首页 > PHP > [ZT]PHP抓取網頁特定div區塊及圖片

[ZT]PHP抓取網頁特定div區塊及圖片

2009年4月5日 Galaxy 发表评论 阅读评论

本来是想找个WP插件的,结果只找到个有同样想法的人。只好把找到的资料先塞这了。
* function download_url( $url ) see: http://phpxref.com/xref/wordpress/wp-admin/includes/file.php.source.html#l312
* function wp_upload_dir see: http://phpxref.com/xref/wordpress/wp-includes/functions.php
* function wp_unique_filename see: http://phpxref.com/xref/wordpress/wp-includes/functions.php

http://andy.diimii.com/2009/03/php%E6%8A%93%E5%8F%96%E7%B6%B2%E9%A0%81%E7%89%B9%E5%AE%9Adiv%E5%8D%80%E5%A1%8A%E5%8F%8A%E5%9C%96%E7%89%87/
1. 取得指定網頁內的所有圖片

<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/'); 
 
//取得所有img標籤,並儲存至二維陣列match
preg_match_all('#<img[^>]*>#i', $text, $match);
 
//印出match
print_r($match);
?>

2. 取得指定網頁內的第一張圖片

<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/');
 
//取得第一個img標籤,並儲存至陣列match(regex語法與上述同義)
preg_match('/<img[^>]*>/Ui', $text, $match);
 
//印出match
print_r($match);
?>

3. 取得指定網頁內的特定div區塊(藉由id判斷)

<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/'); 
 
//去除換行及空白字元(序列化內容才需使用)
//$text=str_replace(array("\r","\n","\t","\s"), '', $text);   
 
//取出div標籤且id為PostContent的內容,並儲存至陣列match
preg_match('/<div[^>]*id="PostContent"[^>]*>(.*?) <\/div>/si',$text,$match);
 
//印出match[0]
print($match[0]);
?>

4. 上述2及3的結合

<?php
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');    
 
//取出div標籤且id為PostContent的內容,並儲存至陣列match
preg_match('/<div[^>]*id="PostContent"[^>]*>(.*?) <\/div>/si',$text,$match);   
 
//取得第一個img標籤,並儲存至陣列match2
preg_match('/<img[^>]*>/Ui', $match[0], $match2); 
 
//印出match2[0]
print_r($match2[0]);
?>

後記:用正規表達式Regex來做真的很方便,但老實說我規則常會忘掉,記錄一下幾篇文章好了(Regular Expression Details | PCRE Functions | Introduction to PHP Regex

參考:Fetching the contents of a div tag by id


http://blog.xoyo.com/dcyhldcyhl/article/426017.shtml
PHP抓取网页和分析
译者:limodou
  抓取和分析一个文件是非常简单的事。这个教程将通过一个例子带领你一步一步地去实现它。让我们开
始吧!

  首先,我首必须决定我们将抓取的URL地址。可以通过在脚本中设定或通过$QUERY_STRING传递。为了简
单起见,让我们将变量直接设在脚本中。

<? $url = 'http://www.php.net'; ?>

  第二步,我们抓取指定文件,并且通过file()函数将它存在一个数组里。

<? $url = 'http://www.php.net'; $lines_array = file($url); ?>

  好了,现在在数组里已经有了文件了。但是,我们想分析的文本可能不全在一行里面。为了解决这个文
件,我们可以简单地将数组$lines_array转化成一个字符串。我们可以使用implode(x,y)函数来实现它。如
果在后面你想用explode(将字符串变量数组),将x设成”|”或”!”或其它类似的分隔符可能会更好。但是出于
我们的目的,最好将x设成空格。y是另一个必要的参数,因为它是你想用implode()处理的数组。

<? $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); ?>

  现在,抓取工作就做完了,下面该进行分析了。出于这个例子的目的,我们想得到在到
之间的所有东西。为了分析出字符串,我们还需要叫做正规表达式的东西。

<? $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); eregi("<head>(.*)</head>", $lines_string, $head); ?>

  让我们看一下代码。正如你所见,eregi()函数按下面的格式执行:

eregi("<head>(.*)</head>", $lines_string, $head);

  ”(.*)”表示所有东西,可以解释为,”分析在和间的所以东西”。$lines_string是我们正
在分析的字符串,$head是分析后的结果存放的数组。
  最后,我们可以输数据。因为仅在和间存在一个实例,我们可以安全的假设数组中仅存
在着一个元素,而且就是我们想要的。让我们把它打印出来吧。

<? $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); eregi("<head>(.*)</head>", $lines_string, $head); echo $head[0]; ?>

  这就是全部的代码了。

<?php
 
//获取所有内容url保存到文件
function get_index($save_file, $prefix="index_"){
    $count = 68;
    $i = 1;
    if (file_exists($save_file)) @unlink($save_file);
    $fp = fopen($save_file, "a+") or die("Open ". $save_file ." failed");
    while($i<$count){
        $url = $prefix . $i .".htm";
        echo "Get ". $url ."...";
        $url_str = get_content_url(get_url($url));
        echo " OK\n";
        fwrite($fp, $url_str);
        ++$i;
    }
    fclose($fp);
}
 
//获取目标多媒体对象
function get_object($url_file, $save_file, $split="|--:**:--|"){
    if (!file_exists($url_file)) die($url_file ." not exist");
    $file_arr = file($url_file);
    if (!is_array($file_arr) || empty($file_arr)) die($url_file ." not content");
    $url_arr = array_unique($file_arr);
    if (file_exists($save_file)) @unlink($save_file);
    $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed");
    foreach($url_arr as $url){
        if (empty($url)) continue;
        echo "Get ". $url ."...";
        $html_str = get_url($url);
        echo $html_str;
        echo $url;
        exit;
        $obj_str = get_content_object($html_str);
        echo " OK\n";
        fwrite($fp, $obj_str);
    }
    fclose($fp);
}
 
//遍历目录获取文件内容
function get_dir($save_file, $dir){
    $dp = opendir($dir);
    if (file_exists($save_file)) @unlink($save_file);
    $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed");
    while(($file = readdir($dp)) != false){
        if ($file!="." && $file!=".."){
            echo "Read file ". $file ."...";
            $file_content = file_get_contents($dir . $file);
            $obj_str = get_content_object($file_content);
            echo " OK\n";
            fwrite($fp, $obj_str);
        }
    }
    fclose($fp);
}
 
 
//获取指定url内容
function get_url($url){
    $reg = '/^http:\/\/[^\/].+$/';
    if (!preg_match($reg, $url)) die($url ." invalid");
    $fp = fopen($url, "r") or die("Open url: ". $url ." failed.");
    while($fc = fread($fp, 8192)){
        $content .= $fc;
    }
    fclose($fp);
    if (empty($content)){
        die("Get url: ". $url ." content failed.");
    }
    return $content;
}
 
//使用socket获取指定网页
function get_content_by_socket($url, $host){
    $fp = fsockopen($host, 80) or die("Open ". $url ." failed");
    $header = "GET /".$url ." HTTP/1.1\r\n";
    $header .= "Accept: */*\r\n";
    $header .= "Accept-Language: zh-cn\r\n";
    $header .= "Accept-Encoding: gzip, deflate\r\n";
    $header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)\r\n";
    $header .= "Host: ". $host ."\r\n";
    $header .= "Connection: Keep-Alive\r\n";
    //$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-\r\n\r\n";
    $header .= "Connection: Close\r\n\r\n";
 
    fwrite($fp, $header);
    while (!feof($fp)) {
        $contents .= fgets($fp, 8192);
    }
    fclose($fp);
    return $contents;
}
 
 
//获取指定内容里的url
function get_content_url($host_url, $file_contents){
 
    //$reg = '/^(#|javascript.*?|ftp:\/\/.+|http:\/\/.+|.*?href.*?|play.*?|index.*?|.*?asp)+$/i';
    //$reg = '/^(down.*?\.html|\d+_\d+\.htm.*?)$/i';
    $rex = "/([hH][rR][eE][Ff])\s*=\s*['\"]*([^>'\"\s]+)[\"'>]*\s*/i";
    $reg = '/^(down.*?\.html)$/i';
    preg_match_all ($rex, $file_contents, $r);
    $result = ""; //array();
    foreach($r as $c){
        if (is_array($c)){
            foreach($c as $d){
                if (preg_match($reg, $d)){ $result .= $host_url . $d."\n"; }
            }
        }
    }
    return $result;
}
 
//获取指定内容中的多媒体文件
function get_content_object($str, $split="|--:**:--|"){    
    $regx = "/href\s*=\s*['\"]*([^>'\"\s]+)[\"'>]*\s*(.*?<\/b>)/i";
    preg_match_all($regx, $str, $result);
 
    if (count($result) == 3){
        $result[2] = str_replace("多媒体: ", "", $result[2]);
        $result[2] = str_replace("", "", $result[2]);
        $result = $result[1][0] . $split .$result[2][0] . "\n";
    }
    return $result;
}
 
?>

由网际旋风发布于 2007-11-07 11:25:06


实用PHP网页抓取
http://blog.csdn.net/hongyu6/archive/2008/03/11/2170585.aspx

<html>
<head>
<title>实用抓取网页内容测试 </title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body  >
<?php
$url = 'http://localhost/test.php'; //测试本地网页
#$url = 'http://www.myenjoylife.cn/index.php'; //抓取www.myenjoylife.cn首页内容
$lines_array = file($url); 
$lines_string = implode('', $lines_array); 
eregi("(.*)", $lines_string, $head); 
echo $head[0]; 
?>
</body>
</html>

http://phpstarter.net/2008/12/how-to-post-data-and-fetch-remote-pages-from-php-scripts/
How to Post Data and Fetch Remote Pages from PHP Scripts
Published: December 18th, 2008 by: Andrew

There are times where a PHP script needs to fetch the HTML from a remote page, or even post some data to a remote location. Learning how to use cURL or fsockopen() can be time consuming and unnecessary. There is a nice PHP class called Snoopy that can take care of all these functions very easily.

First, head on over to the project page and download a copy of the script. Once you extract the archive, you will only need Snoopy.class.php. Upload it to your favorite web host, and create a blank PHP script to get started with some examples.

Fetching a Page

Fetching data from a remote URL is very useful to have if you don’t want to mess with cURL or fsockopen. Snoopy takes care of all that for you.

<?php
/* load the snoopy class and initialize the object */
require('../includes/Snoopy.class.php');
$snoopy = new Snoopy();

/* load the page and print the results */
$snoopy->fetch('http://snoopy.sourceforge.net/');
echo '<pre>' . htmlspecialchars($snoopy->results) . '</pre>';
?>

Fetching a web page may be of little use, but we can also use this to retrieve remote XML or CSV files. Here is a more practical example.

Run This Example

<?php
/* load the snoopy class and initialize the object */
require('../includes/Snoopy.class.php');
$snoopy = new Snoopy();

/* fetch the data */
$snoopy->fetch('http://www.weather.gov/xml/current_obs/KLOT.xml');

/* parse the XML data */
$xml = new SimpleXMLElement($snoopy->results);

/* output some of the elements */
echo '<h3>Some Parsed Information</h3>';
echo '<img src="' . $xml->icon_url_base . '/' . $xml->icon_url_name . '" /><br />';
echo '<b>Reporting Station:</b> ' . $xml->location . '<br />';
echo '<b>Observation Time:</b> ' . $xml->observation_time_rfc822 . '<br />';
echo '<b>Temp:</b> ' . $xml->temperature_string . '<br />';
echo '<b>Wind:</b> ' . $xml->wind_string . '<br />';

echo '<hr /><h3>Raw XML</h3>';
echo '<pre>' . htmlspecialchars($snoopy->results) . '</pre>';
?>

Sending Post & Cookie Data

Some APIs, like PayPal’s Instant Payment Notification tool, require data to be sent to their servers in post form. Snoopy makes it as easy as putting the data in an array and sending it to the URL.

Here is the script that sends the some post data and a couple cookies:

<?php
/* load the snoopy class and initialize the object */
require('../includes/Snoopy.class.php');
$snoopy = new Snoopy();

/* set some values */
$p_data['color'] = 'Red';
$p_data['fruit'] = 'apple';

$snoopy->cookies['vegetable'] = 'carrot';
$snoopy->cookies['something'] = 'value';

/* submit the data and get the result */
$snoopy->submit('http://phpstarter.net/samples/118/data_dump.php', $p_data);

/* output the results */
echo '<pre>' . htmlspecialchars($snoopy->results) . '</pre>';
?>

And here is the result data on the server side:


/* $_POST */
array(2) {
  ["color"]=>
  string(3) "Red"
  ["fruit"]=>

  string(5) "apple"
}
/* $_COOKIE */
array(2) {
  ["vegetable"]=>
  string(6) "carrot"
  ["something"]=>

  string(5) "value"
}

Fetching Links

When you need to crawl a page for links, Snoopy has taken care of the trouble to parse the links from the page.

<?php
/* load the snoopy class and initialize the object */
require('../includes/Snoopy.class.php');
$snoopy = new Snoopy();

/* load the page and print the results */
$snoopy->fetchlinks('http://google.com/');
echo '<pre>' . var_export($snoopy->results, true) . '</pre>';
?>

Script output:


array (
  0 => 'http://images.google.com/imghp?hl=en&tab=wi',
  1 => 'http://maps.google.com/maps?hl=en&tab=wl',
  2 => 'http://news.google.com/nwshp?hl=en&tab=wn',
  3 => 'http://www.google.com/prdhp?hl=en&tab=wf',
  4 => 'http://mail.google.com/mail/?hl=en&tab=wm',
  5 => 'http://www.google.com/intl/en/options/',
  6 => 'http://video.google.com/?hl=en&tab=wv',
  7 => 'http://groups.google.com/grphp?hl=en&tab=wg',
  8 => 'http://books.google.com/bkshp?hl=en&tab=wp',
  9 => 'http://scholar.google.com/schhp?hl=en&tab=ws',
  10 => 'http://finance.google.com/finance?hl=en&tab=we',
  11 => 'http://blogsearch.google.com/?hl=en&tab=wb',
  12 => 'http://www.youtube.com/?hl=en&tab=w1',
  13 => 'http://www.google.com/calendar/render?hl=en&tab=wc',
  14 => 'http://picasaweb.google.com/home?hl=en&tab=wq',
  15 => 'http://docs.google.com/?hl=en&tab=wo',
  16 => 'http://www.google.com/reader/view/?hl=en&tab=wy',
  17 => 'http://sites.google.com/?hl=en&tab=w3',
  18 => 'http://www.google.com/intl/en/options/',
  19 => 'http://www.google.com/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg',
  20 => 'http://www.google.com/intl/en/ads/',
  21 => 'http://www.google.com/services/',
  22 => 'http://www.google.com/intl/en/about.html',
  23 => 'http://www.google.com/intl/en/privacy.html',
  24 => 'http://www.google.com/advanced_search?hl=en',
  25 => 'http://www.google.com/preferences?hl=en',
  26 => 'http://www.google.com/language_tools?hl=en',
)

More Options

We can set various other options and values such as:

  • User name/password for basic HTTP authentication
  • Proxy hosts & ports
  • Raw headers
  • Maximum redirects

<?php
/* don't forget to include the file */
$snoopy = new Snoopy;

/* make the request through a proxy server */
$snoopy->proxy_host = "my.proxy.host";
$snoopy->proxy_port = "8080";

/* change the user agent or refer URL */
$snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";
$snoopy->referer = "http://www.microsnot.com/";

/* set some cookies */
$snoopy->cookies["SessionID"] = 238472834723489l;
$snoopy->cookies["favoriteColor"] = "RED";

/* set some raw headers */
$snoopy->rawheaders["Pragma"] = "no-cache";

/* set some redirect options */
$snoopy->maxredirs = 2;
$snoopy->offsiteok = false; /* allow a redirect to different domain */

/* set user/pass for basic HTTP authentication */
$snoopy->user = "me";
$snoopy->pass = "p@ssw0rd";
?>

There are even more options not covered here. To see what those options are, see the Snoopy.class.php file and the readme.txt file contained in the downloaded package.

Tags: ,

Related posts

分类: PHP 标签: , 779 views
1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
  1. 本文目前尚无任何评论.
  1. 本文目前尚无任何 trackbacks 和 pingbacks.

Locations of visitors to this page