[UPDATE (22-AUG-2009): THIS IS THE NEW WORKING VERSION.]
Today we are going to discuss a bit advanced topic, not in the sense that it’d
be difficult to understand (I always try to make things easier anyway) but that
you won’t find an apparent use of it. What we are going to do today is
what is called Web Scraping. By the way web scraping means retrieving data from
web and pulling out useful information out of it for our use. Of course this
wouldn’t be the next best web scraper rather it would la a basic foundation
on how simple a web scraper can be.
OK let’s kick off guys!
As is obvious we are going to scrape Google’s Web Search Results to retrieve
the number of pages indexed for a search term.
To retrieve results for a search term we need the URL, for this fire up your
favorite Browser and browse to the Search Engine’s (Google, or whatever)
homepage, type in any search query and hit enter.
OK now look at the address bar, in my case I looked like below, your’
should be similar or whatever:
http://www.google.com/search?hl=en&q=learning+c&btnG=Google+Search
On inspection you can see pour search term in the URL which is ‘URL Encoded’
(changes some character such as spaces to codes). There we have it, you can
place any search keyword (urlencoded, very simple with PHP’s in-built
function) and fetch that page. But how in a script, you might ask. Because that
is what we need.
Well using the following function:
file_get_contents();
[UPDATE: WE'LL BE USING THE FOLLOWING USER-DEFINED FUNCTION INSTEAD. READ COMMENTS FOR MORE INFORMATION:
function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com/');
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
]
If you have been following this blog for sometime, you might remember we once
used it in my Creating
a Simple Shout Box in PHP post to fetch contents from a local file.
Yeah its beauty is that it can fetch remote (HTTP) files too.
$data = file_get_contents("http://www.google.com/search?hl=en&q=learning+c&btnG=Google+Search");
[UPDATE: NOW USING:
$data = my_fetch("http://www.google.com/search?hl=en&q=learning+c&btnG=Google+Search");
]
Above code will fetch the Google Search Results for the keyword we searched
for in the browser, $data will contain the HTML source.
Since we have to scrape the total number of pages indexed for a particular
search term (displayed as “Results 1 - 10 of about XXXX …”)
we would find some text near that number(XXXX in this case). In this case that
text is simply “Results 1 - 10 of about”, its also unique throughout
the page hence if we could find it in the code returned we can easily find the
needed data. One more thing we can ease off searching by first stripping off
HTML from the code returned so that only text remains. This part can be implemented
as below:
$data=my_fetch("http://www.google.com/search?hl=en&q=".$s."&btnG=Google+Search");
//strip off HTML
$data=strip_tags($data);
$find='Results 1 - 10 of about ';
$find2=' for';
//have text beginning from $find
$data=strstr($data,$find);
//find position of $find2
$pos=strpos($data,$find2);
//take substring out, which'd be the number we want
$search_number=substr($data,strlen($find), $pos-strlen($find));
Here is the complete code:
<html>
<head>
<title>Google Result Scraper</title>
</head>
<body>
<p align="center" style="font-size: 500%"><font color="#0000FF">G</font><font
color="#FF0000">o</font><font color="#FFFF00">o</font><font
color="#0000FF">g</font><font color="#00FF00">l</font><font
color="#FF0000">e</font><font size="2"><br />
Result Scraper</font></p>
<?php
function my_fetch($url,$user_agent='Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)')
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com/');
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
$s = $_GET['s'];
if (isset($s))
{
echo "<p><i>Search for $s</i></p>";
$s = urlencode($s);
$data = my_fetch("http://www.google.com/search?hl=en&q=" . $s . "&btnG=Google+Search");
//strip off HTML
$data = strip_tags($data);
//now $data only has text NO HTML
//these have to found out in the fetched data
$find = 'Results 1 - 10 of about ';
$find2 = ' for';
//have text beginning from $find
$data = strstr($data, $find);
//find position of $find2
//there might be many occurence
//but it'd give position of the first one,
//which is what we want, anyway
$pos = strpos($data, $find2);
//take substring out, which'd be the number we want
$search_number=substr($data,strlen($find), $pos-strlen($find));
echo "Pages Indexed: $search_number";
}
else
{
?>
<form name="form1" id="form1" method="get" action="">
<div align="center">
<p> <input name="s" type="text" id="s" size="50" />
<input type="submit" name="Submit" value="Count" /></p>
</div>
</form>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p>
<?php
}
?>
</p>
<p align="right"><font size="2">by <a
href="http://learning-computer-programming.blogspot.com/">Learning
Computer Programming</a></font></p>
</body>
</html>
Wow, our first scarper is completed. It has a nice interface, you type in search
phrase click ‘Count’ and there you are. It displays the number of
pages that contains that term same as on Google.
Have fun guys and do comment!
P.S.: You might want to read String
Manipulation Function in PHP I and String
Manipulation Function in PHP II if you are not much familiar with the
string manipulation functions we are using in the code above.
Previous Posts: