I’ve been looking around the internet hoping that this is possible, I basically need to get just the title of a webpage and nothing else.
web crawlers can take a long time performing tasks because they have to load pages before examinining them, this is inefficient for what I am trying to achieve… here’s what I have so far
php code
$url = 'http://www.ebay.com/itm/300702997750#ht_500wt_1156';
$str = file_get_contents($url);
$title = '';
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$titleArr);
$title = $titleArr[1];
}
I want to know whether it would be possible to crawl only part of a page (for example the first 2000 characters of page).
Any help would be appreciated, Thanks.
You could use substr to just grab the first 1000 chars, alternatively, you could use
that will only download the first 500 bytes. You can bench that by running something like this extremely ugly rubbish code:
If I run that on my site (http://www.focalstrategy.com/), I get:
Against http://en.wikipedia.org/wiki/PHP, I get:
Against Stack Overflow I get:
and against eBay I get:
You can see by testing that SO and eBay don’t support range requests.
In summary, sites that support this will get a speed up, those that don’t, won’t, you’ll just get the whole code instead.