I wrote a crawler for spesific dynamic website. All crawl jobs taking over 3 hours.
I want to control the page is already crawled or there are some changes on page.
If i can do this the script will be completed in very short time.
for example:
foreach ($urls as $url) {
if(thereAreChanges($url)){
crawl($url);
}
}
Information: The web page doesn’t provide content-length and crc.
Array ( [0] => HTTP/1.1 200 OK
[Date] => Tue, 08 Jan 2013 07:47:03 GMT
[Server] => Apache
[Set-Cookie] => Array (
[0] => PHPSESSID=eisb6qjme9b0ouoga9su9fgok4; path=/
[1] => j12011=a%3A3%3A%7Bs%3A3%3A%22sid%22%3Bs%3A26%3A%22eisb6qjme9b0ouoga9su9fgok4%22%3Bs%3A2%3A%22ip%22%3Bs%3A12%3A%2294.103.47.65%22%3Bs%3A4%3A%22time%22%3Bi%3A1357631223%3B%7D; expires=Sat, 09-Mar-2013 07:47:03 GMT; path=/
)
[Expires] => Thu, 19 Nov 1981 08:52:00 GMT
[Cache-Control] => no-store, no-cache, must-revalidate, post-check=0, pre-check=0
[Pragma] => no-cache
[Vary] => Accept-Encoding
[Connection] => close
[Content-Type] => text/html
)
The site provides Content-Type but doesnt provide Content-Length. How can i ask content-length to apache.
Update : http://urivalet.com/ can get content-length. I need this.
If i can get CRC code of page in header. It will be perfect. But I guess this is long shot.
Solution is
'header'=>"Accept-Encoding: gzip"That’s why header doesn’t return Content-Length, with this parameter page returns content-length.