I’m optimizing my simple web crawler (currently using PHP/curl_multi).
Goal is to crawl entire website while being smart, and skiping the non-html content. I tried using nobody, and send only HEAD requests, but that doesn’t seem to work on every website (some servers don’t support HEAD), causing exec to pause for long times (sometimes much longer than loading page itself).
Is there any other way to get page type without downloading the entire content or force CURL to abandon download if file isn’t html?
(Writing my own http client is not an option, cause I’m intending to use CURL functions as cookies and ssl later on).
Correct way to do this is use
The callback will accept 2 parameters – first CURL Handle, second – header. It’ll be called each time new header arrives.
}