I’m optimizing my simple web crawler (currently using PHP/curl_multi). Goal is to crawl entire

Question

0

Asked: June 10, 20262026-06-10T07:30:50+00:00 2026-06-10T07:30:50+00:00

I’m optimizing my simple web crawler (currently using PHP/curl_multi). Goal is to crawl entire

0

I’m optimizing my simple web crawler (currently using PHP/curl_multi).

Goal is to crawl entire website while being smart, and skiping the non-html content. I tried using nobody, and send only HEAD requests, but that doesn’t seem to work on every website (some servers don’t support HEAD), causing exec to pause for long times (sometimes much longer than loading page itself).

Is there any other way to get page type without downloading the entire content or force CURL to abandon download if file isn’t html?

(Writing my own http client is not an option, cause I’m intending to use CURL functions as cookies and ssl later on).

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T07:30:51+00:00

Correct way to do this is use

curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'curlHeaderCallback');

The callback will accept 2 parameters – first CURL Handle, second – header. It’ll be called each time new header arrives.

$acceptable=array('application/xhtml+xml',
'application/xml', 'text/plain',
'text/xml', 'text/html');

function curlHeaderCallback($resURL, $strHeader) { 
    global $acceptable;
    if (stripos($strHeader,'content-type')===0) {
        $type=strtolower(trim(array_shift(explode(';',array_pop(explode(':',$strHeader))))));
        if (!in_array($type,$acceptable))
            return 0;
    }
    return strlen($strHeader);

}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m optimizing my simple web crawler (currently using PHP/curl_multi). Goal is to crawl entire

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply