I’m working on a little webcrawler as a side project at the moment and

Question

0

Asked: June 9, 20262026-06-09T16:25:40+00:00 2026-06-09T16:25:40+00:00

I’m working on a little webcrawler as a side project at the moment and

0

I’m working on a little webcrawler as a side project at the moment and basically having it collect all hrefs on a page and then subsequently parsing those, my problem is.

How can I only get the actual page results? at the moment i’m using the following

foreach($page->getElementsByTagName('a') as $link) 
{
    $compare_url = parse_url($link->getAttribute('href'));
    if (@$compare_url['host'] == "") 
    { 
        $links[] = 'http://'.@$base_url['host'].'/'.$link->getAttribute('href');
    }
    elseif ( @$base_url['host'] == @$compare_url['host'] ) 
    {
            $links[] = $link->getAttribute('href');
    }   

 }

As you can see this will bring in jpegs, exe files etc. I only need to pickup the web pages like .php, .html, .asp etc.

I’m not sure if there is some function able to work this one out or if it will need to be regex from some sort of master list?

Thanks

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T16:25:42+00:00

Since the URL string alone doesn’t connected with the resource behind it in any way you will have to go out and ask the webserver about them. For this there’s a HTTP method called HEAD so you won’t have to download everything.

You can implement this with curl in php like this:

function is_html($url) {
    function curl_head($url) {
        $curl = curl_init($url);
        curl_setopt($curl, CURLOPT_NOBODY, true);
        curl_setopt($curl, CURLOPT_HEADER, true);
        curl_setopt($curl, CURLOPT_MAXREDIRS, 5);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true );
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1);
        $content = curl_exec($curl);
        curl_close($curl);

        // redirected heads just pile up one after another
        $parts = explode("\r\n\r\n", trim($content));

        // return only the last one
        return end($parts);
    }
    $header = curl_head('http://github.com');
    // look for the content-type part of the header response
    return preg_match('/content-type\s*:\s*text\/html/i', $header);
}

var_dump(is_html('http://github.com'));

This version is only accepts text/html responses and doesn’t check if the response is 404 or other error (however follows redirects up to 5 jumps). You can tweak the regexp or add some error handling in either from the curl response, or by matching against the header string’s first line.

Note: Webservers will run scripts behind these URLs to give you responses. Be careful not overload hosts with probing, or grabbing “delete” or “unsubscribe” type links.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on a little webcrawler as a side project at the moment and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply