I’m working on a little webcrawler as a side project at the moment and basically having it collect all hrefs on a page and then subsequently parsing those, my problem is.
How can I only get the actual page results? at the moment i’m using the following
foreach($page->getElementsByTagName('a') as $link)
{
$compare_url = parse_url($link->getAttribute('href'));
if (@$compare_url['host'] == "")
{
$links[] = 'http://'.@$base_url['host'].'/'.$link->getAttribute('href');
}
elseif ( @$base_url['host'] == @$compare_url['host'] )
{
$links[] = $link->getAttribute('href');
}
}
As you can see this will bring in jpegs, exe files etc. I only need to pickup the web pages like .php, .html, .asp etc.
I’m not sure if there is some function able to work this one out or if it will need to be regex from some sort of master list?
Thanks
Since the URL string alone doesn’t connected with the resource behind it in any way you will have to go out and ask the webserver about them. For this there’s a HTTP method called HEAD so you won’t have to download everything.
You can implement this with curl in php like this:
This version is only accepts
text/htmlresponses and doesn’t check if the response is 404 or other error (however follows redirects up to 5 jumps). You can tweak the regexp or add some error handling in either from the curl response, or by matching against the header string’s first line.Note: Webservers will run scripts behind these URLs to give you responses. Be careful not overload hosts with probing, or grabbing “delete” or “unsubscribe” type links.