I am making a simple web spider and I was wondering if there is a way that can be triggered in my PHP code that I can get all the webpages on a domain…
e.g Lets say I wanted to get all the webpages on Stackoverflow.com . That means that it would get:
https://stackoverflow.com/questions/ask
pulling webpages from an adult site — how to get past the site agreement?
https://stackoverflow.com/questions/1234214/
Best Rails HTML Parser
And all the links. How can I get that. Or is there an API or DIRECTORY that can enable me to get that?
Also is there a way I can get all the subdomains?
Btw how do crawlers crawl websites that don’t have SiteMaps or Syndication feeds?
Cheers.
If a site wants you to be able to do this, they will probably provide a Sitemap. Using a combination of a sitemap and following the links on pages, you should be able to traverse all the pages on a site – but this is really up to the owner of the site, and how accessible they make it.
If the site does not want you to do this, there is nothing you can do to work around it. HTTP does not provide any standard mechanism for listing the contents of a directory.