I’m creating a web crawler. I’m ganna give it an URL and it will scan through the directory and sub directories for .html files. I’ve been looking at two alternatives:
-
scandir($url). This works on local files but not on http sites. Is this because of file permissions? I’m guessing it shouldn’t work since it would be dangerous for everyone to have access to your website files. -
Searching for links and following them. I can do file_get_contents on the index file, find links and then follow them to their .html files.
Do any of these 2 work or is there a third alternative?
The only way to look for html files is to parse throuhg the file content returned by the server, unless by small chance they have enabled directory browsing on the server, which is one of the first things disabled usually, you dont have access to browse directory listings, only the content they are prepared to show you, and let you use.
You would have to start a http://www.mysite.com and work onwards scanning for links to html files, what if they have asp/php or other files which then return html content?