I’m trying to find the existing subdirectories on the same server as a specified site using PHP.
For example, when parsing the homepage of seoguru.nl, I would like to have an array similar to this:
Array
(
[0] => 'styles'
[1] => 'scripts'
[2] => 'images'
)
(those are all the directories being referenced in the HTML source)
I’ve been thinking about two ways to do this, the first would be using a rather advanced regex but my knowledge of regexes only goes so far… The second would be to use a HTML parser class, like DOMDocument, but I wouldn’t know how exactly to do so.
Another problem is that outside sites, e.g. CDNs or simply links to other sites have to be excluded, but I think I can filter those out afterwards.
If you need any more information, please ask!
Parsing the HTML will only get you so far. Don’t forget that both CSS and Javascript can both contain urls, which will necessarily have to be different parsers than what you’d use for the HTML.
Beyond that, don’t use regexes to parse HTML. They’ll blow up in your face far too easily. Definitely use DOM as your first and only choice for the HTML. It’s easy enough to use some xpath to get at tags that will contain urls (
//*[@src]would be the simplest and cover most things you need to scan). The JS and CSS portions will probably be hardest, as there’s no standard parsers/manipulators for those built into PHP.