I am building a web crawler in PHP, meant for Intranet use (we’re dealing with a huge Intranet). I managed to download a web page using the cURL functions, but now I want to scan the content for links. I am trying to find all obvious links and split them in their corresponding scheme/authority/path/query/fragment so I can index them properly.
Is there a known regular expression that matches all the links, including the ones like <img src="../images/header/logo.png" />, background-image: url(..) and <a href="?query#lonely-fragment">.
What are all the plain-text link representations that I can find using regular expressions in PHP?
You will be better off parsing documents using a proper HTML parser. Regex is not really suited for this kind of thing.
Once you have done that, it’s fairly trivial using XPath to scan for e.g.
//img/@srcor//a/@hrefto find all of the content links in the document itself.If you want to scan CSS, you will also need to look for
//style[@type='text/css']and//link[@rel='stylesheet'][@type='text/css']/@hrefand then use a proper CSS parser to extract all of the content. (Or, if you want to be lazy, you could probably get away with the regex/url\((.*?)\)/.)