I am building a web crawler in PHP, meant for Intranet use (we’re dealing

Question

0

Asked: May 18, 20262026-05-18T00:49:20+00:00 2026-05-18T00:49:20+00:00

I am building a web crawler in PHP, meant for Intranet use (we’re dealing

0

I am building a web crawler in PHP, meant for Intranet use (we’re dealing with a huge Intranet). I managed to download a web page using the cURL functions, but now I want to scan the content for links. I am trying to find all obvious links and split them in their corresponding scheme/authority/path/query/fragment so I can index them properly.

Is there a known regular expression that matches all the links, including the ones like <img src="../images/header/logo.png" />, background-image: url(..) and <a href="?query#lonely-fragment">.

What are all the plain-text link representations that I can find using regular expressions in PHP?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T00:49:21+00:00

You will be better off parsing documents using a proper HTML parser. Regex is not really suited for this kind of thing.

Once you have done that, it’s fairly trivial using XPath to scan for e.g. //img/@src or //a/@href to find all of the content links in the document itself.

If you want to scan CSS, you will also need to look for //style[@type='text/css'] and //link[@rel='stylesheet'][@type='text/css']/@href and then use a proper CSS parser to extract all of the content. (Or, if you want to be lazy, you could probably get away with the regex /url\((.*?)\)/.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am building a web crawler in PHP, meant for Intranet use (we’re dealing

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply