I am already parsing pages with the HtmlAgilityPack, and getting most img sources. However

Question

0

Asked: May 22, 20262026-05-22T18:35:01+00:00 2026-05-22T18:35:01+00:00

I am already parsing pages with the HtmlAgilityPack, and getting most img sources. However

0

I am already parsing pages with the HtmlAgilityPack, and getting most img sources. However many websites include img urls in places other than the img src attributes (e.g. inlined javascript, a different attribute, a different element). I would like to cast a slightly wider net and run a regex on the entire html string capture the following in a regex.

Must begin with http://, https://, //, or /
Then, any number of valid url path characters
Must end with either, .jpeg, .jpg, .png, or .gif

I imagine this would be simple to write, however I am not an awesome regexer. I imagine the parts would look like this

^((https?\:\/\/)|(\/{1,2}))
(any ideas?)
(.(jpe?g|png|gif))$

Can anyone help me fill the blanks?

Thanks

Answer

(https?:)?//?[^\'"<>]+?\.(jpg|jpeg|gif|png)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-22T18:35:01+00:00

There are a number of ad-hoc regular expressions for matching URLs out there, but none that I am aware of claim total reliability. However, this one will attempt to satisfy your conditions.

According to [1], valid URL characters (which are not reserved) are alphanumeric and the symbols $-_.+!*'(),. However, there are reserved characters as well, which are +/?%#& which is concisely given by [2] — I couldn’t find a list in the bulk of the RFC. I know there are other characters used for query strings though, namely =;, so those need inclusion. Then you run into issues that not everyone properly encodes their URL characters, so spaces may be present among other things (which I do not know how to account for as how a browser auto-corrects things can be mystifying).

Therefore, you might just assume that anything can be in a URL, but merely it must start with something particular and end with something particular (which you provided) but this is still unreliable.

@(https?:)?//?[^'"<>]+?\.(jpg|jpeg|gif|png)@

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am already parsing pages with the HtmlAgilityPack, and getting most img sources. However

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply