How do I match an URL string like this: img src = https://stackoverflow.com/a/b/c/d/someimage.jpg where

Question 1

How do I match an URL string like this:

img src = “https://stackoverflow.com/a/b/c/d/someimage.jpg“

where only the domain name and the file extension (jpg) is fixed while others are variables?

The following code does not seem working:

Pattern p = Pattern.compile("<img src=\"http://stachoverflow.com/.*jpg");
    // Create a matcher with an input string
    Matcher m = p.matcher(url);
    while (m.find()) {
     String s = m.toString();
    }

Question 2

First, I would use the group() method to retrieve the matched text, not toString(). But it’s probably just the URL part you want, so I would use parentheses to capture that part and call group(1) retrieve it.

Second, I wouldn’t assume src was the first attribute in the <img> tag. On SO, for example, it’s usually preceded by a class attribute. You want to add something to match intervening attributes, but make sure it can’t match beyond the end of the tag. [^<>]+ will probably suffice.

Third, I would use something more restrictive than .* to match the unknown part to the path. There’s always a chance that you’ll find two URLs on one line, like this:

<img src="http://so.com/foo.jpg"> blah <img src="http://so.com/bar.jpg">

In that case, the .* in your regex would bridge the gap, giving you one match where you wanted two. Again, [^<>]* will probably be restrictive enough.

There are several other potential problems as well. Are attribute values always enclosed in double-quotes, or could they be single-quoted, or not quoted at all? Will there be whitespace around the =? Are element and attribute names always lowercase?

…and I could go on. As has been pointed out many, many times here on SO, regexes are not really the right tool for working with HTML. They can usually handle simple tasks like this one, but it’s essential that you understand their limitations.

Here’s my revised version of your regex (as a Java string literal):

"(?i)<img[^<>]+src\\s*=\\s*[\"']?(http://stackoverflow\\.com/[^<>]+\\.jpg)"

Editorial Team · Answer 1 · 2026-05-14T04:51:00+00:00

First, I would use the group() method to retrieve the matched text, not toString(). But it’s probably just the URL part you want, so I would use parentheses to capture that part and call group(1) retrieve it.

Second, I wouldn’t assume src was the first attribute in the <img> tag. On SO, for example, it’s usually preceded by a class attribute. You want to add something to match intervening attributes, but make sure it can’t match beyond the end of the tag. [^<>]+ will probably suffice.

Third, I would use something more restrictive than .* to match the unknown part to the path. There’s always a chance that you’ll find two URLs on one line, like this:

<img src="http://so.com/foo.jpg"> blah <img src="http://so.com/bar.jpg">

In that case, the .* in your regex would bridge the gap, giving you one match where you wanted two. Again, [^<>]* will probably be restrictive enough.

There are several other potential problems as well. Are attribute values always enclosed in double-quotes, or could they be single-quoted, or not quoted at all? Will there be whitespace around the =? Are element and attribute names always lowercase?

…and I could go on. As has been pointed out many, many times here on SO, regexes are not really the right tool for working with HTML. They can usually handle simple tasks like this one, but it’s essential that you understand their limitations.

Here’s my revised version of your regex (as a Java string literal):

"(?i)<img[^<>]+src\\s*=\\s*[\"']?(http://stackoverflow\\.com/[^<>]+\\.jpg)"

Editorial Team
2026-05-14T04:51:00+00:00Added an answer on May 14, 2026 at 4:51 am

First, I would use the group() method to retrieve the matched text, not toString(). But it’s probably just the URL part you want, so I would use parentheses to capture that part and call group(1) retrieve it.

Second, I wouldn’t assume src was the first attribute in the <img> tag. On SO, for example, it’s usually preceded by a class attribute. You want to add something to match intervening attributes, but make sure it can’t match beyond the end of the tag. [^<>]+ will probably suffice.

Third, I would use something more restrictive than .* to match the unknown part to the path. There’s always a chance that you’ll find two URLs on one line, like this:

<img src="http://so.com/foo.jpg"> blah <img src="http://so.com/bar.jpg">

In that case, the .* in your regex would bridge the gap, giving you one match where you wanted two. Again, [^<>]* will probably be restrictive enough.

There are several other potential problems as well. Are attribute values always enclosed in double-quotes, or could they be single-quoted, or not quoted at all? Will there be whitespace around the =? Are element and attribute names always lowercase?

…and I could go on. As has been pointed out many, many times here on SO, regexes are not really the right tool for working with HTML. They can usually handle simple tasks like this one, but it’s essential that you understand their limitations.

Here’s my revised version of your regex (as a Java string literal):

"(?i)<img[^<>]+src\\s*=\\s*[\"']?(http://stackoverflow\\.com/[^<>]+\\.jpg)"

0

Reply

Share
Share

Share on Facebook

Share on Twitter

Share on LinkedIn

Share on WhatsApp

Report — Editorial Team, 2026-05-14T04:51:00+00:00Added an answer on May 14, 2026 at 4:51 am

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

How do I match an URL string like this: img src = https://stackoverflow.com/a/b/c/d/someimage.jpg where

Leave an answerCancel reply

1 Answer

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Leave an answer
Cancel reply