I’ve set myself a somewhat ambitious first task in learning regular expressions (and one which relates to a problem I’m trying to solve). I need to find any instance of a url that ends in .m4v, in a big html string.
My first attempt was this for jpg files
http.*jpg
Which of course seems correct on first glance, but of course returns stuff like this:
http://domain.com/page.html" title="Misc"><img src="http://domain.com/image.jpg
Which does match the expression in theory. So really, I need to put something in http.*m4v that says ‘only the closest instance between http and m4v’. Any ideas?
As you’ve noticed, an expression such as the following is greedy:
That means it reads as much input as possible while satisfying the expression.
It’s the “
*” operator that makes it greedy. There’s a well-defined regex technique to making this non-greedy… use the “?” modifier after the “*“.Now it will match as little as possible while still satisifying the expression (i.e. it will stop searching at the first occurrence of “.jpg”.
Of course, if you have a .jpg in the middle of a URL, like:
It will not match the full URL.
You’ll want to define the end of the URL as something that can’t be considered part of the URL, such as a space, or a new line, or (if the URL in nested inside parentheses), a closing parenthesis. This can’t be solved with just one little regex however if it’s included in written language, since URLs are often ambiguous.
Take for example:
The comma could technically be a part of a URL. You have to deal with a lot of ambiguities like this when looking for URLs in written text, and it’s hard not to have bugs due to the ambiguities.
EDIT | Here’s an example of a regex in PHP that is a quick and dirty solution, being greedy only where needed and trying to deal with the English language:
It outputs:
The explanation is in the comments.