I’m extracting portions of URLs from text using a regular expression in Python. The URLs I’m looking for are from a limited set of patterns so it feels like I should just able to handle them in a regex. What I’m trying to extract is the first portion of the file name (“some.file.name” in all the examples below), which can include dots, letters and digits.
These are the sorts of forms the URL can take:
http://www.example.com/some.file.name.html
http://www.example.com/some.file.name_foo.html
http://www.example.com/some.file.name(123).html
http://www.example.com/some.file.name_foo(123).html
http://www.example.com/some.file.name
http://www.example.com/some.file.name_foo
http://www.example.com/some.file.name(123)
http://www.example.com/some.file.name_foo(123)
I think I’m pretty much there with this regex:
http://www\.example\.com/([a-zA-Z0-9\.]+)(_[a-z]+)?(\(\d+\))?(\.html)?
But it includes the “.html” in the match when the URL is like the first one in the list. Is there any way of stopping this or is it a fundamental limitation of regular expressions?
I’m quite happy to remove the extension in code as it will always be the same and will never be valid as part of the file name, but it would be cleaner to do it as part of the regex match.
Edit:
I should emphasise that these URLs are in bodies of text. I can’t make any guarantees about whether there are characters before or after them or what those characters might be. I think it’s safe to assume that they won’t be numbers, letters, underscores or dots.
Regular expressions are matched greedy by default.
Try this regexp:
Notice the extra
?added to not capture the.htmlin the first part. It makes the first group capture as little as neccessary to match, instead of as much as possible to match. Without the?, the.htmlwill be included in the first group, as the other groups are optional, and greedy matching tries to match as “early” as possible.P.S. Also note that I anchored the regexp using
^and$to always match the full line.