I was reading this question about how to parse URLs out of web pages and had a question about the accepted answer which offered this solution:
((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)
The solution was offered by csmba and he credited it to regexlib.com. Whew. Credits done.
I think this is a fairly naive regular expression but it’s a fine starting point for building something better. But, my question is this:
What is the point of {1}? It means ‘exactly one of the previous grouping’, right? Isn’t that the default behavior of a grouping in a regular expression? Would the expression be changed in any way if the {1} were removed?
If I saw this from a coworker I would point out his or her error but as I write this the response is rated at a 6 and the expression on regexlib.com is rated a 4 of 5. So maybe I’m missing something?
@Jeff Atwood, your interpretation is a little off – the {1} means match exactly once, but has no effect on the ‘capturing’ – the capturing occurs because of the parens – the braces only specify the number of times the pattern must match the source – once, as you say.
I agree with @Marius, even if his answer is a little terse and may come off as being flippant. Regular expressions are tough, if one’s not used to using them, and the {1} in the question isn’t quite error – in systems that support it, it does mean ‘exactly one match’. In this sense, it doesn’t really do anything.
Unfortunately, contrary to a now-deleted post, it doesn’t keep the regexp from matching
http://http://example.org, since the \S+ at the end will match one or more non-whitespace characters, including thehttp://example.orginhttp://http://example.org(verified using Python 2.5, just in case my regexp reading was off). So, the regexp given isn’t really the best. I’m not a URL expert, but probably something limiting the appearance of ‘:’s and ‘//’s after the first one would be necessary (but hardly sufficient) to ensure good URLs.