I was reading this question about how to parse URLs out of web pages

Question

0

Asked: May 10, 20262026-05-10T13:28:18+00:00 2026-05-10T13:28:18+00:00

I was reading this question about how to parse URLs out of web pages

0

I was reading this question about how to parse URLs out of web pages and had a question about the accepted answer which offered this solution:

((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)

The solution was offered by csmba and he credited it to regexlib.com. Whew. Credits done.

I think this is a fairly naive regular expression but it’s a fine starting point for building something better. But, my question is this:

What is the point of {1}? It means ‘exactly one of the previous grouping’, right? Isn’t that the default behavior of a grouping in a regular expression? Would the expression be changed in any way if the {1} were removed?

If I saw this from a coworker I would point out his or her error but as I write this the response is rated at a 6 and the expression on regexlib.com is rated a 4 of 5. So maybe I’m missing something?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T13:28:19+00:00

@Jeff Atwood, your interpretation is a little off – the {1} means match exactly once, but has no effect on the ‘capturing’ – the capturing occurs because of the parens – the braces only specify the number of times the pattern must match the source – once, as you say.

I agree with @Marius, even if his answer is a little terse and may come off as being flippant. Regular expressions are tough, if one’s not used to using them, and the {1} in the question isn’t quite error – in systems that support it, it does mean ‘exactly one match’. In this sense, it doesn’t really do anything.

Unfortunately, contrary to a now-deleted post, it doesn’t keep the regexp from matching http://http://example.org, since the \S+ at the end will match one or more non-whitespace characters, including the http://example.org in http://http://example.org (verified using Python 2.5, just in case my regexp reading was off). So, the regexp given isn’t really the best. I’m not a URL expert, but probably something limiting the appearance of ‘:’s and ‘//’s after the first one would be necessary (but hardly sufficient) to ensure good URLs.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I was reading this question about how to parse URLs out of web pages

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply