Effectively you've got a "blob" of data that could in…

Question

0

Asked: May 11, 20262026-05-11T15:27:54+00:00 2026-05-11T15:27:54+00:00

How can I detect (with regular expressions or heuristics) a web site link in

0

How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?

The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.

Some objectives:

The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
Any other funny business

Of course, I am blocking spam, but the same process could be used to auto-link text.

Ideas

Here are some things I’m thinking.

The content is native-language prose so I can be trigger-happy in detection
Should I strip out all whitespace first, to catch ‘www .example.com‘? Would common users know to remove the space themselves, or do any browsers ‘do-what-I-mean’ and strip it for you?
Maybe multiple passes is a better strategy, with scans for:
- Well-formed URLs
- All non-whitespace followed by ‘.’ followed by any valid TLD
- Anything else?

Update and Summary

Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:

@Jon Bright’s technique of detecting TLDs (a good defensive chokepoint)
For those suspicious strings, replace the dot with a dot-looking character as per @capar
A good dot-looking character is @Sharkey’s subscripted · (i.e. ‘_·‘). · is also a word boundary so it’s harder to casually copy & paste.

That should make a spammer’s CPM low enough for my needs; the ‘flag as inappropriate’ user feedback should catch anything else. Other solutions listed are also very useful:

Strip out all dotted-quads (@Sharkey’s comment to his own answer)
@Sporkmonger’s requirement for client-side Javascript which inserts a required hidden field into the form.
Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per @Nathan..)
Looking at Chrome’s source for its smart address bar to see what clever tricks Google uses
Calling out to OWASP AntiSAMY or other web services for spam/malware detection.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T15:27:55+00:00

I’m concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.

I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn’t going to be perfect. There’s no way to get ‘buyfunkypharmaceuticals . it’ without disallowing the legitimate ‘I tried again. it doesn’t work’ or similar. All of that said, this would be my suggestion:

[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]

Things this will get:

buyfunkypharmaceuticals.it
google.com
http://stackoverflo**w.com/**questions/700163/

It will of course break as soon as people start obfuscating their URLs, replacing ‘.’ with ‘ dot ‘. But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.

Ideas

Related Questions

Update and Summary

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

How can I detect (with regular expressions or heuristics) a web site link in

Ideas

Related Questions

Update and Summary

Leave an answerCancel reply

1 Answer

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Leave an answer
Cancel reply