How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?
The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.
Some objectives:
- The low-hanging fruit like well-formed URLs (
http://some-fqdn/some/valid/path.ext) - URLs but without the
http://prefix (i.e. a valid FQDN + valid HTTP path) - Any other funny business
Of course, I am blocking spam, but the same process could be used to auto-link text.
Ideas
Here are some things I’m thinking.
- The content is native-language prose so I can be trigger-happy in detection
- Should I strip out all whitespace first, to catch ‘
www .example.com‘? Would common users know to remove the space themselves, or do any browsers ‘do-what-I-mean’ and strip it for you? - Maybe multiple passes is a better strategy, with scans for:
- Well-formed URLs
- All non-whitespace followed by ‘.’ followed by any valid TLD
- Anything else?
Related Questions
I’ve read these and they are now documented here, so you can just references the regexes in those questions if you want.
- replace URL with HTML Links javascript
- What is the best regular expression to check if a string is a valid URL
- Getting parts of a URL (Regex)
Update and Summary
Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:
- @Jon Bright’s technique of detecting TLDs (a good defensive chokepoint)
- For those suspicious strings, replace the dot with a dot-looking character as per @capar
- A good dot-looking character is @Sharkey’s subscripted · (i.e. ‘·‘). · is also a word boundary so it’s harder to casually copy & paste.
That should make a spammer’s CPM low enough for my needs; the ‘flag as inappropriate’ user feedback should catch anything else. Other solutions listed are also very useful:
- Strip out all dotted-quads (@Sharkey’s comment to his own answer)
- @Sporkmonger’s requirement for client-side Javascript which inserts a required hidden field into the form.
- Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per @Nathan..)
- Looking at Chrome’s source for its smart address bar to see what clever tricks Google uses
- Calling out to OWASP AntiSAMY or other web services for spam/malware detection.
I’m concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.
I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn’t going to be perfect. There’s no way to get ‘buyfunkypharmaceuticals . it’ without disallowing the legitimate ‘I tried again. it doesn’t work’ or similar. All of that said, this would be my suggestion:
Things this will get:
It will of course break as soon as people start obfuscating their URLs, replacing ‘.’ with ‘ dot ‘. But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.