Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 185411
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T15:27:54+00:00 2026-05-11T15:27:54+00:00

How can I detect (with regular expressions or heuristics) a web site link in

  • 0

How can I detect (with regular expressions or heuristics) a web site link in a string of text such as a comment?

The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.

Some objectives:

  • The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
  • URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
  • Any other funny business

Of course, I am blocking spam, but the same process could be used to auto-link text.

Ideas

Here are some things I’m thinking.

  • The content is native-language prose so I can be trigger-happy in detection
  • Should I strip out all whitespace first, to catch ‘www .example.com‘? Would common users know to remove the space themselves, or do any browsers ‘do-what-I-mean’ and strip it for you?
  • Maybe multiple passes is a better strategy, with scans for:
    • Well-formed URLs
    • All non-whitespace followed by ‘.’ followed by any valid TLD
    • Anything else?

Related Questions

I’ve read these and they are now documented here, so you can just references the regexes in those questions if you want.

  • replace URL with HTML Links javascript
  • What is the best regular expression to check if a string is a valid URL
  • Getting parts of a URL (Regex)

Update and Summary

Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:

  1. @Jon Bright’s technique of detecting TLDs (a good defensive chokepoint)
  2. For those suspicious strings, replace the dot with a dot-looking character as per @capar
  3. A good dot-looking character is @Sharkey’s subscripted · (i.e. ‘·‘). · is also a word boundary so it’s harder to casually copy & paste.

That should make a spammer’s CPM low enough for my needs; the ‘flag as inappropriate’ user feedback should catch anything else. Other solutions listed are also very useful:

  • Strip out all dotted-quads (@Sharkey’s comment to his own answer)
  • @Sporkmonger’s requirement for client-side Javascript which inserts a required hidden field into the form.
  • Pinging the URL server-side to establish whether it is a web site. (Perhaps I could run the HTML through SpamAssassin or another Bayesian filter as per @Nathan..)
  • Looking at Chrome’s source for its smart address bar to see what clever tricks Google uses
  • Calling out to OWASP AntiSAMY or other web services for spam/malware detection.
  • 1 1 Answer
  • 1 View
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-11T15:27:55+00:00Added an answer on May 11, 2026 at 3:27 pm

    I’m concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.

    I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn’t going to be perfect. There’s no way to get ‘buyfunkypharmaceuticals . it’ without disallowing the legitimate ‘I tried again. it doesn’t work’ or similar. All of that said, this would be my suggestion:

    [^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/] 

    Things this will get:

    • buyfunkypharmaceuticals.it
    • google.com
    • http://stackoverflo**w.com/**questions/700163/

    It will of course break as soon as people start obfuscating their URLs, replacing ‘.’ with ‘ dot ‘. But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 255k
  • Answers 255k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer Effectively you've got a "blob" of data that could in… May 13, 2026 at 10:19 am
  • Editorial Team
    Editorial Team added an answer In Java you would do something similar to: Transport transport… May 13, 2026 at 10:19 am
  • Editorial Team
    Editorial Team added an answer According to Pippenger [1996], when comparing a Lisp system that… May 13, 2026 at 10:19 am

Related Questions

I'm parsing an HTML document with a couple Perl modules: HTML::TreeBuilder and HTML::Element .
Hello, Background: I'm using Checkstyle 4.4.2 with a RegExp checker module to detect when
I'm trying to build a regular expression that will detect any character that Windows
Update: The problem ended up being the Flash component itself. It wasn't properly compiling

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.