Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 926625
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T19:42:52+00:00 2026-05-15T19:42:52+00:00

i need to parse all urls from a paragraph(string) eg. check out this site

  • 0

i need to parse all urls from a paragraph(string)
eg.

“check out this site google.com and don’t forget to see this too bing.com/maps”

it should return “google.com and bing.com/maps”

i’m currently using this and its not to perfection.

reMatch("(^|\s)[^\s@]+\.[^\s@\?\/]{2,5}((\?|\/)\S*)?",mystring)

thanks

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T19:42:53+00:00Added an answer on May 15, 2026 at 7:42 pm

    You need to define more clearly what you consider a URL

    For example, I might use something such as this:

    (?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?
    

    (use with reMatchNoCase or plonk (?i) at front to ignore case)

    Which specifically only allows alphanumerics, underscore, and hyphen in domain and path parts, requires the TLD to be letters only, and only looks for numeric ports.

    It might be this is good enough, or you may need something that looks for more characters, or perhaps you want to trim things likes quotes, brackets, etc off the end of the URL, or whatever – it depends on the context of what you’re doing as to whether you’d like to err towards missing URLs or detecting non-URLs.
    (I’d probably go for the latter, then potentially run a secondary filter to verify if something is a URL, but that takes more work, and may not be necessary for what you’re doing.)

    Anyhow, the explanation of the above expression is below, hopefully with clear comments to help it make sense. 🙂
    (Note that all groups are non-capturing (?:…) since we don’t need the indiv parts.)

    # PROTOCOL
     (?:https?:)?    # optional group of "http:" or "https:"
    
    # SERVER NAME / DOMAIN
     (?://)?         # optional double forward slash
     (?:[\w-]+\.)+   # one or more "word characters" or hyphens, followed by a literal .
                     # grouped together and repeated one or more times
     [a-z]{2,6}      # as many as 6 alphas, but at least 2
    
    # PORT NUMBER
     (?::\d+)?       # an optional group made up of : and one or more digits
    
    # PATH INFO
     (?:/[\w.,-]+)*  # a forward slash then multiple alphanumeric, underscores, or hyphens
                     # or dots or commas (add any other characters as required)
                     # in a group that might occur multiple times (or not at all)
    
    # QUERY STRING
     (?:\?\S+)?      # an optional group containing ? then any non-whitespace
    


    Update:
    To prevent the end of email addresses being matched, we need to use a lookbehind, to ensure that prior to the URL we don’t have an @ sign (or anything else unwanted) but without actually including that prior character in the match.

    CF’s regex is Apache ORO which doesn’t support lookbehinds, but we can use the java.util.regex nice and easily with a component I have created which does support lookbehinds.

    Using that is as simple as:

    <cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
    ...
    <cfset Urls = jrex.match( regex , input ) />
    

    After the createObject, it should basically be like using the built-in re~ stuff, but with the slight syntax difference, and the different regex engine under the hood.

    (If you have any problems or questions with the component, let me know.)

    So, on to your excluding emails from URL matching problem:

    We can either do a (?<=positive) or (?<!negative) lookbehind, depending on if we want to say “we must have this” or “we must not have this”, like so:

    (?<=\s) # there must be whitespace before the current position
    (?<!@)  # there must NOT be an @ before current position
    

    For this URL example, I would expand either of those examples to:

    (?<=\s|^)   # look for whitespace OR start of string
    

    or

    (?<![@\w/]) # ensure there is not a @ or / or word character.
    

    Both will work (and can be expanded with more chars), but in different ways, so it simply depends which method you want to do it with.

    Put whichever one you like at the start of your expression, and it should no longer match the end of abcd@gmail.com, unless I’ve screwed something up. 🙂


    Update 2:

    Here is some sample code which will exclude any email addresses from the match:

    <cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
    
    <cfsavecontent variable="SampleInput">
    check out this site google.com and don't forget to see this too bing.com/maps
    this is an email@somewhere.com which should not be matched
    </cfsavecontent>
    
    <cfset FindUrlRegex = '(?<=\s|^)(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?' />
    
    <cfset MatchedUrls = jrex.match( FindUrlRegex , SampleInput ) />
    
    <cfdump var=#MatchedUrls#/>
    

    Make sure you have downloaded the jre-utils.cfc from here and put in an appropriate place (e.g. same directory as script running this code).

    This step is required because the (?<=…) construct does not work in CF regular expressions.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.