i need to parse all urls from a paragraph(string)
eg.
“check out this site google.com and don’t forget to see this too bing.com/maps”
it should return “google.com and bing.com/maps”
i’m currently using this and its not to perfection.
reMatch("(^|\s)[^\s@]+\.[^\s@\?\/]{2,5}((\?|\/)\S*)?",mystring)
thanks
You need to define more clearly what you consider a URL
For example, I might use something such as this:
(use with
reMatchNoCaseor plonk(?i)at front to ignore case)Which specifically only allows alphanumerics, underscore, and hyphen in domain and path parts, requires the TLD to be letters only, and only looks for numeric ports.
It might be this is good enough, or you may need something that looks for more characters, or perhaps you want to trim things likes quotes, brackets, etc off the end of the URL, or whatever – it depends on the context of what you’re doing as to whether you’d like to err towards missing URLs or detecting non-URLs.
(I’d probably go for the latter, then potentially run a secondary filter to verify if something is a URL, but that takes more work, and may not be necessary for what you’re doing.)
Anyhow, the explanation of the above expression is below, hopefully with clear comments to help it make sense. 🙂
(Note that all groups are non-capturing
(?:…)since we don’t need the indiv parts.)Update:
To prevent the end of email addresses being matched, we need to use a lookbehind, to ensure that prior to the URL we don’t have an @ sign (or anything else unwanted) but without actually including that prior character in the match.
CF’s regex is Apache ORO which doesn’t support lookbehinds, but we can use the java.util.regex nice and easily with a component I have created which does support lookbehinds.
Using that is as simple as:
After the createObject, it should basically be like using the built-in re~ stuff, but with the slight syntax difference, and the different regex engine under the hood.
(If you have any problems or questions with the component, let me know.)
So, on to your excluding emails from URL matching problem:
We can either do a
(?<=positive)or(?<!negative)lookbehind, depending on if we want to say “we must have this” or “we must not have this”, like so:For this URL example, I would expand either of those examples to:
or
Both will work (and can be expanded with more chars), but in different ways, so it simply depends which method you want to do it with.
Put whichever one you like at the start of your expression, and it should no longer match the end of abcd@gmail.com, unless I’ve screwed something up. 🙂
Update 2:
Here is some sample code which will exclude any email addresses from the match:
Make sure you have downloaded the jre-utils.cfc from here and put in an appropriate place (e.g. same directory as script running this code).
This step is required because the
(?<=…)construct does not work in CF regular expressions.