This is JavaScript regex.
regex = /(http:\/\/[^\s]*)/g;
text = "I have http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd and I like http://google.com a lot";
matches = text.match(regex);
console.log(matches);
I get both the urls in the result. However I want to eliminate all the urls ending with .dtd . How do I do that?
Note that I am saying ending with .dtd should be removed. It means a url like http://a.dtd.google.com should pass .
The nicest way to do it is to use a negative lookbehind (in languages that support them):
The
?>in the first bracket makes it an atomic grouping which stops the regex engine backtracking – so it’ll match the full URL as it does now, and if/when the next part fails it won’t try going back and matching less.The
(<!\.dtd)is a negative lookbehind, which only matches if\.dtddoesn’t match ending at that position (i.e., the URL doesn’t end in.dtd).For languages that don’t (such as JavaScript), you can do a negative lookahead instead, which is a bit more ugly and is generally less efficient:
Will match
http://, then scan ahead to make sure it doesn’t end in.dtd, then backtrack and scan forward again to get the actual match.As always, http://www.regular-expressions.info/ is a good reference for more information