I am trying to write a RegEx rule to find all a href HTML links on my webpage and add a ‘rel=”nofollow”‘ to them.
However, I have a list of URLs that must be excluded (for exmaple, ANY (wildcards) internal link (eg. pokerdiy.com) – so that any internal link that has my domain name in is excluded from this. I want to be able to specify exact URLs in the exclude list too – for example – http://www.example.com/link.aspx)
Here is what I have so far which is not working:
(]+)(href=”http://.*?(?!(pokerdiy))[^>]+>)
If you need more background/info you can see the full thread and requirements here (skip the top part to get to the meat):
http://www.snapsis.com/Support/tabid/601/aff/9/aft/13117/afv/topic/afpgj/1/Default.aspx#14737
would match the first part of any link that starts with
http://orhttps://and doesn’t containpokerdiy.comorwww.example.com/link.aspxanywhere in thehrefattribute. Replace that byIf a
rel="nofollow"is already present, you’ll end up with two of these. And of course, relative links or other protocols likeftp://etc. won’t be matched at all.Explanation:
(?!\b(foo|bar)\b)[^"]matches any non-"character unless it it possible to matchfooorbarat the current location. The\bs are there to make sure we don’t accidentally trigger onrebarorfoonly.This whole contruct is repeated (
(?: ... )+), and whatever is matched is preserved in backreference\2.Since the next token to be matched is a
", the entire regex fails if the attribute containsfooorbaranywhere.