I’m looking to modify the RegEx pattern below to match items stored between the quote marks of the href property of a link tag:
My conditions are:
- It can be any URL starting http
- It must not match anything containing $$
My Current Regular Expression:
var pattern = @"(?<name>href)=""(?<value>http[^""]*)""";
Any help would be appreciated.
Try the following expression:
EDIT: A more detailed explanation of the above pattern:
(?i) – This is an in-line regular expression option. It sets the expression to be non-case-sensitive. (So that “http” will match “HTTP”)
(?>…) – This is an atomic grouping construct. It basically says that whatever is matched by the group cannot be unmatched. Regex will try many different paths to see if it can get a match. For example, the construct I’ve used to eliminate matches containing “$$” would be circumvented without this grouping construct.
(?…) – A named group.
[^”] – Matches any character that is not a quotation mark.
(…|…) – An alternate grouping construct. The regex will attempt to find a match using the pattern before the pipe (“|”). If a match cannot be made, it will try again with the pattern following the pipe.
? – This is a non-greedy match. With a regular ““, the regex will attempt to match as much as possible. “*?” will attempt to match as little as possible. It is marginally more efficient and helpful when trying to match text between a given set of symbols.
(?(InvalidUrlChars)…|…) – An if/else grouping construct. Using this particular syntax, the expression preceding the pipe will be matched if the named group (“(InvalidUrlChars)”) was matched. The expression following the pipe will be matched otherwise. The “else” part is optional (I did not use it).
(?!) – A negative lookahead assertion. I don’t have enough room to describe lookaround assertions, but suffice to say that this expression will always fail.
So, in summary, this expression will match any URL, but if the URL contains double dollar signs (“$$”) then the InvalidUrlChars group will trigger as “matched”. At the end of the expression, if the InvalidUrlChars group was matched, then the entire match will fail and the atomic group will prevent the Regex from going back and treating the dollar signs as non-quotation marks.
See http://msdn.microsoft.com/en-us/library/az24scfc for more information
Compare the following strings:
The following will match:
EDIT: I heartily agree that processing HTML is best done with an HTML parser. Regex is terrible at it. But if you need a rapid fire solution and you don’t care too much about the occasional quirk, Regex is a suitable stand-in.