I’m trying to parse html page and I use the following regular expression:
var regex = new Regex(@"<tag1 id=.id1.>.*<tag2>", RegexOptions.Singleline);
“tag1 id =.id.1” occurs in document only once. “tag2” occurs nearly 50 times after the occurance of “tag 1”. But when I try to match page code with my regular expression, it returns only 1 match. Moreover, when I change RegexOptions to “None” or “Multiline” no matches are returned. I’m very confused about this and would appreciate any help.
Leaving aside the obvious exhortations about not using regex to parse HTML, I can explain to you why you’re seeing what you’re seeing.
If
tag1occurs in your text only once, then the regex can only match it once, so there can never be more than one match. Regular expression matches “consume” the text they have matched, so the next match attempt starts at the end of the last successful match.This leads to the next problem:
.*is greedy, so it matches (withRegexOptions.Singleline) until the end of the string and then backtracks until the last<tag2>it finds in order to allow a successful match. Which is another reason why you only get one match.As for your second question: Why do the matches go away if you don’t use
RegexOptions.Singleline? Simple: Without that option, the dot.cannot match newlines, and there appears to be at least one newline betweentag1and the firsttag2.