I’m working on a regular expression in a .NET project to get a specific tag. I would like to match the entire DIV tag and its contents:
<html> <head><title>Test</title></head> <body> <p>The first paragraph.</p> <div id='super_special'> <p>The Store paragraph</p> </div> </body> </head>
Code:
Regex re = new Regex('(<div id='super_special'>.*?</div>)', RegexOptions.Multiline); if (re.IsMatch(test)) Console.WriteLine('it matches'); else Console.WriteLine('no match');
I want to match this:
<div id='super_special'> <p>Anything could go in here...doesn't matter. Let's get it all</p> </div>
I thought . was supposed to get all characters, but it seems to having trouble with the carriage returns. What is my regex missing?
Thanks.
Out-of-the-box, without special modifiers, most regex implementations don’t go beyond the end-of-line to match text. You probably should look in the documentation of the regex engine you’re using for such modifier.
I have one other advice: beware of greed! Traditionally, regex are greedy which means that your regex would probably match this:
You should check for a ‘not-greedy’ modifier, so that your regex would stop matching text at the first occurence of
</div>, not at the last one.Also, as others have said, consider using an HTML parser instead of regexes. It will save you a lot of headache.
Edit: even a non-greedy regex wouldn’t work as expected either, if
<div>s are nested! Another reason to consider using an HTML parser.