What’s the appropriate Perl or Java regex to extract only the second line below? It should find the div tag containing the class=”matchthis” attribute.
<div>Do not match this</div>
<div class="matchthis">MATCH THIS</div>
<div class="unimportant">Do not match this</div>
Please do not tell me to use DOM/Soup/etc. I wonder if raw regex can solve the simple problem above (you’ll be awarded for the answer!). Yes I’m aware of this post so don’t even mention it.
As you already seem to know, using regular expressions to parse HTML is a bad idea.
In this specific case, I’m pretty sure all you really want is this:
Now, the more flexible you want to get, the more unreadable your regular expression will become. And this is the danger of trying to use regular expressions instead of a proper parser. For instance, say you want to allow for additional attributes besides
class. A kind of functional regular expression for this might look like:Totally readable, right? (Also, almost certainly very wrong.)