Here’s the input String:
<div class="matchthis">Orange</div>
<div class="dontmatch">One</div>
<div class="matchthis" id="hurdle">Lemon</div>
<div class="dontmatch">Two</div>
<div id="hurdle" class="matchthis">Peach</div>
I want to output below (all <div> tags containing class="matchthis"):
<div class="matchthis">Orange</div>
<div class="matchthis" id="hurdle">Lemon</div>
<div id="hurdle" class="matchthis">Peach</div>
This Java RegEx <div class=\"matchthis\">(.*?)(?=</div>) will only output the following:
<div class="matchthis">Orange</div>
Please help improve the RegEx to get the desired output.
Please do not tell me to use slower DOM/Soup/etc. I wonder if raw regex can solve the simple problem above (you’ll be awarded for the answer!). Yes I’m aware of this post so don’t even mention it.
To break it down,
matches any number of the below due to the
of non-quote non-tag closers due to
or double quoted attribute values
or single quoted attribute values
The
Pattern.DOTALLmeans that your.*?will allow newlines in thedivbody.The
Pattern.CASE_INSENSITIVEcauses it to handle case folding of HTML element names properly, though if your default locale is Turkish you might get some weirdness around<DİV>(note the dotted I).