I’ve a strange behaviour with regexp pattern matching
The regexp is that:
String regexp = "<h3.*>(.*)</h3>";
I’ve a first case:
<h3 class="pubAdTitleBlock">Title</h3>
In this case, all is ok, matcher.group(1) give me the ‘Title’
I a Second case, i’ve a link nested into h3, like this:
<h3 class="pubAdTitleBlock "><a href="myLink" title="title">Title</a></h3>
This is the Problem
In this case
– matcher.find() is true,
– matcher.group(0) is the full string,
– but matcher.group(1) is an empty string
why ?
I need to extract title inside <h3 ..>title</h3>, and inside <h3 ...><a ...>title</a></h3>
The first
.*will capture" class="pubAdTitleBlock "><a href="myLink" title="title">Title</a", leaving only the zero-width space between</a>and</h3>for the capturing group.You’ll want to change it to something like
[^>]*(i.e. “anything except >”).