I need to scrape some content from a HTTP response with Java. The required fields in the response are: foo, bar and bla. My current pattern is very slow. Any ideas how to improve that?
Response:
...
<div class="ui-a">
<div class="ui-b">
<p><strong>foo</strong></p>
<p>bar</p>
</div>
<div class="ui-c">
<p><strong>bla</strong></p>
<p>...</p>
</div>
</div>
<div class="ui-a">
<div class="ui-b">
<p><strong>foo1</strong></p>
<p>bar1</p>
</div>
<div class="ui-c">
<p><strong>bla1</strong></p>
<p>...</p>
</div>
Pattern:
.*?<div class="ui-a">.*?<strong>(.*?)</strong>.*?<p>(.*?)</p>.*?</div>.*?<div class="ui-c">.*?<strong>(.*?)</strong>.*?
Since you can’t make use of an HTML parser, try something like this:
which will print the following to the console:
Note my changes:
*+and++: http://www.regular-expressions.info/possessive.html.*?, I used(?:(?!...).)*+. The first,.*?will keep track of all possible matches it makes to be able to back-track at a later stage. The latter,(?:(?!...).)*+, will not keep track of these matches.That should make it quicker (not sure by how much…).