I am looking for repeating patterns inside an HTML page.
The patterns I am interested in start after the prefix “<h2>Seasons</h2>”
The same patterns occur before the prefix too, I am not interested in those.
I tried (and failed) with the following python code (I simplified the pattern to ‘<a href=.+?</a>’ for the sake of making this question readable):
matches = re.compile('<h2>Seasons</h2>.+?(<a href=.+?</a>)+',re.DOTALL).findall(page)
for ref in matches
print ref
Given the page:
blah blah html stuff
<h2>Seasons</h2>
blah blah more html stuff
<a href=http://www.111.com>111</a><a href=http://www.222.com>222</a><a href=http://www.333.com>333</a>
The output is
<a href=http://www.333.com>333</a>
So it only prints the last match, the other two do not make it to the findall list.
How do I do to iterate over all matches of the groups?
The problem is that the regex matches only a single time. The parenthesized group matches multiple times, but the regex as a whole only matches once. This means only one match is returned, the last one.
To get around this you need to write a regex that matches multiple times. You might think to use a lookbehind assertion for the
<h2>element like so:This says to find
<a>elements, but only if they’re preceded by<h2>Seasons</h2>. Unfortunately lookbehind strings have to be of fixed length. You can’t put.+?in a lookbehind assertion. So that approach is out.Next up is to find the location of the
<h2>element first, then perform the regex search starting from there.