I am looking for repeating patterns inside an HTML page. The patterns I am

Question

0

Asked: June 16, 20262026-06-16T04:23:25+00:00 2026-06-16T04:23:25+00:00

I am looking for repeating patterns inside an HTML page. The patterns I am

0

I am looking for repeating patterns inside an HTML page.
The patterns I am interested in start after the prefix “<h2>Seasons</h2>”
The same patterns occur before the prefix too, I am not interested in those.

I tried (and failed) with the following python code (I simplified the pattern to ‘<a href=.+?</a>’ for the sake of making this question readable):

matches = re.compile('<h2>Seasons</h2>.+?(<a href=.+?</a>)+',re.DOTALL).findall(page)  
for ref in matches  
   print ref

Given the page:

blah blah html stuff 
<h2>Seasons</h2>  
blah blah  more html stuff
<a href=http://www.111.com>111</a><a href=http://www.222.com>222</a><a href=http://www.333.com>333</a>

The output is

<a href=http://www.333.com>333</a>

So it only prints the last match, the other two do not make it to the findall list.
How do I do to iterate over all matches of the groups?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T04:23:26+00:00

The problem is that the regex matches only a single time. The parenthesized group matches multiple times, but the regex as a whole only matches once. This means only one match is returned, the last one.

To get around this you need to write a regex that matches multiple times. You might think to use a lookbehind assertion for the <h2> element like so:

(?<=<h2>Seasons</h2>.+?)(<a href=.+?</a>)    # doesn't work

This says to find <a> elements, but only if they’re preceded by <h2>Seasons</h2>. Unfortunately lookbehind strings have to be of fixed length. You can’t put .+? in a lookbehind assertion. So that approach is out.

Next up is to find the location of the <h2> element first, then perform the regex search starting from there.

>>> re.findall('<a href=.+?</a>', page[page.find('<h2>Seasons</h2>'):], re.DOTALL)
['<a href=http://www.111.com>111</a>', '<a href=http://www.222.com>222</a>', '<a href=http://www.333.com>333</a>']

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am looking for repeating patterns inside an HTML page. The patterns I am

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply