I’m trying to make a simple Python-based HTML parser using regular expressions. My problem

Question

0

Asked: June 1, 20262026-06-01T02:50:08+00:00 2026-06-01T02:50:08+00:00

I’m trying to make a simple Python-based HTML parser using regular expressions. My problem

0

I’m trying to make a simple Python-based HTML parser using regular expressions. My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple.

Let’s say I have a page with the following stored in the variable HTMLtext:

<ul>
<li class="active"><b><a href="/blog/home">Back to the index</a></b></li>
<li><b><a href="/blog/about">About Me!</a></b></li>
<li><b><a href="/blog/music">Audio Production</a></b></li>
<li><b><a href="/blog/photos">Gallery</a></b></li>
<li><b><a href="/blog/stuff">Misc</a></b></li>
<li><b><a href="/blog/contact">Shoot me an email</a></b></li>
</ul>

I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. So, I’d like to return something like this:

pages = ["home", "about", "music", "photos", "stuff", "contact"]

So far, I’m able to use regex to search for one result:

pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]

Running this expression makespages = ['home'].

How can I get the regex search to continue for the whole text, appending the matched text to this tuple?

(Note: I know I probably should NOT be using regex to parse HTML. But I want to know how to do this anyway.)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T02:50:10+00:00

Your pattern won’t work on all inputs, including yours. The .* is going to be too greedy (technically, it finds a maximal match), causing it to be the first href and the last corresponding close. The two simplest ways to fix this is to use either a minimal match, or else a negates character class.

# minimal match approach
pages = re.findall(r'<a\s+href="/blog/(.+?)">', 
                   full_html_text, re.I + re.S)

# negated charclass approach
pages = re.findall(r'<a\s+href="/blog/([^"]+)">',
                   full_html_text, re.I)

Obligatory Warning

For simple and reasonably well-constrained text, regexes are just fine; after all, that’s why we use regex search-and-replace in our text editors when editing HTML! However, it gets more and more complicated the less you know about the input, such as

if there’s some other field intervening between the <a and the href, like <a title="foo" href="bar">
casing issues like <A HREF='foo'>
whitespace issues
alternate quotes like href='/foo/bar' instead of href="/foo/bar"
embedded HTML comments

That’s not an exclusive list of concerns; there are others. And so, using regexes on HTML thus is possible but whether it’s expedient depends on too many other factors to judge.

However, from the little example you’ve shown, it looks perfectly ok for your own case. You just have to spiff up your pattern and call the right method.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to make a simple Python-based HTML parser using regular expressions. My problem

Leave an answerCancel reply

1 Answer

Obligatory Warning

Leave an answer
Cancel reply