I’m trying to extract every HTML tag including a match for a regular expression.

Question

0

Asked: May 29, 20262026-05-29T14:56:22+00:00 2026-05-29T14:56:22+00:00

I’m trying to extract every HTML tag including a match for a regular expression.

0

I’m trying to extract every HTML tag including a match for a regular expression. For example, suppose I want to get every tag including the string “name” and I have a HTML document like this:

<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>

Probably, I should try a regular expression to catch every match between opening and closing "<>", however, I’d like to be able to traverse the parsed tree based on those matches, so I can get the siblings or parents or ‘nextElements’. In the example above, that amounts to get <head>*</head> or maybe <h2>*</h2> once I know they’re parents or siblings of a tag containing the match.

I tried BeautifulSoap but it seems to me it’s useful when you already know what kind of tag you’re looking for or based on its contents. In this case, I want to get a match first, take that match as a starting point and then navigate the tree as BeautifulSoap and other HTML parsers are able to do.

Suggestions?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T14:56:30+00:00

Use lxml.html. It’s a great parser, it support xpath which can express anything you’d want easily.

The example below uses this xpath expression:

//*[contains(text(),'name']/parent::*/following-sibling::*[1]/*[@class='name']/text()

That means, in english:

Find me any tag that contains the word 'name' in its text, then get
the parent, and then the next sibling, and find inside that any tag with the class
'name' and finally return the text content of that.

The result of running the code is:

['This is also a tag to be retrieved']

Here’s the full code:

text = """
<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>
"""

import lxml.html
doc = lxml.html.fromstring(text)
print doc.xpath('//*[contains(text(), $stuff)]/parent::*/'
    'following-sibling::*[1]/*[@class=$stuff]/text()', stuff='name')

Obligatory read, the “please don’t parse HTML with regex” answer is here:
https://stackoverflow.com/a/1732454/17160

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to extract every HTML tag including a match for a regular expression.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply