I am getting the first paragraph from pages and trying to extract words suitable

Question

0

Asked: May 26, 20262026-05-26T05:54:58+00:00 2026-05-26T05:54:58+00:00

I am getting the first paragraph from pages and trying to extract words suitable

0

I am getting the first paragraph from pages and trying to extract words suitable to be tags or keywords. In some paragraphs there are links and I want to remove the tags:

For instance if the text is

A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
enter code heretitle="Byte">byte</a> ...

I want to remove

<b></b><a href="/wiki/Byte" title="Byte"></a>

to end up with this

A hex triplet is a six-digit, three-byte ...

A regex like this does not work:

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
    enter code heretitle="Byte">byte</a> ..."""
>>> f = re.findall(r'<.+>', text)
>>> f
['<b>hex triplet</b>', '</a>']
>>>

What is the best way to do this?

I found several similar questions but none of them I think solves this particular problem.

Update with an example of BeautifulSoup extract (extract deletes the tag including its text and must run for each tag separately:

>>> soup = BeautifulSoup(text)
>>> [s.extract() for s in soup('b')]
[<b>hex triplet</b>]
>>> soup
A  is a six-digit, three-<a href="/wiki/Byte" enter code heretitle="Byte">byte</a> ...
>>> [s.extract() for s in soup('a')]
[<a href="/wiki/Byte" enter code heretitle="Byte">byte</a>]
>>> soup
A  is a six-digit, three- ...
>>>

Update

For people with the same question: as mentioned by Brendan Long, this answer using HtmlParser works best.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T05:54:58+00:00

The + quantifier is greedy, meaning it will find the longest possible match. Add a ? to force it to find the shortest possible match:

>>> re.findall(r'<.+?>', text)
['<b>', '</b>', '</a>']

Another way to write the regex is to explicitly exclude right angle brackets inside a tag, using [^>] instead of ..

>>> re.findall(r'<[^>]+>', text)
['<b>', '</b>', '<a href="/wiki/Byte"\n    enter code heretitle="Byte">', '</a>']

An advantage of this approach is that it will also match newlines (\n). You can get the same behavior with . if you add the re.DOTALL flag.

>>> re.findall(r'<.+?>', text, re.DOTALL)
['<b>', '</b>', '<a href="/wiki/Byte"\n    enter code heretitle="Byte">', '</a>']

To strip out the tags, use re.sub:

>>> re.sub(r'<.+?>', '', text, flags=re.DOTALL)
'A hex triplet is a six-digit, three-byte ...'

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am getting the first paragraph from pages and trying to extract words suitable

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply