I am getting the first paragraph from pages and trying to extract words suitable to be tags or keywords. In some paragraphs there are links and I want to remove the tags:
For instance if the text is
A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
enter code heretitle="Byte">byte</a> ...
I want to remove
<b></b><a href="/wiki/Byte" title="Byte"></a>
to end up with this
A hex triplet is a six-digit, three-byte ...
A regex like this does not work:
>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
enter code heretitle="Byte">byte</a> ..."""
>>> f = re.findall(r'<.+>', text)
>>> f
['<b>hex triplet</b>', '</a>']
>>>
What is the best way to do this?
I found several similar questions but none of them I think solves this particular problem.
Update with an example of BeautifulSoup extract (extract deletes the tag including its text and must run for each tag separately:
>>> soup = BeautifulSoup(text)
>>> [s.extract() for s in soup('b')]
[<b>hex triplet</b>]
>>> soup
A is a six-digit, three-<a href="/wiki/Byte" enter code heretitle="Byte">byte</a> ...
>>> [s.extract() for s in soup('a')]
[<a href="/wiki/Byte" enter code heretitle="Byte">byte</a>]
>>> soup
A is a six-digit, three- ...
>>>
Update
For people with the same question: as mentioned by Brendan Long, this answer using HtmlParser works best.
The
+quantifier is greedy, meaning it will find the longest possible match. Add a?to force it to find the shortest possible match:Another way to write the regex is to explicitly exclude right angle brackets inside a tag, using
[^>]instead of..An advantage of this approach is that it will also match newlines (
\n). You can get the same behavior with.if you add there.DOTALLflag.To strip out the tags, use
re.sub: