I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.
For example,
<p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p>
should return:
Many hundreds of cultivars exist.
P.S. Some files contain Unicode characters (Hindi) which need to be extracted.
Any ideas how to do that?
Here’s how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the removed tags.
Reference