I’m having some trouble formulating a findAll query for BeautifulSoup that’ll do what I want. Previously, I was using findAll to extract only the text from some html, essentially stripping away all the tags. For example, if I had:
<b>Cows</b> are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.
It would be reduced to:
Cows are being abducted by aliens according to the Washington Post.
I would do this by using ''.join(html.findAll(text=True)). This was working great, until I decided I would like to keep only the <a> tags, but strip the rest of the tags away. So, given the initial example, I would end up with this:
Cows are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.
I initially thought that the following would do the trick:
''.join(html.findAll({'a':True}, text=True))
However, this doesn’t work, since the text=True seems to indicate that it will only find text. What I’m in need of is some OR option – I would like to find text OR <a> tags. It’s important that the tags stay around the text they are tagging – I can’t have the tags or text appearing out of order.
Any thoughts?
Note: The BeautifulSoup.findAll is a search API. The first named argument of
findAllwhich isnamecan be used to restrict the search to a given set of tags. With just a singlefindAllit is not possible to select all text between tags and at the same time select the text and tag for<a>. So I came up with the below solution.This solution depends on
BeautifulSoup.Tagbeing imported.We navigate the parse tree like a list with the
contentsmethod. We extract text only when it’s a tag and when the tag is not<a>. Otherwise we get the entire string with tag included. This uses navigating the parse tree API.The order of the text is preserved.