I’m having some trouble formulating a findAll query for BeautifulSoup that’ll do what I

Question

0

Asked: May 24, 20262026-05-24T10:38:16+00:00 2026-05-24T10:38:16+00:00

I’m having some trouble formulating a findAll query for BeautifulSoup that’ll do what I

0

I’m having some trouble formulating a findAll query for BeautifulSoup that’ll do what I want. Previously, I was using findAll to extract only the text from some html, essentially stripping away all the tags. For example, if I had:

<b>Cows</b> are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.

It would be reduced to:

Cows are being abducted by aliens according to the Washington Post.

I would do this by using ''.join(html.findAll(text=True)). This was working great, until I decided I would like to keep only the <a> tags, but strip the rest of the tags away. So, given the initial example, I would end up with this:

Cows are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.

I initially thought that the following would do the trick:

''.join(html.findAll({'a':True}, text=True))

However, this doesn’t work, since the text=True seems to indicate that it will only find text. What I’m in need of is some OR option – I would like to find text OR <a> tags. It’s important that the tags stay around the text they are tagging – I can’t have the tags or text appearing out of order.

Any thoughts?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T10:38:17+00:00

Note: The BeautifulSoup.findAll is a search API. The first named argument of findAll which is name can be used to restrict the search to a given set of tags. With just a single findAll it is not possible to select all text between tags and at the same time select the text and tag for <a>. So I came up with the below solution.

This solution depends on BeautifulSoup.Tag being imported.

from BeautifulSoup import BeautifulSoup, Tag

soup = BeautifulSoup('<b>Cows</b> are being abducted by aliens according to the <a href="www.washingtonpost.com>Washington Post</a>.')
parsed_soup = ''

We navigate the parse tree like a list with the contents method. We extract text only when it’s a tag and when the tag is not <a>. Otherwise we get the entire string with tag included. This uses navigating the parse tree API.

for item in soup.contents:
    if type(item) is Tag and u'a' != item.name:
        parsed_soup += ''.join(item.findAll(text = True))
    else:
        parsed_soup += unicode(item)

The order of the text is preserved.

 >>> print parsed_soup
 u'Cows are being abducted by aliens according to the <a href=\'"www.washingtonpost.com\'>Washington Post</a>.'

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m having some trouble formulating a findAll query for BeautifulSoup that’ll do what I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply