Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 554759
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T11:44:39+00:00 2026-05-13T11:44:39+00:00

I am using Beautiful Soup to extract ‘content’ from web pages. I know some

  • 0

I am using Beautiful Soup to extract ‘content’ from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that’s how I got started with it.

I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I am starting off with a basic strategy of: if there are more than x-chars in a node then it is content). Let’s take the html code below as an example:

<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>

results = soup.findAll(text=lambda(x): len(x) > 20)

When I use the above code to get at the long text, it breaks (the identified text will start from ‘and hopefully..’) at the tags. So I tried to replace the tag with plain text as follows:

anchors = soup.findAll('a')

for a in anchors:
  a.replaceWith('plain text')

The above does not work because Beautiful Soup inserts the string as a NavigableString and that causes the same problem when I use findAll with the len(x) > 20. I can use regular expressions to parse the html as plain text first, clear out all the unwanted tags and then call Beautiful Soup. But I would like to avoid processing the same content twice — I am trying to parse these pages so I can show a snippet of content for a given link (very much like Facebook Share) — and if everything is done with Beautiful Soup, I presume it will be faster.

So my question: is there a way to ‘clear tags’ and replace them with ‘plain text’ using Beautiful Soup. If not, what will be best way to do so?

Thanks for your suggestions!

Update: Alex’s code worked very well for the sample example. I also tried various edge cases and they all worked fine (with the modification below). So I gave it a shot on a real life website and I run into issues that puzzle me.

import urllib
from BeautifulSoup import BeautifulSoup

page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/')

anchors = soup.findAll('a')
i = 0
for a in anchors:
    print str(i) + ":" + str(a)
    for a in anchors:
        if (a.string is None): a.string = ''
        if (a.previousSibling is None and a.nextSibling is None):
            a.previousSibling = a.string
        elif (a.previousSibling is None and a.nextSibling is not None):
            a.nextSibling.replaceWith(a.string + a.nextSibling)
        elif (a.previousSibling is not None and a.nextSibling is None):
            a.previousSibling.replaceWith(a.previousSibling + a.string)
        else:
            a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
            a.nextSibling.extract()
    i = i+1

When I run the above code, I get the following error:

0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with 
Switched's CES 2010 coverage</a>
Traceback (most recent call last):
  File "parselink.py", line 44, in <module>
  a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
 TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'

When I look at the HTML code, ‘Stay up to date..” does not have any previous sibling (I did not how previous sibling worked until I saw Alex’s code and based on my testing it looks like it is looking for ‘text’ before the tag). So, if there is no previous sibling, I am surprised that it is not going through the if logic of a.previousSibling is None and a;nextSibling is None.

Could you please let me know what I am doing wrong?

-ecognium

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T11:44:39+00:00Added an answer on May 13, 2026 at 11:44 am

    An approach that works for your specific example is:

    from BeautifulSoup import BeautifulSoup
    
    ht = '''
    <div id="abc">
        some long text goes <a href="/"> here </a> and hopefully it 
        will get picked up by the parser as content
    </div>
    '''
    soup = BeautifulSoup(ht)
    
    anchors = soup.findAll('a')
    for a in anchors:
      a.previousSibling.replaceWith(a.previousSibling + a.string)
    
    results = soup.findAll(text=lambda(x): len(x) > 20)
    
    print results
    

    which emits

    $ python bs.py
    [u'\n    some long text goes  here ', u' and hopefully it \n    will get picked up by the parser as content\n']
    

    Of course, you’ll probably need to take a bit more care, i.e., what if there’s no a.string, or if a.previousSibling is None — you’ll need suitable if statements to take care of such corner cases. But I hope this general idea can help you. (In fact you may want to also merge the next sibling if it’s a string — not sure how that plays with your heuristics len(x) > 20, but say for example that you have two 9-character strings with an <a> containing a 5-character strings in the middle, perhaps you’d want to pick up the lot as a “23-characters string”? I can’t tell because I don’t understand the motivation for your heuristic).

    I imagine that besides <a> tags you’ll also want to remove others, such as <b> or <strong>, maybe <p> and/or <br>, etc…? I guess this, too, depends on what the actual idea behind your heuristics is!

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am using python's beautiful stone soup to extract data from this web page
I have this code that fetches some text from a page using BeautifulSoup soup=
I am using Beautiful Soup for parsing web pages. Are there any functions in
I'm using Mechanize and Beautiful soup to scrape some data off Delicious from mechanize
I am using Beautiful Soup 3.2 on python 2.7.1 here. I have recently been
I'm using beautiful soup (in Python). I have such hidden input object: <input type=hidden
I need to make a web crawler to extract information from web pages. I
I'm trying to extract text from arbitrary html pages. Some of the pages (which
I'm using Beautiful Soup for extracting some texts. The program works on the command
I'm parsing some HTML using Beautiful Soup, and occasionally the HTML it returns includes

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.