Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7662525
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T13:49:27+00:00 2026-05-31T13:49:27+00:00

I have a document that has two content types: text/xml and text/html. I would

  • 0

I have a document that has two content types: text/xml and text/html. I would like to use BeautifulSoup to parse the document and end up with a clean text version. The document starts as a tuple, so I have been using repr to turn it into something BeautifulSoup recognizes, and then using find_all to find just the text/html bit of the document by searching for the divs, like so:

soup = BeautifulSoup(repr(msg_data))
text = soup.html.find_all("div")

Then, I’m turning text back into a string, saving it to a variable and then turning it back into a soup object and calling get_text on it, like so:

str_text = str(text)
soup_text = BeautifulSoup(str_text)
soup_text.get_text()

However, that then changes the encoding to unicode, like so:

u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17     
PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 
9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while 
browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, 
\xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives 
them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'

When I try to re-encode it as UTF-8, like so:

soup.encode('utf-8')

I am back to the unparsed type.

I would like to get to the point where I have clean text saved as a string then I can find specific things within the text (like, example, “puppies” in the text above).

Basically, I’m running around in circles here. Can anyone help? As always, thank you so much for any help you can give.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T13:49:29+00:00Added an answer on May 31, 2026 at 1:49 pm

    The encoding isn’t ruined; it’s exactly what it should be. '\xa0' is Unicode for a non-breaking space.

    If you want to encode this (Unicode) string as ASCII, you can tell the codec to ignore any character it doesn’t understand:

    >>> x = u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do,  9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while  browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic,  \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives  them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'
    >>> x.encode('ascii', 'ignore')
    '[9:16 PMErica: with images, and that seemed long to me anyway, 9:17 PMme: yeah, Erica: so feel free to make it shorter, or rather, please do,  9:18 PMnobody wants to read about that shit for 2 pages, me: :), Erica: while  browsing their site, me: srsly, Erica: unless of course your writing is magic,  me: My writing saves drowning puppies, Just plucks him right out and gives  them a scratch behind the ears and some kibble, Erica: Maine is weird, me: haha]'
    

    If you have time, you should watch Ned Batchelder’s recent video Pragmatic Unicode. It will make everything clear and simple!

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a master page that has two content sections like this (left some
I have a Delphi 7 application that has two views of a document (e.g.
I have been given two different Microsoft Word document that my virus scanner has
I've got two content types, both have a node title and a document attachment,
I have two example pages that are behaving differently and I would like to
I have an element in my document that has a background color and image
I have a Delphi application that has a document browser as the main form.
I have a ASP.NET intranet application that has a document library section. The user
I have a document that looks something like <root> <element> <subelement1 /> <subelement2 />
I have an XML document that I'm trying to style via CSS. A relevant

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.