I am using BeautifulSoup to scrape data from a webpage. I want to compare the website data with text that is in a .txt document. However, I seem to be having encoding issues.
The website has the text “heat oven to 400°” The text also appears like this in “view source” (no html entities.)
The website is read using beautifulSoup:
source = "my url".read()
....
soup = BeautifulSoup(source)
The text document was created by making a new text doc encoded as “Encode in UTF-8 without BOM”. I then copy-pasted “heat oven to 400°” from the website into the text doc and saved.
The text file is read as
f = codecs.open('myfilename', encoding='utf-8')
When I compare the two strings, they are not equal, but I want them to be.
To see what is going on: In Eclipse, I split the two texts and, looking at the variables in debug mode, I see that the degree sign from BeautifulSoup appears as \xc2 \xb0. The degree sign from the text doc just appears as \xb0.
Why, and how do I fix it? I’m having this issue with many special chars so I need a general solution. Also, I will be copy-pasting data from several sites into the text doc.
Looks like Beautiful Soup doesn’t have what it needs in order to detect the encoding correctly. You can give a hint by replacing BeautifulSoup(source) with BeautifulSoup(source, fromEncoding=’UTF-8′). More options and information are online at “Beautiful Soup Gives You Unicode, Dammit“.
The bytes ‘\xc2\xb0’ are what you get when the UTF-8 encoding of Unicode code point U+00B0 is mistaken for Beautiful Soup’s last-resort guess at the encoding, which is Windows 1252.