I’m extremely confused over unicode in Python 2.x.
I’m using BeautifulSoup to scrape a webpage, and I’m trying to insert the things I find into a dictionary with the name as the key, and the url as the value.
I’m using BeautifulSoup’s find function to get the info I need. My code started out as follows:
name = i.find('a').string
url = i.find('a').get('href')
This works, with the exception of the thign returned from find is an Object, and not a string.
Here’s were things start confusing me
If I try to convert it to type str before I assign it to the variable, it sometimes throws an UnicodeEncodeError.
'ascii' codec can't encode character u'\xa0' in position 5: ordinal not in range(128)
I Google around and find that I should be encoding to ascii
I try adding:
print str(i.find('a').string).encode('ascii', 'ignore')
No luck, still gives an, Unicode Error.
From there, I tried using repr.
print repr(i.find('a').string)
And that works… almost!
I ran into a new problem here.
Once everything is said and done, and the dictionary is built, I can’t bloody access anything! It keeps giving me a KeyError.
I can loop over the dict:
for i in sorted(data.iterkeys()):
print i
>>> u'Key1'
>>> u'Key2'
>>> u'Key3'
>>> u'Key4'
but if I try to access an item of the dict like this:
print data['key1']
OR
print data[u'key1']
OR
test = unicode('key1')
print data[test]
They all return KeyErrors, which is 100% confusing to me. I assume it’s got something to do with them being Unicode objects.
I’ve tried just about everything I can come up with, but I can’t figure out what’s going on.
Oh! Adding to the oddity, is that this code:
name = repr(i.find('a').string)
print type(name)
returns
>>> type(str)
but if I just print the thing
print name
it shows it as a unicode string
>>>> u'string name'
The
.stringvalue is indeed not a string. You need to cast it tounicode():It’s a unicode-like object called
NavigableString. If you really need it to be astrinstead, you can encode it from there:or similar. For use in a
dictI’d useunicode()objects and not encode.To understand the difference between
unicode()andstr()and what encoding to use, I recommend you read the Python Unicode HOWTO.