I’m trying to code a python google api. Getting some unicode issues. My really basic PoC so far is:
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
query = "filetype%3Apdf"
url = "http://www.google.com/search?sclient=psy-ab&hl=en&site=&source=hp&q="+query+"&btnG=Search"
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
data = response.read()
data = data.decode('UTF-8', 'ignore')
data = data.encode('UTF-8', 'ignore')
soup = BeautifulSoup(data)
print u""+soup.prettify('UTF-8')
My traceback is:
Traceback (most recent call last):
File "./google.py", line 22, in <module>
print u""+soup.prettify('UTF-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 48786: ordinal not in range(128)
Any ideas?
You are converting your
souptree toUTF-8(an encoded byte string), then try to concatenate this to an emptyu''unicode string.Python will automatically try and decode your encoded byte string, using the default encoding, which is
ASCII, and fails to decode theUTF-8data.You need to explicitly decode the
prettify()output:The Python Unicode HOWTO explains this better, including about default encodings. I really, really recommend you read Joel Spolsky’s article on Unicode as well.