The code below extracts links from a web page and shows them in a browser. With a lot of UTF-8 encoded webpages this works great. But the French Wikipedia page http://fr.wikipedia.org/wiki/États_unis for example produces an error.
# -*- coding: utf-8 -*-
print 'Content-Type: text/html; charset=utf-8\n'
print '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Show Links</title>
</head>
<body>'''
import urllib2, lxml.html as lh
def load_page(url):
headers = {'User-Agent' : 'Mozilla/5.0 (compatible; testbot/0.1)'}
try:
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
return page
except:
print '<b>Couldn\'t load:', url, '</b><br>'
return None
def show_links(page):
tree = lh.fromstring(page)
for node in tree.xpath('//a'):
if 'href' in node.attrib:
url = node.attrib['href']
if '#' in url:
url=url.split('#')[0]
if '@' not in url and 'javascript' not in url:
if node.text:
linktext = node.text
else:
linktext = '-'
print '<a href="%s">%s</a><br>' % (url, linktext.encode('utf-8'))
page = load_page('http://fr.wikipedia.org/wiki/%C3%89tats_unis')
show_links(page)
print '''
</body>
</html>
'''
I get the following error:
Traceback (most recent call last):
File "C:\***\question.py", line 42, in <module>
show_links(page)
File "C:\***\question.py", line 39, in show_links
print '<a href="%s">%s</a><br>' % (url, linktext.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
My system: Python 2.6 (Windows), lxml 2.3.3, Apache Server (to show the results)
What am I doing wrong?
You need to encode
urltoo.The problem might be similar to:
But this works:
The empty Unicode string forces the whole expression to be converted to Unicode. Therefore you see Unicode Decode Error.
In general it is a bad idea to mix Unicode and bytestrings. It might appear to be working but sooner or later it breaks. Convert text to Unicode as soon as you receive it, process it and then convert it to bytes for I/O.