I’m trying to parse a website’s source to extract text inside using lxml library. Here is my code
import urllib2
from StringIO import StringIO
from lxml import html
from lxml.html.clean import Cleaner
cleaner = Cleaner(page_structure = False)
htmlsource = cleaner.clean_html(urllib2.urlopen("http://www.verycd.com/").read())
htmltree = html.parse(StringIO(htmlsource.decode("utf-8"))).getroot()
listnode = htmltree.xpath("*")
for node in listnode:
print node.text_content().strip().encode("utf-8")
When I run the code in interactive console, the result looks like this (dev environment)
VeryCD电驴大全 - 分享互联网
用户名:
密码:记住我
免费注册
|
忘记密码?
首页 |
商城 |
专题 |
乐园 |
社区 |
电驴 |
网页游戏 |
网址大全
But in production environment, all unicode characters displayed incorrectly
VeryCDçµé©´å¤§å¨ - å享äºèç½
ç¨æ·åï¼
å¯ç ï¼è®°ä½æÂ
Â
å费注å
|
å¿è®°å¯ç ï¼
é¦é¡µ |
åå |
ä¸é¢ |
ä¹å |
ç¤¾åº |
çµé©´ |
ç½é¡µæ¸¸æ |
ç½å大å¨
Any idea how can I fix this?
EDIT
Seems like I found the problem here. I think there is some thing wrong with lxml builtin GAE. If I don’t use cleaner before parsing html, the output is fine.
# cleaner = Cleaner(page_structure = False)
# htmlsource = cleaner.clean_html(urllib2.urlopen("http://www.verycd.com/").read())
htmlsource = urllib2.urlopen("http://www.verycd.com/").read()
htmltree = html.parse(StringIO(htmlsource.decode("utf-8"))).getroot()
Update: This bug is fixed in App Engine, so the following work-around should no longer be necessary.
I’ve accepted this as a bug in either lxml or App Engine. But you can work around it using
lxml.etree.parseandlxml.etree.HTMLParser(note thatlxml.htmlis a simple wrapper around these two):That is:
encoding='utf-8'.etree.parseinstead ofhtml.parse; passparser=htmlparsertoetree.parse.node.textinstead ofnode.text_content().This works around the bug by explicitly telling the HTMLParser to use UTF-8 encoding instead of having it guess (it guesses Latin-1 incorrectly).