I’m trying to parse a website’s source to extract text inside using lxml library.

Question

0

Asked: June 4, 20262026-06-04T00:28:07+00:00 2026-06-04T00:28:07+00:00

I’m trying to parse a website’s source to extract text inside using lxml library.

0

I’m trying to parse a website’s source to extract text inside using lxml library. Here is my code

import urllib2
from StringIO import StringIO
from lxml import html
from lxml.html.clean import Cleaner

cleaner = Cleaner(page_structure = False)
htmlsource = cleaner.clean_html(urllib2.urlopen("http://www.verycd.com/").read())
htmltree = html.parse(StringIO(htmlsource.decode("utf-8"))).getroot()
listnode = htmltree.xpath("*")
for node in listnode:
  print node.text_content().strip().encode("utf-8")

When I run the code in interactive console, the result looks like this (dev environment)

VeryCD电驴大全 - 分享互联网
用户名：
        密码：记住我 


        免费注册
         |
        忘记密码？



            首页 |
            商城 |
            专题 |
            乐园 |

            社区 |
            电驴 |
            网页游戏 |
            网址大全

But in production environment, all unicode characters displayed incorrectly

VeryCDçµé©´å¤§å¨ - åäº«äºèç½
ç¨æ·åï¼
        å¯ç ï¼è®°ä½æÂ 

        Â 
        åè´¹æ³¨å
         |
        å¿è®°å¯ç ï¼



            é¦é¡µ |
            åå |
            ä¸é¢ |
            ä¹å |

            ç¤¾åº |
            çµé©´ |
            ç½é¡µæ¸¸æ |
            ç½åå¤§å¨

Any idea how can I fix this?

EDIT

Seems like I found the problem here. I think there is some thing wrong with lxml builtin GAE. If I don’t use cleaner before parsing html, the output is fine.

# cleaner = Cleaner(page_structure = False)
# htmlsource = cleaner.clean_html(urllib2.urlopen("http://www.verycd.com/").read())
htmlsource = urllib2.urlopen("http://www.verycd.com/").read()
htmltree = html.parse(StringIO(htmlsource.decode("utf-8"))).getroot()

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T00:28:08+00:00

Update: This bug is fixed in App Engine, so the following work-around should no longer be necessary.

I’ve accepted this as a bug in either lxml or App Engine. But you can work around it using lxml.etree.parse and lxml.etree.HTMLParser (note that lxml.html is a simple wrapper around these two):

import urllib2
from StringIO import StringIO
from lxml import etree
from lxml.html.clean import Cleaner

cleaner = Cleaner(page_structure = False)
htmlsource = cleaner.clean_html(urllib2.urlopen("http://www.verycd.com/").read())
htmlparser = etree.HTMLParser(encoding='utf-8')
htmltree = etree.parse(StringIO(htmlsource.decode("utf-8")),
                       parser=htmlparser).getroot()
listnode = htmltree.xpath("*")
for node in listnode:
  print node.text.strip().encode("utf-8")

That is:

Create an HTMLParser object, explicitly setting encoding='utf-8'.
Use etree.parse instead of html.parse; pass parser=htmlparser to etree.parse.
Use node.text instead of node.text_content().

This works around the bug by explicitly telling the HTMLParser to use UTF-8 encoding instead of having it guess (it guesses Latin-1 incorrectly).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to parse a website’s source to extract text inside using lxml library.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply