I’ve being trying to parse an html page in Python using lxml.html. I used

Question

0

Editorial Team

Asked: May 19, 20262026-05-19T12:55:32+00:00 2026-05-19T12:55:32+00:00

I’ve being trying to parse an html page in Python using lxml.html. I used

0

I’ve being trying to parse an html page in Python using lxml.html.

I used the following code:

import lxml.html as H
page = open('page.html', 'r').read()
doc = H.fromstring(page)
print H.tostring(doc)

The page.html is a web page I downloaded with a proxy program I wrote before which do some work about using proxy and encoding transfer. The encoding of the file has been changed to utf-8 while the charset declaration in the page is like this:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

btw, gb2312 is a kind of Chinese character set.

At first, I ran the above python code, but it printed nothing but an empty html structure which is wrong and not what I wanted.

I tried some ways and at last I found the problem happened because of the charset declaration: when I replaced the ‘charset=gb2312’ with an empty string, the parsing code worked as I expected.

But I don’t quite understand why this happen. And is the way I solved the problem the right method or just a coincidence?