I need to download and parse webpage with lxml and build UTF-8 xml output.

Question

0

Asked: May 14, 20262026-05-14T14:19:11+00:00 2026-05-14T14:19:11+00:00

I need to download and parse webpage with lxml and build UTF-8 xml output.

0

I need to download and parse webpage with lxml and build UTF-8 xml output. I think schema in pseudocode is more illustrative:

from lxml import etree

webfile = urllib2.urlopen(url)
root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True))

txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8))


output = etree.Element("out")
output.text = txt

outputfile.write(etree.tostring(output, encoding=utf8))

So webfile can be in any encoding (lxml should handle this). Outputfile have to be in utf-8. I’m not sure where to use encoding/coding. Is this schema ok? (I cant find good tutorial about lxml and encoding, but I can find many problems with this…) I need robust solution.

Edit:

So for sending utf-8 to lxml I use

        converted = UnicodeDammit(webfile, isHTML=True)
        if not converted.unicode:
            print "ERR. UnicodeDammit failed to detect encoding, tried [%s]", \
                ', '.join(converted.triedEncodings)
            continue
        webfile = converted.unicode.encode('utf-8')

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T14:19:11+00:00

lxml can be a little wonky about input encodings. It is best to send UTF8 in and get UTF8 out.

You might want to use the chardet module or UnicodeDammit to decode the actual data.

You’d want to do something vaguely like:

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)

I’m not sure why you are moving between lxml and etree, unless you are interacting with another library that already uses etree?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to download and parse webpage with lxml and build UTF-8 xml output.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply