I want to parse downloaded RSS with lxml, but I don’t know how to

Question

0

Asked: May 21, 20262026-05-21T18:12:37+00:00 2026-05-21T18:12:37+00:00

I want to parse downloaded RSS with lxml, but I don’t know how to

0

I want to parse downloaded RSS with lxml, but I don’t know how to handle with UnicodeDecodeError?

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)

But I get an error:

tree   = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T18:12:38+00:00

You should probably only be trying to define the character encoding as a last resort, since it’s clear what the encoding is based on the XML prolog (if not by the HTTP headers.) Anyway, it’s unnecessary to pass the encoding to etree.XMLParser unless you want to override the encoding; so get rid of the encoding parameter and it should work.

Edit: okay, the problem actually seems to be with lxml. The following works, for whatever reason:

parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to parse downloaded RSS with lxml, but I don’t know how to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply