I am using Python & lxml and am stuck with an error My code

Question

0

Editorial Team

Asked: June 2, 20262026-06-02T05:54:52+00:00 2026-06-02T05:54:52+00:00

I am using Python & lxml and am stuck with an error My code

0

I am using Python & lxml and am stuck with an error

My code

>>>import urllib
>>>from lxml import html

>>>response = urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Grapevine/GrapevineFordLincoln_1/fullservice-505318162.html').read()
>>>dom = html.fromstring(response)

>>>dom.xpath("//div[@class='description item vcard']")[0].xpath(".//p[@class='service-review-paragraph loose-spacing']")[0].text_content()

Traceback

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/lxml/html/__init__.py", line 249, in text_content
return _collect_string_content(self)
File "xpath.pxi", line 466, in lxml.etree.XPath.__call__ (src/lxml/lxml.etree.c:119105)
File "xpath.pxi", line 242, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:116936)
File "extensions.pxi", line 552, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:112473)
File "apihelpers.pxi", line 1344, in lxml.etree.funicode (src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 477: invalid start byte

The problem is the special character which is present in the div I am fetching. How can I encode/decode the text without losing any data?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T05:54:53+00:00

The parser assumes this is a utf-8 file, but it’s not. the simplest thing to do would be to convert it to unicode first, by knowing the encoding of the page

>>> url =  urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Grapevine/GrapevineFordLincoln_1/fullservice-505318162.html')
>>> url.headers.get('content-type')
'text/html; charset=ISO-8859-1'

>>> response = url.read()
#let's convert to unicode first
>>> response_unicode = codecs.decode(response, 'ISO-8859-1')
>>> dom = html.fromstring(response_unicode)
#and now...
>>> dom.xpath("//div[@class='description item vcard']")[0].xpath(".//p[@class='service-review-paragraph loose-spacing']")[0].text_content()
u'\n                  On December 5th, my vehicle completely shut down.\nI had it towed to Grapevine Ford where they told me that the intak.....

tada!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using Python & lxml and am stuck with an error My code

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply