I have this issue trying to get all the text nodes in an HTML

Question

0

Asked: June 6, 20262026-06-06T08:15:39+00:00 2026-06-06T08:15:39+00:00

I have this issue trying to get all the text nodes in an HTML

0

I have this issue trying to get all the text nodes in an HTML document using lxml but I get an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128). However, when I try to find out the type of encoding of this page (encoding = chardet.detect(response)['encoding']), it says it’s utf-8. It seems weird that a single page has utf-8 and ascii. Actually, this:

fromstring(response).text_content().encode('ascii', 'replace')

solves the problem.

Here it’s my code:

from lxml.html import fromstring
import urllib2
import chardet
request = urllib2.Request(my_url)
request.add_header('User-Agent',
                   'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')   
request.add_header("Accept-Language", "en-us")
response = urllib2.urlopen(request).read()

print encoding
print fromstring(response).text_content()

Output:

utf-8
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)

What can I do to solve this issue?. Keep in mind that I want to do this with a few other pages, so I don’t want to encode on an individual basis.

UPDATE:

Maybe there is something else going on here. When I run this script on the terminal, I get a correct output but when a run it inside SublimeText, I get UnicodeEncodeError… ¿?

UPDATE2:

It’s also happening when I create a file with this output. .encode('ascii', 'replace') is working but I’d like to have a more general solution.

Regards

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T08:15:43+00:00

Editorial Team

2026-06-06T08:15:43+00:00Added an answer on June 6, 2026 at 8:15 am

Can you try wrapping your string with repr()?
This article might help.

print repr(fromstring(response).text_content())

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have this issue trying to get all the text nodes in an HTML

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply