Assume I have a document which uses Unicode in tag names, as for example

Question

0

Asked: June 1, 20262026-06-01T17:51:43+00:00 2026-06-01T17:51:43+00:00

Assume I have a document which uses Unicode in tag names, as for example

0

Assume I have a document which uses Unicode in tag names, as for example <año>2012</año>.

When I use etree from lxml to parse such a document, I have no problems, the tree is correctly built. But when (for debugging purposes) I try to print some elements, I get an exception about a failed attempt to encode as ASCII some unicode char.

Is not a problem of terminal configuration or bad encoding of the file, since I can print without problem the name of the node (.tag), which contains the same unicode char. Apparently the problem is caused by the “stringification” of the Element object, which assumes that the tag names are aways plain ascii.

The following code shows the problem (and also shows that it is not a file/terminal/encoding problem).

# coding: utf-8
from lxml import etree
doc = """<?xml version="1.0" encoding="utf-8"?>
<año>2012</año>
"""
x = etree.fromstring(doc)   # No problem
print x.tag                 # No problem
print x                     # Exception

Running the above script in a terminal with a properly defined LC_CTYPE, produces the following output:

año
Traceback (most recent call last):
  File "procesar.py", line 8, in <module>
    print x
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 10: ordinal not in range(128)

Note how print x.tag outputs correctly año. Shouldn’t print x produce something like <Element año at b7d26eb4>?

Is this a known problem? Any ideas about workarounds?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T17:51:44+00:00

You have to transform unicode strings into byte strings before output

Try:

print unicode(x).encode('utf8')

quoting the unicode function:

For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in ‘strict’ mode.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Assume I have a document which uses Unicode in tag names, as for example

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply