Assume I have a document which uses Unicode in tag names, as for example <año>2012</año>.
When I use etree from lxml to parse such a document, I have no problems, the tree is correctly built. But when (for debugging purposes) I try to print some elements, I get an exception about a failed attempt to encode as ASCII some unicode char.
Is not a problem of terminal configuration or bad encoding of the file, since I can print without problem the name of the node (.tag), which contains the same unicode char. Apparently the problem is caused by the “stringification” of the Element object, which assumes that the tag names are aways plain ascii.
The following code shows the problem (and also shows that it is not a file/terminal/encoding problem).
# coding: utf-8
from lxml import etree
doc = """<?xml version="1.0" encoding="utf-8"?>
<año>2012</año>
"""
x = etree.fromstring(doc) # No problem
print x.tag # No problem
print x # Exception
Running the above script in a terminal with a properly defined LC_CTYPE, produces the following output:
año
Traceback (most recent call last):
File "procesar.py", line 8, in <module>
print x
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 10: ordinal not in range(128)
Note how print x.tag outputs correctly año. Shouldn’t print x produce something like <Element año at b7d26eb4>?
Is this a known problem? Any ideas about workarounds?
You have to transform unicode strings into byte strings before output
Try:
quoting the unicode function: