I am scraping a web page with lxml. At one point, I get the content of a table cell.
# get last name
lastNameContainer = tableRow.xpath('./td[@class="lastName"]');
lastName = lastNameContainer[0].text
Unfortunately, one table cell has a character outside of ASCII’s range, which produces this error.
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-7: ordinal not in range(128)
I tried adding this to the top of my Python file to no avail.
#!/usr/bin/python
# -*- coding: utf-8 -*-
How can I get around this problem? I still want to store this character. This character, by the way, is either ♀ or ♂ depending on the table row.
Update: I realized that the error is triggered when I write the data to a file:
with open('myData.txt', 'w') as myFile:
myFile.write(lastName + '\n')
Oddly, this still produces the above error.
with open('myData.txt', 'w') as myFile:
myFile.write(lastName.decode('utf-8') + '\n')
lxml needs their strings in unicode. When I get that exception I resolve it using
decode('utf-8').ie:
E.doc('♀'.decode('utf-8'))Updated:
Also notice that if lastName is
unicodeand you try to write anUTF-8encoded file you will need to convert it back this waylastName.encode('utf-8')