I have a SharePoint library that captures data entered by user as an XML form. This form is encoded as UTF-8, but some of the characters entered by users are not ASCII (e.g. words from French, Spanish, Maori) and are not saved as UTF-8.
Here is an example of such data (abbreviated, sans meta data):
<?xml version="1.0" encoding="utf-8"?>
<my:myFields xmlns:my="http://schemas.microsoft.com/etc...">
<my:title>Te whakaako i Te Reo Mäori -- Teaching Te Reo Mäori</my:title>
I am using the parse function in ElementTree (xml.etree.ElementTree) to compile this information into a report, which I am then exporting as CSV and sending off in an Excel spreadsheet. As such I would like to convert both the UTF-8 characters and all user input into a single format that works with Excel (cp1252?):
import xml.etree.ElementTree as ET
course = ET.parse(os.path.join(path, filename))
When I go to write the results of all my calculations to file, I get the following error (for the example XML above):
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 48: ordinal not in range(128)
When I look at the data, I see that the text from the tag has been converted to unicode with ‘\xe4’ in place of the ‘ä’: u'Te whakaako i Te Reo M\xe4ori -- Teaching Te Reo M\xe4ori'.
I would like to be able to have my Excel report include the character ‘ä’, but can’t seem to get it to encode in a way that achieves this.
I am potentially missing some obvious encode/decode point but have been struggling with this for much of the day, so any help is appreciated 🙂
You’re looking for
codecs.open().