I am having trouble with unicode in a script I am writing. I have scoured the internet, including this site and I have tried many things, and I still have no idea what is wrong.
My code is very long, but I will show an excerpt from it:
raw_results = get_raw(args)
write_raw(raw_results)
parsed_results = parse_raw(raw_results)
write_parsed(parsed_results)
Basically, I get raw results, which is in XML, encoded in UTF-8. Writing the RAW data has no problems. But writing the parsed data is. So I am pretty sure the problem is inside the function that parses the data.
I tried everything and I do not understand what the problem is. Even this simple line gives me an error:
def parse_raw(raw_results)
content = raw_results.replace(u'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>', u'')
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xd7 in position
570: ordinal not in range(128)
Ideally I would love to be able to work with unicode and have no problems, but I also have no issue with replacing/ignoring any unicode and using only regular text. I know I have not provided my full code, but understand that it’s a problem since it’s work-related. But I hope this is enough to get me some help.
Edit: the top part of my parse_raw function:
from xml.etree.ElementTree import XML, fromstring, tostring
def parse_raw(raw_results)
raw_results = raw_results.decode("utf-8")
content = raw_results.replace('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>', '')
content = "<root>\n%s\n</root>" % content
mxml = fromstring(content)
Edit2:: I think it would be a good idea to point out that the code works fine UNLESS there are special characters. When it’s 100% English, no problem; whenever any foreign letters or accented letters are involved is when the issues arise.
Thank you everyone for the input and the nudges. I have subsequently solved my own problem by going over my code for the millionth time with a fine-toothed comb, and I have found the culprit. And I have solved all my problems now.
For anyone with a similar problem, I have the following information that could help you:
codecsmodule for writing your files.My problem was that at a certain point I was trying to turn unicode into unicode. And in another place I was trying to turn normal ASCII into ASCII again. So whenever I solved one issue, another arose and I figured it was the same problem.
Break your issue into sections… and then you might find your problem!