I’m using BeautifulSoup to parse some XML files. One of the fields in this file frequently uses Unicode characters. I’ve tried unsuccessfully to write the unicode to a file using encode.
The process so far is basically:
-
Get the name
gamename = items.find(‘name’).string.strip()
-
Then incorporate the name into a list which is later converted into a string:
stringtoprint = userid, gamename.encode(‘utf-8’) #
newstring = “INSERT INTO collections VALUES ” + str(stringtoprint) + “;” +”\n”
Then write that string to a file.
listofgamesowned.write(newstring.encode(“UTF-8”))
It seems that I won’t have to .encode quite so often. I had tried encoding directly upon parsing out the name e.g. gamename = items.find('name').string.strip().encode('utf-8') – however, that did not seem to work.
Currently – ‘Uudet L\xc3\xb6yt\xc3\xb6retket’
is being printed and saved rather than Uudet Löytöretket.
It seems if this were a string I was generating then I’d use something.write(u'Uudet L\xc3\xb6yt\xc3\xb6retket'); however, it’s one element embedded in a string.
Unicode is an in-memory representation of a string. When you write out or read in you need to encode and decode.
Uudet L\xc3\xb6yt\xc3\xb6retketis theutf-8encoded version ofUudet Löytöretket, so it is what you want to write out. When you want to read a string back from a file you need to decode it.Just remember to encode immediately before you output and decode immediately after you read it back.