I’m trying to read a file but I can’t figure out the character encoding. There are two characters in the file that I know the value of, and the hex value that I see in my hex editor is as follows:
0xCCA9 é
0xCCBB ê
0xCCC1 á
Any ideas what encoding this is?
All english characters are ASCII encoded in the file. I had similar files which were encoded in mac central europe if that’s any use, perhaps they have been accidentally encoded more than once.
Edit:
Code to find mappings in Python 2.7: (See Esailija’s answer above).
find_mappings(...) is a generator which is given a dictionary of character mappings. It iterates through all available character sets and yields those which match the criteria.
import pkgutil
import encodings
def get_encodings():
false_positives = set(["aliases"])
found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
found.difference_update(false_positives)
return found
def find_mappings(maps):
encodings = sorted(get_encodings())
for f in encodings:
for g in encodings:
try:
if all([k.decode(f).encode(g) == v for k,v in maps.items()]):
yield (f,g)
except:
# Couldn't encode/decode
pass
for mapping in find_mappings({'\xCC': '\xC3', '\xBB': '\xAA', '\xA9': '\xA9', '\xC1': '\xA1'}):
print(mapping)
It’s not in any encoding but a result of messy encoding conversions. How it would be in UTF-8:
So what I think originally happened was that UTF-8 data was treated in ASCII compatible code page X, and then the result was encoded to the file in Mac Central Europe.
To get the original data, you would interpret the file in Mac Central Europe, re-encode the result in code page X, and interpret the re-encoded result in UTF-8.
I don’t know what the code page X is but it must have the following properties, given that the above is true:
©as0xA9; same as Mac, Windows and ISO encodingsŐas0xC3; rules out any DOS code pagesĽas0xAAŃas0xA1