I’m trying to read a file but I can’t figure out the character encoding.

Question

0

Asked: June 16, 20262026-06-16T18:10:00+00:00 2026-06-16T18:10:00+00:00

I’m trying to read a file but I can’t figure out the character encoding.

0

I’m trying to read a file but I can’t figure out the character encoding. There are two characters in the file that I know the value of, and the hex value that I see in my hex editor is as follows:

0xCCA9  é
0xCCBB  ê
0xCCC1  á

Any ideas what encoding this is?

All english characters are ASCII encoded in the file. I had similar files which were encoded in mac central europe if that’s any use, perhaps they have been accidentally encoded more than once.

Edit:

Code to find mappings in Python 2.7: (See Esailija’s answer above).

find_mappings(...) is a generator which is given a dictionary of character mappings. It iterates through all available character sets and yields those which match the criteria.

import pkgutil
import encodings

def get_encodings():
    false_positives = set(["aliases"])
    found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
    found.difference_update(false_positives)
    return found

def find_mappings(maps):
    encodings = sorted(get_encodings())
    for f in encodings:
        for g in encodings:
            try:
                if all([k.decode(f).encode(g) == v for k,v in maps.items()]):
                    yield (f,g)
            except:
                # Couldn't encode/decode
                pass

for mapping in find_mappings({'\xCC': '\xC3', '\xBB': '\xAA', '\xA9': '\xA9', '\xC1': '\xA1'}):
    print(mapping)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T18:10:01+00:00

It’s not in any encoding but a result of messy encoding conversions. How it would be in UTF-8:

0xC3A9  é
0xC3AA  ê
0xC3A1  á

So what I think originally happened was that UTF-8 data was treated in ASCII compatible code page X, and then the result was encoded to the file in Mac Central Europe.

To get the original data, you would interpret the file in Mac Central Europe, re-encode the result in code page X, and interpret the re-encoded result in UTF-8.

I don’t know what the code page X is but it must have the following properties, given that the above is true:

Encodes Ő as 0xC3; rules out any DOS code pages
Encodes Ľ as 0xAA
Encodes Ń as 0xA1
Is ASCII compatibe; rules out any EBCDIC code pages

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to read a file but I can’t figure out the character encoding.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply