I know there are quite a few solutions for this problem but mine was

Question

0

Asked: May 23, 20262026-05-23T02:43:27+00:00 2026-05-23T02:43:27+00:00

I know there are quite a few solutions for this problem but mine was

0

I know there are quite a few solutions for this problem but mine was peculiar in the sense that, I might get truncated utf16 data and yet have to make the best effort of dealing with conversions where decode and encode will fail with UnicodeDecodeError. So came up with the following code in python.
Please let me know your comments on how I can improve them for faster processing.

    try:
        # conversion to ascii if utf16 data is formatted correctly
        input = open(filename).read().decode('UTF16')
        asciiStr = input.encode('ASCII', 'ignore')
        open(filename).close()
        return asciiStr
    except:
        # if fail with UnicodeDecodeError, then use brute force 
        # to decode truncated data
        try:
            unicode = open(filename).read()
            if (ord(unicode[0]) == 255 and ord(unicode[1]) == 254):
                print("Little-Endian format, UTF-16")
                leAscii = "".join([(unicode[i]) for i in range(2, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
                open(filename).close()
                return leAscii
            elif (ord(unicode[0]) == 254 and ord(unicode[1]) == 255):
                print("Big-Endian format, UTF-16")
                beAscii = "".join([(unicode[i]) for i in range(3, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
                open(filename).close()
                return beAscii
            else:
                open(filename).close()
                return None
        except:
            open(filename).close()
            print("Error in converting to ASCII")
            return None

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T02:43:28+00:00

To tolerate errors you could use the optional second argument to the byte-string’s decode method. In this example the dangling third byte (‘c’) is replaced with the “replacement character” U+FFFD:

>>> 'abc'.decode('UTF-16', 'replace')
u'\u6261\ufffd'

There is also an ‘ignore’ option which will simply drop bytes that can’t be decoded:

>>> 'abc'.decode('UTF-16', 'ignore')
u'\u6261'

While it is common to desire a system that is “tolerant” of incorrectly encoded text, it is often quite difficult to define precisely what the expected behavior is in these situations. You may find that the one who provided the requirement to “deal with” incorrectly encoded text does not fully grasp the concept of character encoding.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I know there are quite a few solutions for this problem but mine was

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply