I’m not exactly sure how to ask this question really, and I’m no where close to finding an answer, so I hope someone can help me.
I’m writing a Python app that connects to a remote host and receives back byte data, which I unpack using Python’s built-in struct module. My problem is with the strings, as they include multiple character encodings. Here is an example of such a string:
‘^LThis is an example ^Gstring with multiple ^Jcharacter encodings’
Where the different encoding starts and ends is marked using special escape chars:
- ^L – Latin1
- ^E – Central Europe
- ^T – Turkish
- ^B – Baltic
- ^J – Japanese
- ^C – Cyrillic
- ^G – Greek
And so on… I need a way to convert this sort of string into Unicode, but I’m really not sure how to do it. I’ve read up on Python’s codecs and string.encode/decode, but I’m none the wiser really. I should mention as well, that I have no control over how the strings are outputted by the host.
I hope someone can help me with how to get started on this.
There’s no built-in functionality for decoding a string like this, since it is really its own custom codec. You simply need to split up the string on those control characters and decode it accordingly.
Here’s a (very slow) example of such a function that handles latin1 and shift-JIS:
A faster version might use str.split, or regular expressions.
(Also, as you can see in this example, ‘^J’ is the control character for ‘newline’, so your input data is going to have some interesting restrictions.)