Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8884963
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T21:11:35+00:00 2026-06-14T21:11:35+00:00

I’m decoding bytestreams into unicode characters without knowing the encoding that’s been used by

  • 0

I’m decoding bytestreams into unicode characters without knowing the encoding that’s been used by each of a hundred or so senders.

Many of the senders are not technically astute, and will not be able to tell me what encoding they are using. It will be determined by the happenstance of the toolchains they are using to generate the data.

The senders are, for the moment, all UK/English based, using a variety of operating systems.

Can I ask all the senders to send me a particular string of characters that will unambiguously demonstrate what encoding each sender is using?

I understand that there are libraries that use heuristics to guess at the encoding – I’m going to chase that up too, as a runtime fallback, but first I’d like to try and determine what encodings are being used, if I can.

(Don’t think it’s relevant, but I’m working in Python)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T21:11:37+00:00Added an answer on June 14, 2026 at 9:11 pm

    A full answer to this question depends on a lot of factors, such as the range of encodings used by the various upstream systems, and how well your users will comply with instructions to type magic character sequences into text fields, and how skilled they will be at the obscure keyboard combinations to type the magic character sequences.

    There are some very easy character sequences which only some users will be able to type. Only users with a Cyrillic keyboard and encoding will find it easy to type “Ильи́ч” (Ilyich), and so you only have to distinguish between the Cyrillic-capable encodings like UTF-8, UTF-16, iso8859_5, and koi8_r. Similarly, you could come up with Japanese, Chinese, and Korean character sequences which distinguish between users of Japanese, simplified Chinese, traditional Chinese, and Korean systems.

    But let’s concentrate on users of western European computer systems, and the common encodings like ISO-8859-15, Mac_Roman, UTF-8, UTF-16LE, and UTF-16BE. A very simple test is to have users enter the Euro character ‘€’, U+20AC, and see what byte sequence gets generated:

    • byte [‘\xa4’] means iso-8859-15 encoding
    • bytes [‘\xe2’, ‘\x82’, ‘\xac’] mean utf-8 encoding
    • bytes [‘\x00’, ‘\xac’] mean utf-16be encoding
    • bytes [‘\xac’, ‘\x00’] mean utf-16le encoding
    • byte [‘\x80’] means cp1252 (“Windows ANSI”) encoding
    • byte [‘\xdb’] means macroman encoding
    • iso-8859-1 won’t be able to represent the Euro character at all. iso-8859-15 is the Euro-supporting successor to iso-8859-1.
    • U.S. users probably won’t know how to type a Euro character. (OK, that’s too snarky. 3% of them will know.)

    You should check what each of these byte sequences, interpreted as any of the possible encodings, is not a character sequence that users would likely type themselves. For instance, the ‘\xa4’ of the iso-8859-15 Euro symbol could also be the iso-8859-1 or cp1252 or UTF-16le encoding of ‘¤’, the macroman encoding of ‘§’, or the first byte of any of thousands of UTF-16 characters, such as U+A4xx Yi Syllables, or U+01A4 LATIN SMALL LETTER OI. It would not be a valid first byte of a UTF-8 sequence. If some of your users submit text in Yi, you might have a problem.

    The Python 3.x documentation, 7.2.3. Standard Encodings lists the character encodings which the Python standard library can easily handle. The following program lets you see how a test character sequence is encoded into bytes by various encodings:

    >>> for e in ['iso-8859-1','iso-8859-15', 'utf-8', 'utf-16be', 'utf-16le', \
    ... 'cp1252', 'macroman']:
    ...     print e, list( euro.encode(e, 'backslashreplace'))
    

    So, as an expedient, satisficing hack, consider telling your users to type a ‘€’ as the first character of a text field, if there are any problems with encoding. Then your system should interpret any of the above byte sequences as an encoding clue, and discard them. If users want to start their text content with a Euro character, they start the field with ‘€€’; the first gets swallowed, the second remains part of the text.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have been unable to fix a problem with Java Unicode and encoding. The
I have a French site that I want to parse, but am running into
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I need a function that will clean a strings' special characters. I do NOT
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I want to count how many characters a certain string has in PHP, but
I used javascript for loading a picture on my website depending on which small
this is what i have right now Drawing an RSS feed into the php,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.