Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3991674
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 20, 20262026-05-20T06:38:40+00:00 2026-05-20T06:38:40+00:00

I need to process a large list of short strings (mostly in Russian, but

  • 0

I need to process a large list of short strings (mostly in Russian, but any other language is possible, including random garbage from a cat walking on keyboard).

Some of these strings will be encoded in UTF-8 twice.

I need to reliably detect if a given string is double-encoded, and fix it. I should do this without using any external libraries, just by inspecting the bytes. The detection should be as fast as possible.

The question is: how to detect that a given string was encoded in UTF-8 twice?

Update:

Original strings are in UTF-8. Here is the AS3 code that does the second encoding (unfortunately I don’t have control on the client code, so I can’t fix this):

private function toUTF8(s : String) : String {
       var byteArray : ByteArray = new ByteArray();
       byteArray.writeUTFBytes(s);
       byteArray.position = 0;

       var res : String = "";

       while(byteArray.bytesAvailable){
           res += String.fromCharCode(byteArray.readUnsignedByte());
       }

       return res;
}

myString = toUTF8(("" + myString).toLowerCase().substr(0, 64));

Note toLowerCase() call. Maybe this may help?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-20T06:38:40+00:00Added an answer on May 20, 2026 at 6:38 am

    In principle you can’t, especially allowing for cat-garbage.

    You don’t say what the original character encoding of the data was before it was UTF-8 encoded once or twice. I’ll assume CP1251, (or at least that CP1251 is one of the possibilities) because it’s quite a tricky case.

    Take a non-ASCII character. UTF-8 encode it. You get some bytes, and all those bytes are valid characters in CP1251 unless one of them happens to be 0x98, the only hole in CP1251.

    So, if you convert those bytes from CP1251 to UTF-8, the result is exactly the same as if you’d correctly UTF-8 encoded a CP1251 string consisting of those Russian characters. There’s no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters.

    If you have some control over the original data, you could put a BOM at the start of it. Then when it comes back to you, inspect the initial bytes to see whether you have a UTF-8 BOM, or the result of incorrectly double-encoding a BOM. But I guess you probably don’t have that kind of control over the original text.

    In practice you can guess – UTF-8 decode it and then:

    (a) look at the character frequencies, character pair frequencies, numbers of non-printable characters. This might allow you to tentatively declare it nonsense, and hence possibly double-encoded. With enough non-printable characters it may be so nonsensical that you couldn’t realistically type it even by mashing at the keyboard, unless maybe your ALT key was stuck.

    (b) attempt the second decode. That is, starting from the Unicode code points that you got by decoding your UTF-8 data, first encode it to CP1251 (or whatever) and then decode the result from UTF-8. If either step fails (due to invalid sequences of bytes), then it definitely wasn’t double-encoded, at least not using CP1251 as the faulty interpretation.

    This is more or less what you do if you have some bytes that might be UTF-8 or might be CP1251, and you don’t know which.

    You’ll get some false positives for single-encoded cat-garbage indistinguishable from double-encoded data, and maybe a very few false negatives for data that was double-encoded but that after the first encode by fluke still looked like Russian.

    If your original encoding has more holes in it than CP1251 then you’ll have fewer false negatives.

    Character encodings are hard.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I need to process a large file, around 400K lines and 200 M. But
Need to process large csv files with php. Working with fgetcsv and performance seems
I have a large product list and need to generate a static file of
I need to process a large number of records (several million) representing people. I
I have written a short Scala program to read a large file, process it
I'm very new to pthread world. I need to process a file with list
To process a large number of messages coming to a queue i need guarantee
I have a large subset of encrypted word documents which i need to process
I need to process large image files into smaller image files. I would like
I often need to process large text files containing headers in the first line.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.