I have a file that contains Unicode text in an unstated encoding. I want

Question

0

Asked: May 29, 20262026-05-29T22:22:13+00:00 2026-05-29T22:22:13+00:00

I have a file that contains Unicode text in an unstated encoding. I want

0

I have a file that contains Unicode text in an unstated encoding. I want to scan through this file looking for any Arabic code points in the range U+0600 through U+06FF, and map each applicable Unicode code point to a byte of ASCII, so that the newly produced file will be composed of purely ASCII characters, with all code points under 128.

How do I go about doing this? I tried to read them the same way as you read ASCII, but my terminal shows ?? because it’s a multi-byte character.

NOTE: the file is made up of a subset of the Unicode character set, and the subset size is smaller than the size of ASCII characters. Therefore I am able to do a 1:1 mapping from this particular Unicode subset to ASCII.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T22:22:14+00:00

This is either impossible, or it’s trivial. Here are the trivial approaches:

If no code point exceeds 127, then simply write it out in ASCII. Done.
If some code points exceed 127, then you must choose how to represent them in ASCII. A common strategy is to use XML syntax, as in α for U+03B1. This will take up to 8 ASCII characters for each trans-ASCII Unicode code point transcribed.

The impossible ones I leave as an excercise for the original poster. I won’t even mention the foolish-but-possible (read: stupid) approaches, as these are legion. Data destruction is a capital crime in data processing, and should be treated as such.

Note that I am assuming by ‘Unicode character’ you actually mean ‘Unicode code point’; that is, a programmer-visible character. For user-visible characters, you need ‘Unicode grapheme (cluster)’ instead.

Also, unless you normalize your text first, you’ll hate the world. I suggest NFD.

EDIT

After further clarification by the original poster, it seems that what he wants to do is very easily accomplished using existing tools without writing a new program. For example, this converts a certain set of Arabic characters from a UTF-8 input file into an ASCII output file:

$ perl -CSAD -Mutf8 -pe 'tr[ابتثجحخد][abttjhhd]' < input.utf8 > output.ascii

That only handles these code points:

U+0627 ‭ ا  ARABIC LETTER ALEF
U+0628 ‭ ب  ARABIC LETTER BEH
U+0629 ‭ ة  ARABIC LETTER TEH MARBUTA
U+062A ‭ ت  ARABIC LETTER TEH
U+062B ‭ ث  ARABIC LETTER THEH
U+062C ‭ ج  ARABIC LETTER JEEM
U+062D ‭ ح  ARABIC LETTER HAH
U+062E ‭ خ  ARABIC LETTER KHAH
U+062F ‭ د  ARABIC LETTER DAL

So you’ll have to extend it to whatever mapping you want.

If you want it in a script instead of a command-line tool, it’s also easy, plus then you can talk about the characters by name by setting up a mapping, such as:

 "\N{ARABIC LETTER ALEF}"   =>  "a",
 "\N{ARABIC LETTER BEH}"    =>  "b",
 "\N{ARABIC LETTER TEH}"    =>  "t",
 "\N{ARABIC LETTER THEH}"   =>  "t",
 "\N{ARABIC LETTER JEEM}"   =>  "j",
 "\N{ARABIC LETTER HAH}"    =>  "h",
 "\N{ARABIC LETTER KHAH}"   =>  "h",
 "\N{ARABIC LETTER DAL}"    =>  "d",

If this is supposed to be a component in a larger C++ program, then perhaps you will want to implement this in C++, possibly but not necessary using the ICU4C library, which includes transliteration support.

But if all you need is a simple conversion, I don’t understand why you would write a dedicated C++ program. Seems like way too much work.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a file that contains Unicode text in an unstated encoding. I want

Leave an answerCancel reply

1 Answer

EDIT

Leave an answer
Cancel reply