I have a file that contains Unicode text in an unstated encoding. I want to scan through this file looking for any Arabic code points in the range U+0600 through U+06FF, and map each applicable Unicode code point to a byte of ASCII, so that the newly produced file will be composed of purely ASCII characters, with all code points under 128.
How do I go about doing this? I tried to read them the same way as you read ASCII, but my terminal shows ?? because it’s a multi-byte character.
NOTE: the file is made up of a subset of the Unicode character set, and the subset size is smaller than the size of ASCII characters. Therefore I am able to do a 1:1 mapping from this particular Unicode subset to ASCII.
This is either impossible, or it’s trivial. Here are the trivial approaches:
If no code point exceeds 127, then simply write it out in ASCII. Done.
If some code points exceed 127, then you must choose how to represent them in ASCII. A common strategy is to use XML syntax, as in
αfor U+03B1. This will take up to 8 ASCII characters for each trans-ASCII Unicode code point transcribed.The impossible ones I leave as an excercise for the original poster. I won’t even mention the foolish-but-possible (read: stupid) approaches, as these are legion. Data destruction is a capital crime in data processing, and should be treated as such.
Note that I am assuming by ‘Unicode character’ you actually mean ‘Unicode code point’; that is, a programmer-visible character. For user-visible characters, you need ‘Unicode grapheme (cluster)’ instead.
Also, unless you normalize your text first, you’ll hate the world. I suggest NFD.
EDIT
After further clarification by the original poster, it seems that what he wants to do is very easily accomplished using existing tools without writing a new program. For example, this converts a certain set of Arabic characters from a UTF-8 input file into an ASCII output file:
That only handles these code points:
So you’ll have to extend it to whatever mapping you want.
If you want it in a script instead of a command-line tool, it’s also easy, plus then you can talk about the characters by name by setting up a mapping, such as:
If this is supposed to be a component in a larger C++ program, then perhaps you will want to implement this in C++, possibly but not necessary using the ICU4C library, which includes transliteration support.
But if all you need is a simple conversion, I don’t understand why you would write a dedicated C++ program. Seems like way too much work.