I have an arbitrary Unicode string that represents a number, such as “2”, “٢” (U+0662, ARABIC-INDIC DIGIT TWO) or “Ⅱ” (U+2161, ROMAN NUMERAL TWO). I want to convert that string into an int. I don’t care about specific locales (the input might not be in the current locale); if it’s a valid number then it should get converted.
I tried QString.toInt and QLocale.toInt, but they don’t seem to get the job done. Example:
bool ok;
int n;
QString s = QChar(0x0662); // ARABIC-INDIC DIGIT TWO
n = s.toInt(&ok); // n == 0; ok == false
QLocale anyLocale(QLocale::AnyLanguage, QLocale::AnyScript, QLocale::AnyCountry);
n = anyLocale.toInt(s, &ok); // n == 0; ok == false
QLocale cLocale = QLocale::C;
n = cLocale.toInt(s, &ok); // n == 0; ok == false
QLocale arabicLocale = QLocale::Arabic; // Specific locale. I don't want that.
n = arabicLocale.toInt(s, &ok); // n == 2; ok == true
Is there a function I am missing?
I could try all locales:
QList<QLocale> allLocales = QLocale::matchingLocales(QLocale::AnyLanguage, QLocale::AnyScript, QLocale::AnyCountry);
for(int i = 0; i < allLocales.size(); i++)
{
n = allLocales[i].toInt(s, &ok);
if(ok)
break;
}
But that feels slightly hackish. Also, it does not work for all strings (e.g. Roman numerals, but that’s an acceptable limitation). Are there any pitfalls when doing it that way, such as conflicting rules in different locales (cf. Turkish vs. non-Turkish letter case rules)?
I’ not aware of any ready to use package which does this (but
maybe ICU supports it), but it isn’t hard to do if you really
want to. First, you should download the UnicodeData.txt file
from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
This is an easy to parse ASCII file; the exact syntax is
described in http://www.unicode.org/reports/tr44/tr44-10.html,
but for your purposes, all you need to know is that each line in
the file consists of semi-colon separated fields. The first
field contains the character code in hex, the third field the
“general category”, and if the third field is “Nd” (numeric,
decimal), the seventh field contains the decimal value.
This file can easily be parsed using Python or a number of other
scripting languages, to build a mapping table. You’ll want some
sort of sparse representation, since there are over a million
Unicode characters, of which very few (a couple of hundred) are
decimal digits. The following Python script will give you a C++
table which can be used to initialize an
std::map<int, int>;. If the character isin the map, the mapped element is its value.
Whether this is sufficient or not depends on your application.
It has several weaknesses:
It requires extra logic to recognize when two successive
digits are in different alphabets. Presumably a sequence
"1١"should be treated as two numbers (1 and 1), rather than as one
(11). (Because all of the sets of decimal digits are in 10
successive codes, it would be fairly easy, once you know the
digit, to check whether the preceding digit character was in the
same set.)
It ignores non-decimal digits, like ௰ or ൱ (Tamil ten and
Malayam one hundred). There aren’t that many of them, and they are
also in the UnicodeData.txt file, so it might be possible to
find them manually and add them to the table. I don’t know
myself, however, how they combine with other digits when numbers
have been composed.
If you’re converting numbers, you might have to worry about
the direction. I’m not sure how this is handled (but there is
documentation at the Unicode site); in general, text will appear
in its natural order. In the case of Arabic and related
languages, when reading in the natural order, the low order
digits appear first: something like
"١٢"(literally"12",but because the writing is from right to left, the digits will
appear in the order
"21") should be interpreted as 12, and not 21. Except that I’m not sure whether a change direction mark ispresent or not. (The exact rules are described in the
documentation at the Unicode site; in the UnicodeData.txt file,
the fifth field—index 4—gives this information. I
think if it’s anything but
"AN", you can assume the big-endianstandard used in Europe, but I’m not sure.)
Just to show how simple this is, here’s the Python script to
parse the UnicodeData.txt file for the digit values:
If you’re doing any work with Unicode, this files is a gold mine
for generating all sorts of useful tables.