I am just getting started implementing ICU transforms using ICU4C in a C++ program.

Question

0

Asked: May 21, 20262026-05-21T17:41:23+00:00 2026-05-21T17:41:23+00:00

I am just getting started implementing ICU transforms using ICU4C in a C++ program.

0

I am just getting started implementing ICU transforms using ICU4C in a C++ program. I am particularly looking at transliteration to and from Chinese.

According to this document, the package supports both “Han-Latin” and “Latin-Han” conversion. As a student of Chinese, this seems surprising to me, as Latin-Han conversion is particularly difficult to do without highly advanced statistical techniques (The closest I have seen is Google Transliterate, which actually does a great job with this even without user input, but this is unfeasible for the present project), much less conversion without tone marks. I am skeptical that this is even possible, without resorting to the de facto foreign-name borrowing characters such as 比尔·莫瑞. This is the approach taken by Google Maps in their international domains, as we can see in this paper (PDF)

Anyhow, I was willing to suspend disbelief, and after consulting documentation and tutorials, I was able to construct two Transliterator objects (to and from) and perform simple transliteration using them.

While Han-Latin worked pretty passably (about 80% accuracy for simple data), Latin-Han seemed not to work at all, returning the same “latin” string that was input, which is consistent with the results I get using the online transform sample, and consistent with what I know about Chinese. I managed to find this table, which I think is what is used for both sources, as we can see here:

{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },

I would presume this meant that given a pinyin string it could potentially work to reproduce the original, but this does not seem to be the case.

I guess my general question is this: is this kind of transform even possible with ICU, or anything besides Google Transliterate? What is the expected output? Relatedly, is there a listing somewhere of the script-pairs that ICU actually supports, if this is not really possible?

Thank you for your time

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T17:41:24+00:00

Editorial Team

2026-05-21T17:41:24+00:00Added an answer on May 21, 2026 at 5:41 pm

Note that the data is from the CLDR project, http://cldr.unicode.org . The script pairs that ICU supports are many, ICU will attempt to use a pivot script ( such as Han to Latin to Russian ) which is why you can create transliterators such as “Any-Latin”. You might try browsing the ICU and CLDR data set. The note at the top of the Han-Latin file says that it does not round trip.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am just getting started implementing ICU transforms using ICU4C in a C++ program.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply