I am using ICU4C to transliterate CJK. I am wondering whether it is possible

Question

0

Asked: June 15, 20262026-06-15T08:03:42+00:00 2026-06-15T08:03:42+00:00

I am using ICU4C to transliterate CJK. I am wondering whether it is possible

0

I am using ICU4C to transliterate CJK. I am wondering whether it is possible to have word segmentation in ICU, to split Chinese text into a sequence of words, defined according to some word segmentation standard.

When I try transliterating for example:

直接输出html代码而不是作为函数返回值代后处理

using

Transliterator* myTrans = 
                  Transliterator::createInstance("zh-Latin",UTRANS_FORWARD, err);
UnicodeString str;
str.setTo("直接输出html代码而不是作为函数返回值代后处理");
myTrans->transliterate(str);
str.toUTF8String(st);
std::cout << st << std::endl;

I get the following output:

zhí jiē shū chū html dài mǎ ér bù shì zuò wèi hán shù fǎn huí zhí dài hòu chù lǐ

It seems perfectly fine checking against online pinyin tools, but my problem is ICU’s transliteration the characters one by one. What I’m looking for, though, is something more like the text below (I don’t know any Chinese, so probably the text below doesn’t mean anything, but it should demonstrate what kind of output I’m interested in):

zhíjiē shūchū html dàimǎér bùshì zuò wèihán shùfǎn huízhídài hòu chùlǐ

I have been told that ICU 50 is capable of word segmentation, but I couldn’t find any document in their web page neither on web. Wanted to know if any of you guys have worked with word segmentation in ICU or know how to do it, or if you have any good link on how to do so.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T08:03:44+00:00

“Dictionary Based Iterator” isn’t a different API. Just create an ICU word break iterator with the appropriate locale ID.

There’s a C/C++ sample that comes with ICU in icu/source/samples/break

Also the following sample code shows word breaking:
http://source.icu-project.org/repos/icu/icuapps/trunk/iucsamples/c/s24_brkw/s24_brkw.cpp
http://source.icu-project.org/repos/icu/icuapps/trunk/iucsamples/c/s23_brki/

probably something like this:

  BreakIterator *wordIterator = BreakIterator::createWordInstance(Locale("zh"), status);
UnicodeString text = "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.";
  wordIterator->setText(text);
  int32_t breakCount = 0;
    int32_t start = wordIterator->first();
    for(int32_t end = wordIterator->next();
        end != BreakIterator::DONE;
        start = end, end = wordIterator->next())
    {
         breakCount++;
    }
  delete wordIterator;

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using ICU4C to transliterate CJK. I am wondering whether it is possible

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply