Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8926777
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T08:03:42+00:00 2026-06-15T08:03:42+00:00

I am using ICU4C to transliterate CJK. I am wondering whether it is possible

  • 0

I am using ICU4C to transliterate CJK. I am wondering whether it is possible to have word segmentation in ICU, to split Chinese text into a sequence of words, defined according to some word segmentation standard.

When I try transliterating for example:

直接输出html代码而不是作为函数返回值代后处理

using

Transliterator* myTrans = 
                  Transliterator::createInstance("zh-Latin",UTRANS_FORWARD, err);
UnicodeString str;
str.setTo("直接输出html代码而不是作为函数返回值代后处理");
myTrans->transliterate(str);
str.toUTF8String(st);
std::cout << st << std::endl;

I get the following output:

zhí jiē shū chū html dài mǎ ér bù shì zuò wèi hán shù fǎn huí zhí dài hòu chù lǐ

It seems perfectly fine checking against online pinyin tools, but my problem is ICU’s transliteration the characters one by one. What I’m looking for, though, is something more like the text below (I don’t know any Chinese, so probably the text below doesn’t mean anything, but it should demonstrate what kind of output I’m interested in):

zhíjiē shūchū html dàimǎér bùshì zuò wèihán shùfǎn huízhídài hòu chùlǐ

I have been told that ICU 50 is capable of word segmentation, but I couldn’t find any document in their web page neither on web. Wanted to know if any of you guys have worked with word segmentation in ICU or know how to do it, or if you have any good link on how to do so.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T08:03:44+00:00Added an answer on June 15, 2026 at 8:03 am

    “Dictionary Based Iterator” isn’t a different API. Just create an ICU word break iterator with the appropriate locale ID.

    There’s a C/C++ sample that comes with ICU in icu/source/samples/break

    Also the following sample code shows word breaking:
    http://source.icu-project.org/repos/icu/icuapps/trunk/iucsamples/c/s24_brkw/s24_brkw.cpp
    http://source.icu-project.org/repos/icu/icuapps/trunk/iucsamples/c/s23_brki/

    probably something like this:

      BreakIterator *wordIterator = BreakIterator::createWordInstance(Locale("zh"), status);
    UnicodeString text = "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.";
      wordIterator->setText(text);
      int32_t breakCount = 0;
        int32_t start = wordIterator->first();
        for(int32_t end = wordIterator->next();
            end != BreakIterator::DONE;
            start = end, end = wordIterator->next())
        {
             breakCount++;
        }
      delete wordIterator;
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

In attempting to compile ICU49 using Android NDKv7b, I ran into the following: /home/tim/icu49/icu/source/common/putil.cpp:
I am just getting started implementing ICU transforms using ICU4C in a C++ program.
Using a rss feed(syndicationfeed) I have some encoded text where normally in the view
Using import datetime in python, is it possible to take a formatted time/date string
Using mercurial, I've run into an odd problem where a line from one committer
Using MVC2 I have an AJAX form which is posting to a bound model.
Using Core Data, I have a fetch request to fetch the minimum of a
Using SolrNet for querying & faceting. I have a combination of int, tdate and
Using NLTK and WordNet , how do I convert simple tense verb into its
Using C++ preprocessor directives, is it possible to test if a preprocessor symbol has

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.