Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 851545
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T07:26:40+00:00 2026-05-15T07:26:40+00:00

Some time in the near future I will need to implement a cross-language word

  • 0

Some time in the near future I will need to implement a cross-language word count, or if that is not possible, a cross-language character count.

By word count I mean an accurate count of the words contained within the given text, taking the language of the text. The language of the text is set by a user, and will be assumed to be correct.

By character count I mean a count of the “possibly in a word” characters contained within the given text, with the same language information described above.

I would much prefer the former count, but I am aware of the difficulties involved. I am also aware that the latter count is much easier, but very much prefer the former, if at all possible.

I’d love it if I just had to look at English, but I need to consider every language here, Chinese, Korean, English, Arabic, Hindi, and so on.

I would like to know if Stack Overflow has any leads on where to start looking for an existing product / method to do this in PHP, as I am a good lazy programmer*

A simple test showing how str_word_count with set_locale doesn’t work, and a function from php.net’s str_word_count page.

*http://blogoscoped.com/archive/2005-08-24-n14.html

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T07:26:41+00:00Added an answer on May 15, 2026 at 7:26 am

    Counting chars is easy:

    echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
    echo strlen(utf8_decode('一个有十的字符的句子')); // 10
    

    Counting words is where things start to get tricky, specially for Chinese, Japanese and other languages that don’t use spaces (or other common “word boundary” characters) as word separators. I don’t speak Chinese and I don’t understand how word counting works in Chinese, so you’ll have to educate me a bit – what makes a word in these languages? Is it any specific char or set of chars? I remember reading something related to how hard it was to identify Japanese words in T9 writing but can’t find it anymore.

    The following should correctly return the number of words in languages that use spaces or punctuation chars as words separators:

    count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Some time ago I put together a time based library that could be used
Some time ago, I came across a piece of code, that used some piece
I learned some time ago about Decision Trees and Decision tables. I feel that
Some time ago I got this error when building ANY Visual Studio Deployment project.
Some time ago I wrote a little piece of code to ask about on
some time ago I found an article ( Roles: Composable Units of Object Behavior
Background: Some time ago, I built a system for recording and categorizing application crashes
Since some time, my Delphi debugger became much slower than I was used to
After some time I wanted to update my git repo, and then something went
I made a discovery some time back. Just follow these steps: Create a .doc/.xls/.ppt

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.