I want to detect words in text, i.e. I need to know which characters

Question

0

Asked: May 13, 20262026-05-13T16:46:48+00:00 2026-05-13T16:46:48+00:00

I want to detect words in text, i.e. I need to know which characters

0

I want to detect words in text, i.e. I need to know which characters in a given text are letters, that is they can be part of a (spoken) word and which are, on the other hand, punctuation and such.

For example, in the above sentence, “I”, “want” and “i” and “e” are words in this regard, while spaces, “.” and comma are not.

The difficulty in this is that I want to be able to read any kind of script that’s based on Unicode. E.g., the german word “schön” is one word. But what about greek, arabic or japanese?

So, what I need is a table or list specifying all ranges of characters that can form words. Optionally, I also like to know which chars are digits that can form numbers (assuming other scripts have similar numbering schemes as the arabic numbers do).

I need this for Mac OS X, Windows and Linux. I’ll write a C app, so it needs to be either a OS library or a complete code/data solution that I could translate into C.

I know that Mac OS (Cocoa) offers functions for this purpose, but I am not sure if there are similar solutions for Win and Linux (gtk based, probably?).

Alternatively, I could write my own code if I had the complete tables.

I have found the unicode charts (http://unicode.org/charts/index.html#scripts) but that’s not coming in one convenient form I could use in programming.

So, can someone tell me if there are functions for Windows and Linux for this purpose, or where I can find a complete table/list of word characters in unicode?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T16:46:48+00:00

Editorial Team

2026-05-13T16:46:48+00:00Added an answer on May 13, 2026 at 4:46 pm

You can try to use the Unicode character category to figure out what the word separators may be, but be aware that some languages (e.g. Japanese) do not even have word separators.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to detect words in text, i.e. I need to know which characters

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply