I’m want to tokenize a text, but not separating only with whitespaces. There some

Question

0

Asked: May 24, 20262026-05-24T03:12:40+00:00 2026-05-24T03:12:40+00:00

I’m want to tokenize a text, but not separating only with whitespaces. There some

0

I’m want to tokenize a text, but not separating only with whitespaces.

There some things like proper names that I want to set only one token (eg.: “Renato Dinhani Conceição”). Another case: percentual (“60 %”) and not split into two tokens.

What I want to know if there is a Tokenizator from some libray that can provide high customization? If not, I will try to write my own, if there is some interface or practices to follow.

Not everything need to be universal recognition. Example: I don’t need to reconigze chinese alphabet.

My application is a college application and it is mainly directed to portuguese language. Only some things like names, places and similars will be from another languages.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T03:12:42+00:00

I would try to go about it not from a tokenization perspective, but from a rules perspective. This will be the biggest challenge – creating a comprehensive rule set that will satisfy most of your cases.

Define in human terms what are units that should not be split up based on whitespace. The name example is one.
For each one of those exceptions to the whitespace split, create a set of rules for how to identify it. For the name example: 2 or more consecutive capitalized words with or without language specific non-capitalized name words in between (like “de”).
Implement each rule as its own class which can be called as you loop.
Split the entire string based on whitespace, and then loop it, keeping track of what token came before, and what is current, applying your rule classes for each token.

Example for rule isName:

Loop 1: (eg.: isName = false
Loop 2: "Renato isName = true
Loop 3: Dinhani isName = true
Loop 4: Conceição"). isName = true
Loop 5: Another isName = false

Leaving you with: (eg.:, "Renato Dinhani Conceição")., Another

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m want to tokenize a text, but not separating only with whitespaces. There some

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply