I’m trying to tokenize words from any text, e.g.: Ça me plaît. Should be

Question

0

Asked: May 20, 20262026-05-20T18:18:26+00:00 2026-05-20T18:18:26+00:00

I’m trying to tokenize words from any text, e.g.: Ça me plaît. Should be

0

I’m trying to tokenize words from any text, e.g.:

Ça me plaît.

Should be tokenized as “ça,me,plaît”.
To do this, I want to clear the string from all special characters, and then split it on a whitespace. With this code:

text = text.toLowerCase().replaceAll(/^\w/, ' ')
def tokens = text.split(" ")

I get

a me pla t

Which is far from being useful.
What regex do I need here?

Thanks!
Mulone

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T18:18:27+00:00

Editorial Team

2026-05-20T18:18:27+00:00Added an answer on May 20, 2026 at 6:18 pm

This seems to work for me (at least for this situation):

'Ça me plaît.'.toLowerCase().replaceAll( /[^\p{javaLowerCase}]/, ' ').split( ' ' )

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to tokenize words from any text, e.g.: Ça me plaît. Should be

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply