I’m trying to tokenize words from any text, e.g.:
Ça me plaît.
Should be tokenized as “ça,me,plaît”.
To do this, I want to clear the string from all special characters, and then split it on a whitespace. With this code:
text = text.toLowerCase().replaceAll(/^\w/, ' ')
def tokens = text.split(" ")
I get
a me pla t
Which is far from being useful.
What regex do I need here?
Thanks!
Mulone
This seems to work for me (at least for this situation):