I have some text in French that I need to process in some ways. For that, I need to:
- First, tokenize the text into words
- Then lemmatize those words to avoid processing the same root more than once
As far as I can see, the wordnet lemmatizer in the NLTK only works with English. I want something that can return “vouloir” when I give it “voudrais” and so on. I also cannot tokenize properly because of the apostrophes. Any pointers would be greatly appreciated. 🙂
Here‘s an old but relevant comment by an nltk dev. Looks like most advanced stemmers in nltk are all English specific:
Note: the link he gives is dead, see here for the current regexstemmer documentation.
The more recently added snowball stemmer appears to be able to stem French though. Let’s put it to the test:
As you can see, some results are a bit dubious.
Not quite what you were hoping for, but I guess it’s a start.