I have some text in French that I need to process in some ways.

Question

0

Editorial Team

Asked: June 14, 20262026-06-14T01:14:28+00:00 2026-06-14T01:14:28+00:00

I have some text in French that I need to process in some ways.

0

I have some text in French that I need to process in some ways. For that, I need to:

First, tokenize the text into words
Then lemmatize those words to avoid processing the same root more than once

As far as I can see, the wordnet lemmatizer in the NLTK only works with English. I want something that can return “vouloir” when I give it “voudrais” and so on. I also cannot tokenize properly because of the apostrophes. Any pointers would be greatly appreciated. 🙂

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T01:14:29+00:00

Here‘s an old but relevant comment by an nltk dev. Looks like most advanced stemmers in nltk are all English specific:

The nltk.stem module currently contains 3 stemmers: the Porter
stemmer, the Lancaster stemmer, and a Regular-Expression based
stemmer. The Porter stemmer and Lancaster stemmer are both English-
specific. The regular-expression based stemmer can be customized to
use any regular expression you wish. So you should be able to write a
simple stemmer for non-English languages using the regexp stemmer.
For example, for french:
from nltk import stem
stemmer = stem.Regexp('s$|es$|era$|erez$|ions$| <etc> ')
But you’d need to come up with the language-specific regular
expression yourself. For a more advanced stemmer, it would probably
be necessary to add a new module. (This might be a good student
project.)

For more information on the regexp stemmer:

http://nltk.org/doc/api/nltk.stem.regexp.Regexp-class.html

-Edward

Note: the link he gives is dead, see here for the current regexstemmer documentation.

The more recently added snowball stemmer appears to be able to stem French though. Let’s put it to the test:

>>> from nltk.stem.snowball import FrenchStemmer
>>> stemmer = FrenchStemmer()
>>> stemmer.stem('voudrais')
u'voudr'
>>> stemmer.stem('animaux')
u'animal'
>>> stemmer.stem('yeux')
u'yeux'
>>> stemmer.stem('dors')
u'dor'
>>> stemmer.stem('couvre')
u'couvr'

As you can see, some results are a bit dubious.

Not quite what you were hoping for, but I guess it’s a start.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have some text in French that I need to process in some ways.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply