I am implementing a search application.
Corpus is large text documents.
During file process i’m tokenizing all the words and calling Porter Stemmer algorithm
Step1 (http://tartarus.org/~martin/PorterStemmer/csharp2.txt).
Step1 gets rid of plurals and -ed or -ing…
I noticed that a word like ‘this’ will be stemmed into ‘thi’.
Is that normal operation of the algorithm ?
Since I wanted to tokenize the word ‘this’.
From what you describe, my hunch is that
thisis considered as plural form in Porter Stemmer algorithm and reduced tothi.I do not find an explicit reference to non-plural words ending with
sin Porter’s paper.http://tartarus.org/~martin/PorterStemmer/def.txt