I’m using Lucene’s Term Freq vector to calculate cosine similarity between documents,
Say my docments has these 3 terms, “owe” “owed” “owing”.
Lucene takes this as 3 separate terms, but 3 of them means same “owe”. Is there any functionality in Lucene that can be used to index by semantics? so that it indexes “owe” “owed” “owing” as one word “owe” with term frequency =3 ?
If not I’d welcome any suggestions achieving this task?
You can use the SnowballFilter with an EnglishStemmer. It will replace those verbs with the root verb word (in your example it would be owe, or maybe ow).