I’m working on a project that needs to count the occurrence of every word of a txt file.
For example, I have a text file like this:
What Silver Lake Looks For in IPO Candidates
3 Companies Crushed by Earnings: Apple, Cirrus Logic, IBM
IBM’s Palmisano: How You Get To Be A 100-Year Old Company
If there are 3 sentences shown above in the file and I want to calculate every word’s occurrence. Here, Companies and company should be considered as the same word “company”(lowercase), so the total occurrence for the word “company” is 2.
Is there any NLP toolkit for java that can tell two words like “families” and “family” are actually from the same word “family”?
I’ll count the occurrence of every word to further do the Naive Bayes training, so it’s very important to get the accurate numbers of occurrences of each word.
Apache Lucene and OpenNLP provide good stemming algorithm implementations. You can review and use the best one that suites you. I’ve been using Lucene for my projects.