Could you please recomend me java libraries for text prerprocessing and clean up? The lib should perform such tasks:
- convert all verbs to infinitive
- convert all nouns to singular form
- remove useless (for the sense of a text) words
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Converting words to canonical forms (verbs to infinitives and nouns to singular, for example) is called lemmatization. One Java-based lemmatizer is Standford CoreNLP.
For “useless words” you probably want “stop words” – there’s no standard list, but there’s a lot floating around the Internet which function in more or less the same way with the only difference being how many words they include (typically between 100 and 1000). I’ve known people to use this list before. When removing stop words, remember to ignore case when looking for matches.