I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.
I’m starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.
Are there APIs for text analysis in Java?
EDIT: Text-mining, I want to mining the text. An API for Java that provides this.
For example – you might use some classes from standard library
java.text, or useStreamTokenizer(you might customize it according to your requirements). But as you know – text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer – java.text and other standart utils has too limited capabilities in such context.So, I’d advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.
P.S.
According to your needs – you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).