I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do.
The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites.
What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money, temperature, etc.
What I know: I know the concepts/heard about about Natural Language Processing, Named Entity Reconigzer, POSTagging, NayveBayesian, HMM, training and a lot of things to do classification, etc., but there is some different NLP libraries with differents classifiers and ways to do this and I don’t know what use or what do.
WHAT I NEED: I need some code example from a classifier, NLP, whatever, that can classify each word from a text separetely, and not a entire text. Something like this:
//This is pseudo-code for what I want, and not a implementation
classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
classifiedWord = classifier.classify(word);
System.out.println(classifiedWord.getType());
}
Somebody can help me? I’m confused with various APIs, classifiers and algorithms.
You should try Apache OpenNLP. It is easy to use and customize.
If you are doing it for Portuguese there are information on how to do it on the project documentation using Amazonia Corpus. The types supported are:
Person, Organization, Group, Place, Event, ArtProd, Abstract, Thing, Time and Numeric.
Download the OpenNLP and the Amazonia Corpus. Extract both and copy the file
amazonia.adto theapache-opennlp-1.5.1-incubatingfolder.Execute the TokenNameFinderConverter tool to convert the Amazonia corpus to the OpenNLP format:
Train you model (Change the encoding to the encoding of the corpus.txt file, that should be your system default encoding. This command can take several minutes):
Executing it from command line (You should execute only one sentence and the tokens should be separated):
Executing it using the API:
To evaluate your model you can use 10-fold cross validation: (only available in 1.5.2-INCUBATOR, to use it today you need to use the SVN trunk) (it can take several hours)
Improve the precision/recall by using the Custom Feature Generation (check documentation), for example by adding a name dictionary.