I did some questions about text-mining a week ago, but I was a bit

Question

0

Asked: May 24, 20262026-05-24T06:10:31+00:00 2026-05-24T06:10:31+00:00

I did some questions about text-mining a week ago, but I was a bit

0

I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do.

The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites.

What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money, temperature, etc.

What I know: I know the concepts/heard about about Natural Language Processing, Named Entity Reconigzer, POSTagging, NayveBayesian, HMM, training and a lot of things to do classification, etc., but there is some different NLP libraries with differents classifiers and ways to do this and I don’t know what use or what do.

WHAT I NEED: I need some code example from a classifier, NLP, whatever, that can classify each word from a text separetely, and not a entire text. Something like this:

//This is pseudo-code for what I want, and not a implementation

classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
    classifiedWord = classifier.classify(word);
    System.out.println(classifiedWord.getType());
}

Somebody can help me? I’m confused with various APIs, classifiers and algorithms.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T06:10:31+00:00

You should try Apache OpenNLP. It is easy to use and customize.

If you are doing it for Portuguese there are information on how to do it on the project documentation using Amazonia Corpus. The types supported are:

Person, Organization, Group, Place, Event, ArtProd, Abstract, Thing, Time and Numeric.

Download the OpenNLP and the Amazonia Corpus. Extract both and copy the file amazonia.ad to the apache-opennlp-1.5.1-incubating folder.

Execute the TokenNameFinderConverter tool to convert the Amazonia corpus to the OpenNLP format:

bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data amazonia.ad -lang pt > corpus.txt

Train you model (Change the encoding to the encoding of the corpus.txt file, that should be your system default encoding. This command can take several minutes):
```
bin/opennlp TokenNameFinderTrainer -lang pt -encoding UTF-8 -data corpus.txt -model pt-ner.bin -cutoff 20
```

Executing it from command line (You should execute only one sentence and the tokens should be separated):

$ bin/opennlp TokenNameFinder pt-ner.bin 
Loading Token Name Finder model ... done (1,112s)
Meu nome é João da Silva , moro no Brasil . Trabalho na Petrobras e tenho 50 anos .
Meu nome é <START:person> João da Silva <END> , moro no <START:place> Brasil <END> . <START:abstract> Trabalho <END> na <START:abstract> Petrobras <END> e tenho <START:numeric> 50 anos <END> .

Executing it using the API:

InputStream modelIn = new FileInputStream("pt-ner.bin");

try {
  TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
}
catch (IOException e) {
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
       modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

// load the name finder
NameFinderME nameFinder = new NameFinderME(model);

// pass the token array to the name finder
String[] toks = {"Meu","nome","é","João","da","Silva",",","moro","no","Brasil",".","Trabalho","na","Petrobras","e","tenho","50","anos","."};

// the Span objects will show the start and end of each name, also the type
Span[] nameSpans = nameFinder.find(toks);

To evaluate your model you can use 10-fold cross validation: (only available in 1.5.2-INCUBATOR, to use it today you need to use the SVN trunk) (it can take several hours)
```
bin/opennlp TokenNameFinderCrossValidator -lang pt -encoding UTF-8 -data corpus.txt -cutoff 20
```
Improve the precision/recall by using the Custom Feature Generation (check documentation), for example by adding a name dictionary.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I did some questions about text-mining a week ago, but I was a bit

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply