Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6210523
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T06:10:31+00:00 2026-05-24T06:10:31+00:00

I did some questions about text-mining a week ago, but I was a bit

  • 0

I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do.

The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites.

What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money, temperature, etc.

What I know: I know the concepts/heard about about Natural Language Processing, Named Entity Reconigzer, POSTagging, NayveBayesian, HMM, training and a lot of things to do classification, etc., but there is some different NLP libraries with differents classifiers and ways to do this and I don’t know what use or what do.

WHAT I NEED: I need some code example from a classifier, NLP, whatever, that can classify each word from a text separetely, and not a entire text. Something like this:

//This is pseudo-code for what I want, and not a implementation

classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
    classifiedWord = classifier.classify(word);
    System.out.println(classifiedWord.getType());
}

Somebody can help me? I’m confused with various APIs, classifiers and algorithms.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T06:10:31+00:00Added an answer on May 24, 2026 at 6:10 am

    You should try Apache OpenNLP. It is easy to use and customize.

    If you are doing it for Portuguese there are information on how to do it on the project documentation using Amazonia Corpus. The types supported are:

    Person, Organization, Group, Place, Event, ArtProd, Abstract, Thing, Time and Numeric.

    1. Download the OpenNLP and the Amazonia Corpus. Extract both and copy the file amazonia.ad to the apache-opennlp-1.5.1-incubating folder.

    2. Execute the TokenNameFinderConverter tool to convert the Amazonia corpus to the OpenNLP format:

      bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data amazonia.ad -lang pt > corpus.txt
      
    3. Train you model (Change the encoding to the encoding of the corpus.txt file, that should be your system default encoding. This command can take several minutes):

      bin/opennlp TokenNameFinderTrainer -lang pt -encoding UTF-8 -data corpus.txt -model pt-ner.bin -cutoff 20
      
    4. Executing it from command line (You should execute only one sentence and the tokens should be separated):

      $ bin/opennlp TokenNameFinder pt-ner.bin 
      Loading Token Name Finder model ... done (1,112s)
      Meu nome é João da Silva , moro no Brasil . Trabalho na Petrobras e tenho 50 anos .
      Meu nome é <START:person> João da Silva <END> , moro no <START:place> Brasil <END> . <START:abstract> Trabalho <END> na <START:abstract> Petrobras <END> e tenho <START:numeric> 50 anos <END> .
      
    5. Executing it using the API:

      InputStream modelIn = new FileInputStream("pt-ner.bin");
      
      try {
        TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
      }
      catch (IOException e) {
        e.printStackTrace();
      }
      finally {
        if (modelIn != null) {
          try {
             modelIn.close();
          }
          catch (IOException e) {
          }
        }
      }
      
      // load the name finder
      NameFinderME nameFinder = new NameFinderME(model);
      
      // pass the token array to the name finder
      String[] toks = {"Meu","nome","é","João","da","Silva",",","moro","no","Brasil",".","Trabalho","na","Petrobras","e","tenho","50","anos","."};
      
      // the Span objects will show the start and end of each name, also the type
      Span[] nameSpans = nameFinder.find(toks);
      
    6. To evaluate your model you can use 10-fold cross validation: (only available in 1.5.2-INCUBATOR, to use it today you need to use the SVN trunk) (it can take several hours)

      bin/opennlp TokenNameFinderCrossValidator -lang pt -encoding UTF-8 -data corpus.txt -cutoff 20
      
    7. Improve the precision/recall by using the Custom Feature Generation (check documentation), for example by adding a name dictionary.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I found a bunch of other questions about this topic, but for some reason
I have a few quick questions about the iPhone software development. I did some
Ok, so I did find some questions that were almost similar but they actually
I browsed all SO questions and answers about this topic but I'm still unable
I did find some questions on SO about Rails associations that are somewhat like
I did some googling to try to answer this question but even after that
This started as a question, but turned into a solution as I did some
I did some tests a while ago and never figured out how to make
I've been researching on how to do this for about a week now and
I did a question about punctuation and regex, but it was confusing. Supossing I

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.