Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 478373
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T00:41:06+00:00 2026-05-13T00:41:06+00:00

I’m trying to analyze content of a Drupal database for collective intelligence purposes. So

  • 0

I’m trying to analyze content of a Drupal database for collective intelligence purposes.

So far I’ve been able to work out a simple example that tokenizes the various contents (mainly forum posts) and count tokens after removing stop words.

The StandardTokenizer supplied with Lucene should be able to tokenize hostnames and emails but content can have also embedded html, e.g:

Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi
Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete
scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.

This is tokenized badly in this way:

pubblichiamo -> 1
presentazione -> 1
ibm -> 1
riguardante -> 1
db2 -> 1
vari -> 1
sistemi -> 1
operativi -> 1
linux -> 1
unix -> 1
windows -> 1
documento -> 1
piattaforma -> 1
km -> 1
potete -> 1
scaricare -> 1
href -> 1
https -> 1
sfkm.griffon.local -> 1
sites -> 1
bsf -> 1
20km/bsf -> 1
cc -> 1
20t/specifiche/eventi2008/ibm -> 1
20db2 -> 1
20for -> 1
20linux -> 1
20unix -> 1
20e -> 1
20windows.pdf -> 1
target -> 1
blank -> 1
link -> 1

What I would like to have is to keep links together and strip html tags (like <pre> or <strong>) that are useless.

Should I write a Filter or a different Tokenizer? The Tokenizer should replace the standard one or can I mix them together? The hardest way would be to take StandardTokenizerImpl and copy it in a new file, then add custom behaviour, but I wouldn’t like to go too deep in Lucene implementation for now (learning gradually).

Maybe there is already something similar implemented but I’ve been unable to find it.

EDIT:
Looking at StandardTokenizerImpl makes me think that if I have to extend it by modifying the actual implementation it’s not so convenient compared to using lex or flex and doing it by myself..

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T00:41:07+00:00Added an answer on May 13, 2026 at 12:41 am

    This is most easily achieved by pre processing the text before giving it to lucene to tokenize. Use an html parser, like Jericho to convert your content into text with no html by stripping out tags you dont care about, and extracting the text from those that you do. Jericho’s TextExtractor is perfect for this, and easy to use.

    String text = "Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi"
        +"Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete"
        +"scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.";
    
    TextExtractor te = new TextExtractor(new Source(text)){
        @Override
        public boolean excludeElement(StartTag startTag) {
            return startTag.getName() != HTMLElementName.A;
        }
    };
    System.out.println(te.toString());
    

    This outputs:

    Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativiLinux, UNIX e Windows. Questo documento sta sulla piattaforma KM e lo potetescaricare a questo link.

    You could use a custom Lucene Tokenizer with an html Filter, but it’s not the easiest solution – using Jericho will defn save you development time for this task. The existing html analysers for lucene probably don’t want to do exactly what you want, as they will keep all text on the page. The only caveat to this is that you will end up processing the text twice, rather than all as one stream, but unless you are handling Terabytes of data you aint gonna care about this performance consideration, and dealing with performance is something best left untill you have your app fleshed out and have identified it as an issue anyway.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 235k
  • Answers 235k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer SELECT * FROM mytable WHERE CAST(mydatefield AS DATETIME) >= CAST('2009-01-01'… May 13, 2026 at 6:08 am
  • Editorial Team
    Editorial Team added an answer OCUnit unit testing seems to be available in 3.0 and… May 13, 2026 at 6:08 am
  • Editorial Team
    Editorial Team added an answer My suggestion is to use HTML Tidy instead of hacking… May 13, 2026 at 6:08 am

Related Questions

I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I want use html5's new tag to play a wav file (currently only supported
I ran into a problem. Wrote the following code snippet: teksti = teksti.Trim() teksti
I've got a string that has curly quotes in it. I'd like to replace
In order to apply a triggered animation to all ToolTip s in my app,

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.