I have an index from a large corpus with several fields. Only one these

Question

0

Asked: May 28, 20262026-05-28T13:01:24+00:00 2026-05-28T13:01:24+00:00

I have an index from a large corpus with several fields. Only one these

0

I have an index from a large corpus with several fields. Only one these fields contain text.
I need to extract the unique words from the whole index based on this field.
Does anyone know how I can do that with Lucene in java?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T13:01:25+00:00

You’re looking for term vectors (a set of all the words that were in the field and the number of times each word was used, excluding stop words). You’ll use IndexReader’s getTermFreqVector(docid, field) for each document in the index, and populate a HashSet with them.

The alternative would be to use terms() and pick only terms for the field you’re interested in:

IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
while (terms.next()) {
        final Term term = terms.term();
        if (term.field().equals("field_name")) {
                uniqueTerms.add(term.text());
        }
}

This is not the optimal solution, you’re reading and then discarding all other fields. There’s a class Fields in Lucene 4, that returns terms(field) only for a single field.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have an index from a large corpus with several fields. Only one these

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply