I have an index from a large corpus with several fields. Only one these fields contain text.
I need to extract the unique words from the whole index based on this field.
Does anyone know how I can do that with Lucene in java?
I have an index from a large corpus with several fields. Only one these
Share
You’re looking for term vectors (a set of all the words that were in the field and the number of times each word was used, excluding stop words). You’ll use IndexReader’s getTermFreqVector(docid, field) for each document in the index, and populate a
HashSetwith them.The alternative would be to use terms() and pick only terms for the field you’re interested in:
This is not the optimal solution, you’re reading and then discarding all other fields. There’s a class
Fieldsin Lucene 4, that returns terms(field) only for a single field.