I’m starting from a Lucene index which someone else created. I’d like to find all of the words that follow a given word. I’ve extracted the term (org.apache.lucene.index.Term) of interest from the index, and I can find the documents which contain that term:
segmentTermDocs = segmentReader.termDocs(term);
while (segmentTermDocs.next) {
doc = segmentReader.document(segmentTermDocs.doc);
...
}
Is there a way for me to locate the positions of the term in the document and extract the terms which follow it?
Since indexing the n-grams isn’t an option in your situation, some brute force will be required. You could enumerate the IndexReader’s terms and termPositions, but that would likely be excrutiatingly slow.
A faster approach would be implement a divide-and-conquer search algorithm by enumerating the terms and using a MultiPhraseQuery to check a group at once. Split all the potential terms into reasonably sized groups (say 1000), and run a MultiPhraseQuery search with each chunk and your prefix word. If there are any hits, recursively call on sub-groups until you reach a single term.