Right now, my documents in lucene can have very very large values in one field (from 0 to say hundreds of MB).
I am using Lucene 3.1.0, I create documents like this:
doc = new Document();
Field field = new Field(fieldname, VERYLARGEVALUE, store, tokenize, storevector);
doc.add(field);
Where VERYLARGEVALUE is a String in memory. I am thinking that maybe writing VERYLARGEVALUE to a file while it is being created (it is created by extracting text from a number of sources so it is incremental), and then using:
Field field = Field(String name, Reader reader, Field.TermVector termVector);
doc.add(field);
Where reader reads from the File I wrote VERYLARGEVALUE to.
Will this decrease the memory requirement or VERYLARGEVALUE will be eventually read to memory sooner or later?
Looking through the Lucene code, the
Readeryou pass intoFieldultimately gets passed to theTokenStreamthat tokenizes your data (namely inDocInverterPerField). So your plan should definitely save memory since it’ll stream directly from that reader to do its indexing. You’ll like want to use aBufferedReaderon top of theFileReaderfor better performance.