Use a StreamingUpdateSolrServer, I used the following algorithm to re-index my huge dataset into SOLR.
Initialize StreamingUpdateSolrServer server = new StreamingUpdateSolrServer(solrServerUrl, numDocsToAddInBatch, numOfThreads);
For each Item…
-->Create document
-->Server.add(document)
When all finished,
server.commit();
server.optimize();
The problem:
Some of my items are not making it into the SOLR index, but no logs are being generated to tell me what happened.
I was able to find most of the documents, but some were missing. No errors in any logs – and I have substantial try/catch blocks with logs around all SOLRJ exceptions on the clients site.
Verify logging is not being hidden for the SOLR WAR
You will definitely want to verify that the SOLR server log settings are not hiding the fact that documents are failing to be added to the index.
Because SOLR uses the SLF4J API, your SOLR server could be over-riding the log settings allowing you to see an error message when the document failed to be indexed.
If you have a custom {solr-war}/WEB-INF/classes/logging.properties, you will need to make sure that the settings are not such that it is hiding the error messages.
By default, errors in adding an item should be shown automatically. So if you did not change your SOLR log settings at any point… you should be seeing any errors during indexing in your server log file.
Troubleshoot why Documents are failing to be indexed
In order to investigate this, it is helpful to follow verification step any time after the indexing is complete:
Once the verification step has completed, the log log_notfound created a list of all Documents that failed to be added into the Index.
You can use the log created by log_fromsolr to compare the document fields for an item that made it into the index and one that did not.
Verify it is not an intermittent issue
Sometimes it might be the case that it is not the same Items failing to be added to the index each time you try to index.
If you find objects in the log_notfound log, you will want to back up the current notfound log and run the indexing process again from scratch. Use a diff tool to see the differences between the first notfound log and the second notfound log.
An intermittent problem is evident when you see large numbers of differences in these files (Note: some differences are to be expected if new objects are being created in the database in between the first and second re-indexing).
If your problem is intermittent, it most certainly points at the application code with respect to your SOLR transactions not being committed correctly.
The same documents consistently come up missing each time it indexes
At this point we have to compare documents that are being found from the SOLR index, versus documents that are not getting into the Lucene index. Usually a field-by-field comparison of the object will start turning of some suspicious values that may be causing issues when adding the document to the index.
Try eliminating all the suspicious fields and then re-indexing the entire thing again. See if the documents are still failing to be indexed. If this worked, you will want to start re-introducing the fields that you removed and see if you can pinpoint the one that is the issue.