I’ve written a custom indexer using php to import data to solr from mysql instead of using solr’s own data import handler. It’s working great but I’m stuck when doing the periodical indexing. Here’s the process in my mind:
-
Search all index and find deleted entities comparing them with the
data in mysql and remove them from solr. -
Find recently changed entities in mysql and only index them. (I’ve
at least 12 tables for a core and I need to check all)
So my question is, is this a good way or do you suggest something more efficient. Thanks.
Ps: I didn’t use solr’s data import handler because there were too many things to do on my own like hierarchical data management etc. I don’t know if I could do all of them with data import handler.
You can start tracking changes when the items are changes or removed from your DB. Then you’ll only need to go through that list to update your index. Or add a “created / last updated” field to your DB entities. But that could be a complex task depending on your system architecture and logic.
You can also save on checking if your items exist in the database or index and just re-index everything – on datasets not very huge that could be quicker (just make sure that the same indexed entity would receive the same Solr ID as before in order to replace its older version instead of duplicating it).
Another way is to have so called “delta index” only containing the recently modified items, so you’d have to merge Solr with Solr and not Solr with DB.
If you still need to check every single item it’d be probably better first to request them from DB because Solr search is generally quicker. Then you can run a Solr query in a batch requesting many documents by their ID at once and loop through that set to match them against your DB records. So something like request N results from DB / produce their Solr IDs / request N Solr documents by ID in a single query / match sets should work. But this is a “brute force” method, obviously.