I’ve been analyzing the best method to improve the performance of our SOLR index and will likely shard the current index to allow searches to become distributed.
However given that our index is over 400GB and contains about 700MM documents, reindexing the data seems burdensome. I’ve been toying with the idea of duplicating the indexes and deleting documents as a means to more efficiently create the sharded environment.
Unfortunally it seems that modulus isn’t available to query against the document’s internal numeric ID. What other possible partitioning strategies could I use to delete by query rather than a full reindex?
A lucene tool would do the job IndexSplitter, see mentioned here with a link to an article (japanese, tranlate it with google…)