I am looking at how to implement a high performance tag cloud in Solr.
I have a Solr database with 15 million records and more added every day. I have a field in which several copy statements copy data into. It can have anywhere between 1 and 6 values. These values are usually a sentence or two (string data). I’ve attempted to create a custom field type to optimize & tokenize the field for quick faceting but I’m getting lackluster performance.
Here is the custom field that I’ve created.
<fieldType name="KeywordCloud" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Any suggestions on how I can achieve at least reasonable performance when faceting this field? Or is there a totally different approach that I can take?
This approach works great when I only have an index of a million documents or so, but 15 million and higher is giving me issues.
Thanks in advance!
Have you play with the solr cache? As the number of unique terms for a field gets bigger, you need to grow the cache accordingly. See this link for details. Pay attention on the filter cache and on the field cache.