I’m trying to get autosuggestions for search terms. But I#ve run into a problem with words containing characters like “-” and “&” which are being splitted after just one character.
Example:
/solr/terms/?terms=true&terms.fl=item&terms.limit=10&terms.sort=count&terms.prefix=t
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<lst name="terms">
<lst name="item">
<int name="top">11335</int>
<int name="tshirt">10249</int>
<int name="t">10156</int>
<int name="trouser">4771</int>
<int name="tight">1577</int>
</lst>
</lst>
</response>
The problem lies with tshirt and t. “t” only appears within “t-shirt”. so how do I prevent Solr from splitting words just after one character if there is no whitespace after it. “t shirt” should split – “t-shirt” and “h&m” should not.
Thanks for your help!
The field type for items seems to be text with WordDelimiterFilterFactory being one of the filters in the analysis.
WordDelimiterFilterFactory by default will split on Intra word delimiters.
So t-shirt would generate two tokens t and shirt, and hence the term t appears for you.
If you want to use terms for autosuggest, remove or tune the WordDelimiterFilterFactory as per the requirement.
You can use the TextField with basic configurations, like with WhitespaceTokenizerFactory and apply the lower, ascii folding filters on it so the tokens are least analyzed and don’t appear fragmented.