So, I have a Solr instance which processes inputs and queries using StandardTokenizer (as well as ClassicFilterfactory, LowercaseFilterFactory and Stopfilterfactory).
In my index are a number of files with underscore separated names (eg. some_indexed_file.jpg).
I’ve noticed that if I query for some_indexed_file.jpg, I get the file I’m looking for returned correctly.
However, if I alternatively search for some_indexed_file.jp*, (that’s with an asterisk, which I am presuming is acting as a wildcard) which, by my understanding should produce similar results, I get no results.
Any idea what’s going on: I assume I’m misunderstanding something about the way solr processes queries?
Edit: as requested, here are the schema XML configuration entries:
<fieldType name="default" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ClassicFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ClassicFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" />
</analyzer>
</fieldType>
<field name="filename" type="default" multiValued="true" omitNorms="false" termVectors="false"/>
Well, a bit more research has solved the problem:
The base issue is that Solr doesn’t apply text analysis to wildcard queries.
This meant that it was searching for an exact match to
some_indexed_file.jp*. However, when the filename was indexed, it was tokenised into “some” “indexed” andfile.jpg, which does not match this search term.Searching for
some_indexed_file.jpgwas being tokenised properly, and therefore returning the right results.