I have a field that I would like to treat as a single string, while stripping all non-alphanumeric characters from it.
For example, I would like to tokenize “123 456.78-9” as “123456789”. In order to do that I have been attempting to define my own analyzer. According to the solr page the KeywordTokenizerFactory will treat a string as a single term and I can use a PatternReplaceFilterFactory to remove the characters as I intend.
I am using the following definition within my code and it is not working:
@AnalyzerDef(name = "strippinganalyzer",
tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
filters = {
@TokenFilterDef(factory = PatternReplaceFilterFactory.class,
params = {
@org.hibernate.search.annotations.Parameter(name = "pattern", value="([^a-zA-Z0-9])"),
@org.hibernate.search.annotations.Parameter(name="replacement", value=""),
@org.hibernate.search.annotations.Parameter(name="replace", value="all")
}
)
})
This matches “123*” but not “1234*” etc. What am I missing?
Thanks
Creating a custom Analyzer seems to do the trick: