I’m using Solr 3.x with focus on German text, which works well.
Searching for umlauts (öäüß) also works well.
The problem is:
I received some archived text from the late 80s, were most of the computer/software did not support more than ASCII, especially no German umlauts were supported.
For this an alternative notation was used:
ae instead of ä
oe instead of ö
ue instead of ü
ss instead of ß
That means, the name Müller was saved as Mueller.
Back to Solr, I need now to find documents which contains ue – even if the user searched for ü.
Example: If I like to search for all text messages from the person called Müller,
Solr has to find text with Mueller and also Müller
How can I handle this?
Is this an adequate feature? –> http://wiki.apache.org/solr/UnicodeCollation (I’m not sure, if I understand the documentation completely)
By the way, it’s not an option to change the source-text by “search and replace”: all oe to ö.
As Paige Cook already pointed out, you already found the relevant documentation, but since not every Solr user knows Java I decided to create my own answer with a little more detail.
The first step is to add the filter to your field definition:
The next step is to create the necessary
customRules.datfile:You have to create a tiny Java program in order to follow the documentation. Unfortunately for non-Java programmers this is a little difficult, since the code snippet only shows the important parts. Also it uses a third-party library not distributed with the JDK (Apache Commons IO)
Heres the full Java 7 code necessary to write a
customRules.datwithout the use of external libraries:Disclaimer: The above code compiles and creates a
customRules.datfile, but I didn’t actually test the created file with Solr.