I am doing some test using WordDelimiterFilter in Solr but it doesn’t preserve the protected list of words which I pass to it. Would you please inspect the code and the output example and suggest which part is missing or used badly?
with running this code:
private static Analyzer getWordDelimiterAnalyzer() {
return new Analyzer() {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream stream = new StandardTokenizer(Version.LUCENE_32, reader);
WordDelimiterFilterFactory wordDelimiterFilterFactory = new WordDelimiterFilterFactory();
HashMap<String, String> args = new HashMap<String, String>();
args.put("generateWordParts", "1");
args.put("generateNumberParts", "1");
args.put("catenateWords", "1");
args.put("catenateNumbers", "1");
args.put("catenateAll", "0");
args.put("luceneMatchVersion", Version.LUCENE_32.name());
args.put("language", "English");
args.put("protected", "protected.txt");
wordDelimiterFilterFactory.init(args);
ResourceLoader loader = new SolrResourceLoader(null, null);
wordDelimiterFilterFactory.inform(loader);
/*List<String> protectedWords = new ArrayList<String>();
protectedWords.add("good bye");
protectedWords.add("hello world");
wordDelimiterFilterFactory.inform(new LinesMockSolrResourceLoader(protectedWords));
*/
return wordDelimiterFilterFactory.create(stream);
}
};
}
You are using a standard tokenizer which at least tokenizes on a whitespace level so you will always have “hello world” be split to “hello” and “world”.
See Lucene Documentation:
The word delimiter protected word list is meant for something like:
If you really want to do something like you mentioned you may use the KeywordTokenizer. But you have to do the complete splitting by yourself.