I’m looking for feedback on which analyzer to use with an index that has documents from multiple languages. Currently I am using the simpleanalyzer, as it seems to handle the broadest amount of languages. Most of the documents to be indexed will be english, but there will be the occasional double-byte language indexed as well.
Are there any other suggestions or should I just stick with the simpleanalyzer.
Thanks
SimpleAnalyzer really is simple, all it does is lower-case the terms. I’d have thought that the StandardAnalyzer would give better results than SimpleAnalyzer even with non-english language data. You could perhaps improve it slightly by supplying a custom list of stop words in addition to the default english-language ones.