I’m having a scenario where i have to build multilingual index. specially for two scripts , these two scripts are totally different (Hindi and English). so their stemmers and lemmatisers dont affect each other. My indexing will be huge containing millions of documents.
from follwing 3 which approach do i use for indexing?? :
-
Single field for two languages.
advantage – a) as scripts are different i can use both analysers on it. b) faster searching because fields will be limited. c) will need to take care of relevancy issue. -
Language specific fields : a) possibly slower searching because of many fields.
-
multicore approach : a) problem in handling multilingual docs. b) administration will be hard. c) language specific search will be easy.
I suggest separate cores. IMHO, it’s simply the right way to go.
You don’t have to use Solr’s automatic language recognition, since you define analyzers (lemmatizers/stemmers) for each core/language separately.
The only drawback is boilerplate config elements (most settings are the same for both cores).
See this recent, similar post:
Applying Language Specific Analyzer Dynamically before Solr Indexing