I am working on a project, where we are trying to introduce a searchframework. We are about to start development soon, we have only done some poc-work up till now. We are struggling with estimatesfor hardware. I am uncertain if our performance requirements can be met using a single server setup, or if we need to go for a replicated, or distrbuted solution.
Here are our main requirements
- Search in semi-structured data
- Documents contains 15 fields all of whom should be searchable
- Mostly numeric id’s
- Dates
- Names
- 10+ millions documents in index
- 30-40 updates, in batches every minute
- <100 ms response time searches with several boolean operators for 100 + queries pr minute
Questions
1) Is it feasible to get this performance on a singleserver setup?
2) If not what is an appropriate setup to meet the performance requirements.
3) We are considering several frameworks on top of Lucene, amongst them Solr and Zoie. What distributed architecture would be necessary to handle the descibed load and performance requirements.
Yes, I think so. But it’s a kind of “borderline” (I hope you know, what I mean)
What you need is enough RAM and CPU power. Finlay it depends on the size of “big” fileds, like fulltexte or so and the size of your database.
In comparison I use lucene with 1.2 million docs, 7 fileds, mostly short fileds (date,numbers,..) but also including one big textfield (500-5000 characters). The size of this mysql database (which is indexed by lucene) is 1-2 GB. The System runs on an small single CPU VMware Host with 4GB of RAM. The Fulltext-Search results returned in 100-400ms.
If you don’t have big textfields, your results will return faster. (depending on the kind of search -> for example facettet search)
For example: an facetet search on an char(255) Filed, returned in <70ms
Probably for your configuration an non visualized Hardware with lots of memory (>32GB) and >8 cores would be useful.
does it mean 30-40 new documents per minute? that’s no problem!
30-40 updates per minute with lots of new documents would be more challenging.
Additional you should optimize your index periodically (for example nightly)
Solr is running as an tomcat application. Here you have to define for example the RAM (look above), which is assigned to your search engine.
There are different possibilities to split your index (for more performance or faster update), clustering is also possible.