We need to design a system which allows users to search by different keywords in large texts and also, in the future, create some basic reports regarding the frequency of that keyword in all the articles over a period.
We will have:
- about 200,000 articles added every day
- each article text is about 2KB
- articles are stored for 6 months
To do that, we came up with the following solution:
- create a SOLR repository to store the articles
- use a MySQL database to store the article additional information
The system will search SOLR by keywords and then will look up the results in MySQL to retrieve additional information.
So, would this be a good approach?
If most searches will be only on the articles added in the last month, would it be a good idea to keep two databases, one with the articles added in the last month for most searches and another with all the articles?
If you have any tips/tricks on how to improve this, it would be greatly appreciated.
Thanks in advance!
I think your solution is quite good. I would evaluate putting a memcache instance before SOLR if you want to get faster responses on common queries.
I am not sure about the two databases, you would have to see what’s the performance benefit compared to the burden of moving records from the first to the second DB as they age. I doubt there is a huge benefit, but that is just gut feeling, don’t take my words and run experiments.
Also, are you considering the fact that you may need some horizontal-scalable solution if your dataset becomes very large?