We have a system were we would like to store about 100M documents. We need to be able to iterate them and make very simple retrieval operations, getting a document using a unique Id and using trivial metadata-queries like retrieving by publication-date and source.
We will update the database quite frequently with new documents and by removing old ones and we would like to avoid large maintenance jobs. Great if its easy to replicate or mirror without to much fuzz.
We’re currently using SQL server for this, but we need something much more light-weighted.
Any recommendations?
Some kind of a NVP (NoSQL) will be best. Given your requirements, I recommend mongodb. It supports all of the features you are looking for:
Designed for large sets of docs.
Supports secondary indexes for you metadata queries.
Easy to set up replica sets.
Designed for fast performance & high scale.
It’s easy to install and get started with, and as a programmer, it’s pretty easy to work with.
Cassandra is another possible solution, but it requires a bit more work to set up and plan your schema. It’s main advantage is better support for multi-datacenter sharding and redundancy. Unlike mongo, Cassandra doesn’t use a master-slave replication system.