Can CouchDB handle thousands of separate databases on the same machine?
Imagine you have a collection of BankTransactions. There are many thousands of records. (EDIT: not actually storing transactions–just think of a very large number of very small, frequently updating records. It’s basically a join table from SQL-land.)
Each day you want a summary view of transactions that occurred only at your local bank branch. If all the records are in a single database, regenerating the view will process all of the transactions from all of the branches. This is a much bigger chunk of work, and unnecessary for the user who cares only about his particular subset of documents.
This makes it seem like each bank branch should be partitioned into its own database, in order for the views to be generated in smaller chunks, and independently of each other. But I’ve never heard of anyone doing this, and it seems like an anti-pattern (e.g. duplicating the same design document across thousands of different databases).
Is there a different way I should be modeling this problem? (Should the partitioning happen between separate machines, not separate databases on the same machine?) If not, can CouchDB handle the thousands of databases it will take to keep the partitions small?
(Thanks!)
[Warning, I’m assuming you’re running this in some sort of production environment. Just go with the short answer if this is for a school or pet project.]
The short answer is “yes”.
The longer answer is that there are some things you need to watch out for…
You’re going to be playing whack-a-mole with a lot of system settings like max file descriptors.
You’ll also be playing whack-a-mole with erlang vm settings.
CouchDB has a “max open databases” option. Increase this or you’re going to have pending requests piling up.
It’s going to be a PITA to aggregate multiple databases to generate reports. You can do it by polling each database’s _changes feed, modifying the data, and then throwing it back into a central/aggregating database. The tooling to make this easier is just not there yet in CouchDB’s API. Almost, but not quite.
However, the biggest problem that you’re going to run into if you try to do this is that CouchDB does not horizontally scale [well] by itself. If you add more CouchDB servers they’re all going to have duplicates of the data. Sure, your max open dbs count will scale linearly with each node added, but other things like view build time won’t (ex., they’ll all need to do their own view builds).
Whereas I’ve seen thousands of open databases on a BigCouch cluster. Anecdotally that’s because of dynamo clustering: more nodes doing different things in parallel, versus walled off CouchDB servers replicating to one another.
Cheers.