I got a application that has to store a lot of sparse data.
All documents are separated into Projects.
Each Project has its own database, with its own collections and documents, but all on the same server.
Now I want to make it easier to Query and Reference across Projects.
So I’m considering moving all data into 1 database and let each document have a “project” field that I can query against.
The database schema would go from something like:
Project1 (Database)
Task (Collection)
{name: my_task, status: Completed, ...}
Project2 (Database)
Task (Collection)
{name: other_task, status: Started, ...}
To something like:
SingleDatabase
Task (Collection)
{name: my_task, status: Completed, project: Project1, ...}
{name: other_task, status: Started, project: Project2, ...}
My guess is that it would have some performance tradeoffs to memory, disk usage, and write performance.
The problem is that I have no idea about how much of an impact it would have, if its worth doing at all.
The question is:
Is it possible to calculate what impact this decision could have on the server?
Something like: given X collections, X document, X indexes… The server would on average have: X/s slower write, require X more memory.. and so on.
This is highly theoretical question, and “theory is a bad companion when it comes to performance”. Even if there was a consistent, well-established theory it would be extremely complicated because you have to account for caching (i.e. operations have a history, no time-reversibility, need very detailed usage patterns, etc.), many non-linear effects (most algorithms aim to achieve some log(n) or n log(n) behavior) and discontinuities in the ‘performance function’ (if your RAM can no longer hold the indexes, swapping starts), and hardware specificities (swapping on an SSD is an order of magnitude faster than on spindles), etc.
The fastest and most reliable way to find out how it behaves is to implement it. That implementation can be flaky, hacky and what not. But you can get a good perf indication in a couple of hours.
Some theoretical input:
In essence, using multiple databases is like a bucket sort: You have some code that can quickly identify which bucket to query. In those buckets, the indexes are a bit smaller, hence a little faster. On the other hand, search times should increase only logarithmic with increasing index size. Especially for large collections, this means that there is practically no difference.
Disk space will be used more efficiently (unless you tweaked your database settings heavily), because MongoDB will allocate a
.nsfile of 16MB size and at least 64MB of data files for each database, even if you only store a few documents. Hence, if the number of small databases is large, your disk footprint should be better after the migration, despite the additional field.Changes to the RAM footprint should be negligible, but memory is such an intricate topic that I would not bet a dime.