My question has 2 sub-questions:
-
Let’s assume a case where every second I receive data which is represented as a set of key/value tuples. Each value is basically a 64bit counter. I need to save it into a database. The number of values is several thousands. Out of those thousands only 1% has actual data, others are null (sparsely populated set). Does it make sense to make a table of few thousand columns? Or just store as “id, timestamp, key, value”?
-
In case the answer to question 1 is “thousands of columns”, which da from mysql/postgres family should be used?
The read pattern for this case is mostly charting, so select will be a bunch of data based on timestamps. So it is uniform 1/sec writes and occasional reads of all data or data in date/time range.
Bonus question, what pattern can be used to store such data in NoSQL database? For example in MongoDB a collection of stats containing documents with just 1% of the whole set can be used. How would it work with read/map/reduce in that case? How would reading the data compare with mysql/postgres?
Edit: My usecase is very similar to NewRelic service but instead of having lots of small datasets I have much larger datasets (sparsely populated out of even bigger set) but less often (and fewer users)
PostgreSQL stores null columns as a bitmap, however there is a large overhead per each row. Lets calculate the storage efficiency of the two storage schemes:
So thousand columns is about twice as efficient as splitting it out by key. The crossover point where it would be more efficient to store keys separately is at about 0.45%.
This approach won’t scale very far however. The maximum number of columns in PostgreSQL is limited to 1600. To extend it further you could split the values vertically into many tables. This will also have some issues querying, because a result set can’t be much larger than 1600 either.
Another option is to encode the key value pairs into arrays. The structure of the table in this case would be (id serial, ts timestamptz, keys int2[], values int8[]). The storage overhead for the same 1000 attributes, 1% fill factor would be:
However querying singular values requires slightly more infrastructure in this case.
If even better storage efficiency or flexibility is needed, a custom datatype can be added.
I know that the large number columns pattern for sensor data is used successfully in many PostgreSQL installations. As for database choice, I may be slightly biased, but I would suggest PostgreSQL, because you’ll have much better tools like arrays, predicate indexes and custom datatypes to rearrange your data storage for more efficiency. Most important thing to keep in mind is to use partitioning from the get go.