I need to store large amounts of metering data in a database. A record consists of an id that identifies the data’s source, a timestamp and a value. The records are later retrieved via the id and their timestamp.
According to my previous experience (I am developing the successor of an application that’s been in productive use over the last five years), disk i/o is the relevant performance bottleneck for data retrieval. (See also this other question of mine).
As I am never looking for single rows but always for (possibly large) groups of rows that match a range of ids and timestamps, a pretty obvious optimization seems to be to store larger, compressed chunks of data that are accessed by a much smaller index (e. g. by a day number) and is decompressed and filtered on the fly by the application.
What I’m looking for is the best strategy for deciding what portion of the data to put in one chunk. In a perfect world, each user request would be fulfilled by retrieving one chunk of data and using most or all of it. So I want to minimize the amount of chunks I have to load for each request and I want to minimize excess data per chunk.
I’ll post an answer below containing my ideas so far, and make it community property so you can expand on it. Of course, if you have a different approach, post your own.
ETA: S. Lott has posted this answer below, which is helpful to the discussion even if I can’t use it directly (see my comments). The point here is that the ‘dimensions’ to my ‘facts’ are (and should be) influenced by the end user and change over time. This is a core feature of the app and actually the reason I wound up with this question in the first place.
‘groups of rows that match a range of ids and timestamps’
You have two dimensions: the source and time. I’m sure the data source has lots of attributes. Time, I know, has a lot of attributes (year, month, day, hour, day of week, week of year, quarter, fiscal period, etc., etc.)
While your facts have ‘just’ an ID and a timestamp, they could have have FK’s to the data source dimension and the time dimension.
Viewed as a star-schema, a query that locates ‘groups of rows that match a range of ids’ may — more properly — be a group of rows with a common data source attribute. It isn’t so much a random cluster of ID’s, it’s a cluster of ID’s defined by some common attribute of your dimensions.
Once you define these attributes of the data source dimension, your ‘chunking’ strategy should be considerably more obvious.
Further, you may find that the bit-mapped index capability of some database products makes it possible to simply store your facts in a plain-old table without sweating the chunk design at all.
If bit-mapped indexes still aren’t fast enough, then perhaps, you have to denormalize the data source attributes into both dimension and fact, and then partition the fact table on this dimensional attribute.