I am considering two options for table design and I’m not sure what the pros and cons are for each.
Here is a somewhat abstracted description of my situation:
I am keeping track of a number of data points (category_id, point_id, value).
Most of the time, I am only interested in the current value of the data-point. But I need to log all historical values whenever there is a change.
Occasionally I might want to look at the historical values of a particular point, but it is ok if these queries are a little bit slow. What’s most important is that I can get the current values of all points, or the current values of all points in a particular category as quickly as possible.
The two (and possibly three) approaches I am considering:
- Use two separate tables, a
current_valesand ahistorytable, with a trigger that will insert a row into the history table every time something incurrent_valueschanges. - Use only one table with a boolean flag
isCurrenton each row. Whenever a value changes mark that row as no longer current and insert a new current row with the updated value. - (Use only one table with timestamps on each row — then the current value for a particular id is the row with the most recent timestamp. But this seems complicated to express as a query, especially if i want to get all the current values for a particular category, not even sure how I would express this without subqueries or of the performance)
There will only be about 3,000-5,000 current points at a time, but the values change frequently enough that up to half of these can change every day, so there will be hundreds of thousands of rows of history eventually.
What are the pros and cons of each approach above (or is there another better approach that I haven’t mentioned)? Given my goal of getting the current set of points as quickly as possible, and being ok with slower queries on the history, which is best?
Option 1 and 2 will have similar performance – your manual “partition” of the data in Option 1 can also be managed with a clustered index with IsCurrent as the first column in Option 2. You can always have a view which only gives current and in some ways, this will be very similar in performance, since changing the IsCurrent will move the old row physically (due to the clustering) and add the new row just like your trigger would delete and insert in two tables.
You could also use partitions feature of MySQL.
A big benefit of having separate tables or partitions of a single table is controlling the backup (and potentially purging) of the data in a more fine-grained way.
A real benefit of Option 1 is you do save that small column, which could be beneficial when you get to billions of rows.
A maintenance benefit of Option 2 is that the schema is always the same (don’t have to keep changes in sync), since there is only one table.
Option 3 is not going to perform as well because the leading edge of current values is more difficult to find – i.e. varying (although not impossible to improve performance with an index on identifier and timestamp DESC)