Consider a table invoice_data containing data about invoices over 12 years. This data will be loaded into a cube every day. With every load, the newest 2 years get deleted from invoice_data and the data of the last 2 years gets imported by the live system again. (We do that, because values may change in older data sets, too)
This delete statement takes about 15 minutes, but we can’t use a truncate, because we would have to load the whole 12 years which will take much longer.
Question:
Is it a good design to split a large table invoice_dat like this
invoice_data_old, which containsyears < actual year - 2invoice_data_new, which containsyears >= actual year -2
This way, we could use a truncate on invoice_data_new and there is no need to use a delete statement?
Are there any better approaches?
I am using SQL SERVER 2008, but i think this is a general question.
I think that the best approach to this problem is ‘horizontal partioning’. Basically, you create a sliding window of data – as a new period of data is added at the end, the old data is removed at the beginning.
Over the course of 4 years, you might have 48 partitions (one per month).
The big advantage with this approach is that SQL server is aware that the data is partioned in this way, and can automatically optimise queries to only use partitions that have relevant data – i.e. when you
selectdata from the last month, SQL Server will know to only search one partition, or 1/48th of the data.Another important aspect is that removing the oldest partition becomes a metadata operation, so does not lock anything.
The downside of this approach is that it does take more effort in terms of setting up and maintaining (unless you write some automated month-end scripts, which may be a non-trivial exercise)