I have a vector where I keep an incrementing data. Normally each element of the vector is a 64 bit long variable. However, it is quite possible that difference between two successive elements is quite small, so for example we can have a sequence as follows.
1, 34, 37, 42, 45, 1098, 1200, 1211, 1938
What is the best way of compressing this data. Would it be ideal to just keep the differences, and have a header byte which defines how big is the difference, whether it is only a byte, word, double word etc, or are there even better ways of compressing such incremental data?
EDIT
I need to compress online, that is while putting data in the vector. You may assume a dynamically expanding vector.
Here’s a very simple strategy for when the increments are typically small:
If the increment is <2**7, emit it as a single byte with the highest bit set to zero:
Else, if the increment is <2**14, emit it as two bytes with highest bits one and zero, respectively:
Extend this to larger increments in the obvious way. An eighth bit set to one means “wait, there’s more coming”. Zero means “end of integer”.
I remember seeing this coding scheme being suggested for bigints in some RFC or maybe an
internet-draft, but I seem unable to retrieve it right now. Alternatively, you can reuse the UTF-8 encoding scheme for some improved error detection at the expense of less efficient encoding (and you may have to extend it if you want to go beyond 64-bit integers).