I’ve got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I’d like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That’s nice, but I don’t think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.
If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
and output:
a bit of pseudo code
We’ve now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It’ll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You’re saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.