I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of ‘set’ bits in the array varies widely, from all clear to all set. Currently, I’m using a straight-forward bit array (java.util.BitSet), so each of my bit arrays takes several megabytes.
My plan is to look at the cardinality of the first N bits, then make a decision about what data structure to use for the remainder. Clearly some data structures are better for very sparse bit arrays, and others when roughly half the bits are set (when most bits are set, I can use negation to treat it as a sparse set of zeroes).
- What structures might be good at each extreme?
- Are there any in the middle?
Here are a few constraints or hints:
- The bits are set only once, and in index order.
- I need 100% accuracy, so something like a Bloom filter isn’t good enough.
- After the set is built, I need to be able to efficiently iterate over the ‘set’ bits.
- The bits are randomly distributed, so run-length–encoding algorithms aren’t likely to be much better than a simple list of bit indexes.
- I’m trying to optimize memory utilization, but speed still carries some weight.
Something with an open source Java implementation is helpful, but not strictly necessary. I’m more interested in the fundamentals.
Unless the data is truly random and has a symmetric 1/0 distribution, then this simply becomes a lossless data compression problem and is very analogous to CCITT Group 3 compression used for black and white (i.e.: Binary) FAX images. CCITT Group 3 uses a Huffman Coding scheme. In the case of FAX they are using a fixed set of Huffman codes, but for a given data set, you can generate a specific set of codes for each data set to improve the compression ratio achieved. As long as you only need to access the bits sequentially, as you implied, this will be a pretty efficient approach. Random access would create some additional challenges, but you could probably generate a binary search tree index to various offset points in the array that would allow you to get close to the desired location and then walk in from there.
Note: The Huffman scheme still works well even if the data is random, as long as the 1/0 distribution is not perfectly even. That is, the less even the distribution, the better the compression ratio.
Finally, if the bits are truly random with an even distribution, then, well, according to Mr. Claude Shannon, you are not going to be able to compress it any significant amount using any scheme.