Correct me if I’m approaching this wrong, but I have a queue server and a bunch of java workers that I’m running on in a cluster. My queue has work units that are very small but there are many of them. So far my benchmarks and review of the workers has shown that I get about 200mb/second.
So I’m trying to figure out how to get more work units via my bandwidth. Currently my CPU usage is not very high(40-50%) because it can process the data faster than the network can send it. I want to get more work through the queue and am willing to pay for it via expensive compression/decompression(since half of each core is idle right now).
I have tried java LZO and gzip, but was wondering if there was anything better(even if its more cpu expensive)?
Updated: data is a byte[]. Basically the queue only takes it in that format so I am using ByteArrayOutputStream to write two ints and a int[] to to a byte[] format. The values in int[] are all ints between 0 to 100(or 1000 but the vast majority of the numbers are zeros). The lists are quite large anywhere from 1000 to 10,000 items(again, majority zeros..never more than 100 non-zero numbers in the int[])
It sounds like using a custom compression mechanism that exploits the structure of the data could be very efficient.
Firstly, using a
short[](16 bit data type) instead of anint[]will halve (!) the amount of data sent, you can do this because the numbers are easily between-2^15(-32768) and2^15-1(32767). This is ridiculously easy to implement.Secondly, you could use a scheme similar to run-length encoding: a positive number represents that number literally, while a negative number represents that many zeros (after taking absolute values). e.g.
This is harder to implement that just substituting
shortforint, but will provide ~80% compression in the very worst case (1000 numbers, 100 non-zero, none of which are consecutive).I just did some simulations to work out the compression ratios. I tested the method I described above, and the one suggested by Louis Wasserman and sbridges. Both performed very well.
Assuming the length of the array and the number of non-zero numbers are both uniformly between their bounds, both methods save about 5400
ints (orshorts) on average with a compressed size of about 2.5% the original! The run-length encoding method seems to save about 1 additionalint(or average compressed size that is 0.03% smaller), i.e. basically no difference, so you should use the one that is easiest to implement. The following are histograms of the compression ratios for 50000 random samples (they are very similar!).Summary: using
shorts instead ofints and one of the compression methods, you will be able to compress the data to about 1% of its original size!For the simulation, I used the following R script: