I’m using an event loop based server in twisted python that stores files, and I’d like to be able to classify the files according to their compressibility.
If the probability that they’d benefit from compression is high, they would go to a directory with btrfs compression switched on, otherwise they’d go elsewhere.
I do not need to be sure – 80% accuracy would be plenty, and would save a lot of diskspace. But since there is the CPU and fs performance issue too, I can not just save everything compressed.
The files are in the low megabytes. I can not test-compress them without using a huge chunk of CPU and unduly delaying the event loop or refactoring a compression algorithm to fit into the event loop.
Is there any best practice to give a quick estimate for compressibility? What I came up with is taking a small chunk (few kB) of data from the beginning of the file, test-compress it (with a presumably tolerable delay) and base my decision on that.
Any suggestions? Hints? Flaws in my reasoning and/or problem?
Just 10K from the middle of the file will do the trick. You don’t want the beginning or the end, since they may contain header or trailer information that is not representative of the rest of the file. 10K is enough to get some amount of compression with any typical algorithm. That will predict a relative amount of compression for the whole file, to the extent that that middle 10K is representative. The absolute ratio you get will not be the same as for the whole file, but the amount that it differs from no compression will allow you to set a threshold. Just experiment with many files to see where to set the threshold.
As noted, you can save time by doing nothing for files that are obviously already compressed, e.g. .png. .jpg., .mov, .pdf, .zip, etc.
Measuring entropy is not necessarily a good indicator, since it only gives the zeroth-order estimate of compressibility. If the entropy indicates that it is compressible enough, then it is right. If the entropy indicates that it is not compressible enough, then it may or may not be right. Your actual compressor is a much better estimator of compressibility. Running it on 10K won’t take long.