I want to write a storage backend to store larger chunks of data. The

Question

0

Asked: May 24, 20262026-05-24T13:10:55+00:00 2026-05-24T13:10:55+00:00

I want to write a storage backend to store larger chunks of data. The

0

I want to write a storage backend to store larger chunks of data. The data can be anything, but it is mainly binary files (images, pdfs, jar files) or text files (xml, jsp, js, html, java…). I found most of the data is already compressed. If everything is compressed, about 15% disk space can be saved.

I am looking for the most efficient algorithm that can predict with high probability that a chunk of data (let’s say 128 KB) can be compressed or not (lossless compression), without having to look at all the data if possible.

The compression algorithm will be either LZF, Deflate, or something similar (maybe Google Snappy). So predicting if data is compressible should be much faster than compressing the data itself, and use less memory.

Algorithms I already know about:

Try to compress a subset of the data, let’s say 128 bytes (this is a bit slow)
Calculate the sum of 128 bytes, and if it’s within a certain range then it’s likely not compressible (within 10% of 128 * 127) (this is fast, and relatively good, but I’m looking for something more reliable, because the algorithm really only looks at the topmost bits for each byte)
Look at the file headers (relatively reliable, but feels like cheating)

I guess the general idea is that I need an algorithm that can quickly calculate if the probability of each bit in a list of bytes is roughly 0.5.

Update

I have implemented ‘ASCII checking’, ‘entropy calculation’, and ‘simplified compression’, and all give good results. I want to refine the algorithms, and now my idea is to not only predict if data can be compressed, but also how much it can be compressed. Possibly using a combination of algorithms. Now if I could only accept multiple answers… I will accept the answer that gave the best results.

Additional answers (new ideas) are still welcome! If possible, with source code or links 🙂

Update 2

A similar method is now implemented in Linux.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T13:10:56+00:00

Editorial Team

2026-05-24T13:10:56+00:00Added an answer on May 24, 2026 at 1:10 pm

From my experience almost all of the formats that can effectively be compressed are non-binary. So checking if about 70-80% of the characters are within in the [0-127] rage should do the trick.

If you want to to it “properly” (even though I really can’t see a reason to do that), you either have to run (parts of) your compression algorithm on the data or calculate the entropy, as tskuzzy already proposed.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to write a storage backend to store larger chunks of data. The

Update

Update 2

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply