Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6249131
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T13:10:55+00:00 2026-05-24T13:10:55+00:00

I want to write a storage backend to store larger chunks of data. The

  • 0

I want to write a storage backend to store larger chunks of data. The data can be anything, but it is mainly binary files (images, pdfs, jar files) or text files (xml, jsp, js, html, java…). I found most of the data is already compressed. If everything is compressed, about 15% disk space can be saved.

I am looking for the most efficient algorithm that can predict with high probability that a chunk of data (let’s say 128 KB) can be compressed or not (lossless compression), without having to look at all the data if possible.

The compression algorithm will be either LZF, Deflate, or something similar (maybe Google Snappy). So predicting if data is compressible should be much faster than compressing the data itself, and use less memory.

Algorithms I already know about:

  • Try to compress a subset of the data, let’s say 128 bytes (this is a bit slow)

  • Calculate the sum of 128 bytes, and if it’s within a certain range then it’s likely not compressible (within 10% of 128 * 127) (this is fast, and relatively good, but I’m looking for something more reliable, because the algorithm really only looks at the topmost bits for each byte)

  • Look at the file headers (relatively reliable, but feels like cheating)

I guess the general idea is that I need an algorithm that can quickly calculate if the probability of each bit in a list of bytes is roughly 0.5.

Update

I have implemented ‘ASCII checking’, ‘entropy calculation’, and ‘simplified compression’, and all give good results. I want to refine the algorithms, and now my idea is to not only predict if data can be compressed, but also how much it can be compressed. Possibly using a combination of algorithms. Now if I could only accept multiple answers… I will accept the answer that gave the best results.

Additional answers (new ideas) are still welcome! If possible, with source code or links 🙂

Update 2

A similar method is now implemented in Linux.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T13:10:56+00:00Added an answer on May 24, 2026 at 1:10 pm

    From my experience almost all of the formats that can effectively be compressed are non-binary. So checking if about 70-80% of the characters are within in the [0-127] rage should do the trick.

    If you want to to it “properly” (even though I really can’t see a reason to do that), you either have to run (parts of) your compression algorithm on the data or calculate the entropy, as tskuzzy already proposed.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to upload data from Google Cloud Storage to BigQuery, but I can't
I want write a simple query which will fetch data from a table (which
There has been numerous discussions related to storing images (or binary data) in the
I want to make a database that can store any king of objects and
Is there a way to write data to a storage that gets erased when
I want to write a GIS (geospatial data) extension for my Doctrine 2 project.
I want to create a file in external storage sdCard and write to it.I
I want to write an app that stores its data in a file that
I want write a little code analyzer which parses nested structures and translates into
Hello I want write my own desktop sharing application in Java. The application should

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.