I have a very large file compressed with gzip sitting on disk. The production

Question

0

Asked: June 17, 20262026-06-17T01:38:33+00:00 2026-06-17T01:38:33+00:00

I have a very large file compressed with gzip sitting on disk. The production

0

I have a very large file compressed with gzip sitting on disk. The production environment is "Cloud"-based, so the storage performance is terrible, but CPU is fine. Previously, our data processing pipeline began with gzip -dc streaming the data off the disk.

Now, in order to parallelise the work, I want to run multiple pipelines that each take a pair of byte offsets – start and end – and take that chunk of the file. With a plain file this could be achieved with head and tail, but I’m not sure how to do it efficiently with a compressed file; if I gzip -dc and pipe into head, the offset pairs that are toward the end of the file will involve wastefully seeking through the whole file as it’s slowly decompressed.

So my question is really about the gzip algorithm – is it theoretically possible to seek to a byte offset in the underlying file or get an arbitrary chunk of it, without the full implications of decompressing the entire file up to that point? If not, how else might I efficiently partition a file for "random" access by multiple processes while minimising the I/O throughput overhead?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T01:38:35+00:00

You can’t do that with gzip, but you can do it with bzip2, which is block instead of stream-based – this is how the Hadoop DFS splits and parallelizes the reading of huge files with different mappers in its MapReduce algorithm. Perhaps it would make sense to re-compress your files as bz2 so you can take advantage of this; it would be easier than some ad-hoc way to chunk up the files.

I found the patches that are implementing this in Hadoop, here: https://issues.apache.org/jira/browse/HADOOP-4012

Here’s another post on the topic: BZip2 file read in Hadoop

Perhaps browsing the Hadoop source code would give you an idea of how to read bzip2 files by blocks.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a very large file compressed with gzip sitting on disk. The production

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply