I’m downloading zipped files containing XMLs, and I’d like to avoid writing the zip files to disk before manipulating them because of latency requirements. However, java.util.zip doesn’t suffice for me. There’s no way to say “here’s a byte array of a zip file, use it” without turning it into a stream, and ZipInputStream is not reliable, since it scans for entry headers (see discussion below EDIT for reasons why that is not reliable).
I do not yet have access to the zip files I’ll be handling, so I don’t know whether I’ll be able to handle them through the ZipInputStream, and I need to find a solution that will work with any valid ZIP files, as the penalty for a failure once I go into production will be high.
Assuming ZipInputStream won’t work, what can I do to solve this problem in cases where there are no entry headers? I’m using Wikipedia’s definition, which includes a comment on how to correctly uncompress zip files (quoted below), as the standard.
EDIT
The Apache Commons Zip library has a good write up on some of the problems using Stream (both their solution and Java’s) has. I’ll further add, from wikipedia and personal experience, and the size and crc field on entry headers may not be filled (I’ve files with -1 in these fields). Thanks to centic for providing this link.
Also, let me quote the wikipedia on the subject:
Tools that correctly read zip archives must scan for the signatures of
the various fields, the zip central directory. They must not scan for
entries because only the directory specifies where a file chunk
starts. Scanning could lead to false positives, as the format doesn’t
forbid other data to be between chunks, or uncompressed stream
containing such signatures.
Note that ZipInputStream scans for entries, not the central directory, which is the problem with it.
Final Edit
If anyone is interested, this script can be used to produce a valid ZIP file that cannot be read by ZipInputStream from an existing ZIP file. So, as a final edit to this closed question, I needed a library that can read files such as the ones produced by this script.
EDIT: Another suggestion…
Looking at
ZipFilefrom the Apache Commons implementation, it looks like it wouldn’t be too hard to effectively fork that for your project. Create a wrapper around your byte array which has all the pieces of theRandomAccessFileAPI which are required (I don’t think there are very many). You’ve already indicated that you prefer the interface toZipFile, so why not go with that?We don’t know enough about your project to know whether this opens up any legal questions – and even if you gave details, I doubt that anyone here would be able to give good legal advice – but I suspect it wouldn’t take more than an hour or two to get this solution up and working, and I suspect you’d have reasonable confidence in it.
EDIT: This may be a slightly more productive answer…
If you’re worried about the entries not being contiguous, but don’t want to handle all the compression side yourself, you might consider an option where you effectively rewrite the data. Create a new
ByteArrayOutputStream, and read the central directory at the end. For each entry in the central directory, write out an entry (header + data) to the output stream in a format that you believeZipInputStreamwill be happy with. Then write a new central directory – if you want your replacement to be valid you may need to do this from scratch, but if you’re using code which you know won’t actually read the central directory, you could just provide the original one, ignoring the fact that it might not then be valid. So long as it starts with the right signature, that’s probably good enough 🙂Once you’ve done that, convert the
ByteArrayOutputStreaminto a newbyte[], wrap it in aByteArrayInputStreamand then pass that toZipInputStreamorZipArchiveInputStream.Depending on your purposes, you may not even need to do that much – you may be able to just extract each file as you go by creating a “mini” zip file with just the one entry you’re reading from the directory at a time.
This does involve understanding the zip file format, but not completely – just the skeleton, effectively. It’s not a quick and easy fix like using an existing API completely, but it shouldn’t take very long. It doesn’t guarantee it’ll be able to read all invalid files (how could it?) but it will protect you against the “data between entries” issue you seem to be particularly concerned about. Hope it’s at least a useful idea…
Yes there is:
That leaves the issue of whether
ZipInputStreamcan handle all the zip files you’ll give it – but I wouldn’t write it off quite so quickly.Of course, there are other APIs available. You may want to look at Apache Commons Compress, for example. Even though
ZipFilerequires a file,ZipArchiveInputStreamdoesn’t – so again, you could use aByteArrayInputStream. EDIT: It looks likeZipArchiveStreamdoesn’t read from the central directory either. I was hoping it would usemarkSupportedto check beforehand, but it appears not to…EDIT: In the comments on the question, I asked where you’d read that the zip file doesn’t have to contain entry data. You quoted wikipedia:
That’s not the same as entry data being optional. It’s saying that there may be extra data in awkward places, not that the entries may be missing completely. It’s basically saying that the entries shouldn’t be assumed to be contiguous. I could happily concede that
ZipInputStreammay not be reading the central directory at the end of the file, but finding code which does that isn’t the same as finding code which copes with entry data not existing.You then write:
… which suggests you want code which will handle invalid zip files. Combined with this:
That means you’re asking for code which should handle zip files which are invalid in ways you can’t even predict. Just how invalid would it have to be for you to be able to reject it? If I give you 1000 random bytes, with no attempt for them to be a zip file at all, what on earth would you do with it?
Basically, you need to pin the problem down more tightly before it’s feasible to even say whether a particular library is a valid solution. It’s reasonable to collect a set of zip files from various places, which may be invalid in well-understood ways, and say “I must be able to support all of these.” Later you may need to do some work if it turns out that wasn’t good enough. But to be able to support anything, however broken, simply isn’t a valid requirement.