I am currently trying to create a custom Deflate implementation in C#.
I am currently trying to implement the “pattern search” part where I have (up to) 32k of data and am trying to search the longest possible pattern for my input.
The RFC 1951 which defines Deflate says about that process:
The compressor uses a chained hash table to find duplicated strings,
using a hash function that operates on 3-byte sequences. At any
given point during compression, let XYZ be the next 3 input bytes to
be examined (not necessarily all different, of course). First, the
compressor examines the hash chain for XYZ. If the chain is empty,
the compressor simply writes out X as a literal byte and advances one
byte in the input. If the hash chain is not empty, indicating that
the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
same hash function value) has occurred recently, the compressor
compares all strings on the XYZ hash chain with the actual input data
sequence starting at the current point, and selects the longest
match.
I do know what a hash function is, and do know what a HashTable is as well. But what is a “chained hash table” and how could such a structure be designed to be efficient (in C#) with handling a large amout of data? Unforunately I didn’t understand how the structure described in the RFC works.
What kind of hash function could I choose (what would make sense)?
Thank you in advance!
A chained hash table is a hash table that stores every item you put in it, even if the key for 2 items hashes to the same value, or even if 2 items have exactly the same key.
A DEFLATE implementation needs to store a bunch of (key, data) items in no particular order, and rapidly look-up a list of all the items with that key.
In this case, the key is 3 consecutive bytes of uncompressed plaintext, and the data is some sort of pointer or offset to where that 3-byte substring occurs in the plaintext.
Many hashtable/dictionary implementations store both the key and the data for every item.
It’s not necessary to store the key in the table for DEFLATE, but it doesn’t hurt anything other than using slightly more memory during compression.
Some hashtable/dictionary implementations such as the C++ STL
unordered_mapinsist that every (key, data) item they store must have a unique key. When you try to store another (key, data) item with the same key as some older item already in the table, these implementations delete the old item and replace it with the new item.That does hurt — if you accidentally use the C++ STL
unordered_mapor similar implementation, your compressed file will be larger than if you had used a more appropriate library such as the C++ STLhash_multimap.Such an error may be difficult to detect, since the resulting (unnecessarily large) compressed files can be correctly decompressed by any standard DEFLATE compressor to a file bit-for-bit identical to the original file.
A few implementations of DEFLATE and other compression algorithms deliberately use such an implementation, deliberately sacrificing compressed file size in order to gain compression speed.
As Nick Johnson said, the default hash function used in your standard “hashtable” or “dictionary” implementation is probably more than adequate.
http://en.wikipedia.org/wiki/Hashtable#Separate_chaining