In huffman coding for decompression you have to compare a bitstream to several values(prefix free). I’m trying to implement a huffman coder decoder in python and this is my code to convert the bitstream into ascii-values.
c = ''
l = 0
x = 1
stime = time.time()
while l<len(string):
if string[l:l+x] in table:
c+=table[string[l:l+x]]
l+=x
x = 1
else:
x+=1
What could I do to make this loop more efficient?
Fast:
First off, make sure that you have built a canonical Huffman code, where the shorter codes come numerically before the longer codes. This is done easily by first describing your Huffman code as simply the number of bits for each symbol. Then assign Huffman codes to the shortest codes in symbol order, then the next shortest codes in symbol order, and so on. E.g.
Sort by bits, retaining sort by symbol:
Assign Huffman codes, starting with zero:
This canonical approach provides a compact means of transmitting a Huffman code from the compressor to the decompressor since you don’t have to send the actual codes or a tree — just the code lengths for each symbol. Then the code can be built as above on the other end.
Now we create decoding tables, a symbol table,
Symbol[] = "AECDFBG", and a code index table:Now to decode, you can loop from 2 to 4 bits, and see if your code is less than the starting code of the next bit size. We pull four bits off the stream and call it
nyb(if there aren’t four more bits on the stream, just append with zero bits to fill it out). In pseudo code usingif‘s instead of loop, and>>means shift bits down:It should be pretty easy to see how to turn that into a loop, going from the shortest code length to the second longest code length, and if you come out without finding it, it must be the longest code length.
Faster:
Build a lookup table whose length is 2**(length of the longest code). Each entry of the table contains the number of bits in the code, and the resulting symbol. You take that many bits of the bitstream to use as an index. Again, if the bitstream doesn’t have that many bits left, then fill out with zeros. Then you simply output the symbol from that indexed entry and remove the number of bits in that indexed entry from the bitstream (which may be less than the number of bits you pulled for the index — make sure that you leave the unused bits in the bitstream). Repeat, where now you are pulling off the first unused bits from the remaining bitstream.
At the next level of sophistication, you can do what zlib does. If the longest code is relatively long (in zlib it can be up to 15 bits), the time you take to make the table may not pay for the time saved in decoding, as compared to the following approach. Have a two-level table, where the first level table covers up to
nbits wherenis less than the longest code. (In zlib, the optimal choice turns out to ben == 9for a 15-bit code.) Then if the code isnbits or less, the table entry provides the symbol and number of bits, and you proceed as above. If the code is more thannbits, then you go to a sub-table for thatn-bit value that processes the remaining bits, again as above. That table entry indicates how many bits to pull for the sub-table, and defines the size of that sub-table, call itk. You delete the topnbits from the stream and pull the nextkbits and use that as an index to the sub-table. Then you get the symbol and number of remaining bits in the code and proceed as in the single-level table. Note thatn+kis not necessarily the length of the longest code for each sub-table, since that sub-table may only cover shorter codes. Only the last one or a few sub-tables will haven+kequal to the length of the longest code.This can be quite fast since, by the construction of the Huffman code, the shorter codes are much more likely. Most of the time you’ll get the symbol at the first level, and only occasionally have to go the second level. The total number of table entries to fill in in the main table and all of the sub-tables can be much less than the number of entries in a big table that covers the full code length. The time spent preparing to decode is then reduced.
If you have even longer Huffman codes (e.g. 32 bits), you can have more levels of sub-tables. It takes some experimentation to determine the optimal breakpoints for sub-tables, which will depend on how often a new code is sent and tables have to be built.