In huffman coding for decompression you have to compare a bitstream to several values(prefix

Question

0

Asked: June 10, 20262026-06-10T22:48:57+00:00 2026-06-10T22:48:57+00:00

In huffman coding for decompression you have to compare a bitstream to several values(prefix

0

In huffman coding for decompression you have to compare a bitstream to several values(prefix free). I’m trying to implement a huffman coder decoder in python and this is my code to convert the bitstream into ascii-values.

c = ''
l = 0
x = 1
stime = time.time()
while l<len(string):
    if string[l:l+x] in table:
        c+=table[string[l:l+x]]
        l+=x
        x = 1
    else:
        x+=1

What could I do to make this loop more efficient?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T22:48:59+00:00

Fast:

First off, make sure that you have built a canonical Huffman code, where the shorter codes come numerically before the longer codes. This is done easily by first describing your Huffman code as simply the number of bits for each symbol. Then assign Huffman codes to the shortest codes in symbol order, then the next shortest codes in symbol order, and so on. E.g.

Symbol   Bits
  A        2
  B        4
  C        3
  D        3
  E        2
  F        3
  G        4

Sort by bits, retaining sort by symbol:

Symbol   Bits
  A        2
  E        2
  C        3
  D        3
  F        3
  B        4
  G        4

Assign Huffman codes, starting with zero:

Symbol   Bits    Code
  A        2     00
  E        2     01
  C        3     100
  D        3     101
  F        3     110
  B        4     1110
  G        4     1111

This canonical approach provides a compact means of transmitting a Huffman code from the compressor to the decompressor since you don’t have to send the actual codes or a tree — just the code lengths for each symbol. Then the code can be built as above on the other end.

Now we create decoding tables, a symbol table, Symbol[] = "AECDFBG", and a code index table:

Bits    Start     Index
  2    0000 (0)     0
  3    1000 (8)     2
  4    1110 (14)    5

Now to decode, you can loop from 2 to 4 bits, and see if your code is less than the starting code of the next bit size. We pull four bits off the stream and call it nyb (if there aren’t four more bits on the stream, just append with zero bits to fill it out). In pseudo code using if‘s instead of loop, and >> means shift bits down:

if nyb < Start[Bits are 3] (= 8) then
    output Symbol[Index[Bits are 2] (= 0) + (nyb - Start[Bits are 2] (= 0)) >> 2]
    remove top two bits from bitstream
else if nyb < Start[Bits are 4] (= 14) then
    output Symbol[Index[Bits are 3] (= 2) + (nyb - Start[Bits are 3] (= 8)) >> 1]
    remove top three bits from bitstream
else (must be four bits)
    output Symbol[Index[Bits are 4] (= 5) + (nyb - Start[Bits are 4] (= 14))]
    remove top four bits from bitstream

It should be pretty easy to see how to turn that into a loop, going from the shortest code length to the second longest code length, and if you come out without finding it, it must be the longest code length.

Faster:

Build a lookup table whose length is 2**(length of the longest code). Each entry of the table contains the number of bits in the code, and the resulting symbol. You take that many bits of the bitstream to use as an index. Again, if the bitstream doesn’t have that many bits left, then fill out with zeros. Then you simply output the symbol from that indexed entry and remove the number of bits in that indexed entry from the bitstream (which may be less than the number of bits you pulled for the index — make sure that you leave the unused bits in the bitstream). Repeat, where now you are pulling off the first unused bits from the remaining bitstream.

At the next level of sophistication, you can do what zlib does. If the longest code is relatively long (in zlib it can be up to 15 bits), the time you take to make the table may not pay for the time saved in decoding, as compared to the following approach. Have a two-level table, where the first level table covers up to n bits where n is less than the longest code. (In zlib, the optimal choice turns out to be n == 9 for a 15-bit code.) Then if the code is n bits or less, the table entry provides the symbol and number of bits, and you proceed as above. If the code is more than n bits, then you go to a sub-table for that n-bit value that processes the remaining bits, again as above. That table entry indicates how many bits to pull for the sub-table, and defines the size of that sub-table, call it k. You delete the top n bits from the stream and pull the next k bits and use that as an index to the sub-table. Then you get the symbol and number of remaining bits in the code and proceed as in the single-level table. Note that n+k is not necessarily the length of the longest code for each sub-table, since that sub-table may only cover shorter codes. Only the last one or a few sub-tables will have n+k equal to the length of the longest code.

This can be quite fast since, by the construction of the Huffman code, the shorter codes are much more likely. Most of the time you’ll get the symbol at the first level, and only occasionally have to go the second level. The total number of table entries to fill in in the main table and all of the sub-tables can be much less than the number of entries in a big table that covers the full code length. The time spent preparing to decode is then reduced.

If you have even longer Huffman codes (e.g. 32 bits), you can have more levels of sub-tables. It takes some experimentation to determine the optimal breakpoints for sub-tables, which will depend on how often a new code is sent and tables have to be built.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In huffman coding for decompression you have to compare a bitstream to several values(prefix

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply