Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 265569
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T22:46:54+00:00 2026-05-11T22:46:54+00:00

I have an application where I am reading and writing small blocks of data

  • 0

I have an application where I am reading and writing small blocks of data (a few hundred bytes) hundreds of millions of times. I’d like to generate a compression dictionary based on an example data file and use that dictionary forever as I read and write the small blocks. I’m leaning toward the LZW compression algorithm. The Wikipedia page (http://en.wikipedia.org/wiki/Lempel-Ziv-Welch) lists pseudocode for compression and decompression. It looks fairly straightforward to modify it such that the dictionary creation is a separate block of code. So I have two questions:

  1. Am I on the right track or is there a better way?
  2. Why does the LZW algorithm add to the dictionary during the decompression step? Can I omit that, or would I lose efficiency in my dictionary?

Thanks.

Update: Now I’m thinking the ideal case be to find a library that lets me store the dictionary separate from the compressed data. Does anything like that exist?

Update: I ended up taking the code at http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c and adapting it. I am Chris in the comments on that page. I emailed my mods back to that blog author, but I haven’t heard back yet. The compression rates I’m seeing with that code are not at all impressive. Maybe that is due to the 8-bit tree size.

Update: I converted it to 16 bits and the compression is better. It’s also much faster than the original code.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace Book.Core
{
  public class Huffman16
  {
    private readonly double log2 = Math.Log(2);

    private List<Node> HuffmanTree = new List<Node>();

    internal class Node
    {
      public long Frequency { get; set; }
      public byte Uncoded0 { get; set; }
      public byte Uncoded1 { get; set; }
      public uint Coded { get; set; }
      public int CodeLength { get; set; }
      public Node Left { get; set; }
      public Node Right { get; set; }

      public bool IsLeaf
      {
        get { return Left == null; }
      }

      public override string ToString()
      {
        var coded = "00000000" + Convert.ToString(Coded, 2);
        return string.Format("Uncoded={0}, Coded={1}, Frequency={2}", (Uncoded1 << 8) | Uncoded0, coded.Substring(coded.Length - CodeLength), Frequency);
      }
    }

    public Huffman16(long[] frequencies)
    {
      if (frequencies.Length != ushort.MaxValue + 1)
      {
        throw new ArgumentException("frequencies.Length must equal " + ushort.MaxValue + 1);
      }
      BuildTree(frequencies);
      EncodeTree(HuffmanTree[HuffmanTree.Count - 1], 0, 0);
    }

    public static long[] GetFrequencies(byte[] sampleData, bool safe)
    {
      if (sampleData.Length % 2 != 0)
      {
        throw new ArgumentException("sampleData.Length must be a multiple of 2.");
      }
      var histogram = new long[ushort.MaxValue + 1];
      if (safe)
      {
        for (int i = 0; i <= ushort.MaxValue; i++)
        {
          histogram[i] = 1;
        }
      }
      for (int i = 0; i < sampleData.Length; i += 2)
      {
        histogram[(sampleData[i] << 8) | sampleData[i + 1]] += 1000;
      }
      return histogram;
    }

    public byte[] Encode(byte[] plainData)
    {
      if (plainData.Length % 2 != 0)
      {
        throw new ArgumentException("plainData.Length must be a multiple of 2.");
      }

      Int64 iBuffer = 0;
      int iBufferCount = 0;

      using (MemoryStream msEncodedOutput = new MemoryStream())
      {
        //Write Final Output Size 1st
        msEncodedOutput.Write(BitConverter.GetBytes(plainData.Length), 0, 4);

        //Begin Writing Encoded Data Stream
        iBuffer = 0;
        iBufferCount = 0;
        for (int i = 0; i < plainData.Length; i += 2)
        {
          Node FoundLeaf = HuffmanTree[(plainData[i] << 8) | plainData[i + 1]];

          //How many bits are we adding?
          iBufferCount += FoundLeaf.CodeLength;

          //Shift the buffer
          iBuffer = (iBuffer << FoundLeaf.CodeLength) | FoundLeaf.Coded;

          //Are there at least 8 bits in the buffer?
          while (iBufferCount > 7)
          {
            //Write to output
            int iBufferOutput = (int)(iBuffer >> (iBufferCount - 8));
            msEncodedOutput.WriteByte((byte)iBufferOutput);
            iBufferCount = iBufferCount - 8;
            iBufferOutput <<= iBufferCount;
            iBuffer ^= iBufferOutput;
          }
        }

        //Write remaining bits in buffer
        if (iBufferCount > 0)
        {
          iBuffer = iBuffer << (8 - iBufferCount);
          msEncodedOutput.WriteByte((byte)iBuffer);
        }
        return msEncodedOutput.ToArray();
      }
    }

    public byte[] Decode(byte[] bInput)
    {
      long iInputBuffer = 0;
      int iBytesWritten = 0;

      //Establish Output Buffer to write unencoded data to
      byte[] bDecodedOutput = new byte[BitConverter.ToInt32(bInput, 0)];

      var current = HuffmanTree[HuffmanTree.Count - 1];

      //Begin Looping through Input and Decoding
      iInputBuffer = 0;
      for (int i = 4; i < bInput.Length; i++)
      {
        iInputBuffer = bInput[i];

        for (int bit = 0; bit < 8; bit++)
        {
          if ((iInputBuffer & 128) == 0)
          {
            current = current.Left;
          }
          else
          {
            current = current.Right;
          }
          if (current.IsLeaf)
          {
            bDecodedOutput[iBytesWritten++] = current.Uncoded1;
            bDecodedOutput[iBytesWritten++] = current.Uncoded0;
            if (iBytesWritten == bDecodedOutput.Length)
            {
              return bDecodedOutput;
            }
            current = HuffmanTree[HuffmanTree.Count - 1];
          }
          iInputBuffer <<= 1;
        }
      }
      throw new Exception();
    }

    private static void EncodeTree(Node node, int depth, uint value)
    {
      if (node != null)
      {
        if (node.IsLeaf)
        {
          node.CodeLength = depth;
          node.Coded = value;
        }
        else
        {
          depth++;
          value <<= 1;
          EncodeTree(node.Left, depth, value);
          EncodeTree(node.Right, depth, value | 1);
        }
      }
    }

    private void BuildTree(long[] frequencies)
    {
      var tiny = 0.1 / ushort.MaxValue;
      var fraction = 0.0;

      SortedDictionary<double, Node> trees = new SortedDictionary<double, Node>();
      for (int i = 0; i <= ushort.MaxValue; i++)
      {
        var leaf = new Node()
        {
          Uncoded1 = (byte)(i >> 8),
          Uncoded0 = (byte)(i & 255),
          Frequency = frequencies[i]
        };
        HuffmanTree.Add(leaf);
        if (leaf.Frequency > 0)
        {
          trees.Add(leaf.Frequency + (fraction += tiny), leaf);
        }
      }

      while (trees.Count > 1)
      {
        var e = trees.GetEnumerator();
        e.MoveNext();
        var first = e.Current;
        e.MoveNext();
        var second = e.Current;

        //Join smallest two nodes
        var NewParent = new Node();
        NewParent.Frequency = first.Value.Frequency + second.Value.Frequency;
        NewParent.Left = first.Value;
        NewParent.Right = second.Value;

        HuffmanTree.Add(NewParent);

        //Remove the two that just got joined into one
        trees.Remove(first.Key);
        trees.Remove(second.Key);

        trees.Add(NewParent.Frequency + (fraction += tiny), NewParent);
      }
    }

  }

}

Usage examples:

To create the dictionary from sample data:

var freqs = Huffman16.GetFrequencies(File.ReadAllBytes(@"D:\nodes"), true);

To initialize an encoder with a given dictionary:

var huff = new Huffman16(freqs);

And to do some compression:

var encoded = huff.Encode(raw);

And decompression:

var raw = huff.Decode(encoded);
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-11T22:46:54+00:00Added an answer on May 11, 2026 at 10:46 pm

    The hard part in my mind is how you build your static dictionary. You don’t want to use the LZW dictionary built from your sample data. LZW wastes a bunch of time learning since it can’t build the dictionary faster than the decompressor can (a token will only be used the second time it’s seen by the compressor so the decompressor can add it to its dictionary the first time its seen). The flip side of this is that it’s adding things to the dictionary that may never get used, just in case the string shows up again. (e.g., to have a token for ‘stackoverflow’ you’ll also have entries for ‘ac’,’ko’,’ve’,’rf’ etc…)

    However, looking at the raw token stream from an LZ77 algorithm could work well. You’ll only see tokens for strings seen at least twice. You can then build a list of the most common tokens/strings to include in your dictionary.

    Once you have a static dictionary, using LZW sans the dictionary update seems like an easy implementation but to get the best compression I’d consider a static Huffman table instead of the traditional 12 bit fixed size token (as George Phillips suggested). An LZW dictionary will burn tokens for all the sub-strings you may never actually encode (e.g, if you can encode ‘stackoverflow’, there will be tokens for ‘st’, ‘sta’, ‘stac’, ‘stack’, ‘stacko’ etc.).

    At this point it really isn’t LZW – what makes LZW clever is how the decompressor can build the same dictionary the compressor used only seeing the compressed data stream. Something you won’t be using. But all LZW implementations have a state where the dictionary is full and is no longer updated, this is how you’d use it with your static dictionary.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

While I realize Adobe discourages use of the application directory for reading/writing, I have
I'm writing a small application for college, a video rental application. I have no
I'm sort of lost on this. I have an application that is reading from
I have an application build in C# for reading and saving SMS and MMS,
I have been developing a sample application after reading Rails 3 Tutorial book. In
we have to application servers, both reading from the same JMS server, preferably from
I have an application where there are multiple processes. They share the reading and
I have made an application (for my self) for feeds reading, using SyndicationFeed ,
I have implemented the TTS support for reading the Text Strings in my Application,
I am using Codeigniter to develop an application, and have been reading bits about

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.