I want to compress much small strings (about 75-100 length c# string).
At the time the dictionary is created I already know all short strings (nearly a trillion). There will no additional short strings in future.
I need to extra exactly one string without decompress other strings.
Now I am looking for a library or the best way to do the following:
- Create a dictionary using all strings I have
- Using this dictionary to compress each string
- a way to compress one string using the dictionary from 1.
I found a good related question, but this is not c# specific. Maybe there is something for c# I do not know, or a fancy library or someone has already done that. That is the reason I ask this question.
EDIT:
With dictionary I am talking about things like this: http://en.wikipedia.org/wiki/Dictionary_coder
But everything helps to get the strings shorter. The strings are short text messages in various languages and URLs (30%/70%). There is no need that the compressed strings is human readable. It will be stored in binary files.
If there are a trillion strings and no more, then each can be represented in 40 bits (5 bytes). All you need is a way to use the 5-bytes as an index to the trillion strings.
How do you know all trillion strings? If the compressor and decompressor both have access to all trillion strings, or if there is way to order and recreate the strings, then all you need is the index.
If you can’t find a way to index the strings, then you can take a subset of the strings and use them as a dictionary for a compressor. Just take the most representative sample (you need to figure out what might make some of the strings more common than the other strings or more representative of the other strings) and concatenate them into a 32K dictionary. About 400 of your trillion strings. Then zlib’s deflateSetDictionary on the compress end and inflateSetDictionary on the decompress end, both using exactly the same 32K dictionary. That will provide good compression on the short strings.