In my application, I need to store and transmit data that contains many repeating string values (think entity names in an XML document). I have two proposed solutions:
- A) create a string table to be stored along the document, and then use index references (using multi-byte encoding) in the document body, or
- B) simply compress the document using gzip or a similar compression algorithm.
Which one is likely going to perform better in terms of speed and data size? (Obviously, this depends on the quality of the implementations, but assume that option A builds an array of strings dynamically and encodes the document body in some reasonable fashion).
Also, if option B, do you recommend a more potentially suitable compression method other than gzip?
gzip is only a good algorithm when the transmission/storage cost is not too high compared to the cost of CPU time. You can get better compression ratios with bzip2, 7zip, and especialy for natural language, various PPM algorithms.
Of course, it’s not only computation (and static vs. dynamic memory requirement) vs. compression ratio that matters – different compression formats allow varying degrees of efficient random access seeking, low latency stream decoding, and concatenation of zipped data (e.g.
cat a.gz b.gz | gunzip -cis the same asgunzip -c a.gz;gunzip -c b.gz