We have recently compared the respective file sizes of the same tabular data (think single table, half a dozen of columns, describing a product catalog) serialized with ProtoBuf.NET or with TSV (tab separated data), both files compressed with GZip afterward (default .NET implementation).
I have been surprised to notice that the compressed ProtoBuf.NET version takes a lot more space than the text version (up to 3x more). My pet theory is that ProtoBuf does not respect the byte semantic and consequently mismatches the GZip frequency compression tree; hence a relatively inefficient compression.
Another possibility is that ProtoBuf encodes, in fact, a lot more data (to facilitate schema versioning for example), hence the serialized formats are not strictly comparable information-wise.
Anybody observing the same problem? Is it even worth to compress ProtoBuf?
There are a number of factors possible here; firstly, note that the protocol buffers wire format uses straight UTF-8 encoding for strings; if you data is dominated by strings, it will ultimately need about the same amount of space as it would for TSV.
Protocol buffers is also designed to help store structured data i.e. more complex models that the single table scenario. This doesn’t contribute hugely to the size, but start comparing with xml/json etc (which are more similar in terms of capability) and the difference is more obvious.
Additionally, since protocol buffers is pretty dense (UTF-8 notwithstanding), in some cases compressing it can actually make it bigger – you might want to check if this is the case here.
In a quick sample for the scenario you present, both formats give roughly the same sizes – there is no massive jump:
the tsv is marginally smaller in this case, but ultimately TSV is indeed a very simple format (with very limited capabilities in terms of structured data), so it is no surprise that it is quick.
Indeed; if all you are storing is a very simple single table, TSV is not a bad option – however, it is ultimately a very limited format. I can’t reproduce your “much bigger” example.
In addition to the richer support for structured data (and other features), protobuf places a lot of emphasis on processing performance too. Now, since TSV is pretty simple the edge here won’t be massive (but is noticeable in the above), but again: contrast to xml, json, or the inbuilt BinaryFormatter for a test against formats with similar features and the difference is obvious.
Example for the numbers above (updated to use BufferedStream):