I have a protocol buffer setup like this:
[ProtoContract]
Foo
{
[ProtoMember(1)]
Bar[] Bars;
}
A single Bar gets encoded to a 67 byte protocol buffer. This sounds about right because I know that a Bar is pretty much just a 64 byte array, and then there are 3 bytes overhead for length prefixing.
However, when I encode a Foo with an array of 20 Bars it takes 1362 bytes. 20 * 67 is 1340, so there are 22 bytes of overhead just for encoding an array!
Why does this take up so much space? And is there anything I can do to reduce it?
This overhead is quite simply the information it needs to know where each of the 20 objects starts and ends. There is nothing I can do different here without breaking the format (i.e. doing something contrary to the spec).
If you really want the gory details:
An array or list is (if we exclude “packed”, which doesn’t apply here) simply a repeated block of sub-messages. There are two layouts available for sub-messages; strings and groups. With a string, the layout is:
where
headeris the varint-encoded mash of the wire-type and field-number (hex 08 in this case with field 1),lengthis the varint-encoded size ofdata, and data is the sub-object itself. For small objects (dataless than 128 bytes) this often means 2 bytes overhead per object, depending on a: the field number (fields above 15 take more space), and b: the size of the data.With a group, the layout is:
where
headeris the varint-encoded mash of the wire-type and field-number (hex 0B in this case with field 1),datais the sub-object, andfooteris another varint mash to indicate the end of the object (hex 0C in this case with field 1).Groups are less favored generally, but they have the advantage that they don’t incur any overhead as
datagrows in size. For small field-numbers (less than 16) again the overhead is 2 bytes per object. Of course, you pay double for large field-numbers, instead.