I have some questions regarding packed fields, and storing/serializing
data with protocolbuffers.
What i want to do essentially, is to store 4MB of data to a file.
The data i have (in our embedded system) is received as uint8_t (a byte) and i want to store this data as efficiently as possible.
I have been testing a variety of protobuf setups (four);
repeated uint32_t datastruct = 1;
repeated uint32_t datastruct = 1 [packed = true]
with both variants assigned 1-to-1 (putting a uint8 to uint32) and both variants bitshifted with 4 values cramped into a uint32_t.
To my surprise the stored files are much larger than the original
data. (the examples where i put a uint8 into uint32 was expected of course..)
The best result i could achieve was 5.2MB for the 4MB data, which
really isnt that good.
Have i misunderstood something vital here?
I do realize that protobuf adds information to the packets, but 25%
increase is too much imho.
Also using GzipOutputStream increases the size of the file instead of decreasing it.
Any tips would be very appreciated!
Thanks for your time.
This answer is based on the assumption that you are using
uint32in .proto terms:packedis a positive thing here (removes headers per value); however, by packing a single uint8 into a uint32, you are running into a facet of “varint” encoding – specifically, that if the most significant bit of the byte is set, it will take 2 bytes (varint uses 7 bits per byte for data, and one bit as continuation). Consequently, I would recommend switching to thebytestype, which represents any arbitrary chunk of bytes, and is encoded “as is”, without any varint or similar. It wouldn’t be repeated/packed – just:Another option would be to use
fixed32(repeated and packed), and place (via shifting) 4 bytes per value, but by the time you’ve done that you may as well go tobytesand have a more obvious 1:1 map.Re gzip; it is not uncommon for gzip to increase the size of arbitrary binary without many repeated blocks. By contrast, if your protobuf document contains strings it is common for the size to shrink, as gzip can spot repeated blocks.