I posted a related but still different question regarding Protobuf-Net before, so here goes:
I wonder whether someone (esp. Marc) could comment on which of the following would most likely be faster:
(a) I currently store serialized built-in datatypes in a binary file. Specifically, a long(8 bytes), and 2 floats (2x 4 bytes). Each 3 of those later make up one object in deserialized state. The long type represents DateTimeTicks for lookup purposes. I use a binary search to find the start and end locations of a data request. A method then downloads the data in one chunk (from start to end location) knowing that each chunk consists of a packet of many of above described triplets(1 long, 1 float, 1 float) and each triplet is always 16 bytes long. Thus the number triples retrieved is always (endLocation – startLocation)/16. I then iterate over the retrieved byte array, deserialize (using BitConverter) each built-in type and then instantiate a new object made up of a triplet each and store the objects in a list for further processing.
(b) Would it be faster to do the following? Build a separate file (or implement a header) that functions as index for lookup purposes. Then I would not store individual binary versions of the built-in types but instead use Protbuf-net to serialize a List of above described objects (= triplet of int, float, float as source of object). Each List would contain exactly and always one day’s worth of data (remember, the long represents DateTimeTick). Obviously each List would vary in size and thus my idea of generating another file or header for index lookup purposes because each data read request would only request a multiple of full days. When I want to retrieve the serialized list of one day I would then simply lookup the index, read the byte array, deserialize using Protobuf-Net and already have my List of objects. I guess why I am asking is because I do not fully understand how deserialization of collections in protobuf-net works.
To give a better idea about the magnitude of the data, each binary file is about 3gb large, thus contains many millions of serialized objects. Each file contains about 1000 days worth of data. Each data request may request any number of day’s worth of data.
What in your opinion is faster in raw processing time? I wanted to garner some input before potentially writing a lot of code to implement (b), I currently have (a) and am able to process about 1.5 million objects per second on my machine (process = from data request to returned List of deserialized objects).
Summary: I am asking whether binary data can be faster read I/O and deserialized using approach (a) or (b).
What you have is (and no offence intended) some very simple data. If you’re happy dealing with raw data (and it sounds like you are) then it sounds to me like the optimum way to treat this is: as you are. Offsets are a nice clean multiple of 16, etc.
Protocol buffers generally (not just protobuf-net, which is a single implementation of the protobuf specification) is intended for a more complex set of scenarios:
It is a bit of a different use case! As part of this, protocol buffers uses a small but necessary field-header notation (usually one byte per field), and you would need a mechanism to separate records, since they aren’t fixed-size – which is typically another 2 bytes per record. And, ultimately, the protocol buffers handling of float is IEEE-754, so you would be storing the exact same 2 x 4 bytes, but with added padding. The handling of a long integer can be fixed or variable size within the protocol buffers specification.
For what you are doing, and since you care about fastest raw processing time, simple seems best. I’d leave it “as is”.