I’m writing a binary data format to file containing a graph of serialized objects. To be more resilient to errors (and to be able to debug problems) I am considering length-prefixing each object in the stream. I’m using C# and a BinaryWriter at the moment, but it is quite a general problem.
The size of each object isn’t known until it has been completely serialized, so to be able to
write the length prefixes there are a number of strategies:
-
Use a write buffer with enough space to have random access and insert the length at the correct position after the object is serialized.
-
Write each object to its own MemoryStream, then write the length of the buffer and the buffer contents to the main stream.
-
Write a zero length for all objects in the first pass, remember the positions in the file for all object sizes (a table of object to size), and make a second pass filling in all the sizes.
-
??
The total size (and thus the size of the first/outermost object) is typically around 1mb but can be as large as 50-100mb. My concern is the performance and memory usage of the process.
Which strategy would be most efficient?
The only way to determine this is to measure.
My first instinct would be to use #2, but knowing that is likely to add pressure to the GC (or fragmentation to the large object heap if the worker streams exceed 80Kb). However #3 sounds interesting, assuming the complexity of tracking those positions doesn’t hit maintainability.
In the end you need to measure with your data, and consider that unless you have unusual circumstances the performance will be dominated by network or storage performance, not by processing in memory.