For some caching I’m thinking of doing for an upcoming project, I’ve been thinking about Java serialization. Namely, should it be used?
Now I’ve previously written custom serialization and deserialization (Externalizable) for various reasons in years past. These days interoperability has become even more of an issue and I can foresee a need to interact with .Net applications so I’ve thought of using a platform-independant solution.
Has anyone had any experience with high-performance use of GPB? How does it compare in terms of speed and efficiency with Java’s native serialization? Alternatively, are there any other schemes worth considering?
I haven’t compared Protocol Buffers with Java’s native serialization in terms of speed, but for interoperability Java’s native serialization is a serious no-no. It’s also not going to be as efficient in terms of space as Protocol Buffers in most cases. Of course, it’s somewhat more flexible in terms of what it can store, and in terms of references etc. Protocol Buffers is very good at what it’s intended for, and when it fits your need it’s great – but there are obvious restrictions due to interoperability (and other things).
I’ve recently posted a Protocol Buffers benchmarking framework in Java and .NET. The Java version is in the main Google project (in the benchmarks directory), the .NET version is in my C# port project. If you want to compare PB speed with Java serialization speed you could write similar classes and benchmark them. If you’re interested in interop though, I really wouldn’t give native Java serialization (or .NET native binary serialization) a second thought.
There are other options for interoperable serialization besides Protocol Buffers though – Thrift, JSON and YAML spring to mind, and there are doubtless others.
EDIT: Okay, with interop not being so important, it’s worth trying to list the different qualities you want out of a serialization framework. One thing you should think about is versioning – this is another thing that PB is designed to handle well, both backwards and forwards (so new software can read old data and vice versa) – when you stick to the suggested rules, of course 🙂
Having tried to be cautious about the Java performance vs native serialization, I really wouldn’t be surprised to find that PB was faster anyway. If you have the chance, use the server vm – my recent benchmarks showed the server VM to be over twice as fast at serializing and deserializing the sample data. I think the PB code suits the server VM’s JIT very nicely 🙂
Just as sample performance figures, serializing and deserializing two messages (one 228 bytes, one 84750 bytes) I got these results on my laptop using the server VM:
The ‘speed’ vs ‘size’ is whether the generated code is optimised for speed or code size. (The serialized data is the same in both cases. The ‘size’ version is provided for the case where you’ve got a lot of messages defined and don’t want to take a lot of memory for the code.)
As you can see, for the smaller message it can be very fast – over 500 small messages serialized or deserialized per millisecond. Even with the 87K message it’s taking less than a millisecond per message.