I am trying to serialize a large (~10**6 rows, each with ~20 values) list, to be used later by myself (so pickle’s lack of safety isn’t a concern).
Each row of the list is a tuple of values, derived from some SQL database. So far, I have seen datetime.datetime, strings, integers, and NoneType, but I might eventually have to support additional data types.
For serialization, I’ve considered pickle (cPickle), json, and plain text – but only pickle saves the type information: json can’t serialize datetime.datetime, and plain text has its obvious disadvantages.
However, cPickle is pretty slow for data this large, and I’m looking for a faster alternative.
I think you should give PyTables a look. It should be ridiculously fast, at least faster than using an RDBMS, since it’s very lax and doesn’t impose any read/write restrictions, plus you get a better interface for managing your data, at least compared to pickling it.