I’m writing a data processing library in Python that reads data from a variety of sources into memory, manipulates it, then exports it into a variety of different formats. I was loading this data into memory, but some of the datasets I’m processing can be particularly large (over 4 Gig).
I need an open source library for a backing store that can deal elegantly with large datasets. It needs the ability to alter the data structure dynamically (add, rename, and remove columns), and should support reasonably fast iteration. Ideally, it should be able to handle arbitrary-sized strings and integers (just as python does) but I can build that into the library, if needed. And it needs to be able to handle missing values.
Does anyone have any suggestions?
A document-oriented database should cope fine with that kind of workload as long as you do not have complex joins.
Common representatives would be CouchDB or MongoDB.
They are both well suited for MapReduce like algorithms (this includes iterating over all datasets). If you want to merge rows with new data, you will want to have the ‘table’ sorted or have fast access to single elements: Both boils down to having an index.
Document-oriented DBs support multiple ‘tables’ by having documents with different schemas. They can query documents with a specific schema without a problem.
I do not think you will find a lightweighted solution to handle multiple 4 GB datasets with the requirements you listed. Especially dynamic datastructures are difficult to implement fast.