There seems to be many choices for Python to interface with SQLite (sqlite3, atpy) and HDF5 (h5py, pyTables) — I wonder if anyone has experience using these together with numpy arrays or data tables (structured/record arrays), and which of these most seamlessly integrate with “scientific” modules (numpy, scipy) for each data format (SQLite and HDF5).
There seems to be many choices for Python to interface with SQLite (sqlite3, atpy)
Share
Most of it depends on your use case.
I have a lot more experience dealing with the various HDF5-based methods than traditional relational databases, so I can’t comment too much on SQLite libraries for python…
At least as far as
h5pyvspyTables, they both offer very seamless access via numpy arrays, but they’re oriented towards very different use cases.If you have n-dimensional data that you want to quickly access an arbitrary index-based slice of, then it’s much more simple to use
h5py. If you have data that’s more table-like, and you want to query it, thenpyTablesis a much better option.h5pyis a relatively “vanilla” wrapper around the HDF5 libraries compared topyTables. This is a very good thing if you’re going to be regularly accessing your HDF file from another language (pyTablesadds some extra metadata).h5pycan do a lot, but for some use cases (e.g. whatpyTablesdoes) you’re going to need to spend more time tweaking things.pyTableshas some really nice features. However, if your data doesn’t look much like a table, then it’s probably not the best option.To give a more concrete example, I work a lot with fairly large (tens of GB) 3 and 4 dimensional arrays of data. They’re homogenous arrays of floats, ints, uint8s, etc. I usually want to access a small subset of the entire dataset.
h5pymakes this very simple, and does a fairly good job of auto-guessing a reasonable chunk size. Grabbing an arbitrary chunk or slice from disk is much, much faster than for a simple memmapped file. (Emphasis on arbitrary… Obviously, if you want to grab an entire “X” slice, then a C-ordered memmapped array is impossible to beat, as all the data in an “X” slice are adjacent on disk.)As a counter example, my wife collects data from a wide array of sensors that sample at minute to second intervals over several years. She needs to store and run arbitrary querys (and relatively simple calculations) on her data.
pyTablesmakes this use case very easy and fast, and still has some advantages over traditional relational databases. (Particularly in terms of disk usage and speed at which a large (index-based) chunk of data can be read into memory)