I’d like to create 2d histograms in python from large datasets (100000+ samples) stored in a HDF5 file. I came up with the following code:
import sys
import h5py
import numpy as np
import matplotlib as mpl
import matplotlib.pylab
f = h5py.File(sys.argv[1], 'r')
A = f['A']
T = f['T']
at_hist, xedges, yedges = np.histogram2d(T, A, bins=500)
extent = [yedges[0], yedges[-1], xedges[0], xedges[-1]]
fig = mpl.pylab.figure()
at_plot = fig.add_subplot(111)
at_plot.imshow(at_hist, extent=extent, origin='lower', aspect='auto')
mpl.pylab.show()
f.close()
It takes about 15s to execute (100000 data points). CERN’s Root however (using its own tree data structure instead of HDF5) can do this in less than 1s. Do you have any idea how I could speed up the code? I could also change the structure of the HDF5 data if it would be helpful.
I would try a few different things.
scipy.sparse.coo_matrixto make the 2D histogram. With older versions of numpy,digitize(which all of the varioushistogram*functions use internally) could use excessive memory under some circumstances. It’s no longer the case with recent (>1.5??) versions of numpy, though.As an example of the first suggestion, you’d do something like:
The difference here is that the entirety of the arrays will be read into memory, instead of being
h5py‘s array-like objects, which are basically efficient memory-mapped arrays stored on disk.As for the second suggestion, don’t try it unless the first suggestion didn’t help your problem.
It probably won’t be significantly faster (and is likely slower for small arrays), and with recent versions of numpy, it’s only slightly more memory-efficient. I do have a piece of code where I deliberately do this, but I wouldn’t recommend it in general. It’s a very hackish solution. In very select circumstances (many points and many bins), it can preform better than
histogram2d, though.All those caveats aside, though, here it is:
On my system, this yields: