I’d like to create 2d histograms in python from large datasets (100000+ samples) stored

Question

0

Asked: May 28, 20262026-05-28T02:44:52+00:00 2026-05-28T02:44:52+00:00

I’d like to create 2d histograms in python from large datasets (100000+ samples) stored

0

I’d like to create 2d histograms in python from large datasets (100000+ samples) stored in a HDF5 file. I came up with the following code:

import sys
import h5py
import numpy as np
import matplotlib as mpl
import matplotlib.pylab

f = h5py.File(sys.argv[1], 'r')

A = f['A']
T = f['T']

at_hist, xedges, yedges = np.histogram2d(T, A, bins=500)
extent = [yedges[0], yedges[-1], xedges[0], xedges[-1]]

fig = mpl.pylab.figure()
at_plot = fig.add_subplot(111)

at_plot.imshow(at_hist, extent=extent, origin='lower', aspect='auto')

mpl.pylab.show()

f.close()

It takes about 15s to execute (100000 data points). CERN’s Root however (using its own tree data structure instead of HDF5) can do this in less than 1s. Do you have any idea how I could speed up the code? I could also change the structure of the HDF5 data if it would be helpful.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T02:44:52+00:00

I would try a few different things.

Load your data from the hdf file instead of passing in what are effectively memory-mapped arrays.
If that doesn’t fix the problem, you can exploit a scipy.sparse.coo_matrix to make the 2D histogram. With older versions of numpy, digitize (which all of the various histogram* functions use internally) could use excessive memory under some circumstances. It’s no longer the case with recent (>1.5??) versions of numpy, though.

As an example of the first suggestion, you’d do something like:

f = h5py.File(sys.argv[1], 'r')
A = np.empty(f['A'].shape, f['A'].dtype)
T = np.empty(f['T'].shape, f['T'].dtype)
f['A'].read_direct(A)
f['T'].read_direct(T)

The difference here is that the entirety of the arrays will be read into memory, instead of being h5py‘s array-like objects, which are basically efficient memory-mapped arrays stored on disk.

As for the second suggestion, don’t try it unless the first suggestion didn’t help your problem.

It probably won’t be significantly faster (and is likely slower for small arrays), and with recent versions of numpy, it’s only slightly more memory-efficient. I do have a piece of code where I deliberately do this, but I wouldn’t recommend it in general. It’s a very hackish solution. In very select circumstances (many points and many bins), it can preform better than histogram2d, though.

All those caveats aside, though, here it is:

import numpy as np
import scipy.sparse
import timeit

def generate_data(num):
    x = np.random.random(num)
    y = np.random.random(num)
    return x, y

def crazy_histogram2d(x, y, bins=10):
    try:
        nx, ny = bins
    except TypeError:
        nx = ny = bins
    xmin, xmax = x.min(), x.max()
    ymin, ymax = y.min(), y.max()
    dx = (xmax - xmin) / (nx - 1.0)
    dy = (ymax - ymin) / (ny - 1.0)

    weights = np.ones(x.size)

    # Basically, this is just doing what np.digitize does with one less copy
    xyi = np.vstack((x,y)).T
    xyi -= [xmin, ymin]
    xyi /= [dx, dy]
    xyi = np.floor(xyi, xyi).T

    # Now, we'll exploit a sparse coo_matrix to build the 2D histogram...
    grid = scipy.sparse.coo_matrix((weights, xyi), shape=(nx, ny)).toarray()

    return grid, np.linspace(xmin, xmax, nx), np.linspace(ymin, ymax, ny)

if __name__ == '__main__':
    num=1e6
    numruns = 1
    x, y = generate_data(num)
    t1 = timeit.timeit('crazy_histogram2d(x, y, bins=500)',
            setup='from __main__ import crazy_histogram2d, x, y',
            number=numruns)
    t2 = timeit.timeit('np.histogram2d(x, y, bins=500)',
            setup='from __main__ import np, x, y',
            number=numruns)
    print 'Average of %i runs, using %.1e points' % (numruns, num)
    print 'Crazy histogram', t1 / numruns, 'sec'
    print 'numpy.histogram2d', t2 / numruns, 'sec'

On my system, this yields:

Average of 10 runs, using 1.0e+06 points
Crazy histogram 0.104092288017 sec
numpy.histogram2d 0.686891794205 sec

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’d like to create 2d histograms in python from large datasets (100000+ samples) stored

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply