I originally asked a related question here, but didn’t really seem to get anywhere. Perhaps if I rephrase part of it more specifically it might help….
I have files stored using Matlab’s sparse format (HDF5, csc I
believe), and I’m trying to use Pytables to operate on them directly,
but haven’t succeeded yet. Using h5py I can do the following:
# Method 1: uses h5py (WORKS)
f1 = h5py.File(fname)
data = f1['M']['data']
ir = f1['M']['ir']
jc = f1['M']['jc']
M = scipy.sparse.csc_matrix((data, ir, jc))
but if I try to do the equivalent in Pytables:
# Method 2: uses pyTables (DOESN'T WORK)
f2 = tables.openFile(fname)
data = f2.root.M.data
ir = f2.root.M.ir
jc = f2.root.M.jc
M = scipy.sparse.csc_matrix( (data,ir,jc) )
this fails (after a long wait) with the error:
TypeError Traceback (most recent call last)
/home/tdiethe/BMJ/<ipython console> in <module>()
/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.pyc in __init__(self, arg1, shape, dtype, copy, dims, nzmax)
56 self.indices = np.array(indices, copy=copy)
57 self.indptr = np.array(indptr, copy=copy)
---> 58 self.data = np.array(data, copy=copy, dtype=getdtype(dtype, data))
59 else:
60 raise ValueError, "unrecognized %s_matrix constructor usage" %\
/usr/lib/python2.6/dist-packages/scipy/sparse/sputils.pyc in getdtype(dtype, a, default)
69 canCast = False
70 else:
---> 71 raise TypeError, "could not interpret data type"
72 else:
73 newdtype = np.dtype(dtype)
TypeError: could not interpret data type
Looking at f2:
In [63]: f2.root.M.data
Out[63]:
/M/data (CArray(4753606,), zlib(3)) ''
atom := Float64Atom(shape=(), dflt=0.0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := (8181,)
In [64]: f2.root.M.ir
Out[64]:
/M/ir (CArray(4753606,), zlib(3)) ''
atom := UInt64Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := (8181,)
In [65]: f2.root.M.jc
Out[65]:
/M/jc (CArray(133339,), zlib(3)) ''
atom := UInt64Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := (7843,)
I have two questions:
- how do I load this file in using pytables
- do I need to perform the conversion to a scipy sparse matrix in order to be able to perform operations on it, or can I perform operations directly on the disk files (matrix multiplication etc …) – i.e. without loading the file into memory (if not what’s the point of using pytables?)?
I missed seeing this in your original post but I think your issue is in the design of PyTables, which provides an extra level of abstraction on top of the underlying data.
Consider the following:
At this point
datais not anumpyarray:The
tables.array.Arraydoes immediately load the underlying array, or immediately expose array-like functionality. This is what led to the error when you tried to use these types of objects to create a sparse array inscipy.Instead the
dataobject produced byPyTablesis intended to provide access to the data through additional commands (i.e. you did that by using fancy indexing[...]). In this approach you can access parts of the data or all of it by doingdata[:]ordata.read(). It is only at this point that the familiarnumpyarray is produced.For more information on the
tables.array.Arrayclass see http://pytables.github.com/usersguide/libref.html#the-array-class or theGetting actual datasection at http://www.pytables.org/moin/HowToUse for examples of accessing underlying data.In comparison pyh5 produces much more array-like objects, although still not
numpyarrays. Consider:However, you can immediately do
numpyoperations ondatalike you call toscipy, or more simple operations likenp.cos(data)ordata + np.arange(len(data)). It also appears thatdataobject has some familiarnumpylike attributes (i.e.shape) and that the underlying data (anumpy.ndarray) is stored atdata.value. However, I’m not familiar withpyh5as I haven’t used it myself, so I’m not sure what the limitations are in this regard.In general it appears that
PyTablesandpyh5have different design goals and therefore should be used in different ways.pyh5offers a more Numpy-like interface to HDF files, whilePyTablesoffers a lot more complex database like operations. See the discussion of the differences at thepyh5,PyTablesdocs and the Enthought mailing list: