I have a big array ( 1000x500000x6 ) that is stored in a pyTables file. I am doing some calculations on it that are fairly optimized in terms of speed, but what is taking the most time is the slicing of the array.
At the beginning of the script, I need to get a subset of the rows : reduced_data = data[row_indices, :, :] and then, for this reduced dataset, I need to access:
- columns one by one: reduced_data[:,clm_indice,:]
- a subset of the columns: reduced_data[:,clm_indices,:]
Getting these arrays takes forever. Is there any way to speed that up ? storing the data differently for example ?
You can try choosing the
chunkshapeof your array wisely, see: http://pytables.github.com/usersguide/libref.html#tables.File.createCArrayThis option controls in which order the data is physically stored in the file, so it might help to speed up access.
With some luck, for your data access pattern, something like
chunkshape=(1000, 1, 6)might work.