I’m currently exploring HDF5. I’ve read the interesting comments from the thread “Evaluating HDF5” and I understand that HDF5 is a solution of choice for storing the data, but how do you query it ? For example, say I’ve a big file containing some identifiers : Is there a way to quickly know if a given identifier is present in the file ?
I’m currently exploring HDF5 . I’ve read the interesting comments from the thread Evaluating
Share
I think the answer is “not directly”.
Here are some of the ways I think you could achieve the functionality.
Use groups:
A hierarchy of groups could be used in the form of a Radix Tree to store the data. This probably doesn’t scale too well though.
Use index datasets:
HDF has a reference type which could be used to link to a main table from a separate index tables. After writing the main data, other datasets sorted on other keys with references can be used. For example:
In order to use the above a binary search will have to be written when looking up the field in the Index tables.
In memory Index:
Depending on the size of the dataset it may be just as easy to use an in memory index that is read/written to its own dataset using something like “boost::serialize”.
HDF5-FastQuery:
This paper (and also this page) describe the use of bitmap indices to perform complex queries over a HDF dataset. I have not tried this.