Indexing in HDF5


HDF Indexing Prototypes Other Indexing Projects
Bitmap Indices Projection Indices FastBit PyTables

Basics:

A projection index is a static data structure used to efficiently subset datasets based on the value of the variable rather than the location of the variable in the file. A projection index consists of a list of a data structure consisting of a value and a list of IDs of all the locations that have that value. In HDF5 the list of IDs can easily be stored using a dataspace. Hence to create a projection index we scan the entire dataset and create a list data structures for each value that consists of the value and its associated dataspace. This list is then sorted so that a binary search can allow us to find any location in logarithmic time with respect to the number of elements in the dataset.

For example given the following 2-dimensional dataset:

1 3 4 5 5
1 2 4 5 5
1 2 3 5 5
1 2 3 4 5
1 2 3 4 6

The projection index would be the following:

1 {{0,0-0,4}}
2 {{1,1-1,4}}
3 {{1,0},{2,2-2,4}}
4 {{2,0-2,1},{3,3-3,4}}
5 {{3,0-3,2},{4,0-4-3}}
6 {{4,4}}

In order to respond to the range query [3,4], we would first use binary search to locate the value 3, then scan down to 4. We will get 2 different dataspaces. We use the OR operation on dataspaces (provided by the HDF5 API) to efficiently create the dataspace corresponding to the query.

Restrictions:

The prototype supports indexing of only one dataset.

Since Projection Indexes are single-dimensional an index is created on only one field of a compound datatype.

Currently the value cell in the index can contain only actual values present in the dataset. In the near future we plan to extend it to handle bins of values too.

Using Projection Indexes:

The first HDF5 index prototype, called the H5IN API, the projection index API consists of 2 function calls. The first one allows the user to create an index while the second one allows the user to query on that index.

H5INcreate function call allows the user to create a new projection index on a specified dataset. The dataset is specified through its location and name. The user also specifies the group in which the index is to be stored. The index is stored in a dataset with the same name as the dataset from where the data is being read if the dataset is atomic and the name of the field being indexed if the dataset is compound. For more details on the function call look at the Reference Manual.

H5INquery function call allows the user to query the index. To do so the user must specify the group that stores the indexes and a list of names of the various indexes, and the bounds associated with each index. This would help when we are indexing compound datatypes and can create multiple indexes for the same dataset. For atomic datasets only one index name would need to be provided along with its bounds. For more details on the function call look at the Reference Manual.

Software:

The source code for the indexing software can be found here.