h5py is a Python interface to the Hierarchical Data Format library, version 5. It provides a mature, stable, open way to store data. The HDF5 tutorial provides an excellent introduction to the basic concepts of HDF5.
Useful utilities included with the HDF5 library:
h5dump
(command line HDF5 extraction)h5stat
(command line HDF5 database statistics)
There's also HDFView which provides a nice graphical interface.
I'll walk through the HDF5 tutorial with h5py
to give you a feel for
how things work. It may help to keep in mind the following HDF5 to
filesystem concept map:
HDF5 | filesystem |
---|---|
dataset | file |
attribute | metadata/header |
group | directory |
Creating an HDF5 file
>>> import h5py
>>> f = h5py.File('file.h5', 'w')
>>> f.close()
Which creates
$ h5dump file.h5
HDF5 "file.h5" {
GROUP "/" {
}
}
Creating a dataset
>>> import h5py
>>> import numpy
>>> f = h5py.File('dset.h5', 'w')
>>> f['dset'] = numpy.zeros((6,4), dtype=numpy.int32)
>>> f.close()
Which creates
$ h5dump dset.h5
HDF5 "dset.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 6, 4 ) / ( 6, 4 ) }
DATA {
(0,0): 0, 0, 0, 0,
(1,0): 0, 0, 0, 0,
(2,0): 0, 0, 0, 0,
(3,0): 0, 0, 0, 0,
(4,0): 0, 0, 0, 0,
(5,0): 0, 0, 0, 0
}
}
}
}
Reading from and writing to a dataset
>>> import h5py
>>> import numpy
>>> f = h5py.File('dset.h5', 'w')
>>> f['dset'] = numpy.arange(24, dtype=numpy.int32).reshape((4, 6))
>>> dset = f['dset']
>>> dset
<HDF5 dataset "dset": shape (4, 6), type "<i4">
>>> dset[...]
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
>>> f.close()
Which creates
$ h5dump dset.h5
HDF5 "dset.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
DATA {
(0,0): 0, 1, 2, 3, 4, 5
(1,0): 6, 7, 8, 9, 10, 11
(3,0): 12, 13, 14, 15, 16, 17
(4,0): 18, 19, 20, 21, 22, 23
}
}
}
}
Creating an attribute
Using our file from the previous example:
>>> import h5py
>>> import numpy
>>> f = h5py.File('dset.h5', 'a')
>>> dset = f['dset']
>>> dset.attrs['Units'] = [100, 200]
>>> f.close()
Which creates
$ h5dump dset.h5
HDF5 "dset.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 6, 4 ) / ( 6, 4 ) }
DATA {
(0,0): 0, 1, 2, 3,
(1,0): 4, 5, 6, 7,
(2,0): 8, 9, 10, 11,
(3,0): 12, 13, 14, 15,
(4,0): 16, 17, 18, 19,
(5,0): 20, 21, 22, 23
}
ATTRIBUTE "Units" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): 100, 200
}
}
}
}
}
Creating a group
>>> import h5py
>>> f = h5py.File('group.h5', 'w')
>>> g = f.create_group('/MyGroup')
>>> g
<HDF5 group "/MyGroup" (0 members)>
>>> f.close()
Which creates
$ h5dump group.h5
HDF5 "group.h5" {
GROUP "/" {
GROUP "MyGroup" {
}
}
}
Creating groups using absolute and relative names
>>> import h5py
>>> f = h5py.File('groups.h5', 'w')
>>> g1 = f.create_group('/MyGroup')
>>> g2 = f.create_group('/MyGroup/Group_A')
>>> g3 = g1.create_group('Group_B')
>>> f.keys()
['MyGroup']
>>> f['MyGroup'].keys()
['Group_A', 'Group_B']
>>> f.close()
Which creates
$ h5dump groups.h5
HDF5 "groups.h5" {
GROUP "/" {
GROUP "MyGroup" {
GROUP "Group_A" {
}
GROUP "Group_B" {
}
}
}
}
Creating datasets in groups
Using our file from the previous example:
>>> import h5py
>>> f = h5py.File('groups.h5', 'a')
>>> f['/MyGroup/dset1'] = [3, 3]
>>> g = f['/MyGroup/Group_A']
>>> g['dset2'] = [2, 10]
>>> f.close()
Which creates
$ h5dump groups.h5
HDF5 "groups.h5" {
GROUP "/" {
GROUP "MyGroup" {
GROUP "Group_A" {
DATASET "dset2" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): 2, 10
}
}
}
GROUP "Group_B" {
}
DATASET "dset1" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): 3, 3
}
}
}
}
}
Reading from or writing to a subset of a dataset
Just use the Numpy slice indexing you're used to.
>>> import h5py
>>> import numpy
>>> f = h5py.File('slice.h5', 'w')
>>> f['IntArray'] = numpy.ones((8, 10))
>>> dset = f['IntArray']
>>> dset[...]
array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
>>> f['IntArray'][:,5:] = 2
>>> dset[...]
array([[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.]])
>>> dset[1:4,2:6] = 5
>>> f['IntArray'][...]
array([[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 5., 5., 5., 5., 2., 2., 2., 2.],
[ 1., 1., 5., 5., 5., 5., 2., 2., 2., 2.],
[ 1., 1., 5., 5., 5., 5., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1., 2., 2., 2., 2., 2.]])
>>> f.close()
Here's an example of altering a scalar value:
>>> import h5py
>>> import numpy
>>> f = h5py.File('scalar.h5', 'w')
>>> f['int'] = 1
>>> dset = f['int']
>>> f['int'][...]
1
>>> f['int'][...] = 2
>>> f['int'][...]
2
>>> f.pop('int')
>>> f.close()
I haven't been able to track down official documentation for the
dataset[...]
syntax, but it is mentioned in the 1.3 release
announcement that Andrew sent to the scipy-user
list.
Datatypes
Your array's numpy.dtype
will be preserved.
>>> import h5py
>>> f = h5py.File('dtype.h5', 'w')
>>> f['complex'] = 2 + 3j
>>> f['complex'].dtype
dtype('complex128')
>>> type(f['complex'][...])
<type 'complex'>
>>> f['complex array'] = [1 + 2j, 3 + 4j]
>>> f['complex array'].dtype
dtype('complex128')
>>> type(f['complex array'][...])
<type 'numpy.ndarray'>
>>> f.close()
Which creates
$ h5dump dtype.h5
HDF5 "dtype.h5" {
GROUP "/" {
DATASET "complex" {
DATATYPE H5T_COMPOUND {
H5T_IEEE_F64LE "r";
H5T_IEEE_F64LE "i";
}
DATASPACE SCALAR
DATA {
(0): {
2,
3
}
}
}
DATASET "complex array" {
DATATYPE H5T_COMPOUND {
H5T_IEEE_F64LE "r";
H5T_IEEE_F64LE "i";
}
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): {
1,
2
},
(1): {
3,
4
}
}
}
}
}
Properties
No examples here...
Chunking and extendible datasets
Extendible datasets must be chunked.
>>> import h5py
>>> import numpy
>>> f = h5py.File('ext.h5', 'w')
>>> f['simple'] = [1, 2, 3] # not chunked
>>> s = f['simple']
>>> s.chunks == None
True
>>> s.resize((6,))
Traceback (most recent call last):
...
TypeError: Only chunked datasets can be resized
>>> c = f.create_dataset('chunked', (3,), numpy.int32, chunks=(2,))
>>> c.chunks
(2,)
>>> c[:] = [9, 8, 7]
>>> c.resize((6,))
>>> c[...]
array([1, 2, 3, 0, 0, 0])
>>> c.resize((6,2))
Traceback (most recent call last):
...
TypeError: New shape length (2) must match dataset rank (1)
>>> f.close()
The "chunkiness" of data is not listed by h5dump
,
$ h5dump dtype.h5
HDF5 "ext.h5" {
GROUP "/" {
DATASET "chunked" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 6 ) / ( 6 ) }
DATA {
(0): 1, 2, 3, 0, 0, 0
}
}
DATASET "simple" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): 1, 2, 3
}
}
}
}
but it is preserved.
>>> f = h5py.File('ext.h5', 'a')
>>> f['chunked'].chunks
(2,)
>>> f['simple'].chunks == None
True