h5mread 0.99.4
h5mread is an R/Bioconductor package that allows fast and
memory-efficient loading of HDF5 data into R. The main function in the
package is h5mread()
which allows reading arbitrary data from an HDF5
dataset into R, similarly to what the h5read()
function from the
rhdf5 package does. In the case of h5mread()
, the
implementation has been optimized to make it as fast and memory-efficient
as possible.
In addition to the h5mread()
function, the package also provides the
following low-level functionality:
h5dim()
and h5chunkdim()
;
Utility functions to manipulate the dimnames of an HDF5 dataset;
H5File objects to facilitate access to remote HDF5 files like files stored in an Amazon S3 bucket.
Note that the primary use case for the h5mread package is to support higher-level functionality implemented in the HDF5Array package.
Like any other Bioconductor package, h5mread should always
be installed with BiocManager::install()
:
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("h5mread")
Load the package:
library(h5mread)
h5mread()
is an efficient and flexible alternative to rhdf5::h5read()
.
Note that we’ll use writeHDF5Array()
from the HDF5Array
package to conveniently create the HDF5 datasets used in the examples below.
Create a 70,000 x 1,500 random dataset:
set.seed(2009)
m0 <- matrix(runif(105e6), ncol=1500) # 70,000 x 1,500 matrix
temp0_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(m0, temp0_h5, "m0", chunkdim=c(100, 100))
Load 1,000 random rows from the HDF5 dataset:
h5ls(temp0_h5)
## group name otype dclass dim
## 0 / m0 H5I_DATASET FLOAT 70000 x 1500
nrow0 <- h5dim(temp0_h5, "m0")[[1L]]
starts <- list(sample(nrow0, 1000), NULL)
m <- h5mread(temp0_h5, "m0", starts=starts)
See ?h5mread
for more information and additional examples.
Sanity check:
stopifnot(identical(m, m0[starts[[1L]], ]))
The HDF5 format doesn’t natively support sparse data representation. However, because it stores the data in compressed chunks, sparse data gets efficiently compressed in the HDF5 file.
The more sparse the data, the smaller the resulting HDF5 file:
a1 <- poissonSparseArray(c(6100, 960, 75), density=0.5)
a2 <- poissonSparseArray(c(6100, 960, 75), density=0.05)
temp1_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(a1, temp1_h5, "a1", chunkdim=c(50, 40, 5), level=5)
temp2_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(a2, temp2_h5, "a2", chunkdim=c(50, 40, 5), level=5)
file.size(temp1_h5)
## [1] 128949150
file.size(temp2_h5)
## [1] 38533754
However, the small size on disk won’t translate into a small size in memory if the data gets loaded back as an ordinary array:
a21 <- h5mread(temp2_h5, "a2") # not memory-efficient
object.size(a21)
## 1756800224 bytes
To keep memory usage as low as possible, h5mread()
can load the data in
a SparseArray derivative from the SparseArray package. This
is achieved by setting its as.sparse
argument to TRUE
:
a22 <- h5mread(temp2_h5, "a2", as.sparse=TRUE) # memory-efficient
object.size(a22)
## 351486208 bytes
Note that the data is loaded as a COO_SparseArray object:
class(a22)
## [1] "COO_SparseArray"
## attr(,"package")
## [1] "SparseArray"
See ?h5mread
for more information and additional examples.
Sanity checks:
stopifnot(
identical(a21, as.array(a2)),
is(a22, "COO_SparseArray"),
identical(as.array(a22), as.array(a2))
)
The COO_SparseArray class is one of the two SparseArray concrete subclasses defined in the SparseArray package, the other one being SVT_SparseArray. Note that the latter tends to be even more memory-efficient than the former and to achieve better performance overall.
Use coercion to switch back and forth between the two representations:
a22 <- as(a22, "SVT_SparseArray")
class(a22)
## [1] "SVT_SparseArray"
## attr(,"package")
## [1] "SparseArray"
About half the memory footprint of the COO_SparseArray representation:
object.size(a22)
## 188106768 bytes
See ?SparseArray
in the SparseArray package for more
information.
Two convenience functions to obtain the dimensions of an HDF5 dataset as well as the dimensions of its chunks.
See ?h5dim
for more information and some examples.
A small set of low-level utilities is provided to manipulate the dimnames of an HDF5 dataset.
See ?h5writeDimnames
for more information and some examples.
Use an H5File object to access an HDF5 file stored in an Amazon S3 bucket.
See ?H5File
for more information and some examples.
sessionInfo()
## R Under development (unstable) (2025-01-20 r87609)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] h5mread_0.99.4 SparseArray_1.7.4 S4Arrays_1.7.1
## [4] IRanges_2.41.2 abind_1.4-8 S4Vectors_0.45.2
## [7] MatrixGenerics_1.19.1 matrixStats_1.5.0 Matrix_1.7-1
## [10] BiocGenerics_0.53.4 generics_0.1.3 rhdf5_2.51.2
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] crayon_1.5.3 cli_3.6.3 knitr_1.49
## [4] rlang_1.1.5 xfun_0.50 DelayedArray_0.33.4
## [7] jsonlite_1.8.9 htmltools_0.5.8.1 sass_0.4.9
## [10] rmarkdown_2.29 grid_4.5.0 evaluate_1.0.3
## [13] jquerylib_0.1.4 fastmap_1.2.0 yaml_2.3.10
## [16] lifecycle_1.0.4 Rhdf5lib_1.29.0 bookdown_0.42
## [19] BiocManager_1.30.25 compiler_4.5.0 rhdf5filters_1.19.0
## [22] XVector_0.47.2 lattice_0.22-6 digest_0.6.37
## [25] R6_2.5.1 HDF5Array_1.35.8 bslib_0.8.0
## [28] tools_4.5.0 cachem_1.1.0