1 Introduction
2 Install and load the package
3 The h5mread() function
4 Other functionality provided by the h5mread package
5 Session information

1 Introduction

h5mread is an R/Bioconductor package that allows fast and memory-efficient loading of HDF5 data into R. The main function in the package is h5mread() which allows reading arbitrary data from an HDF5 dataset into R, similarly to what the h5read() function from the rhdf5 package does. In the case of h5mread(), the implementation has been optimized to make it as fast and memory-efficient as possible.

In addition to the h5mread() function, the package also provides the following low-level functionality:

h5dim() and h5chunkdim();
Utility functions to manipulate the dimnames of an HDF5 dataset;
H5File objects to facilitate access to remote HDF5 files like files stored in an Amazon S3 bucket.

Note that the primary use case for the h5mread package is to support higher-level functionality implemented in the HDF5Array package.

2 Install and load the package

Like any other Bioconductor package, h5mread should always be installed with BiocManager::install():

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("h5mread")

Load the package:

library(h5mread)

3 The h5mread() function

h5mread() is an efficient and flexible alternative to rhdf5::h5read().

Note that we’ll use writeHDF5Array() from the HDF5Array package to conveniently create the HDF5 datasets used in the examples below.

3.1 Basic example

Create a 70,000 x 1,500 random dataset:

set.seed(2009)
m0 <- matrix(runif(105e6), ncol=1500)  # 70,000 x 1,500 matrix
temp0_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(m0, temp0_h5, "m0", chunkdim=c(100, 100))

Load 1,000 random rows from the HDF5 dataset:

h5ls(temp0_h5)

##   group name       otype dclass          dim
## 0     /   m0 H5I_DATASET  FLOAT 70000 x 1500

nrow0 <- h5dim(temp0_h5, "m0")[[1L]]
starts <- list(sample(nrow0, 1000), NULL)
m <- h5mread(temp0_h5, "m0", starts=starts)

See ?h5mread for more information and additional examples.

Sanity check:

stopifnot(identical(m, m0[starts[[1L]], ]))

3.2 An example involving sparse data

The HDF5 format doesn’t natively support sparse data representation. However, because it stores the data in compressed chunks, sparse data gets efficiently compressed in the HDF5 file.

The more sparse the data, the smaller the resulting HDF5 file:

a1 <- poissonSparseArray(c(6100, 960, 75), density=0.5)
a2 <- poissonSparseArray(c(6100, 960, 75), density=0.05)

temp1_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(a1, temp1_h5, "a1", chunkdim=c(50, 40, 5), level=5)
temp2_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(a2, temp2_h5, "a2", chunkdim=c(50, 40, 5), level=5)

file.size(temp1_h5)

## [1] 128949150

file.size(temp2_h5)

## [1] 38533754

However, the small size on disk won’t translate into a small size in memory if the data gets loaded back as an ordinary array:

a21 <- h5mread(temp2_h5, "a2")  # not memory-efficient
object.size(a21)

## 1756800224 bytes

To keep memory usage as low as possible, h5mread() can load the data in a SparseArray derivative from the SparseArray package. This is achieved by setting its as.sparse argument to TRUE:

a22 <- h5mread(temp2_h5, "a2", as.sparse=TRUE)  # memory-efficient
object.size(a22)

## 351486208 bytes

Note that the data is loaded as a COO_SparseArray object:

class(a22)

## [1] "COO_SparseArray"
## attr(,"package")
## [1] "SparseArray"

See ?h5mread for more information and additional examples.

Sanity checks:

stopifnot(
  identical(a21, as.array(a2)),
  is(a22, "COO_SparseArray"),
  identical(as.array(a22), as.array(a2))
)

3.3 A note about COO_SparseArray objects

The COO_SparseArray class is one of the two SparseArray concrete subclasses defined in the SparseArray package, the other one being SVT_SparseArray. Note that the latter tends to be even more memory-efficient than the former and to achieve better performance overall.

Use coercion to switch back and forth between the two representations:

a22 <- as(a22, "SVT_SparseArray")
class(a22)

## [1] "SVT_SparseArray"
## attr(,"package")
## [1] "SparseArray"

About half the memory footprint of the COO_SparseArray representation:

object.size(a22)

## 188106768 bytes

See ?SparseArray in the SparseArray package for more information.

4 Other functionality provided by the h5mread package

4.1 h5dim() and h5chunkdim()

Two convenience functions to obtain the dimensions of an HDF5 dataset as well as the dimensions of its chunks.

See ?h5dim for more information and some examples.

4.2 Utility functions to manipulate the dimnames of an HDF5 dataset

A small set of low-level utilities is provided to manipulate the dimnames of an HDF5 dataset.

See ?h5writeDimnames for more information and some examples.

4.3 H5File objects

Use an H5File object to access an HDF5 file stored in an Amazon S3 bucket.

See ?H5File for more information and some examples.

5 Session information

sessionInfo()

## R Under development (unstable) (2025-01-20 r87609)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] h5mread_0.99.4        SparseArray_1.7.4     S4Arrays_1.7.1       
##  [4] IRanges_2.41.2        abind_1.4-8           S4Vectors_0.45.2     
##  [7] MatrixGenerics_1.19.1 matrixStats_1.5.0     Matrix_1.7-1         
## [10] BiocGenerics_0.53.4   generics_0.1.3        rhdf5_2.51.2         
## [13] BiocStyle_2.35.0     
## 
## loaded via a namespace (and not attached):
##  [1] crayon_1.5.3        cli_3.6.3           knitr_1.49         
##  [4] rlang_1.1.5         xfun_0.50           DelayedArray_0.33.4
##  [7] jsonlite_1.8.9      htmltools_0.5.8.1   sass_0.4.9         
## [10] rmarkdown_2.29      grid_4.5.0          evaluate_1.0.3     
## [13] jquerylib_0.1.4     fastmap_1.2.0       yaml_2.3.10        
## [16] lifecycle_1.0.4     Rhdf5lib_1.29.0     bookdown_0.42      
## [19] BiocManager_1.30.25 compiler_4.5.0      rhdf5filters_1.19.0
## [22] XVector_0.47.2      lattice_0.22-6      digest_0.6.37      
## [25] R6_2.5.1            HDF5Array_1.35.8    bslib_0.8.0        
## [28] tools_4.5.0         cachem_1.1.0

The h5mread package

Compiled 22 January 2025; Modified 14 January 2025

Package

Contents