derfinderHelper (Collado-Torres, Jaffe, and Leek, 2015) is a small package that was created to speed up the F-statistics approach implemented in the parent package derfinder. It contains a single function, fstats.apply()
, which is used to calculate the F-statistics for a given data matrix, null and an alternative models.
The data is generally arranged in an matrix where the rows (\(n\)) are the genomic features of interest (gene level summaries, exon level summaries, or base-pair data) and the columns (\(m\)) represent the samples. The other two main arguments for fstats.apply()
are the null and alternative model matrices which are \(m \times p_0\) and \(m \times p\) where \(p_0\) is the number of covariates in the null model and \(p\) is the number of covariates in the alternative model. The models have to be nested and thus by definition \(p > p_0\). The end result is a vector of F-statistics with length \(n\), which is run length encoded for memory saving purposes.
Other arguments of fstats.apply()
are related to flow in derfinder such as the scaling factor (scalefac
) used, whether to subset the data (index
), and if the data was separated into chunks and saved to disk to lower the memory load (lowMemDir
).
Implementation-wise, adjustF
is useful when the denominator of the F-statistic calculation is too small. Finally, method
controls how will the F-statistics be calculated.
Matrix
is the recommended option because it uses around half the memory load of regular
and can be faster. Specially if the data was saved in this format previously by derfinder.Rle
uses the least amount of memory but gets very slow as the number of samples increases. Thus making it less than ideal in several cases.regular
uses base R
to calculate the F-statistics and can require a large amount of memory. This is noticeable when using several cores to run fstats.apply()
on different portions of the data.The F-statistics for each feature \(i\) are calculated using the following formula:
\[ F_i = \frac{ (\text{RSS0}_i - \text{RSS1}_i)/(\text{df}_1 - \text{df}_0) }{ \text{adjustF} + (\text{RSS1}_i / (p - p_0 - \text{df_1}))} \]
The following section walks through an example. However, in practice, you will probably not use this package directly and it will be used via derfinder.
First lets create an example data set where we have information for 1000 features and 16 samples where samples 1 to 4 are from group A, 5 to 8 from group B, 9 to 12 from group C, and 13 to 16 from group D.
## Create some toy data
suppressPackageStartupMessages(library('IRanges'))
## Note: the specification for S3 class "AsIs" in package 'BiocGenerics' seems equivalent to one from package 'RJSONIO': not turning on duplicate class definitions for this class.
## Note: the specification for S3 class "connection" in package 'BiocGenerics' seems equivalent to one from package 'RJSONIO': not turning on duplicate class definitions for this class.
## Note: the specification for S3 class "file" in package 'BiocGenerics' seems equivalent to one from package 'RJSONIO': not turning on duplicate class definitions for this class.
## Note: the specification for S3 class "pipe" in package 'BiocGenerics' seems equivalent to one from package 'RJSONIO': not turning on duplicate class definitions for this class.
## Note: the specification for S3 class "textConnection" in package 'BiocGenerics' seems equivalent to one from package 'RJSONIO': not turning on duplicate class definitions for this class.
set.seed(20140923)
toyData <- DataFrame(
'sample1' = Rle(sample(0:10, 1000, TRUE)),
'sample2' = Rle(sample(0:10, 1000, TRUE)),
'sample3' = Rle(sample(0:10, 1000, TRUE)),
'sample4' = Rle(sample(0:10, 1000, TRUE)),
'sample5' = Rle(sample(0:15, 1000, TRUE)),
'sample6' = Rle(sample(0:15, 1000, TRUE)),
'sample7' = Rle(sample(0:15, 1000, TRUE)),
'sample8' = Rle(sample(0:15, 1000, TRUE)),
'sample9' = Rle(sample(0:20, 1000, TRUE)),
'sample10' = Rle(sample(0:20, 1000, TRUE)),
'sample11' = Rle(sample(0:20, 1000, TRUE)),
'sample12' = Rle(sample(0:20, 1000, TRUE)),
'sample13' = Rle(sample(0:100, 1000, TRUE)),
'sample14' = Rle(sample(0:100, 1000, TRUE)),
'sample15' = Rle(sample(0:100, 1000, TRUE)),
'sample16' = Rle(sample(0:100, 1000, TRUE))
)
## Lets say that we have 4 groups
group <- factor(rep(toupper(letters[1:4]), each = 4))
## Note that some groups have higher coverage, we can adjust for this in the model
sampleDepth <- sapply(toyData, sum)
sampleDepth
## sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
## 4753 5009 4829 4969 7470 7624 7304 7380
## sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16
## 10387 9644 9795 9748 49419 50509 48726 50448
Next we create the model matrices for our example data set. Lets say that we want to calculate F-statistics comparing the alternative hypothesis that the group coefficients are not 0 versus the null hypothesis that they are equal to 0, when adjusting for the sample depth.
To do so, we create the nested models.
## Build the model matrices
mod <- model.matrix(~ sampleDepth + group)
mod0 <- model.matrix(~ sampleDepth)
## Explore them
mod
## (Intercept) sampleDepth groupB groupC groupD
## 1 1 4753 0 0 0
## 2 1 5009 0 0 0
## 3 1 4829 0 0 0
## 4 1 4969 0 0 0
## 5 1 7470 1 0 0
## 6 1 7624 1 0 0
## 7 1 7304 1 0 0
## 8 1 7380 1 0 0
## 9 1 10387 0 1 0
## 10 1 9644 0 1 0
## 11 1 9795 0 1 0
## 12 1 9748 0 1 0
## 13 1 49419 0 0 1
## 14 1 50509 0 0 1
## 15 1 48726 0 0 1
## 16 1 50448 0 0 1
## attr(,"assign")
## [1] 0 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
mod0
## (Intercept) sampleDepth
## 1 1 4753
## 2 1 5009
## 3 1 4829
## 4 1 4969
## 5 1 7470
## 6 1 7624
## 7 1 7304
## 8 1 7380
## 9 1 10387
## 10 1 9644
## 11 1 9795
## 12 1 9748
## 13 1 49419
## 14 1 50509
## 15 1 48726
## 16 1 50448
## attr(,"assign")
## [1] 0 1
Finally, we can calculate the F-statistics using fstats.apply()
.
library('derfinderHelper')
fstats <- fstats.apply(data = toyData, mod = mod, mod0 = mod0, scalefac = 1)
fstats
## numeric-Rle of length 1000 with 1000 runs
## Lengths: 1 1 ... 1
## Values : 4.10142714702069 0.37182716905003 ... 0.409243681977204
We can then proceed to use this information in derfinder or in any way you like.
We created derfinderHelper for calculating F-statistics using SnowParam()
from BiocParallel
(Morgan, Obenchain, Lang, and Thompson, 2015). Using this form of parallelization requires loading the necessary packages in the child processes. Because derfinder takes a long time to load, we shipped off fstats.apply()
to its own package to improve the speed of the calculations while retaining the memory advantages of SnowParam()
over MulticoreParam()
.
Note that transforming the data from a DataFrame
to a dgCMatrix
takes some time, so the most efficient performance is achieved when the data is converted at the beginning instead of at every permutation calculation. This is done in derfinder::preprocessCoverage()
when lowMemDir
is specified.
This package was made possible thanks to:
Code for creating the vignette
## Create the vignette
library('rmarkdown')
system.time(render('derfinderHelper.Rmd', 'BiocStyle::html_document'))
## Extract the R code
library('knitr')
knit('derfinderHelper.Rmd', tangle = TRUE)
## Clean up
file.remove('derfinderHelperRef.bib')
## [1] TRUE
Date the vignette was generated.
## [1] "2015-11-02 20:39:55 PST"
Wallclock time spent generating the vignette.
## Time difference of 4.181 secs
R
session information.
## Session info -----------------------------------------------------------------------------------------------------------
## setting value
## version R version 3.2.2 (2015-08-14)
## system x86_64, linux-gnu
## ui X11
## language en_US:
## collate C
## tz <NA>
## date 2015-11-02
## Packages ---------------------------------------------------------------------------------------------------------------
## package * version date source
## BiocGenerics * 0.16.0 2015-11-03 Bioconductor
## BiocStyle * 1.8.0 2015-11-03 Bioconductor
## IRanges * 2.4.1 2015-11-03 Bioconductor
## Matrix 1.2-2 2015-07-08 CRAN (R 3.2.2)
## R6 2.1.1 2015-08-19 CRAN (R 3.2.2)
## RCurl 1.95-4.7 2015-06-30 CRAN (R 3.2.2)
## RJSONIO 1.3-0 2014-07-28 CRAN (R 3.2.2)
## Rcpp 0.12.1 2015-09-10 CRAN (R 3.2.2)
## RefManageR 0.8.63 2015-06-09 CRAN (R 3.2.2)
## S4Vectors * 0.8.0 2015-11-03 Bioconductor
## XML 3.98-1.3 2015-06-30 CRAN (R 3.2.2)
## bibtex 0.4.0 2014-12-31 CRAN (R 3.2.2)
## bitops 1.0-6 2013-08-17 CRAN (R 3.2.2)
## derfinderHelper * 1.4.1 2015-11-03 Bioconductor
## devtools * 1.9.1 2015-09-11 CRAN (R 3.2.2)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.2)
## evaluate 0.8 2015-09-18 CRAN (R 3.2.2)
## formatR 1.2.1 2015-09-18 CRAN (R 3.2.2)
## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.2)
## httr 1.0.0 2015-06-25 CRAN (R 3.2.2)
## knitcitations * 1.0.7 2015-10-28 CRAN (R 3.2.2)
## knitr 1.11 2015-08-14 CRAN (R 3.2.2)
## lattice 0.20-33 2015-07-14 CRAN (R 3.2.2)
## lubridate 1.3.3 2013-12-31 CRAN (R 3.2.2)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.2)
## memoise 0.2.1 2014-04-22 CRAN (R 3.2.2)
## plyr 1.8.3 2015-06-12 CRAN (R 3.2.2)
## rmarkdown 0.8.1 2015-10-10 CRAN (R 3.2.2)
## stringi 1.0-1 2015-10-22 CRAN (R 3.2.2)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.2)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.2)
This vignette was generated using BiocStyle (Morgan, Oleś, and Huber, 2015) with knitr (Xie, 2014) and rmarkdown (Allaire, Cheng, Xie, McPherson, et al., 2015) running behind the scenes.
Citations made with knitcitations
(Boettiger, 2015).
[1] J. Allaire, J. Cheng, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 0.8.1. 2015. URL: http://CRAN.R-project.org/package=rmarkdown.
[2] D. Bates and M. Maechler. Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.2-2. 2015. URL: http://CRAN.R-project.org/package=Matrix.
[3] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.7. 2015. URL: http://CRAN.R-project.org/package=knitcitations.
[4] L. Collado-Torres, A. E. Jaffe and J. T. Leek. derfinderHelper: derfinder helper package. https://github.com/leekgroup/derfinderHelper - R package version 1.4.1. 2015. URL: http://www.bioconductor.org/packages/release/bioc/html/derfinderHelper.html.
[5] M. Lawrence, W. Huber, H. Pagès, P. Aboyoun, et al. “Software for Computing and Annotating Genomic Ranges”. In: PLoS Computational Biology 9 (8 2013). DOI: 10.1371/journal.pcbi.1003118. URL: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118}.
[6] M. Morgan, V. Obenchain, M. Lang and R. Thompson. BiocParallel: Bioconductor facilities for parallel evaluation. R package version 1.4.0. 2015.
[7] M. Morgan, A. Oleś and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 1.8.0. 2015. URL: https://github.com/Bioconductor/BiocStyle.
[8] H. Pages, M. Lawrence and P. Aboyoun. S4Vectors: S4 implementation of vectors and lists. R package version 0.8.0. 2015.
[9] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2015. URL: https://www.R-project.org/.
[10] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
[11] H. Wickham and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.9.1. 2015. URL: http://CRAN.R-project.org/package=devtools.
[12] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.