This vignette explains in more detail the relationship between the different functions in derfinder (Collado-Torres, Frazee, Love, Irizarry, et al., 2015) as well as add-on packages derfinderPlot (vignette), derfinderHelper (vignette) and regionReport (vignette). The vignette also includes some bash scripts for running derfinder in a cluster, although there are probably other ways to do so using only R scripts. For example, by using BatchJobsParam()
from BiocParallel.
This vignette assumes that you have read through the introductory vignette.
Lets start by loading the derfinder package.
## Load libraries
library('derfinder')
If you explore the package, you will notice that all exported functions use camel case name and have aliases with underscore names. For example, analyzeChr()
and analyze_chr()
. You should use whichever naming style you prefer. Regrettably, implementing the same flexibility at the argument naming level is not straightforward.
Just like it is shown in the introductory vignette, you can find more information about the advanced arguments using advancedArg()
.
## URLs to advanced arguemtns
sapply(c('analyzeChr', 'loadCoverage'), advancedArg, browse = FALSE)
## analyzeChr
## "https://github.com/search?utf8=%E2%9C%93&q=advanced_argument+filename%3Aanalyze+repo%3Alcolladotor%2Fderfinder+path%3A%2FR&type=Code"
## loadCoverage
## "https://github.com/search?utf8=%E2%9C%93&q=advanced_argument+filename%3Aload+repo%3Alcolladotor%2Fderfinder+path%3A%2FR&type=Code"
## Set browse = TRUE if you want to open them in your browser
These arguments are options we expect not all users will want to change and which would otherwise make the help pages confusing to them. However, all arguments are documented in roxygen2 style in the source code.
Furthermore, note that using the ...
argument allows you to specify some of the documented arguments. For example, you might want to control the maxClusterGap
from findRegions()
in the analyzeChr()
call.
If you are working with data from an organism that is not Homo sapiens, then set the global options defining the species
and the chrsStyle
used. For example, if you are working with Arabidopsis Thaliana and the NCBI naming style, then set the options using the following code:
## Set global species and chrsStyle options
options(species = 'arabidopsis_thaliana')
options(chrsStyle = 'NCBI')
## Then proceed to load and analyze the data
Internally derfinder uses extendedMapSeqlevels()
to use the appropriate chromosome naming style given a species in all functions involving chromosome names.
Further note that the argument subject
from analyzeChr()
is passed to bumphunter::annotateNearest(subject)
. So if you are using a genome different from hg19 remember to provide the appropriate annotation data or simply use analyzeChr(runAnnotation = FALSE)
. You might find the discussion Using bumphunter with non-human genomes useful.
If you are loading data from BAM files, you might want to specify some criteria to decide which reads to include or not. For example, your data might have been generated by a strand-specific protocol. You can do so by specifying the arguments of scanBamFlag()
from Rsamtools.
You can also control whether to include or exclude bases with CIGAR
string D
(deletion from the reference) by setting the advanced argument drop.D = TRUE
in your fullCoverage()
or loadCoverage()
call.
Note that in most scenarios, the fullCov
object illustrated in the introductory vignette can be large in memory. When making plots or calculating the region level coverage, we don’t need the full information. In such situations, it might pay off to create a smaller version by loading only the required data. This can be achieved using the advanced argument which
to fullCoverage()
or loadCoverage()
.
However, it is important to consider that when reading the data from BAM files, a read might align partially inside the region of interest. By default such a read would be discarded and thus the base-level coverage would be lower than what it is in reality. The advanced argument protectWhich
extends regions by 30 kbp (15 kbp each side) to help mitigate this issue.
We can illustrate this issue with the example data from derfinder. First, we load in the data and generate some regions of interest.
## Find some regions to work with
example('loadCoverage', 'derfinder')
##
## ldCvrg> datadir <- system.file('extdata', 'genomeData', package='derfinder')
##
## ldCvrg> files <- rawFiles(datadir = datadir, samplepatt = '*accepted_hits.bam$',
## ldCvrg+ fileterm = NULL)
##
## ldCvrg> ## Shorten the column names
## ldCvrg> names(files) <- gsub('_accepted_hits.bam', '', names(files))
##
## ldCvrg> ## Read and filter the data, only for 2 files
## ldCvrg> dataSmall <- loadCoverage(files = files[1:2], chr = '21', cutoff = 0)
## 2015-11-04 20:48:15 loadCoverage: finding chromosome lengths
## 2015-11-04 20:48:15 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009101_accepted_hits.bam
## 2015-11-04 20:48:16 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009102_accepted_hits.bam
## 2015-11-04 20:48:16 loadCoverage: applying the cutoff to the merged data
## 2015-11-04 20:48:16 filterData: originally there were 48129895 rows, now there are 454 rows. Meaning that 100 percent was filtered.
##
## ldCvrg> ## Not run:
## ldCvrg> ##D ## Export to BigWig files
## ldCvrg> ##D createBw(list('chr21' = dataSmall))
## ldCvrg> ##D
## ldCvrg> ##D ## Load data from BigWig files
## ldCvrg> ##D dataSmall.bw <- loadCoverage(c(ERR009101 = 'ERR009101.bw', ERR009102 =
## ldCvrg> ##D 'ERR009102.bw'), chr = 'chr21')
## ldCvrg> ##D
## ldCvrg> ##D ## Compare them
## ldCvrg> ##D mapply(function(x, y) { x - y }, dataSmall$coverage, dataSmall.bw$coverage)
## ldCvrg> ##D
## ldCvrg> ##D ## Note that the only difference is the type of Rle (integer vs numeric) used
## ldCvrg> ##D ## to store the data.
## ldCvrg> ##D
## ldCvrg> ## End(Not run)
## ldCvrg>
## ldCvrg>
## ldCvrg>
example('getRegionCoverage', 'derfinder')
##
## gtRgnC> ## Obtain fullCov object
## gtRgnC> fullCov <- list('21'=genomeDataRaw$coverage)
##
## gtRgnC> ## Assign chr lengths using hg19 information, use only first two regions
## gtRgnC> library('GenomicRanges')
##
## gtRgnC> data(hg19Ideogram, package = 'biovizBase', envir = environment())
##
## gtRgnC> regions <- genomeRegions$regions[1:2]
##
## gtRgnC> seqlengths(regions) <- seqlengths(hg19Ideogram)[names(seqlengths(regions))]
##
## gtRgnC> ## Finally, get the region coverage
## gtRgnC> regionCov <- getRegionCoverage(fullCov=fullCov, regions=regions)
## extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
## 2015-11-04 20:48:16 getRegionCoverage: processing chr21
## 2015-11-04 20:48:16 getRegionCoverage: done processing chr21
Next, we load the coverage again using which
but without any padding. We can see how the coverage is not the same by looking at the maximum coverage for each sample.
## Illustrate reading data from a set of regions
test <- loadCoverage(files = files, chr = '21', cutoff = NULL, which = regions, protectWhich = 0, fileStyle = 'NCBI')
## extendedMapSeqlevels: sequence names mapped from UCSC to NCBI for species homo_sapiens
## 2015-11-04 20:48:17 loadCoverage: finding chromosome lengths
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009101_accepted_hits.bam
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009102_accepted_hits.bam
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009105_accepted_hits.bam
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009107_accepted_hits.bam
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009108_accepted_hits.bam
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009112_accepted_hits.bam
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009115_accepted_hits.bam
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009116_accepted_hits.bam
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009131_accepted_hits.bam
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009138_accepted_hits.bam
## 2015-11-04 20:48:17 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009144_accepted_hits.bam
## 2015-11-04 20:48:18 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009145_accepted_hits.bam
## 2015-11-04 20:48:18 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009148_accepted_hits.bam
## 2015-11-04 20:48:18 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009151_accepted_hits.bam
## 2015-11-04 20:48:18 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009152_accepted_hits.bam
## 2015-11-04 20:48:18 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009153_accepted_hits.bam
## 2015-11-04 20:48:18 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009159_accepted_hits.bam
## 2015-11-04 20:48:18 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009161_accepted_hits.bam
## 2015-11-04 20:48:18 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009163_accepted_hits.bam
## 2015-11-04 20:48:18 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009164_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009167_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031812_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031835_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031867_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031868_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031900_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031904_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031914_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031936_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031958_accepted_hits.bam
## 2015-11-04 20:48:19 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031960_accepted_hits.bam
## 2015-11-04 20:48:20 loadCoverage: applying the cutoff to the merged data
## 2015-11-04 20:48:20 filterData: originally there were 48129895 rows, now there are 48129895 rows. Meaning that 0 percent was filtered.
## Some reads were ignored and thus the coverage is lower as can be seen below:
sapply(test$coverage, max) - sapply(genomeDataRaw$coverage, max)
## ERR009101 ERR009102 ERR009105 ERR009107 ERR009108 ERR009112 ERR009115
## 0 0 0 0 -1 0 -1
## ERR009116 ERR009131 ERR009138 ERR009144 ERR009145 ERR009148 ERR009151
## -2 -2 -2 -1 -1 -3 -3
## ERR009152 ERR009153 ERR009159 ERR009161 ERR009163 ERR009164 ERR009167
## 0 -3 -3 -3 -1 -3 -3
## SRR031812 SRR031835 SRR031867 SRR031868 SRR031900 SRR031904 SRR031914
## 0 -1 0 0 0 -1 0
## SRR031936 SRR031958 SRR031960
## 0 0 0
When we re-load the data using some padding to the regions, we find that the coverage matches at all the bases.
## Illustrate reading data from a set of regions
test2 <- loadCoverage(files = files, chr = '21', cutoff = NULL, which =
regions, protectWhich = 3e4, fileStyle = 'NCBI')
## extendedMapSeqlevels: sequence names mapped from UCSC to NCBI for species homo_sapiens
## 2015-11-04 20:48:20 loadCoverage: finding chromosome lengths
## 2015-11-04 20:48:20 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009101_accepted_hits.bam
## 2015-11-04 20:48:20 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009102_accepted_hits.bam
## 2015-11-04 20:48:20 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009105_accepted_hits.bam
## 2015-11-04 20:48:20 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009107_accepted_hits.bam
## 2015-11-04 20:48:20 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009108_accepted_hits.bam
## 2015-11-04 20:48:20 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009112_accepted_hits.bam
## 2015-11-04 20:48:20 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009115_accepted_hits.bam
## 2015-11-04 20:48:20 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009116_accepted_hits.bam
## 2015-11-04 20:48:20 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009131_accepted_hits.bam
## 2015-11-04 20:48:20 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009138_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009144_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009145_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009148_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009151_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009152_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009153_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009159_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009161_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009163_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009164_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/ERR009167_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031812_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031835_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031867_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031868_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031900_accepted_hits.bam
## 2015-11-04 20:48:21 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031904_accepted_hits.bam
## 2015-11-04 20:48:22 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031914_accepted_hits.bam
## 2015-11-04 20:48:22 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031936_accepted_hits.bam
## 2015-11-04 20:48:22 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031958_accepted_hits.bam
## 2015-11-04 20:48:22 loadCoverage: loading BAM file /tmp/RtmpTZemG2/Rinst2fc34ae6931e/derfinder/extdata/genomeData/SRR031960_accepted_hits.bam
## 2015-11-04 20:48:22 loadCoverage: applying the cutoff to the merged data
## 2015-11-04 20:48:22 filterData: originally there were 48129895 rows, now there are 48129895 rows. Meaning that 0 percent was filtered.
## Adding some padding to the regions helps get the same coverage
identical(sapply(test2$coverage, max), sapply(genomeDataRaw$coverage, max))
## [1] TRUE
## A more detailed test reveals that the coverage matches at every base
all(mapply(function(x, y) { identical(x, y) }, test2$coverage, genomeDataRaw$coverage))
## [1] TRUE
How much padding you need to use will depend on your specific data set, and you might be comfortable getting approximately the same coverage values for the sake of greatly reducing the memory resources needed.
If you are under the case where you like to use a specific chromosome naming style but the raw data files use another one, you might need to use the fileStyle
argument.
For example, you could be working with Homo sapiens data and your preferred naming style is UCSC (chr1, chr2, …, chrX, chrY) but the raw data uses NCBI style names (1, 2, …, X, Y). In that case, use fullCoverage(fileStyle = 'NCBI')
or loadCoverage(fileStyle = 'NCBI')
depending if you are loading one chromosome or multiple at a time.
If you prefer to do so, fullCoverage()
and loadCoverage()
can load the data of a chromosome in chunks using GenomicFiles. This is controlled by whether you specify the tilewidth
argument.
Notice that you might run into slight coverage errors near the borders of the tiles for the same reason that was illustrated previously when loading specific regions.
This approach is not necessarily more efficient and can be significantly time consuming if you use a small tilewidth
.
The following figure illustrates how most of derfinder’s functions interact when performing a base-level differential expression analysis by calculating base-level F-statistics.
Flow chart of the different processing steps (black boxes) that can be carried out using derfinder and the functions that perform these actions (in red). Input and output is shown in green boxes. Functions in blue are those applied to the results from multiple chromosomes (mergeResults()
and derfinderReport
). regionReport functions are shown in orange while derfinderPlot functions are shown in dark purple. Purple dotted arrow marks functions that require unfiltered base-level coverage.
analyzeChr()
flow chartThis figure shows in more detail the processing flow in analyzeChr()
, which is the main function for identifying candidate differentially expressed regions (DERs) from the base-level F-statistics.
Many fine-tunning arguments can be passed to analyzeChr()
to feed into the other functions. For example, you might want to use a smaller chunksize
when pre-processing the coverage data (the default is 5 million): specially if you have hundreds of samples.
Another useful argument is scalefac
(by default it’s 32) which controls the scaling factor to use before the log2 transformation.
Furthermore, you might want to specify maxClusterGap
to control the maximum gap between two regions before they are considered to be part of the same cluster.
regionMatrix()
flow chartThe above figure shows the functions used internally by regionMatrix()
and processing steps. Overall, it is much simpler than analyzeChr()
.
Currently, the following functions can use multiple cores, several of which are called inside analyzeChr()
.
calculatePvalues()
: 1 core per chunk of data to process.calculateStats()
: 1 core per chunk of data to process.coerceGR()
: 1 core per chromosome. This function is used by createBw()
.coverageToExon()
: 1 core per strand, then 1 core per chromosome.loadCoverage()
: up to 1 core per tile when loading the data with GenomicFiles. Otherwise, no parallelization is used.fullCoverage()
: 1 core per chromosome. In general, try to avoid using more than 10 cores as you might reach your maximum network speed and/or hard disk input/output seed. For the case described in loadCoverage()
, you can specify how many cores to use per chromosome for the tiles using the mc.cores.load
argument effectively resulting in mc.cores
times mc.cores.load
used (otherwise it’s mc.cores
squared).getRegionCoverage()
: 1 core per chromosome.regionMatrix()
: 1 core per chromosome.All parallel operations use SnowParam()
from BiocParallel when more than 1 core is being used. Otherwise, SerialParam()
is used. Note that if you prefer to specify other types of parallelization you can do so by specifying the BPPARAM.custom
advanced argument.
Because SnowParam()
requires R
to load the necessary packages on each worker, the key function fstats.apply()
was isolated in the derfinderHelper package. This package has much faster loading speeds than derfinder which greatly impacts performance on cases where the actual step of calculating the F-statistics is fast.
You may prefer to use MulticoreParam()
described in the BiocParallel vignette. In that case, when using these functions use BPPARAM.custom = MulticoreParam(workers = x)
where x
is the number of cores you want to use. Note that in some systems, as is the case of the cluster used by derfinder’s developers, the system tools for assessing memory usage can be misleading, thus resulting in much higher memory loads when using MulticoreParam()
instead of the default SnowParam()
.
For each project, we recommend the following organization.
ProjectDir
|-derCoverageInfo
|-derAnalysis
|---analysis01
|---analysis02
That is, a main project directory with two initial directories. One for storing the coverage data, and one for storing each analysis: you might explore different models, cutoffs or other parameters.
You can then use fullCoverage()
, save the result and also save the filtered coverage information for each chromosome separately. Doing so will result in the following structure.
ProjectDir
|-derCoverageInfo
|---Chr1Cov.Rdata
|---Chr2Cov.Rdata
...
|---ChrYCov.Rdata
|---fullCov.Rdata
|-derAnalysis
|---analysis01
|---analysis02
Next, you can use analyzeChr()
for each of the chromosomes of a specific analysis (say analysis01). Doing so will create several Rdata files per chromosome as shown below. bash
scripts can be useful if you wish to submit one cluster job per chromosome. In general, you will use the same model and group information for each chromosome, so saving the information can be useful.
ProjectDir
|-derCoverageInfo
|---Chr1Cov.Rdata
|---Chr2Cov.Rdata
...
|---ChrYCov.Rdata
|---fullCov.Rdata
|-derAnalysis
|---analysis01
|-----models.Rdata
|-----groupInfo.Rdata
|-----chr1/
|-------chunksDir/
|-------logs/
|-------annotation.Rdata
|-------coveragePrep.Rdata
|-------fstats.Rdata
|-------optionsStats.Rdata
|-------regions.Rdata
|-------timeinfo.Rdata
|-----chr2/
...
|-----chrY/
|---analysis02
Then use mergeResults()
to pool together the results from all the chromosomes for a given analysis (here analysis01).
ProjectDir
|-derCoverageInfo
|---Chr1Cov.Rdata
|---Chr2Cov.Rdata
...
|---ChrYCov.Rdata
|---fullCov.Rdata
|-derAnalysis
|---analysis01
|-----logs/
|-----fullAnnotatedRegions.Rdata
|-----fullFstats.Rdata
|-----fullNullSummary.Rdata
|-----fullRegions.Rdata
|-----fullTime.Rdata
|-----optionsMerge.Rdata
|-----chr1/
|-------chunksDir/
|-------logs/
|-------annotation.Rdata
|-------coveragePrep.Rdata
|-------fstats.Rdata
|-------optionsStats.Rdata
|-------regions.Rdata
|-------timeinfo.Rdata
|-----chr2/
...
|-----chrY/
|---analysis02
Finally, you might want to use derfinderReport()
from regionReport to create a HTML report of the results.
For interacting between bash
and R
we have found quite useful the getopt package. Here we include an example R
script that is controlled by a bash
script which submits a job for each chromosome to analyze for a given analysis.
The two files, derfinderAnalysis.R and derAnalysis.sh should live under the derAnalysis directory.
ProjectDir
|-derCoverageInfo
|---Chr1Cov.Rdata
|---Chr2Cov.Rdata
...
|---ChrYCov.Rdata
|---fullCov.Rdata
|-derAnalysis
|---derfinder-Analysis.R
|---derAnalysis.sh
|---analysis01
|---analysis02
Then, you can simply use:
cd /ProjectDir/derAnalysis
sh derAnalysis.sh analysis01
to run analyzeChr()
on all your chromosomes.
## Run derfinder's analysis steps with timing info
## Load libraries
library("getopt")
## Available at https://github.com/lcolladotor/derfinder
library("derfinder")
## Specify parameters
spec <- matrix(c(
'DFfile', 'd', 1, "character", "path to the .Rdata file with the results from loadCoverage()",
'chr', 'c', 1, "character", "Chromosome under analysis. Use X instead of chrX.",
'mcores', 'm', 1, "integer", "Number of cores",
'verbose' , 'v', 2, "logical", "Print status updates",
'help' , 'h', 0, "logical", "Display help"
), byrow=TRUE, ncol=5)
opt <- getopt(spec)
## Testing the script
test <- FALSE
if(test) {
## Speficy it using an interactive R session and testing
test <- TRUE
}
## Test values
if(test){
opt <- NULL
opt$DFfile <- "/ProjectDir/derCoverageInfo/chr21Cov.Rdata"
opt$chr <- "21"
opt$mcores <- 1
opt$verbose <- NULL
}
## if help was asked for print a friendly message
## and exit with a non-zero error code
if (!is.null(opt$help)) {
cat(getopt(spec, usage=TRUE))
q(status=1)
}
## Default value for verbose = TRUE
if (is.null(opt$verbose)) opt$verbose <- TRUE
if(opt$verbose) message("Loading Rdata file with the output from loadCoverage()")
load(opt$DFfile)
## Make it easy to use the name later. Here I'm assuming the names were generated using output='auto' in loadCoverage()
eval(parse(text=paste0("data <- ", "chr", opt$chr, "CovInfo")))
eval(parse(text=paste0("rm(chr", opt$chr, "CovInfo)")))
## Just for testing purposes
if(test) {
tmp <- data
tmp$coverage <- tmp$coverage[1:1e6, ]
library("IRanges")
tmp$position[which(tmp$pos)[1e6 + 1]:length(tmp$pos)] <- FALSE
data <- tmp
}
## Load the models
load("models.Rdata")
## Load group information
load("groupInfo.Rdata")
## Run the analysis with lowMemDir
analyzeChr(chr=opt$chr, coverageInfo=data, models=models, cutoffFstat=1e-06, cutoffType="theoretical", nPermute=1000, seeds=seq_len(1000), maxClusterGap=3000, groupInfo=groupInfo, subject="hg19", mc.cores=opt$mcores, lowMemDir=file.path(tempdir(), paste0("chr", opt$chr) , "chunksDir")), verbose=opt$verbose, chunksize=1e5)
## Done
if(opt$verbose) {
print(proc.time())
print(sessionInfo(), locale=FALSE)
}
Remember to modify the the script to fit your project.
#!/bin/sh
## Usage
# sh derAnalysis.sh analysis01
# Directories
MAINDIR=/ProjectDir
WDIR=${MAINDIR}/derAnalysis
DATADIR=${MAINDIR}/derCoverageInfo
# Define variables
SHORT='derA-01'
PREFIX=$1
# Construct shell files
for chrnum in 22 21 Y 20 19 18 17 16 15 14 13 12 11 10 9 8 X 7 6 5 4 3 2 1
do
echo "Creating script for chromosome ${chrnum}"
if [[ ${chrnum} == "Y" ]]
then
CORES=2
else
CORES=6
fi
chr="chr${chrnum}"
outdir="${PREFIX}/${chr}"
sname="${SHORT}.${PREFIX}.${chr}"
cat > ${WDIR}/.${sname}.sh <<EOF
#!/bin/bash
#$ -cwd
#$ -m e
#$ -l mem_free=120G,h_vmem=60G,h_fsize=10G
#$ -N ${sname}
#$ -pe local ${CORES}
echo "**** Job starts ****"
date
# Create output directory
mkdir -p ${WDIR}/${outdir}
# Make logs directory
mkdir -p ${WDIR}/${outdir}/logs
# run derfinder-analysis.R
cd ${WDIR}/${PREFIX}/
# specific to our cluster
# see http://www.jhpce.jhu.edu/knowledge-base/environment-modules/
module load R/3.1.x
Rscript ${WDIR}/derfinder2-analysis.R -d "${DATADIR}/${chr}Cov.Rdata" -c "${chrnum}" -m ${CORES} -v TRUE
# Move log files into the logs directory
mv ${WDIR}/${sname}.* ${WDIR}/${outdir}/logs/
echo "**** Job ends ****"
date
EOF
call="qsub .${sname}.sh"
$call
done
Your cluster might specify memory requirements differently and you might need to use fewer or more cores depending on your data set.
This vignette covered the most commonly used advanced arguments, details on how to load data, flow charts explaining the relationships between the functions, the recommended output organization, and example shell scripts for running the analysis.
This implementation of derfinder has its origins in Alyssa C. Frazee’s derfinder (Frazee, Sabunciyan, Hansen, Irizarry, et al., 2014). The statistical methods and implementation by now are very different.
Please use:
## Citation info
citation('derfinder')
##
## Collado-Torres L, Frazee AC, Love MI, Irizarry RA, Jaffe AE and
## Leek JT (2015). "derfinder: Software for annotation-agnostic
## RNA-seq differential expression analysis." _bioRxiv_. <URL:
## http://doi.org/10.1101/015370>, <URL:
## http://www.biorxiv.org/content/early/2015/02/19/015370.abstract>.
##
## Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA and Leek JT
## (2014). "Differential expression analysis of RNA-seq data at
## single-base resolution." _Biostatistics_, *15 (3)*, pp. 413-426.
## <URL: http://doi.org/10.1093/biostatistics/kxt053>, <URL:
## http://biostatistics.oxfordjournals.org/content/15/3/413.long>.
This package was made possible thanks to:
Code for creating the vignette
## Create the vignette
library('rmarkdown')
system.time(render('derfinderAdvanced.Rmd', 'BiocStyle::html_document'))
## Extract the R code
library('knitr')
knit('derfinderAdvanced.Rmd', tangle = TRUE)
## Clean up
file.remove('derfinderAdvRef.bib')
## [1] TRUE
Date the vignette was generated.
## [1] "2015-11-04 20:48:22 PST"
Wallclock time spent generating the vignette.
## Time difference of 7.488 secs
R
session information.
## Session info -----------------------------------------------------------------------------------------------------------
## setting value
## version R version 3.2.2 (2015-08-14)
## system x86_64, linux-gnu
## ui X11
## language en_US:
## collate C
## tz <NA>
## date 2015-11-04
## Packages ---------------------------------------------------------------------------------------------------------------
## package * version date source
## AnnotationDbi * 1.32.0 2015-11-05 Bioconductor
## Biobase * 2.30.0 2015-11-05 Bioconductor
## BiocGenerics * 0.16.0 2015-11-05 Bioconductor
## BiocParallel 1.4.0 2015-11-05 Bioconductor
## BiocStyle * 1.8.0 2015-11-05 Bioconductor
## Biostrings 2.38.0 2015-11-05 Bioconductor
## DBI * 0.3.1 2014-09-24 CRAN (R 3.2.2)
## Formula 1.2-1 2015-04-07 CRAN (R 3.2.2)
## GenomeInfoDb * 1.6.1 2015-11-05 Bioconductor
## GenomicAlignments 1.6.1 2015-11-05 Bioconductor
## GenomicFeatures * 1.22.4 2015-11-05 Bioconductor
## GenomicFiles 1.6.0 2015-11-05 Bioconductor
## GenomicRanges * 1.22.0 2015-11-05 Bioconductor
## Hmisc 3.17-0 2015-09-21 CRAN (R 3.2.2)
## IRanges * 2.4.1 2015-11-05 Bioconductor
## MASS 7.3-44 2015-08-30 CRAN (R 3.2.2)
## Matrix 1.2-2 2015-07-08 CRAN (R 3.2.2)
## R6 2.1.1 2015-08-19 CRAN (R 3.2.2)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.2.2)
## RCurl 1.95-4.7 2015-06-30 CRAN (R 3.2.2)
## RJSONIO 1.3-0 2014-07-28 CRAN (R 3.2.2)
## RSQLite * 1.0.0 2014-10-25 CRAN (R 3.2.2)
## Rcpp 0.12.1 2015-09-10 CRAN (R 3.2.2)
## RefManageR 0.8.63 2015-06-09 CRAN (R 3.2.2)
## Rsamtools 1.22.0 2015-11-05 Bioconductor
## S4Vectors * 0.8.1 2015-11-05 Bioconductor
## SummarizedExperiment 1.0.0 2015-11-05 Bioconductor
## TxDb.Hsapiens.UCSC.hg19.knownGene * 3.2.2 2015-10-09 Bioconductor
## XML 3.98-1.3 2015-06-30 CRAN (R 3.2.2)
## XVector 0.10.0 2015-11-05 Bioconductor
## acepack 1.3-3.3 2014-11-24 CRAN (R 3.2.2)
## bibtex 0.4.0 2014-12-31 CRAN (R 3.2.2)
## biomaRt 2.26.0 2015-11-05 Bioconductor
## bitops 1.0-6 2013-08-17 CRAN (R 3.2.2)
## bumphunter * 1.10.0 2015-11-05 Bioconductor
## cluster 2.0.3 2015-07-21 CRAN (R 3.2.2)
## codetools 0.2-14 2015-07-15 CRAN (R 3.2.2)
## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.2)
## derfinder * 1.4.4 2015-11-05 Bioconductor
## derfinderData * 0.104.0 2015-10-16 Bioconductor
## derfinderHelper 1.4.1 2015-11-05 Bioconductor
## devtools * 1.9.1 2015-09-11 CRAN (R 3.2.2)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.2)
## doRNG 1.6 2014-03-07 CRAN (R 3.2.2)
## evaluate 0.8 2015-09-18 CRAN (R 3.2.2)
## foreach * 1.4.3 2015-10-13 CRAN (R 3.2.2)
## foreign 0.8-66 2015-08-19 CRAN (R 3.2.2)
## formatR 1.2.1 2015-09-18 CRAN (R 3.2.2)
## futile.logger 1.4.1 2015-04-20 CRAN (R 3.2.2)
## futile.options 1.0.0 2010-04-06 CRAN (R 3.2.2)
## ggplot2 * 1.0.1 2015-03-17 CRAN (R 3.2.2)
## gridExtra 2.0.0 2015-07-14 CRAN (R 3.2.2)
## gtable 0.1.2 2012-12-05 CRAN (R 3.2.2)
## highr 0.5.1 2015-09-18 CRAN (R 3.2.2)
## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.2)
## httr 1.0.0 2015-06-25 CRAN (R 3.2.2)
## iterators * 1.0.8 2015-10-13 CRAN (R 3.2.2)
## knitcitations * 1.0.7 2015-10-28 CRAN (R 3.2.2)
## knitr * 1.11 2015-08-14 CRAN (R 3.2.2)
## labeling 0.3 2014-08-23 CRAN (R 3.2.2)
## lambda.r 1.1.7 2015-03-20 CRAN (R 3.2.2)
## lattice 0.20-33 2015-07-14 CRAN (R 3.2.2)
## latticeExtra 0.6-26 2013-08-15 CRAN (R 3.2.2)
## locfit * 1.5-9.1 2013-04-20 CRAN (R 3.2.2)
## lubridate 1.3.3 2013-12-31 CRAN (R 3.2.2)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.2)
## matrixStats 0.15.0 2015-10-27 CRAN (R 3.2.2)
## memoise 0.2.1 2014-04-22 CRAN (R 3.2.2)
## munsell 0.4.2 2013-07-11 CRAN (R 3.2.2)
## nnet 7.3-11 2015-08-30 CRAN (R 3.2.2)
## org.Hs.eg.db * 3.2.3 2015-10-11 Bioconductor
## pkgmaker 0.22 2014-05-14 CRAN (R 3.2.2)
## plyr 1.8.3 2015-06-12 CRAN (R 3.2.2)
## proto 0.3-10 2012-12-22 CRAN (R 3.2.2)
## qvalue 2.2.0 2015-11-05 Bioconductor
## registry 0.3 2015-07-08 CRAN (R 3.2.2)
## reshape2 1.4.1 2014-12-06 CRAN (R 3.2.2)
## rmarkdown 0.8.1 2015-10-10 CRAN (R 3.2.2)
## rngtools 1.2.4 2014-03-06 CRAN (R 3.2.2)
## rpart 4.1-10 2015-06-29 CRAN (R 3.2.2)
## rstudioapi 0.3.1 2015-04-07 CRAN (R 3.2.2)
## rtracklayer 1.30.1 2015-11-05 Bioconductor
## scales 0.3.0 2015-08-25 CRAN (R 3.2.2)
## stringi 1.0-1 2015-10-22 CRAN (R 3.2.2)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.2)
## survival 2.38-3 2015-07-02 CRAN (R 3.2.2)
## xtable 1.8-0 2015-11-02 CRAN (R 3.2.2)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.2)
## zlibbioc 1.16.0 2015-11-05 Bioconductor
This vignette was generated using BiocStyle (Morgan, Oleś, and Huber, 2015) with knitr (Xie, 2014) and rmarkdown (Allaire, Cheng, Xie, McPherson, et al., 2015) running behind the scenes.
Citations made with knitcitations (Boettiger, 2015).
[1] J. Allaire, J. Cheng, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 0.8.1. 2015. URL: http://CRAN.R-project.org/package=rmarkdown.
[2] J. D. S. with contributions from Andrew J. Bass, A. Dabney and D. Robinson. qvalue: Q-value estimation for false discovery rate control. R package version 2.2.0. 2015. URL: http://qvalue.princeton.edu/, http://github.com/jdstorey/qvalue.
[3] S. Arora, M. Morgan, M. Carlson and H. Pages. GenomeInfoDb: Utilities for manipulating chromosome and other ‘seqname’ identifiers. R package version 1.6.1. 2015.
[4] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.7. 2015. URL: http://CRAN.R-project.org/package=knitcitations.
[5] M. Carlson and B. P. Maintainer. TxDb.Hsapiens.UCSC.hg19.knownGene: Annotation package for TxDb object(s). R package version 3.2.2. 2015.
[6] L. Collado-Torres, A. C. Frazee, M. I. Love, R. A. Irizarry, et al. “derfinder: Software for annotation-agnostic RNA-seq differential expression analysis”. In: bioRxiv (2015). DOI: 10.1101/015370. URL: http://www.biorxiv.org/content/early/2015/02/19/015370.abstract.
[7] L. Collado-Torres, A. E. Jaffe and J. T. Leek. derfinderHelper: derfinder helper package. https://github.com/leekgroup/derfinderHelper - R package version 1.4.1. 2015. URL: http://www.bioconductor.org/packages/release/bioc/html/derfinderHelper.html.
[8] A. C. Frazee, S. Sabunciyan, K. D. Hansen, R. A. Irizarry, et al. “Differential expression analysis of RNA-seq data at single-base resolution”. In: Biostatistics 15 (3) (2014), pp. 413-426. DOI: 10.1093/biostatistics/kxt053. URL: http://biostatistics.oxfordjournals.org/content/15/3/413.long.
[9] A. E. Jaffe, P. Murakami, H. Lee, J. T. Leek, et al. “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies”. In: International journal of epidemiology 41.1 (2012), pp. 200–209. DOI: 10.1093/ije/dyr238.
[10] Jaffe, A. E, Murakami, Peter, et al. “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies”. In: International Journal of Epidemiology (2012).
[11] F. E. H. Jr, with contributions from Charles Dupont and many others. Hmisc: Harrell Miscellaneous. R package version 3.17-0. 2015. URL: http://CRAN.R-project.org/package=Hmisc.
[12] M. Lawrence, R. Gentleman and V. Carey. “rtracklayer: an R package for interfacing with genome browsers”. In: Bioinformatics 25 (2009), pp. 1841-1842. DOI: 10.1093/bioinformatics/btp328. URL: http://bioinformatics.oxfordjournals.org/content/25/14/1841.abstract.
[13] M. Lawrence, W. Huber, H. Pagès, P. Aboyoun, et al. “Software for Computing and Annotating Genomic Ranges”. In: PLoS Computational Biology 9 (8 2013). DOI: 10.1371/journal.pcbi.1003118. URL: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118}.
[14] M. Morgan, V. Obenchain, M. Lang and R. Thompson. BiocParallel: Bioconductor facilities for parallel evaluation. R package version 1.4.0. 2015.
[15] M. Morgan, A. Oleś and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 1.8.0. 2015. URL: https://github.com/Bioconductor/BiocStyle.
[16] M. Morgan, H. Pagès, V. Obenchain and N. Hayden. Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package version 1.22.0. 2015. URL: http://bioconductor.org/packages/release/bioc/html/Rsamtools.html.
[17] V. Obenchain, M. Love and M. Morgan. GenomicFiles: Distributed computing by file or by range. R package version 1.6.0. 2015.
[18] H. Pages, M. Carlson, S. Falcon and N. Li. AnnotationDbi: Annotation Database Interface. R package version 1.32.0. 2015.
[19] H. Pages, M. Lawrence and P. Aboyoun. S4Vectors: S4 implementation of vectors and lists. R package version 0.8.1. 2015.
[20] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2015. URL: https://www.R-project.org/.
[21] H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. ISBN: 978-0-387-98140-6. URL: http://had.co.nz/ggplot2/book.
[22] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
[23] H. Wickham and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.9.1. 2015. URL: http://CRAN.R-project.org/package=devtools.
[24] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.
[25] T. Yin, M. Lawrence and D. Cook. biovizBase: Basic graphic utilities for visualization of genomic data. R package version 1.18.0. 2015.