Author: Zuguang Gu ( z.gu@dkfz.de )
Date: 2015-10-14
Enriched heatmap is a special type of heatmap which visualizes the enrichment of genomic signals on specific target regions. It is broadly used to visualize e.g. how histone marks are enriched to transcription start sites.
There are several tools that can make such heatmap (e.g. ngs.plot or deepTools). Here we implement Enriched heatmap by ComplexHeatmap package. Since this type of heatmap is just a normal heatmap but with some special settings, with the functionality of ComplexHeatmap, it would be much easier to customize the heatmap as well as concatenating to a list of heatmaps to show correspondance between different data sources.
library(EnrichedHeatmap)
Load the example data that we will use for demostration:
set.seed(123)
load(paste0(system.file("extdata", "chr21_test_data.RData", package = "EnrichedHeatmap")))
ls()
## [1] "H3K4me3" "cgi" "genes" "meth" "rpkm"
The example data are all GRanges
objects:
H3K4me3
: coverage for H3K4me3 histone markscgi
: CpG islandsgenes
: genesmeth
: methylationrpkm
: gene expressionIn order to build the vignette fast, the data only include chromosome 21.
Heatmap for the enrichment of H3K4me3 histone mark on TSS:
tss = promoters(genes, upstream = 0, downstream = 1)
tss[1:5]
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## ENSG00000141956.9 chr21 [43299591, 43299591] -
## ENSG00000141959.12 chr21 [45719934, 45719934] +
## ENSG00000142149.4 chr21 [33245628, 33245628] +
## ENSG00000142156.10 chr21 [47401651, 47401651] +
## ENSG00000142166.8 chr21 [34696734, 34696734] +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
H3K4me3[1:5]
## GRanges object with 5 ranges and 1 metadata column:
## seqnames ranges strand | coverage
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr21 [9825467, 9825470] * | 10
## [2] chr21 [9825470, 9825488] * | 13
## [3] chr21 [9825488, 9825489] * | 12
## [4] chr21 [9825489, 9825490] * | 13
## [5] chr21 [9825490, 9825493] * | 14
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mat1 = normalizeToMatrix(H3K4me3, tss, value_column = "coverage",
extend = 5000, mean_mode = "w0", w = 50)
mat1
## Normalize H3K4me3 to tss:
## Upstream 5000bp
## Downstream 5000bp
## Not show target regions
## 720 signal regions
EnrichedHeatmap(mat1, name = "H3K4me3")
normalizeToMatrix()
converts the association between genomic signals (H3K4me3
) and targets(tss
) in to a matrix.
It first splits the extended targets regions ( the extension to upstream and downstream is controlled by extend
argument)
into a list of small windows (the width of the windows is controlled by w
), then overlaps genomic signals to these small windows and calculates
the value for every small window which is the mean value of genomic signals that intersects with the window (the value
corresponds to genomic signals are controlled by value_column
and how to calcualte the mean value is controlled by mean_mode
).
There are several modes for mean_mode
according to different types of genomic signals. It will be explained in latter sections.
EnrichedHeatmap()
returns a EnrichedHeatmap
class instance which is inherited from Heatmap
class,
so parameters and methods for Heatmap
class can be directly applied to EnrichedHeatmap
class.
There is a special column annotation function anno_enriched()
which shows mean values of columns in
the normalized matrix.
EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3",
top_annotation = HeatmapAnnotation(lines = anno_enriched()),
top_annotation_height = unit(2, "cm"))
Rows can be smoothed by setting smooth
to TRUE
when generating the matrix.
But since data range will change after smoothing,
you need to manually adjust the color mapping if you want to make figures comparable.
In following example, we use colorRamp2()
from circlize package to define
a color mapping function.
library(circlize)
mat1_smoothed = normalizeToMatrix(H3K4me3, tss, value_column = "coverage",
extend = 5000, mean_mode = "w0", w = 50, smooth = TRUE)
# note we set the color range same as the unsmoothed matrix
EnrichedHeatmap(mat1_smoothed, col = colorRamp2(range(mat1), c("white", "red")),
name = "H3K4me3")
Since EnrichedHeatmap
class is inherited from Heatmap
class, it is easy to customize
the heatmap, e.g. split rows, make clustering on rows, add titles, …
Split rows by a vector or a data frame:
EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3",
split = sample(c("A", "B"), length(genes), replace = TRUE),
column_title = "Enrichment of H3K4me3")
Split rows by k-means clustering:
EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3", km = 3,
column_title = "Enrichment of H3K4me3", row_title_rot = 0)
Split rows and add column annotation as well:
EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3",
# note we have three row-clusters, so we assign three colors for the annotation lines
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = 2:4))),
top_annotation_height = unit(2, "cm"),
km = 3, row_title_rot = 0)
Cluster on rows:
EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3",
cluster_rows = TRUE, show_row_dend = FALSE, column_title = "Enrichment of H3K4me3")
Some graphic settings specific for the EnrichedHeatmap
object:
EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3",
pos_line_gp = gpar(col = "blue", lwd = 2), axis_name = c("-5kb", "TSS", "5kb"),
axis_name_rot = -45, border = FALSE)
Extension to upstream and downstream can be controled by extend
either by a single value
or a vector of length 2.
# upstream 1kb, downstream 2kb
mat12 = normalizeToMatrix(H3K4me3, tss, value_column = "coverage",
extend = c(1000, 2000), mean_mode = "w0", w = 50)
EnrichedHeatmap(mat12, name = "H3K4me3", col = c("white", "red"))
Following heatmap visualizes the enrichment of low methylated regions on TSS:
meth[1:5]
## GRanges object with 5 ranges and 1 metadata column:
## seqnames ranges strand | meth
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr21 [9432427, 9432427] * | 0.267104203931852
## [2] chr21 [9432428, 9432428] * | 0.267106771955287
## [3] chr21 [9432964, 9432964] * | 0.272709911227985
## [4] chr21 [9432965, 9432965] * | 0.2727345344132
## [5] chr21 [9433315, 9433315] * | 0.285114797969136
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mat2 = normalizeToMatrix(meth, tss, value_column = "meth", mean_mode = "absolute",
extend = 5000, w = 50, empty_value = 0.5)
EnrichedHeatmap(mat2, name = "methylation", column_title = "methylation near TSS")
With window size set to 50bp, it can be possible that there is no CpG site in some of
the windows. In this case, empty_value
is used to fill such windows. Since high methylation
and low methylation are the major methylation types in the genome, we set the empty value to 0.5.
Here we set mean_mode
to absolute
. Following illustrates different settings for mean_mode
:
4 5 2 values in signal
++++++ +++ +++++ signal
================ window (16bp)
4 3 3 overlap
value for this window:
absolute: (4 + 5 + 2)/3
weighted: (4*4 + 5*3 + 2*3)/(4 + 3 + 3)
w0: (4*4 + 5*3 + 2*3)/16
So, according to above rules, if the genomic signals are from single base or very small regions,
setting mean_mode
to absolute
seems to be reasonable. For other case, mean_mode
can be set to
weighted
or w0
.
The values for those windows which contain no CpG sites can be fitted by locfit
method:
mat2 = normalizeToMatrix(meth, tss, value_column = "meth", mean_mode = "absolute",
extend = 5000, w = 50, empty_value = NA, smooth = TRUE)
EnrichedHeatmap(mat2, name = "methylation", column_title = "methylation near TSS")
The target of the enrichment can not only be a single point but also can be regions with width larger than 1. Following heatmap visualizes the enrichment of low methylation on CpG islands:
mat3 = normalizeToMatrix(meth, cgi, value_column = "meth", mean_mode = "absolute",
extend = 5000, w = 50, empty_value = NA, smooth = TRUE)
EnrichedHeatmap(mat3, name = "methylation", column_title = "methylation near CGI")
Width of the target regions can be controlled by target_ratio
which is relative to
the width of the complete heatmap.
Target regions are also splitted into small windows, but since width of the target regions are different from each other, they are splitted by percent to their full width (the percent value is calculated automatically).
mat3 = normalizeToMatrix(meth, cgi, value_column = "meth", mean_mode = "absolute",
extend = 5000, w = 50, empty_value = NA, smooth = TRUE, target_ratio = 0.3)
EnrichedHeatmap(mat3, name = "methylation", axis_name_rot = 90,
column_title = "methylation near CGI")
Thanks to the functionality of ComplexHeatmap package, heatmaps can be concatenated
by +
operator. EnrichedHeatmap
objects and Heatmap
objects can be mixed.
Following heatmaps visualizes correspondance between H3K4me3 mark, methylation and gene expression. It is quite straightforward to see high expression correlates with low methylation and high promoter activity around TSS.
EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3", width = 1) +
EnrichedHeatmap(mat2, name = "methylation", width = 1) +
Heatmap(log2(rpkm+1), col = c("white", "orange"), name = "log2(rpkm+1)",
show_row_names = FALSE, width = unit(5, "mm"))
Of course you can split rows by splitting the main heatmap:
EnrichedHeatmap(mat1, col = c("white", "red"), name = "H3K4me3", km = 3, width = 1,
top_annotation = HeatmapAnnotation(lines = anno_enriched()),
top_annotation_height = unit(2, "cm"), row_title_rot = 0,
column_title = "H3K4me3") +
EnrichedHeatmap(mat2, name = "methylation", width = 1,
column_title = "Methylation") +
Heatmap(log2(rpkm+1), col = c("white", "orange"), name = "log2(rpkm+1)",
show_row_names = FALSE, width = unit(5, "mm"))
By default every genomic signal tries to intersect to every target region, but if mapping is provided, only those genomic signals that are mapped to the corresponding target region will be kept.
To illustrate it more clearly, we load the example data.
gene
column in neg_cr
is used to map to the names of all_tss
.
load(paste0(system.file("/extdata/neg_cr.RData", package = "EnrichedHeatmap")))
all_tss = promoters(all_genes, upstream = 0, downstream = 1)
all_tss = all_tss[unique(neg_cr$gene)]
neg_cr
## GRanges object with 5712 ranges and 1 metadata column:
## seqnames ranges strand | gene
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 [ 901460, 902041] * | ENSG00000187583.5
## [2] chr1 [1238870, 1239998] * | ENSG00000131584.14
## [3] chr1 [1241678, 1242375] * | ENSG00000131584.14
## [4] chr1 [1371300, 1371689] * | ENSG00000179403.10
## [5] chr1 [1849952, 1850382] * | ENSG00000178821.8
## ... ... ... ... ... ...
## [5708] chrX [153989953, 153990135] * | ENSG00000130826.11
## [5709] chrX [154113991, 154114210] * | ENSG00000197932.3
## [5710] chrX [154423204, 154424142] * | ENSG00000155959.6
## [5711] chrX [154426343, 154426667] * | ENSG00000155959.6
## [5712] chrX [154488727, 154489937] * | ENSG00000155961.4
## -------
## seqinfo: 17 sequences from an unspecified genome; no seqlengths
all_tss
## GRanges object with 3601 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## ENSG00000187583.5 chr1 [ 901877, 901877] +
## ENSG00000131584.14 chr1 [1244989, 1244989] -
## ENSG00000179403.10 chr1 [1370241, 1370241] +
## ENSG00000178821.8 chr1 [1850712, 1850712] -
## ENSG00000142609.13 chr1 [1935276, 1935276] -
## ... ... ... ...
## ENSG00000126890.13 chrX [153881853, 153881853] -
## ENSG00000130826.11 chrX [153991031, 153991031] +
## ENSG00000197932.3 chrX [154114635, 154114635] +
## ENSG00000155959.6 chrX [154425284, 154425284] +
## ENSG00000155961.4 chrX [154493874, 154493874] -
## -------
## seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths
In this example, neg_cr
contains regions that show negative correlation between methylation
and expression for the genes. The negative correlated regions are detected as:
Since genes may be close to each other, it is possible that one corrlated region for gene A overlaps with gene B. This is not what we want and by specifying the mapping, we can correspond correlated regions to the correct genes.
mat4 = normalizeToMatrix(neg_cr, all_tss, mapping_column = "gene", w = 50, mean_mode = "w0")
EnrichedHeatmap(mat4, col = c("white", "green"), name = "neg_cr", cluster_rows = TRUE,
top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = "green"))),
top_annotation_height = unit(2, "cm"))
Above heatmap shows negative correlated regions are enriched at some distance after the TSS. We guess it is because genes have alternative transcripts and negative correlated regions are actually enriched at the start sites of transcripts.
Next we add another heatmap showing how transcripts are distributed to gene TSS. Maybe here the heatmap is not a nice way for showing transcripts, but according to the annotation graphs on the both top, we can see there is a perfect fitting for the peaks of negative correlated regions and transcripts.
mat5 = normalizeToMatrix(tx, all_tss, mapping_column="gene", w = 50, mean_mode = "w0")
ht_list = EnrichedHeatmap(mat4, col = c("white", "green"), name = "neg_cr", cluster_rows = TRUE,
top_annotation = HeatmapAnnotation(lines1 = anno_enriched(gp = gpar(col = "green"))),
top_annotation_height = unit(2, "cm")) +
EnrichedHeatmap(mat5, col = c("white", "black"), name = "tx",
top_annotation = HeatmapAnnotation(lines2 = anno_enriched(gp = gpar(col = "black"))),
top_annotation_height = unit(2, "cm"))
draw(ht_list, gap = unit(1, "cm"))
Since EnrichedHeatmap is built upon the ComplexHeatmap package, features in ComplexHeatmap can be
used directly for EnrichedHeatmap. As shown before, heatmaps can be split either by km
or spilt
arguments.
The order of rows can be retrieved by row_order()
.
ht_list = draw(ht_list)
row_order(ht_list)
If you are interested in a small cluster, under the interactive mode,
you can use mouse to select this region by select()
function, and it will give you the order of rows
for the selected sub-region.
draw(ht_list)
pos = select()
Since EnrichedHeatmap
and EnrichedHeamtapList
class are inherited from Heamtap
and HeamtapList
class
respectively, all advanced parameters in the latter two classes can be directly used in the former two classes.
E.g. to change graphic settings for the heatmap title:
EnrichedHeatmap(..., column_title_gp = ...)
To change graphic settings for legends:
EnrichedHeatmap(..., heatmap_legend_param = ...)
# or set is globally
ht_global_opt(...)
EnrichedHeatmap(...)
ht_global_opt(RESET = TRUE)
To set the width of the heatmaps if there are more than one heatmaps:
EnrichedHeatmap(..., width = unit(...)) + EnrichedHeatmap(...)
For more advanced settings, please directly go to the vignettes in the ComplexHeamtap package.
Together with above features, you can make very complex heatmaps. Following example is from a real-world dataset. Some information is hidden because the data is not published yet, but still, you can see it is really easy to correspond different sources of information.
Due to findOverlaps()
in GenomicRanges package, normalizeToMatrix()
may use
a lot of memory if your GRanges
object contains too many regions. s
argument in normalizeToMatrix()
can split the data into s
parts but it will increase the running time a lot.
If you generate the plot for the whole genome, I suggest you first save the figure as pdf format and then
convert to png by convert
software, instead of directly saving as png format.
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel grid stats graphics grDevices utils datasets methods
## [10] base
##
## other attached packages:
## [1] circlize_0.3.1 EnrichedHeatmap_1.0.0 locfit_1.5-9.1 GenomicRanges_1.22.0
## [5] GenomeInfoDb_1.6.0 IRanges_2.4.0 S4Vectors_0.8.0 BiocGenerics_0.16.0
## [9] ComplexHeatmap_1.6.0 knitr_1.11 markdown_0.7.7
##
## loaded via a namespace (and not attached):
## [1] whisker_0.3-2 XVector_0.10.0 magrittr_1.5 zlibbioc_1.16.0
## [5] lattice_0.20-33 colorspace_1.2-6 rjson_0.2.15 stringr_1.0.0
## [9] tools_3.2.2 matrixStats_0.14.2 RColorBrewer_1.1-2 formatR_1.2.1
## [13] GlobalOptions_0.0.8 dendextend_1.1.0 shape_1.4.2 evaluate_0.8
## [17] stringi_0.5-5 GetoptLong_0.1.0