Package version: ChIPpeakAnno 3.4.6

Contents

1 Introduction

Chromatin immunoprecipitation (ChIP) followed by DNA sequencing (ChIP-seq) and ChIP followed by genome tiling array analysis (ChIP-chip) have become prevalent high throughput technologies for identifying the binding sites of DNA-binding proteins genome-wise. A number of algorithms have been published to facilitate the identification of the binding sites of the DNA-binding proteins of interest. The identified binding sites as the list of peaks are usually converted to BED or bigwig files to be loaded to the UCSC genome browser as custom tracks for investigators to view the proximity to various genomic features such as genes, exons or conserved elements. However, clicking through the genome browser is a daunting task when the number of peaks gets large or the peaks spread widely across the genome.

Here we developed ChIPpeakAnno, a Bioconductor1 package, to facilitate the batch annotation of the peaks identified from ChIP-seq or ChIP-chip experiments. We implemented functionality to find the nearest gene, exon, miRNA or other custom features supplied by users such as the most conserved elements and other transcription factor binding sites leveraging GRanges. Since the genome annotation gets updated frequently, we have leveraged the biomaRt package to retrieve the annotation data on the fly. The users also have the flexibility to pass their own annotation data or annotation from GenomicFeatures as GRanges. We have also leveraged BSgenome and biomaRt to retrieve the sequences around the identified peak for peak validation or motif discovery2. To understand whether the identified peaks are enriched around genes with certain GO terms, we have implemented the Gene Ontology (GO) enrichment test in the ChIPpeakAnno package leveraging the hypergeometric test phyper in the stats package and integrated with the GO annotation from the GO.db package and multiplicity adjustment functions from the multtest package3–8. The pathway analysis using reactome or KEGG is also supported. Starting 3.4, we also implement the functions for permutation test to determine the association between two sets of peaks, and to plot heatmaps for given feature/peak ranges.

2 Quick start

library(ChIPpeakAnno)
## import the MACS output
macs <- system.file("extdata", "MACS_peaks.xls", package="ChIPpeakAnno")
macsOutput <- toGRanges(macs, format="MACS")
## annotate the peaks with precompiled ensembl annotation
data(TSS.human.GRCh38)
macs.anno <- annotatePeakInBatch(macsOutput, AnnotationData=TSS.human.GRCh38, 
                                output="overlapping", maxgap=5000L)
## add gene symbols
library(org.Hs.eg.db)
macs.anno <- addGeneIDs(annotatedPeak=macs.anno, 
                        orgAnn="org.Hs.eg.db", 
                        IDs2Add="symbol")

if(interactive()){## annotate the peaks with UCSC annotation
    library(GenomicFeatures)
    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    ucsc.hg38.knownGene <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
    macs.anno <- annotatePeakInBatch(macsOutput, 
                                     AnnotationData=ucsc.hg38.knownGene, 
                                     output="overlapping", maxgap=5000L)
    macs.anno <- addGeneIDs(annotatedPeak=macs.anno, 
                            orgAnn="org.Hs.eg.db", 
                            feature_id_type="entrez_id",
                            IDs2Add="symbol")
}

3 An example ChIP-seq analysis workflow using ChIPpeakAnno

We illustrate here a common downstream analysis workflow for ChIP-seq experiments. The input of ChIPpeakAnno is a list of called peaks identified from ChIP-seq experiments. The peaks are represented by by GRanges in ChIPpeakAnno. We implemented a conversion functions toGRanges to convert commonly used peak file formats, such as BED, GFF, or other user defined formats such as MACS (a popular peak calling program) output file to GRanges. Type ?toGRanges for more information.

The workflow here exemplifies converting the BED and GFF files to GRanges, finding the overlapping peaks between the two peak sets, and visualizing the number of common and specific peaks with Venn diagram.

bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE) 
## one can also try import from rtracklayer
library(rtracklayer)
gr1.import <- import(bed, format="BED")
identical(start(gr1), start(gr1.import))
## [1] TRUE
gr1[1:2]
## GRanges object with 2 ranges and 1 metadata column:
##               seqnames         ranges strand |     score
##                  <Rle>      <IRanges>  <Rle> | <numeric>
##   MACS_peak_1     chr1 [28341, 29610]      * |    160.81
##   MACS_peak_2     chr1 [90821, 91234]      * |    133.12
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr1.import[1:2] #note the name slot is different from gr1
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames         ranges strand |        name     score
##          <Rle>      <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr1 [28341, 29610]      * | MACS_peak_1    160.81
##   [2]     chr1 [90821, 91234]      * | MACS_peak_2    133.12
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(ol)
venn diagram of overlaps for duplicated experiments

venn diagram of overlaps for duplicated experiments

## $p.value
##      gr1 gr2 pval
## [1,]   1   1    0
## 
## $vennCounts
##      gr1 gr2 Counts
## [1,]   0   0      0
## [2,]   0   1     61
## [3,]   1   0     62
## [4,]   1   1    164
## attr(,"class")
## [1] "VennCounts"

A pie chart is used to demonstrate the overlap features of the common peaks.

pie1(table(ol$overlappingPeaks[["gr1///gr2"]]$overlapFeature))
Pie chart of common peaks among features

Pie chart of common peaks among features

After finding the overlapping peaks, you can use annotatePeakInBatch to annotate all the genomic features in the AnnotationData within 5kb from those peaks.

overlaps <- ol$peaklist[["gr1///gr2"]]
## ============== old style ===========
## data(TSS.human.GRCh37) 
## overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData, 
##                                      output="overlapping", maxgap=5000L)
## overlaps.anno <- addGeneIDs(overlaps.anno, "org.Hs.eg.db", "symbol")
## ============== new style ===========
library(EnsDb.Hsapiens.v75) ##(hg19)
## create annotation file from EnsDb or TxDb
annoData <- annoGR(EnsDb.Hsapiens.v75, feature="gene")
info(annoData)
## annoGR object;
## # source:  EnsDb.Hsapiens.v75 
## # create at:  Thu Feb 11 12:00:00 AM 2016 UTC 
## # feature:  gene 
## # Db type: EnsDb
## # Type of Gene ID: Ensembl Gene ID
## # Supporting package: ensembldb
## # Db created by: ensembldb package from Bioconductor
## # script_version: 0.1.2
## # Creation time: Wed Mar 18 09:30:54 2015
## # ensembl_version: 75
## # ensembl_host: manny.i-med.ac.at
## # Organism: homo_sapiens
## # genome_build: GRCh37
## # DBSCHEMAVERSION: 1.0
annoData[1:2]
## annoGR object with 2 ranges and 1 metadata column:
##                   seqnames               ranges strand |   gene_name
##                      <Rle>            <IRanges>  <Rle> | <character>
##   ENSG00000000003     chrX [99883667, 99894988]      - |      TSPAN6
##   ENSG00000000005     chrX [99839799, 99854882]      + |        TNMD
##   -------
##   seqinfo: 273 sequences from GRCh37 genome
overlaps.anno <- annotatePeakInBatch(overlaps, AnnotationData=annoData, 
                                    output="overlapping", maxgap=5000L)
overlaps.anno$gene_name <- 
    annoData$gene_name[match(overlaps.anno$feature,
                             names(annoData))]
head(overlaps.anno)
## GRanges object with 6 ranges and 11 metadata columns:
##                        seqnames           ranges strand |
##                           <Rle>        <IRanges>  <Rle> |
##   X001.ENSG00000228327     chr1 [713791, 715578]      * |
##   X001.ENSG00000237491     chr1 [713791, 715578]      * |
##   X002.ENSG00000237491     chr1 [724851, 727191]      * |
##   X003.ENSG00000272438     chr1 [839467, 840090]      * |
##   X004.ENSG00000187634     chr1 [856361, 856999]      * |
##   X004.ENSG00000223764     chr1 [856361, 856999]      * |
##                                                            peakNames
##                                                      <CharacterList>
##   X001.ENSG00000228327 gr1__MACS_peak_13,gr2__region_0,gr2__region_1
##   X001.ENSG00000237491 gr1__MACS_peak_13,gr2__region_0,gr2__region_1
##   X002.ENSG00000237491               gr2__region_2,gr1__MACS_peak_14
##   X003.ENSG00000272438               gr1__MACS_peak_16,gr2__region_3
##   X004.ENSG00000187634               gr1__MACS_peak_17,gr2__region_4
##   X004.ENSG00000223764               gr1__MACS_peak_17,gr2__region_4
##                               peak         feature start_position
##                        <character>     <character>      <integer>
##   X001.ENSG00000228327         001 ENSG00000228327         700237
##   X001.ENSG00000237491         001 ENSG00000237491         714150
##   X002.ENSG00000237491         002 ENSG00000237491         714150
##   X003.ENSG00000272438         003 ENSG00000272438         840214
##   X004.ENSG00000187634         004 ENSG00000187634         860260
##   X004.ENSG00000223764         004 ENSG00000223764         852245
##                        end_position feature_strand insideFeature
##                           <integer>    <character>      <factor>
##   X001.ENSG00000228327       714006              -  overlapStart
##   X001.ENSG00000237491       745440              +  overlapStart
##   X002.ENSG00000237491       745440              +        inside
##   X003.ENSG00000272438       851356              +      upstream
##   X004.ENSG00000187634       879955              +      upstream
##   X004.ENSG00000223764       856396              -  overlapStart
##                        distancetoFeature shortestDistance
##                                <numeric>        <integer>
##   X001.ENSG00000228327               215              215
##   X001.ENSG00000237491              -359              359
##   X002.ENSG00000237491             10701            10701
##   X003.ENSG00000272438              -747              124
##   X004.ENSG00000187634             -3899             3261
##   X004.ENSG00000223764                35               35
##                        fromOverlappingOrNearest     gene_name
##                                     <character>   <character>
##   X001.ENSG00000228327              Overlapping RP11-206L10.2
##   X001.ENSG00000237491              Overlapping RP11-206L10.9
##   X002.ENSG00000237491              Overlapping RP11-206L10.9
##   X003.ENSG00000272438              Overlapping  RP11-54O7.16
##   X004.ENSG00000187634              Overlapping        SAMD11
##   X004.ENSG00000223764              Overlapping   RP11-54O7.3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Once the peaks are annotated, the distribution of the distance to the nearest feature such as the transcription start sites (TSS) can be plotted. The sample code here plots the distribution of the aggregated peak scores and the number of peaks around the TSS.

gr1.copy <- gr1
gr1.copy$score <- 1
binOverFeature(gr1, gr1.copy, annotationData=annoData,
               radius=5000, nbins=10, FUN=c(sum, length),
               ylab=c("score", "count"), 
               main=c("Distribution of aggregated peak scores around TSS", 
                      "Distribution of aggregated peak numbers around TSS"))
Distribution of aggregated peak scores or peak numbers around transcript start sites.

Distribution of aggregated peak scores or peak numbers around transcript start sites.

The distribution of the peaks over exon, intron, enhancer, proximal promoter, 5’ UTR and 3’ UTR can be summarized in peak centric or nucleotide centric view using the function assignChromosomeRegion. Note: setting nucleotideLevel = TRUE will give a nucleotide level distribution over different features.

if(require(TxDb.Hsapiens.UCSC.hg19.knownGene)){
    aCR<-assignChromosomeRegion(gr1, nucleotideLevel=FALSE, 
                           precedence=c("Promoters", "immediateDownstream", 
                                         "fiveUTRs", "threeUTRs", 
                                         "Exons", "Introns"), 
                           TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
    barplot(aCR$percentage)
}
Peak distribution over different genomic features.

Peak distribution over different genomic features.

4 Detailed Use Cases and Scenarios

Here we describe some details in using different functions in ChIPpeakAnno for different tasks. As shown in the last section, the common workflow includes: loading called peaks from BED, GFF, or other formats; evaluating and visualizing the concordance among the biological replicates; combining peaks from replicates; preparing genomic annotation(s) as GRanges; associating/annotating peaks with the annotation(s); summarizing peak distributions over exon, intron, enhancer, proximal promoter, 5’UTR and 3’UTR regions; retrieving the sequences around the peaks; and enrichment analysis of GO and biological pathway. We also implement the functions to plot the heatmap of given peak ranges, and perform permutation test to determine if there is an association between two sets of peaks.

4.1 Determine the overlapping peaks and visualize the overlaps with Venn diagram

It is important to evaluate the concordance among the peaks of biological replicates. Prior to associating features of interest with the peaks, it is a common practice to combine the peaks from biological replicates. Also, it is biologically interesting to obtain overlapping peaks from different ChIP-seq experiments to imply the potential formation of transcription factor complexes. ChIPpeakAnno implemented functions to achieve those goals and quantitatively determine the significance of peak overlaps and generate a Venn diagram for visualization.

Here is the sample code to obtain the overlapping peaks with maximum gap of 1kb for two peak ranges.

peaks1 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", 
                              "2", "6", "6", "6", "6", "5"),
                   ranges=IRanges(start=c(967654, 2010897, 2496704, 3075869, 
                                          3123260, 3857501, 201089, 1543200, 
                                          1557200, 1563000, 1569800, 167889600),
                                  end= c(967754, 2010997, 2496804, 3075969, 
                                         3123360, 3857601, 201089, 1555199,
                                         1560599, 1565199, 1573799, 167893599),
                                  names=paste("Site", 1:12, sep="")),
                  strand="+")

peaks2 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", "1", "2", "3", 
                                     "4", "5", "6", "6", "6", "6", "6", "5"),
                          ranges=IRanges(start=c(967659, 2010898, 2496700, 
                                                 3075866, 3123260, 3857500, 
                                                 96765, 201089, 249670, 307586, 
                                                 312326, 385750, 1549800, 
                                                 1554400, 1565000, 1569400,
                                                 167888600), 
                                         end=c(967869, 2011108, 2496920, 
                                               3076166,3123470, 3857780, 
                                               96985, 201299, 249890, 307796, 
                                               312586, 385960, 1550599, 1560799,
                                               1565399, 1571199, 167888999), 
                                         names=paste("t", 1:17, sep="")),
                          strand=c("+", "+", "+", "+", "+", "+", "-", "-", "-", 
                                   "-", "-", "-", "+", "+", "+", "+", "+"))

ol <- findOverlapsOfPeaks(peaks1, peaks2, maxgap=1000)
peaklist <- ol$peaklist

The function findOverlapsOfPeaks returns an object of overlappingPeaks, which contains there elements: venn_cnt, peaklist (a list consists of all overlapping peaks or unique peaks), and overlappingPeaks (a list of data frame consists of the annotation of all the overlapping peaks).

Within the overlappingPeaks element of the overlappingPeaks object ol (which is also a list), the element “peaks1///peaks2” is a data frame representing the overlapping peaks with maximum gap of 1kb between the two peak lists. Using the overlapFeature column in this data frame, a pie graph can be generated to describe the distribution of the features of the relative positions of peaks1 to peaks2 for the overlapping peaks.

overlappingPeaks <- ol$overlappingPeaks
names(overlappingPeaks)
## [1] "peaks1///peaks2"
dim(overlappingPeaks[["peaks1///peaks2"]])
## [1] 13 14
overlappingPeaks[["peaks1///peaks2"]][1:2, ]
##                                 peaks1 seqnames  start    end width strand
## peaks1__Site1_peaks2__t1 peaks1__Site1        1 967654 967754   101      +
## peaks1__Site7_peaks2__t8 peaks1__Site7        2 201089 201089     1      +
##                              peaks2 seqnames  start    end width strand
## peaks1__Site1_peaks2__t1 peaks2__t1        1 967659 967869   211      +
## peaks1__Site7_peaks2__t8 peaks2__t8        2 201089 201299   211      -
##                          overlapFeature shortestDistance
## peaks1__Site1_peaks2__t1   overlapStart                5
## peaks1__Site7_peaks2__t8     overlapEnd                0
pie1(table(overlappingPeaks[["peaks1///peaks2"]]$overlapFeature))
Pie chart of common peaks among features.

Pie chart of common peaks among features.

The following code returns the merged overlapping peaks from the peaklist object.

peaklist[["peaks1///peaks2"]]
## GRanges object with 11 ranges and 1 metadata column:
##        seqnames                 ranges strand |
##           <Rle>              <IRanges>  <Rle> |
##    [1]        1 [   967654,    967869]      + |
##    [2]        2 [   201089,    201299]      * |
##    [3]        2 [  2010897,   2011108]      + |
##    [4]        3 [  2496700,   2496920]      + |
##    [5]        4 [  3075866,   3076166]      + |
##    [6]        5 [  3123260,   3123470]      + |
##    [7]        5 [167888600, 167893599]      + |
##    [8]        6 [  1543200,   1560799]      + |
##    [9]        6 [  1563000,   1565399]      + |
##   [10]        6 [  1569400,   1573799]      + |
##   [11]        6 [  3857500,   3857780]      + |
##                                        peakNames
##                                  <CharacterList>
##    [1]                  peaks1__Site1,peaks2__t1
##    [2]                  peaks1__Site7,peaks2__t8
##    [3]                  peaks1__Site2,peaks2__t2
##    [4]                  peaks2__t3,peaks1__Site3
##    [5]                  peaks2__t4,peaks1__Site4
##    [6]                  peaks1__Site5,peaks2__t5
##    [7]                peaks2__t17,peaks1__Site12
##    [8] peaks1__Site8,peaks2__t13,peaks2__t14,...
##    [9]                peaks1__Site10,peaks2__t15
##   [10]                peaks2__t16,peaks1__Site11
##   [11]                  peaks2__t6,peaks1__Site6
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

The peaks in peaks1 but not overlap with the peaks in peaks2 can be obtained with:

peaklist[["peaks1"]]
## NULL

The peaks in peaks2 but not overlap with the peaks in peaks1 can be obtained with:

peaklist[["peaks2"]]
## GRanges object with 5 ranges and 1 metadata column:
##       seqnames           ranges strand |       peakNames
##          <Rle>        <IRanges>  <Rle> | <CharacterList>
##   [1]        1 [ 96765,  96985]      - |      peaks2__t7
##   [2]        3 [249670, 249890]      - |      peaks2__t9
##   [3]        4 [307586, 307796]      - |     peaks2__t10
##   [4]        5 [312326, 312586]      - |     peaks2__t11
##   [5]        6 [385750, 385960]      - |     peaks2__t12
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

Venn diagram can be generated by the function makeVennDiagram using the output of findOverlapsOfPeaks as an input.

The makeVennDiagram also outputs p-values indicating whether the overlapping is significant.

makeVennDiagram(ol, totalTest=1e+2)
venn diagram of overlaps

venn diagram of overlaps

## $p.value
##      peaks1 peaks2         pval
## [1,]      1      1 5.890971e-12
## 
## $vennCounts
##      peaks1 peaks2 Counts
## [1,]      0      0     83
## [2,]      0      1      5
## [3,]      1      0      0
## [4,]      1      1     12
## attr(,"class")
## [1] "VennCounts"

Alternatively, users have the option to use other tools to plot Venn diagram. The following code demonstrates how to use a third party R package Vernerable with the output from the function findOverlapsOfPeaks.

#     install.packages("Vennerable", repos="http://R-Forge.R-project.org", 
#                     type="source")
#     library(Vennerable)
#     venn_cnt2venn <- function(venn_cnt){
#         n <- which(colnames(venn_cnt)=="Counts") - 1
#         SetNames=colnames(venn_cnt)[1:n]
#         Weight=venn_cnt[,"Counts"]
#         names(Weight) <- apply(venn_cnt[,1:n], 1, paste, collapse="")
#         Venn(SetNames=SetNames, Weight=Weight)
#     }
# 
#     v <- venn_cnt2venn(ol$venn_cnt)
#     plot(v)

The findOverlapsOfPeaks function accepts up to 5 peak lists for overlapping peaks. The following code is an example for 3 peak lists.

peaks3 <- GRanges(seqnames=c("1", "2", "3", "4", "5", 
                             "6", "1", "2", "3", "4"),
                   ranges=IRanges(start=c(967859, 2010868, 2496500, 3075966,
                                          3123460, 3851500, 96865, 201189, 
                                          249600, 307386),
                                  end= c(967969, 2011908, 2496720, 3076166,
                                         3123470, 3857680, 96985, 201299, 
                                         249890, 307796),
                                  names=paste("p", 1:10, sep="")),
                  strand=c("+", "+", "+", "+", "+", 
                           "+", "-", "-", "-", "-"))

ol <- findOverlapsOfPeaks(peaks1, peaks2, peaks3, maxgap=1000, 
                          connectedPeaks="min")
makeVennDiagram(ol, totalTest=1e+2)
venn diagram of overlaps for three input peak lists

venn diagram of overlaps for three input peak lists

## $p.value
##      peaks1 peaks2 peaks3         pval
## [1,]      0      1      1 1.123492e-09
## [2,]      1      0      1 5.131347e-06
## [3,]      1      1      0 5.890971e-12
## 
## $vennCounts
##      peaks1 peaks2 peaks3 Counts
## [1,]      0      0      0     83
## [2,]      0      0      1      0
## [3,]      0      1      0      2
## [4,]      0      1      1      3
## [5,]      1      0      0      0
## [6,]      1      0      1      0
## [7,]      1      1      0      5
## [8,]      1      1      1      7
## attr(,"class")
## [1] "VennCounts"

The parameter totalTest in the function makeVennDiagram indicates how many potential peaks in total will be used in the hypergeometric test. It should be larger than the largest number of peaks in the replicates. The smaller it is set, the more stringent the test is. The time used to calculate p-value does not depend on the value of the totalTest. For practical guidance on how to choose totalTest, please refer to the post. We also implement a permTest function, in which the number of totalTest is not required. For more details about the permTest, go to section 4.11.

4.2 Generate annotation data

One main function of the ChIPpeakAnno package is to annotate the positional relationship between the peaks and the known genomic features, such as TSS, 5’UTR, 3’UTR etc. Constructing and choosing the appropriate annotation data is crucial for this process.

To simplify this process, we precompiled the annotation data for the transcriptional starting sites (TSS) of various species (with different genome assembly versions). Those TSS annotations include TSS.human.NCBI36, TSS.human.GRCh37, TSS.human.GRCh38, TSS.mouse.NCBIM37, TSS.mouse.GRCm38, TSS.rat.RGSC3.4, TSS.rat.Rnor_5.0, TSS.zebrafish.Zv8, and TSS.zebrafish.Zv9. Those precompiled annotations can be loaded by R data() function.

To annotate the peaks with other genomic features, we provide the function getAnnotation with the argument featureType, e.g., “Exon” to obtain the nearest exon, “miRNA” to find the nearest miRNA, and “5utr” or “3utr” to locate the overlapping “5’UTR” or “3’UTR”. Another parameter for getAnnotation is the name of the appropriate biomaRt dataset, for example, drerio_gene_ensembl for zebrafish genome, mmusculus_gene_ensembl for mouse genome and rnorvegicus_gene_ensembl for rat genome. For a list of available biomaRt and dataset, please refer to the biomaRt package documentation2. For the detailed usage of getAnnotation, you can type ?getAnnotation in R.

You can also determine to use the biologically appropriate database that is related with your biological questions. For example, if you have a list of transcription factor binding sites from literatures and are interested in locating the nearest TSS and the distance to it for the peak lists. You can pass the custom annotation dataset into the function annotatePeakInBatch as GRanges. Use toGRanges function for conversion if necessary.

To facilitate the creation and documentation of the annotation data, we implement an annoGR class, which is an extension of GRanges class, to represent the annotation data. An annoGR object can be constructed from EnsDb, TxDb, or the user defined GRanges object by calling the annoGR function. The advantage of this class is that it contains the meta data such as the source and the timestamp (date) of the data source. Use ?annoGR for more information.

One sample use case is that you want to annotate only the known genes, not other transcript products, such as pseudo genes. A code snippet to build this annotation data using TranscriptDb TxDb.Hsapiens.UCSC.hg19.knownGene with annoGR is:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
annoData <- annoGR(TxDb.Hsapiens.UCSC.hg19.knownGene, feature="gene")
info(annoData)
## annoGR object;
## # source:  TxDb.Hsapiens.UCSC.hg19.knownGene 
## # create at:  Thu Feb 11 12:00:00 AM 2016 UTC 
## # feature:  gene 
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
annoData
## annoGR object with 23056 ranges and 0 metadata columns:
##         seqnames                 ranges strand
##            <Rle>              <IRanges>  <Rle>
##       1    chr19 [ 58858172,  58874214]      -
##      10     chr8 [ 18248755,  18258723]      +
##     100    chr20 [ 43248163,  43280376]      -
##    1000    chr18 [ 25530930,  25757445]      -
##   10000     chr1 [243651535, 244006886]      -
##     ...      ...                    ...    ...
##    9991     chr9 [114979995, 115095944]      -
##    9992    chr21 [ 35736323,  35743440]      +
##    9993    chr22 [ 19023795,  19109967]      -
##    9994     chr6 [ 90539619,  90584155]      +
##    9997    chr22 [ 50961997,  50964905]      -
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

4.3 Find the nearest feature and the distance to the feature for the peaklists

With the annotation data, you can annotate the peaks identified from ChIP-seq or ChIP-chip experiments to retrieve the nearest gene and distance to the corresponding TSS of the gene.

For example, using the annoGR object generated in previous section as AnnotationData, the first 6 peaks in the myPeakList are annotated with the following code:

data(myPeakList)
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6],
                                     AnnotationData = annoData)
annotatedPeak[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
##                          seqnames           ranges strand |         peak
##                             <Rle>        <IRanges>  <Rle> |  <character>
##   X1_93_556427.100288069     chr1 [556660, 556760]      * | X1_93_556427
##   X1_41_559455.100288069     chr1 [559774, 559874]      * | X1_41_559455
##   X1_12_703729.100288069     chr1 [703885, 703985]      * | X1_12_703729
##                              feature start_position end_position
##                          <character>      <integer>    <integer>
##   X1_93_556427.100288069   100288069         700245       714068
##   X1_41_559455.100288069   100288069         700245       714068
##   X1_12_703729.100288069   100288069         700245       714068
##                          feature_strand insideFeature distancetoFeature
##                             <character>      <factor>         <numeric>
##   X1_93_556427.100288069              -    downstream            157408
##   X1_41_559455.100288069              -    downstream            154294
##   X1_12_703729.100288069              -        inside             10183
##                          shortestDistance fromOverlappingOrNearest
##                                 <integer>              <character>
##   X1_93_556427.100288069           143485          NearestLocation
##   X1_41_559455.100288069           140371          NearestLocation
##   X1_12_703729.100288069             3640          NearestLocation
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

As discussed in the previous section, all the genomic locations of the human genes have been precompiled, such as TSS.human.NCBI36 dataset, using function getAnnotation. You can pass it to the argument annotaionData of the annotatePeakInBatch function.

data(TSS.human.NCBI36)
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6], 
                 AnnotationData=TSS.human.NCBI36)
annotatedPeak[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
##                                seqnames           ranges strand |
##                                   <Rle>        <IRanges>  <Rle> |
##   X1_93_556427.ENSG00000212875     chr1 [556660, 556760]      * |
##   X1_41_559455.ENSG00000212678     chr1 [559774, 559874]      * |
##   X1_12_703729.ENSG00000197049     chr1 [703885, 703985]      * |
##                                        peak         feature start_position
##                                 <character>     <character>      <integer>
##   X1_93_556427.ENSG00000212875 X1_93_556427 ENSG00000212875         556318
##   X1_41_559455.ENSG00000212678 X1_41_559455 ENSG00000212678         559620
##   X1_12_703729.ENSG00000197049 X1_12_703729 ENSG00000197049         711184
##                                end_position feature_strand insideFeature
##                                   <integer>    <character>      <factor>
##   X1_93_556427.ENSG00000212875       557859              +        inside
##   X1_41_559455.ENSG00000212678       560165              +        inside
##   X1_12_703729.ENSG00000197049       712376              +      upstream
##                                distancetoFeature shortestDistance
##                                        <numeric>        <integer>
##   X1_93_556427.ENSG00000212875               342              342
##   X1_41_559455.ENSG00000212678               154              154
##   X1_12_703729.ENSG00000197049             -7299             7199
##                                fromOverlappingOrNearest
##                                             <character>
##   X1_93_556427.ENSG00000212875          NearestLocation
##   X1_41_559455.ENSG00000212678          NearestLocation
##   X1_12_703729.ENSG00000197049          NearestLocation
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

You can also pass the user defined features as annotationData. A pie chart can be plotted to show the peak distribution among the features after annotation.

myPeak1 <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", 
                              "2", "6", "6", "6", "6", "5"),
                   ranges=IRanges(start=c(967654, 2010897, 2496704, 3075869, 
                                          3123260, 3857501, 201089, 1543200, 
                                          1557200, 1563000, 1569800, 167889600),
                                  end= c(967754, 2010997, 2496804, 3075969, 
                                         3123360, 3857601, 201089, 1555199,
                                         1560599, 1565199, 1573799, 167893599),
                                  names=paste("Site", 1:12, sep="")))

TFbindingSites <- GRanges(seqnames=c("1", "2", "3", "4", "5", "6", "1", "2", 
                                     "3", "4", "5", "6", "6", "6", "6", "6",
                                     "5"),
                          ranges=IRanges(start=c(967659, 2010898, 2496700, 
                                                 3075866, 3123260, 3857500, 
                                                 96765, 201089, 249670, 307586, 
                                                 312326, 385750, 1549800, 
                                                 1554400, 1565000, 1569400,
                                                 167888600), 
                                         end=c(967869, 2011108, 2496920, 
                                               3076166,3123470, 3857780, 
                                               96985, 201299, 249890, 307796, 
                                               312586, 385960, 1550599, 1560799,
                                               1565399, 1571199, 167888999), 
                                         names=paste("t", 1:17, sep="")),
                          strand=c("+", "+", "+", "+", "+", "+", "-", "-", "-", 
                                   "-", "-", "-", "+", "+", "+", "+", "+"))

annotatedPeak2 <- annotatePeakInBatch(myPeak1, AnnotationData=TFbindingSites)
annotatedPeak2[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
##            seqnames             ranges strand |        peak     feature
##               <Rle>          <IRanges>  <Rle> | <character> <character>
##   Site1.t1     chr1 [ 967654,  967754]      * |       Site1          t1
##   Site2.t2     chr2 [2010897, 2010997]      * |       Site2          t2
##   Site3.t3     chr3 [2496704, 2496804]      * |       Site3          t3
##            start_position end_position feature_strand insideFeature
##                 <integer>    <integer>    <character>      <factor>
##   Site1.t1         967659       967869              +  overlapStart
##   Site2.t2        2010898      2011108              +  overlapStart
##   Site3.t3        2496700      2496920              +        inside
##            distancetoFeature shortestDistance fromOverlappingOrNearest
##                    <numeric>        <integer>              <character>
##   Site1.t1                -5                5          NearestLocation
##   Site2.t2                -1                1          NearestLocation
##   Site3.t3                 4                4          NearestLocation
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths
pie1(table(as.data.frame(annotatedPeak2)$insideFeature))
Pie chart of peak distribution among features

Pie chart of peak distribution among features

Another example of user specific AnnotationData is to annotate peaks by promoters, defined as upstream 5K to downstream 500bp from TSS. The sample code here demonstrates using the GenomicFeatures::promoters function to build a custom annotation dataset and annotate the peaks with this user defined promoter annotations.

library(ChIPpeakAnno)
data(myPeakList)
data(TSS.human.NCBI36)
annotationData <- promoters(TSS.human.NCBI36, upstream=5000, downstream=500)
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6,], 
                                     AnnotationData=annotationData,
                                     output="overlapping")
annotatedPeak[1:3]
## GRanges object with 3 ranges and 9 metadata columns:
##                                seqnames           ranges strand |
##                                   <Rle>        <IRanges>  <Rle> |
##   X1_93_556427.ENSG00000209341     chr1 [556660, 556760]      * |
##   X1_93_556427.ENSG00000209344     chr1 [556660, 556760]      * |
##   X1_93_556427.ENSG00000209346     chr1 [556660, 556760]      * |
##                                        peak         feature start_position
##                                 <character>     <character>      <integer>
##   X1_93_556427.ENSG00000209341 X1_93_556427 ENSG00000209341         554314
##   X1_93_556427.ENSG00000209344 X1_93_556427 ENSG00000209344         555569
##   X1_93_556427.ENSG00000209346 X1_93_556427 ENSG00000209346         555643
##                                end_position feature_strand insideFeature
##                                   <integer>    <character>      <factor>
##   X1_93_556427.ENSG00000209341       559813              -        inside
##   X1_93_556427.ENSG00000209344       561068              -        inside
##   X1_93_556427.ENSG00000209346       561142              -        inside
##                                distancetoFeature shortestDistance
##                                        <numeric>        <integer>
##   X1_93_556427.ENSG00000209341              3153             2346
##   X1_93_556427.ENSG00000209344              4408             1091
##   X1_93_556427.ENSG00000209346              4482             1017
##                                fromOverlappingOrNearest
##                                             <character>
##   X1_93_556427.ENSG00000209341              Overlapping
##   X1_93_556427.ENSG00000209344              Overlapping
##   X1_93_556427.ENSG00000209346              Overlapping
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

In the function annotatyePeakInBatch, various parameters can be adjusted to specify the way to calculate the distance and how the features are selected. For example, PeakLocForDistance is to specify the location of the peak for distance calculation: “middle” (recommended) means using the middle of the peak, and “start” (default, for backward compatibility) means using the start of the peak to calculate the distance to the features. Similarly, FeatureLocForDistance is to specify the location of the feature for distance calculation: “middle” means using the middle of the feature, “start” means using the start of the feature to calculate the distance from the peak to the feature; “TSS” (default) means using the start of the feature when the feature is on plus strand and using the end of feature when the feature is on minus strand; “geneEnd” means using end of the feature when feature is on plus strand and using start of feature when feature is on minus strand.

The argument “output” specifies the characteristics of the output of the annotated features. The default is “nearestLocation”, which means to output the nearest features calculated as PeakLocForDistance-FeatureLocForDistance; “overlapping” will output the overlapping features within the maximum gap specified as maxgap between the peak range and feature range; “shortestDistance” will output the nearest features; “both” will output all the nearest features, in addition, will output any features that overlap the peak that are not the nearest features. other options see ?annotatePeakInBatch.

4.4 Find the overlapping and flanking features

In addition to annotate peaks to nearest genes, ChIPpeakAnno can also reports all overlapping and flanking genes by setting output=“both” and maxgap in annotatePeakInBatch. For example, it outputs all overlapping and flanking genes within 5kb plus nearest genes if set maxgap = 5000 and output =“both”.

annotatedPeak <- annotatePeakInBatch(myPeakList[1:6],
                                     AnnotationData = annoData,
                                     output="both", maxgap=5000)
head(annotatedPeak)
## GRanges object with 6 ranges and 9 metadata columns:
##                          seqnames             ranges strand |
##                             <Rle>          <IRanges>  <Rle> |
##   X1_93_556427.100288069     chr1 [ 556660,  556760]      * |
##   X1_41_559455.100288069     chr1 [ 559774,  559874]      * |
##   X1_12_703729.100288069     chr1 [ 703885,  703985]      * |
##       X1_20_925025.84808     chr1 [ 926058,  926158]      * |
##      X1_11_1041174.54991     chr1 [1041646, 1041746]      * |
##       X1_14_1269014.1855     chr1 [1270239, 1270339]      * |
##                                   peak     feature start_position
##                            <character> <character>      <integer>
##   X1_93_556427.100288069  X1_93_556427   100288069         700245
##   X1_41_559455.100288069  X1_41_559455   100288069         700245
##   X1_12_703729.100288069  X1_12_703729   100288069         700245
##       X1_20_925025.84808  X1_20_925025       84808         910579
##      X1_11_1041174.54991 X1_11_1041174       54991        1017198
##       X1_14_1269014.1855 X1_14_1269014        1855        1270658
##                          end_position feature_strand insideFeature
##                             <integer>    <character>      <factor>
##   X1_93_556427.100288069       714068              -    downstream
##   X1_41_559455.100288069       714068              -    downstream
##   X1_12_703729.100288069       714068              -        inside
##       X1_20_925025.84808       917473              -      upstream
##      X1_11_1041174.54991      1051736              -        inside
##       X1_14_1269014.1855      1284492              -    downstream
##                          distancetoFeature shortestDistance
##                                  <numeric>        <integer>
##   X1_93_556427.100288069            157408           143485
##   X1_41_559455.100288069            154294           140371
##   X1_12_703729.100288069             10183             3640
##       X1_20_925025.84808             -8585             8585
##      X1_11_1041174.54991             10090             9990
##       X1_14_1269014.1855             14253              319
##                          fromOverlappingOrNearest
##                                       <character>
##   X1_93_556427.100288069          NearestLocation
##   X1_41_559455.100288069          NearestLocation
##   X1_12_703729.100288069          NearestLocation
##       X1_20_925025.84808          NearestLocation
##      X1_11_1041174.54991          NearestLocation
##       X1_14_1269014.1855              Overlapping
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

4.5 Add other feature IDs to the annotated peaks

Additional annotations features such as entrez ID, gene symbol and gene name can be added with the function addGeneIDs. The annotated peaks can be saved as an Excel file or plotted for visualizing the peak distribution relative to the genomic features of interest. Here is an example to add gene symbol to the annotated peaks. Use ?addGeneIDs for more information.

data(annotatedPeak)
library(org.Hs.eg.db)
addGeneIDs(annotatedPeak[1:6], orgAnn="org.Hs.eg.db", IDs2Add=c("symbol"))
## GRanges object with 6 ranges and 9 metadata columns:
##                                   seqnames                 ranges strand |
##                                      <Rle>              <IRanges>  <Rle> |
##   X1_11_100272487.ENSG00000202254        1 [100272801, 100272900]      + |
##   X1_11_108905539.ENSG00000186086        1 [108906026, 108906125]      + |
##   X1_11_110106925.ENSG00000065135        1 [110107267, 110107366]      + |
##   X1_11_110679983.ENSG00000197106        1 [110680469, 110680568]      + |
##   X1_11_110681677.ENSG00000197106        1 [110682125, 110682224]      + |
##   X1_11_110756560.ENSG00000116396        1 [110756823, 110756922]      + |
##                                             peak         feature
##                                      <character>     <character>
##   X1_11_100272487.ENSG00000202254 1_11_100272487 ENSG00000202254
##   X1_11_108905539.ENSG00000186086 1_11_108905539 ENSG00000186086
##   X1_11_110106925.ENSG00000065135 1_11_110106925 ENSG00000065135
##   X1_11_110679983.ENSG00000197106 1_11_110679983 ENSG00000197106
##   X1_11_110681677.ENSG00000197106 1_11_110681677 ENSG00000197106
##   X1_11_110756560.ENSG00000116396 1_11_110756560 ENSG00000116396
##                                   start_position end_position
##                                        <numeric>    <numeric>
##   X1_11_100272487.ENSG00000202254      100257218    100257309
##   X1_11_108905539.ENSG00000186086      108918435    109013624
##   X1_11_110106925.ENSG00000065135      110091233    110136975
##   X1_11_110679983.ENSG00000197106      110693108    110744824
##   X1_11_110681677.ENSG00000197106      110693108    110744824
##   X1_11_110756560.ENSG00000116396      110753965    110776666
##                                   insideFeature distancetoFeature
##                                     <character>         <numeric>
##   X1_11_100272487.ENSG00000202254    downstream             15582
##   X1_11_108905539.ENSG00000186086      upstream            -12410
##   X1_11_110106925.ENSG00000065135        inside             16033
##   X1_11_110679983.ENSG00000197106      upstream            -12640
##   X1_11_110681677.ENSG00000197106      upstream            -10984
##   X1_11_110756560.ENSG00000116396        inside              2857
##                                   shortestDistance
##                                          <numeric>
##   X1_11_100272487.ENSG00000202254            15491
##   X1_11_108905539.ENSG00000186086            12310
##   X1_11_110106925.ENSG00000065135            16033
##   X1_11_110679983.ENSG00000197106            12540
##   X1_11_110681677.ENSG00000197106            10884
##   X1_11_110756560.ENSG00000116396             2857
##                                   fromOverlappingOrNearest   symbol
##                                                <character> <factor>
##   X1_11_100272487.ENSG00000202254             NearestStart     <NA>
##   X1_11_108905539.ENSG00000186086             NearestStart    NBPF6
##   X1_11_110106925.ENSG00000065135             NearestStart    GNAI3
##   X1_11_110679983.ENSG00000197106             NearestStart  SLC6A17
##   X1_11_110681677.ENSG00000197106             NearestStart  SLC6A17
##   X1_11_110756560.ENSG00000116396             NearestStart    KCNC4
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths
addGeneIDs(annotatedPeak$feature[1:6], orgAnn="org.Hs.eg.db", 
           IDs2Add=c("symbol"))
##   ensembl_gene_id  symbol
## 1 ENSG00000065135   GNAI3
## 2 ENSG00000116396   KCNC4
## 3 ENSG00000197106 SLC6A17
## 4 ENSG00000186086   NBPF6
## 5 ENSG00000202254    <NA>

4.6 Obtain the sequences surrounding the peaks

Here is an example to get the sequences of the peaks plus 20 bp upstream and downstream for PCR validation or motif discovery.

peaks <- GRanges(seqnames=c("NC_008253", "NC_010468"),
                 ranges=IRanges(start=c(100, 500), 
                                end=c(300, 600), 
                                names=c("peak1", "peak2")))
library(BSgenome.Ecoli.NCBI.20080805)
peaksWithSequences <- getAllPeakSequence(peaks, upstream=20, 
                                         downstream=20, genome=Ecoli)

The obtained sequences can be converted to fasta format for motif discovery by calling the function write2FASTA.

write2FASTA(peaksWithSequences,"test.fa")

4.7 Heatmap for given feature/peak ranges

You can easily visualize and compare the binding patterns of raw signals of multiple ChIP-Seq experiments using function featureAlignedHeatmap and featureAlignedDistribution.

path <- system.file("extdata", package="ChIPpeakAnno")
files <- dir(path, "broadPeak")
data <- sapply(file.path(path, files), toGRanges, format="broadPeak")
names(data) <- gsub(".broadPeak", "", files)
ol <- findOverlapsOfPeaks(data)
#makeVennDiagram(ol)
features <- ol$peaklist[[length(ol$peaklist)]]
wid <- width(features)
feature.recentered <- feature.center <- features
start(feature.center) <- start(features) + floor(wid/2)
width(feature.center) <- 1
start(feature.recentered) <- start(feature.center) - 2000
end(feature.recentered) <- end(feature.center) + 2000
## here we also suggest importData function in bioconductor trackViewer package 
## to import the coverage.
## compare rtracklayer, it will save you time when handle huge dataset.
library(rtracklayer)
files <- dir(path, "bigWig")
cvglists <- sapply(file.path(path, files), import, 
                       format="BigWig", 
                       which=feature.recentered, 
                       as="RleList")
names(cvglists) <- gsub(".bigWig", "", files)
sig <- featureAlignedSignal(cvglists, feature.center, 
                            upstream=2000, downstream=2000) 
heatmap <- featureAlignedHeatmap(sig, feature.center, 
                                 upstream=2000, downstream=2000,
                                 upper.extreme=c(3,.5,4))
Heatmap of aligned features

Heatmap of aligned features

featureAlignedDistribution(sig, feature.center, 
                           upstream=2000, downstream=2000,
                           type="l")
Distribution of aligned features

Distribution of aligned features

4.8 Output a summary of motif occurrences in the peaks.

Here is an example to search the motifs in the binding peaks. The motif patterns to be searched are saved in the file examplepattern.fa.

peaks <- GRanges(seqnames=c("NC_008253", "NC_010468"),
                 ranges=IRanges(start=c(100, 500), 
                                end=c(300, 600), 
                                names=c("peak1", "peak2")))
filepath <- system.file("extdata", "examplePattern.fa", package="ChIPpeakAnno")
readLines(filepath)
## [1] ">ExamplePattern" "GGNCCK"          ">ExamplePattern" "AACCNM"
library(BSgenome.Ecoli.NCBI.20080805)
summarizePatternInPeaks(patternFilePath=filepath, format="fasta", skip=0L, 
                        BSgenomeName=Ecoli, peaks=peaks)
##      n.peaksWithPattern n.totalPeaks Pattern 
## [1,] "0"                "2"          "GGNCCK"
## [2,] "1"                "2"          "AACCNM"

4.9 Obtain the enriched Gene Ontology (GO) terms or reactome/KEGG terms for the genes near the peaks

With the annotated peak data, you can call the function getEnrichedGO to obtain a list of enriched GO terms. For pathway analysis, you can call function getEnrichedPATH using reactome or KEGG database.

In the following sample code we used a subset of the annotatedPeak (the first 500 peaks) for demonstration. All annotated peaks should be used in the real situation.

library(org.Hs.eg.db)
over <- getEnrichedGO(annotatedPeak[1:500], orgAnn="org.Hs.eg.db", 
                    maxP=0.01, multiAdj=FALSE, minGOterm=10, 
                    multiAdjMethod="", condense=FALSE)
head(over[["bp"]][, -3])
##        go.id                        go.term Ontology count.InDataset
## 1 GO:0006644 phospholipid metabolic process       BP               8
## 2 GO:0006644 phospholipid metabolic process       BP               8
## 3 GO:0006644 phospholipid metabolic process       BP               8
## 4 GO:0006644 phospholipid metabolic process       BP               8
## 5 GO:0006644 phospholipid metabolic process       BP               8
## 6 GO:0006644 phospholipid metabolic process       BP               8
##   count.InGenome      pvalue totaltermInDataset totaltermInGenome EntrezID
## 1            355 0.008592944              11331           1413832    27329
## 2            355 0.008592944              11331           1413832    55650
## 3            355 0.008592944              11331           1413832     9588
## 4            355 0.008592944              11331           1413832   347735
## 5            355 0.008592944              11331           1413832     9890
## 6            355 0.008592944              11331           1413832   255738
head(over[["cc"]][, -3])
##        go.id            go.term Ontology count.InDataset count.InGenome
## 1 GO:0001533 cornified envelope       CC               9             46
## 2 GO:0001533 cornified envelope       CC               9             46
## 3 GO:0001533 cornified envelope       CC               9             46
## 4 GO:0001533 cornified envelope       CC               9             46
## 5 GO:0001533 cornified envelope       CC               9             46
## 6 GO:0001533 cornified envelope       CC               9             46
##         pvalue totaltermInDataset totaltermInGenome  EntrezID
## 1 1.357295e-10               3118            381270     23254
## 2 1.357295e-10               3118            381270    353131
## 3 1.357295e-10               3118            381270 100129271
## 4 1.357295e-10               3118            381270    353137
## 5 1.357295e-10               3118            381270    149018
## 6 1.357295e-10               3118            381270     26239
head(over[["mf"]][, -3])
##        go.id                             go.term Ontology count.InDataset
## 1 GO:0004622          lysophospholipase activity       MF               2
## 2 GO:0004622          lysophospholipase activity       MF               2
## 3 GO:0022841 potassium ion leak channel activity       MF               2
## 4 GO:0022841 potassium ion leak channel activity       MF               2
## 5 GO:0035198                       miRNA binding       MF               2
## 6 GO:0035198                       miRNA binding       MF               2
##   count.InGenome      pvalue totaltermInDataset totaltermInGenome EntrezID
## 1             14 0.007446405               2232            237608   127018
## 2             14 0.007446405               2232            237608     5321
## 3             16 0.009698147               2232            237608     3776
## 4             16 0.009698147               2232            237608     3775
## 5             13 0.006422469               2232            237608   192669
## 6             13 0.006422469               2232            237608    79727

Please note that the default setting of feature_id_type is “ensembl_gene_id”. If you are using TxDb as annotation data, please try to change it to “entrez_id”.

Please also note that org.Hs.eg.db is the GO gene mapping for Human, for other organisms, please refer to released organism annotations, or call function egOrgMap to get the name of annotation database.

egOrgMap("Mus musculus")
## [1] "org.Mm.eg.db"
egOrgMap("Homo sapiens")
## [1] "org.Hs.eg.db"

4.10 Find peaks with bi-directional promoters

Bidirectional promoters are the DNA regions located between the 5’ ends of two adjacent genes coded on opposite strands. The two adjacent genes are transcribed to the opposite directions, and often co-regulated by this shared promoter region9. Here is an example to find peaks with bi-directional promoters and output the percentage of the peaks near bi-directional promoters.

data(myPeakList)
data(TSS.human.NCBI36)
annotatedBDP <- peaksNearBDP(myPeakList[1:10,], 
                             AnnotationData=TSS.human.NCBI36, 
                             MaxDistance=5000,
                             PeakLocForDistance="middle", 
                             FeatureLocForDistance="TSS")
annotatedBDP$peaksWithBDP
## GRanges object with 6 ranges and 9 metadata columns:
##                                 seqnames             ranges strand |
##                                    <Rle>          <IRanges>  <Rle> |
##   X1_14_1300250.ENSG00000218550     chr1 [1300503, 1300603]      * |
##   X1_14_1300250.ENSG00000175756     chr1 [1300503, 1300603]      * |
##    X1_41_559455.ENSG00000212678     chr1 [ 559774,  559874]      * |
##    X1_41_559455.ENSG00000209350     chr1 [ 559774,  559874]      * |
##    X1_93_556427.ENSG00000212875     chr1 [ 556660,  556760]      * |
##    X1_93_556427.ENSG00000209349     chr1 [ 556660,  556760]      * |
##                                          peak         feature
##                                   <character>     <character>
##   X1_14_1300250.ENSG00000218550 X1_14_1300250 ENSG00000218550
##   X1_14_1300250.ENSG00000175756 X1_14_1300250 ENSG00000175756
##    X1_41_559455.ENSG00000212678  X1_41_559455 ENSG00000212678
##    X1_41_559455.ENSG00000209350  X1_41_559455 ENSG00000209350
##    X1_93_556427.ENSG00000212875  X1_93_556427 ENSG00000212875
##    X1_93_556427.ENSG00000209349  X1_93_556427 ENSG00000209349
##                                 start_position end_position feature_strand
##                                      <integer>    <integer>    <character>
##   X1_14_1300250.ENSG00000218550        1303908      1304275              +
##   X1_14_1300250.ENSG00000175756        1298974      1300443              -
##    X1_41_559455.ENSG00000212678         559620       560165              +
##    X1_41_559455.ENSG00000209350         557860       557930              -
##    X1_93_556427.ENSG00000212875         556318       557859              +
##    X1_93_556427.ENSG00000209349         556240       556304              -
##                                 insideFeature distancetoFeature
##                                      <factor>         <numeric>
##   X1_14_1300250.ENSG00000218550      upstream             -3355
##   X1_14_1300250.ENSG00000175756      upstream              -110
##    X1_41_559455.ENSG00000212678        inside               204
##    X1_41_559455.ENSG00000209350      upstream             -1894
##    X1_93_556427.ENSG00000212875        inside               392
##    X1_93_556427.ENSG00000209349      upstream              -406
##                                 shortestDistance fromOverlappingOrNearest
##                                        <integer>              <character>
##   X1_14_1300250.ENSG00000218550             3305          NearestLocation
##   X1_14_1300250.ENSG00000175756               60          NearestLocation
##    X1_41_559455.ENSG00000212678              154          NearestLocation
##    X1_41_559455.ENSG00000209350             1844          NearestLocation
##    X1_93_556427.ENSG00000212875              342          NearestLocation
##    X1_93_556427.ENSG00000209349              356          NearestLocation
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths
c(annotatedBDP$percentPeaksWithBDP, 
  annotatedBDP$n.peaks, 
  annotatedBDP$n.peaksWithBDP)
## [1]  0.3 10.0  3.0

4.11 Perform permutation test to determine if there is an association between two sets of peaks

Given two peak lists from two transcript factors (TFs), one can ask whether DNA binding sites of the two TFs are correlated. Most tests are based on hypergeometric distribution, which require user to input an estimate of the total potential binding sites for a given TF. In contrast, peakPermTest implemented here is based on permutation test, which does not require user to supply an estimate of the total potential binding sites. The random peaks in the permutation test are generated using the distribution discovered from the input peaks for a given feature type (transcripts or exons) and relevant binding positions to the features (“TSS”, “geneEnd”). The width of the random peaks also follows the distribution of that of the input peaks.

Following are the sample codes to do the permTest:

if(interactive()){
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
    cds <- unique(unlist(cdsBy(txdb)))
    utr5 <- unique(unlist(fiveUTRsByTranscript(txdb)))
    utr3 <- unique(unlist(threeUTRsByTranscript(txdb)))
    set.seed(123)
    utr3 <- utr3[sample.int(length(utr3), 1000)]
    pt <- peakPermTest(utr3, 
             utr5[sample.int(length(utr5), 1000)], 
             maxgap=500,
             TxDb=txdb, seed=1,
             force.parallel=FALSE) 
    plot(pt)
    ## highly relevant peaks
    ol <- findOverlaps(cds, utr3, maxgap=1)
    pt1 <- peakPermTest(utr3,
             c(cds[sample.int(length(cds), 500)], 
                cds[queryHits(ol)][sample.int(length(ol), 500)]), 
             maxgap=500, 
             TxDb=txdb, seed=1,
             force.parallel=FALSE)
    plot(pt1)
}

Alternatively, users can create a pool of peaks representing all potential binding sites with associated binding probabilities for random peak sampling (see ?preparePool). Here is an example to build a human pool of peaks using the transcription factor binding site clusters (V3) (see ?wgEncodeTfbsV3) downloaded from ENCODE with the HOT spots (?HOT.spots) removed. HOT spots are the genomic regions with high probability of being bound by many TFs in ChIP-seq experiments10. We suggest removing those HOT spots from the peak lists before performing permutation test to avoid the overestimation of the association between the two input peak lists. Users can also obtain the ENCODE blacklist for a given species. The blacklists were constructed by identifying consistently problematic regions over independent of cell line and type of experiment for each species in the ENCODE and modENCODE datasets11. Please note that some of those blacklist may need to be liftover-ed to the correct genome assembly.

if(interactive()){
    data(HOT.spots)
    data(wgEncodeTfbsV3)
    hotGR <- reduce(unlist(HOT.spots))
    removeOl <- function(.ele){
        ol <- findOverlaps(.ele, hotGR)
        if(length(ol)>0) .ele <- .ele[-unique(queryHits(ol))]
        .ele
    }
    temp <- tempfile()
    download.file(file.path("http://hgdownload.cse.ucsc.edu", 
                            "goldenPath", "hg19", "encodeDCC", 
                            "wgEncodeRegTfbsClustered", 
                            "wgEncodeRegTfbsClusteredV3.bed.gz"), temp)
    data <- toGRanges(gzfile(temp, "r"), header=FALSE, format="others", 
                      colNames = c("seqnames", "start", "end", "TF"))
    unlink(temp)
    data <- split(data, data$TF)
    TAF1 <- removeOl(data[["TAF1"]])
    TEAD4 <- removeOl(data[["TEAD4"]])
    pool <- new("permPool", grs=GRangesList(wgEncodeTfbsV3), N=length(TAF1))
    pt <- peakPermTest(TAF1, TEAD4, pool=pool, ntimes=1000)
    plot(pt)
}

5 Citing ChIPpeakAnno

Please cite ChIPpeakAnno in your publication as follows:

## 
## Please cite the paper below for the ChIPpeakAnno package.
## 
##   Lihua J Zhu, Claude Gazin, Nathan D Lawson, Herve Pages, Simon M
##   Lin, David S Lapointe and Michael R Green, ChIPpeakAnno: a
##   Bioconductor package to annotate ChIP-seq and ChIP-chip data.
##   BMC Bioinformatics. 2010, 11:237
## 
##   Zhu LJ. Integrative analysis of ChIP-chip and ChIP-seq dataset.
##   Methods Mol Biol. 2013;1067:105-24.

6 Session Info

sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] FDb.UCSC.tRNAs_1.0.1                   
##  [2] mirbase.db_1.2.0                       
##  [3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [4] BSgenome.Ecoli.NCBI.20080805_1.3.1000  
##  [5] BSgenome_1.38.0                        
##  [6] EnsDb.Hsapiens.v75_0.99.12             
##  [7] ensembldb_1.2.2                        
##  [8] GO.db_3.2.2                            
##  [9] rtracklayer_1.30.2                     
## [10] TxDb.Hsapiens.UCSC.hg38.knownGene_3.1.3
## [11] GenomicFeatures_1.22.13                
## [12] org.Hs.eg.db_3.2.3                     
## [13] AnnotationDbi_1.32.3                   
## [14] Biobase_2.30.0                         
## [15] ChIPpeakAnno_3.4.6                     
## [16] RSQLite_1.0.0                          
## [17] DBI_0.3.1                              
## [18] VennDiagram_1.6.16                     
## [19] futile.logger_1.4.1                    
## [20] GenomicRanges_1.22.4                   
## [21] GenomeInfoDb_1.6.3                     
## [22] Biostrings_2.38.4                      
## [23] XVector_0.10.0                         
## [24] IRanges_2.4.7                          
## [25] S4Vectors_0.8.11                       
## [26] BiocGenerics_0.16.1                    
## [27] BiocStyle_1.8.0                        
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.0.2   splines_3.2.3               
##  [3] htmltools_0.3                yaml_2.1.13                 
##  [5] interactiveDisplayBase_1.8.0 RBGL_1.46.0                 
##  [7] survival_2.38-3              XML_3.98-1.3                
##  [9] BiocParallel_1.4.3           lambda.r_1.1.7              
## [11] matrixStats_0.50.1           stringr_1.0.0               
## [13] zlibbioc_1.16.0              evaluate_0.8                
## [15] memoise_1.0.0                knitr_1.12.3                
## [17] biomaRt_2.26.1               httpuv_1.3.3                
## [19] BiocInstaller_1.20.1         Rcpp_0.12.3                 
## [21] xtable_1.8-2                 regioneR_1.2.3              
## [23] formatR_1.2.1                limma_3.26.8                
## [25] graph_1.48.0                 mime_0.4                    
## [27] Rsamtools_1.22.0             AnnotationHub_2.2.3         
## [29] digest_0.6.9                 stringi_1.0-1               
## [31] shiny_0.13.0                 tools_3.2.3                 
## [33] bitops_1.0-6                 magrittr_1.5                
## [35] RCurl_1.95-4.7               futile.options_1.0.0        
## [37] MASS_7.3-45                  rmarkdown_0.9.2             
## [39] httr_1.1.0                   R6_2.1.2                    
## [41] GenomicAlignments_1.6.3      multtest_2.26.0

References

1.Gentleman, R. et al. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 5, R80 (2004).

2.Durinck, S. et al. BioMart and bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).

3.Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, pp. 289–300 (1995).

4.Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29, 1165–1188 (2001).

5.Johnson, N. L., Kemp, A. W. & Kotz, S. Univariate discrete distributions. 444, (John Wiley & Sons, 2005).

6.Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, pp. 65–70 (1979).

7.Hochberg, Y. A sharper bonferroni procedure for multiple tests of significance. Biometrika 75, 800–802 (1988).

8.Dudoit, S., Shaffer, J. P. & Boldrick, J. C. Multiple hypothesis testing in microarray experiments. Statist. Sci. 18, 71–103 (2003).

9.Robertson, G. et al. Genome-wide profiles of sTAT1 dNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature methods 4, 651–657 (2007).

10.Yip, K. Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol 13, R48 (2012).

11.Consortium, E. P. & others. An integrated encyclopedia of dNA elements in the human genome. Nature 489, 57–74 (2012).