1 Abstract

clusterProfiler supports enrichment analysis of Gene Ontology (GO) and Kyoto Encyclopedia of genes and Genomes (KEGG) with either hypergeometric test or Gene Set Enrichment Analysis (GSEA). clusterProfiler adjust the estimated significance level to account for multiple hypothesis testing and also q-values were calculated for FDR control. It supports several visualization methods, including barplot, cnetplot, enrichMap and gseaplot. clusterProfiler also supports comparing functional profiles among gene clusters. It supports comparing biological themes of GO, KEGG, Disease Ontology (via DOSE) and Reactome pathways (via ReactomePA).

2 Citation

If you use clusterProfiler in published research, please cite G. Yu(2012). In addition, please cite G. Yu (2010) when using GOSemSim for GO semantic similarity analysis, G. Yu (2015) when using DOSE for Disease Ontology analysis and G. Yu (2015) when applying enrichment analysis to NGS data by using ChIPseeker.

G Yu, LG Wang, Y Han, QY He.
clusterProfiler: an R package for comparing biological themes among gene clusters.
OMICS: A Journal of Integrative Biology 2012, 16(5):284-287.

URL: http://dx.doi.org/10.1089/omi.2011.0118

G Yu, F Li, Y Qin, X Bo, Y Wu, S Wang. 
GOSemSim: an R package for measuring semantic similarity among GO terms and gene products.
Bioinformatics 2010, 26(7):976-978.

URL: http://dx.doi.org/10.1093/bioinformatics/btq064

G Yu, LG Wang, GR Yan, QY He.
DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis.
Bioinformatics 2015, 31(4):608-609.

URL: http://dx.doi.org/10.1093/bioinformatics/btu684

G Yu, LG Wang, QY He.
ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization.
Bioinformatics 2015, 31(14):2382-2383.

URL: http://dx.doi.org/10.1093/bioinformatics/btv145

3 Introduction

In recently years, high-throughput experimental techniques such as microarray, RNA-Seq and mass spectrometry can detect cellular molecules at systems-level. These kinds of analyses generate huge quantitaties of data, which need to be given a biological interpretation. A commonly used approach is via clustering in the gene dimension for grouping different genes based on their similarities¹.

To search for shared functions among genes, a common way is to incorporate the biological knowledge, such as Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), for identifying predominant biological themes of a collection of genes.

After clustering analysis, researchers not only want to determine whether there is a common theme of a particular gene cluster, but also to compare the biological themes among gene clusters. The manual step to choose interesting clusters followed by enrichment analysis on each selected cluster is slow and tedious. To bridge this gap, we designed clusterProfiler², for comparing and visualizing functional profiles among gene clusters.

4 bitr: Biological Id TranslatoR

Many new R user may find traslating ID is a tedious task and I have received many feedbacks from clusterProfiler users that they don’t know how to convert gene symbol, uniprot ID or other ID types to Entrez gene ID that used in clusterProfiler for most of the species.

To remove this obstacle, We provide bitr function for translating among different gene ID types.

x <- c("GPX3",  "GLRX",   "LBP",   "CRYAB", "DEFB1", "HCLS1",   "SOD2",   "HSPA2", 
       "ORM1",  "IGFBP1", "PTHLH", "GPC3",  "IGFBP3","TOB1",    "MITF",   "NDRG1", 
       "NR1H4", "FGFR3",  "PVR",   "IL6",   "PTPRM", "ERBB2",   "NID2",   "LAMB1", 
       "COMP",  "PLS3",   "MCAM",  "SPP1",  "LAMC1", "COL4A2",  "COL4A1", "MYOC",  
       "ANXA4", "TFPI2",  "CST6",  "SLPI",  "TIMP2", "CPM",     "GGT1",   "NNMT",
       "MAL",   "EEF1A2", "HGD",   "TCN2",  "CDA",   "PCCA",    "CRYM",   "PDXK",  
       "STC1",  "WARS",  "HMOX1", "FXYD2", "RBP4",   "SLC6A12", "KDELR3", "ITM2B")
eg = bitr(x, fromType="SYMBOL", toType="ENTREZID", annoDb="org.Hs.eg.db")
head(eg)

##   SYMBOL ENTREZID
## 1   GPX3     2878
## 2   GLRX     2745
## 3    LBP     3929
## 4  CRYAB     1410
## 5  DEFB1     1672
## 6  HCLS1     3059

User should provides an annotation package, both fromType and toType can accept any types that supported.

User can use idType to list all supporting types.

idType("org.Hs.eg.db")

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

We can translate from one type to other types.

ids <- bitr(x, fromType="SYMBOL", toType=c("UNIPROT", "ENSEMBL"), annoDb="org.Hs.eg.db")
head(ids)

##   SYMBOL    UNIPROT         ENSEMBL
## 1   GPX3     P22352 ENSG00000211445
## 2   GLRX A0A024RAM2 ENSG00000173221
## 3   GLRX     P35754 ENSG00000173221
## 4    LBP     P18428 ENSG00000129988
## 5    LBP     Q8TCF0 ENSG00000129988
## 6  CRYAB     P02511 ENSG00000109846

5 Gene Ontology analysis

5.1 Supported organisms

At present, GO analysis in clusterProfiler supports about 20 species internally as shown below:

Arabidopsis
Anopheles
Bovine
Canine
Chicken
Chimp
Coelicolor
E coli strain K12
E coli strain Sakai
Fly
Gondii
Human
Malaria
Mouse
Pig
Rat
Rhesus
Worm
Xenopus
Yeast
Zebrafish

For un-supported organisms, user can use their own GO annotation data (in data.frame format with first column of gene ID and second column of GO ID) and passed it to enricher function (see Universal enrichment analysis section).

If a gene is annotated by a GO ID (direction annotation), it should also annotated by its ancestor GO nodes (indirect annation). If user only has direct annotation, they can pass their annotation to buildGOmap function, which will infer indirection annotation and generate annotation file that suitable for enrichGO function. In future version, we may add functions to help user query annotation from public available database.

5.2 Gene Ontology Classification

In clusterProfiler, groupGO is designed for gene classification based on GO distribution at a specific level. Here we use dataset geneList provided by DOSE. Please refer to vignette of DOSE for more details.

library("DOSE")
data(geneList)
gene <- names(geneList)[abs(geneList) > 2]
head(gene)

## [1] "4312"  "8318"  "10874" "55143" "55388" "991"

ggo <- groupGO(gene     = gene,
               organism = "human",
               ont      = "BP",
               level    = 3,
               readable = TRUE)
head(summary(ggo))

##                    ID                              Description Count
## GO:0019953 GO:0019953                      sexual reproduction    11
## GO:0019954 GO:0019954                     asexual reproduction     0
## GO:0032504 GO:0032504      multicellular organism reproduction    10
## GO:0032505 GO:0032505 reproduction of a single-celled organism     0
## GO:0051321 GO:0051321                       meiotic cell cycle     7
## GO:0006807 GO:0006807      nitrogen compound metabolic process    61
##            GeneRatio
## GO:0019953    11/207
## GO:0019954     0/207
## GO:0032504    10/207
## GO:0032505     0/207
## GO:0051321     7/207
## GO:0006807    61/207
##                                                                                                                                                                                                                                                                                                                                                                      geneID
## GO:0019953                                                                                                                                                                                                                                                                                                    ASPM/CDK1/TRIP13/KIFC1/AURKA/CCNB1/PTTG1/GAMT/BMP4/DNALI1/PGR
## GO:0019954                                                                                                                                                                                                                                                                                                                                                                 
## GO:0032504                                                                                                                                                                                                                                                                                                           ASPM/TRIP13/KIFC1/AURKA/CCNB1/PTTG1/GAMT/BMP4/STC2/PGR
## GO:0032505                                                                                                                                                                                                                                                                                                                                                                 
## GO:0051321                                                                                                                                                                                                                                                                                                                         CDC20/TOP2A/ASPM/NEK2/TRIP13/AURKA/PTTG1
## GO:0006807 CDC45/MCM10/S100A9/FOXM1/MYBL2/S100A8/TOP2A/E2F8/CXCL10/RRM2/UGT8/ISG20/CXCL11/SLC7A5/RAD51AP1/CXCL9/CCNA2/CDK1/GINS1/PAX6/CDT1/BIRC5/EZH2/AURKB/GINS2/CHAF1B/CHEK1/TRIP13/QPRT/IDO1/DTL/NUDT1/CCNB1/PIR/MCM5/PTTG1/MAOB/ADIPOQ/DACH1/ZNF423/CORIN/AK5/RNASE4/OMD/NOVA1/PCK1/EMX2/FABP4/GAMT/BMP4/SLC44A4/ABLIM3/ERBB4/NDP/FOXA1/CRY2/GATA3/TFAP2B/PGR/ADIRF/OGN

The input parameters of gene is a vector of gene IDs. It expects entrezgene for most of the organisms. For yeast, it should be ORF IDs; organism should be the common name of supported species. If readable is setting to TRUE, the input gene IDs will be converted to gene symbols.

5.3 GO over-representation test

Over-representation test³ were implemented in clusterProfiler. For calculation details, please refer to the vignette of DOSE.

ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                organism      = "human",
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)
head(summary(ego))

##                    ID                              Description GeneRatio
## GO:0005819 GO:0005819                                  spindle    24/197
## GO:0005876 GO:0005876                      spindle microtubule    11/197
## GO:0000793 GO:0000793                     condensed chromosome    17/197
## GO:0000779 GO:0000779 condensed chromosome, centromeric region    13/197
## GO:0005875 GO:0005875           microtubule associated complex    14/197
## GO:0015630 GO:0015630                 microtubule cytoskeleton    36/197
##              BgRatio       pvalue     p.adjust       qvalue
## GO:0005819 222/11999 1.975020e-13 6.616317e-11 5.883481e-11
## GO:0005876  45/11999 1.101862e-10 1.845619e-08 1.641194e-08
## GO:0000793 150/11999 3.657361e-10 4.084053e-08 3.631695e-08
## GO:0000779  81/11999 5.980787e-10 5.008909e-08 4.454112e-08
## GO:0005875 109/11999 2.660563e-09 1.782577e-07 1.585136e-07
## GO:0015630 765/11999 7.746742e-09 4.325264e-07 3.846190e-07
##                                                                                                                                                                                                                        geneID
## GO:0005819                                                                      KIF20A/CENPE/KIF18B/SKA1/TPX2/KIF4A/ASPM/BIRC5/KIF11/KIFC1/MAD2L1/NEK2/NUSAP1/CDCA8/AURKA/TTK/KIF18A/CCNB1/PRC1/AURKB/KIF23/DLGAP5/CDK1/CDC20
## GO:0005876                                                                                                                                                  KIF18B/SKA1/KIF4A/BIRC5/KIF11/NUSAP1/AURKA/KIF18A/PRC1/AURKB/CDK1
## GO:0000793                                                                                                              NDC80/CENPE/CHEK1/SKA1/NCAPH/BIRC5/MAD2L1/NEK2/ERCC6L/HJURP/CENPN/NCAPG/AURKA/TOP2A/CENPM/CCNB1/AURKB
## GO:0000779                                                                                                                                      NDC80/CENPE/SKA1/BIRC5/MAD2L1/NEK2/ERCC6L/HJURP/CENPN/AURKA/CENPM/CCNB1/AURKB
## GO:0005875                                                                                                                             KIF20A/CENPE/KIF18B/KIF4A/BIRC5/KIF11/KIFC1/MAPT/CDCA8/AURKA/DNALI1/KIF18A/AURKB/KIF23
## GO:0015630 KIF20A/TACC3/CENPE/CHEK1/KIF18B/SKA1/TPX2/KIF4A/ASPM/AK5/BIRC5/KIF11/KIFC1/MAD2L1/MAPT/NEK2/NUSAP1/DTL/ERCC6L/CDCA8/CEP55/NCAPG/AURKA/TOP2A/TTK/DNALI1/KIF18A/CDC45/CCNB1/PRC1/CCNB2/AURKB/KIF23/DLGAP5/CDK1/CDC20
##            Count
## GO:0005819    24
## GO:0005876    11
## GO:0000793    17
## GO:0000779    13
## GO:0005875    14
## GO:0015630    36

A detail explanation of the parameter can be found in the vignette of DOSE.

5.4 GO Gene Set Enrichment Analysis

A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)⁴ directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

For algorithm details, please refer to the vignette of DOSE.

ego2 <- gseGO(geneList     = geneList,
              organism     = "human",
              ont          = "CC",
              nPerm        = 1000,
              minGSSize    = 120,
              pvalueCutoff = 0.01,
              verbose      = FALSE)

GSEA use permutation test, user can set nPerm for number of permutations. Gene Set size below minGSSize will be omitted.

5.5 GO Semantic Similarity Analysis

GO semantic similarity can be calculated by GOSemSim¹. We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.

6 KEGG analysis

The annotation package, KEGG.db, is not updated since 2012. It’s now pretty old and in clusterProfiler, enrichKEGG supports downloading latest online version of KEGG data for enrichment analysis. Using KEGG.db is also supported by explicitly setting use_internal_data parameter to TRUE, but it’s not recommended.

With this new feature, organism is not restricted to those supported in previous release, it can be any species that have KEGG annotation data available in KEGG database. User should pass abbreviation of academic name to the organism parameter. The full list of KEGG supported organisms can be accessed via http://www.genome.jp/kegg/catalog/org_list.html.

6.1 KEGG over-representation test

To speed up the compilation of this document, we set use_internal_data = TRUE.

kk <- enrichKEGG(gene         = gene,
                 organism     = "human",
                 pvalueCutoff = 0.05, 
                 readable     = TRUE,
                 use_internal_data = TRUE)
head(summary(kk))

##                ID                             Description GeneRatio
## hsa04110 hsa04110                              Cell cycle     11/74
## hsa04114 hsa04114                          Oocyte meiosis     10/74
## hsa03320 hsa03320                  PPAR signaling pathway      7/74
## hsa04914 hsa04914 Progesterone-mediated oocyte maturation      6/74
## hsa04115 hsa04115                   p53 signaling pathway      5/74
## hsa04062 hsa04062             Chemokine signaling pathway      8/74
##           BgRatio       pvalue     p.adjust       qvalue
## hsa04110 128/5894 4.309086e-07 4.826176e-05 4.762674e-05
## hsa04114 114/5894 1.252655e-06 7.014870e-05 6.922569e-05
## hsa03320  70/5894 2.351387e-05 8.778511e-04 8.663005e-04
## hsa04914  87/5894 7.212570e-04 2.019520e-02 1.992947e-02
## hsa04115  69/5894 1.637682e-03 3.668408e-02 3.620140e-02
## hsa04062 189/5894 2.373246e-03 4.430059e-02 4.371768e-02
##                                                                  geneID
## hsa04110 CDC45/CDC20/CCNB2/CCNA2/CDK1/MAD2L1/TTK/CHEK1/CCNB1/MCM5/PTTG1
## hsa04114     CDC20/CCNB2/CDK1/MAD2L1/CALML5/AURKA/CCNB1/PTTG1/ITPR1/PGR
## hsa03320                      MMP1/FADS2/ADIPOQ/PCK1/FABP4/HMGCS2/PLIN1
## hsa04914                              CCNB2/CCNA2/CDK1/MAD2L1/CCNB1/PGR
## hsa04115                                    CCNB2/RRM2/CDK1/CHEK1/CCNB1
## hsa04062            CXCL10/CXCL13/CXCL11/CXCL9/CCL18/CCL8/CXCL14/CX3CR1
##          Count
## hsa04110    11
## hsa04114    10
## hsa03320     7
## hsa04914     6
## hsa04115     5
## hsa04062     8

6.2 KEGG Gene Set Enrichment Analysis

kk2 <- gseKEGG(geneList     = geneList,
               organism     = "human",
               nPerm        = 1000,
               minGSSize    = 120,
               pvalueCutoff = 0.05,
               verbose      = FALSE,
               use_internal_data = TRUE)
head(summary(kk2))

##                ID                            Description setSize
## hsa04510 hsa04510                         Focal adhesion     193
## hsa04062 hsa04062            Chemokine signaling pathway     166
## hsa03013 hsa03013                          RNA transport     124
## hsa04060 hsa04060 Cytokine-cytokine receptor interaction     233
## hsa01100 hsa01100                     Metabolic pathways     920
##          enrichmentScore       NES      pvalue   p.adjust    qvalues
## hsa04510      -0.4463585 -1.867478 0.001432665 0.01923077 0.01445922
## hsa04062       0.3831672  1.674395 0.003184713 0.01923077 0.01445922
## hsa03013       0.4271821  1.802566 0.003205128 0.01923077 0.01445922
## hsa04060       0.3394553  1.561842 0.003663004 0.01923077 0.01445922
## hsa01100       0.2360780  1.212631 0.008849558 0.03716814 0.02794597

7 Disease Ontology analysis

DOSE⁵ supports Disease Ontology (DO) Semantic and Enrichment analysis, please refer to the package vignettes. The enrichDO function is very useful for identifying disease association of interesting genes, and function gseAnalyzer function is designed for gene set enrichment analysis of DO.

8 Reactome pathway analysis

ReactomePA uses Reactome as a source of pathway data. The function call of enrichPathway and gsePathway in ReactomePA is consistent with enrichKEGG and gseKEGG.

9 DAVID functional analysis

clusterProfiler provides enrichment and GSEA analysis with GO, KEGG, DO and Reactome pathway supported internally, some user may prefer GO and KEGG analysis with DAVID⁶ and still attracted by the visualization methods provided by clusterProfiler^???. To bridge the gap between DAVID and clusterProfiler, we implemented enrichDAVID. This function query enrichment analysis result from DAVID webserver via RDAVIDWebService⁷ and stored the result as an enrichResult instance, so that we can use all the visualization functions in clusterProfiler to visualize DAVID results. enrichDAVID is fully compatible with compareCluster function and comparing enrichment results from different gene clusters is now available with DAVID.

david <- enrichDAVID(gene = gene,
                     idType = "ENTREZ_GENE_ID",
                     listType = "Gene",
                     annotation = "KEGG_PATHWAY",
                     david.user = "clusterProfiler@hku.hk")

DAVID Web Service has the following limitations:

A job with more than 3000 genes to generate gene or term cluster report will not be handled by DAVID due to resource limit.
No more than 200 jobs in a day from one user or computer.
DAVID Team reserves right to suspend any improper uses of the web service without notice.

For more details, please refer to http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html.

As user has limited usage, please register and use your own user account to run enrichDAVID.

10 Universal enrichment analysis

clusterProfiler supports both hypergeometric test and gene set enrichment analysis of many ontology/pathway, but it’s still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations.

clusterProfiler provides enricher function for hypergeometric test and GSEA function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.

An example of using enricher and GSEA to analyze DisGeNet annotation is presented in use clusterProfiler as an universal enrichment analysis tool.

11 Functional analysis of NGS data

Functional analysis using NGS data (eg, RNA-Seq and ChIP-Seq) can be performed by linking coding and non-coding regions to coding genes via ChIPseeker⁸ package, which can annotates genomic regions to their nearest genes, host genes, and flanking genes respectivly. In addtion, it provides a function, seq2gene, that simultaneously considering host genes, promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function maps genomic regions to genes in a many-to-many manner and facilitate functional analysis. For more details, please refer to ChIPseeker.

12 Visualization

The function calls of groupGO, enrichGO, enrichKEGG, enrichDO and enrichPathway are consistent and all the output can be visualized by bar plot, enrichment map and category-gene-network plot. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart.

12.1 barplot

barplot(ggo, drop=TRUE, showCategory=12)

barplot(ego, showCategory=8)

## dotplot

dotplot is a good alternative to barplot.

dotplot(ego)

## enrichMap

Enrichment map can be viusalized by enrichMap, which also support results obtained from hypergeometric test and gene set enrichment analysis.

enrichMap(ego)

enrichment map of enrichment result

12.2 cnetplot

In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, we developed cnetplot function to extract the complex association.

cnetplot(ego, categorySize="pvalue", foldChange=geneList)

cnetplot(kk, categorySize="geneNum", foldChange=geneList)

## gseaplot

Running score of gene set enrichment analysis and its association of phenotype can be visualized by gseaplot.

gseaplot(kk2, geneSetID = "hsa04145")

plotting gsea result

12.3 plotGOgraph

plotGOgraph, which is based on topGO, can accept output of enrichGO and visualized the enriched GO induced graph.

plotGOgraph(ego)

## 
## groupGOTerms:    GOBPTerm, GOMFTerm, GOCCTerm environments built.
## 
## Building most specific GOs ..... ( 335 GO terms found. )
## 
## Build GO DAG topology .......... ( 335 GO terms and 667 relations. )
## 
## Annotating nodes ............... ( 11632 genes annotated to the GO terms. )

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 29 
## Number of Edges = 50 
## 
## $complete.dag
## [1] "A graph with 29 nodes."

12.4 pathview from pathview package

clusterProfiler users can also use pathview from the pathview⁹ to visualize KEGG pathway.

The following example illustrate how to visualize “hsa04110” pathway, which was enriched in our previous analysis.

library("pathview")
hsa04110 <- pathview(gene.data  = geneList,
                     pathway.id = "hsa04110",
                     species    = "hsa",
                     limit      = list(gene=max(abs(geneList)), cpd=1))

For further information, please refer to the vignette of pathview⁹.

13 Biological theme comparison

clusterProfiler was developed for biological theme comparison², and it provides a function, compareCluster, to automatically calculate enriched functional categories of each gene clusters.

data(gcSample)
lapply(gcSample, head)

## $X1
## [1] "4597"  "7111"  "5266"  "2175"  "755"   "23046"
## 
## $X2
## [1] "23450" "5160"  "7126"  "26118" "8452"  "3675" 
## 
## $X3
## [1] "894"   "7057"  "22906" "3339"  "10449" "6566" 
## 
## $X4
## [1] "5573"  "7453"  "5245"  "23450" "6500"  "4926" 
## 
## $X5
## [1] "5982" "7318" "6352" "2101" "8882" "7803"
## 
## $X6
## [1] "5337"  "9295"  "4035"  "811"   "23365" "4629" 
## 
## $X7
## [1] "2621" "2665" "5690" "3608" "3550" "533" 
## 
## $X8
## [1] "2665" "4735" "1327" "3192" "5573" "9528"

The input for geneCluster parameter should be a named list of gene IDs. To speed up the compilation of this document, we set use_internal_data = TRUE.

ck <- compareCluster(geneCluster = gcSample, fun = "enrichKEGG", use_internal_data = TRUE)
head(summary(ck))

##   Cluster       ID              Description GeneRatio  BgRatio
## 1      X2 hsa04110               Cell cycle    18/297 128/5894
## 2      X2 hsa05340 Primary immunodeficiency     8/297  35/5894
## 3      X3 hsa04512 ECM-receptor interaction     9/152  85/5894
## 4      X4 hsa03030          DNA replication    10/326  36/5894
## 5      X4 hsa04110               Cell cycle    20/326 128/5894
## 6      X4 hsa00240    Pyrimidine metabolism    16/326  99/5894
##         pvalue    p.adjust      qvalue
## 1 6.383147e-05 0.012064148 0.011422474
## 2 2.698906e-04 0.025504659 0.024148104
## 3 3.077198e-04 0.047388855 0.042756862
## 4 1.632995e-05 0.001877912 0.001587708
## 5 1.997779e-05 0.001877912 0.001587708
## 6 8.987559e-05 0.004283049 0.003621167
##                                                                                             geneID
## 1          991/1869/890/1871/701/990/10926/9088/8317/9700/9134/1029/2810/699/11200/23594/8555/4173
## 2                                                              100/6891/3932/973/916/925/958/64421
## 3                                                     7057/3339/3695/1101/3679/3910/3696/1302/3693
## 4                                              5425/4172/4175/4171/10535/5984/2237/4176/54107/4173
## 5 6500/9184/4172/994/4175/4171/1387/10274/8697/902/4616/5591/4176/8881/7043/983/1022/1028/891/4173
## 6                5425/7296/4860/6241/7298/5440/7372/5430/9583/4832/54107/953/5435/1635/51728/29922
##   Count
## 1    18
## 2     8
## 3     9
## 4    10
## 5    20
## 6    16

13.1 Formula interface of compareCluster

compareCluster also supports passing a formula (the code to support formula has been contributed by Giovanni Dall’Olio) of type \(Entrez \sim group\) or \(Entrez \sim group + othergroup\).

## formula interface
mydf <- data.frame(Entrez=c('1', '100', '1000', '100101467',
                       '100127206', '100128071'),
                   group = c('A', 'A', 'A', 'B', 'B', 'B'),
                   othergroup = c('good', 'good', 'bad', 'bad',
                       'good', 'bad'))
xx.formula <- compareCluster(Entrez~group, data=mydf, fun='groupGO')
head(summary(xx.formula))

##   Cluster         ID          Description Count GeneRatio     geneID
## 1       A GO:0016020             membrane     2       2/3   100/1000
## 2       A GO:0005576 extracellular region     3       3/3 1/100/1000
## 3       A GO:0005623                 cell     2       2/3   100/1000
## 4       A GO:0009295             nucleoid     0       0/3           
## 5       A GO:0019012               virion     0       0/3           
## 6       A GO:0030054        cell junction     2       2/3   100/1000

## formula interface with more than one grouping variable
xx.formula.twogroups <- compareCluster(Entrez~group+othergroup,
                                       data=mydf, fun='groupGO')
head(summary(xx.formula.twogroups))

##   Cluster         ID          Description Count GeneRatio geneID
## 1   A.bad GO:0016020             membrane     1       1/1   1000
## 2   A.bad GO:0005576 extracellular region     1       1/1   1000
## 3   A.bad GO:0005623                 cell     1       1/1   1000
## 4   A.bad GO:0009295             nucleoid     0       0/1       
## 5   A.bad GO:0019012               virion     0       0/1       
## 6   A.bad GO:0030054        cell junction     1       1/1   1000

13.2 Visualization of profile comparison

We can visualize the result using plot method.

plot(ck)

By default, only top 5 (most significant) categories of each cluster was plotted. User can changes the parameter showCategory to specify how many categories of each cluster to be plotted, and if showCategory was set to NULL, the whole result will be plotted.

The plot function accepts a parameter by for setting the scale of dot sizes. The default parameter by is setting to “geneRatio”, which corresponding to the “GeneRatio” column of the output. If it was setting to count, the comparison will be based on gene counts, while if setting to rowPercentage, the dot sizes will be normalized by count/(sum of each row)

To provide the full information, we also provide number of identified genes in each category (numbers in parentheses) when by is setting to rowPercentage and number of gene clusters in each cluster label (numbers in parentheses) when by is setting to geneRatio, as shown in Figure 3. If the dot sizes were based on count, the row numbers will not shown.

The p-values indicate that which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding p-values. Color gradient ranging from red to blue correspond to in order of increasing p-values. That is, red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). P-values and adjusted p-values were filtered out by the threshold giving by parameter pvalueCutoff, and FDR can be estimated by qvalue.

User can refer to the example in²; we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus)¹⁰. We identified 8 gene clusters from differentially expressed genes, and using compareCluster to compare these gene clusters by their enriched biological process.

Another example was shown in¹¹, we calculated functional similarities among viral miRNAs using method described in¹², and compared significant KEGG pathways regulated by different viruses using compareCluster.

The comparison function was designed as a framework for comparing gene clusters of any kind of ontology associations, not only groupGO, enrichGO, enrichKEGG and enricher provided in this package, but also other biological and biomedical ontologies, for instance, enrichDO from DOSE⁵ and enrichPathway from ReactomePA work fine with compareCluster for comparing biological themes in disease and reactome pathway perspective. More details can be found in the vignettes of DOSE⁵ and ReactomePA.

14 External documents

14.1 Bugs/Feature requests

If you have any, let me know.

15 Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.2.3 (2015-12-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Rgraphviz_2.14.0      graph_1.48.0          SparseM_1.7          
##  [4] clusterProfiler_2.4.3 org.Hs.eg.db_3.2.3    GO.db_3.2.2          
##  [7] AnnotationDbi_1.32.3  IRanges_2.4.6         S4Vectors_0.8.9      
## [10] Biobase_2.30.0        BiocGenerics_0.16.1   DOSE_2.8.2           
## [13] RSQLite_1.0.0         DBI_0.3.1             BiocStyle_1.8.0      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3       KEGG.db_3.2.2     XVector_0.10.0   
##  [4] formatR_1.2.1     plyr_1.8.3        zlibbioc_1.16.0  
##  [7] tools_3.2.3       digest_0.6.9      lattice_0.20-33  
## [10] evaluate_0.8      gtable_0.1.2      png_0.1-7        
## [13] igraph_1.0.1      yaml_2.1.13       topGO_2.22.0     
## [16] stringr_1.0.0     httr_1.0.0        knitr_1.12       
## [19] Biostrings_2.38.3 qvalue_2.2.2      R6_2.1.1         
## [22] GOSemSim_1.28.2   rmarkdown_0.9.2   ggplot2_2.0.0    
## [25] DO.db_2.9         reshape2_1.4.1    magrittr_1.5     
## [28] scales_0.3.0      htmltools_0.3     splines_3.2.3    
## [31] KEGGREST_1.10.1   colorspace_1.2-6  labeling_0.3     
## [34] stringi_1.0-1     munsell_0.4.2

References

1.Yu, G. et al. GOSemSim: An r package for measuring semantic similarity among gO terms and gene products. Bioinformatics 26, 976–978 (2010).

2.Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. ClusterProfiler: An r package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16, 284–287 (2012).

3.Boyle, E. I. et al. GO::TermFinder–open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics (Oxford, England) 20, 3710–3715 (2004).

4.Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102, 15545–15550 (2005).

5.Yu, G., Wang, L.-G., Yan, G.-R. & He, Q.-Y. DOSE: An r/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609 (2015).

6.Huang, D. et al. The DAVID gene functional classification tool: A novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biology 8, R183 (2007).

7.Fresno, C. & Fernández, E. A. RDAVIDWebService: A versatile r interface to DAVID. Bioinformatics 29, 2810–2811 (2013).

8.Yu, G., Wang, L.-G. & He, Q.-Y. ChIPseeker: An r/Bioconductor package for chIP peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383 (2015).

9.Luo, W. & Brouwer, C. Pathview: An R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics 29, 1830–1831 (2013).

10.Schmidt, M. et al. The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Research 68, 5405–5413 (2008).

11.Yu, G. & He, Q. Functional similarity analysis of human virus-encoded miRNAs. Journal of Clinical Bioinformatics 1, 15 (2011).

12.Yu, G. et al. A new method for measuring functional similarity of microRNAs. Journal of Integrated OMICS 1, 49–54 (2011).

Statistical analysis and visualization of functional profiles for gene and gene clusters

Guangchuang Yu (guangchuangyu@gmail.com)
School of Public Health, The University of Hong Kong

2016-01-19

Contents

1 Abstract

2 Citation

3 Introduction

4 bitr: Biological Id TranslatoR

5 Gene Ontology analysis

5.1 Supported organisms

5.2 Gene Ontology Classification

5.3 GO over-representation test

5.4 GO Gene Set Enrichment Analysis

5.5 GO Semantic Similarity Analysis

6 KEGG analysis

6.1 KEGG over-representation test

6.2 KEGG Gene Set Enrichment Analysis

7 Disease Ontology analysis

8 Reactome pathway analysis

9 DAVID functional analysis

10 Universal enrichment analysis

11 Functional analysis of NGS data

12 Visualization

12.1 barplot

12.2 cnetplot

12.3 plotGOgraph

12.4 pathview from pathview package

13 Biological theme comparison

13.1 Formula interface of compareCluster

13.2 Visualization of profile comparison

14 External documents

14.1 Bugs/Feature requests

15 Session Information

References

Statistical analysis and visualization of functional profiles for gene and gene clusters

Guangchuang Yu (guangchuangyu@gmail.com)School of Public Health, The University of Hong Kong

2016-01-19

Contents

1 Abstract

2 Citation

3 Introduction

4 bitr: Biological Id TranslatoR

5 Gene Ontology analysis

5.1 Supported organisms

5.2 Gene Ontology Classification

5.3 GO over-representation test

5.4 GO Gene Set Enrichment Analysis

5.5 GO Semantic Similarity Analysis

6 KEGG analysis

6.1 KEGG over-representation test

6.2 KEGG Gene Set Enrichment Analysis

7 Disease Ontology analysis

8 Reactome pathway analysis

9 DAVID functional analysis

10 Universal enrichment analysis

11 Functional analysis of NGS data

12 Visualization

12.1 barplot

12.2 cnetplot

12.3 plotGOgraph

12.4 pathview from pathview package

13 Biological theme comparison

13.1 Formula interface of compareCluster

13.2 Visualization of profile comparison

14 External documents

14.1 Bugs/Feature requests

15 Session Information

References

Guangchuang Yu (guangchuangyu@gmail.com)
School of Public Health, The University of Hong Kong