Citation
When using GAPGOM, please cite the following;
- Ehsani R, Drablos F: TopoICSim: a new semantic similarity measure based on gene ontology. BMC Bioinformatics 2016, 17(1):296. DOI: https://doi.org/10.1186/s12859-016-1160-0
- Ehsani R, Drablos F: Measures of co-expression for improved function prediction of long non-coding RNAs. BMC Bioinformatics 2018. Accepted.
Introduction
GAPGOM (novel Gene Annotation Prediction and other GO Metrics) is an R package with tools and algorithms for estimating correlation of gene expression, enriched terms in gene sets, and semantic distance between sets of gene ontology (GO) terms. This package has been made for predicting the annotation of un-annotated gene(s), in particular with respect to GO, and testing such predictions. The prediction is done by comparing expression patterns between a query gene and a library of annotated genes, and annotate the query gene by enriched terms from the set of genes with similar expression pattern (often described as “guilt by association”).
For correlation of gene expression, GAPGOM is introducing LNCRNA2GOA which is a novel tool. The main interface for expression data is currently the Fantom5 data, using Bioconductor’s ExpressionSet
class.
For semantic similarity of GO terms (in particular for testing predictions), the package is using TopoICSim. It makes use of GO data via the GOSemSim package with the godata()
interface.
GO consists of three main Ontologies; molecular function (MF), biological process (BP) and cellular component (CC).
Installation
Before installing, the package has quite a few dependencies in both cran and Bioconductor. You can run the following block of code (preferably line-by-line because of prompts) to install these and the package itself.
Expression data interfaces
ID support
As of v0.2.7 and up, all AnnotationDbi
IDs should be supported. However, we recommend usage of EntrezID
s because this is the most widely supported ID in this and other packages. If you find issues with respect to ID support, please notify this issue on the package repository. If you want to (or have to) convert IDs manually, the BiomaRt
package is recommended. However, translating IDs to other types is lossy and does not always translate well.
Expression data (FANTOM5)
As of right now, the package has one main dataset interface for expression data; The Fantom5 dataset. For other datasets, an ExpressionSet
has to be manually made as described later in this chapter. There are a few helper functions to make this data usable. The Fantom5 dataset is only available for the human and mouse genome. Examples of helper functions/interfaces can be found below;
# download the fantom5 data file
fantom_file <- fantom_download("./", organism = "mouse",
noprompt = TRUE) # saves filename
# load the file (use fantom_file variable if doing all at once)
ft5 <- fantom_load_raw("./mm9.cage_peak_phase1and2combined_tpm_ann.osc.txt",
verbose = TRUE)
# remove first two rows from fantom5 data (these are seperate statistis,
# we just need expressionvalues)
ft5$df <- ft5$df[3:nrow(ft5$df),]
# convert the raw fantom table to an ExpressionSet
expset <- fantom_to_expset(ft5, verbose = TRUE)
Do note that all standard columns are necessary when converting to an ExpressionSet
in this way!
Manually specifying an ExpressionSet
Because loading expression data right now is a bit limited, this paragraph will describe how to convert expression data to an ExpressionSet
object. We will give an example with randomly selected expression values and IDs. In some cases if you want something specific, defining it this way can actually be better (More control over extra data that goes into the object/better interoperability between packages).
Minimal requirement for an ExpressionSet
;
- expression values
- Unique IDs of a certain type. AnnotationDbi keys are the only IDs that are supported right now.
Each row of expression values should have corresponding IDs, with the ID-type as its column name.
Random expression value generation;
# select x random IDs
x_entries <- 1000
go_data <- GAPGOM::set_go_data("human", "BP", computeIC = FALSE)
#> Loading required package: org.Hs.eg.db
#> Loading required package: AnnotationDbi
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: IRanges
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> preparing gene to GO mapping data...
random_ids <- unique(sample(go_data@geneAnno$ENTREZID, x_entries)) # and only keep
# uniques
# make general dataframe.
expressions <- data.frame(random_ids)
colnames(expressions) <- "ENTREZID"
expressions$ID
#> NULL
# n expression values depending on the amount of unique IDs that are present
expressionvalues <- abs(rnorm(length(random_ids)*6))*x_entries
expressions[,2:7] <- expressionvalues
head(expressions)
#> ENTREZID V2 V3 V4 V5 V6 V7
#> 1 56127 1027.1873 497.3245 3461.4253 464.7037 489.6280 44.32688
#> 2 3975 540.7969 814.8878 482.1535 705.6733 566.3916 865.47153
#> 3 285672 732.5895 324.3727 669.1559 392.7910 743.3972 639.94694
#> 4 79871 339.5573 1062.8599 1625.3913 1259.5608 164.5612 1361.73102
#> 5 9353 183.6952 1444.1870 1187.6101 716.0709 1598.7571 485.16822
#> 6 84527 322.1536 964.9530 1622.6438 1717.0781 1057.2547 44.47758
Converting the expression dataframe to an ExpressionSet;
expression_matrix <- as.matrix(expressions[,2:ncol(expressions)])
rownames(expression_matrix) <- expressions$ENTREZID
featuredat <- as.data.frame(expressions$ENTREZID) # And everything else besides expressionvalues (preferably you don't even need to include the IDs themselves here!)
rownames(featuredat) <- expressions$ENTREZID # because they will be the rownames anyway.
expset <- ExpressionSet(expression_matrix,
featureData = new("AnnotatedDataFrame",
data=featuredat))
# To see how it is structured;
head(expset)
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 6 features, 6 samples
#> element names: exprs
#> protocolData: none
#> phenoData: none
#> featureData
#> featureNames: 56127 3975 ... 84527 (6 total)
#> fvarLabels: expressions$ENTREZID
#> fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
head(assayData(expset)[["exprs"]]) # where expressionvalues are stored.
#> V2 V3 V4 V5 V6 V7
#> 56127 1027.1873 497.3245 3461.4253 464.7037 489.6280 44.32688
#> 3975 540.7969 814.8878 482.1535 705.6733 566.3916 865.47153
#> 285672 732.5895 324.3727 669.1559 392.7910 743.3972 639.94694
#> 79871 339.5573 1062.8599 1625.3913 1259.5608 164.5612 1361.73102
#> 9353 183.6952 1444.1870 1187.6101 716.0709 1598.7571 485.16822
#> 84527 322.1536 964.9530 1622.6438 1717.0781 1057.2547 44.47758
head(pData(featureData(expset))) # where other information is stored.
#> expressions$ENTREZID
#> 56127 56127
#> 3975 3975
#> 285672 285672
#> 79871 79871
#> 9353 9353
#> 84527 84527
LNCRNA2GOA (expression similarity)
Background
The LNCRNA2GOA (long non-coding RNA to GO Annotation) or expression_prediction()
uses various methods/measures to determine similar genes with similar expression pattern; pearson
, spearman
, kendall
, sobolev
and fisher
. This calculates scores between expression value sets given a query gene. These scores are used to identify genes for the enrichment analysis, which will be sorted by significance before being returned. It is also possible to find similar expression patterns by using the combined
method. The Sobolev and Fisher metrics are so far unique to this package (at least in the context of this type of analysis), all the others are standard methods from the R cor()
function. Details of the novel methods are described below (quoted from paper [1], references edited).
Sobolev metric
In this section, we use definitions and notations as in [2]. We start with the usual p-inner product. Let \(f\), \(g\) be real-valued functions (in this case \(f\) and \(g\) values are the expression vectors of two genes \(f\) and \(g\)):
\(\langle f,g\rangle_{p} = (\sum_{k=1}^{n}\mid f_k.g_k\mid^p)^\frac{1}{p}\)
(2)
By this notation, Sobolev inner product, norm and meter of degree \(k\) respectively can be defined by:
\(\langle f,g\rangle_{p,a}^{S} = \langle f,g\rangle_p+\alpha\langle D^kf,D^kg\rangle_p\)
(3)
\(\mid\mid f\mid\mid_{p,k,\alpha}^S = \sqrt{\langle f,f\rangle_{p,\alpha}^S}\)
(4)
\(d_{p,k,\alpha}^S(f,g) = \mid\mid f-g\mid\mid_{p,k,\alpha}^S\)
(5)
where \(D^k\) is the \(k\)th differential operator. For the special case \(p=2\) and \(\alpha=1\) an interesting connection to the Fourier-transform of analysis can be made; let \(\hat{f}\) be the Fourier-transform \(f\)
\(\hat{f}(\omega_k) = \sum_{j=1}^{N-1}g_jexp(-i\frac{2\pi kj}{N})\)
(6)
Where \(\omega_k=\frac{2\pi k}{N}\) and \(i=\sqrt{-1}\). Finally the norm can be written as
\(\mid\mid f\mid\mid_{2,k,1}^S = \sqrt{\sum_{j=1}^{N-1}(1+\omega_j)^k\mid \hat{f}(\omega_j)\mid^2}\)
(7)
In this work metric (5) with norm (7) and \(k=1\) was used.
Fisher metric
In this section we use definitions and notations such as in [3]. To define Fisher information metric we first introduce the n-simplex \(P_n\) defined by
\(P_n=\{x\in R^{n+1}:\forall i, x_n \ge0,\sum_{i=1}^{n+1}xi=1\}\)
(8)
The coordinates \(\{x_i\}\) describe the probability of observing different outcomes in a single experiment (or expression value of a gene in \(i\)th cell type). The Fisher information metric on \(P_n\) can be defined by
\(Jij = \sum_{k=1}^{n+1}\frac{1}{x_k}\frac{\partial x_k}{\partial x_i}\frac{\partial x_k}{\partial x_j}\)
(9)
We now define a well-known representation of the Fisher information as a pull-back metric from the positive n-sphere \(S_n^+\)
\(S_n^+=\{x\in R^n;\forall i,x_n\ge0,\sum_{i=1}^{n+1}x^2=1\}\)
(10)
The transformation \(T: P_n\to S_n^+\) defined by
\(T(x)=(\sqrt{x_1}, \dots, \sqrt{x_n+1})\)
(11)
pulls back the Euclidean metric on the surface of the sphere to the Fisher information on the multinomial simplex. Actually, the geodesic distance for \(x,y \in P_n\) under the Fisher information metric may be defined by measuring the length of the great circle on \(S_n^+\) between \(T(x)\) and \(T(y)\)
\(d(x,y) = acos(\sum_{i=1}^{n+1}\sqrt{x_iy_i})\)
(12)
The LNCRNA2GOA method can also be used on other novel genes besides lncRNAs.
Example
Scores + Enrichment
The following example is an arbitrary use-case. Meaning that this is just an example and does not (necessarily) imply a certain question/real life use-case. id_select_vector
represents a vector of gene ids that you’d want to use for annotation enrichment (if left empty, algorithm will use all available gene ids in the ExpressionSet).
# Example with default dataset, take a look at the data documentation
# to fully grasp what's going on with the making of the filter etc. (Biobase
# ExpressionSet)
# keep everything that is a protein coding gene (for annotation)
filter_vector <- fData(GAPGOM::expset)[(
fData(GAPGOM::expset)$GeneType=="protein_coding"),]$GeneID
# set gid and run.
gid <- "ENSG00000228630"
result <- GAPGOM::expression_prediction(gid,
GAPGOM::expset,
"human",
"BP",
id_translation_df =
GAPGOM::id_translation_df,
id_select_vector = filter_vector,
method = "combine",
verbose = TRUE, filter_pvals = TRUE)
#> Looking up GO terms...
#> Calculation time (in seconds):
#> 0.2059645652771
kable(result) %>% kable_styling() %>% scroll_box(width = "100%", height = "500px")
GOID | Ontology | Pvalue | FDR | Term | used_method |
---|---|---|---|---|---|
GO:0006810 | BP | 0.0001042 | 0.0003769 | transport | fisher |
GO:0007165 | BP | 0.0010725 | 0.0070414 | signal transduction | kendall |
GO:0006355 | BP | 0.0031231 | 0.0095531 | regulation of transcription, DNA-templated | pearson |
GO:0045893 | BP | 0.0027479 | 0.0097134 | positive regulation of transcription, DNA-templated | fisher |
GO:0006366 | BP | 0.0041102 | 0.0122466 | transcription by RNA polymerase II | sobolev |
GO:0006468 | BP | 0.0091572 | 0.0443554 | protein phosphorylation | spearman |
GO:0006351 | BP | 0.0158478 | 0.0475434 | transcription, DNA-templated | pearson |
Here we display the results, you can see that it has 6 columns;
GOID
Describing the significantly similar GO terms
Ontology
Describing the ontology of the result.Pvalue
The P-value/significance of the result.FDR
bonferoni normalized P-valueTerm
The description of the GO term.used_method
Used scoring method for getting the result.
Besides this, there is an optional parameter for different GO labeling/annotation; id_translation_df
. This dataframe should contain the following;
- rownames \(\to\) rownames of the expression set
- first column \(\to\) Gene ID (e.g. EntrezID). Gene should be the same as in the expression dataset.
- second colum \(\to\) GO IDs.
This can also drastically improve calculation time as most of the time is spent on querying for this translation.
Only scores
There is also another algorithm that allows you to just calculate scores and skip the enrichment;
# Example with default dataset, take a look at the data documentation
# to fully grasp what's going on with making of the filter etc. (Biobase
# ExpressionSet)
# set an artbitrary gene you want to find similarities for. (5th row in this
# case)
gid <- "ENSG00000228630"
result <- GAPGOM::expression_semantic_scoring(gid,
GAPGOM::expset)
kable(result[1:100,]) %>% kable_styling() %>% scroll_box(width = "100%", height = "500px")
original_ids | score | used_method | |
---|---|---|---|
ENSG00000224505 | ENSG00000224505 | 0.2004647 | pearson |
ENSG00000139144 | ENSG00000139144 | 0.1011604 | pearson |
ENSG00000265787 | ENSG00000265787 | 0.1247753 | pearson |
ENSG00000204539 | ENSG00000204539 | 0.1396684 | pearson |
ENSG00000253563 | ENSG00000253563 | 0.0892708 | pearson |
ENSG00000188784 | ENSG00000188784 | 0.0863590 | pearson |
ENSG00000042304 | ENSG00000042304 | 0.0726566 | pearson |
ENSG00000248787 | ENSG00000248787 | 0.2197304 | pearson |
ENSG00000269305 | ENSG00000269305 | 0.0561735 | pearson |
ENSG00000241933 | ENSG00000241933 | 0.0606906 | pearson |
ENSG00000132640 | ENSG00000132640 | 0.1382561 | pearson |
ENSG00000137634 | ENSG00000137634 | 0.0981431 | pearson |
ENSG00000186994 | ENSG00000186994 | 0.1811896 | pearson |
ENSG00000095906 | ENSG00000095906 | 0.5401186 | pearson |
ENSG00000187871 | ENSG00000187871 | 0.0167506 | pearson |
ENSG00000254350 | ENSG00000254350 | 0.0477682 | pearson |
ENSG00000151247 | ENSG00000151247 | 0.0106081 | pearson |
ENSG00000109061 | ENSG00000109061 | 0.0682357 | pearson |
ENSG00000257594 | ENSG00000257594 | 0.0225258 | pearson |
ENSG00000253720 | ENSG00000253720 | 0.1931214 | pearson |
ENSG00000180638 | ENSG00000180638 | 0.0566312 | pearson |
ENSG00000234279 | ENSG00000234279 | 0.1001705 | pearson |
ENSG00000188032 | ENSG00000188032 | 0.1129793 | pearson |
ENSG00000171161 | ENSG00000171161 | 0.0709871 | pearson |
ENSG00000250411 | ENSG00000250411 | 0.0589016 | pearson |
ENSG00000260244 | ENSG00000260244 | 0.2590438 | pearson |
ENSG00000063601 | ENSG00000063601 | 0.0637382 | pearson |
ENSG00000249605 | ENSG00000249605 | 0.0834436 | pearson |
ENSG00000182368 | ENSG00000182368 | 0.0104575 | pearson |
ENSG00000224819 | ENSG00000224819 | 0.1137457 | pearson |
ENSG00000196531 | ENSG00000196531 | 0.0298562 | pearson |
ENSG00000085377 | ENSG00000085377 | 0.0761354 | pearson |
ENSG00000129195 | ENSG00000129195 | 0.3678937 | pearson |
ENSG00000227608 | ENSG00000227608 | 0.1634545 | pearson |
ENSG00000250043 | ENSG00000250043 | 0.0943256 | pearson |
ENSG00000269843 | ENSG00000269843 | 0.3473258 | pearson |
ENSG00000181690 | ENSG00000181690 | 0.4336276 | pearson |
ENSG00000125246 | ENSG00000125246 | 0.3062198 | pearson |
ENSG00000262861 | ENSG00000262861 | 0.1053504 | pearson |
ENSG00000173171 | ENSG00000173171 | 0.1348866 | pearson |
ENSG00000248455 | ENSG00000248455 | 0.0930772 | pearson |
ENSG00000265114 | ENSG00000265114 | 0.1114366 | pearson |
ENSG00000166173 | ENSG00000166173 | 0.2045608 | pearson |
ENSG00000068489 | ENSG00000068489 | 0.1446615 | pearson |
ENSG00000183474 | ENSG00000183474 | 0.0317519 | pearson |
ENSG00000138326 | ENSG00000138326 | 0.1441870 | pearson |
ENSG00000101096 | ENSG00000101096 | 0.1657225 | pearson |
ENSG00000235151 | ENSG00000235151 | 0.2721271 | pearson |
ENSG00000154265 | ENSG00000154265 | 0.1235900 | pearson |
ENSG00000177946 | ENSG00000177946 | 0.5389089 | pearson |
ENSG00000225302 | ENSG00000225302 | 0.0224971 | pearson |
ENSG00000237560 | ENSG00000237560 | 0.1250799 | pearson |
ENSG00000261195 | ENSG00000261195 | 0.1342620 | pearson |
ENSG00000235123 | ENSG00000235123 | 0.0486064 | pearson |
ENSG00000260092 | ENSG00000260092 | 0.1116397 | pearson |
ENSG00000254431 | ENSG00000254431 | 0.1893861 | pearson |
ENSG00000118526 | ENSG00000118526 | 0.1717508 | pearson |
ENSG00000261049 | ENSG00000261049 | 0.0393922 | pearson |
ENSG00000254489 | ENSG00000254489 | 0.1097308 | pearson |
ENSG00000176697 | ENSG00000176697 | 0.3079547 | pearson |
ENSG00000250519 | ENSG00000250519 | 0.0213221 | pearson |
ENSG00000263821 | ENSG00000263821 | 0.0863590 | pearson |
ENSG00000065665 | ENSG00000065665 | 0.0021354 | pearson |
ENSG00000088836 | ENSG00000088836 | 0.0343418 | pearson |
ENSG00000254514 | ENSG00000254514 | 0.1114366 | pearson |
ENSG00000149308 | ENSG00000149308 | 0.0458501 | pearson |
ENSG00000109534 | ENSG00000109534 | 0.0521644 | pearson |
ENSG00000204187 | ENSG00000204187 | 0.2636256 | pearson |
ENSG00000102468 | ENSG00000102468 | 0.0711898 | pearson |
ENSG00000101457 | ENSG00000101457 | 0.1492439 | pearson |
ENSG00000204837 | ENSG00000204837 | 0.1985665 | pearson |
ENSGR0000236871 | ENSGR0000236871 | 0.1737215 | pearson |
ENSG00000113645 | ENSG00000113645 | 0.1680317 | pearson |
ENSG00000261617 | ENSG00000261617 | 0.0935473 | pearson |
ENSG00000177947 | ENSG00000177947 | 0.1092690 | pearson |
ENSG00000233605 | ENSG00000233605 | 0.1114366 | pearson |
ENSG00000187486 | ENSG00000187486 | 0.0193328 | pearson |
ENSG00000159593 | ENSG00000159593 | 0.1764208 | pearson |
ENSG00000230967 | ENSG00000230967 | 0.0863590 | pearson |
ENSG00000064961 | ENSG00000064961 | 0.3389281 | pearson |
ENSG00000257922 | ENSG00000257922 | 0.0255902 | pearson |
ENSG00000241685 | ENSG00000241685 | 0.4240072 | pearson |
ENSG00000171747 | ENSG00000171747 | 0.0964942 | pearson |
ENSG00000152749 | ENSG00000152749 | 0.0533820 | pearson |
ENSG00000183597 | ENSG00000183597 | 0.5322610 | pearson |
ENSG00000214955 | ENSG00000214955 | 0.1746615 | pearson |
ENSG00000166710 | ENSG00000166710 | 0.0462278 | pearson |
ENSG00000149182 | ENSG00000149182 | 0.0432318 | pearson |
ENSG00000259645 | ENSG00000259645 | 0.8049258 | pearson |
ENSG00000137561 | ENSG00000137561 | 0.0933173 | pearson |
ENSG00000175746 | ENSG00000175746 | 0.1013987 | pearson |
ENSG00000230645 | ENSG00000230645 | 0.0219265 | pearson |
ENSG00000254438 | ENSG00000254438 | 0.1059758 | pearson |
ENSG00000233423 | ENSG00000233423 | 0.1607758 | pearson |
ENSG00000254726 | ENSG00000254726 | 0.0301782 | pearson |
ENSG00000006128 | ENSG00000006128 | 0.1607405 | pearson |
ENSG00000259862 | ENSG00000259862 | 0.1016300 | pearson |
ENSG00000204054 | ENSG00000204054 | 0.0103759 | pearson |
ENSG00000185608 | ENSG00000185608 | 0.2245409 | pearson |
ENSG00000163013 | ENSG00000163013 | 0.0495888 | pearson |
We can see that this function returns a different dataframe;
original_ids
The identifier of the gene expression rowscore
The similarity score/correlation calculated by one of the methods.used_method
The used method used to calculate the score.
The rownames also represent the gene expression row. Only the first 100 rows are shown, otherwise the table would be quite big. Enrichment and GO annotation/translation needs to be done manually after this step. However, this should be quite doable with a bit of help from GOSemSim
.
Original dataset
The original publication used the lncRNA2Function data [4], to test if the results are the same, a small script has been made to reproduce the same results and is located at the package install directory under scripts
. The script
folder also contains the two original scripts for the algorithm, but not neccesarily its data. The data (along with the scripts) can be found on the following websites:
Besides this, the scripts folder also contains a proof-of-concept script for potentially doing the analysis on unannotated transcripts (by finding the closest gene). This is eventually meant as a sort of substitute to the famous GREAT tool.
TopoICSim
Background
TopoICSim or Topological Information Content Similarity, is a method to measure the similarity between two GO terms given the information content and topology of the GO DAG tree. Unlike other similar measures, it considers both the shortest and longest DAG paths between two terms, not just the longest or shortest path(s). The paths along the GO DAG tree get weighted with the information content between two terms.
For the information content the following forumla is used;
\(IC(t) = -log(p(t))\)
(1)
Where \(t\) is the (GO) term. The IC is calculated by GOSemSim, and based on the frequency of the specific go term (\(p(t)\)).
A GO tree can be described as a triplet \(\Lambda=(G,\Sigma,R)\), where \(G\) is the set of GO terms, \(\Sigma\) is the set of hierarchical relations between GO terms (mostly defined as is_a or part_of) [5], and \(R\) is a triplet \((t_i, t_j, \xi)\), where \(t_i,t_j\in G\) and \(\xi\in R\) and \(t_i\xi t_j\). The \(\xi\) relationship is an oriented child-parent relation. Top level node of the GO rDAG is the Root, which is a direct parent of the MF, BP and CC nodes. These nodes are called aspect-specific roots and we refer to them as root in the following. A path \(P\) of length \(n\) between two terms \(t_i,t_j\) can be defined as in (23).
\(P:G\times G\to G\times G\dots\times G=G^{n+1};\\P(t_i,t_j) = (t_i,t_j+1,\dots,t_j)\)
(23)
Here \(\forall\) \(s\), \(i\le s<j\), \(\exists\xi_s\in \Sigma\), \(\exists\tau_s\in R\), \(r_s=(t_s,t_{s+1}, \xi_s)\). Because \(G\) is an rDAG, there might be multiple paths between two terms, so we represent all paths between two terms \(t_i,t_j\) according to (24).
\(\mathcal{A}(t_i,t_j)=\underset{P}{\cup}P(t_i,t_j)\)
(24)
We use Inverse Information Content (IIC) values to define shortest and longest paths for two given terms \(t_i,t_j\) as shown in (25-27).
\(SP(t_i,t_j) = \underset{P\in A(t_i,t_j)}{argminIIC(P)}\)
(25)
\(LP(t_i,t_j) = \underset{P\in A(t_i,t_j)}{argmaxIIC(P)}\)
(26)
\(IIC(P) = \sum_{t\in P}\frac{1}{IC(t)}\)
(27)
The standard definition was used to calculate \(IC(t)\) as shown in (28)
\(IC(t) = -log\frac{G_t}{G_\mathrm{Tot}}\)
(28)
Here \(G_t\) is the number of genes annotated by the term \(t\) and \(G_\mathrm{Tot}\) is the total number of genes. The distribution of IC is not uniform in the rDAG, so it is possible to have two paths with different lengths but with same IICs. To overcome this problem we weight paths by their lengths, so the definitions in (25) and (26) can be updated according to (29) and (30).
\(wSP(t_i,t_j)=SP(t_i,t_j)\times len(P)\)
(29)
\(wLP(t_i,t_j)=LP(t_i,t_j)\times len(P)\)
(30)
Now let \(ComAnc(t_i,t_j)\) be the set of all common ancestors for two given terms \((t_i,t_j)\). First we define the disjuntive common ancestors as a subset of \(ComAnc(t_i,t_j)\) as in (31).
\(DisComAnc(t_i,t_j) = \{x\in ComAnc(t_i,t_j)\mid P(x, root)\cap C(x)=\varnothing\}\)
(31)
Here \(P(x,root)\) is the path between \(x\) and \(root\) and \(C(x)\) is set of all immediate children for \(x\). For each disjuntive common ancestor \(x\) in \(DisComAnc(t_i,t_j)\), we define the distance between \(t_i,t_j\) as the ratio of the weighted shortest path between \(t_i,t_j\) which passes from \(x\) to the weighted longest path between \(x\) and \(root\), as in (32-33).
\(D(t_i,t_j,x) = \frac{wSP(t_i,t_j,x)}{wLP(x,root)}\)
(32)
\(wSP(t_i,t_j,x) = wSP(t_i,x)+wSP(t_j,x)\)
(33)
Now the distance for two terms \(t_i,t_j\) can be defined according to (34).
\(D(t_i,t_j)=\underset{x\in DisComAnc(t_i,t_j)}{min}D(t_i,t_j,x)\)
(34)
We convert distance values by the \(\frac{Arctan(.)}{\pi/2}\) function, and the measure for two GO terms \(t_i\) and \(t_j\) can be defined as in (35).
\(S(t_i,t_j) = 1-\frac{Arcatan(D(t_i,t_j))}{\pi/2}\)
(35)
Note that \(root\) refers to one of three first levels in the rDAG. So if \(DisComAnc(t_i,t_j)=\{root\}\) then \(D(t_i,t_j)=\infty\) and \(S(t_i,t_j)=0\). Also if \(t_i = t_j\) then \(D(t_i,t_j)=0\) and \(S(t_i,t_j)=1\). Finally let \(S=[s_{ij}]_{n\times m}\) be a similarity matrix for two given fenes or gene products \(g1, g2\) with GO terms \(t_{11},t_{12},\dots,t_{1n}\) and \(t_{21},t_{12},\dots,t_{2m}\) where \(s_{ij}\) is the similarity between the GO terms \(t_{1i}\) and \(t_{2j}\). We used a rcmax method to calculate similarity between \(g1, g2\), as defined in (36).
\(\begin{aligned}TopoICSim(g_1,g_2)&=rcmax(S)\\&=rcmax\left(\frac{\sum_{i=1}^n\underset{j}{maxs_{ij}}}{n},\frac{\sum_{i=1}^m\underset{i}{maxs_{ij}}}{m}\right)\end{aligned}\)
(36)
We also tested other methods on the similarity matrix, in particular average and BMA, but in general \(rcmax\) gave the best performance for TopoICSim (data not shown).
Besides all this, there’s also an algorithm for the geneset level, you can calculate interset/intraset similarities of these with (13) and (14). Or simply use R’s mean
on the resulting matrix.
\(IntraSetSim(S_k)=\frac{\sum_{i=1}^n\sum_{j=1}^mSim(g_{ki},g_{kj})}{n^2}\)
(13)
\(InterSetSim(S_k)=\frac{\sum_{i=1}^n\sum_{j=1}^mSim(g_{ki},g_{kj})}{n\times m}\)
(14)
(13) is equal to (14) in the specific case of comparing the geneset to itself.
All forumulas/explanations are quoted from the paper (references edited) [6].
Examples
The following example uses the pfam clan gene set to measure intraset similarity. For the single gene, EntrezID
218 and 501 are compared.
result <- GAPGOM::topo_ic_sim_genes("human", "MF", "218", "501",
progress_bar = FALSE)
kable(result$AllGoPairs) %>% kable_styling() %>% scroll_box(width = "100%", height = "500px")
GO:0004029 | GO:0004030 | GO:0005515 | GO:0008106 | GO:0018479 | GO:0004043 | GO:0008802 | GO:0043878 | |
---|---|---|---|---|---|---|---|---|
GO:0004029 | 1.000 | 0.793 | NA | 0.095 | 0.870 | 0.808 | 0.865 | 0.802 |
GO:0004030 | 0.793 | 1.000 | NA | NA | NA | 0.903 | 0.693 | 0.899 |
GO:0005515 | NA | NA | 1 | NA | NA | NA | NA | NA |
GO:0008106 | 0.095 | NA | NA | 1.000 | NA | 0.122 | 0.080 | 0.121 |
GO:0018479 | 0.870 | NA | NA | NA | 1.000 | 0.718 | 0.954 | 0.711 |
GO:0004043 | 0.808 | 0.903 | NA | 0.122 | 0.718 | 1.000 | NA | NA |
GO:0008802 | 0.865 | 0.693 | NA | 0.080 | 0.954 | NA | 1.000 | NA |
GO:0043878 | 0.802 | 0.899 | NA | 0.121 | 0.711 | NA | NA | 1.000 |
result$GeneSim
#> [1] 0.975
# genelist mode
list1 <- c("126133","221","218","216","8854","220","219","160428","224",
"222","8659","501","64577","223","217","4329","10840","7915","5832")
# ONLY A PART OF THE GENELIST IS USED BECAUSE OF R CHECK TIME CONTRAINTS
result <- GAPGOM::topo_ic_sim_genes("human", "MF", list1[1:3], list1[1:3],
progress_bar = FALSE)
kable(result$AllGoPairs) %>% kable_styling() %>% scroll_box(width = "100%", height = "500px")
GO:0005515 | GO:0016620 | GO:0004029 | GO:0004030 | GO:0018477 | GO:0018479 | GO:0008106 | |
---|---|---|---|---|---|---|---|
GO:0005515 | 1 | NA | NA | NA | NA | NA | NA |
GO:0016620 | NA | 1.000 | 0.533 | 0.678 | 0.441 | 0.431 | 0.152 |
GO:0004029 | NA | 0.533 | 1.000 | 0.793 | 0.876 | 0.870 | 0.095 |
GO:0004030 | NA | 0.678 | 0.793 | 1.000 | 0.710 | 0.699 | 0.118 |
GO:0018477 | NA | 0.441 | 0.876 | 0.710 | 1.000 | 0.957 | 0.082 |
GO:0018479 | NA | 0.431 | 0.870 | 0.699 | 0.957 | 1.000 | 0.081 |
GO:0008106 | NA | 0.152 | 0.095 | 0.118 | 0.082 | 0.081 | 1.000 |
126133 | 221 | 218 | |
---|---|---|---|
126133 | 1.000 | 0.912 | 0.912 |
221 | 0.912 | 1.000 | 0.996 |
218 | 0.912 | 0.996 | 1.000 |
Here we can see the output of TopoICsim, it is a list containg 2 items;
AllGoPairs
The \(n\times m\) matrix of GO terms, including their similarities. Some values might be NA because these were non-occuring pairs. You can add AllGoPairs
to the parameters next time you run TopoICSim, to possibly speed up computations (they will be used as pre-calculated scores to fill in occurring pairs).
GeneSim
The gene/geneset similarities depending on your input. Can be 1 number, or a matrix displaying all possible combinations. \(\to\) The mean of the geneset matrix shows the intraset/interset similarity.
Custom genes
Besides this, you can also define custom genes for TopoICSim. This consists of an arbitrary amount of GO terms. The custom genes have to individually defined in a named list.
custom <- list(cus1=c("GO:0016787", "GO:0042802", "GO:0005524"))
result <- GAPGOM::topo_ic_sim_genes("human", "MF", "218", "501",
custom_genes1 = custom, drop = NULL, verbose = TRUE, progress_bar = FALSE)
#> Preparing topoICSim data...
#> Preparing term data.
#> preparing gene to GO mapping data...
#> preparing IC data...
#> Preparing gene/geneset data...
#> Started calculating all go's.
#> Filtering out precalculated values...
#> Resolving common ancestors...
#> Done!
#> Resolving disjunctive common ancestors...
#> Done!
#> Calculating short paths...
#> Done!
#> Calculating long paths...
#> Done!
#> Merging into all_go_pairs...
#> Done!
#> Merging gene(set) results...
#> Done!
#> Calculation time (in seconds):
#> 9.28031349182129
result
#> $GeneSim
#> 501
#> cus1 0.261
#> 218 0.975
#>
#> $AllGoPairs
#> GO:0004029 GO:0004030 GO:0005515 GO:0008106 GO:0018479 GO:0004043
#> GO:0004029 1.000 0.793 NA 0.095 0.870 0.808
#> GO:0004030 0.793 1.000 NA NA NA 0.903
#> GO:0005515 NA NA 1.000 NA NA NA
#> GO:0008106 0.095 NA NA 1.000 NA 0.122
#> GO:0018479 0.870 NA NA NA 1.000 0.718
#> GO:0004043 0.808 0.903 NA 0.122 0.718 1.000
#> GO:0008802 0.865 0.693 NA 0.080 0.954 0.711
#> GO:0043878 0.802 0.899 NA 0.121 0.711 0.910
#> GO:0016787 0.036 0.046 NA 0.034 0.029 0.047
#> GO:0042802 NA NA 0.101 NA NA NA
#> GO:0005524 NA NA 0.061 NA NA NA
#> GO:0008802 GO:0043878 GO:0016787 GO:0042802 GO:0005524
#> GO:0004029 0.865 0.802 0.036 NA NA
#> GO:0004030 0.693 0.899 0.046 NA NA
#> GO:0005515 NA NA NA 0.101 0.061
#> GO:0008106 0.080 0.121 0.034 NA NA
#> GO:0018479 0.954 0.711 0.029 NA NA
#> GO:0004043 0.711 0.910 0.047 NA NA
#> GO:0008802 1.000 0.704 0.029 NA NA
#> GO:0043878 0.704 1.000 0.047 NA NA
#> GO:0016787 0.029 0.047 1.000 NA NA
#> GO:0042802 NA NA NA 1.000 NA
#> GO:0005524 NA NA NA NA 1.000
Here we define a custom gene named “cus1” with the GO terms; “GO:0016787”, “GO:0042802”, “GO:0005524”, it will be added to the first gene vector (218). If you want to only have a custom gene, you can define an empty vector with c()
for the respective vector.
Other notes.
With TopoICSim there is a pre-calculated score matrix available, this can be turned on/off. The scores might become deprecated however, as soon as one of the org.DB
packages gets an update. For this reason, we advise mostly to keep this option off. You can also pre-calculate some GO’s yourself using the custom genes. all_go_pairs
can then be used as a precalculated score matrix, only intersecting/present GO terms will be used.
Parrallel processing and big data
As of right now, parallel processing is not supported. The other algorithms aren’t parallelized yet, because of implementation difficulties regarding the amount and type of dependencies that the algorithms rely on. Possibly, in the future, parallel processing will be supported. It might however be possible to divide the work in TopoICSim (on gene pair level). It runs for each unique pair of genes, which you may be able to generate using the hidden GAPGOM:::.unique_combos
function. The All_go_pairs
object in the result can then be combined together with other results. No support is offered however in regards to trying to make this work. Tip: The all_go_pairs
argument in topo_ic_sim_genes()
doesn’t automatically create a new, bigger matrix, it only uses the overlapping or present GO terms of the analysis based on the input genes.
Performance and benchmarks
The performance of this package is well tested in the Benchmarks vignette, in which benchmarks are also prepared.
Contact and support
For questions, contact, or support use the (Bioconductor) git repository or contact us via the Bioconductor forums.
SessionInfo
sessionInfo()
#> R Under development (unstable) (2021-10-19 r81077)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] org.Hs.eg.db_3.14.0 AnnotationDbi_1.57.0 IRanges_2.29.0
#> [4] S4Vectors_0.33.0 Biobase_2.55.0 BiocGenerics_0.41.0
#> [7] GAPGOM_1.11.0 kableExtra_1.3.4 knitr_1.36
#>
#> loaded via a namespace (and not attached):
#> [1] matrixStats_0.61.0 bitops_1.0-7 bit64_4.0.5
#> [4] filelock_1.0.2 webshot_0.5.2 httr_1.4.2
#> [7] GenomeInfoDb_1.31.0 tools_4.2.0 bslib_0.3.1
#> [10] utf8_1.2.2 R6_2.5.1 DBI_1.1.1
#> [13] colorspace_2.0-2 tidyselect_1.1.1 bit_4.0.4
#> [16] curl_4.3.2 compiler_4.2.0 graph_1.73.0
#> [19] rvest_1.0.2 xml2_1.3.2 sass_0.4.0
#> [22] scales_1.1.1 readr_2.0.2 RBGL_1.71.0
#> [25] rappdirs_0.3.3 systemfonts_1.0.3 stringr_1.4.0
#> [28] digest_0.6.28 rmarkdown_2.11 svglite_2.0.0
#> [31] GEOquery_2.63.0 XVector_0.35.0 pkgconfig_2.0.3
#> [34] htmltools_0.5.2 highr_0.9 dbplyr_2.1.1
#> [37] fastmap_1.1.0 limma_3.51.0 rlang_0.4.12
#> [40] rstudioapi_0.13 RSQLite_2.2.8 prettydoc_0.4.1
#> [43] jquerylib_0.1.4 generics_0.1.1 jsonlite_1.7.2
#> [46] GOSemSim_2.21.0 dplyr_1.0.7 RCurl_1.98-1.5
#> [49] magrittr_2.0.1 GO.db_3.14.0 GenomeInfoDbData_1.2.7
#> [52] Matrix_1.3-4 Rcpp_1.0.7 munsell_0.5.0
#> [55] fansi_0.5.0 lifecycle_1.0.1 stringi_1.7.5
#> [58] yaml_2.2.1 zlibbioc_1.41.0 plyr_1.8.6
#> [61] BiocFileCache_2.3.0 grid_4.2.0 blob_1.2.2
#> [64] crayon_1.4.1 lattice_0.20-45 Biostrings_2.63.0
#> [67] hms_1.1.1 KEGGREST_1.35.0 pillar_1.6.4
#> [70] igraph_1.2.7 fastmatch_1.1-3 glue_1.4.2
#> [73] evaluate_0.14 data.table_1.14.2 png_0.1-7
#> [76] vctrs_0.3.8 tzdb_0.1.2 org.Mm.eg.db_3.14.0
#> [79] purrr_0.3.4 tidyr_1.1.4 assertthat_0.2.1
#> [82] cachem_1.0.6 xfun_0.27 viridisLite_0.4.0
#> [85] tibble_3.1.5 memoise_2.0.0 ellipsis_0.3.2
References
- [1] Ehsani R, Drablos F: Measures of co-expression for improved function prediction of long non-coding RNAs. BMC Bioinformatics 2018. Accepted.
- [2] Villmann T: Sobolev metrics for learning of functional data - mathematical and theoretical aspects. In: Machine Learning Reports. Edited by Villmann T, Schleif F-M, vol. 1. Leipzig, Germany: Medical Department, University of Leipzig; 2007: 1-13.
- [3] Lebanon G: Learning riemannian metrics. In: Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence; Acapulco, Mexico. Morgan Kaufmann Publishers Inc. 2003: 362-369
- [4] Jian Q: LncRNA2Function: a comprehensive resource for functional investigation of human lncRNAs based on RNA-seq data. In: BMC Genomics 2015. DOI: 10.1186/1471-2164-16-S3-S2
- [5] Benabderrahmane S, Smail-Tabbone, Poch O, Napoli A, Devignes MD. IntelliGO: a new vector-based semantic similarity measure including annotation origin. In: BMC Bioinformatics. 2010;11:588.
- [6] Ehsani R, Drablos F: TopoICSim: a new semantic similarity measure based on gene ontology. In: BMC Bioinformatics 2016, 17(1):296. DOI: 10.1186/s12859-016-1160-0