BgeeDB
is a collection of functions to import data from the Bgee database (http://bgee.org/) directly into R, and to facilitate downstream analyses, such as gene set enrichment test based on expression of genes in anatomical structures. Bgee provides annotated and processed expression data and expression calls from curated wild-type healthy samples, from human and many other animal species.
The package retrieves the annotation of RNA-seq or Affymetrix experiments integrated into the Bgee database, and downloads into R the quantitative data and expression calls produced by the Bgee pipeline. The package also allows to run GO-like enrichment analyses based on anatomical terms, where genes are mapped to anatomical terms by expression patterns, based on the topGO
package. This is the same as the TopAnat web-service available at (http://bgee.org/?page=top_anat#/), but with more flexibility in the choice of parameters and developmental stages.
In summary, the BgeeDB package allows to: * 1. List annotation of RNA-seq and microarray data available the Bgee database * 2. Download the processed gene expression data available in the Bgee database * 3. Download the gene expression calls and use them to perform TopAnat analyses
In R:
#source("https://bioconductor.org/biocLite.R")
#biocLite("BgeeDB")
library(BgeeDB)
The listBgeeSpecies()
function allows to retrieve available species in the Bgee database, and which data types are available for each species.
listBgeeSpecies()
##
## Querying Bgee to get release information...
##
## Building URL to query species in Bgee release 13_2...
##
## Submitting URL to Bgee webservice... (http://r13_2.bgee.org/?page=species&display_type=tsv)
##
## Query to Bgee webservice successful!
## ID GENUS SPECIES_NAME COMMON_NAME AFFYMETRIX EST IN_SITU
## 1 6239 Caenorhabditis elegans c.elegans TRUE FALSE TRUE
## 2 7227 Drosophila melanogaster fruitfly TRUE TRUE TRUE
## 3 7955 Danio rerio zebrafish TRUE TRUE TRUE
## 4 8364 Xenopus tropicalis xenopus FALSE TRUE TRUE
## 5 9031 Gallus gallus chicken FALSE FALSE FALSE
## 6 9258 Ornithorhynchus anatinus platypus FALSE FALSE FALSE
## 7 9544 Macaca mulatta macaque FALSE FALSE FALSE
## 8 9593 Gorilla gorilla gorilla FALSE FALSE FALSE
## 9 9597 Pan paniscus bonobo FALSE FALSE FALSE
## 10 9598 Pan troglodytes chimpanzee FALSE FALSE FALSE
## 11 9606 Homo sapiens human TRUE TRUE FALSE
## 12 9823 Sus scrofa pig FALSE FALSE FALSE
## 13 9913 Bos taurus cow FALSE FALSE FALSE
## 14 10090 Mus musculus mouse TRUE TRUE TRUE
## 15 10116 Rattus norvegicus rat FALSE FALSE FALSE
## 16 13616 Monodelphis domestica opossum FALSE FALSE FALSE
## 17 28377 Anolis carolinensis anolis FALSE FALSE FALSE
## RNA_SEQ
## 1 TRUE
## 2 FALSE
## 3 FALSE
## 4 TRUE
## 5 TRUE
## 6 TRUE
## 7 TRUE
## 8 TRUE
## 9 TRUE
## 10 TRUE
## 11 TRUE
## 12 TRUE
## 13 TRUE
## 14 TRUE
## 15 TRUE
## 16 TRUE
## 17 TRUE
It is possible to list all species from a specific release of Bgee with the release
argument (see listBgeeRelease()
function), and order the species according to a specific columns with the ordering
argument. For example:
listBgeeSpecies(release = "13.2", order = 2)
##
## Querying Bgee to get release information...
##
## Building URL to query species in Bgee release 13_2...
##
## Submitting URL to Bgee webservice... (http://r13_2.bgee.org/?page=species&display_type=tsv)
##
## Query to Bgee webservice successful!
## ID GENUS SPECIES_NAME COMMON_NAME AFFYMETRIX EST IN_SITU
## 17 28377 Anolis carolinensis anolis FALSE FALSE FALSE
## 13 9913 Bos taurus cow FALSE FALSE FALSE
## 1 6239 Caenorhabditis elegans c.elegans TRUE FALSE TRUE
## 3 7955 Danio rerio zebrafish TRUE TRUE TRUE
## 2 7227 Drosophila melanogaster fruitfly TRUE TRUE TRUE
## 5 9031 Gallus gallus chicken FALSE FALSE FALSE
## 8 9593 Gorilla gorilla gorilla FALSE FALSE FALSE
## 11 9606 Homo sapiens human TRUE TRUE FALSE
## 7 9544 Macaca mulatta macaque FALSE FALSE FALSE
## 16 13616 Monodelphis domestica opossum FALSE FALSE FALSE
## 14 10090 Mus musculus mouse TRUE TRUE TRUE
## 6 9258 Ornithorhynchus anatinus platypus FALSE FALSE FALSE
## 9 9597 Pan paniscus bonobo FALSE FALSE FALSE
## 10 9598 Pan troglodytes chimpanzee FALSE FALSE FALSE
## 15 10116 Rattus norvegicus rat FALSE FALSE FALSE
## 12 9823 Sus scrofa pig FALSE FALSE FALSE
## 4 8364 Xenopus tropicalis xenopus FALSE TRUE TRUE
## RNA_SEQ
## 17 TRUE
## 13 TRUE
## 1 TRUE
## 3 FALSE
## 2 FALSE
## 5 TRUE
## 8 TRUE
## 11 TRUE
## 7 TRUE
## 16 TRUE
## 14 TRUE
## 6 TRUE
## 9 TRUE
## 10 TRUE
## 15 TRUE
## 12 TRUE
## 4 TRUE
In the following example we will choose to focus on mouse (“Mus_musculus”) RNA-seq. Species can be specified using their name or their NCBI taxonomic IDs. To specify that RNA-seq data want to be downloaded, the dataType
argument is set to “rna_seq”. To download Affymetrix microarray data, set this argument to “affymetrix”.
bgee <- Bgee$new(species = "Mus_musculus", dataType = "rna_seq")
##
## Querying Bgee to get release information...
##
## Building URL to query species in Bgee release 13_2...
##
## Submitting URL to Bgee webservice... (http://r13_2.bgee.org/?page=species&display_type=tsv)
##
## Query to Bgee webservice successful!
##
## API key built: 5511f77161100e337830a4c7777f6b779e4f584e109b0e74c9bdf6e7619786664637fefef03708873cd0ab61f1e3a86d321cd5315e63104659d37b08a42ec4d4
Note 1: It is possible to work with data from a specific release of Bgee by specifying the release
argument, see listBgeeRelease()
function.
Note 2: The functions of the package will store the downloaded files in a versioned folder created by default in the working directory. These cache files allow faster re-access to the data. The directory where data are stored can be changed with the pathToData
argument.
The getAnnotation()
function will output the list of RNA-seq experiments and libraries available in Bgee for mouse.
annotation_bgee_mouse <- getAnnotation(bgee)
##
## Downloading annotation files...
##
## Saved annotation files in /tmp/Rtmpma61Nt/Rbuild45413a8e759f/BgeeDB/vignettes/Mus_musculus_Bgee_13_2 folder.
# list the first experiments and libraries
lapply(annotation_bgee_mouse, head)
## $sample.annotation
## Experiment.ID Library.ID Library.secondary.ID Anatomical.entity.ID
## 1 GSE30617 GSM759583 ERX012363 UBERON:0000948
## 2 GSE30617 GSM759584 ERX012348 UBERON:0000948
## 3 GSE30617 GSM759585 ERX012344 UBERON:0000948
## 4 GSE30617 GSM759586 ERX012362 UBERON:0000948
## 5 GSE30617 GSM759587 ERX012378 UBERON:0000948
## 6 GSE30617 GSM759588 ERX012374 UBERON:0000948
## Anatomical.entity.name Stage.ID Stage.name
## 1 heart MmusDv:0000052 8 weeks (mouse)
## 2 heart MmusDv:0000052 8 weeks (mouse)
## 3 heart MmusDv:0000052 8 weeks (mouse)
## 4 heart MmusDv:0000052 8 weeks (mouse)
## 5 heart MmusDv:0000052 8 weeks (mouse)
## 6 heart MmusDv:0000052 8 weeks (mouse)
## Platform.ID Library.type Library.orientation Read.count
## 1 Illumina Genome Analyzer II paired unstranded 31000737
## 2 Illumina Genome Analyzer II paired unstranded 8605668
## 3 Illumina Genome Analyzer II paired unstranded 30075234
## 4 Illumina Genome Analyzer II paired unstranded 29498377
## 5 Illumina Genome Analyzer II paired unstranded 27824366
## 6 Illumina Genome Analyzer II paired unstranded 31160789
## Left.part.mapped.read.count.Mapped.read.count
## 1 23674754
## 2 7099912
## 3 24308959
## 4 24792064
## 5 19632932
## 6 21342634
## Right.part.mapped.read.count Min.read.length Max.read.length
## 1 23809225 76 76
## 2 7083684 76 76
## 3 23884114 76 76
## 4 24122931 76 76
## 5 20578789 76 76
## 6 23529915 76 76
## All.genes.percent.present Protein.coding.genes.percent.present
## 1 50.73 31.84
## 2 41.94 25.69
## 3 50.70 31.66
## 4 49.12 30.51
## 5 50.24 31.51
## 6 50.32 31.72
## Intergenic.regions.percent.present Run.IDs Discarded.run.IDs
## 1 0.98 ERR032227 NA
## 2 1.17 ERR032228 NA
## 3 0.95 ERR032229 NA
## 4 0.87 ERR032238 NA
## 5 0.97 ERR032230 NA
## 6 1.04 ERR032231 NA
## Data.source Data.source.URL
## 1 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759583
## 2 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759584
## 3 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759585
## 4 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759586
## 5 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759587
## 6 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759588
## Bgee.normalized.data.URL
## 1 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 2 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 3 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 4 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 5 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 6 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## Raw.file.URL
## 1 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759583
## 2 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759584
## 3 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759585
## 4 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759586
## 5 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759587
## 6 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759588
##
## $experiment.annotation
## Experiment.ID Library.count Condition.count Organ.count Stage.count
## 1 GSE30617 36 6 6 1
## 2 GSE41637 26 9 9 1
## 3 GSE30352 17 6 6 1
## 4 GSE36026 12 12 12 3
## 5 GSE43520 9 5 4 2
## 6 GSE41338 6 5 5 1
## Data.source Data.source.URL
## 1 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30617
## 2 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41637
## 3 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30352
## 4 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36026
## 5 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE43520
## 6 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41338
## Bgee.normalized.data.URL
## 1 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 2 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE41637.tsv.zip
## 3 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30352.tsv.zip
## 4 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE36026.tsv.zip
## 5 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE43520.tsv.zip
## 6 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE41338.tsv.zip
## Experiment.name
## 1 [E-MTAB-599] Mouse Transcriptome
## 2 Evolutionary dynamics of gene and isoform regulation in mammalian tissues
## 3 The evolution of gene expression levels in mammalian organs
## 4 RNA-seq from ENCODE/LICR
## 5 The evolution of lncRNA repertoires and expression patterns in tetrapods
## 6 The evolutionary landscape of alternative splicing in vertebrate species
## Experiment.description
## 1 Sequencing the transcriptome of DBAxC57BL/6J mice. To study the regulation of transcription, splicing and RNA turnover we have sequenced the transcriptomes of tissues collected DBAxC57BL/6J mice.
## 2 Most mammalian genes produce multiple distinct mRNAs through alternative splicing, but the extent of splicing conservation is not clear. To assess tissue-specific transcriptome variation across mammals, we sequenced cDNA from 9 tissues from 4 mammals and one bird in biological triplicate, at unprecedented depth. We find that while tissue-specific gene expression programs are largely conserved, alternative splicing is well conserved in only a subset of tissues and is frequently lineage-specific. Thousands of novel, lineage-specific and conserved alternative exons were identified; widely conserved alternative exons had signatures of binding by MBNL, PTB, RBFOX, STAR and TIA family splicing factors, implicating them as ancestral mammalian splicing regulators. Our data also indicates that alternative splicing is often used to alter protein phosphorylatability, delimiting the scope of kinase signaling.
## 3 Changes in gene expression are thought to underlie many of the phenotypic differences between species. However, large-scale analyses of gene expression evolution were until recently prevented by technological limitations. Here we report the sequencing of polyadenylated RNA from six organs across ten species that represent all major mammalian lineages (placentals, marsupials and monotremes) and birds (the evolutionary outgroup), with the goal of understanding the dynamics of mammalian transcriptome evolution. We show that the rate of gene expression evolution varies among organs, lineages and chromosomes, owing to differences in selective pressures: transcriptome change was slow in nervous tissues and rapid in testes, slower in rodents than in apes and monotremes, and rapid for the X chromosome right after its formation. Although gene expression evolution in mammals was strongly shaped by purifying selection, we identify numerous potentially selectively driven expression switches, which occurred at different rates across lineages and tissues and which probably contributed to the specific organ biology of various mammals. Our transcriptome data provide a valuable resource for functional and evolutionary analyses of mammalian genomes.
## 4 Using RNA-Seq (Mortazavi et al., 2008), high-resolution genome-wide maps of the mouse transcriptome across multiple mouse (C57Bl/6) tissues and primary cells were generated.
## 5 To broaden our understanding of lncRNA evolution, we used an extensive RNA-seq dataset to establish lncRNA repertoires and homologous gene families in 11 tetrapod species. We analyzed the poly- adenylated transcriptomes of 8 organs (cortex/whole brain without cerebellum, cerebellum, heart, kidney, liver, placenta, ovary and testis) and 11 species (human, chimpanzee, bonobo, gorilla, orangutan, macaque, mouse, opossum, platypus, chicken and the frog Xenopus tropicalis), which shared a common ancestor ~370 millions of years (MY) ago. Our dataset included 47 strand-specific samples, which allowed us to confirm the orientation of gene predictions and to address the evolution of sense-antisense transcripts. See also GSE43721 (Soumillon et al, Cell Reports, 2013) for three strand-specific samples for mouse brain, liver and testis.
## 6 How species with similar repertoires of protein coding genes differ so dramatically at the phenotypic level is poorly understood. From comparing the transcriptomes of multiple organs from vertebrate species spanning ~350 million years of evolution, we observe significant differences in alternative splicing complexity between the main vertebrate lineages, with the highest complexity in the primate lineage. Moreover, within as little as six million years, the splicing profiles of physiologically-equivalent organs have diverged to the extent that they are more strongly related to the identity of a species than they are to organ type. Most vertebrate species-specific splicing patterns are governed by the highly variable use of a largely conserved cis-regulatory code. However, a smaller number of pronounced species-dependent splicing changes are predicted to remodel interactions involving factors acting at multiple steps in gene regulation. These events are expected to further contribute to the dramatic diversification of alternative splicing as well as to other gene regulatory changes that contribute to phenotypic differences among vertebrate species.
The getData()
function will download processed RNA-seq data from all mouse experiments in Bgee as a list.
# download all RNA-seq experiments from mouse
data_bgee_mouse <- getData(bgee)
##
## The experiment is not defined. Hence taking all rna_seq experiments available for Mus_musculus.
##
## Downloading expression data...
##
## Saved expression data file in /tmp/Rtmpma61Nt/Rbuild45413a8e759f/BgeeDB/vignettes/Mus_musculus_Bgee_13_2 folder. Now unzipping file...
##
## Saving all data in .rds file...
# number of experiments downloaded
length(data_bgee_mouse)
## [1] 7
# check the downloaded data
lapply(data_bgee_mouse, head)
## [[1]]
## Experiment.ID Library.ID Library.type Gene.ID
## 1 GSE36026 GSM929703 single ENSMUSG00000000001
## 2 GSE36026 GSM929703 single ENSMUSG00000000003
## 3 GSE36026 GSM929703 single ENSMUSG00000000028
## 4 GSE36026 GSM929703 single ENSMUSG00000000031
## 5 GSE36026 GSM929703 single ENSMUSG00000000037
## 6 GSE36026 GSM929703 single ENSMUSG00000000049
## Anatomical.entity.ID Anatomical.entity.name Stage.ID
## 1 UBERON:0001348 brown adipose tissue MmusDv:0000074
## 2 UBERON:0001348 brown adipose tissue MmusDv:0000074
## 3 UBERON:0001348 brown adipose tissue MmusDv:0000074
## 4 UBERON:0001348 brown adipose tissue MmusDv:0000074
## 5 UBERON:0001348 brown adipose tissue MmusDv:0000074
## 6 UBERON:0001348 brown adipose tissue MmusDv:0000074
## Stage.name Read.count RPKM Detection.flag Detection.quality
## 1 24 weeks (mouse) 412 5.49619 present high quality
## 2 24 weeks (mouse) 0 0.00000 absent high quality
## 3 24 weeks (mouse) 53 1.46527 present high quality
## 4 24 weeks (mouse) 94 3.22440 present high quality
## 5 24 weeks (mouse) 4 0.05278 absent high quality
## 6 24 weeks (mouse) 10 0.65387 absent high quality
## State.in.Bgee
## 1 Used in expression call
## 2 Result excluded, reason: pre-filtering
## 3 Used in expression call
## 4 Used in expression call
## 5 Used in no-expression call
## 6 Used in no-expression call
##
## [[2]]
## Experiment.ID Library.ID Library.type Gene.ID
## 1 GSE41637 GSM1020640 paired ENSMUSG00000000001
## 2 GSE41637 GSM1020640 paired ENSMUSG00000000003
## 3 GSE41637 GSM1020640 paired ENSMUSG00000000028
## 4 GSE41637 GSM1020640 paired ENSMUSG00000000031
## 5 GSE41637 GSM1020640 paired ENSMUSG00000000037
## 6 GSE41637 GSM1020640 paired ENSMUSG00000000049
## Anatomical.entity.ID Anatomical.entity.name Stage.ID
## 1 UBERON:0000955 brain UBERON:0000113
## 2 UBERON:0000955 brain UBERON:0000113
## 3 UBERON:0000955 brain UBERON:0000113
## 4 UBERON:0000955 brain UBERON:0000113
## 5 UBERON:0000955 brain UBERON:0000113
## 6 UBERON:0000955 brain UBERON:0000113
## Stage.name Read.count RPKM Detection.flag
## 1 post-juvenile adult stage 2677 12.41761 present
## 2 post-juvenile adult stage 0 0.00000 absent
## 3 post-juvenile adult stage 87 0.83635 absent
## 4 post-juvenile adult stage 27 0.32204 absent
## 5 post-juvenile adult stage 140 0.64249 absent
## 6 post-juvenile adult stage 15 0.34104 absent
## Detection.quality State.in.Bgee
## 1 high quality Used in expression call
## 2 high quality Result excluded, reason: pre-filtering
## 3 high quality Used in expression call
## 4 high quality Result excluded, reason: noExpression conflict
## 5 high quality Used in no-expression call
## 6 high quality Used in expression call
##
## [[3]]
## Experiment.ID Library.ID Library.type Gene.ID
## 1 GSE30617 GSM759583 paired ENSMUSG00000000001
## 2 GSE30617 GSM759583 paired ENSMUSG00000000003
## 3 GSE30617 GSM759583 paired ENSMUSG00000000028
## 4 GSE30617 GSM759583 paired ENSMUSG00000000031
## 5 GSE30617 GSM759583 paired ENSMUSG00000000037
## 6 GSE30617 GSM759583 paired ENSMUSG00000000049
## Anatomical.entity.ID Anatomical.entity.name Stage.ID
## 1 UBERON:0000948 heart MmusDv:0000052
## 2 UBERON:0000948 heart MmusDv:0000052
## 3 UBERON:0000948 heart MmusDv:0000052
## 4 UBERON:0000948 heart MmusDv:0000052
## 5 UBERON:0000948 heart MmusDv:0000052
## 6 UBERON:0000948 heart MmusDv:0000052
## Stage.name Read.count RPKM Detection.flag Detection.quality
## 1 8 weeks (mouse) 941 11.57664 present high quality
## 2 8 weeks (mouse) 0 0.00000 absent high quality
## 3 8 weeks (mouse) 72 1.83570 present high quality
## 4 8 weeks (mouse) 225 7.11763 present high quality
## 5 8 weeks (mouse) 15 0.18255 absent high quality
## 6 8 weeks (mouse) 2 0.12059 absent high quality
## State.in.Bgee
## 1 Used in expression call
## 2 Result excluded, reason: pre-filtering
## 3 Used in expression call
## 4 Used in expression call
## 5 Used in no-expression call
## 6 Used in expression call
##
## [[4]]
## Experiment.ID Library.ID Library.type Gene.ID
## 1 GSE41338 GSM1015150 paired ENSMUSG00000000001
## 2 GSE41338 GSM1015150 paired ENSMUSG00000000003
## 3 GSE41338 GSM1015150 paired ENSMUSG00000000028
## 4 GSE41338 GSM1015150 paired ENSMUSG00000000031
## 5 GSE41338 GSM1015150 paired ENSMUSG00000000037
## 6 GSE41338 GSM1015150 paired ENSMUSG00000000049
## Anatomical.entity.ID Anatomical.entity.name Stage.ID
## 1 UBERON:0000955 brain UBERON:0000113
## 2 UBERON:0000955 brain UBERON:0000113
## 3 UBERON:0000955 brain UBERON:0000113
## 4 UBERON:0000955 brain UBERON:0000113
## 5 UBERON:0000955 brain UBERON:0000113
## 6 UBERON:0000955 brain UBERON:0000113
## Stage.name Read.count RPKM Detection.flag
## 1 post-juvenile adult stage 722 4.17829 present
## 2 post-juvenile adult stage 0 0.00000 absent
## 3 post-juvenile adult stage 59 0.70761 absent
## 4 post-juvenile adult stage 26 0.38690 absent
## 5 post-juvenile adult stage 17 0.09733 absent
## 6 post-juvenile adult stage 46 1.30483 present
## Detection.quality State.in.Bgee
## 1 high quality Used in expression call
## 2 high quality Result excluded, reason: pre-filtering
## 3 high quality Used in expression call
## 4 high quality Result excluded, reason: noExpression conflict
## 5 high quality Used in no-expression call
## 6 high quality Used in expression call
##
## [[5]]
## Experiment.ID Library.ID Library.type Gene.ID
## 1 GSE30352 GSM752614 single ENSMUSG00000000001
## 2 GSE30352 GSM752614 single ENSMUSG00000000003
## 3 GSE30352 GSM752614 single ENSMUSG00000000028
## 4 GSE30352 GSM752614 single ENSMUSG00000000031
## 5 GSE30352 GSM752614 single ENSMUSG00000000037
## 6 GSE30352 GSM752614 single ENSMUSG00000000049
## Anatomical.entity.ID Anatomical.entity.name Stage.ID
## 1 UBERON:0000955 brain UBERON:0000113
## 2 UBERON:0000955 brain UBERON:0000113
## 3 UBERON:0000955 brain UBERON:0000113
## 4 UBERON:0000955 brain UBERON:0000113
## 5 UBERON:0000955 brain UBERON:0000113
## 6 UBERON:0000955 brain UBERON:0000113
## Stage.name Read.count RPKM Detection.flag
## 1 post-juvenile adult stage 550 11.92921 present
## 2 post-juvenile adult stage 0 0.00000 absent
## 3 post-juvenile adult stage 12 0.53941 absent
## 4 post-juvenile adult stage 2 0.11157 absent
## 5 post-juvenile adult stage 13 0.27897 absent
## 6 post-juvenile adult stage 7 0.74416 absent
## Detection.quality State.in.Bgee
## 1 high quality Used in expression call
## 2 high quality Result excluded, reason: pre-filtering
## 3 high quality Used in expression call
## 4 high quality Result excluded, reason: noExpression conflict
## 5 high quality Used in no-expression call
## 6 high quality Used in expression call
##
## [[6]]
## Experiment.ID Library.ID Library.type Gene.ID
## 1 GSE43721 GSM1069683 single ENSMUSG00000000001
## 2 GSE43721 GSM1069683 single ENSMUSG00000000003
## 3 GSE43721 GSM1069683 single ENSMUSG00000000028
## 4 GSE43721 GSM1069683 single ENSMUSG00000000031
## 5 GSE43721 GSM1069683 single ENSMUSG00000000037
## 6 GSE43721 GSM1069683 single ENSMUSG00000000049
## Anatomical.entity.ID Anatomical.entity.name Stage.ID
## 1 UBERON:0000955 brain MmusDv:0000055
## 2 UBERON:0000955 brain MmusDv:0000055
## 3 UBERON:0000955 brain MmusDv:0000055
## 4 UBERON:0000955 brain MmusDv:0000055
## 5 UBERON:0000955 brain MmusDv:0000055
## 6 UBERON:0000955 brain MmusDv:0000055
## Stage.name Read.count RPKM Detection.flag Detection.quality
## 1 11 weeks (mouse) 1024 9.36418 present high quality
## 2 11 weeks (mouse) 0 0.00000 absent high quality
## 3 11 weeks (mouse) 61 1.15606 present high quality
## 4 11 weeks (mouse) 19 0.44676 absent high quality
## 5 11 weeks (mouse) 45 0.40712 absent high quality
## 6 11 weeks (mouse) 35 1.56882 present high quality
## State.in.Bgee
## 1 Used in expression call
## 2 Result excluded, reason: pre-filtering
## 3 Used in expression call
## 4 Result excluded, reason: noExpression conflict
## 5 Used in no-expression call
## 6 Used in expression call
##
## [[7]]
## Experiment.ID Library.ID Library.type Gene.ID
## 1 GSE43520 GSM1064841 single ENSMUSG00000000001
## 2 GSE43520 GSM1064841 single ENSMUSG00000000003
## 3 GSE43520 GSM1064841 single ENSMUSG00000000028
## 4 GSE43520 GSM1064841 single ENSMUSG00000000031
## 5 GSE43520 GSM1064841 single ENSMUSG00000000037
## 6 GSE43520 GSM1064841 single ENSMUSG00000000049
## Anatomical.entity.ID Anatomical.entity.name Stage.ID
## 1 UBERON:0000955 brain UBERON:0000113
## 2 UBERON:0000955 brain UBERON:0000113
## 3 UBERON:0000955 brain UBERON:0000113
## 4 UBERON:0000955 brain UBERON:0000113
## 5 UBERON:0000955 brain UBERON:0000113
## 6 UBERON:0000955 brain UBERON:0000113
## Stage.name Read.count RPKM Detection.flag
## 1 post-juvenile adult stage 852 9.45702 present
## 2 post-juvenile adult stage 0 0.00000 absent
## 3 post-juvenile adult stage 36 0.82814 absent
## 4 post-juvenile adult stage 3 0.08563 absent
## 5 post-juvenile adult stage 31 0.34042 absent
## 6 post-juvenile adult stage 3 0.16322 absent
## Detection.quality State.in.Bgee
## 1 high quality Used in expression call
## 2 high quality Result excluded, reason: pre-filtering
## 3 high quality Used in expression call
## 4 high quality Result excluded, reason: noExpression conflict
## 5 high quality Used in no-expression call
## 6 high quality Used in expression call
# isolate the first experiment
data_bgee_experiment1 <- data_bgee_mouse[[1]]
The result of the getData()
function is, for each experiment, a data frame with the different samples listed in rows, one after the other. Each row is a gene and the expression levels are displayed as raw read counts or RPKMs. A detection flag indicates is the gene is significantly expressed above background level of expression.
Note 1: An additional column in the data frame including expression in the TPM unit will be available from Bgee release 14 (planned for the end of 2016).
Note 2: If microarray data are downloaded, rows correspond to probesets and log2 of expression intensities are available instead of read counts/RPKMs.
Alternatively, you can choose to download data from only one particular RNA-seq experiment from Bgee with the experimentId
parameter:
# download data for GSE30617
data_bgee_mouse_gse30617 <- getData(bgee, experimentId = "GSE30617")
##
## Downloading expression data for the experiment GSE30617 ...
##
## Saved expression data file in /tmp/Rtmpma61Nt/Rbuild45413a8e759f/BgeeDB/vignettes/Mus_musculus_Bgee_13_2 folder. Now unzipping /tmp/Rtmpma61Nt/Rbuild45413a8e759f/BgeeDB/vignettes/Mus_musculus_Bgee_13_2/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip file...
##
## Saving all data in .rds file...
It is sometimes easier to work with data organized as a matrix, where rows represent genes or probesets and columns represent different samples. The formatData()
function reformats the data into an ExpressionSet object including: * An expression data matrix, with genes or probesets as rows, and samples as columns (assayData
slot). The stats
argument allows to choose if the matrix should be filled with read counts, RPKMs (and soon TPMs) for RNA-seq data. For microarray data the matrix is filled with log2 expression intensities. * A data frame listing the samples and their anatomical structure and developmental stage annotation (phenoData
slot) * For microarray data, the mapping from probesets to Ensembl genes (featureData
slot)
The callType
argument allows to retain only actively expressed genes or probesets, if set to “present” or “present high quality”. Genes or probesets that are absent in a given sample are given NA
values.
# use only present calls and fill expression matric with RPKM values
gene.expression.mouse.rpkm <- formatData(bgee, data_bgee_mouse_gse30617, callType = "present", stats = "rpkm")
##
## Extracting expression data matrix...
## Keeping only present genes.
##
## Extracting features information...
##
## Extracting samples information...
gene.expression.mouse.rpkm
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 39179 features, 36 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM759583 GSM759584 ... GSM759618 (36 total)
## varLabels: Library.ID Anatomical.entity.ID ... Stage.name (5
## total)
## varMetadata: labelDescription
## featureData
## featureNames: ENSMUSG00000000001 ENSMUSG00000000003 ...
## ENSMUSG00000099334 (39179 total)
## fvarLabels: Gene.ID
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
For some documentation on the TopAnat analysis, please refer to our publications, or to the web-tool page (http://bgee.org/?page=top_anat#/).
Similarly to the quantitative data download example above, the first step of a topAnat analysis is to built an object from the Bgee class. For this example, we will focus on zebrafish:
# Creating new Bgee class object
bgee <- Bgee$new(species = "Danio_rerio")
##
## NOTE: You did not specify any data type. The argument dataType will be set to c("rna_seq","affymetrix","est","in_situ") for the next steps.
##
## Querying Bgee to get release information...
##
## NOTE: the file describing Bgee species information for release 13_2 was found in the download directory /tmp/Rtmpma61Nt/Rbuild45413a8e759f/BgeeDB/vignettes. Data will not be redownloaded.
##
## API key built: 5511f77161100e337830a4c7777f6b779e4f584e109b0e74c9bdf6e7619786664637fefef03708873cd0ab61f1e3a86d321cd5315e63104659d37b08a42ec4d4
Note : We are free to specify any data type of interest using the dataType
argument among rna_seq
, affymetrix
, est
or in_situ
, or even a combination of data types. If nothing is specified, as in the above example, all data types available for the targeted species are used. This equivalent to specifying dataType=c("rna_seq","affymetrix","est","in_situ")
.
The loadTopAnatData()
function loads a mapping from genes to anatomical structures based on calls of expression in anatomical structures. It also loads the structure of the anatomical ontology and the names of anatomical structures.
# Loading calls of expression
myTopAnatData <- loadTopAnatData(bgee)
##
## Building URLs to retrieve organ relationships from Bgee.........
## URL successfully built (http://r13_2.bgee.org/?page=dao&action=org.bgee.model.dao.api.ontologycommon.RelationDAO.getAnatEntityRelations&display_type=tsv&species_list=7955&attr_list=SOURCE_ID&attr_list=TARGET_ID&api_key=5511f77161100e337830a4c7777f6b779e4f584e109b0e74c9bdf6e7619786664637fefef03708873cd0ab61f1e3a86d321cd5315e63104659d37b08a42ec4d4&source=BgeeDB_R_package&source_version=2.2.0)
## Submitting URL to Bgee webservice (can be long)
## Got results from Bgee webservice. Files are written in "/tmp/Rtmpma61Nt/Rbuild45413a8e759f/BgeeDB/vignettes/Danio_rerio_Bgee_13_2"
##
## Building URLs to retrieve organ names from Bgee.................
## URL successfully built (http://r13_2.bgee.org/?page=dao&action=org.bgee.model.dao.api.anatdev.AnatEntityDAO.getAnatEntities&display_type=tsv&species_list=7955&attr_list=ID&attr_list=NAME&api_key=5511f77161100e337830a4c7777f6b779e4f584e109b0e74c9bdf6e7619786664637fefef03708873cd0ab61f1e3a86d321cd5315e63104659d37b08a42ec4d4&source=BgeeDB_R_package&source_version=2.2.0)
## Submitting URL to Bgee webservice (can be long)
## Got results from Bgee webservice. Files are written in "/tmp/Rtmpma61Nt/Rbuild45413a8e759f/BgeeDB/vignettes/Danio_rerio_Bgee_13_2"
##
## Building URLs to retrieve mapping of gene to organs from Bgee...
## URL successfully built (http://r13_2.bgee.org/?page=dao&action=org.bgee.model.dao.api.expressiondata.ExpressionCallDAO.getExpressionCalls&display_type=tsv&species_list=7955&attr_list=GENE_ID&attr_list=ANAT_ENTITY_ID&api_key=5511f77161100e337830a4c7777f6b779e4f584e109b0e74c9bdf6e7619786664637fefef03708873cd0ab61f1e3a86d321cd5315e63104659d37b08a42ec4d4&source=BgeeDB_R_package&source_version=2.2.0)
## Submitting URL to Bgee webservice (can be long)
## Got results from Bgee webservice. Files are written in "/tmp/Rtmpma61Nt/Rbuild45413a8e759f/BgeeDB/vignettes/Danio_rerio_Bgee_13_2"
##
## Parsing the results.............................................
##
## Adding BGEE:0 as unique root of all terms of the ontology.......
##
## Done.
# Look at the data
## str(myTopAnatData)
The strigency on the quality of expression calls can be changed with the confidence
argument. Finally, if you are interested in expression data coming from a particular developmental stage or a group of stages, please specify the a Uberon stage Id in the stage
argument.
## Loading only high-quality expression calls from affymetrix data made on embryonic samples only
## This is just given as an example, but is not run in this vignette because only few data are returned
bgee <- Bgee$new(species = "Danio_rerio", dataType="affymetrix")
myTopAnatData <- loadTopAnatData(bgee, stage="UBERON:0000068", confidence="high_quality")
Note: As mentioned above, the downloaded data files are stored in a versioned folder that can be set with the pathToData
argument when creating the Bgee class object (default is the working directory). If you query again Bgee with the exact same parameters, these cached files will be read instead of querying the web-service again. It is thus important, if you plan to reuse the same data for multiple parallel topAnat analyses, to plan to make use of these cached files instead of redownloading them for each analysis. The cached files also give the possibility to repeat analyses offline.
First we need to prepare a list of genes in the foreground and in the background. The input format is the same as the gene list required to build a topGOdata
object in the topGO
package: a vector with background genes as names, and 0 or 1 values depending if a gene is in the foreground or not. In this example we will look at genes, annotated with “spermatogenesis” or children terms in the Gene Ontology. We use the biomaRt
package to retrieve this list of genes. We expect them to be enriched for expression in male tissues, notably the testes. The background list of genes is set to all genes annotated to at least one Gene Ontology term, allowing to account for biases in which types of genes are more likely to receive Gene Ontology annotation.
# source("https://bioconductor.org/biocLite.R")
# biocLite("biomaRt")
library(biomaRt)
ensembl <- useMart("ensembl")
ensembl <- useDataset("drerio_gene_ensembl", mart=ensembl)
# Foreground genes are those with GO annotation "spermatogenesis" or childrem terms
myGenes <- getBM(attributes= "ensembl_gene_id", filters=c("go_parent_term"), values=list(c("GO:0007283")), mart=ensembl)
# Background are all genes with GO annotation
universe <- getBM(attributes= "ensembl_gene_id", filters=c("with_go"), values=list(c(TRUE)), mart=ensembl)
# Prepare the gene list vector
geneList <- factor(as.integer(universe[,1] %in% myGenes[,1]))
names(geneList) <- universe[,1]
head(geneList)
summary(geneList == 1)
# Prepare the topGO object
myTopAnatObject <- topAnat(myTopAnatData, geneList)
The above code using the biomaRt
package is not executed in this vignette to prevent building issues of our package in case of biomaRt downtime. Instead we use a geneList
object saved in the data/
folder that we built using pre-downloaded data.
load("../data/geneList.RData")
myTopAnatObject <- topAnat(myTopAnatData, geneList)
##
## Checking topAnatData object.............
##
## Checking gene list......................
##
## WARNING: Some genes in your gene list have no expression data in Bgee, and will not be included in the analysis. 13777 genes in background will be kept.
##
## Building most specific Ontology terms... ( 1050 Ontology terms found. )
##
## Building DAG topology................... ( 1840 Ontology terms and 3702 relations. )
##
## Annotating nodes (Can be long).......... ( 13777 genes annotated to the Ontology terms. )
Warning: This can be long, especially if the gene list is large, since the Uberon anatomical ontology is large and expression calls will be propagated through the whole ontology (e.g., expression in the forebrain will also be counted as expression in parent structures such as the brain, nervous system, etc). Consider running a script in batch mode if you have multiple analyses to do.
For this step, see the vignette of the topGO
package for more details, as you have to directly use the tests implemented in the topGO
package, as shown in this example:
results <- runTest(myTopAnatObject, algorithm = 'classic', statistic = 'fisher')
##
## -- Classic Algorithm --
##
## the algorithm is scoring 298 nontrivial nodes
## parameters:
## test statistic: fisher
You can also choose one of the topGO decorrelation methods, for example the “weight” method, allowing to avoid redundant results induced by the structure of the ontology.
results <- runTest(myTopAnatObject, algorithm = 'weight', statistic = 'fisher')
Warning: This can be long because of the size of the ontology. Consider running scripts in batch mode if you have multiple analyses to do.
The makeTable
function allows to filter and format the test results, and calculate FDR values.
# Display results sigificant at a 10% FDR threshold
makeTable(myTopAnatData, myTopAnatObject, results, cutoff = 0.1)
##
## Building the results table for the 4 significant terms at FDR threshold of 0.1...
## Ordering results by pValue column in increasing order...
## Done
## organId organName annotated
## UBERON:0000079 UBERON:0000079 male reproductive system 4058
## UBERON:0000473 UBERON:0000473 testis 4058
## UBERON:0003135 UBERON:0003135 male reproductive organ 4058
## UBERON:0003101 UBERON:0003101 male organism 4061
## significant expected foldEnrichment pValue FDR
## UBERON:0000079 16 7.36 2.173913 0.0003557706 0.09480433
## UBERON:0000473 16 7.36 2.173913 0.0003557706 0.09480433
## UBERON:0003135 16 7.36 2.173913 0.0003557706 0.09480433
## UBERON:0003101 16 7.37 2.170963 0.0003591073 0.09480433
At the time of building this vignette (April 2017), the significant terms were all related to male reproductive system (testes), which makes biological sense: there is an expression bias for testis of genes involved in spermatogenesis.
By default results are sorted by p-value, but this can be changed with the ordering
parameter by specifying which column should be used to order the results (preceded by a “-” sign to indicate that ordering should be made in decreasing order). For example, it is often convenient to sort significant structures by decreasing enrichment fold, using ordering = -6
. The full table of results can be obtained using cutoff = 1
.
Warning: it is debated whether FDR correction is appropriate on enrichment test results, since tests on different terms of the ontologies are not independent. A nice discussion can be found in the vignette of the topGO
package.