Package: Pbase
Author: Laurent Gatto
Last compiled: Tue Oct 13 23:37:10 2015
Last modified: 2015-10-13 16:27:11
This vignette described how to convert coordinates between different genome references. We will use transcript ENST00000373316
and GRCh38 and GRCh37 as working example.
tr <- "ENST00000373316"
Here is use Gviz to query the latest Ensembl biomart and extract the transcript of interest.
suppressMessages(library("Gviz"))
suppressMessages(library("biomaRt"))
h38 <- useMart("ensembl", "hsapiens_gene_ensembl")
tr38 <- BiomartGeneRegionTrack(biomart = h38,
transcript = tr)
tr38 <- split(tr38, transcript(tr38))
tr38 <- ranges(tr38[[tr]])
tr38
## GRanges object with 13 ranges and 7 metadata columns:
## seqnames ranges strand | feature
## <Rle> <IRanges> <Rle> | <character>
## [1] chrX [78104174, 78104340] + | utr5
## [2] chrX [78104341, 78104405] + | protein_coding
## [3] chrX [78109867, 78109917] + | protein_coding
## [4] chrX [78113744, 78113899] + | protein_coding
## [5] chrX [78114016, 78114160] + | protein_coding
## ... ... ... ... ... ...
## [9] chrX [78123195, 78123374] + | protein_coding
## [10] chrX [78124874, 78125051] + | protein_coding
## [11] chrX [78125327, 78125425] + | protein_coding
## [12] chrX [78125790, 78125830] + | protein_coding
## [13] chrX [78125831, 78129296] + | utr3
## gene exon transcript symbol
## <character> <character> <character> <character>
## [1] ENSG00000102144 ENSE00001600900 ENST00000373316 PGK1
## [2] ENSG00000102144 ENSE00001600900 ENST00000373316 PGK1
## [3] ENSG00000102144 ENSE00003506377 ENST00000373316 PGK1
## [4] ENSG00000102144 ENSE00003502842 ENST00000373316 PGK1
## [5] ENSG00000102144 ENSE00003512377 ENST00000373316 PGK1
## ... ... ... ... ...
## [9] ENSG00000102144 ENSE00000672996 ENST00000373316 PGK1
## [10] ENSG00000102144 ENSE00000672995 ENST00000373316 PGK1
## [11] ENSG00000102144 ENSE00000672994 ENST00000373316 PGK1
## [12] ENSG00000102144 ENSE00001948816 ENST00000373316 PGK1
## [13] ENSG00000102144 ENSE00001948816 ENST00000373316 PGK1
## rank phase
## <numeric> <integer>
## [1] 1 -1
## [2] 1 -1
## [3] 2 2
## [4] 3 2
## [5] 4 2
## ... ... ...
## [9] 8 0
## [10] 9 0
## [11] 10 1
## [12] 11 0
## [13] 11 -1
## -------
## seqinfo: 1 sequence from hsapiens_gene_ensembl genome; no seqlengths
Note the starting position of the transcript is 78104174.
Below, I repeat the same operation without using my own ens Mart instance. As far as I understand, Gviz queries the UCSC genome reference by default.
h37 <- useMart(host = "feb2014.archive.ensembl.org",
biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl")
tr37 <- BiomartGeneRegionTrack(biomart = h37,
transcript = tr)
tr37 <- split(tr37, transcript(tr37))
tr37 <- ranges(tr37[[tr]])
tr37
## GRanges object with 13 ranges and 7 metadata columns:
## seqnames ranges strand | feature
## <Rle> <IRanges> <Rle> | <character>
## [1] chrX [77359671, 77359837] + | utr5
## [2] chrX [77359838, 77359902] + | protein_coding
## [3] chrX [77365364, 77365414] + | protein_coding
## [4] chrX [77369241, 77369396] + | protein_coding
## [5] chrX [77369513, 77369657] + | protein_coding
## ... ... ... ... ... ...
## [9] chrX [77378692, 77378871] + | protein_coding
## [10] chrX [77380371, 77380548] + | protein_coding
## [11] chrX [77380824, 77380922] + | protein_coding
## [12] chrX [77381287, 77381327] + | protein_coding
## [13] chrX [77381328, 77384793] + | utr3
## gene exon transcript symbol
## <character> <character> <character> <character>
## [1] ENSG00000102144 ENSE00001600900 ENST00000373316 PGK1
## [2] ENSG00000102144 ENSE00001600900 ENST00000373316 PGK1
## [3] ENSG00000102144 ENSE00003506377 ENST00000373316 PGK1
## [4] ENSG00000102144 ENSE00003502842 ENST00000373316 PGK1
## [5] ENSG00000102144 ENSE00003512377 ENST00000373316 PGK1
## ... ... ... ... ...
## [9] ENSG00000102144 ENSE00000672996 ENST00000373316 PGK1
## [10] ENSG00000102144 ENSE00000672995 ENST00000373316 PGK1
## [11] ENSG00000102144 ENSE00000672994 ENST00000373316 PGK1
## [12] ENSG00000102144 ENSE00001948816 ENST00000373316 PGK1
## [13] ENSG00000102144 ENSE00001948816 ENST00000373316 PGK1
## rank phase
## <numeric> <integer>
## [1] 1 -1
## [2] 1 -1
## [3] 2 2
## [4] 3 2
## [5] 4 2
## ... ... ...
## [9] 8 0
## [10] 9 0
## [11] 10 1
## [12] 11 0
## [13] 11 -1
## -------
## seqinfo: 1 sequence from hsapiens_gene_ensembl genome; no seqlengths
Note the starting position of the transcript is 77359671.
These differences seem to stem from different genome builds. Ensembl release 78 uses GRCh38, while UCSC uses GRCh37. Indeed, Gviz sets the Ensembl biomart server to Feb.2014
GRCh37.p13
.
We will use the coordinate mapping infrastructure described in the January 2015 Bioconductor Newletter and the Changing genomic coordinate systems with rtracklayer::liftOver workflow.
First, we query AnnotationHub for a chain file to perform the operation we want.
library("AnnotationHub")
hub <- AnnotationHub()
## snapshotDate(): 2015-08-26
query(hub, 'hg19ToHg38')
## AnnotationHub with 1 record
## # snapshotDate(): 2015-08-26
## # names(): AH14150
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: ChainFile
## # $title: hg19ToHg38.over.chain.gz
## # $description: UCSC liftOver chain file from hg19 to hg38
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: Chain
## # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/h...
## # $sourcelastmodifieddate: NA
## # $sourcesize: NA
## # $tags: liftOver, chain, UCSC, genome, homology
## # retrieve record with 'object[["AH14150"]]'
chain <- query(hub, 'hg19ToHg38')[[1]]
The liftOver
function from the rtracklayer package will use the chain and translate the coordinates of a GRanges
object into a new GRangesList
object.
library("rtracklayer")
res <- liftOver(tr37, chain)
res <- unlist(res)
## set annotation
names(res) <- NULL
genome(res) <- "hsapiens_gene_ensembl"
all.equal(res, tr38)
## [1] TRUE
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] rtracklayer_1.30.0 AnnotationHub_2.2.0 biomaRt_2.26.0
## [4] mzID_1.8.0 Biostrings_2.38.0 XVector_0.10.0
## [7] Pbase_0.10.0 Gviz_1.14.0 GenomicRanges_1.22.0
## [10] GenomeInfoDb_1.6.0 IRanges_2.4.0 S4Vectors_0.8.0
## [13] Rcpp_0.12.1 BiocGenerics_0.16.0 BiocStyle_1.8.0
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.2.0 bitops_1.0-6
## [3] matrixStats_0.14.2 doParallel_1.0.8
## [5] RColorBrewer_1.1-2 httr_1.0.0
## [7] MSnbase_1.18.0 tools_3.2.2
## [9] R6_2.1.1 affyio_1.40.0
## [11] rpart_4.1-10 cleaver_1.8.0
## [13] Hmisc_3.17-0 DBI_0.3.1
## [15] colorspace_1.2-6 nnet_7.3-11
## [17] gridExtra_2.0.0 Pviz_1.4.0
## [19] curl_0.9.3 preprocessCore_1.32.0
## [21] chron_2.3-47 Biobase_2.30.0
## [23] formatR_1.2.1 scales_0.3.0
## [25] affy_1.48.0 stringr_1.0.0
## [27] digest_0.6.8 Rsamtools_1.22.0
## [29] foreign_0.8-66 rmarkdown_0.8.1
## [31] dichromat_2.0-0 htmltools_0.2.6
## [33] limma_3.26.0 BSgenome_1.38.0
## [35] RSQLite_1.0.0 impute_1.44.0
## [37] BiocInstaller_1.20.0 shiny_0.12.2
## [39] BiocParallel_1.4.0 acepack_1.3-3.3
## [41] VariantAnnotation_1.16.0 RCurl_1.95-4.7
## [43] magrittr_1.5 Formula_1.2-1
## [45] futile.logger_1.4.1 MALDIquant_1.13
## [47] munsell_0.4.2 proto_0.3-10
## [49] vsn_3.38.0 stringi_0.5-5
## [51] yaml_2.1.13 MASS_7.3-44
## [53] SummarizedExperiment_1.0.0 zlibbioc_1.16.0
## [55] plyr_1.8.3 lattice_0.20-33
## [57] splines_3.2.2 GenomicFeatures_1.22.0
## [59] mzR_2.4.0 knitr_1.11
## [61] reshape2_1.4.1 codetools_0.2-14
## [63] futile.options_1.0.0 XML_3.98-1.3
## [65] evaluate_0.8 biovizBase_1.18.0
## [67] latticeExtra_0.6-26 pcaMethods_1.60.0
## [69] lambda.r_1.1.7 data.table_1.9.6
## [71] httpuv_1.3.3 foreach_1.4.3
## [73] gtable_0.1.2 ggplot2_1.0.1
## [75] mime_0.4 xtable_1.7-4
## [77] survival_2.38-3 iterators_1.0.8
## [79] GenomicAlignments_1.6.0 AnnotationDbi_1.32.0
## [81] cluster_2.0.3 interactiveDisplayBase_1.8.0