Google Genomics implements the GA4GH reads API and this R package can retrieve data from that implementation. For more detail, see https://cloud.google.com/genomics/v1beta2/reference/reads
library(GoogleGenomics)
# This vignette is authenticated on package load from the env variable GOOGLE_API_KEY.
# When running interactively, call the authenticate method.
# ?authenticate
By default, this function retrieves reads for a small genomic region for one sample in 1,000 Genomes.
reads <- getReads()
## Fetching reads page.
## Reads are now available.
length(reads)
## [1] 18
We can see that 18 individual reads were returned and that the JSON response was deserialized into an R list object:
class(reads)
## [1] "list"
mode(reads)
## [1] "list"
The top level field names are:
names(reads[[1]])
## [1] "id" "readGroupId" "readGroupSetId"
## [4] "fragmentName" "properPlacement" "fragmentLength"
## [7] "readNumber" "numberReads" "alignment"
## [10] "alignedSequence" "alignedQuality" "nextMatePosition"
## [13] "info"
And examining the alignment we see:
reads[[1]]$alignedSequence
## [1] "AACAAAAAACTGTCTAACAAGATTTTATGGTTTATAGACCATGATTCCCCGGACACATTAGATAGAAATCTGGGCAAGAGAAGAAAAAAAGGTCAGAGTT"
reads[[1]]$alignment$position$referenceName
## [1] "22"
reads[[1]]$alignment$position$position
## [1] "16051308"
This is good, but this data becomes much more useful when it is converted to Bioconductor data types. For example, we can convert reads in this list representation to GAlignments:
readsToGAlignments(reads)
## GAlignments object with 18 alignments and 1 metadata column:
## seqnames strand cigar qwidth start
## <Rle> <Rle> <character> <integer> <integer>
## ERR251039.44923356 chr22 - 100M 100 16051309
## ERR251039.28556355 chr22 + 100M 100 16051323
## ERR016214.27010769 chr22 + 75M1D6M 81 16051330
## ERR016213.15718767 chr22 + 68M 68 16051400
## ERR016213.29749886 chr22 - 68M 68 16051403
## ... ... ... ... ... ...
## ERR251040.1475552 chr22 + 100M 100 16051454
## ERR251040.3762117 chr22 + 100M 100 16051469
## ERR251040.11853979 chr22 + 100M 100 16051478
## ERR251040.34469189 chr22 + 100M 100 16051486
## ERR251040.38772006 chr22 - 100M 100 16051498
## end width njunc | flag
## <integer> <integer> <integer> | <numeric>
## ERR251039.44923356 16051408 100 0 | 147
## ERR251039.28556355 16051422 100 0 | 163
## ERR016214.27010769 16051411 82 0 | 35
## ERR016213.15718767 16051467 68 0 | 163
## ERR016213.29749886 16051470 68 0 | 153
## ... ... ... ... ... ...
## ERR251040.1475552 16051553 100 0 | 163
## ERR251040.3762117 16051568 100 0 | 163
## ERR251040.11853979 16051577 100 0 | 163
## ERR251040.34469189 16051585 100 0 | 35
## ERR251040.38772006 16051597 100 0 | 19
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Let’s take a look at the reads that overlap rs9536314 for sample NA12893 within the Illumina Platinum Genomes dataset. This SNP resides on chromosome 13 at position 33628137 in 0-based coordinates.
# Change the values of 'chromosome', 'start', or 'end' here if you wish to plot
# alignments from a different portion of the genome.
alignments <- getReads(readGroupSetId="CMvnhpKTFhDyy__v0qfPpkw",
chromosome="chr13",
start=33628130,
end=33628145,
converter=readsToGAlignments)
## Fetching reads page.
## Reads are now available.
alignments
## GAlignments object with 66 alignments and 1 metadata column:
## seqnames strand cigar
## <Rle> <Rle> <character>
## HSQ1009:126:D0UUYACXX:8:2213:15440:74824 chr13 + 101M
## HSQ1009:126:D0UUYACXX:8:2207:9137:75357 chr13 - 101M
## HSQ1009:127:C0LVVACXX:8:1312:4462:83203 chr13 - 101M
## HSQ1009:127:C0LVVACXX:8:2207:9999:16063 chr13 - 101M
## HSQ1009:126:D0UUYACXX:7:2312:2992:38597 chr13 + 101M
## ... ... ... ...
## HSQ1009:126:D0UUYACXX:8:1313:14128:95709 chr13 + 101M
## HSQ1009:126:D0UUYACXX:8:1314:13643:12305 chr13 + 101M
## HSQ1009:126:D0UUYACXX:8:1209:16537:43414 chr13 + 101M
## HSQ1009:126:D0UUYACXX:8:1214:6640:15022 chr13 + 101M
## HSQ1009:127:C0LVVACXX:8:2216:4748:24114 chr13 - 101M
## qwidth start end
## <integer> <integer> <integer>
## HSQ1009:126:D0UUYACXX:8:2213:15440:74824 101 33628032 33628132
## HSQ1009:126:D0UUYACXX:8:2207:9137:75357 101 33628036 33628136
## HSQ1009:127:C0LVVACXX:8:1312:4462:83203 101 33628037 33628137
## HSQ1009:127:C0LVVACXX:8:2207:9999:16063 101 33628038 33628138
## HSQ1009:126:D0UUYACXX:7:2312:2992:38597 101 33628041 33628141
## ... ... ... ...
## HSQ1009:126:D0UUYACXX:8:1313:14128:95709 101 33628137 33628237
## HSQ1009:126:D0UUYACXX:8:1314:13643:12305 101 33628139 33628239
## HSQ1009:126:D0UUYACXX:8:1209:16537:43414 101 33628141 33628241
## HSQ1009:126:D0UUYACXX:8:1214:6640:15022 101 33628141 33628241
## HSQ1009:127:C0LVVACXX:8:2216:4748:24114 101 33628143 33628243
## width njunc |
## <integer> <integer> |
## HSQ1009:126:D0UUYACXX:8:2213:15440:74824 101 0 |
## HSQ1009:126:D0UUYACXX:8:2207:9137:75357 101 0 |
## HSQ1009:127:C0LVVACXX:8:1312:4462:83203 101 0 |
## HSQ1009:127:C0LVVACXX:8:2207:9999:16063 101 0 |
## HSQ1009:126:D0UUYACXX:7:2312:2992:38597 101 0 |
## ... ... ... ...
## HSQ1009:126:D0UUYACXX:8:1313:14128:95709 101 0 |
## HSQ1009:126:D0UUYACXX:8:1314:13643:12305 101 0 |
## HSQ1009:126:D0UUYACXX:8:1209:16537:43414 101 0 |
## HSQ1009:126:D0UUYACXX:8:1214:6640:15022 101 0 |
## HSQ1009:127:C0LVVACXX:8:2216:4748:24114 101 0 |
## flag
## <numeric>
## HSQ1009:126:D0UUYACXX:8:2213:15440:74824 35
## HSQ1009:126:D0UUYACXX:8:2207:9137:75357 19
## HSQ1009:127:C0LVVACXX:8:1312:4462:83203 147
## HSQ1009:127:C0LVVACXX:8:2207:9999:16063 147
## HSQ1009:126:D0UUYACXX:7:2312:2992:38597 35
## ... ...
## HSQ1009:126:D0UUYACXX:8:1313:14128:95709 163
## HSQ1009:126:D0UUYACXX:8:1314:13643:12305 163
## HSQ1009:126:D0UUYACXX:8:1209:16537:43414 163
## HSQ1009:126:D0UUYACXX:8:1214:6640:15022 35
## HSQ1009:127:C0LVVACXX:8:2216:4748:24114 147
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Notice that we passed the converter to the getReads method so that we’re immediately working with GAlignments which means that we can start taking advantage of other Bioconductor functionality. Also keep in mind that the parameters start
and end
are expressed in 0-based coordinates per the GA4GH specification but the Bioconductor data type converters in GoogleGenomics, by default, transform the returned data to 1-based coordinates.
library(ggbio)
We can display the basic alignments and coverage data:
alignmentPlot <- autoplot(alignments, aes(color=strand, fill=strand))
## Scale for 'colour' is already present. Adding another scale for 'colour', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
coveragePlot <- ggplot(as(alignments, "GRanges")) +
stat_coverage(color="gray40", fill="skyblue")
tracks(alignmentPlot, coveragePlot,
xlab="Reads overlapping rs9536314 for NA12893")
And also display the ideogram for the corresponding location on the chromosome:
ideogramPlot <- plotIdeogram(genome="hg19", subchr="chr13")
ideogramPlot + xlim(as(alignments, "GRanges"))
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggbio_1.18.0
## [2] ggplot2_1.0.1
## [3] org.Hs.eg.db_3.2.3
## [4] RSQLite_1.0.0
## [5] DBI_0.3.1
## [6] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [7] BSgenome_1.38.0
## [8] rtracklayer_1.30.0
## [9] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [10] GenomicFeatures_1.22.0
## [11] AnnotationDbi_1.32.0
## [12] GoogleGenomics_1.2.0
## [13] VariantAnnotation_1.16.0
## [14] GenomicAlignments_1.6.0
## [15] Rsamtools_1.22.0
## [16] Biostrings_2.38.0
## [17] XVector_0.10.0
## [18] SummarizedExperiment_1.0.0
## [19] Biobase_2.30.0
## [20] GenomicRanges_1.22.0
## [21] GenomeInfoDb_1.6.0
## [22] IRanges_2.4.0
## [23] S4Vectors_0.8.0
## [24] BiocGenerics_0.16.0
## [25] knitr_1.11
## [26] BiocStyle_1.8.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 biovizBase_1.18.0 lattice_0.20-33
## [4] digest_0.6.8 R6_2.1.1 plyr_1.8.3
## [7] futile.options_1.0.0 acepack_1.3-3.3 evaluate_0.8
## [10] BiocInstaller_1.20.0 httr_1.0.0 zlibbioc_1.16.0
## [13] curl_0.9.3 rpart_4.1-10 rmarkdown_0.8.1
## [16] labeling_0.3 proto_0.3-10 splines_3.2.2
## [19] BiocParallel_1.4.0 stringr_1.0.0 foreign_0.8-66
## [22] RCurl_1.95-4.7 biomaRt_2.26.0 munsell_0.4.2
## [25] htmltools_0.2.6 nnet_7.3-11 gridExtra_2.0.0
## [28] Hmisc_3.17-0 XML_3.98-1.3 reshape_0.8.5
## [31] MASS_7.3-44 bitops_1.0-6 RBGL_1.46.0
## [34] grid_3.2.2 jsonlite_0.9.17 GGally_0.5.0
## [37] gtable_0.1.2 magrittr_1.5 formatR_1.2.1
## [40] scales_0.3.0 graph_1.48.0 stringi_0.5-5
## [43] reshape2_1.4.1 latticeExtra_0.6-26 futile.logger_1.4.1
## [46] Formula_1.2-1 rjson_0.2.15 lambda.r_1.1.7
## [49] RColorBrewer_1.1-2 tools_3.2.2 dichromat_2.0-0
## [52] OrganismDbi_1.12.0 survival_2.38-3 yaml_2.1.13
## [55] colorspace_1.2-6 cluster_2.0.3