The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care.
The RTCGA package offers an easy interface for downloading and integrating variety of the TCGA data using patient barcode key. This allows for easier data acquisition facilitating development of science and improvement of patients’ treatment. Furthermore, the RTCGA package transforms the TCGA data to a tidy form which is convenient to use with R statistical package.
More detailed information about this package can be found here https://github.com/MarcinKosinski/RTCGA.
To get started, install the latest version of RTCGA from Bioconductor:
source("http://bioconductor.org/biocLite.R")
biocLite("RTCGA")
or use for development version:
if (!require(devtools)) {
install.packages("devtools")
library(devtools)
}
biocLite("MarcinKosinski/RTCGA")
Make sure you have rtools installed on your computer, if you are trying devtools on Windows.
Below is an example of how to use RTCGA
package to download ACC
cohort data that contains: clinical
data, mutations
data, and rnaseq v2
data. Furthermore, it is shown how to easily unzip those data and how to read them into tidy format.
We will download data from the one of the newest release date.
library(RTCGA)
releaseDate <- tail( checkTCGA('Dates'), 2 )[1]
# if server doesn't respond, just try
# date <- "2015-06-01"
We will need a folder into which we will download data.
Let us download clinical data. Simply use this command
downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate )
Let us download rnaseq v2 data. Simply use this command
downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate,
dataSet = "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level" )
# one can check all available dataSets' names with
# checkTCGA('DataSets')
Let us download genes’ mutations data. Simply use this command
downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate,
dataSet = "Mutation_Packager_Calls.Level" )
untarFile
and removeTar
parametersBy default untarFile
and removeTar
parameters are set to TRUE
which means that after a desired file is downloaded it is untarred and then the no longer needed *.tar.gz
file is removed. When one used downloadTCGA()
function with those parameters set to FALSE
the that’s the way how those files can be automatically untarred and then removed. ### Untarring data
Let us use the untar()
function to untar all downloaded sets.
list.files( "data/") %>%
file.path( "data", .) %>%
sapply( untar, exdir = "data/" )
tar.gz
filesAfter datasets are untarred, the tar.gz
files ar no longer needed and can be deleted.
list.files( "data/") %>%
file.path( "data", .) %>%
grep( pattern = "tar.gz", x = ., value = TRUE) %>%
sapply( file.remove )
Because the path to rnaseq data has more thatn 256 digits we need to shorten that directory so that R can notice the existance of this file.
list.files( "data/") %>%
file.path( "data", .) %>%
grep("rnaseq", x = ., value = TRUE) %>%
file.rename( to = substr(.,start=1,stop=50))
[1] TRUE
All downloaded clinical datasets for all cohorts are available in RTCGA.clinical
package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. Clinical data format is explained here. Below is just a single code on how to read clinical data for BRCA.
list.files("data/") %>%
grep("Clinical", x = ., value = TRUE) %>%
file.path("data", .) -> folder
folder %>%
list.files() %>%
grep("clin.merged", x = ., value=TRUE) %>%
file.path(folder, .) %>%
readTCGA(path = ., "clinical") -> BRCA.clinical
dim(BRCA.clinical)
[1] 92 1125
All downloaded rnaseq datasets for all cohorts are available in RTCGA.rnaseq
package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. rnaseq data format is explained here. Below is just a single code on how to read rnaseq data for BRCA.
list.files("data/") %>%
grep("rnaseq", x = ., value = TRUE) %>%
file.path("data", .) -> folder
folder %>%
list.files() %>%
grep("illumina", x = ., value=TRUE) %>%
file.path(folder, .) %>%
readTCGA(path = ., "rnaseq") -> BRCA.rnaseq
dim(BRCA.rnaseq)
[1] 79 20532
All downloaded mutations datasets for all cohorts are available in RTCGA.mutations
package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. Mutations data format is explained here. Below is just a single code on how to read mutations data for BRCA.
list.files("data/") %>%
grep("Mutation", x = ., value = TRUE) %>%
file.path("data", .) -> folder
folder %>%
readTCGA(path = ., "mutations") -> BRCA.mutations
dim(BRCA.mutations)
[1] 20166 53
# library(devtools)
# install_github('Rapporter/pander')
if( require(pander) ){
infoTCGA() %>%
pandoc.table()
}
Cohort | BCR | Clinical | CN | LowP | Methylation | mRNA | |
---|---|---|---|---|---|---|---|
ACC-counts | ACC | 92 | 92 | 90 | 0 | 80 | 0 |
BLCA-counts | BLCA | 412 | 412 | 410 | 112 | 412 | 0 |
BRCA-counts | BRCA | 1098 | 1098 | 1089 | 19 | 1097 | 526 |
CESC-counts | CESC | 307 | 307 | 295 | 50 | 307 | 0 |
CHOL-counts | CHOL | 51 | 36 | 36 | 0 | 36 | 0 |
COAD-counts | COAD | 460 | 458 | 451 | 69 | 457 | 153 |
COADREAD-counts | COADREAD | 631 | 629 | 616 | 104 | 622 | 222 |
DLBC-counts | DLBC | 58 | 48 | 48 | 0 | 48 | 0 |
ESCA-counts | ESCA | 185 | 185 | 184 | 51 | 185 | 0 |
FPPP-counts | FPPP | 38 | 38 | 0 | 0 | 0 | 0 |
GBM-counts | GBM | 613 | 595 | 577 | 0 | 420 | 540 |
GBMLGG-counts | GBMLGG | 1129 | 1110 | 1090 | 52 | 936 | 567 |
HNSC-counts | HNSC | 528 | 528 | 522 | 108 | 528 | 0 |
KICH-counts | KICH | 113 | 113 | 66 | 0 | 66 | 0 |
KIPAN-counts | KIPAN | 973 | 941 | 883 | 0 | 892 | 88 |
KIRC-counts | KIRC | 537 | 537 | 528 | 0 | 535 | 72 |
KIRP-counts | KIRP | 323 | 291 | 289 | 0 | 291 | 16 |
LAML-counts | LAML | 200 | 200 | 197 | 0 | 194 | 0 |
LGG-counts | LGG | 516 | 515 | 513 | 52 | 516 | 27 |
LIHC-counts | LIHC | 377 | 377 | 370 | 0 | 377 | 0 |
LUAD-counts | LUAD | 585 | 522 | 516 | 120 | 578 | 32 |
LUSC-counts | LUSC | 504 | 504 | 501 | 0 | 503 | 154 |
MESO-counts | MESO | 87 | 87 | 87 | 0 | 87 | 0 |
OV-counts | OV | 602 | 591 | 586 | 0 | 594 | 574 |
PAAD-counts | PAAD | 185 | 185 | 184 | 0 | 184 | 0 |
PCPG-counts | PCPG | 179 | 179 | 175 | 0 | 179 | 0 |
PRAD-counts | PRAD | 499 | 499 | 492 | 115 | 498 | 0 |
READ-counts | READ | 171 | 171 | 165 | 35 | 165 | 69 |
SARC-counts | SARC | 261 | 261 | 257 | 0 | 261 | 0 |
SKCM-counts | SKCM | 470 | 470 | 469 | 118 | 470 | 0 |
STAD-counts | STAD | 443 | 443 | 442 | 107 | 443 | 0 |
STES-counts | STES | 628 | 628 | 626 | 158 | 628 | 0 |
TGCT-counts | TGCT | 150 | 134 | 150 | 0 | 150 | 0 |
THCA-counts | THCA | 503 | 503 | 499 | 98 | 503 | 0 |
THYM-counts | THYM | 124 | 124 | 123 | 0 | 124 | 0 |
UCEC-counts | UCEC | 560 | 548 | 540 | 106 | 547 | 54 |
UCS-counts | UCS | 57 | 57 | 56 | 0 | 57 | 0 |
UVM-counts | UVM | 80 | 80 | 80 | 51 | 80 | 0 |
mRNASeq | miR | miRSeq | RPPA | MAF | rawMAF | |
---|---|---|---|---|---|---|
ACC-counts | 79 | 0 | 80 | 46 | 90 | 0 |
BLCA-counts | 408 | 0 | 409 | 344 | 130 | 395 |
BRCA-counts | 1093 | 0 | 1078 | 410 | 977 | 0 |
CESC-counts | 304 | 0 | 307 | 173 | 194 | 0 |
CHOL-counts | 36 | 0 | 36 | 30 | 35 | 0 |
COAD-counts | 457 | 0 | 406 | 360 | 154 | 367 |
COADREAD-counts | 623 | 0 | 549 | 491 | 223 | 489 |
DLBC-counts | 33 | 0 | 47 | 33 | 48 | 0 |
ESCA-counts | 184 | 0 | 184 | 126 | 185 | 0 |
FPPP-counts | 0 | 0 | 23 | 0 | 0 | 0 |
GBM-counts | 0 | 565 | 0 | 238 | 290 | 290 |
GBMLGG-counts | 516 | 565 | 512 | 668 | 576 | 803 |
HNSC-counts | 520 | 0 | 523 | 212 | 279 | 510 |
KICH-counts | 66 | 0 | 66 | 63 | 66 | 66 |
KIPAN-counts | 889 | 0 | 873 | 756 | 644 | 799 |
KIRC-counts | 533 | 0 | 516 | 478 | 417 | 451 |
KIRP-counts | 290 | 0 | 291 | 215 | 161 | 282 |
LAML-counts | 179 | 0 | 188 | 0 | 197 | 0 |
LGG-counts | 516 | 0 | 512 | 430 | 286 | 513 |
LIHC-counts | 371 | 0 | 372 | 63 | 198 | 373 |
LUAD-counts | 515 | 0 | 513 | 365 | 230 | 542 |
LUSC-counts | 501 | 0 | 478 | 328 | 178 | 0 |
MESO-counts | 86 | 0 | 87 | 63 | 0 | 0 |
OV-counts | 304 | 570 | 453 | 426 | 316 | 469 |
PAAD-counts | 178 | 0 | 178 | 123 | 146 | 0 |
PCPG-counts | 179 | 0 | 179 | 80 | 179 | 0 |
PRAD-counts | 497 | 0 | 494 | 352 | 332 | 0 |
READ-counts | 166 | 0 | 143 | 131 | 69 | 122 |
SARC-counts | 259 | 0 | 259 | 223 | 252 | 0 |
SKCM-counts | 468 | 0 | 448 | 204 | 343 | 366 |
STAD-counts | 378 | 0 | 436 | 357 | 289 | 0 |
STES-counts | 562 | 0 | 620 | 483 | 474 | 0 |
TGCT-counts | 150 | 0 | 150 | 118 | 149 | 0 |
THCA-counts | 501 | 0 | 502 | 222 | 402 | 0 |
THYM-counts | 120 | 0 | 124 | 90 | 0 | 0 |
UCEC-counts | 545 | 0 | 538 | 440 | 248 | 0 |
UCS-counts | 57 | 0 | 56 | 48 | 57 | 0 |
UVM-counts | 80 | 0 | 80 | 12 | 80 | 0 |
(cohorts <- infoTCGA() %>%
rownames() %>%
sub("-counts", "", x=.))
[1] "ACC" "BLCA" "BRCA" "CESC" "CHOL" "COAD" "COADREAD" "DLBC" "ESCA" "FPPP" "GBM" "GBMLGG" "HNSC"
[14] "KICH" "KIPAN" "KIRC" "KIRP" "LAML" "LGG" "LIHC" "LUAD" "LUSC" "MESO" "OV" "PAAD" "PCPG"
[27] "PRAD" "READ" "SARC" "SKCM" "STAD" "STES" "TGCT" "THCA" "THYM" "UCEC" "UCS" "UVM"
checkTCGA('Dates')
[1] "2011-10-26" "2011-11-15" "2011-11-28" "2011-12-06" "2011-12-30" "2012-01-10" "2012-01-24" "2012-02-17" "2012-03-06" "2012-03-21" "2012-04-12"
[12] "2012-04-25" "2012-05-15" "2012-05-25" "2012-06-06" "2012-06-23" "2012-07-07" "2012-07-25" "2012-08-04" "2012-08-25" "2012-09-13" "2012-10-04"
[23] "2012-10-18" "2012-10-20" "2012-10-24" "2012-11-02" "2012-11-14" "2012-12-06" "2012-12-21" "2013-01-16" "2013-02-03" "2013-02-22" "2013-03-09"
[34] "2013-03-26" "2013-04-06" "2013-04-21" "2013-05-08" "2013-05-23" "2013-06-06" "2013-06-23" "2013-07-15" "2013-08-09" "2013-09-23" "2013-10-10"
[45] "2013-11-14" "2013-12-10" "2014-01-15" "2014-02-15" "2014-03-16" "2014-04-16" "2014-05-18" "2014-06-14" "2014-07-15" "2014-09-02" "2014-10-17"
[56] "2014-12-06" "2015-02-02" "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
checkTCGA('DataSets', 'ACC', releaseDate) %>%
length()
[1] 2