1 Introduction

The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care.

The RTCGA package offers an easy interface for downloading and integrating variety of the TCGA data using patient barcode key. This allows for easier data acquisition facilitating development of science and improvement of patients’ treatment. Furthermore, the RTCGA package transforms the TCGA data to a tidy form which is convenient to use with R statistical package.

2 RTCGA package

More detailed information about this package can be found here https://github.com/MarcinKosinski/RTCGA.

2.1 Installation of the RTCGA package

To get started, install the latest version of RTCGA from Bioconductor:

source("http://bioconductor.org/biocLite.R")
biocLite("RTCGA")

or use for development version:

if (!require(devtools)) {
    install.packages("devtools")
    library(devtools)
}
biocLite("MarcinKosinski/RTCGA")

Make sure you have rtools installed on your computer, if you are trying devtools on Windows.

3 Light data management and manipulations

Below is an example of how to use RTCGA package to download ACC cohort data that contains: clinical data, mutations data, and rnaseq v2 data. Furthermore, it is shown how to easily unzip those data and how to read them into tidy format.

3.1 Adrenal Cortex Cancer (Adrenocortical carcinoma - ACC) data downloading

We will download data from the one of the newest release date.

library(RTCGA)
releaseDate <- tail( checkTCGA('Dates'), 2 )[1]
# if server doesn't respond, just try
# date <- "2015-06-01"

We will need a folder into which we will download data.

3.1.1 Clinical data

Let us download clinical data. Simply use this command

downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate )

3.1.2 Rnaseq v2 data

Let us download rnaseq v2 data. Simply use this command

downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate,
              dataSet = "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level" )
# one can check all available dataSets' names with
# checkTCGA('DataSets')

3.1.3 Mutations data

Let us download genes’ mutations data. Simply use this command

downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate,
              dataSet = "Mutation_Packager_Calls.Level" )

3.2 untarFile and removeTar parameters

By default untarFile and removeTar parameters are set to TRUE which means that after a desired file is downloaded it is untarred and then the no longer needed *.tar.gz file is removed. When one used downloadTCGA() function with those parameters set to FALSE the that’s the way how those files can be automatically untarred and then removed. ### Untarring data

Let us use the untar() function to untar all downloaded sets.

list.files( "data/") %>% 
   file.path( "data", .) %>%
   sapply( untar, exdir = "data/" )

3.2.1 Removing no longer needed tar.gz files

After datasets are untarred, the tar.gz files ar no longer needed and can be deleted.

list.files( "data/") %>% 
   file.path( "data", .) %>%
   grep( pattern = "tar.gz", x = ., value = TRUE) %>%
   sapply( file.remove )

3.3 Shortening directories of downloaded files

Because the path to rnaseq data has more thatn 256 digits we need to shorten that directory so that R can notice the existance of this file.

list.files( "data/") %>% 
   file.path( "data", .) %>%
   grep("rnaseq", x = ., value = TRUE) %>%    
   file.rename( to = substr(.,start=1,stop=50))
[1] TRUE

4 Reading TCGA data to the tidy format

4.1 Clinical data

All downloaded clinical datasets for all cohorts are available in RTCGA.clinical package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. Clinical data format is explained here. Below is just a single code on how to read clinical data for BRCA.

list.files("data/") %>%
    grep("Clinical", x = ., value = TRUE) %>%
    file.path("data", .)  -> folder

folder %>%
    list.files() %>%
    grep("clin.merged", x = ., value=TRUE) %>%
    file.path(folder, .) %>%
    readTCGA(path = ., "clinical") -> BRCA.clinical

dim(BRCA.clinical)
[1]   92 1125

4.2 Rnaseq v2 data

All downloaded rnaseq datasets for all cohorts are available in RTCGA.rnaseq package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. rnaseq data format is explained here. Below is just a single code on how to read rnaseq data for BRCA.

list.files("data/") %>%
    grep("rnaseq", x = ., value = TRUE) %>%
    file.path("data", .) -> folder

folder %>%
    list.files() %>%
    grep("illumina", x = ., value=TRUE) %>%
    file.path(folder, .) %>%
    readTCGA(path = ., "rnaseq") -> BRCA.rnaseq

dim(BRCA.rnaseq)
[1]    79 20532

4.3 Mutations data

All downloaded mutations datasets for all cohorts are available in RTCGA.mutations package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. Mutations data format is explained here. Below is just a single code on how to read mutations data for BRCA.

list.files("data/") %>%
    grep("Mutation", x = ., value = TRUE) %>%
    file.path("data", .) -> folder

folder %>% 
    readTCGA(path = ., "mutations") -> BRCA.mutations
dim(BRCA.mutations)
[1] 20166    53

5 Information about TCGA project datasets

5.1 Codes and counts for each cohort

# library(devtools)
# install_github('Rapporter/pander')
if( require(pander) ){
infoTCGA() %>%
    pandoc.table()
}
Table continues below
  Cohort BCR Clinical CN LowP Methylation mRNA
ACC-counts ACC 92 92 90 0 80 0
BLCA-counts BLCA 412 412 410 112 412 0
BRCA-counts BRCA 1098 1098 1089 19 1097 526
CESC-counts CESC 307 307 295 50 307 0
CHOL-counts CHOL 51 36 36 0 36 0
COAD-counts COAD 460 458 451 69 457 153
COADREAD-counts COADREAD 631 629 616 104 622 222
DLBC-counts DLBC 58 48 48 0 48 0
ESCA-counts ESCA 185 185 184 51 185 0
FPPP-counts FPPP 38 38 0 0 0 0
GBM-counts GBM 613 595 577 0 420 540
GBMLGG-counts GBMLGG 1129 1110 1090 52 936 567
HNSC-counts HNSC 528 528 522 108 528 0
KICH-counts KICH 113 113 66 0 66 0
KIPAN-counts KIPAN 973 941 883 0 892 88
KIRC-counts KIRC 537 537 528 0 535 72
KIRP-counts KIRP 323 291 289 0 291 16
LAML-counts LAML 200 200 197 0 194 0
LGG-counts LGG 516 515 513 52 516 27
LIHC-counts LIHC 377 377 370 0 377 0
LUAD-counts LUAD 585 522 516 120 578 32
LUSC-counts LUSC 504 504 501 0 503 154
MESO-counts MESO 87 87 87 0 87 0
OV-counts OV 602 591 586 0 594 574
PAAD-counts PAAD 185 185 184 0 184 0
PCPG-counts PCPG 179 179 175 0 179 0
PRAD-counts PRAD 499 499 492 115 498 0
READ-counts READ 171 171 165 35 165 69
SARC-counts SARC 261 261 257 0 261 0
SKCM-counts SKCM 470 470 469 118 470 0
STAD-counts STAD 443 443 442 107 443 0
STES-counts STES 628 628 626 158 628 0
TGCT-counts TGCT 150 134 150 0 150 0
THCA-counts THCA 503 503 499 98 503 0
THYM-counts THYM 124 124 123 0 124 0
UCEC-counts UCEC 560 548 540 106 547 54
UCS-counts UCS 57 57 56 0 57 0
UVM-counts UVM 80 80 80 51 80 0
  mRNASeq miR miRSeq RPPA MAF rawMAF
ACC-counts 79 0 80 46 90 0
BLCA-counts 408 0 409 344 130 395
BRCA-counts 1093 0 1078 410 977 0
CESC-counts 304 0 307 173 194 0
CHOL-counts 36 0 36 30 35 0
COAD-counts 457 0 406 360 154 367
COADREAD-counts 623 0 549 491 223 489
DLBC-counts 33 0 47 33 48 0
ESCA-counts 184 0 184 126 185 0
FPPP-counts 0 0 23 0 0 0
GBM-counts 0 565 0 238 290 290
GBMLGG-counts 516 565 512 668 576 803
HNSC-counts 520 0 523 212 279 510
KICH-counts 66 0 66 63 66 66
KIPAN-counts 889 0 873 756 644 799
KIRC-counts 533 0 516 478 417 451
KIRP-counts 290 0 291 215 161 282
LAML-counts 179 0 188 0 197 0
LGG-counts 516 0 512 430 286 513
LIHC-counts 371 0 372 63 198 373
LUAD-counts 515 0 513 365 230 542
LUSC-counts 501 0 478 328 178 0
MESO-counts 86 0 87 63 0 0
OV-counts 304 570 453 426 316 469
PAAD-counts 178 0 178 123 146 0
PCPG-counts 179 0 179 80 179 0
PRAD-counts 497 0 494 352 332 0
READ-counts 166 0 143 131 69 122
SARC-counts 259 0 259 223 252 0
SKCM-counts 468 0 448 204 343 366
STAD-counts 378 0 436 357 289 0
STES-counts 562 0 620 483 474 0
TGCT-counts 150 0 150 118 149 0
THCA-counts 501 0 502 222 402 0
THYM-counts 120 0 124 90 0 0
UCEC-counts 545 0 538 440 248 0
UCS-counts 57 0 56 48 57 0
UVM-counts 80 0 80 12 80 0

5.2 Available cohorts names

(cohorts <- infoTCGA() %>% 
   rownames() %>% 
   sub("-counts", "", x=.))
 [1] "ACC"      "BLCA"     "BRCA"     "CESC"     "CHOL"     "COAD"     "COADREAD" "DLBC"     "ESCA"     "FPPP"     "GBM"      "GBMLGG"   "HNSC"    
[14] "KICH"     "KIPAN"    "KIRC"     "KIRP"     "LAML"     "LGG"      "LIHC"     "LUAD"     "LUSC"     "MESO"     "OV"       "PAAD"     "PCPG"    
[27] "PRAD"     "READ"     "SARC"     "SKCM"     "STAD"     "STES"     "TGCT"     "THCA"     "THYM"     "UCEC"     "UCS"      "UVM"     

5.3 Dates of release

checkTCGA('Dates')
 [1] "2011-10-26" "2011-11-15" "2011-11-28" "2011-12-06" "2011-12-30" "2012-01-10" "2012-01-24" "2012-02-17" "2012-03-06" "2012-03-21" "2012-04-12"
[12] "2012-04-25" "2012-05-15" "2012-05-25" "2012-06-06" "2012-06-23" "2012-07-07" "2012-07-25" "2012-08-04" "2012-08-25" "2012-09-13" "2012-10-04"
[23] "2012-10-18" "2012-10-20" "2012-10-24" "2012-11-02" "2012-11-14" "2012-12-06" "2012-12-21" "2013-01-16" "2013-02-03" "2013-02-22" "2013-03-09"
[34] "2013-03-26" "2013-04-06" "2013-04-21" "2013-05-08" "2013-05-23" "2013-06-06" "2013-06-23" "2013-07-15" "2013-08-09" "2013-09-23" "2013-10-10"
[45] "2013-11-14" "2013-12-10" "2014-01-15" "2014-02-15" "2014-03-16" "2014-04-16" "2014-05-18" "2014-06-14" "2014-07-15" "2014-09-02" "2014-10-17"
[56] "2014-12-06" "2015-02-02" "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"

5.4 Names of avaialable DataSets

checkTCGA('DataSets', 'ACC', releaseDate) %>%
    length()
[1] 2