1 Introduction

The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care.

The RTCGA package offers an easy interface for downloading and integrating variety of the TCGA data using patient barcode key. This allows for easier data acquisition facilitating development of science and improvement of patients’ treatment. Furthermore, the RTCGA package transforms the TCGA data to a tidy form which is convenient to use with R statistical package.

2 RTCGA package

More detailed information about this package can be found here https://github.com/MarcinKosinski/RTCGA.

2.1 Installation of the RTCGA package

To get started, install the latest version of RTCGA from Bioconductor:

source("http://bioconductor.org/biocLite.R")
biocLite("RTCGA")

or use for development version:

if (!require(devtools)) {
    install.packages("devtools")
    library(devtools)
}
biocLite("MarcinKosinski/RTCGA")

Make sure you have rtools installed on your computer, if you are trying devtools on Windows.

3 Light data management and manipulations

Below is an example of how to use RTCGA package to download ACC cohort data that contains: clinical data, mutations data, and rnaseq v2 data. Furthermore, it is shown how to easily unzip those data and how to read them into tidy format.

3.1 Adrenal Cortex Cancer (Adrenocortical carcinoma - ACC) data downloading

We will download data from the one of the newest release date.

library(RTCGA)
releaseDate <- tail( checkTCGA('Dates'), 2 )[1]
# if server doesn't respond, just try
# date <- "2015-06-01"

We will need a folder into which we will download data.

3.1.1 Clinical data

Let us download clinical data. Simply use this command

downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate )

3.1.2 Rnaseq v2 data

Let us download rnaseq v2 data. Simply use this command

downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate,
              dataSet = "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level" )
# one can check all available dataSets' names with
# checkTCGA('DataSets')

3.1.3 Mutations data

Let us download genes’ mutations data. Simply use this command

downloadTCGA( cancerTypes = "ACC", destDir = "data", date = releaseDate,
              dataSet = "Mutation_Packager_Calls.Level" )

3.2 `untarFile` and `removeTar` parameters

By default untarFile and removeTar parameters are set to TRUE which means that after a desired file is downloaded it is untarred and then the no longer needed *.tar.gz file is removed. When one used downloadTCGA() function with those parameters set to FALSE the that’s the way how those files can be automatically untarred and then removed. ### Untarring data

Let us use the untar() function to untar all downloaded sets.

list.files( "data/") %>% 
   file.path( "data", .) %>%
   sapply( untar, exdir = "data/" )

3.2.1 Removing no longer needed `tar.gz` files

After datasets are untarred, the tar.gz files ar no longer needed and can be deleted.

list.files( "data/") %>% 
   file.path( "data", .) %>%
   grep( pattern = "tar.gz", x = ., value = TRUE) %>%
   sapply( file.remove )

3.3 Shortening directories of downloaded files

Because the path to rnaseq data has more thatn 256 digits we need to shorten that directory so that R can notice the existance of this file.

list.files( "data/") %>% 
   file.path( "data", .) %>%
   grep("rnaseq", x = ., value = TRUE) %>%    
   file.rename( to = substr(.,start=1,stop=50))

[1] TRUE

4 Reading TCGA data to the tidy format

4.1 Clinical data

All downloaded clinical datasets for all cohorts are available in RTCGA.clinical package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. Clinical data format is explained here. Below is just a single code on how to read clinical data for BRCA.

list.files("data/") %>%
    grep("Clinical", x = ., value = TRUE) %>%
    file.path("data", .)  -> folder

folder %>%
    list.files() %>%
    grep("clin.merged", x = ., value=TRUE) %>%
    file.path(folder, .) %>%
    readTCGA(path = ., "clinical") -> BRCA.clinical

dim(BRCA.clinical)

[1]   92 1125

4.2 Rnaseq v2 data

All downloaded rnaseq datasets for all cohorts are available in RTCGA.rnaseq package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. rnaseq data format is explained here. Below is just a single code on how to read rnaseq data for BRCA.

list.files("data/") %>%
    grep("rnaseq", x = ., value = TRUE) %>%
    file.path("data", .) -> folder

folder %>%
    list.files() %>%
    grep("illumina", x = ., value=TRUE) %>%
    file.path(folder, .) %>%
    readTCGA(path = ., "rnaseq") -> BRCA.rnaseq

dim(BRCA.rnaseq)

[1]    79 20532

4.3 Mutations data

All downloaded mutations datasets for all cohorts are available in RTCGA.mutations package. The process is described here: http://mi2-warsaw.github.io/RTCGA.data/. Mutations data format is explained here. Below is just a single code on how to read mutations data for BRCA.

list.files("data/") %>%
    grep("Mutation", x = ., value = TRUE) %>%
    file.path("data", .) -> folder

folder %>% 
    readTCGA(path = ., "mutations") -> BRCA.mutations

dim(BRCA.mutations)

[1] 20166    53

5 Information about TCGA project datasets

5.1 Codes and counts for each cohort

# library(devtools)
# install_github('Rapporter/pander')
if( require(pander) ){
infoTCGA() %>%
    pandoc.table()
}

Table continues below
	Cohort	BCR	Clinical	CN	LowP	Methylation	mRNA
ACC-counts	ACC	92	92	90	0	80	0
BLCA-counts	BLCA	412	412	410	112	412	0
BRCA-counts	BRCA	1098	1098	1089	19	1097	526
CESC-counts	CESC	307	307	295	50	307	0
CHOL-counts	CHOL	51	36	36	0	36	0
COAD-counts	COAD	460	458	451	69	457	153
COADREAD-counts	COADREAD	631	629	616	104	622	222
DLBC-counts	DLBC	58	48	48	0	48	0
ESCA-counts	ESCA	185	185	184	51	185	0
FPPP-counts	FPPP	38	38	0	0	0	0
GBM-counts	GBM	613	595	577	0	420	540
GBMLGG-counts	GBMLGG	1129	1110	1090	52	936	567
HNSC-counts	HNSC	528	528	522	108	528	0
KICH-counts	KICH	113	113	66	0	66	0
KIPAN-counts	KIPAN	973	941	883	0	892	88
KIRC-counts	KIRC	537	537	528	0	535	72
KIRP-counts	KIRP	323	291	289	0	291	16
LAML-counts	LAML	200	200	197	0	194	0
LGG-counts	LGG	516	515	513	52	516	27
LIHC-counts	LIHC	377	377	370	0	377	0
LUAD-counts	LUAD	585	522	516	120	578	32
LUSC-counts	LUSC	504	504	501	0	503	154
MESO-counts	MESO	87	87	87	0	87	0
OV-counts	OV	602	591	586	0	594	574
PAAD-counts	PAAD	185	185	184	0	184	0
PCPG-counts	PCPG	179	179	175	0	179	0
PRAD-counts	PRAD	499	499	492	115	498	0
READ-counts	READ	171	171	165	35	165	69
SARC-counts	SARC	261	261	257	0	261	0
SKCM-counts	SKCM	470	470	469	118	470	0
STAD-counts	STAD	443	443	442	107	443	0
STES-counts	STES	628	628	626	158	628	0
TGCT-counts	TGCT	150	134	150	0	150	0
THCA-counts	THCA	503	503	499	98	503	0
THYM-counts	THYM	124	124	123	0	124	0
UCEC-counts	UCEC	560	548	540	106	547	54
UCS-counts	UCS	57	57	56	0	57	0
UVM-counts	UVM	80	80	80	51	80	0

	mRNASeq	miR	miRSeq	RPPA	MAF	rawMAF
ACC-counts	79	0	80	46	90	0
BLCA-counts	408	0	409	344	130	395
BRCA-counts	1093	0	1078	410	977	0
CESC-counts	304	0	307	173	194	0
CHOL-counts	36	0	36	30	35	0
COAD-counts	457	0	406	360	154	367
COADREAD-counts	623	0	549	491	223	489
DLBC-counts	33	0	47	33	48	0
ESCA-counts	184	0	184	126	185	0
FPPP-counts	0	0	23	0	0	0
GBM-counts	0	565	0	238	290	290
GBMLGG-counts	516	565	512	668	576	803
HNSC-counts	520	0	523	212	279	510
KICH-counts	66	0	66	63	66	66
KIPAN-counts	889	0	873	756	644	799
KIRC-counts	533	0	516	478	417	451
KIRP-counts	290	0	291	215	161	282
LAML-counts	179	0	188	0	197	0
LGG-counts	516	0	512	430	286	513
LIHC-counts	371	0	372	63	198	373
LUAD-counts	515	0	513	365	230	542
LUSC-counts	501	0	478	328	178	0
MESO-counts	86	0	87	63	0	0
OV-counts	304	570	453	426	316	469
PAAD-counts	178	0	178	123	146	0
PCPG-counts	179	0	179	80	179	0
PRAD-counts	497	0	494	352	332	0
READ-counts	166	0	143	131	69	122
SARC-counts	259	0	259	223	252	0
SKCM-counts	468	0	448	204	343	366
STAD-counts	378	0	436	357	289	0
STES-counts	562	0	620	483	474	0
TGCT-counts	150	0	150	118	149	0
THCA-counts	501	0	502	222	402	0
THYM-counts	120	0	124	90	0	0
UCEC-counts	545	0	538	440	248	0
UCS-counts	57	0	56	48	57	0
UVM-counts	80	0	80	12	80	0

5.2 Available cohorts names

(cohorts <- infoTCGA() %>% 
   rownames() %>% 
   sub("-counts", "", x=.))

 [1] "ACC"      "BLCA"     "BRCA"     "CESC"     "CHOL"     "COAD"     "COADREAD" "DLBC"     "ESCA"     "FPPP"     "GBM"      "GBMLGG"   "HNSC"    
[14] "KICH"     "KIPAN"    "KIRC"     "KIRP"     "LAML"     "LGG"      "LIHC"     "LUAD"     "LUSC"     "MESO"     "OV"       "PAAD"     "PCPG"    
[27] "PRAD"     "READ"     "SARC"     "SKCM"     "STAD"     "STES"     "TGCT"     "THCA"     "THYM"     "UCEC"     "UCS"      "UVM"

5.3 Dates of release

checkTCGA('Dates')

 [1] "2011-10-26" "2011-11-15" "2011-11-28" "2011-12-06" "2011-12-30" "2012-01-10" "2012-01-24" "2012-02-17" "2012-03-06" "2012-03-21" "2012-04-12"
[12] "2012-04-25" "2012-05-15" "2012-05-25" "2012-06-06" "2012-06-23" "2012-07-07" "2012-07-25" "2012-08-04" "2012-08-25" "2012-09-13" "2012-10-04"
[23] "2012-10-18" "2012-10-20" "2012-10-24" "2012-11-02" "2012-11-14" "2012-12-06" "2012-12-21" "2013-01-16" "2013-02-03" "2013-02-22" "2013-03-09"
[34] "2013-03-26" "2013-04-06" "2013-04-21" "2013-05-08" "2013-05-23" "2013-06-06" "2013-06-23" "2013-07-15" "2013-08-09" "2013-09-23" "2013-10-10"
[45] "2013-11-14" "2013-12-10" "2014-01-15" "2014-02-15" "2014-03-16" "2014-04-16" "2014-05-18" "2014-06-14" "2014-07-15" "2014-09-02" "2014-10-17"
[56] "2014-12-06" "2015-02-02" "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"

5.4 Names of avaialable DataSets

checkTCGA('DataSets', 'ACC', releaseDate) %>%
    length()

[1] 2

RTCGA package tutorial

Marcin Kosinski

2016-01-18

1 Introduction

2 RTCGA package

2.1 Installation of the RTCGA package

3 Light data management and manipulations

3.1 Adrenal Cortex Cancer (Adrenocortical carcinoma - ACC) data downloading

3.1.1 Clinical data

3.1.2 Rnaseq v2 data

3.1.3 Mutations data

3.2 `untarFile` and `removeTar` parameters

3.2.1 Removing no longer needed `tar.gz` files

3.3 Shortening directories of downloaded files

4 Reading TCGA data to the tidy format

4.1 Clinical data

4.2 Rnaseq v2 data

4.3 Mutations data

5 Information about TCGA project datasets

5.1 Codes and counts for each cohort

5.2 Available cohorts names

5.3 Dates of release

5.4 Names of avaialable DataSets

RTCGA package tutorial

Marcin Kosinski

2016-01-18

1 Introduction

2 RTCGA package

2.1 Installation of the RTCGA package

3 Light data management and manipulations

3.1 Adrenal Cortex Cancer (Adrenocortical carcinoma - ACC) data downloading

3.1.1 Clinical data

3.1.2 Rnaseq v2 data

3.1.3 Mutations data

3.2 untarFile and removeTar parameters

3.2.1 Removing no longer needed tar.gz files

3.3 Shortening directories of downloaded files

4 Reading TCGA data to the tidy format

4.1 Clinical data

4.2 Rnaseq v2 data

4.3 Mutations data

5 Information about TCGA project datasets

5.1 Codes and counts for each cohort

5.2 Available cohorts names

5.3 Dates of release

5.4 Names of avaialable DataSets

3.2 `untarFile` and `removeTar` parameters

3.2.1 Removing no longer needed `tar.gz` files