AACR Project GENIE is a publicly accessible international cancer registry of real-world data assembled through data sharing between 19 of the leading cancer centers in the world.
GENIE aggregates, harmonizes, and links clinical-grade, next-generation cancer genomic sequencing data with clinical outcomes obtained during routine medical practice from cancer patients treated at these institutions, although clinical information available is limited (~<10 variables).
The goal of Project GENIE BPC (Biopharma Collaborative) is to augment the existing registry genomic data with enhanced clinical (phenomic) data to support clinical-genomics analyses.
GENIE BPC sponsors six projects that involve augmenting the genomic data in Project GENIE to include PRISSMM phenomic data and cancer-related outcomes. Each project encompasses a single cancer site. While all projects are based on curating data under the PRISSMM phenomic data curation model, the variables available for each project may vary. The six projects are:
GENIE BPC is comprised of several phenomic (clinical) elements that include data from the AACR Project GENIE Tier 1 registry, institutions’ tumor registries, manual curation based on institutions’ electronic health records (PRISSMM phenomic data model), and derived variables based on these data elements. The elements are:
This vignette will describe the structure of each element (dataset).
The Patient Characteristics dataset is structured as one record per patient. This dataset includes variables such as date of death, last known alive date, and number of pathology reports, imaging reports, medical oncologist assessments, along with many other demographic and vital status variables.
The Patient Characteristics dataset can be linked to all other
datasets using cohort
and record_id
.
Two Cancer Diagnosis datasets are provided: the BPC Project Cancer Diagnosis dataset and the Non-BPC Project Cancer Diagnosis dataset.
BPC Project Cancer | Non-BPC Project Cancer |
---|---|
The cancer that met the eligibility criteria for the project and was selected at random for PRISSMM phenomic data curation. | A diagnosis of the same or different cancer type as the project cancer that occurs prior to, simultaneous with, or after the BPC Project cancer. |
Has genomic sequencing associated with the cancer diagnosis. | Does not have associated genomic sequencing |
Recorded in the BPC Project Cancer Diagnosis dataset. | Recorded in the Non-BPC Project Cancer Diagnosis dataset. |
The BPC Project Cancer Diagnosis dataset contains one record per BPC Project cancer diagnosis, per patient.
The BPC Project Cancer Diagnosis dataset can be linked to the following datasets:
Cancer-Directed Regimen dataset using variables
cohort
, record_id
and
ca_seq
Cancer Panel Test dataset using variables cohort
,
record_id
and ca_seq
Patient Characteristics, PRISSMM Pathology, PRISSMM Imaging, and
PRISSMM Medical Oncologist Assessment datasets using cohort
and record_id
The Non-BPC Project Cancer Diagnosis dataset contains one record per non-BPC Project cancer diagnosis, per patient. This dataset includes two types of cancer diagnoses: 1) non-BPC Project invasive cancer and in situ/non-invasive cancer diagnoses, and 2) other tumors.
The Non-BPC Project Cancer Diagnosis dataset can be linked to the following datasets:
Cancer-Directed Regimen Dataset using variables
cohort
, record_id
and
ca_seq
Patient Characteristics, PRISSMM Pathology, PRISSMM Imaging, and
PRISSMM Medical Oncologist Assessment datasets using cohort
and record_id
Cannot be linked to the Cancer Panel Test dataset because non-BPC Project cancer diagnoses were not genomically sequenced
The Cancer-Directed Regimen dataset is structured as one record per regimen-associated cancer diagnosis, per patient.
Cancer-directed regimens were curated for all cancer diagnoses, including both BPC Project and non-BPC Project cancers. A regimen can consist of one drug or up to five drugs given together. Cancer-directed drugs include anti-neoplastic drugs, immunotherapies, targeted therapies, and hormone therapies. A break in treatment of ≥8 weeks was used to indicate the end of a regimen; even if all drugs in the regimen were re-initiated 8+ weeks later, this was considered a new regimen.
If the cancer-directed drug was part of an investigational drug
trial, the drug name(s)
drugs_drug_1
-drugs_drug_5
will be set to
“Investigational Drug” and the end date interval will match the start
date interval. Identification of investigational drugs varies by
institution depending on contractual obligations. For some institutions,
all drugs that are part of an investigational trial are required to be
masked, even if standard of care. For other institutions, only the
investigational drug(s) are required to be masked.
The Cancer-Directed Regimen can be linked to the following datasets:
BPC Project and non-BPC Project Cancer Diagnosis datasets using
the variables cohort
, record_id
and
ca_seq
Cancer Panel Test dataset using the variables
cohort
, record_id
and
ca_seq
Patient Characteristics, PRISSMM Pathology, PRISSMM Imaging, and
PRISSMM Medical Oncologist Assessment datasets using cohort
and record_id
The Cancer-Directed Radiation Therapy dataset is structured as one record per radiation therapy-associated cancer diagnosis, per patient. For example, if radiation therapy is associated with a single cancer diagnosis, there will be one corresponding record in this dataset. If radiation therapy is associated with two cancer diagnoses, then there will be two corresponding records in this dataset: one for the first associated cancer diagnosis and another for the second associated cancer diagnosis. If it is unknown which cancer diagnosis the radiation therapy is associated with, there will still be one record in this dataset.
Radiation therapy was curated beginning within 30 days of the first BPC Project cancer diagnosis. All subsequent radiation therapies were recorded. Additional radiation therapy corresponding to non-BPC Project Cancers may be recorded.
Radiation therapy data is available for the following cohorts: PANC, Prostate, BLADDER.
This dataset can be linked to the following datasets:
BPC Project Cancer Diagnosis, Non-BPC Project Cancer Diagnosis,
Cancer-Directed Regimen, Cancer Panel Test datasets datasets using the
variables cohort
, record_id
and
ca_seq
Patient Characteristics, PRISSMM Pathology, PRISSMM Imaging, and
PRISSMM Medical Oncologist Assessment datasets using cohort
and record_id.
The PRISSMM Pathology dataset is structured as one record per pathology report, per patient.
All pathology reports beginning with the month and year of the first BPC Project cancer diagnosis and all subsequent pathology reports are recorded (including pathology reports corresponding to non-BPC Project cancer and subsequent BPC Project cancer diagnoses).
The PRISSMM Pathology dataset can be linked to the following datasets:
Cancer Panel Test dataset using cohort
and
record_id
, ca_seq
,
path_proc_number
and
path_report_number
Patient Characteristics, BPC Project and Non-BPC Project Cancer
Diagnosis, Cancer-Directed Regimen, PRISSMM Imaging, and PRISSMM Medical
Oncologist Assessment datasets using cohort
and
record_id
The PRISSMM Imaging dataset is structured as one record per imaging report, per patient.
All imaging reports beginning with the month and year of the first BPC Project cancer diagnosis and all subsequent imaging reports are recorded (including imaging reports corresponding to non-BPC Project cancer and subsequent BPC Project cancer diagnoses).
The PRISSMM Imaging dataset can be linked to all datasets using the
variable record_id
.
The PRISSMM Medical Oncologist Assessment dataset is structured as one row per curated medical oncologist assessment, per patient.
Medical oncologist assessments were curated beginning with the month and year of the first BPC Project cancer diagnosis. One medical oncologist assessment per month was curated.
The PRISSMM Medical Oncologist Assessment dataset can be linked to
all datasets using the variable record_id
.
The PRISSMM Tumor Marker dataset is structured as one record per curated tumor marker result, per patient. All serum-based tumor markers that are related to the diagnosis/prognosis of cancer were curated.
Tumor marker data is available for the following cohorts: CRC, BrCa, PANC, Prostate.
Note: variables pertaining to PD-L1, MSI and MMR are recorded in the pathology dataset.
The PRISSMM Tumor Marker dataset can be linked to all datasets using
the variable record_id
.
The Cancer Panel Test dataset is structured as one record per cancer panel test and its associated cancer diagnosis, per patient.
The cancer panel test refers to the multi-gene panels that have been performed through next generation sequencing (NGS) assays. The terms “cancer panel test (CPT)” and “next generation sequencing (NGS)” are used interchangeably.
The Cancer Panel Test dataset can be linked to the following datasets:
BPC Project Cancer Diagnosis dataset using the variables
cohort
, record_id
and
ca_seq
Cancer-Directed Regimen dataset using the variables
cohort
, record_id
and
ca_seq
PRISSMM Pathology dataset using cohort
,
record_id
, ca_seq
,
path_proc_number
and
path_report_number
Note: this dataset cannot be linked to the non-BPC Project Cancer Diagnosis dataset because non-BPC Project cancer diagnoses were not genomically sequenced.
Please note that pulling genomic GENIE data from Synapse using
pull_data_synapse()
and pulling GENIE data from cBioPortal
may result in small differences in the data due to systematic
differences in the processing pipelines employed by Synapse and
cBioPortal. These differences may include:
Data formatting - Some data sets (e.g. CNA files) may appear in wide format in Synapse data versus long format in cBioPortal data, or column attributes and names may appear sightly different (e.g. fusions files).
Default filtering - By default, cBioPortal filters out Silent, Intron, IGR, 3’UTR, 5’UTR, 3’Flank and 5’Flank, except for the promoter mutations of the TERT gene. See cBioPortal documentation for more details. These mutations are retained in Synapse processing pipelines.
Hugo Symbols - Some genes have more than one accepted Hugo Symbol
and may be referred to differently between data sources
(e.g. NSD3
is an alias for WHSC1L1
). Some
tools exist to help you resolve gene aliases across genomic data
sources. See gnomeR::recode_alias()
,
cbioportal::get_alias()
and vignettes from the {gnomeR} and {cbioportalR} for
more information on how to use these functions and work with gene
aliases.