install.packages("ProgModule")
library(ProgModule)

1 Introduction

Cancer arises from the dysregulated cell proliferation caused by acquired mutations in key driver genes. With the rapid accumulation of cancer genomics alterations data, the major goal of cancer genome is to distinguish tumorigenesis driver mutations from passenger mutations, which may improve our understanding of the complex processes involved in cancer formation and progression and tail personalize therapies to a tumor’s mutational pattern. Nowadays, there have been numerous algorithms developed to uncover the genomics mutational signatures, but they are generally limited by their high computational complexity, high false-positive rate, and impracticality for clinical application. To elucidate the underlying mechanisms of cancer initiation, we believed that developing algorithms to identify mutation-driven modules that take into account the impact on patient prognosis while balancing mutation coverage and exclusivity may uncover intricate associations between mutations and survival, and will provide us with crucial insights for cancer diagnosis and treatment. This package attempts to develop a novel bioinformatics tool, ProgModule, to identify candidate driver modules for predicting the prognosis of patients by integrating exclusive coverage of mutations with clinical characteristics in cancer. The detailed flowchart of this package is shown as follows:

2 Overview of the package

The ProgModule package is a bioinformatics tool to identify driver modules for predicting the prognosis of cancer patients, which balances the exclusive coverage of mutations and simultaneously considers the mutation combination-mediated mechanism in cancer. And ProgModule functions can be categorized into mainly Analysis and Visualization modules. Each of these functions and a short description is summarized as shown below:
1.Obtain non-silent mutations frequency matrix.
2.Identify cohort-specific local subnetworks.
3.Calculate the prognosis-related mutually exclusive mutation (PRMEM) score of module.
4.Identify the prognosis-related mutually exclusive mutation modules.
5.Visualization results:
5.1 Plot Patients’ Kaplan-Meier Survival Curves based on the mutation status of driver module.
5.2 Plot patient-specific dysfunction pathways and user-interested geneset mutually exclusive and co-occurrence plots.
5.3 Plot patient-specific dysfunction pathways’ waterfall plots.
5.4 Plot genes’ hotspot mutation lollipop plots.

2.1 Obtain non-silent mutations frequency matrix.

We downloaded patients’ mutation data from the TCGA database in Mutation Annotation Format (MAF) format. About the mutation status of a specific gene in a specific sample, we converted MAF format data into a mutation status matrix, in which every row represents the gene and every column represents the sample. In our study, we only extract the non-silent somatic mutations (nonsense mutation, missense mutation, frame-shift indels, splice site, nonstop mutation, translation start site, inframe indels) in protein-coding regions.The function get_mut_status in the ProgModule package can implement the above process. Take simulated data as an example, the command lines are as follows:

MAF files contain many fields ranging from chromosome names to cosmic annotations. However, most of the analysis in our uses the following fields.

  • Mandatory fields: Hugo_Symbol, Variant_Classification, Tumor_Sample_Barcode.
    Complete specification of MAF files can be found on NCI GDC documentation page.
#load the mutation annotation file
maf<-system.file("extdata","maffile.maf",package = "ProgModule")
maf_data<-read.delim(maf)
mutvariant<-maf_data[,c("Hugo_Symbol","Tumor_Sample_Barcode","Variant_Classification")]
#perform the function 'get_mut_status'
mut_status<-get_mut_status(mutvariant=mutvariant,nonsynonymous = TRUE)
#view the first five lines of mut_status matrix
mut_status[1:5,1:5]
#>       TCGA-B0-5117-01A-01D-1421-08 TCGA-B0-5109-01A-02D-1421-08
#> ACTR8                            1                            0
#> PKHD1                            1                            0
#> MUC17                            1                            0
#> SMC3                             1                            0
#> LARP4                            1                            0
#>       TCGA-A3-3367-01A-02D-1421-08 TCGA-B0-5120-01A-01D-1421-08
#> ACTR8                            0                            0
#> PKHD1                            0                            0
#> MUC17                            0                            0
#> SMC3                             0                            0
#> LARP4                            0                            0
#>       TCGA-CZ-5453-01A-01D-1501-10
#> ACTR8                            0
#> PKHD1                            0
#> MUC17                            0
#> SMC3                             0
#> LARP4                            0


2.2 Search cohort-specific local subnetworks.

The breadth-first search algorithm was then used to search cohort-specific local subnetworks from protein-protein interaction(PPI) networks, which starting at each driver gene obtained from NCG database (defined as seed node) and iteratively exploring its neighbor mutation genes until reaching a maximal number of genes (500 in our study), and the maximum size of the local network is determined by users. The function get_local_network in the ProgModule package can implement the above process.

#load mutation matrix and PPI network
data(mut_status,subnet)
# find the local network of each gene
localnetwork<-get_local_network(network=subnet,freq_matrix=mut_status,max.size=500)


2.5 Visualization results.

  1. The function get_mut_survivalresult is used to draw Kaplan-Meier survival curves based on the mutation status of driver module. The command lines are as follows:


#Load the data
data(mut_status,final_candidate_module)
sur<-system.file("extdata","sur.csv",package="ProgModule")
sur<-read.delim(sur,sep=",",header=TRUE,row.names=1)
#Drawing Kaplan-Meier Survival Curves.
get_mut_survivalresult(module=final_candidate_module,freq_matrix=mut_status,sur)


  1. The function get_plotMutInteract is used to draw patient-specific dysfunction pathways and user-interested geneset mutually exclusive and co-occurrence plots. The command lines are as follows:

#Load the data
data(plotMutInteract_moduledata,plotMutInteract_mutdata)
#Drawing an plotMutInteract of genes
get_plotMutInteract(genes=unique(unlist(plotMutInteract_moduledata[1:4])),freq_matrix=plotMutInteract_mutdata)

#Drawing an plotMutInteract of modules
get_plotMutInteract(module=plotMutInteract_moduledata,freq_matrix=plotMutInteract_mutdata,nShiftSymbols=0)


  1. The function get_oncoplots is used to draw a patient-specific dysfunction pathways’ waterfall plots.
#obtain the modules
data(final_candidate_module)
#load the maf data
maffile<-system.file("extdata","maffile.maf",package="ProgModule")
#Drawing an oncoplot
get_oncoplots(maf=maffile,genes=final_candidate_module[[3]])


  1. The function get_lollipopPlot is used to plot genes’ mutation hotspot lollipop plots.
#load the maf data
data(maf_data)
#Drawing an lollipopPlot of TP53
get_lollipopPlot(maf=maf_data,gene="TP53")
#> Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
#> 8 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
#> Using longer transcript NM_000546 for now.