Introduction
From copy-number alteration (CNA), RNA-seq data, and pre-defined biological
pathway network, MPAC computes Inferred Pathway Levels (IPLs) for pathway
entities, clusters patient samples by their enriched GO terms, and identifies
key pathway proteins for each patient cluster.
Installation
From GitHub
Start R and enter:
devtools::install_github('pliu55/MPAC')
From Bioconductor
Start R and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# The following initializes usage of Bioc devel
BiocManager::install(version='devel')
BiocManager::install("MPAC")
For different versions of R, please refer to the appropriate
Bioconductor release.
Required R packages for this vignette
Please use the following code to load required packages for running this
vignette. Many intermediate objects in this vignette are SummarizedExperiment
objects.
require(SummarizedExperiment)
require(MPAC)
Main functions
ppCnInp()
: prepare CNA states from real CNA data
ppRnaInp()
: prepare RNA states from read RNA-seq data
ppRealInp()
: prepare CNA and RNA states from read CNA and RNA-seq data
ppPermInp():
prepare permuted CNA and RNA states
runPrd()
: run PARADIGM to compute IPLs from real CNA and RNA states
runPermPrd()
: run PARADIGM to compute IPLs from permuted CNA and RNA states
colRealIPL()
: collect IPLs from real data
colPermIPL()
: collect IPLs from permuted data
fltByPerm()
: filter IPLs from real data by IPLs from permuted data
subNtw()
: find the largest pathway network subset by filtered IPLs
ovrGMT()
: do gene set over-representation on a sample’s largest pathway
subset
clSamp()
: cluster samples by their gene set over-representation results
conMtf()
: find consensus pathway motifs within a cluster
pltNeiStt()
: plot a heatmap to identify a protein’s pathway determinants
Inferred pathway levels
PARADIGM binary
MPAC uses PARADIGM developed by Vaske et al. to predict
pathway levels. PARADIGM Binary is available to download at Github for
Linux and MacOS.
IPLs from real data
Example below shows input and output files for running PARADIGM on data from
real samples.
PARADIGM will generate multiple output files. All the file names will start with
the sample ID and a suffix indicating their types:
- Inferred pathway levles (IPLs): file with a suffix
ipl.txt
- Output log: file with a suffix
run.out
- Error log: file with a suffix
run.err
- Other auxiliary files, which can be skipped under common usage
# CNA and RNA state from `ppRealInp()`
real_se <- system.file('extdata/TcgaInp/inp_real.rds', package='MPAC') |>
readRDS()
# Pathway file
fpth <- system.file('extdata/Pth/tiny_pth.txt', package='MPAC')
# folder to save all the output files
outdir <- tempdir()
# PARADIGM binary location. Replace the one below with a true location.
paradigm_bin <- '/path/to/PARADIGM'
### code below depends on external PARADIGM binary
runPrd(real_se, fpth, outdir, paradigm_bin, sampleids=c('TCGA-CV-7100'))
#> PARADIGM binary not found:/path/to/PARADIGM
PARADIGM on permuted data
For permuted input, PARADIGM will generate output in the same fashion as on
real data above, except that one permutation corresponds to one output folder
named as p$(index)
, where $index
is the index of that permutation. For
example, three permutations will generate folders p1
, p2
, and p3
.
# a list of list from `ppPermInp()`
permll <- system.file('extdata/TcgaInp/inp_perm.rds', package='MPAC') |>
readRDS()
# Pathway file
fpth <- system.file('extdata/Pth/tiny_pth.txt', package='MPAC')
# folder to save all the output files
outdir <- tempdir()
# PARADIGM binary location. Replace the one below with a true location.
paradigm_bin <- '/path/to/PARADIGM'
# (optional) sample IDs to run PARADIGM on
pat <- 'TCGA-CV-7100'
### code below depends on external PARADIGM binary
runPermPrd(permll, fpth, outdir, paradigm_bin, sampleids=c(pat))
#> PARADIGM binary not found:/path/to/PARADIGM
#> PARADIGM binary not found:/path/to/PARADIGM
#> PARADIGM binary not found:/path/to/PARADIGM
Collecting IPLs
MPAC has PARADIGM run on individual sample in parallel to speed up calculation.
For the convenience of downstream analysis, PARADIGM results will be collected
and put together for all samples.
From PARADIGM on real data
# the folder saving PARADIGM result on real data
# it should be the `outdir` folder from `runPrd()`
indir <- system.file('/extdata/runPrd/', package='MPAC')
# to return a data.table with columns as entities and IPLs for each sample
colRealIPL(indir) |> head()
#> Key: <entity>
#> entity TCGA-CV-7100
#> <char> <num>
#> 1: Activated_JAK1__JAK2__TYK2_(family) 0
#> 2: CD247 0
#> 3: CD28 0
#> 4: CD28/CD86_(complex) 0
#> 5: CD28_B7-2_(complex) 0
#> 6: CD28_bound_to_B7_ligands_(complex) 0
From PARADIGM on permuted data
# the folder saving PARADIGM result on permuted data
# it should be the `outdir` folder from `runPermPrd()`
indir <- system.file('/extdata/runPrd/', package='MPAC')
# number of permutated dataset results to collect
n_perms <- 3
# return a data.table with columns as entities, permutation index, and IPLs for
# each sample
colPermIPL(indir, n_perms) |> head()
#> entity iperm
#> <char> <int>
#> 1: Activated_JAK1__JAK2__TYK2_(family) 1
#> 2: CD247 1
#> 3: CD28 1
#> 4: CD28/CD86_(complex) 1
#> 5: CD28_B7-2_(complex) 1
#> 6: CD28_bound_to_B7_ligands_(complex) 1
#> TCGA-CV-7100
#> <num>
#> 1: 0.00000
#> 2: -0.09873
#> 3: -0.08027
#> 4: -0.01183
#> 5: -0.01175
#> 6: -0.01175
Filtering IPLs
MPAC uses PARADIGM runs on permuted data to generate a background distribution
of IPLs. This distribution is used to filter IPLs from real data to remove
those could be observed by chance.
# collected real IPLs. It is the output from `colRealIPL()`
realdt <- system.file('extdata/fltByPerm/real.rds', package='MPAC') |> readRDS()
# collected permutation IPLs. It is the output from `colPermIPL()`
permdt <- system.file('extdata/fltByPerm/perm.rds', package='MPAC') |> readRDS()
# to return a matrix of filtered IPLs with rows as pathway entities and columns
# as samples. Entities with IPLs observed by chance are set to NA.
fltByPerm(realdt, permdt) |> head()
#> TCGA-BA-5152
#> Activated_JAK1__JAK2__TYK2_(family) NA
#> CD247 0.4511
#> CD28 0.4506
#> CD28/CD86_(complex) 0.6188
#> CD28_B7-2_(complex) 0.6184
#> CD28_bound_to_B7_ligands_(complex) 0.7762
#> TCGA-CV-7100
#> Activated_JAK1__JAK2__TYK2_(family) NA
#> CD247 NA
#> CD28 NA
#> CD28/CD86_(complex) NA
#> CD28_B7-2_(complex) NA
#> CD28_bound_to_B7_ligands_(complex) NA
Find the largest pathway subset
MPAC decomposes the original pathway and identify the largest pathway subset
with all of its entities having filtered IPLs. This sub-pathway allows the user
to focus on the most altered pathway network. Note that, because the set of
entities having filtered IPLs are often different between samples, the
pathway subset will be different between samples as well and represent
sample-specific features.
# a matrix generated by `fltByPerm()`
fltmat <- system.file('extdata/fltByPerm/flt_real.rds', package='MPAC') |>
readRDS()
# a pathway file
fpth <- system.file('extdata/Pth/tiny_pth.txt', package='MPAC')
# a gene set file in MSigDB's GMT format. It should be the same file that will
# be used in the over-representation analysis below.
fgmt <- system.file('extdata/ovrGMT/fake.gmt', package='MPAC')
# to return a list of igraph objects representing the larget sub-pathway for
# each sample
subNtw(fltmat, fpth, fgmt, min_n_gmt_gns=1)
#> $`TCGA-BA-5152`
#> IGRAPH e3b90cd DN-B 11 12 --
#> + attr: name (v/c), type (v/c), ipl (v/n), title
#> | (e/c)
#> + edges from e3b90cd (vertex names):
#> [1] CD28 ->CD28/CD86_(complex)
#> [2] CD28 ->CD28_homodimer_(complex)
#> [3] CD28 ->Nef_CD28_Complex_(complex)
#> [4] CD28 ->Phospho_CD28_homodimer_(complex)
#> [5] CD28_homodimer_(complex)->CD28_B7-2_(complex)
#> + ... omitted several edges
#>
#> $`TCGA-CV-7100`
#> IGRAPH 6439699 DN-B 5 4 --
#> + attr: name (v/c), type (v/c), ipl (v/n), title
#> | (e/c)
#> + edges from 6439699 (vertex names):
#> [1] LCP2->LAT/PLCgamma1/GRB2/SLP76/GADs_(complex)
#> [2] LCP2->NTAL/PLCgamma1/GRB2/SLP76/GADs_(complex)
#> [3] LCP2->p-SLP-76_VAV_(complex)
#> [4] LCP2->phosphorylated_SLP-76_Gads__LAT_(complex)
Gene set over-representation
To understand the biological functions of a sample’s largest sub-pathway, MPAC
performs gene set over-representation analysis on genes with non-zero IPLs in
the sub-pathway.
# a list of igraph objects from `subNtw()`
subntwl <- system.file('extdata/subNtw/subntwl.rds', package='MPAC') |>readRDS()
# a gene set file that has been used in `subNtw()`
fgmt <- system.file('extdata/ovrGMT/fake.gmt', package='MPAC')
# (optional) genes that have CN and RNA data in the input files for PARADIGM
omic_gns <- system.file('extdata/TcgaInp/inp_focal.rds', package='MPAC') |>
readRDS() |> rownames()
# to return a matrix of over-representation adjusted p-values with rows as gene
# set and columns as samples
ovrGMT(subntwl, fgmt, omic_gns)
#> TCGA-BA-5152 TCGA-CV-7100
#> CD_fake 0.6667 NA
#> MI_fake 1.0000 1
Cluster samples by pathway over-representation
With gene set over-representation adjusted p-values, MPAC can cluster samples
as a way to investigate shared features between samples in the same cluster.
Note that, due to randomness introduced in the louvain clustering in igraph R
package version 1.3 (reported in its
Github issue #539), it is
recommended to run clustering multiple times to evaluate its variation. The
clSamp()
function has an argument, n_random_runs
, to specify the number of
random clustering jobs to run.
# a matrix of gene set over-representation adjusted p-values from `ovrGMT()`
ovrmat <- system.file('extdata/clSamp/ovrmat.rds', package='MPAC') |> readRDS()
# to return a data.table of clustering result by 5 random runs:
#
# - each row represents a clustering result
# - the first column, `nreps`, indicates the number of occurrences of a
# clustering result in the 5 random runs
# - the other columns represents each sample's clustering membership
#
clSamp(ovrmat, n_random_runs=5)
#> nreps TCGA-BA-A4IH TCGA-CN-4741 TCGA-CN-5365
#> <int> <int> <int> <int>
#> 1: 5 1 1 1
#> TCGA-CN-5374 TCGA-CN-A499 TCGA-CN-A49C TCGA-CQ-5323
#> <int> <int> <int> <int>
#> 1: 2 1 1 1
#> TCGA-CR-5248 TCGA-CR-6470 TCGA-CR-6480 TCGA-CR-6484
#> <int> <int> <int> <int>
#> 1: 1 2 1 2
#> TCGA-CR-6491 TCGA-CR-7379 TCGA-CV-7102 TCGA-MZ-A6I9
#> <int> <int> <int> <int>
#> 1: 2 1 1 1
#> TCGA-P3-A5QE TCGA-TN-A7HI TCGA-TN-A7HL
#> <int> <int> <int>
#> 1: 1 2 2
Find consensus pathway motifs within a cluster
From the clustering results, MPAC can find consensus pathway motifs from samples
within the same clusters. These motifs will represent cluster-specific pathway
features. They often contain key proteins for further analysis.
# a list of igraph objects from `subNtw()`
subntwl <- system.file('extdata/conMtf/subntwl.rds', package='MPAC') |>readRDS()
# (optional) genes that have CN and RNA data in the input files for PARADIGM
omic_gns <- system.file('extdata/TcgaInp/inp_focal.rds', package='MPAC') |>
readRDS() |> rownames()
# to return a list of igraph objects representing consensus motifs
conMtf(subntwl, omic_gns, min_mtf_n_nodes=50)
#> [[1]]
#> IGRAPH 9975784 DN-- 240 309 --
#> + attr: name (v/c)
#> + edges from 9975784 (vertex names):
#> [1] ARHGEF7 ->CBL_Cool-Pix_(complex)
#> [2] ARHGEF7 ->CDC42/GTP_(complex)
#> [3] B7_family/CD28_(complex)->ITK
#> [4] B7_family/CD28_(complex)->PRF1
#> [5] BTK ->BTK-ITK_(family)
#> [6] BTK ->CD19_Signalosome_(complex)
#> + ... omitted several edges
Diagnostic plot
MPAC implemented a diagnostic function to plot a heatmap of the omic and
pathway states of a protein as well as the pathway states of this protein’s
pathway neighbors. This heatmap facilitates the identification of pathway
determinants of this protein.
# protein of focus
protein <- 'CD86'
# input pathway file
fpth <- system.file('extdata/Pth/tiny_pth.txt', package='MPAC')
# CNA and RNA state matrix from `ppRealInp()`
real_se <- system.file('extdata/pltNeiStt/inp_real.rds', package='MPAC') |>
readRDS()
# filtered IPL matrix from `fltByPerm()`
fltmat <- system.file('extdata/pltNeiStt/fltmat.rds', package='MPAC') |>
readRDS()
# to plot heatmap
pltNeiStt(real_se, fltmat, fpth, protein)
Session Info
Below is the output of sessionInfo()
on the system on which this document
was compiled.
#> R Under development (unstable) (2025-01-20 r87609)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8
#> [2] LC_NUMERIC=C
#> [3] LC_TIME=en_GB
#> [4] LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8
#> [6] LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8
#> [8] LC_NAME=C
#> [9] LC_ADDRESS=C
#> [10] LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8
#> [12] LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils
#> [6] datasets methods base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.37.0
#> [2] Biobase_2.67.0
#> [3] GenomicRanges_1.59.1
#> [4] GenomeInfoDb_1.43.3
#> [5] IRanges_2.41.2
#> [6] S4Vectors_0.45.2
#> [7] BiocGenerics_0.53.5
#> [8] generics_0.1.3
#> [9] MatrixGenerics_1.19.1
#> [10] matrixStats_1.5.0
#> [11] MPAC_1.1.1
#> [12] data.table_1.16.4
#>
#> loaded via a namespace (and not attached):
#> [1] gridExtra_2.3
#> [2] rlang_1.1.5
#> [3] magrittr_2.0.3
#> [4] clue_0.3-66
#> [5] GetoptLong_1.0.5
#> [6] compiler_4.5.0
#> [7] png_0.1-8
#> [8] systemfonts_1.2.1
#> [9] vctrs_0.6.5
#> [10] stringr_1.5.1
#> [11] pkgconfig_2.0.3
#> [12] shape_1.4.6.1
#> [13] crayon_1.5.3
#> [14] fastmap_1.2.0
#> [15] magick_2.8.5
#> [16] XVector_0.47.2
#> [17] scuttle_1.17.0
#> [18] rmarkdown_2.29
#> [19] UCSC.utils_1.3.1
#> [20] xfun_0.50
#> [21] bluster_1.17.0
#> [22] cachem_1.1.0
#> [23] beachmat_2.23.6
#> [24] jsonlite_1.8.9
#> [25] DelayedArray_0.33.4
#> [26] BiocParallel_1.41.0
#> [27] irlba_2.3.5.1
#> [28] parallel_4.5.0
#> [29] cluster_2.1.8
#> [30] R6_2.5.1
#> [31] bslib_0.8.0
#> [32] stringi_1.8.4
#> [33] RColorBrewer_1.1-3
#> [34] limma_3.63.3
#> [35] jquerylib_0.1.4
#> [36] Rcpp_1.0.14
#> [37] bookdown_0.42
#> [38] iterators_1.0.14
#> [39] knitr_1.49
#> [40] Matrix_1.7-2
#> [41] splines_4.5.0
#> [42] igraph_2.1.4
#> [43] tidyselect_1.2.1
#> [44] viridis_0.6.5
#> [45] abind_1.4-8
#> [46] yaml_2.3.10
#> [47] doParallel_1.0.17
#> [48] codetools_0.2-20
#> [49] lattice_0.22-6
#> [50] tibble_3.2.1
#> [51] evaluate_1.0.3
#> [52] survival_3.8-3
#> [53] fitdistrplus_1.2-2
#> [54] circlize_0.4.16
#> [55] pillar_1.10.1
#> [56] foreach_1.5.2
#> [57] ggplot2_3.5.1
#> [58] munsell_0.5.1
#> [59] scales_1.3.0
#> [60] glue_1.8.0
#> [61] metapod_1.15.0
#> [62] tools_4.5.0
#> [63] BiocNeighbors_2.1.2
#> [64] ScaledMatrix_1.15.0
#> [65] fgsea_1.33.2
#> [66] locfit_1.5-9.10
#> [67] scran_1.35.0
#> [68] Cairo_1.6-2
#> [69] fastmatch_1.1-6
#> [70] cowplot_1.1.3
#> [71] grid_4.5.0
#> [72] edgeR_4.5.1
#> [73] colorspace_2.1-1
#> [74] SingleCellExperiment_1.29.1
#> [75] GenomeInfoDbData_1.2.13
#> [76] BiocSingular_1.23.0
#> [77] cli_3.6.3
#> [78] rsvd_1.0.5
#> [79] viridisLite_0.4.2
#> [80] S4Arrays_1.7.1
#> [81] svglite_2.1.3
#> [82] ComplexHeatmap_2.23.0
#> [83] dplyr_1.1.4
#> [84] gtable_0.3.6
#> [85] sass_0.4.9
#> [86] digest_0.6.37
#> [87] SparseArray_1.7.4
#> [88] dqrng_0.4.1
#> [89] rjson_0.2.23
#> [90] htmltools_0.5.8.1
#> [91] lifecycle_1.0.4
#> [92] httr_1.4.7
#> [93] GlobalOptions_0.1.2
#> [94] statmod_1.5.0
#> [95] MASS_7.3-64