A User Manual for PRANA

Seungjun Ahn

July 22nd, 2024 (Version 1.0.5)

We introduce a Pseudo-value Regression Approach for Network Analysis (PRANA) [1]. To our knowledge, this is the first attempt of utilizing a regression modeling for the differential network (DN) analysis by collective gene expression levels under two experimental conditions (\(\textit{e.g.}\) ‘current’ vs. ‘non-current smokers’ or ‘high-risk’ vs. ‘low-risk’). We start from the mutual information (MI) criteria, followed by pseudo-value calculations, which are then entered into a robust regrssion model.

Requirements

Please install these R packages prior to use PRANA. Note that minet package is available in Bioconductor. Please see below for further instruction. Of important note (as of July 2024), dnapath R package is archived in CRAN as of May 2024, which no longer becomes a requirement for the installation of PRANA. Instead, run_aracne() function is available in our package as of version 1.0.5 to obtain mutual information (MI) estimate via ARACNE for network estimation step.

# library(dplyr) 
# library(parallel) # To use mclapply() when reestimating the association matrix.
# library(robustbase) # To fit a robust regression
# library(minet) # To estimate network via ARACNE

# Please run the following lines to install minet package 
# from Bioconductor in your R console:
#if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
#BiocManager::install("minet")
library(PRANA)

Example: Load the COPDGene study data from PRANA R package:

In the original article [1], a real-data analysis was performed to showcase the utility of PRANA. This contains clinical and expression data for 406 samples and 28 COPD-related genes that were highlighted in a recent genome-wide association study [2]. The full data is available from the Gene Expression Database with accession number GSE158699 [3].

data(combinedCOPDdat_RGO) # A complete data containing expression and clinical data.

Data processing for the example data from COPDGene study:

The main variable of our interest in this analysis is the current smoking status. We obtain the indices of subjects who are ‘current’ vs. ‘non-current smokers.’ These indices are used to dichotomize expression dataset into ‘current (Group B)’ and ‘non-current smokers (Group A).’ This is important as the we estimate the group-specific \(p \times p\) association matrices.

# A gene expression data part of the downloaded data.
rnaseqdat <- combinedCOPDdat_RGO[ , 8:ncol(combinedCOPDdat_RGO)]
rnaseqdat <- as.data.frame(apply(rnaseqdat, 2, as.numeric))

# A clinical data with additional covariates sorted by current smoking groups:
# The first column is ID, so do not include.
phenodat <- combinedCOPDdat_RGO[order(combinedCOPDdat_RGO$currentsmoking), 2:7]

# Indices of non-current smoker (namely Group A)
index_grpA <- which(combinedCOPDdat_RGO$currentsmoking == 0)
# Indices of current smoker (namely Group B)
index_grpB <- which(combinedCOPDdat_RGO$currentsmoking == 1)

Apply the PRANA:

Once the data processing is done, we are off to apply the PRANA. In PRANA function, please make sure you specify the expression and clinical data separately. Additionally, you will need to provide the indices for each group of a binary indicator variable that is deemed as a main predictor variable.

PRANAres <- PRANA(RNASeqdat = rnaseqdat, clindat = phenodat, groupA = index_grpA, groupB = index_grpB)

Some supporting features after the use of PRANA:

In this package, there are some additional R functions that can help you locating the name of genes that are significantly differentially connected (DC) and names and adjusted p-values via the empirical Bayes adjustment [4] for all variables and for a specific variable.

# This is useful when we want to create a table with adjusted p-values only.
adjpval(PRANAres)
#>        currentsmoking packyrs   age gender  race FEV1perc
#> 10370           0.004   1.000 0.996      1 1.000        1
#> 10420           0.000   1.000 0.998      1 1.000        1
#> 1306            0.686   0.996 1.000      1 1.000        1
#> 155185          0.000   1.000 1.000      1 1.000        1
#> 158158          0.846   1.000 1.000      1 1.000        1
#> 1653            0.002   1.000 0.996      1 1.000        1
#> 1762            0.000   1.000 1.000      1 1.000        1
#> 23389           0.020   1.000 0.996      1 1.000        1
#> 253461          0.000   1.000 0.996      1 1.000        1
#> 26112           0.800   0.870 1.000      1 1.000        1
#> 27436           0.000   1.000 1.000      1 1.000        1
#> 3308            0.000   1.000 1.000      1 1.000        1
#> 3696            0.000   1.000 0.996      1 1.000        1
#> 374739          0.000   1.000 0.996      1 0.626        1
#> 3842            0.000   1.000 0.996      1 1.000        1
#> 406             0.000   1.000 0.996      1 1.000        1
#> 56986           0.000   1.000 0.996      1 1.000        1
#> 57188           0.000   1.000 1.000      1 1.000        1
#> 6239            0.462   1.000 1.000      1 1.000        1
#> 7067            0.000   0.998 0.996      1 0.330        1
#> 7871            0.000   1.000 0.996      1 1.000        1
#> 79961           0.004   1.000 1.000      1 1.000        1
#> 79991           0.000   1.000 0.996      1 1.000        1
#> 8224            0.000   1.000 0.996      1 1.000        1
#> 8853            0.000   1.000 0.996      1 1.000        1
#> 8870            0.000   1.000 0.996      1 1.000        1
#> 9258            0.000   1.000 0.996      1 0.994        1
#> 9686            0.846   0.972 0.996      1 1.000        1

Back to the description of the original article by Ahn \(\textit{et al.}\), the main interest is in the current smoking status to declare whether a gene is DC between current and non-current smokers at the user-specified significance level. Thus, we subset the vector of p-values for current smoking status from the adjpvaltab object, aka an object with adjusted p-values and gene names using adjpval function above.

# Create an object to keep the table with adjusted p-values using adjpval() function.
adjptab <- adjpval(PRANAres)

adjpval_specific_var is a handy function to retrieve adjusted p-values with gene names for a single variable of interest.

# NOTE: Please do NOT forget to provide a name of variable with the quotation marks!
adjpval_specific_var(adjptab = adjptab, varname = "currentsmoking")
#>        currentsmoking
#> 10370           0.004
#> 10420           0.000
#> 1306            0.686
#> 155185          0.000
#> 158158          0.846
#> 1653            0.002
#> 1762            0.000
#> 23389           0.020
#> 253461          0.000
#> 26112           0.800
#> 27436           0.000
#> 3308            0.000
#> 3696            0.000
#> 374739          0.000
#> 3842            0.000
#> 406             0.000
#> 56986           0.000
#> 57188           0.000
#> 6239            0.462
#> 7067            0.000
#> 7871            0.000
#> 79961           0.004
#> 79991           0.000
#> 8224            0.000
#> 8853            0.000
#> 8870            0.000
#> 9258            0.000
#> 9686            0.846

If you would like to quickly know which genes are significantly DC for your main binary grouping variable, please use sigDCGtab to proceed. This will return a table with adjusted p-values and gene names.

# NOTE: Please do NOT forget to provide a name of variable with the quotation marks!
sigDCGtab <- sigDCGtab(adjptab = adjptab, groupvar = "currentsmoking", alpha = 0.05)
sigDCGtab
#>        currentsmoking
#> 10370           0.004
#> 10420           0.000
#> 155185          0.000
#> 1653            0.002
#> 1762            0.000
#> 23389           0.020
#> 253461          0.000
#> 27436           0.000
#> 3308            0.000
#> 3696            0.000
#> 374739          0.000
#> 3842            0.000
#> 406             0.000
#> 56986           0.000
#> 57188           0.000
#> 7067            0.000
#> 7871            0.000
#> 79961           0.004
#> 79991           0.000
#> 8224            0.000
#> 8853            0.000
#> 8870            0.000
#> 9258            0.000

Lastly, a function below, called sigDCGnames, will be useful when you just want to return the names of the significantly DC genes from PRANA at the 0.05 significance level.

# NOTE: Please do NOT forget to provide a name of variable with the quotation marks!
sigDCGnames <- sigDCGnames(adjptab = adjptab, groupvar = "currentsmoking", alpha = 0.05)
sigDCGnames
#>  [1] "10370"  "10420"  "155185" "1653"   "1762"   "23389"  "253461" "27436" 
#>  [9] "3308"   "3696"   "374739" "3842"   "406"    "56986"  "57188"  "7067"  
#> [17] "7871"   "79961"  "79991"  "8224"   "8853"   "8870"   "9258"

As an additional step, we can use rename_genes function to rename Entrez gene IDs into gene symbols. However, it is currently not available since dnapath package is currently archived. A user may manually install the archived version of dnapath from the CRAN, and follow the R code below.

#rename_genes(sigDCGnames, to = "symbol", species = "human")

References

[1] Ahn S, Grimes T, Datta S. (2023). A pseudo-value regression approach for differential network analysis of co-expression data. \(\textit{BMC Bioinformatics}\). 24(1), 8.

[2] Sakornsakolpat P, Prokopenko D, Lamontagne M, and et al. (2019). Genetic landscape of chronic obstructive pulmonary disease identifies heterogeneous cell-type and phenotype associations. \(\textit{Nature Genetics}\), 51(3), 494–505.

[3] Wang Z, Masoomi A, Xu Z, and et al. (2021). Improved prediction of smoking status via isoform-aware RNA-seq deep learning models. \(\textit{PLoS Computational Biology}\), 17(10), e1009433.

[4] Datta S, Datta S. (2005). Empirical Bayes screening of many p-values with applications to microarray studies. \(\textit{Bioinformatics}\), 21(9), 1987–1994.