KBoost

Luis F. Iglesias-Martinez, Barbara De Kegel and Walter Kolch

Introduction

KBoost is a gene regulatory network inference algorithm. It builds several models unsing kernel principal components regression and boosting and from them estimates the probability that each transcription factor regulates a gene. KBoost has one main function: kboost and a case specific function: kboost_human_symbol.

Quickstart

The function kboost infers gene regulatory networks from gene expression data. The gene expression data (D4_multi_1 in the examples below) needs to be a numerical matrix where the columns are the genes and the rows are the observations, patients or experiments.

Without Prior knowledge:

library(KBoost)

data(D4_multi_1)

grn = kboost(D4_multi_1)

grn$GRN[91:93,2:5]
##              [,1]         [,2]         [,3]         [,4]
## [1,] 0.4448443050 0.0020788212 0.0051457926 0.0025597813
## [2,] 0.0005754407 0.0002955955 0.0004028377 0.0004049318
## [3,] 0.0046887521 0.0027442450 0.0024883828 0.0034858423

With Prior knowledge:

KBoost has a Bayesian formulation which allows the user to include prior knowledge. In biology this is commonly the case, particularly with well studied organisms. If the user knows of transcription factors that are well known to regulate certain genes, they can include this in a matrix prior_weights of size GxK, where G is the number of genes and K the number of transcription factors.

Take for example the well-known TP53-MDM2 interaction. The user would need fist to build the matrix prior_weights and in the column that corresponds to TP53 and the row that corresponds to MDM2 type a number that represents the prior probability of this interaction. We recommend using values higher that 0.5 but lower than 1 to avod nummerical errors in these cases. On the other hand, for interactions where no prior knowledge is available, the user can simply set 0.5.

library(KBoost)

data(D4_multi_1)

# Matrix of size 100x100 with all values set to 0.5
prior_weights = matrix(0.5,100,100)

# For this example assume we know from previous experiments that TF2 regulates the gene in row 91
prior_weights[91,2] = 0.8

grn = kboost(X=D4_multi_1, prior_weights=prior_weights)

# Note that the first entry now has a slightly higher probability than in the 
# previous example, as a result of adding the prior
grn$GRN[91:93,2:5]
##              [,1]         [,2]         [,3]         [,4]
## [1,] 0.4466380495 0.0020786567 0.0051458405 0.0025593613
## [2,] 0.0005763744 0.0002955955 0.0004028377 0.0004049318
## [3,] 0.0046963602 0.0027442450 0.0024883828 0.0034858423

With gene symbols

library(KBoost)

# A random 10x5 numerical matrix
X = rnorm(50,0,1)
X = matrix(X,10,5)

# Gene names corresponding to the columns of X
gen_names = c("TP53","YY1","CTCF","MDM2","ESR1")

grn = KBoost_human_symbol(X,gen_names,pos_weight=0.6, neg_weight=0.4)

# TFs are taken from Lambert et al., 4 columns in the output network indicates 4 of the genes are TFs.
grn$GRN
##           TP53       YY1      CTCF      ESR1
## TP53 0.0000000 0.9375854 0.1862007 0.8673539
## YY1  0.9746300 0.0000000 0.4890362 0.2134963
## CTCF 0.2034594 0.9551302 0.0000000 0.6673853
## MDM2 0.2616139 0.2039738 0.5105623 0.1995845
## ESR1 1.0000000 0.2311417 0.5218359 0.0000000
# Look at the prior weights based on the Gerstein network.
# Output indicates the YY1-TP53 edge is present in the Gerstein network.
grn$prior_weights
##      TP53 YY1 CTCF ESR1
## TP53  0.4 0.6  0.4  0.4
## YY1   0.4 0.4  0.4  0.4
## CTCF  0.4 0.4  0.4  0.4
## MDM2  0.4 0.4  0.4  0.4
## ESR1  0.4 0.4  0.4  0.4

Main Functions

KBoost(X, TFs, prior_weights, g, v, ite)

Function to infer gene regulatory network from gene expression data.

Input:

Output:
List with the following fields:

KBoost_human_symbol(X, gen_names, g, v, ite, pos_weight, neg_weight)

Function to infer gene regulatory network from human cell lines or patient samples. This function automatically builds a prior from Gerstein et al. (2012) and uses the list of TFs from Lambert et al. (2018). The gene expression data needs to be a numerical matrix.

Input:

Output:
List with the following fields:

AUPR_AUROC_matrix(Net, G_mat, auto_remove, TFs, upper_limit)

Function to calculate the AUROC and AUPR of a known network.

Input:

Output:
List with the following fields:

d4_mfac(v, g, ite)

Function to produce the KBoost AUPR and AUROC results on the DREAM4 Multifactorial Challenge.

Input:

Output:

get_prior_Gerstein(gen_names, TFs, pos_weight, neg_weight)

Function to build a prior from a previously built Network on ChIP-Seq from Gerstein et al. (2012).

Input:

Output:

grid_search_kboost(dataset, vs, gs, ite)

Function to perform a grid search and find the best hyperparameters.

Input:

Output:
List with the following fields:

irma_check(g, v, ite)

Function to produce the AUPR and AUROC Results on the DREAM4 Multifactorial Challenge.

Input:

Output:

net_dist_bin(GRN,TFs,thr)

Function to calculate the shortest distance between nodes.

Input:

Output:

Example:

library(KBoost)
data(D4_multi_1)
Net = kboost(D4_multi_1)
dist = net_dist_bin(Net$GRN,Net$TFs,0.1)

net_summary_bin(GRN,TFs,thr,a,b)

Function to summarize the GRN filtered with a threshold.

Input:

Output: List with the following fields:

Example:

library(KBoost)
data(D4_multi_1)
Net = kboost(D4_multi_1)
Net_Summary = net_summary_bin(Net$GRN)

net_refine(Net)

Function to do a heuristic post-processing suggested by Slawek and Arodz that improves accuracy. Each column is multiplied by its variance.

Input:

Output:

write_GRN_D4(GRN,TFs, filename)

Function to write output in DREAM4 Challenge Format.

Input:

Datasets

DREAM 4 Multifactorial Perturbation Challenge Datasets

D4_multi_1, D4_multi_2, D4_multi_3, D4_multi_4 and D4_multi_5

The gene expression datasets from the DREAM4 multifactorial perturbation challenge. https://www.synapse.org/#!Synapse:syn3049712/wiki/74628 Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed

To use:

data(D4_multi_1)

G_D4_multi_1, G_D4_multi_2, G_D4_multi_3, G_D4_multi_4 and G_D4_multi_5

The gold standard networks from gene expression datasets from the DREAM4 multifactorial perturbation challenge. https://www.synapse.org/#!Synapse:syn3049712/wiki/74628 Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010. Pubmed

To use:

data(G_D4_multi_1)

Gerstein_Prior_ENET_2

Gene Regulatory Network from the ChIPSeq Dataset Encode in Human Cell-Lines. A matrix with two columns The fist column is a transcription factor and the second is a gene.

Gerstein, M.B., et al. Architecture of the human regulatory network derived from ENCODE data. Nature 2012;489(7414):91-100.

To use:

data(Gerstein_Prior_ENET_2)

Human_TFs

Set of Genes that are Transcription Factors in Symbol nomenclature.

Lambert, S.A., et al. The Human Transcription Factors. Cell 2018;172(4):650-665.

To use:

data(Human_TFs)