3 PCLassoReg

In this section, we will go over the main functions, see the basic operations and have a look at the outputs. Users may have a better idea after this section what functions are available, which one to choose, or at least where to seek help.

First, we load the PCLassoReg package:

library("PCLassoReg")

3.1 PCLasso

The PCLasso model accepts a gene/protein expression matrix, survival data, and protein complexes for training the prognostic model. We load a set of data created beforehand for illustration. Users can either load their own data or use those saved in the workspace.

# load data
data(survivalData)
data(PCGroups)

x <- survivalData$Exp
y <- survivalData$survData

The commands load a list survivalData that contains a gene expression matrix Exp and survival information survData of patients in Exp, and a data frame PCGroups containing the protein complexes downloaded from [CORUM] (https://mips.helmholtz-muenchen.de/corum/).

survData is an n x 2 matrix, with a column “time” of failure/censoring times, and “status” a 0/1 indicator, with 1 meaning the time is a failure time, and zero a censoring time.

head(survivalData$survData)
#>        time status
#> S1 22.92000      1
#> S2 99.12000      0
#> S3 64.90000      0
#> S4 68.88000      1
#> S5 23.40000      1
#> S6 57.13333      0

Use getPCGroups function to get human protein complexes from PCGroups. Note that the parameter Type should be consistent the gene names in Exp.

# get human protein complexes
PC.Human <- getPCGroups(Groups = PCGroups, Organism = "Human",
                        Type = "EntrezID")

In order to train and test the predictive performance of the PCLasso model, we divide the data set into a training set and a test set.

set.seed(20150122)
idx.train <- sample(nrow(x), round(nrow(x)*2/3))
x.train <- x[idx.train,]
y.train <- y[idx.train,]
x.test <- x[-idx.train,]
y.test <- y[-idx.train,]

We usually use cv.PCLasso instead of PCLasso to train the model, because cv.PCLasso helps us choose the best \(\lambda\) through k-fold cross validation.

Train the PCLasso model based on the training set data:

# fit cv.PCLasso model
cv.fit1 <- cv.PCLasso(x = x.train, y = y.train, group = PC.Human, nfolds = 5)

cv.fit1 contains a list object that includes a cv.grpsurv object cv.fit and a list of detected protein complexes complexes.dt. complexes.dt contains the proteins that exist in the expression matrix x.train and are used for model training.

We can visualize the norm of the protein complexes by executing the plot function:

# plot the norm of each group
plot(cv.fit1, norm = TRUE)

Each curve in the figure corresponds to a group (protein complex). It shows the path of the norm of each protein complex and \(L_1\)-norm when \(\lambda\) varies.

Visualize the coefficients:

# plot the individual coefficients
plot(cv.fit1, norm = FALSE)

Each curve in the figure corresponds to a variable (gene/protein). It shows the path of the coefficient of each gene/protein and \(L_1\)-norm when \(\lambda\) varies.

The optimal \(\lambda\) value and a cross validated error plot can be obtained to help evaluate our model.

# plot the cross-validation error (deviance)
plot(cv.fit1, type = "cve")

In this plot, the vertical line shows where the cross-validation error curve hits its minimum. The optimal \(\lambda\) can be obtained:

cv.fit1$cv.fit$lambda.min
#> [1] 0.06767398

We can check the selected protein complexes (risk protein complexes) in our model.

# Selected protein complexes at lambda.min
sel.groups <- predict(object = cv.fit1, type="groups",
                      lambda = cv.fit1$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.groups <- predict(object = cv.fit1, type="groups",
                      lambda = c(0.1, 0.05))

Check the number of risk protein complexes:

# The number of risk protein complexes at lambda.min
sel.ngroups <- predict(object = cv.fit1, type="ngroups",
                       lambda = cv.fit1$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.ngroups <- predict(object = cv.fit1, type="ngroups",
                      lambda = c(0.1, 0.05))

Check the norms of the protein complexes:

# The coefficients of protein complexes at lambda.min
groups.norm <- predict(object = cv.fit1, type="coefficients",
                       lambda = cv.fit1$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
groups.norm <- predict(object = cv.fit1, type="coefficients",
                       lambda = c(0.1, 0.05))

Check the selected covariates (risk individual genes/proteins) in our model:

# Selected genes/proteins at lambda.min
sel.vars <- predict(object = cv.fit1, type="vars",
                    lambda=cv.fit1$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.vars <- predict(object = cv.fit1, type="vars",
                    lambda=c(0.1, 0.05))

Check the number of risk individual genes/proteins:

# The number of risk genes/proteins at lambda.min
sel.nvars <- predict(object = cv.fit1, type="nvars",
                     lambda=cv.fit1$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.vars <- predict(object = cv.fit1, type="nvars",
                    lambda=c(0.1, 0.05))

Due to the overlap of protein complexes, there may be duplicates in the above risk genes/proteins. Use the following command to remove duplication:

# Selected genes/proteins at lambda.min
sel.vars <- predict(object = cv.fit1, type="vars.unique",
                    lambda=cv.fit1$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.vars <- predict(object = cv.fit1, type="vars.unique",
                    lambda=c(0.1, 0.05))
# The number of risk genes/proteins at lambda.min
sel.nvars <- predict(object = cv.fit1, type="nvars.unique",
                     lambda=cv.fit1$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.vars <- predict(object = cv.fit1, type="nvars.unique",
                    lambda=c(0.1, 0.05))

The fitted PCLasso model can by used to predict survival risk of new patients:

# predict risk scores of samples in x.test
s <- predict(object = cv.fit1, x = x.test, type="link",
             lambda=cv.fit1$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
s <- predict(object = cv.fit1, x = x.test, type="link",
             lambda=c(0.1, 0.05))

3.2 PCLasso2

The PCLasso2 model accepts a gene/protein expression matrix, a response vector, and protein complexes for training the classification model. We load a set of data created beforehand for illustration. Users can either load their own data or use those saved in the workspace.

# load data
data(classData)
data(PCGroups)

x <- classData$Exp
y <- classData$Label

The commands load a list classData that contains a protein expression matrix Exp and class labels Label of patients in Exp, and a data frame PCGroups containing the protein complexes downloaded from [CORUM] (https://mips.helmholtz-muenchen.de/corum/).

Use getPCGroups function to get human protein complexes from PCGroups. Note that the parameter Type should be consistent the gene names in Exp.

# get human protein complexes
PC.Human <- getPCGroups(Groups = PCGroups, Organism = "Human",
                        Type = "GeneSymbol")

In order to train and test the predictive performance of the PCLasso2 model, we divide the data set into a training set and a test set.

set.seed(20150122)
idx.train <- sample(nrow(x), round(nrow(x)*2/3))
x.train <- x[idx.train,]
y.train <- y[idx.train]
x.test <- x[-idx.train,]
y.test <- y[-idx.train]

We usually use cv.PCLasso2 instead of PCLasso2 to train the model, because cv.PCLasso2 helps us choose the best \(\lambda\) through k-fold cross validation.

Train the PCLasso2 model based on the training set data:

cv.fit2 <- cv.PCLasso2(x = x.train, y = y.train, group = PC.Human,
                       penalty = "grLasso", family = "binomial", nfolds = 10)

cv.fit2 contains a list object that includes a cv.grpreg object cv.fit and a list of detected protein complexes complexes.dt. complexes.dt contains the proteins that exist in the expression matrix x.train and are used for model training.

We can visualize the norm of the protein complexes by executing the plot function:

# plot the norm of each group
plot(cv.fit2, norm = TRUE)

Each curve in the figure corresponds to a group (protein complex). It shows the path of the norm of each protein complex and \(L_1\)-norm when \(\lambda\) varies.

Visualize the coefficients:

# plot the individual coefficients
plot(cv.fit2, norm = FALSE)

Each curve in the figure corresponds to a variable (gene/protein). It shows the path of the coefficient of each gene/protein and \(L_1\)-norm when \(\lambda\) varies.

The optimal \(\lambda\) value and a cross validated error plot can be obtained to help evaluate our model.

# plot the cross-validation error (deviance)
plot(cv.fit2, type = "cve")

In this plot, the vertical line shows where the cross-validation error curve hits its minimum. The optimal \(\lambda\) can be obtained:

cv.fit2$cv.fit$lambda.min
#> [1] 0.01601148

We can check the selected protein complexes (risk protein complexes) in our model.

# Selected protein complexes at lambda.min
sel.groups <- predict(object = cv.fit2, type="groups",
                      lambda = cv.fit2$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.groups <- predict(object = cv.fit2, type="groups",
                      lambda = c(0.1, 0.05))

Check the number of risk protein complexes:

# The number of risk protein complexes at lambda.min
sel.ngroups <- predict(object = cv.fit2, type="ngroups",
                       lambda = cv.fit2$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.ngroups <- predict(object = cv.fit2, type="ngroups",
                      lambda = c(0.1, 0.05))

Check the norms of the protein complexes:

# The coefficients of protein complexes at lambda.min
groups.norm <- predict(object = cv.fit2, type="coefficients",
                       lambda = cv.fit2$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
groups.norm <- predict(object = cv.fit2, type="coefficients",
                       lambda = c(0.1, 0.05))

Check the selected covariates (risk individual genes/proteins) in our model:

# Selected genes/proteins at lambda.min
sel.vars <- predict(object = cv.fit2, type="vars",
                    lambda=cv.fit2$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.vars <- predict(object = cv.fit2, type="vars",
                    lambda=c(0.1, 0.05))

Check the number of risk individual genes/proteins:

# The number of risk genes/proteins at lambda.min
sel.nvars <- predict(object = cv.fit2, type="nvars",
                     lambda=cv.fit2$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.vars <- predict(object = cv.fit2, type="nvars",
                    lambda=c(0.1, 0.05))

Due to the overlap of protein complexes, there may be duplicates in the above risk genes/proteins. Use the following command to remove duplication:

# Selected genes/proteins at lambda.min
sel.vars <- predict(object = cv.fit2, type="vars.unique",
                    lambda=cv.fit2$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.vars <- predict(object = cv.fit2, type="vars.unique",
                    lambda=c(0.1, 0.05))
# The number of risk genes/proteins at lambda.min
sel.nvars <- predict(object = cv.fit2, type="nvars.unique",
                     lambda=cv.fit2$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
sel.vars <- predict(object = cv.fit2, type="nvars.unique",
                    lambda=c(0.1, 0.05))

The fitted PCLasso2 model can by used to predict the probability that the sample is a tumor sample:

# predict probabilities of samples in x.test
s <- predict(object = cv.fit2, x = x.test, type="response",
             lambda=cv.fit2$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
s <- predict(object = cv.fit2, x = x.test, type="response",
             lambda=c(0.1, 0.05))

Predict the class labels of new samples:

# predict class labels of samples in x.test
s <- predict(object = cv.fit2, x = x.test, type="class",
             lambda=cv.fit2$cv.fit$lambda.min)
# For values of lambda not in the sequence of fitted models, linear
# interpolation is used.
s <- predict(object = cv.fit2, x = x.test, type="class",
             lambda=c(0.1, 0.05))

3.3 Other penalties

In addition to “grLasso”, two other penalty functions “grSCAD” and “grMCP” can be used to train PCLasso and PCLasso2 models. Their penalty for large coefficients is smaller than “grLasso”, so they tend to choose less risk protein complexes. Note that the two penalty functions have a new parameter gamma.

Train the PCLasso model:

# load data
data(survivalData)
data(PCGroups)

x = survivalData$Exp
y = survivalData$survData

PC.Human <- getPCGroups(Groups = PCGroups, Organism = "Human",
                        Type = "EntrezID")

# fit PCSCAD model
fit.PCSCAD <- PCLasso(x, y, group = PC.Human, penalty = "grSCAD", gamma = 6)

# fit PCMCP model
fit.PCMCP <- PCLasso(x, y, group = PC.Human, penalty = "grMCP", gamma = 5)

Train the PCLasso2 model:

# load data
data(classData)
data(PCGroups)

x = classData$Exp
y = classData$Label

PC.Human <- getPCGroups(Groups = PCGroups, Organism = "Human",
                        Type = "GeneSymbol")

# fit PCSCAD model
fit.PCSCAD2 <- PCLasso2(x, y, group = PC.Human, penalty = "grSCAD",
                       family = "binomial", gamma = 10)

# fit PCMCP model
fit.PCMCP2 <- PCLasso2(x, y, group = PC.Human, penalty = "grMCP",
                      family = "binomial", gamma = 9)

Other functions are similar to PCLasso and PCLasso2 models.

An introduction to PCLassoReg

1 Introduction

1.1 PCLasso

1.2 PCLasso2

2 Installation

3 PCLassoReg

3.1 PCLasso

3.2 PCLasso2

3.3 Other penalties

4 Reference