This vignette gives the instructions on how to fit the Bernstein polynomial (BP) based survival regression models using the unprecedent routines implemented in the spsurv
package. In addition to this vignette, you can have access to a full description of the methodology (MSc dissertation) available at arXiv, check it out https://arxiv.org/abs/2003.10548. The spsurv::spbp
function is the main routine of this package, as it allows to fit all BP related survival regression approaches presented in the dissertation. The acronym spbp
refers to semi-parametric Bernstein polynomial based regression. The formula
argument in spsurv::spbp
makes use of the same structure available at the survival
package in order to provide a familiar environment to the public. The development version can be found on github.com/rvpanaro/spsurv, it can be installed using:
library("devtools")
devtools::install_github("rvpanaro/spsurv")
The spsurv
package imports specific routines that provide the necessary support for internal calculations. During the installation, other dependencies are required, such as survival
, loo
, coda
, rstan
and MASS
packages.
The target data set is passed to the spsurv::spbp
function through a mandatory data.frame
object class. Also, it is possible to switch between bayesian
and mle
approaches through the approach
argument and between the ph
, po
and aft
frameworks using the model
argument. Naturally, prior choices are ignored if the approach
argument is set to mle
referring to maximum likelihood (ML) estimation; a warning is displayed in this case. In addition, consider extra arguments that may be passed directly to Stan
software to apply rstan::optimizing
(if ML method), or rstan::sampling
if Markov chain Monte Carlo (MCMC) method, function control options. Both methods were applied under the rstan
interface, see https://mc-stan.org/users/interfaces/rstan.
As mentioned in Chapter 4, the polynomial degree (highest basis order) can be chosen arbitrarily. In particular, the polynomial degree must be greater than zero, the default value of the polynomial degree \(m = \sqrt{n}\) is rounded to the greatest integer. Note that, the domain restriction for the BP, referred to as \(\tau\) here, is defined internally, see the discussion in Chapter 5. The reason for not allowing a user-defined \(\tau\) is to avoid an improper definition that would cause computing problems.
Considering the variety of settings that Stan
can provide and the modeling options above, we believe that the package is flexible regarding user-defined applications of the BP based models. Beyond that, a class, namely spbp
was created to extend some S3 methods that meet the R
community need for printing, summarizing, and plotting functions. Accordingly, we had developed S3 methods extensions to accomplish these tasks such as the spsurv::print.spbp
, spsurv::summary.spbp
, spsurv::model.matrix.spbp
and other summary printing extensions such as print.summary.bpph.bayes
.
Further, there are some coding instructions on how to fit the semi-parametric models: Bernstein based proportional hazards (BPPH), Bernstein based proportional odds (BPPO), and Bernstein based accelerated failure time (BPAFT) models for right-censored data, under the Bayesian or frequentist approach. In the Bayesian perspective, Normal prior distributions are attributed to the regression coefficients while Log-Normal, Gamma, or Inverse Gamma can be attributed to the BP parameters. The default arguments for the spbp
functions were set as follows:
spbp.default <-
spbp(formula, degree, data,
approach = c("mle", "bayes"),
model = c("ph", "po", "aft"),
priors = list(beta = c("normal(0,4)"),
gamma = "lognormal(0,10)"),
scale = TRUE,
...)
Consider formula
an object of class formula, with the Surv
object (survival
package) for right-censored time-to-event data on the left side of \(\sim\) and the explanatory terms on the right; degree
for the integer value of the BP degree, non-integer values are rounded to the greatest valid degree; data
for a mandatory data.frame
object with variables named in the formula; approach
for either Bayesian or ML estimation methods, default is bayes
; model
proportional hazards (PH), proportional odds (PO) or accelerated failure time (AFT) for the modeling classes discussed in Chapter 2, default is ph
; priors
list of prior settings, which is ignored when mle
, and scale
for a logical value that indicates whether to apply the standardization discussed in Chapter 5. Recall that extra arguments can be passed to rstan::sampling
(e.g. iter
, chains
, init
), more details in https://mc-stan.org/users/documentation/.
Most statistical packages about survival regression returns an ANOVA table. In this sense, the object of class spbp
follows the design provided in the survival
package. The output corresponding to the ANOVA table can be obtained with:
library("KMsurv")
data("larynx")
library(spsurv)
fit <- spsurv::spbp(Surv(time, delta)~ age + factor(stage),
approach = "mle", data = larynx)
summary(fit)
## Bernstein Polynomial based Proportional Hazards model
## Call:
## spbp.default(formula = Surv(time, delta) ~ age + factor(stage),
## data = larynx, approach = "mle", model = "ph")
##
## n= 90, number of events= 50
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.0193 1.0195 0.0144 1.34 0.180
## factor(stage)2 0.1720 1.1876 0.4626 0.37 0.710
## factor(stage)3 0.6585 1.9318 0.3556 1.85 0.064 .
## factor(stage)4 1.7991 6.0442 0.4288 4.20 2.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Likelihood ratio test= 19.6 on 4 df, p=6e-04
## Wald test = 22.6 on 4 df, p=2e-04
One can reproduce this example by copying and pasting the indicated code in the R
console. The output is as follows:
library("KMsurv")
data("larynx")
library(spsurv)
fit <- spsurv::spbp(Surv(time, delta)~age + factor(stage),
approach = "mle", data = larynx)
Consider that coef
refers to the ML point estimates; exp(coef)
is the point estimate for the hazard ratio; se(coef)
represents the standard errors; z
is the test statistic for the Wald test and p
is the p-value of the Wald test. The estimated BP parameters, the value of the evaluated log-likelihood of the null (reference) model and the stan
object can be obtained having access to the spbp
class object elements. Moreover, apart from the fit
object, it is also possible to obtain the matrix of covariates, the covariance matrix and the likelihood value. This can be done using the following code:
fit$coefficients
## age factor(stage)2 factor(stage)3 factor(stage)4 gamma1
## 1.926946e-02 1.719545e-01 6.584531e-01 1.799095e+00 1.687151e-02
## gamma2 gamma3 gamma4 gamma5 gamma6
## 4.164110e-02 8.170170e-50 1.401060e-02 4.677284e-02 9.401934e-34
## gamma7 gamma8 gamma9 gamma10
## 1.331014e-01 1.189283e-62 4.517613e-112 3.128413e-131
head(model.matrix(fit))
## age factor(stage)2 factor(stage)3 factor(stage)4
## 1 77 0 0 0
## 2 53 0 0 0
## 3 45 0 0 0
## 4 57 0 0 0
## 5 58 0 0 0
## 6 51 0 0 0
diag(fit$var)
## age factor(stage)2 factor(stage)3 factor(stage)4
## 0.0002064355 0.2139848904 0.1264739037 0.1838831112
fit$loglik
## [1] -149.8360 -140.0512
From the Bayesian point of view, the spbp
class contains posterior summary statistics such as the mode, median, mean and standard deviation, along with 95% HPD interval based on the posterior density. Note that the arguments passed after data
are considered Stan
specific control parameters. For instance, the argument chain
allows to choose the number of chains in the MCMC . Other settings such as iter
and warmup
are also flexible and might be set at convenience. The user can simply type in the R
console the code
??rstan::sampling
" for help. The following R
console outcome refers to the Bayesian estimation for the larynx
data set:
fit <- spsurv::spbp(Surv(time, delta)~age + factor(stage),
approach = "bayes", data = larynx,
iter = 2000, chains = 1, warmup = 1000, cores = 1)
##
## SAMPLING FOR MODEL 'spbp' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 3.5e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.35 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 2.45794 seconds (Warm-up)
## Chain 1: 2.42382 seconds (Sampling)
## Chain 1: 4.88176 seconds (Total)
## Chain 1:
## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and
## MCSE estimates will be over-optimistic.
## Warning: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.
## Warning:
## 3 (3.3%) p_waic estimates greater than 0.4. We recommend trying loo instead.
As with the ML estimation, the summary method is extended to the spsurv::spbp
class when applying to a Bayesian fit. Along with the regression estimates, this output also contains descriptive statistics for the posterior hazard ratio denoted by _exp
(in the console output) and the diagnosis statistics from the loo
package. The effective sample size n_eff
gives an estimate of the independent draws from the posterior distribution, and Rhat
referred to as the potential scale reduction statistic, is one of the useful ways to monitor whether a chain has converged to the equilibrium distribution. This statistic measures the ratio between the average variation of the samples within each chain and the variation of the combined samples in the chains; if the chains have not converged to a common distribution, this statistic will be greater than one. It is worth noting that all, the information provided by the Stan
output, including warnings, is passed to the final user. One can have access to the stanfit
object with the fit$stanfit
command. In particular, one can have access to built-in plot functions and even to a shiny
app (details in https://shiny.rstudio.com/ developed by Stan
developer’s team. The summary outcome is as follows:
summary(fit)
## Bayesian Bernstein Polynomial based Proportional Hazards model
##
## Call:
## spbp.default(formula = Surv(time, delta) ~ age + factor(stage),
## data = larynx, approach = "bayes", cores = 1, iter = 2000,
## chains = 1, warmup = 1000, model = "ph")
##
## n= 0, number of events= 0
##
## mode mean se_mean sd 50% n_eff Rhat lowerHPD
## age 0.0213 0.0193 0.000491 0.0147 0.0196 900 0.999 -0.00944
## factor(stage)2 0.1837 0.1217 0.017673 0.4480 0.1400 643 1.000 -0.80166
## factor(stage)3 0.5358 0.6772 0.012475 0.3366 0.6602 728 0.999 0.04247
## factor(stage)4 1.6716 1.7815 0.013653 0.4082 1.7603 894 0.999 0.93321
## upperHPD
## age 0.0472
## factor(stage)2 0.9489
## factor(stage)3 1.3174
## factor(stage)4 2.5377
## ---
## mean_exp median_exp sd_exp lowerHPD_exp upperHPD_exp
## age 1.02 1.02 0.015 0.991 1.05
## factor(stage)2 1.25 1.15 0.594 0.344 2.31
## factor(stage)3 2.08 1.94 0.721 0.951 3.53
## factor(stage)4 6.46 5.81 2.775 2.114 11.76
## ---
##
## Computed from 1000 by 90 log-likelihood matrix
##
## Estimate SE
## elpd_waic -149.6 9.6
## p_waic 9.3 1.0
## waic 299.2 19.1
##
## 3 (3.3%) p_waic estimates greater than 0.4. We recommend trying loo instead.
##
## Computed from 1000 by 90 log-likelihood matrix
##
## Estimate SE
## elpd_loo -149.7 9.6
## p_loo 9.4 1.0
## looic 299.4 19.1
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 88 97.8% 534
## (0.5, 0.7] (ok) 2 2.2% 306
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
The next code chunk shows the code for trace and density plotting and to give access to the shiny app from the shinystan package shinystan
. Figures illustrate the trace plot and the density plot of the BPPH for the larynx
data set. The graphs show unimodal posterior densities and well behaved chains with good mixing, this is a good behavior indication.
rstan::traceplot(fit$stanfit, pars = c("beta", "gamma"))
rstan::stan_dens(fit$stanfit, pars = c("beta", "gamma"))
shinystan::launch_shinystan(fit$stanfit)
Not least, a S3 method had to be created rather than extended. The survivor
method was created to accomplish the calculation of the survival function evaluated in each time point. The goal is similar to the survival::survfit
S3 method, that could be extended instead. The difference is that spbp
classes allows both Bayesian and frequentist approaches. The following code was used to generate Figure:
## CoxPH model
fitcoxph <- survival::coxph(Surv(time , delta)~age + factor(stage),
data = larynx)
## Determine the groups of patients
newdata <- data.frame(age =c(77,77,77,77), stage = c(1,2,3,4))
## survfit Breslow estimator
breslowsurv <- survival::survfit(fitcoxph, newdata = newdata)
## spbp point-wise estimate
spbpsurv <- spsurv::survivor(fit, newdata = newdata)
plot(breslowsurv, bty = "n", lwd = 3, main = "77 years old patient survival per Stage")
points(spbpsurv$time, spbpsurv$survival1, col = 1, pch = 23)
points(spbpsurv$time, spbpsurv$survival2, col = 2, pch = 23)
points(spbpsurv$time, spbpsurv$survival3, col = 3, pch = 23)
points(spbpsurv$time, spbpsurv$survival4, col = 4, pch = 23)
legend("topright", c("Stage I", "Stage II", "Stage III", "Stage IV"), pch = 23, bty = "n", col = 1:4)
The content of this vignette introduces the spsurv
package. Here, the analysis was dedicated to illustrating, in practice, the commands implemented in the proposed package spsurv. We still have work to do to improve and update this tool, however, the present version is ready for the main statistical study in the field of survival analysis. The routines presented in this dissertation are unprecedented. Many efforts with regard to the instruction the routines documentation were carried out concurrently with the spsurv package implementation. This document is part of the content submitted to CRAN. The package is also in public use and is available at the github development platform, the link is: https://github.com/rvpanaro/spsurv.