The QHScrnomo
package provides functions for fitting,
validating, and generating predictions from competing risks regression
models, with an emphasis on nomograms. Nomograms
allow multivariable models to be translated into a diagram that can be
easily implemented into clinical workflows for healthcare practitioners
to quickly provide individualized risk estimates for patient outcomes of
interest (e.g., death from cancer) while accounting for potential
competing events (e.g, death from other causes) to help aid clinical
decision making. This vignette will walk through an example workflow for
constructing a nomogram.
First, we’ll load the package:
We’ll be using the prostate.dat
data set throughout.
This is an artificial data set meant to represent patients with prostate
cancer undergoing 1 of 3 treatments. The patients are followed for some
period of time until they experience death from prostate cancer (the
event of interest), death from other causes, or they are still
alive and therefore censored. The data set consists of 2000
observations, each representing a patient, with the following 9
columns:
UNIQUEID
: Unique patient identifierTX
: Treatment received for prostate cancer
(EBRT
, PI
, or RP
)PSA
: Pre-treatment prostate-specific antigen (PSA)
levels (ng/mL)BX_GLSN_CAT
: Biopsy Gleason
Score. A factor with levels 1 (score 2-6), 2 (score 7), and 3 (score
8-10)CLIN_STG
: Clinical stage of cancer (T1
,
T2
, T3
)AGE
: Age of patient on the date of treatmentRACE_AA
: Patient race where 1
= African
American, and 0
is OtherTIME_EVENT
: Event time or follow-up timeEVENT_DOD
: Event status. 1
= patient died
of prostate cancer, 2
= patient died of other causes,
0
= patient did not die and therefore was censoredstr(prostate.dat)
#> 'data.frame': 2000 obs. of 9 variables:
#> $ UNIQID : int 23982340 17299867 35653287 34821801 24377202 20243570 25179226 30671082 26178029 35519165 ...
#> $ TX : Factor w/ 3 levels "EBRT","PI","RP": 3 3 3 2 2 3 3 3 3 3 ...
#> $ PSA : num 25.2 3.3 7.6 5.6 4.6 4.72 8.2 4.6 3.7 4.23 ...
#> $ BX_GLSN_CAT: Factor w/ 3 levels "1","2","3": 3 1 1 1 1 1 1 1 1 1 ...
#> $ CLIN_STG : Factor w/ 3 levels "T1","T2","T3": 2 1 1 1 1 1 1 1 1 1 ...
#> $ AGE : num 57.9 64.6 70.6 65.4 63.8 ...
#> $ RACE_AA : num 1 0 0 0 0 0 0 0 0 0 ...
#> $ TIME_EVENT : num 136.9 87.8 18.7 10.3 7.2 ...
#> $ EVENT_DOD : num 0 1 1 0 2 0 1 2 0 0 ...
The following steps show how to build a competing risks regression model and construct a nomogram with it.
To start, we first want to create a competing risks regression model.
Our package heavily utilizes the model building procedures and utilities
in the rms
package, so the first step is to create a Cox
Proportional-Hazards model for the event of interest (death from
prostate cancer) from the rms::cph
function using the
preferred workflow:
# Register the data set
dd <- datadist(prostate.dat)
options(datadist = "dd")
# Fit the Cox-PH model for the event of interest
prostate.f <- cph(Surv(TIME_EVENT,EVENT_DOD == 1) ~ TX + rcs(PSA,3) +
BX_GLSN_CAT + CLIN_STG + rcs(AGE,3) +
RACE_AA, data = prostate.dat,
x = TRUE, y= TRUE, surv=TRUE, time.inc = 144)
prostate.f
#> Cox Proportional Hazards Model
#>
#> cph(formula = Surv(TIME_EVENT, EVENT_DOD == 1) ~ TX + rcs(PSA,
#> 3) + BX_GLSN_CAT + CLIN_STG + rcs(AGE, 3) + RACE_AA, data = prostate.dat,
#> x = TRUE, y = TRUE, surv = TRUE, time.inc = 144)
#>
#> Model Tests Discrimination
#> Indexes
#> Obs 2000 LR chi2 55.16 R2 0.029
#> Events 464 d.f. 11 R2(11,2000)0.022
#> Center -2.7576 Pr(> chi2) 0.0000 R2(11,464)0.091
#> Score chi2 56.50 Dxy 0.230
#> Pr(> chi2) 0.0000
#>
#> Coef S.E. Wald Z Pr(>|Z|)
#> TX=PI 0.5135 0.1431 3.59 0.0003
#> TX=RP 0.0594 0.1246 0.48 0.6336
#> PSA -0.0547 0.0270 -2.03 0.0425
#> PSA' 0.0884 0.0463 1.91 0.0563
#> BX_GLSN_CAT=2 0.1900 0.1137 1.67 0.0948
#> BX_GLSN_CAT=3 0.7213 0.1683 4.28 <0.0001
#> CLIN_STG=T2 -0.2888 0.1116 -2.59 0.0096
#> CLIN_STG=T3 -0.0154 0.2372 -0.06 0.9483
#> AGE -0.0429 0.0127 -3.37 0.0007
#> AGE' 0.0332 0.0158 2.10 0.0353
#> RACE_AA -0.3154 0.1429 -2.21 0.0273
Then we can call crr.fit
. This function takes the
rms::cph
fit and uses the cmprsk::crr
function
to convert it to a competing risks regression model, while maintaining
attributes from the original fit.
# Refit to a competing risks regression
prostate.crr <- crr.fit(prostate.f, cencode = 0, failcode = 1)
prostate.crr
#> convergence: TRUE
#> coefficients:
#> TX=PI TX=RP PSA PSA' BX_GLSN_CAT=2
#> 0.30480 0.06632 -0.05174 0.08788 0.12550
#> BX_GLSN_CAT=3 CLIN_STG=T2 CLIN_STG=T3 AGE AGE'
#> 0.63010 -0.26480 0.08370 -0.03062 0.01853
#> RACE_AA
#> -0.25220
#> standard errors:
#> [1] 0.13750 0.11810 0.02638 0.04513 0.11460 0.16120 0.10920 0.21500 0.01240
#> [10] 0.01593 0.14070
#> two-sided p-values:
#> TX=PI TX=RP PSA PSA' BX_GLSN_CAT=2
#> 2.7e-02 5.7e-01 5.0e-02 5.1e-02 2.7e-01
#> BX_GLSN_CAT=3 CLIN_STG=T2 CLIN_STG=T3 AGE AGE'
#> 9.3e-05 1.5e-02 7.0e-01 1.4e-02 2.4e-01
#> RACE_AA
#> 7.3e-02
We can see that the model object above is of class
cmprsk
.
We have created generic methods to assess output:
summary.cmprsk
method utilizes
summary.rms
to flexibly summarize model effects and
contrasts.summary(prostate.crr)
#> Effects Response : Surv(TIME_EVENT, EVENT_DOD == 1)
#>
#> Factor Low High Diff. Effect S.E. Lower 0.95
#> PSA 5.000 9.400 4.400 -0.145130 0.074520 -0.291190
#> Hazard Ratio 5.000 9.400 4.400 0.864910 NA 0.747380
#> AGE 58.137 69.562 11.425 -0.184680 0.077432 -0.336440
#> Hazard Ratio 58.137 69.562 11.425 0.831370 NA 0.714310
#> RACE_AA 0.000 1.000 1.000 -0.252210 0.140700 -0.527970
#> Hazard Ratio 0.000 1.000 1.000 0.777080 NA 0.589800
#> TX - EBRT:RP 3.000 1.000 NA -0.066323 0.118050 -0.297700
#> Hazard Ratio 3.000 1.000 NA 0.935830 NA 0.742530
#> TX - PI:RP 3.000 2.000 NA 0.238490 0.132460 -0.021125
#> Hazard Ratio 3.000 2.000 NA 1.269300 NA 0.979100
#> BX_GLSN_CAT - 2:1 1.000 2.000 NA 0.125530 0.114610 -0.099103
#> Hazard Ratio 1.000 2.000 NA 1.133800 NA 0.905650
#> BX_GLSN_CAT - 3:1 1.000 3.000 NA 0.630090 0.161180 0.314180
#> Hazard Ratio 1.000 3.000 NA 1.877800 NA 1.369100
#> CLIN_STG - T2:T1 1.000 2.000 NA -0.264760 0.109220 -0.478820
#> Hazard Ratio 1.000 2.000 NA 0.767390 NA 0.619510
#> CLIN_STG - T3:T1 1.000 3.000 NA 0.083704 0.214990 -0.337680
#> Hazard Ratio 1.000 3.000 NA 1.087300 NA 0.713430
#> Upper 0.95
#> 0.00092617
#> 1.00090000
#> -0.03291100
#> 0.96762000
#> 0.02355300
#> 1.02380000
#> 0.16505000
#> 1.17950000
#> 0.49809000
#> 1.64560000
#> 0.35016000
#> 1.41930000
#> 0.94600000
#> 2.57540000
#> -0.05069400
#> 0.95057000
#> 0.50509000
#> 1.65710000
anova.cmprsk
method utilizes anova.rms
to test for statistical significance of model terms.anova(prostate.crr)
#> Wald Statistics Response: Surv(TIME_EVENT, EVENT_DOD == 1)
#>
#> Factor Chi-Square d.f. P
#> TX 5.21 2 0.0739
#> PSA 3.85 2 0.1458
#> Nonlinear 3.79 1 0.0515
#> BX_GLSN_CAT 15.29 2 0.0005
#> CLIN_STG 6.88 2 0.0320
#> AGE 9.27 2 0.0097
#> Nonlinear 1.35 1 0.2445
#> RACE_AA 3.21 1 0.0730
#> TOTAL NONLINEAR 5.16 2 0.0758
#> TOTAL 44.64 11 <.0001
Obviously there is a lot of work that would go into determining that
we have the right model specifications, covariates, term specifications,
etc. We’ll assume that the current model is the one we are moving
forward with. See the cmprsk
package to learn more about
the competing risks modeling framework used here, how to interpret
parameters, etc.
Suppose we are particularly interested in accurately predicting the 10-year risk of death from prostate cancer.
First, we can use the tenf.crr
function to generate
K-fold cross-validated predictions for each observation in the
development data set. The default is to set fold=10
for
10-fold cross-validation, so we’ll leave it at that:
set.seed(123)
prostate.dat$preds.tenf <- tenf.crr(prostate.crr, time = time_of_interest)
#> 1 2 3 4 5 6 7 8 9 10
str(prostate.dat$preds.tenf)
#> num [1:2000] 0.374 0.376 0.277 0.372 0.394 ...
This vector shows us, for each patient, the estimated (out of sample) probability of death from prostate cancer given the covariates in the model, adjusting for the possibility of death from other causes as a competing event.
The cindex
function computes the concordance
index for binary, time-to-event, or competing risks outcomes. In
this case, we are interested in the competing risks version, which
utilizes the event times, and only considers pairwise comparisons
involving those with the event of interest when testing for
concordance.
For example, if Patient A experienced the event of interest at 5 years and Patient B experienced the event of interest at 10 years, the algorithm checks to see if the predicted risk for Patient A is larger than that of Patient B–in which case they would be considered concordant. There are similar comparisons if one of the patients had a different event status. However, if Patient A nor Patient B experienced the event of interest (i.e., they either experienced a competing event or were censored), they would not be compared.
cindex(
prob = prostate.dat$preds.tenf,
fstatus = prostate.dat$EVENT_DOD,
ftime = prostate.dat$TIME_EVENT,
type = "crr",
failcode = 1
)
#> N n usable concordant cindex
#> 2.000000e+03 2.000000e+03 5.652590e+05 3.228440e+05 5.711435e-01
The output displays the number of observations (in the original input
N
, and those used n
), the number of pairwise
comparisons made (usable
), the number of concordant pairs
(concordant
), and the concordance index
(cindex
), which is the proportion of the usable pairs that
were concordant. The C-index ranges from 0.50, a coin flip, to 1,
perfect discrimination. In this case, our model has relatively poor
discrimination ability.
We can also assess how calibrated our model predictions are
by comparing the predicted probabilities from the model to the actual
event rate observed. The groupci
function does this by
breaking up the input risk distribution into groups (e.g., by quantiles,
user-defined groups, etc.) and computing the observed cumulative
incidence within each group at the time point in which the predictions
were made for. The goal is for these quantities to follow 45-degree line
such that the risks produced by the models align with the true rates at
which events occur.
groupci(
x = prostate.dat$preds.tenf,
ftime = prostate.dat$TIME_EVENT,
fstatus = prostate.dat$EVENT_DOD,
g = 10, # Deciles
u = time_of_interest,
failcode = 1,
xlab = "Predicted 10-year prostate cancer-specific mortality",
ylab = "Actual 10-year prostate cancer-specific mortality"
)
#> x n events ci std.err
#> [1,] 0.2195250 200 41 0.2137948 0.03444284
#> [2,] 0.2614243 200 42 0.2594906 0.04058236
#> [3,] 0.2844747 200 45 0.2740857 0.04014348
#> [4,] 0.3045506 200 48 0.3427236 0.04718743
#> [5,] 0.3239753 200 50 0.3807385 0.04866854
#> [6,] 0.3418128 200 43 0.2773428 0.04117005
#> [7,] 0.3595614 200 48 0.3746006 0.04898192
#> [8,] 0.3808118 200 44 0.4879819 0.06502863
#> [9,] 0.4101135 200 49 0.3324923 0.04835294
#> [10,] 0.4827395 200 54 0.4596141 0.05677256
There might be some evidence that our model slightly overestimates the actual risk of 10-year prostate cancer-specific mortality for those most at risk.
Once the model is built, validated, and ready to use in practice, we
can construct the nomogram. The nomogram.crr
function takes
our original model fit and maps it to a diagram that allows it to be
printed and used in a clinical (or any other) setting to quickly
“plug-in” model covariates and generate a risk estimate.
# Set some nice display labels (also see ?Newlevels)
prostate.g <-
Newlabels(
fit = prostate.crr,
labels =
c(
TX = "Treatment options",
PSA = "PSA (ng/mL)",
BX_GLSN_CAT = "Biopsy Gleason Score Sum",
CLIN_STG = "Clinical Stage",
AGE = "Age (Years)",
RACE_AA = "Race"
)
)
# Construct the nomogram
nomogram.crr(
fit = prostate.g,
failtime = time_of_interest,
lp = FALSE,
xfrac = 0.65,
fun.at = seq(0.2, 0.45, 0.05),
funlabel = "Predicted 10-year risk"
)
It is a point system-based diagram in which we can extract the individual contribution of each risk factor (top axis), then add them up to obtain the final risk estimate (using the “Total Points” axis, then drawing a line straight down to the risk axis). Also notice that even with non-linear terms in the original model, they are able to be displayed with axes adjusted accordingly. Interactions can be handled as well.
It may be of interest to extract the model equation in a
programmatically-usable format to, for example, hard-code into an
application for automated model evaluation. The sas.cmprsk
function provides this functionality, given the original model fit and
(optionally) the desired time horizon.
sas.cmprsk(prostate.crr, time = time_of_interest)
#> Base failure probability by time = 120 is 0.9605018
#> 0.30480861 * (TX = "PI") + 0.066323409 * (TX = "RP") -
#> 0.05173739 * PSA + 0.00060026412 * max(PSA - 3.8, 0)**3 -
#> 0.00075935137 * max(PSA - 6.335, 0)**3 + 0.00015908725 *
#> max(PSA - 15.9, 0)**3 + 0.12553095 * (BX_GLSN_CAT = "2") +
#> 0.63008736 * (BX_GLSN_CAT = "3") - 0.26475762 * (CLIN_STG =
#> "T2") + 0.083704451 * (CLIN_STG = "T3") - 0.030615192 *
#> AGE + 4.2897617e-05 * max(AGE - 53.318904, 0)**3 - 8.993847e-05 *
#> max(AGE - 64.190411, 0)**3 + 4.7040853e-05 * max(AGE - 74.104384,
#> 0)**3 - 0.25220729 * RACE_AA
The bulk of the output reflects the predicted model value on the
linear predictor scale once a set of covariate values are
entered (see ?cmprsk::crr
). Since we’ve entered a value in
the time
argument, we also get the base failure
probability at that time horizon, which is the result of plugging in
0
for all covariate values and transforming the prediction
to the cumulative incidence scale. These together can then be used to
calculate the probability estimate for an arbitrary set of covariate
values (once evaluated). Notice that the formula also handles the
non-linear terms that were modeled using restricted cubic splines by
capturing the knot positions, so that all the user needs to “plug-in” is
the covariate value on the original scale.
cuminc
The cmprsk::cuminc
function computes the cumulative
incidence for competing events. The pred.ci
function uses
its output to construct a data.frame
of the cumulative
incidence estimates at a desired time horizon and event cause of
interest.
# Get the cuminc object
cum <-
cmprsk::cuminc(
ftime = prostate.dat$TIME_EVENT,
fstatus = prostate.dat$EVENT_DOD,
group = prostate.dat$TX,
cencode = 0
)
# Extract "nice" output at a time point of interest
pred.ci(cum, tm1 = time_of_interest, failcode = 1)
#> Group CI.Prob CI.Var
#> 1 EBRT 0.2840050 0.0005236825
#> 2 PI 0.4047218 0.0013601997
#> 3 RP 0.3287862 0.0004578474
The predict
function is a method for a
cmprsk
object to generate predicted values at a desired
time horizon (see ?predict.cmprsk
).
prostate.dat$pred.120 <- predict(prostate.crr, time = time_of_interest)
str(prostate.dat$pred.120)
#> num [1:2000] 0.36 0.349 0.286 0.381 0.402 ...
By default, the function used the model development data set to get
predicted values, but the newdata
argument allows the user
to specify new records containing the model covariates.