Example Data Set

We’ll be using the prostate.dat data set throughout. This is an artificial data set meant to represent patients with prostate cancer undergoing 1 of 3 treatments. The patients are followed for some period of time until they experience death from prostate cancer (the event of interest), death from other causes, or they are still alive and therefore censored. The data set consists of 2000 observations, each representing a patient, with the following 9 columns:

UNIQUEID: Unique patient identifier
TX: Treatment received for prostate cancer (EBRT, PI, or RP)
PSA: Pre-treatment prostate-specific antigen (PSA) levels (ng/mL)
BX_GLSN_CAT: Biopsy Gleason Score. A factor with levels 1 (score 2-6), 2 (score 7), and 3 (score 8-10)
CLIN_STG: Clinical stage of cancer (T1, T2, T3)
AGE: Age of patient on the date of treatment
RACE_AA: Patient race where 1 = African American, and 0 is Other
TIME_EVENT: Event time or follow-up time
EVENT_DOD: Event status. 1 = patient died of prostate cancer, 2 = patient died of other causes, 0 = patient did not die and therefore was censored

str(prostate.dat)
#> 'data.frame':    2000 obs. of  9 variables:
#>  $ UNIQID     : int  23982340 17299867 35653287 34821801 24377202 20243570 25179226 30671082 26178029 35519165 ...
#>  $ TX         : Factor w/ 3 levels "EBRT","PI","RP": 3 3 3 2 2 3 3 3 3 3 ...
#>  $ PSA        : num  25.2 3.3 7.6 5.6 4.6 4.72 8.2 4.6 3.7 4.23 ...
#>  $ BX_GLSN_CAT: Factor w/ 3 levels "1","2","3": 3 1 1 1 1 1 1 1 1 1 ...
#>  $ CLIN_STG   : Factor w/ 3 levels "T1","T2","T3": 2 1 1 1 1 1 1 1 1 1 ...
#>  $ AGE        : num  57.9 64.6 70.6 65.4 63.8 ...
#>  $ RACE_AA    : num  1 0 0 0 0 0 0 0 0 0 ...
#>  $ TIME_EVENT : num  136.9 87.8 18.7 10.3 7.2 ...
#>  $ EVENT_DOD  : num  0 1 1 0 2 0 1 2 0 0 ...

Steps to construct a nomogram

The following steps show how to build a competing risks regression model and construct a nomogram with it.

1. Fit a competing risks regression model

To start, we first want to create a competing risks regression model. Our package heavily utilizes the model building procedures and utilities in the rms package, so the first step is to create a Cox Proportional-Hazards model for the event of interest (death from prostate cancer) from the rms::cph function using the preferred workflow:

# Register the data set
dd <- datadist(prostate.dat)
options(datadist = "dd")

# Fit the Cox-PH model for the event of interest
prostate.f <- cph(Surv(TIME_EVENT,EVENT_DOD == 1) ~ TX  + rcs(PSA,3) +
           BX_GLSN_CAT + CLIN_STG + rcs(AGE,3) +
           RACE_AA, data = prostate.dat,
           x = TRUE, y= TRUE, surv=TRUE, time.inc = 144)
prostate.f
#> Cox Proportional Hazards Model
#> 
#> cph(formula = Surv(TIME_EVENT, EVENT_DOD == 1) ~ TX + rcs(PSA, 
#>     3) + BX_GLSN_CAT + CLIN_STG + rcs(AGE, 3) + RACE_AA, data = prostate.dat, 
#>     x = TRUE, y = TRUE, surv = TRUE, time.inc = 144)
#> 
#>                         Model Tests      Discrimination    
#>                                                 Indexes    
#> Obs       2000    LR chi2     55.16      R2       0.029    
#> Events     464    d.f.           11    R2(11,2000)0.022    
#> Center -2.7576    Pr(> chi2) 0.0000     R2(11,464)0.091    
#>                   Score chi2  56.50      Dxy      0.230    
#>                   Pr(> chi2) 0.0000                        
#> 
#>               Coef    S.E.   Wald Z Pr(>|Z|)
#> TX=PI          0.5135 0.1431  3.59  0.0003  
#> TX=RP          0.0594 0.1246  0.48  0.6336  
#> PSA           -0.0547 0.0270 -2.03  0.0425  
#> PSA'           0.0884 0.0463  1.91  0.0563  
#> BX_GLSN_CAT=2  0.1900 0.1137  1.67  0.0948  
#> BX_GLSN_CAT=3  0.7213 0.1683  4.28  <0.0001 
#> CLIN_STG=T2   -0.2888 0.1116 -2.59  0.0096  
#> CLIN_STG=T3   -0.0154 0.2372 -0.06  0.9483  
#> AGE           -0.0429 0.0127 -3.37  0.0007  
#> AGE'           0.0332 0.0158  2.10  0.0353  
#> RACE_AA       -0.3154 0.1429 -2.21  0.0273

Then we can call crr.fit. This function takes the rms::cph fit and uses the cmprsk::crr function to convert it to a competing risks regression model, while maintaining attributes from the original fit.

# Refit to a competing risks regression
prostate.crr <- crr.fit(prostate.f, cencode = 0, failcode = 1)
prostate.crr
#> convergence:  TRUE 
#> coefficients:
#>         TX=PI         TX=RP           PSA          PSA' BX_GLSN_CAT=2 
#>       0.30480       0.06632      -0.05174       0.08788       0.12550 
#> BX_GLSN_CAT=3   CLIN_STG=T2   CLIN_STG=T3           AGE          AGE' 
#>       0.63010      -0.26480       0.08370      -0.03062       0.01853 
#>       RACE_AA 
#>      -0.25220 
#> standard errors:
#>  [1] 0.13750 0.11810 0.02638 0.04513 0.11460 0.16120 0.10920 0.21500 0.01240
#> [10] 0.01593 0.14070
#> two-sided p-values:
#>         TX=PI         TX=RP           PSA          PSA' BX_GLSN_CAT=2 
#>       2.7e-02       5.7e-01       5.0e-02       5.1e-02       2.7e-01 
#> BX_GLSN_CAT=3   CLIN_STG=T2   CLIN_STG=T3           AGE          AGE' 
#>       9.3e-05       1.5e-02       7.0e-01       1.4e-02       2.4e-01 
#>       RACE_AA 
#>       7.3e-02

2. Summarize and assess model output

We can see that the model object above is of class cmprsk.

class(prostate.crr)
#> [1] "cmprsk" "crr"

We have created generic methods to assess output:

The summary.cmprsk method utilizes summary.rms to flexibly summarize model effects and contrasts.

summary(prostate.crr)
#>              Effects              Response : Surv(TIME_EVENT, EVENT_DOD == 1) 
#> 
#>  Factor            Low    High   Diff.  Effect    S.E.     Lower 0.95
#>  PSA                5.000  9.400  4.400 -0.145130 0.074520 -0.291190 
#>   Hazard Ratio      5.000  9.400  4.400  0.864910       NA  0.747380 
#>  AGE               58.137 69.562 11.425 -0.184680 0.077432 -0.336440 
#>   Hazard Ratio     58.137 69.562 11.425  0.831370       NA  0.714310 
#>  RACE_AA            0.000  1.000  1.000 -0.252210 0.140700 -0.527970 
#>   Hazard Ratio      0.000  1.000  1.000  0.777080       NA  0.589800 
#>  TX - EBRT:RP       3.000  1.000     NA -0.066323 0.118050 -0.297700 
#>   Hazard Ratio      3.000  1.000     NA  0.935830       NA  0.742530 
#>  TX - PI:RP         3.000  2.000     NA  0.238490 0.132460 -0.021125 
#>   Hazard Ratio      3.000  2.000     NA  1.269300       NA  0.979100 
#>  BX_GLSN_CAT - 2:1  1.000  2.000     NA  0.125530 0.114610 -0.099103 
#>   Hazard Ratio      1.000  2.000     NA  1.133800       NA  0.905650 
#>  BX_GLSN_CAT - 3:1  1.000  3.000     NA  0.630090 0.161180  0.314180 
#>   Hazard Ratio      1.000  3.000     NA  1.877800       NA  1.369100 
#>  CLIN_STG - T2:T1   1.000  2.000     NA -0.264760 0.109220 -0.478820 
#>   Hazard Ratio      1.000  2.000     NA  0.767390       NA  0.619510 
#>  CLIN_STG - T3:T1   1.000  3.000     NA  0.083704 0.214990 -0.337680 
#>   Hazard Ratio      1.000  3.000     NA  1.087300       NA  0.713430 
#>  Upper 0.95 
#>   0.00092617
#>   1.00090000
#>  -0.03291100
#>   0.96762000
#>   0.02355300
#>   1.02380000
#>   0.16505000
#>   1.17950000
#>   0.49809000
#>   1.64560000
#>   0.35016000
#>   1.41930000
#>   0.94600000
#>   2.57540000
#>  -0.05069400
#>   0.95057000
#>   0.50509000
#>   1.65710000

The anova.cmprsk method utilizes anova.rms to test for statistical significance of model terms.

anova(prostate.crr)
#>                 Wald Statistics          Response: Surv(TIME_EVENT, EVENT_DOD == 1) 
#> 
#>  Factor          Chi-Square d.f. P     
#>  TX               5.21       2   0.0739
#>  PSA              3.85       2   0.1458
#>   Nonlinear       3.79       1   0.0515
#>  BX_GLSN_CAT     15.29       2   0.0005
#>  CLIN_STG         6.88       2   0.0320
#>  AGE              9.27       2   0.0097
#>   Nonlinear       1.35       1   0.2445
#>  RACE_AA          3.21       1   0.0730
#>  TOTAL NONLINEAR  5.16       2   0.0758
#>  TOTAL           44.64      11   <.0001

Obviously there is a lot of work that would go into determining that we have the right model specifications, covariates, term specifications, etc. We’ll assume that the current model is the one we are moving forward with. See the cmprsk package to learn more about the competing risks modeling framework used here, how to interpret parameters, etc.

3. Validate model predictions

Suppose we are particularly interested in accurately predicting the 10-year risk of death from prostate cancer.

time_of_interest <- 120 # In months, so 10 years

Generate cross-validated predictions

First, we can use the tenf.crr function to generate K-fold cross-validated predictions for each observation in the development data set. The default is to set fold=10 for 10-fold cross-validation, so we’ll leave it at that:

set.seed(123)
prostate.dat$preds.tenf <- tenf.crr(prostate.crr, time = time_of_interest)
#> 1  2  3  4  5  6  7  8  9  10
str(prostate.dat$preds.tenf)
#>  num [1:2000] 0.374 0.376 0.277 0.372 0.394 ...

This vector shows us, for each patient, the estimated (out of sample) probability of death from prostate cancer given the covariates in the model, adjusting for the possibility of death from other causes as a competing event.

Compute the concordance index (C-Index)

The cindex function computes the concordance index for binary, time-to-event, or competing risks outcomes. In this case, we are interested in the competing risks version, which utilizes the event times, and only considers pairwise comparisons involving those with the event of interest when testing for concordance.

For example, if Patient A experienced the event of interest at 5 years and Patient B experienced the event of interest at 10 years, the algorithm checks to see if the predicted risk for Patient A is larger than that of Patient B–in which case they would be considered concordant. There are similar comparisons if one of the patients had a different event status. However, if Patient A nor Patient B experienced the event of interest (i.e., they either experienced a competing event or were censored), they would not be compared.

cindex(
  prob = prostate.dat$preds.tenf,
  fstatus = prostate.dat$EVENT_DOD,
  ftime = prostate.dat$TIME_EVENT,
  type = "crr",
  failcode = 1
)
#>            N            n       usable   concordant       cindex 
#> 2.000000e+03 2.000000e+03 5.652590e+05 3.228440e+05 5.711435e-01

The output displays the number of observations (in the original input N, and those used n), the number of pairwise comparisons made (usable), the number of concordant pairs (concordant), and the concordance index (cindex), which is the proportion of the usable pairs that were concordant. The C-index ranges from 0.50, a coin flip, to 1, perfect discrimination. In this case, our model has relatively poor discrimination ability.

Assess model calibration

We can also assess how calibrated our model predictions are by comparing the predicted probabilities from the model to the actual event rate observed. The groupci function does this by breaking up the input risk distribution into groups (e.g., by quantiles, user-defined groups, etc.) and computing the observed cumulative incidence within each group at the time point in which the predictions were made for. The goal is for these quantities to follow 45-degree line such that the risks produced by the models align with the true rates at which events occur.

groupci(
  x = prostate.dat$preds.tenf,
  ftime = prostate.dat$TIME_EVENT,
  fstatus = prostate.dat$EVENT_DOD,
  g = 10, # Deciles
  u = time_of_interest,
  failcode = 1,
  xlab = "Predicted 10-year prostate cancer-specific mortality",
  ylab = "Actual 10-year prostate cancer-specific mortality"
)

#>               x   n events        ci    std.err
#>  [1,] 0.2195250 200     41 0.2137948 0.03444284
#>  [2,] 0.2614243 200     42 0.2594906 0.04058236
#>  [3,] 0.2844747 200     45 0.2740857 0.04014348
#>  [4,] 0.3045506 200     48 0.3427236 0.04718743
#>  [5,] 0.3239753 200     50 0.3807385 0.04866854
#>  [6,] 0.3418128 200     43 0.2773428 0.04117005
#>  [7,] 0.3595614 200     48 0.3746006 0.04898192
#>  [8,] 0.3808118 200     44 0.4879819 0.06502863
#>  [9,] 0.4101135 200     49 0.3324923 0.04835294
#> [10,] 0.4827395 200     54 0.4596141 0.05677256

There might be some evidence that our model slightly overestimates the actual risk of 10-year prostate cancer-specific mortality for those most at risk.

4. Create the nomogram

Once the model is built, validated, and ready to use in practice, we can construct the nomogram. The nomogram.crr function takes our original model fit and maps it to a diagram that allows it to be printed and used in a clinical (or any other) setting to quickly “plug-in” model covariates and generate a risk estimate.

# Set some nice display labels (also see ?Newlevels)
prostate.g <-
  Newlabels(
    fit = prostate.crr,
    labels = 
      c(
        TX = "Treatment options",
        PSA = "PSA (ng/mL)",
        BX_GLSN_CAT = "Biopsy Gleason Score Sum",
        CLIN_STG = "Clinical Stage",
        AGE = "Age (Years)",
        RACE_AA = "Race"
      )
  )

# Construct the nomogram
nomogram.crr(
  fit = prostate.g,
  failtime = time_of_interest,
  lp = FALSE,
  xfrac = 0.65,
  fun.at = seq(0.2, 0.45, 0.05),
  funlabel = "Predicted 10-year risk"
)

It is a point system-based diagram in which we can extract the individual contribution of each risk factor (top axis), then add them up to obtain the final risk estimate (using the “Total Points” axis, then drawing a line straight down to the risk axis). Also notice that even with non-linear terms in the original model, they are able to be displayed with axes adjusted accordingly. Interactions can be handled as well.

Other functions

Generate the model equation

It may be of interest to extract the model equation in a programmatically-usable format to, for example, hard-code into an application for automated model evaluation. The sas.cmprsk function provides this functionality, given the original model fit and (optionally) the desired time horizon.

sas.cmprsk(prostate.crr, time = time_of_interest)
#> Base failure probability by time = 120 is 0.9605018 
#>  0.30480861 * (TX = "PI") + 0.066323409 * (TX = "RP") - 
#>     0.05173739 * PSA + 0.00060026412 * max(PSA - 3.8, 0)**3 - 
#>     0.00075935137 * max(PSA - 6.335, 0)**3 + 0.00015908725 * 
#>     max(PSA - 15.9, 0)**3 + 0.12553095 * (BX_GLSN_CAT = "2") + 
#>     0.63008736 * (BX_GLSN_CAT = "3") - 0.26475762 * (CLIN_STG = 
#>     "T2") + 0.083704451 * (CLIN_STG = "T3") - 0.030615192 * 
#>     AGE + 4.2897617e-05 * max(AGE - 53.318904, 0)**3 - 8.993847e-05 * 
#>     max(AGE - 64.190411, 0)**3 + 4.7040853e-05 * max(AGE - 74.104384, 
#>     0)**3 - 0.25220729 * RACE_AA

The bulk of the output reflects the predicted model value on the linear predictor scale once a set of covariate values are entered (see ?cmprsk::crr). Since we’ve entered a value in the time argument, we also get the base failure probability at that time horizon, which is the result of plugging in 0 for all covariate values and transforming the prediction to the cumulative incidence scale. These together can then be used to calculate the probability estimate for an arbitrary set of covariate values (once evaluated). Notice that the formula also handles the non-linear terms that were modeled using restricted cubic splines by capturing the knot positions, so that all the user needs to “plug-in” is the covariate value on the original scale.

Create “nice” output from `cuminc`

The cmprsk::cuminc function computes the cumulative incidence for competing events. The pred.ci function uses its output to construct a data.frame of the cumulative incidence estimates at a desired time horizon and event cause of interest.

# Get the cuminc object
cum <- 
  cmprsk::cuminc(
    ftime = prostate.dat$TIME_EVENT, 
    fstatus = prostate.dat$EVENT_DOD, 
    group = prostate.dat$TX,
    cencode = 0
  )

# Extract "nice" output at a time point of interest
pred.ci(cum, tm1 = time_of_interest, failcode = 1)
#>   Group   CI.Prob       CI.Var
#> 1  EBRT 0.2840050 0.0005236825
#> 2    PI 0.4047218 0.0013601997
#> 3    RP 0.3287862 0.0004578474

Calculate predicted probabilities

The predict function is a method for a cmprsk object to generate predicted values at a desired time horizon (see ?predict.cmprsk).

prostate.dat$pred.120 <- predict(prostate.crr, time = time_of_interest)
str(prostate.dat$pred.120)
#>  num [1:2000] 0.36 0.349 0.286 0.381 0.402 ...

By default, the function used the model development data set to get predicted values, but the newdata argument allows the user to specify new records containing the model covariates.

QHScrnomo

Introduction

Example Data Set

Steps to construct a nomogram

1. Fit a competing risks regression model

2. Summarize and assess model output

3. Validate model predictions

Generate cross-validated predictions

Compute the concordance index (C-Index)

Assess model calibration

4. Create the nomogram

Other functions

Generate the model equation

Create “nice” output from `cuminc`

Calculate predicted probabilities

QHScrnomo

Introduction

Example Data Set

Steps to construct a nomogram

1. Fit a competing risks regression model

2. Summarize and assess model output

3. Validate model predictions

Generate cross-validated predictions

Compute the concordance index (C-Index)

Assess model calibration

4. Create the nomogram

Other functions

Generate the model equation

Create “nice” output from cuminc

Calculate predicted probabilities

Create “nice” output from `cuminc`