The lboxcox package is developed based on our recent research publication (Xing et. al [2]). We proposed a logistic Box-Cox model by adding a shape parameter to a routine logistic regression model to handle a more flexible relationship between log odds and its systematic component. Below we provide instructions for users on how to fit the logistic Box-Cox model using our lboxcox package.
library(lboxcox)
In the library, there is a dataset called depress. It was from the National Health and Nutrition Examination Survey (NHANES) (Xing et. al [2]). The outcome is a binary indicator for depression in the data, and the main predictor is the blood mercury level, which requires an extra shape parameter to accommodate the nonlinear association. And the other variables are age, gender, and weight. The weight variable contains sampling weights, which would be used to adjust for sampling bias.
data(depress)
head(depress)
#> depression mercury age gender weight
#> 1 0 7.13 37 1 18979.728
#> 2 0 1.40 60 0 19877.753
#> 3 0 2.46 60 0 16671.291
#> 4 0 6.24 55 0 3749.384
#> 5 0 11.20 64 1 66567.646
#> 6 0 1.48 51 1 5494.848
The logistic Box-Cox model is fitted using the function, trained_model().
= lbc_maxlik(
trained_model ~ mercury + # response variable, and primary predictor
depression + factor(gender), # covariates
age weight_column_name="weight", # the column name which contains the information about survey weight
data=depress, # data comes from package
svy_lambda_vector = seq(0, 2, length = 25),
init_lambda_vector = seq(0, 2, length = 100),
num_cores=2, # since vignettes can only work with a max of two processes
seed = 1,
iterlim = 100,
timelim = 10
)summary(trained_model)
#> --------------------------------------------
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 5 iterations
#> Return code 1: gradient close to zero (gradtol)
#> Log-Likelihood: -41.14436
#> 5 free parameters
#> Estimates:
#> Estimate Std. error t value Pr(> t)
#> Beta_0 -2.206758 12.233684 -0.180 0.857
#> Beta_1 -0.365960 4.470466 -0.082 0.935
#> Lambda 0.205635 16.899802 0.012 0.990
#> age -0.003513 0.231210 -0.015 0.988
#> factor(gender)_1 -0.492110 8.073192 -0.061 0.951
#> --------------------------------------------
The output shows the coefficients and the standard error values for all the predictors. Categorical variables are automatically one-hot encoded during the model fitting process.
The above is the intended use case for when we do not have initial
values for the coefficients. We use svyglm
from the
survey
package [1] to estimate the
initial values. The training process for svyglm
utilizes a
lambda parameter that we search over a reasonable range to optimize the
model.
We incorporate parallelized computing option in the model by setting
the num_cores
param to the number of cores we want to use
for training. Please use the default num_cores
value (= 1)
if you do not wish to parallelize.
However, if you have an initial guess for the coefficients, you can include it as a vector and bypass the initial value estimation described above entirely! We do this with the below.
= c(0.0984065403, -0.0227734374, 0.0000000000, -0.0002426025, -0.0316484585)
init = lbc_maxlik(
trained_model ~ mercury + # response variable, and primary predictor
depression + factor(gender), # covariates
age weight_column_name="weight", # the column name which contains the information about survey weight
data=depress, # data comes from package
svy_lambda_vector = seq(0, 2, length = 25),
init_lambda_vector = seq(0, 2, length = 100),
num_cores=2, # since vignettes can only work with a max of two processes
seed = 2023,
iterlim = 100,
timelim = 10
)summary(trained_model)
#> --------------------------------------------
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 5 iterations
#> Return code 1: gradient close to zero (gradtol)
#> Log-Likelihood: -41.14436
#> 5 free parameters
#> Estimates:
#> Estimate Std. error t value Pr(> t)
#> Beta_0 -2.206758 12.233684 -0.180 0.857
#> Beta_1 -0.365960 4.470466 -0.082 0.935
#> Lambda 0.205635 16.899802 0.012 0.990
#> age -0.003513 0.231210 -0.015 0.988
#> factor(gender)_1 -0.492110 8.073192 -0.061 0.951
#> --------------------------------------------
[1] Lumley, T. (2011). Complex surveys: a guide to analysis using R (Vol. 565). John Wiley & Sons.
[2] Xing, L., Zhang, X., Burstyn, I., & Gustafson, P. (2021). On logistic Box–Cox regression for flexibly estimating the shape and strength of exposure‐disease relationships. Canadian Journal of Statistics, 49(3), 808-825.