In the main vignette, I discussed how to perform flexible variable selection in a simple, simulated example. In this document, I will discuss intrinsic selection in more detail.
Intrinsic variable selection is variable selection performed using intrinsic variable importance (Williamson and Huang 2023). Intrinsic variable importance is a summary of the true population distribution; it is the difference between the best possible prediction performance obtained using a set of variables and the best possible prediction performance obtained without using that set of variables (Williamson, Gilbert, Simon, et al. 2021; Williamson, Gilbert, Carone, et al. 2020). Prediction performance can be measured using R-squared, classification accuracy, area under the receiver operating characteristic curve (AUC), and binomial deviance. It can also be defined for a single variable by averaging the intrinsic importance of the variable compared to each other subset of variables (Williamson and Feng 2020), a notion that we use for variable selection.
Throughout this vignette, I will use a dataset inspired by data collected by the Early Detection Research Network (EDRN). Biomarkers developed at six “labs” are validated at at least one of four “validation sites” on 306 cysts. The data also include two binary outcome variables: whether or not the cyst was classified as mucinous, and whether or not the cyst was determined to have high malignant potential. The data contain some missing information, which complicates variable selection; only 212 cysts have complete information. In the first section, we will drop the missing data; in the second section, we will appropriately handle the missing data by using multiple imputation. I will use AUC to measure intrinsic importance.
The first step in performing intrinsic variable selection is to
estimate intrinsic variable importance. To estimate variable importance,
I use the function sp_vim
from the R
package
vimp
(Williamson, Feng, Wolock, et
al. 2023). This requires specifying a library of candidate
learners for the Super Learner (Laan,
Polley, and Hubbard 2007); throughout, I use a very simple
library of learners for the Super Learner (this is for illustration
only; in practice, I suggest using a large library of learners).
set.seed(1234)
# set up a library for SuperLearner; this is too simple a library for use in most applications
learners <- "SL.glm"
univariate_learners <- "SL.glm"
V <- 2
# estimate the SPVIMs
library("SuperLearner")
library("vimp")
#> vimp version 2.3.1: Perform Inference on Algorithm-Agnostic Variable Importance
est <- suppressWarnings(
sp_vim(Y = y_cc, X = x_cc, V = V, type = "auc",
SL.library = learners, gamma = .1, alpha = 0.05, delta = 0,
cvControl = list(V = V), env = environment())
)
est
#> Variable importance estimates:
#> Estimate SE 95% CI VIMP > 0 p-value
#> s = 1 0.000000e+00 0.04733026 [0, 0.09196796] FALSE 0.5000000
#> s = 2 4.215857e-02 0.16534232 [0, 0.36622356] FALSE 0.4131126
#> s = 3 0.000000e+00 0.04733026 [0, 0.09196796] FALSE 0.5000000
#> s = 4 0.000000e+00 0.04733026 [0, 0.09196796] FALSE 0.5000000
#> s = 5 6.187506e-05 0.06263558 [0, 0.12282535] FALSE 0.4998444
#> s = 6 0.000000e+00 0.11676553 [0, 0.21535200] FALSE 0.5000000
#> s = 7 0.000000e+00 0.04733026 [0, 0.09196796] FALSE 0.5000000
#> s = 8 7.752945e-03 0.11357388 [0, 0.23035366] FALSE 0.4820035
#> s = 9 0.000000e+00 0.05425335 [0, 0.10041316] FALSE 0.5000000
#> s = 10 2.070649e-02 0.07993317 [0, 0.17737263] FALSE 0.4492960
#> s = 11 0.000000e+00 0.04733026 [0, 0.09196796] FALSE 0.5000000
#> s = 12 0.000000e+00 0.05157831 [0, 0.10109163] FALSE 0.5000000
#> s = 13 6.410256e-03 0.02006046 [0, 0.04572804] FALSE 0.4832894
#> s = 14 0.000000e+00 0.06751652 [0, 0.12712006] FALSE 0.5000000
#> s = 15 0.000000e+00 0.04733026 [0, 0.09196796] FALSE 0.5000000
#> s = 16 0.000000e+00 0.04733026 [0, 0.09196796] FALSE 0.5000000
#> s = 17 0.000000e+00 0.05109337 [0, 0.09725160] FALSE 0.5000000
#> s = 18 4.462739e-04 0.06625191 [0, 0.13029763] FALSE 0.4988827
#> s = 19 0.000000e+00 0.04733026 [0, 0.09196796] FALSE 0.5000000
#> s = 20 0.000000e+00 0.04733026 [0, 0.09196796] FALSE 0.5000000
#> s = 21 0.000000e+00 0.04733026 [0, 0.09196796] FALSE 0.5000000
The next step is to choose an error rate to control and a method for controlling the family-wise error rate. Here, I choose the generalized family-wise error rate to control overall and choose Holm-adjusted p-values to control the individual family-wise error rate:
intrinsic_set <- intrinsic_selection(
spvim_ests = est, sample_size = nrow(x_cc), alpha = 0.2, feature_names = x_names,
control = list( quantity = "gFWER", base_method = "Holm", k = 1)
)
intrinsic_set
#> # A tibble: 21 × 6
#> feature est p_value adjusted_p_value rank selected
#> <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 lab1_actb 0 0.5 1 14 FALSE
#> 2 lab1_molecules_score 0.0422 0.413 1 1 TRUE
#> 3 lab1_telomerase_score 0 0.5 1 14 FALSE
#> 4 lab2_fluorescence_score 0 0.5 1 14 FALSE
#> 5 lab3_muc3ac_score 0.0000619 0.500 1 6 FALSE
#> 6 lab3_muc5ac_score 0 0.5 1 14 FALSE
#> 7 lab4_areg_score 0 0.5 1 14 FALSE
#> 8 lab4_glucose_score 0.00775 0.482 1 3 FALSE
#> 9 lab5_mucinous_call 0 0.5 1 14 FALSE
#> 10 lab5_neoplasia_v1_call 0.0207 0.449 1 2 FALSE
#> # ℹ 11 more rows
I could also choose to control the false discovery rate (FDR), again using Holm-adjusted p-values to control the individual family-wise error rate:
intrinsic_set_fdr <- intrinsic_selection(
spvim_ests = est, sample_size = nrow(x_cc), alpha = 0.2, feature_names = x_names,
control = list( quantity = "FDR", base_method = "Holm", k = 1)
)
intrinsic_set_fdr
#> # A tibble: 21 × 6
#> feature est p_value adjusted_p_value rank selected
#> <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 lab1_actb 0 0.5 1 14 FALSE
#> 2 lab1_molecules_score 0.0422 0.413 1 1 FALSE
#> 3 lab1_telomerase_score 0 0.5 1 14 FALSE
#> 4 lab2_fluorescence_score 0 0.5 1 14 FALSE
#> 5 lab3_muc3ac_score 0.0000619 0.500 1 6 FALSE
#> 6 lab3_muc5ac_score 0 0.5 1 14 FALSE
#> 7 lab4_areg_score 0 0.5 1 14 FALSE
#> 8 lab4_glucose_score 0.00775 0.482 1 3 FALSE
#> 9 lab5_mucinous_call 0 0.5 1 14 FALSE
#> 10 lab5_neoplasia_v1_call 0.0207 0.449 1 2 FALSE
#> # ℹ 11 more rows
To properly handle the missing data, we first perform multiple
imputation. We use the R
package mice
(Buuren, Groothuis-Oudshoorn, and others 2023; van
Buuren and Groothuis-Oudshoorn 2011), here with only 2
imputations (in practice, more imputations may be better).
library("mice")
set.seed(20231121)
mi_biomarkers <- mice::mice(data = biomarkers, m = n_imp, printFlag = FALSE)
imputed_biomarkers <- mice::complete(mi_biomarkers, action = "long") %>%
rename(imp = .imp, id = .id)
We can perform intrinsic variable selection using the imputed data. First, we estimate variable importance for each imputed dataset.
set.seed(20231121)
est_lst <- lapply(as.list(1:n_imp), function(l) {
this_x <- imputed_biomarkers %>%
filter(imp == l) %>%
select(starts_with("lab"), starts_with("cea"))
this_y <- biomarkers$mucinous
suppressWarnings(
sp_vim(Y = this_y, X = this_x, V = V, type = "auc",
SL.library = learners, gamma = 0.1, alpha = 0.05, delta = 0,
cvControl = list(V = V), env = environment())
)
})
Next, we use Rubin’s rules (Rubin 2018) to combine the variable importance estimates, and use this to perform variable selection. Here, we control the generalized family-wise error rate at 5, using Holm-adjusted p-values to control the initial family-wise error rate.
intrinsic_set <- intrinsic_selection(
spvim_ests = est_lst, sample_size = nrow(biomarkers),
feature_names = x_names, alpha = 0.05,
control = list(quantity = "gFWER", base_method = "Holm", k = 5)
)
intrinsic_set
We select five variables, here those with the top-5 estimated variable importance. The point estimates and p-values have been computed using Rubin’s rules.