The propensity score (PS) is the conditional probability of assignment to a particular treatment given a vector of observed covariates (Rosenbaum and Rubin 1983). Hirano and Imbens (2004) extended the idea to studies with continuous treatment (or exposure) and labeled it as the generalized propensity score (GPS), which is a probability density function. In this package, we use either a parametric model (a standard linear regression model) or a non-parametric model (a flexible machine learning model) to train the GPS model as a density estimation procedure (Kennedy et al. 2017). After the model training, we can estimate GPS values based on the model prediction. The machine learning models are developed using the SuperLearner Package (Van der Laan, Polley, and Hubbard 2007). For more details on the problem framework and assumptions, please see Wu et al. (2020).
Whether the prediction models’ performance should be considered the primary parameter in the training of the prediction model is an open research question. In this package, the users have complete control over the hyperparameters, which can fine-tune the prediction models to achieve different performance levels.
The users can use any library in the SuperLearner package. However, in order to have control on internal libraries we generate customized wrappers. The following table represents the available customized wrappers as well as hyperparameters.
Package name | sl_lib name |
prefix | available hyperparameters |
---|---|---|---|
XGBoost | m_xgboost |
xgb_ |
nrounds, eta, max_depth, min_child_weight, verbose |
ranger | m_ranger |
rgr_ |
num.trees, write.forest, replace, verbose, family |
Both XGBoost
and ranger
libraries are
developed for efficient processing on multiple cores. The only
requirement is making sure that OpenMP is installed on the system. User
needs to pass the number of threads (nthread
) in running
the estimate_gps
function.
In the following section, we conduct several analyses to test the scalability and performance. These analyses can be used to have a rough estimate of what to expect in different data sizes and computational resources.