This vignette provides descriptions of the example data sets in the ensr package. We provide information about how the data sets were built and also about each individual variable.

library(data.table) library(magrittr) library(qwraps2) library(digest) set.seed(42) library(ensr) ## Loading required package: glmnet ## Loading required package: Matrix ## Loading required package: foreach ## Loaded glmnet 2.0-16 ## ## Attaching package: 'glmnet' ## The following object is masked from 'package:qwraps2': ## ## auc

Traumatic Brain Injuries

A synthetic data set based based on a Traumatic Brain Injury (TBI) study has been included with the ensr package. All data points in this data set have been randomly generated and have no association with any real patient.

The columns of this data set are:

Column(s)	Meaning
`age`	Age of the patient, in days, at time of hospital admission
`female`	An integer flag for sex, 1 == female, 0 == male.
`los`	Length of stay: number of days between hospital admission and discharge.
`pcode1`, …, `pcode6`	Indicator columns for the presence or absence of a procedure code from a billing database.
`ncode1`, …, `ncode6`	Indicator columns for the presence or absence of the same procedure code, but from a trauma database.
`injury1`, …, `injury3`	Indicator columns for the presence or absence of three different types of TBI.

age

Age of the patient, in days, at time of hospital admission

female

An integer flag for sex, 1 == female, 0 == male.

los

Length of stay: number of days between hospital admission and discharge.

pcode1, …, pcode6

Indicator columns for the presence or absence of a procedure code from a billing database.

ncode1, …, ncode6

Indicator columns for the presence or absence of the same procedure code, but from a trauma database.

injury1, …, injury3

Indicator columns for the presence or absence of three different types of TBI.

Construction of the synthetic data set:

TBI_SUBJECTS <- 1323L tbi <- data.table(age = round(rweibull(TBI_SUBJECTS, shape = 1.9, scale = 2000)), female = rbinom(TBI_SUBJECTS, 1, 0.10), los = round(rexp(TBI_SUBJECTS, rate = 1/13)))

The pcode and ncode variables are difficult to simulate. There are 64 permutations for each set. We built two vectors with relative frequencies of the codes.

pcodes <- c(`0.0.0.0.0.0` = 1230, `1.0.0.0.0.0` = 130, `0.1.0.0.0.0` = 40, `1.1.0.0.0.0` = 7, `0.0.1.0.0.0` = 120, `1.0.1.0.0.0` = 4, `0.1.1.0.0.0` = 20, `1.1.1.0.0.0` = 2, `0.0.0.1.0.0` = 4, `1.0.0.1.0.0` = 4, `0.1.0.1.0.0` = 20, `1.1.0.1.0.0` = 4, `0.0.1.1.0.0` = 20, `1.0.1.1.0.0` = 2, `0.1.1.1.0.0` = 2, `1.1.1.1.0.0` = 7, `0.0.0.0.1.0` = 5, `1.0.0.0.1.0` = 4, `0.1.0.0.1.0` = 10, `1.1.0.0.1.0` = 9, `0.0.1.0.1.0` = 10, `1.0.1.0.1.0` = 7, `0.1.1.0.1.0` = 1, `1.1.1.0.1.0` = 7, `0.0.0.1.1.0` = 10, `1.0.0.1.1.0` = 6, `0.1.0.1.1.0` = 5, `1.1.0.1.1.0` = 8, `0.0.1.1.1.0` = 5, `1.0.1.1.1.0` = 2, `0.1.1.1.1.0` = 6, `1.1.1.1.1.0` = 9, `0.0.0.0.0.1` = 40, `1.0.0.0.0.1` = 7, `0.1.0.0.0.1` = 40, `1.1.0.0.0.1` = 10, `0.0.1.0.0.1` = 10, `1.0.1.0.0.1` = 1, `0.1.1.0.0.1` = 20, `1.1.1.0.0.1` = 5, `0.0.0.1.0.1` = 10, `1.0.0.1.0.1` = 2, `0.1.0.1.0.1` = 6, `1.1.0.1.0.1` = 1, `0.0.1.1.0.1` = 8, `1.0.1.1.0.1` = 1, `0.1.1.1.0.1` = 7, `1.1.1.1.0.1` = 2, `0.0.0.0.1.1` = 9, `1.0.0.0.1.1` = 1, `0.1.0.0.1.1` = 1, `1.1.0.0.1.1` = 2, `0.0.1.0.1.1` = 6, `1.0.1.0.1.1` = 9, `0.1.1.0.1.1` = 6, `1.1.1.0.1.1` = 4, `0.0.0.1.1.1` = 2, `1.0.0.1.1.1` = 9, `0.1.0.1.1.1` = 3, `1.1.0.1.1.1` = 1, `0.0.1.1.1.1` = 7, `1.0.1.1.1.1` = 4, `0.1.1.1.1.1` = 4, `1.1.1.1.1.1` = 3) ncodes <- c(`0.0.0.0.0.0` = 1240, `1.0.0.0.0.0` = 200, `0.1.0.0.0.0` = 130, `1.1.0.0.0.0` = 4, `0.0.1.0.0.0` = 90, `1.0.1.0.0.0` = 9, `0.1.1.0.0.0` = 20, `1.1.1.0.0.0` = 7, `0.0.0.1.0.0` = 20, `1.0.0.1.0.0` = 3, `0.1.0.1.0.0` = 10, `1.1.0.1.0.0` = 8, `0.0.1.1.0.0` = 7, `1.0.1.1.0.0` = 9, `0.1.1.1.0.0` = 8, `1.1.1.1.0.0` = 8, `0.0.0.0.1.0` = 6, `1.0.0.0.1.0` = 9, `0.1.0.0.1.0` = 10, `1.1.0.0.1.0` = 3, `0.0.1.0.1.0` = 9, `1.0.1.0.1.0` = 7, `0.1.1.0.1.0` = 6, `1.1.1.0.1.0` = 8, `0.0.0.1.1.0` = 6, `1.0.0.1.1.0` = 2, `0.1.0.1.1.0` = 9, `1.1.0.1.1.0` = 6, `0.0.1.1.1.0` = 2, `1.0.1.1.1.0` = 2, `0.1.1.1.1.0` = 5, `1.1.1.1.1.0` = 7, `0.0.0.0.0.1` = 20, `1.0.0.0.0.1` = 6, `0.1.0.0.0.1` = 4, `1.1.0.0.0.1` = 1, `0.0.1.0.0.1` = 2, `1.0.1.0.0.1` = 4, `0.1.1.0.0.1` = 7, `1.1.1.0.0.1` = 6, `0.0.0.1.0.1` = 4, `1.0.0.1.0.1` = 1, `0.1.0.1.0.1` = 8, `1.1.0.1.0.1` = 6, `0.0.1.1.0.1` = 1, `1.0.1.1.0.1` = 4, `0.1.1.1.0.1` = 2, `1.1.1.1.0.1` = 7, `0.0.0.0.1.1` = 3, `1.0.0.0.1.1` = 7, `0.1.0.0.1.1` = 7, `1.1.0.0.1.1` = 9, `0.0.1.0.1.1` = 5, `1.0.1.0.1.1` = 1, `0.1.1.0.1.1` = 6, `1.1.1.0.1.1` = 5, `0.0.0.1.1.1` = 2, `1.0.0.1.1.1` = 3, `0.1.0.1.1.1` = 1, `1.1.0.1.1.1` = 5, `0.0.1.1.1.1` = 6, `1.0.1.1.1.1` = 5, `0.1.1.1.1.1` = 4, `1.1.1.1.1.1` = 2)

We generated the code variables by taking a random sample of the names of the above objects, splitting the names by the ., and coercing the result to a data.table.

pcodes <- sample(names(pcodes), TBI_SUBJECTS, replace = TRUE, prob = pcodes) %>% strsplit("\\.") %>% lapply(as.integer) %>% do.call(rbind, .) %>% as.data.table %>% setNames(., paste0("pcode", 1:6)) pcodes ## pcode1 pcode2 pcode3 pcode4 pcode5 pcode6 ## 1: 0 0 1 1 0 0 ## 2: 0 0 0 0 0 0 ## 3: 0 0 0 0 0 0 ## 4: 0 0 0 0 0 0 ## 5: 0 0 0 0 0 0 ## --- ## 1319: 0 0 1 1 0 1 ## 1320: 0 0 0 0 0 0 ## 1321: 0 0 0 0 0 0 ## 1322: 0 0 0 0 0 0 ## 1323: 0 0 0 0 0 1 ncodes <- sample(names(ncodes), TBI_SUBJECTS, replace = TRUE, prob = ncodes + 0.1) %>% strsplit("\\.") %>% lapply(as.integer) %>% do.call(rbind, .) %>% as.data.table %>% setNames(., paste0("ncode", 1:6)) ncodes ## ncode1 ncode2 ncode3 ncode4 ncode5 ncode6 ## 1: 0 1 0 0 0 0 ## 2: 0 0 0 0 0 0 ## 3: 0 0 0 0 0 0 ## 4: 0 0 0 0 0 0 ## 5: 0 0 1 1 1 1 ## --- ## 1319: 0 0 0 0 0 0 ## 1320: 1 1 0 1 1 0 ## 1321: 0 0 0 0 0 0 ## 1322: 1 0 0 0 0 0 ## 1323: 0 0 0 0 0 0

The last step in building the predictor variables is to bind the three data.tables together:

The three injury categories are constructed by fitting a simplified logistic regression model. The regression coefficient vectors are defined next.

injury1_coef <- matrix(c(-5.80494, -3.03946e-05, 0.773355, 0.00556597, -1.04436, -0.849594, 0.770122, 0.968153, 16.6315, 0.411286, 4.07926, 5.16926, 2.84976, -17.1038, -14.7382, 4.30086), ncol = 1) injury2_coef <- matrix(c(-427.083, 0.0742405, -342.072, -8.09704, 299.132, 991.75, -85.0155, -170.344, -57.408, 123.201, 161.41, -568.483, -767.944, 706.95, 10.2964, 148.551), ncol = 1) injury3_coef <- matrix(c(-3.54036, -0.00054294, 1.75962, -0.0475071, -17.515, -60.4276, 4.58458, 0.58551, -19.7312, -1.16923, 2.16091, 63.7699, 39.3569, -37.8554, -14.1002, -14.8469), ncol = 1) injury_expression <- quote( c("injury1", "injury2", "injury3") := list( sapply(qwraps2::invlogit(cbind(1, as.matrix(tbi)) %*% injury1_coef), function(p) rbinom(1, 1, p)), sapply(qwraps2::invlogit(cbind(1, as.matrix(tbi)) %*% injury2_coef), function(p) rbinom(1, 1, p)), sapply(qwraps2::invlogit(cbind(1, as.matrix(tbi)) %*% injury3_coef), function(p) rbinom(1, 1, p)) ) ) tbi[, eval(injury_expression)]

To prove that the tbi object constructed in this vignette is the same as the one provided in via calling data(tbi, package = "ensr") we show the sha1 for both data sets here:

Water Percolation Through A Landfill

The file landfill.csv contains results from a computer simulation of water percolation through five layers in a landfill over one year. The raw data file is provided so the end user can explore their own pre-processing of the data.

landfill <-
  fread(file = system.file("extdata/landfill.csv", package = "ensr"), sep = ",")

This landfill consists of five layers:

Topsoil
Geomembrane Liner
Clay
Weathered Lavery Till (WLT)
Unweathered Lavery Till (ULT)

The outcomes modeled by the computer program are:

Outcome	Variable Name(s)
evaporation	evap
runoff	runoff
percolation	perc_topsoil, perc_liner, perc_clay, perc_wlt, perc_ult
drainage	drain_topsoil
water content	wc_topsoil, wc_clay, wc_wlt, wc_ult

The set of predictors included in the data set are:

Predictor	Variable Name(s)	Notes
Porosity	topsoil_porosity, clay_porosity, wlt_porosity, ult_porosity
van Genuchten \(\alpha\)	topsoil_alpha, clay_alpha, wlt_alpha, ult_alpha	related to the inverse of the air entry suction
van Genuchten \(\theta_r\)	topsoil_thetar, clay_thetar, wlt_thetar, ult_thetar	residual water content
van Genuchten N	topsoil_n, clay_n, wlt_n, ult_n	pore size
Saturated hydraulic conductivity	topsoil_ks, clay_ks, wlt_ks, ult_ks
Relative Humidity	rh
Leaf Area Index	lai
Grow Start	growstart	First day of the growing season
Grow End	growend	Last day of the growing season
Wind	wind	Average wind speed in mph
Solar Radiation	rmw, rmd	Mean of daily solar radiation for wet days, or dry days
Amplitude of daily solar radiation	ar
SCS AMCII Curve Number	cn
Plant root beta	beta	Used for calculating evaporation zone depth
Linear Defects	liner_defects	Number of defects per acre of liner
Linear Pinholes	liner_pinholes	Number of pinholes per acre of liner
100 year average Solar Radiation	weather_solrad
100 year average Precipitation	weather_precip
100 year average Temperature	weather_temp

Landfill Data Preparation

The elastic net method suggests that all predictors be centered and scaled. In the case of multi-variable responses, it is recommended that the outcomes be centered and scaled as well. For calls to glmnet and cv.glmnet, the argument standardized = TRUE (default) will center and scale the values. If you have already centered and scaled the data you may prefer to set standardized = FALSE.

If you want to explicitly standardize your data, the base R function scale is adequate for most situations. However, with the landfill data the simple centering and scaling could be considered inappropriate.
scale will use the mean and standard deviation of the input. However, the landfill data contains measurements with the same values. Centering and scaling should be based on the mean and standard deviation of just the unique values. The ensr function standardize will, by default, standardize a numeric vector based on just the unique values using either the mean and standard deviation or the median and interquartile range (IQR).

Here is an example. There are 974 rows in the landfill data set. There are 83 unique values for the van Genuchten \(\alpha\) value. Here is how the standardization results are affected by non-unique predictor values:

nrow(landfill)
## [1] 974
length(unique(landfill$topsoil_alpha))
## [1] 83

# Standard scaling, using the mean and standard deviation for the full vector.
scaled_topsoil_alpha_v1 <- as.vector(scale(landfill$topsoil_alpha))

scaled_topsoil_alpha_v2 <-
  (landfill$topsoil_alpha - mean(landfill$topsoil_alpha)) / sd(landfill$topsoil_alpha)

all.equal(target  = scaled_topsoil_alpha_v1,
          current = scaled_topsoil_alpha_v2,
          check.attributes = FALSE)
## [1] TRUE

# Center and scale using the mean and standard deviation of only the unique
# values of the vector.
scaled_topsoil_alpha_v3 <-
  as.vector(scale(landfill$topsoil_alpha,
                  center = mean(unique(landfill$topsoil_alpha)),
                  scale  = sd(unique(landfill$topsoil_alpha))))

scaled_topsoil_alpha_v4 <- standardize(landfill$topsoil_alpha)

all.equal(target  = scaled_topsoil_alpha_v3,
          current = scaled_topsoil_alpha_v4,
          check.attributes = FALSE)
## [1] TRUE

# Center and scale using the median and IQR.
scaled_topsoil_alpha_v5 <-
  as.vector(scale(landfill$topsoil_alpha,
                  center = median(unique(landfill$topsoil_alpha)),
                  scale  = IQR(unique(landfill$topsoil_alpha))))

scaled_topsoil_alpha_v6 <-
  standardize(landfill$topsoil_alpha,
              stats = list(center = "median", scale = "IQR"))

all.equal(target  = scaled_topsoil_alpha_v5,
          current = scaled_topsoil_alpha_v6,
          check.attributes = FALSE)
## [1] TRUE

For the full landfill data set we can center and scale all the columns using:

landfill <- standardize(landfill)

Note that the values used for centering and scaling are retained as attributes for each variable in the data set.

str(landfill[, 1:2] )
## Classes 'data.table' and 'data.frame':   974 obs. of  2 variables:
##  $ topsoil_porosity: num  0.244 -0.171 -0.301 -0.223 -0.586 ...
##   ..- attr(*, "center")= num 0.296
##   ..- attr(*, "scale")= num 0.0385
##  $ topsoil_alpha   : num  -0.735 -0.455 -0.455 0.146 -0.535 ...
##   ..- attr(*, "center")= num 0.0143
##   ..- attr(*, "scale")= num 0.0025
##  - attr(*, ".internal.selfref")=<externalptr>

The landfill object generated in this vignette is the same as the one generated using data(landfill, package = "ensr").

data(landfill, package = "ensr", envir = ensr_env)
all.equal(digest::sha1(landfill),
          digest::sha1(ensr_env$landfill))
## [1] TRUE

print(sessionInfo(), local = FALSE) ## R Under development (unstable) (2019-01-13 r75986) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Debian GNU/Linux 9 (stretch) ## ## Matrix products: default ## BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ensr_0.1.0 glmnet_2.0-16 foreach_1.4.4 Matrix_1.2-15 ## [5] digest_0.6.18 magrittr_1.5 data.table_1.12.0 qwraps2_0.4.0 ## [9] knitr_1.21 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_1.0.0 codetools_0.2-16 lattice_0.20-38 grid_3.6.0 ## [5] evaluate_0.12 rlang_0.3.1 stringi_1.2.4 rmarkdown_1.11 ## [9] iterators_1.0.10 tools_3.6.0 stringr_1.3.1 xfun_0.4 ## [13] yaml_2.2.0 compiler_3.6.0 htmltools_0.3.6

Elastic Net SearcheR Datasets

ensr package version 0.1.0

Peter E. DeWitt

2019-01-14

Traumatic Brain Injuries

Water Percolation Through A Landfill

Landfill Data Preparation

Session Info