Making cheese

Structuring code that is robust to updates in the data, changes in methodological specifications, etc., and can get to presentation-ready results in an automated way can mitigate errors caused by painful, tedious tasks–it can also be a huge time saver. The cheese package was designed with this philosophy in mind, heavily influenced by (and explicit dependencies on) tidy concepts. Its intention is to provide general tools for working with data and results during statistical analysis, in addition to a collection of functions designated for specific statistical tasks.

Let’s take a closer look:

#Load the package
require(cheese)
## Loading required package: cheese

Heart disease dataset

This data comes from the UCI Machine Learning Repository, containing a collection of demographic and clinical characteristics from 303 patients. It was subsequently processed and cleaned into a format suitable for analysis–details of which can be found here.

#Look at the data
heart_disease
## # A tibble: 303 × 9
##      Age Sex    ChestPain           BP Cholest…¹ Blood…² Maxim…³ Exerc…⁴ Heart…⁵
##    <dbl> <fct>  <fct>            <dbl>     <dbl> <lgl>     <dbl> <fct>   <fct>  
##  1    63 Male   Typical angina     145       233 TRUE        150 No      No     
##  2    67 Male   Asymptomatic       160       286 FALSE       108 Yes     Yes    
##  3    67 Male   Asymptomatic       120       229 FALSE       129 Yes     Yes    
##  4    37 Male   Non-anginal pain   130       250 FALSE       187 No      No     
##  5    41 Female Atypical angina    130       204 FALSE       172 No      No     
##  6    56 Male   Atypical angina    120       236 FALSE       178 No      No     
##  7    62 Female Asymptomatic       140       268 FALSE       160 No      Yes    
##  8    57 Female Asymptomatic       120       354 FALSE       163 Yes     No     
##  9    63 Male   Asymptomatic       130       254 FALSE       147 No      Yes    
## 10    53 Male   Asymptomatic       140       203 TRUE        155 Yes     Yes    
## # … with 293 more rows, and abbreviated variable names ¹​Cholesterol,
## #   ²​BloodSugar, ³​MaximumHR, ⁴​ExerciseInducedAngina, ⁵​HeartDisease

The functions that will be introduced are roughly in order of their development, as many build off of one another. Selection of columns is done with non-standard evaluation (NSE) or with tidyselect::select_helpers, where applicable.

Splitting and binding data

divide() is used to split up a data frame into a list of subsets. Suppose you want to split the example data set by sex and heart disease status:

div_dat <-
  heart_disease %>%
    divide(
      Sex,
      HeartDisease
    )
div_dat
## $Female
## $Female$No
## # A tibble: 72 × 7
##      Age ChestPain           BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
##    <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl> <fct>          
##  1    41 Atypical angina    130         204 FALSE            172 No             
##  2    57 Asymptomatic       120         354 FALSE            163 Yes            
##  3    56 Atypical angina    140         294 FALSE            153 No             
##  4    48 Non-anginal pain   130         275 FALSE            139 No             
##  5    58 Typical angina     150         283 TRUE             162 No             
##  6    50 Non-anginal pain   120         219 FALSE            158 No             
##  7    58 Non-anginal pain   120         340 FALSE            172 No             
##  8    66 Typical angina     150         226 FALSE            114 No             
##  9    69 Typical angina     140         239 FALSE            151 No             
## 10    71 Atypical angina    160         302 FALSE            162 No             
## # … with 62 more rows, and abbreviated variable name ¹​ExerciseInducedAngina
## 
## $Female$Yes
## # A tibble: 25 × 7
##      Age ChestPain           BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
##    <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl> <fct>          
##  1    62 Asymptomatic       140         268 FALSE            160 No             
##  2    65 Asymptomatic       150         225 FALSE            114 No             
##  3    61 Asymptomatic       130         330 FALSE            169 No             
##  4    51 Asymptomatic       130         305 FALSE            142 Yes            
##  5    62 Asymptomatic       160         164 FALSE            145 No             
##  6    60 Asymptomatic       150         258 FALSE            157 No             
##  7    61 Asymptomatic       145         307 FALSE            146 Yes            
##  8    43 Asymptomatic       132         341 TRUE             136 Yes            
##  9    62 Non-anginal pain   130         263 FALSE             97 No             
## 10    63 Asymptomatic       150         407 FALSE            154 No             
## # … with 15 more rows, and abbreviated variable name ¹​ExerciseInducedAngina
## 
## 
## $Male
## $Male$No
## # A tibble: 92 × 7
##      Age ChestPain           BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
##    <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl> <fct>          
##  1    63 Typical angina     145         233 TRUE             150 No             
##  2    37 Non-anginal pain   130         250 FALSE            187 No             
##  3    56 Atypical angina    120         236 FALSE            178 No             
##  4    57 Asymptomatic       140         192 FALSE            148 No             
##  5    44 Atypical angina    120         263 FALSE            173 No             
##  6    52 Non-anginal pain   172         199 TRUE             162 No             
##  7    57 Non-anginal pain   150         168 FALSE            174 No             
##  8    54 Asymptomatic       140         239 FALSE            160 No             
##  9    49 Atypical angina    130         266 FALSE            171 No             
## 10    64 Typical angina     110         211 FALSE            144 Yes            
## # … with 82 more rows, and abbreviated variable name ¹​ExerciseInducedAngina
## 
## $Male$Yes
## # A tibble: 114 × 7
##      Age ChestPain           BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
##    <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl> <fct>          
##  1    67 Asymptomatic       160         286 FALSE            108 Yes            
##  2    67 Asymptomatic       120         229 FALSE            129 Yes            
##  3    63 Asymptomatic       130         254 FALSE            147 No             
##  4    53 Asymptomatic       140         203 TRUE             155 Yes            
##  5    56 Non-anginal pain   130         256 TRUE             142 Yes            
##  6    48 Atypical angina    110         229 FALSE            168 No             
##  7    58 Atypical angina    120         284 FALSE            160 No             
##  8    58 Non-anginal pain   132         224 FALSE            173 No             
##  9    60 Asymptomatic       130         206 FALSE            132 Yes            
## 10    40 Asymptomatic       110         167 FALSE            114 Yes            
## # … with 104 more rows, and abbreviated variable name ¹​ExerciseInducedAngina

The default behavior is to continually split the data in a hierarchical structure based on the order of the input columns, and to remove them from the result. The keep argument can be used to retain the column(s) in the data frames.

fasten() allows you to take a list structure that contains data frames at an arbitrary depth, and roll them back up into a single data frame. This is useful when a statistical process needs to be mapped to many subsets, and you want to be able to easily collect the results without repeatedly binding data. You can call the function with the divided data from above:

div_dat %>%
  fasten()
## # A tibble: 303 × 7
##      Age ChestPain           BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
##    <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl> <fct>          
##  1    41 Atypical angina    130         204 FALSE            172 No             
##  2    57 Asymptomatic       120         354 FALSE            163 Yes            
##  3    56 Atypical angina    140         294 FALSE            153 No             
##  4    48 Non-anginal pain   130         275 FALSE            139 No             
##  5    58 Typical angina     150         283 TRUE             162 No             
##  6    50 Non-anginal pain   120         219 FALSE            158 No             
##  7    58 Non-anginal pain   120         340 FALSE            172 No             
##  8    66 Typical angina     150         226 FALSE            114 No             
##  9    69 Typical angina     140         239 FALSE            151 No             
## 10    71 Atypical angina    160         302 FALSE            162 No             
## # … with 293 more rows, and abbreviated variable name ¹​ExerciseInducedAngina

It was binded back to the original number of rows, but it lost the columns it was split by. The into argument can be used to handle this:

div_dat %>%
  fasten(
    into = c("Sex", "HeartDisease")
  )
## # A tibble: 303 × 9
##    Sex    HeartDisease   Age ChestPain        BP Chole…¹ Blood…² Maxim…³ Exerc…⁴
##    <chr>  <chr>        <dbl> <fct>         <dbl>   <dbl> <lgl>     <dbl> <fct>  
##  1 Female No              41 Atypical ang…   130     204 FALSE       172 No     
##  2 Female No              57 Asymptomatic    120     354 FALSE       163 Yes    
##  3 Female No              56 Atypical ang…   140     294 FALSE       153 No     
##  4 Female No              48 Non-anginal …   130     275 FALSE       139 No     
##  5 Female No              58 Typical angi…   150     283 TRUE        162 No     
##  6 Female No              50 Non-anginal …   120     219 FALSE       158 No     
##  7 Female No              58 Non-anginal …   120     340 FALSE       172 No     
##  8 Female No              66 Typical angi…   150     226 FALSE       114 No     
##  9 Female No              69 Typical angi…   140     239 FALSE       151 No     
## 10 Female No              71 Atypical ang…   160     302 FALSE       162 No     
## # … with 293 more rows, and abbreviated variable names ¹​Cholesterol,
## #   ²​BloodSugar, ³​MaximumHR, ⁴​ExerciseInducedAngina

The positions of the into values always lineup with the level of the list hierarchy. So, for example, if you don’t care about retaining the heart disease status, you can do this:

div_dat %>%
  fasten(
    into = "Sex"
  )
## # A tibble: 303 × 8
##    Sex      Age ChestPain           BP Cholesterol BloodSugar MaximumHR Exerci…¹
##    <chr>  <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl> <fct>   
##  1 Female    41 Atypical angina    130         204 FALSE            172 No      
##  2 Female    57 Asymptomatic       120         354 FALSE            163 Yes     
##  3 Female    56 Atypical angina    140         294 FALSE            153 No      
##  4 Female    48 Non-anginal pain   130         275 FALSE            139 No      
##  5 Female    58 Typical angina     150         283 TRUE             162 No      
##  6 Female    50 Non-anginal pain   120         219 FALSE            158 No      
##  7 Female    58 Non-anginal pain   120         340 FALSE            172 No      
##  8 Female    66 Typical angina     150         226 FALSE            114 No      
##  9 Female    69 Typical angina     140         239 FALSE            151 No      
## 10 Female    71 Atypical angina    160         302 FALSE            162 No      
## # … with 293 more rows, and abbreviated variable name ¹​ExerciseInducedAngina

In contrast, if you want to forgo the sex indicator, empty strings will need to be used as placeholders so the names are applied at the correct levels.

div_dat %>%
  fasten(
    into = c("", "HeartDisease")
  )
## # A tibble: 303 × 8
##    HeartDisease   Age ChestPain           BP Cholesterol Blood…¹ Maxim…² Exerc…³
##    <chr>        <dbl> <fct>            <dbl>       <dbl> <lgl>     <dbl> <fct>  
##  1 No              41 Atypical angina    130         204 FALSE       172 No     
##  2 No              57 Asymptomatic       120         354 FALSE       163 Yes    
##  3 No              56 Atypical angina    140         294 FALSE       153 No     
##  4 No              48 Non-anginal pain   130         275 FALSE       139 No     
##  5 No              58 Typical angina     150         283 TRUE        162 No     
##  6 No              50 Non-anginal pain   120         219 FALSE       158 No     
##  7 No              58 Non-anginal pain   120         340 FALSE       172 No     
##  8 No              66 Typical angina     150         226 FALSE       114 No     
##  9 No              69 Typical angina     140         239 FALSE       151 No     
## 10 No              71 Atypical angina    160         302 FALSE       162 No     
## # … with 293 more rows, and abbreviated variable names ¹​BloodSugar, ²​MaximumHR,
## #   ³​ExerciseInducedAngina

Obviously, the classes of the split columns are not retained from the original data since the splitting and binding processes are independent.

Adjusting the depth

As shown above, the default behavior for these functions is to split or bind “as much as possible”. However, it can be useful to have control over the shape of the split. This is where the depth argument comes in. Suppose you want the same data frames at the leaves of the list, but only one level deep:

heart_disease %>%
  divide(
    Sex,
    HeartDisease,
    depth = 1
  )
## $`Female|No`
## # A tibble: 72 × 7
##      Age ChestPain           BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
##    <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl> <fct>          
##  1    41 Atypical angina    130         204 FALSE            172 No             
##  2    57 Asymptomatic       120         354 FALSE            163 Yes            
##  3    56 Atypical angina    140         294 FALSE            153 No             
##  4    48 Non-anginal pain   130         275 FALSE            139 No             
##  5    58 Typical angina     150         283 TRUE             162 No             
##  6    50 Non-anginal pain   120         219 FALSE            158 No             
##  7    58 Non-anginal pain   120         340 FALSE            172 No             
##  8    66 Typical angina     150         226 FALSE            114 No             
##  9    69 Typical angina     140         239 FALSE            151 No             
## 10    71 Atypical angina    160         302 FALSE            162 No             
## # … with 62 more rows, and abbreviated variable name ¹​ExerciseInducedAngina
## 
## $`Male|No`
## # A tibble: 92 × 7
##      Age ChestPain           BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
##    <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl> <fct>          
##  1    63 Typical angina     145         233 TRUE             150 No             
##  2    37 Non-anginal pain   130         250 FALSE            187 No             
##  3    56 Atypical angina    120         236 FALSE            178 No             
##  4    57 Asymptomatic       140         192 FALSE            148 No             
##  5    44 Atypical angina    120         263 FALSE            173 No             
##  6    52 Non-anginal pain   172         199 TRUE             162 No             
##  7    57 Non-anginal pain   150         168 FALSE            174 No             
##  8    54 Asymptomatic       140         239 FALSE            160 No             
##  9    49 Atypical angina    130         266 FALSE            171 No             
## 10    64 Typical angina     110         211 FALSE            144 Yes            
## # … with 82 more rows, and abbreviated variable name ¹​ExerciseInducedAngina
## 
## $`Female|Yes`
## # A tibble: 25 × 7
##      Age ChestPain           BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
##    <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl> <fct>          
##  1    62 Asymptomatic       140         268 FALSE            160 No             
##  2    65 Asymptomatic       150         225 FALSE            114 No             
##  3    61 Asymptomatic       130         330 FALSE            169 No             
##  4    51 Asymptomatic       130         305 FALSE            142 Yes            
##  5    62 Asymptomatic       160         164 FALSE            145 No             
##  6    60 Asymptomatic       150         258 FALSE            157 No             
##  7    61 Asymptomatic       145         307 FALSE            146 Yes            
##  8    43 Asymptomatic       132         341 TRUE             136 Yes            
##  9    62 Non-anginal pain   130         263 FALSE             97 No             
## 10    63 Asymptomatic       150         407 FALSE            154 No             
## # … with 15 more rows, and abbreviated variable name ¹​ExerciseInducedAngina
## 
## $`Male|Yes`
## # A tibble: 114 × 7
##      Age ChestPain           BP Cholesterol BloodSugar MaximumHR ExerciseInduc…¹
##    <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl> <fct>          
##  1    67 Asymptomatic       160         286 FALSE            108 Yes            
##  2    67 Asymptomatic       120         229 FALSE            129 Yes            
##  3    63 Asymptomatic       130         254 FALSE            147 No             
##  4    53 Asymptomatic       140         203 TRUE             155 Yes            
##  5    56 Non-anginal pain   130         256 TRUE             142 Yes            
##  6    48 Atypical angina    110         229 FALSE            168 No             
##  7    58 Atypical angina    120         284 FALSE            160 No             
##  8    58 Non-anginal pain   132         224 FALSE            173 No             
##  9    60 Asymptomatic       130         206 FALSE            132 Yes            
## 10    40 Asymptomatic       110         167 FALSE            114 Yes            
## # … with 104 more rows, and abbreviated variable name ¹​ExerciseInducedAngina

You now have list with four elements containing the subsets–the names of which are the concatenated levels of the split columns (see the sep argument).

This argument also works when binding data. Going back to the original divided data frame, you may only want to bind the internal lists.

div_dat %>%
  fasten(
    into = "HeartDisease",
    depth = 1
  )
## $Female
## # A tibble: 97 × 8
##    HeartDisease   Age ChestPain           BP Cholesterol Blood…¹ Maxim…² Exerc…³
##    <chr>        <dbl> <fct>            <dbl>       <dbl> <lgl>     <dbl> <fct>  
##  1 No              41 Atypical angina    130         204 FALSE       172 No     
##  2 No              57 Asymptomatic       120         354 FALSE       163 Yes    
##  3 No              56 Atypical angina    140         294 FALSE       153 No     
##  4 No              48 Non-anginal pain   130         275 FALSE       139 No     
##  5 No              58 Typical angina     150         283 TRUE        162 No     
##  6 No              50 Non-anginal pain   120         219 FALSE       158 No     
##  7 No              58 Non-anginal pain   120         340 FALSE       172 No     
##  8 No              66 Typical angina     150         226 FALSE       114 No     
##  9 No              69 Typical angina     140         239 FALSE       151 No     
## 10 No              71 Atypical angina    160         302 FALSE       162 No     
## # … with 87 more rows, and abbreviated variable names ¹​BloodSugar, ²​MaximumHR,
## #   ³​ExerciseInducedAngina
## 
## $Male
## # A tibble: 206 × 8
##    HeartDisease   Age ChestPain           BP Cholesterol Blood…¹ Maxim…² Exerc…³
##    <chr>        <dbl> <fct>            <dbl>       <dbl> <lgl>     <dbl> <fct>  
##  1 No              63 Typical angina     145         233 TRUE        150 No     
##  2 No              37 Non-anginal pain   130         250 FALSE       187 No     
##  3 No              56 Atypical angina    120         236 FALSE       178 No     
##  4 No              57 Asymptomatic       140         192 FALSE       148 No     
##  5 No              44 Atypical angina    120         263 FALSE       173 No     
##  6 No              52 Non-anginal pain   172         199 TRUE        162 No     
##  7 No              57 Non-anginal pain   150         168 FALSE       174 No     
##  8 No              54 Asymptomatic       140         239 FALSE       160 No     
##  9 No              49 Atypical angina    130         266 FALSE       171 No     
## 10 No              64 Typical angina     110         211 FALSE       144 Yes    
## # … with 196 more rows, and abbreviated variable names ¹​BloodSugar, ²​MaximumHR,
## #   ³​ExerciseInducedAngina

Note that the positions of the into values are directly related to how shallow/deep you want the result to be.

The depth can also be controlled relative to the leaves by using negative integers. This can be useful if it is unclear how deep the list will be (or is). In the example above, suppose you didn’t know how deep the list was, but you knew the heart disease status was the most internal split, and thus wanted to bind it.

div_dat %>%
  fasten(
    into = "HeartDisease",
    depth = -1
  )
## $Female
## # A tibble: 97 × 8
##    HeartDisease   Age ChestPain           BP Cholesterol Blood…¹ Maxim…² Exerc…³
##    <chr>        <dbl> <fct>            <dbl>       <dbl> <lgl>     <dbl> <fct>  
##  1 No              41 Atypical angina    130         204 FALSE       172 No     
##  2 No              57 Asymptomatic       120         354 FALSE       163 Yes    
##  3 No              56 Atypical angina    140         294 FALSE       153 No     
##  4 No              48 Non-anginal pain   130         275 FALSE       139 No     
##  5 No              58 Typical angina     150         283 TRUE        162 No     
##  6 No              50 Non-anginal pain   120         219 FALSE       158 No     
##  7 No              58 Non-anginal pain   120         340 FALSE       172 No     
##  8 No              66 Typical angina     150         226 FALSE       114 No     
##  9 No              69 Typical angina     140         239 FALSE       151 No     
## 10 No              71 Atypical angina    160         302 FALSE       162 No     
## # … with 87 more rows, and abbreviated variable names ¹​BloodSugar, ²​MaximumHR,
## #   ³​ExerciseInducedAngina
## 
## $Male
## # A tibble: 206 × 8
##    HeartDisease   Age ChestPain           BP Cholesterol Blood…¹ Maxim…² Exerc…³
##    <chr>        <dbl> <fct>            <dbl>       <dbl> <lgl>     <dbl> <fct>  
##  1 No              63 Typical angina     145         233 TRUE        150 No     
##  2 No              37 Non-anginal pain   130         250 FALSE       187 No     
##  3 No              56 Atypical angina    120         236 FALSE       178 No     
##  4 No              57 Asymptomatic       140         192 FALSE       148 No     
##  5 No              44 Atypical angina    120         263 FALSE       173 No     
##  6 No              52 Non-anginal pain   172         199 TRUE        162 No     
##  7 No              57 Non-anginal pain   150         168 FALSE       174 No     
##  8 No              54 Asymptomatic       140         239 FALSE       160 No     
##  9 No              49 Atypical angina    130         266 FALSE       171 No     
## 10 No              64 Typical angina     110         211 FALSE       144 Yes    
## # … with 196 more rows, and abbreviated variable names ¹​BloodSugar, ²​MaximumHR,
## #   ³​ExerciseInducedAngina

The new depth is one less than the input. In this case, the same result was achieved because of the symmetry.

Applying functions

The next set of functions are rooted in functional programming (i.e. purrr). In each instance, given a pre-defined function, you can easily control the scope of the data in which it is evaluated on. This is an incredibly useful and efficient philosophy for generating reusable code that is non-repetitive.

Pairwise combinations of column sets

Often times a statistical analysis is concerned with multiple outcomes/targets, though each one potentially uses the same set of explanatory variables. You could of course hard code the specific analysis steps for each outcome (e.g. formulas, etc.), but then the code becomes repetitive. You could also make separate data sets for each outcome with the associated predictors and then apply the same analysis process to each data set, but then you waste resources by storing multiple copies of the same data in memory. dish() allows you to distribute a function to various pairwise sets of columns from the same data set in a single call. The motivation described above is merely that–this is a general function that can be applied to any context.

As a simple first example, lets compute the Pearson correlation of blood pressure and cholesterol with all numeric columns:

heart_disease %>%
  dish(
    f = cor,
    left = c(BP, Cholesterol),
    right = where(is.numeric)
  )
## $BP
## $BP$Age
## [1] 0.2849459
## 
## $BP$BP
## [1] 1
## 
## $BP$Cholesterol
## [1] 0.1301201
## 
## $BP$MaximumHR
## [1] -0.04535088
## 
## 
## $Cholesterol
## $Cholesterol$Age
## [1] 0.2089503
## 
## $Cholesterol$BP
## [1] 0.1301201
## 
## $Cholesterol$Cholesterol
## [1] 1
## 
## $Cholesterol$MaximumHR
## [1] -0.003431832

The left argument controls which columns are entered in the first argument of f, and the right argument controls which columns are entered into the second. In the example, you’ll notice that the left columns were also included in the set of right columns, because they are, in fact, numeric(). If you didn’t want this, you can omit the right argument, which will then include everything not in left as the second argument of f:

heart_disease %>%
  dplyr::select(where(is.numeric)) %>% #Need to do this so only numeric columns are evaluated
  dish(
    f = cor,
    left = c(BP, Cholesterol)
  )
## $BP
## $BP$Age
## [1] 0.2849459
## 
## $BP$MaximumHR
## [1] -0.04535088
## 
## 
## $Cholesterol
## $Cholesterol$Age
## [1] 0.2089503
## 
## $Cholesterol$MaximumHR
## [1] -0.003431832

Now lets suppose you want to regress both blood pressure and cholesterol on all other variables in the data set. The each_ arguments allow you to control whether the column sets are entered into the function individually as vectors, or together in a single data frame. The former is the default, so you’ll need to use this argument here:

heart_disease %>%
  dish(
    f = function(y, x) lm(y ~ ., data = x),
    left = c(BP, Cholesterol),
    each_right = FALSE
  )
## $BP
## 
## Call:
## lm(formula = y ~ ., data = x)
## 
## Coefficients:
##               (Intercept)                        Age  
##                  97.09264                    0.50748  
##                   SexMale   ChestPainAtypical angina  
##                  -3.95940                   -9.84788  
## ChestPainNon-anginal pain      ChestPainAsymptomatic  
##                  -9.56350                  -10.11788  
##            BloodSugarTRUE                  MaximumHR  
##                   6.70106                    0.09688  
##  ExerciseInducedAnginaYes            HeartDiseaseYes  
##                   1.72906                    6.00819  
## 
## 
## $Cholesterol
## 
## Call:
## lm(formula = y ~ ., data = x)
## 
## Coefficients:
##               (Intercept)                        Age  
##                  127.5108                     1.2471  
##                   SexMale   ChestPainAtypical angina  
##                  -23.6185                     8.8270  
## ChestPainNon-anginal pain      ChestPainAsymptomatic  
##                    5.7660                     8.0778  
##            BloodSugarTRUE                  MaximumHR  
##                   -0.7327                     0.3491  
##  ExerciseInducedAnginaYes            HeartDiseaseYes  
##                    7.8039                    12.5303

Subsets of data

stratiply() streamlines the familiar split() then purrr::map() approach by making it simple to select the stratification columns, and apply a function to each subset. Let’s run a multiple logistic regression model using heart disease as the outcome, stratified by sex:

heart_disease %>%
  stratiply(
    f = glm,
    by = Sex,
    formula = HeartDisease ~ . -ChestPain,
    family = "binomial"
  )
## $Female
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##              (Intercept)                       Age                        BP  
##                 -6.95485                   0.03362                   0.04366  
##              Cholesterol            BloodSugarTRUE                 MaximumHR  
##                  0.00354                   0.47028                  -0.02465  
## ExerciseInducedAnginaYes  
##                  2.12039  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  90 Residual
## Null Deviance:       110.7 
## Residual Deviance: 76.21     AIC: 90.21
## 
## $Male
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##              (Intercept)                       Age                        BP  
##                 1.877142                  0.024853                  0.007710  
##              Cholesterol            BloodSugarTRUE                 MaximumHR  
##                 0.008718                 -0.390310                 -0.042231  
## ExerciseInducedAnginaYes  
##                 1.106387  
## 
## Degrees of Freedom: 205 Total (i.e. Null);  199 Residual
## Null Deviance:       283.2 
## Residual Deviance: 211.2     AIC: 225.2

Adding multiple stratification columns will produce a deeper list:

heart_disease %>%
  stratiply(
    f = glm,
    by = c(Sex, ExerciseInducedAngina),
    formula = HeartDisease ~ . -ChestPain,
    family = "binomial"
  )
## $Female
## $Female$No
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##    (Intercept)             Age              BP     Cholesterol  BloodSugarTRUE  
##       -9.58123         0.01027         0.07213         0.00406         0.65807  
##      MaximumHR  
##       -0.02540  
## 
## Degrees of Freedom: 74 Total (i.e. Null);  69 Residual
## Null Deviance:       62.53 
## Residual Deviance: 49.93     AIC: 61.93
## 
## $Female$Yes
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##    (Intercept)             Age              BP     Cholesterol  BloodSugarTRUE  
##       3.850121        0.049450        0.012214        0.001458        0.945153  
##      MaximumHR  
##      -0.056538  
## 
## Degrees of Freedom: 21 Total (i.e. Null);  16 Residual
## Null Deviance:       28.84 
## Residual Deviance: 23.36     AIC: 35.36
## 
## 
## $Male
## $Male$No
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##    (Intercept)             Age              BP     Cholesterol  BloodSugarTRUE  
##       2.728309        0.053058       -0.003653        0.005387       -1.106214  
##      MaximumHR  
##      -0.042036  
## 
## Degrees of Freedom: 128 Total (i.e. Null);  123 Residual
## Null Deviance:       174 
## Residual Deviance: 142.3     AIC: 154.3
## 
## $Male$Yes
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##    (Intercept)             Age              BP     Cholesterol  BloodSugarTRUE  
##        0.70070        -0.01348         0.03561         0.01342         1.09404  
##      MaximumHR  
##       -0.04531  
## 
## Degrees of Freedom: 76 Total (i.e. Null);  71 Residual
## Null Deviance:       75.94 
## Residual Deviance: 60.53     AIC: 72.53

Distributing to specific classes

typly() allows you to apply a function to columns of a data frame whose type inherits at least one (or none) of the specified candidates. It is similar to purrr::map_if(), but uses methods::is() for determining types (see some_type()), and only returns the results for the selected columns:

heart_disease %>%
  typly(
    f = mean,
    types = c("numeric", "logical")
  )
## $Age
## [1] 54.43894
## 
## $BP
## [1] 131.6898
## 
## $Cholesterol
## [1] 246.6931
## 
## $BloodSugar
## [1] 0.1485149
## 
## $MaximumHR
## [1] 149.6073

You can use the negated argument to apply the function to columns that are not any of the listed types:

heart_disease %>%
  typly(
    f = table,
    types = c("numeric", "logical"),
    negated = TRUE,
    useNA = "always"
  )
## $Sex
## 
## Female   Male   <NA> 
##     97    206      0 
## 
## $ChestPain
## 
##   Typical angina  Atypical angina Non-anginal pain     Asymptomatic 
##               23               50               86              144 
##             <NA> 
##                0 
## 
## $ExerciseInducedAngina
## 
##   No  Yes <NA> 
##  204   99    0 
## 
## $HeartDisease
## 
##   No  Yes <NA> 
##  164  139    0

Populating key-value pairs into custom string templates

absorb() allows you to create collections of strings that use keys as placeholders, which are then populated by the values. This is useful for setting up results of your analysis to be presented in a specific format that is independent of the data set at hand. Here’s a simple example:

absorb(
  key = c("mean", "sd", "var"),
  value = c("10", "2", "4"),
  text = 
    c(
      "MEAN: mean, SD: sd",
      "VAR: var = sd^2",
      MEAN = "mean"
    )
)
##                                                  MEAN 
## "MEAN: 10, SD: 2"    "VAR: 4 = 2^2"              "10"

The input text is scanned to look for the keys (interpreted as regular expressions) and are replaced with the corresponding values.

Let’s look at a more useful example. Suppose we have a collection summary statistics for patient age by angina and heart disease status:

age_stats <-
  heart_disease %>%
  dplyr::group_by(
    ExerciseInducedAngina,
    HeartDisease
  ) %>%
  dplyr::summarise(
    mean = round(mean(Age), 2),
    sd = round(sd(Age), 2),
    median = median(Age),
    iqr = IQR(Age),
    .groups = "drop"
  ) %>%
  tidyr::gather(
    key = "key",
    value = "value",
    -ExerciseInducedAngina,
    -HeartDisease
  )
age_stats
## # A tibble: 16 × 4
##    ExerciseInducedAngina HeartDisease key    value
##    <fct>                 <fct>        <chr>  <dbl>
##  1 No                    No           mean   52.4 
##  2 No                    Yes          mean   57.1 
##  3 Yes                   No           mean   53.6 
##  4 Yes                   Yes          mean   56.2 
##  5 No                    No           sd      9.68
##  6 No                    Yes          sd      7.56
##  7 Yes                   No           sd      8.54
##  8 Yes                   Yes          sd      8.27
##  9 No                    No           median 52   
## 10 No                    Yes          median 58   
## 11 Yes                   No           median 53   
## 12 Yes                   Yes          median 57   
## 13 No                    No           iqr    15   
## 14 No                    Yes          iqr    10   
## 15 Yes                   No           iqr    12.5 
## 16 Yes                   Yes          iqr     8.25

Next, you can specify the string format you want to nicely summarize the information in each group:

age_stats %>%
  dplyr::group_by(
    ExerciseInducedAngina,
    HeartDisease
  ) %>%
  dplyr::summarise(
    Summary = 
      absorb(
        key = key,
        value = value,
        text = 
          c(
            "mean (sd) | median (iqr)"
          )
      ),
    .groups = "drop"
  ) 
## # A tibble: 4 × 3
##   ExerciseInducedAngina HeartDisease Summary                 
##   <fct>                 <fct>        <chr>                   
## 1 No                    No           52.42 (9.68) | 52 (15)  
## 2 No                    Yes          57.1 (7.56) | 58 (10)   
## 3 Yes                   No           53.61 (8.54) | 53 (12.5)
## 4 Yes                   Yes          56.24 (8.27) | 57 (8.25)

Of course, this is much more useful when the summary data is already in the format of age_stats.

Concatenating repetitive keys

In the case where the key pattern found in the string template is repeated, its values are collapsed into a concatenated string:

absorb(
  key = age_stats$key,
  value = age_stats$value,
  text = c("(mean) / (sd)", "median")
)
## [1] "(52.42_57.1_53.61_56.24) / (9.68_7.56_8.54_8.27)"
## [2] "52_58_53_57"

The sep argument can be used to customize how values are separated.

Evaluating strings as expressions

It may be useful in some cases to evaluate the resulting string as an expression. You can setup the prior example so that the result contains valid expressions:

absorb(
  key = age_stats$key,
  value = age_stats$value,
  text = c("(mean) / (sd)", "median"),
  sep = "+",
  evaluate = TRUE,
  trace = TRUE
)
## (mean) / (sd) median 
## (mean) / (sd) median 
## (52.42+57.1+53.61+56.24) / (sd) median 
## (52.42+57.1+53.61+56.24) / (sd) 52+58+53+57
## [[1]]
## [1] 6.442584
## 
## [[2]]
## [1] 220

These statistics are not entirely useful here, but you get the point.

Traversing lists to identify objects

depths() and depths_string() are used for traversing a list structure to find the elements that satisfy a predicate. The former produces a vector of the unique depths that contain at least one element where predicate is true, and the latter creates a string representation of the traversal with the actual positions.

#Make a divided data frame
div_dat1 <-
  heart_disease %>%
  divide(
    Sex,
    HeartDisease
  )

#Find the depths of the data frames
div_dat1 %>%
  depths(
    predicate = is.data.frame
  )
## [1] 2
div_dat1 %>%
  depths_string(
    predicate = is.data.frame
  )
## [1] "{1{-1,-2},2{-1,-2}}"

In the string result, the brackets represent the level of the list, and the integers represent the index at that level, which are negative if predicate is true for that element.

By default, the algorithm continues when rlang::is_bare_list() so that the traversal can end with list-like objects. However, the bare argument can be used to traverse deeper into the list:

div_dat1 %>%
  depths_string(
    predicate = is.data.frame,
    bare = FALSE
  )
## [1] "{1{-1{1,2,3,4,5,6,7},-2{1,2,3,4,5,6,7}},2{-1{1,2,3,4,5,6,7},-2{1,2,3,4,5,6,7}}}"

Now it also evaluated the columns of each data frame in the list, though it still found the correct positions where the predicate held.

Distorting relationships between columns

muddle() can be used to randomly shuffle columns in a data frame. This is a convenient way to remove confounding when exploring the effects of a variable on an outcome.

set.seed(123)
heart_disease %>%
  muddle()
## # A tibble: 303 × 9
##      Age Sex    ChestPain           BP Cholest…¹ Blood…² Maxim…³ Exerc…⁴ Heart…⁵
##    <dbl> <fct>  <fct>            <dbl>     <dbl> <lgl>     <dbl> <fct>   <fct>  
##  1    43 Female Non-anginal pain   130       214 FALSE       120 Yes     No     
##  2    44 Male   Asymptomatic       130       204 FALSE       143 Yes     Yes    
##  3    68 Female Asymptomatic       128       242 FALSE       125 Yes     No     
##  4    35 Female Asymptomatic       120       269 FALSE       163 Yes     Yes    
##  5    45 Female Atypical angina    138       240 FALSE       169 No      No     
##  6    54 Male   Asymptomatic       128       209 FALSE       182 Yes     No     
##  7    61 Male   Asymptomatic       100       258 FALSE       152 Yes     Yes    
##  8    57 Female Asymptomatic       152       229 FALSE       142 No      No     
##  9    67 Male   Non-anginal pain   110       283 FALSE       138 No      Yes    
## 10    51 Female Atypical angina    125       204 TRUE        161 No      No     
## # … with 293 more rows, and abbreviated variable names ¹​Cholesterol,
## #   ²​BloodSugar, ³​MaximumHR, ⁴​ExerciseInducedAngina, ⁵​HeartDisease

All columns are permuted by default. Use the at argument to only shuffle certain ones:

heart_disease %>%
  muddle(
    at = Age
  )
## # A tibble: 303 × 9
##      Age Sex    ChestPain           BP Cholest…¹ Blood…² Maxim…³ Exerc…⁴ Heart…⁵
##    <dbl> <fct>  <fct>            <dbl>     <dbl> <lgl>     <dbl> <fct>   <fct>  
##  1    47 Male   Typical angina     145       233 TRUE        150 No      No     
##  2    49 Male   Asymptomatic       160       286 FALSE       108 Yes     Yes    
##  3    59 Male   Asymptomatic       120       229 FALSE       129 Yes     Yes    
##  4    35 Male   Non-anginal pain   130       250 FALSE       187 No      No     
##  5    44 Female Atypical angina    130       204 FALSE       172 No      No     
##  6    29 Male   Atypical angina    120       236 FALSE       178 No      No     
##  7    45 Female Asymptomatic       140       268 FALSE       160 No      Yes    
##  8    59 Female Asymptomatic       120       354 FALSE       163 Yes     No     
##  9    46 Male   Asymptomatic       130       254 FALSE       147 No      Yes    
## 10    54 Male   Asymptomatic       140       203 TRUE        155 Yes     Yes    
## # … with 293 more rows, and abbreviated variable names ¹​Cholesterol,
## #   ²​BloodSugar, ³​MaximumHR, ⁴​ExerciseInducedAngina, ⁵​HeartDisease

Additional arguments can be passed to sample() as well. For example, you might want five random values from each column:

heart_disease %>%
  muddle(
    size = 5
  )
## # A tibble: 5 × 9
##     Age Sex    ChestPain           BP Choleste…¹ Blood…² Maxim…³ Exerc…⁴ Heart…⁵
##   <dbl> <fct>  <fct>            <dbl>      <dbl> <lgl>     <dbl> <fct>   <fct>  
## 1    54 Female Asymptomatic       136        275 TRUE        163 No      No     
## 2    62 Male   Asymptomatic       125        271 FALSE       195 No      No     
## 3    52 Male   Non-anginal pain   150        286 FALSE       105 No      No     
## 4    64 Male   Asymptomatic       106        273 FALSE       163 Yes     No     
## 5    54 Male   Atypical angina    120        199 FALSE       182 Yes     Yes    
## # … with abbreviated variable names ¹​Cholesterol, ²​BloodSugar, ³​MaximumHR,
## #   ⁴​ExerciseInducedAngina, ⁵​HeartDisease

Spanning keys and values across the columns

stretch() is for spanning keys and values across the columns. It is similar to tidyr::spread(), but allows for any number of keys and values to be selected. It’s functionality is more so targeted at setting up your results to be presented in a table, rather than for general data manipulation. Thus, the resulting column names and orderings are setup to make it a seamless transition.

To illustrate, lets first collect the parameter estimates and standard errors from multiple regression models using blood pressure and cholesterol as outcomes, stratified by sex:

mod_results <-
  heart_disease %>%
  
  #For each sex
  stratiply(
    by = Sex,
    f =
      ~.x %>%
      
      #Collect coefficients and standard errors for model estimates on multiple outcomes
      dish(
        f = function(y, x) {
          model <- lm(y ~ ., data = x)
          tibble::tibble(
            Parameter = names(model$coefficients),
            Estimate = model$coefficients,
            SE = sqrt(diag(vcov(model)))
          )
        },
        left = c(BP, Cholesterol),
        each_right = FALSE
      ) 
  ) %>%
  
  #Gather results into a single data frame
  fasten(
    into = c("Sex", "Outcome")
  )
mod_results
## # A tibble: 36 × 5
##    Sex    Outcome     Parameter                 Estimate      SE
##    <chr>  <chr>       <chr>                        <dbl>   <dbl>
##  1 Female BP          (Intercept)                 93.7   24.7   
##  2 Female BP          Age                          0.533  0.210 
##  3 Female BP          ChestPainAtypical angina   -16.3    9.73  
##  4 Female BP          ChestPainNon-anginal pain  -15.3    9.15  
##  5 Female BP          ChestPainAsymptomatic      -14.2    9.64  
##  6 Female BP          BloodSugarTRUE               6.92   5.53  
##  7 Female BP          MaximumHR                    0.123  0.0993
##  8 Female BP          ExerciseInducedAnginaYes     8.00   4.89  
##  9 Female BP          HeartDiseaseYes             12.1    5.21  
## 10 Female Cholesterol (Intercept)                -12.1   91.6   
## # … with 26 more rows

Now lets span the estimates and standard errors across the columns for both outcomes:

straught1 <-
  mod_results %>%
  stretch(
    key = Outcome,
    value = c(Estimate, SE)
  )
straught1
## # A tibble: 18 × 6
##    Sex    Parameter                 BP_Estimate   BP_SE Cholesterol_Es…¹ Chole…²
##    <chr>  <chr>                           <dbl>   <dbl>            <dbl>   <dbl>
##  1 Female (Intercept)                   93.7    24.7             -12.1    91.6  
##  2 Female Age                            0.533   0.210             2.30    0.779
##  3 Female BloodSugarTRUE                 6.92    5.53             24.6    20.6  
##  4 Female ChestPainAsymptomatic        -14.2     9.64             34.0    35.8  
##  5 Female ChestPainAtypical angina     -16.3     9.73             22.7    36.1  
##  6 Female ChestPainNon-anginal pain    -15.3     9.15             34.1    34.0  
##  7 Female ExerciseInducedAnginaYes       8.00    4.89              8.14   18.2  
##  8 Female HeartDiseaseYes               12.1     5.21              5.88   19.3  
##  9 Female MaximumHR                      0.123   0.0993            0.720   0.369
## 10 Male   (Intercept)                  105.     14.7             151.     38.7  
## 11 Male   Age                            0.480   0.142             0.765   0.374
## 12 Male   BloodSugarTRUE                 4.96    3.12             -6.85    8.23 
## 13 Male   ChestPainAsymptomatic        -10.0     4.25              3.38   11.2  
## 14 Male   ChestPainAtypical angina      -8.61    4.68              9.55   12.3  
## 15 Male   ChestPainNon-anginal pain     -6.99    4.31             -0.802  11.4  
## 16 Male   ExerciseInducedAnginaYes      -1.08    2.78              6.88    7.33 
## 17 Male   HeartDiseaseYes                3.07    2.80             13.4     7.39 
## 18 Male   MaximumHR                      0.0392  0.0601            0.239   0.158
## # … with abbreviated variable names ¹​Cholesterol_Estimate, ²​Cholesterol_SE

The sep argument controls how the new column names are concatenated.

Now suppose you wanted to also span sex across the columns. It can just be added to the keys:

straught2 <-
  mod_results %>%
  stretch(
    key = c(Sex, Outcome),
    value = c(Estimate, SE)
  )
straught2
## # A tibble: 9 × 9
##   Parameter     Femal…¹ Femal…² Femal…³ Femal…⁴ Male_B…⁵ Male_…⁶ Male_…⁷ Male_…⁸
##   <chr>           <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
## 1 (Intercept)    93.7   24.7    -12.1    91.6   105.     14.7    151.     38.7  
## 2 Age             0.533  0.210    2.30    0.779   0.480   0.142    0.765   0.374
## 3 BloodSugarTR…   6.92   5.53    24.6    20.6     4.96    3.12    -6.85    8.23 
## 4 ChestPainAsy… -14.2    9.64    34.0    35.8   -10.0     4.25     3.38   11.2  
## 5 ChestPainAty… -16.3    9.73    22.7    36.1    -8.61    4.68     9.55   12.3  
## 6 ChestPainNon… -15.3    9.15    34.1    34.0    -6.99    4.31    -0.802  11.4  
## 7 ExerciseIndu…   8.00   4.89     8.14   18.2    -1.08    2.78     6.88    7.33 
## 8 HeartDisease…  12.1    5.21     5.88   19.3     3.07    2.80    13.4     7.39 
## 9 MaximumHR       0.123  0.0993   0.720   0.369   0.0392  0.0601   0.239   0.158
## # … with abbreviated variable names ¹​Female_BP_Estimate, ²​Female_BP_SE,
## #   ³​Female_Cholesterol_Estimate, ⁴​Female_Cholesterol_SE, ⁵​Male_BP_Estimate,
## #   ⁶​Male_BP_SE, ⁷​Male_Cholesterol_Estimate, ⁸​Male_Cholesterol_SE

Notice how the resulting columns are positioned. They are sequentially sorted starting with the outer-most key in a hierarchical manner. Since there are multiple value columns in this example, their names are also appended to the result in the order in which they were requested.

Stacking headers in a table

grable() stands for “graded table”. It is used to stack headers into a knitr::kable() in an automated fashion by iteratively parsing the column names by the specified delimiter. The result of stretch() was purposely made to flow directly into this function.

straught1 %>%
  grable(
    at = -c(Sex, Parameter)
  )
BP
Cholesterol
Sex Parameter Estimate SE Estimate SE
Female (Intercept) 93.7187493 24.6536900 -12.1297785 91.6044593
Female Age 0.5330329 0.2096016 2.2975419 0.7788061
Female BloodSugarTRUE 6.9178015 5.5329199 24.5796821 20.5583885
Female ChestPainAsymptomatic -14.1810178 9.6423236 34.0195680 35.8274902
Female ChestPainAtypical angina -16.3247037 9.7289213 22.6736038 36.1492568
Female ChestPainNon-anginal pain -15.3471449 9.1541813 34.1239627 34.0137249
Female ExerciseInducedAnginaYes 7.9952541 4.8947727 8.1371920 18.1872573
Female HeartDiseaseYes 12.0913966 5.2058240 5.8795266 19.3430149
Female MaximumHR 0.1226586 0.0992801 0.7201695 0.3688901
Male (Intercept) 105.2923383 14.6933205 150.9743256 38.7467039
Male Age 0.4800104 0.1417168 0.7654107 0.3737111
Male BloodSugarTRUE 4.9616363 3.1197639 -6.8540853 8.2269061
Male ChestPainAsymptomatic -10.0474949 4.2547960 3.3795322 11.2200180
Male ChestPainAtypical angina -8.6148546 4.6798629 9.5540927 12.3409315
Male ChestPainNon-anginal pain -6.9875791 4.3056911 -0.8015539 11.3542299
Male ExerciseInducedAnginaYes -1.0827067 2.7780388 6.8846691 7.3257672
Male HeartDiseaseYes 3.0740495 2.8029917 13.4020228 7.3915687
Male MaximumHR 0.0391572 0.0600935 0.2387579 0.1584683

The at argument specifies which columns should be spanned in the header. Let’s try it with the two-key example from above:

straught2 %>%
  grable(
    at = -Parameter
  )
Female
Male
BP
Cholesterol
BP
Cholesterol
Parameter Estimate SE Estimate SE Estimate SE Estimate SE
(Intercept) 93.7187493 24.6536900 -12.1297785 91.6044593 105.2923383 14.6933205 150.9743256 38.7467039
Age 0.5330329 0.2096016 2.2975419 0.7788061 0.4800104 0.1417168 0.7654107 0.3737111
BloodSugarTRUE 6.9178015 5.5329199 24.5796821 20.5583885 4.9616363 3.1197639 -6.8540853 8.2269061
ChestPainAsymptomatic -14.1810178 9.6423236 34.0195680 35.8274902 -10.0474949 4.2547960 3.3795322 11.2200180
ChestPainAtypical angina -16.3247037 9.7289213 22.6736038 36.1492568 -8.6148546 4.6798629 9.5540927 12.3409315
ChestPainNon-anginal pain -15.3471449 9.1541813 34.1239627 34.0137249 -6.9875791 4.3056911 -0.8015539 11.3542299
ExerciseInducedAnginaYes 7.9952541 4.8947727 8.1371920 18.1872573 -1.0827067 2.7780388 6.8846691 7.3257672
HeartDiseaseYes 12.0913966 5.2058240 5.8795266 19.3430149 3.0740495 2.8029917 13.4020228 7.3915687
MaximumHR 0.1226586 0.0992801 0.7201695 0.3688901 0.0391572 0.0600935 0.2387579 0.1584683

The tables can then be piped through subsequent customized processing with kableExtra tools.

Univariate analysis

descriptives() produces a data frame in long format with a collection of descriptive statistics.

heart_disease %>%
  descriptives()
## # A tibble: 111 × 9
##    fun_eval fun_key   col_ind col_lab    val_ind val_lab val_dbl val_chr val_cbn
##    <chr>    <chr>       <int> <chr>        <int> <chr>     <dbl> <chr>   <chr>  
##  1 all      available       1 Age              1 <NA>        303 <NA>    303    
##  2 all      available       2 Sex              1 <NA>        303 <NA>    303    
##  3 all      available       3 ChestPain        1 <NA>        303 <NA>    303    
##  4 all      available       4 BP               1 <NA>        303 <NA>    303    
##  5 all      available       5 Cholester…       1 <NA>        303 <NA>    303    
##  6 all      available       6 BloodSugar       1 <NA>        303 <NA>    303    
##  7 all      available       7 MaximumHR        1 <NA>        303 <NA>    303    
##  8 all      available       8 ExerciseI…       1 <NA>        303 <NA>    303    
##  9 all      available       9 HeartDise…       1 <NA>        303 <NA>    303    
## 10 all      class           1 Age              1 <NA>         NA numeric numeric
## # … with 101 more rows

univariate_associations() is a special case of dish() specifically for computing association metrics of one or more responses with a collection of predictors.

heart_disease %>%
  univariate_associations(
    f = list(AIC = function(y, x) AIC(lm(y ~ x))),
    responses = c(BP, Cholesterol)
  )
## # A tibble: 14 × 3
##    response    predictor               AIC
##    <chr>       <chr>                 <dbl>
##  1 BP          Age                   2577.
##  2 BP          Sex                   2602.
##  3 BP          ChestPain             2598.
##  4 BP          BloodSugar            2593.
##  5 BP          MaximumHR             2602.
##  6 BP          ExerciseInducedAngina 2602.
##  7 BP          HeartDisease          2596.
##  8 Cholesterol Age                   3243.
##  9 Cholesterol Sex                   3244.
## 10 Cholesterol ChestPain             3259.
## 11 Cholesterol BloodSugar            3257.
## 12 Cholesterol MaximumHR             3257.
## 13 Cholesterol ExerciseInducedAngina 3256.
## 14 Cholesterol HeartDisease          3255.

univariate_table() allows you to create a custom descriptive table for a data set. It uses almost every function in the package across its span of capabilities.

heart_disease %>%
  univariate_table()
Variable Level Summary
Age 56 (48, 61)
Sex Female 97 (32.01%)
Sex Male 206 (67.99%)
ChestPain Typical angina 23 (7.59%)
ChestPain Atypical angina 50 (16.5%)
ChestPain Non-anginal pain 86 (28.38%)
ChestPain Asymptomatic 144 (47.52%)
BP 130 (120, 140)
Cholesterol 241 (211, 275)
MaximumHR 153 (133.5, 166)
ExerciseInducedAngina No 204 (67.33%)
ExerciseInducedAngina Yes 99 (32.67%)
HeartDisease No 164 (54.13%)
HeartDisease Yes 139 (45.87%)

See vignette("describe") for a more in-depth introduction to these functions.