This vignette provides an introduction to the nestedmodels package and the most basic use case. For this and all other vignettes, it is assumed that you have a familiarity with the ‘tidymodels’ framework (e.g. by reading Tidy Modelling with R). This vignette does not aim to teach good statistical practices, and instead demonstrates how to use the package as simply as possible.
nestedmodels is an extension to the ‘tidymodels’ framework. It allows models and workflows to be used on nested data. It provides an alternative to modeltime’s approach to nested modelling or the ‘multilevelmod’ package, allowing any model or workflow to be used on nested data very easily.
The best example where you may need to use the nestedmodels package
is when working with panel data. When you have a set of time series,
each describing a different object (the historic prices for a set of
stocks, for example), you may want to model each time series separately,
especially considering the fact that many time series modelling tools do
not work well with non-date predictors (and furthermore, many models do
not accept non-numeric predictors, although there are often better ways
to deal with this problem; see recipes::step_dummy()
). In
this scenario, nested modelling is often the best solution.
The implementation of nestedmodels is very simple. Fitting a nested model fits the model to each nested value (for time series about a set of stocks, a model would be fitted to each stock). The correct model will then be selected and used when making predictions.
In this vignette, we’re going to explore the most basic example of a nested model. You’re going to need the following packages:
library(nestedmodels)
library(tidyr)
library(parsnip)
library(recipes)
library(workflows)
library(rsample)
library(glmnet)
We’re going to use the example data included in the nestedmodels package. The data is very simple, and only serves as an example of data that can be nested, rather than representing anything concrete.
data("example_nested_data")
data <- example_nested_data
data
#> # A tibble: 1,000 × 7
#> id id2 x y z a b
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 49 48.5 29.1 44.7 50.0
#> 2 1 1 50 64.2 29.7 40.2 64.9
#> 3 1 1 51 -19.4 26.6 43.2 38.0
#> 4 1 1 52 41.0 28.8 66.4 61.7
#> 5 1 1 53 -94.2 23.9 18.2 -1.66
#> 6 1 1 54 72.6 30.0 83.8 38.8
#> 7 1 1 55 -91.5 24.0 91.7 40.7
#> 8 1 1 56 -50.5 25.5 79.8 55.4
#> 9 1 1 57 90.3 30.6 50.3 33.8
#> 10 1 1 58 32.4 28.6 25.4 20.5
#> # ℹ 990 more rows
The data can be nested in the following way:
nested_data <- nest(data, data = -id)
nested_data
#> # A tibble: 20 × 2
#> id data
#> <int> <list>
#> 1 1 <tibble [50 × 6]>
#> 2 2 <tibble [50 × 6]>
#> 3 3 <tibble [50 × 6]>
#> 4 4 <tibble [50 × 6]>
#> 5 5 <tibble [50 × 6]>
#> 6 6 <tibble [50 × 6]>
#> 7 7 <tibble [50 × 6]>
#> 8 8 <tibble [50 × 6]>
#> 9 9 <tibble [50 × 6]>
#> 10 10 <tibble [50 × 6]>
#> 11 11 <tibble [50 × 6]>
#> 12 12 <tibble [50 × 6]>
#> 13 13 <tibble [50 × 6]>
#> 14 14 <tibble [50 × 6]>
#> 15 15 <tibble [50 × 6]>
#> 16 16 <tibble [50 × 6]>
#> 17 17 <tibble [50 × 6]>
#> 18 18 <tibble [50 × 6]>
#> 19 19 <tibble [50 × 6]>
#> 20 20 <tibble [50 × 6]>
Lets split this data up into a training and testing set using the
nested_resamples()
function. This ensures that the training
and testing set all contain data with every ‘id’ value.
split <- nested_resamples(nested_data, rsample::initial_split())
data_tr <- rsample::training(split)
data_tst <- rsample::testing(split)
Now let’s define our model:
Since we’re fitting this model to nested data, we need some way to
make the model ‘nested’. This is very simple with the
nested()
function.
nested_model <- model %>%
nested()
nested_model
#> Nested Model Specification
#>
#> Inner model:
#> Linear Regression Model Specification (regression)
#>
#> Main Arguments:
#> penalty = 0.1
#>
#> Computational engine: glmnet
We can then fit this model in the usual way. Note that the data must be nested, and formula cannot include the id column.
nested_tr <- tidyr::nest(data_tr, data = -id)
model_fit <- fit(nested_model, z ~ x + y + a + b, nested_tr)
model_fit
#> Nested model fit, with 20 inner models
#> # A tibble: 20 × 2
#> id .model_fit
#> <int> <list>
#> 1 1 <fit[+]>
#> 2 2 <fit[+]>
#> 3 3 <fit[+]>
#> 4 4 <fit[+]>
#> 5 5 <fit[+]>
#> 6 6 <fit[+]>
#> 7 7 <fit[+]>
#> 8 8 <fit[+]>
#> 9 9 <fit[+]>
#> 10 10 <fit[+]>
#> 11 11 <fit[+]>
#> 12 12 <fit[+]>
#> 13 13 <fit[+]>
#> 14 14 <fit[+]>
#> 15 15 <fit[+]>
#> 16 16 <fit[+]>
#> 17 17 <fit[+]>
#> 18 18 <fit[+]>
#> 19 19 <fit[+]>
#> 20 20 <fit[+]>
Predicting can also be done in the usual way (the data to predict on can be both nested and non-nested). Since this is just a demonstration, we use the same data that the model was fitted on.
predict(model_fit, data_tst)
#> # A tibble: 260 × 1
#> .pred
#> <dbl>
#> 1 31.2
#> 2 27.0
#> 3 25.6
#> 4 41.7
#> 5 28.9
#> 6 27.1
#> 7 17.5
#> 8 27.3
#> 9 27.3
#> 10 26.4
#> # ℹ 250 more rows
This method is fine, but having to nest the data ourselves is a pain. We can solve this by using a workflow.
We first define the recipe, and we define a step which is used to nest the data. This time, the formula can include the ‘id’ column, since the recipe needs to act on it.
This is a little easier than nesting the data manually. Note that the recipe does not actually nest the data, but instead removes the specified columns and adds a new column, ‘.nest_id’, which specifies which nest each row belongs to.
recipe %>%
prep() %>%
bake(NULL)
#> # A tibble: 740 × 6
#> x y a b z .nest_id
#> <int> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 50 64.2 40.2 64.9 29.7 Nest 1
#> 2 74 75.8 98.7 57.2 38.8 Nest 1
#> 3 85 -8.74 52.4 43.3 53.3 Nest 1
#> 4 57 90.3 50.3 33.8 30.6 Nest 1
#> 5 73 -67.2 31.3 5.80 33.6 Nest 1
#> 6 92 39.9 77.3 99.6 3.31 Nest 1
#> 7 52 41.0 66.4 61.7 28.8 Nest 1
#> 8 65 94.6 54.8 74.7 22.9 Nest 1
#> 9 77 -18.8 13.8 51.9 52.9 Nest 1
#> 10 86 104. 63.8 -0.387 57.4 Nest 1
#> # ℹ 730 more rows
Now we create the workflow, combining the recipe and the model.
A workflow can be fitted in the same way as a model, but note that
since we used step_nest()
the data does not have to be
nested.
This fit object can then be used to make predictions.
predict(wf_fit, data_tst)
#> # A tibble: 260 × 1
#> .pred
#> <dbl>
#> 1 31.2
#> 2 27.0
#> 3 25.6
#> 4 41.7
#> 5 28.9
#> 6 27.1
#> 7 17.5
#> 8 27.3
#> 9 27.3
#> 10 26.4
#> # ℹ 250 more rows
Other common parsnip functions can also be used on fitted nested models:
augment(wf_fit, data_tst)
#> # A tibble: 260 × 8
#> id id2 x y z a b .pred
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 51 -19.4 26.6 43.2 38.0 31.2
#> 2 1 1 55 -91.5 24.0 91.7 40.7 27.0
#> 3 1 1 56 -50.5 25.5 79.8 55.4 25.6
#> 4 1 1 62 109. 23.4 5.23 19.8 41.7
#> 5 1 1 63 1.35 19.6 38.2 43.6 28.9
#> 6 1 1 66 46.0 21.2 30.4 60.6 27.1
#> 7 1 2 76 -37.7 52.2 60.8 72.2 17.5
#> 8 1 2 78 32.9 54.7 87.1 61.1 27.3
#> 9 1 2 80 129. 58.2 79.9 87.5 27.3
#> 10 1 2 81 84.9 56.7 2.82 58.2 26.4
#> # ℹ 250 more rows
tidy(wf_fit)
#> # A tibble: 100 × 4
#> .nest_id term estimate penalty
#> <fct> <chr> <dbl> <dbl>
#> 1 Nest 1 (Intercept) 49.0 0.1
#> 2 Nest 1 x -0.181 0.1
#> 3 Nest 1 y 0.0798 0.1
#> 4 Nest 1 a 0.0621 0.1
#> 5 Nest 1 b -0.256 0.1
#> 6 Nest 2 (Intercept) -84.2 0.1
#> 7 Nest 2 x 0.701 0.1
#> 8 Nest 2 y -0.00725 0.1
#> 9 Nest 2 a -0.0532 0.1
#> 10 Nest 2 b -0.0261 0.1
#> # ℹ 90 more rows
This is all you really need to know to use the nestedmodels package. These models and workflows can be compared, fitted and tuned in much the same way as normal models and workflows - you can even combine them with normal models using the workflowsets and stacks packages.
In this short vignette, a simple example of a nested model and workflow were created and used on dummy data, to demonstrate how nestedmodels is used.