This vignette demonstrates three major functions in the
brokenstick
package: brokenstick()
,
predict()
and plot()
. We also need
dplyr
and ggplot2
.
require("brokenstick")
require("dplyr")
require("ggplot2")
For more elaborate documentation, see the manual.
The smocc_200
data in the brokenstick
package contain the heights of 200 Dutch children measured on 10 visits
at ages 0-2 years.
<- brokenstick::smocc_200
data head(data, 3)
## # A tibble: 3 × 7
## id age sex ga bw hgt hgt_z
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 10001 0 female 40 3960 52 0.575
## 2 10001 0.0821 female 40 3960 55.6 0.888
## 3 10001 0.159 female 40 3960 58.2 0.797
Figure 1 dispays the data from the first 500 rows as a set of growth curves of Dutch children. Curves are steeper during the first few months, so child growth is faster for young infants. Note also there are more cross-overs during the first half year, whereas fewer occur later. This means that the relative positions have been settled by the age of 2 years, or - put differently - that the correlation between time points at those ages is high.
ggplot(data[1:500, ], aes(x = age, y = hgt_z, group = id, color = as.factor(id))) +
geom_line(size = 0.1) +
geom_point(size = 0.7) +
scale_colour_viridis_d(option = "cividis") +
xlab("Age (years)") +
ylab("Length SDS") +
theme_light() +
theme(legend.position = "none")
Figure 2 dispays the same data, but with the vertical axis changed to Standard Deviation Scores (SDS), or \(Z\)-score. The \(Z\)-score is the height corrected for age relative to the Dutch height reference from the Fourth Dutch Growth Study. The \(Z\)-score transformation takes away the major time trend, so all curves are more or less flat. This allows us to see a more detailed assessment of individual growth.
The plots also show how the measurements are clustered around ten ages: birth, 1, 2, 3, 6, 9, 12, 15, 18 and 24 months. While the design was followed rigorously in the study, some variation in timing is inevitable because of weekends, holidays, sickness, and other events. The timing variation poses a problem because we cannot directly compare the measurement between different children (especially for figure 1). Also, we cannot easily construct the “broad” matrix with 10 time point per child.
Of course, we can divide the time axis into ten age groups, and treat all point within the same age group as being measured at the same point. This is probably a good strategy for nicely looking data - as we have here -, but this approach is problematic in data with irregular time intervals, of there are multiple measurement per age group, if the measurement schedules vary by child, or in data combined from studies that employed different designs.
The brokenstick
package contains tools to approximate
the observed data by a series of connecting straight lines. When these
lines closely follow the data, we may replace each trajectory by its
values at the breakpoints. The statistical analysis can then be done on
the regularised trajectories, which is easier than working with the
observed data.
We fit a trivial broken stick model with just one line on ages between birth and two years, and plot the trajectories of three selected children as follows:
set.seed(123)
<- brokenstick(hgt ~ age | id, data, knots = c(0, 2))
fit <- c(10001, 10005, 10022)
ids plot(fit, group = ids,
xlab = "Age (years)", ylab = "Length (cm)")
The following plot displays the same data, but in standardised units so as to increase the analytic resolution:
<- brokenstick(hgt_z ~ age | id, data, knots = c(0, 2))
fit0 plot(fit0, group = ids,
xlab = "Age (years)", ylab = "Length (SDS)")
Note that both approximations describe the individual trend in the data, but do not address any systematic deviations from the trend.
The broken stick model describes a trajectory by a series of connected straight lines. We first calculate a model with two connected lines. The first line starts at birth and end at the age of exactly 1 years. The second line spans the period between 1 to 2 years. In addition, the lines must connect at the age of 1 year. We estimate and plot the model as follows:
<- brokenstick(hgt_z ~ age | id, data = data, knots = c(0, 1, 2))
fit2 plot(fit2, group = ids, xlab = "Age (years)", ylab = "Length (SDS)")
The plot shows that the two-line model is still fairly crude. The
fit2
object holds the parameter estimates of the model:
summary(fit2)
## Class brokenstick (kr)
## Variables hgt_z (outcome), age (predictor), id (group)
## Data 1942 (n), 36 (nmis), 200 (groups)
## Parameters 16 (total), 4 (fixed), 4 (variance), 6 (covariance), 2 (error)
## Knots 0 1 2
## Means -0.0394 0.0278 0.0646
## Residuals 0.113 0.146 0.168 0.195 0.383 (min, P25, P50, P75, max)
## Mean resid 0.179
## R-squared 0.867
##
## Variance-covariance matrix
## age_0 age_1 age_2
## age_0 1.218
## age_1 0.481 0.823
## age_2 0.485 0.782 0.884
The console output lists the knots of the model, including the left
and right boundary knots at 0 and 2.6776. The row of means
correspond to the fixed effect estimates of the linear mixed model. We
may interpret these as the global means. Next, the output lists the
variance-covariance matrix of the random effects. The model contains 16
parameters in total: four fixed effects (means), four random effects
(diagonal elements), 6 covariance (off-diagonal elements) and 2 error
variances (one for the residual error variance, one for the variability
of the error per cluster). These parameters are enough to reconstruct
the broken stick model, and to apply it to new data.
We refine the model in the first two years by adding a knot for each age at which a visit was scheduled. This model can be run as
<- round(c(0, 1, 2, 3, 6, 9, 12, 15, 18, 24)/12, 4)
knots <- brokenstick(hgt_z ~ age | id, data = data, knots = knots) fit9
This optimization problem is more difficult, so it takes slightly longer to run. The results are
summary(fit9)
## Class brokenstick (kr)
## Variables hgt_z (outcome), age (predictor), id (group)
## Data 1942 (n), 36 (nmis), 200 (groups)
## Parameters 79 (total), 11 (fixed), 11 (variance), 55 (covariance), 2 (error)
## Knots 0.0000 0.0833 0.1667 0.2500 0.5000 0.7500 1.0000 1.2500 1.5000 2.0000
## Means -0.18618 0.02163 0.00262 0.06602 0.06346 -0.01955 0.00781 0.03233 -0.02665 0.11470
## Residuals 0.0176 0.0268 0.0329 0.0494 4.5190 (min, P25, P50, P75, max)
## Mean resid 0.0752
## R-squared 0.983
##
## Variance-covariance matrix
## age_0 age_0.0833 age_0.1667 age_0.25 age_0.5 age_0.75 age_1 age_1.25
## age_0 1.740
## age_0.0833 1.118 1.337
## age_0.1667 1.077 1.095 1.246
## age_0.25 0.900 1.005 0.999 1.052
## age_0.5 0.627 0.821 0.822 0.815 0.943
## age_0.75 0.597 0.696 0.693 0.718 0.867 0.938
## age_1 0.495 0.611 0.579 0.606 0.786 0.819 0.869
## age_1.25 0.432 0.614 0.575 0.601 0.757 0.807 0.834 0.953
## age_1.5 0.455 0.614 0.624 0.604 0.808 0.824 0.836 0.914
## age_2 0.361 0.483 0.517 0.539 0.720 0.753 0.783 0.883
## age_1.5 age_2
## age_0
## age_0.0833
## age_0.1667
## age_0.25
## age_0.5
## age_0.75
## age_1
## age_1.25
## age_1.5 1.039
## age_2 0.919 1.06
There nine-line model summarises the data by 79 parameters. This model fits substantially better. The model residuals are substantially smaller (0.075 instead of 0.179) and the proportions of explained variance is much higher (0.983 instead of 0.867).
The figure shows that the nine-line broken stick model fits the observed data very well.
The predict()
function allows us to obtain various types
of predictions from the broken stick model. The simplest call
<- predict(fit2)
p1 head(p1)
## .pred
## 1 0.720
## 2 0.660
## 3 0.604
## 4 0.534
## 5 0.352
## 6 0.171
identical(nrow(data), nrow(p1))
## [1] TRUE
produces a tibble
with the one column called
.pred
for each row in data
. We can bind column
.pred
to data
for further processing.
Sometimes, we also want the prediction at the knot values, for
example, to create graphs that contain observed and modelled
trajectories. We obtain predictions at the knots by the special
x = "knots"
argument, e.g.
<- predict(fit2, x = "knots", include_data = FALSE)
p2 head(p2)
## .source id age sex ga bw hgt hgt_z .pred
## 1 added 10001 0 <NA> NA NA NA NA 0.71989
## 2 added 10001 1 <NA> NA NA NA NA -0.00939
## 3 added 10001 2 <NA> NA NA NA NA 0.05785
## 4 added 10002 0 <NA> NA NA NA NA -0.23264
## 5 added 10002 1 <NA> NA NA NA NA -0.37772
## 6 added 10002 2 <NA> NA NA NA NA -0.50233
nrow(p2)
## [1] 600
We use the include_data = FALSE
argument to remove
predictions for the observed data. The output is more verbose and
includes the grid of knots for each child (id
,
age
). The column .source
is equal to
added
as all rows are non-observed data.
Note there are also knots at ages 0.00 and 2.68 years. These are
boundary knots, and added by the brokenstick()
function.
The boundary knots effectively filter the observations that enter the
calculations. By default, the boundary knots span the age range in the
data. For technical reasons, the broken stick model also defines and
estimates parameters for these knots, but these may in general be
ignored, especially when the data near the boundary knots are
sparse.
If we wish to obtain estimates at both the knots and the observed data use:
<- predict(fit2, x = "knots", hide = "none")
p3 table(p3$.source)
##
## added data
## 800 1942
This return 1942 rows for the data and 800 rows for the knots.
The proportion of the variance of the outcome explained by the two-line model is
get_r2(fit2)
## [1] 0.867
For the second model we get
get_r2(fit9)
## [1] 0.983
so the nine-line broken stick model explains about 98 percent of the variance of the height SDS.
Suppose we are interest in knowing the effect of sex, gestational age and birth weight on the height SDS at the age of 2 years. This is an analysis at the subject level. Let us first extract the subject-level data with variables that vary over subjects only.
<- data %>%
subj select(id, sex, ga, bw) %>%
group_by(id) %>%
slice(1)
head(subj, 3)
## # A tibble: 3 × 4
## # Groups: id [3]
## id sex ga bw
## <dbl> <chr> <dbl> <dbl>
## 1 10001 female 40 3960
## 2 10002 male 38 3210
## 3 10003 female 40 4170
We also need the outcome variable. We take it from the broken stick estimates from the nine line solution and append it to the subject level data.
<- predict(fit9, x = "knots", shape = "wide", include_data = FALSE)
bs <- bind_cols(subj, select(bs, -id))
data head(data, 3)
## # A tibble: 3 × 14
## # Groups: id [3]
## id sex ga bw `0` `0.0833` `0.1667` `0.25` `0.5` `0.75` `1`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10001 female 40 3960 0.569 0.865 0.735 0.657 0.232 -0.172 0.118
## 2 10002 male 38 3210 -0.184 -0.329 -0.252 -0.319 -0.218 -0.281 -0.344
## 3 10003 female 40 4170 1.17 2.11 1.97 2.04 2.03 1.81 1.32
## # … with 3 more variables: `1.25` <dbl>, `1.5` <dbl>, `2` <dbl>
The names of the columns in bs
correspond to the knot
values.
The effect of the subject’s sex, gestational age and birth weight on
the height SDS at the age of 2 years (here denoted by the variable named
2
) can be estimated as
<- lm(`2` ~ sex + ga + I(bw / 1000), data = data)
fit1_lm summary(fit1_lm)
##
## Call:
## lm(formula = `2` ~ sex + ga + I(bw/1000), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8965 -0.4525 0.0817 0.6424 2.0369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0361 1.4705 0.70 0.48187
## sexmale 0.0109 0.1285 0.08 0.93249
## ga -0.0700 0.0433 -1.62 0.10752
## I(bw/1000) 0.5343 0.1420 3.76 0.00022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.879 on 196 degrees of freedom
## Multiple R-squared: 0.0757, Adjusted R-squared: 0.0616
## F-statistic: 5.35 on 3 and 196 DF, p-value: 0.00145
Note that the analysis shows there is a substantial effect of birth
weight. Of course, it might be that birth weight is directly related to
height at the age of 2 years. Alternatively, the relation could be
mediated by birth length. The following model adds birth length (the
variable named 0
) to the model:
<- lm(`2` ~ sex + ga + I(bw / 1000) + `0`, data = data)
fit2_lm summary(fit2_lm)
##
## Call:
## lm(formula = `2` ~ sex + ga + I(bw/1000) + `0`, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9375 -0.5326 0.0426 0.6292 1.8792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.226719 1.662005 1.94 0.0536 .
## sexmale -0.000703 0.126542 -0.01 0.9956
## ga -0.089472 0.043247 -2.07 0.0399 *
## I(bw/1000) 0.143924 0.201777 0.71 0.4765
## `0` 0.245522 0.091492 2.68 0.0079 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.865 on 195 degrees of freedom
## Multiple R-squared: 0.109, Adjusted R-squared: 0.0904
## F-statistic: 5.94 on 4 and 195 DF, p-value: 0.000156
The effect of birth length on length at age 2 is very strong. There is no separate effect of birth weight anymore, so this analysis suggests that the relation between birth weight and length at age 2 can be explained by their mutual associations to birth length.
This vignette illustrated the use of the brokenstick()
,
plot()
and predict()
functions. Other
vignettes highlight various other capabilities of the package.