Tidying up, transforming and exploring data is an important part of data analysis, and you can manage many common tasks in this process with the tidyverse or related packages. The sjmisc-package fits into this workflow, especially when you work with labelled data, because it offers functions for data transformation and labelled data utility functions. This vignette describes typical steps when beginning with data exploration.
The examples are based on data from the EUROFAMCARE project, a survey
on the situation of family carers of older people in Europe. The sample
data set efc
is part of this package. Let us see how the
family carer’s gender and subjective perception of negative impact of
care as well as the cared-for person’s dependency are associated with
the family carer’s quality of life.
The first thing that may be of interest is probably the distribution
of gender. You can plot frequencies for labelled data with
frq()
. This function requires either a vector or data frame
as input and prints the variable label as first line, followed by a
frequency-table with values, labels, counts and percentages of the
vector.
frq(efc$c161sex)
#> carer's gender (x) <numeric>
#> # total N=908 valid N=901 mean=1.76 sd=0.43
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#> 1 | Male | 215 | 23.68 | 23.86 | 23.86
#> 2 | Female | 686 | 75.55 | 76.14 | 100.00
#> <NA> | <NA> | 7 | 0.77 | <NA> | <NA>
Next, let’s look at the distribution of gender by the cared-for
person’s dependency. To compute cross tables, you can use
flat_table()
. It requires the data as first argument,
followed by any number of variable names.
But first, we need to know the name of the dependency-variable. This
is where find_var()
comes into play. It searches for
variables in a data frame by
By default, it looks for variable name and labels. The function also
supports regex-patterns. By default, find_var()
returns the
column-indices, but you can also print a small “summary”” with the
out
-argument.
# find all variables with "dependency" in name or label
find_var(efc, "dependency", out = "table")
#> col.nr var.name var.label
#> 1 5 e42dep elder's dependency
Variable in column 5, named e42dep, is what we are looking for.
Now we can look at the distribution of gender by dependency:
flat_table(efc, e42dep, c161sex)
#> c161sex Male Female
#> e42dep
#> independent 18 48
#> slightly dependent 54 170
#> moderately dependent 80 226
#> severely dependent 63 241
Since the distribution of male and female carers is skewed, let’s see
the proportions. To compute crosstables with row or column percentages,
use the margin
-argument:
Next, we need the negatice impact of care (neg_c_7) and want
to create three groups: low, middle and high negative impact. We can
easily recode and label vectors with rec()
. This function
does not only recode vectors, it also allows direct labelling of
categories inside the recode-syntax (this is optional, you can also use
the val.labels
-argument). We now recode neg_c_7
into a new variable burden. The cut-points are a bit arbitrary,
for the sake of demonstration.
efc$burden <- rec(
efc$neg_c_7,
rec = c("min:9=1 [low]; 10:12=2 [moderate]; 13:max=3 [high]; else=NA"),
var.label = "Subjective burden",
as.num = FALSE # we want a factor
)
# print frequencies
frq(efc$burden)
#> Subjective burden (x) <categorical>
#> # total N=908 valid N=892 mean=2.03 sd=0.81
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> -------------------------------------------------
#> 1 | low | 280 | 30.84 | 31.39 | 31.39
#> 2 | moderate | 301 | 33.15 | 33.74 | 65.13
#> 3 | high | 311 | 34.25 | 34.87 | 100.00
#> <NA> | <NA> | 16 | 1.76 | <NA> | <NA>
You can see the variable burden has a variable label
(“Subjective burden”), which was set inside rec()
, as well
as three values with labels (“low”, “moderate” and “high”). From the
lowest value in neg_c_7 to 9 were recoded into 1, values 10 to
12 into 2 and values 13 to the highest value in neg_c_7 into 3.
All remaining values are set to missing (else=NA
– for
details on the recode-syntax, see ?rec
).
How is burden distributed by gender? We can group the data and print
frequencies using frq()
for this as well, as this function
also accepts grouped data frames. Frequencies for grouped data frames
first print the group-details (variable name and category), followed by
the frequency table. Thanks to labelled data, the output is easy to
understand.
efc %>%
select(burden, c161sex) %>%
group_by(c161sex) %>%
frq()
#> Subjective burden (burden) <categorical>
#> # grouped by: Male
#> # total N=215 valid N=212 mean=1.91 sd=0.81
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> ------------------------------------------------
#> 1 | low | 80 | 37.21 | 37.74 | 37.74
#> 2 | moderate | 72 | 33.49 | 33.96 | 71.70
#> 3 | high | 60 | 27.91 | 28.30 | 100.00
#> <NA> | <NA> | 3 | 1.40 | <NA> | <NA>
#>
#> Subjective burden (burden) <categorical>
#> # grouped by: Female
#> # total N=686 valid N=679 mean=2.08 sd=0.81
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> -------------------------------------------------
#> 1 | low | 199 | 29.01 | 29.31 | 29.31
#> 2 | moderate | 229 | 33.38 | 33.73 | 63.03
#> 3 | high | 251 | 36.59 | 36.97 | 100.00
#> <NA> | <NA> | 7 | 1.02 | <NA> | <NA>
Let’s investigate the association between quality of life and burden
across the different dependency categories, by fitting linear models for
each category of e42dep. We can do this using nested data
frames. nest()
from the tidyr-package
can create subsets of a data frame, based on grouping criteria, and
create a new list-variable, where each element itself is a data
frame (so it’s nested, because we have data frames inside a data
frame).
In the following example, we group the data by e42dep, and “nest” the groups. Now we get a data frame with two columns: First, the grouping variable (e42dep) and second, the datasets (subsets) for each country as data frame, stored in the list-variable data. The data frames in the subsets (in data) all contain the selected variables burden, c161sex and quol_5 (quality of life).
# convert variable to labelled factor, because we then
# have the labels as factor levels in the output
efc$e42dep <- to_label(efc$e42dep, drop.levels = TRUE)
efc %>%
select(e42dep, burden, c161sex, quol_5) %>%
group_by(e42dep) %>%
tidyr::nest()
#> # A tibble: 5 × 2
#> # Groups: e42dep [5]
#> e42dep data
#> <fct> <list>
#> 1 moderately dependent <tibble [306 × 3]>
#> 2 severely dependent <tibble [304 × 3]>
#> 3 independent <tibble [66 × 3]>
#> 4 slightly dependent <tibble [225 × 3]>
#> 5 <NA> <tibble [7 × 3]>
Using map()
from the purrr-package, we
can iterate this list and apply any function on each data frame in the
list-variable “data”. We want to apply the lm()
-function to
the list-variable, to run linear models for all “dependency-datasets”.
The results of these linear regressions are stored in another
list-variable, models (created with mutate()
). To
quickly access and look at the coefficients, we can use
spread_coef()
.
efc %>%
select(e42dep, burden, c161sex, quol_5) %>%
group_by(e42dep) %>%
tidyr::nest() %>%
na.omit() %>% # remove nested group for NA
arrange(e42dep) %>% # arrange by order of levels
mutate(models = purrr::map(
data, ~
lm(quol_5 ~ burden + c161sex, data = .))
) %>%
spread_coef(models)
#> # A tibble: 4 × 7
#> # Groups: e42dep [4]
#> e42dep data models `(Intercept)` burden2 burden3 c161sex
#> <fct> <list> <list> <dbl> <dbl> <dbl> <dbl>
#> 1 independent <tibble> <lm> 18.8 -3.16 -4.94 -0.709
#> 2 slightly dependent <tibble> <lm> 19.8 -2.20 -2.48 -1.14
#> 3 moderately dependent <tibble> <lm> 17.9 -1.82 -5.29 -0.637
#> 4 severely dependent <tibble> <lm> 19.1 -3.66 -7.92 -0.746
We see that higher burden is associated with lower quality of life,
for all dependency-groups. The se
and
p.val
-arguments add standard errors and p-values to the
output. model.term
returns the statistics only for a
specific term. If you specify a model.term
, arguments
se
and p.val
automatically default to
TRUE
.
efc %>%
select(e42dep, burden, c161sex, quol_5) %>%
group_by(e42dep) %>%
tidyr::nest() %>%
na.omit() %>% # remove nested group for NA
arrange(e42dep) %>% # arrange by order of levels
mutate(models = purrr::map(
data, ~
lm(quol_5 ~ burden + c161sex, data = .))
) %>%
spread_coef(models, burden3)
#> # A tibble: 4 × 6
#> # Groups: e42dep [4]
#> e42dep data models burden3 std.error p.value
#> <fct> <list> <list> <dbl> <dbl> <dbl>
#> 1 independent <tibble [66 × 3]> <lm> -4.94 2.20 2.84e- 2
#> 2 slightly dependent <tibble [225 × 3]> <lm> -2.48 0.694 4.25e- 4
#> 3 moderately dependent <tibble [306 × 3]> <lm> -5.29 0.669 5.22e-14
#> 4 severely dependent <tibble [304 × 3]> <lm> -7.92 0.875 2.10e-17