Simple data exploration

Introduction

The summarize function in dplyr, especially when combined with group_by and across, provides powerful tools for exploring data using summary statistics. The psyntur package provides some wrappers to these tools to allow data exploration, albeit of a limited kind, to be done quickly and easily. We explore some of these functions in this vignette.

Load the psyntur functions and data sets with the usual library command.

library(psyntur)
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2

Summary statistics with describe

We can use the describe function in psyntur. The first argument to describe should be the data frame. Subsequent arguments should be named arguments of summary statistics functions, like mean, median, etc., applied to any variables in the data frame. For example, using the faithfulfaces data frame, we can obtain the arithmetic mean and standard deviation of the faithful variable as follows.

describe(data = faithfulfaces, avg = mean(faithful), stdev = sd(faithful))
#> # A tibble: 1 × 2
#>     avg stdev
#>   <dbl> <dbl>
#> 1  5.14 0.957

We can apply the same or different functions to the same or different variables.

describe(data = faithfulfaces,
         avg_faith = mean(faithful), 
         avg_trust = mean(trustworthy),
         sd_trust = sd(trustworthy))
#> # A tibble: 1 × 3
#>   avg_faith avg_trust sd_trust
#>       <dbl>     <dbl>    <dbl>
#> 1      5.14      4.32    0.791

We can obtain the summary statistics for the chosen variables for each group of a third variable using a by variable.

describe(data = faithfulfaces, by = face_sex, 
         avg = mean(faithful), stdev = sd(faithful))
#> # A tibble: 2 × 3
#>   face_sex   avg stdev
#>   <chr>    <dbl> <dbl>
#> 1 female    5.55 0.802
#> 2 male      4.75 0.932

The by argument may be a vector of variables. In this case, the chosen variables are grouped by the combination of the by variables. For example, in the following we group the time variable in vizverb by both task and response.

describe(vizverb, by = c(task, response),
         avg = mean(time),
         median = median(time),
         iqr = IQR(time),
         stdev = sd(time)
)
#> # A tibble: 4 × 6
#>   task   response   avg median   iqr stdev
#>   <chr>  <chr>    <dbl>  <dbl> <dbl> <dbl>
#> 1 verbal verbal   12.8   11.2   2.92  5.17
#> 2 verbal visual   13.7   13.5   4.96  3.98
#> 3 visual verbal    9.01   7.68  4.65  3.37
#> 4 visual visual   18.2   16.0   7.59  6.12

Multiple summary functions to multiple variables

It would be tedious and repetitive to use describe as above if wanted to apply the same set of summary statistic functions to a set of variables. Instead, we can use describe_across. For example, to calculate the mean, median, standard deviation to two variables, trustworthy and faithful, in the faithfulfaces data set, we can do the following.

describe_across(faithfulfaces,
                variables = c(trustworthy, faithful),
                functions = list(avg = mean, median = median, stdev = sd)
)
#> # A tibble: 1 × 6
#>   trustworthy_avg trustworthy_median trustworthy_stdev faithful_avg faithful_median
#>             <dbl>              <dbl>             <dbl>        <dbl>           <dbl>
#> 1            4.32               4.24             0.791         5.14            5.24
#> # … with 1 more variable: faithful_stdev <dbl>

Note that the data frame that is returned is in a wide format. We can pivot this to a longer format by saying pivot = TRUE.

describe_across(faithfulfaces,
                variables = c(trustworthy, faithful),
                functions = list(avg = mean, median = median, stdev = sd),
                pivot = TRUE
)
#> # A tibble: 2 × 4
#>   variable      avg median stdev
#>   <chr>       <dbl>  <dbl> <dbl>
#> 1 trustworthy  4.32   4.24 0.791
#> 2 faithful     5.14   5.24 0.957

We can use the by variable to calculate the summary statistics for each subgroup corresponding to each value of the by variable, as in the following example.

describe_across(faithfulfaces,
                variables = c(trustworthy, faithful),
                functions = list(avg = mean, median = median, stdev = sd),
                by = face_sex,
                pivot = TRUE
)
#> # A tibble: 4 × 5
#>   face_sex variable      avg median stdev
#>   <chr>    <chr>       <dbl>  <dbl> <dbl>
#> 1 female   trustworthy  4.44   4.29 0.742
#> 2 female   faithful     5.55   5.71 0.802
#> 3 male     trustworthy  4.21   4.18 0.822
#> 4 male     faithful     4.75   4.85 0.932

As in the case of describe, the by argument can be a vector of variables.

Dealing with missing values with _xna

When variable have NA values, most summary statistics function will, by default, return NA. To illustrate this, we can modify faithfulfaces to contain NA’s for the faithful variable.

faithfulfaces_na <- faithfulfaces %>%
  dplyr::mutate(faithful = ifelse(faithful > 6, NA, faithful))

Now, if we try one of the above describe or describe_aross functions with the faithful variable, we will obtain corresponding NA values.

describe_across(faithfulfaces_na,
                variables = c(trustworthy, faithful),
                functions = list(avg = mean, median = median, stdev = sd),
                by = face_sex,
                pivot = TRUE
)
#> # A tibble: 4 × 5
#>   face_sex variable      avg median  stdev
#>   <chr>    <chr>       <dbl>  <dbl>  <dbl>
#> 1 female   trustworthy  4.44   4.29  0.742
#> 2 female   faithful    NA     NA    NA    
#> 3 male     trustworthy  4.21   4.18  0.822
#> 4 male     faithful    NA     NA    NA

Of course, if we set na.rm = TRUE in any or all of the summary functions, we will remove the NA values before the statistics are calculated. This is relatively easy to do with describe, as in the following example.

describe(data = faithfulfaces, by = face_sex, 
         avg = mean(faithful, na.rm = T), stdev = sd(faithful, na.rm = T))
#> # A tibble: 2 × 3
#>   face_sex   avg stdev
#>   <chr>    <dbl> <dbl>
#> 1 female    5.55 0.802
#> 2 male      4.75 0.932

However, for describe across, we pass in a list of functions, and so to set na.rm = T, we can to create purrr style anonymous functions calling the summary statistic function with na.rm = T, as in the following example.

library(purrr)
describe_across(faithfulfaces_na,
                variables = c(trustworthy, faithful),
                functions = list(avg = ~mean(., na.rm = T), 
                                 median = ~median(., na.rm = T), 
                                 stdev = ~sd(., na.rm = T)),
                by = face_sex,
                pivot = TRUE
)
#> # A tibble: 4 × 5
#>   face_sex variable      avg median stdev
#>   <chr>    <chr>       <dbl>  <dbl> <dbl>
#> 1 female   trustworthy  4.44   4.29 0.742
#> 2 female   faithful     5.11   5.26 0.606
#> 3 male     trustworthy  4.21   4.18 0.822
#> 4 male     faithful     4.65   4.82 0.845

Anonymous function like this are not very transparent for those new to R, and the resulting function looks quite complex.

In order to avoid using code like ~mean(., na.rm = T), for a number of commonly used summary statistic functions (sum, mean, median, var, sd, IQR), we have made counterparts where na.rm is set to TRUE by default. These functions have the same name as the original with the suffix _xna (but IQR is iqr_xna, not IQR_xna). As such, we can do the following.

describe_across(faithfulfaces_na,
                variables = c(trustworthy, faithful),
                functions = list(avg = mean_xna, median = median_xna, stdev = sd_xna),
                by = face_sex,
                pivot = TRUE
)
#> # A tibble: 4 × 5
#>   face_sex variable      avg median stdev
#>   <chr>    <chr>       <dbl>  <dbl> <dbl>
#> 1 female   trustworthy  4.44   4.29 0.742
#> 2 female   faithful     5.11   5.26 0.606
#> 3 male     trustworthy  4.21   4.18 0.822
#> 4 male     faithful     4.65   4.82 0.845