With tibblify()
you can rectangle deeply nested lists
into a tidy tibble. These lists might come from an API in the form of
JSON or from scraping XML. The reasons to use tibblify()
over other tools like jsonlite::fromJSON()
or
tidyr::hoist()
are:
jsonlite::fromJSON()
.jsonlite::fromJSON()
.Let’s start with gh_users
, which is a list containing
information about four GitHub users.
library(tibblify)
gh_users_small <- purrr::map(gh_users, ~ .x[c("followers", "login", "url", "name", "location", "email", "public_gists")])
names(gh_users_small[[1]])
#> [1] "followers" "login" "url" "name" "location"
#> [6] "email" "public_gists"
Quickly rectangling gh_users_small
is as easy as
applying tibblify()
to it:
tibblify(gh_users_small)
#> The spec contains 1 unspecified field:
#> • email
#> # A tibble: 4 × 7
#> followers login url name location email public_gists
#> <int> <chr> <chr> <chr> <chr> <list> <int>
#> 1 780 jennybc https://api.github.co… Jenn… Vancouv… <NULL> 54
#> 2 3958 jtleek https://api.github.co… Jeff… Baltimo… <NULL> 12
#> 3 115 juliasilge https://api.github.co… Juli… Salt La… <NULL> 4
#> 4 213 leeper https://api.github.co… Thom… London,… <NULL> 46
We can now look at the specification tibblify()
used for
rectangling
guess_tspec(gh_users_small)
#> The spec contains 1 unspecified field:
#> • email
#> tspec_df(
#> tib_int("followers"),
#> tib_chr("login"),
#> tib_chr("url"),
#> tib_chr("name"),
#> tib_chr("location"),
#> tib_unspecified("email"),
#> tib_int("public_gists"),
#> )
If we are only interested in some of the fields we can easily adapt the specification
spec <- tspec_df(
login_name = tib_chr("login"),
tib_chr("name"),
tib_int("public_gists")
)
tibblify(gh_users_small, spec)
#> # A tibble: 4 × 3
#> login_name name public_gists
#> <chr> <chr> <int>
#> 1 jennybc Jennifer (Jenny) Bryan 54
#> 2 jtleek Jeff L. 12
#> 3 juliasilge Julia Silge 4
#> 4 leeper Thomas J. Leeper 46
We refer to lists like gh_users_small
as
collection and objects are the elements of such lists.
Objects and collections are the typical input for
tibblify()
.
Basically, an object is simply something that can be converted to a one row tibble. This boils down to a condition on the names of the object:
object
must have names (the names
attribute must not be NULL
),NA
or
""
),In other words, the names must fulfill
vec_as_names(repair = "check_unique")
. The name-value pairs
of an object are the fields.
For example list(x = 1, y = "a")
is an object with the
fields (x, 1)
and (y, "a")
but
list(1, z = 3)
is not an object because it is not fully
named.
A collection is basically just a list of similar objects so that the fields can become the columns in a tibble.
Providing an explicit specification has a couple of advantages:
As seen before the specification for a collection is done with
tspec_df()
. The columns of the output tibble are describe
with the tib_*()
functions. They describe the path to the
field to extract and the output type of the field. There are the
following five types of functions:
tib_scalar(ptype)
: a length one vector with type
ptype
tib_vector(ptype)
: a vector of arbitrary length with
type ptype
tib_variant()
: a vector of arbitrary length and type;
you should barely ever need thistib_row(...)
: an object with the fields
...
tib_df(...)
: a collection where the objects have the
fields ...
For convenience there are shortcuts for tib_scalar()
and
tib_vector()
for the most common prototypes:
logical()
: tib_lgl()
and
tib_lgl_vec()
integer()
: tib_int()
and
tib_int_vec()
double()
: tib_dbl()
and
tib_dbl_vec()
character()
: tib_chr()
and
tib_chr_vec()
Date
: tib_date()
and
tib_date_vec()
Date
encoded as character: tib_chr_date()
and tib_chr_date_vec()
Scalar elements are the most common case and result in a normal vector column
tibblify(
list(
list(id = 1, name = "Peter"),
list(id = 2, name = "Lilly")
),
tspec_df(
tib_int("id"),
tib_chr("name")
)
)
#> # A tibble: 2 × 2
#> id name
#> <int> <chr>
#> 1 1 Peter
#> 2 2 Lilly
With tib_scalar()
you can also provide your own
prototype
Let’s say you have a list with durations
x <- list(
list(id = 1, duration = vctrs::new_duration(100)),
list(id = 2, duration = vctrs::new_duration(200))
)
x
#> [[1]]
#> [[1]]$id
#> [1] 1
#>
#> [[1]]$duration
#> Time difference of 100 secs
#>
#>
#> [[2]]
#> [[2]]$id
#> [1] 2
#>
#> [[2]]$duration
#> Time difference of 200 secs
and then use it in tib_scalar()
If an element does not always have size one then it is a vector
element. If it still always has the same type ptype
then it
produces a list of ptype
column:
x <- list(
list(id = 1, children = c("Peter", "Lilly")),
list(id = 2, children = "James"),
list(id = 3, children = c("Emma", "Noah", "Charlotte"))
)
tibblify(
x,
tspec_df(
tib_int("id"),
tib_chr_vec("children")
)
)
#> # A tibble: 3 × 2
#> id children
#> <int> <list<chr>>
#> 1 1 [2]
#> 2 2 [1]
#> 3 3 [3]
You can use tidyr::unnest()
or tidyr::unnest_longer()
to flatten these columns to regular columns.
For example in gh_repos_small
gh_repos_small <- purrr::map(gh_repos, ~ .x[c("id", "name", "owner")])
gh_repos_small <- purrr::map(
gh_repos_small,
function(repo) {
repo$owner <- repo$owner[c("login", "id", "url")]
repo
}
)
gh_repos_small[[1]]
#> $id
#> [1] 61160198
#>
#> $name
#> [1] "after"
#>
#> $owner
#> $owner$login
#> [1] "gaborcsardi"
#>
#> $owner$id
#> [1] 660288
#>
#> $owner$url
#> [1] "https://api.github.com/users/gaborcsardi"
the field owner
is an object itself. The specification
to extract it uses tib_row()
spec <- guess_tspec(gh_repos_small)
spec
#> tspec_df(
#> tib_int("id"),
#> tib_chr("name"),
#> tib_row(
#> "owner",
#> tib_chr("login"),
#> tib_int("id"),
#> tib_chr("url"),
#> ),
#> )
and results in a tibble column
tibblify(gh_repos_small, spec)
#> # A tibble: 30 × 3
#> id name owner$login $id $url
#> <int> <chr> <chr> <int> <chr>
#> 1 61160198 after gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 2 40500181 argufy gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 3 36442442 ask gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 4 34924886 baseimports gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 5 61620661 citest gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 6 33907457 clisymbols gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 7 37236467 cmaker gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 8 67959624 cmark gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 9 63152619 conditions gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> 10 24343686 crayon gaborcsardi 660288 https://api.github.com/users/gaborcs…
#> # ℹ 20 more rows
If you don’t like the tibble column you can unpack it with
tidyr::unpack()
. Alternatively, if you only want to extract
some of the fields in owner
you can use a nested path
spec2 <- tspec_df(
id = tib_int("id"),
name = tib_chr("name"),
owner_id = tib_int(c("owner", "id")),
owner_login = tib_chr(c("owner", "login"))
)
spec2
#> tspec_df(
#> tib_int("id"),
#> tib_chr("name"),
#> owner_id = tib_int(c("owner", "id")),
#> owner_login = tib_chr(c("owner", "login")),
#> )
tibblify(gh_repos_small, spec2)
#> # A tibble: 30 × 4
#> id name owner_id owner_login
#> <int> <chr> <int> <chr>
#> 1 61160198 after 660288 gaborcsardi
#> 2 40500181 argufy 660288 gaborcsardi
#> 3 36442442 ask 660288 gaborcsardi
#> 4 34924886 baseimports 660288 gaborcsardi
#> 5 61620661 citest 660288 gaborcsardi
#> 6 33907457 clisymbols 660288 gaborcsardi
#> 7 37236467 cmaker 660288 gaborcsardi
#> 8 67959624 cmark 660288 gaborcsardi
#> 9 63152619 conditions 660288 gaborcsardi
#> 10 24343686 crayon 660288 gaborcsardi
#> # ℹ 20 more rows
Objects usually have some fields that always exist and some that are
optional. By default tib_*()
demands that a field
exists
x <- list(
list(x = 1, y = "a"),
list(x = 2)
)
spec <- tspec_df(
x = tib_int("x"),
y = tib_chr("y")
)
tibblify(x, spec)
#> Error in `tibblify()`:
#> ! Field y is required but does not exist in `x[[2]]`.
#> ℹ Use `required = FALSE` if the field is optional.
You can mark a field as optional with the argument
required = FALSE
:
spec <- tspec_df(
x = tib_int("x"),
y = tib_chr("y", required = FALSE)
)
tibblify(x, spec)
#> # A tibble: 2 × 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 <NA>
You can specify the value to use with the fill
argument
To rectangle a single object you have two options:
tspec_object()
which produces a list or
tspec_row()
which produces a tibble with one row.
While tibbles are great for a single object it often makes more sense to convert them to a list.
For example a typical API response might be something like
api_output <- list(
status = "success",
requested_at = "2021-10-26 09:17:12",
data = list(
list(x = 1),
list(x = 2)
)
)
To convert to a one row tibble
row_spec <- tspec_row(
status = tib_chr("status"),
data = tib_df(
"data",
x = tib_int("x")
)
)
api_output_df <- tibblify(api_output, row_spec)
api_output_df
#> # A tibble: 1 × 2
#> status data
#> <chr> <list<tibble[,1]>>
#> 1 success [2 × 1]
it is necessary to wrap data
in a list. To access
data
one has to use api_output_df$data[[1]]
which is not very nice.
object_spec <- tspec_object(
status = tib_chr("status"),
data = tib_df(
"data",
x = tib_int("x")
)
)
api_output_list <- tibblify(api_output, object_spec)
api_output_list
#> $status
#> [1] "success"
#>
#> $data
#> # A tibble: 2 × 1
#> x
#> <int>
#> 1 1
#> 2 2
Now accessing data
does not required an extra subsetting
step