Data are stored in many different ways in tables or spreadsheets
because no strict semantic or topographic standards for the organisation
of tables are commonly accepted. In the R environment the tidy
paradigm is a first step towards interoperability of data, in that it
requires a certain arrangement of tables, where variables are recorded
in columns and observations in rows (see https://tidyr.tidyverse.org/). Tables can be tidied
(i.e., brought into a tidy arrangement) via packages such as
tidyr
, however, all functions that deal with reshaping
tables to date require data that are already organised into
topologically coherent, rectangular tables. This is often violated in
practice, especially in data that are scraped off of the internet.
tabshiftr
fills this gap in the toolchain towards more
interoperable data via schema
descriptions that are built
with setters and debugged with getters and a reorganise()
function that ties everything together.
install.packages("tabshiftr")
or the latest development version from github:
::install_github("EhrmannS/tabshiftr") devtools
A disorganised table may look like the following table:
library(tabshiftr)
library(knitr)
# a rather disorganised table with messy clusters and a distinct variable
<- tabs2shift$clusters_messy
input kable(input)
X1 | X2 | X3 | X4 | X5 | X6 | X7 |
---|---|---|---|---|---|---|
commodities | harvested | production | . | . | . | . |
unit 1 | . | . | . | . | . | . |
soybean | 1111 | 1112 | year 1 | . | . | . |
maize | 1121 | 1122 | year 1 | . | . | . |
soybean | 1211 | 1212 | year 2 | . | . | . |
maize | 1221 | 1222 | year 2 | . | . | . |
. | . | . | . | . | . | . |
commodities | harvested | production | commodities | harvested | production | . |
unit 2 | . | . | unit 3 | . | . | . |
soybean | 2111 | 2112 | soybean | 3111 | 3112 | year 1 |
maize | 2121 | 2122 | maize | 3121 | 3122 | year 1 |
soybean | 2211 | 2212 | soybean | 3211 | 3212 | year 2 |
maize | 2221 | 2222 | maize | 3221 | 3222 | year 2 |
If we were to transform this data into tidy data by merely using the
functions in tidyr
(or the extended tidyverse
in general), we’d potentially end up with a massive algorithm,
especially for such complicated table arrangements. For other tables
that may or may not be as complicated, we’d have to set up yet more
algorithms and while a pipeline of tidy functions is relatively easy to
set up, it would still become very laborious to repeat this for the
dozens of potential table arrangements. In tabshiftr
we
solve that by describing the schema of the input table and providing
this schema description to the reorganise()
function. This
requires us to use a vastly smaller set of code and makes it thus a lot
more efficient to bring multiple heterogeneous data into an
interoperable format.
# put together schema description by ...
# ... identifying cluster positions
<- setCluster(id = "territories", left = c(1, 1, 4), top = c(1, 8, 8))
schema
# ... specifying the cluster ID as id variable (obligatory for when we deal with clusters)
<- schema %>%
schema setIDVar(name = "territories", columns = c(1, 1, 4), rows = c(2, 9, 9))
# ... specifying a distinct variable (explicit position)
<- schema %>%
schema setIDVar(name = "year", columns = 4, rows = c(3:6), distinct = TRUE)
# ... specifying a tidy variable (by giving the column values)
<- schema %>%
schema setIDVar(name = "commodities", columns = c(1, 1, 4))
# ... identifying the (tidy) observed variables
<- schema %>%
schema setObsVar(name = "harvested", columns = c(2, 2, 5)) %>%
setObsVar(name = "production", columns = c(3, 3, 6))
# to potentially debug the schema description, first validate the schema ...
<- validateSchema(schema = schema, input = input)
schema_valid
# ... and extract parts of it per cluster (also check out the other getters in
# this package)
getIDVars(schema = schema_valid, input = input)
#> [[1]]
#> [[1]]$year
#> # A tibble: 4 × 1
#> X4
#> <chr>
#> 1 year 1
#> 2 year 1
#> 3 year 2
#> 4 year 2
#>
#> [[1]]$commodities
#> # A tibble: 4 × 1
#> X1
#> <chr>
#> 1 soybean
#> 2 maize
#> 3 soybean
#> 4 maize
#>
#>
#> [[2]]
#> [[2]]$year
#> # A tibble: 4 × 1
#> X4
#> <chr>
#> 1 year 1
#> 2 year 1
#> 3 year 2
#> 4 year 2
#>
#> [[2]]$commodities
#> # A tibble: 4 × 1
#> X1
#> <chr>
#> 1 soybean
#> 2 maize
#> 3 soybean
#> 4 maize
#>
#>
#> [[3]]
#> [[3]]$year
#> # A tibble: 4 × 1
#> X4
#> <chr>
#> 1 year 1
#> 2 year 1
#> 3 year 2
#> 4 year 2
#>
#> [[3]]$commodities
#> # A tibble: 4 × 1
#> X4
#> <chr>
#> 1 soybean
#> 2 maize
#> 3 soybean
#> 4 maize
getObsVars(schema = schema_valid, input = input)
#> [[1]]
#> [[1]]$harvested
#> # A tibble: 4 × 1
#> X2
#> <chr>
#> 1 1111
#> 2 1121
#> 3 1211
#> 4 1221
#>
#> [[1]]$production
#> # A tibble: 4 × 1
#> X3
#> <chr>
#> 1 1112
#> 2 1122
#> 3 1212
#> 4 1222
#>
#>
#> [[2]]
#> [[2]]$harvested
#> # A tibble: 4 × 1
#> X2
#> <chr>
#> 1 2111
#> 2 2121
#> 3 2211
#> 4 2221
#>
#> [[2]]$production
#> # A tibble: 4 × 1
#> X3
#> <chr>
#> 1 2112
#> 2 2122
#> 3 2212
#> 4 2222
#>
#>
#> [[3]]
#> [[3]]$harvested
#> # A tibble: 4 × 1
#> X5
#> <chr>
#> 1 3111
#> 2 3121
#> 3 3211
#> 4 3221
#>
#> [[3]]$production
#> # A tibble: 4 × 1
#> X6
#> <chr>
#> 1 3112
#> 2 3122
#> 3 3212
#> 4 3222
# alternatively, if the clusters are regular, relative values starting from the
# cluster origin could be set
<- setCluster(id = "territories",
schema_alt left = c(1, 1, 4), top = c(1, 8, 8)) %>%
setIDVar(name = "territories", columns = 1, rows = .find(row = 2, relative = TRUE)) %>%
setIDVar(name = "year", columns = 4, rows = c(3:6), distinct = TRUE) %>%
setIDVar(name = "commodities", columns = .find(col = 1, relative = TRUE)) %>%
setObsVar(name = "harvested", columns = .find(col = 2, relative = TRUE)) %>%
setObsVar(name = "production", columns = .find(col = 3, relative = TRUE))
The reorganise()
function carries out the steps of
validating, extracting the variables, pivoting the tentative output and
putting the final table together automatically, so it merely requires
the finalised schema
and the input
table.
# has a pretty print function
schema #> 3 clusters
#> origin : 1|1, 8|1, 8|4 (row|col)
#> id : territories
#>
#> variable type row col dist
#> ------------- ---------- ------ ------ ------
#> territories id 2, 9 1, 4 F
#> year id 3:6 4 T
#> commodities id 1, 4 F
#> harvested observed 2, 5 F
#> production observed 3, 6 F
<- reorganise(input = input, schema = schema)
output kable(output)
territories | year | commodities | harvested | production |
---|---|---|---|---|
unit 1 | year 1 | maize | 1121 | 1122 |
unit 1 | year 1 | soybean | 1111 | 1112 |
unit 1 | year 2 | maize | 1221 | 1222 |
unit 1 | year 2 | soybean | 1211 | 1212 |
unit 2 | year 1 | maize | 2121 | 2122 |
unit 2 | year 1 | soybean | 2111 | 2112 |
unit 2 | year 2 | maize | 2221 | 2222 |
unit 2 | year 2 | soybean | 2211 | 2212 |
unit 3 | year 1 | maize | 3121 | 3122 |
unit 3 | year 1 | soybean | 3111 | 3112 |
unit 3 | year 2 | maize | 3221 | 3222 |
unit 3 | year 2 | soybean | 3211 | 3212 |
tabshiftr
.tabshiftr
should at some
point support that those schemas can be exported into data-formats that
are used by downstream applications (xml, json, …), following proper
(ISO) standards. In case you have experience with those standards and
would like to collaborate on it, please get in touch!This work was supported by funding to Carsten Meyer through the Flexpool mechanism of the German Centre for Integrative Biodiversity Research (iDiv) (FZT-118, DFG).