1 Rationale

Tables as places where data are recorded can be pretty messy. The ‘tidy’ paradigm in R proposes that data are organised so that variables are recorded in columns, observations in rows and that there is only one value per cell (Wickham 2014). This, however, is only one interpretation of how data should be organised (Broman and Woo 2018) and especially when scraping data off the internet, one frequently encounters spreadsheets that don’t follow this paradigm.

The tidyr package is one of the most popular tools to bring data into a tidy format. However, up until today it is limited to tables that are already organised into topologically coherent (rectangular) chunks and any messiness beyond that requires dedicated scripts for reorganisation. In tabshiftr we try to describe and work with a further dimension of messiness, where data are available as so-called disorganised (messy) data, data that are not yet arranged into rectangular form.

The approach of tabshiftr is based on describing the arrangement of such tables in a so-called schema description, which is then the basis for automatic reorganisation via the function reorganise(). Typically there is an input and an output schema, describing the arrangement of the input and output tables, respectively. The advantage of this procedure is that input and output tables exist explicitly and the schema maps the transformation of the data. As we want to end up with tidy tables, the output schema is pre-determined by a tidy table of the included variables and the input schema needs to be put together by you, the user.

2 The basics

Data can be disorganised according to multiple dimensions. To understand those dimensions, we first need to understand the nature of data. Data "of the same kind" are collected in a variable, which is always a combination of a name and the values. In a table, names are typically in the topmost row and values are in the column below that name (Tab. 1). Conceptually, there are two types of variables in any table:

Variables that contain categorical levels that identify the units for which values have been observed (they are called identifying variables here).
Variables that have been measured or observed and that consequently represent the target values of that measurement, be they continuous or categorical (they are called observed variables here).

Moreover, a table is part of a series when other tables of that series contain the same variables with slightly different values for the identifying variables, irrespective of how the distinct tables of that series are arranged.

Table 1: An example table containing one identifying and two observed variables, with the variable names in the topmost row and the values in all consecutive rows.
identifying variable	observed variable (categorical)	observed variable (continuous)
sample 1	blue	10
sample 2	green	20
sample 3	red	30

Here, we do use the term spreadsheet to denote what would typically be seen as table, as the data we deal with here are typically "disorganised" and thus not what would be recognised as a table. Even though data in many spreadsheets are disorganised, they are mostly not non-systematic. Especially in complex spreadsheets, one often encounters a situation where a set of variables occurs more than once with the same or very similar arrangement, which we call cluster here. Data that are part of clusters are split along one of the, typically categorical, variables (the cluster ID), with the aim to increase the visual accessibility or direct the focus for human readers (Tab. 2). This may also be the case where data are split up into several files or spreadsheets of a file, where the cluster ID can be found in the file or spreasheet name or the meta-data. In many cases, the cluster ID is an implicit variable, a variable that is not immediately recogniseable as such.

Table 2: An example of a table with several clusters of comparable variables.
sample	colour	intensity	sample	colour	intensity
sample 1			sample 2
	blue	10		blue	11
	green	20		green	24
	red	30		red	13

sample 3			sample 4
	blue	20		blue	10
	green	15		green	16
	red	33		red	21

3 How to make a schema description

3.1 Decision tree

To set up a schema description, go through the following questions step by step and provide the respective answer in the respective function. Linked tables in the next section can serve as examples to compare against.

Variables: Clarify which are the identifying variables and which are the observed variables. Make sure not to mistake a listed observed variable (Tab. 10) as an identifying variable.
Format: Provide potentially information about the table format in setFormat().
Clusters: In case there are clusters, provide in setCluster()
- are data clustered into several files or spreadsheets and is the information used to separate the data (i.e., the spreadsheet name) a variable of interest (Tab. 6) or are the data clustered within one spreadsheet (Tab. 12, Tab. 13, Tab. 14 & Tab. 15)?
- are data clustered according to an identifying variable (Tab. 12, Tab. 13), or are observed variables grouped into clusters (Tab. 14)?
- are clusters nested into a grouping variable of interest (Tab. 15)?
Identifying variables: provide in setIDVar()
- in which column(s) is the variable?
- is the variable a result of merging several columns (Tab. 4) or must it be split off of another column (Tab. 5)?
- is the variable wide (i.e., its values are in several columns) (Tab. 7, Tab. 8 & Tab. 9)? In this case, the values will look like they are part of the header.
- is the variable distinct from the main table (Tab. 16)?
Observed variable: provide in setObsVar()
- in which column(s) is the variable?
- is the variable wide and nested in a wide identifying variable (i.e., the name of the observed variable is below the name of the identifying variable) (Tab. 7 & Tab. 9)?
- is the variable listed (i.e., the names of the variable are given as a value of an identifying variable) (Tab. 10)?

3.2 Additional remarks

To work with tabshiftr, tables need to be read in while treating any header rows as data, i.e., by not setting the first row as header, because in disorganised tables it’s often not only the first row that is part of the header.
Each column should be treated as a character data type, because some columns might contain data with both numeric and character cells.

input <- read_csv(file = ...,
                  col_names = FALSE,
                  col_types = cols(.default = "c"))

reorganise() takes care of reformatting the data-types into the most permissive data type that does not introduce NAs where there should be data, i.e, if a variable can be numeric, it is formatted as numeric column.
When setting up a schema description, some generalisations are hard to wrap your mind around. Try imagining that once you have set a variable, it’s values are “removed” from the table. For instance, once clusters are set for Tab. 6, imagine that the meta-data header isn’t part of the table anymore, resulting in a much simpler table. Or for Tab. 16, once territories are set, imagine that the respective unit1/2/3 values disappear to see that the respective column is actually a tidy column of commodities.

3.3 An example

As an example, we show here how to build a schema description for a table that has a wide identifying variable and a listed observed variable. This table contains additionally some dummy information one would typically encounter in tables, such as empty_cols and rows and data that are not immediately of interest (other_observed).

kable(input <- tabs2shift$listed_column_wide)

X1	X2	X3	X4	X5	X6	X7	X8
territories	period	.	dimension	other_observed	soybean	maize	empty_col
unit 1	year 1	.	harvested	xyz	1111	1121	.
unit 1	year 1	.	production	xyz	1112	1122	.
unit 1	year 2	.	harvested	xyz	1211	1221	.
unit 1	year 2	.	production	xyz	1212	1222	.
.	.	.	.	.	.	.	.
unit 2	year 1	.	harvested	xyz	2111	2121	.
unit 2	year 1	.	production	xyz	2112	2122	.
unit 2	year 2	.	harvested	xyz	2211	2221	.
unit 2	year 2	.	production	xyz	2212	2222	.

In this case we don’t need to set clusters and can start immediately with setting the first id variable territories, which is in the first column and otherwise tidy. The order by which we set the variables determines where they ocurr in the output table. Any of the setters start by default with an empty schema, in case none is provided to them from a previous setter, thus none needs to be provided at the beginning of a schema.

schema <- setIDVar(name = "territories", columns = 1)

Since version 0.3.0, tabshiftr comes with getters that allow to debug the current schema description. To do this, however, the schema first needs to be validated. This is in order to make sure that all the generic information are evaluated with the respective input. After that, a getter can be used to extract the respective information, for example the reorganised id variables with getIDVars().

validateSchema(schema = schema, input = input) %>% 
  getIDVars(input = input)
#> [[1]]
#> [[1]]$territories
#> # A tibble: 9 × 1
#>   X1         
#>   <chr>      
#> 1 territories
#> 2 unit 1     
#> 3 unit 1     
#> 4 unit 1     
#> 5 unit 1     
#> 6 unit 2     
#> 7 unit 2     
#> 8 unit 2     
#> 9 unit 2

After seeing that our specification results in a meaningful output, we can continue setting the other id variables years (tidy and in column 2) and commodities (spread over two columns and the values are in the first row). Note, how we pipe the previous schema into the next setter. This results in the next variable being added to that schema.

schema <- schema %>% 
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = c(6, 7), rows = 1)

Validating and checking for id variables again results in the following

validateSchema(schema = schema, input = input) %>% 
  getIDVars(input = input)
#> [[1]]
#> [[1]]$territories
#> # A tibble: 8 × 1
#>   X1    
#>   <chr> 
#> 1 unit 1
#> 2 unit 1
#> 3 unit 1
#> 4 unit 1
#> 5 unit 2
#> 6 unit 2
#> 7 unit 2
#> 8 unit 2
#> 
#> [[1]]$year
#> # A tibble: 8 × 1
#>   X2    
#>   <chr> 
#> 1 year 1
#> 2 year 1
#> 3 year 2
#> 4 year 2
#> 5 year 1
#> 6 year 1
#> 7 year 2
#> 8 year 2
#> 
#> [[1]]$commodities
#> # A tibble: 1 × 2
#>   X6      X7   
#>   <chr>   <chr>
#> 1 soybean maize

The id variable commodities is clearly wide (more than one column) and its’ values are not repeated four times, as it should be, judging by the combination of the other variables. However, this is an expected tentative output that will be handled in a later step and the id variables have been specified correctly.

Next, we set the listed observed variables. Listed means that the column names of the observed variables are treated as if they were the values of an identifying variable (in column 4), while the values are in the columns 6 and 7. In this case, the values need to be filtered by value (i.e., the values of that variable are found in columns 6 and 7, where column 4 contains value).

schema <- schema %>% 
  setObsVar(name = "harvested", columns = c(6, 7), key = 4, value = "harvested") %>%
  setObsVar(name = "production", columns = c(6, 7), key = 4, value = "production")

We then get the following observed variables, which is also an expected tentative output.

validateSchema(schema = schema, input = input) %>% 
  getObsVars(input = input)
#> [[1]]
#> [[1]]$listed
#> # A tibble: 8 × 3
#>   key        X6    X7   
#>   <chr>      <chr> <chr>
#> 1 harvested  1111  1121 
#> 2 production 1112  1122 
#> 3 harvested  1211  1221 
#> 4 production 1212  1222 
#> 5 harvested  2111  2121 
#> 6 production 2112  2122 
#> 7 harvested  2211  2221 
#> 8 production 2212  2222

From both, the output of getIDVars and getObsVars we can calculate how many and which combinations of data exist (e.g., the two columns in the observed variables correspond to the two values of the identifying variable commodities) and that they still need to be pivoted to be in a tidy arrangement.

The reorganise() function carries out the steps of validating, extracting the variables, pivoting the tentative output and putting the final table together automatically, so it merely requires the finalised (non-validated) schema and the input table.

schema # has a pretty print function
#>   1 cluster (whole spreadsheet)
#> 
#>    variable      type       row   col   key   value         
#>   ------------- ---------- ----- ----- ----- ------------  
#>    territories   id               1                        
#>    year          id               2                        
#>    commodities   id         1     6:7                      
#>    harvested     observed         6:7   4     harvested    
#>    production    observed         6:7   4     production

reorganise(input = input, schema = schema)
#> # A tibble: 8 × 5
#>   territories year   commodities harvested production
#>   <chr>       <chr>  <chr>           <dbl>      <dbl>
#> 1 unit 1      year 1 maize            1121       1122
#> 2 unit 1      year 1 soybean          1111       1112
#> 3 unit 1      year 2 maize            1221       1222
#> 4 unit 1      year 2 soybean          1211       1212
#> 5 unit 2      year 1 maize            2121       2122
#> 6 unit 2      year 1 soybean          2111       2112
#> 7 unit 2      year 2 maize            2221       2222
#> 8 unit 2      year 2 soybean          2211       2212

4 Table types

In this section we look at some examples of disorganised data, discuss the dimension along which they are disorganised and show which schema description should be used to reorganise them.

All of the following examples contain an other_observed, an empty_col column and an empty row, which serve the purpose of dummy information or formating that could be found in any table and should not disturb the process of reorganising. You can run all the examples by simply loading the schema and calling reorganise(input = tabs2shift$..., schema = schema) with the respective table that is plotted for this example.

4.1 Table contains one cluster

4.1.1 Tidy table

In case the observed variables are arranged into individual columns (Tab. 3), we have tidy data (Wickham 2014), which are largely already in the target arrangement. The tidy table may however, still contain unneeded data, need different names, or transformation factors for the values.

kable(tabs2shift$tidy)

Table 3: A largely tidy table.
X1	X2	X3	X4	X5	X6	X7
territories	period	commodities	other_observed	harvested	production	empty_col
unit 1	year 1	soybean	xyz	1111	1112	.
unit 1	year 1	maize	xyz	1121	1122	.
unit 1	year 2	soybean	xyz	1211	1212	.
unit 1	year 2	maize	xyz	1221	1222	.
.	.	.	.	.	.	.
unit 2	year 1	soybean	xyz	2111	2112	.
unit 2	year 1	maize	xyz	2121	2122	.
unit 2	year 2	soybean	xyz	2211	2212	.
unit 2	year 2	maize	xyz	2221	2222	.

schema <-
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = 3) %>%
  setObsVar(name = "harvested", columns = 5) %>%
  setObsVar(name = "production", columns = 6, factor = 0.1)

reorganise(input = tabs2shift$tidy, schema = schema)
#> # A tibble: 8 × 5
#>   territories year   commodities harvested production
#>   <chr>       <chr>  <chr>           <dbl>      <dbl>
#> 1 unit 1      year 1 maize            1121       112.
#> 2 unit 1      year 1 soybean          1111       111.
#> 3 unit 1      year 2 maize            1221       122.
#> 4 unit 1      year 2 soybean          1211       121.
#> 5 unit 2      year 1 maize            2121       212.
#> 6 unit 2      year 1 soybean          2111       211.
#> 7 unit 2      year 2 maize            2221       222.
#> 8 unit 2      year 2 soybean          2211       221.

4.1.2 Mismatch of columns and variables

Sometimes it may be the case that the number of variables is not the same as there are columns because either one variable is spread over several column, or one column contains several variables.

In the former case, columns need to be merged (Tab. 4) and in the latter case, columns need to be split via regular expressions (Tab. 5). For example, .+?(?=_) gives everything up until the first _ and (?<=\\_).* everything after the _.

kable(input <- tabs2shift$split_column)

Table 4: The variables `year` is split up into two columns.
X1	X2	X3	X4	X5	X6	X7	X8
territories	timespan	other_observed	step	commodity	harvested	production	empty_col
unit 1	year	xyz	1	soybean	1111	1112	.
unit 1	year	xyz	1	maize	1121	1122	.
unit 1	year	xyz	2	soybean	1211	1212	.
unit 1	year	xyz	2	maize	1221	1222	.
.	.	.	.	.	.	.	.
unit 2	year	xyz	1	soybean	2111	2112	.
unit 2	year	xyz	1	maize	2121	2122	.
unit 2	year	xyz	2	soybean	2211	2212	.
unit 2	year	xyz	2	maize	2221	2222	.

schema <-
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = c(2, 4), merge = " ") %>%
  setIDVar(name = "commodities", columns = 5) %>%
  setObsVar(name = "harvested", columns = 6) %>%
  setObsVar(name = "production", columns = 7)

kable(tabs2shift$merged_column)

Table 5: The variables `year` and `commodities` are stored in the same column.
X1	X2	X3	X4	X5	X6
territories	unit	other_observed	harvested	production	empty_col
unit 1	year 1_soybean	xyz	1111	1112	.
unit 1	year 1_maize	xyz	1121	1122	.
unit 1	year 2_soybean	xyz	1211	1212	.
unit 1	year 2_maize	xyz	1221	1222	.
.	.	.	.	.	.
unit 2	year 1_soybean	xyz	2111	2112	.
unit 2	year 1_maize	xyz	2121	2122	.
unit 2	year 2_soybean	xyz	2211	2212	.
unit 2	year 2_maize	xyz	2221	2222	.

schema <-
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 2, split = ".+?(?=_)") %>%
  setIDVar(name = "commodities", columns = 2, split = "(?<=\\_).*") %>%
  setObsVar(name = "harvested", columns = 4) %>%
  setObsVar(name = "production", columns = 5)

4.1.3 Implicit variables

When data are split up into clusters that are stored in separate files or spreadsheets, the cluster ID is often recorded not in the table as an explicit variable, but is only provided in the file or table name. In those cases, we have to register this cluster ID as an identifying variable nevertheless, to output a consistent table.

kable(input <- tabs2shift$implicit_variable)

Table 6: The information about which territory we are dealing with is missing or implied by some meta-data.
X1	X2	X3	X4
this meta-data header doesn’t tell us this tables is actually about unit 1.	.	.	.
We might get these information only from the context	.	.	.
.	.	.	.
commodities	harvested	production	period
soybean	1111	1112	year 1
maize	1121	1122	year 1
soybean	1211	1212	year 2
maize	1221	1222	year 2

schema <- setCluster(id = "territories",
                     left = 1, top = 4) %>%
  setIDVar(name = "territories", value = "unit 1") %>%
  setIDVar(name = "year", columns = 4) %>%
  setIDVar(name = "commodities", columns = 1) %>%
  setObsVar(name = "harvested", columns = 2) %>%
  setObsVar(name = "production", columns = 3)

4.1.4 Wide variables

In case identifying variables are factors with a small number of levels, those levels may be falsely used as names of other variables, where they would be next to each other and thus "wide" (Tab. 7). Those other variables (both identifying and observed variables) would then be "nested" in the wide identifying variables. In those cases we have to record for the identifying variable(s) the columns and the row in which the values of the identifying variable are found (they will look like they are part of the header). For the observed variable(s) we need to record the columns and the row where the name of that variable is found.

kable(input <- tabs2shift$one_wide_id)

Table 7: The observed variables are nested within the identifying variable `commodities`.
X1	X2	X3	X4	X5	X6	X7	X8
territories	.	period	soybean	.	maize	.	empty_col
.	.	.	harvested	production	harvested	production	.
unit 1	.	year 1	1111	1112	1121	1122	.
unit 1	.	year 2	1211	1212	1221	1222	.
.	.	.	.	.	.	.	.
unit 2	.	year 1	2111	2112	2121	2122	.
unit 2	.	year 2	2211	2212	2221	2222	.

schema <-
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 3) %>%
  setIDVar(name = "commodities", columns = c(4, 6), rows = 1) %>%
  setObsVar(name = "harvested", columns = c(4, 6), top = 2) %>%
  setObsVar(name = "production", columns = c(5, 7), top = 2)

The same is also true in case the identifying variables are actually nested within the observed variables, i.e., the names of the observed variables are on top of the names of the id variables (Tab. 8). However, if an observed variables is in the topmost row (and there are no clusters), top = 1 can be omitted.

kable(input <- tabs2shift$wide_obs)

Table 8: The identifying variable `commodities` is nested within the observed variables.
X1	X2	X3	X4	X5	X6
territories	period	harvested	.	production	.
.	.	soybean	maize	soybean	maize
unit 1	year 1	1111	1121	1112	1122
unit 1	year 2	1211	1221	1212	1222
unit 2	year 1	2111	2121	2112	2122
unit 2	year 2	2211	2221	2212	2222

schema <-
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = c(3:6), rows = 2) %>%
  setObsVar(name = "harvested", columns = c(3, 4)) %>%
  setObsVar(name = "production", columns = c(5, 6))

In case several variables are nested within other variables, we have to specify for all nested or nesting variables in which respective rows their values sit.

kable(input <- tabs2shift$two_wide_id)

Table 9: The observed variables are nested within the identifying variable `year` and `commodities`.
X1	X2	X3	X4	X5	X6	X7	X8	X9
territories	year 1	.	.	.	year 2	.	.	.
.	soybean	.	maize	.	soybean	.	maize	.
.	harvested	production	harvested	production	harvested	production	harvested	production
unit 1	1111	1112	1121	1122	1211	1212	1221	1222
unit 2	2111	2112	2121	2122	2211	2212	2221	2222

schema <-
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = c(2, 6), rows = 1) %>%
  setIDVar(name = "commodities", columns = c(2, 4, 6, 8), rows = 2) %>%
  setObsVar(name = "harvested", columns = c(2, 4, 6, 8), top = 3) %>%
  setObsVar(name = "production", columns = c(3, 5, 7, 9), top = 3)

4.1.5 Listed observed variables

Some tables contain a column where the names of observed variables (harvested and production) are treated as if they were the values of an identifying variable (dimension), while the values are presented in only one column (values) (Tab. 10). To end up with tidy data in those cases, we need to extract the values associated with the observed variables. Thus, we define the observed variables and specify the key = in which the variable names sit, and the value = the variable name has, to extract that variable.

kable(input <- tabs2shift$listed_column)

Table 10: The variable names of the observed variable are treated as if they were the values of the identifying variable `dimension`.
X1	X2	X3	X4	X5	X6	X7	X8
territories	period	commodities	other_observed	.	dimension	values	empty_col
unit 1	year 1	soybean	xyz	.	harvested	1111	.
unit 1	year 1	maize	xyz	.	harvested	1121	.
unit 1	year 1	soybean	xyz	.	production	1112	.
unit 1	year 1	maize	xyz	.	production	1122	.
unit 1	year 2	soybean	xyz	.	harvested	1211	.
unit 1	year 2	maize	xyz	.	harvested	1221	.
unit 1	year 2	soybean	xyz	.	production	1212	.
unit 1	year 2	maize	xyz	.	production	1222	.
.	.	.	.	.	.	.	.
unit 2	year 1	soybean	xyz	.	harvested	2111	.
unit 2	year 1	maize	xyz	.	harvested	2121	.
unit 2	year 1	soybean	xyz	.	production	2112	.
unit 2	year 1	maize	xyz	.	production	2122	.
unit 2	year 2	soybean	xyz	.	harvested	2211	.
unit 2	year 2	maize	xyz	.	harvested	2221	.
unit 2	year 2	soybean	xyz	.	production	2212	.
unit 2	year 2	maize	xyz	.	production	2222	.

schema <-
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = 3) %>%
  setObsVar(name = "harvested", columns = 7, key = 6, value = "harvested") %>%
  setObsVar(name = "production", columns = 7, key = 6, value = "production")

Moreover, (several) identifying variables may be wide additionally and we have to proceed as mentioned above, by providing the columns and the rows of the variable values (which appear to be names).

kable(input <- tabs2shift$listed_column_wide)

Table 11: The identifying variable `commodities` is treated as if it were the observed variables while the variable names of the observed variable are treated as if they were the values of the identifying variable `dimension`.
X1	X2	X3	X4	X5	X6	X7	X8
territories	period	.	dimension	other_observed	soybean	maize	empty_col
unit 1	year 1	.	harvested	xyz	1111	1121	.
unit 1	year 1	.	production	xyz	1112	1122	.
unit 1	year 2	.	harvested	xyz	1211	1221	.
unit 1	year 2	.	production	xyz	1212	1222	.
.	.	.	.	.	.	.	.
unit 2	year 1	.	harvested	xyz	2111	2121	.
unit 2	year 1	.	production	xyz	2112	2122	.
unit 2	year 2	.	harvested	xyz	2211	2221	.
unit 2	year 2	.	production	xyz	2212	2222	.

schema <-
  setIDVar(name = "territories", columns = 1) %>%
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = c(6, 7), rows = 1) %>%
  setObsVar(name = "harvested", columns = c(6, 7), key = 4, value = "harvested") %>%
  setObsVar(name = "production", columns = c(6, 7), key = 4, value = "production")

4.1.6 Misplaced columns or rows

WIP

4.2 Table contains several clusters

Clusters are typically of the same arrangement within one table, they can be repeated along rows (horizontally) or along columns (vertically), but also a non-systematic distribution is possible. Moreover, clusters may be nested into some grouping variable in a similar way that data are nested into clusters. A table could also be treated like a cluster when the table is not only composed of the table, but perhaps also text in the form of some table description (that may be scattered in the document and) that does not allow the table to start at the table origin in the topmost left cell.

4.2.1 Horizontal clusters

In case clusters are sitting right next to each other in the same origin row (Tab. 12), it is sufficient to provide the topmost row and all leftmost columns at which a new cluster starts.

kable(input <- tabs2shift$clusters_horizontal)

Table 12: Horizontal clusters of the identifying variable `period`.
X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
commodities	period	harvested	production	other_observed	commodities	period	harvested	production	empty_col
unit 1	.	.	.	.	unit 2	.	.	.	.
soybean	year 1	1111	1112	xyz	soybean	year 1	2111	2112	.
maize	year 1	1121	1122	xyz	maize	year 1	2121	2122	.
soybean	year 2	1211	1212	xyz	soybean	year 2	2211	2212	.
maize	year 2	1221	1222	xyz	maize	year 2	2221	2222	.

schema <- setCluster(id = "territories",
                     left = c(1, 6), top = 2) %>%
  setIDVar(name = "territories", columns = c(1, 6), rows = 2) %>%
  setIDVar(name = "year", columns = c(2, 7)) %>%
  setIDVar(name = "commodities", columns = c(1, 6)) %>%
  setObsVar(name = "harvested", columns = c(3, 8)) %>%
  setObsVar(name = "production", columns = c(4, 9))

4.2.2 Vertical clusters

For vertically arranged clusters (Tab. 13), just like for the horizontal case, the respective rows and columns need to be provided.

kable(input <- tabs2shift$clusters_vertical)

Table 13: Vertical clusters of the identifying variable `period`.
X1	X2	X3	X4	X5	X6	X7	X8
territories	period	other_observed	.	commodities	harvested	production	empty_col
.	.	.	.	.	.	.	.
unit 1	.	.	.	.	.	.	.
.	year 1	xyz	.	soybean	1111	1112	.
.	year 1	xyz	.	maize	1121	1122	.
.	year 2	xyz	.	soybean	1211	1212	.
.	year 2	xyz	.	maize	1221	1222	.
.	.	.	.	.	.	.	.
unit 2	.	.	.	.	.	.	.
.	year 1	xyz	.	soybean	2111	2112	.
.	year 1	xyz	.	maize	2121	2122	.
.	year 2	xyz	.	soybean	2211	2212	.
.	year 2	xyz	.	maize	2221	2222	.

schema <- setCluster(id = "territories",
                     left = 1, top = c(3, 9)) %>%
  setIDVar(name = "territories", columns = 1, rows = c(3, 9)) %>%
  setIDVar(name = "year", columns = 2) %>%
  setIDVar(name = "commodities", columns = 5) %>%
  setObsVar(name = "harvested", columns = 6) %>%
  setObsVar(name = "production", columns = 7)

4.2.3 Clusters of observed variables

The previous two types of clusters are clusters of identifying variables, but it may also be the case that the observed variables are split up into distinct clusters. Here, we need to specify first of all setClusters(..., id = "observed") to indicate that clusters are observed variables. Next, we need to set up the observed variables so that they contain "key = "cluster" and in value the number of the cluster this variable can be found in.

kable(input <- tabs2shift$clusters_observed)

Table 14: Vertical clusters of the observed variables.
X1	X2	X3	X4	X5	X6	X7	X8
.	territories	period	other_observed	commodities	.	value	empty_col
harvested	.	.	.	.	.	.	.
.	unit 1	year 1	xyz	soybean	.	1111	.
.	unit 1	year 1	xyz	maize	.	1121	.
.	unit 1	year 2	xyz	soybean	.	1211	.
.	unit 1	year 2	xyz	maize	.	1221	.
.	unit 2	year 1	xyz	soybean	.	2111	.
.	unit 2	year 1	xyz	maize	.	2121	.
.	unit 2	year 2	xyz	soybean	.	2211	.
.	unit 2	year 2	xyz	maize	.	2221	.
.	.	.	.	.	.	.	.
production	.	.	.	.	.	.	.
.	unit 1	year 1	xyz	soybean	.	1112	.
.	unit 1	year 1	xyz	maize	.	1122	.
.	unit 1	year 2	xyz	soybean	.	1212	.
.	unit 1	year 2	xyz	maize	.	1222	.
.	unit 2	year 1	xyz	soybean	.	2112	.
.	unit 2	year 1	xyz	maize	.	2122	.
.	unit 2	year 2	xyz	soybean	.	2212	.
.	unit 2	year 2	xyz	maize	.	2222	.

schema <- setCluster(id = "observed",
                     left = 1, top = c(2, 12)) %>%
  setIDVar(name = "territories", columns = 2) %>%
  setIDVar(name = "year", columns = 3) %>%
  setIDVar(name = "commodities", columns = 5) %>%
  setObsVar(name = "harvested", columns = 7, key = "cluster", value = 1) %>%
  setObsVar(name = "production", columns = 7, key = "cluster", value = 2)

4.2.4 Clusters that are nested into another variable

When (some) clusters are nested into a (grouping) variable of interest, not only the cluster positions need to be specified, but also their relation to the grouping variable. Similar to the cluster ID, this group ID also needs to be specified as an identifying variable and needs to be provided as group = ... and the membership of each cluster to a group needs to be specified in member = .... The cluster position needs to be specified just as it would be without groups.

kable(input <- tabs2shift$clusters_nested)

Table 15: Clusters are grouped according to a variable of interest that should also be captured.
X1	X2	X3	X4	X5	X6	X7	X8
territory	commodities	.	other_observed	harvested	production	year	empty_col
group 1	.	.	.	.	.	.	.
unit 1	.	.	.	.	.	.	.
.	soybean	.	xyz	1111	1112	year 1	.
.	maize	.	xyz	1121	1122	year 1	.
.	soybean	.	xyz	1211	1212	year 2	.
.	maize	.	xyz	1221	1222	year 2	.
unit 2	.	.	.	.	.	.	.
.	soybean	.	xyz	2111	2112	year 1	.
.	maize	.	xyz	2121	2122	year 1	.
.	soybean	.	xyz	2211	2212	year 2	.
.	maize	.	xyz	2221	2222	year 2	.
.	.	.	.	.	.	.	.
group 2	.	.	.	.	.	.	.
unit 3	.	.	.	.	.	.	.
.	soybean	.	xyz	3111	3112	year 1	.
.	maize	.	xyz	3121	3122	year 1	.
.	soybean	.	xyz	3211	3212	year 2	.
.	maize	.	xyz	3221	3222	year 2	.

schema <- setCluster(id = "sublevel",
                     group = "territories", member = c(1, 1, 2),
                     left = 1, top = c(3, 8, 15)) %>%
  setIDVar(name = "territories", columns = 1, rows = c(2, 14)) %>%
  setIDVar(name = "sublevel", columns = 1, rows = c(3, 8, 15)) %>%
  setIDVar(name = "year", columns = 7) %>%
  setIDVar(name = "commodities", columns = 2) %>%
  setObsVar(name = "harvested", columns = 5) %>%
  setObsVar(name = "production", columns = 6)

4.2.5 Variables that are distinct from a cluster

When not all identifying variables can be provided relative to the cluster origin, for example because they are missing for some clusters, it makes more sense to define such a variable as a distinct variable. This is done by providing row and col as absolute values and setting distinct = TRUE. Other variables that are all arranged in the same way in each cluster can be specified so that their row and column indices are given relative to the cluster position (relative = TRUE), as shown in the alternative shema below.

kable(input <- tabs2shift$clusters_messy)

Table 16: Several clusters where one variable is not available for each cluster, but distinct of them.
X1	X2	X3	X4	X5	X6	X7
commodities	harvested	production	.	.	.	.
unit 1	.	.	.	.	.	.
soybean	1111	1112	year 1	.	.	.
maize	1121	1122	year 1	.	.	.
soybean	1211	1212	year 2	.	.	.
maize	1221	1222	year 2	.	.	.
.	.	.	.	.	.	.
commodities	harvested	production	commodities	harvested	production	.
unit 2	.	.	unit 3	.	.	.
soybean	2111	2112	soybean	3111	3112	year 1
maize	2121	2122	maize	3121	3122	year 1
soybean	2211	2212	soybean	3211	3212	year 2
maize	2221	2222	maize	3221	3222	year 2

schema <- setCluster(id = "territories",
                     left = c(1, 1, 4), top = c(1, 8, 8)) %>%
  setIDVar(name = "territories", columns = c(1, 1, 4), rows = c(2, 9, 9)) %>%
  setIDVar(name = "year", columns = 4, rows = c(3:6), distinct = TRUE) %>%
  setIDVar(name = "commodities", columns = c(1, 1, 4)) %>%
  setObsVar(name = "harvested", columns = c(2, 2, 5)) %>%
  setObsVar(name = "production", columns = c(3, 3, 6))

schema_alt <- setCluster(id = "territories",
                         left = c(1, 1, 4), top = c(1, 8, 8)) %>%
  setIDVar(name = "territories", columns = 1, rows = .find(row = 2, relative = TRUE)) %>%
  setIDVar(name = "year", columns = 4, rows = c(3:6), distinct = TRUE) %>%
  setIDVar(name = "commodities", columns = .find(col = 1, relative = TRUE)) %>%
  setObsVar(name = "harvested", columns = .find(col = 2, relative = TRUE)) %>%
  setObsVar(name = "production", columns = .find(col = 3, relative = TRUE))

4.2.6 Messy clusters

In case several clusters are neither aligned along a row nor a column, and are all of differing size, the respective information need to be provided at the same index of the respective property. For example, three clusters, where the first cluster starts at (1,1) and is 3 by 4 cells in size, where the second clusters starts at (5,2) and is 5 by 5 cells in size, and so on, needs to be specified as below.

schema <- setCluster(id = ...,
                     left = c(1, 2, 5), top = c(1, 5, 1),  
                     width = c(3, 5, 2), height = c(4, 5, 3), 
                     ...) %>% 
  setIDVar(name = "territories", columns = .find(col = 1, relative = TRUE)) %>% 
  ...

Additionally, given that at least the tables within each cluster are all arranged in the same way, the contained variables can be specified so that their row and column indices are given relative to the cluster position (relative = TRUE). If also that is not the case, the row and column values for each cluster need to be provided for the respective variables in the same way as for cluster positions.

References

Broman, Karl W., and Kara H. Woo. 2018. “Data Organization in Spreadsheets.” PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.3183v2.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.

tabshiftr - Reshape disorganised messy data