Tired of struggling to convert open data formats into R dataframes? Look no further and welcome the opendataformat package.
With just a few lines of code, you can convert a data package
specified as opendataformat into an R dataframe
(read_odf()
) or convert an R dataframe into a data package
specified in the opendataformat (write_odf()
).
But wait, there’s more! Our package goes beyond parsing. Accessing
metadata has never been easier. Dive into the treasure trove of
information stored in your R dataframes. Explore dataset labels,
descriptions, and valuable details about variables such as labels and
value labels with the docu_odf()
function and the
getmetadata_odf()
function.
This vignette will guide you through a series of examples that demonstrate the possibilities of these functions. Let’s get started!
You can download and install the package by downloading the latest
version from Zenodo. Alternatively you can download the development
version from GitHub using the install_git()
-function from
the devtools-library.
# At this point you can download and install the the latest version of the
# opendataformat package from Zenodo:
install.packages(
"https://zenodo.org/records/13683314/files/opendataformat_1.2.1.tar.gz",
repos = NULL, method = "libcurl")
# Alternatively you can install the development version from GitHub:
devtools::install_git(
"https://github.com/opendataformat/r-package-opendataformat.git")
12917713
read_odf()
The opendataformat package provides example data that is specified as
‘Open Data Format’. Data in the open data format is a ZIP file
containing a csv-file and a xml-file. The example data contains the
data-csv (data.csv
) with 20 rows and 7 columns and the
metadata-XML (metadata.xml
). The metadata file describes
the dataset and its variables. If you are interested in what these files
look like, you can find an example in the Open Data Format git
repository: https://git.soep.de/opendata/specification/-/tree/main/external/example
To load the data, we need to specify the path to the zip-file. Here the
example data is loaded using read_odf()
Alternatively, you
can set file
to a zip-file in the working directory.
library(opendataformat)
#> Lade nötiges Paket: cli
#> Lade nötiges Paket: magrittr
#> Lade nötiges Paket: xml2
#> Lade nötiges Paket: data.table
#> Lade nötiges Paket: zip
#>
#> Attache Paket: 'zip'
#> Die folgenden Objekte sind maskiert von 'package:utils':
#>
#> unzip, zip
path <- system.file("extdata", "data.zip", package = "opendataformat")
df <- read_odf(file = path)
The output of the read_odf()
function is an R-data.frame
object, that has the additional class odf
. It has
additional metadata stored in the attributes of the dataframe and the
variables/columns. These include the languages of the metadata, labels,
descriptions, urls, variable types, and value labels.
df
bap87 bap9201 bap9001 bap9002 bap9003 bap96 name
1 4 -2 1 -1 2 -2.00 Jakob
2 3 5 -2 1 4 1.57 Luca
3 NA -1 -1 2 -1 1.92 Emilia
4 1 9 -2 2 4 1.85 -1
5 -1 4 2 3 1 1.91 Johanna
6 3 4 -1 4 -2 1.80 Paul
7 1 9 2 -1 -1 1.80
8 5 6 1 -1 1 1.96 Mia
9 5 5 5 3 1 1.64 Ben
10 -2 4 4 -1 -2 1.93 Jakob
11 -1 4 2 1 5 1.93 Anton
12 -2 5 3 -2 4 NA Charlotte
13 3 -1 2 1 2 1.74 Luca
14 2 -2 -2 4 -1 1.65 Maria
15 5 -1 -2 -1 -1 1.80 Johanna
16 4 5 1 3 -1 1.58 Emma
17 3 7 1 2 -2 1.95 Felix
18 3 NA 5 3 -2 1.98 David
19 -2 8 1 4 5 1.61 -2
20 2 8 3 1 2 1.83 Anton
If you load the haven package, you see the variable labels in the active language .
If you want to import a dataset with metadata only in one or several
languages. You can use the languages
-argument. To load the
example data only with english labels and descriptions, set
languages="en"
:
By default languages = "all"
:
You can also give a list of languages::
You can set further arguments for the read_odf()
function. With the nrows
argument you define how many rows
to read excluding the header. With the skip
parameter you
set how many rows to skip (excluding the header).With the
select
input you determine which columns/variables to load
with a vector of indices or variable/column names.
docu_odf()
You can explore dataset information using two methods. Firstly, you can browse metadata at the record level, providing an overview of the dataset. Alternatively, you have the option to examine specific variable details, allowing you to gain insights into selected data attributes.
By default, when using the docu_odf()
function,
dataset-level information is presented through the console and an HTML
page. If you’re utilizing RStudio, this html-page will be displayed
within the RStudio viewer.
To display the metadata only in the console, utilize the
style
argument with the value set to print
(or
console
). This ensures that the information is conveniently
displayed on the R console, serving our specific demonstration purposes.
To display metadata information only in the viewer, set
style="viewer"
or style="html"
. By default
style="both"
.
docu_odf(df, style = "print")
#> [4m[1mDataset:[0m[0m soep-core v38.1: bap
#> [1mLabel:[0m
#> [en]Data from individual questionnaires 2010
#> [1mDescription:[0m
#> [en]The data were collected as part of the SOEP-Core study using the questionnaire "Living in Germany - Survey 2010 on the social situation - Personal questionnaire for all. This questionnaire is addressed to the individual persons in the household. A view of the survey instrument can be found here: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf
#> [1mlanguages:[0m
#> en de (active: en)
#> [1mURL:[0m
#> https://paneldata.org/soep-core/data/bap
#> [1mVariables:[0m
#> Variable Label en
#> 1 bap87 Current Health
#> 2 bap9201 hours of sleep, normal workday
#> 3 bap9001 Pressed For Time Last 4 Weeks
#> 4 bap9002 Run-down, Melancholy Last 4 Weeks
#> 5 bap9003 Well-balanced Last 4 Weeks
#> 6 bap96 Height
#> 7 name Firstname
To obtain a comprehensive overview of all variables within the
dataset, simply set the argument variables="yes"
.
docu_odf(df, variables = "yes", style = "print")
[4m[1mDataset:[0m[0m soep-core v38.1: bap
[1mLabel:[0m
[en]Data from individual questionnaires 2010
[1mDescription:[0m
[en]The data were collected as part of the SOEP-Core study using the questionnaire "Living in Germany - Survey 2010 on the social situation - Personal questionnaire for all. This questionnaire is addressed to the individual persons in the household. A view of the survey instrument can be found here: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf
[1mlanguages:[0m
en de (active: en)
[1mURL:[0m
https://paneldata.org/soep-core/data/bap
[1mVariables:[0m
Variable Label en
1 bap87 Current Health
2 bap9201 hours of sleep, normal workday
3 bap9001 Pressed For Time Last 4 Weeks
4 bap9002 Run-down, Melancholy Last 4 Weeks
5 bap9003 Well-balanced Last 4 Weeks
6 bap96 Height
7 name Firstname
If you are interested in just one specific variable, you can do this:
docu_odf(df$bap9001, style = "print")
#> [4m[1mVariable:[0m[0m bap9001
#> [1mLabel:[0m
#> [en]Pressed For Time Last 4 Weeks
#> [1mDescription:[0m
#> [en]Frequency of feeling time pressure in the past 4 weeks
#> [1mType:[0m
#> numeric
#> [1mURL:[0m
#> https://paneldata.org/soep-core/data/bap/bap9001
#> [1mValue Labels:[0m
#> Value en
#> -2 Does not apply
#> -1 No Answer
#> 1 Always
#> 2 Often
#> 3 Sometimes
#> 4 Almost Never
#> 5 Never
Certain datasets offer metadata such as labels, descriptions, or
value labels in multiple languages. To display the metadata in all
languages supported by your dataset, you can simply set the
languages
argument to all
. This setting
enables you to identify the range of languages available for accessing
the relevant metadata within your dataset.
If you have a specific language of interest, you can easily display it by utilizing the corresponding language code. Simply specify the desired language code to retrieve the metadata in the language of your choice. This enables you to access the specific language variant of variable labels, value labels. In this example, we display the German version:
docu_odf(df$bap9001, style = "print", languages = "de")
#> [4m[1mVariable:[0m[0m bap9001
#> [1mLabel:[0m
#> [de]Eile, Zeitdruck letzten 4 Wochen
#> [1mDescription:[0m
#> [de]Häufigkeit des Gefühls von Zeitdruck in den letzten 4 Wochen
#> [1mType:[0m
#> numeric
#> [1mURL:[0m
#> https://paneldata.org/soep-core/data/bap/bap9001
#> [1mValue Labels:[0m
#> Value de
#> -2 trifft nicht zu
#> -1 keine Angabe
#> 1 Immer
#> 2 Oft
#> 3 Manchmal
#> 4 Fast nie
#> 5 Nie
You can apply this function to the entire dataset, allowing you to access the desired information across all variables.
If you prefer another display style, you can use the datasets’ metadata directly from the attributes and write your own code:
for (i in names(df)) {
cat(
paste0(attributes(df[[i]])$name, ": ", attributes(df[[i]])$label_de, "\n")
)
}
bap87: Gesundheitszustand gegenwärtig
bap9201: Stunden Schlaf, normaler Werktag
bap9001: Eile, Zeitdruck letzten 4 Wochen
bap9002: Niedergeschlagen letzten 4 Wochen
bap9003: Ausgeglichen letzten 4 Wochen
bap96: Körpergröße
name: Vorname
You can also use the getmetadata_odf() function to retrieve labels and other metadata for the variables:
getmetadata_odf(df, type = "label")
bap87 bap9201
"Current Health" "hours of sleep, normal workday"
bap9001 bap9002
"Pressed For Time Last 4 Weeks" "Run-down, Melancholy Last 4 Weeks"
bap9003 bap96
"Well-balanced Last 4 Weeks" "Height"
name
"Firstname"
or the value labels:
setlanguage_odf()
Alternatively, you can set the current (active) language for a dataset-object. (This function tries to copy the label language function from Stata.)
To display which languages are available for the dataset metadata,
display the languages
attribute:
getmetadata_odf()
and
attributes()
Browsing through datasets’ metadata provides a valuable initial overview. However, when it comes time to dive into the analysis work, questions arise regarding the storage location of the metadata and the process of accessing and utilizing it. Let’s explore how and where the metadata is stored, and how we can effectively access and leverage it for analysis purposes.
A easy way to retrieve metadata is to use the
getmetadata_odf()
function to get metadata.
attributes()
and
attr()
Another way is to retrieve metadata directly from the attributes. The
metadata imported from the Open Data Format file into an R dataframe is
stored as R attributes. By using the base R functions
attributes()
and attr()
, you can easily access
this metadata. When providing the entire dataset to the function, R will
display all the metadata describing the dataset as a whole in your
console.
attributes(df)
$names
[1] "bap87" "bap9201" "bap9001" "bap9002" "bap9003" "bap96" "name"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$study
[1] "soep-core v38.1"
$name
[1] "bap"
$description_en
[1] "The data were collected as part of the SOEP-Core study using the questionnaire \"Living in Germany - Survey 2010 on the social situation - Personal questionnaire for all. This questionnaire is addressed to the individual persons in the household. A view of the survey instrument can be found here: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf"
$description_de
[1] "Die Daten wurden im Rahmen der Studie SOEP-Core mittels des Fragebogens „Leben in Deutschland – Befragung 2010 zur sozialen Lage - Personenfragebogen für alle“ erhoben. Dieser Fragebogen richtet sich an die einzelnen Personen im Haushalt. Eine Ansicht des Erhebungsinstrumentes finden Sie hier: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf"
$label_en
[1] "Data from individual questionnaires 2010"
$label_de
[1] "Daten vom Personenfragebogen 2010"
$url
[1] "https://paneldata.org/soep-core/data/bap"
$languages
[1] "en" "de"
$lang
[1] "en"
$label
[1] "Data from individual questionnaires 2010"
$class
[1] "odf" "data.frame"
If you provide a specific variable to the function, only the corresponding metadata for that variable will be printed.
attributes(df$bap87)
$name
[1] "bap87"
$label_en
[1] "Current Health"
$label_de
[1] "Gesundheitszustand gegenwärtig"
$description_en
[1] "Question: How would you describe your current health?"
$description_de
[1] "Frage: Wie würden Sie Ihren gegenwärtigen Gesundheitszustand beschreiben?"
$type
[1] "numeric"
$url
[1] "https://paneldata.org/soep-core/data/bap/bap87"
$labels_en
Does not apply No Answer Very good Good Satisfactory
-2 -1 1 2 3
Poor Bad
4 5
$labels_de
trifft nicht zu keine Angabe Sehr gut Gut
-2 -1 1 2
Zufriedenstellend Weniger gut Schlecht
3 4 5
$languages
[1] "en" "de"
$lang
[1] "en"
$label
[1] "Current Health"
If you’re interested in a particular attribute, you can access it using the dollar sign followed by the attribute name. For instance, let’s consider accessing a variable label in German (language code: de) as an example.
Alternatively, you can use the attr()
function to get
the same result:
Moreover, you have the flexibility to copy, remove, and modify these attributes to suit your needs.
getmetadata_odf()
You can also use the getmetadata_odf()
function to
retrieve labels and other metadata for the variables. By default, the
function will return the variable labels for a dataset:
getmetadata_odf(df, type = "labels")
bap87 bap9201
"Current Health" "hours of sleep, normal workday"
bap9001 bap9002
"Pressed For Time Last 4 Weeks" "Run-down, Melancholy Last 4 Weeks"
bap9003 bap96
"Well-balanced Last 4 Weeks" "Height"
name
"Firstname"
or for a specific variable::
To retrieve metadata in a specific language, use the language parameter:
Or set the active language of the dataset using the
setlanguage_odf()
function:
You can also use the getmetadata_odf()
function to
retrieve value labels for a specific variable by setting the argument
type="valuelabels"
:
getmetadata_odf(df$bap9001, type = "valuelabels")
Does not apply No Answer Always Often Sometimes
-2 -1 1 2 3
Almost Never Never
4 5
The value labels for each value are stored in the namespace:
names(getmetadata_odf(df$bap9001, type = "valuelabels"))
[1] "Does not apply" "No Answer" "Always" "Often"
[5] "Sometimes" "Almost Never" "Never"
You can use the getmetadata_odf()
function to return
descriptions, urls, variable types and metadata languages as well:
To retrieve variable description(s), set the argument
type="description"
:
getmetadata_odf(df, type = "description")
bap87
"Question: How would you describe your current health?"
bap9201
"Sleep hours per weekday"
bap9001
"Frequency of feeling time pressure in the past 4 weeks"
bap9002
"Frequency of feeling a sad and depressed state"
bap9003
"Frequency of feeling balance"
bap96
"Body size"
name
"Firstname"
To retrieve variable url(s), set the argument
type="url"
:
To retrieve variable type(s), set the argument
type="type"
:
write_odf()
To save a dataset as odf-file, we can use the
write_odf()
function. Let’s assume we want to save the
first four columns of our dataset as a new odf-file. We use the
write_odf()
function and indicate the r-dataframe and the
file name (and location if it).
write_odf(
x = df[, 1:4],
file = "../df_1_4.zip"
)
#or :
df_14 <- df[, 1:4]
write_odf(
x = df[, 1:4],
file = "df_1_4.zip"
)
The XML file metadata.xml and the CSV file data.csv are saved within the directory ‘data_rec’, as well as within the ZIP file ’data_rec.zip. The dataset looks the same as before, just with fewer variables:
<?xml version='1.0' encoding='utf-8'?>
<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5" version="2.5">
<fileDscr>
<fileTxt>
<fileName>bap</fileName>
<fileCont xml:lang="en">The data were collected as part of the SOEP-Core study using the questionnaire "Living in Germany - Survey 2010 on the social situation - Personal questionnaire for all. This questionnaire is addressed to the individual persons in the household. A view of the survey instrument can be found here: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf</fileCont>
<fileCont xml:lang="de">Die Daten wurden im Rahmen der Studie SOEP-Core mittels des Fragebogens „Leben in Deutschland – Befragung 2010 zur sozialen Lage - Personenfragebogen für alle“ erhoben. Dieser Fragebogen richtet sich an die einzelnen Personen im Haushalt. Eine Ansicht des Erhebungsinstrumentes finden Sie hier: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf</fileCont>
<fileCitation>
<titlStmt>
<titl xml:lang="en">Data from individual questionnaires 2010</titl>
<titl xml:lang="de">Daten vom Personenfragebogen 2010</titl>
</titlStmt>
</fileCitation>
</fileTxt>
<notes>
<ExtLink URI="https://paneldata.org/soep-core/data/bap" />
</notes>
</fileDscr>
<dataDscr>
<var name="bap87">
<labl xml:lang="en">Current Health</labl>
<labl xml:lang="de">Gesundheitszustand gegenwärtig</labl>
<txt xml:lang="en">Question: How would you describe your current health?</txt>
<txt xml:lang="de">Frage: Wie würden Sie Ihren gegenwärtigen Gesundheitszustand beschreiben?</txt>
<notes>
<ExtLink URI="https://paneldata.org/soep-core/data/bap/bap87" />
</notes>
<varFormat type="numeric" />
<catgry>
<catValu>-2</catValu>
<labl xml:lang="en">Does not apply</labl>
<labl xml:lang="de">trifft nicht zu</labl>
</catgry>
<catgry>
<catValu>-1</catValu>
<labl xml:lang="en">No Answer</labl>
<labl xml:lang="de">keine Angabe</labl>
</catgry>
<catgry>
<catValu>1</catValu>
<labl xml:lang="en">Very good</labl>
<labl xml:lang="de">Sehr gut</labl>
...
The data.csv file now includes just four columns:
"bap87","bap9201","bap9001","bap9002"
4,-2,1,-1
3,5,-2,1
,-1,-1,2
1,9,-2,2
-1,4,2,3
3,4,-1,4
1,9,2,-1
...
If you wish to export only the metadata for documentation or
archiving purposes, you can achieve this by setting the argument
export_data="no"
. By doing so, the resulting directory or
zip file will solely contain the metadata XML file, excluding the data
CSV file. This allows you to specifically capture and preserve the
metadata without including the actual data, providing a solution for
documentation or archiving needs.
If you wish to export the dataset with the metadata only in one or
some languages, set the languages argument to
languages=c("en")
. Default:
languages="all"
By default, languages is set to languages="all"
. You can
also define a list of languages to be exported:
Now let’s see how we can use the metadata to better understand the data and make more informative plots.
As expected, the frequency table displays the occurrence count of
each variable value. Now, let’s enhance the convenience of the frequency
table by utilizing the value labels associated with the variables. To
access the value labels, as explained in the preceding section, you can
utilize the base R function attributes()
. Let’s proceed to
examine them now:
attributes(df$bap87)$labels_en
Does not apply No Answer Very good Good Satisfactory
-2 -1 1 2 3
Poor Bad
4 5
attributes(df$bap87)$labels_de
trifft nicht zu keine Angabe Sehr gut Gut
-2 -1 1 2
Zufriedenstellend Weniger gut Schlecht
3 4 5
table(factor(df$bap87, labels = names(attributes(df$bap87)$labels_en)))
Does not apply No Answer Very good Good Satisfactory
3 2 2 2 5
Poor Bad
2 3
Alternatively you can use the getmetadata_odf()-function to get the value labels:
To display the data in a language other than the default one, let’s try German by appending the respective language code to the attribute name. For example, you can use $labels_de to access the German language labels and present the information accordingly.
table(
factor(
df$bap87,
labels = names(attributes(df$bap87)$labels_de)
)
)
trifft nicht zu keine Angabe Sehr gut Gut
3 2 2 2
Zufriedenstellend Weniger gut Schlecht
5 2 3
Or using getmetadata_odf()-function:
To merge ODF-datasets you should use the left_join(), right_join(), full_join(), and inner_join() from the dplyr-package instead of the the merge()-function to keep the attributes with the metadata of the merged datasets.
library(dplyr)
#similar to merge(df[,c(1:3,6)], df[,c(4:6)], by="name", all.x=T, all.y=F)
merged_df <- left_join(df[, c(1:3, 6)], df[, c(4:6)], by = "name")
#or
merged_df <- left_join(df[, c(1:3, 6)], df[, c(4:6)])
We want to display the table with only valid answers. Therefore, we
set the values -2
and -1
to NA
.
Because we do not want to overwrite the original variable, we generate a
new one:
We check the attributes of the metadata and notice they are also copied from the original variable to the new one:
attributes(bap87_rec)
$name
[1] "bap87"
$label_en
[1] "Current Health"
$label_de
[1] "Gesundheitszustand gegenwärtig"
$description_en
[1] "Question: How would you describe your current health?"
$type
[1] "numeric"
$url
[1] "https://paneldata.org/soep-core/data/bap/bap87"
$labels_en
Does not apply No Answer Very good Good Satisfactory
-2 -1 1 2 3
Poor Bad
4 5
$labels_de
trifft nicht zu keine Angabe Sehr gut Gut
-2 -1 1 2
Zufriedenstellend Weniger gut Schlecht
3 4 5
$languages
[1] "en" "de"
$lang
[1] "en"
$label
[1] "Current Health"
Now we can set the negative values to NA:
for (row in seq(1, length(bap87_rec))) {
if (!is.na(bap87_rec[row]) && bap87_rec[row] <= -1) {
bap87_rec[row] <- NA
}
}
table(bap87_rec, useNA = "ifany")
bap87_rec
1 2 3 4 5 <NA>
2 2 5 2 3 6
We notice that the copied values and value labels do not fit anymore:
attributes(bap87_rec)$labels_en
Does not apply No Answer Very good Good Satisfactory
-2 -1 1 2 3
Poor Bad
4 5
To change that, we’ll copy positions 3
to
7
, retaining the desired range of values and their
respective value labels.
attributes(bap87_rec)$labels_en <-
unname(attributes(df$bap87)$labels_en)[3:7] # values
names(attributes(bap87_rec)$labels_en) <-
names(attributes(df$bap87)$labels_en)[3:7] # labels
attributes(bap87_rec)$labels_en
Very good Good Satisfactory Poor Bad
1 2 3 4 5
Do the same for the other language versions of the new recoded variable:
attributes(bap87_rec)$labels_de <-
unname(attributes(df$bap87)$labels_de)[3:7] # values
names(attributes(bap87_rec)$labels_de) <-
names(attributes(df$bap87)$labels_de)[3:7] # labels
We do also notice that the variable name is not adequate. We replace
the name copied from the original variable with the new name
bap87_rec
.
Now we generate the frequency table by using the variable as a factor variable.
To create a barplot, we will utilize the recoded variable from the previous section. This example will demonstrate how to leverage metadata to create a more convenient and informative graph. By incorporating the metadata into the visualization, we can enhance the graph’s interpretability and provide a clearer understanding of the data.
barplot(
table(
factor(
bap87_rec,
labels = names(attributes(bap87_rec)$labels_en)
)
),
main = attributes(bap87_rec)$description_en, # title
xlab = paste0(
attributes(bap87_rec)$name, ": ", attributes(bap87_rec)$label), # label
sub = attributes(bap87_rec)$url, # subtitle
cex.main = 0.9, cex.names = 0.7, cex.sub = 0.8, cex.axis = 0.6,
cex.lab = 0.7 # font sizes
)
Drawing a barplot with the German description becomes effortless when dealing with dates that have multiple language versions of labels and descriptions. Simply append the language code to the end of the label attributes, and you’ll be able to generate the desired barplot with the German description:
barplot(
table(
factor(
bap87_rec,
labels = names(attributes(bap87_rec)$labels_de)
)
),
main = attributes(bap87_rec)$description_de, # title
xlab = paste0(
attributes(bap87_rec)$name, ": ", attributes(bap87_rec)$label_de), # label
sub = attributes(bap87_rec)$url, # subtitle
cex.main = 0.7, cex.names = 0.5, cex.sub = 0.8, cex.axis = 0.7,
cex.lab = 0.7 # font sizes
)