This package offers a tidy solution for epidemiological data. It houses a range of functions for epidemiologists and public health data wizards for data management and cleaning. These include:
import()
imports files of any formats.export()
exports files of any formats.get_admin_names()
downloads admin names using various
country codes and naming conventions.clean_admin_names()
cleans admin names using both
user-provided and downloaded admin data.create_test()
creates a unit-testing functions to
perform various data validation.consistency_check()
plot to see if certain variables
exceed others (i.e., tests vs cases).handle_outliers()
detects outliers using various
approaches, and offers functionality to manage them.missing_plot()
plots missing data or reporting rate for
given variable(s) by different factors.The package is yet to be available on Cran, but can be installed using devtools in R.
# The way to install it via CRAN once available
# install.packages("epiCleanr")
You can install the latest development version from GitHub by using:
# If you haven't installed the 'devtools' package, run:
# install.packages("devtools")
::install_github("truenomad/epiCleanr") devtools
Inspired by rio
, the import function allows you to read
data from a wide range of file formats. Additional reading options
specific to each format can be passed through the ellipsis (…) argument.
Similarly, the export function provides a simple way to export data into
various formats.
# Load the epiCleanr package
library(epiCleanr)
# Reading a CSV file with a specific seperator
<- import("path/to/your/file.csv", sep = "\n")
data_csv
# Import the first sheet from an Excel file
<- import("path/to/your/file.xlsx", sheet = 1)
data_excel
# Export a Stata DTA file
export(my_data, "path/to/your/file.dta")
# Export an RDS file
export(my_data, "path/to/your/file.rds")
# Export an Excel file with sheets
export(
list(my_data = my_data1, my_data2 = my_data2), "path/to/your/file.xlsx")
The clean_names_strings()
function offers additional
flexibility by allowing you to clean not just strings within columns but
also the column names themselves.
# For data frame with snake_case (default)
data("iris")
colnames(iris)
# > "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
<- clean_names_strings(iris)
cleaned_iris colnames(cleaned_iris)
# > "sepal_length" "sepal_width" "petal_length" "petal_width" "species"
You can download administrative names (such as districts, provinces,
etc.,) for a given country via the GeoNames
website using
country name or codes to pull the data.
# Get admin names for Togo
<- get_admin_names(country_name_or_code = "Togo")
admin_names
# Different admin levels are saved as a list
str(admin_names$adm2)
#>'data.frame': 30 obs. of 7 variables:
#> $ country_code : chr "TG" "TG" "TG" "TG" ...
#> $ asciiname : chr "Vo Prefecture" "Zio Prefecture" "Tchaoudjo" "Tchamba" ...
#> $ alternatenames: chr "Circonscription de Vogan, Prefecture de Vo, Préfecture de Vo, ...
#> $ adm2 : chr "Vo Prefecture" "Zio Prefecture" "Tchaoudjo" "Tchamba" ...
#> $ latitude : num 6.42 6.58 9 8.83 6.67 ...
#> $ longitude : num 1.5 1.17 1.17 1.42 1.5 ...
#> $ last_updated : IDate, format: "2019-01-08" "2019-01-08" "2022-02-17" "2022-02-17" ...
You can clean administrative names using them
clean_admin_names
function, which uses various matching and
string distance algorithms to match your admin data, using your own list
of admin names as well admin names from GeoNames
website.
# Get simulated epi data for Togo
# get path
<- system.file(
path "extdata",
"fake_epi_df_togo.rds",
package = "epiCleanr")
# get example data
<- import(path)
fake_epi_df_togo
# get path
<- system.file(
path "extdata",
"fake_epi_df_togo.rds",
package = "epiCleanr")
# get referecne dataset with clean admin names
<- import(path)
fake_epi_df_togo
# Lets check how the matching looks
clean_admin_names(
country_code = "Togo",
admin_names_to_clean = fake_epi_df_togo$district,
user_base_admin_names = togo_admin_df$district,
report_mode = T)
#> There are 15 out of 15 (100%) admins that have been perfectly matched!
#> # A tibble: 15 × 5
#> names_to_clean final_names ource_of_cleaned_name prop_matched matching_algorithm
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 Bas-Mono Bas-Mono User base admin names 100 Levenshtein Distance
#> 2 Bliita Blitta Main admin name from geonames 100 Soundex
#> 3 Centrale Centrale User base admin names 100 Levenshtein Distance
#> 4 Cinkaasi Cinkassé User base admin names 100 Soundex
#> 5 Dankben Dankpen Main admin name from geonames 100 Soundex
#> 6 East-Mono Est-Mono Main admin name from geonames 100 Soundex
#> 7 Kaloto Kloto Main admin name from geonames 100 Soundex
#> 8 Keéran Keran Alternative name from geonames 100 Soundex
#> 9 Lomé Lomé User base admin names 100 Levenshtein Distance
#> 10 Ogou Ogou Main admin name from geonames 100 Levenshtein Distance
#> 11 Sotouboua Sotouboua Main admin name from geonames 100 Levenshtein Distance
#> 12 Tchamaba Tchamba Main admin name from geonames 100 Soundex
#> 13 Vo Vo Prefecture Main admin name from geonames 100 Levenshtein Distance
#> 14 Yotto Yoto Main admin name from geonames 100 Soundex
#> 15 Zioo Zio Prefecture Main admin name from geonames 100 Soundex
# If we are happy with the names, we can update the old names with the new ones
# otherwise we can further clean the names manually
$district <-
fake_epi_df_togoclean_admin_names(
country_code = "Togo",
admin_names_to_clean = fake_epi_df_togo$district,
user_base_admin_names = togo_admin_df$district,
report_mode = F)
The create_test
is there so users can create their own
functions in which they can use for unit-testing when working with
datasets that require lots of manipulation and wrangling. This function
(plus the tidylog
package) will save users from the
headache of troubleshooting issues related to data joins and
transformations.
# Set up a unit-testing fucntion
<- create_test(
my_tests # For checking the dimension of the data
dimension_test = c(900, 9),
# For expected number of combinations in data
combinations_test = list(
variables = c("month", "year", "district"),
expectation = 12 * 5 * 15),
# Check repeated cols, rows and max and min thresholds
row_duplicates = TRUE, col_duplicates = TRUE,
max_threshold_test = list(malaria_tests = 1000, cholera_tests = 1000),
min_threshold_test = list(cholera_cases = 0, cholera_cases = 0)
)
# Apply your unit-test on your data
my_tests(fake_epi_df_togo)
#> Test passed! You have the correct number of dimensions!
#> Test passed! No duplicate rows found!
#> Test passed! No repeated columns found!
#> Test passed! You have the correct number of combinations for month, year, district!
#> Test passed! Values in column cholera_cases are above the threshold.
#> Test passed! Values in column cholera_cases are above the threshold.
#> Test passed! Values in column malaria_tests are below the threshold.
#> Test passed! Values in column cholera_tests are below the threshold.
#> Congratulations! All tests passed: 8/8 (100%) 😀
The consistency_check
function serves as a function for
validating the logical relationships between variables. For instance, if
you know that the number of disease cases cannot exceed the number of
tests for that disease, this function helps inspect such expected
behaviours in your data.
# Run checks using Togo data
consistency_check(fake_epi_df_togo,
tests = c("malaria_tests", "cholera_tests"),
cases = c("malaria_cases", "cholera_cases"))
#> Consistency test passed for malaria_tests vs malaria_cases: There are more tests than there are cases!
#> Consistency test failed for cholera_tests vs cholera_cases: There are 3 (0.33%) rows where cases are greater than tests.
# You can even save the plot if you want
::ggsave("../man/figures/consistency_plot.png", width = 10,
ggplot2height = 5.75, scale = 0.95, dpi = 400, bg = "white")
The handle_outliers
function is designed for both
identifying and addressing outliers in your dataset. It supports
multiple statistical methods for outlier detection, such as Z-score,
modified Z-score, and Interquartile Range. Beyond detection, the
function offers various options for handling outliers, including removal
or replacement with mean, median, grouped means, or quantiles.
# Select the variables
<- c("malaria_tests", "malaria_cases",
variables "cholera_tests", "cholera_cases")
# Get outliers report
<- handle_outliers(
outliers vars = variables,
fake_epi_df_togo, method = "zscore", report_mode = TRUE)
# Get the report
$report
outliers
#> variable test outliers prop_outliers
#> <chr> <chr> <glue> <chr>
#> 1 malaria_tests Z-Score 5/900 <1%
#> 2 malaria_cases Z-Score 5/900 <1%
#> 3 cholera_tests Z-Score 2/900 <1%
#> 4 cholera_cases Z-Score 1/900 <1%
# Get the plot
$plot
outliers
# You can even save the plot if you want
# ggplot2::ggsave("../man/figures/outliers_plot.png", width = 10,
# height = 7, scale = 0.95, dpi = 400)
# If happy, create a dataframe with the outliers handled
<- handle_outliers(
fake_epi_df_togo_no_outliers vars = variables,
fake_epi_df_togo, method = "zscore", report_mode = FALSE,
treat_method = "mean")
The missing_plot
function allows you to visualize
missing data across different variables, either by a single factor like
time or district, or by two factors simultaneously like date and
district. This is particularly useful when you’re dealing with
epidemiological data that might have varying degrees of completeness
across different times or locations.
# Make date columns
<- fake_epi_df_togo |>
fake_epi_df_togo2 ::mutate(date = lubridate::as_date(paste0(year, month, "/01")))
dplyr
# Missing rate for variables across dates
missing_plot(fake_epi_df_togo2, miss_vars = variables,
x_var = "date", use_rep_rate = F)
# Reporting rate for malaria_cases across dates and districts
missing_plot(fake_epi_df_togo2, miss_vars = "malaria_cases",
x_var = "date", y_var = "district", use_rep_rate = T)