Integration of Geospatial and Demographic data

As you know, ColOpenData can be used to access both geospatial and demographic data from Colombia, in independent modules. However, we thought it would be helpful to present a module that incorporates a way to merge information between geospatial and demographic data. In this vignette you will learn how to use the function merge_geo_demographic().

library(ColOpenData)
library(dplyr)
library(ggplot2)

Disclaimer: all data is loaded to the environment in the user’s R session, but is not downloaded to user’s computer.

How to merge geospatial and demographic data

Documentation access

Geospatial and demographic data can be merged based on the spatial aggregation level (SAL). While geospatial data can be aggregated down to the block level, demographic data is typically available only at the department and municipality levels. Therefore, these are the only SAL that can be accessed in both types of data for merging.

Now, the merge_geo_demographic() function takes as a parameter the demographic dataset of interest. Therefore, we should first access the demographic documentation to know which dataset we want to work with. Let’s suppose we want to select a dataset at the department level. We can load all demographic available datasets and then filter the level by the desired SAL.

datasets_dem <- list_datasets("demographic", "EN")

department_datasets <- datasets_dem[datasets_dem["level"] == "department", ]

head(department_datasets)
#> # A tibble: 6 × 7
#>   name                 group       source year  level      category  description
#>   <chr>                <chr>       <chr>  <chr> <chr>      <chr>     <chr>      
#> 1 DANE_CNPVH_2018_1HD  demographic DANE   2018  department househol… Number of …
#> 2 DANE_CNPVH_2018_2HD  demographic DANE   2018  department househol… Number of …
#> 3 DANE_CNPVH_2018_3HD  demographic DANE   2018  department househol… Households…
#> 4 DANE_CNPVPD_2018_1PD demographic DANE   2018  department persons_… Total cens…
#> 5 DANE_CNPVPD_2018_3PD demographic DANE   2018  department persons_… Total cens…
#> 6 DANE_CNPVPD_2018_4PD demographic DANE   2018  department persons_… Census pop…

After reviewing the available datasets, we can select the one we wish to work with and take a closer look. For instance, let’s suppose we choose the dataset “DANE_CNPVPD_2018_14BPD”.

chosen_dataset <- download_demographic("DANE_CNPVPD_2018_14BPD")
#> Original data is retrieved from the National Administrative Department
#> of Statistics (Departamento Administrativo Nacional de Estadística -
#> DANE).
#> Reformatted by package authors.
#> Stored by Universidad de Los Andes under the Epiverse TRACE iniative.

head(chosen_dataset)
#> # A tibble: 6 × 7
#>   codigo_departamento departamento sexo  grupo_de_edad area 
#>   <chr>               <chr>        <chr> <chr>         <chr>
#> 1 total               Nacional     total total         total
#> 2 total               Nacional     total total         total
#> 3 total               Nacional     total total         total
#> 4 total               Nacional     total total         total
#> 5 total               Nacional     total total         total
#> 6 total               Nacional     total total         total
#> # ℹ 2 more variables: servicio_salud_al_que_acudieron <chr>, total <int>

chosen_data presents information regarding health service attended by people that in the last thirty days had an illness, accident, dental problem or other health problem. Now, we can use the merge_geo_demographic() function.

The simplified argument downloads a simplified version of the geometries. This is not recommended for very accurate applications, but for a simple plot the approximation is enough. Also, it makes the download process much faster. To override this, you could use simplified = FALSE.

merged_data <- merge_geo_demographic(
  demographic_dataset =
    "DANE_CNPVPD_2018_14BPD"
)
#> Original data is retrieved from the National Administrative Department
#> of Statistics (Departamento Administrativo Nacional de Estadística -
#> DANE).
#> Reformatted by package authors.
#> Stored by Universidad de Los Andes under the Epiverse TRACE iniative.

head(merged_data)
#> # A tibble: 6 × 18
#>   codigo_departamento departamento version         area latitud longitud
#>   <chr>               <chr>          <dbl>        <dbl>   <dbl>    <dbl>
#> 1 05                  Antioquia       2018 62804708983.    6.92    -75.6
#> 2 08                  Atlántico       2018  3315752105.   10.7     -75.0
#> 3 11                  Bogotá, D.C.    2018  1622852605.    4.32    -74.2
#> 4 13                  Bolívar         2018 26719196397.    8.75    -74.5
#> 5 15                  Boyacá          2018 23138048132     5.78    -73.1
#> 6 17                  Caldas          2018  7425221672.    5.34    -75.3
#> # ℹ 12 more variables: total_personas_que_tuvieron_alguna_enfermedad <int>,
#> #   sin_informacion <int>,
#> #   a_la_entidad_de_seguridad_social_en_salud_a_la_cual_esta_afliado_a <int>,
#> #   a_un_medico_particular <int>, a_un_boticario_farmaceuta_droguista <int>,
#> #   a_terapias_alternativas <int>,
#> #   acudio_a_una_autoridad_indigena_espiritual <int>,
#> #   otro_medico_de_un_grupo_etnico <int>, uso_remedios_caseros <int>, …

merged_data presents geospatial information related to departments, as well as the information related to the health service attended by the population. We can use this dataset to visualize the proportion of people in each department who used home remedies for health issues. To achieve this, we will calculate the proportion by dividing the count of people who reported using home remedies (“uso_remedios_caseros”) by the total count of people who reported experiencing a health problem in each department.

merged_data <- merged_data %>%
  mutate(proportion_home_remedies = uso_remedios_caseros /
    total_personas_que_tuvieron_alguna_enfermedad)

We can now plot the results

ggplot(data = merged_data) +
  geom_sf(mapping = aes(fill = proportion_home_remedies), color = "white") +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "white", colour = "white"),
    panel.background = element_rect(fill = "white", colour = "white"),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    plot.title = element_text(hjust = 0.5)
  ) +
  scale_fill_gradient("Count", low = "#10bed2", high = "#deff00") +
  ggtitle(
    label = "Proportion of people who reported using home remedies to treat
    a health problem",
    subtitle = "Colombia"
  )