In this vignette, we will explore the OmopSketch functions
that provide information about individuals characteristics at specific
points in time. We will employ
summarisePopulationCharacteristics()
to generate a summary
of the demographic details within the database population. Additionally,
we will tidy and present the results using
tablePopulationCharacteristics()
, which supports either gt or flextable for
formatting the output.
Before we dive into OmopSketch functions, we need first to load the essential packages and create a mock CDM using the Eunomia database.
library(dplyr)
library(CDMConnector)
library(DBI)
library(duckdb)
library(OmopSketch)
# Connect to Eunomia database
con <- DBI::dbConnect(duckdb::duckdb(), CDMConnector::eunomia_dir())
cdm <- CDMConnector::cdmFromCon(
con = con, cdmSchema = "main", writeSchema = "main"
)
cdm
#>
#> ── # OMOP CDM reference (duckdb) of Synthea synthetic health database ──────────
#> • omop tables: person, observation_period, visit_occurrence, visit_detail,
#> condition_occurrence, drug_exposure, procedure_occurrence, device_exposure,
#> measurement, observation, death, note, note_nlp, specimen, fact_relationship,
#> location, care_site, provider, payer_plan_period, cost, drug_era, dose_era,
#> condition_era, metadata, cdm_source, concept, vocabulary, domain,
#> concept_class, concept_relationship, relationship, concept_synonym,
#> concept_ancestor, source_to_concept_map, drug_strength
#> • cohort tables: -
#> • achilles tables: -
#> • other tables: -
To start, we will use
summarisePopulationCharacteristics()
function to generate a
summarised result object, capturing demographic characteristics at both
observation_period_start_date
and
observation_period_end_date
.
summarisedResult <- summarisePopulationCharacteristics(cdm)
#> ! cohort columns will be reordered to match the expected order:
#> cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#> ℹ Building new trimmed cohort
#> Creating initial cohort
#> ✔ Cohort trimmed
#> ℹ adding demographics columns
#>
#> ℹ summarising data
#>
#> ✔ summariseCharacteristics finished!
#>
#> ! The following column type were changed:
#> • variable_name: from integer to character
summarisedResult |> glimpse()
#> Rows: 49
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "Synthea synthetic health database", "Synthea synthet…
#> $ group_name <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level <chr> "demographics", "demographics", "demographics", "demo…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "Number records", "Number subjects", "Cohort start da…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "count", "count", "min", "q25", "median", "q75", "max…
#> $ estimate_type <chr> "integer", "integer", "date", "date", "date", "date",…
#> $ estimate_value <chr> "2694", "2694", "1908-09-22", "1950-07-13", "1961-03-…
#> $ additional_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
To tidy and display the summarised result using a gt table, we can use
tablePopulationCharacteristics()
function.
summarisedResult |>
tablePopulationCharacteristics(type = "flextable")
#> ! Results have not been suppressed.
Variable name | Variable level | Estimate name |
Database name
|
---|---|---|---|
Synthea synthetic health database | |||
Number records | - | N | 2,694 |
Number subjects | - | N | 2,694 |
Cohort start date | - | Median [Q25 - Q75] | 1961-03-18 [1950-07-13 - 1970-08-29] |
Range | 1908-09-22 to 1986-11-03 | ||
Cohort end date | - | Median [Q25 - Q75] | 2018-12-14 [2018-08-02 - 2019-04-06] |
Range | 1945-07-20 to 2019-07-03 | ||
Age at start | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Age at end | - | Median [Q25 - Q75] | 57 [47 - 67] |
Range | 31 to 110 | ||
Sex | Female | N% | 1,373 (50.97) |
Male | N% | 1,321 (49.03) | |
Prior observation | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Future observation | - | Median [Q25 - Q75] | 20,870 [17,494 - 24,701] |
Mean (SD) | 21,601.60 (5,460.69) | ||
Range | 11,396 to 40,348 | ||
Days in cohort | - | Median [Q25 - Q75] | 20,872 [17,495 - 24,702] |
Mean (SD) | 21,602.60 (5,460.69) | ||
Range | 11,397 to 40,349 |
To obtain a flextable instead of
a gt, you can simply change the
type
argument to "flextable"
. Additionally, it
is important to note that age at start, prior observation, and future
observation are calculated at the start date defined (in this case, at
individuals observation_period_start_date). On the other hand, age at
end is calculated at the end date defined (i.e., individuals
observation_period_end_date).
To focus on a specific period within the observation data, rather
than analysing the entire individuals’ observation period, we can trim
the study period by using the studyPeriod
argument. This
allows to analyse the demographic metrics within a defined time range
rather than the default observation start and end dates.
summarisePopulationCharacteristics(cdm,
studyPeriod = c("1950-01-01", "1999-12-31")) |>
tablePopulationCharacteristics()
#> ! cohort columns will be reordered to match the expected order:
#> cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#> ℹ Building new trimmed cohort
#> Creating initial cohort
#> ✔ Cohort trimmed
#> ℹ adding demographics columns
#>
#> ℹ summarising data
#>
#> ✔ summariseCharacteristics finished!
#>
#> ! The following column type were changed:
#> • variable_name: from integer to character
#> ! Results have not been suppressed.
Variable name | Variable level | Estimate name |
Database name
|
---|---|---|---|
Synthea synthetic health database | |||
Number records | - | N | 2,693 |
Number subjects | - | N | 2,693 |
Cohort start date | - | Median [Q25 - Q75] | 1961-03-19 [1950-07-22 - 1970-08-30] |
Range | 1950-01-01 to 1986-11-03 | ||
Cohort end date | - | Median [Q25 - Q75] | 1999-12-31 [1999-12-31 - 1999-12-31] |
Range | 1961-02-26 to 1999-12-31 | ||
Age at start | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 3.31 (8.32) | ||
Range | 0 to 41 | ||
Age at end | - | Median [Q25 - Q75] | 38 [29 - 49] |
Range | 13 to 91 | ||
Sex | Female | N% | 1,372 (50.95) |
Male | N% | 1,321 (49.05) | |
Prior observation | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 1,252.46 (3,094.66) | ||
Range | 0 to 15,076 | ||
Future observation | - | Median [Q25 - Q75] | 20,489 [17,290 - 23,961] |
Mean (SD) | 20,352.30 (3,799.83) | ||
Range | 4,074 to 25,383 | ||
Days in cohort | - | Median [Q25 - Q75] | 14,092 [10,684 - 17,762] |
Mean (SD) | 13,798.73 (3,730.66) | ||
Range | 4,075 to 18,262 |
However, if you are interested in analysing the demographic
characteristics starting from a specific date without restricting the
study end, you can define just the start of the study period. By
default, summarisePopulationCharacteristics()
function will
use the observation_period_end_date to calculate the end-point
statistics when the end date is not defined.
summarisePopulationCharacteristics(cdm,
studyPeriod = c("1950-01-01", NA)) |>
tablePopulationCharacteristics()
#> ! cohort columns will be reordered to match the expected order:
#> cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#> ℹ Building new trimmed cohort
#> Creating initial cohort
#> ✔ Cohort trimmed
#> ℹ adding demographics columns
#>
#> ℹ summarising data
#>
#> ✔ summariseCharacteristics finished!
#>
#> ! The following column type were changed:
#> • variable_name: from integer to character
#> ! Results have not been suppressed.
Variable name | Variable level | Estimate name |
Database name
|
---|---|---|---|
Synthea synthetic health database | |||
Number records | - | N | 2,693 |
Number subjects | - | N | 2,693 |
Cohort start date | - | Median [Q25 - Q75] | 1961-03-19 [1950-07-22 - 1970-08-30] |
Range | 1950-01-01 to 1986-11-03 | ||
Cohort end date | - | Median [Q25 - Q75] | 2018-12-14 [2018-08-03 - 2019-04-06] |
Range | 1961-02-26 to 2019-07-03 | ||
Age at start | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 3.31 (8.32) | ||
Range | 0 to 41 | ||
Age at end | - | Median [Q25 - Q75] | 57 [47 - 67] |
Range | 31 to 110 | ||
Sex | Female | N% | 1,372 (50.95) |
Male | N% | 1,321 (49.05) | |
Prior observation | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 1,252.46 (3,094.66) | ||
Range | 0 to 15,076 | ||
Future observation | - | Median [Q25 - Q75] | 20,489 [17,290 - 23,961] |
Mean (SD) | 20,352.30 (3,799.83) | ||
Range | 4,074 to 25,383 | ||
Days in cohort | - | Median [Q25 - Q75] | 20,490 [17,291 - 23,962] |
Mean (SD) | 20,353.30 (3,799.83) | ||
Range | 4,075 to 25,384 |
Similarly, if you are only interested in analysing the population
characteristics up to a specific end date, you can define only the end
date and set the startDate = NA
. By default the
observation_period_start_date will be used.
Population characteristics can also be estimated by stratifying the
data based on age and sex using ageGroups
and
sex
arguments.
summarisePopulationCharacteristics(cdm,
sex = TRUE,
ageGroup = list("<60" = c(0,59), ">=60" = c(60, Inf))) |>
tablePopulationCharacteristics()
#> ! cohort columns will be reordered to match the expected order:
#> cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#> ℹ Building new trimmed cohort
#> Creating initial cohort
#> ✔ Cohort trimmed
#> ℹ adding demographics columns
#>
#> ℹ summarising data
#>
#> ✔ summariseCharacteristics finished!
#>
#> ! The following column type were changed:
#> • variable_name: from integer to character
#> ! Results have not been suppressed.
Variable name | Variable level | Estimate name |
Database name
|
---|---|---|---|
Synthea synthetic health database | |||
overall; overall | |||
Number records | - | N | 2,694 |
Number subjects | - | N | 2,694 |
Cohort start date | - | Median [Q25 - Q75] | 1961-03-18 [1950-07-13 - 1970-08-29] |
Range | 1908-09-22 to 1986-11-03 | ||
Cohort end date | - | Median [Q25 - Q75] | 2018-12-14 [2018-08-02 - 2019-04-06] |
Range | 1945-07-20 to 2019-07-03 | ||
Age at start | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Age at end | - | Median [Q25 - Q75] | 57 [47 - 67] |
Range | 31 to 110 | ||
Sex | Female | N% | 1,373 (50.97) |
Male | N% | 1,321 (49.03) | |
Prior observation | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Future observation | - | Median [Q25 - Q75] | 20,870 [17,494 - 24,701] |
Mean (SD) | 21,601.60 (5,460.69) | ||
Range | 11,396 to 40,348 | ||
Days in cohort | - | Median [Q25 - Q75] | 20,872 [17,495 - 24,702] |
Mean (SD) | 21,602.60 (5,460.69) | ||
Range | 11,397 to 40,349 | ||
<60; overall | |||
Number records | - | N | 2,694 |
Number subjects | - | N | 2,694 |
Cohort start date | - | Median [Q25 - Q75] | 1961-03-18 [1950-07-13 - 1970-08-29] |
Range | 1908-09-22 to 1986-11-03 | ||
Cohort end date | - | Median [Q25 - Q75] | 2018-12-14 [2018-08-02 - 2019-04-06] |
Range | 1945-07-20 to 2019-07-03 | ||
Age at start | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Age at end | - | Median [Q25 - Q75] | 57 [47 - 67] |
Range | 31 to 110 | ||
Sex | Female | N% | 1,373 (50.97) |
Male | N% | 1,321 (49.03) | |
Prior observation | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Future observation | - | Median [Q25 - Q75] | 20,870 [17,494 - 24,701] |
Mean (SD) | 21,601.60 (5,460.69) | ||
Range | 11,396 to 40,348 | ||
Days in cohort | - | Median [Q25 - Q75] | 20,872 [17,495 - 24,702] |
Mean (SD) | 21,602.60 (5,460.69) | ||
Range | 11,397 to 40,349 | ||
overall; Female | |||
Number records | - | N | 1,373 |
Number subjects | - | N | 1,373 |
Cohort start date | - | Median [Q25 - Q75] | 1961-05-13 [1950-08-09 - 1971-01-04] |
Range | 1908-09-22 to 1986-04-17 | ||
Cohort end date | - | Median [Q25 - Q75] | 2018-12-18 [2018-08-12 - 2019-04-07] |
Range | 1945-07-20 to 2019-07-01 | ||
Age at start | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Age at end | - | Median [Q25 - Q75] | 57 [47 - 67] |
Range | 31 to 110 | ||
Sex | Female | N% | 1,373 (100.00) |
Prior observation | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Future observation | - | Median [Q25 - Q75] | 20,860 [17,381 - 24,682] |
Mean (SD) | 21,665.77 (5,623.53) | ||
Range | 11,396 to 40,348 | ||
Days in cohort | - | Median [Q25 - Q75] | 20,861 [17,382 - 24,683] |
Mean (SD) | 21,666.77 (5,623.53) | ||
Range | 11,397 to 40,349 | ||
overall; Male | |||
Number records | - | N | 1,321 |
Number subjects | - | N | 1,321 |
Cohort start date | - | Median [Q25 - Q75] | 1961-01-23 [1950-04-13 - 1970-04-19] |
Range | 1909-02-14 to 1986-11-03 | ||
Cohort end date | - | Median [Q25 - Q75] | 2018-12-09 [2018-07-26 - 2019-04-03] |
Range | 1967-02-18 to 2019-07-03 | ||
Age at start | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Age at end | - | Median [Q25 - Q75] | 57 [48 - 67] |
Range | 31 to 109 | ||
Sex | Male | N% | 1,321 (100.00) |
Prior observation | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Future observation | - | Median [Q25 - Q75] | 20,972 [17,556 - 24,703] |
Mean (SD) | 21,534.91 (5,287.44) | ||
Range | 11,438 to 40,005 | ||
Days in cohort | - | Median [Q25 - Q75] | 20,973 [17,557 - 24,704] |
Mean (SD) | 21,535.91 (5,287.44) | ||
Range | 11,439 to 40,006 | ||
<60; Female | |||
Number records | - | N | 1,373 |
Number subjects | - | N | 1,373 |
Cohort start date | - | Median [Q25 - Q75] | 1961-05-13 [1950-08-09 - 1971-01-04] |
Range | 1908-09-22 to 1986-04-17 | ||
Cohort end date | - | Median [Q25 - Q75] | 2018-12-18 [2018-08-12 - 2019-04-07] |
Range | 1945-07-20 to 2019-07-01 | ||
Age at start | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Age at end | - | Median [Q25 - Q75] | 57 [47 - 67] |
Range | 31 to 110 | ||
Sex | Female | N% | 1,373 (100.00) |
Prior observation | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Future observation | - | Median [Q25 - Q75] | 20,860 [17,381 - 24,682] |
Mean (SD) | 21,665.77 (5,623.53) | ||
Range | 11,396 to 40,348 | ||
Days in cohort | - | Median [Q25 - Q75] | 20,861 [17,382 - 24,683] |
Mean (SD) | 21,666.77 (5,623.53) | ||
Range | 11,397 to 40,349 | ||
<60; Male | |||
Number records | - | N | 1,321 |
Number subjects | - | N | 1,321 |
Cohort start date | - | Median [Q25 - Q75] | 1961-01-23 [1950-04-13 - 1970-04-19] |
Range | 1909-02-14 to 1986-11-03 | ||
Cohort end date | - | Median [Q25 - Q75] | 2018-12-09 [2018-07-26 - 2019-04-03] |
Range | 1967-02-18 to 2019-07-03 | ||
Age at start | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Age at end | - | Median [Q25 - Q75] | 57 [48 - 67] |
Range | 31 to 109 | ||
Sex | Male | N% | 1,321 (100.00) |
Prior observation | - | Median [Q25 - Q75] | 0 [0 - 0] |
Mean (SD) | 0.00 (0.00) | ||
Range | 0 to 0 | ||
Future observation | - | Median [Q25 - Q75] | 20,972 [17,556 - 24,703] |
Mean (SD) | 21,534.91 (5,287.44) | ||
Range | 11,438 to 40,005 | ||
Days in cohort | - | Median [Q25 - Q75] | 20,973 [17,557 - 24,704] |
Mean (SD) | 21,535.91 (5,287.44) | ||
Range | 11,439 to 40,006 |