One challenge when working with electronic healthcare (EHR) data from different hospitals, which is accentuated if the intensive care units (ICUs) collecting the data are located in different geographic regions, is the use of different measurement unit systems. In part, this can be attributed to the use of imperial units in English speaking countries (e.g. patient weight being reported in lbs instead of kg), but more subtle differences in practice are involved as well, such as reporting lab test results in mg/dL instead of the SI unit mmol/L. While discrepancies of the former type are easy to resolve, harmonization of different notions of concentration is slightly more involved due to the conversion factor being substance-specific.
ricu
data conceptsData concepts of type num_cncpt
can specify an expected
units as character vector (where the string in position 1 is added as
per-column attribute units
). When using concepts of this
type, unit conversion to the specified target unit is not handled
automatically but is up to the user via a callback function. Unit
mismatches are reported as messages during concept loading1.
(dat <- load_concepts(c("lact", "map"), "mimic_demo"))
#> ── Loading 2 concepts ──────────────────────────────────────────────────────
#> • lact
#> • map
#> ◯ removed 22 (0.14%) of rows due to `NA` values
#> ◯ removed 13 (0.08%) of rows due to out of range entries
#> ────────────────────────────────────────────────────────────────────────────
#> # A `ts_tbl`: 13,115 ✖ 4
#> # Id var: `icustay_id`
#> # Units: `lact` [mmol/L], `map` [mmHg]
#> # Index var: `charttime` (1 hours)
#> icustay_id charttime lact map
#> <int> <drtn> <dbl> <dbl>
#> 1 201006 -58 hours 1.7 NA
#> 2 201006 -10 hours 1.8 NA
#> 3 201006 0 hours 2.2 82
#> 4 201006 1 hours NA 88.8
#> 5 201006 2 hours NA 91.5
#> …
#> 13,111 298685 314 hours NA 76
#> 13,112 298685 315 hours NA 58
#> 13,113 298685 316 hours NA 47
#> 13,114 298685 317 hours NA 35
#> 13,115 298685 318 hours NA 12
#> # ℹ 13,110 more rows
Messages raised during loading of lactate values, for example, indicate that 0.62% of retrieved values are specified in mEq/L instead of mmol/L (which requires an identity transformation for unit conversion), while the remaining discrepancies are false positives (both mmole/L and MMOLL can be assumed to mean mmol/L). For mean arterial bloop pressure values, the target unit is specified as mmHg (with the alternative spelling mm Hg being accepted as well), however, due to the data organization in eICU2, no explicit measurement units are specified for this variable, in turn causing the large percentage of missing unit values reported.
Several utility functions are exported from ricu
for
helping with creating callback functions that handle unit conversion.
Data items corresponding to the bilirubin concept for the European
datasets HiRID and AUMCdb, for example, have a callback entry specified
as convert_unit(binary_op(`*`, 0.058467), "mg/dL")
. This
creates a callback functions which applies
binary_op(`*`, 0.058467)
to the column specified as
val_var
and replaces existing values in the column
identified by unit_var
with the value "mg/dl"
.
In case the loaded data already is comprised of a mix of units, a
regular expression passed as rgx
can be specified, which
will be used to identify the rows on which to operate. Finally, the
function binary_op
turns a binary function into an unary
function by fixing the second operand.
As a first sanity check we will slightly modify data loading in order
to be warned about item IDs that do not appear in the data. For this we
chop up data items loaded from the dictionary such that items of type
sel_itm
which may contain several IDs, are split into
separate items and regular expressions in rgx_itm
items are
taken apart such that aggregate expressions such as
(foo|bar)baz
are turned into foobaz
and
barbaz
.
Next, for the actual data loading, report_empty()
substitutes the internally called function do_itm_load()
with a modified version that takes note of the offending IDs alongside
concept, table and data source names whenever zero rows are returned for
a given itm
object.
demo <- c("mimic_demo", "eicu_demo")
concepts <- c("map", "lact", "bili", "gcs", "abx")
dict <- load_dictionary(demo, concepts)
empty_items <- report_empty(dict, merge = FALSE, verbose = FALSE)
#> Tracing function "do_itm_load" in package "ricu"
#> Untracing function "do_itm_load" in package "ricu"
If no table is printed given the modified data loading above, every
single ID that is part of a sel_itm
or an
rgx_itm
actually returns some data. For other types of data
items this means that for every single item as a whole, some data was
returned. There are limitations to this type of sanity check though: It
might be the case that one of the supplied IDs associated with a
num_cncpt
concept returns data in an unexpected unit of
measurement which may cause the range filter to remove all of that data
again. In such a scenario though, this will be reported (if
TRUE
is passed as verbose
argument to
load_concept()
). Paying attention to the output produced by
load_concept()
should help spot such issues, albeit no
longer at item resolution but only at concept level.
Next, we will investigate the number of measurements available per concept and stay day. For each stay ID and concept we calculate the number of measurements and note the stay duration. From this we can visualize how the number of measurements per day is distributed over the datasets alongside the percentage of patients that have at least one measurement available.
count_meas <- function(x) {
x[!is.na(get(data_var(x))), list(count = .N), by = c(id_vars(x))]
}
meas_day <- function(x, los) {
merge(x, los)[, count := count / los_icu]
}
quants <- function(x) {
setNames(
as.list(quantile(x, c(0.05, 0.25, 0.5, 0.75, 0.95))),
c("min", "lwr", "med", "upr", "max")
)
}
meas_stats <- function(x, concept) {
x[, c(list(concept = concept, n_pat = .N), quants(count / los_icu)),
by = "source"]
}
srcs <- c("mimic", "eicu", "aumc", "hirid", "miiv")
los <- load_concepts("los_icu", srcs, verbose = FALSE)
los <- los[los_icu > 0, ]
concepts <- c("map", "lact", "crea", "bili", "plt")
dat <- load_concepts(concepts, srcs, merge = FALSE, verbose = FALSE)
counts <- lapply(dat, count_meas)
counts <- lapply(counts, merge, los)
counts <- Map(meas_stats, counts, names(counts))
counts <- do.call(rbind, counts)
counts <- merge(counts, los[, list(total_pat = .N), by = "source"],
by = "source")
head(counts)
#> source concept n_pat min lwr med upr
#> 1: aumc map 21627 13.7840745 21.0645481 22.8242075 23.648350
#> 2: aumc lact 17457 0.1910777 0.9028213 2.1686747 4.652666
#> 3: aumc crea 22930 1.1130736 1.7539997 2.8533686 3.850267
#> 4: aumc bili 12789 0.2482502 0.4707763 0.9436435 1.518987
#> 5: aumc plt 22938 1.1383050 1.8613927 3.2129197 5.366460
#> 6: eicu map 190613 9.2604502 19.9282419 22.7586207 23.762376
#> max total_pat
#> 1: 24.281525 23103
#> 2: 10.573116 23103
#> 3: 7.959834 23103
#> 4: 3.992237 23103
#> 5: 8.888889 23103
#> 6: 24.881210 199646
Finally, we compare the densities we obtain by looking concept values per dataset, as visualized in the following plot.
When extending the ricu
dictionary to both new data
sources and new data concepts, it might be worthwhile to visually
inspect the returned data in a fashion similar to the above in order to
have a high-level confirmation that measurement units roughly agree.
Recently, experimental support for automatic unit
conversion has been added via the unt_cncpt
class. Such a
concept will attempt to convert data columns using the units package and
therefore requires that both the source- and target-units are recognized
and convertible (see units::ud_are_convertible()
).↩︎
Both the vitalperiodic
and
vitalaperiodic
tables in eICU are layed out in
wide format (whereas most other tables are in long
format), and therefore no unit columns are available. This also explains
the substantial degree of missingness (in terms of values) reported, as
such a wide data organization scheme coupled with differing
measurement intervals per variable will inevitably lead to some degree
of sparsity.↩︎