Version 2.0 of the Ultimate Microarray Prediction, Inference, and Reality Engine (Umpire) extends the functions of the Umpire 1.0 R package to allow researchers to simulate realistic, mixed-type, clinical data. Statisticians, computer scientists, and clinical informaticians who develop and improve methods to analyze clinical data from a variety of contexts (including clinical trials, population cohorts, and electronic medical record sources) recognize that it is difficult to evaluate methods on real data where “ground truth” is unknown. Frequently, they turn to simulations where the can control the underlying structure, which can result in simulations which are too simplistic to reflect complex clinical data realities. Clinical measurements on patients may be treated as independent, in spite of the elaborate correlation structures that arise in networks, pathways, organ systems, and syndromes in real biology. Further, the researcher finds limited tools at her disposal to facilitate simulation of binary, categorical, or mixed data at this representative level of biological complexity.
In this vignette, we describe a workflow with the Umpire package to simulate biologically realistic, mixed-type clinical data.
As usual, we start by loading the package:
Since we are going to run simulations, for reproducibility purposes, we should set the seed of the random number generator.
The simulation workflow begins by simulating complex, correlated, continuous data with known “ground truth” by instantiating a ClinicalEngine. We simulate 20 features and 4 clusters of unequal size. The ClinicalEngine generates subtypes (clusters) with known “ground truth” through an implementation of the Umpire 1.0 CancerModel and CancerEngine.
## A 'CancerEngine' using the cancer model:
## --------------
## Clinical Simulation Model (Raw), a CancerModel object constructed via:
## CancerModel(name = "Clinical Simulation Model (Raw)", nPossible = NP,
## nPattern = nClusters, HIT = hitfn, SURV = SURV, OUT = OUT,
## survivalModel = survivalModel, prevalence = Prevalence(isWeighted,
## nClusters))
##
## Pattern prevalences:
## [1] 0.3173905 0.1315154 0.1265533 0.4245408
##
## Survival effects:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.63510 -0.40700 -0.24181 -0.21561 -0.01502 0.21166
##
## Outcome effects:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.228756 -0.080970 -0.035408 -0.009743 0.116928 0.124044
## --------------
##
## Base expression given by:
## An Engine with 10 components.
##
## Altered expression given by:
## An Engine with 10 components.
Note that the prevalences are not equal; when you use isweighted = TRUE, they are chosen from a Dirichlet distribution. Note also that the summary function describes the object as a CancerEngine, since the same underlying structure is used to implement a ClinicalEngine.
Now we confirm that the model expects to produce the 20 features that we requested. It will do so using 10 “components”, where each component consists of a pair of correlated features.
## [1] 20
## [1] 10
The ClinicalEngine is used to simulate the raw, base dataset.
Data are simulated as a list with two objects: simulated data and associated clinical information, including “ground truth” subtype membership and survival data (outcome, length of followup, and occurrence of event of interest within the followup period).
## [1] "list"
## [1] "clinical" "data"
## CancerSubType Outcome LFU Event
## Min. :1.000 Bad :151 Min. : 0.00 Mode :logical
## 1st Qu.:1.000 Good:149 1st Qu.:19.75 FALSE:197
## Median :3.000 Median :31.50 TRUE :103
## Mean :2.717 Mean :33.81
## 3rd Qu.:4.000 3rd Qu.:49.00
## Max. :4.000 Max. :71.00
The raw data are simulated as a matrix of continuous values.
## [1] "matrix" "array"
## [1] 20 300
The user may add further additive noise to the raw data. The ClinicalNoiseModel simulates additive noise for each feature f and patient i as a normal distribution \(E_{fi} \sim N(0, \tau)\) , where the standard deviation \(\tau\) varies with a hyperparameter along the gamma distribution \(\tau \sim Gamma(shape, scale)\). Thus, the ClinicalNoiseModel generates many features with low noise (such as a tightly calibrated laboratory test) and some features with high noise (such as a blood pressure measured by hand and manually entered into the medical record.) The user may apply default parameters or individual parameters. Next, the ClinicalNoiseModel is applied to blur the previously simulated data. The default model below generates a low overall level of additive noise.
## A 'NoiseModel' with:
## additive offset = 0
## additive scale distributed as:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00601 0.03036 0.05852 0.06958 0.09856 0.18595
## multiplicative scale = 0
Umpire 2.0 allows the simulation of binary, nominal, and ordinal data from raw, continuous data in variable, user-defined mixtures. The user defines prevalences, summing to 1, of binary, continuous, and categorical data in the desired final mixture. For categorical features, the user may tune the percent of categorical data desired to be nominal and the range of the number of categories to be simulated.
The data simulated above by the ClinicalEngine and ClinicalNoiseModel takes rows (not columns) as features, as an omics convention. Thus, by default, when generating data, rows are treated as features and columns as patients. The makeDataTypes method transposes its results to a data frame where the columns are features and the rows are patients. This transposition both fits better with the conventions used for clinical data, but also supports the ability to store different kinds of (mixed-type) data in different columns.
dt <- makeDataTypes(dset$data,
pCont = 1/3, pBin = 1/3, pCat = 1/3,
pNominal = 0.5, range = 3:9,
inputRowsAreFeatures = TRUE)
names(dt)
## [1] "binned" "cutpoints"
The makeDataTypes function generates a list containing two objects: a data.frame of mixed-type data…
## [1] "data.frame"
## [1] 300 20
## V1 V2 V3 V4 V5
## R:52 Min. :5.191 Min. :0.0000 Min. :0.0000 Min. :0.00
## S:55 1st Qu.:6.490 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00
## T:32 Median :6.971 Median :0.0000 Median :0.0000 Median :1.00
## U:59 Mean :6.965 Mean :0.4133 Mean :0.4067 Mean :0.66
## V:50 3rd Qu.:7.400 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00
## W:52 Max. :8.792 Max. :1.0000 Max. :1.0000 Max. :1.00
##
## V6 V7 V8 V9 V10
## Min. :1.923 Min. :5.454 Min. :3.563 Min. :0.00 X :53
## 1st Qu.:3.578 1st Qu.:6.566 1st Qu.:4.902 1st Qu.:0.00 V :48
## Median :4.024 Median :7.071 Median :5.378 Median :0.00 Y :38
## Mean :4.054 Mean :7.065 Mean :5.300 Mean :0.19 S :36
## 3rd Qu.:4.502 3rd Qu.:7.540 3rd Qu.:5.679 3rd Qu.:0.00 T :36
## Max. :5.795 Max. :8.859 Max. :6.873 Max. :1.00 W :34
## (Other):55
## V11 V12 V13 V14 V15 V16
## R:40 Min. :2.797 Min. :0.00 Min. :0.0 G :55 Min. :0.0
## S:48 1st Qu.:4.086 1st Qu.:0.00 1st Qu.:0.0 E :42 1st Qu.:0.0
## T:57 Median :4.460 Median :0.00 Median :0.0 F :41 Median :1.0
## U:39 Mean :4.440 Mean :0.07 Mean :0.1 D :38 Mean :0.7
## V:56 3rd Qu.:4.825 3rd Qu.:0.00 3rd Qu.:0.0 A :34 3rd Qu.:1.0
## W:60 Max. :5.798 Max. :1.00 Max. :1.0 C :33 Max. :1.0
## (Other):57
## V17 V18 V19 V20
## Min. :2.606 Min. :4.993 A:52 Min. :2.356
## 1st Qu.:4.053 1st Qu.:6.608 B:30 1st Qu.:4.796
## Median :4.528 Median :7.179 C:40 Median :5.529
## Mean :4.515 Mean :7.123 D:38 Mean :5.505
## 3rd Qu.:4.912 3rd Qu.:7.608 E:47 3rd Qu.:6.190
## Max. :6.610 Max. :8.738 F:48 Max. :8.570
## G:45
The cutpoints contain a record, for each feature, of data type, break points, and labels. Here are two examples of the kind of information stored for a cutpoint.
## $breaks
## 0% 10.55853% 28.88869% 48.54916% 66.07294% 82.94035% 100%
## -Inf 4.055293 4.728019 5.149966 5.491690 5.900582 Inf
##
## $labels
## [1] "T" "S" "U" "W" "V" "R"
##
## $Type
## [1] "nominal"
## $breaks
## [1] -Inf 7.35767 Inf
##
## $labels
## [1] 0 1
##
## $Type
## [1] "symmetric binary"
And here is an overview of the number of features of each type.
## type
## asymmetric binary continuous nominal ordinal
## 1 8 3 2
## symmetric binary
## 6
The cupoitns should be saved for downstream use in the MixedTypeEngine.
The many parameters defining a simulated data mixture can be stored as a single MixedTypeEngine for downstream use to easily generate future datasets with the same simulation parameters.
The MixedTypeEngine stores the following components for re-implementation:
## A 'MixedTypeEngine' (MTE) based on:
## A 'CancerEngine' using the cancer model:
## --------------
## Clinical Simulation Model (Raw), a CancerModel object constructed via:
## CancerModel(name = "Clinical Simulation Model (Raw)", nPossible = NP,
## nPattern = nClusters, HIT = hitfn, SURV = SURV, OUT = OUT,
## survivalModel = survivalModel, prevalence = Prevalence(isWeighted,
## nClusters))
##
## Pattern prevalences:
## [1] 0.3173905 0.1315154 0.1265533 0.4245408
##
## Survival effects:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.63510 -0.40700 -0.24181 -0.21561 -0.01502 0.21166
##
## Outcome effects:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.228756 -0.080970 -0.035408 -0.009743 0.116928 0.124044
## --------------
##
## Base expression given by:
## An Engine with 10 components.
##
## Altered expression given by:
## An Engine with 10 components.
##
## ---------------
## The MTE uses the following noise model:
## A 'NoiseModel' with:
## additive offset = 0
## additive scale distributed as:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00601 0.03036 0.05852 0.06958 0.09856 0.18595
## multiplicative scale = 0
## ---------------
## The MTE simulates clinical data of these types:
##
## asymmetric binary continuous nominal ordinal
## 1 8 3 2
## symmetric binary
## 6
With rand, the user can easily generate new data sets with the same simulation parameters.
## [1] "list"
## Length Class Mode
## 0 NULL NULL
## CancerSubType Outcome LFU Event
## Min. :1.0 Bad :14 Min. : 7.00 Mode :logical
## 1st Qu.:2.0 Good: 6 1st Qu.:18.75 FALSE:13
## Median :3.5 Median :37.00 TRUE :7
## Mean :2.9 Mean :37.35
## 3rd Qu.:4.0 3rd Qu.:54.75
## Max. :4.0 Max. :70.00
By using the keepal argument othe function, you can keep the intermediate datasets produced by the rand method.
## [1] "list"
## [1] "raw" "clinical" "noisy" "binned"
The raw and noisy elements have the rows as (future clinical) features and the columns as patients/samples.
## [1] 20 25
## V1 V2 V3 V4
## Min. :4.028 Min. :5.816 Min. :1.901 Min. : 8.537
## 1st Qu.:4.535 1st Qu.:6.560 1st Qu.:2.348 1st Qu.: 9.466
## Median :5.425 Median :6.903 Median :3.227 Median :10.173
## Mean :5.344 Mean :6.967 Mean :3.393 Mean :10.087
## 3rd Qu.:5.893 3rd Qu.:7.295 3rd Qu.:4.194 3rd Qu.:10.607
## Max. :7.708 Max. :8.818 Max. :5.475 Max. :11.558
## V5 V6 V7 V8
## Min. :6.180 Min. :2.825 Min. :6.297 Min. :4.460
## 1st Qu.:6.920 1st Qu.:3.395 1st Qu.:6.818 1st Qu.:4.967
## Median :7.402 Median :3.797 Median :7.160 Median :5.058
## Mean :7.366 Mean :3.894 Mean :7.207 Mean :5.194
## 3rd Qu.:7.798 3rd Qu.:4.177 3rd Qu.:7.574 3rd Qu.:5.448
## Max. :8.676 Max. :5.618 Max. :8.578 Max. :6.076
## V9 V10 V11 V12
## Min. :5.222 Min. :3.932 Min. :4.832 Min. :3.785
## 1st Qu.:6.182 1st Qu.:4.690 1st Qu.:6.019 1st Qu.:4.342
## Median :6.739 Median :5.504 Median :6.374 Median :4.566
## Mean :6.704 Mean :5.418 Mean :6.354 Mean :4.762
## 3rd Qu.:7.039 3rd Qu.:5.760 3rd Qu.:6.648 3rd Qu.:5.125
## Max. :8.556 Max. :6.929 Max. :8.148 Max. :6.479
## V13 V14 V15 V16
## Min. :6.050 Min. :6.125 Min. :4.136 Min. : 8.014
## 1st Qu.:7.135 1st Qu.:6.816 1st Qu.:5.638 1st Qu.: 8.626
## Median :7.417 Median :7.232 Median :5.940 Median : 9.244
## Mean :7.434 Mean :7.261 Mean :6.088 Mean : 9.071
## 3rd Qu.:7.879 3rd Qu.:7.779 3rd Qu.:6.558 3rd Qu.: 9.471
## Max. :8.551 Max. :8.156 Max. :7.543 Max. :10.093
## V17 V18 V19 V20
## Min. :3.173 Min. :6.379 Min. :6.508 Min. :3.353
## 1st Qu.:4.296 1st Qu.:6.945 1st Qu.:6.940 1st Qu.:5.250
## Median :4.825 Median :7.320 Median :7.200 Median :5.704
## Mean :4.745 Mean :7.391 Mean :7.217 Mean :5.511
## 3rd Qu.:5.095 3rd Qu.:7.713 3rd Qu.:7.461 3rd Qu.:5.980
## Max. :6.312 Max. :8.652 Max. :7.832 Max. :6.405
## [1] 25 20
## V1 V2 V3 V4
## Min. : 2.287 Min. : 2.790 Min. : 3.404 Min. : 2.941
## 1st Qu.: 5.653 1st Qu.: 5.500 1st Qu.: 5.874 1st Qu.: 4.558
## Median : 6.501 Median : 6.554 Median : 6.646 Median : 6.218
## Mean : 6.526 Mean : 6.454 Mean : 6.631 Mean : 5.930
## 3rd Qu.: 7.464 3rd Qu.: 7.159 3rd Qu.: 7.208 3rd Qu.: 6.799
## Max. :10.206 Max. :10.271 Max. :11.206 Max. :10.230
## V5 V6 V7 V8
## Min. : 3.935 Min. : 2.921 Min. :3.263 Min. :3.144
## 1st Qu.: 5.416 1st Qu.: 5.379 1st Qu.:5.225 1st Qu.:5.209
## Median : 6.396 Median : 6.525 Median :6.327 Median :6.762
## Mean : 6.584 Mean : 6.472 Mean :6.236 Mean :6.488
## 3rd Qu.: 7.621 3rd Qu.: 7.611 3rd Qu.:7.206 3rd Qu.:7.431
## Max. :10.188 Max. :10.608 Max. :9.419 Max. :9.205
## V9 V10 V11 V12
## Min. :4.166 Min. : 3.269 Min. :3.477 Min. :3.911
## 1st Qu.:5.557 1st Qu.: 5.045 1st Qu.:5.453 1st Qu.:4.915
## Median :6.553 Median : 5.996 Median :6.502 Median :6.594
## Mean :6.560 Mean : 6.344 Mean :6.422 Mean :6.518
## 3rd Qu.:7.577 3rd Qu.: 7.538 3rd Qu.:7.424 3rd Qu.:7.928
## Max. :9.531 Max. :10.452 Max. :9.666 Max. :9.389
## V13 V14 V15 V16
## Min. :3.563 Min. : 1.941 Min. : 2.194 Min. :2.342
## 1st Qu.:4.991 1st Qu.: 4.889 1st Qu.: 5.451 1st Qu.:5.049
## Median :6.358 Median : 6.504 Median : 6.011 Median :6.771
## Mean :6.305 Mean : 6.221 Mean : 6.317 Mean :6.308
## 3rd Qu.:7.340 3rd Qu.: 7.357 3rd Qu.: 7.125 3rd Qu.:7.561
## Max. :9.488 Max. :10.614 Max. :11.498 Max. :9.884
## V17 V18 V19 V20
## Min. : 2.422 Min. : 3.298 Min. : 2.968 Min. :3.418
## 1st Qu.: 4.938 1st Qu.: 4.864 1st Qu.: 5.692 1st Qu.:5.108
## Median : 6.343 Median : 6.580 Median : 6.632 Median :6.509
## Mean : 6.094 Mean : 6.343 Mean : 6.614 Mean :6.304
## 3rd Qu.: 7.371 3rd Qu.: 7.227 3rd Qu.: 7.567 3rd Qu.:7.192
## Max. :10.057 Max. :10.199 Max. :11.299 Max. :9.389
## V21 V22 V23 V24
## Min. : 2.162 Min. :4.072 Min. :3.012 Min. :3.273
## 1st Qu.: 5.056 1st Qu.:5.482 1st Qu.:4.889 1st Qu.:5.086
## Median : 6.289 Median :6.592 Median :6.095 Median :6.165
## Mean : 6.327 Mean :6.533 Mean :6.270 Mean :6.166
## 3rd Qu.: 7.420 3rd Qu.:7.702 3rd Qu.:7.600 3rd Qu.:6.810
## Max. :11.418 Max. :9.413 Max. :9.530 Max. :9.474
## V25
## Min. : 2.118
## 1st Qu.: 4.941
## Median : 6.422
## Mean : 6.176
## 3rd Qu.: 7.322
## Max. :10.860
Noisy data arises by adding simulated noise to the raw data.
The binned element has columns as features and rows as samples. Binned data arises by applying cut points to noisy data.
## [1] 25 20
## V1 V2 V3 V4 V5 V6 V7
## R:6 Min. :5.878 0:12 0:11 0:12 Min. :2.921 Min. :6.165
## S:7 1st Qu.:6.477 1:13 1:14 1:13 1st Qu.:3.418 1st Qu.:6.821
## T:2 Median :6.918 Median :3.895 Median :7.197
## U:2 Mean :6.959 Mean :3.892 Mean :7.199
## V:5 3rd Qu.:7.247 3rd Qu.:4.306 3rd Qu.:7.658
## W:3 Max. :8.810 Max. :5.739 Max. :8.577
##
## V8 V9 V10 V11 V12 V13 V14
## Min. :4.482 0:19 T :6 R:5 Min. :3.578 0:22 0:22
## 1st Qu.:4.925 1: 6 R :5 S:3 1st Qu.:4.252 1: 3 1: 3
## Median :5.103 X :4 T:2 Median :4.619
## Mean :5.195 S :3 U:4 Mean :4.771
## 3rd Qu.:5.443 V :2 V:7 3rd Qu.:5.154
## Max. :6.113 W :2 W:4 Max. :6.379
## (Other):3
## V15 V16 V17 V18 V19 V20
## D :5 0: 8 Min. :3.156 Min. :6.304 A:3 Min. :3.360
## F :4 1:17 1st Qu.:4.293 1st Qu.:6.895 B:2 1st Qu.:5.263
## G :4 Median :4.838 Median :7.368 C:4 Median :5.701
## A :3 Mean :4.748 Mean :7.388 D:6 Mean :5.496
## B :3 3rd Qu.:5.105 3rd Qu.:7.768 E:6 3rd Qu.:5.971
## H :3 Max. :6.333 Max. :8.630 F:4 Max. :6.361
## (Other):3 G:0