The purpose of this vignette is to illustrate how one would use the
pval_gen()
function to simulate p-values and then feed
these p-values to the cECM()
function to illustrate the
performance of event categorization techniques. Alternatively, one could
use this vignette to learn how to feed sets of real training p-values
along with newly generated p-values to cECM()
.
First, let’s load the package:
The pval_gen()
function simulates p-values generated
from the two discriminants of event depth and polarity of motion, for
the event types of "earthquake"
and
"explosion"
. Details for the methods for calculating the
specific discriminant p-values are shown in Anderson et al. (2007).
An example of a call to the pval_gen()
function is shown
below.
simulated.pvalues <- pval_gen(sims = 160, grid.dim = c(1000,1000,30),
seismometer = list(N = 100, max.depth = 2),
explosion = list(max.depth = 3, prob = 0.3),
pwave.arrival = list(H0 = 4, optim.starts = 15),
first.polarity = list(read.err = 0.6))
The call produces 160 sets of p-values, which will be split between
training data and newly acquired data. Under the current version, it is
advised to generate p-values for both these data sets with the same call
to pval_gen()
so that the same seismometer locations and
variance in arrival times at these locations are consistent. Events and
seismometers are located within a space that is 1000 km in X and Y
coordinates, and has a depth of 30 km. N = 50
seismometer
locations, with a maximum depth of 2 km are randomly and uniformly
distributed through the region. An "explosion"
event has a
maximum true depth of 2 km. When generating each set of p-values there
is a probability of prob = 0.2
that the set will be
generated from an explosion. This explosion$prob
parameter
sets the ratio between events of "explosion"
and
"earthquake"
within the data set, with some randomness
within this ratio. The null hypothesis for testing event depth is set to
greater or equal to 4 km. To reduce computation time of the simulation
optim.starts
is set to 15. This controls how exhaustive the
search is for the gradient based estimation of the event location.
A summary of the simulation can be inspected:
summary(simulated.pvalues)
#> p.depth p.polarity event
#> Min. :0.00000 Min. :0.0000 Length:160
#> 1st Qu.:0.02007 1st Qu.:0.0000 Class :character
#> Median :0.17442 Median :0.0000 Mode :character
#> Mean :0.24880 Mean :0.1393
#> 3rd Qu.:0.50033 3rd Qu.:0.1545
#> Max. :0.80779 Max. :0.9941
The first column of the simulated.pvalues
data frame
contains a p-value related to estimated event depth, the second column
contains the p-value related to polarity of first motion, and the third
column contains the true event category. An example row of a single
event is as follows:
To show the functionality of the cECM()
function,
simulated.pvalues
needs to be split into a training set and
a set that would be used equivalent to if a new event which needed to be
categorized occurred. In this case, we will use the last 10 sets of
p-values generated. The p-values for new.data
are tabulated
after subsetting them below.
train <- simulated.pvalues[1:150,]
new.data <- simulated.pvalues[151:160,]
knitr::kable(new.data, format = "html")
p.depth | p.polarity | event | |
---|---|---|---|
151 | 0.0089881 | 0.0000000 | earthquake |
152 | 0.2159951 | 0.0000371 | earthquake |
153 | 0.0001332 | 0.0000000 | earthquake |
154 | 0.5837720 | 0.0000000 | earthquake |
155 | 0.0000003 | 0.0000000 | earthquake |
156 | 0.3875222 | 0.0000000 | earthquake |
157 | 0.6192695 | 0.5640187 | explosion |
158 | 0.4641851 | 0.0000000 | earthquake |
159 | 0.1813066 | 0.3840009 | explosion |
160 | 0.0000004 | 0.0000000 | earthquake |
The cECM()
function cannot use the event
column of new.data
. This column is saved in order to check
the accuracy of the method, and then deleted from the
new.data
data frame.
We are now ready to use the cECM()
function.
First, we will use cECM()
to fit a model to the training
data, and then use this model fit to look at the evidence that the new
data is from the distribution of p-values obtained from explosion data.
It is possible to feed both the training data and the new data to the
cECM()
function simultaneously. However, the usage
illustrated below is more realistic for an application where the model
would first be trained and later be used.
Then this object can be plotted:
This fit
can be saved for later use. However, because we
have some new data, we are ready to use this fit
with the
new.data
.
We can then investigate the aggregate p-values for
new.data
belonging to the "explosion"
distribution. Organizing and summarizing the output from
cECM()
allows for a comparison between the aggregate
p-value and the true category. Ideally, the explosion
column should have an aggregate p-value below the pre-determined
threshold when the new.data.true
column contains
"earthquake"
.
categories <- cbind(new.data.category$cECM, new.data.true)
categories <- categories[c("explosion", "new.data.true")]
knitr::kable(categories, format = "html")
explosion | new.data.true |
---|---|
0.0014954 | earthquake |
0.0148655 | earthquake |
0.0007148 | earthquake |
0.0182961 | earthquake |
0.0006443 | earthquake |
0.0203718 | earthquake |
0.6851373 | explosion |
0.0207576 | earthquake |
0.6036446 | explosion |
0.0006447 | earthquake |