aggregate-pvalue-sim

The purpose of this vignette is to illustrate how one would use the pval_gen() function to simulate p-values and then feed these p-values to the cECM() function to illustrate the performance of event categorization techniques. Alternatively, one could use this vignette to learn how to feed sets of real training p-values along with newly generated p-values to cECM().

First, let’s load the package:

library(ezECM)

Simulating p-values

The pval_gen() function simulates p-values generated from the two discriminants of event depth and polarity of motion, for the event types of "earthquake" and "explosion". Details for the methods for calculating the specific discriminant p-values are shown in Anderson et al. (2007).

An example of a call to the pval_gen() function is shown below.

simulated.pvalues <- pval_gen(sims = 160, grid.dim = c(1000,1000,30), 
                              seismometer = list(N = 100, max.depth = 2),
                              explosion = list(max.depth = 3, prob = 0.3),
                              pwave.arrival = list(H0 = 4, optim.starts = 15),
                              first.polarity  = list(read.err = 0.6))

The call produces 160 sets of p-values, which will be split between training data and newly acquired data. Under the current version, it is advised to generate p-values for both these data sets with the same call to pval_gen() so that the same seismometer locations and variance in arrival times at these locations are consistent. Events and seismometers are located within a space that is 1000 km in X and Y coordinates, and has a depth of 30 km. N = 50 seismometer locations, with a maximum depth of 2 km are randomly and uniformly distributed through the region. An "explosion" event has a maximum true depth of 2 km. When generating each set of p-values there is a probability of prob = 0.2 that the set will be generated from an explosion. This explosion$prob parameter sets the ratio between events of "explosion" and "earthquake" within the data set, with some randomness within this ratio. The null hypothesis for testing event depth is set to greater or equal to 4 km. To reduce computation time of the simulation optim.starts is set to 15. This controls how exhaustive the search is for the gradient based estimation of the event location.

A summary of the simulation can be inspected:

summary(simulated.pvalues)
#>     p.depth          p.polarity        event          
#>  Min.   :0.00000   Min.   :0.0000   Length:160        
#>  1st Qu.:0.02007   1st Qu.:0.0000   Class :character  
#>  Median :0.17442   Median :0.0000   Mode  :character  
#>  Mean   :0.24880   Mean   :0.1393                     
#>  3rd Qu.:0.50033   3rd Qu.:0.1545                     
#>  Max.   :0.80779   Max.   :0.9941

The first column of the simulated.pvalues data frame contains a p-value related to estimated event depth, the second column contains the p-value related to polarity of first motion, and the third column contains the true event category. An example row of a single event is as follows:

simulated.pvalues[1,]
#>     p.depth   p.polarity      event
#> 1 0.6111228 3.911149e-13 earthquake

To show the functionality of the cECM() function, simulated.pvalues needs to be split into a training set and a set that would be used equivalent to if a new event which needed to be categorized occurred. In this case, we will use the last 10 sets of p-values generated. The p-values for new.data are tabulated after subsetting them below.

train <- simulated.pvalues[1:150,]
new.data <- simulated.pvalues[151:160,]

knitr::kable(new.data, format = "html")
p.depth p.polarity event
151 0.0089881 0.0000000 earthquake
152 0.2159951 0.0000371 earthquake
153 0.0001332 0.0000000 earthquake
154 0.5837720 0.0000000 earthquake
155 0.0000003 0.0000000 earthquake
156 0.3875222 0.0000000 earthquake
157 0.6192695 0.5640187 explosion
158 0.4641851 0.0000000 earthquake
159 0.1813066 0.3840009 explosion
160 0.0000004 0.0000000 earthquake

The cECM() function cannot use the event column of new.data. This column is saved in order to check the accuracy of the method, and then deleted from the new.data data frame.

new.data.true <- new.data$event
new.data$event <- NULL

We are now ready to use the cECM() function.

ECM

First, we will use cECM() to fit a model to the training data, and then use this model fit to look at the evidence that the new data is from the distribution of p-values obtained from explosion data. It is possible to feed both the training data and the new data to the cECM() function simultaneously. However, the usage illustrated below is more realistic for an application where the model would first be trained and later be used.

fit <- cECM(x = train)

Then this object can be plotted:

plot(fit)

This fit can be saved for later use. However, because we have some new data, we are ready to use this fit with the new.data.

new.data.category <- cECM(x = fit, newdata = new.data)

We can then investigate the aggregate p-values for new.data belonging to the "explosion" distribution. Organizing and summarizing the output from cECM() allows for a comparison between the aggregate p-value and the true category. Ideally, the explosion column should have an aggregate p-value below the pre-determined threshold when the new.data.true column contains "earthquake".

categories <- cbind(new.data.category$cECM, new.data.true)
categories <- categories[c("explosion", "new.data.true")]

knitr::kable(categories, format = "html")
explosion new.data.true
0.0014954 earthquake
0.0148655 earthquake
0.0007148 earthquake
0.0182961 earthquake
0.0006443 earthquake
0.0203718 earthquake
0.6851373 explosion
0.0207576 earthquake
0.6036446 explosion
0.0006447 earthquake

References

Anderson, Dale N, Deborah K Fagan, Mark A Tinker, Gordon D Kraft, and Kevin D Hutchenson. 2007. “A Mathematical Statistics Formulation of the Teleseismic Explosion Identification Problem with Multiple Discriminants.” Bulletin of the Seismological Society of America 97 (5): 1730–41.