Performs the Baumgartner-Weiß-Schindler 2-sample test of equal probability distributions.
– Steven E. Pav, shabbychef@gmail.com
This package can be installed from CRAN, via drat, or from github:
# via CRAN:
install.packages("BWStest")
# via drat:
if (require(drat)) {
:::add("shabbychef")
dratinstall.packages("BWStest")
}# get snapshot from github (may be buggy)
if (require(devtools)) {
install_github("shabbychef/BWStest")
}
The Baumgartner-Weiß-Schindler (‘BWS’) test is a non-parametric hypothesis test for the null of equal probability distributions of two samples, not unlike the two-sample Kolmogorov-Smirnov test or the Wilcoxon test. The BWS test is more powerful than these other tests under certain alternatives, as shown in the original paper and replicated below.
The front end for the hypothesis test is the function
bws_test
. By default it supports the the classical
Baumgartner-Weiß-Schindler test for a two-sided alternative, returning a
htest
object:
require(BWStest)
set.seed(12345)
# under the null:
<- rnorm(200)
x <- rnorm(200)
y <- bws_test(x, y)
hval show(hval)
##
## two-sample BWS test
##
## data: x vs. y
## B = 1, p-value = 0.2
## alternative hypothesis: true difference in survival functions is not equal to 0
# under the alternative:
<- rnorm(200, mean = 1)
z <- bws_test(x, z)
hval show(hval)
##
## two-sample BWS test
##
## data: x vs. z
## B = 30, p-value <2e-16
## alternative hypothesis: true difference in survival functions is not equal to 0
The code also supports alternative test statistics from Neuhäuser and Murakami, along with supporting one-sided alternatives for some of the tests:
set.seed(12345)
# under the alternative:
<- rnorm(200)
x <- rnorm(200, mean = 1)
y <- bws_test(x, z, alternative = "less")
hval show(hval)
##
## two-sample Neuhauser/Murakami test
##
## data: x vs. z
## B_2 = -30, p-value <2e-16
## alternative hypothesis: true difference in survival functions is less than 0
<- rnorm(8)
x <- rnorm(8, mean = 1)
y <- bws_test(x, z, method = "B3", alternative = "two.sided")
hval show(hval)
##
## two-sample Murakami test
##
## data: x vs. z
## B_3 = 2, p-value = 0.009
## alternative hypothesis: true difference in survival functions is not equal to 0
We should note that the B3
through B5
tests
do not achieve nominal coverage for large sample sizes and
should only be used on sample sizes of about 12 or fewer in
each of the two samples.
Doverai No Proverai (Trust, but verify.) – Russian proverb.
Here we perform 5000 simulations of the BWS test under the null hypothesis, then compute the CDF of the test statistic. If the code is correct, the resultant p-values should be uniform. So I q-q plot under the uniform law:
require(BWStest)
# now compute a bunch under the null:
set.seed(1234)
<- replicate(5000, bws_stat(rnorm(100), rnorm(100)))
bvals # compute the approximate p-values under the null:
<- bws_cdf(bvals)
pvals
require(ggplot2)
<- ggplot(data.frame(pv = pvals), aes(sample = pv)) +
ph stat_qq(distribution = stats::qunif)
print(ph)
Looks good to me!
Here we replicate figure 2A of Baumgartner et al. We draw two samples from the normal distribution, both with unit standard deviation, letting a be the difference in means. We check the empirical rejection rate at the 0.05 level for a few different tests. As in Baumgartner, we find that the lowly t-test is the most powerful in this case, with the BWS, Cramer-Von Mises, and Wilcoxon tests displaying similar power, then the KS test the least powerful. Note that the Kolmogorov Smirnov test does not appear to have nominal coverage under the null, probably due to the small sample size.
<- 10000
n.sim <- seq(0, 3.2, length.out = 17)
avals <- 0.05
alpha <- 10
mnsize
# this is archived on CRAN, unfortunately:
library(CvM2SL2Test)
# find the CVM critical value.
<- uniroot(function(x) {
critv cvmts.pval(x, mnsize, mnsize) - alpha
lower = 0, upper = 100, maxiter = 5000)$root
},
set.seed(1234)
<- sapply(avals, function(a) {
simul <- replicate(n.sim, {
rejs <- rnorm(mnsize, mean = 0)
x <- rnorm(mnsize, mean = a)
y <- bws_cdf(bws_stat(x, y), lower_tail = FALSE) <=
bws
alpha<- t.test(x, y, alternative = "two.sided")$p.value <=
ttv
alpha<- cvmts.test(x, y) >= critv
cvm <- ks.test(x, y, alternative = "two.sided")$p.value <=
ksv
alpha<- wilcox.test(x, y, alternative = "two.sided")$p.value <=
wcx
alphac(bws, ttv, cvm, ksv, wcx)
})<- rowMeans(rejs)
rejrate names(rejrate) <- c("BWS test", "t test", "Cramer-Von Mises test",
"Kolmogorov Smirnov test", "Wilcoxon test")
rejratesimplify = "matrix")
},
<- data.frame(t(simul))
Arejrates $a <- avals Arejrates
library(tidyr)
library(dplyr)
library(ggplot2)
<- tidyr::gather(Arejrates, "test", "rejection_rate",
plotdf -a) %>% dplyr::mutate(test = gsub("\\.", " ", test))
<- ggplot(plotdf, aes(x = a, y = rejection_rate,
ph group = test, colour = test)) + geom_line() + geom_point() +
labs(x = "a, difference in means", y = "rejection rate")
print(ph)
Here we replicate figure 2B of Baumgartner et al. We draw two samples from the normal distribution, both with zero mean, one with unit standard deviation, the other with standard deviation of sigma. We compute the empirical rejection rate at the 0.05 level, dropping the t-test since it is not relevant for this formulation. As in Baumgartner, we find the BWS test is the most powerful, followed by KS test, then Cramer-Von Mises, then Wilcoxon, which is basically useless in this simulation.
<- 10000
n.sim <- seq(1, 45, length.out = 10)
svals <- 0.05
alpha <- 10
mnsize
# this is archived on CRAN, unfortunately:
library(CvM2SL2Test)
# find the CVM critical value.
<- uniroot(function(x) {
critv cvmts.pval(x, mnsize, mnsize) - alpha
lower = 0, upper = 100, maxiter = 5000)$root
},
set.seed(1234)
<- sapply(svals, function(s) {
simul <- replicate(n.sim, {
rejs <- rnorm(mnsize, mean = 0, sd = 1)
x <- rnorm(mnsize, mean = 0, sd = s)
y <- bws_cdf(bws_stat(x, y), lower_tail = FALSE) <=
bws
alpha<- cvmts.test(x, y) >= critv
cvm <- ks.test(x, y, alternative = "two.sided")$p.value <=
ksv
alpha<- wilcox.test(x, y, alternative = "two.sided")$p.value <=
wcx
alphac(bws, cvm, ksv, wcx)
})<- rowMeans(rejs)
rejrate names(rejrate) <- c("BWS test", "Cramer-Von Mises test",
"Kolmogorov Smirnov test", "Wilcoxon test")
rejratesimplify = "matrix")
},
<- data.frame(t(simul))
Brejrates $sigma <- svals Brejrates
<- tidyr::gather(Brejrates, "test", "rejection_rate",
plotdf -sigma) %>% dplyr::mutate(test = gsub("\\.", " ",
test))<- ggplot(plotdf, aes(x = sigma, y = rejection_rate,
ph group = test, colour = test)) + geom_line() + geom_point() +
labs(x = "sigma, ratio of standard deviations",
y = "rejection rate")
print(ph)
Here we replicate figure 3A of Baumgartner et al. We draw two samples from the exponential distribution, letting l be the ratio of the rate parameters of the two populations. We compute the empirical rejection rate at the 0.05 level. As in Baumgartner, we find the BWS test is the most powerful, followed by Wilcoxon, then Cramer-Von Mises, then the KS test.
<- 10000
n.sim <- seq(1, 12)
lvals <- 0.05
alpha <- 10
mnsize
# this is archived on CRAN, unfortunately:
library(CvM2SL2Test)
# find the CVM critical value.
<- uniroot(function(x) {
critv cvmts.pval(x, mnsize, mnsize) - alpha
lower = 0, upper = 100, maxiter = 5000)$root
},
set.seed(1234)
<- sapply(lvals, function(l) {
simul <- replicate(n.sim, {
rejs <- rexp(mnsize, rate = 1)
x <- rexp(mnsize, rate = l)
y <- bws_cdf(bws_stat(x, y), lower_tail = FALSE) <=
bws
alpha<- cvmts.test(x, y) >= critv
cvm <- ks.test(x, y, alternative = "two.sided")$p.value <=
ksv
alpha<- wilcox.test(x, y, alternative = "two.sided")$p.value <=
wcx
alphac(bws, cvm, ksv, wcx)
})<- rowMeans(rejs)
rejrate names(rejrate) <- c("BWS test", "Cramer-Von Mises test",
"Kolmogorov Smirnov test", "Wilcoxon test")
rejratesimplify = "matrix")
},
<- data.frame(t(simul))
Crejrates $lratio <- lvals Crejrates
<- tidyr::gather(Crejrates, "test", "rejection_rate",
plotdf -lratio) %>% dplyr::mutate(test = gsub("\\.", " ",
test))<- ggplot(plotdf, aes(x = lratio, y = rejection_rate,
ph group = test, colour = test)) + geom_line() + geom_point() +
labs(x = "l, ratio of rate parameters", y = "rejection rate")
print(ph)
Here we replicate figure 3B of Baumgartner et al. We draw two samples, one from the normal distribution with zero mean and variance one-twelth, the other uniformly on -0.5 to 0.5. We take equal sample sizes from these two populations, then vary the sample size, checking the empirical rejection rate at the 0.05 level. Since the first two moments are equal, the Wilcoxon test is useless here, and not applied. As in Baumgartner, we find the BWS test is the most powerful, followed by the KS test and Cramer-Von Mises tests. Based on the power plots here, I theorize that Baumgartner et al. are plotting the total sample sizes on the x axis, that is, drawing n from both distributions, then plotting empirical power versus 2n. We follow that convention, which makes the plots match those of Baumgartner.
<- 10000
n.sim <- seq(10, 670, by = 60)
mvals <- 0.05
alpha
# this is archived on CRAN, unfortunately:
library(CvM2SL2Test)
set.seed(1234)
<- sapply(mvals, function(mnsize) {
simul # find the CVM critical value. note that this
# basically converged for mnsize > 100 or so, so we
# take the min.. for reasons of speed.
<- uniroot(function(x) {
critv cvmts.pval(x, min(mnsize, 80), min(mnsize,
80)) - alpha
lower = 0, upper = 2, maxiter = 100)$root
}, <- replicate(n.sim, {
rejs <- rnorm(mnsize, mean = 0, sd = 1/sqrt(12))
x <- runif(mnsize, min = -0.5, max = 0.5)
y <- bws_cdf(bws_stat(x, y), lower_tail = FALSE) <=
bws
alpha<- cvmts.test(x, y) >= critv
cvm <- ks.test(x, y, alternative = "two.sided")$p.value <=
ksv
alphac(bws, cvm, ksv)
})<- rowMeans(rejs)
rejrate names(rejrate) <- c("BWS test", "Cramer-Von Mises test",
"Kolmogorov Smirnov test")
rejratesimplify = "matrix")
},
<- data.frame(t(simul))
Drejrates $ssize <- 2 * mvals Drejrates
<- tidyr::gather(Drejrates, "test", "rejection_rate",
plotdf -ssize) %>% dplyr::mutate(test = gsub("\\.", " ",
test))<- ggplot(plotdf, aes(x = ssize, y = rejection_rate,
ph group = test, colour = test)) + geom_line() + geom_point() +
labs(x = "m+n, total sample size", y = "rejection rate")
print(ph)
Neuhäuser and Murakami
described some modifications to the original test of Baumgartner.
Neuhäuser’s test allows one to test against directional alternatives.
Murakami enumerated some modifications to the weighting scheme. These
are available via the murakami_stat
function, where the
flavor
corresponds to the test number from Murakami’s
paper, namely
Here we take these through the paces as above.
As above, we draw samples under the null and compare to the putative CDF function
require(BWStest)
# now compute a bunch under the null:
<- 9
n1 <- n1
n2 set.seed(1234)
<- lapply(0L:5L, function(flavor) {
allpvs <- replicate(5000, murakami_stat(rnorm(n1),
bvals rnorm(n2), flavor = flavor))
# compute the approximate p-values under the null:
<- murakami_cdf(bvals, n1 = n1, n2 = n2,
pvals flavor = flavor)
data.frame(pv = pvals, flavor = flavor)
})
<- do.call(rbind, allpvs)
df
require(ggplot2)
<- ggplot(df, aes(sample = pv)) + facet_grid(flavor ~
ph + stat_qq(distribution = stats::qunif)
.) print(ph)
While these all look fine, they are based on small sample sizes. The CDF is approximated by evaluating all the permutations (with memoisation to tame the computational requirements), but this can only be done up to some reasonably small sample size. If the test statistic does not converge beyond that sample size, the CDF approximation will not be accurate. This appears to be the case for flavors 3 through 5, as demonstrated below:
require(BWStest)
# now compute a bunch under the null:
<- 50
n1 <- n1
n2 set.seed(1234)
<- lapply(0L:5L, function(flavor) {
allpvs <- replicate(5000, murakami_stat(rnorm(n1),
bvals rnorm(n2), flavor = flavor))
# compute the approximate p-values under the null:
<- murakami_cdf(bvals, n1 = n1, n2 = n2,
pvals flavor = flavor)
data.frame(pv = pvals, flavor = flavor)
})
<- do.call(rbind, allpvs)
df
require(ggplot2)
<- ggplot(df, aes(sample = pv)) + facet_grid(flavor ~
ph + stat_qq(distribution = stats::qunif)
.) print(ph)
So that’s not so great.
We can perform the power comparisons of Baumgartner et al. using the Murakami test statistics. Here we consider the setup of figure 2A, samples of size 10 from the normal distribution with equal variances but different means. Neuhäuser’s test is, as expected, more powerful for detecting this one-sided alternative, probably on par with the t-test, while the BWS test (basically equivalent to Murakami 1) is next, then 5, 3, 4.
<- 10000
n.sim <- seq(0, 3.2, length.out = 17)
avals <- 0.05
alpha <- 10
mnsize
set.seed(1234)
<- sapply(avals, function(a) {
simul <- replicate(n.sim, {
statvs <- rnorm(mnsize, mean = 0)
x <- rnorm(mnsize, mean = a)
y <- bws_stat(x, y)
bws <- sapply(1L:5L, function(flavor) {
murav murakami_stat(x, y, flavor = flavor)
})c(bws, murav)
})<- mean(bws_cdf(statvs[1, ], lower_tail = FALSE) <=
rej_bws
alpha)<- sapply(2:6, function(rown) {
rej_mur mean(murakami_cdf(statvs[rown, ], mnsize, mnsize,
flavor = rown - 1, lower_tail = FALSE) <=
alpha)
})
<- c(rej_bws, rej_mur)
rejrate names(rejrate) <- c("BWS test", "Murakami 1", "Neuhauser",
"Murakami 3", "Murakami 4", "Murakami 5")
rejratesimplify = "matrix")
},
<- data.frame(t(simul))
Arejrates $a <- avals Arejrates
library(tidyr)
library(dplyr)
library(ggplot2)
<- tidyr::gather(Arejrates, "test", "rejection_rate",
plotdf -a) %>% dplyr::mutate(test = gsub("\\.", " ", test))
<- ggplot(plotdf, aes(x = a, y = rejection_rate,
ph group = test, colour = test)) + geom_line() + geom_point() +
labs(x = "a, difference in means", y = "rejection rate")
print(ph)
Now we consider figure2B, with two samples of equal size from the normal distribution with zero mean, but different standard deviations. Neuhäuser’s test is nearly useless here because the means of the populations are equal. Murakami 3, 4, and 5 are the most powerful, followed by BWS.
<- 10000
n.sim <- seq(1, 45, length.out = 10)
svals <- 0.05
alpha <- 10
mnsize
set.seed(1234)
<- sapply(svals, function(s) {
simul <- replicate(n.sim, {
statvs <- rnorm(mnsize, mean = 0, sd = 1)
x <- rnorm(mnsize, mean = 0, sd = s)
y <- bws_stat(x, y)
bws <- sapply(1L:5L, function(flavor) {
murav murakami_stat(x, y, flavor = flavor)
})c(bws, murav)
})<- mean(bws_cdf(statvs[1, ], lower_tail = FALSE) <=
rej_bws
alpha)<- sapply(2:6, function(rown) {
rej_mur mean(murakami_cdf(statvs[rown, ], mnsize, mnsize,
flavor = rown - 1, lower_tail = FALSE) <=
alpha)
})
<- c(rej_bws, rej_mur)
rejrate names(rejrate) <- c("BWS test", "Murakami 1", "Neuhauser",
"Murakami 3", "Murakami 4", "Murakami 5")
rejratesimplify = "matrix")
},
<- data.frame(t(simul))
Brejrates $sigma <- svals Brejrates
<- tidyr::gather(Brejrates, "test", "rejection_rate",
plotdf -sigma) %>% dplyr::mutate(test = gsub("\\.", " ",
test))<- ggplot(plotdf, aes(x = sigma, y = rejection_rate,
ph group = test, colour = test)) + geom_line() + geom_point() +
labs(x = "sigma, ratio of standard deviations",
y = "rejection rate")
print(ph)
Now to figure 3A, with two equal sized samples from the exponential distribution, letting l be the ratio of the rate parameters of the two populations. The BWS test is the most powerful here, apparently.
<- 10000
n.sim <- seq(1, 12)
lvals <- 0.05
alpha <- 10
mnsize
set.seed(1234)
<- sapply(lvals, function(l) {
simul <- replicate(n.sim, {
statvs <- rexp(mnsize, rate = 1)
x <- rexp(mnsize, rate = l)
y <- bws_stat(x, y)
bws <- sapply(1L:5L, function(flavor) {
murav murakami_stat(x, y, flavor = flavor)
})c(bws, murav)
})<- mean(bws_cdf(statvs[1, ], lower_tail = FALSE) <=
rej_bws
alpha)<- sapply(2:6, function(rown) {
rej_mur mean(murakami_cdf(statvs[rown, ], mnsize, mnsize,
flavor = rown - 1, lower_tail = FALSE) <=
alpha)
})
<- c(rej_bws, rej_mur)
rejrate names(rejrate) <- c("BWS test", "Murakami 1", "Neuhauser",
"Murakami 3", "Murakami 4", "Murakami 5")
rejratesimplify = "matrix")
},
<- data.frame(t(simul))
Crejrates $lratio <- lvals Crejrates
<- tidyr::gather(Crejrates, "test", "rejection_rate",
plotdf -lratio) %>% dplyr::mutate(test = gsub("\\.", " ",
test))<- ggplot(plotdf, aes(x = lratio, y = rejection_rate,
ph group = test, colour = test)) + geom_line() + geom_point() +
labs(x = "l, ratio of rate parameters", y = "rejection rate")
print(ph)
Now figure 3B, with a normal sample and a uniform sample, with equal first and second moments, varying the sample size. Because flavors 3 through 5 are not well behaved, we drop them here. Once again the BWS test is the most powerful.
<- 10000
n.sim <- seq(10, 670, by = 60)
mvals <- 0.05
alpha
set.seed(1234)
<- sapply(mvals, function(mnsize) {
simul <- replicate(n.sim, {
statvs <- rnorm(mnsize, mean = 0, sd = 1/sqrt(12))
x <- runif(mnsize, min = -0.5, max = 0.5)
y <- bws_stat(x, y)
bws <- sapply(1L:2L, function(flavor) {
murav murakami_stat(x, y, flavor = flavor)
})c(bws, murav)
})<- mean(bws_cdf(statvs[1, ], lower_tail = FALSE) <=
rej_bws
alpha)<- sapply(2:3, function(rown) {
rej_mur mean(murakami_cdf(statvs[rown, ], mnsize, mnsize,
flavor = rown - 1, lower_tail = FALSE) <=
alpha)
})
<- c(rej_bws, rej_mur)
rejrate names(rejrate) <- c("BWS test", "Murakami 1", "Neuhauser")
rejratesimplify = "matrix")
},
<- data.frame(t(simul))
Drejrates $ssize <- 2 * mvals Drejrates
<- tidyr::gather(Drejrates, "test", "rejection_rate",
plotdf -ssize) %>% dplyr::mutate(test = gsub("\\.", " ",
test))<- ggplot(plotdf, aes(x = ssize, y = rejection_rate,
ph group = test, colour = test)) + geom_line() + geom_point() +
labs(x = "m+n, total sample size", y = "rejection rate")
print(ph)
I anticipate the following: