Broglio et al. (2014) describe an example diagnostic study from Julian et al. (2008). This was a single-arm diagnostic study of the GeneSearch Breast Lymph Node (BLN) Assay for patients with breast cancer undergoing sentinel lymph node (SLN) biopsy. The new BLN test yielded a result during the surgery, i.e. intra-operatively, which yielded a positive or negative test for metastasis. The gold standard reference test was histologic evaluation. Whilst histologic evaluation may not be known immediately, we assume it is available soon after, meaning both test results (BLN and reference) are known immediately.
The trial would be declared successful if the new test sensitivity (\(\pi_1\)) and specificity (\(\pi_0\)) satisfy the following posterior probability inequalities:
\[ P[\pi_1 > 0.70 | \text{Data}] \ge 0.985 \, \text{ and } P[\pi_0 > 0.90 | \text{Data}] \ge 0.985. \]
Based on this, we know that
<- 0.7
sens_pg <- 0.9
spec_pg
<- 0.985,
succ_sens <- 0.985,
succ_spec
<- "both" endpoint
The thresholds of 0.985 were reported as being chosen to control the overall type I error at <5%. In general, when there are interim looks at the data, a type I error penalty must be paid. The choice of thresholds can be determined using simulation (in the null model space) at multiple sequential thresholds, and identify which threshold(s) lead to type I error below the required value.
Their study did not specify what the assumed sensitivity, specificity, or true prevalence was. From the pilot (beta) study reported by Julian et al. (2008), we assume that the true values are:
<- 0.824
sens_true <- 0.963
spec_true <- 0.2 prev_true
There was no prevalence value reported from the pilot study, so we assume prev_true <- 0.20
.
The prior distributions for prior_sens
, prior_spec
, and prior_prev
, are all assumed to be independent \(\text{Beta}(0.1, 0.1)\). This is passed as:
<- c(0.1, 0.1)
prior_sens <- c(0.1, 0.1)
prior_spec <- c(0.1, 0.1) prior_prev
The trial sample size selection analyses were planned starting when \(n=200\) patients were enrolled, and after every additional \(n=50\) patients, up to a maximum of \(n=700\). Hence, we can set:
<- seq.int(200, 700, by = 50) n_at_looks
We can simulate trials under this design using the adaptDiag
package function multi_trial()
function as follows:
library(adaptDiag)
<- multi_trial(
fit_power sens_true = 0.824,
spec_true = 0.963,
prev_true = 0.20,
endpoint = "both",
sens_pg = 0.7,
spec_pg = 0.9,
prior_sens = c(0.1, 0.1),
prior_spec = c(0.1, 0.1),
prior_prev = c(0.1, 0.1),
succ_sens = 0.985,
succ_spec = 0.985,
n_at_looks = seq(200, 700, 50),
n_mc = 10000,
n_trials = 200,
ncores = 1L)
Here, we have only simulated 200 trials here, using a total of 1 core. In general, n_trials
should be larger, and ncores
should be set as large as possible dependent on the machine being used to perform the calculations. The Monte Carlo integration performed for futility assessment is based on n_mc = 10000
draws.
To determine the operating characteristics, we need to also specify the futility stopping rule. That is, we wish to calculate the posterior predictive probability of eventual trial success at the maximum sample size. Using conjugacy, the posterior predictive distribution of cases is Beta-Binomial (Broglio et al. 2014). For the GeneSearch BLN Assay study, the futility stopping rule was set at 5% – that is, if the posterior predictive probability of eventual study success is <0.05, the study will terminate early.
We add in an additional criteria that the study cannot stop for early success or futilty until a fixed number of reference positive cases are observed. We implement these 2 criteria and extract the operating characteristic as follows:
summarise_trials(fit_power, min_pos = 30, fut = 0.05)
#> n
#> decision 200 250 300 350 400 450 500 550 600 650 700
#> early win 64 37 25 13 15 11 5 8 5 1 0
#> late win 0 0 0 0 0 0 0 0 0 0 2
#> no stopping 0 0 0 0 0 0 0 0 0 0 1
#> stop for futility 1 2 1 0 1 1 0 1 4 2 0
#> power stop_futility n_avg sens spec mean_pos
#> 1 0.93 0.065 319 0.8439725 0.9636661 63.73
The printed output shows a table with columns listing the sample size looks, and rows listing the reasons for the trial stopping. Below this table we also see estimates of the study power, average sample size, proportion of trials stopping for futility, as well as posterior estimates of the sensitivity and specificity.
We also wish to evaluate the type I error. We can straightforwardly calculate this by assuming the true values are equal to the null.
<- update(fit_power,
fit_type1 sens_true = 0.7,
spec_true = 0.9)
summarise_trials(fit_type1, min_pos = 30, fut = 0.05)
#> n
#> decision 200 250 300 350 400 450 500 600
#> stop for futility 156 18 9 7 5 1 1 3
#> power stop_futility n_avg sens spec mean_pos
#> 1 0 1 228 0.6928179 0.9012423 45.405
The operating characteristics show the type I error is well controlled at the 5% level. Note: in practice, one should run many simulations (at least 10,000) to get accurate estimates of the type I error.
The true prevalence was passed as prev_true = 0.20
, commensurate with 20% of subjects in the trial testing positive with the reference histology test. We do not know what the actual value of the true prevalence was when the trial was originally designed. (In fact, we do not know what the input true parameters for sensitivity and specificity were either.) It is reasonably straightforward to search a grid of input parameters, as follows.
Assume we are unsure of what the true prevalence might be. We believe it will be 15% and 30%. We then wish to evaluate the operating characteristics at a grid of plausible values. If we let , we can apply the multi_trial()
function within a loop:
<- NULL
tab
for (i in 1:length(prev_true_vec)) {
<- multi_trial(
fit_power_i sens_true = 0.824,
spec_true = 0.963,
prev_true = prev_true_vec[i],
endpoint = "both",
sens_pg = 0.7,
spec_pg = 0.9,
prior_sens = c(0.1, 0.1),
prior_spec = c(0.1, 0.1),
prior_prev = c(0.1, 0.1),
succ_sens = 0.985,
succ_spec = 0.985,
n_at_looks = seq(200, 700, 50),
n_mc = 1000,
n_trials = 100,
ncores = 1L)
<- summarise_trials(fit_power_i, min_pos = 30, fut = 0.05)
out <- rbind(tab, out)
tab }
We can then use this this to understand how the power changes with the the assumed true prevalence value, as
plot(prev_true_vec, tab$power,
xlab = "True prevalence",
ylab = "Power",
main = "Prevalence vs. power",
type = "b")
Note: to allow the vignette to compile quickly, we have only used n_mc = 1000
and n_trials = 100
. In practice, these values would both need substantially larger.
Broglio KR, Connor JT, Berry SM. Not too big, not too small: a Goldilocks approach to sample size selection. Journal of Biopharmaceutical Statistics, 2014; 24(3): 685–705.
Julian TB, Blumencranz P, Deck K, Whitworth P, Berry DA, Berry SM, Rosenberg A, et al. Novel intraoperative molecular test for sentinel lymph node metastases in patients with early-stage breast cancer. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology, 2008; 26(20): 3338–3345.