Note: Simulation-based calculations of sample size necessarily entail a fair amount of computation. For vignettes of CRAN packages, the allowed computational limits are small. As a result, this vignette will demonstrate coding examples using nRegression without evaluation. To see the results of running the code, you can view the full output of these computations on the github page: https://github.com/srivastavbudugutta/nRegression
Sample size calculations are fundamental to the design of many research studies. When tractable, analytic methods for calculating the sample size are preferred. However, many research studies have more complex designs. In these settings, simulation techniques may be used to generate data and test assumptions about statistical methods. The nRegression package was designed to estimate the minimal sample size required to attain a specific statistical power in the context of linear regression and logistic regression models through simulations.
For an assumed effect size and sample size, the statistical power is the probability of a true positive result for the study. Simulation studies can estimate the statistical power through repeated generation of studies with random sampling of data from a statistical model. The empirical proportion of simulated studies that reject the null hypothesis would then estimate the statistical power of the study’s design.
Because simulations begin with a specific sample size, it is relatively easy to calculate the statistical power. However, calculating the minimally viable sample size to achieve a given level of statistical power is less straightforward. The nRegression package utilizes a searching algorithm to estimate the statistical at a range of sample sizes. This algorithm includes a number of steps:
Select a specific statistical power (e.g. 0.8) for the study to achieve.
Select a minimum \(m\) and maximum value \(M\) for the sample size. This creates a range of possible values to explore.
Select a starting sample size \(n\).
Create a simulation study that estimates the statistical power for the study at sample size \(n\).
Reset the search space as follows:
If the estimated statistical power is at least the specified value (1), set the maximum value to the previous sample size. If the estimated statistical power is less than the specified value (1), set the minimum value to the previous sample size.
Then select a new sample size as the minimum of:
\(n\) plus a user selected increment \(k\) (e.g. 50);
The ceiling of the average of the current minimum and maximum values: \(\lceil(m+M)/2\rceil\).
This formula is: \(n_{new} = \min\left(n+k, \lceil{(m+M)/2}\rceil\right)\)
This selection is a modification of a binary search algorithm. For any decrease or small increase in the selected sample size, binary search is used. However, if the increase from a binary search is large, then a smaller increase \(k\) is used. This can help to reduce the amount of computation required for generating simulation studies with large sample sizes.
The resulting sample size will constitute the smallest value that exceeds the specified threshold. If no such value is detected, the sample size that empirically achieved the highest value of the statistical power will be returned.
The results will include information about the simulation’s estimates of the statistical power, range of values considered, and information from the full simulation study for the selected sample size.
The nRegression package is an extension of the simitation package, which can be used to generate simulations of studies involving linear and logistic regression relationships in data. The nRegression package involves specifying a number of quantities and inputs:
The preferred statistical power as a real number in \((0,1)\).
Information for simulating the multivariate data. This consists of an ordered specification of the variables and the generating model used for each variable the simulation.
The formula for the linear or logistic regression model that will be applied to the data.
The specific variable in the regression model for which the corresponding coefficient’s value will be statistically tested.
Other relevant inputs, such as the minimum \(m\) and maximum \(M\) sample sizes in the search, the starting value \(n\), the increment \(k\), the stopping threshold \(w\), the confidence level of the test, the number of simulated experiments \(B\), the type of model to fit (linear or logistic regression), and the randomization seed.
Then the nRegression method estimates the minimally viable sample size according to these specifications.
Following the steps of the previous section, we will specify the following quantities:
step.age <- "Age ~ N(45, 10)"
step.female <- "Female ~ binary(0.53)"
step.health.percentile <- "Health.Percentile ~ U(0,100)"
step.exercise.sessions <- "Exercise.Sessions ~ Poisson(2)"
step.diet <- "Diet ~ sample(('Light', 'Moderate', 'Heavy'), (0.2, 0.45, 0.35))"
Then we can specify the Healthy.Lifestyle variable as a binary value with a logistic regression relationship as a function of Age, Female, Health.Percentile, Exercise.Sessions, and Diet:
step.healthy.lifestyle <- "Healthy.Lifestyle ~ logistic(log(0.45) - 0.1 * (Age -45) + 0.05 * Female + 0.01 * Health.Percentile + 0.5 * Exercise.Sessions - 0.1 * (Diet == 'Moderate') - 0.4 * (Diet == 'Heavy'))"
Then Weight can be specified as a continuous value according to a linear regression relationship with the other variables as inputs:
step.weight <- "Weight ~ lm(150 - 15 * Female + 0.5 * Age - 0.1 * Health.Percentile - 0.2 * Exercise.Sessions + 5 * (Diet == 'Moderate') + 15 * (Diet == 'Heavy') - 2 * Healthy.Lifestyle + N(0, 10))"
Each of these steps is concatenated into a single character vector:
the.steps <- c(step.age, step.female, step.health.percentile, step.exercise.sessions,
step.diet, step.healthy.lifestyle, step.weight)
The steps are sordered to describe which variables will be calculated. Note that the variables with a dependence on other variables (e.g. Healthy.Lifestyle) must be specified after all of its inputs.
the.formula.logistic <- Healthy.Lifestyle ~ Age + Female + Health.Percentile + Exercise.Sessions +
Weight
Note that there is no requirement for this formula to match the generating function used in the previous step. This can be helpful if the planned model does not include certain variables (e.g. unmeasured confounding variables).
Here we are basing our sample size calculation based upon the statistical power related to a test of the coefficient of the Exercise.Sessions variable. (Specifically, this would be for a null hypothesis that its coefficient has the value zero.)
conf.level = 0.95
model.type = "logistic"
seed = 41
vstr = 3.6
num.experiments = 200
n.start = 200
n.min = 1
n.max = 3000
increment = 100
stop.threshold = 1
n.logistic = nRegression(the.steps = the.steps, num.experiments = num.experiments,
the.formula = the.formula.logistic, the.variable = the.variable, seed = seed,
n.start = n.start, n.min = n.min, n.max = n.max, increment = increment, stop.threshold = stop.threshold,
power = power, model.type = model.type, verbose = TRUE)
Note that using verbose = TRUE will provide information on the search as it proceeds, while verbose = FALSE will suppress this information.
The resulting variable n.logistic is an object containing a range of information:
These include:
These include the specified steps used to generate the data, the full set of simulated data for all of the experiments at that sample size, information about the statistical tests for each experiment, and information to analyze the simulation study across the range of experiments. (For more information, please see the documentation for the simitation package.)
From the earlier example, we can consider linear regression models of the Weight output as a function of the other variables. Then we can estimate the minimally viable sample size with only modest changes in the specifications:
the.formula.lm <- Weight ~ Age + Female + Health.Percentile + Exercise.Sessions +
Healthy.Lifestyle
model.type = "lm"
num.experiments <- 500
n.start = 500
n.max <- 10000
increment = 500
stop.threshold = 10
the.variable = "Healthy.LifestyleTRUE"
Note here that we are testing the coefficient associated with Healthy.Lifestyle as equal to TRUE (relative to a base case of FALSE). Since this variable is categorical, we must name the coefficient according to the model’s output as specified by R’s glm() function.
Also note that we are using a less precise stopping threshold of 10. This will reduce the number of iterations before the search procedure halts.
We can then estimate the minimally viable sample size:
n.lm = nRegression(the.steps = the.steps, num.experiments = num.experiments, the.formula = the.formula.lm,
the.variable = the.variable, seed = seed, n.start = n.start, n.max = n.max, increment = increment,
stop.threshold = stop.threshold, power = power, model.type = model.type, verbose = TRUE)
n.lm$n
We can also see information on the iterations:
Simulation-based methods are necessarily computationally intensive. The time required to generate an estimate will grow with the number of experiments in each simulation, the range of sample sizes, the selected starting point, and the stopping criteria. To reduce this computation time, we offer the following points of guidance:
You might first consider a smaller-scale calculation to narrow the range of potential sample sizes. For instance, you could reduce the number of experiments in each simulation and set a large stopping criteria (e.g. 50). After reviewing the outputs, the range of sample sizes can be revised. Then a larger scale calculation, with more experiments per simulation and a smaller stopping criteria, can be conducted.
For instance, with the linear regression example, we might first conduct the following smaller-scale calculation:
n.lm = nRegression(the.steps = the.steps, num.experiments = 100, the.formula = the.formula.lm,
the.variable = the.variable, seed = seed, n.start = n.start, n.max = n.max, increment = increment,
stop.threshold = 50, power = power, model.type = model.type, verbose = FALSE)
n.lm$iterations
This would indicate to us that the sample size is approximately 860. To verify, we might consider a more narrow range below and above this value, with a larger number of experiments and a small stopping threshold:
n.lm = nRegression(the.steps = the.steps, num.experiments = 500, the.formula = the.formula.lm,
the.variable = the.variable, seed = seed, n.start = 870, n.min = 800, n.max = 1200,
increment = increment, stop.threshold = 1, power = power, model.type = model.type,
verbose = FALSE)
n.lm$n
n.lm$iterations
When the established maximum is not sufficiently large, the search algorithm will return a sample size equal to the largest value it considered. If the previous example for linear regression had been restricted to a smaller range of sample sizes, then we would arrive at the following calculation:
n.lm = nRegression(the.steps = the.steps, num.experiments = num.experiments, the.formula = the.formula.lm,
the.variable = the.variable, seed = seed, n.start = n.start, n.max = 700, increment = increment,
stop.threshold = stop.threshold, power = power, model.type = model.type, verbose = TRUE)
n.lm$n
n.lm$power
n.lm$iterations
Therefore, it is important to double check the results to ensure the quality of the estimated sample size.
The nRegression package offers a simulation-based method to perform sample size calculations for linear and logistic regression models. These tools are easy to use and flexible for use in many study designs. Simulation-based methods can offer a complement to analytic sample size calculations when they are available. They can also provide reasonable estimates when analytic methods cannot be employed. The nRegression method greatly simplifies the work of generating the intended simulations and searching for the minimally viable sample size. Simulation studies necessarily introduce some variability into the methods. Repeated calculations across a range of randomization seeds may be helpful. For practitioners, the methods of the nRegression package can extend the reach of simulation methods and reduce the required labor. Similar calculations can be performed for other statistical models, which can be an area for future research.
Note: Simulation-based calculations of sample size necessarily entail a fair amount of computation. For vignettes of CRAN packages, the allowed computational limits are small. As a result, this vignette will demonstrate coding examples using nRegression without evaluation. To see the results of running the code, you can view the full output of these computations on the github page: https://github.com/srivastavbudugutta/nRegression