Introduction to nRegression

Note: Simulation-based calculations of sample size necessarily entail a fair amount of computation. For vignettes of CRAN packages, the allowed computational limits are small. As a result, this vignette will demonstrate coding examples using nRegression without evaluation. To see the results of running the code, you can view the full output of these computations on the github page: https://github.com/srivastavbudugutta/nRegression

Introduction

Sample size calculations are fundamental to the design of many research studies. When tractable, analytic methods for calculating the sample size are preferred. However, many research studies have more complex designs. In these settings, simulation techniques may be used to generate data and test assumptions about statistical methods. The nRegression package was designed to estimate the minimal sample size required to attain a specific statistical power in the context of linear regression and logistic regression models through simulations.

Statistical Power and Sample Sizes in Simulations

For an assumed effect size and sample size, the statistical power is the probability of a true positive result for the study. Simulation studies can estimate the statistical power through repeated generation of studies with random sampling of data from a statistical model. The empirical proportion of simulated studies that reject the null hypothesis would then estimate the statistical power of the study’s design.

A Searching Algorithm

Because simulations begin with a specific sample size, it is relatively easy to calculate the statistical power. However, calculating the minimally viable sample size to achieve a given level of statistical power is less straightforward. The nRegression package utilizes a searching algorithm to estimate the statistical at a range of sample sizes. This algorithm includes a number of steps:

Select a specific statistical power (e.g. 0.8) for the study to achieve.
Select a minimum \(m\) and maximum value \(M\) for the sample size. This creates a range of possible values to explore.
Select a starting sample size \(n\).
Create a simulation study that estimates the statistical power for the study at sample size \(n\).
Reset the search space as follows:

If the estimated statistical power is at least the specified value (1), set the maximum value to the previous sample size. If the estimated statistical power is less than the specified value (1), set the minimum value to the previous sample size.
Then select a new sample size as the minimum of:
1. \(n\) plus a user selected increment \(k\) (e.g. 50);
2. The ceiling of the average of the current minimum and maximum values: \(\lceil(m+M)/2\rceil\).

This formula is: \(n_{new} = \min\left(n+k, \lceil{(m+M)/2}\rceil\right)\)

This selection is a modification of a binary search algorithm. For any decrease or small increase in the selected sample size, binary search is used. However, if the increase from a binary search is large, then a smaller increase \(k\) is used. This can help to reduce the amount of computation required for generating simulation studies with large sample sizes.

Steps 4 and 5 are repeated. Each time, the range from the minimum to the maximum possible sample size is reduced. When this range is smaller than a user-specified value \(w\) (e.g. 5), the algorithm will stop.

The resulting sample size will constitute the smallest value that exceeds the specified threshold. If no such value is detected, the sample size that empirically achieved the highest value of the statistical power will be returned.

The results will include information about the simulation’s estimates of the statistical power, range of values considered, and information from the full simulation study for the selected sample size.

Simulations with the simitation Package

The nRegression package is an extension of the simitation package, which can be used to generate simulations of studies involving linear and logistic regression relationships in data. The nRegression package involves specifying a number of quantities and inputs:

The preferred statistical power as a real number in \((0,1)\).
Information for simulating the multivariate data. This consists of an ordered specification of the variables and the generating model used for each variable the simulation.
The formula for the linear or logistic regression model that will be applied to the data.
The specific variable in the regression model for which the corresponding coefficient’s value will be statistically tested.
Other relevant inputs, such as the minimum \(m\) and maximum \(M\) sample sizes in the search, the starting value \(n\), the increment \(k\), the stopping threshold \(w\), the confidence level of the test, the number of simulated experiments \(B\), the type of model to fit (linear or logistic regression), and the randomization seed.

Then the nRegression method estimates the minimally viable sample size according to these specifications.

Examples


library(nRegression)

Logistic Regression

Following the steps of the previous section, we will specify the following quantities:

Statistical Power:

power = 0.9

Variables to Simulation: The intended simulation will include the following variables: Age, Female, Health.Percentile, Exercise.Sessions, Diet, Healthy.Lifestyle, and Weight. The simitation package allows for analytic specifications of these variables. We will generate Age as a Normal random variable, Female as a Bernoulli random variable, Health.Percentile as a uniform random variable, as Exercise.Sessions as a Poisson random variable. The Diet is sampled from a discrete categorical distribution of the values of ‘Light’, ‘Moderate’, and ‘Heavy’, each with associated probabilities:

step.age <- "Age ~ N(45, 10)"
step.female <- "Female ~ binary(0.53)"
step.health.percentile <- "Health.Percentile ~ U(0,100)"
step.exercise.sessions <- "Exercise.Sessions ~ Poisson(2)"
step.diet <- "Diet ~ sample(('Light', 'Moderate', 'Heavy'), (0.2, 0.45, 0.35))"

Then we can specify the Healthy.Lifestyle variable as a binary value with a logistic regression relationship as a function of Age, Female, Health.Percentile, Exercise.Sessions, and Diet:

step.healthy.lifestyle <- "Healthy.Lifestyle ~ logistic(log(0.45) - 0.1 * (Age -45) + 0.05 * Female + 0.01 * Health.Percentile + 0.5 * Exercise.Sessions - 0.1 * (Diet == 'Moderate') - 0.4 * (Diet == 'Heavy'))"

Then Weight can be specified as a continuous value according to a linear regression relationship with the other variables as inputs:

step.weight <- "Weight ~ lm(150 - 15 * Female + 0.5 * Age - 0.1 * Health.Percentile - 0.2 * Exercise.Sessions  + 5 * (Diet == 'Moderate') + 15 * (Diet == 'Heavy') - 2 * Healthy.Lifestyle + N(0, 10))"

Each of these steps is concatenated into a single character vector:

the.steps <- c(step.age, step.female, step.health.percentile, step.exercise.sessions,
    step.diet, step.healthy.lifestyle, step.weight)

The steps are sordered to describe which variables will be calculated. Note that the variables with a dependence on other variables (e.g. Healthy.Lifestyle) must be specified after all of its inputs.

The Model’s Formula: We will then specify the formula for the logistic regression of the Healthy.Lifestyle variable:

the.formula.logistic <- Healthy.Lifestyle ~ Age + Female + Health.Percentile + Exercise.Sessions +
    Weight

Note that there is no requirement for this formula to match the generating function used in the previous step. This can be helpful if the planned model does not include certain variables (e.g. unmeasured confounding variables).

The Variable to Test:

the.variable = "Exercise.Sessions"

Here we are basing our sample size calculation based upon the statistical power related to a test of the coefficient of the Exercise.Sessions variable. (Specifically, this would be for a null hypothesis that its coefficient has the value zero.)

Other Inputs: These are specified below using notation to match the nRegression function’s inputs.

conf.level = 0.95
model.type = "logistic"
seed = 41
vstr = 3.6
num.experiments = 200

n.start = 200
n.min = 1
n.max = 3000
increment = 100
stop.threshold = 1

Sample Size Calculation: We can then use the nRegression method:

n.logistic = nRegression(the.steps = the.steps, num.experiments = num.experiments,
    the.formula = the.formula.logistic, the.variable = the.variable, seed = seed,
    n.start = n.start, n.min = n.min, n.max = n.max, increment = increment, stop.threshold = stop.threshold,
    power = power, model.type = model.type, verbose = TRUE)

Note that using verbose = TRUE will provide information on the search as it proceeds, while verbose = FALSE will suppress this information.

The resulting variable n.logistic is an object containing a range of information:

names(n.logistic)

These include:

The selected sample size:

n.logistic$n

The empirical power of the selected sample size in the simulation study:

n.logistic$power

Information on each iteration of the search:

n.logistic$iterations

Full information on the simulation study for the selected sample size:

names(n.logistic$simstudy)

These include the specified steps used to generate the data, the full set of simulated data for all of the experiments at that sample size, information about the statistical tests for each experiment, and information to analyze the simulation study across the range of experiments. (For more information, please see the documentation for the simitation package.)

Linear Regression

From the earlier example, we can consider linear regression models of the Weight output as a function of the other variables. Then we can estimate the minimally viable sample size with only modest changes in the specifications:

the.formula.lm <- Weight ~ Age + Female + Health.Percentile + Exercise.Sessions +
    Healthy.Lifestyle
model.type = "lm"
num.experiments <- 500

n.start = 500
n.max <- 10000
increment = 500
stop.threshold = 10

the.variable = "Healthy.LifestyleTRUE"

Note here that we are testing the coefficient associated with Healthy.Lifestyle as equal to TRUE (relative to a base case of FALSE). Since this variable is categorical, we must name the coefficient according to the model’s output as specified by R’s glm() function.

Also note that we are using a less precise stopping threshold of 10. This will reduce the number of iterations before the search procedure halts.

We can then estimate the minimally viable sample size:

n.lm = nRegression(the.steps = the.steps, num.experiments = num.experiments, the.formula = the.formula.lm,
    the.variable = the.variable, seed = seed, n.start = n.start, n.max = n.max, increment = increment,
    stop.threshold = stop.threshold, power = power, model.type = model.type, verbose = TRUE)

n.lm$n

We can also see information on the iterations:

n.lm$iterations

Practical Considerations

Simulation-based methods are necessarily computationally intensive. The time required to generate an estimate will grow with the number of experiments in each simulation, the range of sample sizes, the selected starting point, and the stopping criteria. To reduce this computation time, we offer the following points of guidance:

You might first consider a smaller-scale calculation to narrow the range of potential sample sizes. For instance, you could reduce the number of experiments in each simulation and set a large stopping criteria (e.g. 50). After reviewing the outputs, the range of sample sizes can be revised. Then a larger scale calculation, with more experiments per simulation and a smaller stopping criteria, can be conducted.

For instance, with the linear regression example, we might first conduct the following smaller-scale calculation:

n.lm = nRegression(the.steps = the.steps, num.experiments = 100, the.formula = the.formula.lm,
    the.variable = the.variable, seed = seed, n.start = n.start, n.max = n.max, increment = increment,
    stop.threshold = 50, power = power, model.type = model.type, verbose = FALSE)

n.lm$iterations

This would indicate to us that the sample size is approximately 860. To verify, we might consider a more narrow range below and above this value, with a larger number of experiments and a small stopping threshold:

n.lm = nRegression(the.steps = the.steps, num.experiments = 500, the.formula = the.formula.lm,
    the.variable = the.variable, seed = seed, n.start = 870, n.min = 800, n.max = 1200,
    increment = increment, stop.threshold = 1, power = power, model.type = model.type,
    verbose = FALSE)

n.lm$n
n.lm$iterations

Checking for Viability

When the established maximum is not sufficiently large, the search algorithm will return a sample size equal to the largest value it considered. If the previous example for linear regression had been restricted to a smaller range of sample sizes, then we would arrive at the following calculation:

n.lm = nRegression(the.steps = the.steps, num.experiments = num.experiments, the.formula = the.formula.lm,
    the.variable = the.variable, seed = seed, n.start = n.start, n.max = 700, increment = increment,
    stop.threshold = stop.threshold, power = power, model.type = model.type, verbose = TRUE)

n.lm$n
n.lm$power
n.lm$iterations

Therefore, it is important to double check the results to ensure the quality of the estimated sample size.