This package provides methods for fitting generalized linear models (“GLM”s) and generalized additive models (“GAM”s). Linear and additive regression are useful modeling approaches real-valued response data. However, in many applications the response of interest is not continuous or even numerical in nature. For instance, results from polls typically have a binary response of either “yes” or “no”. Many engineering fields deal with discrete data values instead of a continuous response. In these situations, the assumption of Gaussian-distributed errors is not appropriate. Generalized regression methods model the expectation of the response under the assumption of an alternative, non-Gaussian data-generating distribution. They take advantage of the relative simplicity of a linear/additive modeling approach and apply it through the use of an appropriate link function. This vignette briefly overviews the fitting procedure for GLMs and GAMs implemented in this package, and provides simple illustrative example of the “glam” function in action.
Iterative Re-Weighted Least Squares (IRLS, Chen and Shao (1993)), is a common approach for fitting GLMs. IRLS is an iterative procedure that searches for coefficients for a linear model that maximize the Fisher scoring information; this is akin to finding the maximum likelihood estimator (MLE) for the coefficients. What’s powerful about the IRLS algorithm is its relative simplicity and modularity. As described in chapter 6 of Hastie and Tibshirani (1990), employing IRLS requires only a few unique quantities. One is (obviously) the choice of link function, where the function evaluation and its derivative with respect to our expectation \(\mu\) are used in the algorithms search. We also must provide an analytical form of the variance of our response \(Y\), which again will depend on the choice of link function. The rest of the quantities in IRLS are constant; so long as we have these quantities, we can implement this fitting approach for any GLM we please, and will (eventually) converge at the MLE.
Although the structure of additive models is different from a linear approach, its fitting procedure is remarkably close to that of IRLS. In addition to the given smoothing functions we wish to employ for our fit (Hastie and Tibshirani (1990)), we need only specify the same quantities we needed for our GLM fit, including the choice of link function, its derivative, and the variance of our response \(Y\). The local scoring procedure detailed in Chapter 6 of Hastie and Tibshirani (1990) uses these quantities to iterativelt fit additive models using a weighted backfitting approach. The backfitting algorithm is a simple approach to fitting additive models that is detailed also in Hastie and Tibshirani (1990), chapter 4.
The ‘GLAM’ library takes advantage of GLM and GAM local scoring
procedures to fit commonly-used regression models on non-Gaussian
response data. We wrap our methods into one, easy-to-use function called
glam
:
There are four required arguments; a numeric matrix of covariate data, \(X\), a vector of responses, \(Y\), a choice of whether to fit a GLM or GAM, and the distributional family for which to model the response. The optional arguments are whether to include an intercept term or not (default is “TRUE”), the tolerance for which convergence is determined (default is \(1e-8\)), and the maximum number of iterations allowed until convergence is deemed to have failed (default is \(100\)).
We provide some simple, examples of how to use our function for four different kinds of response data.
We first generate some fake, 1-dimensional covariate data for which the following examples will use in generating different responses.
We first start with beta regression, which is the primary contribution of this package given the lack of beta regression fitting tools that are currently available for R. Beta regression is appropriate for response data that is bounded between 0 and 1; this is common when dealing with proportion or percentage data, such as modeling the proportion of people who answered in a certain way on a poll question.
To fit a Beta GLM/GAM, we specify in our glam
call that
family = "beta"
. We then compare the performance of a GLM
(model = "linear"
) with a GAM
(model = "additive"
) on generated response data that ranges
from \(0\) to \(1\) in the given domain of \(X\).
Y <- (X - 0.5)^2 + 0.5 + rnorm(n, sd = 1e-1)
beta_glm <- glam(cbind(X, X^2), Y, model = "linear", family = "beta")
beta_gam <- glam(cbind(X, X^2), Y, model = "additive", family = "beta")
plot(X, Y, xlab = "X", ylab = "Y", main = "Beta Regression Example")
lines(X, beta_glm$mu, col = "red", lwd = 2)
lines(X, beta_gam$mu, col = "blue", lwd = 2)
legend("bottomright", legend = c("GLM", "GAM"), col = c("red", "blue"), lwd = 2)
For both models we plot the fitted mean \(\mu\) that is return by glam
.
We can also see the estimated coefficients from the GLM model are also
returned:
Logistic regression is possibly the most common instance of generalized regression. It is useful when response data is binary in nature; for instance, many surveys have questions where the answer recorded is only a “yes” or a “no” rather than a numerical value. In the current iteration of the GLAM package, only a GLM is available to fit, but hopefully a GAM version will be made available soon.
To fit a GLM on some example binary data, we specify
family = "binomial"
. Note the response \(Y\) must be coded numerically with two
unique values. In this example we code the response as either \(0\) for “failure” or \(1\) for “success”.
Y <- rep(0, n)
Y[which(X <= 1/3)] <- rbinom(sum(X <= 1/3), 1, 0.9)
Y[which(X >= 2/3)] <- rbinom(sum(X >= 2/3), 1, 0.9)
log_glm <- glam(cbind(X, X^2), Y, model = "linear", family = "binomial")
plot(X, Y, xlab = "X", ylab = "Y", main = "Binomial Regression Example")
lines(X, log_glm$mu, col = "red", lwd = 2)
legend("bottomright", legend = c("GLM"), col = c("red"), lwd = 2)
Poisson regression is another common example of generalized modeling. This is often used when the response of interest is a count of some kind. For instance, we may wish to model the expected number of times of a given event occurring under varying conditions.
Poisson GLMs and GAMs can be fit by setting
family = "poisson"
:
Y <- rpois(n, 3*rnorm(n, mean = 2*X, sd = 1e-4))
pois_glm <- glam(cbind(X, X^2), Y, model = "linear", family = "poisson")
pois_gam <- glam(cbind(X, X^2), Y, model = "additive", family = "poisson")
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
#> Backfitting algorithm failed to converge in 100 iterations.
Finally, gamma regression is a common approach is a common approach for modeling continuous positive responses. Many situations produce response data that is inherently positive in nature, such as the heights or weights of individuals. While a normal approximation is often used in practice and typically works well in most situations, we may wish to use a model that enforces an appropriate constraint on the domain of \(Y\).
To fit a gamma GLM or GAM, we simply specify
family = "gamma"
.
Y <- rgamma(n, shape = 10, scale = 5*X)
gamma_glm <- glam(cbind(X, X^2), Y, model = "linear", family = "gamma")
gamma_gam <- glam(cbind(X, X^2), Y, model = "additive", family = "gamma")
plot(X, Y, xlab = "X", ylab = "Y", main = "Gamma Regression Example")
lines(X, gamma_glm$mu, col = "red", lwd = 2)
lines(X, gamma_gam$mu, col = "blue", lwd = 2)
legend("bottomright", legend = c("GLM", "GAM"), col = c("red", "blue"), lwd = 2)
Virginia Tech Department of Statistics, ahcooper@vt.edu↩︎