The Art of Influence

A Practical Guide to Working with Influence Functions

Klaus Kähler Holst

Influence functions

Estimators that have parametric convergence rates can often be fully characterized by their influence function (IF), also referred to as an influence curve or canonical gradient (Bickel et al. 1998; Vaart 1998). The IF allows for the direct estimation of properties of the estimator, including its asymptotic variance. Moreover, estimates of the IF enable the simple combination and transformation of estimators into new ones. This vignette describes how to estimate and manipulate IFs using the R-package lava (Holst and Budtz-Jørgensen 2013).

Formally, let \(Z_1,\ldots,Z_n\) be iid \(k\)-dimensional stochastic variables, \(Z_i=(Y_{i},A_{i},W_{i})\sim P_{0}\), and \(\widehat{\theta}\) a consistent estimator for the parameter \(\theta\in\mathbb{R}^p\). When \(\widehat{\theta}\) is a regular and asymptotic linear (RAL) estimator, it has a unique iid decomposition \[\begin{align*} \sqrt{n}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \operatorname{IC}(Z_i; P_{0}) + o_{P}(1), \end{align*}\] where the function \(\operatorname{IC}\) is the unique Influence Function s.t. \(\mathbb{E}\{\operatorname{IC}(Z_{i}; P_{0})\}=0\) and \(\mathbb{V}\!\text{ar}\{\operatorname{IC}(Z_{i}; P_{0})^{2}\}<\infty\) (Tsiatis 2006; Vaart 1998). The influence function thus fully characterizes the asymptotic behaviour of the estimator and by the central limit theorem it follows that the estimator converges weakly to a Gaussian distribution \[ \sqrt{n}(\widehat{\theta}-\theta) \overset{\mathcal{D}}{\longrightarrow} \mathcal{N}(0, \mathbb{V}\!\text{ar}\{\operatorname{IC}(Z; P_{0})\}), \] where the empirical variance of the plugin estimator, \(\mathbb{P}_{n}\operatorname{IC}(Z; \widehat{P})^{\otimes 2} = \frac{1}{n}\sum_{i=1}^n \operatorname{IC}(Z_{i}; \widehat{P})\operatorname{IC}(Z_{i}; \widehat{P})^{\top}\) can be used to obtain a consistent estimate of the asymptotic variance. Note, in practice the estimate \(\widehat{P}\) used in the plugin-estimate, needs only to capture the parts of the distribution of \(Z\) that is necessary to evaluate the IF. In some cases this nuisance parameter can be estimated using flexible machine learning components and in other (parametric) cases be derived directly from \(\widehat{\theta}\).

The IFs are easily derived for the parameters of many parametric statistical models as illustrated in the next example sections. More generally, the IF can also be derived for a smooth target parameter \(\Psi: \mathcal{P}\to\mathbb{R}\) where \(\mathcal{P}\) is a family of probability distributions forming the statistical model, which often can be left completely non-parametric. Formally, the parameter must be pathwise differentiable see (Vaart 1998) in the sense that there exists linear bounded function \(\dot\Psi\colon L_{2}(P_{0})\to\mathbb{R}\) such that \([\Psi(P_{t}) - \Psi(P_{0}))]t^{-1} \to \dot\Psi(P_{0})(g)\) as \(t\to 0\) for any parametric submodel \(P_t\) with score model \(g(z)= \partial/(\partial t) \log (p_t)(z)|_{t=0}\). Riesz’s representation theorem then tells us that the directional derivative has a unique representer, \(\phi_{P_{0}}\) lying in the closure of the submodel score space (the tangent space), s.t. \[\begin{align*} \dot\Psi(P_0)(g) = \langle\phi_{P_0}, g\rangle = \int \phi_{P_0}(Z)g(X)\,dP_0 \end{align*}\] The unique representer is exactly the IF which can be found by solving the above integral equation. For more details on how to derive influence functions, we refer to (Laan and Rose 2011; Hines et al. 2022).

As an example we might be interested in the target parameter \(\Psi(P) = \mathbb{E}_P(Z)\) which can be shown to have the unique (and thereby efficient) influence function \(Z\mapsto Z-\mathbb{E}_P(Z)\) under the non-parametric model. Another target parameter could be \(\Psi_{a}(P) = \mathbb{E}_{P}[\mathbb{E}_{P}(Y\mid A=a, W)]\) which is often a key interest in causal inference and which has the IF \[\begin{align*} \operatorname{IC}(Y,A,W; P) = \frac{\mathbf{1}(A=a)}{\mathbb{P}(A=a\mid W)}(Y-\mathbb{E}_{P}[Y\mid A=a,W]) + \mathbb{E}_{P}[Y\mid A=a,W] - \Psi_{a}(P) \end{align*}\] See section on average treatment effects.

Examples

To illustrate the methods we consider data arising from the model \(Y_{ij} \sim Bernoulli\{\operatorname{expit}(X_{ij} + A_{i} + W_{i})\}, A_{i} \sim Bernoulli\{\operatorname{expit}(W_{i})\}\) with independent covariates \(X_{ij}\sim\mathcal{N}(0,1), U_{i}\sim\mathcal{N}(0,1)\).

m <- lvm() |>
  regression(y1 ~ x1 + a + w) |>
  regression(y2 ~ x2 + a + w) |>
  regression(y3 ~ x3 + a + w) |>
  regression(y4 ~ x4 + a + w) |>
  regression(a ~ w) |>
  distribution(~ y1 + y2 + y3 + y4 + a, value = binomial.lvm()) |>
  distribution(~id, value = Sequence.lvm(integer = TRUE))

We simulate from the model where \(Y_3\) is only observed for half of the subjects

n <- 4e2
dw <- sim(m, n, seed = 1) |>
  transform(y3 = y3 * ifelse(id > n / 2, NA, 1))
Print(dw)
#>     y1 x1       a w        y2 x2       y3 x3       y4 x4        id 
#> 1   1  -0.6265  1  1.0744  0  -1.08691 1  -1.5570  1   0.34419  1  
#> 2   1   0.1836  1  1.8957  1  -1.82608 1   1.9232  1   0.01272  2  
#> 3   0  -0.8356  0 -0.6030  0   0.99528 0  -1.8568  0  -0.87345  3  
#> 4   0   1.5953  0 -0.3909  1  -0.01186 0  -2.1061  1   0.34280  4  
#> 5   0   0.3295  1 -0.4162  0  -0.59963 1   0.6976  1  -0.17739  5  
#> ---                                                                
#> 396 0  -0.92431 0 -1.02939 0  -0.30825 NA -1.05037 0  -0.008056 396
#> 397 1   1.59291 1 -0.01093 1   0.01552 NA  1.63787 1   1.033784 397
#> 398 0   0.04501 0 -1.22499 0  -0.44232 NA -1.20733 0  -0.799127 398
#> 399 0  -0.71513 0 -2.59611 0  -1.63801 NA -2.62616 0   1.004233 399
#> 400 1   0.86522 1  1.16912 0  -0.64140 NA  0.01746 1  -0.311973 400
## Data in long format
dl <- mets::fast.reshape(dw, varying = c("y", "x")) |> na.omit()
Print(dl)
#>      a w      id  y x       num
#> 1    1 1.074  1   1 -0.6265 1  
#> 2    1 1.074  1   0 -1.0869 2  
#> 3    1 1.074  1   1 -1.5570 3  
#> 4    1 1.074  1   1  0.3442 4  
#> 5    1 1.896  2   1  0.1836 1  
#> ---                            
#> 1594 0 -2.596 399 0 -1.6380 2  
#> 1596 0 -2.596 399 0  1.0042 4  
#> 1597 1  1.169 400 1  0.8652 1  
#> 1598 1  1.169 400 0 -0.6414 2  
#> 1600 1  1.169 400 1 -0.3120 4

Example: population mean

The main functions for working with influence functions are

The estimate function is the primary tool for obtaining parameter estimates and related information. It returns an object of the class type estimate, which is general container for holding information about estimated parameters. The estimate function takes as input either a model object (the first argument x), or a parameter vector and corresponding influence function (IF) matrix specified using the coef and IF arguments. If the primary goal is to apply the delta method or test linear hypotheses, it is also possible to provide the asymptotic variance estimate via the vcov argument, without specifying the IF matrix.

estimate(x=, ...)
estimate(coef=, IF=, ...)
estimate(coef=, vcov=, ...)

Here we first consider the problem of estimating the IF of the mean. For a general transformation \(f: \mathbb{R}^k\to\mathbb{R}^p\) we have that \[ \sqrt{n}\{\mathbb{P}_{n}f(X) - \mathbb{E}[f(X)]\} = \frac{1}{\sqrt{n}}\sum_{i=1}^{n} f(X_{i}) - \mathbb{E}[f(X)] \] and hence for the problem of estimating the proportion of the binary outcome \(Y_1\), the IF is given by \(\mathbf{1}(Y_{1}=1) - \mathbb{P}(Y_{1}=1)\).

To estimate this parameter and its IF we will use the estimate function

inp <- as.matrix(dw[, c("y1", "y2")])
e <- estimate(inp[, 1, drop = FALSE], type="mean") 
class(e)
#> [1] "estimate"
e
#>    Estimate Std.Err   2.5%  97.5%    P-value
#> y1     0.61 0.02439 0.5622 0.6578 4.435e-138

The reported standard errors from the estimate method are the robust standard errors obtained from the IF. The variance estimate and the parameters can be extracted with the vcov and coef methods. The IF itself can be extracted with the IC (or influence) method:

IC(e) |> Print()
#>     y1   
#> 1    0.39
#> 2    0.39
#> 3   -0.61
#> 4   -0.61
#> 5   -0.61
#> ---      
#> 396 -0.61
#> 397  0.39
#> 398 -0.61
#> 399 -0.61
#> 400  0.39

It is also possible to simultaneously estimate the proportions of each of the two binary outcomes

estimate(inp)
#>    Estimate Std.Err   2.5%  97.5%    P-value
#> y1    0.610 0.02439 0.5622 0.6578 4.435e-138
#> y2    0.535 0.02494 0.4861 0.5839 4.316e-102

or alternatively the input can be a model object, here a mlm object:

e <- lm(cbind(y1, y2) ~ 1, data = dw) |>
  estimate()
IC(e) |> head()
#>   y1:(Intercept) y2:(Intercept)
#> 1           0.39         -0.535
#> 2           0.39          0.465
#> 3          -0.61         -0.535
#> 4          -0.61          0.465
#> 5          -0.61         -0.535
#> 6          -0.61          0.465

Different methods are available for inspecting an estimate object

summary(e)
#> Call: estimate.default(x = x, coef = pars(x))
#> ────────────────────────────────────────────────────────────────────────────────
#>                Estimate Std.Err   2.5%  97.5%    P-value
#> y1:(Intercept)    0.610 0.02439 0.5622 0.6578 4.435e-138
#> y2:(Intercept)    0.535 0.02494 0.4861 0.5839 4.316e-102
#> 
#>  Null Hypothesis: 
#>   [y1:(Intercept)] = 0
#>   [y2:(Intercept)] = 0 
#>  
#> chisq = 955.6986, df = 2, p-value < 2.2e-16
## extract parameter coefficients
coef(e)
#> y1:(Intercept) y2:(Intercept) 
#>          0.610          0.535
## ## Asymptotic (robust) variance estimate
vcov(e)
#>                y1:(Intercept) y2:(Intercept)
#> y1:(Intercept)     5.9475e-04   0.0000841250
#> y2:(Intercept)     8.4125e-05   0.0006219375
## Matrix with estimates and confidence limits
estimate(e, level = 0.99) |> parameter()
#>                Estimate    Std.Err      0.5%     99.5%       P-value
#> y1:(Intercept)    0.610 0.02438750 0.5471820 0.6728180 4.434692e-138
#> y2:(Intercept)    0.535 0.02493867 0.4707622 0.5992378 4.316104e-102
## Influence curve
IC(e) |> head()
#>   y1:(Intercept) y2:(Intercept)
#> 1           0.39         -0.535
#> 2           0.39          0.465
#> 3          -0.61         -0.535
#> 4          -0.61          0.465
#> 5          -0.61         -0.535
#> 6          -0.61          0.465
## Join estimates
e + e # Same as merge(e,e)
#>                  Estimate Std.Err   2.5%  97.5%    P-value
#> y1:(Intercept)      0.610 0.02439 0.5622 0.6578 4.435e-138
#> y2:(Intercept)      0.535 0.02494 0.4861 0.5839 4.316e-102
#> ────────────────                                          
#> y1:(Intercept).1    0.610 0.02439 0.5622 0.6578 4.435e-138
#> y2:(Intercept).1    0.535 0.02494 0.4861 0.5839 4.316e-102

Example: generalized linear model

For a \(Z\)-estimator defined by the score equation \(E[U(Z; \theta)] = 0\), the IF is given by \[\begin{align*} IC(Z; \theta) = \mathbb{E}\Big\{\frac{\partial}{\partial\theta^\top}U(\theta; Z)\Big\}^{-1}U(Z; \theta) \end{align*}\] In particular, for a maximum likelihood estimator the score, \(U\), is given by the partial derivative of the log-likelihood function.

As an example, we can obtain the estimates with robust standard errors for a logistic regression model:

g <- glm(y1 ~ a + x1, data = dw, family = binomial)
estimate(g)
#>             Estimate Std.Err    2.5%   97.5%   P-value
#> (Intercept)  -0.2687  0.1622 -0.5867 0.04931 9.772e-02
#> a             1.5595  0.2428  1.0835 2.03545 1.348e-10
#> x1            0.9728  0.1435  0.6916 1.25397 1.198e-11

We can compare that to the usual (non-robust) standard errors:

estimate(g, robust = FALSE)
#>             Estimate Std.Err    2.5%   97.5%   P-value
#> (Intercept)  -0.2687  0.1589 -0.5802 0.04281 9.091e-02
#> a             1.5595  0.2423  1.0846 2.03433 1.220e-10
#> x1            0.9728  0.1396  0.6992 1.24634 3.177e-12

The IF can be extracted from the estimate object or directly from the model object

IC(g) |> head()
#>   (Intercept)          a         x1
#> 1  0.09816353   3.715892 -0.8478763
#> 2 -0.08203584   2.562573  0.7085752
#> 3 -2.74896196   3.316937  1.8772112
#> 4 -6.78328520   4.052090 -9.0268560
#> 5  0.47533946 -11.818085 -4.1056905
#> 6 -2.77584564   3.340948  1.8677174

The same estimates can be obtained with a cumulative link regression model which also generalizes to ordinal outcomes. Here we consider the proportional odds model given by \[\begin{align*} \log\left(\frac{\mathbb{P}(Y\leq j\mid x)}{1-\mathbb{P}(Y\leq j\mid x)}\right) = \alpha_{j} + \beta^{t}x, \quad j=1,\ldots,J \end{align*}\]

ordreg(y1 ~ a + x1, dw, family=binomial(logit)) |> estimate()
#>     Estimate Std.Err     2.5%  97.5%   P-value
#> 0|1   0.2687  0.1622 -0.04932 0.5867 9.772e-02
#> a     1.5595  0.2429  1.08349 2.0355 1.350e-10
#> x1    0.9728  0.1435  0.69157 1.2540 1.200e-11

Note that the sandwich::estfun function from the sandwich library (Zeileis, Köll, and Graham 2020) can also estimate the IF for different parametric models, but does not provide the tools for combining and transforming these.

Example: right-censored outcomess

To illustrate the methods on survival data we will use the Mayo Clinic Primary Biliary Cholangitis Data (Therneau and Grambsch 2000)

library("survival")
data(pbc, package="survival")

The Cox proportional hazards model can be fitted with the mets::phreg method which can estimate the IF for both the partial likelihood parameters and the baseline hazard. Here we fit a survival model with right-censored event times

fit.phreg <- mets::phreg(Surv(time, status > 0) ~ age + sex, data = pbc)
fit.phreg
#> Call:
#> mets::phreg(formula = Surv(time, status > 0) ~ age + sex, data = pbc)
#> 
#>    n events
#>  418    186
#> coeffients:
#>        Estimate       S.E.    dU^-1/2 P-value
#> age   0.0220977  0.0070372  0.0072712  0.0017
#> sexf -0.2999507  0.2022144  0.2097533  0.1380
#> 
#> exp(coeffients):
#>      Estimate    2.5%  97.5%
#> age   1.02234 1.00834 1.0365
#> sexf  0.74085 0.49843 1.1012
IC(fit.phreg) |> head()
#>           age       sexf
#> 1  0.12691175  2.9551968
#> 2 -0.16011629 -4.3755455
#> 3  0.19322595 -8.1786480
#> 4  0.04668109  0.8548021
#> 5 -0.22936186  0.9721761
#> 6  0.07015171  0.5644414

The IF for the baseline cumulative hazard at a specific time point \[\begin{align*} \Lambda_0(t) = \int_0^t \lambda_0(u)\,du, \end{align*}\] where \(\lambda_0(t)\) is the baseline hazard, can be estimated in similar way:

baseline <- function(object, time, ...) {
  ic <- mets::IC(object, baseline = TRUE, time = time, ...)
  est <- mets::predictCumhaz(object$cumhaz, new.time = time)[1, 2]
  estimate(NULL, coef = est, IC = ic, labels = paste0("chaz:", time))
}
tt <- 2000
baseline(fit.phreg, tt)
#>           Estimate Std.Err    2.5%  97.5% P-value
#> chaz:2000    0.178 0.07597 0.02913 0.3269 0.01911

The estimate and IF methods are also available for parametric survival models via survival::survreg, here a Weibull model:

survival::survreg(Surv(time, status > 0) ~ age + sex, data = pbc, dist="weibull") |>
  estimate()
#>             Estimate  Std.Err     2.5%     97.5%    P-value
#> (Intercept)  9.02697 0.382437  8.27741  9.776530 3.521e-123
#> age         -0.01919 0.006362 -0.03166 -0.006723  2.554e-03
#> sexf         0.28170 0.174338 -0.06000  0.623392  1.061e-01
#> scale        0.87751 0.070693  0.73896  1.016067  2.220e-35

Example: random effects model / structural equation model

General structural equation models (SEMs) can be estimated with lava::lvm. Here we fit a random effects probit model \[ \mathbb{P}(Y_{ij} = 1 \mid U_{i}, W_{ij})=\Phi(\mu_{j} + \beta_{j} W_{ij} + U_{i}), \quad U_{i}\sim\mathcal{N}(0,\sigma_{u}^{2}),\quad j=1,2 \] to the simulated dataset

sem <- lvm(y1 + y2 ~ 1 * u + w) |>
  latent(~ u) |>
  ordinal(K=2, ~ y1 + y2)
semfit <- estimate(sem, data = dw)

## Robust standard errors
estimate(semfit)
#>      Estimate Std.Err    2.5%    97.5%   P-value
#> y2   -0.21038 0.09391 -0.3944 -0.02631 2.508e-02
#> u     0.36025 0.06659  0.2297  0.49076 6.293e-08
#> y1~w  0.55425 0.06931  0.4184  0.69009 1.272e-15
#> y2~w  0.59388 0.07510  0.4467  0.74107 2.623e-15
#> u~~u -0.09496 0.07360 -0.2392  0.04929 1.970e-01

Example: quantile

Let \(\beta\) denote the \(\tau\)th quantile of \(X\), with IF \[\begin{align*} \operatorname{IC}(x; P_{0}) = \tau - \mathbf{1}(x\leq \beta)f_{0}(\beta)^{-1} \end{align*}\]

where \(f_{0}\) is the density function of \(X\).

To calculate the variance estimate, an estimate of the density is thus needed which can be obtained by a kernel estimate. Alternatively, the resampling method of (Zeng and Lin 2008) can be applied. Here we use a kernel smoother (additional arguments to the estimate function are parsed on to stats::density.default) to estimate the quantiles and IF for the 25%, 50%, and 75% quantiles of \(W\) and \(X_1\)

eq <- estimate(dw[, c("w", "x1")], type = "quantile", probs = c(0.25, 0.5, 0.75))
eq
#>        Estimate Std.Err    2.5%    97.5%   P-value
#> w.25%  -0.81214 0.07270 -0.9546 -0.66965 5.649e-29
#> w.50%  -0.11062 0.07195 -0.2516  0.03040 1.242e-01
#> w.75%   0.67716 0.07778  0.5247  0.82960 3.133e-18
#> x1.25% -0.57510 0.06334 -0.6992 -0.45096 1.090e-19
#> x1.50% -0.02664 0.06073 -0.1457  0.09238 6.609e-01
#> x1.75%  0.69590 0.06690  0.5648  0.82703 2.437e-25
IC(eq) |> head()
#>          w.25%     w.50%      w.75%     x1.25%    x1.50%     x1.75%
#> [1,] 0.8394734  1.438982  2.6942304 -2.1941660 -1.214564 -0.7725337
#> [2,] 0.8394734  1.438982  2.6942304  0.7313887  1.214564 -0.7725337
#> [3,] 0.8394734 -1.438982 -0.8980768 -2.1941660 -1.214564 -0.7725337
#> [4,] 0.8394734 -1.438982 -0.8980768  0.7313887  1.214564  2.3176012
#> [5,] 0.8394734 -1.438982 -0.8980768  0.7313887  1.214564 -0.7725337
#> [6,] 0.8394734 -1.438982 -0.8980768 -2.1941660 -1.214564 -0.7725337

Combining influence functions

A key benefit of working with the IFs of estimators is that this allows for transforming or combining different estimates while easily deriving the resulting IF and thereby asymptotic distribution of the new estimator.

Let \(\widehat{\theta}_{1}, \ldots, \widehat{\theta}_{M}\) be \(M\) different estimators with decompositions \[\begin{align*} \sqrt{n}(\widehat{\theta}_{m}-\theta_{m}) = \frac{1}{\sqrt{n}}\sum_{i=1}^{n} \operatorname{IC}_m(Z_i; P_{0}) + o_{P}(1) \end{align*}\] based on iid data \(Z_1,\ldots,Z_n\). It then follows immediately (Vaart 1998 Theorem 18.10[vi]) that the joint distribution of \(\widehat{\theta} - {\theta}= (\widehat{\theta}_{1}^{\top},\ldots,\widehat{\theta}_{M}^{\top})^\top- ({\theta}_{1}^{\top},\ldots,{\theta}_{M}^{\top})^\top\) is given by \[\begin{align*} \sqrt{n}(\widehat{\theta}-\theta) &= \frac{1}{\sqrt{n}}\sum_{i=1}^{n} \underbrace{[\operatorname{IC}_{1}(Z_i; P_{0})^\top,\ldots,\operatorname{IC}_{M}(Z_i; P_{0})^\top]^{\top}}_{\overline{\operatorname{IC}}(Z_i; P_{0})} + o_{P}(1) \\ &\overset{\mathcal{D}}{\longrightarrow}\mathcal{N}(0,\Sigma) \end{align*}\] by the CLT, and under regulatory conditions \(\mathbb{P}_{n}\overline{\operatorname{IC}}(Z_i; \widehat{P})^{\otimes 2} \overset{P}{\longrightarrow}\Sigma\) as \(n\to\infty\).

To illustrate this we consider two marginal logistic regression models fitted separately for \(Y_1\) and \(Y_2\) and combine the estimates and IFs using the merge method

g1 <- glm(y1 ~ a, family=binomial, data=dw)
g2 <- glm(y2 ~ a, family=binomial, data=dw)
e <- merge(g1, g2)
summary(e)
#> Call: estimate.default(x = NULL, data = NULL, id = id, stack = FALSE, 
#>     IC = ic0, keep = keep, coef = coefs)
#> ────────────────────────────────────────────────────────────────────────────────
#>               Estimate Std.Err    2.5%    97.5%   P-value
#> (Intercept)    -0.1861  0.1442 -0.4688  0.09655 1.969e-01
#> a               1.3239  0.2173  0.8981  1.74978 1.105e-09
#> ─────────────                                            
#> (Intercept).1  -0.6168  0.1505 -0.9117 -0.32185 4.152e-05
#> a.1             1.5060  0.2148  1.0849  1.92712 2.385e-12
#> 
#>  Null Hypothesis: 
#>   [(Intercept)] = 0
#>   [a] = 0
#>   [(Intercept).1] = 0
#>   [a.1] = 0 
#>  
#> chisq = 96.4362, df = 4, p-value < 2.2e-16

As we have access to the joint asymptotic distribution we can for example test for whether the odds-ratio is the same for the two responses:

estimate(e, cbind(0,1,0,-1), null=0)
#>             Estimate Std.Err    2.5%  97.5% P-value
#> [a] - [a.1]  -0.1821  0.3003 -0.7707 0.4065  0.5443
#> 
#>  Null Hypothesis: 
#>   [a] - [a.1] = 0

More details an be found in the Section on hypothesis testing.

Imbalanced data

Let \(O_{1} = (Z_{1}R_{1}, R_{1}), \ldots, O_{N}=(Z_{N}R_{N}, R_{N})\) be iid with \(R_{i}\!\perp\!\!\!\!\perp\!Z_i\) and let the full-data IF for some estimator of a parameter \(\theta\in\mathbb{R}^p\) be \(IC(\cdot; P_{0})\). For convenience let the data be ordered \(R_{i}=\mathbf{1}(i\leq n)\) where \(n\) is the number of observed data points, then the complete-case estimator is consistent and based on same IF \[\begin{align*} \sqrt{n}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{n}}\sum_{i=1}^n IC(Z_i; P_{0}) + o_{P}(1). \end{align*}\] This estimator can also be decomposed in terms of the observed data \(O_1,\ldots,O_N\) noting that \[\begin{align*} \sqrt{N}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{N}}\sum_{i=1}^N IC(Z_i; P)\frac{R_i N}{n} + o_{P}(1). \end{align*}\] where the term \(\frac{R_i N}{n}\) corresponds to an inverse probability weighting with the empirical plugin estimate of the proportion of observed data \(R=1\). Under a missing completely at random assumption we can therefore combine estimators that are estimated on different datasets. Let the observed data be \((Z_{11}R_{11}, R_{11}, Z_{21}R_{21}, R_{21}), \ldots, (Z_{1N}R_{1N}, R_{1N}, Z_{2N}R_{2N}, R_{2N}))\) with complete-case estimators \(\widehat{\theta}_1\) and \(\widehat{\theta}_2\) for parameters \(\theta_1\) and \(\theta_2\) based on \((Z_{11}R_{11}, \ldots, Z_{1N}R_{1N})\) and \((Z_{21}R_{21}, \ldots, Z_{2N}R_{2N})\), respectively, and let the corresponding IFs be \(IC_{1}(\cdot; P_{0})\) and \(IC_{2}(\cdot;\ P)\). It then follows that \[\begin{align*} \sqrt{N}\left\{ \begin{pmatrix} \widehat{\theta}_1 \\ \widehat{\theta}_2 \end{pmatrix} - \begin{pmatrix} \vphantom{\widehat{\theta}_1}\theta_1 \\ \vphantom{\widehat{\theta}_1}\theta_2 \end{pmatrix} \right\} = \frac{1}{\sqrt{N}}\sum_{i=1}^N \begin{pmatrix} IC_1(Z_{1i}; P_{0})\frac{R_{1i}N}{R_{1\bullet}} \\ IC_2(Z_{2i}; P_{0})\frac{R_{2i}N}{R_{2\bullet}} \end{pmatrix} + o_{P}(1) \end{align*}\] with \(R_{k\bullet} = \sum_{i=1}^{N}R_{ki}.\) Returning to the example, we can combine the marginal estimates of two model objects that have been estimated from different datasets (as the outcome \(Y_3\) is only available in half of the data). Here we will use the overloaded + operator

g2 <- glm(y2 ~ 1, family = binomial, data = dw)
summary(g2)
#> 
#> Call:
#> glm(formula = y2 ~ 1, family = binomial, data = dw)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)   0.1402     0.1002   1.399    0.162
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 552.56  on 399  degrees of freedom
#> Residual deviance: 552.56  on 399  degrees of freedom
#> AIC: 554.56
#> 
#> Number of Fisher Scoring iterations: 3
dwc <- na.omit(dw) 
g3 <- glm(y3 ~ 1, family = binomial, data = dwc)
summary(g3)
#> 
#> Call:
#> glm(formula = y3 ~ 1, family = binomial, data = dwc)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)  
#> (Intercept)   0.2615     0.1426   1.833   0.0668 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 273.87  on 199  degrees of freedom
#> Residual deviance: 273.87  on 199  degrees of freedom
#> AIC: 275.87
#> 
#> Number of Fisher Scoring iterations: 3

e2 <- estimate(g2, id = dw$id)
e3 <- estimate(g3, id = "id", data=dwc)

merge(e2,e3) |> IC() |> Print()
#>     (Intercept) (Intercept).1
#> 1   -2.151       3.540       
#> 2    1.869       3.540       
#> 3   -2.151      -4.598       
#> 4    1.869      -4.598       
#> 5   -2.151       3.540       
#> ---                          
#> 396 -2.151       0.000       
#> 397  1.869       0.000       
#> 398 -2.151       0.000       
#> 399 -2.151       0.000       
#> 400 -2.151       0.000
vcov(e2 + e3)
#>               (Intercept) (Intercept).1
#> (Intercept)   0.010049191   0.002102598
#> (Intercept).1 0.002102598   0.020342639
## Same marginals as
list(vcov(e2), vcov(e3))
#> [[1]]
#>             (Intercept)
#> (Intercept)  0.01004919
#> 
#> [[2]]
#>             (Intercept)
#> (Intercept)  0.02034264

Note, it is also possible to directly specify the id-variables in the merge call:

merge(e2, e3, id = list(dw$id, dwc$id))
#>               Estimate Std.Err     2.5%  97.5% P-value
#> (Intercept)     0.1402  0.1002 -0.05625 0.3367 0.16186
#> ─────────────                                         
#> (Intercept).1   0.2615  0.1426 -0.01807 0.5410 0.06676

In the above example the id argument defines the identifier that makes it possible to link the rows in the different IFs that should be glued together. If omitted then the id will automatically be extracted from the model-specific IC method (deriving it from the original data.frame used for estimating the model). This automatically works with all models and IC methods described in this document.

estimate(g2) |>
  IC() |> head()
#>   (Intercept)
#> 1   -2.150532
#> 2    1.869154
#> 3   -2.150532
#> 4    1.869154
#> 5   -2.150532
#> 6    1.869154
vcov(estimate(g2) + estimate(g3))
#>               (Intercept) (Intercept).1
#> (Intercept)   0.010049191   0.002102598
#> (Intercept).1 0.002102598   0.020342639
(estimate(g2) + estimate(g3)) |>
  (rownames %++% head %++% IC)()
#> [1] "1"   "10"  "100" "101" "102" "103"

To force that the id variables are not overlapping between the merged model objects, i.e., assuming that there is complete independence between the estimates, the argument id=NULL can be used

merge(g1, g2, id = NULL) |> (Print %++% IC)()
#>     (Intercept) a          (Intercept).1
#> 1   -5.047e-15   5.128e+00  0.000e+00   
#> 2   -5.047e-15   5.128e+00  0.000e+00   
#> 3   -7.547e+00   7.547e+00  0.000e+00   
#> 4   -7.547e+00   7.547e+00  0.000e+00   
#> 5    1.673e-14  -1.600e+01  0.000e+00   
#> ---                                     
#> 796  0.000       0.000      3.738       
#> 797  0.000       0.000      3.738       
#> 798  0.000       0.000     -4.301       
#> 799  0.000       0.000     -4.301       
#> 800  0.000       0.000     -4.301
merge(g1, g2, id = NULL) |> vcov()
#>                 (Intercept)             a (Intercept).1
#> (Intercept)    2.079760e-02 -2.079760e-02  2.700089e-29
#> a             -2.079760e-02  4.720777e-02 -1.552978e-25
#> (Intercept).1  2.700089e-29 -1.552978e-25  1.004919e-02

Renaming and subsetting parameters

To only keep a subset of the parameters the keep argument can be used.

merge(g1, g2, keep = c("(Intercept)", "(Intercept).1"))
#>               Estimate Std.Err     2.5%   97.5% P-value
#> (Intercept)    -0.1861  0.1442 -0.46876 0.09655  0.1969
#> (Intercept).1   0.1402  0.1002 -0.05625 0.33671  0.1619

The argument can be given either as character vector or a vector of indices:

merge(g1,g2, keep=c(1, 3))
#>               Estimate Std.Err     2.5%   97.5% P-value
#> (Intercept)    -0.1861  0.1442 -0.46876 0.09655  0.1969
#> (Intercept).1   0.1402  0.1002 -0.05625 0.33671  0.1619

or as a vector of perl-style regular expressions

merge(g1, g2, keep = "cept", regex = TRUE)
#>               Estimate Std.Err     2.5%   97.5% P-value
#> (Intercept)    -0.1861  0.1442 -0.46876 0.09655  0.1969
#> (Intercept).1   0.1402  0.1002 -0.05625 0.33671  0.1619
merge(g1, g2, keep = c("\\)$", "^a$"), regex = TRUE, ignore.case = TRUE)
#>             Estimate Std.Err    2.5%   97.5%   P-value
#> (Intercept)  -0.1861  0.1442 -0.4688 0.09655 1.969e-01
#> a             1.3239  0.2173  0.8981 1.74978 1.105e-09

When merging estimates unique parameter names are created. It is also possible to rename the parameters with the labels argument

merge(g1, g2, labels = c("a", "b", "c")) |> estimate(keep = c("a", "c"))
#>   Estimate Std.Err     2.5%   97.5% P-value
#> a  -0.1861  0.1442 -0.46876 0.09655  0.1969
#> c   0.1402  0.1002 -0.05625 0.33671  0.1619
merge(g1, g2,
      labels = c("a", "b", "c"),
      keep = c("a", "c")
)
#>   Estimate Std.Err     2.5%   97.5% P-value
#> a  -0.1861  0.1442 -0.46876 0.09655  0.1969
#> c   0.1402  0.1002 -0.05625 0.33671  0.1619
estimate(g1, labels=c("a", "b"))
#>   Estimate Std.Err    2.5%   97.5%   P-value
#> a  -0.1861  0.1442 -0.4688 0.09655 1.969e-01
#> b   1.3239  0.2173  0.8981 1.74978 1.105e-09

Finally, the subset argument can be used to subset the parameters and IFs before the actual merging is being done

merge(g1, g2, subset="(Intercept)")
#>               Estimate Std.Err     2.5%   97.5% P-value
#> (Intercept)    -0.1861  0.1442 -0.46876 0.09655  0.1969
#> ─────────────                                          
#> (Intercept).1   0.1402  0.1002 -0.05625 0.33671  0.1619

Clustered data (non-iid case)

Let \(Z_i = (Z_{i1},\ldots,Z_{iN_{i}})\) and assume that \((Z_{i}, N_{i}) \sim P\), \(i=1,\ldots,n\) are iid and \(N_i\!\perp\!\!\!\!\perp\! Z_{ij}\). The variables \(Z_{i1},\ldots,Z_{iN_{i}}\) we assume are exchangeable but not necessarily independent. Define \(N = \sum_{i=1}^{n} N_i\), and assume that a parameter estimate, \(\widehat{\theta}\in\mathbb{R}^p\) has the decomposition \[ \sqrt{N}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{N}} \sum_{i=1}^{n} \sum_{k=1}^{N_{i}} IC(Z_{ik}; P_{0}) + o_{P}(1). \] It then follows that \[ \sqrt{n}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{n}} \sum_{i=1}^{n} \widetilde{\operatorname{IC}}(Z_{i}; P_{0}) + o_{P}(1) \] with \(\widetilde{\operatorname{IC}}(Z_{i}; P_{0}) = \sum_{k=1}^{N_{i}} \frac{n}{N}IC(Z_{ik}; P_{0})\), \(i=1,\ldots,n\) which are iid an therefore admits the usual CLT to derive the asymptotic variance of \(\widehat{\theta}\). Turning back to the example data, we can estimate the marginal model

g0 <- glm(y ~ a + w + x, data = dl, family = binomial())

The asymptotic variance estimate ignoring that the observations are not independent is not consistent. Instead we can calculate the cluster robust standard errors from the above iid decomposition

estimate(g0, id=dl$id)
#>             Estimate Std.Err    2.5%   97.5%   P-value
#> (Intercept)  -0.1147 0.09351 -0.2979 0.06862 2.201e-01
#> a             1.0178 0.13016  0.7627 1.27288 5.303e-15
#> w             0.9825 0.07421  0.8370 1.12791 5.278e-40
#> x             0.9485 0.07835  0.7949 1.10203 9.976e-34

We can confirm that this situation is equivalent to the variance estimates we obtain from a GEE marginal model with working independence correlation structure (Halekoh, Højsgaard, and Yan 2006)

gee0 <- geepack::geeglm(y ~ a + w + x, data = dl, id = dl$id, family=binomial)
summary(gee0)
#> 
#> Call:
#> geepack::geeglm(formula = y ~ a + w + x, family = binomial, data = dl, 
#>     id = dl$id)
#> 
#>  Coefficients:
#>             Estimate  Std.err    Wald Pr(>|W|)    
#> (Intercept) -0.11466  0.09351   1.504     0.22    
#> a            1.01777  0.13016  61.145 5.33e-15 ***
#> w            0.98246  0.07421 175.250  < 2e-16 ***
#> x            0.94845  0.07835 146.523  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Correlation structure = independence 
#> Estimated Scale Parameters:
#> 
#>             Estimate Std.err
#> (Intercept)   0.9538   0.098
#> Number of clusters:   400  Maximum cluster size: 4

Computational aspects

Working with large and potentially multiple different IFs can be memory-intensive. A remedy is to use the idea of aggregating the IFs by introducing a random coarser grouping variable. Following the same arguments as in the previous section, the aggregated IF will still be iid and allows us to estimate the asymptotic variance. Obviously, the same grouping must be used across estimates when combining IFs.

set.seed(1)
y <- cbind(rnorm(1e5))
N <- 2e2 ## Number of aggregated groups, the number of observations in the new IF
id <- foldr(nrow(y), N, list=FALSE)
Print(cbind(table(id)))
#>     [,1]
#> 1   500 
#> 2   500 
#> 3   500 
#> 4   500 
#> 5   500 
#> ---     
#> 196 500 
#> 197 500 
#> 198 500 
#> 199 500 
#> 200 500

## Aggregated IF
e <- estimate(cbind(y), id = id) 
object.size(e)
#> 18992 bytes
e
#>     Estimate Std.Err      2.5%    97.5% P-value
#> p1 -0.002244 0.00332 -0.008751 0.004263  0.4991

IF building blocks: transformations and the delta theorem

Let \(\phi\colon \mathbb{R}^p\to\mathbb{R}^m\) be differentiable at \(\theta\) and assume that \(\widehat{\theta}_n\) is RAL estimator with IF given by \(\operatorname{IC}(\cdot; P_{0})\) such that \[\begin{align*} \sqrt{n}(\widehat{\theta}_n - \theta) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \operatorname{IC}(Z_i; P_{0}) + o_{P}(1), \end{align*}\] then by the delta method (Vaart 1998 Theorem 3.1) \[\begin{align*} \sqrt{n}\{\phi(\widehat{\theta}_n) - \phi(\theta)\} = \frac{1}{\sqrt{n}}\sum_{i=1}^n \nabla\phi(\theta)\operatorname{IC}(Z_i; P_{0}) + o_{P}(1), \end{align*}\]

where \(\phi\colon \theta\mapsto (\phi_{1}(\theta),\ldots,\phi_{m}(\theta))^\top\) and \(\nabla\) is the partial derivative operator \[\begin{align*} \nabla\phi(\theta) = \begin{pmatrix} \tfrac{\partial}{\partial\theta_1}\phi_1(\theta) & \cdots & \tfrac{\partial}{\partial\theta_p}\phi_1(\theta) \\ \vdots & \ddots & \vdots \\ \tfrac{\partial}{\partial\theta_1}\phi_m(\theta) & \cdots & \tfrac{\partial}{\partial\theta_p}\phi_m(\theta) \\ \end{pmatrix}. \end{align*}\]

Together with the ability to derive the joint IF from marginal IFs, this provides us with a powerful tool for constructing new estimates using the IFs as the fundamental building blocks.

To apply the delta method the transformation of the parameters function must be supplied to the estimate method (argument f)

estimate(g1, sum)
#>    Estimate Std.Err   2.5% 97.5%   P-value
#> p1    1.138  0.1625 0.8193 1.456 2.532e-12
estimate(g1, function(p) list(a = sum(p))) # named list
#>   Estimate Std.Err   2.5% 97.5%   P-value
#> a    1.138  0.1625 0.8193 1.456 2.532e-12
## Multiple parameters
estimate(g1, function(x) c(x, x[1] + exp(x[2]), inv = 1 / x[2]))
#>               Estimate Std.Err    2.5%   97.5%   P-value
#> (Intercept)    -0.1861  0.1442 -0.4688 0.09655 1.969e-01
#> a               1.3239  0.2173  0.8981 1.74978 1.105e-09
#> (Intercept).1   3.5721  0.7289  2.1435 5.00062 9.539e-07
#> inv.a           0.7553  0.1240  0.5124 0.99828 1.105e-09
estimate(g1, exp)         
#>             Estimate Std.Err   2.5% 97.5%   P-value
#> (Intercept)   0.8302  0.1197 0.5955 1.065 4.087e-12
#> a             3.7582  0.8166 2.1578 5.359 4.175e-06

The gradient can be provided as the attribute grad and otherwise numerical differentiation is applied.

Example: Pearson correlation

As a simple toy example consider the problem of estimating the covariance of two variables \(X_1\) and \(X_2\) \[\begin{align*} \widehat{\mathbb{C}\!\text{ov}}(X_1,X_2) = \mathbb{P}_n(X_1-\mathbb{P}_n X_1)(Y_1-\mathbb{P}_n Y_1). \end{align*}\] It is easily verified that the IF of the sample estimate of \((\mathbb{E}X_{1}, \mathbb{E} X_{2}, \mathbb{E}\{X_{1}X_{2}\})^\top\) given by is \(\operatorname{IC}(X1,X2; P_{0}) = (X_{1}-\mathbb{E}X_{1}, X_{2}-\mathbb{E}X_{2}, X_{1}X_{2}-\mathbb{E}\{X_{1}X_{2}\})^\top\). By the delta theorem with \(\phi(x,y,z) = z-xy\) we have \(\nabla\phi(x,y,z) = (-y -x 1)\) and thus the IF for the sample covariance estimate becomes \[\begin{align*} \operatorname{IC}_{x_1, x_2}(X_1, X_2; P_{0}) = (X_1 - \mathbb{E}X_1)(X_2 - \mathbb{E}X_2) - \mathbb{C}\!\text{ov}(X_1,X_2) \end{align*}\]

We can implement this directly using the estimate function via the IC argument which allows us to provide a user-specificed IF and with the point estimate given by the coef argument

Cov <- function(x, y, ...) {
  est <- mean(x * y)-mean(x)*mean(y)
    estimate(
      coef = est,
      IC = (x - mean(x)) * (y - mean(y)) - est,
      ...
    )
}
with(dw, Cov(x1, x2))
#>  Estimate Std.Err     2.5%  97.5% P-value
#>  0.004043 0.04976 -0.09349 0.1016  0.9352

As an illustration we could also derive this estimate from simpler building blocks of \(\mathbb{E}X_{1}\), \(\mathbb{E}X_{2}\), and \(\mathbb{E}(X_{1}X_{2})\).

e1 <- lm(cbind(x1, x2, x1 * x2) ~ 1, data = dw) |>
  estimate(labels = c("Ex1", "Ex2", "Ex1x2"))
e1
#>        Estimate Std.Err     2.5%   97.5% P-value
#> Ex1    0.038089 0.04842 -0.05681 0.13298  0.4315
#> Ex2   -0.037026 0.05187 -0.13869 0.06464  0.4753
#> Ex1x2  0.002633 0.05003 -0.09541 0.10068  0.9580
estimate(e1, function(x) c(x, cov=with(as.list(x), Ex1x2 - Ex2* Ex1)))
#>        Estimate Std.Err     2.5%   97.5% P-value
#> Ex1    0.038089 0.04842 -0.05681 0.13298  0.4315
#> Ex2   -0.037026 0.05187 -0.13869 0.06464  0.4753
#> Ex1x2  0.002633 0.05003 -0.09541 0.10068  0.9580
#> cov    0.004043 0.04976 -0.09349 0.10158  0.9352

The variance estimates can be estimated in the same way and the combined estimates be used to estimate the correlation

e2 <- with(dw, Cov(x1, x2, labels = "c", id = id) +
               Cov(x1, x1, labels = "v1", id = id) +
               Cov(x2, x2, labels = "v2", id = id))
rho <- estimate(e2, function(x) list(rho = x[1] / (x[2] * x[3])^.5))
rho
#>     Estimate Std.Err     2.5%  97.5% P-value
#> rho 0.004025 0.04953 -0.09306 0.1011  0.9352

by using a variance stabilizing transformation, Fishers \(z\)-transform (Lehmann and Romano 2023), \(z = \operatorname{arctanh}(\widehat{\rho}) = \frac{1}{2}\log\left(\frac{1+\widehat{\rho}}{1-\widehat{\rho}}\right)\), confidence limits with general better coverage can be obtained

estimate(rho, atanh, back.transform = tanh)
#>     Estimate Std.Err     2.5%  97.5% P-value
#> rho 0.004025         -0.09279 0.1008  0.9352

The confidence limits are calculated on the \(\operatorname{arctanh}\)-scale and transformed back to the original correlation scale via the back.transform argument. In this case, where the estimates are far away from the boundary of the parameter space, the variance stabilizing transform does almost not have any impact, and the confidence limits agrees with the original symmetric confidence limits.

TODO Example: survival

Linear contrasts and hypothesis testing

An important special case of parameter transformations are linear transformations. A particular interest may be formulated around testing null-hypotheses of the form \[\begin{align*} H_0\colon\quad \mathbf{B}\theta = \mathbf{b}_0 \end{align*}\]

where \(\mathbf{B}\in\mathbb{R}^{m\times p}\) is a matrix of estimable contrasts and \(\mathbf{b}_0\in\mathbb{R}^{m}\).

As an example consider marginal models for the binary response variables \(Y_1, Y_2, Y_3, Y_4\)

g <- lapply(
  list(y1 ~ a, y2 ~ a, y3 ~ a), #, y4 ~ a+x4),
  function(f) glm(f, family = binomial, data = dw)
)
gg <- Reduce(merge, g)
gg
#>               Estimate Std.Err    2.5%    97.5%   P-value
#> (Intercept)    -0.1861  0.1442 -0.4688  0.09655 1.969e-01
#> a               1.3239  0.2173  0.8981  1.74978 1.105e-09
#> ─────────────                                            
#> (Intercept).1  -0.6168  0.1505 -0.9117 -0.32185 4.152e-05
#> a.1             1.5060  0.2148  1.0849  1.92712 2.385e-12
#> ─────────────                                            
#> (Intercept).2  -0.2938  0.2063 -0.6982  0.11064 1.545e-01
#> a.2             1.1047  0.2962  0.5242  1.68515 1.914e-04

A linear transformation can be specified via the f as a matrix argument instead of function object

B <- cbind(0,1, 0,-1, 0,0)
estimate(gg, B)
#>             Estimate Std.Err    2.5%  97.5% P-value
#> [a] - [a.1]  -0.1821  0.3003 -0.7707 0.4065  0.5443
#> 
#>  Null Hypothesis: 
#>   [a] - [a.1] = 0

The \(\mathbf{b}_0\) vector (default assumed to be zero) can be specified via the null argument

estimate(gg, B, null=1)
#>             Estimate Std.Err    2.5%  97.5%   P-value
#> [a] - [a.1]  -0.1821  0.3003 -0.7707 0.4065 8.281e-05
#> 
#>  Null Hypothesis: 
#>   [a] - [a.1] = 1

For testing multiple hypotheses we use that \((\mathbf{B}\widehat{\theta}-\mathbf{b}_0)^{\top} (\mathbf{B}\widehat{\Sigma}\mathbf{B}^{\top})^{-1} (\mathbf{B}\widehat{\theta}-\mathbf{b}_0) \sim \chi^2_{\operatorname{rank}(B)}\) under the null hypothesis where \(\widehat{\Sigma}\) is the estimated variance of \(\theta\) (i.e., vcov(gg))

B <- rbind(cbind(0,1, 0,-1, 0,0),
           cbind(0,1, 0,0, 0,-1))
estimate(gg, B)
#>             Estimate Std.Err    2.5%  97.5% P-value
#> [a] - [a.1]  -0.1821  0.3003 -0.7707 0.4065  0.5443
#> [a] - [a.2]   0.2192  0.3637 -0.4936 0.9321  0.5466
#> 
#>  Null Hypothesis: 
#>   [a] - [a.1] = 0
#>   [a] - [a.2] = 0 
#>  
#> chisq = 1.343, df = 2, p-value = 0.5109

Such linear statistics can also be specified directly as expressions of the parameter names

estimate(gg, a + a.1, 2*a - a.2, a, null=c(2,1,1))
#>              Estimate Std.Err   2.5% 97.5%  P-value
#> [a] + [a.1]     2.830  0.3107 2.2210 3.439 0.007557
#> 2[a] - [a.2]    1.543  0.5208 0.5224 2.564 0.296991
#> [a]             1.324  0.2173 0.8981 1.750 0.135985
#> 
#>  Null Hypothesis: 
#>   [a] + [a.1] = 2
#>   2[a] - [a.2] = 1
#>   [a] = 1 
#>  
#> chisq = 7.557, df = 3, p-value = 0.05612

We refer to the function lava::contr and lava::parsedesigns for defining contrast matrices.

contr(list(1, c(1, 2), c(1, 4)), n = 5)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    1   -1    0    0    0
#> [3,]    1    0    0   -1    0

A particular useful contrast is the following for considering all pairwise comparisons of different exposure estimates:

pairwise.diff(3)
#>      [,1] [,2] [,3]
#> [1,]    1   -1    0
#> [2,]    1    0   -1
#> [3,]    0    1   -1
estimate(gg, pairwise.diff(3), null=c(1,1,1), use=c(2,4,6))
#>               Estimate Std.Err    2.5%  97.5%   P-value
#> [a] - [a.1]    -0.1821  0.3003 -0.7707 0.4065 8.281e-05
#> [a] - [a.2]     0.2192  0.3637 -0.4936 0.9321 3.182e-02
#> [a.1] - [a.2]   0.4013  0.3506 -0.2858 1.0885 8.773e-02
#> 
#>  Null Hypothesis: 
#>   [a] - [a.1] = 1
#>   [a] - [a.2] = 1
#>   [a.1] - [a.2] = 1 
#>  
#> chisq = 11.96, df = 2, p-value = 0.002523

When conducting multiple tests each at a nominal-level of \(\alpha\) the overall type I error is not controlled at \(\alpha\)-level.
The influence function also allows for adjusting for multiple comparisons. Let \(Z_{1},\ldots,Z_{p}\) denote \(Z\)-statistics from \(p\) distinct two-sided hypothesis tests which we will assume is asymptotically distributed under the null hypothesis as a zero-mean Gaussian distribution with correlation matrix \(R.\) Let \(Z_{max} = \max_{i=1,\ldots,p} |Z_i|\) then the family-wise error rate (FWER) under the null can be approximated by \[ P(Z_{max} > z) = 1-\int_{-z}^{z} \cdots \int_{-z}^{z} \phi_{R}(x_{1},\ldots,x_{p}) \,dx_{1}\cdots\,dx_{p} \] where \(\phi_{R}\) is the multivariate normal density function with mean 0 and variance given by the correlation matrix \(R\). The adjusted \(p\)-values can then be calculated as \[P(Z_{max} > \Phi^{-1}(1-p/2))\] where \(\Phi\) is the standard Gaussian CDF. As described in in (Pipper, Ritz, and Bisgaard 2012) the joint distribution of \(Z_{1},\ldots,Z_{p}\) can be estimated from the IFs. This is implemented in the p.correct method

gg0 <- estimate(gg, use="^a", regex=TRUE, null=rep(.8, 3))
p.correct(gg0)
#>       Estimate  P-value Adj.P-value
#> [a]      1.324 0.015891    0.046870
#> [a.1]    1.506 0.001015    0.003043
#> [a.2]    1.105 0.303571    0.661497
#> attr(,"adjusted.significance.level")
#> [1] 0.01698

While this always yields a more powerful test compared to Bonferroni adjustments, a more powerful closed-testing procedure (Marcus, Eric, and Gabriel 1976), can be generally obtained by considering all intersection hypotheses.

Figure: closed testing via Wald tests of all intersections hypotheses.

To reject the \(H_{1}\) at an correct \(\alpha\)-level we should test all intersection hypotheses involving \(H_{1}\) and check if there all are rejected at the \(\alpha\)-level. The adjusted \(p\)-values can here be obtained as the maximum \(p\)-value across all the composite hypothesis tests. Unfortunately, this only works for relatively few comparisons as the number of tests grows exponentially.

closed.testing(gg0)
#>     Estimate   P-value Adj.P-value
#> a      1.324 1.105e-09    0.033802
#> a.1    1.506 2.385e-12    0.003413
#> a.2    1.105 1.914e-04    0.303571

Averaging

Some parameters of interest are expressed as averages over functions of the observed data and estimated parameters of a model. The asymptotic distribution can in some of these cases also be derived from the influence function. Let \(Z_{1},\ldots,Z_{n}\) be iid observations, \(Z_{1}\sim P_{0}\) and let \(X_{i}\subset Z_{i}\).

Assume that \(\widehat{\theta}\) is RAL estimator of \(\theta\in\Omega\subset \mathbb{R}^{p}\) \[\sqrt{n}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{n}}\sum_{i=1}^{n}\phi(Z_{i}; P_{0}) + o_{P}(1). \] Let \(f:\mathcal{X}\times\Omega\to\mathbb{R}\) be continuous differentiable in \(\theta\) \[\sqrt{n}\{f(X; \widehat{\theta})-f(X; \theta)\} = \frac{1}{\sqrt{n}}\nabla_{\theta}f(X;\theta)\sum_{i=1}^{n}\phi(Z_{i}; P) + o_{P}(1). \]

Let \(\Psi = P_{0}f(X;\theta)\) and \(\widehat{\Psi} = P_{n} f(X;\widehat{\theta})\). \(P_{0}\) and \(P_{n}\) are here everywhere the integrals wrt. \(X\). It is easily verified that \[\begin{align*} \widehat{\Psi}-\Psi &= (P_{n}-P_{0})(f(X; \theta)-\Psi) + P[f(X;\widehat{\theta})-f(X;\theta)] \\ &\quad + (P_{n}-P_{0})[f(X;\widehat{\theta})-f(X;\theta)] \end{align*}\]

From Lemma 19.24 (Vaart 1998) it follows that for the last term \[\begin{align*} \sqrt{n}(P_{n}-P_{0})[f(X;\widehat{\theta})-f(X;\theta)] = o_{P}(1) \end{align*}\] when \(f\) for example is Lipschitz and more generally when \(f(X;\theta)\) forms a \(P_{0}\)-Donsker class.

It therefore follows that \[\begin{align*} \sqrt{n}(\widehat{\Psi}-\Psi) &= \sqrt{n}P_{n}(f(X; \theta)-\Psi) + \frac{1}{\sqrt{n}}P\nabla_{\theta}f(X;\theta)\sum_{i=1}^{n}\phi(Z_{i}; P_{0}) + o_{P}(1) \\ &= \frac{1}{\sqrt{n}}\sum_{i=1}^{n}\{f(X;\theta)-\Psi\} + \frac{1}{\sqrt{n}}P\nabla_{\theta}f(X;\theta)\sum_{i=1}^{n}\phi(Z_{i}, P_{0}) + o_{P}(1) \end{align*}\] Hence the IF for \(\widehat{\Psi}\) becomes \[ IC(Z; P_{0}) = f(X;\theta)-\Psi + [P_{0}\nabla_{\theta}f(X;\theta)]\phi(Z). \]

Turning back to the example we can estimate the logistic regression model \(\operatorname{logit}(E\{Y_1 | A,X_1,W\}) = \beta_0 + \beta_a A + \beta_{x_1} X_1 + \beta_w W\), and from this we want to estimate the target parameter \[ \theta(P) = \mathbb{E}_{P}[E(Y\mid A=1, X_{1}, W)]. \] To do this we need first to estimate the model and then define a function that gives the predicted probability \(\mathbb{P}(Y=1\mid,A=a,X_{1},W)\) for any observed values of \(X_1,W\) but with the treatment variable \(A\) kept fixed at the value \(1\)

g <- glm(y1 ~ a + x1 + w, data=dw, family=binomial)
pr <- function(p, data, ...)
  with(data, expit(p[1] + p["a"] + p["x1"]*x1 + p["w"]*w))
pr(coef(g), dw) |> head()
#> [1] 0.8307 0.9651 0.4234 0.9337 0.7669 0.4834

The target parameter can now be estimated with the syntax

id <- foldr(NROW(dw), 100, list=FALSE)
ea <- estimate(g, pr, average=TRUE, id=id)
ea
#>     Estimate Std.Err   2.5%  97.5%    P-value
#> val   0.7006 0.02966 0.6425 0.7587 2.561e-123
IC(ea) |> head()
#>        val
#> 1 -0.41685
#> 2 -0.22444
#> 3  0.34679
#> 4  0.03143
#> 5  0.02478
#> 6 -0.61997

Example: Average Treatment Effects

a1 <- targeted::cate(a ~ 1,

                     data = dw,
                     response_model = y1 ~ x1+w+a,
                     propensity_model = a ~ x1*w
                     )
a1
#>             Estimate Std.Err    2.5%  97.5%   P-value
#> E[y1(1)]      0.6983 0.03500 0.62970 0.7669 1.502e-88
#> E[y1(0)]      0.5375 0.03587 0.46725 0.6078 8.911e-51
#> ───────────                                          
#> (Intercept)   0.1608 0.04696 0.06871 0.2528 6.189e-04
IC(a1) |> head()
#>   E[y1(1)] E[y1(0)] (Intercept)
#> 1   0.3927  0.02231     0.37042
#> 2   0.2923  0.36519    -0.07294
#> 3  -0.2022 -0.71701     0.51486
#> 4   0.2210 -1.18578     1.40680
#> 5  -1.5965 -0.03709    -1.55939
#> 6  -0.2002 -0.82485     0.62466

SessionInfo

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin22.6.0 (64-bit)
#> Running under: macOS Sonoma 14.3.1
#> 
#> Matrix products: default
#> BLAS:   /Users/kkzh/.asdf/installs/R/4.3.2/lib/R/lib/libRblas.dylib 
#> LAPACK: /Users/kkzh/.asdf/installs/R/4.3.2/lib/R/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Copenhagen
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] survival_3.5-7 lava_1.8.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyr_1.3.0          sass_0.4.7           utf8_1.2.4          
#>  [4] future_1.33.1        generics_0.1.3       futile.options_1.0.1
#>  [7] lattice_0.22-5       pracma_2.4.4         listenv_0.9.1       
#> [10] digest_0.6.34        magrittr_2.0.3       evaluate_0.23       
#> [13] grid_4.3.2           mvtnorm_1.2-4        fastmap_1.1.1       
#> [16] jsonlite_1.8.8       Matrix_1.6-5         backports_1.4.1     
#> [19] graph_1.80.0         formatR_1.14         targeted_0.5        
#> [22] purrr_1.0.2          fansi_1.0.6          Rgraphviz_2.46.0    
#> [25] codetools_0.2-19     numDeriv_2016.8-1.1  jquerylib_0.1.4     
#> [28] cli_3.6.2            rlang_1.1.3          futile.logger_1.4.3 
#> [31] mets_1.3.4           parallelly_1.37.1    future.apply_1.11.1 
#> [34] splines_4.3.2        geepack_1.3.9        cachem_1.0.8        
#> [37] yaml_2.3.7           tools_4.3.2          parallel_4.3.2      
#> [40] nloptr_2.0.3         dplyr_1.1.3          optimx_2023-10.21   
#> [43] lambda.r_1.2.4       globals_0.16.2       BiocGenerics_0.48.0 
#> [46] broom_1.0.5          vctrs_0.6.5          R6_2.5.1            
#> [49] stats4_4.3.2         lifecycle_1.0.4      MASS_7.3-60         
#> [52] pkgconfig_2.0.3      timereg_2.0.5        progressr_0.14.0    
#> [55] bslib_0.5.1          pillar_1.9.0         data.table_1.15.2   
#> [58] glue_1.7.0           Rcpp_1.0.12          tidyselect_1.2.0    
#> [61] xfun_0.41            tibble_3.2.1         highr_0.10          
#> [64] knitr_1.45           htmltools_0.5.6.1    rmarkdown_2.25      
#> [67] compiler_4.3.2

Bibliography

Bickel, Peter J., Chris A. J. Klaassen, Ya’Acov Ritov, and Jon A. Wellner. 1998. Efficient and Adaptive Estimation for Semiparametric Models. Springer.
Halekoh, Ulrich, Søren Højsgaard, and Jun Yan. 2006. “The R Package Geepack for Generalized Estimating Equations.” Journal of Statistical Software 15/2: 1–11. https://doi.org/10.18637/jss.v015.i02.
Hines, Oliver, Oliver Dukes, Karla Diaz-Ordaz, and Stijn Vansteelandt. 2022. “Demystifying Statistical Learning Based on Efficient Influence Functions.” The American Statistician, February, 1–13. https://doi.org/10.1080/00031305.2021.2021984.
Holst, K. K., and E. Budtz-Jørgensen. 2013. “Linear Latent Variable Models: The Lava-Package.” Computational Statistics 28 (4): 1385–1452. https://doi.org/10.1007/s00180-012-0344-y.
Laan, Mark J. van der, and Sherri Rose. 2011. Targeted Learning. Causal Inference for Observational and Experimental Data. Springer. https://doi.org/10.1007/978-1-4419-9782-1.
Lehmann, E. L., and Joseph P. Romano. 2023. Testing Statistical Hypotheses. Fourth. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/978-3-030-70578-7.
Marcus, Ruth, Pertiz Eric, and K. R. Gabriel. 1976. “On Closed Testing Procedures with Special Reference to Ordered Analysis of Variance.” Biometrika 63 (3): 655–60. https://doi.org/10.1093/biomet/63.3.655.
Pipper, Christian B., Christian Ritz, and Hans Bisgaard. 2012. “A Versatile Method for Confirmatory Evaluation of the Effects of a Covariate in Multiple Models.” Journal of the Royal Statistical Society C, Appl. Statist. 61 (2): 315–26. https://doi.org/10.1111/j.1467-9876.2011.01005.x.
Therneau, T., and P. Grambsch. 2000. Modeling Survival Data: Extending the Cox Model. Springer-Verlag, New York. https://doi.org/10.1007/978-1-4757-3294-8.
Tsiatis, A. 2006. Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer New York. https://doi.org/10.1007/0-387-37345-4.
Vaart, A. W. van der. 1998. Asymptotic Statistics. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511802256.
Zeileis, Achim, Susanne Köll, and Nathaniel Graham. 2020. “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.” Journal of Statistical Software 95 (1): 1–36. https://doi.org/10.18637/jss.v095.i01.
Zeng, Donglin, and D. Y. Lin. 2008. “Efficient Resampling Methods for Nonsmooth Estimating Functions.” Biostatistics 9 (2): 355–63. https://doi.org/10.1093/biostatistics/kxm034.