---
title: "What is the Delta Method?"
author: "Alex Gold, Nat Olin, Annie Wang"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Delta Method}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

There are a number of ways to compute the standard errors for the
[margins](usage.html) of a regression. It might be possible to derive a 
probability density function for the margin itself, but that's perhaps a huge 
pain and might not even exist. It is also possible to use simulation or 
bootstrapping to create standard errors for the margin.

In this package, we follow 
[Stata's margins command](https://www.stata.com/help.cgi?margins) and use the
delta method, which is a semi-parametric method that takes advantage of a 
closed-form solution to $\frac{d(\text{link}^{-1}(X \beta))}{d(X \beta)}$ to 
improve computational time relative to simulation or bootstrap methods.

# The Delta Method

The delta method is a general method for deriving the variance of a function of
asymptotically normal random variables with known variance. In this case, the
delta method takes advantage of the fact that the margin is (usually) an
infinitely differentiable function of the data, $X$, and the vector of $\beta$s
to derive a closed-form solution for the standard errors of the
margin.^["Usually" because this statement requires that the canonical link
function for the regression has a closed-form derivative. Luckily, this is true
for most common forms of linear regression.]

In particular, the delta method uses a 
[Taylor series](https://en.wikipedia.org/wiki/Taylor_series) expansion of the 
inverse link function of the regression to approximate the margin in the 
neighborhood of $X$ and the $\beta$s and derive variation near that point.

## A Reminder about Taylor Series

The [Taylor expansion](https://en.wikipedia.org/wiki/Taylor_series) is a useful 
tool because it allows us to restate a differentiable function, $G(x)$, in 
terms of (an infinite sum of) the derivatives of $G(x)$. To be more precise, 
an infinitely differentiable $G(x)$ evaluated at $a$ can be written as

\[G(x) = G(a) + \frac{G'(a)}{1!}(x - a) + \frac{G''(a)}{2!}(x-a)^2 + 
\frac{G'''(a)}{3!}(x-a)^3 + \dots\]

If we cut off the expansion after some number of terms (two is common), 
we can get a useful approximation of $G(x)$.

## Taylor Series and the Delta Method

In the case of predicted margins (levels), where the regression model has 
[link function](https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function)
$\text{link}$^[The exact form of the link function and its inverse will depend 
on the type of regression. For example, the logit function is the canonical 
link function for logistic regression and allows transformations between 
probabilities and log-odds.], the column vector of predicted levels $P_m$ 
at covariates $X_1$ is 

\[P_m(X_1 \beta) = \text{link}^{-1}(X_1 \beta)\]

The predicted effects $P_e$ of that same regression is a function of $X \beta$ 
such that 

\[P_e(X_1 \beta) = \frac{d(\text{link}^{-1}(X_1 \beta))}{d(X \beta)}\]

Depending on whether the effect is over a continuous or categorical variable, 
this may be an actual derivative (the instantaneous rate of change) or the 
subtraction of $P(X \beta)$ calculated at one value of $X$ from another 
(the first difference). 

Using the Taylor expansion, we can approximate $P$, an arbitrary function of the
random variable $X \beta$ around the point $X_1 \beta$, as^[Note that we treat the 
input $X$ as fixed and $\beta$ as a random variable.]

\[P(X \beta) = P(X_1 \beta) + \frac{d(P(X_1 \beta))}{d(X\beta)}(X\beta - X_1 \beta)\]

Now we can substitute for the different margins we'll care about. For
predicted levels, the Taylor expansion is 

\[P_m(X \beta) = \text{link}^{-1}(X_1 \beta) + 
\frac{d(\text{link}^{-1}(X_1 \beta))}{d(X \beta)}(X\beta - X_1 \beta)\]

For the predicted effects of categorical variables, we are trying to 
estimate the effects at $P_e(X_1 \beta - X_2 \beta)$, which gives us

\[\begin{aligned}
P_e(X \beta) &= \text{link}^{-1} (X_1 \beta - X_2 \beta) + 
\frac{d(\text{link}^{-1}(X_1 \beta - X_2 \beta))}{d(X \beta)}(X\beta - (X_1 \beta - X_2 \beta)) \\
% &= \text{link}^{-1} (X_1 \beta) - \text{link}^{-1} (X_2 \beta) + \frac{d(\text{link}^{-1})}{d(X \beta)}(X_1) - \frac{d(\text{link}^{-1})}{d(X \beta)}(X_2)
\end{aligned}
\]

For the predicted effects of continuous variables, the marginal effect is a derivative, so 

\[\begin{aligned}
P_e(X_1 \beta) &= \frac{d(\text{link}^{-1}(X_1 \beta))}{d(X \beta)} + 
\frac{d^2(\text{link}^{-1}(X_1 \beta))}{d(X \beta)}\left(\frac{d(\text{link}^{-1}(X \beta))}{d(X \beta)} - \frac{d(\text{link}^{-1}(X_1 \beta))}{d(X \beta)}\right)
\end{aligned}
\]

# How does the Delta Method Work?

The equations above describe how to approximate predicted levels or effects, but
why not just calculate our estimate $P(X\beta)$ directly?

Well, we can do that for the point estimate, but also want to calculate errors
on that estimate, and the variance of $P(X\beta)$ isn't known. Fortunately, we
_can_ calculate the variance of the approximations above. [Wikipedia contains a
really nice 
derivation](https://en.wikipedia.org/wiki/Delta_method#Multivariate_delta_method)
of the multivariate delta method:

\[\text{Var}[P(X \beta)] = \text{Var}\left[P(X_1 \beta) + \frac{d(P(X_1 \beta))}{d(X\beta)}(X\beta - X_1 \beta)\right]\]

Because $X_1\beta$ is a known point, it has variance of zero, so this simplifies to

\[\text{Var}[P(X \beta)] = \text{Var}\left[\frac{d(P(X_1 \beta))}{d(X\beta)} \cdot X\beta\right]\]

The first term in the variance is the vector of partial derivatives of our
estimator, also known as our jacobian matrix.^[The Jacobian matrix is just the
name of the matrix of all first partial derivatives of a vector-valued
function.] For any predicted level indexed by $i$ in a regression, the $i,j$th
element of the jacobian will be the derivative of predicted level $i$ with 
respect to regressor $j$. Those are fixed quantities. We also know
the variance of $X\beta$, since that's our variance-covariance matrix $V$.^[As a
quick reminder about the variance-covariance matrix, if your model contains
coefficients $b_0$ to $b_n$ (each with mean $\mu_{b_i}$ and standard deviation
$\sigma_{b_i}$), the $i,j$th element of the variance-covariance matrix is
$cov(b_i, b_j)$.] 

So we want the variance of a random variable multiplied by a fixed matrix, which
we can find.^[Quick reminder: for scalar $a$ and $r$, where $b$ is the variance 
of random variable $r$, $\text{Var}(a \cdot r)$ is $a^2 \cdot b$. The matrix
analog of $a^2 \cdot \text{Var}(b)$ is $ABA^T$.]

\[\text{Var}[P(X \beta)] = \left(\frac{d(P(X_1 \beta))}{d(X\beta)}\right)^T \ V \left(\frac{d(P(X_1 \beta))}{d(X\beta)}\right)\]

So practically speaking, to get our variance, we'll pre- and post-multiply the 
partial derivatives of the inverse link function by the original 
variance-covariance matrix from the regression.

In short, 

1. Calculate the jacobian matrix of the inverse link function 
of $X \beta$, $J$.[^1]

2. Get the variance-covariance matrix, $V$, from the regression output, 
or calculate it some other way.^[Usually you'll just want the 
variance-covariance matrix of the regression, but you'll need to modify 
the standard variance-covariance matrix if you want to cluster standard 
errors or something.]

3. Sandwich multiply the matrices: $J^{T} V J$. You’ll end up with a 
$k \times 1$ matrix for the $k$ predicted levels/effects.

# Example

Say we're fitting a logistic regression using the `margex` data, and we're 
interested in the predicted outcome for different treatment groups:

```{r}
library(modmarg)
data(margex)
lg <- glm(outcome ~ treatment * age, data = margex, family = 'binomial')
summary(lg)
```

where the `treatment` variable $\tau \in {0, 1}$. What is the average outcome 
when $\tau = 0$ or $\tau = 1$?

Based on these results alone, we don't really know what the predicted outcome 
is for different treatment groups because we can't directly interpret the 
coefficients: there's an interaction term, and the effect of 
each covariate depends on the levels of the other covariates.

## Point estimates

To get an estimate of the average levels of the outcome for different values 
of treatment, we can set treatment to 0 or 1 for the entire dataset, generate 
predicted outcomes for the dataset, and average the results (in other words, 
what would the outcome have been, if everybody was in the control / treatment 
group?). The linear predictors of our model (in this case, setting $\tau = t$), 
are given by: 
\[
\hat{y}_i = \alpha + \beta_1 \cdot t + \beta_2 \cdot \text{age}_i + \beta_3 \cdot (t \cdot \text{age}_i)
\]
\ldots also known as $\hat{y}_i = X_i\beta$. We can then transform the linear 
predictors to the scale of the outcome variable (predicted probabilities) with 
the inverse link function: \[f(z) = \frac{1}{1 + \exp(-z)}\] Finally, we can 
find the average predicted level, by taking the average of these predicted 
probabilities. Combining these, we have:
\[
P(X\beta) = \frac{1}{n} \sum_{i = 1}^n \frac{1}{1 + \exp(-X_i\beta)}
\]
Now let's do this in R: set treatment to 0 or 1, generate linear 
predictors, transform them to predicted outcomes, and take the mean.

```{r}
# Extract the n x k matrix of data
x <- model.matrix(lg)

# Extract the coefficients from the model (a column vector with k entries)
beta <- matrix(lg$coefficients)

# CONTROL:

# Set treatment and treatment:age to 0 for all observations
x[, "treatment"] <- 0
x[, 'treatment:age'] <- x[, 'treatment'] * x[, 'age']

# Get linear predictors
pred_ctl <- x %*% beta

# Apply the inverse link function to get predicted probabilities
pp_ctl <- 1 / (1 + exp(-pred_ctl))

# Get the average predicted probability
mean_pp_ctl <- mean(pp_ctl)

# TREATMENT:

# Set treatment to 1 and treatment:age to age for all observations
x[, "treatment"] <- 1
x[, 'treatment:age'] <- x[, 'treatment'] * x[, 'age']

# Get linear predictors
pred_treat <- x %*% beta

# Apply the inverse link function to get predicted probabilities
pp_treat <- 1 / (1 + exp(-pred_treat))

# Get the average predicted probability
mean_pp_treat <- mean(pp_treat)

# RESULTS:

mean_pp_ctl
mean_pp_treat
```

## Variance

So far this is pretty straightforward, but we want to know the standard 
error of these estimates. 

You might be tempted to just use the R command, 
`predict(model, newdata, se.fit = TRUE)`, fix treatment to 0 or 1, and find the 
average of the resulting standard errors. Don't do that! You'll average the 
standard errors of the predicted outcome of each observation in your dataset - 
which is _not_ the same thing as finding the standard error for a specific 
estimate, which is what we're looking for.

As explained above, we can approximate the variance of the predicted margins: 
using $J$ to represent the jacobian (the derivative of our transformation 
with respect to each $\beta$ parameter), then

\[\text{Var}(P(X\beta)) =  J V J^T\]

\ldots Going back to our function from before, let's start with the 
coefficient on treatment, $\beta_1$. First we just apply the quotient rule 
(and the chain rule in the last step):

\begin{align*}
P(X\beta) &= \frac{1}{n} \sum_{i = 1}^n \frac{1}{1 + \exp(-X_i\beta)}\\
\frac{\partial}{\partial \beta_1} P(X\beta) &= \frac{\partial}{\partial \beta_1} \left[\frac{1}{n} \sum_{i = 1}^n \frac{1}{1 + \exp(-X_i\beta )}\right]\\
&= \frac{1}{n} \sum_{i = 1}^n \frac{\partial}{\partial \beta_1} \left[\frac{1}{1 + \exp(-X_i\beta)}\right]\\
&= \frac{1}{n} \sum_{i = 1}^n \frac{-\frac{\partial}{\partial \beta_1}[1 + \exp(-X_i\beta)]}{(1 + \exp(-X_i\beta))^2} \\
&= \frac{1}{n} \sum_{i = 1}^n \frac{-\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot \frac{\partial}{\partial \beta_1} [-X_i\beta]
\end{align*}

Coming up for air for a second: what's the last term in that final 
expression? 

$X_i\beta = \alpha + \beta_1 \tau_i + \beta_2 \text{age}_i + \beta_3 (\tau_i \cdot \text{age}_i)$, 
so the derivative of that with respect to $\beta_1$ is just $\tau_i$. Therefore, 

\begin{align*}
\frac{\partial}{\partial \beta_1} P(X\beta) 
&= \frac{1}{n} \sum_{i = 1}^n \frac{-\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot (-\tau_i)\\
&= \frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot \tau_i
\end{align*}
\ldots which is the first term of our jacobian.^[Note that the first part of 
the expression is sometimes written 
$\frac{1}{1 + \exp(-X_i\beta)} \cdot \frac{1}{1 + \exp(X_i\beta)}$ 
since that implies $\frac{\partial}{\partial \beta} \text{logit}^{-1}(X\beta) = \text{logit}^{-1}(X\beta) \cdot \text{logit}^{-1}(-X\beta)$]

What about the other terms? Well, because the $X_i\beta$ term is additively 
separable, each term is the same, except for having a different term at the 
end. For example, the derivative with respect to $\beta_2$ is going to be 

\[\frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta))}{(1 + \exp(-X_i\beta))^2} \cdot \text{age}_i\]. 

So the full jacobian is
\[
J = 
\left[
  \begin{array}{l} \\
    \frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot 1\\
    \frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot \tau_i\\
    \frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot \text{age}_i\\
    \frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot \tau_i \cdot \text{age}_i
  \end{array}
\right]
\]

This can be rewritten as

\[
J = 
\frac{1}{n}
\left[
  \begin{array} \\
     \frac{\exp(-X_1\beta)}{(1 + \exp(-X_1\beta))^2} &
     \frac{\exp(-X_2\beta)}{(1 + \exp(-X_2\beta))^2} &
     \cdots &
     \frac{\exp(-X_n\beta)}{(1 + \exp(-X_n\beta))^2}
  \end{array}
\right]
\left[
  \begin{array}{cccc}\\
    1 & \tau_1 & \text{age}_1 & \tau_1 \cdot \text{age}_1 \\
    1 & \tau_2 & \text{age}_2 & \tau_2 \cdot \text{age}_2 \\
    \vdots  & \vdots & \vdots & \vdots\\
    1 & \tau_n & \text{age}_n & \tau_n \cdot \text{age}_n \\
  \end{array}
\right]
\]

The first term applies the derivative of the inverse link function to every 
linear predictor in the model, and the second term is the covariate matrix 
from our data! How convenient! That's going to be true for all general linear 
models. So we can rewrite this as

\[J = \frac{\left[\begin{array}\\ \frac{\exp(-X_1\beta)}{(1 + \exp(-X_1\beta))^2} & \frac{\exp(-X_2\beta)}{(1 + \exp(-X_2\beta))^2} & \cdots & \frac{\exp(-X_n\beta)}{(1 + \exp(-X_n\beta))^2}\end{array}\right] X}{n}\]

Recalling that our variance is $J V J^T$, and that we've already calculated 
$X_i\beta$ for all $i$, this is relatively simple to do in R:

```{r}
# Get the data
x <- model.matrix(lg)

# CONTROL ERROR

# Apply the derivative of the inverse link function to the linear predictors
deriv <- as.vector(exp(-pred_ctl) / (1 + exp(-pred_ctl))^2)

# Set treatment to 0
x[, 'treatment'] <- 0
x[, 'treatment:age'] <- x[, 'treatment'] * x[, 'age']

# Complete the chain rule by matrix-multiplying the derivatives by the data,
# now we have the jacobian
j <- deriv %*% x / nrow(x)

# The variance of our estimate is the cross product of the jacobian and the model's
# variance-covariance matrix
variance <- j %*% vcov(lg) %*% t(j)

# The error is the square root of that
se_ctl <- sqrt(diag(variance))

# TREATMENT ERROR: same logic

deriv <- as.vector(exp(-pred_treat) / (1 + exp(-pred_treat))^2)

x <- model.matrix(lg)

x[, 'treatment'] <- 1
x[, 'treatment:age'] <- x[, 'treatment'] * x[, 'age']

j <- deriv %*% x / nrow(x)

variance <- j %*% vcov(lg) %*% t(j)
se_treat <- sqrt(diag(variance))

se_ctl
se_treat
```

## Review

OK, we've got our estimates and variance:

```{r}
result <- data.frame(
  Label = c("treatment = 0", "treatment = 1"),
  Margin = c(mean_pp_ctl, mean_pp_treat),
  Standard.Error = c(se_ctl, se_treat)
)

result
```

What do we get from modmarg?

```{r}
marg <- modmarg::marg(mod = lg, var_interest = 'treatment')
marg[[1]][, c("Label", "Margin", "Standard.Error")]
```

Hooray!

[^1]: Some practical notes on calculating the jacobian: the example at the bottom
of the vignette captures how to calculate the jacobian for predicted levels pretty
thoroughly. For predictive effects, the big change is that you are now calculating 
the variance on predicted effects, so you need to take the second derivative of 
the link function (instead of the first derivative). For categorical
variables, you can take the second derivative by subtracting each of the levels 
from the base level and returning that as the jacobian. For continuous variables, 
you need to actually compute the second derivative of the link function and use 
that in place of the first derivative above. You need to explicitly compute the 
second derivative because you want an instantaneous rate of change as opposed to 
the rate of change over a range as with categorical variables above.