For many types of regression techniques, the coefficients in the
model may not be sufficient to adequately figure out the marginal
relationship between a covariate and the outcome of a regression (or the
error in your estimate). In the simplest case, say you run the following
formula in glm: wages ~ age + age^2
.
Because the output will include coefficients for both age and age squared, it’s not immediately apparent what the total marginal relationship is between a change in age and wages. Moreover, computing the error in that estimate is a non-trivial problem.
This non-obviousness of marginal relationships is also a problem for even very simple regressions with functional forms that mean that coefficients are not in the base units of the regression. For example, the coefficients of logistic regression are odds ratios, so even simple regressions are not immediately interpretable.
This package reproduces the margins
command from Stata,
which allows for easy and quick estimation of marginal relationships and
the associated error. The error is computed using the delta method,
which we detail in a separate vignette.
In this vignette, we detail a few possible use cases of the
modmarg
package.1
We want to ascertain the marginal effect of treatment
on
y
while controlling for age. In this first example we’ll
use a binned age variable.
library(modmarg)
data(margex)
g <- glm(y ~ as.factor(agegroup)*as.factor(treatment) , data = margex)
summary(g)
##
## Call:
## glm(formula = y ~ as.factor(agegroup) * as.factor(treatment),
## data = margex)
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 68.6901 0.8921 77.001
## as.factor(agegroup)30-39 -1.4677 1.3448 -1.091
## as.factor(agegroup)40+ -9.8679 1.2384 -7.968
## as.factor(treatment)1 14.6799 1.6316 8.997
## as.factor(agegroup)30-39:as.factor(treatment)1 -2.1209 2.2438 -0.945
## as.factor(agegroup)40+:as.factor(treatment)1 -2.1073 1.9565 -1.077
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## as.factor(agegroup)30-39 0.275
## as.factor(agegroup)40+ 2.27e-15 ***
## as.factor(treatment)1 < 2e-16 ***
## as.factor(agegroup)30-39:as.factor(treatment)1 0.345
## as.factor(agegroup)40+:as.factor(treatment)1 0.282
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 416.1914)
##
## Null deviance: 1391433 on 2999 degrees of freedom
## Residual deviance: 1246077 on 2994 degrees of freedom
## AIC: 26615
##
## Number of Fisher Scoring iterations: 2
It’s not at all obvious from these coefficients what the total
marginal relationship is between treatment
and
y
.
We can get the predicted margin of y
at the various
levels of treatment
.
## [[1]]
## Label Margin Standard.Error Test.Stat P.Value Lower CI (95%)
## 1 treatment = 0 63.28363 0.5484021 115.3964 0 62.20835
## 2 treatment = 1 76.37699 0.5526036 138.2130 0 75.29347
## Upper CI (95%)
## 1 64.35892
## 2 77.46051
Or the effect of a unit change in treatment
.
## [[1]]
## Label Margin Standard.Error Test.Stat P.Value Lower CI (95%)
## 1 treatment = 0 0.00000 0.0000000 NaN NaN 0.00000
## 2 treatment = 1 13.09336 0.7785343 16.81796 1.013688e-60 11.56684
## Upper CI (95%)
## 1 0.00000
## 2 14.61988
Or maybe we want to get treatment effect at several different levels of age. Let’s re-run the regression with continuous age cubed (maybe we’re looking at severity of a disease that’s most prevalent among the young and the old).
Note that you have to use raw = T
when using
poly()
. Otherwise marg
will try to create
multiple orthogonal vectors from a constant, which doesn’t work so
well.
##
## Call:
## glm(formula = y ~ poly(age, 3, raw = T) * as.factor(treatment),
## data = margex)
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.118e+02 2.371e+01 4.716
## poly(age, 3, raw = T)1 -3.741e+00 2.012e+00 -1.860
## poly(age, 3, raw = T)2 1.077e-01 5.415e-02 1.990
## poly(age, 3, raw = T)3 -1.083e-03 4.649e-04 -2.329
## as.factor(treatment)1 -4.282e+01 3.677e+01 -1.165
## poly(age, 3, raw = T)1:as.factor(treatment)1 4.973e+00 2.997e+00 1.659
## poly(age, 3, raw = T)2:as.factor(treatment)1 -1.392e-01 7.788e-02 -1.788
## poly(age, 3, raw = T)3:as.factor(treatment)1 1.244e-03 6.486e-04 1.918
## Pr(>|t|)
## (Intercept) 2.52e-06 ***
## poly(age, 3, raw = T)1 0.0630 .
## poly(age, 3, raw = T)2 0.0467 *
## poly(age, 3, raw = T)3 0.0199 *
## as.factor(treatment)1 0.2442
## poly(age, 3, raw = T)1:as.factor(treatment)1 0.0972 .
## poly(age, 3, raw = T)2:as.factor(treatment)1 0.0739 .
## poly(age, 3, raw = T)3:as.factor(treatment)1 0.0552 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 404.2444)
##
## Null deviance: 1391433 on 2999 degrees of freedom
## Residual deviance: 1209499 on 2992 degrees of freedom
## AIC: 26530
##
## Number of Fisher Scoring iterations: 2
modmarg::marg(mod = g, var_interest = "treatment", type = "effects",
at = list("age" = c(20, 40, 60)))
## $`age = 20`
## Label Margin Standard.Error Test.Stat P.Value Lower CI (95%)
## 1 treatment = 0 0.00000 0.000000 NaN NaN 0.000000
## 2 treatment = 1 10.90186 3.278277 3.325484 0.0008932985 4.473953
## Upper CI (95%)
## 1 0.00000
## 2 17.32976
##
## $`age = 40`
## Label Margin Standard.Error Test.Stat P.Value Lower CI (95%)
## 1 treatment = 0 0.00000 0.000000 NaN NaN 0.00000
## 2 treatment = 1 12.96123 1.128904 11.48125 6.862963e-30 10.74772
## Upper CI (95%)
## 1 0.00000
## 2 15.17474
##
## $`age = 60`
## Label Margin Standard.Error Test.Stat P.Value Lower CI (95%)
## 1 treatment = 0 0.00000 0.000000 NaN NaN 0.00000
## 2 treatment = 1 23.06926 3.492633 6.605119 4.683001e-11 16.22105
## Upper CI (95%)
## 1 0.00000
## 2 29.91746
Let’s say we want to figure out how much treatment
increased the likelihood of a binary outcome
.
##
## Call:
## glm(formula = outcome ~ as.factor(treatment), family = binomial,
## data = margex)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.44999 0.09554 -25.64 <2e-16 ***
## as.factor(treatment)1 1.40222 0.11221 12.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2732.1 on 2999 degrees of freedom
## Residual deviance: 2551.5 on 2998 degrees of freedom
## AIC: 2555.5
##
## Number of Fisher Scoring iterations: 5
Those coefficients are odds ratios. It’s really unclear what the marginal relationship is.
## [[1]]
## Label Margin Standard.Error Test.Stat P.Value
## 1 treatment = 0 0.07943925 0.006986952 11.36966 5.921906e-30
## 2 treatment = 1 0.25965379 0.011313052 22.95170 1.416955e-116
## Lower CI (95%) Upper CI (95%)
## 1 0.06574508 0.09313343
## 2 0.23748062 0.28182697
## [[1]]
## Label Margin Standard.Error Test.Stat P.Value Lower CI (95%)
## 1 treatment = 0 0.0000000 0.00000000 NaN NaN 0.0000000
## 2 treatment = 1 0.1802145 0.01329672 13.55331 7.573387e-42 0.1541535
## Upper CI (95%)
## 1 0.0000000
## 2 0.2062756
Aha! It’s an 18 percentage point increase in the likelihood of a positive outcome from treatment and the effect is highly statistically significant. Much more interpretable.
Let’s say we want to know the marginal predicted level at only a
couple of age groups while controlling for distance. marg
makes that simple.
Note that unlike the at
option, which takes a named list
of values, at_var_interest
takes just an unnamed
vector.
##
## Call:
## glm(formula = y ~ poly(distance, 2, raw = T) * as.factor(agegroup),
## data = margex)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 7.269e+01 9.098e-01
## poly(distance, 2, raw = T)1 8.555e-02 1.112e-01
## poly(distance, 2, raw = T)2 -1.175e-04 1.552e-04
## as.factor(agegroup)30-39 -4.524e-01 1.386e+00
## as.factor(agegroup)40+ -6.516e+00 1.239e+00
## poly(distance, 2, raw = T)1:as.factor(agegroup)30-39 6.289e-02 1.520e-01
## poly(distance, 2, raw = T)2:as.factor(agegroup)30-39 -1.053e-04 2.124e-04
## poly(distance, 2, raw = T)1:as.factor(agegroup)40+ 1.008e-02 1.243e-01
## poly(distance, 2, raw = T)2:as.factor(agegroup)40+ -2.796e-05 1.732e-04
## t value Pr(>|t|)
## (Intercept) 79.888 < 2e-16 ***
## poly(distance, 2, raw = T)1 0.770 0.442
## poly(distance, 2, raw = T)2 -0.757 0.449
## as.factor(agegroup)30-39 -0.326 0.744
## as.factor(agegroup)40+ -5.260 1.54e-07 ***
## poly(distance, 2, raw = T)1:as.factor(agegroup)30-39 0.414 0.679
## poly(distance, 2, raw = T)2:as.factor(agegroup)30-39 -0.496 0.620
## poly(distance, 2, raw = T)1:as.factor(agegroup)40+ 0.081 0.935
## poly(distance, 2, raw = T)2:as.factor(agegroup)40+ -0.161 0.872
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 452.6186)
##
## Null deviance: 1391433 on 2999 degrees of freedom
## Residual deviance: 1353782 on 2991 degrees of freedom
## AIC: 26870
##
## Number of Fisher Scoring iterations: 2
## [1] "40+" "20-29" "30-39"
## [[1]]
## Label Margin Standard.Error Test.Stat P.Value Lower CI (95%)
## 1 agegroup = 20-29 73.42933 0.9075403 80.91028 0 71.64987
## Upper CI (95%)
## 1 75.2088
Normally we assume that the amount of variation in our outcome of
interest (conditional on covariates) is constant across our sample.
Sometimes, this assumption is violated, and we will use a different
variance-covariance matrix to represent this heterogeneity in variance
(heteroskedasticity). Creating these variance-covariance matrices is
beyond the scope of this package. However, they can be used with
marg
to correct standard errors and p values in predicted
levels or effects.
Let’s say we want to get the predicted levels of an outcome for
different treatments, but we want to cluster our standard errors by the
arm
variable. We estimate the model, and then create the
“clustered” variance-covariance matrix separately (see this
script for one method to do this). This code and example replicate
the vce(cluster arm)
option in Stata.
We can use the vcov_mat
option to pass a custom
variance-covariance matrix to modmarg. Because this is an OLS model, the
degrees of freedom for the T test must also be corrected. Here we’re
using Stata’s default correction of ngroups - 1
, where
ngroups
is the number of unique values in the clustering
variable. Notice the standard errors and p values increase
substantially.
##
## Call:
## glm(formula = outcome ~ treatment + distance, family = "gaussian",
## data = margex)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.378e-02 9.625e-03 9.743 < 2e-16 ***
## treatment 1.786e-01 1.323e-02 13.508 < 2e-16 ***
## distance -2.314e-04 3.646e-05 -6.346 2.55e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1311316)
##
## Null deviance: 422.64 on 2999 degrees of freedom
## Residual deviance: 393.00 on 2997 degrees of freedom
## AIC: 2424
##
## Number of Fisher Scoring iterations: 2
## (Intercept) treatment1 distance
## (Intercept) 2.970431e-03 -4.945889e-04 -5.960823e-06
## treatment1 -4.945889e-04 7.952403e-04 5.335865e-07
## distance -5.960823e-06 5.335865e-07 1.225712e-08
## [1] 2
## [[1]]
## Label Margin Standard.Error Test.Stat P.Value Lower CI (95%)
## 1 treatment = 0 0.0802249 0.009356981 8.573802 1.579014e-17 0.06187815
## 2 treatment = 1 0.2588702 0.009344512 27.702918 1.362609e-150 0.24054793
## Upper CI (95%)
## 1 0.09857166
## 2 0.27719254
## [[1]]
## Label Margin Standard.Error Test.Stat P.Value Lower CI (95%)
## 1 treatment = 0 0.0802249 0.04810472 1.667714 0.23730671 -0.12675299
## 2 treatment = 1 0.2588702 0.04671881 5.541028 0.03106062 0.05785542
## Upper CI (95%)
## 1 0.2872028
## 2 0.4598851
Modmarg is short for model margins.↩︎