Instrumental variables

Video walk-through

If you want to follow along with this example, you can download these three datasets:

There’s a set of videos that walks through each section below. To make it easier for you to jump around the video examples, I cut the long video into smaller pieces and included them all in one YouTube playlist.

You can also watch the playlist (and skip around to different sections) here:

Background

For all these examples, we’re interested in the perennial econometrics question of whether an extra year of education causes increased wages. Economists love this stuff.

We’ll explore the question with three different datasets: a fake one I made up and two real ones from published research.

Make sure you load all these libraries before getting started:

Code
library(tidyverse)  # ggplot(), mutate(), and friends
library(parameters)  # Extract model parameters to data frames
library(performance) # Extract model details to data frames
library(marginaleffects) # Easier model predictions
library(modelsummary)  # Create side-by-side regression tables
library(tinytable)  # Add fancier formatting to tables
library(estimatr)  # Run 2SLS models in one step with iv_robust()

Education, wages, and father’s education (fake data)

First let’s use some fake data to see if education causes additional wages.

ed_fake <- read_csv("data/father_education.csv")

The father_education.csv file contains four variables:

Variable name Description
wage Weekly wage
educ Years of education
ability Magical column that measures your ability to work and go to school (omitted variable)
fathereduc Years of education for father

Naive model

If we could actually measure ability, we could estimate this model, which closes the confounding backdoor posed by ability and isolates just the effect of education on wages:

Code
model_forbidden <- lm(wage ~ educ + ability, data = ed_fake)
model_parameters(model_forbidden)
## Parameter   | Coefficient |   SE |           95% CI | t(997) |      p
## ---------------------------------------------------------------------
## (Intercept) |      -85.57 | 7.20 | [-99.70, -71.45] | -11.89 | < .001
## educ        |        7.77 | 0.46 | [  6.87,   8.66] |  17.05 | < .001
## ability     |        0.34 | 0.01 | [  0.32,   0.36] |  33.20 | < .001
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.

However, in real life we don’t have ability, so we’re stuck with a naive model:

Code
model_naive <- lm(wage ~ educ, data = ed_fake)
model_parameters(model_naive)
## Parameter   | Coefficient |    SE |           95% CI | t(998) |      p
## ----------------------------------------------------------------------
## (Intercept) |      -59.38 | 10.38 | [-79.74, -39.02] |  -5.72 | < .001
## educ        |       13.12 |  0.62 | [ 11.91,  14.34] |  21.24 | < .001
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.

The naive model overestimates the effect of education on wages (12.2 vs. 9.24) because of omitted variable bias. Education suffers from endogeneity—there are things in the model (like ability, hidden in the error term) that are correlated with it. Any estimate we calculate will be wrong and biased because of selection effects or omitted variable bias (all different names for endogeneity).

Check instrument validity

To fix the endogeneity problem, we can use an instrument to remove the endogeneity from education and instead use a special exogeneity-only version of education. Perhaps someone’s father’s education can be an instrument for education (it’s not the greatest instrument, but we’ll go with it).

For an instrument to be valid, it must meet three criteria:

  1. Relevance: Instrument is correlated with policy variable
  2. Exclusion: Instrument is correlated with outcome only through the policy variable
  3. Exogeneity: Instrument isn’t correlated with anything else in the model (i.e. omitted variables)

Relevance

We can first test relevance by making a scatterplot and running a model of policy ~ instrument:

Code
ggplot(ed_fake, aes(x = fathereduc, y = educ)) +
  geom_point() +
  geom_smooth(method = "lm")

Code

check_relevance <- lm(educ ~ fathereduc, data = ed_fake)
model_parameters(check_relevance)
## Parameter   | Coefficient |   SE |       95% CI | t(998) |      p
## -----------------------------------------------------------------
## (Intercept) |        2.25 | 0.17 | [1.91, 2.59] |  13.07 | < .001
## fathereduc  |        0.92 | 0.01 | [0.89, 0.94] |  84.48 | < .001
modelsummary::get_gof(check_relevance)
##    aic  bic r.squared adj.r.squared rmse nobs    F logLik
## 1 2137 2152      0.88          0.88  0.7 1000 7136  -1066

This looks pretty good! The F-statistic is definitely above 10 (it’s 7,136!), and there’s a significant relationship between the instrument and policy. I’d say that this is relevant.

Exclusion

To check for exclusion, we need to see if there’s a relationship between father’s education and wages that occurs only because of education. If we plot it, we’ll see a relationship:

Code
ggplot(ed_fake, aes(x = fathereduc, y = wage)) +
  geom_point() +
  geom_smooth(method = "lm")

That’s to be expected, since in our model, father’s education causes education which causes wages—they should be correlated. But we have to use a convincing story + theory to justify the idea that a father’s education increases the hourly wage only because it increases one’s education, and there’s no real statistical test for that. Good luck.

Exogeneity

There’s not really a test for exogeneity either, since there’s no way to measure other endogenous variables in the model (that’s the whole reason we’re using IVs in the first place!). Because we have the magical ability column in this fake data, we can test it. Father’s education shouldn’t be related to ability:

Code
ggplot(ed_fake, aes(x = ability, y = fathereduc)) +
  geom_point() +
  geom_smooth(method = "lm")

And it’s not! We can safely say that it meets the exogeneity assumption.

In real life, though there’s no statistical test for exogeneity. We just have to tell a theory-based story that the number of years of education one’s father has is not correlated with anything else in the model (including any omitted variables). Good luck with that—it’s probably not a good instrument. This relates to Scott Cunningham’s argument that instruments have to be weird. According to Scott:

The reason I think this is because an instrument doesn’t belong in the structural error term and the structural error term is all the intuitive things that determine your outcome. So it must be weird, otherwise it’s probably in the error term.

Let’s just pretend that father’s education is a valid instrument and move on :)

2SLS manually

Now we can do two-stage least squares (2SLS) regression and use the instrument to filter out the endogenous part of education. The first stage predicts education based on the instrument (we already ran this model earlier when checking for relevance, but we’ll do it again just for fun):

Code
first_stage <- lm(educ ~ fathereduc, data = ed_fake)

Now we want to add a column of predicted education to our original dataset. The easiest way to do that is with the predictions() function from the {marginaleffects} library, which plugs values from a dataset into a model to generate predictions:

Code
ed_fake_with_prediction <- predictions(first_stage, ed_fake)
head(ed_fake_with_prediction)
##   rowid estimate std.error statistic p.value s.value conf.low conf.high wage educ ability fathereduc  df
## 1     1       18     0.027       666       0     Inf       18        18  180   19     408         17 Inf
## 2     2       16     0.022       734       0     Inf       16        16  100   16     310         15 Inf
## 3     3       18     0.031       603       0     Inf       18        19  125   18     303         18 Inf
## 4     4       16     0.022       739       0     Inf       16        17  178   17     342         16 Inf
## 5     5       16     0.025       637       0     Inf       16        16  265   17     534         15 Inf
## 6     6       17     0.022       755       0     Inf       17        17  187   17     409         16 Inf

Instead of dealing with names like estimate, I like to rename the fitted variable to something more understandable after I use predictions():

Code
ed_fake_with_prediction <- predictions(first_stage, ed_fake) |>
  rename(educ_hat = estimate)

head(ed_fake_with_prediction)
##   rowid educ_hat std.error statistic p.value s.value conf.low conf.high wage educ ability fathereduc  df
## 1     1       18     0.027       666       0     Inf       18        18  180   19     408         17 Inf
## 2     2       16     0.022       734       0     Inf       16        16  100   16     310         15 Inf
## 3     3       18     0.031       603       0     Inf       18        19  125   18     303         18 Inf
## 4     4       16     0.022       739       0     Inf       16        17  178   17     342         16 Inf
## 5     5       16     0.025       637       0     Inf       16        16  265   17     534         15 Inf
## 6     6       17     0.022       755       0     Inf       17        17  187   17     409         16 Inf

We can now use the new educ_hat variable in our second stage model:

Code
second_stage <- lm(wage ~ educ_hat, data = ed_fake_with_prediction)
model_parameters(second_stage)
## Parameter   | Coefficient |    SE |        95% CI | t(998) |      p
## -------------------------------------------------------------------
## (Intercept) |       28.82 | 12.67 | [3.95, 53.69] |   2.27 | 0.023 
## educ hat    |        7.83 |  0.76 | [6.35,  9.32] |  10.37 | < .001
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.

The estimate for educ_hat is arguably more accurate now because we’ve used the instrument to remove the endogenous part of education and should only have the exogenous part.

2SLS in one step

Doing all that two-stage work is neat and it helps with the intuition of instrumental variables, but it’s tedious. More importantly, the standard errors for educ_hat are wrong and the \(R^2\) and other diagnostics for the second stage model are wrong too. You can use fancy math to adjust these things in the second stage, but we’re not going to do that. Instead, we’ll use a function that does both stages of the 2SLS model at the same time!

There are several functions from different R packages that let you do 2SLS, and they all work a little differently and have different benefits:

  • iv_robust() from {estimatr}:
    • Syntax: outcome ~ treatment | instrument
    • Benefits: Handles robust and clustered standard errors; includes special tests for weak instruments (that are better than the standard “check if F > 10”)
  • feols() from {fixest}:
    • Syntax: outcome ~ treatment | instrument or outcome ~ treatment | fixed effects | instrument (if using fixed effects)
    • Benefits: Handles fixed effects really quickly
  • ivreg() from {ivreg}:
    • Syntax: outcome ~ treatment | instrument
    • Benefits: Includes special tests for weak instruments (that are better than the standard “check if F > 10”)
  • plm() from {plm}:
    • Syntax: outcome ~ treatment | instrument
    • Benefits: Handles panel data (country/year, state/year, etc.)

This page here has more detailed examples of the main three: iv_robust(), ivreg(), and feols()

I typically like using iv_robust(), so we’ll do that here. Instead of running a first stage, generating predictions, and running a second stage, we can do it all at once like this:

Code
model_2sls <- iv_robust(wage ~ educ | fathereduc,
                        data = ed_fake)
model_parameters(model_2sls)
## # Fixed Effects
## 
## Parameter   | Coefficient |    SE |        95% CI | t(998) |      p
## -------------------------------------------------------------------
## (Intercept) |       28.82 | 11.16 | [6.91, 50.73] |   2.58 | 0.010 
## educ        |        7.83 |  0.66 | [6.53,  9.14] |  11.81 | < .001
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.

The coefficient for educ here is the same as educ_hat from the manual 2SLS model, but here we found it in one line of code! Also, the model’s standard errors and diagnostics are all correct now.

Compare results

We can put all the models side-by-side to compare them:

Code
# gof_omit here will omit goodness-of-fit rows that match any of the text. This
# means 'contains "IC" OR contains "Low" OR contains "Adj" OR contains "p.value"
# OR contains "statistic" OR contains "se_type"'. Basically we're getting rid of
# all the extra diagnostic information at the bottom
modelsummary(list("Forbidden" = model_forbidden, "OLS" = model_naive,
                  "2SLS (by hand)" = second_stage, "2SLS (automatic)" = model_2sls),
             gof_omit = "IC|Log|Adj|p\\.value|statistic|se_type",
             stars = TRUE) |>
  # Add a background color to rows 3-4 and 7-8
  style_tt(i = c(3:4, 7:8), background = "#F5ABEA")
Forbidden OLS 2SLS (by hand) 2SLS (automatic)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) -85.571*** -59.378*** 28.819* 28.819**
(7.198) (10.376) (12.672) (11.165)
educ 7.767*** 13.124*** 7.835***
(0.456) (0.618) (0.664)
ability 0.344***
(0.010)
educ_hat 7.835***
(0.755)
Num.Obs. 1000 1000 1000 1000
R2 0.673 0.311 0.097 0.261
F 1025.794 451.244 107.639
RMSE 26.97 39.13 44.80 40.55

Note how the coefficients for educ_hat and educ in the 2SLS models are close to the coefficient for educ in the forbidden model that accounts for ability. That’s the magic of instrumental variables!

Education, wages, and parent’s education (multiple instruments) (real data)

This data comes from the wage2 dataset in the {wooldridge} R package (and it’s real!). The data was used in this paper:

M. Blackburn and D. Neumark (1992), “Unobserved Ability, Efficiency Wages, and Interindustry Wage Differentials,” Quarterly Journal of Economics 107, 1421-1436. https://doi.org/10.3386/w3857

wage2 <- read_csv("data/wage2.csv")

This dataset includes a bunch of different variables. If you run library(wooldridge) and then run ?wage you can see the documentation for the data. These are the variables we care about for this example:

Variable name Description
wage Monthly wage (1980 dollars)
educ Years of education
feduc Years of education for father
meduc Years of education for mother

To make life easier, we’ll rename some of the columns and get rid of rows with missing data:

Code
ed_real <- wage2 |>
  rename(education = educ, education_dad = feduc, education_mom = meduc) |>
  na.omit()  # Get rid of rows with missing values

Naive model

We want to again estimate the effect of education on wages, but this time we’ll use both one’s father’s education and one’s mother’s education as instruments. Here’s the naive estimate of the relationship, which suffers from endogeneity:

Code
model_naive <- lm(wage ~ education, data = ed_real)
model_parameters(model_naive)
## Parameter   | Coefficient |    SE |          95% CI | t(661) |      p
## ---------------------------------------------------------------------
## (Intercept) |      175.16 | 92.84 | [-7.13, 357.45] |   1.89 | 0.060 
## education   |       59.45 |  6.70 | [46.30,  72.60] |   8.88 | < .001
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.

This is wrong though! Education is endogenous to unmeasured things in the model (like ability, which lives in the error term). We can isolate the exogenous part of education with an instrument.

Check instrument validity

Before doing any 2SLS models, we want to check the validity of the instruments. Remember, for an instrument to be valid, it should meet these criteria:

  1. Relevance: Instrument is correlated with policy variable
  2. Exclusion: Instrument is correlated with outcome only through the policy variable
  3. Exogeneity: Instrument isn’t correlated with anything else in the model (i.e. omitted variables)

Relevance

We can check for relevance by looking at the relationship between the instruments and education:

Code
# Combine father's and mother's education into one column so we can plot both at the same time
ed_real_long <- ed_real |>
  pivot_longer(cols = c(education_dad, education_mom),
               names_to = "instrument", values_to = "instrument_value")

ggplot(ed_real_long, aes(x = instrument_value, y = education)) +
  # Make points semi-transparent because of overplotting
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm") +
  facet_wrap(vars(instrument))

Code
model_check_instruments <- lm(education ~ education_dad + education_mom,
                              data = ed_real)
model_parameters(model_check_instruments)
## Parameter     | Coefficient |   SE |        95% CI | t(660) |      p
## --------------------------------------------------------------------
## (Intercept)   |        9.91 | 0.32 | [9.29, 10.54] |  31.02 | < .001
## education dad |        0.22 | 0.03 | [0.16,  0.28] |   7.58 | < .001
## education mom |        0.14 | 0.03 | [0.07,  0.21] |   4.17 | < .001
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.
modelsummary::get_gof(model_check_instruments)
##    aic  bic r.squared adj.r.squared rmse nobs  F logLik
## 1 2804 2822       0.2           0.2    2  663 83  -1398

There’s a clear relationship between both of the instruments and education, and the coefficients for each are significant. The F-statistic for the model is 83, which is higher than 10, which might be a good sign of a strong instrument. However, it’s less than 104, which, according to this paper, is a better threshold for F statistics. So maybe it’s not so relevant in the end. Who knows.

Exclusion

We can check for exclusion in part by looking at the relationship between the instruments and the outcome, or wages. We should see some relationship:

Code
ggplot(ed_real_long, aes(x = instrument_value, y = wage)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap(~ instrument)

And we do! Now we just have to make the case that the only reason there’s a relationship is that parental education only influences wages through education. Good luck with that.

Exogeneity

The last step is to prove exogeneity—that parental education is not correlated with education or wages. Good luck with that too.

2SLS manually

Now that we’ve maybe found some okay-ish instruments perhaps, we can use them in a two-stage least squares model. I’ll show you how to do it by hand, just to help with the intuition, but then we’ll do it automatically with iv_robust().

Assuming that parental education is a good instrument, we can use it to remove the endogenous part of education using 2SLS. In the first stage, we predict education using our instruments:

Code
first_stage <- lm(education ~ education_dad + education_mom, data = ed_real)

We can then extract the predicted education and add it to our main dataset, renaming the estimate variable to something more useful along the way:

Code
ed_real_with_predicted <- predictions(first_stage, ed_real) |>
  rename(education_hat = estimate)

Finally, we can use predicted education to estimate the exogenous effect of education on wages:

Code
second_stage <- lm(wage ~ education_hat,
                   data = ed_real_with_predicted)
model_parameters(second_stage)
## Parameter     | Coefficient |     SE |             95% CI | t(661) |      p
## ---------------------------------------------------------------------------
## (Intercept)   |     -537.71 | 208.16 | [-946.46, -128.97] |  -2.58 | 0.010 
## education hat |      111.56 |  15.18 | [  81.76,  141.36] |   7.35 | < .001
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.

The coefficient for education_hat here should arguably be our actual effect!

2SLS in one step

Again, in real life, you won’t want to do all that. It’s tedious and your standard errors are wrong. Here’s how to do it all in one step:

Code
model_2sls <- iv_robust(wage ~ education | education_dad + education_mom,
                        data = ed_real)
model_parameters(model_2sls)
## # Fixed Effects
## 
## Parameter   | Coefficient |     SE |             95% CI | t(661) |      p
## -------------------------------------------------------------------------
## (Intercept) |     -537.71 | 214.43 | [-958.76, -116.66] |  -2.51 | 0.012 
## education   |      111.56 |  15.90 | [  80.34,  142.78] |   7.02 | < .001
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.

The coefficient for education is the same that we found in the manual 2SLS process, but now the errors are correct.

Compare results

Let’s compare all the findings and interpret the results!

Code
modelsummary(list("OLS" = model_naive, "2SLS (by hand)" = second_stage,
                  "2SLS (automatic)" = model_2sls),
             gof_omit = "IC|Log|Adj|p\\.value|statistic|se_type",
             stars = TRUE) |>
  # Add a background color to rows 3 and 5
  style_tt(i = c(3, 5), background = "#F5ABEA")
OLS 2SLS (by hand) 2SLS (automatic)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 175.160+ -537.712* -537.712*
(92.839) (208.164) (214.431)
education 59.452*** 111.561***
(6.698) (15.901)
education_hat 111.561***
(15.176)
Num.Obs. 663 663 663
R2 0.106 0.076 0.025
F 78.786 54.041
RMSE 383.97 390.55 401.16

The 2SLS effect is roughly twice as large and is arguably more accurate, since it has removed the endogeneity from education. An extra year of school leads to an extra $111.56 dollars a month in income (in 1980 dollars).

Check for weak instruments

The F-statistic in the first stage was 83.3, which is bigger than 10, but not huge. Again, this newer paper argues that relying on 10 as a threshold isn’t great. They provide a new, more powerful test called the tF procedure, but nobody’s written an R function to do that yet, so we can’t use it yet.

We can, however, do a couple other tests for instrument strength. First, if you include the diagnostics = TRUE argument when running iv_robust(), you can get a few extra diagnostic statistics. (See the “Details” section in the documentation for iv_robust for more details about what these are.)

Let’s re-run the 2SLS model with iv_robust with diagnostics on. To see diagnostic details, you can’t use model_parameters() (since that just shows the coefficients), so you have to use summary():

Code
model_2sls <- iv_robust(wage ~ education | education_dad + education_mom,
                        data = ed_real, diagnostics = TRUE)
summary(model_2sls)
## 
## Call:
## iv_robust(formula = wage ~ education | education_dad + education_mom, 
##     data = ed_real, diagnostics = TRUE)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)     -538      214.4   -2.51 1.24e-02   -958.8     -117 661
## education        112       15.9    7.02 5.66e-12     80.3      143 661
## 
## Multiple R-squared:  0.0247 ,    Adjusted R-squared:  0.0232 
## F-statistic: 49.2 on 1 and 661 DF,  p-value: 5.66e-12
## 
## Diagnostics:
##                  numdf dendf value p.value    
## Weak instruments     2   660  96.2 < 2e-16 ***
## Wu-Hausman           1   660  16.5 5.5e-05 ***
## Overidentifying      1    NA   0.4    0.53    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The main diagnostic we care about here is the first one: “Weak instruments”. This is a slightly fancier version of just looking at the first-stage F statistic. The null hypothesis for this test is that the instruments we have specified are weak, so we’d like to reject that null hypothesis. Here, the p-value is tiny, so we can safely reject the null and say the instruments likely aren’t weak. (In general, you want a statistically significant weak instruments test).

Another approach for checking for weak instruments is to calculate something called the Anderson-Rubin confidence set, which is essentially a 95% confidence interval for your coefficient that shows the stability of the coefficient based on how weak or strong the instrument is. This test was invented in like 1949 and it’s arguably more robust than checking F statistics, but for whatever reason, nobody really teaches it or uses it!. It’s not in any of the textbooks for this class, and it’s really kind of rare. Even if you google “anderson rubin test weak instruments”, you’ll only find a bunch of lecture notes from fancy econometrics classes (like p. 10 here, or p. 4 here, or p. 4 here).

Most of the automatic 2SLS R packages don’t provide an easy way to do this test! But there is a package dedicated to IV diagnostics (conveniently named {ivDiag}) that does this test and a ton of others (see here for a tutorial about all of them).

To use it, you need to use ivDiag(), which will return a list of a bunch of diagnostics. You have to specify the dataset, the outcome (Y), the treatment (D), and the instrument(s) (Z), and it’ll take a few seconds to run because it does some fancy sampling work behind the scenes:

Code
library(ivDiag)

diagnostics <- ivDiag(
  data = ed_real,
  Y = "wage",
  D = "education",
  Z = c("education_dad", "education_mom")
)

The diagnostics object has a bunch of different slots, like the first stage F-statistic:

Code
diagnostics$F_stat
##  F.standard    F.robust   F.cluster F.bootstrap F.effective 
##          83          97          NA          91         159

Importantly, it has Anderson-Rubin confidence intervals:

Code
diagnostics$AR
## $Fstat
##   F df1 df2   p 
##  25   2 660   0 
## 
## $ci.print
## [1] "[76.0024, 151.8828]"
## 
## $ci
## [1]  76 152
## 
## $bounded
## [1] TRUE

Based on this confidence interval, given the strength (or weakness) of the instruments, the IV estimate could be as low as 76 and as high as 152, which is a fairly big range around the $112 effect we found. Neat.

There’s no magic threshold to look for in these confidence intervals—you’re mostly concerned with how much potential variability there is. If you’re fine with a causal effect that could be between 76 and 152, great. If you want that range to be narrower, find some better instruments.

Education, wages, and distance to college (control variables) (real data)

For this last example we’ll estimate the effect of education on wages using a different instrument—geographic proximity to colleges. This data comes from David Card’s 1995 study where he did the same thing, and it’s available in the {wooldridge} library as card. You can find a description of all variables here; we’ll use these:

Variable name Description
lwage Annual wage (log form)
educ Years of education
nearc4 Living close to college (=1) or far from college (=0)
smsa Living in metropolitan area (=1) or not (=0)
exper Years of experience
expersq Years of experience (squared term)
black Black (=1), not black (=0)
south Living in the south (=1) or not (=0)
Code
card <- read_csv("data/card.csv")

Once again, Card wants to estimate the effect of education on wage. But to remove the endogeneity that comes from ability, he uses a different instrumental variable: proximity to college.

He also uses control variables to help explain additional variation in wages: smsa66 + exper + expersq + black + south66.

IMPORTANT NOTE: When you include controls, every control variable needs to go in both stages. The only things from the first stage that don’t carry over to the second stage are the instruments—notice how nearc4 is only in the first stage, since it’s the instrument, but it’s not in the second stage. The other controls are all in both stages.

He thus estimates a model where:

First stage:

\[ \widehat{\text{educ}} = \beta_0 + \beta_1\text{nearc4} + \beta_{2-6}\text{Control variables} \]

Second stage:

\[ \text{lwage} = \beta_0 + \beta_1 \widehat{\text{educ}} + \beta_{2-6}\text{Control variables} \]

Check instrument validity

Card provides arguments to support each of three main characteristics of a good instrumental variable:

  1. Relevancy: People who live close to a 4-year college have easier access to education at a lower costs (no commuting costs and time nor accommodation costs), so they have greater incentives to pursue education.
  2. Exclusion: Proximity to a college has no effect on your annual income, unless you decide to pursue further education because of the nearby college.
  3. Exogeneity: Individual ability does not depend on proximity to a college.

Let’s see if these assumptions hold up:

Relevancy

There should be a strong relationship between the instrument (distance to college) and education:

Code
first_stage <- lm(educ ~ nearc4 + smsa66 + exper + expersq + black + south66,
                  data = card)
model_parameters(first_stage)
## Parameter   | Coefficient |       SE |         95% CI | t(3003) |      p
## ------------------------------------------------------------------------
## (Intercept) |       16.81 |     0.17 | [16.47, 17.15] |   96.94 | < .001
## nearc4      |        0.33 |     0.09 | [ 0.16,  0.50] |    3.84 | < .001
## smsa66      |        0.25 |     0.08 | [ 0.09,  0.42] |    3.00 | 0.003 
## exper       |       -0.41 |     0.03 | [-0.48, -0.34] |  -12.12 | < .001
## expersq     |    4.61e-04 | 1.65e-03 | [ 0.00,  0.00] |    0.28 | 0.780 
## black       |       -0.92 |     0.09 | [-1.11, -0.74] |   -9.90 | < .001
## south66     |       -0.35 |     0.08 | [-0.51, -0.18] |   -4.18 | < .001
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.
modelsummary::get_gof(first_stage)
##     aic   bic r.squared adj.r.squared rmse nobs   F logLik
## 1 12558 12606      0.47          0.47  1.9 3010 449  -6271

Based on this first stage model, nearc4 has a significant relationship to educ, and the model’s joint F statistic is 449, which is definitely bigger than both 10 and 104. Good. We’ll call it relevant.

Exclusion

For distance to college to work as an instrument and meet the exclusion restriction, we have to prove that distance to college causes wages only through getting more education. Think about other possible pathways between living close to a college and increased wages—there could be other paths that don’t go through education. Good luck.

Exogeneity

For distance to college to work as an exogenous instrument, we have to prove that none of the unobserved confounders between education and earnings are connected to distance. Also good luck.

2SLS estimation

Assuming distance to education is a valid instrument (sure), we can use it in a 2SLS model. Remember that control variables have to go in both stages, so specify them accordingly in the model formula:

Code
model_2sls <- iv_robust(lwage ~ educ + smsa66 + exper + expersq + black + south66 |
                          nearc4 + smsa66 + exper + expersq + black + south66,
                        data = card, diagnostics = TRUE)
model_parameters(model_2sls, digits = 3)
## # Fixed Effects
## 
## Parameter   | Coefficient |        SE |           95% CI | t(3003) |      p
## ---------------------------------------------------------------------------
## (Intercept) |       3.357 |     0.930 | [ 1.533,  5.182] |   3.608 | < .001
## educ        |       0.157 |     0.055 | [ 0.050,  0.265] |   2.869 | 0.004 
## smsa66      |       0.081 |     0.027 | [ 0.028,  0.134] |   3.015 | 0.003 
## exper       |       0.118 |     0.024 | [ 0.072,  0.165] |   5.004 | < .001
## expersq     |      -0.002 | 3.661e-04 | [-0.003, -0.002] |  -6.595 | < .001
## black       |      -0.104 |     0.052 | [-0.206, -0.001] |  -1.975 | 0.048 
## south66     |      -0.064 |     0.028 | [-0.119, -0.009] |  -2.277 | 0.023
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.

Cool cool. Based on the coefficient for educ, a year of education causes a 15.7% increase in annual wages, on average.

Is that an improvement over a naive model where we don’t account for any of the endogeneity?

Code
model_naive <- lm(lwage ~ educ + smsa66 + exper + expersq + black + south66,
                  data = card)
model_parameters(model_naive, digits = 3)
## Parameter   | Coefficient |        SE |           95% CI | t(3003) |      p
## ---------------------------------------------------------------------------
## (Intercept) |       4.731 |     0.069 | [ 4.597,  4.866] |  68.927 | < .001
## educ        |       0.076 |     0.004 | [ 0.069,  0.083] |  21.459 | < .001
## smsa66      |       0.113 |     0.015 | [ 0.083,  0.142] |   7.468 | < .001
## exper       |       0.085 |     0.007 | [ 0.072,  0.098] |  12.639 | < .001
## expersq     |      -0.002 | 3.222e-04 | [-0.003, -0.002] |  -7.388 | < .001
## black       |      -0.177 |     0.018 | [-0.213, -0.141] |  -9.578 | < .001
## south66     |      -0.096 |     0.016 | [-0.128, -0.064] |  -5.970 | < .001
## 
## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed using a Wald t-distribution approximation.

Yep! Without removing endogeneity from education, an additional year of education is only associated with a 7.6% increase in annual wages, on average.

Compare results

For fun, we can look at the results side-by-side:

Code
modelsummary(list("Naive OLS" = model_naive, "2SLS" = model_2sls),
             gof_omit = "IC|Log|Adj|p\\.value|statistic|se_type",
             stars = TRUE) |>
  # Add a background color to rows 3-4
  style_tt(i = 3:4, background = "#F5ABEA")
Naive OLS 2SLS
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 4.731*** 3.357***
(0.069) (0.930)
educ 0.076*** 0.157**
(0.004) (0.055)
smsa66 0.113*** 0.081**
(0.015) (0.027)
exper 0.085*** 0.118***
(0.007) (0.024)
expersq -0.002*** -0.002***
(0.000) (0.000)
black -0.177*** -0.104*
(0.018) (0.052)
south66 -0.096*** -0.064*
(0.016) (0.028)
Num.Obs. 3010 3010
R2 0.269 0.143
F 184.606
RMSE 0.38 0.41

Extra diagnostics

Finally, we can check for weak instruments issues. The F statistic we found in the first stage was pretty big, so that’s a good sign, but we can look at the first stage’s weak instrument statistic, as well as the Anderson-Rubin confidence interval.

Because we included diagnostics = TRUE in the model, we can just use summary() to check weak instruments diagnostics:

Code
summary(model_2sls)
## 
## Call:
## iv_robust(formula = lwage ~ educ + smsa66 + exper + expersq + 
##     black + south66 | nearc4 + smsa66 + exper + expersq + black + 
##     south66, data = card, diagnostics = TRUE)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper   DF
## (Intercept)  3.35723   0.930415    3.61 3.13e-04  1.53292  5.18155 3003
## educ         0.15722   0.054807    2.87 4.15e-03  0.04976  0.26468 3003
## smsa66       0.08097   0.026858    3.01 2.59e-03  0.02831  0.13363 3003
## exper        0.11835   0.023649    5.00 5.93e-07  0.07198  0.16472 3003
## expersq     -0.00241   0.000366   -6.60 5.00e-11 -0.00313 -0.00170 3003
## black       -0.10358   0.052451   -1.97 4.84e-02 -0.20643 -0.00074 3003
## south66     -0.06371   0.027979   -2.28 2.28e-02 -0.11857 -0.00885 3003
## 
## Multiple R-squared:  0.143 , Adjusted R-squared:  0.141 
## F-statistic: 96.7 on 6 and 3003 DF,  p-value: <2e-16
## 
## Diagnostics:
##                  numdf dendf value p.value    
## Weak instruments     1  3003 15.31 9.3e-05 ***
## Wu-Hausman           1  3002  2.61    0.11    
## Overidentifying      0    NA    NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for the “Weak instruments” test is tiny, which means we can safely reject the null hypothesis that the near college instrument is weak. Neat.

To calculate Anderson-Rubin confidence intervals, we can use ivDiag():

Code
diagnostics_card <- ivDiag(
  data = card,
  Y = "lwage",
  D = "educ",
  Z = "nearc4",
  controls = c("smsa66", "exper", "expersq", "black", "south66")
)

diagnostics_card$AR
## $Fstat
##       F     df1     df2       p 
## 8.7e+00 1.0e+00 3.0e+03 3.2e-03 
## 
## $ci.print
## [1] "[0.0597, 0.3128]"
## 
## $ci
## [1] 0.06 0.31
## 
## $bounded
## [1] TRUE

Phew. That’s a pretty wide interval, ranging from 6% to 31%. It’s still positive, but it could sometimes be fairly small.