Instrumental Variables Solutions

6-5 Instrumental Variables – Solutions
March 19, 2024
Causal Graph (DAG) Review
Let’s briefly review what we know about directed acyclic graphs (DAGs) or causal graphs. In this lab, we will introduce the daggity and ggdag libraries for drawing DAGs in R.
Firstly, it is imperative that we keep in mind that DAGs are essentially a visual representation of our assumptions about the causal relationships between variables. We are rarely, if ever, able to prove that our DAG is actually “true”–we simply assume that it is.
Therefore we must proceed with extreme caution when deciding upon the assumptions we wish to encode in our DAG (most assumptions are derived from knowledge within the field such as literature review, expert insight, etc.). And we must also take great care when interpreting any results from our statistical analysis, as they are only valid in the context of our DAG (and any other assumptions made).
The assumptions encoded in our DAG include (but are not limited to):
1. The variables included (and not included) in the DAG as a whole 2. Exclusion restriction(s) (defined below)
3. Independence assumption(s) (defined below)
DAG Key Terms
Let’s recall some key terms:
• Endogenous variables – Measured variables including exposure (A), outcome (Y ), and any other measured covariates (W ). Sometimes collectively referred to as X (as in X = {W, A, Y }) or in other literatures as S.
• Exogenous variables – Unmeasured variables (UX ) which feed into the endogenous variables. Sometimes collectively referred to as U (as in U = {UW , UA , UY }).
• Exclusion restriction – Note that this concept can be a bit confusing because it can to refer to two slightly different scenarios:
• In the context of casual inference, can refer to the assumption that a particular arrow does not exist between two endogenous variables X. In other words, the absence of an arrow between any pair of endogenous variables is inherently an exclusion restriction–an assumption that must be justified.
• In the context of IVs, can refer to assumption that the only path by which Z (instrument) affects Y (outcome) is through A (treatment). Meaning that Z does not affect Y through some other direct or
indirect way.
• Independence assumption – Assumption regarding the joint distribution of the exogenous variables
U. That is, the assumption that any pair of exogenous variables (UX1,UX2) are independent from each other (UX1 ⊥ UX2) i.e. there is no arrow between them. In other words, the absence of an arrow between any pair of exogenous variables is inherently an independence assumption–an assumption that must be justified.
• Unblocked backdoor path – A causal path between the exposure (A) and the outcome (Y ) (besides the direct “main effect” path of interest) which does not contain a collider. In other words, an indirect path which may explain some or all of the relationship between the exposure and outcome.
• Collider – A covariate W with two parent nodes (two arrows pointed inward) on some backdoor path between the exposure (A) and the outcome (Y ). The existence of a collider on a particular path “blocks”

said path. NB: Conditioning on a collider induces a path between its two parents (thereby possibly inducing a new unblocked backdoor path).
Example: In the first DAG below, W is a collider. In the second DAG, we have conditioned on W, thereby introducing a new path between A and Y . Let’s explore the example using ggdag(). This is not the easiest package to use, but here is a great tutorial you can use to get you started with the basics.
unadjusted
activated by adjustment for collider
unadjusted
adjusted unadjusted
DAG Example Questions
Let’s go through a few examples and answer a few questions about each DAG. Remember, we are interested in understanding the effect of exposure (A) on the outcome (Y ).
Question 1: Answer the following questions for the DAG above:
a. What are the endogenous variables?
b. What are the exogenous variables?
c. Are there any exclusion restrictions? If so, what are they?
d. Are there any independence assumptions? If so, what are they?
e. Are there any unblocked backdoor paths? If so, what is the path? (Note: There may be multiple paths)
f. Are there any colliders? If so, what are they? What path(s) do they block? What would happen if you
were to condition on them?

Question 1 Solutions:
a. X={W1,A,Y} b. U = {U}
c. No. d. No.
e. Yes two, A ← W1 → Y and A ← U → Y f. No.
Question 2: Answer the following questions for the DAG below:
a. What are the endogenous variables?
b. What are the exogenous variables?
c. Are there any exclusion restrictions? If so, what are they?
d. Are there any independence assumptions? If so, what are they?
e. Are there any unblocked backdoor paths? If so, what is the path? (Note: There may be multiple paths)
f. Are there any colliders? If so, what are they? What path(s) do they block? What would happen if you
were to condition on them?
程序代写 CS代考 加QQ: 749389476
Question 2 Solutions:
a. X={W1,A,Y}
b. U={UW1,UA,UY}
d. Yes;UW1 ⊥UA,UW1 ⊥UY,andUA ⊥UY
f. Yes; W1; A → W1 ← Y ; it would induce an unblocked backdoor path between A and Y .
Question 3: Answer the following questions for the DAG above:
a. What are the endogenous variables?
b. What are the exogenous variables?
c. Are there any exclusion restrictions? If so, what are they?
d. Are there any independence assumptions? If so, what are they?
e. Are there any unblocked backdoor paths? If so, what is the path? (Note: There may be multiple paths)
f. Are there any colliders? If so, what are they? What path(s) do they block? What would happen if you
were to condition on them?
Programming Help
Question 3 Solutions:
a. X={W1,W2,A,Y}
b. U={UW1,UW2,UA,UY}
c. Yes; there is an assumption of no direct causal relationship between W2 and A. d. Yes;UW1 ⊥UA,UW1 ⊥UY,UW1 ⊥UW2,UW2 ⊥UA,UW2 ⊥UY,andUA ⊥UY
e. Yes; A → W 1 → W 2 → Y .
f. Yes; W1; A → W1 ← Y ; it would induce an unblocked backdoor path between A and Y .
Instrumental Variables Instrumental Variables Rationale
Recall from our consideration of randomized experiments that, when implemented properly, randomizing the exposure allows us to ensure independence between the exposure and any other covariates. A simple DAG representing this situation when considering only the exposure A and outcome Y is shown below.

DAG of a randomized experiment
This independence of A from any measured covariates W and from any unmeasured confounders U is what allows us to make direct causal inferences on the effect of A on Y in random experiments.
As we have seen, however, observational data usually do not afford us the same freedom. Let us consider the DAG below.
DAG of what you might see with observational data
This simple DAG represents an unfortunately common situation in observational studies, in which the exposure A and the outcome Y are thought to have measured and unmeasured confounders in common.
We have explored many methods of accounting for measured confounders W, but what of unmeasured confounders U? We cannot control for a variable we cannot measure.
One strategy to combat this concern is to determine whether we might find some measurable covariate Z which can “represent” the exposure A but which, unlike A, is independent from unmeasured confounders.

Such a covariate, if found, is called an instrumental variable. Instrumental Variable Criteria
While instrumental variables can be an exciting, clever “loophole” to the problem of exposure non-independence, they must be chosen with care.
In order for some variable Z to be a valid instrument, it must be:
• Causally related to the exposure A. This can be represented in the DAG with an arrow Z → A. This is commonly referred to as the First Stage.
• Exogenous to the system. That is, independent from (or not correlated with) the other covariates in the system both measured (W) and unmeasured (U). This can be represented in the DAG as the absence of arrows between Z and the Ws (a.k.a. exclusion restrictions) and the absence of arrows between unmeasured confounders of Z (UZ) and any other unmeasured confounders U (a.k.a. independence assumptions).
– In other words, there should be no unblocked backdoor path from Z to Y –the only path from Z to Y must be that through A. Confusingly, this criterion is commonly referred to simply as the Exclusion Restriction.
In the following DAG, Z satisifies these requirements and is a valid instrument of the effect of A on Y . DAG of an instrumental variable analysis
This second criteria has some inherent flexibility, however. In the case of a causal relationship between Z and any measured confounders W, we can control for said confounders and still find this requirement satisfied. Consider the following DAG:

DAG of an instrumental variable analysis with multiple paths of Z to Y
The above DAG shows an unblocked backdoor path from Z to Y through W. However, if we control for W we see this path disappear (Note: arrows turn grey when they can be ignored after adjustment):
unadjusted
activated by adjustment for collider
unadjusted
Now the only path from Z to Y is the direct path through A.
However, remember we must as always be cautious when adjusting for any covariates. In the previous
程序代写 CS代考 加微信: cstutorcs
example, we began with an independance assumption that UW ⊥ UY A. Let us consider the following DAG without this independence assumption:
Note the only difference here is that W shares unmeasured confounding U with A and Y . Now we again control for W:
unadjusted
activated by adjustment for collider
unadjusted

Here we see that we still have an unblocked backdoor path from Z to Y . (Note that there should not be a relationship between UZ AND U as a result of controlling for W—this is some issues with the package—only between U AND Z.)
Question 4: What is the new unblocked backdoor path from Z to Y ? Why did controlling for W open up this path?
Solution: Z → U → Y . W is a collider of Z and U because it has two arrows going into it.
Recall that whenever we control for a covariate we must be on the lookout for colliders. Consider the following
Notice here that we again have the independence assumption UW ⊥ UY A, saving us from the problem just considered. However, W itself is now a collider on the path from Z to Y .
Question 5: Why is this a problem? What would happen if we controlled for W? Include a DAG in your answer.
Solution: Conditioning on W will induce a path from Z to Y directly, which is therefore an unblocked backdoor path (of sorts) since it does not pass through A.
## NOTE: The adjustment code is not working and seems to be an issue with how the package ## controlling for colliders.
#ex_dag4 %>%
# control_for(var = “W”) %>%
# ggdag_adjust() +
# geom_dag_node(aes(color = adjusted)) +
# geom_dag_text(col = “white”)
# Instead, you can use this as an opportunity to draw your DAG by hand and include a pictu # Be sure to change “example-dag.jpg” below to the correct name of your file knitr::include_graphics(“example-dag.jpg”)
re of it here

Two-Stage Least Squares (2SLS) Regression
In practice, instrumental variables are used most often in the context of linear regression models using Two-Stage Least Squares (2SLS) Regression.
Recall that a simple linear regression model looks as follows: Y = β0 + β1A + ε
Where the parameter coefficients β0, β1 represent the y-intercept and slope, respectively, and ε represents the error term.
Earlier we saw that a problem arises when A and Y have unmeasured confounders U in common. This problem is diagnosed when considering the causal relationships represented in our DAG, but in practice is often discovered as a correlation between A and the error term ε.
Exclusion Restriction
There is no empirical way to determine whether the “exclusion restriction” requirement discussed above (that the only causal path from Z to Y must be that through A) is met. You must use your knowledge of the system to develop what you believe to be an accurate DAG, and then determine whether your intended instrument satisfies this requirement based on that DAG. However, in practice, a variable Z can be ruled out as a potential instrument if it appears correlated with ε.
First Stage
The “first stage” requirement (that Z must have a causal effect on A), however, can be empirically tested, and as the name implies, doing so is indeed the first stage in implementing an instrumental variable analysis.
To do so, we simply run a linear regression of the intended instrument Z on the exposure A (and any measured confounders W that we have determined appropriate to control for):
Z = β0 + β1A + ε
If this regression results in a high correlation value, Z is considered a strong instrument and we may proceed.
If correlation is low, however, Z is considered a weak instrument and may be a poor choice of instrument. 11

If we decide to move forward with using Z as an instrument, we save the predicted values of the instrument Zˆ and the covariance of Z and A (Cov(Z,A)) for the next stage.
Question 6: Consider, what are some potential concerns with using a weak instrument?
Solution: There are many possible answers, but the primary concern is that Z may not truly have a causal
effect on A (or at least, not a very strong one). Second Stage
Now that we have the predicted values of the instrument Zˆ, we regress the outcome Y on these values, like so: Y = β 0 + β 1 Zˆ + ε
We then retrieve the covariance between Z and Y (Cov(Z, Y )). The ratio between this and Cov(Z, A) is then our 2SLS estimate of the coefficient on A in the original model.
βˆ1 = Cov(Z,Y) Cov(Z,A)
Question 7: Explain in your own words why you think the above estimates the desired parameter. Your answer here.
Natural Experiments
A common source of potential instrumental variables explored arise from natural experiments. A “natural experiment” refers to observational data in which randomization has been applied to an exposure (or instrumental) variable, but that randomization was not implemented by the researchers (i.e. it was implemented by “nature”). Common examples include legislative differences in similar jurisdictions (or legislative changes in a single jurisdiction, comparing shortly before and shortly after said change), proximity to a source the exposure of interest, and many others.
Simulation
Let us consider a modified version of our AspiTyleCedrin example explored previously. In this version, say that both exposure to AspiTyleCedrin and the outcome of experiencing a migraine are affected by watching cable news, since AspiTyleCedrin are commonly shown on cable news channels, and stress from excessive cable news watching can trigger migraines. Say also that living near a pharmacy that carries AspiTyleCedrin makes people more likely to use it, but is not related to cable news watching or experience of migraines. Furthermore, say sex assigned at birth does have an effect on both AspiTyleCedrin use and experience of migraines, but is not causally related to either cable news watching or proximity to a pharmacy that sells AspiTyleCedrin. (Note: This is just an example, in reality there may be reason to suspect causal relationships that we are not considering here).
Thus we have the following variables:
Endogenous variables:
• A: Treatment variable indicating whether the individual i took AspiTyleCedrin (Ai = 1) or not (Ai = 0).
• Y: Continuous outcome variable indicating the number of migraines experienced by an individual in the
past month. (NOTE: We have previously measured this variable as binary!)
• W: Variable representing sex assigned at birth, with W = 0 indicating AMAB (assigned male at birth), W = 1 indicating AFAB (assigned female at birth), and W = 2 indicating an X on the birth certificate,
possibly representing an intersex individual or left blank.
• Z: Instrumental variable indicating proximity in miles the individual i lives near a pharmacy that sells
AspiTyleCedrin.

Exogenous variables: * U_YA: Unmeasured confounding variable, cable news watching, which affects the exposure A and the outcome Y , * U_Z: Unmeasured confounding variable(s) which affect the instrument Z.
And our DAG is as follows:
Simulate the dataset:
# set seed for reproducibility
set.seed(10)
n = 1e4 # Number of individuals (smaller than last time)
# NOTE: Again, don’t worry too much about how we’re creating this dataset, # this is just an example.
df <- data.frame(U_Z = rnorm(n, mean=50, sd=5), U_YA = rbinom(n, size = 1, prob = 0.34), df <- df %>%
W = sample(0:2, size = n, replace = TRUE, prob = c(0.49,0.50,0.01)),
eps = rnorm(n)
mutate(Z = 1.2*U_Z + 5,
A = as.numeric(rbernoulli(n,
p = (0.03 + 0.06*(W > 0) + 0.7*(Z < 60) + 0.21*(U_YA == Y = 5 - 4*A + 1*W + 3*U_YA) ## U_Z ## 1 50.09373 ## 2 49.07874 ## 3 43.14335 ## 4 47.00416 ## 5 51.47273 ## 6 51.94897 1 1 0.974739870 65.11248 0 9 0 0 -0.006558132 63.89448 0 5 0 1 1.567393278 56.77202 1 2 0 0 0.474007817 61.40499 0 5 0 0 -0.944051166 66.76727 0 5 0 1 -1.543734178 67.33877 0 6 summary(df) ## U_Z U_YA W eps ## Min. :32.34 Min. :0.0000 Min. :0.0000 Min. :-4.199057 ## 1st Qu.:46.63 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:-0.676836 ## Median :49.97 Median :0.0000 Median :1.0000 Median : 0.019156 ## Mean :50.01 Mean :0.3389 Mean :0.5201 Mean : 0.003346 ## 3rd Qu.:53.39 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 0.679510 ## Max. :69.06 Max. :1.0000 Max. :2.0000 Max. : 4.101319 ## Z A Y ## Min. :43.81 Min. :0.0000 Min. : 1.000 ## 1st Qu.:60.96 1st Qu.:0.0000 1st Qu.: 5.000 ## Median :64.97 Median :0.0000 Median : 5.000 ## Mean :65.01 Mean :0.2656 Mean : 5.474 ## 3rd Qu.:69.07 3rd Qu.:1.0000 3rd Qu.: 6.000 ## Max. :87.88 Max. :1.0000 Max. :10.000 Question 8: Use the lm() function to regress proximity Z on AspiTyleCedrin use A and sex assigned at birth W. Assign the predicted values to the variable name Z_hat. Use the cov() function to find Cov(Z,A) and assign the result to the variable name cov_za. ## lm(formula = Z ~ A + W, data = df) ## Residuals: ## Min 1Q Median 3Q Max ## -21.2030 -3.6628 -0.7653 3.3060 22.9693 ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
# 1. first stage
# ———-
lm_out1 <- lm(Z ~ A + W, # regress Z (instrument) on A + W data = df) # specify data # view model summary summary(lm_out1) ## (Intercept) 66.56501 ## A -6.22232 ## W 0.18394 0.08187 813.02 0.12184 -51.07 0.10449 1.76 <2e-16 *** <2e-16 *** ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## Residual standard error: 5.374 on 9997 degrees of freedom ## Multiple R-squared: 0.207, Adjusted R-squared: 0.2068 ## F-statistic: 1304 on 2 and 9997 DF, p-value: < 2.2e-16 Question 9: Use the lm() function to regress migraines Y on your fitted values Z_hat. Use the cov() 14 # get fitted values (Z-hat) Z_hat <- lm_out1$fitted.values # get the covariance of Z and A cov_za <- cov(df$Z, df$A) function to find Cov(Z, Y ) and assign the result to the variable name cov_zy. ## lm(formula = Y ~ Z_hat, data = df) ## Residuals: ## Min 1Q Median 3Q Max ## -2.0402 -1.2868 -0.3828 1.7132 3.5213 ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.44453 0.34358 -82.79 <2e-16 *** ## Z_hat 0.52177 0.00528 98.81 <2e-16 *** ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## Residual standard error: 1.449 on 9998 degrees of freedom ## Multiple R-squared: 0.4941, Adjusted R-squared: 0.494 ## F-statistic: 9764 on 1 and 9998 DF, p-value: < 2.2e-16 Question 10: Use your cov_za and cov_zy to estimate the coefficient β1 in the following equation: Y = β0 + β1A + β2W + ε Interpret your result in words. ## [1] -3.899776 When controlling for sex assigned at birth, use of AspiTyleCedrin reduces migraines by approxi- mately 3.8 per month. The AER package also provides us with the ivreg() function which allows us to perform IV regression in one command: s + instrument # 2. reduced form # ---------- lm_out2 <- lm(Y ~ Z_hat, # regress Y (outcome) on fitted values from first stage data = df) # specify data # view model summary summary(lm_out2) # get the covariance of Z and Y cov_zy <- cov(df$Z, df$Y) # 3. calculate treatment effect # ---------- beta_hat <- cov_zy/cov_za # divide Cov(Z,Y) / Cov(Z,A) beta_hat # repeat using ivreg() # ---------- lm_out3 <- ivreg(Y ~ A + W | W + Z , # reduced form (think of as norm OLS model) | control data = df) # specify data # view model summary summary(lm_out3) ## ivreg(formula = Y ~ A + W | W + Z, data = df) ## Residuals: ## Min 1Q Median 3Q Max ## -1.0904 -1.0107 -0.9815 1.9388 2.0476 ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.01072 0.02634 228.21 <2e-16 *** ## A -3.92033 0.07041 -55.68 <2e-16 *** ## W 0.97082 0.02761 35.16 <2e-16 *** ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## Residual standard error: 1.413 on 9997 degrees of freedom ## Multiple R-Squared: 0.5195, Adjusted R-squared: 0.5194 ## Wald test: 1969 on 2 and 9997 DF, p-value: < 2.2e-16 Question 11: Compare the estimate of the coefficient on A in the output above to your previous answer. The results are very similar. In this case the estimate using ivreg() is slightly larger, but if you repeat this with a difference seed, it might be smaller. So, they will both report similar estimates, which could be due to a rounding error. Table 1: OLS vs Instrumental Variable Analysis 5.852 (0.021) −3.264 (0.031) 0.941 (0.027) 10 000 0.540 0.540 34 863.1 34 892.0 −17 427.562 5859.331 1.38 There are a number of packages that can help you quickly and easily format your results for a paper. My favorite is the modelsummary() library becuase it is so flexible, intuitive, and easily customizable—check out the documentation. I’ve given you some code to quickly compare you results with a basic OLS model and format the table for a paper, which is based off this great tutorial. (Intercept) A Num.Obs. R2 R2 Adj. AIC BIC Log.Lik. F RMSE 6.011 (0.026) −3.920 (0.070) 0.971 (0.028) 10 000 0.519 0.519 35 291.6 35 320.5 1.41 You can insert a footnote here. Modelsummary # you might need to (re)install tinytex() if not already. Follow instruction prompt # tinytex::reinstall_tinytex(repository = "illinois") # create a list of models models <- list( "OLS" = lm(Y ~ A + W, data = df), # since we didn't run an OLS above, we can specify it "IV" = lm_out3 # specify the model output # display table modelsummary(models, # specify list of models title = 'OLS vs Instrumental Variable Analysis', # add title notes = "You can insert a footnote here." # add notes and much more! ) References http://dx.doi.org/10.2139/ssrn.3135313 https://www.statisticshowto.com/instrumental-variable/ https://umanitoba.ca/faculties/health_sciences/medicine/units/chs/departmental_units/mchp/protocol/ media/Instrumental_variables.pdf https://rpubs.com/wsundstrom/t_ivreg https://en.wikipedia.org/wiki/Instrumental_variables_estimation#Testing_the_exclusion_restriction https://towardsdatascience.com/a-beginners-guide-to-using-instrumental-variables-635fd5a1b35f https://www.econometrics-with-r.org/12-1-TIVEWASRAASI.html