PMC4789291

6-3 Matching Methods – Solutions
March 12, 2024
# libraries
xfun::pkg_attach2(c(“tidyverse”, # load all tidyverse packages
“here”, # set file path “MatchIt”, # for matching “optmatch”, # for matching
“cobalt”)) # for matching assessment
# chunk options —————————————————————-
knitr::opts_chunk$set( warning = FALSE
# prevent scientific notation # ———-
options(scipen = 999)
# prevents warning from appearing after code chunk
As we saw in last week’s lab, an important advantage of randomized experiments are that they allow researchers to ensure independence between the exposure variable and other covariates, or rather that treatment and control groups have similar covariate distributions and differ only randomly.
The same cannot be said of observational studies, no matter how large the sample size. Thus, researchers often use a variety of matching methods to try to replicate this matching of covariate distributions between exposure groups.
In this lab we will consider some of these matching methods. Note that these methods are all implemented in the analysis stage (i.e. after the study has already been completed), and are distinct from (though may be similar to) methods of conducting studies which are matched from the outset.
Furthermore, matching should not be seen as an alternative to modeling adjustments such as regression, but instead are often used together.
Simulation
We will again use the simulated example from last week assessing the effectiveness of AspiTyleCedrin at treating migraines. As a reminder, this dataset contained the following variables:
• A: Treatment variable indicating whether individual i: – DID take AspiTyleCedrin (Ai = 1)
– DID NOT take AspiTyleCedrin (Ai = 0)
• Y_obs: Outcome variable indicating whether individual i: – DID experienced a migraine (Yiobs = 1)
– DID NOT experience a migraine (Yiobs = 0)
• W1: Variable representing sex assigned at birth:
– W 1 = 0 indicating AMAB (assigned male at birth)
– W 1 = 1 indicating AFAB (assigned female at birth)
– W 1 = 2 indicating an X on the birth certificate, intersex individual, or left blank

• W2: Variable representing simplified racial category:
– W 2 = 0 indicating White
– W 2 = 1 indicating Black or African American
– W 2 = 2 indicating Non-White Hispanic or Latinx
– W 2 = 3 indicating American Indian or Alaska Native – W 2 = 4 indicating Asian
– W 2 = 5 indicating Native Hawaiian or Other Pacific Islander
Say that there is concern among providers that AspiTyleCedrin may be less effective among individuals with a higher Body Mass Index (BMI). To simulate this, we will modify the code we used to create the original AspiTyleCedrin dataset to also include the variable W3 representing an individual’s BMI. (We’ll also modify the treatment and observed outcomes to be confounded by this variable.)
# set seed
# ———-
set.seed(42) # set so that random process of generating data is reproducible
# set the number of individuals for simulated dataset
# ———-
n = 1e4 # Number of individuals (smaller than last time)
# NOTE: Again, don’t worry too much about how we’re creating this dataset, # this is just an example.
# W3 scaled to have mu=24 and sigma=4 a la
# https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4789291/
# where k = muˆ2/sigmaˆ2 and theta = sigmaˆ2/mu
# Also make treatment less likely so that there are more controls, # and add ID column
df <- data.frame(ID = seq.int(n), W1 = sample(0:2, size = n, replace = TRUE, prob = c(0.49,0.50,0.01)), W2 = sample(0:5, size = n, replace = TRUE, prob = c(0.60,0.13,0.19,0.06, 0.015, 0.005)), W3 = rgamma(n, shape = 36, scale = (2/3))) df <- df %>%
mutate(W3 = W3 + 8*(W1 == 1)+ 12*(W2==2) +
8*(W2==3) + 4*(W2==4) + (-4)*(W2 == 5), A = as.numeric(rbernoulli(n,
p = (0.16 + 0.07*(W1 > 0) + 0.21*(W2 == 0) – 0.1*(W3 > 25) ))),
Y_0 = as.numeric(rbernoulli(n,
p = (0.87 + 0.035*(W1 > 0) + 0.05*(W2 > 0)) +
abs((W3 – 22)/100))),
p = (0.34 + 0.035*(W1 > 0) + 0.3*(W2 > 0)) +
abs((W3 – 22)/100) + 0.2*(W3 > 30))), Y_obs = as.numeric((A & Y_1) | (!A & Y_0)))
ATE_true <- mean(df$ITE) df_a1 <- df %>% filter(A == 1)
Y_1 = as.numeric(rbernoulli(n,
ITE = Y_1 – Y_0,

ATT_true <- mean(df_a1$ITE) df <- df %>% select(-Y_0, -Y_1, -ITE)
df_a1 <- df_a1 %>% select(-Y_0, -Y_1, -ITE) df_a0 <- df %>% filter(A == 0)
## ID W1 W2 W3 A Y_obs
## 1 1 0 0 28.51215 0
## 2 2 0 2 34.91706 0
## 3 3 1 1 31.11242 0
## 4 4 0 0 26.56770 0
## 5 5 0 2 29.66014 0
## 6 6 0 2 38.53180 0
## ID
## Min. : 1 Min.
## 1st Qu.: 2501 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:25.25
## Median : 5000 Median :1.0000 Median :0.0000 Median :30.44
## Mean : 5000 Mean :0.5203 Mean :0.7677 Mean :30.79
## 3rd Qu.: 7500 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:35.55
## 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :1.0000
## Mean :0.2568 Mean :0.8666
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
Max. :5.0000
Max. :57.38
Max. :10000 A
Max. :2.0000
Min. :0.0000
Min. :0.0000
1 1 1 1 1 1
W1 W2 W3
:0.0000 Min. :0.0000 Min. :12.84
Let’s take a look at the covariate distributions, comparing those that did and did not take AspiTyleCedrin:
Sex Assigned at Birth (SAAB)
For this chunk, there is extra ggplot code that illustrates how you might customize a figure for publication. There is a lot more you can do, so be sure to delve into the ggplot documentation to see all that is possible.
ting an inters
summary(df)
# treatment status by sex
# ————————————————– df %>%
# processing
# ———-
mutate(sex = case_when(W1 == 0 ~ “Male”, # assigned male at birth
W1 == 1 ~ “Female”, # assigned female at birth
W1 == 2 ~ “X/intersex”), # an X on the birth certificate,represen sex = fct_relevel(sex, “Male”, “Female”, “X/intersex”),
treatment = case_when(A == 0 ~ “Control”,
A == 1 ~ “Treatment”)
) %>% # plot

# ———-
ggplot(aes(x = sex, fill = treatment)) +
# create a bar plot using geom_bar()
geom_bar() +
geom_text(stat = “count”, aes(label = ..count..), # calculate count and pass to label
vjust = -0.5) + # vjust to add space between bar and
# facet grid controls number of panels – prefer this to facet_wrap
facet_grid(
#rows= vars(treatment), # facet variable in the rows cols = vars(treatment) # facets variable in the column )+
theme_bw() + # set base black and white theme theme(legend.position = “bottom”) + # theme functions manipulate different elements of
scale_fill_manual(values=c(“#800000″,”#027148”)) + scale_y_continuous(breaks=seq(0, 4000, 1000),
labels = scales::label_number(scale = 1, accuracy = 1,
# assign colors using
# y axis floor, ceilin
# scale the variable
# decimal points
prefix = “”,
suffix = “”),
limits = c(0, 4000)) +
labs(x = “Sex Assigned at Birth “, # x-axis label
# add suffix, e.g., “%
# set floor and ceilin
y = “Count”,
# y-axis label
# legend label
# add a caption
fill = “Treatment status”,
caption = “Note: “,
title = “Distribution of Sex Assigned at Birth Treatment Status”) # title
big.mark = “,”, # add “,” or “.”
parameter usin
the plots appe
” or “k” g

Distribution of Sex Assigned at Birth Treatment Status
Male Female
X/intersex
Male Female
X/intersex
Sex Assigned at Birth
Treatment status Control Treatment
# chi-squared to test difference # ———- chisq.test(table(df$A, df$W1))
## Pearson’s Chi-squared test
## data: table(df$A, df$W1)
## X-squared = 7.8191, df = 2, p-value = 0.02005
The bar plot above clearly shows a difference in the distribution of SAAB among the two groups, and this is confirmed by the very small p-value from the χ2 test.
, # relevel fa
# treatment status by race
# ————————————————– df %>%
# processing
# ———-
mutate(race = case_when(W2 == 0 ~ “White”, # non-Hispanic White
W2 == 1 ~ “Black”, # non-Hispanic Black
W2 == 2 ~ “Hispanic”, # Latinx
W2 == 3 ~ “AIAN”, # American Indian or Alaska Native
W2 == 4 ~ “Asian”, # Asian
W2 == 5 ~ “NH/OPI”), # Native Hawaiian or Other Pacific Islande
race = fct_relevel(race, “White”, “Black”, “Hispanic”, “AIAN”, “Asian”, “NH/OPI”)

treatment = case_when(A == 0 ~ “Control”,
A == 1 ~ “Treatment”)
# ———-
ggplot(aes(x = race, fill = treatment)) +
geom_bar() +
facet_grid(cols = vars(treatment)) + # facets variable in the column
theme_bw() + # set base black and white theme theme(legend.position = “bottom”) + # theme functions manipulate different elements of
labs(title = “Distribution of Racial Category by Treatment Status”, fill = “”)
## Pearson’s Chi-squared test
## data: table(df$A, df$W2)
the plots appe
Distribution of Racial Category by Treatment Status
White Black Hispanic AIAN Asian NH/OPI
White Black Hispanic AIAN Asian NH/OPI
Control Treatment
# chi-squared to test difference # ———- chisq.test(table(df$A, df$W2))
Code Help, Add WeChat: cstutorcs
## X-squared = 661.27, df = 5, p-value < 0.00000000000000022 The bar plot above again shows a difference in the distribution of simplified racial category among the two groups, and this is again confirmed by the very small p-value from the χ2 test. You can find more documentation for the plotting parameters here. Finally, we can use geom_hist to view the distribution of BMI by treatment status, which is a continuious variable. the plots appe # treatment status by BMI # -------------------------------------------------- df %>%
# processing
# ———-
mutate(treatment = case_when(A == 0 ~ “Control”,
) %>% # plot
# ———-
A == 1 ~ “Treatment”)
ggplot(aes(x = W3, fill = treatment)) +
geom_histogram(binwidth = 1, aes(y = ..density..)) +
facet_grid(rows = vars(treatment)) + # facets variable in the column
theme_bw() + # set base black and white theme theme(legend.position = “bottom”) + # theme functions manipulate different elements of
labs(title = “Distribution of BMI among Treated and Untreated”, x = “BMI”,
fill = “”)

Distribution of BMI among Treated and Untreated
20 30 40 50
Control Treatment
# ———
t.test(W3 ~ A, data = df)
## Welch Two Sample t-test
## data: W3 by A
## t = 15.686, df = 4735.2, p-value < 0.00000000000000022 ## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0 ## 95 percent confidence interval: ## 2.243671 2.884605 ## sample estimates: ## mean in group 0 mean in group 1 ## 31.45203 28.88789 While it may be difficult to determine from the histogram above how the distribution of BMI differs among the two groups, the very small p-value from the t-test shows evidence of a clear difference. Thus we can see the need to improve the matching of these covariate distributions. Matching Considerations There are a number of factors to consider when choosing a matching method, including the following: • Distance Metric • Greediness • Control:Treatment Ratio • Caliper Width • Replacement • Estimand Distance Metric The goal of matching is to match together each treatment unit (in our case, each individual who took AspiTyle- Cedrin, A == 1) to one or more “control” unit (in our case, individuals who did not take AspiTyleCedrin, A == 0) based on baseline covariates (in our case, W1, W2, W3). Conceptually, this means we are trying to find the control unit(s) that most closely resemble the counterfactual for each treatment unit. Exact Matching Ideally, we would like to find the control unit(s) which have all identical covariate values. This is called “exact matching”. For our dataset, this would mean each individual who took AspiTyleCedrin (A == 1) would be matched with individual(s) who did not take AspiTyleCedrin (A == 0) with the exact same SAAB (W1), racial category (W2), and BMI (W3). In other words, the exact distance between two points Xi,Xj, where Xi = {W1i,W2i,W3i} and Xj = {W1j,W2j,W3j} is defined as: (0, if Xi = Xj Distance(Xi,Xj) = ∞, if Xi ̸= Xj Question 1: The data frame df_a0 contains all the individuals that did not take AspiTyleCedrin, and the data frame df_a1 contains all those who did. In the R code chunk below, use the first ten rows of df_a0 and the first five rows of df_a1 to find the exact distance of the first ten individuals who did not take AspiTyleCedrin from each of the first five individuals who did. (Hint: How many comparisons should you be making?) # calculate exact matches by hand # -------------------------------------------------- # create dataframe # --------- df_a0_small <- df_a0[1:10,] df_a1_small <- df_a1[1:5,] cols <- c("W1", "W2", "W3") # create function to only keep where observations are equal # # --------- dist.exact <- function(x,y) { ifelse(all(x == y), 0, NA) # NA means no match } # funciton to calculate distances # --------- calculate.dist <- function(x, y, dist.method, xnames = df_a1_small$ID, ynames = df_a0_smal dists <- apply(y, 1, function(j) {apply(x, 1, function(i) {dist.method(i,j)})}) rownames(dists) <- xnames colnames(dists) <- ynames return(dists) # apply to data and save as new object # --------- dists_ex <- calculate.dist(df_a1_small[, cols], # x df_a0_small[, cols], # y dist.exact) # distance metric ## 1 2 3 4 5 6 7 8 10 12 ##9 NANANANANANANANANANA ## 11 NA NA NA NA NA NA NA NA NA NA ## 15 NA NA NA NA NA NA NA NA NA NA ## 16 NA NA NA NA NA NA NA NA NA NA ## 17 NA NA NA NA NA NA NA NA NA NA While exact matching is ideal, it is not always possible, such as in the case of continuous variables, such as our BMI variable, W3. Question 2: Explain why matching on a continuous variable would likely be impossible. The probability of any exact value of a continuous variable is by definition zero, so even taking rounding into account, the probability of finding exact matches on a continuous variable is very low. Question 3: Modify your code above to only check the distance for W1 and W2 values. ## 1 2 3 4 5 6 7 8 10 12 ##9 0NANA 0NANA 0NA 0NA ## 11 NA NA NA NA NA NA NA NA NA NA ## 15 NA NA NA NA NA NA NA NA NA NA ##16NANANANANANANANANA 0 ##17 0NANA 0NANA 0NA 0NA Since exact matching is not always possible, there are a variety of alternative distance metrics which may be used to determine how similar a potential match is. A few of these methods are discussed below. Mahalanobis Distance Matching The Mahalanobis distance in general is a “multi-dimensional generalization of the idea of measuring how many standard deviations away [some point] P is from the mean of [some distribution] D.” However, in the context of matching, the Mahalanobis distance measures this distance between the two points Xi,Xj rather than that between one point and a distribution. Mathematically, this version of the Mahalanobis distance is defined as follows: Distance(Xi,Xj)=q(Xi −Xj)TS−1(Xi −Xj) where S−1 is the covariance matrix of Xi and Xj. # check again but omit W3 (BMI) # --------- dists_ex_lim <- calculate.dist(df_a1_small[, cols[1:2]], df_a0_small[, cols[1:2]], dist.ex dists_ex_lim Question 4: Using the cov() function to find the covariance matrix of {W 1, W 2, W 3} from the whole dataset, modify your code from Question 1 to instead find the Mahalanobis distance of the first ten individuals who did not take AspiTyleCedrin from each of the first five individuals who did. (Hint: The t() function will transpose a vector or matrix, and matrix multiplication uses the %*% character, not *) ## 1 2 3 4 5 6 7 8 ## 9 53.711777 102.53125 73.88371 39.210885 63.33835 129.48478 66.22732 192.1228 ## 11 36.824702 12.16181 16.68337 51.324811 27.30501 38.99333 24.31058 101.5984 ## 15 4.977753 53.79966 25.14157 9.554379 14.70509 80.74327 17.47859 143.3723 ## 16 50.456454 99.25355 70.61490 35.959232 60.05258 126.20982 62.97019 188.8519 ## 17 57.306452 106.12532 77.47812 42.805559 66.93147 133.07908 69.82200 195.7173 ## 10 12 ## 9 18.47621 78.78527 ## 11 109.01085 11.81294 ## 15 67.23230 30.04789 ## 16 21.76174 75.51503 ## 17 14.88153 82.37969 Propensity Score Matching The propensity score of an individual is a measure of the probability of that individual receiving the treatment based upon the baseline covariates. That is, given a set of covariate values ({W1i,W2i,W3i} in our case), the propensity score represents the estimated probability of treatment (Ai = 1). The propensity score is often estimated using a logit model and is therefore defined as follows: π =P(A =1|X)= 1 i i i 1+e−Xiβ We can estimate these propensity scores using logistic regression, by regressing the treatment A on the baseline covariates X, like so: nce and the di # calculate Mahalanobis distance metric # -------------------------------------------------- # calculate covariance matrix # --------- cov_df <- cov(df[,cols]) # create a function to calculate mahalanobis distance # --------- dist_mahalanobis <- function(x,y) { diff <- (x - y) # return the difference of x-matrix from y-matrix sqrt( t(diff) %*% cov_df %*% (diff) ) # transpose difference and multiply by the covaria } # apply function to calculate Mahalanobis distance # --------- dists_ma <- calculate.dist(df_a1_small[, cols], # x df_a0_small[, cols], # y dist_mahalanobis) # distance # fit a logit model # -------------------------------------------------- model_ps <- # save logit model as an object glm(A ~ W1 + W2 + W3, # regress A (treatment) on covariates (W1, W2, W3) family = binomial(), # specifying binomial calls a logit model data = df) # specify data for regression # print summary summary(model_ps) ## glm(formula = A ~ W1 + W2 + W3, family = binomial(), data = df) ## Coefficients: ## Estimate Std. Error z value ## (Intercept) 0.015900 0.121459 0.131 ## W1 0.358689 0.056395 6.360 ## W2 -0.531363 0.033945 -15.654 < 0.0000000000000002 *** ## W3 -0.031861 0.004817 -6.614 0.0000000000373 *** ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## (Dispersion parameter for binomial family taken to be 1) ## Null deviance: 11394 on 9999 degrees of freedom ## Residual deviance: 10709 on 9996 degrees of freedom ## AIC: 10717 ## Number of Fisher Scoring iterations: 4 We can then use this model and the predict() function to add all of the estimated propensity scores for each data point in df: Propensity score matching uses the absolute difference between two propensity scores as its distance metric, or rather: Distance(Xi,Xj)=|πi −πj| Question 5: Again modify your previous code to find the propensity score distance of the first ten individuals who did not take AspiTyleCedrin from each of the first five individuals who did. 0.0000000002014 *** score based on # --------- df <- # save over df dataframe object df %>% # pass data
mutate(prop_score = predict(model_ps)) # create a new variable that predicts propensity
# update the subsetted datasets – WOULD NOT DO THIS IN YOUR WORK
# ———
df_a0 <- df %>% filter(A == 0) # save anything under control as a dataframe df_a1 <- df %>% filter(A == 1) # save anything under treatment as a dataframe df_a0_small <- df_a0[1:10,] # further subsetting df_a1_small <- df_a1[1:5,] # further subsetting # calculate distances based on propensity scores # --------- dist.prop.score <- function(x,y) { abs(x-y) # distance based on absolute value # apply function # --------- dists_ps <- calculate.dist(as.matrix(df_a1_small[, "prop_score"]), # x as.matrix(df_a0_small[, "prop_score"]), # y dist.prop.score) # method ## 1 2 3 4 5 6 7 ## 9 0.2294698 1.4962611 0.48498975 0.1675185 1.3287721 1.611429 0.2829393 ## 11 0.2024718 1.4692631 0.45799176 0.1405205 1.3017741 1.584431 0.2559413 ## 15 0.3809625 1.6477538 0.63648243 0.3190112 1.4802647 1.762922 0.4344320 ## 16 0.3136261 0.9531652 0.05810617 0.3755774 0.7856761 1.068333 0.2601566 ## 17 0.2448272 1.5116184 0.50034708 0.1828758 1.3441294 1.626786 0.2982966 ## 81012 ## 9 1.5192731 0.07893488 0.8657146 ## 11 1.4922751 0.10593288 0.8387166 ## 15 1.6707658 0.07255779 1.0172072 ## 16 0.9761772 0.62203081 0.3226186 ## 17 1.5346305 0.06357755 0.8810719 Double Robustness A key advantage of propensity score matching is that, when used in conjunction with outcome regression, provides a “doubly robust” estimator. That is, “When used individually to estimate a causal effect, both outcome regression and propensity score methods are unbiased only if the statistical model is correctly specified. The doubly robust estimator combines these 2 approaches such that only 1 of the 2 models need be correctly specified to obtain an unbiased efficient estimator.” “Correctly specified” means that a model accurately represents the relationship between the variables. E.g. a linear model between x and y is correctly specified if and only if x and y truly do have a linear relationship to each other. This means that only one of the two models (the model of treatment to covariates or the model of outcome to treatment and covariates) needs to accurately represent the relationships among the respective variables in order for the estimate to be unbiased. Greediness Once deciding upon a distance metric, we must also choose a matching algorithm. That is, how shall the computed distances be used to determine a match? The various matching algorithms fall into two general categories: “greedy” and optimal. “Greedy” Matching Greedy algorithms in general are used to reduce larger problems to smaller ones by taking the best option at the time and repeating, while never returning to earlier choices to make changes. In the context of matching, this means that a greedy matching algorithm chooses the best single match first and removes that chosen match. It then repeats this process by choosing the best single match still remaining and removing that match, and so on. There are a number of different ways to decide which match to deem “best”, including but not limited to: • Choose the treatment participant with the highest propensity score first, and match it to the “control” participant with the closest propensity score (shortest propensity score distance). • Same as above but start with lowest rather than highest propensity score. • The best overall match (minimum of all match distances) in the entire dataset. • Random selection. Most greedy matching algorithms in common use (including those listed above) are “nearest neighbor” algorithms, which choose a treatment individual first and match to a control individual rather than the reverse. Question 6: Using the propensity score distances you made in Question 5, find the greedy matching of this subset using highest to lowest propensity score. Report the IDs of both elements of each matched pair. (Hint: You may find the which.min() and which.max() functions helpful) ## [1] "15" "17" "9" "11" "16" ## [1] "10" "4" "1" "7" "3" Question 7: Same as Question 6, but now find the greedy matching of this subset using lowest to highest propensity score. he dataframe in control e just selecte # use greedy matching - subset on highest to lowest propensity # -------------------------------------------------- # create new datasets # --------- treat <- c() # create empty treatment vector control <- c() # create empty control vector df_a1_small_copy <- as.data.frame(df_a1_small) # create a copy to prevent overwrite within dists_ps_copy <- as.data.frame(dists_ps) # create a copy to prevent overwrite within # loop through to grab matches based on propensity scores # --------- for(i in 1:nrow(df_a1_small)) { max_treat <- which.max(df_a1_small_copy$prop_score)# %>% select(-ID)) # save max propens
treat[i] <- names(max_treat) df_a1_small_copy <- df_a1_small_copy %>% slice(-max_treat)
match_control <- which.min(dists_ps_copy[max_treat,]) control[i] <- names(all_of(match_control)) dists_ps_copy <- dists_ps_copy %>%
select(-match_control) %>% slice(-max_treat)
# add max_treat na
# remove it from t
# find it’s match
# store names as c
# drop what we hav
# ——— treat
# use greedy matching – subset on lowest to highest propensity
# ————————————————–
# create new datasets # ———
treat <- c() Programming Help, Add QQ: 749389476 control <- c() df_a1_small_copy <- as.data.frame(df_a1_small) dists_ps_copy <- as.data.frame(dists_ps) # loop through to grab matches based on propensity scores # --------- for(i in 1:nrow(df_a1_small)) { min_treat <- which.min(df_a1_small_copy$prop_score) treat[i] <- names(min_treat) df_a1_small_copy <- df_a1_small_copy %>% slice(-min_treat)
match_control <- which.min(dists_ps_copy[min_treat,]) control[i] <- names(match_control) dists_ps_copy <- dists_ps_copy %>%
select(-match_control) %>% slice(-min_treat)
# ——— treat
## [1] “16” “11” “9” “17” “15”
## [1] “3” “10” “4” “1” “7”
Question 8: Same as in the previous two problems, but now find the greedy matching of this subset using best overall match.
# use greedy matching – subset using best overall
# ————————————————–
# create new datasets
# ———
treat <- c() control <- c() dists_ps_copy <- as.data.frame(dists_ps) # loop through to grab matches based on propensity scores # --------- for(i in 1:nrow(df_a1_small)) { best <- which(dists_ps_copy == min(dists_ps_copy), arr.ind = TRUE) treat[i] <- rownames(dists_ps_copy)[best[1]] control[i] <- colnames(dists_ps_copy)[best[2]] dists_ps_copy <- dists_ps_copy %>% slice(-(best[1])) %>% select(-(best[2]))
# ——— treat

## [1] “16” “17” “11” “9” “15”
## [1] “3” “10” “4” “1” “7”
Question 9: Were there any differences in the matchings you found in the previous three problems? ANSWER: There were significant differences across each one, meaning the matching algorithm can have a
big impact.
Optimal Matching
Optimal matching, as the name implies, seeks to find an optimal matching scheme in which the overall match difference is minimized. For example, if we were to add the distances of all match pairs chosen, an optimal matching would seek the set of match pairs which produces the smallest sum. A disadvantage of optimal matching is that it can be computationally intensive without providing sufficient improvements over greedy matching.
Control:Treatment Ratio
You may have noticed that in the previous examples we only selected one “control” individual for each treatment individual, often called 1 : 1 matching. However, in some cases we may prefer to match more than one control to each treatment, often called k : 1 matching, where k is the number of control individuals desired per treatment individual. (Note: while we are not considering them here, there are matching algorithms which discard treatment individuals rather than control individuals)
Question 10: Modify your code from Question 6 to perform a 2:1 matching rather than 1:1. That is, find the two best “control” matches for each treatment individual, using highest to lowest propensity score.
# manual matching – using 2:1 ratio
# ————————————————– #
# create new datasets
# ———
treat <- c() control_1 <- c() control_2 <- c() df_a1_small_copy <- as.data.frame(df_a1_small) dists_ps_copy <- as.data.frame(dists_ps) # loop through to grab matches based on propensity scores # --------- for(i in 1:nrow(df_a1_small)) { max_treat <- which.max(df_a1_small_copy$prop_score) treat[i] <- names(max_treat) df_a1_small_copy <- df_a1_small_copy %>% slice(-max_treat)
match_control_1 <- which.min(dists_ps_copy[max_treat,]) control_1[i] <- names(all_of(match_control_1)) dists_ps_copy <- dists_ps_copy %>% select(-match_control_1)
if(ncol(dists_ps_copy) > 1) {
match_control_2 <- which.min(dists_ps_copy[max_treat,]) control_2[i] <- names(all_of(match_control_2)) dists_ps_copy <- dists_ps_copy %>%
Code Help
select(-match_control_2) %>%
slice(-max_treat) } else {
control_2[i] <- names(dists_ps_copy) } # --------- treat ## [1] "15" "17" "9" "11" "16" ## [1] "10" "1" "3" "5" "8" ## [1] "4" "7" "12" "2" "6" Question 11: Did any of the matches you made in Question 6 change in Question 10? ANSWER: Yes. It is also possible to have a variable number of control individuals per treatment individual in “full” matching. Full matching assures that every individual in the dataset is paired. Full matching can only by achieved using an optimal matching algorithm. Caliper Width As seen in 1 : 1 and k : 1 matching, some data may be pruned in favor of other priorities. We may also choose to prune data for which a sufficiently close match can be found. For this method we choose a threshold, or “caliper”, and only consider matches whose distance is within this caliper width, discarding any individuals left unmatched. Replacement Another consideration when deciding upon a matching algorithm is whether matches are made with or without replacement. That is, can the same control individual be matched to more than one treatment individual. You may notice that so far we have only considered matching without replacement. Question 12: Write code to perform the same gr