Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching strategy for clusters that don't include both treatment groups - multilevel matching #188

Open
leonALIVE opened this issue Jan 28, 2024 · 3 comments

Comments

@leonALIVE
Copy link

Is there a workaround to get matchit to preferentially match within cluster and to find a match outside the cluster if one does not exist within? (Something similar to Cannas and Arpino (2019) CMatching "hybrid matching" that is no longer supported.)

My example,
I'm evaluating the effect of an intervention (treatment) applied to patients (subjects) in hospitals (clusters or groups). In more that a third of hospitals either all or none of the patients were exposed to the intervention. Strict within-cluster matching options require me to subset (= exclude) a large section of the study population.

I can group hospitals by hospital-level covariates to increase cluster size, but I was hoping there may be a more elegant approach to this problem that is common in my field.

@ngreifer
Copy link
Collaborator

ngreifer commented Jan 28, 2024

That's a great question. I can think of an ad-hoc workaround that would be fairly straightforward to implement but would require some manual coding. Essentially, you do regular matching but put a large penalty on any between-cluster matches. The way you could implement this penalty would be by adding a large positive number to the distance between units in different clusters in a distance matrix. That way, between-cluster matches would only occur if the within-cluster match was impossible (e.g., because there were no units left or all remaining units were banned due to a caliper or other constraint). You would also need to match in order of closeness, i.e., by setting m.order = "closest", which would ensure every unit that can get a within-cluster match gets one before any between-cluster matches are sought.

Here is how you might implement this using propensity score matching.

#Compute PS
ps <- glm(A ~ X1 + X2 + cluster, data = data, family = binomial)$fitted

#Compute PS distance
dist <- euclidean_dist(treat ~ ps, data = lalonde)

#Create penalty matrix
cluster_dist <- euclidean_dist(treat ~ cluster, data = lalonde)

#Apply penalty matrix
dist[cluster_dist > 0] <- dist[cluster_dist > 0] + 100 * max(dist)

#Do matching
m <- matchit(A ~ X1 + X2 + cluster, data = data,
             distance = dist, m.order = "closest")

#Find which treated units received matches outside their cluster
rownames(m$match.matrix)[cluster_dist[cbind(rownames(m$match.matrix), m$match.matrix[,1])] > 0]

Setting the penalty to Inf is equivalent to doing exact matching on cluster; setting the penalty to anything larger than the largest distance will prioritize within-cluster matching and do between-cluster matching only for the units that require a match outside their cluster, still prioritizing otherwise close matches. You can modify the penalty to penalize different clusters different amounts. The great thing about being able to supply a distance matrix is that you can implement whatever penalty or restriction you want.

@leonALIVE
Copy link
Author

leonALIVE commented Jan 28, 2024

Thank you for this beautiful solution, Noah!

Note, for some reason the code to find which treated units received matches outside their cluster does not work. It just produces a matrix of NAs. (Regardless whether I run the code on lalonde or my own test dataset.)

The rest of it works perfectly.

Here is the test data I'm using:
https://github.com/leonALIVE/fake_data/blob/main/dtax.csv

And your code using the var names in the test dataset. The covariates included in the model below are just for testing purposes. The cluster variable indicating hospital is called 'DAG'.

dtax <- read.csv("~/dtax.csv")

ps <- glm(surg_checklist~age+gender+Hb+chronic_comorbid___1+anes_techniq+Specialists+DAG,
          data = dtax, family = binomial)$fitted

dist <- euclidean_dist(surg_checklist ~ ps, data = dtax)

cluster_dist <- euclidean_dist(surg_checklist ~ DAG, data = dtax)

dist[cluster_dist > 0] <- dist[cluster_dist > 0] + 100 * max(dist)

m.out <- matchit(surg_checklist~DAG+age+gender+Hb+chronic_comorbid___1+anes_techniq+Specialists, 
                  data = dtax, distance = dist, m.order = "closest", replace = T)

@ngreifer
Copy link
Collaborator

Glad it worked! Change the rbind() to cbind() and it should work correctly. I'll make that edit above in case someone else wants to use the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants