You are on page 1of 55

Propensity Score Matching

潘杰,讲师,华西公共卫生学院

Email: panjie.jay@gmail.com
Website: panjie.org
《卫生政策与管理研究的定量方法》,2013年11月7日
Introductory example
 Question: What is the impact of cash transfer
intervention on maternal mortality?
 OLS?
 Would there be selection bias?

 Propensity Score Matching is one strategy that


corrects for selection bias in making estimates

2
Evaluation Problem
Let Yi T be medical cost of patients in treatment group ( T ), i.e. Yi T | T (1)

Let Yi C be medical cost of patients in control group Yi C | C (2).

We are interested in Yi T | T  Yi C | T , which is the effect of treatment on medical cost.

We will never have the patient’s medical cost with and without treatment at the same
time.

But, we may hope to learn the average effect of treatment on medical cost:
E (Yi T  Yi C | T ). Then

E (Yi T  Yi C | T )  E (Yi T | T )  E (Yi C | C )  E (Yi C | C )  E (Yi C | T )


treatment  effect (1) and (2) ignore
What are Propensity Scores?
 Paul Rosenbaum and Donald Rubin (1983) define a
propensity score as “the conditional probability of
assignment to a particular treatment given a vector of
observed covariates.”
 Propensity scores range from 0 to 1; in a randomized
experiment, an equal probability assignment mechanism
assigns people to one of two (or more) distributions of
treatment, so each person will have a true propensity
score of .5; In a non-experimental, observational study,
propensity scores must be estimated.
 Two assumptions that are made are:
 (1) Stable Unit-Treatment Value Assumption: There
is a unique value rti corresponding to unit i and
treatment t.
 (2) Strongly Ignorable Treatment Assignment: the
responses, rti, are conditionally independent of the
treatment assignment, t, given the observed
covariates, and for each covariate the subjects have
a positive probability of receiving the treatment.

5
Generating Propensity Scores
 Propensity scores can be estimated using
several methods, but the most commonly used
method is logistic regression.

 The regression uses observable covariates, and


following Rubin and Thomas (1996) “unless a
variable can be excluded because there is a
consensus that it is related to outcome or is
not a proper covariate, it is advisable to include
it in the propensity score model even if it is not
statistically significant.” Thus, Propensity
Scores are Covariate -Promiscuous.
 When logistic regression is used, the observed
covariates are the predictors and the treatment
assignment (dummy coded 0=No
Treatment/exposure, 1=treatment/exposure)
is used as the dependent variable.

 The predicted value (probability) is the


propensity score – and each person in the
target population will end up with a propensity
score, unless they have missing values on
covariates.

7
Limitations of Matching
 If the two groups do not have substantial
overlap, then substantial error may be
introduced:
 E.g., if only the worst cases from the
untreated “comparison” group are compared
to only the best cases from the treatment
group, the result may be regression toward
the mean
 makes the comparison group look better
 Makes the treatment group look worse.
Propensity Score Overlap - 1
.975-
.925-
.875-
.825-
.775-
.725-
Propensity score

.675-
.625-
.575-
.525-
.475-
.425-
.375-
.325-
.275-
.225-
.175-
.125-
.075-
.025-
0-
400 300 200 100 0 100 200 300 400
Number of Infants
5-19% Poverty < 5% Poverty
Data Source: MN Department of Health, Center for Health Statistics, LBD 1990-1999
Propensity Score Overlap - 2
.975-
.925-
.875-
.825-
.775-
Propensity score

.725-
.675-
.625-
.575-
.525-
.475-
.425-
.375-
.325-
.275-
.225-
.175-
.125-
.075-
.025-
0-

400 300 200 100 0 100 200 300 400

Number of Infants
40-100% Poverty < 5% Poverty
Data Source: MN Department of Health, Center for Health Statistics, LBD 1990-1999
Propensity Scores– An Example of NO
Overlap
1.00-

.875-

.725-
Propensity score

.575-

.425-

.275-

.125-

0-

400 300 200 100 0 100 200 300 400

Number of Subjects

High Poverty Low Poverty


Common support
 For the matching, we had to decide whether the test
should be performed only on the observations that
had propensity scores within the common support
region, i.e. precisely on the subset of the comparison
group that was most comparable to the treatment
group or on the full set of the comparison group.
 Heckman et al., (1997) argue that imposing the
common support restriction in the estimation of
propensity scores improves the quality of the
estimates. Lechner (2001), on the other hand, argues
that besides reducing the sample considerably,
imposing the restriction may lose high-quality
matches at the boundary of the common support
region.
 General practice is to use common support.
Cases Are Excluded at Both Ends of
the Propensity Score
Cases excluded

Range of
matched
cases.

Participants Nonparticipants

Predicted Probability
Propensity Score Matching (PSM)
 Employs a predicted probability of group
membership
 E.g. treatment vs. control group
 Based on observed predictors, usually obtained from
logistic regression to create counterfactual group
(Rosenbaum & Rubin, 1983)
 Dependent variable: T=1, if participate; T=0, otherwise
T=f(age, gender, pre-cci, etc.)
 Allows “quasi-randomized” experiment
 Two subjects, one in treated group and one in the
control, with the same (or similar) propensity score,
can be seen as “randomly assigned” to either group
Criteria for “Good” PSM Before
Matching
 Identify treatment and comparison group with
substantial overlap
 Same exclusion, inclusion criteria
 Overweighting some variables (Medicare vs Medicaid)

 Choose variables

 Conduct logit estimations correctly


Choosing appropriate variables
 Approach
 Including irrelevant variable increases the
variance of predictions
 Dropping independent variables that have
small coefficients can reduce the average
error of predictions
Choose Appropriate Conditioning
Variables-2
 Include all variables that affect both treatment
assignment and the outcome variables
 Including variables only weakly related to treatment
assignment usually reduces bias more than it will
increase variance
 To avoid post-treatment bias, we should exclude
variables affected by the treatment variables
 Step-wise regression
Stata command: sw logit treatment age cci cap female race,
pr(.2) backward
My favorite:
Stata command: sw logit treatment (age female) cci cap
female race , pr(.2) pe(.5) lackterm1
Example
 Variables both effect treatment and outcomes

Ex: age, gender, female, race, pre-period: severity, top


10 comorbidities, plantype, cci, total number of
diagnosis, pre-period expenditures

 Variables that may create post-treatment bias


(OVER MATCHING)

Ex: post period severity, comorbidities, cci, total number


of diagnosis
General Procedure for Conducting
PSM
 STEP 1: Run Logistic Regression
 Dependent variable: T=1, if participate; T=0,
otherwise
 Choose appropriate conditioning variables
 Obtain propensity score: predicted probability (p)
or log[p/(1-p)]
 STEP 2: Match Each Participant to One or More
Non-participants on propensity Score
 Stratification Matching
 Nearest Neighbor Matching
 Caliber Matching
 Mahalanobis Matching
 Kernel Matching
 Radius Matching

20
Stratification Method
 Divide the range of variation of the propensity
score in intervals such that within each interval
treated and control units have, on average, the
same propensity score
 Calculate the differences in outcomes measure
between the treatment and the control group
in each interval
 Average treatment effect is obtained as an
average of outcomes of each block with
weights given by the distribution of treated
units across blocks
 Discard observation in blocks where either the
treated or control unit is absent

Stata command: atts cost, pscore(phat) blockid(5)


boot reps(250)

22
Stratification Method
Nearest Neighbor Matching
 Randomly order the participants and non-
participants
 Then select the first participants and find non-
participant with closest propensity score
Nearest Neighbor Matchup (1 to 1)

Propensity Score Treatment Propensity Score Control


matched
0.005 0.005
0.007 0.0055
matched
0.006
0.0061
0.009
Stata command: psmatch2, outcome(cost) pscore(phat) n(1) norep
Nearest Neighbor Matchup (2 to 1)

Propensity Score Treatment Propensity Score Control


matched
0.005 0.005
matched
0.007 matched 0.0055
matched 0.006
0.0061
0.009
Stata command: psmatch2, outcome(cost) pscore(phat) n(2) norep
Nearest Neighbor Matchup with Replacement (1 to 1)

Propensity Score Treatment Propensity Score Control


matched
0.005 0.005
matched
0.0051 0.0055
0.006
0.0061
Stata command: psmatch2, outcome(cost) pscore(phat) n(1)
Caliber Matching
 Define a common support region
 Suggested ¼ of standard error of estimated propensity
score
 Randomly select one treatment that matches
on the propensity score with the control

qui predict phat


sum phat
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
phat | 30878 .027301 .034677 .0000891 .7371861

Caliber: 0.034677/4=0.0086
Stata command: psmatch2, outcome(cost) pscore(phat) n(2)
norep cal(0.086)
Radius and Kernel Matching
 Radius matching
 Each treated unit is matched only with the control
units whose propensity score falls in a predefined
neighborhood of the propensity score of the treated
unit
 Kernel matching
 All treated are matched with a weighted average of
all controls
 Weights are inversely proportional to the distance
between the propensity scores of treated and
controls
Radius Matching

psmatch2, outcome(cost) pscore(phat) radius


cal(0.005)

Picked matched units

r
Kernel Matching
(psmatch2, outcome(cost) pscore(phat) kernel)
^ d1 ^
C1T P1T P1C C1C

^ d2 ^
C2T P2T P2C C2C

^ d3 ^
C3T P3T P3C C3C

1 [C3T – C3C] ; 1 [C2T – C2C] ; 1 [C1T – C1C]

d1 d2 d3

lowest distance good match highest weight


Example
 Data are used to estimate the cost of illness for
asthma patients
 Details of the data set can be found at Baser (2005)
 Treatment Group (Patients with the disease) =
1184
 Control Group (Patients without the disease)=
3169
Descriptive Table for Treatment and
Control Group
Variables Treatment Control Differences
(N=1184) (N=3169)
Mean Mean p-values
Age 28.026 31.368 0.000
Female 0.506 0.514 0.646
Male 0.494 0.486 0.646
Northeast 0.218 0.234 0.248
North central 0.253 0.213 0.005
South 0.378 0.331 0.004
West 0.110 0.099 0.314
Other region 0.042 0.122 0.000
CCI 0.999 0.159 0.000
Point of Service 0.720 0.721 0.908
Other Plan Type 0.280 0.279 0.908
Estimation of Propensity Score
with Logit
Variables Coefficients Stand. Err. P-values 95% Confidence Int
North central 0.301 0.115 0.009 0.077 0.526
South 0.184 0.104 0.077 -0.020 0.387
West 0.280 0.146 0.055 -0.006 0.567
Other region -1.428 0.289 0.000 -1.995 -0.861
CCI 1.887 0.069 0.000 1.751 2.023
Age -0.033 0.003 0.000 -0.039 -0.028
Female -0.030 0.081 0.716 -0.189 0.130
Point of service -0.043 0.090 0.635 -0.220 0.134
Constant -0.934 0.127 0.000 -1.183 -0.686
N= 4353
Prob > chi2 0.000
Pseudo R2 0.243
Types of Propensity Score Matching
Used
 M1: Nearest Neighbor
 M2: 2 to 1
 M3: Mahalanobis
 M4: Mahalanobis with caliber
 M5: Radius
 M6: Kernel
 M7: Stratified
Results
Regression Based
Treatment Control Difference S.E. Difference
10398 3345 7053 742 4247
10398 7377 3021 1275 3969
10398 7364 3034 978 5157
10398 6892 3506 2281 4823
11104 6641 4463 3252 4456
10398 7786 2612 1278 4601
10398 7942 2456 2281 4823
10398 8358 2040 2564 3754
Which One to Choose?
 Asymptotically, the matching estimators all end
up comparing only exact matches
 And therefore give the same answer
 In a finite sample however, choice makes a
difference!
 The general tendency in the literature is to
choose matching with replacement when
control data set is small
 If it is large and evenly distributed without
replacement is better
 Kernel, Mahalanobis, and Radius Matching work
better with asymmetrically distributed, large
control datasets
 Stratification works better if we suspect
unobservable effects in the matching

38
Quantifiable Criteria-1
C1. Two sample t-statistics & Chi-square test
between treatment and matched control
observations
-Insignificant values
Variables C1: T-test or chi-square test p-values
M1 M2 M3 M4 M5 M6 M7
Age 0 0 0.001 0.709 0 0 0
Female 0.267 0.868 0.87 0.999 0.233 0.376 0.255
Male 0.267 0.868 0.87 0.999 0.233 0.376 0.255
Northeast 0.005 0.003 0.514 0.482 0.002 0 0.006
North central 0 0 0.962 0.999 0 0 0
South 0 0 0.966 0.999 0.003 0 0
West 0 0 0.948 0.999 0.61 0.024 0.45
Other region 0 0 0.999 0.999 0 0 0
CCI 0 0 0.255 0.999 0 0 0
Point of service0.407 0.937 0.891 0.999 0.455 0.689 0.515
Other plan type0.407 0.937 0.891 0.999 0.455 0.689 0.515
Quantifiable Criteria-2
C2. Mean difference as a percentage of the
average standard deviation
1
100 * ( X T  X C ) /( )( s xT  s xC )
2
C2: The mean difference as a percentage of the average standard
Variables deviation
M1 M2 M3 M4 M5 M6 M7

The lower the value, the better match


Age
Female
57.9
4.6
55.5
0.6
13.5
0.7
2.2
0
50.4
4.9
56.1
3.6
48.5
4.1
Male 4.6 0.6 0.7 0 4.9 3.6 4.1
Northeast 11.6 10.8 2.7 4.2 12.8 15.9 10.2
North central 20.6 17.1 0.2 0 18 19.6 16.4
South 29 28.7 0.2 0 12.2 20.1 25.7
West 17.4 15.1 0.3 0 2.1 9.3 3.5
Other region 43.4 42.3 0 0 34.4 38 30.4
CCI 26.7 25.7 4.7 0 19.3 25 20.7
Point of service 3.4 0.3 0.6 0 4.9 1.6 2.8
Other plan type 3.4 0.3 0.6 0 4.9 1.6 2.8
 X AT  X AC   ( X IT  X IC ) *100
X IT  X IC
Quantifiable Criteria-3
C3. Calculate the percent reduction bias in the means of the
explanatory variables after matching (A) and initially (I)
The best value is 100, meaning 100% reduction. The more
deviation from 100 the worse is the match.

Variables C3: Percent reduction bias in means of explanatory variables


M1 M2 M3 M4 M5 M6 M7
Age 182.7 172.2 37.9 123.4 143 174.4 151.6
Female 191.8 137.8 143.2 100 212.9 132.4 150.8
Male 191.8 137.8 143.2 100 212.9 132.4 150.8
Northeast 201 178.1 166.3 54.6 233.1 316.1 250.4
North central 109.8 75.9 102.1 100 84.8 99.5 88.5
South 187 184.3 98.2 100 25.4 103.4 154.5
West 676.8 595.5 108.1 100 163.8 394.1 389.5
Other region 149.9 141.5 100 100 83.3 108.9 135.5
CCI 130.7 129.4 105.1 100 123 128.9 120.5
Point of service759.6 28.4 43.3 100 212.9 316.2 268.6
Other plan type759.6 28.4 43.3 100 212.9 316.2 268.6
Quantifiable Criteria-4
C4. Use the Kolmogorov-Smirnov test to
compare treatment density estimates for
explanatory variables
-Insignificant values
Variables C4: Comparison of treatment and control density estimates
M1 M2 M3 M4 M5 M6 M7
Age 0 0 0 0.156 0 0 0
Female 0.999 0.999 0.999 0.83 0.999 0.99 0.89
Male 0.999 0.999 0.999 0.83 0.99 0.99 0.89
Northeast 0.999 0.999 0.999 0.999 0.972 0.972 0.95
North central 0.787 0.374 0.999 0.999 0.129 0.129 0.85
South 0.997 0.999 0.256 0.999 0.05 0.05 0.08
West 0.969 0.677 0.616 0.927 0.999 0.999 0.49
Other region 0.761 0.953 0.999 0.999 0.123 0.123 0.35
CCI 0 0 0 0.153 0 0 0
Point of service 0.999 0.999 0.989 0.864 0.999 0.999 0.988
Other plan type 0.999 0.999 0.989 0.864 0.999 0.999 0.988

M1: Nearest Neighbor; M2: 2 to 1; M3: Mahalanobis; M4: Mahalanobis with caliber
M5: Radius; M6: Kernel; M7: Stratified
Quantifiable Criteria-5
C5. Use the Kolmogorov-Smirnov test to
compare treatment density estimates of
the propensity scores of control units
with those of the treated units
-Insignificant values
C5: Comparison of the density estimates of the propensity scores of control units with those of the treated units
M1 M2 M3 M4 M5 M6 M7
Propensity Scores 0 0 0.001 0.192 0 0 0

M1: Nearest Neighbor;•M2: 2 to 1;•M3: Mahalanobis;•M4: Mahalanobis with caliber


M5: Radius;•M6: Kernel;•M7: Stratified
Results

Matching Type Treatment Control Difference S.E.


Unmatched 10398 3345 7053 742
M1: Nearest neighbor 10398 7377 3021 1275
M2: 2 to 1 10398 7364 3034 978
M3: Mahalanobis 10398 6892 3506 2281
M4: Mahalanobis with caliber 11104 6641 4463 3252
M5: Radius 10398 7786 2612 1278
M6: Kernel 10398 7942 2456 2281
M7: Stratified 10398 8358 2040 2564
Average Treatment Effect:
Regression
 Looked at the difference in means of
expenditure between the matched treatment
and control units
 Now, we will run the regression over matched
groups, independent variables are treatment
indicator + same variables as in propensity
score estimation.
 GLM with log link and gamma family

STATA command:
glm cost treatment age …, link(log) family(gamma)
robust
Estimated Total Health Care
Expenditure
Regression Based
Matching Type Difference S.E. Treatment Control Difference S.E.
Unmatched $4,247 $489 $10,398 $3,345 $7,053 $742
M1: Nearest neighbor $3,969 $1,135 $10,398 $7,377 $3,021 $1,275
M2: 2 to 1 $5,157 $1,232 $10,398 $7,364 $3,034 $978
M3: Mahalanobis $4,823 $1,205 $10,398 $6,892 $3,506 $2,281
M4: Mahalanobis with caliber $4,456 $994 $11,104 $6,641 $4,463 $3,252
M5: Radius $4,601 $659 $10,398 $7,786 $2,612 $1,278
M6: Kernel $4,823 $1,205 $10,398 $7,942 $2,456 $2,281
M7: Stratified $3,754 $1,009 $10,398 $8,358 $2,040 $2,564
Multivariate Regression After
Propensity Score Matching
 Is it necessary?
 Results are at least as good as the ones after
Propensity Score Matching
 Tells us the marginal effects of each variable on
outcomes measure
glm cost treatment age …, link(log)
family(gamma) robust
 Increase efficiency - double filtering
 Covers your mistakes!
Why do we need Propensity Score Matching if we
run multivariate regression-1

- Argument against using regression: it fits parameters


using a “global” method. Many prefer “local” methods --
regression function at a point is affected only, or mainly,
by nearby observation.
- Dehejia and Wahba, 1999 show that propensity score
matching approach leads to results that are close to the
experimental evidence, where the regression approaches
failed.
- Since you don’t care about individual coefficients- the
main purpose is classification- you can put as many
variable as possible to the right hand side in logit runs
but if you do this in MV analysis it will cost you.
The Counter-Argument:
-Jeffrey M. Wooldridge, University Distinguished Professor,
Fellow of Econometric Society, Author of “Introductory
Econometrics” and “Econometric Analysis of Cross Section
and Panel Data”
… “as I often say, people who do data analysis want to do
anything other than standard regression analysis. It’s too
bad we have the mind set, as a careful regression with
good controls and flexible functional forms is probably the
best we can do in many cases.
…what is a bit disconcerting is that these things take on a
life of their own. It seems that once the method is
blessed by a few smart people, there is no turning back!”
Limitations-1
 If two groups do not have substantial overlap,
then substantial error may be introduced
 Individuals that fall outside the region of common
support have to be disregarded. Information from
those outside the common support could be useful
especially if treatment effects are heterogeneous.
Possible solution:
1. Analyze them separately
2. Calculate non-parametric bounds of the
parameter of interest (Lenchner, 2000)
Limitations-2
Matching may not eliminate bias due to
unobservables
- Suppose we match people on the basis of age, region,
plan type, race and attribute any resulting
difference in expenditure is due to differences in
treatment
- It is quite possible that expenditures of patients with
same observable characteristics vary widely due to
physician/practice prescribing patterns
Possible Solutions:
1. Rosenbaum Bounds (2002) : How strongly
unmeasured variable must influence the
selection process in order to undermine the
implication of matching analysis?
2. Instrumental Variable Approach: (Newhouse
and McClellan, 1998) for detail explanation.

52
Conclusion
 Propensity score matching creates “quasi
random experiment” from observational data
 For retrospective data when true randomization not
possible
 Choosing among different types of matching
techniques is important and we should look at
several criteria
 Multivariate analysis after applying the correct
matching technique increases the efficiency of
the outcome estimator
Research of Interest
 Propensity Score Estimation with More
than Two Categories
 (Imbens, 1999)
 Accounting for Limited Overlap in
Estimation of Average Treatment Effect
 (Optimal Subpopulation Average Treatment
(OSATE) estimation) (Crump, Hotz, Imbens,
Mitnik, work in progress)
 Combining regression and propensity
score matching (Double Robustness)
(Wooldridge, work in progress)
Commonly Asked Questions
 How do we to select the propensity score
variables?
 Should we include/exclude any propensity
score variables when we build our outcomes
regression covariate list?
 Is it possible to over-match?
 How can we control for unobservable effects?

You might also like