You are on page 1of 5

Thoughts on the use of propensity score matching in the evaluation of the Educational Maintenance Allowance (EMA) By Alex Bryson,

Principal Research Fellow, Policy Studies Institute This note discusses the use of propensity score matching (PSM) in the first two years of the EMA evaluation. Comments relate to two reports (Ashworth et al., 2001; Ashworth et al. - referred to as R1 and R2 respectively below) where PSM is used to estimate EMA effects on participation in school post-compulsion (Year 12) and retention by Year 13. The reports are engaging, well-written and competent pieces of evaluation research. Thanks to honesty and endeavour of the evaluators in seeking continual improvements in their analyses, reading the reports back-to-back offers real in-sights into the sensitivity of the evaluation results to the implementation of matching. I therefore commend the reports to policy evaluators and policy makers. Below I make observations under seven broad headings. 1. Identifying the causal impact of EMA Matching occurs at two levels: area-level, to find matches for the LEA pilot areas among the remaining LEAs in England and at level of the individual across pilot and control areas. The latter may be required, even if area-level matching seems reasonable, to account for differences in demographics across pilot/control areas that may affect outcomes. Good area-level matching is always critical in area-based evaluation because the approach is predicated on the assumption that differences in outcomes can be attributed to the programme (in this case EMA) and not to other area-based differences. The evaluators note some problems (R1, page 13) with the initial area matching which relied on YCS data. YCS may have lacked some key data items and contained relatively few observations per LEA. The evaluators report using EMA survey data to improve the allocation of pilot areas but, of course, by that point, the control areas used to administer the survey had already been chosen. The evaluators readily admit that the area matching was by no means perfect. They were particularly concerned about systematic differences between pilot and control areas in pre-EMA staying-on rates and deprivation rates. Their concerns were born out. By R2, the evaluators had ward-level indices of deprivation, ward-level staying-on rates pre-EMA plus information on local school quality. They were able to use these controls in individual-level matching so overcoming some of drawbacks in the area-level matching. The effect of conditioning on these extra variables is dramatic. In R1 EMA effects 1

in urban areas get smaller post-matching whereas effects in rural areas get larger post-matching. Conditioning on the wider set of Xs in R2, the opposite happens: EMA effects in urban areas are bigger postmatching whereas rural effects become smaller post-matching and are no longer statistically significant. Of course, we can be more confident in the results presented in R2 but we can never know whether we have conditioned on all relevant Xs. PSM, like all other non-experimental evaluation techniques, relies on rich data to identify unbiased programme estimates. 2. Detail on matching procedures and diagnostics

In judging whether PSM is likely to capture the causal impact of a programme and, if not, the likely biases in the estimated effects, one needs to know: - the theory informing what the evaluator believes influences both treatment and outcome and thus merits inclusion in the matching estimator - the ability to capture these influences with the data available - how the matching is implemented (weighing bias against efficiency, the use of sample weights, and so on) - the sensitivity of results to what enters the propensity estimation - the difficulties in identifying matches among the non-treated for the whole treated sample - success in balancing the mean scores of the Xs in the matching estimation across the treated and matched comparators. The reports do not offer enough detail to make the judgement. For instance, the probit estimates for the area-level matching and individual-level matching are never presented. Although we are informed that determinants of treatment and outcomes differ systematically across the sexes and by rural and urban location, prompting sub-group analyses throughout, we are never shown the probit estimates for these sub-groups which would allow the reader to see how the groups differ. There is only cursory mention of the shift from nearest neighbour matching in R1 to kernel density regression in R2 and, in spite of some support problems in estimating effects for rural areas, these are not discussed. In defence of the evaluators, they have already produced two lengthy reports and it is unlikely that any reader would have the stamina for more. Instead, it might be advisable to publish a separate technical report that presents all the information above. (Perhaps this is a lesson for all government evaluation that relies on the use of complex techniques.) Only then can

the reader come to a considered opinion about whether the study has isolated an unbiased EMA effect. 3. Which treatment effect: eligibility and take-up

The evaluators make it clear that they are estimating the impact of eligibility for EMA, as opposed to actual take-up of EMA. Two issues arise. First, the evaluators face severe difficulties in accurately estimating eligibility using the survey data. EMA eligibility depends on the age of the child and, because EMA offers means-tested assistance, household income. The latter is likely to be measured with substantial error due to the number of income items comprising household income. The reports make this clear through the fulsome reporting of imputation methods to overcome item non-response and with their reporting of the degree of misalignment between eligibility and take-up in the pilot areas. It is not clear what bias this measurement error might imply. Second, the effect of eligibility combines the impact of EMA receipt and the probability of take-up. The latter varies a great deal across areas in a way that may be explained, in part, by differences in administrative efficiency across LEAs. The problem here is that these differences in LEA efficiency can not be easily separated from EMA effects: using eligibility as the measure of treatment conflates the two. Given these two concerns measurement error in estimating eligibility and conflation of entitlement and take-up it would have been valuable to supplement estimates of EMA eligibility with estimates of actual EMA take-up. This could be done by matching those who takeup their entitlements in pilot areas with like eligibles in control areas. But it would also be interesting to estimate the effect of take-up by comparing EMA receivers with like eligibles within pilot areas, thus side-stepping area-related differences. This last approach would provide an estimate of treatment-on-the-treated within the pilot areas where treatment is defined as being in receipt of EMA. 4. Problems in re-weighting data to overcome sample attrition Like many survey-based evaluations, EMA suffers from substantial attrition resulting from opt-outs, non-contacts, movers, refusals, item non-response and non-response at follow-up interviews. Where attrition is non-random and is associated with treatment and outcomes, as in this case, it is important to try to re-weight the data so that it reflects the original pilot population. The efforts made to reweight the achieved samples back to the pilot populations using Family Resource Survey generated weights are not convincing. This is clear 3

from Table 2.9 which shows lower EMA effects on Y12 participation using wave 2 cohort 1 relative to wave 1 cohort 1, despite the use of weights. This suggests the weighting schema is insufficient. It might be better to generate attrition weights with probits estimating various reasons for attrition. These weights which are the inverse of the probability of non-response, moving etc. can be combined with the match weights in estimating the EMA effect. 5. The relative impact of full versus partial eligibility

The evaluators find the EMA effect is only statistically significant for full eligibility: there is no significant effect of partial eligibility. There are two difficulties in interpreting this result. First, to obtain it, matching must occur within the full and partial eligibility groups. What distinguishes full and partial eligibility is household income so we can not rule out the possibility that these results may be telling us more about heterogeneous treatment effects across lower and higher income households than they are about the EMA taper. Second, the distinction between full and partial eligibility is not that useful from a policy perspective because partial eligibility lumps together individuals who are eligible for anything between 5 and just under 30 (the full amount) per week. 6. How to interpret the diminishing EMA effect

The EMA effect for urban men diminishes over time such that the effect is not statistically significant for cohort 2 (Table 2.3 of R2). It is not clear why this is so but the evaluators point to a 5 percentage point increase in post-compulsory education participation by young men in the urban control areas between cohorts 1 and 2 (R2, p. 25). This change in the control areas seems large over such a small period of time: if it had occurred in the pilot areas it may well have been attributed erroneously to EMA! Among urban women, on the other hand, the EMA effect increases. Clearly we need to know more about what has been happening in the control areas since they are supposed to be like the pilot areas, other than that the pilot areas are running EMA. 7. Difficulties in extrapolating from these results to a wider population The evaluators seek to extrapolate from their results to the potential impact of EMA if extended more widely. This extrapolation is not convincing. The problem the evaluators face is that EMA pilot areas were chosen because they were not representative of England as a 4

whole. Indeed, they were chosen as pilot areas because it was anticipated that EMA would have its greatest effect in these areas. It seems unlikely, therefore, that the use of FRS-generated weights to weight the results to the wider population will suffice. If policy-makers were interested in designing an evaluation capable of estimating the EMA effect in other areas it would have been preferable to have chosen the pilot areas at random. A second reason why efforts to extrapolate these results to the wider population is unconvincing is the difficulty the evaluators have in identifying the impact EMA may have on staying-on and retention among those ineligible for it. These effects may be large in the case of programmes affecting a sizeable proportion of a population: this is the case with EMA which, according to the evaluators, may be available to over half of each cohort considering whether to stay-on postcompulsory education. The concern is that EMA may affect the schooling decisions of non-eligibles either positively or negatively, thus boosting or diminishing the EMA effect arising from its direct impact on eligibles. To get at these spillover effects, the evaluators undertake additional matching, this time using ineligibles as well as eligibles in the matching. They find no evidence of spillovers on participation (see R2, Table 2.8). However, the problem with this analysis is that ineligibles are taken from control areas where, by definition, one would never expect to encounter spillover because the programme is absent in those areas. It might have been interesting to see what spillover effects may emerge from an analysis confined to eligibles and ineligibles in the pilot areas. There is an additional problem in extrapolating the rural area results to rural areas more generally since the rural EMA estimates are based on a single rural pilot area Cornwall which is not particularly wellmatched to its two control area counterparts. It is always hazardous to draw policy inferences on the basis of what is effectively a single observation. In the event, the R2 results find EMA does not significantly improve participation in rural areas. Although this result is due in part to the inflation of standard errors arising from the introduction of ward-level controls into the matching estimator, one does not know a priori what effect might emerge with a larger sample size.

You might also like