Professional Documents
Culture Documents
31 1 233 1 10 20071029
31 1 233 1 10 20071029
com Articles
methodological rigor and quality of policy research and that there is a special need
development impact evaluations, while at the for experiments today.
same time adapting the methodology to the Many of these arguments are equally
technical, administrative, political and socio- applicable to the international development
cultural contexts within which these evaluations field. International development projects
are developed, implemented and used. typically use one of two procedures for
participant selection: self-selection (people are
The Strengths and Limitations of invited to apply for, for example, small business
loans, or communities apply to participate in a
RCTs in the International program to provide water, schools or other
Development Context social services) and administrative selection (the
project implementing agency selects the
Cook argues that evaluation has to deal with individuals, communities or administrative areas
different kinds of ideas and issues and causal who will participate). Hence there is very
questions are not always important. This is probably a selection bias as participants are
indeed the case, as shown by the heavy reliance likely to have special characteristics, often
of development agencies on process evaluations correlated with project success, which
for lesson learning and accountability (with distinguish them from non-participants.
respect to the use of inputs and possibly If selection characteristics are known and
assessing actual outputs against target values). observed then they can be controlled for to
However, for answering questions about remove the bias by using a range of quasi-
causality randomized designs can be well experimental (regression-based) techniques. But
warranted theoretically and empirically and they if selection characteristics cannot be observed –
are widely perceived to have at least a marginal depending on such things as ‘entrepreneurial’ or
advantage over other bias-free methods such as ‘community’ spirit—then the omission of these
regression discontinuity. Experiments also enjoy variables will bias regression-based estimates of
credibility in some policy debates—though this project impact. However, in the cases that these
is not yet the case in most developing countries unobserved determinants do not vary over time
or even in all international development (time invariant) then their influence can be
agencies. Consequently, having rigorous removed by double differencing (the difference
evidence on impact can increase the likelihood in the change in the outcome for the treatment
of information being used in policy debates. and control groups), and so selection bias
Cook argued that randomized designs have eliminated—but we have to assume this time
a significant methodological advantage over invariance as it can of course not be observed.
quasi-experimental designs, none of which can Where effect sizes are small, or there is a
adequately control for sample selection bias. long time between treatment and measuring
Cook recognizes that valid knowledge has often impact, then the problem of having to control
come from non-experiments and there are for all confounding factors is greater, and so
many situations in which randomized designs randomization again appears attractive. But the
are not possible. Despite their advantages, integrity of the design must be preserved,
relatively limited use has been made of meaning that the control group must not
randomized designs in social (policy) research. receive any treatment.
He concludes that multiple methods are The methodological advantage of RCTs is
required in policy research, that generating that they eliminate both project selection and
better causal knowledge has a large role in sample selection bias—the two major threats to
conclusion validity. This is a very powerful
advantage that cannot be matched by quasi- indeed, empowerment (Alkire, 2005, and Alsop
experimental designs. Some argue that et al., 2006). But even those who accept the
regression discontinuity designs can also largely validity of RCTs point out their limitations.
control for these biases, but this approach has The first limitation is the narrow range of
been used infrequently in development interventions to which RCTs can be applied.
evaluation, partly as the required longitudinal Such an approach is most readily applicable
data sets are rarely available, but also because when the intervention is a discrete,
the enforcement of eligibility criteria (upon homogenous intervention—precisely like a
which the approach depends) is more likely to drug, and so the prevalence of the approach in
be lax in developing countries—an examination the medical field, including in developing
of the Grameen Bank microfinance program in countries. But development projects are usually
Bangladesh found that, contrary to its stated complex, heterogeneous and evolve during
criteria, many beneficiaries had more than one implementation. Examples where randomized
acre of land (Morduch, 1998). A study of a child approaches have been used are interventions
growth monitoring program, also in such as conditional cash transfers (a grant to the
Bangladesh, found that most community household conditioned upon certain behavior,
nutrition workers were unable to correctly such as sending girls to school or ensuring
interpret the growth charts used to select children receive regular health check-ups), or
children for supplementary feeding (World very specific changes such as using flip charts in
Bank, 2005). school or deworming school children. But for a
However, the advocacy for RCTs and large, multi-million dollar intervention, it is
strong evaluation designs, and the related, but likely that only small sub-components of the
separate, call for the creation of an independent project, if any, will be amenable to an
evaluation agency, has stimulated a strong experimental approach. For example, a single
response from many critics, questioning the education project may support school
applicability, or the claimed advantages of these rehabilitation, curriculum and textbook
approaches. Most fundamentally, some critics development, teacher training, strengthening
argue that complex processes of social change local government management capacity, and
cannot be assessed through quantitative technical support to the Education Management
outcome measures that ignore the setting within Information System—but at best selected sub-
which the program is implemented, do not components of such a project will be amenable
study how the implementation process affects to a randomized evaluation. Development
outcomes, do not assess the qualitative projects may also suffer from a small n problem,
dimensions of the program, and do not have the since the project may focus on working with
flexibility to identify and study changes that take one, or only a small number of agencies (such as
place, both in project administration and in the supporting the creation of an anti-corruption
project setting, during the life of the project. A commission or technical assistance to a
powerful part of this critique, which has government ministry), or support national
strongly influenced development agencies in the policy reform.
past, is the argument that intended outcomes A second limitation is that, even when an
such as ‘empowerment’ are immeasurable, intervention appears to be homogenous, it can
certainly in terms of the money metric (see, for be difficult to ensure that this is in fact so across
example, Kabeer, 1992). However, considerable time and space. It is rarely possible to ensure a
progress has been made in the last decade in high degree of control over the conditions of
measurement of apparent elusive concepts such subjects throughout the treatment period
as social capital (e.g., Krishna, 2002) and, (which may last for a year or more in some
cases), or to ensure multiple repetitions of the jurisdictions, the selection procedures may
treatment with different dosages on different vary—either because the nature of the data used
subject groups. Often logistical problems make to define the target population may differ or
it difficult to ensure that school books, because of different political pressures. Or the
medicines or other inputs are delivered on-time project may use different implementing
and regularly re-supplied to all locations. This agents—sometimes government, sometimes
can be complicated by irregular electricity, non-governmental organizations (NGOs), often
computer networks or, in poorer countries, a number of them—in different parts of the
availability of fuel. In more than one project we country. So, for example, as mentioned in the
have evaluated the project workers Bangladesh example above, in a growth
accompanying the evaluation team to the field monitoring program the technical skills of
were not recognized by the villagers, having not community nutrition workers is crucial, but the
visited the village for up to two years, and in ability of different NGOs to effectively train
other cases the quality of the services may vary, women for this role can vary enormously.
when for example, health centers, schools or A further requirement is that the project
other facilities are understaffed, or not all staff environment remains constant for all subjects
receive the preparatory training or speak the throughout the treatment period and that no
local language. A major challenge for many external events interfere with the supposedly
evaluations is to be able to monitor or controlled implementation setting. It is often
document what services have actually been not possible to ensure this degree of control
received as project monitoring systems either do and external events such as the opening of a
not provide this information (e.g. there is no new factory, or the launch of a project by
record of how many patients receive both the another agency might complement or interfere
full package of malaria treatment and are given with outcomes of the project being studied. Or
the required orientation on the use of the kit) or different project settings may be subjected to
the project administrators intentionally fail to economic, political, demographic or other
report deficiencies (as when many teachers are changes that may affect how the project is
absent during the planting or harvest seasons). implemented in different locations.
Consequently, the evaluator often has only In addition to difficulties in satisfying these
limited information on how uniformly requirements in most development contexts,
treatments were actually administered. there are a number of other problems affecting
Furthermore, when projects are the utilization of RCTs. First, the RCT
administered in phases it will often be found evaluation designs are by definition inflexible in
that selection criteria will be modified as the that the same measurement instruments must
characteristics of the target population are be applied throughout the evaluation. This
better understood or political pressures come makes it difficult to capture or adapt to
into play to allow certain previously excluded changing circumstances such as changes in
groups to be considered (for example, a participant selection procedures, how the
secondary school scholarship program may project is administered or to contextual changes
originally have been targeted for rural areas but that might have a different effect on project and
children in urban areas may gradually be control groups. For example, urban renewal
admitted). Alternatively, the package of services, programs might affect the comparison groups,
or the way they are administered, may change. or the opening of new industries might affect
Similarly, when the project is implemented the economic opportunities of the target
in different locations, and particularly when population.
these fall under different administrative
A further potential problem concerns sometimes missed as the project design and the
“seepage” (or “contagion”) when sectors of the parameters of the evaluation have already been
control group become absorbed into the project established before the evaluator is called in. Of
population or when they have de facto access to course in some cases randomized designs are
project benefits. Examples of the latter include selected even without the involvement of the
access to water supply by neighboring evaluator. One example is when demand for a
communities (which also reduces water particular service exceeds supply and a lottery is
availability for project households), non-eligible used to ensure that the system of selection is
communities or individuals gain access to seen to be fair and to avoid political or other
scholarship programs or other educational pressure during selection—but such cases are
interventions, or information campaigns spread very rare in developing countries.
by word of mouth rather than project- Third, there are sometimes ethical
supported media. objections to the use of randomization as it is
Another potential issue concerns the fidelity perceived that important benefits affecting
of the data collection procedures (how health or even life-saving treatments are
accurately they represent the situation being deliberately withheld from people who need
described. While this is an issue for all them. Indeed, the term experiment has
evaluation designs it may be particularly unfortunate connotations, as people may object
problematic for RCTs that rely on one or a to being ‘experimented’ on—randomized
limited number of quantitative indicators. program designs for the Maori population in
It is sometimes claimed that the New Zealand were stopped for precisely this
implementation of an RCT design will be more reason. The response to this, which may prove
expensive than a non-RCT. However, this will acceptable is that the control is not getting the
not always be the case as a randomized design treatment yet, rather than it is not getting it at
will often significantly reduce the cost of sample all.
frame construction as both project and control Finally, government policies or
samples are generated from the same sampling interventions of other agencies may
frame. differentially affect the project and control
Finally there may be practical difficulties to populations. In some cases government or
implementing a randomized design. The most other agencies may provides additional services
common is that a random allocation is not to the target population to take advantage of the
acceptable to policy-makers, either because the investments already being made by the project
program’s scope is national, or where it is not, implementing agency. In other cases, these
then administrative criteria are preferred or agencies may only focus on the non-beneficiary
there is political interference in the allocation, population to avoid duplication or to even out
which can also come into play once an intended levels of benefits to different sectors of the low-
experiment is underway. A project management income population.
office, which is typically a unit under a parent The above considerations are just some of
ministry, is not politically powerful, so higher- the constraints on using RCTs in developing
level political support is needed to experimental countries. The fact that there have been so few
designs to enforce their implementation at local such experiments is not simply because agencies
level. have been unwilling or unable to implement
A second practical problem is that such evaluations, but because it is often
evaluators are usually not involved in project estimated that probably at most only about 5
design, and so potential opportunities to percent of the total value of development
introduce strong evaluation designs are finance is amenable to such an approach. Even
the most ardent proponents of RCTs in the performance; the effectiveness of water supply
development field admit that they “must and sanitation on community health and income
necessarily remain a small fraction of all and the comparison of the effectiveness of
evaluations” (Duflo & Kremer, 2005, p. 205). different delivery systems (community managed
Despite these difficulties there are a number versus top-down); the impacts of vouchers for
of situations in which randomized trials can be private schooling in Colombia; and the impact
used. In principle the possibility of using an of changing interest rates on loan acceptance in
RCT can be considered whenever there is a South Africa.
clearly defined target population, subjects can Third, as Cook observes in his article, there
be randomly allocated to treatment and control are situations in which randomized trials are
groups, a significant proportion of the considered to be more credible than other
population will not receive the treatment, the options by some key stakeholders, so there may
treatment is applied in a standardized and be political support for their use. Such an
uniform way, and the project setting remains argument might apply to some countries in
relatively stable throughout the period of the Latin America, where there is a strong tradition
trials. Consequently, RCTs are likely to work of impact evaluation (several countries require
better when the trial lasts for a relatively short such studies to be carried out for all social
period of time. programs), and randomized approaches have
The following are some of the situations in become widely known through their use for
which RCTs can be used. First, government high-profile conditional cash transfer programs,
may decide to use a lottery selection system to notably PROGRESA in Mexico. But, outside of
ensure equity and transparency in the selection Latin America, such a systematic demand from
procedure. The Bolivian Water Supply and policy makers is at present unlikely in
Sanitation project was one example where a developing countries.
lottery was used because demand from villages
interested in receiving these services exceeded Opportunities for Applying Strong
the government’s ability to provide the services
within a given period (in this case programs Quasi-Experimental Designs in
were planned on an annual basis). The lottery International Development
was considered politically and ethically
acceptable because this was a multi-year The problems of program heterogeneity and of
program so that villages not selected during the a small n also limit the applicability of quasi-
first year knew they had another chance to enter experimental impact evaluation designs. The
the lottery again the following year. problem of possibly immeasurable outcomes
Second, there are a number of situations in will hinder any quantitative approach.
which stakeholders are interested to test the Nonetheless, although still only representing a
effectiveness or cost-effectiveness of treatments minority of all project evaluations, the
in different combinations and settings and opportunities for applying strong quasi-
where randomization is possible and politically experimental designs are much greater than for
acceptable. These are often situations where using RCTs. There are many situations where
replication of a program is being considered and randomization was not possible, or was not
where it is important to separate real project used, but where a pretest-posttest control group
effects from other confounding factors. design was used, which allows a double-
Examples where RCTs have been used include: difference based approach. As argued above,
comparing the effectiveness of deworming with such approaches can be bias free unless there
other ways to increase school attendance or are unobserved determinants of program
participation which vary over time and which National Extension Project (World Bank, 1999).
are correlated with the outcomes of interest. In another example, for its study of support to
These designs can be used whenever survey basic education in Ghana, IEG surveyed 1,600
data can be collected on both the project and a households and 705 schools in 85 enumeration
(reasonably representative) comparison group at areas which had been covered by a combined
the start and end (or at some point late in the income and expenditure and education survey
project cycle) of the project. They are in the late 1980s (World Bank, 2003).
particularly strong when good secondary survey Despite the practical and political difficulties
data are available so that propensity score discussed above, strong evaluation designs have
matching or instrumental variables can an important role to play in international
strengthen the comparison between the two development. Very few development programs
samples. have been subjected to rigorous impact
One of the main limitations on the use of evaluations, and the vast majority have been
strong quasi-experimental designs is that assessed without even a simple quasi-
frequently the evaluation is not commissioned experimental design or any reference to a
until late in the project cycle so that baseline counterfactual. A large proportion of aid
data cannot be collected. This late start would projects have not been subject to any impact
of course also eliminate the possibility of using evaluation, but even when there has been such a
RCT. However, when the evaluation does begin study it has not always employed a
at the start of the project, and if sufficient funds counterfactual. A review by the World Bank’s
are available, it is often possible to use the evaluation department (OED, now called IEG)
pretest/posttest control group design. of its own impact evaluations found that of the
Examples include: evaluating housing and urban 78 studies so classified only 21 had employed a
infrastructure projects targeted to clearly counterfactual (Gupta Kapoor, 2002), though
defined low income populations; conditional the situation has changed so that all new IEG
cash transfer programs; scholarship programs impact studies contain counterfactual analysis.
targeted for a particular section of the school Consequently many of the claims that programs
population (for example female secondary have been “effective” and have achieved their
students or all students from poor families objectives (contributing to the elimination of
when there is a clearly defined criterion of poverty, increasing school enrolment and
poverty); water supply and sanitation projects performance and so on) are often based on
targeted for particular communities and road rather flimsy evidence. Many agencies define
construction projects (although in these latter impact as simply comparing baseline measures
cases the definition of project and control with post project measures for the target
groups is often more difficult, but not population with no kind of comparison group
impossible to define). and it is implicitly assumed that all of the
Even if a formal baseline survey was not changes can be attributed to the project
conducted, there may be other surveys which intervention.
have been carried out in the project area which Recently there has been a renewed concern
can serve this purpose. For example, in its within the development community for greater
evaluation of agricultural extension services in aid accountability and many agencies have
Kenya, the World Bank Independent introduced results-based management, which it
Evaluation Group (IEG) commissioned a is claimed focuses on a better measure of results
household survey of 285 households which had (outcomes and possibly impacts). However, in
been covered by the Rural Household Budget most cases the results-based management
Survey 15 years earlier, at the start of the systems continue to rely on post project
because funding and implementing agencies these biases, but can still provide valuable
only become aware at this late stage of the need lessons, though it may sometimes be necessary
to collect systematic evidence on which to make to read between the lines.
decisions about continuation or replication of The range of possible quasi-experimental
the project, or because the original project and non-experimental impact evaluation designs
document required an impact evaluation but it is summarized in Table 1. These approaches
was not considered a priority. Indeed, it is have been widely used in real-world contexts
currently common practice amongst the when experimental (randomized) designs have
evaluation departments of all major not been an option. The designs are ordered
development agencies to not get involved in roughly from methodologically most to least
evaluation until the end of the project, although robust. However, this is only a loose
the project should also contain plans for a ‘self- classification as theoretically sound designs can
evaluation’ implemented by the project staff. be considerably weakened if they are not
Only recently has the World Bank’s evaluation properly implemented (which of course also
department made an exception and allowed its applies to RCTs), while some of the
staff to give ex ante advice on evaluation design theoretically weaker designs can be
for a health financing project in the Indian state strengthened for example if used as part of a
of Karnataka. The French development agency mixed-method, theory-based design or if
has also recently begun an impact evaluation additional observation points can be included.
program with ex ante evaluation. But these It should also be emphasized that this
examples remain exceptions. categorization is made from a quantitative
Frequently, but not always, the belated evaluation design perspective and many
interest in evaluation also means that there is an qualitative evaluation practitioners would take
inadequate budget and often time pressures to issue with the underlying premise of the
deliver the evaluation report in time for the superiority, or even the appropriateness of the
negotiations on the future of the project. Critics quantitative methods on which the judgment is
also claim that given that the evaluation is being made.
commissioned to support the agency’s claim Five of the designs (1, 2, 4, 5 and 7) can
that the project should continue to be funded, only be used when the evaluation begins at the
the evaluator is often given subtle, or not so start of the project. One design can be used
subtle hints that while the evaluation must be when the evaluation begins when the project is
“objective and impartial”, it is hoped that the already underway (design 3) and two are used
findings will be positive. However, our for evaluations that start late in the project cycle
experience of working for a number of (designs 6 and 8).
development agencies suggests this is not
common practice, though the extent to which it Strengthening the Evaluation
happens varies and is more nuanced. In general
evaluations undertaken for or by evaluation Design
departments have a fair degree of independence.
There are usually systems for review or There are a number of cost-effective ways to
response from the operational side of the strengthen impact evaluation designs when
agency, which can help correct errors of fact, working under budget, time or data constraints.
though may sometimes also allow pressure to be Indeed, these methods should be adopted for
brought to bear on content—though formal any impact evaluation. But they are particularly
independence can limit these pressures. ‘Self- important when facing time or budget
evaluations’ are more likely to be subject to constraints as they help underpin the validity of
the findings.
The first is to consider the feasibility of funds, argued that inviting local communities to
building the evaluation design on a program select among different social infrastructure
theory model. A theory-based approach projects would increase the level of community
involves mapping out the channels through participation; while the other theory, espoused
which the inputs are expected to achieve the by some critics of the approach, argued that the
intended outcomes. When circumstances permit decision-making process would be co-opted by
(see later in this paragraph), a program theory local elites and would not increase local
model helps explain the links in the causal chain participation..
enabling the evaluation to identify the key .A World Bank evaluation of agricultural
assumptions that must be tested. A program extension services in Kenya found that
theory can also incorporate contextual analysis extension workers spent far less time than
so as to identify local economic, political, planned in the field and visiting farmers, and
institutional, environmental and socio-cultural that since the planned link from new research to
factors that can help explain differences in the extension advice did not operate, the extension
performance and outcomes of the same project workers were proposing to farmers that they
when implemented in different locations. adopt methods most had already adopted.
Theory-based approaches can also incorporate Hence the result that there was no impact on
process analysis so as to monitor how the yields is extremely plausible although the
project is actually implemented, the quality of control was not a randomized one (World Bank,
implementation and unplanned variations in the 1999).
package of services actually received by A second method for strengthening evaluation
different communities or beneficiaries. Under design—for all evaluations not just weaker
certain conditions the program theory can help ones—is to adopt a good mixed-method design,
distinguish between design failure and combining quantitative and qualitative
implementation failures to explain why intended approaches in the formulation, implementation
outcomes were not achieved, and can help and analysis of the evaluation. This can be done
establish plausible association between inputs in a number of ways. Qualitative data may be
and outcomes—or the lack of such an used for triangulation that is to provide
association. additional evidence in support of the
However, program theory models only quantitative results. But the most important role
work well under certain conditions. Theory for qualitative data is often to help frame the
models do not work well when there is no research. An evaluation design, and quantitative
sound theory on which to build, or there is a questionnaire, framed in ignorance of field
lack of empirical evidence on, for example, conditions is very likely to overlook important
expected effect sizes or the linkages between aspects of how the project actually functions,
key variables. They are also difficult to use which may well differ from what is described in
when there are several competing theories. the operational manual. Finally qualitative data
However, Weiss (2000) argues that it is possible can help interpret the quantitative results. A
to develop and test several alternative theory household survey conducted in Malawi and
models based on different theories. For Zambia for a World Bank study of funds for
example, Carvalho and White (2004) defined community-identified and implemented projects
and tested two competing theories to explain (social funds) found that participation rates in
the likely impacts of social investment funds on the project selection decision meeting was very
the level of local participation in the selection of low, but participation rates in project
community social infrastructure projects. One implementation very high. Qualitative fieldwork
theory, advocated by the supporters of social showed that the decision on the choice of
Table 1
Eight Commonly Used Quasi-Experimental and Non-experimental Impact Evaluation Designs
Key
Project End of The stage of the project
T = Time Start of
intervention Mid-term project cycle at which each
P = Project participants; C = Control group project
[Process not evaluation [Post-test] evaluation design can to
P1, P2, C1, C2 First and second observations [pre-test]
discrete event] be used.
X = Project intervention (a process rather than a discrete event)
project was taken by a small group, usually the head—although this can affect the quality of
village headmen and the school head teacher, information on the opinions, behavior and
and then announced in the community meeting, economic activities of household members who
with each household instructed to send a are not interviewed). Third, the creative use of
worker on a particular day (World Bank, 2002). secondary data can often reduce data collection
Whilst this was not the community participation costs. Fourth, a judicious assessment of
envisaged by the program’s designers, it has expected effect size and power analysis may
proved an effective means of rapidly expanding sometimes make it possible to reduce sample
social infrastructure in rural areas. size while still obtaining satisfactory estimates of
A third method is to make maximum use of project impact. Finally, there are often ways to
available secondary data, including project reduce the costs of data collection. One
monitoring data which are usually under-utilized possibility is to use less expensive interviewers
in project evaluations. A fourth is to include, such as medical students or student teachers
whenever time and budget permit, collection of rather than commercial interviewers. In some
data at additional points in the project cycle. In cases questionnaires could be self-administered
some cases this may be at some point during (rather than hiring interviewers) and in other
project implementation while in other cases this cases it may be possible to obtain information
may involve data collection some time after the through direct observation rather than
project has been completed so as to assess household surveys (for example, observing
project sustainability. pedestrian and vehicular traffic patterns, or
direct observation of time-use and sexual
Addressing Time and Budget division of labor).
While it is often assumed that the evaluation
Constraints will always require the collection of primary
data, it is often possible to significantly reduce
Addressing Budget Constraints time and cost, as well as enhance quality by
drawing on available secondary sources of data
Five options can be considered (Bamberger, (White 2006). In addition to primary data
Rugh and Mabry. 2006. Chapter 3). First, collection in both project and control areas, it
considerable cost savings are often possible by may be possible to obtain data from one of the
eliminating one or more of the four data following sources:
collection points (pretest/posttest project and
control group). For example, design 5 eliminates use of existing secondary data from already
baseline control group data and design 6 completed surveys (demographic and health
eliminates all baseline data. There is clearly a surveys, living standard measurement
trade-off that must be assessed for this and the studies etc).as a baseline for both project and
following options between cost savings and control areas.
methodological rigor. Second, the data use of secondary data, as discussed above, for control
collection instruments can be simplified to groups with the collection of primary data for project
reduce the amount of information to be area. This option is often used when the
collected. Often considerable amounts of sample of project households in the
unnecessary or low-priority information can be secondary source is too small or where
eliminated by judicious pruning. In other cases additional information, not included in the
it may be possible to reduce the number of previous survey must be collected on the
people from whom information is collected (for project population.
example, only interviewing the household