31 1 233 1 10 20071029

http://www.jmde.
com Articles
Using Strong Evaluation Designs in Developing

Countries: Experience and Challenges
Michael Bamberger and Howard White

Independent Consultant and Independent Evaluation Group, The World Bank
I n his article in the November 2006 JMDE

Thomas Cook reviewed the current debate
on the role of randomized control trials (RCT).
to know what works (see White, 2004, for a
discussion of the results agenda in development
agencies).
While his article focused on educational Several international conferences have
research, the issues raised, and the responses by stressed the need for greater accountability in
Michael Scriven (2006) and Jane Davidson the use of aid and greater rigor in the
(2006), highlighted the broader issues currently assessment of development outcomes. In
being debated on the need for more ‘rigorous’ particular, the 2002 Monterrey Conference on
program evaluation in all sectors. The purpose Financing for Development heightened interest
of this article is to extend the discussion to the in the use of results-based management in
field of international development evaluation, development agencies and the 2005 Paris
reviewing the different approaches which can Accords encouraged multi-donor cooperation in
be adopted to rigorous evaluation methodology the promotion of, among other things, impact
and their applicability in a development setting. evaluations. The Poverty Action Lab (PAL) at
In the international development field there MIT has for a number of years been promoting
is a growing debate on the need for more the use of randomized designs and also offers
rigorous evaluation design, and specifically the training programs for developing countries on
use of randomized control trials. The majority the use of these designs. The Center for Global
of evaluations carried out by official Development (CGD) has become a strong
development agencies have largely been process advocate of more rigorous evaluation designs,
evaluations. In addition, from the early eighties notably in their publication “When Will We
there was an increased use of participatory Ever Learn” (CGD, 2006). Recently CGD
evaluation approaches. Such approaches were a issued a “Call to Action”, calling for the
welcome change from a tradition which had not creation of an independent evaluation agency to
seen it necessary to seek the opinion of the ensure more independence and rigor in the
intended beneficiaries of aid-financed evaluation of development programs. The term
interventions, but they did not produce data “Gold Standard” has recently been introduced
amenable to quantitative analysis of impact. The into evaluation discourse to refer to RCTs as
rise of results-based approaches in government being the impact evaluation methodology to
agencies in developed countries, and the which development agencies should aspire,
associated focus on the Millennium though this privileged position is disputed by
Development Goals as a measure of others.
development progress, has resulted in greater The purpose of this article is to seek
calls to be able to demonstrate aid impact, and common ground on ways to strengthen the
Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 58

ISSN 1556-8180
October 2007
methodological rigor and quality of policy research and that there is a special need
development impact evaluations, while at the for experiments today.
same time adapting the methodology to the Many of these arguments are equally
technical, administrative, political and socio- applicable to the international development
cultural contexts within which these evaluations field. International development projects
are developed, implemented and used. typically use one of two procedures for
participant selection: self-selection (people are
The Strengths and Limitations of invited to apply for, for example, small business
loans, or communities apply to participate in a
RCTs in the International program to provide water, schools or other
Development Context social services) and administrative selection (the
project implementing agency selects the
Cook argues that evaluation has to deal with individuals, communities or administrative areas
different kinds of ideas and issues and causal who will participate). Hence there is very
questions are not always important. This is probably a selection bias as participants are
indeed the case, as shown by the heavy reliance likely to have special characteristics, often
of development agencies on process evaluations correlated with project success, which
for lesson learning and accountability (with distinguish them from non-participants.
respect to the use of inputs and possibly If selection characteristics are known and
assessing actual outputs against target values). observed then they can be controlled for to
However, for answering questions about remove the bias by using a range of quasi-
causality randomized designs can be well experimental (regression-based) techniques. But
warranted theoretically and empirically and they if selection characteristics cannot be observed –
are widely perceived to have at least a marginal depending on such things as ‘entrepreneurial’ or
advantage over other bias-free methods such as ‘community’ spirit—then the omission of these
regression discontinuity. Experiments also enjoy variables will bias regression-based estimates of
credibility in some policy debates—though this project impact. However, in the cases that these
is not yet the case in most developing countries unobserved determinants do not vary over time
or even in all international development (time invariant) then their influence can be
agencies. Consequently, having rigorous removed by double differencing (the difference
evidence on impact can increase the likelihood in the change in the outcome for the treatment
of information being used in policy debates. and control groups), and so selection bias
Cook argued that randomized designs have eliminated—but we have to assume this time
a significant methodological advantage over invariance as it can of course not be observed.
quasi-experimental designs, none of which can Where effect sizes are small, or there is a
adequately control for sample selection bias. long time between treatment and measuring
Cook recognizes that valid knowledge has often impact, then the problem of having to control
come from non-experiments and there are for all confounding factors is greater, and so
many situations in which randomized designs randomization again appears attractive. But the
are not possible. Despite their advantages, integrity of the design must be preserved,
relatively limited use has been made of meaning that the control group must not
randomized designs in social (policy) research. receive any treatment.
He concludes that multiple methods are The methodological advantage of RCTs is
required in policy research, that generating that they eliminate both project selection and
better causal knowledge has a large role in sample selection bias—the two major threats to
conclusion validity. This is a very powerful

ISSN 1556-8180
October 2007
advantage that cannot be matched by quasi- indeed, empowerment (Alkire, 2005, and Alsop
experimental designs. Some argue that et al., 2006). But even those who accept the
regression discontinuity designs can also largely validity of RCTs point out their limitations.
control for these biases, but this approach has The first limitation is the narrow range of
been used infrequently in development interventions to which RCTs can be applied.
evaluation, partly as the required longitudinal Such an approach is most readily applicable
data sets are rarely available, but also because when the intervention is a discrete,
the enforcement of eligibility criteria (upon homogenous intervention—precisely like a
which the approach depends) is more likely to drug, and so the prevalence of the approach in
be lax in developing countries—an examination the medical field, including in developing
of the Grameen Bank microfinance program in countries. But development projects are usually
Bangladesh found that, contrary to its stated complex, heterogeneous and evolve during
criteria, many beneficiaries had more than one implementation. Examples where randomized
acre of land (Morduch, 1998). A study of a child approaches have been used are interventions
growth monitoring program, also in such as conditional cash transfers (a grant to the
Bangladesh, found that most community household conditioned upon certain behavior,
nutrition workers were unable to correctly such as sending girls to school or ensuring
interpret the growth charts used to select children receive regular health check-ups), or
children for supplementary feeding (World very specific changes such as using flip charts in
Bank, 2005). school or deworming school children. But for a
However, the advocacy for RCTs and large, multi-million dollar intervention, it is
strong evaluation designs, and the related, but likely that only small sub-components of the
separate, call for the creation of an independent project, if any, will be amenable to an
evaluation agency, has stimulated a strong experimental approach. For example, a single
response from many critics, questioning the education project may support school
applicability, or the claimed advantages of these rehabilitation, curriculum and textbook
approaches. Most fundamentally, some critics development, teacher training, strengthening
argue that complex processes of social change local government management capacity, and
cannot be assessed through quantitative technical support to the Education Management
outcome measures that ignore the setting within Information System—but at best selected sub-
which the program is implemented, do not components of such a project will be amenable
study how the implementation process affects to a randomized evaluation. Development
outcomes, do not assess the qualitative projects may also suffer from a small n problem,
dimensions of the program, and do not have the since the project may focus on working with
flexibility to identify and study changes that take one, or only a small number of agencies (such as
place, both in project administration and in the supporting the creation of an anti-corruption
project setting, during the life of the project. A commission or technical assistance to a
powerful part of this critique, which has government ministry), or support national
strongly influenced development agencies in the policy reform.
past, is the argument that intended outcomes A second limitation is that, even when an
such as ‘empowerment’ are immeasurable, intervention appears to be homogenous, it can
certainly in terms of the money metric (see, for be difficult to ensure that this is in fact so across
example, Kabeer, 1992). However, considerable time and space. It is rarely possible to ensure a
progress has been made in the last decade in high degree of control over the conditions of
measurement of apparent elusive concepts such subjects throughout the treatment period
as social capital (e.g., Krishna, 2002) and, (which may last for a year or more in some

ISSN 1556-8180
October 2007
cases), or to ensure multiple repetitions of the jurisdictions, the selection procedures may
treatment with different dosages on different vary—either because the nature of the data used
subject groups. Often logistical problems make to define the target population may differ or
it difficult to ensure that school books, because of different political pressures. Or the
medicines or other inputs are delivered on-time project may use different implementing
and regularly re-supplied to all locations. This agents—sometimes government, sometimes
can be complicated by irregular electricity, non-governmental organizations (NGOs), often
computer networks or, in poorer countries, a number of them—in different parts of the
availability of fuel. In more than one project we country. So, for example, as mentioned in the
have evaluated the project workers Bangladesh example above, in a growth
accompanying the evaluation team to the field monitoring program the technical skills of
were not recognized by the villagers, having not community nutrition workers is crucial, but the
visited the village for up to two years, and in ability of different NGOs to effectively train
other cases the quality of the services may vary, women for this role can vary enormously.
when for example, health centers, schools or A further requirement is that the project
other facilities are understaffed, or not all staff environment remains constant for all subjects
receive the preparatory training or speak the throughout the treatment period and that no
local language. A major challenge for many external events interfere with the supposedly
evaluations is to be able to monitor or controlled implementation setting. It is often
document what services have actually been not possible to ensure this degree of control
received as project monitoring systems either do and external events such as the opening of a
not provide this information (e.g. there is no new factory, or the launch of a project by
record of how many patients receive both the another agency might complement or interfere
full package of malaria treatment and are given with outcomes of the project being studied. Or
the required orientation on the use of the kit) or different project settings may be subjected to
the project administrators intentionally fail to economic, political, demographic or other
report deficiencies (as when many teachers are changes that may affect how the project is
absent during the planting or harvest seasons). implemented in different locations.
Consequently, the evaluator often has only In addition to difficulties in satisfying these
limited information on how uniformly requirements in most development contexts,
treatments were actually administered. there are a number of other problems affecting
Furthermore, when projects are the utilization of RCTs. First, the RCT
administered in phases it will often be found evaluation designs are by definition inflexible in
that selection criteria will be modified as the that the same measurement instruments must
characteristics of the target population are be applied throughout the evaluation. This
better understood or political pressures come makes it difficult to capture or adapt to
into play to allow certain previously excluded changing circumstances such as changes in
groups to be considered (for example, a participant selection procedures, how the
secondary school scholarship program may project is administered or to contextual changes
originally have been targeted for rural areas but that might have a different effect on project and
children in urban areas may gradually be control groups. For example, urban renewal
admitted). Alternatively, the package of services, programs might affect the comparison groups,
or the way they are administered, may change. or the opening of new industries might affect
Similarly, when the project is implemented the economic opportunities of the target
in different locations, and particularly when population.
these fall under different administrative

ISSN 1556-8180
October 2007
A further potential problem concerns sometimes missed as the project design and the
“seepage” (or “contagion”) when sectors of the parameters of the evaluation have already been
control group become absorbed into the project established before the evaluator is called in. Of
population or when they have de facto access to course in some cases randomized designs are
project benefits. Examples of the latter include selected even without the involvement of the
access to water supply by neighboring evaluator. One example is when demand for a
communities (which also reduces water particular service exceeds supply and a lottery is
availability for project households), non-eligible used to ensure that the system of selection is
communities or individuals gain access to seen to be fair and to avoid political or other
scholarship programs or other educational pressure during selection—but such cases are
interventions, or information campaigns spread very rare in developing countries.
by word of mouth rather than project- Third, there are sometimes ethical
supported media. objections to the use of randomization as it is
Another potential issue concerns the fidelity perceived that important benefits affecting
of the data collection procedures (how health or even life-saving treatments are
accurately they represent the situation being deliberately withheld from people who need
described. While this is an issue for all them. Indeed, the term experiment has
evaluation designs it may be particularly unfortunate connotations, as people may object
problematic for RCTs that rely on one or a to being ‘experimented’ on—randomized
limited number of quantitative indicators. program designs for the Maori population in
It is sometimes claimed that the New Zealand were stopped for precisely this
implementation of an RCT design will be more reason. The response to this, which may prove
expensive than a non-RCT. However, this will acceptable is that the control is not getting the
not always be the case as a randomized design treatment yet, rather than it is not getting it at
will often significantly reduce the cost of sample all.
frame construction as both project and control Finally, government policies or
samples are generated from the same sampling interventions of other agencies may
frame. differentially affect the project and control
Finally there may be practical difficulties to populations. In some cases government or
implementing a randomized design. The most other agencies may provides additional services
common is that a random allocation is not to the target population to take advantage of the
acceptable to policy-makers, either because the investments already being made by the project
program’s scope is national, or where it is not, implementing agency. In other cases, these
then administrative criteria are preferred or agencies may only focus on the non-beneficiary
there is political interference in the allocation, population to avoid duplication or to even out
which can also come into play once an intended levels of benefits to different sectors of the low-
experiment is underway. A project management income population.
office, which is typically a unit under a parent The above considerations are just some of
ministry, is not politically powerful, so higher- the constraints on using RCTs in developing
level political support is needed to experimental countries. The fact that there have been so few
designs to enforce their implementation at local such experiments is not simply because agencies
level. have been unwilling or unable to implement
A second practical problem is that such evaluations, but because it is often
evaluators are usually not involved in project estimated that probably at most only about 5
design, and so potential opportunities to percent of the total value of development
introduce strong evaluation designs are finance is amenable to such an approach. Even

ISSN 1556-8180
October 2007
the most ardent proponents of RCTs in the performance; the effectiveness of water supply
development field admit that they “must and sanitation on community health and income
necessarily remain a small fraction of all and the comparison of the effectiveness of
evaluations” (Duflo & Kremer, 2005, p. 205). different delivery systems (community managed
Despite these difficulties there are a number versus top-down); the impacts of vouchers for
of situations in which randomized trials can be private schooling in Colombia; and the impact
used. In principle the possibility of using an of changing interest rates on loan acceptance in
RCT can be considered whenever there is a South Africa.
clearly defined target population, subjects can Third, as Cook observes in his article, there
be randomly allocated to treatment and control are situations in which randomized trials are
groups, a significant proportion of the considered to be more credible than other
population will not receive the treatment, the options by some key stakeholders, so there may
treatment is applied in a standardized and be political support for their use. Such an
uniform way, and the project setting remains argument might apply to some countries in
relatively stable throughout the period of the Latin America, where there is a strong tradition
trials. Consequently, RCTs are likely to work of impact evaluation (several countries require
better when the trial lasts for a relatively short such studies to be carried out for all social
period of time. programs), and randomized approaches have
The following are some of the situations in become widely known through their use for
which RCTs can be used. First, government high-profile conditional cash transfer programs,
may decide to use a lottery selection system to notably PROGRESA in Mexico. But, outside of
ensure equity and transparency in the selection Latin America, such a systematic demand from
procedure. The Bolivian Water Supply and policy makers is at present unlikely in
Sanitation project was one example where a developing countries.
lottery was used because demand from villages
interested in receiving these services exceeded Opportunities for Applying Strong
the government’s ability to provide the services
within a given period (in this case programs Quasi-Experimental Designs in
were planned on an annual basis). The lottery International Development
was considered politically and ethically
acceptable because this was a multi-year The problems of program heterogeneity and of
program so that villages not selected during the a small n also limit the applicability of quasi-
first year knew they had another chance to enter experimental impact evaluation designs. The
the lottery again the following year. problem of possibly immeasurable outcomes
Second, there are a number of situations in will hinder any quantitative approach.
which stakeholders are interested to test the Nonetheless, although still only representing a
effectiveness or cost-effectiveness of treatments minority of all project evaluations, the
in different combinations and settings and opportunities for applying strong quasi-
where randomization is possible and politically experimental designs are much greater than for
acceptable. These are often situations where using RCTs. There are many situations where
replication of a program is being considered and randomization was not possible, or was not
where it is important to separate real project used, but where a pretest-posttest control group
effects from other confounding factors. design was used, which allows a double-
Examples where RCTs have been used include: difference based approach. As argued above,
comparing the effectiveness of deworming with such approaches can be bias free unless there
other ways to increase school attendance or are unobserved determinants of program

ISSN 1556-8180
October 2007
participation which vary over time and which National Extension Project (World Bank, 1999).
are correlated with the outcomes of interest. In another example, for its study of support to
These designs can be used whenever survey basic education in Ghana, IEG surveyed 1,600
data can be collected on both the project and a households and 705 schools in 85 enumeration
(reasonably representative) comparison group at areas which had been covered by a combined
the start and end (or at some point late in the income and expenditure and education survey
project cycle) of the project. They are in the late 1980s (World Bank, 2003).
particularly strong when good secondary survey Despite the practical and political difficulties
data are available so that propensity score discussed above, strong evaluation designs have
matching or instrumental variables can an important role to play in international
strengthen the comparison between the two development. Very few development programs
samples. have been subjected to rigorous impact
One of the main limitations on the use of evaluations, and the vast majority have been
strong quasi-experimental designs is that assessed without even a simple quasi-
frequently the evaluation is not commissioned experimental design or any reference to a
until late in the project cycle so that baseline counterfactual. A large proportion of aid
data cannot be collected. This late start would projects have not been subject to any impact
of course also eliminate the possibility of using evaluation, but even when there has been such a
RCT. However, when the evaluation does begin study it has not always employed a
at the start of the project, and if sufficient funds counterfactual. A review by the World Bank’s
are available, it is often possible to use the evaluation department (OED, now called IEG)
pretest/posttest control group design. of its own impact evaluations found that of the
Examples include: evaluating housing and urban 78 studies so classified only 21 had employed a
infrastructure projects targeted to clearly counterfactual (Gupta Kapoor, 2002), though
defined low income populations; conditional the situation has changed so that all new IEG
cash transfer programs; scholarship programs impact studies contain counterfactual analysis.
targeted for a particular section of the school Consequently many of the claims that programs
population (for example female secondary have been “effective” and have achieved their
students or all students from poor families objectives (contributing to the elimination of
when there is a clearly defined criterion of poverty, increasing school enrolment and
poverty); water supply and sanitation projects performance and so on) are often based on
targeted for particular communities and road rather flimsy evidence. Many agencies define
construction projects (although in these latter impact as simply comparing baseline measures
cases the definition of project and control with post project measures for the target
groups is often more difficult, but not population with no kind of comparison group
impossible to define). and it is implicitly assumed that all of the
Even if a formal baseline survey was not changes can be attributed to the project
conducted, there may be other surveys which intervention.
have been carried out in the project area which Recently there has been a renewed concern
can serve this purpose. For example, in its within the development community for greater
evaluation of agricultural extension services in aid accountability and many agencies have
Kenya, the World Bank Independent introduced results-based management, which it
Evaluation Group (IEG) commissioned a is claimed focuses on a better measure of results
household survey of 285 households which had (outcomes and possibly impacts). However, in
been covered by the Rural Household Budget most cases the results-based management
Survey 15 years earlier, at the start of the systems continue to rely on post project

ISSN 1556-8180
October 2007
comparisons with a baseline but with no discussion on rigorous impact evaluations is

comparison group, so it is not clear how much limited to a discussion of this scenario.
progress has been made. However, even when the evaluation is planned
Consequently, reasonably robust evaluation to commence with the start of the project there
designs that include a logically defensible are a number of real-world constraints that limit
counterfactual, even if they do not satisfy the the possibility of using strong evaluation designs
highest methodological standards, can provide a (Bamberger, Rugh & Mabry, 2006). Budget
significantly better understanding of the extent constraints may exclude collection of baseline
to which development programs are achieving data for a comparison group and may even limit
their objectives as well as helping understand the kinds, and sample size, of baseline data
the factors contributing to the level of impact which can be collected on the project
and how it is distributed among different population. In other cases time pressures may
sectors of the target population. limit baseline data collection or make it
As we will discuss in the next section, there impossible to conduct exploratory studies and
are a wide range of impact evaluation design pilot testing of data collection instruments
options that offer a useful level of required for a sound survey design. Time
understanding of potential impacts even when pressures may squeeze out the baseline
evaluations are conducted under budget and altogether—once project implementation starts
time constraints. Many of these techniques can there is much to be done, and conducting a
also be used in the very common situation baseline survey for an impact study that will
where the evaluation is not commissioned until only produce results some years hence is far
late in the project cycle. from a priority so it is eventually conducted, at
The main message is on the need to best, a few years into the project (in a recent
broaden the focus of the debate beyond Indian irrigation project we studied, the
searching for the relatively small number of ‘baseline’ had been conducted five and a half
project settings where rigorous designs can be years into a seven year project). For this reason,
used (although advantage should be taken of all it is preferable to conduct the baseline before
opportunities to apply these designs), to the formal start of the project, which will
focusing on proposing a wider range of impact usually require funds from a different source. It
evaluation methodologies than can provide may also be difficult to identify populations not
operationally useful, and acceptably valid affected by the project from which a
estimates, of project impact and that can be comparison group could be selected. Finally
applied to a much wider range of development there may be political or administrative
interventions. The challenge is to define a constraints such as the implementing agency’s
minimum set of methodological criteria for an concern that interviewing families or
evaluation design to be considered sufficiently communities not scheduled to receive project
rigorous to provide valid estimates of project benefits will stir up political controversy or
impact. create pressures to expand the scope of the
project beyond available plans or resources. In
Real-World Approaches to Impact other cases the client is not convinced that it
makes sense to “waste” money and time
Evaluation interviewing populations not involved in the
project.
Evaluations are commissioned either at the But it is more common that the evaluation is
outset of a project (ex ante) or toward its end (ex not commissioned until towards the end of the
post). The former case permits the use of project, or even after it has finished—either
stronger evaluation designs and much of the

ISSN 1556-8180
October 2007
because funding and implementing agencies these biases, but can still provide valuable
only become aware at this late stage of the need lessons, though it may sometimes be necessary
to collect systematic evidence on which to make to read between the lines.
decisions about continuation or replication of The range of possible quasi-experimental
the project, or because the original project and non-experimental impact evaluation designs
document required an impact evaluation but it is summarized in Table 1. These approaches
was not considered a priority. Indeed, it is have been widely used in real-world contexts
currently common practice amongst the when experimental (randomized) designs have
evaluation departments of all major not been an option. The designs are ordered
development agencies to not get involved in roughly from methodologically most to least
evaluation until the end of the project, although robust. However, this is only a loose
the project should also contain plans for a ‘self- classification as theoretically sound designs can
evaluation’ implemented by the project staff. be considerably weakened if they are not
Only recently has the World Bank’s evaluation properly implemented (which of course also
department made an exception and allowed its applies to RCTs), while some of the
staff to give ex ante advice on evaluation design theoretically weaker designs can be
for a health financing project in the Indian state strengthened for example if used as part of a
of Karnataka. The French development agency mixed-method, theory-based design or if
has also recently begun an impact evaluation additional observation points can be included.
program with ex ante evaluation. But these It should also be emphasized that this
examples remain exceptions. categorization is made from a quantitative
Frequently, but not always, the belated evaluation design perspective and many
interest in evaluation also means that there is an qualitative evaluation practitioners would take
inadequate budget and often time pressures to issue with the underlying premise of the
deliver the evaluation report in time for the superiority, or even the appropriateness of the
negotiations on the future of the project. Critics quantitative methods on which the judgment is
also claim that given that the evaluation is being made.
commissioned to support the agency’s claim Five of the designs (1, 2, 4, 5 and 7) can
that the project should continue to be funded, only be used when the evaluation begins at the
the evaluator is often given subtle, or not so start of the project. One design can be used
subtle hints that while the evaluation must be when the evaluation begins when the project is
“objective and impartial”, it is hoped that the already underway (design 3) and two are used
findings will be positive. However, our for evaluations that start late in the project cycle
experience of working for a number of (designs 6 and 8).
development agencies suggests this is not
common practice, though the extent to which it Strengthening the Evaluation
happens varies and is more nuanced. In general
evaluations undertaken for or by evaluation Design
departments have a fair degree of independence.
There are usually systems for review or There are a number of cost-effective ways to
response from the operational side of the strengthen impact evaluation designs when
agency, which can help correct errors of fact, working under budget, time or data constraints.
though may sometimes also allow pressure to be Indeed, these methods should be adopted for
brought to bear on content—though formal any impact evaluation. But they are particularly
independence can limit these pressures. ‘Self- important when facing time or budget
evaluations’ are more likely to be subject to constraints as they help underpin the validity of
the findings.

ISSN 1556-8180
October 2007
The first is to consider the feasibility of funds, argued that inviting local communities to
building the evaluation design on a program select among different social infrastructure
theory model. A theory-based approach projects would increase the level of community
involves mapping out the channels through participation; while the other theory, espoused
which the inputs are expected to achieve the by some critics of the approach, argued that the
intended outcomes. When circumstances permit decision-making process would be co-opted by
(see later in this paragraph), a program theory local elites and would not increase local
model helps explain the links in the causal chain participation..
enabling the evaluation to identify the key .A World Bank evaluation of agricultural
assumptions that must be tested. A program extension services in Kenya found that
theory can also incorporate contextual analysis extension workers spent far less time than
so as to identify local economic, political, planned in the field and visiting farmers, and
institutional, environmental and socio-cultural that since the planned link from new research to
factors that can help explain differences in the extension advice did not operate, the extension
performance and outcomes of the same project workers were proposing to farmers that they
when implemented in different locations. adopt methods most had already adopted.
Theory-based approaches can also incorporate Hence the result that there was no impact on
process analysis so as to monitor how the yields is extremely plausible although the
project is actually implemented, the quality of control was not a randomized one (World Bank,
implementation and unplanned variations in the 1999).
package of services actually received by A second method for strengthening evaluation
different communities or beneficiaries. Under design—for all evaluations not just weaker
certain conditions the program theory can help ones—is to adopt a good mixed-method design,
distinguish between design failure and combining quantitative and qualitative
implementation failures to explain why intended approaches in the formulation, implementation
outcomes were not achieved, and can help and analysis of the evaluation. This can be done
establish plausible association between inputs in a number of ways. Qualitative data may be
and outcomes—or the lack of such an used for triangulation that is to provide
association. additional evidence in support of the
However, program theory models only quantitative results. But the most important role
work well under certain conditions. Theory for qualitative data is often to help frame the
models do not work well when there is no research. An evaluation design, and quantitative
sound theory on which to build, or there is a questionnaire, framed in ignorance of field
lack of empirical evidence on, for example, conditions is very likely to overlook important
expected effect sizes or the linkages between aspects of how the project actually functions,
key variables. They are also difficult to use which may well differ from what is described in
when there are several competing theories. the operational manual. Finally qualitative data
However, Weiss (2000) argues that it is possible can help interpret the quantitative results. A
to develop and test several alternative theory household survey conducted in Malawi and
models based on different theories. For Zambia for a World Bank study of funds for
example, Carvalho and White (2004) defined community-identified and implemented projects
and tested two competing theories to explain (social funds) found that participation rates in
the likely impacts of social investment funds on the project selection decision meeting was very
the level of local participation in the selection of low, but participation rates in project
community social infrastructure projects. One implementation very high. Qualitative fieldwork
theory, advocated by the supporters of social showed that the decision on the choice of

ISSN 1556-8180
October 2007
Table 1
Eight Commonly Used Quasi-Experimental and Non-experimental Impact Evaluation Designs
Key
Project End of The stage of the project
T = Time Start of
intervention Mid-term project cycle at which each
P = Project participants; C = Control group project
[Process not evaluation [Post-test] evaluation design can to
P1, P2, C1, C2 First and second observations [pre-test]
discrete event] be used.
X = Project intervention (a process rather than a discrete event)
Quantitative Impact Evaluation Design T1 T2 T3

RELATIVELY ROBUST QUASI-EXPERIMENTAL DESIGNS
1. Pre-test post-test non-equivalent control group design with statistical
matching of the two groups. Participants are either self-selected or are selected by the
P1 P2
project implementing agency. Statistical techniques (such as propensity score matching), X Start
C1 C2
drawing on high quality secondary data used to match the two groups on a number of
relevant variables.
2. Pre-test post-test non-equivalent control group design with judgmental
matching of the two groups. Participants are either self-selected or are selected by the P1 P2
X Start
project implementing agency Control areas usually selected judgmentally and subjects are C1 C2
randomly selected from within these areas.
LESS ROBUST QUASI-EXPERIMENTAL DESIGNS
3. Pre-test/post-test comparison where the baseline study is not conducted until During project
P1 P2
the project has been underway for some time (most commonly this is around the X implementation (often
C1 C2
mid-term review). at mid-term)
4. Pipeline control group design. When a project is implemented in phases, subjects in
P1 P2
Phase 2 (i.e who will not receive benefits until some later point in time) can be used as X Start
C1 C2
the control group for Phase 1 subjects.
5. Pre-test post-test comparison of project group combined with post-test P2
P1 X Start
comparison of project and control group. C2
P1
6. Post-test comparison of project and control groups X End
C1
NON-EXPERIMENTAL DESIGNS (THE LEAST ROBUST)
7. Pre-test post-test comparison of project group P1 X P2 Start
8. Post-test analysis of project group. X P1 End
Source: Adapted from Bamberger, Rugh, & Mabry (2006).

ISSN 1556-8180
October 2007
project was taken by a small group, usually the head—although this can affect the quality of
village headmen and the school head teacher, information on the opinions, behavior and
and then announced in the community meeting, economic activities of household members who
with each household instructed to send a are not interviewed). Third, the creative use of
worker on a particular day (World Bank, 2002). secondary data can often reduce data collection
Whilst this was not the community participation costs. Fourth, a judicious assessment of
envisaged by the program’s designers, it has expected effect size and power analysis may
proved an effective means of rapidly expanding sometimes make it possible to reduce sample
social infrastructure in rural areas. size while still obtaining satisfactory estimates of
A third method is to make maximum use of project impact. Finally, there are often ways to
available secondary data, including project reduce the costs of data collection. One
monitoring data which are usually under-utilized possibility is to use less expensive interviewers
in project evaluations. A fourth is to include, such as medical students or student teachers
whenever time and budget permit, collection of rather than commercial interviewers. In some
data at additional points in the project cycle. In cases questionnaires could be self-administered
some cases this may be at some point during (rather than hiring interviewers) and in other
project implementation while in other cases this cases it may be possible to obtain information
may involve data collection some time after the through direct observation rather than
project has been completed so as to assess household surveys (for example, observing
project sustainability. pedestrian and vehicular traffic patterns, or
direct observation of time-use and sexual
Addressing Time and Budget division of labor).
While it is often assumed that the evaluation
Constraints will always require the collection of primary
data, it is often possible to significantly reduce
Addressing Budget Constraints time and cost, as well as enhance quality by
drawing on available secondary sources of data
Five options can be considered (Bamberger, (White 2006). In addition to primary data
Rugh and Mabry. 2006. Chapter 3). First, collection in both project and control areas, it
considerable cost savings are often possible by may be possible to obtain data from one of the
eliminating one or more of the four data following sources:
collection points (pretest/posttest project and
control group). For example, design 5 eliminates use of existing secondary data from already
baseline control group data and design 6 completed surveys (demographic and health
eliminates all baseline data. There is clearly a surveys, living standard measurement
trade-off that must be assessed for this and the studies etc).as a baseline for both project and
following options between cost savings and control areas.
methodological rigor. Second, the data use of secondary data, as discussed above, for control
collection instruments can be simplified to groups with the collection of primary data for project
reduce the amount of information to be area. This option is often used when the
collected. Often considerable amounts of sample of project households in the
unnecessary or low-priority information can be secondary source is too small or where
eliminated by judicious pruning. In other cases additional information, not included in the
it may be possible to reduce the number of previous survey must be collected on the
people from whom information is collected (for project population.
example, only interviewing the household

ISSN 1556-8180
October 2007
piggy-backing in which an additional module Addressing Data Constraints

can be added to an already planned survey,
possibly over-sampling the project area to Real-world evaluations often lack baseline data,
obtain the desired power particularly on the control group but also quite
synchronized survey in which a larger survey is often on the project population as well. The
used to select the control group (for lack of a baseline is important since if selection
example by propensity score matching) and is based on unobservable factors that don’t vary
a survey is carried out only amongst the over time then their influence can be removed
project group. by double differencing. For the same reason,
double differencing also helps when there has
The Inter-American Development Bank has been inadequate definition of the control
been very successful in supporting low-cost population. A number of strategies are available
impact evaluations while avoiding any new data to reconstruct baseline data (Bamberger, Rugh
collection. It has used proposals submitted to and Mabry 2006 Chapter 5).
undertake studies to identify existing data First, as mentioned above, an existing
sources, to which it can obtain access for local survey may serve this purpose. Second, existing
research teams who may not otherwise be able documentary data from within the organization
to obtain those data for analysis (and in or from other sources can be used, or key
consequence put in cheap bids to achieve this informants can also be asked to provide
privilege). information on pre-project conditions. Finally,
informants can be asked to recall their situation
Addressing Time Constraints prior to the start of the project. Some evaluators
question the validity of recall as it is particularly
Most of the above techniques can also be used vulnerable to bias because of intentional
to reduce time (Bamberger, Rugh and Mabry distortion or lapses of memory. But all
2006 Chapter 4). When time is a constraint but questionnaires are based on recall – so it is
there is an adequate budget it is sometimes actually a question of degree rather than
possible to contract local consultants to conduct whether the approach should be used at all.
preparatory studies to save time for foreign or Areas such as income and expenditure and
out of town consultants in order to increase the fertility behavior, in which extensive research
efficiency of the limited time they can have has been conducted on the reliability of recall,
available for in-country or project visits. Video- have shown that it is possible to identify the
conferencing can also be an effective way to direction and magnitude of bias as well as
improve coordination and save time. Hiring identifying ways to reduce the bias. For
more researchers, interviewers or data analysts example, between 1989 and 1998, the National
may also be considered to reduce the time Sample Survey in India experimented with
required for data collection and analysis. different recall periods for measuring
However, increasing the size of the research expenditure. It was found that when the 30-day
team also increases the complexity of recall period for food items was replaced with a
coordination so that less time may be saved 7-day period, the total estimated food
than expected. Data collection technology such expenditures increased by around 30%. When
as hand-held computers, internet surveys and at the same time the 30-day recall period for
optical scanning are also possible time-savers. infrequent expenditures was replaced with a
one-year recall, the estimated total expenditure
increased by about 17% (Deaton 2005). A
number of studies have found there is a general

ISSN 1556-8180
October 2007
tendency to under-estimate small expenditures and easily understandable techniques to

(truncation) and to over-estimate major reconstruct time-lines, trend analysis, historical
expenditures (telescoping). Bamberger, Rugh transects and seasonal diagrams to trace the
and Mabry 2006 pp. 97-99 for a brief review of evolution of the community and the critical
the recall literature. incidents in its history (Kumar 2002). PRA
Similar research in other areas (mainly by methods are also helpful for addressing another
comparing information provided on current data constraint which occur when data
behavior or assets with recall of this same collection methods are not adequate for
information at a future point in time) could collecting sensitive information or for
greatly enhance the utility of recall. But for the identifying, locating and interviewing difficult-
time being it can be noted that major events and to-reach groups. In addition to questions
purchases (such as main assets like a vehicle or concerning potential biases in information
livestock) can be recalled with reasonable collected from groups and how the data can be
accuracy, especially if other methods are used to incorporated into quantitative analysis, a
triangulate the information. Asset measures, problem with most group-based data collection
combined with indicators of housing quality, are is that the sample size is significantly reduced as
increasingly used as a proxy for the more the unit of analysis becomes the group rather
difficult to measure outcome of household than the individual or household. This is
income. Krishna et al. (2005) use recall for an particularly important when group-based
asset based approach to analyzing poverty techniques are advocated as a way to reduce the
trends in a number of Indian villages over a 25 costs of data collection through household
year period. sample surveys.
There are also a number of PRA techniques
that can be used to reconstruct baseline Conclusions
conditions. The term PRA (Participatory Rural
Appraisal) is now commonly used as a generic We conclude that RCTs do indeed have a role
term to describe a wide range of participatory to play in developing countries. Although, even
planning and evaluation techniques that are under the most favorable circumstances RCTs
used with groups or communities to identify will only make up a small percentage of impact
their development priorities; their perception of evaluations, they are currently falling short of
the constraints affecting the achievement of even that amount so there is scope for
their goals and the resources they can draw on; expansion, and given their limitations it is
and their opinions on the effectiveness of necessary to identify other means of
community organizations and external undertaking impact studies. These other means
programs. PRA techniques were originally must also address the time and budget
developed, drawing heavily on the work of constraints under which evaluators are
Robert Chambers (e.g. Chambers 1994a, b and frequently forced to operate. We have presented
c), for working with mainly rural communities a range of designs with a range of costs and
with low levels of literacy and often with rigor. Where the most rigorous designs are not
difficulties in expressing their ideas verbally and possible then a good theory-based approach will
consequently PRA has developed a wide range lend plausibility to the findings.
of techniques that do not involve reading or
writing and that use non-verbal communication.
With all of these techniques a facilitator works
with community groups, rather than individuals
and uses social maps, charts and other visual

ISSN 1556-8180
October 2007
References Poor World)” Review of Economics and

Statistics 87(1):1-19
Alkire, S. (2002). Valuing freedom: Sen’s capability Duflo, E., & Kremer, M. (2005). Use of
approach and poverty reduction. Oxford: Oxford randomization in the evaluation of
University Press. development effectiveness. In G. K.
Alsop, R., Bertelsen, M., & Holland, J. (2006). Pitman, O. Feinstein, & G. Ingram (Eds.),
Empowerment in practice: from analysis to Evaluating development effectiveness. Washington,
implementation. Washington, DC: World DC: World Bank.
Bank. Gupta Kapoor, A. (2002). Review of impact
Bamberger, M., Rugh, J., & Mabry, L. (2006). evaluation methodologies used by the Operations
Realworld evaluation: Working under budget, time, Evaluation Department over past 25 years [OED
data and political constraints. Thousand Oaks, Working Paper]. Washington, DC: IEG,
CA: Sage. World Bank.
Bamberger, M. (2006). Conducting quality impact Kabeer, N. (1992). Evaluating cost-benefit
evaluations under budget, time and data constraints. analysis as a tool for gender planning.
Washington DC: World Bank. Development and Change, 23.
Carvalho, S & White, H (2004) “Theory-based Krishna, A. (2002). Active social capital: Tracing the
Evaluation: the Case of Social Funds” roots of development and democracy. University
American Journal of Evaluation 25(2):141- Presses of California, Columbia and
60. Princeton
Chambers, R. (1994a) “The origins and practice Krishna, A., Kapila, M., Porwal, M., & Singh, V.
of participatory rural appraisal” World (2005). Why growth is not enough:
Development 22(7): 953-969. Household poverty dynamics in Northeast
Chambers, R. (1994b) “Participatory rural Gujarat, India. Journal of Development Studies,
appraisal: analysis of experience” World 41(7), 1163-1192.
Development 22(7): 1253-1268. Kumar, S. (2002). Methods for community
Chambers, R. (1994c) “Participatory rural participation: A complete guide for practitioners.
appraisal: challenges, potentials and Rugby, England: ITDG Publishing.
paradigm” World Development 22(7): 1437- Morduch, J. (1998). Does microfinance really help the
145. poor? New evidence from flagship program in
CGD. (2006). When will we ever learn? Improving Bangladesh. Princeton, NJ: Princeton
lives through impact evaluation. Washington, University.
DC: Center for Global Development. Scriven, M. (2006). Converting perspective to
Cook, T. D. (2006) Describing what is special practice. Journal of MultiDisciplinary
about the role of experiments in Evaluation, 6, 8-9.
contemporary educational research: Putting Weiss, C (2000). “Which Links in which
the “gold standard” rhetoric into Theories Shall We Evaluate?” Pp 35-45 in
perspective. Journal of MultiDisciplinary Program Theory in Evaluation: Challenges
Evaluation, 6, 1-7. and Opportunities edited by P.J Rogers et
Davidson, E. J. (2006). The RCTs-only al. New Directions for Evaluation. No. 87
doctrine: Brakes on the acquisition of San Francisco. Jossey-Bass.
knowledge? Journal of MultiDisciplinary White, H. (2004). Using development goals and
Evaluation, 6, ii-v. targets for donor agency performance
Deaton, A. (2005) “Measuring Poverty in a measurement. In R. Black & H. White
Growing World (or Measuring Growth in a (Eds.), Targeting development: Critical perspectives

ISSN 1556-8180
October 2007
on the Millennium Development Goals. London:

Routledge.
White, H. (2006). Impact evaluation experience of the
Independent Evaluation Group of the World Bank.
Washington, DC: World Bank.
World Bank. (1999). Agricultural extension: The
Kenya experience. Washington, DC: World
Bank.
World Bank. (2002). Social funds: Assessing the
effectiveness. Washington, DC: World Bank.
World Bank. (2004). Books, buildings and learning
outcomes: An impact evaluation of World Bank
assistance to basic education in Ghana.
Washington, DC: World Bank.
World Bank. (2005). Maintaining momentum to
2015? An impact evaluation of interventions to
improve maternal and child health and nutrition
outcomes in Bangladesh. Washington, DC:
World Bank.

ISSN 1556-8180
October 2007

31 1 233 1 10 20071029

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

31 1 233 1 10 20071029

Uploaded by

Copyright:

Available Formats

http://www.jmde.

Using Strong Evaluation Designs in Developing

Michael Bamberger and Howard White

I n his article in the November 2006 JMDE

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 58

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 59

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 60

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 61

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 62

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 63

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 64

comparisons with a baseline but with no discussion on rigorous impact evaluations is

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 65

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 66

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 67

Quantitative Impact Evaluation Design T1 T2 T3

Source: Adapted from Bamberger, Rugh, & Mabry (2006).

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 68

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 69

 piggy-backing in which an additional module Addressing Data Constraints

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 70

tendency to under-estimate small expenditures and easily understandable techniques to

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 71

References Poor World)” Review of Economics and

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 72

on the Millennium Development Goals. London:

Journal of MultiDisciplinary Evaluation, Volume 4, Number 8 73

You might also like

piggy-backing in which an additional module Addressing Data Constraints