You are on page 1of 27

DRAFT ONLY Rethinking Evaluation of Grants and Contributions

Acknowledgement

While I am responsible for any errors in this paper I received significant aid from a number of commentators and former senior consumers and practitioners of evaluation research. In particular, Tim Dugas, Maryantonett Flumian and Maria Barrados were particularly generous in their time and feedback. Internally at EKOS, I received very significant feedback and editorial advice from Laura Maxwell and several other colleagues. Finally, I would like to thank Terry Colpitts for inviting me to speak and for his very helpful feedback on various drafts. I would also like to acknowledge the guidance of Stephen Johnson and for the assistance provided by Tracy Matsuo.

1.0

Introduction

Some forty years ago, Donald T. Campbell succinctly outlined the goal of an experimental society where rational, empirical evidence would be a crucial basis for societal decision making:

The United States and other modern nations, Campbell proclaimed, should be ready for an experimental approach to social reform, an approach in which we try out new programs designed to cure specific social problems. (Campbell, 1969)

Forty years later the most popular and powerful politician on the planet (President Barack Obama) chose to express the following point in what may have been the most listened to political address of all time:

The question we ask today is not whether our government is too big or too small, but whether it works, whether it helps families find jobs at a decent wage, care they can afford, a retirement that is dignified.

Where the answer is yes, we intend to move forward. Where the answer is no, programs will end. (Barack Obama, Inaugural address).

In reviewing the current status and future prospects for the evaluation of Grants and Contributions I thought it might be helpful to situate the problem in a broader historical field. The first quote is by no means the beginning of program evaluation but it condenses some of the initial promise of the field. The idea of rational management based on empirical data can be traced back to at least the encyclopaedists and the notion of political arithmetic. In fact, the term statistics itself initially referred to measures of state and there is a rich tradition of what might be called technocratic approaches to societal management.

In the sixties, the current movement found its roots in the search for a great society in the United States and a little later the just society in Canada. Social science was to be harnessed to the pursuit of not just a better standard of living but a better quality of life. Along with social indicators the fledgling evaluation research movement was to substitute hard empirical evidence of what works and what doesnt for the vagaries of less informed political choices. I quote Donald Campbell throughout his career because he outlined arguably the most sophisticated model of how evaluation would assess the counterfactual hypothesis of what the world would have looked like if a program hadnt existed. The search for falsification was in keeping with his preference for experimentalism and his epistemological debt to Karl Popper (1973). Sometimes we would use non

experimental designs but only while adjusting for and recognizing the threats to internal and external validity these entailed. The basic idea here is that we can never verify anything with certainty but we can falsify with certainty. Programs and policies which survived falsification, would continue. Those which failed this test would be abandoned and newer approaches would be introduced (to also face these experimental criterion).

In the full flush of what Daniel Bell (1976) called the emergence of a knowledgebased, Post Industrial society this was an extremely alluring idea. No longer would we be slaves to the irrational caprices of impressions, intuitions and vested interests. Hard evidence and rationalism would supplant these clearly inferior methods of deciding. At least that was the theory.

So here we are roughly four decades later and we should be encouraged that no less a figure than President Obama sought to remind the world of the importance of this idea in the short time he had to make what may well have been his most listened to political statement. A cynic, however, might wonder why it was necessary to make this plea some four decades later if it was such an obvious idea in the middle part of the last century. Shouldnt evaluation research have already

supplanted the more capricious and irrational forms of decision-making it was intended to replace? While most of us would agree that we should have seen the triumph of causal evidence, the actual record of utilization is not very positive. So, while we should not abandon this alluring idea we should also be clear eyed about the fact that after forty years it is hard to make a compelling case that we have made much real progress to Campbells original idea of an experimental society. What are the main obstacles to having causal evidence of the incremental impacts of programs and policies exert the influence that we all believe it should? Should we be looking at methodological barriers? Is the problem one of confidence in the integrity of evaluation conclusions? Or are the problems more practical in nature? It costs too much, its too slow or the language of evaluation is too opaque and banal to capture the interests of key audiences? Or, finally, are the most formidable barriers more in the realm of the diverse and often contradictory interests underlying the crucial actors in the modern state?

2.0

The recent Canadian context

I use this broad framing to enjoin the question of how we are doing with the more specific evaluation of Grants and Contributions (G&Cs). While G&Cs have their own unique challenges they share many of the issues which are endemic to the broader field.

In Canada, evaluation research was institutionalized under the rubric of program evaluation. The initial Treasury Board Secretariat guidelines (late seventies) clearly laid out the notion of causal impacts and program evaluation took on a unique role distinct from audit, review, and other forms of management consulting. The policy and its operationalization within the federal government have evolved through time. Several respected authorities I have discussed this topic with suggest that the function may have reached its pinnacle, in terms of quality and influence at the erstwhile HRDC. Although major expenditure reviews such as the Nielsen Task Force largely ignored program evaluation by the middle part of the nineties, HRDC had a large and vibrant program evaluation branch which was doing large scale effectiveness evaluations. These studies were producing important and often counterintuitive findings about which labour market innovations worked and did not work. For example, during the major process of Social Policy Review, these

materials actually were a significant component of the materials used to brief the minister and his committee.

In fact, we developed blended models of evaluation results and polling results which were integrated into a segmented model of recommendations and advice on social policy renewal (Lifelong Learning and the World of Work: A Segmented Perspective, 1995). These briefing materials became a prominent feature of the debate about social policy and EI reform.

On a number of occasions, I have had the privilege of briefing the most senior levels of elected and bureaucratic officials (including the Deputy Minister, Prime Minister and the Cabinet). This particular exercise is, however, the only time I recall having the opportunity to speak to the most senior bureaucratic and political audiences about evaluation results. I think it was the accessible and timely blend of hard evaluation results and opinion data which enhanced this reception. Unfortunately, despite the large number of examples where more superficial polling results have had great influence on the most senior levels of decisionmaking, it is almost unheard of to imagine the more rigorous results of evaluation research exerting such influence. The very image of the minister or Prime Minister confronted with a political crisis summoning his favoured evaluator to tell him

what really works is so far fetched it is almost humorous. Yet, we would think nothing about the same scenario seeing the favoured pollster offering advice. What is wrong with this picture?

Despite the sophistication and scale of the HRDC evaluation group, much of its impact and presence faded after the handoff of labour market development to the provinces in the aftermath of the near loss of the 1995 Referendum. As in many other examples, this illustrates the triumph of perceived political exigencies over rational evidence.

A common view is that the weakening of program evaluation evident in HRDC in the later part of the nineties was reflective of a more general atrophying. Evaluations became smaller in scale and more closely linked to ongoing management advice. At the same time, both the Auditor General (desAutels) and parliament were also expressing dissatisfaction with the general oversight of G&Cs. This concern exploded for reasons we will consider shortly and renewed interest in evaluation accompanied the Federal Accountability Act of 2005. Treasury Board Secretariat introduced a new Transfer Payment Policy in 2008 and there was renewed interest

in evaluation as a way of informing risk management, including what to do with G&Cs.

In the balance of this talk, I want to turn to the question of why utilization has been dramatically less than the initial proponents dreamed of and still far short of the familiar goal that President Obama cited in his inaugural address. With better methods, could we improve the validity of our conclusions and then see greater utilization? Does the obscure language of program evaluation render results inaccessible or is their timeliness suspect and by the time answers are available the questions have faded or shifted to other ground? Or, do key barriers lie in the nature of the contradictory interests and incentives underlie the bureaucratic and political realms of the modern state? We will consider all three areas in the rest of this speech and conclude that there is room for improvement in each of these. The main reason, however, that some cynics might claim that evaluation research has spent 40 odd years packing its bags for a trip it never seems to take lies in the realm of contradictory interests and incentives.

3.0

Grants and Contributions : special areas of concern

10

Lets turn more specifically to the area of G&Cs, HRSDC and the uneasy mixture of methodological and organizational issues underlying the evaluation of G&Cs. I have chosen to focus more on HRSDC (or its ancestor HRDC) because the example so vividly underlines some of the key points I wish to make. There are many similarities to the problems of evaluating G&Cs in other setting as well.

G&Cs are the truly flexible and responsive portion of government. Unlike legislated entitlement programs, they allow government to present a more agile face to citizens. For instance, the current panoply of infrastructure investments being delivered under the aegis of the Economic Action Plan (EAP) are the federal governments key tool for attempting to jump start a moribund economy.

The newer Treasury Board thinking on risk management reflects some of Campbells original concept of societal experimentation. If governments are to improve their responsiveness and effectiveness then embodying some of what Schumpeter called creative destructionism is a good idea. Better still, allowing hard evidence of what did not work guide the destructive portion of the cycle is the exact point that President Obama was referring to in his inaugural address (Where the answer is no, programs will end.).

11

G&Cs are by their more ad hoc and protean nature best suited to implementing a more agile, experimental government. In the nineties, there were attempts to increase government agility and value for money using new public management. This involved greater use of non-governmental partners. There were examples where these partnered models compared favourably to other methods. Yet, the implementation of this flexibility is fraught with difficulties and the evaluation record to date is inauspicious. Improvements in agility and effectiveness often were at the cost of weaker expenditure management. Yet, the balance of effectiveness and control was never really debated.

In the case of G&Cs, there are magnified methodological problems for the evaluator. The program logic is often fuzzier and the accountability regimes are more diffuse and less clearly developed. The diverse scale and nature of different projects/investments within the overall program often makes it very difficult to classify, compare, and integrate conclusions. Moreover, there are often limited number of cases to sample from which renders the application of probability theory difficult.

G&Cs often produce a more direct mixture of the political and bureaucratic realms of government. Some have speculated that disputes about who gets to cut which

12

ribbons in which ridings have severely retarded the implementation of the EAP. We dont need to merely speculate about the potentially incendiary nature of this mixture, however; any casual review of the recent history of this department will vividly illustrate some of the key points we are making. As we review this, it is important to bear in mind the diverse and often contradictory vested interests of the political world, the program managers and the evaluator.

The recent journey from HRDC to HRDC in its new incarnation as HRSDC illustrates some of the very points I want to make about the relative influence of rational empirical knowledge in the world of decision-making. In the heat of public controversy, it is rarely (if ever) the case that hard effectiveness research can counter perceived political and communications imperatives.

A particularly poignant illustration of many of the central points I am trying to make can be drawn from the ill-fated Transitional Jobs Fund Program (TJF). Without delving into the details of this much analyzed case, it is important to note that focus on the TJF produced a broad range of profound impacts. At the very least, the audit and evaluation records of this G&C program were linked to the set of forces which saw the resignation of the Minister, the cessation of the program and arguably the dismantling of the Department.

13

TJF was designed (partially) on the principles of the new public management model. It was to be based on greater agility and partnered, local delivery (often with the voluntary sector). While recognizing the inherent challenge of financial management and diffusion of accountability, there were reasons to expect value for money advantages (which had been seen in earlier evaluations). We conducted a preliminary evaluation, and there was also an internal audit. The preliminary evaluation conclusions on effectiveness were largely positive but several participants had complained about undue political interference in the granting process. It was a minority view and the program design had actually included a role for local MPs. The concern was noted in the report (although it did not find its way into the Executive Summary). The pyrotechnic impacts of this secondary note were dramatic and long lasting. The affair escalated into the billion dollar boondoggle and shovelgate. Subsequent forensic analysis suggested the actual amount of missing funds was far less than $80,000 not trivia but far short of a billion.

Not only was the program scrapped, but the Minister resigned. This was also the genesis for a view which emerged amongst some key advisors to the Prime Minister that the corrosive influences of this scandal were so severe and

14

irreparable that the Department itself had to be dismantled and rebranded. This drastic measure was implemented notwithstanding the fact that extensive quantitative and qualitative research had clearly shown minimal awareness of the scandal amongst the clients and general public and that the overall public impressions of the Department were very positive. The Department was broken down into three rather odd new departments which functioned less well than before. Predictably, the Department was slowly reintegrated into the more coherent original organizational configuration, at considerable costs which most likely dwarfed any funds lost in the initial incident.

This controversy, along with the even more dramatic sponsorships scandal (another Grants & Contributions program where much more egregious abuses and criminality occurred) arguably changed Canadas political landscape and most certainly ushered in the new Accountability Act. This swing to stricter accountability has produced full employment for auditors and had the serendipitous benefit of reinvigorating interest in evaluation following a period of decline. Whether these changes have made the federal government more slow footed and inefficient remains to be seen but whatever the allure of the new public management governance model it was also extinguished in the smoking ruins of these affairs.

15

Revisiting this sorry tale was not intended to reawaken past ghosts. Rather the tale has most striking implications for this audience today, which are twofold. First of all, Grants & Contributions programs have potentially inflammable ingredients which need to be managed carefully if the parties involved are to avoid being burned. Second, and more importantly, it appears that issues of causal effectiveness had virtually no influence in the record of decision-making and were eclipsed by perceived political calculus. So why is it that despite all of the compelling rational arguments in favour of causal evidence, evaluation once again played such a minor role in the decision-making. Clearly the initial internal audit, and subsequent work of the Auditor General, were extremely important and acted upon. In the rush to vacate the new public management model and usher in a strengthened era of strict expenditure management, was there a careful debate of actual results and value for money versus financial probity? Was program evaluation relatively silent because of poor methodology or was it because more powerful contradictory interests overwhelmed these considerations? Why did the audit conclusions have such great influence and the evaluation record so little? Are actual results less important than prudential financial management?

4.0

Some methodological practical and organizational suggestions

16

Moving from this specific example methodological challenges for answering difficult questions about the causal impacts of programs, particularly the diverse, multi-levelled projects delivered under G&Cs programs, are extremely challenging. Without being overly specific, we suggest some areas where progress might occur.

First of all, we should keep the large M methodological issue of causal inferences and epistemology clearly in mind. One of the most attractive features of Campbells approach was his insistence on at least emulating if not always adhering to the rigour of experimental design. This approach owes a debt to Karl Poppers notion of falsification. We can never verify any proposition with certainty, as Hume had pointed out in his discussion on the limits of induction. We can, however, know with certainty what isnt true. Hence objective knowledge is an evolutionary process of the survival of the fittest (non-falsified) hypotheses. I wonder if some of the newer theory methods dont stray into the potential fallacy of affirming the consequent. The key point I want to stress is that there are a myriad of useful forms of management advice. The unique value of program evaluation should be rooted pre-eminently in the realm of causal impacts.

17

As an extension of this principle, it might be wise to recognize that while there is a logic of testing (falsification) there is no corresponding logic of discovering (innovation). So the evaluator can certainly shed light on what new projects/policies might work but his/her fundamental role is to point to what hasnt worked (and by corollary, tentatively approve those programs which survived falsification). This also relates to the point that evaluations must continue to strive to clarify and demarcate their role as the science of causal effects of programs and policies. In my view, there is too much blurring across other areas of accountability and this has done a disservice to the core purpose and value of evaluation. This is not simply because evaluators have been drawn into the role of ongoing management consulting to program managers but because auditors have also attempted to lay claim to authority in documenting effectiveness and results. There is little in the professional practices of auditors and accountants to support this claim. In the case of G&Cs programs the application of experimental method is farfetched. Even if these methods could be applied by, for example, applying the programs in a random or systematic sample of catchment areas, there would be serious obstacles to both validity and ethics. How could we justify withholding, for example, stimulus funding in areas of relatively equal need. Unlike drug trials, the value of the effects knowledge would not outweigh the ethical let alone political

18

fallout. Moreover, even if we could imagine situations where this was possible, experimental designs often suffer serious issues of external validity and timeliness. Would these results be reproducible in other geographic settings or other periods in the business cycle?

One shot retrospective designs, even with elaborate statistical controls, also have serious methodological flaws. Perhaps if we lower our immediate sights, we might be able to make greater progress by implementing repeated measures of key policy and performance outcomes. Moreover, these repeated outcome measures could be appended or fused to pre-existing data collection platforms.

5.0

Some practical Issues

A full discussion of practical issues would include economics, timeliness and communications issues. One example will be used to illustrate these points.

Large, complex departments like HRSDC routinely collect regular data for nonevaluation purposes. There are considerable resources devoted, for example, to measuring client satisfaction and organizational branding. Often, dozens of indicators to operationalize the common measurement tool to assess client

19

satisfaction. It is our contention that the vast majority of this information could be reliably measured with less than half the indicators currently used. This would free up space to include regular monitoring of a series of key policy outcome indicators (e.g., confidence in skills, ability to find a new job/relocate, income adequacy, selfrated heath, etc.). These indicators could be designed to explicitly measure some of the ultimate objectives of G&Cs programs. The fusion of outcome data and administrative program data could be done in a quite rapid and accessible fashion. We first argued for a greater integration of evaluation and market research methods in a 1988 CJS article entitled Toward Ongoing Monitoring and Evaluation Systems. My colleagues in both the evaluation and polling worlds remain as underwhelmed with this argument as I am convinced that it would help.

Merely monitoring outcomes on a regular basis would not provide clear evidence of causal impacts. It will, however, at least tell us about the state of ultimate goals and whether things are getting better or worse. Through time, however, even a much less fully identified causal model could yield causal insights with a clear time-series. This could be further enhanced by aggregating outcome data to small geographic areas and then linking in program supply data and other characteristics of that geographical area. This could provide an aggregated model for testing causal hypothesis.

20

There is another advantage of introducing what we have in the past called Ongoing Monitoring and Evaluation Systems (OMES) (Graves, 1988). By simplifying measurements to regularly monitor key policy outcomes the overall timeliness, accessibility and agility of the evaluation results can be improved. One of the principal practical barriers to better utilization are concerns about timeliness and transparency of results. Rapid, ongoing feedback of the state of dependent variables of interest could help reduce these concerns.

By blending in the transparency and timeliness of modern polling techniques, we can achieve a fusion evaluation which could capture the attention of the most senior audiences in a more effective manner. Today, we do social surveys (now known as polls) in a tiny fraction of the time devoted to them at the outset of my career. Twenty-five years ago, those same surveys took months. In consulting with experts, it seems evaluations are not quicker than in the past. Couldnt we fruitfully apply some of this agility to evaluation?

Obviously this sort of approach will require rethinking the current of research and data collection in HRSDC (and virtually all other large government departments and agencies). We must move beyond the current insular research fiefdoms to

21

explore methods of developing shared data bases to serve the needs of several clients. This logic could be extended to horizontal coverage of several departments and even across different levels of government and different sectors. In the past, we have had great success building horizontal policy and communications platforms for a range or organizations who share a common research interest but dont have the resources to conduct themselves. Such horizontal evaluations could provide economies of scale and create common criterion evidence across diverse organizations. How many departments and levels of government might share an interest in knowing which stimulus spending methods were most cost-effective? Having entered the realm of organizational reform I would like to conclude with my most radical proposal.

For many years proponents and critics have debated the issue of what would be the best organizational locus and configuration for more effective program evaluation. A more muscular and evaluation-focussed comptroller-general has been discussed along with the possibility of housing ultimate responsibility within the Office of the Auditor General. The OAG certainly has an enviable record of having its conclusions forcefully tabled and otherwise acted upon. Yet the OAG has historically rejected this call over concerns about resource limitations muting its audit effectiveness and blurring of its role. Others have also argued that housing

22

evaluation within departments produces more realistic and nuanced evaluations. This is a vulgar simplification of these complex arguments.

Whatever the merits of these different positions, what is undeniable is that evaluations have rarely been a major force shaping public decision-making. Along with the worked through example of Transitional Jobs Fund, I would add that program evaluation has had a very modest impact on landmark expenditure review processes ranging from the Nielsen Task Force to the Program Review exercise of the last decade. It is encouraging that the current TBS risk management framework appears to open the door for a more substantial role for program evaluation. It is, however, nave to think that evaluations influence will prevail simply because it is a good idea or because the opportunity exists. The historical record of utilization flies in the face of this optimistic view of evaluation impacts.

In seeking an understanding of why this hasnt been the case, and how things could work better in the future, we have suggested that a more timely, accessible approach to evaluation which integrates diverse departmental and perhaps crossdepartmental data collection into an ongoing system of monitoring outcomes might make sense. We also argue that an integrated fusion of evaluation and communications (polling) research might allow evaluators to capture some of the

23

attention that other fields have enjoyed. Others have offered far better developed and compelling arguments for methodological reforms.

At the end of the day, however, it is not weak methodology which explains evaluations weaker than-hoped-for impacts. The deeper answers lie in the nature of the organizational design and perhaps more importantly, of the incentive systems underpinning the bureaucratic, ultimately, political wings of the modern federal state.

First, a couple of notes on the bureaucracy. Currently, program managers have an enlarged role in shaping depiction program logic and evaluation approach to be used. Given that success within an organization is linked to records of success, this may produce undesirable organizational tensions. Asking program managers to sign off on a powerful evaluation design is tantamount to asking the dog to fetch the stick you are going to beat it with. If we concur that the ultimate role of evaluation lies in falsification of program logic, then there is little upside and obvious downside in the incentive system underpinning the current system.

Turning to the more challenging problem of the political realm, this problem is magnified considerably.

24

Political success lies in winning elections. Negative evaluation evidence can be useful fodder for opposition critics but there is little upside for those sitting on the government side of the house. Promises of a stricter accountability regime, and greater focus on results, can also be a modestly useful component of a platform for seeking electoral success. But actual evaluation results are rarely helpful to the governing party. There are few public hosannas for positive evaluations and even mixed evaluations can usher in a cycle of politically damaging consequences. One can also safely argue that a governments enthusiasm for evaluation will begin to wane the longer they hold office and the documented inventory of foolishness shifts from their predecessors policies and programs to their own.

A potential solution to raising the influence of evaluation evidence in public decision-making, and insulating it from these contradictory interests would be the creation of an Office of the Evaluator General (OEG). The OEG would also serve to champion and raise public consciousness about the importance of knowing what works and what doesnt.

One of the reasons that the Auditor General has had a successful record of improving the value for money is its arms length relationship to the government of

25

the day. The Auditor General reports directly to Parliament and its reports receive high profile media, stakeholder and public attention. These conditions are crucial to the objectivity and influence of the Office.

The time may well have come for a parallel Office of the Evaluator General. Harnessing the resources of internal evaluation units, which would be physically housed in departments but report to the OEG, the EG would be insulated from the vested interests of both the bureaucracy and the political realm. While raising public consciousness of the need to know if programs actually work (separately from sound financial management) the entire function would serve the diverse interests of the entire parliament and the overall public.

26

Bibliography
Bell, Daniel. The Coming of Post-Industrial Society (1973) Campbell, Donald T. Reform as Experiments American Psychologist, 24,4. (April, 1969) Campbell, Donald T. with Stanley, Julian C. Experimental and Quasi-Experimental Designs for Research (1966) Colpitts, T. A Brief History of Evaluation: In Time and Space. Ottawa (February, 2009) DAloisio Guy, Laurendeau Michel, Neimanis V., Obrecht Michael, Porteous Nancy, Prieur Paul, Witmer Julie. An Evaluator General for Canada A Solution for Filling the Accountability Void? Published in the Canadian Government Executive, September 2007, Volume 13, Number 7 Good, David A. The Politics of Public Management: The HRDC Audit of Grants and Contributions Graves, Frank. Harder Methods for Softer Programs: Evaluating the Special Program of Cultural Initiatives, Canadian Evaluation Society, 1984 Annual General Meeting, Ottawa Graves, Frank. Strengthening the Dialogue between Program Evaluation and Market Research: Toward Ongoing Monitoring and Evaluation Systems (OMES), Canadian Journal of Program Evaluation, Vol. 3:1 (1988) Graves, Frank. Towards Practical Rigour: Methodological and Strategic Considerations for Program Evaluation, presented to the Canadian Evaluation Society, May 1984 President Barack Obama, Inaugural Address, January 2009, Washington, D.C. Popper, Karl R. Objective Knowledge: An Evolutionary Approach (1972) The Changing Role of Nonrandomized Research Designs in Assessments. In J. Hudson, J. Mayne and R. Thomilson (eds.), Action-Oriented Evaluation in Organizations: Canadian Practices, Toronto: Wall and Emerson Inc., 1992, ISBN 1-895131-09-X Towards Practical Rigour: Methodological and Strategic Considerations for Program Evaluation, in Optimum, Bureau of Management Consulting, Vol. 15:4 (1984) Preliminary Evaluation of the Transitional Jobs Fund (1998), EKOS Research Associates Inc., submitted to Evaluation Services, Evaluation and Data Development, Human Resources Development Canada

27

You might also like