You are on page 1of 19

Evaluating Intelligence

Kristan J. Wheaton
Assistant Professor
Mercyhurst College
814 824 2023
kwheaton Page 2 2/8/2009

Evaluating Intelligence1

Evaluating intelligence is tricky.

Really tricky.

Sherman Kent, one of the foremost early thinkers regarding the analytic process in the
US national security intelligence community wrote in 1976, “Few things are asked the
estimator more often than "How good is your batting average?" No question could be
more legitimate--and none could be harder to answer.” So difficult was the question that
Kent reports not only the failure of a three year effort in the 50’s to establish the validity
of various National Intelligence Estimates but also the immense relief among the
analysts in the Office of National Estimates (forerunner of the National Intelligence
Council) when the CIA “let the enterprise peter out.”

Unfortunately for intelligence professionals, the decisionmakers that intelligence
supports have no such difficulty evaluating the intelligence they receive. They routinely
and publicly find intelligence to be “wrong” or lacking in some significant respect.
Abbot Smith, writing for Studies In Intelligence in 1969, cataloged many of these errors
in On The Accuracy Of National Intelligence Estimates. The list of failures at the time
included the development of the Soviet H-bomb, the Soviet invasions of Hungary and
Czechoslovakia, the Cuban Missile Crisis and the Missile Gap. The Tet Offensive, the
collapse of the Soviet Union and the Weapons of Mass Destruction fiasco in Iraq would
soon be added to the list of widely recognized (at least by decisionmakers) “intelligence

This article originated as a series of posts on my blog, Sources and Methods
( This form of “experimental scholarship” -- or using the
medium of the internet and the vehicle of the blog as a way to put my research online—provides for more
or less real-time peer review. Earlier examples of this genre include: A Wiki Is Like A Room..., The
Revolution Begins On Page 5, What Is Intelligence? and What Do Words Of Estimative Probability
Mean?. Given its origin and the fact that it is stored electronically in the ISA archive, I will retain the
hyperlinks as a form of citation.

In addition, astute readers will note that some of what I write here I have previously discussed in other
places, most notably in an article written with my long-time collaborator, Diane Chido, for Competitive
Intelligence Magazine and in a chapter of our book on Structured Analysis Of Competing Hypotheses
(written with Diane, Katrina Altman, Rick Seward and Jim Kelly). Diane and the others clearly deserve
full credit for their contribution to this current iteration of my thinking on this topic.
kwheaton Page 3 2/8/2009

Nor was the US the only intelligence community to suffer such indignities. The Soviets
had their Operation RYAN, the Israelis their Yom Kippur War and the British their
Falklands Island. In each case, after the fact, senior government officials, the press and
ordinary citizens alike pinned the black rose of failure on their respective intelligence

To be honest, in some cases, the intelligence organization in question deserved the
criticism but, in many cases, it did not -- or at least not the full measure of fault it
received. However, whether the blame was earned or not, in the aftermath of each of
these cases, commissions were duly summoned, investigations into the causes of the
failure examined, recommendations made and changes, to one degree or another,
ratified regarding the way intelligence was to be done in the future.

While much of the record is still out of the public eye, I suspect it is safe to say that
intelligence successes rarely received such lavish attention.

Why do intelligence professionals find intelligence so difficult, indeed impossible, to
evaluate while decisionmakers do so routinely? Is there a practical model for thinking
about the problem of evaluating intelligence? What are the logical consequences for
both intelligence professionals and decisionmakers that derive from this model? Finally,
is there a way to test the model using real world data?

I intend to attempt to answer all of these questions but first I need to tell you a story…

A Tale Of Two Weathermen

I want to tell you a story about two weathermen; one good, competent and diligent and
one bad, stupid and lazy. Why weathermen? Well, in the first place, they are not
intelligence analysts, so I will not have to concern myself with all the meaningless
distinctions that might arise if I use a real example. In the second place, they are enough
like intelligence analysts that the lessons derived from this thought experiment – sorry, I
mean “story” – will remain meaningful in the intelligence domain.

Imagine first the good weatherman and imagine that he only knows one rule: If it is
sunny outside today, then it is likely to be sunny tomorrow (I have no idea why he only
knows one rule. Maybe he just got hired. Maybe he hasn’t finished weatherman school
yet.Whatever the reason, this is the only rule he knows). While the weatherman only
knows this one rule, it is a good rule and has consistently been shown to be correct.

His boss comes along and asks him what the weather is going to be like tomorrow. The
good weatherman remembers his rule, looks outside and sees sun. He tells the boss, “It
is likely to be sunny tomorrow.”
kwheaton Page 4 2/8/2009

The next day the weather is sunny and the boss is pleased.

Clearly the weatherman was right. The boss then asks the good weatherman what the
weather will be like the next day. “I want to take my family on a picnic,” says the boss,
“so the weather tomorrow is particularly important to me.” Once again the good
weatherman looks outside and sees sun and says, “It is likely to be sunny tomorrow.”

The next day, however, the rain is coming down in sheets. A wet and bedraggled
weatherman is sent straight to the boss’ office as soon as he arrives at work.After the
boss has told the good weatherman that he was wrong and given him an earful to boot,
the good weatherman apologizes but then asks, “What should I have done differently?”

“Learn more rules!” says the boss.

“I will,” says the weatherman, “but what should I have done differently yesterday? I
only knew one rule and I applied it correctly. How can you say I was wrong?”

“Because you said it would be sunny and it rained! You were wrong!” says the boss.

“But I had a good rule and I applied it correctly! I was right!” says the weatherman.

Let’s leave them arguing for a minute and think about the bad weatherman.

This guy is awful. The kind of guy who sets low standards for himself and consistently
fails to achieve them, who has hit rock bottom and started to dig, who is not so much of
a has-been as a won’t-ever-be (For more of these see British Performance Evaluations).
He only knows one rule but has learned it incorrectly! He thinks that if it is cloudy
outside today, it is likely to be sunny tomorrow. Moreover, tests have consistently
shown that weathermen who use this rule are far more likely to be wrong than right.

The bad weatherman’s boss asks the same question: “What will the weather be like
tomorrow?” The bad weatherman looks outside and sees that it is cloudy and he states
(with the certainty that only the truly ignorant can muster), “It is likely to be sunny

The next day, against the odds, the day is sunny. Was the bad weatherman right? Even
if you thought he was right, over time, of course, this weatherman is likely to be wrong
far more often than he is to be right. Would you evaluate him based solely on his last
judgment or would you look at the history of his estimative judgments?

There are several aspects of the weathermen stories that seem to be applicable to
intelligence. First, as the story of the good weatherman demonstrates, the traditional
kwheaton Page 5 2/8/2009

notion that intelligence is either “right” or “wrong” is meaningless without a broader
understanding of the context in which that intelligence was produced.

Second, as the story of the bad weatherman revealed, considering estimative judgments
in isolation, without also evaluating the history of estimative judgments, is a mistake.
Any model for evaluating intelligence needs to (at least) take these two factors into

A Model For Evaluating Intelligence

Clearly there is a need for a more sophisticated model for evaluating intelligence – one
that takes not only the results into consideration but also the means by which the analyst
arrived at those results. It is not enough to get the answer right; analysts must also
“show their work” in order to demonstrate that they were not merely lucky.

For the purpose of this paper, I will refer to the results of the analysis -- the analytic
estimate under consideration -- as the product of the analysis. I will call the means by
which the analyst arrived at that estimate the process. Analysts, therefore, can be largely
(more on this later) correct in their analytic estimate. In this case, I will define the
product as true. Likewise, analysts can be largely incorrect in their analytic estimate in
which case I will label the product false.

Just as important, however, is the process. If an analyst uses a flawed, invalid process
(much like the bad weatherman used a rule proven to be wrong most of the time), then I
would say the process is false. Likewise, if the analyst used a generally valid process,
one which produced reasonably reliable results over time, then I would say the process
was true or largely accurate and correct.

Note that these two spectra are independent of one another. It is entirely possible to
have a true process and a false product (consider the story of the good weatherman). It
is also possible to have a false process and a true product (such as with the story of the
bad weatherman). On the other hand, both product and process are bound tightly
together as a true process is more likely to lead to a true product and vice-versa. The
Chinese notion of yin and yang or the physicist’s idea of complementarity are useful
analogues for the true relationship between product and process.

In fact, it is perhaps convenient to think of this model for evaluating intelligence in a
small matrix, such as the one below:
kwheaton Page 6 2/8/2009

There are a number of
examples of each of
these four basic
combinations. For
instance, consider the
use of intelligence
preparation of the
battlefield in the
execution of combat
operations in the
Mideast and elsewhere.
Both the product and
the process by which it
was derived have
proven to be accurate.
On the other hand,
statistical sampling of voters (polling) is unquestionably a true process but has, upon
occasion, generated spectacularly incorrect results (see Truman v. Dewey…)

False processes abound. Reading horoscopes, tea leaves and goat entrails are all false
processes which, every once in a while, turn out to be amazingly accurate. These same
methods, however, are even more likely to be false in both process and product.

What are the consequences of this evaluative model? In the first place, it makes no
sense to talk about intelligence being “right” or “wrong”. Such an appraisal is overly
simplistic and omits critical evaluative information. Evaluators should be able to specify
if they are talking about the intelligence product or process or both. Only at this level of
detail does any evaluation of intelligence begin to make sense.

Second, with respect to which is more important, product or process, it is clear that
process should receive the most attention. Errors in a single product might well result in
poor decisions, but are generally easy to identify in retrospect if the process is valid. On
the other hand, errors in the analytic process, which are much more difficult to detect,
virtually guarantee a string of failures over time with only luck to save the unwitting
analyst. This truism is particularly difficult for an angry public or a congressman on the
warpath to remember in the wake of a costly “intelligence failure”. This makes it all the
more important to embed this principle deeply in any system for evaluating intelligence
from the start when, presumably, heads are cooler.
kwheaton Page 7 2/8/2009

Finally, and most importantly, it makes no sense to evaluate intelligence in isolation –
to examine only one case to determine how well an intelligence organization is
functioning. Only by examining both product and process systematically over a series
of cases does a pattern emerge that allows for appropriate corrective action, if necessary
at all, to be taken.

The Problems With Evaluating Product

The fundamental problem with evaluating intelligence products is that intelligence, for
the most part, is probabilistic. Even when an intelligence analyst thinks he or she knows
a fact, it is still subject to interpretation or may have been the result of a deliberate
campaign of deception.

• The problem is exacerbated when making an intelligence estimate, where good
analysts never express conclusions in terms of certainty. Instead, analysts
typically use words of estimative probability (or, what linguists call verbal
probability expressions) such as "likely" or "virtually certain" to express a
probabilistic judgment. While there are significant problems with using words
(instead of numbers or number ranges) to express probabilities, using a limited
number of such words in a preset order of ascending likelihood currently seems
to be considered the best practice by the National Intelligence Council (Iran
NIE, page 5).

Intelligence products, then, suffer from two broad categories of error: Problems of
calibration and problems of discrimination. Anyone who has ever stepped on a scale
only to find that they weigh significantly more or significantly less than expected
understands the idea of calibration. Calibration is the act of adjusting a value to meet a

In simple probabilistic examples, the concept works
well. Consider a fair, ten-sided die. Each number, one
through ten, has the same probability of coming up
when the die is rolled (10%). If I asked you to tell me
the probability of rolling a seven, and you said 10%,
we could say that your estimate was perfectly
calibrated. If you said the probability was only 5%,
then we would say your estimate was poorly
calibrated and we could "adjust" it to 10% in order to
bring it into line with the standard.

Translating this concept into the world of intelligence analysis is incredibly complex.
kwheaton Page 8 2/8/2009

To have perfectly calibrated intelligence products, we would have to be able to say that,
if a thing is 60% likely to happen, then it happens 60% of the time. Most intelligence
questions (beyond the trivial ones), however, are unique, one of a kind. The exact set of
circumstances that led to the question being asked in the first place and much of the
information relevant to its likely outcome are impossible to replicate making it difficult
to keep score in a meaningful way.

The second problem facing intelligence products is one of discrimination.
Discrimination is associated with the idea that intelligence is either “right” or “wrong”.
An analyst with a perfect ability to discriminate always gets the answer right, whatever
the circumstance. While the ability to perfectly discriminate between right and wrong
analytic conclusions might be a theoretical ideal, the ability to actually achieve such a
feat exists only in the movies. Most complex systems are subject to a certain sensitive
dependence on initial conditions which precludes any such ability to discriminate
beyond anything but trivially short time frames.

If it appears that calibration and discrimination are in conflict, they are. The better
calibrated an analyst is, the less likely they are to be willing to definitively discriminate
between possible estimative conclusions. Likewise, the more willing an analyst is to
discriminate between possible estimative conclusions, the less likely he or she is to be
properly calibrating the possibilities inherent in the intelligence problem.

For example, an analyst who says X is 60% likely to happen is still 40% "wrong" when
X does happen should an evaluator choose to focus on the analyst's ability to
discriminate. Likewise, the analyst who said X will happen is also 40% wrong if the
objective probability of X happening was 60% (even though X does happen), if the
evaluator chooses to focus on the analyst's ability to calibrate.

Failure to understand the tension between these two evaluative principles leaves the
unwitting analyst open to a "damned if you do, damned if you don't" attack by critics of
the analyst's estimative work. The problem only grows worse if you consider words of
estimative probability instead of numbers.

All this, in turn, typically leads analysts to ask for what Phlip Tetlock, in his excellent
book Expert Political Judgment, called "adjustments" when being evaluated regarding
the accuracy of their estimative products. Specifically, Tetlock outlines four key

• Value adjustments -- mistakes made were the "right mistakes" given the cost of
the alternatives
• Controversy adjustments -- mistakes were made by the evaluator and not the
kwheaton Page 9 2/8/2009

• Difficulty adjustments -- mistakes were made because the problem was so
difficult or, at least, more difficult than problems a comparable body of analysts
typically faced
• Fuzzy set adjustments -- mistakes were made but the estimate was a "near miss"
so it should get partial credit

This parade of horribles should not be construed as a defense of the school of thought
that says that intelligence cannot be evaluated, that it is too hard to do. It is merely to
show that evaluating intelligence products is truly difficult and fraught with traps to
catch the unwary. Any system established to evaluate intelligence products needs to
acknowledge these issues and, to the greatest degree possible, deal with them.

Many of the "adjustments", however, can also be interpreted as excuses. Just because
something is difficult to do doesn't mean you shouldn't do it. An effective and
appropriate system for evaluating intelligence is an essential step in figuring out what
works and what doesn't, in improving the intelligence process. As Tetlock notes (p. 9),
"The list (of adjustments) certainly stretches our tolerance for uncertainty: It requires
conceding that the line between rationality and rationalization will often be blurry. But,
again, we should not concede too much. Failing to learn everything is not tantamount to
learning nothing."

The Problems With Evaluating Process

In addition to product failures, there are a number of ways that the intelligence process
can fail as well. Requirements can be vague, collection can be flimsy or undermined by
deliberate deception, production values can be poor or intelligence made inaccessible
through over-classification. Finally, the intelligence architecture, the system in which
all the pieces are embedded, can be cumbersome, inflexible and incapable of responding
to the intelligence needs of the decisionmaker. All of these are part of the intelligence
process and any of these -- or any combination of these -- reasons can be the cause of an
intelligence failure.

In this article (and in this section in particular), I intend to look only at the kinds of
problems that arise when attempting to evaluate the analytic part of the process. From
this perspective, the most instructive current document available is Intelligence
Community Directive (ICD) 203: Analytic Standards. Paragraph D4, the operative
paragraph, lays out what makes for a good analytic process in the eyes of the Director
Of National intelligence:

• Objectivity
• Independent of Political Considerations
• Timeliness
kwheaton Page 10 2/8/2009

• Based on all available sources of intelligence
• Properly describes the quality and reliability of underlying sources
• Properly caveats and expresses uncertainties or confidence in analytic judgments
• Properly distinguishes between underlying intelligence and analyst's
assumptions and judgments
• Incorporates alternative analysis where appropriate
• Demonstrates relevance to US national security
• Uses logical argumentation
• Exhibits consistency of analysis over time or highlights changes and explains
• Makes accurate judgments and assessments

This is an excellent starting point for evaluating the analytic process. There are a few
problems, though. Some are trivial. Statements such as "Demonstrates relevance to US
national security" would have to be modified slightly to be entirely relevant to other
disciplines of intelligence such as law enforcement and business. Likewise, the
distinction between "objectivity" and "independent of political considerations" would
likely bother a stricter editor as the latter appears to be redundant (though I suspect the
authors of the ICD considered this and still decided to separate the two in order to
highlight the notion of political independence).

Some of the problems are not trivial. I have already discussed the difficulties associated
with mixing process accountability and product accountability, something the last item
on the list, "Makes accurate judgments and assessments" seems to encourage us to do.

Even more problematic, however, is the requirement to "properly caveat and express
uncertainties or confidence in analytic judgments." Surely the authors meant to say
"express uncertainties and confidence in analytic judgments". While this may seem like
hair-splitting, the act of expressing uncertainty and the act of expressing a degree of
analytic confidence are quite different things. This distinction is made (though not as
clearly as I would like) in the prefatory matter to all of the recently released National
Intelligence Estimates. The idea that the analyst can either express uncertainties
(typically through the use of words of estimative probability) or express confidence flies
in the face of this current practice.

Analytic confidence is (or should be) considered a crucial subsection of an evaluation of
the overall analytic process. If the question answered by the estimate is, "How likely is
X to happen?" then the question answered by an evaluation of analytic confidence is
"How likely is it that you, the analyst, are wrong?" These concepts are analogous to
statistical notions of probability and margin of error (as in polling data that indicates
that Candidate X is looked upon favorably by 55% of the electorate with a plus or
minus 3% margin of error). Given the lack of a controlled environment, the inability to
replicate situations important to intelligence analysts and the largely intuitive nature of
kwheaton Page 11 2/8/2009

most intelligence analysis, an analogy, however, is what it must remain.

What contributes legitimately to an increase in analytic confidence? To answer this
question, it is essential to go beyond the necessary but by no means sufficient criteria
set by the standards of ICD 203. In other words, analysis which is biased or late
shouldn't make it through the door but analysis that is only unbiased and on time meets
only the minimum standard.

Beyond these entry-level standards for a good analytic process, what are those elements
that actually contribute a better estimative product? The current best answer to this
question comes from Josh Peterson's thesis on the topic. In it he argued that seven
elements had adequate experimental data to suggest that they legitimately contribute to
analytic confidence:

• Use of Structured Methods in Analysis
• Overall Source Reliability
• Level of Source Corroboration/Agreement
• Subject Matter Expertise
• Amount of Collaboration Among Analysts
• Task Complexity
• Time Pressure

There are still numerous questions that remain to be answered. Which element is most
important? Is there a positive or negative synergy between two or more of the elements?
Are these the only elements that legitimately contribute to analytic confidence?

Perhaps the most important question, however, is how the decisionmaker -- the person
or organization the intelligence analyst supports -- likely sees this interplay of elements
that continuously impacts both the analytic product and process.

The Decisionmaker's Perspective

Decisionmakers are charged with making decisions. While this statement is blindingly
obvious, its logical extension actually has some far reaching consequences.

First, even if the decision is to "do nothing" in a particular instance, it is still a decision.
Second, with the authority to make a decision comes (or should come) responsibility
and accountability for that decision's consequences (The recent kerfluffle surrounding
the withdrawal of Tom Daschle for an appointment in the Obama cabinet is instructive
in this matter).

Driving these decisions are typically two kinds of forces. The first is largely internal to
kwheaton Page 12 2/8/2009

the individual or organization making the decision. The focus here is on the capabilities
and limitations of the organization itself: How well-trained are my soldiers? How
competent are my salespeople? How fast is my production line, how efficient are my
logistics or how well equipped are my police units? Good decisionmakers are often
comfortable here. They know themselves quite well. Oftentimes they have risen through
the ranks of an organization or even started the organization on their own. The internal
workings of a decisionmaker's own organization are easiest (if not easy) to see, predict
and control.

The same cannot be said of external forces. The current upheaval in the global market is
likely, for example, to affect even good, well-run businesses in ways that are extremely
difficult to predict, much less control. The opaque strategies of state and non-state
actors threaten national security plans and the changing tactics of organized criminal
activity routinely frustrate law enforcement professionals. Understanding these external
forces is a defining characteristic of intelligence and the complexity of these forces is
often used to justify substantial intelligence budgets.

Experienced decisionmakers do not expect intelligence professionals to be able to
understand external forces to the same degree that it is possible to understand internal
forces. They do expect intelligence to reduce their uncertainty, in tangible ways,
regarding these external forces. Sometimes intelligence provides up-to-date descriptive
information, unavailable previously to the decisionmaker (such as the U2 photographs
in the run-up to the Cuban Missile Crisis). Decisionmakers, however, find it much more
useful when analysts provide estimative intelligence -- assessments about how the
relevant external forces are likely to change.

• Note: I am talking about good, experienced decisionmakers here. I do not
intend to address concerns regarding bad or stupid decisionmakers in this series
of posts, though both clearly exist. These concerns are outside the scope of a
discussion about evaluating intelligence and fall more naturally into the realms
of management studies or psychology. I have a slightly different take on
inexperienced (with intelligence) decisionmakers, however. I teach my students
that intelligence professionals have an obligation to teach the decisionmakers
they support about what intelligence can and cannot do in the same way the
grizzled old platoon sergeant has an obligation to teach the newly minted second
lieutenant about the ins and outs of the army.

Obviously then, knowing, with absolute certainty, where the relevant external forces
will be and what they will be doing is of primary importance to decisionmakers.
Experienced decisionmakers also know that to expect such precision from intelligence
is unrealistic. Rather, they expect that the estimates they receive will only reduce their
uncertainty about those external forces, allowing them to plan and decide with greater
but not absolute clarity.
kwheaton Page 13 2/8/2009

Imagine, for example, a company commander on a mission to defend a particular piece
of terrain. The intelligence officer tells the commander that the enemy has two primary
avenues of approach, A and B, and that it is "likely" that the enemy will choose avenue
A. How does this
intelligence inform the
commander's decision
about how to defend
the objective?
For the sake of
argument, let's assume
that the company
commander interprets
the word "likely" as
meaning "about 60%".
Does this mean that the
company commander
should allocate about
60% of his forces to
defending Avenue A
and 40% to defending
Avenue B? That is a solution but there are many, many ways in which such a decision
would make no sense at all.

The worst case scenario for the company commander, however, is if he only has enough
forces to adequately cover one of the two avenues of approach. In this case, diverting
any forces at all will guarantee failure.

Assuming an accurate analytic process and all other things being equal (and I can do
that because this is a thought experiment), the commander should align his forces along
Avenue A in this worst case situation. This gives him the best chance of stopping the
enemy forces. This decisionmaker, with his limited forces, is essentially forced by the
situation to treat a 60% probability as 100% accurate for planning purposes. Since many
decisions are (or appear to be to the decisionmaker) of this type, it is no wonder that
decisionmakers, when they evaluate intelligence, tend to focus on the ability to
discriminate between possible outcomes over the ability to calibrate the estimative

The Iraq WMD Estimate And Other Pre-War Iraq Assessments
kwheaton Page 14 2/8/2009

Applying all these principles to a real world case is difficult but not impossible – at least
in part. Of the many estimates and documents made public regarding the war in Iraq,
three seem close enough in time, space, content and authorship to serve as a test case.

Perhaps the most famous document leading up to the war in Iraq is the much-maligned
National Intelligence Estimate (NIE) titled Iraq's Continuing Programs for Weapons Of
Mass Destruction completed in October, 2002 and made public (in part) in April, 2004.
Subjected to extensive scrutiny by the Commission on the Intelligence Capabilities of
the United States Regarding Weapons of Mass Destruction, this NIE was judged "dead
wrong" in almost all of its major estimative conclusions (i.e. in the language of this
paper, the product was false).

Far less well known are two Intelligence Community Assessments (ICA), both
completed in January, 2003. The first, Regional Consequences of Regime Change in
Iraq, was made public in April, 2007 as was the second ICA, Principal Changes in
Post-Saddam Iraq. Both documents were part of the US Senate's Select Subcommittee
on Intelligence report on Pre-War Intelligence Assessments about Post War Iraq and
both (heavily redacted) documents are available as appendices to the subcommittee's
final report.

The difference between an NIE and an ICA seems modest to an outsider. Both types of
documents are produced by the National Intelligence Council and both are coordinated
within the US national security intelligence community and, if appropriate, with cleared
experts outside the community. The principal differences appear to be the degree of
high level approval (NIEs are approved at a higher level than ICAs) and the intended
audiences (NIEs are aimed at high level policy makers while ICAs are geared more to
the desk-analyst policy level.

In this case, there appears to be at least some overlap in the actual drafters of the three
documents. Paul Pillar, National Intelligence Officer (NIO) for the Near East and South
Asia at the time was primarily responsible for coordinating (and, presumably drafting)
both of the ICAs. Pillar also assisted Robert D. Walpole, NIO for Strategic and Nuclear
Programs in the preparation of the NIE (along with Lawrence K. Gershwin, NIO for
Science and Technology and Major General John R. Landry, NIO for Conventional
Military Issues).

Despite the differences in the purposes of these documents, it is likely safe to say that
the fundamental analytic processes -- the tradecraft and evaluative norms -- were largely
the same. It is highly unlikely, for example, that standards such as "timeliness" and
"objectivity" were maintained in NIEs but abandoned in ICAs.

Why is this important? As discussed in detail earlier in this paper, it is important, in
evaluating intelligence, to cast as broad a net as possible, to not only look at examples
kwheaton Page 15 2/8/2009

where the intelligence product was false but also cases where the intelligence product
was true and, in turn, examine the process in both cases to determine if the analysts
were good or just lucky or bad or just unlucky. These three documents, prepared at
roughly the same time, under roughly the same conditions, with roughly the same
resources on roughly the same target allows the accuracy of the estimative conclusions
in the documents to be compared with some assurance that doing so may help get at any
underlying flaws or successes in the analytic process.

Batting Averages

Despite good reasons to believe that the findings of the Iraq WMD National Intelligence
Estimate NIE) and the two pre-war Intelligence Community Assessments (ICAs)
regarding Iraq can be evaluated as a group in order to tease out insights into the quality
of the analytic processes used to produce these products, several problems remain
before we can determine the "batting average".

• Assumptions vs. Descriptive Intelligence: The NIE drew its estimative
conclusions from what the authors believed were the facts based on an analysis
of the information collected about Saddam Hussein's WMD programs. Much of
this descriptive intelligence (i.e. that information which was not proven but
clearly taken as factual for purposes of the estimative parts of the NIE) turned
out to be false. The ICAs, however, are largely based on a series of assumptions
either explicitly or implicitly articulated in the scope notes to those two
documents. This analysis, therefore, will only focus on the estimative
conclusions of the three documents and not on the underlying facts.

• Descriptive Intelligence vs. Estimative Intelligence: Good analytic tradecraft has
always required analysts to clearly distinguish estimative conclusions from the
direct and indirect information that supports those estimative conclusions. The
inconsistencies in the estimative language along with the grammatical structure
of some of the findings make this particularly difficult. For example, the Iraq
NIE found: "An array of clandestine reporting reveals that Baghdad has
procured covertly the types and quantities of chemicals and equipment sufficient
to allow limited CW agent production hidden in Iraq's legitimate chemical
industry." Clearly the information gathered suggested that the Iraqi's had
gathered the chemicals. What is not as clear is if they were they likely using
them for limited CW production or if they merely could use these chemicals for
such purposes. A strict constructionist would argue for the latter interpretation
whereas the overall context of the Key Judgments would suggest the former. I
have elected to focus on the context to determine which statements are
kwheaton Page 16 2/8/2009

estimative in nature. This inserts an element of subjectivity into my analysis and
may skew the results.

• Discriminative vs. Calibrative Estimates: The language of the documents uses
both discriminative ("Baghdad is reconstituting its nuclear weapons program")
and calibrative language ("Saddam probably has stocked at least 100 metric tons
... of CW agents"). Given the seriousness of the situation in the US at that time,
the purposes for which these documents were to be used, and the discussion of
the decisionmaker’s perspective earlier, I have elected to treat calibrative
estimates as discriminative for purposes of evaluation.

• Overly Broad Estimative Conclusions: Overly broad estimates are easy to spot.
Typically these statements use highly speculative verbs such as "might" or
"could". A good example of such a statement is the claim: "Baghdad's UAVs
could threaten Iraq's neighbors, US forces in the Persian Gulf, and if brought
close to, or into, the United States, the US homeland." Such alarmism seems
silly today but it should have been seen as silly at the time as well. From a
theoretical perspective, these types of statements tell the decisionmaker nothing
useful (anything "could" happen; everything is "possible"). One option, then, is
to mark these statements as meaningless and eliminate them from consideration.
This, in my mind, encourages this bad practice and I intend to count these kinds
of statements as false if they turned out to have no basis in fact (I would under
this same logic have to count them as true if they turned out to be true, of

• Weight of the Estimative Conclusion: Some estimates are clearly more
fundamental to a report than others. Conclusions regarding direct threats to US
soldiers from regional actors, for example, should trump any minor and indirect
consequences regarding regional instability identified in the reports. Engaging in
such an exercise might be something appropriate for individuals directly
involved in this process and in a better position to evaluate these weights. I, on
the other hand, am looking for only the broadest possible patterns (if any) from
the data. I have, therefore decided to weigh all estimative conclusions equally.

• Dealing with Dissent: There were several dissents in the Iraq NIE. While the
majority opinion is, in some sense, the final word on the matter, an analytic
process that tolerates formal dissent deserves some credit as well. Going simply
with the majority opinion does not accomplish this. Likewise, eliminating the
dissented opinion from consideration gives too much credit to the process. I
have chosen to count those estimative conclusions with dissents as both true and
false (for scoring purposes only).
kwheaton Page 17 2/8/2009

Clearly, given the caveats and conditions under which I am attempting this analysis, I
am looking only for broad patterns of analytic activity. My intent is not to spend hours
quibbling about all of the various ways a particular judgment could be interpreted as
true or false after the fact. My intent is to merely make the case that evaluating
intelligence is difficult but, even with those difficulties firmly in mind, it is possible to
go back, after the fact, and, if we look at a broad enough swath of analysis, come to
some interesting conclusions about the process.

Within these limits, then, by my count, the Iraq NIE contained 28 (85%) false
estimative conclusions and 5 (15%) true ones. This analysis tracks quite well with the
WMD Commission's own evaluation that the NIE was incorrect in "almost all of its pre-
war judgments about Iraq's weapons of mass destruction." By my count, the Regional
Consequences of Regime Change in Iraq ICA fares much better with a count of 23
(96%) correct estimative conclusions and only one (4%) incorrect one. Finally, the
report on the Principal Challenges in Post-Saddam Iraq nets 15 (74%) correct analytic
estimates to 4 (26%) incorrect ones. My conclusions are certainly consistent with the
tone of the Senate Subcommittee Report.

• It is noteworthy that the Senate Subcommittee did not go to the same pains to
compliment analysts on their fairly accurate reporting in the ICAs as the WMD
Commission did to pillory the NIE. Likewise, there was no call from Congress
to ensure that the process involved in creating the NIE was reconciled with the
process used to create the ICAs, no laws proposed to take advantage of this
largely accurate work, no restructuring of the US national intelligence
community to ensure that the good analytic processes demonstrated in these
ICAs would dominate the future of intelligence analysis.

The most interesting number, however, is the combined score for the three documents.
Out of the 76 estimative conclusions made in the three reports, 43 (57%) were correct
and 33 (43%) incorrect. Is this a good score or a bad score? Such a result is likely much
better than mere chance, for example. For each judgment made, there were likely many
reasonable hypotheses considered. If there were only three reasonable hypotheses to
consider in each case, the base rate would be 33%. On average, the analysts involved
were able to nearly double that "batting average".

Likewise it is consistent with both hard and anecdotal data of historical trends in
analytic forecasting. Mike Lyden, in his thesis on Accelerated Analysis, calculated that,
historically, US national security intelligence community estimates were correct
approximately 2/3 of the time.

Former Director of the CIA, GEN Michael Hayden, made his own estimate of analytic
accuracy in May of last year, ""Some months ago, I met with a small group of
investment bankers and one of them asked me, 'On a scale of 1 to 10, how good is our
kwheaton Page 18 2/8/2009

intelligence today?' I said the first thing to understand is that anything above 7 isn't on
our scale. If we're at 8, 9, or 10, we're not in the realm of intelligence—no one is asking
us the questions that can yield such confidence. We only get the hard sliders on the
corner of the plate."

Given these standards, 57%, while a bit low by historical measures, certainly seems to
be within normal limits and, even more importantly, consistent with what the
intelligence community’s senior leadership expects from its analysts.

Final Thoughts

The purpose of this article was not to rationalize away, in a frenzy of legalese, the
obvious failings of the Iraq WMD NIE. Under significant time pressure and operating
with what the authors admitted was limited information on key questions, they failed to
check their assumptions and saw all of the evidence as confirming an existing
conceptual framework (While it should be noted that this conceptual framework was
shared by virtually everyone else, the authors do not get a free pass on this either.
Testing assumptions and understanding the dangers of overly rigid conceptual models is
Intel Analysis 101).

On the other hand, if the focus of inquiry is just a bit broader, to include the two ICAs
about Iraq completed by at least some of the same people, using many of the same
processes, the picture becomes much brighter. When evaluators consider the three
documents together, the analysts seem to track pretty well with historical norms and
leadership expectations. Like the good weatherman discussed earlier, it is difficult to
see how they got it "wrong".

Moreover, the failure by evaluators to look at intelligence successes as well as
intelligence failures and to examine them for where the analysts were actually good or
bad (vs. where the analysts were merely lucky or unlucky) is a recipe for turmoil.
Imagine a football coach who only watched game film when the team lost and ignored
lessons from when the team won. This is clearly stupid but it is very close to what
happens to the intelligence community every 5 to 10 years. From the Hoover
Commission to today, so-called intelligence failures get investigated while intelligence
successes get, well, nothing.

The intelligence community, in the past, has done itself no favors for when the
investigations do inevitably come, however. The lack of clarity and consistency in the
estimative language used in these documents coupled with the lack of auditable process
made coming to any sort of conclusion about the veracity of product or process far more
difficult than it needed to be. While I do not expect that other investigators would come
kwheaton Page 19 2/8/2009

to startlingly different conclusions than mine, I would expect there to be areas where we
would disagree -- perhaps strongly -- due to different interpretations of the same
language. This is not in the intelligence community's interest as it creates the impression
that intelligence is merely an “opinion” or, even worse, that the analysts are "just

Finally, there appears to be one more lesson to be learned from an examination of these
three documents. Beyond the scope of evaluating intelligence, it goes to the heart of
what intelligence is and what role it serves in a policy debate.

In the days before the vote to go to war, the Iraq NIE clearly answered the question it
had been asked, albeit in a predictable way (so predictable, in fact, that few in
Washington bother to read it). The Iraq ICAs, on the other hand, come out in January,
2003, two months before the start of the war. They were generated in response to a
request from the Director of Policy Planning at the State Department and were intended,
as are all ICAs, for lower level policymakers. These reports quite accurately -- as it
turns out -- predict the tremendous difficulties should the eventual solution (of the
several available to the policymakers at the time) to the problem of Saddam's Hussein's
presumed WMDs be war.

What if all three documents had come out at the same time and had all been NIEs?
There does not appear to be, from the record, any reason why they could not have been
issued simultaneously. The Senate Subcommittee states on page 2 of its report that there
was no special collection involved in the ICAs, that it was "not an issue well-suited to
intelligence collection." The report went on to state, "Analysts based their judgments
primarily on regional and country expertise, historical evidence and," significantly, in
light of this paper, "analytic tradecraft." In short, open sources and sound analytic
processes. Time was of the essence, of course, but it is clear from the record that the
information necessary to write the reports was already in the analysts’ heads.

It is hard to imagine that such a trio of documents would not have significantly altered
the debate in Washington. The outcome might still have been war, but the ability of
policymakers to dodge their fair share of the blame would have been severely limited.
In the end, it is perhaps the best practice for intelligence to answer not only those
questions it is asked but also those questions it should have been asked.