You are on page 1of 21

Articles

Building Watson:
An Overview of the
DeepQA Project

David Ferrucci, Eric Brown, Jennifer Chu-Carroll,


James Fan, David Gondek, Aditya A. Kalyanpur,
Adam Lally, J. William Murdock, Eric Nyberg, John Prager,
Nico Schlaefer, and Chris Welty

I IBM Research undertook a challenge to build

T
he goals of IBM Research are to advance computer science
a computer system that could compete at the by exploring new ways for computer technology to affect
human champion level in real time on the science, business, and society. Roughly three years ago,
American TV quiz show, Jeopardy. The extent IBM Research was looking for a major research challenge to rival
of the challenge includes fielding a real-time
the scientific and popular interest of Deep Blue, the computer
automatic contestant on the show, not merely a
laboratory exercise. The Jeopardy Challenge
chess-playing champion (Hsu 2002), that also would have clear
helped us address requirements that led to the relevance to IBM business interests.
design of the DeepQA architecture and the With a wealth of enterprise-critical information being cap-
implementation of Watson. After three years of tured in natural language documentation of all forms, the prob-
intense research and development by a core lems with perusing only the top 10 or 20 most popular docu-
team of about 20 researchers, Watson is per- ments containing the user’s two or three key words are
forming at human expert levels in terms of pre- becoming increasingly apparent. This is especially the case in
cision, confidence, and speed at the Jeopardy the enterprise where popularity is not as important an indicator
quiz show. Our results strongly suggest that
of relevance and where recall can be as critical as precision.
DeepQA is an effective and extensible architec-
There is growing interest to have enterprise computer systems
ture that can be used as a foundation for com-
bining, deploying, evaluating, and advancing a deeply analyze the breadth of relevant content to more precise-
wide range of algorithmic techniques to rapidly ly answer and justify answers to user’s natural language ques-
advance the field of question answering (QA). tions. We believe advances in question-answering (QA) tech-
nology can help support professionals in critical and timely
decision making in areas like compliance, health care, business
integrity, business intelligence, knowledge discovery, enterprise
knowledge management, security, and customer support. For

Copyright © 2010, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 FALL 2010 59
Articles

researchers, the open-domain QA problem is form generation, and knowledge representation


attractive as it is one of the most challenging in the and reasoning.
realm of computer science and artificial intelli- Winning at Jeopardy requires accurately comput-
gence, requiring a synthesis of information ing confidence in your answers. The questions and
retrieval, natural language processing, knowledge content are ambiguous and noisy and none of the
representation and reasoning, machine learning, individual algorithms are perfect. Therefore, each
and computer-human interfaces. It has had a long component must produce a confidence in its out-
history (Simmons 1970) and saw rapid advance- put, and individual component confidences must
ment spurred by system building, experimenta- be combined to compute the overall confidence of
tion, and government funding in the past decade the final answer. The final confidence is used to
(Maybury 2004, Strzalkowski and Harabagiu 2006). determine whether the computer system should
With QA in mind, we settled on a challenge to risk choosing to answer at all. In Jeopardy parlance,
build a computer system, called Watson,1 which this confidence is used to determine whether the
could compete at the human champion level in computer will “ring in” or “buzz in” for a question.
real time on the American TV quiz show, Jeopardy. The confidence must be computed during the time
The extent of the challenge includes fielding a real- the question is read and before the opportunity to
time automatic contestant on the show, not mere- buzz in. This is roughly between 1 and 6 seconds
ly a laboratory exercise. with an average around 3 seconds.
Jeopardy! is a well-known TV quiz show that has Confidence estimation was very critical to shap-
been airing on television in the United States for ing our overall approach in DeepQA. There is no
more than 25 years (see the Jeopardy! Quiz Show expectation that any component in the system
sidebar for more information on the show). It pits does a perfect job — all components post features
three human contestants against one another in a of the computation and associated confidences,
competition that requires answering rich natural and we use a hierarchical machine-learning
language questions over a very broad domain of method to combine all these features and decide
topics, with penalties for wrong answers. The nature whether or not there is enough confidence in the
of the three-person competition is such that confi- final answer to attempt to buzz in and risk getting
dence, precision, and answering speed are of critical the question wrong.
importance, with roughly 3 seconds to answer each In this section we elaborate on the various
question. A computer system that could compete at aspects of the Jeopardy Challenge.
human champion levels at this game would need to
produce exact answers to often complex natural The Categories
language questions with high precision and speed A 30-clue Jeopardy board is organized into six
and have a reliable confidence in its answers, such columns. Each column contains five clues and is
that it could answer roughly 70 percent of the ques- associated with a category. Categories range from
tions asked with greater than 80 percent precision broad subject headings like “history,” “science,” or
in 3 seconds or less. “politics” to less informative puns like “tutu
Finally, the Jeopardy Challenge represents a much,” in which the clues are about ballet, to actu-
unique and compelling AI question similar to the al parts of the clue, like “who appointed me to the
one underlying DeepBlue (Hsu 2002) — can a com- Supreme Court?” where the clue is the name of a
puter system be designed to compete against the judge, to “anything goes” categories like “pot-
best humans at a task thought to require high lev- pourri.” Clearly some categories are essential to
els of human intelligence, and if so, what kind of understanding the clue, some are helpful but not
technology, algorithms, and engineering is necessary, and some may be useless, if not mis-
required? While we believe the Jeopardy Challenge leading, for a computer.
is an extraordinarily demanding task that will A recurring theme in our approach is the require-
greatly advance the field, we appreciate that this ment to try many alternate hypotheses in varying
challenge alone does not address all aspects of QA contexts to see which produces the most confident
and does not by any means close the book on the answers given a broad range of loosely coupled scor-
QA challenge the way that Deep Blue may have for ing algorithms. Leveraging category information is
playing chess. another clear area requiring this approach.

The Questions
The Jeopardy Challenge There are a wide variety of ways one can attempt to
Meeting the Jeopardy Challenge requires advancing characterize the Jeopardy clues. For example, by
and incorporating a variety of QA technologies topic, by difficulty, by grammatical construction,
including parsing, question classification, question by answer type, and so on. A type of classification
decomposition, automatic source acquisition and that turned out to be useful for us was based on the
evaluation, entity and relation detection, logical primary method deployed to solve the clue. The

60 AI MAGAZINE
Articles

The Jeopardy! Quiz Show

The Jeopardy! quiz show is a well-known syndicat- relatively few side effects?” the corresponding
ed U.S. TV quiz show that has been on the air Jeopardy clue might read “This drug has been
since 1984. It features rich natural language ques- shown to relieve the symptoms of ADD with rela-
tions covering a broad range of general knowl- tively few side effects.” The correct Jeopardy
edge. It is widely recognized as an entertaining response would be “What is Ritalin?”
game requiring smart, knowledgeable, and quick Players have 5 seconds to speak their response,
players. but it’s typical that they answer almost immedi-
The show’s format pits three human contestants ately since they often only buzz in if they already
against each other in a three-round contest of know the answer. If a player responds to a clue cor-
knowledge, confidence, and speed. All contestants rectly, then the dollar value of the clue is added to
must pass a 50-question qualifying test to be eligi- the player’s total earnings, and that player selects
ble to play. The first two rounds of a game use a another cell on the board. If the player responds
grid organized into six columns, each with a cate- incorrectly then the dollar value is deducted from
gory label, and five rows with increasing dollar the total earnings, and the system is rearmed,
values. The illustration shows a sample board for a allowing the other players to buzz in. This makes
first round. In the second round, the dollar values it important for players to know what they know
are doubled. Initially all the clues in the grid are — to have accurate confidences in their responses.
hidden behind their dollar values. The game play There is always one cell in the first round and
begins with the returning champion selecting a two in the second round called Daily Doubles,
cell on the grid by naming the category and the whose exact location is hidden until the cell is
dollar value. For example the player may select by selected by a player. For these cases, the selecting
saying “Technology for $400.” player does not have to compete for the buzzer but
The clue under the selected cell is revealed to all must respond to the clue regardless of the player’s
the players and the host reads it out loud. Each confidence. In addition, before the clue is revealed
player is equipped with a hand-held signaling but- the player must wager a portion of his or her earn-
ton. As soon as the host finishes reading the clue, ings. The minimum bet is $5 and the maximum
a light becomes visible around the board, indicat- bet is the larger of the player’s current score and
ing to the players that their hand-held devices are the maximum clue value on the board. If players
enabled and they are free to signal or “buzz in” for answer correctly, they earn the amount they bet,
a chance to respond. If a player signals before the else they lose it.
light comes on, then he or she is locked out for The Final Jeopardy round consists of a single
one-half of a second before being able to buzz in question and is played differently. First, a catego-
again. ry is revealed. The players privately write down
The first player to successfully buzz in gets a their bet — an amount less than or equal to their
chance to respond to the clue. That is, the player total earnings. Then the clue is revealed. They
must answer the question, but the response must have 30 seconds to respond. At the end of the 30
be in the form of a question. For example, validly seconds they reveal their answers and then their
formed responses are, “Who is Ulysses S. Grant?” bets. The player with the most money at the end
or “What is The Tempest?” rather than simply of this third round wins the game. The questions
“Ulysses S. Grant” or “The Tempest.” The Jeopardy used in this round are typically more difficult than
quiz show was conceived to have the host provid- those used in the previous rounds.
ing the answer or clue and the players responding
with the corresponding question or response. The
clue/response concept represents an entertaining
twist on classic question answering. Jeopardy clues
are straightforward assertional forms of questions.
So where a question might read, “What drug has
been shown to relieve the symptoms of ADD with

FALL 2010 61
Articles

bulk of Jeopardy clues represent what we would Decomposable Jeopardy clues generated require-
consider factoid questions — questions whose ments that drove the design of DeepQA to gener-
answers are based on factual information about ate zero or more decomposition hypotheses for
one or more individual entities. The questions each question as possible interpretations.
themselves present challenges in determining Puzzles. Jeopardy also has categories of questions
what exactly is being asked for and which elements that require special processing defined by the cate-
of the clue are relevant in determining the answer. gory itself. Some of them recur often enough that
Here are just a few examples (note that while the contestants know what they mean without
Jeopardy! game requires that answers are delivered instruction; for others, part of the task is to figure
in the form of a question (see the Jeopardy! Quiz out what the puzzle is as the clues and answers are
Show sidebar), this transformation is trivial and for revealed (categories requiring explanation by the
purposes of this paper we will just show the
host are not part of the challenge). Examples of
answers themselves):
well-known puzzle categories are the Before and
Category: General Science After category, where two subclues have answers
Clue: When hit by electrons, a phosphor gives off that overlap by (typically) one word, and the
electromagnetic energy in this form.
Rhyme Time category, where the two subclue
Answer: Light (or Photons)
answers must rhyme with one another. Clearly
Category: Lincoln Blogs
these cases also require question decomposition.
Clue: Secretary Chase just submitted this to me for
For example:
the third time; guess what, pal. This time I’m
accepting it. Category: Before and After Goes to the Movies
Answer: his resignation Clue: Film of a typical day in the life of the Beatles,
which includes running from bloodthirsty zombie
Category: Head North
fans in a Romero classic.
Clue: They’re the two states you could be reentering
if you’re crossing Florida’s northern border. Subclue 2: Film of a typical day in the life of the Bea-
Answer: Georgia and Alabama tles.
Answer 1: (A Hard Day’s Night)
Decomposition. Some more complex clues con- Subclue 2: Running from bloodthirsty zombie fans
tain multiple facts about the answer, all of which in a Romero classic.
are required to arrive at the correct response but Answer 2: (Night of the Living Dead)
are unlikely to occur together in one place. For Answer: A Hard Day’s Night of the Living Dead
example: Category: Rhyme Time
Category: “Rap” Sheet Clue: It’s where Pele stores his ball.
Clue: This archaic term for a mischievous or annoy- Subclue 1: Pele ball (soccer)
ing child can also mean a rogue or scamp. Subclue 2: where store (cabinet, drawer, locker, and
Subclue 1: This archaic term for a mischievous or so on)
annoying child. Answer: soccer locker
Subclue 2: This term can also mean a rogue or There are many infrequent types of puzzle cate-
scamp. gories including things like converting roman
Answer: Rapscallion numerals, solving math word problems, sounds
In this case, we would not expect to find both like, finding which word in a set has the highest
“subclues” in one sentence in our sources; rather, Scrabble score, homonyms and heteronyms, and
if we decompose the question into these two parts so on. Puzzles constitute only about 2–3 percent of
and ask for answers to each one, we may find that all clues, but since they typically occur as entire
the answer common to both questions is the categories (five at a time) they cannot be ignored
answer to the original clue. for success in the Challenge as getting them all
Another class of decomposable questions is one wrong often means losing a game.
in which a subclue is nested in the outer clue, and Excluded Question Types. The Jeopardy quiz show
the subclue can be replaced with its answer to form
ordinarily admits two kinds of questions that IBM
a new question that can more easily be answered.
and Jeopardy Productions, Inc., agreed to exclude
For example:
from the computer contest: audiovisual (A/V)
Category: Diplomatic Relations questions and Special Instructions questions. A/V
Clue: Of the four countries in the world that the
questions require listening to or watching some
United States does not have diplomatic relations
sort of audio, image, or video segment to deter-
with, the one that’s farthest north.
Inner subclue: The four countries in the world that mine a correct answer. For example:
the United States does not have diplomatic rela- Category: Picture This
tions with (Bhutan, Cuba, Iran, North Korea). (Contestants are shown a picture of a B-52 bomber)
Outer subclue: Of Bhutan, Cuba, Iran, and North Clue: Alphanumeric name of the fearsome machine
Korea, the one that’s farthest north. seen here.
Answer: North Korea Answer: B-52

62 AI MAGAZINE
Articles

40 Most Frequent LATs

200 Most Frequent LATs

12.00% 12.00%

10.00%
10.00%
8.00%

8.00% 6.00%

4.00%
6.00%
2.00%
4.00% 0.00%

2.00%

0.00%

title
film
NA

king

part
group
he

city
man

nation
sport

team
woman
author

novel

show
song

actor

book

game
here

river
capital

animal
she
state

star

play
country

island

singer
company

musical

leader
president

character

composer
presidential
series

actress
Figure 1. Lexical Answer Type Frequency.

Special instruction questions are those that are answer must be inferred by the context. Here’s an
not “self-explanatory” but rather require a verbal example:
explanation describing how the question should Category: Decorating
be interpreted and solved. For example: Clue: Though it sounds “harsh,” it’s just embroi-
Category: Decode the Postal Codes dery, often in a floral pattern, done with yarn on
Verbal instruction from host: We’re going to give you cotton cloth.
a word comprising two postal abbreviations; you Answer: crewel
have to identify the states. The distribution of LATs has a very long tail, as
Clue: Vain shown in figure 1. We found 2500 distinct and
Answer: Virginia and Indiana explicit LATs in the 20,000 question sample. The
Both present very interesting challenges from an most frequent 200 explicit LATs cover less than 50
AI perspective but were put out of scope for this percent of the data. Figure 1 shows the relative fre-
contest and evaluation. quency of the LATs. It labels all the clues with no
explicit type with the label “NA.” This aspect of the
The Domain challenge implies that while task-specific type sys-
As a measure of the Jeopardy Challenge’s breadth of tems or manually curated data would have some
domain, we analyzed a random sample of 20,000 impact if focused on the head of the LAT curve, it
questions extracting the lexical answer type (LAT) still leaves more than half the problems unaccount-
when present. We define a LAT to be a word in the ed for. Our clear technical bias for both business and
clue that indicates the type of the answer, inde- scientific motivations is to create general-purpose,
pendent of assigning semantics to that word. For reusable natural language processing (NLP) and
example in the following clue, the LAT is the string knowledge representation and reasoning (KRR)
“maneuver.” technology that can exploit as-is natural language
Category: Oooh….Chess resources and as-is structured knowledge rather
Clue: Invented in the 1500s to speed up the game, than to curate task-specific knowledge resources.
this maneuver involves two pieces of the same col-
or.
Answer: Castling The Metrics
About 12 percent of the clues do not indicate an In addition to question-answering precision, the
explicit lexical answer type but may refer to the system’s game-winning performance will depend
answer with pronouns like “it,” “these,” or “this” on speed, confidence estimation, clue selection,
or not refer to it at all. In these cases the type of and betting strategy. Ultimately the outcome of

FALL 2010 63
Articles

100%

90%

80%

70%

60%

Precision
50%

40%

30%

20%

10%

0%
0% 20% 40% 60% 80% 100%
% Answered

Figure 2. Precision Versus Percentage Attempted.


Perfect confidence estimation (upper line) and no confidence estimation (lower line).

the public contest will be decided based on out of those it chooses to answer. Percent answered
whether or not Watson can win one or two games is the percentage of questions it chooses to answer
against top-ranked humans in real time. The high- (correctly or incorrectly). The system chooses
est amount of money earned by the end of a one- which questions to answer based on an estimated
or two-game match determines the winner. A play- confidence score: for a given threshold, the system
er’s final earnings, however, often will not reflect will answer all questions with confidence scores
how well the player did during the game at the QA above that threshold. The threshold controls the
task. This is because a player may decide to bet big trade-off between precision and percent answered,
on Daily Double or Final Jeopardy questions. There assuming reasonable confidence estimation. For
are three hidden Daily Double questions in a game higher thresholds the system will be more conser-
that can affect only the player lucky enough to vative, answering fewer questions with higher pre-
find them, and one Final Jeopardy question at the cision. For lower thresholds, it will be more aggres-
end that all players must gamble on. Daily Double sive, answering more questions with lower
and Final Jeopardy questions represent significant precision. Accuracy refers to the precision if all
events where players may risk all their current questions are answered.
earnings. While potentially compelling for a pub- Figure 2 shows a plot of precision versus percent
lic contest, a small number of games does not rep- attempted curves for two theoretical systems. It is
resent statistically meaningful results for the sys- obtained by evaluating the two systems over a
tem’s raw QA performance. range of confidence thresholds. Both systems have
While Watson is equipped with betting strate- 40 percent accuracy, meaning they get 40 percent
gies necessary for playing full Jeopardy, from a core of all questions correct. They differ only in their
QA perspective we want to measure correctness, confidence estimation. The upper line represents
confidence, and speed, without considering clue an ideal system with perfect confidence estima-
selection, luck of the draw, and betting strategies. tion. Such a system would identify exactly which
We measure correctness and confidence using pre- questions it gets right and wrong and give higher
cision and percent answered. Precision measures confidence to those it got right. As can be seen in
the percentage of questions the system gets right the graph, if such a system were to answer the 50

64 AI MAGAZINE
Articles

100%

90%

80%

70%

60%
Precision

50%

40%

30%

20%

10%

0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered

Figure 3. Champion Human Performance at Jeopardy.

percent of questions it had highest confidence for, the winner answered, and the y-axis of the graph,
it would get 80 percent of those correct. We refer to labeled “Precision,” represents the percentage of
this level of performance as 80 percent precision at those questions the winner answered correctly.
50 percent answered. The lower line represents a In contrast to the system evaluation shown in
system without meaningful confidence estimation. figure 2, which can display a curve over a range of
Since it cannot distinguish between which ques- confidence thresholds, the human performance
tions it is more or less likely to get correct, its pre- shows only a single point per game based on the
cision is constant for all percent attempted. Devel- observed precision and percent answered the win-
oping more accurate confidence estimation means ner demonstrated in the game. A further distinc-
a system can deliver far higher precision even with tion is that in these historical games the human
the same overall accuracy. contestants did not have the liberty to answer all
questions they wished. Rather the percent
The Competition: answered consists of those questions for which the
winner was confident and fast enough to beat the
Human Champion Performance competition to the buzz. The system performance
A compelling and scientifically appealing aspect of graphs shown in this paper are focused on evalu-
the Jeopardy Challenge is the human reference ating QA performance, and so do not take into
point. Figure 3 contains a graph that illustrates account competition for the buzz. Human per-
expert human performance on Jeopardy It is based formance helps to position our system’s perform-
on our analysis of nearly 2000 historical Jeopardy ance, but obviously, in a Jeopardy game, perform-
games. Each point on the graph represents the per- ance will be affected by competition for the buzz
formance of the winner in one Jeopardy game.2 As and this will depend in large part on how quickly
in figure 2, the x-axis of the graph, labeled “% a player can compute an accurate confidence and
Answered,” represents the percentage of questions how the player manages risk.

FALL 2010 65
Articles

The center of what we call the “Winners Cloud” The questions used were 500 randomly sampled
(the set of light gray dots in the graph in figures 3 Jeopardy clues from episodes in the past 15 years.
and 4) reveals that Jeopardy champions are confi- The corpus that was used contained, but did not
dent and fast enough to acquire on average necessarily justify, answers to more than 90 per-
between 40 percent and 50 percent of all the ques- cent of the questions. The result of the PIQUANT
tions from their competitors and to perform with baseline experiment is illustrated in figure 4. As
between 85 percent and 95 percent precision. shown, on the 5 percent of the clues that PI-
The darker dots on the graph represent Ken Jen- QUANT was most confident in (left end of the
nings’s games. Ken Jennings had an unequaled curve), it delivered 47 percent precision, and over
winning streak in 2004, in which he won 74 games all the clues in the set (right end of the curve), its
in a row. Based on our analysis of those games, he precision was 13 percent. Clearly the precision and
acquired on average 62 percent of the questions confidence estimation are far below the require-
and answered with 92 percent precision. Human ments of the Jeopardy Challenge.
performance at this task sets a very high bar for A similar baseline experiment was performed in
precision, confidence, speed, and breadth. collaboration with Carnegie Mellon University
(CMU) using OpenEphyra,5 an open-source QA
framework developed primarily at CMU. The
Baseline Performance framework is based on the Ephyra system, which
Our metrics and baselines are intended to give us was designed for answering TREC questions. In our
confidence that new methods and algorithms are experiments on TREC 2002 data, OpenEphyra
improving the system or to inform us when they answered 45 percent of the questions correctly
are not so that we can adjust research priorities. using a live web search.
Our most obvious baseline is the QA system We spent minimal effort adapting OpenEphyra,
called Practical Intelligent Question Answering but like PIQUANT, its performance on Jeopardy
Technology (PIQUANT) (Prager, Chu-Carroll, and clues was below 15 percent accuracy. OpenEphyra
Czuba 2004), which had been under development did not produce reliable confidence estimates and
at IBM Research by a four-person team for 6 years thus could not effectively choose to answer ques-
prior to taking on the Jeopardy Challenge. At the tions with higher confidence. Clearly a larger
time it was among the top three to five Text investment in tuning and adapting these baseline
Retrieval Conference (TREC) QA systems. Devel- systems to Jeopardy would improve their perform-
oped in part under the U.S. government AQUAINT ance; however, we limited this investment since
program3 and in collaboration with external teams we did not want the baseline systems to become
and universities, PIQUANT was a classic QA significant efforts.
pipeline with state-of-the-art techniques aimed The PIQUANT and OpenEphyra baselines
largely at the TREC QA evaluation (Voorhees and demonstrate the performance of state-of-the-art
Dang 2005). PIQUANT performed in the 33 per- QA systems on the Jeopardy task. In figure 5 we
cent accuracy range in TREC evaluations. While show two other baselines that demonstrate the
the TREC QA evaluation allowed the use of the performance of two complementary approaches
web, PIQUANT focused on question answering on this task. The light gray line shows the per-
using local resources. A requirement of the Jeopardy formance of a system based purely on text search,
Challenge is that the system be self-contained and using terms in the question as queries and search
does not link to live web search. engine scores as confidences for candidate answers
The requirements of the TREC QA evaluation generated from retrieved document titles. The
were different than for the Jeopardy challenge. black line shows the performance of a system
Most notably, TREC participants were given a rela- based on structured data, which attempts to look
tively small corpus (1M documents) from which the answer up in a database by simply finding the
answers to questions must be justified; TREC ques- named entities in the database related to the
tions were in a much simpler form compared to named entities in the clue. These two approaches
Jeopardy questions, and the confidences associated were adapted to the Jeopardy task, including iden-
with answers were not a primary metric. Further- tifying and integrating relevant content.
more, the systems are allowed to access the web The results form an interesting comparison. The
and had a week to produce results for 500 ques- search-based system has better performance at 100
tions. The reader can find details in the TREC pro- percent answered, suggesting that the natural lan-
ceedings4 and numerous follow-on publications. guage content and the shallow text search tech-
An initial 4-week effort was made to adapt niques delivered better coverage. However, the flat-
PIQUANT to the Jeopardy Challenge. The experi- ness of the curve indicates the lack of accurate
ment focused on precision and confidence. It confidence estimation.6 The structured approach
ignored issues of answering speed and aspects of had better informed confidence when it was able
the game like betting and clue values. to decipher the entities in the question and found

66 AI MAGAZINE
Articles

100%

90%

80%

70%

60%
Precision

50%

40%

30%

20%

10%

0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered

Figure 4. Baseline Performance.

the right matches in its structured knowledge on Jeopardy or even on prior baseline studies using
bases, but its coverage quickly drops off when TREC data.
asked to answer more questions. To be a high-per- We ended up overhauling nearly everything we
forming question-answering system, DeepQA must did, including our basic technical approach, the
demonstrate both these properties to achieve high underlying architecture, metrics, evaluation proto-
precision, high recall, and an accurate confidence cols, engineering practices, and even how we
estimation. worked together as a team. We also, in cooperation
with CMU, began the Open Advancement of Ques-
tion Answering (OAQA) initiative. OAQA is
The DeepQA Approach intended to directly engage researchers in the com-
Early on in the project, attempts to adapt munity to help replicate and reuse research results
PIQUANT (Chu-Carroll et al. 2003) failed to pro- and to identify how to more rapidly advance the
duce promising results. We devoted many months state of the art in QA (Ferrucci et al 2009).
of effort to encoding algorithms from the litera- As our results dramatically improved, we
ture. Our investigations ran the gamut from deep observed that system-level advances allowing rap-
logical form analysis to shallow machine-transla- id integration and evaluation of new ideas and
tion-based approaches. We integrated them into new components against end-to-end metrics were
the standard QA pipeline that went from question essential to our progress. This was echoed at the
analysis and answer type determination to search OAQA workshop for experts with decades of
and then answer selection. It was difficult, howev- investment in QA, hosted by IBM in early 2008.
er, to find examples of how published research Among the workshop conclusions was that QA
results could be taken out of their original context would benefit from the collaborative evolution of
and effectively replicated and integrated into dif- a single extensible architecture that would allow
ferent end-to-end systems to produce comparable component results to be consistently evaluated in
results. Our efforts failed to have significant impact a common technical context against a growing

FALL 2010 67
Articles

100%

90%

80%

70%

60%
Precision

50%

40%

30%

20%

10%

0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered

Figure 5. Text Search Versus Knowledge Base Search.

variety of what were called “Challenge Problems.” applications and additional exploratory challenge
Different challenge problems were identified to problems including medicine, enterprise search,
address various dimensions of the general QA and gaming.
problem. Jeopardy was described as one addressing The overarching principles in DeepQA are mas-
dimensions including high precision, accurate sive parallelism, many experts, pervasive confi-
confidence determination, complex language, dence estimation, and integration of shallow and
breadth of domain, and speed. deep knowledge.
The system we have built and are continuing to Massive parallelism: Exploit massive parallelism
develop, called DeepQA, is a massively parallel in the consideration of multiple interpretations
probabilistic evidence-based architecture. For the and hypotheses.
Jeopardy Challenge, we use more than 100 differ- Many experts: Facilitate the integration, applica-
ent techniques for analyzing natural language, tion, and contextual evaluation of a wide range of
identifying sources, finding and generating loosely coupled probabilistic question and content
hypotheses, finding and scoring evidence, and analytics.
merging and ranking hypotheses. What is far more Pervasive confidence estimation: No component
important than any particular technique we use is commits to an answer; all components produce
how we combine them in DeepQA such that over- features and associated confidences, scoring differ-
lapping approaches can bring their strengths to ent question and content interpretations. An
bear and contribute to improvements in accuracy, underlying confidence-processing substrate learns
confidence, or speed. how to stack and combine the scores.
DeepQA is an architecture with an accompany- Integrate shallow and deep knowledge: Balance the
ing methodology, but it is not specific to the Jeop- use of strict semantics and shallow semantics,
ardy Challenge. We have successfully applied leveraging many loosely formed ontologies.
DeepQA to both the Jeopardy and TREC QA task. Figure 6 illustrates the DeepQA architecture at a
We have begun adapting it to different business very high level. The remaining parts of this section

68 AI MAGAZINE
Articles

Figure 6. DeepQA High-Level Architecture.

provide a bit more detail about the various archi- informative with respect to the original seed docu-
tectural roles. ment; and (4) merge the most informative nuggets
into the expanded corpus. The live system itself
Content Acquisition uses this expanded corpus and does not have
The first step in any application of DeepQA to access to the web during play.
solve a QA problem is content acquisition, or iden- In addition to the content for the answer and
tifying and gathering the content to use for the evidence sources, DeepQA leverages other kinds of
answer and evidence sources shown in figure 6. semistructured and structured content. Another
Content acquisition is a combination of manu- step in the content-acquisition process is to identi-
al and automatic steps. The first step is to analyze fy and collect these resources, which include data-
example questions from the problem space to pro- bases, taxonomies, and ontologies, such as dbPe-
duce a description of the kinds of questions that dia,7 WordNet (Miller 1995), and the Yago8
must be answered and a characterization of the ontology.
application domain. Analyzing example questions
is primarily a manual task, while domain analysis Question Analysis
may be informed by automatic or statistical analy- The first step in the run-time question-answering
ses, such as the LAT analysis shown in figure 1. process is question analysis. During question
Given the kinds of questions and broad domain of analysis the system attempts to understand what
the Jeopardy Challenge, the sources for Watson the question is asking and performs the initial
include a wide range of encyclopedias, dictionar- analyses that determine how the question will be
ies, thesauri, newswire articles, literary works, and processed by the rest of the system. The DeepQA
so on. approach encourages a mixture of experts at this
Given a reasonable baseline corpus, DeepQA stage, and in the Watson system we produce shal-
then applies an automatic corpus expansion low parses, deep parses (McCord 1990), logical
process. The process involves four high-level steps: forms, semantic role labels, coreference, relations,
(1) identify seed documents and retrieve related named entities, and so on, as well as specific kinds
documents from the web; (2) extract self-contained of analysis for question answering. Most of these
text nuggets from the related web documents; (3) technologies are well understood and are not dis-
score the nuggets based on whether they are cussed here, but a few require some elaboration.

FALL 2010 69
Articles

Question Classification. Question classification is databases to simply “look up” the answers is limit-
the task of identifying question types or parts of ed to fewer than 2 percent of the clues.
questions that require special processing. This can Watson’s use of existing databases depends on
include anything from single words with poten- the ability to analyze the question and detect the
tially double meanings to entire clauses that have relations covered by the databases. In Jeopardy the
certain syntactic, semantic, or rhetorical function- broad domain makes it difficult to identify the
ality that may inform downstream components most lucrative relations to detect. In 20,000 Jeop-
with their analysis. Question classification may ardy questions, for example, we found the distri-
identify a question as a puzzle question, a math bution of Freebase9 relations to be extremely flat
question, a definition question, and so on. It will (figure 7). Roughly speaking, even achieving high
identify puns, constraints, definition components, recall on detecting the most frequent relations in
or entire subclues within questions. the domain can at best help in about 25 percent of
Focus and LAT Detection. As discussed earlier, a the questions, and the benefit of relation detection
lexical answer type is a word or noun phrase in the drops off fast with the less frequent relations.
question that specifies the type of the answer with- Broad-domain relation detection remains a major
out any attempt to understand its semantics. open area of research.
Determining whether or not a candidate answer Decomposition. As discussed above, an important
can be considered an instance of the LAT is an requirement driven by analysis of Jeopardy clues
important kind of scoring and a common source of was the ability to handle questions that are better
critical errors. An advantage to the DeepQA answered through decomposition. DeepQA uses
approach is to exploit many independently devel- rule-based deep parsing and statistical classifica-
oped answer-typing algorithms. However, many of tion methods both to recognize whether questions
these algorithms are dependent on their own type should be decomposed and to determine how best
systems. We found the best way to integrate pre- to break them up into subquestions. The operating
existing components is not to force them into a hypothesis is that the correct question interpreta-
single, common type system, but to have them tion and derived answer(s) will score higher after
map from the LAT to their own internal types. all the collected evidence and all the relevant algo-
The focus of the question is the part of the ques- rithms have been considered. Even if the question
tion that, if replaced by the answer, makes the ques- did not need to be decomposed to determine an
tion a stand-alone statement. Looking back at some answer, this method can help improve the system’s
of the examples shown previously, the focus of overall answer confidence.
“When hit by electrons, a phosphor gives off elec- DeepQA solves parallel decomposable questions
tromagnetic energy in this form” is “this form”; the through application of the end-to-end QA system
focus of “Secretary Chase just submitted this to me on each subclue and synthesizes the final answers
for the third time; guess what, pal. This time I’m by a customizable answer combination compo-
accepting it” is the first “this”; and the focus of nent. These processing paths are shown in medi-
“This title character was the crusty and tough city um gray in figure 6. DeepQA also supports nested
editor of the Los Angeles Tribune” is “This title char- decomposable questions through recursive appli-
acter.” The focus often (but not always) contains cation of the end-to-end QA system to the inner
useful information about the answer, is often the subclue and then to the outer subclue. The cus-
subject or object of a relation in the clue, and can tomizable synthesis components allow specialized
turn a question into a factual statement when synthesis algorithms to be easily plugged into a
replaced with a candidate, which is a useful way to common framework.
gather evidence about a candidate.
Relation Detection. Most questions contain rela- Hypothesis Generation
tions, whether they are syntactic subject-verb- Hypothesis generation takes the results of question
object predicates or semantic relationships analysis and produces candidate answers by
between entities. For example, in the question, searching the system’s sources and extracting
“They’re the two states you could be reentering if answer-sized snippets from the search results. Each
you’re crossing Florida’s northern border,” we can candidate answer plugged back into the question is
detect the relation borders(Florida,?x,north). considered a hypothesis, which the system has to
Watson uses relation detection throughout the prove correct with some degree of confidence.
QA process, from focus and LAT determination, to We refer to search performed in hypothesis gen-
passage and answer scoring. Watson can also use eration as “primary search” to distinguish it from
detected relations to query a triple store and direct- search performed during evidence gathering
ly generate candidate answers. Due to the breadth (described below). As with all aspects of DeepQA,
of relations in the Jeopardy domain and the variety we use a mixture of different approaches for pri-
of ways in which they are expressed, however, mary search and candidate generation in the Wat-
Watson’s current ability to effectively use curated son system.

70 AI MAGAZINE
Articles

4.00%

3.50%

3.00%

2.50%

2.00%

1.50%

1.00%

0.50%

0.00%

Figure 7: Approximate Distribution of the 50 Most Frequently


Occurring Freebase Relations in 20,000 Randomly Selected Jeopardy Clues.

Primary Search. In primary search the goal is to lists to satisfy key constraints identified in the
find as much potentially answer-bearing content question.
as possible based on the results of question analy- Triple store queries in primary search are based
sis — the focus is squarely on recall with the expec- on named entities in the clue; for example, find all
tation that the host of deeper content analytics database entities related to the clue entities, or
will extract answer candidates and score this con- based on more focused queries in the cases that a
tent plus whatever evidence can be found in sup- semantic relation was detected. For a small number
port or refutation of candidates to drive up the pre- of LATs we identified as “closed LATs,” the candi-
cision. Over the course of the project we continued date answer can be generated from a fixed list in
to conduct empirical studies designed to balance some store of known instances of the LAT, such as
speed, recall, and precision. These studies allowed “U.S. President” or “Country.”
us to regularly tune the system to find the number Candidate Answer Generation. The search results
of search results and candidates that produced the feed into candidate generation, where techniques
best balance of accuracy and computational appropriate to the kind of search results are applied
resources. The operative goal for primary search to generate candidate answers. For document
eventually stabilized at about 85 percent binary search results from “title-oriented” resources, the
recall for the top 250 candidates; that is, the system title is extracted as a candidate answer. The system
generates the correct answer as a candidate answer may generate a number of candidate answer vari-
for 85 percent of the questions somewhere within ants from the same title based on substring analy-
the top 250 ranked candidates. sis or link analysis (if the underlying source con-
A variety of search techniques are used, includ- tains hyperlinks). Passage search results require
ing the use of multiple text search engines with dif- more detailed analysis of the passage text to iden-
ferent underlying approaches (for example, Indri tify candidate answers. For example, named entity
and Lucene), document search as well as passage detection may be used to extract candidate
search, knowledge base search using SPARQL on answers from the passage. Some sources, such as a
triple stores, the generation of multiple search triple store and reverse dictionary lookup, produce
queries for a single question, and backfilling hit candidate answers directly as their search result.

FALL 2010 71
Articles

If the correct answer(s) are not generated at this deep content analysis is performed. Scoring algo-
stage as a candidate, the system has no hope of rithms determine the degree of certainty that
answering the question. This step therefore signifi- retrieved evidence supports the candidate answers.
cantly favors recall over precision, with the expec- The DeepQA framework supports and encourages
tation that the rest of the processing pipeline will the inclusion of many different components, or
tease out the correct answer, even if the set of can- scorers, that consider different dimensions of the
didates is quite large. One of the goals of the sys- evidence and produce a score that corresponds to
tem design, therefore, is to tolerate noise in the how well evidence supports a candidate answer for
early stages of the pipeline and drive up precision a given question.
downstream. DeepQA provides a common format for the scor-
Watson generates several hundred candidate ers to register hypotheses (for example candidate
answers at this stage. answers) and confidence scores, while imposing
few restrictions on the semantics of the scores
Soft Filtering themselves; this enables DeepQA developers to
A key step in managing the resource versus preci- rapidly deploy, mix, and tune components to sup-
sion trade-off is the application of lightweight (less port each other. For example, Watson employs
resource intensive) scoring algorithms to a larger more than 50 scoring components that produce
set of initial candidates to prune them down to a scores ranging from formal probabilities to counts
smaller set of candidates before the more intensive to categorical features, based on evidence from dif-
scoring components see them. For example, a ferent types of sources including unstructured text,
lightweight scorer may compute the likelihood of semistructured text, and triple stores. These scorers
a candidate answer being an instance of the LAT. consider things like the degree of match between a
We call this step soft filtering. passage’s predicate-argument structure and the
The system combines these lightweight analysis question, passage source reliability, geospatial loca-
scores into a soft filtering score. Candidate answers tion, temporal relationships, taxonomic classifica-
that pass the soft filtering threshold proceed to tion, the lexical and semantic relations the candi-
hypothesis and evidence scoring, while those can- date is known to participate in, the candidate’s
didates that do not pass the filtering threshold are correlation with question terms, its popularity (or
routed directly to the final merging stage. The soft obscurity), its aliases, and so on.
filtering scoring model and filtering threshold are Consider the question, “He was presidentially
determined based on machine learning over train- pardoned on September 8, 1974”; the correct
ing data. answer, “Nixon,” is one of the generated candi-
Watson currently lets roughly 100 candidates pass dates. One of the retrieved passages is “Ford par-
the soft filter, but this a parameterizable function. doned Nixon on Sept. 8, 1974.” One passage scor-
er counts the number of IDF-weighted terms in
Hypothesis and Evidence Scoring
common between the question and the passage.
Candidate answers that pass the soft filtering Another passage scorer based on the Smith-Water-
threshold undergo a rigorous evaluation process man sequence-matching algorithm (Smith and
that involves gathering additional supporting evi- Waterman 1981), measures the lengths of the
dence for each candidate answer, or hypothesis, longest similar subsequences between the question
and applying a wide variety of deep scoring ana- and passage (for example “on Sept. 8, 1974”). A
lytics to evaluate the supporting evidence. third type of passage scoring measures the align-
Evidence Retrieval. To better evaluate each candi- ment of the logical forms of the question and pas-
date answer that passes the soft filter, the system sage. A logical form is a graphical abstraction of
gathers additional supporting evidence. The archi- text in which nodes are terms in the text and edges
tecture supports the integration of a variety of evi- represent either grammatical relationships (for
dence-gathering techniques. One particularly example, Hermjakob, Hovy, and Lin [2000];
effective technique is passage search where the Moldovan et al. [2003]), deep semantic relation-
candidate answer is added as a required term to the ships (for example, Lenat [1995], Paritosh and For-
primary search query derived from the question. bus [2005]), or both . The logical form alignment
This will retrieve passages that contain the candi- identifies Nixon as the object of the pardoning in
date answer used in the context of the original the passage, and that the question is asking for the
question terms. Supporting evidence may also object of a pardoning. Logical form alignment
come from other sources like triple stores. The gives “Nixon” a good score given this evidence. In
retrieved supporting evidence is routed to the deep contrast, a candidate answer like “Ford” would
evidence scoring components, which evaluate the receive near identical scores to “Nixon” for term
candidate answer in the context of the supporting matching and passage alignment with this passage,
evidence. but would receive a lower logical form alignment
Scoring. The scoring step is where the bulk of the score.

72 AI MAGAZINE
Articles

Argentina Bolivia

0.8

0.6

0.4

0.2

0
Location Passage Popularity Source Taxonomic
Support Reliability
-0.2

Figure 8. Evidence Profiles for Two Candidate Answers.


Dimensions are on the x-axis and relative strength is on the y-axis.

Another type of scorer uses knowledge in triple pendent impact on Watson’s performance deserves
stores, simple reasoning such as subsumption and its own research paper. We cannot do this work jus-
disjointness in type taxonomies, geospatial, and tice here. It is important to note, however, at this
temporal reasoning. Geospatial reasoning is used point no one algorithm dominates. In fact we
in Watson to detect the presence or absence of spa- believe DeepQA’s facility for absorbing these algo-
tial relations such as directionality, borders, and rithms, and the tools we have created for exploring
containment between geoentities. For example, if their interactions and effects, will represent an
a question asks for an Asian city, then spatial con- important and lasting contribution of this work.
tainment provides evidence that Beijing is a suit- To help developers and users get a sense of how
able candidate, whereas Sydney is not. Similarly, Watson uses evidence to decide between compet-
geocoordinate information associated with entities ing candidate answers, scores are combined into an
is used to compute relative directionality (for overall evidence profile. The evidence profile
example, California is SW of Montana; GW Bridge groups individual features into aggregate evidence
is N of Lincoln Tunnel, and so on). dimensions that provide a more intuitive view of
Temporal reasoning is used in Watson to detect the feature group. Aggregate evidence dimensions
inconsistencies between dates in the clue and might include, for example, Taxonomic, Geospa-
those associated with a candidate answer. For tial (location), Temporal, Source Reliability, Gen-
example, the two most likely candidate answers der, Name Consistency, Relational, Passage Sup-
generated by the system for the clue, “In 1594 he port, Theory Consistency, and so on. Each
took a job as a tax collector in Andalusia,” are aggregate dimension is a combination of related
“Thoreau” and “Cervantes.” In this case, temporal feature scores produced by the specific algorithms
reasoning is used to rule out Thoreau as he was not that fired on the gathered evidence.
alive in 1594, having been born in 1817, whereas Consider the following question: Chile shares its
Cervantes, the correct answer, was born in 1547 longest land border with this country. In figure 8
and died in 1616. we see a comparison of the evidence profiles for
Each of the scorers implemented in Watson, two candidate answers produced by the system for
how they work, how they interact, and their inde- this question: Argentina and Bolivia. Simple search

FALL 2010 73
Articles

engine scores favor Bolivia as an answer, due to a score profiles and use the ranking score for confi-
popular border dispute that was frequently report- dence. For more intelligent ranking, however,
ed in the news. Watson prefers Argentina (the cor- ranking and confidence estimation may be sepa-
rect answer) over Bolivia, and the evidence profile rated into two phases. In both phases sets of scores
shows why. Although Bolivia does have strong may be grouped according to their domain (for
popularity scores, Argentina has strong support in example type matching, passage scoring, and so
the geospatial, passage support (for example, align- on.) and intermediate models trained using
ment and logical form graph matching of various ground truths and methods specific for that task.
text passages), and source reliability dimensions. Using these intermediate models, the system pro-
duces an ensemble of intermediate scores. Moti-
Final Merging and Ranking vated by hierarchical techniques such as mixture
It is one thing to return documents that contain of experts (Jacobs et al. 1991) and stacked general-
key words from the question. It is quite another, ization (Wolpert 1992), a metalearner is trained
however, to analyze the question and the content over this ensemble. This approach allows for itera-
enough to identify the precise answer and yet tively enhancing the system with more sophisti-
another to determine an accurate enough confi- cated and deeper hierarchical models while retain-
dence in its correctness to bet on it. Winning at ing flexibility for robustness and experimentation
Jeopardy requires exactly that ability. as scorers are modified and added to the system.
The goal of final ranking and merging is to eval- Watson’s metalearner uses multiple trained
uate the hundreds of hypotheses based on poten- models to handle different question classes as, for
tially hundreds of thousands of scores to identify instance, certain scores that may be crucial to iden-
the single best-supported hypothesis given the evi- tifying the correct answer for a factoid question
dence and to estimate its confidence — the likeli- may not be as useful on puzzle questions.
hood it is correct. Finally, an important consideration in dealing
with NLP-based scorers is that the features they
Answer Merging produce may be quite sparse, and so accurate con-
Multiple candidate answers for a question may be fidence estimation requires the application of con-
equivalent despite very different surface forms. fidence-weighted learning techniques. (Dredze,
This is particularly confusing to ranking tech- Crammer, and Pereira 2008).
niques that make use of relative differences
between candidates. Without merging, ranking
algorithms would be comparing multiple surface
Speed and Scaleout
forms that represent the same answer and trying to DeepQA is developed using Apache UIMA,10 a
discriminate among them. While one line of framework implementation of the Unstructured
research has been proposed based on boosting con- Information Management Architecture (Ferrucci
fidence in similar candidates (Ko, Nyberg, and Luo and Lally 2004). UIMA was designed to support
2007), our approach is inspired by the observation interoperability and scaleout of text and multi-
that different surface forms are often disparately modal analysis applications. All of the components
supported in the evidence and result in radically in DeepQA are implemented as UIMA annotators.
different, though potentially complementary, These are software components that analyze text
scores. This motivates an approach that merges and produce annotations or assertions about the
answer scores before ranking and confidence esti- text. Watson has evolved over time and the num-
mation. Using an ensemble of matching, normal- ber of components in the system has reached into
ization, and coreference resolution algorithms, the hundreds. UIMA facilitated rapid component
Watson identifies equivalent and related hypothe- integration, testing, and evaluation.
ses (for example, Abraham Lincoln and Honest Early implementations of Watson ran on a single
Abe) and then enables custom merging per feature processor where it took 2 hours to answer a single
to combine scores. question. The DeepQA computation is embarrass-
ing parallel, however. UIMA-AS, part of Apache
Ranking and Confidence Estimation UIMA, enables the scaleout of UIMA applications
After merging, the system must rank the hypothe- using asynchronous messaging. We used UIMA-AS
ses and estimate confidence based on their merged to scale Watson out over 2500 compute cores.
scores. We adopted a machine-learning approach UIMA-AS handles all of the communication, mes-
that requires running the system over a set of train- saging, and queue management necessary using
ing questions with known answers and training a the open JMS standard. The UIMA-AS deployment
model based on the scores. One could assume a of Watson enabled competitive run-time latencies
very flat model and apply existing ranking algo- in the 3–5 second range.
rithms (for example, Herbrich, Graepel, and Ober- To preprocess the corpus and create fast run-
mayer [2000]; Joachims [2002]) directly to these time indices we used Hadoop.11 UIMA annotators

74 AI MAGAZINE
Articles

were easily deployed as mappers in the Hadoop Status and Results


map-reduce framework. Hadoop distributes the
content over the cluster to afford high CPU uti- After approximately 3 years of effort by a core algo-
lization and provides convenient tools for deploy- rithmic team composed of 20 researchers and soft-
ware engineers with a range of backgrounds in nat-
ing, managing, and monitoring the corpus analy-
ural language processing, information retrieval,
sis process.
machine learning, computational linguistics, and
knowledge representation and reasoning, we have
Strategy driven the performance of DeepQA to operate
within the winner’s cloud on the Jeopardy task, as
Jeopardy demands strategic game play to match shown in figure 9. Watson’s results illustrated in
wits against the best human players. In a typical this figure were measured over blind test sets con-
Jeopardy game, Watson faces the following strate- taining more than 2000 Jeopardy questions.
gic decisions: deciding whether to buzz in and After many nonstarters, by the fourth quarter of
attempt to answer a question, selecting squares 2007 we finally adopted the DeepQA architecture.
from the board, and wagering on Daily Doubles At that point we had all moved out of our private
and Final Jeopardy. offices and into a “war room” setting to dramati-
The workhorse of strategic decisions is the buzz- cally facilitate team communication and tight col-
in decision, which is required for every non–Daily laboration. We instituted a host of disciplined
engineering and experimental methodologies sup-
Double clue on the board. This is where DeepQA’s
ported by metrics and tools to ensure we were
ability to accurately estimate its confidence in its
investing in techniques that promised significant
answer is critical, and Watson considers this confi- impact on end-to-end metrics. Since then, modulo
dence along with other game-state factors in mak- some early jumps in performance, the progress has
ing the final determination whether to buzz. been incremental but steady. It is slowing in recent
Another strategic decision, Final Jeopardy wagering, months as the remaining challenges prove either
generally receives the most attention and analysis very difficult or highly specialized and covering
from those interested in game strategy, and there small phenomena in the data.
exists a growing catalogue of heuristics such as By the end of 2008 we were performing reason-
“Clavin’s Rule” or the “Two-Thirds Rule” (Dupee ably well — about 70 percent precision at 70 per-
1998) as well as identification of those critical score cent attempted over the 12,000 question blind
boundaries at which particular strategies may be data, but it was taking 2 hours to answer a single
used (by no means does this make it easy or rote; question on a single CPU. We brought on a team
despite this attention, we have found evidence specializing in UIMA and UIMA-AS to scale up
DeepQA on a massively parallel high-performance
that contestants still occasionally make irrational
computing platform. We are currently answering
Final Jeopardy bets). Daily Double betting turns out
more than 85 percent of the questions in 5 seconds
to be less studied but just as challenging since the
or less — fast enough to provide competitive per-
player must consider opponents’ scores and pre- formance, and with continued algorithmic devel-
dict the likelihood of getting the question correct opment are performing with about 85 percent pre-
just as in Final Jeopardy. After a Daily Double, how- cision at 70 percent attempted.
ever, the game is not over, so evaluation of a wager We have more to do in order to improve preci-
requires forecasting the effect it will have on the sion, confidence, and speed enough to compete
distant, final outcome of the game. with grand champions. We are finding great results
These challenges drove the construction of sta- in leveraging the DeepQA architecture capability
tistical models of players and games, game-theo- to quickly admit and evaluate the impact of new
retic analyses of particular game scenarios and algorithms as we engage more university partner-
strategies, and the development and application of ships to help meet the challenge.
reinforcement-learning techniques for Watson to
An Early Adaptation Experiment
learn its strategy for playing Jeopardy. Fortunately,
Another challenge for DeepQA has been to demon-
moderate samounts of historical data are available
strate if and how it can adapt to other QA tasks. In
to serve as training data for learning techniques.
mid-2008, after we had populated the basic archi-
Even so, it requires extremely careful modeling and
tecture with a host of components for searching,
game-theoretic evaluation as the game of Jeopardy evidence retrieval, scoring, final merging, and
has incomplete information and uncertainty to ranking for the Jeopardy task, IBM collaborated
model, critical score boundaries to recognize, and with CMU to try to adapt DeepQA to the TREC QA
savvy, competitive players to account for. It is a problem by plugging in only select domain-spe-
game where one faulty strategic choice can lose the cific components previously tuned to the TREC
entire match. task. In particular, we added question-analysis

FALL 2010 75
Articles

100%

90%
v0.7 04/10
80%

70% v0.6 10/09


v0.5 05/09
60%
Precision

v0.4 12/08
50% v0.3 08/08

v0.2 05/08
40%
v0.1 12/07
30%

20%

10% Baseline

0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered

Figure 9. Watson’s Precision and Confidence Progress as of the Fourth Quarter 2009.

components from PIQUANT and OpenEphyra that percent accuracy on the Jeopardy clues. The Deep-
identify answer types for a question, and candidate QA system at the time had accuracy above 50 per-
answer-generation components that identify cent on Jeopardy. Without adaptation DeepQA’s
instances of those answer types in the text. The accuracy on TREC questions was about 35 percent.
DeepQA framework utilized both sets of compo- After adaptation, DeepQA’s accuracy on TREC
nents despite their different type systems — no exceeded 60 percent. We repeated the adaptation
ontology integration was performed. The identifi- experiment in 2010, and in addition to the
cation and integration of these domain specific improvements to DeepQA since 2008, the adapta-
components into DeepQA took just a few weeks. tion included a transfer learning step for TREC
The extended DeepQA system was applied to questions from a model trained on Jeopardy ques-
TREC questions. Some of DeepQA’s answer and tions. DeepQA’s performance on TREC data was 51
evidence scorers are more relevant in the TREC percent accuracy prior to adaptation and 67 per-
domain than in the Jeopardy domain and others are cent after adaptation, nearly level with its per-
less relevant. We addressed this aspect of adapta- formance on blind Jeopardy data.
tion for DeepQA’s final merging and ranking by The result performed significantly better than
training an answer-ranking model using TREC the original complete systems on the task for
questions; thus the extent to which each score which they were designed. While just one adapta-
affected the answer ranking and confidence was tion experiment, this is exactly the sort of behav-
automatically customized for TREC. ior we think an extensible QA system should
Figure 10 shows the results of the adaptation exhibit. It should quickly absorb domain- or task-
experiment. Both the 2005 PIQUANT and 2007 specific components and get better on that target
OpenEphyra systems had less than 50 percent task without degradation in performance in the
accuracy on the TREC questions and less than 15 general case or on prior tasks.

76 AI MAGAZINE
Articles

DeepQA performs on
70% Jeopardy! BOTH tasks 9/2008

TREC
60%

CMU’s 2007
50% TREC QA

40%
IBM’s 2005
TREC QA

30%

20%

DeepQA prior
10%
to Adaptation

0%
PIQUANT Ephyra DeepQA

Figure 10. Accuracy on Jeopardy! and TREC.

Summary end systems tend to involve many complex and


often overlapping interactions. A system design
The Jeopardy Challenge helped us address require- and methodology that facilitated the efficient inte-
ments that led to the design of the DeepQA archi- gration and ablation studies of many probabilistic
tecture and the implementation of Watson. After 3 components was essential for our success to date.
years of intense research and development by a core The impact of any one algorithm on end-to-end
team of about 20 researcherss, Watson is perform-
performance changed over time as other tech-
ing at human expert levels in terms of precision,
niques were added and had overlapping effects.
confidence, and speed at the Jeopardy quiz show.
Our commitment to regularly evaluate the effects
Our results strongly suggest that DeepQA is an
of specific techniques on end-to-end performance,
effective and extensible architecture that may be
and to let that shape our research investment, was
used as a foundation for combining, deploying,
evaluating, and advancing a wide range of algorith- necessary for our rapid progress.
mic techniques to rapidly advance the field of QA. Rapid experimentation was another critical
The architecture and methodology developed as ingredient to our success. The team conducted
part of this project has highlighted the need to more than 5500 independent experiments in 3
take a systems-level approach to research in QA, years — each averaging about 2000 CPU hours and
and we believe this applies to research in the generating more than 10 GB of error-analysis data.
broader field of AI. We have developed many dif- Without DeepQA’s massively parallel architecture
ferent algorithms for addressing different kinds of and a dedicated high-performance computing
problems in QA and plan to publish many of them infrastructure, we would not have been able to per-
in more detail in the future. However, no one algo- form these experiments, and likely would not have
rithm solves challenge problems like this. End-to- even conceived of many of them.

FALL 2010 77
Articles

Tuned for the Jeopardy Challenge, Watson has 10. incubator.apache.org/uima/.


begun to compete against former Jeopardy players 11. hadoop.apache.org/.
in a series of “sparring” games. It is holding its
own, winning 64 percent of the games, but has to References
be improved and sped up to compete favorably Chu-Carroll, J.; Czuba, K.; Prager, J. M.; and Ittycheriah,
against the very best. A. 2003. Two Heads Are Better Than One in Question-
We have leveraged our collaboration with CMU Answering. Paper presented at the Human Language
and with our other university partnerships in get- Technology Conference, Edmonton, Canada, 27 May–1
ting this far and hope to continue our collabora- June.
tive work to drive Watson to its final goal, and help Dredze, M.; Crammer, K.; and Pereira, F. 2008. Confi-
openly advance QA research. dence-Weighted Linear Classification. In Proceedings of the
Twenty-Fifth International Conference on Machine Learning
Acknowledgements (ICML). Princeton, NJ: International Machine Learning
Society.
We would like to acknowledge the talented team
Dupee, M. 1998. How to Get on Jeopardy! ... and Win: Valu-
of research scientists and engineers at IBM and at
able Information from a Champion. Secaucus, NJ: Citadel
partner universities, listed below, for the incredible Press.
work they are doing to influence and develop all
Ferrucci, D., and Lally, A. 2004. UIMA: An Architectural
aspects of Watson and the DeepQA architecture. It Approach to Unstructured Information Processing in the
is this team who are responsible for the work Corporate Research Environment. Natural Langage Engi-
described in this paper. From IBM, Andy Aaron, neering 10(3–4): 327–348.
Einat Amitay, Branimir Boguraev, David Carmel, Ferrucci, D.; Nyberg, E.; Allan, J.; Barker, K.; Brown, E.;
Arthur Ciccolo, Jaroslaw Cwiklik, Pablo Duboue, Chu-Carroll, J.; Ciccolo, A.; Duboue, P.; Fan, J.; Gondek,
Edward Epstein, Raul Fernandez, Radu Florian, D.; Hovy, E.; Katz, B.; Lally, A.; McCord, M.; Morarescu,
Dan Gruhl, Tong-Haing Fin, Achille Fokoue, Karen P.; Murdock, W.; Porter, B.; Prager, J.; Strzalkowski, T.;
Ingraffea, Bhavani Iyer, Hiroshi Kanayama, Jon Welty, W.; and Zadrozny, W. 2009. Towards the Open
Lenchner, Anthony Levas, Burn Lewis, Michael Advancement of Question Answer Systems. IBM Techni-
McCord, Paul Morarescu, Matthew Mulholland, cal Report RC24789, Yorktown Heights, NY.
Yuan Ni, Miroslav Novak, Yue Pan, Siddharth Pat- Herbrich, R.; Graepel, T.; and Obermayer, K. 2000. Large
wardhan, Zhao Ming Qiu, Salim Roukos, Marshall Margin Rank Boundaries for Ordinal Regression. In
Advances in Large Margin Classifiers, 115–132. Linköping,
Schor, Dafna Sheinwald, Roberto Sicconi, Hiroshi
Sweden: Liu E-Press.
Kanayama, Kohichi Takeda, Gerry Tesauro, Chen
Hermjakob, U.; Hovy, E. H.; and Lin, C. 2000. Knowl-
Wang, Wlodek Zadrozny, and Lei Zhang. From our
edge-Based Question Answering. In Proceedings of the
academic partners, Manas Pathak (CMU), Chang Sixth World Multiconference on Systems, Cybernetics, and
Wang (University of Massachusetts [UMass]), Hide- Informatics (SCI-2002). Winter Garden, FL: International
ki Shima (CMU), James Allen (UMass), Ed Hovy Institute of Informatics and Systemics.
(University of Southern California/Information Hsu, F.-H. 2002. Behind Deep Blue: Building the Computer
Sciences Instutute), Bruce Porter (University of That Defeated the World Chess Champion. Princeton, NJ:
Texas), Pallika Kanani (UMass), Boris Katz (Massa- Princeton University Press.
chusetts Institute of Technology), Alessandro Mos- Jacobs, R.; Jordan, M. I.; Nowlan. S. J.; and Hinton, G. E.
chitti, and Giuseppe Riccardi (University of Tren- 1991. Adaptive Mixtures of Local Experts. Neural Compu-
to), Barbar Cutler, Jim Hendler, and Selmer tation 3(1): 79-–87.
Bringsjord (Rensselaer Polytechnic Institute). Joachims, T. 2002. Optimizing Search Engines Using
Clickthrough Data. In Proceedings of the Thirteenth ACM
Notes Conference on Knowledge Discovery and Data Mining (KDD).
1. Watson is named after IBM’s founder, Thomas J. Wat- New York: Association for Computing Machinery.
son. Ko, J.; Nyberg, E.; and Luo Si, L. 2007. A Probabilistic
2. Random jitter has been added to help visualize the dis- Graphical Model for Joint Answer Ranking in Question
tribution of points. Answering. In Proceedings of the 30th Annual International
3. www-nlpir.nist.gov/projects/aquaint. ACM SIGIR Conference, 343–350. New York: Association
for Computing Machinery.
4. trec.nist.gov/proceedings/proceedings.html.
Lenat, D. B. 1995. Cyc: A Large-Scale Investment in
5. sourceforge.net/projects/openephyra/.
Knowledge Infrastructure. Communications of the ACM
6. The dip at the left end of the light gray curve is due to 38(11): 33–38.
the disproportionately high score the search engine
Maybury, Mark, ed. 2004. New Directions in Question-
assigns to short queries, which typically are not suffi-
Answering. Menlo Park, CA: AAAI Press.
ciently discriminative to retrieve the correct answer in
top position. McCord, M. C. 1990. Slot Grammar: A System for Sim-
pler Construction of Practical Natural Language Gram-
7. dbpedia.org/.
mars. In Natural Language and Logic: International Scientific
8. www.mpi-inf.mpg.de/yago-naga/yago/. Symposium. Lecture Notes in Computer Science 459.
9. freebase.com/. Berlin: Springer Verlag.

78 AI MAGAZINE
Articles

Miller, G. A. 1995. WordNet: A Lexical Database for Eng- applications of machine learning, statistical modeling,
lish. Communications of the ACM 38(11): 39–41. and game theory to question answering and natural lan-
Moldovan, D.; Clark, C.; Harabagiu, S.; and Maiorano, S. guage processing. Gondek has contributed to journals
2003. COGEX: A Logic Prover for Question Answering. and conferences in machine learning and data mining.
Paper presented at the Human Language Technology He earned his Ph.D. in computer science from Brown
Conference, Edmonton, Canada, 27 May–1 June.. University.
Paritosh, P., and Forbus, K. 2005. Analysis of Strategic Aditya A. Kalyanpur is a research staff member at the
Knowledge in Back of the Envelope Reasoning. In Pro- IBM T. J. Watson Research Center. His primary research
ceedings of the 20th AAAI Conference on Artificial Intelligence interests include knowledge representation and reason-
(AAAI-05). Menlo Park, CA: AAAI Press. ing, natural languague programming, and question
Prager, J. M.; Chu-Carroll, J.; and Czuba, K. 2004. A Mul- answering. He has served on W3 working groups, as pro-
ti-Strategy, Multi-Question Approach to Question gram cochair of an international semantic web work-
Answering. In New Directions in Question-Answering, ed. shop, and as a reviewer and program committee member
M. Maybury. Menlo Park, CA: AAAI Press. for several AI journals and conferences. Kalyanpur com-
Simmons, R. F. 1970. Natural Language Question- pleted his doctorate in AI and semantic web related
Answering Systems: 1969. Communications of the ACM research from the University of Maryland, College Park.
13(1): 15–30
Adam Lally is a senior software engineer at IBM’s T. J.
Smith T. F., and Waterman M. S. 1981. Identification of
Watson Research Center. He develops natural language
Common Molecular Subsequences. Journal of Molecular
processing and reasoning algorithms for a variety of
Biology 147(1): 195–197.
applications and is focused on developing scalable frame-
Strzalkowski, T., and Harabagiu, S., eds. 2006. Advances in works of NLP and reasoning systems. He is a lead devel-
Open-Domain Question-Answering. Berlin: Springer. oper and designer for the UIMA framework and architec-
Voorhees, E. M., and Dang, H. T. 2005. Overview of the ture specification.
TREC 2005 Question Answering Track. In Proceedings of
the Fourteenth Text Retrieval Conference. Gaithersburg, MD: J. William Murdock is a research staff member at the
National Institute of Standards and Technology. IBM T. J. Watson Research Center. Before joining IBM, he
Wolpert, D. H. 1992. Stacked Generalization. Neural Net- worked at the United States Naval Research Laboratory.
works 5(2): 241–259. His research interests include natural-language seman-
tics, analogical reasoning, knowledge-based planning,
machine learning, and computational reflection. In
David Ferrucci is a research staff member and leads the 2001, he earned his Ph.D. in computer science from the
Semantic Analysis and Integration department at the Georgia Institute of Technology..
IBM T. J. Watson Research Center, Hawthorne, New York.
Eric Nyberg is a professor at the Language Technologies
Ferrucci is the principal investigator for the
Institute, School of Computer Science, Carnegie Mellon
DeepQA/Watson project and the chief architect for
University. Nyberg’s research spans a broad range of text
UIMA, now an OASIS standard and Apache open-source
analysis and information retrieval areas, including ques-
project. Ferrucci’s background is in artificial intelligence
tion answering, search, reasoning, and natural language
and software engineering.
processing architectures, systems, and software engineer-
Eric Brown is a research staff member at the IBM T. J. ing principles
Watson Research Center. His background is in informa-
tion retrieval. Brown’s current research interests include John Prager is a research staff member at the IBM T. J.
question answering, unstructured information manage- Watson Research Center in Yorktown Heights, New York.
ment architectures, and applications of advanced text His background includes natural-language based inter-
analysis and question answering to information retrieval faces and semantic search, and his current interest is on
systems.. incorporating user and domain models to inform ques-
tion-answering. He is a member of the TREC program
Jennifer Chu-Carroll is a research staff member at the committee.
IBM T. J. Watson Research Center. Chu-Carroll is on the
editorial board of the Journal of Dialogue Systems, and pre- Nico Schlaefer is a Ph.D. student at the Language Tech-
viously served on the executive board of the North Amer- nologies Institute in the School of Computer Science,
ican Chapter of the Association for Computational Lin- Carnegie Mellon University and an IBM Ph.D. Fellow. His
guistics and as program cochair of HLT-NAACL 2006. Her research focus is the application of machine learning
research interests include question answering, semantic techniques to natural language processing tasks. Schlae-
search, and natural language discourse and dialogue.. fer is the primary author of the OpenEphyra question
answering system.
James Fan is a research staff member at IBM T. J. Watson
Research Center. His research interests include natural Chris Welty is a research staff member at the IBM
language processing, question answering, and knowledge Thomas J. Watson Research Center. His background is
representation and reasoning. He has served as a program primarily in knowledge representation and reasoning.
committee member for several top ranked AI conferences Welty’s current research focus is on hybridization of
and journals, such as IJCAI and AAAI. He received his machine learning, natural language processing, and
Ph.D. from the University of Texas at Austin in 2006. knowledge representation and reasoning in building AI
systems.
David Gondek is a research staff member at the IBM T. J.
Watson Research Center. His research interests include

FALL 2010 79

You might also like