researchers, the open-domain QA problem isattractive as it is one o the most challenging in therealm o computer science and artificial intelli-gence, requiring a synthesis o inormationretrieval, natural language processing, knowledgerepresentation and reasoning, machine learning,and computer-human interaces. It has had a longhistory (Simmons 1970) and saw rapid advance-ment spurred by system building, experimenta-tion, and government unding in the past decade(Maybury 2004, Strzalkowski and Harabagiu 2006).With QA in mind, we settled on a challenge tobuild a computer system, called Watson,
1
whichcould compete at the human champion level inreal time on the American TV quiz show,
Jeopardy
.The extent o the challenge includes fielding a real-time automatic contestant on the show, not mere-ly a laboratory exercise.
Jeopardy!
is a well-known TV quiz show that hasbeen airing on television in the United States ormore than 25 years (see the
Jeopardy!
Quiz Showsidebar or more inormation on the show). It pitsthree human contestants against one another in acompetition that requires answering rich naturallanguage questions over a very broad domain o topics, with penalties or wrong answers. The natureo the three-person competition is such that confi-dence, precision, and answering speed are o criticalimportance, with roughly 3 seconds to answer eachquestion. A computer system that could compete athuman champion levels at this game would need toproduce exact answers to oten complex naturallanguage questions with high precision and speedand have a reliable confidence in its answers, suchthat it could answer roughly 70 percent o the ques-tions asked with greater than 80 percent precisionin 3 seconds or less.Finally, the
Jeopardy
Challenge represents aunique and compelling AI question similar to theone underlying DeepBlue (Hsu 2002)
—
can a com-puter system be designed to compete against thebest humans at a task thought to require high lev-els o human intelligence, and i so, what kind o technology, algorithms, and engineering isrequired? While we believe the
Jeopardy
Challengeis an extraordinarily demanding task that willgreatly advance the field, we appreciate that thischallenge alone does not address all aspects o QAand does not by any means close the book on theQA challenge the way that Deep Blue may have orplaying chess.
The
Jeopardy
Challenge
Meeting the
Jeopardy
Challenge requires advancingand incorporating a variety o QA technologiesincluding parsing, question classification, questiondecomposition, automatic source acquisition andevaluation, entity and relation detection, logicalorm generation, and knowledge representationand reasoning.Winning at
Jeopardy
requires accurately comput-ing confidence in your answers. The questions andcontent are ambiguous and noisy and none o theindividual algorithms are perect. Thereore, eachcomponent must produce a confidence in its out-put, and individual component confidences mustbe combined to compute the overall confidence o the final answer. The final confidence is used todetermine whether the computer system shouldrisk choosing to answer at all. In
Jeopardy
parlance,this confidence is used to determine whether thecomputer will “ring in” or “buzz in” or a question.The confidence must be computed during the timethe question is read and beore the opportunity tobuzz in. This is roughly between 1 and 6 secondswith an average around 3 seconds.Confidence estimation was very critical to shap-ing our overall approach in DeepQA. There is noexpectation that any component in the systemdoes a perect job
—
all components post eatureso the computation and associated confidences,and we use a hierarchical machine-learningmethod to combine all these eatures and decidewhether or not there is enough confidence in thefinal answer to attempt to buzz in and risk gettingthe question wrong.In this section we elaborate on the variousaspects o the
Jeopardy
Challenge.
The Categories
A 30-clue
Jeopardy
board is organized into sixcolumns. Each column contains five clues and isassociated with a category. Categories range rombroad subject headings like “history,” “science,” or“politics” to less inormative puns like “tutumuch,” in which the clues are about ballet, to actu-al parts o the clue, like “who appointed me to theSupreme Court?” where the clue is the name o ajudge, to “anything goes” categories like “pot-pourri.” Clearly some categories are essential tounderstanding the clue, some are helpul but notnecessary, and some may be useless, i not mis-leading, or a computer.A recurring theme in our approach is the require-ment to try many alternate hypotheses in varyingcontexts to see which produces the most confidentanswers given a broad range o loosely coupled scor-ing algorithms. Leveraging category inormation isanother clear area requiring this approach.
The Questions
There are a wide variety o ways one can attempt tocharacterize the
Jeopardy
clues. For example, bytopic, by dificulty, by grammatical construction,by answer type, and so on. A type o classificationthat turned out to be useul or us was based on theprimary method deployed to solve the clue. The
Articles
60AI MAGAZINE