researchers, the open-domain QA problem isattractive as it is one o the most challenging in therealm o computer science and artiﬁcial intelli-gence, requiring a synthesis o inormationretrieval, natural language processing, knowledgerepresentation and reasoning, machine learning,and computer-human interaces. It has had a longhistory (Simmons 1970) and saw rapid advance-ment spurred by system building, experimenta-tion, and government unding in the past decade(Maybury 2004, Strzalkowski and Harabagiu 2006).With QA in mind, we settled on a challenge tobuild a computer system, called Watson,
whichcould compete at the human champion level inreal time on the American TV quiz show,
.The extent o the challenge includes ﬁelding a real-time automatic contestant on the show, not mere-ly a laboratory exercise.
is a well-known TV quiz show that hasbeen airing on television in the United States ormore than 25 years (see the
Quiz Showsidebar or more inormation on the show). It pitsthree human contestants against one another in acompetition that requires answering rich naturallanguage questions over a very broad domain o topics, with penalties or wrong answers. The natureo the three-person competition is such that conﬁ-dence, precision, and answering speed are o criticalimportance, with roughly 3 seconds to answer eachquestion. A computer system that could compete athuman champion levels at this game would need toproduce exact answers to oten complex naturallanguage questions with high precision and speedand have a reliable conﬁdence in its answers, suchthat it could answer roughly 70 percent o the ques-tions asked with greater than 80 percent precisionin 3 seconds or less.Finally, the
Challenge represents aunique and compelling AI question similar to theone underlying DeepBlue (Hsu 2002)
can a com-puter system be designed to compete against thebest humans at a task thought to require high lev-els o human intelligence, and i so, what kind o technology, algorithms, and engineering isrequired? While we believe the
Challengeis an extraordinarily demanding task that willgreatly advance the ﬁeld, we appreciate that thischallenge alone does not address all aspects o QAand does not by any means close the book on theQA challenge the way that Deep Blue may have orplaying chess.
Challenge requires advancingand incorporating a variety o QA technologiesincluding parsing, question classiﬁcation, questiondecomposition, automatic source acquisition andevaluation, entity and relation detection, logicalorm generation, and knowledge representationand reasoning.Winning at
requires accurately comput-ing conﬁdence in your answers. The questions andcontent are ambiguous and noisy and none o theindividual algorithms are perect. Thereore, eachcomponent must produce a conﬁdence in its out-put, and individual component conﬁdences mustbe combined to compute the overall conﬁdence o the ﬁnal answer. The ﬁnal conﬁdence is used todetermine whether the computer system shouldrisk choosing to answer at all. In
parlance,this conﬁdence is used to determine whether thecomputer will “ring in” or “buzz in” or a question.The conﬁdence must be computed during the timethe question is read and beore the opportunity tobuzz in. This is roughly between 1 and 6 secondswith an average around 3 seconds.Conﬁdence estimation was very critical to shap-ing our overall approach in DeepQA. There is noexpectation that any component in the systemdoes a perect job
all components post eatureso the computation and associated conﬁdences,and we use a hierarchical machine-learningmethod to combine all these eatures and decidewhether or not there is enough conﬁdence in theﬁnal answer to attempt to buzz in and risk gettingthe question wrong.In this section we elaborate on the variousaspects o the
board is organized into sixcolumns. Each column contains ﬁve clues and isassociated with a category. Categories range rombroad subject headings like “history,” “science,” or“politics” to less inormative puns like “tutumuch,” in which the clues are about ballet, to actu-al parts o the clue, like “who appointed me to theSupreme Court?” where the clue is the name o ajudge, to “anything goes” categories like “pot-pourri.” Clearly some categories are essential tounderstanding the clue, some are helpul but notnecessary, and some may be useless, i not mis-leading, or a computer.A recurring theme in our approach is the require-ment to try many alternate hypotheses in varyingcontexts to see which produces the most conﬁdentanswers given a broad range o loosely coupled scor-ing algorithms. Leveraging category inormation isanother clear area requiring this approach.
There are a wide variety o ways one can attempt tocharacterize the
clues. For example, bytopic, by diﬁculty, by grammatical construction,by answer type, and so on. A type o classiﬁcationthat turned out to be useul or us was based on theprimary method deployed to solve the clue. The