Rapid Exploitation and Analysis of Documents
David Buttler David Andrzejewski Keith Stevens David Anastasiu Byron Gao
Analystsareoverwhelmedwithinformation. Theyhavelargearchivesof historical data, both structured and unstructured, and continu-ous streams of relevant messages and documents that they need tomatch to current tasks, digest, and incorporate into their analysis.The purpose of the READ project is to develop technologies tomake it easier to catalog, classify, and locate relevant information.We approached this task from multiple angles. First, we tackle theissue of processing large quantities of information in reasonabletime. Second, we provide mechanisms that allow users to cus-tomize their queries based on latent topics exposed from corpusstatistics. Third, we assist users in organizing query results, addinglocalized expert structure over results. Forth, we use word sensedisambiguation techniques to increase the precision of matchinguser generated keyword lists with terms and concepts in the corpus.Fifth, we enhance co-occurence statistics with latent topic attribu-tion, to aid entity relationship discovery. Finally we quantitativelyanalyze the quality of three popoular latent modeling techniques toexamine under which circumstances each is useful.
The analysis of unstructured and structured text documents is afundamental part of both government and business intelligence.Most of the information analysts need to process information isrepresented as text, including newspapers and web sites, scientiﬁcarticles in journals, and proprietary messages and analysis. Theamount of such information is staggering, with hundreds of mil-lions of individual documents. Unfortunately, most analysts havevery few tools to evaluate this vast amount of unstructured data.While there has been signiﬁcant advances for new types of analytictools, such as Palantir
, the most common tool used today by mostanalysts is simple Boolean keyword search.The project described here extends the capability of the the user in-terfaces that analysts are accustomed to, enabling analysts to assessthe relevance of individual documents and ﬁnd interesting docu-ments from a massive document set.
http://www.palantirtech.com/ There are several aspects of the project, each of which is describedbelow. The main theme is tying together multiple components tocreate a uniﬁed interface that takes advantage of the latest in in-formation retrieval research and systems software. We attempt toaddress the following analyst problems: managing large number of documents, and keeping everything accessible via search; comingup with the right keywords to ﬁnd targeted concepts or actors; or-ganizing search results so that they are comprehensible; preciselyidentifying concepts of interest without having to wade throughmasses of unrelated terms; and searching by entity networks ratherthan document sets. Finally we also examine some of the concep-tual underpinnigs of our approach, measuring various alternativesthat can have a huge impact on the quality of the statistical sum-mary information we both present to analysts to help them under-stand a corpus and the various mechanisms we use to help them intheir search.
Managing large corpora of documents is a difﬁcult task. However,recent years have seen signiﬁcant advances in open-source soft-wareinfrastructurethatmakestheproblem moretractable. Thema- jor improvements include information retrieval software — speciﬁ-cally Lucene and Solr — and simpliﬁed distributed processing sys-tems, such as the Hadoop Map/Reduce implementation.Solr provides the infrastructure for querying large numbers of doc-uments; it includes faceted search to allow users to reﬁne theirsearch by different aspects of the corpus. However, the facets mustbe generated and placed there by operators of the system. Creat-ing those facets is often a processing intensive task requiring botha global view of the corpus and the information in the text of thelocal document. We use Map/Reduce to distribute the processingload across a cluster to make the processing time tractable.
Enhanced Keyword Search
Another issue that comes up is assisting the user in understandinga specialized and unfamiliar corpus. Typical search terms may beless useful, and there are fewer standard external resources (con-cept hierarchies, user query logs, links to wikipedia, etc.), that canbeleveragedtoprovidesearchguidanceininternalinformationsys-tems. In these cases we exploit statistical structure in the corpus toenhance query strings, and to provide an overal gist for the corpus.
As soon as a query is submitted, there are several things that canbe done to improve the result lists. Updated facet counts provideone digested view of the results. Re-ranked documents, specializedfor a task, provide another view. Relevant latent topic themes give