December 15, 2011

Five Ways Entity Extraction Enhances the Intelligence Cycle
Much usable intelligence comes from unstructured text. It exists in many forms, including electronic documents, email, web pages, news feeds, and social media. The challenge is to quickly idenfy which documents will most likely yield pernent intelligence. Enty extracon is an effecve tool to enhance this idenficaon process.

We put the World in the World Wide Web®

language idenficaon. Company headquarters are in Cambridge. Soware vendors. mullingual search. (2012-08-15) . Exalead/Dassault Systems. and corpus-derived data. visit www. Microso. enty extracon. and other enterprise applicaons. keeping government and law enforcement ahead of exponenal growth of data storage volumes. EMC. and cheaper techniques to extract forensic evidence. content providers. Northrop Grumman. and Symantec. and identy resoluon in over forty languages. and Tokyo. Our products and services are used by over 250 major firms. SAIC. including Amazon.S. London. with branch offices in San Francisco. Our Rosee® linguiscs plaorm is a widely used suite of interoperable components that power search.basistech. name indexing. and logos used in this document are the property of their respecve owners. Washington. “Rosee”.ABOUT BASIS TECHNOLOGY Basis Technology provides soware soluons for text analycs. Our forensics team pioneers beer. financial compliance. e-discovery. defense and intelligence industry by such firms as CACI. and name translaon. Lockheed Marn. “Geoscope”. © 2012 Basis Technology Corporaon. Hewle-Packard. and “We put the World in the World Wide Web” are registered trademarks of Basis Technology Corporaon. expert rules. We are the top provider of mullingual technology to web and e-commerce search engines. and Yahoo!. Google. “Basis Technology”. faster. Our linguiscs team is at the forefront of applied natural language processing using a combinaon of stascal modeling. service marks. For more informaon. and SRI. including Cisco. All other trademarks. social media monitoring. informaon retrieval. financial instuons. “Odyssey Digital Forensics”. digital forensics. Our text analysis products are widely used in the Massachuses. business intelligence. Oracle. and government agencies worldwide rely on Basis Technology’s soluons for Unicode compliance.

Suppose you are an analyst monitoring incoming chaer. Since keyword search does not know the meaning of words. This paper discusses the ulity of enty extracon for the intelligence cycle.” keyword search may return the “United” of “United Airlines” or “united we stand” or “united as a people. and organizaons—are frequently the pivotal data points. Being able to quickly pick out the documents menoning enes of interest speeds the triage of data.. SIGINT. people.” An enty extractor. Enes—e.” Keyword search may over-include irrelevant documents such as “brown sugar” and “brown Mary Janes” (a type of shoe) when the meaning of the word varies by context. Let’s consider an example where keyword search is over-inclusive. You noce a sudden increase in the menons of a town in Iowa the president is due to visit in two weeks. this approach ends up under-including many relevant documents and over-including many irrelevant documents. Also. can be trained to recognize these references as names of companies or products.” Enty extracon understands that “United Naons” and “UN” are enes and only those documents will be flagged. or OSINT—is key in keeping the cycle short. Similarly. including English.” or “Toyota. and the following analysis and disseminaon. A shorter cycle allows quicker adjustments to requirements and quicker follow-up of fruiul avenues.g. THE IMPACT OF ENTITY EXTRACTION 1. However. whereas enty extracon will find misspelled enes because it finds enes based on their context. “Mary Jane Brown. Given “United Naons. Search Beyond Key “Words” to Key “Enes” Keyword search is good at what it was built to do: return exact keyword matches. however. a search for “car” will miss references to “Chrysler.” “Saturn. Five Ways Enty Extracon Enhances the Intelligence Cycle 3 . keyword search is unlikely to find documents where the keyword is misspelled. 2. being able to quickly assess the relevance of informaon as it arrives—whether HUMINT. places. This increase might trigger a closer analysis of documents which menon both enes. Reveal Rising Trends and Paerns Enty extracon can illuminate paerns and rising trends when the same enty is flagged in mulple sources.In the processing and exploitaon stage of the intelligence cycle. personal names are created from common nouns—for example. and weighs different methods of enty extracon available today. in many languages.

” 1 Knowing which documents contain enes of interest allows analysts to focus on those first. 3. Use as Foundaonal Data for Drawing Relaonships Enes which have a relaonship will appear in the same document. Nocing this upck in menons of the town and the president could lead the analyst to read those documents first. What relaonship. 1 Although the converse statement—“enes appearing in the same document have a relaonship”—is not necessarily true! 4 Five Ways Enty Extracon Enhances the Intelligence Cycle . leaving more me for analysts to draw connecons and accelerate the intelligence cycle. giving analysts me to examine and grade their significance. emerging paerns even if they don’t yet know the significance of the enty. 5. 4. but enty extracon doesn’t. Without enty extracon. At the same me.Finding this paern with keyword search would require pre-knowledge to ask about this town and the president. it reveal shortcuts through a confusing mountain of data that would not have been found even with double the manpower. Tighten the Intelligence Feedback Loop Enty extracon does more than conserve valuable analyst me. an analyst may find relaonships and conclusions to feed into applicaons for linking. Enty extracon automates that work. is there between Obama and Oumwa? Shortcuts through the data reveal significant paerns quickly and pinpoint relevant documents. data visualizaon. unknown enes when they first appear. consuming precious human resources on dead ends. for example. producing only—and all—true enes eliminates the need to analyze false informaon at each stage of the cycle. Many tools for idenfying threats rely upon triage rules to raise alerts upon discovering paerns or significant words. Extractors can spot new. Savings can be even greater in eliminang the false posives that might trigger a chain reacon. Avoiding over.and under-inclusion of search results means more complete informaon and fewer chases down fruitless paths. “President Obama to visit Oumwa on Thursday” or “James Bond works for Brish Secret Service. From examining the smaller set of documents. Early detecon of unmonitored enes can alert analysts to new. and alerng. and inform the security arrangements for the president if necessary. if any. Reduce Costs and Increase Efficiency Enty extracon reduces labor costs. but also the cost of not having the right intelligence soon enough. document filtering is done by analysts reading through files and flagging ones of interest.

enes can populate analyst repositories of structured data. paern and trend analysis. • Easy customizaon—Every enty extractor works best on the genre of material it was trained on. a verb in electronics. and tools cannot be either. the appearance of enes may trigger new avenues of invesgaon. thus a certain amount of customizability is required in almost any applicaon.Individually. Comprehensive intelligence analysis requires proficiency in the languages of Asia. so the capability should offer easy integraon with other analysis tools to present data in a format that is easy to use and manipulate. link and data visualizaon. whether it means training on text with a different profile or adding custom-enes. and elsewhere. Five Ways Enty Extracon Enhances the Intelligence Cycle 5 . Effecve enty extracon has these characteriscs: • Foreign language capabilies—Intelligence is not limited to English or European languages. In the aggregate. Within a single document. judicious deployment of enty extracon decreases the manpower required to idenfy and isolate relevant informaon. while reducing costs and increasing the quality and speed of intelligence analysis. many common words can occur with different meanings. Only true enes should be flagged. and identy resoluon. or the last name of a person. the Middle East. • Context-sensive extracon—The word “gates” may be a plural noun in real estate. driving further analysis for fact and concept extracon. ESSENTIAL ELEMENTS OF ENTITY EXTRACTION The analysis stage of the intelligence cycle depends on effecve enty extracon to analyze mulple languages and draw accurate conclusions—even when words are used with different meanings in different contexts. Since most analysis centers around enes. • Seamless integraon—Enty extracon is just one step in the analysis-process chain.

based on the rules it has “learned. and email addresses. new enes can be discovered without retraining.1999” or “10*9*1999” can confuse them. medicines. There are also common paerns that are used in different ways.” are usually the name of a person. Stascal—These extractors idenfy enes based on the context of surrounding words. the models creates their own rules for idenfying the context in which enes appear. so a list of countries will find not only “China” (the country) but also “fine china. Unconvenonal formang such as “10. An extractor should have a small footprint and be fast enough to rapidly analyze high volumes of data. an extractor should be at least 85% accurate in major languages.” With this approach. such as Social Security numbers. postal codes. A list of important geographical and polical enes can extract the most important ones. and over 90% for English and the most common European languages. street addresses. or weapons work well with this approach.” Paern Matching—These mathemacal paerns do a good job with enes that fall into set paerns. Further. easy to extend. Through machine training. soware can only idenfy enes defined by known rules. For instance. Each of these approaches works well with some enes but fails with others which leads us to our final approach: 6 Five Ways Enty Extracon Enhances the Intelligence Cycle . and oen quite accurate within its narrow scope. Stascal models are trained on hundreds of thousands of words which have been marked by human annotators. Rule-Based—Sophiscated paerns are oen referred to as “rules.9. Rules like these require connuous maintenance by experts. METHODS OF ENTITY EXTRACTION Enty extractors use several different approaches: List-Based—A simple list of all enes in a category is simple to implement. However. However. Finite categories like names of countries. and new enes require reprogramming. A phone number and a SKU may have the same structure. it could result in a SKU mistakenly being extracted as a phone number. they lack context.” Rule-based enty extractors look for more complex paerns. paerns can also be myopic and lack context. lists require constant maintenance and the enes must be known a priori. • High throughput—The volume of data to be analyzed—from both closed and open sources —is growing exponenally every year. but since the regular expression detector isn’t aware of the surrounding context. Enty extracon is then a stascal analysis of the probability of each word being an enty. phone numbers. Furthermore. the words that follow “Mr.• High accuracy—To be useful.

paern matching. Pre-Trained and Customizable Enty Extracon Rosee Enty Extractor comes ready to idenfy a wide-range of enty types. BASIS TECHNOLOGY’S APPROACH TO ENTITY EXTRACTION The Rosee® Enty Extractor is a hybrid mechanism that integrates the results from three techniques: list-based. • Easy customizaon—Users can add custom enes via regular expressions or lists. and stascal. The hybrid approach will produce a more accurate result than any one approach alone. but also includes stascal data generated from the new documents. place. Five Ways Enty Extracon Enhances the Intelligence Cycle 7 . Apache Solr. places. Pashto. The pre-wrien regular expressions for the paern-matching modules support a variety of formats for each enty type. .Hybrid—This soluon integrates the results from mulple approaches. The target text is fed to all three modules and then a fourth module called the redactor balances the results and acts as judge when answers conflict. or Java applicaon programming interface (API). Japanese Korean. dtSearch. C++. which handle a high volume of transacons and require high quality for every system component. • High accuracy and throughput—Rosee’s accuracy and speed is industry-tested and used by customers such as Microso Bing. or enhance the stascal model with training data with addional data relevant to the user’s problem domain. including English. Rosee uses a weighted set of criteria to merge results and idenfy people. It has a database of carefully curated enes for the list-based module. Persian. which are representave of a variety of genres for each supported language Many users customize Rosee to their own problem domains by adding their own lists of enes and regular expressions. • Context-sensive extracon—Rosee’s stascal models consider context when extracng key enes such as person. and major European languages.and other search engines. Rosee Enty Extractor Features • Foreign language capabilies—Rosee extracts enes from text in many languages. and other enes. • Seamless integraon—Rosee is a soware development kit (SDK) accessible via via single C. It has been designed for simple integraon with Apache Lucene. The stascal modules achieve high accuracy by pre-training on carefully tagged and quality checked corpora.NET. The stascal models can be enhanced by feeding the training soware documents with the same profile as those needing enty extracon. Chinese. The enhanced model retains all of the pre-loaded knowledge . It weighs the results based upon the known strength and reliability of each approach. Urdu. Arabic. and organizaon (including company names).

The result was more accurate enty extracon for more accurate message linking to detect IED hot spots. the enty extractor did not recognize “PAK” as “Pakistan. Instead. The applicaon used enes to find links between messages and then clustered the messages to idenfy possible “hot spots” where the probability of future IED aacks was high. Out-of-the-box. in three days. from 2MB to 20MB of message text would have been annotated and added to the model but me and data were limited. two annotators and a soware developer added about 100 annotated messages and rebuilt the stascal model. In an ideal world. Coalition forces are working with AFG locals in this region to track KHALILI before he escapes too far into the border of PAK.ENTITY EXTRACTION IN ACTION: PREDICTING IEDS Rosee Enty Extractor was incorporated into an applicaon to predict where improvised explosive devices (IEDs) might be found. It is believed this group is responsible for the recent IDF attacks in MsE (42SUF31386402. JADID District. BALKH Province). and required addional training.” a place. HAMID KHALILI and his followers were spotted on the border of PAK and AFG using IMINT from UAVs. However. 8 Five Ways Enty Extracon Enhances the Intelligence Cycle . examining coalion message traffic is a far cry from news arcles which enty extractors are normally trained on.

Rosee Language Idenfier sorts documents by language and encoding. Rosee Enty Extractor finds enes from these words. With good data on enes going into the intelligence cycle’s data processing stage. 5. We will be happy to assist you in evaluang the performance of our products on your data. Reliable Enty Extracon as a Foundaon for Analycs Government agencies and technology companies trust Rosee Enty Extractor to augment the mechanisms and intelligence analysis used to prepare an increasingly large collecon of text for exploita” “Obama.. 2.” a misspelled “Barak Obama. 4. Rosee Base Linguiscs idenfies word and sentence boundaries and performs other linguisc processing to allow search engines to index the data for highly accurate search. 1. “Barack Obama.ONE STEP OF AN INTEGRATED DOCUMENT EXPLOITATION WORKFLOW Rosee Enty Extractor is just one step in the text analysis pipeline that starts with mullingual text and ends with a master index of enes in the document set or a translated list of names. focused and dependable conclusions can be quickly generated and acted upon. 3. Rosee Name Indexer matches different instances of the same name (e. Rosee Name Translator displays English translaons of foreign names to help Englishspeaking analysts. or write to info@basistech. EXPLORE FURTHER For more informaon or to request an evaluaon.g. Five Ways Enty Extracon Enhances the Intelligence Cycle 9 . please call us at 617-386-2090 or 800-697-2062.” and “President Obama”).