You are on page 1of 42

Text Mining: Tools, Techniques, and Applications

Nathan Treloar President AvaQuest, Inc.

Outline
Text Mining Defined Foundations of Text Mining Example Applications User Interface Challenges The Future 

   

© 2002, AvaQuest Inc.

Mining Medical Literature 


Medical research Find causal links between symptoms or diseases and drugs or chemicals.

© 2002, AvaQuest Inc.

A Real Example 

Research objective:
±

Follow chains of causal implication to discover a relationship between migraines and biochemical levels. medical research papers, medical news (unstructured text information) symptoms, drugs, diseases, chemicals« 

Data:
± 

Key concept types:
±

© 2002, AvaQuest Inc.

Example Application: Medical Research 
      

stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated in some migraines high levels of magnesium inhibit SCD migraine patients have high platelet aggregability magnesium can suppress platelet aggregability
(source: Swanson and Smalheiser, 1994)

© 2002, AvaQuest Inc.

Text Mining Defined 

Discover useful and previously unknown ³gems´ of information in large text collections

© 2002, AvaQuest Inc.

³Search´ versus ³Discover´

Search (goal-oriented) Structured Data Unstructured Data (Text) Data Retrieval Information Retrieval
© 2002, AvaQuest Inc.

Discover (opportunistic) Data Mining Text Mining

Data Retrieval
Find records within a structured database.
Database Type Search Mode Atomic entity Example Information Need Example Query Structured Goal-driven Data Record ³Find a Japanese restaurant in Boston that serves vegetarian food.´ ³SELECT * FROM restaurants WHERE city = boston AND type = japanese AND has_veg = true´ 

© 2002, AvaQuest Inc.

Information Retrieval 

Find relevant information in an unstructured information source (usually text)
Unstructured Goal-driven Document ³Find a Japanese restaurant in Boston that serves vegetarian food.´ ³Japanese restaurant Boston´ or Boston->Restaurants->Japanese
© 2002, AvaQuest Inc.

Database Type Search Mode Atomic entity Example Information Need

Example Query

Data Mining 

Discover new knowledge through analysis of data
Database Type Search Mode Atomic entity Example Information Need Example Query Structured Opportunistic Numbers and Dimensions ³Show trend over time in # of visits to Japanese restaurants in Boston ´ ³SELECT SUM(visits) FROM restaurants WHERE city = boston AND type = japanese ORDER BY date´
© 2002, AvaQuest Inc.

Text Mining 

Discover new knowledge through analysis of text
Unstructured Opportunistic Language feature or concept ³Find the types of food poisoning most often associated with Japanese restaurants´ Rank diseases found associated with ³Japanese restaurants´
© 2002, AvaQuest Inc.

Database Type Search Mode Atomic entity Example Information Need

Example Query

Motivation for Text Mining
Approximately 90% of the world¶s data is held in unstructured formats (source: Oracle Corporation) Information intensive business processes demand that we transcend from simple document retrieval to ³knowledge´ discovery.
Structured Numerical or Coded Information  

10%

90%

Unstructured or Semi-structured Information

© 2002, AvaQuest Inc.

Challenges of Text Mining 

Very high number of possible ³dimensions´
±

All possible word and phrase types in the language!! records (= docs) are not structurally identical records are not statistically independent 

Unlike data mining:
± ± 

Complex and subtle relationships between concepts in text
± ±

³AOL merges with Time-Warner´ ³Time-Warner is bought by AOL´ automobile = car = vehicle = Toyota Apple (the company) or apple (the fruit)
© 2002, AvaQuest Inc. 

Ambiguity and context sensitivity
± ±

The Emergence of Text Mining 

Advances in text processing technology
± ±

Natural Language Processing (NLP) Computational Linguistics CPU Disk Network 

Cheap Hardware!
± ± ±

© 2002, AvaQuest Inc.

Text Processing 


Statistical Analysis
±

Quantify text data Identifying structural elements Extracting and codifying meaning Reducing the dimensions of text data

Language or Content Analysis
± ± ±

© 2002, AvaQuest Inc.

Statistical Analysis 

Use statistics to add a numerical dimension to unstructured text
Term frequency Document frequency Term proximity Document length

© 2002, AvaQuest Inc.

Content Analysis
Lexical and Syntactic Processing
± ± ± 

Recognizing ³tokens´ (terms) Normalizing words Language constructs (parts of speech, sentences, paragraphs) Extracting meaning Named Entity Extraction (People names, Company Names, Locations, etc«) 

Semantic Processing
± ± 

Extra-semantic features
±

Identify feelings or sentiment in text 

Goal = Dimension Reduction
© 2002, AvaQuest Inc.

Syntactic Processing
Lexical analysis
± ± 

Recognizing word boundaries Relatively simple process in English Recognizing larger constructs Sentence and Paragraph Recognition Parts of speech tagging Phrase recognition 

Syntactic analysis
± ± ± ±

© 2002, AvaQuest Inc.

Named Entity Extraction 



Identify and type language features Examples: 
    

People names Company names Geographic location names Dates Monetary amount Others« (domain specific)

© 2002, AvaQuest Inc.

Simple Entity Extraction

³The quick brown fox jumps over the lazy dog´

Noun phrase Mammal Canidae
© 2002, AvaQuest Inc.

Noun phrase Mammal Canidae

Entity Extraction in Use 

Categorization
±

Assign structure to unstructured content to facilitate retrieval Get the ³gist´ of a document or document collection Expand query terms with related ³typed´ concepts Find patterns, trends, relationships between concepts in text 

 

Summarization
±

Query expansion
±

Text Mining
±

© 2002, AvaQuest Inc.

Extra-semantic Information 

Extracting hidden meaning or sentiment based on use of language.
±

Examples: 


³Customer is unhappy with their service!´ Sentiment = discontent 

Sentiment is:
± ± ±

Emotions: fear, love, hate, sorrow Feelings: warmth, excitement Mood, disposition, temperament, « Lies, sarcasm 

Or even (someday)«
±

© 2002, AvaQuest Inc.

Text Mining: General Applications 

Relationship Analysis
±

If A is related to B, and B is related to C, there is potentially a relationship between A and C. Occurrences of A peak in October. Co-occurrence of A together with B peak in November. 

Trend analysis
± 

Mixed applications
±

© 2002, AvaQuest Inc.

Text Mining: Business Applications 

Ex 1: Decision Support in CRM
-

What are customers¶ typical complaints? What is the trend in the number of satisfied customers in Cleveland? People Finder Suggest products that fit a user¶s interest profile (even based on personality info). 

Ex 2: Knowledge Management
± 

Ex 3: Personalization in eCommerce
-

© 2002, AvaQuest Inc.

Example 1:
Decision Support using Bank Call Center Data 

The Needs:
±

±

Analysis of call records as input into decision-making process of Bank¶s management Quick answers to important questions 
 

Which offices receive the most angry calls? What products have the fewest satisfied customers? (³Angry´ and ³Satisfied´ are recognizable sentiments)

±

User friendly interface and visualization tools
© 2002, AvaQuest Inc.

Example 1:
Decision Support using Bank Call Center Data 

The Information Source:
± ±

Call center records Example:

AC2G31, 01, 0101, PCC, 021, 0053352, NEW YORK, NY, H-SUPRVR8, STMT, ³mr stark has been with the company for about 20 yrs. He hates his stmt format and wishes that we would show a daily balance to help him know when he falls below the required balance on the account.´
© 2002, AvaQuest Inc.

Example 1:
Call Volume by Sentiment

Negative Calls Related to Bank Statements
1000 800 600 400 200 0 Cleveland New York Boston

© 2002, AvaQuest Inc.

Example 2: KM People Finder 

The Needs:
-

-

-

Find people as well as documents that can address my information need. Promote collaboration and knowledge sharing Leverage existing information access system Email, groupware, online reports, «
© 2002, AvaQuest Inc.

-

The Information Sources:
-

Example 2: Simple KM People Finder

Ranked People Names Query Name Extractor

Authority List

Search or Navigation System
© 2002, AvaQuest Inc.

Relevant Docs

Example 2: KM People Finder

© 2002, AvaQuest Inc.

Example 3: Personalized Movie ³Matcher´ 

The Need:
±

Match movies to individuals based on preference profile Written reviews of movies Users¶ lists of favorite movies. Sentiment Analysis Typed and Tagged Reviews 

The Information:
± ±

Movie Reviews

© 2002, AvaQuest Inc.

Sentiment Analysis of Movies: Visualization (after Evans)
absurdity insecurity injustice 0 1 conflict crime Action Romance

inferiority

death deception

immorality horror fear
© 2002, AvaQuest Inc.

destruction

Commercial Tools
IBM Intelligent Miner for Text Semio Map InXight LinguistX / ThingFinder LexiQuest ClearForest Teragram SRA NetOwl Extractor Autonomy
© 2002, AvaQuest Inc. 

      

User Interfaces for Text Mining 


Need some way to present results of Text Mining in an intuitive, easy to manage form. Options:
± ± ±

Conventional text ³lists´ (1D) Charts and graphs (2D) Advanced visualization tools (3D+) 
 

Network maps Landscapes 3d ³spaces´

© 2002, AvaQuest Inc.

UI Challenges

Simple lists, charts, and graphs not obviously applicable or difficult to work with due to high dimensionality of text

Advanced visualization tools can be intimidating for the general community and are not readily accepted
© 2002, AvaQuest Inc.

Charts and Graphs
http://www.cognos.com/

© 2002, AvaQuest Inc.

Visualization: Network Maps
http://www.thinkmap.com/

© 2002, AvaQuest Inc.

Visualization: Network Maps
http://www.lexiquest.com/

© 2002, AvaQuest Inc.

Visualization: Landscapes
http://www.aurigin.com/

© 2002, AvaQuest Inc.

Visualization: 3D Spaces
http://zing.ncsl.nist.gov/~cugini/uicd/cc-paper.html

© 2002, AvaQuest Inc.

The Future 



Different tools and data, but common dimensions Example:
±

±

³Find sales trends by product and correlate with occurrences of company name in business news articles´ Dimensions: Time, Company names (or stock symbols), Product names, Regions
© 2002, AvaQuest Inc.

Recent Events 

February 2002
±

Meta Group posts report arguing for need to integrate business intelligence applications with knowledge management portals. SAS, leading provider of business intelligence software solutions, partners with Inxight to introduce true text mining product. 

March 2002
±

© 2002, AvaQuest Inc.