Professional Documents
Culture Documents
Bristol-Myers Squibb
2007 PIUG Northeast Conference
New Brunswick, New Jersey
Introduction – Text Mining &Visualization
Overview of Text Mining Tools
■ Capabilities
■ Data Sources
■ Results
■ Strengths
Summary
Why do we need a tool to do text mining?
Type of Tool
Capabilities
Data Sources
Results
Strengths
Summary
* Text mining tool slides are provided courtesy of the vendors.
Text Mining Capabilities
Keyword Analysis
■ Extracting nouns or noun phrases in text without
understanding their meaning or relationships or
counting the number of times the nouns appear
Statistical Analysis
■ Frequency-based analysis – counting the number
of times a word appears in the text
Linguistic Analysis
■ Natural language processing (NLP) – “Trained Agent”
■ Semantic analysis
Text Mining Data Sources
■ Unstructured text
►full text document, emails
■ Structured text
►database records, such as records from STN,
pubmed
■ Hybrid content
►Patents, front page is structured, text is not
Data Sources
I. General Data Sources (Unstructured):
ClearForest
GoldFire Innovator
Inxight
OmniViz
Temis
Results
■ Static categorization of key concepts
■ Accurate answers to questions
■ Dynamic document summarization
GoldFire Innovator - Strengths
Precision retrieval of targeted R&D content
►Retrieves information from context – semantic
indexing
►automated summaries and categorization
►Relevant filtering and ranking
Using natural language query to search
►Ask the right questions - How to dry paper? How to
balance diets?
Innovation Trend Analysis
► Competitive analysis
► Technology analysis
► Patent relationship analysis – citation analysis
Inxight
Type of Tool
■ Text mining software tool.
Capability
■ Natural Language Processing
■ Contextual extractions (leaning towards semantic analysis)
Data Source
■ Unstructured text from websites, internal repositories, full-
text documents
■ Documents have to be pre-processed to extract meta-data
and identify entity types
Results
■ Hierarchical categorization
Inxight - Strengths
Federated Search capability
Claim to have more accuracy than a
human reader
Software can work in 32 languages
and can understand 27 entity types
Can process 1.2Gigabytes per hour
Claim to have the most powerful
linguistic algorithms in the field
Temis
Type of tool
■ Text Mining Solutions - software
Capability
■ Natural Language processing
►Insight DiscovererTM Extractor – info extraction sever powered
by Xe-LDA and used with specialized Skill Cartridges
►Insight DiscovererTM Categorizer – doc categorization sever
►Insight DiscovererTM Clusterer – automated classification sever
►XeLDA - Multilingual linguistic engine – natural language processing
►Skill Cartridge – A set of customizable knowledge components
that define the information to be extracted. The two major knowledge
components are multi-lingual dictionaries and multi-lingual
extraction rules (establish relationships between defined concepts
Skill Cartridge Overview
Open architecture
■ Plug & Play annotation components
■ Each defines areas of interests & extraction rules
■ Extraction rules describe the sentence structure that characterizes a concept
Meaning = Acquisition
Merger &
Acquisition • Target & buyer
Insight • Amount & date
…
Plug & Play
Discoverer™ ...
Skill Cartridges™
Extractor
Positive & Meaning = Satisfaction
Negative
Sentiment • People,
Analysis companies,
Words
(any concept) Products
• Satisfaction
XeLDA™ • Support
...
Text
(any kind, any format)
Temis
Data Sources
■ Any kind, any format, Internal & external data,
documents, literature, patents, clinical trials,
chemistry and biology, bioinformatics, internet,
email, etc
Results
■ Clusters, Rankings, Lists to discover information
trends and relationships
Temis - Strengths
Searching by concepts
► Selecting concepts from concept tree
Type of tool
■ visual based data/text mining software
Capability
■ algorithm based statistical analysis, not semantics
Data source/type
■ numeric, text, categorical, chem. structures, sequence,
structured/unstructured text
Results
■ interactive visualizations maps such as CoMet,
Correlation, Galaxy Proximity, etc.
OmniViz
OmniViz- Strengths
■ Interactive visualizations
■ Supports analysis of large amounts of data
(millions of documents) - numeric, categorical and
full-text analysis, including patents.
■ Broad applications including gene expression,
sequence & pathway analysis, chemical
structures, cheminformatics, clinical trial, patent
analysis, diagnosis and treatment, legal,
marketing data, regulatory compliance,
intelligence analysis, etc.
■ Flexible data import and merge capabilities
ClearForest
Type of Tool
■ Text mining tool (text analytics solution)
Capability
■ Semantic analysis/NLP
Data Sources
■ Unstructured text – websites
■ Patents
■ Internal documents
■ Meta-data
Results
■ Structured data entities
■ List of potential solutions for identified issues
■ Visualization tools – trend graphs, category maps
►Color and font are used to show intensity of relationships
ClearForest
Text Analytics: How it Works
Unified
Role-Based
Analysis
Interfaces
Tagging Extraction
Platform Across Records
Including domain specific
entities & relationships
Patents
MicroPatent U.S. Patent Text, Word, Database
Search Search Excel, etc Fields
The Galaxy view organizes references according to how they are related conceptually.
References on farming and herbs, either their Groups in the lower right focus on herbs in
cultivation or use as herbicides, are found in medicine.
the upper left region of the Galaxy.
The region in between farming and medicine contains a mix of
references about herbage diets in farm animals, herbal extracts
from plants, and research on health effects of herbicide exposure.
Quosa
Type of tool
■ Text mining tool based on concept extraction/clustering
Capability
■ Statistical analysis (term extraction, frequency ranking,
concept extraction using dynamic extraction algorithm from
MIT/Harvard)
Data sources
■ unstructured text - PubMed, Ovid, Google Scholar
■ Patents
■ Internal documents
Results
■ Highly organized collection of documents (folders on
shared server or local machine)
■ Team sharing and annotating
Quosa - Strengths
Type of tool
■ content and software tool specializing in visualization and
citation analysis
Capability
■ Keyword and Statistical Analysis
Data Sources
■ patent databases listed in MicroPatent’s FullText collection
Results
■ ThemeScape maps, hyperbolic citations trees, text clusters
Aureka Themescape Map of
A Themescape map of
Stem Cell Technology
a large set of
documents provides an
initial view of the
content. Additional
probing and analysis of
the map will help to
reveal more insight.
Citation Tree of Patent EP0778277
A cited patent provides insight into a corporation’s strategic intent with a patent;
build a picket fence, non-core patent, or lack of R&D interest.
Aureka – Strengths
Annotation capabilities
Strong visualization analysis
►Patent mapping with ThemeScape
►Clustering by Vivisimo
Wisdomain
Type of tool
■ Content and software tool. Web-based searching and
citation tool. Analysis module is local
Capability
■ Keyword analysis, citation map visualized
searching
Data Sources
■ Patents, specialized in US, EP, PCT, PAJ,
INPADOC legal and family status, China abs,
Korea abs
Results
■ Genealogy tree, Tables, charts
Wisdomain - Strengths
Strong citation analysis capability
►backward and forward citations, more than
one nesting
►collateral citation analysis
►citation alerts
Genealogy Tree
►good in competitive analysis and licensing
strategy planning
APPLIED ISSUED
1990 1993
PENDING PERIOD
PATENT PATENT
PATENT PATENT
PATENT
PATENT PATENT
PATENT
SUBJECT PATENT
PATENT PATENT
PATENT
PATENT
PATENT
■Phase II
►Pilot selected tools
►Identify potential clients groups and interview
representative clients
Closing Remarks
Acknowledgements
Peter Mattei Aureka
Thomas Klose ClearForest
Shelley Pavlek GoldFire/Invention Machine
Joanne Freeman Inxight
Marlene Khouri M-CAM
Heahyun Yoo OmniViz
Tony Medina PatAnalyst
Michael Rogers Quosa
Karen Stesis RefViz
Tisha Zawisky Temis
Lou Ann DiNallo VantagePoint
Mary Talmadge-Grebenar Wisdomain
Joseph Bezek
Claudia Powers
Ramesh Durvasula (Informatics)
Ronald Stoner (Mead Johnson)
Questions