You are on page 1of 62

Text Mining & Visualization

Impressions of emerging capabilities

Cynthia Barcelon-Yang (speaker)


Yun Yun Yang (speaker)
Lucy Akers

Bristol-Myers Squibb
2007 PIUG Northeast Conference
New Brunswick, New Jersey
Introduction – Text Mining &Visualization
Overview of Text Mining Tools
■ Capabilities
■ Data Sources
■ Results
■ Strengths
 Summary
Why do we need a tool to do text mining?

Welcome to the age of too much information...


Typical questions asked of IP Operations
Often, the IP Operations group within an organization provides centralized support
to a wide range of business units, and is responsible for answering the following:

How many patents do we have concerning technology ‘x’?


How does our portfolio compare with company ‘ABC’ ?
Who is citing our portfolio?
Which patents do business unit ‘xyz’ own?
Which patents should we divest as a result of selling division
XYZ?
How do our invention disclosures compare with current granted
patents?
How do we improve our patent operations?
What is text mining?
(according to Marti Hearst of UC Berkeley School of Information)
■ The discovery of new, previously unknown information, by
automatically extracting information from different written
resources.
■ A variation on a field called data mining, that tries to find
interesting patterns from large databases.
■ Many researchers think it will require a full simulation of how the
mind works before we can write programs that read the way
people do.
■ computational linguistics (also known as natural language
processing)
■ Hearst distinguishes between "real" text mining, that discovers
new pieces of knowledge, and approaches that find overall
trends in textual data.
Text Mining Process

Courtesy of: Invention Machine Corp.


Common Tasks
List generation (can be displayed as histograms)
List cleanup and grouping of concepts
Co-occurrence matrices and other graphing
Clustering, categorization, grouping and
extraction of text
Mapping document clusters or concepts
Adding temporal components to maps
Citation analysis
Subject/Action/Object (SAO) functions (a.k.a.
NLP)
Federated searching e.g. on Internet or Intranets
Project Planning
■ Phase I
►Literature searches, key references, brainstorming of
text/data mining & visualization
►Identify potential tools to evaluate
►Vendor onsite demonstrations
► Summary of initial tool evaluations
■ Phase II
►Pilot selected tools
►Identify potential clients groups and interview
representative clients
Investigation & Process Approach

■ Scout the literature/internet sources & brainstorm


■ Benchmark
■ “Patinformatics – Tools and Tasks” by Tony Trippe,
World Patent Information 25 (2003) 211–221
■ “Data Visualization Tools - A Perspective from
the Pharmaceutical Industry” by Jeannette Eldridge,
World Patent Information 28 (2006) 43–49
■ Vendor demos
Tools Initially Identified

AnaVist Matheo Patent


Anacubis OmniViz
Aureka PatAnalyst
Bioalma Quosa
BizInt Technology Watch
ClearForest Temis
Delphion VantagePoint
Entrieva (Semio) Vivisimo
GoldFire Wisdomain
Inxight Wistract
M-CAM
Vendor Tool Demonstrations
1.Quosa
2.Inxight
3.PatAnalyst
4.OmniViz
5.Temis
6.Aureka
7.Wisdomain
8.GoldFire
9.VantagePoint
10.ClearForest
11.m-CAM
12.RefViz
* Overview of Vendor Tools

Type of Tool
Capabilities
Data Sources
Results
Strengths
Summary
* Text mining tool slides are provided courtesy of the vendors.
Text Mining Capabilities
Keyword Analysis
■ Extracting nouns or noun phrases in text without
understanding their meaning or relationships or
counting the number of times the nouns appear
Statistical Analysis
■ Frequency-based analysis – counting the number
of times a word appears in the text
 Linguistic Analysis
■ Natural language processing (NLP) – “Trained Agent”
■ Semantic analysis
Text Mining Data Sources

■ Unstructured text
►full text document, emails
■ Structured text
►database records, such as records from STN,
pubmed
■ Hybrid content
►Patents, front page is structured, text is not
Data Sources
I. General Data Sources (Unstructured):
ClearForest
GoldFire Innovator
Inxight
OmniViz
Temis

II. Bibliographic Data Sources (Structured):


Quosa
RefViz
VantagePoint

III. Patent-Focused (Hybrid):


Aureka
M-CAM
PatAnalyst
Wisdomain
Evaluation Template
 Type of Tool
■ Text mining software tool
■ Database content provider
■ Both
 Capabilities
■ Keyword analysis
■ Statistical analysis
■ Linguistic analysis
 Data Sources
■ Structured bibliographic data sources
■ Unstructured sources – full-text web, email, corporate repositories, etc.
■ Hybrid sources – patents, combination of structured/unstructured
 Results
■ Lists of documents
■ Tables
■ Charts/Graphs
■ Maps
 Strengths – Disclaimer: Our Impressions only!
 Summary
GoldFire Innovator

 Type of tool – text mining tool


GoldFire Innovator
 Technology – Semantic Analysis
GoldFire Innovator
GoldFire Innovator
 Data Sources
■ Unstructured information from personal data,
corporate data, deep web, content, patents,
internet
►15 MM worldwide patents
►Database of over 8000 scientific effects
►3000 cross-disciplinary scientific deep web websites

 Results
■ Static categorization of key concepts
■ Accurate answers to questions
■ Dynamic document summarization
GoldFire Innovator - Strengths
Precision retrieval of targeted R&D content
►Retrieves information from context – semantic
indexing
►automated summaries and categorization
►Relevant filtering and ranking
Using natural language query to search
►Ask the right questions - How to dry paper? How to
balance diets?
Innovation Trend Analysis
► Competitive analysis
► Technology analysis
► Patent relationship analysis – citation analysis
Inxight
 Type of Tool
■ Text mining software tool.
 Capability
■ Natural Language Processing
■ Contextual extractions (leaning towards semantic analysis)
 Data Source
■ Unstructured text from websites, internal repositories, full-
text documents
■ Documents have to be pre-processed to extract meta-data
and identify entity types
 Results
■ Hierarchical categorization
Inxight - Strengths
 Federated Search capability
 Claim to have more accuracy than a
human reader
 Software can work in 32 languages
and can understand 27 entity types
 Can process 1.2Gigabytes per hour
 Claim to have the most powerful
linguistic algorithms in the field
Temis
Type of tool
■ Text Mining Solutions - software
Capability
■ Natural Language processing
►Insight DiscovererTM Extractor – info extraction sever powered
by Xe-LDA and used with specialized Skill Cartridges
►Insight DiscovererTM Categorizer – doc categorization sever
►Insight DiscovererTM Clusterer – automated classification sever
►XeLDA - Multilingual linguistic engine – natural language processing
►Skill Cartridge – A set of customizable knowledge components
that define the information to be extracted. The two major knowledge
components are multi-lingual dictionaries and multi-lingual
extraction rules (establish relationships between defined concepts
Skill Cartridge Overview
 Open architecture
■ Plug & Play annotation components
■ Each defines areas of interests & extraction rules
■ Extraction rules describe the sentence structure that characterizes a concept

Meaning = Acquisition
Merger &
Acquisition • Target & buyer
Insight • Amount & date

Plug & Play
Discoverer™ ...
Skill Cartridges™
Extractor
Positive & Meaning = Satisfaction
Negative
Sentiment • People,
Analysis companies,
Words
(any concept) Products
• Satisfaction
XeLDA™ • Support
...
Text
(any kind, any format)
Temis

Data Sources
■ Any kind, any format, Internal & external data,
documents, literature, patents, clinical trials,
chemistry and biology, bioinformatics, internet,
email, etc
Results
■ Clusters, Rankings, Lists to discover information
trends and relationships
Temis - Strengths

Searching by concepts
► Selecting concepts from concept tree

Specialized Skill cartridges


► Life science Skill Cartridges
– Analytics
– Text Mining 360°
– Competitive Intelligence
– Human Resources Management
► General Skill Cartridges
– Biological Entity Relationships – best selling
– Medical Entity Relationships
– Chemical Entity Relationships
– Competitive Intelligence Life Sciences Edition
Temis - Strengths

Strong extraction, categorization, and


clustering capabilities
Robust XeLDA linguistic engine
Quick trend analysis
Chemical Document Browser – specialized
extraction module for chemical substance
nomenclature translation to chemical
structures.
OmniViz

 Type of tool
■ visual based data/text mining software

 Capability
■ algorithm based statistical analysis, not semantics

 Data source/type
■ numeric, text, categorical, chem. structures, sequence,
structured/unstructured text
 Results
■ interactive visualizations maps such as CoMet,
Correlation, Galaxy Proximity, etc.
OmniViz
OmniViz- Strengths

■ Interactive visualizations
■ Supports analysis of large amounts of data
(millions of documents) - numeric, categorical and
full-text analysis, including patents.
■ Broad applications including gene expression,
sequence & pathway analysis, chemical
structures, cheminformatics, clinical trial, patent
analysis, diagnosis and treatment, legal,
marketing data, regulatory compliance,
intelligence analysis, etc.
■ Flexible data import and merge capabilities
ClearForest
 Type of Tool
■ Text mining tool (text analytics solution)
 Capability
■ Semantic analysis/NLP
 Data Sources
■ Unstructured text – websites
■ Patents
■ Internal documents
■ Meta-data
 Results
■ Structured data entities
■ List of potential solutions for identified issues
■ Visualization tools – trend graphs, category maps
►Color and font are used to show intensity of relationships
ClearForest
Text Analytics: How it Works
Unified
Role-Based
Analysis
Interfaces

Part Problem Condition

<PartProblemCondition> Fuel Pump Fails corroded


<Part> Fuel Pump </Part>
Pump Relay Shorts Cold
Output <Problem> Fails </Problem>
DB weather
<Condition> Corroded </Condition> Headlight Fails Running hot
</PartProblemCondition>
Engine Stalls At low
XML Database speeds

Tagging Extraction
Platform Across Records
Including domain specific
entities & relationships

Unstructured Documents Database


Text, Word, Excel, DB Text Fields
Text
Email, WWW, PDF
Clear Forest
Packaged Extraction Modules
Inputs

Patents
MicroPatent U.S. Patent Text, Word, Database
Search Search Excel, etc Fields

Outputs Structured Data Entities Entities


 Agent • Claim Element
 Application Number • Claim Invention
 Assignee • Extracted Terms
 Assignee Address • Invention Terms
 Examiner • Measurement Terms
 Filing Date • Number of Claims
 Inventor • Patent Section
 Inventor Address • Problem Solved Terms
 IPC • Problems Solved
 Issue Date • Process Technology Terms
 Number Of Claims • Technology Terms
 Patent Citations
 Patent Number
 US Class
ClearForest - Strengths
Can be applied to a wide range of
applications as evidenced by wide variety of
available extraction modules
■ Security/intelligence gathering
■ Product/customer information
■ Corporate/People profiles
■ Patents
■ Biomedical entities
Analytics tool can discover unexpected
relationships between entities that would not
have been otherwise uncovered by standard,
manual methods.
VantagePoint
 Type of the tool
■ Text mining software mainly used for technology
assessment and company profiling
 Capability
■ Uses pattern matching, rule-based, and natural language
processing techniques
 Data Sources
■ Works best with structured data - text data from
bibliographic databases
 Results
■ summaries, charts, matrices, maps, and graphs
VantagePoint - Key Features
 Rapid navigation in large abstract collections
 Helps find relationships within your data
 Visually displays relationships
 Buckets documents to help in categorization
 Utilities for cleaning data
 User created thesauri for reducing data
 Scripting capabilities to automate knowledge-
gathering
 Easily exports output to other applications
 Can be configured to text mine most forms of
structured bibliographic data
VantagePoint - Strengths

 List Creation and Cleanup


■ patent assignee, author, inventor
■ pre-built IPC, User created thesauri
 Analytical tool box
■ rapid navigation in large abstract collections to
answer who, where, what, when but not how and
why
■ visually displays relationships
 Scripting capabilities to automate
knowledge-gathering
■ configure to extract from structured databases
RefViz
 Type of tool
■ Text Analysis and Data Visualization software
 Capability
■ Statistical and Linguistic analysis
►“mathematical signature” – relationship of words
►Uses a thesaurus tool
 Data Sources
■ Only structured data from title, abstracts/notes
fields, or ISI Web of Science, PubMed, OCLC,
Output
 Results
■ “Galaxy” & matrix visualization
RefViz - Strengths

■ Reference Retriever™ can search multiple online


sources simultaneously
■ can be used together with EndNote, ProCite, and
Reference Manager to provide an additional level
of analysis to existing reference collections
■ analyzes large numbers of references by thematic
content
■ interactive, visual landscape
Reveal trends and associations in references

The Galaxy view organizes references according to how they are related conceptually.

References on farming and herbs, either their Groups in the lower right focus on herbs in
cultivation or use as herbicides, are found in medicine.
the upper left region of the Galaxy.
The region in between farming and medicine contains a mix of
references about herbage diets in farm animals, herbal extracts
from plants, and research on health effects of herbicide exposure.
Quosa
 Type of tool
■ Text mining tool based on concept extraction/clustering
 Capability
■ Statistical analysis (term extraction, frequency ranking,
concept extraction using dynamic extraction algorithm from
MIT/Harvard)
 Data sources
■ unstructured text - PubMed, Ovid, Google Scholar
■ Patents
■ Internal documents
 Results
■ Highly organized collection of documents (folders on
shared server or local machine)
■ Team sharing and annotating
Quosa - Strengths

Full-text retrieval and management of


scientific documents
■ Get full-article from a journal or patent
gateway
► PubMed, Ovid, USPTO website
■ Document Summary from My Article
Organizer
■ Download to EndNote
M-CAM DoorsTM
 Type of tool
■ Patent database provider, with text analysis and risk management
solution
 Capability
■ Linguistic & semantic-based analysis, multi lingual
 Data Sources
■ Patents from over 88 patenting authorities, 50 million patent doc.
■ journal articles (by the end of the summer 2006)
 Results
■ “Compass” citation view
■ “Magellan” telescope & hourglass – patent life timeline
■ Patent uniqueness and enforceability analysis
■ Competitive intelligence analysis - financial risk analysis for
merger/acquisition and stock trading
M-CAM DoorsTM
Hourglass view – shows behavior and intent

Red bar – cited patents


Blue bar – citing patents
Green bar – concurrent art – share pendency
Purple bar – volume of uncited patents
Orange bar – volume of patents that did not cite subject patent
M-CAM DoorsTM - Strengths
Powerful visual interface for citation analysis
with related family & legal status views
Can rate each patent for its uniqueness,
reliance on related patents, and enforcement
potential – based on Hourglass view
Can rank patent clusters by relevance to
business objectives
Competitive Intelligence/Investment
Research
■ New Patent Thursday™ , Patent Portfolio
Confidence Rating™ , Custom PPCR™
PatAnalyst
 Type of tool
■ Patent database provider – integrated source (UNIPAT) of patent
databases from US, PCT, EPO, PAJ, Germany, UK, France and
Switzerland
■ Patent search & examination service
 Capability
■ No text mining algorithm
 Data Sources
■ 51.5 MM patent documents – bibliographic data from 70 countries
from EPO
■ 15MM full-text documents – 8 countries/patenting authorities
 Results
■ Viewer – analyze and orgnize the patent documents/families.
■ easy to use analytical colored text-highlighting of keywords
■ Organized folders of documents
PatAnalyst - Strengths

Powerful user-interface with enhanced


display features
■ Highlight keywords are in different colors
■ Side-by-side views of full-text and standard
bibliographic data
■ Integrated IPC category trees
■ “Live” legal status & patent family tree view from
EPO Viewer (EPOQUE)
■ Combined search of full-text & bibliographic data
Aureka

 Type of tool
■ content and software tool specializing in visualization and
citation analysis
 Capability
■ Keyword and Statistical Analysis
 Data Sources
■ patent databases listed in MicroPatent’s FullText collection
 Results
■ ThemeScape maps, hyperbolic citations trees, text clusters
Aureka Themescape Map of

A Themescape map of
Stem Cell Technology
a large set of
documents provides an
initial view of the
content. Additional
probing and analysis of
the map will help to
reveal more insight.
Citation Tree of Patent EP0778277
A cited patent provides insight into a corporation’s strategic intent with a patent;
build a picket fence, non-core patent, or lack of R&D interest.
Aureka – Strengths

 Strong citation analysis tool


►Interactive citation tree – intelligence analysis
and strategic planning

 Annotation capabilities
 Strong visualization analysis
►Patent mapping with ThemeScape
►Clustering by Vivisimo
Wisdomain
 Type of tool
■ Content and software tool. Web-based searching and
citation tool. Analysis module is local
 Capability
■ Keyword analysis, citation map visualized
searching
 Data Sources
■ Patents, specialized in US, EP, PCT, PAJ,
INPADOC legal and family status, China abs,
Korea abs
 Results
■ Genealogy tree, Tables, charts
Wisdomain - Strengths
Strong citation analysis capability
►backward and forward citations, more than
one nesting
►collateral citation analysis
►citation alerts
Genealogy Tree
►good in competitive analysis and licensing
strategy planning

 Graphic view of the search results


Collateral Citation
Identifying similar patents sharing the same pending period with the subject patent

APPLIED ISSUED
1990 1993
PENDING PERIOD

PATENT PATENT
PATENT PATENT
PATENT
PATENT PATENT
PATENT

SUBJECT PATENT

PATENT PATENT
PATENT

PATENT
PATENT

Key Collateral patent

7 collateral patents are identified based on indirect citation re


relations.
Vendor Name
Summary
Strength Potential User Groups
ClearForest Extraction modules Business Intelligence

GoldFire Sophisticated semantic analysis R&D scientists


tool

Inxight Extraction & Federated Search R&D Informatics

OmniViz Interactive visualization R&D scientists

Temis Extraction using Specialized Skill R&D scientists,


Cartridges Business Intelligence

Quosa Full-text retrieval & mgmt R&D scientists

RefViz Bibliographic data post- R&D scientists,


processing Information Professionals

VantagePoint Analytical tool box for technology Information Professionals,


or company assessment Business Intelligence

Aureka Patent mapping, clustering & Legal/Patent Dept., R&D scientists,


citation analysis Information Professionals,
Strategic Planning, Business
Intelligence

M-CAM Patent uniqueness & enforcement Business Intelligence, Legal/Patent


analysis Dept., Information Professionals

PatAnalyst Powerful full-text user interface Information Professionals,


with display features R&D scientists

Wisdomain Strong collateral citation analysis R&D scientists,


Information Professionals
Path Forward

■Phase II
►Pilot selected tools
►Identify potential clients groups and interview
representative clients
Closing Remarks
Acknowledgements
Peter Mattei Aureka
Thomas Klose ClearForest
Shelley Pavlek GoldFire/Invention Machine
Joanne Freeman Inxight
Marlene Khouri M-CAM
Heahyun Yoo OmniViz
Tony Medina PatAnalyst
Michael Rogers Quosa
Karen Stesis RefViz
Tisha Zawisky Temis
Lou Ann DiNallo VantagePoint
Mary Talmadge-Grebenar Wisdomain
Joseph Bezek
Claudia Powers
Ramesh Durvasula (Informatics)
Ronald Stoner (Mead Johnson)
Questions

You might also like