Text Mining & Visualization: Impressions of Emerging Capabilities

Text Mining & Visualization
Impressions of emerging capabilities
Cynthia Barcelon-Yang (speaker)

Yun Yun Yang (speaker)
Lucy Akers
Bristol-Myers Squibb
2007 PIUG Northeast Conference
New Brunswick, New Jersey
Introduction – Text Mining &Visualization
Overview of Text Mining Tools
■ Capabilities
■ Data Sources
■ Results
■ Strengths
Summary
Why do we need a tool to do text mining?
Welcome to the age of too much information...

Typical questions asked of IP Operations
Often, the IP Operations group within an organization provides centralized support
to a wide range of business units, and is responsible for answering the following:
How many patents do we have concerning technology ‘x’?

How does our portfolio compare with company ‘ABC’ ?
Who is citing our portfolio?
Which patents do business unit ‘xyz’ own?
Which patents should we divest as a result of selling division
XYZ?
How do our invention disclosures compare with current granted
patents?
How do we improve our patent operations?
What is text mining?
(according to Marti Hearst of UC Berkeley School of Information)
■ The discovery of new, previously unknown information, by
automatically extracting information from different written
resources.
■ A variation on a field called data mining, that tries to find
interesting patterns from large databases.
■ Many researchers think it will require a full simulation of how the
mind works before we can write programs that read the way
people do.
■ computational linguistics (also known as natural language
processing)
■ Hearst distinguishes between "real" text mining, that discovers
new pieces of knowledge, and approaches that find overall
trends in textual data.
Text Mining Process
Courtesy of: Invention Machine Corp.

Common Tasks
List generation (can be displayed as histograms)
List cleanup and grouping of concepts
Co-occurrence matrices and other graphing
Clustering, categorization, grouping and
extraction of text
Mapping document clusters or concepts
Adding temporal components to maps
Citation analysis
Subject/Action/Object (SAO) functions (a.k.a.
NLP)
Federated searching e.g. on Internet or Intranets
Project Planning
■ Phase I
►Literature searches, key references, brainstorming of
text/data mining & visualization
►Identify potential tools to evaluate
►Vendor onsite demonstrations
► Summary of initial tool evaluations
■ Phase II
►Pilot selected tools
►Identify potential clients groups and interview
representative clients
Investigation & Process Approach
■ Scout the literature/internet sources & brainstorm

■ Benchmark
■ “Patinformatics – Tools and Tasks” by Tony Trippe,
World Patent Information 25 (2003) 211–221
■ “Data Visualization Tools - A Perspective from
the Pharmaceutical Industry” by Jeannette Eldridge,
World Patent Information 28 (2006) 43–49
■ Vendor demos
Tools Initially Identified
AnaVist Matheo Patent

Anacubis OmniViz
Aureka PatAnalyst
Bioalma Quosa
BizInt Technology Watch
ClearForest Temis
Delphion VantagePoint
Entrieva (Semio) Vivisimo
GoldFire Wisdomain
Inxight Wistract
M-CAM
Vendor Tool Demonstrations
1.Quosa
2.Inxight
3.PatAnalyst
4.OmniViz
5.Temis
6.Aureka
7.Wisdomain
8.GoldFire
9.VantagePoint
10.ClearForest
11.m-CAM
12.RefViz
* Overview of Vendor Tools
Type of Tool
Capabilities
Data Sources
Results
Strengths
Summary
* Text mining tool slides are provided courtesy of the vendors.
Text Mining Capabilities
Keyword Analysis
■ Extracting nouns or noun phrases in text without
understanding their meaning or relationships or
counting the number of times the nouns appear
Statistical Analysis
■ Frequency-based analysis – counting the number
of times a word appears in the text
Linguistic Analysis
■ Natural language processing (NLP) – “Trained Agent”
■ Semantic analysis
Text Mining Data Sources
■ Unstructured text
►full text document, emails
■ Structured text
►database records, such as records from STN,
pubmed
■ Hybrid content
►Patents, front page is structured, text is not
Data Sources
I. General Data Sources (Unstructured):
ClearForest
GoldFire Innovator
Inxight
OmniViz
Temis
II. Bibliographic Data Sources (Structured):

Quosa
RefViz
VantagePoint
III. Patent-Focused (Hybrid):

Aureka
M-CAM
PatAnalyst
Wisdomain
Evaluation Template
Type of Tool
■ Text mining software tool
■ Database content provider
■ Both
Capabilities
■ Keyword analysis
■ Statistical analysis
■ Linguistic analysis
Data Sources
■ Structured bibliographic data sources
■ Unstructured sources – full-text web, email, corporate repositories, etc.
■ Hybrid sources – patents, combination of structured/unstructured
Results
■ Lists of documents
■ Tables
■ Charts/Graphs
■ Maps
Strengths – Disclaimer: Our Impressions only!
Summary
GoldFire Innovator
Type of tool – text mining tool

GoldFire Innovator
Technology – Semantic Analysis
GoldFire Innovator
GoldFire Innovator
Data Sources
■ Unstructured information from personal data,
corporate data, deep web, content, patents,
internet
►15 MM worldwide patents
►Database of over 8000 scientific effects
►3000 cross-disciplinary scientific deep web websites
Results
■ Static categorization of key concepts
■ Accurate answers to questions
■ Dynamic document summarization
GoldFire Innovator - Strengths
Precision retrieval of targeted R&D content
►Retrieves information from context – semantic
indexing
►automated summaries and categorization
►Relevant filtering and ranking
Using natural language query to search
►Ask the right questions - How to dry paper? How to
balance diets?
Innovation Trend Analysis
► Competitive analysis
► Technology analysis
► Patent relationship analysis – citation analysis
Inxight
Type of Tool
■ Text mining software tool.
Capability
■ Natural Language Processing
■ Contextual extractions (leaning towards semantic analysis)
Data Source
■ Unstructured text from websites, internal repositories, full-
text documents
■ Documents have to be pre-processed to extract meta-data
and identify entity types
Results
■ Hierarchical categorization
Inxight - Strengths
Federated Search capability
Claim to have more accuracy than a
human reader
Software can work in 32 languages
and can understand 27 entity types
Can process 1.2Gigabytes per hour
Claim to have the most powerful
linguistic algorithms in the field
Temis
Type of tool
■ Text Mining Solutions - software
Capability
■ Natural Language processing
►Insight DiscovererTM Extractor – info extraction sever powered
by Xe-LDA and used with specialized Skill Cartridges
►Insight DiscovererTM Categorizer – doc categorization sever
►Insight DiscovererTM Clusterer – automated classification sever
►XeLDA - Multilingual linguistic engine – natural language processing
►Skill Cartridge – A set of customizable knowledge components
that define the information to be extracted. The two major knowledge
components are multi-lingual dictionaries and multi-lingual
extraction rules (establish relationships between defined concepts
Skill Cartridge Overview
Open architecture
■ Plug & Play annotation components
■ Each defines areas of interests & extraction rules
■ Extraction rules describe the sentence structure that characterizes a concept
Meaning = Acquisition
Merger &
Acquisition • Target & buyer
Insight • Amount & date
…
Plug & Play
Discoverer™ ...
Skill Cartridges™
Extractor
Positive & Meaning = Satisfaction
Negative
Sentiment • People,
Analysis companies,
Words
(any concept) Products
• Satisfaction
XeLDA™ • Support
...
Text
(any kind, any format)
Temis
Data Sources
■ Any kind, any format, Internal & external data,
documents, literature, patents, clinical trials,
chemistry and biology, bioinformatics, internet,
email, etc
Results
■ Clusters, Rankings, Lists to discover information
trends and relationships
Temis - Strengths
Searching by concepts
► Selecting concepts from concept tree
Specialized Skill cartridges

► Life science Skill Cartridges
– Analytics
– Text Mining 360°
– Competitive Intelligence
– Human Resources Management
► General Skill Cartridges
– Biological Entity Relationships – best selling
– Medical Entity Relationships
– Chemical Entity Relationships
– Competitive Intelligence Life Sciences Edition
Temis - Strengths
Strong extraction, categorization, and

clustering capabilities
Robust XeLDA linguistic engine
Quick trend analysis
Chemical Document Browser – specialized
extraction module for chemical substance
nomenclature translation to chemical
structures.
OmniViz
Type of tool
■ visual based data/text mining software
Capability
■ algorithm based statistical analysis, not semantics
Data source/type
■ numeric, text, categorical, chem. structures, sequence,
structured/unstructured text
Results
■ interactive visualizations maps such as CoMet,
Correlation, Galaxy Proximity, etc.
OmniViz
OmniViz- Strengths
■ Interactive visualizations
■ Supports analysis of large amounts of data
(millions of documents) - numeric, categorical and
full-text analysis, including patents.
■ Broad applications including gene expression,
sequence & pathway analysis, chemical
structures, cheminformatics, clinical trial, patent
analysis, diagnosis and treatment, legal,
marketing data, regulatory compliance,
intelligence analysis, etc.
■ Flexible data import and merge capabilities
ClearForest
Type of Tool
■ Text mining tool (text analytics solution)
Capability
■ Semantic analysis/NLP
Data Sources
■ Unstructured text – websites
■ Patents
■ Internal documents
■ Meta-data
Results
■ Structured data entities
■ List of potential solutions for identified issues
■ Visualization tools – trend graphs, category maps
►Color and font are used to show intensity of relationships
ClearForest
Text Analytics: How it Works
Unified
Role-Based
Analysis
Interfaces
Part Problem Condition
<PartProblemCondition> Fuel Pump Fails corroded

<Part> Fuel Pump </Part>
Pump Relay Shorts Cold
Output <Problem> Fails </Problem>
DB weather
<Condition> Corroded </Condition> Headlight Fails Running hot
</PartProblemCondition>
Engine Stalls At low
XML Database speeds
Tagging Extraction
Platform Across Records
Including domain specific
entities & relationships
Unstructured Documents Database

Text, Word, Excel, DB Text Fields
Text
Email, WWW, PDF
Clear Forest
Packaged Extraction Modules
Inputs
Patents
MicroPatent U.S. Patent Text, Word, Database
Search Search Excel, etc Fields
Outputs Structured Data Entities Entities

Agent • Claim Element
Application Number • Claim Invention
Assignee • Extracted Terms
Assignee Address • Invention Terms
Examiner • Measurement Terms
Filing Date • Number of Claims
Inventor • Patent Section
Inventor Address • Problem Solved Terms
IPC • Problems Solved
Issue Date • Process Technology Terms
Number Of Claims • Technology Terms
Patent Citations
Patent Number
US Class
ClearForest - Strengths
Can be applied to a wide range of
applications as evidenced by wide variety of
available extraction modules
■ Security/intelligence gathering
■ Product/customer information
■ Corporate/People profiles
■ Patents
■ Biomedical entities
Analytics tool can discover unexpected
relationships between entities that would not
have been otherwise uncovered by standard,
manual methods.
VantagePoint
Type of the tool
■ Text mining software mainly used for technology
assessment and company profiling
Capability
■ Uses pattern matching, rule-based, and natural language
processing techniques
Data Sources
■ Works best with structured data - text data from
bibliographic databases
Results
■ summaries, charts, matrices, maps, and graphs
VantagePoint - Key Features
Rapid navigation in large abstract collections
Helps find relationships within your data
Visually displays relationships
Buckets documents to help in categorization
Utilities for cleaning data
User created thesauri for reducing data
Scripting capabilities to automate knowledge-
gathering
Easily exports output to other applications
Can be configured to text mine most forms of
structured bibliographic data
VantagePoint - Strengths
List Creation and Cleanup

■ patent assignee, author, inventor
■ pre-built IPC, User created thesauri
Analytical tool box
■ rapid navigation in large abstract collections to
answer who, where, what, when but not how and
why
■ visually displays relationships
Scripting capabilities to automate
knowledge-gathering
■ configure to extract from structured databases
RefViz
Type of tool
■ Text Analysis and Data Visualization software
Capability
■ Statistical and Linguistic analysis
►“mathematical signature” – relationship of words
►Uses a thesaurus tool
Data Sources
■ Only structured data from title, abstracts/notes
fields, or ISI Web of Science, PubMed, OCLC,
Output
Results
■ “Galaxy” & matrix visualization
RefViz - Strengths
■ Reference Retriever™ can search multiple online

sources simultaneously
■ can be used together with EndNote, ProCite, and
Reference Manager to provide an additional level
of analysis to existing reference collections
■ analyzes large numbers of references by thematic
content
■ interactive, visual landscape
Reveal trends and associations in references
The Galaxy view organizes references according to how they are related conceptually.
References on farming and herbs, either their Groups in the lower right focus on herbs in
cultivation or use as herbicides, are found in medicine.
the upper left region of the Galaxy.
The region in between farming and medicine contains a mix of
references about herbage diets in farm animals, herbal extracts
from plants, and research on health effects of herbicide exposure.
Quosa
Type of tool
■ Text mining tool based on concept extraction/clustering
Capability
■ Statistical analysis (term extraction, frequency ranking,
concept extraction using dynamic extraction algorithm from
MIT/Harvard)
Data sources
■ unstructured text - PubMed, Ovid, Google Scholar
■ Patents
■ Internal documents
Results
■ Highly organized collection of documents (folders on
shared server or local machine)
■ Team sharing and annotating
Quosa - Strengths
Full-text retrieval and management of

scientific documents
■ Get full-article from a journal or patent
gateway
► PubMed, Ovid, USPTO website
■ Document Summary from My Article
Organizer
■ Download to EndNote
M-CAM DoorsTM
Type of tool
■ Patent database provider, with text analysis and risk management
solution
Capability
■ Linguistic & semantic-based analysis, multi lingual
Data Sources
■ Patents from over 88 patenting authorities, 50 million patent doc.
■ journal articles (by the end of the summer 2006)
Results
■ “Compass” citation view
■ “Magellan” telescope & hourglass – patent life timeline
■ Patent uniqueness and enforceability analysis
■ Competitive intelligence analysis - financial risk analysis for
merger/acquisition and stock trading
M-CAM DoorsTM
Hourglass view – shows behavior and intent
Red bar – cited patents

Blue bar – citing patents
Green bar – concurrent art – share pendency
Purple bar – volume of uncited patents
Orange bar – volume of patents that did not cite subject patent
M-CAM DoorsTM - Strengths
Powerful visual interface for citation analysis
with related family & legal status views
Can rate each patent for its uniqueness,
reliance on related patents, and enforcement
potential – based on Hourglass view
Can rank patent clusters by relevance to
business objectives
Competitive Intelligence/Investment
Research
■ New Patent Thursday™ , Patent Portfolio
Confidence Rating™ , Custom PPCR™
PatAnalyst
Type of tool
■ Patent database provider – integrated source (UNIPAT) of patent
databases from US, PCT, EPO, PAJ, Germany, UK, France and
Switzerland
■ Patent search & examination service
Capability
■ No text mining algorithm
Data Sources
■ 51.5 MM patent documents – bibliographic data from 70 countries
from EPO
■ 15MM full-text documents – 8 countries/patenting authorities
Results
■ Viewer – analyze and orgnize the patent documents/families.
■ easy to use analytical colored text-highlighting of keywords
■ Organized folders of documents
PatAnalyst - Strengths
Powerful user-interface with enhanced

display features
■ Highlight keywords are in different colors
■ Side-by-side views of full-text and standard
bibliographic data
■ Integrated IPC category trees
■ “Live” legal status & patent family tree view from
EPO Viewer (EPOQUE)
■ Combined search of full-text & bibliographic data
Aureka
Type of tool
■ content and software tool specializing in visualization and
citation analysis
Capability
■ Keyword and Statistical Analysis
Data Sources
■ patent databases listed in MicroPatent’s FullText collection
Results
■ ThemeScape maps, hyperbolic citations trees, text clusters
Aureka Themescape Map of
A Themescape map of
Stem Cell Technology
a large set of
documents provides an
initial view of the
content. Additional
probing and analysis of
the map will help to
reveal more insight.
Citation Tree of Patent EP0778277
A cited patent provides insight into a corporation’s strategic intent with a patent;
build a picket fence, non-core patent, or lack of R&D interest.
Aureka – Strengths
Strong citation analysis tool

►Interactive citation tree – intelligence analysis
and strategic planning
Annotation capabilities
Strong visualization analysis
►Patent mapping with ThemeScape
►Clustering by Vivisimo
Wisdomain
Type of tool
■ Content and software tool. Web-based searching and
citation tool. Analysis module is local
Capability
■ Keyword analysis, citation map visualized
searching
Data Sources
■ Patents, specialized in US, EP, PCT, PAJ,
INPADOC legal and family status, China abs,
Korea abs
Results
■ Genealogy tree, Tables, charts
Wisdomain - Strengths
Strong citation analysis capability
►backward and forward citations, more than
one nesting
►collateral citation analysis
►citation alerts
Genealogy Tree
►good in competitive analysis and licensing
strategy planning
Graphic view of the search results

Collateral Citation
Identifying similar patents sharing the same pending period with the subject patent
APPLIED ISSUED
1990 1993
PENDING PERIOD
PATENT PATENT
PATENT PATENT
PATENT
PATENT PATENT
PATENT
SUBJECT PATENT
PATENT PATENT
PATENT
PATENT
PATENT
Key Collateral patent
7 collateral patents are identified based on indirect citation re

relations.
Vendor Name
Summary
Strength Potential User Groups
ClearForest Extraction modules Business Intelligence
GoldFire Sophisticated semantic analysis R&D scientists

tool
Inxight Extraction & Federated Search R&D Informatics
OmniViz Interactive visualization R&D scientists
Temis Extraction using Specialized Skill R&D scientists,

Cartridges Business Intelligence
Quosa Full-text retrieval & mgmt R&D scientists
RefViz Bibliographic data post- R&D scientists,

processing Information Professionals
VantagePoint Analytical tool box for technology Information Professionals,

or company assessment Business Intelligence
Aureka Patent mapping, clustering & Legal/Patent Dept., R&D scientists,

citation analysis Information Professionals,
Strategic Planning, Business
Intelligence
M-CAM Patent uniqueness & enforcement Business Intelligence, Legal/Patent

analysis Dept., Information Professionals
PatAnalyst Powerful full-text user interface Information Professionals,

with display features R&D scientists
Wisdomain Strong collateral citation analysis R&D scientists,

Information Professionals
Path Forward
■Phase II
►Pilot selected tools
►Identify potential clients groups and interview
representative clients
Closing Remarks
Acknowledgements
Peter Mattei Aureka
Thomas Klose ClearForest
Shelley Pavlek GoldFire/Invention Machine
Joanne Freeman Inxight
Marlene Khouri M-CAM
Heahyun Yoo OmniViz
Tony Medina PatAnalyst
Michael Rogers Quosa
Karen Stesis RefViz
Tisha Zawisky Temis
Lou Ann DiNallo VantagePoint
Mary Talmadge-Grebenar Wisdomain
Joseph Bezek
Claudia Powers
Ramesh Durvasula (Informatics)
Ronald Stoner (Mead Johnson)
Questions

Text Mining & Visualization: Impressions of Emerging Capabilities

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Mining & Visualization: Impressions of Emerging Capabilities

Uploaded by

Copyright:

Available Formats

Text Mining & Visualization

Impressions of emerging capabilities

Cynthia Barcelon-Yang (speaker)

Welcome to the age of too much information...

How many patents do we have concerning technology ‘x’?

Courtesy of: Invention Machine Corp.

■ Scout the literature/internet sources & brainstorm

AnaVist Matheo Patent

II. Bibliographic Data Sources (Structured):

III. Patent-Focused (Hybrid):

Type of tool – text mining tool

Specialized Skill cartridges

Strong extraction, categorization, and

Part Problem Condition

<PartProblemCondition> Fuel Pump Fails corroded

Unstructured Documents Database

Outputs Structured Data Entities Entities

List Creation and Cleanup

■ Reference Retriever™ can search multiple online

Full-text retrieval and management of

Red bar – cited patents

Powerful user-interface with enhanced

Strong citation analysis tool

Graphic view of the search results

Key Collateral patent

7 collateral patents are identified based on indirect citation re

GoldFire Sophisticated semantic analysis R&D scientists

Inxight Extraction & Federated Search R&D Informatics

OmniViz Interactive visualization R&D scientists

Temis Extraction using Specialized Skill R&D scientists,

Quosa Full-text retrieval & mgmt R&D scientists

RefViz Bibliographic data post- R&D scientists,

VantagePoint Analytical tool box for technology Information Professionals,

Aureka Patent mapping, clustering & Legal/Patent Dept., R&D scientists,

M-CAM Patent uniqueness & enforcement Business Intelligence, Legal/Patent

PatAnalyst Powerful full-text user interface Information Professionals,

Wisdomain Strong collateral citation analysis R&D scientists,

You might also like