Professional Documents
Culture Documents
KDD 99 Chasm
KDD 99 Chasm
Rakesh Agrawal
IBM Almaden Research Center
Thesis
• The greatest challenge facing data mining is
to make the transition from being an early
market technology to mainstream
technology
• We have the opportunity to make this
transition successful
Outline
• Chasm in the technology adoption life cycle,
à la Geoffrey Moore†
• Experience with Quest/Intelligent Miner
• Ideas for successful chasm crossing
Chasm
• Yes!!!
• My personal viewpoint based on
Quest/Intelligent Miner experience
Quest
• Started as skunk work in early nineties
• Inspired by needs articulated by industry
visionaries:
– Transaction data collected over a long period
– Current tools/SQL don’t cut it
– About ready to throw data
Approach
• Examine “real” applications
• Identify operations that cut across
applications
• Design fast, scalable algorithms for each
operation
• Develop applications by composing
operations
Operations
• Associations • Classification
• Sequential Patterns • Clustering
• Similar time series • Deviations
http://www.almaden.ibm.com/cs/quest
Bringing Quest to market
• Visionaries who inspired Quest did not
become first customers:
– Wanted evidence that the technology “worked”
• Frustrating attempts to interest major IBM
customers:
– Integration with existing applications
– Too-far-out technology
– Resistance from in-house analytic groups
First hits
• Small information-based companies who
provided data in exchange for free results
• CIO who wanted to be seen as the
technology pioneer in his industry
• CIO who wanted the success story to
feature in the company’s annual report
Led to the formation of a group offering services using Quest
Characteristics of engagements
• Mostly associations and sequential patterns
• Completeness a big plus
• Unanticipated uses
• Feedback for further development
Into the product land
• Formation of a small “out-of-plan” product
group to productize Quest
• Facilitated by a closet mathematician
• Successes of the services group used for
market validation
• Continued development and infusion of
technology
Intelligent Miner
• Serious product
• Integrates technologies from various groups
• Fast, scalable, runs on multiple platforms
• Several “early market” success
stories
http://www.software.ibm.com/data/iminer/
Are we in the chasm?
• Perceived to be sophisticated technology,
usable only by specialists
• Long, expensive projects
• Stand-alone, loosely-coupled with data
infrastructures
• Difficult to infuse into existing mission-
critical applications
Chasm Crossing
• Personal speculations on some technical
challenges
• Do not imply IBM research/product
directions
XML-based Data Mining Standard (1)
• Model Building:
Data Specs Standard
DTD – A pair of standard
Parameters
DTDs for each
operation
Operator Library – Interchangeable
library of operator
implementations
Standard
Model
DTD
Ack: Mattos, Pirahesh, Schwenkries
XML-based Data Mining Standard (2)
Standard DTDs • Model Deployment:
Model Mapping Data – Mapping XML object
Record provides mapping
between names and
format in the model
Application Library object and the data record
– Model could have been
developed on a different
Standard system
Result
DTD
Implications
• Standard interfaces for application
developers to incorporate data mining
• Coupling with relational databases
– mappings from DTDs to relational schemas
– implementation using existing infrastructure
Data Mining Benchmarks
• UC Irvine repository
• Generating synthetic benchmarks modeled
after real data sets is a hard problem
– How to map names into meaningful literals
– How to preserve empirical distributions
Ack: R. Srikant
Personalization
• Internet might provide for the first time tools
necessary for users to capture information
about themselves and to selectively release
this information†
• Will we be providing these tools?
† John Hagel, Marc Singer. Net Worth. Harvard Business School Press .
What about Association Rules?
• Very long patterns
• Separating wheat from chaff
• Principled introduction of domain
knowledge
What else?
• Formal foundations of data mining
Summary
• Closely couple data • Standard interfaces
mining with database • Benchmarks
systems • Auto focussing
• Embed data mining
into applications
• Personalization
• Focus on web
• Privacy
Concluding remarks
• Data mining, a great technology
– Combination of intriguing theoretical questions
with large commercial interest in the
technology
• Poised for transitioning into mainstream
technology
• Will we rise to the challenge as a
community?
Acknowledgments
Arning Arnold Bayardo Baur Bollinger Brodbeck
Baune Carey Chandra Cody Faloutsos Gardner
Gehrke Ghosh Greissl Gruhl Grove Gunopulos