Data Mining To Knowledge Discovery

From Data Mining
to
Knowledge
Discovery:
An Introduction
Gregory Piatetsky-Shapiro
KDnuggets
Outline
Introduction
Data Mining Tasks
Application Examples
Trends leading to Data Flood

More data is generated:
Bank, telecom, other
business transactions ...
Scientific Data:
astronomy, biology, etc
Web, text, and ecommerce
More data is captured:

Storage technology faster
and cheaper
DBMS capable of handling
bigger DB
Examples
Europe's Very Long Baseline Interferometry
(VLBI) has 16 telescopes, each of which
produces 1 Gigabit/second of astronomical
data over a 25-day observation session
storage and analysis a big problem
Walmart reported to have 24 Tera-byte DB

AT&T handles billions of calls per day
data cannot be stored -- analysis is done on the fly
Growth Trends
Moores law
Computer Speed doubles every
18 months
Storage law
total storage doubles every 9
months
Consequence
very little data will ever be looked
at by a human
Knowledge Discovery is
NEEDED to make sense and
use of data.
Knowledge Discovery Definition

Knowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
Related Fields
Machine
Learning
Visualization
Data Mining and

Knowledge Discovery
Statistics
Databases
Knowledge Discovery Process

Integration
Da
ta
Se
& lect
Cl io
ea n
nin
g
DATA
Ware
house
ma
tio
Mi
nin
Knowledge
__ __ __
__ __ __
__ __ __
Transformed
Data
Target
Data
Knowledge
Patterns
and
Rules
Understanding
Raw
Data
Tra
ns
f or
Interpretation
& Evaluation
Outline
Introduction
Data Mining Tasks
Data Mining Tasks:

Classification
Learn a method for predicting the instance class

from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
Classification: Linear
Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression
computes wi from
data to minimize
squared error to fit
the data
Not flexible enough
Classification: Decision Trees

if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
Classification: Neural Nets

Can select more
complex regions
Can be more
accurate
Also can overfit the
data find patterns
in random noise
Data Mining Central Quest
Find true patterns

and avoid overfitting
(false patterns due
to randomness)
Data Mining Tasks: Clustering

Find natural grouping of
instances given un-labeled data
Major Data Mining Tasks

Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Estimation: predicting a continuous value
Deviation Detection: finding changes
Link Analysis: finding relationships

www.KDnuggets.com
Data Mining Software Guide
Outline
Introduction
Data Mining Tasks
Major Application Areas for

Data Mining Solutions
Advertising
Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection
eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web
Case Study: Search Engines

Early search engines used mainly keywords
on a page were subject to manipulation
Google success is due to its algorithm
which uses mainly links to the page
Google founders Sergey Brin and Larry Page
were students in Stanford doing research in
databases and data mining in 1998 which
led to Google
Case Study:
Direct Marketing and CRM
Most major direct marketing companies are

using modeling and data mining
Most financial companies are using
customer modeling
Modeling is easier than changing customer
behaviour
Some successes
Verizon Wireless reduced churn rate from 2% to
1.5%
Biology: Molecular Diagnostics

Leukemia: Acute Lymphoblastic (ALL) vs
Acute Myeloid (AML)
72 samples, about 7,000 genes
ALL
AML
Results: 33 correct (97% accuracy),

1 error (sample suspected
mislabelled)
AF1q: New Marker for

Medulloblastoma?
AF1Q ALL1-fused gene from chromosome 1q
transmembrane protein
Related to leukemia (3 PUBMED entries) but not to Medulloblastoma
Case Study:
Security and Fraud Detection
Credit Card Fraud Detection
Money laundering
FAIS (US Treasury)
Securities Fraud
NASDAQ Sonar system
Phone fraud
AT&T, Bell Atlantic, British Telecom/MCI
Bio-terrorism detection at Salt Lake

Olympics 2002
Data Mining and Terrorism:

Controversy in the News
TIA: Terrorism (formerly Total) Information
Awareness Program
DARPA program closed by Congress
some functions transferred to intelligence agencies
CAPPS II screen all airline passengers

controversial
Invasion of Privacy or Defensive Shield?
Criticism of analytic approach

to Threat Detection:
Data Mining will
invade privacy
generate millions of false positives
But can it be effective?
Can Data Mining and Statistics

be Effective for Threat
Detection?
Criticism: Databases have 5% errors, so
analyzing 100 million suspects will generate 5
million false positives
Reality: Analytical models correlate many

items of information to reduce false positives.
Example: Identify one biased coin from 1,000.
After one throw of each coin, we cannot
After 30 throws, one biased coin will stand out
with high probability.
Can identify 19 biased coins out of 100 million
with sufficient number of throws
Another Approach: Link Analysis
Can Find Unusual Patterns in the Network Structure
Analytic technology can be

effective
Combining multiple models and link
analysis can reduce false positives
Today there are millions of false positives
with manual analysis
Data Mining is just one additional tool to
help analysts
Analytic Technology has the potential to
reduce the current high rate of false
positives
Data Mining with Privacy

Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
Multi-party computation distributed data
Bayardo & Srikant, Technological Solutions for

Protecting Privacy, IEEE Computer, Sep 2003
The Hype Curve for

Data Mining and Knowledge
Discovery
Over-inflated
expectations
Growing acceptance
and mainstreaming
rising
expectations
Disappointment
Summary
www.KDnuggets.com the website
for Data Mining and Knowledge Discovery
Contact: Gregory Piatetsky-Shapiro
gregory@kdnuggets.com

Data Mining To Knowledge Discovery

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining To Knowledge Discovery

Uploaded by

Copyright:

Available Formats

From Data Mining

Trends leading to Data Flood

More data is captured:

Walmart reported to have 24 Tera-byte DB

Knowledge Discovery Definition

Data Mining and

Knowledge Discovery Process

Data Mining Tasks:

Learn a method for predicting the instance class

Classification: Decision Trees

Classification: Neural Nets

Data Mining Central Quest

Find true patterns

Data Mining Tasks: Clustering

Major Data Mining Tasks

Major Application Areas for

Case Study: Search Engines

Most major direct marketing companies are

Biology: Molecular Diagnostics

Results: 33 correct (97% accuracy),

AF1q: New Marker for

Bio-terrorism detection at Salt Lake

Data Mining and Terrorism:

CAPPS II screen all airline passengers

Invasion of Privacy or Defensive Shield?

Criticism of analytic approach

But can it be effective?

Can Data Mining and Statistics

Reality: Analytical models correlate many

Another Approach: Link Analysis

Can Find Unusual Patterns in the Network Structure

Analytic technology can be

Data Mining with Privacy

Bayardo & Srikant, Technological Solutions for

The Hype Curve for

You might also like