Professional Documents
Culture Documents
Intelligence
DATA
MINING
“We’re drowning in information
but starving for knowledge.”
(John Naisbett)
Topics
tifi
c
Recognition
ial
s
tic
Int
is a blend of
tis
ellig
Sta
en
multiple
ce
DATA Machine
MINING Learning
Disciplines Mathematical
Modeling Databases
Statistics :
- Impose a model on the data that we feel will replicate the actual
patterns in the data
DM :
- Let the data tell us the story
- To make sense of what was previously unable to be seen
Statistical forecasting and Data Mining
Statistical Forecasting
- we seek verification of previously held hypothesis
- we know which patterns exist in the time series data we forecast
Data Mining
- seeks discovery of new knowledge from the data
- allows the data itself to reveal the patterns within, rather than imposing
the patterns on the data at the outset
DM terminology – PREDICTION
(prediction +forecasting)
Prediction Forecasting
• Estimating a future
• The act of “telling” value based on past
about the future data values
• Guessing
+experiences
• Data and model
based
+ opinions
+ other relevant information
Terminology in Data Mining
• QUERY • QUERY
Well defined Poorly defined
SQL Not precise query language
• DATA
• DATA Not operational data
Operational data
• OUTPUT
• OUTPUT Fuzzy
Precise Not a subset of a database
Subset of a database
Query Examples
• DATA MINING
Find all credit applicants who are poor credit risks - CLASSIFICATION
Identify customers with similar buying habits - CLUSTERING
Find all items which are frequently purchased with Y - ASSOCIATION
How Data Mining Works
• Types of patterns
– Association
– Prediction
– Cluster (segmentation)
– Sequential (or time series) relationships
Data Mining Applications
• Collect data
Data Consolidation • Select data
• Integrate data
Data
• Eliminate inconsistencies
Preparation –
• Normalize data
Data Transformation • Discretize/aggregate data
• Construct new attributes
Task
Data Reduction • Reduce number of cases
• Balance skewed data
Well-formed
Data
Data Mining Process: CRISP-DM
The Six-Step CRISP-DM
Data Mining Process
Source: KDnuggets.com.
DM tasks
• Prediction: the act of “telling” about the future
• Classification: analyzing the historical behavior
of groups of entities with similar characteristics,
to predict the future behavior of a new entity
from its similarity to those groups
• Clustering: finding groups of entities with similar
characteristics
• Association: establishing relationships among
items that occur together
• Sequence discovery: finding time-based
associations
DM tasks
• Visualization: presenting results obtained
through one or more of the other methods
• Regression: a statistical estimation technique
based on fitting a curve defined by a
mathematical equation of known type but
unknown parameters to existing data
• Forecasting: estimating a future data value
based on past data values.
A Taxonomy for Data Mining tasks, methods and
algorithms
PREDICTION
CLASSIFICATION REGRESSION
• For prediction that can be • A statistical estimation
used on historical data and technique – fitting a curve
relationships defined by a mathematical
• What is being predict is a equation of known type but
class label unknown parameters to
existing data
• Weather predictions: sunny, • What is being predict is a
rainy, cloudy…. numeric value
• Weather predictions: 25 0 C
Classification
Analyzing the historical behavior of groups of entities with similar
characteristics, to predict the future behavior of a new entity from its
similarity to those groups
1. What do you think about data mining and its implication for
privacy? What is the threshold between discovery of
knowledge and infringement of privacy?
Myth Reality
Data mining provides instant, crystal-ball-like Data mining is a multistep process that
predictions. requires deliberate, proactive design and
use.
Data mining is not yet viable for mainstream The current state of the art is ready for
business applications. almost any business type and/or size.
Data mining requires a separate, dedicated Because of the advances in database
database. technology, a dedicated database is not
required.
Only those with advanced degrees can do data Newer Web-based tools enable managers
mining. of all educational levels to do data mining.
Data mining is only for large firms that have lots If the data accurately reflect the business
of customer data. or its customers, any company can use
data mining.
Common Data Mining Blunders