You are on page 1of 12

Chapter 4 :

Data Mining
Introduction:-
Data Mining:-
is the art and science of discovering knowledge, insights, and patterns in
data.
 Patterns must be : valid – novel – potentially useful – understandable.
 Data mining the knowledge of data quality – data organizing from Database area.
 Data mining draws modeling – analytical techniques from Statistics and Computer Science
(Artificial Intelligence).
 Data mining draws the knowledge of decision-making from Business Management.

Target:-
is a large retail chain that crunches data to develop insights that help target marketing
and advertising campaigns.
Gathering & Selecting Data:-
Enterprise Data Model (EDM ):
is a unified, high-level model of all the data stored in an
organization’s databases.
 The EDM is usually inclusive of the data generated from all internal
systems.
 The EDM provides the basic menu of data to create a data warehouse for a
particular decision-making purpose.
 The EDM help imagine what relevant external data should be gathered to
provide context and develop good predictive relationship with the internal
data.
Evaluation Data Mining Results:-
Classification:- is the main category of supervised learning activity.
 Predictive accuracy = (correct prediction) / total predictions.
 When a true positive data point is positive → true positive (TP)
 When a true negative data point is negative → true negative (TN)
 When a true positive data point is negative → false negative (FN)
 When a true negative data point is positive → false positive (FP)
 Predictive accuracy = (TP + TN) / (TP + TN + FP + FN)
Data Mining Techniques:-

Techniques

Supervised Unsupervised
Learning Learning

Classification Machine Classification Clustering Association


Learning Statistics Analysis Rules

Decision Neural
Regression
Trees Networks
Data Cleansing & Preparation:-
1. Duplicate data needs to be removed.
2. Missing value need to be filled in, or those rows should be
removed.
3. Data elements should be comparable:
a) Transformed from one unit to another.
b) Comparable overtime.
c) Stored at the same granularity to ensure comparability.
4. Continuous values may need to be binned into a few
buckets to help with some analyses.
5. Outlier data elements need to be removed after careful
review, to avoid the skewing of results.
6. Data may need to be selected to increase information
density.
7. Ensure that the data is representative of the phenomena
under analysis by correcting for any biases in the selection
of data.
Outputs of Data Mining:-
 Decision Tree (Business Rule)
 Related Format
 Population “Centroid”
 Artificial Neural Networks (ANN)
 Cluster Analysis (Segmentation Technique)
 The K-means Technique
Tools & Platforms for Data Mining:-
1. Simple or Sophisticated.
2. Stand-alone or Embedded.
3. Open source or Commercial.
4. User Interface.
5. Data Formats.
Programs:-
1. Excel
2. Weka
3. R
4. IBM’S SPSS Modeler
Data Mining Best Practices:-
1.Business Understanding
2.Data Understanding
3.Data Preparation
4.Modeling
5.Model Evaluation
6.Dissemination & Rollout
Myths about Data Mining:-
 Data Mining is about algorithms.
 Data Mining is about predictive accuracy.
 Data Mining requires a data warehouse.
 Data Mining requires large quantities of data.
 Data Mining requires a technology expert.
Data Mining Mistakes:-
 Selecting the wrong problem for data mining.
 Buried under mountains of data without clear metadata.
 Disorganized data mining.
 Insufficient business knowledge.
 Incompatibility of data mining tools and datasets.
 Looking only at aggregated results and not at individual
records or predictions.
 Not measuring your results differently from the way your
sponsor measures them.

You might also like