DM Chapter 4

Chapter four
Data mining techniques

Topics
Exploratory Data Analysis
Predictive Modeling
Descriptive Modeling
Discovering Patterns and Rules

Exploratory Data Analysis (EDA)
• Exploratory data analysis can be extremely valuable
– You should always “look” at your data before applying any data
mining algorithms
– Exploratory data analysis is useful for data checking
• E.g.: finding that a variable is always integer valued or
positive
• Finding the some variables are highly skewed
• In general EDA helps to

– maximize insight into a dataset;
– Uncover/extract underlying structure;
– extract important variables;
– detect outliers and anomalies;
– test underlying assumptions;
Cont,,,
• Classical sequence of data analysis:
Problem  Data  Model  Analysis  Conclusion
– Model the pattern before analysis
• Exploratory:
Problem  Data  Analysis  Model  Conclusion
– Analysis the data before applying model/mining
algorithms
Exploratory Data Analysis (EDA)
• Exploratory data analysis helps to
get an overall sense of the dataset
through computing summary.
statistics:
– Number of distinct values, max,
min, mean, median, variance,
skewness,..
• EDA is an approach for data
analysis that employs a variety of
techniques (mostly graphical) for
visualization. Such techniques
include:
• scatter plot, mean plots,
standard deviation plots, and
main effects plots of the raw
data.
From Data To Model: quality data for a
better model
Data
Model
Business Solution(recommendation)
What is a model?
 A simplified representation of reality (based on certain
assumptions) created for a specific purpose.
▶ Simple: Stylized, partial (focused only on certain
aspects),capture the essence.
▶ Representation: Words, pictures, boxes and arrows,
mathematical expressions
▶ Specific purpose: What to capture. What to ignore.
▶ Variables: Entities of interest
▶ Controllable, uncontrollable, environmental
▶ Assumptions: Reduce complexity
▶ In the context of BI & A it includes algorithms + data
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set
• Unsupervised learning (clustering)

– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim
of establishing the existence of classes or clusters in the data
Basic Data Mining techniques
• Classification maps data into predefined groups or classes to
enhance the prediction process( Data  class labels)
– It is supervised learning technique
• Clustering groups similar data together into clusters.
– is used to find appropriate groupings of elements for a set of
data. Unlike classification, clustering is a kind of undirected
knowledge discovery or unsupervised learning; that is, there is
no target field, and the relationship among the data is identified
by bottom-up approach.
– It is unsupervised learning technique
• Association Rule (is also known as market basket analysis)
– discover interesting associations between attributes contained in
a database. Based on frequency counts of the number of items
occur in the event, association rule tells if item X is a part of the
event, then what is the percentage of item Y is also part of the
event.
Classification
 Given a collection of records (training set )
▶ Each record contains a set of attributes, one of the
attributes is the class.
 Find a model for class attribute as a function of the
values of other attributes.
▶ Goal: previously unseen records should be assigned a class
as accurately as possible.
 A test set is used to determine the accuracy of the
model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
Classification Example
Classification
 Classification is the process of finding a set of models (or
functions) that describe and distinguish data classes or
concepts for the purpose of being able to use the model
to predict the class of an object whose class is unknown
➢The derived class is based on training data set and can be
represented in various forms such as classification IF—THEN
rule, decision tree, mathematical formulae or neural networks
Classification vs. Prediction
Classification—A Two-Step Process
 Thus, model construction refers to describing a set of predetermined classes

Classification Process (1): Model
Construction
Cont,,,
Classification Process (2): Use the Model
in Prediction
Issues regarding classification: Data Preparation and
experimentation
 Data preprocessing
▶ Preprocess data in order to reduce noise and handle
missing values
▶ Data transformation
▶ Generalize and/or normalize data
 Relevance analysis (feature selection)
▶ Remove the irrelevant or redundant attributes
 Conducting various experiments using different algorithms
▶ Select the best model
Thus Classification
 Classification is a data mining (machine learning) technique used to
predict group membership for data instances.
 Given a collection of records (training set), each record contains a set
of attributes, one of the attributes is the class.
▶ It is finding a model for class attribute as a function of the values of
other attributes.
 Goal: previously unseen records should be assigned a class as
accurately as possible. A test set is used to determine the accuracy of
the model.
▶ Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
 For example, one may use classification to predict whether the
weather on a particular day will be “sunny”, “rainy” or “cloudy”.
Illustrating Classification Task- induction
and deduction
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Classification approaches/methods
 There are various classification methods. Popular
classification techniques include the following.
 Decision tree classifier: divide decision space into
piecewise constant regions.
 Rule based : Association based classifier
 K-Nearest Neighbour: classify based on similarity
measurement
 Neural networks: partition by non-linear boundaries
 Bayesian network: a probabilistic model
 Support vector machine: solves non-linearly separable
problems
Decision Trees
Simple classification using decision tree
 What are the two known classes?

 How many attributes are used? (What are they?)
Decision tree classifier
• Decision tree performs classification by constructing a
tree based on training instances with leaves having class
labels.
– The tree is traversed for each test instance to find a leaf,
and the class of the leaf is the predicted class. This is a
directed knowledge discovery in the sense that there is a
specific field whose value we want to predict.
• Widely used learning method. It has been applied to:
– classify medical patients based on the disease,
– equipment malfunction by cause,
– loan applicant by likelihood of payment.
Cont,,,
Decision Tree (detail)
Example of a Decision Tree
Another Example of Decision Tree
Thus
 As it is a classification approach; Decision tree generation
consists of two phases
 Tree construction:
▶ At start, all the training examples are at the root
▶ Partition examples recursively based on selected
attributes
▶ Tree pruning
▶ Identify and remove branches that reflect noise or outliers
 Use of decision tree:
▶ Classifying an unknown sample
▶ Test the attribute values of the sample against the
decision tree
Training Dataset- example
Output: A Decision Tree for “buys_computer”
Extracting Classification Rules from Trees
 Represent the knowledge in the form of IF-THEN rules
 Rules are easier for humans to understand
 Example
IF age = “<=30” AND student = “no” THEN buys_computer =
“no”
IF age = “<=30” AND student = “yes” THEN buys_computer =
“yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer
=“no”
Points to recall
Classification
▶ a two step process
▶ PRIDICT categorical class labels
Classification approaches
▶ Decision tree based techniques
▶ Rule based techniques
▶ Function based
Decision tree
▶ Attribute selection is the core idea
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Optimal attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain, gini index)
• Conditions for stopping partitioning

– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left
Attribute Selection Measure (Measures of
Node Impurity)
 Information gain: Select the attribute with the
highest information gain
First, compute the disorder using Entropy; the expected
information needed to classify objects into classes
Second, measure the Information Gain; to calculate by
how much the disorder of a set would reduce by
knowing the value of a particular attribute.
In information gain measure we want:-
 Large Gain
 Same as: small average disorder created
Use of Info Gain - three step procedure
Entropy
• The Entropy measures the disorder of a set S containing a total of n
examples of which n+ are positive and n- are negative and it is given
by:
n n n n
D(n , n )   log 2  log 2  Entropy ( S )
n n n n
• Some useful properties of the Entropy:
• D(n,m) = D(m,n)
• D(0,m) = 0
• D(S)=0 means that all the examples in S have the same class
• D(m,m) = 1
• D(S)=1 means that half the examples in S are of one class and
half are the opposite class
Information Gain
• The Information Gain measures the expected reduction in
entropy due to splitting on an attribute A
 k ni 
GAIN split  Entropy( S )    Entropy(i ) 
 i 1 n 
Parent Node, p is split into k partitions;
ni is number of records in partition I
Entropy(i)=D(n+, n-)= -(n+/n)log(n+/n)-(n-/n)log(n-/n)
• Information Gain: Measures Reduction in Entropy achieved
because of the split. Choose the split that achieves most
reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number
of partitions, each being small but pure.
Example: Decision Tree for “buy computer or not”. Use the training Dataset
given below to construct decision tree
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Cont,,,
Attribute Selection by Information Gain
• Class P: buys_computer = 5 4
E (age)  E (2,3)  E (4,0)
“yes” 14 14
• Class N: buys_computer = 5
 E (3,2)  0.69
“no” 14
• E(P, N) = E(9, 5) =0.940entropy Hence

Gain(age)  E ( p, n)  E (age)
• Compute the entropy for age:
Similarly
age pi ni E(pi, ni) Gain(age)  0.25
<=30 2 3 0.971 Gain(income)  0.029
30…40 4 0 0 Gain( student )  0.151
>40 3 2 0.971 Gain(credit _ rating )  0.048
Cont,,,
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes no yes
Classification Rules
IF age = “<=30” & student = “no” THEN buys_computer = “no”
IF age = “<=30” & student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” & credit_rating = “excellent” THEN buys_computer = “no”
IF age = “>40” & credit_rating = “fair” THEN buys_computer = “yes”
Metrics for Performance Evaluation
▶ Confusion Matrix and Cost Matrix
▶ Confusion Matrix (classification mattrix)
▶ Focus on the predictive capability of a model
rather than how fast it takes to classify or build
models, scalability,etc.
Cont,,,
▶ A confusion matrix displays the number of correct
and incorrect predictions made by the model compared
with the actual classifications in the test data.
▶ The matrix is n-by-n, where n is the number of
classes.
▶ Allows the computation of
▶ Accuracy
▶ Error rate
Accuracy and error rate
 Counts of test records that are correctly (or
incorrectly) predicted by the classification model
Other Cost-Sensitive Measures
Pros and Cons of decision trees
Pros
Cons
• Reasonable training time
Cannot handle complicated
• Fast application relationship between features
• Easy to interpret simple decision boundaries(support)
• Easy to implement problems with lots of missing,noise da
•Can handle large number
of features
Why decision tree induction in data mining?

•Relatively faster learning speed (than other classification methods)
•Convertible to simple and easy to understand classification if-then-
else rules
•Comparable classification accuracy with other methods
•Does not require any prior knowledge of data distribution, works well
on noisy data.

DM Chapter 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM Chapter 4

Uploaded by

Copyright:

Available Formats

Chapter four

Data mining techniques

Discovering Patterns and Rules

• In general EDA helps to

• Unsupervised learning (clustering)

 Thus, model construction refers to describing a set of predetermined classes

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

 What are the two known classes?

• Conditions for stopping partitioning

• E(P, N) = E(9, 5) =0.940entropy Hence

student? yes credit rating?

no yes excellent fair

Why decision tree induction in data mining?

You might also like