Professional Documents
Culture Documents
ML Class ITV
ML Class ITV
net/publication/305775161
CITATIONS READS
0 6,614
1 author:
Sérgio Viademonte
Institute of Technology Vale
20 PUBLICATIONS 66 CITATIONS
SEE PROFILE
All content following this page was uploaded by Sérgio Viademonte on 01 June 2018.
ITV
Applied Computing Group
• Clustering
Given a data matrix D, partition its records into sets S1…Sn such that
records in each cluster are similar to each other.
• Classification
Learning the structure of a dataset of examples, already partitioned
into groups, referred as categories or classes.
• Outlier Detection
Given a data matrix D, determine the records of D that are very
different from the remaining records in D.
Sergio Viademonte, PhD. ITV DS Charu Aggarwal (2015). Data Mining: The textbook.
Roadmap
begin
Create root node containing D;
loop
Grow it by selecting and eligible node in the tree;
Split the node into two or more nodes based on the split criterion;
until
No more nodes for split;
Prune overfitting nodes;
Label each leaf node with its dominant class;
end
sunny rainy
weather
1 Internal node is a test on an attribute
warm cold
4 At each node, one attribute is
chosen to split the training examples
play home
into distinct classes
• error rate
p = fraction of instances in a set of data points S
S belongs to a class label
er = 1 – p
Lowest values of error rate are better
Compute weighted average of ER of individuals attribute values Si
r
S (S1…Sr) = Σ |Si | / |S| . G(Si)
i=1
Calculate the overall Gini index, based on the target attribute G(Stg)
Calculate the Gini index for each individual attribute/value G(Si)
Calculate the Gain for attribute Si, G(Stg) - G(Si), chose the attribute with the largest Gain.
Sergio Viademonte, PhD. ITV DS
Decision Trees
• Splitting attribute
Let pj, fraction of data points in class j, for the attribute value vi, than
the class entropy E(vi) is defined as follows:
k
E(vi) = - Σ pj log2(pj) ,
j =1
Lower values for Entropy are the better (value 0 implies a perfect
separation)
Posterior distribution p(x | a) for x given a.
E(vi) = [ 0, log2(k) ]
• Splitting attribute
• Algorithms
Example:
Given X, Y ⊂ I, and X ∩ Y = Ø.
Support s(X à Y) = (X U Y) / N
Confidence is the ratio of the number of transaction that contain X ∪ Y to the number of
transactions that contain X, given by the following expression:
1.Frequent Itemset Generation, whose objective is to find all the item-sets that
satisfy the minsup threshold. These itemsets are called frequent
itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from
the frequent itemsets found in the previous step. These rules are called strong rules.