Professional Documents
Culture Documents
• Decision Trees
• Association Rules
Can we do better?
The best place for students to learn Applied Engineering 14 http://www.insofe.edu.in
Women and children first?
Train data analysis
Did Not
Survived
Survive
Build Use
• Think : “If, Then” rules specified in the feature • Every observation mapped to a leaf node assigned
space. the label most commonly occurring in that leaf
(Classification)
• Greedily divide (binary split) the feature
space into distinct, non-overlapping regions • Every observation mapped to a leaf node assigned
the mean of the samples in the leaf (Regression)
• “Natural” clustering given the target variable.
23
24
25
26
• Measures of Impurity
– Determine Information Gain
– Determine split choice
http://people.revoledu.com/kardi/tutorial/DecisionTree/how-to-measure-impurity.htm
• Can do better
– Sort the attributes that you can split on.
– Find all the "breakpoints" where the class labels associated with them
change.
– Consider the split points where the labels change.
Train
Test
alpha
Mean Cross-Validation
Error
The best place for students to learn Applied Engineering 36 http://www.insofe.edu.in
Reduced Error Pruning
• Start from leaf nodes
• If removal of node Wind
Yes No
of this node into
previous node
• Continue till root node
The best place for students to learn Applied Engineering 37 http://www.insofe.edu.in
Controlling Overfit
Other parameters
• Minimum information gain
• Specify the minimum count in leaf
node
• Max number of leaf nodes
• x, y, z: Antecedent
• A: Consequent
This is called
support
23% support
The best place for students to learn Applied Engineering 51 http://www.insofe.edu.in
Lift
• Is confidence and support always good?
– Example: Leak in oil gasket causes Engine
rebuild
• Prob of Engine rebuild will be very small, hence support for this rule is small
• Confidence will be high
• What does the value of one feature tell us about the value of another feature?
• People who buy diapers are likely to buy baby powder
• If (people buy diaper), then (they buy baby powder)
• Caution : Watch the directionality! (A➔B does not mean B ➔A)
• Association rules
• Are statements about relations among features (attributes) : across elements (tuples)
• Use a transaction-itemset data model
• Confidence = 1?
• Caution : Diaper is very popular!
• Does the inclusion of {Milk, Beer} increase the probability of Diaper?
• Lift
• Confidence (X➔ Y)/Support(Y) or equivalently P(Y|X) / P(Y)
• > 1 : X & Y positively correlated (Presence of X lifts probability of Y’s presence)
• < 1 : X & Y negatively correlated (Presence of X reduces probability of Y’s presence)
• = 1 X & Y not correlated
The best place for students to learn Applied Engineering 64 http://www.insofe.edu.in
Question
• How many rules can we generate
when we have p items in our dataset?
–Approx 2p
• Algorithm
– Find all frequent 1-itemsets (frequent ➔ > support)
– Find all frequent 2-itemsets for filtered 1-itemsets
– Find all frequent 3-itemsets for filtered 2-itemsets
– ….
• The reliance on downward closure property, allows us to reduce our search space
for the rules.
plot(basket_rules, measure=c("support","lift"),
shading="confidence");
○ Parallelize
○ Increase support
76
The best place for students to learn Applied Engineering 76 http://www.insofe.edu.in
Association Rules : Summary
• Applications
• Market Basket Analysis
• Any dataset where features take values :
0/1
• Can work in any dataset where features can
be represented as taking only two use
values : 0/1
• Preprocessing: Discretization, Feature
selection