Week 5 - Decision Trees

MANAGEMENT INFORMATION SYSTEMS
DATA MINING AND BUSINESS INTELLIGENCE
Week 5 – Decision Trees

PROF. DR. GÖKHAN SILAHTAROĞLU
Lecturer:. NADA MISK
Outline
in KNIME Until today:
• Read/Write Data
• CSV Reader, Excel Reader, Table Reader, File Reader, DB Reader
• CSV Writer, Excel Writer, Table Writer
• Data Manipulation
• Column Filter, Row Filter, Group by, Pivoting Concatenate, Rule Engine, String Manipulation, Math Formula, Column expression
• Data Understanding / Visualization
• Data Explorer, Color Manager, Bar Chart, Scatter plot
Today:
• Normalization
• Data Manipulation
• Numeric Binner, Number To String, Joiner
• Data Understanding / Visualization
• scatter plot–table view(componentnet)
• Decision Trees
2
DATA RESTRUCTURING
 Some of the models, techniques and algorithms that we will use in data mining work
only with numerical values, while others use categorical values. Others only deal with
values consisting of 0s and 1s. In this case, we must adapt the data we have to the
algorithm we will work with, that is, we must reconstruct the data in a way.
 Decision trees use interval values instead of continuous values.
 For example; If the wage variable takes various values between 550 and 15,000 TL, these
values are 550-1000; 1001–2000; 2001–4000; 4001–7000; 7001–10,000; We can divide it
into ranges such as 10,001–15,000.
 Algorithms such as artificial neural networks work with values of 0.0-1.0.
DR. G.SILAHTAROGLU 3
NORMALIZATION
The process of reducing the available data to ranges such as 0.0–1.0 is called
normalization. There are several methods for the normalization process;
𝑠 − min
min-max normalization 𝑠′ =
𝑚𝑎𝑥 − min
𝑠 − min
newmin-max normalization 𝑠′ =
𝑚𝑎𝑥 − min(𝑛𝑒𝑤_𝑚𝑎𝑥 − 𝑛𝑒𝑤_ min) + 𝑛𝑒𝑤_min
𝑠 − 𝑎𝑣𝑔
Z-score normalization 𝑠′ =
𝜎
4
DR. G.SILAHTAROGLU
NORMALIZATION– EXAMPLE 1
If the Maximum age in the Database is 89, minimum age is 18. What is the normalized value of 37 years old?
𝑠 − min
𝑠′ =
𝑚𝑎𝑥 − min
Maximum age in the Database is 89, minimum age is 18. What is the normalized value of 37 years in the range
of 1 – 5?
𝑠 − min
𝑠′ =
𝑚𝑎𝑥 − min(𝑛𝑒𝑤_𝑚𝑎𝑥 − 𝑛𝑒𝑤_ min) + 𝑛𝑒𝑤_min
37 − 18
𝑠′ = (5 − 1) + 1 = 2
89 − 18
5
DR. G.SILAHTAROGLU
NORMALIZATION– EXAMPLE 2
𝑠 − 𝑎𝑣𝑔
𝑠′ =
𝜎
While the mean age was 48.96 and
the standard deviation was 19.11,
What is the normalized value of 37
years in z-score.
6
DR. G.SILAHTAROGLU
 Exam grade
STUDY  55
 65
 89
 23
 66
 45
 25
 12
 Apply the normalization
 85
algorithms to the data on the
right.  78
 77
 68
 36
 60
 55
7
DR. G.SILAHTAROGLU
VISUAL TRANSFORMATION
18-29 range 1st group

30 – 37 range 2nd group
38 – 45 range 3rd group
46 – 60 range 4th group
61 + 5th group
8
DR.G.SILAHTAROGLU
Sales Forecasts
Order Estimates
Classification Estimation and Causes of
Production Defect Costs
Fraud Detection
DATA MINING Customer Profiling
MODELS Clustering Product Sales Profile

Clustering of Error Locations and Times
Market Basket Analysis

Association
Time-Based
Analysis Consecutive Sales
DR. G. SILAHTAROGLU 11
SUPERVISED &
UNSUPERVISED
LEARNING
DATA MINING MODELS
Revealing the
existence of
hidden
Prediction patterns in
databases
Sort data into

groups based
on certain
common
characteristics
Classification
CLASSIFICATION
"Bayesian "algorithms «k-nearest

"Artificial neural
classification based on neighbor
networks"
algorithm» decision trees» algorithm»
DECISION TREES
 Decision trees are one of the most widely used

algorithms in classification problems.
 Compared to other methods, it can be said that
decision trees are easier to structure and
understand.
CLASSIFICATION PROCESS
A decision tree is a structure that includes root nodes, branches, and leaf nodes.
Each internal node specifies a test on an attribute, each branch specifies the result of a test, and
each leaf node has a class label.
Colour
The top node of the tree is the root node.
A
Brown Pink
Horoscope
B C Product
Lion Cancer Blouse Skirt
D TO F G
3 items
2 items Sales 2 items Sales
1 product sale
Sales
If color =brown
Then
• If horoscope=“lion”
• Then
Rule 1: • decision = 3 item sale ;
• if horoscope = “cancer”
• then
• decision = 2 item sale; RULE EXAMPLES
EXTRACTED BY
If renk =Pink DECISION TREE
Then
• If item=“blouze” Then
Rule 2: • decision = 2 item sales;
• If item=“Skirt”
• Then
• decision = 1 item sales;
DECISION TREE STRUCTURE EXAMPLE
A2
A3 A1
A4 C3 C4 C5
C1 C2
LİFT
 Decision trees produce rules, but they also provide some parameter values. one of the most important parameters
is the concept of lift.
 lift is the value that shows how many of the records in any node belong to the target class compared to the whole
tree. For example, the lift value of 255% indicates that the records in that node belong to the specified class with a
rate of 2.55 times higher.
LIFT CALCULATION EXAMPLE
Membership Period Leave

1-5 0
6-15 1
6-15 1
1-5 0
6-15 1
6-15 0
1-5 1
1-5 0
1-5 0
Membership P. Leave
1-5 0
6-15 1
6-15 1
1-5 0
 Rule 1: IF membership = 1-5 THEN Leave = 0;
6-15 1
6-15 0
1-5 1
1-5 0
1-5 0
Membership P. Leave
1-5 0
6-15 1
6-15 1
1-5 0  Rule 1: IF membership = 1-5 THEN Leave = 0;
6-15 1  Rule 2:if membership = 6-15 THEN Leave = 1;
6-15 0
1-5 1
1-5 0
1-5 0
Membership P. Leave
1-5 0
6-15 1
6-15 1  Rule 1:
1-5 0  if1-5 then 0

6-15 1  Lift for Rule 1
6-15 0  (4/5) / (5/9) = 1.44
1-5 1
1-5 0
1-5 0
Membership P. Leave
1-5 0
6-15 one
 Rule 2:
6-15 one
1-5 0  If 6-15 then 1
6-15 one  Lift for rule 2:

6-15 0  (3/4) / (4/9) =1.69
1-5 one
1-5 0
1-5 0
SAMPLE DATA
Gender Weight height Size
F 48 170 M
F 49 151 S
F 52 158 M
F 56 165 M
M 59 160 S
F 61 159 M
M 62 162 S
M 63 174 M
F 68 168 M
F 69 177 L
M 72 170 M
M 74 165 S
M 85 175 M
M 85 190 L
M 98 190 L
SAMPLE DATA TRANSFORMED İNTO DECİSİON TREE
Height
160<height<175 183< Height

<183
176< Height<182
Gender Gender Weight
Female Male Female Male
Small Weight Small Large Small Medium Large

DECISION TREES
 Rule generator: They Make Rules

 If gender = male and age between 30 -40 will buy a computer
 If age > 20 and studying computer science and living in Istanbul then reads comics
 They work on the principle of data folding
(Like the game find the object I had in mind in 21 questions”.http://www.20q.net/
 They produce a tree-like structure, what is actually produced are nested IF commands.
 Each leaf of the tree (then expression) returns the class.
GENERAL ALGORİTHM OF DECISION TREES
D:Learning database
T:Tree to be established
T = 0// initially tree empty set
Set branching criteria
T = set root node
T = branch the root node according to the branching rules;
for each branch
do
Set the variable that will come to this node

if(stop condition reached)
Add leaf and stop
else
loop
PRUNING
 It is the removal of weak branches (internal IFs).

 It is the removal of unnecessary rules.
 Algorithms usually leave this option to the user.
STOPPING CRITERIA
 It is also referred to as purity in some software.
 The desired purity value is a value used for pre-pruning before the tree is set up.
 Considering the purity value given while the tree was being established, when the leaves of the tree
reach a certain percentage value such as 70%, 95%, leaves are assigned to the tree and the process is
started for other branches.
BAGGİNG/BOOTSTRAP AVERAGİNG
 It is the use of mean values by running several decision trees at the same time,
 Significantly improves the quality of the decision tree
 Used to prevent noise

ENTROPY
 Entropy is used to measure uncertainty, confusion, and

randomness within a dataset.
 If all the available data belonged to a single class, for
example, if everyone has the same favorite football
team, we would not be surprised by the answer we
would get when we asked anyone about their favorite
team; then the entropy would be zero (0).
 Entropy takes a value between 0 and 1
 Surprise is high when entropy is high.
H (p1 , p 2 ... p n ) =  ( pi log(1 / pi ))
32
C 4.5 ALGORİTHM
 Missing data is not taken into account when creating a decision tree. That is, when calculating the gain
ratio, only other records whose data are not missing are used.
 C4.5 algorithm uses the lost data to calculate the gain ratio by predicting it with the help of other data
and variables.
 Thus, a tree that can extract more sensitive and more meaningful rules can be produced.
C4.5
ALGORITHM
ENTROPY CALCULATION
F 48 170 M
F 49 151 S
H (p1 , p 2 , ...., p n ) =  ( pi log(1/ pi ))

F 52 158 M
F 56 165 M
M 59 160 S
F 61 159 M Calculating by Target (Class):
M 62 162 S S number:4, M Number: 8, L Number: 3
M 63 174 M
F 68 168 M 4 15 8 15 3 15
log + log + log = 𝟎. 𝟒𝟑𝟖𝟒
15 4 15 8 15 3
F 69 177 L
M 72 170 M
M 74 165 S
M 85 175 M
M 85 190 L
M 98 190 L
SAMPLE
F 48 170 M
F 49 151 S
F 52 158 M Split information for gender;
F 56 165 M
M 59 160 S
F 61 159 M
 15  8  15 
H (DF ; DM ) =
M 62 162 S 7
log   + log   = 0.3001
M 63 174 M 15  7  15  8 
F 68 168 M
F 69 177 L
M 72 170 M gain Raito(for Gender)
M 74 165 S
M 85 175 M = 0.4384 / 0.3001 = 1.46
M 85 190 L
M 98 190 L
same transactionskilofor let it be done
SAMPLE Gender Weight height Size Weight

F 48 170 M Intervals Label
F 49 151 S 48 – 55.99 1
F 52 158 M 56 – 64.99 2
F 56 165 M 65 – 75.99 3
M 59 160 S
76 – 85.99 4
F 61 159 M
M 62 162 S 86 – 100 5
M 63 174 M Height
F 68 168 M
Intervals Label
F 69 177 L
151 – 159.99 1
M 72 170 M
M 74 165 S
160 – 164.99 2
M 85 175 M 165 – 174.99 3

M 85 190 L 175 – 184.99 4
M 98 190 L
185 - ++++ 5
DR. G. SİLAHTAROGLU
37
Gender Weight height Size SAMPLE
F 1 3 M
F 1 1 S
F 1 1 M
F 2 3 M Split information for weight;
M 2 2 S
F 2 1 M
 3 5 4 2 1 3  15  5  15  4  15 
M 2 2 S H  ; , , ,  = log   + log   + log  
 15 15 15 15 15  15  3  15  5  15  4
M 2 3 M 2  15  1  15 
+ log   + log   = 0.6470
F 3 3 M 15  2  15 1
F 3 4 L
M 3 3 M
M 3 3 S gain Raito(for weight)
M 4 4 M
=0.4384/0.6470 = 0.6776
M 4 5 L
M 5 5 L
Gender Weight height Size SAMPLE
F 1 1 S
F 1 1 M
F 2 1 M
M 2 2 S
M 2 2 S Split information for Height;
F 1 3 M
F 2 3 M  3 2 6 2 2 3  15  2  15  6  15 
H  ; , , ,  = log   + log   + log  
M 2 3 M  15 15 15 15 15  15  3  15  2  15  6
F 3 3 M 2  15  2  15 
+ log   + log   = 0.6490
M 3 3 M 15  2  15  2
M 3 3 S
F 3 4 L Gain Ratio (Height)
M 4 4 M
M 4 5 L =0.4384 / 0.6490 = 0.6755.
M 5 5 L
SAMPLE
gain Raito(for Gender) gain Raito(for Weight)
= 0.4384 / 0.3001 = 1.46 =0.4384/0.6470 = 0.6776
gain Raito(for Height)

=0.4384 / 0.6490 = 0.6755.
SAMPLE
gain Raito(for Gender) gain Raito(for Weight)
= 0.4384 / 0.3001 = 1.46 =0.4384/0.6470 = 0.6776
gain Raito(for Height)

=0.4384 / 0.6490 = 0.6755.
SAMPLE
As a result of this algorithm, the smallest variable from the gain ratios is assigned as the
root (or next node).
As can be seen, the height variable was found as the root.
Height?
151 - 159.99
185 +++
160 - 164.99165 - 174.99175 - 184.99

Week 5 - Decision Trees

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 5 - Decision Trees

Uploaded by

Copyright:

Available Formats

MANAGEMENT INFORMATION SYSTEMS

DATA MINING AND BUSINESS INTELLIGENCE

Week 5 – Decision Trees

18-29 range 1st group

DATA MINING Customer Profiling

MODELS Clustering Product Sales Profile

Market Basket Analysis

Sort data into

"Bayesian "algorithms «k-nearest

 Decision trees are one of the most widely used

Lion Cancer Blouse Skirt

Membership Period Leave

1-5 0  if1-5 then 0

6-15 one  Lift for rule 2:

160<height<175 183< Height

Gender Gender Weight

Female Male Female Male

Small Weight Small Large Small Medium Large

 Rule generator: They Make Rules

Set the variable that will come to this node

 It is the removal of weak branches (internal IFs).

 It is also referred to as purity in some software.

 Significantly improves the quality of the decision tree

 Used to prevent noise

 Entropy is used to measure uncertainty, confusion, and

H (p1 , p 2 , ...., p n ) =  ( pi log(1/ pi ))

SAMPLE Gender Weight height Size Weight

M 85 175 M 165 – 174.99 3

gain Raito(for Gender) gain Raito(for Weight)

= 0.4384 / 0.3001 = 1.46 =0.4384/0.6470 = 0.6776

gain Raito(for Height)

gain Raito(for Gender) gain Raito(for Weight)

= 0.4384 / 0.3001 = 1.46 =0.4384/0.6470 = 0.6776

gain Raito(for Height)

As can be seen, the height variable was found as the root.

160 - 164.99165 - 174.99175 - 184.99

You might also like