You are on page 1of 42

MANAGEMENT INFORMATION SYSTEMS

DATA MINING AND BUSINESS INTELLIGENCE

Week 5 – Decision Trees


PROF. DR. GÖKHAN SILAHTAROĞLU
Lecturer:. NADA MISK
Outline
in KNIME Until today:
• Read/Write Data
• CSV Reader, Excel Reader, Table Reader, File Reader, DB Reader
• CSV Writer, Excel Writer, Table Writer
• Data Manipulation
• Column Filter, Row Filter, Group by, Pivoting Concatenate, Rule Engine, String Manipulation, Math Formula, Column expression
• Data Understanding / Visualization
• Data Explorer, Color Manager, Bar Chart, Scatter plot
Today:
• Normalization
• Data Manipulation
• Numeric Binner, Number To String, Joiner
• Data Understanding / Visualization
• scatter plot–table view(componentnet)
• Decision Trees
2
DATA RESTRUCTURING

 Some of the models, techniques and algorithms that we will use in data mining work
only with numerical values, while others use categorical values. Others only deal with
values ​consisting of 0s and 1s. In this case, we must adapt the data we have to the
algorithm we will work with, that is, we must reconstruct the data in a way.
 Decision trees use interval values ​instead of continuous values.
 For example; If the wage variable takes various values ​between 550 and 15,000 TL, these
values ​are 550-1000; 1001–2000; 2001–4000; 4001–7000; 7001–10,000; We can divide it
into ranges such as 10,001–15,000.
 Algorithms such as artificial neural networks work with values ​of 0.0-1.0.

DR. G.SILAHTAROGLU 3
NORMALIZATION

The process of reducing the available data to ranges such as 0.0–1.0 is called
normalization. There are several methods for the normalization process;

𝑠 − min
min-max normalization 𝑠′ =
𝑚𝑎𝑥 − min

𝑠 − min
newmin-max normalization 𝑠′ =
𝑚𝑎𝑥 − min(𝑛𝑒𝑤_𝑚𝑎𝑥 − 𝑛𝑒𝑤_ min) + 𝑛𝑒𝑤_min

𝑠 − 𝑎𝑣𝑔
Z-score normalization 𝑠′ =
𝜎

4
DR. G.SILAHTAROGLU
NORMALIZATION– EXAMPLE 1
If the Maximum age in the Database is 89, minimum age is 18. What is the normalized value of 37 years old?
𝑠 − min
𝑠′ =
𝑚𝑎𝑥 − min

Maximum age in the Database is 89, minimum age is 18. What is the normalized value of 37 years in the range
of 1 – 5?

𝑠 − min
𝑠′ =
𝑚𝑎𝑥 − min(𝑛𝑒𝑤_𝑚𝑎𝑥 − 𝑛𝑒𝑤_ min) + 𝑛𝑒𝑤_min

37 − 18
𝑠′ = (5 − 1) + 1 = 2
89 − 18

5
DR. G.SILAHTAROGLU
NORMALIZATION– EXAMPLE 2

𝑠 − 𝑎𝑣𝑔
𝑠′ =
𝜎
While the mean age was 48.96 and
the standard deviation was 19.11,
What is the normalized value of 37
years in z-score.

6
DR. G.SILAHTAROGLU
 Exam grade
STUDY  55
 65
 89
 23
 66
 45
 25
 12
 Apply the normalization
 85
algorithms to the data on the
right.  78
 77
 68
 36
 60
 55

7
DR. G.SILAHTAROGLU
VISUAL TRANSFORMATION

18-29 range 1st group


30 – 37 range 2nd group
38 – 45 range 3rd group
46 – 60 range 4th group
61 + 5th group

8
DR.G.SILAHTAROGLU
Sales Forecasts
Order Estimates
Classification Estimation and Causes of
Production Defect Costs
Fraud Detection

DATA MINING Customer Profiling

MODELS Clustering Product Sales Profile


Clustering of Error Locations and Times

Market Basket Analysis


Association
Time-Based
Analysis Consecutive Sales

DR. G. SILAHTAROGLU 11
SUPERVISED &
UNSUPERVISED
LEARNING

DR. G.SILAHTAROGLU 12
DATA MINING MODELS

Revealing the
existence of
hidden
Prediction patterns in
databases

Sort data into


groups based
on certain
common
characteristics

Classification

DR. G.SILAHTAROGLU 13
CLASSIFICATION

"Bayesian "algorithms «k-nearest


"Artificial neural
classification based on neighbor
networks"
algorithm» decision trees» algorithm»

DR. G.SILAHTAROGLU 14
DECISION TREES

 Decision trees are one of the most widely used


algorithms in classification problems.
 Compared to other methods, it can be said that
decision trees are easier to structure and
understand.
CLASSIFICATION PROCESS
A decision tree is a structure that includes root nodes, branches, and leaf nodes.
Each internal node specifies a test on an attribute, each branch specifies the result of a test, and
each leaf node has a class label.
Colour
The top node of the tree is the root node.
A

Brown Pink

Horoscope
B C Product

Lion Cancer Blouse Skirt

D TO F G
3 items
2 items Sales 2 items Sales
1 product sale
Sales
If color =brown
Then
• If horoscope=“lion”
• Then
Rule 1: • decision = 3 item sale ;
• if horoscope = “cancer”
• then
• decision = 2 item sale; RULE EXAMPLES
EXTRACTED BY
If renk =Pink DECISION TREE
Then
• If item=“blouze” Then
Rule 2: • decision = 2 item sales;
• If item=“Skirt”
• Then
• decision = 1 item sales;
DECISION TREE STRUCTURE EXAMPLE
A2

A3 A1

A4 C3 C4 C5

C1 C2
LİFT

 Decision trees produce rules, but they also provide some parameter values. one of the most important parameters
is the concept of lift.

 lift is the value that shows how many of the records in any node belong to the target class compared to the whole
tree. For example, the lift value of 255% indicates that the records in that node belong to the specified class with a
rate of 2.55 times higher.
LIFT CALCULATION EXAMPLE

Membership Period Leave


1-5 0
6-15 1
6-15 1
1-5 0
6-15 1
6-15 0
1-5 1
1-5 0
1-5 0
LIFT CALCULATION EXAMPLE

Membership P. Leave
1-5 0
6-15 1
6-15 1
1-5 0
 Rule 1: IF membership = 1-5 THEN Leave = 0;
6-15 1
6-15 0
1-5 1
1-5 0
1-5 0
LIFT CALCULATION EXAMPLE

Membership P. Leave
1-5 0
6-15 1
6-15 1
1-5 0  Rule 1: IF membership = 1-5 THEN Leave = 0;
6-15 1  Rule 2:if membership = 6-15 THEN Leave = 1;
6-15 0
1-5 1
1-5 0
1-5 0
LIFT CALCULATION EXAMPLE

Membership P. Leave
1-5 0
6-15 1
6-15 1  Rule 1:

1-5 0  if1-5 then 0


6-15 1  Lift for Rule 1
6-15 0  (4/5) / (5/9) = 1.44
1-5 1
1-5 0
1-5 0
LIFT CALCULATION EXAMPLE

Membership P. Leave
1-5 0
6-15 one
 Rule 2:
6-15 one
1-5 0  If 6-15 then 1

6-15 one  Lift for rule 2:


6-15 0  (3/4) / (4/9) =1.69
1-5 one
1-5 0
1-5 0
SAMPLE DATA
Gender Weight height Size
F 48 170 M
F 49 151 S
F 52 158 M
F 56 165 M
M 59 160 S
F 61 159 M
M 62 162 S
M 63 174 M
F 68 168 M
F 69 177 L
M 72 170 M
M 74 165 S
M 85 175 M
M 85 190 L
M 98 190 L
SAMPLE DATA TRANSFORMED İNTO DECİSİON TREE
Height

160<height<175 183< Height


<183

176< Height<182

Gender Gender Weight

Female Male Female Male

Small Weight Small Large Small Medium Large


DECISION TREES

 Rule generator: They Make Rules


 If gender = male and age between 30 -40 will buy a computer
 If age > 20 and studying computer science and living in Istanbul then reads comics
 They work on the principle of data folding
(Like the game find the object I had in mind in 21 questions”.http://www.20q.net/
 They produce a tree-like structure, what is actually produced are nested IF commands.
 Each leaf of the tree (then expression) returns the class.
GENERAL ALGORİTHM OF DECISION TREES
D:Learning database
T:Tree to be established
T = 0// initially tree empty set
Set branching criteria
T = set root node
T = branch the root node according to the branching rules;
for each branch
do

Set the variable that will come to this node


if(stop condition reached)
Add leaf and stop
else
loop
PRUNING

 It is the removal of weak branches (internal IFs).


 It is the removal of unnecessary rules.
 Algorithms usually leave this option to the user.
STOPPING CRITERIA

 It is also referred to as purity in some software.

 The desired purity value is a value used for pre-pruning before the tree is set up.

 Considering the purity value given while the tree was being established, when the leaves of the tree
reach a certain percentage value such as 70%, 95%, leaves are assigned to the tree and the process is
started for other branches.
BAGGİNG/BOOTSTRAP AVERAGİNG

 It is the use of mean values by running several decision trees at the same time,

 Significantly improves the quality of the decision tree

 Used to prevent noise


ENTROPY

 Entropy is used to measure uncertainty, confusion, and


randomness within a dataset.
 If all the available data belonged to a single class, for
example, if everyone has the same favorite football
team, we would not be surprised by the answer we
would get when we asked anyone about their favorite
team; then the entropy would be zero (0).
 Entropy takes a value between 0 and 1
 Surprise is high when entropy is high.
H (p1 , p 2 ... p n ) =  ( pi log(1 / pi ))

32
C 4.5 ALGORİTHM

 Missing data is not taken into account when creating a decision tree. That is, when calculating the gain
ratio, only other records whose data are not missing are used.
 C4.5 algorithm uses the lost data to calculate the gain ratio by predicting it with the help of other data
and variables.
 Thus, a tree that can extract more sensitive and more meaningful rules can be produced.
C4.5
ALGORITHM
ENTROPY CALCULATION
Gender Weight height Size
F 48 170 M
F 49 151 S

H (p1 , p 2 , ...., p n ) =  ( pi log(1/ pi ))


F 52 158 M
F 56 165 M
M 59 160 S
F 61 159 M Calculating by Target (Class):
M 62 162 S S number:4, M Number: 8, L Number: 3

M 63 174 M
F 68 168 M 4 15 8 15 3 15
log + log + log = 𝟎. 𝟒𝟑𝟖𝟒
15 4 15 8 15 3
F 69 177 L
M 72 170 M
M 74 165 S
M 85 175 M
M 85 190 L
M 98 190 L
SAMPLE
Gender Weight height Size
F 48 170 M
F 49 151 S
F 52 158 M Split information for gender;
F 56 165 M
M 59 160 S
F 61 159 M
 15  8  15 
H (DF ; DM ) =
M 62 162 S 7
log   + log   = 0.3001
M 63 174 M 15  7  15  8 
F 68 168 M
F 69 177 L
M 72 170 M gain Raito(for Gender)
M 74 165 S
M 85 175 M = 0.4384 / 0.3001 = 1.46
M 85 190 L
M 98 190 L
same transactionskilofor let it be done

SAMPLE Gender Weight height Size Weight


F 48 170 M Intervals Label
F 49 151 S 48 – 55.99 1
F 52 158 M 56 – 64.99 2
F 56 165 M 65 – 75.99 3
M 59 160 S
76 – 85.99 4
F 61 159 M
M 62 162 S 86 – 100 5
M 63 174 M Height
F 68 168 M
Intervals Label
F 69 177 L
151 – 159.99 1
M 72 170 M
M 74 165 S
160 – 164.99 2

M 85 175 M 165 – 174.99 3


M 85 190 L 175 – 184.99 4
M 98 190 L
185 - ++++ 5
DR. G. SİLAHTAROGLU

37
Gender Weight height Size SAMPLE
F 1 3 M
F 1 1 S
F 1 1 M
F 2 3 M Split information for weight;
M 2 2 S
F 2 1 M
 3 5 4 2 1 3  15  5  15  4  15 
M 2 2 S H  ; , , ,  = log   + log   + log  
 15 15 15 15 15  15  3  15  5  15  4
M 2 3 M 2  15  1  15 
+ log   + log   = 0.6470
F 3 3 M 15  2  15 1
F 3 4 L
M 3 3 M
M 3 3 S gain Raito(for weight)
M 4 4 M
=0.4384/0.6470 = 0.6776
M 4 5 L
M 5 5 L
Gender Weight height Size SAMPLE
F 1 1 S
F 1 1 M
F 2 1 M
M 2 2 S
M 2 2 S Split information for Height;
F 1 3 M
F 2 3 M  3 2 6 2 2 3  15  2  15  6  15 
H  ; , , ,  = log   + log   + log  
M 2 3 M  15 15 15 15 15  15  3  15  2  15  6
F 3 3 M 2  15  2  15 
+ log   + log   = 0.6490
M 3 3 M 15  2  15  2
M 3 3 S
F 3 4 L Gain Ratio (Height)
M 4 4 M
M 4 5 L =0.4384 / 0.6490 = 0.6755.
M 5 5 L
SAMPLE

gain Raito(for Gender) gain Raito(for Weight)

= 0.4384 / 0.3001 = 1.46 =0.4384/0.6470 = 0.6776

gain Raito(for Height)


=0.4384 / 0.6490 = 0.6755.
SAMPLE

gain Raito(for Gender) gain Raito(for Weight)

= 0.4384 / 0.3001 = 1.46 =0.4384/0.6470 = 0.6776

gain Raito(for Height)


=0.4384 / 0.6490 = 0.6755.
SAMPLE
As a result of this algorithm, the smallest variable from the gain ratios is assigned as the
root (or next node).

As can be seen, the height variable was found as the root.

Height?

151 - 159.99
185 +++

160 - 164.99165 - 174.99175 - 184.99

You might also like