Professional Documents
Culture Documents
Some of the models, techniques and algorithms that we will use in data mining work
only with numerical values, while others use categorical values. Others only deal with
values consisting of 0s and 1s. In this case, we must adapt the data we have to the
algorithm we will work with, that is, we must reconstruct the data in a way.
Decision trees use interval values instead of continuous values.
For example; If the wage variable takes various values between 550 and 15,000 TL, these
values are 550-1000; 1001–2000; 2001–4000; 4001–7000; 7001–10,000; We can divide it
into ranges such as 10,001–15,000.
Algorithms such as artificial neural networks work with values of 0.0-1.0.
DR. G.SILAHTAROGLU 3
NORMALIZATION
The process of reducing the available data to ranges such as 0.0–1.0 is called
normalization. There are several methods for the normalization process;
𝑠 − min
min-max normalization 𝑠′ =
𝑚𝑎𝑥 − min
𝑠 − min
newmin-max normalization 𝑠′ =
𝑚𝑎𝑥 − min(𝑛𝑒𝑤_𝑚𝑎𝑥 − 𝑛𝑒𝑤_ min) + 𝑛𝑒𝑤_min
𝑠 − 𝑎𝑣𝑔
Z-score normalization 𝑠′ =
𝜎
4
DR. G.SILAHTAROGLU
NORMALIZATION– EXAMPLE 1
If the Maximum age in the Database is 89, minimum age is 18. What is the normalized value of 37 years old?
𝑠 − min
𝑠′ =
𝑚𝑎𝑥 − min
Maximum age in the Database is 89, minimum age is 18. What is the normalized value of 37 years in the range
of 1 – 5?
𝑠 − min
𝑠′ =
𝑚𝑎𝑥 − min(𝑛𝑒𝑤_𝑚𝑎𝑥 − 𝑛𝑒𝑤_ min) + 𝑛𝑒𝑤_min
37 − 18
𝑠′ = (5 − 1) + 1 = 2
89 − 18
5
DR. G.SILAHTAROGLU
NORMALIZATION– EXAMPLE 2
𝑠 − 𝑎𝑣𝑔
𝑠′ =
𝜎
While the mean age was 48.96 and
the standard deviation was 19.11,
What is the normalized value of 37
years in z-score.
6
DR. G.SILAHTAROGLU
Exam grade
STUDY 55
65
89
23
66
45
25
12
Apply the normalization
85
algorithms to the data on the
right. 78
77
68
36
60
55
7
DR. G.SILAHTAROGLU
VISUAL TRANSFORMATION
8
DR.G.SILAHTAROGLU
Sales Forecasts
Order Estimates
Classification Estimation and Causes of
Production Defect Costs
Fraud Detection
DR. G. SILAHTAROGLU 11
SUPERVISED &
UNSUPERVISED
LEARNING
DR. G.SILAHTAROGLU 12
DATA MINING MODELS
Revealing the
existence of
hidden
Prediction patterns in
databases
Classification
DR. G.SILAHTAROGLU 13
CLASSIFICATION
DR. G.SILAHTAROGLU 14
DECISION TREES
Brown Pink
Horoscope
B C Product
D TO F G
3 items
2 items Sales 2 items Sales
1 product sale
Sales
If color =brown
Then
• If horoscope=“lion”
• Then
Rule 1: • decision = 3 item sale ;
• if horoscope = “cancer”
• then
• decision = 2 item sale; RULE EXAMPLES
EXTRACTED BY
If renk =Pink DECISION TREE
Then
• If item=“blouze” Then
Rule 2: • decision = 2 item sales;
• If item=“Skirt”
• Then
• decision = 1 item sales;
DECISION TREE STRUCTURE EXAMPLE
A2
A3 A1
A4 C3 C4 C5
C1 C2
LİFT
Decision trees produce rules, but they also provide some parameter values. one of the most important parameters
is the concept of lift.
lift is the value that shows how many of the records in any node belong to the target class compared to the whole
tree. For example, the lift value of 255% indicates that the records in that node belong to the specified class with a
rate of 2.55 times higher.
LIFT CALCULATION EXAMPLE
Membership P. Leave
1-5 0
6-15 1
6-15 1
1-5 0
Rule 1: IF membership = 1-5 THEN Leave = 0;
6-15 1
6-15 0
1-5 1
1-5 0
1-5 0
LIFT CALCULATION EXAMPLE
Membership P. Leave
1-5 0
6-15 1
6-15 1
1-5 0 Rule 1: IF membership = 1-5 THEN Leave = 0;
6-15 1 Rule 2:if membership = 6-15 THEN Leave = 1;
6-15 0
1-5 1
1-5 0
1-5 0
LIFT CALCULATION EXAMPLE
Membership P. Leave
1-5 0
6-15 1
6-15 1 Rule 1:
Membership P. Leave
1-5 0
6-15 one
Rule 2:
6-15 one
1-5 0 If 6-15 then 1
176< Height<182
The desired purity value is a value used for pre-pruning before the tree is set up.
Considering the purity value given while the tree was being established, when the leaves of the tree
reach a certain percentage value such as 70%, 95%, leaves are assigned to the tree and the process is
started for other branches.
BAGGİNG/BOOTSTRAP AVERAGİNG
It is the use of mean values by running several decision trees at the same time,
32
C 4.5 ALGORİTHM
Missing data is not taken into account when creating a decision tree. That is, when calculating the gain
ratio, only other records whose data are not missing are used.
C4.5 algorithm uses the lost data to calculate the gain ratio by predicting it with the help of other data
and variables.
Thus, a tree that can extract more sensitive and more meaningful rules can be produced.
C4.5
ALGORITHM
ENTROPY CALCULATION
Gender Weight height Size
F 48 170 M
F 49 151 S
M 63 174 M
F 68 168 M 4 15 8 15 3 15
log + log + log = 𝟎. 𝟒𝟑𝟖𝟒
15 4 15 8 15 3
F 69 177 L
M 72 170 M
M 74 165 S
M 85 175 M
M 85 190 L
M 98 190 L
SAMPLE
Gender Weight height Size
F 48 170 M
F 49 151 S
F 52 158 M Split information for gender;
F 56 165 M
M 59 160 S
F 61 159 M
15 8 15
H (DF ; DM ) =
M 62 162 S 7
log + log = 0.3001
M 63 174 M 15 7 15 8
F 68 168 M
F 69 177 L
M 72 170 M gain Raito(for Gender)
M 74 165 S
M 85 175 M = 0.4384 / 0.3001 = 1.46
M 85 190 L
M 98 190 L
same transactionskilofor let it be done
37
Gender Weight height Size SAMPLE
F 1 3 M
F 1 1 S
F 1 1 M
F 2 3 M Split information for weight;
M 2 2 S
F 2 1 M
3 5 4 2 1 3 15 5 15 4 15
M 2 2 S H ; , , , = log + log + log
15 15 15 15 15 15 3 15 5 15 4
M 2 3 M 2 15 1 15
+ log + log = 0.6470
F 3 3 M 15 2 15 1
F 3 4 L
M 3 3 M
M 3 3 S gain Raito(for weight)
M 4 4 M
=0.4384/0.6470 = 0.6776
M 4 5 L
M 5 5 L
Gender Weight height Size SAMPLE
F 1 1 S
F 1 1 M
F 2 1 M
M 2 2 S
M 2 2 S Split information for Height;
F 1 3 M
F 2 3 M 3 2 6 2 2 3 15 2 15 6 15
H ; , , , = log + log + log
M 2 3 M 15 15 15 15 15 15 3 15 2 15 6
F 3 3 M 2 15 2 15
+ log + log = 0.6490
M 3 3 M 15 2 15 2
M 3 3 S
F 3 4 L Gain Ratio (Height)
M 4 4 M
M 4 5 L =0.4384 / 0.6490 = 0.6755.
M 5 5 L
SAMPLE
Height?
151 - 159.99
185 +++