Professional Documents
Culture Documents
1 2
1 2
3 4
5 6
7 8
Why Data Mining? Commercial Viewpoint Why Data Mining? Scientific Viewpoint
• Lots of data is being collected • Data collected and stored at
and warehoused enormous speeds
– remote sensors on a satellite
– Web data • NASA EOSDIS archives over
• Google has Peta Bytes of web data petabytes of earth science data
/ year fMRI Data from Brain
• Facebook has billions of active users Sky Survey Data
– telescopes scanning the skies
– purchases at department/ • Sky survey data
grocery stores, e-commerce – High-throughput biological
– Amazon handles millions of visits/day data
– Bank/Credit Card transactions – scientific simulations
• terabytes of data generated in a
• Computers have become cheaper and more few hours
Surface Temperature of
Earth
powerful • Data mining helps scientists Gene Expression Data
9 10
Great opportunities to improve productivity in all walks of life Great Opportunities to Solve Society’s Major Problems
Improving health care and reducing costs Predicting the impact of climate change
11 12
11 12
13 14
13 14
15 16
1 Yes
Status
Single
Taxable
Income Cheat
125K No
– Classification
2
3
No
No
Married
Single
100K
70K
No
No • used for discrete target variables
4 Yes Married 120K No
5
6
No
No
Divorced 95K
Married 60K
Yes
No
– For example, predicting whether a web user will make a purchase
7
8
Yes
No
Divorced 220K
Single 85K
No
Yes
at an online bookstore
– Regression
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
10
15 No Single 90K Yes
– For example, forecasting the future price of a stock
• The goal of both tasks is to learn a model that
Milk
minimizes the error between the predicted and
true values of the target variable
17 18
17 18
address Set
1 Yes Graduate 5 Yes
Number of Number of
2 Yes High School 2 No years years
3 No Undergrad 1 No
Training
Learn
4 Yes High School 10 Yes > 3 yr < 3 yr > 7 yrs < 7 yrs Model
… … … … …
Set Classifier
10
Yes No Yes No
19 20
19 20
• Predicting tumor cells as benign or • Label past transactions as fraud or fair transactions.
– This forms the class attribute.
malignant
• Learn a model for the class of the transactions.
• Classifying secondary structures of
• Use this model to detect fraud by observing credit card
protein as alpha-helix, beta-sheet, or transactions on an account.
random coil
21 22
21 22
23 24
25 26
60
Use of K-means to
Land Cluster 2 partition Sea Surface
30
Land Cluster 1
Temperature (SST) and
latitude
Sea Cluster 1
Southern Hemispheres.
-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Cluster
longitude
27 28
27 28
29 30
29 30
31 32
31 32
33 34
33 34
35 36
37 38
http://www3.yildiz.edu.tr/~naydin Fundamentals
1 2
1 2
3 4
5 6
5 6
7 8
7 8
9 10
9 10
11 12
13 14
13 14
15 16
17 18
19 20
19 20
21 22
Discrete Control
Discrete CPU unit Datapath
Inputs Information
Processing Discrete
Outputs Inputs:
System Outputs: CRT,
Keyboard,
LCD, modem,
mouse, modem, Input/Output
speakers
microphone
System State
Synchronous or
Asynchronous?
23 24
23 24
25 26
25 26
Transducers
27 28
• The continuous analogue signal has to be held before Continuous (analog) signal ↔ Discrete signal
it can be sampled x(t) = f(t) ↔ Analog to digital conversion ↔ x[n] = x [1], x [2], x [3], ... x[n]
10 10
Continuous
8
9
6
x(t)
8
4
7
6
x(t) and x(n)
measurement 0 2 4 6
Time (sec)
8 10
5 Digitization
8
Discrete
4
2
4
1
2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
Sample Number Sample Number
29 30
29 30
31 32
Sampling Sampling
• The sampling results in a discrete set of digital • Analog signal is sampled every TS secs.
numbers that represent measurements of the signal • Ts is referred to as the sampling interval.
– usually taken at equal intervals of time • fs = 1/Ts is called the sampling rate or sampling
• Sampling takes place after the hold frequency.
– The hold circuit must be fast enough that the signal is not • There are 3 sampling methods:
changing during the time the circuit is acquiring the signal – Ideal - an impulse at each sampling instant
value – Natural - a pulse of short width with varying amplitude
• We don't know what we don't measure – Flattop - sample and hold, like natural but with single
amplitude value
• In the process of measuring the signal, some
• The process is referred to as pulse amplitude
information is lost modulation PAM and the outcome is a signal with
analog (non integer) values
33 34
33 34
35 36
35 36
37 38
39 40
39 40
Sampling Theorem Nyquist sampling rate for low-pass and bandpass signals
Fs 2fm
41 42
41 42
43 44
43 44
45 46
45 46
47 48
47 48
49 50
49 50
51 52
53 54
53 54
55 56
55 56
57 58
(Number)r = ( i=n-1
Ai r i + ) ( j=-1
)
Aj r j
• How about negative numbers?
– we use two more symbols to distinguish positive and negative:
i=0 j=-m
(Integer Portion) + (Fraction Portion) + and -
59 60
59 60
61 62
63 64
65 66
67 68
67 68
69 70
69 70
• Given n binary digits (called bits), a binary code is a • Given M elements to be represented by a
mapping from a set of represented elements to a binary code, the minimum number of bits, n,
subset of the 2n binary numbers. needed, satisfies the following relationships:
2n > M > 2(n – 1)
• Example: A Color Binary Number n =log2 M where x , called the ceiling
binary code Red 000 function, is the integer greater than or equal to x.
for the seven Orange 001
Yellow 010 • Example: How many bits are required to
colors of the
Green 011 represent decimal digits with a binary code?
rainbow Blue 101
– 4 bits are required (n =log2 9 = 4)
• Code 100 is Indigo 110
Violet 111
not used
71 72
71 72
( 5 7 ) dec
= (0101 0111)bcd
73 74
73 74
75 76
77 78
77 78
• 13 0001 0011BCD
– This is coding
79 80
79 80
81 82
XOR XNOR
A B AB A B (AB)'
0 0 0 0 0 1
0 1 1 0 1 0
1 0 1 1 0 0
1 1 0 1 1 1
83
83
1 2
1 2
3 4
5 6
7 8
9 10
9 10
11 12
Categorical
Qualitative
female} test
– the Fahrenheit scale?
Ordinal Ordinal attribute hardness of minerals, median,
– the Kelvin scale? values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
• Consider measuring the height above average Interval
(<, >)
For interval
numbers
calendar dates,
tests, sign tests
mean, standard
– If Ali’s height is 3 cm above average and Veli’s attributes,
differences between
temperature in
Celsius or Fahrenheit
deviation,
Pearson's
Quantitative
Numeric
height is 6 cm above average, then would we say that values are correlation, t and
meaningful. (+, - ) F tests
Veli is twice as tall as Ali? Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
– Is this situation analogous to that of temperature? ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current
13 14
13 14
terms of transformations that do not change the Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
meaning of an attribute make any difference?
Categorical
Qualitative
– For example, the meaning of a length attribute is Ordinal An order preserving change of An attribute encompassing
values, i.e., the notion of good, better best
unchanged if it is measured in meters instead of feet. new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
• The statistical operations that make sense for a by { 0.5, 1, 10}.
particular type of attribute are those that will Interval new_value = a * old_value + b Thus, the Fahrenheit and
where a and b are constants Celsius temperature scales
Quantitative
15 16
15 16
17 18
19 20
21 22
23 24
25 26
25 26
27 28
29 30
29 30
timeout
season
coach
game
score
play
team
win
ball
lost
• If the order of the terms (words) in a document is
ignored—the “bag of words” approach—then a document
can be represented as a term vector, where each term is a Document 1 3 0 5 0 2 6 0 2 0 2
31 32
31 32
33 34
35 36
35 36
37 38
37 38
39 40
41 42
43 44
• Poor data quality negatively affects many data • The measurement error refers to any problem
processing efforts resulting from the measurement process.
– Data mining example: – A common problem is that the value recorded differs
• a classification model for detecting people who are loan risks is from the true value to some extent.
built using poor data
– Some credit-worthy candidates are denied loans
– For continuous attributes, the numerical difference of
the measured and true value is called the error.
• What kinds of data quality problems?
• The data collection error refers to errors such as
• How can we detect problems with the data?
– omitting data objects or attribute values,
• What can we do about these problems?
– inappropriately including a data object.
• Examples of data quality problems:
• Both measurement errors and data collection
– Noise and outliers, Wrong data, Fake data, Missing
values, Duplicate data errors can be either systematic or random.
45 46
45 46
47 48
49 50
51 52
51 52
53 54
55 56
57 58
59 60
61 62
63 64
65 66
65 66
67 68
69 70
69 70
Dimensionality Reduction
• Data sets can have a large number of features • When dimensionality
increases, data
• There are a variety of benefits to dimensionality becomes increasingly
reduction. sparse in the space
• Many data mining algorithms work better if the that it occupies
dimensionality—the number of attributes in the • Definitions of density
data—is lower. and distance between
points, which are
– This is partly because dimensionality reduction can critical for clustering
eliminate irrelevant features and reduce noise and and outlier detection, •Randomly generate 500 points
partly because of the curse of dimensionality become less
•Compute difference between max and
meaningful min distance between any pair of points
71 72
71 72
73 74
75 76
77 78
79 80
79 80
81 82
• Equal frequency approach used to obtain 4 values. • K-means approach used to obtain 4 values.
83 84
83 84
85 86
85 86
87 88
87 88
89
1 2
3 4
Transformations Transformations
• often applied to convert a similarity to a • often applied to convert a similarity to a
dissimilarity, or vice versa, or to transform a dissimilarity, or vice versa, or to transform a
proximity measure to fall within a particular proximity measure to fall within a particular
range, such as [0,1]. range, such as [0,1].
– For instance, we may have similarities that range – For instance, we may have similarities that range
from 1 to 10, but the particular algorithm or software from 1 to 10, but the particular algorithm or software
package that we want to use may be designed to work package that we want to use may be designed to work
only with dissimilarities, or it may work only with only with dissimilarities, or it may work only with
similarities in the interval [0,1] similarities in the interval [0,1]
• Frequently, proximity measures, especially • Frequently, proximity measures, especially
similarities, are defined or transformed to have similarities, are defined or transformed to have
values in the interval [0,1]. values in the interval [0,1].
5 6
5 6
7 8
• Next, we consider more complicated measures of – where n is the number of dimensions (attributes) and
proximity between objects that involve multiple attributes: xk and yk are, respectively, the kth attributes
– dissimilarities between data objects (components) of data objects x and y.
– similarities between data objects. • Standardization is necessary, if scales differ.
9 10
9 10
11 12
13 14
15 16
15 16
17 18
• If s(x, y) is the similarity between points x and y, • Consider an experiment in which people are
then the typical properties of similarities are the asked to classify a small set of characters as they
following: flash on a screen.
– Positivity – The confusion matrix for this experiment records
• s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1) how often each character is classified as itself, and
– Symmetry how often each is classified as another character.
• s(x, y) = s(y, x) for all x and y – Using the confusion matrix, we can define a
• For similarities, the triangle inequality typically similarity measure between a character x and a
character y as the number of times that x is
does not hold
misclassified as y,
– However, a similarity measure can be converted to a • but note that this measure is not symmetric.
metric distance
19 20
19 20
21 22
Similarity Measures for Binary Data Similarity Measures for Binary Data
• Simple Matching Coefficient (SMC) • Jaccard Similarity Coefficient
– One commonly used similarity coefficient – frequently used to handle objects consisting of
asymmetric binary attributes
23 24
23 24
25 26
27 28
27 28
29 30
• between two sets of numerical values, i.e., two vectors, x • Correlation is always in the range −1 to 1.
and y, is defined by: – A correlation of 1 (−1) means that x and y have a
perfect positive (negative) linear relationship;
– where the following standard statistical notation and • that is, xk = ayk + b, where a and b are constants.
definitions are used:
• The following two vectors x and y illustrate cases
where the correlation is −1 and +1, respectively.
x = (−3, 6, 0, 3,−6) x = (3, 6, 0, 3, 6)
y = ( 1,−2, 0,−1, 2) y = (1, 2, 0, 1, 2)
31 32
31 32
33 34
• Compare the three proximity measures according to their behavior • Choice of the right proximity measure depends on
under variable transformation the domain
– scaling: multiplication by a value • What is the correct choice of proximity measure for
– translation: adding a constant the following situations?
Property Cosine Correlation Euclidean Distance
Invariant to scaling (multiplication) Yes Yes No
– Comparing documents using the frequencies of words
• Documents are considered similar if the word frequencies are
Invariant to translation (addition) No Yes No similar
• Consider the example – Comparing the temperature in Celsius of two locations
– x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0) • Two locations are considered similar if the temperatures are
– ys = y × 2 = (2, 4, 6, 8, 0, 0, 0) yt = y + 5 = (6, 7, 8, 9, 5, 5, 5) similar in magnitude
– Comparing two time series of temperature measured in
Measure (x , y) (x , ys) (x , yt) Celsius
Cosine 0.9667 0.9667 0.7940 • Two time series are considered similar if their shape is similar,
Correlation 0.9429 0.9429 0.9429 – i.e., they vary in the same way over time, achieving minimums and
maximums at similar times, etc.
Euclidean Distance 1.4142 5.8310 14.2127
35 36
35 36
37 38
Entropy
• Information relates to possible outcomes of an event • For
– transmission of a message, flip of a coin, or measurement – a variable (event), X,
of a piece of data – with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
• The more certain an outcome, the less information – the entropy of X , H(X), is given by
that it contains and vice-versa 𝑛
– For example, if a coin has two heads, then an outcome of 𝐻 𝑋 = − 𝑝𝑖 log 2 𝑝𝑖
heads provides no information
𝑖=1
– More quantitatively, the information is related the
probability of an outcome
• Entropy is between 0 and log2n and is measured in
• The smaller the probability of an outcome, the more information bits
it provides and vice-versa – Thus, entropy is a measure of how many bits it takes to
– Entropy is the commonly used measure represent an observation of X on average
39 40
39 40
41 42
43 44
45 46
47 48
49 50
49 50
51
51
1 2
3 4
Transformations Transformations
• often applied to convert a similarity to a • often applied to convert a similarity to a
dissimilarity, or vice versa, or to transform a dissimilarity, or vice versa, or to transform a
proximity measure to fall within a particular proximity measure to fall within a particular
range, such as [0,1]. range, such as [0,1].
– For instance, we may have similarities that range – For instance, we may have similarities that range
from 1 to 10, but the particular algorithm or software from 1 to 10, but the particular algorithm or software
package that we want to use may be designed to work package that we want to use may be designed to work
only with dissimilarities, or it may work only with only with dissimilarities, or it may work only with
similarities in the interval [0,1] similarities in the interval [0,1]
• Frequently, proximity measures, especially • Frequently, proximity measures, especially
similarities, are defined or transformed to have similarities, are defined or transformed to have
values in the interval [0,1]. values in the interval [0,1].
5 6
5 6
7 8
• Next, we consider more complicated measures of – where n is the number of dimensions (attributes) and
proximity between objects that involve multiple attributes: xk and yk are, respectively, the kth attributes
– dissimilarities between data objects (components) of data objects x and y.
– similarities between data objects. • Standardization is necessary, if scales differ.
9 10
9 10
11 12
13 14
15 16
15 16
17 18
• If s(x, y) is the similarity between points x and y, • Consider an experiment in which people are
then the typical properties of similarities are the asked to classify a small set of characters as they
following: flash on a screen.
– Positivity – The confusion matrix for this experiment records
• s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1) how often each character is classified as itself, and
– Symmetry how often each is classified as another character.
• s(x, y) = s(y, x) for all x and y – Using the confusion matrix, we can define a
• For similarities, the triangle inequality typically similarity measure between a character x and a
character y as the number of times that x is
does not hold
misclassified as y,
– However, a similarity measure can be converted to a • but note that this measure is not symmetric.
metric distance
19 20
19 20
21 22
Similarity Measures for Binary Data Similarity Measures for Binary Data
• Simple Matching Coefficient (SMC) • Jaccard Similarity Coefficient
– One commonly used similarity coefficient – frequently used to handle objects consisting of
asymmetric binary attributes
23 24
23 24
25 26
27 28
27 28
29 30
• between two sets of numerical values, i.e., two vectors, x • Correlation is always in the range −1 to 1.
and y, is defined by: – A correlation of 1 (−1) means that x and y have a
perfect positive (negative) linear relationship;
– where the following standard statistical notation and • that is, xk = ayk + b, where a and b are constants.
definitions are used:
• The following two vectors x and y illustrate cases
where the correlation is −1 and +1, respectively.
x = (−3, 6, 0, 3,−6) x = (3, 6, 0, 3, 6)
y = ( 1,−2, 0,−1, 2) y = (1, 2, 0, 1, 2)
31 32
31 32
33 34
35 36
37 38
37 38
39 40
• Some similarity measures are based on • Bazı benzerlik ölçüleri bilgi teorisine
information theory
dayanmaktadır:
– Mutual information in various versions
– Karşılıklı/Ortak Bilgi ve değişik versiyonları
– Maximal Information Coefficient (MIC) and related
measures – Azami Bilgi Katsayısı (MIC) ve ilgili ölçüler
• General and can handle non-linear relationships • Genel ve doğrusal olmayan ilişkileri yönetebilir
• Can be complicated and time intensive to • Hesaplaması karmaşık ve zaman alıcı olabilir
compute
41 42
41 42
43 44
Entropy Entropi
• For • p1, p2 …, pn gibi olasılığa sahip x1, x2 …, xn gibi
– a variable (event), X, n olası değeri (çıktısı) olan bir X değişkeni (olayı)
– with n possible values (outcomes), x1, x2 …, xn
için entropi, H(X), aşağıdaki gibidir:
– each outcome having probability, p1, p2 …, pn 𝑛
– the entropy of X , H(X), is given by
𝑛 𝐻 𝑋 = − 𝑝𝑖 log 2 𝑝𝑖
𝐻 𝑋 = − 𝑝𝑖 log 2 𝑝𝑖 𝑖=1
45 46
• What is the entropy of a fair four-sided die? • Adil bir dört kenarlı zarın entropisi nedir?
4
4
47 48
• For continuous data, the calculation is harder • Sürekli veriler için hesaplama daha zordur
49 50
49 50
51 52
53 54
53 54
• Information one variable provides about another • Bir değişkenin diğeri hakkında sağladığı bilgi:
Formally, 𝐼 𝑋, 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌), – 𝐼 𝑋, 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌),
where H(X,Y) is the joint entropy of X and Y, • H(X,Y) , X ve Y, nin ortak entropisidir.
𝐻 𝑋, 𝑌 = − 𝑝𝑖𝑗log 2 𝑝𝑖𝑗 𝐻 𝑋, 𝑌 = − 𝑝𝑖𝑗log 2 𝑝𝑖𝑗
𝑖 𝑗 𝑖 𝑗
burada pij , X in i. değeri ile Y nin j. değerinin birlikte olma
where pij is the probability that the ith value of X and
olasılığıdır.
the jth value of Y occur together
• For discrete variables, this is easy to compute • Ayrık değişkenler için bunu hesaplamak kolaydır
• Maximum mutual information for discrete variables – Ayrık değişkenler için maksimum karşılıklı bilgi
is log2(min(nX, nY), where nX (nY) is the number of log2(min(nX, nY); burada nX (nY), X (Y) değerlerinin
values of X (Y) sayısıdır
55 56
55 56
57 58
57 58
• Mutual information of Student Status and Grade • Öğrenci Durumu ve Harf notu için
= 0.9928 + 1.4406 - 2.2710 = 0.1624 – Karşılıklı Bilgi = 0.9928 + 1.4406 - 2.2710 = 0.1624
59 60
59 60
61 62
General Approach for Combining Similarities Benzerlikleri Birleştirmek İçin Genel Yaklaşım
• Sometimes attributes are of many different types, • Bazen nitelikler birçok farklı türde olabilir, ancak
but an overall similarity is needed. genel bir benzerliğe ihtiyaç vardır.
– For the kth attribute, compute a similarity, sk(x, y), in – k. nitelik için, [0, 1] aralığında bir benzerlik sk(x, y)
the range [0, 1].
hesaplayın.
– Define an indicator variable, k, for the kth attribute as
follows: – k. nitelik için bir gösterge değişkeni, k, aşağıdaki
• k = 0 if the kth attribute is an asymmetric attribute and both objects gibi tanımlayın:
have a value of 0, or if one of the objects has a missing value for the • eğer k. niteliği asimetrik bir nitelikse ve her iki nesnenin değeri de 0 ise
kth attribute veya nesnelerden biri k. niteliği için eksik bir değere sahipse k = 0
• k = 1 otherwise • aksi takdirde k = 1
– Compute – Benzerlik ölçüsünü aşağıdaki gibi hesapla
σ𝑛𝑘=1 𝛿𝑘 𝑠𝑘 (𝑥, 𝑦) σ𝑛𝑘=1 𝛿𝑘 𝑠𝑘 (𝑥, 𝑦)
similarity(𝑥, 𝑦) = similarity(𝑥, 𝑦) =
σ𝑛𝑘=1 𝛿𝑘 σ𝑛𝑘=1 𝛿𝑘
63 64
63 64
• May not want to treat all attributes the same. • Tüm niteliklerin aynı ağırlığa sahip olması
– Use non-negative weights 𝜔𝑘 gerekmeyebilir.
– Negatif olmayan ağırlıklar (𝜔𝑘 ) kullanılabilir
σ𝑛
σ𝑛𝑘=1 𝜔𝑘 𝛿𝑘 𝑠𝑘 (𝐱, 𝐲)
𝑘=1 𝜔𝑘 𝛿𝑘 𝑠𝑘 (𝐱,𝐲) 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 =
– 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 = σ𝑛 σ𝑛𝑘=1 𝜔𝑘 𝛿𝑘
𝑘=1 𝜔𝑘 𝛿𝑘
• Can also define a weighted form of distance • Ağırlıklı bir mesafe metriği de tanımlanabilir
𝑛 1Τ𝑟
𝑟
𝑑 𝑥, 𝑦 = 𝜔𝑘 𝑥𝑘 − 𝑦𝑘
𝑘=1
65 66
65 66
naydin@yildiz.edu.tr naydin@yildiz.edu.tr
http://www3.yildiz.edu.tr/~naydin http://www3.yildiz.edu.tr/~naydin
1 2
1 2
3 4
3 4
– A classification model is an abstract representation of the – Bir sınıflandırma modeli, öznitelik kümesi ile sınıf etiketi
relationship between the attribute set and the class label. arasındaki ilişkinin soyut bir temsilidir.
5 6
5 6
7 8
Task Attribute set, x Class label, y Görev Nitelik kümesi, x Sınıf etiketi, y
Categorizing Features extracted from spam or non-spam E-posta E-posta mesajı Spam ya da spam
email messages email message header mesajlarını başlığından ve değil
and content sınıflandırma içeriğinden çıkarılan
(Spam filtering) Binary Binary
(Spam filtreleme) özellikler
Identifying Features extracted from malignant or benign Tümör hücrelerini Röntgen veya MRI Kötü huylu veya iyi
tumor cells x-rays or MRI scans cells tanımlama taramalarından elde huylu hücreler
Binary edilen özellikler Binary
Cataloging Features extracted from Elliptical, spiral, or Galaksileri Teleskop Eliptik, sarmal veya
galaxies telescope images irregular-shaped kataloglama görüntülerinden düzensiz şekilli
galaxies çıkarılan özellikler galaksiler
Multiclass Multiclass
9 10
9 10
11 12
11 12
13 14
13 14
General Framework for Building Classification Model Sınıflandırma Modeli Oluşturmak için Genel Çerçeve
15 16
General Framework for Building Classification Model Sınıflandırma Modeli Oluşturmak için Genel Çerçeve
• Induction • İndüksiyon
– The process of using a learning algorithm to build a – Eğitim verilerinden bir sınıflandırma modeli oluşturmak için
classification model from the training data. bir öğrenme algoritması kullanma süreci.
– AKA learning a model or building a model. – Bir model öğrenmek veya bir model oluşturmak olarak da
bilinir.
• Deduction • Çıkarım
– Process of applying a classification model on unseen test – Sınıf etiketlerini tahmin etmek için görünmeyen test
instances to predict their class labels. örneklerine bir sınıflandırma modeli uygulama süreci.
• Process of classification involves two steps: • Sınıflandırma işlemi iki adımdan oluşur:
– applying a learning algorithm to training data to learn a – bir modeli öğrenmek için eğitim verilerine bir öğrenme
model, algoritması uygulamak,
– applying the model to assign labels to unlabeled instances – etiketlenmemiş örneklere etiket atamak için modeli uygulama
• A classification technique refers to a general • Bir sınıflandırma tekniği, sınıflandırmaya yönelik genel
approach to classification bir yaklaşımı ifade eder.
17 18
17 18
19 20
General Framework for Building Classification Model Sınıflandırma Modeli Oluşturmak için Genel Çerçeve
• Models that deliver such predictive insights are said to • Bu tür tahmine dayalı içgörüler sağlayan modellerin iyi bir
have good generalization performance. genelleme performansına sahip oldukları söylenebilir.
• The performance of a model (classifier) can be evaluated • Bir modelin (sınıflandırıcı) performansı, tahmin edilen
by comparing the predicted labels against the true labels etiketleri örneklerin gerçek etiketleriyle karşılaştırarak
of instances. değerlendirilebilir.
• Bu bilgi, karışıklık matrisi adı verilen bir tabloda
• This information can be summarized in a table called a
özetlenebilir.
confusion matrix.
– Each entry fij denotes the number of instances from class i – Her girdi fij , i sınıfından olduğu halde j sınıfına ait olduğu tahmin
predicted to be of class j. edilen örneklerin sayısını belirtir
• number of correct predictions: f11 + f00 • doğru tahmin sayısı : f11 + f00
• number of incorrect predictions: f10 + f01 • yanlış tahmin sayısı : f10 + f01
21 22
21 22
23 24
25 26
27 28
29 30
29 30
31 32
Home 10
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married
Income NO Income NO
< 80K > 80K < 80K > 80K
NO YES NO YES
33 34
33 34
Home 10
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married
Income NO Income NO
< 80K > 80K < 80K > 80K
NO YES NO YES
35 36
35 36
Home 10
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married Assign Defaulted to
“No”
Income NO Income NO
< 80K > 80K < 80K > 80K
NO YES NO YES
37 38
37 38
There could be more than one tree that 13 Yes Large 110K ?
Deduction
9 No Married 75K No
fits the same data! 14 No Small 95K ?
10 No Single 90K Yes 15 No Large 67K ?
10
10
Test Set
39 40
39 40
grow the decision tree in a top-down fashion belong to more than one class, 10
41 42
Defaulted = No Marital 10
Yes No
Status
(3,0) Single,
Marital Married
Defaulted = No Divorced
(3,0) Status
(3,0) Single, Annual Defaulted = No
(3,0) Married
Divorced Income (3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
43 44
Design Issues of Decision Tree Induction Methods for Expressing Test Conditions
• How should training records be split? • Decision tree induction algorithms must provide
– Method for expressing test condition a method for expressing an attribute test
• depending on attribute types condition and its corresponding outcomes for
– Measure for evaluating the goodness of a test different attribute types
condition – Binary Attributes
– Nominal Attributes
• How should the splitting procedure stop? – Ordinal Attributes
– Stop splitting if all the records belong to the same – Continuous Attributes
class or have identical attribute values
– Early termination
45 46
45 46
Test Condition for Binary Attributes Test Condition for Nominal Attributes
• generates two potential outcomes • Multi-way split:
– Use as many partitions
Marital
as distinct values. Status
47 48
attribute values
{Small, {Medium, (i) Binary split (ii) Multi-way split
Large} Extra Large}
49 50
49 50
51 52
How to determine the Best Split How to determine the Best Split
• Before Splitting: • Greedy approach:
– 10 records of class 0, – Nodes with purer class distribution are preferred
– 10 records of class 1
• Need a measure of node impurity:
53 54
55 56
57 58
57 58
59 60
59 60
𝑖=0 𝑖=0
61 62
61 62
Computing Gini Index of a Collection of Nodes Binary Attributes: Computing GINI Index
• When a node 𝑝 is split into 𝑘 partitions (children) • Splits into two partitions (child nodes)
𝑘
𝑛𝑖 • Effect of Weighing partitions:
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = 𝐺𝐼𝑁𝐼(𝑖) – Larger and purer partitions are sought
𝑛
𝑖=1 Parent
B? C1 7
where, 𝑛𝑖 = number of records at child 𝑖, Yes No C2 5
Gini = 0.486
𝑛 = number of records at parent node 𝑝. Gini(N1)
Node N1 Node N2
63 64
63 64
Categorical Attributes: Computing Gini Index Continuous Attributes: Computing Gini Index
• Use the count matrix to make decisions – Number of possible splitting values 4 Yes Married 120K No
5 No Divorced 95K Yes
= Number of distinct values
Multi-way split Two-way split 6 No Married 60K No
(find best partition of values) • Each splitting value has a count 7 Yes Divorced 220K No
matrix associated with it 8 No Single 85K Yes
CarType CarType CarType – Class counts in each of the partitions, 9 No Married 75K No
65 66
65 66
• For efficient computation: for each attribute, • For efficient computation: for each attribute,
– Sort the attribute on values – Sort the attribute on values
– Linearly scan these values, each time updating the – Linearly scan these values, each time updating the
count matrix and computing gini index count matrix and computing gini index
– Choose the split position that has the least gini index – Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No Cheat No No No Yes Yes Yes No No No No
Annual Income Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Sorted Values 60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230 Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
67 68
67 68
Continuous Attributes: Computing Gini Index Continuous Attributes: Computing Gini Index
• For efficient computation: for each attribute, • For efficient computation: for each attribute,
– Sort the attribute on values – Sort the attribute on values
– Linearly scan these values, each time updating the – Linearly scan these values, each time updating the
count matrix and computing gini index count matrix and computing gini index
– Choose the split position that has the least gini index – Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No Cheat No No No Yes Yes Yes No No No No
Annual Income Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230 Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
69 70
69 70
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
GINI index computations
71 72
71 72
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Parent Node, 𝑝 is split into 𝑘 partitions (children),
C2 6 Entropy = – 0 log2 0 – 1 log 1 = – 0 – 0 = 0 𝑛𝑖 is number of records in child node 𝑖
73 74
73 74
75 76
• Gain Ratio: 𝑘
• Classification error at a node 𝒕
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − 𝑙𝑜𝑔2 𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝𝑖 𝑡 ]
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛 𝑖
𝑖=1
where 𝒑𝒊(𝒕) is the frequency of class 𝒊 at node 𝒕
Parent Node, 𝑝 is split into 𝑘 partitions (children),
𝑛𝑖 is number of records in child node 𝑖
– Maximum of 1−1/𝑐 when records are equally
CarType CarType CarType
Family Sports Luxury {Sports,
{Family} {Sports}
{Family, distributed among all classes, implying the least
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2 interesting situation
C2 3 0 7 7 3 C2 0 10
– Minimum of 0 when all records belong to one class,
C2
Gini 0.163 Gini 0.468 Gini 0.167
SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97 implying the most interesting situation
77 78
77 78
79 80
79 80
81 82
83 84
3
No
No
Medium
Small
100K
70K
No
No
6 No Medium 60K No
– Expected error of
7 Yes Large 220K No Learn
Model
are quite consistent with each other.
8 No Small 85K Yes
9 No Medium 75K No
Training Set
Model
a model over
Apply
Tid
11
Attrib1
No
Attrib2
Small
Attrib3
55K
Class
?
Model
random selection
12 Yes Medium 80K ?
13
14
Yes
No
Large
Small
110K
95K
?
?
Deduction
of records from
15 No Large 67K ?
10
85 86
87 88
Overfitting and Underfitting of Decision Trees Model Overfitting – Impact of Training Data Size
• As the model becomes more and more complex, test errors can Using twice the number of data instances
start increasing even though training error may be decreasing
– Underfitting: • Increasing the size of training data reduces the
• when model is too simple, both training and test errors are large
– Overfitting: difference between training and testing errors at a
• when model is too complex, training error is small, but test error is large given size of model
89 90
89 90
Decision Tree
Decision Tree
91 92
91 92
93 94
Model Overfitting – Impact of Training Data Size Reasons for Model Overfitting
• Performance of decision trees using 20% data for • Limited training size
training (twice the original training size) – In general, as we increase the size of a training set,
the patterns learned from the training set start
resembling the true patterns in the overall data
• the effect of overfitting can be reduced by increasing the
training size
• High model complexity
– Generally, a more complex model has a better ability
to represent complex patterns in the data.
– Multiple Comparison Procedure
– AKA multiple testing problem
95 96
95 96
97 98
99 100
101 102
101 102
103 104
103 104
Estimating the Complexity of Decision Trees Estimating the Complexity of Decision Trees: Example
+: 3 +: 5 +: 1 +: 3 +: 3 e(TR) = 6/24
-: 0 -: 2 -: 4 -: 0 -: 6
+: 3 +: 2 +: 0 +: 1 +: 3 +: 0 =1
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5
– err(T): error rate on all training records
– : trade-off hyper-parameter (similar to α ) Decision Tree, TL Decision Tree, TR
• Relative cost of adding a leaf node
– k: number of leaf nodes egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458
– Ntrain: total number of training records
egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417
105 106
105 106
Xn
… …
1
e(TL) = 4/24 Xn ?
e(TR) = 6/24
• Cost(Model,Data) = Cost(Data|Model) + α × Cost(Model)
– Cost is the number of bits needed for encoding.
+: 3 +: 5
-: 0 -: 2
+: 1
-: 4
+: 3
-: 0
+: 3
-: 6 – Search for the least costly model.
• Cost(Data|Model) encodes the misclassification errors.
+: 3 +: 2 +: 0 +: 1 +: 3 +: 0
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5 • Cost(Model) uses node encoding (number of children) plus
splitting condition encoding.
Decision Tree, TL Decision Tree, TR
107 108
107 108
109 110
109 110
111 112
111 112
• Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
• Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
113 114
113 114
115 116
naydin@yildiz.edu.tr naydin@yildiz.edu.tr
http://www3.yildiz.edu.tr/~naydin http://www3.yildiz.edu.tr/~naydin
1 2
1 2
3 4
3 4
Clustering Kümeleme
• Clustering is the process of separating similar pieces • Kümeleme birbirine benzeyen veri parçalarını
of data, and most clustering methods use distances ayırma işlemidir ve kümeleme yöntemlerinin
between data. çoğu veri arasındaki uzaklıkları kullanır.
• The process of separating objects into clusters • Nesneleri kümelere (gruplara) ayırma işlemi
(groups)
• Küme:
• Cluster:
– birbirine benzeyen nesnelerden oluşan grup
– group of similar objects
• Aynı kümedeki nesneler birbirine daha çok benzer
• Objects in the same cluster are more similar to each other
• Farklı kümedeki nesneler birbirine daha az benzer
• Objects in different sets are less alike
• Unsupervised learning: • Danışmansız öğrenme:
– It is not clear which object belongs to which class and – Hangi nesnenin hangi sınıfa ait olduğu ve sınıf sayısı
number of classes belli değil
5 6
5 6
7 8
7 8
browsing, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN gruplandırma, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
gruplandırma
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
functionality, 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
• Summarization • Özetlemek
– Büyük veri kümelerinin
– Reduce the size of large data
boyutunu azaltma
sets
9 10
9 10
11 12
13 14
13 14
15 16
15 16
17 18
17 18
p1 p1
p3 p4 p3 p4
p2 p2
p1 p2 p3 p4 p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram Geleneksel Hiyerarşik Kümeleme Geleneksel Dendrogram
p1 p1
p3 p4 p3 p4
p2 p2
p1 p2 p3 p4 p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram Geleneksel olmayan Hiyerarşik Geleneksel olmayan Dendrogram
Kümeleme
19 20
19 20
Other Distinctions Between Sets of Clusters Küme Takımları Arasındaki Diğer Ayrımlar
21 22
21 22
23 24
25 26
25 26
27 28
29 30
29 30
31 32
31 32
33 34
Characteristics of the Input Data Are Important Giriş Verilerinin Özellikleri Önemlidir
35 36
• Hierarchical clustering
• Hiyerarşik kümeleme
• Density-based clustering
• Yoğunluk tabanlı kümeleme
37 38
37 38
39 40
39 40
2.5 2.5
2 2
1.5 1.5
y
1 1
0.5 0.5
0 0
41 42
2 2 2 2 2 2
y
1 1 1 1 1 1
0 0 0 0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x x x x
2 2 2 2 2 2
y
1 1 1 1 1 1
0 0 0 0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x x x x
43 44
43 44
45 46
47 48
2.5 2.5
2
Original Points 2
Orijinal noktalar
1.5 1.5
y
y
1 1
0.5 0.5
0 0
3 3 3 3
2 2 2 2
y
1 1 1 1
0 0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x x
49 50
2.5 2.5
2 2
1.5 1.5
y
1 1
0.5 0.5
0 0
51 52
2 2 2 2
1 1 1 1
0 0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x x
2 2 2 2 2 2
1 1 1 1 1 1
0 0 0 0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x x x x
53 54
53 54
55 56
55 56
Problems with Selecting Initial Points İlk Noktaları Seçmeyle İlgili Sorunlar
• If there are K ‘real’ clusters then the chance of • K adet "gerçek" küme varsa, her kümeden bir
selecting one centroid from each cluster is small. merkez noktası seçme şansı düşüktür.
– Chance is relatively small when K is large – K büyük olduğunda şans nispeten küçüktür
– If clusters are the same size, n, then – Kümeler n boyuttaysa (aynı), o zaman
57 58
Iteration 4
1
2
3 Iteration 4
1
2
3
8
10 Clusters Example 8
10 Küme Örneği
6 6
4 4
2 2
y
0 0
-2 -2
-4 -4
Starting with two initial centroids in one cluster of each pair of clusters Her bir küme çiftinin bir kümesinde iki ilk ağırlık merkezi ile başlama
-6 -6
59 60
0 5 10 15 20 0 5 10 15 20
59 x 60 x
6 6 6 6
4 4 4 4
2 2 2 2
y
y
0 0 0 0
-2 -2 -2 -2
-4 -4 -4 -4
-6 -6 -6 -6
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
x x x x
Iteration 3 Iteration 4 Iteration 3 Iteration 4
8 8 8 8
6 6 6 6
4 4 4 4
2 2 2 2
y
y
0 0 0 0
-2 -2 -2 -2
-4 -4 -4 -4
-6 -6 -6 -6
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
x x x x
Starting with two initial centroids in one cluster of each pair of clusters Her bir küme çiftinin bir kümesinde iki ilk ağırlık merkezi ile başlama
61 62
61 62
Iteration 4
1
2
3 Iteration 4
1
2
3
8 8
2 2
y
0 0
-2 -2
-4 -4
-6 Starting with some pairs of clusters having three initial centroids, while other -6 Diğerleri yalnızca bir taneye sahipken üç başlangıç ağırlık merkezine sahip
have only one. bazı küme çiftleriyle başlama
0 5 10 15 20 63 0 5 10 15 20 64
x x
63 64
6 6 6 6
4 4 4 4
2 2 2 2
y
0 0 0 0
-2 -2 -2 -2
-4 -4 -4 -4
-6 -6 -6 -6
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4 x
Iteration 3 x
Iteration 4
8 8 8 8
6 6 6 6
4 4 4 4
2 2 2 2
y
0 0 0 0
-2 -2 -2 -2
-4 -4 -4 -4
-6 -6 -6 -6
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
x x x x
Starting with some pairs of clusters having three initial centroids, while other have only one. Diğerleri yalnızca bir taneye sahipken üç başlangıç ağırlık merkezine sahip bazı küme
çiftleriyle başlama
65 66
65 66
67 68
K-means++ K-means++
• This approach can be slower than random initialization, • Bu yaklaşım rasgele başlatmadan daha yavaş olabilir,
but very consistently produces better results in terms of ancak çok tutarlı bir şekilde SSE (Sum of Squared Error)
SSE açısından daha iyi sonuçlar verir.
– The k-means++ algorithm guarantees an approximation ratio – K-means++ algoritması, k'nin merkezlerin sayısı olduğu bir
O(log k) in expectation, where k is the number of centers beklentide O(log k) yaklaşık oranını garanti eder.
• To select a set of initial centroids, C, perform the • Bir C başlangıç merkez noktaları kümesi seçmek için,
following aşağıdakileri gerçekleştirin
1. Select an initial point at random to be the first centroid 1. Select an initial point at random to be the first centroid
2. For k – 1 steps 2. For k – 1 steps
3. For each of the N points, xi, 1 ≤ i ≤ N, find the minimum squared 3. For each of the N points, xi, 1 ≤ i ≤ N, find the minimum squared
distance to the currently selected centroids, C1, …, Cj, 1 ≤ j < k, distance to the currently selected centroids, C1, …, Cj, 1 ≤ j < k,
i.e.,min d2( Cj, xi ) i.e.,min d2( Cj, xi )
𝑗
𝑗
4. Randomly select a new centroid by choosing a point with probability 4. Randomly select a new centroid by choosing a point with probability
min d2( Cj, xi )
𝑗 min d2( Cj, xi )
proportional to σ min d2( C , x )is 𝑗
proportional to σ is
𝑖 𝑗 j i
d2( C , x )
𝑖 min
𝑗 j i
5. End For
5. End For
69 70
69 70
CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
71 72
71 72
73 74
73 74
75 76
77 78
77 78
79 80
79 80
81 82
81 82
• One solution is to find a large number of clusters such that each • Çözümlerden biri, her biri doğal bir kümenin parçasını temsil
of them represents a part of a natural cluster. edecek şekilde çok sayıda küme bulmaktır.
• But these small clusters need to be put together in a post- • Ancak bu küçük kümelerin bir son işlem adımında bir araya
processing step. getirilmesi gerekir.
83 84
83 84
• One solution is to find a large number of clusters such that each • Çözümlerden biri, her biri doğal bir kümenin parçasını temsil
of them represents a part of a natural cluster. edecek şekilde çok sayıda küme bulmaktır.
• But these small clusters need to be put together in a post- • Ancak bu küçük kümelerin bir son işlem adımında bir araya
processing step. getirilmesi gerekir.
85 86
85 86
• One solution is to find a large number of clusters such that each • Çözümlerden biri, her biri doğal bir kümenin parçasını temsil
of them represents a part of a natural cluster. edecek şekilde çok sayıda küme bulmaktır.
• But these small clusters need to be put together in a post- • Ancak bu küçük kümelerin bir son işlem adımında bir araya
processing step. getirilmesi gerekir.
87 88
87 88
0.1 2 0.1 2
1 0.05
1
0.05
3 1 3 1
0 0
1 3 2 5 4 6 1 3 2 5 4 6
89 90
89 90
91 92
91 92
93 94
B B
C C
Terminal nodes Son düğümler
D D
Sisters Kızkardeşler
Root E Kök E
F F
95 96
95 96
97 98
99 100
p2 p2
p3 p3
p4 p4
p5 p5
. .
. .
. Proximity Matrix . Yakınlık Matrisi
... ...
p1 p2 p3 p4 p9 p10 p11 p12 p1 p2 p3 p4 p9 p10 p11 p12
101 102
101 102
C2 C2
C3 C3
C3 C3
C4 C4
C4 C4
C5 C5
C2 C5 C2 C5
... ...
p1 p2 p3 p4 p9 p10 p11 p12 p1 p2 p3 p4 p9 p10 p11 p12
103 104
103 104
Step 4 Adım 4
• We want to merge the two closest clusters (C2 and C5) • En yakın iki kümeyi (C2 ve C5) birleştirmek ve yakınlık
and update the proximity matrix. C1 C2 C3 C4 C5 matrisini güncellemek istiyoruz. C1 C2 C3 C4 C5
C1 C1
C2 C2
C3 C3
C3 C3
C4 C4
C4 C4
C5 C5
C2 C5 C2 C5
... ...
p1 p2 p3 p4 p9 p10 p11 p12 p1 p2 p3 p4 p9 p10 p11 p12
105 106
105 106
Step 5 Adım 5
• The question is “How do we update the proximity • Soru, “Yakınlık matrisini nasıl güncelleriz?”
matrix?” C2
U
C2
U
C1 C5 C3 C4 C1 C5 C3 C4
C1 ? C1 ?
C2 U C5 ? ? ? ? C2 U C5 ? ? ? ?
C3 C3
C3 ? C3 ?
C4 C4
C4 ? C4 ?
C2 U C5 C2 U C5
... ...
p1 p2 p3 p4 p9 p10 p11 p12 p1 p2 p3 p4 p9 p10 p11 p12
107 108
107 108
p3 p3
p4 p4
p5 p5
• MIN (single link) • MIN (tek bağlantı)
. .
• MAX (complete link) • MAX (tam bağlantı)
. .
• Group Average .
• Grup Ortalaması .
Proximity Matrix
• Distance Between Centroids • Merkezler Arası Uzaklık Yakınlık Matrisi
• Other methods driven by an objective • Bir amaç fonksiyonu tarafından
function yönlendirilen diğer yöntemler
– Ward’s Method uses squared error – Ward'ın Yöntemi kare hatası kullanır
109 110
109 110
p2 p2
p3 p3
p4 p4
p5 p5
• MIN (single link) • MIN (tek bağlantı)
. .
• MAX (complete link) • MAX (tam bağlantı)
. .
• Group Average .
• Grup Ortalaması .
• Distance Between Centroids
Proximity Matrix
• Merkezler Arası Uzaklık Yakınlık Matrisi
111 112
111 112
p2 p2
p3 p3
p4 p4
p5 p5
• MIN (single link) • MIN (tek bağlantı)
. .
• MAX (complete link) • MAX (tam bağlantı)
. .
• Group Average .
• Grup Ortalaması .
• Distance Between Centroids
Proximity Matrix
• Merkezler Arası Uzaklık Yakınlık Matrisi
113 114
113 114
p2 p2
p3 p3
p4 p4
p5 p5
• MIN (single link) • MIN (tek bağlantı)
. .
• MAX (complete link) • MAX (tam bağlantı)
. .
• Group Average .
• Grup Ortalaması .
• Distance Between Centroids
Proximity Matrix
• Merkezler Arası Uzaklık Yakınlık Matrisi
115 116
115 116
p2 p2
p3 p3
p4 p4
p5 p5
• MIN (single link) • MIN (tek bağlantı)
. .
• MAX (complete link) • MAX (tam bağlantı)
. .
• Group Average .
• Grup Ortalaması .
• Distance Between Centroids
Proximity Matrix
• Merkezler Arası Uzaklık Yakınlık Matrisi
117 118
117 118
• Proximity of two clusters is based on the two • İki kümenin yakınlığı, farklı kümelerdeki en
closest points in the different clusters yakın iki noktayı temel alır.
– Determined by one pair of points, i.e., by one link in – Bir nokta çifti tarafından, yani yakınlık grafiğindeki
the proximity graph bir bağlantı ile belirlenir
• Example: • Örnek: Altı noktanın xy koordinatları:
xy-coordinates of six points:
119 120
119 120
dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))
= min(0.15, 0.25, 0.28, 0.39) = min(0.15, 0.25, 0.28, 0.39)
= 0.15. = 0.15.
121 122
121 122
5 5
1 1
3 3
0.2
5 0.2
5
2 1 2 1
0.15 0.15
2 3 6 2 3 6
0.1
0.1
0.05
4 4 0.05
4 0
3 6 2 5 4 1 4
0
3 6 2 5 4 1
Nested Clusters: Single link Dendrogram İç İçe Kümeler: Altı noktanın Dendrogram
clustering of the six points tek bağlantı kümelemesi
123 124
123 124
125 126
125 126
127 128
127 128
• Proximity of two clusters is based on the two • İki kümenin yakınlığı, farklı kümelerdeki en uzak
most distant points in the different clusters iki noktayı temel alır.
– Determined by all pairs of points in the two clusters – İki kümedeki tüm nokta çiftleri tarafından belirlenir
129 130
129 130
• As with single link, points 3 and 6 are merged first. • Tek bağlantıda olduğu gibi, önce 3 ve 6 noktaları
• However, {3, 6} is merged with {4}, instead of {2, 5} or birleştirilir.
{1} because • Ancak {3, 6}, {2, 5} veya {1} yerine {4} ile birleştirilir,
dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4)) çünkü
dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4))
= max(0.15, 0.22) = max(0.15, 0.22)
= 0.22. = 0.22.
dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5),
dist(6, 5)) dist(6, 5))
= max(0.15, 0.25, 0.28, 0.39) = max(0.15, 0.25, 0.28, 0.39)
= 0.39. = 0.39.
dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1)) dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1))
= max(0.22, 0.23) = max(0.22, 0.23)
= 0.23. = 0.23.
131 132
131 132
4 1 4 1
2 5 0.4 2 5 0.4
0.35 0.35
5 5
2 0.3
2 0.3
0.25 0.25
3 6 0.2 3 6 0.2
3 0.15 3 0.15
1 0.1 1 0.1
0.05 0.05
4 4
0 0
3 6 4 1 2 5 3 6 4 1 2 5
133 134
133 134
135 136
135 136
137 138
137 138
proximity(p , p ) i j proximity(p , p )
piClusteri
i j
piClusteri
p jClusterj
proximity(Clusteri , Clusterj ) =
p jClusterj
proximity(Clusteri , Clusterj ) =
|Clusteri ||Clusterj | |Clusteri ||Clusterj |
139 140
139 140
dist({3, 6, 4}, {1}) = (0.22 + 0.37 + 0.23)/(3 × 1) dist({3, 6, 4}, {1}) = (0.22 + 0.37 + 0.23)/(3 × 1)
= 0.28 = 0.28
dist({2, 5}, {1}) = (0.24 + 0.34)/(2 × 1) dist({2, 5}, {1}) = (0.24 + 0.34)/(2 × 1)
= 0.29 = 0.29
dist({3, 6, 4}, {2, 5}) = (0.15 + 0.28 + 0.25 + 0.39 + 0.20 + 0.29)/(3 × 2) dist({3, 6, 4}, {2, 5}) = (0.15 + 0.28 + 0.25 + 0.39 + 0.20 + 0.29)/(3 × 2)
= 0.26 = 0.26
• Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, • dist({3, 6, 4}, {2, 5}), dist({3, 6, 4}, {1}) ve dist({2, 5},
4}, {1}) and dist({2, 5}, {1}), clusters {3, 6, 4} and {2, {1}) den daha küçük olduğu için, { 3, 6, 4} ve {2, 5}
5} are merged at the fourth stage. kümeleri dördüncü aşamada birleştirilir.
141 142
141 142
5 4 5 4
1 1
0.25 0.25
2 2
5 0.2 5 0.2
2 2
0.15 0.15
3 6 0.1
3 6 0.1
1 0.05
1 0.05
4 0
4 0
3 3 6 4 1 2 5
3 3 6 4 1 2 5
143 144
• Limitations • Kısıtları
– Biased towards globular clusters – Küresel kümelere eğilimli
145 146
145 146
147 148
5 5
1 5 4 1 5 4
1 1
2 2 2 2
5 Ward’s Method 5 5 Ward’s Method 5
2 2 2 2
3 6 Group Average 3 6 3 6 Group Average 3 6
3 3
4 1 1 4 1 1
4 4 4 4
3 3
149 150
149 150
• O(N2) space since it uses the proximity matrix. • O(N2) uzayı çünkü yakınlık matrisini kullanıyor.
– N is the number of points. – N nokta sayısıdır.
151 152
151 152
• Once a decision is made to combine two clusters, • İki kümeyi birleştirme kararı verildikten sonra
it cannot be undone geri alınamaz
• Different schemes have problems with one or • Farklı tasarılarda aşağıdakilerden bir veya daha
more of the following: fazlasıyla ilgili sorunlar vardır :
– Sensitivity to noise – Gürültüye duyarlılık
– Difficulty handling clusters of different sizes and – Farklı boyutlardaki ve küresel olmayan şekillerdeki
non-globular shapes kümeleri işleme zorluğu
– Breaking large clusters – Büyük kümeleri kırma
153 154
153 154
155 156
155 156
157 158
157 158
DBSCAN: Core, Border, and Noise Points DBSCAN: Çekirdek, Sınır ve Gürültü Noktaları
MinPts = 7 MinPts = 7
159 160
159 160
DBSCAN: Core, Border and Noise Points DBSCAN: Çekirdek, Sınır ve Gürültü Noktaları
161 162
161 162
1. Label all points as core, border, or noise points. 1. Tüm noktaları çekirdek, sınır veya gürültü noktaları olarak
2. Eliminate noise points. etiketleyin.
3. Put an edge between all core points within a distance Eps of 2. Gürültü noktalarını ortadan kaldırın.
each other. 3. Birbirinden Eps mesafe içinde tüm çekirdek noktaların
4. Make each group of connected core points into a separate arasına bir kenar koyun.
cluster. 4. Birbirine bağlı çekirdek noktaların her grubunu ayrı bir küme
5. Assign each border point to one of the clusters of its haline getirin.
associated core points 5. Her sınır noktasını, ilişkili çekirdek noktalarının
kümelerinden birine atayın
163 164
163 164
Original Points Clusters (dark blue points indicate noise) Orijinal Noktalar Kümeler (koyu mavi noktalar
gürültüyü gösterir)
• Can handle clusters of different shapes and sizes • Farklı şekil ve boyutlardaki kümeleri işleyebilir
• Resistant to noise • Gürültüye dayanıklı
165 166
165 166
When DBSCAN Does NOT Work Well DBSCAN Ne Zaman İyi Çalışmaz
167 168
167 168
169 170
169 170
• Idea is that for points in a cluster, their kth nearest • Fikir, bir kümedeki noktalar için k'inci en yakın
neighbors are at close distance komşularının yakın mesafede olmasıdır.
• Noise points have the kth nearest neighbor at farther • Gürültü noktaları, daha uzak mesafedeki k. en
distance yakın komşuya sahiptir.
• So, plot sorted distance of every point to its kth • Böylece, her noktanın sıralanmış mesafesini k'inci
nearest neighbor en yakın komşusuna çizin
171 172
171 172
• For cluster analysis, the analogous question is how to evaluate the • Küme analizi için benzer soru, ortaya çıkan kümelerin “iyiliğinin”
“goodness” of the resulting clusters? nasıl değerlendirileceğidir?
• But “clusters are in the eye of the beholder”! • Ama “kümeler bakanın gözündedir”!
– In practice the clusters we find are defined by the clustering algorithm – Uygulamada bulduğumuz kümeler, kümeleme algoritması tarafından
tanımlanır
• Then why do we want to evaluate them?
– To avoid finding patterns in noise • O zaman neden onları değerlendirmek istiyoruz?
– To compare clustering algorithms – Gürültü içinde desen bulmaktan kaçınmak için
– To compare two sets of clusters – Kümeleme algoritmalarını karşılaştırmak için
– To compare two clusters – İki küme takımını karşılaştırmak için
– İki kümeyi karşılaştırmak için
173 174
173 174
y
y
y
0.4 0.4 0.4 0.4
0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x x
1 1 1 1
y
0.4 0.4 0.4 0.4
0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x x
175 176
175 176
177 178
179 180
179 180
K=1 cluster: 𝑆𝑆𝐸 = 1 − 3 2 + 2−3 2 + 4−3 2 + 5−3 2 = 10 K=1 cluster: 𝑆𝑆𝐸 = 1 − 3 2 + 2−3 2 + 4−3 2 + 5−3 2 = 10
𝑆𝑆𝐵 = 4 × 3 − 3 2 = 0 𝑆𝑆𝐵 = 4 × 3 − 3 2 = 0
𝑇𝑜𝑡𝑎𝑙 = 10 + 0 = 10 𝑇𝑜𝑡𝑎𝑙 = 10 + 0 = 10
K=2 clusters: 𝑆𝑆𝐸 = 1 − 1.5 2 + 2 − 1.5 2 + 4 − 4.5 2 + 5 − 4.5 2 = 1 K=2 clusters: 𝑆𝑆𝐸 = 1 − 1.5 2 + 2 − 1.5 2 + 4 − 4.5 2 + 5 − 4.5 2 = 1
𝑆𝑆𝐵 = 2 × 3 − 1.5 2 + 2 × 4.5 − 3 2 =9 𝑆𝑆𝐵 = 2 × 3 − 1.5 2 + 2 × 4.5 − 3 2 =9
𝑇𝑜𝑡𝑎𝑙 = 1 + 9 = 10 𝑇𝑜𝑡𝑎𝑙 = 1 + 9 = 10
181 182
• A proximity graph-based approach can also be • Uyum ve ayrılma için yakınlık grafiğine dayalı
used for cohesion and separation. bir yaklaşım da kullanılabilir.
– Cluster cohesion is the sum of the weight of all links – Küme uyumu, bir küme içindeki tüm bağlantıların
within a cluster. ağırlıklarının toplamıdır.
– Cluster separation is the sum of the weights between – Küme ayrımı, kümedeki düğümler ile küme
nodes in the cluster and nodes outside the cluster. dışındaki düğümler arasındaki ağırlıkların
toplamıdır..
183 184
185 186
• Compute the correlation between the two matrices • İki matris arasındaki korelasyonu hesaplayın
– Since the matrices are symmetric, only the correlation between – Matrisler simetrik olduğundan, sadece n(n-1)/2 girişi arasındaki korelasyonun
n(n-1)/2 entries needs to be calculated. hesaplanması gerekir.
• High magnitude of correlation indicates that points that belong • Yüksek korelasyon değeri, aynı kümeye ait noktaların birbirine
to the same cluster are close to each other. yakın olduğunu gösterir.
– Correlation may be positive or negative depending on whether the – Korelasyon, benzerlik matrisinin benzerlik veya benzemezlik matrisi
similarity matrix is a similarity or dissimilarity matrix olmasına bağlı olarak pozitif veya negatif olabilir
• Not a good measure for some density or contiguity-based • Bazı yoğunluk veya bitişiklik tabanlı kümeler için iyi bir ölçü
clusters. değildir.
187 188
187 188
Measuring Cluster Validity Via Correlation Korelasyon Yoluyla Küme Geçerliliğini Ölçme
• Correlation of ideal similarity and proximity • Aşağıdaki iyi kümelenmiş veri setinin K-
matrices for the K-means clusterings of the ortalama kümelemeleri için ideal benzerlik ve
following well-clustered data set. yakınlık matrislerinin korelasyonu.
1 1 1 1
Points
0 100 0 0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points x Points
189 190
189 190
Korelasyon Yoluyla Küme Geçerliliğini Ölçme Korelasyon Yoluyla Küme Geçerliliğini Ölçme
• Correlation of ideal similarity and proximity • Aşağıdaki rasgele veri setinin K-ortalama
matrices for the K-means clusterings of the kümelemeleri için ideal benzerlik ve yakınlık
following random data set. matrislerinin korelasyonu.
1 1 1 1
Points
0 100 0 0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points x Points
191 192
191 192
• Order the similarity matrix with respect to cluster labels • Benzerlik matrisini küme etiketlerine göre sıralayın ve
and inspect visually. görsel olarak inceleyin.
1 1
1 1
10 0.9 10 0.9
0.9 0.9
20 0.8 20 0.8
0.8 0.8
30 0.7 30 0.7
0.7 0.7
40 0.6 40 0.6
0.6 0.6
Points
Points
50 0.5 50 0.5
0.5 0.5
y
y
60 0.4 60 0.4
0.4 0.4
70 0.3 70 0.3
0.3 0.3
80 0.2 80 0.2
0.2 0.2
90 0.1 90 0.1
0.1 0.1
100 0 100 0
0 20 40 60 80 100 Similarity 0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Points Points
x x
193 194
193 194
Judging a Clustering Visually by its Similarity Matrix Bir Kümeyi Benzerlik Matrisine Göre Görsel Olarak Değerlendirmek
• Clusters in random data are not so crisp • Rastgele verilerdeki kümeler çok net değildir
1 1
1 1
0.9 10 0.9 10
0.9 0.9
Points
0.5 0.5
y
50 0.5 50 0.5
0 100 0 0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points x Points
DBSCAN DBSCAN
195 196
195 196
Judging a Clustering Visually by its Similarity Matrix Bir Kümeyi Benzerlik Matrisine Göre Görsel Olarak Değerlendirmek
1 1
0.9 0.9
1 500 1 500
0.8 0.8
2 6 2 6
0.7 0.7
1000 1000
3 0.6 3 0.6
4 4
1500 0.5 1500 0.5
0.4 0.4
2000 2000
0.3 0.3
5 5
0.2 0.2
2500 2500
0.1 0.1
7 7
3000 0 3000 0
500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000
DBSCAN DBSCAN
197 198
197 198
6 9 6 9
8 8
4 4
7 7
2 6 2 6
SSE
SSE
5 5
0 0
4 4
-2 3 -2 3
2 2
-4 -4
1 1
-6 0 -6 0
2 5 10 15 20 25 30 2 5 10 15 20 25 30
5 10 15 5 10 15
K K
199 200
199 200
• SSE curve for a more complicated data set • Daha karmaşık bir veri seti için SSE eğrisi
1 1
2 6 2 6
3 3
4 4
5 5
7 7
201 202
201 202
Supervised Measures of Cluster Validity: Entropy and Purity Denetimli Küme Geçerliliği Ölçütleri: Entropi ve Saflık
203 204
203 204
• Need a framework to interpret any measure. • Herhangi bir ölçütü yorumlamak için bir
– For example, if our measure of evaluation has the çerçeveye ihtiyaç vardır.
value, 10, is that good, fair, or poor? – Örneğin, değerlendirme ölçütümüz 10 değerine
sahipse, bu iyi mi, adil mi yoksa kötü mü?
• Statistics provide a framework for cluster
• İstatistik, küme geçerliliği için bir çerçeve
validity
sağlar
– The more “atypical” a clustering result is, the more – Bir kümeleme sonucu ne kadar "atipik" ise,
likely it represents valid structure in the data verilerde geçerli yapıyı temsil etme olasılığı o
– Compare the value of an index obtained from the kadar yüksektir
given data with those resulting from random data. – Verilen verilerden elde edilen bir indeksin değerini
• If the value of the index is unlikely, then the cluster rastgele verilerden elde edilenlerle karşılaştırın.
results are valid • "Dizin"in değeri olası değilse, küme sonuçları geçerlidir
205 206
205 206
Count
0.5 0.5
y
25 25
0.4 0.4
20 20
0.3 0.3
15 15
0.2 0.2
10 10
0.1 0.1
5 5
0 0
0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034
x SSE x SSE
SSE = 0.005 Histogram shows SSE of three clusters in 500 SSE = 0.005 Histogram, x ve y değerleri için 0,2 – 0,8
sets of random data points of size 100 distributed aralığında dağıtılan 100 büyüklüğünde 500 rasgele
over the range 0.2 – 0.8 for x and y values veri noktasındaki üç kümenin SSE'sini gösterir.
207 208
207 208
1 1 1 1
0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x x
209 210
211 212
211 212
1 2
3 4
5 6
7 8
• Veri hataları
– 2 yaşında, 100 kg
9 10
9 10
11 12
11 12
13 14
15 16
17 18
19 20
19 20
21 22
şu şekilde görüntülenir:
6
0.09
5
0.08
4
0 0.04
-2
0.03
0.02
-4 -3 -2 -1 0 1 2 3 4 5
x
23 24
23 24
6
0.1
0.09
– H0: There is no outlier in data
– HA: There is at least one outlier
5
0.08
4
G=
2
0.05 Gaussian
y
-1
0.04
0.03
s
-2 0.02
2
-3 0.01
( N − 1) t ( / N , N −2 )
-4
probability
• Reject H0 if: G
N − 2 + t (2 / N , N − 2 )
-5 density
-4 -3 -2 -1 0
x
1 2 3 4 5 N
25 26
25 26
• Tek değişkenli verilerde aykırı değerleri algılama • Assume the data set D contains samples from a
• Verilerin normal dağılımdan geldiğini varsayalım mixture of two probability distributions:
– M (majority distribution)
• Her seferinde bir aykırı değeri algılar, aykırı – A (anomalous distribution)
değeri kaldırır ve tekrar eder
• General Approach:
– H0: Verilerde aykırı değer yok – Initially, assume all the data points belong to M
– HA: En az bir aykırı değer var – Let Lt(D) be the log likelihood of D at time t
max X − X
• Grubbs'un test istatistiği: G= – For each point xt that belongs to M, move it to A
s • Let Lt+1 (D) be the new log likelihood.
( N − 1) t (2 / N , N − 2 ) • Compute the difference, = Lt(D) – Lt+1 (D)
• H0 ı reddet, eğer: G • If > c (some threshold), then xt is declared as an
N N − 2 + t (2 / N , N − 2 )
anomaly and moved permanently from M to A
27 28
27 28
29 30
31 32
1.6
1.4
1.2
0.8
0.6
0.4
33 34
33 34
One Nearest Neighbor - Two Outliers Five Nearest Neighbors - Small Cluster
2
0.55
D D
0.5 1.8
0.45
1.6
0.4
1.4
0.35
1.2
0.3
1
0.25
0.2 0.8
0.15 0.6
0.1
0.4
0.05
35 36
35 36
D 1.8
• Simple
1.6
1.4
• Expensive – O(n2)
• Sensitive to parameters
1.2
0.8
• Sensitive to variations in density
0.6
0.4
• Distance becomes less meaningful in high-
0.2
dimensional space
37 38
37 38
39 40
In the NN approach, p2 is
1.33
2
not considered as outlier,
A
while LOF approach find
1
p2 both p1 and p2 as outliers
p1
41 42
41 42
43 44
43 44
Distance of Points from Closest Centroids Relative Distance of Points from Closest Centroid
4
4.5
4.6
4 3.5
C
3.5 3
3
2.5
2.5
2
D 0.17
2
1.5
1.5
1
1.2 1
A 0.5
0.5
45 46
47 48
Reconstruction Error(x)= x − xො
• Objects with large reconstruction errors are
anomalies
49 50
49 50
51 52
51 52
• This data may contain outliers – Every point in the same orthant (quadrant)
53 54
55 56
55 56
Finding Outliers with a One-Class SVM Finding Outliers with a One-Class SVM
• Decision boundary with 𝜈 = 0.1 • Decision boundary with 𝜈 = 0.05 and 𝜈 = 0.2
57 58
57 58
59 60
59 60
61 62
63 64
65
65