Professional Documents
Culture Documents
Analisis Klasifikasi Part2
Analisis Klasifikasi Part2
Måns Thulin
Department of Mathematics, Uppsala University
thulin@math.uu.se
1/28
Outline
2/28
Probabilistic methods
Previously, we have looked at probabilistic methods for
classification – methods based on statistical theory and model
assumptions.
In a statistical problem, the basic situation is the following:
3/28
Algorithmic methods
By splitting the given data into a training set and a test set, we
can evaluate the performance of our algorithmic method.
4/28
A toy example
Example: consider a data set with two groups: red and blue.
kNN classification
6
4
●
●
●
2
y
●
●
●
●
●
0
● ●
● ● ●
●
●
● ●
●
● ●
−2
●
●
●
−2 0 2 4 6
5/28
A toy example
How should we classify the new black point?
kNN classification
6
4
●
●
●
2
y
●
●
●
●
●
0
● ●
● ● ●
●
●
● ●
●
● ●
−2
●
●
●
−2 0 2 4 6
6/28
A toy example
It seems reasonable to classify the point as being blue!
kNN classification
6
4
●
●
●
2
y
●
●
●
●
●
0
● ●
● ● ●
●
●
● ●
●
● ●
−2
●
●
●
−2 0 2 4 6
7/28
A toy example
How should we classify the new black point?
kNN classification
6
4
●
●
●
2
y
●
●
●
●
●
0
● ●
● ● ●
●
●
●
● ●
●
● ●
−2
●
●
●
−2 0 2 4 6
8/28
A toy example
It seems reasonable to classify the point as being red!
kNN classification
6
4
●
●
●
2
y
●
●
●
●
●
0
● ●
● ● ●
●
●
●
● ●
●
● ●
−2
●
●
●
−2 0 2 4 6
9/28
A less nice example
But what about this point?
kNN classification
6
4
●
● ●
●
2
y
●
●
●
●
●
0
● ●
● ● ●
●
●
● ●
●
● ●
−2
●
●
●
−2 0 2 4 6
10/28
kNN: basic idea
In the first two examples, we could easily classify the point since all
points in its neighbourhood had the same colour.
11/28
kNN: basic idea
Look at the k = 1 closest neighbours. The point is classified as
being blue, since the nearest neighbour is blue.
●
● ●
●
2
y
●
●
●
●
●
0
● ●
● ● ●
●
●
● ●
●
● ●
−2
●
● 12/28
kNN: basic idea
Look at the k = 2 closest neighbours. It is not clear how to
classify the point (no colour has a majority).
●
● ●
●
2
y
●
●
●
●
●
0
● ●
● ● ●
●
●
● ●
●
● ●
−2
●
● 13/28
kNN: basic idea
Look at the k = 3 closest neighbours. The point is classified as
being blue (2 votes against 1).
●
● ●
●
2
y
●
●
●
●
●
0
● ●
● ● ●
●
●
● ●
●
● ●
−2
●
● 14/28
kNN: choosing k
15/28
kNN: no majority
In our example, we encountered a problem when k = 2: no colour
had a majority. What should we do in such cases?
I Flip a coin?
I This ignores some of the information that we have gathered!
I Let the closest neighbour decide? Or the k − 1 closest?
I A better solution is probably to use weighted votes, so that the
votes from closer neighbours are seen as more important. This
idea could be used in all cases, and not just when there is no
majority.
I Look at k + 1 neighbours instead?
I Essentially, this means that when we don’t have enough
information to make a decision, we gather more information.
16/28
kNN: some last comments
17/28
Decision trees: basic idea
18/28
Decision trees: basic idea
Consider the following data set with vertebrate data:
There are many algorithms for building the tree. One of the
earliest is Hunt’s algorithm:
Let Dt be a set of observations belonging to a node t.
1. If all observations in Dt are of the same class i then t is a leaf
node labeled as i.
2. Otherwise, use some condition to partition the observations
into two smaller subsets. A child node is created for each
outcome of the condition and the observations are distributed
to the children based on the outcomes.
When should the splitting stop? Other criterions are sometimes
used, but a simple and reasonable stopping criterion is to stop
splitting when all remaining nodes are leaf nodes.
20/28
Decision trees: the best split
Let p(i|t) be the fraction of observations in class i at the node t
and let c be the number of classes.
The Gini for node t is defined as
c
X
Gini(t) = 1 − (p(i|t))2
i=1
Gini(t) = 1 − 12 − 0 − . . . − 0 = 0.
The Gini is maximized when all classes have the same number of
observations at t.
One criterion for splitting could be to minimize the Gini in the next
level of the tree. That way we will get ”purer” nodes.
21/28
Decision trees: the best split
The situation becomes a bit more complicated if we take into
account that the children can have different numbers of
observations. To account for this, we try to maximize the gain:
k
X nvj
Gain = Gini(t) − Gini(vj )
nt
j=1
22/28
Decision trees: vertebrate data
23/28
Decision trees: extensions
24/28
Algorithmic versus probabilistic methods: pros
25/28
Algorithmic versus probabilistic methods: cons
26/28
Algorithmic versus probabilistic methods: discussion
I The data and the problem at hand should lead to the solution
– not prior ideas about what kind of methods are good.
I The statistician should focus on finding a good solution –
regardless of whether that solution uses algorithmic or
probabilistic methods.
I How good a method is should be judged by the predictive
accuracy of the method on the test data.
I This last point is perhaps controversial; we often judge
probabilistic method by theoretical properties.
27/28
Algorithmic versus probabilistic methods: discussion
28/28