You are on page 1of 28

Lecture 9: Classification and algorithmic methods

Måns Thulin
Department of Mathematics, Uppsala University
thulin@math.uu.se

Multivariate Methods • 17/5 2011

1/28
Outline

I What are algorithmic methods?


I Algorithmic methods for classification
I kNN classification
I Decision trees
I Algorithmic versus probabilistic methods

2/28
Probabilistic methods
Previously, we have looked at probabilistic methods for
classification – methods based on statistical theory and model
assumptions.
In a statistical problem, the basic situation is the following:

Nature → {Black box} → Data

The probabilistic approach is to assume a model for what happens


in the black box (normal distribution, ARIMA time series, linear
model, Markov chains...). We hope that the models describe the
black box accurately enough.

”All models are wrong, but some are useful.” - George


Box
Some statisticians, and indeed people from other fields as well,
argue that it is time to think outside the box.

3/28
Algorithmic methods

Suppose that we have a set of data with known classes. Without


any model assumptions, we can use heuristics and good ideas to
come up with new methods. We can create algorithms that creates
rules for classifying new points using the given training data.

By splitting the given data into a training set and a test set, we
can evaluate the performance of our algorithmic method.

”All models are wrong, and increasingly you can succeed


without them.” - Peter Norvig, research director at
Google
As a motivating example, we’ll look at a situation where it is more
or less clear that we don’t need fancy methods or model
assumptions to classify new observations.

4/28
A toy example
Example: consider a data set with two groups: red and blue.

kNN classification

6
4




2
y






0

● ●
● ● ●

● ●

● ●
−2




−2 0 2 4 6
5/28
A toy example
How should we classify the new black point?

kNN classification

6
4




2
y






0

● ●
● ● ●

● ●

● ●
−2




−2 0 2 4 6
6/28
A toy example
It seems reasonable to classify the point as being blue!

kNN classification

6
4




2
y






0

● ●
● ● ●

● ●

● ●
−2




−2 0 2 4 6
7/28
A toy example
How should we classify the new black point?

kNN classification

6
4




2
y






0

● ●
● ● ●


● ●

● ●
−2




−2 0 2 4 6
8/28
A toy example
It seems reasonable to classify the point as being red!

kNN classification

6
4




2
y






0

● ●
● ● ●


● ●

● ●
−2




−2 0 2 4 6
9/28
A less nice example
But what about this point?

kNN classification

6
4


● ●

2
y






0

● ●
● ● ●

● ●

● ●
−2




−2 0 2 4 6
10/28
kNN: basic idea

In the first two examples, we could easily classify the point since all
points in its neighbourhood had the same colour.

What should we do when there is more than one colour in the


neighbourhood?

The kNN algorithm classifies the new point by letting the k


Nearest Neighbours – the k points that are the closest to the
point – ”vote” about the class of the new point.

11/28
kNN: basic idea
Look at the k = 1 closest neighbours. The point is classified as
being blue, since the nearest neighbour is blue.

kNN classification − k=1


6
4


● ●

2
y






0

● ●
● ● ●

● ●

● ●
−2


● 12/28
kNN: basic idea
Look at the k = 2 closest neighbours. It is not clear how to
classify the point (no colour has a majority).

kNN classification − k=2


6
4


● ●

2
y






0

● ●
● ● ●

● ●

● ●
−2


● 13/28
kNN: basic idea
Look at the k = 3 closest neighbours. The point is classified as
being blue (2 votes against 1).

kNN classification − k=3


6
4


● ●

2
y






0

● ●
● ● ●

● ●

● ●
−2


● 14/28
kNN: choosing k

I Clearly, the choice of k is very important.


I If k is too small, the algorithm becomes sensitive to noise
points and outliers.
I If k is too large, the neighbourhood will probably include
points from other classes.

I How should we choose k?


I This is a difficult question! There is no ”right answer”.
I Often a test data set is used to investigate the performance for
different k.
I Typically, we choose the k that has the lowest misclassification
rate for the test data.

15/28
kNN: no majority
In our example, we encountered a problem when k = 2: no colour
had a majority. What should we do in such cases?

I Flip a coin?
I This ignores some of the information that we have gathered!
I Let the closest neighbour decide? Or the k − 1 closest?
I A better solution is probably to use weighted votes, so that the
votes from closer neighbours are seen as more important. This
idea could be used in all cases, and not just when there is no
majority.
I Look at k + 1 neighbours instead?
I Essentially, this means that when we don’t have enough
information to make a decision, we gather more information.

16/28
kNN: some last comments

I kNN is essentially a rank method: we measure the distance to


all points in the data set and rank them accordingly. The k
points with the lowest ranks are used to classify the new point.
I An important question is what we mean by ”close”. Which
distance measure should we use? Euclidean distance?
Statistical distance? Mahalanobis? Should we look at
standardized data?
I Is it meaningful to use distance measures if the data is binary
or categorical? If some of the variables are categorical and
some are continuous measurements? Are more general
similarity measures useful?

17/28
Decision trees: basic idea

Another popular algorithmic classification method is decision trees.

Have you ever played the game ”20 questions”?

Decision trees is more or less that game!

The idea is to classify the new observation by asking a series of


questions. Depending on what the answer to the first question is,
different second questions are asked, and so on. Questions are
asked until a conclusion is reached.

18/28
Decision trees: basic idea
Consider the following data set with vertebrate data:

Name Body temp Gives birth Has legs Class


Human warm-blooded yes yes mammal
Whale warm-blooded yes no mammal
Cat warm-blooded yes yes mammal
Cow warm-blooded yes yes mammal
Python cold-blooded no no reptile
Komodo dragon cold-blooded no yes reptile
Turtle cold-blooded no yes reptile
Salmon cold-blooded no no fish
Eel cold-blooded no no fish
Pigeon warm-blooded no yes bird
Penguin warm-blooded no yes bird

Decision tree example: see blackboard!


19/28
Decision trees: building the tree
Given training data, how can we build the decision tree?

There are many algorithms for building the tree. One of the
earliest is Hunt’s algorithm:
Let Dt be a set of observations belonging to a node t.
1. If all observations in Dt are of the same class i then t is a leaf
node labeled as i.
2. Otherwise, use some condition to partition the observations
into two smaller subsets. A child node is created for each
outcome of the condition and the observations are distributed
to the children based on the outcomes.
When should the splitting stop? Other criterions are sometimes
used, but a simple and reasonable stopping criterion is to stop
splitting when all remaining nodes are leaf nodes.

How, then, do we choose the condition for partitioning?

20/28
Decision trees: the best split
Let p(i|t) be the fraction of observations in class i at the node t
and let c be the number of classes.
The Gini for node t is defined as
c
X
Gini(t) = 1 − (p(i|t))2
i=1

Gini is a measure of ”impurity”. If all observations belong to the


same class, then

Gini(t) = 1 − 12 − 0 − . . . − 0 = 0.

The Gini is maximized when all classes have the same number of
observations at t.
One criterion for splitting could be to minimize the Gini in the next
level of the tree. That way we will get ”purer” nodes.

21/28
Decision trees: the best split
The situation becomes a bit more complicated if we take into
account that the children can have different numbers of
observations. To account for this, we try to maximize the gain:
k
X nvj
Gain = Gini(t) − Gini(vj )
nt
j=1

where vj are the children and ni are the number of observations at


node i.
nv
This is equivalent to minimizing kj=1 ntj Gini(vj ).
P

Vertebrate example: see blackboard!


Sometimes other impurity measures than Gini are used. One
example is the entropy:
c
X
Entropy (t) = − p(i|t) log2 p(i|t).
i=1

22/28
Decision trees: vertebrate data

Name Body temp Gives birth Has legs Class


Human warm-blooded yes yes mammal
Whale warm-blooded yes no mammal
Cat warm-blooded yes yes mammal
Cow warm-blooded yes yes mammal
Python cold-blooded no no reptile
Komodo dragon cold-blooded no yes reptile
Turtle cold-blooded no yes reptile
Salmon cold-blooded no no fish
Eel cold-blooded no no fish
Pigeon warm-blooded no yes bird
Penguin warm-blooded no yes bird

23/28
Decision trees: extensions

Some further remarks:

I In our example, we only used binary splits, were each internal


node has two children. It is also possible to use non-binary
splits, where each internal node can have more than two
children.
I When the data is continuous, it is perhaps not as easy to
choose the split criterions.
I Example: animal weight. A node question could be: ”is weight
<10 kg?”
I Is this a better question than ”is weight <11 kg?” or ”is
weight <9 kg?”
Having looked at two algorithmic methods, we will now compare
the merits of algorithmic and probabilistic method.

24/28
Algorithmic versus probabilistic methods: pros

Probabilistic methods Algorithmic methods


– Mathematical/probabilistic – No need for model
foundation. assumptions.
– Possible to derive optimal – Is ”optimized” using the test
methods. data.
– Often gives nice interpreta- – Often has a good heuristic
tions of the results. foundation.
– Possible to control error rates – Some methods work well
by choosing significance levels. when p > n.

25/28
Algorithmic versus probabilistic methods: cons

Probabilistic methods Algorithmic methods


– May be based on asymptotic – Relies heavily on the train-
results, that do not work well ing data, which may not be
when the sample size is small. representative.
– The model may be a poor de- – Difficult or impossible to find
scription of nature. optimal methods.
– The conclusions are only – Likely not as good as
about the model’s mechanism the probabilistic method if the
and not about the true mecha- model is accurate.
nism.
– Evaluating the model fit can – Some methods lack solid
be difficult, especially in higher theoretical support.
dimensions.

26/28
Algorithmic versus probabilistic methods: discussion

A paper by Leo Breiman from 2001 (Statistical modeling: the two


cultures, Statistical Science, Vol. 16) discusses the use of
algorithmic methods in modern statistics. Breiman argues that:

I The data and the problem at hand should lead to the solution
– not prior ideas about what kind of methods are good.
I The statistician should focus on finding a good solution –
regardless of whether that solution uses algorithmic or
probabilistic methods.
I How good a method is should be judged by the predictive
accuracy of the method on the test data.
I This last point is perhaps controversial; we often judge
probabilistic method by theoretical properties.

27/28
Algorithmic versus probabilistic methods: discussion

Some further comments and questions:


I Are algorithmic methods simply ”even more non-parametric”
non-parametric methods?
I Today it is not uncommon for new probabilistic methods to be
published with nothing but simulation results to back them up
(as the underlying mathematics can be quite complicated). Is
this any different from the support for algorithmic methods?
I There are some very interesting research problems in trying to
provide probabilistic support for algorithmic methods.
I Regardless of how we feel about algorithmic methods, we
should not be afraid to introduce new tools to our statistical
toolbox!

28/28

You might also like