You are on page 1of 5

Machine Learning: Tools, Techniques, Applications

(2013-14-I)
# 1
Aug. 2013
Chapter 2
Bayes decision theory (Ver:0.8)
We observed earlier that the dependence of the predicted variable on the feature vector is stochastic. Con-
sequently, the expected or average error made by any predictor on X will be non-zero. One question that
naturally arises is: what is the least expected error that the best (that is most accurate) predictor will make.
This will act as a lower bound on the expected error of any predictor.
Let us understand this through an example. Assume we wish to predict the geographic origin of a person
from India as North if the persons place of origin is north of the parallel close to Nagpur otherwise as South.
Let North correspond to class
1
and South correspond class
2
. We will abuse notation a bit and let
1
and
2
denote both the label and the set of objects with that label it will be apparent from the context
which is meant. Now suppose we ask the predictor to label a person picked at random from India. How will
the predictor do this if no information is available about the individual? One possibility that immediately
suggests itself is to use the population ratios of the north and south. So, if f
N
is the fraction of the total
population that is from the north and f
S
the fraction from the south then a simple rule is:
if (f
N
> f
S
) then
1
else
2
.
If we let f
N
be identied with P(
1
) and f
S
with P(
2
) the apriori probabilities for classes
1
and
2
respectively then we can write the above rule in terms of the apriori probabilities:
if (P(
1
) > P(
2
)) then
1
else
2
.
Now assume, in addition, that we have some information related to skin colour of people in
1
and
2
.
These distributions, assumed as similar normal distributions for both classes except for a shifted mean, are
shown in gure 2.1 for both classes. Now when we get the feature vector x of an unlabelled individual whose
skin colour feature value is x we can modify our labelling rule to (note x is a one-dimensional vector):
if (p(
1
|x) > p(
2
|x)) then
1
else
2
.
That is given x if the conditional probability for
1
is greater then we give the label
1
otherwise
2
. If the
class conditional distributions in gure 2.1 and the apriori probabilties are known we can use Bayes rules to
get p(
1
|x) and p(
2
|x) by:
p(
i
|x) =
p(x|
i
)P(
i
)
P(x)
i 1..2
If we assume that X is a d dimensional space then the various class conditional distributions will carve the
space into regions or volumes where for each region one of the conditional probabilites p(
i
|x) is highest
that is larger than p(
j
|x) j = i. If x is in that region then we give it the corresponding label
i
. For
example in gure 2.1 if x is in region R
1
then it will be labelled
1
, if it is in R
2
it will be labelled
2
.
6
Figure 2.1: Class conditional distribution p(x|
1
), p(x|
2
) of skin colour for
1
,
2
. R
1
is region to left of
t and R
2
the region to its right.
At the point where R
1
, R
2
intersect either label can be give. This is the general Bayes decision rule. Note
that in R
1
there is a nite probability of nding objects of class
2
and similarly in R
2
there is a nite
probability of nding objects belonging to
1
. The Bayes decision rule will misclassify such objects. The
total error probability is clearly the area under the curve shown shaded in gure 2.2. Intuitively, we can
Figure 2.2: Class conditional distribution p(x|
1
), p(x|
2
) of skin colour for
1
,
2
.
see that if we move the threshold point t in gure 2.2 either to the right or left the shaded area under the
curve increases. Thus, the minimum error is obtained as per the Bayes decision rule. This can be proved in
a more rigorous manner but we shall not do that here. More generally we can write:
p(error) =
C

i=1
p(error|
i
)P(
i
)
where p(error|
i
) is the probability of misclassifying objects from class
i
. This is given by:
p(error|
i
) =
_
R
c
i
p(x|
i
)dx
7
where R
c
i
is the complement of region R
i
in the feature space X. So,
p(error) =
C

i=1
P(
i
)
_
R
c
i
p(x|
i
)dx
=
C

i=1
P(
i
)[1
_
R
i
p(x|
i
)dx]
= 1
A
..
[

P(
i
)
_
R
i
p(x|
i
)dx] (2.1)
If we want to minimize p(error) then we have to maximize the value of the expression A in equation
2.1. So, we must choose regions R
i
such that
_
R
i
P(
i
)p(x|
i
)dx is a maximum. The integral will be a
maximum when we choose R
i
as the region over which P(
i
)p(x|
i
) is the largest over all classes that is
_
X
max
i
[P(
i
)p(x|
i
)]dx, over the whole feature space X. This means the probability of a correct label is:
p(correct) =
_
X
max
i
[P(
i
)p(x|
i
)]dx
and probability of error is p(error) = 1 p(correct). But notice that p(correct) is essentially Bayes rule.
Therefore, Bayes rule indeed does minimize the probability of error.
When dierent classication errors are made they are not all equally costly. For example, if tumours are
being classied as malignant or benign. A classication error that classies a malignant tumour as benign
is much more costly than one where a benign tumour is misclassied as malignant. In the rst case it could
mean the death of an individual while in the second further tests that are usually done before treatment is
started should be able to catch the mistake. Therefore, we would like to give dierent weights to dierent
errors. Let
ij
be the weight when true label
i
is misclassied as
j
. This weighted probability is often
called risk or loss. For the case where there are only two labels the risk r is:
r =
12
_
R
2
P(
1
)p(x|
1
)dx +
21
_
R
1
P(
2
)p(x|
2
)dx
The integral r
1
=
_
R
2
p(x|
1
)dx is called the risk or loss associated with giving a wrong label
2
to an object
whose real label is
1
. More generally for C class labels the risk or loss for label
k
is:
r
k
=
C

i=1

ki
_
R
i
p(x|
k
)dx
The goal is to choose regions R
i
such that expected or average risk is minimized. The expected risk is:
r =
C

k=1
P(
k
)r
k
=
C

k=1
P(
k
)
C

i=1

ki
_
R
i
p(x|
k
)dx
=
C

k=1
P(
k
)
_
R
i
_
C

i=1

ki
p(x|
k
)
_
dx (2.2)
8
We would like to minimize the risk 2.2 by choosing the regions R
i
suitably.
From analogy with the previous discussion we see that it is really a weighted version of the Bayes decision
rule with
ki
as weights. The loss/risk can be specied by a loss/risk matrix:
_

11
. . .
1C
.
.
.
.
.
.
.
.
.

C1
. . .
CC
_

_
Usually
ii
is 0 that is there is no classication error. When we have 0 1 loss dened by:

ki
=
_
0 k = i
1 k = i
then we get the minimum error classication error probability.
A common and useful special case is when we have only two class labels.
r
1
=
11
p(x|
1
)P(
1
) +
21
p(x|
2
)P(
2)
r
2
=
21
p(x|
1
)P(
1
) +
22
p(x|
2
)P(
2)
For an unknown vector x we give label
1
if r
1
< r
2
otherwise label
2
. Assuming that
ij
>
ii
we can
write the above as:
label(x) =
_

1
if
_
p(x|
1
)
p(x|
2
)
>
P(
2
)
P(
1
)

21

22

12

11
_

2
otherwise
Notice that this is again Bayes rule if we assume
11
=
22
= 0 and
12
=
21
. For better intuition consider
the eect of dierent values of
12
and
21
on the threshold t in gure 2.1. Assuming (P(
1
) = P(
2
)), we
get x
1
if (p(x|
1
) > p(x|
2
)

21

12
) otherwise it is in
2
. So, threshold t moves left if (
21
>
12
) otherwise
it moves right.
9

You might also like