You are on page 1of 19

인공지능을 위한 이론과 모델링

3강: Classification

Lecturer: 장원철 2023 가을학기

3.1 Introduction

Statistics Computer Science Meaning


estimation learning find a good classifier
classification supervised learning predicting a discrete Y from X
clustering unsupervised learning putting data into groups
data training sample (X1 , Y1 ), . . . , (Xn , Yn )
covariate features the Xi ’s
classifier hypothesis map h : X → Y

• The problem of predicting a discrete random variable Y from another random variable X is called
classification or supervised learning or discrimination or pattern recognition or machine
learning

• Consider iid data (X1 , Y1 ), . . . , (Xn , Yn ) where

Xi = (Xi1 , . . . , Xid )T ∈ X ⊂ Rd

is a d-dimensional vector and Yi takes values in {0, 1}.

Definition. A classification rule is a function h : X → {0, 1}. Observe X, predict Y = h(X). The
classification risk(or error rate) of h is

R(h) = Pr(Y ̸= h(X))

Example (Handwritten Digits). Identify handwritten digits from images. Each Y is a digit from 0 to 9.
There are 256 covariates x1 , . . . , x256 corresponding to the intensity values from the pixels of the 16 × 16
image.

1
2 3강: Classification

Example (Linear Decision Boundary). The figure below shows 100 data points. The covariate X = (X1 , X2 )
is 2-dimensional and the outcome Y ∈ Y = {0, 1}. □ means Y = 0 and △ means Y = 1. A linear classification
rule is of the form (
1 if a + b1 x1 + b2 x2 > 0
h(x) =
0 otherwise

Example (Coronary Risk-Factor Study). There are 462 males between the ages of 15 and 64 from three
rural areas in South Africa. The outcome Y is the presence (Y = 1) or absence (Y = 0) of coronary
heart disease and there are 9 covariates: systolic blood pressure, cumulative tobacco (kg), Idl (low density
lipoprotein), adiposity, famhist (family history of heart disease), typea (type-A behavior), obesity, alcohol
(current alcohol consumption), and age. The goal is to predict Y from the covariates.

Definition. The true error rate (or classification risk) of a classifier h is

R(h) = Pr(h(X) ̸= Y )

and the empirical error rate or training error rate is

n
bn (h) = 1
X
R I(h(Xi ) ̸= Yi ).
n i=1
3강: Classification 3

Theorem 3.1. The rule h that minimizes R(h) is


(
1
∗ 1 if m(x) > 2
h (x) =
0 otherwise

where
m(x) = E (Y |X = x) = Pr(Y = 1|X = x)

denote the regression function. The rule h∗ is called the Bayes’ rule. The risk R∗ = R(h∗ ) of the Bayes
rule is called the Bayes’ risk. The set D(h) = {x : m(x) = 1/2} is called the decision boundary.

Proof. It suffices to show that


R(h) − R(h∗ ) ≥ 0

Note that Z
R(h) = Pr(Y ̸= h(X)) = P r(Y ̸= h(X)|X = x)f (x)dx.

Hence we only need to show

Pr(Y ̸= h(X)|X = x) − Pr(Y ̸= h∗ (X)|X = x) ≥ 0 for all x

Now

Pr(Y ̸= h(X)|X = x) = 1 − Pr(Y = h(X)|X = x)


= 1 − {Pr(Y = 1, h(x) = 1|X = x) + Pr(Y = 0, h(x) = 0|X = x)}
= 1 − {I(h(x) = 1)Pr(Y = 1|X = x) + I(h(x) = 0)Pr(Y = 0|X = x)}
= 1 − {I(h(x) = 1)m(x) + I(h(x) = 0)(1 − m(x))}
= 1 − {I(x)m(x) + (1 − I(x))(1 − m(x))}

where I(x) = I(h(x) = 1).

Hence

Pr(Y ̸= h(X)|X = x) − Pr(Y ̸= h∗ (X)|X = x)


= (I ∗ (x)m(x) + (1 − I ∗ (x))(1 − m(x))) − (I(x)m(x) + (1 − I(x))(1 − m(x)))
= (2m(x) − 1)(I ∗ (x) − I(x))
 
1
= 2 m(x) − (I ∗ (x) − I(x))
2

where m(x) ≥ 1/2, h∗ (x) = 1, the above term is nonnegative. When m(x) ≤ 1/2, h∗ (x) = 0 so both terms
4 3강: Classification

are nonpositive and hence the above term is again non-negative.

• We can rewrite h∗ in a different way. From Bayes’ theorem

Pr(x|Y = 1) Pr(Y = 1)
m(x) = Pr(Y = 1 | X = x) =
Pr(x|Y = 1) Pr(Y = 1) + Pr(x|Y = 0) Pr(Y = 0)
π1 p1 (x)
=
π1 p1 (x) + π0 p0 (x)

where πj = Pr(Y = j). Therefore

1 p1 (x) 1 − π1
m(x) > ⇔ >
2 p0 (x) π1

Thus the Bayes rule can be rewritten as



1 p1 (x) 1−π1
if p0 (x) > π1
h∗ (x) =
0 otherwise

• Define the oracle classifier as follows:

R(h0 ) = inf R(h)


h∈H

where H is a set of classifiers.

• Let R0 = R(h0 ) denote the oracle risk of H. In general

R(h) − R(h∗ ) = R(h) − R(h0 ) + R(h0 ) − R(h∗ )


| {z } | {z }
distance from oracle distance of oracle from Bayes error

where the first term is a kind of variance and the second term acts like bias2 . In other words, if one
choose a bigger class of H, then h0 is closer to h∗ .

• A natural choice for Bayes classifier would be the plug-in classification rule:

1 1
if m(x)
b > 2
h(x) =
b
0 otherwise.

Theorem 3.2. The risk of the plug-in classifier rule satisfies


sZ
h) − R(h∗ ) ≤ 2
R(b (m(x)
b − m∗ (x))2 dPX (x)
3강: Classification 5

Proof. It is easy to show (HW #2)


Z
h(X) ̸= Y ) − Pr(h∗ (X) ̸= Y ) = 2
Pr(b |m∗ − 1/2|I(h∗ (x) = b
h(x))dPX (x)

Now when h∗ (x) ̸= b


h(x), there are two possible scenarios:

h(x) = 1 and h∗ (x) = 0;


1. b
h(x) = 0 and h∗ (x) = 1
2. b

In both scenarios, we can conclude

|m(x)
b − m∗ (x)| ≥ |m(x)
b − 1/2|

Therefore,
Z
h(X) ̸= Y ) − Pr(h∗ (X) ̸= Y ) ≤ 2
Pr(b |m∗ − m(x)|I(h
b ∗
(x) = b
h(x))dPX (x)
Z
≤ 2 |m∗ − m(x)|IdP
b X (x)
sZ
≤ (m(x)
b − m∗ (x))2 dPX (x)

p
The last inequality is due to E (Z) ≤ E (Z 2 ) for any Z.

• The above theorem tells us that if m(x)


b is closer to m∗ (x), then the plug-in classification risk will be
closer to the Bayes risk. However, the converse is not necessarily true.

• How to find a good classifier?

1. Empirical Risk Minimization: Choose a set of classifiers H and find b


h ∈ H that minimizes
some estimate of L(h).
2. Regression: Find an estimate m
b of the regression function r and substitute into the Bayes rule
3. Density Estimation: Estimate p0 from the Xi ’s for which Yi = 0, estimate p1 from the Xi ’s for
Pn
b = n1 i=1 Yi . Define
which Yi = 1 and let π

π
bpb1 (x)
m(x) = Pr(Y
c = 1|X = x) =
bpb1 (x) + (1 − π
b
π b)b
p0 (x)

and (
1
1 if m(x)
b > 2
h(x) =
b
0 otherwise.
6 3강: Classification

3.2 Empirical Risk Minimization

• Let H = {h1 , . . . , hm } be a finite set of classifiers.

• Empirical risk minimization means choosing the classifier b


h ∈ H to minimize the training error R
bn (h),
the empirical risk. !
1X
h = argmin R
b bn (h) = argmin I(h(Xi ) ̸= Yi )
h∈H h∈H n i

• Let h∗ the best classifier in H.


R(h∗ ) = min R(h)
h∈H

h) ≤ R(h∗ ) + ϵ for some small ϵ > 0.


item We want to show that, with high probability, R(b

• Recall Hoeffding’s inequality.


If X1 , . . . , Xn ∼ Bernoulli(p), then for any ϵ > 0,

2
p − p| > ϵ) ≤ 2e−2nϵ
Pr(|b

Pn
where pb = i=1 Xi /n.

Theorem 3.3. Assume H is finite and has m elements. Then,


 
2
Pr max |R
bn (h) − R(h)| > ϵn ≤ 2me−2nϵ
h∈H

Proof. We will use the union bound and Hoeffiding’s inequality.


  [ 
Pr max |R
bn (h) − R(h)| > ϵn = Pr |R
bn (h) − R(h)| > ϵ
h∈H
X  
≤ Pr |Rbn (h) − R(h)| > ϵ
h∈H
X 2 2
≤ 2e−2nϵ = 2me−2nϵ
h∈H

• Fix α and let s  


1 2m
ϵn = log
2n α

Then  
Pr max |R
bn (h) − R(h)| > ϵn ≤ α.
h∈H
3강: Classification 7

• Hence, with probability at least 1 − α, the following is true:

h) ≤ R
R(b bn (h∗ ) + ϵn ≤ R(h∗ ) + 2ϵn
h) + ϵn ≤ R
bn (b

• Summarizing: s  !
bn (h) > R(h∗ ) + 2 2m
Pr R log ≤ α.
n α

3.3 Linear and Logistic Regression

• Regression approach is to estimate m(x) = E (Y |X = x) = Pr(Y = 1|X = x). Can use linear

d
X
Y = m(x) + ϵ = β0 + βj Xj + ϵ
j=1

or logistic P
eβ0 + βj xj
j
m(x) = Pr(Y = 1|X = x) = P
β0 + j βj xj
1+e

• The parameters β0 and β = (β1 , . . . , βd )T can be estimated by the following maximum conditional
likelihood.
n
Y
L(β0 , β) = π(xi , β0 , β)Yi (1 − π(xi , β0 , β))1−Yi .
i=1

• The conditional log-likelihood is

n
X
ℓ(β0 , β) = {Yi log π(xi , β0 , β − (1 − Yi ) log(1 − π(xi , β0 , β))} ,
i=1
Xn
Yi (β0 + xTi β) − log(1 + exp(β0 + xTi β)) ,

=
i=1

• To find the logistic regression MLE, we use the iteratively reweighted least squares algorithm.

• Even if the model is wrong this might work well since we only need to approximate the decision
boundary.

3.4 Linear Discriminant Analysis

• Suppose that p0 (x) = p(x | Y = 0) and p1 (x) = p(x | Y = 1) are both multivariate Gaussians:
 
1 1 T −1
pk (x) = exp − (x − µk ) Σ k (x − µk ) , k = 0, 1.
(2π)d |Σk |1/2 2
8 3강: Classification

where Σ1 and Σ2 are both d × d covariance matrices.

Theorem 3.4. Suppose that X|Y = 0 ∼ N (µ0 , Σ0 ) and X|Y = 1 ∼ N (µ1 , Σ1 ). Then the Bayes rule
is     
π1 |Σ0 |
1 if ri2 < r02 + 2 log 1−π1 + log |Σ1 | ,
h∗ (x) =
0 otherwise

where ri2 = (x − µ)T Σi (x − µi ) for i = 1, 2 is the Mahalanobis distance.

• An equivalent way of expressing the Bayes rule is

h∗ (x) = argmax δk (x)


k

where
1 1
δk (x) = − log |Σk | − (x − µk )T Σ−1
k (x − µk ) + log πk
2 2
is called the Gaussian discriminant function.

• In practice, insert sample estimates for µk , Σk , πk .

• Decision boundary is quadratic (Quadratic Discriminant Analysis),

• Set Σ0 = Σ1 = Σ to get linear decision boundary (LDA).

3.5 Relationship Between Logistic Regression and LDA

• LDA and logistic regression are almost the same thing.

• If we assume that each group is Gaussian with the same covariance matrix, then
 
Pr(Y = 1|X = x)
log
Pr(Y = 0|X = x)
 
π0 1
= log − (µ0 + µ1 )T Σ−1 (µ1 − µ0 ) + xT Σ−1 (µ1 − µ0 )
π1 2
≡ α0 + αT x

• On the other hand, the logistic model is


 
Pr(Y = 1|X = x)
log = β0 + β T x
Pr(Y = 0|X = x)

• Both LDA and logistic regression lead to a linear classification rule. The difference is in how we
estimate the parameter.
3강: Classification 9

• In LDA we estimated the whole joint distribution by maximizing the likelihood


Y Y Y
f (Xi , yi ) = f (Xi |yi ) f (yi )
i i i
| {z } | {z }
Gaussian Bernoulli

• In logistic regression we maximized the conditional likelihood


Q
i f (yi |Xi ) but we ignored the second
term f (Xi ):
Y Y Y
f (Xi , yi ) = f (yi |Xi ) f (Xi )
i i i
| {z } | {z }
logistic ignored

• Since classification only requires knowing f (y|x), we don’t really need to estimate the whole joint
distribution.

• Logistic regression leaves the marginal distribution f (x) unspecified so it is more nonparametric than
LDA. This is an advantage of the logistic regression approach over LDA.

3.6 Support Vector Machines

• In this section, the outcomes are coded as −1 and 1.

• We consider a class of linear classifiers called Support Vector Machines (SVM). It will be convenient
to label the cot comes as −1 and +1 instead of 0 and 1. A linear classifier can then be written as

h(x) = I(H(x) > 0)

where x = (x1 , . . . , xd ),
d
X
H(x) = β0 + β j xj
j=1

• Note that:
classifier correct ⇒ Yi H(Xi ) ≥ 0
classifier incorrect ⇒ Yi H(Xi ) ≤ 0

• The classification risk is

R = Pr(Y ̸= h(X)) = Pr(Y H(X) ≥ 0) = E (L(Y H(X)))

where the loss function L is the hinge loss, L(a) = 1 if a < 0 and L(a) = 0 if a ≥ 0.
10 3강: Classification

그림 1: The 0-1 classification loss (dashed line), hinge loss (solid line) and logistic loss (dotted line)

• The SVM classifier is b


h(x) = I(H(x)
b > 0) where H(x)
b can be obtained by minimizing

n
X λ
[1 − Yi H(Xi )]+ + ∥β∥2
i=1
2
| {z }
hinge loss

where λ > 0.

• Figure 1 compares the 01-classification loss, hinge loss , logistic loss (log(1 + e−yH(x) )).

• The hinge loss is he smallest convex function that lies above the 0-1 loss so computation is easy and
the minimizer of E [1 − Y H(X)]+ is the Bayes rule.

• The SVM classifiers is often developed from geometric perspective.

• Suppose that the data are linearly separable, that is, there exists a hyperplane that perfectly sepa-
rates the two classes.

• How can we find a separating hyperplane? Note that LDA or logistic regression may not find it.

• A separating hyperplane will minimize


X
− Yi H(Xi )
i∈M

where M is the index set of all misclassified data points.

• Rosenblatt’s perceptron algorithm takes starting values and updates them:


     
β β Yi Xi
← +ρ
β0 β0 Yi

where ρ > 0 is the learning rate.


3강: Classification 11

그림 2: Support vectors and maximum margin hyperplane

• However, there are many separating hyperplanes.

• The particular separating hyperplane that this algorithm converges to depends on the starting values.

• Intuitively, it seems reasonable to choose the hyperplane furthest form the data in the sense that it
separates the +1s and −1s and maximizes the distance to the closest point.

• This hyperplane is called the maximum margin hyperplane.

• The margin is the distance to from the hyperplane to the nearest point.

• Points on the boundary of the margin are called support vectors. Figure 2 shows support vectors.

• The data can be separated by some hyperplane if and only if there exists a hyper plan H(x) =
Pd
β0 + j=1 βi xi such that
Yi H(xi ) ≥ 1, i = 1, . . . , n.

• The goal, then, is to maximize the margin, subject to this condition. That is

n
X
max M subject to βj2 = 1,
β0 .β
j=1
Yi H(Xi ) ≥ M i = 1, . . . , n.

• Given two vectors a and b let ⟨a, b⟩ = aT b =


P
j aj bj denote the inner product of a and b.
Pd
• Let H(x)
b = βb0 + j=1 βbj xj denote the optimal (largest margin) hyperplane.

• Then, for j = 1, . . . , d,
n
X
βbj = α
bi Yi Xij
i=1
12 3강: Classification

where Xij is the value of the covariate Xj for the ith data point, and α
b = (b
α1 , . . . α
bn ) is the vector
that maximizes
n n n
X 1 XX
αi − αi αk Yi Yk ⟨Xi , Xk ⟩
i=1
2 i=1
k=1

subject to
X
αi ≥ 0, and αi Yi = 0
i

• The points Xi for which α


b ̸= 0 are called support vectors. b
a0 can be found by solving
 
bi Yi (XiT βb + βb0 ) = 0
α

for any support point Xi .

• H
b may be written as
n
X
H(x)
b = βb0 + αi Yi ⟨x, Xi ⟩
i=1

• If there is no perfect linear classifier, then one allows overlap between the groups by replacing the
condition with
Yi H(xi ) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , n.

• The variables ξ1 , . . . , ξn are called slack variables.

• We now maximize subject to

n
X
max M subject to βj2 = 1,
β0 .β
j=1
Yi H(Xi ) ≥ M (1 − ξi ),
X
ξi ≥ 0, ξi < C i = 1, . . . , n.

• The constant C is a tuning parameter that controls the amount of overlap.

• Here is another (easier) way to think about the SVM.

• There is a trick called kernelization for improving a computationally simple classifier h.

• The idea is to map the covariate X - which takes values in X - into a higher dimensional space Z and
apply the classifier in the bigger space Z.

• This can yield a more flexible classifier while retaining computationally simplicity

Example. The covariate x = (x1 , x2 ). The Yi s can be separated into two groups using an ellipse.
Define a mapping ϕ by

z = (z1 , z2 , z3 ) = ϕ(x) = (x21 , 2x1 x2 , x22 ).
3강: Classification 13

Thus, ϕ maps Z = R2 into Z = R3 . In the higher-dimensional space Z, the Yi ’s are separable by a


linear decision boundary.

• If we significantly expand the dimension of the problem, we might increase the computational burden.

• For example, if x has dimension d = 256 and we wanted to use all fourth=order terms, then z = ϕ(x)
has dimension 183,181,376.

• We are spared this computational nightmare by the following two facts:

– First, many classifiers just use the inner product between pairs of points.
– Second, the inner product in Z can be written

⟨z, ze⟩ = x)⟩ = x21 x


⟨ϕ(x), ϕ(e e21 + 2x1 x e2 + x22 x
e1 x2 x e22
= e⟩)2 ≡ K(x, x
(⟨x, x e)

• Thus, we can compute ⟨z, ze⟩, without ever computing Zi = ϕ(Xi ).

• To summarize, kernelization involves finding a mapping ϕ : X → Z and a classifier such that:

1. cZ has higher dimension than cX and so leads a richer set of classifiers.


2. The classifier only requires computing inner products.
3. There is a function K, called a kernel, such that ⟨ϕ(x), ϕ(e
x)⟩ = K(x, x
e).
4. Everywhere there term ⟨x, x
e⟩ appears in the algorithm, replace it with K(x, x
e).

• In fact, we never need to construct the mapping ϕ at all.

• We only need to specify a kernel K(x, x


e) that corresponds to ⟨ϕ(x), ϕ(e
x)⟩. for some ϕ.

• This raises an interesting question: given a function of two variables K(x, y), does there exist a function
ϕ(x) such that K(x, y) = ⟨ϕ(x), ϕ(y)⟩?
14 3강: Classification

• The answer is provided by Mercer’s theorem which says, roughly, that if K is positive definite -
meaning that Z Z
K(x, y)f (x)f (y)dxdy ≥ 0

for square integrable functions f - then such a ϕ exists.

• The support vector machine can be kerneled as follows.

• We simply replace ⟨Xi , Xj ⟩ with K(Xi , Xj ).

• We now maximize
n n n
X 1 XX
αi − αi αk Yi Yk K(Xi , Xj )
i=1
2 i=1
k=1

• The hyperplane can be written as

n
X
H(x)
b = βb0 + α
bi Yi K(X, Xi )
i=1

3.7 Nonparametric Classification

3.7.1 Nonparametric Logistic Regression

• If Y is not real valued or ϵ is not Gaussian, using the basic regression model might not be appropriate.

• Assume that Y has an exponential family distribution family, given x, if


 
yθ(x) − b(θ(x))
f (y|x) = exp + c(y, ϕ) .
a(ϕ)

for some functions a(·), b(·) and c(·).

• Here θ(·) is called the canonical parameter and ϕ is called the dispersion parameter.

• Define

m(x) = E (Y |X = x) = b′ (θ(x))
σ 2 (x) = Var (Y |X = x) = a(ϕ)b′′ (θ(x))

• The generalized linear model is of form

g(E (Y |X = x)) = xT β

for some known function g called the link function.

• The parameters β are usually estimated by maximum likelihood. item We want to use nonparametric
regression version of GLM.
3강: Classification 15

• The local polynomial regression estimator can be obtained by solving the weighted least square:

n
X
argmin wi (Yi − Px (Xi , a))2
a
i=1

where wi = K( x−X
h ).
i

• Replace L2 loss with the log-likelihood function.

• Suppose Y |X = x ∼ Bernoulli(m(x)) for sme smooth function m(x) for which 0 ≤ m(x) ≤ 1.

• The likelihood function is


n
Y
m(Xi )Yi (1 − m(Xi ))1−Yi
i=1

• Define ξ(x) = log(m(x)/(1 − m(x))). Then the log-likelihood function is

n
X
ℓ(m) = ℓ(Yi , ξ(Xi )),
i=1

where " y  1−y #


eξ 1
ℓ(y, ξ) = log = yξ − log(1 + eξ ).
1 + eξ 1 + eξ

• To estimate the regression function at x for u near x, we approximate the regression function r(u) by
local logistic function
ea0 +a1 (u−x)
m(u) ≈
1 + ea0 +a1 (u−x)

• Equivalently,
ξ(u) = logit(m(u)) ≈ a0 + a1 (u − x).

• Now define local-loglikelihood

n  
X x − Xi
ℓx (a) = K ℓ(Yi , a0 + a1 (Xi − x))
i=1
h
n  
X x − Xi 
= K Yi (a0 + a1 (Xi − x)) − log(1 + ea0 +a1 (Xi −x) )
i=1
h

• Choose b
a = argmina ℓx (a).

• The nonparametric estimate of m(x) is

eba0
m(x) = .
1 + eba0
b
16 3강: Classification

• To choose the optimal bandwidth, we can use the leave-one-out log-likelihood cross-validation.

n
X
CV = ℓ(Yi , ξb−i (Xi ))
i=1

where ξb−i (x) is the estimator obtained by leaving out (Xi , Yi ).

• Unfortunately, we don’t have a simple formula for CV as in linear regression. However, we can approx-
imate the CV function.

• Let ℓ̇(y, ξ) and ℓ̈(y, ξ) denote the first and second derivatives of ℓ(y, ξ).

ℓ̇(y, ξ) = y − p(ξ), ℓ̈(y, ξ) = −p(ξ)(1 − p(ξ))

where p(ξ) − eξ /(1 + eξ ).

• Define Vx = diag(V (xj )) with

V (xj ) = −ℓ̈(Yi , b a1 (xj − Xi )).


a0 + b

• Then
n
˙ i, b
X
CV ≈ ℓx (b
a) + m(Xi )((Y a0 ))2
i=1

T
where e1 = (1, 0, . . . , 0) and
m(x) = K(0)eT1 (XxT Wx Vx Xx )−1 e1

• The effective degrees of freedom is

n
X
ν= m(Xi )E (−ℓ̈(Yi , b
a0 )).
i=1

3.7.2 Nearest Neighbor Classifiers

• The k-nearest neighbor rule is


( Pn Pn
1 if i=1 wi (x)I(Yi = 1) > i=1 wi (x)I(Yi = 0)
h(x) =
0 otherwise

where wi (x) = 1 if Xi is one of the k nearest neighbors of x, wi (x) = 0, otherwise.

• Nearest depends on how you define the distance.

• Often we use Euclidean distance ∥Xi − Xj ∥.

• An important part of this method is to choose a good value of k. We can use cross-validation.
3강: Classification 17

0.41
0.40
0.39
0.38
error

0.37
0.36
0.35
0.34

0 10 20 30 40 50

그림 3: knn for South African heart disease data

• Figure 3 shows the result of the cross-validation with South African heart disease.

3.7.3 Density Estimation and Naive Bayes

• The Bayes rule is h(x) = argmaxk π


bk fbk (x)

• We can estimate πk = 1
P
n i I(Yi = k)

• We can estimate fk using density estimation. For example, we could apply kernel density estimation
to Dk = {Xi : Yi = k} to get fbk .

• But if x = (x1 , . . . , xd ) is high-dimensional, nonparametric estimation is not very reliable

• Solution: we assume that X1 , . . . , Xd are independent.

• We can then use one-dimensional density estimators and multiply them:

d
Y
fb0 (x1 , . . . , xd ) = fb0j (xj )
j=1
d
Y
fb1 (x1 , . . . , xd ) = fb1j (xj )
j=1
18 3강: Classification

• The assumption that the components of X are independent is usually wrong yet the resulting classifier
might still be accurate. Here is a summary of the steps in the naive Bayes classifier.

1. For each group k, compute an estimate fbkj of the density fkj for Xj , using the data for which
Yi = k.
Qd
2. Let fbk (x) = fbk (x1 , . . . , xd ) = j=1 fbkj (xj )
Pn
bk = n1 i=1 I(Yi = k).
3. Let π
4. Define
h(x) = argmax π
bk fbk (x)
k

• Navie Bayes is closely related to generalized additive models. Under the naive Bayes model,
   
P(Y = 1|X) πf1 (X)
log = log
P(Y = 1|X) (1 − π)f0 (X)
Qd !
π j=1 f1j (X)
= log Qd
(1 − π) j=1 f0j (X)
  X d  
π f1j (Xj )
= log + log
1−π j=1
f0j (Xj )
d
X
= β0 + gj (Xj )
j=1

which has the form of a generalized additive model.

Example. Figure 4 shows an artificial data set with two covariate x1 and x2 . Figure 4 (middle) shows
kernel density estimators of fb1 (x1 ), fb1 (x2 ), fb0 (x1 ), fb1 (x2 ). The top left plot shows the resulting naive
Bayes decision boundary. The bottom left plot shows the predictions from a gam model. Clearly, this
is similar to the naive Bayes model. The gam model has an error rate of 0.03. In contrast, a linear
model yields a classifier with error rate of 0.78.
3강: Classification 19

그림 4: Top: artificial data, Middle: kernel density estimation, bottom: Naive Bayes and GAM classifier

You might also like