You are on page 1of 5

Introduction

• Sometimes probabilistic information unavailable


or mathematically intractable

• Many alternatives to Bayesian classification,


CSCE 970 Lecture 3: but optimality guarantee may be compromised!
Linear Classifiers
• Linear classifiers use a decision hyperplane to
perform classification

Stephen D. Scott • Simple and efficient to train and use

• Optimality requires linear separability of classes

= Class A
January 21, 2003
= Class B

= unclassified
= decision line

1 2

The Perceptron Algorithm


Linear Discriminant Functions
• Assume linear separability, i.e. ∃ w ∗ s.t.

• Let w = [w1, . . . , w`]T be a weight vector and


w∗T · x > 0 ∀ x ∈ ω1
w∗T · x ≤ 0 ∀ x ∈ ω2
w0 (a.k.a. θ) be a threshold
(w0∗ is included in w∗)

• Decision surface is a hyperplane: • So ∃ deterministic function classifying vectors


(contrary to Ch. 2 assumptions)
w T · x + w0 = 0
1
(ω1 )
w0
• E.g. predict ω2 if `i=1 wixi > w0, otherwise
P
x1 w1 y(t)=1 if sum > 0
predict ω1 Σi wi xi
xl wl y(t)=0 otherwise
• Focus of this lecture: How to find wi’s (ω2 )
May also use +1 and -1
– Perceptron algorithm
• Given actual label y(t) for trial t, update weights:
– Winnow w(t + 1) = w(t) + ρ(y(t) − ŷ(t))x(t)

– Least squares methods (if classes not lin- · ρ > 0 is learning rate
early separable)
· (y(t) − ŷ(t)) moves weights toward correct
prediction for x

3 4
The Perceptron Algorithm
Intuition
The Perceptron Algorithm
Example
• Compromise between correctiveness and
x2 y(t) = 0 conservativeness
w0 = 0
y(t) = 1
our new dec. line – Correctiveness: Tendency to improve on x(t)
our dec. line
if prediction error made
x(t)
– Conservativeness: Tendency to keep
w(t + 1) close to w(t)

w* w(t+1)
opt. dec. line • Use cost function that measures both:

x1 conserv. corrective
z }| { z }| {
w(t) U (w) = kw(t + 1) − w(t)k2 2
2 +η (y(t) − w(t + 1) · x(t))

(ω1 ) = (wi(t + 1) − wi(t))2 +
i=1
(ω2 )
 2

η y(t) − wi(t + 1) xi(t)
i=1

5 6

The Perceptron Algorithm


Intuition
(cont’d)

• Take gradient w.r.t. w(t + 1) and set to 0: The Perceptron Algorithm


Miscellany
0 =2 (wi(t + 1) − wi(t)) −
 

2η y(t) − wi(t + 1) xi(t) xi(t)
i=1
• If classes linearly separable, then by cycling
through vectors,
guaranteed to converge in finite number of steps
• Approximate with

0 =2 (wi(t + 1) − wi(t)) −



• For real-valued output, can replace threshold
2η y(t) − wi(t) xi(t) xi(t), function on sum with
i=1
– Identity function: f (x) = x
which yields
wi(t + 1) = wi(t) + 1
– Sigmoid function: e.g. f (x) = 1+exp(−ax)
 

η y(t) − wi(t)xi (t) xi(t) – Hyperbolic tangent: e.g. f (x) = c tanh(ax)
i=1

• Applying threshold to summation yields


wi(t + 1) = wi(t) + η (y(t) − ŷ(t)) xi(t)

7 8
Winnow/Exponentiated Gradient
Intuition
Winnow/Exponentiated Gradient
• Measure distance in cost function with
1
(ω1 ) unnormalized relative entropy:
x1 w1 w0
y(t)=1 if sum > 0
conserv.
Σi wi xi z }| !{
X̀ w (t + 1)
xl wl y(t)=0 otherwise U (w) = wi(t) − wi(t + 1) + wi(t + 1) ln i
(ω2 ) i=1 wi(t)
May also use +1 and -1 corrective
z }| {
+ η (y − w(t + 1) · x(t))2
• Same as Perceptron, but update weights:
• Take gradient w.r.t. w(t + 1) and set to 0:
wi(t + 1) = wi(t) exp (−2η(ŷ(t) − y(t)) xi(t))
 
w (t + 1) X̀
0 = ln i − 2η y(t) − wi(t + 1) xi(t) xi(t)
• If y(t), ŷ(t) ∈ {0, 1} ∀t, then set η = (ln α)/2 wi(t) i=1
(α > 1) and get Winnow:
 • Approximate with
xi (t)
wi(t)/α if ŷ(t) = 1, y(t) = 0

  
wi(t + 1) = wi(t)αxi (t) w (t + 1) X̀

if ŷ(t) = 0, y(t) = 1 0 = ln i − 2η y(t) − wi(t)xi (t) xi(t),


wi(t) if ŷ(t) = y(t) wi(t) i=1

which yields
wi(t + 1) = wi(t) exp (−2η (ŷ(t) − y(t)) xi(t))

9 10

Winnow/Exponentiated Gradient Winnow/Exponentiated Gradient


Negative Weights Miscellany

• Winnow and EG update wts by multiplying by • Winnow and EG are muliplicative weight update
a pos const: impossible to change sign schemes versus additive weight update schemes,
e.g. Perceptron
– Weight vectors restricted to one quadrant
• Winnow and EG work well when most attributes
(features) are irrelevant, i.e. optimal weight
• Solution: Maintain wt vectors w +(t) and w−(t) vector w∗ is sparse (many 0 entries)
 
– Predict ŷ(t) = w+(t) − w−(t) · x(t) • E.g. xi ∈ {0, 1}, x’s are labelled by a monotone
k-disjunction over ` attributes, k  `
– Update:
– Remaining ` − k are irrelevant
ri+ (t) = exp (−2η (ŷ(t) − y(t)) xi(t) U )
– E.g. x5 ∨ x9 ∨ x12, ` = 150, k = 3
ri−(t) = 1/ri+(t)
– For disjunctions, number of on-line
wi+ (t) ri+(t) prediction mistakes is O(k log `) for Winnow
wi+ (t + 1) = U · P 
+ + − −

` and worst-case Ω(k`) for Perceptron
j=1 wi (t) ri (t) + wi (t) ri (t)

U and denominator normalize wts for proof of error – So in worst case, need exponentially fewer
bound updates for training in Winnow than Per-
ceptron
Kivinen & Warmuth, “Additive Versus Exponen-
tiated Gradient Updates for Linear Prediction.” • Other bounds exist for real-valued inputs and
Information and Computation, 132(1):1–64, Jan.
outputs
1997. [see web page]
11 12
Non-Linearly Separable Classes Non-Linearly Separable Classes
Winnow’s Agnostic Results

• What if no hyperplane completely separates


the classes? • Winnow’s total number of prediction mistakes
• Add extra inputs that are nonlinear combina- loss (in on-line setting) provably not much worse
tions of original inputs (Section 4.14) than best linear classifier

= Class A optimal decision line


– E.g. attribs. x1 and x2, so try
h iT
x = x1, x2, x1x2, x21, x22, x21x2, x1x22, x31, x32 = Class B

– Perhaps classes linearly separable in new fea-


ture space

– Useful, especially with Winnow/EG loga-


rithmic bounds

– Kernel functions/SVMs

• Pocket algorithm (p. 63) guarantees conver- • Loss bound related to performance of best
gence to a best hyperplane classifier and total distance under k · k1 that
feature vectors must be moved to make best
• Winnow’s & EG’s agnostic results classifier perfect [Littlestone, COLT ’91]
• Least squares methods (Sec. 3.4)

• Networks of classifiers (Ch. 4) • Similar bounds for EG [Kivinen & Warmuth]

13 14

Non-Linearly Separable Classes


Multiclass learning
Least Squares Methods
Kessler’s Construction
• Recall from Slide 7:
 

wi(t + 1) = wi(t) + η y(t) − wi(t)xi (t) xi(t) ω1’s line ω3’s line
i=1
  [2,2]
= wi(t) + η y(t) − w(t)T · x(t) xi(t)

• If we don’t threshold dot product during train-


ing and allow η to vary each trial (i.e. substi- ω2’s line
tute ηt), get∗ Eq. 3.38, p. 69:
  = Class ω1
w(t + 1) = w(t) + ηt x(t) y(t) − w(t)T · x(t)
= Class ω2
• This is Least Mean Squares (LMS) Algorithm = Class ω3

• If e.g. ηt = 1/t, then



lim P w(t) = w∗ = 1, • For∗ x = [2, 2, 1]T of class ω1, want
t→∞
where
  2 
w∗ = argmin E y − wT · x `+1 `+1 `+1 `+1
w∈<` X X X X
w1i xi > w2ixi AND w1ixi > w3i xi
is vector minimizing mean square error (MSE) i=1 i=1 i=1 i=1

∗ Note
that here w(t) is weight before trial t. In book it is
∗ The extra 1 is added so threshold can be placed in w.
weight after trial t.
15 16
Multiclass learning Multiclass learning
Kessler’s Construction (cont’d) Error-Correcting Output Codes (ECOC)

• So map x to • Since Win. & Percep. learn binary functions,


orig. neg pad learn individual bits of binary encoding of classes
z }| { z }| { z }| {
x1 = [2, 2, 1, −2, −2, −1, 0, 0, 0]T
x2 = [2, 2, 1, 0, 0, 0, −2, −2, −1]T • E.g. M = 4, so use two linear classifiers:
(all labels = +1) and let
w1 w2 w3 Class Binary Encoding
w = [w11, w12, w10, w21, w22, w20, w31, w32, w30]T
z }| {z }| {z }| { Classifier 1 Classifier 2
ω1 0 0
ω2 0 1
• Now if w∗T · x1 > 0 and w∗T · x2 > 0, then ω3 1 0
ω4 1 1
`+1
X `+1
X `+1
X `+1
X
∗ ∗ ∗ ∗
w1i xi > w2i xi AND w1i xi > w3i xi and train simultaneously
i=1 i=1 i=1 i=1

• Problem: Sensitive to individual classifier er-


• In general, map (` + 1) × 1 feature vector x to rors, so use a set of encodings per class to
x1, . . . xM −1, each of size (` + 1)M × 1 improve robustness

• x ∈ ωi ⇒ x in ith block and −x in jth block,


(rest are 0s). Repeat for all j 6= i • Similar to principle of error-correcting output
codes used in communication networks
• Now train to find weights for new vector space [Dietterich & Bakiri, 1995]
via perceptron, Winnow, etc. • General-purpose, independent of learner

17 18

Topic summary due in 1 week!

19

You might also like