3 Linear

Introduction
• Sometimes probabilistic information unavailable

or mathematically intractable
• Many alternatives to Bayesian classification,

CSCE 970 Lecture 3: but optimality guarantee may be compromised!
Linear Classifiers
• Linear classifiers use a decision hyperplane to
perform classification
Stephen D. Scott • Simple and efficient to train and use
• Optimality requires linear separability of classes
= Class A
January 21, 2003
= Class B
= unclassified
= decision line
1 2
The Perceptron Algorithm

Linear Discriminant Functions
• Assume linear separability, i.e. ∃ w ∗ s.t.
• Let w = [w1, . . . , w`]T be a weight vector and

w∗T · x > 0 ∀ x ∈ ω1
w∗T · x ≤ 0 ∀ x ∈ ω2
w0 (a.k.a. θ) be a threshold
(w0∗ is included in w∗)
• Decision surface is a hyperplane: • So ∃ deterministic function classifying vectors

(contrary to Ch. 2 assumptions)
w T · x + w0 = 0
1
(ω1 )
w0
• E.g. predict ω2 if `i=1 wixi > w0, otherwise
P
x1 w1 y(t)=1 if sum > 0
predict ω1 Σi wi xi
xl wl y(t)=0 otherwise
• Focus of this lecture: How to find wi’s (ω2 )
May also use +1 and -1
– Perceptron algorithm
• Given actual label y(t) for trial t, update weights:
– Winnow w(t + 1) = w(t) + ρ(y(t) − ŷ(t))x(t)
– Least squares methods (if classes not lin- · ρ > 0 is learning rate
early separable)
· (y(t) − ŷ(t)) moves weights toward correct
prediction for x
3 4
Intuition
Example
• Compromise between correctiveness and
x2 y(t) = 0 conservativeness
w0 = 0
y(t) = 1
our new dec. line – Correctiveness: Tendency to improve on x(t)
our dec. line
if prediction error made
x(t)
– Conservativeness: Tendency to keep
w(t + 1) close to w(t)
w* w(t+1)
opt. dec. line • Use cost function that measures both:
x1 conserv. corrective
z }| { z }| {
w(t) U (w) = kw(t + 1) − w(t)k2 2
2 +η (y(t) − w(t + 1) · x(t))
X̀
(ω1 ) = (wi(t + 1) − wi(t))2 +
i=1
(ω2 )
 2
X̀
η y(t) − wi(t + 1) xi(t)
i=1
5 6

Intuition
(cont’d)
• Take gradient w.r.t. w(t + 1) and set to 0: The Perceptron Algorithm

Miscellany
0 =2 (wi(t + 1) − wi(t)) −
 
X̀
2η y(t) − wi(t + 1) xi(t) xi(t)
i=1
• If classes linearly separable, then by cycling
through vectors,
guaranteed to converge in finite number of steps
• Approximate with
0 =2 (wi(t + 1) − wi(t)) −

X̀

• For real-valued output, can replace threshold
2η y(t) − wi(t) xi(t) xi(t), function on sum with
i=1
– Identity function: f (x) = x
which yields
wi(t + 1) = wi(t) + 1
– Sigmoid function: e.g. f (x) = 1+exp(−ax)
 
X̀
η y(t) − wi(t)xi (t) xi(t) – Hyperbolic tangent: e.g. f (x) = c tanh(ax)
i=1
• Applying threshold to summation yields

wi(t + 1) = wi(t) + η (y(t) − ŷ(t)) xi(t)
7 8
Winnow/Exponentiated Gradient
Intuition
Winnow/Exponentiated Gradient
• Measure distance in cost function with
1
(ω1 ) unnormalized relative entropy:
x1 w1 w0
y(t)=1 if sum > 0
conserv.
Σi wi xi z }| !{
X̀ w (t + 1)
xl wl y(t)=0 otherwise U (w) = wi(t) − wi(t + 1) + wi(t + 1) ln i
(ω2 ) i=1 wi(t)
May also use +1 and -1 corrective
z }| {
+ η (y − w(t + 1) · x(t))2
• Same as Perceptron, but update weights:
• Take gradient w.r.t. w(t + 1) and set to 0:
wi(t + 1) = wi(t) exp (−2η(ŷ(t) − y(t)) xi(t))
 
w (t + 1) X̀
0 = ln i − 2η y(t) − wi(t + 1) xi(t) xi(t)
• If y(t), ŷ(t) ∈ {0, 1} ∀t, then set η = (ln α)/2 wi(t) i=1
(α > 1) and get Winnow:
 • Approximate with
xi (t)
wi(t)/α if ŷ(t) = 1, y(t) = 0

  
wi(t + 1) = wi(t)αxi (t) w (t + 1) X̀

if ŷ(t) = 0, y(t) = 1 0 = ln i − 2η y(t) − wi(t)xi (t) xi(t),


wi(t) if ŷ(t) = y(t) wi(t) i=1
which yields
wi(t + 1) = wi(t) exp (−2η (ŷ(t) − y(t)) xi(t))
9 10
Winnow/Exponentiated Gradient Winnow/Exponentiated Gradient

Negative Weights Miscellany
• Winnow and EG update wts by multiplying by • Winnow and EG are muliplicative weight update
a pos const: impossible to change sign schemes versus additive weight update schemes,
e.g. Perceptron
– Weight vectors restricted to one quadrant
• Winnow and EG work well when most attributes
(features) are irrelevant, i.e. optimal weight
• Solution: Maintain wt vectors w +(t) and w−(t) vector w∗ is sparse (many 0 entries)

– Predict ŷ(t) = w+(t) − w−(t) · x(t) • E.g. xi ∈ {0, 1}, x’s are labelled by a monotone
k-disjunction over ` attributes, k `
– Update:
– Remaining ` − k are irrelevant
ri+ (t) = exp (−2η (ŷ(t) − y(t)) xi(t) U )
– E.g. x5 ∨ x9 ∨ x12, ` = 150, k = 3
ri−(t) = 1/ri+(t)
– For disjunctions, number of on-line
wi+ (t) ri+(t) prediction mistakes is O(k log `) for Winnow
wi+ (t + 1) = U · P
+ + − −

` and worst-case Ω(k`) for Perceptron
j=1 wi (t) ri (t) + wi (t) ri (t)
U and denominator normalize wts for proof of error – So in worst case, need exponentially fewer
bound updates for training in Winnow than Per-
ceptron
Kivinen & Warmuth, “Additive Versus Exponen-
tiated Gradient Updates for Linear Prediction.” • Other bounds exist for real-valued inputs and
Information and Computation, 132(1):1–64, Jan.
outputs
1997. [see web page]
11 12
Non-Linearly Separable Classes Non-Linearly Separable Classes
Winnow’s Agnostic Results
• What if no hyperplane completely separates

the classes? • Winnow’s total number of prediction mistakes
• Add extra inputs that are nonlinear combina- loss (in on-line setting) provably not much worse
tions of original inputs (Section 4.14) than best linear classifier
= Class A optimal decision line

– E.g. attribs. x1 and x2, so try
h iT
x = x1, x2, x1x2, x21, x22, x21x2, x1x22, x31, x32 = Class B
– Perhaps classes linearly separable in new fea-

ture space
– Useful, especially with Winnow/EG loga-

rithmic bounds
– Kernel functions/SVMs
• Pocket algorithm (p. 63) guarantees conver- • Loss bound related to performance of best
gence to a best hyperplane classifier and total distance under k · k1 that
feature vectors must be moved to make best
• Winnow’s & EG’s agnostic results classifier perfect [Littlestone, COLT ’91]
• Least squares methods (Sec. 3.4)
• Networks of classifiers (Ch. 4) • Similar bounds for EG [Kivinen & Warmuth]
13 14
Non-Linearly Separable Classes

Multiclass learning
Least Squares Methods
Kessler’s Construction
• Recall from Slide 7:
 
X̀
wi(t + 1) = wi(t) + η y(t) − wi(t)xi (t) xi(t) ω1’s line ω3’s line
i=1
[2,2]
= wi(t) + η y(t) − w(t)T · x(t) xi(t)
• If we don’t threshold dot product during train-

ing and allow η to vary each trial (i.e. substi- ω2’s line
tute ηt), get∗ Eq. 3.38, p. 69:
= Class ω1
w(t + 1) = w(t) + ηt x(t) y(t) − w(t)T · x(t)
= Class ω2
• This is Least Mean Squares (LMS) Algorithm = Class ω3
• If e.g. ηt = 1/t, then

lim P w(t) = w∗ = 1, • For∗ x = [2, 2, 1]T of class ω1, want
t→∞
where
2
w∗ = argmin E y − wT · x `+1 `+1 `+1 `+1
w∈<` X X X X
w1i xi > w2ixi AND w1ixi > w3i xi
is vector minimizing mean square error (MSE) i=1 i=1 i=1 i=1
∗ Note
that here w(t) is weight before trial t. In book it is
∗ The extra 1 is added so threshold can be placed in w.
weight after trial t.
15 16
Multiclass learning Multiclass learning
Kessler’s Construction (cont’d) Error-Correcting Output Codes (ECOC)
• So map x to • Since Win. & Percep. learn binary functions,

orig. neg pad learn individual bits of binary encoding of classes
z }| { z }| { z }| {
x1 = [2, 2, 1, −2, −2, −1, 0, 0, 0]T
x2 = [2, 2, 1, 0, 0, 0, −2, −2, −1]T • E.g. M = 4, so use two linear classifiers:
(all labels = +1) and let
w1 w2 w3 Class Binary Encoding
w = [w11, w12, w10, w21, w22, w20, w31, w32, w30]T
z }| {z }| {z }| { Classifier 1 Classifier 2
ω1 0 0
ω2 0 1
• Now if w∗T · x1 > 0 and w∗T · x2 > 0, then ω3 1 0
ω4 1 1
`+1
X `+1
X `+1
X `+1
X
∗ ∗ ∗ ∗
w1i xi > w2i xi AND w1i xi > w3i xi and train simultaneously
i=1 i=1 i=1 i=1
• Problem: Sensitive to individual classifier er-

• In general, map (` + 1) × 1 feature vector x to rors, so use a set of encodings per class to
x1, . . . xM −1, each of size (` + 1)M × 1 improve robustness
• x ∈ ωi ⇒ x in ith block and −x in jth block,

(rest are 0s). Repeat for all j 6= i • Similar to principle of error-correcting output
codes used in communication networks
• Now train to find weights for new vector space [Dietterich & Bakiri, 1995]
via perceptron, Winnow, etc. • General-purpose, independent of learner
17 18
Topic summary due in 1 week!
19

3 Linear

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Linear

Uploaded by

Copyright:

Available Formats

Introduction

• Sometimes probabilistic information unavailable

• Many alternatives to Bayesian classification,

Stephen D. Scott • Simple and efficient to train and use

• Optimality requires linear separability of classes

The Perceptron Algorithm

• Let w = [w1, . . . , w`]T be a weight vector and

• Decision surface is a hyperplane: • So ∃ deterministic function classifying vectors

The Perceptron Algorithm

• Take gradient w.r.t. w(t + 1) and set to 0: The Perceptron Algorithm

• Applying threshold to summation yields

Winnow/Exponentiated Gradient Winnow/Exponentiated Gradient

• What if no hyperplane completely separates

= Class A optimal decision line

– Perhaps classes linearly separable in new fea-

– Useful, especially with Winnow/EG loga-

• Networks of classifiers (Ch. 4) • Similar bounds for EG [Kivinen & Warmuth]

Non-Linearly Separable Classes

• If we don’t threshold dot product during train-

• If e.g. ηt = 1/t, then

• So map x to • Since Win. & Percep. learn binary functions,

• Problem: Sensitive to individual classifier er-

• x ∈ ωi ⇒ x in ith block and −x in jth block,

Topic summary due in 1 week!

You might also like