You are on page 1of 15

Statistical Machine Learning (BE4M33SSU)

Lecture 5: Structured Output Support Vector Machines


Czech Technical University in Prague
V. Franc

BE4M33SSU – Statistical Machine Learning, Winter 2022


Linear classifier
2/15

X . . . set of observations


Y . . . finite set of hidden state




φ : X × Y → Rn . . . joint feature map




w ∈ Rn . . . vector of parameters


Generic linear classifier h : X → Y




h(x; w) ∈ Argmaxhw, φ(x, y)i


y∈Y
Example: two-class linear classifier
3/15

X ∈ Rn . . . set of observations


Y = {+1, −1} . . . binary labels




Two-class linear classifier h : Rn → {−1, +1}





+1 if hu, xi + v ≥ 0
h(x; u, v) = sign(hu, xi + v) =
−1 if hu, xi + v < 0

We can write the two-class classifier as




h(x; w) ∈ Argmax y(hu, xi + v) = Argmax hw, φ(x, y)i


y∈{−1,+1} y∈{−1,+1}

where φ : Rn × {−1, +1} → Rn+1

φ(x, y) = y(x, 1) and w = (u, v)


Example: multi-class linear classifier
4/15
X = Rn . . . set of observations; Y = {1, . . . , Y } . . . set of class labels


Multi-class linear classifier h : X → Y




h(x; w) ∈ Argmaxhwy , xi
y∈Y

where w = (w1, . . . , wY ) ∈ Rn·Y are parameters.


We can write the multi-class classifier as


h(x; w) ∈ Argmaxhwy , xi = Argmaxhw, φ(x, y)i


y∈Y y∈Y

where φ : Rn × Y → Rn·Y is

φ(x, y) = (0, . . . , x
|{z} , . . . , 0)
y−th slot
Example: sequence classifier for OCR
5/15
x = (x1, . . . , xL) ∈ I L . . . sequence of images with characters


y = (y1, . . . , yL) ∈ AL . . . seq. of chars. from A = {A, . . . , Z}




p(xi | yi) . . . appearance model for characters




p(y1), p(yi | yi−1) . . . language model




Hidden Markov Chain model of the sequences:




L
Y L
Y
p(x1, . . . , xL, y1, . . . , yL) = p(y1) p(yi | yi−1) p(xi | yi)
i=2 i=1

p(y2 | y1 ) p(y3 | y2 ) p(y4 | y3 )


y1 y2 y3 y4

p(x1 | y1 ) p(x2 | y2 ) p(x3 | y3 ) p(x4 | y4 )

x1 x2 x3 x4
Example: sequence classifier for OCR
6/15
The MAP estimate from HMC:


 L
X L
X 
ŷ ∈ Argmax log p(y1) + log p(yi | yi−1) + log p(xi | yi)
y∈AL i=2 i=1

Let us assume the following parametrization:




log p(y1) = hw, φ(y1)i


log p(yi | yi−1) = hw, φ(yi−1, yi)i
log p(xi|yi) = hw, φ(xi, yi)i

The MAP estimate becomes a linear classifier:




 L
X L
X 
ŷ = Argmax w, φ(y1) + φ(yi−1, yi) + φ(xi, yi)
(y1,...,yk )∈AL i=2 i=1
| {z }
φ(x,y)
Learning by Emprical Risk Minimization
7/15
` : Y × Y → [0, ∞) . . . loss such that `(y, y 0) = 0 iff y = y 0


Find w of h(x; w) ∈ Argmaxy∈Y hw, φ(x, y)i which minimizes the risk


 
R(w) = E(x,y)∼p `(y, h(x; w))

ERM based learning leads to solving




w∗ ∈ Argmin RT m (w)
w∈Rn
where the empirical risk is
m
1 X i
RT m (w) = `(y , h(xi; w))
m i=1

and T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} are training examples


drawn from i.i.d. with distribution p(x, y).
Learning linear classifier from separable examples
8/15

An example (xi, y i) is correctly classified, that is,




y i = h(xi; w) = Argmaxhw, φ(xi, y)i


y∈Y

is equivalent to

hφ(xi, y i), wi > hφ(xi, y), wi , ∀y ∈ Y \ {y i}

Definition: The examples T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} are


linearly separable w.r.t. joint feature map φ : X × Y → Rn if there exists
w ∈ Rn such that

hφ(xi, y i), wi > hφ(xi, y), wi , ∀i ∈ {1, . . . , m}, y ∈ Y \ {y i}


(Generic) Perceptron algorithm
9/15

Task: given a set of points {ai ∈ Rn | i = 1, 2, . . . , K} we want to find




w ∈ Rn such that

hw, aii > 0 , ∀i ∈ {1, 2, . . . , K} (1)

Algorithm:


1. w ← 0
2. Find k ∈ {1, 2, . . . , K} such that hw, ak i ≤ 0
3. If there is no such k return w otherwise update

w ← w + ak

and go to step 2.
If the set of inequalities (1) is solvable then the Perceptron algorithm


exits in a finite number of steps which does not depend on m.


Structured Output Perceptron
10/15
Learning h(x; w) ∈ Argmaxy∈Y hw, φ(x, y)i from examples


T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} leads to solving

hφ(xi, y i), wi − hφ(xi, y), wi > 0 , ∀i ∈ {1, . . . , m}, y ∈ Y \ {y i}

Algorithm:


1. w ← 0
2. Find a misclassified example (xk , y k ) ∈ T m such that

y k 6= ŷ k = Argmaxhw, φ(xk , y)i prediction problem


y∈Y

3. If there is no misclassified example return w otherwise update

w ← w + φ(xk , y k ) − φ(xk , ŷ k ) parameter update

and go to step 2.
Structured Output Support Vector Machines
11/15
Learning h(x; w) ∈ Argmaxy∈Y hw, φ(x, y)i from examples


T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} by ERM leads to


m
1 X
w∗ ∈ Argmin RT m (w) where RT m (w) = `(y i, h(xi; w))
w∈Rn m i=1

The SO-SVM approximates the ERM by a convex problem




m
1 X
w∗ ∈ Argmin Rψ (w) where ψ
R (w) = ψ(xi, y i, w)
w∈Wr m i=1

where
• Wr = {w ∈ Rn | kwk ≤ r} . . . a ball of radius r
• ψ : X × Y × Rn → R . . . proxy approximating the true loss `
Margin rescaling loss
12/15

The score for the correct label of an example (xi, y i) should be above


scores for incorrect labels icreased by margin proportional to loss `(y i, y):

hw, φ(xi, y i)i ≥ hw, φ(xi, y)i + `(y i, y) , ∀y ∈ Y \ {y i}

0
PL 0
Example: Sequencial OCR, Hamming distance `(y, y ) = [ y =
6 yi]


i=1 i

n n
ψ(xi, y i, w) = max 0, max
D   E D   E
4+ φ , AAAA , w − φ , JOHN , w ,
D   E D   E
3+ φ , JAAA , w − φ , JOHN , w ,
D   E D   E
2+ φ , JOAA , w − φ , JOHN , w ,
D   E D   E
1+ φ , JOHA , w − φ , JOHN , w ,
..
Margin rescaling loss
13/15
i i
The score for the correct label of an example (x , y ) should be above


scores for incorrect labels icreased by margin proportional to loss `(y i, y):

hw, φ(xi, y i)i ≥ hw, φ(xi, y)i + `(y i, y) , ∀y ∈ Y \ {y i}

The margin rescaling loss




i i i i i i
 
ψ(x , y , w) = max 0, max `(y , y)+hw, φ(x , y)i−hw, φ(x , y )i
y∈Y\{y i}

Upper bound of the true loss:




y i 6= ŷ = h(xi; w) = Argmaxhw, φ(xi, y)i


y∈Y

implies hw, φ(xi, ŷ)i − hw, φ(xi, y i)i ≥ 0 and hence

ψ(xi, y i, w) ≥ `(y i, h(xi, w)) , ∀w ∈ Rn


SO-SVM as a convex unconstrained problem
14/15
SO-SVM as constrained convex problem:


m
1 X
w∗ ∈ Argmin Rrψ (w) where ψ
Rr (w) = ψ(xi, y i, w)
w∈Wr m i=1

and Wr = {w ∈ Rn | kwk ≤ r}

SO-SVM as an unconstrained convex problem:




m 
ψ ψ 1 X λ 
w∗ ∈ Argmin Rλ (w) where Rλ (w) = kwk2 + ψ(xi, y i, w)
w∈Rn m i=1 2
m
1 X
= ψλ(xi, y i, w)
m i=1

where λ > 0 is a hyper-parameter which controls over-fitting.


SO-SVM problem solved by Stochastic Gradient Descent
15/15
The SO-SVM as an unconstrained convex problem:


m
1 X
w∗ ∈ Argmin Rλψ (w) where ψ
Rλ (w) = ψλ(xi, y i, w)
w∈Rn m i=1

Algorithm:


1: Choose an initial iterate w1 ∈ Rn


2: for k = 1, 2, . . .
3: Select an example (xk , y k ) ∈ T m uniformly at random
4: Compute subgradient g k of ψλ(xk , y k , w) at wk
g k =λwk − φ(xk , y k ) + φ(xk , ŷ k )
k k k

ŷ ∈ Argmax `(y , y) + hwk , φ(x , y)i
y∈Y

4: Choose a stepsize αk > 0 (e.g. αk = constant


k )
5: Set the new iterate wk+1 ← wk − αk g k

You might also like