Structured Output SVM

Statistical Machine Learning (BE4M33SSU)
Lecture 5: Structured Output Support Vector Machines

Czech Technical University in Prague
V. Franc
BE4M33SSU – Statistical Machine Learning, Winter 2022

Linear classifier
2/15
X . . . set of observations
Y . . . finite set of hidden state

φ : X × Y → Rn . . . joint feature map

w ∈ Rn . . . vector of parameters
Generic linear classifier h : X → Y

h(x; w) ∈ Argmaxhw, φ(x, y)i

y∈Y
Example: two-class linear classifier
3/15
X ∈ Rn . . . set of observations
Y = {+1, −1} . . . binary labels

Two-class linear classifier h : Rn → {−1, +1}

+1 if hu, xi + v ≥ 0
h(x; u, v) = sign(hu, xi + v) =
−1 if hu, xi + v < 0
We can write the two-class classifier as

h(x; w) ∈ Argmax y(hu, xi + v) = Argmax hw, φ(x, y)i

y∈{−1,+1} y∈{−1,+1}
where φ : Rn × {−1, +1} → Rn+1
φ(x, y) = y(x, 1) and w = (u, v)

Example: multi-class linear classifier
4/15
X = Rn . . . set of observations; Y = {1, . . . , Y } . . . set of class labels
Multi-class linear classifier h : X → Y

h(x; w) ∈ Argmaxhwy , xi
y∈Y
where w = (w1, . . . , wY ) ∈ Rn·Y are parameters.

We can write the multi-class classifier as
h(x; w) ∈ Argmaxhwy , xi = Argmaxhw, φ(x, y)i

y∈Y y∈Y
where φ : Rn × Y → Rn·Y is
φ(x, y) = (0, . . . , x
|{z} , . . . , 0)
y−th slot
Example: sequence classifier for OCR
5/15
x = (x1, . . . , xL) ∈ I L . . . sequence of images with characters
y = (y1, . . . , yL) ∈ AL . . . seq. of chars. from A = {A, . . . , Z}

p(xi | yi) . . . appearance model for characters

p(y1), p(yi | yi−1) . . . language model

Hidden Markov Chain model of the sequences:

L
Y L
Y
p(x1, . . . , xL, y1, . . . , yL) = p(y1) p(yi | yi−1) p(xi | yi)
i=2 i=1
p(y2 | y1 ) p(y3 | y2 ) p(y4 | y3 )

y1 y2 y3 y4
p(x1 | y1 ) p(x2 | y2 ) p(x3 | y3 ) p(x4 | y4 )
x1 x2 x3 x4
Example: sequence classifier for OCR
6/15
The MAP estimate from HMC:
L
X L
X
ŷ ∈ Argmax log p(y1) + log p(yi | yi−1) + log p(xi | yi)
y∈AL i=2 i=1
Let us assume the following parametrization:

log p(y1) = hw, φ(y1)i

log p(yi | yi−1) = hw, φ(yi−1, yi)i
log p(xi|yi) = hw, φ(xi, yi)i
The MAP estimate becomes a linear classifier:

L
X L
X
ŷ = Argmax w, φ(y1) + φ(yi−1, yi) + φ(xi, yi)
(y1,...,yk )∈AL i=2 i=1
| {z }
φ(x,y)
Learning by Emprical Risk Minimization
7/15
` : Y × Y → [0, ∞) . . . loss such that `(y, y 0) = 0 iff y = y 0
Find w of h(x; w) ∈ Argmaxy∈Y hw, φ(x, y)i which minimizes the risk

R(w) = E(x,y)∼p `(y, h(x; w))
ERM based learning leads to solving

w∗ ∈ Argmin RT m (w)
w∈Rn
where the empirical risk is
m
1 X i
RT m (w) = `(y , h(xi; w))
m i=1
and T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} are training examples

drawn from i.i.d. with distribution p(x, y).
Learning linear classifier from separable examples
8/15
An example (xi, y i) is correctly classified, that is,

y i = h(xi; w) = Argmaxhw, φ(xi, y)i

y∈Y
is equivalent to
hφ(xi, y i), wi > hφ(xi, y), wi , ∀y ∈ Y \ {y i}
Definition: The examples T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} are

linearly separable w.r.t. joint feature map φ : X × Y → Rn if there exists
w ∈ Rn such that
hφ(xi, y i), wi > hφ(xi, y), wi , ∀i ∈ {1, . . . , m}, y ∈ Y \ {y i}

(Generic) Perceptron algorithm
9/15
Task: given a set of points {ai ∈ Rn | i = 1, 2, . . . , K} we want to find

w ∈ Rn such that
hw, aii > 0 , ∀i ∈ {1, 2, . . . , K} (1)
Algorithm:
1. w ← 0
2. Find k ∈ {1, 2, . . . , K} such that hw, ak i ≤ 0
3. If there is no such k return w otherwise update
w ← w + ak
and go to step 2.
If the set of inequalities (1) is solvable then the Perceptron algorithm
exits in a finite number of steps which does not depend on m.

Structured Output Perceptron
10/15
Learning h(x; w) ∈ Argmaxy∈Y hw, φ(x, y)i from examples
T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} leads to solving
hφ(xi, y i), wi − hφ(xi, y), wi > 0 , ∀i ∈ {1, . . . , m}, y ∈ Y \ {y i}
Algorithm:
1. w ← 0
2. Find a misclassified example (xk , y k ) ∈ T m such that
y k 6= ŷ k = Argmaxhw, φ(xk , y)i prediction problem

y∈Y
3. If there is no misclassified example return w otherwise update
w ← w + φ(xk , y k ) − φ(xk , ŷ k ) parameter update
and go to step 2.
Structured Output Support Vector Machines
11/15
Learning h(x; w) ∈ Argmaxy∈Y hw, φ(x, y)i from examples
T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} by ERM leads to

m
1 X
w∗ ∈ Argmin RT m (w) where RT m (w) = `(y i, h(xi; w))
w∈Rn m i=1
The SO-SVM approximates the ERM by a convex problem

m
1 X
w∗ ∈ Argmin Rψ (w) where ψ
R (w) = ψ(xi, y i, w)
w∈Wr m i=1
where
• Wr = {w ∈ Rn | kwk ≤ r} . . . a ball of radius r
• ψ : X × Y × Rn → R . . . proxy approximating the true loss `
Margin rescaling loss
12/15
The score for the correct label of an example (xi, y i) should be above
scores for incorrect labels icreased by margin proportional to loss `(y i, y):
hw, φ(xi, y i)i ≥ hw, φ(xi, y)i + `(y i, y) , ∀y ∈ Y \ {y i}
0
PL 0
Example: Sequencial OCR, Hamming distance `(y, y ) = [ y =
6 yi]
i=1 i
n n
ψ(xi, y i, w) = max 0, max
D E D E
4+ φ , AAAA , w − φ , JOHN , w ,
D E D E
3+ φ , JAAA , w − φ , JOHN , w ,
D E D E
2+ φ , JOAA , w − φ , JOHN , w ,
D E D E
1+ φ , JOHA , w − φ , JOHN , w ,
..
Margin rescaling loss
13/15
i i
The score for the correct label of an example (x , y ) should be above
scores for incorrect labels icreased by margin proportional to loss `(y i, y):
hw, φ(xi, y i)i ≥ hw, φ(xi, y)i + `(y i, y) , ∀y ∈ Y \ {y i}
The margin rescaling loss

i i i i i i

ψ(x , y , w) = max 0, max `(y , y)+hw, φ(x , y)i−hw, φ(x , y )i
y∈Y\{y i}
Upper bound of the true loss:

y i 6= ŷ = h(xi; w) = Argmaxhw, φ(xi, y)i

y∈Y
implies hw, φ(xi, ŷ)i − hw, φ(xi, y i)i ≥ 0 and hence
ψ(xi, y i, w) ≥ `(y i, h(xi, w)) , ∀w ∈ Rn

SO-SVM as a convex unconstrained problem
14/15
SO-SVM as constrained convex problem:
m
1 X
w∗ ∈ Argmin Rrψ (w) where ψ
Rr (w) = ψ(xi, y i, w)
w∈Wr m i=1
and Wr = {w ∈ Rn | kwk ≤ r}
SO-SVM as an unconstrained convex problem:

m
ψ ψ 1 X λ
w∗ ∈ Argmin Rλ (w) where Rλ (w) = kwk2 + ψ(xi, y i, w)
w∈Rn m i=1 2
m
1 X
= ψλ(xi, y i, w)
m i=1
where λ > 0 is a hyper-parameter which controls over-fitting.

SO-SVM problem solved by Stochastic Gradient Descent
15/15
The SO-SVM as an unconstrained convex problem:
m
1 X
w∗ ∈ Argmin Rλψ (w) where ψ
Rλ (w) = ψλ(xi, y i, w)
w∈Rn m i=1
Algorithm:
1: Choose an initial iterate w1 ∈ Rn

2: for k = 1, 2, . . .
3: Select an example (xk , y k ) ∈ T m uniformly at random
4: Compute subgradient g k of ψλ(xk , y k , w) at wk
g k =λwk − φ(xk , y k ) + φ(xk , ŷ k )
k k k

ŷ ∈ Argmax `(y , y) + hwk , φ(x , y)i
y∈Y
4: Choose a stepsize αk > 0 (e.g. αk = constant

k )
5: Set the new iterate wk+1 ← wk − αk g k

Structured Output SVM

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Structured Output SVM

Uploaded by

Copyright:

Available Formats

Statistical Machine Learning (BE4M33SSU)

Lecture 5: Structured Output Support Vector Machines

BE4M33SSU – Statistical Machine Learning, Winter 2022

Y . . . finite set of hidden state

φ : X × Y → Rn . . . joint feature map

Generic linear classifier h : X → Y

h(x; w) ∈ Argmaxhw, φ(x, y)i

Y = {+1, −1} . . . binary labels

Two-class linear classifier h : Rn → {−1, +1}

We can write the two-class classifier as

h(x; w) ∈ Argmax y(hu, xi + v) = Argmax hw, φ(x, y)i

where φ : Rn × {−1, +1} → Rn+1

φ(x, y) = y(x, 1) and w = (u, v)

Multi-class linear classifier h : X → Y

where w = (w1, . . . , wY ) ∈ Rn·Y are parameters.

h(x; w) ∈ Argmaxhwy , xi = Argmaxhw, φ(x, y)i

y = (y1, . . . , yL) ∈ AL . . . seq. of chars. from A = {A, . . . , Z}

p(xi | yi) . . . appearance model for characters

p(y1), p(yi | yi−1) . . . language model

Hidden Markov Chain model of the sequences:

p(y2 | y1 ) p(y3 | y2 ) p(y4 | y3 )

p(x1 | y1 ) p(x2 | y2 ) p(x3 | y3 ) p(x4 | y4 )

Let us assume the following parametrization:

log p(y1) = hw, φ(y1)i

The MAP estimate becomes a linear classifier:

ERM based learning leads to solving

and T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} are training examples

An example (xi, y i) is correctly classified, that is,

y i = h(xi; w) = Argmaxhw, φ(xi, y)i

hφ(xi, y i), wi > hφ(xi, y), wi , ∀y ∈ Y \ {y i}

Definition: The examples T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} are

hφ(xi, y i), wi > hφ(xi, y), wi , ∀i ∈ {1, . . . , m}, y ∈ Y \ {y i}

Task: given a set of points {ai ∈ Rn | i = 1, 2, . . . , K} we want to find

hw, aii > 0 , ∀i ∈ {1, 2, . . . , K} (1)

exits in a finite number of steps which does not depend on m.

T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} leads to solving

hφ(xi, y i), wi − hφ(xi, y), wi > 0 , ∀i ∈ {1, . . . , m}, y ∈ Y \ {y i}

y k 6= ŷ k = Argmaxhw, φ(xk , y)i prediction problem

3. If there is no misclassified example return w otherwise update

w ← w + φ(xk , y k ) − φ(xk , ŷ k ) parameter update

T m = {(xi, y i) ∈ (X × Y) | i = 1, . . . , m} by ERM leads to

The SO-SVM approximates the ERM by a convex problem

hw, φ(xi, y i)i ≥ hw, φ(xi, y)i + `(y i, y) , ∀y ∈ Y \ {y i}

hw, φ(xi, y i)i ≥ hw, φ(xi, y)i + `(y i, y) , ∀y ∈ Y \ {y i}

The margin rescaling loss

Upper bound of the true loss:

y i 6= ŷ = h(xi; w) = Argmaxhw, φ(xi, y)i

implies hw, φ(xi, ŷ)i − hw, φ(xi, y i)i ≥ 0 and hence

ψ(xi, y i, w) ≥ `(y i, h(xi, w)) , ∀w ∈ Rn

SO-SVM as an unconstrained convex problem:

where λ > 0 is a hyper-parameter which controls over-fitting.

1: Choose an initial iterate w1 ∈ Rn

4: Choose a stepsize αk > 0 (e.g. αk = constant

You might also like