Professional Documents
Culture Documents
A Tutorial On Conditional Random Fields: With Applications To Music Analysis
A Tutorial On Conditional Random Fields: With Applications To Music Analysis
Slim Essid
Telecom ParisTech
November 2013
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 1 / 122
Slides available online
http://www.telecom-paristech.fr/~essid/resources.htm
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 2 / 122
Contents
I Introduction
I Conclusion
I References
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 3 / 122
Introduction Preliminaries
Features
x1 x2 x3 x4 C C F F
real-valued: e.g. MFCC, chroma, symbolic: e.g. note/chord played,
tempo... key...
• Usually assembled as feature vectors x.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 5 / 122
Introduction Preliminaries
Notations
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 6 / 122
Introduction Preliminaries
Feature functions
Different from features!
Definition
A feature function is a real-valued function of both the input space O
(observations) and the output space Y (target labels), fj : O × Y → R,
that can be used to compute characteristics of the observations.
(
1 if oi = C, oi+1 = E and y = Cmaj
Example: fj (oi , yi ) =
0 otherwise
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 7 / 122
Introduction Preliminaries
Feature functions
I Remarks:
− Different attributes may thus be considered for different classes.
− Feature functions are more general than features: one can define
∆
I fj (o, y ) = xj ;
or
∆
I fj (o, y ) = x.
− In the following:
I Feature-function notations will be used only when needed.
I Otherwise, feature-vectors will be preferred.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 8 / 122
Introduction Preliminaries
Probabilistic classification
in order to minimize the error rate (here the expected 0-1 loss).
MAP: Maximum A Posteriori probability
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 9 / 122
Introduction Preliminaries
p(y )p(x|y )
ŷ = argmax = argmax p(y )p(x|y ).
y p(x) y
1
• Assuming a fixed prior p(y ) (possibly uninformative: p(y ) = K ), one is
left with:
ŷ = argmax p(x|y ).
y
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 10 / 122
Introduction Preliminaries
5 1.2
Class conditional densities
1
4
Class posteriors
p(y = 1|x) p(y = 2|x)
p(x|y = 2) 0.8
3 p(x|y = 1)
0.6
2
0.4
1
0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Generated using pmtk3 (Dunham and Murphy, 2010)
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 11 / 122
Introduction Preliminaries
I Cons:
− Classes need to be learned jointly and data should be available for all classes.
− Models do not allow for generating observations.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 12 / 122
Introduction Preliminaries
Ab Maj
VI II V I
C Maj
IV V I
Eb Maj
VI II V I
G Maj
IV V I
Linear-chain structure
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 13 / 122
Introduction Preliminaries
o1 o2 o3 o4
Linear-chain structure
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 13 / 122
Introduction Preliminaries
Autotagging tasks:
target tags are correlated (e.g. bebop, Jazz, fast tempo)
Genre label
... Classical Jazz Rock/Pop ...
... Bebop Big band Cool Jazz ...
Tree structure
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 13 / 122
Introduction Preliminaries
Autotagging tasks:
target tags are correlated (e.g. bebop, Jazz, fast tempo)
Genre label
... Classical Jazz Rock/Pop ...
... Bebop Big band Cool Jazz ...
Tree structure
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 13 / 122
Introduction Preliminaries
o1 o2 o3 o4 o1 o2 o3 o4
More notations
• Remarks:
− Observations are no longer assumed to be i.i.d within each sequence.
− Sequences x(q) do not necessarily have the same length, when needed nq will
denote the length of x(q) .
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 15 / 122
Introduction A first look at CRF
D
1 X
p(y |x; θ) = exp θj Fj (x, y )
Z (x, θ)
j=1
1
= Ψ(x, y ; θ); θ = {θ1 , · · · , θD }.
Z (x, θ)
P P
• Z (x, θ) = exp j θj Fj (x, y ) is called a partition function.
y
Ψ(x, y ; θ) = exp D
P
j=1 θj Fj (x, y ) is called a potential function.
•
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 16 / 122
Introduction A first look at CRF
Applications of CRFs
• They have been successfully used for various computer vision tasks (He
et al., 2004; Quattoni et al., 2007; Wang et al., 2006; Morency et al., 2007; Rudovic et al., 2012)
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 17 / 122
References to applications in MIR-related tasks
I Introduction
I Conclusion
I References
I Appendix
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 18 / 122
References to applications in MIR-related tasks
Overview
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 19 / 122
References to applications in MIR-related tasks
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 20 / 122
References to applications in MIR-related tasks
System (Davies and Plumbley, 2007) (Ellis, 2007) (Klapuri et al., 2006) CRF
F-measure 68,4 55,6 69,6 72,7
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 21 / 122
The logistic regression model
I Introduction
I Conclusion
SlimReferences
I Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 22 / 122
The logistic regression model Model specification
P(C1 |x)
log = w10 + w1T x
P(CK |x)
P(C2 |x)
log = w20 + w2T x
P(CK |x)
..
.
P(CK −1 |x) T
log = w(K −1)0 + wK −1 x
P(CK |x)
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 23 / 122
The logistic regression model Model specification
P(Ck |x)
• From log P(C K |x)
= wk0 + wkT x ; k = 1, · · · , K − 1; it is easy to deduce
that:
• Remarks
− The model is a classification model (not a regression model!)
− It is a discriminative model as it targets P(Ck |x) (as opposed to modeling
p(x|Ck ) in generative models.)
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 24 / 122
The logistic regression model Model specification
• When K = 2
1
P(C1 |x) = p =
1 + exp −(w10 + w1T x)
P(C2 |x) = 1 − p
1
• p= ; a = w10 + w1T x
1 + exp −a
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 25 / 122
The logistic regression model Model specification
∆ 1
0.9
σ(a) = 1+exp −a
0.8
0.7
0.6
0.5
• Properties: 0.4
0.3
0.2
Symmetry: σ(−a) = 1 − σ(a) 0.1
0
σ −8 −6 −4 −2 0 2 4 6 8
p p
• The odds 1−p ∈ [0, +∞] hence the log-odds log 1−p ∈ [−∞, +∞]
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 26 / 122
The logistic regression model Maximum Entropy Modeling
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 27 / 122
The logistic regression model Maximum Entropy Modeling
In terms of statistics
P(Cmaj) + P(Cmin) + P(A[maj) + P(Amin) + P(Fmaj) + P(Fmin) = 1.
• Safe choice:
In terms of statistics
1
P(Cmaj) = P(Cmin) = · · · = P(Fmin) = 6
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 28 / 122
The logistic regression model Maximum Entropy Modeling
Why “uniform”?
• Intuitively: the most uniform model according to our knowledge, the only
unbiased assumption
• Ancient wisdom:
− Occam’s razor (William of Ockham, 1287-1347): principle of parsimony: “Nunquam
ponenda est pluralitas sine necesitate.” [Plurality must never be posited
without necessity.]
− Laplace: “when one has no information to distinguish between the probability
of two events, the best strategy is to consider them equally likely.” (Principle
of Insufficient Reason)
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 29 / 122
The logistic regression model Maximum Entropy Modeling
More facts
• The matching chord is Cmaj or Fmaj 30% of the time:
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 31 / 122
The logistic regression model Maximum Entropy Modeling
∆ P
• Ẽ(fj ) = o,y p̃(o, y )fj (o, y ): expected value of fj w.r.t p̃(o, y ).
∆ P
• E(fj ) = o,y p(o)p(y |o)fj (o, y ): expected v. of fj w.r.t the model p(o, y ).
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 32 / 122
The logistic regression model Maximum Entropy Modeling
Constraint equation
X X
p(o)p(y |o)fj (o, y ) = p̃(o, y )fj (o, y )
o,y o,y
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 33 / 122
The logistic regression model Maximum Entropy Modeling
∆
X
H(y |o) = − p(o)p(y |o) log p(y |o) : the conditional entropy
o,y
exp(wk0 + wkT x)
p(y = k|x) = PK −1
1 + l=1 exp(wl0 + wlT x)
0 + w T x)0
exp(wk0 k
= PK 0
0 + wl T x)
l=1 exp(wl0
1 0 0
= exp(wk0 + wkT x).
Zw (x)
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 36 / 122
The logistic regression model Maximum Entropy Modeling
I Maxent model:
1 X
p(y = k|o) = exp λkj fj (o, y )
Zλ (o)
j
1 0 0
p(y = k|x) = exp(wk0 + wkT x)
Zw (x)
• Using:
− feature-function: fj (o, y ) = xj ; f0 (o, y ) = 1 and x = (x1 , · · · , xj , · · · , xD )T ;
0 PD 0
0
− wk0 + wkT x = j=0 wkj fj (o, y );
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 37 / 122
The logistic regression model Maximum Entropy Modeling
I Maxent model:
1 X
p(y = k|o) = exp λkj fj (o, y )
Zλ (o)
j
1 X 0
p(y = k|o) = exp wkj fj (o, y )
Zw (o)
j
I CRF model:
1 X
p(y |x; θ) = exp θj Fj (x, y )
Zθ (x)
j
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 38 / 122
The logistic regression model Maximum Entropy Modeling
Conclusion
The solution to the maximum entropy models has the same parametric
form as logistic regression and CRF models.
I Introduction
I Conclusion
SlimReferences
I Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 40 / 122
The logistic regression model Parameter estimation
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 41 / 122
The logistic regression model Parameter estimation
Negative Log-Likelihood
N
X
L(D; θ) = L(w̃) = − {yi log p(xi ; w̃) + (1 − yi ) log (1 − p(xi ; w̃))}
i=1
XN n o
= − yi w̃T x̃i − log 1 + exp(w̃T x̃i )
i=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 42 / 122
The logistic regression model Parameter estimation
PN
Gradient: ∇L(D; w̃) = − i=1 x̃i (yi − p(xi ; w̃))
∂ 2 L(D;w̃) PN T
Hessian: ∂ w̃∂ w̃T
= i=1 x̃i x̃i p(xi ; w̃) (1 − p(xi ; w̃))
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 43 / 122
The logistic regression model Parameter estimation
Optimization problem
Solve for w̃ the D + 1 non-linear equations:
N
X N
X
yi x̃ji = x̃ji p(xi ; w̃) ; 0 ≤ j ≤ D
i=1 i=1
Optimization methods
PN PN
Objective: Solve i=1 yi x̃ji = i=1 x̃ji p(xi ; w̃)
Among the many available descent algorithms, two are widely used:
− the Newton-Raphson method: fast... but complex (efficient variants exist);
− the stochastic gradient descent method: easy to implement, adapted to
large scale problems.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 45 / 122
The logistic regression model Parameter estimation
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 46 / 122
The logistic regression model Parameter estimation
g (θ)
gθquad
n
(θ)
θn θ n + sn
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 47 / 122
The logistic regression model Parameter estimation
g (θ)
gθquad
n+1
(θ)
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 47 / 122
The logistic regression model Parameter estimation
I Introduction
I Conclusion
SlimReferences
I Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 49 / 122
The logistic regression model Improvements to the logistic regression model
`2 -regularization
γ
ŵ = argmin L(D; w̃) + ||w||2
w̃ 2
X N h D
i γ X
= argmin − yi w̃T x̃i − log 1 + exp w̃T x̃i + wj2
w̃ 2
i=1 j=1
`2 -regularization
Discussion
Recall that:
γ
ŵ = argmin L(D; w̃) + ||w||2
w̃ 2
is equivalent to:
(
ŵ = argminw̃ L(D; w̃)
subject to ||w||2 ≤ t
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 51 / 122
The logistic regression model Improvements to the logistic regression model
`2 -regularization
Gradient and Hessian
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 52 / 122
The logistic regression model Improvements to the logistic regression model
`1 -regularization
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 53 / 122
The logistic regression model Improvements to the logistic regression model
`1 -regularization
Discussion
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 54 / 122
The logistic regression model Improvements to the logistic regression model
`1 -regularization
Discussion
• Difficulties:
− The regularizer is not differentiable at zero yielding non-smooth
optimization problem.
→ specific optimization techniques needed (Yuan et al., 2010).
− In configurations with groups of highly correlated features:
I `1 -regularization tends to select randomly one feature in each group;
I `2 -regularization tends to yield better prediction performance.
→ Consider the elastic net model (Hastie et al., 2009):
L(D; w̃) + γ2 ||w||22 + γ1 ||w||1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 54 / 122
The logistic regression model Improvements to the logistic regression model
KLR model
1
p(yi |xi ) = ; g (x) = w0 + wT φ(x)
1 + exp −g (xi )
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 55 / 122
The logistic regression model Improvements to the logistic regression model
Regularized KLR
PN
By the representer theorem: g (x) = w0 + i=1 αi K(xi , x)
•
• The problem is strictly convex and can be solved using classic solvers.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 56 / 122
The logistic regression model Improvements to the logistic regression model
KLR vs SVM
• It can be shown that KLR and SVM are quite related (see Appendix).
• Very similar prediction performance and optimal margin properties.
• Same refinements are possible: SMO, MKL...
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 57 / 122
Conditional Random Fields (for linear-chain data)
I Introduction
I Conclusion
I References
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 58 / 122
Conditional Random Fields (for linear-chain data) Introduction
Structured-output data
o1 o2 o3 o4 o1 o2 o3 o4
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 59 / 122
Conditional Random Fields (for linear-chain data) Introduction
• Remarks:
− Observations are no longer assumed to be i.i.d within each sequence.
− Sequences x(q) do not necessarily have the same length, when needed nq will
denote the length of x(q) .
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 60 / 122
Conditional Random Fields (for linear-chain data) Introduction
D
1 X
p(y |x; θ) = exp θj Fj (x, y )
Z (x, θ)
j=1
1
= Ψ(x, y ; θ); θ = {θ1 , · · · , θD }.
Z (x, θ)
P P
• Z (x, θ) = exp j θj Fj (x, y ) is called a partition function.
y
Ψ(x, y ; θ) = exp D
P
j=1 θj Fj (x, y ) is called a potential function.
•
• Remarks:
− CRFs appear to be an extension of logistic regression to structured data.
− Feature functions Fj (x, y ) depend on the whole sequence of observations x
and labels y .
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 61 / 122
Conditional Random Fields (for linear-chain data) Introduction
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 62 / 122
Conditional Random Fields (for linear-chain data) Introduction
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 63 / 122
Conditional Random Fields (for linear-chain data) Introduction
n
X
Fj (x, y ) = fj (yi−1 , yi , x, i)
i=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 64 / 122
Conditional Random Fields (for linear-chain data) Introduction
• Hence:
Do
n X Dt
n X
1 X X
p(y |x; θ) = exp θj bj (yi , x, i) + θj tj (yi−1, yi , x, i)
Z (x, θ)
i=1 j=1 i=1 j=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 65 / 122
Conditional Random Fields (for linear-chain data) Introduction
Connection to HMM
where λlq = log p(yi = l |yi−1 = q) and µol = log p(xi = o|yi = l ).
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 66 / 122
Conditional Random Fields (for linear-chain data) Introduction
Connection to HMM
p(y ,x)
p(y 0 , x), one gets:
P
• also using p(y |x) = P 0 and letting Z (x) = y0
y 0 p(y ,x)
n X n X
1 X X
phmm (y |x) = exp θj bj (yi , x, i) + θj t j (yi−1 , yi , x, i) ;
Z (x)
i=1 j i=1 j
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 67 / 122
Conditional Random Fields (for linear-chain data) Introduction
Connection to HMM
Discussion
yi −1 yi yi +1 yi −1 yi yi +1
xi −1 xi xi +1 xi −1 xi xi +1
HMM CRF
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 68 / 122
Conditional Random Fields (for linear-chain data) Introduction
Connection to HMM
Advantage of the discriminative nature of CRF
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 69 / 122
Conditional Random Fields (for linear-chain data) Introduction
Problems to be solved:
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 70 / 122
Conditional Random Fields (for linear-chain data) Inference
D
1 X
ŷ = argmax p(y |x; θ) = argmax exp θj Fj (x, y )
y y Z (x, θ)
j=1
D
X
= argmax θj Fj (x, y )
y
j=1
n X
X D
= argmax θj fj (yi−1 , yi , x, i)
y
i=1 j=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 71 / 122
Conditional Random Fields (for linear-chain data) Inference
∆ PD
Let: gi (yi−1 , yi ) = j=1 θj fj (yi−1 , yi , x, i); then:
n X
X D n
X
ŷ = argmax θj fj (yi−1 , yi , x, i) = argmax gi (yi−1 , yi ).
y y
i=1 j=1 i=1
Let δm (s) be the optimal “intermediate score” such that at time step m the
label value is s:
"m−1 #
∆
X
δm (s) = max gi (yi−1 , yi ) + gm (ym−1 , s)
{y1 ,··· ,ym−1 }
i=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 72 / 122
Conditional Random Fields (for linear-chain data) Inference
y1
y2 s
y3
yK
m−1 m ... n
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 73 / 122
Conditional Random Fields (for linear-chain data) Inference
Viterbi recursion
δm (s) = max [δm−1 (ym−1 ) + gm (ym−1 , s)] ; 1 ≤ m ≤ n
ym−1 ∈Y
1
See Appendix for more details.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 74 / 122
Conditional Random Fields (for linear-chain data) Inference
Initialization:
δ1 (s) = g1 (y0 , s); ∀s ∈ Y; y0 = start
ψ1 (s) = start
Recursion: ∀s ∈ Y; 1 ≤ m ≤ n
δm (s) = max [δm−1 (y ) + gm (y , s)]
y ∈Y
ψm (s) = argmax [δm−1 (y ) + gm (y , s)]
y ∈Y
Termination: n
X
δn (yn∗ ) = max δn (y ) = max gi (yi−1 , yi ).
y ∈Y y
i=1
yn∗ = argmax δn (y )
y ∈Y
Path backtracking:
∗ ∗
ym = ψm+1 (ym+1 ); m = n − 1, n − 2, · · · , 1.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 75 / 122
Conditional Random Fields (for linear-chain data) Inference
∗ ∗
ym = ψm+1 (ym+1 ); m = n − 1, n − 2, · · · , 1.
y1
y2
y3
yK
m−1 m ... n
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 76 / 122
Conditional Random Fields (for linear-chain data) Inference
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 77 / 122
Conditional Random Fields (for linear-chain data) Inference
Problems to be solved:
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 78 / 122
Conditional Random Fields (for linear-chain data) Inference
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 79 / 122
Conditional Random Fields (for linear-chain data) Inference
• Alternatively,
P defining
β m (ym ) = ym+1 Mm+1 (ym , ym+1 )β m+1 (ym+1 ) ; 1 ≤ m ≤ n − 1 and
β n (yn ) = 1, one gets:
Marginal probability
X
p(ym−1 , ym |x) = p(y |x)
y \{ym−1 ,ym }
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 81 / 122
Conditional Random Fields (for linear-chain data) Parameter estimation
I Introduction
I Conclusion
I References
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 82 / 122
Conditional Random Fields (for linear-chain data) Parameter estimation
• Given training data D = {(x(1) , y (1) ), · · · , (x(N) , y (N) )}, the NLL is:
N
∆
X
L(D; θ) = − log p(y (q) |x(q) ; θ)
q=1
N nq D
(q) (q)
X X X
= log Z (x(q) ; θ) − θj fj (yi−1 , yi , x(q) , i)
q=1 i=1 j=1
XN XD
(q) (q) (q)
= log Z (x ; θ) − θj Fj (x , y ) .
q=1 j=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 83 / 122
Conditional Random Fields (for linear-chain data) Parameter estimation
NLL gradient
N
∂L(D; θ) X ∂ (q) (q) (q)
Gradient: = log Z (x ; θ) − Fk (x , y ) .
∂θk ∂θk
q=1
D
∂ 1 X ∂ X
log Z (x; θ) = exp θj Fj (x, y )
∂θk Z (x; θ) n
∂θk
y ∈Y j=1
D
1 X X
= Fk (x, y ) exp θj Fj (x, y )
Z (x; θ) n
y ∈Y j=1
P
X exp j θj Fj (x, y )
= Fk (x, y )
Z (x; θ)
y ∈Y n
X
= Fk (x, y )p(y |x; θ)
y ∈Y n
= Ep(y |x;θ) Fk (x, y ) .
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 84 / 122
Conditional Random Fields (for linear-chain data) Parameter estimation
NLL gradient
N
∂L(D; θ) X ∂ (q) (q) (q)
Gradient: = log Z (x ; θ) − Fk (x , y ) .
∂θk ∂θk
q=1
∂
∂θk log Z (x; θ) = Ep(y |x;θ) Fk (x, y ) : conditional expectation given x.
N
∂L(D; θ) X n h i o
= Ep(y |x(q) ;θ) Fk (x(q) , y ) − Fk (x(q) , y (q) ) .
∂θk
q=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 84 / 122
Conditional Random Fields (for linear-chain data) Parameter estimation
Optimality condition
∂L(D;θ)
• Setting the derivatives to 0, i.e. ∂θk = 0, yields:
N
X h i N
X
(q)
Ep(y |x(q) ;θ) Fk (x , y) = Fk (x(q) , y (q) ); 1 ≤ k ≤ D
q=1 q=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 85 / 122
Conditional Random Fields (for linear-chain data) Parameter estimation
Optimality condition
∂L(D;θ)
• Setting the derivatives to 0, i.e. ∂θk = 0, yields:
N N
1 X h i 1 X
Ep(y |x(q) ;θ) Fk (x(q) , y ) = Fk (x(q) , y (q) ); 1 ≤ k ≤ D
N N
q=1 q=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 85 / 122
Conditional Random Fields (for linear-chain data) Parameter estimation
X
Ep(y |x;θ) Fk (x, y ) = Fk (x, y )p(y |x; θ)
y ∈Y n
n X
X
= fk (yi−1 , yi , x)p(y |x; θ)
i=1 y ∈Y n
n
X X
= fk (yi−1 , yi , x)p(yi−1 , yi |x; θ)
i=1 yi −1 ,yi ∈Y 2
Optimization
Now that we are able to compute the gradient, we can use a descent
method to solve for θ.
Many algorithms are available (see Sokolovska, 2010; Lavergne et al., 2010) :
convergence, suboptimal.
I Introduction
I Conclusion
I References
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 88 / 122
Improvements and extensions to original CRFs
Regularization
Using `2 -norm
||θ||2
• Redefine the objective function as: L(D; θ) = L(D; θ) + 2σ22 ;
σ 2 : a free parameter penalizing large weights (as in ridge regression).
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 89 / 122
Improvements and extensions to original CRFs Regularization
Regularization
Using `1 -norm to perform feature selection
• Difficulties:
− The regularizer is not differentiable at zero: specific optimization
techniques needed (Sokolovska, 2010).
− In configurations with groups of highly correlated features, tend to select
randomly one feature in each group.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 90 / 122
Improvements and extensions to original CRFs Hidden-state CRF model
I Introduction
I Conclusion
I References
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 91 / 122
Improvements and extensions to original CRFs Hidden-state CRF model
Motivation
Problem: the CRF model does not support hidden states.
CRF
Input: x1 x2 x3 x4 ...
D = {(x(i) , y (i) )}i
Output: y1 y2 y3 y4
Am Dm7 G7 CMaj7
Motivation
Problem: the CRF model does not support hidden states.
yi −1 yi yi +1
CRF
hi −1 hi hi +1
Hidden-state CRF
D = {(x(i) , y (i) )}i
xi −1 xi xi +1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 92 / 122
Improvements and extensions to original CRFs Hidden-state CRF model
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 93 / 122
Improvements and extensions to original CRFs Hidden-state CRF model
Inference in HCRF
1 PD
• Using the HCRF model: p(y , h|x; θ) = Z (x,θ) exp j=1 θj Fj (x, y , h);
entails being able to compute:
− ŷ = argmaxy ∈Y p(y |x; θ ∗ ), to classify new test cases;
− the partition function Z (x, θ), to evaluate posterior probabilities.
∆ PD
Let Z 0 (y , x, θ) =
P
•
h∈Hn exp j=1 θj Fj (x, y , h): marginalization wrt h.
• We have:
0
PZ (y0 ,x,θ) ;
P
− p(y |x; θ) = h∈Hn p(y , h|x; θ) =
y Z (y ,x,θ)
Z (x, θ) = y Z 0 (y , x, θ).
P
−
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 94 / 122
Improvements and extensions to original CRFs Hidden-state CRF model
Negative log-likelihood
N
∆
X
L(D; θ) = − log p(y (q) |x(q) ; θ)
q=1
N
( ! )
X X
= log Z 0 (y , x(q) , θ) − log Z 0 (y (q) , x(q) , θ) ;
q=1 y
D
∆
X X
0
Z (y , x, θ) = exp θj Fj (x, y , h)
h∈Hn j=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 95 / 122
Improvements and extensions to original CRFs Hidden-state CRF model
NLL gradient
N
(
∂L(D; θ) X X
= Fk (x(q) , y , h)p(y , h|x; θ)
∂θk
q=1 y ,h
)
X
− Fk (x(q) , y (q) , h)p(h|y (q) , x(q) ; θ)
h
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 96 / 122
Improvements and extensions to original CRFs Hidden-state CRF model
Input: x1 x2 x3 x4 ...
Hidden: h1 h2 h3 h4
Output: y = ’TRUMPET’
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 97 / 122
Improvements and extensions to original CRFs Hidden-state CRF model
N
X N
X N
X
Ψ(x, y , h, θ) = < θ(hi ), xi > + θ(y , hi ) + θ(y , hi−1 , hi )
i=1 i=1 i=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 98 / 122
Improvements and extensions to original CRFs Hidden-state CRF model
Evaluation
• Classifying 1-second long segments of solo excerpts of Cello, Guitar,
Piano, Bassoon and Oboe.
• Data:
− training set: 2505 segments (i.e. 42’);
− testing set: 2505 segments.
• Classifiers:
− `2 -regularized HCRF with 3 hidden states;
− Linear SVM.
• Features: 47 cepstral, perceptual and temporal features.
• Results
Classifier SVM HCRF
Average accuracy 75% 76%
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 99 / 122
Improvements and extensions to original CRFs Other extensions
I Introduction
I Conclusion
I References
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 100 / 122
Improvements and extensions to original CRFs Other extensions
Other extensions
y yi −1 yi yi +1
yi −1 yi yi +1 hi −1 hi hi +1 hi −1 hi hi +1
xi −1 xi xi +1 xi −1 xi xi +1 xi −1 xi xi +1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 101 / 122
Improvements and extensions to original CRFs Other extensions
Other extensions
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 101 / 122
Conclusion
Take-home messages I
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 102 / 122
Conclusion
Take-home messages II
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 103 / 122
References
methods (L-BFGS)
Linear-chain CRF, NLP, large label and feature
Wapiti C99 sets, various regularization and optimization (Lavergne et al., 2010)
CRF tutorials
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 105 / 122
References
Bibliography I
Y. Altun, A. Smola, and T. Hofmann. Exponential families for conditional random fields. In Proceedings of the 20th
conference on Uncertainty in artificial intelligence, volume 0, 2004. URL
http://dl.acm.org/citation.cfm?id=1036844.
A. L. Berger, S. A. D. Pietra, and Vincent J. Della Pietra. A Maximum Entropy Approach to Natural Language
Processing. Computational Linguistics, 22(1), 1996.
J. Corey and K. Fujinaga. A cross-validated study of modelling strategies for automatic chord recognition in audio.
International Conference on Music Information Retrieval (ISMIR), 2007. URL
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.119.5760&rep=rep1&type=pdf.
M. E. P. Davies and M. D. Plumbley. Context-dependent beat tracking of musical audio. Audio, Speech, and
Language Processing, IEEE Transactions on, 15(3):1009–1020, 2007.
Z. Duan, L. Lu, A. Msra, and H. District. Collective annotation of music from multiple semantic categories. In
Proceedings of the 9th International Conference of Music Information Retrieval, pages 237–242, Philadelphia,
Pennsylvania, USA, 2008.
M. Dunham and K. Murphy. pmtk3: probabilistic modeling toolkit for Matlab/Octave, version 3, 2010. URL
https://code.google.com/p/pmtk3/.
S. Durand. Estimation de la position des temps et des premiers temps d’un morceau de musique. Master thesis,
Universite Pierre et Marie Curie, 2013.
D. Ellis. Beat Tracking by Dynamic Programming. Journal of New Music Research, 36(1):51–60, 2007. ISSN
09298215. URL
http://www.informaworld.com/openurl?genre=article&doi=10.1080/09298210701653344&magic=crossref.
A. Gunawardana, M. Mahajan, A. Acero, and J. Platt. Hidden conditional random fields for phone classification. In
Interspeech, 2005. URL
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.8190&rep=rep1&type=pdf.
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Springer, 2009. ISBN 0387848576. URL http://books.google.com/books?id=tVIjmNS3Ob8C&pgis=1.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 106 / 122
References
Bibliography II
X. He, R. Zemel, and M. Carreira-Perpinan. Multiscale conditional random fields for image labeling. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, volume 2, pages 695–702.
IEEE, 2004. ISBN 0-7695-2158-4. doi: 10.1109/CVPR.2004.1315232. URL
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1315232.
W.-T. Hong. Speaker identification using Hidden Conditional Random Field-based speaker models. In 2010
International Conference on Machine Learning and Cybernetics, pages 2811–2816. IEEE, July 2010. ISBN
978-1-4244-6526-2. doi: 10.1109/ICMLC.2010.5580793. URL
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=5580793&contentType=Conference+
Publications&searchField=Search_All&queryText=Hidden+Conditional+Random+Fields.
V. Imbrasaite, T. Baltrusaitis, and P. Robinson. Emotion tracking in music using continuous conditional random
fields and relative feature representation. In 2013 IEEE International Conference on Multimedia and Expo
Workshops (ICMEW). IEEE, July 2013. ISBN 978-1-4799-1604-7. doi: 10.1109/ICMEW.2013.6618357. URL
http://ieeexplore.ieee.org/articleDetails.jsp?arnumber=6618357.
F. V. Jensen and T. D. Nielsen. Bayesian Networks and Decision Graphs. Springer, 2007. ISBN 0387682813. URL
http://books.google.fr/books/about/Bayesian_Networks_and_Decision_Graphs.html?id=goSLmQq4UBcC&pgis=1.
C. Joder, S. Essid, and G. Richard. A Conditional Random Field Viewpoint of Symbolic Audio-to-Score Matching.
In ACM Multimedia 2010, Florence, Italy, 2010.
C. Joder, S. Essid, and G. Richard. A Conditional Random Field Framework for Robust and Scalable Audio-to-Score
Matching. IEEE Transactions on Audio, Speech and Language Processing, 19(8):2385–2397, 2011.
C. Joder, S. Essid, and G. Richard. Learning Optimal Features for Polyphonic Audio-to-Score Alignment. IEEE
Transactions on Audio Speech and Language Processing, 2013. doi: 10.1109/TASL.2013.2266794.
A. P. Klapuri, A. J. Eronen, and J. T. Astola. Analysis of the meter of acoustic musical signals. Audio, Speech, and
Language Processing, IEEE Transactions on, 14(1):342–355, 2006.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 107 / 122
References
Bibliography III
J. Laferty, Y. Liu, and X. Zhu. Kernel Conditional Random Fields: Representation, Clique Selection, and
Semi-Supervised Learning. In International Conference on Machine Learning, 2004. URL
http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Kernel+Conditional+Random+Fields+:
+Representation+,+Clique+Selection+,+and+Semi-Supervised+Learning#5.
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and
labeling sequence data. In Machine learning international workshop then conference, pages 282–289. Citeseer,
2001. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.9849&rep=rep1&type=pdf.
T. Lavergne, O. Cappe, and F. Yvon. Practical very large scale CRFs. In 48th Annual Meeting of the Association for
Computational Linguistics (ACL), number July, pages 504–513, Uppsala, Sweeden, 2010. URL
http://www.aclweb.org/anthology/P10-1052.
C.-X. Li. Exploiting label correlations for multi-label classification. Master thesis, UC San Diego, 2011.
A. K. McCallum. Mallet: A machine learning for language toolkit, 2002. URL http://mallet.cs.umass.edu.
L.-P. Morency. HCRF: Hidden- state Conditional Random Field Library, 2010. URL
http://sourceforge.net/projects/hcrf/.
L.-P. Morency, A. Quattoni, and T. Darrell. Latent-Dynamic Discriminative Models for Continuous Gesture
Recognition. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, June
2007. ISBN 1-4244-1179-3. doi: 10.1109/CVPR.2007.383299. URL
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4270324.
J. Morris and E. Fosler-Lussier. Conditional Random Fields for Integrating Local Discriminative Classifiers. TASLP,
16(3):617–628, 2008. ISSN 1558-7916. doi: 10.1109/TASL.2008.916057. URL
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4451149.
N. Okazaki. CRFsuite: a fast implementation of Conditional Random Fields (CRFs), 2007. URL
http://www.chokkan.org/software/crfsuite/.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 108 / 122
References
Bibliography IV
S. Ortmanns, H. Ney, and A. Eiden. Language-model look-ahead for large vocabulary speech recognition. In
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, volume 4, pages
2095–2098. IEEE, 1996. ISBN 0-7803-3555-4. doi: 10.1109/ICSLP.1996.607215. URL
http://ieeexplore.ieee.org/articleDetails.jsp?arnumber=607215.
T. Qin and T. Liu. Global ranking using continuous conditional random fields. In Advances in Neural and Information
Processing Systems, 2008. URL http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2008_0518.pdf.
A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell. Hidden conditional random fields. IEEE
transactions on pattern analysis and machine intelligence, 29(10):1848–53, Oct. 2007. ISSN 0162-8828. doi:
10.1109/TPAMI.2007.1124. URL http://www.ncbi.nlm.nih.gov/pubmed/17699927.
S. Reiter, B. Schuller, and G. Rigoll. Hidden Conditional Random Fields for Meeting Segmentation. In Multimedia
and Expo, 2007 IEEE International Conference on, pages 639–642. IEEE, July 2007. ISBN 1-4244-1016-9. doi:
10.1109/ICME.2007.4284731. URL
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=4284731&contentType=Conference+
Publications&searchField=Search_All&queryText=Hidden+Conditional+Random+Fields.
O. Rudovic, V. Pavlovic, and M. Pantic. Kernel conditional ordinal random fields for temporal segmentation of facial
action units. In Computer Vision - ECCV, pages 1–10, 2012. URL
http://link.springer.com/chapter/10.1007/978-3-642-33868-7_26.
S. Sarawagi and W. Cohen. Semi-markov conditional random fields for information extraction. In Advances in
Neural Information Processing Systems, 2005. URL
http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2005_427.pdf.
E. Schmidt and Y. Kim. Modeling musical emotion dynamics with conditional random fields. In International
Conference on Music Information Retrieval (ISMIR), number Ismir, pages 777–782, Miami, FL, USA, 2011. URL
http://music.ece.drexel.edu/files/Navigation/Publications/Schmidt2011c.pdf.
M. Schmidt. crfChain, 2008. URL http://www.di.ens.fr/~mschmidt/Software/crfChain.html.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 109 / 122
References
Bibliography V
B. Settles. Biomedical named entity recognition using conditional random fields and rich feature sets. International
Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004. URL
http://dl.acm.org/citation.cfm?id=1567618.
N. Sokolovska. Contributions to the estimation of probabilistic discriminative models: semi-supervised learning and
feature selection. PhD thesis, Telecom ParisTech, 2010.
Y.-H. Sung and D. Jurafsky. Hidden Conditional Random Fields for phone recognition. 2009 IEEE Workshop on
Automatic Speech Recognition & Understanding, pages 107–112, Dec. 2009. doi:
10.1109/ASRU.2009.5373329. URL
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5373329.
Taku-ku. CRF++, 2003. URL http://crfpp.googlecode.com/svn/trunk/doc/index.html.
B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of the
Eighteenth conference on Uncertainty in Artificial Intelligence (UAI02), 2002. URL
http://dl.acm.org/citation.cfm?id=2073934.
H. M. Wallach. Efficient Training of Conditional Random Fields. PhD thesis, University of Edinburgh, 2002.
S. B. Wang, A. Quattoni, L.-P. Morency, and D. Demirdjian. Hidden Conditional Random Fields for Gesture
Recognition. pages 1521–1527. IEEE Computer Society, 2006. ISBN 0-7695-2597-0. URL
http://portal.acm.org/citation.cfm?id=1153171.1153627.
G. Yuan, K. Chang, C. Hsieh, and C. Lin. A comparison of optimization methods and software for large-scale
l1-regularized linear classification. The Journal of Machine Learning Research, 2010. URL
http://dl.acm.org/citation.cfm?id=1953034.
J. Zhu and T. Hastie. Support vector machines, kernel logistic regression and boosting. Multiple Classifier Systems,
2364:16–26, 2002. URL http://link.springer.com/chapter/10.1007/3-540-45428-4_2.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 110 / 122
Appendix
I Conclusion
I References
I Appendix
– Optimization with stochastic gradient learning
– Comparing KLR and SVM
– Derivation of the Viterbi algorithm
– The forward-backward method
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 111 / 122
Appendix Optimization with stochastic gradient learning
Algorithm
- Initialise w̃
- Repeat (until convergence)
- Randomly permute training examples xi
- For i = 1 : N
wj ← wj + t (yσi − pσi ) xjσi ; j = 1, · · · , D
N
X γ
1 − yi0 g (xi ) + + ||g ||2HK ; yi0 ∈ {−1, 1}
min
w̃ 2
i=1
3
Hinge
2.5
0.5
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
y’g(x)
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 113 / 122
Appendix Comparing KLR and SVM
KLR vs SVM
(
1 if yi = 1
• Let yi0 =
−1 if yi = 0
• The negative
Plog-likelihood of the0 KLR model can then be written as
L(D; w̃) = N i=1 log (1 + exp −yi g (xi )) .
N
X λ
min l (yi0 g (xi )) + ||g ||2HK ;
w̃ 2
i=1
KLR SVM
l (yi0 g (xi )) = log 1 + exp −yi0 f (xi ) l (yi0 g (xi )) = 1 − yi0 f (xi ) +
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 114 / 122
Appendix Comparing KLR and SVM
KLR vs SVM
Hinge vs negative binomial log-likelihood
3
Hinge
NLL
2.5
2
loss
1.5
0.5
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
y’g(x)
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 115 / 122
Appendix Derivation of the Viterbi algorithm
D
∆
X
Let: gi (yi−1 , yi ) = θj fj (yi−1 , yi , x, i); then:
j=1
n X
X D n
X
ŷ = argmax θj fj (yi−1 , yi , x, i) = argmax gi (yi−1 , yi ).
y y
i=1 j=1 i=1
"m−1 #
∆
X
Let δm (s) = max gi (yi−1 , yi ) + gm (ym−1 , s)
{y1 ,··· ,ym−1 }
i=1
m−1
X
= max gi (yi−1 , yi ) + max gm (ym−1 , s).
{y1 ,··· ,ym−1 } ym−1
i=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 116 / 122
Appendix Derivation of the Viterbi algorithm
Viterbi decoding
m−1
X
δ(s) = max gi (yi−1 , yi ) + max gm (ym−1 , s)
{y1 ,··· ,ym−1 } ym−1
i=1
"m−2 #
X
So: δm−1 (ym−1 ) = max gi (yi−1 , yi ) + gm−1 (ym−2 , y m−1 )
{y1 ,··· ,ym−2 }
i=1
m−1
X
= max gi (yi−1 , yi ).
{y1 ,··· ,ym−2 }
i=1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 117 / 122
Appendix Derivation of the Viterbi algorithm
Viterbi decoding
The algorithm
Initialization:
δ1 (s) = g1 (y0 , s); ∀s ∈ Y; y0 = start
ψ1 (s) = start
Recursion: ∀s ∈ Y; 1 ≤ m ≤ n
δm (s) = max [δm−1 (y ) + gm (y , s)]
y ∈Y
ψm (s) = argmax [δm−1 (y ) + gm (y , s)]
y ∈Y
Termination:
δn (yn∗ ) = max δn (y )
y ∈Y
yn∗ = argmax δn (y )
y ∈Y
Path backtracking:
∗ ∗
ym = ψm+1 (ym+1 ); m = n − 1, n − 2, · · · , 1.
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 118 / 122
Appendix The forward-backward method
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 119 / 122
Appendix The forward-backward method
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 119 / 122
Appendix The forward-backward method
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 119 / 122
Appendix The forward-backward method
X
β m (ym ) = Mm+1 (ym , ym+1 )β m+1 (ym+1 ) ; 1 ≤ m ≤ n − 1
ym+1 ∈Y
β n (yn ) = 1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 120 / 122
Appendix The forward-backward method
Marginal probability
P
p(ym−1 , ym |x) = y \{ym−1 ,ym } p(y |x) ;
∆
y \{ym−1 , ym } = {y1 , · · · , ym−2 , ym+1 , · · · , yn }.
n
1 X Y
p(ym−1 , ym |x) = Mi (yi−1 , yi , x)
Z (x)
y \{ym−1 ,ym } i=1
m−1
1 X Y
= Mi (yi−1 , yi , x) × Mm (ym−1 , ym , x)
Z (x)
y \{ym−1 ,ym } i=1
n
Y
× Mi (yi−1 , yi , x)
i=m+1
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 121 / 122
Appendix The forward-backward method
Marginal probability
m−1
1 X Y
p(ym−1 , ym |x) = Mm (ym−1 , ym , x) × Mi (yi−1 , yi , x)
Z (x)
{y1,··· ,ym−2 } i=1
X n
Y
× Mi (yi−1 , yi , x)
{ym+1,··· ,yn } i=m+1
1
p(ym−1 , ym |x) = αm−1 (ym−1 )Mm (ym−1 , ym , x)β m (ym ).
Z (x)
Slim Essid (Telecom ParisTech) A tutorial on CRF - ISMIR 2013 November 2013 122 / 122