You are on page 1of 40

中国科学院大学人工智能学院硕士课《模式识别》

第3章:参数估计(续)

刘成林(liucl@nlpr.ia.ac.cn)
2021年9月29日
助教:赵梦彪(zhaomengbiao2017@ia.ac.cn)
郭宏宇(guohongyu2019@nlpr.ia.ac.cn)
朱 飞(zhufei2018@ia.ac.cn)
基于贝叶斯决策的模式分类框架
类先验概率
贝叶斯决策: 估计
最小风险决策 分类器
设计
最大后验概率决策
基于概率密度的判别函数 类条件概率
密度估计

复合模式 多分类器
参数法 半参数法 非参数法
识别 融合

隐马尔科夫模型 高斯密度、Gamma分布 高斯混合 直方图


(HMM) 伯努利分布、二项分布、 密度 Parzen窗
N-Mean, k-近邻估计
马尔科夫随机场 多项分布等 期望-
LDF, k-近邻分类
(MRF) QDF 最大法
(EM算法)
EM算法 参数估计:最大似然估计
贝叶斯估计
公式太多,怎么办?
• 注重宏观思维
– 先整体,后局部,再回到整体;先简化,再回去
– 理解概念最重要!
• 特征空间、符号、公式的物理意义,形成直觉
• 高维空间物理意义如何理解:简化到低维,再推广到高维
– 注重不同方法之间的区别和联系(共性)
– 理解概念的基础上再去了解细节和数学证明
• 对主要的方法理解其原理、过程和结论
• 复杂的数学证明过程可忽略,记住结论即可
– 简单的情况要清楚细节,如高斯密度函数的最大似然估计求解过程
– 高斯混合密度的最大似然估计(EM算法)了解主要步骤(E-step, M-step)
– 低维空间和简单模型能写出详细过程,高维或复杂模型则不要求
– 数学分析(形式化)和证明的能力对创新研究很重要,但不
可能(没有精力)把所有细节都搞懂
• 善于利用已有概念、原理和结论,理解和会用是基础

3
上次课主要内容回顾
• 离散变量贝叶斯决策
• 复合模式分类

• 最大似然参数估计
高斯分布的情况
• 贝叶斯估计
Parameter space vs
feature space 二者的区别和联系

4
提 纲
• 第3章:参数估计
(贝叶斯分类的参数法、半参数法)
– 特征维数问题
• 过拟合
• 扩展:开放集分类的特征维数问题
– 期望最大法
• 一般情况
• EM for Gaussian Mixture
– 隐马尔可夫模型

5
特征维数问题
• 统计模式分类
– 特征空间划分
– 贝叶斯决策:最小风险规则,MAP

• 增加特征有什么好处
– 判别性:类别间有差异的特征有助于分类
• 带来什么问题
– 计算
– 存储
– 泛化性能,Overfitting

6
分类错误率与特征的关系
• 二类高斯分布
– p(x|ωj)~N(μj,Σ), j=1,2, 等协方差矩阵
– Bayes error rate

R1 R2

7
• 二类高斯分布
– Conditionally independent case
• 每一维二类均值之间距离反映区分度,决定错误率
• 特征增加有助于减小错误率(r2增大)

• 特征维数决定可分性的例子
– 3D空间完全可分
– 2D和1D投影空间有重叠

然而,增加特征也可能导致分类性能更差,
因为有模型估计误差(wrong model)

8
计算复杂度
• 最大似然估计
– 高斯分布,d维特征,n个样本
– 参数估计的复杂度,主要由Σ决定

• 参数存储复杂度
c(d + d (d + 1) / 2)
• 分类复杂度?
– 计算逆矩阵比较复杂,一般为O(d3)
9
过拟合(Overfitting)
• Overfitting
– 特征维数高、训练样本少导致模型参数估计不准确
• 比如协方差矩阵需要样本数在d以上

• 克服办法
– 特征降维:特征提取(变换)、特征选择
– 参数共享/平滑
• 共享协方差矩阵Σ0
• Shrinkage (a.k.a. Regularized Discriminant Analysis)

10
• 过拟合的例子

10th order polynomial

11
扩展:开放集分类的特征维数问题
• 开放集分类问题
– 已知类别: i , i = 1,..., c
c +1

– 后验概率  P( | x) = 1
i =1
i

– ωc+1无训练样本,测试样本作为outlier拒识
• 特征维数问题
– 区分c+1个类别比区分c个类别需要更多的特征
– 如果分类器训练时瞄准区分c个已知类别
测试时易造成outlier跟已知类别样本的混淆
– 因此,在c类样本上训练分类器时,要使特征
表达具有区分更多类别的能力
• 如,训练神经网络时加入数据重构损失(类似auto-
encoder)作为正则项;或者生成一些假想类样本(通过
组合已知类别样本)

12
期望-最大法(EM)
• 数据缺失情况下的参数估计
– Good features, missing/bad features

– 已知参数值θi 情况下估计新参数值θ
• 对缺失数据求期望(marginalize)
max

13
• Expectation-Maximization (EM)

The EM algorithm guarantees that the log-likelihood of


good data increases monotonically.

14
• Example: EM for a 2D Gaussian

parameters initially

15
max

After 3 iterations

x41 unknown

What if x4=(1,4)T

16
• EM for Gaussian mixture
– 参数型概率密度函数,可以表示复杂的分布
K
p ( x) =   k p ( x |  k )
k =1
K
subject to   k = 1
k =1

• Gaussian component
p(x |  k ) = N (x | k ,  k )
– 参数估计:Maximum Likelihood (ML)
N N K
max LL = log  p(xn ) =  log   k p(xn | k )
n =1 n =1 k =1

 k LL = 0,  k LL = 0, k LL = 0

不能解析求解
17
• EM Algorithm for Gaussian mixture
– Incomplete data X, complete data {X,Z}
znk  {0,1}, k = 1,..., K
missing
– Expectation of complete data log-likelihood
Q(, old ) =  [log p( X, Z | )] p( Z | X, old )
Z

1. Choose an initial set of parameters for Θold


2. Do
E-step: Evaluate p(Z|X, Θold)
M-step: Update parameters
new = arg max Q(, old )

If convergence condition is not satisfied
old  new
3. End
(C.M. Bishop, Pattern Recognition and Machine Learning, 2006) 18
• EM Algorithm for Gaussian mixture
N K

E-step p(X, Z | ) =  k nk N (xn | k , k ) nk


z z

n =1 k =1
N K
Q(, )=E Z [log p( X, Z | )] =   ( znk ){log  k + log N ( x | k ,  k )}
old

n =1 k =1
 k N (x n | k ,  k )
 ( znk ) = P( znk = 1| x n )= K


j =1
j N (x n |  j ,  j )

M-step  Q = 0,   Q = 0,  Q = 0
k k k
N
1
 new
k =
Nk
 (z
n =1
nk )x n
N
1
 new
k =
Nk
 nk n k n k )
 (
n =1
z )( x −  new
)( x −  new T

N
N k =   ( znk )
N
 knew = k
N n =1

19
An example, from (C.M. Bishop, Pattern Recognition and Machine
Learning, 2006. Figure 9.8)

20
Break
隐马尔可夫模型
• Sequential (Temporal) Pattern
– Variable length
– Distortion
– Ambiguous boundary between primitives (symbols)
• Bayesian Classification
– Sequence of patterns (observations) O = O1O2 OT
– Sequence class (states) q = q1q2 qT
– Posterior probability p (O | q) P(q)
P(q | O) =
p (O)
• Hidden Markov Model (HMM)
– Model p(O|q), p(O,q)
22
Markov Chain
• Sequence of States (classes)
P(q1q2 qT ) = P (q1 ) P (q2 | q1 ) P (q3 | q1q2 ) P (qT | q1 qT −1 )
qt  {S1 ,..., S N }

– First-Order Markov chain


P(qt = S j | qt-1 = Si , qt-2 = S k , ) = P(qt = S j | qt-1 = Si )
P(q1q2 qT ) =
P(q1 ) P(q2 | q1 ) P(q3 | q2 ) P(qT | qT −1 )

State transition probabilities


aij = P(qt = S j | qt −1 = Si ), 1  i, j  N
N

a
j =1
ij =1

L.R. Rabiner, A tutorial on hidden Markov models and selected applications in


23
speech recognition, Proc. IEEE, 77(2): 257-286, 1989.
– State duration (self-transition)
O = {Si , Si , Si , ..., Si , S j  Si }
1 2 3 d d+1

P(O | Model, q1 = Si ) = (aii )d −1 (1 − aii ) = pi (d )


– Expected duration of specific state

di =  d pi (d )
d =1

1
=  d (aii ) d −1 (1 - aii ) =
d =1 1 − aii

• Example: Transition of Weather


State 1: rain or (snow) 0.4 0.3 0.3
State 2: cloudy A = {aij } = 0.2 0.6 0.2 
State 3: sunny  0.1 0.1 0.8

Expected number of days for sunny and cloudy?


24
Hidden Markov Model (HMM)
• Markov Chain: States are Observable
• Hidden States: An Example
– Imagine you are in a un-windowed room, cannot see the
weather outside. Instead, you can guess the weather from
the temperature and humidity in room
• Observations: temperature, humidity
• Hidden states: weather
– Hidden Markov Model (HMM): Doubly embedded
stochastic process O1 O2 OT
P(O1,O2,…,OT)
P(q1,q2,…,qT)
• Infer states from observations q1 q2 qT

qi∈{S1,S2,…,SN}
25
• Elements of an HMM λ = ( A, B, π)
– N: number of states in the model, S={S1,S2,…,SN}
– M: number of observation symbols, V={v1,v2,…,vM}
– State transition probability distribution A={aij}
aij = P(qt +1 = S j | qt = Si ), 1  i, j  N
– Observation symbol (emission) probability distribution
B={bj(k)} b j (k ) = P(vk at t | qt = S j ), 1  j  N , 1  k  M
– Initial state distribution π = { i }
 i = P(q1 = Si ), 1  i  N

26
• Three Basic Problems of HMM
– Problem 1 (Evaluation):
How to efficiently compute the probability of observation
sequence P(O|)
– Problem 2 (Decoding):
How to choose the best state sequence responding to an
observation sequence
– Problem 3 (Training):
How to estimate the model parameters

27
Evaluation Problem
• Given model λ = ( A, B, π) and observation sequence
O=O1O2…OT, compute P(O | λ )
– Direct computation
P(O  λ ) =  P(O | Q, λ )P(Q | λ )
all Q

= 
q1 ,q2 , ..., qT
 q bq (O1 )aq q bq (O2 )
1 1 1 2 2
aqT −1qT bqT (OT )

Conditional P(O  Q, λ ) =  P(Ot | qt , λ ) P(Q  λ ) =  q1 aq1q2 aq2 q3 aqT −1qT


t =1
independence
= bq1 (O1 )bq2 (O2 ) bqT (OT ) Markov chain of states

– Complexity: O(2TNT)!

28
• Evaluation: Forward Procedure
– Define forward variable  t (i ) = P(O1O2 Ot , qt = Si | λ )
– Initialization 1 (i ) =  i bi (O1 ), 1  i  N
– Induction
N 
 t +1 ( j ) =    t (i)aij  b j (Ot +1 ),
 i =1 
1  t  T − 1, 1  j  N.
– Termination
N
P (O | λ ) =   T (i )
i =1

– Complexity: O(TN2)

29
• Evaluation: Backward Procedure
– Define backward variable
t (i ) = P(Ot +1 ,..., OT | qt = Si , λ )
– Initialization
T (i ) = 1, 1  i  N
– Induction
N
t (i ) =  aij b j (Ot +1 )t +1 ( j ),
j =1

1  t  T − 1, 1  i  N
– Termination
N N
P(O |  ) =   ibi (O1 ) 1 (i) = 1 (i) 1 (i)
i =1 i =1

– Complexity?

30
Decoding Problem
• This is Pattern Recognition
• Optimal Sequence of States
max P(q1q2 qT | O, λ ) = max P(q1q2 qT , O | λ )
q1q2 qT q1q2 qT

• Viterbi Algorithm (DP: dynamic programming)


– Define variable
 t (i ) = max P(q1q2 qt = i, O1O2 Ot λ )
q1 , q2 , , qt −1
– DP
 t +1 ( j ) = max  t (i)aij   b j (Ot +1 )
 i 
– Initialization
1 (i) =  i bi (O1 ), 1  i  N
 1 (i) = 0.
 t (i )= max P(q1q2 qt = i, O1O2 Ot | λ )
q1 , q2 ,..., qt −1

• Viterbi Algorithm (Cont.)


– Recursion
 t ( j ) =  max  t -1 (i )aij  b j (Ot ),
 1i  N 
 t ( j ) = arg max  t -1 (i)aij ,
1i  N

2  t  T, 1 j  N
– Termination
P* = max  T (i )
1i  N

qT * = arg max  T (i )
1i  N

– Backtracking
qt* =  t +1 (qt*+1 ), Complexity: O(TN2)
1  t  T −1
32
• Appendix: Dynamic Programming (DP) Principle
(Bellman Principle of Optimality)
– The best path through a particular, intermediate place is
the best way from start to it, followed by the best way
from it to the goal.
– Implication: from multiple ways reaching an intermediate
place, only retain the best one
– Often used in sequence matching and HMMs

33
Training Problem
• Maximum Likelihood (ML) max P(O |  )
A, B ,
• Baum-Welch Algorithm (EM)
max Q( ,  ) =  [log P(Q, O |  )]P(Q, O |  )

Q
– Define variable t (i, j ) = P(qt = Si , qt +1 = S j | O, λ )

 t ( i ) aij b j ( Ot +1 ) t +1 ( j ) P(O, qt = Si , qt +1 = S j |  )
t (i, j ) =
P(O | λ )
 t ( i ) aij b j ( Ot +1 ) t +1 ( j )
= N N

  ( i ) a b ( O ) 
i =1 j =1
t ij j t +1 t +1 ( j)

– Define probability
 t ( i ) t (i )  t ( i ) t (i) N
 t (i ) = P(qt = Si | O, λ ) = = =  t (i, j )
P(O | λ ) N

  ( i )  (i)
i =1
t t
j =1

34
• Baum-Welch Algorithm (Cont.)
– Reestimation formulas
 i = expected frequency (number of times) in state Si at time (t=1)= 1 (i )
expected number of transitions from state Si to state S j
aij =
expected number of transitions from state Si
T −1 T −1

  (i, j )   (i)a b (O
t t ij j t +1 ) t +1 ( j )
= t =1
T −1
= t =1
T −1

  (i)
t =1
t   (i) (i)
t =1
t t

expected number of times in state j and observing symbol vk


b j (k ) =
expected number of times in state j
T −1 T


t =1, s.t. Ot = vk
 t ( j) 
t =1, s.t. Ot = vk
 t ( j ) t ( j )
= T −1
= T

  ( j)
t =1
t  ( j) ( j)
t =1
t t

35
Continuous Density HMM
• Handling Continuous Observations
– Continuous features: vector OT
– Discretization: vector quantization (VQ)
• Each vector replaced with its closest codevector, which is viewed as
a symbol
• Small codebook: distortion
• Large codebook: large data required in emission probability
estimation
– Continuous emission density: Gaussian mixture (GM)
M
b j (O) =  c jmN (O;  jm ,U jm ), 1  j  N
m =1

• Parameter Estimation of Continuous HMM (omitted)

36
Application to Speech Recognition
• Isolated Word Recognition
– Given HMM  for each word in the vocabulary
– Input observation sequence O, Bayes decision
(assuming equal prior probabilities)
v* = arg max P(λ v | O) = arg max P(O | λ v )
1v V 1v V

– Acoustic features (Ot) (details omitted)


– Vector quantization (discrete observation symbols)
– Choice of model parameters
• Number of states: empirical (cross-
validation), can be equal for all word
models
• Number of components in GM

37
Extensions of HMM
• Hybrid HMM/Neural
– HMM: parametric bj(Ot)=p(Ot|qt=Sj), conditional
independence
– Neural: discriminative emission probability p(qt=Sj|Ot)
• Neural network outputs approximate posterior probabilities
– Replace p(Ot|qt) with p(qt|Ot)/p(qt)
p (Ot | qt ) p (qt | Ot )
=
p (Ot ) P(qt )
– ANN may input multiple frames to learn the correlation
p (qt = S j | ...Ot −1Ot Ot +1...)
• Latest: deep neural networks

38
讨论
• 特征维数与过拟合
– 克服过拟合的方法?
• 期望最大法(EM)
– 对数似然度对缺失数据的期望
– EM for Gaussian mixture
• 隐马尔可夫模型(HMM)
– Three basic problems
– Viterbi Algorithm
– Extensions

39
下次课内容

• 第4章 非参数法
– 密度估计
– Parzen窗方法
– K近邻估计
– 最近邻规则
– 距离度量
– Approximation by Series Expansion

40

You might also like