Professional Documents
Culture Documents
Yen-Chin Lee
Outline
Labeling sequence data problem Introduction conditional random field (CRF) Different views on building a conditional random field (CRF) From directed to undirected graphical models From generative to discriminative models Sequence models From HMMs to CRFs Difference between MEMM & CRFs Parameter estimation / inference Experiment
X is a random variable over data sequence Y is a random variable over label sequence Yi is assumed to range over a finite label set A The problem:
Learn how to give labels y from the label set to a data sequence x
x1
x2
is verb
x3 being noun y3
X:
Thinking noun
Applications
Computational biology
Computational linguistics Information extraction
Y:
y1
y2
Has been used successfully in various domains such as part of speech tagging and other Natural Language Processing tasks exp( f (x, y ) + g (x, y , y ))
i i t j j t t 1
P( y | x) =
Z(x)
Undirected acyclic graph Allow some transitions vote more strongly than others depending on the corresponding observations
Motivation
Bayesian Network
Naive Bayes
Logistic Regression
Directed models
Using conditional prob. for each local substructure Called Bayesian network
(X,Y)
Undirected models
Using potential functions in each local substructure Called Markov random field or Markov network
Naive Bayes
Logistic Regression
Base on a model of Base on a model of Joint distribution P(y,x) conditional distribution P(y|x) Need to calculate P(X) Dont need
Naive Bayes
Logistic Regression
Power of graphical models: model many interdependent variables HMM models joint distribution
Given the direct predecessor, each state is independent of his ancestors Each observation depends only on current state
Key: conditional distribution p(y|x) of an HMM is a CRF with a particular choice of feature function with
ij = log p( y ' = i | y = j )
last step: write conditional probability P(y|x) for the HMM .Then a linear-chain
A conditional model that representing the probability of reaching a state given an observation and the previous state Consider observation sequences to be events to be conditioned upon. Given training set X with label sequence Y:
Train a model that maximizes P(Y|X, ) For a new data sequence x, the predicted label y maximizes P(y|x, ) Notice the per-state normalization
MEMMs (contd)
MEMMs have all the advantages of Conditional Models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states Subject to Label Bias Problem
P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)
Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation
advantage: conditional log likelihood is concave, therefore every local optimum is a global one
There is no analytical solutions for the parameter by maximizing the log-likelihood Setting the gradient to zero and solving for does not always yield a closed form solution Iterative technique is adopted Iterative scaling Gradient decent
use gradient descent: quasi-Newton methods runtime in O(TM2NG) T: length of sequence M: number of labels N: number of training instances G: number of required gradient computations
Experiment
A run consists of 2,000 training examples and 500 test examples, trained to convergence using Iterative Scaling algorithm CRF error is 4.6%, and MEMM error is 42% MEMM fails to discriminate between the two branches CRF solves label bias problem
Summary
Discriminative models are prone to the label bias problem CRFs provide the benefits of discriminative models CRFs solve the label bias problem well, and demonstrate good performance
Thanks!!