You are on page 1of 25

Conditional Random Fields - A probabilistic graphical model

Yen-Chin Lee

Outline

Labeling sequence data problem Introduction conditional random field (CRF) Different views on building a conditional random field (CRF) From directed to undirected graphical models From generative to discriminative models Sequence models From HMMs to CRFs Difference between MEMM & CRFs Parameter estimation / inference Experiment

Labeling Sequence Data

X is a random variable over data sequence Y is a random variable over label sequence Yi is assumed to range over a finite label set A The problem:

Learn how to give labels y from the label set to a data sequence x

x1

x2
is verb

x3 being noun y3

X:

Thinking noun

Applications

Computational biology
Computational linguistics Information extraction

Y:

y1

y2

Conditional Random Fields

A form of discriminative model

Has been used successfully in various domains such as part of speech tagging and other Natural Language Processing tasks exp( f (x, y ) + g (x, y , y ))
i i t j j t t 1

P( y | x) =

Z(x)

Undirected acyclic graph Allow some transitions vote more strongly than others depending on the corresponding observations

Motivation
Bayesian Network

Naive Bayes

Logistic Regression

Hidden Markov Model

Linear Chain Conditional Random Field

General Conditional Random Field

Markov Random Field

Directed vs. Undirected Models


P(Y|X)

Directed models

Using conditional prob. for each local substructure Called Bayesian network
(X,Y)

Undirected models

Using potential functions in each local substructure Called Markov random field or Markov network

generative discriminative models


Bayesian Network

Naive Bayes

Logistic Regression

Hidden Markov Model

Linear Chain Conditional Random Field

General Conditional Random Field

Markov Random Field

generative v.s discriminative models


generative Nave Bayes discriminative Logistic regression

Base on a model of Base on a model of Joint distribution P(y,x) conditional distribution P(y|x) Need to calculate P(X) Dont need

Overview: sequence models


Bayesian Network

Naive Bayes

Logistic Regression

Hidden Markov Model

Linear Chain Conditional Random Field

General Conditional Random Field

Markov Random Field

Sequence models: HMMs

Power of graphical models: model many interdependent variables HMM models joint distribution

Uses two independence assumptions to do it tractably

Given the direct predecessor, each state is independent of his ancestors Each observation depends only on current state

From HMMs to linear chain CRFs (1)

Key: conditional distribution p(y|x) of an HMM is a CRF with a particular choice of feature function with

ij = log p( y ' = i | y = j )

From HMMs to linear chain CRFs (2)

last step: write conditional probability P(y|x) for the HMM .Then a linear-chain

conditional random field is a distribution p(y|x) that takes the form

Maximum Entropy Markov Models (MEMMs)


A conditional model that representing the probability of reaching a state given an observation and the previous state Consider observation sequences to be events to be conditioned upon. Given training set X with label sequence Y:

Train a model that maximizes P(Y|X, ) For a new data sequence x, the predicted label y maximizes P(y|x, ) Notice the per-state normalization

MEMMs (contd)

MEMMs have all the advantages of Conditional Models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states Subject to Label Bias Problem

Bias toward states with fewer outgoing transitions

Label Bias Problem


Consider this MEMM:

P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)

Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation

Solve the Label Bias Problem

Change the state-transition structure of the model

Not always practical to change the set of states

Principles in parameter estimation

basic principle: maximum likelihood estimation with conditional log likelihood of


l( ) = log p( y ( i ) | x ( i ) )
i= 1

advantage: conditional log likelihood is concave, therefore every local optimum is a global one

Differentiating the log-likelihood function with respect to parameters j gives

Principles in parameter estimation

There is no analytical solutions for the parameter by maximizing the log-likelihood Setting the gradient to zero and solving for does not always yield a closed form solution Iterative technique is adopted Iterative scaling Gradient decent

use gradient descent: quasi-Newton methods runtime in O(TM2NG) T: length of sequence M: number of labels N: number of training instances G: number of required gradient computations

Summary of Structures relation

Experiment

Modeling the label bias problem

A run consists of 2,000 training examples and 500 test examples, trained to convergence using Iterative Scaling algorithm CRF error is 4.6%, and MEMM error is 42% MEMM fails to discriminate between the two branches CRF solves label bias problem

MEMM vs. HMM


The HMM outperforms the MEMM

MEMM vs. CRF


CRF usually outperforms the MEMM

Summary

Discriminative models are prone to the label bias problem CRFs provide the benefits of discriminative models CRFs solve the label bias problem well, and demonstrate good performance

Thanks!!

You might also like