Professional Documents
Culture Documents
Maximum Entropy Markov Models (MEMMs) (Mc-
Callum et al., 2000) that avoid the label-bias prob-
lem (Lafferty et al., 2001). Although the details of
UBV"WYX
CRFs have been introduced elsewhere, we present
&
and that the number of state sequences is exponen-
them here in moderate detail for the sake of clarity,
tial in the input sequence length, . In arbitrarily-
Let
and because we use a different training procedure.
be some observed input
structured CRFs, calculating the partition function in
closed form is intractable, and approximation meth-
data sequence, such as a sequence of Chinese char- ods such as Gibbs sampling or loopy belief propaga-
acters, (the values on input nodes of the graphi- tion must be used. In linear-chain-structured CRFs,
cal model). Let be a set of FSM states, each of as we have here for sequence modeling, the parti-
!"
#$%
which is associated with a label, , (such as a tion function can be calcuated very efficiently by
&
label N OT S TART). Let be some dynamic programming. The details are given in the
sequence of states, (the values on output nodes). following subsection.
CRFs define the conditional probability of a state se-
6 78:9
7 ;=< ;
>
; 8? 8
quence given an input sequence as 2.1 Efficient Inference in CRFs
;
CRF states at a particular position in the input se-
( $ ! G BA+ is an arbitrary 8
where factor over all state
< quence can be calculated efficiently by dynamic pro-
8
ward values”, [ ( $\]+ , to be the probability of arriv-
sequences, feature
gramming. We can define slightly modified “for-
), and
most cases, and have value 1 if and only if is
state #1 (which may have label N S OT TART
8:` 7 8 6 7 ; < ;">
;
observation at position A in is a Chinese charac-
is state #2 (which may have label S ), and the
TART
< [ (
+) Uba [ ( cM+ 032Y4 ( cd !e @ BAB+ C
; FSM transitions
ter appearing in the lexicon of surnames. Higher
< .T/ is then
weights make their corresponding
The backward procedure and the remaining details
be positive since Chinese surname characters often f
U [ ( "+ . The Viterbi algorithm for finding the
more likely, so the weight in this example should
of Baum-Welsh are defined similarly.
occur at the beginning of a word. More generally,
most likely state sequence given the observation se-
feature functions can ask powerfully arbitrary ques-
quence can be correspondingly modified from its
tions about the input sequence.
original HMM form.
CRFs define the conditional probability of a label
<
sequence based on total probability over the state se- 2.2 Training CRFs
quences,
' (HB* ,+ IBJ KM7 NL IPO 9RQ p ' (* ,+S maximize conditional log-likelihood, g , of la-
The weights of a CRF are typically set to
p
beled sequences in some training set, h
i G H L O
:@ H L jkO
:@ H LmlnOSo :
l < ;
The upper parenthesized term corresponds to the
The lower parenthesized term corresponds to the ex-
pected count of feature using the current CRF
where the second sum is a Gaussian prior over pa-
rameters, with variance , that provides smoothing
z parameters, , to determine the likely state paths.
Matching simple intuition, notice that when the state
to help cope with sparsity in the training data (Chen paths chosen by the CRF parameters match the state
and Rosenfeld, 1999). paths from the labeled data, the derivative will be 0.
When the training labels make the state sequence When the training labels do not disambiguate a
unambiguous (as they often do in practice), the like- single state path, expectation-maximization can be
lihood function in exponential models such as CRFs used to fill in the “missing” state paths.
is convex, so there are no local maxima, and thus
finding the global optimum is guaranteed. It is not, 2.3 Two Alternate Priors for CRFs
however, straightforward to find it quickly. Param- In previous unpublished experiments on named en-
eter estimation in CRFs requires an iterative pro- tity recognition, the first author found that ob-
cedure, and some methods require fewer iterations
than others.
<
taining high accuracy required that some features
have large-magnitude weights, but that the Gaus-
Although the original presentation of CRFs (Laf- sian prior (which “punishes” weights by their value
ferty et al., 2001) described training procedures squared), was preventing them from reaching this
based on iterative scaling (Lafferty et al., 2001), it magnitude. At the same time, the Gaussian prior
is significantly faster to train CRFs and other “maxi- was only very weakly “punishing” low-magnitude
mum entropy”-style exponential models by a quasi- weights, and thus the resulting models had many
Newton method, such as L-BFGS (Byrd et al., 1994; non-zero weights, but almost none above one.
Malouf, 2002; Sha and Pereira, 2002). This method Since the number of features in CRFs is often cre-
approximates the second-derivative of the likelihood ated by a large power set of conjunctions, intuitively
by keeping a running, finite window of previous one would expect that many features are irrelevant,
first-derivatives. Sha and Pereira (2002) show that and should have zero weight, while others that have
training CRFs by L-BFGS is several orders of mag- sufficient training-data support should be allowed
(d +
nitude faster than iterative scaling, and also signifi- more freedom in their range of weight value.
cantly faster than conjugate gradient. A linear (L1, abs -shaped), rather than
L-BFGS can simply be treated as a black-box op- quadratic (L2, -shaped) penalty function would
timization procedure, requiring only that one pro- come closer to this goal. It is well-known, for ex-
vide the first-derivative of the function to be opti- ample, in the support-vector machine literature, that
L jkO
mized. Assuming that the training labels make the L1 loss functions encourage “sparse” solutions, in
state paths unambiguous, let denote that path, which many weights for unimportant features are
j I ~
well as two intuitively meaningful meta-parameters:
; controls the slope of the line away from zero,
>e; 8? 8
where ~ ( + is the “count” for feature given
F <
and controls the “roundness” of the differentiable
8b 8 8
using shifts of up to 3 positions. Using the nota-
5 Experimental Results tion
at sequence position , and A
a a
to indicate an individual feature, , tested
indicate a con-
We tested our approach using hand-segmented data 8 8:` 8? 8:` 8? :8 `,
junction of three feature tests applied at various se-
from the Penn Chinese Treebank (LDC-2001T11). quence positions, the complete set of time-shifting
It consists of 325 Xinhua newswire articles on a patterns we use is: , , , , , ,
variety of subjects. We selected it because it is a 2
words plus punctuation
3
standard data set on which others have published re- e.g. http://www.mandarintools.com.
adjective ending character money # training training testing segmentation
adverb ending character
building words
negative characters
organization indicator
Method
Peng
sentences
M
tag acc.
?
prec. recall
75.1 74.0
F1
74.2
Chinese number characters
Chinese period (dot)
cities & regions
preposition characters
provinces
punctuation characters
Ponte
Teahan
Xue
?
40k
10k
?
?
97.8
84.4
?
95.2
87.8
?
95.1
86.0
94.4
95.2
countries Roman alphabetics CRF 2805 99.9 97.3 97.8 97.5
dates Roman digits CRF 1402 99.9 96.9 97.4 97.1
department characters stopwords CRF 140 100 95.4 96.0 95.7
digit characters surnames CRF 56 100 93.9 95.0 94.4
foreign name characters symbol characters CRF 7 100 82.8 83.5 83.1
function words time CRF+ 2805 100 99.6 99.7 99.6
job titles verb characters CRF- 2805 99.9 92.6 92.6 92.6
locations wordlist (188k lexicon) CRF- 1402 100 90.6 90.4 90.5
CRF- 140 100 80.0 78.6 79.3
Figure 2: Lexicons used in our experiments CRF- 7 100 60.4 59.0 59.7
&
correct, and segmentation precision, recall and F1. curacy is high even on words that appear in none
Let be the number of predicted segments, be the of our 30 lexicons. On the test set, there are 1134
~
number of true segments, and be the number of
out-of-vocabulary words, 985 of which the CRF cor-
~ ~ &
predicted segments with perfect starting and ending rectly segments—a success rate of 87%. Brief scan-
boundaries; then precision is , recall is . ning of our results by a native speaker indicates that
F1 is the harmonic mean of precision and recall. some of the “errors” made by the CRF are actu-
Results are given in Figure 3. On Chinese Tree- ally judgement calls that could go either way, with
bank with realistic, incomplete lexicons, the best most of the true segmentation errors being in proper
previous result of which we are aware is 95.2% F1 names.
(Xue and Converse, 2002). CRFs produce 97.5% With the full training set CRFs are still over-fitting
F1, reducing error by half. Furthermore, this im- (F1 on the training set is 99.7%), so CRFs would be
proved performance was reached with significantly expected to do even better with more training data.
less training data. As additional testament to the We might also do significantly better if we gathered
training-data efficiency of our domain-knowledge- more words for our wordlist lexicon—when we ar-
laden features, notice that CRFs maintain high ac- tificially give CRFs a perfect lexicon by including
curacy with a shockingly small amount of training all words in our training and testing sets (CRF+ in
data: with only 280 training sentences, CRFs still Figure 3), we obtain 99.6% F1.
out-perform previous methods; with only 56 labeled To further demonstrate the advantages of domain-
sentences, they match the performance of Teahan et knowledge-laden features, we also trained and tested
al. (2000), which was trained on roughly 700-times CRFs using only traditional character-based features
(CRF- in Figure 3). Features indicate the presence References
8 :8 ` 8 ? 8:` 8 ? 8b8:`
of individual characters, as well as time-shifted con- Rie Kubota Ando and Lillian Lee. 2000. Mostly-unsupervised
8? 8
junctions, as before, with the following conjunc- statistical segmentation of Japanese: Applications to Kanji.
NAACL, pages 241–248.
tion patterns: , , , , , ,
—resulting in about 400k features. Including R. H. Byrd, J. Nocedal, and R. B. Schnabel. 1994. Representa-
more context would have greatly enlarged the total tions of quasi-Newton matrices and their use in limited mem-
ory methods. Mathematical Programming, 63:129–156.
number of features, which wouldn’t appear helpful,
since the model was already badly over-fitting, (hav- K.J. Chen and S.H. Liu. 1992. Word identification for Man-
darin Chinese sentence. In Proceedings of the Fourteenth
ing training F1 of 99.9% even with the maximum International Conference on Computaional Linguistics.
amount of training data).
z
Stanley F. Chen and Ronald Rosenfeld. 1999. A Gaussian prior
All of the above results use a Gaussian prior with for smoothing maximum entropy models. Technical Report
variance
z
, a value selected by cross-validation
on the training set. Alternative values of running
CMU-CS-99-108, CMU.