You are on page 1of 8

Chinese Word Segmentation with

Conditional Random Fields and Integrated Domain Knowledge

Andrew McCallum Fang-fang Feng


Computer Science Department Computer Science Department
University of Massachusetts Amherst University of Massachusetts Amherst
Amherst, MA 01002 Amherst, MA 01002
mccallum@cs.umass.edu fang@cs.umass.edu

Abstract ing. It is generally accepted, therefore, that a neces-


sary first step in most Chinese language technology
Chinese word segmentation is a difficult, applications (including document retrieval, summa-
important and widely-studied sequence rization and extraction) is to segment the character
modeling problem. Conditional Random stream into words.
Fields (CRFs) are a new discriminative The problem of Chinese word segmentation has
sequence modeling technique that sup- been widely studied, and approached from several
ports the incorporation of many rich fea- different angles. Some researchers have used a large
tures. This paper demonstrates the abil- lexicon of Chinese words and simple rules to se-
ity of CRFs to easily integrate domain lect among several different segmentations consis-
knowledge, and thus reduce the need for tent with the lexicon (Chen and Liu, 1992; Wu and
labeled data. Using readily-available do- Tseng, 1993). However these approaches have been
main knowledge in the CRF’s feature def- limited by the impossibility of creating a lexicon that
initions we show that, even training on as includes all possible Chinese words and by the lack
few as 140 segmented Chinese sentences, of statistical sophistication in the rules.
we can achieve world-class segmentation Others have used unsupervised learning tech-
accuracy. Furthermore, when training on niques and strong statistical techniques to train on a
more data, our approach yields segmen- large body of unsegmented Chinese text (Ando and
tation F1 of 97.5% with the Penn Chi- Lee, 2000). Some also incorporate lexicons (Ponte
nese Treebank, and realistic, incomplete and Croft, 1996; Peng and Schuurmans, 2001).
lexicons—a 50% reduction in error com- They successfully cluster characters sequences into
pared with the previously best published words by using the relative predictabilty of the next
results of which we are aware. We also in- character. However, the lack of supervision funda-
troduce two alternative prior distributions mentally prevents this approach from being able to
for CRF’s learned weights. separate the words of new highly correlated word
sequences such as idioms and common phrases.
1 Introduction and Related Work In response, others have taken advantage of the
availabilty of human-segmented Chinese corpora,
Unlike western languages, Chinese is not written and used supervised machine learning methods—
with spaces separating words. Some words consist including techniques such as finite state transducers
of a single Chinese character, but many words con- (Sproat et al., 1996), hidden Markov models (Tea-
sist of two, three, four or more characters. A human han et al., 2000), maximum entropy classifiers and
reader of Chinese will naturally segment the charac- transformation-based learning (Xue and Converse,
ter stream into words in order to discern its mean- 2002). However, the amount of labeled training data
will always be limited, and given the large number from the Internet and other sources. The various lex-
of parameters that must be learned in these models, icons include common surnames, country names, lo-
performance degredation due to over-fitting is often cation names, job titles, punctuation characters, dig-
significant. This over-fitting could be ameliorated its, and characters that indicate the word endings of
by use of domain knowledge to provide prior biases, adjectives and adverbs. The features used by the
but models that combine the sophistication of finite CRF are conjunctions of position-shifted member-
state decoding with the ability to integrate domain ship queries of these lexicons.
knowledge have previously been elusive. Comparing Chinese word segmentation accura-
Conditional Random Fields (CRFs) (Lafferty et cies can be difficult because many papers use differ-
al., 2001) are recent models that have the ability to ent data sets and ground-rules. Some published re-
combine rich domain knowledge, with finite-state sults claim 98% or 99% segmentation precision and
decoding, sophisticated statistical methods, and dis- recall, e.g. (Chen and Liu, 1992), but these either
criminative, supervised training. In their most gen- count only the testing words that occur in the lexi-
eral form, they are arbitrary undirected graphical con, or use unrealistic lexicons that have extremely
models trained to maximize the conditional proba- small or artificially non-existant out-of-vocabulary
bility of the desired outputs given the corresponding rates, or use short sentences with many numbers. On
inputs. In the special case we use here, they can be the more difficult Penn Chinese Treebank data, the
roughly understood as discriminatively-trained hid- best performance of which we are aware is F1 of
den Markov models with next-state transition func- 95.2% (Xue and Converse, 2002). In correspond-
tions represented by exponential models (as in max- ingly careful experiments with a set-aside test-set,
imum entropy classifiers), and with great flexibilty we have obtained F1 of 97.5% on Chinese Tree-
to view the observation sequence in terms of arbi- bank, cutting error nearly by half. In our test set,
trary, overlapping features, including long-range de- 5% of the words are out-of-vocabulary; of these
pendencies, and multiple levels of grainularity. the CRF segments 87% correctly. Notably, when
This paper demonstrates the success of CRFs on trained with only 140 sentences, CRFs still reach
Chinese word segmentation, with special focus on F1 95.7%. A CRF trained with character features
reducing the need for labeled training data by in- instead of the domain-knowledge-laden lexicon fea-
jecting domain knowledge into the model. We be- tures obtains only F1 92.6% on the same amount of
lieve that machine learning models should not be training data.
trained tabula rasa—instead, we should use models Pointers to publicly-available Java source code for
that will let their users easily integrate background training and testing, our lexicons, and an easily de-
knowledge whenever it is available. ployed pre-trained segmenter will be provided in the
To apply CRFs to Chinese word segmentation, we final copy of this paper.
first cast segmentation as a sequence labeling prob- This paper also presents two alternate priors dis-
lem: each character that begins a word is labeled tributions for the learned weights of CRFs, de-
with the S TART tag, every other character is labeled signed to form “sparser” solutions than the tradi-
with the N ON S TART tag. The task of a trained CRF tional Gaussian prior. However, we find that these
is to recover the sequence of tags given an unlabeled alternatives do not improve the performance of Chi-
(unsegmented) sentence of Chinese characters. This nese word segmentation.
is performed by constructing a CRF whose states are
each associated with a tag, then running an analogue 2 Conditional Random Fields
of the Viterbi algorithm to efficiently find the most
likely state sequence given the input character se- Conditional Random Fields (CRFs) (Lafferty et al.,
quence. Finally we then output the tags associated 2001) are undirected graphical models used to cal-
with the states in that state sequence. We incorporate culate the conditional probability of values on des-
strong domain knowledge into our model by making ignated output nodes given values on designed input
use of 28 different special-purpose lexicons of Chi- nodes.
nese words and characters that are readily available In the special case in which the designated output
B( +
 .T/
nodes of the graphical model are linked by edges in a where is the sequence of labels corresponding
linear chain, CRFs make a first-order Markov inde- to the labels of the states in sequence .
pendence assumption among output nodes, and thus Note that the normalization factor, , (also
correspond to finite state machines (FSMs). Thus, known in statistical physics as the partition func-
in this case CRFs can be roughly understood as tion) is the sum of the “scores” of all possible state
conditionally-trained hidden Markov models. CRFs
; > ; 8? 8
sequences,

.%/  7 30 2Y4 6 :87 9 7 ; < (  $ ! @ BAB+DCZ


of this type are a globally-normalized extension to


Maximum Entropy Markov Models (MEMMs) (Mc-
Callum et al., 2000) that avoid the label-bias prob-
lem (Lafferty et al., 2001). Although the details of
UBV"WYX
CRFs have been introduced elsewhere, we present
&
and that the number of state sequences is exponen-
them here in moderate detail for the sake of clarity,
tial in the input sequence length, . In arbitrarily-

Let 


and because we use a different training procedure.
be some observed input
structured CRFs, calculating the partition function in
closed form is intractable, and approximation meth-


data sequence, such as a sequence of Chinese char- ods such as Gibbs sampling or loopy belief propaga-


acters, (the values on input nodes of the graphi- tion must be used. In linear-chain-structured CRFs,


cal model). Let be a set of FSM states, each of as we have here for sequence modeling, the parti-

  
! "
#$%
which is associated with a label, , (such as a tion function can be calcuated very efficiently by

&
label N OT S TART). Let be some dynamic programming. The details are given in the
sequence of states, (the values on output nodes). following subsection.
CRFs define the conditional probability of a state se-

6 78:9
7 ;=< ; > ; 8? 8
quence given an input sequence as 2.1 Efficient Inference in CRFs

p ')(* ,+ .%- /103254 (  $


 !

@ B B
A D
+ E
C As in the forward-backward for hidden Markov
 models (HMMs) (Rabiner, 1990), the probability

.)/ is>F; a normalization


8? 8
that a particular transition was taken between two

;
CRF states at a particular position in the input se-
(  $ ! G BA + is an arbitrary 8
where factor over all state
< quence can be calculated efficiently by dynamic pro-
8
ward values”, [ ( $\]+ , to be the probability of arriv-
sequences, feature
gramming. We can define slightly modified “for-

in state  \ given the observations 5$


  . We
function over its arguments, and is a learned
weight for each feature function. A feature func-
8? ing set [_^ (  + equal to the probability of starting in each
  8
state  , and recurse:
tion may, for example, be defined to have value 0 in

), and 
most cases, and have value 1 if and only if is
state #1 (which may have label N S OT TART
8:` 7 8 6 7 ; < ;"> ;
observation at position A in is a Chinese charac-
is state #2 (which may have label S ), and the
TART
< [  (  +) Uba [ ( cM+ 032Y4 ( cd !e @ BAB+ C 
; FSM transitions
ter appearing in the lexicon of surnames. Higher
< .T/ is then
weights make their corresponding
The backward procedure and the remaining details
be positive since Chinese surname characters often f
U [  ( "+ . The Viterbi algorithm for finding the
more likely, so the weight in this example should
of Baum-Welsh are defined similarly.
occur at the beginning of a word. More generally,
most likely state sequence given the observation se-
feature functions can ask powerfully arbitrary ques-
quence can be correspondingly modified from its
tions about the input sequence.
original HMM form.
CRFs define the conditional probability of a label

<
sequence based on total probability over the state se- 2.2 Training CRFs
quences,

' (HB* ,+ IBJ KM7 NL IPO 9RQ p ' (* ,+S maximize conditional log-likelihood, g , of la-
The weights of a CRF are typically set to
p
beled sequences in some training set, h 
i G H  L  O
:@ H  L jkO
:@ H  LmlnOSo :
l < ; ƒ
The upper parenthesized term corresponds to the

g  j 7 9  pqFrts p'(H L jkO * LujSO +Pvxw 7 ; y z


expected count of feature given that the training
labels are used to determine the correct state paths.

ƒ
The lower parenthesized term corresponds to the ex-

…
pected count of feature using the current CRF
where the second sum is a Gaussian prior over pa-
rameters, with variance , that provides smoothing
z parameters, , to determine the likely state paths.
Matching simple intuition, notice that when the state
to help cope with sparsity in the training data (Chen paths chosen by the CRF parameters match the state
and Rosenfeld, 1999). paths from the labeled data, the derivative will be 0.
When the training labels make the state sequence When the training labels do not disambiguate a
unambiguous (as they often do in practice), the like- single state path, expectation-maximization can be
lihood function in exponential models such as CRFs used to fill in the “missing” state paths.
is convex, so there are no local maxima, and thus
finding the global optimum is guaranteed. It is not, 2.3 Two Alternate Priors for CRFs
however, straightforward to find it quickly. Param- In previous unpublished experiments on named en-
eter estimation in CRFs requires an iterative pro- tity recognition, the first author found that ob-
cedure, and some methods require fewer iterations
than others.
<
taining high accuracy required that some features
have large-magnitude weights, but that the Gaus-
Although the original presentation of CRFs (Laf- sian prior (which “punishes” weights by their value
ferty et al., 2001) described training procedures squared), was preventing them from reaching this
based on iterative scaling (Lafferty et al., 2001), it magnitude. At the same time, the Gaussian prior
is significantly faster to train CRFs and other “maxi- was only very weakly “punishing” low-magnitude
mum entropy”-style exponential models by a quasi- weights, and thus the resulting models had many
Newton method, such as L-BFGS (Byrd et al., 1994; non-zero weights, but almost none above one.
Malouf, 2002; Sha and Pereira, 2002). This method Since the number of features in CRFs is often cre-
approximates the second-derivative of the likelihood ated by a large power set of conjunctions, intuitively
by keeping a running, finite window of previous one would expect that many features are irrelevant,
first-derivatives. Sha and Pereira (2002) show that and should have zero weight, while others that have
training CRFs by L-BFGS is several orders of mag- sufficient training-data support should be allowed

(d† +
nitude faster than iterative scaling, and also signifi- more freedom in their range of weight value.

†
cantly faster than conjugate gradient. A linear (L1, abs -shaped), rather than
L-BFGS can simply be treated as a black-box op- quadratic (L2, -shaped) penalty function would
timization procedure, requiring only that one pro- come closer to this goal. It is well-known, for ex-
vide the first-derivative of the function to be opti- ample, in the support-vector machine literature, that

 L jkO
mized. Assuming that the training labels make the L1 loss functions encourage “sparse” solutions, in
state paths unambiguous, let denote that path, which many weights for unimportant features are

{ g ;  }| 7 l9 ; L jkO LujSO +D€w


then the first-derivative of the log-likelihood is nearly zero.
„ ˆ ˆ<
The absolute value function is not everywhere dif-
{< j Y~ 
(  (‡ + pqFr (‰ qeŠ ‹ ( +B+
ferentiable, being undefined at zero. However, the
;
7|} l9 7 p ')(* LujSO + ; ( L jkO + €w < z
function is everywhere differ-
entiable, and has an appealingly similar shape, as

j  I ~ ‡ ˆ
well as two intuitively meaningful meta-parameters:
; controls the slope of the line away from zero,
>e; 8? 8
where ~ ( ‚+ is the “count” for feature ƒ given 
ˆ Œ ŽF <
and controls the “roundness” of the differentiable

and , equal to the sum of (  $ ! G BA + values


;for< „ all positions, A , in the sequence  . The last term, ‡ ‹ ( +
curve close to zero. The derivative of this function
is . We call this a hyperbolic-L1 prior;
z , is the derivative of the Gaussian prior. it has likely been used elsewhere on other contexts,
but we have not yet found reference to it.
An even more extreme embodiment of “sparse- Input: (1)Training set: sentences of Chinese text with hand-
ness” and non-interference with other weights segmented words, and (2) several lexicons of Chinese words
and characters belonging to different salient categrories (such
would be a weight penalty function that rises away as location names, surnames, digits, etc).

„ ˆ ˆ < „ < „"


from zero as previously, but becomes more flat fur-
 Algorithm: Use a quasi-Newton method to set the parameters

( ‡ pqFr (‰ qeŠ ‹ ( B+ + ( +


+
of a CRF finite state model to maximize the conditional like-
ther out. This shape might be captured with a prior lihood of S TART and N ON S TART tags on the characters in the
such as abs , where is a training data.
parameter that controls how far away from zero the Output: A finite state model that finds word boundaries in un-
segmented Chinese text by finding the most likely state (and
flattening begins. thus, tag) sequence using the Viterbi algorithm.
We have experimented with the first, but not yet
the second. As will be described in section 5, on
Chinese word segmentation, we did not find the Figure 1: Outline of our algorithm.
hyperbolic-L1 prior to improve performance.
entire regions of the sequence, even the output of
3 Use of Domain Knowledge with CRFs
previous hand-built rules, expert systems and ad-hoc
One of the chief strengths of CRFs is their abil- solutions.
ity to easily incorporate arbitrary features of the in- The next two sections describe such an approach
put sequence. This strength can be used with tra- to Chinese word segmentation, and show positive
ditional HMM sequence modeling features, i.e. the experimental results, including high accuracies with
current token identity,1 by simply adding additional extremely small amounts of labeled data.
features that query the identity of tokens before and
after the current token, as well as conjunctions of 4 CRFs plus Domain Knowledge for
these features. These additional features are highly Chinese Word Segmentation
non-independent and would be quite problematic for Our approach to Chinese word segmentation com-
a generative model like an HMM—not to mention bines the strengths of (1) supervised machine learn-
the outright impossibility of a generative sequence ing trained on a hand-segmented corpus, (2) a rich
model depending on future, not yet generated ele- variety of domain-knowledge-laden lexicons which
ments of the sequence. reduce the need for large quantities of labeled data,
However, these type of features still do not take (3) a maximum-likelihood training method gauran-
full advantage of CRFs’ flexibility. This paper teed to find the global optima, (4) a strong repre-
strongly advocates learning that is not tabula rasa, sentation of contextal information to help correctly
but rather incorporates as much domain knowledge handle out-of-vocabulary (OOV) words, (5) and the
as possible. Doing so can improve generalization full statistical power of finite-state inference.
and also greatly reduce the amount of labeled train- We cast the segmentation problem as one of se-
ing data that is necessary. In language technology quence tagging. Chinese characters that begin a new
applications, which often use models with extremely word are given the S TART tag, and characters in the
large numbers of parameters, the need for large vol- middle and at the end of words are given the N ON -
umes of labeled training data is often a bottleneck S TART tag. The task of segmenting new, unseg-
to high accuracy. Methods that allow the straight- mented test data becomes a matter of assigning to
forward incorporation of domain knowledge provide it a sequence of tags (labels).
an extra avenue for weak supervision, and reduce the A conditional random field are configured as a fi-
burden of data labeling. nite state machine (FSM) for this purpose, and tag-
Domain knowledge comes in many forms, nearly ging is performed using Viterbi to find the most
all of which can be cast as features: gazeteers and likely label sequence for a given character sequence.
lexicons, lists of significant token suffixes and other The FSM may be first-order Markov, in which case
sub-constituents, hand-generated questions about there is one state for each label, and the choice of
1
for example, a feature defined to be non-zero only when the next label depends only on the single previous la-
current token is the Chinese character for “cat”) bel and the input sequence. The FSM may also
be second-order Markov or higher. A second-order sults, and because it has long sentences, relatively
FSM has a state for every possible pair of labels, and few numbers and large and rich vocabulary—all of
the choice of next label depends on the previous two which make word segmentation more challenging.
labels. In pre-processing this Chinese Treebank data, we
Some previous methods have used a single large ignore all part-of-speech tags, and exclude all header
lexicon of Chinese words, or a small collection and title information, obtaining our training and test-
of “character class” definitions (such as digits and ing data from only the body of the articles. The ar-
punctuation marks) to help perform segmentation, ticle bodies contain a total of 3,811 sentences, com-
e.g. (Ponte and Croft, 1996; Xue and Converse, prising 96,653 words (10,653 of which are unique),
2002). CRFs allow us to take this approach to the and 166,120 Chinese characters.
extreme: we incorporate not a few, but a large va- The test set for all experiments below consists of
riety of different lexicons, each with different spe- the sentences of the articles whose file number in the
cialities and meanings. For example, one is simply a Treebank is evenly divisible by four. The sentences
very large dictionary of Chinese words, another is a from the other articles comprise the training set.
list of location names, others capture surnames, dig- The training set contains 2,805 sentences (71,969
its, etc. The complete list our lexicons is given in the tokens2 , 62,145 words); the testing set, 1,006 sen-
next section. tences (24,684 tokens, 21,109 words). The testing
The features used in our CRF are membership set has 4,587 unique words, 1,688 (37%) of which
queries of character subsegments in these lexicons; do not appear anywhere in the training set. This out-
for example, one feature might indicate that the sub- of-vocabulary rate speaks to the large and varied vo-
sequence consisting of the previous character and cabulary in this data set, and the difficulty of this
the current character appears in the lexicon of loca- segmentation task.
tions. We also add features that query lexicon mem-
5.1 Lexicon Features as Domain Knowledge
bership of subsequences before and after the cur-
rent character. For example, a feature could ask if Many lexicons of Chinese words and characters are
the character three positions ahead appears in the readily available from the Internet and other sources.
lexicon of organization name indicators. Further- Our domain-knowledge-providing lexicons consist
more we create new features out of two- and three- of 28 lists of Chinese words and characters obtained
element conjunctions of these atomic features, that from several Internet sites,3 from BBN’s Chinese
might ask, for example, if the character two posi- named entity recognizer, and from a local native
tions previous is in the list of prepositions, and is the speaker. They were all obtained independently of
character one position ahead in the list of verbs. LDC data. (We could have reasonably used the Penn
Because features that examine past and future Chinese Dictionary and a word lexicon built from
provide strong contextual clues, and because we use the training corpus, but did not). The complete list
the power of Viterbi for finite state inference, we can of our lexicons appears in figure 2.
usually successfully segment new words that don’t Features used in our CRF are time-shifted con-
appear in any of our lexicons. All our features are junctions of membership queries to these lexicons
binary-valued. Although we could also add con- of character subsequence windows size 1-8. An ad-
juncts indicating the lack of membership in a lexi- ditional set of features captures whether the current
con, we have not yet done this. 8
character is equal to the character before or after it,

‘ 8b’ 8 8 ‘
using shifts of up to 3 positions. Using the nota-
5 Experimental Results tion
at sequence position , and A ‘ ”
a “ a
to indicate an individual feature, , tested
indicate a con-
We tested our approach using hand-segmented data 8 8:` 8? 8:` 8? :8 `,•
junction of three feature tests applied at various se-

‘ ‘  ‘  ‘ ‘ ‘
from the Penn Chinese Treebank (LDC-2001T11). quence positions, the complete set of time-shifting
It consists of 325 Xinhua newswire articles on a patterns we use is: , , , , , ,
variety of subjects. We selected it because it is a 2
words plus punctuation
3
standard data set on which others have published re- e.g. http://www.mandarintools.com.
adjective ending character money # training training testing segmentation
adverb ending character
building words
negative characters
organization indicator
Method
Peng šœ›
sentences
M
tag acc.
?
prec. recall
75.1 74.0
F1
74.2
Chinese number characters
Chinese period (dot)
cities & regions
preposition characters
provinces
punctuation characters
Ponte
Teahan
Xue
šš ?
40k
10k
?
?
97.8
84.4
?
95.2
87.8
?
95.1
86.0
94.4
95.2
countries Roman alphabetics CRF 2805 99.9 97.3 97.8 97.5
dates Roman digits CRF 1402 99.9 96.9 97.4 97.1
department characters stopwords CRF 140 100 95.4 96.0 95.7
digit characters surnames CRF 56 100 93.9 95.0 94.4
foreign name characters symbol characters CRF 7 100 82.8 83.5 83.1
function words time CRF+ 2805 100 99.6 99.7 99.6
job titles verb characters CRF- 2805 99.9 92.6 92.6 92.6
locations wordlist (188k lexicon) CRF- 1402 100 90.6 90.4 90.5
CRF- 140 100 80.0 78.6 79.3
Figure 2: Lexicons used in our experiments CRF- 7 100 60.4 59.0 59.7

8?–• 8b’—8 8b’—8:` 8? ’—8 8b’˜8:` 8? ’˜8 8b’—8:`,•


‘ 8?–• ,’ 8‘ 8,? ‘ ’ 8 “ 8:`   , ‘  , ‘ , ‘ , ‘ ,
Figure 3: Accuracy comparison in percentages.
Peng (Peng and Schuurmans, 2001), is an unsu-
‘ ,‘ . The total number of such fea- pervised method. Ponte (Ponte and Croft, 1996),
is a lexicon-based method. Teahan (Teahan et al.,
tures appearing in the training set is 42,187. Given
that each of these features has a parameter on each 2000), is a supervised generative FSM, Xue (Xue
FSM transition, and that there are sequence-start and Converse, 2002), is a supervised maximum-
and -end weights for each state, our second-order entropy method. CRF is the method described in
Markov CRF has a total of 337,500 parameters. this paper. CRF+ is the same, using an artificially
Using our Java implementation and a 1.6MHz complete lexicon that has a zero OOV rate. CRF-
Pentium, training the CRF by L-BFGS on the 2805 uses traditional character-based features instead of
sentences of the training set takes about 100 itera- multiple lexicons.
tions, 300 megabytes of memory and 5 hours. Seg-
menting one sentence from the test set takes less
than a tenth of a second. as much data.
Accuracy is measured in terms of percentage tags We were particularly pleased to see that CRF ac-

™ &
correct, and segmentation precision, recall and F1. curacy is high even on words that appear in none
Let be the number of predicted segments, be the of our 30 lexicons. On the test set, there are 1134

number of true segments, and be the number of
„ out-of-vocabulary words, 985 of which the CRF cor-

~ ™ ~ &
predicted segments with perfect starting and ending rectly segments—a success rate of 87%. Brief scan-
boundaries; then precision is , recall is . ning of our results by a native speaker indicates that
F1 is the harmonic mean of precision and recall. some of the “errors” made by the CRF are actu-
Results are given in Figure 3. On Chinese Tree- ally judgement calls that could go either way, with
bank with realistic, incomplete lexicons, the best most of the true segmentation errors being in proper
previous result of which we are aware is 95.2% F1 names.
(Xue and Converse, 2002). CRFs produce 97.5% With the full training set CRFs are still over-fitting
F1, reducing error by half. Furthermore, this im- (F1 on the training set is 99.7%), so CRFs would be
proved performance was reached with significantly expected to do even better with more training data.
less training data. As additional testament to the We might also do significantly better if we gathered
training-data efficiency of our domain-knowledge- more words for our wordlist lexicon—when we ar-
laden features, notice that CRFs maintain high ac- tificially give CRFs a perfect lexicon by including
curacy with a shockingly small amount of training all words in our training and testing sets (CRF+ in
data: with only 280 training sentences, CRFs still Figure 3), we obtain 99.6% F1.
out-perform previous methods; with only 56 labeled To further demonstrate the advantages of domain-
sentences, they match the performance of Teahan et knowledge-laden features, we also trained and tested
al. (2000), which was trained on roughly 700-times CRFs using only traditional character-based features
(CRF- in Figure 3). Features indicate the presence References
8 :8 ` 8 ? 8:` 8 ? 8b’—8:`
of individual characters, as well as time-shifted con- Rie Kubota Ando and Lillian Lee. 2000. Mostly-unsupervised

8? ’ 8 ‘ ‘  ‘  ‘ ‘ ‘ 
junctions, as before, with the following conjunc- statistical segmentation of Japanese: Applications to Kanji.

‘ 
NAACL, pages 241–248.
tion patterns: , , , , , ,
—resulting in about 400k features. Including R. H. Byrd, J. Nocedal, and R. B. Schnabel. 1994. Representa-
more context would have greatly enlarged the total tions of quasi-Newton matrices and their use in limited mem-
ory methods. Mathematical Programming, 63:129–156.
number of features, which wouldn’t appear helpful,
since the model was already badly over-fitting, (hav- K.J. Chen and S.H. Liu. 1992. Word identification for Man-
darin Chinese sentence. In Proceedings of the Fourteenth
ing training F1 of 99.9% even with the maximum International Conference on Computaional Linguistics.
amount of training data).
z ž
Stanley F. Chen and Ronald Rosenfeld. 1999. A Gaussian prior
All of the above results use a Gaussian prior with for smoothing maximum entropy models. Technical Report
variance
z
, a value selected by cross-validation
on the training set. Alternative values of running
CMU-CS-99-108, CMU.

John Lafferty, Andrew McCallum, and Fernando Pereira. 2001.


from 0.1 to 10 did not have a high impact—all had Conditional random fields: Probabilistic models for seg-
testing F1 above or near 97%. Experiments with the menting and labeling sequence data. In Proc. ICML, pages
282–289.
hyperbolic-L1 prior also yielded results in the same
range, with none higher than 97.3%. Robert Malouf. 2002. A comparison of algorithms for maxi-
mum entropy parameter estimation.

Andrew McCallum, Dayne Freitag, and Fernando Pereira.


6 Conclusions 2000. Maximum entropy Markov models for information
extraction and segmentation. In Proc. ICML 2000, pages
Conditional random fields support a strong, princi- 591–598.
pled integration of finite-state inference, discrimina- Fuchun Peng and Dale Schuurmans. 2001. Self-supervised
tive training, and the use of rich, arbitrary features. Chinese word segmentation. In F. Hoffman et al., editor,
Advances in Intelligent Data Analysis, pages 238–247.
When domain knowledge is incorporated into the
feature set, we can drastically reduce the need for Jay M. Ponte and W. Bruce Croft. 1996. Useg: A retar-
labeled training data. getable word segmentation procedure for information re-
trieval. Symposium on Document Analysis and Information
We would expect our accuracy to rise even fur- Retrieval.
ther by performing cross-validated error analysis on Lawrence R. Rabiner. 1990. A tutorial on hidden markov mod-
the training set, and by tuning the features and other els and selected applications in speech recognition. In Alex
meta-parameters. It should also improve by further Weibel and Kay-Fu Lee, editors, Readings in Speech Recog-
nition, pages 267–296. Morgan Kaufmann.
enlarging our lexicons and increasing our amount of
labeled training data. In future work we also plan to Fei Sha and Fernando Pereira. 2002. Shallow parsing with con-
ditional random fields. Technical Report CIS TR MS-CIS-
investigate the benefits of feature induction (which 02-35, University of Pennsylvania, Computer and Informa-
could be used to discover those cases when captur- tion Science Department. (forthcoming).
ing individual Chinese characters would be helpful),
Richard W. Sproat, Chilin Shih, William Gale, and Nancy
and investigate methods for training with combina- Change. 1996. A stochastic finite-state word-
tions of labeled and unlabeled data. Additionally, segmentation algorithm for chinese. Computational Lin-
we plan to apply our methods to segmenting other guistics, 22(3):377–404.
Asian languages, such as Japanese and Thai. W. J. Teahan, Y. Wen, R. McNab, and I. H. Witten. 2000. A
compression-based algorithm for Chinese word segmenta-
tion. Computational Linguistics, 26(3):375–393.
Acknowledgements
Zimin Wu and Gwyneth Tseng. 1993. Chinese text segmenta-
This work was supported in part by the Center for Intelli- tion for text retrieval: Achievements and problems. Journal
gent Information Retrieval and in part by SPAWARSYSCEN- of the American Society for Information Science, 44(9):532–
SD grant numbers N66001-99-1-8912 and N66001-02-1-8903. 544.
Any opinions, findings and conclusions or recommendations
expressed in this material are the author(s) and do not neces- Nianwen Xue and Susan Converse. 2002. Combining classi-
sarily reflect those of the sponsor. We would also like to thank fiers for Chinese word segmentation. In Proceedings of the
Douglas Waterman for helpful comments on a previous draft. SIGHAN Workshop.

You might also like