You are on page 1of 4

A two-stage statistical word segmentation system for Chinese

Guohong Fu K.K. Luke


Dept of Linguistics Dept of Linguistics
The University of Hong Kong The University of Hong Kong
Pokfulam Road, Hong Kong Pokfulam Road, Hong Kong
ghfu@hkucc.hku.hk kkluke@hkusua.hku.hk

known word segmentation. Section 3 describes a


Abstract hybrid approach for unknown word identification.
In section 4, we report the results of our system at
In this paper we present a two-stage the SIGHAN evaluation program, and in the final
statistical word segmentation system for section we give our conclusions on this work.
Chinese based on word bigram and word-
formation models. This system was 2 The first stage: Segmentation of known
evaluated on Peking University corpora at words
the First International Chinese Word
Segmentation Bakeoff. We also give In a sense, known word segmentation is a process
results and discussions on this evaluation. of disambiguation. In our system, we use word
bigram language models and Viterbi algorithm
(1967) to resolve word boundary ambiguities in
1 Introduction known word segmentation.
For a particular input Chinese character string
Word segmentation is very important for Chinese C = c1 c 2 L c n , there is usually more than one
language processing, which aims to recognize the
possible segmentation W = w1 w2 L wm according to
implicit word boundaries in Chinese text. During
the past decades, great success has been achieved in given system dictionary. Word bigram segmentation
Chinese word segmentation (Nie, et al, 1995; Yao, aims to find the most appropriate segmentation
1997; Fu and Wang, 1999; Wang et al, 2000; Wˆ = w1 w2 L wm that maximizes the probability
Zhang, et al, 2002). However, there still remain two m

difficult problems, i.e. ambiguity resolution and ∏ Pr ( wi | wi−1) , i.e.


i =1
unknown word (so-called OOV word) identification, m

while developing a practical segmentation system Wˆ = arg max Pr (W | C ) ≈ arg max ∏ Pr (wi | wi −1 ) (1)
W W i =1
for large open applications.
where Pr ( wi | wi −1 ) is the probability that word wl
In this paper, we present a two-stage statistical
word segmentation system for Chinese. In the first will occur given previous word wi −1 , which can be
stage, we employ word bigram model to segment easily estimated from segmented corpus using
known words (viz. the words included in the system maximum likelihood estimation (MLE), i.e.
dictionary) in input. In the second stage, we develop Count ( wi −1 wi )
Pr (wi | wi −1 ) ≈ (2)
a hybrid algorithm to perform unknown word Count ( wi −1 )
identification incorporating word contextual To avoid the problem of data sparseness in MLE,
information, word-formation patterns and word here we apply the linear interpolation technique
juncture model. (Jelinek and Mercer, 1980) to smooth the estimated
The rest of this paper is organized as follows: word bigram probabilities.
Section 2 presents a word bigram solution for
3 The second stage: Unknown word general, a known word w may take one of the
identification following four patterns while forming a word: (1)
w itself is a word. (2) w is the beginning of an
The second stage mainly concerns unknown words unknown word. (3) w is at the middle of an
segmentation that remains unresolved in first stage. unknown word. (4) w appears at the end of an
This section describes a hybrid algorithm for unknown word. For convenience, we use S , B ,
unknown word identification, which can incorporate M and E to denote the four patterns respectively.
word juncture model, word-formation patterns and Let pttn(w) denote a particular pattern of w in an
contextual information. To avoid the complicated
unknown word and P r ( pttn( w)) denote the relevant
normalization of the probabilities of different
dimensions, the simple superposition principle is probability, then
def
also used in merging these models. Count ( pttn( w))
Pr ( pttn( w)) = (5)
Count ( w)
3.1 Word juncture model Obviously, ∑ Pr ( pttn(w)) = 1 . And 1- Pr ( S ( w)) is the
pttn
Word juncture model score an unknown word by
assigning word juncture type. Obviously, most word-formation power of the known word w .
unknown words appear as a string of known words Let Ppttn ( wU ) be the overall word-formation
after segmentation in first stage. Therefore, pattern probability of a certain unknown word
unknown word identification can be viewed as a wU = w1 w2 L wl , then
process of re-assigning correct word juncture type
to each known word pair in input. Given a known
Ppttn ( wU ) = ∏ Pr ( pttn( wi )) (6)
wi ∈wU
word string W = w1w2 Lwn , between each word pair Theoretically speaking, a known word can take any
wi wi +1(1 ≤ i ≤ n − 1) is a word juncture. In general, pattern while forming an unknown word. But it is
there are two types of junctures in unknown word not even in probability for different known words
identification, namely word boundary (denoted by and different patterns. For example, the word 性
t B ) and non-word boundary (denoted by t N ). (xing4, nature) is more likely to act as the suffix of
Let t (wi wi +1 ) denote the type of a word juncture words. According to our investigation on the
wi wi +1 , and Pr (t(wi wi+1 )) denote the relevant training corpus, the character 性 appears at the end
conditional probability, then of a multiword in more than 93% of cases.
def
Count (t (wi wi +1 ))
Pr (t ( wi wi +1 )) = (3) 3.3 Hybrid algorithm for unknown word
Count (wi wi +1 )
identification
Thus, the word juncture probability PCJM ( wU ) of a
Current algorithm for unknown word identification
particular unknown word wU = wi wi +1 L w j
consists of three major components: (1) an
(1 ≤ i ≤ j ≤ n) can be calculated by unknown word extractor firstly extracts a fragment
j −1
of known words w1 w2 L wn that that may have
PCJM (wU ) = Pr (t B (wi −1wi )) × Pr (t B (w j w j +1 )) × ∏ Pr (t N (cl cl +1)) (4)
l =i unknown words based on the related word-
In a sense, word juncture model mirrors the affinity formation power and word juncture probability and
of known word pairs in forming an unknown word. its left and right contextual word w L , w R from the
For a word juncture ( wi , wi +1 ) , the larger the output of the first stage. (2) A candidate word
probability Pr (t N (wi wi +1 )) , the more likely the two constructor then generates a lattice of all possible
words are merged together into one new word. new segmentations {WU | WU = x1 x2 L xm } that may
involve unknown words from the extracted
3.2 Word-formation patterns fragment. (3) A decoder finally incorporates word
Word-formation pattern model scores an unknown juncture model PWJM (WU ) , word-formation
word according to the probability of how each patterns Ppttn (WU ) and word bigram probability
internal known word contributes to its formation. In
Pbigram (WU ) to score these candidates, and then 4.3 Experimental results and discussion
applies the Viterbi algorithm again to find the best
new segmentation WˆU = x1 x2 L xm that has the Items F R P OOV ROOV Riv
maximum score: Closed 0.939 0.936 0.942 0.069 0.675 95.5
WˆU = arg max{Ppttn (WU ) + PCJM (WU ) + Pbigram (WU )} Open 0.937 0.933 0.941 0.094 0.762 95.0
(7)
WU
Table 2: Test results on PK corpus
= arg max{
WU
∑ ( Ppttn ( xi ) + PCJM ( xi ) + Pr ( xi | xi −1 ))}
i =1,L, n

where x0 = wL and x n+1 = w R . Let wU denote any Segmentation speed: There are in all about 28,458
unknown word in the training corpus. If xi is an characters in the test corpus. It takes about 3.21
∑ Count( xi −1wU ) and 3.07 seconds in all for our system to perform
unknown word, then Pr ( xi | xi −1 ) = w U . full segmentation (including known word
Count ( xi −1 )
segmentation and unknown word identification) on
the closed and open test corpus respectively,
4 Experiments running on an ACER notebook (TM632XC-P4M).
Our system participated in both closed and open This indicates that our system is able to process
tests on Peking University corpora at the First about 531,925~556,182 characters per minute.
International Chinese Word Segmentation Bakeoff. Results and discussions: The results for the closed
This section reports the results and discussions on and open test are presented in Table 2. We can
its evaluation. draw some conclusions from these results.
Firstly, the overall performance of our system is
4.1 Measures very stable in both the closed and open tests. As
shown in Table 2, the out-of-vocabulary (OOV)
In the evaluation program of the First International
rate is 6.9% in the closed test and 9.4% in the open
Chinese Word Segmentation Bakeoff, six measures
test. However, the overall test F-measure drops by
are employed to score the performance of a word
only 0.2 percent in the open test, compared with the
segmentation system, namely test recall (R), test
closed test.
precision (denoted by P), the balanced F-measure
Secondly, our approach can handle most unknown
(F), the out-of-vocabulary (OOV) rate for the test
words in the input. As can be seen from Table 2,
corpus, the recall on OOV words (ROOV), and the
the recall on OOV words are 67.5% the closed-test
recall on in-vocabulary (Riv) words. OOV is defined
and 76.2% in the open-test. Wang et al (2000) and
as the set of words in the test corpus not occurring
Yao (1997) have proposed a character juncture
in the training corpus in the closed test, and the set
model and word-formation patterns for Chinese
of words in the test corpus not occurring in the
unknown word identification. However, their
lexicon used in the open test.
approaches can only work for the unknown words
4.2 Experimental lexicons and corpora that are made up of pure monosyllable character in
that they are character-based methods. To address
As shown in Table 1, we only used the training data this problem, we introduce both word juncture
from Peking University corpus to train our system model and word-based word-formation patterns into
in both the open and closed tests. As for the our system. As a result, our system can deal with
dictionary, we compiled a dictionary for the closed different unknown words that consist of different
test from the training corpus, which contained 55, known words, including monosyllable characters
226 words, and used a dictionary in the open test and multiword.
that contained about 65, 269 words. Although our system is effective for most
ambiguities and unknown words in the input, it has
Items # words in # train. # test.
its inherent deficiencies. Firstly, to avoid data
lexicon words words
Closed 55,226 1,121,017 17,194
sparseness, we do not differentiate known words
Open 65,269 1,121,017 17,194 and unknown words while estimating word juncture
Table 1: Experimental lexicons and corpora models and word-formation patterns from the
training corpus. This simplification may introduce Acknowledgments
some noises into these models for identifying
unknown words. Our further investigations show We would like to thank all colleagues of the First
that the precision on OOV words is still very low, International Chinese Word Segmentation Bakeoff
i.e. 67.1% for closed test and 72.5% for open test. for their evaluations of the results and the Institute
As a result, our system may yield a number of of Computational Linguistics, Peking University for
mistaken unknown words in the processing. providing the experimental corpora.
Secondly, we regard known word segmentation and
unknown word identification as two independent References
stages in our system. This strategy is obviously
simple and more easily applicable. However, it does Fu, Guohong and Xiaolong Wang. 1999. Unsupervised
not work while the input contains a mixture of Chinese word segmentation and unknown word
ambiguities and unknown words. For example, identification. In: Proceedings of NLPRS’99,
Beijing, China, 32-37.
there was a sentence 中行长葛支行注重健身 in
the test corpus, where, the string 中行长葛 is a Jelinek, Frederick, and Robert L. Mercer. 1980.
fragment mixed with ambiguity and unknown Interpolated estimation of Markov source parameters
from sparse data. In: Proceedings of Workshop on
words. The correct segmentation should be 中行/长 Pattern Recognition in Practice, Amsterdam, 381-
葛/, where 中行(Zhonghang, the Bank of China) is 397.
a abbreviation of organization name, and 长 葛 Nie, Jian-Yuan, M.-L. Hannan and W.-Y. Jin. 1995.
(Changge) is a place name. Actually, this fragment Unknown word detection and segmentation of
is segmented as 中/行长/葛/ in the first stage of our Chinese using statistical and heuristic knowledge.
system. However, the unknown word identification Communication of COLIPS, 5(1&2): 47-57.
stage does not have a mechanism to split the word Viterbi, A.J. 1967. Error bounds for convolutional
行长(Hangzhang, president) and finally resulted in codes and an asymmetrically optimum decoding
wrong segmentation. algorithm. IEEE Transactions on Information
Theory, IT-13(2): 260-269.
5 Conclusions Wang, Xiaolong, Fu Guohong, Danial S.Yeung, James
N.K.Liu, and Robert Luk. 2000. Models and
This paper presents a two-stage statistical word algorithms of Chinese word segmentation. In:
segmentation system for Chinese. In the first stage, Proceedings of the International Conference on
word bigram model and Viterbi algorithm are Artificial Intelligence (IC-AI’2000), Las Vegas,
applied to perform known word segmentation on Nevada, USA, 1279-1284.
input plain text, and then a hybrid approach is Yao, Yuan. 1997. Statistics Based approaches towards
employed in the second stage to incorporate word Chinese language processing. Ph.D. thesis. National
bigram probabilities, word juncture model and University of Singapore.
word-based word-formation patterns to detect OOV
Zhang, Hua-Ping, Qun Liu, Hao Zhang, and Xue-Qi
words. The experiments on Peking University
Cheng. 2002. Automatic recognition of Chinese
corpora have shown that the present system based unknown words based on roles tagging. In:
on fairly simple word bigram and word-formation Proceedings of The First SIGHAN Workshop on
models can achieve a F-score of 93.7% or above. In Chinese Language Processing, Taiwan, 71-77.
future work, we hope to improve our strategies on
estimating word juncture model and word-formation
patterns and develop an integrated segmentation
technique that can perform known word
segmentation and unknown word identification at
one time.

You might also like