You are on page 1of 7

Japanese Case Structure Analysis

by U n s u p e r v i s e d C o n s t r u c t i o n of a Case Frame Dictionary

Daisuke Kawahara, Nobuhiro Kaji and Sadao Kurohashi

G r a d u a t e School of Intbrm~tics, K y o t o U n i v e r s i t y
Y o s h i d a - H o n m a c h i , S~kyo-ku, K y o t o , 606-8501, J a p a n
{kawahara, kaj i , kuro }@pinc. kuee. kyoto-u, ac. jp

Abstract tically analyzed corpora are too small to learn a

dictionary, since case fl'ame iuformation consists
In Japanese, case structure analysis is very im-
of relations between nouns and verbs, which rnul-
t)ortant to handle several troublesome charac-
tiplies to millions of combinations.
teristics of Japanese snch as scrambling, onfis-
Based on such a consideration, we took the
sion of ease components, mid disappearance of
fbllowing unsupervised learning strategy to the
case markers. However, fi)r lack of a wide-
.Japanese case structure analysis:
coverage ease frame dictionary, it has been dif-
ficult to perfornl case structure analysis accu- 1. At first, a robust and accurate parser is de-
rat;ely. Although several methods to construct veloped, which does not utilize a case fl'mne
a ease fl'mne dictionary from analyzed corpora
have been proposed, they cannot avoid data
sparseness 1)rol)lem. This paper proposes an un- 2. a very large corI)us is parsed by the parser,
supervised m e t h o d of constructing a case frame 3. reliable noun-verb relations are extracted
dictionary from an enormous raw corpus by us- from the parse results, and a case frmne dic-
ing a robust and accurate parser. It also pro- tionary is constructed from them, and
rides a case structure analysis method based on
the constructed dictionary. 4. the dictionary is utilized for case structure
1 Introduction
2 Characteristics of Japanese
Syntactic analysis, or parsing has been a main language and necessity of case
objective in Natural Language Processing. In structure analysis
case of Jat)anese , however, syntactic analysis
cannot clarify relations between words ill sen- In Japanese, postpositions function as case
tences because of several troublesome character- markers ( ( M s ) mid a verb is final in a sentence.
istics of Japanese such as scrambling, omission The basic structure of a Japanese sentence is as
of case components, and disappearance of case fbllows:
markers. Therefore, in Japanese sentence analy- (1) k a t e 9a coat w o ki~'u.
sis, case structure analysis is an important issue, he nominative-CM coat accusative-CM wear
and a case frame dictionary is necessary for the (lie wears a coat)
Some research institutes have constructed A clause modifier is left to the modified noun
Japanese case frmne dictiouaries manually (Ike- as follows:
hara et al., 1997; Infbrmation-Technology Pro-
(2) k a t e 9 a k i t e - i r u coat
motion Agency, Japan, 1987). However, it is lie nom-CM wear coat
quite expensive, or almost impossible to con-
(the coat he wears)
struct a wide-coverage ease fl'anm dictionary by
hand. The modified noun followed by a postposition
Others have tried to construct a case fl'mne then becomes a case component of a matrix verb.
dictionary automatically from analyzed corpora The typical structure of a Japanese complex sen-
(Utsuro et al., 1998). However, existing syntac- tence is as fbllows:

(3) boush, i no irv wa kitc-ir'u for subsequent N L P applieations like Japanese
hat of color tol)ic-marker wear to English machine l;rmmlation.
coat ni a'wa~'cr"~,. Then, what we need to do is a case structure.
coal; dative-CM harmonize analysis based on a case fl'ame dictiolmry, or a
(c/) harmonizes the color of his/her hat with subcat, of each verb as follows:
the coat he/she wears)
h a n a s u 'speak':
In terms of autolnatic analysis, the problen> ga (nora) ks're 'he', hire 'person'
atic characteristics of Japanese sentences can be 'we (ace) cigo 'English', kotoba 'language'
summarized as follows: ~;i?'U, ~%vear':
ga (nora) kavc qm', hil, o 'person'
1. Case componenl;s are often scrambled or we (ace) fuhu 'cloth', coat 'coat'
omitted. a'tlJasel'~t ' h a r l l l o n i z e ' :

2. Case-marking postpositions disappear when ga (nora) kar'c 'he', hito 'person'

case components are accompanied by topic- 'we (ace) ire ~color'
markers or other special 1)ostpositions ni (dat) fltku 'cloth'
meaning 'just', 'also' and others.
cx) karv 'wa coat m e ti:itc-iv'u. Consultation of such a dictionary can easily find
he tol}iC-lna.rker coat also wear that kar'c 'he' is a nomilmtive case and Dev, tsch,-
(Ile wears a coat a.lso) 90 'German' is an accusative (:as(', in the sentence
3. A noun modified 1)y a clause is usually a case Furthermore, a (:ase frame dictionary Call s o l v e
component for the verb of the mo(litlying the problem 4 above, that is, some part of struc-
clause. However, there is no case-marker for tural ambiguity in sentences. In case of sentence
their relation. In case of sentence 3, there is 3, a t)r()l)er head for 'ir'o w a 'color topic-marker'
no case-marker for coat in relation to kite- (:all })e selected by consulting case slots of kir'u
ir'u 'wear'. Note that 'hi (dative-CM) of coat: ~wear' and those of a'wascru 'harmonize'.
ni does not show the case to kitc-ivu 'wear',
lint to awascr'v, 'harmonize'. 3 Unsupe, rvised construction of a
4. Sentence 3 exhibits a typical structural am- case fralne dictionary
1)iguity in a ,lalmnese sentence. That is, This s(x:tion explains how to construct a case
ir'o "~va 'color topic-marker' possit)ly modi- fralll(*, dictionary fl'om corl)ora autonmtica.lly.
ties kit, c-iru 'wear' or awa.scv'u qmrmonize'. As mentioned in the introduction section, it;
is quite expensive, or ahnost ilnl)ossible to con-
In English, sentence structure is rather rigid, struct a wide-coverage case frame dictionary by
and word order (the position in relation to the lmnd. In Japanese,, some noun q- copula works
verb) clearly defines cases. In Japanese, how- like an adjective. For example, sa~tsei da 'posi-
ew% the l)roblem 1 above makes word order use- tiveness + Colmla' can take 9a case and 'hi case.
less, and CMs constitute the only int'ormation for However, such case frames are rarely covered t)y
detecting cases. the existing h a n d m a d e dictionaries 1.
Nevertheless, CMs often disapl)ear because of Fm'thermore, existing halldmade dictionaries
the problems 2 and 3, whidl means that sim- cover typical obligatory cases like ga (nomina-
ple syntactic analysis cmmot clari(5~ cases sui[i- tive), wo (accusative), ni (dative), but do not
cientl> For eXalnple, given an inlmt sentence: cover compound case markers such as ni-kandz.itc
'in terms of', 'wo-rncqutte 'concerning' and oth-
(4) har'c w a Dcv, tsch,-go m e hano, sv,.
he topic-marker Germall also sl)eak
(he speaks G e r m a n also) Then, we tried to construct an example-based
case frmne dictionary from corpora, which de-
a simple syntactic analysis just detects both kar'c
lOut method collects case frames not only tbr verbs,
'he' and D c u t s c h - g o 'German' modifies ]ta'aas~t but also tbr adjectives mM nouns-kcopula. In this paper,
'speak', but tells nothing a b o u t which is subject we use 'verb' instead of 'w;rb/adjective. or llOllll -{- copula.'
and object. This analysis result is not sufficient for simplicity.

Table 1: The accuracy of KNP.

'wa~ 7tto clause clause

9a 'we ni ka'r'a rr~.ade ?lori topic- modif~ying modifying 'lbtal
noln. ace. dative from to from marker verbs nouIIS
91.2% 97.7% 94.2% 83.8% 85.3% 82.8% 88.0% 84.3% 95.5% 91.3%

scribes what kind of cases each verb has and We can collect pairs of verbs and case compo-
what kind of nouns can fill a case slot. Very large nents from the automatic analyses of large cor-
syntactically analyzed corpora could be useful to pora by KNP.
construct such a dictionary. However, corpus an-
3.2 Coping with two problems
notation costs very much and existing analyzed
corpora are too small from the view point of case The quality of automatic case frame learning
frame learning. For exmnple, in Kyoto Univer- could be negatively influenced by the %llowing
sity Corpus which consists of about 40,000 ana- two problems:
lyzed sentences of newspaper articles, very basic Word sense ambiguity: A verb sometimes
verbs like t e t s u d a u 'help' or v, k e t s v , k c r ' u 'accept' has w~rious usages and possibly has several
appear only 10 times or 15 times respectively. It case frames depending on its usages.
is obvious t h a t such small d a t a are insufficient
for automatic case frmne learning. T h a t is, case S t r u c t u r a l a m b i g u i t y : K N P performs fairly
frame learning must be done from enormous un- well, but automatic parse results inevitably
analyzed corpora, in unsupervised way 2. contt~in errors.

3.1 G o o d parser The tbllowing sections explain how to solve

these problems.
NLP research group at Kyoto University has
been developing a robust and accurate parsing 3.2.1 Word sense ambiguity
system, KNP, over the last ten yem's (Kurohashi If a verb has two or more meanings and their
and Nagao, 1994; Kurohashi and Nagao, 1998). case fl'ame patterns differ, we htwe to disam-
This parser has the following advantages: biguate the sense of each occurrence of the verb
in a corpus first, and collect case components for
• .Japanese is an agglutinative language, and each sense respectively. However, unsupervised
several N n c t i o n words (auxiliary verbs, suf- word sense disambiguation of fl'ee texts is one of
fixes, and postpositions) often appear to- the most ditficult problems in NLP. At the very
gether and in many cases compositionality begimfing, even the definition of word senses is
does not hold among them. KNP treats open to question.
such function words careflflly and precisely. To cope with this problem, we made a very
simple but usefltl assumption: a light verb has
• K N P detects scopes of coordination struc-
diffbrent case frames det)ending on its main case
tures well based on their parallelism.
component; an ordinary verb has a unique case
• K N P employs several heuristic rules to pro- frmne even if it has two or more meanings. For
duce mfique parses for the input sentences. example, the case frmne of the verb n a r n 'be-
come' differs depending on its ni (dative) case
The accuracy of KNP is shown in Table 1, as %llows:
which counted whether each phrase modifies a
... ga b?}ouki n i n a ' r u
proper head or not. The overall accuracy was
nora. become ill
around 90%, and the accuracy concerning case
• .. ga ... to tornodachi n i na'r"u
components varies from 82% to 98%.
nora. with become a fliend
21n English, several unsupervised methods have been
proposed (Manning, 1993; Briscoe and Carroll, 19!)7). In most cases, the main case components are
However, as mentioned in Section 3, automatic Japanese placed just in front of the light verbs so that
case analysis is much harder than English. the automatic parser can detect their relations

Tal/le 2: EXmnl)les of the c o n s t r u c t e d ease frames.

verl)s c a s e lnarkel"s example nomls

t, aS?t]gCl"~l, (1,o111) husband, person, child, staff, I, SUSl)eet, faculty, ...
'help' ,,,,o (it(:(:) .jol), shol) , farmwork, preparation, election, move, ...
'r~,i (dat) son, friend, ambassador, meml~er, thank, holid~\y, ...
ae (op) volunte(,r, aft'air, otfice, rewar(l, house, headquarters, ...
yomu .qa (no]n) lX;rson, ], chihl, adult, parent, teacher, ...
l'ead' 'wo (at(;) newspaper, book, magazine, article, nov(J, letter, ...
hi, (dat) chiht, person, daughter, teacher, student, reader, ...
& (o10 newspaper, book, magazine, library, classroom, b a t h r o o m , ...

reliably. Therefore, as for five m a j o r a n d trou- verbs are constructed; the average n u m b e r of
t)lesome light verbs (.~'.,r'u 'do', 'nwr'u, q)ceomo?, ease slots of a verb is 2.8; the average m u n b e r
ar'u 'is . . . ' , iu ~s~w', nai 'not'), their case fl'mnes of cxanqflc nouns in a (:as(: slot is 33.6. Table 2
are distinguished d e p e n d i n g (m their left neigh- shows exmnlfles of c o n s t r u c t e d ease Dames.
b o u r i n g case components. For other verbs, we A l t h o u g h the c o n s t r u c t e d d a t a look apl)ropri-
aSStlllle a ]lnique e a s e f r a m e . ate in most cases, it is h a r d to evaluate a (lictio-
3.2.2 Structural ambiguity n a r y statica.ll> In the next section, we use the
As shown in '_['~dfle 1, K N P detects heads of case dictiomu'y in case s t r u c t u r e analysis a n d eval-
conlt~onents in f a M y high accuracy. However, u a t e the analysis result, wlfich also im])lies an
in order to collect nmch reliable data, we dis- cvahu~.ti(m of the d i c t i o n a r y itself.
carded moditier-hcad relations in the aul;onmti-
4 Case structure analysis using the
t a l l y Imrsed c o r p o r a in the following cases:
constructed case frame dictionary
• W h e n CMs of ease c o n q x m e n t s disappear 4.1. Matching of an input sentence and
because oi" topic markers or others. a case frallle
• W h e n the verb is followed 1)y a causative 'Jl~e basic 1)ro(:cdure in ('ase strucl;ul"e analysis is
auxiliary o1' a passive auxiliary, l;he case tm.t- lo m a t c h a n inlml sentence with a case frame,
t(:rn is e]mnged and the 1;race in K N I ' is not aS show11 ill lqgUl'C, 1.
so rclial)le. T h e m a t c h i n g of case conq)onenl:s in an input
Based on the conditions al)ove, case compo- and case slots in a c a s e fl'alllO is ([Olle Oll the
nents of each verb are collected froln the 1)arscd following conditions:
corpora, and the collected d a t a arc considered as
case frames of verbs. However, if the f l c q u e n c y I. W h e n a ease c o m p o n e n t has a CM, it must
of a CM is very low compared to other CMs, it be assigned to 1;11o case slot with the same
m i g h t t)e collected because of parse errors. So, CM.
we set the threshold for the CM flequency as . W h e n a case COml)Onent does nol: have a
2~, where m.f m e a n s the frequency of the CM, it can 1)e assigned to the 9a, we, or ni
1nest folln(t ChJ. if the fl'equeney of ~t C M is less CM slot.
tlmn the threshold, it is discarded, l.~br exalnple,
suppose the most frequent CM fin' a verb is we, . ()nly one case c o m p o n e n t can be assigned
100 times, a n d the frequency of ni CM tbr the to a case slot (unique case assiglmmnt con-
verb is 1.6, ni C M is discarded (since it is less straint).
t h a n the threshold, 20).
T h e conditions above m a y produce nmltil)le
a.3 Constructed case frmne dictionary m a t c h i n g p a t t e r n s , and to select the proper one
We applied the al)ow', procedure to Mainichi alllOng {,llclll, 11Oll118 of case COlllpon(',lltS al'o COlll-
Newst)al)er Corpus (7 years, 3,600,00(} sen- pared w i t h examph',s in case slots of the (tictio-
tences). Fronl the cortms , case franws of 23,497 nary.

syorui wa . (6) Deutsch-go me
(5) document topic-marker / / Gcl'nlan also q
ka,'e .i .___1 hHII(ISII

,,e 1 speak "-7

~" walashila
" hand {';cq/;i'i;ir

[ (1 handed the document to him.) a teacher who speaks also German)

WaRlSU 'hand'
ga defendam, president.... ~ professor, president 1
we money, nlelllO, bribe.... ni person, friend....
ni person, suspect, ... - - we reason English Japanese
de affair, office, room.... to (sentence)

Figure 1: Matching of an inl)ut sentence and a case fl:ame.

Even though a 3,600,000 sentences corpus was Deutsch,-go is assigned to 'wo, sensei is assigned
used for learning, examples in case slots are still to ga.
sparse, and a n input noun mostly does not match
exactly an example in the dictionary. Then, a 4.2 Parsing with case structure analysis
thesaurus is employed to solve this problem. A complex sentence which contains a clausal
In our experiments, N T T Semantic Feature modifier exhit)its a typical structural ambiguity
Dictionary (Ikehara et al., 1997) is employed as of Japanese; case components left to a verb of
a thesaurus. Suppose we calculate the silnilar- a clausal modifier, Vc, possibly modify V~: or a
ity between Wl and w2, their depth is dl and d2 matrix verb Vm.
in the thesaurus, and the depth of their lowest For example, in sentence 3, ir'o ' w ~ L 'color
(most specitic) common node is de, the similarity topic-inarker' possibly modifies kite-iru 'wear' or
score between them is calculated as follows: (l,~l)(],Sel'~l, q l a r i i l o n i z e ' .
KNP, a rule-based parser, handles this type of
= (4 × + ambiguity ~s follows. If a case component is fol-
If W1 and w2 are in the same node of the the- lowed by a comma, it is treated as modif[ying Vm ;
saurus, the similarity is 1.0, the maximum score if not, it is treated as modif[ying 1~:. Although
based on this criteria. If Wl and w2 are identical, this heuristic rule usually explains real d a t a very
the similarity is 1.0, of course. well, sentence 3 will be analyzed incorrectly.
The score of case assigmnent is the best sim- Parsing which utilizes a case frame dictionary
ilarity between the input noun and examples in can consider which is a proper head, V~ or Vm,
the case slots. The score of a matching pattern tbr an ambiguous case compolmnt by comparing
is the sum of scores of case assignments in it. If examples in the case slots of V~ and 14~. Such
two or more patterns meet the above conditions, a consideration nmst be done considering wlmt
one which has the best score is selected as a final other case components modifly Vc and Vm, since
result. the assigned case slot of a case component might
In the case of sentence 5 in Figure 1, karc 7ti differ depending on the candidate structure of
'he dativc-CM' is assigned to the ni case slot. the sentence due to the unique case assignment
Then, syorui wa 'document topic-marker' can be constraint.
assigned to the ga or wo case slot. By calculating Therefore, it is necessary to expand the struc-
similarity between syorui and 9a-slot examples tural ambiguity and consider all the possible
and wo-slot exmnples, it; is considered to be as- structures fbr an input. So, we calculate the
signed to the wo slot. matching score of all pairs of case components
In case of sentence 6, none of the case compo- and verbs in all possible structures of the sen-
nents has a CM. Based on similarity calculation, tence, and select the best structure based on the

boushi tic; ire wa kite-it'lt Co(It I1i HWCLTCI'll
hat color weal" coat hllrnlolliZC

bott.~]ti n o
hal ~,~ hat
~---__ ire wa ire wa -2 (distance penalty)
color-- I c°l°r ~ ]
kite-iru [

C(;¢lt It[ co.t ,,i ~

COat ~5"¢rll ~ COat IdilA't!rll

kiru 'wear' ~.\ awa,~emt 'harllloilizc'

We ClOth, tllli]'/)l'lll, CI)la.... ~ WO ] pOWCI', face, }l}ind 2 ,:2,,,,,,6....

de party, oily, home .... ~ ni. I I prcl)lcnce, cItlth ....
Figure 2: Parsing with east structure analysis.

sum of the matching scores in it.

'l'td)le 3: The at:curacy of case detection.
Since the heuristic rule employed ill K N P is
actually very useful, we in(:orporate it, t h a t is,
l)enalty score is imposed to the modifier-hea(l re- (;orre(:I; ill(:orl'e(;t
la.tion depending on the distraint between ~t mod- ease case
ifi(;l" and a head. If a moditier is not followed by (lel;e(:l;ion (tete(:tion
a comma, the penalty score, 0 , - 2 , - 4 , - 6 , ... is topic-marleer 82 13 5
imposed when a moditler modifie.s the first (nea.r- clausal modifier 73 18 9
est), second, third, tburth, ... verbs ill a sentence
respectively; if with a comma, the tmnalty score,
- 2 , 0, - 2 , - 4 , ... is impose&
senl;ences, we can reasonably limit the possil)le
For example, sentence 3 was analyzed t)y our sl;ructures of the sentence.
m e t h o d as shown ill Figure 2. Since the simi- The ~werage analysis speed of tile ext)criments
lm'ity score between fro ~color' a.nd the 'we-slot described in the next section was about 50 sen-
of uwa.s'cr'u hmunonize is nmch larger t;]iall theft tenets/aria. 'File tinm-oul, of one rain. was only
l)etween ire 'eoloff and the ga-slot of lci'r'u. 'wear', employed to 7 out of 4,272 test Selltellces.
the correct structure of the selltellee was de-
tected (the right-lmnd parse of Figure 2). Note 4.3 Experilnents and discussion
that, furthermore, both the ease of ire ill reb> We used 4,272 sentences of Kyoto University c o l
tion to a w a s c r u 'harmonize', and the case of coal, pus as a test set. We parsed them by our new
in relation to kite-iru 'wear' were dete(:ted cor- lnethod (Figure 3 shows several examt)les) and
rectly. cheekc,d two 1)oints: case detection of mnbiguous
Structm'al ambiguities often cause a combina- case (-omponents and syntactic analysis.
torial explosion when a sentence is long. How- First, we r a n d o m l y selected ambiguous ease
ever, by detecting the SeOl)eS of coordinate struc- components: 100 l,ol)ic-markcA case components
tures 1)e%rehand, which off;ell aPl)ear in long all(t 100 (:ase coral)orients moditied by clausal

ookllrasyo ha
the Treasury
5 Conclusion
3gatsuki kes:~an de We proposed an unsupervised construction
settlenlellt ill March ,, m e t h o d of a case frame dictionary. We obtained
.l'hintakuginkotl kakukmt ga I impr<n'edby case iajbmtatiml a large case fl'alne dictionary, which consists
each trust bank
of 23,497 verbs. Using this dictionary, we can
detect ambiguous case components accurately.
save lip
Also since our m e t h o d employs unsupervised dic-
tokubetsu t3,uhokil~ ,1o
specially reserved money tionary learning, it can be easily scaled up.
mrikuzushi wo
tnitonwru Ted Briscoe and John Carroll. 1997. Automatic
extraction of subcategorization from corpora.
In Prvccedings of ANLP-97.
gai,,'yoltha Satoru Ikehara, Masahiro Miyazaki, Satoshi
the Foreign Minislcr Shirai, Akio Yokoo, Hiromi Nakaiwa, Ken-
mikka ni
on tile third ]
tarou Ogura, and Yoshiflmfi O y a m a Yoshi-
hiko Hayashi, editors. 1997. Japanese Lexi-
Mexico ga
Mexico con. Iwananfi Publishing.
[Iglppyolt shiRl imln'oredhy cave i@~tmltli¢~*i Information-q~chnology Promotion Agency,
illltl(RlnCC ~ ~ ]
i@lre Ixmshi mulo •
,Japan. 1987. Japanese Verbs : A Guide to
prevention o1"inl]alion [ the H~A Lea:icon of Basic ,Japa~tcsc Verbs.
L'eizai misaku ni
S. Kurohashi and M. Nagao. 1994. A syntac-
fimmclal pllllcy tic analysis m e t h o d of long japanese sentences
based on the detection of conjunctive struc-
st'Lvlltttl'i ~hifa
tures. Computational Linguistics, 20(4).
cxphdn S. Kurohashi and M. Nagao. 1998. Build-
ing a jal)anese parsed corpus while improv-
Figure 3: Exmnt)les of the mmlysis results. ing the t)arsing system. In Prvcccdin.qs of" Th.c
Fir;st h~,tcr'national Co't@r~ncc on Lwnguage
R.csources 64 Evaluation, pages 719 724.
Christopher D. Maturing. 1993. Automatic ac-
quisition of a large snbcategorization dictio-
modifiers, and checked whether their cases were nary froln corpora. In Pr'occcding s of A CL-93.
correctly detected or not. As shown in Table 3, Takehito Utsuro, Takashi Miyata, and Yuji Mat-
the accuracy of the analysis was fairly good: that sumoto. 1998. General-to-simeific model se-
tbr topic-markers was 82% and that tbr clausal lection tbr subcategorization preference. In
modifiers was 73%. Proceedings of th.c 17th International ConJ'cr-
cncc on Computational Li'n.quistics and the
Then, we compared the parse results of our 36th Annual Mectin.q of the Association for
m e t h o d with those of the original KNP. As a re- Computational Lin.quistics.
sult, 565 modifier-head relations differed; in 260
cases, our m e t h o d was correct and the original
KNP was incorrect (by considering the struc-
tures in the Kyoto University Corpus as a golden
standard); in 224 cases, vice versa. T h a t is,
our m e t h o d was superior to KNP by 36 cases,
and increased the overall accuracy from 89.8%
to 89.9%. Since the heuristic rule used in KNP
is very strong, the improvement was not big.
The improvement of the accuracy, though small,
is valuable, because the accuracy around 90%
seems close to the ceiling of this task.