Professional Documents
Culture Documents
Abstract
Automatic transliteration and back-transliteration across languages with drastically different alphabets and phonemes inventories such
as English/Korean, English/Japanese, English/Arabic, English/Chinese, etc, have practical importance in machine translation, cross-
lingual information retrieval, and automatic bilingual dictionary compilation, etc. In this paper, a bi-directional and to some extent
language independent methodology for English/Korean transliteration and back-transliteration is described. Our method is composed
of character alignment and decision tree learning. We induce transliteration rules for each English alphabet and back-transliteration
rules for each Korean alphabet. For the training of decision trees we need a large labeled examples of transliteration and back-
transliteration. However this kind of resources are generally not available. Our character alignment algorithm is capable of highly
accurately aligning English word and Korean transliteration in a desired way.
a
the word mismatch problem caused by foreign words in very various. For example, English word ‘digital’ may be
a
a
information retrieval (Jeong et al., 1999). variously transliterated in Korean as ‘ (ticithel)’,
a
In contemporary Korean, the significant portion of the ‘ (ticithal)’, and ‘ (ticithul)’, etc, even
vocabulary consists of loan words and most of them are though ‘ (ticithel)’ is preferred as a standard form.
from English. They are generally transliterated in Korean. This is because an English phoneme can only be
Both in machine translation and cross-lingual information ambiguously mapped to more than one Korean phoneme
due to their radically different phonolgies. Moreover, lazy
writers often use English word in its original form without
transliteration. These mixed use of various transliterations
together with their origin English word cause severe word
source language target language mismatch problem in information retrieval. One solution
Å}’
for this problem is to index only the canonical form of the
‘internet’ transliteration ‘ variations. As the canonical form, the origin English word
should be preferred. Automatic back-transliteration is
(English) (Korean)
back-transliteration required to find the canonical form of the variants (Jeong
et al., 1999).
For the machine learning of transliteration and back-
Figure 1. Transliteration and Back-transliteration transliteration rules, large phonetically aligned pairs of
English word and Korean transliteration are essential.
However this kind of resources does not exist and manual
alignment would be highly tedious. In previous researches In most cases English words and their Korean
small amount of aligned word pairs were manually transliterations do not have same length. This means that
constructed (Jeong et al, 1999) or unsupervised learning the mapping type is generally many-to-many
method, a variation of EM(Expectation Maximization) correspondences. Moreover it may also have null
algorithm, were tried (Lee & Choi, 1998). However correspondence. They are mainly due to silent English
manual alignment is not proper for the large amount of letters. They are also caused by Korean characters. In
data and the performance of unsupervised learning was Korean a consonant is not allowed come in stand-alone
not good enough. and must be followed by a vowel. This causes dummy
In this paper, we first present an automatic character Korean vowels that can not be mapped to any English
alignment method between English word and Korean alphabets. In character alignment, unlike word alignment
transliteration. Our method is highly accurate so that more and sentence alignment, crossing dependency does not
than 99% accuracy was obtained. With this alignment occur. In another words, when the correspondences are
method the construction of large reliable labeled (aligned) visualized as links between English and Korean characters,
examples is just instant. the links do not cross each other. No crossing dependency
Once we have enough aligned data, in fact any property makes an effective alignment algorithm possible
¢
supervised learning method would be applicable to the by drastically reducing the search space.
¬
transliteration and back-transliteration domain. We chose Let’s call the mapping unit, ‘b’, ‘oa’, ‘r’, ‘d’, ‘ ’, ‘ ’
decision tree method to automatically induce ‘ ’ in the above alignment example, as PU
transliteration and back-transliteration rules for its (Pronunciation Unit) (Lee & Choi, 1998). We may use
conceptual simplicity and empirically verified decision trees for the induction of the mapping rules
effectiveness in text-to-speech domain (Dietterich et al, between English PUs and Korean PUs. Unfortunately,
1995). however, too many PUs may be produced and
Our methodology is fully bi-directional, i.e. the same consequently too many decision trees need to be
methodology is used for both transliteration and back- constructed. Moreover null PUs in the source word side
transliteration. Moreover, our proposed methodology is makes the application of decision tree method difficult. To
easily adaptable to other language pairs. Decision tree remedy this problem we constrain the alignment
learning algorithm is domain independent and our configuration. Specifically we allow only one-to-many
alignment method is relatively easily adaptable to other correspondence and prohibit null PUs in the source word
language pairs. The alignment method is logically divided side.
into language independent part and language dependent Under these constraints the previous alignment
part. Alignment algorithm itself is domain independent example may be modified as follows:
but the heuristic rules used incorporate bilingual phonemic
knowledge. However the bilingual phonemic knowledge English b o a r d
is simple so that even non-expert may encode it without | | | | |
much difficulty. Korean ¢ - - ¬
On the contrary, when source word is Korean and target
2. Character Alignment word is English, i.e. back-transliteration, the alignment
should be as follows:
2.1. English/Korean character alignment
English/Korean character alignment is, given a source Korean ¢ ¬
language word (English) and its phonetic equivalent in | | | |
target language (Korean), to find the most phonetically English b oar d -
probable correspondence between their characters. For
A)
example, English word ‘board’ is generally transliterated This constrained version of character alignment makes
into ‘ (potu)’1 in Korean and their one possible decision tree learning more manageable and more efficient.
alignment is as follows: In the case of E/K transliteration only 26 decision trees for
each English alphabet need to be learned and in the case
English b oa r d of E/K back-transliteration only 462 decision trees for each
Korean alphabet need to be learned.
¢
| | | |
Korean - ¬ 2.2. Alignment algorithm
Generally E/K character alignment has the following For the automatic character alignment, we developed
properties: an extended version of Covington’s alignment algorithm
(Covington, 1996). Covington’s algorithm views an
(1) many-to-many correspondence alignment as a way of stepping through two words while
(2) null correspondence performing match or skip operation on each step. Thus the
(3) no crossing dependency alignment
1
Korean characters are composed in syllable unit when they get
A)
written. The two-syllable word ‘ ’ may be deformed as
¢¬
2
Refer to appendix for the more information about the 46
‘ ’ in character unit. Korean alphabets.
source b o a r d -
target ¢ ¬- - Human who has a little bit of bilingual phonemic
knowledge can almost correctly align any English word
is produced by matching ‘b’ and ‘’, ‘o’ and ‘¢’, then and its Korean transliteration pair. This is because
skipping ‘a’ and ‘r’, matching ‘d’ and ‘’, and lastly relatively simple bilingual phonemic knowledge is
skipping ‘¬’. Null symbol ‘-’ indicates skip at the sufficient for the alignment task. We hope to simulate this
position. human process. We may exploit the following two
Covington’s algorithm produces only one-to-one heuristics that are expected to be very effective in E/K
correspondence. This implies that null mapping is character alignment.
inevitable on both source and target word side. In order to
produce one-to-many correspondences and remove null on H1. Consonant tends to map with consonant and vowel
the source word side we introduce bind operation. We tends to map with vowel.
define two kinds of bind operation: forward bind and H2. There exist typical Korean transliterations for each
English alphabet.
ñQ¥
backward bind. The following alignment example of
3
English word ‘switch’ and Korean transliteration ‘
(suwuichi)’ pictorially represents the two bind operations. We have succeeded in aligning with high accuracy
using the heuristic H1 and H2. The heuristic H1 seems to
source ¬ > ª > < ® always hold except ‘w’. The semi-vowel ‘w’ is sometimes
mapped to Korean consonant even though it is usually
mapped to vowels. For the heuristic H2, we can easily
target s - w i t c h -
make a list of typical Korean transliterations for each
English alphabet. Generally an English alphabet has more
where ‘>’ and ‘<’ respectively represent forward bind and than one Korean character that is phonetically similar.
ª
backward bind at the position. ‘w’ is forward binded with Table 1 lists phonetically similar Korean transliterations
‘i’ and together matched with ‘ ’. Similarly, ‘t’ and ‘h’ is (or characters) for several English alphabets. This simple
forward and backward binded with ‘c’ and collectively bilingual phonemic knowledge can be coded without
matched with ‘ ’. By introducing bind operations we can much effort by even non-expert. If we match English
A)
remove null on the source side. Therefore the recurrent alphabet with the Korean character in the list with higher
alignment example of ‘board’ and ‘ (potu)’ may be priority, in most cases we get correct alignment. To handle
represented as follows: more complicated cases we made up of the evaluation
metrics in Table 2 by extending the heuristic H1.
source b o a r d <
target ¢ - - ¬ The evaluation metrics in Table 2 approximately
realize the following principles: (1) phonetic similarity
matters most, (2) consonants matters more than vowel, (3)
in the case of transliteration and vowel and consonant do not tend to map each other, (4)
¢ ¬
phonetically dissimilar consonants do not tend to map
source < < each other, rather skip is preferred, (5) phonetically
target b o a r d - dissimilar vowels tend to map each other rather than
skipping, (6) bind is more generous for dissimilar
in the case of back-transliteration. consonants’ matching and also for vowel/consonant
We can systematically generate all the valid matching, (7) consonant/vowel bind preferred than
alignments that are possible by match, skip, bind dissimilar consonant bind.
operations and satisfy the alignment constraints. Aligning We ran our alignment algorithm on 500 pairs5 of
may be interpreted as finding the best alignment in the English word and Korean transliteration. The alignment
alignment search space that is composed of all the valid results were manually checked by human evaluators.
alignments. The algorithm does not use dynamic
programming but does depth-first search while pruning
fruitless branches early (Covington, 1996). To evaluate
each alignment, every match, skip, bind is assigned a English
penalty depending on the phonetic similarity between the Korean transliterations
alphabet
English letter and Korean character under consideration. a ¦
The alignment that has the least total penalty summed
b * 4
over all the operations is determined as the best. For
d
A) ¢
example, the total penalty of the alignment of ‘board’ and
‘ (potu)’ can be computed as follows: o
English b o a r d <
r *
Korean ¢ - - ¬ Table 1. Typical Korean transliterations for several
English alphabets
operation M M S S M b.B
penalty 0 + 10 + 40 + 60 +0 + 200 = 310
4
The consonant attached with ‘*’ means that it is a syllable-final
Table 4. Performance comparison between two attribute selection methods: proximity and information gain.
transliteration back-transliteration
training word character letter Total no. word character letter Total no.
data size accuracy accuracy accuracy of leaves accuracy accuracy accuracy of leaves
1000 38.9 76.8 82.4 2950 27.3 75.2 79.2 3104
2000 45.0 80.5 85.1 5017 31.7 77.6 81.2 5565
3000 48.7 81.8 86.3 6973 34.7 78.6 82.1 7738
4000 49.7 82.8 86.9 9922 36.4 79.2 82.6 10646
5000 50.9 83.1 87.3 12355 35.9 79.5 83.0 13167
6000 51.3 83.4 87.5 14547 37.2 80.0 83.3 15423
There are two fundamentally different approaches to result is not so surprising since our character alignment is
avoiding overfitting the training data in decision tree almost perfect and the only source of the noise is the
learning. First approach, called pre-pruning, is to stop relatively small amount of non-English origin words.
growing decision trees before it reaches to the perfect Another interesting observation is that in back-
classification of all the examples. Second approach, called transliteration, both pre-pruning and post-pruning reduced
post-pruning, is to derive complete decision tree first, then the tree sizes by 28% and 46% respectively without
to post-prune the tree. decreasing back-transliteration accuracy while no
The most critical part of pre-pruning is when to stop significant change in tree size happened in transliteration.
growing. Many stopping criterions have been proposed so
far (Mingers, 1989). But we tested only one of them in our 4.5. Training set size
experiments. The information gain measure was adopted Until now, we have only showed performance values
since we did not use it as a selection measure7. If the with the 3000-word training set. It might be that more
information gain acquired by growing the current node is training data raise the transliteration accuracy. So we
less than a specified threshold, stops growing and make generated decision trees, increasing data by 1000 words,
the current node as a leaf node. The majority class of the with 6 different training data of the size of 1000 words to
examples of the node is assigned as the class of the leaf 6000 words (Table 6). We obtained significant
node. performance improvement especially in word accuracy.
Generally, post-pruning is known to be more effective Transliteration word accuracy increased from 38.9 (1000
method to avoiding overfitting problems than pre-pruning words) to 51.3 (6000 words) by about 44% and back-
(Mitchell, 1997). So we tested reduced error pruning transliteration word accuracy increased from 27.3 (1000
method (Quinlan, 1987), which is one of the most words) to 37.2 (6000 words) by about 36%. However, in
representative post-pruning methods and known to give both transliteration and back-transliteration, the increase
persistent performance (Mingers, 1989), to find out rates were rapidly slowed after 3000 words.
whether it can improve the transliteration accuracy.
In the past work on English text-to-speech mapping,
4.6. Discussions
pruning trees turned out to be not much helpful in
improving the mapping accuracy (Dietterich et al, 1995). We analyzed how well individual decision trees are
We also obtained similar results. All the tree pruning learned. In the case of transliteration, it was found that all
methods failed to increase both the transliteration and the English vowels ‘a’, ‘e’, ‘i’, ‘o’, ‘u’ and consonant ‘w’,
back-transliteration accuracy (Table 5). This result may ‘x’ had especially low letter accuracy. The low accuracy
mean that ID3 is not overfitting the data. In another words, of vowel is because it is realized to many different
the 3000-word data does not contain much noise. This phonemes. In the case of back-transliteration, generally
Korean vowels had relatively low accuracy in the same
reason.
7
If information gain measure is used as an attribute selection The low letter accuracy might mean that their
measure, generally chi-square test is used as the stopping transliteration or back-transliteration is very irregular or
criterion. Using the same measure for both attribute selection their target concepts may be very complex. In the case that
and pre-pruning may cause problem (Quinlan, 1986).
transliteration and back-transliteration is inherently highly References
irregular, there is no good way to improve the accuracy. Collier, N., A. Kumano and H. Hirakawa, 1997.
However, if the low accuracy is due to the high Acquisition of English-Japanese proper nouns from
complexity of the target concept, more training data or noisy-parallel newswire articles using Katakana
wider context window might improve the accuracy. matching. Natural Language Processing Pacific Rim
Symposium ’97.
5. Conclusion Covington, M. A., 1996. An algorithm to align words for
We have presented a very effective bi-directional historical comparison. Computational Linguistics, 22.
automatic English/Korean transliteration and back- Dietterich, T. G., H. Hild and G. Bakri, 1995. A
transliteration methodology. Our method consists of Comparison of ID3 and Backpropagation for English
character alignment and decision tree learning. We wanted Text-to-Speech Mapping. Machine Learning, 18.
to induce the transliteration rules for each English Jeong, K. S., S. H. Myaeng, J. S. Lee, and K. S. Choi,
alphabet and the back-transliteration rules for each Korean 1999. Automatic identification and back-transliteration
alphabet. However what is missing is large labeled of foreign words for information retrieval. Information
examples for the training of decision trees. So, we Processing and Management, 35(4):523-540.
developed a highly accurate character alignment algorithm, Kang, Y. and A. A. Maciejewski, 1996. An Algorithm for
which is able to align two words in a desirable constrained Generating a Dictionary of Japanese Scientific Terms.
way across languages. Our method is somewhat language Literary and Linguistic Computing, 11(2).
independent. The only language dependent part is the Knight, K. and J. Graehl, 1997. Machine Transliteration.
alignment evaluation metrics that may also be easily In Proceedings. of the 35th Annual Meetings of the
constructed without much effort. Association for Computational Linguistics (ACL),
Madrid, Spain.
Appendix: Korean consonants and vowels Lee, J. S., 1999. An English-Korean Transliteration and
Korean alphabets consist of 19 consonants and 21 Retransliteration Model for Cross-lingual Information
Retrieval. Ph.D. dissertation, Dept. of Computer
vowels. Korean characters are composed in syllable unit
Science, Korea Advanced Institute of Science and
when they get written. The possible syllable
configurations are CV, CVC and CVCC. The consonant Technology.
Lee, J. S. and K. S. Choi, 1998. English to Korean
preceding vowel is called syllable-initial and the
Statistical transliteration for information retrieval.
consonants following vowel are called syllable-finals. For
phonetic matching, it is beneficial to differentiate Korean Computer Processing of Oriental Languages, 12(1):17-
37.
consonants as syllable-initials and syllable-finals since
Mingers, J., 1989. An empirical comparison of selection
they are pronounced differently depending on their
position within a syllable. measures for decision tree induction. Machine Learning,
3:319-342.
Table 7 shows the list of Korean consonants and
Mingers, J., 1989. An empirical comparison of pruning
vowels. In Korean transliteration of foreign words only 7
methods for decision tree induction. Machine Learning,
consonants are allowed in syllable-final position. The
4:227-243.
syllable-initial ‘ ’ is dummy for the purpose of just
Mitchell, T. M., 1997. Machine Learning. The McGraw-
satisfying the legitimate Korean syllable configurations.
So we decided to omit it in our deformed form of Korean Hill Companies, Inc.
Nam, Y. S., 1997. The latest foreign word dictionary.
words.
Sung-An-Dang Press.
Quinlan, J. R., 1987. Rule induction with statistical data –
a comparison with multiple regression. Journal of the
Operational Research Society, 38:347-352.
|, }, , , , ,
Korean alphabets
Quinlan, J. R., 1986. Induction of decision trees. Machine
, , , , , (), Learning, 1:81-106.
, , , , , ,
syllable-initial Stalls, B. and K. Knight, 1998. Translating Names and
(19) Technical Terms in Arabic Text. In Proceedings of the
consonant
|*, *, *, *, *, COLING/ACL Workshop on Computational
*, *
syllable-final Approaches to Semitic Languages.
, , , , , ,
(7) Wagner, R. A. and M. J. Fischer, 1974. The string-to-
, ¡, ¢, £, ¤, ¥, string correction problem. Journal of ACM 21(1):168-
vowel (21)
¦, §, ª, ©, «, , 178.
®, ¬ Wan, S. and C. M. Verspoor, 1998. Automatic English-
Chinese name transliteration for development of
multilingual resources. In Proceedings of COLING-
Table 7. Korean consonant and vowels ACL’98, the joint meeting of 17th International
Conference on Computational Linguistics and the 36th
Annual Meeting of the Association for Computational
Linguistics, Montreal, Canada.
Acknowledgements
This work was supported by the Korea Science and
Engineering Foundation (KOSEF) through the Advanced
Information Technology Research Center (AITrc).