Professional Documents
Culture Documents
Morphology Generation For English-Indian Language Statistical Machine Translation
Morphology Generation For English-Indian Language Statistical Machine Translation
https://doi.org/10.1007/s00500-020-05393-7 (0123456789().,-volV)(0123456789().
,- volV)
Abstract
When translating into morphologically rich languages, statistical MT approaches face the problem of data sparsity. The
severity of the sparseness problem will be high when the corpus size of morphologically richer language is less. Even
though, we can use factored models to correctly generate morphological forms of words, the problem of data sparseness
limits their performance. In this paper, we describe a simple and effective solution which is based on enriching the input
corpora with various morphological forms of words. We use this method with the phrase-based and factor-based exper-
iments on two morphologically rich languages: Hindi and Marathi when translating from English. We evaluate the
performance of our experiments both in terms of automatic evaluation and subjective evaluation such as adequacy and
fluency. We observe that the morphology injection method helps in improving the quality of translation. We further
analyze that the morph injection method helps in handling the data sparseness problem to a great level.
123
describe the morphology generation and the factored model noun and verb morphology when translating from English
for handling morphology, a case study of handling mor- to Hindi using factored models. We also discuss the solu-
phology for English to Hindi translation in Sect. 3. We tion to the sparseness problem.
describe our experiments and evaluations performed in
Sect. 4. Section 5 draws conclusion and points to future 3.1 Noun morphology
work.
In this section, we discuss the factored model for handling
Hindi noun morphology and the data sparseness solution in
2 Sparseness in factored translation models the context of same. Hindi nouns show morphological
marking only for number and case. Number can be either
When factored models allow incorporation of linguistic singular or plural. Case marking on Hindi nouns is either
annotations, it also leads to the problem of data sparseness. direct or oblique. Gender, an inherent, lexical property of
Two ways of sparseness can be identified: one is sparseness Hindi nouns (masculine or feminine) is not morphologi-
occurred during translation and another one is sparseness cally marked, but is realized via agreement with adjectives
occurred during generation. This occurs when a particular and verbs (Singh and Sarma 2011).
combination of factors does not exist in the source side Generation phase begins with the correction of the
training corpus. For example, let the factored model have genders of the translated words, since certain words are
single translation step: X|Y ? P|Q. Suppose the training data masculine in the source language but feminine in the target
has evidence for only xi|yj ? pk|ql mapping. The factored and vice versa. This is followed by short distance and long-
model learnt from this data can not translate xu|yv, for all distance agreements performed by intra-chunk and the
u = i and v = j. The factored model generates UNKNOWN inter-chunk module. These ensure that the gender, number
as output in these cases. Note that, if we train simple phrase- and person of local groups of phrases agree as also the
based model only on the surface form of words, we will at least gender of the subject’s verbs or objects reflect those of the
get some output, which may not give correct inflections, but subject.
still will be able to convey the meaning.
When the training corpus does not contain a specific 3.1.1 Factored model setup
factor combination, sparseness in translation occurs. For
example, let the factored model have single generation Noun inflections in Hindi are affected by the number and
step: P|Q ? R. Suppose the target side training data has an case of the noun only (Singh et al. 2010). So, in this case,
evidence of only pa|qb ? rc. The factored model learnt the set S, consists of number and case. Number can be
from this data can not generate from pu|qv for all singular or plural and case can be direct or oblique.
u = a and v = b. Again the factored model generates Example of factors and mapping steps is shown in Fig. 1.
UNKNOWN as output. Thus, due to sparseness, we cannot The generation of the number and case factors is discussed
make the best use of factored models. In fact, they are far in Sect. 4.
worse than the phrase-based models, especially, when a
particular factor combination is absent in the training data. 3.1.2 Building word-form dictionary
A simple and effective solution to the sparseness prob-
lem is to have all factor combinations present in the In the case of factored model described in Sect. 3.1.1:
training data. For the factored model described in Sect. 3.1,
in order to remove data sparseness in the translation step,
we need to have all Source root|{S} ? Target root|suffix
pairs present in the training data. Also, to remove data
sparseness in the generation step, we need to have all
Target root|suffix ? Target surface word pairs present in
the training data. In Sect. 3, we use a solution to this
problem in the context of English to Hindi translation.
3 Morphology generation
123
• To solve the sparseness in translation step, we need to • Class B: Includes = i=; =i= or =ya = ending feminine
have all English root|number|case ? Hindi root noun|- nouns that take ya for the features [? pl,
suffix pairs present in the training data. - oblique] and -yo for [? pl, ? oblique].
• To solve the sparseness in generation step, we need to • Class C: Includes feminine nouns that take e for the
have all Hindi root noun|suffix ? Hindi surface word feature [? pl] and o for [? pl, ? oblique].
pairs present in the training data. • Class D: Includes masculine nouns that end in
=a = or =ya =: Some kinship terms are also involved.
In other words, we need to get a set of suffixes and their
Words directly derived from Sanskrit are excluded.
corresponding number-case values for each noun pair.
• Class E: Includes masculine nouns that inflect only for
Using these suffixes and the Hindi root word, we need to
the features [? pl, ? oblique]. The nouns in this class
generate Hindi surface words to remove sparseness in the
end with =u =; =u=; = i=; =i= or a consonant.
generation step. We need to generate four pairs for each
noun present in the training data, i.e., (sg-dir, sg-obl, pl-dir,
pl-obl) and get their corresponding Hindi inflections. In the 3.1.2.2 Predicting inflectional class for new lexemes For
following section, we discuss how to generate these mor- the classification of the new lexemes into one of the five
phological forms. classes discussed above, we need gender information. After
gender is lexically assigned to the new lexeme, its inflec-
3.1.2.1 Generating new morphological forms Figure 2 tional class can be predicted using the procedure outlined
shows a pipeline to generate new morphological forms for in Fig. 3. A masculine noun may or may not be inflected
an English-Hindi noun pair. To generate different mor- based on its semantic property. If it is an abstract noun or a
phological forms, we need to know the suffix of a noun in mass noun it will fall into the non-inflecting Class A irre-
Hindi for the corresponding number and case combination. spective of its phonological form. On the other hand, a
We use the morphological classification of the noun shown countable lexeme will fall into one of the two masculine
in Table 1 for the same. Nouns are classified into five classes based on its phonological form. Similar procedure
different classes, namely A, B, C, D and E according to follows for feminine nouns.
their inflectional behavior with respect to case and number To predict the class of a Hindi noun, we develop a
(Singh et al. 2010). All nouns in the same class show the classifier which uses gender and the ending characters of
same inflectional behavior. the nouns as features (Singh et al. 2010). We get four
different suffixes and corresponding number-case combi-
• Class A: Includes nouns that take null for all number-
nations using the class of Hindi noun and classification
case values. These nouns are generally abstract or
shown in Table 1. For example, if we know that the noun
uncountable.
123
form (kuttaa) belongs to class D, then we can get four 3.2 Verb morphology
different suffixes for (kuttaa) as shown in Table 2
Generating surface word: In this section, we discuss the factored model for handling
Next we generate Hindi surface word from Hindi root Hindi verb morphology and the data sparseness solution in
noun and suffix using a rule-based joiner (reverse mor- the context of same. Many grammarians and morphologists
phological) tool. The rules of the joiner use the ending of have discussed the inflectional categories of Hindi verbs
the root noun and the class to which the suffix belongs as but these studies are either pedagogical or structural in
features. Thus, we get four different morphological forms approach. Across all approaches, there is much agreement
of the noun entities present in the training data. We aug- on the kinds of inflectional categories that are seen in Hindi
ment the original training data with these newly generated verbs. The inflection in Hindi verbs may appear as suffixes
morphological forms. Table 3 shows four morphological or as auxiliaries (Singh and Sarma 2011). These categories
forms of dog- (ladakaa) noun pair. The joiner solves the and their exponents are shown in Table 4. When translating
sparseness in generation step. from English to Hindi, to handle all of these inflections of
verbs we need to have all the factors available with us to
123
Table 2 Morphological suffixes for dog- (kuttaa) noun pair implement a factored model. But, as we will see, there are
so many factors in the factored model that may degrade the
performance of the translation system. Hence, we tried to
use some of these factors which are important and which
are easily available.
123
123
Noun Fact 25.30 12.84 2130 1499 19.33 15.08 25.62 16.52 25.65 17.20
Fact0 28.41 17.86 1839 1302 30.73 23.58 35.66 26.25
Verb Fact 26.23 13.02 1241 1872 20.11 17.58 26.85 18.67 27.86 19.26
0
Fact 29.16 19.02 1010 1649 35.91 26.74 39.91 29.31
Noun and verb Fact 22.93 10.55 2293 3237 18.87 10.98 20.89 13.69 24.92 16.28
Fact0 24.03 12.01 1967 2422 24.19 18.79 28.06 22.36
Noun and verb Phr 24.87 13.40 913 875 12.38 10.06 23.07 17.70 25.90 18.24
Phr0 27.89 16.41 853 828 27.15 21.72 31.92 25.25
Verb factors: The scores were given on the scale of 1–5 going from worst
to best, respectively (Sreelekha et al. 2013). Table 5 shows
• Number factor: Using typed dependencies we extract
average scores for each system. We observe up to 9%
subject of the sentence and get number of the subject as
improvement in adequacy and up to 11% improvement in
we get it for a noun.
fluency.
• Person factor: We do lookup into simple list of
pronouns to find the person of the subject.
• Tense, aspect and modality factor: We use POS tag of
5 Conclusion
verbs to extract tense, aspect and modality of the
sentence
SMT approaches suffer due to data sparsity when trans-
lating into morphologically rich languages. We use mor-
4.1 Automatic evaluation phology injection method to solve this problem by
enriching the training data with the missing morphological
The translation systems were evaluated by BLEU score forms of words. We verify this method with two morpho-
(Papineni et al. 2002). We counted the number of OOV logically rich languages Marathi and Hindi when translat-
words in the translation outputs, since the reduction in ing from English. We analyze that morphology injection
number of unknowns in the translation output indicates performs very well and improves the translation quality.
better handling of data sparsity. Table 5 shows the evalu- We observe huge reduction in number of OOVs and
ation scores and numbers. From the evaluation scores, it is improvement in adequacy and fluency of the translation
very evident that Fact0 /Phr0 outperforms Fact/Phr while outputs. This method is more effective when used with
solving any morphology problem in both Hindi and Mar- factored models than the phrase-based models. The mor-
athi. But, improvements in En-Mr systems are low. This is phology generation process may be difficult for target
due to the small size of word-form dictionaries that are languages which are morphologically too complex even
used for injection. % reduction in OOV shows that, mor- though the approach of solving data sparsity seems simple.
phology injection is more effective with factored models A possible future work is to generalize the approach of
than with the phrase-based model. Also, improvements morphology generation and verify the effectiveness of
shown by BLEU are less compared to % reduction in OOV. morphology injection on morphologically complex
languages.
4.2 Subjective evaluation
Acknowledgements The authors would like to thank Prof. Pushpak
Bhattacharyya for his guidance during this work. The authors would
As BLEU evaluation with single reference is not a true like to thank Department of Science & Technology, Govt. of India for
measure of evaluating our method, we also performed providing fund under Woman Scientist Scheme (WOS-A) with the
human evaluation. We found out that Fact0 /Phr0 systems Project Code-SR/WOS-A/ET/1075/2014. The author would like to
really have better outputs compared to Fact/Phr systems, in acknowledge her own associated works published in ACM and ACL
web.
terms of both, adequacy and fluency. We have randomly
chosen 150 translation outputs from each system for Author’s contributions The first author is the sole author of this
manual evaluation to get the adequacy and fluency scores. work.
123
Funding This work is funded by Department of Science & Tech- Koehn P, Hoang H (2007) Factored translation models. In: EMNLP-
nology, Govt. of India under Woman Scientist Scheme (WOS-A) with CoNLL
the Project Code-SR/WOS-A/ET/1075/2014. Koehn P, Och FJ, Marcu D (2007) Statistical phrase-based translation.
NAACL on human language technology-volume 1. ACL
Availability of data and materials The created Lexical Resources are Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for
freely available using a creative commons license. automatic evaluation of machine translation. In: ACL
Singh S, Sarma VM (2011) Verbal inflection in Hindi: a distributed
morphology approach. In: PACLIC
Compliance with ethical standards Singh S, Sarma VM, Muller S (2010) Hindi noun inflection and
distributed morphology. Universite Paris Diderot, Paris 7,
Conflict of interest The author declares that there is no conflict of France. Muller S (ed). CSLI Publications (2006), p 307
interest associated with this work. Sreelekha S, Dabre R, Bhattacharyya P (2013) Comparison of SMT
and RBMT, the requirement of hybridization for Marathi–Hindi
Ethics approval and consent to participate Not applicable. MT ICON. In: 10th international conference on NLP, December
2013
Consent for publication Not applicable. Tamchyna A, Bojar O (2013) No free lunch in factored phrase-based
machine translation. In: Computational linguistics and intelligent
text processing. Springer, Berlin Heidelberg, pp 210–223
References Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature rich
part of speech tagging with a cyclic dependency network. In:
Proceedings of the 2003 Conference of the North American
Avramidis E, Koehn P (2008) Enriching morphologically poor
Chapter of the Association for Computational Linguistics on
languages for statistical machine translation. In: ACL
Human Language Technology, vol 1. Association for Compu-
Chahuneau V, Schlinger E, Smith NA, Dyer C (2013) Translating into
tational Linguistics, 27 May 2003, pp 173–180. https://doi.org/
morphologically rich languages with synthetic phrases. In:
10.3115/1073445.1073478
EMNLP
De Marneffe M-C, Manning CD (2008) Stanford typed dependencies
manual. http://nlp.stanford.edu/software/dependenciesmanual. Publisher’s Note Springer Nature remains neutral with regard to
pdf jurisdictional claims in published maps and institutional affiliations.
123
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com