Morphology Generation For English-Indian Language Statistical Machine Translation

Soft Computing
https://doi.org/10.1007/s00500-020-05393-7 (0123456789().,-volV)(0123456789().
,- volV)
METHODOLOGIES AND APPLICATION
Morphology generation for English-Indian language statistical

machine translation
Sreelekha S.1
Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract
When translating into morphologically rich languages, statistical MT approaches face the problem of data sparsity. The
severity of the sparseness problem will be high when the corpus size of morphologically richer language is less. Even
though, we can use factored models to correctly generate morphological forms of words, the problem of data sparseness
limits their performance. In this paper, we describe a simple and effective solution which is based on enriching the input
corpora with various morphological forms of words. We use this method with the phrase-based and factor-based exper-
iments on two morphologically rich languages: Hindi and Marathi when translating from English. We evaluate the
performance of our experiments both in terms of automatic evaluation and subjective evaluation such as adequacy and
fluency. We observe that the morphology injection method helps in improving the quality of translation. We further
analyze that the morph injection method helps in handling the data sparseness problem to a great level.
Keywords Morphology Machine translation Statistical machine translation
1 Introduction However, Marathi and Hindi have some similarities except in

Marathi, there is agglutination of suffixes. To understand the
Factored models (Koehn and Hoang 2007; Tamchyna and severity of the sparseness problem, we consider an example of
Bojar 2013) treat each word in the corpus as vector of tokens. verb morphology in Marathi. In Marathi, a regular root verb
Each token can be any linguistic information about the word generates over 80 forms and over 53 irregular verb forms.
which leads to its inflection on the target side. Hence, factored Each verb can have 2268 (4 * 3 * 3 * 2 * 3 * 3 * 7 * 6)
models are preferred over phrase-based models (Koehn et al. inflected forms of it. Marathi vocabulary has around 40,503
2007) when translating from morphologically poor language root verbs. Hence, in total 91,860,804 (2268 * 40,503) verb
to morphologically richer language (Avramidis and Koehn forms. It is very likely that parallel Marathi corpus cannot have
2008; Chahuneau et al. 2013). Factored models translate using all inflected forms of each verb. Moreover, if the corpus size of
Translation and Generation mapping steps. If a particular Marathi language is less, then the severity of the sparseness
factor combination in these mapping steps has no evidence in problem will be high.
the training corpus, then it leads to the problem of data Even though, we can use factored models to correctly
sparseness. Even though factored models give more accurate generate morphological forms of words, the problem of
morphological translations, they may also generate more data sparseness limits their performance. In this paper, we
unknowns compared to other unfactored models. Hindi is propose a simple and effective solution which is based on
morphologically complex comparing to English; while Mar- enriching the input corpora with various morphological
athi is more complex with its suffix agglutination properties. forms of words. Application of the technique to English-
Hindi and Marathi case study shows that the technique
really improves the translation quality and handles the
Communicated by V. Loia. problem of sparseness effectively. We didn’t use NMT for
our experiments, since our aim was to improve upon SMT.
& Sreelekha S.
sreelekha@cse.iitb.ac.in The rest of the paper is organized as follows: We start by
studying the basics of factored translation models. We
1
Department of Computer Science and Engineering, Indian discuss the sparseness problem in Sect. 2. Then we
Institute of Technology Bombay, Mumbai, India
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Sreelekha S.
describe the morphology generation and the factored model noun and verb morphology when translating from English
for handling morphology, a case study of handling mor- to Hindi using factored models. We also discuss the solu-
phology for English to Hindi translation in Sect. 3. We tion to the sparseness problem.
describe our experiments and evaluations performed in
Sect. 4. Section 5 draws conclusion and points to future 3.1 Noun morphology
work.
In this section, we discuss the factored model for handling
Hindi noun morphology and the data sparseness solution in
2 Sparseness in factored translation models the context of same. Hindi nouns show morphological
marking only for number and case. Number can be either
When factored models allow incorporation of linguistic singular or plural. Case marking on Hindi nouns is either
annotations, it also leads to the problem of data sparseness. direct or oblique. Gender, an inherent, lexical property of
Two ways of sparseness can be identified: one is sparseness Hindi nouns (masculine or feminine) is not morphologi-
occurred during translation and another one is sparseness cally marked, but is realized via agreement with adjectives
occurred during generation. This occurs when a particular and verbs (Singh and Sarma 2011).
combination of factors does not exist in the source side Generation phase begins with the correction of the
training corpus. For example, let the factored model have genders of the translated words, since certain words are
single translation step: X|Y ? P|Q. Suppose the training data masculine in the source language but feminine in the target
has evidence for only xi|yj ? pk|ql mapping. The factored and vice versa. This is followed by short distance and long-
model learnt from this data can not translate xu|yv, for all distance agreements performed by intra-chunk and the
u = i and v = j. The factored model generates UNKNOWN inter-chunk module. These ensure that the gender, number
as output in these cases. Note that, if we train simple phrase- and person of local groups of phrases agree as also the
based model only on the surface form of words, we will at least gender of the subject’s verbs or objects reflect those of the
get some output, which may not give correct inflections, but subject.
still will be able to convey the meaning.
When the training corpus does not contain a specific 3.1.1 Factored model setup
factor combination, sparseness in translation occurs. For
example, let the factored model have single generation Noun inflections in Hindi are affected by the number and
step: P|Q ? R. Suppose the target side training data has an case of the noun only (Singh et al. 2010). So, in this case,
evidence of only pa|qb ? rc. The factored model learnt the set S, consists of number and case. Number can be
from this data can not generate from pu|qv for all singular or plural and case can be direct or oblique.
u = a and v = b. Again the factored model generates Example of factors and mapping steps is shown in Fig. 1.
UNKNOWN as output. Thus, due to sparseness, we cannot The generation of the number and case factors is discussed
make the best use of factored models. In fact, they are far in Sect. 4.
worse than the phrase-based models, especially, when a
particular factor combination is absent in the training data. 3.1.2 Building word-form dictionary
A simple and effective solution to the sparseness prob-
lem is to have all factor combinations present in the In the case of factored model described in Sect. 3.1.1:
training data. For the factored model described in Sect. 3.1,
in order to remove data sparseness in the translation step,
we need to have all Source root|{S} ? Target root|suffix
pairs present in the training data. Also, to remove data
sparseness in the generation step, we need to have all
Target root|suffix ? Target surface word pairs present in
the training data. In Sect. 3, we use a solution to this
problem in the context of English to Hindi translation.
3 Morphology generation
Hindi is a morphologically richer language compared to

English. Hindi shows morphological inflections on nouns
and verbs. In this section, we study the problem of handling Fig. 1 Factored model setup to handle nominal inflections
123

Morphology generation for English-Indian language statistical machine translation
• To solve the sparseness in translation step, we need to • Class B: Includes = i=; =i= or =ya = ending feminine
have all English root|number|case ? Hindi root noun|- nouns that take ya for the features [? pl,
suffix pairs present in the training data. - oblique] and -yo for [? pl, ? oblique].
• To solve the sparseness in generation step, we need to • Class C: Includes feminine nouns that take e for the
have all Hindi root noun|suffix ? Hindi surface word feature [? pl] and o for [? pl, ? oblique].
pairs present in the training data. • Class D: Includes masculine nouns that end in
=a = or =ya =: Some kinship terms are also involved.
In other words, we need to get a set of suffixes and their
Words directly derived from Sanskrit are excluded.
corresponding number-case values for each noun pair.
• Class E: Includes masculine nouns that inflect only for
Using these suffixes and the Hindi root word, we need to
the features [? pl, ? oblique]. The nouns in this class
generate Hindi surface words to remove sparseness in the
end with =u =; =u=; = i=; =i= or a consonant.
generation step. We need to generate four pairs for each
noun present in the training data, i.e., (sg-dir, sg-obl, pl-dir,
pl-obl) and get their corresponding Hindi inflections. In the 3.1.2.2 Predicting inflectional class for new lexemes For
following section, we discuss how to generate these mor- the classification of the new lexemes into one of the five
phological forms. classes discussed above, we need gender information. After
gender is lexically assigned to the new lexeme, its inflec-
3.1.2.1 Generating new morphological forms Figure 2 tional class can be predicted using the procedure outlined
shows a pipeline to generate new morphological forms for in Fig. 3. A masculine noun may or may not be inflected
an English-Hindi noun pair. To generate different mor- based on its semantic property. If it is an abstract noun or a
phological forms, we need to know the suffix of a noun in mass noun it will fall into the non-inflecting Class A irre-
Hindi for the corresponding number and case combination. spective of its phonological form. On the other hand, a
We use the morphological classification of the noun shown countable lexeme will fall into one of the two masculine
in Table 1 for the same. Nouns are classified into five classes based on its phonological form. Similar procedure
different classes, namely A, B, C, D and E according to follows for feminine nouns.
their inflectional behavior with respect to case and number To predict the class of a Hindi noun, we develop a
(Singh et al. 2010). All nouns in the same class show the classifier which uses gender and the ending characters of
same inflectional behavior. the nouns as features (Singh et al. 2010). We get four
different suffixes and corresponding number-case combi-
• Class A: Includes nouns that take null for all number-
nations using the class of Hindi noun and classification
case values. These nouns are generally abstract or
shown in Table 1. For example, if we know that the noun
uncountable.
Fig. 2 Pipeline to generate new

morphological forms for noun/
verbs pair
123

Sreelekha S.
Table 1 Inflection-based classification of Hindi nouns
Fig. 3 Predicting inflectional

class for new lexemes
form (kuttaa) belongs to class D, then we can get four 3.2 Verb morphology
different suffixes for (kuttaa) as shown in Table 2
Generating surface word: In this section, we discuss the factored model for handling
Next we generate Hindi surface word from Hindi root Hindi verb morphology and the data sparseness solution in
noun and suffix using a rule-based joiner (reverse mor- the context of same. Many grammarians and morphologists
phological) tool. The rules of the joiner use the ending of have discussed the inflectional categories of Hindi verbs
the root noun and the class to which the suffix belongs as but these studies are either pedagogical or structural in
features. Thus, we get four different morphological forms approach. Across all approaches, there is much agreement
of the noun entities present in the training data. We aug- on the kinds of inflectional categories that are seen in Hindi
ment the original training data with these newly generated verbs. The inflection in Hindi verbs may appear as suffixes
morphological forms. Table 3 shows four morphological or as auxiliaries (Singh and Sarma 2011). These categories
forms of dog- (ladakaa) noun pair. The joiner solves the and their exponents are shown in Table 4. When translating
sparseness in generation step. from English to Hindi, to handle all of these inflections of
verbs we need to have all the factors available with us to
123

Table 2 Morphological suffixes for dog- (kuttaa) noun pair implement a factored model. But, as we will see, there are
so many factors in the factored model that may degrade the
performance of the translation system. Hence, we tried to
use some of these factors which are important and which
are easily available.
3.2.1 Factored model setup
Verb inflections in Hindi are affected by gender, number,

person, tense, aspect, modality, etc. (Singh and Sarma
2011). As it is difficult to extract gender from English
Table 3 New Morphological forms dog- (kuttaa) noun pair verbs, we do not use it as a factor on English side. We just
replicate English verbs for each gender inflection on Hindi
side. Hence, set S, as in Sect. 3.1, consists of number,
person, tense, aspect and modality. Example of factors and
mapping steps are shown in Fig. 4. The generation of the
factors is discussed in Sect. 4.
Table 4 Inflectional categories

Verbal form Grammatical category Exponents
and their markers for Hindi
verbs Finite Tense Present ho
Past th
Future g
Aspect Habitual t
Progressive rah
Perfective (y)a, (y)I, (y)e
Completive cuk
Mood Imperative Null, o, iye, jiye, na
Subjunctive (root) u, o, (y)e, (y)e
Subjunctive(auxiliary) ho
Presumptive ho g
Root conditional t
Condition(auxiliary) ho t
Gender number Masculine singular (y)a
Masculine plural (y)e
Feminine singular (y)i
Feminine plural (y)i
Person number 1st p singular u
1st p singular (y)e
2nd p singular (y)e
o(semi honorific)
2nd p plural o(semi hon)
(y)e (honorific)
3rd p singular (y)e
3rd p plural (y)e
Voice Passive Perfective ? auxiliary ‘ja’
Non finite Infinitive n
Past participle (y)a, (y)i, (y)i, (y)e
Present participle t
123

Sreelekha S.
phrase-based model. We follow the same procedure as

described above, but we remove all factors from source and
target words except the surface form.
4 Experiments and evaluation

Fig. 4 Factored model setup to handle nominal inflections
We performed experiments on ILCI (Indian Languages
3.2.2 Building word-form dictionary Corpora Initiative) En-Hi and En-Mr data set. Domain of
the corpus is health and tourism. We used 44,586 sentence
Thus, in the case of factored model described in pairs for training and 2974 sentence pairs for testing.
Sect. 3.2.1: Word-form dictionary was created using the Hindi and
Marathi word lexicon. It consisted of 182,544 noun forms
• To solve the sparseness in translation step, we need to and 310,392 verb forms of Hindi and 44,762 noun forms
have all English root|numer|person|tense|aspect|modal- and 106,570 verb forms of Marathi. Moses1 toolkit was
ity ? Hindi root verb|suffix pairs present in the used for training and decoding. Language model was
training data. trained on the target side corpus with IRSTLM.2 For our
• To solve the sparseness in generation step, we need to experiments, we compared the translation output of the
have all Hindi root verb|suffix ? Hindi surface word following systems:
pairs present in the training data.
• Phrase-based (unfactored) model (Phr);
In other words, we need to get a set of suffixes and their • Basic factored model for solving noun and verb
corresponding number-person-tense-aspect-modality val- morphology (Fact);
ues, for each noun pair. Using these suffixes and the Hindi • Phrase-based model trained on the corpus used for Phr
root word, we need to generate Hindi surface words to augmented with the word-form dictionary for solving
remove sparseness in the generation step. In the following noun and verb morphology (Phr0 );
section, we discuss how to generate these morphological • Factored model trained on the corpus used for Fact
forms. augmented with the word-form dictionary for solving
noun and verb morphology (Fact0 ).
3.2.2.1 Generating new morphological forms Figure 2
shows a pipeline to generate new morphological forms for With the help of syntactic and morphological tools, we
an English-Hindi verb pair. No pre-classification of verbs is extract the number and case of the English nouns and
required, as these suffixes apply to all verbs. number, person, tense, aspect and modality of the English
verbs as follows:
3.2.2.2 Generating surface word Next we generate Hindi Noun factors:
surface word from Hindi root verb and suffix using a rule- • Number factor: We use Stanford POS tagger3 to
based joiner (reverse morphological) tool. The rules of the identify the English noun entities (Toutanova et al.
joiner use only the ending of the root verb as features. 2003). The POS tagger itself differentiates between
Thus, we get different morphological forms of the verb singular and plural nouns by using different tags.
entities present in the training data. We augment the • Case factor: It is difficult to find the direct/oblique case
original training data with these newly generated mor- of the nouns as English nouns do not contain this
phological forms. The joiner solves the sparseness in information. Hence, to get the case information, we
generation step. need to find out features of an English sentence that
correspond to direct/oblique case of the parallel nouns
3.3 Noun and verb morphology in Hindi sentence. We use object of preposition,
subject, direct object, tense as our features. These
Finally, we create a new factored model which combines features are extracted using semantic relations provided
factors on both nouns and verbs, as shown in Fig. 4. We by Stanfords typed dependencies (De Marneffe et al.
build word-form dictionaries separately as discussed in 2008).
Sects. 3.1.2 and 3.2.2. Then, we augment training data with
1
both dictionaries. Note that, factor normalization on each http://www.statmt.org/moses/.
2
word is required before this step to maintain same number https://hlt.fbk.eu/technologies/irstlm-irst-languagemodelling-
of factors. We also create a word-form dictionary for toolkit.
3
http://nlp.stanford.edu/software/tagger.shtml.
123

Table 5 Automatic and subjective evaluation of the translation systems

Morph. model problem BLEU # OOV % OOV reduction Adequacy Fluency
En–Hi En–Mr En–Hi En–Mr En–Hi En–Mr En–Hi En–Mr En–Hi En–Mr
Noun Fact 25.30 12.84 2130 1499 19.33 15.08 25.62 16.52 25.65 17.20
Fact0 28.41 17.86 1839 1302 30.73 23.58 35.66 26.25
Verb Fact 26.23 13.02 1241 1872 20.11 17.58 26.85 18.67 27.86 19.26
0
Fact 29.16 19.02 1010 1649 35.91 26.74 39.91 29.31
Noun and verb Fact 22.93 10.55 2293 3237 18.87 10.98 20.89 13.69 24.92 16.28
Fact0 24.03 12.01 1967 2422 24.19 18.79 28.06 22.36
Noun and verb Phr 24.87 13.40 913 875 12.38 10.06 23.07 17.70 25.90 18.24
Phr0 27.89 16.41 853 828 27.15 21.72 31.92 25.25
Verb factors: The scores were given on the scale of 1–5 going from worst
to best, respectively (Sreelekha et al. 2013). Table 5 shows
• Number factor: Using typed dependencies we extract
average scores for each system. We observe up to 9%
subject of the sentence and get number of the subject as
improvement in adequacy and up to 11% improvement in
we get it for a noun.
fluency.
• Person factor: We do lookup into simple list of
pronouns to find the person of the subject.
• Tense, aspect and modality factor: We use POS tag of
5 Conclusion
verbs to extract tense, aspect and modality of the
sentence
SMT approaches suffer due to data sparsity when trans-
lating into morphologically rich languages. We use mor-
4.1 Automatic evaluation phology injection method to solve this problem by
enriching the training data with the missing morphological
The translation systems were evaluated by BLEU score forms of words. We verify this method with two morpho-
(Papineni et al. 2002). We counted the number of OOV logically rich languages Marathi and Hindi when translat-
words in the translation outputs, since the reduction in ing from English. We analyze that morphology injection
number of unknowns in the translation output indicates performs very well and improves the translation quality.
better handling of data sparsity. Table 5 shows the evalu- We observe huge reduction in number of OOVs and
ation scores and numbers. From the evaluation scores, it is improvement in adequacy and fluency of the translation
very evident that Fact0 /Phr0 outperforms Fact/Phr while outputs. This method is more effective when used with
solving any morphology problem in both Hindi and Mar- factored models than the phrase-based models. The mor-
athi. But, improvements in En-Mr systems are low. This is phology generation process may be difficult for target
due to the small size of word-form dictionaries that are languages which are morphologically too complex even
used for injection. % reduction in OOV shows that, mor- though the approach of solving data sparsity seems simple.
phology injection is more effective with factored models A possible future work is to generalize the approach of
than with the phrase-based model. Also, improvements morphology generation and verify the effectiveness of
shown by BLEU are less compared to % reduction in OOV. morphology injection on morphologically complex
languages.
4.2 Subjective evaluation
Acknowledgements The authors would like to thank Prof. Pushpak
Bhattacharyya for his guidance during this work. The authors would
As BLEU evaluation with single reference is not a true like to thank Department of Science & Technology, Govt. of India for
measure of evaluating our method, we also performed providing fund under Woman Scientist Scheme (WOS-A) with the
human evaluation. We found out that Fact0 /Phr0 systems Project Code-SR/WOS-A/ET/1075/2014. The author would like to
really have better outputs compared to Fact/Phr systems, in acknowledge her own associated works published in ACM and ACL
web.
terms of both, adequacy and fluency. We have randomly
chosen 150 translation outputs from each system for Author’s contributions The first author is the sole author of this
manual evaluation to get the adequacy and fluency scores. work.
123

Sreelekha S.
Funding This work is funded by Department of Science & Tech- Koehn P, Hoang H (2007) Factored translation models. In: EMNLP-
nology, Govt. of India under Woman Scientist Scheme (WOS-A) with CoNLL
the Project Code-SR/WOS-A/ET/1075/2014. Koehn P, Och FJ, Marcu D (2007) Statistical phrase-based translation.
NAACL on human language technology-volume 1. ACL
Availability of data and materials The created Lexical Resources are Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for
freely available using a creative commons license. automatic evaluation of machine translation. In: ACL
Singh S, Sarma VM (2011) Verbal inflection in Hindi: a distributed
morphology approach. In: PACLIC
Compliance with ethical standards Singh S, Sarma VM, Muller S (2010) Hindi noun inflection and
distributed morphology. Universite Paris Diderot, Paris 7,
Conflict of interest The author declares that there is no conflict of France. Muller S (ed). CSLI Publications (2006), p 307
interest associated with this work. Sreelekha S, Dabre R, Bhattacharyya P (2013) Comparison of SMT
and RBMT, the requirement of hybridization for Marathi–Hindi
Ethics approval and consent to participate Not applicable. MT ICON. In: 10th international conference on NLP, December
2013
Consent for publication Not applicable. Tamchyna A, Bojar O (2013) No free lunch in factored phrase-based
machine translation. In: Computational linguistics and intelligent
text processing. Springer, Berlin Heidelberg, pp 210–223
References Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature rich
part of speech tagging with a cyclic dependency network. In:
Proceedings of the 2003 Conference of the North American
Avramidis E, Koehn P (2008) Enriching morphologically poor
Chapter of the Association for Computational Linguistics on
languages for statistical machine translation. In: ACL
Human Language Technology, vol 1. Association for Compu-
Chahuneau V, Schlinger E, Smith NA, Dyer C (2013) Translating into
tational Linguistics, 27 May 2003, pp 173–180. https://doi.org/
morphologically rich languages with synthetic phrases. In:
10.3115/1073445.1073478
EMNLP
De Marneffe M-C, Manning CD (2008) Stanford typed dependencies
manual. http://nlp.stanford.edu/software/dependenciesmanual. Publisher’s Note Springer Nature remains neutral with regard to
pdf jurisdictional claims in published maps and institutional affiliations.
123

Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com

Morphology Generation For English-Indian Language Statistical Machine Translation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Morphology Generation For English-Indian Language Statistical Machine Translation

Uploaded by

Copyright:

Available Formats

Soft Computing

METHODOLOGIES AND APPLICATION

Morphology generation for English-Indian language statistical

Springer-Verlag GmbH Germany, part of Springer Nature 2020

Keywords Morphology Machine translation Statistical machine translation

1 Introduction However, Marathi and Hindi have some similarities except in

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Hindi is a morphologically richer language compared to

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Fig. 2 Pipeline to generate new

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 1 Inflection-based classification of Hindi nouns

Fig. 3 Predicting inflectional

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

3.2.1 Factored model setup

Verb inflections in Hindi are affected by gender, number,

Table 4 Inflectional categories

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

phrase-based model. We follow the same procedure as

4 Experiments and evaluation

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Table 5 Automatic and subjective evaluation of the translation systems

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

You might also like