You are on page 1of 24

Artif Intell Rev

DOI 10.1007/s10462-012-9351-1

Arabic machine translation: a survey

Arwa Alqudsi · Nazlia Omar · Khalid Shaker

© Springer Science+Business Media B.V. 2012

Abstract Although there is no machine learning technique that fully meets human require-
ments, finding a quick and efficient translation mechanism has become an urgent necessity,
due to the differences between the languages spoken in the world’s communities and the vast
development that has occurred worldwide, as each technique demonstrates its own advan-
tages and disadvantages. Thus, the purpose of this paper is to shed light on some of the
techniques that employ machine translation available in literature, to encourage researchers
to study these techniques. We discuss some of the linguistic characteristics of the Arabic
language. Features of Arabic that are related to machine translation are discussed in detail,
along with possible difficulties that they might present. This paper summarizes the major tech-
niques used in machine translation from Arabic into English, and discusses their strengths
and weaknesses.

Keywords Arabic machine translation · Arabic language morphology

1 Introduction

Machine translation is a computer application that translates texts or speech from one natural
language to another. Machine translation receives a source sentence,

A. Alqudsi (B) · N. Omar


Knowledge Technology Research Group (KT), School of Computer Science, Faculty of Information
Science and Technology, University Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysia
e-mail: arwa@ftsm.ukm.my
N. Omar
e-mail: no@ftsm.ukm.my

K. Shaker
Department of Software Engineering, Faculty of Computer Science and Information Technology,
University of Malaya, 50603 Lembah Pantai, Kuala Lumpur, Malaysia
e-mail: khalidsh@um.edu.my

123
A. Alqudsi et al.

S = [s1 s2, . . . , si]

and generates a target sentence,


 
T = t1 t2, . . . , tj

by translating the source sentence and give the meaning of it in the target language. Interest-
ing authors in text translation are Hutchins (2007) who provides a good overview and Dorr
et al. (1999) for a comprehensive survey of machine translation. The question is “why should
researchers be interested to develop new translation systems?” The first and most important
reason is that there is a need for machine translation, since the advent of computers, as there is
an increasing demand to create online communication between people worldwide, speaking
in different languages. Another reason is that there is no machine translation that fully sat-
isfies people’s requirements, in terms of translation quality and retrieval time. Furthermore,
the use of computer translation tools can increase the speed of translation throughput, with
immediate results, taking into consideration less costs of translation. Finally, machine trans-
lation is a major administrative activity in natural language processing for different fields.
Thus, efficient techniques that work with special rules should be available to generate a useful
machine translation system, in order to improve the translation of natural language texts into
other natural languages, and remove anomalies. Arnold et al. (1994) suggested that machine
translation should focus on moderate translation that involves human interaction.
Recently, machine translations achieved better translation for almost all natural languages
(Attia 2008). We found many online translation machines, such as Google Translator, which
is a free online text translation.1 that is based on statistical machine translation paradigms and
support more than 55 different languages. Microsoft Translator is based on example-based
machine translation and several statistical machine translation technologies. It is a free online
translation.2 that supports 32 languages. Systran uses a rule based machine translation par-
adigm. Systran.3 can translate a certain number of languages, like English, Arabic, French,
Dutch, Chinese, and others. Many of the pairs include to or from English or French.
The goals of this paper are to provide details of the Arabic language, to characterize the
main ideas of Arabic to English machine translation, and provide a classification of various
approaches applied to translate Arabic text into English text.

1.1 Arabic language

Arabic is one of the six major world languages. It originated in the area currently known
as the Arabian Peninsula. Arabic has been used since the 2nd millennium Before the Com-
mon Era. Most of the oral spoken Arabic is presently more divergent than written Arabic,
due to dialectal interference. In morphological analysis, Arabic words are often ambiguous
(Al-Sughaiyer and Al-Kharashi 2004).
Arabic is the joint official language in Middle Eastern and African states. Large commu-
nities of Arabic speakers have existed outside of the Middle East since the end of the last
century, particularly in the United States and Europe. The motivation for this paper is to shed
light on Arabic language features and investigate several existing translation systems within
literature related to Arabic to English translation, in terms of the strengths and weaknesses
of translation.

1 http://translate.google.com.
2 http://www.microsofttranslator.com.
3 http://www.systransoft.com/.

123
Arabic machine translation

Fig. 1 Arabic sentence and the


equivalent sentence in English

A new British investigation of Iraqis torture

Arabic has a different word order that provides a significant challenge to MT, due to the
possibilities of expressing the same sentence in Arabic. In Arabic, three elements make-up a
sentence, namely subject, verb, and object. Through these elements, Arabic can be classified
into four types of sentences, according to different word orders i.e., SVO, VSO, VOS, and
SOV.
Put simply, the task is to take a string of words (“sentence”) in the source language with
vocabulary, and transform it into another string of words (“sentence”) in the target language
with vocabulary. Some languages may require special pre-processing, such as German or
Chinese, as there are no clearly marked word boundaries.
There is often no special treatment of morphological variants. Arabic is rich and complex
in morphological and syntactic structures. Therefore, it is possible for the size of vocabularies
to reach into the tens or hundreds of thousands, or even millions.
Soudi et al. (2007) presented a review of the salient issues in Arabic computational mor-
phology, provided a broad coverage of the computational techniques for the processing of
Arabic morphology, and a detailed discussion of the linguistic approaches on which each
computational treatment is based. They also introduced the transliteration scheme, which
is used to represent Arabic words for readers who cannot read Arabic script, as well as
guidelines for pronouncing Arabic, given this transliteration.
The goal of a translation system, when presented with an input sequence, is to find a
target sequence that matches the corresponding translation. An example of translational cor-
responding sequences is shown in Fig. 1. We draw a line between the words in the sentence
that are translations of each other. For instance, we can see that an Arabic sentence is trans-
lated into an English sentence, and that we could align the words as we draw a line between
each word in both sentences.
Therefore, any new system will need some kind of mechanism to choose between various
possible options for each translation decision. The system will also need a mechanism to
correctly reorder words, as words with their equivalent meanings do not always appear in the
same order in both source and target sentences. Reordering typically depends on the syntactic
structure of the target language.
The foremost challenge for Natural Language Processing (NLP) in Arabic, is overcoming
ambiguity (Kamir et al. 2002; Albared et al. 2009). It is not uncommon for the different
possible translations of a word to have very different meanings, and because of its rich and
complex morphology, Arabic is notorious for its morphological ambiguity (Attia 2006).
According to Daimi (2001), Fehri (1993), Chalabi (2000), there are many complexities in
Arabic. The following lists the major issues involving Arabic:
• Arabic writing direction is from right to left in a horizontal form. For example:

The translation for this sentence is Khalid read the book.

123
A. Alqudsi et al.

Table 1 Arabic free word order


Sentence orders Arabic sentence English translation

VSO Ate Adam the apple


OVS The apple ate Adam
SVO Adam ate the apple
VOS Ate the apple Adam

the wrote

It was written book books library libraries office offices writers


writer
{{

Fig. 2 Derivation of words from a three-letter-root

• There are no capital letters in Arabic.


• Punctuation in Arabic is similar to English, except for commas, which sit ‘on’ the line
instead of ‘under’ the line.
• Arabic uses gender for all known nouns (none are neutral).
• Space is left between words in sentences.

Some letters change shape depending on their location within a word, whether they are at
the start, middle, or end of the word.
For example: the shape of letter ( ) in the start of a word will be ( ), like in the word:
, in the middle of a word it will be ( ), like in the word: , and at the end of a word it
will be ( or ), like in these words: and .
Arabic has a comparatively free word order. See the example shown in Table 1.
In the examples above, the translation for all four sentences is one translation (i.e., the
same sentence in English), which is Adam ate the apple, but using different ordering. The
order for the first sentence, from the left side is VSO, for the second sentence is OVS, for the
third one is SVO, and for the last one is VOS.

• Arabic is a pro-drop language. According to Chalabi (2004), the subject can be omitted,
leaving any syntactic parser with the challenge to decide, first, whether or not there is an
omitted pronoun in the subject position, and second, what the antecedent of the omitted
pronoun is. For example:
(He writes the lesson)
• Arabic is a clitic and affixes language (Abbès et al. 2004). There are some words in Arabic
that hold the meaning of a full sentence. For example:
(We will play)
• Arabic words can often be ambiguous, because of the three-letter root system. This sys-
tem allows Arabic to develop to cover a wide choice of meanings. One or more of the
root letters is dropped in some derivations, and this leads to possible ambiguity. Figure 2
shows the derivation of words from a three-letter-root.
• Arabic does not have copula verbs ‘to be’ and ‘to have’. An example is shown in Table 2.

123
Arabic machine translation

Table 2 Verb ‘to be’


Arabic sentence Arabic reading English sentence(to be)

The door opening The door is opening


He clever He is clever

Table 3 Verb ‘to have’


Arabic sentence Arabic reading English sentence(to have)

To her a bag She has a bag


To him a book He has a book

Table 4 Feminine nouns are


Arabic nouns English Translation
derived from masculine nouns
(male) Engineer
(female) Engineer
(male) Doctor
(female) Doctor

Table 5 Feminine nouns are


Arabic nouns English translation
different from masculine
(male) Boy
(female) Girl
(male) Man
(female) Woman
(male) Cock
(female) Chicken

In the above example, rather than say “the door is opening,” in Arabic, it would read like
“the door opening,” and in another example, instead of saying “He is clever,” in
Arabic, it would read “He clever,” .
In English, the verb ‘to have’ usually means ‘to own’. Rather than saying “she has a bag,”
the Arabic equivalent is ‘to her a bag’, . In another example shown in Table 3, rather
than saying “he has a book,” in Arabic, it would read like “to him a book” .

• Nouns in Arabic must either be masculine or feminine. Usually feminine nouns are derived
from masculine nouns, which are considered as the stem (see Table 4).

However, sometimes feminine nouns are different from masculine (the feminine nouns
not derived from a masculine noun); as shown in Table 5.
• The number system in Arabic includes the dual form, whereas English moves from a
singular to a plural form directly, but in Arabic we need to add a suffixing morpheme to
the singular (stem) ( or ) depending on whether the case is nominative or accusative
and genitive (as shown in Table 6).
• The plural form of Arabic masculine nouns exists by suffixing morpheme to the singular
nouns ( or ) depending on whether the word case is nominative or accusative and
genitive (see Table 7).

123
A. Alqudsi et al.

Table 6 Arabic dual and plural


Arabic singular Arabic dual Arabic dual English translation
forms
(nominative) (accusative
and genitive)

(male) (male) (male) Two engineers


(female) Two engineers
(female) (female)

Table 7 Plural form of Arabic


Arabic singular Arabic plural Arabic plural English translation
Masculine nouns
(nominative) (accusative
and genitive)

Teachers
Visitors

Table 8 Plural form of Arabic


Arabic singular Arabic plural Arabic plural English translation
feminine nouns
(nominative) (accusative
and genitive)

Teachers
Visitors

Table 9 Broken plural


Arabic English Arabic plural English
singular translation translation

Door Doors
Pen Pens
Planet Planets

• The plural form of Arabic feminine nouns can be created by adding a suffixing mor-
pheme to the stem word ( or ) depending on whether the word case is nominative or
accusative and genitive (see Table 8).
• In Arabic, some words have no fixed rule for their plural form. Their plural forms are
formed by changing the vowels, or adding or deleting the original alphabet; this type of
plural is called a broken plural (as shown in Table 9).

1.2 Some arabic affixes matter

Arabic has a large number of suffixes and prefixes that can change a stem to form words. This
leads to a high vocabulary in the lexicon, and hence, a potential increase in word error rate.
To tackle this problem, many preceding results state that a simple morphological analysis
for Arabic words is helpful, and has shown a good potential for machine translation (Afify
et al. 2006). Prefixes and/or suffixes are merged to the word stems to produce new words.

123
Arabic machine translation

Table 10 Arabic prefixes,


Prefixes Suffixes
suffixes and their meanings
(and) (your (singular))
(the) (your(plural))
(then) (his)
(to) (her)
(and the) (their)
(like) (my)

The stem word can be derived by applying some predefined patterns to the roots. Table 10
shows several Arabic prefixes, suffixes, and their meanings.

2 Translation approaches

There are many different approaches to carrying out machine translation. In this work, we
give a brief explanation of the main approaches that have been used in previous works, as
over the years, many techniques are used to enhance the performance of machine translation.

2.1 Rule-based

In the field of machine translation, a rule based approach is the first technique used by research-
ers. Rules are written by humans according to their linguistic knowledge. The strength of this
is that it can deeply analyse both syntax and semantic levels. A key design of any rule-based
MT system is its lexical resources. In practice, rule-based machine translation systems often
have diverse dictionaries, where some contain main entries, and others contain specialized
vocabulary. Brill and Resnik (1994) described a new rule-based approach to prepositional
phrase attachment disambiguation, and compared the results of this algorithm with other
rule-based approaches, with the same problem. Salem et al. (2008) introduced a system
called UniArab to support the development of a rule-based lexical framework for Arabic
language processing using a Role and Reference Grammar (RRG) linguistic model. The
weakness of a rule-based approach is that it is impossible to write rules that cover all lan-
guages, as this requires great linguistic knowledge (Charoenpornsawat et al. 2002). Hutchins
and Harold (1992) state that rule-based machine translation approaches can be classified by
their architectures into the following categories: Direct approach, Transfer based approach,
and Interlingual approach.

2.1.1 Direct approach

This direct approach is used by most first generation MT systems. Usually, the source lan-
guage text will not analyse structurally beyond morphology, when the MT system uses direct
translation, as the translation is based on many dictionaries. Translation occurs as follows:
• Translation is word-by-word.
• Very little analysis of the source text (e.g., no syntactic or semantic analysis).
• Relies on a large bilingual dictionary. For each word in the source language, the dictionary
specifies a set of rules for translating that word.
• After the words are translated, simple reordering rules are applied.

123
A. Alqudsi et al.

Source Target
language language
input output

Morphological Bilingual dictionary Local reordering


analysis lookup

Fig. 3 Direct MT approach

Joshan and Lehal (2007) used and approved that the direct approach can translate Hindi
text to Punjabi text with a tolerable good accuracy.
Traces of the direct approach can even be found in indirect systems. However, the direct
MT system model has a more rudimentary software design. Figure 3 shows some of the
steps of the translation process. The limitations of direct machine translation are that it lacks
analysis of the source language. This may cause several problems and words are translated
without disambiguation of their syntactic role.
In response to the apparent weakness of the direct approach, many types of indirect
approach were developed (as shown in Fig. 3). Systems of this nature are sometimes referred
to as second generation systems.

2.1.2 Interlingua approach

Interlingua machine translation is a classic approach and is the most attractive approach for a
multilingual system. The Interlingua approach is carried out in two stages. In the first stage,
the source language sentence will be analysed into an Interlingua representation, and in the
second stage, the target text will convert the meaning of that representation into an output
sentence.
Within the rule-based machine translation paradigm, the Interlingua approach is an alter-
native to the direct and transfer approaches. Hutchins (1986) mentioned that Interlingua was
the first indirect method. However, Hutchins and Harold (1992) mentioned that “In the past,
the intention or hope was to develop an interlingual representation that was truly ‘universal’
and could thus be intermediary between any natural languages. At present, interlingual sys-
tems are less ambitious.” Development of Interlingua systems depend on translating the
source text into an intermediate language, or symbolic representation form, which could
then be translated into any of other languages. Eric and Teruko (1992) described Knowl-
edge-Based Natural Language Translation (KANT), which is the first knowledge-based in-
terlingual machine translation system that combines a principled source language design,
semi-automated knowledge acquisition, and knowledge compilation techniques to produce
fast, high-quality translation, into multiple languages. In addition, Bonnie et al. (2004) defined
the elements of an Interlingua, the main issues faced by researchers and builders of Interlin-
gua, and improvement of Interlingua MT systems.
In the Interlingua approach, there are both advantages and disadvantages. One advantage
of multilingual machine translations is that no transfer component has to be created for each
language pair. An obvious disadvantage is that the definition of an Interlingua is difficult.

123
Arabic machine translation

English English

French French
Interlingua
German German

Spanish Spanish

Fig. 4 Interlingua MT with 4 languages

Interlingua

Transfer

Direct
Source Target
language text language text

Fig. 5 Transfer approach

It is easier to add new language pairs to the system than it is in the direct method, as the addi-
tion of a new language to the system entails the creation of just two new modules: analysis
grammar and generation grammar. For example, in a system that has four languages (i.e.,
English, French, German, and Spanish) there are 12 language pairs (Hutchins and Harold
1992). Figure 4 illustrates an Interlingua system.

2.1.3 Transfer approach

Transfer based machine translation is based on the idea of Interlingua, and is currently one
of the most widely used methods of machine translation. The transfer approach is carried out
in three phases. The first phase analyses the source language sentence and builds a syntactic
analysis using the Source Language (SL) dictionary. The second phase is the transfer process,
which changes the results of the analysis phase and produces the linguistic and structural
equivalents between two languages. The third phase is the generation phase, which produces
the Target Language (TL) text, based on the linguistic data of the source language, using a
target language dictionary.
Figure 5 represents the transfer strategy. This strategy involves three phases to show the
document linguistically, using a source language dictionary. The transfer rules may look quite
similar to the rules for direct translation systems, but they can operate on syntactic structures.
This approach easily handles long-distance reordering.
Toma (1977) improved the SYSTRAN system, which was regarded as the first system
using this approach. This system has proven capabilities in an operational environment. The
SYSTRAN system has an inherent capability to translate from one language to any number
of different languages. Lavie et al. (2004) developed a basic Hindi-to-English MT system, by
enhancing the performance of a syntactically transfer-based approach, using strong statistical
methods. Hatem and Omar (2010) proposed a transfer-based approach in Arabic to English

123
A. Alqudsi et al.

MT, in order to solve the word ordering problem. Their approach was tested on 100 titles
from the Aljazeera news website.

2.2 Statistical

The ideas behind statistical machine translation came from information theory, as text was
translated using a probability process. The statistical approach does not require linguistic
knowledge, but it does need a large sized bilingual corpus. A statistical approach uses the
statistics of bilingual corpus and a language model. The advantage of this approach is the
ability to produce suitable translations, even if a given sentence is not similar to any sentences
in the training corpus.
A classic example of this approach is IBM’s work on French-English translation, using
the Canadian Hansards. A good survey about Statistical Machine Translation (SMT) was
produced by Lopez (2008). Josef and Ney (2000) discussed five IBM alignment models and
presented different single-word based alignment models for statistical machine translation.
Marcu (2001) presented an algorithm to translate natural language sentences by exploit-
ing both a statistical-based translation model and Translation Memory (TM). The results
show that an automatically derived translation memory can often be used within a statistical
framework to find translations of higher probability than those found only using a statistical
model. Zavrel et al. (1997) used a statistical method, Memory-Based Learning, in trying to
develop the presentation of prepositional phrase attachment resolution. In recent years, statis-
tical machine translation has contributed to the significant resurgence of interest in machine
translation. It is now the most widely studied machine translation method, but as we believe,
statistical machine translation has not yet achieved people’s requirements in terms of quality.

2.3 Example-based (EBMT)

Similar to the statistical approach, this approach does not require linguistic knowledge, but it
does need a large sized bilingual corpus. Example-based MT can produce suitable translations
in cases of a given sentence, which is similar to sentences in the training data.
Over the last decade, Example-Based Machine Translation (EBMT) has shown great pro-
gress (Furuse and Iida 1992; Nirenburg et al. 1994; Brown and Ralf 1996; Nagao 1997).
Furuse and Iida (1992) presented a method called Transfer-Driven Machine Translation
(TDMT), which utilized an example-based framework for various processes and combined
multi-level knowledge. Meanwhile, Richardson et al. (2001) described Microsoft Research
Machine Translation (MSR-MT) as a large-scale example-based Machine Translation system
(and some statistical) for several language pairs. The system was tested on English-Spanish.
The evaluation results showed that MSR-MT’s integration of rule-based parsers, statisti-
cal techniques, and example-based processing, produced translations of high quality, which
exceeded those of un-customized commercial MT systems.

2.4 Knowledge-based (KBMT)

Knowledge-Based Machine Translation (KBMT) systems are based on the point that “high
quality translation requires in-depth understanding of the text” (Arnold et al. 1994). This
approach requires mentioning real-world knowledge, as well as knowledge of the “differ-
ences in cultural backgrounds and differences in conceptual divisions” (Hutchins and Harold
1992) between diverse languages.

123
Arabic machine translation

Source Batch Representation Target


language Text System of meaning language
text

Semantic
Knowledge
Base

Fig. 6 Knowledge-based MT approach

Knowledge Based Machine Translation (KBMT) (as shown in Fig. 6) was implemented
in a pilot system called SAM (Script Applying Mechanism). It was a multifaceted project
to explore the role of stereotypical domain knowledge in automated text understanding. The
syntactic analysis, which was required to build a language-free meaning representation, has
both advantages and disadvantages over earlier approaches. KBMT creates the possibility of a
true multilingual translation by the abandonment of transfer grammars in favour of more prin-
cipled parsing and generation techniques. The KBMT approach requires a parser to map the
source language into semantic symbols and a generator to map those symbols into the target
language. As a result, “KBMT systems rely on an augmenter” Trujillo (1999). Furthermore,
a source text could be translated into many languages, as it only needs to be parsed once, and
the resulting meaning representation can be generated in each target language. Generation is
a simpler, less computationally demanding process. Thus, KBMT makes the process of mul-
tilingual translation far more computationally tractable. It also greatly reduces the amount
of development work required to reach eventual closure in the number of grammars needed
to translate between all commonly spoken human languages.
Usually, KBMT systems require huge amounts of knowledge. Mitamura et al. (1991)
improved a Knowledge-based, Accurate Natural-language Translation (KANT) system that
reduces this requirement, to produce handy, scalable, and perfect KBMT applications.
Richardson et al. (2001) described Microsoft Research Machine Translation (MSR-MT)
as an EBMT system that generates output with high quality in a specific domain, which
exceeds commercial machine translation systems. The system was applied to both Spanish
to English and English to Spanish language. Tahir et al. (2010) proposed and designed a
new knowledge-based machine translation system to overcome problems, such as syntactic
and structural ambiguity, lexical ambiguity, polysemy, discourses, anaphoric ambiguity, and
different shades of meanings, by using data mining and text mining techniques. The system
fulfilled all of the requirements of computational linguistic natural language processing. The
system was designed for Urdu, but it could be used for many other languages.

2.5 Hybrid method

Hybrid methods are used to incorporate higher level abstract syntax rules to arrive at the
final translation. Hybrid methods have been explored in the research community without
any real success, due to the difficultly of merging fundamentally different approaches. New
knowledge algorithms i.e., how words should be translated, phrases and patterns, knowledge
of how syntax-based translation rules should be applied, and knowledge of how syntactically
based target structures were developed.

123
A. Alqudsi et al.

Groves and Way (2005) incorporated marker chunks with statistical machine translation
sub-sentential alignments. They discovered that it was capable of outperforming both baseline
translation models. After that, Groves and Way (2006) continued their research by develop-
ing new hybrid Statistical (STM) and Example-based (EBMT) systems. The hybrid system
outperformed both SMT and EBMT baseline systems. Langlais and Simard (2002) attempted
to use the hybrid system by merging Example-Based and Statistical Machine Translation,
but to no avail.
Paul et al. (2005a) presented a multi-engine hybrid approach to MT. The two main strate-
gies used in corpus-based translation are firstly, the EBMT to retrieve the translation examples
that are best matched to an input, which are then adjusted to obtain the translation; and sec-
ondly, Statistical Machine Translation to translate from corpora and dictionaries, which it
then searches for the best translation. The system was applied to translate from Japanese to
English and Chinese-to-English. Paul et al. (2005b) proposed an approach to integrate exam-
ple-based and rule-based machine translation systems with statistical methods. The source
language input is paired with an initial translation hypothesis. The outputs that are generated
by multiple translation engines, such as rule-based and example-based systems, are used
as the initial translation hypotheses. This approach was applied to the Japanese-to-English
translation of conversation in the travel domain.

3 Related work

3.1 General Arabic machine translation

Arabic is the Qur’anic language, and there are millions of people that need to understand
the Quran (Muslim’s holy book). Therefore, efficient techniques that work with special rules
should be available to generate a useful computer system to produce high quality translations.
The issue of enhancing the quality of machine translations has been gaining interest amongst
researchers in recent years.
Salem et al. (2008) reported that computerised translations, rather than manual translations,
can save a lot of effort and cost. Arabic is notorious for its complex morphology (McCarthy
1979; Azmi 1988; Beesley 1998; Ratcliffe 1998; Ibrahim 2002). Arabic has always been
a challenge in computational morphology and a difficult testing ground for morphological
analysis technologies. When translating from a morphologically rich language, the transla-
tion process is passed into multiple steps, which are called tokenization. Habash and Sadat
(2006), Lee (2004) stated that tokenization is helpful when translating Arabic, as Arabic
is segmented by simple punctuation tokenization. This tokenization rank is not enough for
syntactic analysis (Hatem and Omar 2010).
Attia (2007) implemented a rule-based tokenizer that handles tokenization as a pre-
processing stage in MT. The advantage of this implementation is that it can become more
manageable and deterministic in debugging. Its lack of robustness makes it inapplicable,
as no single morphological transducer can claim to comprise all language words. Different
models of tokenization are applied at different levels of linguistic depth, while the tokenizer
interacts with other components. According to Beesley and Karttunen (2003), based on the
level of analysis, there are three strategies to develop Arabic morphologies:

1. One level rules: analysing Arabic at the stem level and using regular concatenation.
2. Two-level rules: analysing Arabic words as being composed of roots and patterns in
addition to concatenations.

123
Arabic machine translation

3. Three-level rules: analysing Arabic words as being composed of roots, templates, and
vocalization, besides concatenations.

Attia (2005) developed a morphological analyser, which uses a one level rules-approach
analyser that considers stems as the base forms of Arabic words, and handles spelling differ-
ences through alteration rules. Alansary et al. (2009) highlighted the four axes of analysis,
which are morphological analysis, lexical analysis, syntactic analysis, and semantic analy-
sis. They presented a test roadmap for Arabic corpus analysis, by following a stem-based
approach to be used in analysing the international corpus of Arabic. They also discussed
general considerations to bear in mind when starting the process of analysing the interna-
tional corpus of Arabic. Köpr and Miller (2009) presented a powerful Arabic morphological
analyser and generator. Their system was used as a component in both rule-based and sta-
tistical machine translation systems. The authors gave implementation details on nominal
morphology, verbal morphology analysis, lexicon, derivational morphology, and morpho-
logical generation. The overall accuracy rate for their system was 91.4 %. Some of the errors
originated from words that did not exist in the lexicon, and there was therefore no analy-
sis for them. For the remaining errors, an incorrect analysis was produced for the source
language.
Žabokrtský and Smrž (2003) developed a dependency grammar for Arabic, with a focus
on the automatic transformation of the phrase-structure syntactic trees of Arabic into depen-
dency-driven analytical ones. Meanwhile, Ditters (2001) wrote a grammar for Arabic using
AGFLformalism (Affix Grammars over a Finite Lattice).
Usually, any sentence that has two or more structural representations is said to be syn-
tactically ambiguous. Sometimes, Arabic sentences with only one structural representation
may be ambiguous. Daimi (2001) described a technique for identifying syntactic ambigu-
ity, in single-parse Arabic sentences, using Definite Clause Grammar formalism. His work
analysed each sentence and validated the conditions that rule the survival of certain types of
syntactic ambiguities in Arabic sentences.
Attia (2008) described the main syntactic structures in Arabic within the LFG framework.
He built the first Arabic parser using a Xerox Linguistics Environment, which allowed the
writing of grammar rules and notations that follow the LFG formalisms. This parser was
only tested on short sentences in the news domain. Spence and Christopher (2010) cre-
ated higher parsing baselines and showed that Arabic parsing performance is not as poor
as previously thought. They described the grammar state splits that significantly improve
parsing performance, catalogued parsing errors, and quantified the effect of segmentation
errors.
Othman et al. (2003) reported an attempt to create an efficient chart parser for analys-
ing Modern Standard Arabic (MSA) sentences. Their parser was able to satisfy syntactic
constraints; thus reducing parsing ambiguity. Lexical semantic features were used to dis-
ambiguate sentence structure. The authors explained that an Arabic morphological analyser
depends on an Augmented Transition Network (ATN) technique. They used Prolog to imple-
ment both the Arabic parser and the Arabic morphological analyser. Linguistic rules were
only obtained from sentences in the agriculture domain.
Habash (2010) discussed modern standard Arabic. The author focused on Arabic script,
phonology, orthography, morphology, syntax and semantics, and machine translation issues
about Arabic, such as morphology and Arabic script. However, Larkey et al. (2002) described
a large-scale system that presents morphological analysis and the generation of on-line
Arabic words, represented in the standard orthography, whether wholly vowelled, partially
vowelled, or un-vowelled. The analysis shows the root, pattern, and all other affixes, together

123
A. Alqudsi et al.

with characteristic tags indicating part-of-speech, person, number, voice, mood, aspect, etc.
The system depended on the lexicons and rules from a two-level morphological system,
reworked extensively using Xerox Finite-State Morphology tools.
Beesley (1996) described a finite-state morphological analyser of written modern standard
Arabic words (available on the internet at http://www.xrce.xerox.com/research/mltt/arabic).
The system is composed of the analyser, which runs on a network server, and Java applets
that run on the user’s machine. It gives words in standard Arabic orthography; both for input
and output. Darwish (2002) presented a way to rapidly develop a shallow Arabic morpho-
logical analyser, based on automatically derived rules and statistics. The system was limited
in the choice of roots to a fixed set and some rare Arabic words that constitute complete
sentences that do not appear in a training set. However, the system analyser lacked the ability
to decipher affix combinations.
Abu Shugier and Sembok (2007) asserted that “Arabic differs tremendously in terms of its
characters, morphology, and diacritic, from other languages; and to claim otherwise would be
a mistake.” Attia (2008) mentioned that the traditional classification of Arabic parts-of-speech
into nouns, verbs and particles, is not sufficient for a complete computational grammar. This
was confirmed by Farghaly and Senellart (2003), Chalabi (2001), Alsalman (2004), and they
described and evaluated the Arabic machine translation and pointed out the rules that must
be followed in Arabic translation.

3.2 Arabic-to-English MT

Arabic achieved attention in the Natural Language Processing (NLP) community, because
of its political importance and linguistic differences between it and other languages. These
linguistic features (particularly its complex morphology), present motivating challenges for
Arabic language researchers. Significant work has been done in Arabic natural language pro-
cessing in applications such as machine translation (Farghaly and Senellart 2003; Shaalan
et al. 2004), entity extraction (Shaalan and Raza 2009), and sentiment analysis (Almas and
Ahmed 2007).
Farghaly and Shaalan (2009) presented several solutions that would guide current and
future practitioners in the field of Arabic natural language processing, such as general fea-
tures and specific properties of Arabic, and they also highlighted the significance of Arabic.
Furthermore, the paper presented solutions that have already been adopted by some pioneer-
ing researchers in the field of Arabic natural language processing.
Lee et al. (2003) presented a tough word segmentation algorithm, which segments a word
into a prefix and suffix stem sequence. Their method is classified by a small manually seg-
mented Arabic corpus and uses it to introduce an unsupervised algorithm to build a segmenter
for Arabic words from a large un-segmented Arabic corpus. The algorithm can be used to
identify any number of suffixes and prefixes of a given token. It can generally be applied to
different language families. The algorithm achieves about 97 % segmentation accuracy on a
development test corpus containing 28,449 word tokens.
Bisazza and Federico (2010) proposed a chunk-based reordering technique to automat-
ically identify and move clause-initial verbs in the Arabic side of a word-aligned parallel
corpus. The method is applied to reprocess the training data, and to collect statistics about
verb movements. From this analysis, verb reordering patterns are identified, built on the test
sentences, before decoding them. This technique handled the most important cases of reor-
dering verbs in Arabic-English, focusing only on the problem of VSO sentences. Carpuat
et al. (2010) presented a method for improving overall SMT quality using a syntactic parser
to reorder VS constructions into SV for Arabic-to-English word alignment. The author did

123
Arabic machine translation

not totally solve the problem, because many verb re-orderings were missed, even though the
resulting system surpassed a strong baseline in terms of BLEU systems, and produced more
globally readable translations.
Nguyen and Vogel (2008) presented a context-dependent morphology pre-processing tech-
nique for Arabic-English translation. The authors used Arabic morphology-English align-
ment to teach a model removing nonaligned Arabic morphemes. They discussed the relation
between the size of the reordering window and morphology processing. The model was only
applied on a travel-domain system and a news domain system. Abraham and Salim (2005)
used algorithms to analyse Arabic to English, based on supervised alignment data, and the
performance of their algorithm was contrasted with human annotation performance. Shirko
et al. (2010) developed a machine translation system called Npae-Rbmt that translated Ara-
bic noun phrases into English using a transfer-based approach. The method was tested on 88
thesis titles and journals from the computer science domain. The accuracy of this system’s
results was 94.6 %.
Salem et al. (2008) presented an Arabic to English machine translation system called
UniArab, which was based on the Role and Reference Grammar model, and detailed the sys-
tem’s design and how it accommodated the particulars of Arabic to generate English. Given
a limited lexicon, which was used to implement this system, the system failed to translate
many words, because their structure did not exist in the system. Also, the system did not deal
with ambiguity.
Chafia and Ali (1995) presented an attempt to perform MT from Arabic to English and
French. They proved that analysing and reordering Arabic must be done to achieve good
results according to Arabic rules. Yassine et al. (2010) investigated the possibility of building
a high performance Arabic NER (automated Named Entity Recognition) system, by using
lexical, syntactic, and morphological features, and increasing the model with deeper lexi-
cal features and more syntagmatic features. These features were extracted from noisy data
obtained via projection from an Arabic-English parallel corpus. The results showed that the
system achieved a significantly high performance for almost all data-sets that were obtained
from broadcast news only. Larkey et al. (2002) used the spelling feature to measure the string
kernel distance between Romanised Arabic and English words.
According to the above, the machine translation from Arabic to English is a difficult task
and efficient techniques working with special rules should be available to generate a useful
system. A rule-based approach is the most popular method applied to generate a machine
translation from Arabic to English. This is due to the complexity of Arabic in morphological
and syntactic structures, which require many processes, such as word segmentation (segments
a word into a prefix and suffix stem sequence), word analysis (analysing and reordering of
Arabic must be done to achieve good results according to Arabic rules), etc.

3.3 English-to-Arabic MT

There are many orthographical differences between Arabic and English, which have to be
taken into consideration by MT developers. Italics are used in English to indicate emphasis
to a word, but in Arabic, they show a change in word order or the introduction of an emphatic
word.
Badr et al. (2009) applied syntactic phrase reordering in English-to-Arabic statistical
machine translation, and introduced reordering rules that were motivated linguistically. The
work also studied the effect of combining reordering with Arabic morphological segmenta-
tion; a pre-processing technique to improve Arabic to English and English to Arabic trans-
lation. Although this phrase-based statistical machine translation proved to be a robust and

123
A. Alqudsi et al.

effective approach to machine translation, it had a limited capacity to deal with long distance
phenomena, because they relied on local alignments.
Elming and Habash (2009) studied syntactic reordering and the effect of the alignment
method on learning reordering rules within English to Arabic translation tasks. They achieved
significant improvements in translation quality. Toutanova et al. (2008) improved the quality
of Statistical Machine Translation (SMT) by applying models that predicted word forms from
their stems using extensive morphological and syntactic information from both the source
and the target languages. They applied the inflection generation models in translating English
into two morphologically complex languages, namely Russian and Arabic, and showed that
an independent model of morphology generation can be successfully integrated with an SMT
system, making improvements in both phrasal and syntax-based MT. Their model achieved
an accuracy of over 91 %, which suggests that the model was effective when its input was
clear in its stem choice and order.
Attia (2003) discussed the translation of English into Arabic using a transfer approach.
Their study focused on the analysis of English as a source language, problems related to the
transfer of English into Arabic, and the generation of Arabic as a target language, which dealt
with agreement as one of the characteristics that greatly affects the output of MT. The study
was limited to electronic texts i.e., texts which are written in a machine readable format. Abu
Shugier (2009) presented a rule-based approach in English to Arabic MT, and emphasis was
given to the handling of word agreement and ordering. A major design goal of this system
was that it would be used as a tool and integrated with a general machine translation system.
The total score for this system was 96.1 %.
Modern Arabic has agreement asymmetries that are sensitive to word order effects. Aoun
et al. (1994) proposed an analysis of first conjunct agreement in verb sentences in Lebanese
Arabic and Moroccan Arabic. They argued that agreement of number-sensitive items caused
by clausal coordination. Guessoum and Zantout (2005) presented a methodology for evalu-
ating Arabic Machine Translation (MT) systems. Their evaluation methodology was applied
to four English-Arabic commercial MT systems and the results of the evaluation of these
systems were presented for the domains of the Internet and Arabization.
Hatem and Nassar (2008) introduced a modified Dijkstra’s shortest path algorithm, used
to identify the target language phrases by listing the indexes of the source sentence’s words,
which were found in the target language corpus and constructed a directed graph to identify
the phrases that form a shortest path walk in the graph. The method was used in a hybrid
English to Arabic MT system. The system merges between rule-based and example-based
machine translation techniques. Shaalan et al. (2004) implemented a transfer-based MT sys-
tem to translate a fairly complex English NP into Arabic. The system was applied to 66
real thesis and journal titles from the computer science domain. The accuracy of the results
was 94.6 %. This significant improvement was attributed to the use of specific rules of noun
phrases.
Finally, machine translation from English to Arabic uses methods slightly different to
those discussed from Arabic to English. Most experiments dealt with agreement as one of
the characteristics that greatly affects the output of machine translation from English to
Arabic, as well as making reordering improvements.

3.4 From Arabic to other languages and vice versa

Several papers discussed the major approaches to machine translation from and to Arabic.
For example; machine translation approaches from French to Arabic are uncommon and
very rare. According to Alsharaf et al. (2004), French cannot be translated into Arabic using

123
Arabic machine translation

existing approaches (i.e., direct, transfer, pivot, and statistical). This may be because the two
languages are linguistically distant and this requires that certain linguistic phenomena that
are specific to the pair must be analysed. Such phenomena do not necessarily occur in other
language pairs.
Alsharaf et al. (2004) presented an approach to the machine translation of French to Ara-
bic. They incorporated certain aspects of existing approaches (i.e., direct, transfer, pivot, and
statistical) in some of their system’s steps. They used new functions, which were used to treat
the type of language pair that was characterized by having linguistically distant languages.
In French, the morphology is very different, because there is no method to construct nouns,
adjectives, adverbs, and actors, according to the verb rhyme, as in Arabic.
Hasan et al. (2006) first presented a statistically driven machine translation system for
Arabic to French and applied the system to the medical domain. They also described the
necessary steps needed to create a system for corpus acquisition, pre-processing (such as
Arabic tokenizer), training the models, and generating translations. Debili (1992) handled
the problem related to the automatic alignment of sentences belonging to bilingual text pairs.
Their experiments were applied to French-English and French-Arabic text pairs. Meanwhile,
Mostefa et al. (2009) presented a semi-automatic annotation with Named Entities of a mul-
tilingual corpus for Arabic, English, and French. The text corpus was made of comparable
newswires from the Agency France Presse, covering the period 2004 to 2006. The method,
which they used for producing the corpus, was iterative and the annotations were checked
manually and corrected if necessary. Statistics of the corpus were presented and compared
with the annotation results for the three languages.
Besançon et al. (2009) presented the InFile evaluation paradigm (INformation, FILtering,
and Evaluation) in general and focused on a study of the Arabic part of the corpus in par-
ticular. Coverage mismatch was between profiles and Arabic documents. He also discussed
the problems that may arise when trying to transfer information from English and French to
Arabic. Guidere (2002) applied a corpus-based machine translation form that depended on a
bilingual corpus of French and Arabic texts and translation part alignment. The author used
alignment for combining linguistic and statistical information. He also proposed procedures
to construct a machine translation system based on parallel translated corpora. Moghrabi
(1998) described a machine translation system between French and Arabic in the sub-world
of cooking recipes. He described the design of the generation component and how this design
allows a variety of outputs; all expressing the same conceptual meaning. This system was of
the knowledge-based family of Interlingua translation systems. It focused on the importance
of the meaning of the text being processed and articulated all of its available knowledge-bases
in order to achieve a flexible meaningful wording.
Another language that attracted the attention of researchers was Chinese. In common use,
Chinese uses a complex orthography that contains about 10,000 characters, which express
semantics rather than phonological information. Chinese is written either from left-to-right
or top-down and words are written without spaces. The major challenge for Chinese pro-
cessing is word segmentation. There is no morphology in Chinese, but it does have limited
nominal and verbal aspects. The first work in Arabic-Chinese MT was by Habash and Jun
Hu (2009). They presented a comparison of two approaches for Arabic-Chinese machine
translation using English as a pivot language. Their system handled many complex Arabic-
Chinese syntactic variations. The results showed that using English language as a pivot was
better than a direct translation from Arabic to Chinese.
Japanese is the closest in form to Chinese, but less common in machine translation. The
word order of Japanese is Subject Object Verb (SOV). Japanese nouns have no grammatical
number, gender, or features, and some words are usually translated as pronouns. Bouillon

123
A. Alqudsi et al.

et al. (2008) described an interlingua-based medical speech translation system between Jap-
anese and Arabic and vice versa. They also described a simple generic tool for debugging
Interlingua translation rules, and a method for improving speech understanding performance
by re-scoring N-best speech hypothesis lists. They used statistical tuning methods to increase
efficiency.
Spanish is also attracting scientist’s attention. Spanish is a language with a two-gender
system. As for syntax, the sentence word order is Subject Verb Object, though variations are
possible. Doaa and Ana (2008) focused on the discourse markers as key elements in guiding
the inferences of statements and to help in natural language processing. This is achieved by a
rule based approach for the resource that addresses aspects concerning discourse structure and
coherence through automatic identification, classification, and annotation, of the discourse
markers in a multilingual parallel corpus (i.e., Arabic-Spanish-English). The research pro-
vided an important resource for the community as it presented a multilingual computational
processing of different kinds of discourse markers. Furthermore, the research addressed Ara-
bic from a computational pragmatic perspective, where the classification, identification, and
annotation processes, were implemented using the information provided from the tagging of
Spanish discourse markers and alignments.
Doaa et al. (2006) developed a multilingual parallel corpus (Arabic-Spanish-English)
aligned on the sentence level and tagged on the POS level, and is a valuable resource for this
translation. A multilingual parallel corpus includes Arabic to Spanish, English to Arabic, and
English to Spanish. The results of this method were over 90%, even though the percentages
were different from one language pair to another, they were evaluated against a gold standard
system.
Another language gaining interest among researchers is Hindi. The syntax for Hindi is
Subject-Object-Verb, and nouns are either masculine or feminine. The verb must agree with
its subject in both number and gender and the adjectives must agree with the nouns. Some
words in Hindi can be translated into different forms, but the meaning is approximately same
and their translation depends upon grammatical context.
Mark et al. (2004) presented a comparative analysis of relative clauses in machine transla-
tion of Hindi and Arabic, in the tradition of the Paninian Grammar Framework, which leads
to deriving a common logical form for equivalent sentences. Parallels are drawn between
the Hindi co-relative construction and resumptive pronouns in Arabic. More details about
relative clauses between Hindi and Arabic can be found by Mark et al. (2004).
In conclusion, machine translation from Arabic to other languages (and vice versa), notes
that there is a big difference in the approach styles, due to the different features of each
language, such as morpho-syntactic features, agreement features, as well as, each language
has its own challenges and ambiguities. Figure 7 summarises the available methods from the
literature review that were employed to develop machine translations from/to Arabic.

4 Discussion

Concluding this review, it is clear that machine translation can be - and in fact has been—
addressed using a variety of different approaches. In this review, we have focused our dis-
cussions on the application of these approaches to Arabic machine translation problems,
and taken special note which of these approaches might be suitable for dealing with Arabic
features. There are many common elements in the systems, although there is also a growing
diversity. The transfer approach has made swift progress and there is great optimism for its
future success. It is more suited to certain types of Arabic challenges, due to the advantages

123
Arabic machine translation

Machine Translation

Arabic Machine Translation General Machine Translation

Arabic to other and Versa

Arabic to English English to Arabic

Rule based Statistical Other

Lodhi et al., (2000) Diab et al., (2007) Young et al., (2003)


Chafia & Ali (1995) Marine et al., (2010) Abraham & Salim (2005)
Nguyen & Vogel (2008) Yassine et al., (2010)
Salem et al., (2008) Bisazza et al., (2010)
Shirko et al., (2010)

Rule based Statistical Other


Aoun et al., (1994)
Attia (2003) Sarikaya & Deng (2007)
Hatem & Nassar (2008)
Shaalan et al., (2004) Badr et al., (2008)
Toutanova et al., (2008) Badr et al., (2009)
Elming & Habash (2009)
AbuShugier & Sembok (2009)

Rule based Statistical Other


Doaa & Ana (2008) Guidere (2002) Debili & Sammouda (1992)
Besançon et al., (2009) Hasan et al., (2006) Moghrabi (1998)
Doaa et al., (2006) Alsharaf et al., (2004)
Mostefa et al., (2009) Mark et al., (2004)
Habash & JunHu (2009) Bouillon et al., (2008)

Rule-based Knowledge- Hybrid Example- Statistical


based Method based Approach
Mitamura et al., (1991) Langlais & Simard (2002) Furuse et al., (1992) Zavrel et al., (1997)
Hutchins & Harold (1992) Groves (2005) Furuse & Iida (1992) Och & Ney (2000)
Arnold et al., (1994) Paul (2005a) Nirenburg et al., (1994) Marcu (2001)
Trujillo (1999) Paul (2005b) Brown (1996) Lopez (2008)
Tahir et al., (2010) Groves (2006) Nagao (1997)
Richardson et al., (2001)

Direct Interlingua Transfer


Joshan & Lehal (2007) Hutchins (1986) Toma (1977)
Hutchins & Harold (1992) Lavie et al., (2004)
Eric & Teruko (1992)
Bonnie & Eduard (2004)

Fig. 7 Summary of machine translation approaches

that the transfer is easy to reach the level of abstractness required, and the level of analysis
in the transfer approach is attainable and easy to implement.
Developing a transfer based MT system requires less time and effort than Interlingua. This
is why most commercial systems apply the transfer approach. One noticeable trait from our
review is the reality that most approaches that have been proposed for machine translation

123
A. Alqudsi et al.

systems may only have been tested on limited domains. This situation is understandable from
a practical perspective. However, from a research point-of-view, this characteristic will often
make it very difficult to assess the system capacity of performing in comparison to other
systems. There seems to be confusion in the field about how different machine translation
systems should be formally tested and evaluated against one another. As we have seen in
this survey, different approaches and different challenges seek to achieve different sets of
objectives, making it difficult to perform comparisons in many cases. However, difficulties in
comparisons are also real in many of the studies. Different performance measures will often
be looked at and different numbers of trials will be performed in the analyses and testing of
the systems.
Most of the community evaluations focused on the translation of news and government
texts. There is very little work on other domain translations, particularly those that describe
much of the information found on the internet, where translation is in demand. Usually, the
authors tested their machine translation systems using the BLEU evaluator; which is an eval-
uation system in the development and research cycle of machine translation technology. The
BLEU system uses the n-gram similarity of a candidate to a set of references. It has a wider
applicability than just MT. It could also be extended to evaluate the generation of natural
language and the summarization of systems.

5 Conclusion and future work

Arabic has a different word order sequence that makes it a significant challenge to MT, due to
the possibility of expressing a sentence in Arabic in various subject-verb-object combinations
with the same meaning. In Arabic, three elements make up a sentence, namely subject, verb
and object. Using all of these elements, Arabic can be classified into four types of sentences,
according to different word orders i.e., SVO, VSO, VOS, and SOV. Thus, it is a difficult
task to find a machine translation that meets human requirements. It is not yet clear whether
machine translation can satisfy peoples’ requirements in terms of translation quality and
retrieval time. We assume that many kinds of phenomena exist, some of which are suitable
for MT.
This paper has given an account of and reasons for, the widespread use of machine trans-
lation. The features of Arabic and major MT methods were also discussed. Some of these
methods are commonly used. Most machine translation systems focus on the translation of
news and official texts, whilst not many focus open domain translation. Translations focus
mainly on informal genres, which take much of their information from the internet (for which
translation is in great demand). Statistical machine translation has grown quickly and trans-
fer machine translation will surely follow. Nonetheless, these MT systems still do not meet
human requirements. This paper investigates current MT techniques, employed to translate
from\to Arabic. In the future, we plan to develop a new Arabic to English machine translation
system, taking the reordering of language challenges into consideration. We hope to extend
our system to cover Arabic and other languages with even more features.

References

Abbès R, Dichy J, Hassoun M (2004) The architecture of a standard Arabic lexical database: some figures,
ratios and categories from the DIINAR.1 source program. In: The workshop on computational approaches
to Arabic script-based languages, COLING 2004. Geneva, Switzerland, pp 15–22

123
Arabic machine translation

Abraham I, Salim R (2005) A maximum entropy word aligner for Arabic-English machine translation. In:
Proceedings of human language technology conference and conference on empirical methods in natural
language processing (HLT/EMNLP). pp 89–96 (Vancouver)
Abu Shugier M, Sembok T (2007) Handling agreement in machine translation from English to Arabic. In:
1st International conference on digital communications and computer applications (DCCA2007). JUST.
pp 385–379
Abu Shugier M (2009) Word agreement and ordering in English-Arabic machine translation: a rule-based
approach. PhD thesis, FTSM, University Kebangsaan Malaysia, p 175
Afify M, Sarikaya R, HKJ Kuo LB, Gao Y (2006) On the use of morphological analysis for dialectal Arabic
speech recognition. In: 9th International conference on spoken language processing (Interspeech—ICS-
LP), Pittsburgh. pp 277–280
Alansary S, Nagi M, Adly N (2009) Towards analysing the international corpus of Arabic (ICA). In: Interna-
tional conference on language engineering. Progress of Morphological Stage, Egypt, pp 241–245
Albared M, Nazlia O, Mohd J, Ab Aziz (2009) Classifiers combination to Arabic morphoSyntactic disam-
biguation. In: International conference on electrical engineering and informatics, Malaysia. 978-1-4244-
4913-2/09 (IEEE)
Almas Y, Ahmed K (2007) A note on extracting “sentiments” in financial news in English, Arabic, and Urdu.
In: Proceedings of the 2nd workshop on computational approaches to Arabic script-based languages
(CAASL’07). pp 1–12
Alsalman S (2004) The effectiveness of machine translation. Int J Arab Engl Stud 5:145–160
Alsharaf H, Sylviane C, Peter G (2004) French to Arabic machine translation. In: The specificity of lan-
guage couples 9th EAMT workshop, “Broadening horizons of machine translation and its applications”,
pp 26–27 April 2004, Malta, pp 11–17
Al-Sughaiyer I, Al-Kharashi IA (2004) Arabic morphological analysis techniques: a comprehensive survey.
JASIST 55(3):189–213
Aoun J, Elabbas B, Dominique S (1994) Agreement, word order, and conjunction in some varieties of Arabic.
Linguist Inq 25:195–220
Arnold D, Balkan L, Lee H, Meijer S, Sadler L (1994) Machine translation: an introductory guide. Blackwell,
Manchester
Attia M (2006) An ambiguity-controlled morphological analyser for modern standard Arabic modelling finite
state networks. In: Challenge of Arabic for NLP/MT conference. The British Computer Society, London,
pp 48–67
Attia M (2007) Arabic tokenization system. In: ACL-Workshop on computational approaches to semitic lan-
guages, Prague
Attia M (2005) Developing a robust Arabic morphological transducer using finite state technology. In: The
8th annual CLUK research colloquium. Manchester
Attia M (2008) Handling Arabic morphological and syntactic ambiguity within the LFG framework with a
view to machine translation. Ph.D. Thesis. The University of Manchester, Manchester, p 61
Attia M (2003) Implications of the agreement features in machine translation. M.A. Thesis. University of
Manchester
Azmi M (1988) Arabic morphology: a study in the system of conjugation. Hasan Publishers, Hyderabad
Badr I, Zbib R, Glass J (2009) Syntactic phrase reordering for English-to-Arabic statistical machine transla-
tion. In: The 12th conference of the European chapter of the association for computational linguistics.
Athens, pp 86–93
Beesley K (1996) Arabic finite-state morphological analysis and generation. In: Proceedings of the 16th
conference on association for computational linguistics. pp 89–94
Beesley KR (1998) Arabic morphology using only finite-state operations. In: Computational approaches to
semitic languages: proceedings of the workshop. Montreal, pp 50–57
Beesley KR, Karttunen L (2003) Finite state morphology. CSLI Publications, Palo Alto, CA
Besançon R, Mostefa D, Timimi I, Chaudiron S, Laïb M (2009) Arabic, English and French: three languages
in a filtering systems evaluation project. In: MEDAR 2009: 2nd international conference on Arabic
language resources & Tools, 22–23 April 2009, Cairo, pp 163–167
Bisazza A, Federico M (2010) Chunk-based verb reordering in VSO sentences for Arabic-English statistical
machine translation. In: ACL 2010: joint fifth workshop on statistical machine translation and Metric-
sMATR. Proceedings of the workshop, 15–16 July 2010, Uppsala University, Uppsala, pp 235–243
Bonnie J, Dorr E, Hovy H, Lori S (2004) Machine translation: interlingual methods. In: Brown K (ed) Ency-
clopaedia of language and linguistics, 2nd edn, ms. 939
Bouillon P, Sonia H, Yukie N, Kyoko K, Hitoshi I, Nikos T, Marianne S, Beth AH, Manny R (2008) Devel-
oping non-European translation pairs in a medium-vocabulary medical speech translation system. In:

123
A. Alqudsi et al.

LREC 2008: 6th Language resources and evaluation conference, Marrakech, Morocco, 26–30 May,
pp 1741–1748
Brill E, Resnik P (1994) A rule-based approach to prepositional phrase attachment. In: Proceedings of the
15th conference on 1994, acl.ldc.upenn.edu
Brown D, Ralf B (1996) Example-based machine translation in the Pangloss system. In: Proceedings of the
COLING-96, vol 1, pp 169–174 (Copenhagen)
Carpuat M, Yuval M, Nizar H (2010) Improving Arabic-to-English statistical machine translation by reor-
dering post-verbal subjects for alignment. In: ACL 2010: the 48th annual meeting of the association
for computational linguistics, Uppsala, July 11–16, 2010: Proceedings of the Conference Short Papers,
pp 178–183
Chafia M, Ali Mili (1995) Machine translation from Arabic to English and French information sciences
3(2):91–109
Chalabi A (2004) Elliptic personal pronoun and MT in Arabic. In: JEP-2004-TALN 2004 special session
on Arabic language processing-text and speech. http://www.lpl.univ-aix.fr/jep-taln04/proceed/actes/
arabe2004/TAAC17.pdf
Chalabi A (2000) MT-based transparent Arabization of the internet TARJIM.COM. In: White JS (ed) AMTA
2000, LNAI 1934. Springer, Berlin pp 189–191
Chalabi A (2001) Sakhr web-based Arabic/English MT engine. Downloaded from www.elsnet.org/arabic2001/
chalabi.pdf on 25 Aug
Charoenpornsawat P, Sornlertlamvanich V, Charoenporn, T (2002) Improving translation quality of rule-based
machine translation. In: Proceedings of COLING-02 on machine translation in Asia. Morristown, pp 1–6
Daimi K (2001) Identifying syntactic ambiguities in single-parse Arabic sentence. Comput Hum 35:333–349
Darwish K (2002) Building a shallow Arabic morphological analyser in one day. In: Proceedings of the
ACL workshop on natural language processing in the biomedical domain, PA, USA. Association for
Computational Linguistics
Debili F (1992) Aligning sentences in bilingual texts French–English and French–Arabic. In: COLING,
pp 517–525 (Nantes)
Ditters E (2001) A formal grammar for the description of sentence structure in modern standard Arabic. In:
Workshop on Arabic processing: status and prospects at ACL/EACL, Toulouse
Doaa S, Ana GL (2008) Pragmatic annotation of discourse markers in a multilingual parallel corpus (Arabic-
Spanish-English). In: LREC 2008: 6th language resources and evaluation conference, Marrakech, 26–30
May 2008
Doaa S, Antonio M, Sandoval J, Guirao M, Enrique A (2006) Building a parallel multilingual corpus (Arabic-
Spanish-English). In: LREC-2006: fifth international conference on language resources and evaluation.
Proceedings, Genoa, Italy, 22–28 May 2006, pp 2176–2181 (increase)
Dorr BJ, Jordan PW, Benoit JW (1999) A survey of current paradigms in machine translation. In: Zelkowitz
M (ed) Advances in computers, vol 49. Academic Press, London pp 1–68
Elming J, Habash N (2009) Syntactic reordering for English-Arabic phrase-based machine translation. In:
Proceedings of the EACL 2009 workshop on computational approaches to semitic languages, Athens,
pp 69–77
Eric HN, Teruko M (1992) The KANT system: fast, accurate, high-quality translation in practical domains.
In: International conference on computational linguistics proceedings of the 14th conference on compu-
tational linguistics, vol 3. pp 1069–1073
Farghaly A, Shaalan K (2009) Arabic natural language processing: challenges and solutions. ACM Trans
Asian Lang Inform Process Assoc Comput Mach 8:1–22. doi:10.1145/1644879.1644881
Farghaly A, Senellart J (2003) Intuitive coding of the Arabic lexicon. In: Proceedings of the MT Summit IX,
the association for machine translation in the Americas (AMTA’03)
Fehri AF (1993) Issues in structure of Arabic clauses and works. Kulwer, Dordrecht
Furuse O, Iida H (1992) An example-based method for transfer-driven machine translation. In: The third
international conference on theoretical and methodological issues, Empiristic vs. Rationalist methods in
MT. Montréal, pp 139–150
Groves D, Way A (2006) Hybrid data-driven models of machine translation. Springer Science & Business
Media B.V., Berlin 301–323
Groves D, Way A (2005) Hybrid example-based SMT: the best of both worlds? In: Proceedings of the ACL
2005 workshop on building and using parallel texts: data-driven machine translation and beyond, Ann
Arbor, pp 183–190
Guessoum A, Zantout R (2005) A methodology for evaluating Arabic machine translation systems. Mach
Trans 18:299–335 doi:10.1007/s10590-005-2412-3 (Springer)
Guidere M (2002) Toward Corpus-Based Machine Translation for Standard Arabic Translation Journal 6.1.
http://accurapid.com/journal/19mt.htm, visited September

123
Arabic machine translation

Habash N (2010) Introduction to Arabic natural language processing. In: Graeme H (ed) Synthesis lectures
on human language technologies. Morgan & Claypool Publishers, San Rafael p 187
Habash N, Jun Hu (2009) Improving Arabic-Chinese statistical machine translation using English as pivot lan-
guage. In: Proceedings of the fourth workshop on statistical machine translation, Athens, 30 March–31
March, pp 173–181
Habash N, Sadat F (2006) Arabic pre-processing schemes for statistical machine translation. In: Proceedings
of the 7th meeting of the North American chapter of the association for computational linguistics/human
language technologies conference (HLT-NAACL06). New York, pp 49–52
Hasan S, Isbihani A El I, Hermann N (2006) Creating a large-scale Arabic to French statistical machine
translation system. In: LREC-2006: fifth international conference on language resources and evaluation.
Proceedings, Genoa, Italy, 22–28 May
Hatem A, Nassar A (2008) Modified Dijstra-like search algorithm for English to Arabic machine translation
system. In: Hutchins J, Hahn Walther v (eds) Proceedings EAMT 2008: 12th annual conference of the
European association for machine translation, September 22–23, 2008. Hamburg, pp 66–71
Hatem A, Omar N (2010) Syntactic reordering for Arabic-English phrase-based machine translation. In:
Database theory and application, bio-science and bio-technology. Springer Lecture Notes in Computer
Science, vol 118. Verlag, Berlin, pp 198–206
Hutchins J (2007) Machine translation: a concise history. In: Wai CS (ed) Computer aided translation: theory
and practice. Chinese University of Hong Kong, Hong Kong
Hutchins WJ, Harold LS (1992) An introduction to machine translation. Academic Press, London
Hutchins WJ (1986) Machine translation: past, present, future. Ellis Horwood Limited, West Sussex
Ibrahim K (2002) Al-Murshid fi Qawa’id Al-Nahw wa Al-Sarf [The Guide in Syntax and Morphology Rules].
Amman, Jordan, Al-Ahliyyah for Publishing and Distribution
Josef FO, Ney H (2000) Improved statistical alignment models. In: ACL00: Proceedings of the 38th annual
meeting of the association for computational linguistics., Hongkong, pp 440–447
Joshan GS, Lehal GS (2007) Evaluation of direct machine translation system from Punjabi to Hindi. Int
J Systemics Cybern Inform, 76–83
Kamir D, Soreq N, Neeman Y (2002) A comprehensive NLP system for modern standard Arabic and modern
hebrew. In: Proceedings of the workshop on computational approaches to semitic languages in the 40th
annual meeting of the association for computational linguistics (ACL-02). Philadelphia
Köpr S, Miller J (2009) A unification based approach to the morphological analysis and generation of Arabic.
In: CAASL-3—third workshop on computational approaches to Arabic script-based languages [at] MT
Summit XII, August 26 2009
Langlais P, Simard M (2002) Merging example-based and statistical machine translation. In: Richardson SD
(ed) Machine translation: from research to real users, 5th conference of the association for machine
translation in the Americas (AMTA-2002), Tiburon, October 2002. proceedings, Springer, Berlin,
pp 104–113
Larkey L, Ballesteros L, Connell M (2002) Improving stemming for Arabic information retrieval: light stem-
ming and co-occurrence analysis. pp 275–282
Lavie A, Probst K, Peterson E, Vogel S, Levin L, Font-Llitjos A, Carbonell J (2004) A Trainable transfer-based
machine translation approach for languages with limited resources. In: Proceedings of workshop of the
European association for machine translation (EAMT-2004), Valletta, Malta, pp 116–123
Lee Y (2004) Morphological analysis for satistical machine translation. In: Proceedings of the joint confer-
ence on human language technologies and the annual meeting of the North American chapter of the
association of computational linguistics (HLT-NAACL)
Lee Y, Suk L, Kishore P, Salim R (2003) Language model based Arabic word segmentation. In: 41st annual
meeting of the association for computational linguistics. Sapporo, pp 399–406
Lopez A (2008) Statistical machine translation. ACM Comput Surv 40(3):1–49
Marcu D (2001) Towards a unified approach to memory- and statistical-based machine translation. In: Asso-
ciation for computational linguistics: 39th annual meeting and 10th conference of the European chapter,
Toulouse, pp 378–385
Mark P, Domenyk E, Samir K, Lakshmi P (2004) Relative clauses in Hindi and Arabic: a Paninian depen-
dency grammar analysis. In: Coling’04 workshop: proceedings recent advances in dependency grammar,
August 28, pp 9–16
McCarthy J (1979) Formal problems in semitic phonology and morphology. Ph.D. dissertation, MIT, Cam-
bridge
Mitamura T, Nyberg E, Carbonell J (1991) An efficient interlingua translation system for multi-lingual docu-
ment production. In: Proceedings of machine translation Summit III, Washington, DC, July 2–4

123
A. Alqudsi et al.

Moghrabi C (1998) On parametering the choice of words in text generation and its usefulness in machine trans-
lation. In: International conference “Machine translation: ten years on” proceedings held at Cranfield
University, England, 12–14 November (Cranfield University Press, pp 1–9
Mostefa D, Laïb M, Chaudiron S, Choukri K, Chalendar G (2009) A multilingual named entity corpus for Ara-
bic, English and French. In: MEDAR 2009: 2nd international conference on Arabic language resources
& tools, April 2009, Cairo
Nagao M (1997) Machine translation through language understanding. In: Proceedings of MT Summit VI,
San Diego, pp 41–49
Nguyen T, Vogel S (2008) Context-based Arabic morphological analysis for machine translation In: Proceed-
ings of the 12th conference on computational natural language learning, Manchester, pp 135–142
Nirenburg S, Beale S, Domashnev C (1994) A full text experiment in example based machine translation.
In: Proceedings of the international conference on new methods in language processing, Manchester,
pp 78–87
Othman E, Shaalan K, Rafea A (2003) A chart parser for analysing modern standard Arabic sentence. In: The
MT Summit IX workshop on machine translation for semitic languages: Issues and Approaches, New
Orleans
Paul M, Doi T, Hwang Y, Imamura K, Sumita E (2005a) Nobody is perfect: ATR’s hybrid approach to spo-
ken language translation. In: Proceedings of the international workshop on spoken language translation
(IWSLT 2005), Pittsburgh, pp 55–62
Paul M, Sumita E, Yamamoto S (2005b) A machine learning approach to hypothesis selection of greedy
decoding for SMT. In: MT Summit X workshop: second workshop on example-based machine transla-
tion, Phuket, pp 117–124
Ratcliffe R (1998) The broken plural problem in Arabic and comparative semitic: allomorphy and analogy in
non-concatenative morphology. J. Benjamins, Amsterdam
Richardson S, Dolan W, Menezes A, Pinkham J (2001) Achieving commercial-quality translation with exam-
ple-based methods. In: Proceedings of MT summit VIII, Santiago De Compostela, Spain
Salem Y, Arnold H, Brian N (2008) Implementing Arabic to English machine translation using the role and
reference grammar linguistic model. In: Proceedings of the eighth annual international conference on
information technology and telecommunication (ITT 2008), Galway, Ireland, October 2008 (Runner-up
for Best Paper Award)
Shaalan K, Rafea A, Abdel Monem A, Baraka H (2004) Machine translation of English noun phrases into
Arabic. Int J Comput Process Orient Lang. World Scientific Publishing Company 17(2):121–134
Shaalan K, Raza H (2009) NERA: named entity recognition for Arabic. J Am Soc Inf Sci Technol. John Wiley
& Sons, Inc., NJ, 60(8):1652–1663
Shirko O, Omar N, Arshad H, Albared M (2010) Machine translation of noun phrases from Arabic to English
using transfer-based approach. J Comput Sci 6(3):350–356 (ISSN 1549-3636)
Soudi A, Bosch A, Neumann G (2007) Arabic computational morphology: knowledge-based and empirical
methods. Springer, Berlin
Spence G, Christopher D (2010) Better Arabic parsing: baselines, evaluations, and analysis. In: Coling 2010:
23rd international conference on computational linguistics. Proceedings of the conference, 23–27 August
2010, Beijing International Convention Centre, Beijing
Tahir GR, Asghar S, Masood N (2010) Knowledge based machine translation. In: Proceedings of international
conference on information and emerging technologies (ICIET). Karachi, Pakistan pp 1–5
Toma P (1977) SYSTRAN as a multilingual machine translation systemmt-archive. In: The third European
congress on information systems, pp 569–581
Toutanova K, Suzuki H, Ruopp A (2008) Applying morphology generation models to machine translation.
In: ACL-08: HLT. 46th annual meeting of the association for computational linguistics: human language
technologies. Proceedings of the conference, June 15–20, 2008, The Ohio State University, Columbus„
pp 514–522
Trujillo A (1999) Translation engines: techniques for machine translation. Springer, London
Yassine B, Imed Z, Mona D, Paolo R (2010) Arabic named entity recognition: using features extracted from
noisy data. In: Proceedings of the ACL 2010 conference short papers, pp 281–285, Uppsala, 11–16 July
2010. c 2010 Association for Computational Linguistics
Žabokrtský Z, Smrž O Arabic syntactic trees: from constituency to dependency. In: The 10th conference of
the European chapter of the association for computational linguistics, Budapest, pp 183–186
Zavrel J, Daelemans W, Veenstra J (1997) Resolving PP. attachment Ambiguities with memory-based learning.
In: The workshop on computational natural language learning (CoNLL’97). Madrid, pp 136–144

123

You might also like