You are on page 1of 5

2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

Proverb Treatment in Malay-English


Machine Translation
Khirulnizam Abd. Rahman and Norita Mohd. Norwawi

Abstract – Machine translation is a process by which computer They are normally short, generally known sentence of the
software is used to translate a text from one natural language, in this folk which contains wisdom, truth, morals, and traditional
case Malay, to another, English. They are normally short, generally views in a metaphorical, fixed and memorisable form and
known sentence of the folk which contains wisdom, truth, morals, which is handed down from generation to generation [12].
and traditional views in a metaphorical, fixed and memorisable form
and which is handed down from generation to generation. Although
Although proverbs do beautifies Malay literature, however
proverb do beautifies Malay literature, this brings challenges to this brings challenges to machine translation since proverb
machine translation since proverb cannot be translated literally, rather cannot be translated literally, rather logically [7].
logically. This paper is to propose a research on proverb treatment in
B. Brief Description of Malay Proverbs
Malay-English machine translation. The treatment process started by
identifying the Malay proverbs using the pattern matching approach. Peribahasa or Malay proverb is a group of words in a fixed
After the identification, the proverb will be translated into the order that has a particular meaning that is different from the
meaning in English. To avoid literal (direct) translation a lookup meanings of each word understood on its own [1]. There are
table of the Malay proverb and the actual meaning in English has four categories of Malay proverbs [1], which are simpulan
been prepared. There is another challenge, the ambiguity of the bahasa, perumpamaan, bidalan and pepatah.
proverb meaning (there are Malay proverbs with several different Simpulan bahasa – normally consist of two words (sometimes
meanings). To resolve this ambiguous issue, the possibility theory
three). The literal meaning of the word combination is
will be implemented with the help of the one of the WordNet feature,
the semantic labeling. different than the actual meaning of the ‘simpulan bahasa’.
Example: Langkah kanan; literally means right footstep, yet
Keywords-- idiom treatment , machine translation, Malay-English the actual meaning is lucky.
machine translation, pattern matching, possibility theory
Perumpamaan – phrases started with seolah-olah, ibarat,
bak, seperti, macam, bagai or laksana.
I. INTRODUCTION
Example: bagaikan pinang dibelah dua; literally means like
A. Proverbs in Malay-English Machine Translation betel nut split apart evenly, yet the actual meaning is
compatible / equally beautiful and handsome for a pair of just
M ACHINE translation (MT) is sometimes defined as
automated translation or machine aided translation. It is
a process by which computer software is used to
married bride and bridegroom.
Pepatah – proverb that contains advices or teachings.
translate a text from one natural language, Malay, to another, Example: Adat berperang, yang kalah jadi abu, menang
English [27]. The study of Malay language in machine jadi arang; literally means in war, loser become ashes, winner
translation has been a serious topic since 1984 with the become coal, yet the actual meaning is in war, the defeated
establishment of Unit Terjemahan Melalui Komputer (UTMK) and the winner are both losers.
in USM [5]. Malay language in the sense of this research is Bidalan – phrase (pepatah) started with jangan, biar or
bahasa Melayu which is the standard first language in ingat.
Malaysia. Although Indonesia, Malaysia and Brunei languages Example: Kalau kail panjang sejengkal, lautan dalam
shares the common roots of Austronesian family, the research jangan diduga; literally means if you have a short hook, do not
is focusing into Malaysian Malay language. attempt to fish in the deep sea, yet the actual meaning is if you
Proverbs (peribahasa) in Malay language are beautiful have little knowledge, do not dare to dream big.
elements to deliver advices, Malay teachings, moral values
and comparison through metaphoric phrases [26]. II. LITERATURE REVIEW

A. Multi Words Expressions (MWE)


MWE is combination of several words to form another
meaning. As in other languages, Malay also contains a lot of
MWE. Although some MWEs can be isolated in the
tokenization process, and then analysed as a single cluster,
Khirulnizam Abd. Rahman and Norita Mohd. Norwawi are with Universiti most of them cannot [3].
Sains Islam Malaysia Nilai, Negeri Sembilan, MALAYSIA email:
khirulnizam@gmail.com

4
2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

According to Tatabahasa Dewan [16], the basic sentence in III. PROBLEM STATEMENT
Malay can be categorised into four basic structures (Table I). The most challenging issue in interpreting natural language
TABLE I texts is the ambiguity problem [15], [10]. Proverbs are one
FOUR CATEGORIES OF BASIC MALAY SENTENCES part of the ambiguity issues. Proverbs normally come with
Malay basic sentence Combination of Phrases fixed sequence of words; however the meaning is not based on
category the words directly [1]. Since proverb is translated logically,
1 Noun Phrase + Noun Phrase
2 Noun Phrase + Verb Phrase
the machine translation algorithm needs to know the semantic
3 Noun Phrase + Adjective Phrase (real meaning) of the phrases. On top of these issues, certain
4 Noun Phrase + Prepositional Phrase Malay proverbs have ambiguous meaning (more than one
Source: Tatabahasa Dewan Edisi Ketiga, 2008 [16] meaning) which the solution has not been mention in existing
proverbs treatment [20], [7], [4].
Aiti Aw et al [2] in their research realized the important of
Noun Phrases in Malay sentence structure decided to study IV. RESEARCH QUESTIONS
this issue. They proposed a translation approach where they
make use of parallel bilingual corpus to obtain a large set of This study is to answer the following research questions;
bilingual terms and then use them to train a statistical engine. a) Is there any workable algorithm that can be
There’s another research by Rais et. al. [17] indexing the implemented to detect Malay proverbs?
Malay MWE using combination of query translation approach b) Since the nature of several Malay proverbs that has
and weighting schemes. The researchers did mention about ambiguity in the meaning, can this be solved using the
dictionary is crucial in multiword detection. existing algorithm?
c) Will the semantic analysis improve the Malay proverb
B. Why proverbs are another type of MWE that are more translation?
difficult to translate?
Proverb and idioms are a part of MWE. The only problem V. THE RESEARCH OBJECTIVES
with proverbs and idioms in MT is they cannot be translated The aim of this research is to propose a novel approach to
literally, rather logically [7]. The translation machine needs to proverb classification using Naïve Bayesian approach and
know the definite meaning of the proverbs. translation into English using combination of dictionary
C. Challenges in Malay proverbs automated detection. mapping and semantic labeling.
The objectives are:
These are several challenges that need to be consider in the a) To analyze the pattern of Malay proverbs.
automated detection of Malay proverbs. b) To develop an algorithm to extract proverbs from Malay
a) Word with affixes – Example: “Kembang sayap” =
texts.
“Mengembangkan sayap”.
c) To develop an algorithm of translating the Malay idioms
b) Another word in between (stopword) Example:
into English by providing the actual (logical) meaning
“berpijak di bumi nyata” or sometimes “berpijak di
(not literal meaning).
bumi yang nyata” which means “do not day-dreaming”. d) To model the semantic analysis in order to define
c) Different proverb having the same meaning. ambiguous meaning of the Malay proverbs.
Example: “Baik budi”, “baik hati” which means kind-
e) To evaluate the effectiveness of the proposed approach.
hearted. Another example: "bintang hati", "buah
hati", "mahkota hati", "rangkai hati", "tajuk mahkota",
VI. THE SCOPE OF RESEARCH
"tangkai hati", "tangkai kalbu", "bintang terang" which
means beloved. The research is capable of detecting and defining the meaning
of proverb in Malay text. Proverbs may appear in modern
D. Ambiguous meaning of Malay proverb. Malay literature such as in novel, short stories or blogs.
There are Malay proverbs can have more than one meaning
[18]. Examples of ambiguous Malay proverbs are. VII. THE RESEARCH CONTRIBUTIONS
a) “mata air” which means 1) lover, or 2) underground ater
The research’s major contributions are;
resource. a) To develop an algorithm to identify proverb in Malay
b) “orang putih” which means 1) pious man, or 2) text.
European people.
b) To develop an algorithm that can translate Malay
c) “air muka” – 1) face 2) pride
proverb by providing the right meaning of the
d) bawa diri - 1) running away 2) sulking 3) being
proverbs that has ambiguous meaning.
independent.
e) Bagai cicak makan kapur - meaning 1) ashamed because
VIII. PREVIOUS WORKS
of his own offense. 2) pleased.
f)Ada air adalah ikan - meaning: 1) there must be people These are several works by the previous researchers that
in a country 2) fortune is everywhere have done remarkable jobs in solving the issue of
proverbs/idioms detection and translation. These researches
studied proverbs in many different languages such as Hindi,
Punjabi, Malay, Germany, English and Italian.

5
2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

Brahmaleen et. al. [4] suggested a method in identifying


proverbs in Hindi text and translate them into Punjabi. The
machine has a dictionary to map the Hindi proverb and the
equivalent proverb in Punjabi. The illustration of the workflow
of the said system is available in Figure 1.

Fig. 2 METIS II translation flow Source: Dmitra, 2010 [7]

IX. RESEARCH METHODOLOGY


The research is an experimental based research whereby
development of the system will be carried out, and evaluated
Fig. 1 Hindi to Punjabi proverb automated translation to validate the performance. The following research processes
Source: Brahmaleen et. al., 2010 [4] are adopted;
Phase 1 : Literature Review (Information Gathering/
Noah and Ismail [20] found out that Malay proverb can be
Acquisition)
detected using Naïve Bayesian model. The experiment
A thorough literature review is conducted related to the area
suggested that the multinomial model shows slightly better
of research; covering theoretical, conceptual and
performance as compared to the Multivariate Bernoulli model.
developmental aspects. Current research on subject matter is
This research suggested stop words removal to enhance the
analysed. The research gap is identified and refined to its
classification process.
potential venture.
There is another research from ATMA, UKM whereby they
Phase 2: Dictionary of Malay Proverb (and Semantic
are providing a searchable database of Malay proverbs and
Labelling)
idioms [25]. The application receives a complete or a part of
Dictionary of Malay Proverb is defined as the main
proverb/idioms, using the SQL pattern matching to search the
dictionary used in the system. Suggested dictionary is the
list of the proverbs/idioms, and the outputs are all the
Peribahasa Melayu Kontemporari by Abdullah & Ainon, 2011
proverbs/idioms that similar to the user’s request.
[1]. Semantic labelling (tagging) will be done manually to the
Dmitra [7] studied the METIS II (in Figure 3) translation
list of proverbs that has ambiguous meaning (as suggested in
engine which capable of translating several languages such as
Wordnet, 2012 [11] and Dmitra [7]).
Dutch, German, Greek and Spanish into English. The pre-
Phase 3: Design
translation process involves tokenization, lemmatization,
The design of system is carefully devised to support Malay
tagging and chunker. By using the existing METIS II
proverb detection and translation. Pre-classification will
facilities, she managed to embed an idioms translation tool for
involve tokenization, stop words removal and stemming. The
Germany into English.
proverb detection will be constructed from the pattern
The idioms processor comprises of the following three
matching. As for the translation, direct dictionary lookup will
components;
be adopted (for proverb with single meaning). To resolve the
a) German-English idioms dictionary,
ambiguity issue, the possibility theory will be implemented
b) German source language corpus of sentences, and
together with the semantic analysis and semantic labeling as in
c) Syntactic matching rules.
WordNet [11].
The dictionary was constructed manually, the German Phase 4: Development/ Implementation
corpus (containing idioms) assembled from varies prominent The proposed system is developed based on the design
resources such as Europarl [9], sentence examples from the phase aforementioned. Appropriate programming languages
Web and DWDS [8]. The last component is the German are used (suggestion to develop this system as a web-based
syntactic matching rules which recognize the syntax of application).
German idioms either in the form of continuous of Phase 5: Evaluation
discontinuous. The system is tested in terms of its impact on the efficiency
However, among all the researches mentioned in the of the proverb classification, and correctness of translation.
discussion of the previous works, none of them mention about The testing dataset will be developed by combining sentences
the treatment to the proverbs that have ambiguous meaning.

6
2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

that contains Malay proverb appear in Malay novel, short REFERENCES


stories or blogs. [1] Abdullah Hassan and Ainon Mohd. 2011. Kamus Peribahasa
Kontemporari, Edisi Ketiga. PTS Professional Publishing.
X. EXPECTED FINDINGS [2] Aiti Aw, Sharifah Mahani Aljunied and Haizhou Li. 2009. Malay
Multi-word Expression Translation. Second Workshop on Technologies
Based on the problems identified, the researcher suggesting and Corpora for Asia-Pacific Speech Translation (TCAST 2009),
the following framework for Malay proverb treatment for Suntec, Singapore.
[3] Arvi Hurskainen. 2008. Multiword Expressions and Machine
Malay-English machine translation in Figure 3. Translation. Technical Reports in Language Technology Report No 1,
Tokenization is process of separating words and 2008 . < http://www.njas.helsinki.fi/salama/multiword-expressions-and-
punctuation. This process provides a list of tokens which machine-translation.pdf>
represents separated words and punctuations [7]. [4] Brahmaleen K. Sidhu, Arjan Singh and Vishal Goyal. 2010.
Identification of Proverbs in Hindi Text Corpus and their Translation
Stop words removal is a process of enlisting all the words into Punjabi. Journal of Computer Science and Engineering, Vol. 2,
that are considered not important (meaningless). In this case Issue 1, July 2010.
stop words are such as yang, itu, macam, seperti, bagai, [5] Choy-Kim Chuah and Zaharin Yusoff. 2002. Computational Linguistics
laksana, ibarat, umpama, ini, begitu, begini and etc. This at Universiti Sains Malaysia. LREC 2002 Third International
Conference On Language Resources And Evaluation. The University of
process will produce better accuracy [22] in text classification. Las Palmas de Gran Canaria, Canary Islands – Spain, 29 May - 31 May
Stemming is a process of finding the root of a word through 2002.
removing the prefixes and postfixes. Word such as membawa [6] Darwis Harahap. 1992. Sejarah Pertumbuhan Bahasa Melayu. Penerbit
is rooted from bawa which appear in membawa diri (or Universiti Sains Malaysia, Pulau Pinang.
[7] Dmitra Anastasiou. 2010. PhD Thesis, Universitat des Saarlandes.
sometimes used as bawa diri). Stemming will provides more Unpublished.
chances of finding the proverb in the dictionary. [8] DWDS. 6 June 2012. < http://www.dwds.de >
Malay proverbs detection - by using the pattern matching [9] Europarl. 2012. <http://www.statmt.org/europarl/>
approach, this process will identify Malay proverbs appearing [10] Hejab Ma’azer Al Fawareh, Shaidah Jusoh and Wan Rozaini Sheikh
Osman. 2008. Ambiguity in Text Mining. Proceedings of the
in the text. International Conference on Computer and Communication Engineering
Proverb Semantic Analysis – in the case of proverb with 2008. May 13-15, 2008 Kuala Lumpur, Malaysia
ambiguous meaning, the semantic tagger will be analyzed. [11] Lim, L. T. and Hussein, N.. 2006. Fast Prototyping of a Malay WordNet
System. In: Proceedings of the Language, Articial Intelligence and
Computer Science for Natural Language Processing (LAICS-NLP)
Summer School Workshop. Bangkok, Thailand, 2006, pp. 13–16.
[12] Mieder, Wolfgang. 2003. Proverbs are Never out of Season. Popular
Wisdom in the Modern Age. New York: Oxford University Press.
[13] Mohd Noor, Yusnita and Jamaludin, Zulikha and Jusoh, Shaidah. 2010.
A Retrospective View On The Promise On Machine Translation For
Bahasa Melayu-English. In: Found in Translation : International
Conference on Translation and Multiculturalism, 23-25 July 2010,
University of Malaya.
[14] Mosleh H. Al-Adhaileh and Tang Enya Kong. 1999. Example-Based
Machine Translation Based on the Synchronous SSTC Annotation
Schema, Machine Translation Summit VII, 1999.
[15] Nadzeya Kiyavitskaya, Nicola Zeni, Luisa and Mich Daniel M. Berry:
Requirements for Tools for Ambiguity Identification and Measurement
in Natural Language Requirements Specifications. WER 2007: 197-206.
[16] Nik Safiah Karim, Farid M Onn and Hashim Haji Musa. 2008.
Tatabahasa Dewan Edisi Ketiga. Dewan Bahasa dan Pustaka, Kuala
Lumpur.
[17] N.H. Rais, M.T. Abdullah & R.A Kadir, 2011. Multiword Phrases
Indexing for Malay-English Cross Language Information Retrieval,
Information Technology Journal, 2011.
Fig. 3 Expected Findings - A framework to classify Malay proverb [18] Peribahasa Melayu. 2012. Institut Tamadun dan Alama Melayu, UKM.
and translate it into the actual meaning in English http://malaycivilization.ukm.my/idc/groups/ukm_view_page/documents/
malayportal/ukm_view_page.hcsp?nid=79
[19] Princeton University "About WordNet." WordNet. Princeton University.
XI. CONCLUSIONS 6 January 2012. <http://wordnet.princeton.edu>
Proverb translation is challenging because the ambiguity [20] S.A. Noah and F. Ismail. 2008. Automatic Classifications of Malay
Proverbs Using Naive Bayesian Algorithm. Information Technology
and the logical definition. As highlighted in the previous Journal, 2008.
discussion, there are Malay proverbs with ambiguous [21] Shamsuddin Ahmad. 2006. Kamus Peribahasa Melayu-Inggeris, PTS
meaning. This research propose the following steps for Publication.
proverb detection and translation of Malay text; tokenization, [22] Silva C. and B. Ribeiro. 2003. The Importance of Stop Words Removal
on Recall Values in Text Categorization. Proceedings of the
stemming, proverb detection and translation. International Joint Conference on Neural Networks, 20-24 July 2003.pp
By developing the prototype of the system, the researcher 1661-1666.
could be able to minimize the confusion in defining proverb [23] Suhaimi Ab. Rahman, Noorhayati Ahmad, Hafizullah Amin Hashim,
through automated translation. It is hoped that this approach Abdul Wahab Dahalan. 2006. Real Time On-Line English-Malay
Machine Translation (MT) System. Third Real-Time Technology And
will have great impact on the advancement of proverb Applications Symposium 2006, UPM, Malaysia.
treatment in Machine Translation system. [24] Suhaimi Ab. Rahman, Normaziah Abdul Aziz and Badariah Solemon.
2008. An English-Malay Translation Memory System, CITWorkshops,

7
2nd International Conference on Machine Learning and Computer Science (IMLCS'2013) May 6-7, 2013 Kuala Lumpur (Malaysia)

pp.619-624, 2008 IEEE 8th International Conference on Computer and


Information Technology Workshops, 2008. .
[25] Supyan Hussin, Ding Choo Ming, Afendi Hamat & Arba’eyah Abdul
Rahman. 2004. Kamus Peribahasa Melayu Digital yang Pertama. Sari 22
(2004), 49-61.
[26] Susana Widyastuti. 2010. Peribahasa: Cerminan Kepribadian Budaya
Lokal Dan Penerapannya Di masa Kini. Proceeding of Seminar Nasional
UTY 3 Juli 2010.
[27] Systrans. 2004. http://www.systransoft.com/systran/corporate-profile/
translation-technology/what-is-machine-translation.

You might also like