You are on page 1of 10

THE CHALLENGES OF HANDLING PROVERBS IN MALAY-ENGLISH MACHINE TRANSLATION

Khirulnizam Abd Rahman, Norita Md Norwawi Faculty of Science and Technology Universiti Sains Islam Malaysia khirulnizam@gmail.com Abstract: Proverb is a unique feature of Malay language in which a message or advice is not communicated indirectly through metaphoric phrases. However, the use of proverb will cause confusion or misinterpretation if one does not familiar with the phrases since they cannot be translated literally. This paper will discuss the process of automated filtering of Malay proverb in Malay text. The next process is translation. In machine translation, the biggest challenge related to proverb is ambiguity. In Malay language itself, there are proverbs that have several meanings, and even can be translated literally. This will also be discussed in the paper. The objective of this review is to find what are the issues in handling Malay proverbs in the Machine Translation, and how do the previous researchers solve them. There was also an experiment conducted to test the effectiveness of several online Machine Translators. The researchers have gathered 200 proverb entries from the Kamus Peribahasa by Abdul Aziz Abdul Rahman, 2012 (it is a Malay proverb dictionary). Two online Machine Translation services were tested; namely Google Translate and CitCat.com. From the experiment, 52-34 percent of the proverbs are correctly translated out of 200 Malay proverbs tested. The rest are either translated wrongly or literally. This paper is part of the research and development of a Malay proverb filterer whereby the system will recognize any proverb presence in a Malay text, and convert them into plain Malay before translating the text into English via the Machine Translators. Keywords: proverb translation, Malay-English translation, machine translation, proverbs, idioms.
*This paper appears in the Proceeding of the 14 International Conference on Translation 2013 (http://ppa14atf.usm.my/). ISBN 978-967-12092-0-2. 27 – 29 August 2013, Universiti Sains Malaysia, Penang.
th

1.0 INTRODUCTION 1.1 Idioms in Malay Bahasa Malaysia (Malay) is a language spoken mostly by the Malays in Malaysia (Aiti et. al., 2009b), and also spoken by non-Malay. It is the national language for Malaysia and Brunei (Clynes & David, 2011), one of the official language in Singapore and Southern Thailand (Pattani) also using Malay to communicate (Fontong, 2007). Several researchers agree that idiom is a phrase with combination of two or more words and the overall meaning of these words is different than the meaning of the constituent word (Zoltan and Peter, 1996). Idioms in Malay are categorised into proverb and non-proverb (Ainon & Abdullah, 2010).

Figure 1: Hierarchy of idioms in Malay, adapted from Ainon & Abdullah (2010).

1.2 Descriptions of Malay Proverbs Peribahasa or Malay proverb is a group of words in a fixed order that has a particular meaning, different from the meanings of each word understood on its own (Abdullah & Ainon, 2011). It is mentioned that Malay proverbs are usually come in fixed words order, however according to Abdullah (2010), sometimes different word is used. For example the proverb “kera kena belacan” can also be in this form “monyet dapat belacan”. The proverb means restless. Kera can also be mentioned as monyet which is monkey. Examples: bagaikan aur dengan tebing, air dicincang tidak akan putus, sekilas ikan di air sudah ku tahu jantan dan betina, anak di rumah kelaparan kera di hutan disusukan. There are four categories of Malay proverbs (as described by Abdullah & Ainon, 2011); which are simpulan bahasa, perumpamaan, bidalan and pepatah. Simpulan bahasa – normally consist of two words (sometimes three). The literal meaning of the word combination is different than the actual meaning of the ‘simpulan bahasa’. Example: Langkah kanan; literally means right footstep, yet the actual meaning is lucky. Perumpamaan – phrases started with seolah-olah, ibarat, bak, seperti, macam, bagai or laksana. If translated to English these words are similar to as or like which is to resemble. Example: bagaikan pinang dibelah dua; literally means like betel nut split apart evenly, yet the actual meaning is compatible / equally beautiful and handsome for a pair of just married bride and bridegroom. Pepatah – proverb that contains advices or teachings.

Example: Adat berperang, yang kalah jadi abu, menang jadi arang; literally means in war, loser become ashes, winner become coal, yet the actual meaning is in war, the defeated and the winner are both losers. Bidalan – phrase (pepatah) started with jangan, biar or ingat. Example: Kalau kail panjang sejengkal, lautan dalam jangan diduga; literally means if you have a short hook, do not attempt to fish in the deep sea, yet the actual meaning is if you have little knowledge, do not dare to dream big.

2.0 MALAY-ENGLISH MACHINE TRANSLATION Machine translation (MT) is sometimes defined as automated translation or machine aided translation. It is the process by which computer software is used to translate a text from one natural language, for example Malay, to another, English (Systrans, 2012). Though Malay is an important language in Malaysia, the study of Malay language in machine translation has only been serious since 1984 with the establishment of Unit Terjemahan Melalui Komputer (UTMK) in USM (Chuah & Zaharin, 2002). Then followed by another research funded by MIMOS to automatically translate English to Malay implementing the Bilingual Knowledge Base bank (Suhaimi & Normaziah, 2004). However they did not discuss in details on the proverb handling in MalayEnglish machine translation. The authors believe that the proverb treatment is important due to the Malay language nature which is rich in proverb (Koh Boon, 1992; Lim, 2003). Proverbs (peribahasa) in Malay are beautiful elements to deliver advices, Malay teachings, moral values and comparison through metaphoric phrases (Susana, 2010). They are normally short, generally known sentence of the folk which contains wisdom, truth, morals, and traditional views in a metaphorical, fixed and memorisable form and which are handed down from generation to generation (Mieder,1993). Although proverbs do beautifies Malay literature, however this brings challenges to machine translation since proverb cannot be translated literally, rather logically (Dmitri, 2010).

2.0 PREVIOUS RESEARCHES 2.1 Multi Words Expressions (MWE) MWE is combination of several words to form another meaning. As in other languages, Malay also contains a lot of MWE. Although some MWEs can be isolated in the tokenization process, and then analysed as a single cluster, most of them cannot (Arvi, 2008). Aiti Aw et. al. (2009a) in their research realized the important of Noun Phrases in Malay sentence structure decided to study this issue. They proposed a translation approach making use of parallel bilingual corpus to obtain a large set of bilingual terms and then implemented it to train a statistical engine. There’s another research by Rais et. al. (2011) indexing the Malay MWE using combination of query translation approach and weighting schemes. The researchers did mention about dictionary is crucial in multiword detection. 2.2 Why proverbs are another type of MWE that are more difficult to translate?

Proverb and idioms are a part of MWE (Arvi, 2008; Sharma & Goyal, 2011). The only problem with proverbs and idioms in MT is they cannot be translated literally, rather logically (Dmitri, 2010). Thus the translation machine needs to know the definite meaning of the proverbs. The most challenging issue in interpreting natural language texts is the ambiguity problem (Kiyavitskaya et. al., 2007; Hejab et. al., 2008). Proverbs are one part of the ambiguity issues. Proverbs normally come with fixed sequence of words; however the meaning is not based on the words directly (Abdullah & Ainon, 2011). Since proverb is translated logically, the machine translation algorithm needs to know the semantic (real meaning) of the phrases. On top of this issue, certain Malay proverbs have ambiguous meaning (more than one meaning) which the solution has not been mentioned in existing proverbs treatment (Noah & Ismail, 2008; Dmitra, 2010; Brahmaleen et. al., 2010).

3.0 CHALLENGES IN MALAY PROVERB MACHINE TRANSLATION The paper is discussing two major challenges in the proverb treatment of machine translation. The first challenge is in the detection/filtering phase of proverb; and the second issue is determining the correct semantic definition of the proverb (based on the context of the sentence). 3.1 Malay proverbs automated detection These are several challenges encountered in the process of detecting/filtering the Malay proverbs: Word with affixes – There are proverbs that has affixes. Example: “Kembang sayap” = “Mengembangkan sayap”. In this early research phase, the authors would like to propose the stemming process (Muhamad Taufik, et. al, 2009). Stemming is a process to remove word affixes. Another word in between (stopword) Example: “berpijak di bumi nyata” or sometimes “berpijak di bumi yang nyata” which means “do not day-dreaming”. To overcome this issue, the researchers propose Stop words removal; it is a process of enlisting all the words that are considered not important (meaningless). In this case stop words are such as yang, itu, macam, seperti, bagai, laksana, ibarat, umpama, ini, begitu, begini and etc. This process will produce better accuracy (Agus, 2009) in text processing. Different combination of word used to represent the same proverbial meaning. It is mentioned that Malay proverbs are usually come in fixed words order, however according to Abdullah (2010), sometimes different word is used. Example: “Ada angin, ada pokoknya” which can also be used as “Ada angin, ada pohonnya”; which means anything that happen has its cause. Another example is “dapat dihitung (dibilang) dengan jari”; “bagai kera (beruk/monyet) kena belacan”.

3.2 Malay proverb translation The most difficult in translating proverbs is ambiguity. Ambiguity in language study is defined as uncertainty or inexactness of meaning. From the authors’ observation, there are two categories of ambiguities in Malay proverbs. The first category is the

phrase can have a literal and also an idiomatic meaning. The second category is the proverb has more than one idiomatic meaning (Abdullah & Ainon, 2011; Abdullah, 2010; Peribahasa Melayu, 2012). 3.2.1 Several phrases with literal and idiomatic meaning. Example of Malay Literal meaning Figurative meaning Proverb Angkat senjata Cuci tangan Cuci mata Gosok belakang Ibu ayam / bapa ayam Kena tembak Kena tikam Kena tikam dari belakang Kena tendang Kerak nasi Kipas angin geleng kepala angkat kaki patah kaki Lifting a weapon Washing hands Washing eyes Rub on the back Hen / cock Shot Stabbed Stabbed from the back Kicked Rice crust An electrical device to blow wind an action of shaking your head an activity of lifting up leg Broken leg Going to a war Repented from doing any crime Do not want to be responsible Sight seeing To console Pimp Cheated Betrayed Betrayed by whom we trust Fired Easily deterred Provoking others to fight refusal to run away- proverb 1. A person we are hoping to help us, however he/she is not around 2. No vehicle to move around Confused

garu kepala hari hendak hujan Bulan penuh Gigit jari Gigit bibir Mengurut dada Menjolok sarang tebuan Meriam buluh

An action of scratching one’s head It’s going to rain

Teasing – a person is going to cry The state of full moon Beautiful face Biting the finger Failed to achieve Biting the lips Furious Stroking the chest To be patient Poking the beehive Intentionally doing something dangerous A canon made of Bragging about things that bamboo that provide loud he/she did not do sound, however it does not shoot any cannonball

3.2.2 Examples of proverbs that have more than one idiomatic meaning. Example of Malay Proverb Meaning Other meaning

mata air orang putih air muka bawa diri Seperti cicak makan kapur Ada air adalah ikan Abu di atas tunggul Ada hati Ada tangga, hendak memanjat tiang

underground water resource European people face travel or being independent ashamed of his own offense there must be people in a country Easily forgotten Desire above competency Doing things against the norm/regulation

lover pious man pride Sulking or running away pleased fortune is everywhere Not safe Having a feeling to somebody Opting the hard way to do an easy thing

5. THE EXPERIMENT This experiment is to analyse what is the current state of Malay proverb translation using three commercially available machine translation; namely Google Translate, CitCat.com and MyMemory.translated.net. The study is to analyse 200 Malay proverbs extracted from several Malay proverb dictionary used by the secondary school students. The result will be the percentage of correctly translated proverbs by the automated translators, and also discussions pertaining to the translation issues. 5.1 The Online Automated Translators There are several online machine translator; Google Translate, Bing Translator, Babel Fish and CitCat.com. Bing Translator provides Indonesian-English not MalayEnglish, however Babel Fish does not have Malay-English or even IndonesiaEnglish. Google Translate, CitCat.com and MyMemory.translated.net are three online machine translators that capable of automatically translating Malay sentences into English. These services provide basic translation for free. Google Translate is a free translation service that provides online instant translations between 64 (Google Translate, 2013) different languages (including Malay – to any other 63 languages). Google Translate is implementing the statistical machine translation approach since 2007. Meaning that the translation system is highly dependent on the translation examples analysed from thousands of translated documents. CitCat.com is a new company (established in 2007) providing mainly English-MalayEnglish automated translation (CitCat.com, 2013). There are several languages supported such as English-Chinese-English and English-Indonesians. Currently the authors could not find any information which approach is implemented in this translation service; either rule-based, statistical or example based. MyMemory.translated.net is another interesting automated translation service which implements the statistical translation approach (http://mymemory.translated.net/doc/) by combining several others services such as Google Translate, Bing Translate, Systran and Worldlingo. This service also benefits a lot from the user’s contribution to enrich the translation examples. The website claims that it service is capable of

translating Malay into other 150 languages, including English. However, at the end of the study we found out that the translation result of this tool is very similar with the output from Google Translate. Hence, we decide to exclude it from the discussion. There is a claim by Gaule & Singh (2012) saying that Google Translate does not filter idioms or proverbs on English to Hindi translation. As the result of this failure the proverbs are translated literally instead of semantically. From an early observation, the authors found out that these two systems sometimes capable to detect Malay proverb in the source language. However there are also failures in detecting Malay proverbs. For example the phrase “berat hati” is translated into “reluctantly”, which is correct. However the proverb “kera sumbang” (a person who live in seclusion or a recluse) is translated literally into “monkey incest”. Though the translation services 5.2 Experiment Objective The main purpose of this study is to roughly estimate the correctness of the automated translation tools in translating Malay sentences contains proverbs. Example (translations are done on 2nd January 2013): Malay proverbs Meaning Correct English translatio n (manuall y) Having a feeling / crush / ambition / desire Make ends meet Translation Translation by Translation by by Google CitCat.com MyMemory.tr Translator anslated.net

Ada hati

Mencari sesuap nasi

Keinginan / menyimp an perasaan bekerja

Some heart There is heart Have heart (translated (translated (translated literally) literally) literally)

Bagai langit dengan bumi

Sangat berbeza

Absolutel y different

Make ends Seeking for a meet mouthful of (correct) rice (translated literally) Like heaven Like sky with and earth earth (translated (translated literally) literally)

Make meet

ends

Like heaven and earth (translated literally)

5.3 Study and Findings The authors randomly collected 200 Malay proverbs (mostly categorised as simpulan bahasa) from two Malay proverb dictionaries (Abdul Aziz, 2012; Zanariah, 2012). The dictionaries provide the sample sentence on how to use the proverbs inside the Malay sentence. These sentences are than automatically translated to English using Google Translate () and CitCat.com (). Each of the translated sentences is evaluated by the language expert whether they provide correct translation. Although Bing Translate provides Indonesian to English translation, the authors decided not to include it into the study since it is for Indonesian language. This is the result of the study. The table provide the percentage of; 1) correctly translated by the both translator, 2) literally translated (word-by-word translation,

direct translation), 3) translated into similar idioms available in English. From the 200 proverb entries tested CitCat.com scored better with 52% of correctly translated proverb as compared to Google Translate. In the case of literal translation attempted by both services, Google Translate provide 55% incorrect literal translation while CitCat.com is 44.5% incorrect literal translation. Overall the researchers concluded that CitCat.com perform better job in handling Malay proverb translation as compared to Google Translate. Criteria % correctly translated % literally translated but wrong % translated into similar idioms in English Google Translate 34.0 55.0 9.6 CitCat.com 52.0 44.5 11.0

The correctly translated criteria refer to the translation either providing the meaning or providing similar idioms in English. For example the Malay proverb “ada gula ada semut” is translated into “where bees, there is honey”. These two proverbs contain almost similar meanings which are; “ada gula ada semut”: People will go to any place where they can earn their living (Shamsuddin, 2006). “where bees, there is honey”: Where there are industrious persons, there is wealth, for the hand of the diligent maketh rich. (Simpson & Speake, 2003)

Note: The complete data set of the sentences containing the Malay proverbs and the translation by Google Translate and CitCat.com could be obtained from the corresponding author’s website at http://kerul.net .

6. CONCLUSIONS The main objective of this paper is to discuss on the challenges of detecting and translating Malay proverb in the machine translation. From the experiment conducted using a couple of online machine translation services, we would like to conclude that in terms of handling proverbs, these services have quite a poor performance. However CitCat.com has done a better job as compared to Google Translate in handling Malay proverbs. From the observation both system do not cater proverbial phrases with more than one meaning. In the next phase, the researchers will be proposing a Malay proverb filterer to be developed as a complement tool before the Malay text is processed by these machine translation tools.

REFERENCES
Abdul Aziz Abdul Rahman, 2012. Kamus Peribahasa. Pan Asia Publications Sdn. Bhd. Abdullah Hassan and Ainon Mohd. 2011. Kamus Peribahasa Kontemporari, Edisi Ketiga. PTS Professional Publishing.

Agus T. Kwee, Flora S. Tsai, Wenyin Tang, 2009. Sentence-Level Novelty Detection in English and Malay. Advances in Knowledge Discovery and Data Mining Lecture Notes in Computer Science Volume 5476, 2009, pp 40-51. Aiti Aw, Sharifah Mahani Aljunied and Haizhou Li. 2009. Malay Multi-word Expression Translation. Second Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST 2009), Suntec, Singapore. Aiti Aw, Sharifah Mahani Aljunied, Lianhau Lee and Haizhou Li. 2009. Piramid: Bahasa Indonesia and Bahasa Malaysia Translation System Enhanced through Comparable Corpora. Second Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST 2009), Suntec, Singapore. Arvi Hurskainen. 2008. “Multiword Expressions and Machine Translation”. Technical Reports in Language Technology Report No 1, 2008 . < http://www.njas.helsinki.fi/salama/multiword-expressions-and-machine-translation.pdf> Brahmaleen K. Sidhu, Arjan Singh and Vishal Goyal. 2010. Identification of Proverbs in Hindi Text Corpus and their Translation into Punjabi. Journal of Computer Science and Engineering, Vol. 2, Issue 1, July 2010. Choy-Kim Chuah and Zaharin Yusoff. 2002. Computational Linguistics at Universiti Sains Malaysia. LREC 2002 Third International Conference On Language Resources And Evaluation. The University of Las Palmas de Gran Canaria, Canary Islands – Spain, 29 May - 31 May 2002. CitCat.com, 2013. About CitCat. http://www.citcat.com/ Accessed on 7 January 2013. Clynes, Adrian and Deterding, David, 2011. Standard Malay (Brunei). Journal of Phonetic Association, Vol. 42 Issue 2, 2011, pp. Darwis Harahap. 1992. Sejarah Pertumbuhan Bahasa Melayu. Penerbit Universiti Sains Malaysia, Pulau Pinang. Dimitra Anastasiou. 2010. Idiom Treatment Experiments in Machine Translation. PhD Thesis, Universitat des Saarlandes. Unpublished. Fontong Raine Boonlong, 2007. The Language Rights of the Malay Minority in Thailand. Asia Pacific Journal of Human Rights and the Law, Vol 1, 2007, pp. 47-63. Gaule, Monika and Singh, Gurpreet Josan, 2012. Machine Translation of Idioms from English to Hindi. International Journal of Computational Engineering Research, Vol2, Issue 6, pp. 50-54. th Google Translate, http://translate.google.com/about/ - Accessed on 4 January 2013. Hejab Ma’azer Al Fawareh, Shaidah Jusoh and Wan Rozaini Shei kh Osman. 2008. Ambiguity in Text Mining. Proceedings of the International Conference on Computer and Communication Engineering 2008. May 13-15, 2008 Kuala Lumpur, Malaysia Lim, Kim Hui, 2003. BUDI as a Malay Mind: A Philosophical Study of Malay Ways of Reasoning and Emotion in Peribahasa. PhD Thesis, University of Hamburg. Unpublished. Lim, L. T. and Hussein, N.. 2006. Fast Prototyping of a Malay WordNet System. In: Proceedings of the Language, Articial Intelligence and Computer Science for Natural Language Processing (LAICS-NLP) Summer School Workshop. Bangkok, Thailand, 2006, pp. 13–16. Mieder, Wolfgang. 2003. Proverbs are Never out of Season. Popular Wisdom in the Modern Age. New York: Oxford University Press. Mohd Noor, Yusnita and Jamaludin, Zulikha and Jusoh, Shaidah. 2010. A Retrospective View On The Promise On Machine Translation For Bahasa Melayu-English. In: Found in Translation : International Conference on Translation and Multiculturalism, 23-25 July 2010, University of Malaya. Mosleh H. Al-Adhaileh and Tang Enya Kong. 1999. Example-Based Machine Translation Based on the Synchronous SSTC Annotation Schema, Machine Translation Summit VII, 1999. Muhamad Taufik Abdullah, Fatimah Ahmad, Ramlan Mahmod and Tengku Mohd Tengku Sembok, 2009. Rules Frequency Order Stemmer for Malay Language. IJCSNS International Journal of Computer Science and Network Security, Vol.9 No.2, pp 433-438 Nadzeya Kiyavitskaya, Nicola Zeni, Luisa and Mich Daniel M. Berry: Requirements for Tools for Ambiguity Identification and Measurement in Natural Language Requirements Specifications. WER 2007: 197-206. Nik Safiah Karim, Farid M Onn and Hashim Haji Musa. 2008. Tatabahasa Dewan Edisi Ketiga. Dewan Bahasa dan Pustaka, Kuala Lumpur.

N.H. Rais, M.T. Abdullah & R.A Kadir, 2011. Multiword Phrases Indexing for Malay-English Cross Language Information Retrieval, Information Technology Journal, 2011. Peribahasa Melayu. 2012. Institut Tamadun dan Alama Melayu, UKM. 6 January 2012. http://malaycivilization.ukm.my/idc/groups/ukm_view_page/documents/malayportal/ukm_ view_page.hcsp?nid=79 Princeton University "About WordNet." WordNet. Princeton University. Accessed on 6 January 2012. http://wordnet.princeton.edu S.A. Noah and F. Ismail. 2008. Automatic Classifications of Malay Proverbs Using Naive Bayesian Algorithm. Information Technology Journal, 2008. Shamsuddin Ahmad. 2006. Kamus Peribahasa Melayu-Inggeris, PTS Publication. Sharma, Monika and Goyal, Vishal, 2011. Extracting Proverbs in Machine Translation from Hindi to Punjabi using Relational Data Approach. International Journal of Computer Science and Communication Vol. 2, No. 2, July-December 2011, pp. 611-613. Silva C. and B. Ribeiro. 2003. The Importance of Stop Words Removal on Recall Values in Text Categorization. Proceedings of the International Joint Conference on Neural Networks, 20-24 July 2003.pp 1661-1666. Simpson, J.A. and Speake, J.. 2003. The Concise Oxford Dictionary of Proverbs. Oxford University Press. Suhaimi Ab. Rahman, Noorhayati Ahmad, Hafizullah Amin Hashim, Abdul Wahab Dahalan. 2006. Real Time On-Line English-Malay Machine Translation (MT) System. Third RealTime Technology And Applications Symposium 2006, UPM, Malaysia. Suhaimi Ab. Rahman, Normaziah Abd Aziz, “Improving Word Alignment in an English -Malay Paralell Corpus for Machine Translation”, In Pre-LREC 2004 Workhop on Amazing Utility of Parallel and Comparable Corpora, Lisbon, Portugal, May 2004. Suhaimi Ab. Rahman, Normaziah Abdul Aziz and Badariah Solemon. 2008. An EnglishMalay Translation Memory System, CITWorkshops, pp.619-624, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops, 2008. Supyan Hussin, Ding Choo Ming, Afendi Hamat & Arba’eyah Abdul Rahman. 2004. Kamus Peribahasa Melayu Digital yang Pertama. Sari 22 (2004), 49-61. Susana Widyastuti. 2010. Peribahasa: Cerminan Kepribadian Budaya Lokal Dan Penerapannya Di masa Kini. Proceeding of Seminar Nasional UTY 3 Juli 2010. Systrans. 2004. http://www.systransoft.com/systran/corporate-profile/translationtechnology/what-is-machine-translation. Zanariah Abdul, 2012. Peribahasa WATAFA (Wajib Tahu dan Faham). Pelangi Book Publishing, Kuala Lumpur. Zoltan Kovecses and Peter Szabc, 1996. Idioms: A View from Cognitive Semantics. Journal of Applied Linguistic, Vol 17(3),pp 326-355.