You are on page 1of 9


discussions, stats, and author profiles for this publication at:

A Grammar Checker for Tagalog using


Article · May 2012


2 826

2 authors:

Nathaniel Oco Allan Borra

National University (Philippines) De La Salle University


Some of the authors of this publication are also working on these related projects:

eParticipation View project

eParticipation 2.0 View project

All content following this page was uploaded by Nathaniel Oco on 31 August 2015.

The user has requested enhancement of the downloaded file.

A Grammar Checker for Tagalog using LanguageTool

Nathaniel Oco Allan Borra

Center for Language Technologies Center for Language Technologies
College of Computer Studies College of Computer Studies
De La Salle University De La Salle University
2401 Taft Avenue 2401 Taft Avenue
Malate, Manila City Malate, Manila City
1004 Metro Manila 1004 Metro Manila
Philippines Philippines

a part-of-speech tag based on the declarations in

Abstract the Tagger Dictionary. The words and their part-
of-speech are used to check for patterns that
match those declared in the rule file. If there is a
pattern match, an error message is shown to the
This document outlines the use of Language user. Currently, LanguageTool supports Belaru-
Tool for a Tagalog Grammar Checker. Lan- sian, Catalan, Danish, Dutch, English, Esperanto,
guage Tool is an open-source rule-based en-
French, Galician, Icelandic, Italian, Lithuanian,
gine that offers grammar and style checking
functionalities. The details of the various lin-
Malayalam, Polish, Romanian, Russian, Slovak,
guistic resource requirements of Language Slovenian, Spanish, Swedish, and Ukrainian to a
Tool for the Tagalog language are outlined and certain degree.
discussed. These are the tagger dictionary and Tagalog is the basis for the Filipino language,
the rule file that use the notation of Language the official language of the Philippines. Accord-
Tool. The expressive power of Language ing to a data collected by Cheng et al. (2009),
Tool’s notation is analyzed and checked if there are 22,000,000 native speakers of Tagalog.
Tagalog linguistic phenomena are captured or This makes it the highest in the country, fol-
not. The system was tested using a collection lowed by Cebuano with 20,000,000 native
of sentences and these are the results: 91%
speakers. Tagalog is very rich in morphology,
precision rate, 51% recall rate, 83% accuracy
Ramos (1971) stated that Tagalog words are
normally composed of root words and affixes.
1 Credits Dimalen and Dimalen (2007) described Tagalog
as a language with “high degree of inflection”.
LanguageTool was developed by Naber (2003). Jasa et al. (2007) stated that the number of
It can run as a stand-alone program and as an available Tagalog grammar checkers is limited.
extension for OpenOffice.Org1 and LibreOffice2. Tagalog is a very rich language and Language-
LanguageTool is distributed through Language- Tool is a flexible language. The development of
Tool’s website: Tagalog support for LanguageTool provides a
readily-available Tagalog grammar checker that
2 Introduction can be easily updated.
LanguageTool is an open-source style and 3 Related Works
grammar checker that follows a manual-based
rule-creation approach. Ang et al. (2002) developed a semantic analyzer
LanguageTool utilizes rules stored in an xml that has the capability to check semantic rela-
file to analyze and check text input. The text in- tionships in a Tagalog sentence. Jasa et al. (2007)
put is separated into sentences, each sentence is and Dimalen and Dimalen (2007) both developed
separated into words, and each word is assigned syntax-based Filipino grammar checker exten-
sions for OpenOffice.Org Writer. In syntax-
1 OpenOffice.Org is available at based grammar checkers, error-checking is based
2 LibreOffice is available at on the parser. An input is considered correct if

Proceedings of the 9th Workshop on Asian Language Resources, pages 2–9,
Chiang Mai, Thailand, November 12 and 13, 2011.
parsing succeeds, erroneous if parsing fails. Na- word declarations for the Tagalog Tagger Dic-
ber (2003) explained that syntax-based grammar tionary.
checkers need a complete grammar to function.
Erroneous sentences that are not covered by the 4.2 Tagset for the Tagger Dictionary
grammar can be flagged as error-free input. A tagset for the Tagalog tagger dictionary is pro-
posed. The tagset is based on the tagset devel-
4 LanguageTool Resources oped by Rabo and Cheng (2006) and the modifi-
Discussed here are the different language re- cations by Miguel and Roxas (2007). The discus-
sources required by the tool. The notations, for- sions on Tagalog affixation (1971) and case sys-
mats, and acquisition of resources are outlined tem of Tagalog verbs (1973) by Ramos, verb
and discussed. aspect and verb focus by Cena and Ramos
(1990), different Tagalog part-of-speech by Cu-
4.1 Tagger Dictionary bar and Cubar (1994), and inventory of verbal
affixes by Otanes and Schachter (1972) were
Language Tool utilizes a dictionary file, called
taken into account.
the Tagger Dictionary. The tagger dictionary,
Table 1 shows the proposed noun tags. Nouns
which contains word declarations, is utilized in
were classified into proper nouns, common
pattern matching to identify and tag words with
nouns, and abbreviations. Kroeger (1993) ex-
their part-of-speech.
plained that the determiners used for proper
The tagger dictionary can be a txt file, a dict
nouns and common nouns are different to a cer-
file, or an FSA-encoded 3 dict file. The tagger
tain degree.
dictionary contains three columns, separated by a
tag. The first column is the inflected form. The
NOUN: [tag] [semantic class]
second column is the base form. The third col-
umn is the part-of-speech tag. The format for the
Tagalog tagger dictionary follows the three- NPRO Proper Noun
column format. The first column is the inflected NCOM Common Noun
form, which could contain ligatures. The second NABB Abbreviation
column is similar to the first column, except that Table 1. Noun Tags
ligatures were omitted. This serves as the base
form. The third column is the proposed tag, Table 2 shows the proposed pronoun tags.
which is composed of the part-of-speech or POS Grammatical person and plurality attribute were
of the word and the corresponding attribute-value added to aid in distinguishing different types of
pair, separated by a white space character. This pronouns.
serves as the POS tag. Figure 1 shows a sample
declaration from the Tagalog tagger dictionary. PRONOUN: [tag] [grammatical person]
doktor doktor NCOM
PANP “ang” Pronouns
ako ako PANP ST S
PNGP “ng” Pronouns
kumakain kumakain VACF IN
PSAP “sa” Pronouns
nasa nasa PRLO
mga mga DECP PAND “ang” Demonstratives
hoy hoy INTR PNGD “ng” Demonstratives
PSAD “sa” Demonstratives
Figure 1. Tagalog Tagger Dictionary Example
PFOP Found Pronouns
PINP Interrogative Pronouns
Evaluation and test data from different re- PCOP Comparison Pronouns
searches on Tagalog POS Tagging (Bonus, 2004; PIDP Indefinite Pronouns
Cheng and Rabo, 2006; Miguel and Roxas, POTH Other
2007) were used to come up with almost 8,000 Grammatical
ST 1st person
FSA stands for Finite State Automata. Morfologik was ND 2nd person
used to build the binary automata. Morfologik is available at RD 3rd person
stemming/ NU Null

S Singular ADVERB: [tag] [modifies]
P Plural Tag
B Both AVMA Manner
Table 2. Pronoun Tags AVNU Numeral
AVDE Definite
Table 3 shows the proposed verb tags. Verb AVEO Comparison, group I
focus and verb aspect were added. The verb fo- AVET Comparison, group II
cus can indicate the thematic role the subject is AVCO Comparative, group I
taking. This is useful for future works. AVCT Comparative, group II
AVSO Superlative, group I
VERB: [focus] [aspect] AVST Superlative, group II
Focus AVSC Slight comparison
VACF Actor Focus AVAY Agree (Panang-ayon)
VOBF Object / Goal Focus AVGI Disagree (Pananggi)
VBEF Benefactive Focus AVAG Possibility (Pang-agam)
VLOF Locative Focus AVPA Frequency (Pamanahon)
VINF Instrument Focus AVOT Other
VOTF Other Modifies
Aspect VE Verb
NE Neutral AD Adjective
CM Completed AV Adverb
IN Incompleted AL Applicable to All
CN Contemplated Table 5. Adverb Tags
RC Recently Completed
OT Other Conjunctions, prepositions, determiners, inter-
Table 3. Verb Tags jections, ligatures, particles, enclitic, punctua-
tion, and auxiliary words are also part of the pro-
Table 4 shows the proposed adjective tags. posed tagset. These tags however, do not contain
Plurality was added to handle number agreement. additional properties or corresponding attribute-
Kroeger (1993) stated that if the plurality of the value pairs. Overall, the tagset has a total of 87
nominative argument does not match the plural- tags from 14 POS and lexical categories.
ity of the adjective or the predicate, the sentence
considered ungrammatical. 4.3 Rule File
The rule file is an xml file used to check errors in
ADJECTIVE: [tag] [plurality] a sentence. If a pattern declared in the rule
Tag matches the input sentence, an error is shown to
ADMO Modifier the user.
ADCO Comparative The rule file, case insensitive by default, is
ADSU Superlative composed of several rule categories which may
ADNU Numeral cover but is not limited to spelling, grammar,
ADUN Unaffixated style, and punctuation errors. Each rule category
ADOT Other is composed of one or more rules or rule groups.
Plurality Each rule is composed of different elements and
S Singular attributes. The three basic elements a rule has are
P Plural pattern, message, and example. The pattern ele-
N Null ment is where the error to be matched is de-
Table 4. Adjective Tags clared. The message element is where the feed-
back and suggestion, if applicable, is declared.
Table 5 shows the proposed adverb tags. An The example element is where incorrect and cor-
additional attribute was added to distinguish the rect examples are declared. Figure 2 shows a
POS of the word being modified. Ramos (1971) pseudocode that describes what happens in the
stated that adverbs in Tagalog can modify verbs, event a pattern is matched and Figure 3 shows an
adjectives, and other adverbs. example rule in the Tagalog rule file.

ding? = din or ding
if(pattern in rule file = pattern in input) { ring? = rin or ring
mark error; .*[aeiou] = any word that ends in a vowel
show feedback; .*[bcdfghjklmnpqrstvwxyz] = any word that
provide suggestions if applicable; ends in a consonant
Figure 4. Regular expression usage
Figure 2. Pseudocode
<token bla="x">think</token>
<rule id="MGA_MGA" name="mga mga matches the word “think”
(ang mga)">
<pattern case_sensitive="no" <token regexp="yes">think|say</token>
mark_from="0"> matches the regular expression think|say, i.e.
<token>mga</token> the word “think” or “say”
</pattern> <token postag="VB" />
<message>Do you mean <token>house</token>
<suggestion>ang matches a base form verb followed by the
\2</suggestion>? "mga" can word house.
not be followed by another
"mga". <token>cause</token> <token regexp="yes"
</message> negate="yes">and|to</token>
<short>Word Repetition</short> matches the word “cause” followed by any
<example correction="ang mga" word that is not “and” or “to”
<marker>mga mga</marker> <token postag="SENT_START" /> <to-
tanawin.</example> ken>foobar</token>
<example type="correct">Maganda matches the word “foobar” only at the begin-
<marker>ang mga</marker> ning of a sentence
</rule> Figure 5. Different methods of pattern-matching
described in LangaugeTool’s website
Figure 3. Rule File Declaration for “ang ang”
word repetition The following resources were used as basis in
developing rules: Makabagong Balarila ng Pili-
Pattern matching can utilize tokens, POS tags, pino (Ramos, 1971), Writing Filipino Gramamar:
and a combination of both to properly capture Traditions and Trends (Cubar and Cubar, 1994),
errors. Regular expressions 4 are also used to Modern Tagalog: Grammatical Explanations and
simplify or merge several rules. Figure 4 shows Exercises for Non-native Speakers (Cena and
different examples of using regular expression. Ramos, 1990), Tagalog Reference Grammar
Different methods of pattern-matching explained (Otanes and Schachter, 1972) and Phrase Struc-
in LanguageTool’s website are shown in Figure ture and Grammatical Relations in Tagalog
5. It should be noted that if a particular error is (Kroeger, 1993).
not covered by the tagger dictionary and the rule
file, the error will not be detected. 5 Tagalog Grammar Checking
Errors are classified into three types: wrong
word, missing word, and transposition of words.
This section discusses the different types of er-
rors and the corresponding method for capturing
these errors. Figure 6 shows a pseudocode ex-
plaining how an error is classified.

Standard Regular Expression Engine of Java. Described at:,5.0/docs/api/java/util/r

if(POS sequence != unoccurring) ken matching is performed. Regular expressions
Wrong Word; were employed to make rule files shorter.
else if(POS sequence = unoccurring)
if(POS sequence before != Correct:
unoccurring || POS after != unoccur- Magnanakaw din siya.
ring) He is also thief.
Missing Word;
else Incorrect:
Transposition; Magnanakaw rin siya.
(For: He is also thief.)
Figure 6. Pseudocode
Figure 8. Sound and Letter Change
5.1 Wrong Words
Wrong words are often caused by using the Other errors like proper adverb and ligature
wrong determiner and affixation rule. Also, mor- usage also fall into this type of error.
phophonemic change and verb focus are often
5.2 Missing Words
not taken into consideration. There are cases
where relying on part-of-speech alone will not Missing words are often due to missing deter-
capture certain errors. To address this issue, miners, particles, markers, and other words com-
grammatical person and plurality of pronouns, posed of several letters. Usually, missing words
focus and aspect of verbs, plurality of adjectives, cause irregular and unoccurring POS sequence.
and word modified by adverbs were considered Figure 9 illustrates an example. Unoccurring
in developing the tagset. Consider the example in POS sequence are checked and matched against
Figure 7. Both have the same POS but only one specific rules. The missing word is added to the
is correct. Kroeger (1993) pointed out that plural- sentence as feedback. In the sentences in Figure
ity in adjectives is demonstrated by the redupli- 9, it is unnatural for a pronoun to be immediately
cation of the first syllable. An error caused by the followed by an adjective. Missing words are cap-
disagreement of the plurality of the adjective and tured by looking for unoccurring POS sequence
the plurality of the nominative argument can not often caused by a missing word.
be handled by considering the part-of-speech
only. Correct:
Ikaw ay maganda.
Correct: Pronoun Marker Adjective
Magaganda kami. You beautiful
Adjective 1st person Pronoun You are beautiful.
Plural Plural
Beautiful we. Incorrect:
We are beautiful. Ikaw maganda.
Pronoun Adjective
Incorrect: You beautiful
Magaganda ako. (For: You are beautiful)
Adjective 1st person Pronoun
Plural Singular Figure 9. Missing Lexical Marker “ay”
Beaautiful me. 5.3 Transposition
(For: I am beautiful)
The process of detecting errors caused by trans-
Figure 7. Number Agreement position is similar to missing words. The main
difference is tokens and POS tags before and af-
Consider the sentences in Figure 8. The en- ter the unoccurring POS sequence are considered
clitic “din” is used if the last letter of the preced- and checked for any irregularities.
ing word is a consonant. Otherwise, “rin” is
used. Cena and Ramos (1990) explained that
sound and letter changes occur in affixation and
even in word boundaries. “din” and “rin” is one
of many examples. To address this, a simple to-

6 Performance of Language Tool: Re- The system flagged 4 error-free sentences as
sults and Analysis erroneous. This is mainly because of wrong dec-
larations in the tagger dictionary file. Figure 13
The system was initially tested using a collection shows one of the sentences. In the tagger dic-
of sentences. The collection is composed of tionary, “mag-aral” was declared as a noun and
evaluation data used in FiSSAn (Ang et al., “maingay” was declared both as an adverb and as
2002), LEFT (Chan et al., 2006), and PanPam an adjective. In the Tagalog language, if a com-
(Jasa et al., 2007). Test data used by Dimalen mon noun is preceded by an adjective, there
(2003) examples from books (Kroeger, 1993; should be a ligature between them. Figure 14
Ramos, 1971), and additional test data are also demonstrates proper Tagalog ligature usage.
part of the collection. A total of 272 sentences
from the collection were used. Table 6 shows a
summary of figures. 186 out of 190 error-free Umalis ang mabait
Verb Det Adjective
sentences were marked as error-free, 4 out of 190
error-free sentences were marked as erroneous, Leave the good
42 out of 82 erroneous sentences were marked as
ngunit maingay mag-aral.
erroneous, and 40 out of 82 erroneous sentences
were marked as error-free. Conjunct Adverb Verb
but noisy study
Sentences Correctly Incorrectly Total Figure 13. Flagged as erroneous
Flagged Flagged
Error-free 186 4 190
Erroneous 42 40 82 Root word ends with a vowel, add “-ng”
Total 228 44 272 Matalino + bata
Table 6. Summary of Figures Adjective Common Noun
Intelligent Child
The test showed that the system has a 91%
precision rate, 51% recall rate, and 83% accuracy =Matalinong bata
rate. Figure 10, Figure 11, and Figure 12 show Intelligent Child
the formulas used for precision, recall, and accu-
racy, respectively. True Positives refer to errone- Root word ends with the letter “n”, add “-g”
ous evaluation data properly flagged by the sys- Matulin + bata
tem as erroneous. False Positives refer to error- Adjective Common Noun
free evaluation data flagged by the system as er- Fast Child
roneous. True Negatives refer to error-free
evaluation data properly flagged by the system as =Matuling bata
error-free. False Negatives refer to erroneous Fast Child
evaluation data flagged by the system as error-
free. Root word ends with a consonant, add “na”
Matapang + bata
Adjective Common Noun
Brave Child
TruePositives + FalsePositives
Figure 10. Precision Formula =Matapang na bata
Brave Child
Figure 14. Ligature usage
TruePositives + FalseNegatives
Figure 11. Recall Formula The presence of ellipsis in one of the sen-
tences is another reason why error-free sentences
TruePositives + TrueNegatives were flagged as erroneous. Ellipsis was not de-
clared in the rule file. This resulted in two sen-
tences being recognized as one.
Figure 12. Accuracy Formula The system flagged 40 out of 42 erroneous
sentences as error-free. A close analysis on there
errors reveal that majority of the sentences con-

tains free-word order errors, transposition of For comparative evaluation, the same collec-
more than 2 words, extra words. Some sentences tion was tested on PanPam (Jasa et al., 2007) and
contain errors that focus on semantic checking. these are the results: 23% precision rate, 46%
Figure 15 shows 9 of these sentences. These are recall rate, and 38% accuracy rate. Table 7 shows
the type of errors that are not handled by the sys- a summary of figures.
tem and are not declared in the rule file. Future
research works can focus on these areas. Sentences Correctly Incorrectly Total
Flagged Flagged
Humihinga ang bangkay. Error-free 68 122 190
The corpse is breathing. Erroneous 38 44 82
Total 106 166 272
Nagluto ang sanggol. Table 7. PanPam Results
The baby cooked.
The comparative evaluation shows that the
Naglakad ang ahas. system scored 68% higher than PanPam in terms
The snake walked. of precision, 5% higher in terms of recall, and
37% higher in terms of accuracy.
Kumain ang plato. Overall, these findings reaffirm earlier analy-
The plate ate. sis by Konchady (2009) that rule-based grammar
checkers that follow a manual-based rule-
Nabasag ang basong mabilis. creation approach tend to produce low recall rate
The fast glass shattered. but precision rate is above average. This is be-
cause the total number of rules isn’t sufficient to
Kumain ang plato sa baso. cover a variety of errors. Also, because of pat-
The plate ate at the glass. tern-matching, majority of the errors detected are
indeed errors. It is also important to note, espe-
Kumain ang aso ng plato. cially in the case of LanguageTool, that the pat-
The dog ate the plate. terns being captured are erroneous sentences and
not error-free sentences. This makes rule-based
Tumakbo ang sapatos. grammar checkers dependent on the rules de-
The shoe ran. clared for error checking coverage.
LanguageTool can support the Tagalog lan-
Nagluto ang pusa ng pagkain. guage to a certain degree. Although developing a
The cat cooked food. tagger dictionary and a rule file is a tedious task,
it is necessary to create a tagger dictionary, a tag-
Figure 15. Flagged as error-free set, and rules that can handle the different Taga-
log linguistic Penomena.
Among the 42 erroneous sentences it correctly
flagged as erroneous, the system provided the Acknowledgements
correct feedback for 41 sentences. The sentence
with incorrect feedback is shown in Figure 16. The authors acknowledge developers and main-
The sentence, used to test free-word order, con- tainers of LanguageTool especially Daniel Na-
tains transposition of several words. The system ber, Dominique Pellé, and Marcin Milkowski for
detected it as a missing last word error because being instrumental to the completion of this
the determiner “ang” can not be the last word of study and to the completion of the Tagalog sup-
a sentence. port for LanguageTool. The August 07, 2011
snapshot of LanguageTool would not include
Tagalog support if not for their assistance. The
Pinalo tatay ng makulit batang ang. authors also acknowledge the opinions and
thoughts shared through email by Vlamir Rabo,
Correct Form: Manu Konchady, and Mark Johnson.
Pinalo ng tatay ang batang makulit.
The father spanked the naughty child. References
Figure 16. Sentence with incorrect feedback Charibeth K. Cheng, Nathalie Rose T. Lim, and Ra-
chel Edita O. Roxas. 2009. Philippine Language

Resources: Trends and Directions. Proceedings of pino Sentence Syntax and Semantic Analyzer. Un-
the 7th Workshop on Asian Langauge Resource dergraduate Thesis. De La Salle University, Ma-
(ALR7), Singapore. nila.
Charibeth K. Cheng and Vlamir S. Rabo. 2006. Paul Kroeger. 1993. Phrase Structure and Gram-
TPOST: A Template-based Part-of-Speech Tagger martical Relations in Tagalog. CSLI Publications,
for Tagalog. Journal of Research in Science, Com- Stanford, CA.
puting, and Engineering, Volume 3, Number 1.
Resty M. Cena and Teresita V. Ramos. 1990. Modern
Dalos D. Miguel and Rachel Edita O. Roxas. 2007. Tagalog: Grammatical Explanations and Exercises
Comparative Analysis of Tagalog Part of Speech for Non-native Speakers. University of Hawaii
(POS) Taggers. Proceedings of the 4th National Press, Honolulu, HI.
Natural Language Processing Research Sympo-
Teresita V. Ramos. 1971. Makabagong Balarila ng
sium (NNLPRS), CSB Hotel, Manila. ISSN 1908-
Pilipino. Rex Book Store, Manila.
Teresita V. Ramos. 1973. The Case System of Taga-
Daniel Naber. 2003. A Rule-Based Style and Gram-
log Verbs. Doctoral Dissertation. University of
mar Checker. Diploma Thesis. Bielefeld Univer-
Hawaii. Honolulu, HI.
sity, Bielefeld.
Davis Muhajereen D. Dimalen and Editha D. Di-
malen. 2007. An OpenOffice Spelling and Gram-
mar Checker Add-in Using an Open Source Exter-
nal Engine as Resource Manager and Parser. Pro-
ceedings of the 4th National Natural Language
Processing Research Symposium (NNLPRS), CSB
Hotel, Manila.
Don Erick J. Bonus. 2004. A Stemming Algorithm for
Tagalog Words. Proceedings of the 4th Philippine
Computing Science Congress (PSCS 2004), Uni-
versity of the Philippines – Los Baños, Laguna.
Editha D. Dimalen. 2003. A Parsing Algorithm for
Constituent Structures of Tagalog. Master’s The-
sis. De La Salle University, Manila.
Ernesto H. Cubar and Nelly I. Cubar. 1994. Writing
Filipino Grammar: Traditions and Trends. New
Day Publishers, Quezon City.
Erwin Andrew O. Chan, Chris Ian R. Lim, Richard
Bryan S. Tan, and Marlon Cromwell N. Tong.
2006. LEFT: Lexical Functional Grammar Based
English-Filipino Translator. Undergraduate The-
sis. De La Salle University, Manila.
Fe T. Otanes and Paul Schachter. 1972. Tagalog Ref-
erence Grammar. University of California Press,
Berkeley, CA.
Manu Konchady. 2009. Detecting Grammatical Er-
rors in Text using a Ngram-based Ruleset. Re-
trieved from:
Michael A. Jasa, Justin O. Palisoc, and Martee M.
Villa. 2007. Panuring Pampanitikan (PanPam): A
Sentence Syntax and Semantic Based Grammar
Checker for Filipino. Undergraduate Thesis. De La
Salle University, Manila.
Morgan O. Ang, Sonny G. Cagalingan, Paulo Justin
U. Tan, and Reagan C. Tan. 2002. FiSSAn: Fili-

View publication stats