You are on page 1of 5

INVESTIGATING SPELLING VARIANTS AND CONVENTIONALIZATION RATES OF THE

THE PHILIPPINE NATIONAL LANGUAGE’S ORTHOGRAPHIC SYSTEM USING


SPELLING VARIANTS SEEN IN A PHILIPPINE HISTORICAL CORPUS
Joel P. Ilao1,2 and Rowena Cristina L. Guevara2

Center for Automation Research, De La Salle University1


2401 Taft Avenue 1004, Malate, Manila, Philippines

Digital Signal Processing Laboratory, Electrical and Electronics Engineering Institute 2


Room 410, EEE Building, Velasquez St., University of the Philippines – Diliman

e-mail: joel.ilao@delasalle.ph, gev@eee.upd.edu.ph

ABSTRACT development of the Philippine national language was made


more complex owing to the country’s archipelagic nature
In this paper, we present our initial efforts to objectively and the influences of nations that have occupied the
describe the development of the Philippine national Philippines at various points in the past: Spain from the 16 th
language’s system of orthography through a Corpus-based century to the early 1900s, United States of America from
analysis of a historical text corpus taken from documents the 1900s to 1941 and from 1945 to 1946, and briefly by
published from the 1900s to the present times that are Japan from 1941 to 1945 during World War II. This study is
written in the Tagalog, Pilipino and Filipino languages. a step in objectively gauging the effect of various socio-
Spelling variants were extracted from the gathered historical political developments in Philippine history by investigating
corpus, and usage preference rates of competing spelling Spelling variants taken from a Philippine historical corpus
variant groups were graphed to illustrate various levels of as a starting point.
development in the Philippine national language’s Spelling variants are groups of words belonging to the
orthographic system. The Philippine orthographic system’s same language that only slightly differ in spelling but have
year-by-year development was also summarized through a the same meaning. For example, the Filipino words ‘lalaki’
graph of the overall conventionalization rate of all the and ‘lalake’ are spelling variants that both translate to the
spelling variant groups. The resulting graphs reveal English word “male”. Studies involving detection as well as
interesting developmental patterns which can serve to guide generation of spelling variants were made in the past for
Philippine language institutes in drafting effective language various reasons. In particular, spelling variants generation is
planning policies. used to improve the hit rate of Information Retrieval
systems (e.g. search engines) by substituting search words
Index Terms— Corpus linguistics, Philippine with their alternative spellings in searchable databases [1].
orthography Standard forms of spelling variant groups are also used to
standardize the transcriptions of speech corpora. In the area
of historical corpus linguistics, the Variant Detector
1. INTRODUCTION
(VARD) system is used in detecting historical spelling
variants to properly annotate historical documents and
The Filipino language, the Philippines’ national language,
normalising them to their modern form, while groups of
has shown rapid and significant changes in its vocabulary,
spelling variants were extracted from a Brazilian Portuguese
orthography and grammar, due to the Philippines’ rich
historical corpus in building a Historical Dictionary of
colonial history and the conscious efforts at the national and
Brazilian Portuguese [2]. In this study, we are interested in
institutional levels to standardize the grammar and
extracting spelling variant groups from
orthography of the language. The idea of a national
Tagalog/Pilipino/Filipino historical corpora in order to
language was conceived in the early 1930s and was initially
observe standardisation trends in the orthographic system of
based on the Tagalog language spoken in the city of Manila,
the Philippine national language.
the Philippines’ socio-economic and political capital. Due
Various methods have been employed in the past for
to various socio-political reasons, the Philippine national
detecting spelling variants. The primary method for
language has been renamed twice in the past: as Pilipino in
detecting variants is to first determine the modes of
1959 and then as Filipino in 1987. Furthermore, the
differences among valid spelling variant pairs, and formalize
these differences into basic transformational rules that 2.2. Extracting the working lexicon from the historical
define exactly how spelling variant pairs are different in Tagalog/Pilipino/Filipino corpora
their spellings, either through letter or string replacements.
The corpus with a century-long analysis window described
Spelling transformational rules are either manually
in Section 2.1 is used in observing fluctuations in
generated [2] [3] or automatically generated with optional
preferences by writers of Tagalog/Pilipino/Filipino material
weighting schemes [4]. Giusti (2007) has reported being
in using competing spelling variation rules covered by the
able to generate 43 transformational rules using a manual
analysis window. Spelling variant rules are defined here to
approach, while Archer et al. (2006) have listed 68
be transformational rules that define a particular class of
replacements rules and 52 replacement rules for the German
spelling variants. A word list for Tagalog/Pilipino/Filipino
and English languages, respectively. When dealing with
was built using the collected text corpora. Unwanted entries,
historical corpora, Guisti (2007) mentions other issues that
such as numbers (e.g. dates, counts, phone numbers),
make spelling variants detection more problematic: (1)
misspells and other considered as non-standard terms (e.g.
broken words at the end of lines are not always hyphenated,
OCR errors), are removed from the resulting lexicon a two-
and (2) there are uncommon typographical errors.
step method of (1) simply removing words that contain
combinations of both numbers and letters, and then (2)
2. METHODOLOGY removing entries that do not satisfy a minimum frequency of
occurrence requirement in the corpora, which is set to have
Lists of possible Spelling Variants seen in a historical a value of 5.
Tagalog/Pilipino/Filipino corpus were generated, based on
2.3. Generation of spelling variant groups, and
allowable maximum possible applicable edit rules for
determination of Transformation Rules
transforming one word to a candidate word in a particular
text corpus. Spelling Transformation rules corresponding to Levenshtein (1966) defined three edit operations that can be
classes of competing variant pairs were then created through used to compare one string of characters with another string
manual inspection of the generated Spelling Variant lists. of characters: (1) substitution, (2) insertion and (3) deletion
Frequency counts for each of the Spelling Transformation [5]. Refer to Table 1 for examples of Levenshtein edit
rules corresponding to competing spelling variant cases operations seen from spelling variant pairs in the
were made in order to determine which case is preferred by Tagalog/Pilipino/Filipino written language. Two words are
language users over the other. Details of the implementation considered to have an edit distance of N if at least N distinct
of the modules just described are presented in the following Levenshtein edit operations need to be applied successively
sections. to transform one word to the other word. For example, the
word pair mag-asawa and nagaasawa has an edit distance
2.1. Collecting the Tagalog/Pilipino/Filipino Corpus with
of 3 because the transformations shown in Figure 1 have 3
Century-long analysis window
successive steps (note that the affected letter is highlighted
Electronic text corpora representing different periods in the in the transformations):
Philippines’ history, published in Manila, starting from 1900
to the present times were gathered. These corpora consist of mag-asawa → nag-asawa → nagasawa → nagaasawa
Tagalog documents published prior to 1959, Pilipino-written
Figure 1. Illustration of word transformation using
documents from 1959 to 1986, and Filipino-written
Levenshtein edit distance
documents from 1987 to the present. The corpora gathered
were literary works such as novels, poems, and short stories. From the cleaned running lexicon, candidate Spelling
The time period for sampling published works started with Variant cases were extracted by looking for words that
the 1900s because works written prior to this period are differ by at most two edit operations. The generated Spelling
mainly in Spanish. Since the Tagalog language became the Variant cases were then inspected, and Spelling
basis of the national language, works written in Tagalog Transformation rules reflecting the categories under which
from the 1900s were considered. A good number of works the spelling variant cases fall under were constructed.
included in this study are the Tagalog novels, which are
literary works published starting in the 1900s until the Table 1. Levenshtein Edit Operations and their examples
present times. Short stories, particularly those published in for Tagalog/Pilipino/Filipino language. Affected
Liwayway magazine from 1922 to 2005 were also included. characters in the examples are written in bold.
Various texts were taken from available online repositories: Edit Operation Example
FilNet1, Project Gutenberg2 and eLib3 Project. substitution anu-ano → ano-ano
deletion hinantay →inantay
insertion kolehiala→kolehiyala
1
http://www.filipiniana.net
2
http://www.gutenberg.ph
3
http://elib.gov.ph
A B

C D

E F

Figure 2. Six types of spelling variant plots: (A) showing clean transition, (B) barrel-shaped, (C) with consistent
demarcation, (D) with multiple crossovers, (E) funnel-shaped, (F) showing almost-equal usage preference rates for
both competing spelling variant groups.
Normalized frequency graphs for competing word sets spelling variant groups were extracted from the gathered
were then created to show transition periods in grammar and historical corpus. Regularization values for each spelling
orthography from 1900s to present. variant groups were then computed for all the years with
available data. All the regularization values of each spelling
2.4. Determining the overall conventionalization rate of variant groups were afterwards averaged per year in order to
usage preferences of spelling variants arrive at the overall conventionalization graph of the
Philippine national language’s spelling system over the
The regularization value of a group of competing word course of the 20th century.
forms is defined as the ratio of the number of times the most
commonly used word form for a particular publication year 3. RESULTS AND ANALYSIS
was seen in the text corpora-on-hand, over the total sum of
the number of occurrences of all the competing word forms Appendix I shows the 29 transformation rules that were
for that group. This metric can be seen as indication of how hand-crafted from manual inspections of the spelling variant
a language standardises in its use as seen from the groups extracted from the corpus using Levenshtein edit
publications of professional writers. For this study, all the distance as similarity measure. There are six prevailing
Figure 3. Overall Conventionalization Graph for the Tagalog/Pilipino/Filipino

types of graphs observed from the normalized usage spelling (capoua-> kapuwa, cong->kong, etc.) The 2000s
preference plots of all the 29 spelling variant groups. Figure graph shows the sudden variability in preferences
2 shows some representative plots of these categories. particularly in ways of adopting loan words. The plots
Many of the transformation rules (i.e. <qui> vs. <ki>) aggregated by 5 years and 10 years show that the spelling
are remnants of a Spanish-based system of orthography that system conventionalizes beginning from the 1920s until the
started to be replaced at the beginning of the century. The mid-1980s. However, it starts to drop from the mid-1980s
spelling variant groups that involve the use of the old henceforth.
Spanish system of orthography tend to show clean transition
3. CONCLUSIONS AND FUTURE WORK
lines, whose transition regions can be used to demarcate the
stage from an old system of orthography to a modern system
We have just described our method for objectively tracking
of orthography. Figure 2-A clearly shows that the transition
the development of the Philippine national language’s
region lies in the first decade of the 20 th century, centering
system of orthography by investigating usage preference
in the year 1905. The case of <v> vs. <b> in Figure 2-B
plots of competing spelling variant categories extracted
shows a funnel-shaped graph for their corresponding usage
from historical corpus composed of works published from
preference rates, which can mean that recent developments
the 1900s to the present times. The normalized usage
have caused a steady resurgence in the use of the alternate
preference plots yielded six types of spelling variant cases:
forms. While some of the competing forms in the spelling
(1) those that exhibit transitional plots typical of Spanish-
variant groups show consistent domination of one form over
influenced spelling conventions that are being supplanted by
the other, as in Figure 2-C, there are some spelling variants
their modern forms at the first decade of the 20th century,
that show multiple crossovers in usage preference rates,
(2) those that consistently show one spelling convention
such as in Figure 2-D indicating that the rules for these
being largely preferred over the other, (3) cases showing
particular spelling variant case groups are still not yet fully
multiple-crossovers indicating that rules regarding these
accepted within the writing community. The last two graphs
spelling variants still have not been settled, (4) cases that
(Figure 2-E and Figure 2-F) reflects that fact the Filipinos
show barrel-shaped graphs and (5) funnel-shaped graphs
are generally confused over the use of <i> vs. <e> and <o>
showing resurgence in the use of alternate spelling forms,
vs. <u> in the Filipino language’s written form. In fact, it
and (6) cases showing almost equal usage preferences
was also shown that this confusion is not just confined to the
indicating both alternative spelling forms are widely
Tagalog/Pilipino/Filipino language, but to the other major
accepted in writing community.
Philippine languages Ilokano and Cebuano-Visayan as well
The results of this study are particularly interesting to
[6].
planners of the Filipino language, since it enables them to
Figure 3 shows the resulting overall conventionalization
see, in very objective terms, how the national language
graphs of spelling variant groups, aggregated per year, by 5
develops, and what interventions have significantly affected
years, by 10 years, and by 25 years. The per-year plots show
its progress. Thus, a natural follow-up to this study would be
that there was a dip in the 1920s in the conventionalization
to correlate language-related socio-political and legislative
plots and recent years also see a gradual decrease of
historical developments to the usage preference and
regularity in language usage (from 2000 onwards). The
conventionalization plots culled from the gathered historical
1920s can be seen as a transition period where the old ways
corpus.
of spelling were gradually being supplanted by new ways of
4. ACKNOWLEDGMENTS [3] S. Orasmaa, R. Käärik, J. Vilo and T. Hennoste, "Information
Retrieval of Word Form Variants," in Seventh conference on
The authors would like to thank the Department of Science International Language Resources and Evaluation (LREC'10),
and Technology - Science Education Institute (DOST-SEI) Valletta, Malta, 2010.
for funding this research project as part of the Engineering [4] D. Archer, A. Ernst-Gerlach, S. Kempken, T. Pilz and P.
Research and Development for Technology (ERDT) Rayson, "The identification of spelling variants in English and
scholarship given to the first author. German historical texts: manual or automatic?," in Abstracts of
Digital Humanities, Paris: Sorbonne, 2006.
5. REFERENCES [5] Levenshtein, "Binary codes capable of correcting deletions,
insertions, and reversals," Soviet Physics Doklady, vol. 10, no.
[1] A. Ernst-Gerlach and N. Fuhr, "Retrieval in text collections 8, pp. 707-710, 1966.
with historic spelling using linguistic and spelling variants," in [6] J. Ilao and T. G. R. Santos, "Comparative analysis of actual
In JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint language usage andselected grammar and orthographical rules
conference on Digital libraries, 2007. for Filipino,Cebuano-Visayan and Ilokano:a Corpus-based
[2] R. Giusti, A. J. Candido, M. Muniz, L. Cucatto and S. Aluisio, Approach," in 2nd Philippine Conference Workshop on Mother
"Automatic detection of spelling variation in historical corpus: Tongue-Based Multilingual Education, Iloilo, 2012.
an application to build Brazilian Portuguese spelling variants
dictionary," in Corpus Linguistics Conference, University of
Birmingham, U.K., 2007.

Appendix I. Hand-crafted transformation rules based on spelling variant groups extracted from running lexicon.
Highlighted rows correspond to transformation rules influenced by the Spanish system of orthography which
prevailed until the end of the 19th century.
Number Transformation Rule Example
1 <c>vs.<k> balcon / balkon
2 <c>vs.<s> princesa / prinsesa
3 <ch>vs.<ts> derecho / deretso
4 DASH vs. NODASH bahay-kubo / bahaykubo
5 <ñ>vs.<n> hañgarin / hangarin
6 <ñ>vs.<ny> españa / espanya
7 <f>vs.<p> filosopo / pilosopo
8 <ia>vs.<iya> biblia / bibliya
9 IPINAGREDUP_vs_IPINAGNODUP pinag-aagawan / pinapag-agawan
10 IPINAREDUP_vs_IPINANODUP ipinamamalas / ipinapamalas
11 <i>vs.<e> babai / babae
12 <iye>vs.<ie> impiyerno / impierno
13 <iyo>vs.<io> kolehiyo / kolehio
14 <j>vs.<h> jardin / hardin
15 <ks>vs.<x> taksi / taxi
16 <ll>vs.<ly> martillo / martilyo
maibibili / mabibili
17 MNAI vs MNA
naisasagot / nasasagot
MNAKAPAGREDUP vs makapag-iisa / makakapag-isa nakapag-
18
MNAKAPAGNODUP uutos / nakakapag-utos
makapag-iisa / makakapag-isa
19 MNAKAREDUP vs MNAKANODUP
nakapagsasabi / nakakapagsabi
20 MUTEH vs WITHH ospital / hospital
21 <ng> vs <n> kangi-kangina / kani-kanina
22 <o> vs <u> anu-ano / ano-ano
23 PrefixI vs WithoutPrefixI iginawa / ginawa
24 <qui> vs <ki> aquin / akin
25 RepeatingVowels vs non-repeatingVowels aakalain / akalain
26 <ui> vs <wi> dalauin / dalawin
27 <v> vs <b> españa / espanya
28 <w> vs <u> asawa / asaua
29 <z> vs <s> luzon / luson

You might also like