You are on page 1of 34

Finite State Morphology: The Turkish Nominal Paradigm

A Thesis by Philip Makedonski /nero_pdm@yahoo.com/ Submited to Seminar für Sprachwissenschaft Eberhard Karls Universität Tübingen, 72074 Tübingen, Germany In fulfillment of the requirements for the degree Bachelor of Arts in Computational Linguistics

July 2005

0

ABSTRACT
Finite State Morphology: The Turkish Nominal Paradigm
Makedonski, Philip Seminar für Sprachwissenschaft Eberhard Karls Universität Tübingen Supervisor: Dr. Dale Gerdemann July 2005 24 Pages

In this thesis my goal is to present a finite state approach to the inflectional morphology of Turkish nouns, the ultimate goal being building a morphological analyzer for Turkish nouns. We’ll be dealing primarily with the principles of vowel harmony across the different inflectional noun suffixes in Turkish as the most interesting phenomenon and my implementation of these principles in the Xerox Finite State Toolbox (xFST). We will also pay attention to the other morphophonological alternations occurring both in the stem and the suffixes attached to it as a result of the inflectional processes.

Keywords: Natural Language Processing, Finite State Networks, Morphology, Computational Linguistics

Turkish

1

To my family, to my love

2

special thanks to Nevin Recep for sparkling my interest in the Turkish language and supporting me all the time. Many.ACKNOWLEDGEMENTS First. 3 . Sandra Kübler for her support and understanding throughout this course of studies. Dale Gerdemann for his support and advisory over this project. And most of all. I would also like to thank Dr. no matter what was happening. I’d like to thank my supervisor Dr. I appreciate the freedom and independence I had for the choice of topic and approach. which in many cases turned out to be the crucial for my progress. Thanks to my friends for their understanding. many thanks to my family for their support all the time.

.............................................................................. 4 1.3............................................................. 12 3......3.............1 Finite State Automata (FSA) ...............................................................1 Inflection for Number .....................1.......... 21 3.............. 6 OVERVIEW .................................................................................................. 22 3....................................... 1 DEDICATION..............................2 Finite State Transducers (FST’s) .............2 1............................2 Consonant Alternation Rules...................................2 The Rules Component .................................................................................................................................................................................................................................................................................................................2...............................................................2.....3 Inflection for Possession ........................................... 29 FUTURE WORK. 24 3...................................2 (De)Gemination .................................. 17 3...................................................................................................................2 FINITE STATE TECHNOLOGY ...................................................................................................................................4 Rule Order ........1 THE NOMINAL PARADIGM OF TURKISH............. 31 APPENDIX B: LEXC CODE SAMPLES...................2.......................................................3 XFST .................................................................................................................................................................................................................................................................................................. 15 3............................................................. 20 3......... 3 TABLE OF CONTENTS........................................................ 13 3..................................................................................................................................................................... 7 2...... 10 2............................................................................. 28 4.....2........................................................................................................1 Resolving Vowel Harmony................................................... 27 3....................................................2..... 21 3............ 5 1................................................................................2......... 17 3.................1...................................................2...................... 11 2............................................. 32 APPENDIX C: ON REPLACEMENT RULES........................................................................3.................3 IMPLEMENTATION ...........................................................................................TABLE OF CONTENTS ABSTRACT............................................. 26 3.......................................................................................................................3 Other Alternations......1 1......................3.......... THE MODEL..... 7 2......1 Vowel Harmony Rules ................................................3........................................................2 Case Inflection..........................................................1 The Lexicon ..............................................3................. 7 BACKGROUND...................... MOTIVATION ...................................................... 20 3........................ 2 ACKNOWLEDGEMENTS.......3 1..........................................................................................................................................................................................................2.... 13 3..............................................................................................2....................... 10 2............. 5....................................1.. 14 3........ 22 3....3....................... 14 3........................................... 5 RELATED WORK ..........................2..1................................................................................................................................ INTRODUCTION .....2.......................................................................................................4 Lexical Exceptions – the ‘su’ case....................................................2.....................................................................................................2 Consonant Alternation Rules ..................................................................................................................................................... 19 3.................3........2........1 TURKISH ............................................................ CONCLUSIONS........... 17 3.............................................2 PHONOLOGICAL ALTERNATION RULES ............................................................1 Vowel Insertion/Deletion.....................................................................................1 The Glottal Stop ........... 5 MORPHOLOGY...................................................................................3 Fixing the Morphotactics ......................................... 19 3........................4 2.......................................2................................................................................................................................................................................................2.......................................................1 Final Consonat (De)Voicing ....................................................... 33 4 .......... MORPHOTACTICS............................... 29 APPENDIX A: LIST OF ABBREVIATIONS............... 25 3...................................

But for a full scale NLP. explicitly listing all the possible forms as separate entries. are often represented in the morphological structure. In Turkish for example. Turkish. Phonological (also orthographical) alternations define the changes in morphemes occurring in particular environments. This is especially valid for agglutinative languages like Turkish where the concept of a word is much wider. This in turn would require a rich lexicon. Furthermore. These are simple examples that could be caught easily with a few basic rules. XVI) Morphotactic definition accounts for the acceptability of a word like piti-less-ness and the unacceptability of a word like *piti-ness-less. important semantic and grammatical information could be encoded in such lexicons as well. Different relations between the words in a sentence are mostly expressed by affixes.2 Morphology The central concepts of morphology are morphotactics and (morpho)phonological alternations. Furthermore. There are approximately 20.1. 5 . one needs a much more sophisticated system. many affixes and roots in Turkish change their shape depending on the environment and have to obey various constraints like vowel harmony. 2003) comes at hand: (1) pity → piti-less → piti-less-ness (Karttunen. As a consequence. which effectively amount to millions of inflected and derived forms. Morphotactics (also morphosyntax or word formation) defines the constraints on possible morpheme combinations.1 Motivation In morphologically rich languages like Bulgarian. pp. As it turns out.000 stems and 300-400 roots actively used in Turkish. 2003. would quickly explode into an unmanageable size due to the rich inflectional and derivational possibilities for a single base (dictionary) form (stem). Spanish and many others. Russian. This further increases the demand for an automated morphological analysis. Introduction 1. possession and case (the number varies in the different sources). morphological structures are much more regular than syntactic ones. To illustrate the issue an example from Karttunen (Karttunen. the nominal inflectional paradigm has three basic types of suffixes – for number. and building up a lexicon. the number might be different depending on the source). Phonological alternations on the other side describe why pity is realized as piti in the context of a following less. and the verbal inflectional paradigm is even more complicated with its eight affixes (again. any form of an adequate Natural Language Processing (NLP) application would require a good morphological component due to the increased role of morphology in these languages. 1. They can be handled very efficiently and accurately using sets of rules and compact lexicons of base forms (stems). grammatical features and functions typically assigned to the syntactic structure in morphologically poor languages like English.

With the key concept here being “feed”. 1 6 . numerous recent works by Kemal Oflazer based on his Two-Level model of Turkish (Oflazer. For the earlier works are hard to find.3 Related Work A significant amount of work has been done in the computational modeling of Turkish morphology already: Köksal’s first approach to a computerized model for automatic morphological analysis of Turkish (Köksal.1. 1996) is a comprehensive guide to building a computational model for full nominal phrases using the functional grammar formalism (Dik. The most significant difference from the ordered linear approach in composed sequences of rule transducers1 is that all the rules operate in parallel. each transducer operates on its own input and output. 1994). producing an intermediate output to feed the next transducer in the cascade. 1986).1: Cascade-based and two-level (parallel) models in finite state morphology. the major drawback of the two-level models has been that in the case of bleeding or feeding relations between rules (which is often the case in generative phonology). 1975). In the cascade-model of composed rule transducers. 1994). Hankamer’s description in terms of finite state morphology (Hankamer. Oflazer’s work is based primarily on his two-level model for Turkish morphology (Oflazer. 1981). For now think of rule transducers simply as a way to implement rules. 1983).1 below: Lexical Form Lexical Form FST 1 FST 2 Intermediate forms … FST n FST 1 FST 2 … FST n Surface Form Surface Form Figure 1. a basic two-level model and a cascade-based model relating the languages defining the lexical and surface forms are presented in Figure 1. To illustrate the difference.2. I will briefly discuss only the more recent works by Oflazer and Schaaik as closely related to what I am doing in this project. it is hardly possible to define such relations within this approach More on transducers and automata follows in the technical background on finite state technology in Section 2. Schaaik’s Studies in Turkish Grammar (Schaaik. The idea behind the two-level models originates from Koskenniemi (Koskenniemi.

I will try to approach the task as modular as possible. 2003). such will be used. we’ll use one. I will focus on the nominal morphology only.1 Turkish In this subsection I will present the most important features of Turkish that we’ll be dealing with in the subsequent sections. all that is needed is to plug in the extension component and occasionally do a little tune up of the system.1 – linguistic background on Turkish. as the complete nominal morphology of Turkish is a subject too broad to cover here (set aside the complete Turkish morphology). We conclude in Section 4 and in Section 5 I will present an outlook on possible future elaborations. 1. the network could easily “explode” into unmanageable size as many parts of it may need to be copied. so that if changes or extensions are required. whenever parallel operation of rules is needed. But the convenience of the cascade-based model from this perspective comes at a price.3 provide some technical background on the technology employed and the particular toolbox I have chosen to use. The key concept here is modularity. Turkish is an agglutinative language from the family of Turkic languages. The advantage being. and whenever sequential (linear) operation of rules is needed. A Turkish word consists of a root (base form) and a number of suffixes attached to it. 1967). it could be easily extended to cover the other major word classes in a language. In the process of composition. referred to as the official language guides for Turkish in most papers. Sections 2. 1989) and Turkish Grammar (Lewis. Once a solution for the nominal morphology is designed however.4 Overview In the following sections I will present a finite state approach to a part of the Turkish morphology. For the purpose of this project I will be using the Xerox Finite State Toolbox (XFST) and the “manual” to it by Lauri Karttunen (Karttunen. The actual model and its implementation will be presented in their full beauty in Section 3. In Section 2 I will roughly present the background information needed to proceed through the paper as follows: Section 2. 2. My work is based primarily on Geoffrey Lewis’ Turkish (Lewis. Luckily there are some techniques to restrict such growth. 2. My project combines both models in a way as we shall see later. in particular the different inflectional paradigms.2 and 2. each extending its meaning or changing its word class: 7 .(apart from having to design the rules very carefully in order to get the necessary result). Background In the following sections I will present the basic “technical” properties of the language and the technology used to model it.

pp. 1989. 3) As one might infer. the scope of this project will be restricted to inflectional noun suffixes only. As stated in (Lewis. 1982) 1 8 . The Turkish vowel system is shown in table 2. the features of a vowel depend on the features of the preceding vowel. all the vowels in a word agree with the backness value of the first vowel of that word: (3) +Back sekiz – eight seksen – eighty sinir – nerve sinirler – nerves sinirlerimiz – our nerves -Back dokuz – nine doksan – ninety sınır – frontier sınırlar – frontiers sınırlarımız – our frontiers (Lewis. harmonizing with the vowel of the first syllable. Exceptions to this principle are: a small number of native Turkish words – elma (apple). In simple words. many ideas typically expressed by prepositions or pronouns across languages are expressed by suffixes in Turkish. kardeş (brother or sister). pp. the harmonic suffixes harmonize with the vowel of the last preceding syllable. 11) In cases of disharmony1 in the root or if an invariable suffix is attached. Clements and Sezer account for them in (Clements. from bilgi (information) and sayar (counter.1 below: Unrounded Low High a ı e i Rounded Low o ö High u ü Front Back Table 2. Another important feature of the Turkish language is vowel harmony. which harmonizes for backness. loanwords. compound words – bilgisayar (computer). eight invariable suffixes. We’ll be dealing exclusively with the vowel harmony of suffixes in Turkish and as mentioned before. anne (mother). to anne (mother) will result in anneler (mothers) and not in *annelar. So attaching the plural suffix -ler/ -lar. 1989) describes the vowel harmony in Turkish with a general law of vowel harmony in terms of the feature +/-back of vowels. 1989. Geoffrey Lewis (Lewis.(2) bilgi – knowledge biglisiz – without knowledge bilgisizlik – lack of knowledge bilgisizlikleri – their lack of knowledge bilgisizliklerinden – from their lack of knowledge bilgisizliklerindenmiş – I gather that it was from their lack of knowledge (Lewis. Vowel harmony is basically described as a “progressive sound assimilation” phenomenon.1: The vowel system of Turkish. lister). 1989).

In some cases it is a n or an s. 1989) refers to it. –leri/-ları feature both types of harmonizing vowels. as Lewis (Lewis. words in Turkish typically end in voiceless consonants. Exceptions are some loan words like saat (hour). definite objective case – the book).2. Typically a buffer y is inserted if a suffix begining with a vowel is attached to a word ending in a vowel. Second.2 I will present the actual morphotactics of the Turkish nominal inflectional paradigm and the phonological alternation rules respectively. except the eight invariable ones. (5) ev (house) kol (arm) kitap (book) köprü (bridge) evler (houses) kollar (arms) kitaplar (books) köprüler (bridges) evi (the house) kolu (the arm) kitabı (the book) köprüyü (the bridge) One might notice a few addtional things from (5). This topic. çamur (mud) – in general a can be followed by u if a p. Exceptions to this principle will be: tapu (title-deed). abuk sabuk (nonsensical). They could be divided in two groups: The vowels of the first group alternate between the low unrounded vowels a and e (also called e-type2 suffixes (Pollard.2.1 and 3. Some suffixes like the 3pPl Poss. for the sake of simplicity. 2 The e-/i-type distinction is really a distinction between harmonizing vowels and not suffixes as Pollard (Pollard. i. (4) above provides some basic notion about this classification. 1996) proposes. The plural suffix -ler/ -lar falls in the first class. First of all no vowel sequences are possible in Turkish. that constrains the occurrence of vowels in terms of roundedness1. whereas suffix like the definite objective case suffix is an i-type suffix. Unrounded vowels are typically followed by unrounded vowels and rounded vowels are typically followed by low unrounded or high rounded vowels. In Section 3. but they do change to voiced ones intervocally. 1996)) and the vowels of the second group alternate between the high vowels ı. 1 9 . harmonize with. v. Except one – the present tense verbal suffix –iyor/–ıyor/–uyor/–üyor. allong with the other alternations occuring in the process of suffixation will be further elaborated in Section 3. b or m intervenes. no other suffixes contain o and ö. These exceptions occur apparently only root-internally and do not seem to affect suffixation: kitap (book) → kitabı – (book. avuç (hollow of the hand). Combining the two principles we end up with the following: (4) a is followed by a or ı e is followed by e or i ı is followed by a or ı i is followed by e or i o is followed by a or u ö is followed by e or ü u is followed by a or u ü is followed by e or ü Turkish suffixes. u and ü (the so-called i-type1 suffixes (Pollard.There is also. These are the general morphological and phonological features of Turkish that we will pay attention to. a “special law of vowel harmony”. 1996)). the vowel of the last syllable of the word they are attached to.

a 1 c b 2 3 b Figure 2. That is. with different properties and set of arcs that connect these states.2. For the slightly more complicated network in Figure 2. which might as well be identical to the input. but not a by itself. Various tasks are nowadays approached using finite state technology – part-of-speech disambiguation. Enumerating all the inputs seems unreasonable. In the above example there are two paths possible to the final state 3. we end up with an infinite set of acceptable input strings. A regular expression (or a regex) is a pattern that matches a set of strings which obey particular syntax rules. at the end of the input the network should be in a final state. 2. Automata describe languages. bcab.1 Finite State Automata (FSA) Finite state networks typically have one start state and one or more final states. abcb. the state marked with a double circle (3) is the final state. Valid inputs for the network in Figure 2.2 Finite State Technology Finite state technology was quickly condemned by the linguists at the earlier stages of its development due to its weak descriptive power. whereas transducers express relations between languages. But later on it proved to be quite useful for modeling parts of languages that could be considered finite and regular. In order to accept a string.1 are b and ab. Automata are finite state machines that only accept a set of given strings (a language).. The basic idea behind finite state technology is a set of states. We’d rather define some rule that selects valid inputs.2: A bit more complicated three-state model. It 1 a 1 b 2 3 b Figure 2. It is the basis for any further kind of natural language processing. A more compact representation could be defined using regular expressions.2. abcab… Because of the looping arc through c. Transitions between the states are possible only if the required input is recognized. We will be talking about networks here as a general term abstracting over transducers and automata. tokenization. The sequence of transitions over arcs to a particular state is called a path. Arcs have a direction and an input symbol. bcb. All the possible input strings in this case seem to follow a particular (regular) pattern. But the most significant and core application of finite state technology in NLP remains morphological analysis. 10 .1: A simple three-state network. for a particular state there is a set of outgoing arcs with their respective input symbols. ab.2. shallow parsing. The state marked with and arrow (1) is the start state. The arc with input c takes us back to the start state creating a loop. whereas transducers provide a set of outputs for an accepted input. The states and arcs together form networks1. valid input sequences will be: b.

] → [<a. I will be using the ones described in (Karttunen.1>.3 The operators and their syntax vary among toolboxes.2>. Where FSA deal with acceptance/recognition only. composition (. negation (~). This could be further described as <a. in terms of the toolbox I am using in section 2. as noted above. [<A. A model solution for the above networks using the lexc language is provided in the appendix. These are the basics.B> in terms of relations.is an essential concept in Finite State Technology.1> and <B. where the inner symbols match: (6) 1 2 [<a. union (|). Composing the two of them would provide us with a new transducer taking the upper side of the first and the lower side of the second transducer. The essential terms will be explained as needed as we proceed.3 below: a:A 1 c b:B 2 3 b:B Figure 2. we can apply different operations to it – intersection (&)2. Regular expressions describe the languages accepted by Finite State Automata – the regular languages. A general feature of Finite State Networks is that they can be composed together yielding a sequence of transducers/ automata – a modular structure that is very essential to our purpose in this paper. The c’s remain unchanged.3) that is turning lowercase a’s and b’s into upper case A’s and B’s respectively. concatenatenation ( ).2 Finite State Transducers (FST’s) A Finite State Network (or a Finite State Machine).o.] The terms will be explained in more detail in section 2.3.A> and <b.2. This major difference is described using symbol pairs in the model in Figure 2. In this case we have strings from one language (later on referred to as the ‘UPPER’ language1) related to strings from another language (which will be called the ‘LOWER’ language1). and so on.o. For an input string like ab the output will be AB.2. is the general term for Finite State Automata (FSA) and Finite State Transducers (FST’s). <b.B> ] .2>. etc. 2003) 11 . <A. Say we have then another transducer that is turning capital A’s and B’s into numbers.). Say we have the transducer above (Figure 2. FST’s also provide output(s) for the recognized input. In the current state. The c which remains unchanged is applied the identity relation. <B.3: A Finite State Transducer. extending its capabilities and expressive power. Most important to note here is the composition operation (. It accepts the same strings as the FSA in Figure 2.2>. Composition is an operation on two relations.). for abcb – ABcB. I will describe the necessary syntax basics in further detail. Once we have designed a network describing a language or a relation.A>. The precise syntax varies among applications and toolboxes. It seems like a simple replacement operation. 2.1>. but transforms the lowercase a’s and b’s into upper-case A’s and B’s respectively. There are newer operations defined in every particular toolbox. <b. but there is no such operation involved here. subtraction (-).o. regular expressions are only partially related to real regular expressions.

12 . 1972) who first realized that morphophonological knowledge could be modeled using FSN’s.All the operations can be applied multiple times to different networks. What would be referred to as upper language. and xfst – the core tool providing interface to the finite state calculus for building. XFST also provides two tools. Additionally. whereas phonological/orthographical alternation rules will be defined as separate transducers (mostly using replacement rules). in terms of the current task at hand. so I will leave it aside. Beesley and Lauri Karttunen. XFST defines transducers as relations between two languages. 1 A brief overview of the formalism is available in the appendix. designed for testing and application of larger projects.3 XFST The Xerox Finite State Toolbox (XFST) was developed at the Xerox Research Centre Europe (XRCE) by Kenneth R. 2. Johnson (Johnson. could be thought of as the input and the lower language would then be the output when we apply an input to a transducer downwards. For further information on finite state technology and automata theory refer to (Hopcroft. The most fascinating part is. This natural feature of finite state networks is what makes them so suitable for morphological processing. In the process of implementing a morphological analyzer. lookup and tokenize. Additional transducers can be composed to the network at hand to impose restrictions. composed together into a single transducer. we can easily apply it in the other direction for the task of morphological analysis. It implements the standard finite state operations such as composition and union as well as several innovative operations like replacement rules1 and local sequentialization. compose multiple rule transducers into a single lexical transducer that is relating strings from the language of surface forms to strings from the language of lexical (underlying) forms. as it won’t be necessary to understand the current paper. I will spare the mathematical model behind Finite State Networks. Composition allows us to build a cascade of multiple transducers into a single transducer. which is specifically designed for handling morphotactics in natural languages. which itself will be composed with the network derived from the lexc definition of the lexicon to finally result in a lexical transducer which will be used for our final purpose. once we have constructed a transducer for morphological generation. for others not. the morphotactics will be defined in lexc as supposed. the terms upper and lower remain constant. 1983). It was C.a complier for lexicons in the lexc language. Although it seems a bit confusing. accessing and manipulating Finite State Networks and compiler for regular expressions and replacement rules which will be essential to my work. XFST includes: lexc . 1979). If we apply input to the transducer upwards then the roles switch – the input is applied on the lower side and the output comes from the upper side. define alternations or add more content. In the definition of a lexical transducer. the upper side language will describe the lexical (underlying) forms of the language to be analyzed and the lower side language will contain the actual surface forms in the standard orthography. For some of them the order matters. there is a compiler for two-level morphology rules (twolc) as described by Koskenniemi (Koskenniemi.D. but they won’t be discussed any further in this paper. but its application is beyond the scope of my work.

At the current stage of development I won’t be concerned with it however.1 The Nominal Paradigm of Turkish.1: A simplified FSA model for the nominal morphotactics in Turkish. We’ll get back to this issue in the subsequent sections.1 and 3. to describe vowel harmony I will be using “I” to generalize over the class of high vowels that alternate according to the principle of i-type vowel harmony and “E” to generalize over the class of low unrounded vowels that alternate in concordance with the principle of e-type vowel harmony. The definition will be further extended in the subsequent sections. An important notion in the following sections will be that of archiphonemic descriptions. The symbols denoting the particular classes of alternating phonemes will be defined as needed as we proceed further. My initial approach. 13 . 3.2 I will present the theoretical background behind my model. The general idea: I will be using both in theory and practice the so-called archiphonemes to describe classes of similar phonemes that alternate depending on the environment. the relativising suffix –ki is classified as part of the nominal inflectional paradigm. so I had to redesign it using unspecified abstract definitions on the lexical side for entries that do undergo the alternations and underspecify the entries that do not. On the other side. The basic pattern on which everyone agrees though is: STEM – NUM – POSS – CASE Turkish has no distinction of grammatical gender. Morphotactics The nominal inflectional paradigm is defined in different ways in the various sources. Worth mentioning is that in some sources. I realized that the idea of using variables could be further employed to describe other phenomena. For example. using consonant alternation rules on the surface forms failed to describe the exceptional cases. In Sections 3. There are two modules in the model – the lexicon defining the morphotactics of Turkish nouns and the morphophnonological rules component describing the alternations occurring on the surface. So let’s have a closer look at the core of the Turkish noun paradigm. As I was implementing the vowel harmony principles using variables for the alternating vowel segments. such as the consonant alternations. The Model In this section I will present the nominal paradigm of Turkish and my implementation of it. casetype suffixes are also differently defined in the various sources – in some of the recent works. the suffix –(y)la/–(y)le is classified as an instrumental case suffix. NUM POSS CASE STEM 1 2 0 3 0 4 0 5 Figure 3.3.

1. ş and t). Acc.1: Summary of case suffixes in Turkish. / LF) ev-DE → (house. 1967. but it is mostly from syntactic and semantic points of view and I won’t go any further discussing the issue. at) Ablative (from. So using archiphonemic descriptions and the principles of vowel harmony. 2003) provides an extensive study on the multiple readings of the Turkish plural morpheme. e or i -(y)i -(n)in -(y)e -de -den ö or ü -(y)ü -(n)ün a or ı -(y)ı -(n)ın -(y)a -da -dan O or u -(y)u -(n)un The bracketed y and n are realized on the surface only if the word the suffix is attached to ends in a vowel. The plural form is derived by attaching the –ler/–lar suffix. at) Ablative (from.1 Inflection for Number The basic uninflected dictionary form of Turkish nouns is singular (or as claimed in some sources – “numberless”). / SF – “the car”) evde (house. 3. Its vowel is of e-type harmony. Acc. out of) Lexical Form of the Suffix -(y)I -(n)In -(y)E -DE -DEn Table 3. 1989) defines six cases in his grammar of Turkish.1 below provides an overview of the case paradigm in Turkish: Case\Last preceding vowel Absolute (Nominative) Definite Objective (Accusative) Genitive (of) Dative (to.) 14 . A few examples will be: (7) araba (car. for) Locative (in. f. on. It comes generally before any other inflectional suffix. / SF – “in the house”) ev → (house. for) Locative (in.) → araba-(y)I → (car. but when attached to a word ending in a voiceless consonant (ç.2: Summary of case suffixes in Turkish using archiphonemic descriptions. Nom.2 Case Inflection Lewis (Lewis. k. on. therefore the compact representation using an archiphonemic description will be –lEr. Nom. they are realized as –te/–ta and –ten/–tan respectively. s. h. the case inflection summary will look like: Case Absolute (Nominative) Definite Objective (Accusative) Genitive (of) Dative (to. Table 3. Loc. Loc. out of) Table 3.3. / LF) arabayı (car. The locative and ablative suffixes are generally realized as –de/–da and –den/–dan. p.1. Ketrez (Ketrez.

3: Summary of possessive suffixes in Turkish using archiphonemic descriptions. I will stick to the classic works for now and treat it as a separate (non-case) suffix1. “of mine”). etc. “of yours”). 1pPl Poss. 1pPl Poss. etc. etc. so they carry the inflection for gender. “of hers”). Again.As mentioned above. the pre-posed possessives act pretty much like adjectives and typically precede them. / SF – “my house”) arabamız (car. her/hers. 1989) states that it is attached to nominative nouns and genitive pronouns. / LF) araba-(I)mIz → (car.her)…. 3. to avoid vowel sequences2. as far as my knowledge reaches out. still used. In Bulgarian for example. I will leave it aside until I get a clearer view on the issue. 2 More on vowel sequences to come in the description of the rules in the following sections 15 . неин ([nein] . твой ([tvoy] . (8) ev → (house) araba → (car) araba → (car) ev-(I)m → (house.3 below: Person 1pSg 2pSg 3pSg 1pPl 2pPl 3pPl Suffix -(I)m -(I)n -(s)I -(I)mIz -(I)nIz -lErI Gloss my your his/her/its our your their Table 3.3 Inflection for Possession Where in many languages possession is formed using pre-/post-posed pronouns (English: my/mine.my). here the optional segments surface both if the word the possessive suffix is attached to ends in a consonant (for the first and second person singular and plural) and if the word ends in a vowel (for the third person singular). the bracketed segments surface only in particular conditions. / SF – “his/her car”) 1 Lewis (Lewis.. his. 1967. It is however. 3pSg Poss / LF) evim (house. 1pSg Poss. и ([i] – her. “of his”). dein (your). Bulgarian: pre-posed – мой ([moy] . both as a postposition and as a cliticized suffix. ти ([ti] – you. in this sense it could be considered an additional case suffix. post-posed:. негов ([negov] his). ihr (her). sein (his). 1pSg Poss. In Turkish the possessive suffixes are partially derived from the present tense forms of the verb to be. number and definiteness.your). some more recent works treat what used to be (and I believe still is) a postposition (ilE) following absolute or genitive forms as an additional instrumental/ comitative case suffix (–(y)lE).. So we have vowel deletion in one case and consonant insertion in the other. / SF – “our car”) arabası (car. where the bracketed segments surfaced only if the word they are attached to ends in a vowel. in Turkish possession is expressed by suffixes.). A summary of the possessive suffixes is presented in Table 3. / LF) araba-(s)I → (car. The complexity of the possessives varies across languages.1. 3pSg Poss. Opposite to the case suffixes. depending on their overall morphological complexity. your/yours.ми ([mi] – my. му ([mu] – his. German: mein (my).

In this case the noun itself reverts to accusative case.. (the houses) Pl. 3pSg Poss. Even though Turkish is morphologically highly specified. 3pSg Poss..Poss) onların evi (their house. “the houses of theirs”) (they. just to make things even more confusing. So we end up having the single form evleri for both “their house” and “their houses”. also the house) but: (12) evinde (in his/her house in our case. houses.Poss) evleri (their houses) → (houses. Paying a closer look however. (their houses) Sg. By having another look at the two inflectional paradigms one might or might not notice that some of the suffixed forms could occasionally overlap on the surface.Possessive suffixes precede case suffixes. The derivations from the underlying lexical representations of the four interpretations of evleri are given in (10) below: (10) Pl. Acc.) onun evleri (his houses.Acc . 3pSg. 3pPl. but also identical with in your house) Confusing? Typically ambiguities are resolved by looking at the context where the ambiguous word occurs – ambiguous forms are usually used with the genitive of the personal pronouns to avoid confusion.Poss.3pPl.3pSg. “the houses of his”) (he.Poss) evleri (his houses) → (houses. “the house”) and ev-(s)I (house – 3pSg possessive.3pPl. Acc. is that after the third person possessive suffixes.3. we often have 2-.. (his/her houses) ev-lEr-(y)I ev-ler-I evleri ev-lEr-lErI ev-ler-leri evleri ev-lErI ev-leri evleri ev-lEr-(s)I ev-ler-i evleri Worth to note. house. “his house”) end up absolutely the same on the surface – evi: (9) ev → (house) ev → (house) ev-(y)I (house. / LF) Things get further complicated if there are multiple instances of the plural suffix –lEr – in the case of 3pPl possessive for example.) onların evleri (their houses.) 16 .Poss. / SF – “his/her house”) ev-(s)I → (house. Acc. (their house) Pl. Gen. Acc / LF) → evi (house. a so-called “pronominal n” is added when there is a case suffix following. (13) evleri (their house) → (house. 3pPl. if the possessed noun is already plural – evler (houses) → *evlerleri → evleri (their houses) – one –lEr gets deleted. “the house of theirs”) (they. houses. Gen. Gen. / SF – “the house”) evi (house.Poss. (11) evi (his/her house. Acc. For example: the underlyingly different ev-(y)I (house – Definite Objective (Accusative) case.or as in this case 4-fold ambiguities. reveals even further complications: evleri could also denote the accusative case of the plural of houses (“the houses”) and the 3pSg possessive of the plural of houses – “his/her houses”.

as this task should be performed at a later stage. The e-type harmony rule checks the value backness feature of the last preceding vowel – if it is a back vowel the underlying E is realized as a. define LowV [a | e | o | ö].2 Phonological Alternation Rules In the following subsections I will outline theoretically the basics of the phonological alternation rules in Turkish with respect to the task at hand. Since the system does not provide us with feature specification of phonemes. So the set of back unrounded vowels will be derived as: 17 .4 Lexical Exceptions – the ‘su’ case There is only one pure lexical exception to the paradigm – the noun su (water).For the purpose of this project. 3. this topic is beyond the scope of my work.1 are rather simple to implement. for example: akarsu (river – “running water”). there are plenty of possible inflections . in the possessive forms. and things get further complicated. There is however a large number of derived noun roots that end in –su. In general. I will present how the basics work and then address some of the exceptional cases. For this reason. if it is a front vowel. define UnroundedV [a | ı | e | i ].2. 3. however. that is for a single noun stem. 3. I had to define the classes of vowels as sets: (14) define BackV [a | ı | o | u]. define RoundedV [o | ö | u | ü]. The intersection (&) of those sets provides us with the sub-classes of vowels having combined features. it is realized as e.1. I won’t be concerned with morphological disambiguation. for a single entry in the lexicon. it deserves a special treatment.1 Resolving Vowel Harmony The vowel harmony principles as described in Section 2. There are further distinctions in the uses of the possessives in Turkish. As one might imagine. the y is inserted whenever a suffix starting in a vowel or dropping consonant is attached to the word.2x for number times 7x (the six possessive suffixes + the possession free form) for possession times 6x (or even 7x if the instrumental case is included) for case inflection. suyu (his/her water) instead of *susu. there is always a y preceding the possessive suffix – suyum (my water) instead of *sum. The exception manifests itself as su taking the -yun suffix for the genitive (instead of the standard –nun suffix) and also. define FrontV [e | i | ö | ü]. define HighV [ı | i | u | ü]. I split the two harmony classes in two rules – for e-type harmony and for i-type harmony. after examining the already analyzed context. but again. results in 84 basic forms from inflection only (even though some of them might be identincal).

In my solution the rules operate in parallel locally. saat (clock). 2 1 18 . Schaaik (Schaaik. if the last preceding vowel is back and unrounded. but that won’t have much of a descrıptıve liguistic value. namely backness and roundedness. etc. etc. it can’t be stated that this is always the case. rol (role). are realized as alkolü (alcohol. the abstract symbols have to be resolved in a left-toright fashion and e-type suffixes at the current stage precede i-type suffixes. The reason behind it – apart from the backness harmony being the more general principle and having broader coverage.1 exceptional cases like anne (mother). The same term however is used in some sources for roots that do not conform the principles of vowel harmony internally – the already mentioned in section 2. A few words about the exceptions to vowel harmony: We will be concerned with roots whose last vowel does not have predictive power over the harmonic features of the suffixes attached to it. because if a e-type suffix is added.(15) [ BackV & UnroundedV ] which results in: (16) [ a | ı ] This is essential for defining the i-type harmony. that is for the e-type and the i-type they operate together among themselves. This was due to resolving the –InIz (1pPl Possessive suffix) as –unuz in concordance with the last (resolved) preceding vowel o (the E in the plural suffix –lEr was still pending resolution). rolü Consonant is also a defined class featuring all the consonants A small issue that occured when I accidently switched the order of the rules was that for example in words having a round vowel in their last syllable (like katalog (catalogue)) were resolved in an unusual way *kataloglarunuz. In this sense. One migh as well simply write the rules as: I -> i || [ i | e] Consonant _ . 1996) refers to words which induce such exceptions as “disharmonic roots”. The other rules are identical: (18) I → [HighV & BackV & UnroundedV] || [BackV & UnroundedV] Consonant _ I → [HighV & FrontV & RoundedV] || [FrontV & RoundedV] Consonant _ I → [HighV & FrontV & UnroundedV] || [FrontV & UnroundedV] Consonant _ This is only necessary to state clearly the principles operating vowel harmony. as it is based on two features rather than one. The exceptions we will be dealing with are mostly of foreign origin: alkol (alcohol). So. çamur (mud). the underlying I is realized as ı (or the hıgh back and unrounded vowel – so to say intersecting the set of high vowels with the sets describing the features of the last preceding vowel). I might need to combine the e-type and itype rules into one single rule operating in parallel as the system gets more sophisticated2. The same holds for the other realizations of the undelying I: (17) I → [HighV & BackV & RoundedV] || [BackV & RoundedV] Consonant1 _ Which should be read as: I is realized as the high-back-rounded vowel (u) in the context of a back rounded last preceding vowel (o or u). We need the exact properties of the last preceding vowel in order to resolve the next variable vowel in the following (or even in the same suffix). all the following suffixes feature unrounded vowels (unless a suffix with an invariable rounded vowel is added). whereas the correct form would be kataloglarınız (our catalogues). but the e-type harmnoy still has precedence over the itype. Acc). Although they often do overlap. This is important.

Acc. My approach to this issue is partially based on the paper by Sharon Inkelas and Orhan Orgun (Inkelas.2.(role. For the purpose of this project. otherwise B → p D → d || _ Vowel. saatler (clock. saati (clock. (19) Below provides basic notion about the alternations that occur.). Pl. otherwise Q → k So far it seems fine as far as alternations in the stems are concerned. where do they occur. *saatlar respectively in their accusative and plural forms. etc. falls into this category as well. which transform into their voiced counterparts. It covers the voiceless plosives p. We will pay some attention to the exceptions in the end of the section. The above abstraction is necessary to model the exceptions to these alternation rules. 3. otherwise G → ğ Q → k || _ Vowel. So we have2: (20) B → p || VoiceLessCons _ . 1997) in which lexical exceptions are treated in terms of Optimality Theory.1. roller (role. g and ğ that exhibit similar behavior). and t. They are dependent on the preceding phoneme and assimilate the value of its voicing feature.). D for d and t. *rollar. In brief: the alternating word final consonant in regular roots that undergo the alternations will be unspecified in the lexicon using a special symbol and the exceptional cases will be underspecified with their nonalternating surface realizations so that they won’t trigger the alternation rules.) or a capital for the geminating phoneme (S for ss and s. which is phonologically realized as lengthening of the preceding vowel. Acc. But similar alternations occur in suffixes as well. In most of the related works they pick the capital letter for the voiced phoneme (B for b and p. otherwise K → ğ G → g || _ Vowel.1 Final Consonat (De)Voicing The final consonant voicing occurs when a suffix starting in a vowel or a dropping consonant is attached to the stem.2. 3. otherwise D → t C → c || _ Vowel. otherwise B → b D → t || VoiceLessCons _ . Pl. Additionally. and what do the archiphonemic symbols stand for: (19) B → b || _ Vowel. *rolu. I used archiphonemic descriptions for the alternating segments. *saatı and *alkollar. otherwise C → ç K → k || _ Vowel.).) and alkoller (alcohol. 2 For the purpose of this project only the the d/t alternation will be actually used as it is the only one occurring in the inflectional suffixes of nouns 1 19 . Pl. otherwise C → c K/O because the counterpart of k intervocally is the so called “yumuşak ge” (soft g). etc. what is often classified separately as a “K/0”1 alternation (namely because of the subclass of velar consonants k.).) instead of *alkolu. word final consonants undergo particular alternations depending on the environment. otherwise D → d C → ç || VoiceLessCons _ . I will stick to the standard notation to avoid unnecessary confusion. ç.2.2 Consonant Alternation Rules As mentioned in Section 2.

3 Other Alternations Two other alternations are worth mentioning for the sake of completeness. which is the most productive type of consonant alternation a few other types of alternations are worth mentioning. Nom. Acc. In linguistic terms we have regressive assimilation in stems and progressive assimilation in suffixes. The final consonant (de)gemination occurs only in a small number of Arabic loan words. will be the inflection of kitap (book) in Table 3.An example for both phenomena where several rules apply. E→a Gloss book. Acc. Nom. D→t. as there are polysyllabic words that do not. whereas the second operates on a limited domain of Arabic loan words. 1pSg Poss. The exceptions to these rules include primarily monosyllabic words that perserve the quality of their final consonant.2 (De)Gemination Apart from the final stop voicing/devoicing. as my project is not intended to feature a syllabification module in its current stage of development.4: Summary of the application of the phonological alternation rules. Loc Table 3. Pl. The nature of this phenomenon is similar to the one of the final consonant (de)voicing – a word final segment gets doubled if a suffix starting in a vowel (or dropping consonant) is attached to the word: (21) his (feeling) → hat (line) → hissi (feeling. One of them involves vowel insertion/deletion and the other describes the status of the glottal stop in Turkish. book. 3. In the actual implementation they feature a wider context including morpheme boundaries to make the distinctions clearer. Such exceptions will be underspecified in the lexicon with their unchanging consonant. There are however monosyllabic words that do undergo the alternation rules. He proceeds even further...2. book. The first one is rather common. “the line”) → → hisler (feelings) hatler (lines) Again. Nom book. I→ı B→p. Sg. D→d. “the feeling”) hattı (line.2. investigating the dependence of these alternations on the re-syllabification processes occurring with the different suffixes. 1pSg Poss. E→a B→p. Loc. Sg. Sg. we will have to employ special symbols that will be realized differently on the surface depending on the context as proposed by Schaaik (Schaaik. 1 20 . 1996)1. Sg. I will not go into detail however. The rules in (19) and (20) are oversimplified of course. book. by underspecifying the geminating stems with their double consonants in the lexicon and then removing the additional segment if necessary. This issue could be approached differently. Both of them show some ambiguities. I→ı.2. 3.4: Surface Form kitap kitaplar kitabım kitapta kitabimda Lexical Form kitaB kitaB-lar kitaB-(I)m kitaB-DE kitaB-(I)m-DE Alternation Rules B→p B→p B→b.

As for the quality of the consonant clusters that are formed after the epenthesis occurs. Schaaik (Schaaik. “the idea”) şehri (city. 1996) description and the Turkish Lexical Database Project (TLDP).2. is probably the most improductive rule in Turkish. Acc. along with the gemmination rule. pp. there have been several attempts to define the possible consonant sequences in such cases.3. 21 . the glottal stop is mostly omitted both in speech and writing. “the forehead”) This phenomenon occurs again whenever a suffix starting in a vowel is attached to the stem (seems like all the stem-internal alternations in Turkish are conditioned on the same context). The nature of the glottal stop is not quite clear to me. In modern Turkish however. 1996) describes two types of glottal stop: (23) Type 1: ^ -> 0 / ^ (0 if a consonant follows and ^ if a vowel follows) cami^ (mosque) -> -> camiler (mosques) cami^i (the mosque / his/her mosque)2 Type 2: ‘ -> i / ‘ (i if a consonant follows and ‘ if a vowel follows) nev’ (sort) -> -> neviler (sorts) nev’i (the sort / his/her sort) (Schaaik. 114) Both are supposed to act as consonants if a vowel follows. 1996. however I attempted an approach based on Schaaik’s (Schaaik. so it has to be hard-coded. “the nose”) fikri (idea. Acc.1 The Glottal Stop This. The epenthesized vowel is always a high vowel.3.3. Acc. Both cases are 1 2 In modern Turkish. the tendency is to retain the i in şehir (city) – şehiri (the city) The Type 1 glottal marker ^ is not manifesting itself orthographically.2. Acc. 3. but this is far beyond the scope of this paper. but its other features cannot be automatically determined. They both concern only a limited number of arabic loan words. Such stems will be indicated in the lexicon with a meta character preceding the vowel which is to be deleted. Apparently. “the city”)1 ömrü (life. It is preserved only when ambiguities occur – telin (of the wire / your wire) and tel’in (denunciation). Acc.1 Vowel Insertion/Deletion Some stems in Turkish exhibit an interesting property of forming stem final consonant clusters via vowel epenthesis: (22) burun (nose) fikir (idea) şehir (city) ömür (life) alın (forehead) → → → → → burnu (nose. “the life”) alnı (forehead. in TLDP the glottal stop is not featured either.

using the formalism of replacement rules). The examples I am concerned with are: (24) camim (mosque. they should both denote the same thing.). (25) camiim (mosque. but it does include some transductions for the tags). only yeisi (the despair / his/her despair) and neviyi (the sort) / nevisi (his/her sort) are recognized. identically camii and camiyi both denote the accusative case (the mosque). it visible only if morphological analysis (or lookup) is performed (same for all the other tags). Further on. I tried to approach the issue as in the TLDP. it contains a sub-lexicon of the noun stems – it is the simplest. 1pSg Poss. “my mosque”) Analogous to camii and camisi (his/her mosque). there is a multicharacter symbols definition (26) where a set of sequences of symbols that should be treated as atomic symbols is defined: (26) Multichar_Symbols +Noun +Poss +Case +1p +2p +3p +Sg +Pl +DefObj +Gen +Dat +Loc +Abl +Abs These are primarily used to define the tags to be used (case marking. For the second type though. First of all. number. in my project they will both stand for “my mosque”. and a set of rules. where only the first one (camim) seems to be proper. defined in lexc. Then on the next stage (the standard continuation class for all nouns) a tag +Noun is attached on the upper side.1 The Lexicon The lexicon network implemented in lexc describes the morphotactics of the Turkish nominal inflection. 3. but the TLDP analyzer provides different solutions.3. 1pSg Poss.3 Implementation The model comprises of two components – the lexicon. whereas the second type behaves more or less as if it wasn’t there at all. So the first type allows for both realizations.accepted there – camii and camisi both denote the 3pSg Possessive form (his/her mosque). possession. In my solution. it is also possible that the TLDP analyzer has some flaws. This form includes all the special symbols that denote alternating segments and trigger the alternation rules. “my mosque”) vs. 3. The continuation class from there is the number lexicon – number suffixes are attached on the lower (surface) side and tags +Sg and +Pl are attached on the upper (lexical) side. that is. There are some mismatches though. For now. describing the morphotactics of Turkish (technically it is implemented as an FSA. which could be automatically extracted from a dictionary. (the dash “ – “ stands for morpheme boundary): 22 . I have to investigate the issue further. and even though it is more likely that the mistake is overgeneration from my side. but most important part – it contains the noun stems in their lexical (underlying) form. On the lower (surface) it is realized as an epsilon. etc. that describe the morphophonological alternations that occur on the surface (implemented naturally by a set of FSTs in xfst.

the pronominal n is denoted by the capital N. #. Oflazer (Oflazer. #. #. In the morphological analysis module of the Turkish WordNet® the possessive markup is obligatory.posessive. "for" ! Locative Case . and if there is none. then the tag is +Pnon. In his case."in"."from". "of" ! Dative Case . #. 1995) defines it as a part of the case suffixes. that specifies the optionality of the possessive suffixes: (28) LEXICON Possessive +Poss:0 +Case:0 PSuff. ! "my" ! "your" ! "his/her/its" ! "our" ! "your" ! "their" After taking a possessive suffix there is again an intermediate stage that should be passed – the possessive forms still have to take a +Case tag. Case. "at" ! Ablative Case .(27) LEXICON Number +Sg:0 +Pl:-lEr Possessive. To me it seems more intuitive to have it as a part of the possessive. Possessive. (30) LEXICON CSuff +DefObj:-*yI +Gen:-*nIn +Dat:-*yE +Loc:-DE +Abl:-DEn +Abs:0 #. or take a +Case tag and go to the lexicon of case suffixes. ! Definite Objective Case (Accusative) ! Genitive Case . and I don’t find much sense in having two instances of every case inflection. That is. there are two copies of each case suffix – one that follows the third person possessive form and one for all the other possessive and nonpossessive forms. but of course it won’t be any problem to tune my system up so that it features the same type of mark-up. Case. So the actual sub-lexicon for the possessive suffixes is called PSuff: (29) LEXICON PSuff +1p+Sg:-*Im +2p+Sg:-*In +3p+Sg:-*sIN +1p+Pl:-*ImIz +2p+Pl:-*InIz +3p+Pl:-lErIN Case.1. A possessive sub-lexicon follows which defines the inflection for possession as described in Section 3. "througn" ! Absolute (dictionary) form (Nominative) 23 . CSuff. #. "out of". There is an intermediate lexicon however. Two more points to make clear: the optional segments which were marked with brackets in the theoretical part are prefixed with an optionality marker (*). It is referred to as possessive agreement there. I don’t find it necessary for now. Case. Case. as it is indeed a “pronominal n”. either take a possessive tag +Poss and go to the lexicon of possessive suffixes. "on". In my case it is an optional segment that surfaces only if there is a suffix following the third person singular and plural possessive forms. Case.3 with the appropriate tags.(indirect object) "to".

PSuff 6.The last component of our lexicon is the case inflection sub-lexicon. +2p+Sg:-*In .Case +Sg:0 . To summarize.)). It is obligatory. +1p+Pl:-*ImIz . a visual map of the lexicon network is presented in Figure 3. +Abl:-Den . as all uninflected nouns are in their absolute form (Nominative case). +Abs:0 Figure 3. +Loc:-DE .2 The Rules Component The rules component of the system is implemented as a sequence of composed transducers in xfst using the formalism of replacement rules. 1 24 .3.Number +1p+Sg:-*Im .Possessive +Poss:0 +Case:0 +Case:0 7.2: Schematic visualization model of the lexicon network 3. +3p+Pl:-lErIN 5. The hash symbol (#) is an anchor symbol denoting word boundary (in replacement rules it is circumfixed by dots (. +Pl:lEr 4.NN +Noun:0 3. +2p+Pl:-*InIz .2 below: 0 0. It currently features 17 rules. 8.CSuff +DefObj:-*yI .Root 1.Noun /Noun Stems/ 2. +Gen:-*nIn . of which 12 are significant and 5 are just for cleaning up the markers1. # +Dat:-*yE . +3p+Sg:-*sIN . as it often happens that I need to preserve some markers in order to see what exactly has gone wrong in case of an error.#. The rules are composed in a particular I prefer keep them apart in the development stage.

The rules are split (for now) in several groups addressing the different phenomena types that they describe.1 Vowel Harmony Rules So far.2. the rules for e-type and i-type vowel harmony are split into two separate rules (which operate in parallel among themselves). 25 .. but they might need to be merged into a single rule operating in parallel on all the harmonizing segments. there are no other vowels. (32) I -> i || [FrontV & UnroundedV] ~$[Vowel] _ .. This is especially true for monosyllabic roots that lose their one and only vowel.) which stands for sequential operation). Same for the i-type harmony rule.. Full independence is hardly achievable.sequence. A thing to mention. The tilde (~) on the other side stands for a complementation operator – negation (in this case: negation of the language that contains vowels). for the rule of progressive assimilation in suffixes. In simple words the left context should be read as: there is a front vowel on the left and between it and the symbol to be resolved (E).o. where the e-type precedes the i-type harmony resolution. an underlying E is realized as e on the surface when the last preceding vowel is a front vowel and as a when the last preceding vowel is a back vowel. As already mentioned.1. As mentioned above in Section 3. This defines the e-type harmony rule: (31) E -> e || FrontV ~$[Vowel] _ . In the case of a dropping vowel in the stem for instance. as some of them do depend on each other. I -> ü || [FrontV & RoundedV] ~$[Vowel] _ . E -> a || BackV ~$[Vowel] _ The dollar sign ($) has a special meaning in xfst – „contains“. only that it considers two features (backness and roundedness) of the last preceding vowel. Further on. only that it concerns back vowels in the left context.) in xfst replacement rules stand for parallel operation (as opposed to the composition operator (. since the suffixes have to harmonize with this vowel. I -> ı || [BackV & UnroundedV] ~$[Vowel] _ . the vowels are further divided into subclasses according to their features for the vowel harmony resolution. For example: alkol (alcohol) which transforms into alkolü (instead of *alkolu) (the alcohol) and alkoller (instead of *alkollar) (alcohol. I defined a class for the vowels and consonants initially. 3. where the consonant class had to be extended to feature all the archiphonemic descriptions used. prefixed by a harmony marker (H). Same for the second line. A few classes needed to be defined in order to make the rules operational.3. this is a two-level rule. The disharmony marker itself will be nothing more than the vowel that induces the new vowel harmony. I -> u || [BackV & RoundedV] ~$[Vowel] _ As far as vowel disharmony in suffixes is concerned.. the double commas (. In other words. Pl) will be lexically represented as alkoHül. the stems that induce such disharmony will be marked as such (again this could be implemented as an automated procedure) by inserting a (dis-)harmony marker after the last vowel of the stem.2. I had to define a class of voiceless consonants.. the vowel harmony rules have to apply before the vowel is deleted.

1 is the final stop devoicing rule.(%*) _ ] . I had to use abstract symbols denoting the alternating phonemes (just as in the vowel harmony rules).(%*) Vowel ] . T -> t || _ [%.. C -> ç . which is an exception to vowel harmony. [ G -> g. it could be improved though in case of failure. As we’ve already had an extensive overview of the principles behind these rules I will not discuss them any further. so far it covers only the cases of geminating s and t. D -> d || _ %.) is used to denote word boundaries (the beginning of string if used on the left and the end of string if used on the right). [ S -> [ s s ] . [ B -> p .(%*) Cons] | . just to implement the principle. where germination and devoicing occur simultaneously: (34) muhip (friend) → muhibbi (friend.2.2. G -> ğ || ~VLCons %. K -> k. D -> t . Acc. This issue could be fixed using a few minor tricks and the current system is ready to handle it. besides undergoing germination and voicing – serhaddi (border. as already mentioned. (33) Final Consonant Devoicing Rule: [ B -> b .o. 26 . [ C -> c . The suffix onset (de)voicing rule will also fall in this category.#.. G -> k || VLCons %. D -> d .(%*) Vowel ] .]] Suffix Onset Devoicing Rule: [ C -> ç . They were chosen at random out of the set of eight geminating consonants in Turkish.3. C -> c .#. “the border”). ] ] .o. special symbols have to be chosen and their transformations need to be inserted in the rule (pure mechanical operation). For the remaining six consonants. T -> [ t t ] ] The percent sign (%) is used as an escape character in xfst to literalize characters that have a special meanings otherwise. “the friend”) and the even further complicated case of serhat (border).2.(%*) _ ] Gemination Rule: [ S -> s. Q -> k || _ [[%.Cons | .(%*) Cons] | . Q -> g || _ %. The brackets denote optionality in the regular expressions sense – (%*) in a replacement rule means “there is a possible literal * there”.o. but I will leave it for a later stage of development. Similar rules (both in operation and conditions) are the K/0 alternation rule and the consonant germination rules. For these rules.3.#. A few notes on the germination rule: it is a rather radical approach as far as context is concerned.o. There are however some special cases. The anchor marker (. D -> t || _ [[%.]] Velar Alternation Rule: [ G -> ğ. Acc. K -> ğ.#.2 Consonant Alternation Rules The most productive consonant alternation rule as described in Section 3.

The rule for the pronominal n simply drops the N word finally (a tricky solution).] || _ %. marked up as optional. Pl) and koyunu (the sheep)). I called it fixing the vowel sequences. Next. 1 See (Schaaik. Dropping segments are prefixed by a literal dollar sign ($). In simple terms. so an underlying koy$un (bosom) will be realized as: koyun (bosom) and koynu (the bosom) (as opposed to koyun (sheep) which is realized as: koyunlar (sheep. we have high vowel (I) deletion if the stem ends in a vowel and s insertion if the stem ends in a vowel in the possessive inflectional paradigm.2. The dropping segments remain if the suffix attached starts in a consonant on the surface. the ^ marker is either realized as underlying consonant or as nothing at all. nothing unusual. the lexical exception “su” and the elimination of the multiple plural morpheme. The rule for multiple plurals simply takes two adjacent plural morphemes and rewrites them as a single morpheme.3. the words ending in su are specified in the lexicon as suY (this is partially from the origin of the word – historically it derives from “suw”). The rule for the exceptional class of words ending in “–su” (water) is again pretty simple. On the other side.%* _ ] The above composition of two rules does two things. It deletes every high vowel (I). It deletes every segment that is not a high vowel (I).%* ] This way. and the one that took me the most time to design optimally (and which is still under consideration whether it is the best solution or not) is the rule that manages all the dropping consonants and vowels in suffixes (except the pronominal n). It took me quite a while to come to this idea. 27 . all these phenomena occur to avoid vowel sequences. As we’ve seen. both camii and camisi/camiyi will be recognized as described in Section 3. This special symbol is then realized as y in the proper context or as epsilon by default. in the context of a preceding consonant across a morpheme boundary and 2. The rule for the glottal stop is again a tricky solution.1.o.%* _ ] .3. I was happy to see that others have approached the issue in a similar way1. [ HighV -> 0 || Vowel %. My approach was to optionally delete it if a vowel follows: (35) [ %^ (->) [. namely: 1. as this is more or less what it is supposed to do. the rule for dropping stem vowels has nothing particularly interesting to it. 1996). After quite a bit of thinking I dealt with all these phenomena in a single blow: (36) [? .3 Fixing the Morphotactics The few next rules are used to “fix the morphotactics” – they deal with general phenomena such as vowel/consonant deletion.HighV] -> 0 || Cons %. In the case suffixes we have y and n insertion if the stem ends in a vowel. The most complicated rule.3. which is marked up as optional in the context of a preceding vowel across a morpheme boundary. the pronominal n.0.2.

Gemination . Also the vowel harmony resolution shall precede the stem vowel deletion.4 Rule Order A few notes on the current rule ordering.2.o.o.o. There are some local dependencies among the rules.o.The remaining rules clean up the marker leftovers. The clean up procedures can be incorporated in the rules themselves.o. the stem vowels will be deleted before the vowels in the suffixes which shall 28 . If we proceed from left to right (with parallel rules). PronominalN . VelarAlternations . like already mentioned.o. getting rid of the multiple plural morpheme is a good thing to start with. FixGlottal . ClearExHarmony Figure 3.o. 3.o.o. the e-type harmony rule has to precede the i-type harmony rule (or probably they will have to be merged in a single rule and apply simultaneously as two-level rules). but during the development stage.3. ClearSVDMarker .o.o. ClearGlottal .o. FixMultiPlural .o. FixVowelSeq .o.o. ClearMBMarker . SUexception . I prefer to keep them separated for debugging purposes. SuffOnsetDevoicing . iTypeHR .3: The current rule ordering First thing’s first. StemVDeletion . ClearOptMarker . eTypeHR .o. FinalStopDevoicing .

This could be combined with a morphological guesser. The pronominal n rule is also on its own. to extend functionality a lexicon extraction routine has to be implemented. In this paper I presented an approach to part of the “basic entities” in the Turkish language. Such however is not present in the current stage of development. Having a fully functional morphological processor at hand. The final stop devoicing. there are various ways one could take: Integrate it into a larger NLP system (speech synthesis/recognition applications. Further on I am thinking of implementing a syllabification module as it seems quite necessary. Then perhaps. All that maters is to process the input sequentially. rules that apply on segments that occur after unresolved segments might cause major troubles.harmonize with the deleted vowel are resolved. extend its functionality for different tasks (a major advantage of the modular approach – simply add a new module for the task at hand and occasionally tune up the existing modules). from left to right. I myself am not so sure which way this project will take. as well as the minor word categories. before everything else. as well as perhaps stress markup. the velar alternations. one needs first to determine the exact properties of each and every individual token. stem vowel deletion and the rule for the su exception could all operate at a single stage as they occur in identical contexts and their purpose is more or less the same. add a context component for disambiguation (this falls in the previous category perhaps). OCR applications. 29 . Future Work Where do we go from here on? One could come up with various ideas. could only be determined after examining the environment. in the end it seems that the rules are mostly independent. As a first step however. artificial intelligence components. First. try approaching a different language. This is the reason why most finite state approaches to Turkish morphology are based on two-level morphological descriptions. There should also be some tendency to go from simpler and more general to more sophisticated and specific rules (either in upward or downward direction). to result in a full-featured morphological processor. 5. in which the former will be used to train the latter. but if the input is processed left to right. a complete coverage of the language of choice has to be accomplished. so getting the bigger picture. language tutoring applications. Some of the properties however. and the two could form a symbiotic relation. supplemental linguistic applications). and the guessing algorithm will occasionally provide substance for the extension of the lexicon. this will be determined before the application of the suffix onset devoicing rule. The common approach to this issue is “inside-out” (or bottom-up) – starting from the basic entities and building up increasingly complex structures out of them. the model has to be completed to cover the other major word category in Turkish. automatic machine translation applications. 4. Conclusions In order to analyze the complex and often symbiotic relations between words. and numerous other options in the field. The suffix onset devoicing rule is partially dependent on the outcome of the final stop devoicing rule. the germination. that automatically extracts entities from a dictionary into the morphological processor. And therefore if we have the wrong rule ordering.

2003.google. Kemal. Ankara.Turkish online dictionary – additional glossary http://www.2005). Clarendon Press.edu. Turkish Morphology and Corresponding English Structures. Gerjan van.hlst.provides morphological analysis to verify the results http://www. C. Hikmet I. Schaaik. van der Halst and N. with Kenneth R. Linguistic Models: The Structure of Phonological Representations (Part II).pdf .net/ . Holland. Wiesbaden. Department of General Linguistics. Finite State Morphology and Left to Right Parsing. London. Paper. Kemal. Oxford. Foris. Simon C. Kimmo. Ketrez. 1997.sabanciuniv. Hopcroft.com/ . Introduction to Automata Theory. Sharon. The Netherlands. The Netherlands. Multiple Readings of the Plural Morpheme in Turkish. Harrassowitz Verlag.metu. J. Lauri. Oxford.07.usc. Pollard.Bibliography: Dik. (online at: http://acl. Oxford University Press. 1972.ps. Mouton.pdf . A First Approach to a Computerized Model for the Automatic Morphological Analysis of Turkish. August 1986. Linguistic and Literary Computing. Two-level Morphology. 1996.edu/TL/ . 1971. Ullman. Hankamer. Iggy. Inkelas. Dordrecht. Technical Report. 1979. Addison – Wesley. Elvan Göçmen and Cem Bozşahin. Foris Publishing. Stanford. Sebüktekin.tr/ftp/papers/morphspecs.ldc. Oflazer. Beesley. Formal Aspects of Phonological Description. Lewis.edu/~ketrez/papers/ADL2003ketrez. 1994. USC. Hodder and Stoughton.gz 18. An Outline of Turkish Morphology. University of Helsinki. Hodder and Stoughton.upenn. C. Geoffrey. Mouton. Oflazer. 1975. USA.. 2003. J. 1996.06. CSLI Publications.lcsl. Lewis. Köksal. Finite State Morphology. Jorge. Turkish: A complete course for beginners. Germany. Middle East Technical University (online at: http://www.Everything is there! – using the web as a corpus 30 . Studies in Turkish Grammar.2005) Koskenniemi. 1967. The Implications of Lexical Exceptions for the Nature of Grammar. Karttunen. 1981. 1995. Functional Grammar 3rd Ed. Asuman Çelen. (online at: http://www-scf. ed.25. Doctoral Dissertation.edu/E/E93/E93-1066. George N. 1989. Clements. A General Computational Model for Word-Form Recognition and Production.2005). (Teach Yourself Books). Geoffrey. Roca. Hacettepe Universitesi. London. Derivations and Constraints in Phonology. Vowel and Consonant Disharmony in Turkish. Turkish Grammar. Orhan Orgun. Pollard. Turkish-English Contrastive Analysis. A. 1986. 3rd International Conference on Turkish Linguistics. David.D. 1982. 1997. F. Languages and Computation.. by H. Turkish 2nd ed. The Hague. (Teach Yourself Books).The Turkish Lexical Database Project .E. Johnson. Douglas. Nihan. The Hague. Two-level Description of Turkish Morphology. 1983. Tilburg.turkishdictionary. Dordrecht.06.25. Useful links: http://www. and Engin Sezer. Paris. Smith. Paris.

/+Dat Gen./+Sg Pl./+Pl (+)1p/2p/3p Poss./+Abl Nominative/Absolute Accusative/Definite Objective Dative Genitive Locative Ablative NUMBER/POSSESSIVE: Sg./+Poss Singular Plural 1/2/3 Person Possessive GENERAL: FST FSA FSN LF SF Finite State Transducer Finite State Automaton (-ta) Finite State Network Lexical Form (lexicon entry form) Surface Form (standard orthographical representation) 31 .Appendix A: List of Abbreviations CASES: Nom./+Loc Abl./+DefObj Dat./+Abs Acc./+Gen Loc.

2######################## LEXICON Root !#The start state so to say. I will try to keep !#them as simple as possible. b Three. 32 . combining various operations. A continuation class !#Think of the expression as the symbol over the arc and !#the continuation class as the destination state Lexicon One a Two. b:B Three. Lexicon Two b Three. Lexicon One a:A Two. c One. Lexicon Three #. or a final state !#The loop back to State 1 !#################A model lexc solution for Figure 2. Every lexicon needs it. !#but as my key concept is modularity.3########################### !#Same as above for the most part LEXICON Root One. Lexicon Three #. !#A line in lexc has two components: !#1. An expression (which could be as complex as needed) !#2. !#Figuratively speaking – State 1 !#The two arcs with the respective input symbols and destinations !#State 2 !#State 3 !#The hash symbol denotes end of input. !#The semicolon operator denotes a transduction here !#Basically the expressions could be regular expressions !#with varying complexity. c One. One. Lexicon Two b:B Three.Appendix B: lexc Code Samples !##############A lexc solution to the network in Figure 2.

C->D || L _ R The same is valid for contexts: A->B || L1 _ R1 ..Appendix C: On Replacement Rules Replacement rules are simply intuitive and convenient shorthands for more complex regular expressions. 33 . C->D || L2 _ R2 Or composed as standard networks: [A->B || L1 _ R1] . A substring from A is related to a substring from B. every string from the upper language (the universal language1) is mapped to itself. B.o. The most general shape of a context-free replacement rule is: A->B where A and B are regular languages (which could be arbitrarily complex regular expressions themselves). In this case. [C->D || L2 _ R2] The difference is crucial if the rules are dependent on each other. The double vertical bars separate the rule(s) from the context. Different rules operating in the same context are separated by a comma: A->B . This formalism is further extended to include context: A->B || L _ R where A. except that whenever a substring from A is encountered. nothing happens and there is no output). These are the basics.. L2 _ R2 Replacement rules could be constructed to operate in parallel (as in two-level models) using double comma (. only that the languages A and B are further contextually restricted. For more information on XFST and its replacement rules refer to (Karttunen. What happens here is essentially the same as above. only if it is preceded by a substring from L and followed by a substring from R. it is related to a substring from B (opposed to normal transducers where if the input string doesn’t match a string from the upper language. 2003). 1 The language of all possible strings. L and R all denote languages and not relations (both L and R are optional).) separator: A->B || L1 _ R1 .