Md. Shariful Islam Bhuyan, and Reaz Ahmed

ABSTRACT In spite of being a successful syntactic theory in many respects, Head-driven Phrase Structure Grammar (HPSG) has inadequate coverage for morphological constructions, especially for nonconcatenative morphology, which is prominent in the Semitic languages such as Arabic, Hebrew etc. In this paper, we extend the HPSG framework to support rich nonconcatenative morphology of the verbal system of Arabic, the best instance of nonconcatenative morphology among the living languages. We also introduce necessary features for syntactic and semantic aspects of an Arabic Verb. Keywords: Nonconcatenative Morphology, Head-driven Phrase Structure Grammar, Arabic Verbal Morphology, Constraint-based Grammar

Arabic language exhibits an extremely rich morphology [8]-[9]. Both concatenative and nonconcatenative operations take place in the formation of an Arabic word. Inflection is made by concatenative operations whereas derivation is made by non-concatenative operations. Morpho-syntactic operations performed over the morphemes come with two flavors: concatenative and nonconcatenative. Concatenative operations are those where morphemes are linearly concatenated. For example: i. Prefixation: clear | unclear ii. Suffixation: walk | walked iii. Circumfixation: mind | unmindful Nonconcatenative operations are those where morphemes are nonlinearly embedded. For example: i. Infixation: kataba | kattaba ii. Simulfixation: eat | ate iii. Modification: man | men iv. Suppletion: go | went There are many other morpho-syntactic operations also. In this paper, we mainly focus on nonconcatenative operation and give a mathematical formalism to capture their rich diversity. Arabic word formation is an excellent example of nonconcatenative root-pattern morphology. A combination of root letters are plugged in a variety of morphological pattern with priory fixed letters and particular vowel melody that gives rise to corresponding syntactic and semantic phenomena. To feel the richness of Arabic morphological patterns, which we call “measure” in this paper, following example is given. Here, the root letters ‘k’, ‘t’, ‘b’ bearing a concept of writing, are plugged in various measures to get a myriad of syntactic and semantic phenomena. The measures with a particular semantic paradigm are called “Form”. Arabic has many forms. Among them, ten forms are used regularly. The root letters ‘k’, ‘t’, ‘b’ can be plugged in among nine of them. i. Form I (Transitive): kataba – He wrote

Broad-coverage precision grammar [1]-[3] and computational lexicon development for deep linguistic processing is a research-intensive area with several potential applications [4]. Amidst the vast literature on formal linguistic theory [5], Head-driven Phrase Structure Grammar (HPSG) [6] has a unique position since it combines the best features of the contemporary approaches as well as establishes an integrated framework for cross-layer representation comprising phonology, morphology, syntax, semantics, pragmatics and discourse. Although, HPSG successfully describes numerous syntactic and semantic phenomena, it lacks rigorous analyses for morphological phenomena, especially for non-concatenative morphology [7]. Nonconcatenative morphology illustrates an interesting paradigm of morphological operations, which is prominent in the Semitic languages such as Arabic, Hebrew etc [10]-[11]. Among the living languages, Arabic demonstrates the best instance of nonconcatenative morphology. Arabic verb system exhibit both concatenative and nonconcatenative morphology, capable of lexically expressing diverse syntactic and semantic phenomena. Formalisms of existing morphological analyzers for Arabic are not powerful enough to capture this higher layer diversity. In this paper, we extend the HPSG framework to support rich nonconcatenative morphology for the first comprehensive HPSGconstruction of Arabic verbal system.

Table 1 Derivational Paradigm of root “ktb” Active perfect Passive perfect Active imperfect Passive imperfect Active imperative Passive Imperative Verbal noun Active participle Passive participle Locative participle Instrumental participle FORM I katab-a kutib-a ya-ktub-u yu-ktab-u u-ktub litu-ktab kitaab-atun kaatib-un ma-ktuub-un ma-ktab-un mi-ktab-un FORM II kattab-a kuttib-a yu-kattib-u yu-kattab-u kattib litu-kattab ta-ktiib-un mu-kattib-un mu-kattab-un … … FORM III kaatab-a kuutib-a yu-kaatib-u yu-kaatab-u kaatib litu-kaatab kitaab-un mu-kaatib-un mu-kaatab-un … … FORM … … … … … … … … … … … … … … … … … ii. we can conclude that an Arabic word has four components. containing all the root letters. these are “katab” and “kutib” respectively. Every entry of the table 1. We call this fixed portion. there must be a single measure and single set of root letters. Form V (Reflexive): takattaba – It was written on its own vi. where the measure packages syntactic and semantic features and root supplies the core concept. we note a crucial point that. r – which bears the concept of helping. For table 2 and 3. Here the measure indicates that – the actor is in 3rd person. Form III (Ditransitive): corresponded kaataba – He 2) Suffix: hu – the object pronoun attached as a clitic 3) Root: k. Form X (Control): istaktaba – He asked to write The above example illustrates the derivational paradigm of Arabic word. is always constant. katab-ta katab-tuma katab-tum 2nd/Fem. a measure may contain two parts – stem-measure and affix-measure. t. 1) Prefix: sa – the particle indicating future . If we plug in another set of root letters. We can break the word in the following components. Form VIII (Reciprocity): iktataba – They wrote to each other ix. which is governed by the agreement information. For imperfect form. Form IV (Factitive): aktaba – He dictated v. singular number. sayaktubuhu . However. it also indicates that the event has not yet been completed. For example. Table 2 and 3 show the inflectional paradigm for active perfect and passive perfect entry of form I. b – the root letters bearing the concept of writing 4) Measure: ya_ _u_u – bearing the syntactic and semantic information of the event It may be possible to concatenate multiple prefixes and suffixes. we can give the following model of an Arabic word. the verb is in indicative case. Form II (Causative): kattaba – He caused to write iii. a certain portion of the word. there are three such inflectional paradigms. the “affix-measure”. n. Form VII (Submissive): inkataba – He was subscribed viii. katab-at katab-ataa katab-na 2nd/Masc. active voice and derived form I. katab-ti katab-tuma katab-tunna 1st katab-tu katab-na katab-na Depending on this analysis. there is also an inflectional paradigm. In our analysis. (Writing concept) root prefix sa-yaktubu-hu suffix (future particle) measure (it-attached pronoun) (3rd/sg/masc/ind/perf/act/form-I) From the diagram. the “stemmeasure” and the remaining part. masculine gender. for a particular inflectional paradigm. However. we get sayanSuruhu – He will help him. A root-derived Arabic word = Prefix + affix-measure (stem-measure (Root)) + Suffix iv. for example. S. Table 2 Inflectional Paradigm of Form I-ActivePerfect Ind/Sub/Juss Singular Dual Plural 3rd/Masc.He will write it. We can break the word in the following component. Form VI (Reciprocity): takaataba – They wrote to each other vii. An Arabic word can encode a complete sentence. katab-a katab-aa katab-uua 3rd/Fem. can take fourteen inflectional form according to there number gender and person. containing prefix and/or suffix. From the above tables.

AN HPSG PRIMER Natural languages are generally consists of two components. To capture these features. and constructionally licensed only if it is the mother of some construct” [6]. uncertainty MOOD indicative. Depending on the characteristics of root letters. Both sign and construct are described using feature structure . … VOICE active. which governs the derivational and inflectional paradigms for Arabic roots. pragmatic and others. in English. kutib-at kutib-ataa kutib-na 2nd/Masc. ‘r’ both are member of same root class – the sound root class. etc. imperative Table 5 Attributes Governing Inflectional Paradigm Attribute Values MODALITY emphatic. e. the linguistic rules that license those utterances. However. 3. Atoms are atomic types that can be used as the value of features. genitive DEFINITENESS definite. we have listed some features that will be used in this paper. The roots ‘k’. Second.Table 3 Inflectional Paradigm of Form I-PassivePerfect Ind/Sub/Juss Singular Dual Plural 3rd/Masc. II. 2nd. words. dual. These features and their values constitute a very detailed type hierarchy (see figure 1). ‘b’ and ‘n’. the utterances that can be used by human. Next. They maps features to feature structure. the feature structure of a construct has a mother (MTR) feature and a daughters (DTRS) feature.) and rules are captured using another Figure 2: An HPSG Sign From the type hierarchy of figure 1. feminine CASE nominative. We call a set of roots. where.g. IV. since the rules do not license them. syntactic. First. 3rd NUMBER singular. To capture grammatical rules. morphological. phonological. “rwite” are not valid. the class is determined. which share a common derivational and inflectional paradigm. cxt (construction). III. To model a linguistic phenomenon we first need to identify the involved signs with their hierarchy. Table 4 Attributes Governing Derivational Paradigm Attribute Values POS noun.a collection of features of corresponding linguistic objects along with their values. lexically licensed only if it satisfies some lexical entry. HPSG is a mathematical theory for natural languages that formally captures these two core linguistic components. particle FORM I. Figure 1: An HPSG Type Hierarchy An utterance can have linguistic feature spanning multiple layers. verb. ”writes books”. Utterances are modeled using a mathematical object Sign (a formal representation of linguistic objects phrase. ”writes” – all are valid utterances. we need to design functional feature structures for them with linguistically motivated features. The value of the MTR is a sign and the value of the DTRS is a nonempty list of signs. plural GENDER masculine. kutib-ti kutib-tuma kutib-tunna 1st kutib-tu kutib-na kutib-na mathematical object Construct (a formal representations of grammar rules or schema that are used to license signs). Attributes in the table 4 and 5. phrase and others. lexeme. indefinite POLARITY affirmative. imperfect. “writes he”. There are syntactic and semantic features. “Writes he books”. the description of a typical HPSG sign looks like figure 2. The licensing of signs follows The Sign Principle which states that “Every sign must be lexically or constructionally licensed. jussive PERSON 1st. kutib-a kutib-aa kutib-uua 3rd/Fem. ”He writes books”. Another facet of Arabic morphology is the concept of root class. govern the derivational and inflectional paradigm for an Arabic root respectively. accusative. We use constructional HPSG [6] in this paper. For example. we can see that there are two type of feature structure. passive VFORM perfect. subjunctive. kutib-ta kutib-tuma kutib-tum 2nd/Fem. With a linguistic investigation. semantic. a root class. ‘S’. the description of a typical HPSG construct looks like figure 3. negative This is not the whole story of Arabic morphology. Notable functions are sign. . Functions are the feature structure that is described using an attribute value matrix (AVM). ‘t’.

which has a list of root letters as well as the CONTENT feature. the feature CAT. Next. VAL is a list of signs. which contains the syntactic category for this measure. which gives the semantic contribution made by root letters. In the figure 4. using our analysis. Next. kataba Figure 3: An HPSG Construction Then we also need to define the necessary constructs as well as the atomic type hierarchy. First. First. NUMBER and GENDER information of our semantic actor in the case where it is not syntactically realized. First. we build these ingredients for Arabic verbs.FRAMES feature. the verb only subcategorizes for syntactic object. In this version. the feature PATTERN that captures the stem along with the root letters using structure sharing. The semantic actor is not realized syntactically. So.kataba.write is a transitive verb that takes an object. It contains the VFORM and VOICE feature of Arabic. its value is sound. ARABIC IN HPSG Here we give the attribute value matrix (AVM) for an Arabic verb kataba – “He wrote”. in active form and its corresponding passive form kutiba – “It was written”. We can also see the constraints imposed over the object.he is encoded by the inflectional morphology. their values are perfect and active respectively. 4. the feature MEASURE. We should note that the hidden pronoun . We present the syntactic and semantic information using the SYN and SEM feature. for the word . which contains the morphological. its value is structure-shared with the write-fr in the Figure 5: An HPSG Sign for kutiba Figure 4: An HPSG Sign for kataba is a form-I derivative. The verb .kataba. the feature ROOT. In this case. In this case. the feature FORM. Then. Finally. Next. the feature TYPE. which captures the PERSON. which denotes the semantic paradigm. Its value is structure-shared with the syntactic feature CAT. the feature PNG. This feature affects both root and measure. the VAL feature that captures the subcategorization of verbs. its syntactic head should be a noun phrase with the value of its CASE feature set to . which are required by the syntactic head. requires an object. Therefore.kataba. In this case. the CAT features identifies the syntactic category of . syntactic and semantic information contributed by measure. Next. which denotes the associated root class. Arabic roots are classified into several root class according to their derivational and inflectional paradigm. we have three features associated with morphology. which governs the derivational paradigm of verb lexeme. the verb . when no explicit subject is used. In this case. it has been taken out to a first level morphological feature. In the next section.

Arabic passives do not subcategorize for a subject or any other argument. expressed by the feature SIT. There are two semantic role associated with this predicate. Then. First. To capture the temporal constraint. Unlike English. which can have a prepositional complement in passives. We also use event co-indexing. who plays an undergoer role. rather required to be syntactically correct. For example. This is an example of reference co-indexing. Figure 6: An HPSG Sign for yaktubu Next. Semantic actor now completely unknown by not having . the FRAMES feature. Predicates have their respective arguments. which serves as a bag for elementary-predicates to describe the situation at hand. We use the technique of Figure 7: An HPSG Sign for yuktabu In the figure 5. we need to consider some semantic features.accusative. Moreover. the PNG feature. The perfect-predicate takes a situation hook as an argument. the event of writing is expressed. The negative value of the OPT feature indicates that this object is not optional. who plays a doer role. we introduce a discourse referent with corresponding PNG feature. we show the HPSG representation of the passive . NUMBER and GENDER. which capture the semantics of PERSON. we consider the role of written. Here. To capture the core event. Second. For this reason. The discourse referent predicate is actually the actor of the writepredicate. the INDEX value of hidden pronoun and the ACTOR value of the writepredicate are co-indexed. which is expressed as the feature ARG. we consider the role of writer. all are coindexed and expressed using the value s. co-indexing for sharing semantic objects. in the case of kataba. We identify the associated changes for this conversion. this is done by coindexing the INDEX value of the syntactic object and the UNDGR value of the write-predicate with a value j. the discourse referent in the feature FRAMES is now co-indexed with the UNDGR feature of the write-predicate.kutiba. Next change can be found obviously in the feature VOICE. we use a type feature version of predicate logic to capture semantics of natural language. In this example. The event hook SIT of writepredicate. the VAL list is empty. expressed by the feature ACTOR. Next. we use the perfect-predicate. expressed by the value j. Finally. The PATTERN feature is changed to capture the derivational morphological operation. To denote this constraint. to express the actor of the event. changing its value to passive. we consider the INDEX feature. Another important issue of HPSG representation is the syntaxsemantics interface. write-predicate has a situation hook. both are given the value i. which is a reference to a discourse entity. the hidden pronoun. The event is completed in the past and there is a discourse referent to the actor. write-predicate is introduced. situation hook of the entire scenario and argument ARG of the perfect-predicate. expressed by the feature UNDGR. First. This indicates that the syntactic object is our semantic undergoer whereas from our previous discussion we can note that the semantic actor is not syntactically realized.

2000. and Wasow T. we also give the HPSG sign for yaktubu and yuktabu. 49-77. and Seghezzi N.. especially Arabic verb morphology within the framework of HPSG. 2001. Functional Arabic Morphology. Lectures on Contemporary Syntactic Theories. 1998. Formal System and Implementation. [9] Smrž O.. which is a distinctive property of Arabic passives.. [7] Bird S... “An open-source grammar development environment and broadcoverage English grammar using HPSG. “Open source machine translation with DELPH-IN. and Flickinger D. 2007. A Constructional Approach to Idioms and Word Formation. Oepen S. “Type-Based Derivational Morphology. 1– 8.” ACL Workshop on Deep Linguistic Processing.. Copestake A.. 2001. Mifsud M. [3] .. “Phonological Analysis in Typed Feature Systems”. An imperfect-predicate denotes the non-completion aspect of the event.” Journal of Comparative Germanic Linguistics. Computational Linguistics.. 1999. PhD Dissertation. 2005. In figure 6 and 7. Marimon M. 20. 1985. we give the proposal how to capture nonconcatenative morphology. Siegel M... Stolz T. Bel N. pp. “The Spanish Resource Grammar: pre-processing strategy and lexical acquisition. pp. “Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001”. 5.” 1st International Conference on Maltese Linguistics. [5] Sells P. REFERENCES [1] Copestake A. There are lot of works to do in the future. and Vanhove M. pp.. [4] Bond F. [2] Comrie B. pp. Charles University in Prague. ACL Workshop on Arabic Language Processing: Status and Prospects..... Stanford: CSLI Publications. Results will be immensely helpful for the construction of resource grammar for languages with rich nonconcatenative morphology. [6] Sag I. Fabri R.. [10] Riehemann S.. 2. (Eds). Stanford: CSLI Publications. Espeja S. . vol. There is a newly introduced feature is MOOD that take the value indicative in this case as well as VFORM change to imperfect. “Towards an HPSG Analysis of Maltese... 15-22. To construct matrix from table 1 we need to cope with a wide range of diversity that an Arabic verb can take. 1994. 55-90.any syntactic or semantic reference. [8] Beesley K..” Second conference on Language Resources and Evaluation. 2007. Hume B. Syntactic Theory: A Formal Introduction.” Open-Source Machine Translation Workshop at the 10th Machine Translation Summit. vol. and Klein E. 2007.. and Flickinger D. CONCLUSION In this paper. Stanford imperfect and passive imperfect form of kataba. PhD Dissertation. [11] Riehemann S....

