You are on page 1of 12

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G.

Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

The SALAH Project:


Segmentation and Linguistic Analysis of ad Arabic Texts
Marco Boella1, Francesca Romana Romani2, Anjela Al-Raies2, Cristina Solimando2, Giuliano Lancioni2
University La Sapienza, Rome, Italy marco.boella@alice.it 2 Roma Tre University, Rome, Italy francescaromana.romani@gmail.com, anjela.alraies@gmail.com {csolimando,lancioni}@uniroma3.it,
1

Abstract. A model for the unsupervised segmentation and linguistic analysis of Arabic texts of Prophetic tradition (ads), SALAH, is proposed. The model automatically segments each text unit in a transmitter chain (isnd) and a text content (matn) and further analyses each segment according to two distinct pipelines: a set of regular expressions chunks transmitter chains in a graph labeled with the relation between transmitters, while a tailored, augmented version of the AraMorph morphological analyzer (RAM) analyzes and annotates lexically and morphologically the text content. A graph with relations among transmitters and a lemmatized text corpus, both in XML format, are the final output of the system, which can further feed the automatic generation of concordances of the texts with variable-sized windows. The model results can be useful for a variety of purposes, including retrieving information from ad texts, verify the relations between transmitters, finding variant readings, supplying lexical information to specialized dictionaries. Keywords. segmentation, Arabic text treatment, information retrieval, morphological analyzer, hadith, regular expressions, graph

Introduction

Information retrieval in Arabic has often pivoted on contemporary texts, for obvious reasons: electronic availability, usefulness of information, analogy with work done in other linguistic domains. However, Classical texts are much more important in contemporary Arabic culture than in most Western countries, as witnessed by the large diffusion of websites which make Middle Ages books available not only to scholars, but also - and most important - to laymen interested in such texts.

All authors have contributed equally to this work , but since it refers to a modular project, Boella should be mainly credited for sec. 3, Romani for sec. 2, Al-Rajes for sec. 5, Solimando for sec. 4.1, Lancioni for secs. 1, 4.2 and 6.

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

A special role in this context is played by ad texts, the set of narratives on the life and deeds of the Prophet that altogether constitute the sunna, or Islamic Tradition (see Section 2). These texts do not have only a historical importance, they are the cornerstone of Muslim law and a favored reading of most Muslims around the world, and their presence in contemporary written Arabic is widespread. Notwithstanding their importance, Classical texts have not been - at least to the best of our knowledge - the subject of any scholar research project as far as information retrieval is regarded: to search the texts, most scholars still refer to older, paper resources such as Wensinck's concordances [1].1 On the contrary, ad texts are a privileged fields for information retrieval texts. Their structure, which couples a text with a preceding chain of transmitters that assures the validity of the tradition, or isnd, is already (if informally) organized in such a way that readers are able to detect information with a relatively small amount of ambiguity. Yet, notwithstanding the importance of relations among transmitters in order to ascertain the legal validity of a tradition, such data are still managed in a rather haphazard way, by recurring to traditional resources such as prosopographical repertories and by evaluating transmission relations in a mostly impressionistic way. The same is true of the lexical and grammatical content of traditions: in most cases, interpreters analyze each ad on its own merit, making few, if any, recur to crosstextual regularities and collocations. Our research project aims to devise methods and algorithm to extract as much information as possible from such texts in an automatic way. The subject matters on which we started working are the automatic segmentation of isnd and narrative text (or matn: see Section 3), the reconstruction of chains of transmitters through graphs, the creation of (semi-)automatic lexical concordances and the prospective development of a grammar suitable to (semi-)automatically interpret texts and to build semantic representation which can further be employed in inference (by modeling a classical method used by Islamic law scholars). Preliminary results of a morphological analyzer and lemmatizer (see Section 4) are discussed (see Section 5).

Contents and Structure of the Corpus: the ads

ad, lit. narrative, talk, is the term used to indicate each member of the set of shorter or longer narratives on the life and deeds of the Prophet Muammad (571632) that report what he said or did, or of his tacit approval of something said or done and by itself define what is considered good, by providing details to regulate all aspects of life in this world and to prepare people for the beyond, clarifying the Koranic shades; ad texts constitute the sunna, lit. way of life, or Islamic Tradition, that in Muslim culture is considered second in authority only to the Koran: other sources of the Islamic Law (ul al-fiqh), ijm consensus and qiys analogical reasoning have generally a lower rank.1
1

Since ads become sources of rules of conduct as authoritative example of the Prophets behavior, then very badly wanted: in fact the Companions of the Prophet, or simply those who had known him, should have much to tell about him and the new converts wished to le-

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

ads structure is a sequence of binary elements: a text of the narrative, matn, with a preceding chain of transmitters (isnd, literally support), that have transmitted the narrative, and that assures the validity of the tradition, following one another until the first one who saw or heard Muammad.2 For the selection of input data computational and linguistics criteria were privileged, rather than philological ones. Among the canonical collection of ads, it has been chosen an on-line edition of the collection known as a Al-Bur, compiled by Muammad ibn Ismal al-Bur [3]. Its features, namely full digitalization and vocalization, allow a wide range of investigations without any needing of manual intervention or preparation. The text has been processed as is, and a systematic control of orthographical and philological coherence has been postponed as not relevant at this stage of projects architectures implementation.

Extracting Surface's Information: from Segmentation to Representation

The automatic segmentation is a process that assigns segment boundaries to get discrete objects from a non-discrete continuum [4]. This approach aims to avoid or at least to limit drastically the supervised intervention, which is rather resourceconsuming in time and human involvement, especially considering large amount of data [4-6]. 3.1 ad's Segmentation: pairing Explicit and Implicit Information

The task of segmenting ad texts is in many respects analogous to other cases of semi-structured texts in (pseudo-)natural languages, such as semi-formal texts in descriptions of mathematics. Wolska and Kruijff-Korbayov [7] approached the analysis and formalization of symbolisms and formulas used in mathematics manuals, and drafted a model that: (i) finds regularities in a text; (ii) employs them as patterns to extract textual and meta-textual information; (iii) conceives a set of rules based on these patterns in order to automatically translate extensive verbal expression in maths formulas and vice-versa. This study and others [8], point out that segmentation could be used in textual analysis not only to identify discrete strings, but to try to assign, through an analysis of regularities and recurrences, a global structure to the text itself as well. This structure could be seen as governed by a sort of contextual grammar of
arn what he said or did to imitate him, to conform to his traditional standards of behavior, as a rule, in name of the taqld (the so called imitatio Muammadica or imitation of Muammad). Isnd is a guarantee of truth for ad, through the reputation and the good faith of the transmitters who handed down the narratives orally [2]: Islamic civilization was built upon the supremacy of the spoken word and hearing, the written fixation has only a support role, as prescriptive and restrictive measure, against the aptitude to establish false chains of traditions, considered valid.

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

rules, which also controls connections between content's information and its textual organization. As written above, the ad collections fit undoubtedly well the definition of structured texts. The organization of a ad is, in fact, almost rigorous and dependant on a set of recurrent functional expressions (based on verbs and prepositions in particular) that bound, define and sometimes nest different kind of content [9-10]. The texts continuum could therefore read as formed by two levels, the first one containing information and the other which assumes, beside its textual value, a meta-textual function which organize and define the first level. This seem to show, although in a linear way, somehow a similar structure of that employed in databases, in which records contain information defined by fields. The parallelism with mathematics and information science is far from perfect, however. In fact, a look at literature about the structure of ad collections shows that there is not a general agreement about the original value, meaning and translation of these functional expressions [11-12-13] and a complete set of them has not been jet fully defined. However, they could be undoubtedly considered as provided with some extra meanings beside the merely linguistic ones: (i) they separate one transmitter from another; (ii) they specify the authority and typology of transmission; (iii) sometimes they show the direction of the transmission. An automatic recognition of these elements in text and their role seems able to draft a rather solid structure of relationships and meanings. 3.2 Extraction and Organization: the HadExtractor Program

In order to test the above-mentioned models of segmenting and structuring texts, a specific program named HadExtractor (HE) has been designed to deal with ad corpora, aiming to: (i) read the full collection and identify single ads; (ii) segment for each ad isnd from matn; (iii) extract from each isnd all transmitters names together with relative supplementary information (position, typology and direction of transmission). HE was written and implemented in Python [14]. At the present HE has designed as rather close system, as specifically requires as input ad texts only and not jet other Classical Arabic textual structures. Direct processing of Arabic script in programming languages is possible in theory but hard to manage, especially for switching direction (right to left left to right) among strings in different characters systems. The original Arabic script has been therefore converted by using a set of characters based on Buckwalter transliteration system [15] and modified by us in order to fit to Python and regular expressions constraints on special reserved characters. This transliteration uses ASCII characters only and substitutes usual diacritics employed academically in Latin characters with capital letters and, where needed, supplying with non-alphabetic characters. We implemented the conversion by employing a specific program that allows back transliteration to Arabic at every stage and processes together either vocalized and not vocalized strings [16]. The core of HE is based on Regular Expressions Syntax, conceived in the 1950s by Kleene as tool of automata theory to describe formal languages and developed afterwards by Thompson to be used in programming languages [17]. A regular expression

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

(regex) consists of a formally structured text string for describing complex search patterns, which could be applied to other strings in order to find for matches. The complexity of constants and operators employed allows regexs to be powerful and deeply expressive instruments to retrieve textual segments [18]. HE has been built mainly on three regexs: the first one identifies single ad, the second one separates isnd from matn, the last one catches the transmitters names. After the identification of single ads, HE looks for the textual separation point between isnd and matn, through a regex that models a pattern containing as variable the above-mentioned functional expressions and some look-back and look-forwards operators that verify the context to detect effectively the functional value of the expression as, obviously, the same word could recur in other contexts, for example inside matn without any particular meta-textual role. Once that all isnds are obtained, another regex, working in similar way but referring to a larger list of functional expressions, extract all transmitters names pairing them with the correspondent functional expression. Concerning the extracted matns, they were paired with a digitalized English translation [19-20], which was processed by a tailored version of HE. Once HE has implemented all regex routines and related tasks, it produces as output an XML file, in which all extracted information is organized, automatically tagged and nested. The following lines show an excerpt from output XML file, related to a single ad (Arabic script is in Buckwalter modified transliteration):
<hadith id_ar="7296" id_cor=""> <source_info> <vol>9</vol> <num>7554</num></source_info> <isn> <trasm type="Had+aCaniy">muHam+adu b_nu Eabiy GaAlibI</trasm> <trasm type="Had+aCanaA">muHam+adu b_nu eis_maAciyla</trasm> <trasm type="Had+aCanaA">muc_tamirU</trasm> <trasm type="samic_tu">Eabiy</trasm> [] <trasm type="Ean+a">Eabiy raAficI</trasm> <trasm type="Had+aCahu Ean+ahu samica">Eabuw huray_raoa</trasm> </isn> <mat>yaquwlu [] camalNA</mat></hadith>

3.3

Representation: Transmitters Chains and Graphs

Once HE has been applied to the ad corpus, a large amount of automatically extracted information was available for further investigations, most of them dealing with extracted matn (see Section 4). Focusing instead on the isnd, a smart example of representation of information about transmitters is given here below, with the aim to focus on objects' relationships rather then objects themselves. We structured accordingly the features of extracted isnd's information in the following categories: (i) name of transmitter; (ii) its position in the single ad transmission chain, which starts usually from the collector and arrives to the Prophet Muammad); (iii) the typology of transmission (see Section 3.1). These categories

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

were read as pertaining to a simple model in which objects have different kinds of relationships among them. This model is undoubtedly similar to the ones coming from the graph theory, in which a graph is defined as a structure of nodes and edges to model relations among objects from a given collection [21]. These nodes and edges are drawn on a bi-dimensional grid through specific algorithms, in order to graphically visualize the above-mentioned relationships [22]. It was therefore clear that the graph theory could be usefully applied to our data in order to try a graphical representations of transmitters chains. This kind of representation aims to offer a sophisticate and quantitative-based instrument in a field of research traditionally characterized by analogical and human-based approaches [23] [11]. On the basis of fundamental literature on graph drawing [24-25] we have conceived and implemented in Python another specific program, named ChainViewer. This application, by using existing Python libraries for graph drawing: (i) gets all information about transmitters stored in the XML file containing previously extracted data through HE); (ii) for each ad assigns the transmitters' names to nodes , the relationships among each transmitter and the previous/next one to edges(i.e. the chains), the typology of transmission to edges' types; (iii) through an algorithm is able to generate graphs for single chains or joins together multiple chains in the same graph (in this case if a transmitter's name appears twice or more is shown once but with multiple edges). At the current stage of development, ChainViewer works well with limited number of chains only (see fig. 1), but could virtually be applied to all chains at the same time, in order to automatically gather in a unique graph all the transmitters of an ad collection together with the complete set of their transmission's relationships. This task obviously presents new problems to deal with, namely: (i) the needing of a semiautomatic instrument able to disambiguate homonyms, unify various inflected forms of the same name, identify nicknames and aliases; (ii) a specifically designed drawing's algorithm that could deal with thousands of nodes and edges, and dynamically represent them with expansion/compression tools.

Fig. 1. A portion of a not-directional graph of transmission chains obtained by processing 11 ads together (Arabic script is in Buckwalter modified transliteration)

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

4
4.1

Analyzing the Corpus: the Revised AraMorph analyzer


The Original AraMorph Implementation

As a starting point for the implementation of the text analysis module of the SALAH project, the morphological analyzer and lemmatizer AraMorph (AM) by Tim Buckwalter [26] was chosen. The main reasons for this choice are the simplicity of design of the model, its high performance level even in unsupervised environment and the easiness of its maintaining and extending.3 Opposite to a long-standing linguistic and computer science tradition which emphasized the need for complex, multi-layered, deep morphological components in order to analyze properly Arabic texts and to account for the apparent lack of linearity of many Arabic morphemes see examples and discussion in [26][28-30], AM chooses to treat Arabic words (in the rather naive, but computationally efficient sense of any sequence of characters separated by spaces) as elements linearly decomposable in three sub-elements: a prefix, a stem, and a suffix, the stem being the only necessary sub-element (in fact, zero prefixes and suffixes are postulated, sometimes adding some grammatical information to the stem according to the time-honoured morphological concept of zero morpheme). This simple account is straightforwardly implemented in the (possibly) simplest way, by feeding the system with three lookup lists of, respectively, (a) prefixes, (b) stems and (c) suffixes, together with three compatibility tables between, respectively, (d) prefixed and stems, (e) prefixes and suffixes and (f) stems and suffixes. Entries in the lookup lists are made up of four fields: (i) unvocalized and (ii) vocalized forms of the morpheme, (iii) grammatical category and (iv) English gloss; compatibility tables just list couples of compatible morphemes, all other combinations being incompatibles. Supplementary pieces of information, not employed in the analysis proper but potentially useful for glossing the texts (root and lemma for a group of morphemes), are provided in the stem lookup list in the form of pseudo-comments. The analysis, both in the original Buckwalter model (implemented in Perl) and in the Java implementation by the AM project, is performed through a brute-force search of every possible decomposition of words into prefixes, stems and suffixes, by looking up for prefixes from 0 to 4 letters long, stems from 1 letter upwards, and suffixes from 0 to 6 letters long. Only the unvocalized form of words is taken into account (short vowels and other diacritics are stripped before looking up): candidate prefixstem-suffix decompositions are first matched against the first fields of the respective lookup lists and discarded if any of the elements is missing, then the grammatical categories of the surviving combinations are matched against the compatibility tables and discarded if any of the combination is not present. As a result, each word of the input text can be labeled as (i) unrecognized, if no possible analysis passes the text,

Buckwalters system, has been used in many different projects, mostly in its Java implementation; it is, for instance, included as a morphological analysis tool in the Arabic WordNet Project [27].

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

(ii) unambiguous or (iii) ambiguous if, respectively, one ore more analyses are licensed. This model, whose beauty lies in its very simplicity, is a good starting point for a successful morphological analysis, but does not fits our needs for a plurality of reasons. First, while the emphasis on the unvocalized form of the word is relatively justified for the ideal text genre targeted by Buckwalter newspaper texts and other Modern Standard Arabic (MSA) non-literary texts that largely comprise the LDC Arabic corpus it is far from ideal for other types of texts, first of all fully vocalized ones like ad corpora, but also sparsely vocalized texts: each diacritic added to the consonantal skeleton of a text reduces ambiguity, and thus a system that, like the original AM model, deliberately chooses to ignore this information accepts to live with a higher degree of (morphological) ambiguity and automatically passes a number of wrong analyses that would instead be ruled out by taking into account diacritics present in the text. The second weak point in the original model lies in the fact that the lookup lists and the relative compatibility tables were built from a sample of the text corpora Buckwalter worked on: again, only morphemes attested in a subset of MSA texts and their combinations are included in the lookup lists and the combination tables, which unavoidably brings to reject or analyze wrongly many words attested in other textual types. A third weakness in the original AM implementation, which is linked to the previous one, lies in the lack of any stylistic or chronological information in the lookup lists. This way, many morphemes that are virtually exclusive of MSA texts for instance, a not negligible number of transliterated foreign named entities which cannot be found in Classical texts and which are relatively rare in modern literary texts as well are included in the lists (and more ore less properly vocalized foreign proper names are never vocalized in real-world Arabic texts in order to respect the field structure of the lists themselves) and are likely to give rise to a number of false positives in the analysis of some textual genres. 4.2 Modifications to the Algorithm

In order to overcome the weaknesses listed in the previous section, a number of modifications to the original AM algorithm were devised that tackle the single problems detected above; the new algorithm has been dubbed Revised AraMorph (RAM). The first modification is about the token identification mechanism: instead than discarding vocalization, our revised lemmatizer uses it to reduce the number of false positives by taking into account all the vowels present in the text. The comparison phase is less trivial than it might seem, since it must proceed on a three-stage level: (i) the token is segmented in consonants and diacritics (where everything between two characters marked as a consonant is a diacritic); (ii) consonants must match exactly in fact some qualifications are orders which take into account current practice, e.g. an alif with hamza above or below matches a simple alif (which the original AM accounts for pragmatically, but rather inefficiently, by multiplying entries), and some more are required to reflect idiosyncrasies in the ad orthography; (iii) diacritics

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

present in the token must not contradict the full vocalized form in the lexicon (that is, e.g., missing vowels are ok, but a vowel cannot match a different one).4 To tackle the second weak point, namely the partial and unbalanced coverage of Arabic lexicon in the original AM implementation, a file with additional stems automatically extracted from Anthony Salmon Arabic-English dictionary [31] (a work from the end of the 19th century encoded in TEI-compliant XML format within the Perseus project) was added to the system. Moreover, an analysis of most frequent types of unrecognized tokens allowed to add a limited, but important, number of additional lists of prefixes and suffixes together with the relative combination tables. The single more important addition was the set of prefixes, suffixes and combinatory rules for verb imperatives, a category entirely missing from the original AM implementation perhaps on purpose, since Arabic imperatives are morphologically complex and quite rare in newspaper texts and relatively frequent in ad texts, given the abundance of prescriptions and performative contexts in the latter. To reduce the third problem detected, namely the genre and style indistinctness in the AM lexicon, we experimented with automatically remove items that are likely to correspond to contemporary foreign named entities, especially proper names and place-names.5 In order to do so, we first extracted a list of potential named entities by exploiting a suggestion by Tim Buckwalter himself that in most cases a gloss starting with an uppercase letter is a named entities in 99% of cases; we after perform a fulltext search for each word in Salmons dictionary and retain only words found there. This way, we are likely to exclude most foreign contemporary named entities by retaining Arabic proper names and place-names which can be found in Classical texts and which are often (but unfortunately not always) included by Salmon.

Results and Evaluation

Results of both HE and RAM have been submitted to standard practice of evaluation [32] through division of the corpus in a training (95%) and testing (5%) section; the testing section has been held relatively small in consideration of the homogeneity of the corpus and the necessity to manually annotate the test sentences. At the present
4

Some parameters were introduced in the experiments to test the impact of full vocalization in ad texts: since the texts were full vocalized, it is meaningful not to allow additional vowels nor a tadd (reduplication) symbol if not present in the text. Both the original algorithm and our implementation were rewritten in Python 2.6 in order to profit from other existing tools and to have the possibility to treat directly Arabic texts in Unicode format (even if this is not expedient in some cases, see also Section 3.2). The original AraMorph implementation, true to its newspaper-based spirit (in this case, the source was the AFP corpus), included pretty contemporary items such as the Arabized version of the names of the Belgian tennis player Sabine Appelmans or the Czech soccer team Sigma Olomouc. In some cases, confusion with Arabic words is in fact possible, especially if we take into account the fact that nothing like capitalization is available in the Arabic script: for instance, the transcription of the English first name John (juwn) is indistinguishable from jn inlet, bay.

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

stage of development, the results obtained through both modules are brilliant but quite faceted. The total number of processed ad was 7305, and the segmentation produced outputs for 7135 of them, showing an effectiveness rate of 97.7%. A manual screening of the testing ad sample showed a rate of 7.7% incorrect, of which 6.8% are false negatives and 0.9% false positives.
Table 1. Summary of HE results

segmentation effective wrong

97.7% 2.3%

testing of segmented data error rate precision recall F measure

7.7% 99.2% 93.1% 96.1%

The accuracy is going to be improved mainly by refining the above-mentioned operators, and secondly by raising human control on outputs whereas automatic recognition is still impeded. As to RAM, the system was applied only on the effectively segmented matn text output by HE. We obtained a corpus that gathers only matns section and consists of 382,700 words. Then we applied the original AM analyzer to get a preliminary system output; the results were then compared to the output of the RAM analyzer with different parameter settings.
Table 2. Summary of RAM results

recognition original AM unanalyzed 10.36% univocal 29.45% ambiguous 60.19% testing of segmented data error rate 60.54% precision 64.90% recall 74.56% F measure 69.40%

with vocalization

RAM

with added entries

RAM

with contemporary NEs removed

RAM

12.55% 58.98% 28.47% 32.77% 74.57% 92.66% 82.64%

7.23% 62.52% 30.25% 27.65% 81.47% 90.88% 85.92%

8.12% 67.79% 24.09% 24.58% 83.37% 92.05% 87.50%

The RAM system with vocalization fares far better than the original AM in univocal token recognition, even if the rate of unanalyzed token is slight higher. In fact, the result is equivocal, since AM gets a better result at the price of a higher number of false positives (which RAM discriminates through vocalization).

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Further Research

Both HE and RAM can be seen as starting points for future research. HE can be extended and generalized to other domains within Classical Arabic culture where texts are arranged according to semi-formal criteria: genealogical repertories, specialized dictionaries, definition lexica (as opposed to lexical encyclopedia). RAM can be further extend to cope with a larger domain of textual genres, especially if coupled with some reasonably well performing system of Arabic Named Entities recognition. As showed by the flowchart in fig. 2 and alluded in the Introduction, this might well feed other, higher-level systems of text analysis and information retrieval.

Fig. 2. The SALAH process flowchart

References
1. Wensinck, A.J., et al.: Concordance et indices de la tradition musulmane : Le six Livres, le Musnad d'al-Drim, le Muwaa de Malik, le Musnad de Amad ibn anbal. Brill, Leiden, (1933-) 2. Brown, J.: The Canonization of al-Bukhr and Muslim: the Formation and Function of the Sunn ad Canon., Brill, Leiden (2007) 3. al-Bukhr, M. b. I.: a al-Bukhr. Dr q al-Najh, Riya (1990) 4. Gibbon D., Moore R., Winski R. (eds.): Handbook of Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin (1997) 5. Jackson, P., Moulinier, I.: Natural Language Processing for Online Applications: Text Retrieval, Extraction & Categorization. John Benjamins, Amsterdam (2002) 6. Abu El-Khair, I.: Arabic Information Retrieval. Annual Review of Information Science and Technology, 41(1), 505-533 (2007) 7. Wolska M. and Kruijff-Korbayov, I.: Analysis of mixed natural and symbolic language input in mathematical dialogs. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Stroudsburg, Association for Computational Linguistics (2004)

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

8. Bird, S., Klein, E.: Regular Expressions for Natural Language Processing. University of Pennsylvania (2006) 9. Robson, J.: Standard applied by Muslim traditionists. Bulletin of the John Rylands Library, 63 (1961) 10. Juynboll, G. H. A.: Encyclopedia of Canonical Hadith. Brill, Leiden (2007) 11. Sezgin, F.: Geschichte des Arabischen Schrifttums. 1, Brill, Leiden (1967) 12. Robson J. adth. In: Enciclopaedia of Islam. Vol. 3, pp. 2328. Brill, Leiden (1978) 13. Gnther, S.: Assessing the Sources of Classical Arabic Compilations: The Issue of Categories and Methodologies. British Journal of Middle Eastern Studies, 32:1, 7598 (2005) 14. van Rossum, G.: An Introduction to Python for UNIX/C Programmers. Proceedings of the NLUUG najaarsconferentie (1993) 15. Buckwalter, T.: Buckwalter Arabic transliteration, (undated) [available at http:qamus.org/transliteration.htm] 16. Lancioni, G.: An Adaptation of Buckwalter Transcription Model to XML and Regular Expression Syntax. Technical report, Roma Tre University, r3a (2011) 17. Aho, A. V.: Algorithms for finding patterns in strings. In van Leeuwen, Jan. Handbook of Theoretical Computer Science, volume A: Algorithms and Complexity. The MIT Press. 255300 (1990) 18. Goyvaens, J., Levitan, S.: Regular Expressions Cookbook. O'Reilly, Sebastopol (2009) 19. Khan, M.M.: The English Translation of Sahih Al Bukhari. Alexandria, Al-Saadawi Publications (1984) 20. Al-Asqaln, A.: Fat al-br bi-shar a al-Bukhr. Bayrt, Dr al-Marifah (1959) 21. Berge, C.: Thorie des graphes et ses applications. Collection Universitaire de Mathmatiques, vol. II. Dunod , Paris (1958) 22. Bondy, J.A., Murty, U.S.R.: Graph Theory, Springer, Heidelberg (2008) 23. Fck, J.: Beitrge zur berlieferungsgeschichte von Buris Traditionssammlung. Zeitschrift der Deutschen Morgenlndischen Gesellschaft, 6087 (1938) 24. Di Battista, G., Eades, P., Tamassia, R., Tollis, I.G.: Graph Drawing; Algorithms for the visualization of graphs. Prentice Hall, Upper Saddle River (1999) 25. Kaufmann, M., Wagner, D.: Drawing Graphs: Methods and Models. Springer, Heidelberg (2001) 26. Buckwalter, T.: Buckwalter Arabic Morphological Analyzer Version 1.0. Philadelphia, Linguistic Data Consortium (2002) 27. Black, W., Elkateb, S., Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A. and Fellbaum, C.: Introducing the Arabic WordNet Project. In: Proceedings of the Third International WordNet Conference. Sojka, Choi, Fellbaum and Vossen eds (2006) 28. Al-Sughaiyer, I.A., Al-Kharashi I.A.: Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology, 55(3), 189213 (2004) 29. Bebah, M., Belahbib, R., Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A.: A Markovian Approach for Arabic Root Extraction. The International Arab Journal of Information Technology, Vol. 8, No. 1 (2011) 30. Habash, N., Rambow, O.: Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell woop. Proceedings of the 43rd Annual Meeting of the ACL, pp.573580, Ann Arbor (2005) 31. Salmon, H.A.: An Advanced Learner's Arabic-English Dictionary. Librairie du Liban, Beirut (1889) 32. van Rijsbergen, C. J.: Information Retrieval. Butterworths, London (1979)