• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
Analysis of Nasta’leeq for Pakistani Languages
M G Abbas Malik Christian Boitet Pushpak Bhattachariyya
GETALP – LIG, University Joseph Fourer abbas.malik@imag.fr Christian.Boitet@imag.fr 
 
CSE, IIT Bombay, India pb@cse.iitb.ac.in
 
Abstract
 Nasta’leeq is a bidirectional, diagonal, non-monotonic, cursive, highly context-sensitive and verycomplex writing system for languages like Urdu,Punjabi, Pashto, Sindhi, Baluchi, Kashmiri, etc.written in a derivation of the Perso-Arabic script. Thestyle is defined by well-formed rules, passed downmainly through generations of calligraphers and old manuscripts, and then by the printed materials of themodern era, but these rules have not beenquantitatively analyzed in enough details for languagesmentioned above. This paper first gives a brief introduction of different writing styles of Perso-Arabicscript and then discusses its salient features. It alsobriefly discusses the alphabets of Pakistani languages.Finally, it gives the quantitative analysis of Nasta’leeqand explains its context-sensitive behavior with respect to Pakistani languages that is equally true for Arabic,Persian and other languages written in a derivation of Perso-Arabic script. Finally, it discusses the Context-Sensitive Substitution Grammar of Nasta’leeq, acomputational model of Nasta’leeq.
1. Introduction
Pakistan is a country with at least six major languages and 58 minor ones (Rahman, 2004). Urdu,the national language, has over 11 million (7.57%)mother-tongue speakers while those who use it as asecond language are more than 105 million (Grimes,2000). Punjabi, the mother tongue of 44.15% population, is the biggest language of Pakistan. Other major languages are Pashto, Sindhi, Baluchi andKashmiri according to their number of speakers andgeographical divisions. The size of these languages andUrdu is shown in Table 1.The benefits from the Information Technology (IT)revolution cannot be reaped unless masses use it,which is not possible unless computing is possible in alanguage that is understood by the masses (Afzal andHussain, 2001). Information has become such anintegral part of our global society that its access isconsidered as a basic human right. Internet is believedto be the dominant carrier of information across theglobe. Currently, English is the lingua franca for Internet and most of the information is available in it, but that makes information practically inaccessible tothe vast majority of the world. This is applicableespecially to countries like Pakistan where those whomay be considered barely literate in Urdu according to43.92% literacy rate as claimed in the census of 1998are nearly 66 million. That is rather a large number compared to nearly 26 million (17.29%) who, having passed the ten-year school system (matriculation), can presumably read and understand a little English. Andyet Internet and computer programs function in Englishin Pakistan and not even in Urdu let alone the other languages. This means that most Pakistanis are either excluded from the digital world or function in it ashandicapped aliens. Indeed, most matriculates fromUrdu and Sindhi medium schools have suchrudimentary knowledge of English that they cannotcarry out any meaningful interaction, especially thatwhich would increase their knowledge or analyticalskills, with the digital world. Perhaps only the 4.38%graduates (about 6.5 millions) could do so (Rahman,2004).
Language No. of Speakers
Urdu* 164,290,000Punjabi 66,225,000Pashto 23,130,000Sindhi 21,150,000Baluchi 5,355,000Kashmiri 4,496,000* Urdu include native and 2nd language speakersSource: Rahman, 2004
Table 1: Speakers of Pakistani Languages
 
2. Arabic Script and its Writing Styles
Arabic script is a cursive writing system. It hasmany writing styles including Naskh, Kufi, Sulus,Riqah, Deevani, etc. some of them are shown in figure1. Nasta’leeq writing style was developed in Iranduring the 14
th
and 15
th
centuries by combining Naskh
 
and Taleeq (an old obsolete style)
1
. It is one of themain genres of the Islamic calligraphy. It is rich incalligraphic content. Owing to the complexities of rendering, the basic shapes identified in this section areunable to render languages in an acceptable form in Nasta’leeq. A detailed quantitative analysis of  Nasta’leeq with respect to Pakistani languages is givenin section 4.
zz õô 
  
 ‚Zu[ p 
 
 u@ 
  
 ÖZ
  
  ¯
او
 
او
 
     او
 
     ا   و
 
مقلاو
a
مشلاخسو
a
     او
 
     ا   و
 
راو
 
سارو
 
Figure 1: Different Writing Styles for Arabic
The distinguishing characteristics of Perso-Arabicscript are discussed for the benefit of the unacquaintedreader. It is read from right-to-left. Figure 2 showssome sample characters from Pakistani languages.Unlike English, characters do not have upper and lower case.
b
ڃ
 
b
c
b
 
` ^ [ \
\
 
ٿ
 
\
] _
^
 
]
 
\
 
[
Z
ڇ
 
څ
f e
e
r q p o n m l k
ڙ
 
ږ
 
ړ
g i
g
 
 
ڏ
 
 
 
 
 
ڊ
 
p
} ~
ې
 
~
 
 
| { z
ڼ
y
ڻ
 
y
 
x w u v 
t s
Figure 2: Sample Characters of Pakistani Languages
The shape assumed by a character in a word iscontext-sensitive, i.e. the shape is different dependingwhether the position of the character is at the beginning, in the middle or at the end of the constituentword. This generates three shapes, the fourth being theindependent shape of the character (Khaver, 1999;Malik, 2005). Figure 3 shows these four shapes of thecharacter Beh in Naskh writing style.
Figure 3: Context sensitive shapes of BEH(Khaver, 1999)
To be precise, the above is true for all except certainnumber of characters that only have the independentand the terminating shape when they come at beginning and middle or end of a word respectively(Khaver, 1999; Malik, 2005). Some of these charactersare shown in Figure 4.
ZW
p
f e
e
ڊ  ڏ
g i
g
 
ړږ
g
ڙ
Figure 4: Sample Characters having only Two Shapes
 
1
 http://en.wikipedia.org/wiki/Nasta%27liq_script 
Hamza
 
never comes at the beginning of a word(Khaver, 1999), but it could come at the beginning of aligature. Also it takes the independent shape instead of the final shape when it comes at the end of the word.Thus, it has initial, middle and independent shapes(Khaver, 1999; Malik, 2005), as illustrated in figure 5.
Figure 5: Shapes of Hamza (circled)
Arabic, Persian and Pakistani languages have alarge set of diacritical marks that are necessary for thecorrect articulation of a word. The diacritical marksappear above or below a character to define a vowel or to geminate a character (Khaver, 1999; Malik, 2005).They are the foundation of the vowel system in theselanguages. The most common diacritical marks withthe character Beh are shown in Figure 6.
G
[
F
[
 E
[
 
Figure 6: BEH with Diacritical Marks
Diacritics, though part of the language, aresparingly used. They are essential for ambiguitiesremoval, natural language processing and speechsynthesis (Khaver, 1999; Malik, 2005; Malik, 2006;Malik 
et al.
, 2008).
3. Pakistani Languages
Pakistani languages are written in an alphabet that isderived from the Perso-Arabic alphabet. It is not possible to discuss all Pakistani languages here. This paper only discusses the six languages given in Table 1 because the last five represent the major geographicaldivision of Pakistan and Urdu is the National languageof Pakistan. All of these languages belong to the Indo-European language family. Their family tree is given inFigure 7.
Figure 7: Language Tree of Pakistani Languages
The alphabets of each of these languages arediscussed separately here with their Unicode values. InUnicode, Arabic and its associated languages likeUrdu, Punjabi, Pashto, Sindhi, etc. have been allocatedthe code points 0600h – 06FFh, 0750h – 077Fh andFB50h – FEFFh.
 
3.1. Urdu
 
Urdu is the National language of Pakistan and oneof the state languages of India with more than 60millions native speakers. It is one of the biggestlanguages of the world, if one considers Hindi/Urdu asdialects of the same language called Hindustani byPlatts (Platts, 1909). Table 2 gives the size of Hindi/Urdu.
Speakers
Native 2
nd
Language Total
Hindi 366,000,000 487,000,000 853,000,000Urdu 60,290,000 104,000,000 164,290,000Total 426,290,000 591,000,000
1,017,000,000Table 2: Hindi and Urdu speakers (Malik
et al.
, 2008)
 Urdu is written in Nasta’leeq style. It has 35consonant characters representing 27 consonant soundsas some consonant sounds are represented by two or more consonant characters, e.g. the sound ‘s’ isrepresented by three different characters Seh (
_
), Seen(
k
) and Sad (
m
) (Malik 
et al.
, 2008). Out of 35consonant characters, 32 are adopted from Persian. 3retroflex consonants are added to accommodate theindigenous sounds of the Indian sub-continent. Thesecharacters are Tteh (
^
) [
ʈ
], Ddal (
e
) [
ɖ 
] and Rreh (
g
)[
ɽ
]. Non-aspirated consonants of Urdu are given inTable 3.
Sr.
 
Symbol Unicode Sr.
 
Symbol Unicode
1
[
[b]0628 19
m
[
s
]
 
06352
\
[p]067E 20
n
[
z
]
 
06363
]
[
]062A 21
o
[
]
 
06374
^
[
ʈ
]0679 22
p
[
z
]
 
06385
_
[s]06B2 23
q
[
ʔ
]
 
06396
`
[
ʤ
]062C 24
r
[
ɣ
]
 
063A7
[
ʧ
]0686 25
s
[
]
 
06418
b
[h]062D 26
t
[
q
]
 
06429
c
[x]062E 27
 
[
k
]
 
06A910
e
[d
   ̭
]062F 28
[g]
 
06AF11
e
[
ɖ 
]0688 29
w
[
l
]
 
064412
f
[z]0630 30
x
[
m
]
 
064513
[r]0631 31
 
[
n
]
 
064614
g
[
ɽ
]0691 32
z
[
v
]
 
064815
i
[z]0632 33
{
[
h
]
 
06C116
g
[
ʒ
]0698 34
 
[
 j
]
 
06CC17
k
[s]0633 35
>
[
]
 
062918
l
[
ʃ
]0634
Table 3: Non-Aspirated Urdu Consonants
The phenomenon of aspiration does not exist inPersian or Arabic but it exists in languages of theregion e.g. Hindi, Urdu, Punjabi, etc. In Urdu, thespecial character Heh Doachashmee (
|
) is used to mark the aspiration. Thus aspirated consonants arerepresented by the combination of the consonant to beaspirated and Heh Doachashmee (
 ه
) e.g.
[
[b] +
|
[h]=
 J
[b
ʰ
],
`
[
ʤ
] +
|
[h] =
 Y
[
ʤʰ
], etc. Urdu has 15aspirated consonants (Malik 
et al.
, 2008). AspiratedUrdu consonants are given in Table 4.
Sr.
 
Symbol Sr.
 
Symbol Sr.
 
Urdu
1
 J
[
b
ʰ
]6
 Y
[
ʧʰ
]11
JÏ 
[
k
ʰ
]2
 J
[
p
ʰ
]7
|
e
[
ḓʰ
]12
 Ï 
[
g
ʰ
]3
 J
[
ṱʰ
]8
|e
[
ɖʰ
]13
[
l
ʰ
]4
 
 J
[
ʈʰ
]9
|
[
r
ʰ
]14
[
m
ʰ
]5
 Y
[
ʤʰ
]10
|
g
[
ɽʰ
]15
 J
[
n
ʰ
]
Table 4: Aspirated Urdu Consonants
In addition to consonants, Urdu has 10 vowels and 7of them also have their nasalized forms (Malik 
et al.
,2008; Hussain, 2004). They are represented with thehelp of four long vowels (Alef Madda (
W
), Alef (
Z
),Waw (
z
) and Yeh (
)) and three short vowels (ArabicFatha
F
, Damma
E
and Kasra
G
). The representation of a vowel is context-sensitive, i.e. a vowel may bewritten in two or more ways according to the context ina word, e.g. the vowel sound [
ə
] is represented by Alef (
Z
) + Zabar (
F
) at the start of a word and by Zabar (
 F
) inthe middle of a word. The vowel sound [
ə
] never comes at the end of a word. Nasalization of a vowel ismarked with Noon-ghunna (
y
) and Noon (
) at the endand in the middle of a word respectively (Malik 
et al.
,2008). For more details, see (Malik 
et al.
, 2008).Urdu contains 15 diacritical marks. They representvowel sounds, except Hamza-e-Izafat (
 
Y
) and Kasr-e-Izafat (
G
) that are used to build compound words,
e.g.
 
ﺲﻨﺋﺎﺳ
 
ٔﮦرادِا
[
ɪ
d
   ̭ɑ
ə
h
ɪ
s
ɑɪ
ns] (Institute of Science),
ِﺦﻳِرﺎﺗﺶﺋاﺪﻴﭘ
[t
ɑ
rix
ɪ
 ped
ɑɪʃ
] (date of birth),
etc.
Shadda (
 H
) isused to geminate a consonant
e.g.
 
ّبر
[r 
ə
 bb] (God),
ﺎﻬّ ﭼا
 [
ə
ʧʧʰɑ
] (good),
etc.
Sukun (
 
) is used to mark theabsence of a vowel after the base consonant (Platts,1909; Malik 
et al.
, 2008).Languages of Table 1 share Perso-Arabic punctuation and special symbols. These punctuationmarks and symbols are given in Table 5.
Sr.
 
Symbol Unicode Sr.
 
Symbol Unicode
1
Ô
060C 10
؏
 060F2
;
061B 11
ؐ 
 
06103
?
061F 12
œ
 
06114
X
06D4 13
Ÿ
 
06125
    ؀
 0600 14
ؓ
 
0613
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...