Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Approximate String Search for Bangla: Phonetic and Semantic Standpoint

Approximate String Search for Bangla: Phonetic and Semantic Standpoint

Ratings: (0)|Views: 94|Likes:
Published by ijcsis
Despite the improvement in the field of approximate string search, insignificant research was performed for Bangla string matching. Approximate string search has a great deal of interest in spellchecking, query relaxation or interactive search. In our work, we proposed a method for Bangla string search which is specially modified considering Bangla spelling rules and grammar. Rather than simple string matching, special emphasis was given to make sure that words possessing relevant meaning are not ignored due to its inflected form. Moreover, phonetic matching was also emphasized for the purpose.
Despite the improvement in the field of approximate string search, insignificant research was performed for Bangla string matching. Approximate string search has a great deal of interest in spellchecking, query relaxation or interactive search. In our work, we proposed a method for Bangla string search which is specially modified considering Bangla spelling rules and grammar. Rather than simple string matching, special emphasis was given to make sure that words possessing relevant meaning are not ignored due to its inflected form. Moreover, phonetic matching was also emphasized for the purpose.

More info:

Published by: ijcsis on Nov 02, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

03/12/2013

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, 2010
Approximate String Search for Bangla: Phonetic andSemantic Standpoint
Adeeb Ahmed
Department of Electrical and Electronic EngineeringBangladesh University of Engineering and TechnologyDhaka, Bangladeshahmedadeeb@yahoo.com
Abdullah Al Helal
Department of Electrical and Electronic EngineeringBangladesh University of Engineering and TechnologyDhaka, Bangladeshhelal@eee.uiu.ac.bd 
 Abstract
— Despite the improvement in the field of approximatestring search, insignificant research was performed for Banglastring matching. Approximate string search has a great deal of interest in spellchecking, query relaxation or interactive search.In our work, we proposed a method for Bangla string searchwhich is specially modified considering Bangla spelling rules andgrammar. Rather than simple string matching, special emphasiswas given to make sure that words possessing relevant meaningare not ignored due to its inflected form. Moreover, phoneticmatching was also emphasized for the purpose.
 Keywords- Approximate string search; Bangla search; Levenshtein distance; query relaxation; spelling suggestion; caseending
I.
 
I
NTRODUCTION
 Consider a collection of strings named ‘Database’ and aquery string named ‘Queryword’. We need to find all thesubstrings in ‘Database’ which possess ‘similarity’ with thequery string ‘Queryword’, and sort them according to theirsimilarity with ‘Queryword’. Now, the real challenge is todefine the term ‘similarity’. Different methods have beenproposed for this purpose [1]–[6]. Different functions wereused for finding the similarity between strings such asLevenshtein distance [7], cosine similarity [5] or Jaccardcoefficient [8].But these distances alone are not capable of dealing withcommon spelling mistakes made by human. Especially inBangla, words may lose their original form when used inside asentence as Bangla is a highly inflected language.. Consideringthese alterations is far beyond the scope of these functionsalone. In this work, we have taken two different matters intoaccount for the approximate search. First, the common spellingmistakes made by human in Bangla. For this purpose Banglaphonetic was studied and any mismatch between similarsounding letters was ignored. Being an extremely richlanguage, Bangla possess more than one characters for varioussimilar sounding voiced and unvoiced sounds. From phoneticstandpoint, they could easily been represented by a singlecharacter. Due to almost similar auditory sensation, thesesimilar sounding letters often creates confusion and causesspelling mistakes. Moreover, in some cases, different spellingsof a single word are accepted. In finding the approximatestring search, these common spelling errors must be studiedcarefully.The second factor that must be taken into account is moreimportant for implementation of query relaxation. Due to thegrammatical rules in Bangla, most often words lose theiroriginal form inside a sentence due to different forms of inflections. These inflections may be classified into variousgroups like tense ending, case ending, personal ending,imperative ending, etc [9]. Among the various forms of inflections, case ending is responsible for alteration of nounsand pronouns. Noticeable facts about these types of inflectionsare that, they are unavoidable in sentence formation and theycause insignificant changes in the meaning of the words. Asproper nouns and nouns together constitutes over 70% of thequery terms on web [10], considering the word inflection dueto case ending is extremely important for Bangla search.II.
 
P
RELIMINARIES
 Among various functions for computing the similaritybetween two different strings, Levenshtein distance or editdistance is an accepted one. Levenshtein distance between twostrings is defined as the minimum number of operations(substitution, insertion or deletion) required for convertingfrom one string to another. Now let us look carefully about theperformance of edit distance in Bangla string matching. Herewe assume the Bangla text is encoded using Unicode [11]. Asstated earlier, Bangla contains several similar soundingcharacters which often introduce confusion; we study thespelling of the word ‘BANGLA’ itself. The word ‘BANGLA’can be spelled in two different ways ‘
 বাংলা 
and ‘
 বাঙলা 
.
If we simply consider the Levenshtein distance or edit distance,we get the value ‘1’ (substitute
 
◌ং
 
with
).
Now let usconsider the edit distance between the words ‘
 বাংলা 
and ‘
 বালা 
 which means ‘BANGLE’ in English, a completely differentword. A simple insertion of 
 
◌ং
 
transforms ‘
 বালা 
into ‘
 বাংলা 
,resulting an equal edit distance compared with the previouspair. But from phonetic point of view, ‘
 বাংলা 
is much closer to
 বাঙলা 
. And on this case possess the same meaning. So beforecomputing the conventional edit distance, these factors must beconsidered. Understanding the second factor, the inflections of words due to case ending requires slight knowledge on Bangla
170http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, 2010
grammar. To make the things easier, here is a brief descriptionabout the alteration process. Like prepositions used in English;a consonant (
 র
,
 য়
 
etc) , a dependent vowel(
◌
)
[11] or both(
র
,
ত 
)
may be added at the end of a word in Bangla [12]. Themost troubling part is, unlike English, these additionaldependent vowels or consonants merge with the words makingitself and integral part of the word. As these inflections do notmake any significant change in the meaning of the word, ratherthey are used to embed the word inside a sentence, theseinflections should be considered carefully in case of a querysearch used for web. A simple example may help to clarify thenecessity of considering the case ending. Consider someone iswilling to know about the capital of Bangladesh, that isDHAKA. In Bangla it is spelled ‘
 ঢাকা 
. In Table I, some wordsare stated with their respective meanings and edit distanceswith the word of interest.
TABLE I. P
ERFORMANCE
O
F
L
EVENSHTEIN
D
ISTANCE
F
OR
B
ANGLA
 
Serial Word English Meaning Editdistance
1
 ঢাকা 
 
Dhaka, Capital of Bangladesh02
 ঢাক 
 
Drum (Musical instrument)13
 ঢাকী 
 
Drummer14
 টাকা 
 
Currency of Bangladesh15
 ডাকা 
 
To call16
পাকা 
 
Ripe17
 ঢাকায়
 
In Dhaka18
 ঢাকার
 
Of Dhaka19
 ঢাকােক 
 
To Dhaka (used for addressing)210
 ঢাকােত 
 
In Dhaka2
In the list, words from 2 to 8, all having the same editdistance, would be treated with equal importance as beingsimilar to the word of interest. But from semantic standpoint,the words numbered form 7 to 10 having the true informationabout the capital of Bangladesh, should be given preference.For words having puzzling spelling rules, simultaneousoccurrence of an inflection due to case ending and a spellingmistake may lead to a higher edit distance, resultingundesirable outcome. These facts motivate us to performadditional task before calculating the conventional edit distancefor Bangla approximate search.III.
 
M
ETHOD
 For any string searching algorithm, one of the mostimportant factors is running time, especially for thoseapplications adopting a web based service model. We areassuming to have a large list of words to perform the searchoperation on which is evident both for a dictionary search orweb query. So, we propose a method which consists of twomajor stages.A. Fast filteringB. Computation of modified edit distance
 A.
 
Fast Filtering
Since we are considering to have a large amount of data tosearch on, the best idea would be to quickly discard large partof the text using a computationally efficient method. This iscalled filtering and various methods like n-gram[13] or spacedseeds[14] have been proposed for serving the purpose. In ourwork, we applied a fast filtering method which has similaritywith n-gram method (with n=1), and has a very simple form. Afilter is said to be lossless if it do not discard any potentialmatch during its operation. To be on the safe side and avoidlosing a probable matching word, we have adopted a filterwhich acts in a defensive manner rather than being tooaggressive to discard large amount of text on the first run. Inthe filtering process following steps were performed.
1)
 
 Length Matching:
If query string has N characters, thenonly words having length between L1 and L2 inclusive will beconsidered for the next step , whereL
1
=N-N/2L
2
=N+N/2L
1
, L
2
rounded to the nearest integer.On this stage, large amount of words are discarded withlittle computational cost. A comparatively large margin is usedfor the words to pass the filter. This is due to the fact that, inBangla a word within a sentence can be augmented by caseending (e.g. ‘
 ঢাকা 
may become ‘
 ঢাকােক 
, see Table 1) andthis will result in a longer word. On the other hand, words mayhave shorter form due to some variation in spelling (e.g.sometimes ‘
◌ঁ
,
◌্ 
are ignored).
2)
 
Coarse Distance Matching:
Only the words qualifyingin the first stage are considered for this stage. In this step, asimilarity between the query word and the searched word ismeasured by comparing the number of occurrence of differentcharacters in the two words. For serving the purpose, a onedimensional vector of length k is used, where k is the numberof possible characters (including dependent vowels) in Bangla.For the query word, the vector CQ is computed before startingthe process. Say,
 
]...[
321
Qk QQQQ
cccc
=
(1)
Where C
Qn
= number of occurrence of n
th
character on thequery word.Similarly
C
S
 
is computed for the searched word.
]...[
321
Sk SSSS
cccc
=
(2)
Now, coarse distance is computed by the equation
=
=
nSnQn D
cc
1
(3)
171http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, 2010
Higher the value of C
D
indicates higher difference betweenthe two words. A threshold is set for the words to pass throughthe filter. In this case also, moderately large threshold is used toensure lossless filtering. Simple assumption suggests us toconsider a threshold value proportional to the length of thequery string as longer string can contain more errors. Onlywords which result C
D
values lower than the threshold valueare considered for the next stage.
 B.
 
Computing Modified Edit Distance
The focal part of our work is to compute a distance betweenwords which is aptly intelligent to distinguish human error insearching and the inflections introduced in words when used insentences. To do the job, before computing the traditionaldistance, some modifications are made over the words. Firststage of modification takes into account the phoneticsimilarities between words and exploiting the phoneticresemblances, some manipulation and simplification of spellings are done. On the second stage of modification,semantic similarities are considered and alterations are done tomake sure that semantically similar words in the databaseproduce lower distance with the query word compared to theother words.
1)
 
Spelling Modifications:
As stated earlier, two differentfeatures are emphasized during the approximate string search inour work. First, spelling mistake due to similar sounding letters(e.g. ‘
 স 
’,
 শ 
’,
 ষ 
’) should be considered. This problem alsoexists in English and various approximate string matchingalgorithm built for English used a variety of phonetic methodslike Soundex [15] or PHONIX [16]. Among all the methods,Soundex is the oldest. It was particularly developed forEnglish. Soundex replaces the 26 different letters in English bya set of 7 disjoint sets only considering their phoneticsimilarity. Vowels are completely ignored and not taken on thecomputations. PHONIX is also similar to Soundex but littlemodification is done prior to mapping of words. But forBangla, mapping of word to such a small number of sets andignoring the vowel may bring up unacceptable result. Due tothe word structure, small variation in a dependent orindependent vowel may produce a completely new set of wordswith different meanings. Refer to Table 1, there is only adifference of one dependent vowel among words 1, 2 and 3.This proves the improperness of ignoring the vowels forimplementing Soundex in Bangla. Furthermore, in Soundex,the English letters are mapped into only 7 disjoint sets whichdemands the mapping of hardly similar sounding letters to mapinto the same set (eg. D, T are treated equally both in Soundexor PHONIX). But careful observation of Bangla lexiconreveals numerous words which are comparable from phoneticstandpoint (eg, word 1, 4 and 5 in Table 1). Moreover, due toimplementation of fast filtering in the first stage, we expect tohave relatively smaller number of words. This eliminates theneed for a highly computationally efficient matching.Considering these details, we used a rather conservativeconversion, only by converting the phonetically similarcharacters, keeping most of the words unchanged (Table II).There is a small listing of ignored characters which are oftendisregarded commonly. A simple example may clarify theprocedure. Consider a Bangla word
 িবভীষণ 
’. After conversionit will become ‘
 িবিভসন 
’.
 
TABLE II. M
APPING
O
F
W
ORDS
F
OR
S
PELLING
M
ODIFICATION
 
Characters in the original word Converted characters
◌্ 
 
◌ঁ
 
◌ঃ
 
◌া 
 
Ignored
,
◌ং 
 
 
 শ 
,
 ষ 
 
 স 
 
ণ 
 
 ন 
 
 ড় 
,
 ঢ়
 
 র
 
 য
 
 জ 
 
 ত 
 
 ঈ 
 
 i 
 
 ঊ 
 
 u 
 
◌ী 
 
 ি◌
 
◌ূ 
 
◌ু 
 Rest of the characters Unchanged
2)
 
Case Ending Consideration:
The second form of modification is particularly required for web query or databasesearch. As explained earlier, the Bangla words undergo variousinflections. Due to, greater importance of nouns and pronounsin web search, in our work we only modify the inflectionsapplied over nouns and pronouns, that is inflections due to caseending.
 
To make things even complicated, most of the case endingterms in Bangla words are integrated with the original wordsmaking it even harder to deal with. But fortunately, there arelimited numbers of case ending terms listed in Table III used inBangla, and by using proper logic; these can be identified mostof the time.
TABLE III. L
IST OF
C
ASE
E
NDINGS
U
SED IN
B
ANGLA
 Group-1
ক 
,
র
, e
 ের
, e,
 য়
,
ত 
,e
 েত 
,
 র
, e
 র
 
Group-2
 d 
 ারা 
,
 িদয়া 
,
 কতৃ    ক 
,
 হ 
 i 
 েত 
,
থেক 
,
চেয়
 
In Table III, case endings listed as Group-2 do not unitewith the original words and thus not of our concern. OnlyGroup-1 case endings would be considered. Here, a noteworthything is that, the case endings written in Table III are not in theexact form how they exist inside the word. For ease of reading,all the case endings are written using independent vowels
(e,e
 েত 
,
etc) in Table III. But when used with words theseindependent vowels would be replaced by dependent vowels(e.g.
e
with
◌
).
As an example, when case ending ‘
e
 র
isused with a word ‘
 সকাল 
’, the word after inflection would be
 সকাল 
+ e
 র
=
 সকােলর
 
172http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->