(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, 2010
grammar. To make the things easier, here is a brief descriptionabout the alteration process. Like prepositions used in English;a consonant (
র
,
য়
etc) , a dependent vowel(
◌
)
[11] or both(
র
,
ত
)
may be added at the end of a word in Bangla [12]. Themost troubling part is, unlike English, these additionaldependent vowels or consonants merge with the words makingitself and integral part of the word. As these inflections do notmake any significant change in the meaning of the word, ratherthey are used to embed the word inside a sentence, theseinflections should be considered carefully in case of a querysearch used for web. A simple example may help to clarify thenecessity of considering the case ending. Consider someone iswilling to know about the capital of Bangladesh, that isDHAKA. In Bangla it is spelled ‘
ঢাকা
’
. In Table I, some wordsare stated with their respective meanings and edit distanceswith the word of interest.
TABLE I. P
ERFORMANCE
O
F
L
EVENSHTEIN
D
ISTANCE
F
OR
B
ANGLA
Serial Word English Meaning Editdistance
1
ঢাকা
Dhaka, Capital of Bangladesh02
ঢাক
Drum (Musical instrument)13
ঢাকী
Drummer14
টাকা
Currency of Bangladesh15
ডাকা
To call16
পাকা
Ripe17
ঢাকায়
In Dhaka18
ঢাকার
Of Dhaka19
ঢাকােক
To Dhaka (used for addressing)210
ঢাকােত
In Dhaka2
In the list, words from 2 to 8, all having the same editdistance, would be treated with equal importance as beingsimilar to the word of interest. But from semantic standpoint,the words numbered form 7 to 10 having the true informationabout the capital of Bangladesh, should be given preference.For words having puzzling spelling rules, simultaneousoccurrence of an inflection due to case ending and a spellingmistake may lead to a higher edit distance, resultingundesirable outcome. These facts motivate us to performadditional task before calculating the conventional edit distancefor Bangla approximate search.III.
M
ETHOD
For any string searching algorithm, one of the mostimportant factors is running time, especially for thoseapplications adopting a web based service model. We areassuming to have a large list of words to perform the searchoperation on which is evident both for a dictionary search orweb query. So, we propose a method which consists of twomajor stages.A. Fast filteringB. Computation of modified edit distance
A.
Fast Filtering
Since we are considering to have a large amount of data tosearch on, the best idea would be to quickly discard large partof the text using a computationally efficient method. This iscalled filtering and various methods like n-gram[13] or spacedseeds[14] have been proposed for serving the purpose. In ourwork, we applied a fast filtering method which has similaritywith n-gram method (with n=1), and has a very simple form. Afilter is said to be lossless if it do not discard any potentialmatch during its operation. To be on the safe side and avoidlosing a probable matching word, we have adopted a filterwhich acts in a defensive manner rather than being tooaggressive to discard large amount of text on the first run. Inthe filtering process following steps were performed.
1)
Length Matching:
If query string has N characters, thenonly words having length between L1 and L2 inclusive will beconsidered for the next step , whereL
1
=N-N/2L
2
=N+N/2L
1
, L
2
rounded to the nearest integer.On this stage, large amount of words are discarded withlittle computational cost. A comparatively large margin is usedfor the words to pass the filter. This is due to the fact that, inBangla a word within a sentence can be augmented by caseending (e.g. ‘
ঢাকা
’
may become ‘
ঢাকােক
’
, see Table 1) andthis will result in a longer word. On the other hand, words mayhave shorter form due to some variation in spelling (e.g.sometimes ‘
◌ঁ
’
,
‘
◌্
’
are ignored).
2)
Coarse Distance Matching:
Only the words qualifyingin the first stage are considered for this stage. In this step, asimilarity between the query word and the searched word ismeasured by comparing the number of occurrence of differentcharacters in the two words. For serving the purpose, a onedimensional vector of length k is used, where k is the numberof possible characters (including dependent vowels) in Bangla.For the query word, the vector CQ is computed before startingthe process. Say,
]...[
321
Qk QQQQ
ccccC
=
(1)
Where C
Qn
= number of occurrence of n
th
character on thequery word.Similarly
C
S
is computed for the searched word.
]...[
321
Sk SSSS
ccccC
=
(2)
Now, coarse distance is computed by the equation
∑
=
−=
k nSnQn D
ccC
1
(3)
171http://sites.google.com/site/ijcsis/ISSN 1947-5500