You are on page 1of 5

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.

ORG

95

A Novel Phonetic Name Matching Technique


Junaid Tariq, Noman Ajmal, and Azam Khan
AbstractName is an attribute that is used to search a record almost in every application that keeps records, because name is a must attribute of every object. The practical experiences show that name undergoes lots of changes by time, which makes such searches complex. This paper covers various problems related to name, and shortcomings of existing techniques. It also provides a survey on different phonetic name matching techniques. Phonetic name matching techniques are faster than pattern matching techniques, but phonetic techniques dont support features of pattern matching. At the end, we have presented a name matching techniques which uses phonetic approach to achieve the features of pattern matching. The proposed technique is flexible, as it handles the changes made to name over time, and also handles single character error efficiently. Index Terms Name Matching, Name change, and Name Matching Techniques.

1 INTRODUCTION
arge amount of data is created by organizations which keep records about people like e.g. customers, suppliers, patients, research authors, bankers, artistic, musicians, politicians, actors, business mans, etc. Personal name is often used to search data related to such group of peoples. Study of personal names is called anthroponomy [10]. African names are different from western societies where persons father name is at the last [10]. Akan child gets birthday name with its name, on the day he/ she were born [10]. Name in different communities are stored in different formats e.g., first name followed by second name, or another community saves the same name in format like second name (first) followed by first name. Matching of name has been done by different communities by using different approaches like statistics, artificial intelligence, etc [7]. Variation of name even exists in single community, [8] explains the differences in Chinese personal name between Beijing and Hong Kong. It also highlights variations in names across different Chinese communities [8]. There are few fields e.g., medical records, crime investigation, etc, where no ID or Registration name can be used to search a record, in such fields, you have to use the name to find out the record from a large database containing million, even billions of records. Exact matching is also not desired because of the limitation of machine translation systems, or any automated translating system like OCR (Optical Character Recognition), which translate name from one language to another. OCR is a research based project, in which 100 % accuracy is not achieved, so translation of name using any automated system is challenging and error prone. This leads to an assumption that

searching of name using exact matching technique will be full of errors. Challenges of Chinese to English personal name conversions are presented in [9]. Another problem with personal name is that they are not unique and have many variations. Person name is ubiquitous in information systems [11]. Table 1 shows how name is written in different communities. In Pakistani culture, family name is the last name of the person, where as in Italy, its the first name of the person. Table 1: Name arrangement in different countries [12] Country/ First name Last name Arrangement Community Pakistan One or 1 family Given name more giv- name first en names English/ Two (2) or 1 family Family name first Dutch/ more giv- name en names German (first, middle) Spain One (1) or Two fami- Given name more giv- ly names first en names ( one from father, one from mother) Italy One (1) or Family Family name more giv- name first en names

The paper is organized as: Next section presents various problems associated with name, section III discusses vari ous Phonetic name matching approaches, section IV cov Junaid Tariq is working as lecturer in COMSATS Institute of Information ers limitations of various Phonetic techniques, section V Technology, Islamabad Pakistan. E-mail:junaid_tariq@comsats.edu.pk. Noman Ajmal is working as project manager in Fulgent Resources, Islama- covers the proposed technique, and section VI and VII bad, Pakistan. E-mail: raja.noman@hotmail.com. discuss results and conclusion, respectively.
Azam Khan is working as RA in COMSATS Institue of Information Technology, Islamabad, Pakistan. Email: azam_khan@comsats.edu.pk.

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

96

PROBLEMS ASSOCIATED WITH NAMES


In some cultures, women family name changes upon marriage [12]. Forces officer prefix their name by rank. Same name having different valid spelling [1] e.g. Gayle and Gale. Cristina T. Smith and C. Tina Smith may be the name of same person.

There are few characteristics that are related to the name:

1. 2. 3. 4. 5. 6.

Soundex [2] [3] Phonex [3] Phonix [4] NYSIIS [5] Double-Metaphone [6] Fuzzy Soundex [2]

Figure 2 shows that the two main categories of name matching approaches are: Phonetic encoding, and pattern matching.
Name Matching Approaches

Following are the problem related to the name: An early study showed that 80% errors were due to single character error [13, 1]. 14% errors due to spelling variations [1]. 8% errors due to different last name of women [1]. Typing errors done by data entry person e.g., Sydney typed as sydeny, etc. Therefore exact matching will lead to poor results. According to [15] exact matching does not guarantee that the searched name is the desired name. For this reason, many approximate matching techniques were developed. Figure 1 shows different categories of Character Level Error [14]. 1. Typographical errors: typing errors by data entry operator. 2. Cognitive errors: errors due to lack of knowledge. 3. Phonetic errors: correct spelling of a similar sounding word.

Phonetic Encoding

Pattern Matching

Figure 2: Name Matching Approaches

Phonetic encoding algorithms are faster than pattern matching algorithms [1], as their complexity is (|s|) for any given string s. And among phonetic techniques, phonix is the slowest one [1]. In the following subsection, we will briefly discuss the Phonetic encoding approaches:

Character Level Errors

1. Soundex 1.1 Introduction Soundex is based on English language pronunciation [17]. Soundex was developed by Robert C. and Margaret K. in 1918. It converts the string into a code according to how a name is pronounced. It is the best known phonetic encoding algorithm. 1.2 Working The Algorithm is outlined in [16]

Typographical Error

Cognitive Error

Phonetic Error

1. It first capitalizes all letters in the word and drops all punctuation marks. Pad the word with rightmost blanks as needed during each procedure step. 2. Retain the first letter of the word. 3. Change all occurrence of the following letters to '0' (zero): 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. 4. Change letters from the following sets into the digit given:

Figure 1: Character Level Error

3 PHONETIC NAME MATCHING APPROACHES


Approximating techniques are based on pattern matching, phonetic encoding or a combination of both. Large numbers of techniques have being developed for both of these approaches [1]. Following are the approaches for phonetic encoding:

o o

1 = 'B', 'F', 'P', 'V' 2 = 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

97

o o o o

3 = 'D','T' 4 = 'L' 5 = 'M','N' 6 = 'R'

3.2 Working
1. 2. The name string is transformed using Phonix rules. The transformed string of step 1 is encoded to 1 character and 3 digit code. Phonix use following encoding scheme:

5. Remove all pairs of digits which occur beside each other from the string that resulted after step (4). 6. Remove all zeros from the string that results from step 5.0 (placed there in step 3) 7. Pad the string that resulted from step (6) with trailing zeros and return only the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.

1.3 Example Soundex code for peter is p360. 1.4 Limitation/ Drawback Soundex keeps the first letter, so any variation or error at the beginning will result in false code. 2. Phonex 2.1 Introduction Phonex tries to improve the encoding quality of Soundex. 2.1 Working
1. 2. 3. Trailings are eliminated. Retain the first letter of the word. Various rules are applied in addition to Soundex rules, e.g.:

Figure 3: Phonix Rules

3.3 Limitation/ Drawback Phonix large number of rules make it slow, and complex. 4. NYSIIS 4.1 Introduction New York State Identification Intelligence System (NYSIIS) has transmission rules like Phonex and Phonix. Code return by NYSIIS consists of letters only. . 5. Double Metaphone 5.1 Introduction This algorithm is developed for non-English names like Asians. Code of double-metaphone is consists of letters. It contains many rules similar to Phonix. In certain cases, double-metaphone return two phonetic codes. 6. Fuzzy Soundex 6.1 Introduction Fuzzy Soundex is based on q-gram. It has transformation rules like Phonix, some rules are limited to the beginning of the string, some are limited to the end of the string, some are limited to the middle of the string and some are applicable anywhere in the string.

o o
4.

Kn = (replaced by) n Wr = r

Output of Phonex will be one letter followed by 3 digits (Phonex form < letter> <digit> <digit> <digit>).

3. Phonix 3.1 Introduction Phonix applies more than one hundred rules on the string. Some of these rules are concern with the beginning of the string, some with the middle portion of the string, some with the end of the string, and some rules are applied anywhere.

4 LIMITATIONS OF PHONETIC TECHNIQUES


After monitoring the working of various phonetic name matching techniques, we will come up with the following limitations: 1. You will not be able to find the typing errors. As a, e, h, I, o, u, w, y will be replaced

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

98

by 0 (zero), etc 2. 3. Reducing the repeatable character to one (1) will change the whole word.

2. 3.

Assign code to each character from number 0 to 25, to characters a/A to z/Z respectively. Add the code within each sub-string of the string to get a single accumulated number against each sub-string present in the string. Attach the first English character of each substring to its respective accumulated code. At the end you will get a code containing 1 letter followed by one number for each sub-string.

Interchanging the first and last name with each other will result in a code that will make the name matching impossible. e.g: Pakistani Way (first British Way (second name, second name) name, first name) Junad Tariq Tariq Junaid After coding: J53T60 After coding: T60J53

4. 5.

Above example shows it clearly that J53T60 will never be match with T60J53, which in reality are the name of the same person. Such modification make it impossible to high light problems such as typing errors, words that sound same, last name replaced by first name, etc. To overcome these short comings, phonetic algorithms are combined with pattern matching algorithms.

C. Proposed Algorithm Rules


1. If matching is to take place between one string to multiple sub-strings or between multiple substrings to a single string, then try to match first character with any pair of the sub-string to see whether its the part same string or not. If character is followed by dot (.) then just match only the first character with the string, in such case you dont need to match the number. Also dont calculate accumulated number of any character followed by dot (.). If first character matches, and the accumulated number does not, then try to find the difference of the two stringss accumulated numbers. If the difference is less than 25, its mean that there is a typing mistake of a single character or a data entry person has missed one character. User will select the either the soft search or hard search mode. In soft search mode, you can skip one prefix and one postfix sub-string at most. In hard search, you have to find each sub-string of the target string to match with the each substring of the string, being inputted for search, to be declared as hit.

2.

5 PROPOSED TECHNIQUE
A. Introduction Phonetic matching algorithms are easy to implement and are less expensive in term of computation than pattern matching algorithms, but they have limitation and problems as discuss preciously.
To achieve easy implementation, and some of the positive features of pattern matching algorithms, we applied ASCII coding technique to phonetic algorithm e.g., assign code 0 to a/A, 1 to b/B, , and similarly assign code 25 to z/Z. Consider a name: first name: Junaid, Last name: Tariq Full name in Pakistani format: Junaid Tariq Full name in European format: Tariq Junaid Another variation of this name: J. Tariq Adding a prefix to name: Mr. Junaid Tariq Name after typing error: Mr. Juneid Tariq Name after missing a single character: Jnaid Tariq Remember all these variations (name) are actually referencing to the same person. Now we are in need of a technique that produces the same code for all these variations. 3.

4.

Note: in hard search, the sequence of sub-strings dont matter, the only thing that matters is that all substring has to be matched with the target string.

D. Explanatory Example
Table 2 shows the working of the proposed technique. The input string Column, inputs the same name (junaid tariq) in different ways. The code column shows the code generated for the input string. Hit/ miss column informs whether the name is matched in the database or not, and finally, the reason column shows the reason of hit. The reason column presents the rule by which the name is matched.

B.

Proposed Algorithm Working


1. Remove all punctuation except dot (.) if its present in the middle of the string e.g. J. Tariq

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 8, AUGUST 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

99

Table 2: Junaid Tariq is saved in database


Input string Junaid Tariq J. Tariq Tariq Junaid Mr. Junaid Tariq Junaid Juneid Jnaid Experiment Name Asian format Dot (.) format European format Prefix addition Single name Typing error Character missed code J53 T60 J T60 T60 J53 M J60 T53 J60 J57 J40
Hit/ miss Hit Hit Hit Hit Hit Hit Hit Rule # 1 Rule # 2 Rule # 4 Rule # 4 Rule # 4 Rule # 3 Rule # 3 Reason

[3]

[4] [5]

[6] [7]

[8] [9]

The benefit of the proposed techniques includes: Simple to understand. Easy to implement. Efficiently handle typing error. Ignore the problem of character missed in a substring. Giving sub-string level matching control to the user. Less number of rules than phonix, phonex, etc.

[10] [11]

[12]

[13] [14] [15]

6 RESULTS
Hit rate (mentioned in section V, sub-heading B, section 3) clearly shows that proposed technique has overcome the limitation of missing character, and typing error of Phonetic technique. The proposed technique handles the errors created due to single character, which are 80% of errors. It also handles the changes made to name in different communities, as it does not checks the order of name, but presence of sub-strings. This technique also handles the shorthand, used for name by the person.

[16]

[17]

A. Lait and B. Randell, An Assessment of Name Matching Algorithms, Technical report, university of Newcastle upon Tyne, 1993. T. Gadd, PHONIX: The Algorithm, Automated Library and Information Systems, 24(4):363-366. C. L. Borgman and S. L. Siegfried, A Survey of Applications of Personal Name-Matching Algorithms, Journal of American Society for information Science, 1992. L. Philips, The Double-Metaphone Search Algorithms, C/ C++ Users Journal, 2000. William W. Cohen, Pradeep Ravikumar, A Comparison of String Distance Metrics for Name-Matching tasks, American association for artificial intelligence, 2003. Lawrence CHEUNG, Maosong SUN, Identification of Chinese Personal Names in Unrestricted Texts, University of Hong Kong. Benjamin K. Tsou and Oi Yee, Evaluating Chinese-English Translation Systems for Personal Name Coverage. Kofi Agyekum, The Sociolinguistic of Akan personal Names, kardic Journal of African Studies 15(2): 206-235 (2006). Patrick Reuther, Personal Name matching: New Test Collections and a Social network based approach, University of Trier, Germany, 16 March 2006. Article: Family name, URL: http://familypedia.wikia.com/wiki/Family_name, retrieved on 29 May 2010. F. J. Damerau, A Technique for Computer Detection and Correction of spelling errors, Computers and Biomedical Research, 1992. K. Kukich, Techniques for automatically correcting words in text, ACM, 1992. Yuzana, Khin Marlar, Sounds Alike Name Matching for Myanmar language, World academy of science, engineering and technology, 2008. Article: Understanding Classic SoundEx Algorithms, URL: http://creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm, retrieved on 30 May 2010. Article: Soundex, URL: http://en.wikipedia.org/wiki/Soundex, retrieved on 30 May 2010.

7 CONCLUSION
Phonetic techniques are faster than pattern matching techniques, but they produce false result even if a single character is missed or wrongly typed. The proposed technique uses phonetic approach which means that it is fast, but provide better matching results than Phonetic techniques. The results of the new technique overcome the limitations, faced by many Phonetic techniques. It is also flexible to handle various transformations, through which a name undergoes over a period of time.

REFERENCES
[1] [2] Peter Christen, A Comparison of Personal Name Matching: Techniques and Practical Issues, ANU, Canberra ACT 0200, Australia D. Holmes, Approximate string matching, ACM Computing Surveys, 12(4):381-402.

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

You might also like