You are on page 1of 7

International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1011-1017 The Society of Digital Information

and Wireless Communications, 2011 (ISSN: 2220-9085)

REMOVING DUPLICATIONS IN EMAIL ADDRESSES IN LINEAR TIME Eyas El-Qawasmeh King Saud University, Information Systems Dept. Riyadh, Saudi Arabia eyasa@usa.net

Abstract. Currently, many government offices and companies use mailing lists for reaching their clients. Any mailing list needs continuous updates that include removing unsubscribed emails, inserting new coming emails, and removing duplications. Duplication can occur when merging two mailing lists into one master mailing list, where both the merged lists contain the same email more than one or when any one of the mailing lists contain the same email many times. Existing algorithms for removing duplications in mailing list require time complexity greater than linear. Most of them sort emails in alphabetical order and then remove the duplication in O(n log n). However, we are able to reduce the time complexity to O(n) using hashing. This saves the time and the efforts of the senders. Keywords: Mailing list, linear time, sorting, duplication 1. INTRODUCTION Currently, mailing lists are a powerful tool that enables the communication of one-tomany channels instead of one-to-one channel. The conversion from one-to-one channel into one-to-many channels saves the time of the sender since it allows the sender

to communicate with many people at the same time. This feature allows the person to communicate with a group of people instead of individual base (El-Qawasmeh, E. , 2011). Mailing lists are very important due to its power to bring people around the world together in a single communication setting (Bettenburg, N. et al. , 2009). Currently, building an email list (we call it Master_List) that targets a certain group of people is one of the most effective methods available for establishing an online communications. A good email list provides the user with a line of communications that is convenient and cheap (Hausenblas, M. and Rehatschek, H., 2007). The online marketing depends on it since its cost is low. Currently, governments and different organizations start using it since they realized its influence. For example, a government that manages a certain event such as a conference uses a mailing list that contains the interested people for this conference. It should be clear that any mailing list is dynamic rather than static. The reason for this is that any mailing list, which contains thousands of emails, has many updates
1011

International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1011-1017 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

daily. Some emails are no longer valid, and some owners of these emails might want to un-subscribe. This requires a good management of the mailing list. This good management implies that the owner of the mailing list removes all the un-subscribed emails, and the emails of the closed account (detected from not delivered). On the other hand, the owner needs to update the list by inserting extra emails. The addition of extra emails comes mostly from another list where the owner ends with two lists that he needs to merge them in just one single list (we call it Master_List) and removes all the duplicated emails. Currently the duplication occurs a lot in mailing lists due to the use of ready software (such as grab programs or email harvest programs) widely that collects emails from the web automatically. Keeping the duplication in the emails causes inconvenience to the receiver. The inconvenience occurs because the receiver receives the same email many times within a short period of time. This damages the sender professionalility. Removing the duplications can be done easily in polynomial time (O (n2)) or in O(n log n). This polynomial time is achieved by sorting the emails ascending and comparing each email with its successor to see if they are identical, (since they are sorted) then one of them is removed. However, this paper suggests removing the duplication in a linear time (O (n)). This is a saving in the time complexity and it worth the saving. The organization of this paper will be as follows. Section 2 describes current related

work. Section 3 describes the proposed algorithm. Section 4 is the Hashing function. Section 5 is the performance results. Section 6 is a discussion and finally section 7 is conclusions. 2. CURRENT RELATED WORK The problem that this paper handles can be described as follows: You have two files, where each file contains only email addresses (The file here is the mailing list). These files could be word or text, excel, or even some other types. You want to merge these two files into one single file. In addition, you need to remove any possible duplication with a linear time complexity. This problem can have two versions. The first version where the email addresses in one or two of the files are sorted alphabetically and the other version where the email addresses are not sorted. Our proposed algorithm solves the mentioned problem independent from the sorting. In other words, it does not matter if the email addresses in any one of the files is sorted or not. This problem includes a basic operation, which is removing duplications in the email addresses. Duplications mean that the same email address occur more than one in the mailing list. This occurs very frequently when the sender gets a list of email addresses and wants to merge them with the mailing list that he has. In other words, the merge of two different mailing lists generate one mailing list (Master_List) that will contain many emails that are duplicated. Therefore, talking about removing duplication that is generated from merging
1012

International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1011-1017 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

two files is similar to removing duplication that exists in one file. As we discussed previously, two lists will be converted into one list. This scenario happens a lot in real life where the administrator of the Master mailing list keeps getting either new
collections of emails stored in a mailing list which we will call mini-list, or when he gets a mini-list contain emails of people who unsubscribed or those who has emails bounced back or their accounts are closed. Up to the knowledge of the author, there is no work approach this specific problem in time complexity less than O(n log n). The common algorithm for this is to merge the two lists then do sorting in O(n log n). After sorting, a scan iterates through the mailing list and whenever there is email address identical to its successor, then only one of them is considered and the other identical one will be deleted. The same problem has been approached in removing duplication from arrays for numeric values. However, the idea of arrays cannot be adapted here since the time that is required for removing duplications is also O(n log n). The next section describes the proposed algorithm.

list will be copied as it is to Master_List. Followed by copying the contents of the second list into Master_List. The proposed algorithm uses hashing. In hashing, each email address will be mapped into a certain location. The type of hashing that is used is with collision hashing. In case of a collision, it means that two emails are identical. In this case, instead of saving the second email (which is duplication), we will not save it.
The algorithm is as follows: Pre-Process operation Begin Step 0: Create an array of size equal to the quadruple the total size of List1 and List2. // The reason is that in hashing, it is recommended that the hash table size is at least 3 // times the number of entries. Step 1: Create a new empty list and call it Master_List. Step 2: Copy all items in List1 into Master_List ( Master_List List1). Step 3: Copy all items in List2 into Master_List starting from the last location that has been reached in Master_List. End Process operation For each email in the in Master_List do the following Begin While Step 4: Get one email

3. PROPROSED ALGORITHM The proposed algorithm assumes that we have two mailing lists stored as text files. If the file is another format such as word, then it will be converted to plain text file. The first mailing list contains in it a set of mails that might not be sorted. In addition, there is a possibility that one email is written many times. The second list is also a text file. The second mailing list also contains a set of emails where some of them might be repeated in the first list. The algorithm start by creating a new list called Master_List (generated from merging two lists). The first

address. Hash the email using the function called Hash(email) // Function Hash is described next section.

1013

International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1011-1017 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

Step 5: Store the hashed email address in the array where the index of the array is the generated hash value. // for example if Hash (SalesTv@hotmail.com) is 6789 then this number is the index of the array. If the email was written and a duplication exists, then the new hashing will overwrite the previous email, which is a duplicated one. End While Step 6: Store the array in a text file. Step 7: exit 4. HASH FUNCTION The proposed algorithm suggests using hashing. Hashing can be with collision or free from collision. The whole idea in our algorithm is to do hashing. If there are two emails that are identical and we would like to remove one of them, then both of them are hashed to the same location. However, only one of them is saved. Even the same email has repeated tens of times since all of them will hash to the same location. The proposed hash function maps each email address to a specific number. This specific number represents the index of an array where the email will be stored. If the same email occurs another time, then it will overwrite itself in the same location. The proposed algorithm reads each email address, and takes only 5 characters from it and hashes them to a single place. The hash of 5 characters requires a hash table that is of size three times the number of entries in general. Thus, if the merged two lists contains 1,000,000 emails, then it is

recommended that the hash table contains at least triple this value. If the size of the hash table is greater than this value, then the performance will be much better and the emails will be scattered randomly all over the table with the possibility to include more than 5 characters to hash. It should be pointed out that the size of this table can be expanded if the number of emails is very large. In case of expanding it, we will use six characters or more. On the other side, if number of emails is small, then we might use four characters for hashing instead of five characters. However, the remaining description of this paper uses five characters to hash the email address. The hash function reads the email address and detects the @ sign. This sign is considered as a pivot character. From the pivot character (@) we will consider the first 3 letters that proceeds it, then the first letter after it as fourth character and then the last character will be the first character that occur after detecting the dot (only the first dot that comes after the pivot character). Figure 1 demonstrates an example of how the characters will be hashed. The allowed characters to be hashed all only English letters and digital numbers (0, 1, 2,9) plus the following two special symbols. They are (- and _). The total number of characters will be 26 aliphatic plus 10 digits plus two special symbols. This brings the total to 38 symbols. The size of the hash table will be 38 multiplied by 5 since we will hash 5 characters. This brings the total number of entries in the hash table to be of 6255408.
1014

International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1011-1017 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

Allocating one-dimensional array of this size is currently possible on all machines.

eelqawasmeh@ksu.edu.sa

0 : : 4

: :

: :

0 : :

: : 4 e

7 12 : :

==
10

30

30

4320

Read the email until "@" symbol Read only the three characters that proceeds the @ directly and substitute them to decimal value according to the formula mentioned previously. Read the first character after @ and substitute its value in the previous formula. Read the first character after @ and it is preceded by dot then substitutes its value in the previous formula. Store the email in the HashValue[Index]. End loop End The whole process is depicted in Figure 2.
eelqawasmeh@ksu.edu.sa

13 *

385 +

31 *

384

+8*

383

+ 11*

382

+ 5*

381

= 4321

Figure 1: The characters that will be hashed The index of the hash table will be computed for any email according to the following formula. It is Index = ( Order of First hash letter * 385 + order of the second hash letter * 384 + order of the third hash letter * 383 + order of the fourth hash letter * 382 + order of the third hash letter * 38) -1 . The result will be a number in the range from 0 to 6255407. The algorithm for the hash function is written below

0 : :

0 4

: :

: :

0 : :

: : 4 e

7 12 : :

10

30

30

13 * 385 + 31 * 384 + 8 * 383 + 11* 382 + 5* 381 = 4321

Figure 2: The process of obtaining the hash value Function Hash (email) // This function reads any email and convert it to a decimal number ranges from 0 to 6255407. Begin Loop Assign HashValue 0. It should be noted that it is not necessary to hash only five characters. It could be either more or less. The size of the existing Master mailing list could affect the selection process.

1015

International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1011-1017 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

5. PERFORMANCE RESULTS It is clear from the algorithm that it takes linear time. Let us assume that the size of the first list is m and the size of the second list is n. Then the time complexity for the proposed algorithm is O(n+m). It should be clear that there is a hidden small constant. An experiment was done to evaluate the proposed algorithm in Visual Basic. The experiment gets two real files from companies where the first size and the second are listed below. We adjusted these two files in order to give the specific number of emails as was listed in Table 1 by deleting some emails. The execution was as follows
40 30 20 10 0 0 30000 70000 110000 150000

Figure 3: The time normalized for merging files 6. DISCUSSION The time complexity for this version is linear with the following details: Step 0: constant time Step 1: constant time Step 2: O(m) operations bucket sort Step 3: O(m+n) at most Step 4: O(m+n) at most

Table 1: Execution time for different cases of merge No. of No. of Size of Time in emails in emails Master_List seconds the List in List 2 1 20000 10000 30000 6 40000 30000 70000 14 60000 50000 110000 22 80000 70000 150000 30 100000 90000 190000 40

Therefore, the total time will be O(m+n). The proposed algorithm takes advantages that collision means duplication and hence the email address is saved only one. One might argue that five characters are not enough and then there might be two distinct emails hash to the same location. This is valid point. In this case, we will lose the one useful email. However, if we consider the probability of this then we find it is very low. For example, let us assume that the average length of any email is 10 characters. Then the possibility that two emails are identical is 1 divided by 1/2610. This probability is almost zero. Even for five characters, the probability is very low. In fact, we sacrifice the time complexity by small amount of lost email addresses. This can be solved very easily using linear probing which is

Below is graph of the previous results normalized to the first value. Please note that we fixed the following parameters. They are: The number of duplications within the List 1 is equal to zero. The number of duplication with the second list is equal to zero. The percentage of duplication both lists is setup to 20%.

1016

International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1011-1017 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

used in hashing to solve the problem of collision or expanding the hash table. 7. CONCLUSIONS People all over the world are using the internet more and more. All the business and the governments are using email very heavily due to its convenience. Sending one email message to a group of receivers requires that the mailing list of the sender be always updated. The suggested paper proposed algorithm that merges two lists into one list using a linear time. The proposed algorithm removes duplication. This saves time, and it is more convenient to the receiver.
8. 5.

Action on a Message. Proceedings of ACM conference on Human Factor in Computing Systems, Organ, USA, pp. 691-700. Segal, R and Kephart, J. (1999). Mailcat: An intelligent Assistant for Organizing E-mail. Proceedings of the third International Conference on the Autonomous Agents, pp. 276-282. Vel, O., Anderson, A., Corney, M. and Mohay, G. (2001). Mailing E-Mail content for Author Identification Forensics. ACM SIGMOD Record, Vol. 30, No. 4, pp. 55-64. Mara, J., and Hidalgo, G. (2002). Evaluating Cost-Sensitive Unsolicited Bulk Email Categorization. Proceedings of the 2002 ACM Symposium on Applied Computing, SpringerVerlag, pp. 615-620. Ducheneaut, N., and Belloti, V. (2001). Email as Habitat: An Exploration of Embedded Personal Information. Management. Interactions Vol. 8, No. 5, pp. 30-38. Tyler, J. R. and Tang, J. C. (2003). When Can I Expect an Email Response? A Study of Rhythms in Email Usage. Proceedings from ECSCW 03: European Conference on Computer-Supported Cooperative Work, pp. 239-258.

6.

7.

ACKNOWLEDGEMENT This work was supported by the Research Center of College of Computer and Information Sciences, King Saud University. The author is grateful for this support. REFERENCES
1. Bettenburg, N. Shihab, E. and Hassan, A. (2009), "An empirical study on the risks of using off-the-shelf techniques for processing mailing list data," 2009 IEEE International Conference on Software Maintenance, pp. 539-542, http://doi.ieeecomputersociety.org/10.1109/ICS M.2009.5306383. El-Qawasmeh, E. (2011) Categorizing Received Email to Improve Delivery, International Journal of Computer Systems Science and Engineering, Australia, Vol. 26, No. 3, pp. 115121, March 2011. Hausenblas, M. and Rehatschek, H. (2007), mle: Enhancing the Exploration of Mailing List Archives Through Making Semantics Explicit, Semantic Web Challenge 2007, 6th International Semantic Web Conference (ISWC2007), Busan, Korea, November 11-15. Dabbish, L, Kraut, R. Fussell, S. and Kiesler, S. (2005). Understanding Email Use: Predicting

9.

10. Pavlov, O.V., Melville, N., and Plice, R.K. (2008). Toward a Sustainable Email Marketing Infrastructure. Journal of Business Research, Vol. 61, No. 11, pp. 1191-1199. 11. El-Qawasmeh, E, Snasel, V. and Pichappan, P. (2008). Reshaping Email Relationships. Proceedings of the Third International Conference on Digital Information Management (ICDM2008), pp. 304-307, Nov. 13-16. 12. Jolai, F., Asadzadeh, S., and Taghizadeh, M. (2008). Performance Estimation of an email Contact Center by a Finite Source Discrete Time Geo/Geo/1 Queue with Disasters. Computers and Industrial Engineering, Vol. 55, No. 3, pp. 54355.

2.

3.

4.

1017