You are on page 1of 3

# IJSTE - International Journal of Science Technology & Engineering | Volume 3 | Issue 11 | May 2017

## Duplicate Detection using Algorithms

Elizabeth Priya Mammen Sreelakshmi R
PG Student Assistant Professor
Department of Computer Science & Engineering Department of Computer Science & Engineering
Mount Zion College of Engineering, Kadammanitta, Mount Zion College of Engineering, Kadammanitta,
Pathanamthitta, Kerala Pathanamthitta, Kerala

Abstract
Data are most important asset. If there are more number of data that data may contain some duplicates. For finding that
duplicates we use some algorithms like Sorted Neighborhood Method (SNM), Progressive Sorted Neighborhood Method
(PSNM) etc. Duplicate count strategy++ (DCS++) algorithm is the best and efficient algorithm for finding the duplicates in the
data sets. By using the DCS++ algorithm we can find the distinct data.
Keywords: Data, DCS++, PSNM, SNM
________________________________________________________________________________________________________

I. INTRODUCTION

Progressive duplicate detection is a pay-as-you-go method that identiﬁes most duplicate pairs timely in the process. Instead of
dropping the overall time needed to ﬁnish the entire duplicate detection task, progressive methods try to decrease the average
time after which a subjective duplicate is found. The aim of progressive duplicate detection is to recognize most matches early in
the process. Therefore, this algorithm increases the eﬃciency of the duplicate detection workﬂow and not the class of the
reported results. Usually, windowing techniques are used to reduction the difficulty of the duplicate detection process. Therefore,
ending a progressive algorithm prematurely might be sensible to ﬁnd slightly less duplicates in a much shorter time.
Hence, the progressive styles increase the eﬃciency particularly for use case situations that do not need the whole result. In
evaluation to traditional duplicate detection, I can deﬁne progressive duplicate detection with the subsequent two criteria.
Improved Early Quality - May be an arbitrary objective time at which results are needed. Then the output of a progressive
algorithm will be greater at t than the output of an equivalent traditional algorithm. Typically, t is much smaller than the
complete runtime of the traditional algorithm. Same Eventual Quality - If together a traditional algorithm and its progressive
version ﬁnish implementation without early conclusion at t, then they will output the identical results.

## II. EXISTING SYSTEM

In the existing system Sorted Neighborhood Method (SNM) is used for finding the duplicates in the dataset. SNM has 3 phases
1) Key selection: sorting key is assigned to each record. The key is generated by concatenating two or more values of
attributes.
2) All records are sorted according to key.
3) Slides a window over sorted data within particular window all records pairs are compared and duplicates are marked.
A disadvantages of the sorted neighborhood method is fixed window size. Some duplicates might be missed when selected
window size in too small. On other hand, unnecessary comparison is carried out when window size is too large. The efficient
detection of duplicate records in challenging work because database cleaning is very complicated process.

## III. PROPOSED SYSTEM

The proposed system is used to find duplicates in the large dataset. Duplicate Count Strategy++ (DCS++) algorithm is used in
the proposed system. DCS++ is an efficient method for finding duplicates in the dataset. By using this we get distinct dataset. In
DCS++the dataset is sorted by a key and then it will slide a window over all the records and find all the duplicates in the dataset.
In DCS++ the window size is variable.

Duplicate Detection using Algorithms
(IJSTE/ Volume 3 / Issue 11 / 038)

Architecture

System Architecture.
First system should upload records from database. After uploading these records are sorted by using sorting technique. From
these sorted records duplicate records are detected by using progressive sorted neighborhood method (PSNM) and duplicate
count strategy++ (DCS++).
In PSNM compares the records that are within a window of records in the sorted order that are already partitioned. The PSNM
algorithm uses this procedure iteratively without varying the window size. PSNM dynamically changing the execution order of
the comparison based on intermediate results. After completing this procedure using PSNM we can display the duplicates
In DCS++ compares the records that are within a window of records in the sorted order. In DCS++ algorithm uses this
procedure iteratively by variable window size. If there any duplicate records are identified the window size will be incremented.
There should not be any partition is required. For each detected duplicates the next records of that duplicate are added to the
window. The detected duplicate records are skipped for save comparisons. It does not miss any duplicate records from the
dataset. DCS++ algorithm is the best and efficient method for finding duplicates in the dataset.

IV. CONCLUSION

This project is used to find the duplicates in the project. In existing system we use progressive duplicate detection is used for
conclusion the duplicates in the data set. Progressive sorted neighborhood method (PSNM) algorithm is used. In this method the
datasets are partitioned and then checks the copies in the dataset. The disadvantage in this technique is the checking is done
within the partition only, between the partitions here is no checking is take place. Because of that occasionally the duplicates
may waste. The disadvantage in progressive method is overcome by spending an adaptive technique for duplicate detection. For
this duplicate count strategy ++ (DCS++) algorithm is used for discovery the duplicates in the dataset.

REFERENCES
[1] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,” IEEE Trans. Knowl. Data Eng., vol. 19,no. 1, pp. 1–16, Jan.
2007.
[2] C. Xiao, W. Wang, X. Lin, and H. Shang“Top-k set similarity joins,” in Proc. IEEE Int. Conf. Data Eng., 2009, pp. 916–927.
[3] F. Naumann and M. Herschel, An Introduction to Duplicate Detection.2010
[4] H.B.Newcombe and J.M.Kennedy, ’Record linkage: Making maximum use of the discriminating power of identifying information,’ Commun. ACM, vol.
5, no. 11, pp. 563 to 566, 196
[5] H.B. Newcombe, J.M. Kennedy, S. Axford, and A. James, “Automatic Linkage of Vital Records,” Science, vol. 130, no. 3381, pp. 954-959, Oct. 1959.
[6] Hernandez and S.J.Stolfo, ’Real-world data is dirty: Data cleansing and the merge/purge problem,’ Data Mining Knowl. Discovery, vol. 2, no. 1, pp. 9 to
37, 1998.
[7] H. S. Warren, Jr., “A modiﬁcation of Warshall’s algorithm for the transitive closure of binary relations,” Commun. ACM, vol. 18, no. 4, pp. 218–220,
1975.
[8] J. Madhavan, S. R. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy, “Web-scale data integration: You can only afford to pay as you go,” in Proc.
Conf. Innovative Data Syst. Res., 2007. 11
[9] L. Kolb, A. Thor, and E. Rahm, “Parallel sorted neighborhood blocking with MapReduce,” in Proc. Conf. Datenbanksysteme in B€uro, Technik und
Wissenschaft, 2011.
[10] M.Anitha, A.Srinivas, T.P.Shekhar3 , D.Sagar4 “Duplicate detection of records in Queries using clustering”

Duplicate Detection using Algorithms
(IJSTE/ Volume 3 / Issue 11 / 038)

[11] M. A. Hern_andez and S. J. Stolfo,“Real-world data is dirty: Data cleansing and the merge/purge problem,” Data Mining Knowl. Discovery, vol. 2, no. 1,
pp. 9–37, 1998. 5
[12] M. Wallace and S. Kollias, “Computationally efficient incremental transitive closure of sparse fuzzy binary relations,” in Proc. IEEE Int. Conf. Fuzzy Syst.,
2004, pp. 1561–1565.
[13] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom, “Swoosh: a generic approach to entity resolution,” VLDB J., vol. 18,
no. 1, pp. 255–276, 2009.
[14] O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller, ’Framework for evaluating clustering algorithms in duplicate detection,’ Proc. Very Large
Databases Endowment, vol. 2, pp. 1282 to 1293, 2009.
[15] O. Hassanzadeh and R. J. Miller, ’Creating probabilistic databases from duplicated data,’ VLDB J., vol. 18, no. 5, pp. 1141 to 1166, 2009.
[16] P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Trans. Knowl. Data Eng., vol. 24,no. 9, pp. 1537–1555,
Sep. 2012
[17] P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Trans. Knowl. Data Eng., vol. 24,no. 9, pp. 1537–1555,
Sep. 2012
[18] S.E. Whang, D.Marmaros, and H.Garcia-Molina, ’Pay-as-you-goentity resolution, ’IEEE Trans. Knowl.DataEng.,vol.25,no.5,pp.1111to1124, May 2012.
[19] S. Yan, D. Lee, M.-Y. Kan, and L. C. Giles, “Adaptive sorted neighborhood methods for efﬁcient record linkage,” in Proc. 7th ACM/ IEEE Joint Int. Conf.
Digit. Libraries, 2007, pp. 185–194.
[20] U. Draisbach and F. Naumann,“A generalization of blocking and windowing algorithms for duplicate detection,” in Proc. Int. Conf.Data Knowl. Eng.,
2011, pp. 18–24.
[21] U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, ’Adaptive windows for duplicate detection,’ in Proc. IEEE 28th Int. Conf. Data Eng., 2012, pp.
1073 to 1083.