Professional Documents
Culture Documents
Vol.06,Issue.22
June-2017,
Pages:4253-4262
www.ijsetr.com
Abstract: A database contains very large datasets, where various duplicate records are present. The duplicate records occur
when data entries are stored in a uniform manner in the database, resolving the structural heterogeneity problem. Detection of
duplicate records is difficult to find and it take more execution time. In this literature survey papers various techniques used to
find duplicate records in database but there are some issues in this techniques. To address this Progressive algorithm has been
proposed for that significantly increases the efficiency of finding duplicates if the execution time is limited and improve the
quality of records.
I. INTRODUCTION A. General
Databases play an important role in today's IT based Duplicate detection is the process of identifying multiple
economy. Many industries and systems depend on the representations of same real world entities. Today, duplicate
accuracy of databases to carry out operations. Therefore, the detection methods need to process ever larger datasets in
quality of the information stored in the databases, can have ever shorter time: maintaining the quality of a dataset
significant cost implications to a system that relies on becomes increasingly difficult. We present two novel,
information to function and conduct business. In an error- progressive duplicate detection algorithms that significantly
free system with perfectly clean data, the construction of a increase the efficiency of finding duplicates if the execution
comprehensive view of the data consists of linking --in time is limited: They maximize the gain of the overall
relational terms, joining-- two or more tables on their key process within the time available by reporting most results
fields. Unfortunately, data often lack a unique, global much earlier than traditional approaches. Comprehensive
identifier that would permit such an operation. Furthermore, experiments show that our techniques can double the
the data are neither carefully controlled for quality nor efficiency over time of traditional duplicate detection and
defined in a consistent way across different data sources. significantly improve upon related work.
Thus, data quality is often compromised by many factors,
including data entry errors (ex: student instead of student),
missing integrity constraints (e.g., allowing entries such as
Employee Age=567), and multiple conventions for recording
information To make things worse, in independently
managed databases not only the values, but the structure,
semantics and underlying assumptions about the data may
differ as well. Data are among the most important assets of a
company. But due to data changes and sloppy data entry,
errors such as duplicate entries might occur, making data
cleansing and in particular duplicate detection indispensable.
However, the pure size of today’s datasets renders duplicate
detection processes expensive. Online retailers, for example,
offer huge catalogs comprising a constantly growing set of
items from many different suppliers. As independent persons
change the product portfolio, duplicates arise. Although
there is an obvious need for deduplication, online shops
without downtime cannot afford traditional deduplication. Fig1.
Fig 2. Effect of partition caching and look-ahead. 1.Low latency: On all datasets PSNM and PB start reporting
first results about 1-2% earlier than SNM and SLORP. This
advantage is a result of our progressive Magpie Sort. For the
non-progressive algorithms, we use an implementation of the
Two-Phase Multi-way Merge Sort (TPMMS), which isa
popular approach for external memory sorting. Although
TPMMS is highly efficient, Magpie-Sorting slightly out
performs this approach regarding progressiveness.
Fig12. The above screen allows an administrator to select Fig15. The screen is designed for User to register into the
Sub Root Element to add dataset in a specified Sub Root database. After entering the details click on Submit
Element. button. Thus user details are registered and stored in the
database. If anytime user needs to access the database to
search for a particular file, user needs to enter the
registered username and password. If it is valid allows
the user to search for a file else shows an error.
Fig19. The above screens are the first steps for the output
filtering. Here we applied an algorithm “Progressive Fig21. This is the actual output shown to the user. In this
Sorted Neighborhood Method (PSNM) ” which allows an we applied an algorithm named “Progressive Blocking”,
engine to search all the files based on the given which identifies the duplicate records and displays that
Keyword. Identifying one million duplicates took more only once in the output. PSNM and PB executes with
than 30 minutes without using PSNM, but calculating the very limited time when compared to normal execution
transitive closure on them takes only 1.4 seconds. and also identifies duplicates in a very short period of
time.
International Journal of Scientific Engineering and Technology Research
Volume.06, IssueNo.22, June-2017, Pages: 4253-4262
K. MAHESHWARI, PRAVIN THUMUKUNTA
V. CONCLUSION in Proc. 7th ACM/ IEEE Joint Int. Conf. Digit. Libraries,
We also presented several new de-duplication constructions 2007, pp. 185–194.
supporting authorized duplicate check in hybrid cloud [11] J. Madhavan, S. R. Jeffery, S. Cohen, X. Dong, D. Ko,
architecture, in which the duplicate-check tokens of files are C. Yu, and A. Halevy, “Web-scale data integration: You can
generated by the private cloud server with private keys. only afford to pay as you go,” in Proc. Conf. Innovative Data
Security analysis demonstrates that our schemes are secure Syst. Res., 2007.
in terms of insider and outsider attacks specified in the [12] S. R. Jeffery, M. J. Franklin, and A. Y. Halevy, “Pay-
proposed security model. As a proof of concept, we as-you-go user feedback for dataspace systems,” in Proc. Int.
implemented a prototype of our proposed authorized Conf. Manage. Data, 2008, pp. 847–860.
duplicate check scheme and conduct test bed experiments on [13] C. Xiao, W. Wang, X. Lin, and H. Shang, “Top-k set
our prototype. similarity joins,” in Proc. IEEE Int. Conf. Data Eng., 2009,
pp. 916–927.
VI. FUTURE ENHANCEMENTS [14] P. Indyk, “A small approximately min-wise independent
Though the above solution not allowed the File redundancy, family of hash functions,” in Proc. 10th Annu. ACM-SIAM
in future the Brute force attacks introduced and launched by Symp. Discrete Algorithms, 1999, pp. 454–456. Fig. 10.
the public cloud server, which can be more powerful and Duplicates found in the plista-dataset. 1328 IEEE
secure and not allowing the files to be duplicate. In Present Transactions On Knowledge And Data Engineering, Vol. 27,
using dictionaries and software programs, which can test No. 5, May 2015
hundreds of thousands of password combinations per [15] U. Draisbach and F. Naumann, “A generalization of
second, and will won’t allow other port to scans the blocking and windowing algorithms for duplicate detection,”
password for the user more than a particular time? So it in Proc. Int. Conf. Data Knowl. Eng., 2011, pp. 18–24.
would crack passwords within minutes? So, Brute force [16] H. S. Warren, Jr., “A modification of Warshall’s
attacks typically begin with secure shell (SSH) and it will algorithm for the transitive closure of binary relations,”
prevent taking the File keys and it won’t allow the duplicate Commun. ACM, vol. 18, no. 4, pp. 218–220, 1975.
keys also for open a file and it will save the file redundancy. [17] M. Wallace and S. Kollias, “Computationally efficient
incremental transitive closure of sparse fuzzy binary
VII. REFERENCES relations,” in Proc. IEEE Int. Conf. Fuzzy Syst., 2004, pp.
[1] S. E. Whang, D. Marmaros, and H. Garcia-Molina, “Pay- 1561–1565.
as-you-go entity resolution,” IEEE Trans. Knowl. Data Eng., [18] F. J. Damerau, “A technique for computer detection and
vol. 25, no. 5, pp. 1111–1124, May 2012. correction of spelling errors,” Commun. ACM, vol. 7, no. 3,
[2] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, pp. 171–176, 1964.
“Duplicate record detection: A survey,” IEEE Trans. Knowl. [19] P. Christen, “A survey of indexing techniques for
Data Eng., vol. 19, no. 1, pp. 1–16, Jan. 2007. scalable record linkage and deduplication,” IEEE Trans.
[3] F. Naumann and M. Herschel, An Introduction to Knowl. Data Eng., vol. 24, no. 9, pp. 1537–1555, Sep. 2012.
Duplicate Detection. San Rafael, CA, USA: Morgan & [20] B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz, “The
Claypool, 2010. Plista dataset,” in Proc. Int. Workshop Challenge News
[4] H. B. Newcombe and J. M. Kennedy, “Record linkage: Recommender Syst., 2013, pp. 16–23.
Making maximum use of the discriminating power of [21] L. Kolb, A. Thor, and E. Rahm, “Parallel sorted
identifying information,” Commun. ACM, vol. 5, no. 11, pp. neighborhood blocking with MapReduce,” in Proc. Conf.
563–566, 1962. Datenbanksysteme in B€uro, Technik und Wissenschaft,
[5] M. A. Hernandez and S. J. Stolfo, “Real-world data is 2011.
dirty: Data cleansing and the merge/purge problem,” Data
Mining Knowl. Discovery, vol. 2, no. 1, pp. 9–37, 1998.
[6] X. Dong, A. Halevy, and J. Madhavan, “Reference
reconciliation in complex information spaces,” in Proc. Int.
Conf. Manage. Data, 2005, pp. 85–96.
[7] O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller,
“Framework for evaluating clustering algorithms in
duplicate detection,” Proc. Very Large Databases
Endowment, vol. 2, pp. 1282– 1293, 2009.
[8] O. Hassanzadeh and R. J. Miller, “Creating probabilistic
databases from duplicated data,” VLDB J., vol. 18, no. 5, pp.
1141–1166, 2009.
[9] U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg,
“Adaptive windows for duplicate detection,” in Proc. IEEE
28th Int. Conf. Data Eng., 2012, pp. 1073–1083.
[10] S. Yan, D. Lee, M.-Y. Kan, and L. C. Giles, “Adaptive
sorted neighborhood methods for efficient record linkage,”