You are on page 1of 10

ISSN 2319-8885

Vol.06,Issue.22
June-2017,
Pages:4253-4262
www.ijsetr.com

Duplicate Detection of Record Linkage in Real-World Data


K. MAHESHWARI1, PRAVIN THUMUKUNTA2
1
PG Scholar, Dept of CSE, Methodist College of Engineering and Technology, Affiliated Osmania University, Hyderabad,
Telangana, India.
2
Assistant Professor, Dept of CSE, Methodist College of Engineering and Technology, Affiliated Osmania University,
Hyderabad, Telangana, India.

Abstract: A database contains very large datasets, where various duplicate records are present. The duplicate records occur
when data entries are stored in a uniform manner in the database, resolving the structural heterogeneity problem. Detection of
duplicate records is difficult to find and it take more execution time. In this literature survey papers various techniques used to
find duplicate records in database but there are some issues in this techniques. To address this Progressive algorithm has been
proposed for that significantly increases the efficiency of finding duplicates if the execution time is limited and improve the
quality of records.

Keywords: Duplicate Detection, Entity Resolution, Pay-As-You-Go, Progressiveness, Data Cleaning.

I. INTRODUCTION A. General
Databases play an important role in today's IT based Duplicate detection is the process of identifying multiple
economy. Many industries and systems depend on the representations of same real world entities. Today, duplicate
accuracy of databases to carry out operations. Therefore, the detection methods need to process ever larger datasets in
quality of the information stored in the databases, can have ever shorter time: maintaining the quality of a dataset
significant cost implications to a system that relies on becomes increasingly difficult. We present two novel,
information to function and conduct business. In an error- progressive duplicate detection algorithms that significantly
free system with perfectly clean data, the construction of a increase the efficiency of finding duplicates if the execution
comprehensive view of the data consists of linking --in time is limited: They maximize the gain of the overall
relational terms, joining-- two or more tables on their key process within the time available by reporting most results
fields. Unfortunately, data often lack a unique, global much earlier than traditional approaches. Comprehensive
identifier that would permit such an operation. Furthermore, experiments show that our techniques can double the
the data are neither carefully controlled for quality nor efficiency over time of traditional duplicate detection and
defined in a consistent way across different data sources. significantly improve upon related work.
Thus, data quality is often compromised by many factors,
including data entry errors (ex: student instead of student),
missing integrity constraints (e.g., allowing entries such as
Employee Age=567), and multiple conventions for recording
information To make things worse, in independently
managed databases not only the values, but the structure,
semantics and underlying assumptions about the data may
differ as well. Data are among the most important assets of a
company. But due to data changes and sloppy data entry,
errors such as duplicate entries might occur, making data
cleansing and in particular duplicate detection indispensable.
However, the pure size of today’s datasets renders duplicate
detection processes expensive. Online retailers, for example,
offer huge catalogs comprising a constantly growing set of
items from many different suppliers. As independent persons
change the product portfolio, duplicates arise. Although
there is an obvious need for deduplication, online shops
without downtime cannot afford traditional deduplication. Fig1.

Copyright @ 2017 IJSETR. All rights reserved.


K. MAHESHWARI, PRAVIN THUMUKUNTA
B. Objective After the data preparation phase, the data are typically stored
The main aim of the project is to identify most duplicate in tables having comparable fields. The next step is to
pairs early in the detection process. Instead of reducing the identify which fields should be compared. E.g. it would not
overall time needed to finish the entire process, progressive be meaningful to compare the contents of the field Last
approaches try to reduce the average time after which a Name with the field Address. Even after parsing, data
duplicate is found. Early termination, in particular, then standardization, and identification of similar fields, it is not
yields more completes results on a progressive algorithm trivial to match duplicate records. Misspellings and different
than on any traditional approach conventions for recording the same information still result in
different, multiple representations of a unique object in the
II. DATA PREPARATION database.
Duplicate record detection is the process of identifying
different or multiple records that refer to one unique real A. Existing System
world entity or object. The data preparation stage is the first Much research on duplicate detection, also known as
step in the process of duplicate record detection during entity resolution and by many other names, focuses on pair
which data entries are stored in uniform manner in the selection algorithms that try to maximize recall on the one
database resolving the structural heterogeneity problem. The hand and efficiency on the other hand. Xiao et al. proposed a
data preparation stage includes a parsing a data top-k similarity join that uses a special index structure to
transformation and a standardization step. These steps estimate promising comparison candidates. This approach
improve the quality of the in-flow data and make the data progressively resolves duplicates and also eases the
comparable and more usable. parameterization problem.
Drawbacks in Existing System:
1. A user has only limited, may be unknown time for data
cleansing and wants to make best possible use of it.
Fig 1. Steps in data preparation. Then, simply start the algorithm and terminate it when
needed. The result size will be maximized.
Parsing is the first component in the data preparation 2. A user has little knowledge about the given data but still
which locates identifies and isolates individual data elements needs to configure the cleansing process.
in the source files. Parsing makes it easier to correct 3. A user needs to do the cleaning interactively to, for
standardize and match data because it allows the comparison instance, find good sorting keys by trial and error. Then,
of individual components rather than of long complex strings run the progressive algorithm repeatedly; each run
of data. E.g. the appropriate parsing of name and address quickly reports possibly large results.
components into consistent packets of information is a 4. All presented hints produce static orders for the
crucial part in the data cleaning process. Data transformation comparisons and miss the opportunity to dynamically
refers to simple conversions that can be applied to the data in adjust the comparison order at runtime based on
order for them to confirm to the data types of their intermediate results.
corresponding domains Simple conversions applied to the
data in order to confirm their corresponding data types also C. Proposed System
refer to data transformation. This type of conversion focuses In this work, however, we focus on techniques, which try to
mainly on one field at time without any consideration of the report most matches early on, while possibly slightly
values in the related fields. increasing their overall runtime. To achieve this, they need
to estimate the similarity of all comparison candidates in
Example of data transformation order to compare most promising record pairs first. We
1. Conversion of a data element from one data type to propose two novel, progressive duplicate detection
another. algorithms namely progressive sorted neighborhood method
2. Renaming of a field from one name to another. (PSNM), which performs best on small and almost clean
3. Range checking, this involves examination of data in a datasets, and progressive blocking (PB), which performs best
field to ensure that if falls within the expected range, on large and very dirty datasets. Both enhance the efficiency
usually a numeric or date range. of duplicate detection even on very large datasets. We
introduce a concurrent progressive approach for the multi-
Data standardization refers to the process of standardizing pass method for progressive duplicate detection workflow.
the information represented in certain fields to a specific Advantages in Proposed System:
content format. This is used for information that can be 1. Improved early quality
stored in many ways in various data sources and must be 2. Same eventual quality
converted to a representation before the duplicate detection 3. Our algorithms PSNM and PB dynamically adjust their
process starts. Without standardization, many duplicate behavior by automatically choosing optimal parameters,
entries could be erroneously designated as non-duplicates e.g., window sizes, block sizes, and sorting keys,
based on the fact that common identifying information rendering their manual specification superfluous. In this
cannot be compared. One of the most common way, we significantly ease the parameterization
standardization applications involves address information. complexity for duplicate detection in general and
International Journal of Scientific Engineering and Technology Research
Volume.06, IssueNo.22, June-2017, Pages: 4253-4262
Duplicate Detection of Record Linkage in Real-World Data
contribute to the development of more user interactive first part of our evaluation is executed on a DELLOptiplex
applications. 755 comprising an Intel Core 2 Duo E8400 3 GHzand 4 GB
4. If the file is ready to open first it needs to run in SSP to RAM. We use Ubuntu 12.04 32 bit as operating system and
prove the file-ownership. If the proof is passed the user Java 1.6 as runtime environment. The evaluation of Sec. 4.6
will get the key token for the file to open. uses a different machine, explained there.
5. There is no chance get a duplicate key for the same or
another user to open the file because key is stored in the 1. Memory limitation: We assume that many real-world
private cloud using the Hybrid cloud. datasets are considerably larger than the amount of available
main memory, e.g., in our use case described in Sec.
III. EVALUATION 4.6.Therefore, we limit the main memory of our machine to1
In the previous sections, we presented two progressive GB so that the DBLP- and CSX-dataset do not fit into main
duplicate detection algorithms namely PSNM and PB, and memory entirely. 1 GB of memory corresponds to about 100
their Attribute Concurrency techniques. In this section, we 000 records that can be loaded at once. The artificial
first generally evaluate the performance of our approaches limitation actually degrades the performance of our
and compare them to the traditional Sorted Neighborhood algorithms more than the performance of the non-
Method (SNM) and the Sorted List of Record Pairs progressive baseline, because progressive algorithms need to
(SLORP) presented in [2]. Then, we test our algorithms access partitions several times. As our experiments show,
using a much larger dataset and a concrete use case. The using more memory significantly increases the
graphs used for performance measurements plot the total progressiveness of both PSNM and PB. Sec. 4.6 further
number of reported duplicates over time. Each duplicate is a shows that all results on 1 GB main memory can be
positively matched record pair. For better readability, we extrapolated to larger datasets being processed using more
manually marked some data points from the many hundred main memory.
measured data points that make up a graph.
A. System Techniques 2. Quality measure: To evaluate the progressiveness of our
1. Progressive sorted neighborhood method (PSNM) algorithms, we use the quality measure proposed. For the
2. Progressive blocking (PB) weighting function, we generally choose ω (t) =max (1 -
3. Subset selection algorithm (SSA) ( ), i.e., the area under the curve of the corresponding
4. Time complexity (TC)
result graph. In this way, the calculated quality values are
visually easy to understand.
B. Experimental setup
To evaluate the performance of our algorithms, we chose
3. Baseline approach: The baseline algorithm, which we
three real-world datasets with different characteristics (see
use in our tests, is the standard Sorted Neighborhood Method
Table 1). Since only the CD-dataset comes with an own true
(SNM). This algorithm has been implemented similar to the
gold-standard, we computed duplicates in the DBLP and
PSNM algorithm so that it may use load-compare
CSX-dataset by running an exhaustive duplicate detection
parallelism as well. In our experiments, we always execute
process using our fixed and reasonable (but for our
SNM and PSNM with the same parameters and
evaluation irrelevant) similarity measure.
optimizations to compare them in a fair way.
Table 1. Real-world datasets and their characteristics
C. Optimizations in PSNM
Before we compare our PSNM algorithm to the PB
algorithm and existing approaches, we separately evaluate
PSNM’s different progressive optimizations. We use a
window size of 20 in all these experiments.
The CD-dataset contains various records about music and 1. Window Interval: The window interval parameter I is a
audio CDs. The DBLP-dataset is a bibliographic index on trade-off parameter: Small values close to 1 favor
computer science journals and proceedings. In contrast to the progressiveness at any price while large values close to the
other two datasets, DBLP includes many, large clusters of window size optimize for a short overall runtime. In all our
similar article representations. The CSX-dataset contains experiments, I=1 performs best, achieving, for instance,67%
bibliographic data used by the Cite Seer X search engine for progressiveness on the DBLP-dataset. On the same dataset,
scientific digital literature. CSX also stores the full abstracts the performance reduces to 65% for I = 2, to 62%for I = 4
of all its publications in text-format. These abstracts are the and to 48% for I = 10. Hence, we suggest to set I = 1 if early
largest attributes in our experiments. Our work focuses on termination can be used.
increasing efficiency while keeping the same effectiveness.
Hence, we assume a given, correct similarity measure; it is 2. Partition Caching: Although eventually PSNM executes
treated as an exchangeable black box. For our experiments, the same comparisons as the traditional SNM approach, the
however, we use the Damerau-Levenshtein similarity. This algorithm takes longer to finish. The reason for this
similarity measure achieved an actual precision of 93% on observation is the increased number of highly expensive load
the CD dataset, for which we have a true gold standard. The processes. To reduce their complexity, PSNM implements

International Journal of Scientific Engineering and Technology Research


Volume.06, IssueNo.22, June-2017, Pages: 4253-4262
K. MAHESHWARI, PRAVIN THUMUKUNTA
partition caching. We now evaluate the traditional SNM without parallelization as well. Fig.3 illustrates the results of
algorithm, a PSNM algorithm without partition caching and the experiment. On the DBLP-dataset, load-compare
a PSNM algorithm with partition caching on the DBLP- parallelism performs almost perfectly: the entire load-time is
dataset. The results of this experiment are shown inFig.2 in hidden by the compare-time so that the optimized PSNM
the left graph. The experiment shows that the benefit of algorithm and the optimized SNM algorithm finish nearly
partition caching is significant: The runtime of PSNM simultaneously. This is due to the fact that the latency hiding
decreases by 42% minimizing the runtime difference effect reduced the runtime of the PSNM algorithm by 43%
between PSNM and SNM to only 2%. but the runtime of the SNM algorithm by only 5%. On the
larger CSX-dataset, however, the load-compare parallelism
3. Look-Ahead: To optimize the selection of comparison strategy reduces the runtime of the SNM algorithm by11%
candidates, PSNM’s look-ahead strategy dynamically and the runtime of the PSNM algorithm by only 25%. This is
executes comparisons around recently identified duplicates. a remarkable gain, but since the load phases are much longer
In the following experiment, we evaluate the gain of this than the compare phases on this dataset, the optimization
optimization. As in the previous experiment, we compare the cannot hide the full data access latency: the CSX-dataset
look-ahead optimized PSNM to the non-optimized PSNM on contains many enormously large attribute values that
the DBLP-dataset. As the results in the right graph ofFig.2 increase the load time a lot. Although the load-compare
show, the look-ahead strategy clearly improves the parallelism improves the PSNM algorithm, all further
progressiveness of the PSNM algorithm: The measured experiments do not use this optimization; the comparisons
quality increases from 37% to 64%. This is a quality gain of would become unfair using parallelization for some
42%. On the CSX-dataset, however, the performance algorithms and no parallelization on some other algorithms,
increases by only 7% from 70% to 75%. The reason is that in particular those of [2].
the benefit of the look-ahead optimization greatly depends
on the number and the size of duplicate clusters contained D. Comparison to related work
within a dataset. The CSX-dataset contains only few large In the following experiment, we evaluate our algorithms
clusters of similar records and, therefore, exhibits a very PSNM and PB on all four datasets. We use the traditional,
homogeneous distribution of duplicates, which is why the non-progressive SNM algorithm as baseline to measure the
look-ahead strategy achieves only a small gain in real benefit of PSNM and PB. Furthermore, the experiment
progressiveness on that dataset. includes an implementation of the Sorted List of Record
Pairs (SLORP) hint [2], which we consider to be the best
progressive duplicate detection algorithm in related work.
For fairness, SLORP also uses partition caching, because
text-files had not been considered as input format in that
work. The experiment uses a maximum window size of 20
for PSNM, SNM, and SLORP. In accordance with, we set
both PB’s block size and PB’s block range to 5. So, the PB
algorithm executes 11% fewer comparisons on each dataset
than the three other approaches. The results of the
experiment are depicted in Fig.4.

Fig 2. Effect of partition caching and look-ahead. 1.Low latency: On all datasets PSNM and PB start reporting
first results about 1-2% earlier than SNM and SLORP. This
advantage is a result of our progressive Magpie Sort. For the
non-progressive algorithms, we use an implementation of the
Two-Phase Multi-way Merge Sort (TPMMS), which isa
popular approach for external memory sorting. Although
TPMMS is highly efficient, Magpie-Sorting slightly out
performs this approach regarding progressiveness.

2. PSNM: In all three test runs, PSNM achieves the best


performance, approximately doubling the progressiveness of
the SNM baseline algorithm. PSNM also significantly out
Fig 3. Evaluation of the Load-Compare Parallelism. performs the SLORP algorithm. In our experiment, PSNM
exhibits a 6% (CSX) to 29% (DBLP) higher progressiveness
4. Load-Compare Parallelism: By parallelizing the load than SLORP.
phase and the compare phase, the load time for partitions
should ideally no longer affect the performance. The 3. PB: The PB algorithm is the second best algorithm in this
following experiments evaluate this assumption for our experiment. As the progressiveness of this algorithm highly
PSNM. Since the load-compare parallelism also improves benefits from more and larger duplicate clusters, it shows its
the traditional SNM, the experiment runs SNM with and best performance on the DBLP-dataset. In general, PB
International Journal of Scientific Engineering and Technology Research
Volume.06, IssueNo.22, June-2017, Pages: 4253-4262
Duplicate Detection of Record Linkage in Real-World Data
reports first duplicates in the starting phase clearly slower
than the PSNM, because running a window of size 1 is
initially more efficient than running the first block
comparisons. In the following phases, however, PB resolves
duplicate clusters extremely fast. Overall, PSNM is still 3%
more progressive than PB on the DBLP-dataset. There by,
we need to consider that PB executes 11% few comparisons
than PSNM and, therefore, finds 4% fewer duplicates.
Hence, PB actually competes well with PSNM on skewed
datasets but loses on uniformly distributed duplicates in Fig 4. Performance comparison of the traditional SNM
single-pass settings. and the progressive PB, PSNM, and PB algorithms

Second, both SNM and PSNM have very similar values in


4. I/O-Overhead: For a given dataset, the tasks of sorting,
candidate generation, and record comparison all have the precision, which verifies the irrelevance of the similarity
same runtime in both progressive and non-progressive measure for progressiveness. Third, the progressive
algorithms find few false positive matches in relation to true
algorithms. However, the progressive algorithms require
positive matches in the beginning, as the precision graphs
more I/O operations if the data does not fit into main
show.
memory. This causes their overall runtimes to increase,
which then reduces their progressivity. Figure 6 shows these
runtime differences especially for the large CSX-dataset. If
the data fits into main memory, e.g., for the CD-dataset, this
effect cannot be observed.

5. Pairs Quality: To show how precise comparison


candidates are chosen, we evaluated the pair’s quality PQ of
PSNM, PB, and SNM over time. The PQ of a duplicate
detection algorithm at time t is the number of identified
duplicates at t divided by the number of comparisons that Fig 5. Evaluation on pairs quality PQ (left) and precision
were executed to find these duplicates. So the perfect and recall (right)
duplicate detection algorithm comparing only those record
pairs that in fact are duplicates yields PQ=1. Fig.7depicts the E. Attribute Concurrency
PQ-value curves for the CSX-dataset (left chart).As the Our Attribute Concurrency algorithms AC-PSNM and
curves show, the two progressive approaches choose their AC-PB progressively execute the multi-pass method for the
comparison candidates much more carefully: The PSNM PSNM algorithm and PB algorithm, respectively, favoring
algorithm detects a new duplicate with every 12thand PB good keys over poor keys by dynamically ranking different
with every 20th comparison in the first few minutes. The passes using their intermediate results. In the following, we
baseline approach, in contrast, reports less than one duplicate compare AC-PSNM and AC-PB to the common multi-pass
in 100 comparisons. In the end, all algorithms have executed execution model, which resolves the different keys
(almost) the same comparisons, so that their PQ curves sequentially in random order. The experiment uses three
converge to the same value. different keys, which are{Title}, {Authors}, and
{Description}. Since a common multi-pass algorithm can
6. Precision and Recall: The proposed progressive execute the different passes in any order, it might
algorithms enhance the efficiency and usability of duplicate accidentally choose the best or worst order of keys.
detection processes, but do not change their effectiveness. Of
course, the similarity function used to determine duplicates
must match the characteristics of the used sorting key(s). But
both similarity function and keys are irrelevant for the
progressiveness of our algorithms. In other words: If the
similarity function is poor, we obtain the same poor results
from progressive and non-progressive algorithms. To
illustrate this behavior, we evaluated the change inprecision
and recall on the CD-dataset, which is the only dataset for
which a true gold-standard is given. As the right chart in
Fig.5 shows, the recall curves correspond to the previous
duplicate curves. The precision curves, on the other hand,
give the following insights: First, the final precision of 93%
is relatively high, which underlines the suitability of the used
similarity function. Fig 6. Attribute Concurrency on the DBLP-dataset.

International Journal of Scientific Engineering and Technology Research


Volume.06, IssueNo.22, June-2017, Pages: 4253-4262
K. MAHESHWARI, PRAVIN THUMUKUNTA
after each insert. Fig.7 plots the resulting curve. The left
Therefore, we run the traditional, sequential multi-pass chart shows that the proposed sorted lists of duplicates data
algorithm with the optimal key Sequence 1, two mediocre structure does not scale well with the result set’s size.
key Sequences 2and 3 and the worst key Sequence 4. The However, the incremental transitive closure algorithm by
corresponding graphs are depicted in Fig.6. The fifth graph Wallace and Kollias scales linearly with the number of
in both charts shows the AC-strategy for the respective identified duplicates if we use an index structure on the
algorithm. First of all, both charts show that the AC- identified duplicates. The measurements further show that
approaches need about 10% more time to finish. This is the overhead of calculating the transitive closure is
because the ranking of intermediate results and the negligible: Identifying one million duplicates took more than
scheduling of different keys takes some additional time. 30 minutes, but calculating the transitive closure on them
Moreover, both approaches need to store all orders takes only 1.4 seconds.
simultaneously in main memory, which decreases the size of
their partitions. We first evaluate the results for the AC- G. Examining a concrete use case
PSNM algorithm. With a progressiveness of 79%, Sequence Progressive duplicate detection is an efficient and
1 is the best approach. Our AC-PSNM algorithm, then, convenient solution for many data cleansing use cases. In
delivers the second best result with 76% followed by all cooperation with plista (www.plista.com), a company
other results. Thereby, the worst sequence achieves a offering target-oriented online advertisement, we used our
progressive quality of only 59%.Due to the overhead of progressive algorithms to detect persona in web server log
creating all orders and lots of initial block pairs, the PB data. A persona is a user with a certain interest area. Hence,
approach loses much time early on. But after 18 minutes the same user is and should be reflected by different persona,
runtime, the attribute concurrent PB algorithm outperforms if her interests differ. Compared to the number of entity
all other multi-pass approaches, because it has finished the duplicates in traditional data cleansing tasks, we expect
initial runs and can now simultaneously use the benefits of many more persona duplicates in this dataset. To arrange
all orders. Therefore, its overall progressiveness of 90% is target-oriented advertisements, plista collects anonymized
almost as good as the progressiveness of the best sequence, web log data for visitors of their customer’s web pages. The
which is 91%. The worst sequence of sorting keys, in huge amount of constantly growing data comprises
contrast, achieves only62% progressive performance, which information about user’s software, geographic location,
is about less than the best two approaches. In summary, query terms, and categories, to mention only a few attributes.
We refer to this dataset as the plista dataset. For the task of
both attribute concurrent approaches off era good
finding persona, we consider a subset of the IMPRESSION-
progressive quality. Although they might not find the most
table comprising 100 million records and 63attributes, which
progressive multi-pass configurations, they always produce
corresponds to 150 GB in total. Although primarily used to
reliable execution orders for the different passes. We also
create recommendations for advertisement, plista also
see that PB outperforms PSNM in multi-pass settings.
analyzes the dataset to identify users. Currently, users are
Finally, it is worth noting that due to dynamically generated
identified by their session ID –not recognizing different
execution orders only little expert knowledge is needed in
users that, for instance, share the same device or same users
creating good sorting or blocking keys.
that maintain multiple sessions. To identify users more
accurately, domain experts at plista defined a similarity
measure for web log records that deduplicates personas. The
similarity measure compares 17 of the 63 attributes by edit-
distance, numerical distance, or exact matching and returns a
final similarity as the weighted sum of the individual
similarities.

Fig 7. The incremental transitive closure overhead.

F. Incremental Transitive Closure


In this experiment, we evaluate the computational overhead
caused by the incremental calculation of the transitive
closure. We take a result set of one million duplicates ( a
subset of duplicates found in the use case of Sec. 4.6),submit Fig 8. Duplicates found in the plista-dataset.
it to the transitive closure algorithm and measure the time
International Journal of Scientific Engineering and Technology Research
Volume.06, IssueNo.22, June-2017, Pages: 4253-4262
Duplicate Detection of Record Linkage in Real-World Data
To run the persona detection, we use a Dell PowerEdge complete, the progressive analysis can be stopped at any
R620 with two Intel Xeon E5-2650 2.00 GHz CPUs and128 point in time and still maximizes the output.
GB DDR3-1600 RAM. Note that although the server
provides 16 cores, the current implementations of all IV. RESULT
algorithms are single-threaded and, therefore, utilize only
one core. Hence, all algorithms can further be improved by
parallelization. The server’s main memory of 128 GB can
hold 15 million records of the given plista-dataset, which
leads to seven partitions overall. Due to the size of the
dataset and the high number of expected duplicates, we also
increase the maximum window size to 50 for the SNM-
approaches and the block size to 6 and maximum block
range to 8 for the PB algorithm. The results of this
experiment are shown in Fig.8. The traditional Sorted
Neighborhood Method takes almost seven days to finish the
persona detection. Not only must the user wait this long for
results, the algorithm also reserves significant server
resources during these days. In combination with early
termination, both progressive algorithms significantly reduce Fig9. The above screen is the Home page, which
this effort. Although the two algorithms require more time to describes about the project and also consists of all tabs.
completely finish, they deliver almost same results in a much
shorter time: PSNM identifies 71% and PB identifies 93% of
all duplicates already in the first two days. So if we accept a
slightly less complete result, we can run the deduplication in
two instead of seven days. With 56%, SNM exhibits an
above average progressive performance. However, PSNM
still outperforms this quality with 73% and PB with even
88%. These results are comparable to the results that we
measured in Sec. 4.3 on smaller datasets using less memory.
The reason for PB significantly out performing PSNM on the
plista dataset is that the dataset contains many duplicate
clusters, which was foreseeable for the use case at hand.

We also show the quality for other weighting functions ω


(t) with L= 1 and t in days for this experiment: As the first
two rank the results similar, the last function puts so much
weight on the few very early results that PSNM is ranked Fig10. This is the administrator login screen, which
highest here. So PSNM might be preferable in a pipeline- allows an administrator to login with its Username and
scenario. In the analysis, we found out that the plista dataset Password. If username and password is valid allows to
contains about 135 million duplicate pairs (w.r.t. the expert’s create datasets else shows an error.
similarity measure definition of a persona). After merging all
these duplicates, we ended up with 61.4 million distinct
personas in the 100 million web log records. Among
those,55 million were singletons, i.e., had no duplicate. So
each persona visited about 1.6 web-pages containing plista
advertisement on average. Furthermore, the average size ofa
duplicate cluster (excluding the singletons) is 21, which
corresponds to seven records for the same persona. So most
personas visit only one web-page with plista advertisement
(the singletons), but if a persona visits more than one page,
then she visits seven pages on average. By further inspecting
the identified personas, however, data mining specialists
might discover more insights. In summary, executing a full,
traditional duplicate detection run on plista’s massive
amount of log data turned out to be extremely time and
resource consuming. Using progressive duplicate detection
techniques, on the contrary, renders this process feasible: As
Fig11.The above screen allows an administrator to select
the result of the persona detection must not necessarily be
Root Element to add dataset in a specified Root Element.

International Journal of Scientific Engineering and Technology Research


Volume.06, IssueNo.22, June-2017, Pages: 4253-4262
K. MAHESHWARI, PRAVIN THUMUKUNTA

Fig12. The above screen allows an administrator to select Fig15. The screen is designed for User to register into the
Sub Root Element to add dataset in a specified Sub Root database. After entering the details click on Submit
Element. button. Thus user details are registered and stored in the
database. If anytime user needs to access the database to
search for a particular file, user needs to enter the
registered username and password. If it is valid allows
the user to search for a file else shows an error.

Fig13. This screen allows to add Dataset into a particular


Root element Sub Root element. After selecting a file
have to click on Submit button. This allows to add
dataset into a database.
Fig16. User’s registration success page.

Fig14. This screen allows to add Dataset file related


image into a particular Root Element Sub Root element. Fig17. The above screen allows to a User to login page. If
After selecting a file have to click on Submit button. This a user is already registered allows to move for the further
allows to add an image to a particular dataset file into a steps. If a user is not registered, requests an user to
database. register with option called New User.
International Journal of Scientific Engineering and Technology Research
Volume.06, IssueNo.22, June-2017, Pages: 4253-4262
Duplicate Detection of Record Linkage in Real-World Data

Fig18. The above screen allows a user to search a file


with a specified Keyword given in search box and must
click on Search button.

Fig20. The above screen is the Second step of filtering an


output by applying an algorithm named “Subset
Selection Algorithm”. This algorithm allows to filter the
files into two divisions based on the given Threshold of
‘5‘ i.e. it retrieves a file which repeats more than or
equal to 5 times as Relevant Dataset files and the rest as
Irrelevant Dataset files, which is restricted to show in the
output. In the above screen it consists of a button named
“Progressive Blocking”. If that button is clicked actual
output is displayed to user.

Fig19. The above screens are the first steps for the output
filtering. Here we applied an algorithm “Progressive Fig21. This is the actual output shown to the user. In this
Sorted Neighborhood Method (PSNM) ” which allows an we applied an algorithm named “Progressive Blocking”,
engine to search all the files based on the given which identifies the duplicate records and displays that
Keyword. Identifying one million duplicates took more only once in the output. PSNM and PB executes with
than 30 minutes without using PSNM, but calculating the very limited time when compared to normal execution
transitive closure on them takes only 1.4 seconds. and also identifies duplicates in a very short period of
time.
International Journal of Scientific Engineering and Technology Research
Volume.06, IssueNo.22, June-2017, Pages: 4253-4262
K. MAHESHWARI, PRAVIN THUMUKUNTA
V. CONCLUSION in Proc. 7th ACM/ IEEE Joint Int. Conf. Digit. Libraries,
We also presented several new de-duplication constructions 2007, pp. 185–194.
supporting authorized duplicate check in hybrid cloud [11] J. Madhavan, S. R. Jeffery, S. Cohen, X. Dong, D. Ko,
architecture, in which the duplicate-check tokens of files are C. Yu, and A. Halevy, “Web-scale data integration: You can
generated by the private cloud server with private keys. only afford to pay as you go,” in Proc. Conf. Innovative Data
Security analysis demonstrates that our schemes are secure Syst. Res., 2007.
in terms of insider and outsider attacks specified in the [12] S. R. Jeffery, M. J. Franklin, and A. Y. Halevy, “Pay-
proposed security model. As a proof of concept, we as-you-go user feedback for dataspace systems,” in Proc. Int.
implemented a prototype of our proposed authorized Conf. Manage. Data, 2008, pp. 847–860.
duplicate check scheme and conduct test bed experiments on [13] C. Xiao, W. Wang, X. Lin, and H. Shang, “Top-k set
our prototype. similarity joins,” in Proc. IEEE Int. Conf. Data Eng., 2009,
pp. 916–927.
VI. FUTURE ENHANCEMENTS [14] P. Indyk, “A small approximately min-wise independent
Though the above solution not allowed the File redundancy, family of hash functions,” in Proc. 10th Annu. ACM-SIAM
in future the Brute force attacks introduced and launched by Symp. Discrete Algorithms, 1999, pp. 454–456. Fig. 10.
the public cloud server, which can be more powerful and Duplicates found in the plista-dataset. 1328 IEEE
secure and not allowing the files to be duplicate. In Present Transactions On Knowledge And Data Engineering, Vol. 27,
using dictionaries and software programs, which can test No. 5, May 2015
hundreds of thousands of password combinations per [15] U. Draisbach and F. Naumann, “A generalization of
second, and will won’t allow other port to scans the blocking and windowing algorithms for duplicate detection,”
password for the user more than a particular time? So it in Proc. Int. Conf. Data Knowl. Eng., 2011, pp. 18–24.
would crack passwords within minutes? So, Brute force [16] H. S. Warren, Jr., “A modification of Warshall’s
attacks typically begin with secure shell (SSH) and it will algorithm for the transitive closure of binary relations,”
prevent taking the File keys and it won’t allow the duplicate Commun. ACM, vol. 18, no. 4, pp. 218–220, 1975.
keys also for open a file and it will save the file redundancy. [17] M. Wallace and S. Kollias, “Computationally efficient
incremental transitive closure of sparse fuzzy binary
VII. REFERENCES relations,” in Proc. IEEE Int. Conf. Fuzzy Syst., 2004, pp.
[1] S. E. Whang, D. Marmaros, and H. Garcia-Molina, “Pay- 1561–1565.
as-you-go entity resolution,” IEEE Trans. Knowl. Data Eng., [18] F. J. Damerau, “A technique for computer detection and
vol. 25, no. 5, pp. 1111–1124, May 2012. correction of spelling errors,” Commun. ACM, vol. 7, no. 3,
[2] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, pp. 171–176, 1964.
“Duplicate record detection: A survey,” IEEE Trans. Knowl. [19] P. Christen, “A survey of indexing techniques for
Data Eng., vol. 19, no. 1, pp. 1–16, Jan. 2007. scalable record linkage and deduplication,” IEEE Trans.
[3] F. Naumann and M. Herschel, An Introduction to Knowl. Data Eng., vol. 24, no. 9, pp. 1537–1555, Sep. 2012.
Duplicate Detection. San Rafael, CA, USA: Morgan & [20] B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz, “The
Claypool, 2010. Plista dataset,” in Proc. Int. Workshop Challenge News
[4] H. B. Newcombe and J. M. Kennedy, “Record linkage: Recommender Syst., 2013, pp. 16–23.
Making maximum use of the discriminating power of [21] L. Kolb, A. Thor, and E. Rahm, “Parallel sorted
identifying information,” Commun. ACM, vol. 5, no. 11, pp. neighborhood blocking with MapReduce,” in Proc. Conf.
563–566, 1962. Datenbanksysteme in B€uro, Technik und Wissenschaft,
[5] M. A. Hernandez and S. J. Stolfo, “Real-world data is 2011.
dirty: Data cleansing and the merge/purge problem,” Data
Mining Knowl. Discovery, vol. 2, no. 1, pp. 9–37, 1998.
[6] X. Dong, A. Halevy, and J. Madhavan, “Reference
reconciliation in complex information spaces,” in Proc. Int.
Conf. Manage. Data, 2005, pp. 85–96.
[7] O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller,
“Framework for evaluating clustering algorithms in
duplicate detection,” Proc. Very Large Databases
Endowment, vol. 2, pp. 1282– 1293, 2009.
[8] O. Hassanzadeh and R. J. Miller, “Creating probabilistic
databases from duplicated data,” VLDB J., vol. 18, no. 5, pp.
1141–1166, 2009.
[9] U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg,
“Adaptive windows for duplicate detection,” in Proc. IEEE
28th Int. Conf. Data Eng., 2012, pp. 1073–1083.
[10] S. Yan, D. Lee, M.-Y. Kan, and L. C. Giles, “Adaptive
sorted neighborhood methods for efficient record linkage,”

International Journal of Scientific Engineering and Technology Research


Volume.06, IssueNo.22, June-2017, Pages: 4253-4262

You might also like