You are on page 1of 9

A

SYNOPSIS
of

The Dissertation Work

on

XML duplicate detection


for

M.E. in Computer Science and Engineering


at

Department of CSE,
Annasaheb Dange College of Engineering & Technology,
Ashta, Sangli.
(Aliated to Shivaji University Kolhapur)
by
Ms.Trupti Ashok Patil

PRN No.:1312445135

Guide
Prof. S. V. Patil

Academic Year

2014-15
XML Duplicate Detection

• Name of College: Annasaheb Dange College of Engineering & Technology, Ashta

• Name of Course: ME-CSE-PART-II

• Name of the Student:Ms.Trupti Ashok Patil PRN:1312445135

• Date of Registration: (25-08-2013)

• Name of the Guide: Prof. S. V. Patil


(PG Recognition No.:SU/PGBUTR/RECOG/247/243 Date: 05/04/2013)

• Proposed Title: ”Xml duplicate Detection”

1 Introduction
Hierarchical data is defined as a set of data items that are related to each other by hierarchical
relationships. Hierarchical relationships exist where one item of data is the parent of another
item. The structure allows representing information using parent/child relationships where each
parent can have many children, but each child has only one parent (also known as a 1-to-many
relationship). All attributes of a specific record are listed under an entity type.
An XML document is a tree, and therefore a single XML data type instance can represent
a complete hierarchy. XML is used both for large scale electronic publishing of data, and for
the exchange of data on the Web and elsewhere. The two main features of XML are that the
data is organized hierarchically, and is semi-structured, mixing content, e.g., text and structure,
using so called XML tags.
Several problems arise in the context of data integration where data from distributed and
heterogeneous data source is combined.One of these problems is the possibly inconsistent rep-
resentation of the same real world object in the differant data sources. When combining the data
from heterogeneous sources the ideal result is a unique complete and correct representation for
every object such that data quality can only be achieved through data cleansings, where the most
important task is to insure that an object only has one representation in the result.This require
the identification of object and is referred to as ”Duplicate detection”

2 Relevance/Motivation
Relational data only represents a fraction of todays data. XML is increasingly popular as
data representation, specially for data published on the World Wide Web and data exchanged

Student Signature Guide Signature 1/8


XML Duplicate Detection

between organizations. In XML data, it is also true that different types of objects are described
within a single schema, however, they are not necessarily described by a fixed set of single
valued attributes due to the semi-structured nature of XML. Consequently, similarity measures
designed to compare equally structured flat tuples are no longer applicable to semi-structured
hierarchical XML elements.
Structural divergences are more difficult to identify in XML data than in relational databases,
due to the flexible hierarchical structure of XML documents. Several nesting levels can exist
in an XML document. This does not occur in relational databases, since tables and attributes
are flat. It is harder to find correspondences when comparing documents which use different
hierarchical structures to convey the same information. As a consequence, the techniques used
for relational databases cannot be readily applied for XML data sources.
In the world of XML there are not necessary uniform and clearly defined structures like
tables. Duplicates are multiple representation of the same real world object that differs from
each other. Duplicate detection has been studied extensively for relational data. In this case de-
tection strategy consists in comparing pairs of tuples. Methods devised for duplicate detection
in a single relation do not directly apply to XML data. Two object have to be compared accord-
ing to both structure and data.Therefore there is a need to develop method to detect duplicate
objects in nested XML data.Duplicates are detected by using duplicate detection algorithm.The
fact that two XML nodes are duplicate depends only on the fact that their values are duplicates
and their children nodes are duplicates.

3 Literature Review
E.Rahm and H.H. Do proposed Data cleaning approaches for data quality problems[1].Data
Cleaning is used to detect and remove errors and inconsistencies from data in order to improve
the quality of data. The need for data cleaning increasing because the sources often contain
redundant data in different representation.
R. Ananthakrishna, S. Chaudhuri and V. Ganti developed an algorithm for Eliminating
Fuzzy duplicates[2].The algorithm eliminates duplicates in dimensional tables in data ware-
house which are associated with hierarchies. In this paper they take advantage of hierarchies to
develop high quality scalable duplicate elimination algorithm and evaluate it on real data sets.
D.V. Kalashnikov and S. Mehrotra proposed a domain-independent data cleaning approach[3].
The approach RELDC analyses not only object features but also inter-object relationships to im-
prove disambiguation quality.
M. Weis and F. Naumann presents generalized framework for object identification and an
XML-specific specialization[4]. The framework consist of three main components

• Candidate Definition component

• Duplicate definition component

• Duplicate detection component

Student Signature Guide Signature 2/8


XML Duplicate Detection

The candidate definition that specifies what objects to compare, The Duplicate definition
defining what information is part of a candidate’s description .and when two candidates are
duplicates. The Duplicate detection defines how we search duplicates within duplicate candi-
dates. Using this framework , DogMatix algorithm was introduced for object identification in
XML that uses the heuristics to determine candidate descriptions domain-independently. The
proposed system uses a domain-independent similarity measure tailored to hierarchical and
semi-structured XML data. Experimental results demonstrate that the good results in terms of
recall and precision when adequate heuristics are chosen. The proposed system uses the object
filter, to reduce the number of pairwise candidate comparisons.
L. Leitao, P. Calado, and M. Weis, proposed a novel method for fuzzy duplicate detection
in hierarchical and semi-structured XML data and also proposed a Bayesian network[5]. A
Bayesian network model, are able to accurately determine the probability of two XML objects
in a given database being duplicates. Probabilities are computed efficiently using a Bayesian
network, the model also provides great flexibility in its configuration, allowing the use of dif-
ferent similarity measures for the field values and different conditional probabilities to combine
the similarity probabilities of the XML elements. Experimental result show the proposed al-
gorithm is able to maintain highly accurate results and recall values than those produced using
previous state-of-the-art methods such as DogMatix, even when dealing with data containing a
high amount of errors and missing information.
A.M. Kade and C.A. Heuser, presented a novel approach to the matching problem. XML
document integration is an important process in highly dynamical applications, for the volume
of data available in this format is constantly growing[6]. The Matching problem is the problem
of defining which parts of two documents contain the same information. For matching two
XML documents, the proposed XML matching technique uses information from both structure
and content of the elements of the documents. This is an important step in the integration
of XML documents. The proposed technique uses three pieces of information to calculate the
similarity between two nodes of XML trees the elements content, the nodes names and the nodes
path. The experiments showed that proposed technique is capable of dealing with structural and
content diversity which typically exists in XML documents create by distinct sources.
D. Milano, M. Scannapieco, and T. Catarci, defined a new distance for XML data, the
structure aware XML distance[7]. The distance compares only portions of XML data trees
whose structure suggest similar semantics. It performs comparison on unordered trees, without
incurring in high computational costs. The proposed algorithm in this paper measured the
distance between two trees, and discussed its complexity, that is polynomial. The experiments
test the effectiveness and efficiency of distance measure as a comparison function for XML
object identification.
S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, proposed lower and upper
bounds for tree edit distance between ordered labeled trees, that are computationally more effi-
cient[8]. Then show how the tree edit distance, and other metrics that quantify distance between
trees, can be incorporated in a join framework. The proposed generic technique based on the
notion of reference sets that can be used to process joins between XML data sources. The pro-
posed various algorithms for processing approximate XML joins and experimentally quantified

Student Signature Guide Signature 3/8


XML Duplicate Detection

the performance tradeoffs.


J.C.P. Carvalho and A.S. da Silva, proposed an identification approach to finding similar
identities among objects from multiple Web sources[9]. The proposed similarity-based ap-
proach use to finding identities among objects from multiple Web sources. It works like the
join operation in relational databases, but a semantic similarity condition takes the place of the
equality condition. The similarity condition is based on the vector space model. The effective-
ness of proposed approach is demonstrated by experimental results with real Web data sources
from different domains.
M.A. Hernandez and S.J. Stolfo, presents Merge/Purge Problem. The problem is the task
of merging data from multiple sources in as efficient manner as possible, while maximizing the
accuracy of the result[10]. To solve merge/purge problem ,uses sorted neighborhood method.
An alternative method presented based on data clustering modestly improves the process in
time.

4 Proposed Work
4.1 Scope:
• Duplicate Detection in hierarchical data is an important aspect of todays world which is
achieved by performing different operations on it.

• Proposed XMLDup method uses Bayesian network to determine the probability of two
XML elements being duplicates.

• Network pruning strategy is presented to improve the efficiency of the network evaluation.

Need of Work:

• XML is growing in use and popularity, it is more challenging than relational data because
there is no distinction between object types and attribute types and instances of a same ob-
ject may have different structure. XML is a hierarchical structure and to take the advantage
of the hierarchical structure for efficiency and effectiveness.

• Proposed method determine duplicates considering not only the information within the
elements but also the way that information is structured.

4.2 Methodology:
4.2.1 System architecture
The XML files are selected by user which contains duplicate record. Then XML files are
parsed by using DOM (Document to Object Modeling) API parser. DOM API parses the xml
document and maps the records in the file to the java objects. Now, to construct the BN model
mapped objects are used.

Student Signature Guide Signature 4/8


XML Duplicate Detection

Figure 1: System Architecture

BN will have a node labeled with the records parent node and a binary random variable is
assigned. This variable takes the value 1 (active) to represent the fact that the XML nodes are
duplicates. It takes the value 0 (inactive) to represent the fact that the nodes are not duplicates.
Then parent node will have sub parent nodes for child nodes in XML tree, and this continues
until the leaf nodes.
A list containing parent nodes is given to XMLDup method for calculating probabilities.
To obtain probability, the method propagates the prior probabilities associated with the BN leaf
nodes, which will set the intermediate node probabilities, until the root probability is found. To
calculate root probability that propagate the similarity of leaf nodes up the tree until reach to
the root node, probabilities are used to calculate final probability.
If the calculated probability is more than threshold then the two records in BN structure are
duplicates. Duplicate records then separated from the list.
In order to improve the BN evaluation time, The proposed lossless pruning strategy is used.
This strategy is lossless in the sense that no duplicate objects are lost. Only object pairs inca-
pable of reaching a given duplicate probability threshold are discarded.
Pruning algorithm is implemented and pass it the list of BN models calculated at the be-
ginning. This algorithm takes each BN model for evaluation. Traversing through BN Structure
it propagates the probabilities of values to the parent nodes if probability of parent is more
than threshold values then records represented by BN structure are considered as duplicates.
Likewise algorithm checks the entire list and finds duplicates.

Student Signature Guide Signature 5/8


XML Duplicate Detection

4.2.2 Module 1: Read file


It accesses the files from user. The file contains hierarchical data such as XML files etc.
XML files are selected for comparison.

4.2.3 Module 2: Parsing XML files


XML files are parsed that contains duplicate records. proposed system uses DOM (Document
to Object Modeling) API. DOM API parses the xml document and maps the records in the file
to the java objects.

4.2.4 Module 3: Implementing XMLDup method


XMLDup uses a Bayesian network to determine the probability of two XML elements being
duplicates. Mapped objects used to construct BN. BN will have a node labeled with the records
parent node and a binary random variable is assigned. This variable takes the value 1 (active) to
represent the fact that the XML nodes are duplicates. It takes the value 0 (inactive) to represent
the fact that the nodes are not duplicates. Then the parent node will have sub parent nodes for
child nodes in XML tree, and this continues until the leaf nodes. Compare the nodes with each
other and check whether these are greater than threshold value or not. If they are then take those
nodes.

4.2.5 Module 4: Calculate precision and recall


Take each BN structure and calculate probability for each BN node and propagate the
probability to the parent node until reach to the root node, if probability of root node is more
than threshold the BN structure has duplicate records. Calculate precision and recall for the xml
file. Compare proposed system with xml dup algorithm and check that the proposed system is
giving better recall than previous system.

4.3 System Configuration


Hardware Requirement
• Processor - Pentium -IV

• RAM - 64 MB

• Hard Disk - 2 GB

Software Requirement
• Operating System -Windows XP

• Front End - Java

Student Signature Guide Signature 6/8


XML Duplicate Detection

5 Expected month for completion of work: April,2015

Table 1: Schedule

Sr. No. Tasks Duration in months


1 Literature Survey and Synopsis Preparation 2
2 Designing System and Study of Algorithm 2
3 Implementation of module 1 and 2 2
4 Implementation of module 3 and 4 2
5 Analysis 1
6 Report preparation 1

Student Signature Guide Signature 7/8


XML Duplicate Detection

References
[1] E. Rahm and H.H. Do, “Data Cleaning: Problems and Current Approaches,” IEEE Data
Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec. 2000.
[2] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data
Warehouses,” Proc. Conf. Very Large Databases (VLDB), pp. 586-597, 2002.
[3] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of
Entity-Relationship Graph,” ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767,
2006.
[4] M. Weis and F. Naumann, “Dogmatix Tracks Down Duplicates in XML,” Proc. ACM SIG-
MOD Conf. Management of Data, pp. 431-442, 2005.
[5] L. Leitao, P. Calado, and M. Weis, “Structure-Based Inference of XML Similarity for Fuzzy
Duplicate Detection,” Proc. 16th ACM Intl Conf. Information and Knowledge Manage-
ment,pp. 293-302, 2007.
[6] A.M. Kade and C.A. Heuser, “Matching XML Documents in Highly Dynamic Applica-
tions,” Proc. ACM Symp. Document Eng.(DocEng), pp. 191-198, 2008.
[7] D. Milano, M. Scannapieco, and T. Catarci, “Structure Aware XML Object Identification,”
Proc. VLDB Workshop Clean Databases (CleanDB), 2006.
[8] S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, “Approximate XML Joins,”
Proc. ACM SIGMOD Conf. Management of Data, 2002.
[9] J.C.P. Carvalho and A.S. da Silva, “Finding Similar Identities among Objects from Multiple
Web Sources,” Proc. CIKM Workshop Web Information and Data Management (WIDM),
pp. 90-93, 2003.
[10] M.A. Hernandez and S.J. Stolfo, “The Merge/Purge Problem for Large Databases,” Proc.
ACM SIGMOD Conf. Management of Data, pp. 127-138, 1995.

Date: July 17, 2014

Place: Ashta

Ms.Trupti Ashok Patil Prof.S.V.Patil


Student name and signature Guide

Dr. M. A. Kadampur
HOD Principal

Student Signature Guide Signature 8/8

You might also like