Professional Documents
Culture Documents
SYNOPSIS
of
on
Department of CSE,
Annasaheb Dange College of Engineering & Technology,
Ashta, Sangli.
(Aliated to Shivaji University Kolhapur)
by
Ms.Trupti Ashok Patil
PRN No.:1312445135
Guide
Prof. S. V. Patil
Academic Year
2014-15
XML Duplicate Detection
1 Introduction
Hierarchical data is defined as a set of data items that are related to each other by hierarchical
relationships. Hierarchical relationships exist where one item of data is the parent of another
item. The structure allows representing information using parent/child relationships where each
parent can have many children, but each child has only one parent (also known as a 1-to-many
relationship). All attributes of a specific record are listed under an entity type.
An XML document is a tree, and therefore a single XML data type instance can represent
a complete hierarchy. XML is used both for large scale electronic publishing of data, and for
the exchange of data on the Web and elsewhere. The two main features of XML are that the
data is organized hierarchically, and is semi-structured, mixing content, e.g., text and structure,
using so called XML tags.
Several problems arise in the context of data integration where data from distributed and
heterogeneous data source is combined.One of these problems is the possibly inconsistent rep-
resentation of the same real world object in the differant data sources. When combining the data
from heterogeneous sources the ideal result is a unique complete and correct representation for
every object such that data quality can only be achieved through data cleansings, where the most
important task is to insure that an object only has one representation in the result.This require
the identification of object and is referred to as ”Duplicate detection”
2 Relevance/Motivation
Relational data only represents a fraction of todays data. XML is increasingly popular as
data representation, specially for data published on the World Wide Web and data exchanged
between organizations. In XML data, it is also true that different types of objects are described
within a single schema, however, they are not necessarily described by a fixed set of single
valued attributes due to the semi-structured nature of XML. Consequently, similarity measures
designed to compare equally structured flat tuples are no longer applicable to semi-structured
hierarchical XML elements.
Structural divergences are more difficult to identify in XML data than in relational databases,
due to the flexible hierarchical structure of XML documents. Several nesting levels can exist
in an XML document. This does not occur in relational databases, since tables and attributes
are flat. It is harder to find correspondences when comparing documents which use different
hierarchical structures to convey the same information. As a consequence, the techniques used
for relational databases cannot be readily applied for XML data sources.
In the world of XML there are not necessary uniform and clearly defined structures like
tables. Duplicates are multiple representation of the same real world object that differs from
each other. Duplicate detection has been studied extensively for relational data. In this case de-
tection strategy consists in comparing pairs of tuples. Methods devised for duplicate detection
in a single relation do not directly apply to XML data. Two object have to be compared accord-
ing to both structure and data.Therefore there is a need to develop method to detect duplicate
objects in nested XML data.Duplicates are detected by using duplicate detection algorithm.The
fact that two XML nodes are duplicate depends only on the fact that their values are duplicates
and their children nodes are duplicates.
3 Literature Review
E.Rahm and H.H. Do proposed Data cleaning approaches for data quality problems[1].Data
Cleaning is used to detect and remove errors and inconsistencies from data in order to improve
the quality of data. The need for data cleaning increasing because the sources often contain
redundant data in different representation.
R. Ananthakrishna, S. Chaudhuri and V. Ganti developed an algorithm for Eliminating
Fuzzy duplicates[2].The algorithm eliminates duplicates in dimensional tables in data ware-
house which are associated with hierarchies. In this paper they take advantage of hierarchies to
develop high quality scalable duplicate elimination algorithm and evaluate it on real data sets.
D.V. Kalashnikov and S. Mehrotra proposed a domain-independent data cleaning approach[3].
The approach RELDC analyses not only object features but also inter-object relationships to im-
prove disambiguation quality.
M. Weis and F. Naumann presents generalized framework for object identification and an
XML-specific specialization[4]. The framework consist of three main components
The candidate definition that specifies what objects to compare, The Duplicate definition
defining what information is part of a candidate’s description .and when two candidates are
duplicates. The Duplicate detection defines how we search duplicates within duplicate candi-
dates. Using this framework , DogMatix algorithm was introduced for object identification in
XML that uses the heuristics to determine candidate descriptions domain-independently. The
proposed system uses a domain-independent similarity measure tailored to hierarchical and
semi-structured XML data. Experimental results demonstrate that the good results in terms of
recall and precision when adequate heuristics are chosen. The proposed system uses the object
filter, to reduce the number of pairwise candidate comparisons.
L. Leitao, P. Calado, and M. Weis, proposed a novel method for fuzzy duplicate detection
in hierarchical and semi-structured XML data and also proposed a Bayesian network[5]. A
Bayesian network model, are able to accurately determine the probability of two XML objects
in a given database being duplicates. Probabilities are computed efficiently using a Bayesian
network, the model also provides great flexibility in its configuration, allowing the use of dif-
ferent similarity measures for the field values and different conditional probabilities to combine
the similarity probabilities of the XML elements. Experimental result show the proposed al-
gorithm is able to maintain highly accurate results and recall values than those produced using
previous state-of-the-art methods such as DogMatix, even when dealing with data containing a
high amount of errors and missing information.
A.M. Kade and C.A. Heuser, presented a novel approach to the matching problem. XML
document integration is an important process in highly dynamical applications, for the volume
of data available in this format is constantly growing[6]. The Matching problem is the problem
of defining which parts of two documents contain the same information. For matching two
XML documents, the proposed XML matching technique uses information from both structure
and content of the elements of the documents. This is an important step in the integration
of XML documents. The proposed technique uses three pieces of information to calculate the
similarity between two nodes of XML trees the elements content, the nodes names and the nodes
path. The experiments showed that proposed technique is capable of dealing with structural and
content diversity which typically exists in XML documents create by distinct sources.
D. Milano, M. Scannapieco, and T. Catarci, defined a new distance for XML data, the
structure aware XML distance[7]. The distance compares only portions of XML data trees
whose structure suggest similar semantics. It performs comparison on unordered trees, without
incurring in high computational costs. The proposed algorithm in this paper measured the
distance between two trees, and discussed its complexity, that is polynomial. The experiments
test the effectiveness and efficiency of distance measure as a comparison function for XML
object identification.
S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, proposed lower and upper
bounds for tree edit distance between ordered labeled trees, that are computationally more effi-
cient[8]. Then show how the tree edit distance, and other metrics that quantify distance between
trees, can be incorporated in a join framework. The proposed generic technique based on the
notion of reference sets that can be used to process joins between XML data sources. The pro-
posed various algorithms for processing approximate XML joins and experimentally quantified
4 Proposed Work
4.1 Scope:
• Duplicate Detection in hierarchical data is an important aspect of todays world which is
achieved by performing different operations on it.
• Proposed XMLDup method uses Bayesian network to determine the probability of two
XML elements being duplicates.
• Network pruning strategy is presented to improve the efficiency of the network evaluation.
Need of Work:
• XML is growing in use and popularity, it is more challenging than relational data because
there is no distinction between object types and attribute types and instances of a same ob-
ject may have different structure. XML is a hierarchical structure and to take the advantage
of the hierarchical structure for efficiency and effectiveness.
• Proposed method determine duplicates considering not only the information within the
elements but also the way that information is structured.
4.2 Methodology:
4.2.1 System architecture
The XML files are selected by user which contains duplicate record. Then XML files are
parsed by using DOM (Document to Object Modeling) API parser. DOM API parses the xml
document and maps the records in the file to the java objects. Now, to construct the BN model
mapped objects are used.
BN will have a node labeled with the records parent node and a binary random variable is
assigned. This variable takes the value 1 (active) to represent the fact that the XML nodes are
duplicates. It takes the value 0 (inactive) to represent the fact that the nodes are not duplicates.
Then parent node will have sub parent nodes for child nodes in XML tree, and this continues
until the leaf nodes.
A list containing parent nodes is given to XMLDup method for calculating probabilities.
To obtain probability, the method propagates the prior probabilities associated with the BN leaf
nodes, which will set the intermediate node probabilities, until the root probability is found. To
calculate root probability that propagate the similarity of leaf nodes up the tree until reach to
the root node, probabilities are used to calculate final probability.
If the calculated probability is more than threshold then the two records in BN structure are
duplicates. Duplicate records then separated from the list.
In order to improve the BN evaluation time, The proposed lossless pruning strategy is used.
This strategy is lossless in the sense that no duplicate objects are lost. Only object pairs inca-
pable of reaching a given duplicate probability threshold are discarded.
Pruning algorithm is implemented and pass it the list of BN models calculated at the be-
ginning. This algorithm takes each BN model for evaluation. Traversing through BN Structure
it propagates the probabilities of values to the parent nodes if probability of parent is more
than threshold values then records represented by BN structure are considered as duplicates.
Likewise algorithm checks the entire list and finds duplicates.
• RAM - 64 MB
• Hard Disk - 2 GB
Software Requirement
• Operating System -Windows XP
Table 1: Schedule
References
[1] E. Rahm and H.H. Do, “Data Cleaning: Problems and Current Approaches,” IEEE Data
Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec. 2000.
[2] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data
Warehouses,” Proc. Conf. Very Large Databases (VLDB), pp. 586-597, 2002.
[3] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of
Entity-Relationship Graph,” ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767,
2006.
[4] M. Weis and F. Naumann, “Dogmatix Tracks Down Duplicates in XML,” Proc. ACM SIG-
MOD Conf. Management of Data, pp. 431-442, 2005.
[5] L. Leitao, P. Calado, and M. Weis, “Structure-Based Inference of XML Similarity for Fuzzy
Duplicate Detection,” Proc. 16th ACM Intl Conf. Information and Knowledge Manage-
ment,pp. 293-302, 2007.
[6] A.M. Kade and C.A. Heuser, “Matching XML Documents in Highly Dynamic Applica-
tions,” Proc. ACM Symp. Document Eng.(DocEng), pp. 191-198, 2008.
[7] D. Milano, M. Scannapieco, and T. Catarci, “Structure Aware XML Object Identification,”
Proc. VLDB Workshop Clean Databases (CleanDB), 2006.
[8] S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, “Approximate XML Joins,”
Proc. ACM SIGMOD Conf. Management of Data, 2002.
[9] J.C.P. Carvalho and A.S. da Silva, “Finding Similar Identities among Objects from Multiple
Web Sources,” Proc. CIKM Workshop Web Information and Data Management (WIDM),
pp. 90-93, 2003.
[10] M.A. Hernandez and S.J. Stolfo, “The Merge/Purge Problem for Large Databases,” Proc.
ACM SIGMOD Conf. Management of Data, pp. 127-138, 1995.
Place: Ashta
Dr. M. A. Kadampur
HOD Principal