You are on page 1of 6

International Journal of Computer Trends and Technology (IJCTT) volume 7 number 2 Jan 2014

ISSN: 2231-2803 Page75

Deriving the Probability with Machine Learning
and Efficient Duplicate Detection in Hierarchical
D.Nithya, M.phil
, K.Karthickeyan, M.C.A, M.Phil, Ph.D

Mphil Scholar, Department of Computer Science, Dr. SNS Rajalakshmi collage of arts & science, Coimbatore, Tamilnadu,
Associate professor, Department of Computer Science, Dr. SNS Rajalakshmi collage of arts & science, Coimbatore

Abstract Duplicate detection is the major important task in the
data mining, in order to find duplicate in the original data as well
as data object. It exactly identifies whether the given data is
duplicates or not. Real world duplicates are multiple
representations that related to same real world data object.
Detection of duplicates can performed to any places, it takes
major important in database. To perform this hierarchical
structure of duplicate detection in single relation is applied to
XML data .In this work existing system presents a method called
XMLDup. XMLDup uses a Bayesian network to establish the
conditional probability between two XML elements being
duplicates, taking into consideration not. Bayesian network
based system conditional probability values are derived
manually it becomes less efficient when compare to machine
learning based results improves the efficiency of the duplicate
detection proposed system finds the duplicate detection of XML
data and XML Objects with different structure representation of
the input files. Derive the conditional probability by applying
Support vector machines (SVMs) models with associated
learning algorithms that analyze XML Duplicate data object. In
this method the number of XML Data is considered as input and
the predicts the conditional probability value for each data in the
hierarchical structure. Finally proposed SVM based
classification performs better and efficient as well as effective
duplicate detection.

Keywords Duplicate detection, XML data, SVM classification,
Bayesian networks and entity resolution.


Duplicate detection is the major important task to determining
dissimilar representation of data XML data object and XML
data for real world object. Duplicate detection is a essential
process in data cleaning [1,2] and is significant for data
integration [3], individual data management [4], and several
areas . The difficulty has been considered expansively for data
stored in a particular relational table with adequate attributes
to formulate rational comparisons. Though, a great deal data
comes in supplementary composite structures, so conservative
approaches cannot be useful. For example, inside XML data,
XML elements might not have some text. Though, the
hierarchical associations through further elements potentially
present sufficient information for significant comparisons.
The difficulty of XML duplicate detection is mainly tackling
in applications like catalog integration or online data process.

In this research present a duplicate detection of XML data
with hierarchical representation of data object as well as data
,rather than general approaches such as top-down and bottom
up approaches among all types of relationship such as one to
one ,one to many and many to one between XML data objects
based hierarchical representation. The essential scheme has
been formerly outlined in a poster [5]. Essentially, believe an
object to depend on one further object if the last helps result
duplicates of the primary. For illustration, actors help out
discover duplicates in movies based on actors and their
relationship among actors. Appropriate to communal
dependencies to be capable of take place, and detection of
XML data used to help or find duplicate XML data objects in
database. Consequently, algorithms such as [4] use
dependencies to increase efficiency by performing arts
pairwise results additional than just the once.

Methods used for duplicate detection in a solitary relation that
may not straight be relevant to XML data, appropriate to the
dissimilar among data or data models [5]. For example
consider data with have same data object type whereas each
and every dissimilar organization at the example level, while
tuples inside relations forever contain the similar organization.
However, further significantly, the hierarchical relationships
in XML make available helpful further information that helps
get better mutually the runtime and the excellence of duplicate
detection. It was shown in the figure with different XML
elements illustrated in figure 1. Together symbolize person
objects and are labeled pro.

Generally these two XML element is named as name and date
of birth(dob).They nest those elements additionally and
represented as place of birth(pob) and contacts (cnt). The
contact details of the XML elements are represented as add
and email(eml) and child of these details are represented as
International Journal of Computer Trends and Technology (IJCTT) volume 7 number 2 Jan 2014

ISSN: 2231-2803 Page76

cnt. Leaf nodes of these structure or hierarchical
representation of XML elements contains original text node
data. In this instance, the objective of duplicate detection is to
discover so as to together persons are duplicates, regardless of
the dissimilar in the data. To do this by making the
comparison among Leaf nodes and their corresponding parent
node values among both objects.

In this work suggest that the hierarchical organization of XML
data helps in detecting duplicate pro elements, because
descendant elements can be detected to be equivalent which
increases the resemblance of the ancestors, and consequently
resting on in a top-down manner.

Figure 1: XML data with tree strucutre

Our contributions, particularly compared to our earlier work,
can be summarized as follows:
1. First taking into consideration the dissimilar
construction with XML object and then introduces a
machine learning algorithm to develop the CP
principles previous to the stage the network pruning
2. After that present a novel pruning algorithm and
studying how the regulate in which nodes are
processed affect runtime.
3. Obtain the conditional probabilities using SVM
Classification methods and as well as analyze
XMLData object.
4. Finally illustrate how to amplify effectiveness when
a small drop in recall, i.e., in the numeral of
recognized duplicates, is suitable. This procedure can
be physically tuned or performed repeatedly, using
well-known duplicate objects fromfurther databases;


Early on effort in XML duplicate detection was apprehensive
through the well-organized achievement of XML join
operations. Guha et al [6] proposed an algorithmto present
joins in xml databases of related elements capably. They were
paying attention on completion of a tree edit distance [6],
which might be useful in an XML join algorithm. Carvalho
and Silva proposed a resolution to the difficulty of Web data
that are extracted fromhierarchical representation tree. Two
hierarchical illustrations of individual person elements are
compared by transforming original data into vector and
similarity among data or elements are measured by using a
cosine similarity determination. Linear combinations of
weighted similarities are taken and do not take benefit of the
helpful features presented in XML databases. The difficulties
of identifying numerous demonstration of a similar real-world
object the difficulty have been addressed in great remains of
effort. Ironically, it appear below many names itself, such as
record linkage [7], object matching [1], object consolidation
[8], and reference reconciliation [4] to person's name presently
a little.
Generally words, study in duplicate detection may categorize
into two ways: methods to get better efficiency and methods
to get better effectiveness. Investigate on the previous
difficulty is concerned through improving precision and recall,
for example by increasing complicated resemblance
procedures. Examples are [4] and [9], wherever the
associations amongst objects are used to get better the
superiority of the outcome. Investigate on the subsequent
difficulty assumes a known similarity measure and proceed
algorithmso as to try to evade having to affect the quantify to
all pairs of objects. An instance is the sorted neighborhood
technique, which trades off efficiency for superior competence
by comparing simply objects inside a certain window. In a
first algorithm, this regulate is used to get the similar
effectiveness as some arrange, but quicker; in a subsequent
alternative additional improved effectiveness except at the
price of misplaced several duplicates.
Objective of hierarchical representation for finding duplicates
XML data object in XML databases [12], [6], [13]. These
works is dissimilar beginning than earlier approaches because
they were particularly intended to develop the characteristic of
XML object representations and the semantics inherent in the
XML labels.



2/231 raja
street ,CBE
2/123 raju
street ,CBE

International Journal of Computer Trends and Technology (IJCTT) volume 7 number 2 Jan 2014

ISSN: 2231-2803 Page77

The method presented by [14] proposed a DogmatiX structure
aim at together effectiveness and efficiency in duplicate
detection [14]. The structure consists of three major steps that
is candidate description, duplicate description, and duplicate
discovery. While the primary two present the definition
essential for duplicate detection, the third constituent include
the definite algorithm, an addition to XML data of the work of
Ananthakrishna et al. [15]
Hierarchical representation of XML data objects for
personalized data are represented as tree referred to original
personal information management (PIM) data [7] in database.
The subsequent measurement discerns among three methods
used to execute duplicate detection: machine learning and
similarity measures are performed to learn duplicate data
objects, clustering algorithms and finally iterative algorithms,
which iterate process until datas are detected as duplicate,
which in revolve are aggregated to clusters, using transitive
Earlier survey shows that the data mining are well suited for
detection of duplicates and mining data efficiently in larger
database or datawarehouse. But in datacleaning and data
preprocessing step required more time to discover or find
irrelevant as well as detect duplicate in structure manner. Data
cleaning difficulty arises since information as of a variety of
heterogeneous sources is combined to generate a single
database [10].
Numerous dissimilar data cleaning issues have been
recognized as dealing with missing data, managing incorrect
data, record linkage etc. At this point deal with individual
challenge called reference disambiguation, it is also known as
fuzzy match and "fuzzy lookup". This difficulty arises
whilst entity in a database enclose reference to additional
entities. If entities be referred to via single identifiers, then
disambiguating individuals reference would be

In this work proposed XML Duplicate detection approach is
furthermore second-hand to detect the XML duplicate data
object as well as data in structure .The demonstration of XML
data uses a Bayesian network to represent data and method
used for detecting duplicates [12]. This Bayesian network
demonstration is a directed acyclic graph.

a. Bayesian Network Construction

Bayesian network model whereas a node in the structure
denotes the variables and edges denotes the relationship
among the nodes in the data objects. In this work first define
the Bayesian Network demonstration for duplicate detection,
and calculate the similarity among the XML dataobject
between nodes in the Bayesian network. Based on similarity
measure it classifies duplicate data in structure. Mapping
schema maps the relationship among the XML dataobject in
network; its result must first be validated to make sure a high
excellence mapping.
In Bayesian network model still the deriving the conditional
probability values becomes assumption ,to overcome these
problem proposed a machine learning methods to
automatically derive conditional probability values for every
node in the Bayesian network ,it results exact probability
values for each and every attribute in the XML data ,it
efficient detects the duplicate files

b. Support vector machines (SVM)

SVM are supervised learning models that investigate data and
recognize patterns. The fundamental SVM takes set of input
and predicts exact result for each input data fromlearning
initial phase ,it takes two outcomes for each input data that is
positive and negative class forms of output creation.SVM
model categorizes data into two phase learning and training
phase one category or the other. SVM model is an illustration
of the examples as points in space, map so that the examples
of the divide categories are divided by an obvious gap to
facilitate as broad as probable. Instinctively, a good partition
is achieved by the hyperplane that have the biggest distance to
the nearest instruction data point of one class, because in all-
purpose the superior the margin the lesser the generalization
error of the classifier. While the unique difficulty might be
confirmed in a restricted dimensional space, it regularly
happens that the sets to distinguish are not linearly separate in
that space.

In SVM methods a set of XML data from Bayesian network
structure result considered as input for SVM data .The
following conditional probabilities values are derived using
machine learning methods after that values are derived then
duplicate data are detected.

c. Deriving the conditional probabilities

Conditional probability 1 (CP1) denotes the probability of
nodes values duplicate detection .In this CP1 if the nodes
having the duplicate values consider each attributes as
duplicate values else all nodes having original XML data .

Conditional probability 2 (CP2) denotes the probability of
offspring nodes duplicates and detection of duplicate from
those nodes according to corresponding attributes. It says that
the many offspring nodes in tree have higher probability
values rather those parent nodes it is considered as duplicates.

Conditional probability 3 (CP3) denotes the probability of
both parent and child nodes .In this step two nodes of
attributes duplicate are detected.

International Journal of Computer Trends and Technology (IJCTT) volume 7 number 2 Jan 2014

ISSN: 2231-2803 Page78

Conditional probability 4 (CP4) denotes the probability of a
position of nodes of the similar category individual duplicates
known that every pair of entity nodes in the set is duplicates.

Algorithm steps for conditional probability values

Input: Number of the training samples with XMLData
structure constructed fromBN w ={w
} as input
file for SVM classification

Output: Classification result and prediction of the conditional
probability values CP1,CP2,CP3, CP4

1. Procedure SVM X) // input training data results from
the SVM methods
2. Begin
3. Begin
4. Initialize C=0 //initially C be the set of conditional
probability values either positive or negative
probability values for the class labels should be zero
5. Get input file X for training //the input data fromBN
result as the example for training and prediction
results of the CP values as result.
6. Read the number of XMLData structure constructed
fromBN result fromBN structure with XML data
7. x

.w +b =0 // XMLData structure constructed

fromBN is represented as matrix and denoted by x

and w is the weight value matrix whose product is
summed with bias value to give the class value.
8. x

.w +b =1 // This above equation marks a central

classifier margin. This can be bounded by soft
margin at one side using the following equation.
9. Decision function f(W) = x
10. //decision function f(w) decides the class labels for
the SVM classification training examples ,
11. If (w) 1 for x

is the first class // if the F(w) is

greater than or equal to the 1 is labeled as first class
12. Else
13. (w) 1 for x

is the first class // if the f(w) is

less than or equal to the value of -1 is labeled as
second class
14. The prediction result for (i=1,n) number of
document //after the classification result are
performed then check the classification result by
testing phase it is check the below function
15. y


.w b) 1 //if the function is greater than one

the results or classified conditional probability as
16. Display the result //finally display the conditional
probability values are classification result

d. Network Pruning for BN

BN assessment point in time, suggest a lossless pruning
approach. This approach is simply objected pairs incompetent
of attainment a known replacement probability threshold are
not needed. As confirmed earlier than, network valuation is
performed by responsibility a proliferation of the previous
probabilities, in a substructure up fashion, until
accomplishment the topmost node. The previous probabilities
are obtained by applying a similarity determines by the
pleased of the leaf nodes. Computing such similarities is the
mainly exclusive procedure in the network valuation and in
the duplicate detection procedure in universal. Consequently,
the designs subsequent our pruning suggestion lies in avoid
the computation of previous probabilities, except they are
exactingly required. Each and every similarity values are
assumed to one before proceeding pruning step. The design is
to, at each step of the procedure; preserve a same probability
value. At every one step, at any time a novel similarity is
evaluated taking into deliberation the previously recognized
similarities and the unidentified similarities set to be one.
There are still two challenges to pruning the strategy
compensation by means of a pruning aspect lesser than one.
Primary, ever since dissimilar attributes in an XML object
have dissimilar characteristics with dissimilar pruning factor.
Subsequent, fine-tuning factors physically be able to be a
multifaceted task, particularly if the consumer have small
information of the database, thus be supposed to be capable to
calculate every one pruning factors mechanically. It improves
the efficiency order to optimize competence, while
minimizing the failure in efficiency.

Algorithm 2: Network Pruning (N; T)

Require: The node N, for which intend to compute the
probability score from SVM classification; threshold value T,
less than T are considered non-duplicates in XML nodes
,current score is derived fromSVM.

Ensure: Duplicate probability of the XML nodes represented
by N
which is also derived from SVM

1:I gctporcntnoJc(N){Get the ordered list of parents
2: parentScore[n] 1; for all n belongs to L {Maximum
probability of each parent node}
3: currentScore 0
4: for each node n in L do {Compute the duplicate
5: if n is a value node then
6: score getSimilarityScore(n) {For value nodes, compute
the similarities}
7: else
8:newThreshold getNewThreshold(T; parentScore)
9:scoreNetworkPruning(n; newThreshold)
10: end if
International Journal of Computer Trends and Technology (IJCTT) volume 7 number 2 Jan 2014

ISSN: 2231-2803 Page79

11: parentScore[n] score
13: if currentScore <T then
14: End network evaluation
15: end if
16: end for
17: return currentScore

In this work evaluate the performance of proposed SVM
based machine learning method then evaluate the performance
of proposed system of XML duplicate detection with SVM
our proposed pruning optimization with different pruning
threshold changed accordingly based on SVM result. The
experiments results were evaluated between duplicate
detection results.
It shows that proposed XMLDup detection results to less error
values than duplicated detection XMLdataobject with no one
considerable degradation of the outcome and even performs
effectively when dealing with reasonable amounts of missing
data. The use of attributes with less important information is
able to decrease the systems efficiency. Effectiveness tests
exposed that our network pruning approach be able to
advantage starting consider nodes in an exacting order, known
by a predefined arrangement heuristic.
Experimental results were evaluated by measuring the
duplicate detection result between precision and recall value
in many areas .The proposed SVM Based pruning with
Bayesian network is achieves higher precision and recall
values in XML duplicate detection results regarding both
competence and efficiency. The efficiency result of proposed
system was shown in figure 2,it shows that enhanced work
indicated in blue lines between precision and recall achieves
better enhanced result than the red line of basework.

Figure 2: Performance comparison vs parameters


In this work network pruning technique derive condition
probabilities are derived from using learning methods; it
becomes more accurate results of XMLDup detection than
general methods. SVM machine learning algorithmderives
condition probability values routinely as an alternative of
universal probabilities. After that BN Pruning was performed
to eliminate or remove duplicate detection of XML data and
XML data objects. Though, the representation is too
extremely flexible and dissimilar way of probability values are
derived fromSVM. To progress the runtime competence of

XMLDup, a network pruning strategy with SVM is moreover
presented. Additionally, the subsequent approach can be
performed repeatedly, not including user involvement. It
produces best efficient results for duplicate detection.
Expand the BN model structure algorithmto additional types
of machine learning and optimization algorithm such as bee,
artificial immune system and BAT algorithm to derive
conditional probability values.
1. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita,
Declarative data cleaning: Language, model, and algorithms, In
International Conference on Very Large Databases (VLDB), pages
371380, Rome, Italy, 2001.
2. E. Rahmand H. H. Do, Data cleaning: Problems and current
approaches, IEEE Data Engineering Bulletin, Volume 23, pages
3-13, 2000.
3. Doan, Y. Lu, Y. Lee, and J . Han, Object matching for
information integration: A profiler-based approach, IEEE
Intelligent Systems, pages 54-59, 2003.
4. X. Dong, A. Halevy, and J . Madhavan, Reference reconciliation
in complex information spaces, In International Conference on
the Management of Data (SIGMOD), Baltimore, MD, 2005.
5. M. Weis and F. Naumann, Detecting duplicates in complex XML
data, In International Conference on Data Engineering (ICDE),
Atlanta, Georgia, 2006.
6. D. Milano, M. Scannapieco, and T. Catarci, "Structure Aware
XMLObject Identification," Proc. VLDB Workshop Clean
Databases(CleanDB), 2006.
7. W. E. Winkler, Advanced methods for record linkage, Technical
report, Statistical Research Division, U.S. Census Bureau,
Washington, DC, 1994.
8. Z. Chen, D. V. Kalashnikov, and S.Mehrotra, Exploiting
relationships for object consolidation ,In SIGMOD-2005
Workshop on Information Quality in Information Systems,
Baltimore, MD, 2005.
9. M. Weis and F. Naumann, Duplicate detection in XML, In
SIGMOD-2004 Workshop on Information Quality in Information
Systems, pages 1019, Paris, France, 2004.
10. D.V. Kalashnikov and S. Mehrotra, "Domain- Independent
DataCleaning via Analysis of Entity- Relationship Graph."ACM
Trans.Database Systems, vol. 31, no. 2, pp. 716-767, 2006.
11. Lus Leitao, Pavel Calado, and Melanie Herschel, Efficient and
Effective Duplicate Detection in Hierarchical Data, IEEE
ENGINEERING, VOL. 25, NO. 5, MAY 2013,pp 1028-1041.
12. L. Leitao, P. Calado, and M. Weis, Structure-Based Inference of
XML Similarity for Fuzzy Duplicate Detection, Proc. 16th ACM
Intl Conf. Information and Knowledge Management, pp. 293-302,
International Journal of Computer Trends and Technology (IJCTT) volume 7 number 2 Jan 2014

ISSN: 2231-2803 Page80

13. S. Puhlmann, M. Weis, and F. Naumann, XML Duplicate
Detection Using Sorted Neighborhoods, Proc. Conf. Extending
Database Technology (EDBT), pp. 773-791, 2006.
14. M. Weis and F. Naumann, Dogmatix Tracks Down Duplicates in
XML, Proc. ACM SIGMOD Conf. Management of Data, pp.
431-442, 2005.
15. R. Ananthakrishna, S. Chaudhuri, and V. Ganti, Eliminating
Fuzzy Duplicates in Data Warehouses, Proc. Conf. Very Large
Databases (VLDB), pp. 586-597, 2002.