Professional Documents
Culture Documents
A Survey On Large Scale Schema and Ontology Matching Techniques
A Survey On Large Scale Schema and Ontology Matching Techniques
ORG
55
1 INTRODUCTION
N general relational and XML database is called as Schemaandthedatabaserepresentedbysemanticlan guage like Web Ontology Language (OWL) or Resource DescriptionLanguage(RDF)iscalledasOntology.Match ing technique finds semantic correspondences between cartesian product of entity pair of the given two ontolo gies. These correspondences may stand for equivalent, subsumption, or disjoint relation between the ontology entities.Theresultofmatchingtechniqueisthesetofse mantic correspondences called alignments. Fig.1 depicts the general framework for the schema matching tech niques which is also applicable for ontology matching techniques.Asshowninfiguretheinputofthematching techniqueistwoschemaswhicharefirstimportedintoa format suitable for processing. For a faster computation the schema may be preprocessed, e.g., preparing neig bourlistforeachentitytofastenthestructuralsimilarity calculation, removing redundant information from sche ma,etc.Thesimilarityvalueiscalculatedforallcartesian product entity pairs one from each of two preprocessed schemas using a match workflow consisting of set of matcher (e.g., lexical, structural and semantic matcher). The semantic value ranges from 0 to 1, i.e., value 1 indi cates equivalent and value0 means disjoint relation. The matcherworkflowcanbesequential,parallel,iterativeor in some mixed fashion. The similarity value obtained fromthematchworkflowcanbeaggregatedtoobtainthe final alignment. For large scale schema matching the
Mrs. K. Saruladha is with the Computer Science Department, Pondicherry Engineering College, Puducherry, Pin 605014, India. Dr. G. Aghila is with the Computer Science Department, Pondicherry University, Puducherry, Puducherry, Pin 605014, India. Ms. B. Sathiya is with the Computer Science Department, Pondicherry Engineering College, Puducherry, Pin 605014, India.
workflowwillslightlyvarytoincorporatetechniqueslike early pruning and partitioning of schema to reduce the searchspace,parallelization,etc. According to Shvaiko P and Euzenat J [11] one of the toughestchallengeformatchingsystemishandlinglarge scaleschemasorontologyandonthebasisofRahmE[12] work domain where large scale matching is necessary includeebusiness[13],webdata[14],lifescienceontolo gies ,and medicine [15] and web directories [16]. Even thoughtheadvancementsaremadeinthecurrentmatch ing systems, they are still unable to achieve good effec tiveness (correctness and completeness of the alignment) andgoodefficiency(timeandspaceefficiency).Theeffec tiveness is measured by precision and recall while the efficiency is measured by execution time and memory used. The remainder of this paper is organized as follows: Section 2 discusses various large scale matching tech niques. Section 3 gives a brief comparison of the large scale matching techniques. Finally, Section 4 concludes thesurveyofontologymatchingtechniques. InputSchemas Preprocessor Similarity Similarity Alignments Aggregation Computation Preprocessor
Fig. 1 General Framework for Matching Process
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
56
pruned using heuristics strategies to reduce the search space contructed in the previous step. The selected matching pairs will be processed for similarity computa tionandtheirinterpretation.Onceagainthesystemhasto determinewhichmatchingpairstoaddtotheagendafor the next iteration. The heuristics strategies are based on labels of entities, entity neighbour to the alignments ob tained from the previous iteration, hierarchy of entities, randompickorcombinationoftheabovestrategies. Similarity Computation. The similarity between the above selected entity pairs can be measured by wide range of similarity functions like Levenshteins edit dis tance,Dicecoefficientandcosinesimilarityvalue.Based on the feature of the entity any of the above similarity functioncanbechosen. Similarity Aggregation. QOM uses weighted average of similarityvaluesforagivenentitypair.Theweightofthe similarityvaluesiscalculatedbysigmoidfunctionwhich emphasizeshighindividualsimilaritiesanddeemphasiz eslowindividualsimilarities. Interpretation.First,matchingpairswithsimilarityvalue less than a threshold is discarded. Next similarity pairs whichviolatesbijectivity(onetooneandontoconstraint) ofthematchingarediscarded. Iteration. Initial iteration uses lexical knowledge while later iterations use ontology structure knowledge for matching. The number of iteration required irrespective ofontologysizebasedontheirexperimentsisten. BasedontheevaluationQOMisnearlytwentytimesfast erthanNOM.ThedrawbackofQOMisthat,effectiveness is marginally lower than NOM since not all entity pairs areevaluated. 2.1.2 Peukert et al. Method Peukertetal.proposedaschemaandontologymatching algorithm.Itisarulebasedoptimizationtechniquewhich rewrites match workflow to improve performance. The filteroperatorswithinmatchworkfloweliminatedissimi lar entity pairs (whose similarity value is less than some threshold) from intermediate match results and thus re ducingsearchspaceforfurthermatchers.Thethresholdis either statically predetermined or dynamically derived fromthesimilaritythresholdusedinthepreviousmatch workflowtoselectalignment. TheinputofthisprocessisXMLschemas,metamod elsorontologies.Theseinputformatsareconvertedintoa standard internal format. The user must design the matchingprocesswhichisrepresentedasagraphwhere theverticesrepresentmatcherfromthelibraryandedges representexecutionorderanddataflow.Sincethegraph isdesignedbytheuserthematchworkflowcanbeparal lel or serial or iterative. The user can utilize the rewrite recommendation system, to increase efficiency of the matching workflow. These rewrite recommendation sys tem is similar to costbased rewrites in database query optimizatione.g,theuseofrewriterulesaresimilartothe useofpredicatepushdownrulesfordatabasequeryop
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
57
timization which reduce the number of input tuples for joinsandotherexpensiveoperators.Asimplecostmodel isusedtochoosewhichpartsofamatchingworkflowto rewrite.Theserewritescanbeacceptedorrejectedbythe user. Once the matching process graph is finalised the graphcanbeexecutedtoobtainthealignment. Thedrawbacksofthismethodareasfollows.Rewrite rules, only focus on efficiency improvements but not on effectiveness improvement. The matcher library consists ofonlyfewnumbersofmatchers.
2.2.2 Falcon-AO FalconAO is an ontology matching tool which aims at finding alignments of the given two OWL/RDF ontolo gies. The matcher library of FalconAO consist of VDoc [17] and ISub [19] which are lightweight linguistic matchers, GMO [18] an iterative structural matcher and PBM(PartitionBasedMatcher)whichadoptsthedivide andconquer strategy to find block mappings between largescale ontologies. The three phase of PBM are ex plainedbelow Partitioning ontologies: Structural proximities between entities are calculated based on how closely they are re latedinthehierarchies.Theontologyclustersareformed based on structural proximities using modified ROCK [20] clustering algorithm. RDF Blocks are constructed fromtheclustersbyassigningeachRDFtripletoacluster inwhichatleasttwoentitiesarecontained. Matchingblocks:Thealignmentwithhighsimilaritiesis referred as anchors. A lightweight string comparison technique, ISUB, is firstly employed to exploit anchors betweentwofullontologiesandthentheblocksfromthe two ontologies are matched based on the distribution of theanchors. Discovering alignments: VDOC adopts a linguistic ap proach to ontology matching. GMO is an iterative struc turalmatcher.Similaritycombinationisaheuristicstrate gytotunethethresholdsoftheabovetwomatchers. The drawbacks of this method are as follows. The entire ontologyneedstobeprocessedtofindanchorsandthus efficiencyofFalconAOisreduced.Maximumnumberof entityinaclusterisdeterminedbytheGMOmatcherand the clustering algorithm terminates abruptly if it reaches maximumnumberwhichwillleadtopoorclustering. 2.2.3 Taxomap Taxomapisanontologymatchingalgorithmconsistingof twopartitioningalgorithmnamelyPartitionAnchorParti tion(PAP)andAnchorPartitionPartition(APP)whichhave been designed to take the alignment objective into ac count in the partitioning process. The most structured ontologyisreferredastargetontologyandthelessstruc tured is referred as source ontology. PAP is suitable for structuredontologyvsunstructuredontologyandAPPis suitable for structured ontology vs structured ontology. Theentitypaironefromeachoftwoontologywhichhas identicallabelsiscalledasanchorswhichwillbeusedin bothPAPandAPP.Thealignmentisbasedonlexicaland structure(subclass)similaritymeasure. PartitionAnchorPartition(PAP): 1. UsePBMmatchertopartitionthetargetontologyinto setofblocks. 2. Identify the set of anchors between two ontologies. This set will be the center of the future block which willbegeneratedfromthesourceontology. 3. Use PBM matcher to partition the source ontology aroundtheidentifiedcenters. 4. Aligneachblockwiththecorrespondingblock.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
58
AnchorPartitionPartition(PAP): algorithm has been proposed by this system leading to 1. Identifythesetofanchorsacrossontologies. better load balancing, limited memory consumption and 2. Partitionboththetargetontologyandsourceontology scalability without reducing the effectiveness of match by modifying PBM matcher in order to take into results. accountsharedanchors. The system first generates the context attributes for 3. Alignblocksthatsharemaximalnumberofanchors. each entity. Then the ontology is partitioned into set of The drawbacks of this method are as follows. The ef fectiveness of this method depends on the availability of identical labels across ontologies. Only labels and hie rarchystructureisusedformatchingandhencecompara tivelylessrecall. 2.2.4 AnchorFlood AnchorFlood is an ontology matching algorithm which implements a dynamic partition based matching that avoidstheprioripartitioningoftheontologies.Thesimi larity measure for alignment is based on internal struc ture, external structure and JaroWingler string distance. The process starts of with the set of anchors where an anchor is defined as a exact string match of concepts, propertiesorinstancespair. The algorithm preprocesses ontologies to normalize the textual contents of entities. It starts with an anchor and collects neighboring entity of the chosen anchor acrossontologies.Thesegmentpairconstructedfromthe above collected neighbors is processed to produce align ment pairs. The process repeats until either all the col lected entity are explored, or no new aligned pair is found. The anchor will be discarded as misalignment if the constructed segment of the anchor does not produce sufficientalignments. Thedrawbacksofthismethodareasfollows.Semantic similaritybetweenentitiesisnotexplored.Itignorescer tain distantly placed aligned pairs since segments are constructed from neighbor of anchors. Only exact string match of concepts, properties and instances are consi deredasanchorswhichleadtoinefficientalignments. partition based on the sizebased partitioning algorithm achieving intramatcher parallelism. Now the partition pairsareconstructedonefromeachofthetwoontologies. Each partition pair is assigned a processor. Within each processorthepartitionpairisprocessedinparallelbythe elementlevel,structurelevel,instancebasedmatchersso as to achieve inter matcher parallelism. The alignments fromallthematchercanbeaggregatedtooutputthefinal alignment. The drawbacks of this method are as follows.The matcherlibraryconsistsofonlyfewnumberofmatchers. The partition algorithm uses only simple strategies for partitioningwhichcanbeimproved.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
59
*Notavailable**optional The drawback of RiMOM is its inefficiency for dealing with large scale ontologies. Eventhough it shows a very good effectiveness for large scale ontologies it consume longtimeandlargeamountofmemorysinceitdoesnot incorporate search space reduction techniques like early pruning,partitionofontologyorparalleltechnique. ASMOV(Automated Semantic Matching of Ontologies with Verification) [9] is an iterative ontology matching algorithm. The strength of ASMOV lies in the post processing of the alignment to remove the alignments whicharesemanticallyinconsistent.ItusesWordNetand theUnifiedMedicalLanguageSystem(UMLS)toincrease theeffectiveness.TheinputontologyshouldbeinOWL DLformat.TheinputofASMOVistwoontologiesandan optionalpredeterminedalignmentset.Thelexical,struc tural and instance based similarity value between all possiblepairsofentities,onefromeachofthetwoontol ogies is calculated and is used as the optional input alignmenttoadjustanycalculatedmeasures.Asimilarity matrix containing the calculated similarity values for everypairofentitiesisobtainedfromtheabovestep.For each entity choose the maximum similarity value as pre alignmentfromthesimilaritymatrix. Thisprealignmentmustundergoaprocessofseman ticverification,whichisanextensivepostprocessingto
eliminate potential inconsistencies among the set pre alignment. Five different kinds of inconsistencies are checked. One such inconsistency rule is Multipleentity correspondences, e.g., if two alignments (a, b) and (b, c) exist then there must also be alignment like (a, c), if not the above two alignment cannot be verified and hence removed.Theoutputoftheabovestepisthesemantically verified similarity matrix, which is then tested against a termination condition. If this condition is true, no more iteration is needed and the process stops. The resulting alignmentisfinalalignmentset. Thedrawbacksofthissystemareasfollows.Whenthe given two ontologies are dissimilar the effectiveness is decreased. Sometime semantic verification system elimi natestoomanyalignmentsleadingtolessrecall.Foreach iteration, theASMOV needs polynomial time, thus lead ingtoinefficiency. AgreeementMaker[10]isaschemaandontologymatch ing algorithm consisting of wide range of matcher for lexicalandstructuralfeaturesoftheontology.Itprovides bothserialandparallelmatcherworkflows.Thestrength of the system lies in GUI which enable user to choose, control and execute the iterative matchers and their re sults.ThroughGUItheusercanchoosematcherfromthe matcher library based on the matching granularity (ele
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
60
men wise, structural wise and instance wise), dominant featuresofinputschemas,etc. The input ontology can be in XML, RDFS, OWL, or N3. The system consists of three layers. First matcher layerusetheentityfeatures(e.g.,label,comments,anno tations, and instances) and compute the similarity value using the syntactic and lexical matchers. The resulting similarity values are combined based on weighted aver agemethod.Secondmatcherlayerusestructuralproper tiesoftheontologiestocomputethesimilarityvalue.Fi nally,thirdmatcherlayercombinethesimilarityvalueof firstandsecondmatcherlayer. The drawback of this system is that, the end user should be a sophisticated domain expert because to choose, control and execute the matcher the user must havedomainknowledge.
[3]
[4]
[5]
[6]
[7]
[8]
[9]
Toshowthestateoftheartincurrentlargescalematch ing techniques, we briefly compare the above discussed matching prototypes that have successfully been applied to largescale matching problem. The OAEI (Ontology AlignmentEvaluationInitiative)isabenchmarkcompeti tion for evaluating the new proposed matching systems andthecomparisonalsotellswhichprototypeshavepar ticipated in OAEI. Table 1 provides a brief comparison betweentenmatchprototypes.
[10]
[11]
[12]
[13]
4 CONCLUSION
This survey paper has discussed the various matching techniques that could be used for large scale matching. Theaimofthissurveypaperistogivebroadoverviewof techniques which are used to increase the efficiency like early pruning strategy, partitioning of ontology/schema and parallelization of matching process. And also the techniques to increase the effectiveness like postprocess ing of alignments, dynamic matcher selection, and full usercontrolofmatchworkflowthroughGUI.Thesurvey depictsthatthereisalwaysatradeoffbetweengoodeffi ciency and good effectiveness in the existing matching systems.Hence,weareworkingforlargescalematching technique which will combine the advantages ofthe me thods discussed in this paper to achieve both good effi ciencyandgoodeffectivenessandwewilltestitwithon tologiesofparticulardomain.
[14]
[15]
[16]
[17]
[18]
[19]
REFERENCES
[1] Ehrig M, Staab S, Quick ontology matching, Proceedings of international conference semantic web (ICSW). LNCS, vol 3298. Springer,Heidelberg,pp683697,2004. Peukert E, Berthold H, Rahm E, Rewrite techniques for per formanceoptimizationofschemamatchingprocesses,Proceed ingsof13thinternationalconferenceonextendingdatabasetechnolo
[20]
gy(EDBT).ACM,NY,pp453464,2010. Hu W, Qu Y, Cheng G, Matching large ontologies: A divide andconquerapproach, Journal of Data Knowledge Engineering, vol.67,no.1,pp.140160,2008. Hamdi F, Safar B, Reynaud C, Zargayouna H, Alignment basedpartitioningoflargescaleontologies,Advancesinknow ledgediscoveryandmanagement,1sted.NewYork:SpringerHei delberg,pp.251269,2009. Do HH, Rahm E, Matching large schemas: Approaches and evaluation,JournalofInformationSystem,vol.3,no.6,pp.857 885,2007. HanifMS,AonoM,Anefficientandscalablealgorithmfor segmentedalignmentofontologiesofarbitrarysize,Journalof WebSemantics,vol.7,no.4,pp.344356,2009. Gross A, Hartung M, Kirsten T, Rahm E, On matching large lifescienceontologiesinparallel,Proceedingsof7thinternational conferenceondataintegrationinthelifesciences(DILS).LNCS,vol 6254.Springer,Heidelberg,2010. Li J, Tang J, Li Y, Luo Q, RiMOM:A dynamic multistrategy ontology alignment framework, IEEE Trans Knowl Data Eng vol.21, no. 8,pp.12181232,2009. JeanMaryYR,ShironoshitaEP,KabukaMR,Ontologymatch ingwithsemanticverification,JWebSemvol.7, no. 3,pp.235 251,2009. Cruz IF, Antonelli FP, Stroe C, AgreementMaker: Efficient matching for large realworld schemas and ontologies, PVLDB,VLDBEndowment,vol.2,no.2,pp15861589,2009. Shvaiko P, Euzenat J, Ten challenges for ontology matching, Confederated International Conference on the Move to Meaningful InternetSystems,pp.11641182,2008. Rahm E, Towards LargeScale Schema and Ontology Match ing, Schema matching and mapping, Bellahsene Z, Bonifati A RahmE, eds. NewYork:SpringerHeidelberg,pp.327,2011. SmithK,MorseM,MorkPetal.,Theroleofschemamatching inlargeenterprises,ProceedingsofCIDR,2009. Su W, Wang J, Lochovsky FH, Holistic schema matching for web query interfaces, Proceedings of international conference on extendingdatabasetechnology(EDBT).Springer,Heidelberg,pp 7794,2006. Kirsten T, ThorA, Rahm E, Instancebased matching of large life science ontologies, Proceedings of data integration in the life sciences(DILS).LNCS,Springer,Heidelberg,vol.4544,pp172 187,2007. AvesaniP,GiunchigliaF,YatskevichM,Alargescaletaxonomy mapping evaluation, Proceedings of international conference on semanticweb(ICSW).LNCS,vol.3729.Springer,Heidelberg,pp 6781,2005. Y. Qu, W. Hu, G. Cheng, Constructing virtual documents for ontology matching, Proceedings of the 15th International World WideWebConference,ACMPress,pp.2331,2006. W. Hu, N. Jian, Y. Qu, Y. Wang, GMO: a graph matching for ontologies, Proceedings of the KCAP Workshop on Integrating Ontologies,pp.4148,2005. Stoilos G, Stamou G, Kollias S, A string metric for ontology alignment,Proceedingsofthe4thInternationalSemanticWebCon ference,LNCS,Springer,vol.3729,pp.624637,2005. Guha S, Rastogi R, Shim K, ROCK: a robust clustering algo rithm for categorical attributes, Proceedings of the 15th Interna tionalConferenceonDataEngineering,pp.512521,1999.
[2]
K. Saruladha working as Assistant professor in Pondicherry Engineering College, India. She has got a total of 21 years of teaching expereince. She has graduated from Pondicherry university.Puchucherry, India. She is a member of Indian Society of Technical Education, India. She has published nearly 20 research papers
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
61
in Distributed computing, information security and ontology based information retrieval. She is currently pursuing her Ph.D. in ontology based information retreival systems. Dr. G. Aghila working as Professor in Pondicherry University, India has got a total of 21 years of teaching expereince. She has graduated from Anna University Chennai, India. She has published nearly 40 research papers in web crawlers, ontology based information retrieval. She is currently guiding 8 Ph.D. scholars. She was in receipt of schrneiger award. She is an expert in onology development. Her area of interest includes artificial intelligence, text mining and semantic web technologies. B. Sathiya is a post graduate student pursuing her M.Tech in Computer Science and Engineering (Distributed Computing Systems specialization) in Pondicherry Engineering College, Puducherry , India.