Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
P. 1
Data Fusion - Jens Bleiholder and Felix Naumann

Data Fusion - Jens Bleiholder and Felix Naumann

Ratings: (0)|Views: 274|Likes:
Published by Deep

More info:

Published by: Deep on Sep 19, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

03/12/2012

pdf

text

original

 
1Data Fusion
JENS BLEIHOLDER and FELIX NAUMANN
 Hasso-Plattner-Institut
The development of the Internet in recent years has made it possible and useful to access many differentinformation systems anywhere in the world to obtain information. While there is much research on the inte-gration of heterogeneous information systems, most commercial systems stop short of the actual integrationof available data. Data fusion is the process of fusing multiple records representing the same real-worldobject into a single, consistent, and clean representation.This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion,namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relationalalgebra and SQL. Finally, the article features a comprehensive survey of data integration systems fromacademia and industry, showing if and how data fusion is performed in each.Categories and Subject Descriptors: H.2.5 [
Database Management
]: Heterogeneous DatabasesGeneral Terms: Algorithms, Languages Additional Key Words and Phrases: Data cleansing, data conflicts, data consolidation, data integration, datamerging, data quality
 ACM Reference Format:
Bleiholder, J. and Naumann, F. 2008. Data fusion. ACM Comput. Surv., 41, 1, Article 1 (December 2008), 41pages DOI
=
10.1145/1456650.1456651 http://doi.acm.org/10.1145/1456650.1456651
1. INTRODUCTION
With more and more information sources available via inexpensive network connec-tions, either over the Internet or in company intranets, the desire to access all thesesourcesthroughaconsistentinterfacehasbeenthedrivingforcebehindmuchresearchinthefieldofinformationintegration.Duringthelastthreedecadesmanysystemsthattry to accomplish this goal have been developed, with varying degrees of success. Oneof the advantages of information integration systems is that the user of such a systemobtains a complete yet concise overview of all existing data without needing to accessalldatasourcesseparately:completebecausenoobjectisforgottenintheresult;concisebecause no object is represented twice and the data presented to the user is without
This research was supported by the German Research Society (DFG grant no. NA 432). Authors’ addresses: J. Bleiholder and F. Naumann, Hasso-Plattner-Institut, Prof.-Dr.-Helmert-Str. 2-3,D-14482 Potsdam, Germany; email:
{
Jens.Bleiholder,Felix.Naumann
}
@hpi.uni-potsdam.de.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation. Copy-rights for components of this work owned by others than ACM must be honored. Abstracting with credit ispermitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax
+
1 (212)869-0481, or permissions@acm.org.c
2008 ACM 0360-0300/2008/12-ART1 $5.00. DOI 10.1145/1456650.1456651 http://doi.acm.org/10.1145/ 1456650.1456651
 ACM Computing Surveys, Vol. 41, No. 1, Article 1, Publication date: December 2008.
 
1:2
J. Bleiholder and F. Naumann
Fig. 1
. A data integration process.
contradiction.Thelatterisdifficultbecauseinformationaboutentitiesisstoredinmorethan one source. After major technical problems of connecting different data sources on different ma-chines are solved, the biggest challenge remains: overcoming semantic heterogeneity,that is, overcoming the effect of the same information being stored in different ways.Themainproblemsarethedetectionofequivalentschemaelementsindifferentsources(schema matching) and the detection of equivalent object descriptions (duplicate detec-tion)indifferentsourcestointegratedataintoonesingleandconsistentrepresentation.However, the problem of actually integrating, or fusing, the data and coping with theexisting data inconsistencies is often ignored.The problem of contradictory attribute values when integrating data from differentsources is first mentioned by Dayal [1983]. Since then, the problem has often beenignored, yet a few approaches and techniques have emerged. Many of them try to avoidthe data conflicts by resolving only the uncertainty of missing values, but quite a fewof them use different kinds of resolution techniques to resolve conflicts.With this survey we introduce the inclined reader to the process of 
data fusion
in thewider context of information integration. This process is also referred to in the litera-ture as
data merging
,
data consolidation
,
entity resolution
, or
finding representations/survivors
. We also present and compare existing approaches to implement such a datafusion step as part of an information integration process and enable users to choose themost suitable among them for the current integration task at hand.The survey begins in Section 2 with a brief introduction into information integrationand the different tasks that need to be carried out, before describing the process of datafusion in more detail. Section 3 presents, describes, and classifies relational techniquesfor data fusion, before Section 4 gives a detailed overview and classifies integratedinformation systems and their data fusion capabilities.
2. DATA FUSION
Integrated (relational) information systems provide users with a unified view of multi-ple heterogeneous data sources. Querying the underlying data sources, combining theresults, and presenting them to the user is performed by the integration system.In this article we assume an integration scenario with a three-step data integrationprocess, as shown in Figure 1 [Naumann et al. 2006]. First, we need to identify corre-sponding attributes that are used to describe the information items in the source. Theresult of this step is a
schema mapping
, which is used to transform the data presentin the sources into a common representation (renaming, restructuring). Second, the
 ACM Computing Surveys, Vol. 41, No. 1, Article 1, Publication date: December 2008.
 
 Data Fusion
1:3differentobjectsthataredescribedinthedatasourcesneedtobeidentifiedandaligned.In this way, using
duplicate detection
techniques, multiple, possibly inconsistent repre-sentations of the same real-world objects are found. Third, in a last step, the duplicaterepresentations are combined and fused into a single representation while inconsis-tencies in the data are resolved. This last step is referred to as
data fusion
1
and is themainfocusinthisarticle.Webrieflydiscussthetwootherstepsasprerequisitestodatafusion.In the remainder of this section we present common solutions to the first two stepsand then go into more detail in the field of data fusion. Apart from completeness andconcisenessindataintegration,weregardthedifferentkindsofconflictsthatmayoccurduring integration and different possible semantics that can be used in implementingthe final step of data fusion. We give a motivating example at the end of this section,which is used throughout the article to show the differences between the techniquesand systems.The following section is used to present relational techniques for data fusion(Section 3)andexaminestowhatdegreetheyaccomplishit.Thenextsection(Section 4)surveys various data integration systems and specifically identifies their data fusioncapabilities.
2.1. Data Transformation
Integrated information systems must usually deal with heterogeneous schemata. Inorder to present to the user query results in a single unified schema, the schematicheterogeneities must be bridged. Data from the data sources must be transformed toconform to the global schema of the integrated information system. Two approachesare common to bridge heterogeneity and thus specify data transformation: schemaintegration and schema mapping. The former approach is driven by the desire tointegrate a known set of data sources. Schema integration regards the individualschemata and tries to generate a new schema that is complete and correct with re-spect to the source schemata, that is minimal, and that is understandable. Batini et al.[1986] give an overview of the difficult problems involved and the techniques to solvethem.The latter approach, schema mapping, assumes a given target schema; that is, itis driven by the need to include a set of sources in a given integrated informationsystem. A set of correspondences between elements of a source schema and elementsof the global schema are generated to specify how data is to be transformed [Popa et al.2002; Melnik et al. 2005]. A particularly interesting addition to schema mapping are
schema matching
techniques, which semi-automatically find correspondences betweentwo schemata. Rahm and Bernstein have classified these techniques based on whatinput information the methods use [Rahm and Bernstein 2001]. There is much ongoingresearch in both the areas of schema matching and schema mapping—comprehensivesurveys are yet missing.The goal of both approaches, schema integration and schema mapping, is the same:transform data of the sources so that it conforms to a common global schema. Given aschema mapping, either to an integrated or to a new schema, finding such a completeand correct transformation is a considerable problem [Fagin et al. 2005]. The data
1
There are two other fields in computer science that also use the term
(data) fusion
. In information retrievalit means the combination of search results of different search engines into one single ranking, therefore it isalso called
rank merging
. In networking it means the combination of data from a network of sensors to inferhigh-level knowledge, therefore also called
sensor fusion
. Beyond computer science, in market research, theterm data fusion is used when referring to the process of combining two datasets on different, similar, butnot identical objects that overlap in their descriptions.
 ACM Computing Surveys, Vol. 41, No. 1, Article 1, Publication date: December 2008.

Activity (5)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
Jose Macedo liked this
master8577 liked this

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->