You are on page 1of 4

Dataspaces

Dataspaces are an abstraction in data management that aim to overcome some of the problems encountered
in data integration system. The aim is to reduce the effort required to set up a data integration system by
relying on existing matching and mapping generation techniques, and to improve the system in "pay-as-
you-go" fashion as it is used. Labor-intensive aspects of data integration are postponed until they are
absolutely needed.[1][2][3][4][5][6][7][8]

Traditionally, data integration and data exchange systems have aimed to offer many of the purported
services of dataspace systems. Dataspaces can be viewed as a next step in the evolution of data integration
architectures, but are distinct from current data integration systems in the following way. Data integration
systems require semantic integration before any services can be provided. Hence, although there is not a
single schema to which all the data conforms and the data resides in a multitude of host systems, the data
integration system knows the precise relationships between the terms used in each schema. As a result,
significant up-front effort is required in order to set up a data integration system.

Dataspaces shift the emphasis to a data co-existence approach providing base functionality over all data
sources, regardless of how integrated they are. For example, a DataSpace Support Platform (DSSP) can
provide keyword search over all of its data sources, similar to that provided by existing desktop search
systems. When more sophisticated operations are required, such as relational-style queries, data mining, or
monitoring over certain sources, then additional effort can be applied to more closely integrate those
sources in an incremental fashion. Similarly, in terms of traditional database guarantees, initially a dataspace
system can only provide weaker guarantees of consistency and durability. As stronger guarantees are
desired, more effort can be put into making agreements among the various owners of data sources, and
opening up certain interfaces (e.g., for commit protocols).

Data graphs play an important role in dataspaces systems. They work on a fact based (triples or "data
entities" made up of subject-predicate-object)[9] data modeling approach which supports the "pay-as-you-
go" techniques described above. They support data co-existence and are therefore an ideal technique for
semantic integration. Search and relational-style queries and analytics can work simultaneously on data
graphs which is another important property of dataspaces.

Applications of dataspaces

Personal information management

The goal of personal information management is to offer easy access and manipulation of all of the
information on a person's desktop, with possible extension to mobile devices, personal information on the
Web, or even all the information accessed during a person's lifetime. Recent desktop search tools are an
important first step for PIM, but are limited to keyword queries. Our desktops typically contain some
structured data (e.g., spreadsheets) and there are important associations between disparate items on the
desktop. Hence, the next step for PIM is to allow the user to search the desktop in more meaningful ways.
For example, "find the list of juniors who took my database course last quarter," or "compute the aggregate
balance of my bank accounts." We would also like to search by association, e.g., "find the email that John
sent me the day I came back from Hawaii," or "retrieve the experiment files associated with my SIGMOD
paper this year." Finally, we would like to query about sources, e.g., "find all the papers where I
acknowledged a particular grant," "find all the experiments run by a particular student," or "find all
spreadsheets that have a variance column."

The principles of dataspaces in play in this example are that

1. a PIM tool must enable accessing all the information on the desktop, and not just an
explicitly or implicitly chosen subset, and
2. while PIM often involves integrating data from multiple sources, we cannot assume users
will invest the time to integrate. Instead, most of the time the system will have to provide
best-effort results, and tighter integrations will be created only in cases where the benefits
will clearly outweigh the investment.

Scientific data management

Consider a scientific research group working on environmental observation and forecasting, such as the
CORIE System1. They may be monitoring a coastal ecosystem through weather stations, shore- and buoy-
mounted sensors and remote imagery. In addition they could be running atmospheric and fluid-dynamics
models that simulate past, current and near future conditions. The computations may require importing data
and model outputs from other groups, such as river flows and ocean circulation forecasts. The observations
and simulations are the inputs to programs that generate a wide range of data products, for use within the
group and by others: comparison plots between observed and simulated data, images of surface-temperature
distributions, animations of salt-water intrusion into an estuary. Such a group can easily amass millions of
data products in just a few years. While it may be that for each file, someone in the group knows where it is
and what it means, no one person may know the entire holdings nor what every file means. People
accessing this data, particularly from outside the group, would like to search a master inventory that had
basic file attributes, such as time period covered, geographic region, height or depth, physical variable
(salinity, temperature, wind speed), kind of data product (graph, isoline plot, animation), forecast or
hindcast, and so forth. Once data products of interest are located, understanding the lineage is paramount in
being able to analyze and compare products: What code version was used? Which finite element grid?
How long was the simulation time step? Which atmospheric dataset was used as input?

Groups will need to federate with other groups to create scientific dataspaces of regional or national scope.
They will need to easily export their data in standard scientific formats, and at granularities (sub-file or
multiple file) that don't necessarily correspond to the partitions they use to store the data. Users of the
federated dataspace may want to see collections of data that cut across the groups in the federation, such as
all observations and data products related to water velocity, or all data related to a certain stretch of coastline
for the past two months. Such collections may require local copies or additional indices for fast search.

This scenario illustrates several dataspace requirements, including

1. a dataspace-wide catalog,
2. support for data lineage and
3. creating collections and indexes over entities that span more than one participating source.

See also
Data mapping
Data integration
Semantic integration
Information integration
Semantic query

References
1. Belhajjame, K.; Paton, N. W.; Embury, S. M.; Fernandes, A. A. A.; Hedeler, C. (2013).
"Incrementally improving dataspaces based on user feedback". Information Systems. 38 (5):
656. CiteSeerX 10.1.1.303.1957 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.
303.1957). doi:10.1016/j.is.2013.01.006 (https://doi.org/10.1016%2Fj.is.2013.01.006).
2. Belhajjame, K.; Paton, N. W.; Embury, S. M.; Fernandes, A. A. A.; Hedeler, C. (2010).
"Feedback-based annotation, selection and refinement of schema mappings for
dataspaces". Proceedings of the 13th International Conference on Extending Database
Technology - EDBT '10. p. 573. doi:10.1145/1739041.1739110 (https://doi.org/10.1145%2F1
739041.1739110). ISBN 9781605589459.
3. Talukdar, P. P.; Ives, Z. G.; Pereira, F. (2010). "Automatically incorporating new sources in
keyword search-based data integration" (https://repository.upenn.edu/cis_papers/622).
Proceedings of the 2010 international conference on Management of data - SIGMOD '10.
p. 387. doi:10.1145/1807167.1807211 (https://doi.org/10.1145%2F1807167.1807211).
ISBN 9781450300322. S2CID 14566848 (https://api.semanticscholar.org/CorpusID:145668
48).
4. Sarma, A. D.; Dong, X. (L.; Halevy, A. Y. (2009). "Data Modeling in Dataspace Support
Platforms". Conceptual Modeling: Foundations and Applications. Lecture Notes in Computer
Science. Vol. 5600. p. 122. doi:10.1007/978-3-642-02463-4_8 (https://doi.org/10.1007%2F9
78-3-642-02463-4_8). ISBN 978-3-642-02462-7.
5. Dong, X. L.; Halevy, A.; Yu, C. (2008). "Data integration with uncertainty". The VLDB Journal.
18 (2): 469. CiteSeerX 10.1.1.176.3648 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=
10.1.1.176.3648). doi:10.1007/s00778-008-0119-9 (https://doi.org/10.1007%2Fs00778-008-
0119-9). S2CID 8035903 (https://api.semanticscholar.org/CorpusID:8035903).
6. Howe, B.; Maier, D.; Rayner, N.; Rucker, J. (2008). "Quarrying dataspaces: Schemaless
profiling of unfamiliar information sources". 2008 IEEE 24th International Conference on
Data Engineering Workshop. p. 270. doi:10.1109/ICDEW.2008.4498331 (https://doi.org/10.1
109%2FICDEW.2008.4498331). ISBN 978-1-4244-2161-9. S2CID 14039616 (https://api.se
manticscholar.org/CorpusID:14039616).
7. Dong, X.; Halevy, A. (2007). "Indexing dataspaces". Proceedings of the 2007 ACM SIGMOD
international conference on Management of data - SIGMOD '07. p. 43.
doi:10.1145/1247480.1247487 (https://doi.org/10.1145%2F1247480.1247487).
ISBN 9781595936868. S2CID 1184444 (https://api.semanticscholar.org/CorpusID:118444
4).
8. Franklin, M.; Halevy, A.; Maier, D. (2005). "From databases to dataspaces". ACM SIGMOD
Record. 34 (4): 27. doi:10.1145/1107499.1107502 (https://doi.org/10.1145%2F1107499.110
7502). S2CID 14092111 (https://api.semanticscholar.org/CorpusID:14092111).
9. [1] (https://www.zdnet.com/actian-adds-sparql-citys-graph-analytics-engine-to-its-arsenal-70
00035397/) ZDNet, Actian adds SPARQL City's graph analytics engine to its arsenal.

Further reading
Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer,
Zachary G. Ives, Fernando Pereira, Sudipto Guha: Learning to create data-integrating
queries. PVLDB 1(1): 785-796 (2008)
Michael J. Franklin, Alon Y. Halevy, David Maier: A first tutorial on dataspaces (http://www.vld
b.org/pvldb/1/1454217.pdf). PVLDB 1(2): 1516-1517 (2008)
Jens-Peter Dittrich, Marcos Antonio Vaz Salles: iDM: A Unified and Versatile Data Model for
Personal Dataspace Management (http://www.vldb.org/conf/2006/p367-dittrich.pdf). VLDB
2006: 367-378.

External links
Dataspaces by Refinement (http://dataspaces.cs.manchester.ac.uk)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Dataspaces&oldid=1162511175"

You might also like