You are on page 1of 9

This research note is restricted to the personal use of rolando.quezada@itau.cl.

Use Data Virtualization to Help Resolve Data


Silos
Published: 20 May 2016 ID: G00295605

Analyst(s): Mark A. Beyer, Eric Thoo, Nick Heudecker

Data silos slow organizations' transformation into digital businesses by


limiting information discovery and access. Data virtualization offers
information leaders a mitigation strategy that can lead to new analytical and
operational opportunities.

Key Challenges
■ Data silos increasingly slow the transformation to digital business by limiting information's
discovery and access.
■ Traditional integration approaches consume time and resources, and generally focus on
meeting narrow, project-based information needs.
■ Operational data often remains beyond the reach of even elaborate integration deployments, so
its potential value is squandered.
■ Information leaders may be unaware of recent trends in, and the new optimization capabilities
of, rapidly maturing data virtualization tools.

Recommendations
Information leaders should:

■ Create a data exploration capability via data virtualization to identify data silos that are
candidates for consolidation or that may persist as federated data use cases.
■ Use data virtualization to help select the best option for opening up data silos.
■ Apply data virtualization to both operational and analytic data.
■ Select data virtualization tools that exploit new capabilities to address physics and scalability
limits.
■ Utilize data use, view creation, frequency-of-access, performance and capacity utilization
metadata to determine how long views can remain virtual and when they should be converted
to other platforms for optimization.

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Table of Contents

Strategic Planning Assumption............................................................................................................... 2


Introduction............................................................................................................................................ 2
Analysis.................................................................................................................................................. 3
Create a Data Exploration Capability via Data Virtualization to Identify Silos and Data Usage............ 3
Select the Best Option for Opening Up Data Silos............................................................................ 4
Apply Data Virtualization to Both Operational and Analytic Data........................................................5
Understand How Data Virtualization Tools Tackle Four Well-Known Issues........................................7
Gartner Recommended Reading............................................................................................................ 8

Strategic Planning Assumption


Through 2020, 35% of enterprises will implement some form of data virtualization as one enterprise
production option for data integration.

Introduction
"How do I consolidate data silos?" is a long-standing question asked by IT and information leaders.
Until recently, it was a practical issue: After a merger or acquisition, disparate but similar systems
needed a common infrastructure to reflect common processes. Clashes of business culture and
internal politics often prevented organizations from dispensing easily with existing data
management systems. But consolidation is now an acute problem, for two reasons:

■ The broad growth of the digital business model, which is fueling demand for more and better
data.
■ The accompanying need to access, share, analyze and act on information quickly and easily.

To answer the aforementioned question, we first have to define a data silo. Data silos are about
limits. Put simply, a data silo is "some of the data, for some of the people, for some of the time." A
data silo occurs whenever:

■ A data store (or access tier) limits the way in which data can be joined together.
■ A time slice of data is intentionally limited or partitioned across platforms.
■ Some data is refused for inclusion because it does not support the intended analysis.

The conventional solution to the problem of data silos has been to implement an enterprise data
warehouse (EDW). But new data assets significantly different from those for which the EDW was
originally designed generally involved rebuilding the entire physical data repository, or creating
another warehouse/mart and, therefore, another silo. Traditionally, "consolidation" means physical
integration. There is still a place for this type of solution, but given the scale and speed needed by

Page 2 of 9 Gartner, Inc. | G00295605

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

today's businesses, it's best used as a targeted remedy. Additional demands will require different
solutions. Focusing only on integration by physical consolidation ignores new approaches that can
increase flexibility by creating the opposite of a silo — approaches that offer "all of the needed data,
for the relevant people, at the time it's needed."

The same demands to break through the limits of analytic data silos are apparent when it comes to
operational and transactional data. A clutch of transactional applications may need access to
master data in a separate master data management (MDM) system. Many organizations lack a
systematic approach to create, manage and maintain that access, especially in legacy or off-the-
shelf systems.

A data silo is a silo because it has been placed in some commonly understood context — this
creates a semantic relationship between the data use case and the data as it is stored. However, all
data is semantic in nature as it always represents something else — usually a concept or object
from the "real" world. To break down the barriers between silos, a neutral semantic that removes
use case bias is needed, and virtualization introduces this opportunity without damaging the current
silo for its currently intended use case.

Data virtualization — a style of federation — offers a viable approach for:

■ Monitoring data usage, to help identify where silos exist and how their data is being used (or not
used)
■ Extending an organization's integration strategy to gain the necessary speed and flexibility for
data discovery

Importantly, data virtualization avoids the issue of how to resolve separate approaches to
governance and management. Data virtualization generally accepts the governance model and
management approach of the accepted sources. Experience has shown that attempting to embed
governance or management functions is difficult. Generally, it is best for connecting applications to
perform those functions and to use the virtualization layer as a semantic interface.

Analysis
Create a Data Exploration Capability via Data Virtualization to Identify Silos and Data
Usage
The purpose of data integration is to get beyond silo boundaries and limits. Data virtualization or
federation complements an organization's physical integration strategy and augments existing
integration architectures in three ways:

■ It allows organizations to combine new and existing data sources.


■ It deals with formally scoped and targeted purposes.
■ It offers greater flexibility to advanced data users.

Gartner, Inc. | G00295605 Page 3 of 9

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

Virtualization lets information leaders create a managed data exploration capability that can be
delivered as a "virtual sandbox" to business analysts for the reconfiguration of existing data models
or the addition of new data as prototypes. Data virtualization is not necessarily about new
capabilities: It is a faster way to determine levels of user engagement, specifications, repeatability
and pervasiveness of use. With this information, data leaders can identify whether, and when, the
need for stable data access methods is overwhelming the need for agility and flexibility. That
identification then guides action on continued or expanded use of virtual solutions or the use of
physical data stores instead.

Virtualization tools can support decision-making and planning processes in an analytics workspace.
They provide, for example, the opportunity to build a single view of customer sentiment in real time
to support customer-facing processes involving an MDM solution to resolve data mappings. They
also offer the chance to facilitate integration of data and systems for fraud- and security-related
analysis.

The traditional, stand-alone data warehouse, which focused almost exclusively on the use of a
single, physically integrated data model in a dedicated repository, is passé. Today, the logical data
warehouse (LDW) can use data federation/virtualization technology to read data in place, enabling
data access services to support the LDW's use of abstracted interfaces for various processing
needs. Introducing virtualization into the mix of data delivery capabilities enables those capabilities
to model views from across combinations of repositories at the speed required by the business.

Select the Best Option for Opening Up Data Silos


Information and IT leaders are no longer limited to traditional integration approaches to data silos.
Data virtualization is one technology that can support several requirements that cannot be met by
physical stores.

One way to identify opportunities for infrastructure flexibility is outlined in Gartner's Information
Capabilities Framework (ICF). The ICF describes an approach for architecting a modern information
infrastructure. Using it, information and IT leaders can disconnect information capabilities from
specific use cases, which enables those capabilities to be reused across systems and processes
throughout an organization (see "Introduction to Gartner's Information Capabilities Framework").

Conceptually, the ICF lays out four relevant semantic styles for addressing data silos:

■ Dedicated: Deployment of a new model in a single use case tool, for a targeted set of analysts,
that includes the new data in a use-case-specific repository. This approach is the genesis of silo
approaches.
■ Registry: Use of a form of logical federation in which existing underlying schemas remain intact
but are now accessed by a resolving, access-only tier. When this is done physically, it is
typically performed using one of three methods:
■ Data virtualization via a dedicated virtual federation tool
■ A DBMS external-access capability (external files/tables, DBLinks and so on)
■ The semantic tier of a business intelligence or data analytics platform

Page 4 of 9 Gartner, Inc. | G00295605

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

■ Consolidation: Rebuilding of the entire physical data store/repository or a complete redesign of


a use-case-neutral semantic access tier to include all desired data and represent a new logical
access schema. This is the traditional EDW.
■ Orchestration: A combination of the above three styles for delivery against multiple use cases,
typically using a data virtualization or data services/API tier to create first-level access and to
pass requests to any dedicated, registry or consolidated solution in an underlying infrastructure.
This is the LDW.

In practice, these concepts translate into five implementation choices:

■ If only an EDW exists: Consider using data virtualization to extend the warehouse with, in
effect, an "awning" that keeps new data "dry" without forcing the slow requirements-and-
analysis cycle. Do this by segmenting the users (see "Making Big Data Normal Begins With
Self-Classifying and Self-Disciplined Users"), and let only analysts and miners under the data
virtual awning, while screening out lower-skilled casual users.
■ If an EDW exists and Hadoop has been deployed: Consider analyzing the output from
competing Hadoop processes to determine which version of that data will be most useful
across a broader base of use cases, and make the preferred output available through data
virtualization Create views on top of the EDW for a federated view and data access function for
the Hadoop environment combined with a "straight through" view to the EDW. Then let only
analysts and miners into the Hadoop view. Your data miners should provide support for this
environment.
■ If Hadoop is strong and data warehouses or data marts are weak and extremely
proprietary (but highly political due to ownership): Use data virtualization to federate the
"weak" repositories. Then consider using Hadoop for first-level staging and experimentation
only by miners and possibly data scientists. Use outputs to populate the EDW or the view layer.
However, remain acutely aware that some source structures and models do not lend themselves
readily to data virtualization approaches and may require some preprocessing in a data lake.
■ If, in all cases, it becomes optimal for a view to be moved to a repository (for reasons of
performance, audit/compliance or commonality of use case): Use data virtualization as a
first data preparation layer to examine and qualify data for later deployment using batch data
integration processing.
■ If situations arise in which a time-variant data warehouse loaded in less than real time
needs to be combined in near-real-time fashion with operational data that combines
stable warehouse data with new data from a stream or a transactional system: Use data
virtualization by adding current data views that access operational source data or a fast copy of
that data in a staging area, an operational data store or a data lake.

Apply Data Virtualization to Both Operational and Analytic Data


The benefits of data virtualization extend beyond analytic data to operational and transactional data.
Virtualization can be key to bringing these data sources together in a timely way. For example, a

Gartner, Inc. | G00295605 Page 5 of 9

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

virtualization layer can let numerous transactional systems easily access master data in an MDM
system.

One approach is to use data virtualization to create a data service layer on operational applications.
Data is federated from transactional applications and databases by creating a virtual integrated
view. This change supports analytic requirements and dashboard applications by federating data at
query time from these operational data sources. Where use cases mainly involve well-anticipated
data and business requirements, information leaders can still decide to preintegrate and/or
preaggregate data into physical stores.

A virtual view can also be used to preprocess data from multiple systems. In this case, the view is
treated as simply another data source independent of the underlying physical or syntactical
representation. It can then be exposed as a reusable service. One example is a virtual view acting
as an input to a batch-oriented process involving an extraction, transformation and loading (ETL)
tool; or as provisioning data for an enterprise service bus (ESB) that orchestrates business flows.
Another is using that same view as a data source for MDM systems that become additional
consumers of the virtual view.

It is important at this point to address the cloud. Many organizations have begun to deploy new
applications in the cloud at an increasing rate (some organizations exist entirely in the cloud). For
those organizations currently deploying both in the cloud and on-premises, a new type of data silo
is developing. Some applications are simply "lifted and shifted" from the premises to the cloud,
often creating two instances of the same application with different data or data that needs
synchronization. Similarly, some cloud-deployed apps use data from the premises, but manage it
separately — changing values, having different rates of change and so on. Data virtualization only
provides the ability to connect such systems — it does not specifically address issues like location,
security and quality. When extensive mitigation of multiple issues begins to become necessary, the
use of virtualization between cloud and on-premises deployments becomes challenging.

Virtualization enables the creation of abstracted interfaces to data sources, including big data,
content, data from operational technologies and social data. Developers can use these interfaces to
construct integration processes that serve consuming applications without having to understand the
structure of, or interact directly with, the data sources. Data abstraction also supports data
migration efforts. When data virtualization is used at the beginning of, and throughout, such
processes, the potential to continue to use data virtualization to repeat the processes or to combine
old and new data increases significantly. After applications or systems complete their migration,
access to data sources can be reconfigured and modified without impacting the participating
applications.

The separation and, therefore, insulation that data virtualization promises is never guaranteed. A
similar result can be achieved with a centralized repository like a warehouse, by simply using the
connecting tool's semantic layer and the database's capability to render views. This approach does
not allow interactive optimization of the connected systems but uses existing advanced database
administration skills and the in-DBMS optimization strategies after data is retrieved from the
connected sources. Frequently, end users prefer to use their own tool's semantic tier in their
business intelligence platform as well (and data virtualization can provide a direct view to the data to
help avoid political issues regarding tool preference). Although this approach is favored by users, it

Page 6 of 9 Gartner, Inc. | G00295605

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

requires that each tool build its own semantic layer because the tools do not share semantics with
each other.

Finally, data virtualization can support application prototyping by ensuring data-related aspects of
business functionality are adequately supported. Using federated data as a source, developers can
experiment with different sets of data before physically consolidating data into its specific
consumption points, such as a business application or an analytic solution. This supports
prototyping work by federating or incorporating various kinds of data access and usage required by
the business to ensure favorable application outcomes.

Understand How Data Virtualization Tools Tackle Four Well-Known Issues


Data virtualization tools now exploit technology-driven approaches to address four well-known and
troublesome issues concerning the physics of, and access to, data:

■ Access: Virtual views do not work if data sources are unavailable due to network issues, a
system is offline, or administrators turn off access. Vendors have introduced caching and the
ability to write out data constructs to their own specified storage. In days past, virtualization
tools would do live reads of data at the source and cache query results. But if new queries
could not be satisfied by what was in their cache (almost always the case when exploring new
data), the query would run again on the source. System administrators did not like this as it
slowed their systems. In some cases, the "solution" was to turn off access. A caching tier is one
part of the solution to this problem. The other part is good design: Cache raw data queries and
then, on top of that, let users build their semantics in a second tier of virtual views.
■ Network capacity: Today's data is "bigger" than before and has the potential to strain all but
the most generous network capacity. To remedy this, some virtualization tools have introduced
multitiered caching with incremental update capability. These tools look for changes using
functions like log reads in the sources, accumulate the changes frequently (for example, every
five minutes), and then use that as a buffer to load the current hour's cache tier. Then they roll
the hour into today, then today into the long-term cache. Eventually, the cache grows really big.
Organizations will develop their own tolerance and policies for instantiating data somewhere
outside the data virtualization layer.
■ Data volume: Volume is actually a reflection of access and network capacity, and the problem
of excessively large or unplanned data volumes can be solved in the same way. But there is
another option: query rewrites for distributed processing. In this case, the virtualization keeps
significant statistics on data volume, capacity for each source, performance for time-of-day
expectations on the sources, and even on itself, as well as a myriad other possible statistics.
Then, when a new access point is requested, a cost-based optimization engine examines all the
patterns it has already collected in internal metadata and compares the access plan to the
current conditions. Based on factors such as system availability and current capacity, the
optimization engine rewrites the query into various subprocesses and distributes those
processes for the current best optimization plan. In some tools, this includes fast replication of
small datasets across some or all of the other connected systems to do prejoins and
preprocessing on each source — for example, quickly copying a chart of accounts to 10 to 25

Gartner, Inc. | G00295605 Page 7 of 9

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

(or more) different systems to prefetch data for each source system, and sending only
summaries or subselected records forward to be processed elsewhere in the environment.
■ Activity instance: Commonly referred to as "number of users" or "number of queries," this is
really the number of user-driven activities. Some virtualization tools can use the optimization
techniques described above to determine when it is time to use the caching approach to write
out their own repository for best access rates from their internal models. This makes for very
fast interfacing and interchanging of data between various types of storage device, up to and
including main memory.

Gartner Recommended Reading


Some documents may not be available as part of your current Gartner subscription.

"Adopt Data Federation/Virtualization to Support Business Analytics and Data Services"

"Harness Data Federation/Virtualization as Part of Your Enterprise's Comprehensive Data


Integration Strategy"

"Decision Point for Choosing an Application Services Implementation Architecture"

Evidence
The findings in this report draw on:

■ Gartner client inquiry data (over 700 inquiries on data virtualization annually)
■ Gartner analysts' assessments of product capabilities across a range of data integration, DBMS
and data virtualization tools, from 2013 to 2015
■ Calls with product reference customers conducted by Gartner analysts
■ A survey of reference customers conducted for Gartner's "Magic Quadrant for Data Integration
Tools"

Page 8 of 9 Gartner, Inc. | G00295605

This research note is restricted to the personal use of rolando.quezada@itau.cl.


This research note is restricted to the personal use of rolando.quezada@itau.cl.

GARTNER HEADQUARTERS

Corporate Headquarters
56 Top Gallant Road
Stamford, CT 06902-7700
USA
+1 203 964 0096

Regional Headquarters
AUSTRALIA
BRAZIL
JAPAN
UNITED KINGDOM

For a complete list of worldwide locations,


visit http://www.gartner.com/technology/about.jsp

© 2016 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. or its affiliates. This
publication may not be reproduced or distributed in any form without Gartner’s prior written permission. If you are authorized to access
this publication, your use of it is subject to the Usage Guidelines for Gartner Services posted on gartner.com. The information contained
in this publication has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy,
completeness or adequacy of such information and shall have no liability for errors, omissions or inadequacies in such information. This
publication consists of the opinions of Gartner’s research organization and should not be construed as statements of fact. The opinions
expressed herein are subject to change without notice. Although Gartner research may include a discussion of related legal issues,
Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner is a public company,
and its shareholders may include firms and funds that have financial interests in entities covered in Gartner research. Gartner’s Board of
Directors may include senior managers of these firms or funds. Gartner research is produced independently by its research organization
without input or influence from these firms, funds or their managers. For further information on the independence and integrity of Gartner
research, see “Guiding Principles on Independence and Objectivity.”

Gartner, Inc. | G00295605 Page 9 of 9

This research note is restricted to the personal use of rolando.quezada@itau.cl.

You might also like