You are on page 1of 44

BEST PRACTICES REPORT

Q2 2019

Cloud Data
Management
Integrating and Processing Data in Modern
Cloud and Hybrid Environments

By Philip Russom
Research Sponsors

Research Sponsors
Actian

Couchbase

Datameer

Denodo

Hitachi

SAP

Snowflake

Trifacta
BEST PRACTICES REPORT

Q2 2019

Cloud Data Table of Contents


Management Research Methodology and Demographics . . . . . . . . . . . 3

Executive Summary . . . . . . . . . . . . . . . . . . . . . . . 4
Integrating and Processing
Data in Modern Cloud and Introduction to Cloud Data Management . . . . . . . . . . . . 5
Hybrid Environments Defining Cloud Data Management (CDM) . . . . . . . . . . . . . . 6
Related Terms and Concepts for Cloud Data Management . . . . . 6
By Philip Russom Real-World Use Cases for CDM . . . . . . . . . . . . . . . . . . . 8
The Point of CDM and Similar Hybrid Practices . . . . . . . . . . . 9
Benefits and Barriers for CDM . . . . . . . . . . . . . . . . . 11
CDM: Problem or Opportunity? . . . . . . . . . . . . . . . . . . 11
Benefits of CDM . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Barriers to CDM . . . . . . . . . . . . . . . . . . . . . . . . . . 13
The State of CDM . . . . . . . . . . . . . . . . . . . . . . . .15
Is CDM important? . . . . . . . . . . . . . . . . . . . . . . . . 15
Why is CDM important? . . . . . . . . . . . . . . . . . . . . . . 16
Cloud Adoption: Decision Disciplines Are Catching up to
Operational Ones . . . . . . . . . . . . . . . . . . . . . . . . . 16
CDM Successes . . . . . . . . . . . . . . . . . . . . . . . . . . 17
CDM Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Organizational Matters . . . . . . . . . . . . . . . . . . . . . 20
CDM Owners . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
CDM Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Hiring and Training for CDM Skills . . . . . . . . . . . . . . . . . 22
Holistic Data Governance for Hybrid CDM . . . . . . . . . . . . . 23
Multiphase Plans for Migrating Data to Cloud . . . . . . . . . . . 24
CDM Best Practices . . . . . . . . . . . . . . . . . . . . . . 26
© 2019 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions
in whole or in part are prohibited except by written permission. Email requests or
Data Platforms for Hybrid CDM . . . . . . . . . . . . . . . . . . 26
feedback to info@tdwi.org. Data Management Tools for Hybrid CDM . . . . . . . . . . . . . 28
Product and company names mentioned herein may be trademarks and/or Modern Data Semantics as CDM Enabler and Unifier of HDAs . . 31
registered trademarks of their respective companies. Inclusion of a vendor,
Data Virtualization as an Agile and Non-Intrusive CDM Method . . 32
product, or service in TDWI research does not constitute an endorsement by TDWI
or its management. Sponsorship of a publication should not be construed as an Distributing Data Across a Hybrid Data Architecture . . . . . . . 33
endorsement of the sponsor organization or validation of its claims.
Top Ten Priorities for Cloud Data Management . . . . . . . . .37
This report is based on independent research and represents TDWI’s findings;
reader experience may differ. The information contained in this report was
obtained from sources believed to be reliable at the time of publication. Features
and specifications can and do change frequently; readers are encouraged to
visit vendor websites for updated information. TDWI shall not be liable for any
omissions or errors in the information in this report.

tdwi.org  1
Cloud Data Management

About the Author


PHILIP RUSSOM, Ph.D., is senior director of TDWI Research for data management and is a well-
known figure in data warehousing, integration, and quality. He has published more than 600
research reports, magazine articles, opinion columns, and speeches over a 20-year period. Before
joining TDWI in 2005, Russom was an industry analyst covering data management at Forrester
Research and Giga Information Group. He also ran his own business as an independent industry
analyst and consultant, was a contributing editor with leading IT magazines, and a product
manager at database vendors. His Ph.D. is from Yale. You can reach him at prussom@tdwi.org,
@prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.

About TDWI Research


TDWI Research provides research and advice for data professionals worldwide. TDWI Research
focuses exclusively on data management and analytics issues and teams up with industry
thought leaders and practitioners to deliver both broad and deep understanding of the business
and technical challenges surrounding the deployment and use of data management and analytics
solutions. TDWI Research offers in-depth research reports, commentary, inquiry services, and
topical conferences as well as strategic planning services to user and vendor organizations.

About the TDWI Best Practices Reports Series


This series is designed to educate technical and business professionals about new business
intelligence, analytics, AI, and data management technologies, concepts, or approaches that
address a significant problem or issue. Research is conducted via interviews with industry
experts and leading-edge user companies and is supplemented by surveys of business and IT
professionals. To support the program, TDWI seeks vendors that collectively wish to evangelize
a new approach to solving problems or an emerging business and technology discipline.
By banding together, sponsors can validate a new market niche and educate organizations
about alternative solutions to critical problems or issues. To suggest a topic that meets these
requirements, please contact TDWI senior research directors Fern Halper (fhalper@tdwi.org),
Philip Russom (prussom@tdwi.org), or David Stodder (dstodder@tdwi.org).

Acknowledgments
TDWI would like to thank many people who contributed to this report. First, we appreciate the
many professionals who responded to our survey, especially those who agreed to our requests
for phone interviews. Second, our report sponsors, who diligently reviewed outlines, survey
questions, and report drafts. Finally, we would like to recognize TDWI’s production team:
James Powell, Peter Considine, Lindsay Stares, and Michael Boyda.

Sponsors
Actian, Couchbase, Datameer, Denodo, Hitachi, Snowflake, SAP, and Trifacta sponsored
this report.

2 
Research Methodology and Demographics

Research Methodology and Position

Demographics
Corporate IT or BI professionals 53%
Business sponsors/users 33%
Report Scope. Cloud data management (CDM) is about managing old and new Consultants 14%
data in hybrid data architectures that span on-premises and cloud systems.
Industry
CDM involves database management systems, data integration tools, related
Consulting/professional services 12%
platforms, and the numerous best practices users perform with them. The
Software/Internet 11%
primary goal of CDM is to unify heterogeneous and hybrid data so businesses
Financial services 10%
have complete data they can trust, govern, and leverage for business value. This
Retail/wholesale/distribution 8%
is a challenge because clouds are making data more heavily distributed than it
Insurance 7%
has ever been.
Education 5%
Audience. This report targets business and technical managers who are Manufacturing 5%
responsible for modernizing data environments that consolidate traditional (non-computers)
enterprise data and modern big data using on-premises and cloud-based tools Transportation/logistics 5%
and platforms, typically to support use cases in reporting, analytics, and Agriculture 4%
business operations. Food/beverage 4%
Healthcare 4%
Survey Methodology. In January 2019, TDWI sent an invitation via e-mail to Media/entertainment/ 4%
the data management professionals in our database, asking them to complete publishing
an Internet-based survey. The invitation was also distributed via websites, Government: Federal 3%
newsletters, and publications from TDWI and other firms. The survey drew Government: State/Local 3%
responses from 116 survey respondents. From these, we excluded respondents Telecommunications 3%
who identified themselves as vendor employees or academics. The resulting Other 12%
responses of 108 respondents form the core data sample for this report. (“Other” consists of multiple industries, each
represented by less than 3% of respondents.)
Research Methods. In addition to the survey, TDWI Research conducted
telephone interviews with technical users, business sponsors, and recognized Geography
data management experts. TDWI also received product briefings from vendors United States 66%
that offer products and services related to the best practices under discussion. Asia 11%
Canada 8%
Survey Demographics. The majority of survey respondents are IT or Europe 7%
BI professionals (53%). Others are business sponsors or users (33%) and Mexico, Central, or 4%
consultants (14%). We asked consultants to fill out the survey with a recent South America
client in mind. Australia/New Zealand 2%
Africa 1%
The respondent population is dominated by industries in consulting (12%),
Middle East 1%
software/Internet (11%), and financial services (10%), followed by retail/
wholesale/distribution (8%), insurance (7%), and other industries. Most survey Company Size by Revenue
respondents reside in the U.S. (66%), followed by Asia (11%), Canada (8%), and Less than $100 million 21%
Europe (7%). For the most part, respondents are evenly distributed across all $100–$499 million 15%
sizes of organizations. $500 million–$999 million 10%
$1–$4.9 billion 19%
$5–$9.9 billion 8%
$10 billion or greater 19%
Don’t know 8%

Based on 108 survey respondents.

tdwi.org  3
Cloud Data Management

Executive Summary
Cloud computing has The world of IT continues to become more hybrid in that some information systems and data
joined traditional remain on premises while others are increasingly deployed to the cloud. Technical users are
paradigms, making lured to the cloud because of its speed, scale, elasticity, and low level of maintenance, while
IT hybrid. business people are drawn to its agility, low cost, and ability to support new data-driven
business practices.

Data isn’t just big. The rise of the cloud has ramifications, especially in the realm of data management. As if
It's also hybrid. managing data of increasing size weren’t hard enough, organizations are now challenged to
monitor business processes, assemble complete views of customers, and weave a cohesive
analysis of corporate performance based on hybrid data that is strewn across the traditional
enterprise and multiple clouds. The long list of data platform and tool types that manage hybrid
data coalesce into hybrid data architectures that are difficult to understand and optimize.

The new hybrid cloud Cloud data management (CDM) has risen to address these new challenges. CDM is the latest
context demands an evolution of data management, and it has been greatly updated and extended to support new
updated approach to cloud data platforms, applications, and use cases. It also integrates data from those with
data management. traditional on-premises sources and targets. CDM promises to enable the next level of business
analytics and data-driven operational innovations based on data from platforms old and new.

CDM is real and in use Among users surveyed, CDM is already in use in cloud data warehousing, advanced analytics,
in decision-making multichannel marketing, real-time operational dashboards, and for data sync among on-
solutions, marketing, premises and software-as-a-service (SaaS) applications. According to our survey results, the
and for data sync leading benefits of CDM are scalability, elasticity, analytics, real-time operations, and agility.
across applications. Barriers to successful CDM include issues in governance, data migration, data quality, and tool
maturity. A whopping 96% of users surveyed say CDM is an opportunity, which explains why so
many are adopting cloud systems or migrating to them. Although most data is managed on
premises today, survey data suggests that the amount on cloud platforms in the typical
organization will at least triple over the next three years.

Our survey says that the infrastructure and system architectures for cloud data management
are usually owned and maintained by central IT or a group responsible for enterprise data
architecture. However, groups for data warehousing, DataOps, and general data management
contribute to CDM solution designs and data requirement definitions. Among these workers,
almost half of job titles include the word “architect” because creating unified solutions in a
multiplatform and hybrid context requires deep architecture expertise for data, integration,
systems, and applications.

Cloud data management The hybrid IT ecosystem under discussion includes an amazing variety of systems, each fulfilling
enables interoperability a specific purpose, while also interoperating with many other systems. On the data side of things,
and data sharing among the relational database continues its hegemony, though it has evolved to run natively on clouds
a range of systems, old and to focus on analytics (e.g., columnar, graph, and NoSQL databases). Also, data management
and new, on premises tools (for integration, quality, metadata, and virtualization) have evolved to execute processing
and in the cloud. on servers and inside data platforms, both on premises and in the cloud, via interfaces for all
these. On the application side of things, traditional on-premises applications are today regularly
mixed with SaaS applications in clouds. Despite the eclectic nature of these massive software
portfolios and the massive amounts of distributed data involved, cloud data management
integrates hybrid data across hybrid data architectures with scale, speed, quality, and compliance.

This report explains in detail what CDM is and does so data professionals and their business
counterparts can understand what CDM can do for them and how they might organize a
successful program.

4 
Introduction to Cloud Data Management

Introduction to Cloud Data Management


Data continues to evolve into greater diversity in the sense of more data structures, schema, As data continues to
sources, containers, and latencies. Likewise, user organizations continue to diversify the ways evolve, how we manage
they use data for business value, especially via use cases in advanced analytics, reporting, it and use it for business
data warehousing, marketing, and time-sensitive business operations. At the same time, new value evolves, too.
platforms have emerged for data management, the most disruptive ones being open source
and cloud-based.

The tools and platforms we use to manage data are also evolving to keep up with data, its
changing management requirements, and its new use cases. Many are leveraging the power
and affordability of the cloud. This includes the new database management systems (DBMSs)
and data warehouse platforms built from the bottom up specifically for the cloud. Likewise,
data integration platforms and analytics tools have evolved to run natively in the cloud
(while still running on premises) as well as to interoperate with SaaS applications and other
new cloud systems.

Users are adopting new cloud-based platforms and tools for data because they are optimized for Across industries,
the cloud’s high performance and scale. They also offer relatively low cost, highly agile, and users are adopting
flexible deployments. These cloud characteristics help organizations innovate by enabling cloud systems and
modern data-driven practices in analytics, real-time operations, and self-service data discovery, integrating them with
prep, visualization, and analytics. on-premises ones.

However, despite the steady migration of data, tools, and applications to the cloud, user
organizations are also maintaining their on-premises legacy systems. After all, these legacies
represent considerable investments in money, time, and physical assets. Furthermore, they
still deliver valuable automation for traditional use cases in business process optimization and
decision making.

The result is an increasingly hybrid IT ecosystem in the sense that many tools, applications, data Integrating new cloud
platforms, and data sets (and their servers, etc.) are still physically located on premises while an and on-premises
increasing number are deployed in the cloud—sometimes multiple clouds. In terms of systems results in hybrid
information assets, this means that bespoke hybrid data is broadly distributed and persisted in architectures for data
numerous data platforms, whether on premises, in the cloud, or both. and applications.

Data’s recent distribution to the cloud results in hybrid data architectures that are inherently
complex. More complexity comes from trends that usually accompany cloud usage, namely
a greater diversity of data structures, multiple database brands (old and new, from software
vendors and open source projects), and the capture of big data and other new data assets from
web applications, social media, customer channels, and the Internet of Things (IoT).

Managing distributed data is difficult in the best of times. Managing it in today’s multiplatform
hybrid data architectures is a whole new next-level challenge. That’s where cloud data
management saves the day.

tdwi.org  5
Cloud Data Management

Defining Cloud Data Management (CDM)


CDM has two halves: Cloud data management is simply data management that involves clouds. As with most forms of
data persistence and data management, CDM involves tasks and technologies for data persistence and data
data integration. integration. For example:
• When focused on data persistence, CDM provides cloud-native data storage and optimized
processing for the burgeoning volumes of enterprise data, big data, and data from new
sources that users are choosing to manage and use in clouds.

• When focused on data integration, CDM provides data integration infrastructure that unifies
multicloud and hybrid environments. Note that data integration infrastructure includes
development tools and deployment servers for multiple forms of data integration (ETL/ELT,
replication, virtualization, event processing, etc.) as well as data disciplines related to
integration (data quality, metadata management, master data management, etc.). All these
interoperate with on-premises and cloud systems and may execute on either.

CDM’s purpose is to extend data management technologies and practices to deeply support cloud
environments, especially regarding data persistence and integration.

The cloud’s many The cloud offers many benefits that apply to data management. These include the cloud’s elastic
desirable traits all scalability, zero system integration and administration, massive storage capacity, and low cost.
apply to CDM. Cloud users have reaped those benefits with common use cases such as SaaS applications and
advanced analytics. Now they need the power of the cloud to give data management solutions
greater speed and scale with less administration and cost.

CDM is already giving users business value in production solutions for cloud data warehousing,
cloud data integration, cloud-based analytics, modern data semantics, and modern data hubs.
CDM also contributes to operational environments such as those for digital marketing, the
online supply chain, the Internet of Things (IoT), and many kinds of SaaS tools and applications.

Related Terms and Concepts for Cloud Data Management


Most users know The vast majority of people responding to this report’s survey feel that they know what CDM is
what CDM is, although and does. (See Figure 1.) As we’ll see later, the majority of survey respondents already have
they may call it by hands-on experience using or deploying data-driven cloud applications (for analytics, reporting,
another name. marketing, and sales automation) or cloud platforms (for data warehousing and data lakes).

Do you feel you know what CDM is and does?

19% No

81% Yes

Figure 1. Based on 108 respondents.

6 
Introduction to Cloud Data Management

Although users know what cloud data management is, they may refer to it by another name. To
get a sense of which terms users apply to the general practice of CDM, the survey asked, “What
term(s) do you or your team use for data management that involves clouds?” (See Figure 2.) Each
of the terms adds nuance to our understanding of CDM.

Cloud data warehousing (59%): The survey population is dominated by data warehouse
professionals, so it is no surprise that they use the industry standard term cloud data warehousing.
This also indicates that data warehousing is now firmly established as a cloud use case.

Cloud data management (50%): For many years, TDWI has been using the phrase data
management as a broad umbrella term that includes numerous technologies and user practices.
These tend to fall into two categories:
• Database management systems (DBMSs) and data persistence platforms that resemble them
(e.g., Hadoop and cloud storage)

• Data integration and related disciplines (e.g., data quality, metadata management, master
data management, replication, data services, etc.)

TDWI members and other people working in data management have also adopted the umbrella
term data management. Many place an adjective at the beginning of the phrase to describe
specific applications, as in cloud data management (50%) and its synonym hybrid data
management (31%). The latter reminds us that CDM is not solely about clouds; it is also about
combinations of computing platforms, commonly the mixture of on-premises and cloud systems.

Data-as-a-service (DaaS, 44%): Cloud-native systems (and progressively others, too) often Its synonyms reveal
interface and integrate via services of various types. These include web services (based on that CDM is distributed,
SOAP or REST), emerging microservices, homegrown services, and proprietary APIs. The service-oriented, hybrid,
term DaaS belies how important service architectures are for modern computing, including and multiplatform.
data management. It also reminds us that a modern hybrid ecosystem will not unify into a
cohesive data architecture without an appropriate integration infrastructure, as discussed later
in this report.

Distributed data management (24%): This term goes back to the early days of client-server
computing, which were some of the first architectures to integrate data that originated in and
was managed by multiple IT systems. Though long in the tooth, the term is still relevant,
given how clouds provide even more options for distributing data as well as integrating
distributed data.

Enterprise data architecture (40%): This equally aged term is a mature and indispensable
discipline in larger enterprises that want data consistency, quality, and accessibility across
diverse enterprise systems. The unification of platform diversity that EDA was designed to
address is still with us, but greatly expanded by the addition of cloud platforms in hybrid
architectures.

Multiplatform data architecture (19%): TDWI Research originated this term in a recent
research report.1 The term is more descriptive than most because it stresses how hybrid data
environments consist of multiple platforms of diverse types, often spread between on premises
and the cloud. It also stresses that a collection of platforms is a mere “bucket of silos” that will
not achieve full business value unless there is a data architecture that unites and integrates them.


1 For a discussion of the many data architectures possible in today’s multiplatform and hybrid data environments, see the 2018 TDWI tdwi.org  7
Best Practices Report: Multiplatform Data Architectures, online at tdwi.org/bpreports.
Cloud Data Management

What term(s) do you or your team use for data management that involves clouds? Select all
that apply.
Cloud data warehousing 59%
Cloud data management 50%
Data-as-a-service 44%
Enterprise data architecture 40%
Hybrid data management 31%
Distributed data management 24%
Multiplatform data architecture 19%
Multicloud data sync 12%
Other 4%

Figure 2. Based on 306 responses from 108 respondents (2.8 responses on average).

Real-World Use Cases for CDM


Users surveyed use Cloud data management practices are in place today, supporting numerous production use cases
CDM mostly for decision in both analytics and operations. (See Figure 3.) For example, the top four items in Figure 3 are
disciplines, namely all for use cases in analytics, whereas the other items are for operational use cases. This
reporting, analytics, illustrates that CDM is a real-world best practice, providing valuable support for a wide range
and data warehousing. of business functions. It also shows that, in the average enterprise today, these business
functions and departments are all involved with cloud-based data and applications to some
degree. In other words, the cloud has penetrated the enterprise broadly, affecting just about
every business process.

For what enterprise functions or use cases is your organization applying cloud data
management? Select all that apply.
Analytics 64%
Data warehousing 52%
Reporting 47%
Data science 43%
Marketing and sales 35%
Finance 29%
Customer service 27%
Enterprise resource planning 27%
Human resources 24%
Research and development (R&D) 21%
Supply chain 14%
Other 6%

Figure 3. Based on 306 responses from 108 respondents (2.8 responses on average).

8 
Introduction to Cloud Data Management

The Point of CDM and Similar Hybrid Practices


To better understand users’ priorities for CDM, our survey asked respondents to rank common Business value is the
CDM goals in priority order. (See Figure 4.) top priority of CDM.
Others include analytics,
Get more business value from data, whether in operations or analytics. Business value is collecting web data, and
rightfully at the top of the list, as it should be for all data management. CDM supports business tapping cloud power.
value by repurposing and provisioning data for multiple business use cases from operations
to analytics, even in hybrid data environments that require “many-to-many” data integration
across multiple platforms.

Capture and leverage emerging data types and sources. This includes data from IoT, social
media, customer channels, big data, and web apps. You cannot leverage data for business value if
you cannot capture, store, process, and present that data according to its unique requirements—
and that’s where CDM plays a valuable role. New and unique data sources (many of them cloud-
based) come online almost daily in some organizations, which drives up the need for CDM that
supports diverse platforms and tools.

Embrace new systems architectures and data architectures that involve cloud platforms. Users surveyed know
Many data warehouses were originally designed exclusively for the repeatable and auditable that CDM improves
structures of reporting, which makes them ill-suited to the open-ended and unpredictable architecture, analytics,
explorations of advanced analytics. Instead of trying to satisfy radically different data real-time processes,
requirements in one warehouse, many users prefer to complement the warehouse with a cloud- and data sharing.
based data lake or sandbox for analytics. Similarly, marketers keep sensitive customer data and
customer masters on premises while managing massive stores of data for analytics or campaign
management in the cloud. Hybrid and distributed architectures such as these need CDM to
integrate and synchronize data across the multiple platforms involved in each use case.

Expand analytics into more advanced forms, such as machine learning and AI. Some forms
of advanced analytics work best with massive amounts of data. This includes machine learning,
data mining, graph, clustering, and statistics. Furthermore, analytics training data for these
should come from many sources and timeframes and have rich details, as is typical of raw source
data from operational systems. That’s because analytics needs broad data to make complex
correlations among far-flung data points, which is what leads to business insights. Managing and
processing data of this volume and diversity in a traditional on-premises system—such as an MPP
configuration of a relational database—would be cost prohibitive and difficult to scale. However,
the same analytics data and processing in the cloud is far more affordable and scalable.

Put the right data on the right platform for the right storage, processing, or provisioning.
For example, the data requirements for set-based OLAP or SQL are radically different than
those for algorithmic analytics for statistics, mining, and graph. Real-time analytics requires
real-time data ingestion. Sentiment analysis requires a platform that can handle unstructured
data. In addition, analytics processing increasingly runs inside a database or other persistence
layer where data is stored. Accommodating all these requirements may demand a multiplatform
storage portfolio, which nowadays generally leads to a hybrid mix of on-premises and cloud-
based platforms.

Real-time access to all data, whether on premises or in the cloud. Interfaces to and from
cloud systems vary considerably. However, the trend is toward more reliance on web services
and microservices, whether based on SOAP and REST, or developed as a proprietary API. Many
services of this sort can perform in real time or close to it. This makes cloud data platforms and
analytics tools good fits for fast-paced business processes such as logistics, trading, e-commerce,
and utilities.

tdwi.org  9
Cloud Data Management

Take advantage of cloud characteristics. These include elastic scalability, low cost, and
managed services. Many users are drawn to the cloud in order to scale with big data analytics,
to reduce costs as compared to traditional on-premises data centers, and to hand off system
integration and administrative tasks to the cloud provider.

Simplify the process of sharing data inside and outside of an organization. Cloud data
platforms and integration tools can be used many ways, as we’ve already seen. Another way is to
use the cloud as a central, shared point of collocation or consolidation for disparate data.
For example, cloud data warehouses, data lakes, and data hubs do this (as well as many other
useful things).

What is the point of CDM? Rank the following in priority order.


Get more business value from data, whether
in operations or analytics 5.7
Capture and leverage emerging data types
and sources 5.3
Embrace new systems architectures and data
architectures that involve cloud platforms 5.0
Expand analytics into more advanced forms,
such as machine learning and AI 5.0
Put the right data on the right platform for the
right storage, processing, or provisioning 4.4
Real-time access to all data, whether on
premises or cloud 3.9

Take advantage of cloud characteristics 3.7


Simplify the process of sharing data inside
and outside of an organization 3.4

Figure 4. Based on 97 respondents. Possible scores range from 1 to 8.

USER STORY CONSULTANTS CAN HELP WITH CLOUD ADOPTION AND MIGRATION.
“I work for a professional services firm, where we help our clients adopt cloud for data management,
analytics, and other use cases,” said Ash Naseer, vice president for analytics at Cloudnile. “As a first
step, we conduct advisory consulting engagements for companies that have legacy data environments
to determine what kinds of commitments they should make to cloud and what their strategy for
adoption and migration should be. As the next step, I organize implementations that migrate data and
build cloud data management solutions.
“Our clients want to leverage the abilities of cloud, especially elasticity and scale. Cost plays into it,
too, and cloud can reduce capital expenditures for new implementations. However, our clients are more
interested in finding the right tool and technology for specific use cases, and cloud-based systems
increasingly fit their requirements, whether a pure cloud or hybrid solution.
“Our consulting practices highlight that the core principals of data warehousing and analytics will
remain valid, although adapted to cloud and other new platforms such as Hadoop. Despite big data,
increasingly advanced analytics, and new data-driven business practices, we still need the relational
paradigm and SQL fully supported for most use cases, even on cloud-based data platforms.”

10 
Benefits and Barriers for CDM

Benefits and Barriers for CDM


CDM: Problem or Opportunity?
Embracing new best practices and technologies can uncover previously unknown challenges as Almost all survey
well as new insights and improvements to technical infrastructure. Users are right to ponder the respondents consider
balance of risk and reward before making a commitment. Such is the case with cloud data CDM an opportunity,
management today. To test whether CDM is worth the effort, this report’s survey asked: Is cloud not a problem.
data management (CDM) a problem or an opportunity? (See Figure 5.)

The vast majority of respondents (96%) consider CDM an opportunity. Responses to other
survey questions reveal that users are hopeful that CDM will help them scale to big data, expand
analytics programs, draw business value from new data assets, and extend data warehouses.

A tiny minority (4%) consider CDM a problem. As we’ll see later in this report, many users
are rightfully concerned about the difficulties of data governance, privacy, and security when
data is located on numerous platforms, both on premises and in the cloud. A growing number of
organizations have resolved these concerns to manage and use hybrid data in compliant ways.

Is cloud data management (CDM) a problem or an opportunity?

4% Problem—because its
complexity is difficult
to integrate, optimize,
and govern
96% Opportunity—because it
provides many useful
options for data manage-
ment, analytics, operational
applications, etc.

Figure 5. Based on 107 respondents.

Benefits of CDM
In the perceptions of survey respondents, CDM offers several potential benefits. (See Figure 6.) The leading benefits of
A few areas stand out in their responses. CDM include scalability,
elasticity, analytics, real-
The general benefits of the cloud also apply to cloud data management. In fact, cloud time access, and agility.
performance characteristics ranked at the top of survey responses, namely scalability (51%),
elasticity (44%), and support for new data sources and structures (30%). In phone interviews
with users that TDWI conducted for this report, users repeatedly listed scalability, elasticity,
multistructured data, advanced analytics, and real-time data practices as their reasons for
migrating data and applications to the cloud.

CDM enables modern practices in analytics and real time. For example, CDM enables
advanced analytics inexpensively at scale (35%), thereby generating new insights that make
the business more profitable, efficient, and competitive. Also, CDM enhances real-time access
to all data, whether on premises or in the cloud (35%), which enables innovative business
practices such as e-commerce recommendations, fast-paced performance management, just-
in-time inventory or maintenance, and real-time analytics with data from IoT and streaming
sources. Depending on the hybrid combination of platforms involved, CDM helps new solution
development be agile, yielding a short time-to-business use compared to implementing with
purely on-premises platforms (20%).

tdwi.org  11
Cloud Data Management

Data assets are more fully leveraged in the cloud. TDWI regularly encounters users who
believe that most clouds are well suited to consolidating diverse data so it can support more use
cases, which in turn yields greater business value. Similarly, some users consider the cloud a
“neutral Switzerland” that fosters collaboration among diverse users, both internal and external.
This way, the cloud makes it easier to share with external suppliers, partners, and customers
(30%). CDM makes hybrid data architectures possible, which puts data platforms near web and
IoT data sources (15%), and puts data platforms near third-party data providers (9%).

CDM fosters improvements across the board. From a technical perspective, CDM can
modernize a mature data management infrastructure (30%). From a business perspective,
managing hybrid data more effectively can improve existing business processes (29%), customer
experience and service (28%), and employee efficiency (22%).

CDM and clouds can save money. Adopting data platforms that are cloud-based can save time
and money because the platform is already set up and optimized for the cloud (alleviating time-
consuming system integration on premises). Furthermore, “renting” cloud hardware avoids
initial capital expenses for on-premises hardware, thereby making CDM an operational expense
(14%). Depending on your cloud provider and whether you signed up for a managed service,
administration for DBMSs and Hadoop can be easier (and therefore less expensive) in the cloud
than on premises (15%). In many cases, cloud personnel do work that would require new hires on
premises. For these reasons, startup expenses are low for new cloud-based data-driven programs
and solutions (16%).

If your organization were to implement cloud data management, what would its leading
benefits be? Select seven or fewer.
Scalability for data storage and integration workloads 51%
Automatic and elastic resource management 44%
Enables advanced analytics, at scale but inexpensively 35%
Enhances real-time access to all data, whether on premises or in the cloud 35%
Data and other assets or resources are more fully leveraged 32%
Cloud data platforms support new data sources and structures at scale 30%
Makes data easier to share with external suppliers, partners, and customers 30%
Modernizes our mature data management infrastructure 30%
Improves existing business processes 29%
Customer experience, service, analytics, etc. improve 28%
Improves employee efficiency 22%
Short time-to-use compared to implementing on-premises platforms 20%
Cost reduction, especially for start-up expenses 16%
Finances data management as operational expense, not capital expenditure 15%
Puts data platforms near web and IoT data sources 15%
Admin for DBMSs and Hadoop is easier on cloud than on premises 14%
Security for enterprise data is easier to implement and maintain 14%
Complies with our cloud-first corporate mandate 13%
Fits our IT outsourcing strategy 10%
Puts data platforms near third-party data providers 9%

Figure 6. Based on 605 responses from 98 respondents. 6.2 responses on average.

12 
Benefits and Barriers for CDM

Barriers to CDM
In the opinion of survey respondents, CDM and the cloud present some potential barriers. (See Barriers to successful
Figure 7.) A few areas stand out in their responses. CDM may arise in
governance, migrations,
Data governance and related issues keep users awake at night. At the top of the barrier data quality, and
list, users’ greatest fears center on the challenges of modernizing data governance to cover all tool maturity.
data on all platforms (38%). Users have other governance concerns about CDM and the cloud,
including data privacy issues (40%), data security threats (36%), the risk of exposing personally
identifiable information and other sensitive data (32%), and a variety of data usage compliance
issues (27%).

These concerns are valid given that the hybrid data architectures that CDM unites distribute data
across many platform types and enable more users and applications to access a broader range of
data. Even so, as users have migrated applications, data, and users to the cloud, TDWI has seen
many organizations across industries succeed with the cloud and CDM. They succeed by revising
and expanding their programs in data governance, stewardship, curation, and security to assure
that the cloud and CDM are compliant, secure, and governed.

Migrations are big projects that must be completed before CDM starts. Many users are
replatforming to cloud-based systems (38%), meaning that they have pre-existing applications
on premises, with data, users, and business processes they will migrate to the cloud. These
users need to allocate generous time and money for such projects because migrating data and
applications to the cloud is time-consuming (22%) and expensive in terms of acquiring and
supporting cloud platforms and apps (14%).

In a related problem, too many organizations underestimate the scope of migration by thinking Promises of “lift-and-
that they can perform a simple “lift-and-shift” from on-premises systems to cloud-based shift” are exaggerated.
systems. In reality, many migration projects are more similar to new development than "lift-and- Replatforming
shift," as explained later in this report. For example, the more different the source and target usually involves
platforms are, the more development it will take to have a well-performing system with optimal development work.
data at the end of the project. For all migrations, users should create a plan that organizes each
migration as a series of easily dispatched phases, instead of a risky "big bang" project.

A variety of problems with data can hinder cloud adoption and CDM. Many of the use
cases that cloud and CDM enable involve sharing data, as in analytics, reporting, data-driven
operations, and self-service data access. These use cases are imperiled when the owners of data
and platforms are not willing to embrace the cloud fully (24%), which increases the difficulty
of sharing data across all parties, inside and outside your organization (27%). As with any
data-oriented technology, CDM success may be threatened by the poor quality of data from
traditional sources (24%) or new sources (8%). Finally, maintaining a single version of the truth
that is current and accurate (31%) in a hybrid cloud environment requires expert skills in data
integration and synchronization.

Some users still consider best practices and tool support for clouds immature. This
includes immaturity in the CDM concept (21%) and CDM relative to requirements for advanced
analytics (11%). Likewise, CDM may be hamstrung by immature tool support for clouds and SaaS
applications (15%) and a data integration infrastructure that is weak on cloud support (19%).

tdwi.org  13
Cloud Data Management

If your organization were to attempt cloud data management, what would its leading
BARRIERS be? Select seven or fewer.
Data privacy issues 40%
Data governance, modernized to cover all data on all platforms 38%
Replatforming to cloud-based systems (e.g., for data warehouse, analytics) 38%
Data security threats 36%
Risk of exposing sensitive data (e.g., personally identifiable information) 32%
Maintaining a single version of the truth that is current and accurate 31%
Data usage compliance issues 27%
Difficulty of sharing data across all parties, inside and outside our organization 27%
Owners of data and platforms not willing to go to the cloud 24%
Poor quality of data from traditional sources 24%
Migrating data and apps to cloud is time-consuming and expensive 22%
Architecting solutions for environments that are excessively heterogeneous & hybrid 21%
Immaturity of the CDM concept 21%
Integration infrastructure is weak on cloud support 19%
Immaturity of tool support for clouds and SaaS apps 15%
IT won’t support the cloud brand or cloud tools we want 15%
Cost of acquiring and supporting cloud platforms and apps 14%
No compelling business case 13%
Immaturity of CDM relative to requirements for advanced analytics 11%
Poor quality of data from new sources 8%
Other 4%

Figure 7. Based on 512 responses from 107 respondents (4.8 responses on average).

14 
The State of CDM

EXPERT COMMENT CLOUD, DATA LAKE, AND DATA PREP IS AN EMERGING COMBINATION.
David Stodder is a senior research director at TDWI and a recognized expert in the new discipline
of data preparation, which he sees increasingly used with clouds and data lakes. “With more data
originating in the cloud through online and mobile marketing, e-commerce, social media, and other
apps and data services, the resulting ‘data gravity’ is attracting even greater investment in cloud-native
analytics, data preparation, and data management.
“Data warehouses, although still valuable, are being augmented or replaced by data lakes that can
store all kinds of raw, detailed data. First built with Apache Hadoop clusters on premises, many
organizations today are choosing to base their data lakes in the cloud so they can take advantage of
service-based flexibility, scale, and the promise of better economics while avoiding having to develop,
configure, and maintain data lakes on their own.
“Data preparation is essential to gaining value from all this data. Data preparation processes begin
with data ingestion and collection and include steps for profiling the data and improving its quality,
consistency, and completeness. Data preparation also includes transformation, wrangling, conversion,
cleansing, and enrichment to make the data ready for analysis, modeling, and consumption.
“Cloud computing is evolving quickly, which is having a big impact on data strategies. Agile, loosely
coupled microservices and containerization are shifting cloud-native environments away from
monolithic structures. Organizations need data preparation processes and solutions that can help them
accelerate cloud data lake adoption, take advantage of cloud’s evolution, and reduce time to value.”2

The State of CDM


Is CDM Important?
To gauge the urgency of cloud data management, this report’s survey asked respondents to rank Most organizations
the importance of CDM relative to their organization’s data strategy. (See Figure 8.) consider CDM an
important piece of their
Few respondents (14%) say that CDM is not a pressing issue. Therefore, we can conclude enterprise data strategy.
that CDM contributes significantly to enterprise data strategies.

Most respondents (86%) recognize the importance of CDM. Many feel that CDM is extremely
important (39%), while others see it as moderately important (47%).

How important is implementing CDM to the success of your organization’s data strategy?

14% Not currently a


pressing issue

39% Extremely important

47% Moderately important

Figure 8. Based on 91 respondents.


2 Read more of what David Stodder has to say in the 2019 TDWI Checklist Report: How to Use Data Preparation to Accelerate Cloud Data tdwi.org  15
Lake Adoption, online at tdwi.org/checklists.
Cloud Data Management

Why Is CDM Important?


Users have many good Why is CDM important? The survey asked the open-ended question, "In your own words, why is
reasons for considering implementing CDM important (or not important)?" The respondents’ comments reveal a number
CDM important. of use cases, needs, and trends, as seen in the representative excerpts in Figure 9. Note that the
users quoted work in many different industries and geographic regions. Cloud and CDM are top
of mind for most data professionals and their business sponsors in many contexts worldwide.

In your own words, why is implementing CDM important (or not important)?
• “CDM allows us to expand our use of data with a lower up-front investment.” – Head of data
center of excellence, consulting, Asia

• “[CDM’s] scale and speed are commodities we cannot afford to compromise on.” – BI
manager, media and entertainment, U.S.

• “[CDM] aligns with the company strategy to move to the cloud, so it is important that the
move is done correctly and properly.” – Group manager of IT, media and entertainment,
Australia

• “[Our] IT mandate to move all systems to the cloud makes CDM an absolute necessity.” –
Data analyst, consulting, Africa

• “Cost-effective use of technology and data warehousing. Leverage dollars, repurpose,


provide broader access to tools and infrastructure.” – Senior analyst, transportation, Canada

• “Cloud computing facilitates the access of applications and data from any location worldwide
and from any device with an internet connection.” – Database marketing, insurance, U.S.

• “Need to modernize the data management technologies to lay a foundation for cost-effective
machine learning and deep learning [analytics].” – Consultant, financial services, U.S.

• “As business processes move to the cloud, it is vital that data processes are also supported.”
– Product manager, hospitality, Europe

• “[CDM is important] to modernize the data warehouse and add business value.” – BI
manager, food and beverage, U.S.

• “It’s important to be able to stay up-to-date on the latest software updates. The cloud puts
this on the vendors.” – Director of data governance, retail, U.S.

• “New opportunities while leveraging existing data assets.” – System administrator,


education, Canada

• “It is important to be prepared for the future and not just do what we’ve always done.” –
Manager of DW/BI, healthcare, U.S.

Figure 9. Drawn from the text responses of 108 respondents.

Cloud Adoption: Decision Disciplines Are Catching up to Operational Ones


To sort survey respondents according to their exposure to CDM, our survey asked, "Do you
personally have experience implementing and/or using tools, apps, or platforms that participate
in some form of cloud data management?" (See Figure 10.)

16 
The State of CDM

The majority of respondents have direct experience with some form of CDM. This Most organizations
percentage reveals how pervasive clouds and data management on or around them has become. surveyed are already
However, why the large response? doing CDM in some form.

As numerous technologies and practices have emerged over the history of IT, a recurring trend is
for operational applications to adopt the technology first, followed after a few years by decision-
making disciplines such as reporting, analytics, data warehousing, and data integration. In
recent years, TDWI has observed organizations of all sizes and from all industries making
deep commitments to many forms of cloud-based operational applications, licensed in the
software-as-a-service (SaaS) model. These include SaaS applications for sales force automation
(SFA), customer relationship management (CRM), marketing campaign management, financials,
call center, human resources, and so on. We are now at the phase of the cloud adoption and
maturation cycles where decision-making disciplines are catching up to operational ones.

Do you personally have experience implementing and/or using tools, apps, or platforms that
participate in some form of cloud data management?

19% No

81% Yes

Figure 10. Based on 91 respondents.

Note that this report’s survey branched according to how respondents answered this question.
Respondents answering “yes” (81% in Figure 10) were presented with detailed questions about
CDM technologies, teams, and best practices, as seen in many of the following figures.

CDM Successes
CDM is similar to any IT discipline. There are successful programs and programs that fail CDM successes
outright. However, success and failure are more often a matter of degree, in that some aspects of commonly occur in
a program succeed while other aspects fail. The program continues into the future and users reporting, analytics,
improve the lackluster parts as they go. data warehousing, and
data integration.
To get a sense of which aspects of CDM are currently succeeding or failing, our survey presented
two open-ended questions that allowed respondents to describe their successes and failures in
their own words. (See Figures 11 and 12.) Note that these questions were posed to respondents
who reported having direct exposure to CDM; they speak from real-world experience.

Figure 11 assembles several representative comments about aspects of CDM that are succeeding
today. The survey population is dominated by people who work in decision-making disciplines,
so it’s no surprise that their successes primarily concern reporting, analytics, data warehousing,
and data integration, with additional successes for many types of operational applications and
their data needs. Clearly, respondents’ successes prove that CDM is real, it works, and it provides
business value.

tdwi.org  17
Cloud Data Management

Briefly list some areas where cloud data management has succeeded in your organization.
Reporting

• “We have had success converting over reports and dashboards”

• “Self-service reporting”

• “Enterprise reporting, single source of truth”

Analytics

• “Customer analytics”

• “Machine learning, data science, near-real-time data”

• “Managing real-time data and analytics”

• “Using a public cloud for heavy calculations on data and wiping it after use”

Data warehousing and other databases

• “Data warehouse modernization and move to the cloud has worked well

• “Migrating on-premises data warehouses to the cloud”

• “Data platforms, data semantics, and data integration”

• “Master data management project enabled to combine data from 27+ organizations”

• “Reducing costs and processing data faster with more resources”

Operational Applications

• “Salesforce implementation and integration with on-premises DWs”

• “Subsidiary companies that don’t own existing infrastructure”

• “Back-end office finance”

• Many mentions of marketing, sales, CRM, HR, supply chain, IoT data, mobile data

Figure 11. Drawn from the text responses of 50 respondents who have CDM experience.

18 
The State of CDM

CDM Failures
Figure 12 assembles several representative comments about aspects of CDM that sometimes fail. CDM failures involve
The figure also includes commentary about possible causes and remedies for the failures. Note consultants, inadequate
that none of these are total failures. Instead, they are partial failures, each isolated to an aspect governance, poor quality
of CDM, such that the failures can be corrected and expanded for long-term success. data, and poor planning.

Briefly list some areas where cloud data management has FAILED in your organization.
• “No failures yet.” Forty percent of respondents reported having no failures. For some, they
simply have not done CDM long enough to suffer partial or complete failures. With others,
it’s because their CDM solution works fine and delivers value, which is a very good sign.

• “Allowing consultants to be the only ones who know the implementation.” When attempting
something that is new to you and your organization (as is often the case with cloud
solutions), a tried-and-true IT practice is to hire consultants who have the experience you
lack. However, for long-term success, this process must include a thorough knowledge
transfer from consultants to internal personnel.

• “Consolidating all data into one place takes enterprise governance. The tools and process
to do this are not mature.” Data governance (DG) is clearly a success factor for CDM. As
we noted earlier, DG is especially hard in cloud and hybrid data architectures where data
is distributed geographically. However, DG is mostly about people, process, and policy
making—rarely about tools—so that’s where the correction should be applied.

• “Data quality. During the retrieval process the data was not accurate.” If this was a problem
with the quality of data from the source systems involved in a migration to the cloud, then
basic data profiling should have revealed required corrections. Besides, you should never
just move data; always improve data as you move it. However, this problem might be due
to faulty ETL logic. Regardless of cause, a good migration will be designed as a multiphase
project that includes testing of data and logic at all phases, with rollback as a contingency.

• “Simply moving data from relational to Hadoop and using a presentation layer.” TDWI is
seeing a lot of disappointment in Hadoop of late. For technical and business users who are
used to mature relational databases, the light relational functionality you can retrofit onto
Hadoop is not very satisfying. This explains why most replatforming today is no longer
focused on Hadoop but instead on the new cloud-based databases built specifically to deliver
deep relational functionality, albeit with cloud’s speed, scale, and low cost.

• “Initial phases are taking too long. No 'quick wins' for the process.” A good cloud
implementation or migration plan should start with small, low-risk goals that fairly quickly
demonstrate technical prowess and business value. Otherwise, everyone involved can
become dispirited and management may even cancel the project.

• “Tightly integrated legacy applications are hard to get off our on-premises data management
[infrastructure].” This is another reason why “lift-and-shift” migrations to the cloud don’t
always work as advertised. The more mature an incumbent solution is, the more similar to
new development the migration to the cloud will be. That’s because the data management
components of the legacy solution will probably include platform-specific functionality
such as database stored procedures, user-defined functions, and proprietary APIs, as well
as hand-coded integration routines. None of that will port unaltered. Legacy-to-cloud
migrations are the hardest and longest. Plan for a multiyear transition.

Figure 12. Drawn from the text responses of 46 respondents who have CDM experience.

tdwi.org  19
Cloud Data Management

USER STORY HYBRID DATA WAREHOUSES CAN BE CHALLENGING TO GOVERN.


“About three years ago, we hired a large consulting firm to evaluate the legacy data warehouse we
had at the time and to determine our future requirements for reporting and analytics. Based on their
recommendations, we built a hybrid data warehouse,” said a data warehouse professional currently
assigned to data governance at a financial institution.
“Our new hybrid data warehouse architecture centers around a traditional warehouse on premises,
deployed on a large MPP configuration of a leading relational database brand. However, the architecture
also includes large data sets for analytics and self-service on two different cloud providers, plus cloud-
based tools for data integration and data governance. Over time, we’ll progressively move more data
and its processing to the cloud, while maintaining the hybrid architecture.
“Data governance for the new warehouse is different from governance for the old one. Replatforming
and migrating data revealed some data quality issues, and distributing warehouse components across
multiple platform types—some in the cloud, some physically located on premises—makes data
access tracking more challenging. That’s why I’ve been assigned to revamp data governance for
data warehousing.
“My first priority in governance is to improve data quality across the distributed warehouse
environment, largely for the sake of regulatory compliance, risk reduction, and financial reporting.
We’ve updated our library of quality metrics and we hope to soon achieve 80% to 90% quality per
data set.
“My second priority is to restructure the governance organization. In addition to the data warehouse
team, we now have a warehouse governance team. It consists mostly of data stewards who are
subject matter experts from the business, so they know the data and its impact on specific business
processes.”

Organizational Matters
CDM Owners
The scope of CDM can be as narrow as a few simple data integration jobs that move data from
SaaS apps to a data warehouse. Alternately, it can be a very broad infrastructure that provides
deep interoperability and integration among multiple applications, tools, and data platforms,
distributed across multiple clouds and on-premises sites. The scope of your CDM solution affects
its ownership, design, funding, and maintenance. The broader the scope, the more likely it will
be owned by a large centralized organization within your enterprise.

CDM is usually owned by For example, according to survey results, the two most common CDM owners are the data
enterprise groups for data architecture group (47%) and central IT (42%). (See Figure 13.) This makes sense because many
architecture or central IT. midsize to large organizations have an enterprise data architecture (EDA) team responsible for
data architectures and data standards for enterprise environments that span multiple platforms
and business units, as is common with the hybrid data architectures integrated by CDM. The role
of central IT varies widely, but in many modern organizations its first priority is to provide
enterprise-scope infrastructure (networks, storage subsystems, racks of servers, etc.), and CDM
demands hefty infrastructure.

20 
Organizational Matters

Ownership aside, multiple teams must contribute to the design of data structures and data Data teams contribute to
integration solutions involved in successful CDM solutions. These teams include various data CDM’s design, integration,
management groups or DataOps (34%) and the data warehouse group (27%), with possible and data requirements.
involvement of application groups or DevOps (14%). Note that CDM is more often broad than
narrow in scope, which is why it is rarely owned by a business unit or department (14%), although
those organizations may be heavy users of CDM and cloud systems and therefore have data
requirements than need attention.

Who primarily designs and maintains the CDM solution you work with? Select three or fewer
answers.
Data architecture group 47%
Central IT 42%
Data management group or DataOps 34%
Data warehouse group 27%
Application group or DevOps 14%
Business unit or department 14%
Research or analysis group 8%
Third party (e.g., managed service or cloud provider) 8%
Other 4%

Figure 13. Based on 146 responses from 74 respondents who have CDM experience (2 responses per
respondent on average).

CDM Workers
This report’s survey asked respondents with CDM experience to enter the job titles of people who Architects, engineers,
design and implement CDM solutions. (See Figure 14.) analysts, and upper
management make
CDM does not happen without architects—lots and lots of architects. On the data side of CDM happen.
IT, this includes data architects (16%), data warehouse architects (4%), enterprise data architects
(2%), and a new title: cloud data architects (2%). On the applications and systems side of IT,
architects take the form of solution architects (7%), enterprise architects (6%), and systems
architects (2%).

It makes sense that CDM requires so many architects. After all, data architecture is usually
about how diverse data sets relate through shared data structures or the data flows and pipelines
of data integration. It is also about enterprise standards that should apply to most enterprise
data sets. Similarly, architectures for solutions and servers concern how diverse application
modules communicate within a single application or across multiple ones. Today’s hybrid cloud
environments are bursting with multiple data platforms, data sets, and applications, which
require architects to make them integrate and interoperate in an organized fashion that lends
itself to optimization and maintenance.

Note that data architecture is rarely green field; it’s like archeology in that architects dig into a
data ecosystem to gain an understanding of its existing systems and data. From that knowledge,
they envision a global design that will unify disparate data platforms and data sets at an
appropriate and realistic level. Architects also suggest local changes to data models, databases,
interfaces, and data standards to make data more easily shared across the multiple platforms
typical of today’s modern digital enterprises, especially those that have embraced cloud
computing.

tdwi.org  21
Cloud Data Management

The cloud and its data management tend to be top heavy, organizationally speaking. This
is clear from the numerous management titles revealed by the survey, including managers (15%),
chief officers (8%), directors (5%), and vice presidents (4%). Among these, CDOs, CTOs, CIOs,
and VPs get projects moving and keep them focused on the business goals of upper management,
whereas directors and managers direct the quotidian work.

Enter the job titles of people who contribute significantly to the design and implementation of
cloud data management.
DATA SPECIALISTS

Data architects 16%


Data integration engineers 9%
Data warehouse architects 4%
Data analysts and scientists 3%
BI developers 2%
Cloud data architects 2%
Enterprise data architects 2%

APPLICATION or SYSTEM SPECIALISTS

Solution architects 7%
Enterprise architects 6%
System architects 2%
Database administrators 2%

MANAGEMENT

Managers 15%
Chief officers (CDO, CTO, CIO) 8%
Directors 5%
Vice presidents 4%

MISCELLANEOUS

Business owners, sponsors, users, SMEs 7%


Other 6%

Figure 14. Based on 123 responses from 54 respondents (2.3 responses per respondent on average).

Hiring and Training for CDM Skills


Fill the skills gap by There are multiple ways to get the skills you need for working with clouds and CDM. (See
hiring new people, Figure 15.)
training employees, and
engaging consultants. Hiring new employees with architectural and integration experience. Finding data
professionals who are truly qualified for advanced work in architecture and integration continues
to be a challenge for IT and data management teams. Yet, 73% of the organizations surveyed
report that they are able to do so successfully.

22 
Organizational Matters

Training existing employees for new skills in architecture and integration. Cross-training
data management professionals works because these professionals enjoy learning new
skills, know that it’s good for resumes and job security, and are fully capable of working with
increasingly complex combinations of data platforms and user best practices, which is the very
hallmark of the multiplatform hybrid data architectures under discussion here.

Depending on consultants for new skills. Many system integrator and consulting firms now
have mature practices devoted to the cloud in general, including the variations of cloud data
management discussed in this report. Hence, consultants are an available and reliable source of
cloud skills—if you have the budget to afford them. If you tap consulting resources, be sure they
work side-by-side with employees to assure a knowledge transfer that will position employees to
eventually take over the project successfully.

How is your organization staffing CDM design and implementation? Select all that apply.
Hiring new employees with architectural
and integration experience 73%
Training existing employees for new skills
in architecture and integration 58%

Depending on consultants for new skills 45%

Figure 15. Based on 130 responses from 74 respondents who have CDM experience (1.8 responses
on average).

Holistic Data Governance for Hybrid CDM


Data should be governed holistically across all platforms, including clouds. This is true
whether data exists on premises, in the cloud, or both (as is common in today’s hybrid data
architectures). It is also true whether data migrates to a cloud, originates there, migrates
off a cloud, or in some combination. Data governance should be holistic despite the extreme
complexity of modern hybrid data architectures.

Holistic data governance should apply evenly across enterprise platforms. Governance The goal of modern
policies are too often made one platform, data set, or user constituency at a time. Instead, data DG is to govern all
governors should design policies broadly so the policies apply directly to data access and use in data and platforms
many scenarios. This way, business and technical users can apply such policies unaltered (or via fewer, but more
updated slightly) to cover new applications, data platforms, and data sets, whether on premises, comprehensive, policies.
in the cloud, or in a hybrid mix. After all, “compliance is compliance” in most enterprise
scenarios, especially when guided by legislated regulations (e.g., U.S. HIPAA and EU GDPR),
certain data domains (consumers and patients), internal privacy policies, and enterprise
requirements for stewardship and curation. As you migrate data to the cloud or generate data
from new cloud systems, be sure that established data governance policies and processes are
assigned with minimal revisions or without creating new and potentially conflicting policies.

Consider where the data will live. Replatforming and migrations to the cloud move data
physically, and that is a potential problem when data travels great distances. For example,
long-standing data protection regulations set up by the European Union limit the movement
of certain data domains across certain national borders. When replatforming moves data to a
cloud, corroborate that the location of the cloud provider’s data center complies with legislated
regulations that apply to your data.

tdwi.org  23
Cloud Data Management

New data-driven Rethink governance for new platforms that enable new business and technology practices.
practices need new At the top of that list is self-service data access, which is a foundation for self-service data
DG guidance. exploration, data prep, visualization, and analytics. The broad access and use of data that self-
service practices assume will inevitably lead to governance infractions unless governance
policies are put in place as soon as the new practices arrive. In addition to policies, new practices
should also be guided more granularly by data stewards and data curators.

More than compliance, DG A governance committee should enforce enterprise-scope data standards. Data governance
also sets data standards. is not only about compliance. A mature governance program will also establish data standards
for data models, integration methods, quality, and semantics. Standards give data the
consistency it needs to be shared across multiple IT systems and business units. That’s important
because sharing data helps the modern business succeed with single views of customers,
complete data for reports and analyses, and up-to-date status information for operational
processes. Governed data standards are also useful as more data travels across the hybrid data
architectures that are typical of cloud scenarios.3

Multiphase Plans for Migrating Data to the Cloud4


Organizations of any size or maturity will already have a data warehouse deployed and in
operation. Modernizing a warehouse regularly involves migrating data from platform to platform,
increasingly from on-premises to the cloud. This is because replatforming is a common strategy
and the cloud is the most modern platform available for warehouses today.

Some data warehouse modernization programs seek to simplify bloated portfolios of databases
(or to take control of rogue data marts) by consolidating them onto fewer platforms. The cloud
is an easily centralized and globally available platform, which makes it an ideal target for data
consolidation. Hence, users who modernize a data warehouse need to plan carefully for the
complexity, time, business disruption, risks, and costs of migrating and/or consolidating data,
with special considerations for cloud platforms.

The key is to plan carefully before migrating data, its management, and its users to the cloud.

Create a multiphase plan Don’t bite off more than you can chew. We all know the risks of a "big bang" project, where
before migrating data size and complexity raise the probability of failure. Such risks are easily mitigated by a
warehouses and other multiphase project plan, which segments work into multiple manageable pieces, each with a
databases to cloud. technical goal that adds business value.

Start with a low-risk, high-value segment of work. For example, successful data migration
or replatforming projects focus the first phase on a data subset or use case that is both easy to
construct and in high demand by the business. Prioritize early phases so they give everyone
confidence by demonstrating technical prowess and business value. Save problematic phases
for later.

Note that you’re not just migrating data. You’re also migrating business processes, groups of
end users, reports, applications, analysts, developers, and data management solutions. Plan to
migrate all these elements with minimal disruption to business operations.

24  3 Learn about how governance and other data management practices can be adapted to cloud use cases in the 2017 TDWI Checklist
Report: Data Management Best Practices for Cloud and Hybrid Architectures, online at tdwi.org/checklists.
4 This section of the current report is borrowed from the 2018 TDWI Checklist Report: Modernizing Data Warehouses via the Cloud, online
at tdwi.org/checklists. Read that report for more information about migrations to the cloud and related topics.
Organizational Matters

Plan contingencies for risky milestones. Expect to fail, but be ready to recover via rollback.
Don’t be too eager to unplug the old platforms because you may need them for rollback. It’s
inevitable that old and new data warehouse platforms will operate simultaneously for months or
years, depending on the size and complexity of the data, user groups, and business processes you
are migrating.

Expect development work, not just migration and consolidation work. Replatforming can Some migrations are
easily feel like new development when data being migrated or consolidated requires much work. simple. Others are like
For example, some data and solution components will "lift-and-shift" quickly, and work pretty new development.
well on the new platform. Others will not. Even when the so-called "lift-and-shift" works,
developers may need to tweak data models and interfaces for maximum performance on the new
platform.

When the new platform offers little or no backward compatibility with the old one, development
may be needed for platform-specific components such as stored procedures, user-defined
functions, and other hand-coded routines. Similarly, poor data quality and modeling should
be remediated during migration; otherwise you’re just bringing your old problems into the
new platform. In all data management work, when you move data you should also endeavor to
improve data.Assemble a diverse team for modernizing and replatforming a data warehouse.

Obviously, data management professionals are required. Data warehouse modernization and Data migrations to the
replatforming usually needs specialists in warehousing, integration, analytics, and reporting. cloud affect many types
When tweaks and new development are required, experts in data modeling, architecture, and of people. Your plan
data languages may be required. Don’t overlook the maintenance work required of database should protect them all.
administrators (DBAs), systems analysts, and various IT staff.
• Affected parties must be part of the process. A mature data warehouse will serve a long
list of end users who consume reports, dashboards, metrics, analyses, and other products
of data warehousing and business intelligence. These people report to a line-of-business
manager and other middle managers. Affected parties (i.e., managers and sometimes end
users, too) should be involved in planning a data warehouse modernization. First, their
input should affect the whole project from the beginning so they get what they need to be
successful. Second, the new platform rollout should take into consideration the productivity
and process needs of all affected parties.

• External parties may need coordination. In some scenarios, such as those for supply
chain, e-commerce, and business-to-business relationships, the migration plan should
stipulate dates and actions for partners, suppliers, clients, customers, and other external
entities. Light technical work may be required of external parties, as when customers or
suppliers have online access to reports or analytics supported by a cloud data warehouse
platform.

tdwi.org  25
Cloud Data Management

EXPERT COMMENT MODERNIZATION IS MORE THAN MIGRATING AND REPLATFORMING.


David Loshin is the president of Knowledge Integrity, Inc. and a well-known expert in data management.
He has much to say about system modernization and data migration, “Disruptive technologies
such as Hadoop and cloud computing have motivated rallies for system modernization. What does
‘modernization’ really mean?
“A common misconception suggests that modernization consists of merely moving existing applications
to a new platform. However, modernization is actually more about refactoring or reengineering an
existing legacy system to align it with current business demands. Segregating business processes from
their original implementations helps eliminate dependencies that were hard-coded into legacy systems
to accommodate those business processes.
“Because modernization involves a more sophisticated approach to reengineering, organizations that
choose to do so face many challenges that may impede their abilities to effectively modernize their
data warehouse environments, whether they are on premises or in the cloud. Becoming aware of
the challenges allows you to prepare a foundation for brainstorming approaches to addressing and
overcoming those challenges.”5

CDM Best Practices


Data Platforms for Hybrid CDM
CDM involves a rich The list of tools and platforms involved in cloud data management continues to evolve, as users
selection of tools adjust their software portfolios, usually to include more cloud-based systems. To quantify the
and platforms. mix that users are working with today, our survey asked, "For the CDM solution that you use
most, what types of use cases, data, and compute platforms are being supported today?" (See
Figure 16.) The overall message from survey responses is that a cloud of some kind is being used
by almost all user organizations. Furthermore, data is being created, managed, integrated, and
used for business advantage across hybrid data architectures, as seen in the following examples.

Operational applications. Given that on premises is still the norm (compared to the cloud), it
is no surprise that most of the users surveyed have a variety of on-premises systems (86%), both
operational and analytic. However, the surprise is that a minority of organizations (perhaps as
high as 14%) would prefer to have no significant on-premises footprint. In phone interviews,
TDWI found a couple of these—new firms with digital products (software applications or third-
party data). Because their products are cloud-based, it makes sense that they also made internal
IT as cloud-based as possible.

Obviously, traditional operational applications are still with us (ERP, CRM, SFA, etc.) (72%),
although most brands of packaged applications may now be flexibly deployed on premises or in
the cloud. Among end users, the platform trend for operational applications is clearly toward
software-as-a-service (SaaS) applications (67%). In fact, two-thirds of organizations surveyed
already have these.

26  5 Read more from David Loshin in the 2019 TDWI Checklist Report: Overcoming Challenges to Data Warehouse Modernization, online at
tdwi.org/checklists.
CDM Best Practices

Cloud types. Plans for cloud migrations typically start by selecting a cloud provider. Most users
choose a third-party cloud (56%) instead of a private cloud (42%), usually because of the time
and cost of building and maintaining a private one. Most organizations prefer a third-party cloud
because it meets their goals of outsourcing IT infrastructure, avoiding system integration, and
reducing administrative costs. In a related trend, TDWI sees users increasingly looking for third-
party cloud providers that have a managed service (40%) for platforms they need in the cloud,
typically a favored brand of database, distribution of Hadoop, analytics platform, or tool for
data integration.

Decision-making technologies. The prominent layers of the decision-making technology stack Every layer of the
are represented amply in survey results, both on premises and in the cloud. These decision- decision-making
making technologies include data warehouses (77% on premises, 53% cloud), data integration technology stack is now
platforms (77% on premises, 49% cloud), analytics tools (74% on premises, 58% on cloud), and established in the cloud.
data lakes (49% on premises, 42% cloud). Note that each category of decision-making technology
is currently more prominent on premises than in the cloud, but not by much. TDWI expects the
“cloud gap” to shrink as more users gain confidence in the cloud and as cloud providers and
software vendors improve their offerings.

Database management systems (DBMSs). The relational DBMS is still the most common DBMSs have adapted
paradigm for data management—even in the cloud (72%). Even so, users continue to diversify the well to the cloud.
range of DBMS types they use, because data itself and business uses of it are diversifying (as
discussed at the start of this report). In fact, nonrelational DBMSs (40%), NoSQL DBMSs (37%),
and analytic DBMSs (40%) are all firmly established in users’ software portfolios, in both
enterprise and cloud deployments.

Do not forget that the relational DBMS is evolving, too. In one direction, most so-called
columnar, graph, and NoSQL DBMSs support the relational paradigm. In another direction,
multiple DBMSs are now available as native applications on multiple public clouds, whether these
are mature brands ported to the cloud or the new relational DBMSs built from the ground up for
the cloud. In a related trend, object storage in the cloud (53%) is gaining popularity because it
resembles the relational and DBMS paradigms but with the cloud’s speed, scale, and low cost, as
well as mechanisms for embedding object storage in a number of application technology stacks.

Open source software. The operating system LINUX years ago proved the value, performance,
reliability, and low cost of open source software. In recent years, many open source tools and
platforms have proved themselves useful in data management, including cloud solutions (70%).
In particular, Hadoop (42%) is now common in data warehousing and analytics, both on premises
and in the cloud, followed by Spark (35%) and open source containers (e.g., Docker) (35%).

tdwi.org  27
Cloud Data Management

For the CDM solution that you use most, what types of use cases, data, and compute
platforms are being supported today?
OPERATIONAL APPLICATIONS

On-premises systems 86%


Traditional applications (ERP, CRM, SFA, etc.) 72%
Software-as-a-service (SaaS) applications 67%

CLOUD TYPES

Public or third-party cloud 56%


Private cloud 42%
Managed service provider 40%

DECISION TECHNOLOGIES

Data integration platforms on premises 77%


Data warehouse on premises 77%
Analytics tools on premises 74%
Analytics tools in the cloud 58%
Data warehouse in the cloud 53%
Data integration platforms in the cloud 49%
Data lake on premises 49%
Data lake in the cloud 42%

DATABASE MANAGEMENT SYSTEMS (DBMSs)

Relational DBMSs 72%


Object storage in the cloud 53%
Nonrelational DBMSs 40%
Analytic DBMSs 40%
NoSQL DBMSs 37%

OPEN SOURCE

Open source software (except ubiquitous LINUX) 70%


Hadoop 42%
Spark 35%
Containers (e.g., Docker) 35%

Figure 16. Based on 57 respondents who have CDM experience.

Data Management Tools for Hybrid CDM


DM tools extended Hybrid cloud environments and CDM need a comprehensive data integration infrastructure and
for the cloud make related data management tools to manage, move, document, and provision fit-for-purpose data
CDM happen and unify that’s clean, compliant, and governed. To quantify the mix of data management platforms and
hybrid ecosystems. tools users are working with today, our survey asked, "What data management capabilities do
you need for successful CDM today?" (See Figure 17.)

28 
CDM Best Practices

Data stores. Both operational and analytics use cases involving hybrid data architectures
benefit from mid-tier data stores where extremely diverse data is integrated, aggregated, and
repurposed for these use cases. For that function, the data warehouse (80%) is still alive and
well, having modernized recently to leverage the strengths of new platforms, such as the cloud,
Hadoop, and NoSQL.

However, most warehouses continue to be optimized for carefully cleansed, documented, and
structured data sets, typically for standard reports and dashboards. Therefore, users increasingly
complement a data warehouse with a data lake (48%) that is optimized for massive volumes
of detailed source data, typically for operations, operational reporting, data exploration or
discovery, and analytics. This way, the warehouse and lake complement each other; together
they create a more comprehensive solution.

Core data management. Absolute “must haves” for CDM include core data management Data quality, integration,
functions for data integration (70%), data prep (57%), and data quality (55%). Data virtualization and virtualization are the
(DV, 43%) is a near-real-time and virtual alternative to batch-driven ETL/ELT-style data meat and potatoes of
integration. DV is an excellent fit for CDM because it specializes in interoperability with many cloud data management.
far-flung sources (typical of hybrid cloud data ecosystems) while instantiating integrated data
sets with close to real-time performance.

Special user features. To get full business value from CDM’s unification of hybrid data, business
people and other user constituencies need self-service data access and exploration tools (50%)
so they can perform modern self-service practices such as data prep and data visualization. Note
that these self-service practices depend on business metadata or its equivalent (defined below)
that are best provided by the data integration tools that are fundamental to CDM. Ideally, such
tools will also have features tuned for multiple user types (from both technology and business)
(45%) via data-sharing functions (48%) and stewardship and curation features (31%).

Real-time performance. CDM for hybrid data clouds regularly supports time-sensitive use
cases, such as business monitoring, operational reporting, and data capture for IoT. For these,
CDM’s data management infrastructure must support real-time data interfaces (45%) and event
processing (38%).

Cross-platform interfaces. For CDM to unify a hybrid data architecture, it needs heavy doses of The diverse platforms
data pipelining (39%) and other modern approaches to cross-platform interfaces. To help of hybrid environments
optimize complex integration solutions and avoid resource conflicts, CDM’s data management demand many kinds of
tooling needs orchestration and workflow management (36%). When the great number of interfaces and semantics.
interfaces in a hybrid environment is large, the environment may benefit from interface and API
management (43%). In-memory functions (41%) are instrumental for the high speed and low I/O
that data pipelining and orchestration assumes. Progressively, tools are exposing their data
management functions via interfaces and methods for data-as-a-service (DaaS, 29%) and
microservices for data (25%).

Data semantics. As we’ll see in the next section of this report, CDM that satisfies the broad
requirements of many users and use cases will rely on multiple approaches to data semantics,
including metadata management (43%), data catalogs (39%), and business glossaries (36%). A
number of functions can be built atop semantics, including those for impact analysis (32%) and
data lineage (25%).

tdwi.org  29
Cloud Data Management

What data management capabilities do you need for successful CDM today? Select all that
apply.
DATA STORES

Data warehouse 80%


Data lake 48%

DATA MANAGEMENT

Data integration 70%


Data prep for simplified data integration and
analytics 57%

Data quality 55%


Data virtualization 43%

SPECIAL USER FEATURES

Self-service for data access and exploration tools 50%


Data-sharing functions 48%
Tool features tuned for multiple user types
(tech, dev, biz, steward) 45%

Stewardship and curation features 31%

REAL-TIME PERFORMANCE

Real-time data interfaces 45%


Event processing 38%

CROSS-PLATFORM INTERFACES

Interface and API management 43%


In-memory functions 41%
Data pipelining 39%
Orchestration and workflow management for
cross-platform data pipelines 36%

Data-as-a-service 29%
Microservices for data 25%

DATA SEMANTICS

Metadata management 43%


Data catalog 39%
Business glossary 36%
Impact analysis 32%
Data lineage 25%

Figure 17. Based on 56 respondents who have CDM experience.

Data integration upgraded Beef up your data management program as preparation for cloud success. As shown in
to deeply support Figure 17, organizations involved in cloud data management are using every known tool type and
clouds is a success function—and using them deeply. This reveals how important data management is to a healthy
factor for data-driven hybrid ecosystem that requires the timely movement of current data for both operations and
cloud use cases. analytics. For users planning upgrades to data management infrastructure, to keep pace with the
demanding requirements of cloud, we offer these recommendations.

30  6 For more tips about optimizing data integration and other data management disciplines for the cloud, see the 2017 TDWI Checklist
Report: Data Management Best Practices for Cloud and Hybrid Architectures, online at tdwi.org/checklists.
CDM Best Practices

Assume that data integration involving cloud requires many interface types. Cloud-based
applications and data platforms—as well as clouds themselves—typically support standard
interfaces (ODBC/JDBC), proprietary APIs, call interfaces, and both standard and proprietary
file or document formats. Ask your application, data platform, and cloud providers which
methods they work best with; depending on the use case, the best method could be a standard
interface, an API, or a file. Likewise, be sure your data integration tool set supports the interfaces,
protocols, and data formats of popular cloud-based applications and data platforms in addition to
the usual on-premises enterprise systems.

Seek optimal elasticity. In varying degrees, clouds allocate and re-allocate resources Evaluate cloud providers
autonomically. This is called elasticity, and it is one of the leading benefits of cloud because it for integration prowess,
assures speed, scalability, and automatic capacity without much planning. As you evaluate cloud not just data platforms.
providers, get a sense of how elastic their system is, plus what you must do to achieve maximum
elasticity via data modeling, file formats, interface selection, and bulk load options.

Integrate with third-party clouds. Firms in industries with active supply chains (e.g., retail
and manufacturing) often turn to cloud-based data brokers to facilitate business-to-business
communication and data exchange. Similarly, customer-facing firms turn to cloud-based data
aggregators to purchase additional data about consumers. In such cases, your data management
infrastructure must support whatever the third-party provider requires. For example, today’s
cloud-based B2B gateways can significantly reduce the time and expense of onboarding partners
and enable the exchange of data through standard protocols (such as EDI, SWIFT, and HL7) as
well as via on-demand and orchestrated APIs (such as REST).

Know the interface points of new platform types. For example, TDWI sees users increasingly Cloud requirements for
adopting cloud-based Hadoop, which involves multiple interface points, including MapReduce, interfaces and metadata
Pig, Hive, Hbase, Spark, Drill, and Presto. differ from on premises.

Select from multiple right-time interface types. Data coming from or going to clouds
increasingly travels in real time or close to it. Therefore, your data integration tools and data
management infrastructure should address multiple right-time interfaces, ranging from offline
batch and microbatch to real-time streams and IoT.

Modernize your metadata management. For years, TDWI has seen organizations depend on
data integration tools for multiplatform metadata management. This trend continues with clouds,
though clouds demand modern approaches to metadata. Be sure your DM infrastructure supports
multiple metadata types (technical, business, and operational).

Modern Data Semantics as CDM Enabler and Unifier of HDAs


There are now several established forms of data semantics, namely metadata management and Metadata still rules, but
multiple forms of metadata (e.g., technical, business, and operational metadata), as well as today’s requirements
emerging semantics for business glossaries, data profiling, and data cataloging. Because modern demand multiple
users want to query, browse, and search semantic descriptions of data (which leads to accessing semantic approaches.
the data), a modern semantic facility must support multiple forms of indexing. Sophisticated
users (such as those working with hybrid data architectures) are using all these approaches to
data semantics, often in a single project. The long list of approaches comes together in what
TDWI calls the semantic array.

The modern semantic array does a lot for hybrid data architectures (HDAs). When the array
is centralized and shared, it presents a comprehensive inventory of data for all the platforms of
the HDA, whereas traditional semantics rarely reaches beyond a single platform. The semantic
array enables the creation of custom views of distributed data, such as business metadata or


7 As examples of such brokers and related standards, visit web sites for the Global Data Synchronization Network (GDSN) tdwi.org  31
and 1WorldSync.
Cloud Data Management

glossaries for business users. The same array can also enable sophisticated data virtualization
and certain applications of it, such as the logical data warehouse or logical data lake. Finally,
note that semantics-driven views or virtual applications can impose architectural unity upon the
siloed chaos of HDA without the risk, cost, and distraction of time-consuming data migration
and consolidation projects. Hence, the semantic array is becoming one of the leading tools for
the unification of HDAs and other complex, hybrid, and distributed data architectures.8

Metadata management across new platforms. For example, modern metadata management
tools are now appearing as software-as-a-service (SaaS) platforms. The benefits of SaaS and
cloud-based tools apply to metadata management, namely minimal tool set up, tool maintenance,
and capital investment, with short time-to-use and elastic scale in production. As a completely
different example, metadata tools must interface with data management functions on new
platforms such as SaaS operational apps, cloud-based systems and storage, Hadoop, and other
open source products. Finally, given the multiplatform (hybrid) data environments becoming
popular today, it sometimes makes sense to deploy a hybrid metadata repository that stores
metadata on diverse platforms, although its interface makes distributed metadata look like a
single source.

Drawing a holistic “big picture” is critical for HDA success. To cope with the complexity
of HDAs, data management professionals need CDM tool functionality that can draw the “big
picture” of a hybrid environment’s data inventory, server platforms, local data structures, and
overall data architecture. To unify complex environments, data management professionals turn
to technologies designed for cross-platform data operations in cloud and hybrid architectures,
such as data virtualization, query federation, integration hubs, data flows, and data replication.
However, they also need semantics that draw the big picture, as is done by enterprise data
catalogs, business glossaries, and modern approaches to metadata management (as discussed
earlier in this report). These big-picture functions contribute to multiple contexts, including
CDM development, runtime deployment, data governance, and self-service data access.

Data Virtualization as an Agile and Non-Intrusive CDM Method


Data virtualization is a Data virtualization is a platform type for modern data integration. It performs many of the
form of integration that same transformation and quality functions as traditional data integration—namely, ETL,
provides abstraction replication, federation, and messaging—but without the latency, redundancy, and rigidity that is
and services layers as typical of traditional systems.
a virtual complement to
physical integration. Data federation is a subset of data virtualization. Data federation simply aggregates
heterogeneous data from disparate sources and presents it as a single result set or point of access.
Data virtualization goes beyond federation to also provide an abstraction layer and data services.
Furthermore, federation via a DV platform supports advanced query planning, caching, in-
memory, and hybrid strategies for optimizing cross-platform performance.

Data virtualization provides abstraction and service layers for heterogeneous and
distributed data. By integrating data from disparate sources, locations, and formats without
replicating it, data virtualization (DV) creates a single “virtual” data layer that delivers unified
data services to support multiple applications and users. For a hybrid data ecosystem, DV can
unify the diversity and simplify the chaos of distributed data.

32  8 The semantic array is also a critical success factor for data hubs, as explained in the 2018 TDWI Checklist Report: The Modern
Data Hub, online at tdwi.org/checklists.
CDM Best Practices

DV offers compelling use cases for hybrid data architectures:

• Many virtualized data services can operate in real time (or close to it) to instantiate
fresh data that is time-sensitive for business processes or updated repeatedly during the
business day.

• Virtual views are often designed to be business-friendly and can simplify access to HDA
systems and data, as required of self-service data prep, exploration, and visualization.

• DV reduces data replication and relocation, reducing network and storage loads.

• Data may be migrated through data virtualization’s abstraction layer for fast prototyping
and testing. DV infrastructure also facilitates migration across an HDA’s heterogeneous
platforms whether on premises or in the cloud.

• DV techniques create data interfaces and communication channels among the many
components of an HDA, which in turn unifies the large-scale architecture of the HDA.

Virtualize hybrid data instead of migrating or consolidating it. A data management


professional’s knee-jerk reaction to HDA complexity is to migrate data from several systems to
fewer ones, and then do the time-consuming and unpredictable hard work of consolidating data
sets of heterogeneous schema. However, consolidating hybrid data is not a compelling solution
because it heavily consumes time and other resources.

In many cases, data virtualization is an effective alternative to data migration and consolidation.
Data virtualization can create logical views in which the data looks consolidated even though it
has not been migrated or physically altered. Data virtualization also involves a fraction of the
time, cost, risk, and disruption of data migration and consolidation projects. It avoids arguments
about data ownership and budgets. Because logical representations are agile, once the logical
models are built, updating them to keep pace with evolving data platforms and use cases is
faster than with physical methods. Furthermore, advancements in DV techniques and hardware
speed make the instantiation of logical data structures fast enough to satisfy most service-level
agreements.9

Distributing Data Across a Hybrid Data Architecture


Hybrid data architectures are both a blessing and a curse.

As we saw in the discussions of Figures 16 and 17, relational data and database management As you embrace the cloud
systems are common in cloud and hybrid environments. However, big data and many new data and new data, rethink
sources generate nonrelational data of diverse structures or no structure. For example, consider how you load balance
the proprietary and constantly evolving record structures generated by web applications and storage and processing
sensors on the Internet of Things. Furthermore, unstructured data is on the increase, from in hybrid architectures.
traditional enterprise applications (e.g., customer conversations captured by call center apps, the
claims process in insurance) and modern web apps (e.g., social media, e-commerce).

The result of all this is extremely hybrid data, which is a blessing in terms of the new insights and
innovative business management methods it can inspire and enable. However, it is also a curse
because its diversity leads to complex architectures and bulging software portfolios that are
expensive to assemble and difficult to maintain over time.

Users find it increasingly difficult to manage data with on-premises platforms (75%, see
Figure 18). This is no wonder given the rise of nonrelational data. Traditional data platforms
are not going away because they still manage large volumes of highly valuable data and they fit
into business processes quite ably. Users are maintaining older platforms while adding new ones


9 For a detailed discussion of data virtualization in the context of hybrid data architectures, see the 2017 TDWI Checklist Report: tdwi.org  33
Architecting a Hybrid Data Ecosystem, online at tdwi.org/checklists.
Cloud Data Management

to address new data requirements as well as to scale at a reasonable cost. That is why cloud and
hybrid data architectures have become so diverse and complex in terms of data structures and
the platform types that manage them.

As your organization adopts new data types and taps new data sources, do you find it
increasingly difficult to capture and use data with on-premises platforms?

25% No

75% Yes

Figure 18. Based on 91 respondents.

Upgrade your data management architecture. Adopting the cloud affects architectures,
which is an opportunity to fix some of the mistakes of the past. For example, too many DM
solutions evolve into a plague of point-to-point interfaces and integrations, which results in a
convoluted hairball that is hard to optimize, control, and maintain. Many organizations fix this
common data management design problem by restructuring the hairball as a data integration
hub with controllable spokes. With a hub-and-spoke architecture and the right hub tools, users
can orchestrate data flowing through the hub to control access (for security and governance),
improve data (for quality and modeling), and make data accessible to a wider range of users (via
self-service and publish/subscribe). Orchestration via a hub can apply to all data, including data
flowing to and from clouds.10

Enterprise data in the cloud will double or triple—at least—over three years.

The proliferation of new data types and new data platforms has increased the presence of cloud-
based systems in modern enterprises such that an increasing amount of data “lives” in the cloud,
regardless of where it may have originated or may be going. Data management professionals need
to track this development (from a capacity-planning viewpoint) as well as to assure that the data
is where it needs to be for specific use cases, compliance reasons, or processing requirements. To
get a sense of how the physical distribution of data will shift, our survey asked, "Which of the
following best describes the location of data across your organization relative to on premises
versus cloud systems?" (See Figure 19.)

Plan capacity for heavier Today, most enterprises surveyed have most of their data on premises. Respondents report
loads on cloud platforms. having their data almost exclusively on premises (30%) or mostly on premises (48%).
Respondents with data on cloud systems report single-digit percentages of data there today.

In three years, most will have doubled or tripled the percentage of data on cloud. Some
respondents expect their data to be mostly on premises (29%) in three years. However, others
think their data will be in near equal doses on premises and cloud (16%), excessively hybrid and
mostly cloud (27%), or almost exclusively cloud or multicloud (21%). Conversely, a mere 2% think
their data will be almost exclusively on premises.

34  10 For detailed discussions of hybrid data architectures and architectures for data integration, read the 2018 TDWI Best Practices
Report: Multiplatform Data Architectures and the 2014 TDWI Best Practices Report: Evolving Data Warehouse Architectures, both online
at tdwi.org/bpreports.
CDM Best Practices

Which of the following best describes the location of data, across your organization, relative
to on premises versus cloud systems? Answer for both Today and In Three Years.
30% Today
Almost exclusively on premises
2% In 3 Years

Increasingly hybrid, but still mostly on premises


48%
29%

Near equal doses on premises and cloud


9%
16%

Excessively hybrid and mostly cloud


7%
27%

Almost exclusively cloud or multicloud


6%
21%
0%
Don't Know
5%

Figure 19. Based on 56 respondents who have CDM experience.

When and where data is processed moves as data migrates to new storage. As user Data processing is
organizations go deeper into hybrid data architectures, they will regularly redistribute data just as distributed
across multiple platforms of differing types. However, there is more to this process than simply as data storage.
re-examining storage. As data moves, where it is processed will, too. Given the increasing size of
data, moving it from platform to platform for processing gets less tenable, which is why platforms
and user best practices are evolving to accommodate more data processing inside a platform.
This has many names: in situ processing, in-database processing, in-database analytics, push-
down processing, and ELT.

Technical users will need to revise their integration and processing solutions for data
aggregation, transformation, and quality. This sounds like a problem, but it’s actually an
opportunity, because new platforms—both on premises and cloud—offer more options than ever
for users to grow into. Users should rely on tools that support native processing on all platforms
available as well as the tool’s own server.

To determine where processing is taking place today in hybrid data architectures, our survey Data processing occurs
asked, "For DM tools and platforms involved with CDM, where does the server software execute on premises, in the
today?" (See Figure 20.) cloud, or both.

In hybrid data architectures surveyed, most data processing takes place exclusively on
premises. This is especially true of large systems, namely relational DBMSs (46%) and data
warehouse platforms (37%). It is also true of the core disciplines of data management, namely
data quality (35%), metadata (35%), and data integration (30%).

Data processing in the cloud is firmly established. The leader here is analytics tools and
sandboxes (30%). This is no surprise because, in the recent uptick of new analytics programs,
TDWI has seen many designed from the start to capture data and process it for analytics
exclusive in the cloud.

The healthy amount of processing on both proves that data has truly gone hybrid.
Prominent systems that process data both on premises and in the cloud include analytics tools
and sandboxes (40%), data integration (37%), relational DBMSs (33%), and data warehouse
platforms (30%).

tdwi.org  35
Cloud Data Management

For DM tools and platforms involved with CDM, where does the server software execute
today? Select one answer per row.
On premises
Relational DBMSs 46% 17% 33% 22
Cloud
Both
Data warehouse platform 37% 25% 30% 3 5%
Not using
Don't know Data quality 35% 14% 27% 19% 5%

Metadata management 35% 12% 19% 25% 9%

Data integration 30% 28% 37% 23

Hadoop 25% 14% 23% 33% 5%

Data virtualization 23% 24% 21% 16% 16%

Nonrelational and NoSQL DBMSs 21% 25% 21% 26% 7%

Analytics tools and sandboxes 21% 30% 40% 4% 5%

Figure 20. Based on 57 respondents who have CDM experience. Sorted by the On premises column.

USER STORY THE MODERN DATA WAREHOUSE IS INCREASINGLY IN THE CLOUD, BUT STILL
RELATIONAL.
“We modernized our enterprise data warehouse two years ago, migrating it to the cloud in the process,”
said Jean-Paul Saliou, senior director of business intelligence at Genesys, which develops software and
services for running small to large call centers. “Our legacy warehouse was on a relational database,
and most of our reporting and analytics tools require SQL support. We were also planning to expand
our solutions for self-service access to data, reports, and visualization, which involves relational
requirements. In a related move, our central IT group decided years ago to give preference to the
cloud when planning new implementations, so we knew that the database platform for our next data
warehouse should be relational and in the cloud.
“To find the right cloud-based data warehouse platform, we conducted proof-of-concept exercises with
two large cloud providers and a small new one. We chose the latter because of its high performance,
ease of migration, and consumption-based licensing.
“Today, all BI data is in the cloud, with nothing left on premises. Moving into the future, we’ll expand
self-service, increase the adoption of our visualization tool as a spreadsheet replacement, and start a
new program for machine learning and other advanced analytics on cloud platforms.”

36 
Top Ten Priorities for Cloud Data Management

Top Ten Priorities for Cloud Data Management


In closing, let’s summarize this report by distilling from it the top ten priorities for cloud data
management (CDM). Let’s also reflect on why these priorities are important. Think of the
priorities as recommendations, requirements, or rules that can guide user organizations through
a successful CDM program.
1. Don’t just adopt the cloud. Integrate the cloud, too. Many useful applications, tools, and Don’t let a cloud be a silo.
data platforms are now available on a variety of clouds, and user organizations should avail
themselves of these. However, modern software can become a bucket of silos just as quickly
as older enterprise applications did. In particular, TDWI has seen organizations go on a
shopping spree, acquiring multiple SaaS applications without a plan for integrating and
sharing data across these and older applications. Cloud data management can prevent
modern silos and get more business value from SaaS applications as well as the many other
sources of new data discussed here.

2. Perform CDM for the benefits. The general benefits of the cloud apply to data Be mindful of CDM’s
management, especially scale, speed, elasticity, minimal setup, and maintenance performed benefits, barriers,
by the cloud provider. According to this report’s survey, these favorable cloud characteristics and key use cases.
enhance business solutions for analytics, reporting, business activity monitoring, and agility.

3. Beware of CDM’s barriers. Survey respondents redlined issues in governance, migrations,


data quality, and tool maturity as common barriers to successful CDM and hybrid data
architectures. Even so, these are not catastrophic failures; each is relatively easy to avoid
or fix.

4. Know the successful use cases and start with these. Numerous users surveyed reported
success while applying CDM to uses cases in decision-making practices, especially reporting
and dashboards, data warehousing, and advanced analytics. They have also successfully
applied CDM to migrations of applications and data to the cloud as well as implementations
of SaaS apps.

5. Consider new cloud-based data platforms. Some of the most exciting new products of Cloud database
recent years are the relational databases purpose-built for data warehousing and analytics solutions should be on
in the cloud. These are built from the bottom up to tap the power of the cloud, though at a every evaluation list.
reasonable cost. One or more of these should always be on any list of data platforms to
evaluate when replatforming, modernizing, migrating, or designing databases for
warehousing, lakes, operations, analytics, integration, and so on.

6. Deploy significant data integration infrastructure for the cloud. Software for CDM is not
just the portfolio of data platforms. You also need appropriate tooling for data integration
and other data management disciplines. This is because hybrid data travels relentlessly into,
across, and out of the platforms of hybrid data architectures.

7. When possible, virtualize hybrid data instead of consolidating it. In other words, use Data integration is a
data virtualization tools to create logical views of existing, siloed data environments key success factor
without migrating or consolidating their data physically or restructuring their platforms. for CDM, including
Migration and consolidation projects are always more expensive, time-consuming, and virtual approaches.
distracting for both business and technical people than anyone expects—and data
virtualization is a viable alternative to these problematic projects. Furthermore, for time-
sensitive practices (business monitoring, e-commerce, operational reporting) virtualization
provisions fresher data than more latent integration practices can.

tdwi.org  37
Cloud Data Management

Rethink governance, 8. Govern HDAs (and whole enterprises) holistically instead of per platform. Too often,
migrations, and data governance policies are made on a per application, data set, or use case basis. As the
architectures. number of these increases (as is typical with successful hybrid data architectures), DG is
unable to scale to the massive volume of policies required. Furthermore, the mass of polices
confuses users and inevitably leads to contradictory policies. Holistic DG seeks to create as
few policies as possible but also make individual policies that apply broadly to many apps,
data sets, and use cases. With fewer policies, DG can scale to the complexity of hybrid data
environments with fewer opportunities for confusion.

9. Organize migrations to the cloud as multiphase projects. Sometimes you can "lift-and-
shift" data from one system to another with minimal work to optimize that data on the new
platform and sometimes you cannot. Organizations facing migrations of older applications
and data to the cloud should assume that "lift-and-shift" will be inadequate because of the
exaggerated differences of old and new platforms discussed earlier. When "lift-and-shift"
cannot produce the desired results, create a multiphase project plan (instead of a "big bang"
project) that sets proper expectations for time and other resources, then work through the
project in a controlled, low-risk manner.

10. Don’t forget architecture. Without data architecture, a hybrid data environment is merely
a bucket of siloed data sets. Likewise, the large number of hardware and software servers
involved in hybrid IT deserves a systems architecture. Finally, CDM also merits its own
architecture in the same way that a data integration solution can have an architecture that
differs from the data platform architectures it reads from and writes to. Architecture, when
applied to a hybrid data environment, improves its design, maintenance, optimization,
usability, data quality, and data standards.

38 
Research Sponsors

actian.com datameer.com
Actian, the hybrid, data management, analytics, and Datameer is a unified data preparation and exploration
integration company, delivers data as a competitive platform that enables enterprises to rapidly transform
advantage to thousands of customers worldwide. Through raw data into analytics-ready data sets for tremendously
the deployment of innovative hybrid data technologies and faster time-to-insight—from weeks to hours. It is the
solutions, Actian ensures that business-critical systems only platform that delivers the scale, governance, and
can transact and integrate at their very best—on premises, operationalization required by enterprises, yet is so easy
in the cloud, or both. Thousands of forward-thinking to use that analysts, data scientists, and data engineers
organizations around the globe trust Actian to help them can collaborate on a centralized view of data and rapidly
solve the toughest data challenges to transform ho they personalize it, empowering everyone to discover insights.
run their businesses, today and in the future. For more Datameer’s unique visual data exploration tools enable
information, visit http://www.actian.com. any persona to discover the right data for their analytics
problems, enabling more accurate analytics and machine
learning.

The cloud-native Datameer platform provides deep


couchbase.com integration with cloud platforms and services for a seamless,
scalable experience. This includes elastic compute services,
Couchbase's mission is to be the database platform that scalable storage, native security and encryption, and
enables a revolution in application innovation. To make cloud-based data sources and analytic tools. Datameer
this possible, Couchbase created an enterprise-class also supports hybrid environments for secure, scalable
NoSQL database to help deliver ever richer and ever more integration between on-premises and cloud-based data and
personalized customer and employee experiences. Built analytics for easy expansion and/or migration of analytic
with the most powerful NoSQL technology, Couchbase workloads to the cloud. Datameer works with customers
was architected on top of an open source foundation for from every industry including firms like Dell, Vodaphone,
the massively interactive enterprise. Our geo-distributed Citibank, UPS, and more. Learn more at datameer.com.
database provides unmatched developer agility and
manageability, as well as unparalleled performance at any
scale, from any cloud to the edge.

Couchbase has become pervasive in our everyday lives;


our customers include industry leaders Amadeus, AT&T,
BD (Becton, Dickinson, and Company), Carrefour, Cisco,
Comcast, Disney, DreamWorks Animation, E-Bay, Marriott,
Neiman Marcus, Tesco, Tommy Hilfiger, United, Verizon,
and Wells Fargo, as well as hundreds of other household
names. For more information, visit www.couchbase.com.
Research Sponsors

denodo.com sap.com

Denodo is the leader in data virtualization providing As the cloud company powered by SAP HANA®, SAP is
agile, high-performance data integration, data abstraction, market leader in enterprise application software, helping
and real-time data services across the broadest range of companies of all sizes and industries run better. From back
enterprise, cloud, big data, and unstructured data sources office to boardroom, warehouse to storefront, desktop
at half the cost of traditional approaches. Denodo’s or mobile device to the cloud—SAP empowers people
customers across every major industry have gained and organizations to work together more efficiently and
significant business agility and ROI by enabling faster and use business insight more effectively to stay ahead of
easier access to unified business information for agile BI, the competition. SAP applications and services enable
big data analytics, web and cloud integration, single-view more than 335,000 customers to operate profitably, adapt
applications, and enterprise data services. continuously, and grow sustainably. SAP helps simplify
technology for companies of all sizes so they can consume
The Denodo Platform offers the broadest access to
our software the way they want—and without disruption.
structured and unstructured data residing in enterprise,
With an extensive global network of customers, partners,
big data, and cloud sources, in both batch and real-time,
employees, and thought leaders around the world, SAP
exceeding the performance needs of data-intensive
helps the world run better and improve people’s lives. For
organizations for both analytical and operational use cases,
more information, visit www.sap.com.
delivered in a much shorter time frame than traditional
data integration tools.

The Denodo Platform drives agility, faster time to market,


and increased customer engagement by delivering a single
view of the customer and operational efficiency from real- snowflake.com
time business intelligence and self-serviceability.
Snowflake started with a clear vision: Make modern data
Founded in 1999, Denodo is privately held, with main warehousing effective, affordable, and accessible to all
offices in Palo Alto (CA), Madrid (Spain), Munich (Germany), data users. Snowflake enables the data-driven enterprise
and London (UK). with instant elasticity, secure data sharing, and per-second
pricing across multiple clouds. Because traditional on-
For more information visit denodo.com, follow Denodo via
premises and cloud solutions struggle at this, Snowflake
twitter @denodo, or contact us to request an evaluation
developed a new product with a new built-for-the-cloud
copy at info@denodo.com.
architecture that combines the power of data warehousing,
the flexibility of big data platforms, and the elasticity of
the cloud at a fraction of the cost of traditional solutions.
Snowflake: Your data, no limits. Find out more at
snowflake.com.
hitachivantara.com

Data is your greatest asset, if you know how to use it. It


reveals your path to innovation and outcomes that matter
for business and society. Hitachi Vantara combines 100
years of OT and 60 years of IT experience to help data-
driven leaders unlock the value in their data. Our unique
Stairway to Value model uses machine learning and
artificial intelligence to deliver tangible benefits driven by
your data. We help you store, enrich, activate and monetize
your data to improve customer experiences, create new
revenue streams and lower costs. We listen. We understand.
We work with you.
Research Sponsors

trifacta.com

Trifacta is the industry pioneer and established leader


of the global market for data preparation technology.
The company draws on decades of academic research
in machine learning and data visualization to make
the process of preparing data faster and more intuitive.
More than 100,000 data wranglers in 10,000 companies
worldwide use Trifacta solutions across cloud, hybrid, and
on-premises environments to support a variety of analytic
and operational use cases. Leading organizations such as
Deutsche Boerse, Google, Kaiser Permanente, New York
Life, and PepsiCo count on Trifacta to accelerate time-to-
insight and discover opportunities that drive success. Learn
more at trifacta.com.
TDWI Research provides research and advice for data
professionals worldwide. TDWI Research focuses
exclusively on data management and analytics issues and
teams up with industry thought leaders and practitioners
to deliver both broad and deep understanding of the
business and technical challenges surrounding the
deployment and use of data management and analytics
solutions. TDWI Research offers in-depth research reports,
commentary, inquiry services, and topical conferences
as well as strategic planning services to user and vendor
organizations.

T 425.277.9126
555 S. Renton Village Place, Ste. 700 F 425.687.2842
Renton, WA 98057-3295 E info@tdwi.org tdwi.org

You might also like