You are on page 1of 44

Volume 3 Number 2 ■ SUMMER 2017

Securing IoT With Automated 6


Connectivity Management

Are the Public Clouds 35


Too Big to Fail?

Improving the ROI of 39


Big Data and Analytics

ENTERPRISE ARCHITECTURE:
TECHNOLOGIES AND SKILLS
THAT BUILD THE FOUNDATION
FOR DATA MANAGEMENT 12

12

WWW.DBTA.COM
CONTENTS
BIG DATA
QUARTERLY
Summer 2017

editor’s note | Joyce Wells


2 Faster, Better, Smarter

departments

3 BIG DATA BRIEFING


PUBLISHED BY Unisphere Media—a Division of Information Today, Inc.
Key news on big data product launches,
EDITORIAL & SALES OFFICE 121 Chanlon Road, New Providence, NJ 07974
partnerships, and acquisitions
CORPORATE HEADQUARTERS 143 Old Marlton Pike, Medford, NJ 08055

Thomas Hogan Jr., Group Publisher Celeste Peterson-Sloss, Lauree Padgett,


6 INSIGHTS | Shaun Kirby
609-654-6266; thoganjr@infotoday Alison A. Trotta, Editorial Services Securing IoT With Automated Connectivity Management:
Joyce Wells, Managing Editor Tiffany Chamenko, Lessons Learned From the Connected Car
908-795-3704; Joyce@dbta.com Production Manager

Joseph McKendrick, Lori Rice, 9 INSIGHTS | Alberto Pan


Contributing Editor; Joseph@dbta.com Senior Graphic Designer
In-Memory Parallel Processing and
Adam Shepherd, Jackie Crawford, Data Virtualization Redefine Analytics Architectures
Advertising and Sales Coordinator Ad Trafficking Coordinator
908-795-3705; ashepherd@dbta.com
Sheila Willison, Marketing Manager, 24 TRENDING NOW
Stephanie Simone, Editorial Assistant Events and Circulation The Growth of Hybrid IT and What It Means:
908-795-3520; ssimone@dbta.com 859-278-2223; sheila@infotoday.com
Q&A With Solar Winds’ Kong Yang
Don Zayacz, Advertising Sales Assistant DawnEl Harris, Director of Web Events;
908-795-3703; dzayacz@dbta.com dawnel@infotoday.com
28 INSIGHTS | Roman Stanek
ADVERTISING
Getting Real Business Value From Artificial Intelligence
Stephen Faig, Business Development Manager, 908-795-3702; Stephen@dbta.com
31 TRENDING NOW
DevOps for Big Data:
INFORMATION TODAY, INC. EXECUTIVE MANAGEMENT
Q&A With Pepperdata’s Ash Munshi
Thomas H. Hogan, President and CEO Thomas Hogan Jr., Vice President,
Marketing and Business Development
Roger R. Bilboul,
Chairman of the Board Richard T. Kaser, Vice President, Content

John C. Yersak, Bill Spence, Vice President, features


Vice President and CAO Information Technology
4 THE VOICE OF BIG DATA
BIG DATA QUARTERLY (ISBN: 2376-7383) is published quarterly (Spring, Summer, Fall, Securing the Modern Enterprise:
and Winter) by Unisphere Media, a division of Information Today, Inc.
Q&A With McAfee’s Steve Grobman
POSTMASTER
Send all address changes to:
Big Data Quarterly, 143 Old Marlton Pike, Medford, NJ 08055 12 COVER STORY | Joe McKendrick
Copyright 2017, Information Today, Inc. All rights reserved.
Technologies and Skills That
PRINTED IN THE UNITED STATES OF AMERICA Build the Foundation for Data Management
Big Data Quarterly is a resource for IT managers and professionals providing information on the
enterprise and technology issues surrounding the ‘big data’ phenomenon and the need to better 26 BIG DATA BY THE NUMBERS
manage and extract value from large quantities of structured, unstructured and semi-structured
data. Big Data Quarterly provides in-depth articles on the expanding range of NewSQL, NoSQL, DevOps and the Need for Speed
Hadoop, and private/public/hybrid cloud technologies, as well as new capabilities for traditional
data management systems. Articles cover business- and technology-related topics, including
business intelligence and advanced analytics, data security and governance, data integration,
data quality and master data management, social media analytics, and data warehousing. columns
No part of this magazine may be reproduced and by any means—print, electronic or any
other—without written permission of the publisher. 33 HADOOP PLAYBOOK | Jim Scott
COPYRIGHT INFORMATION Making the Most of the Cloud
Authorization to photocopy items for internal or personal use, or the internal or personal use
of specific clients, is granted by Information Today, Inc., provided that the base fee of US $2.00
per page is paid directly to Copyright Clearance Center (CCC), 222 Rosewood Drive, Danvers, 34 BIG DATA BASICS | Lindy Ryan
MA 01923, phone 978-750-8400, fax 978-750-4744, USA. For those organizations that have
been grated a photocopy license by CCC, a separate system of payment has been arranged. Overcoming Common Problems With Data Visualization
Photocopies for academic use: Persons desiring to make academic course packs with articles
from this journal should contact the Copyright Clearance Center to request authorization
through CCC’s Academic Permissions Service (APS), subject to the conditions thereof. Same 35 CLOUD CURRENTS
CCC address as above. Be sure to reference APS. Michael Corey & Don Sullivan
Creation of derivative works, such as informative abstracts, unless agreed to in writing by the Are the Public Clouds Too Big to Fail?
copyright owner, is forbidden.
Acceptance of advertisement does not imply an endorsement by Big Data Quarterly. Big Data
Quarterly disclaims responsibility for the statements, either of fact or opinion, advanced by the 37 GOVERNING GUIDELINES | Anne Buff
contributors and/or authors.
The Science of Data Governance Matter
The views in this publication are those of the authors and do not necessarily reflect the views
of Information Today, Inc. (ITI) or the editors.
38 THE IoT INSIDER | Bart Schouw
SUBSCRIPTION INFORMATION
Subscriptions to Big Data Quarterly are available at the following rates (per year): Never Mind Fake News, Fake Data Is Far Worse
Subscribers in the U.S. —$96.95; Single issue price: $25

39 DATA SCIENCE DEEP DIVE | Bart Baesens


© 2017 Information Today, Inc. Improving the ROI of Big Data and Analytics
EDITOR’S NOTE

Faster, Better, Smarter


By Joyce Wells

THE DEMANDS FOR faster development cycles, and capabilities being delivered by a cloud
better software and services that are more provider.
secure and always-on, and smarter decisions The expansion of cloud technologies is
fueled by real-time data are creating a strong also driving the use of DevOps, an approach
force in the big data world today. that helps increase the speed and alignment
As a result, big data technologies to support of software development and IT operations.
software development and data management Classical DevOps was about creating veloc-
are constantly evolving. And in this issue ity, but performance needs to be a first-level
of Big Data Quarterly, the rapidly changing player in DevOps for big data, notes Pepper-
enterprise requirements and IT trends are data CEO Ash Munshi in an interview.
explored from a range of vantage points in The use of containers can help organiza-
interviews, articles, and columns. tions gain greater agility since the approach
The rise of security issues posed by cloud, supports multi-cloud deployments, while
as well as mobility and big data, are consid- also allowing them to deploy software more
ered in an interview with Steve Grobman, easily and better utilize their resources, adds
CTO of McAfee. There are tremendous effi- MapR’s Jim Scott. Microservices, a technol-
ciencies and cost benefits that organizations ogy being used more frequently with con-
can achieve by moving to cloud-based archi- tainers, is another approach that helps orga-
tectures. However, organizations must also nizations achieve much-needed agility, notes
be aware that the value of data in these cloud Unisphere Research analyst Joe McKendrick
environments also makes it a sought-after in his cover article on the technology changes
target for cybercriminals, he observes. And, if taking place in enterprise architecture.
such a breach occurs in a multitenant cloud, With this issue, we also welcome Software
the impact could be severe. AG’s Bart Schouw as the new author of the
Moreover, with the growing use of cloud IoT Insider column. In his piece, Schouw
technology in the enterprise, there is a keen highlights the security advantages of block-
awareness that the loss of services even for a chain for connected devices. Blockchain was
short time in a cloud environment can wreak originally intended for data integrity, but
havoc, a point noted by a number of BDQ writ- why not use it for device integrity?
ers in this issue. To help alleviate the risk of a And, there are many other great articles in
major disruption, Navisite’s Michael Corey and this issue on the changes taking place in the
VMware’s Don Sullivan suggest there needs to world of big data. To stay on top of the latest
be a stronger embrace of hybrid approaches in big data developments, research reports, and
addition to an understanding of the features news, visit www.dbta.com/bigdataquarterly.

2 B I G D ATA QU A RTERLY | SU MMER 2017


BIG DATA BRIEFING
Key news on big data product KINETICA, a provider of an in-memory HAZELCAST, provider of an open
analytics database accelerated by source in-memory data grid, has
launches, partnerships, and joined the Confluent Partner Program
GPUs, is partnering with Safe Software
acquisitions and created FME connectors that read as a Technology Partner.
and write data from Kinetica into and www.hazelcast.com
out of FME workspaces.
Data analytics platform provider ELASTIC and GOOGLE have formed
www.kinetica.com
LOOKER recently closed an $81.5 a partnership to bring managed sup-
million Series D funding round led by MAPR TECHNOLOGIES has added port of Elastic’s open source search
CapitalG, Alphabet’s growth equity a small footprint edition of the MapR and analytics platform to the Google
investment fund. Converged Data Platform to address Cloud Platform.
https://looker.com the need to capture, process, and www.elastic.co, www.google.com
analyze data generated by IoT
In an effort to alleviate an impending ZOOMDATA, developer of a visual
devices close to the source. analytics platform for big data, has
critical shortage of developers, the www.mapr.com
CLOUD FOUNDRY FOUNDATION announced the launch of a new
is launching a cloud-native REDPOINT GLOBAL, a provider Smart Connector for the Vertica
developer certification initiative. of data management and customer Advanced Analytics database from
www.cloudfoundry.org engagement technology, is intro- Hewlett Packard Enterprise.
www.zoomdata.com
ducing the RedPoint Customer
IBM is making the NVIDIA Tesla Engagement Hub solution, providing SNAPLOGIC is launching a new
P100 GPU accelerator available enterprises with tools to overcome technology that uses artificial intelli-
on the cloud. The combination of challenges caused by the gap gence to automate highly repetitive,
NVIDIA’s acceleration technology between customer expectations and low-level development tasks, elim-
with IBM’s Cloud platform is intended the actual experience brands deliver. inating the integration backlog that
to help organizations more efficiently www.redpoint.net stifles most technology initiatives.
run compute-heavy workloads, such
www.snaplogic.com
as AI, deep learning, and high- DATAGUISE, a provider of sensitive
performance data analytics. data governance, has announced that SOFTWARE AG has updated
www.ibm.com DgSecure now provides sensitive data its Zementis Predictive Analytics
monitoring and masking in Apache Hive. product to support IBM z Systems
Melissa Data, a provider of global www.dataguise.com and Adabas and Natural applications
contact data quality and identity ver- and databases. Zementis supports
ification solutions, has rebranded as ORACLE is expanding its Internet of artificial intelligence and machine
“MELISSA” to reflect the company’s Things portfolio with four new cloud learning models in batch or real-time
increased focus on enabling global solutions to help businesses take transactions, which in turn delivers
business intelligence. advantage of digital supply chains. operational AI for fast-moving, big
www.melissa.com www.oracle.com data applications.
www.softwareag.com
CLOUDERA, which provides an ALATION and TRIFACTA are extend-
analytics platform built on Hadoop ing their partnership to deliver an Sisense has launched SISENSE
and other open source software, has integrated solution for self-service PULSE, a new alerting system that
unveiled the Cloudera Data Science data discovery and preparation that provides proactive notifications
Workbench, a new self-service tool for enables users to access the data about a user’s most important
data science that is based on the com- catalog and data wrangling features business events, in real time when
pany’s recent acquisition of Sense.io. within a single interface. changes occur.
www.cloudera.com www.alation.com, www.trifacta.com www.sisense.com

DBTA. COM / BI GDATAQUART E R LY 3


THE VOICE OF BIG DATA
SECURING THE
MODERN ENTERPRISE
Data—now universally understood to be the life-
blood of businesses—is at risk like never before in the
form of both malicious attacks and innocent indis-
cretions. Recently, Steve Grobman, CTO for McAfee,
discussed the range of threats to data security and
what companies must do to defend themselves.

When you look at the data security forecasts that came out of Steve Grobman, CTO for McAfee
the recent McAfee Labs 2017 Threat Predictions report, was
there anything that stood out to you?
We are seeing IT aggressively moving to cloud-based architec-
tures in order to improve efficiency and decrease their costs. There ransom in bitcoin, all the email archives of their top executives—
are tremendous benefits from doing so, and many of those will be with information about salaries and off-color comments—will
security benefits as well—in that cloud providers can inherently be released. There is the potential to use data to threaten harm
invest in building a strong security architecture. But we also need and extort companies. But then, the other part of that is an
to recognize that, given the value of the data that will be held in offshoot of what we saw during the presidential election cycle.
these cloud environments, the benefit to bad actors in breaching
those environments will be very high. Please explain.
Compromised data can be augmented with fabricated data
How so? to make things even worse. For example, in a corporate envi-
Whether that is from a low-level perspective like virtualiza- ronment, if you had your CEO’s email stolen, a bad actor could
tion technologies, or the orchestration capabilities that tie it all release the legitimate stolen data to establish credibility but
together, or even the bridging technologies that allow cloud archi- then add fabricated data to do even more harm. Think of a
tectures to interoperate with traditional environments—all of scenario where the bad actor’s objective is to make money on
those will be targeted. One of the most profound areas from my manipulating the stock price of an organization. Data stolen
perspective is that when a multitenant cloud system is breached, from a key executive, that when vetted and evaluated will be
the impact can be much more severe than breaching a single found to be legitimate, can be interlaced with fabricated data
company’s application or data architecture. The reason for that is that makes it appear as though there were scandals, or corrup-
that the bad actor could either steal or corrupt many parties’ data tion, or illegal activity.
versus just a single organization’s. I think we will start to see issues
related to the cloud becoming much more common. What needs to be done?
It is very important to make the point to the general public
It is possible for a data breach of any type to cause damage that we need to be very suspicious of data that is identified from
even if there is no financial opportunity. a data leak. With the media continuously reporting the content
You are right that using data as a weapon as opposed to just of breached data, the general public is being conditioned to
monetizing it by stealing it and selling it is key. That can benefit a essentially trust this information. Think of something as mun-
cybercriminal in many different ways and one of the easiest is sim- dane as the Ashley Madison breach—there was nothing to pre-
ply by extortion and threatening a company that if it does not pay a vent the hackers from adding names to those lists.

4 B I G D ATA QU A RTERLY | SU MMER 2017


THE VOICE OF BIG DATA
How do you classify data risks? There have to be controls.
Data breaches generally can be categorized in three ways. One of the intelligence challenges that led to 9/11 was limited
One is what I would term the “accidental breach.” A lot of information sharing between intelligence organizations. As a
data leaves an organization, not through malice, but through result, there were procedures put in place to make moving data
employees simply trying to get their jobs done. They forward between agencies easier but then that put data at higher levels
sensitive data to their cloud account so they can work on it at of risk, leading to things like the leak from Edward Snowden.
home or they use other mechanisms to move data to places
that they shouldn’t and then it ends up going to someone that How can companies deal with this?
should not have access to it. What every organization needs to do is find the right balance
between efficient operations where you can take advantage of
And the next? data sharing while still having tolerable levels of risk.
The second category of breach is caused by the intentional
insider. And that is much harder problem if you have a sophis- Are newer approaches such as AI, machine learning, and
ticated insider who wants to smuggle data out. Part of the prob- predictive analytics being deployed in data security?
lem is that there are certain types of data smuggling that are very Most definitely. We are using machine learning and artificial
difficult to prevent with technology. An example of that is what intelligence within our products, and really across the cyber-
we call the “analog problem” which is a fancy term meaning that security industry, it is one of the newer technologies that is
it is difficult to prevent somebody from doing something such being heavily used. But it is important to understand that it is
as taking a picture of their screen with their cellphone. possible to poison machine-learning algorithms or subject an
organization to large numbers of false positives, so that it has
What else? to recalibrate its models, which then allows a criminal actor to
The third situation is when it is an actor from outside the
have a viable infiltration vector. So, although it is an effective
organization who is using a combination of malicious tools
and very interesting new field that is being embraced by the
and techniques in order to break into an organization and then
cybersecurity and defense industry, we do need to be mindful
exfiltrate the data.
that it has limitations.
How do you approach these challenges?
What should an organization do to ensure that its data pro-
Our strategy is to break the problem primarily into two sets
tection stance is adequate?
of technologies. The first big part of the technology is to provide
The key step is to recognize that protection is needed against
a Pervasive Data Protection architecture across not only tradi-
both the accidental breach and the intentional breach and that
tional systems but also the cloud as well as personal devices to
create policy and controls on where and how data flows. The the technology, systems, and processes for each will be different.
other big arm of the strategy is giving organizations a compre- Organizations also have to be careful to not just focus on the last
hensive set of technologies to defend their environment against incident that resulted in a data loss. Very often, companies that
bad actors that are using offensive cyber capabilities to break have lost data due to a breach or an unintentional loss put a lot of
into an organization and exfiltrate data, and that is our Threat effort into that vector and in some cases ignore the other vectors.
Defense set of products and capabilities.
What keeps you up at night?
Are there different challenges today to data security? You The thing that keeps me up at night is the thought of hav-
mentioned cloud and mobile. ing some of our large multitenant cloud-based data systems
Those are two, and a third one is the challenge around big breached where you could have many of the companies on the
data and using data from many places. Organizations want to be Fortune 500 critically impacted. If you think of the number of
able to analyze data but also ensure that they are honoring the organizations that rely on cloud-based CRM systems or cloud-
privacy and data access restrictions for that data. Enabling many based storage systems, if one of these systems were breached, it
different groups to have access to large pools of data for analysis wouldn’t be a single organization but potentially a very large per-
is highly beneficial but by doing that you are making data acces- centage of businesses and organizations worldwide and I think
sible to individuals that otherwise wouldn’t have access. The that those are the things that we need to pay a lot of attention to.
challenge is enabling new forms of big data analytics, machine
learning, where you really need to have access to large quantities
of potentially sensitive data but not introduce new data privacy This interview was conducted, condensed, and edited
or data access issues by doing that. by Joyce Wells.

DBTA. COM / BI GDATAQUART E R LY 5


Securing IoT With Automated
Connectivity Management:
INSIGHTS
Lessons Learned From the
Connected Car
By Shaun Kirby

OVER THE NEXT 6 years, the Internet of Things (IoT) tion technologies tout further benefits, enabling
market is expected to reach $883.55 billion, as con- cars and trucks to “talk” to one another to steer
Whether for nected devices continue to pour into just about drivers clear of accidents and other hazards. In
a connected every aspect of our lives. For enterprises, the IoT fact, the National Highway Traffic Safety Admin-
car, home, is helping to transform products into connected istration (NHTSA) estimates that V2V technology
factory, or services, capable of creating recurring revenue could prevent more than half a million accidents
business—you streams, reducing costs, and enhancing customers’ and save more than 1,000 lives each year in the
name it—end- experiences. U.S. And, while we are still a few years away from
to-end security
Despite these benefits, the IoT brings a flood of fully autonomous or driverless vehicles zipping
is crucial to
security risks. Vulnerability to hackers, privacy con- down our highways, the possibilities of safer roads,
thwarting
cybersecurity cerns, and sheer uncertainty of what these devices less traffic, and reduced emissions are extremely
threats and are capable of doing are just a few of the IoT’s inher- enticing. Connected, highly automated cars also
keeping users ent implications. open up vistas for whole new experiences such as
safe. Organizations and their networks are unpre- immersive entertainment and collaboration, add-
pared for the tsunami of devices on the way. Accord- ing to the value.
ing to AT&T’s Cybersecurity Insights Report, 85% On the other hand, connected cars pose threats
of enterprises are in the process of or intend to not only to privacy (given all the data they collect),
deploy IoT devices. Yet, just 10% are confident in but also to safety, when not properly secured. Case
their ability to secure devices against hackers. In in point: 2 years ago, a pair of hackers demon-
order to reap the benefits of the IoT, securing these strated the potential dangers of a cybersecurity
devices and streams of data that flow between them attack on a connected car by remotely hijacking
must be top-of-mind. and crashing a Jeep over the internet. The inci-
dent led to the recall of 1.4 million Chrysler vehi-
The Connected Car: cles. Considering this dramatic demonstration, it
The Epitome of IoT Hopes … And Fears is no surprise that consumers have been hesitant
Looking at the wealth of devices entering the to “turn on” the various connected devices in their
market, the connected car is the most talked about, connected cars. A 2016 Spireon survey revealed
and perhaps the most controversial. A “data cen- that despite interest in connected cars, 54% of
ter on wheels,” the connected car is emblematic of participants said they have not actually used con-
everyone’s best hopes and worst fears of the IoT. nected car features. Although their concerns are
Shaun Kirby
On one hand, connected cars promise to pro- slowly but surely diminishing (willingness to pay
is director, mote safer and more efficient driving through for connected services went from 21% in 2014 to
Automotive & technologies such as collision avoidance systems, 32% in 2015), auto manufacturers are missing out
Connected Car, remote diagnostics, predictive maintenance tools, on the opportunity to capitalize on a $155 billion
at Cisco.
and on-board GPS. Connected transportation sys- market and the recurring revenue that stems from
tems and “vehicle-to-vehicle” (V2V) communica- subscriptions to different connected services.

6 B I G D ATA QU A RTERLY | SU MMER 2017


One of the keys to securing the connected car’s large, poten- connectivity must be “on” during vehicle testing so that auto-
tial “attack surface” is enabling the right levels of connectivity makers can verify that connected services are properly func-
at the right times throughout the vehicle’s lifecycle—from the tioning. Then, when the vehicle is in its shipping container, the
manufacturing and testing facility, to the shipping container, manufacturer can automatically disable these services to pre-
to the dealership showroom, to the driver’s garage and beyond. vent hackers from sabotaging the vehicle while it is en route to
When should connectivity be on or off? What should a vehicle the dealership. All the while, some connectivity must remain
be allowed to do with that connectivity? With 69 million vehi- on to enable real-time tracking of the vehicle during its jour-
cles built with internet connectivity expected to ship globally ney. When the vehicle arrives at the dealership, an automated
in 2020, addressing these questions is no easy task. system allows OEMs to safely resume connections so that
salespeople can demo the vehicle and its connected services to
IoT Connectivity Management Platforms the buyer. If at any time the connections do anything else, the
Pave the Way for End-to-End Security platform can detect that anomalous behavior and automati-
Whether for a connected car, home, factory, or business— cally shut off the connectivity, preventing illicit activity that
you name it—end-to-end security is crucial to thwarting could compromise the vehicle’s security and safety.
cybersecurity threats and keeping users safe. Ultimately, the As networks evolve to accommodate the millions of con-
ability to secure data that these connections generate requires nected devices entering the market, security must be front and
organizations to constantly identify and monitor how that center every step of the way. Although IoT connectivity plat-
data should be used. Here’s where IoT connectivity manage- forms can help address many security concerns, there is an
ment platforms can help. These platforms are capable of auto- ecosystem of responsibility at play for IoT. For the connected
mating how and when a device connects and what it is allowed car, it’s not only the auto manufacturers who have a hand in
to do with that connection. securing connections but also the dealership, developers of
Let’s go back to the connected car example. Automated aftermarket services and subscriptions, and even the customer.
connectivity management platforms allow manufacturers But, no matter the device, security must be a priority from
to identify exactly what vehicles are allowed to do with their Day 1 so that everyone can work together to experience all the
connectivity at each phase of the car’s lifecycle. For instance, benefits IoT has to offer.

DBTA. COM / BI GDATAQUART E R LY 7


In-Memory Parallel Processing
INSIGHTS

and Data Virtualization Redefine


Analytics Architectures
By Alberto Pan

THE TIDE IS CHANGING


To realize the for analytics archi-
full potential tectures. Traditional
of logical approaches, from the
architectures, data warehouse to
it is crucial the data lake, implic-
that the data itly assume that all
virtualization
relevant data can
system
be stored in a single,
includes query
optimization centralized repository.
techniques But this approach is
specifically slow and expensive,
designed for and sometimes not
combining large even feasible, because
distributed some data sources
datasets. are too big to be rep-
licated, and data is
Figure 1: The logical data warehouse architecture
often too distributed
to make a “full cen-
tralization” strategy successful. architectures, and some companies report even big-
That is why leading analysts such as Gartner ger improvements.
and Forrester recommend architectures such as the Logical architectures for analytics are typically
logical data warehouse. In these architectures, data implemented using data virtualization (DV), which
is distributed across several specialized data stores makes all the underlying data sources seem to be
such as data warehouses, Hadoop clusters, and a single system with a unified query interface (see
Alberto Pan
is chief
cloud databases, and there is a common infrastruc- Figure 1). As a result, data virtualization creates
technical officer ture which allows unified querying, administration, real-time logical views from several sources and
at Denodo, a and metadata management. Logical architectures publishes the results to other applications in mul-
provider of data are the only option when data collection/replication tiple formats, such as JDBC, ODBC, or REST data
virtualization
software, and
becomes unfeasible, and, in addition, they greatly services. This way, consuming applications do not
an associate reduce the need for effort-intensive ETL processes, need to know where data resides or the query lan-
professor at providing much shorter times to production and guage that is used by each source system, as they are
University of A significant cost reductions. Gartner has recently also abstracted from changes in data management
Coruña.
estimated cost savings of 40% by deploying logical technologies such as moving from Hive to Spark.

DBTA. COM / BI GDATAQUART E R LY 9


Obtain Total Sales By Customer Country in the Last 2 Years

Figure 2: Simple federation (A) versus advanced data virtualization approach (B)

It is also straightforward to create different logical views The following example illustrates how both capabilities
over the same physical data, adapted to the needs of each work, while the figure below shows a simplified logical data
type of user. Furthermore, data virtualization provides a warehouse scenario:
single entry point to apply global security and governance • An enterprise data warehouse (EDW) contains sales
policies across all the underlying systems. from the current year (290 million rows) and a Hadoop
Nevertheless, to realize the full potential of logical system contains the sales data from previous years (3 bil-
architectures, it is crucial that the data virtualization sys- lion rows). Sales data include, among other information,
tem includes query optimization techniques specifically the customer ID associated to each sale.
designed for combining large distributed datasets. In turn, • A CRM database contains customer data (5 million rows,
many of the data federation/virtualization systems available one row for each customer). The information for each
in the market reuse the query optimizers of conventional customer includes its country of origin.
databases, with only slight adaptations. This is the case of Parts A and B of the figure show two alternative strategies
the data federation extensions recently introduced by some to calculate a certain report that looks at the total amount
database and BI tool vendors. However, those optimizers of sales by customer country in the last 2 years. As depicted,
cannot apply the most effective optimization techniques for the report needs sales data from both the current and previ-
logical architectures, and the associated performance pen- ous years and the country of origin of each customer. There-
alty can be very significant. fore, it needs to combine data from the three data sources.
More precisely, there are two types of capabilities needed Data federation tools, using extensions of conventional
to achieve the best performance in these scenarios: query optimizers, would compute this report using Strategy
A, while DV tools with optimizers designed for logical archi-
1. Applying automatic optimizations to minimize net- tectures would use Strategy B. In fact, the most sophisticated
work traffic, pushing down as much processing as pos- DV tools would consider additional strategies to Strategy B
sible to the data sources. and choose the best one using cost information.
2. Using parallel in-memory computation to perform at In Strategy A, the federation tool pushes down the fil-
the DV layer the post-processing operations that can- tering conditions to the data sources and retrieves the data
not be pushed down to the data sources. required to calculate the report. Since the report includes one

10 BI G D ATA QU A RTERLY | SU MMER 2017


filtering condition because only sales from the last 2 years needs to post-process 15 million rows. While this can be done
are needed, only 400 million rows are retrieved from Hadoop in acceptable time with conventional serial execution, the pro-
instead of the total 3 billion rows. In addition, the report cess can be further optimized using a parallel in-memory grid,
required the full 290 million rows from the EDW and the full as illustrated in the upper-right part of the figure.
5 million rows from the CRM. Therefore, even though the fil- The in-memory grid should be installed in a cluster of
ters have been pushed down to the data sources, alternative A machines connected to the DV system through a high-speed
still needs to retrieve 695 million rows through the network, network. When the DV system needs to post-process a sig-
and then post-process all the data in the federation system. nificant amount of data, it can use the in-memory grid to
Therefore, execution will be very slow. execute such operations in parallel. For this, the DV opti-
In turn, in strategy B, the optimizer of the data virtual- mizer should partition the data into the cluster nodes to
ization system introduces additional groups by operations, maximize the degree of parallelism. In this example, the DV
as the circles above the EDW and Hadoop data sources show, optimizer would partition the data by “customer_id,” thus
to divide the computation of the report in two steps. The first ensuring that all the sales from the same customer end in
step is executed at the EDW and the Hadoop systems, which the same node, and the join and group by post-processing
compute the total sales by customer for the data in each of operations can be parallelized almost in full. Additionally,
the data does not need to be materialized in disk at either
the DV system or at the in-memory grid, so data can be
streamed directly as it arrives from the data sources. As a
result, parallel computation starts almost immediately with
Logical architectures for analytics are typically
the first chunk of data.
implemented using data virtualization which makes Using the parallel in-memory grid can result in much
all the underlying data sources seem like a single system faster execution of the post-processing operations than
with a unified query interface. in conventional serial architectures. However, the parallel
in-memory computation capabilities are of no help to min-
imize network traffic. For instance, in the above example,
parallel databases with simple federation capabilities would
still use the execution strategy A. This means 695 million
these systems. The second step is performed at the data vir- rows would need to be transferred through the network
tualization system: It adds the partial results obtained for before the parallel processing even starts, resulting in query
each customer in both data sources and groups the result- execution times unacceptable for many applications. There-
ing data by country using the information retrieved from fore, both the advanced techniques for minimizing network
the CRM. Since the first step does not require information traffic and the parallel in-memory computation capabilities
about the country of origin, it can be entirely pushed down are needed to achieve the best performance.
to the data sources. This means we only need to retrieve 5
million rows, or one row for each customer, from both the Benefits of Redefining Analytics Architectures
EDW and the Hadoop systems. Therefore, network traffic Logical architectures for analytics provide shorter times
is drastically reduced from 695 million to 15 million rows. to production and are cheaper to create and maintain. Data
Notice these are the type of techniques only an optimizer virtualization is a key component of these architectures,
specifically designed for logical architectures will consider. providing them with abstraction, unified query execution,
A conventional query optimizer designed to work in a phys- and unified security capabilities. To guarantee best perfor-
ical architecture couldn’t add additional operations to the mance, it is crucial that the query optimizer of the data vir-
query plan, as alternative B did, because it usually makes no tualization system includes specific techniques designed for
sense in physical environments. minimizing network traffic in logical architectures. More
The second capability required for best performance in log- importantly, the query optimizer should leverage in-mem-
ical architectures is parallel in-memory computation. Notice ory parallel processing to perform post-processing opera-
that in the alternative B, the data virtualization system still tions that cannot be pushed down to the data sources.

DBTA. COM/ BI GDATAQUARTERLY 11


ENTERPRISE ARCHITECTURE:

TECHNOLOGIES AND SKILLS


THAT BUILD THE FOUNDATION
FOR DATA MANAGEMENT

12 BI G D ATA QU A RTERLY | SU MMER 2017


TECHNOLOGIES AND SKILLS THAT BUILD
THE FOUNDATION FOR DATA MANAGEMENT

W
HAT ARE THE enabling
technologies that make
enterprise architecture
what it is today? There are a range
of new-generation technologies
and approaches shaping today’s
data environments. The key is
By Joe putting them all together to help
McKendrick enterprise architecture fit into the
enterprise’s vision of itself as a data-
driven organization. Tools and tech-
nologies emerging within today’s
data-driven enterprise include
cloud, data lakes, real-time analyt-
ics, microservices, containers, Spark,
Hadoop, and open source trends.

DBTA. COM/ BI GDATAQUARTERLY 13


TECHNOLOGIES AND SKILLS THAT BUILD
THE FOUNDATION FOR DATA MANAGEMENT

CLOUD For smaller operations, cloud is the Interestingly, the cloud “is already start-
Cloud computing, in its current form, de facto platform, as pointed out by Eric ing to be seen as the more secure place to
has been on the scene for close to a decade. Mizell, vice president of global solution operate your business,” Glickman said.
It has only been within the past 2–3 years, engineering for Kinetica. “Most startups “Regulators might soon begin rewarding
however, that it has hit its stride as the solu- are 100% cloud, as it’s easier to spin up and their constituents who operate in the cloud
tion of choice for data environments. “The down instances versus standing up servers since they can provide greater transparency
acceleration to the cloud has passed the in an office.” to their respective businesses.” Ultimately,
point of no return,” said Matthew Glickman, At the same time, Mizell sees move- he added, “cloud adoption will reach its
vice president of product management for ment even among the largest data centers point of full-on adoption once everyone
Snowflake. “More and more companies, “away from traditional datacenters for stops talking about cloud adoption.”
regardless of scale, are all in the cloud.” most workloads.” That is the case, he says,
Organizations are embracing cloud because “it is now essential to have global DATA LAKES
“to reach new levels of agility, increase the collection and processing zones in the Industry experts are bullish on the con-
speed of innovation, and improve time- cloud for easier and faster data handling cept of the data lake. As Syed Mahmood,
to-market rates,” said Mat Keep, director around the world. They say that data has director of product marketing at Horton-
of product and market analysis for Mon- gravity, and what is collected in the cloud works, pointed out, “The data lake is a nat-
goDB. “We estimate that the majority of stays in the cloud.” Moreover, the infra- ural extension of a company’s decision to
our deployments today are in the cloud, structure behind the cloud keeps getting embark on its big data journey.”
and we’re seeing those numbers increase.” faster and more powerful. However, they disagree about whether
There are a range of benefits enterprises Areas where cloud is gaining the most Spark or Hadoop is being used to support
are already seeing from cloud, including traction include “newer digital business these environments. The urgency of the
the ability to “scale applications to new projects that provide responsive and per- data lake concept is acute. “The need to
geographies, decrease investments in local sonalized customer and employee-centric bring data from different systems together
data center resources, and improve the experiences using mobile, web, and IoT into a centralized repository for analytics
ability to deliver apps quickly—all while applications,” said Ravi Mayuram, senior and reporting is nothing new but with data
reducing application and infrastructure vice president of products and engineering volumes exploding, and much of that data
provisioning,” said Keep. For the most for Couchbase. “We see these new systems now being semi-structured and unstruc-
part, startups “will never have their own being built across many industries, includ- tured, traditional enterprise data ware-
data centers, opting instead to be cloud ing ecommerce, travel and hospitality, dig- houses are buckling under the load,” said
natives,” according to Joe Pasqua, executive ital health, digital media, financial services, Keep. “Data lakes augment, rather than
vice president of products for MarkLogic. and gaming.” While tech and media were replace, the enterprise data warehouse.” He
Benefits also include “agility, in which, the early cloud adopters, other industries noted that in building data lakes, Hadoop
for example, public clouds allow for quick are now joining the cloud movement, isn’t the only solution available, and likely
spin-up or spin-down of infrastructure,” as Glickman agreed. introduces complexity. “If organizations go
well as “scale, in which public clouds allow Be careful not to associate cloud exclu- the Hadoop route, they need to consider
for nearly-unlimited storage and compute, sively with “public” cloud services, Norris how they will integrate the analytics cre-
enabling customers to burst data and/or cautioned. There’s a key role for on-prem- ated in the data lake with the operational
analytics into a cloud on an as-needed ises data centers, as well. “Cloud is less systems that need to consume those ana-
basis,” said Jack Norris, senior vice presi- about which sites to deploy to, and more lytics in real time. This demands the inte-
dent of data and applications for MapR. about taking advantage of all physical sites gration of a highly scalable, highly flexible
Finally, there are cost savings, in which available,” Norris said. “Hybrid models, operational database layer.”
“public clouds allow for a pay-as-you-go where services or resources are managed While Hadoop has made the data lake
model, where customers are charged based in some combination of on-premises and possible, it also introduces challenges, such
on resources used.” public cloud, are quite prevalent.” as “the potential to become a data dump,

14 BI G D ATA QU A RTERLY | SU MMER 2017


TECHNOLOGIES AND SKILLS THAT BUILD
THE FOUNDATION FOR DATA MANAGEMENT

security issues, lack of skill sets, and slow out advanced data science degrees to access ics, you often want to be able to scope your
performance, causing smaller or less agile information faster and more reliably than analysis to a very fine target.”
companies to either not try or give up on ever before,” he explained.
Hadoop,” said Keep. He noted that he has Apache Spark is appealing for real- MICROSERVICES AND CONTAINERS
seen many companies add “a fast data time environments “because users can Containers and microservices play a
layer on top of Hadoop to help increase compute analytics very quickly, which key role in helping to achieve agility in
its value.” According to Keep, “Spark offers is especially important in today’s highly hybrid cloud or on-premises environments,
new life to the data lake concept. It brings responsive customer-facing applications,” industry observers agree. “Containers and
performance and machine-learning algo- said Mayuram. He pointed to another microservices were born out of the cloud
rithms that enable the desired data mung- real-time enabler, Apache Kafka, which environment and are critical components
ing businesses want. It also plays well in the provides a “standard way to move data to help developers be more agile,” said Jason
cloud by enabling data in cloud storage to from an application context into a broker, McGee, IBM fellow, vice president, and
be processed faster than ever before.” so your web application team doesn’t need CTO for IBM Cloud Platform. “It’s all about
Still, some experts caution against to worry about how to make it available to enabling developers to progress and iterate
diving too deep into a data lake. “Unfor- downstream consumers—their respon- quickly. Developers have to spend a lot of
time setting up the environments that sup-
port their application, installing and con-
figuring software, setting up infrastructure,
Containers and microservices and moving applications between develop-
ment, test, and production systems. Con-
play a key role in helping to tainers solve this challenge by standardizing
how developers package their applications
achieve agility in hybrid cloud and dependencies, making it super simple
to create, move, and maintain applications
or on-premises environments. and allowing more time for what devel-
opers really want to do, which is create.”
Keep agreed that containers provide much-
needed application portability, “making it
tunately, many companies have seen sibility ends at Kafka. Likewise, different simpler to move services between on-prem
their data lakes turn into data swamps,” application teams can build analytics on and cloud environments, facilitated increas-
said Norris. With respect to Hadoop the website data by consuming it from ingly by the public cloud vendors rolling out
and Spark, “we see two types of cus- Kafka—no prearrangement required.” container services.”
tomer adoption patterns,” he said. “The A notable benefit of both Kafka and For their part, microservices contrib-
first group started with Hadoop and Spark, Mayuram continued, “is the abil- ute to agility “by enabling the formation of
then adopted Spark and are using both ity to support real-time data streaming, smaller teams that do not have to coordi-
technologies. The second group adopted which significantly reduces the traditional nate as much with the larger organization,”
Spark initially and use Spark independent time lag between when data enters the McGee continued. Keep added that “the
of Hadoop.” Spark’s streaming analytics, system and when the results of ETL and large, monolithic code bases that tradition-
he added, benefits significantly from run- analytical processes are available.” ally power enterprise applications make it
ning on a data platform that is not limited Ultimately, for the success of analytics difficult to quickly launch new services. In
by Hadoop’s batch constraints. and real-time solutions, data needs to be the last few years, microservices—often
trusted. “Most analytics technologies fall enabled by containers—have come to the
REAL TIME down in this area,” said Pasqua. He noted forefront of the conversation. Containers
What are the best technologies for that “the goal of many real-time analytic work very well in a microservices environ-
enabling real-time analytics? For Dinesh processes is to determine as much as pos- ment as they isolate services to an individ-
Nirmal, vice president of analytics devel- sible about an individual entity as opposed ual container. Updating a service becomes
opment at IBM, the answer is Spark. to a population. While many people think a simple process to automate and manage,
“Spark radically simplifies the analysis of about analytics in terms of statistics over and changing one service will not impact
large datasets, enabling even those with- large groups of data, in real-time analyt- other services.”

DBTA. COM/ BI GDATAQUARTERLY 15


TECHNOLOGIES AND SKILLS THAT BUILD
THE FOUNDATION FOR DATA MANAGEMENT

Containers and microservices may go for big data processing,” said Couchbase’s TomEE, Web Server, Cordova, Axis, Zoo-
together, but are not joined at the hip. “Just Mayuram. “Spark performs better, is eas- Keeper, Mesos, Groovy, Commons, Open-
to be clear, containers are not required ier to manage, and provides additional JPA, ServiceMix, Zeppelin, and Lucene.
for microservices, nor are microservices functionality like machine learning, which Mahmood sees another solution,
required for containers,” said Mayuram. tends to make it much more attractive than Apache Ranger, also gaining traction
“While it’s correct that both containers and Hadoop for big data processing,” among enterprises that “are increasingly
microservices are frequently used together Hadoop is dying in the enterprise, concerned about providing secure and
in today’s modern web, mobile, and IoT Glickman agreed. “Hadoop-based proj- authorized access to data such that it can
applications, they are not a requirement ects are slowly failing and will eventually be widely used across the organization,
for each other.” be replaced with cloud-based services that while also keeping sensitive information
Flexibility and adaptability are critical are better suited to the tasks Hadoop tried safe. Apache Ranger is being used by some
to container and microservices success. to solve on-premises. Apache Spark, on of the largest companies across industries
“Choose a database that meets the require- the other hand, is thriving. By being data- to provide a framework for authoriza-
ments of microservices and continuous source agnostic by design, Spark never had tion, auditing, and encryption and key
delivery,” Keep said. “When you build a a tight coupling to Hadoop, or more pre- management capabilities across big data
new service that changes the data model, cisely, HDFS.” infrastructure.” Other open source tools
you shouldn’t have to update all of the exist- Some industry observers, however, include Apache Atlas, which addresses data
ing records, something that can take weeks believe Spark and Hadoop can coexist management and governance, and Apache
for a relational database.” Instead, Keep and deliver impressive synergies. “We Zeppelin, which assures “access to data is
noted, it is important “to ensure that you don’t view this as a Spark-versus-Hadoop democratized and citizen data scientists
can quickly iterate and model data against debate,” said Mahmood. “We believe that can use a web-based tool to explore data,
an ever-changing microservices landscape, analysts and data scientists require a cen- create models and interact with machine
resulting in faster time to market and tralized platform to develop predictive learning models,” Mahmood stated.
greater agility.” applications. Apache Hadoop provides this
One risk is the distributed nature of foundational platform for big data pro- BLOCKCHAIN
microservices, Keep said. “There are more cessing with HDFS for storage and YARN And, finally, there is an increasing role
potential failure points. Microservices for compute management. We believe for blockchain—the global, distributed
should be designed with redundancy in that Apache Spark is more effective when database—in today’s enterprise environ-
mind.” it operates as part of a Hadoop platform. ments. While the direction and impact of
Automation is also essential to these envi- With the burden of the platform being this technology is not yet clear, blockchain
ronments, he added. “With a small number taken care of by Hadoop, data scientists can promises to disrupt many data manage-
of services, it is not difficult to manage tasks be more productive by simply focusing on ment approaches. “Blockchain technology
manually. As the number of services grows, building predictive applications.” excels at building trust between groups
productivity can stall if there is not an auto- of inherently untrusting legal entities,”
mated process in place to handle the growing OPEN SOURCE said Jerry Cuomo, IBM fellow and vice
complexity.” Finally, he advises, “learn from Open source is also gaining traction president of blockchain technologies.
the experiences of others.” and, in particular, a number of key Apache “If everyone trusts each other like in a
projects are getting a foothold in the enter- private enterprise, we really don’t need
SPARK VERSUS HADOOP prise. “We often see different technologies a blockchain. However, every enterprise
While Hadoop has emerged as a popu- being brought in to address application has business-to-business relationships
lar open source framework in recent years, development, data management, and oper- where value is exchanged.” For example,
another contender, Apache Spark, is steal- ational challenges,” said Mayuram. Some he noted, in a supply chain, partners, sup-
ing its thunder. “Our customers, especially of the more common Apache projects that pliers, and shippers manage the exchange
those who are building newer big data Couchbase sees within enterprise customers of goods across enterprises. “This is where
projects, tend to choose Spark over Hadoop are Spark, Kafka, ActiveMQ, Flume, Arrow, blockchain shines.”

16 BI G D ATA QU A RTERLY | SU MMER 2017


Oracle
PAGE 20
A FAST DATA
JOURNEY ON THE BIG DATA QUARTERLY
ROAD TO BIG DATA ONE COMPLETE MARKETING PROGRAM

Aerospike The Rise of


Fast Data
PAGE 22
CACHELESS
ARCHITECTURES

Management
FOR DIGITAL
TRANSFORMATION:
WHY THEY MATTER,
AND WHAT YOU NEED
TO KNOW

Informatica
PAGE 23
TO STREAM OR
& Analytics
NOT TO STREAM—
JUST FLIP A SWITCH

Best Practices Series


BIG DATA QUARTERLY
ONE COMPLETE MARKETING PROGRAM

Eight Rules of the Road for


Fast Data Management
and Analytics
Best Practices Series
These days, end users—be they employees or consumers evolution is toward predictive analytics, built upon a constant
visiting a site—expect information delivered in seconds, if not feedback loop of real-time data from sensors and systems feeding
nanoseconds. Applications tied into networks of connected into operations.
devices and sensors are powering operations and making adjust- Enterprises also recognize that real-time capabilities deliver
ments on a real-time basis. greater value to their organizations and customers than traditional
This calls for fast and intelligent data—also often referred batch-mode processing. It helps keep applications and related
to as “streaming data”—sought by enterprises to compete in an information being provided refreshed on a constant basis. Batch
intensive global economy. From a consumer’s point of view, no mode, on the other hand, means periodic updates of large sets
user experience is complete without some smattering of analysis of data, likely on a 24-hour cycle, with no real-time interaction.
or intelligence, tied to recommendation engines that provide Many enterprises recognize the role that fast data is playing
additional insights or courses of action for users to follow. The in current and future growth plans. A recent survey of 4,000
challenge for data and development teams is to not only build on data professionals by OpsClarity, Inc. found that 92% of compa-
these types of intelligent services but to also find the best ways nies plan to leverage stream-processing applications, while 79%
to deliver data interactively and in real time—or near real time. intend to reduce or eliminate their investments in batch-only
The next generation of technologies and methodologies processing. It’s going to take some time, however. While 65% of
emerging—from in-memory databases to machine learning respondents claim to have real-time or near-real-time pipelines
to alternative forms of databases—promises to deliver on this currently in production, they are still leveraging a wide mix of
potential. Fast data is at the forefront of the real-time revolu- data processing technologies—batch, micro-batch, or streaming.
tion. It means large volumes of data need to be moved through There are many solutions on the horizon that promise to con-
systems and across decision makers’ desks, enabling real-time verge real-time capabilities with data environments. However,
views of events as they happen, be it a customer problem, an integrating these multiple approaches can be daunting. Stream-
inventory shortage, or a systems glitch. It’s a matter of identify- ing analytics, for example, can be supported through open source
ing the moments of truth, in real time, when an end user—be solutions such as Apache Spark, a fast cluster computing system,
it an employee or customer—is ready to move on to make the and Apache Kafka, a distributed streaming platform, but integrat-
next decision. ing these various environments can be challenging. Plus, these
This is a sea change from the typical data environments, newer solutions often get built on top of existing data environ-
which, until recently, were tasked with delivering static reports, ments, such as data warehouses.
most likely on historical data. Now, there is a drive to what ranges The good news is a fast data infrastructure doesn’t have to be
from real-time analytics to streaming analytics to operational a drag on performance. Data managers need to take proactive
intelligence, in which information viewed by decision makers measures to build, maintain, and support today’s generation of
is refreshed on an instantaneous basis. The next phase in this highly interactive and intelligent applications. Maintaining per-

18 BI G D ATA QU A RTERLY | SU MMER 2017


sponsored content

formance is an important piece of the puzzle, and that’s where Employ machine learning and other real-time approaches.
database technology converges with the drive to real time. Behind just about every analytics-driven interaction is an algo-
Here are the key elements to consider in moving to a fast or rithm that employs techniques to gather data and do some type
streaming data environment: of pattern matching to measure preferences or predict future
Mind your storage. Fast data requires a technology compo- outcomes. Machine learning approaches enable these systems to
nent that is essential: abundant and responsive storage. This is adjust software to data streams without time-consuming manual
where data managers and their business counterparts need to intervention.
understand when and where data pulsing through their orga- Look to the cloud. Today’s cloud services support many
nizations need only to be read once and discarded, or stored of the components required for fast or streaming data—from
for historical purposes. Many forms of data—such as constant machine-learning algorithms to in-memory technologies. Most
streams of normal readings from sensors—simply aren’t import- respondents in the OpsClarity survey (68%) cite the use of either
ant enough to invest in archival storage. public cloud or hybrid deployments as the preferred mechanism
Consider alternative databases. Much of the data that is for hosting their streaming data pipelines.
being sought across enterprises these days is the unstructured, Pump up your skills base. The next-generation approaches
non-relational variety—video, graphical, log data, and so forth. required for delivering fast or streaming data and analytics also
Relational data systems, for example, tend to be slower than call for new types of skills in these areas. Data professionals need
necessary for the tasks that employ unstructured data streams. greater familiarity with new tools and frameworks, including
NoSQL databases, for example, have lighter-weight footprints Apache Spark or Apache Kafka. Organizations must increase
and can process these data streams at faster rates than established their levels of training for current data management staffs, as
relational database environments. well as seek out these skills in the market.
Employ analytics close to the data. It may also be helpful to Look at data lifecycle management. It’s important to be able
use data analytics that are embedded with database solutions for to filter the data that is required for eventual long-term storage,
many basic queries. This enables greater response times, versus versus the data that is only valuable in the moment. Otherwise,
routing data and queries through networks and centralized algo- the amount of data that would need to be stored would be over-
rithms that may drag on performance and increase wait times. whelming—and mostly unnecessary. A way to address potential
Examine in-memory options. The delivery of highly intelligent, storage overload is data lifecycle management, in which certain
interactive experiences requires that back-end systems and applica- types of data are either eliminated or moved to low-cost storage
tions operate at peak performance. That requires movement and vehicles, such as tape, after a predetermined amount of time.
delivery of data at blazing speeds, recognizing that every nanosec-
ond counts in a user interaction. In-memory technologies—which
can support entire datasets in memory— can deliver this speed. —Joe McKendrick

DBTA. COM/ BI GDATAQUARTERLY 19


sponsored content

A Fast Data Journey


On the Road to Big Data

There are many possible journeys In industries as diverse as banking, which means access not just to that
on the Road to Big Data. In this article transportation, building maintenance streaming data, but also to the full
we will look at handling streaming and particle physics, predictive mainte- richness of your new and existing
data and how to deliver effective nance solutions rely on rapid analysis of datasets.
analytics on high-speed data. Most sensor data. Let’s look at what that takes.
importantly, we want to provide you And the fraud detection that we
a path to doing this successfully in a all experience on credit card transac- THREE STEPS
short amount of time. tions as well as ecommerce sites is only We can break this down into three
possible with immediate analytics on simple steps:
USE CASES transactions as they stream in. 1. Establishing a data lake that can
Many common big data use cases host all the non-streaming data
rest on streaming data. Successful B2C MORE DATA—BETTER RESULTS you need, along with the analytics
companies need to interact with, and But it takes more than streaming to capitalize on it.
make decisions with their customers data to bring these use cases to life. 2. Capturing your data stream(s),
in real-time. Fast analytics on fast data Building the right predictive analytics establishing real-time analytics,
underpins everything from understand- requires a rich collection of historical and depositing that data into your
ing your customer’s next best move to data against which to build and test data lake for future use.
delivering targeted promotions over different models. Adding more data 3. Integrating analytics to act upon
multiple channels to handling inquiries is almost always more effective than both your data stream(s) and your
promptly. building better algorithms on less data, data lake.

20 BI G D ATA QU A RTERLY | SU MMER 2017


sponsored content

A QUICK PROOF
OF CONCEPT
Seeing is believing.
Building a fast data
environment need not
be a time-consuming
project. With the right
examples, and the right
experimental data to work
with, it’s possible to spin
up a cloud environment
and start getting to work.
Oracle’s Big Data Cloud
Platform offers a compre-
hensive environment that
lets you quickly:
• Load data into an object OBJECT STORAGE— enterprise data lakes. Think of object
store—the “new data lake”— THE BEST DATA LAKE store as the lowest tier in your storage
and do basic batch analyses PLATFORM hierarchy. Object Store allows you to
• Set up a Kafka stream to Historically (if that’s the right term decouple storage from compute giving
ingest and analyze fast, for something that is not yet a teen- organizations more flexibility, durabil-
streaming data ager) data lakes have been based on ity and cost savings. Our guidance is
• Integrate contextual data Hadoop and HDFS. But in the cloud to store everything in Object Store and
in your data lake with there’s a better option. Object storage read only the data you need into the
streaming data using the automatically replicates and distributes application tier on demand. At the end
power of Kafka, Spark and data across multiple data centers to of the day, the cost of copying this data
Hadoop increase availability and data integrity. as needed is small compared with the
With object storage you can: savings and the increased flexibility.
THE NEW DATA LAKE • Detach compute from storage to
SOLUTION allow each to grow independently CALL TO ACTION
Big Data Cloud Service—Compute • Persist data in a lower cost store The best way to learn is by doing.
Edition is the right platform to get that offers greater durability Visit oracle.com/bigdatajourney for
started with. It offers a simple, devel- • Maintain a centralized, multi- step-by-step instructions on how to
oper-friendly interface, can provision tenant platform that has the get started using sample data and a
a new cluster in minutes, and encrypts flexibility to handle new work- real use case. You’ll also find instruc-
data-at-rest and data-in-motion to loads, new types of data and new tions on how to redeem $300 in free
keep it secure. Event Hub Cloud Ser- software frameworks as the com- cloud credits towards Oracle Cloud
vice delivers the power of Kafka as a plexity of use cases increase. Services necessary to build “The New
managed platform to handle streaming Hadoop HDFS’ strategy of intrin- Data Lake.”
data. And Object Storage is the foun- sically tying storage and compute is
dation for the most cost effective and increasing becoming an inefficient ORACLE
flexible data lake. use of resources when it comes to www.oracle.com

DBTA. COM/ BI GDATAQUARTERLY 21


sponsored content

Cacheless Architectures for Digital Transformation:


Why They Matter, and What You Need to Know
ORGANIZATIONS ARE EMBRACING digital
transformation to create a competitive TABLE 1
advantage by either building entirely
new systems of engagement (SOEs) or
by attempting to retool existing SOEs.
SOEs are real-time, operationally focused,
edge-based systems and services, such as
mobile and social apps, videos, cloud, and
big data, that are core to many businesses.
SOEs require data consistency, speed,
uptime, and availability.
In many cases, organizations are
enabling new SOEs via traditional
approaches such as an RDBMS or other
storage layer, perhaps similar to a 1st gen-
eration NoSQL DB, along with a caching
solution to enable the real-time speed TABLE 2
these systems require. See Table 1 for an
example architecture used to enable these
types of systems.
Part of the challenge organizations
face is that architecture and infrastructure
approaches have changed dramatically in
the last few years, creating a significant com-
petitive advantage for organizations willing
to embrace next-generation architectures to
solve real-time problems. There have also
been advances in hardware storage (NVMe
and Optane storage) and DB architecture, all while improving overall performance 5. Provide predictable performance,
which fundamentally change how organiza- for lower latency SOE applications. irrespective of workload, WITHOUT a
tions solve these types of problems. Why is a hybrid memory architecture traditional caching layer.
Organizations that have embraced important? True hybrid memory systems: 6. Lower TCO, creating a competitive
a new cacheless hybrid memory DB 1. Store indices in memory (DRAM) advantage and enhanced business value.
architecture have demonstrated they can and data on SSD storage, reducing server A true cacheless architecture reduces
increase overall performance of real-time count by up to 10 times. They also use server footprints by a factor of three or
applications, at scale, improve uptime and NVMe drives at one-sixth the cost of more and hardware costs by a factor of
availability, and dramatically reduce total DRAM and dropping. six or more.
cost of ownership (TCO) as compared to 2. Are multi-threaded, massively parallel Organizations embracing digital trans-
traditional RDBMS/cache-based solutions systems. formation to build SOEs should consider
and 1st generation NoSQL solutions. 3. Significantly improve uptime and avail- a cacheless hybrid memory architecture to
See Table 2 for a significantly differ- ability, without manual DevOps processes. deliver predictable performance and create
ent architectural approach. True hybrid 4. Require data consistency. A true hybrid a competitive advantage, at a lower TCO.
memory architectures or cacheless memory system is capable of mixed work-
architectures remove a layer from the load (true concurrent read/write), synchro-
technology stack, creating enormous sim- nization within a cluster, and asynchronous AEROSPIKE
plification and significantly lower TCO, communication across remote clusters. www.aerospike.com

22 BI G D ATA QU A RTERLY | SU MMER 2017


sponsored content

To Stream or Not To Stream—


Just Flip A Switch
THE CONTINOUS providing flexibility to leverage existing also enables other capabilities to maximize
STREAM PROCESSING batch processes or create new ones with a reliability such as session playback should
OPPORTUNITY AND RISK flip of a switch. any data flows experience interruption.
Data is now rushing into organiza- Multi-latent abstraction therefore drives
tions at blazing speeds. The growth in SMARTER, MORE RELIABLE speed, efficiency, flexibility, and reliability.
connected devices and sensors has made INTERNET OF THINGS WITH
new datasets like clickstreams from web MULTI-LATENT ABSTRACTION BRINGING THE RIGHT
servers, geo-location data, and industrial Internet of Things projects are delayed PEOPLE, PROCESS, AND
data coming from sensors and machines due to excessively manual approaches to TECHNOLOGY TOGETHER
available for analysis. But the ability to turn specialized development. One-off plat- Succeeding with the Internet of Things
this real-time streaming data into valuable form-specific code built using frameworks is often best served with building a center
business insights on time is what differ- like Apache Kafka, Apache Spark Stream- of excellence. Cognizant is a leading pro-
entiates competitive innovators from the ing, or Amazon Kinesis creates a higher risk vider of business and technology services
rest. Despite the increased maturity of new of project delays, project rewrites, and long- for helping organizations get more value
streaming technologies like Apache Kafka term technical debt. Moreover, one-off from their data assets. Cognizant believes
and Amazon Kinesis, very few organiza- approaches generally lack the dependability integrated investments in business strat-
tions have been able to demonstrate fast and availability that is required for enter- egy, design thinking, industry expertise,
and repeatable success. In a world of grow- prise-grade Internet of Things projects. and technology delivery are critical to
ing data velocity, organizations need faster, Multi-latent abstraction is the devel- success for Internet of Things and data
simpler, more reliable, and more repeatable opment, testing, and operations of data lake environments. Training and change
approaches to managing stream processing flows at any latency without knowledge or management can help define the right
and delivering connected experiences. implementation at a lower level. By build- structures, roles, and operating procedures
ing data flows at a higher-level of business to effectively bring together the right peo-
MULTI-LATENT ABSTRACTED logic, the development time for projects ple and processes for maximum return on
APPROACH TO FAST DATA is accelerated. Furthermore, downstream investments.
LAKE MANAGEMENT testing and maintenance of data flows is Meanwhile, Informatica is the #1
The traditional approach to manu- radically easier and faster, thus dramatically independent provider of data management
ally coding real-time stream processing lowering the total cost of ownership for solutions for enterprises. Informatica’s
functions imposes excessive development project environments. Ultimately, organiza- comprehensive Cloud Data Lake Man-
times and a highly rigid approach to tions simply focus on the business logic of agement solutions, including our unique
developing and maintaining code. With how data should be ingested and processed, Intelligent Streaming product, deliver
so much innovation in real-time big data while leaving the underlying execution to accurate and consistent big data assets you
and fast data technologies, building code the abstraction layer. can trust to power faster business decisions.
manually only creates technical debt. Multi-latent abstraction clearly acceler- Data security and data governance are
Moreover, these approaches require you to ates projects and lowers overall TCO, while automatically enforced through metadata
isolate real-time streaming processes from unleashing flexibility. As underlying pro- intelligence that ultimately enables more
more traditional batch-oriented processes, cessing engines like Apache Kafka, Apache data consumers to quickly and repeatedly
leading to future delays and complexity. Spark Streaming, or Amazon Kinesis get more trusted big data without more
In contrast, a multi-latent and abstracted continue to evolve, organizations face zero risk. As close partners, Informatica and
approach to data streaming and data risk of code re-writes. In fact, converting Cognizant work together to deliver the
lakes accelerates development projects data flows between batch execution and right people, process, and technology to
by abstracting streaming logic from the real-time execution is literally just a matter build Internet of Things and data lake
underlying execution, thus accelerating the of flipping a switch. Meanwhile, the man- projects with speed, efficiency, flexibility,
development times for projects, while also agement framework of an abstraction layer and reliability.

DBTA. COM/ BI GDATAQUARTERLY 23


The Growth of Hybrid IT
and What It Means
TRENDING NOW
Q&A With SolarWinds’ Kong Yang

ACCORDING TO Unisphere Research, over the past lems for IT professionals as they have to manage
5 years, storing data in the cloud has become an mission critical layers of their application ser-
increasingly important feature of the overall data vices across networks, systems, and services that
management infrastructure, and the amount of they neither own nor control completely. This
data stored in the cloud continues to expand at a decreases their visibility into performance and
healthy rate. challenges their authority to identify and resolve
The growing assortment of private cloud, problems such as downtime and outages. With
virtual private cloud, and public infrastruc- deployments that were previously onsite spread
ture-as-a-service offerings means IT professionals across cloud service providers, IT administrators
must start thinking about how to be successful in need to monitor their environments more effi-
this increasingly hybrid context as well as strate- ciently and effectively than ever, and develop new
gies for managing what are becoming highly com- skills to succeed.
Embracing
plex environments.
monitoring
as a discipline Kong Yang, head geek at SolarWinds, a pro- What do these changes mean for IT
is of great vider of IT management software, believes the professionals?
importance rise of the mobile workforce and the pressure to With these changes, IT professionals must con-
to successfully implement new technologies means that modern tinue to develop new skills to keep pace and avoid
implementing IT professionals must be able to quickly evolve being “left behind.” IT professionals can no longer
and maintaining beyond the confines of on-premises deployment have a single area of expertise; as their IT environ-
hybrid IT. and shift into the realm of hybrid IT. Here, Yang ments become increasingly de-siloed, their areas
reflects on some of the ways that IT professionals of expertise must also extend beyond their usual
can begin that journey. discipline.
Discerning what can be moved outside the
What is changing in IT environments now data center to best realize the benefits of hybrid
with respect to cloud and on-premises IT requires a deep understanding of cloud ser-
deployments? vices and how they integrate with on-premises
As the technology industry continues to trans- deployments. IT professionals must learn to use
form, IT environments are becoming increasingly this knowledge to determine what services and
hybrid. A hybrid IT environment encompasses a applications are best suited for on-premises, as
mix of cloud services and on-premises deploy- the decision to migrate a portion of existing IT
ments, and has positive and negative effects on IT services to the cloud should not be taken lightly.
professionals and organizations in general. IT professionals and their ultimate value will be in
For example, hybrid IT gives organizations the balancing the cloud’s benefits with performance,
opportunity to consider a workload’s resource, cost, governance, and security objectives.
security, and performance needs before deter-
mining whether it’s a better fit for the cloud or What are the skills needed to succeed in a
if it should remain on-premises. Public cloud hybrid context?
vendors supply IT organizations with the ser- An IT professional operating in a hybrid IT
vices necessary to implement hybrid IT on an environment must surpass traditional roles and
as-needed basis; this ultimately gives organiza- develop a keen understanding of enterprise net-
tions opportunities to choose services and scale works, data centers, and application delivery;
as they are needed. these skills are actually a mix of adapting existing
While convenient, affordable, and full of skills and acquiring new ones. They must hone
choices, hybrid IT also creates a host of prob- their skills of managing infrastructure services

24 BI G D ATA QU A RTERLY | SU MMER 2017


and vendors, integrat- bleshoot. Through these
ing cloud services and principles, IT professionals
ensuring quality-of-ser- can ascertain what’s going on
vice that meet business in their environments, learn
performance needs for any when something is going wrong
given service. (without having to constantly sit
Embracing monitoring as a in front of a monitor), fix problems
discipline is of great importance to fast, and determine the root of those
successfully implementing and main- problems to prevent future issues.
taining hybrid IT. IT professionals are still
responsible for overall performance and What does it add that is missing?
availability regardless of the direct control Hybrid IT is leading to increasingly opaque
and visibility issues hybrid IT creates. It IT environments. The principles of the DART
is critical to monitor resource utilization, framework enable the success of hybrid IT
saturation, and errors across the applica- at every layer, from surfacing truths into an
tion stack regardless of location. IT environment to resolving problems in the
application stack and preparing for future inte-
Where does DevOps fit in? gration. This framework is built to quickly sur-
DevOps’ core tenets of increased col- face the single point of truth in the hybrid IT
laboration and communication with contin- environment.
uous integration and delivery of services needs
to be applied to hybrid IT. The introduction of cloud If all goes according to your expectations, 5 years
services creates more complexity with change management, from now, what do you think will be different in
as the service being consumed as-a-service—either soft- enterprise environments?
ware-as-a-service, platform-as-a-service, or anything-as-a- In 5 years, enterprise environments will be increasingly
service—can change. That’s part of the cloud premise—a affected by changes that are already taking place and the
high rate and large amplitude of change. host of additional problems, resources, and skills that this
Applying DevOps principles, such as monitoring with hybrid IT world has created. We will see changes from every
discipline and continuous collaboration, will help IT pro- angle, from expanded cloud vendor services to IT profes-
fessionals mitigate the risk associated with this high fre- sionals’ skills. While some systems and applications will
quency of change. The biggest benefit of applying a DevOps remain on-premises, new services will give IT professionals
culture to a hybrid IT environment is allowing the IT team and end users more incentives to move to the cloud. I antic-
to quickly deliver services that the business and customers ipate more seamlessly integrated hybrid IT with more cloud
need exactly when those services are needed with better services, especially around serverless architecture, artificial
quality assurance. intelligence, and virtual/augmented/mixed reality. And, in
the coming years, IT professionals will need fully developed
You introduced the DART framework as a series of cloud skills. In fact, IT pros without cloud skills will likely
skills that virtualization administrators can use. become obsolete.
What does it stand for?
The DART framework encompasses four key principles
to successfully adopt monitoring as a discipline. These
skills apply to any IT professional, especially one looking
to enable hybrid IT service models. The four tenets of the This interview was conducted, condensed, and edited
DART framework are discover, alert, remediate, and trou- by Joyce Wells

DBTA. COM/ BI GDATAQUARTERLY 25


BIG DATA BY
DEVOPS AND THE NEED FOR SPEED

T
he need for speed and agility are among the key driv-
ers of the growing DevOps movement, which seeks to
better align software development and IT operations.
Yet, challenges still exist.

42%
As organizations embrace new models such as DevOps,
a requirement for greater visibility and operational
efficiency is also driving tool consolidation.

Organizations are using between are deploying and updating apps more
frequently than in the past
4 and 10 tools to manage their

68%
growing portfolios of custom apps

Source: “The New Normal: Cloud,


DevOps, and SaaS Analytics Tools
Reign in the Modern App Era,” from plan to adopt DevOps practices, or are
Sumo Logic with UBM Technology already doing so on a limited or trial basis
(March 2017)

Workloads are growing


63% faster than headcount.

44%
of respondents see their Source: Chef Survey 2017: Community Metrics
workloads increasing

expect to see an increase in the


size of their development teams
33%
expect to see an increase in the
size of their operations teams

26 BI G D ATA QU A RTERLY | SU MMER 2017


THE NUMBERS
DevOps requires organizational, cultural, and Companies are struggling to bring
technical changes. Key elements to success- DevOps to their databases.
fully implementing a DevOps approach include:

Support from executive leadership ........................ 39%


Flexible, available resources .......................................39%
47%
of respondents have already 47%
adopted DevOps practices
Cross-business, cross-functional teams across some or all of
their IT projects
spanning operations, development, and
infrastructure personnel....................................................... 34%
Inclusion of infrastructure as part of
the continuous delivery process .................................. 11%
33%
plan to adopt a
33%
DevOps approach
Most significant technical benefits of the within the next 2 years
DevOps approach?
Continuous software delivery........................................ 46%
Improved deployment success rates .................... 33% 20%
Only
20%
Faster resolution of issues ............................................... 25%
of respondents are already
applying DevOps practices

Less complex problems to fix .........................................10%


such as continuous integration,
automated testing, or automated
deployment to their databases

Source: “The Current State and Adoption of DevOps,” Produced


by Unisphere Research, a Division of Information Today, Inc. and Source: Redgate Software’s State of Database DevOps
sponsored by Quest Software (September 2016) Survey (January 2017)

Most significant barriers to DevOps Success are:


14%
13%

12%

11%
11%
Source:
Quali’s 2016 DevOps
Survey (March 2017)
DBTA. COM/ BI GDATAQUARTERLY 27
Getting Real Business Value
INSIGHTS

From Artificial Intelligence


By Roman Stanek

AI has the
potential
to benefit
businesses of
every size,
but right off
the bat it’s
important to
understand
that generating
ROI from AI
is not easy.

TODAY’S HEADLINES ARE filled with news about artificial mean for everyone else? Does this signify that only the
intelligence (AI), proclaiming variously that robots giants will profit from AI, while the rest of us languish?
will take our jobs, cure cancer, or change industries The answer is an emphatic no. AI has the poten-
in ways unseen since the industrial revolution. One tial to benefit businesses of every size, but right off
thing is clear to those of us watching closely, how- the bat, it’s important to understand that generating
ever: It’s not all hype. In 2016 alone, the quantity of ROI from AI is not easy. There is still a gap between
AI startup acquisitions was remarkable, but most research into AI and delivering actual tangible busi-
of these massive investments were made by an elite ness results in the real world; in fact, a recent For-
corps of companies, such as Amazon, Google, Apple, rester survey found that while 58% of companies are
Facebook and a few others. researching AI, only 12% are actually using AI sys-
The fact that these heavy hitters are leading the tems in their businesses.
charge makes sense. AI investments, deployments, and Not everyone will be able to develop their own AI
resources are largely siloed within a small Silicon Val- solution in-house, so to catch up with the tech giants,
Roman ley circle, and the high cost of development or acquisi- many smaller companies will seek technology part-
Stanek tion combined with the fact that there is currently only ners to set up the least expensive, most effective way
is CEO of a small cadre of truly talented AI experts means that to harness AI and machine learning. To that end, here
GoodData.
only a small number of companies have the resources are the three keys that will help you achieve the most
to deliver AI innovation at scale. But what does that success in setting up your AI strategy.

28 BI G D ATA QU A RTERLY | SU MMER 2017


Train Like You Mean It hands and new blood, so make sure you invest in keeping the
Achieving value from AI requires training your organiza- talent that you have so that you can set, execute, and grow your
tion in new skills, as well as creating a crystal clear plan about AI strategy without disruption.
how investing in these types of technologies will generate ROI
or achieve a specific strategic business goal. Integrating AI into
your business model is a complicated strategic problem to
Organizations that are winning at AI are doing so by
which there is simply no one-size-fits-all solution. In order to
succeed in this fast-paced, early adopter market, organizations establishing clear strategic objectives with the business
need to create dedicated teams that are solely focused on getting team, and then handing their execution over to the
these initiatives to market quickly, extracting key learnings, and professionals that live and breathe data and technology.
optimizing them to deliver maximum impact and value.
It’s not enough to simply add an AI solution as a shiny new
element to your existing product road map; your business and
technology teams have to work together from the get-go, other- Find the Right Partner
wise you’ll be left with expensive technology investments that are Make no mistake: AI will transform the business landscape,
looking for problems instead of solving them. with applications for every industry. According to an Ovum
Organizations that are winning at AI are doing so by estab- report on tech trends in 2017, machine learning will be the “big-
lishing clear strategic objectives with the business team, and gest disruptor for big data analytics in 2017,” and become “table
then handing their execution over to the professionals that stakes for data preparation and other tools related to managing
live and breathe data and technology. Leaders in the roles of curation of data.”
CIO, CDO, or CTO have the experience, mindset, and authority But even as IBM Watson, Google Now, Alexa, Siri, and other
required to reorient your corporate culture to treat data as a platforms have opened the public’s eyes to the potential of AI,
strategic asset as well as the ability and flexibility to determine the fact remains that the advanced computer networks that
how best to leverage emerging technology to achieve the desired make up the “brain” of these systems still struggle to match the
strategic business objective. ability of an average human in areas such as context and ana-
By placing your AI deployment under an experienced data lytical capability.
executive and arming that person with clear strategic business To make these systems work will require advanced predic-
goals and the organizational support needed to execute on those tive analytics that can rapidly make sense of massive volumes
visions, you empower your company with the tools, talent, and of data, and it is critically important to find a flexible, scalable,
mindset required to define and execute these initiatives successfully. end-to-end data and analytics partner who can help you see the
past and anticipate the future. Choose carefully, as you are going
Hire and Grow the Right People to need to lean on that company’s expertise for years to come.
Because every business has different goals for its AI deploy-
ments, it is impossible to outsource these initiatives. Of course, Get Ready for the AI Takeover
adapting your existing teams to the task is essential. They know 2017 will be a huge year for AI. Forrester Research expects
your business inside and out and have the experience and vision “enterprise interest in, and use of, AI to increase as software ven-
to know where you come from and where you need to go. That dors roll out AI platforms and build AI capabilities into applica-
said, the complex nature of these systems means that almost tions,” as “enterprises that plan to invest in AI expect to improve
inevitably you will have to hire dedicated AI experts to steer the customer experiences, improve products and services, and dis-
ship in the right direction and plug knowledge gaps that your rupt their industry with new business models.”
current teams might have. This means companies have to start Until recently, harnessing AI was on the strategic agenda of
thinking immediately about how to acquire and retain the tal- only the most progressive companies, while most others viewed
ent they’ll need to successfully go to market with AI. it as a futuristic concept and adopted a “wait and see” attitude.
Right now there is a shortage of experienced AI profession- Such a posture is no longer viable. As money continues to pour
als, and competition for their skills is fierce. You should make into development, and businesses from every sector start ramp-
hiring the right talent a top priority, and focus on creating a ing up their efforts to integrate AI into their offerings, compa-
culture where they want to stay. However, don’t forget that these nies must invest immediately into their own AI solutions or risk
initiatives are a team sport that requires a blend of experienced falling behind.

DBTA. COM/ BI GDATAQUARTERLY 29


DevOps for Big Data
Q&A With Pepperdata’s Ash Munshi
TRENDING NOW

DESPITE THE LARGE investments that organizations How is DevOps for big data different?
are making in big data applications, difficulties still Classical DevOps was all about creating veloc-
persist for developers and operators who need to ity between business requirements and needs—the
find efficient ways to adjust and correct their appli- developers writing the code, the systems that actu-
cation’s code. To address these challenges, Pepper- ally embody the code and solve the business prob-
data has introduced a new product based on the Dr. lems—and it was all around processes like Agile,
Elephant project that gives developers an under- continuous integration, and continuous delivery.
standing of bottlenecks and provides suggestions on That is all well and good, but for big data, there is
how to fix them throughout the big data DevOps another big component which is this performance
lifecycle. aspect. We believe that performance needs to be a
Pepperdata CEO Ash Munshi recently discussed first-level player in DevOps for big data.
the need for DevOps for big data, and the role of
Dr. Elephant, which was open sourced in 2016 by What does this involve?
LinkedIn and is available under the Apache v2 Taking information about how things are actu-
License. ally performing and providing feedback to earlier
parts of the DevOps chain is vital for big data. In
Taking
What is happening in the big data space now? particular, if you collect information about resource
information
about how We are seeing more and more customers going to utilization, contention for the resources, the appli-
things are production with big data. Our company tripled its cations, and whether the places where they are
actually growth last year and we have a nice vector for this deployed match or don’t match, and give that back
performing year planned, as well. Our customers are proof that to the people who do the release part of it, and can
and providing the technology and the solutions are leaving the lab also say that the developers made a set of changes
feedback to and actually becoming business-critical. but they might be detrimental to performance—
earlier parts that is an important aspect of feedback. In addition,
of the DevOps What is changing as customers go into going back to the developers and saying that they
chain is vital
production with big data? might want to change their algorithm because it is
for big data.
When we think about big data going into pro- not using the cluster efficiently or it is taking up too
duction, there are three big components. The first many resources, and then going back to the people
is making sure that things are reliable; the second who actually provisioned the cluster and saying that
is that they scale; and the third is that they perform. the assumptions that they made around the num-
It is the performance aspect that we focus on as a ber of users, data volume, and the workload are not
company. The reason that performance is so hard actually resulting in the response times that they are
for big data is that you are dealing with hundreds expecting—these are all important feedback loops
and thousands of computers, you are dealing with back into the DevOps chain, and they are vital for
datasets that are usually two orders of magnitude big data. That is really our fundamental thesis.
larger than what classic IT has dealt with, and you The products that we have today—the Clus-
are dealing with data that changes rapidly. And ter Analyzer, the Capacity Optimizer, and the Pol-
then, you are dealing with a lot of people that are icy Enforcer—provide that type of feedback to the
doing things simultaneously. They are doing inter- operators.
active work and decision support all on the same The Cluster Analyzer gathers all of the data and
machines. That combination of variables is very answers questions about what resource is being used
hard to get your hands around, and the perfor- for what and how they are correlated. The Capacity
mance implications of that are even more difficult Analyzer takes automated action and says there are
to understand. That is really why performance is additional resources that are available now to run
such a big deal for big data. We like to say that per- more jobs or run them quicker if more resources
formance can mean the difference between busi- are allocated to them—so it does automated anal-
ness-critical and “business-useless” for big data ysis using machine learning to use the resources
systems. better. And, then the Policy Enforcer guarantees

DBTA. COM/ BI GDATAQUARTERLY 31


that the important jobs are never starved for resources. They separate user interface. It is provided as a software as a service
are all focused on providing performance feedback into the solution that is integrated into our suite.
operators. The importance of integrating this into our suite is that in
addition to providing recommendations to the developer, it
What are you adding? is critical for the developers to understand the context that
With a new product, which is the Application Profiler, we the jobs ran in. It is necessary for the developer to understand
are providing that performance feedback all the way up to the what was happening on the cluster at the time that the job was
developers. We heard from operators that they wanted feed- running in order for them to be able to determine how seri-
back to be provided to the developers because if the develop- ously to take some of these recommendations.
ers can make changes in their code, it will make the operators’ Dr. Elephant by itself doesn’t provide that, but by integrat-
jobs much easier. That is why we embraced it and why we are ing it into our dashboard and our Cluster Analyzer, we are able
going into that direction. to provide that context so it makes Dr. Elephant much more
powerful in addition to the fact that we take the headache of
What does it do? deploying and supporting it away from mere mortals, so to
Strategically, the more we can provide to the developers, speak. It is the integration, plus the hosted solution, that is the
the more issues that we can catch at an earlier value stage, power of the Application Profiler.
which in turn means that there are fewer problems in pro-
duction later on. The Application Profiler takes the data that What’s next?
was gathered and provides recommendations to the develop- We are contributing everything we are doing back to the
ers to make changes in their code so the code will run more community and we will continue to do that. The heuristics are
efficiently. going to be contributed back to the main code base with Dr.
Elephant. We think that is a really important thing to do and,
How is it deployed? obviously, the community will benefit. And, as the community
The Application Profiler is built on an open source project makes changes, we will also benefit, and so it is a very import-
called Dr. Elephant that was originally started by LinkedIn. We ant step to take. LinkedIn has embraced us, we have embraced
are now actively contributing to that project and we have inte- them, and others have started to join, as well.
grated that code into our suite of products. That means that
our customers who buy Application Profiler from us don’t This interview was conducted, condensed, and edited
have to go and install Dr. Elephant on a separate cluster with a by Joyce Wells

32 BI G D ATA QU A RTERLY | SU MMER 2017


HADOOP PLAYBOOK

Making the Most of the Cloud


WHEN PEOPLE talk about the next generation of applications you to start viewing your infrastructure as just a pool of
or infrastructure, what is often echoed throughout the resources waiting to be utilized.
industry is the cloud. On the application side, the concept Deploying a container on a server still requires having
of “serverless” is becoming less of a pipe dream and more someplace safe to persist all of the data. Protecting against
of a reality. The infrastructure side has already proven that the failure of an individual Docker container or server fail-
it is possible to deliver the ability to pay for compute on an ure is imperative to a successful implementation. Because
hourly or more granular basis. of those considerations, storage is a necessity when it comes
In February of 2017, Amazon had a “hiccup” which to containerized applications. Whether your application
caused a massive blackout impacting websites across the is trying to write log files, maintain the internal state of
world. This doesn’t make Amazon a bad choice as a cloud an application, leverage a database for general data model
provider, but should cause people to think about and persistence, or even trying to use decoupled messaging for
remember what the real value proposition of the cloud communication, persistence is a necessity.
is, and that is infrastructure-as-a-service. If you think the Converging all of these services into a single data plat-
cloud is a viable option for operating part or all of your form is ideal, as this allows simplified management and
business, then consider taking advantage of it. However, as deployment of persistent application client containers. With
the old addage goes, “Variety is the spice of life,” and in the a converged platform, regardless of which server in your
context of cloud, this means “prepare to go multi-cloud.” cluster of hardware your container gets deployed on, it will
Utilizing multiple clouds simultaneously delivers the always be able to find and write data of any type. This is a
same benefits as running multiple private data centers. major benefit when contemplating a move toward a server-
Determine where to run select parts of your business appli- less architecture.
cations or even load balance your business applications Enabling a separation in authority between the software
by deploying all of your services in a highly available and architecture and the data administration is critical to lever-
redundant way. This will protect your business from any aging cloud infrastructure. Software engineers should not
single failure in infrastructure and deliver better overall have to be concerned about the cost of storage options or
business continuity plans as well. which storage facility needs be used. Moreover, they should
Container technologies such as Docker are a great not have to worry about rewriting software when a new
way to leverage cloud offerings. While containers are not storage class becomes available or is a better fit based on
required, they make deploying software easier and allow cost or performance.
you to better utilize the resources for which you are pay- The line-of-business and systems administrators are the
ing. When considering the use of Docker containers, think folks in an organization who should determine the ongo-
about deploying your own internal Docker repository and ing balancing of the costs and performance of a system. If
mirroring your repository between data centers. Then, as your data platform supports using ultrafast NVM Express
new container images are created, it becomes very easy to solid-state drives or super-slow, yet reliable, object storage,
ensure the software is available in each of the data center then take advantage of that and pick and choose where your
locations, regardless of how many are being leveraged. One data lives within the data platform.
of the main benefits of Docker containers is that they enable In order to maintain agility within your organization,
do not force your engineers to write code specific to where
data should land based on costs. Software engineers should
worry about writing good software to meet the needs of the
Jim Scott, director of Enterprise Strat- business and systems administrators, and business owners
egy and Architecture at MapR, is the should pick and choose where the data will reside. When
co-founder of the Chicago Hadoop Users
looking toward the cloud, it isn’t acceptable to be required
Group (CHUG), where he coordinates the
to make a trade-off between data agility, application agility,
Chicago Hadoop community.
and infrastructure agility. You deserve them all, but it is up
to you to take advantage of the options presented.

DBTA. COM/ BI GDATAQUARTERLY 33


BIG DATA BASICS
Overcoming Common Problems
With Data Visualization
ORGANIZATIONS ARE EMBRACING data visualization as more can be shared by many rather than each group recreating
than a tool to “see” trends and patterns in data but as a the same visual), and can facilitate fruitful conversation
pathway to a dynamic culture of visual data discovery. As because everyone is working on the same set of assump-
with any type of cultural shift, there are going to be a few tions and meanings of the data, its business use, and its
bumps along the road as innovative ways to transform context. Additionally, expectations for version control and
data into actionable insights through the power of data updating of the business glossary should be established, so
visualization are sought. However, with a few consider- that there is a level of assurance that this tool will continue
ations kept top-of-mind in the early stages of data visual- to generate value within the business. Determine how
ization adoption, common problems can be avoided. often and who will be responsible for leading efforts to
1. Take Time for Due Diligence Upfront make sure the vocabulary stays up-to-date and accurate.
Avoiding data visualization problems begins with 3. Know Your Data Stewards
being sure you’re bringing the right tools into the mix Using data visualization to nurture the democratiza-
in the beginning. No tool is a magic bullet for every data tion of data highlights the need for clear boundaries and
need, especially out of the box. It’s the organization’s data tracking and monitoring, especially among those who
responsibility to look beyond bells and whistles and have the ability to make changes to the data or the visual-
thoroughly evaluate tool offerings to assess strengths ization through their visual discovery activities. Data stew-
and weaknesses in line with the needs and expectations ards do not necessarily own the data; they are the go-to
of the organization before purchasing. Visualization people for questions, concerns, or doubts on the data’s
tools are part of a larger data ecosystem, and knowing use, quality, or context. They are also the people that can
how a tool fits will help you maximize the value of your be counted on to contribute a meaningful definition of the
investment, streamline implementation, and minimize data back into the business vocabulary. These stewards are
surprises. Additionally, it is important to have a clear often intermediaries between business and IT and speak
idea of how users will apply the tool to solve business to both sides. They understand the business drivers and
problems and where use cases, training opportunities, needs and how the data supports them and are also versed
or requirements may exist that affect how tools are com- in the entire lifecycle of the data, spanning generation or
pared. Crafting a vision statement prior to embarking on acquisition, where it lives in the data architecture, how it
tool evaluations will ensure tools are evaluated in line is administered, its security and access controls, and how
with the organization’s plans today and for the future. it’s leveraged as a visual information asset in the business.
2. Establish a Common Glossary 4. Respect the Limitations of Self-Service
Establish a common business glossary to facilitate a The demand for self-service is heard throughout the
shared understanding of what data visualization is andtiers of data users. However, it is necessary to exercise cau-
how it is used that is understood, trusted, and utilized
tion. There is a wide gap between empowering users with
within the organization. Variations in definitions cause
self-service and those users becoming self-sufficient. Dif-
confusion, inconsistency, redundancy, and reduce the ferent users will use data visualization for an assortment of
validity of analysis activities. Conversely, standard defi-
needs, through a variety of form factors, and will bring to
nitions promote consistency, streamline visualization the table varying levels of expertise in visual design, data
activities (for example, a single visualization or dashboard
visualization best practices, and even data storytelling.
There will be users who desire to be
hands-on and deeply involved in build- AD INDEX
Based in the greater New York City area, ing data visualizations, and there will
Melissa Data ..... Cover 4
Lindy Ryan researches and teaches business be those who prefer to consume visual
analytics and data communication at a major assets. Thus, it is important to know
BEST PRACTICES
East Coast university, and is the author of The both the audience of data visualization
Visual Imperative: Creating a Culture of Visual Aerospike ................ 22
and whether they will need to be pre-
Discovery. Follow her on Twitter @lindy_ryan. sented data either within in a hands-on Informatica .............. 23
or consumption environment.
Oracle.................... 20

34 BI G D ATA QU A RTERLY | SU MMER 2017


CLOUD CURRENTS

Are the Public Clouds Too Big to Fail?


IN OCTOBER OF 2008, Congress enacted the Emergency Eco- and even infrastructure that they support? It is reasonable to
nomic Stabilization Act, more commonly known as the bailout conclude that the public clouds are sufficiently embedded in
of the financial system. It was deemed that certain U.S. financial critical infrastructure as to be considered equivalently critical
companies and institutions were too important to the systemic to other public utilities.
stability of the system to be allowed to become insolvent. The As the rapid expansion of public clouds continues and the
understanding was that catastrophic financial consequences complexity of the clouds becomes daunting, customers must
would be the result of the failure of these entities and that those consider how to protect themselves from the inevitable calam-
aggregate failures could devastate the U.S. As a result, they have ities that will occur if and when a public cloud service experi-
been heavily regulated and controlled with the intention of ences a serious and sustained outage.
protecting against that type of exposure again. Let’s discuss some of the steps customers should consider
The recent major outages in the public clouds services taking so that they can avoid becoming a causality when a
inevitably lead to the same question being asked of this new major outage occurs.
industry. That is, whether or not certain public cloud services
have become so critical to the functioning of the U.S. economy Going Hybrid—or Not
that those systems should be subject to the same strict scrutiny Many companies confuse the expertise obtained when doing
and control as the financial systems. Are they so intertwined development work in the cloud with building the skills needed
with U.S. commerce that a “Cloud Dodd-Frank” should be to deploy enterprise-wide applications in the cloud. It is very
considered by the 115th session of Congress? Although hard to simple to enter a corporate credit card and immediately start
precisely verify, it has been reported that a recent public cloud utilizing the compute resources, network, and storage resources
outage affected service to more than 50% of the top 100 online provided in the cloud. It’s so easy—if you need more resources,
retailers. What happens when a major public cloud problem you just ask.
brings down half of the 911 systems across the U.S., or even It is true that the cost of using infrastructure is diminishing
one-third of certain critical state and local government systems? due to the economies of scale enabled by the cloud. However,
Everyone, from the federal, state, and local governments to pub- the costs of engineering reliability, security, recoverability, and
lic and private industry, is using public cloud services, and the scalability are significant. Many companies don’t understand
trend is continuing to accelerate. that the exercise of “forklifting” applications into the cloud is
As this growth perpetuates, our reliance on the public clouds very different than architecting a valid strategy for effective and
for critical services will also continue to grow respectively. Can sound business operations in the cloud. Expecting the cloud
we continue to allow public clouds to be unregulated? The to be perpetually available without interruption is a recipe for
often-used metaphor of the “Wild West” is quite appropriate in disaster. Just ask the top retail sites that were disrupted due to
this case. At what point will our use of the public clouds be so the recent outages if it was good for their business. It is likely
critical to the functioning U.S. infrastructure that it will require that they regretted their unquestioned faith in the reliability of
prudent regulations applied to protect the critical functions their public cloud infrastructure.
These recent outages reinforce the need to partner with a
Michael Corey, director—cloud computing qualified provider of cloud infrastructure who can work with
evangelist at Spectrum Enterprise Navisite, you to make sure there is a clear understanding of the business
was recognized in 2017 as one of the top 100 requirements around availability, security, and recoverability.
people who influence the cloud. He is a Micro- The public cloud should be treated as a commodity. It is a wise
soft Data Platform MVP, Oracle Ace, VMware approach to work with an infrastructure provider who can offer
vExpert, and a past president of the IOUG.
access to multiple public clouds. A well-developed cloud infra-
Check out his blog at Michaelcorey.com.
structure will be architected with the specific requirements per-
taining to reliability, recoverability, and security for each indi-
Don Sullivan has been with VMware since vidual customer. It is important to work with an infrastructure
2010 and is the product line marketing man- provider who understands the different strengths and weakness
ager for business critical applications and of the various clouds and can match an organization’s needs to
databases. the right solution.

DBTA. COM/ BI GDATAQUART E R LY 35


CLOUD CURRENTS
Vendor Lock-In The same types of questions should be asked of any com-
Many of the cloud services providers offer unique features pany in any industry doing business in the cloud. When a com-
and capabilities. Each specific feature or capability may be pro- pany uses Salesforce, it is common practice to store a copy of
prietary to a specific vendor and may result in vendor lock-in. the contract in the application. Today, it is common practice to
It’s necessary to work with a provider who can help determine if store source code in the cloud. What happens when a vendor
the advantages of that feature or capability are more important experiences a security breach? And, what happens when a ven-
than the respective disadvantages within the context of the spe- dor used by one of the cloud service providers gets breached?
cific customer requirements. In 2015, a major coffee retail chain The recursive logic is daunting. Will you get notified if this
experienced a public cloud outage that took down thousands of multi-level exposure occurs? Security is only as strong as the
stores in the U.S. and Canada. That company learned firsthand weakest link, and along the same lines, a firm’s level of cyber-
the dangers of putting “all your eggs” in the proverbial basket of security is only as good as the cybersecurity of its vendors. As
a single public cloud provider. Apple computers put their infra- more organizations co-mingle data in the cloud, the exposure
structure eggs in one basket that same year and had 200 million increases exponentially. Are all the parties involved taking the
iCloud users affected by a disruption in service. proper steps to make sure the data is secure? The question is
Both these situations could have been avoided with a prop- clearly rhetorical.
erly designed hybrid-cloud approach. Moving forward, if these
organizations choose to adopt a hybrid cloud solution, they New Capabilities, New Challenges
may regret a decision to choose a proprietary feature a vendor According to the IDC “Worldwide Semi-Annual Public Cloud
offered in their public cloud. Spending Guide,” the amount of funds allocated to public cloud
services will soar to $122.5 billion this year. The cloud infrastruc-
A well-developed cloud ture that exists today to support all this spending is immense.
Arguably, it is the most complex technology infrastructure that
infrastructure will be architected has ever existed. With all this complexity, there will be future out-

with the specific requirements


ages as we continue to learn how to support it all. But, as more
critical infrastructure relies on the cloud, will there be a time
pertaining to reliability, when critical infrastructure regulation will be required? When
will cloud infrastructure be too big to be allowed to fail? With
recoverability, and security for companies moving more of their critical capabilities into the
cloud, it is important that they work with an infrastructure pro-
each individual customer. vider who has the skill and expertise to help properly assess the
requirements of recoverability, security, and availability.
Software as a Service As the cost of using cloud-based infrastructure drops due
Forbes magazine reported a few years ago that 83% of health- to the economies of scale of the cloud, investments must be
care organizations were using cloud-based applications. Did made in engineering cloud solutions that guarantee reliability,
these organizations put all their infrastructure eggs in one bas- scalability, recoverability, and security. Because this new para-
ket? How would the quality of care be affected if the cloud ven- digm of cloud computing co-mingles a company’s infrastruc-
dors used by these hospitals experienced an outage? And, what ture and data with that of other partners, it is important to
happens to our personal HIPAA information if these healthcare remember that this chain of new capabilities is only as strong
organizations stop using these vendors? as its weakest link.

New Technologies in a
FALL Big Data World
2017 For sponsorship details, contact Stephen Faig,
stephen@dbta.com, or 908-795-3702.

36 BI G D ATA QU A RTERLY | SU MMER 2017


GOVERNING GUIDELINES
The Science of Data Governance Matter
YOU WILL OFTEN hear experienced practitioners and consultants The solid state of data governance is the core framework, the
suggest that there is both an art and a science to effective data non-negotiable state of policies, processes, and standards that
governance. The art is in the details of fine-tuning a data gover- apply across the enterprise. Once established, the solid state of
nance program to fit your culture and address specific business data governance requires little energy to sustain and support,
needs. But the fundamental principles of data governance are as there is little to no flexibility for the pieces and parts. For
best understood and executed through science. solid state data governance to change, significant force such as
Data governance is a physical science. For some, this is a an executive mandate or substantial change in the overarching
very exciting and intriguing proclamation, and for the rest of business model is required.
us, the very thought of chemistry and physics brings on a cold Liquid. When energy is applied to matter in a solid state, the
sweat. But, through years—and now decades—of experience matter becomes liquid. Fluid state of matter is incompressible
in the world of data, the laws of physical science have resur- and assumes the shape of the container it occupies. As various
faced to provide a unique perspective on the core elements (pun forms of energy (resources, budget, authority, etc.) are applied
intended) of successful data governance. to the solid state of governance, fluidity and flexibility are gen-
States of Matter. The existence of matter is expressed in erated and the state of data governance changes. The liquid
states or phases based on physical properties and the behav- state of governance maintains the core elements of the estab-
ior of particles. Effective data governance programs operate in lished, solid-state data governance framework while allowing
different states with properties similar to solids, liquids, and autonomous data decisions to flow throughout business units,
gases. Some phases persist for long periods with low energy, taking on various shapes and sizes and addressing specific
while others require sustained energy sources. The key is that requirements as necessary. The additional energy of resources
while physical properties change, the basic elements remain such as data stewards and accountable business stakeholders
constant across all states. The effort and energy required creates elasticity by applying domain and technical expertise to
to support and maintain each state will vary, sometimes accelerate business processes.
significantly. Gas. When additional energy is applied to matter in a liquid
Energy. The energy required to support and sustain data state, the matter becomes gas. Gas is compressible liquid with
governance is substantial, particularly the larger the program no definitive shape or form that permeates or fills the entire
becomes (mass). Inertia can come from the mandates of top- container it occupies. In energy-rich data governance environ-
down initiatives or the momentum of grass-roots, bottom-up ments, the fundamental elements of the data governance frame-
projects. Hot topics that drive the business will generate their work are baked into day-to-day businesses processes and tasks.
own energy, while the pressures of external forces may create The ownership and accountability of responsible data manage-
new, and sometimes unexpected, sources of energy. Authority, ment permeate all levels of the organization. Like gas matter,
budget, resources, and time are all factors in the energy equa- mature data governance programs evolve through phases with
tions of data governance. continuously applied energy which can be difficult to maintain.
Solid. This represents a rigid physical state of matter that The key to sustaining a mature data governance program is to
retains a fixed shape. Particles are densely packed and have little ensure the sources of energy (resources, budget, authority, etc.)
to no movement. Solids can only change shape by applied force. are renewable and constant.
Similar to elements and compounds, data governance can
exist in multiple phases or states, sometimes within the same
company. It is because of these physical science principles that
no two data governance programs are the same. For data gover-
Anne Buff is a business solutions man- nance to be most effective, you must understand the observable
ager for SAS Best Practices, a thought states of governance within your business, identify the ideal or
leadership organization at SAS Institute. desired state, and determine the energy required to reach the
She specializes in the topics of data gov- points of change given environmental conditions.
ernance, MDM, data integration, and data
monetization. “Don’t get set into one form, adapt it and build your own, and
let it grow, be like water.” – Bruce Lee: A Warrior’s Journey (2000)

DBTA. COM/ BI GDATAQUARTERLY 37


THE IoT INSIDER
Never Mind Fake News,
Fake Data Is Far Worse
WITH THE FUROR over fake news, where the truth is massaged the Stuxnet worm that targeted Iran’s nuclear facilities. That
for commercial or political gain, the focus has gone off fake was back in 2010, when hardly anyone had heard of IoT.
data—which can have a lot more perilous consequences. When we consider the scope and scale of IoT devices today,
Imagine for a moment that you have an insulin pump it becomes even more frightening. Just recently, a teddy bear
embedded into your body. It is equipped with wireless con- “leaked” private messages and email addresses through a hack-
nectivity, remote monitoring, and near-field communication ing network where the data was held for “ransom.”
technology. It is, in fact, an Internet of Things (IoT) device. As demand for many devices is cost-sensitive, devices are
Because the device is accessible by your health professionals, often built with little security. Plus, many companies tend to
its security could be vulnerable to attack. downplay future-proofing; what might be declared safe now
It might sound like a Cold War spy novel, but someone could be unsafe in the future.
could hack into your device and give you a fatal dose of insu- In IoT, we are at the brink of an era where devices will be
lin. Hackers could even enter fake insulin level data to trigger participants in financial transactions. To cite an overused
the overdose. This frightening scenario is just one example scenario, what if you have a smart washing machine that has
of how IoT devices present challenges to manufacturers and detergent cartridges built in? When the cartridge is nearly
users alike. empty, it would be logical for the smart machine to automat-
Fake data arises from unchecked IoT devices, and the ically order more detergent from the manufacturer.
security of these is increasingly being questioned. There are But what if a daring detergent competitor hacked your
even reports that devices have been hacked into and added to machine and reinstructed the sensor in such a way that you
botnets that carry out malicious attacks. Because IoT devices ended up with a truck-load of detergent, all charged to your
generally have a weak infrastructure, they are easy targets, and credit card? What if the competitor not only did that with
the sensitive data they often contain makes them even more your washing machine, but with all the other machines out
enticing to hackers. there? Imagine the economic impact on the vendor.
There are basically two scenarios to consider. In one, the There are specialists arguing that security issues will be
hacker gains access to the device and compromises the soft- the reason IoT will fail. My view is that the most promising
ware of the sensor itself, so the readings become unreliable. new technology on the block that could solve the issues of
In another, the hacker compromises the communications device integrity is blockchain—pun intended. As blockchain
device and alters data that flows from the device to the deci- was orginally designed for data integrity, why not use it for
sion point. device integrity?
While currently most of the focus is on the latter cases, By sealing your sensor software with a cryptographic hash
pushing developers to increase security and data encryption and by placing this hash in the blockchain, you could test at
of communications devices, the former still seems grossly any moment whether a device’s integrity has been compro-
neglected. mised. Simply do a checksum and compare the result with
This can be a problem, as we already have seen that device the outcome stored on the blockchain. It is just a matter of
software might be the more successful route for hackers. The time before we will see the first security solutions based on
first time the implications of fake data came about was with blockchain hitting the market mainstream.
Let’s be clear: There are still some serious issues with
blockchain to be solved. In that sense, it is like the 1990s
when the internet appeared in our lives; we had quite number
of protocols such as Gopher before the Hypertext Transfer
Bart Schouw is IoT solutions director at Protocol became the standard. Similar to the internet, which
Software AG. Based in the Netherlands, started out so slowly, we will see some very fast movement in
he has nearly 20 years of experience in IT this area once the industry agrees on a standard.
in all areas. While fake news might thrill some people, fake data can
take down an electricity grid, a stock exchange, your organi-
zation, or, even worse, your life. It has to go.

38 BI G D ATA QU A RTERLY | SU MMER 2017


DATA SCIENCE DEEP DIVE
Improving the ROI of
Big Data and Analytics
BIG DATA AND ANALYTICS are all around these days. Most com- shared locations, as these are visited by fewer people. This
panies already have their first analytical models in production implicit network can then be leveraged to target advertise-
and are thinking about further boosting their performance. ments to the same user on different devices or to users with
However, far too often, these companies focus on the analyt- similar tastes, thus improving online interactions. Both of
ical techniques rather than on the key ingredient: data. The these examples illustrate the potential of implicit networks as
best way to boost the performance and ROI of an analytical an important data source. A key challenge here is to creatively
model is by investing in new sources of data which can help to think about how to define these networks based upon the goal
further unravel complex customer behavior and improve key of the analysis.
analytical insights.
Let’s explore the various types of data sources that could be
worth pursuing in order to squeeze more economic value out The best way to boost the
of your analytical models.
A first option concerns the exploration of network data performance and ROI of an
by carefully studying relationships between customers. These
relationships can be explicit or implicit. Examples of explicit
analytical model is by investing
networks are calls between customers, shared board mem- in new sources of data.
bers between firms, and social connections, such as family
and friends. Explicit networks can be readily distilled from
underlying data sources, such as call logs, and their key char- Data is often branded as the new oil. Hence, firms, such as
acteristics can then be summarized using “featurization” pro- Equifax, Experian, Moody’s, S&P, Nielsen, and Dun & Brad-
cedures resulting in new characteristics which can be added street, capitalize on this by gathering various types of data,
to the modeling dataset. Research has shown network data to analyzing them in innovative and creative ways, and selling
be highly predictive for both customer churn prediction and the results thereof. These firms consolidate publically available
fraud detection. data, data scraped from websites or social media, survey data,
However, implicit networks or pseudo networks are more and data contributed by other firms. By doing so, they can
challenging to define and featurize. In one study, a network of perform all kinds of aggregated analyses, build generic scores,
customers was built in which links were defined based upon and sell these to interested parties. Because of the low-entry
which customers transferred money to the same entities (e.g., barrier in terms of investment, externally purchased analytical
retailers) using data from a major bank. When combined with models are sometimes adopted by smaller firms to take their
non-network data, this way of defining a network based upon first steps in analytics. In addition to commercially available
similarity instead of explicit social connections gave a better external data, open data—such as industry and government
lift and generated more profit for almost any targeting bud- data, weather data, news data, and search data—can also be
get. In another award-winning study, a geosimilarity network a valuable source of information. Both commercial and open
was built among users based upon location-visitation data in external data can boost the performance and the economic
a mobile environment. In this model, two devices are con- return of an analytical model.
sidered similar, and thus connected, when they share at least Macro-economic data is another source of information.
one visited location. They are more similar if they have more Many analytical models are developed using a snapshot of
data at a particular moment in time. This is obviously con-
ditional on the external environment at that moment. Mac-
Bart Baesens is a professor at KU Leuven ro-economic up- or downturns can have an impact on the per-
(Belgium) and the University of South- formance and, thus, the ROI of the analytical model. The state
ampton (U.K.) doing research on big data
of the macro-economy can be summarized using measures
and analytics, web analytics, fraud detec-
such as gross domestic product, inflation, and unemployment.
tion, and credit risk management. See
Incorporating these effects enables further improvement of
dataminingapps.com for an overview
of his research. the performance of analytical models and makes them more
robust against external influences.

DBTA. COM/ BI GDATAQUARTERLY 39


DATA SCIENCE DEEP DIVE

Another type of data to consider is textual data. Examples of dimensions may still be too large for practical analysis. Sin-
are product reviews, Facebook posts, Twitter tweets, book rec- gular value decomposition (SVD) offers a more advanced way
ommendations, complaints, and legislation. Textual data is to do dimension reduction. SVD works in a way that is similar
difficult to process analytically since it is unstructured and to principal component analysis (PCA) and summarizes the
cannot be directly represented into a matrix format. Moreover, document term matrix into a set of singular vectors, also called
this data depends upon linguistic structure and is typically latent concepts, which are linear combinations of the original
quite “noisy” due to grammatical or spelling errors, synonyms, terms. These reduced dimensions can then be added as new fea-
and homographs. However, this type of data can contain very tures to an existing, structured dataset.
relevant information for an analytical modeling exercise. Just Besides textual data, other types of unstructured data such as
as with network data, it is important to find ways to featurize audio, images, videos, fingerprint, GPS, and RFID data can be
text documents and combine them with other structured data. A considered. To successfully leverage these types of data in ana-
popular way of doing this is by using a document term matrix lytical models, it is critical to carefully think about creative ways
indicating what terms appear, and how frequently, in which doc- of featurizing them. When doing so, it is recommended that any
uments. Such a matrix will be large and sparse. Dimension reduc- accompanying metadata is taken into account. For example, in
tion will thus be very important, making it necessary to represent fraud detection, not only an image may be relevant but also who
every term in lowercase; remove terms which are uninformative, took it, where, and at what time.
such as stop words and articles; use synonym lists to map syn- The bottom line is that the best way to boost the perfor-
onym terms to one single term; stem all terms to their root; and mance and ROI of analytical models is by investing in data first.
remove terms that only occur in a single document. And remember that alternative data sources can contain valu-
Even after these activities have been performed, the number able information about the behavior of customers.

40 BI G D ATA QU A RTERLY | SU MMER 2017

You might also like