You are on page 1of 11

APRIL 2016

SHAPING THE FUTURE


OF DATA WAREHOUSING
THROUGH OPEN SOURCE
SOFTWARE
1 Market Overview: Open Source Data Warehousing
3 Big Data Analytics: The Importance of MPP Data Warehouses
7 MPP Data Warehouses: The Open Source Perspective
10 About Pivotal

Sponsored by:

tdwi.org
Expert Q&A MPP Data Warehouses Open Source Solutions About Pivotal

MARKET OVERVIEW:
OPEN SOURCE DATA WAREHOUSING
What are the benefits and risks of using an open source data TDWI: What is an open source data warehouse?
warehouse, and why are they just coming to market now? We
Jeff Kelly: An open source data warehouse is a specialized database
look at the basics of open source data warehousing with
built entirely on open source software code that supports enterprise,
Jeff Kelly, a data market strategist at Pivotal Software, Inc.
production-grade data analytics and reporting. An open source data
warehouse should also support large-scale exploratory analytics and
data science workloads including machine learning.
What role does the open source community play as it relates to
data warehousing?
Like other open source technologies and projects, community
involvement leads to faster development cycles. Rather than being
beholden to the slow development cycles of a given data warehouse
vendor, open source data warehouse practitioners benefit from
continuous improvements made by the open source community,
which practitioners themselves can participate in to influence
product direction.
Most people associate open source with lower total cost of
ownership (TCO) compared to proprietary technology. Is that
the case with open source data warehousing?
Yes, open source data warehousing significantly reduces TCO. With
open source data warehousing, there are no software licensing costs
and no expensive proprietary hardware to purchase. The code is free
and generally runs on inexpensive commodity hardware. Open source
data warehouses are also an ideal environment for important but
less complex workloads many enterprises currently run on expensive
proprietary appliances, such as large-scale extract, transform, and
load (ETL) workloads.

1  TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A MPP Data Warehouses Open Source Solutions About Pivotal

Open source is not a new phenomenon, but it hasn’t been What are other important criteria to consider when evaluating
associated with data warehousing until now. Why? an open source data warehousing option?
Open source data warehousing has been held back due to a lack of As I mentioned, data warehouses support mission-critical workloads,
vendor support and reluctance on the part of practitioners to throw so it is important to select an open source data warehouse with an
their lot in with the few, untested open source options that were active, growing community that is continuously developing the code
available. With the speed of business today, practitioners require base. It is also critical to pick an open source data warehouse that
more agile, more powerful approaches to data warehousing and has at least one but preferably several trusted vendors backing it up
analytics that are simultaneously cost-effective to scale. Only open with world-class support.
source data warehousing can meet those requirements.
How does open source data warehousing relate to the larger big
data technology stack, much of which is based on open source
technology itself?
Data warehousing is an important part of the big data stack,
and open source data warehousing in particular is a perfect
complement to the other open source technologies in that stack,
such as Hadoop. Using a common open source consumption
model for all the components of your big data platform makes
administration that much easier. Open source data warehousing is
also complementary to other big data technologies from a workload
perspective, providing flexible, high-performance analytics and
reporting capabilities that support other important workloads such
as streaming and unstructured data analysis.
Do you expect other data warehouse vendors to move their
proprietary products to open source?
Potentially, but the challenge most vendors face is that open source
is a threat to their business models, which are based on expensive,
proprietary appliances that lead to vendor lock-in. Open source
data warehouses are generally deployed on inexpensive commodity
hardware, and they significantly reduce the risk of lock-in because
practitioners can stop paying their vendor at any time yet continue to
use the software indefinitely!
Are there risks to relying on open source data warehousing?
There are in the sense that for most organizations, the reporting and
analytics that data warehousing supports are mission-critical to
the business, so it is important that they select a hardened, battle-
tested, and reliable open source data warehouse. Failure is not an
option.

2  TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A MPP Data Warehouses Open Source Solutions About Pivotal

BIG DATA ANALYTICS: THE IMPORTANCE


OF MPP DATA WAREHOUSES

MPP data warehouses are ideal for the dynamic, mixed source software, MPP performance is surprisingly affordable. It’s
workloads of today, and they can store, manage, and process surprisingly cost-effective, too.
information at big data volumes.
The MPP Data Warehouse Reimagined
The data warehouse plays a critical role in storing, managing, Of course, NoSQL systems can also claim very strong price
and processing information at big data scale. This might sound performance. When it comes to query processing, however, NoSQL’s
counterintuitive, especially now that Hadoop, Cassandra, MongoDB, price-performance benefits disappear. NoSQL query engines cannot
and other NoSQL platforms are marketed as replacements for the process analytical queries as efficiently, as richly, or as reliably
data warehouse. True, one or more SQL query engines exist for all MPP databases.
of these platforms, but a SQL query engine does not a data
warehouse make. For one thing, none of the extant SQL engines fully adheres to
modern versions of the ANSI SQL standard. (Few, if any, fully
“If all you want to do is take some flat files and execute SQL implement the ANSI SQL-92 standard; most implement only portions
across them, that doesn’t actually require a database. It requires of ANSI SQL-1999 and later.) Second, a SQL query engine is only as
a translator from SQL to execution. In order to design and build a good—only as useful, powerful, and valuable—as the underlying
massively parallel processing [MPP] database, some of the most database it’s querying. Hadoop, Cassandra, and MongoDB are not
difficult problems to solve are maintaining consistency across a relational database systems. They lack the guarantees—such as
huge database that runs on a distributed cluster, where you have support for ACID transactions as well as rich metadata management
concurrent access to that data,” notes Ivan Novick, a product features—that ensure data is reliably ingested, structured,
manager with Pivotal Software, Inc., which markets Greenplum, an cataloged, and, as it were, “true.”
open source MPP database.
What’s more, Hadoop, MongoDB, and similar NoSQL technologies
The number of ACID-compliant (atomic, consistent, isolated, are general purpose parallel processing platforms, not analytics
durable), MPP analytical data warehouses is remarkably small. This MPP platforms. They likewise can’t efficiently process concurrent
doesn’t mean MPP is a prohibitively expensive proposition, however. SQL queries from multiple simultaneous users. “Basically, it’s very
Not anymore. Thanks to the “commodification” of MPP software difficult to build a data warehouse as opposed to just a SQL engine,”
and server hardware and a little assist from the world of open Novick explains. “The difference between a data warehouse and

3  TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A MPP Data Warehouses Open Source Solutions About Pivotal

a SQL engine is that the data warehouse is a guarantor of ‘truth.’ Build a new system based on one, two, or three business-critical
The data can be consistently guaranteed and ‘true.’ At the same applications and add to that over time,” he says. “If you’re running
time, the warehouse can support concurrent access across dozens a Netezza data warehouse, you could probably migrate in one shot,
or hundreds of users. These NoSQL engines compromise on either but if you have a Teradata system that’s supporting 200 different
concurrency, ACID, or richness of SQL expression.” use cases, you’d start by migrating data and applications to the new
system piecemeal.”
Fifteen years ago, only the largest companies could afford MPP
performance. Thanks to a combination of technological innovation Designing a schema for an MPP database system isn’t overly
and ongoing commodification, MPP performance is now much more complicated, either. Novick recommends a vertical partitioning
affordable. Commercial offerings from Microsoft (which markets scheme with a clearly defined fact table based on star or
its SQL Server Parallel Data Warehouse), Amazon (which markets snowflake schema.
Redshift, an MPP data warehouse in the cloud), and Pivotal (which
What’s vertical partitioning? He offers an example. “In addition
offers both commercial and free versions of its open source
to dividing data into pieces per machine—[a technique called]
Greenplum MPP data warehouse) are priced at a fraction of the
horizontal sharding, which is what all scale-out systems do—
cost of traditional MPP databases. MPP hardware has become less
vertical partitioning divides it by, say, time. If I have 500 days of
expensive and more scalable, too. MPP databases used to be sold
data, I could have a different partition [distributed across all of the
as software-hardware bundles. This meant that an MPP server
nodes in a cluster] for each day,” Novick says.
node stuffed with the latest and greatest Intel Pentium Pro or Xeon
processors would cost more—sometimes, much more—than would “Now, imagine you’re an analyst in a financial services company
equivalent hardware from manufacturers such as Dell, HP, or IBM. and someone says to you, ‘Give me the count of trades done on
this one specific day’; then all 100 machines will operate in parallel
To some extent, this was a necessary evil. An MPP database
and they’ll all loop through the data. However, because the data is
distributes data across all of the nodes in a cluster; this can entail
partitioned separately [and] you have partitioned it by time, [this
significant data movement. MPP also relies on a technique called
query] will only process one day (1/500th) of data than if you had
message passing to coordinate communications between and
not partitioned it.”
among nodes. For this reason, MPP database server nodes used to
be outfitted with proprietary interconnects designed specifically for He also touts a technique that Pivotal and a few other vendors call
high data throughput and low latency. dual ETL. It’s an alternative to techniques such as data replication
or changed data capture (CDC), wherein data is replicated from
Today, commodity high-throughput, low-latency technologies such
a “live” master system to a “standby” backup system. Dual ETL
as 10-gigabit Ethernet are available at much lower price points.
describes a scheme in which two distinct clusters both act as
Add it all up and the hardware market has changed dramatically,
“live” or “hot” systems. Both systems are fed by the same ETL
Novick argues. “Essentially, what’s ideal is to use the sweet spot of
processes and enforce the same data validation, data consistency,
commodity hardware. In today’s world, basically, Intel is commodity,
data quality, rules, and mechanisms. Both are likewise available
and generally a server that has two Intel processors, literally two, is
to users for querying and advanced analytics. The clusters can
the sweet spot. If you put four Intel processors in a computer, then
also be geographically distributed for disaster recovery or business
it becomes a proprietary design and it becomes very expensive, and
continuity purposes, Novick says.
you don’t get the bang for the buck,” he explains.
“The customer might say, ‘I want to use a replicated solution,’ which
“Because the data warehouse technologies today are scale out,
means that as data is modified on one warehouse, it’s automatically
you can combine 10, 50, 100, even 500 or more servers in a single
updated on the second one. That is expensive and troublesome on
cluster and build a data warehouse.”
the big data system. Just the pure transfer rate of the data, plus the
Moving to MPP isn’t an overly complex or costly process, Novick syncing of the data, especially at big data scale, would be expensive
maintains. You’d move an existing data warehouse over to an MPP and vulnerable to failure,” he explains.
data warehouse much like you’d migrate or transition from one
Thanks to the commodification of MPP server and software kits,
conventional data warehouse to another. “The way to do it is to
dual ETL is cost-effective on its own terms, Novick argues. “You
identify and pull out strategic applications based on business value.
essentially build two clusters and all of the data warehouse inputs

4  TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A MPP Data Warehouses Open Source Solutions About Pivotal

are done on each cluster. For example, let’s say you wanted to columnar design helps minimize I/O contention, as well as drastically
create a data warehouse for a retail store chain. You would have reduces disk seek times. For these and other reasons, including
two clusters and you would feed, say, retail store purchases data superior compressibility, a columnar architecture is generally
locally into both systems. What that enables you to do is to have advantageous for analytical workloads. Columnar isn’t a panacea,
two systems that are both live. This allows you to distribute your however; there are analytical workloads (e.g., queries on very wide
workloads between both systems. It allows you to do upgrades to tables) for which a row store is superior.
one cluster and not to the other. It allows you to have downtime
Columnar versus row store isn’t an either/or proposition, Novick
on a cluster if you want to do hardware changes, and so on.” This
argues; some database systems, including Pivotal Greenplum,
approach permits DBAs to more efficiently boost concurrency, too.
support both.
A dual ETL topology can permit an organization to achieve double
the rate of concurrency, supporting thousands or potentially tens of “We have a hybrid approach where you define the storage format
thousands of simultaneous users. at the time of table creation, and that can include both row versus
column, as well as compression. You can do it not only at the table
Dual ETL isn’t everything, of course. MPP systems such as
level but also at the partition level. The point is that you have that
Greenplum use workload management facilities to manage
flexibility,” he says.
concurrency, too. “The key to this, which has been proven across
multiple vendors, is to have a good workload management system Cloud, too, isn’t an either/or proposition. For many (if not most)
where you can define and enforce dynamic rules. This is … a rule- customers, it will likely be both/and.
based workload management system where you can set different
“I think the first question about data warehousing in the cloud is
thresholds and conditions and based on that allow different queries
a simple one: Where is the source of your data? If your source is in
of different priorities to run at different times,” Novick argues,
the cloud, then it makes sense to have the data warehouse be there
noting that not all queries (or all user classes) are equal—and that
because the data itself is already there. If you have to migrate a
some of the users or groups who initiate queries are more important
huge amount of data from an on-premises location to the cloud,
(or more trustworthy) than others. “If you know users of a certain
that’s another matter,” he points out.
group are problematic, it’s important to have the ability to terminate
their overly expensive queries.” Again, for most customers, both on-premises and cloud deployments
will likely make sense. There’s no shortage of cloud data warehouse
In addition, off-loading infrequently accessed data to non-MPP
services, with offerings from Amazon and Microsoft, in addition
storage can simplify data archiving as well as boost performance.
to specialty providers, but organizations must safeguard against
Frequently accessed data is still available in an online, MPP context,
vendor lock-in. The promise of the cloud is openness and portability,
which means cluster resources can be allocated to the workloads
but cloud platform-as-a-service (PaaS) offerings aren’t always as
that need it most. Data that is less frequently accessed can be
portable or open as they seem.
saved on an external system and queried using external tables
seamlessly from the same SQL interface as internal data, however “If you’re going to locate your data warehouse in the cloud, don’t use
with a performance penalty. This is ideal for meeting requirements a single vendor’s cloud platform. Use the cloud for infrastructure-as-
of regulatory agencies that have mandates of online access to a-service (IaaS). Instead of consolidating your whole data warehouse
older data. on a single vendor’s cloud stack, go with something like an Amazon
Web Services or Microsoft Azure for the IaaS—the servers, the
network, and storage capacity—but use data warehouse software
Column Storage, Cloud Storage, and
that is portable,” he argues.
Data Warehouse Futures
Novick’s isn’t a disinterested opinion: Pivotal’s Greenplum database
Conventional relational databases store data in rows, which means
can run in both traditional on-premises environments and in
that even if a query only needs to pull data from a single column, the
cloud IaaS environments. However, Pivotal has its own highly
database still has to scan the contents of each and every column.
successful cloud PaaS service, too—Cloud Foundry—Novick
This increases I/O as well as seek time and latency. By contrast, a

5  TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A MPP Data Warehouses Open Source Solutions About Pivotal

points out. Running in cloud IaaS makes it possible to more easily


shift data warehousing workloads between on-premises and cloud
environments and vice versa.
“Rather than being locked into some specific provider’s system, you
can easily redeploy your data warehouse to another cloud provider
or to on-premises hardware. You can more easily take advantage of
external cloud storage, too. Right now, cloud providers are offering
very cheap storage in the form of such things as Amazon S3. Cheap
cloud storage can also be leveraged for data archiving, as, for
example, when you off-load infrequently accessed or cold data.”

Conclusion
In the era of big data, MPP database systems are ideally suited
for many if not most analytical workloads. They’re able to support
high concurrency rates and, in some cases, new kinds of advanced,
NoSQL analytics. For example, some MPP platforms can parallelize
and run different types of algorithms in the context of the database
engine. Take the Apache MADlib (incubating) machine-learning
library, which runs in the context of the Greenplum database engine.
This permits it to benefit from Greenplum’s MPP processing.
This is just one example, says Novick. “The MPP data warehouses of
today are utilizing a cluster of servers to store and process data. You
can run a machine-learning algorithm that leverages the CPUs of all
of the servers in that cluster to do the analysis in parallel.”
Hadoop and other NoSQL platforms have positive, distinctive roles
to play in the big data architectures of today and tomorrow. NoSQL
platforms are well suited for storing and managing multistructured
data (e.g., text files, multimedia content, and binary objects), as
well as for storing relational data at truly massive scale. The MPP
data warehouse, however, is ideal for dynamic, mixed workloads,
as well as for storing, managing, and processing information at
big data volumes.
“There’re really only about seven products in the world that can do
that—support big data volumes and big data analytics in a data
warehouse. This is why we fully embrace the term data warehouse.
We’re targeting people to use our system who are running serious
businesses,” Novick says.

6  TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A MPP Data Warehouses Open Source Solutions About Pivotal

MPP DATA WAREHOUSES:


THE OPEN SOURCE PERSPECTIVE

Enterprises looking for alternatives to proprietary data Pivotal’s Greenplum database is an open source MPP data
warehouses may find big benefits in open source solutions. warehouse. Greenplum itself is based on the open source PostgreSQL
database, which has a rich open source pedigree. Greenplum wasn’t
Open source software has transformed software development, always an open source offering, though; it wasn’t until October 2015
delivery, and licensing. It hasn’t just changed how businesses use that it became available under the Apache License Version 2. Rojas
software but what they expect from the software they use. says that Pivotal made this decision for Greenplum—and all Pivotal
data products—“because it was the right thing to do for Pivotal’s
Open source has also transformed the price-performance calculus
customers.”
organizations use to evaluate and make decisions about their
strategic IT investments. In an era of open source innovation, it’s “A large number of customers want to start new data warehouse
becoming increasingly difficult for organizations to justify the cost projects or migrate away from proprietary technologies to
and the inflexibility of proprietary platforms. This is even true of Greenplum’s open source platform because it frees the organization
highly specialized product segments, such as machine learning, from any vendor lock-in,” he explains.
data mining, statistical analysis, and, yes, the massively parallel
“This is definitely something customers are looking for because they
processing (MPP) data warehouse.
want to be able to run a variety of uses cases including reporting,
“Many customers are looking for alternatives to proprietary data advanced analytics, and data science in a massively scalable and
warehouses, especially in the open source space, and the main open source environment. In some ways, we’re able to classify
reason for this is that not only are they paying premium prices, ourselves in a very unique way because no one else is doing what
but there’s a vendor lock-in coming that companies cannot justify we’re currently doing.”
any longer,” says Cesar Rojas, product marketing director for
the Greenplum open source data warehouse at Pivotal Software, Open Source MPP Data Warehouse
Inc. “Proprietary platforms used to make sense because of their Open source software has lowered the proverbial bar with respect to
uniqueness in the market, but they’re not making sense anymore. cost of entry, cost of maintenance, and total cost of ownership (TCO)
Customers feel trapped in that environment and this is made even in many once-specialized markets.
worse by the sky-high total cost. In addition to their software, many
vendors push their proprietary appliances in a big way and with their
appliances come expensive consulting and implementation services.”

7  TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A MPP Data Warehouses Open Source Solutions About Pivotal

Take the open source GNU-Linux operating system, which turns 23 very new open source development (project name: GPORCA) that is
this year. A quarter of a century ago, the dominant UNIX operating modular and independent of the Greenplum engine,” he continues.
systems were proprietary and ran on costly RISC hardware. GNU-
What does it mean to “develop” a query optimizer for big data?
Linux isn’t technically UNIX, but it’s UNIX-like—and its market
“When Greenplum utilizes GPORCA to optimize queries, it considers
share now trounces that of its proprietary UNIX rivals.
many, many more alternatives than other query optimizers. It
Here’s another example: the open source R statistical programming optimizes a much wider range of queries and it uses memory
environment. Disciplines don’t get much more specialized than extensively to do it,” Rojas says.
statistics and data mining, which were dominated by proprietary
Elsewhere, Pivotal plans to offer a cloud infrastructure-as-a-service
vendors such as SAS Institute Inc. and the former SPSS Inc. (now
(IaaS) deployment option for Greenplum. Once again, even though
IBM SPSS) for decades. R hasn’t just mounted a challenge to
there’s no shortage of cloud databases, it’s possible to count the
the dominance of SAS and SPSS, it’s arguably already won. Most
number of cloud MPP data warehouses on a single hand—and to
graduates of college business, engineering, social scientific, and,
have fingers left over.
of course, statistical programs learned their craft on R, not on
proprietary platforms. “We’re running right now on Amazon Web Services, but at the same
time we are in the process of working with our Pivotal Cloud Foundry
There’s no shortage of open source database offerings. PostgreSQL
[service] to be able to deploy in any potential client environment that
and MySQL are just two of the more prominent open source database
is available to the customer,” he says. “This year, we have a very
platforms. Non-MPP platforms use a technology called symmetric
extensive pipeline of cloud innovations. One of the things that is
multiprocessing, or SMP, to scale up (or scale vertically). A MySQL or
coming in the near term is the ability to [write to] external [database]
standard SQL Server database is designed to run on a single server,
tables running on Amazon S3. Anything we do in the cloud is going
or node, and to scale across all of the processors or cores on that
to help us move faster to a managed services type of environment,
node. Ideally, an SMP database would scale linearly; in practice, this
which makes our technology more elastic.”
is never the case because as you add more cores, the ability of the
database to use those cores diminishes.
Machine Learning to the Max
An MPP database can scale up (within a single SMP node) across all Last fall, Pivotal donated its MADlib machine learning framework
of the available cores in a server node. However, an MPP database to the Apache Software Foundation, or ASF. “Apache MADlib
also scales horizontally in the sense that it’s distributed across (incubating)” describes a collection of more than 30 machine
multiple SMP nodes in a cluster. When an MPP database processes learning algorithms. It’s one of several machine learning, predictive
a query, each of the nodes in the cluster independently processes a analytics, data mining, and statistical algorithms or libraries that
piece of the query—so instead of, say, 24 cores, an MPP database can run in the context of the Greenplum database engine itself,
can muster 192 cores, 384 cores, 768 cores, and more. says Rojas.
There are several commercial MPP data warehouse platforms, Rojas “The MADlib library runs in this massively parallel environment. In
says, but Greenplum is the only open source MPP database. There addition to that, we also run other kinds of in-database analytics,”
isn’t another creditable open source alternative. In a sense, he he continues, citing PostGIS, a spatial database extension for the
maintains, Greenplum’s own pedigree demonstrates the difficulty of PostgreSQL database, which also runs in-database in Greenplum.
developing an open source MPP database technology from scratch. “We also provide in-database programming, so anything that is
Unlike Linux and R, Greenplum started out as a commercial, best-of- called ‘PL/,’ so PL/R, which enables R to run in-database, PL/Perl,
breed database. Its designers forked PostgreSQL and spent a decade PL/Python. Those aren’t just running in-database, they’re running in
enhancing Greenplum as a proprietary product. this MPP environment.”
Unlike its non-MPP open source database alternatives, Greenplum In other words, it’s possible to parallelize machine learning,
supports both row-based and columnar storage. “Greenplum is fully predictive analytics, data mining, and other advanced analytics
compliant with SQL. We provide both columnar and row orientation; workloads across a Greenplum MPP cluster. Because these
we call it polymorphic storage,” Rojas explains. “We obviously workloads are running on distributed nodes, using the discrete
are an MPP database, but we have also developed as part of this processing, storage, and network resources of those nodes, they
technology the first query optimizer specifically for big data. It’s a

8  TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A MPP Data Warehouses Open Source Solutions About Pivotal

can execute much faster—sometimes several orders of magnitude “The collaboration with the open source community has been
faster—than on a single-system SMP database. (Not to mention incredible. Every single day, there’s either a GitHub pull request or
that not all single-system SMP databases can run machine learning a comment. Every day, either our engineers or the community at
algorithms or other types of analytical workloads in-database.) large is answering questions,” he concludes. “We’re seeing more and
more collaboration not only with the core database engine but also
With machine learning and other advanced analytics practices,
at the tool level. This is very exciting for the engineers. They were
iteration—the ability to rapidly build and test prototypes, or
working on their own, by themselves, and now everybody wants to
hypotheses—is key. The idea is to fail faster. By rapidly iterating
collaborate with them. There’s also a lot of collaboration with the
through what doesn’t work, you more quickly arrive at what does.
PostgreSQL community. They’re pretty much embracing the fact
MPP permits extremely rapid iteration, Rojas says.
that we went open source. There’s also integration with the main
“Let’s say you’re playing with R, you have your R model, you’re PostgreSQL project. From that point of view, we want to be able to
working in your own little environment [on a test-dev system]. When be 100 percent integrated with the latest PostgreSQL release.”
you want to execute [your prototype] on a massive scale, you take
In the end, Rojas concludes, community is the backbone of open
that R model and you execute it in this MPP infrastructure. You’re
source software. “We believe all of this [IP] is going to be utilized by
able to run everything in parallel and you get the results in parallel.
the larger community, not only us. This is core to our development
As an analyst, you’re able to iterate much more quickly.”
philosophy. We’ve designed all of the components [of the Greenplum
MADlib permits an MPP database to iterate even faster, Rojas database] in a way that is very modular. We want to work with the
maintains. Just as important, he points out, it exposes a SQL community and innovate with the community.”
interface, so analysts who aren’t well versed in Java or Python
can write SQL code to exploit MADlib algorithms. “MADlib provides
you with MPP implementations of mathematical, statistical, and
machine learning [algorithms] for both structured and unstructured
data. MADlib also has full SQL execution on top of it, [along with]
embedded functions that are also run as SQL,” he explains. “For
those who are not familiar with Java development, MADlib definitely
democratizes the access to analytics by giving the SQL-fluent
analyst access to very complete algorithms that would be out of
their reach if they needed to start coding Java.”

Communal Innovation
Rojas says Pivotal isn’t just paying lip service to the importance of
community.
In addition to MADlib, which became an ASF incubator project late
last year, it also donated other valuable proprietary IP: namely,
HAWQ—a port of Greenplum to run natively in Hadoop, with full
SQL support, RDBMS-like transactional consistency guarantees, and
MPP-database-like parallelization, courtesy of Hadoop.
This is further proof that the proprietary data warehouse is an
endangered species, Rojas argues. This isn’t to equate the term
proprietary with intellectual property (IP), as if killer IP no longer
mattered. IP was, is, and ever shall be a critical differentiator.
Instead, it’s to make a distinction between closed source and
open source IP.

9  TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A MPP Data Warehouses Open Source Solutions About Pivotal

pivotal.io tdwi.org
Pivotal’s Cloud Native platform drives software innovation for many TDWI is your source for in-depth education and research on all
of the world’s most admired brands. With millions of developers in things data. For 20 years, TDWI has been helping data professionals
communities around the world, Pivotal technology touches billions get smarter so the companies they work for can innovate and grow
of users every day. After shaping the software development culture faster.
of Silicon Valley’s most valuable companies for over a decade, today
TDWI provides individuals and teams with a comprehensive portfolio
Pivotal leads a global technology movement transforming how the
of business and technical education and research to acquire
world builds software.
the knowledge and skills they need, when and where they need
• Pivotal Greenplum: The Open Source Massively Parallel them. The in-depth, best-practices-based information TDWI offers
Data Warehouse can be quickly applied to develop world-class talent across your
organization’s business and IT functions to enhance analytical, data-
• Greenplum Database: The World’s First Open Source Massively
driven decision making and performance.
Parallel Data Warehouse
TDWI advances the art and science of realizing business value
from data by providing an objective forum where industry experts,
solution providers, and practitioners can explore and enhance data
competencies, practices, and technologies.
TDWI offers five major conferences, topical seminars, onsite
education, a worldwide membership program, business intelligence
certification, live webinars, resourceful publications, industry news,
an in-depth research program, and a comprehensive website:
tdwi.org.

© 2016 by TDWI, a division of 1105 Media, Inc. All rights reserved.


Reproductions in whole or in part are prohibited except by written permission. Email requests or feedback to info@tdwi.org.

Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

10  TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

You might also like