You are on page 1of 3

What’s Trending

In this article, we’ll look at some of


the developments in big data public
cloud offerings. Before we arrive at the
present, however, let’s look at how big data on the
Big Data in the
public cloud got started. The early innovators in big
data infrastructure (Google, Microsoft, Yahoo, and
Facebook) were of course large public cloud compa-
nies, but ran on their own private infrastructures.
Public Cloud
Although the public cloud companies have been
developing big data infrastructure since their in-
ception, only more recently have big data workloads
been running in the public cloud. led to the development of Apache Hadoop, Google’s
BigTable5 and Amazon’s Dynamo6 led to the Apache
Data Processing in the Cloud HBase and Apache Cassandra projects, respectively.
Data processing was the first big data workload to As with data processing, users have run these
run on the public cloud. Amazon launched the core systems on cloud infrastructure (IaaS), and the pub-
parts of its Amazon Web Services (AWS)—Elastic lic cloud companies have launched datastore ser-
Compute Cloud (EC2) and Simple Storage Service vices. In 2012, Amazon added Apache HBase to its
(S3)—in 2006. The Apache Hadoop project added Elastic MapReduce (EMR), and launched its own
support for running Hadoop on EC2 and S3 that managed NoSQL database, DynamoDB (which, de-
same year. In his article, “Self-Service, Prorated Su- spite the name, isn’t based on Dynamo). The next
percomputing Fun!” Derek Gottfrid described how year, Google released Cloud Datastore, a managed
the New York Times used Hadoop on Amazon AWS solution for storing nonrelational data based on its
in 2007 to create PDFs from its archives1; in 2008, High Replication Datastore (HRD), which appeared
NYT used Hadoop to process archived images as well. as part of Google App Engine in 2008. Microsoft re-
In 2008, Amazon launched its Elastic MapReduce cently announced Azure DocumentDB, a managed
(EMR) service for large-scale data processing, argu- highly scalable document database.
ably the first big data service offered on the public
cloud. In 2013, the company reported that 5.5 mil- Document Search
lion clusters had been launched since 2010.2 Large-scale document search is one of the most
Other large public cloud companies soon fol- recent big data public cloud services. This is no
lowed suit. Google launched BigQuery, a Web ser- surprise given that much of the early big data infra-
vice for querying massive datasets, in 2010. In structure was motivated by the development of Web
2012, the company launched Compute Engine, an search from early public cloud companies. Cloud
infrastructure-as-a-service (IaaS) offering that lets
users run existing big data infrastructure on Google’s
virtual machines. That year, Qubole also launched its
Hadoop-based big data service, Qubole Data Service.
The following year, Microsoft launched both Azure
IaaS and HDInsight, a cloud-based Hadoop service.
Google recently announced Cloud Dataflow, a soft-
ware developer’s kit (SDK) and managed service for
big and fast parallel data analysis pipelines.
Eli Collins
Cloud-Based Data Stores
Scale-out, schemaless data stores were the next big Cloudera
data systems to run on the public cloud. Just as the eli@cloudera.com
Google File System (GFS)3 and MapReduce papers4

2325- 6095/14/$31 .0 0 © 2014 IEEE J u ly 2 0 1 4 I EEE Clo u d Co m p u t i n g 13


What’s Trending

users commonly ran distributed search based on MongoDB, another popular being exposed to users via public cloud
systems built on Apache Lucene, such open source, scale-out document data- versions of popular analytics and busi-
as Apache Solr and ElasticSearch, on base. Open source software has allowed ness intelligence (BI) tools. Recent ex-
public cloud IaaSs. users to access the same big data soft- amples are Microstrategy Cloud, SAS
Amazon announced its native search ware infrastructure across many de- Cloud Analytics, Tableau Online, and
service in 2012. The CloudSearch Web ployment options. Microsoft Power BI.

Although big data and the


As the cloud matures, the systems public cloud have overlap-
ping histories and have been
storing or generating data might move growing steadily, they’ve
outside the enterprise firewall, and only recently started to
big data infrastructure will intersect in the enterprise.
Internet companies such as Google, Face-
likely follow. book, and Twitter were early adopters,
and frequently the progenitors, of big data
infrastructure, but they ran their big data
infrastructure on their own “bare metal”
service offers Amazon’s existing search Native Cloud Services servers in their own datacenters.
technology to developers via a managed Although it’s increasingly common to Early enterprise adopters of big data
service that supports document search. run big data infrastructure on cloud wanted to park this new infrastructure
Microsoft soon followed with the Bing IaaS, cloud native platform infrastruc- near their existing systems—Web/ap-
Search API on the Azure Data Mar- ture services (for example, for data pro- plication servers, operational databases,
ketplace, based on its existing search cessing, query, and search) continue to data warehouses, and so on—that were
technology. flourish, delivering services for higher- generating the data that would be in-
level activities. gested into their big data platforms.
Moving Beyond the Big Three An early example of this was Google’s Aside from the performance benefits
Big data infrastructure in the public Prediction API service, launched in 2010. of colocating their big data infrastruc-
cloud isn’t limited to the big three pub- Google Prediction API implements super- ture in the same datacenter, it spared
lic cloud companies (Amazon, Google, vised learning—the user submits labeled them from having to manage multiple
and Microsoft). IBM, HP, and Oracle, training data, and the service trains a environments and the data security and
the traditional big three independent model and then serves queries against it. governance challenges that continue to
software vendors  (ISVs)/original equip- Google Prediction API can be used for ev- make running big data infrastructures
ment manufacturers (OEMs), have been erything from document classification to in the public cloud difficult. For highly
working on big data cloud offerings that building recommendation systems. regulated industries, the public cloud
will run on their public clouds. Existing Microsoft recently launched a pre- might not even be an option. Large us-
telecom and managed service providers, view of Azure Machine Learning (ML), ers have been able to reap many of the
such as ATT, CenturyLink Savvis, IBM a public cloud service for predictive an- cloud’s benefits by operating their big
SoftLayer, Rackspace, T-Systems, and alytics. Like Google Prediction API, it data infrastructure as a private service
Verizon Terremark, have all partnered lets users build, test, and deploy models. for their internal customers.
with big data infrastructure providers Startups have also built cloud ser- However, things are changing. As the
to offer managed big data solutions. vices targeted at analysts and data sci- cloud matures, the systems storing or gen-
These have typically been based on entists. Databricks recently launched erating data might move outside the enter-
Apache Hadoop or other popular open Databricks Cloud, a service based on prise firewall, and big data infrastructure
source projects. For example, in 2013, Apache Spark that’s designed to facili- will likely follow. Users won’t only move
ObjectRocket, now part of Rackspace, tate data scientists’ common tasks. Big to a model in which their software is de-
launched a cloud datastore service data infrastructure is also increasingly ployed on cloud infrastructure (on behalf

14 I EEE Clo u d Co m p u t i n g w w w.co m p u t er .o rg /clo u d co m p u t i n g


of the user, that is, IaaS), they’re also con- 2. W. Vogels, “Navigating the Cloud,” uted Storage System for Structured
suming the software-as-a-service (SaaS) opening keynote, AWS Summit Data,” Proc. 7th Symp. Operating
offering. Although in some cases, the dis- 2013—Singapore, www.slideshare System Design and Implementation
tinction between these two models might .net/AmazonWebServices/opening (OSDI 06), 2006; http://research.
be subtle, limited to packaging and pric- -keynote-24487829. google.com/archive/bigtable.html.
ing, in others it will represent a funda- 3. S. Ghemawat, H. Gobioff, and S.-T. 6. G. DeCandia et al., “Dynamo: Ama-
mental change in technical architecture, Leung, “The Google File System,” zon’s Highly Available Key-value
or vertically integrating products all the Proc. 19th ACM Symp. Operating Sys- Store,” Proc. 21st ACM Symp. Operat-
way up to the user. Either way, they’re tems Principles (SOSP 03), 2003, pp. ing Systems Principles (SOSP 07), 2007;
surely impacting where people run their 29–43; http://research.google.com/ http://s3.amazonaws.com/AllThings
big data workloads. archive/gfs.html. Distributed/sosp/amazon-dynamo
4. J. Dean  and  S. Ghemawat, “Map -sosp2007.pdf.
References Reduce:  Simplified  Data  Process-
1. D. Gottfrid, “Self-Service, Prorated ing  on  Large  Clusters,” Proc. 6th
Supercomputing Fun!” New York Symp. Operating System Design and Eli Collins is Cloudera’s chief tech-
Times, 1 Nov. 2007, http://open.blogs Implementation (OSDI 04), 2004; nologist. His research interests include
.nytimes.com/2007/11/01/self-service http://research.google.com/archive/ cloud computing and data management.
-prorated-super-computing-fun / mapreduce.html. Collins has an MS in computer science
comment-page-3. 5. F. Chang et al., “Bigtable: A Distrib- from the University of Wisconsin–Madison.

ADVERTISER INFORMATION

Advertising Personnel Southwest, California:


Mike Hughes
Marian Anderson: Sr. Advertising Coordinator Email: mikehughes@computer.org
Email: manderson@computer.org Phone: +1 805 529 6790
Phone: +1 714 816 2139 | Fax: +1 714 821 4010
Southeast:
Sandy Brown: Sr. Business Development Mgr. Heather Buonadies
Email sbrown@computer.org Email: h.buonadies@computer.org
Phone: +1 714 816 2144 | Fax: +1 714 821 4010 Phone: +1 973 304 4123
Fax: +1 973 585 7071
Advertising Sales Representatives (display)
Advertising Sales Representatives (Classified Line)
Central, Northwest, Far East:
Eric Kincaid Heather Buonadies
Email: e.kincaid@computer.org Email: h.buonadies@computer.org
Phone: +1 214 673 3742 Phone: +1 973 304 4123
Fax: +1 888 886 8599 Fax: +1 973 585 7071

Northeast, Midwest, Europe, Middle East: Advertising Sales Representatives (Jobs Board)
Ann & David Schissler
Email: a.schissler@computer.org, d.schissler@computer.org
Phone: +1 508 394 4026 Heather Buonadies
Fax: +1 508 394 1707 Email: h.buonadies@computer.org
Phone: +1 973 304 4123
Fax: +1 973 585 7071

J u ly 2 0 1 4 I EEE Clo u d Co m p u t i n g 15

You might also like