®

CLOUD COMPUTING: BIG DATA IS THE FUTURE OF IT
Winter 2009 | Ping Li | ping@accel.com Cloud computing has been generating considerable hype these days. Every participant in the datacenter and IT ecosystem has been rolling out “cloud” initiatives and strategies from hardware vendors, ISVs, SaaS providers, and Web 2.0 companies - startups and incumbents are equally active. Cloud computing promises to transform IT infrastructure and deliver scalability, flexibility, and efficiency, as well as new services and applications that were previously unthinkable. Despite all of this activity, cloud computing remains as amorphous today as its name suggests. However, one critical trend shines through the cloud – Big Data. Indeed, it’s the core driver in cloud computing and will define the future of IT. from this exponential data growth – as inexpensively as possible. Previous computing platform transitions had technology dislocations similar to cloud computing but along different dimensions. The shift from mainframe to client-server was fueled by disruptive innovation in computing horsepower that enabled distributed microprocessing environments. The following shift to web applications/web services during the last decade was enabled by the open networking of applications and services through the internet buildout. While cloud computing will leverage these prior waves of technology – computing and networking – it will also embrace deep innovations in storage/ data management to tackle big data. Along these lines, many of the early uses of cloud computing have been focused less on “computing” and more on “storage.” For example, a significant portion of the initial applications on AWS were primarily leveraging just S3 with applications executing behind the firewall. Popular storage applications, like Jungle Disk and Smug Mug, were early AWS customers. This explosion of data has driven enterprises (and consumers for that matter) to find cheap, on-demand storage in unlimited quantities – which cloud storage promises to deliver. Until now, massive tape archives in the middle of nowhere (like Iron Mountain) have been the only means to achieve that cheap storage. However, enterprises today need more; they need quick access data retrieval for multiple reasons, from compliance to business analytics. It is simply no longer sufficient to have “cold” data; rather, it needs to be online and resilient (and cheap, of course); hence, the accelerating shift towards storing every piece of data in memory or on disks (Data Domain smartly rode this trend). The need to balance data availability/usability and cost effectiveness has prompted significant innovation in both onpremise and hosted cloud storage – cloud storage systems (Caringo, EMC Atmos, and ParaScale, to name just a few), flash-based storage systems (Fusion IO, Nimble Storage, Pliant, etc.) – are just some current examples. Furthermore, hierarchical storage management (HSM, which has always sounded great but has been implemented only rarely) will become an important element in storage workflows. Enterprises will require seamless capability to move data across different tiers of storage (both on-premise and into the cloud) based on policy and data type to maximize retrieval costs. As cloud computing matures, true cloud applications will be (re)written to leverage hierarchical and cloud-like storage tiers to retrieve data dynamically from different storage layers. Page 1
Source: “Approaching the Zettabyte Era.” Cisco, 16 June 2008. <http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11481374_ns827_Networking_Solutions_White_Paper.html>
1

BIG DATA – THE PERFECT STORM
Cloud computing has been driven fundamentally by the need to process an exploding quantity of data. Data is no longer measured in gigabytes but in exabytes as we are “Approaching the ZettaByte Era.”1 Moreover, data types – structured, semistructured, or unstructured – continue to proliferate at an alarming rate as more information is digitized, from family pictures to historical documents to genome mapping to financial transactions to utility metering. The list is truly unbounded. But today, data is not only being generated by users and applications. It is increasingly being “machine-generated,” and such data is exponentially leading the charge in the Big Data world. In a recent article, The Economist called this phenomenon the “Data Deluge” (http://www.economist.com/opinion/displaystory.cfm? story_id=15579717). One can argue that Web 2.0 companies have been pushing the upper bounds of large-scale data processing more than anyone. That being said, this data explosion is not sparing any vertical industries – financial, health care, biotech, advertising, energy, telecom, etc. All are grappling with this perfect storm. Below are just a few stats: • Google was processing two years ago more than 400PB of data/month in just one application • The New York Times is processing an 11-million-story archive dating back to 1851 • eBay processes more than 50TB/day in its data warehouse • CERN is processing 2GB/second for their most recent particle accelerator • Facebook crunches 15TB/day into a 2.5PB data warehouse Without question, data represents the competitive advantage of any enterprise, and every organization is now encumbered with the task of storing, managing, analyzing, and extracting value

. such as security. systems management. and its now accepted as a “standard” tier in datacenters.) to store and analyze any data in near real-time at a fraction of the cost that traditional data management and data warehouse approaches could even contemplate. and Amazon drove them to adopt a new data architecture. Moreover. The challenge of processing terabytes of data daily at Google. the primary security challenges will stem from “control. and this “non-database database” trend has proliferated to the point of having its own conference: NoSQL. etc. any widely adopted cloud computing platform will inevitably account for richer security requirements. availability. albeit in a primitive form. The diverse set of participants at Hadoop World NYC hosted by Cloudera clearly points to this trend. Facebook. Data integration across cloud platforms will be more of an obstacle than application integration. low-cost/high bandwidth connectivity. core platform capabilities. web services. i. Imperva’s database firewall is an example of an increasingly important cloud security product. location agnostic and multi-tenancy are two critical Page 2 SECURING THE CLOUD Given this data intensive nature. this new cloud framework needs the ability to process data in increasingly greater orders of magnitude – and do it at a fraction of the cost – by leveraging commodity. Big Table and Cassandra are among the many variants. and economies of cloud computing. concurrent/multi-threaded programming models and open source software stacks are all technology building blocks that can deliver the high performance and scalability of grid/utility computing. The adoption of these new frameworks will ultimately make cloud computing “safe” and broaden its penetration into enterprises of all sizes. These technology drivers enable applications and users to be abstracted cleanly from particular IT infrastructure resources (computing. this cloud stack has been implemented already.” In addition to federated user authentication. In particular. This will drive the need for ensuring data authentication and policy control for the volumes of data flowing between cloud applications. high-powered commodity servers. it will be critical for the cloud applications to be able to “talk” to each other. but importantly – and concurrently – with underlying commodity resources. highperformance. depending on use case and workload for particular applications. although high bandwidth encryption solutions and sophisticated key management will be needed to match the massively parallel computational cloud environments. multi-threaded servers for storage and computing. policy granularity will be paramount to ensure security and compliance. Facebook.) in new and powerful ways. Hadoop is quickly penetrating broader enterprise use cases. etc. these APIs and layers will harden and will become tailored.. This breadth of cloud computing is engendered in a new set of underlying technology forces. provisioning. etc. which is essentially Martian to traditional enterprise datacenter architects. Over time.). at large-scale internet datacenters.e. Managing non-transactional data has become even more daunting. No longer are ACID and relational databases back-ending transactional applications.A NEW CLOUD STACK In order for cloud computing to become a mainstream approach. capabilities. Virtualization technologies. Simply replicating the current computing stack but allowing it to reside off-premise will not achieve the scale. As applications reside in different public and private clouds. managing and reconciling user identities across individual user directories for each SaaS/Cloud application will present further security issues. Instead. internet data centers are collecting massive volumes of data that need to be processed cheaply in order to drive monetization value. Hadoop has allowed the largest web properties (Yahoo!. WHAT’S BREWING IN A CLOUD? Despite constant comparisons to grid and utility computing. access control. Internet datacenters quickly encountered the scaling limitations of SQL databases as the volume of data exploded. virtualization. a new “cloud” stack (like mainframe and OSI) will likely emerge. Hadoop is an open source data management framework that has become widely deployed for massive parallel computation and distributed file systems in a cloud environment. In many ways. Data migration challenges are perhaps the greatest factor today for locking users to a particular cloud platform. Northscale’s Memcached) are also being implemented to further drive application performance. Just like prior computing platform transitions (client/server. LinkedIn. Although the framework has roots in internet datacenters. given the multi-tenancy paradigm of cloud environments. In addition. cloud computing has the potential to address a much broader set of applications and use cases beyond the limited HPC environments served traditionally by grid computing. Clearly. scalable/distributed non-SQL data stores are being developed internally and implemented at scale. as applications have become more open/standard. networking. cloud computing is essentially abstracting a web services interface for infrastructure IT.” User authentication will become increasingly challenging as applications are federated outside the firewall because of SaaS adoption. storage. Database caching layers (i. The security challenges will be focused less on . Much like web applications in the 90s created an SSO layer. In this case. point network and data level security. application management. and it will demand a similar unified “authentication/entitlement layer. cloud computing will also require “data” authentication and security. Standard “data” APIs will emerge as part of the new cloud stack to allow disparate environments to talk to each other and avoid vendor lock-in. will be a prerequisite before IT organizations are able to adopt the cloud completely.e. this stack will exist in a different representation than prior platform layers to embrace a cloud environment. etc. From log files to click stream data to web indexing.

Cloud computing will reach its full potential in the future when a whole new set of applications (never possible before) is created that is purpose-built for the cloud. were not encumbered by legacy enterprise stacks. Rather than posit yet another.0 start-ups seeking to launch applications quickly and cheaply. a powerful trend in the role of developers in driving cloud computing adoptions. VMware now has a distinct opportunity to transition SpringSource’s dominant Java developer mindshare to develop onto VMware’s private cloud platform. or IT). For example. Amazon Web Services has experienced tremendous success from its developer-centric platform APIs. via their use of Web 2. Increasingly. Alternatively. and (iv) portability (applications are abstracted from physical infrastructure and can be migrated easily). certain developers will prefer to interface with a cloud provider at a higher level of abstraction. an application may choose to run on MSFT Azure to leverage SQL/MSFT services or Salesforce Force for CRM integration and distribution advantages. cloud computing enables disparate applications and entities to harness a shared pool of resources. Facebook. cloud computing for enterprise services is definitely still in its formative stages. which were designed for a specific application in a single company. the enterprise use of cloud computing represents opposite ends of the spectrum: (i) Web 2. such as Google App Engine. Unlike traditional HPC grid environments. the internet datacenters of Facebook. These Web 2. however. and unsurprisingly. and the application. have been teaching the typically early technology adopter enterprises the effectiveness of cloud computing. This developer-centric nature was a primary motivator of VMware’s strategic acquisition of SpringSource. It’s likely these innovative applications will require new programming models and potentially languages yet to be hardened. large-scale applications. STILL IN THE EARLY DAYS Despite the high energy surrounding cloud computing and early cloud offering successes. this time the consumers. etc. It will certainly take significant time/effort for enterprise IT infrastructure gatekeepers to evolve their current architectures to embrace a new cloud platform. It is quite unlikely that this will happen. enterprises can reap the technology innovation from internet data centers (many which are open source) to accelerate this transition. MSFT) and IT-centric (EMC. For example. one can envision powerful collaboration applications emerging that enable internal enterprise and external users to seamlessly cooperate that would have been previously impossible with users and data isolated on disparate enterprise islands. Page 3 . Luckily. Interestingly. one can break cloud platforms into roughly two camps: developercentric (Amazon. such as Amazon Web Services. the workload. Cloud computing instead will need to penetrate mainstream IT infrastructure slowly and offer a broader set enterprise applications. Yahoo!. Amazon went after developers first and has only recently begun to add the functionality that will appeal to broader enterprise IT. It is important to note here that these Web 2. developer. Google. there are early signs of developers (Q&A environments. storage and networking resources). as opposed to a more bare metal API. batch processing. such as Rackspace. These cloud prerequisites will yield a powerful a set of use cases beyond grid computing that are unique to cloud platforms. Amazon’s internet datacenters was easily adapted to become the first and leading “public computing” provider. Multiple cloud models will emerge depending on the user.0 start-ups represent MORE THAN ONE FLAVOR There have been analogies drawn between cloud computing and public utilities (electric. (ii) elasticity (on-demand allocation of any computing. No longer do users need to have IT’s blessing and time to get their job done. Today. VMware). Many different definitions of cloud computing have surfaced.0 startup.0 services. and (ii) compute intensive enterprises that need batch processing for bursty. applications can be “broken up” in the cloud where computing resources may reside on the client while the data is accessed portably from multiple cloud locations (as an example). consumers have already adopted cloud computing technologies.). Within enterprises. According to this hypothesis. etc. several characteristics are resident in any cloud instance: (i) self-provisioned (either by user. applications. value-added services previously considered unthinkable. empowering developers and line of business owners to innovate and deploy new applications without the shackles of IT will be a motivating driver for cloud adoption. Unlike typical three-tier “traditional” enterprise datacenters. Many early users of cloud computing are examples of developers launching applications without requiring the involvement of IT (in the case of a Web 2. the world will only have a few cloud providers that reach maximum efficient scale. These capabilities allow enterprise to shift IT resources from capex to opex – a usage based model that is particularly appealing during recent economic constraints. it’s unlikely these limited use cases will establish cloud computing as a pervasive platform.0/SaaS offerings clearly exhibit the core cloud characteristics outlined above. Unlike traditional hosting providers that cater to IT/operations. (iii) multi“anything” (multi-user. Although these users are driving the early adoption of cloud technology. multi-application. gas. etc. and in turn are delivering new. and developer prototyping) and line of business/departmental leveraging cloud computing.) where the value is all about economies of scale.elements among others. and IT rules. One could argue that web companies like Google. In addition to inheriting significant Java technology. In addition. In contrast. Therefore. they don’t have an IT department). multi-session. It is not uncommon for new platform technologies to start at the “fringes” of IT before mainstream adoption takes place. which in turn enabled them to be built from the ground up with cloud stacks to handle elastically large-scale consumer transactions for multiple applications. Today. and Salesforce are examples of consumers leveraging cloud computing.

Innovation will abound to solve the specific issues in all of these various cloud environments. The panel brought together technologists who view cloud computing from distinctly different lenses: private cloud innovators. To drive this cloud diversity point further. Fellow. Professor of Computer Science. I predict that in the next 10 years. Imagine what’s going to happen once large clouds are routinely available to build they’re own application and you start aggregating your own data. Mosso (Rackspace): “…I think cloud computing is going to be a mindshift. You know. in the last 2 or 3 years. which is way outpacing the speed of chip design…So I believe that there will be a couple of people who will figure out ways to blend computation and storage on the client. more gracefully with that on the server. it’s going to take a while. opinions. an application administrator. hybrid clouds that bridge private and public clouds on a permanent and temporary basis (also known as “cloud bursting”) will come to fruition for certain applications or as a migration path for enterprises. and my browser effectively being it’s own terminal. University of California. personally. whether that’s the CIO or somebody he or she appoints…” Rich Wolski. And you’re going to need. moved toward most of the computing happening in the cloud. Facebook: “…one of the things that is going to happen is that people are going to figure out that we need a more blended workload between the cloud and the client. In addition.” Jayshree Ullal. with computer systems looking at vast amounts of data…. lower price point. Eucalyptus Systems: “…there’s another revolution coming that’s going to intersect the cloud revolution and that has to do with data simulation…pretty much everything you own is going to be trying to send you data.Cloud platforms will remain distinct and diverse as long as they continue to deliver unique value-add for their particular use cases and users. more on the operational side. Here’s what the experts said: Jonathan Bryce.” Mike Schroepfer. but still provide you with all of the benefits of basically access to my data anywhere I need. a network administrator. there’s a technology impact but I actually think it’s going to really make CIO’s rethink their jobs. I would say for the deployment of this. and they’re all silos… but you need your general practitioner.” Raghu Ramakrishnan. VP Engineering. I think the cloud is going to make that revolution that much quicker to come to us. “private clouds” behind the firewalls present yet another flavor of cloud computing as enterprises leverage the benefits of cloud frameworks while maintaining security/control as well as the compliance of their internal datacenters. the concept of a “cloud within a cloud” is also emerging where distinct services. So if I had to make a prediction. CEO/Co-founder. cloud enabling technology solutions and cloud infrastructure applications. computer science as computer science isn’t really going to be the place that smart young guys are going to find tremendously rewarding careers. I asked each speaker to conjure up a single prediction for cloud computing in the next few years. I’m convinced the sun will shine through. I think that science will be done maybe not even in the lab on the wet bench anymore. but that it will emerge as fundamental breakthroughs in datacenter and IT infrastructure in the years to come. Facebook. you can have a server administrator. Arista Networks: “Well. CloudSwitch and Zetta among them) are building products that make the cloud “safe” for enterprises. it’s got to be a generalized IT person. such as data warehousing. And that’s really missing right now in the cloud. and the kind of reliability of the cloud. these panelists had plenty of deep insights. but with data. In wrapping up the panel session. Someone’s going to figure out the next big thing. Chief Scientist for Audience and Research . I hosted a cloud computing panel with an esteemed group of technology thought leaders at Accel’s 15th Stanford Technology Symposium. all of those kind of things…” Mike Olson. and predictions about cloud computing. Several start-ups (Cirtas. Once this economic storm passes. Despite the current macro headwinds. Yahoo! Research: “So a lot of the companies that are out there today – Yahoo!. by taking 2 + 2 and coming up with 20. and cloud computing is sure to have many silver linings. Lastly. Santa Barbara and CTO/Founder. We’ve been operating kind of in the cycle of reincarnation and computer science. and market opportunities in cloud computing will persist. You’re seeing 2x to 4x improvements in core performance on the engines and VMs in those browsers year on year. public cloud providers. And I think part of what’s going to enable some of those businesses is cloud computing. Cloudera: “I think that a lot of what’s been said around here about data is really right on. deep innovation. Needless to say. President and CEO. less on the technology. and being able to get started with a lower varied entry. Google – they’re all exposing data APIs. Page 4 LOOKING AHEAD To further parse all this. CTO/Founder. Today. can be built atop a more generic cloud platform to provide a higher layer cloud service. Ping Li is a partner at Accel Partners in Palo Alto and focuses primarily on Information Technology infrastructure and digital media platforms.” These predictions depict cloud computing as still being in its formative phases. But I think an economy like this is actually a huge opportunity for entrepreneurs…I think this is a time when resources are scarce – that’s when great businesses end up getting built. and you have the opportunity to fuse that with all the data that’s out there. I think that the application of these new compute systems to large data in the sciences will advance human kind substantially. a great deal of storage and compute capacity to be able to deal with that. the speed and capability of browsers has been outpacing that of most chips.