You are on page 1of 67

Cloud

computing performance
A Bitcurrent study on the performance of cloud computing platforms June, 2010

Sponsored by Neustar Webmetrics

Executive Summary

Cloud computing is a significant shift in the way companies build and run IT resources. It promises pay-as-you-go economics and elastic capacity. Every major change in IT forces IT professionals to rebalance their application strategyjust look at client-server computing, the web, or mobile devices. Today, cloud computing is prompting a similar reconsidering of IT strategy. But its still early days for clouds. Many enterprises are skeptical of on-demand computing, because it forces them to relinquish control over the underlying networks and architectures on which their applications run. In late 2009, performance monitoring firm Webmetrics approached us to write a study on cloud performance. We decided to assess several cloud platforms, across several dimensions, using Webmetrics testing services to collect performance data. Over the course of several months, we created test agents for five leading cloud providers that would measure network, CPU, and I/O constraints. We also analyzed five companies sites running on each of the five clouds. As you might imagine, this resulted in a considerable amount of data, which we then processed and browsed for informative patterns that would help us understand the performance and capacity of these platforms. This report is the result of that effort. Testing across several platforms is by its very nature imprecise. Different clouds require different programming techniques, so no two test agents were alike. Some clouds use large-scale data storage thats optimized for quick retrieval; others rely on traditional databases. As a result, the data in this report should serve only as a guideline for further testing: your mileage will vary greatly.

Acknowledgements

First and foremost, this research would not have been possible without Webmetrics/Neustar. The company funded the development of the testing agents, allowed us to use their systems for data collection, and underwrote the cost of generating the study. They did so without constraints or editorial input, in the hopes of contributing to ongoing discussions and industry dialogue about cloud performance. This kind of altruism is rare, and were grateful for their assistance. This research was a team effort. Eric Packman and Pete Taylor worked hard to develop, deploy, and maintain multiple software agents across several clouds, and to collect and analyze the data, looking for anomalies and addressing monitoring issues. Lenny Rachitsky and Shirin Rejali of Webmetrics supported the idea of independent research that would help the IT community, and funded this work. Sean Power continues to be a great colleague and co-author on all things to do with web monitoring. Ian Rae, Dan Koffler, and the team at Syntenic offered first-hand cloud experience and assisted with analysis and early feedback. Finally, cloud experts, including Shlomo Swidler, Randy Bias, Jeremy Edberg, and many other end users provided invaluable insight.

Contents

Executive Summary .................................................................................................................................. 3 Acknowledgements................................................................................................................................... 4 Contents......................................................................................................................................................... 5 The state of web performance ............................................................................................................. 6 Elements of latency.............................................................................................................................. 6 A shift towards composed designs................................................................................................ 6 The reasons performance matters ................................................................................................ 7 The state of cloud computing ............................................................................................................... 8 What do we mean by clouds? .......................................................................................................... 8 The uncertainty of a shared resource .......................................................................................... 9 The problem with computing far away....................................................................................... 9 Cloud architectures........................................................................................................................... 10 The big questions.................................................................................................................................... 11 Testing methodology............................................................................................................................. 11 Test limitations........................................................................................................................................ 12 Test results ................................................................................................................................................ 14 Real website testshigh-level metrics.................................................................................... 14 Agent testshigh-level metrics .................................................................................................. 15 Performance histograms ................................................................................................................ 17 The performance of individual clouds...................................................................................... 18 How do different clouds handle workloads? ......................................................................... 33 Noteworthy observations ................................................................................................................... 40 Conclusions ............................................................................................................................................... 60 Further research and reading ........................................................................................................... 61 Peter Van Eijk ...................................................................................................................................... 61 Cloudstatus ........................................................................................................................................... 61 Cloudharmony..................................................................................................................................... 62 Alan Williamson and Cloudkick................................................................................................... 63 The Bitsource....................................................................................................................................... 64 Cloud test agent code ............................................................................................................................ 65 Simple objects...................................................................................................................................... 65 CPU test .................................................................................................................................................. 65 I/O test.................................................................................................................................................... 65

The state of web performance

This report focuses primarily on the performance of cloud computing. While website performance optimization has come a long way from the early days of static sites, there are still many reasons that applications are slow.

Elements of latency
Web latency comes from four main factors: Service discovery involves finding the website. This is usually a Domain Name Service (DNS) lookup in which the URL of the website is resolved to an IP address. There may be other delays in this processfor example, when a site redirects the client to another site. Clouds change how lookups happen, because they may redirect visitors to different destinations. Network latency is the time spent travelling across a network. This is a function of two basic things: The round-trip time between the browser and the web server; and the number of round-trips that are required to load a page. A page with few objects will take far less time to load, even over a slower link, than a page with many components on it. Processing latency, or host time, is the work the server has to do when preparing content for the browser. This is the primary focus of our study, since cloud computing changes how the server works. Host latency may come from simply responding to a request; from computationally intensive calculations; from retrieving data from other sources such as a database; or from connecting to back-end systems behind the server itself. Client-side latency is the time the browser takes to assemble and present the web page content. While client-side latency is an increasingly important component of web monitoring in modern websites, clients dont care (much) whether the server is a physical machine or a cloudso we wont concentrate on this kind of delay in this report.

Whats more, in a mashup site that contains embedded components from elsewheresuch as JavaScript for analytics, or third-party embedded DIV tagsthe browser must retrieve additional components from elsewhere.

A shift towards composed designs


Cloud computing is a loaded term. There are two important ways to look at clouds: as a technology, and as a business model. Clouds-as-a-technology rely on several innovationsvirtualization, automation, self-service provisioning, chargebacks, service-centric components, and so onto improve the efficiency of IT while reducing the time it takes to deploy applications. On the other hand, clouds-as-a-business rely on new ways for third party providers to offer IT on demand. Because of the separation and efficiency thats now possible, service providers can rent a virtual machine for an hour at a time. Theyre able to 6

achieve economies of scale through automation, and as a result are cheaper than dedicated infrastructure for many applications. A consequence of these technologies and business models is the emergence of the developer as the central role in IT. Todays developer can spin up machines easily, without long purchase cycles or the burden of administration. She can experiment with several versions of the application for only a few pennies an hour. And she can scale up and down with demand. At the root of this is the idea of a composed designan application that consists of many sub-applications stitched together. An application might rely on a virtual machine as a computing resource, but use another service to store large objects; another to queue messages for processing; and another to quickly store and retrieve key-value pair data. This is a different way of building applications from earlier, more monolithic approaches.

The reasons performance matters


Weve always wanted fast applications. Its only recently, however, that weve known why. Studies from Google, Microsoft, Shopzilla and Strangeloop Networks all prove that faster websites improve every measure of online success. Visitors who have a better online experience search more, buy more things, stay on sites more, and so on. Search engines like Google factor page load time into their ranking of pages, which in turn affects search result placementa key factor in driving traffic to a site. There are four main types of site; each benefits from performance in its own way. E-commerce sites, which make their money from transactions, see an increase in conversion rates (the percentage of people who buy something) as well as in quantity purchased. Collaboration sites, which encourage visitors to create and share content, see visitors that are more likely to return, a clear sign of engagement. They also place higher in search rankings, marking them as an authority on a subject. Software-as-a-Service companies have more productive users, and face fewer SLA refunds due to delayboth of which reduce churn and encourage upselling to additional services. Media sites, which make their money from advertising and sponsorship, find that visitors stay on the site for longer, and view more pages, when their pages load quickly. This means more exposures to pay-per-view advertising and more changes for pay-per-click advertising to work.

Regardless of what kind of site youre running, you should care about performance. If youre considering cloud computing, you need to know how that move will affect your business.

The state of cloud computing

The term cloud computing has only been around a few years, and theres been a lot of confusion around what a cloud isand isnt. Given the huge it just works appeal of clouds, many providers and vendors have clothed themselves in the cloud mantle in the hopes of revitalizing their web-based application.

What do we mean by clouds?


As a result, cloud definitions are, well, cloudy. Just to be clear, heres what we mean by clouds: Private and public Clouds may run on-premise (sometimes known as private clouds) or on a third- partys infrastructure (public clouds.) If youre running a private cloud, you control the bare metal on which the cloud runs, as well as the network and architecture. While the cloudy nature of a virtualized, on-demand environment might make things harder to diagnose, youre not risking much by adopting this model. Were going to focus on public clouds for this report. Infrastructure, Platform, and Software A cloud is all about separation. The dividing line between you and your provider is the most important aspect of a cloud offering. While theres still considerable variety among providers, the industry has settled on three basic offerings. An Infrastructure-as-a-Service (IaaS) model offers virtual machines. You copy your machine to the cloud, and that machine can run whatever a normal machine would. Youre in charge of the operating system, the application stack, and so on. Examples of this model include Amazon Web Services EC2, the Rackspace Cloud, and Terremarks cloud. A Platform-as-a-Service (PaaS) model offers code execution. You copy your code to the cloud, and it runs. You dont see individual machinesyou just worry about the codebut you may have limited options for your environment. In Salesforces Force.com cloud, you write your code in a programming language called Apex; Googles App Engine allows only Java and Python, and requires that you use Googles storage model, Bigtable, to store and retrieve data. A Software-as-a-Service (SaaS) offering offers an application. You simply add contentGoogle Apps, Netsuite, Freshbooks, and Hotmail are all examples of a SaaS application. While many people consider this a cloud, its not a platform on which things are built, so we wont include it in this report.

In other words, for this study were looking exclusively at public PaaS and IaaS clouds.

The uncertainty of a shared resource


Almost all the concerns around public cloud computing have to do with a shared resource. You may be worried about neighboring cloud tenants with bad security practices. You might fear that other applications will consume all available resources. Or perhaps you think that the cloud provider has chosen an underlying architecture that helps them, but wont make the most of your application. All of these concerns are legitimate. The pact youre making with a public cloud, for better or worse, is that the advantages of elasticity and pay-as-you-go economics outweigh any problems youll face. But even within public clouds, there are many factors that will affect how you fare. Take, for example, IaaS versus PaaS models. IaaS applications give you more control, because you can choose the operating system, programming languages, and tools at your disposal. IaaS clouds we tested could do an order of magnitude more CPU- intensive computation than PaaS clouds. On the other hand, when many requests happened at once, the IaaS clouds got slowwhile the PaaS clouds handled the traffic just fine. This is because IaaS is still infrastructure. You need to spin up additional machines to handle load. In a PaaS environment, theres no upper limit to processing because youre not running on one machine. In fact, PaaS providers like Salesforce and Heroku have to create false ceilings (called governors) to stop one customers code from consuming the entire system.

Figure 1: Governor limit reached on a Force.com application

The problem with computing far away


Even if a public cloud were identical to a private one, the simple fact that its further away from you changes things. As Microsofts Jim Gray pointed out in 2003, compared to the cost of moving bytes around, every other aspect of computing is virtually free. That means youre going to have to think about how to get content into a cloud. Even if youre getting to the cloud relatively quickly, not all storage systems are the same. In our research, it took nearly three days to insert data into Googles Bigtable storage model. This slow-insert, rapid-retrieval model is a characteristic of

Googles computing system, which allows very fast searches across the entire Internet. Indeed, once our data was in Bigtable, queries were very fast. Humans also expect responsiveness. If youre building an internal application for people in your building, a cloud may introduce delay simply because its not in the same building as your users. On the other hand, if youre building an application for users throughout the world, then Googles App Engine has over 20 points of presence and may speed things up; similarly, Amazons Cloudfront offering can speed up the delivery of static content to your users.

Cloud architectures
The application youre building will dictate the cloud model you adopt.: If youre building a simple three-tiered website, clouds are a simple choice, and you can launch a pre-configured LAMP stack in an IaaS environment. If your site experiences spiky traffic, and youre willing to edit your code, you might consider a PaaS model to handle scale and pay only for CPU cycles. If youre doing data mining across a large data set, then youll want a framework like Hadoop that can process things in parallel. If you want a messaging platform, then you may want a message queue service. If youre trying to broadcast media to many destinations, youll likely involve a content delivery network.

In the end, you have architectural control over cloudsbut its at a much higher level. Rather than worrying about which processes are running on an individual server, youre selecting which clouds and cloud services to use as you assemble your application.

10

The big questions


For our research, we want to answer a few basic questions about clouds: How do different clouds perform for web users? What are the differences across clouds for specific functionsnetwork delivery, computation, and back-end I/O? How much is one cloud user affected by its neighbors? How do IaaS and PaaS clouds vary in performance? How do different platforms handle spikes in concurrent requests? How variable, or predictable, are specific cloud platforms?

Testing methodology
To answer these questions, we chose five cloud platformsthree IaaS and two PaaS. For each, we created a test application that could exercise the various elements of latency in which we were interested: A one-pixel GIF, to test raw response time and caching. A 2-MByte object, to test network throughput and congestion. A CPU-intensive task (repeatedly calculating a sine function) that would consume processing capacity. An I/O intensive task (searching through a database for a specific record, then clearing the cache) to measure back-end systems and resource contention.

We also chose five companies whose web businesses used each cloud platform, according to the following criteria: They had to be reputable, real-world companies rather than experiments or personal sites The sites had to be dynamic, rather than static They had to be codedthat is, not simply parameterized versions of turnkey SaaS applications that had been skinned to match a particular brand

Despite the popularity of cloud computing, this wasnt as easy as it sounds. Many websites that claim to be running in the cloud are actually customer- or partner- portal front-ends for SaaS applications, while others are little more than static marketing websites. For each site, we tested a single object (usually a one-pixel GIF or a favicon.ico image) and a full page load. This meant that each cloud had fourteen testsfive real sites tested twice, and four agent sites.

11

These tests were run from mutiple locations worldwide using Webmetrics testing service. We generated the analysis of performance from both Webmetrics reports and post-processing of the logfiles generated by the Webmetrics service. The goal of this research isnt to recommend a particular cloud; indeed, our research shows that different platforms work better for different application types. Weve named the five cloud platforms on which we tested, but weve hidden the names of the production applications that were running on each cloud.

Test limitations

First and foremost: this is not a scientific, repeatable test. The very nature of clouds is that theyre inconsistent and multitenant, and subject to seasonal and daily variances. Consider: There are huge differences between the ways data is stored in cloud platforms. Some use databases, others use massively parallel data storage Some clouds are busy, and were sharing machines with others; on other clouds, we may have the luxury of a machine to ourselves. We dont know. The way a particular function (such as a sine calculation) is performed varies from cloud to cloud. The physical machine on which a virtual machine is running will vary from test to test. Factors beyond the cloud providers controlInternet congestion, DDOS attacks, problems with DNSmay change the results.

12

Much of the delay within an application may be the result of something the developer does; we saw several tested sites change performance, and go down, during the test. We purposely switched between sequential and simultaneous test modes during the data collection period in order to understand the impact of contention on resources. This is a reasonable simulation of a website thats facing changing traffic patterns: sometimes, visits are spread out; other times, they happen all at once. PaaS platforms artificially limit the number of instructions we can run in a period of time, which means we had to reduce the number of sine calculations performedso its not accurate to say that PaaS is faster than IaaS from this data, because the PaaS platform is actually doing far less work. In fact, when we tried the same number of computations on PaaS and IaaS platforms, the IaaS platform was so much faster that the results werent easily measurable.

Nevertheless, the results are valuable precisely because they show the variance and uncertainty, and help us understand what to look for across clouds.

13

Test results

Well first look at high-level results across platforms and test types, then look at individual test types, individual clouds, and trends among websites running on each cloud.

Real website testshigh-level metrics


First, lets look at how our five clouds handled the real-world sites running on them. This gives us a rough idea of the health of each platform, but we dont know anything about the code theyre runninga bad developer could give a good cloud terrible results. Throughout this report, well use histograms to represent the performance measurements we collected. While averages are nice to look at, a histogram like the one below is more meaningful. It shows the number of tests that occurred with a particular performance levelfor example, in the chart below, we see that roughly 27% of all tests of the five sites running on Salesforce.com took less than a second to complete.
Performance of production websites running on five cloud platforms

Percent of all tests

40.00% 30.00% 20.00% 10.00% 0.00% <1 2 4 Terremark Amazon Rackspace Google Salesforce 12

10

Delay in seconds

14

Agent testshigh-level metrics


For a more controlled analysis of cloud performance, we need to measure known workloads. Here are the results for our four agent tests. The clouds all handled the small image well. PaaS clouds were more efficient at delivering the large object, possibly because of their ability to distribute workload out to caching tiers better than an individual virtual machine can do. Even within PaaS clouds, CPU and I/O varied widely.
Average latency of four agents across Iive cloud platforms
Seconds 20 15 10 5 0 IO CPU 2MB GIF 1-pixel GIF

Heres a data table of average load times across all five clouds for a two-day period in early April. Salesforce Google Rackspace Amazon Terremark 1-pixel GIF (light test) CPU Test 0.11s 8.13 0.25 1.97 1.63 0.18 3.25 2.16 0.23 4.41 10.03 0.23 5.00 3.75 12.35 2-MByte GIF (heavy test) 0.50

IO Test 6.26 2.03 3.33 19.46 Averages can be misleading, however, as some clouds experienced far greater latency. Heres the median latency:

15

Median latency
25 20 15 10 5 0 IO CPU 2MB GIF 1-pixel GIF

Seconds

And heres the mode of latency:


Mode of latency
10 8 6 4 2 0 IO CPU 2MB GIF 1-pixel GIF

Seconds

Sometimes variance is as important as latency, however, so we looked at the standard deviation across each cloud test:

16

Standard deviation of latency


6 5 4 3 2 1 0 IO CPU 2MB GIF 1-pixel GIF

Seconds

Note that IaaS clouds consistently show greater variance than PaaS clouds; and that, because its measuring network performance, the 2MByte GIF test varies significantly when tested from many locations on the Internet. In other words, the variance here is largely due to geography: WAN latency varies the most.

Performance histograms
A better way to understand latency is to look at a histogram of performance, which shows how many tests experienced a particular level of latency. The following chart combines all four kinds of testsince network, server responsiveness, computation, and back end I/O are all components of an applicationto show how each cloud compares.

17

Performance by cloud, all four agents

Percent of tests

50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0 1 2 3 Terremark Amazon Rackspace Google Salesforce 9 10+

Delay in seconds

The performance of individual clouds

Now lets look at each cloud provider in detail.

18

Salesforces force.com This is a PaaS cloud featuring its own programming language (Apex.) Its designed primarily for extending the Salesforce.com SaaS CRM, and developers have access to their CRMs data structures such as contacts and sales funnels.

Heres the performance histogram for the five sites running on Force.com.

19

Five sites on Salesforce cloud


Percentage of tests Seconds of latency 0.00% 0 2 4 6 8 10 12+
Heres how Salesforce handled the various computing tasks we gave it.

20.00%

40.00%

60.00%

80.00% 100.00%

And the performance histogram for the four tasks:

20

Salesforce test proIile


40000 Test count 30000 20000 10000 0 0 1 2 3 IO CPU 2MB 1-pixel 9 10+

Delay in seconds

21

Google App Engine Googles cloud platform is a PaaS. Developers can use either Java or Python, and are billed for computation used. Their applications must use the Google-specific data store, which allows scalability and fault tolerance. Heres how the five sites running on Google App Engine did during the test:

Heres the performance histogram for the five sites running on App Engine:

22

Five sites on Google app Percentage of tests engine


Seconds of latency 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0 2 4 6 8 10 12+
This is how Google App Engine handled our four test agents:

And heres the performance histogram:

23

Google test proIile

60000 50000 Test count 40000 30000 20000 10000 0 0 1 2 3 Totals IO CPU 2MB 1-pixel 9 10+

Delay in seconds

24

Rackspace Cloud Rackspace.com is an established managed service provider, offering bare metal hosting, managed hosting, and IaaS and PaaS clouds. Heres how the five companies whose sites run on Rackspace fared during our test.

Heres the performance histogram of the websites we evaluated. Note that performance is fairly consistent across the test period, but that one site is slow compared to all others and affects results significantly.

25

Five sites on Rackspace Cloud


Percentage of tests Seconds of latency 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0 2 4 6 8 10 12+
When it came to our tests running on Rackspace, heres what performance looked like:

Heres the performance histogram for the four test functions on Rackspace.

26

Rackspace test proIile

40000 Test count 30000 20000 10000 0 0 1 2 3 Totals IO CPU 2MB 1-pixel 9 10+

Delay in seconds

The CPU test suffered from two spikes seen in the test period.

27

Amazon Web Services Amazon is the clear market leader, and for many technologists their model is synonymous with cloud computing. Most customers use the EC2 virtual machines and S3 storage, but other services arent as broadly adopted. Heres what the five sites running on Amazon looked like:

Heres the performance histogram for the five sites running on AWS:

28

Five sites on Amazon Web Percentage of tests Services


Seconds of latency 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0 2 4 6 8 10 12+
As for our four tests, heres how they compared during the test:

Heres the performance histogram for the four tests in the same period:

29

Amazon test proIile

50000 Test count 40000 30000 20000 10000 0 0 1 2 3 Totals IO CPU 2MB 1-pixel 9 10+

Delay in seconds

30

Terremark vCloud Unlike most public clouds, which rely on open source Xenserver for virtualization, Terremarks cloud offering is based on VMWares virtualization technology. Heres how five Terremark customers fared during the test period:

Heres the performance histogram for the five sites:

Five sites on Terremark Cloud


Percentage of tests Seconds of latency 0.00% 0 2 4 6 8 10 12+

20.00%

40.00%

60.00%

80.00% 100.00%

31

Heres how our four test agents performed:

This is the performance histogram for those tests:

Terremark test proIile


30000 Test count 20000 10000 0 0 1 2 3 IO CPU 2MB 1-pixel 9 10+

Delay in seconds

32

How do different clouds handle workloads?

Now lets look at the agents weve created for each cloud. Heres a high-level view of how well the clouds handled each type of test.

Performance by test type, all clouds


Percent of tests 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 0 1 2 3 IO CPU 2MB GIF 1-pixel GIF 9 10+

Delay in seconds

33

1-pixel GIF Here are the results for the retrieval of a one-pixel GIF by each of the clouds.

Heres the performance histogram for each cloud. This is a measure of how well the cloud caches static content and responds quickly to requests.

Latency, 1-pixel GIF


Percent of tests 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 0 1 2 3 4 5 Terremark Amazon Rackspace Google Salesforce 8 9 10+

Delay in seconds

34

2-MByte GIF The second test evaluates network throughput. Heres what it looked like across a month of testing.

Heres the performance histogram for each cloud running the test.

Latency, 2MB GIF


Percent of tests 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0 1 2 3 4 5 Terremark Amazon Rackspace Google Salesforce 8 9 10+

Delay in seconds

35

CPU test The third test looks at processing capacity, asking the cloud platform to compute 1,000,000 sine and sum operations.

Note that the measurements for Salesforce.coms cloud must not be compared to others directly; because of governors on the PaaS platform that limit the number of instructions that can be carried out, this cloud only executed 100,000 operations one tenth of those the other clouds were processing. Heres the performance histogram for the test:

36

Latency, CPU test


Percent of tests 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0 1 2 3 4 5 Terremark Amazon Rackspace Google Salesforce 8 9 10+ Amazon Terremark

Delay in seconds Salesforce Google Rackspace

Amazons processing of the CPU tests was slow, but we chose the smallest virtual machine available from their service catalog; bigger machines would have handled this more quickly. We also looked at error rates for the CPU test: Errors seen, CPU test Salesforce Google Rackspace Amazon Terremark Uptime 99.96% 99.99% 99.93% 100.00% 100.00% Errors 11 3 20 0 0 Mar 15 - Apr 15 Successes 29106 32212 32470 30371 31426

37

I/O test In the fourth test, we search a storage system for a specific string. Heres the result of that process over time; this is the most revealing in terms of interesting data.

Note the two spikes in latency within Rackspace; the massive drop in both Terremark and Amazon, followed by an increase; and the gradual growth in delay within Googles App Engine. The drop, and subsequent increase, are caused by switching the testing model from sequential (where tests happen one after another) to simultaneous (where all tests occur at once.) Both the Terremark and Amazon IaaS agents become much slower when tests are simultaneous because theyre unable to scale automatically; however, Rackspace, which is also an IaaS model, wasnt affected by the change. Well cover this in more detail below. Heres the performance histogram:

38

Latency, I/O test


Percent of tests 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 0 1 2 3 4 5 Terremark Amazon Rackspace Google Salesforce 8 9 10+

Delay in seconds

The PaaS providers did well, largely because of their shared storage model that is optimized for large data sets across many machines; Rackspaces cloud performed very well but had occasional slow-downs.

39

Noteworthy observations

As we conducted this research, we saw many anomalies and interesting patterns. Here are a few of them.

40

Gradual increase in latency on Google App Engines CPU Over the course of the test, we saw a significant increase in the CPU latency of Googles App Engine. While Googles service was still faster than other platforms at this specific task, we saw the delay roughly double in this time.

41

Spikes in Rackspaces CPU performance Rackspaces CPU tests encountered two significan spikes during the test, when latency jumped from a respectable 2 seconds to three times as much.

42

A problem with a subset of customers in an IaaS In this chart, we see several Rackspace Cloud customers experiencing a slow-down at the same time. One becomes completely unavailable, while two others are slow. Our static test agent remains fast throughout the test, suggesting that this is an internal problem affecting several sites simultaneously.

43

How big a problem is WAN latency? In this chart, we compare WAN performance for our I/O test. WAN latency is only a fraction of the total delay here. As a whole, theres a roughly two-second difference between the fastest and slowest testing point from which we tested the agent.

44

Spiky I/O performance and outages In this chart, we see the Rackspace I/O test varying significantly, while one of the Rackspace customers were watching also goes offlinesuggesting that a resource problem on the back end affected us, and also affected a company whose application is I/O dependent that we shared the cloud with.

45

High variance within Salesforces cloud This chart shows performance on the Salesforce cloud. Early on, CPU is very slow; then it recovers, and proceeds to get slower again. During that first spike, other sites occupying the PaaS platform also become spiky and slow.

46

The impact of availability zones This data was presented by Eran Shir, CTO of advertising service Dapper, at Web2Expo San Francisco. We include it here to emphasize the issues with availability zones and performance. In Amazons model, customers can choose an availability zone from which their application will be served. This is partly for compliance reasonsso customers know what legislation applies to their dataand partly to provide separation for failover and SLAs. But all zones are not the same. Consider the latency of an application running in Availability Zone US-East-1A:

Compare this to US-East-2A:

In the Eastern zone, performance often spikes to 2 seconds or more; in the Western zone, its consistently fast. This kind of performance degradation is essential to watch.

47

WAN latency variance Heres a good example of performance degradation from one part of the world (the West Coast) that doesnt affect another.

48

Slow-downs affecting all sites on one day This chart of February 23 shows many sites running on Salesforces cloud all slowing down concurrently.

49

Slow-downs affecting a single site Degradation doesnt always affect everyone on a site. Here are several applications running on App Engine, where only one of them has a problem.

50

When the whole Internet is slow Some days, many clouds see a problemwhich is likely a result of intermittent issues with the Internet as a whole.

51

Money spent on clouds For the duration of the tests, heres how much we spent on each platform.
Amount $127.69 4.71 61.44 0.00 102.87 Cloud Amazon Web Services (EC2 and EBS) Google App Engine Rackspace.com Force.com Terremark

Note that we didnt hit any of the rate limits for the Salesforce force.com cloud, so we didnt pay for the capacity we used as part of the testbut because of force.coms governors, we were performing only 10% of the CPU operations of the other clouds.

52

IO contention on Rackspace During our analysis, we saw wide changes in I/O performance At one point, 56356 rsec/s meant only a 74.26% utilization of the system; at another, 584 rsec/s meant 100.00% utilization. This is a clear sign that the system is competing with other processes for resources. Heres the data for the two periods:
Device rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrqsz avgqusz await svctm %util

Period 1 Period 2

Sda1 Sda1

0.00 0.00

0.00 0.00

673.27 19.00

0.00 0.00

56356.44 584.00

0.00 0.00

83.71 30.74

4.07 2.95

6.04 147.37

1.10 52.63

74.26 100.00

53

Simultaneous versus spread out tests When we began the testing, we launched several requests from several cities simultaneously. This meant that the test applications had to deal with three or four users at exactly the same momentand our CPU- and I/O-intensive tests did not deal with this well in an Infrastructure-as-a-Service model.

Later, we switched testing to spread requests out more evenly (the sequential setting), and the IaaS patforms dealt with the load far better. Consider the following chart, which shows latency of the I/O test on an IaaS cloud before and after the change:

54

Now consider the same test volume, for the same period, on a PaaS cloud. Remember that PaaS has no notion of a machinethe system scales to handle load automatically.

55

While theres some variance, its far less than in the IaaS example above. Heres a graph of three cloud platforms before and after the switch: the two IaaS platforms (Terremark and Amazon) fare far better when the requests are spread out, while the PaaS platform (Google) barely changes.

56

Heres Rackspaces performance across this period, showing the improvement in I/O latency by spreading the requests out.

57

Spikes in a single cloud The following chart shows the I/O performance across all cloud providers. We can see that I/O on one of them has spiked significantly, but that other clouds seem to also have sporadic slow-downs.

Heres the same period within Webmetrics reporting system.

58

Variance within Google App Engine This chart shows a three-day period during which App Engines performance varied significantly. While the small image test was relatively consistent in its performance, all three other tests spiked and there appeared to be correlation between I/O, network, and CPU latency.

59

Conclusions

First and foremost: theres a lot to watch. Clouds can fail in many unexpected ways; here are some of the lessons weve learned. Watch your neighbors. Weve seen good evidence that several cloud applications slow down at once, so youll definitely be affected by others using the same cloud as you. Understand the profile of your cloud. The histograms shown here demonstrate that different clouds are good at different tasks. Youll need to choose the size of your virtual machinesin terms of CPU, memory, and so onin order to deliver good performance. You need an agent on the inside. When you plan a monitoring strategy, you need custom code that exercised back-end functions so you can triage the problem quickly. Choose PaaS or IaaS. If youre willing to re-code your application to take advantage of big data systems like Bigtable, you can scale well by choosing a PaaS cloud. On the other hand, if you need individual machines, youll have to build elasticity into your IaaS configuration. Big data doesnt come for free. Using large, sharded data stores might seem nice; but it takes time to put the data in there, which may not be appropriate for your applications usage patterns. Monitor usage and governors. In PaaS, if you exceed your rate limits, your users will get errors. Troubleshooting gets harder. You need data on the Internet as a whole, your cloud provider as a whole, and your individual applications various tiers, in order to properly triage the problem. When you were running on dedicated infrastructure, you didnt have to spend as much time eliminating third-party issues (such as contention for shared bandwidth or I/O blocking.) PaaS means youre in the same basket. We noticed that if youre using a PaaS, when the cloud gets slow, everyone gets slow. With IaaS, theres more separation of the CPU and the servers responsivenessbut youre still contending for shared storage and network bandwidth. Watch several zones. When you rely on availability zones to distribute the risk of an outage, youll also need to deploy additional monitoring to compare those zones to one another.

60

Further research and reading

Theres already a good body of research on cloud performance in the public domain. In preparing this study, we consulted many online resources; here are some of the most relevant ones:

Peter Van Eijk


Peter used a variety of approaches to deduce the size and scale of cloud offerings, and to look at changes in their performance over time. One of the most notable aspects of Peters research was the rate at which cloud offerings are improving. He found that in a single month, Amazons Cloudfront underwent two significant performance improvements:


Figure 2: Peter Van Eijk's research into Amazon Cloudfront performance from NYC

His results were presented at the Computer Measurement Groups annual meeting and are available on Slideshare1.

Cloudstatus
Now part of VMWare, Hyperic has a variety of agents that collect cloud-specific metrics across several public clouds.2

1 http://www.slideshare.net/pveijk/cloud-encounters-sept-2009-for-cmg- dec-6 2 http://www.cloudstatus.com 61


Figure 3: A sample of Hyperic's Cloudstatus dashboard

Whats notable here is that each cloud has unique things to measureon Amazon, for example, it might include SimpleDB or SimpleMQ data, while on Google it might include Bigtable lookup times.

Cloudharmony

Over a two-month period, Cloudharmony paid end users to run their testing tool and measure the speed of clouds.3

Figure 4: Aggregate results of cloud performance collected by Cloudharmony

3 http://blog.cloudharmony.com/2010/02/cloud-speed-test-results.html 62

The primary focus of the testing was end-user throughput, rather than web page latency, but they were able to test many different platforms during the test. Cloudharmonys client-side monitoring tool can exercise many cloud functions as part of its testing routine.

Figure 5: A one-time test of cloud storage and platforms by Cloudharmony's agent

Alan Williamson and Cloudkick

Allan Williamson published data on oversubscription and contention within an IaaS cloud,4 referencing data gathered by Cloudkick.5 The data indicates increasing congestion within Amazons resources. These issues seem to plague Amazons US- East availability zone, something weve heard from several sources.

4 http://alan.blog-city.com/has_amazon_ec2_become_over_subscribed.htm 5 https://www.cloudkick.com/blog/2010/jan/12/visual-ec2-latency/ 63

Figure 6: Network latency increasing in Amazon's US-East availability zone

While network congestion is only one factor that can lead to poor end user experience, its an important one. As clouds grow in popularity, its important to remember that theyre a shared resource and to ensure that the provider is scaling infrastructure capacity commensurate with demand.

The Bitsource
The Bitsource did an independent comparison of Rackspace and Amazon6 to measure performance within the clouds themselves.

Figure 7: The Bitsource's comparison of compilation time across cloud virtual machines

This comparison focused more on applying traditional benchmarkssuch as compilation or I/O operationsacross two systems. 6 http://www.thebitsource.com/featured-posts/rackspace-cloud-servers- versus-amazon-ec2-performance-analysis/ 64

Cloud test agent code

Heres some additional information on how the test agents were constructed. As noted above, the construction of tests varies widely by platform; while we attempted to be consistent across platforms, the results should by no means be used to pick a particular cloud provider without further testing. Most notably, the Force.com CPU tests consisted of 100,000, rather than 1,000,000, operations.

Simple objects
The one-pixel and 2-MByte tests are simply objects retrieved by URL. The one-pixel retrieval does not include any cookies or other information, so it can be retrieved quickly and is primarily a test of network round-trip time and server responsiveness; in many cases, the cloud moved this to a cache. The larger image also tests network throughput and congestion.

CPU test
To test processing resources, we conduct 1,000,000 SIN operations and 1,000,000 SUM operations. The code for the CPU load is the same on all clouds, except that in one PaaS environment (force.com), we reduce this to 50,000 operations in order to comply with the platforms governorsdespite this, its still slower than IaaS platforms. The code is as follows:
<? $lcnt = 1000000; $a = '<h1>CPU Header '.$lcnt.'</h1><br/>'; print($a); $sinsum=0; for ($y=0;$y<$lcnt;++$y) { $temp = $y; $x[$y]= sin($temp); } for ($y=0;$y<$lcnt;++$y) { $sinsum+=$x[$y]; } print($sinsum.' <br/>'); ?>

I/O test
In each case, we loaded the available data store (a database in IaaS environments, or the built-in one in a PaaS environment) with 500,000 records. IO.php runs a full table scan of the 500,000 rows, and then flushes disk buffers to ensure the data isnt cached. In IaaS environments, the code for IO.php is:
<?

65

$chandle = mysql_pconnect("localhost", "root","barfly") or die("ERROR: Connection Failure to Database"); mysql_select_db("MainObj", $chandle) or die ("Database not found."); $query="select ip_addr, comment,value from vote where comment like '%OIYDGOAPSL%'"; $result = mysql_db_query("MainObj", $query) or die("ERROR: Failed Query"); $result = mysql_query($query); echo "<h1>IO</h1><br />"; // Check result // This shows the actual query sent to MySQL, and the error. Useful for debugging. if (!$result) { $message = 'ERROR: ' . mysql_error() . "<br />"; $message .= 'Whole query: ' . $query; die($message); } while ($row = mysql_fetch_assoc($result)) { echo $row['ip_addr']; echo "<br />"; } mysql_free_result($result); // this is to flush the cache for next time or we're effectively cheating the load Dom0 system("/sbin/fc"); ?>

/sbin/fc assumes superuser privileges to flush all disk cache. This was done at the end of the code to prevent other load issues as the OS recovers a little from having the buffers suddenly all marked dirty. This forces the next web hit to not be able to search through the entire database from cache, but rather forces DISK IO ro happen as though this was the first query ever. The code for fc.c is:
#include <stdio.h> main() { setreuid(0,0); system("echo '3' > /proc/sys/vm/drop_caches"); }

To prove this works, use iostat -x 1 and go visit the web page IO.php a few times, you'll see all the IO load anew each time. Rename IO.php and you'll note the entire DB fits into RAM, and there's no IO to disk anymore. The code for PaaS I/O tests varies by PaaS platform significantly. Both Force.com and Google App Engine add visible and invisible data per row. Force.com adds author, creator, and other metadata such that 524k rows is about 1.1GB, whereas it's only 190 megs on a mysql database. While we cant flush the cache in a PaaS cloud, the fact that were running a full scan of a gigabyte of data in a shared environment suggests that its less likely to be cached.

66

Googles insert slow to query fast philosophy means that data was inserted at a rate of 31,160 300-byte rows per CPU-hour. We quickly burned through the 6.5 free CPU-hours Google offers its users, and in the end, took nearly 3 days to push all of the test data into App Engines Bigtable storage. Google App Engine does not support searching for a substring in a string field. As such, the usual search for the 6-byte string found only in row 11692 (which was the basis for the I/O test) is replaced by a search for the entire field (all 255 bytes). This is essentially still a full index scan, as this column is not indexed normally, and should be give or take about the same amount of aggregate IO load as all the other tests.

67