Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
0Activity
0 of .
Results for:
No results containing your search query
P. 1
On Designing and Deploying

On Designing and Deploying

Ratings: (0)|Views: 1 |Likes:
Published by ascrivner

More info:

Published by: ascrivner on Jul 19, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/19/2012

pdf

text

original

 
On Designing and DeployingInternet-Scale Services
 James Hamilton
 –Windows Live Services Platform
ABSTRACT
The system-to-administrator ratio is commonly used as a rough metric to understand adminis-trative costs in high-scale services. With smaller,less automated services this ratio can be as low as2:1, whereas on industry leading, highly automated services, we’ve seen ratios as high as 2,500:1.Wi t h i nMicrosoft services, Autopilot [1] is often cited as the magic behind the success of the Win-dows Live Search team in achieving high system-to-administrator ratios. While auto-administrationis important, the most important factor is actually the service itself. Is the service efficient to auto-mate? Isit what we refer to more generally as operations-friendly?Services that are operations-friendly require little human intervention, and both detect and recover from all but the most obscurefailures without administrative intervention. This paper summarizes the best practices accumulatedover many years in scaling some of the largest services at MSN and Windows Live.
Introduction
This paper summarizes a set of best practices for designing and developing operations-friendly services.Designing and deploying high-scale services is a rapidlyevolving subject area and, consequently,any list of best practices will likely grow and morph over time. Our aimis to help others1. deliveroperations-friendly services quickly and2. avoidthe early morning phone calls and meet-ings with unhappy customers that non-opera-tions-friendly services tend to yield.The work draws on our experiences over the last20 years in high-scale data-centric software systemsand internet-scale services, most recently from leadingthe Exchange Hosted Services team (at the time, a mid-sized service of roughly 700 servers and just over 2.2Musers). Wealso incorporate the experiences of the Win-dows Live Search, Windows Live Mail, ExchangeHosted Services, Live Communications Server,Win-dows Live Address Book Clearing House (ABCH),MSN Spaces, Xbox Live, Rackable Systems Engineer-ing Team, and the Messenger Operations teams in ad-dition to that of the overall Microsoft Global Founda-tion Services Operations team. Several of these con-tributing services have grown to more than a quarter  billion users. The paper also draws heavily on the work done at Berkeley on Recovery Oriented Computing [2,3] and at Stanford on Crash-Only Software [4, 5].Bill Hoffman [6] contributed many best practicesto this paper,but also a set of three simple tenetsworth considering up front:1. Expectfailures. A component may crash or bestopped at any time. Dependent componentsmight fail or be stopped at any time. There will be network failures. Disks will run out of space.Handle all failures gracefully.2. Keepthings simple. Complexity breeds prob-lems. Simple things are easier to get right.Avoid unnecessary dependencies. Installationshould be simple. Failures on one server shouldhave no impact on the rest of the data center.3. Automateeverything. People make mistakes.People need sleep. People forget things. Auto-mated processes are testable, fixable, and there-fore ultimately much more reliable. Automatewherever possible.These three tenets form a common thread through-out much of the discussion that follows.
Recommendations
This section is organized into ten sub-sections,each covering a different aspect of what is required todesign and deploy an operations-friendly service. Thesesub-sections include overall service design; designingfor automation and provisioning; dependency manage-ment; release cycle and testing; hardware selection andstandardization; operations and capacity planning; au-diting, monitoring and alerting; graceful degradationand admission control; customer and press communica-tions plan; and customer self provisioning and self help.
Overall Application Design
We have long believed that 80% of operations is-sues originate in design and development, so this sec-tion on overall service design is the largest and mostimportant. When systems fail, there is a natural ten-dency to look first to operations since that is where the problem actually took place. Most operations issues,however,either have their genesis in design and devel-opment or are best solved there.Throughout the sections that follow,aconsensusemerges that firm separation of development, test, andoperations isn’tthe most effective approach in the ser-vices world. The trend we’ve seen when lookingacross many services is that low-cost administrationcorrelates highly with how closely the development,test, and operations teams work together.
21st Large Installation System Administration Conference (LISA 07)233
 
On Designing and Deploying Internet-Scale ServicesHamilton
In addition to the best practices on service designdiscussed here, the subsequent section, ‘Designing for Automation Management and Provisioning,’’also hassubstantial influence on service design. Effective auto-matic management and provisioning are generallyachieved only with a constrained service model. Thisis a repeating theme throughout: simplicity is the keyto efficient operations. Rational constraints on hard-ware selection, service design, and deployment mod-els are a big driver of reduced administrative costs andgreater service reliability.Some of the operations-friendly basics that havethe biggest impact on overall service design are:
Design for failure
.This is a core concept whendeveloping large services that comprise manycooperating components. Those components willfail and they will fail frequently.The compo-nents don’talways cooperate and fail indepen-dently either.Once the service has scaled beyond10,000 servers and 50,000 disks, failures will oc-cur multiple times a day.Ifahardware failure re-quires any immediate administrative action, theservice simply won’tscale cost-effectively andreliably.The entire service must be capable of surviving failure without human administrativeinteraction. Failure recovery must be a very sim- ple path and that path must be tested frequently.Armando Fox of Stanford [4, 5] has argued thatthe best way to test the failure path is never toshut the service down normally.Just hard-fail it.This sounds counter-intuitive, but if the failure paths aren’tfrequently used, they won’twork when needed [7].
Redundancy and fault recovery
.The mainframemodel was to buy one very large, very expensiveserver.Mainframes have redundant power sup- plies, hot-swappable CPUs, and exotic bus ar-chitectures that provide respectable I/O through- put in a single, tightly-coupled system. The ob-vious problem with these systems is their ex- pense. And, even with all the costly engineering,they still aren’tsufficiently reliable. In order toget the fifth 9 of reliability,redundancy is re-quired. Even getting four 9’sonasingle-systemdeployment is difficult. This concept is fairlywell understood industry-wide, yet it’sstill com-mon to see services built upon fragile, non-re-dundant data tiers.Designing a service such that any system cancrash (or be brought down for service) at anytime while still meeting the service level agree-ment (SLA) requires careful engineering. Theacid test for full compliance with this design principle is the following: is the operationsteam willing and able to bring down any server in the service at any time without draining thework load first? If they are, then there is syn-chronous redundancy (no data loss), failuredetection, and automatic take-over.Asadesignapproach, we recommend one commonly usedto find and correct potential service security is-sues: security threat modeling. In security threatmodeling, we consider each possible securitythreat and, for each, implement adequate mitiga-tion. The same approach can be applied to de-signing for fault resiliency and recovery.Document all conceivable component failuresmodes and combinations thereof. For each fail-ure, ensure that the service can continue to op-erate without unacceptable loss in service quali-ty,ordetermine that this failure risk is accept-able for this particular service (e.g., loss of anentire data center in a non-geo-redundant ser-vice). Very unusual combinations of failuresmay be determined sufficiently unlikely thatensuring the system can operate through themis uneconomical. Be cautious when making this judgment. We’ve been surprised at how fre-quently ‘‘unusual’combinations of events take place when running thousands of servers that produce millions of opportunities for compo-nent failures each day.Rare combinations can become commonplace.
Commodity hardware slice
.All components of the service should target a commodity hardwareslice. For example, storage-light servers will bedual socket, 2- to 4-core systems in the $1,000to $2,500 range with a boot disk. Storage-heavyservers are similar servers with 16 to 24 disks.The key observations are1. large clusters of commodity servers aremuch less expensive than the small num- ber of large servers they replace,2. serverperformance continues to increasemuch faster than I/O performance, makingasmall server a more balanced system for agiven amount of disk,3. powerconsumption scales linearly withservers but cubically with clock frequency,making higher performance servers moreexpensive to operate, and4. asmall server affects a smaller proportionof the overall service workload when fail-ing over.
Single-version software
.Two factors that makesome services less expensive to develop andfaster to evolve than most packaged products arethe software needs to only target a singleinternal deployment and previous versions don’thave to be support-ed for a decade as is the case for enter- prise-targeted products.Single-version software is relatively easy toachieve with a consumer service, especially one provided without charge. But it’sequally impor-tant when selling subscription-based services to
234 21stLarge Installation System Administration Conference (LISA ’07)
 
Hamilton OnDesigning and Deploying Internet-Scale Services
non-consumers. Enterprises are used to havingsignificant influence over their software providersand to having complete control over when theydeploy new versions (typically slowly). Thisdrives up the cost of their operations and thecost of supporting them since so many versionsof the software need to be supported.The most economic services don’tgive cus-tomers control over the version they run, andonly host one version. Holding this single-ver-sion software line requires1. carein not producing substantial user ex- perience changes release-to-release and2. awillingness to allow customers that needthis level of control to either host internallyor switch to an application service provider willing to provide this people-intensivemulti-version support.
Multi-tenancy
.Multi-tenancy is the hosting oall companies or end users of a service in thesame service without physical isolation, where-as single tenancy is the segregation of groups of users in an isolated cluster.The argument for multi-tenancy is nearly identical to the argu-ment for single version support and is based up-on providing fundamentally lower cost of ser-vice built upon automation and large-scale.In review,the basic design tenets and considera-tions we have laid out above are:
design for failure,
implement redundancy and fault recovery,
depend upon a commodity hardware slice,
support single-version software, and
implement multi-tenancy.We are constraining the service design and operationsmodel to maximize our ability to automate and to re-duce the overall costs of the service. Wedraw a clear distinction between these goals and those of applica-tion service providers or IT outsourcers. Those busi-nesses tend to be more people intensive and more will-ing to run complex, customer specific configurations.More specific best practices for designing opera-tions-friendly services are:
Quick service health check
.This is the servicesversion of a build verification test. It’sasniftest that can be run quickly on a developer’ssystem to ensure that the service isn’tbroken inany substantive way.Not all edge cases are test-ed, but if the quick health check passes, thecode can be checked in.
Develop in the full environment
.Developersshould be unit testing their components, butshould also be testing the full service with their component changes. Achieving this goal effi-ciently requires single-server deployment (sec-tion 2.4), and the preceding best practice, aquick service health check.
Zero trust of underlying components
.Assumethat underlying components will fail and ensurethat components will be able to recover and con-tinue to provide service. The recovery techniqueis service-specific, but common techniques are tocontinue to operate on cached data in read-only mode or continue to provide service to all but a tinyfraction of the user base during the shorttime while the service is accessing the re-dundant copy of the failed component.
Do not build the same functionality in multiplecomponents
.Foreseeing future interactions ishard, and fixes have to be made in multiple parts of the system if code redundancy creepsin. Services grow and evolve quickly.Withoutcare, the code base can deteriorate rapidly.
One pod or cluster should not affect another podor cluster 
.Most services are formed of pods or sub-clusters of systems that work together to pro-vide the service, where each pod is able to oper-ate relatively independently.Each pod should beas close to 100% independent and without inter- pod correlated failures. Global services even withredundancy are a central point of failure. Some-times they cannot be avoided but try to have ev-erything that a cluster needs inside the clusters.
Allow (rare) emergency human intervention
.Thecommon scenario for this is the movement of us-er data due to a catastrophic event or other emer-gency.Design the system to never need humaninteraction, but understand that rare events willoccur where combined failures or unanticipatedfailures require human interaction. These eventswill happen and operator error under these cir-cumstances is a common source of catastrophicdata loss. An operations engineer working under  pressure at 2 a.m. will make mistakes. Designthe system to first not require operations inter-vention under most circumstances, but work with operations to come up with recovery plansif they need to intervene. Rather than docu-menting these as multi-step, error-prone proce-dures, write them as scripts and test them in production to ensure they work. What isn’ttest-ed in production won’twork, so periodicallythe operations team should conduct a ‘‘firedrill’’using these tools. If the service-availabil-ity risk of a drill is excessively high, then insuf-ficient investment has been made in the design,development, and testing of the tools.
Keep things simple and robust
.Complicated al-gorithms and component interactions multiplythe difficulty of debugging, deploying, etc. Sim- ple and nearly stupid is almost always better in ahigh-scale service-the number of interacting fail-ure modes is already daunting before complexoptimizations are delivered. Our general rule isthat optimizations that bring an order of magni-tude improvement are worth considering, but
21st Large Installation System Administration Conference (LISA 07)235

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->