Hamilton OnDesigning and Deploying Internet-Scale Services
non-consumers. Enterprises are used to havingsignificant influence over their software providersand to having complete control over when theydeploy new versions (typically slowly). Thisdrives up the cost of their operations and thecost of supporting them since so many versionsof the software need to be supported.The most economic services don’tgive cus-tomers control over the version they run, andonly host one version. Holding this single-ver-sion software line requires1. carein not producing substantial user ex- perience changes release-to-release and2. awillingness to allow customers that needthis level of control to either host internallyor switch to an application service provider willing to provide this people-intensivemulti-version support.
•
Multi-tenancy
.Multi-tenancy is the hosting of all companies or end users of a service in thesame service without physical isolation, where-as single tenancy is the segregation of groups of users in an isolated cluster.The argument for multi-tenancy is nearly identical to the argu-ment for single version support and is based up-on providing fundamentally lower cost of ser-vice built upon automation and large-scale.In review,the basic design tenets and considera-tions we have laid out above are:
•
design for failure,
•
implement redundancy and fault recovery,
•
depend upon a commodity hardware slice,
•
support single-version software, and
•
implement multi-tenancy.We are constraining the service design and operationsmodel to maximize our ability to automate and to re-duce the overall costs of the service. Wedraw a clear distinction between these goals and those of applica-tion service providers or IT outsourcers. Those busi-nesses tend to be more people intensive and more will-ing to run complex, customer specific configurations.More specific best practices for designing opera-tions-friendly services are:
•
Quick service health check
.This is the servicesversion of a build verification test. It’sasniff test that can be run quickly on a developer’ssystem to ensure that the service isn’tbroken inany substantive way.Not all edge cases are test-ed, but if the quick health check passes, thecode can be checked in.
•
Develop in the full environment
.Developersshould be unit testing their components, butshould also be testing the full service with their component changes. Achieving this goal effi-ciently requires single-server deployment (sec-tion 2.4), and the preceding best practice, aquick service health check.
•
Zero trust of underlying components
.Assumethat underlying components will fail and ensurethat components will be able to recover and con-tinue to provide service. The recovery techniqueis service-specific, but common techniques are tocontinue to operate on cached data in read-only mode or continue to provide service to all but a tinyfraction of the user base during the shorttime while the service is accessing the re-dundant copy of the failed component.
•
Do not build the same functionality in multiplecomponents
.Foreseeing future interactions ishard, and fixes have to be made in multiple parts of the system if code redundancy creepsin. Services grow and evolve quickly.Withoutcare, the code base can deteriorate rapidly.
•
One pod or cluster should not affect another podor cluster
.Most services are formed of pods or sub-clusters of systems that work together to pro-vide the service, where each pod is able to oper-ate relatively independently.Each pod should beas close to 100% independent and without inter- pod correlated failures. Global services even withredundancy are a central point of failure. Some-times they cannot be avoided but try to have ev-erything that a cluster needs inside the clusters.
•
Allow (rare) emergency human intervention
.Thecommon scenario for this is the movement of us-er data due to a catastrophic event or other emer-gency.Design the system to never need humaninteraction, but understand that rare events willoccur where combined failures or unanticipatedfailures require human interaction. These eventswill happen and operator error under these cir-cumstances is a common source of catastrophicdata loss. An operations engineer working under pressure at 2 a.m. will make mistakes. Designthe system to first not require operations inter-vention under most circumstances, but work with operations to come up with recovery plansif they need to intervene. Rather than docu-menting these as multi-step, error-prone proce-dures, write them as scripts and test them in production to ensure they work. What isn’ttest-ed in production won’twork, so periodicallythe operations team should conduct a ‘‘firedrill’’using these tools. If the service-availabil-ity risk of a drill is excessively high, then insuf-ficient investment has been made in the design,development, and testing of the tools.
•
Keep things simple and robust
.Complicated al-gorithms and component interactions multiplythe difficulty of debugging, deploying, etc. Sim- ple and nearly stupid is almost always better in ahigh-scale service-the number of interacting fail-ure modes is already daunting before complexoptimizations are delivered. Our general rule isthat optimizations that bring an order of magni-tude improvement are worth considering, but
21st Large Installation System Administration Conference (LISA ’07)235
Leave a Comment