You are on page 1of 16

Holistic Configuration Management at Facebook

Chunqiang Tang, Thawan Kooburat, Pradeep Venkatachalam, Akshay Chander,


Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert Karl
Facebook Inc.
{tang, thawan, pradvenkat, akshay, wenzhe, aravindn, pdowell, robertkarl}@fb.com

Abstract the many challenges. This paper presents Facebook’s holistic


Facebook’s web site and mobile apps are very dynamic. configuration management solution. Facebook uses Chef [7]
Every day, they undergo thousands of online configuration to manage OS settings and software deployment [11], which
changes, and execute trillions of configuration checks to is not the focus of this paper. Instead, we focus on the home-
personalize the product features experienced by hundreds grown tools for managing applications’ dynamic runtime
of million of daily active users. For example, configuration configurations that may be updated live multiple times a
changes help manage the rollouts of new product features, day, without application redeployment or restart. Examples
perform A/B testing experiments on mobile devices to iden- include gating product rollouts, managing application-level
tify the best echo-canceling parameters for VoIP, rebalance traffic, and running A/B testing experiments.
the load across global regions, and deploy the latest machine Below, we outline the key challenges in configuration
learning models to improve News Feed ranking. This paper management for an Internet service and our solutions.
gives a comprehensive description of the use cases, design, Configuration sprawl. Facebook internally has a large
implementation, and usage statistics of a suite of tools that number of systems, including frontend products, backend
manage Facebook’s configuration end-to-end, including the services, mobile apps, data stores, etc. They impose differ-
frontend products, backend systems, and mobile apps. ent requirements on configuration management. Historically,
each system could use its own configuration store and distri-
1. Introduction bution mechanism, which makes the site as a whole hard to
The software development and deployment cycle has accel- manage. To curb the configuration sprawl, we use a suite
erated dramatically [13]. A main driving force comes from of tools built on top of a uniform foundation to support the
the Internet services, where frequent software upgrades are diverse use cases. Currently, the tools manage hundreds of
not only possible with their service models, but also a neces- thousands of configs (i.e., configuration files) from a cen-
sity for survival in a rapidly changing and competitive envi- tral location, and distribute them to hundreds of thousands
ronment. Take Facebook, for example. We roll facebook.com of servers and more than one billion mobile devices.
onto new code twice a day [29]. The site’s various configura- Configuration authoring and version control. A large-
tions are changed even more frequently, currently thousands scale distributed system often has many flexible knobs that
of times a day. In 2014, thousands of engineers made live can be tuned live. The median size of a config at Face-
configuration updates to various parts of the site, which is book is 1KB, with large ones reaching MBs or GBs. Manu-
even more than the number of engineers who made changes ally editing these configs is error prone. Even a minor mis-
to Facebook’s frontend product code. take could potentially cause a site-wide outage. We take a
Frequent configuration changes executed by a large popu- truly configuration-as-code approach to compile and gener-
lation of engineers, along with unavoidable human mistakes, ate configs from high-level source code. We store the config
lead to configuration errors, which is a major source of site programs and the generated configs in a version control tool.
outages [24]. Preventing configuration errors is only one of
Defending against configuration errors. We safeguard
configs in multiple ways. First, the configuration compiler
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed automatically runs validators to verify invariants defined for
for profit or commercial advantage and that copies bear this notice and the full citation configs. Second, a config change is treated the same as a
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, code change and goes though the same rigorous code re-
to post on servers or to redistribute to lists, requires prior specific permission and/or a view process. Third, a config change that affects the frontend
fee. Request permissions from permissions@acm.org.
products automatically goes through continuous integration
SOSP’15, October 4–7, 2015, Monterey, CA.
Copyright c 2015 ACM 978-1-4503-3834-9/15/10. . . $15.00. tests in a sandbox. Lastly, the automated canary testing tool
http://dx.doi.org/10.1145/2815400.2815401Reprinted from SOSP’15, October 4–7, rolls out a config change to production in a staged fashion,
2015, Monterey, CA, pp. 1–16.

1
monitors the system health, and rolls back automatically in Gating new product features. Facebook releases software
case of problems. A main hurdle we have to overcome is to early and frequently. It forces us to get early feedback and
reliably determine the health of numerous backend systems. iterate rapidly. While a new product feature is still under de-
Configuration dependency. Facebook.com is powered by a velopment, we commonly release the new code into produc-
large number of systems developed by many different teams. tion early but in a disabled mode, and then use a tool called
Each system has its own config but there are dependencies Gatekeeper to incrementally enable it online. Gatekeeper can
among the configs. For example, after the monitoring tool’s quickly disable the new code if problems arise. It controls
config is updated to enable a new monitoring feature, the which users will experience the new feature, e.g., Facebook
monitoring configs of all other systems might need be up- employees only or 1% of the users of a mobile device model.
dated accordingly. Our framework expresses configuration The target can be changed live through a config update.
dependency as source code dependency, similar to the in- Conducting experiments. Good designs often require A/B
clude statement in a C++ program. The tool automatically tests to guide data-driven decision making. For example, the
extracts dependencies from source code without the need to echo-canceling parameters for VoIP on Facebook Messenger
manually edit a makefile. need tuning for different mobile devices because of the hard-
Scalable and reliable configuration distribution. Our ware variation. Our tools can run live experiments in produc-
tools manage a site that is much larger than the previously tion to test different parameters through config changes.
reported configuration distribution system [30], and support Application-level traffic control. Configs help manage the
a much more diverse set of applications, including mobile. site’s traffic in many ways. Automation tools periodically
The size of a config can be as small as a few bytes or as large make config changes to shift traffic across regions and per-
as GBs. Given the scale and the geographically distributed form load tests in production. In case of emergency, a config
locations, failures are the norm. It is a significant challenge change kicks off automated cluster/region traffic drain and
to distribute configs to all servers and mobile devices in a another config change disables resource-hungry features of
timely and reliable manner, and not to make the availability the site. During shadow tests, a config change starts or stops
of the configuration management tools become a bottleneck duplicating live traffic to testing servers. During a fire drill, a
of the applications’ availability. config change triggers fault injection into a production sys-
In this paper, we describe our solutions to these chal- tem to evaluate its resilience.
lenges. We make the following contributions: Topology setup and load balancing. Facebook stores user
• Runtime configuration management is an important data in a large-scale distributed data store called TAO [5]. As
problem for Internet services, but it is not well defined the hardware setup changes (e.g., a new cluster is brought
in the literature. We describe the problem space and the online), the macro traffic pattern shifts, or failure happens,
real use cases from our experience, in the hope of mo- the application-level configs are updated to drive topology
tivating future research in this area. changes for TAO and rebalance the load.
• We describe Facebook’s configuration management Monitoring, alerts, and remediation. Facebook’s monitor-
stack, which addresses many challenges not covered ing stack is controlled through config changes: 1) what mon-
by prior work, e.g., gating product rollouts, config au- itoring data to collect, 2) monitoring dashboard (e.g., the lay-
thoring, automated canary testing, mobile config, and out of the key-metric graphs), 3) alert detection rules (i.e.,
a hybrid subscription-P2P model for large config dis- what is considered an anomaly), 4) alert subscription rules
tribution. This is the first published solution of holistic (i.e., who should be paged), and 5) automated remediation
configuration management for Internet services. actions [27], e.g., rebooting or reimaging a server. All these
can be dynamically changed without a code upgrade, e.g., as
• We report the statistics and experience of operating troubleshooting requires collecting more monitoring data.
a large-scale configuration management stack, which
Updating machine learning models. Machine learning
are made available in the literature for the first time.
models are used to guide search ranking, News Feed rank-
For example, do old configs become dormant, and how
ing, and spam detection. The models are frequently re-
often do config changes expose code bugs?
trained with the latest data and distributed to the servers
2. Use Cases and Solution Overview without a code upgrade. This kind of model is used for many
products. Its data size can vary from KBs to GBs.
We focus on the problem of managing an Internet service’s
dynamic runtime configurations that may be updated live Controlling an application’s internal behavior. This is one
multiple times a day, without application redeployment or of the most common use cases. A production system often
restart. Configuration management is an overloaded term. has many knobs to control its behavior. For example, a data
We describe several real use cases to make the problem space store’s config controls how much memory is reserved for
more concrete. Note that they are just a tiny sample set out caching, how many writes to batch before writing to the disk,
of the hundreds of thousands of configs we manage today. how much data to prefetch on a read, etc.

2
Thrift [2] language (see “job.thrift”). The engineer writes
two Python files “create job.cinc” and “cache job.cconf ”
MobileConfig to manipulate the Thrift object. A call to “export if last()”
writes the config as a JSON [20] file. To prevent invalid
configs, the engineer writes another Python file “job.thrift-
Other tools for
Package cvalidator” to express the invariants for the config. The val-
Gatekeeper A/B testing Sitevars
Vessel idator is automatically invoked by the Configerator compiler
experiments
to verify every config of type “Job”.
Configerator
The source code of config programs and generated JSON
Figure 1: Facebook’s configuration management tools. Mo- configs are stored in a version control tool, e.g., git [14]. In
bileConfig supports mobile apps. All the other tools support the upper-left side of Figure 3, the engineer works in a “de-
applications running in data centers. velopment server” with a local clone of the git repository.
She edits the source code and invokes the Configerator com-
2.1 Tool Overview piler to generate JSON configs. At the top of Figure 3, config
Figure 1 shows Facebook’s configuration management tools. changes can also be initiated by an engineer via a Web UI, or
They work together to support the diverse use cases. programmatically by an automation tool invoking the APIs
provided by the “Mutator” component.
Configerator provides all the foundational functions, in- The example in Figure 2 separates create job.cinc from
cluding version control, authoring, code review, automated cache job.cconf so that the former can be reused as a com-
canary testing, and config distribution. Other tools are built mon module to create configs for other types of jobs. Hy-
on top of Configerator and provide specialized functions. pothetically, three different teams may be involved in writ-
Gatekeeper controls the rollouts of new product features. ing the config code: the scheduler team, the cache team,
Moreover, it can also run A/B testing experiments to find the and the security team. The scheduler team implements the
best config parameters. In addition to Gatekeeper, Facebook scheduler software and provides the shared config code,
has other A/B testing tools built on top of Configerator, but including the config schema job.thrift, the reusable mod-
we omit them in this paper due to the space limitation. ule create job.cinc, and the validator job.thrift-cvalidator,
PackageVessel uses peer-to-peer file transfer to assist the which ensures that configs provided by other teams do not
distribution of large configs (e.g., GBs of machine learning accidentally break the scheduler. The cache team gener-
models), without sacrificing the consistency guarantee. ates the config for a cache job by simply invoking cre-
Sitevars is a shim layer that provides an easy-to-use config- ate job(name=“cache”), while the security team gener-
uration API for the frontend PHP products. ates the config for a security job by simply invoking cre-
ate job(name=“security”).
MobileConfig manages mobile apps’ configs on Android Code modularization and reuse are the key reasons why
and iOS, and bridges them to the backend systems such as maintaining config code is easier than manually editing
Configerator and Gatekeeper. MobileConfig is not bridged JSON configs. Config dependencies are exposed as code de-
to Sitevars because Sitevars is for PHP only. MobileConfig pendencies through import thrift() and import python(). An
is not bridged to PackageVessel because currently there is no example is shown below.
need to transfer very large configs to mobile devices. Python file “app_port.cinc”
APP_PORT = 8089
We describe these tools in the following sections.
Python file “app.cconf”
3. Configerator, Sitevars, and PackageVessel import_python (“app_port.cinc”, “*”)
app_cfg = AppConfig (port = APP_PORT …)
Among other things, Configerator addresses the challenges export_if_last (app_cfg)
in configuration authoring, configuration error prevention,
and large-scale configuration distribution. Python file “firewall.cconf”
import_python (“app_port.cinc”, “*”)
3.1 Configuration Authoring firewall_cfg = FirewallConfig (port = APP_PORT …)
export_if_last (firewall_cfg)
Our hypotheses are that 1) most engineers prefer writing
code to generate a config (i.e., a configuration file) instead “app.cconf ” instructs an application to listen on a specific
of manually editing the config, and 2) most config programs port. “firewall.cconf ” instructs the OS to allow traffic on
are easier to maintain than the raw configs themselves. We that port. Both depend on “app port.cinc”. The “Depen-
will use data to validate these hypotheses in Section 6.1. dency Service” in Figure 3 automatically extracts depen-
Following these hypotheses, Configerator literally treats dencies from source code. If APP PORT in “app port.cinc”
“configuration as code”. Figure 2 shows an example. A con- is changed, the Configerator compiler automatically recom-
fig’s data schema is defined in the platform-independent piles both “app.cconf ” and “firewall.cconf ”, and updates

3
Figure 2: The Configerator compiler generates a JSON configuration file from the Python and Thrift source code.

their JSON configs in one git commit, which ensures consis- mated integration tests, and 5) automated canary tests. They
tency. Dependency can be expressed using any Python lan- complement each other to catch different configuration er-
guage construct, not limited to shared constants. rors. We follow the flow in Figure 3 to explain them.
To manually test a new config, an engineer runs a com-
3.2 Improving Usability through UI and Sitevars mand to temporarily deploy the new config to some pro-
Configerator is designed as a uniform platform to support all duction servers or testing servers, and verifies that every-
use cases. It must be sufficiently flexible and expressive in thing works properly. Once satisfied, the engineer submits
order to support complex configs. On the other hand, sim- the source code, the JSON configs, and the testing results
ple configs may not benefit much from the complexity of the to a code review system called Phabricator [26]. If the con-
Python and Thrift code in Figure 2. The Configerator UI al- fig is related to the frontend products of facebook.com, in
lows an engineer to directly edit the value of a Thrift config a sandbox environment, the “Sandcastle” tool automatically
object without writing any code. The UI automatically gen- performs a comprehensive set of synthetic, continuous in-
erates the artifacts needed by Configerator. tegration tests of the site under the new config. Sandcastle
The Sitevars tool is a shim layer on top of Configerator posts the testing results to Phabricator for reviewers to ac-
to support simple configs used by frontend PHP products. It cess. Once the reviewers approve, the engineer pushes the
provides configurable name-value pairs. The value is a PHP config change to the remote “Canary Service”.
expression. An engineer uses the Sitevars UI to easily update The canary service automatically tests a new config on a
a sitevar’s PHP content without writing any Python/Thrift subset of production machines that serve live traffic. It com-
code. A sitevar can have a checker implemented in PHP to plements manual testing and automated integration tests.
verify the invariants, similar to the validator in Figure 2. Manual testing can execute tests that are hard to automate,
Because PHP is weakly typed, sitevars are more prone to but may miss config errors due to oversight or shortcut under
configuration errors, e.g., typos. Engineers are encouraged time pressure. Continuous integration tests in a sandbox can
to define a data schema for a newly created sitevar. A legacy have broad coverage, but may miss config errors due to the
sitevar may predate this best practice. The tool automatically small-scale setup or other environment differences.
infers its data type from its historical values. For example, it A config is associated with a canary spec that describes
infers whether a sitevar’s field is a string. If so, it further how to automate testing the config in production. The spec
infers whether it is a JSON string, a timestamp string, or a defines multiple testing phases. For example, in phase 1,
general string. If a sitevar update deviates from the inferred test on 20 servers; in phase 2, test in a full cluster with
data type, the UI displays a warning message to the engineer. thousands of servers. For each phase, it specifies the testing
target servers, the healthcheck metrics, and the predicates
3.3 Preventing Configuration Errors that decide whether the test passes or fails. For example, the
Configuration errors are a major source of site outages [24]. click-through rate (CTR) collected from the servers using
We take a holistic approach to prevent configuration errors, the new config should not be more than x% lower than the
including 1) config validators to ensure that invariants are CTR collected from the servers still using the old config.
not violated, 2) code review for both config programs and The canary service talks to the “Proxies” running on the
generated JSON configs, 3) manual config testing, 4) auto- servers under test to temporarily deploy the new config (see

4
the bottom of Figure 3). If the new config passes all testing
phases, the canary service asks the remote “Landing Strip”
to commit the change into the master git repository.
Misc automation Misc Web UI:
tools driving Configerator,
3.4 Scalable and Reliable Configuration Distribution config changes Gatekeeper,
Sitevar query
Configerator distributes a config update to hundreds of thou- canary
status
sands of servers scattered across multiple continents. In such

query dependency
an environment, failures are the norm. In addition to scala- Dependency Mutator
bility and reliability, other properties important to Configer- Service
ator are 1) availability (i.e., an application should continue git checkout
query dependency
to run regardless of failures in the configuration manage-
ment tools); and 2) data consistency (i.e., an application’s Development
Server Canary
instances running on different servers should eventually re- Service
ceive all config updates delivered in the same order, although git checkout
temporarily
there is no guarantee that they all receive a config update publish diff for deploy
code review a config
exactly at the same time). In this section, we describe how for canary
Configerator achieves these goals through the push model. Landing Strip testing
Phabricator
In Figure 3, the git repository serves as the ground truth (for code
review) Master
for committed configs. The “Git Tailer” continuously ex-
Git
tracts config changes from the git repository, and writes post testing results Repository
them to Zeus for distribution. Zeus is a forked version of Sandcastle
ZooKeeper [18], with many scalability and performance en- (continuous
integration
hancements in order to work at the Facebook scale. It runs tests) Git Tailer
a consensus protocol among servers distributed across mul-
tiple regions for resilience. If the leader fails, a follower is Zeus Ensemble
converted into a new leader. Leader
Zeus uses a three-level high-fanout distribution tree,
leader→observer→proxy, to distribute configs through the Follower Follower Follower Follower
push model. The leader has hundreds of observers as chil-
dren in the tree. A high-fanout tree is feasible because the
data-center network has high bandwidth and only small data Multiple Observers in Each Cluster
is sent through the tree. Large data is distributed through a Observer Observer Observer
peer-to-peer protocol separately (see Section 3.5). The three-
level tree is simple to manage and sufficient for the current
scale. More levels can be added in the future as needed.
Production Server
Each Facebook data center consists of multiple clusters.
Each cluster consists of thousands of servers, and has mul- Product X Product Y Product Z

tiple servers designated as Zeus observers. Each observer HHVM (PHP VM)
keeps a fully replicated read-only copy of the leader’s data. Gatekeeper runtime as
Upon receiving a write, the leader commits the write on HHVM extension
the followers, and then asynchronously pushes the write to
each observer. If an observer fails and then reconnects to the Cache Proxy
leader, it sends the latest transaction ID it is aware of, and
Application A
requests the missing writes. The commit log of ZooKeeper’s
consensus protocol helps guarantee in-order delivery of con- Application B
fig changes.
Each server runs a Configerator “Proxy” process, which
randomly picks an observer in the same cluster to connect to. Figure 3: Architecture of Configerator. It uses git for ver-
If the observer fails, the proxy connects to another observer. sion control, Zeus for tree-structured config distribution,
Unlike an observer, the proxy does not keep a full replica Phabricator for code review, Sandcastle for continuous in-
of the leader’s data. It only fetches and caches the configs tegration tests, Canary Service for automated canary tests,
needed by the applications running on the server. Landing Strip for committing changes, Mutator for provid-
An application links in the Configerator client library to ing APIs to support automation tools, and Dependency Ser-
access its config. On startup, the application requests the vice for tracking config dependencies. Applications on a pro-
proxy to fetch its config. The proxy reads the config from duction server interact with the Proxy to access their configs.

5
the observer with a watch so that later the observer will no- thousands of global servers many times a day. Our statistics
tify the proxy if the config is updated. The proxy stores the show that PackageVessel consistently and reliably delivers
config in an on-disk cache for later reuse. If the proxy fails, the large configs to the live servers in less than four minutes.
the application falls back to read from the on-disk cache di-
rectly. This design provides high availability. So long as a 3.6 Improving Commit Throughput
config exists in the on-disk cache, the application can ac- Multiple engineers making concurrent config commits into a
cess it (though outdated), even if all Configerator compo- shared git repository causes contention and slows down the
nents fail, including the git repository, Zeus leader/followers, commit process. We explain it through an example. When
observers, and proxy. an engineer tries to push a config diff X to the shared
Configerator uses the push model. How does it compare git repository, git checks whether the local clone of the
with the pull model [19, 30]? The biggest advantage of the repository is up to date. If not, she has to first bring her local
pull model is its simplicity in implementation, because the clone up to date, which may take 10s of seconds to finish.
server side can be stateless, without storing any hard state After the update finishes, she tries to push diff X to the shard
about individual clients, e.g., the set of configs needed by repository again, but another diff Y from another engineer
each client (note that different machines may run different might have just been checked in. Even if diff X and diff
applications and hence need different configs). However, the Y change different files, git considers the engineer’s local
pull model is less efficient for two reasons. First, some polls repository clone outdated, and again requires an update.
return no new data and hence are pure overhead. It is hard The “Landing Strip” in Figure 3 alleviates the problem,
to determine the optimal poll frequency. Second, since the by 1) receiving diffs from committers, 2) serializing them
server side is stateless, the client has to include in each poll according to the first-come-first-served order, and 3) pushing
the full list of configs needed by the client, which is not them to the shared git repository on behalf of the committers,
scalable as the number of configs grows. In our environment, without requiring the committers to bring their local repos-
many servers need tens of thousands of configs to run. We itory clones up to date. If there is a true conflict between a
opt for the push model in our environment. diff being pushed and some previously committed diffs, the
3.5 Distributing Large Configs through PackageVessel shared git repository rejects the diff, and the error is relayed
back to the committer. Only then, the committer has to up-
Some configs can be large, e.g., machine learning models for
date her local repository clone and resolve the conflict.
News Feed ranking. As these large configs may change fre-
The landing strip alleviates the commit-contention prob-
quently, it is not scalable to deliver them through Zeus’ dis-
lem, but does not fundamentally solve the commit-
tribution tree, because it would overload the tree’s internal
throughput problem, because 1) a shared git repository can
nodes that have a high fan out. Moreover, it is hard to guar-
only accept one commit at a time, and 2) git operations be-
antee the quality of service if the distribution paths overlap
come slower as the repository grows larger. Configerator
between large configs and small (but critical) configs.
started with a single shared git repository. To improve the
Our PackageVessel tool solves the problem by separating
commit throughput, we are in the process of migration to
a large config’s metadata from its bulk content. When a large
multiple smaller git repositories that collectively serve a par-
config changes, its bulk content is uploaded to a storage
titioned global name space. Files under different paths (e.g.,
system. It then updates the config’s small metadata stored in
“/feed” and “/tao”) can be served by different git reposito-
Configerator, including the version number of the new config
ries that can accept commits concurrently. Cross-repository
and where to fetch the config’s bulk content. Configerator
dependency is supported.
guarantees the reliable delivery of the metadata to servers
that subscribe to the config. After receiving the metadata import_python(‘‘feed/A.cinc’’, ‘‘*’’)
update, a server fetches the config’s bulk content from the import_python(‘‘tao/B.cinc’’, ‘‘*’’)
storage system using the BitTorrent [8] protocol. Servers ...
that need the same large config exchange the config’s bulk
content among themselves in a peer-to-peer (P2P) fashion to In the example above, the config imports two other configs.
avoid overloading the centralized storage system. Our P2P The code is the same regardless of whether those configs
protocol is locality aware so that a server prefers exchanging are in the same repository or not. The Configerator compiler
data with other servers in the same cluster. We recommend uses metadata to map the dependent configs to their reposito-
using PackageVessel for configs larger than 1MB. ries and automatically fetches them if they are not checked
A naive use of P2P cannot guarantee data consistency. out locally. Along with the partitioning of the git reposito-
Our hybrid subscription-P2P model does not have this limi- ries, the components in Figure 3 are also partitioned. Each
tation. Zeus’ subscription model guarantees the consistency git repository has its own mutator, landing strip, and tailer.
of the metadata, which in turn drives the consistency of the New repositories can be added incrementally. As a repos-
bulk content. For example, Facebook’s spam-fighting system itory grows large, some of its files can be migrated into a
updates and distributes hundreds of MBs of config data to new repository. It only requires updating the metadata that

6
lists all the repositories and the file paths they are responsi- if(gk_check(‘‘ProjectX’’, $user_id)) {
ble for. The contents of the migrated files require no change. // Show the new feature to the user.
...
3.7 Fault Tolerance } else {
Every component in Figure 3 has built-in redundancy across // Show the old product behavior.
multiple regions. One region serves as the master. Each ...
}
backup region has its own copy of the git repository, and re-
ceives updates from the master region. The git repository in Figure 4: Pseudocode of a product using Gatekeeper to con-
a region is stored on NFS and mounted on multiple servers, trol the rollout of a product feature.
with one as the master. Each region runs multiple instances
of all the services, including mutator, canary service, landing bool gk_check($project, $user_id) {
strip, and dependency service. Configerator supports failover if ($project == ‘‘ProjectX’’) {
both within a region and across regions. // The gating logic for ‘‘ProjectX’’.
if($restraint_1($user_id) AND
3.8 Summary $restraint_2($user_id)) {
//Cast the die to decide pass or fail.
Configerator addresses the key challenges in configuration
return rand($user_id) < $pass_prob_1;
authoring, configuration error prevention, and configuration } else if ($restraint_4($user_id)) {
distribution. It takes a configuration-as-code approach to return rand($user_id) < $pass_prob_2;
compile and generate configs from high-level source code. } else {
It expresses configuration dependency as source code de- return false;
pendency, and encourages config code modularization and }
reuse. It takes a holistic approach to prevent configuration ...
errors, including config validators, code review, manual con- }
fig testing, automated integration tests, and automated ca-
nary tests. It uses a distribution tree to deliver small con- Figure 5: Pseudocode of Gatekeeper’s gating logic.
figs through the push model, and uses a P2P protocol to
Figure 4 shows how a piece of product code uses a Gate-
deliver the bulk contents of large configs. It avoids commit
keeper “project” (i.e., a specific gating logic) to enable or
contention by delegating commits to the landing strip. It im-
disable a product feature. Figure 5 shows the project’s in-
proves commit throughput by using multiple git repositories
ternal logic, which consists of a series of if-then-else state-
that collectively serve a partitioned global name space.
ments. The condition in an if -statement is a conjunction
of predicates called restraints. Examples of restraints in-
4. Gatekeeper clude checking whether the user is a Facebook employee and
Facebook releases software early and frequently. It forces checking the type of a mobile device. Once an if -statement
us to get early feedback and iterate rapidly. It makes trou- is satisfied, it probabilistically determines whether to pass
bleshooting easier, because the delta between two releases is or fail the gate, depending on a configurable probability that
small. It minimizes the use of code branches that complicate controls user sampling, e.g., 1% or 10%.
maintenance. On the other hand, frequent software releases The code in Figure 5 is conceptual. A Gatekeeper
increase the risk of software bugs breaking the site. This project’s control logic is actually stored as a config that can
section describes how Gatekeeper helps mitigate the risk by be changed live without a code upgrade. Through a Web UI,
managing code rollouts through online config changes. the if-then-else statements can be added or removed (actu-
While a new product feature is still under development, ally, it is a graphical representation, without code to write1 );
Facebook engineers commonly release the new code into the restraints in an if -statement can be added or removed;
production early but in a disabled mode, and then use Gate- the probability threshold can be modified; and the parame-
keeper to incrementally enable it online. If any problem is ters specific to a restraint can be updated. For example, the
detected during the rollout, the new code can be disabled in- user IDs in the “ID()” restraint can be added or removed so
stantaneously. Without changing any source code, a typical that only specific engineers will experience the new product
launch using Gatekeeper goes through multiple phases. For feature during the early development phase.
example, initially Gatekeeper may only enable the product A Gatekeeper project is dynamically composed out of re-
feature to the engineers developing the feature. Then Gate- straints through configuration. Internally, a restraint is stat-
keeper can enable the feature for an increasing percentage ically implemented in PHP or C++. Currently, hundreds of
of Facebook employees, e.g., 1%→10%→100%. After suc- 1 Code review is supported even if changes are made through UI. The UI
cessful internal testing, it can target 5% of the users from a tool converts a user’s operations on the UI into a text file, e.g., “Updated
specific region. Finally, the feature can be launched globally Employee sampling from 1% to 10%”. The text file and a screenshot of the
with an increasing coverage, e.g., 1%→ 10%→100%. config’s final graphical representation are submitted for code review.

7
restraints have been implemented, which are used to com- myCfg = Factory.get (MY_CONFIG);
Mobile App Code
pose tens of thousands of Gatekeeper projects. The restraints bool x = myCfg.getBool (FEATURE_X);
(iOS, Android)
int y = myCfg.getInt (VOIP_ECHO);
check various conditions of a user, e.g., country/region, lo-
cale, mobile app, device, new user, and number of friends. Mobile Cross-platform
Device C++ library MobileConfig Client Library
A Gatekeeper project’s control logic is stored as a JSON Side
config in Configerator. When the config is changed (e.g., Periodically pull
Emergency push
expanding the rollout from 1% to 10%), the new config config changes

is delivered to production servers (see the bottom of Fig- Server


ure 3). The Gatekeeper runtime reads the config and builds Side MobileConfig Translation Layer
a boolean tree to represent the gating logic. Similar to how
// Map MY_CONFIG to backend systems.
an SQL engine performs cost-based optimization, the Gate- MY_CONFIG {
keeper runtime can leverage execution statistics (e.g., the ex- FEATURE_X => Gatekeeper (“ProjX”)
VOIP_ECHO => GKRestraintExperiment (“ECHO”)
ecution time of a restraint and its probability of returning }
true) to guide efficient evaluation of the boolean tree.
The example in Figure 5 is similar to the disjunctive Other tools for A/B
Configerator Gatekeeper
testing experiments
normal form (DNF), except the use of rand() to sample a
subset of users. Sampling is inherent to feature gating, i.e.,
rolling out a feature to an increasingly larger population, Figure 6: MobileConfig architecture.
e.g., 1% → 10%. The negation operator is built inside
each restraint. For example, the employee restraint can be
configured to check “not an employee”. As a result, the 5. MobileConfig
gating logic has the full expressive power of DNF. Configuration management for mobile apps differs from
Gatekeeper uses DNF of restraints and user sampling to that for applications running in data centers, because of the
form the gating logic. It strikes a balance among flexibil- unique challenges in a mobile environment. First, the mo-
ity, usability, and safety. Theoretically, it is possible not to bile network is a severe limiting factor. Second, mobile plat-
impose any structure on the gating logic (i.e., not limited forms are diverse, with at least Android and iOS to support.
to the form in Figure 5), by allowing an engineer to write Lastly, legacy versions of a mobile app linger around for a
arbitrary gating code in a dynamic programming language long time, raising challenges in backward compatibility.
(e.g., PHP), and immediately distributing it for execution as MobileConfig addresses these challenges while maximiz-
a live config update. This approach offers maximum flexibil- ing the reuse of the configuration management tools already
ity, but increases the risk of configuration errors. Moreover, developed for applications running in data centers. Figure 6
it is harder to use for most engineers, compared with sim- shows the architecture of MobileConfig. Every config ap-
ply selecting restraints from Gatekeeper’s UI without writing pears as a context class in a mobile app’s native language.
any code. Finally, its additional flexibility over Gatekeeper The app invokes the context class’ getter methods to retrieve
is limited, because Facebook rolls out PHP code twice a day the values of the config fields. The client library that sup-
and new restraints can be added quickly. ports the context class is implemented in C++ so that it is
Some gating logic is computationally too expensive to portable across Android and iOS.
execute realtime inside a restraint. In one example, a prod- Because push notification is unreliable, MobileConfig
uct feature should only be exposed to users whose recent cannot solely rely on the push model for config distribution.
posts are related to the current trending topics. This compu- The client library polls the server for config changes (e.g.,
tation requires continuous stream processing. In another ex- once every hour) and caches the configs on flash for later
ample, it needs to run a MapReduce job to analyze historical reuse. To minimize the bandwidth consumption, the client
data to identify users suitable for a product feature. Gate- sends to the server the hash of the config schema (for schema
keeper provides a key-value-store interface to integrate with versioning) and the hash of the config values cached on the
these external systems. A special “laser()” restraint invokes client.2 The server sends back only the configs that have
get(“$project-$user id”) on a key-value store called Laser. changed and are relevant to the client’s schema version. In
If the return value is greater than a configurable threshold, addition to pull, the server occasionally pushes emergency
i.e., get(...)>T, the restraint passes. Any system can integrate config changes to the client through push notification, e.g.,
with Gatekeeper by putting data into Laser. Laser stores data to immediately disable a buggy product feature. A combina-
on flash or in memory for fast access. It has automated data tion of push and pull makes the solution simple and reliable.
pipelines to load data from the output of a stream process- To cope with legacy versions of a mobile app, separat-
ing system or a MapReduce job. The MapReduce job can be ing abstraction from implementation is a first-class citizen
re-run periodically to refresh the data for all users.
2 One future enhancement is to make the server stateful, i.e., remembering
each client’s hash values to avoid repeated transfer of the hash values.

8
in MobileConfig. The translation layer in Figure 6 provides

(hundreds of thousands)
one level of indirection to flexibly map a MobileConfig field

Number of Configs
Compiled
Gatekeeper migrated
to a backend config. The mapping can change. For exam- to Configerator
Raw
ple, initially VOIP ECHO is mapped to a Gatekeeper-backed
experiment, where satisfying different if -statements in Fig-

1000
1100
1200
1300
1400
100
200
300
400
500
600
700
800
900
ure 5 gives VOIP ECHO a different parameter value to ex-
periment with. After the experiment finishes and the best pa-
Days since the creation of the config repository
rameter is found, VOIP ECHO can be remapped to a con-
stant stored in Configerator. In the long run, all the backend Figure 7: Number of configs in the repository. The scale on
systems (e.g., Gatekeeper and Configerator) may be replaced the y-axis is removed due to confidentiality, but we keep the
by new systems. It only requires changing the mapping in the unit to show the order of magnitude.
translation layer to smoothly finish the migration. To scale to
more than one billion mobile devices, the translation layer
runs on many servers. The translation mapping is stored in 100%
Configerator and distributed to all the translation servers. 80%
Raw Config

CDF
60%
40% Compiled Config
6. Usage Statistics and Experience 20%
0%
We described Facebook’s configuration management stack.

10,000
50,000
100
200
300
400
600
800
1,000
2,000
5,000

100,000
500,000
1,000,000
10,000,000
Given the space limit, this section will primarily focus
on the usage statistics of the production systems and our
Config Size (bytes)
experience, because we believe those production data are
more valuable than experimental results in a sandbox en- Figure 8: Cumulative distribution function (CDF) of config
vironment. For the latter, we report Configerator’s commit- size. Note that the x-axis is neither linear nor logscale.
throughput scalability test in a sandbox, because the data
cannot be obtained from production, and currently commit
throughput is Configerator’s biggest scalability challenge. compiled configs. Moreover, about 89% of the updates to
We attempt to answer the following questions: raw configs are done by automation tools, i.e., not manually
• Does the configuration-as-code hypothesis hold, i.e., edited. This validates one important hypothesis of Configer-
most engineers prefer writing config-generating code? ator, i.e., engineers prefer writing config-generating code to
• What are the interesting statistics about config update manually editing configs. The custom automation tools are
patterns, e.g., do old configs become dormant quickly? used to generate raw configs because they suit the specific
• What are the performance and scale of the tools? (often simpler) jobs better, or they predate Configerator.
Figure 8 shows the CDF of config size. Many configs
• What are the typical configuration errors?
have significant complexity and are not trivial name-value
• How do we scale our operation to support thousands pairs. The complexity is one reason why engineers prefer
of engineers and manage configs on hundreds of thou- not editing configs manually. Compiled configs are more
sands of servers and a billion or more mobile devices? complex than raw configs. The P50’s of raw config size and
• How does an organization’s engineering culture impact compiled config size are 400 bytes and 1KB, respectively.
its configuration management solution? The P95’s are 25KB and 45KB, respectively. The largest
configs are 8.4MB and 14.8MB, respectively. Even larger
6.1 Validating the Configuration-as-Code Hypothesis configs are distributed through PackageVessel and only their
Configerator stores different types of files: 1) the Python and small metadata is stored in Configerator.
Thrift source code as shown in Figure 2; 2) the “compiled On average, a raw/compiled/source config gets updated
configs”, i.e., the JSON files generated by the Configerator 44/16/10 times during its lifetime, respectively. Raw configs
compiler from the source code; 3) the “raw configs”, which get updated 175% more frequently than compiled configs,
are not shown in Figure 2 for brevity. Configerator allows because most (89%) of the raw config changes are done by
engineers to check in raw configs of any format. They are not automation tools. Compiled configs are generated from con-
produced by the Configerator compiler, but are distributed in fig source code, but the former changes 60% more frequently
the same way as compiled configs. They are either manually than the latter does, because the change of one source code
edited or produced by other automation tools. file may generate multiple new compiled configs, similar to
Figure 7 shows the number of configs stored in Config- a header file change in C++ causing recompilation of mul-
erator. The growth is rapid. Currently it stores hundreds of tiple .cpp files. This indicates that writing code to gener-
thousands of configs. The compiled configs grow faster than ate configs reduces the burden of manually keeping track of
the raw configs. Out of all the configs, 75% of them are changes, thanks to code reuse.

9
6.2 Config Update Statistics changes are very small one-line or two-line changes. On the
This section reports config update statistics, which hopefully other hand, large config changes are not a negligible fraction.
can help guide the design of future configuration manage- 8.7% of the updates to compiled config modify more than
ment systems, and motivate future research. 100 lines of the JSON files.
Are configs fresh or dormant, i.e., are they updated re- How many co-authors update the same config? Table 3
cently? Figure 9 shows that both fresh configs and dormant shows that most configs are only updated by a small number
configs account for a significant fraction. Specifically, 28% of co-authors. Specifically, 79.6% of the compiled configs
of the configs are either created or updated in the past 90 are only updated by one or two authors, whereas 91.5% of
days. On the other hand, 35% of the configs are not updated the raw configs are only updated by one or two authors. It is
even once in the past 300 days. more skewed for raw configs, because most raw configs are
Do configs created a long time ago still get updated? updated by automation tools, which are counted as a single
Figure 10 shows that both new configs and old configs get author. On the other hand, some configs are updated by a
updated. 29% of the updates happen on configs created in large number of co-authors. For example, there is a sitevar
the past 60 days. On the other hand, 29% of the updates updated by 727 authors over its two years of lifetime. For
happen on configs created more than 300 days ago, which is future work, it would be helpful to automatically flag high-
significant because those old configs only account for 50% risk updates on these highly-shared configs.
of the configs currently in the repository. The configs do not Is a config’s co-authorship pattern significantly different
stabilize as quickly as we initially thought. from that of regular C++/Python/Java code? No. This is be-
The frequency of config updates is highly skewed, as cause Facebook adopts the DevOps model, where engineers
shown in Table 1. It is especially skewed for raw configs developing a software feature also do the configuration-
because their commits are mostly done by automation tools. related operational work in production. The “fbcode” col-
At the low end, 56.9% of the raw configs are created but umn in Table 3 shows the breakdown for Facebook’s back-
then never updated. At the high end, the top 1% of the raw end code repository, which is primarily C++, Python, and
configs account for 92.8% of the total updates to raw configs. Java. There is no big difference between the “compiled con-
By contrast, 25.0% of the compiled configs are created but fig” column and the “fbcode” column. One subtle but im-
then never updated, and the top 1% of the compiled configs
No. of updates Compiled Raw
account for 64.5% of the total updates.
in lifetime config config
When a config gets updated, is it a big change or a small
1 25.0% 56.9%
change? Table 2 shows that most changes are very small. 2 24.9% 23.7%
In the output of the Unix diff tool, it is considered a one- 3 14.1% 5.2%
line change to add a new line or delete an existing line in 4 7.5% 3.2%
a file. Modifying an existing line is considered a two-line [5, 10] 15.9% 6.6%
change: first deleting the existing line, and then adding a [11, 100] 11.6% 3.0%
new line. Table 2 shows that roughly about 50% of the config [101, 1000] 0.8% 0.7%
[1001, ∞) 0.2% 0.7%

95%100%
100% 78% 83% Table 1: Number of times that a config gets updated. How
80% 65% 71%
52% to read the bold cell: 25.0% of compiled configs are written
CDF

60% 39% 44%


40% 28% only once, i.e., created and then never updated.
9% 17%
20% 0.5% 2% 4% 6%
0%
More
1
5
10
20
30
60
90
120
150
200
300
400
500
600
700

No. of line changes Compiled Source Raw


Days since a config was last modified
in a config update config code config
Figure 9: Freshness of configs. 1 2.5% 2.7% 2.3%
2 49.5% 44.3% 48.6%
[3, 4] 9.9% 13.5% 32.5%
93% 96%100%
[5, 6] 3.9% 4.6% 4.2%
100% 80% 87% [7, 10] 7.4% 6.1% 3.6%
71%
80% 52% 60%
CDF

60%
29%
38% 45% [11, 50] 15.3% 19.3% 5.7%
40% 17%
20% 4% 6% 8%
13% [51, 100] 2.8% 2.3% 1.1%
0% [101, ∞) 8.7% 7.3% 2.0%
More
1
5
10
20
30
60
90
120
150
200
300
400
500
600
700

Age of a config at the time of an update (days)


Table 2: Number of line changes in a config update. How to
read the bold cell: 49.5% of the updates to compiled configs
Figure 10: Age of a config at the time of an update. are two-line changes.

10
No. of co-authors Compiled Raw fbcode

Commits / Hour
config config

(hundreds)
1 49.5% 70.0% 44.0%
2 30.1% 21.5% 37.7%

11/3

11/4

11/5

11/6

11/7

11/8

11/9
3 9.2% 5.1% 7.6%
4 3.9% 1.4% 3.6%
[5, 10] 5.7% 1.2% 5.6% Figure 12: Configerator’s hourly commit throughput.
[11, 50] 1.3% 0.6% 1.4%
[51, 100] 0.2% 0.1% 0.02%
[101, ∞) 0.04% 0.002% 0.007% 250 8

Commit Latency
Comit Throughput
(Commits / minute)
200 Throughput 6
Latency

(seconds)
150
4
Table 3: Number of co-authors of configs. How to read 100
50 2
the bold cell: 49.5% of the compiled configs have a single 0 0
author throughout their lifetime. 0 200 400 600 800 1000
Number of files in the git repository (thousands)

4500 Figure 13: Configerator’s maximum commit throughput


4000 (measured from a sandbox instead of production).
Commits / Day (thousands)

Christmas
3500
3000
tween 10AM-6PM). There are a steady number of commits
2500
Configerator throughout the nights and weekends. Those are automated
2000
commits, which account for 39% of the total commits.
1500
Figure 13 shows Configerator’s maximum commit
1000 throughput as a function of the git repository size. It is mea-
www
500
fbcode
sured from a sandbox setup under a synthetic stress load
0 test. The git repository is built up by replaying Configera-
4/1/2014

5/1/2014

6/1/2014

7/1/2014

8/1/2014

9/1/2014

10/1/2014

11/1/2014

12/1/2014

1/1/2015

tor’s production git history from the beginning. To go be-


yond Configerator’s current repository size, we project the
repository’s future growth by generating synthetic git com-
Figure 11: Daily commit throughput of repositories. mits that follow the statistical distribution of past real git
commits. Figure 13 shows that the commit throughput is not
scalable with respect to the repository size, because the exe-
portant difference is at the high end. 0.24% of the compiled cution time of many git operations increases with the number
configs has more than 50 co-authors, whereas only 0.027% of files in the repository and the depth of the git history. The
of the files in fbcode has more than 50 co-authors. The rela- 60
right y-axis is simply latency = throughput , which makes
tive difference is large. the trend more obvious. This “latency” is just the execution
time excluding any queueing time. To improve throughput
6.3 Configerator & Gatekeeper Performance and reduce latency, Configerator is in the process of mi-
Figure 11 compares Configerator’s daily commit through- gration to multiple smaller git repositories that collectively
put (i.e., the number of times code/configs are checked into serve a partitioned global name space (see Section 3.6).
the git repository) with those of www (frontend code repos- When an engineer saves a config change, it takes about
itory) and fbcode (backend code repository). (Facebook has ten minutes to go through automated canary tests. This long
other code repositories not shown here.) Figure 11 high- testing time is needed in order to reliably determine whether
lights the scaling challenge we are facing. In 10 months, the application is healthy under the new config. After ca-
the peak daily commit throughput grows by 180%. The pe- nary tests, how long does it take to commit the change and
riodic peaks and valleys are due to the weekly pattern, i.e., propagate it to all servers subscribing to the config? This la-
less commits on weekends. Configerator has a high commit tency can be broken down into three parts: 1) It takes about 5
throughput even on weekends, because a significant fraction seconds to commit the change into the shared git repository,
of config changes are automated by tools and done continu- because git is slow on a large repository; 2) The git tailer
ously. Specifically, Configerator’s weekend commit through- (see Figure 3) takes about 5 seconds to fetch config changes
put is about 33% of the busiest weekday commit throughput, from the shared git repository; 3) The git tailer writes the
whereas this ratio is about 10% for www and 7% for fbcode. change to Zeus, which propagates the change to all subscrib-
Figure 12 shows Configerator’s hourly commit through- ing servers through a distribution tree. The last step takes
put in a week. It exhibits both a weekly pattern (low ac- about 4.5 seconds to reach hundreds of thousands of servers
tivities during the weekend) and a daily pattern (peaks be- distributed across multiple continents.

11
40
Propagation Latency
The first example is related to the config rollout process.
30 When both the client code and the config schema are updated
20 together, they may not get deployed to a server at the same
(sec)

10 time. Typically, the new client code can read the old config
0 schema, but the old client code cannot read the new config
11/3

11/4

11/5

11/6

11/7

11/8

11/9
schema. The latter was what happened in an incident. An en-
gineer changed both the client code and the config schema.
Figure 14: Latency between committing a config change and She checked in the client code and thought it would be in-
the new config reaching the production servers (the week of cluded in the code deployment happening later on that day.
11/3/2014). However, the release branch was cut earlier and her new
code unknowingly missed that release. Five days later, the
engineer committed a new config using the new schema.
The automated canary testing tool initially only deployed the
Gatekeeper Check Rate

new config to 20 servers in production, and monitored their


(billions / sec)

health. It compared the error logs of those 20 servers with


those of the rest of the production servers, and detected a log
spew, i.e., rapid growth of error logs. It aborted the rollout
and prevented an outage.
Another incident was less lucky. An engineer made a con-
2/15

2/16

2/17

2/18

2/19

2/20

2/21

fig change, which was rejected by the automated canary tool,


Figure 15: Gatekeeper check throughput (the week of because it caused some instances of the application to crash.
2/15/2015). The engineer stared at the config change. It seemed such a
trivial and innocent change that nothing could possibly go
wrong. “It must be a false positive of the canary tool!” She
overrode the tool’s rejection and deployed the config, which
Figure 14 shows the end-to-end latency of the three steps
caused more crashes. She mitigated the problem by immedi-
above. It is measured from production. The baseline latency
ately reverting the config change. It turned out that the con-
is about 14.5 seconds, but it increases with the load. It shows
fig change itself was indeed correct, but it caused the ap-
both a daily pattern (due to the low load at nights) and
plication to exercise a new code path and triggered a subtle
a weekly pattern (due to the low load on weekends). The
race-condition bug in the code. This incident highlights that
major challenge in reducing the latency is to speed up git
problems could arise in any unexpected ways.
operations on a large repository. On the other hand, latency
Enhancing automated canary tests is a never ending bat-
is less critical for Configerator, because automated canary
tle, as shown in the example below. An engineer introduced
tests take about ten minutes anyway.
a configuration error that sent mobile requests down a rare
Gatekeeper projects manage product feature rollouts.
code path to fetch data from a backend store. It put too much
When a user accesses facebook.com, the Gatekeeper projects
load on the data store and dramatically increased the latency.
are checked in realtime to determine what features to en-
At that time, automated canary tests did not catch the prob-
able for the user. Figure 15 shows the total number of Gate-
lem, because the testing was done on a limited number of
keeper checks across the site. Because the check throughput
servers and the small scale testing was insufficient to cause
is high (billions of checks per second) and some Gatekeeper
any load issue. The latency increase became evident only af-
restraints are data intensive, currently Gatekeeper consumes
ter the config was deployed site wide. Since then, we added
a significant percentage of the total CPU of the frontend
a canary phase to test a new config on thousands of servers
clusters that consist of hundreds of thousands of servers.
in a cluster in order to catch cluster-level load issues.
We constantly work on improving Gatekeeper’s efficiency.
In addition to the examples above, we manually analyzed
On the other hand, we consider this “overhead” worthwhile,
the high-impact incidents during a three-month period. We
because it enables Facebook engineers to iterate rapidly on
found that 16% of the incidents were related to configura-
new product features. This is evidenced by the popularity
tion management, while the rest were dominated by soft-
of Gatekeeper. In 2014, tens of thousands of Gatekeeper
ware bugs. The table below shows the breakdown of the
projects were created or updated to actively manage the roll-
configuration-related incidents.
outs of a huge number of micro product features.

6.4 Configuration Errors Type of Config Issues Percentage


To illustrate the challenges in safeguarding configs, we dis- Type I: common config errors 42%
cuss several real examples of configuration errors as well as Type II: subtle config errors 36%
a statistical analysis of configuration errors. Type III: valid config changes exposing code bugs 22%

12
Type I errors are obvious once spotted, e.g., typos, out-of- the site and claim “test it live!”, which is not uncommon in
bound values, and referencing an incorrect cluster. They can startups. The authors experienced both extremes (as well as
benefit from more careful config reviews. Type II errors are something in-between) at different companies.
harder to anticipate ahead of time, e.g., load-related issues, Over the years, Facebook has evolved from optional diff
failure-induced issues, or butterfly effects. The root causes review and optional manual testing for config changes, to
of type III issues are actually in the code rather than in the mandatory diff review and mandatory manual testing. We
configs. The high percentage of type III issues was a surprise also put more emphasis on automated canary tests and au-
to us initially. All types of config issues can benefit from tomated continuous integration tests. The tools do support
better canary testing, which is a focus of our ongoing work. access control (i.e., only white-listed engineers can change
certain critical configs), but that is an exception rather than
6.5 Operational Experience the norm. We empower individual engineers to use their best
Facebook’s configuration management team adopts the De- judgment to roll out config changes quickly, and build var-
vOps model. A small group of engineers (i.e., the authors) ious automated validation or testing tools as the safety net.
are responsible for 1) implementing new features and bug We expect Facebook’s configuration management culture to
fixes, 2) deploying new versions of the tools into production, further evolve, even in significant ways.
3) monitoring the health of the tools and resolving produc- Facebook engineers iterate rapidly on product features
tion issues, and 4) supporting the engineering community, by releasing software early and frequently. This is inherent
e.g., answering questions and reviewing code related to the to Facebook’s engineering culture. It requires tools to man-
use of the tools. Everyone does everything, without sepa- age product rollouts and run A/B tests to identify promising
ration of roles such as architect, software engineer, test en- product features. Gatekeeper and other A/B testing tools are
gineer, and system administrator. The small team is highly developed as the first-class citizens to support this engineer-
leveraged, because we support thousands of engineers using ing culture, and are widely used by product engineers.
our tools and manage the configs on hundreds of thousands
of servers and more than one billion mobile devices. 7. Related Work
Engineers on the team rotate through an oncall schedule. There is limited publication on runtime configuration man-
The oncall shields the rest of the team to focus on devel- agement for Internet services. The closest to our work is
opment work. Monitoring tools escalate urgent issues to the Akamai’s ACMS system [30]. ACMS is a configuration stor-
oncall through automated phone calls. The tool users post age and distribution system, similar to the Zeus component
questions about non-urgent issues into the related Facebook in our stack. ACMS uses the pull model whereas Zeus uses
groups, which are answered by the oncall when she is avail- the push model. Moreover, the scope of our work is much
able, or answered by users helping each other. We strive broader, including configuration authoring and validation,
to educate our large user community and make them self- code review, version control, and automated testing. Our
sufficient. We give lectures and lab sessions in the bootcamp, use cases are also much broader, from incremental product
where new hires go through weeks of training. launch to testing parameters on mobile devices. In terms of
Although Configerator’s architecture in Figure 3 seems config distribution, just on the server side, our system is 10-
complex, we rarely have outages for the Configerator com- 100 times larger than ACMS; counting in mobile devices,
ponents, thanks in part to the built-in redundancy and auto- our system is 10,000-100,000 times larger.
mated failure recovery. On the other hand, there are more Configuration management is an overloaded term. It has
outages caused by configuration errors breaking products, been used to mean different things: 1) controlling an applica-
and we strive to guard against them, e.g., by continuously tion’s runtime behavior; 2) source code version control [9];
enhancing automated canary tests. and 3) software deployment, e.g., Chef [7], Puppet [28],
The Configerator proxy is the most sensitive component, Autopilot [19], and other tools covered in the survey [10].
because it runs on almost every server and it is hard to antic- ACMS and our tools fall into the first category.
ipate all possible problems in diverse environments. Many “Configuration-as-code” has different forms of realiza-
product teams enroll their servers in a testing environment tion. Chef [7] executes code on the target server at the de-
to verify that a new proxy does not break their products. The ployment time, whereas Configerator executes code during
proxy rollout is always done carefully in a staged fashion. the development phase and then pushes the JSON file to all
servers during the deployment phase. The difference stems
6.6 Configuration Management Culture from their focuses: software installation vs. managing run-
An Internet service’s configuration management process and time behavior. Configerator indeed has some special use
tools reflect the company’s engineering culture. At the con- cases where scripts are stored as raw configs and pushed to
servative extreme, all config changes must go through a cen- servers for execution, but those are very rare.
tral committee for approval and are carefully executed by Chubby [6] and ZooKeeper [18] provide coordination
a closed group of operation engineers. At the moving-fast services for distributed systems, and can be used to store
extreme, every engineer can change any config directly on application metadata. We use Zeus, an enhanced version

13
ZooKeeper, to store configs, and use observers to form a dis- ments in your configuration management tools to help
tribution tree. Thialfi [1] delivers object update notifications accelerate your software development process.
to clients that registered their interests in the objects. It po- • With proper tool supports, it is feasible for even a large
tentially can be used to deliver config change notifications. organization to practice “open” configuration manage-
Oppenheimer et al. [24] studied why Internet services ment, i.e., almost every engineer is allowed to make on-
fail, and identified configuration errors as a major source line config changes. Although it seems risky and might
of outages. Configuration error is a well studied research not be suitable for every organization, it is indeed feasi-
topic [17, 21, 25, 33, 35–37]. Our focus is to prevent config- ble and can be beneficial to agile software development.
uration errors through validators, code review, manual tests, • It takes a holistic approach to defend against configura-
automated canary tests, and automated integration tests. tion errors, including config authoring, validation, code
Configuration debugging tools [3, 31, 32] are complemen- review, manual testing, automated integration tests, and
tary to our work. We can benefit from these tools to diagnose automated canary tests.
the root cause of a configuration error. Spex [34] infers con-
• Although the use cases of configuration management
figuration constraints from software source code. It might
can be very diverse (e.g., from gating product rollouts
help automate the process of writing Configerator validators.
to A/B testing), it is feasible and beneficial to support
Like LBFS [23] and DTD [22], MobileConfig uses hash
all of them on top of a uniform and flexible founda-
to detect and avoid duplicate data transfer. It is unique in that
tion, with additional tools providing specialized func-
legacy versions of an app may access the same config using
tions (see Figure 1). Otherwise, inferior wheels will be
different schemas and need to fetch different data.
reinvented. At Facebook, it is a long history of frag-
Configerator not only uses git for version control, but also
mented solutions converging onto Configerator.
uses a git push to trigger an immediate config deployment.
Many Cloud platforms [4, 15, 16] adopt a similar mecha- • For nontrivial configs, it is more productive and less
nism, where a git push triggers a code deployment. error prone for engineers to write programs to generate
Our tools follow many best practices in software engi- the configs as opposed to manually editing the configs.
neering, including code review, verification (in the form of • In data centers, it is more efficient to use the push model
validators), continuous integration [12], canary test, and de- to distribute config updates through a tree.
ployment automation. Our contribution is to apply these • The hybrid pull-push model is more suitable for mobile
principles to large-scale configuration management. apps, because push notification alone is unreliable.
• Separating the distribution of a large config’s small
8. Conclusion metadata from the distribution of its bulk content
We presented Facebook’s configuration management stack. (through a P2P protocol) makes the solution scalable,
Our main contributions are 1) defining the problem space without sacrificing the data consistency guarantee.
and the use cases, 2) describing a holistic solution, and • A typical git setup cannot provide sufficient commit
3) reporting usage statistics and operational experience from throughput for large-scale configuration management.
Facebook’s production system. Our major future work in- The solution is to use multiple git repositories to collec-
cludes scaling Configerator, enhancing automated canary, tively serve a partitioned global name space, and dele-
expanding MobileConfig to cover more apps, improving the gate commits to the landing strip to avoid contention.
config abstraction (e.g., introducing config inheritance), and • The config use cases and usage statistics we reported
flagging high-risk config updates based on historical data. may motivate future research. For example, our data
The technology we described is not exclusive for large show that old configs do get updated, and many con-
Internet services. It matters for small systems as well. figs are updated multiple times. It would be helpful to
Anecdotally, the first primitive version of the Sitevars tool automatically flag high-risk updates based on the past
was introduced more than ten years ago, when Facebook history, e.g., a dormant config is suddenly changed in
was still small. What are the principles or knowledge that an unusual way.
might be applied beyond Facebook? We summarize our
thoughts below, but caution readers the potential risk of over- Acknowledgments
generalization. We would like to thank the anonymous reviewers and our
• Agile configuration management enables agile software shepherd, Shan Lu, for their valuable feedback. The authors
development. For example, gating product rollouts via are the current or the most recent members of Facebook’s
config changes reduces the risks associated with fre- configuration management team. In the past, many other
quent software releases. A/B testing tools allow en- colleagues contributed to Configerator, Gatekeeper, Sitevars,
gineers to quickly prototype product features and fail and Zeus. We thank all of them for their contributions. We
fast. Even if these specific techniques are not suitable thank our summer intern, Peng Huang, for writing scripts to
for your organization, consider other dramatic improve- replay the git history used in Figure 13.

14
References [19] I SARD , M. Autopilot: automatic data center management.
[1] A DYA , A., C OOPER , G., M YERS , D., AND P IATEK , M. ACM SIGOPS Operating Systems Review 41, 2 (2007), 60–
Thialfi: a client notification service for internet-scale appli- 67.
cations. In Proceedings of the 23rd ACM Symposium on Op- [20] JAVA S CRIPT O BJECT N OTATION (JSON). http://www.
erating Systems Principles (2011), pp. 129–142. SOSP’11. json.org/.
[2] A PACHE T HRIFT. http://thrift.apache.org/. [21] M AHAJAN , R., W ETHERALL , D., AND A NDERSON , T. Un-
derstanding BGP misconfiguration. In Proceedings of the
[3] ATTARIYAN , M., AND F LINN , J. Automating configuration
ACM SIGCOMM 2002 Conference on Applications, Tech-
troubleshooting with dynamic information flow analysis. In
nologies, Architectures, and Protocols for Computer Commu-
Proceedings of 9th USENIX Symposium on Operating Systems
nication (2002), pp. 3–16. SIGCOMM’02.
Design and Implementation (2010), pp. 237–250. OSDI’10.
[22] M OGUL , J. C., C HAN , Y.-M., AND K ELLY, T. Design,
[4] AWS E LASTIC B EANSTALK. http://aws.amazon.
implementation, and evaluation of duplicate transfer detec-
com/elasticbeanstalk/.
tion in HTTP. In Proceedings of the 1st Symposium on Net-
[5] B RONSON , N., A MSDEN , Z., C ABRERA , G., C HAKKA , worked Systems Design and Implementation (2004), pp. 43–
P., D IMOV, P., D ING , H., F ERRIS , J., G IARDULLO , A., 56. NSDI’04.
K ULKARNI , S., L I , H., ET AL . TAO: Facebook’s distributed
[23] M UTHITACHAROEN , A., C HEN , B., AND M AZIERES , D.
data store for the social graph. In Proceedings of the 2013
A low-bandwidth network file system. In Proceedings of
USENIX Annual Technical Conference (2013). USENIX
the 18th ACM Symposium on Operating Systems principles
ATC’13.
(2001), pp. 174–187. SOSP’01.
[6] B URROWS , M. The Chubby lock service for loosely-coupled
[24] O PPENHEIMER , D., G ANAPATHI , A., AND PATTERSON ,
distributed systems. In Proceedings of 7th USENIX Sym-
D. A. Why do Internet services fail, and what can be done
posium on Operating Systems Design and Implementation
about it? In Proceedings of the 4th USENIX Symposium on
(2006), pp. 335–350. OSDI’06.
Internet Technologies and Systems (2003). USITS’03.
[7] C HEF. http://www.opscode.com/chef/.
[25] PAPPAS , V., X U , Z., L U , S., M ASSEY, D., T ERZIS , A.,
[8] C OHEN , B. Incentives build robustness in bittorrent. In AND Z HANG , L. Impact of configuration errors on DNS
Proceedings of the 1st Workshop on Economics of Peer-to- robustness. In Proceedings of the ACM SIGCOMM 2004
Peer systems (2003), pp. 68–72. Conference on Applications, Technologies, Architectures, and
[9] C ONRADI , R., AND W ESTFECHTEL , B. Version models Protocols for Computer Communication (2004), pp. 319–330.
for software configuration management. ACM Computing SIGCOMM’04.
Surveys 30, 2 (1998), 232–282. [26] P HABRICATOR. http://phabricator.org/.
[10] D ELAET, T., J OOSEN , W., AND VAN B RABANT, B. A survey [27] P OWER , A., 2011. Making Facebook
of system configuration tools. In Proceedings of the 24th Self-Healing, https://www.facebook.
Large Installation System Administration Conference (2010). com/notes/facebook-engineering/
LISA’10. making-facebook-self-healing/
[11] D IBOWITZ , P. Really large scale systems configuration: 10150275248698920.
config management @ Facebook, 2013. https://www. [28] P UPPET. https://puppetlabs.com/.
socallinuxexpo.org/scale11x-supporting/ [29] ROSSI , C. Ship early and ship twice
default/files/presentations/cfgmgmt.pdf. as often. https://www.facebook.
[12] D UVALL , P. M., M ATYAS , S., AND G LOVER , A. Continuous com/notes/facebook-engineering/
integration: improving software quality and reducing risk. ship-early-and-ship-twice-as-often/
Pearson Education, 2007. 10150985860363920.
[13] F OWLER , M., AND H IGHSMITH , J. The agile manifesto. [30] S HERMAN , A., L ISIECKI , P. A., B ERKHEIMER , A., AND
Software Development 9, 8 (2001), 28–35. W EIN , J. ACMS: the Akamai configuration management
[14] G IT. http://git-scm.com/. system. In Proceedings of the 2nd conference on Sympo-
sium on Networked Systems Design & Implementation (2005),
[15] G OOGLE A PP E NGINE. https://appengine. pp. 245–258. NSDI’05.
google.com/.
[31] S U , Y.-Y., ATTARIYAN , M., AND F LINN , J. AutoBash:
[16] H EROKU. https://www.heroku.com/. improving configuration management with operating system
[17] H UANG , P., B OLOSKY, W. J., S INGH , A., AND Z HOU , Y. causality analysis. In Proceedings of the 21st ACM Sympo-
ConfValley: A systematic configuration validation framework sium on Operating Systems Principles (2007), pp. 237–250.
for cloud services. In Proceedings of the 10th European SOSP’07.
Conference on Computer Systems (2015), p. 19. EuroSys’15. [32] W HITAKER , A., C OX , R. S., AND G RIBBLE , S. D. Con-
[18] H UNT, P., KONAR , M., J UNQUEIRA , F. P., AND R EED , B. figuration debugging as search: Finding the needle in the
ZooKeeper: wait-free coordination for internet-scale systems. haystack. In Proceedings of 6th USENIX Symposium on Oper-
In Proceedings of the 2010 USENIX Annual Technical Con- ating Systems Design and Implementation (2004), pp. 77–90.
ference (2010), pp. 11–11. USENIX ATC’10. OSDI’04.

15
[33] W OOL , A. A quantitative study of firewall configuration
errors. IEEE Computer 37, 6 (2004), 62–67.
[34] X U , T., Z HANG , J., H UANG , P., Z HENG , J., S HENG , T.,
Y UAN , D., Z HOU , Y., AND PASUPATHY, S. Do not blame
users for misconfigurations. In Proceedings of the 24th ACM
Symposium on Operating Systems Principles (2013), pp. 244–
259. SOSP’13.
[35] X U , T., AND Z HOU , Y. Systems approaches to tackling
configuration errors: A survey. ACM Computing Surveys 47,
4 (2015), 70.
[36] Y IN , Z., M A , X., Z HENG , J., Z HOU , Y., BAIRAVASUN -
DARAM , L. N., AND PASUPATHY, S. An empirical study
on configuration errors in commercial and open source sys-
tems. In Proceedings of the 23rd ACM Symposium on Oper-
ating Systems Principles (2011), pp. 159–172. SOSP’11.
[37] Z HANG , J., R ENGANARAYANA , L., Z HANG , X., G E , N.,
BALA , V., X U , T., AND Z HOU , Y. EnCore: Exploiting sys-
tem environment and correlation information for misconfig-
uration detection. In Proceedings of the 19th Architectural
Support for Programming Languages and Operating Systems
(2014), pp. 687–700. ASPLOS’14.

16

You might also like