You are on page 1of 8



Sign in


Observability at Twitter: technical overview,

part II
Tuesday, March 22, 2016 | By Anthony Asta (@anthonyjasta), Senior Engineering Manager, Observability
Engineering [19:00 UTC]

This post is part II of a two part series focused on observability at Twitter.

This is the second post of two part series on observability engineering at Twitter. In this post, we discuss
visualization, alerting, distributed tracing systems, log aggregation/analytics platform, utilization, and
lessons learned.
While collecting and storing the data is important, it is of no use to our engineers unless it is visualized in a
way that can immediately tell a relevant story. Engineers use the CQL query language to plot time series
data on charts inside a browser. A chart is the most basic, fundamental visualization unit in observability
products. Charts are often embedded and organized into dashboards, but can also be created ad hoc in
order to quickly share information while performing a deploy or diagnosing an incident. Also available to
engineers are a command line tool for dashboard creation, libraries of reusable monitoring components,
and an API for automation.
We improved the users cognitive model of monitoring data by unifying visualization and alerting
configurations. Alerts, described in the next section, are simply predicates applied to the same time series
data used for visualization and diagnosis. This makes it easier for the engineers to reason about the state
of their service because all the related data is in one place.




Dashboards and charts are equipped with many tools to help engineers drill down into their metrics. They
can change the arrangement and presentation of their data with stack and fill options, they can toggle
between linear and logarithmic chart scales, they can select different time granularities (per-minute, perhour, or per-day). Additionally, engineers can choose to view live, near real-time data as it comes into the
pipeline or dive back into historical data. When strolling through the offices, its common to see these
dashboards on big screens or an engineers monitor. Engineers at Twitter live in these dashboards!

Visualization use cases include hundreds of charts per dashboard and thousands of data points per chart.
To meet the required browser chart performance, an in-house charting library was developed.
Our alerting system tells our engineers when their service is degraded or broken. To use alerting, the
engineer sets conditions on their metrics and we notify them when those conditions are met.
The alerting system can handle over 25,000 alerts, evaluated minutely. Alert evaluation is partitioned
across multiple boxes for scalability and redundancy with failover in the case of node failure.
While our legacy system has served Twitter well, we have migrated to a new distributed alerting system
that has additional benefits:

Inter-data center alert failover in the case of zone failures

Alert evaluation catchup in the case of node failures

Alert execution isolation so one bad alert wont take down others

Non-impacting deployments so users dont lose visibility

Unified object model for alerting and visualization configurations




The visualization service allows engineers to interact with the alerting system and provides a UI for actions
such as viewing alert state, referencing runbook, and snoozing alerts.

Dynamic configuration
As our system becomes more complex, we need a lightweight mechanism to deploy configuration changes
to a large number of servers so that we can iterate quickly as part of the development process without
restarting the service. Our dynamic configuration library provides a standard way of deploying and
updating configuration for both Mesos/Aurora services and services deployed on dedicated machines. The
library uses Zookeeper as a source of truth for the configuration. We use a command line tool to parse the
configuration files and update the configuration data in Zookeeper. Services relying on this data receive a
notification of the changes within a few seconds:

Distributed tracing system (Zipkin)

Because of the limited number of engineers on the team, we wanted to tap into the growing Zipkin open
source community, which has been working on the OSS Twitter Zipkin, to accelerate our development




velocity. As a result, the observability team decided to completely open source Zipkin through the Open
Zipkin project. We have since worked with the open source community to establish governance and
infrastructure models to ensure change is regularly reviewed, merged and released. These models have
proven to work well: 380 pull requests have been merged into 70 community-driven releases in 8 months.
All documentation and communication originates from the Open Zipkin community. Going forward, Twitter
will deploy zipkin builds directly from the Open Zipkin project into our production environments.
Log aggregation/analytics platform
LogLens is a service that provides indexing, search, visualization, and analytics of service logs. It was
motivated by two specific gaps in developer experience when running services on Aurora/Mesos.

The coupling between the lifetime of service logs and the lifetime of the transient resource containers
the task was scheduled on caused a lot of uncertainty in our ability to triage recent incidents because of
lost logs.

The difficulty in quickly searching through all of the distinct logs generated by the many components
that comprised a service increased the response time for live incidents.

The LogLens service was designed around the following prioritizations ease of onboarding, prioritizing
availability of live logs over cost, prioritizing cost over availability for older logs, and the ability to operate
the service reliably with limited developer investment.
Customers can onboard their services through a self-service portal that provisions an index for their
service logs with reserved capacity and burst headroom. Logs are retained on HDFS for 7 days and a cache
tier serves the last 24 hours of logs in real time and older logs on demand.

As Twitter and observability grow, service owners want visibility into their usage of our platform. We track




all the read and write requests to Cuckoo, and use them to calculate a simple utilization metric, defined as
the read/write ratio. This tracking data is also useful for our growth projection and capacity planning.
Our data pipeline aggregates event data on a daily basis, and we store the output in both HDFS and
Vertica. Our users can access the data in three different ways. First, we send out periodic utilization and
usage reports to individual teams. Second, users can visualize the Vertica data with Tableau, allowing them
to do deep analysis of the data. Finally, we also provide our users with a Utilization API with detailed
actionable suggestions. This API, beyond just showing the basic utilization and usage numbers, is also
designed to help users drill down into which specific groups of metrics are not used.

Since this initiative came into play, these tools have allowed users to close the gap between their reads and
writes in two ways: by simply reducing the number of unused metrics they write, or by replacing individual
metrics with aggregate metrics. As a result, some teams have been able to reduce their metric footprint by
an order of magnitude.
Lessons learned

Pull vs push in metrics collection: At the time of our previous blog post, all our metrics were
collected by pulling from our collection agents. We discovered two main issues:

There is no easy way to differentiate service failures from collection agent failures. Service response
time out and missed collection request are both manifested as empty time series.

There is a lack of service quality insulation in our collection pipeline. It is very difficult to set an
optimal collection time out for various services. A long collection time from one single service can
cause a delay for other services that share the same collection agent.

In light of these issues, we switched our collection model from pull to push and increased our service
isolation. Our collection agent on each host only collects metrics from services running on that specific




host. Additionally, each collection agent sends separate collection status tracking metrics in addition to the
metrics emitted by the services.
We have seen a significant improvement in collection reliability with these changes. However, as we moved
to self service push model, it becomes harder to project the request growth. In order to solve this problem,
we plan to implement service quota to address unpredictable/unbounded growth.

Fault tolerance: As one of the most critical services at Twitter, we bear the responsibility of providing
high available observability services even in the event of catastrophic failures, such as a complete DC
outage. In order to achieve that, we followed two principles

Cross-DC redundancy: Some of our most critical metrics are sent to more than one DC for
redundancy. This makes us resistant to a single DC failure.

Eliminate/decouple unnecessary dependencies on other libraries/services: In some cases of our

development, we intentionally remove dependencies on some widely used internal infrastructures,
such as the Twitter Front End, TFE, to avoid downtime event caused by failures of those systems. In
other cases, we use dedicated clusters and instances, like Manhattan and ZooKeeper, to decouple
our failure from that of the services we monitor.

Learn more
Want to know more about some of the challenges faced building Twitters observability stack? check out
the following:

Twitter Flight 2015 talk by Caitie McCaffrey

Observability Engineering team: Anthony Asta, Jonathan Cao, Hao Huang, Megan Kanne, Caitie McCaffrey,
Mike Moreno, Sundaram Narayanan, Justin Nguyen, Aras Saulys, Dan Sotolongo, Ning Wang, Si Wang

Tags: developers, infrastructure, and visualizations

Older post

Newer post




Under the hood

Tools, projects & community

Related posts

Observability at Twitter: technical overview, part I

Observability at Twitter
Distributed Systems Tracing with Zipkin
Manhattan, our real-time, multi-tenant distributed database for Twitter scale












advertising (1)
A/B testing (1)
infrastructure (7)
MoPub (1)
downloads (1)
international (1)
science (2)
experiments (5)
women in tech (1)
Show more tags













Ads info



2016 Twitter, Inc.