You are on page 1of 28

FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN ENTERPRISE SOFTWARE DEVELOPMENT

Scalability
eMag Issue 11 - April 2014

Interview with Raffi Krikorian


on Twitter’s Infrastructure
Raffi Krikorian, Vice President of Platform Engineering at Twitter,
gives an insight on how Twitter prepares for unexpected traffic peaks
and how system architecture is designed to support failure. PAGE 4

INTERVIEW: ADRIAN COCKCROFT ON HIGH AVAILABILITY,


BEST PRACTICES, AND LESSONS LEARNED IN THE CLOUD P. 9
TO EXECUTION PROFILE OR TO MEMORY PROFILE? THAT IS THE QUESTION. P. 12
VIRTUAL PANEL: USING JAVA IN LOW LATENCY ENVIRONMENTS P. 16
RELIABLE AUTO-SCALING USING FEEDBACK CONTROL P. 25
Contents

Interview with Raffi Krikorian on Twitter’s Infrastructure Page 4


Raffi Krikorian, Vice President of Platform Engineering at Twitter, gives an insight on how
Twitter prepares for unexpected traffic peaks and how system architecture is designed to
support failure.

Interview: Adrian Cockcroft on High Availability,


Best Practices, and Lessons Learned in the Cloud Page 9
Netflix is a widely referenced case study for how to effectively operate a cloud application
at scale. While their hyper-resilient approach may not be necessary at most organizations,
Netflix has advanced the conversation about what it means to build modern systems. In this
interview, InfoQ spoke with Adrian Cockcroft who is the Cloud Architect for the Netflix
platform.

To Execution profile or to Memory Profile?


That is the question. Page 12
There are times when memory profiling will provide a clearer picture than execution
profiling to find execution hot spots. In this article Kirk Pepperdine talks through some
indicators for determining when to use which kind of profiler.

Virtual Panel: Using Java in Low Latency Environments Page 16


Java is increasingly being used for low latency work where previously C and C++ were the
de-facto choice. InfoQ brought together four experts in the field to discuss what is driving the
trend, and some of the best practices when using Java in these situations.

Reliable Auto-Scaling using Feedback Control Page 25


Philipp K. Janert explains how to reliably auto-scale systems using a reactive approach based
on feedback control which provides a more accurate solution than deterministic or rule-
based ones.
Scalability / eMag Issue 11 - April 2014

Interview with Raffi Krikorian on


Twitter’s Infrastructure
by Xuefeng Ding

Twitter’s Raffi Krikorian gives insight on how the company prepares for unexpected
traffic peaks and how system architecture is designed to support failure.
InfoQ: Hi, Raffi. Would you please introduce yourself We’d run those on the order of every month – I don’t
to the audience and the readers of InfoQ? know what the exact schedule is these days – and
then we do analyses of every single system at Twitter.
Raffi: Sure. My name is Raffi Krikorian. I’m the
vice-president of platform engineering at Twitter. When we build architecture and systems at Twitter,
We’re the team that runs basically the backend we look at the performance of all those systems
infrastructure for all of Twitter. on a weekly basis to really understand what the
theoretical capacity of the systems looks like,
InfoQ: With the help of Castle in the Sky, Twitter right now on a per-service basis, and then we try
created a new peak tweets-per-second record. How to understand what the theoretical capacity looks
does Twitter deal with unpredictable peak traffic? like overall. From that, we can decide whether we
have the right number of machines in production
Raffi: What you’re referring to is the Castle in the at any given time or whether we need to buy more
Sky event, which is what we call it internally. That computers, and we can have a rational conversation
was a television show that aired in Tokyo. We set on whether or not the system is operating efficiently.
our new record of around 34,000 tweets a second
coming into Twitter during that event. Normally, So if we have certain services, for example, that
Twitter experiences something on order of 5,000 to can only take half the number of requests a second
10,000 tweets a second, so this is pretty far out of as other services, we should look at those and
our standard operating bounds. I think it says a few understand architecturally: are they performing
things about us. I think it says how Twitter reacts to correctly or do we need to make a change?
the world at large, like things happen in the world and
they get reflected on Twitter. So for us, the architecture to get to something like the
Castle in the Sky event is a slow evolutionary process.
So the way that we end up preparing for something We make a change, we see how that change reacts
like this is really years of work beforehand. This type and how that change behaves in the system, and we
of event could happen at any time without real notice. make a decision on the slow-rolling basis of whether
So we do load tests against the Twitter infrastructure. or not this is acceptable to us. We make a tradeoff,

Contents Page 4
Scalability / eMag Issue 11 - April 2014

like do we buy more machinery or do we write new InfoQ: How do you isolate the broken module in the
software in order to withstand this? system? When something goes wrong, what’s your
reaction at the first moment?
While we never have experienced an event like Castle
in the Sky before, some of our load tests have pushed Raffi: The way that Twitter is architected these days
us to those limits already so we were comfortable is that a failure should stay relatively constrained to
when it happened in real life. We’re like, “Yes, it the feature in which the failure occurred. Of course,
actually worked.” the deeper you get down the stack, the bigger the
problem becomes. So if our storage mechanisms
InfoQ: Are there any emergency plans in Twitter? Do all of a sudden have a problem, a bunch of different
you practice for unusual times, such as shut down systems would show something going wrong. For
some servers or switches? example, if someone made a mistake on the Web site,
it won’t affect the API these days.
Raffi: Yeah. We do two different things, basically, as
our emergency planning – maybe three, depending The way that we know that something is going wrong
on how you look at it. Every system is carefully again is by being able to see the different graphs
documented for what would turn it on and what of the system. We have alerts set up over different
would turn it off. We have what we call “runbooks” for thresholds on a service-by-service basis. So, if the
every single system so we understand what we would success rate of the API fell below some number, a
do in an emergency. We’ve already thought through bunch of pagers immediately go off; there’s always
the different types of failures. We don’t believe we’ve someone on call for every single service at Twitter
thought through everything but we think we’ve and they can react to that as quickly as they can.
documented at least the most common ones and we
understand what we need to do. Our operations team and our network command
center will also see this and might try some really
Two, we’re always running tests against production, rudimentary things, the equivalent of “should we
so we understand what the system would look like turn it off and on again and see what happens?”
when we hit it really hard and we can practice. So we Meanwhile, the actual software developers on a
hit it really hard and teams on call might get a page or second track try to understand what is going wrong
something, and we can try to decide whether or not with the system. So, operations is trying to make sure
we do need to do something differently and how to the site comes back as quickly as it can while software
react to that. development is trying to understand what actually
went wrong and to determine whether we have a bug
And third, we’ve taken some inspiration from Netflix. that we need to take care of.
Netflix has what they call their Chaos Monkey, which
kills machines in production. We have something So this is how we end up reacting. But, like I said,
similar to that within Twitter that helps make sure the architecture at Twitter keeps failure fairly well
that we didn’t accidentally introduce a single point of constrained. If we think it’s going to propagate or
failure somewhere. We can randomly kill machines we think that, for example, the social graph is having
within the data center and make sure that the service a problem, the social-graph team will then start
doesn’t see a blip while that’s happening. immediately notifying everyone else just in case they
should be on alert for something going wrong.
All this requires us to have excellent transparency
with respect to the success rate of all the different One of our strengths these days, I like to say jokingly,
systems. We have a massive board. It’s a glass wall is emergency management: what we do in a case of
with all these graphs on it that show us what’s going disaster because it could happen at any time. My
on within Twitter. And when these events happen, contract with the world is that Twitter will be up so
we can see in an instant whether or not something is you don’t have to worry about it.
changing, whether it would be traffic to Twitter or a
failure within a data center, so that we can react to it InfoQ: The new architecture helps a lot in stability
as quickly as we can. and performance. Could you give us a brief
introduction to it?

Contents Page 5
Scalability / eMag Issue 11 - April 2014

Raffi: Sure. When I joined Twitter a couple of years for the current runtime values of Twitter. How
ago, we ran the system on what we call the monolithic that practically maps is, for example, the Discover
codebase. Everything you had to do with the homepage has a Decider value that wraps it, and that
software at Twitter was in one codebase that anyone Decider value tells Discover whether it’s on or off
could deploy, anyone could touch, anyone could right now.
modify. That sounds great. In theory, that’s actually
excellent. It means that every developer in Twitter is So I can deploy Discover into Twitter and have it
empowered to do the right thing. deployed in the state that Decider says it should be.
We don’t get an inconsistent state. The Discover page,
In practice, however, there’s a balancing act. or any feature at Twitter, runs across many machines.
Developers then need to understand how everything You don’t want to get in the inconsistent state where
actually works in order to make change. And in some of the machines have the feature and some of
practical realities, the concern I would have is that them don’t. So we can deploy it in the off state using
with the speed at which Twitter is writing new Decider and then when it is on all the machines that
code, people don’t give deep thought into places we want it to be on, we can turn it on across the data
they haven’t seen before. I think this is standard in center by flipping a Decider switch.
the way developers write software. It’s like, “I don’t
understand what I fully need to do to make this This also gives us the ability to do a percentage-
change, but if I change just this one line it probably based control. I can say that now that it’s on all of
gets the effect I want.” I’m not saying that this is a bad the machines, I only want 50% of users to get it. I can
behavior. It’s a prudent and expedient behavior. But actually make that decision as opposed to it being a
this means that technical debt builds up when you do side effect of the way that things are being deployed
that. in Twitter. This allows us to have runtime control over
Twitter without having to push code. Pushing code is
So what we’ve done is we’ve taken this monolithic a dangerous thing; the highest correlation to failure in
codebase and broken it up into hundreds of different a system like ours, not just Twitter but any big system,
services that comprise Twitter. This way, we can is software-development error. This way we can
have actual real owners for every single piece of deploy software in a relatively safe way because it’s
business logic and every single piece of functionality off. Turn it on really slowly, purposefully, make sure
at Twitter. There’s actually a team responsible for it’s good, and then ramp it up as fast as I want.
managing photos for Twitter. There’s another team
who manages the URLs for Twitter. There are now InfoQ: How does Twitter push code online? Would
subject experts throughout the company, and you you please share the deployment process with us?
could consult them when you want to make a feature For example, how many different stages? Do you
change that would change something – for example, choose daily pushing or weekly pushing or both?
where URLs work.
Raffi: Twitter deployment, because we have this
Breaking up the codebase in all these different ways services architecture, is up to the control of every
and having subject-matter experts also allows things individual team. So the onus is on the team to make
that we’ve spoken about: isolation for failure and sure that when they’re deploying code, everyone
isolation for feature development. If you want to that may be affected by it should know that you’re
change the way tweets work, you only have to change doing it, and the network control center should also
a certain number of systems. You don’t have to know what you’re doing so they have a global view of
change everything in Twitter anymore, so we can have the system. But it’s really up to every single team to
good isolation both for failure and for development. decide when and if they want to push.

InfoQ: What’s the role of Decider in the system? On average, I would say teams have a bi or tri-
weekly deployment schedule. Some teams deploy
Raffi: Decider is one of our runtime configuration every single day; some teams only deploy once a
mechanisms at Twitter. What I mean by that is month. But the deployment process looks about the
we can turn off features and software in Twitter same to everybody: you deploy into a developing
without doing a deployment. Every single service at environment. This is so developers can hack on it
Twitter is constantly looking to the Decider system quickly, make changes, look at the product manager,

Contents Page 6
Scalability / eMag Issue 11 - April 2014

look at the designer, and make sure it does the right So we begin to ingest all this and log them so we can
thing. Then we deploy into what we call the “Canary do analysis. It’s a pretty hard thing. We actually have
system” within Twitter, which means that it’s getting different SLAs depending on what kind of data comes
live production traffic but we don’t rely on its results in. Tweets, we measure in milliseconds. In order to get
just yet. So it’s just basically loading it to make sure it around database locking, for example, we developed
handles it efficiently, and we can look at the results Snowflake, which can generate unique IDs for us
that would have returned and inspect them to make incredibly quickly and do it decentralized so that we
sure that it did what we thought it would do given live don’t have a single point of failure in generating IDs
traffic. for us.

Our testing scenarios may not have covered all the We have Gizzard, which handles data flowing in and
different edge cases that the live traffic gets, so it’s shards it as quickly as possible so that we don’t have
one way we learn what the real testing scenarios hot spots on different clusters in the system. It tries to
should look like. After we go into Canary, we deploy probabilistically spread the load so that the amount of
at dark, then we slowly start to ramp it up to really data coming in doesn’t overload the databases. Again,
understand what it looks like at the scale. That ramp- tweets go through very fast on the system.
up could take anywhere from a day to a week. We’ve
had different products that we’ve ramped to 100% Logs of, for example, people clicking on things or
in the course of a week or two. We’ve added other viewing tweets, have their SLA measured in minutes
products that we’ve ramped up to 100% in the course as opposed to milliseconds. Those go into completely
of minutes. different pipeline. Most of it is based on Scribe these
days. So, those slowly trickle through, get aggregated,
Again, it’s really up to the team. And each team is get collected and get jumped into HDFS so we can
responsible for their feature, is responsible for their analyze them later.
service. So it’s their call on how they want to do it, but
those stages of development – Canary, dark reading, For long-term retention, all of the data, whether it be
ramp up by Decider – is the pattern that everyone real-time or not, ends up in HDFS and that’s where we
follows. run massive MapReduce and Hadoop jobs to really
understand what’s going on in the system.
InfoQ: There are huge amounts of data in Twitter.
You must have some special infrastructure (such as So, we try to achieve a balance of what needs to be
Gizzard and Snowflake) and methods to store the taken care of right now, especially given the onslaught
data, to even process them in real time. of data we have, and where do we put things because
this unclogged data accumulates very fast. If Twitter
Raffi: That’s really two different questions, I sees 400 million tweets a day and has been running
think. There is how do we ingest all this data that’s for a couple of years now… you can imagine the size of
coming into Twitter because Twitter is a real-time our corpus. HDFS handles all that for us, and we can
system with latency for a tweet to get delivered in run these massive MapReduce jobs that way.
milliseconds to Twitter users. And then there’s the
second question of what we do with all that data. InfoQ: Twitter is an amazing place for engineers.
What’s the growing path of an engineer in Twitter?
For the first one, you’re right; we have systems like How would one become a successful geek like you?
Snowflake, Gizzard, and things like that to handle
tweet ingestion. Tweets are only one piece of data Raffi: Well, I can’t say I’m a successful engineer since
that comes into Twitter, obviously. We have things I don’t write software anymore. I started at Twitter as
like favorites. We have retweets. We have people an engineer, and I’ve risen into this position of running
sending direct messages. People change their avatar a lot of engineering these days.
images, their background images, and things like that.
People click on URLs and load Web pages. These are Twitter has a couple of different philosophies and
all events that are coming into Twitter. mentalities around it, but we have a career path for
engineers that basically involves tackling harder and

Contents Page 7
Scalability / eMag Issue 11 - April 2014

harder and harder problems. We would like to say of expand your horizons within Twitter and a way to
that it doesn’t actually matter how well the feature safely decide whether or not you want to go and try
you built does. In some cases, it does. But we really something new.
like the level of technical thought and technical merit
you’ve put into the project you work on. We invest in our engineers because, honestly, they’re
the backbone of the company. The engineers build the
So growth in Twitter is done very much through a thing that we all thrive on within Twitter, and that the
peer-based mechanism. To be promoted from one world uses, so I give them as many opportunities as I
level to the next level at Twitter requires consensus. can in order to try different things and to geek out in
It requires a bunch of engineers at that higher level to lots of different ways.
agree that, yes, you’ve done the work needed in order
to get to this level at Twitter.

To help with that, managers make sure projects go


to engineers that are looking for big challenges.
Engineers can move between teams. They’re not
stuck on the tweet team, for example, or a timeline ABOUT THE INTERVIEWEE
team. If an engineer says, “I want to work on the Raffi Krikorian is vice-president of
mobile team because that’s interesting. I think platform engineering at Twitter. His
there’s career growth for me,” my job as a person teams manage the business logic, scalable
that manages a lot of this is to make that possible. delivery, APIs, and authentication of
You can do almost whatever you want within Twitter. Twitter's application. His group helped
I tell engineers what my priorities are in running create the iOS 5 Twitter integration as
engineering and what the company’s priorities are in well as the The X Factor Twitter voting
user growth or money or features to build. And then mechanism.
engineers should flow to the projects that they think
they can make the biggest impact on.

On top of that, I run a small university within Twitter


that we call Twitter University. It’s a group of people READ THIS INTERVIEW
whose whole job is training. For example, if an ONLINE ON InfoQ
engineer wants to join the mobile team but is a back-
end Java developer, we’re say, “Great. We’ve created
a training class so you can learn Android engineering
or iOS engineering and you can take a one weeklong
class that will get you to the place that you’ve
committed to with that codebase, and then you can
join that team for real.” This gives you a way to sort

Contents Page 8
Scalability / eMag Issue 11 - April 2014

Interview: Adrian Cockcroft on


High Availability, Best Practices, and
Lessons Learned in the Cloud
by Richard Seroter

Netflix is a widely referenced case study for how to effectively operate a cloud
application at scale. While their hyper-resilient approach may not be necessary at
most organizations – and the jury is out on that assumption – Netflix has advanced
the conversation about what it means to build modern systems. InfoQ spoke with
Adrian Cockcroft, who is the cloud architect for the Netflix platform.
InfoQ: What does “high availability 101” look like architects build out circuit breakers and employ
for a new Netflix engineer? How do they learn best other techniques for preventing cascading failures
practices, and what are the main areas of focus? in systems?

Cockcroft: We run an internal “boot camp” every few Cockcroft: The problem with dependencies between
months for new engineers. The most recent version services is that it rapidly gets complicated to keep
is a mixture of presentations about how everything track of them, and it’s important to multi-thread calls
works and some hands-on work making sure that to different dependencies, which gets tricky when
everyone knows how to build code that runs in managing nested calls and responses. Our solution to
the cloud. We use a version of the Netflix OSS RSS this is based on the functional reactive pattern that
Reader as a demo application. we’ve implemented using RxJava, with a backend
circuit-breaker pattern wrapped around each
InfoQ: Are there traditional Web-development dependency using Hystrix. To test that everything
techniques or patterns that you often ask engineers works properly under stress, we use Latency Monkey
to “forget” when working with cloud-scale to inject failures and high latency into dependent
distributed systems? service calls. This makes sure we have the timeouts
and circuit breakers calibrated properly, and
Cockcroft: Sticky session-based programming uncovers any “unsafe” dependencies that are being
doesn’t work well so we make everything request called directly, since those can still cause cascading
scoped, and any cross-request information must be failures.
stored in memcached using our EVcache mechanism
(which replicates the data across zones). InfoQ: Netflix OSS projects cover a wide range of
services including application deployment, billing,
InfoQ: You and others at Netflix have spoken at and more. Which of these projects do you consider
length about expecting failures in distributed most indispensable to your team at Netflix, and
systems. How do you specifically recommend that why?

Contents Page 9
Scalability / eMag Issue 11 - April 2014

Cockcroft: One of our most powerful mechanisms their data models independently. At that point, most
and somewhat overlooked Netflix OSS projects is of the value of a unified relational schema is gone
the Zuul gateway service. This acts as an intelligent anyway.
routing layer, which we can use for many purposes:
handling authentication; geographic and content InfoQ: Can you give us an example of something
aware routing; scatter/gather of underlying services at Netflix that didn’t work because it was too
into a consistent external API; etc. It’s dynamically sophisticated and made you opt for a simpler
programmable, and can be reconfigured in seconds. approach?
In order to route traffic to our Zuul gateways, we
need to be able to manage a large number of DNS Cockcroft: There have been cases where teams
endpoints with ensemble operations. We’ve built decided that they wanted to maintain strong
the Denominator library to abstract away multiple consistency, so they invented complex schemes that
DNS vendor interfaces to provide the same high they thought would also keep their services available,
levels of functionality. We have found many bugs and but this tends to end up with a lot of downtime, and
architectural problems in the commonly used DNS- eventually a much simpler and more highly available
vendor-specific APIs, so as a side effect we have been model takes over. There is less consistency guarantee
helping fix DNS management in general. with the replacement, and perhaps we had to build
a data-checking process to fix things up after the
InfoQ: Frameworks often provide a useful event if anything goes wrong. A lot of Netflix outages
abstraction on top of complex technology. However, around two years ago were due to an attempt to keep
are there cases where an abstraction shields a datacenter system consistent with the cloud, and
developers from truly understanding something cutting the tie to the datacenter so that Cassandra
more complex, but useful? in the cloud became the master copy made a big
difference.
Cockcroft: Garbage collection lets developers
forget about how much memory they are using and InfoQ: How about something you built at Netflix
consuming. While it helps them quickly write code, that failed because it was too simple?
the sheer volume of garbage and number of times
data is copied from one memory location to another Cockcroft: Some groups use Linux load-average as a
is not usually well understood. While there are some metric to tell if their instances are overloaded. They
tools to help (we open-sourced our JVM GCviz tool), then want to use this as an input to autoscaling. I
it’s a common blind spot. The tuning parameters for don’t like this because load-average is time-decay
setting up heaps and garbage-collection options are weighted so it’s slow to respond, and it’s non-linear
confusing and are often set poorly. so it tends to make autoscaler rules over-react. As
a simple rule, total (user+system) CPU utilization
InfoQ: Netflix is a big user of Cassandra, but is there is a much better metric, but it can still react too
any aspect of the public-facing Netflix system that slowly. We’re experimenting with more sophisticated
uses a relational database? How do you think that algorithms that have a lot more inputs, and hope
modern applications should decide between NoSQL to have a Netflix Tech Blog post on this issue fairly
and relational databases? soon (keep watching http://techblog.netflix.com for
technology discussion and open source project
Cockcroft: The old Netflix DVD-shipping service announcements).
still runs on the old code base on top of a few large
Oracle databases. The streaming service has all InfoQ: How do you recommend that developers
its customer-request-facing services running on (at Netflix and other places) set up appropriate
Cassandra, but we do use MySQL for some of our sandboxes to test their solutions at scale? Do you
internal tools and non-customer-facing systems use the same production-level deployment tools
such as the processes that we use to ingest metadata to push to developer environments? Should each
about new content. If you want to scale and be developer get their own?
highly available, use NoSQL. If you are doing rapid
continuous delivery of functionality, you will Cockcroft: Our build system delivers into a test
eventually want to denormalize your data model and AWS account that contains a complete running set
give each team its own data store so they can iterate of Netflix services. We automatically refresh test

Contents Page 10
Scalability / eMag Issue 11 - April 2014

databases from production backups every weekend


(overwrite the old test data). We have multiple
“stacks” or tagged versions of specific services that
are being worked on, and ways to direct traffic by
tags is built into Asgard, our deployment tool. There’s
a complete integration stack that is intended to be ABOUT THE INTERVIEWEE
close to production availability but reflect the next Adrian Cockcroft has had a long career
version of all the services. Each developer has their working at the leading edge of technology.
own tagged stack of things they are working on, Before joining Battery in 2013, Adrian
that others will ignore by default, but they share
helped lead Netflix’s migration to a
the common versions. We re-bake AMIs from a test
large scale, highly available public-cloud
account to push to production account with a few
changes to environment variables. There is no tooling architecture and the open sourcing of
support to build an AMI directly for the production the cloud-native NetflixOSS platform.
account without launching it in test first. Prior to that at Netflix he managed a team
working on personalization algorithms
InfoQ: Given the size and breadth of the Netflix and service-oriented refactoring. He
cloud deployment, how and when do you handle graduated from The City University,
tuning and selection of the ideal AWS instance size
London with a Bsc in Applied Physics and
for a given service? Do you run basic performance
Electronics, and was named one of the
profiling on a service to see if it’s memory-bound,
I/O bound, or CPU bound, and then choose the right top leaders in Cloud Computing in 2011
type of instance? At what stage of the service’s and 2012 by SearchCloudComputing
lifecycle do you make these assessments? magazine. He can usually be found on
Twitter @adrianco.
Cockcroft: Instance type is chosen primarily based
on memory need. We’re gradually transitioning
where possible from the m2 family of instances to
the m3 family, which have a more modern CPU base READ THIS INTERVIEW
(Intel E5 Sandy Bridge) that runs Java code better. ONLINE ON InfoQ
We then run enough instances to get the CPU we
need. The only instances that are I/O intensive are
Cassandra, and we use the hi1.4xlarge for most of
them. We’ve built a tool to measure how efficiently
we use instances, and it points out the cases where a
team is running more instances than they need.

Contents Page 11
Scalability / eMag Issue 11 - April 2014

To Execution-Profile or to Memory-
Profile? That Is the Question
by Kirk Pepperdine

I recently had a group of developers troubleshoot a concatenates the two strings to form a key that is
problem-riddled application from my performance used to look up the data in a map. This misuse of
workshop. After dispensing with a couple easy wins, strings is a code smell in that it indicates that there
the group was faced with a CPU that was running is a missing abstraction. As we will see, that missing
very hot. The group reacted in exactly the same abstraction is not only at the root of the performance
way that I see most teams do when faced with a hot problem, but adding it also improves the readability
CPU; they fired up an execution profiler hoping that of the code. In this case, the missing abstraction is a
it would help them sort things out. In this particular CompositeKey<String,String>, a class that wraps the
case, the problem was related to how the application two strings and implements both the equals(Object)
was burning through memory. Now, while an and hashCode() methods.
execution profiler can find these problems, memory
profilers will paint a much clearer picture. My group public class CustomerList {
had somehow missed a key metric that was telling private final Map customers = new
them that they should have been using a memory ConcurrentHashMap();
profiler. Let’s run through a similar exercise here so public Customer addCustomer(String
that we can see when and why it is better to use a firstName, String lastName) {
memory profiler. Customer person = new
Customer(firstName, lastName);
Profilers work by either sampling top of stack customers.put(firstName + lastName,
or instrumenting the code with probes, or a person);
combination of both. These techniques are good return person;
at finding computations that happen frequently }
or take a long time. As my group experienced, the public Customer findCustomer(String
information gathered by execution profilers often firstName, String lastName) {
correlates well with the source of the memory return (Customer) customers.
inefficiency. However, it points to an execution get(firstName + lastName);
problem which can sometimes be confusing. }
Listing 1. Source for CustomerList
The code found in Listing 1 defines the method
API findByName(String,String). The problem here Another downside to the style of API used in this
isn’t so much in the API itself but more in how the example is that it will limit scalability because of
method treats the String parameters. The code the amount of data the CPU is required to write to

Contents Page 12
Scalability / eMag Issue 11 - April 2014

memory. In addition to the extra work to create the


data, the volume of data being written to memory by
the CPU creates a back pressure that will force the
CPU to slow down. Though this bench is artificial in
how it presents the problem, it’s a problem that is
not so uncommon in applications using the popular
logging frameworks. That said, don’t be fooled into
thinking only String concatenation can be at fault.
Memory pressure can be created by any application
that is churning through memory, regardless of the
underlying data structure.
Chart 1. Allocation rates
The easiest way to determine if our application
is burning through memory is to examine the In this chart, we can see that the allocation rates
garbage-collection (GC) logs. GC logs report on initially start out at about 2.75 GB per second. The
heap occupancy before and after each collection. laptop that I used to run this benchmark under
Subtracting occupancy after the previous collection ideal conditions can sustain an allocation rate of
from the occupancy before the current collection about 4 GB per second. Thus this value of 2.75
yields the amount of memory allocated between GB/s represents a significant portion of the total
collections. If we do this for many records, we can memory bandwidth. In fact, the machine is not able
get a pretty clear picture of the application’s memory to sustain this rate as is evidenced by the drop over
needs. Moreover, getting the needed GC log is both time in allocation rates. While your production
cheap and, with the exception of a couple of edge servers may have a larger capacity to consume
cases, will have no impact on the performance of memory, it is my experience that any machine trying
your application. I used the flags -Xloggc:gc.log to maintain object-creation rates greater than 500
and -XX:+PrintGCDetails to create a GC log with a MB per second will spend a significant amount of
sufficient level of detail. I then loaded the GC log file time allocating memory. It will also have a very
into Censum, jClarity’s GC-log analysis tool. limited ability to scale. Since memory efficiency is the
overriding bottleneck in our application, the biggest
wins will come from making it more memory efficient.

Execution Profiling
It should go without saying that if we’re looking
to improve memory efficiency, we should be using
a memory profiler. However, when faced with a
hot CPU, our group decided that they should use
execution profiling, so let’s start with that and see
where it leads. I used the NetBeans profiler running in
VisualVM in its default configuration to produce the
Table 1. Summary of garbage-collection activity profile in Chart 2.

Censum provides a whole host of statistics (see


Table 1), of which we’re interested in the “Collection
Type Breakdown” (at the bottom). The “% Paused”
column (the sixth column in Table 1) tells us that the
total time paused for GC was 0.86%. In general, we’d
like GC pause time to be less than 5%, which it is.
The number suggests that the collectors are able to
reclaim memory without too much effort. Keep in
mind, however, that when it comes to performance, a
single measure rarely tells you the whole story. In this
case, we need to see the allocation rates and in Chart
1 we can see just that.
Chart 2. Execution profile

Contents Page 13
Scalability / eMag Issue 11 - April 2014

Looking at the chart, we can see that outside of the snapshot and then look at the allocation stack traces
Worker.run() method, most of the time is spent in for char[]. The snapshot can be seen in Chart 4.
CustomerList.findCustomer(String,String). If the
source code were a bit more complex, you could
imagine it being difficult to understand why the
code is a problem or what you should do to improve
performance. Let’s contrast this view with the one
presented by memory profiling.

Memory Profiling
Ideally, I would like my memory profiler to show me Chart 4. char[] allocation stack traces
how much memory is being consumed and how many
objects are being created. I would also like to know The chart shows three major sources of char[]
the causal execution paths – that is, the path through creation of which one is opened up so that you can see
the source code that is responsible for churning the details. In all three cases, the root can be traced
through memory. I can get these statistics using the back to the firstName + lastName operation.
NetBeans profiler once again running in VisualVM.
However I will need to configure the profiler to collect It was at this point that the group tried to come up
allocation stack traces. This configuration can be seen with numerous alternatives. However, none of the
in Figure 1. proposed solutions were as efficient as the code
produced by the compiler. It was clear that to have
the application run faster, we were going to have
to eliminate the concatenation. The solution that
eventually solved the problem was to introduce a Pair
class that took the first and last name as arguments.
We called this class CompositeKey as it introduced
the missing abstraction. The improved code can be
Figure 1. Configuring NetBeans memory profiler
seen in Listing 2.

Note that the profiler will not collect for every public class CustomerList {
allocation but only for every 10th allocation. Sampling private final Map customers = new
in this manner should produce the same result as if ConcurrentHashMap();
you were capturing data from every allocation but public Customer addCustomer(String
with much less overhead. The resulting profile is firstName, String lastName) {
shown in Chart 3. Customer person = new
Customer(firstName, lastName);
customers.put(new
CompositeKey(firstName, lastName),
person);
return person;
}
public Customer findCustomer(String
firstName, String lastName) {
return (Customer) customers.get(new
CompositeKey(firstName, lastName));
}
}
Chart 3. Memory profile
Listing 2. Improved implementation using CompositeKey
The chart identifies char[] as the most popular object. abstraction
Having this information, the next step is to take a

Contents Page 14
Scalability / eMag Issue 11 - April 2014

CompositeKey implemented both hashCode() and


equals(), thus eliminating the need to concatenate
the strings together. While the first benchmark
completed in ~63 seconds, the improved version ABOUT THE AUTHOR
ran in ~21 seconds, a threefold improvement. The Kirk Pepperdine has worked in high-
garbage collector ran four times, making it impossible performance and distributed computing
to get an accurate picture, but the application for nearly 20 years. Since 1998, Kirk has
consumed in aggregate just under 3 GB of data as
been working all aspects of performance
apposed to the more than 141 GB consumed by the
and tuning in each phase of a project
first implementation.
life cycle. In 2005, he helped author
Two Ways to Fill a Water Tower the foremost Java-performance tuning
A colleague of mine once said that you can fill a workshop that has been presented to
water tower one teaspoon at a time. This example hundreds of developers worldwide.
proves that you certainly can. However, it’s not the Author, speaker, consultant, Kirk was
only way to fill the tower; you could also run a large recognized in 2006 as a Java Champion
hose to fill it very quickly. In those cases, it’s unlikely
for his contributions to the Java
that an execution profiler would pick up on the
community. He was the first non-Sun
problem. However the garbage collector will see the
allocation and the recover and certainly the memory employee to present a technical lab at
profiler will see the allocation in sheer byte count. JavaONE, an achievement that opened
In one application where these large allocations the opportunity for others in the industry
predominated, the development team had exhausted to do so. He was named a JavaONE
the vast majority of the gains they were going to get Rockstar in 2011 and 2012 for his talks
by using an execution profiler, yet they still needed to on garbage collection. You can reach him
squeeze more out of the app. At that point, we turned
by e-mail at kirk@kodewerk.com or on
on the memory profiler and it exposed one allocation
Twitter @kcpeppe.
hotspot after another, and with that information
we were able to extract a number of significant
performance gains. What that team learned is that
not only was memory profiling giving them the
right view, it was giving them the only view into the READ THIS ARTICLE
problem. This is not to say that execution profiling ONLINE ON InfoQ
isn’t productive. What it is saying is that sometimes
it’s not able to tell you where your application is
spending all of its time – and in those cases, getting a
different perspective on the problem can make all the
difference in the world.

Contents Page 15
Scalability / eMag Issue 11 - April 2014

Virtual Panel: Using Java in


Low-Latency Environments
by Charles Humble

Java is increasingly being used for low-latency work where previously C and C++
were the de facto choices.
InfoQ brought together four experts in the field to discuss what is driving the trend
and some of the best practices when using Java in these situations.
The Questions
The Participants Q1: What do we mean by low latency? Is it the
Peter Lawrey is a Java consultant interested same thing as real time? How does it relate to high-
in low-latency and high-throughput systems. performance code in general?
He has worked for a number of hedge funds,
trading firms, and investment banks. Lawrey: A system with a measured latency
requirement that is too fast to see. This could be
Martin Thompson is a high-performance and anywhere from 100 ns to 100 ms.
low-latency specialist, with over two decades
working with large-scale transactional and Montgomery: Real time and low latency can be
big-data systems, in the automotive, gaming, quite different. The majority view on real time would
financial, mobile, and content-management be determinism over pure speed with very closely
domains. controlled, or even bounded, outliers. However, low
latency typically implies that pure speed is given
Todd L. Montgomery is vice-president of much higher priority and some outliers may be,
architecture for Informatica Ultra Messaging however slightly, more tolerable. This is certainly
and the chief designer and implementer of the case when thinking about hard real time. One of
the 29West low-latency messaging products. the key prerequisites for low latency is a keen eye
for efficiency. From a system view, this efficiency
Andy Piper recently joined Push Technology must permeate the entire application stack, the
as chief technology officer, from Oracle. OS, and the network. This means that low-latency
systems have to have a high degree of mechanical
sympathy to all those components. In addition, many
of the techniques that have emerged in low-latency
systems over the last several years have come from
high-performance techniques in OSs, languages,

Contents Page 16
Scalability / eMag Issue 11 - April 2014

VMs, protocols, other system-development areas, latency systems typically require high-performance
and even hardware design. code so that software elements of latency can be
reduced.
Thompson: Performance is about two things:
throughput, i.e. units per second, and response time, Q2: Some of the often-cited advantages of using
otherwise know as latency. It is important to define Java in other situations include access to the rich
the units and not just say something should be fast. collection of libraries, frameworks, application
Real time has a very specific definition and is often servers, and so on, and also the large number of
misused. Real time is to do with systems that have available programmers. Do these advantages apply
a real-time constraint from input event to response when working on low-latency code? If not, what
time regardless of system load. In a hard real-time advantages does Java have over C++?
system, if this constraint is not honored then a total
system failure can occur. Good examples are heart Lawrey: If your application spends 90% of the time
pacemakers or missile control systems. in 10% of your code, Java makes optimizing that 10%
harder, but writing and maintaining 90% of your code
With trading systems, real time tends to have a easier, especially for teams of mixed ability.
different meaning in that the system must have high
throughput and react as quickly as possible to an Montgomery: In the capital markets, especially
event, which can be considered low latency. Missing algorithmic trading, there are a number of factors
a trading opportunity is typically not a total system that come into play. Often, the faster an algorithm
failure so you cannot really call this real time. can be put into the market, the more advantage
it has. Many algorithms have a shelf life and
A good trading system will have a high quality of quicker time to market is key in taking advantage
execution for which one aspect is to have a low- of that. With the community around Java and the
latency response with little deviation in response options available, it can definitely be a competitive
time. advantage, as opposed to C or C++ where the options
may not be as broad for the use case. Sometimes,
Piper: Latency is simply the delay between decision though, pure low latency can rule out other concerns.
and action. In the context of high-performance I think the current difference in performance
computing, low latency has typically meant that between Java and C++ is so close that it’s not a
transmission delays across a network are low or black-and-white decision based solely on speed.
that the overall delays from request to response Improvements in GC techniques, JIT optimizations,
are low. What defines “low” depends on the context and managed runtimes have made traditional Java
– low latency over the Internet might be 200 ms weaknesses with respect to performance into some
whereas low latency in a trading application might very compelling strengths that are not easy to ignore.
be 2 μs. Technically, low latency is not the same
as real time – low latency typically is measured as Thompson: Low-latency systems written in Java
percentiles where the outliers (situations in which tend to not use third-party or even standard libraries
latency has not been low) are extremely important for two major reasons. Firstly, many libraries have
to know about. With real time, guarantees are made not been written with performance in mind and often
about the behavior of the system – so instead of do not have sufficient throughput or response time.
measuring percentile delays, you are enforcing a Secondly, they tend to use locks when concurrent,
maximum delay. You can see how a real-time system and they generate a lot of garbage. Both of these
is also likely to be a low-latency system, whereas the contribute to highly variable response times, due to
converse is not necessarily true. Today, however, the lock contention and garbage collection respectively.
notion of enforcement is gradually being lost so that
many people now use the terms interchangeably. Java has some of the best tooling support of any
language, which results in significant productivity
If latency is the overall delay from request to gains. Time to market is often a key requirement
response then it is obvious that many things when building trading systems, and Java can often
contribute to this delay – CPU, network, OS, get you there sooner.
application, even the laws of physics! Thus low-

Contents Page 17
Scalability / eMag Issue 11 - April 2014

Piper: In many ways the reverse is true: writing good these constructs. And the C++11 memory model is a
low-latency code in Java is relatively hard since great leap forward for developers.
the developer is insulated from the guarantees of
the hardware by the JVM itself. The good news is Thompson: Java (1.5) was the first major language
that this is changing. Not only are JVMs constantly to have a specified memory model. A language-
getting faster and more predictable but developers level memory model allows programmers to reason
are now able to take advantage of hardware about concurrent code at an abstraction above the
guarantees through a detailed understanding of hardware. This is critically important, as hardware
the way that Java works – in particular, the Java and compilers will aggressively reorder our code to
memory model - and how it maps to the underlying gain performance, which has visibility issues across
hardware. (Indeed, Java was the first popular threads. With Java, it is possible to write lock-free
language to provide a comprehensive memory algorithms that when done well can provide some
model that programmers could rely on. C++ only pretty amazing throughput at low and predictable
provided one later on.) A good example is the lock- latencies. Java also has rich support for locks.
free, wait-free techniques that Martin Thompson However, when locks are contended the operating
has been promoting and that our company, Push, system must get involved as an arbitrator with huge
has adopted into its own development with great performance costs. The latency difference between
success. Furthermore, as these techniques become a contended and uncontended lock is typically three
more mainstream, we are starting to see their orders of magnitude.
uptake in standard libraries (e.g. the Disruptor) so
that developers can adopt the techniques without Piper: Support for concurrent programs in Java start
needing such a detailed understanding of the with the Java Language Specification itself – the
underlying behavior. JLS describes many Java primitives and constructs
that support concurrency. At a basic level, this is
Even without these techniques, the safety the java.lang.Thread class for the creation and
advantages of Java (memory management, thread management of threads and the synchronized
management, etc.) can often outweigh the perceived keyword for the mediation of access to shared
performance advantages of C++, and of course JVM resources from different threads. On top of this,
vendors have claimed for some time that modern Java provides a whole package of data structures
JVMs are often faster than custom C++ code because optimized for concurrent programs (java.util.
of the holistic optimizations that they can apply concurrent) from concurrent hash tables to task
across an application. schedulers to different lock types. One of the biggest
areas of support, however, is the Java memory model
Q3. How does the JVM support concurrent (JMM) that was incorporated into the JLS as part
programs? of JDK 5. This provides guarantees around what
developers can expect when dealing with multiple
Lawrey: Java has had built-in multi-threading threads and their interactions. These guarantees
support from the start and high-level concurrency have made it much easier to write high-performance,
support standard for almost 10 years. thread-safe code. In the development of Diffusion,
we rely very heavily on the JMM in order to achieve
Montgomery: The JVM is a great platform for the best possible performance.
concurrent programs. The memory model allows a
consistent model for developers to utilize lock-free Q4. Ignoring garbage collection for a moment,
techniques across hardware, which is a great plus for what other Java-specific techniques (things that
getting the most out of the hardware by applications. wouldn’t apply if you were using C++) are there for
Lock-free and wait-free techniques are great for writing low-latency code? I’m thinking here about
creating efficient data structures, something we very things like warming up the JVM, getting all your
desperately need in the development community. In classes into permgen to avoid I/O, Java-specific
addition, some of the standard library constructs for techniques for avoiding cache misses, and so on.
concurrency are quite handy and can make for more
resilient applications. With C++11, certain specifics Lawrey: Java allows you to write, test, and profile
aside, Java is not the only one with access to a lot of your application with limited resources more
effectively. This gives you more time to ensure you

Contents Page 18
Scalability / eMag Issue 11 - April 2014

cover the entire “big picture”. I have seen many C/ 2. Don’t yet (for experts only).
C++ projects spend a lot of time drilling down to the
low level and still ending up with longer latencies end And if that does not get you where you need to be:
to end.
1. See if you actually need to speed it up.
Montgomery: That is kind of tough. The only obvious
one would be warm up for JVMs to do appropriate 2. Profile the code to see where it’s actually spending
optimizations. However, some of the class and its time.
method call optimizations that can be done via
class-hierarchy analysis at runtime are not possible 3. Focus on the few high-payoff areas and leave the
currently in C++. Most other techniques can also rest alone.
be done in C++ or, in some cases, don’t need to be
done. Low-latency techniques in any language often Now, of course, the tools you would use to achieve
involve what you don’t do that can have the biggest this and the potential hotspots might be different
impact. In Java, there are a handful of things to avoid between Java and C++, but that’s just because they
that can have undesirable side effects for low-latency are different. Granted, you might need to understand
applications. One is the use of specific APIs, such as in a little more detail than your average Java
the Reflection API. Thankfully, there are often better programmer would what is going on but the same
choices for how to achieve the same end result. is true for C++, and, of course, by using Java there
are many things you don’t need to understand so
Thompson: You mention most of the issues in your well because they are adequately catered for by the
question. :-) Basically, Java must be warmed up to runtime. In terms of the types of things that might
get the runtime to a steady state. Once in this steady need optimizing – these are the usual suspects of
state, Java can be as fast as native languages and in code paths, data structures, and locks. In Diffusion,
some cases faster. One big Achilles heel for Java is we have adopted a benchmark-driven approach
lack of memory-layout control. A cache miss is a lost where we are constantly profiling our application and
opportunity to have executed ~500 instructions on looking for optimization opportunities.
a modern processor. To avoid cache misses, we need
control of memory layout and then we must access Q5. How has managing GC behavior affected the
it in a predictable fashion to avoid cache misses. To way people code for low latency in Java?
get this level of control, and reduce GC pressure, we
often have to create data structures in direct byte Lawrey: There are different solutions for different
buffers or go off heap and use Unsafe. Both of these situations. My preferred solution is to produce so
allow for the precise layout of data structures. This little garbage that it no longer matters. You can cut
need could be removed if Java introduced support your GCs to less than once a day.
for arrays of structures. This does not need to be a
language change and could be introduced by some At this point, the real reason to reduce garbage is
new intrinsics. to ensure you are not filling your CPU caches with
garbage. Reducing the garbage you are producing
Piper: The question seems to be based on false can improve the performance of your code by two to
premises. At the end of the day, writing a low-latency five times.
program is very similar to writing other programs
where performance is a concern; the input is code Montgomery: Most of the low-latency systems
provided by a developer (whether C++ or Java), I have seen in Java have gone to great lengths to
which executes on a hardware platform with some minimize or even try to eliminate the generation
level of indirection inbetween (e.g. through the of garbage. As an example, avoiding the use of
JVM or through libraries, compiler optimizers, Strings altogether is not uncommon. Informatica
etc. in C++). The fact that the specifics vary makes Ultra Messaging (UM) itself has provided specific
little difference. This is essentially an exercise in Java methods to cater to the needs of many users
optimization and the rules of optimization are, as with respect to object reuse and avoiding some
always: usage patterns. If I had to guess, the most common
implication has been the prevalent use of object
1. Don’t. reuse. This pattern has also influenced many other

Contents Page 19
Scalability / eMag Issue 11 - April 2014

non-low-latency libraries such as Hadoop. It’s a complexity through the bypass of a very useful Java
common technique now within the community to feature (GC) and therefore can be hard to maintain.
provide options or methods for users of an API or
framework to utilize them in a low or zero garbage Q6. When analyzing low-latency applications,
manner. are there any common causes or patterns you see
behind spikes or outliers in performance?
In addition to the effect on coding practices, there is
also an operational impact for low-latency systems. Lawrey: Waiting for I/O of some type. CPU
Many systems will take some, shall we say, creative instruction or data-cache disturbances. Context
control of GC. It’s not uncommon to only allow GC switches.
to occur at specific times of the day. The implications
on application design and operational requirements Montgomery: In Java, GC pauses are beginning to be
are a major factor in controlling outliers and gaining well understood and, thankfully, we have better GCs
more determinism. that are available. System effects are common for all
languages though. OS scheduling delay is one of the
Thompson: Object pools are employed or, as many causes behind spikes. Sometimes it is the direct
mentioned in the previous response, most data delay and sometimes it is a knock-on effect caused by
structures need to be managed in byte buffers or off the delay that is the real killer. Some OSs are better
heap. This results in a C style of programming in Java. than others when it comes to scheduling under heavy
If we had a truly concurrent garbage collector then load. Surprisingly, for many developers the impact
this could be avoided. that poor application choices can make on scheduling
is something that often comes as a surprise and is
Piper: How long is a piece of java.lang.String? often hard to debug sufficiently. Of a related note
Sorry, I’m being facetious. The truth is that some is the delay inherent from I/O and contention that
of the biggest changes to GC behaviour have come I/O can cause on some systems. A good assumption
about through JVM improvements rather than to make is that any I/O call may block and will block
through programmers’ individual coding decisions. at some point. Thinking through the implications
HotSpot, for instance, has come an incredibly long inherent in that is very often key. And remember,
way from the early days when you could measure network calls are I/O.
GC pauses in minutes. Many of these changes have
been driven by competition – it used to be that There are a number of network-specific causes for
BEA JRockit behaved far better than HotSpot from poor performance to cover as well. Let me list the key
a latency perspective, creating much lower jitter. items to consider.
These days, however, Oracle is merging the JRockit
and HotSpot codebases precisely because the gap • Networks take time to traverse. In WAN
has narrowed so much. Similar improvements have environments, the time it takes to propagate data
been seen in other, more modern, JVMs such as across the network is non-trivial.
Azul’s Zing and in many cases, developer attempts
to improve GC behavior have actually had no net • Ethernet networks are not reliable; it is the
benefit or made things worse. protocols on them that provide reliability.

However, that’s not to say that there aren’t things • Loss in networks causes delay due to
that developers can do to manage GC – for instance, retransmission and recovery as well as second-
by reducing object allocations through either pooling order effects such as TCP head-of-line blocking.
or using off-heap storage to limit memory churn. It’s
still worth bearing in mind, however, that these are • Loss in networks can occur on the receiver side
problems that JVM developers are also very focused due to resource starvation in various ways when
on, so it still may well be either not necessary to do UDP is in use.
anything at all or easier to simply buy a commercial
JVM. The worst thing you can do is prematurely • Loss in networks can occur within switches
optimize this area of your applications without and routers due to congestion. Routers and
knowing whether it is actually a problem or not, switches are natural contention points and when
since these kinds of techniques increase application contended for, loss is the tradeoff.

Contents Page 20
Scalability / eMag Issue 11 - April 2014

• Reliable network media, like InfiniBand, trade off a big different to the overall behavior of a low-
loss for delay at the network level. The end result latency application.
of loss causing delay is the same, though.
Apart from GC, lock contention is another major
To a large degree, low-latency applications that make cause of latency spikes, but this can be rather
heavy use of networks often have to look at a whole harder to identify and resolve due to its often non-
host of causes of delay and additional sources of deterministic nature. It’s worth remembering also
jitter within the network. Beside network delay, loss that any time the application is unable to proceed,
is probably a high contender for the most common it will yield a latency spike. This could be caused by
cause of jitter in many low-latency applications. many things, even things outside the JVM’s control
– e.g. access to kernel or OS resources. If these kinds
Thompson: I see many causes of latency spikes. of constraint can be identified then it is perfectly
Garbage collection is the one most people are possible to change an application to avoid the use of
aware of but I also see a lot of lock contention, these resources or to change the timing of when they
TCP-related issues, and many Linux-kernel issues are used.
related to poor configuration. Many applications
have poor algorithm design that does not amortize Q7. Java 7 introduced support for Sockets Direct
the expensive operations like I/O and cache misses Protocol (SDP) over InfiniBand fabric. Is this
under bursty conditions and thus suffers queuing something you’ve seen exploited in production
effects. Algorithm design is often the largest cause systems yet? If it isn’t being used, what other
of performance issues and latency spikes in the solutions are you seeing the wild?
applications I’ve seen.
Lawrey: I haven’t used it for Ethernet because it
Time to safepoint (TTS) is a major consideration creates quite a bit of garbage. In low-latency systems,
when dealing with latency spikes. Many JVM you want to minimize the number of network hops
operations require all user threads to be stopped by and usually it’s the external connections that are
bringing them to a safepoint. Safepoint checks are the only ones you cannot remove. These are almost
typically performed on method returns. The need always Ethernet.
for safepoints can be anything from revoking biased
locks or some JNI interactions and de-optimizing Montgomery: We have not seen this that much. It
code, through to many GC phases. Often, the time has been mentioned, but we have not seen it being
taken to bring all threads to a safepoint is more seriously considered. Ultra Messaging is used as
significant than the work to be done. The work is the interface between SDP and the developer using
then followed by the significant costs in waking messaging. SDP fits much more into a (R)DMA access
all those threads to run again. Getting a thread to pattern than a push-based usage pattern. Turning a
safepoint quickly and predictably is often not a DMA pattern into a push pattern is possible, but SDP
considered or optimized part of many JVMs, e.g. is not that well-suited for it, unfortunately.
object cloning and array copying.
Thompson: I’ve not seen this used in the wild. Most
Piper: The most common cause of outliers are people use a stack like OpenOnload and network
GC pauses, however the most common cure for adapters from the likes of Solarflare or Mellanox.
GC pauses is GC tuning rather than actual code At the extreme I’ve seen RDMA over InfiniBand
changes. For instance, simply changing from the with custom lock-free algorithms accessing shared
parallel collector that is used by default in JDK 6 memory directly from Java.
and JDK 7 to the concurrent mark-sweep collector
can make a huge difference to stop-the-world GC Piper: Oracle’s Exalogic and Coherence products
pauses that typically cause latency spikes. Beyond have used Java and SDP for some time so in that
tuning, another thing to bear in mind is the overall sense we’ve seen usage of this feature in production
heap size being used. Very large heaps typically put systems for some time also. In terms of developers
more pressure on the garbage collector and can actually using the Java SDP support directory rather
cause longer pause times – often simply eliminating than through some third-party product, no, not so
memory leaks and reducing memory usage can make much – but if it adds business benefit then we expect
this to change. We ourselves have made used of

Contents Page 21
Scalability / eMag Issue 11 - April 2014

latency-optimized hardware (e.g. Solarflare 10GbE memory support for updating small amounts of data
adapters) where the benefits are accrued from atomically. Unfortunately, Java is likely to take a long
kernel-driver installation rather than specific Java time to offer such support directly to programmers.
tuning.
Piper: Lock contention can be one of the biggest
Q8. Perhaps a less Java-specific question, but performance impediments for low-latency
why do we need to try and avoid contention? In applications. Locks in themselves don’t have to
situations where you can’t avoid it, what are the be expensive and in the uncontended case, Java
best ways to manage it? synchronized blocks perform extremely well.
However, with contended locks, performance can
Lawrey: For ultra-low latency, this is an issue, but for fall off a cliff – not just because a thread holding a
multi-microsecond latencies, I don’t see it as an issue. lock prevents another thread that wants the same
In situations where you can’t avoid it, be aware of and lock from doing work, but also because simply the
minimize the impact of any resource contention. fact that more than one thread is accessing the
lock makes the lock more expensive for the JVM
Montgomery: Contention is going to happen. to manage. Obviously, avoidance is key, so don’t
Managing it is crucial. One of the best ways to deal synchronize stuff that doesn’t need it – remove locks
with contention is architecturally. The “single-writer that are not protecting anything, reduce the scope
principle” is an effective way to do that. In essence, of locks that are protecting, reduce the time that
just don’t have the contention, assume a single writer locks are held, don’t mix the responsibilities of locks,
and build around that base principle. Minimize the etc. Another common technique is to remove multi-
work on that single write and you would be surprised threaded access – instead of giving multiple threads
what can be done. access to a shared data structure, updates can be
queued as commands with the queue being tended
Asynchronous behavior is a great way to avoid by a single thread. Lock contention then simply
contention. It all revolves around the principle of comes down to adding items to the queue, which
“always be doing useful work”. itself can be managed through lock-free techniques.

This also normally turns into the single-writer Q9. Has the way you approach low-latency
principle. I often like a lock-free queue in front of development in Java changed in the past couple of
a single writer on a contended resource and use a years?
thread to do all the writing. The thread does nothing
but pull off a queue and do the writing operation in a Lawrey: Build a simple system that does what you
loop. This works great for batching as well. A wait- want. Profile it as end-to-end as possible. Optimize
free approach on the enqueue side pays off big here and/or rewrite where you measure the bottlenecks
and that is where asynchronous behavior comes into to be.
the play for me from the perspective of the caller.
Montgomery: Entirely changed. Ultra Messaging
Thompson: Once we have contention in an started in 2004. At the time, the thought of using
algorithm, we have a fundamental scaling bottleneck. Java for low latency was just not a very obvious
Queues form at the point of contention and Little’s choice. But a few certainly did consider it. And more
law kicks in. We can also model the sequential and more have ever since. Today I think the landscape
constraint of the contention point with Amdahl’s is totally changed. Java is not only viable, it may be
law. Most algorithms can be reworked to avoid the predominant option for low-latency systems. It’s
contention from multiple threads or execution the awesome work done by Martin Thompson and
contexts, giving a parallel speed up, often via [Azul Systems’] Gil Tene that has really propelled this
pipelining. If we really must manage contention on change in attitude within the community.
a given data resource then the atomic instructions
provided by processors tend to be a better Thompson: The main change over the past few
solution than locks because they operate in user years has been the continued refinement of lock-
space without ever involving the kernel. The next free and cache-friendly algorithms. I often have fun
generation of Intel processors (Haswell) expands on getting involved in language shootouts that just keep
these instructions to provide hardware transactional proving that the algorithms are way more important

Contents Page 22
Scalability / eMag Issue 11 - April 2014

than the language to performance. Clean code that you would use Java and FPGAs or GPUs is getting
displays mechanical sympathy tends to give amazing narrower all the time.
performance, regardless of language.
Montgomery: Java is definitely an option for most
Piper: Java VMs and hardware are constantly high-performance work. For HFT, Java already has
changing, so low-latency development is always most everything needed. There is more room for
an arms race to stay in the sweet spot of target work, though: more intrinsics is an obvious one. In
infrastructure. JVMs have also gotten more robust other domains, Java can work well, I think. Just like
and dependable in their implementation of the Java low latency, I think it will take developers willing to
memory model and concurrent data structures try to make it happen, though.
that rely on underlying hardware support, so that
techniques such as lock-free/wait-free have moved Thompson: With sufficient time, I can make a C/
into the mainstream. Hardware also is now on a C++/ASM program perform better than Java, but
development track of increasing concurrency based there is not that much in it these days. Java is often
on increasing execution cores, so that techniques the much quicker delivery route. If Java had a good
that take advantage of these changes and minimize concurrent garbage collector, control of memory
disruption (e.g. by giving more weight to avoiding layout, unsigned types, and some more intrinsics for
lock contention) are becoming essential to access to SIMD and concurrent primitives then I’d be
development activities. a very happy bunny.

In Diffusion, we have now got down to single-digit Piper: I see C++ as an optimization choice. Java
microsecond latency all on stock Intel hardware is by far the preferred development environment
using stock JVMs. from a time-to-market, reliability, higher-quality
perspective, so I would always choose Java first and
Q10. Is Java suitable for other performance- then switch to something else only if bottlenecks
sensitive work? Would you use it in a high- are identified that Java cannot address. It’s the
frequency trading system, for example, or is C++ optimization mantra all over again.
still a better choice here?

Lawrey: For time to market, maintainability, and


support from teams of mixed ability, I believe Java
is the best. The space for C or C++ between where

ABOUT THE PANELISTS


Peter Lawrey is a Java consultant Todd L. Montgomery is vice-president of
interested in low-latency and high- architecture for the Messaging Business Unit
throughput systems. He has works of 29West, now part of Informatica. As the
for a number of hedge funds, trading chief architect of Informatica’s Messaging
firms, and investment banks. Peter Business Unit, Todd is responsible for the design
is third for Java on StackOverflow, and implementation of the Ultra Messaging
his technical blog gets 120K page product family, which has over 170 production
views per month, and he is the deployments within the financial services sector.
lead developer for the OpenHFT In the past, Todd has held architecture positions
project on GitHub. The OpenHFT at TIBCO and Talarian, as well as research and
project includes Chronicle, which lecture positions at West Virginia University.
supports up to 100 million persisted He has contributed to the IETF and performed
messages per second. Peter offers research for NASA in various software fields.
a free hourly sessions on different With a deep background in messaging systems,
low-latency topics twice a month to reliable multicast, network security, congestion
the Performance Java User’s Group. control, and software assurance, Todd brings
a unique perspective tempered by 20 years of
practical development experience.

Contents Page 23
Scalability / eMag Issue 11 - April 2014

Martin Thompson is a high-


performance and low-latency
specialist, with experience gained
over two decades working with
large-scale transactional and
big-data domains, including
automotive, gaming, financial,
mobile, and content management.
He believes mechanical sympathy
- applying an understanding of
the hardware to the creation
of software - is fundamental
to delivering elegant, high-
performance, solutions. Martin
was the co-founder and CTO of
LMAX, until he left to specialize in
helping other people achieve great
performance with their software.
The Disruptor concurrent
programming framework is just one
example of what his mechanical
sympathy has created.

Andy Piper recently joined the


Push Technology team as chief
technology officer. Previously
a technical director at Oracle
Corporation, Andy has over 18
years experience working at
the forefront of the technology
industry. In his role at Oracle,
Andy led development for Oracle
Complex Event Processing (OCEP)
and drove global product strategy
and innovation. Prior to Oracle,
Andy was an architect for the
WebLogic Server Core at BEA
Systems, a provider of middleware
infrastructure technologies.

READ THIS ARTICLE


ONLINE ON InfoQ

Contents
Scalability / eMag Issue 11 - April 2014

Reliable Auto-Scaling
Using Feedback Control
by Philipp K. Janert

Introduction Instead of a fixed (time-based) schedule, we might


When deploying a server application to production, consider a rule-based solution: we have a rule that
we need to decide on the number of active server specifies the number of server instances to use for
instances to use. This is a difficult decision, because any given traffic intensity. This solution is more
we usually do not know how many instances will flexible than the time-based schedule but it still
be required to handle a given traffic load. As a requires us to predict how many servers we need for
consequence, we are forced to use more, possibly each traffic load. And what happens when the nature
significantly more, instances than actually required in of the traffic changes — as may happen, for example,
order to be safe. Since servers cost money, this makes if the fraction of long-running queries increases?
things unnecessarily expensive. The rule-based solution will not be able to respond
properly.
In fact, things are worse than that. Traffic is rarely
constant throughout the day. If we deploy instances Feedback control is a design paradigm that is fully
with peak traffic in mind, we basically guarantee that capable of handling all these challenges. Feedback
most of the provisioned servers will be underutilized works by constantly monitoring some quality-of-
most of the time. In particular, in a cloud-based service metric (such as the response time), then
deployment scenario where instances can come making appropriate adjustments (such as adding
and go at any moment, we should be able to realize or removing servers) if this metric deviates from its
significant savings by having only as many instances desired value. Because feedback bases its control
active as are required to handle the load at any actions on the actual behavior of the controlled
moment. system, it is capable of handling even unforeseen
events, such as traffic that exceeds all expectations.
One approach to this problem is to use a fixed Moreover, and in contrast to the rule-based solution
schedule, in which we somehow figure out the sketched earlier, feedback control requires very little
required number of instances for each hour of a priori information about the controlled system.
the day. The difficulty is that such a fixed schedule The reason is that feedback is truly self-correcting:
can not handle random variations: if for some because the quality-of-service metric is monitored
reason traffic is 10% higher today than yesterday, constantly, any deviation from the desired value is
the schedule will not be capable of providing the spotted and corrected immediately, and this process
additional instances that are required to handle the repeats as necessary. To put it simply: if the response
unexpected load. Similarly, if traffic peaks half an time deteriorates, a feedback controller will simply
hour early, a system based on a fixed schedule will activate additional instances, and if that does not
not be able to cope. help, it will add more. That’s all.

Contents Page 25
Scalability / eMag Issue 11 - April 2014

Feedback control has long been a standard method To answer these questions, it helps to remember
in mechanical and electrical engineering, but it does that the primary purpose of feedback control is to
not seem to be used much as a design concept in minimize the deviation of the actual system output
software architecture. As a paradigm that specifically from the desired output. This deviation can be
applies in situations of incomplete information expressed as “tracking error”:
and random variation, it is rather different than
the deterministic, algorithmic solutions typical of error = actual - desired
computer science.
The controller can do anything it deems suitable
Although feedback control is conceptually simple, to reduce this error. We have absolute freedom in
deploying an actual controller to a production designing the algorithm — but we will want to take
environment requires knowledge and understanding knowledge of the controlled system into account.
of some practical tricks in order to work. In this
article, we will introduce the concepts and point out Let’s consider again the data-center situation. We
some of the difficulties. know that increasing the number of servers reduces
the average response time. So, we can choose a
Nature of a Feedback Loop control strategy that will increase the number of
The basic structure of a feedback loop is shown in the active servers by one whenever the actual response
figure. On the right, we see the controlled system. time is worse than its desired value (and decrease
Its output is the relevant quality-of-service metric. the server count in the opposite case). But we can
The value of this metric is continuously supplied to do better than that, because this algorithm does not
the controller, which compares it to its desired value, take the magnitude of the error into account, only its
which is supplied from the left. (The desired value sign. Surely, if the tracking error is large, we should
of the system’s output metric is referred to as the make a larger adjustment than when the tracking
“setpoint”.) Based on the two inputs of the desired error is small. In fact, it is common practice to let the
and the actual value of the quality-of-service metric, control action be proportional to the tracking error:
the controller computes an appropriate control
action for the controlled system. For instance, if the action = k error
actual value of the response time is worse than the
desired value, the control action might consist of The value of k is some number.
activating a number of additional server instances.
With this choice of control algorithm, large
The figure shows the generic structure deviations lead to large corrective actions, whereas
of all feedback loops. Its essential components are small deviations lead to correspondingly smaller
the controller and the controlled system. Information corrections. Both aspects are important. Large
flows from the system’s output via the return path to actions are required in order to reduce large
the controller, where it is compared to the setpoint. deviations quickly but it is also important to let
Given these two inputs, the controller decides on an control actions become small if the error is small —
appropriate control action. only if we do this does the control loop ever settle to
a steady state. Otherwise, the behavior will always
So, what does a controller actually do? How does it oscillate around the desired value, an effect we
determine what action to take? usually wish to avoid.

Contents Page 26
Scalability / eMag Issue 11 - April 2014

We said earlier that there is considerable A first rule of thumb in choosing the size of control
freedom in choosing a particular algorithm for the actions is to work backwards: given a tracking error
implementation of the feedback controller but it is of a certain size, how large would a correction need
usually a good idea to keep it simple. The magic of to be to eliminate this error entirely? Remember that
feedback control lies in the loopback structure of we do not need to know this information precisely:
the information flow, not so much in a particularly the self-correcting nature of feedback control
sophisticated controller. Feedback control incurs a assures that there is considerable tolerance in
more complicated system architecture in order to choosing values for the tuning parameters. But we do
allow for a simpler controller. need to get at least the order of magnitude right. In
other words, to improve the average query response
One thing, however, is essential: the control action time by 0.1 seconds, do we need to add roughly one
must be applied in the correct direction. In order to server, 10 servers, or 100?
guarantee this, we need to have some understanding
of the behavior of the controlled system. Usually, this Some systems are slow to respond to control
is not a problem: we know that more servers means actions. For instance, it may take several minutes
better response times and so on. But it is a crucial before a newly requested (virtual) server instance
piece of information that we must have. is ready to receive requests. If this is the case, we
must take this lag or delay into account: while the
Implementation Issues additional instances spin up, the tracking error will
Thus far, our description of feedback control has persist, and we must prevent the controller from
been largely conceptual. However, when attempting requesting further and further instances. Otherwise,
to turn these high-level ideas into a concrete we will eventually have way too many active servers
realization, some implementation details need to be online! Systems that do not respond immediately
settled. The most important of these concerns the pose specific challenges and require more care but
magnitude of the control action that results from a systematic methods exist to tune such systems.
tracking error of a given size. (If we use the formula (Basically, one first needs to understand the duration
given earlier, this amounts to choosing a value for the of the lag or delay before using specialized plug-in
numerical constant, k.) formulas to obtain values for the tuning parameters.)

The process of choosing specific values for the Special Considerations


numerical constants in the controller implementation We must keep in mind that feedback control is
is known as “controller tuning”. Controller tuning a reactive control strategy: things must first go
is the expression of an engineering tradeoff: if we out of whack, at least a little, before any corrective
choose to make relatively small control actions, action can take place. If this is not acceptable,
then the controller will respond slowly and tracking feedback control might not be suitable. In practice,
errors will persist for a long time. If, on the other this is usually not a problem: a well-tuned feedback
hand, we choose to make rather large control actions, controller will detect and respond even to very small
then the controller will respond much faster, but deviations and generally keep a system much closer
at risk of over-correcting and incurring an error to its desired behavior than a rule-based strategy or
in the opposite direction. If we let the controller a human operator would.
make even larger corrections, it is possible for the
control loop to become unstable. If this happens, A more serious concern is that no reactive control
the controller tries to compensate each deviation strategy is capable of handling disturbances that
with an ever-increasing sequence of control actions, occur much faster than it can apply its control
swinging wildly from one extreme to the other actions. For instance, if it takes several minutes to
while increasing the magnitude of its actions all the bring additional server instances online, we will not
time. Instability of this form is highly detrimental to be able to respond to traffic spikes that build up
smooth operations and therefore must be avoided. within a few seconds or less. (At the same time, we
The challenge of controller tuning therefore amounts will have no problem handling changes in traffic that
to finding control actions that are as large as possible build up over several minutes or hours.) If we need
without making the loop unstable. to handle spiky loads, we must either find a way to
speed up control actions (for instance, by having

Contents Page 27
Scalability / eMag Issue 11 - April 2014

servers on hot standby) or employ mechanisms that


are not reactive (such as message buffers).

Another question that deserves some consideration ABOUT THE AUTHOR


is the choice of the quality-of-service metric to Philipp K. Janert provides consulting
be used. Ultimately, the only thing the feedback services for data analysis and
controller does is to keep this quantity at its desired mathematical modeling, drawing on
value, hence we should make sure that the metric we his previous careers as physicist and
choose is indeed a good proxy for the behavior that software engineer. He is the author
we want to maintain. At the same time, this metric of the best-selling Data Analysis with
must be available, immediately and at all times. (We Open Source Tools (O’Reilly), as well as
cannot build an effective control strategy on some Gnuplot in Action: Understanding Data
metric that is only available after a significant delay, with Graphs (Manning Publications).
for instance.) In his latest book, Feedback Control for
Computer Systems, he demonstrates
A final consideration is that this metric should not how the same principles that govern
be too noisy, because noise tends to confuse the cruise control in your car also apply
controller. If the relevant metric is naturally noisy, to data-center management and
then it usually needs to be smoothed before it can other enterprise systems. He has
be used as control signal. For instance, the average written for the O’Reilly Network, IBM
response time over the last several requests provides developerWorks, and IEEE Software.
a better signal than just the response time of the He holds a Ph.D. in theoretical physics
most recent request. Taking the average has the from the University of Washington.
effect of smoothing out random variations. Visit his company’s Web site.

Summary
Although we have introduced feedback control here
in terms of data-center autoscaling, it has a much
wider area of applicability: wherever we need to
maintain some desired behavior, even in the face of
uncertainty and change, feedback control should
be considered an option. It can be more reliable
than deterministic approaches and simpler than
rule-based solutions, but it requires a novel way of READ THIS ARTICLE
thinking and knowledge of some special techniques ONLINE ON InfoQ
to be effective.

Further Reading
This short article can only introduce the basic notions
of feedback control. More information is available
on my blog and in my book on the topic (Feedback
Control for Computer Systems. O’Reilly, 2013).

Contents Page 28

You might also like