You are on page 1of 18

Load Bal anci ng i n t he Cl oud:

Tool s, Ti ps, and Techni ques


A T E CHNI CA L WHI T E PA P E R
Brian Adler, Solutions Architect, RightScale, Inc.
Ri ght Scal e www. r i ght s cal e. com 1

Abstract
Load Balancing is a method to distribute workload across one or more servers, network interfaces, hard
drives, or other computing resources. Typical datacenter implementations rely on large, powerful (and
expensive) computing hardware and network infrastructure, which are subject to the usual risks
associated with any physical device, including hardware failure, power and/or network interruptions, and
resource limitations in times of high demand.
Load balancing in the cloud differs from classical thinking on load-balancing architecture and
implementation by using commodity servers to perform the load balancing. This provides for new
opportunities and economies-of-scale, as well as presenting its own unique set of challenges.
The discussion to follow details many of these architectural decision points and implementation
considerations, while focusing on several of the cloud-ready load balancing solutions provided by
RightScale, either directly from our core components, or from resources provided by members of our
comprehensive partner network.
Ri ght Scal e www. r i ght s cal e. com 2
1 Introduction
A large percentage of the systems (or deployments in RightScale vernacular) managed by the
RightScale Cloud Management Platform employ some form of front-end load balancing. As a result of
this customer need, we have encountered, developed, architected, and implemented numerous load
balancing solutions. In the process we have accumulated experience with solutions that excelled in their
application, as well as discovering the pitfalls and shortcomings of other solutions that did not meet the
desired performance criteria. Some of these solutions are open source and are fully supported by
RightScale, while others are commercial applications (with a free, limited version in some cases)
supported by members of the RightScale partner network.
In this discussion we will focus on the following technologies that support cloud-based load balancing:
HAProxy, Amazon Web Services Elastic Load Balancer (ELB), Zeus Technologies Load Balancer (with
some additional discussion of their Trafc Manager features), and aiCaches Web Accelerator. While it
may seem unusual to include a caching application in this discussion, we will describe the setup in a
later section that illustrates how aiCache can be congured to perform strictly as a load balancer.
The primary goal of the load balancing tests performed in this study is to determine the maximum
connection rate that the various solutions are capable of supporting. For this purpose we focused on
retrieving a very small web page from backend servers via the load balancer under test. Particular use-
cases may see more relevance in testing for bandwidth or other metrics, but we have seen more
difculties surrounding scaling to high connection rates than any other performance criterion, hence the
focus of this paper. As will be seen, the results provide insight into other operational regimes and
metrics as well.
Section 2 will describe the test architecture and the method and manner of the performance tests that
were executed. Application- and/or component-specic congurations will be described in each of the
subsections describing the solution under test. Wherever possible, the same (or similar) conguration
options were used in an attempt to maintain a compatible testing environment, with the goal being
relevant and comparable test results. Section 3 will discuss the results of these tests from a pure load
balancing perspective, with additional commentary on specialized congurations pertinent to each
solution that may enhance its performance (with the acknowledgement that these congurations/options
may not be available with the other solutions included in these evaluations). Section 4 will describe an
enhanced testing scenario used to exercise the unique features of the ELB, and section 5 will summarize
the results and offer suggestions with regards to best practices in the load balancing realm.
2 Test Architecture and Setup
In order to accomplish a reasonable comparison among the solutions exercised, an architecture typical
of many RightScale customer deployments (and cloud-based deployments in general) was utilized. All
tests were performed in the AWS EC2 US-East cloud, and all instances (application servers, server
under test, and load-generation servers) were launched in a single availability zone.
A single EC2 large instance (m1.large, 2 virtual cores, 7.5GB memory, 64-bit platform) was used for the
load balancer under test for each of the software appliances (HAProxy, Zeus Load Balancer, and
aiCache Web Accelerator). As the ELB is not launched as an instance, we will address it as an
architectural component as opposed to a server in these discussions. A RightImage (a RightScale-
created and supported Machine Image) utilizing CentOS 5.2 was used as the base operating system on
the HAProxy and aiCache servers, while an Ubuntu 8.04 RightImage was used with the Zeus Load
Balancer. A total of ve identically congured web servers were used in each test to handle the
responses to the http requests initiated by the load-generation server. These web servers were run on
Ri ght Scal e www. r i ght s cal e. com 3
EC2 small instances (m1.small, 1 virtual core, 1.7GB memory, 32-bit platform), and utilized a CentOS 5.2
RightImage. Each web server was running Apache version 2.2.3 and the web page being requested
was a simple text-only page with a size of 147 bytes. The nal server involved in the test was the load-
generation server. This server was run on an m1.large instance, and also used a CentOS 5.2
RightImage. The server congurations used are summarized in Table 1 below.
Load 8aa|ancer WWeb Server Load Geenerator

k|ght Image
CS
AWS
Instance
k|ghtImage
CS
AWS
Instance Apache
k|ghtImage
CS
AWS
Instance
PAroxy
1.3.19 CenLCS 3.2 m1.large CenLCS 3.2 m1.small 2.2.3 CenLCS 3.2 m1.large
AWS LL8 n/A n/A CenLCS 3.2 m1.small 2.2.3 CenLCS 3.2 m1.large
Zeus 6.0
ubunLu
8.04 m1.large CenLCS 3.2 m1.small 2.2.3 CenLCS 3.2 m1.large
alCache
6.107 CenLCS 3.2 m1.large CenLCS 3.2 m1.small 2.2.3 CenLCS 3.2 m1.large
Table 1 Summary of server configurations
The testing tool used to generate the load was ApacheBench, and the command used during the tests
was the following:
ab -k -n 100000 -c 100 http://<Public_DNS_name_of_EC2_server>
The full list of options available is described in the ApacheBench man page (http://httpd.apache.org/
docs/2.2/programs/ab.html) but the options used in these tests were:
-k
Enable the HTTP KeepAlive feature, i.e., perform multiple
requests within one HTTP session. Default is no KeepAlive.
-n requests
Number of requests to perform for the benchmarking session.
The default is to just perform a single request which usually
leads to non-representative benchmarking results.
-c concurrency
Number of multiple requests to perform at a time. Default is
one request at a time.
Additional tests were performed on the AWS ELB and on HAProxy using httperf as an alternative to
ApacheBench. These tests are described in sections to follow.
An architectural diagram of the test setup is shown in Figure 1.
Ri ght Scal e www. r i ght s cal e. com 4
Figure 1 Test setup architecture
In all tests, a round-robin load balancing algorithm was used. CPU utilization on all web servers was
tracked during the tests to ensure this tier of the architecture was not a limiting factor on performance.
The CPU idle value for each of the ve web servers was consistently between 65%-75% during the
entire timeline of all tests. The CPU utilization of the load-generating server was also monitored during all
tests, and the idle value was consistently above 70% on both cores (the httperf tests more fully
utilized the CPU, and these congurations will be discussed in detail in subsequent sections).
As an additional test, two identical load-generating servers were used to simultaneously generate load
on the load balancer. In each case, the performance seen by the rst load-generating server was halved
as compared to the single load generator case, with the second server performing equally. Thus, the
overall performance of the load balancer remained the same. As a result, the series of tests that
generated the results discussed herein were run with a single load-generating server to simplify the test
setup and results analysis. The load-generation process was handled differently in the ELB test to more
adequately test the auto-scaling aspects of the ELB. Additional details are provided in section 4.1,
which discusses this test setup and results.
The metric collected and analyzed in all tests was the number of requests per second that were
processed by the server under test (referred to as responses/second hereafter). Other metrics may be
more relevant for a particular application, but pure connection-based performance was the desired
metric for these tests.
2.1 Additional Testing Scenarios
Due to the scaling design of the AWS ELB, adequately testing this solution requires a different and more
complex test architecture. The details of this test conguration are described in section 4 below. With
this more involved architecture in place, additional tests of HAProxy were performed to conrm the
results seen in the more simplistic architecture described above. The HAProxy results were consistent
Ri ght Scal e www. r i ght s cal e. com 5
between the two test architectures, lending validation to the base test architecture. Additional details on
these HAProxy tests are provided in section 4.2.
3 Test Results
Each of the ApacheBench tests described in Section 2 was repeated a total of ten times against each
load balancer under test with the numbers quoted being the averages of those tests. Socket states
were checked between tests (via the netstat command) to ensure that all sockets closed correctly
and the server had returned to a quiescent state. A summary of all test results is included in Appendix
A.
3.1 HAProxy
HAProxy is an open-source software application that provides high-availability and load balancing
features (http://haproxy.1wt.eu/). In this test, version 1.3.19 was used and the health-check
option was enabled, but no other additional features were congured. The CPU utilization was less than
50% on both cores of the HAProxy server during these tests (HAProxy does not utilize multiple cores,
but monitoring was performed to ensure no other processes were active and consuming CPU cycles),
and the addition of another web server did not increase the number of requests serviced, nor change the
CPU utilization of the HAProxy server. HAProxy performance tuning as well as Linux kernel tuning was
performed. The tuned parameters are indicated in the results below, and are summarized in Appendix
B. HAProxy does not support the keep-alive mode of the HTTP transactional model, thus its response
rate is equal to the TCP connection rate.
3.1.1 HAProxy Baseline
In this test, HAProxy was run with the standard conguration le (Appendix C) included with the
RightScale frontend ServerTemplates (a ServerTemplate is a RightScale concept, and denes the base
OS image and series of scripts to install and congure a server at boot time). The results of the initial
HAProxy tests were:
Requests per second: 4982 [#/sec]
This number will be used as a baseline for comparison with the other load balancing solutions under
evaluation.
3.1.2 HAProxy with nbproc Modication
The nbproc option to HAProxy is used to set the number of haproxy processes when run in daemon
mode. This is not the preferred mode in which to run HAProxy as it makes debugging more difcult, but
it may result in performance improvements on certain systems. As mentioned previously, the HAProxy
server was run on an m1.large instance, which has two cores, so the nbproc value was set to 2 for this
test. Results:
Requests per second: 4885 [#/sec]
This is approximately 2% of a performance reduction compared with the initial tests (in which nbproc
was set to the default value of 1), so this difference is considered statistically insignicant, with the
conclusion that in this test scenario, modifying the nbproc parameter has no effect on performance.
This is most likely an indicator that user CPU load is not the limiting factor in this conguration.
Additional tests described in section 4 add credence to this assumption.
Ri ght Scal e www. r i ght s cal e. com 6
3.1.3 HAProxy with Kernel Tuning
There are numerous kernel parameters that can be tuned at runtime, all of which can be found under
the /proc/sys directory. The ones mentioned below are not an exhaustive or all-inclusive list of the
parameters that would positively (or negatively) affect HAProxy performance, but they have been found
to be benecial in these tests. Alternate values for these (and other) parameters may have positive
performance implications depending on the trafc patterns a site encounters and the type of content
being served. The following kernel parameters were modied by adding them to the /etc/
sysctl.conf le and executing the sysctl p command to load them into the kernel:
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.all.rp_filter=1
net.core.rmem_max = 8738000
net.core.wmem_max = 6553600
net.ipv4.tcp_rmem = 8192 873800 8738000
net.ipv4.tcp_wmem = 4096 655360 6553600
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 360000
vm.min_free_kbytes = 65536
vm.swappiness = 0
net.ipv4.ip_local_port_range = 30000 65535
With these modications in place, the results of testing were:
Requests per second: 5239 [#/sec]
This represents about a 5.2% improvement over the initial HAProxy baseline tests. To ensure accuracy
and repeatability of these results, the same tests (HAProxy Benchmark with no application or kernel
tuning and the current test) were rerun. The 5%-6% performance improvement was consistent across
these tests. Additional tuning of the above-mentioned parameters was performed, with the addition of
other network- and buffer-related parameters, but no signicant improvements to these results were
observed. Setting the haproxy process afnity also had a positive effect on performance (and negated
any further gains from kernel tuning). This process afnity modication is described in section 4.2.
It is worth noting that HAProxy can be congured for both cookie-based and IP-based session stickiness
(IP-based if a single HAProxy load balancer is used). This can enhance performance, and in certain
application architectures, it may be a necessity.
Ri ght Scal e www. r i ght s cal e. com 7
3.2 Zeus Load Balancer
Zeus Technologies (http://www.zeus.com/) is a RightScale partner which has created a cloud-
ready ServerTemplate available from the RightScale Dashboard. Zeus is a fee-based software
application, with different features being enabled at varying price points. The ServerTemplate used in
these tests utilized version 6.0 of the Zeus Trafc Manager. This application provides many advanced
features that support caching, SSL termination (including the ability to terminate SSL for multiple fully-
qualied domain names on the same virtual appliance), cookie- and IP-based session stickiness,
frontend clustering, as well as numerous other intelligent load balancing features. In this test only the
Zeus Load Balancer (a feature subset of the Zeus Trafc Manager) was used to provide more feature-
compatible tests with the other solutions involved in these evaluations. By default, Zeus enables both
HTTP keep-alives as well as TCP keep-alives on the backend (the connections to the web servers), thus
avoiding the overhead of unnecessary TCP handshakes and tear-downs. With a single Zeus Load
Balancer running on an m1.large (consistent with all other tests), the results were:
Requests per second: 6476 [#/sec]
This represents a 30% increase over the HAProxy baseline, and a 24% increase over the tuned HAProxy
test results. As mentioned previously, the Zeus Trafc Manager is capable of many advanced load
balancing and trafc managing features, so depending on the needs and architecture of the application,
signicantly improved performance may be achieved with appropriate tuning and conguration. For
example, enabling caching would increase performance dramatically in this test since a simple static
text-based web page was used. We will see a use case for this in the following section discussing
aiCaches Web Accelerator. However, for these tests standard load balancing with particular attention to
requests served per second was the desired metric, so the Zeus Load Balancer features were exercised,
and not the extended Zeus Trafc Manager capabilities.
3.3 aiCache Web Accelerator
aiCache implements a software-solution to provide frontend web server caching (http://
aicache.com/). aiCache is a RightScale partner that has created a ServerTemplate to deploy their
application in the cloud through the RightScale platform. The aiCache Web Accelerator is also a fee-
based application. While it may seem out of place to include a caching application in an evaluation of
load balancers, the implementation of aiCache lends itself nicely to this discussion. If aiCache does not
nd the requested object in its cache, it will load the object into the cache by accessing the origin
servers (the web servers used in these discussions) in a round-robin fashion. aiCache does not support
session stickiness by default, but it can be enabled via a simple conguration le directive. In the tests
run as part of this evaluation, aiCache was congured with the same ve web servers on the backend as
in the other tests, and no caching was enabled, thus forcing the aiCache server to request the page from
a backend server every time. With this setup and conguration in place, the results were:
Requests per second: 4785 [#/sec]
This performance is comparable with that of HAProxy (it is 4% less than the HAProxy baseline, and 9%
less than the tuned HAProxy results). As mentioned previously, aiCache is designed as a caching
application to be placed in front of the web servers of an application, and not as a load balancer per se.
But as these results show, it performs this function quite well. Although it is a bit out of scope with
regard to the intent of these discussions on load balancing, a simple one line change to the aiCache
conguration le allowed caching of the simple web page being used in these tests. With this one line
change in place, the same tests were run, and the results were:
Ri ght Scal e www. r i ght s cal e. com 8
Requests per second: 15342 [#/sec]
This is large improvement (320%) over the initial aiCache load balancing test, and similarly compared to
the HAProxy tests (307% over the HAProxy baseline, and 293% better than the tuned HAProxy results).
Caching is most benecial in applications that serve primarily static content. In this simple test it was
applicable in that the requested object was a static, text-based web page. As mentioned above in the
discussion of the Zeus solution, depending on the needs, architecture, and trafc patterns associated
with an application, signicantly improved results can be obtained by selecting the correct application for
the task, and tuning that application correctly.
3.4 Amazon Web Services Elastic Load Balancer (ELB)
Elastic Load Balancing facilitates distributing incoming trafc among multiple AWS instances (much like
HAProxy). Where ELB differs from the other solutions discussed in this white paper is that it can span
Availability Zones (AZ), and can distribute trafc to different AZs. While this is possible with HAProxy,
Zeus Load Balancer, and aiCache Web Accelerator, there is a cost associated with cross-AZ trafc
(trafc within the same AZ via private IPs is at no cost, while trafc between different AZs is fee-based).
However, an ELB has a cost associated with it as well (an hourly rate plus a data transfer rate), so some
of this inter-AZ trafc cost may be equivalent to the ELB charges depending on your application
architecture. Multiple AZ congurations are recommended for applications that demand high reliability
and availability, but an entire application can be (and often is) run within a single AZ. AWS has not
released details on how ELB is implemented, but since it is designed to scale based on load (which will
be shown in sections to follow), it is most likely a software-based virtual appliance. The initial release of
ELB did not support session stickiness, but cookie-based session afnity is now supported.
AWS does not currently have different sizes or versions of ELBs, so all tests executed were run with the
standard ELB. Additionally, no performance tuning or conguration is currently possible on ELBs. The
only conguration that was set with regard to the ELB used in these tests was that only a single AZ was
enabled for trafc distribution.
Two sets of tests were run. The rst was functionally equivalent to the tests run against the other load
balancing solutions in that a single load-generating server was used to generate a total of 100,000
requests (and then repeated 10 times to obtain an average). The second test was designed to exercise
the auto-scaling nature of ELB, and additional details are provided in section 4.1. For the rst set of
tests, the results were:
Requests per second: 2293 [#/sec]
This performance is about 46% of that of the HAProxy baseline tests, and approximately 43% of the
tuned HAProxy results. This result is consistent with tests several of RightScales customers have run
independently. As a comparison to this simple ELB test, a test of HAProxy on an m1.small instance was
conducted. The results of this HAProxy test are as follows:
Requests per second: 2794 [#/sec]
In this test scenario, the ELB performance is approximately 82% that of HAProxy running on an
m1.small. However, due to the scaling design of the ELB solution discussed previously, another testing
methodology is required to adequately test the true capabilities of ELB. This test is fundamentally
different from all others performed in this investigation, so it will be addressed separately in section 4.
4 Enhanced Testing Architecture
In this test, a much more complex and involved testing architecture was implemented. Instead of ve
Ri ght Scal e www. r i ght s cal e. com 9
backend web servers as used in the previous tests, a total of 25 identical backend web servers were
used, with 45 load-generating servers utilized instead of a single server. The reason for the change is
that fully exercising ELB requires that requests are issued to a dynamically varying number of IP
addresses returned by the DNS resolution of the ELB endpoint. In effect, this is the rst stage of load
balancing employed by ELB in order to distribute incoming requests across a number of IP addresses
which correspond to different ELB servers. Each ELB server then in turn load balances across the
registered application servers.
The load-generation servers used in this test were run on AWS c1.medium servers (2 virtual cores,
1.7GB memory, 32-bit platform). As a result of observing the load-generating servers in the previous
tests, it was determined that memory was not a limiting factor and the 7.5GB available was far more
than was necessary for the required application. CPU utilization was high on the load-generator, so the
c1.medium was used to add an additional 25% of computing power. As mentioned previously, instead
of a single load-generating server, up to 45 servers were used, each running the following httperf
command in an endless loop:
httperf --hog --server=$ELB_IP --num-conns=50000 --rate=500 --timeout=5
In order to spread the load among the ELB IPs that were automatically added by AWS, a DNS query
was made at the beginning of each loop iteration so that subsequent runs would not necessarily use the
same IP address. These 45 load-generating servers were added in groups at specic intervals, which
will be detailed below.
The rate of 500 requests per second (the --rate=500 option to httperf) was determined via
experimentation on the load-generating server. With rates higher than this, non-zero fd-unavail error
counts were observed, which is an indication that the client has run out of le descriptors (or more
accurately, TCP ports), and is thus overloaded. The number of total connections per iteration was set to
50,000 (--num-conns=50000) in order to keep each test run fairly short in duration (typically less than
two minutes) such that DNS queries would occur at frequent intervals in order to spread the load as the
ELB scaled.
4.1 ELB performance
The rst phase of the ELB test utilized all 25 backend web servers, but only three load-generating
servers were launched initially (which would generate about 1500 requests/sec three servers at 500
requests/second each). Some reset/restart time was incurred between each loop iteration running the
httperf commands, so a sustained 500 requests/second per load-generating server was not quite
achievable. DNS queries initially showed three IPs for the ELB. As shown in Figure 2 (label (a)) an
average of about 1000 requests/second were processed by the ELB at this point.
Approximately 20 minutes into the test, an additional three load-generating servers were added, resulting
in a total of six, generating about 3000 requests/second (see Figure 2 (b)). The ELB scaled up to ve IPs
over the course of the next 20 minutes (c), and the response rate leveled out at about 3000/second at
this point. The test was left to run in its current state for the next 45 minutes, with the number of ELB
IPs monitored periodically, as well as the response rate. As Figure 2 shows (d), the response rate
remained fairly stable at about 3000/second during this phase of the test. The number of IPs returned
via DNS queries for the ELB varied between seven and 11 during this time.
At this point, an additional 19 load-generating servers were added (for a total of 25, see Figure 2 (e)),
which generated about 12500 requests/second. The ELB added IPs fairly quickly in response to this
load, and averaged between 11 and 15 within 10 minutes. After about 20 minutes (Figure 2 (f)), an
Ri ght Scal e www. r i ght s cal e. com 10
average of 10500 responses/second was realized (again, due to the restart time between iterations of
the httperf loop, the theoretical maximum of 12500 requests/second was not quite realized).
The test was left to run in this state for about 20 minutes, where it remained fairly stable in terms of
response rate, but the number of IPs for the ELB continued to vary between 11 and 15. An additional
20 load-generating servers (for a total of 45, see Figure 2 (g)) were added at this time. About 10 minutes
were required before the ELB scaled up to accommodate this new load, with a result of between 18 and
23 IPs for the ELB. The response rate at this time averaged about 19000/second (Figure 2 (h)). The test
was allowed to run for approximately another 20 minutes before all servers were terminated. The
response rate during this time remained around 19000/second, and the ELB varied in the number of IPs
between 19 and 22.
Figure 2 httperf responses per second through the AWS ELB. Each color corresponds to
the responses received from an individual ELB IP address. The quantization is due to the
fact that each load generating server is locked to a specific IP address for a 1-2 minute
period during which it issues 500 requests/second.
To ensure that the backend servers were not overtaxed during these tests, the CPU activity of each was
monitored. Figure 3 show the CPU activity on a typical backend server. Additionally, the interface trafc
on the load-generating servers and the number of Apache requests on the backend servers was
monitored. Figures 4 and 5 show graphs for these metrics.
Ri ght Scal e www. r i ght s cal e. com 11
Figure 3 CPU activity on typical backend web server
Figure 4 Interface traffic on typical load-generating server
Figure 5 Apache requests on typical backend web server. Peak is with
45 load-generating servers.
It would appear that the theoretical maximum response rate using an ELB is almost limitless, assuming
that the backend servers can handle the load. Practically this would be limited by the capacity of the
AWS infrastructure, and/or by throttles imposed by AWS with regards to an ELB. These test results
Ri ght Scal e www. r i ght s cal e. com 12
were shared with members of the AWS network engineering team, who conrmed that there are activity
thresholds that will trigger an inspection of trafc to ensure it is legitimate (and not a DoS/DDoS attack,
or similar). We assume that the tests performed here did not surpass this threshold and that additional
requests could have been generated before the alert/inspection mechanism would have been
performed. If the alert threshold is met, and after inspection the trafc is deemed to be legitimate, the
threshold is lifted to allow additional AWS resources to be allocated to meet the demand. In addition,
when using multiple availability zones (as opposed to the single AZ used in this test) supplemental ELB
resources become available.
While the ELB does scale up to accommodate increased trafc, the ramp-up is not instantaneous, and
therefore may not be suitable to all applications. In a deployment that experiences a slow and steady
load increase, an ELB is an extremely scalable solution, but in a ash-crowd or viral event, ELB scaling
may not be rapid enough to accommodate the sudden inux of trafc, although articial pre-warming
of ELB may be feasible.
4.2 Enhanced Test Conguration with HAProxy
In order to validate the previous HAProxy results, the enhanced test architecture described above was
used to test a single instance running HAProxy on an m1.large (2 virtual cores, 7.5GB memory, 64-bit
architecture). In this test conguration, 16 load-generating servers were used as opposed to the 45
used in the ELB tests. (No increase in performance was seen beyond 10 load-generators, so the test
was halted once 16 had been added.) The backend was populated with 25 web servers as in the ELB
test, and the same 147-byte text-only web page was the requested object. Figure 6 shows a graph of
the responses/second handled by HAProxy. The average was just above 5000, which is consistent with
the results obtained in the tests described in section 3.1 above.
Figure 6 HAProxy responses/second
The gap in the graph was the result of a restart of HAProxy once kernel parameters had been modied.
The graph tails off at the end as DNS TTLs expired, which pushed the trafc to a different HAProxy
server running on an m1.xlarge. Results of this m1.xlarge test are described below.
In the initial test run in the new conguration, an average of about 5000 responses/second was
observed. During this time frame, CPU-0 was above 90% utilization (see Figure 7), while CPU-1 was
essentially idle. By setting the HAProxy process afnity for a single CPU (essentially moving all system-
related CPU cycles to a separate CPU), performance was increased approximately 10% to the 5000
responses/second shown in Figure 6. When the afnity was set (using the taskset -p 2
Ri ght Scal e www. r i ght s cal e. com 13
<haproxy_pid> command) CPU-0s utilization was dropped to less than 5%, and CPU-1s changed
from 0% utilized to approximately 60% utilization (due to the fact that the HAProxy process was moved
exclusively to CPU-1). (See Figure 8.) Additionally, when the HAProxy process afnity was set, tuning
the kernel parameters no longer had any noticeable effect.
Figure 7 CPU-0 activity on HAProxy server
Figure 8 CPU activity on CPU-0 and CPU-1 after HAProxy affinity is set to CPU-1
The interface on the HAProxy server averaged approximately 100 MBits/second total (in and out
combined) during the test (see Figure 9). In previous tests of m1.large instances in the same availability
zone, throughput in excess of 300 MBits/second has been observed, thus conrming the instances
bandwidth was not the bottleneck in these tests.
Ri ght Scal e www. r i ght s cal e. com 14
Figure 9 Interface utilization on HAProxy server
With unused CPU cycles on both cores, and considerable bandwidth on the interface available, the
bottleneck in the HAProxy solution is not readily apparent. The HAProxy test described above was also
run on an m1.xlarge (4 virtual cores, 15GB memory, 64-bit platform) with the same conguration. The
results observed were identical to that of the m1.large. Since HAProxy is not memory-intensive, and
does not utilize additional cores, these results are not surprising, and support the reasoning that the
throttling factor may be an infrastructure- or hypervisor-related limitation.
During these HAProxy tests, it was observed that the virtual interface was peaking at approximately
110K packets per second (pps) in total throughput (input + output). As a result of this observation, the
ttcp utility was run in several congurations to attempt to validate this nding. Tests accessing the
instance via its internal IP, external IP, as well as two concurrent transmit sessions were executed (see
Figure 10).
Figure 10 Packets per second as generated by ttcp
The results of these tests were fairly consistent in that a maximum of about 125K pps were achieved,
with an average of 118K-120K being more typical. These results were shared with AWS network
engineering representatives, who conrmed that we are indeed hitting limits in the virtualization layer
which involves the traversal of two network stacks.
Ri ght Scal e www. r i ght s cal e. com 15
The takeaway from these experiments is that in high trafc applications, the network interface should be
monitored and additional capacity should be added when the interface approaches 100K pps,
regardless of other resources that may still be available on the instance.
These ndings also explain why the results between HAProxy, aiCache, and Zeus are very similar. With all
three appliances the practical limit is about 100K packets per second. The minor performance
differences between the three are primarily due to keep-alive versus non keep-alive HTTP connections
and internal buffer strategies that may distribute payloads over more or fewer packets in different
situations.
5 Conclusions
At RightScale we have encountered numerous and varied customer architectures, applications, and use
cases, and the vast majority of these deployments use, or can benet from, the inclusion of front-end
load balancing. As a result of assisting these customers both in a consultant capacity as well as
engaging with them on a professional services level, we have amassed a broad spectrum of experience
with load balancing solutions. The intent of this discussion was to give a brief overview of the load
balancing options currently available in the cloud via the RightScale platform, and compare and contrast
these solutions using a specic conguration and metric on which to rate these solutions. Through
these comparisons, we have hoped to illustrate that there is no one size ts all when it comes to load
balancing. Depending on the particular applications architecture, technology stack, trafc patterns, and
numerous other variables, there may be one or more viable solutions, and the decision on which
mechanism to put in place will often come down to a tradeoff between performance, functionality, and
cost.
Ri ght Scal e www. r i ght s cal e. com 16
Appendices
[A] Summary of all tests performed
1est #]Sec
1
8asellne 4982
nAroxy
2
wlLh npproc
Modlcauon 4883
3 wlLh kernel 1unlng 3239
AWS LL8
4 1esL 1 2293
AWS LL8
S 1esL 2 19000+
2eus 6 8asellne 6476
AICache Web 7 8asellne 4783
Acce|erator 8 wlLh Cachlng 13342
[B] Kernel tuning parameters
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.all.rp_filter=1
net.core.rmem_max = 8738000
net.core.wmem_max = 6553600
net.ipv4.tcp_rmem = 8192 873800 8738000
net.ipv4.tcp_wmem = 4096 655360 6553600
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 360000
vm.min_free_kbytes = 65536
vm.swappiness = 0
net.ipv4.ip_local_port_range = 30000 65535
Ri ght Scal e www. r i ght s cal e. com 17
[C] HAProxy conguration le
# Copyright (c) 2007 RightScale, Inc, All Rights Reserved Worldwide.
#
# THIS PROGRAM IS CONFIDENTIAL AND PROPRIETARY TO RIGHTSCALE
# AND CONSTITUTES A VALUABLE TRADE SECRET. Any unauthorized use,
# reproduction, modification, or disclosure of this program is
# strictly prohibited. Any use of this program by an authorized
# licensee is strictly subject to the terms and conditions,
# including confidentiality obligations, set forth in the applicable
# License Agreement between RightScale.com, Inc. and
# the licensee.
globalstats socket /home/haproxy/status user haproxy group haproxy
log 127.0.0.1 local2 info
# log 127.0.0.1 local5 info
maxconn 4096
ulimit-n 8250
# typically: /home/haproxy
chroot /home/haproxy
user haproxy
group haproxy
daemon
quiet
pidfile /home/haproxy/haproxy.pid
defaults
log global
mode http
option httplog
option dontlognull
retries 3
option redispatch
maxconn 2000
contimeout 5000
clitimeout 60000
srvtimeout 60000
# Configuration for one application:
# Example: listen myapp 0.0.0.0:80
listen www 0.0.0.0:80
mode http
balance roundrobin
# When acting in a reverse-proxy mode, mod_proxy from Apache adds X-
Forwarded-For,
# X-Forwarded-Host, and X-Forwarded-Server request headers in order
to pass information to
# the origin server;therefore, the following option is commented out
#option forwardfor
# Haproxy status page
stats uri /haproxy-status
#stats auth @@LB_STATS_USER@@:@@LB_STATS_PASSWORD@@
# when cookie persistence is required
cookie SERVERID insert indirect nocache

# When internal servers support a status page
#option httpchk GET @@HEALTH_CHECK_URI@@
# Example server line (with optional cookie and check included)
# server srv3.0 10.253.43.224:8000 cookie srv03.0 check inter
2000 rise 2 fall 3
server i-570a243f 10.212.69.176:80 check inter 3000 rise 2 fall 3
maxconn 255
Ri ght Scal e www. r i ght s cal e. com 18