You are on page 1of 12

Applying Queuing Theory to Optimizing the Performance of

Enterprise Software Applications


Henry H. Liu, Ph.D.
BMC Software
1030 West Maude Avenue, Sunnyvale, CA 94085, USA

Performance is one of the most stringent requirements for large scale


enterprise software applications. It is crucial in determining the success or
failure of a large project. It spans various stages of a software product life
cycle from designing to developing, and to final delivering to the customer.
In this paper, using the two most fundamental concepts, wait events and
service demand, we demonstrate quantitatively how we can leverage wellknown queuing theory to help achieve the best possible performance for
large scale enterprise software applications, both efficiently and effectively.
1. Introduction
Delivering an enterprise-class software application
product on budget and on schedule satisfying the
customers functional and performance requirements
is always the goal of every large software
development organization. In order to achieve the
goal,
every
major
performance
optimization
opportunity has to be incorporated into the entire
development cycle at each stage from designing,
coding, to performance assurance and acceptance
testing. When and how various performance
optimization opportunities should be considered and
implemented has been a debate for decades, as
evidenced by a few interesting quotes below [WEB01]:

"More computing sins are committed in


the name of efficiency (without necessarily
achieving it) than for any other single
reason - including blind stupidity." -W.A.
Wulf
"Premature optimization is the root of all
evil." Tony Hoare and Donald Knuth
"Bottlenecks occur in surprising places,
so don't try to second guess and put in a
speed hack until you have proven that's
where the bottleneck is." Rob Pike

In this paper, we will not try to parse all the


implications behind various mentalities on software
performance optimization. Different opinions purely
out of technical context are helpful and healthy
balances for avoiding major mistakes that would blow
the budget and schedule away. However, software
performance optimization probably is the only thing on
the soft-side of computers that is closer to science

than to art, thats because the performance of a


software application can be quantified rigorously with
given workload and hardware. Functionality tests
generate discrete outcomes, either success or failure,
whereas performance tests generate outcomes like
analog signals that are continuous.
Any phenomenon in nature that is continuous and
dynamic can be modeled analytically. This applies to
the performance of software applications as well.
However, software performance is not an isolated
island. In fact, analysis of software performance is
deeply rooted in traditional queuing theory. Some of
the seminal works on analyzing the performance of
software and computer systems include Littles Law
[LITT61] and Jacksons theorem [JACK63]. Since then,
more elaborate queuing models have been developed
and applied to solving various system performance
related challenges in traditional computing systems
[LAZO84, JAIN91], in communication and switching
networks [HARR89, GUNT91], in storage networks
[SIMI03], in eCommerce applications [MENA00,
MENA04], as well as in enterprise software
applications as a specific category [LIU04, LIU05].
It probably is adequate to say that if you perceive
software performance optimization as art it may take
you years to learn. However, if you accept it as
science, it may take you only a few hours to learn the
basics and then a few months more to practice and
become proficient in it, even if you are a software
practitioner with no software performance experience.
Also, software performance work could be both
exciting and rewarding, as sometimes you make a
small change and the entire application could be 2
10 times faster.

In this paper, we focus specifically on applying


queuing theory to optimizing the performance of
enterprise software applications. We advocate
applying queuing theory rather than second
guessing, because there is less chance to flounder
through a software development project if all major
decisions on improving performance are based on
proven theories rather than on speculative instincts.
We concentrate on enterprise software applications,
as this is where the performance work is most
complex with multi-tiered software architectural
topologies from client tier to web tier, to mid-tier or
application server tier, and to the database tier. Each
tier in an enterprise-class software application needs
to perform optimally in order to achieve the best
overall performance for the entire application.
The rest of this paper is organized as follows. First,
we introduce the basics of queuing theory as a
foundation for discussing how to apply queuing theory
to optimizing the performance of enterprise
applications. Then, we elaborate on the two most
important performance elements, wait events and
service demands, derived from queuing theory, in the
context of enterprise applications. A few optimization
techniques are presented to show significant
performance improvement by reducing wait times and
service demands. An optimization guideline derived
from balanced queuing systems is proposed with a
realistic supporting example.

So from banking efficiency and customer satisfaction


perspectives, we already know that both the banking
center and the customers are interested in:

2. Basics of Queuing Theory


Queuing theory is generic. Its not developed
specifically for improving the performance of software
applications. Its a model for evaluating the efficiency
of a system that consumes multiple resources, both
physical and logical, in order to realize intended
values for its user. Queuing theory can be applied to
evaluating the efficiency of a manufacturing process
that adopts assembly lines, or a customer service line
that fulfills customers service requests or a computer
system that executes software requests or
transactions on behalf of its human user.

Server banking center fulfilling customers


service requests
Customer initiator of service requests
Wait time time duration a customer has to
spend waiting in line
Service time time duration a teller has to
spend in order to complete the service for a
customer
Arrival rate rate at which customers arrive
for service
Utilization portion of a tellers time actually
servicing the customer rather than idling
Queue length total number of customers
both waiting and being serviced
Response time the sum of wait time and
service time for one visit to the teller
Residence time total time if the teller is
visited multiple times for one transaction.
Throughput rate at which customers are
serviced. A banking center certainly is
interested in knowing how fast they can
service their customers without losing them
because of long wait time.

Minimizing wait time so that the banking


business would be more efficient and the
customer would be happier
Minimizing service time so that the banking
center can run their business more efficiently
Knowing the average customer arrival rate so
that the banking center will neither over nor
under-staff itself
Minimizing both response time and residence
time
And striving for the highest possible
throughput to stay profitable while making
customers happy

In fact, its not that hard to grasp the basic concepts


on queuing theory. Everybody has experience of
visiting a banking center. When we deposit or
withdraw at a banking center, most of the time, we
have to wait in line in order to be serviced by a teller.
We dont want to wait too long in line and we care
about how long it takes for a teller to complete
servicing us so that we can leave for the next thing on
our daily agenda.

Fortunately, some pioneer researchers had studied all


queuing phenomenon and published their research
results [LITT61, JACK63]. Earlier research results
have been summarized into published texts with
additional researches carried out during the past two
decades [LAZO84, JAIN91, HARR89, GUNT98,
MENA00, MENA04, LIU04, and LIU05]. For software
developers and architects, or no offence, for those
who are not math-oriented, we recommend Gunthers
book which is an excellent source for further study on
queuing theory [GUNT98].

With this banking example, we already touched upon


most of the concepts on queuing theory. Here is a list
of basic concepts involved in queuing theory in
general:

Using the previous banking example as an easier


entrance, we can now formally transit to queuing
theory in a more rigorous format in the context of

performance optimization on enterprise software


applications, which is the subject of this paper.
Queuing theory is typically introduced with a graphic
representation as shown in Fig. 1 below. Here, we
have replaced customer in the banking example with
request which is more pertinent to the subject of this
paper. However, throughout this paper, customer and
software request will be interchangeable whenever its
more pertinent to the context.
Each bean in Fig. 1 represents a request which could
be a system or user call from a caller to a callee.
There is a queue in front of the server which stores all
arriving requests for the server to process. However,
in a computer system, the server could be any type of
resources, for example, processors, disks, or
networks, which are equivalent to the bank tellers in
the banking example described before.

Arriving requests

Server

Queue with waiting


requests

Completed requests

Therefore, response time of an OLTP system is better


characterized with a statistic distribution. To
emphasize more, every time when a specific response
time number is quoted it should be implicitly taken as
an average instead of an exact number for every user
or request at any time.
With this notion that the performance of an enterprise
software application is stochastic in mind, we are now
ready for introducing the basics of queuing theory in
the next few sections.
2.2 Kendall Notation and Genealogy of Queuing
Models
Based on the types of arrival process, the service time
and other characteristics of a queuing system, Kendall
devised a set of notations for defining different types
of queues symbolically [KEND81]. Because of its
convenience for describing and characterizing a
queuing system, Kendall notation has become the
language of queuing theory. Fortunately, Kendall
notation is not as complex as a programming
language. Its very simple and intuitive, as described
in Table 1 below.

Fig .1 Graphic representation of a queuing system


With enterprise software applications, we are most
concerned with two performance metrics, response
time and throughput. Typically, response time is used
for measuring the performance of an OLTP (On-Line
Transaction Processing) system, whereas throughput
is used for measuring the performance of batch jobs.
The purpose of applying queuing theory to optimizing
the performance of an enterprise software application
is either to reduce response time or increase
throughput. Its important to keep in mind that both
response time and throughput are statistic in nature,
as will be clarified in the next section.
2.1 Software Performance and Statistics
When we quantitatively measure the performance of
an enterprise software application, we need to be
mindful that there are many underlying factors that
make the measurement results not exactly repeatable.
For batch jobs, throughput numbers could fluctuate
within 5% out of multiple measurements under the
same conditions even if the tests were conducted in a
well-isolated environment. For an OLTP system,
fluctuations in response time measurement from
request to request could be much larger than 5%.

Table 1 Kendall notation (//m//N/Q)


The type of probability distribution for the arrival
process, e.g., Markovian, General, etc.
The type of probability distribution for service
time.

Number of servers at the queuing center

Buffer size or storage capacity at the queuing


center
The allowed population size, which may be finite
or infinite
The type of service policy, e.g., FIFO.

N
Q

In Kendall notation, a generic queue is represented as


//m//N/Q symbolically. Hence, the Kendall
descriptor M/M/m///FIFO represents a queuing
center with Markovian arrival process, exponential
service time distribution, m servers, infinite queuing
capacity, infinite population, and First-In-First-Out
(FIFO) service policy. Such a queuing center is
denoted as M/M/m queue conventionally.
What is a Markovian (M) process? A Markovian
process is characterized by its memory-less-ness: the
future states of the process are independent of its
past history and depend solely on the present state.
This type of process is named after A. A. Markov, who
defined it a century ago. It turned out to be the most
generic probability model for describing various kinds
of statistic processes in nature, not only in software
systems but also in many other disciplines.

Based on the permutation of various parameters in


Kendall notation and the research results on various
queuing models summarized in [GUNT98], the
genealogy for classifying all proliferated queuing
models was condensed on to a chart that can help
quickly identify the exact queuing model in question
[LIU05]. Out of that genealogy, the two most
commonly used queuing models will be introduced in
the next two sections in this paper: the M/M/1 model
and M/M/m/N/N model.
Before we introduce the two most basic queuing
models, the M/M/1 model and M/M/m/N/N model, it
might be helpful to introduce all mathematical symbols
used for describing these two models. Early
acquaintance with these symbols helps comprehend
queuing theory more easily, as its a two step process
for grasping the queuing theory: first get familiar with
those symbols and understand what they represent,
and secondly understand what those formulas mean
and know how to apply them.
Throughout this paper, we assume that a queuing
system is composed of a series of queuing nodes and
each node conforms to M/M/1 or M/M/m/N/N model.
The following table lists all symbols used in this paper
for describing the queuing theory. The subscript i is
used to imply that that quantity is meant for queuing
node i. You will get deeper understandings on the
meanings of these symbols when they are presented
later in the context of concrete queuing theory.
Table 2 Symbols used in queuing theory
Symbol
Semantics
Service demand
Di
# of visits to the queuing node
Vi
Service time
Si
X0
Throughput at the system level
Arrival rate

Utilization
Ui
Wait time
Wi
Response time
Ri
'
Residence
time
R
i

Ni

Total queue length (waiting + being


serviced)

2.3 M/M/1 Model (Open)

Fig. 2 An M/M/1 queue with feedback ( and 1 are


external and internal arrival rates, respectively, and p
is the probability at which customers return to the
same queue).
As multiple arrival streams may be joining the queue
and there may be feedback, the arrival process is not
Poisson. However, Jacksons theorem states that
although the arrivals into the queue are not Poisson,
the centers behavior statistically still follows the laws
governed by Poisson processes [JACK63]. Jacksons
work helped shed light on the fact that its really the
Service Demand, not the Service Time, which is most
fundamental to the performance of a queuing system.
This observation becomes so obvious after we show
how response time is calculated next.
The two input parameters, Service Demand (Di) and
Arrival Rate (), are assumed given when calculating
the performance metrics of an M/M/1 queuing model.
The Service Demand is related to Service Time as
follows
Di = Vi x Si,

(1)

where Vi is the average number of visits to queue i


and Si is the average service time of a request at
queue i per visit to the queue. Without feedback,
Service Demand and Service Time are the same.
For open models, it is well known that under
operational equilibrium the average throughput of the
overall system, X0, is equal to the average arrival rate
, namely,
X0 = .

(2)

Lets start with the simplest queue, the M/M/1 open


model. This is a special case of M/M/m///FIFO with
m = 1 and all remaining parts implicit. Although its
simple, it illustrates all basic concepts and elements of
the queuing theory.

Knowing the system throughput and the service


demand allows us to calculate the utilization (Ui) of
resource i according to

To be generic, we consider an M/M/1 queuing center


with feedback, as shown in Fig. 2. In such a queuing
center, some customers may come back and visit the
queuing center more than once.

the average residence time of a request ( Ri ) at queue

Ui = X0 x Di.
(3)
With the service demand Di and utilization Ui given,
'

i can be calculated as follows

Ri' = Vi x Ri = Di / (1 Ui),

(4)

where Ri is the average response time of a request at


queue i, defined as the sum of the average wait time
and average service time per visit to the queue, i.e.,
Ri = Wi + Si. (5)
The response time can also be expressed as
Ri = Si / (1 Ui). (6)
The total average response time of the system, R0, is
equal to the sum of the residence times over all
queues, namely,

R0 = i =1 Ri' . (7)
K

In summary, here is how a systems response time is


calculated in a chain-like fashion:
1. For given values of Vi and Si, calculate Di
using Eq. (1).
2. For a given and hence X0, calculate Ui using
Eq. (3).
'

3. Calculate Ri using Eq. (4).


4. Calculate R0 using Eq. (7).
As is seen, an essential element in query theory is the
service demand associated with each resource. This
observation holds true not only for open models but
also for closed models, as is described in the next
section.
2.4 M/M/m/N/N Model (Closed)
A closed queuing center, denoted as M/M/m/N/N, is
described by the following set of recursive equations
for each node:

2.5 Littles Law


Overview of queuing theory is incomplete without
mentioning Littles law [LITT61]. Littles law simply
states that the number of customers, both waiting for
and receiving service is equal to the product of
throughput and response time:
Ni = Xi x Ri, (10)
where Xi = Vi x X0 is the throughput at queuing node i.
With Eqs. (3), (5), and (6), we can further obtain:
Ni = Ui / (1 Ui). (11)
Eq. (11) is another form of Littles law for open
queuing models. It is expressed as a function of
system resource utilization only.
In the next section, we will introduce the concept of
bottlenecks for enterprise software applications based
on wait time and service demand that we have
introduced.

3. Wait Events and Service Demands in


the Context of Enterprise Software
Applications
Enterprise software applications typically adopt a
multi-tier architecture consisting of backend tier,
middle tier, and front tier. Backend tier hosts data,
middle tier provides enterprise services that retrieve
data from or add / update data at the backend, and
front tier sends user requests to the middle tier and
renders results to the user. The application server and
database server constitute a bulk part of an enterprise
software application, as is shown in Fig. 3 below.

Application
Server

R'i[n] = Di (1 + Qi [n-1]), (8.1a)


X[n] = n/ (Z + iR'i[n]), (8.1b)
(8.1c)
Qi[n] = X[n] R'i [n],
where n is the number of customers in the system,
R'i[n] the residence time at queuing node i, X[n] the
system throughput, Qi[n-1] the queue length at
queuing node i, and Z the think time. It can be solved
with the Mean Value Analysis (MVA) algorithm
[REIS80]. The iteration starts with the initial conditions
of n = 1 and Qi [0] = 0, which leads to
R'i[1] = Di. (9)
Again, the service demand (Di) is the starting point for
solving this set of recursive equations.

Storage
DBServer

Fig. 3 Topology for an enterprise software application


The database server and application server
communicates with each other using TCP/IP.
Enterprise data are stored on external storage
configured at proper RAID levels. Database servers

access data through giga-bit Storage Area Network


(SAN) fabrics in order to minimize latency.
A complete transaction in an enterprise software
application consists of a wait-chain as is shown in
Fig. 4 below. In this wait chain, database servers wait
for the data to come back from the disk arrays, and
application servers wait for the data to come back
from the database servers. Therefore, the execution
of a service call in an enterprise application can be
considered as cycles of service-wait-service-wait, as
is illustrated in an earlier research [LIU05].

Its clear from both Eqs. (12) and (13) that the efforts
of optimizing the performance of an enterprise
software application essentially fall into two categories:
1. Minimize wait times at each layer as much as
possible
2. Reduce service demands at each layer as
much as possible
Actual performance optimization efforts start with
identifying bottlenecks. A resource is a bottleneck if it
has the largest sum of wait time and service demand,
as is implied in Eqs. (12) and (13). When the
bottleneck with a resource is removed, then the next
bottleneck is identified and removed until a balance is
reached among all resources. Removing bottlenecks
one after another will lead to a balanced queuing
system that will be detailed later in this paper.
Characteristically, bottlenecks are implied when the
type of the application is known, as is shown in the
following table. As enterprise applications incur
intensive I/O activities and heavy validation logic,
either I/O or CPU could be the bottleneck.

Fig. 4 Wait-chain in an enterprise application


If we treat each layer as a queuing node, we can then
immediately apply the queuing theory introduced in
the previous section to shed some light on how the
performance of an enterprise application can be
optimized.
Assuming that the service demands and wait times
are represented as Dapp, Ddb, Ddisk, Wapp, Wdb, and Wdisk
for the application server, database server, and disks
in Fig. 4, respectively, then, according to Eq. (7), the
total response time for an OLTP system at the system
level can be de-composed into two parts as follows

R0 = Vi x Wi + Di ,
i

(12)

Note that from Eqs. (12) and (13), in an un-contended


test environment, both response time and throughput
are determined by the sum of service demands from
all queuing nodes.
In the next section, we will present a number of
optimization techniques that have worked well for
most of enterprise software applications.

where i iterates over the set of {app, db, disk}. Note


that in this equation, the first term represents wait
times associated with various resource wait events in
database terms and the second term represents
service demands. In an un-contended environment,
wait times are essentially zero.
Similarly, applying Eqs. (10) and (12) at the system
level, we can calculate the system throughput for
batch jobs as follows

X0 =

Table 3 Application resource bottlenecks


Type
Bottleneck
Graphics, encryption
CPU, memory
Web application
Network
Enterprise application
I/O or CPU

V x W + D
i

. (13)
i

4. Effective Optimization Techniques


The performance of an enterprise software application
can be improved both from hardware side and
software side. During the design and development
stages, optimizations are applied from software
perspectives, whereas after the product is released,
the performance of an enterprise software application
can be further improved by using fastest possible
hardware and applying system level tunings. In this
section, we will give some examples from both
perspectives.
4.1 Array Processing Reducing Vi
Large-scale enterprise software applications require
careful use of various design patterns and
performance patterns for maximum flexibility and

highest possible performance. One of the


performance prescriptions in this regard is that roundtrip communications between various tiers, especially
between the application tier and database tier, should
be minimized to the largest extent. This way, the
service demand,

Fig. 6 below shows improvement on the performance


of the two APIs of batch job #2 before and after
implementing array processing. It is seen that the first
API was about 3 times faster and the second API was
about 7 times faster after the original implementation
was replaced with array processing.

Di = Vi x Si,

Service demands of the two APIs of batch job 2


before and after implementing Array Processing

can be decreased accordingly, from which better


response time and throughput will result.

Array processing is not a new performance


optimization technique. It has been documented as a
generic, programming language agnostic performance
pattern [SMIT02]. It is supported explicitly in JDBC 2 /
Oracle JDBC driver [BONA01] with an executeBatch
API from the Statement object. Fig. 5 below shows
improvement on the performance of the two APIs of
batch job #1 before and after using array processing
implemented with JDBC 2 and Oracle JDBC driver.
Service demands of the two APIs of batch job 1
before and after implementing Array Processing

Service Demand (ms)

50
40

Service Demand (ms)

60
40
20

API #1

API #2

before

21

92

after

13

Fig. 6 Efficacy of array processing with two APIs of


batch job 2
Fig. 7 below shows the effects of varying the array
size in array processing implementation on the
throughput of the two batch jobs.
50
45
40
35
30
25
20
15
10
5
0

30
20
10
0

80

Throughput (Records / s)

Array processing helps improve the performance of an


enterprise software application by reducing the
number of round-trips between application server and
database server with batch operations on common
data operations such as insert and update. Instead of
issuing insert and update SQLs one by one, a bunch
of inserts and updates, say with a batch size of 10 or
100, can be issued to the database server as one call
from the application server. This seems like common
sense, but ignored often by developers until being
raised up during the performance assurance and
acceptance test cycle.

100

10

20

30

40

50

100

batch 1

16

33

36

36

38

41

45

bacth 2

13

24

28

30

31

33

31

batch size

API #1

API #2

before

46

42

after

5.8

10

Fig. 5 Efficacy of array processing with two APIs of


batch job 1
It is seen that the first API was 8 times faster and the
second API was 4 times faster after the original
implementation was replaced with array processing.

Fig. 7 Effects of batch size in array processing


implementation on the throughput of two batch jobs
As is seen, a batch size of 10 gives a sharp
improvement by as much as 200% on the throughput
of the two batch jobs compared with no array
processing or a batch size of 1. Because of this
significant improvement on performance, every
enterprise software application should adopt and
implement array processing even during the early
stages of product development life cycle before
performance assurance and acceptance tests begin.

A batch size of 100 can be used as a default setting


before finding out the optimal batch size from the later
performance tests using realistic workload.
Fig. 8 below shows normalized performance
improvement on an enterprise application built in
Java/C/C++ after implementing array processing
during the later cycle of performance assurance tests.
In this case, batch job A was run first and then batch
job B was run. The throughputs of the two batch jobs
have been improved by 65% and 20%, respectively. A
batch size of 100 was hard-coded into the application,
which was verified to be optimal. These improvements
are significant as these batch jobs may run for days,
depending on the volume of data to be processed.

In both cases, the performance degradations were


caused by some objects which slipped away from
caching in coding. And in both cases, the defects
were fixed in time without causing any damage to the
customers business. The morale of the story is that
caching objects that are static at the application level
plays a vital role in delivering decent performance for
every large-scale enterprise software application. Its
a smart guess in a situation that if the performance of
an application suddenly dropped drastically due to
some changes in the code of the application then it
might be because accidentally some objects slipped
away from caching in coding.

1.8
1.6
Performance Gain

1.4
1.2
1
0.8
0.6
0.4
0.2
0
base
ap

Batch A

Batch B

1.65

1.2

Fig. 8 Performance improvement using array


processing on an enterprise application built in
Java/C/C++
4.2 Caching Reducing Wait Time (Wi)
Caching helps reduce wait time or data latency for the
objects that are fairly static and reused frequently. It is
the most common performance optimization technique
and certainly is implemented with every enterprise
software application. Its safe to say that without
caching, none of the enterprise applications would
perform.
Caching can be implemented at various levels, for
example, at the application server level, at the
database level, and at the storage level. Developers
are usually very good at implementing caching at the
application level. However, human beings make
mistakes, especially under tight schedules. Here are
two examples that you can second guess why
suddenly the performance had degraded significantly:

been in use by millions of users for online


shopping. Suddenly, the system was so slow
that the users were not able to place purchase
orders online any more. The customer was
losing revenue as every minute went by.
Whats the performance defect that had
caused this incident?
A large-scale enterprise application had
undergone a thorough overhaul by retrofitting
and combining all major parts to make it
perform better. Then, routine regression
performance test showed that its 10 30
times slower than before. What could cause
such enormous performance degradation?

In order to add a functionality requested by


the customer, a small change was made to a
production enterprise application that has

Another interesting example of caching is at the


storage level. Lets say you have an I/O intensive
enterprise application. Initial performance tests were
done with local disks installed internally on the
database server. Later you had a chance to test with
more advanced hardware that uses SAN as the
storage for the database server. What kind of data
latency would you anticipate between local disks and
SAN? The following table shows the difference
between local disks (320 MB/s SCSI) and SAN for
about 200k physical reads and writes.
Table 4 Data latency
disks and SAN
I/O
Reads
Avg Reads / s
Avg Read Time (ms)
Writes
Avg Writes / s
Buffer Waits
Avg Buffer Wait (ms)

comparison between local


Local disk
201,400
160
9
238,300
190
5,300
11

SAN
201,800
400
0.21
124,700
250
12,000
0.12

It is seen that read / buffer wait latency was in the


range of 10 ms with local disks, but in the submillisecond range with SAN. The difference comes
from the fact that the SAN under test has a huge
cache of 2 GB with cache-write enabled. With such a
huge disparity in I/O latency, the throughput differed
by as much as 2 3 times with the same workload.

180

Database SQL indexing tuning is a much broader


category than that can be covered in this paper. A lot
of texts on SQL tuning are readily available [HARR01,
TOW04] and most developers know how to index
simple queries. However, queries with deep joins as
often found in data warehouse applications are an
exception.
In this section, we present an example beyond simple
SQL indexing that can help combat excessive logical
I/Os or buffer gets in Oracle terms. Excessive logical
I/Os consume significant amounts of database CPUs
and affect the throughput of batch jobs severely.

Number of buffer gets (million)

4.3 Data in Index Reducing Service Demand (Di)

160
140
120
100
80
60
40
20
0

SQL1

SQL2

RegularIndex

164

101

DataInIndex

42

24

Assume that we have the following two SQLs that are


similar to each other:

Fig. 9 Combating excessive buffer gets with data-inindex optimization technique

Select C3 from T1 where


C2=<value> order by 1 ASC;

C1=<value>

and

4.4 Cursor-sharing Reducing Service Demand


(Di)

Select C3 from T2 where


C2=<value> order by 1 ASC;

C1=<value>

and

This section presents another example about how the


performance of an enterprise application can be
improved by whatever optimization techniques that
help reduce service demands. This example is Oracle
10g specific, but might be applicable to other
database products as well.

The difference between these two SQLs is that they


access two different tables, T1 and T2. These two
tables are large tables that can contain millions of
rows. Performance tests show that the throughput
was low even with both queries properly indexed on
the C1 and C2 columns. The Oracle performance
diagnostics tool indicated that excessive buffer gets
associated with these two SQLs were attributable to
large amounts of CPU times consumed on the
database server.
Closer examination revealed that these two SQLs
were returning one column only from the data tables.
This is a perfect situation where data-in-index (DII)
can play a role. By appending C3 column to the
indexes on C1 and C2 columns of both tables, there
would be no need for the two queries to touch the
data tables, which does not only help save physical
I/Os to the data tables but also help avoid bringing in
too much data into the database buffer cache.
Fig. 9 below shows significant reduction of buffer gets
on both queries after this data-in-index optimization
was applied. As is seen, the number of buffer gets has
been reduced from 164 million to 42 million for the first
query and 101 million to 24 million for the second
query. This optimization has not only improved the
throughput of the batch job by as much as 55% but
also helped improve the scalability of the batch job.
Without this fix, more and more data would be brought
into the database buffer cache when the volume of
data to be processed becomes larger and larger.

When a query is received by a database server, it


must be parsed first. Parsing could be either a hard
parse or a soft parse. A hard parse is very expensive
since the database server treats each query as a new
SQL and all data structure setup as well as all
validation logic has to be repeated. Soft parse is less
expensive as the database server may re-use the
data structures already set up when the SQL was
executed the first time and replace literals in the
WHERE clause only for subsequent executions.
With Oracle 10g, an initialization parameter named
CURSOR_SHARING determines how a SQL is
parsed. This parameter has three settings, EXACT
(default),
SIMILAR,
and
FORCE.
Setting
CURSOR_SHARING to EXACT incurs excessive hard
parses and may hurt the performance severely.
Statements that are identical, except for the values of
some literals, are called similar statements. Setting
CURSOR_SHARING to either SIMILAR or FORCE
allows similar statements to share SQL. The
difference between SIMILAR and FORCE is that
SIMILAR forces similar statements to share the SQL
area without deteriorating execution plans. Setting
CURSOR_SHARING to FORCE forces similar
statements to share the executable SQL area,
potentially deteriorating execution plans. Hence,
FORCE should be used as a last resort, when the risk

of suboptimal plans is outweighed


improvements in cursor sharing.

by

the

4.5 Eliminating Extraneous Logic Reducing


Service Demand (Di)

Using CURSOR_SHARING = SIMILAR or FORCE


can significantly improve cursor sharing on some
applications that have many similar statements,
resulting in reduced memory usage, faster parses,
and reduced latch contention. Fig. 10 below is an
example showing how different CURSOR_SHARING
settings can affect the performance of an application.

Enterprise applications are heavy on validations on


certain business rules that must be enforced. Any
business rule validation may consume significant
amount of CPU resources both on the application
server and database server. Without checking with
performance tests, its easy for the developers to code
more validation logic than actually necessary.
For example, when new objects are inserted into the
database, the application may fire database triggers to
validate various attributes of each object before being
allowed to be inserted into the database. To make it
worse, triggers may trigger sub-triggers. With a real
experience, before tuning, as many as 130 triggers
were fired every time a new object was stored into the
database. After careful examination on the validation
logic, it was found that as many as half of the triggers
actually never needed to be fired. By removing those
extraneous triggers, the throughput of the application
was increased by 35%. Eliminating extraneous
triggers effectively reduced service demands on the
database server, which resulted in better throughput.

2.5

Normalized throughput

1.5

0.5

0
throughput

EXACT

SIMILAR

FORCE

2.3

Fig. 11 Effects of CURSOR_SHARING settings on the


throughput of an enterprise application batch job
Table 5 below shows how resource consumptions on
the database server varied among different
CURSOR_SHARING settings for the following
counters:
a) latch: library cache %Total Call Time
b) % of DB Time of Parse Time Elapsed
c) % of DB Time of Hard Parse Elapsed Time
Note that reduction with each counter effectively
reduced service demands on the database server,
which resulted in better performance.
Table 5 Resource consumptions with various
CURSOR_SHARING settings
Resource EAXCT
SIMILAR
FORCE
a)
55
3.6
0.5
b)
61
26
10
c)
35
13
0.3
In general, SIMILAR and FORCE are faster than
EXACT. However, there is no guarantee that FORCE
would be always faster than SIMILAR or vice versa.
Performance test with realistic workload is the only
way to find out which one is faster.

Before performance assurance and acceptance


testing, its hard to identify unnecessary code logic
that would waste CPU resources and degrade the
performance of the application. For large-scale
enterprise
software
applications,
rigorous
performance testing has always been helpful for
preventing severe performance defects to be found
after release to the customers.
4.6 Faster storage Reducing data latency (Wi)
I/O is an essential part of every enterprise application.
It should be treated as the first potential bottleneck if
its known that the application incurs heavy I/O
activities. In the context of queuing theory, faster I/O
reduces data latency, and therefore improves
throughput, as is shown in the following table with the
same workload but different storage configurations.
Both local disk configurations were SCSI 320 but on
different platforms.
Table 6 Throughput versus I/O rates
Configuration Reads/s
Writes/s
LocalDisk (I) 160
190
LocalDisk (II) 54
52
SAN(RAID 0) 400
250

Throughput
42
27
90

5. Balanced Queuing Systems


Traditionally, a balanced queuing system is defined as
such that all queuing nodes have the same equal
service demand. This poses a great challenge for

actually finding balanced queuing systems, as its not


so easy to quantify service demands exactly [LIU05].
Fortunately, there are always alternative approaches
to achieving the same goal. A closer look at Eq. (3) of
Ui = X0 x Di.
reveals that for a stable system with constant
throughput (X0), the service demand (Di) and
utilization (Ui) are equivalent to each other. This leads
to a new definition of balanced queuing systems using
utilizations: a queuing system is balanced if all nodes
have the same equal utilization.
A guideline for optimizing the performance of an
enterprise application can be derived based on the
utilization-based definition of a balanced system:
An enterprise software application
performs better if its a balanced system.
We cant say performs best as the performance of
an enterprise applications depends on numerous
factors including the hardware that will be run.
To demonstrate the utility of this guideline, we show
the experimental results of two systems: System A
and System B. The two systems were similar except
that one was un-balanced and the other was balanced.
Fig. 11 shows system A that had non-constant CPU
ratio between the application server and the database
server: initially the utilizations were about the same,
and then with time, they bifurcated. Database server
CPU utilization kept growing and the application
server CPU utilization kept decreasing.

DB
App

Fig. 12 A balanced system with constant CPU ratio of


the application server (blue curve) to the database
server (yellow curve)
Fig. 13 below shows the CPU patterns of the
application server and database server with the tuning
turned on and off. Such tunings are useful for
maintaining a balanced condition for the enterprise
application in order for it to perform to the expectation
of the customer.

Tuning on off

DB
App

Fig. 13 CPU patterns of the application server (blue


curve) and database server (yellow curve) with the
tuning turned on and off

DB

App
Fig. 11 An unbalanced system with bifurcating CPU
utilizations of the application server (blue curve) and
the database server (yellow curve)
Fig. 12 shows a constant CPU ratio between the
application server and the database server with
exactly the same workload, after some tuning was
applied. At the end of the test run, the throughput was
107, which was 73% better than System A. To some
extent, a balanced system means a stable system, as
is demonstrated with this test.

The utility of this performance optimization guideline is


obvious: the utilization of a device can be measured
and monitored a lot more easily than service demand.
It can be used as a generic guide to optimizing and
tuning the performance of an enterprise software
application. For example, lets say we have an
enterprise application installed all on one single
system, namely, both the application server and
database server sit on one single box. Then, we can
monitor the CPU utilizations of both application server
process and the database server process
simultaneously when the performance tests are
running. If the CPU ratio of the application server to
the database server is significantly disparate, then the
higher one is the one to be optimized or tuned.
Iteratively, we can help squeeze out every bit of
performance for the application under development.

In order to efficiently optimize the performance of an


enterprise software application and reach a balanced
state faster, its desirable to use a high-quality
application API profiler. A profile as shown in Fig. 14
would reveal immediately the following useful
information on each API of the application:

The percentage of elapsed time or CPU time


The absolute elapsed or CPU time
The number of times an API was called

Fig. 14 An example profile for a J2EE enterprise


application (Each node is a Java method. Red edges
indicate hot APIs and expensive performance paths)
A profile can help nail down hot APIs immediately and
drive the system to reach the balanced state quickly.
Without such information available, one has to infer
based on the hot SQLs captured on the database side.

6. Conclusions
In this paper, we introduced the basics of queuing
theory. We demonstrated the applicability of queuing
theory to optimizing the performance of enterprise
software applications. We presented a number of
optimization techniques that resulted in immediate
performance improvements when applied.
We also proposed a new definition for balanced
systems using utilizations instead of service demands.
The two definitions are equivalent to each other, but
the new definition makes more sense as utilizations
can be measured and monitored more easily.
Based on the new definition for balanced systems, we
derived a performance optimization guideline which
states that an enterprise software application performs
better if its a balanced system. This has turned out to
be a very useful guide for optimizing and tuning the
performance of enterprise software applications both
under development and after deployment. Future
research will help clarify more on this guide.

Acknowledgement
Id like to thank Dr. Timothy Harris for discussions and
reviewing this paper.

References
[WEB01]http://en.wikipedia.org/wiki/Premature_optimi
zation.
[LITT61] J. D. C. Little, A Proof for the Queuing
Formula: L = W, Operations Research, 9(3), 1961.
[JACK63] J. R. Jackson, Jobshop-like Queuing
Systems, Management Science, 10(1), 1963.
[LAZO84] E. D. Lazowska et. al., Quantitative System
Performance: Computer System Analysis Using
Queuing Network Models, Prentice-Hall, Inc., 1984.
[JAIN91] R. Jain, The Art of Computer Systems
Performance Analysis, John Wiley & Sons, Inc., 1991.
[HARR93] P.G. Harrison et. al., Performance
Modeling of Communications Networks and Computer
Architectures, Wokingham, U.K.: Addison-Wesley.
[GUNT98] N. Gunther, The Practical Performance
Analyst, McGraw-Hill, 1998.
[SIMI03] H. Simitci, Storage Network Performance
Analysis, Wiley, 2003.
[MENA00] D. A. Menasce and V. A. F. Almeida,
Scaling for E-Business, Prentice Hall PTR, 2000.
[MENA04] D. A. Menasce et. al., Performance by
Design, Prentice Hall, 2004.
[LIU04] H. Liu and P. Crain, An Analytic Model for
Predicting the Performance of SOA-based Enterprise
Software Applications, CMG Proceedings, Dec. 2004,
Las Vegas, Vol. 2, pp.821 832.
[LIU05] H. Liu, Service Demand Models for
Enterprise Software Applications, CMG Proceedings,
Dec. 2005, Orlando, Florida.
[KEND81] D.G. Kendall, J. Royal Stat. Soc. Series
B13, pp. 151-185, 1981.
[REIS80] M. Reiser and S. S. Lavenberg, MeanValue Analysis of Closed Multichain Queuing
Networks, J. of the ACM, 27 (2) pp313-322, 1980.
[HARR01] Guy Harrison, Oracle SQL HighPerformance Tuning, Prentice Hall PTR, 2001.
[TOW03] D.Tow, SQL Tuning, OReily & Associates,
Inc., 2003.
[SMIT02] C. Smith and L. Williams, Performance
Solutions A Practical Guide to Creating Responsive,
Scalable Software, Addison Wesley, 2002.
[BONA01] E. Bonazzi and G. Stokol, Oracle 8i &
Java, From Client/Server to E-Commerce, Prentice
Hall PTR, 2001.

You might also like