10 - Comparing Systems

Harrell−Ghosh−Bowden: I. Study Chapters 10.
Comparing Systems © The McGraw−Hill

Simulation Using Companies, 2004
ProModel, Second Edition
C H A P T E R
10 COMPARING SYSTEMS
“The method that proceeds without analysis is like the groping of a blind man.”
—Socrates
10.1 Introduction
In many cases, simulations are conducted to compare two or more alternative de-
signs of a system with the goal of identifying the superior system relative to some
performance measure. Comparing alternative system designs requires careful
analysis to ensure that differences being observed are attributable to actual differ-
ences in performance and not to statistical variation. This is where running either
multiple replications or batches is required. Suppose, for example, that method A
for deploying resources yields a throughput of 100 entities for a given time period
while method B results in 110 entities for the same period. Is it valid to conclude
that method B is better than method A, or might additional replications actually
lead to the opposite conclusion?
You can evaluate alternative configurations or operating policies by perform-
ing several replications of each alternative and comparing the average results
from the replications. Statistical methods for making these types of comparisons
are called hypotheses tests. For these tests, a hypothesis is first formulated (for
example, that methods A and B both result in the same throughput) and then a test
is made to see whether the results of the simulation lead us to reject the hypothe-
sis. The outcome of the simulation runs may cause us to reject the hypothesis that
methods A and B both result in equal throughput capabilities and conclude that the
throughput does indeed depend on which method is used.
This chapter extends the material presented in Chapter 9 by providing statis-
tical methods that can be used to compare the output of different simulation mod-
els that represent competing designs of a system. The concepts behind hypothesis
testing are introduced in Section 10.2. Section 10.3 addresses the case when two
253
Harrell−Ghosh−Bowden: I. Study Chapters 10. Comparing Systems © The McGraw−Hill
254 Part I Study Chapters
alternative system designs are to be compared, and Section 10.4 considers the
case when more than two alternative system designs are to be compared. Addi-
tionally, a technique called common random numbers is described in Section 10.5
that can sometimes improve the accuracy of the comparisons.
10.2 Hypothesis Testing

An inventory allocation example will be used to further explore the use of hy-
pothesis testing for comparing the output of different simulation models. Suppose
a production system consists of four machines and three buffer storage areas.
Parts entering the system require processing by each of the four machines in a se-
rial fashion (Figure 10.1). A part is always available for processing at the first
machine. After a part is processed, it moves from the machine to the buffer stor-
age area for the next machine, where it waits to be processed. However, if the
buffer is full, the part cannot move forward and remains on the machine until a
space becomes available in the buffer. Furthermore, the machine is blocked and
no other parts can move to the machine for processing. The part exits the system
after being processed by the fourth machine.
The question for this system is how best to allocate buffer storage between the
machines to maximize the throughput of the system (number of parts completed
per hour). The production control staff has identified two candidate strategies for
allocating buffer capacity (number of parts that can be stored) between machines,
and simulation models have been built to evaluate the proposed strategies.
FIGURE 10.1
Production system
with four workstations
and three buffer
storage areas.
.72 4.56
Chapter 10 Comparing Systems 255
Suppose that Strategy 1 and Strategy 2 are the two buffer allocation strategies
proposed by the production control staff. We wish to identify the strategy that
maximizes the throughput of the production system (number of parts completed
per hour). Of course, the possibility exists that there is no significant difference in
the performance of the two candidate strategies. That is to say, the mean through-
put of the two proposed strategies is equal. A starting point for our problem is to
formulate our hypotheses concerning the mean throughput for the production
system under the two buffer allocation strategies. Next we work out the details of
setting up our experiments with the simulation models built to evaluate each strat-
egy. For example, we may decide to estimate the true mean performance of each
strategy (µ1 and µ2) by simulating each strategy for 16 days (24 hours per day)
past the warm-up period and replicating the simulation 10 times. After we run
experiments, we would use the simulation output to evaluate the hypotheses
concerning the mean throughput for the production system under the two buffer
allocation strategies.
In general, a null hypothesis, denoted H0, is drafted to state that the value of µ1
is not significantly different than the value of µ2 at the α level of significance. An
alternate hypothesis, denoted H1, is drafted to oppose the null hypothesis H0. For
example, H1 could state that µ1 and µ2 are different at the α level of significance.
Stated more formally:
H0 : µ1 = µ2 or equivalently H0 : µ1 − µ2 = 0
H1 : µ1 =
µ2 or equivalently H1 : µ1 − µ2 = 0
In the context of the example problem, the null hypothesis H0 states that the
mean throughputs of the system due to Strategy 1 and Strategy 2 do not differ. The
alternate hypothesis H1 states that the mean throughputs of the system due to
Strategy 1 and Strategy 2 do differ. Hypothesis testing methods are designed such
that the burden of proof is on us to demonstrate that H0 is not true. Therefore, if
our analysis of the data from our experiments leads us to reject H0, we can be con-
fident that there is a significant difference between the two population means. In
our example problem, the output from the simulation model for Strategy 1 repre-
sents possible throughput observations from one population, and the output from
the simulation model for Strategy 2 represents possible throughput observations
from another population.
The α level of significance in these hypotheses refers to the probability of
making a Type I error. A Type I error occurs when we reject H0 in favor of H1
when in fact H0 is true. Typically α is set at a value of 0.05 or 0.01. However,
the choice is yours, and it depends on how small you want the probability of
making a Type I error to be. A Type II error occurs when we fail to reject H0 in
favor of H1 when in fact H1 is true. The probability of making a Type II error is
denoted as β. Hypothesis testing methods are designed such that the probability
of making a Type II error, β, is as small as possible for a given value of α. The
relationship between α and β is that β increases as α decreases. Therefore, we
should be careful not to make α too small.
We will test these hypotheses using a confidence interval approach to
determine if we should reject or fail to reject the null hypothesis in favor of the
alternative hypothesis. The reason for using the confidence interval method is that
it is equivalent to conducting a two-tailed test of hypothesis with the added bene-
fit of indicating the magnitude of the difference between µ1 and µ2 if they are in
fact significantly different. The first step of this procedure is to construct a confi-
dence interval to estimate the difference between the two means (µ1 − µ2 ). This
can be done in different ways depending on how the simulation experiments are
conducted (we will discuss this later). For now, let’s express the confidence inter-
val on the difference between the two means as
P[(x̄1 − x̄2 ) − hw ≤ µ1 − µ2 ≤ (x̄1 − x̄2 ) + hw] = 1 − α
where hw denotes the half-width of the confidence interval. Notice the similari-
ties between this confidence interval expression and the one given on page 227 in
Chapter 9. Here we have replaced x̄ with x̄1 − x̄2 and µ with µ1 − µ2 .
If the two population means are the same, then µ1 − µ2 = 0, which is our
null hypothesis H0. If H0 is true, our confidence interval should include zero with
a probability of 1 − α. This leads to the following rule for deciding whether to re-
ject or fail to reject H0. If the confidence interval includes zero, we fail to reject H0
and conclude that the value of µ1 is not significantly different than the value of µ2
at the α level of significance (the mean throughput of Strategy 1 is not signifi-
cantly different than the mean throughput of Strategy 2). However, if the confi-
dence interval does not include zero, we reject H0 and conclude that the value of
µ1 is significantly different than the value of µ2 at the α level of significance
(throughput values for Strategy 1 and Strategy 2 are significantly different).
Figure 10.2(a) illustrates the case when the confidence interval contains zero,
leading us to fail to reject the null hypothesis H0 and conclude that there is no sig-
nificant difference between µ1 and µ2. The failure to obtain sufficient evidence to
pick one alternative over another may be due to the fact that there really is no dif-
ference, or it may be a result of the variance in the observed outcomes being too
high to be conclusive. At this point, either additional replications may be run or one
of several variance reduction techniques might be employed (see Section 10.5).
Figure 10.2(b) illustrates the case when the confidence interval is completely to the
FIGURE 10.2
Three possible
(a) [------|------] Fail to
reject H0
positions of a
confidence interval
relative to zero.
(b) [------|------] Reject H0
(c) [------|------] Reject H0
␮1 ⫺ ␮2 ⫽ 0
[------|------] denotes confidence interval (x1 ⫺ x 2) ⫾ hw

left of zero, leading us to reject H0. This case suggests that µ1 − µ2 < 0 or, equiv-
alently, µ1 < µ2 . Figure 10.2(c) illustrates the case when the confidence interval
is completely to the right of zero, leading us to also reject H0. This case suggests
that µ1 − µ2 > 0 or, equivalently, µ1 > µ2 . These rules are commonly used in
practice to make statements about how the population means differ
(µ1 > µ2 or µ1 < µ2 ) when the confidence interval does not include zero (Banks
et al. 2001; Hoover and Perry 1989).
10.3 Comparing Two Alternative System Designs

In this section, we will present two methods based on the confidence interval
approach that are commonly used to compare two alternative system designs. To
facilitate our understanding of the confidence interval methods, the presentation
will relate to the buffer allocation example in Section 10.2, where the production
control staff has identified two candidate strategies believed to maximize the
throughput of the system. We seek to discover if the mean throughputs of the
system due to Strategy 1 and Strategy 2 are significantly different. We begin by es-
timating the mean performance of the two proposed buffer allocation strategies (µ1
and µ2) by simulating each strategy for 16 days (24 hours per day) past the warm-
up period. The simulation experiment was replicated 10 times for each strategy.
Therefore, we obtained a sample size of 10 for each strategy (n 1 = n 2 = 10). The
average hourly throughput achieved by each strategy is shown in Table 10.1.
As in Chapter 9, the methods for computing confidence intervals in this chap-
ter require that our observations be independent and normally distributed. The 10
observations in column B (Strategy 1 Throughput) of Table 10.1 are independent
TABLE 10.1 Comparison of Two Buffer Allocation Strategies
(A) (B) (C)

Strategy 1 Strategy 2
Replication Throughput x1 Throughput x2
1 54.48 56.01
2 57.36 54.08
3 54.81 52.14
4 56.20 53.49
5 54.83 55.49
6 57.69 55.00
7 58.33 54.88
8 57.19 54.47
9 56.84 54.93
10 55.29 55.84
Sample mean x̄i, for i = 1, 2 56.30 54.63
Sample standard deviation si , for i = 1, 2 1.37 1.17
Sample variance si2 , for i = 1, 2 1.89 1.36
because a unique segment (stream) of random numbers from the random number
generator was used for each replication. The same is true for the 10 observations in
column C (Strategy 2 Throughput). The use of random number streams is dis-
cussed in Chapters 3 and 9 and later in this chapter. At this point we are assuming
that the observations are also normally distributed. The reasonableness of assum-
ing that the output produced by our simulation models is normally distributed is
discussed at length in Chapter 9. For this data set, we should also point out that two
different sets of random numbers were used to simulate the 10 replications of each
strategy. Therefore, the observations in column B are independent of the observa-
tions in column C. Stated another way, the two columns of observations are not
correlated. Therefore, the observations are independent within a population (strat-
egy) and between populations (strategies). This is an important distinction and will
be employed later to help us choose between different methods for computing the
confidence intervals used to compare the two strategies.
From the observations in Table 10.1 of the throughput produced by each strat-
egy, it is not obvious which strategy yields the higher throughput. Inspection of the
summary statistics indicates that Strategy 1 produced a higher mean throughput
for the system; however, the sample variance for Strategy 1 was higher than for
Strategy 2. Recall that the variance provides a measure of the variability of the data
and is obtained by squaring the standard deviation. Equations for computing the
sample mean x̄, sample variance s2, and sample standard deviation s are given in
Chapter 9. Because of this variation, we should be careful when making conclu-
sions about the population of throughput values (µ1 and µ2) by only inspecting the
point estimates (x̄1 and x̄2 ). We will avoid the temptation and use the output from
the 10 replications of each simulation model along with a confidence interval to
make a more informed decision.
We will use an α = 0.05 level of significance to compare the two candidate
strategies using the following hypotheses:
H0 : µ1 − µ2 = 0
H1 : µ1 − µ2 = 0
where the subscripts 1 and 2 denote Strategy 1 and Strategy 2, respectively. As
stated earlier, there are two common methods for constructing a confidence
interval for evaluating hypotheses. The first method is referred to as the Welch
confidence interval (Law and Kelton 2000; Miller 1986) and is a modified two-
sample-t confidence interval. The second method is the paired-t confidence inter-
val (Miller et al. 1990). We’ve chosen to present these two methods because their
statistical assumptions are more easily satisfied than are the assumptions for other
confidence interval methods.
10.3.1 Welch Confidence Interval

for Comparing Two Systems
The Welch confidence interval method requires that the observations drawn from
each population (simulated system) be normally distributed and independent
within a population and between populations. Recall that the observations in
Table 10.1 are independent and are assumed normal. However, the Welch confi-
dence interval method does not require that the number of samples drawn from
one population (n1) equal the number of samples drawn from the other population
(n2) as we did in the buffer allocation example. Therefore, if you have more ob-
servations for one candidate system than for the other candidate system, then by
all means use them. Additionally, this approach does not require that the two pop-
ulations have equal variances (σ12 = σ22 = σ 2 ) as do other approaches. This is
useful because we seldom know the true value of the variance of a population.
Thus we are not required to judge the equality of the variances based on the sam-
ple variances we compute for each population (s12 and s22 ) before using the Welch
confidence interval method.
The Welch confidence interval for an α level of significance is
P[(x̄1 − x̄2 ) − hw ≤ µ1 − µ2 ≤ (x̄1 − x̄2 ) + hw] = 1 − α
where x̄1 and x̄2 represent the sample means used to estimate the population
means µ1 and µ2; hw denotes the half-width of the confidence interval and is
computed by

s2 s2
hw = tdf,α/2 1 + 2
n1 n2
where df (degrees of freedom) is estimated by
2 2
s1 n 1 + s22 n 2
df ≈ 2 2
s12 n 1 (n 1 − 1) + s22 n 2 (n 2 − 1)
and tdf,α/2 is a factor obtained from the Student’s t table in Appendix B based on
the value of α/2 and the estimated degrees of freedom. Note that the degrees of
freedom term in the Student’s t table is an integer value. Given that the estimated
degrees of freedom will seldom be an integer value, you will have to use interpo-
lation to compute the tdf,α/2 value.
For the example buffer allocation problem with an α = 0.05 level of signifi-
cance, we use these equations and data from Table 10.1 to compute
[1.89/10 + 1.36/10]2
df ≈ ≈ 17.5
[1.89/10]2 /(10 − 1) + [1.36/10]2 /(10 − 1)
and

1.89 1.36 √
hw = t17.5,0.025 + = 2.106 0.325 = 1.20 parts per hour
10 10
where tdf,α/2 = t17.5,0.025 = 2.106 is determined from Student’s t table in Appen-
dix B by interpolation. Now the 95 percent confidence interval is
(x̄1 − x̄2 ) − hw ≤ µ1 − µ2 ≤ (x̄1 − x̄2 ) + hw
(56.30 − 54.63) − 1.20 ≤ µ1 − µ2 ≤ (56.30 − 54.63) + 1.20
0.47 ≤ µ1 − µ2 ≤ 2.87
With approximately 95 percent confidence, we conclude that there is a significant

difference between the mean throughputs of the two strategies because the in-
terval excludes zero. The confidence interval further suggests that the mean
throughput µ1 of Strategy 1 is higher than the mean throughput µ2 of Strategy 2
(an estimated 0.47 to 2.87 parts per hour higher).
10.3.2 Paired-t Confidence Interval

for Comparing Two Systems
Like the Welch confidence interval method, the paired-t confidence interval
method requires that the observations drawn from each population (simulated sys-
tem) be normally distributed and independent within a population. However, the
paired-t confidence interval method does not require that the observations between
populations be independent. This allows us to use a technique called common
random numbers to force a positive correlation between the two populations of
observations in order to reduce the half-width of our confidence interval without
increasing the number of replications. Recall that the smaller the half-width, the
better our estimate. The common random numbers technique is discussed in Sec-
tion 10.5. Unlike the Welch method, the paired-t confidence interval method does
require that the number of samples drawn from one population (n1) equal the num-
ber of samples drawn from the other population (n2) as we did in the buffer alloca-
tion example. And like the Welch method, the paired-t confidence interval method
does not require that the populations have equal variances (σ12 = σ22 = σ 2 ). This is
useful because we seldom know the true value of the variance of a population. Thus
we are not required to judge the equality of the variances based on the sample
variances we compute for each population (s12 and s22 ) before using the paired-t
confidence interval method.
Given n observations (n 1 = n 2 = n), we pair the observations from each
population (x1 j and x2 j ) to define a new random variable x(1−2) j = x1 j − x2 j , for
j = 1, 2, 3, . . . , n. The x1 j denotes the jth observation from the first population
sampled (the output from the jth replication of the simulation model for the first
alternative design), x2 j denotes the jth observation from the second population
sampled (the output from the jth replication of the simulation model for the
second alternative design), and x(1−2) j denotes the difference between the jth
observations from the two populations. The point estimators for the new random
variable are
n
j=1 x(1−2) j
Sample mean = x̄(1−2) =
n
2
n
j=1 x(1−2) j − x̄(1−2)
Sample standard deviation = s(1−2) =
n−1
where x̄(1−2) estimates µ(1−2) and s(1−2) estimates σ(1−2) .

The half-width equation for the paired-t confidence interval is

(tn−1,α/2 )s(1−2)
hw = √
n
where tn−1,α/2 is a factor that can be obtained from the Student’s t table in Appen-
dix B based on the value of α/2 and the degrees of freedom (n − 1). Thus the
paired-t confidence interval for an α level of significance is

P x̄(1−2) − hw ≤ µ(1−2) ≤ x̄(1−2) + hw = 1 − α
Notice that this is basically the same confidence interval expression presented in
Chapter 9 with x̄(1−2) replacing x̄ and µ(1−2) replacing µ.
Let’s create Table 10.2 by restructuring Table 10.1 to conform to our new
“paired” notation and paired-t method before computing the confidence interval
necessary for testing the hypotheses
H0 : µ1 − µ2 = 0 or using the new “paired” notation µ(1−2) = 0
H1 : µ1 − µ2 = 0 or using the new “paired” notation µ(1−2) = 0
where the subscripts 1 and 2 denote Strategy 1 and Strategy 2, respectively.
The observations in Table 10.2 are identical to the observations in Table 10.1.
However, we added a fourth column. The values in the fourth column (col-
umn D) are computed by subtracting column C from column B. This fourth
column represents the 10 independent observations (n = 10) of our new random
variable x(1−2) j .

Based on the Paired Differences
(A) (B) (C) (D)

Throughput
Strategy 1 Strategy 2 Difference (B − C)
Replication ( j) Throughput x1 j Throughput x2 j x(1−2) j = x1 j − x2 j
1 54.48 56.01 −1.53

2 57.36 54.08 3.28
3 54.81 52.14 2.67
4 56.20 53.49 2.71
5 54.83 55.49 −0.66
6 57.69 55.00 2.69
7 58.33 54.88 3.45
8 57.19 54.47 2.72
9 56.84 54.93 1.91
10 55.29 55.84 −0.55
Sample mean x̄(1−2) 1.67
Sample standard deviation s(1−2) 1.85
2
Sample variance s(1−2) 3.42
Now, for an α = 0.05 level of significance, the confidence interval on the

new random variable x(1−2) j in the Throughput Difference column of Table 10.2
is computed using our equations as follows:
10
j=1 x (1−2) j
x̄(1−2) = = 1.67 parts per hour
10

10 2
j=1 x (1−2) j − 1.67
s(1−2) = = 1.85 parts per hour
10 − 1
(t9,0.025 )1.85 (2.262)1.85
hw = √ = √ = 1.32 parts per hour
10 10
where tn−1,α/2 = t9,0.025 = 2.262 is determined from the Student’s t table in
Appendix B. The 95 percent confidence interval is
x̄(1−2) − hw ≤ µ(1−2) ≤ x̄(1−2) + hw
1.67 − 1.32 ≤ µ(1−2) ≤ 1.67 + 1.32
0.35 ≤ µ(1−2) ≤ 2.99
With approximately 95 percent confidence, we conclude that there is a signifi-
cant difference between the mean throughputs of the two strategies given that
the interval excludes zero. The confidence interval further suggests that the
mean throughput µ1 of Strategy 1 is higher than the mean throughput µ2 of
Strategy 2 (an estimated 0.35 to 2.99 parts per hour higher). This is basically the
same conclusion reached using the Welch confidence interval method presented
in Section 10.3.1.
Now let’s walk back through the assumptions made when we used the paired-t
method. The main requirement is that the observations in the Throughput
Difference column be independent and normally distributed. Pairing the through-
put observations for Strategy 1 with the throughput observations for Strategy 2 and
subtracting the two values formed the Throughput Difference column of Table
10.2. The observations for Strategy 1 are independent because nonoverlapping
streams of random numbers were used to drive each replication. The observations
for Strategy 2 are independent for the same reason. The use of random number
streams is discussed in Chapters 3 and 9 and later in this chapter. Therefore, there is
no doubt that these observations meet the independence requirement. It then
follows that the observations under the Throughput Difference column are also sta-
tistically independent. The assumption that has been made, without really knowing
if it’s true, is that the observations in the Throughput Difference column are nor-
mally distributed. The reasonableness of assuming that the output produced by our
simulation models is normally distributed is discussed at length in Chapter 9.
10.3.3 Welch versus the Paired-t Confidence Interval

It is difficult to say beforehand which method would produce the smaller con-
fidence interval half-width for a given comparison problem. If, however, the
observations between populations (simulated systems) are not independent, then

the Welch method cannot be used to compute the confidence interval. For this
case, use the paired-t method. If the observations between populations are inde-
pendent, the Welch method would be used to construct the confidence interval
should you have an unequal number of observations from each population
(n 1 = n 2 ) and you do not wish to discard any of the observations in order to pair
them up as required by the paired-t method.
10.4 Comparing More Than Two Alternative System Designs

Sometimes we use simulation to compare more than two alternative designs of a
system with respect to a given performance measure. And, as in the case of com-
paring two alternative designs of a system, several statistical methods can be used
for the comparison. We will present two of the most popular methods used in sim-
ulation. The Bonferroni approach is presented in Section 10.4.1 and is useful for
comparing three to about five designs. A class of linear statistical models useful
for comparing any number of alternative designs of a system is presented in Sec-
tion 10.4.2. A brief introduction to factorial design and optimization experiments
is presented in Section 10.4.3.
10.4.1 The Bonferroni Approach for Comparing

More Than Two Alternative Systems
The Bonferroni approach is useful when there are more than two alternative sys-
tem designs to compare with respect to some performance measure. Given K al-
ternative system designs to compare, the null hypothesis H0 and alternative
hypothesis H1 become
H0 : µ1 = µ2 = µ3 = · · · = µ K = µ for K alternative systems
H1 : µi =
µi for at least one pair i = i
where i and i are between 1 and K and i < i . The null hypothesis H0 states that
the means from the K populations (mean output of the K different simulation
models) are not different, and the alternative hypothesis H1 states that at least one
pair of the means are different.
The Bonferroni approach is very similar to the two confidence interval
methods presented in Section 10.3 in that it is based on computing confidence
intervals to determine if the true mean performance of one system (µi ) is signif-
icantly different than the true mean performance of another system (µi ). In fact,
either the paired-t confidence interval or the Welch confidence interval can be
used with the Bonferroni approach. However, we will describe it in the context of
using paired-t confidence intervals, noting that the paired-t confidence interval
method can be used when the observations across populations are either inde-
pendent or correlated.
The Bonferroni method is implemented by constructing a series of confidence
intervals to compare all system designs to each other (all pairwise comparisons).
The number of pairwise comparisons for K candidate designs is computed by

K (K − 1)/2. A paired-t confidence interval is constructed for each pairwise
comparison. For example, four candidate designs, denoted D1, D2, D3, and D4,
require the construction of six [4(4 − 1)/2] confidence intervals to evaluate the
differences µ(D1−D2) , µ(D1−D3) , µ(D1−D4) , µ(D2−D3) , µ(D2−D4) , and µ(D3−D4) . The
six paired-t confidence intervals are

P x̄(D1−D2) − hw ≤ µ(D1−D2) ≤ x̄(D1−D2) + hw = 1 − α1





The rule for deciding if there is a significant difference between the true mean
performance of two system designs is the same as before. Confidence intervals
that exclude zero indicate a significant difference between the mean performance
of the two systems being compared.
In a moment, we will gain some experience using the Bonferroni approach
on our example problem. However, we should discuss an important issue about
the approach first. Notice that the number of confidence intervals quickly grows
as the number of candidate designs K increases [number of confidence intervals
= K (K − 1)/2]. This increases our computational workload, but, more
importantly, it has a rather dramatic effect on the overall confidence we can place
in our conclusions. Specifically, the overall confidence in the correctness of our
conclusions goes down as the number of candidate designs increases. If we pick
any one of our confidence intervals, say the sixth one, and evaluate it separately
from the other five confidence intervals, the probability that the sixth confidence
interval statement is correct is equal to (1 − α6 ). Stated another way, we are
100(1 − α6 ) percent confident that the true but unknown mean (µ(D3−D4) ) lies
within the interval (x̄(D3−D4) − hw) to (x̄(D3−D4) + hw). Although each confi-
dence interval is computed separately, it is the simultaneous interpretation of all
the confidence intervals that allows us to compare the competing designs for a
system. The Bonferroni inequality states that the probability of all
six confidence
6
intervals being simultaneously correct is at least equal to (1 − i=1 αi ). Stated
more generally,
m
P (all m confidence interval statements are correct) ≥ (1 − α) = 1 − i=1 αi
m
where α = i=1 αi and is the overall level of significance and m = K (K2−1) and
is the number of confidence interval statements.
If, in this example for comparing four candidate designs, we set α1 = α2 =
α3 = α4 = α5 = α6 = 0.05, then the overall probability that all our conclusions
are correct is as low as (1 − 0.30), or 0.70. Being as low as 70 percent confident in
our conclusions leaves much to be desired. To combat this, we simply lower the val-
ues of the individual significance levels (α1 = α2 = α3 = · · · = αm ) so their sum
is not so large. However, this does not come without a price, as we shall see later.
One way to assign values to the individual significance levels is to first es-
tablish an overall level of significance α and then divide it by the number of pair-
wise comparisons. That is,
α
αi = for i = 1, 2, 3, . . . , K (K − 1)/2
K (K − 1)/2
Note, however, that it is not required that the individual significance levels be as-
signed the same value. This is useful in cases where the decision maker wants to
place different levels of significance on certain comparisons.
Practically speaking, the Bonferroni inequality limits the number of system de-
signs that can be reasonably compared to about five designs or less. This is because
controlling the overall significance level α for the test requires the assignment
of small values to the individual significance levels (α1 = α2 = α3 = · · · = αm )
if more than five designs are compared. This presents a problem because the width
of a confidence interval quickly increases as the level of significance is reduced.
Recall that the width of a confidence interval provides a measure of the accuracy
of the estimate. Therefore, we pay for gains in the overall confidence of our test by
reducing the accuracy of our individual estimates (wide confidence intervals).
When accurate estimates (tight confidence intervals) are desired, we recommend
not using the Bonferroni approach when comparing more than five system designs.
For comparing more than five system designs, we recommend that the analysis of
variance technique be used in conjunction with perhaps the Fisher’s least signifi-
cant difference test. These methods are presented in Section 10.4.2.
Let’s return to the buffer allocation example from the previous section and
apply the Bonferroni approach using paired-t confidence intervals. In this case,
the production control staff has devised three buffer allocation strategies to com-
pare. And, as before, we wish to determine if there are significant differences
between the throughput levels (number of parts completed per hour) achieved
by the strategies. Although we will be working with individual confidence
intervals, the hypotheses for the overall α level of significance are
H0 : µ1 = µ2 = µ3 = µ
H1 : µ1 =
µ2 or µ1 = µ3 or µ2 = µ3
where the subscripts 1, 2, and 3 denote Strategy 1, Strategy 2, and Strategy 3,
respectively.
To evaluate these hypotheses, we estimated the performance of the three
strategies by simulating the use of each strategy for 16 days (24 hours per day)
past the warm-up period. And, as before, the simulation was replicated 10 times
for each strategy. The average hourly throughput achieved by each strategy is
shown in Table 10.3.
The evaluation of the three buffer allocation strategies (K = 3) requires
that three [3(3 − 1)/2] pairwise comparisons be made. The three pairwise
TABLE 10.3 Comparison of Three Buffer Allocation Strategies (K = 3)

Based on Paired Differences
(A) (B) (C) (D) (E) (F) (G)

Strategy 1 Strategy 2 Strategy 3 Difference (B − C) Difference (B − D) Difference (C − D)
Rep. Throughput Throughput Throughput Strategy 1 − Strategy 2 Strategy 1 − Strategy 3 Strategy 2 − Strategy 3
(j) x1 j x2 j x3 j x(1−2) j x(1−3) j x(2−3) j
1 54.48 56.01 57.22 −1.53 −2.74 −1.21

2 57.36 54.08 56.95 3.28 0.41 −2.87
3 54.81 52.14 58.30 2.67 −3.49 −6.16
4 56.20 53.49 56.11 2.71 0.09 −2.62
5 54.83 55.49 57.00 −0.66 −2.17 −1.51
6 57.69 55.00 57.83 2.69 −0.14 −2.83
7 58.33 54.88 56.99 3.45 1.34 −2.11
8 57.19 54.47 57.64 2.72 −0.45 −3.17
9 56.84 54.93 58.07 1.91 −1.23 −3.14
10 55.29 55.84 57.81 −0.55 −2.52 −1.97
x̄(i−i ) , for all i and i between 1 and 3, with i < i 1.67 −1.09 −2.76
s(i−i ) , for all i and i between 1 and 3, with i < i 1.85 1.58 1.37
comparisons are shown in columns E, F, and G of Table 10.3. Also shown in Table
10.3 are the sample means x̄(i−i ) and sample standard deviations s(i−i ) for each
pairwise comparison.
Let’s say that we wish to use an overall significance level of α = 0.06 to eval-
uate our hypotheses. For the individual levels of significance, let’s set α1 = α2 =
α3 = 0.02 by using the equation
α 0.06
αi = = = 0.02 for i = 1, 2, 3
3 3
The computation of the three paired-t confidence intervals using the method out-
lined in Section 10.3.2 and data from Table 10.3 follows:
Comparing µ(1−2) : α1 = 0.02

tn−1,α1 /2 = t9,0.01 = 2.821 from Appendix B
(t9,0.01 )s(1−2) (2.821)1.85
hw = √ = √
n 10
hw = 1.65 parts per hour
The approximate 98 percent confidence interval is
x̄(1−2) − hw ≤ µ(1−2) ≤ x̄(1−2) + hw

1.67 − 1.65 ≤ µ(1−2) ≤ 1.67 + 1.65
0.02 ≤ µ(1−2) ≤ 3.32
Comparing µ(1−3) : α2 = 0.02

tn−1,α2 /2 = t9,0.01 = 2.821 from Appendix B
(t9,0.01 )s(1−3) (2.821)1.58
hw = √ = √
n 10
hw = 1.41 parts per hour
The approximate 98 percent confidence interval is
x̄(1−3) − hw ≤ µ(1−3) ≤ x̄(1−3) + hw
−1.09 − 1.41 ≤ µ(1−3) ≤ −1.09 + 1.41
−2.50 ≤ µ(1−3) ≤ 0.32
Comparing µ(2−3) : The approximate 98 percent confidence interval is
−3.98 ≤ µ(2−3) ≤ −1.54
Given that the confidence interval about µ(1−2) excludes zero, we conclude
that there is a significant difference in the mean throughput produced by Strate-
gies 1 (µ1 ) and 2 (µ2 ). The confidence interval further suggests that the mean
throughput µ1 of Strategy 1 is higher than the mean throughput µ2 of Strategy 2
(an estimated 0.02 parts per hour to 3.32 parts per hour higher). This conclusion
should not be surprising because we concluded that Strategy 1 resulted in a higher
throughput than Strategy 2 earlier in Sections 10.3.1 and 10.3.2. However, notice
that this confidence interval is wider than the one computed in Section 10.3.2 for
comparing Strategy 1 to Strategy 2 using the same data. This is because the earlier
confidence interval was based on a significance level of 0.05 and this one is based
on a significance level of 0.02. Notice that we went from using t9,0.025 = 2.262
for the paired-t confidence interval in Section 10.3.2 to t9,0.01 = 2.821 for this
confidence interval, which increased the width of the interval to the point that it is
very close to including zero.
Given that the confidence interval about µ(1−3) includes zero, we conclude
that there is no significant difference in the mean throughput produced by Strate-
gies 1 (µ1 ) and 3 (µ3 ). And from the final confidence interval about µ(2−3) , we
conclude that there is a significant difference in the mean throughput produced by
Strategies 2 (µ2 ) and 3 (µ3 ). This confidence interval suggests that the through-
put of Strategy 3 is higher than the throughput of Strategy 2 (an estimated 1.54
parts per hour to 3.98 parts per hour higher).
Recall that our overall confidence for these conclusions is approximately
94 percent. Based on these results, we may be inclined to believe that Strategy 2
is the least favorable with respect to mean throughput while Strategies 1 and 3 are
the most favorable with respect to mean throughput. Additionally, the difference
in the mean throughput of Strategy 1 and Strategy 3 is not significant. Therefore,
we recommend that you implement Strategy 3 in place of your own Strategy 1
because Strategy 3 was the boss’s idea.
In this case, the statistical assumptions for using the Bonferroni approach are
the same as for the paired-t confidence interval. Because we used the Student’s t
distribution to build the confidence intervals, the observations in the Throughput

Difference columns of Table 10.3 must be independent and normally distributed.
It is reasonable to assume that these two assumptions are satisfied here for the
Bonferroni test using the same logic presented at the conclusion of Section 10.3.2
for the paired-t confidence interval.
10.4.2 Advanced Statistical Models for Comparing

More Than Two Alternative Systems
Analysis of variance (ANOVA) in conjunction with a multiple comparison test
provides a means for comparing a much larger number of alternative system de-
signs than does the Welch confidence interval, paired-t confidence interval, or
Bonferroni approach. The major benefit that the ANOVA procedure has over the
Bonferroni approach is that the overall confidence level of the test of hypothe-
sis does not decrease as the number of candidate system designs increases.
There are entire textbooks devoted to ANOVA and multiple comparison tests
used for a wide range of experimental designs. However, we will limit our focus
to using these techniques for comparing the performance of multiple system
designs. As with the Bonferroni approach, we are interested in evaluating the
hypotheses
H0 : µ1 = µ2 = µ3 = · · · = µ K = µ for K alternative systems
H1 : µi =
µi for at least one pair i = i
where i and i are between 1 and K and i < i . After defining some new terminol-
ogy, we will formulate the hypotheses differently to conform to the statistical
model used in this section.
An experimental unit is the system to which treatments are applied. The sim-
ulation model of the production system is the experimental unit for the buffer
allocation example. A treatment is a generic term for a variable of interest and a
factor is a category of the treatment. We will consider only the single-factor case
with K levels. Each factor level corresponds to a different system design. For the
buffer allocation example, there are three factor levels—Strategy 1, Strategy 2,
and Strategy 3. Treatments are applied to the experimental unit by running the
simulation model with a specified factor level (strategy).
An experimental design is a plan that causes a systematic and efficient ap-
plication of treatments to an experimental unit. We will consider the completely
randomized (CR) design—the simplest experimental design. The primary as-
sumption required for the CR design is that experimental units (simulation models)
are homogeneous with respect to the response (model’s output) before the treat-
ment is applied. For simulation experiments, this is usually the case because a
model’s logic should remain constant except to change the level of the factor
under investigation. We first specify a test of hypothesis and significance level,
say an α value of 0.05, before running experiments. The null hypothesis for the
buffer allocation problem would be that the mean throughputs due to the applica-
tion of treatments (Strategies 1, 2, and 3) do not differ. The alternate hypothesis
TABLE 10.4 Experimental Results and Summary Statistics

for a Balanced Experimental Design
Strategy 1 Strategy 2 Strategy 3

Throughput Throughput Throughput
Replication (j) (x1 j ) (x2 j ) (x3 j )
1 54.48 56.01 57.22

2 57.36 54.08 56.95
3 54.81 52.14 58.30
4 56.20 53.49 56.11
5 54.83 55.49 57.00
6 57.69 55.00 57.83
7 58.33 54.88 56.99
8 57.19 54.47 57.64
9 56.84 54.93 58.07
10 55.29 55.84 57.81
n 10
Sum xi = j=1 xi j = j=1 xi j , for i = 1, 2, 3 563.02 546.33 573.92
n 10
j=1 xi j j=1 xi j
Sample mean x̄i = = , for i = 1, 2, 3 56.30 54.63 57.39
n 10
states that the mean throughputs due to the application of treatments (Strategies 1,
2, and 3) differ among at least one pair of strategies.
We will use a balanced CR design to help us conduct this test of hypothesis.
In a balanced design, the same number of observations are collected for each fac-
tor level. Therefore, we executed 10 simulation runs to produce 10 observations
of throughput for each strategy. Table 10.4 presents the experimental results and
summary statistics for this problem. The response variable (xi j ) is the observed
throughput for the treatment (strategy). The subscript i refers to the factor level
(Strategy 1, 2, or 3) and j refers to an observation (output from replication j) for
that factor level. For example, the mean throughput response of the simulation
model for the seventh replication of Strategy 2 is 54.88 in Table 10.4. Parameters
for this balanced CR design are
Number of factor levels = number of alternative system designs = K = 3
Number of observations for each factor level = n = 10
Total number of observations = N = n K = (10)3 = 30
Inspection of the summary statistics presented in Table 10.4 indicates that
Strategy 3 produced the highest mean throughput and Strategy 2 the lowest.
Again, we should not jump to conclusions without a careful analysis of the
experimental data. Therefore, we will use analysis of variance (ANOVA) in con-
junction with a multiple comparison test to guide our decision.
Analysis of Variance
Analysis of variance (ANOVA) allows us to partition the total variation in the out-
put response from the simulated system into two components—variation due to
the effect of the treatments and variation due to experimental error (the inherent
variability in the simulated system). For this problem case, we are interested in
knowing if the variation due to the treatment is sufficient to conclude that the per-
formance of one strategy is significantly different than the other with respect to
mean throughput of the system. We assume that the observations are drawn from
normally distributed populations and that they are independent within a strategy
and between strategies. Therefore, the variance reduction technique based on
common random numbers (CRN) presented in Section 10.5 cannot be used with
this method.
The fixed-effects model is the underlying linear statistical model used for the
analysis because the levels of the factor are fixed and we will consider each pos-
sible factor level. The fixed-effects model is written as

for i = 1, 2, 3, . . . , K
xi j = µ + τi + εi j
for j = 1, 2, 3, . . . , n
where τi is the effect of the ith treatment (ith strategy in our example) as a devia-
tion from the overall (common to all treatments) population mean µ and εi j is the
error associated with this observation. In the context of simulation, the εi j term
represents the random variation of the response xi j that occurred during the jth
replication of the ith treatment. Assumptions for the fixed-effects model are that
the sum of all τi equals zero and that the error terms εi j are independent and nor-
mally distributed with a mean of zero and common variance. There are methods
for testing the reasonableness of the normality and common variance assump-
tions. However, the procedure presented in this section is reported to be somewhat
insensitive to small violations of these assumptions (Miller et al. 1990). Specifi-
cally, for the buffer allocation example, we are testing the equality of three
treatment effects (Strategies 1, 2, and 3) to determine if there are statistically sig-
nificant differences among them. Therefore, our hypotheses are written as
H0 : τ1 = τ2 = τ3 = 0
H1 : τi =
0 for at least one i, for i = 1, 2, 3
Basically, the previous null hypothesis that the K population means are
all equal (µ1 = µ2 = µ3 = · · · = µ K = µ) is replaced by the null hypothesis
τ1 = τ2 = τ3 = · · · = τ K = 0 for the fixed-effects model. Likewise, the alterna-
tive hypothesis that at least two of the population means are unequal is replaced
by τi = 0 for at least one i. Because only one factor is considered in this problem,
a simple one-way analysis of variance is used to determine FCALC, the test statis-
tic that will be used for the hypothesis test. If the computed FCALC value exceeds
a threshold value called the critical value, denoted FCRITICAL, we shall reject the
null hypothesis that states that the treatment effects do not differ and conclude that
there are statistically significant differences among the treatments (strategies in
our example problem).
To help us with the hypothesis test, let’s summarize the experimental results
shown in Table 10.4 for the example problem. The first summary statistic that we
will compute is called the sum of squares (SSi) and is calculated for the ANOVA
for each factor level (Strategies 1, 2, and 3 in this case). In a balanced design
where the number of observations n for each factor level is a constant, the sum of
squares is calculated using the formula
n 2

x i j
n j=1
SSi = j=1 x i j −
2
for i = 1, 2, 3, . . . , K
n
For this example, the sums of squares are

10 2

10 j=1 x 1 j
SS1 = j=1 x 1 j −
2
10
(563.02)2
SS1 = [(54.48)2 + (57.36)2 + · · · + (55.29)2 ] − = 16.98
10
SS2 = 12.23
SS3 = 3.90
The grand total of the N observations (N = n K ) collected from the output re-
sponse of the simulated system is computed by
K n K
Grand total = x.. = i=1 j=1 xi j = i=1 xi
The overall mean of the N observations collected from the output response of
the simulated system is computed by
K n
i=1 j=1 x i j x..
Overall mean = x̄.. = =
N N
Using the data in Table 10.4 for the buffer allocation example, these statistics are
3
Grand total = x.. = i=1 xi = 563.02 + 546.33 + 573.92 = 1,683.27
x.. 1,683.27
Overall mean = x̄.. = = = 56.11
N 30
Our analysis is simplified because a balanced design was used (equal obser-
vations for each factor level). We are now ready to define the computational
formulas for the ANOVA table elements (for a balanced design) needed to
conduct the hypothesis test. As we do, we will construct the ANOVA table for the
buffer allocation example. The computational formulas for the ANOVA table
elements are
Degrees of freedom total (corrected) = df(total corrected) = N − 1
= 30 − 1 = 29
Degrees of freedom treatment = df(treatment) = K − 1 = 3 − 1 = 2
Degrees of freedom error = df(error) = N − K = 30 − 3 = 27
and
K
Sum of squares error = SSE = i=1 SSi = 16.98 + 12.23 + 3.90 = 33.11

K x..2
Sum of squares treatment = SST = 1
n i=1 i −
x 2
K

1 (1,683.27)2
SST = ((563.02)2 + (546.33)2 + (573.92)2 )− = 38.62
10 3
Sum of squares total (corrected) = SSTC = SST + SSE

= 38.62 + 33.11 = 71.73
and
SST 38.62
Mean square treatment = MST = = = 19.31
df(treatment) 2
SSE 33.11
Mean square error = MSE = = = 1.23
df(error) 27
and finally
MST 19.31
Calculated F statistic = FCALC = = = 15.70
MSE 1.23
Table 10.5 presents the ANOVA table for this problem. We will compare the value
of FCALC with a value from the F table in Appendix C to determine whether to
reject or fail to reject the null hypothesis H0 : τ1 = τ2 = τ3 = 0. The values
obtained from the F table in Appendix C are referred to as critical values and
are determined by F(df(treatment), df(error); α). For this problem, F(2,27; 0.05) = 3.35 =
FCRITICAL , using a significance level (α) of 0.05. Therefore, we will reject H0 since
FCALC > FCRITICAL at the α = 0.05 level of significance. If we believe the data in
Table 10.4 satisfy the assumptions of the fixed-effects model, then we would con-
clude that the buffer allocation strategy (treatment) significantly affects the mean
TABLE 10.5 Analysis of Variance Table
Source of Degrees of Sum of Mean

Variation Freedom Squares Square FCALC
Total (corrected) N − 1 = 29 SSTC = 71.73

Treatment (strategies) K−1=2 SST = 38.62 MST = 19.31 15.70
Error N − K = 27 SSE = 33.11 MSE = 1.23
throughput of the system. We now have evidence that at least one strategy produces
better results than the other strategies. Next, a multiple comparison test will be
conducted to determine which strategy (or strategies) causes the significance.
Multiple Comparison Test

Our final task is to conduct a multiple comparison test. The hypothesis test sug-
gested that not all strategies are the same with respect to throughput, but it did not
identify which strategies performed differently. We will use Fisher’s least signifi-
cant difference (LSD) test to identify which strategies performed differently. It is
generally recommended to conduct a hypothesis test prior to the LSD test to de-
termine if one or more pairs of treatments are significantly different. If the hy-
pothesis test failed to reject the null hypothesis, suggesting that all µi were the
same, then the LSD test would not be performed. Likewise, if we reject the null
hypothesis, we should then perform the LSD test. Because we first performed a
hypothesis test, the subsequent LSD test is often called a protected LSD test.
The LSD test requires the calculation of a test statistic used to evaluate all
pairwise comparisons of the sample mean from each population (x̄1 , x̄2 ,
x̄3 , . . . , x̄ K ). In our example buffer allocation problem, we are dealing with the
sample mean throughput computed from the output of our simulation models for
the three strategies (x̄1 , x̄2 , x̄3 ). Therefore, we will make three pairwise compar-
isons of the sample means for our example, recalling that the number of pairwise
comparisons for K candidate designs is computed by K (K − 1)/2. The LSD test
statistic is calculated as

2(MSE)
LSD(α) = t(df(error),α/2)
n
The decision rule states that if the difference in the sample mean response values
exceeds the LSD test statistic, then the population mean response values are sig-
nificantly different at a given level of significance. Mathematically, the decision
rule is written as
If |x̄i − x̄i | > LSD(α), then µi and µi are significantly different at the
α level of significance.
For this problem, the LSD test statistic is determined at the α = 0.05 level of
significance:

2(MSE) 2(1.23)
LSD(0.05) = t27,0.025 = 2.052 = 1.02
n 10
Table 10.6 presents the results of the three pairwise comparisons for the LSD
analysis. With 95 percent confidence, we conclude that each pair of means is dif-
ferent (µ1 = µ2 , µ1 = µ3 , and µ2 = µ3 ). We may be inclined to believe that the
best strategy is Strategy 3, the second best strategy is Strategy 1, and the worst
strategy is Strategy 2.
Recall that the Bonferroni approach in Section 10.4.1 did not detect a sig-
nificant difference between Strategy 1 (µ1 ) and Strategy 3 (µ3 ). One possible
TABLE 10.6 LSD Analysis
Strategy 2 Strategy 1
x̄2 = 54.63 x̄1 = 56.30
Strategy 3 |x̄2 − x̄3 | = 2.76 |x̄1 − x̄3 | = 1.09

x̄3 = 57.39 Significant Significant
(2.76 > 1.02) (1.09 > 1.02)
Strategy 1 |x̄1 − x̄2 | = 1.67

x̄1 = 56.30 Significant
(1.67 > 1.02)
explanation is that the LSD test is considered to be more liberal in that it will in-
dicate a difference before the more conservative Bonferroni approach. Perhaps if
the paired-t confidence intervals had been used in conjunction with common
random numbers (which is perfectly acceptable because the paired-t method
does not require that observations be independent between populations), then the
Bonferroni approach would have also indicated a difference. We are not sug-
gesting here that the Bonferroni approach is in error (or that the LSD test is in
error). It could be that there really is no difference between the performances of
Strategy 1 and Strategy 3 or that we have not collected enough observations to
be conclusive.
There are several multiple comparison tests from which to choose. Other
tests include Tukey’s honestly significant difference (HSD) test, Bayes LSD
(BLSD) test, and a test by Scheffe. The LSD and BLSD tests are considered to be
liberal in that they will indicate a difference between µi and µi before the more
conservative Scheffe test. A book by Petersen (1985) provides more information
on multiple comparison tests.
10.4.3 Factorial Design and Optimization

In simulation experiments, we are sometimes interested in finding out how differ-
ent decision variable settings impact the response of the system rather than simply
comparing one candidate system to another. For example, we may want to measure
how the mean time a customer waits in a bank changes as the number of tellers (the
decision variable) is increased from 1 through 10. There are often many decision
variables of interest for complex systems. And rather than run hundreds of experi-
ments for every possible variable setting, experimental design techniques can be
used as a shortcut for finding those decision variables of greatest significance (the
variables that significantly influence the output of the simulation model). Using ex-
perimental design terminology, decision variables are referred to as factors and the
output measures are referred to as responses (Figure 10.3). Once the response of in-
terest has been identified and the factors that are suspected of having an influence
on this response defined, we can use a factorial design method that prescribes how
many runs to make and what level or value to use for each factor.
FIGURE 10.3
Factors (X1, X2, . . . , Xn) Simulation Output responses
Relationship between
factors (decision model
variables) and output
responses.
The natural inclination when experimenting with multiple factors is to test
the impact that each individual factor has on system response. This is a simple and
straightforward approach, but it gives the experimenter no knowledge of how fac-
tors interact with each other. It should be obvious that experimenting with two or
more factors together can affect system response differently than experimenting
with only one factor at a time and keeping all other factors the same.
One type of experiment that looks at the combined effect of multiple factors on
system response is referred to as a two-level, full-factorial design. In this type of ex-
periment, we simply define a high and a low setting for each factor and, since it is a
full-factorial experiment, we try every combination of factor settings. This means
that if there are five factors and we are testing two different levels for each factor,
we would test each of the 25 = 32 possible combinations of high and low factor
levels. For factors that have no range of values from which a high and a low can be
chosen, the high and low levels are arbitrarily selected. For example, if one of the
factors being investigated is an operating policy (like first come, first served or last
come, last served), we arbitrarily select one of the alternative policies as the high-
level setting and a different one as the low-level setting.
For experiments in which a large number of factors are considered, a two-
level, full-factorial design would result in an extremely large number of combina-
tions to test. In this type of situation, a fractional-factorial design is used to strate-
gically select a subset of combinations to test in order to “screen out” factors with
little or no impact on system performance. With the remaining reduced number of
factors, more detailed experimentation such as a full-factorial experiment can be
conducted in a more manageable fashion.
After fractional-factorial experiments and even two-level, full-factorial ex-
periments have been performed to identify the most significant factor level com-
binations, it is often desirable to conduct more detailed experiments, perhaps over
the entire range of values, for those factors that have been identified as being the
most significant. This provides more precise information for making decisions re-
garding the best, or optimal, factor values or variable settings for the system. For
a more detailed treatment of factorial design in simulation experimentation, see
Law and Kelton (2000).
In many cases, the number of factors of interest prohibits the use of even
fractional-factorial designs because of the many combinations to test. If this is
the case and you are seeking the best, or optimal, factor values for a system, an
alternative is to employ an optimization technique to search for the best combina-
tion of values. Several optimization techniques are useful for searching for the
combination that produces the most desirable response from the simulation model
without evaluating all possible combinations. This is the subject of simulation op-
timization and is discussed in Chapter 11.
10.5 Variance Reduction Techniques

One luxury afforded to model builders is that the variance of a performance mea-
sure computed from the output of simulations can be reduced. This is a luxury be-
cause reducing the variance allows us to estimate the mean value of a random
variable within a desired level of precision and confidence with fewer replications
(independent observations). The reduction in the required number of replications
is achieved by controlling how random numbers are used to “drive” the events in
the simulation model. These time-saving techniques are called variance reduction
techniques. The use of common random numbers (CRN) is perhaps one of the
most popular variance reduction techniques. This section provides an introduction
to the CRN technique, presents an example application of CRN, and discusses
how CRN works. For additional details about CRN and a review of other variance
reduction techniques, see Law and Kelton (2000).
10.5.1 Common Random Numbers

The common random numbers (CRN) technique was invented for comparing
alternative system designs. Recall the proposed buffer allocation strategies for the
production system presented in Section 10.2. The objective was to decide which
strategy yielded the highest throughput for the production system. The CRN tech-
nique was not used to compare the performance of the strategies using the paired-t
confidence interval method in Section 10.3.2. However, it would have been a good
idea because the CRN technique provides a means for comparing alternative
system designs under more equal experimental conditions. This is helpful in
ensuring that the observed differences in the performance of two system designs
are due to the differences in the designs and not to differences in experimental
conditions. The goal is to evaluate each system under the exact same circumstances
to ensure a fair comparison.
Suppose a system is simulated to measure the mean time that entities wait in
a queue for service under different service policies. The mean time between
arrivals of entities to the system is exponentially distributed with a mean of
5.5 minutes. The exponentially distributed variates are generated using a stream
of numbers that are uniformly distributed between zero and 1, having been pro-
duced by the random number generator. (See Chapter 3 for a discussion on gen-
erating random numbers and random variates.) If a particular segment of the
random number stream resulted in several small time values being drawn from the
exponential distribution, then entities would arrive to the system faster. This
would place a heavier workload on the workstation servicing the entities, which
would tend to increase how long entities wait for service. Therefore, the simula-
tion of each policy should be driven by the same stream of random numbers to en-
sure that the differences in the mean waiting times are due only to differences in
the policies and not because some policies were simulated with a stream of ran-
dom numbers that produced more extreme conditions.
FIGURE 10.4 Stream

Unique seed value Rep.1
assigned for each Seed 9
replication. .83
.
.
.
.12
Rep. 2
Seed 5
.93
.
.
.
.79
Rep. 3
Seed 3
.28
.
.
.
Distributions for arrival times, service times, and so on
Simulation models for each alternative
The goal is to use the exact random number from the stream for the exact
purpose in each simulated system. To help achieve this goal, the random number
stream can be seeded at the beginning of each independent replication to keep it
synchronized across simulations of each system. For example, in Figure 10.4, the
first replication starts with a seed value of 9, the second replication starts with a
seed value of 5, the third with 3, and so on. If the same seed values for each repli-
cation are used to simulate each alternative system, then the same stream of ran-
dom numbers will drive each of the systems. This seems simple enough. How-
ever, care has to be taken not to pick a seed value that places us in a location on
the stream that has already been used to drive the simulation in a previous repli-
cation. If this were to happen, the results from replicating the simulation of a sys-
tem would not be independent because segments of the random number stream
would have been shared between replications, and this cannot be tolerated.
Therefore, some simulation software provides a CRN option that, when selected,
automatically assigns seed values to each replication to minimize the likelihood

of this happening.
A common practice that helps to keep random numbers synchronized across
systems is to assign a different random number stream to each stochastic element
in the model. For this reason, most simulation software provides several unique
streams of uniformly distributed random numbers to drive the simulation. This
concept is illustrated in Figure 10.5 in that separate random number streams are
used to generate service times at each of the four machines. If an alternative
design for the system was the addition of a fifth machine, the effects of adding the
fifth machine could be measured while holding the behavior of the original four
machines constant.
In ProModel, up to 100 streams (1 through 100) are available to assign to any
random variable specified in the model. Each stream can have one of 100 differ-
ent initial seed values assigned to it. Each seed number starts generating random
numbers at an offset of 100,000 from the previous seed number (seed 1 generates
100,000 random numbers before it catches up to the starting point in the cycle of
seed 2). For most simulations, this ensures that streams do not overlap. If you do
FIGURE 10.5 Stream 1 Stream 2 Stream 5 Stream 7

Unique random
Rep. 1 Rep. 1 Rep. 1 Rep. 1
number stream Seed 99 Seed 51 Seed 89 Seed 67
assigned to each .83 .27 .56 .25
stochastic element . . . .
. . . .
in system. . . . .
.12 .19 .71 .99

Seed 75 Seed 33 Seed 7 Seed 23
.93 .87 .45 .69
. . . .
. . . .
. . . .
.79 .35 .74 .42

Seed 3 Seed 79 Seed 49 Seed 13
.28 .21 .53 .82
. . . .
. . . .
. . . .
Service time Service time Service time

distribution distribution distribution
Machine Machine Machine Machine

1 2 3 4
not specify an initial seed value for a stream that is used, ProModel will use the
same seed number as the stream number (stream 3 uses the third seed). A detailed
explanation of how random number generators work and how they produce
unique streams of random numbers is provided in Chapter 3.
Complete synchronization of the random numbers across different models is
sometimes difficult to achieve. Therefore, we often settle for partial synchro-
nization. At the very least, it is a good idea to set up two streams with one stream
of random numbers used to generate an entity’s arrival pattern and the other
stream of random numbers used to generate all other activities in the model.
That way, activities added to the model will not inadvertently alter the arrival
pattern because they do not affect the sample values generated from the arrival
distribution.
10.5.2 Example Use of Common Random Numbers

In this section the buffer allocation problem of Section 10.2 will be simulated
using CRN. Recall that the amount of time to process an entity at each machine is
a random variable. Therefore, a unique random number stream will be used to
draw samples from the processing time distributions for each machine. There is no
need to generate an arrival pattern for parts (entities) because it is assumed that a
part (entity) is always available for processing at the first machine. Therefore, the
assignment of random number streams is similar to that depicted in Figure 10.5.
Next, we will use the paired-t confidence interval to compare the two competing
buffer allocation strategies identified by the production control staff. Also, note
that the common random numbers technique can be used with the Bonferroni
approach when paired-t confidence intervals are used.
There are two buffer allocation strategies, and the objective is to determine if
one strategy results in a significantly different average throughput (number of
parts completed per hour). Specifically, we will test at an α = 0.05 level of sig-
nificance the following hypotheses:
H0 : µ1 − µ2 = 0 or using the paired notation µ(1−2) = 0
H1 : µ1 − µ2 = 0 or using the paired notation µ(1−2) = 0
where the subscripts 1 and 2 denote Strategy 1 and Strategy 2, respectively. As
before, each strategy is evaluated by simulating the system for 16 days (24 hours
per day) past the warm-up period. For each strategy, the experiment is replicated
10 times. The only difference is that we have assigned individual random number
streams to each process and have selected seed values for each replication. This
way, both buffer allocation strategies will be simulated using identical streams of
random numbers. The average hourly throughput achieved by each strategy is
paired by replication as shown in Table 10.7.
Using the equations of Section 10.3.2 and an α = 0.05 level of significance,
a paired-t confidence interval using the data from Table 10.7 can be constructed:
x̄(1−2) = 2.67 parts per hour
s(1−2) = 1.16 parts per hour

Using Common Random Numbers
(A) (B) (C) (D)

Throughput
Strategy 1 Strategy 2 Difference (B − C)
Replication (j) Throughput x1 j Throughput x2 j x(1−2) j = x1 j − x2 j
1 79.05 75.09 3.96

2 54.96 51.09 3.87
3 51.23 49.09 2.14
4 88.74 88.01 0.73
5 56.43 53.34 3.09
6 70.42 67.54 2.88
7 35.71 34.87 0.84
8 58.12 54.24 3.88
9 57.77 55.03 2.74
10 45.08 42.55 2.53
Sample mean x̄(1−2) 2.67
Sample standard deviation s(1−2) 1.16
2
Sample variance s(1−2) 1.35
and
(t9,0.025 )s(1−2) (2.262)1.16
hw = √ = √ = 0.83 parts per hour
n 10
where tn−1,α/2 = t9,0.025 = 2.262 is determined from the Student’s t table in
Appendix B. The 95 percent confidence interval is
x̄(1−2) − hw ≤ µ(1−2) ≤ x̄(1−2) + hw

2.67 − 0.83 ≤ µ(1−2) ≤ 2.67 + 0.83
1.84 ≤ µ(1−2) ≤ 3.50
With approximately 95 percent confidence, we conclude that there is a significant

difference between the mean throughputs of the two strategies because the interval
excludes zero. The confidence interval further suggests that the mean throughput
µ1 of Strategy 1 is higher than the mean throughput µ2 of Strategy 2 (an estimated
1.84 to 3.50 parts per hour higher).
The interesting point here is that the half-width of the confidence interval
computed from the CRN observations is considerably shorter than the half-width
computed in Section 10.3.2 without using CRN. In fact, the half-width using CRN
is approximately 37 percent shorter. This is due to the reduction in the sample
standard deviation s(1−2) . Thus we have a more precise estimate of the true mean
difference without making additional replications. This is the benefit of using
variance reduction techniques.
10.5.3 Why Common Random Numbers Work

Using CRN does not actually reduce the variance of the output from the simula-
tion model. It is the variance of the observations in the Throughput Difference
column (column D) in Table 10.7 that is reduced. This happened because the ob-
servations in the Strategy 1 column are positively correlated with the observations
in the Strategy 2 column. This resulted from “driving” the two simulated systems
with exactly the same (or as close as possible) stream of random numbers. The
effect of the positive correlation is that the variance of the observations in the
Throughput Difference column will be reduced.
Although the observations between the strategy columns are correlated, the
observations down a particular strategy’s column are independent. Therefore, the
observations in the Throughput Difference column are also independent. This is
because each replication is based on a different segment of the random number
stream. Thus the independence assumption for the paired-t confidence interval
still holds. Note, however, that you cannot use data produced by using CRN to
calculate the Welch confidence interval or to conduct an analysis of variance.
These procedures require that the observations between populations (the strategy
columns in this case) be independent, which they are not when the CRN technique
is used.
As the old saying goes, “there is no such thing as a free lunch,” and there is a
hitch with using CRN. Sometimes the use of CRN can produce the opposite effect
and increase the sample standard deviation of the observations in the Throughput
Difference column. Without working through the mathematics, know that this
occurs when a negative correlation is created between the observations from each
system instead of a positive correlation. Unfortunately, there is really no way of
knowing beforehand if this will happen. However, the likelihood of realizing a
negative correlation in practice is low, so the ticket to success lies in your ability
to synchronize the random numbers. If good synchronization is achieved, then
the desired result of reducing the standard deviation of the observations in the
Difference column will likely be realized.
10.6 Summary
An important point to make here is that simulation, by itself, does not solve a
problem. Simulation merely provides a means to evaluate proposed solutions by
estimating how they behave. The user of the simulation model has the responsi-
bility to generate candidate solutions either manually or by use of automatic
optimization techniques and to correctly measure the utility of the solutions based
on the output from the simulation. This chapter presented several statistical
methods for comparing the output produced by simulation models representing
candidate solutions or designs.
When comparing two candidate system designs, we recommend using either
the Welch confidence interval method or the paired-t confidence interval. Also, a
variance reduction technique based on common random numbers can be used in

conjunction with the paired-t confidence interval to improve the precision of the
confidence interval. When comparing between three and five candidate system
designs, the Bonferroni approach is useful. For more than five designs, the
ANOVA procedure in conjunction with Fisher’s least significance difference test
is a good choice, assuming that the population variances are approximately equal.
Additional methods useful for comparing the output produced by different simu-
lation models can be found in Goldsman and Nelson (1998).
10.7 Review Questions

1. The following simulation output was generated to compare four
candidate designs of a system.
a. Use the paired-t confidence interval method to compare Design 1 with
Design 3 using a 0.05 level of significance. What is your conclusion?
What statistical assumptions did you make to use the paired-t
confidence interval method?
b. Use the Bonferroni approach with Welch confidence intervals to
compare all four designs using a 0.06 overall level of significance.
What are your conclusions? What statistical assumptions did you
make to use the Bonferroni approach?
c. Use a one-way analysis of variance (ANOVA) with α = 0.05 to
determine if there is a significant difference between the designs.
What is your conclusion? What statistical assumptions did you make
to use the one-way ANOVA?
Waiting Time in System
Replication Design 1 Design 2 Design 3 Design 4
1 53.9872 58.1365 58.5438 60.1208

2 58.4636 57.6060 57.3973 59.6515
3 55.5300 58.5968 57.1040 60.5279
4 56.3602 55.9631 58.7105 58.1981
5 53.8864 58.3555 58.0406 60.3144
6 57.2620 57.0748 56.9654 59.1815
7 56.9196 56.0899 57.2882 58.3103
8 55.7004 59.8942 57.3548 61.6756
9 55.3685 57.5491 58.2188 59.6011
10 56.9589 58.0945 59.5975 60.0836
11 55.0892 59.2632 60.5354 61.1175
12 55.4580 57.4509 57.9982 59.5142
2. Why is the Bonferroni approach to be avoided when comparing more

than about five alternative designs of a system?
3. What is a Type I error in hypothesis testing?
4. What is the relationship between a Type I and a Type II error?
5. How do common random numbers reduce variation when comparing
two models?
6. Analysis of variance (ANOVA) allows us to partition the total variation in
the output from a simulation model into two components. What are they?
7. Why can common random numbers not be used if the Welch confidence
interval method is used?
References
Banks, Jerry; John S. Carson; Berry L. Nelson; and David M. Nicol. Discrete-Event Sys-
tem Simulation. Englewood Cliffs, NJ: Prentice Hall, 2001.
Bateman, Robert E.; Royce O. Bowden; Thomas J. Gogg; Charles R. Harrell; and Jack
R. A. Mott. System Improvement Using Simulation. Orem, UT: PROMODEL Corp.,
1997.
Goldsman, David, and Berry L. Nelson. “Comparing Systems via Simulation.” Chapter 8
in Handbook of Simulation. New York: John Wiley and Sons, 1998.
Hines, William W., and Douglas C. Montgomery. Probability and Statistics in Engineering
and Management Science. New York: John Wiley & Sons, 1990.
Hoover, Stewart V., and Ronald F. Perry. Simulation: A Problem-Solving Approach.
Reading, MA: Addison-Wesley, 1989.
Law, Averill M., and David W. Kelton. Simulation Modeling and Analysis. New York:
McGraw-Hill, 2000.
Miller, Irwin R.; John E. Freund; and Richard Johnson. Probability and Statistics for
Engineers. Englewood Cliffs, NJ: Prentice Hall, 1990.
Miller, Rupert G. Beyond ANOVA, Basics of Applied Statistics, New York: Wiley, 1986.
Montgomery, Douglas C. Design and Analysis of Experiments. New York: John Wiley &
Sons, 1991.
Petersen, Roger G. Design and Analysis of Experiments. New York: Marcel Dekker, 1985.

10 - Comparing Systems

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 - Comparing Systems

Uploaded by

Copyright:

Available Formats

Harrell−Ghosh−Bowden: I. Study Chapters 10.

Comparing Systems © The McGraw−Hill

254 Part I Study Chapters

10.2 Hypothesis Testing

Chapter 10 Comparing Systems 255

256 Part I Study Chapters

(c) [------|------] Reject H0

[------|------] denotes confidence interval (x1 ⫺ x 2) ⫾ hw

Chapter 10 Comparing Systems 257

10.3 Comparing Two Alternative System Designs

TABLE 10.1 Comparison of Two Buffer Allocation Strategies

(A) (B) (C)

258 Part I Study Chapters

10.3.1 Welch Conﬁdence Interval

Chapter 10 Comparing Systems 259

260 Part I Study Chapters

With approximately 95 percent conﬁdence, we conclude that there is a signiﬁcant

10.3.2 Paired-t Conﬁdence Interval

where x̄(1−2) estimates µ(1−2) and s(1−2) estimates σ(1−2) .

Chapter 10 Comparing Systems 261

The half-width equation for the paired-t conﬁdence interval is

TABLE 10.2 Comparison of Two Buffer Allocation Strategies

(A) (B) (C) (D)

1 54.48 56.01 −1.53

262 Part I Study Chapters

Now, for an α = 0.05 level of signiﬁcance, the conﬁdence interval on the

10.3.3 Welch versus the Paired-t Conﬁdence Interval

Chapter 10 Comparing Systems 263

observations between populations (simulated systems) are not independent, then

10.4 Comparing More Than Two Alternative System Designs

10.4.1 The Bonferroni Approach for Comparing

264 Part I Study Chapters

The number of pairwise comparisons for K candidate designs is computed by

Chapter 10 Comparing Systems 265

266 Part I Study Chapters

TABLE 10.3 Comparison of Three Buffer Allocation Strategies (K = 3)

(A) (B) (C) (D) (E) (F) (G)

1 54.48 56.01 57.22 −1.53 −2.74 −1.21

Comparing µ(1−2) : α1 = 0.02

The approximate 98 percent conﬁdence interval is

x̄(1−2) − hw ≤ µ(1−2) ≤ x̄(1−2) + hw

Chapter 10 Comparing Systems 267

Comparing µ(1−3) : α2 = 0.02

268 Part I Study Chapters

distribution to build the conﬁdence intervals, the observations in the Throughput

10.4.2 Advanced Statistical Models for Comparing

Chapter 10 Comparing Systems 269

TABLE 10.4 Experimental Results and Summary Statistics

Strategy 1 Strategy 2 Strategy 3

1 54.48 56.01 57.22

270 Part I Study Chapters

Chapter 10 Comparing Systems 271

For this example, the sums of squares are

272 Part I Study Chapters

Sum of squares total (corrected) = SSTC = SST + SSE

TABLE 10.5 Analysis of Variance Table

Source of Degrees of Sum of Mean

Total (corrected) N − 1 = 29 SSTC = 71.73

Chapter 10 Comparing Systems 273

Multiple Comparison Test

274 Part I Study Chapters

TABLE 10.6 LSD Analysis

Strategy 3 |x̄2 − x̄3 | = 2.76 |x̄1 − x̄3 | = 1.09

Strategy 1 |x̄1 − x̄2 | = 1.67