Professional Documents
Culture Documents
CHAPTERS 8 .. 11
© 2023
Hoai V. Tran and Duy Phuong Nguyen †
NOTE: Courtesy of Google Inc. [122] for picturesque icons/images of chapter covers.
This page is left blank intentionally.
Contents
8.8 ASSIGNMENT II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
11.4.1 Comparison between using coded units and engineering units . . . . . . . . . 393
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
• Chapter 10 presents an entirely new methodology, called Designed Experiments (DOE) with ad-
vanced usages of statistical experimental designs for system performance evaluation in diverse sec-
tors. For example, in Computing the DOE approach for SPE is useful in both software engineering
and hardware manufacturing, when the “experiment” is the execution of a computer simulation
model or just computer model. Chapter 11 is meaningful since many classes of Fractional Facto-
rial Designs find their way into Quality Analytics and Performance Evaluation from the cost and
efficiency optimality.
• Finally, Chapter ?? (R) proposes Performance Analytics Projects with Further Insights and Views.
This page is left blank intentionally.
Chapter 8
Introduction
1. Firstly, suppose that a complex device or machine is to be built and launched. Before it happens,
its performance is simulated, and this allows us to evaluate its adequacy and associated risks
carefully.
2. We surely prefers to evaluate reliability and safety of a new module of a space station by means
of computer simulations rather than during the actual mission.
In the design phase of a system there is no system available, we can not rely on measurements
for generating a density function (pdf). In such extreme cases, we may use simulation. Large
complex system simulation has become common practice in many areas.
Step (i) building a computer model that describes the behavior of a system;
Once we have a computer simulation model of the actual system, we need to generate values for
the random quantities that are part of the system input (to the model).
To conduct Step (i) rightly and meaningfully, a close collaboration between mathematicians and
statisticians with engineers and experts in specific areas is vital.
An organization has realized that a system is not operating as desired, it will look for ways to
improve its performance. To do so, sometimes it is possible to experiment with the real system and,
through observation and the aid of Probabilistic methods and Statistics, we reach valid conclusions
for future system improvement.
• Sometimes, it is not feasible or impossible, to build a prototype, we may yet obtain a mathe-
matical model (through equations and constraints) for describing the essential behavior of the
system.
PROBLEM 1’s analysis may be done through analytical or numerical methods, but the model may
be too complex to be dealt with. ■
♣ QUESTION. How do we poceed when a mathematical model is not feasible, or too complex?
Simulation Model
Probability
Distributions for
important outputs
A brief simulation process (as in Figure 8.1) essentially consists of passing the inputs through the
simulation model to obtain outputs for being analyzed later.
♣ QUESTION. How to choose good system design from multiple system designs?
Possibly from the STATISTICAL SIMULATION view a good solution firstly comes from combining
Statistical Design and Inference with Simulation while looking out for clues of the next two questions
that
2. How do we capture uncertain variation of systems and express them up to some proper preci-
sion?
• So far variance analysis works well for single population or one system.
Such inferences exploiting Fisher distribution [ref. part 8.6.2] and Chi-square distribution [see
8.11.3] are used for the comparison of
• Section 8.6 presents methods for comparison of performance via risk (variance) analysis, with
F-tests comparing two population variances.
Start-up time 𝑋 of computers, it is conjectured, could be related to the operating system (OS) used
on the machines.
Two groups of laptops are randomly assigned to one of two OS: Windows or Linux.
A measure of start-up time 𝑋 (in second) is then obtained for each of the subjects:
Assumptions:
♣ QUESTION. Compare the start-up times of the two operating systems using the above data
and assumptions.
With COMPLEMENT 7B of ?? showing key mathematical ways of generating random numbers, we then present generation of
continuous random variables at Section 8.1.1 and lastly Section 8.1.2 shows a short discussion on using exponential variable and
8.1
Generation of Random Variables and Its Usage
Output Values 𝑥 of 𝑋.
2. Generate values 𝑥 via the transformation 𝑋 = 𝐹 −1 (𝑉 ). In other words, solve the equation 𝐹 (𝑋) =
𝑉 for 𝑋.
1 1
⇐⇒ 𝑋 = 𝐹𝑋−1 (𝑉 ) = − log(1 − 𝑉 ) = − log 𝑈, 𝑈 ∼ Uni([0, 1]).
𝜆 𝜆
SYSTEM PERFORMANCE EVALUATION
8.1. Mathematical Generation of Random Variables 11
1
Hence, 𝑋 = − log 𝑈 , so the negative log of an uniform 𝑈 is exponentially distributed with rate 𝜆.
𝜆
When 𝜆 = 1, 𝑋 ∼ E(1), for any constant 𝑐 > 0, 𝑐 𝑋 is exponential with mean 𝑐.
The shape parameter 𝑛 and the frequency parameter 𝛽 determine Gamma(𝑛, 𝛽) completely. How-
ever, the probability density function of 𝑋 ∼ Gamma(𝑛; 𝛽)
1
⎧
⎪
⎨
𝑛
𝑥𝑛−1 𝑒−𝑥/𝛽 , if 𝑥 ≥ 0,
𝑔(𝑥; 𝑛, 𝛽) = 𝛽 Γ(𝑛) (8.1)
⎪
⎩0, if 𝑥 < 0.
A Gamma variable generally can be generated as a sum of 𝑛 independent exponential and (iden-
tically) each with rate 𝛽, that is
𝑛
∑︁
𝑋 = Gamma(𝑛, 𝛽) ∼ E𝑖 (𝛽) ∼ 𝐺(𝑛, 𝛽)
𝑖=1
or 𝑛
∑︁ 1 1
𝑋∼ − log 𝑈𝑖 = − log(𝑈1 · · · 𝑈𝑛 ), where 𝑈𝑖 ∼ Uni([0, 1])?
𝑖=1
𝛽 𝛽
Poisson variable 𝑋 counts the number of (rare) events randomly occurring in one 1 unit of time,
denoted by 𝑋 ∼ Pois(𝜆), is determined by 3 components:
Constant 𝜆 > 0 is the rate or speed of events or the average number of events occurring in one
time unit.
Exponential variable E(.) and Poisson variable Pois(𝜆) has strongly closed bond in engineering
and services as Queuing Theory and System. For a simple queuing system with one server we
1
can describe some parameters and measures of performance. . ■
■ NOTATION 1.
Suppose that customers (or entities) entering a queuing system are assigned numbers with the
𝑖-th arriving customer called customer-𝑖. Let
• 𝐴𝑖 denote the time when the 𝑖-th customer arrives, and thereby
⏞ ⏟
𝐴𝑟𝑟𝑖𝑣𝑎𝑙 𝑡𝑖𝑚𝑒𝑠 0‖ − −𝐴1 − − − 𝐴2 · · · An − − − −An+1 − − − − − 𝐴𝑛+2 − − >
𝑋1 = 𝐴2 − 𝐴1 ↖ . . . Xn = An+1 − An ↖ 𝑆𝑛+1 ↑
If precisely the inter-arrival times {𝑋𝑖 } are exponentially distributed with an arrival rate of 𝜆 [with
mean E[𝑋𝑖 ] = 1/𝜆 and pdf 𝑓𝑋 (𝑡) = 𝜆 𝑒−𝜆𝑡 ],
then the number of arrivals 𝑁 (𝑡) in the time interval [0, 𝑡) forms a Poisson process with param-
eter 𝜆 𝑡.
EXAMPLE 8.1 provides a way to generate an exponential 𝑋 ∼ E(.). We may generate Poisson
random variables that uses E(.) in queuing theory as follows.
We already knew the number of arrivals (events) 𝑁 (1) in the time interval [0, 1) is Poisson dis-
tributed with mean 𝜆.
𝑛
∑︁
• The 𝑛-th event will occur at time 𝑋𝑖 , so the number of events by time 1 is
𝑖=1
{︁ 𝑛
∑︁ }︁
𝑁 (1) = max 𝑛 : 𝑋𝑖 ≤ 1 . (8.5)
𝑖=1
That is, the number of events by time 1 is equal to the largest 𝑛 for which the 𝑛-th event has
occurred by time 1. E.g., if the 4th event occurred by time 1 but the 5th event did not, then clearly
there would have been a total of four events by time 1.
• Hence, we use the results of EXAMPLE 8.1 to generate 𝑁 = 𝑁 (1), a Poisson random variable
with mean 𝜆, by generating random numbers 𝑈1 , 𝑈2 , · · · , 𝑈𝑛 , . . . and setting
𝑛
{︁ ∑︁ 1 }︁
− log 𝑈𝑖 ≤ 1 = max 𝑛 : 𝑈1 · · · 𝑈𝑛 ≥ 𝑒−𝜆 𝑊 𝐻𝑌 ?
{︀ }︀
𝑁 = max 𝑛 :
𝑖=1
𝜆
We conclude, a Poisson random variable 𝑁 with mean 𝜆 can be generated by successively gener-
ating random numbers until their product exceeds 𝑒−𝜆 , then setting
The following table below shows key probability distributions being useful in System Performance
Evaluation (SPE).
Gauss 𝑋 ∼ N(𝜇, 𝜎 2 ) 𝜇, 𝜎 2 𝜇 𝜎2
Exponential 𝑋 ∼ E(𝜆) 𝜆 1/𝜆 1/𝜆2
𝜒2 𝑋 ∼ 𝜒2𝑛 𝑛 𝜇=𝑛 2𝑛
Student 𝑇 ∼ 𝑡𝑛,𝑝 𝑛, 𝑝 𝜇𝑇 = 0 𝑛/((𝑛 − 2)
We will learn how to quantify the simulation’s precision in Section 8.3, but prior of that we first
present Monte Carlo Simulation.
8.2
The Monte Carlo Simulation- Methodology
1. Monte Carlo methods are those based on computer simulations involving random numbers. To
perform a simulation, we need
• a way to generate random numbers (according to your model) using a computer. The data
that are generated from your model can then be studied as if they were observations.
Statistically, Monte Carlo methods are mostly used for the computation of probabilities, expected
values, and other distribution characteristics (such as variances).
2
Statistical Physics - in particular, during development of the atomic bomb - but are now widely used in statistics and machine learning... The
term Monte Carlo mathematically referred to simulations that involved random walks and was first used by Jon von Neumann in the 1940’s.
Today, the Monte Carlo method refers to any simulation that involves the use of random numbers
We will show that Monte Carlo simulations (or experiments) are feasible way to understand the
phenomena of interest via a few examples.
A) Forecasting in [Climate Science.] Given just a basic distribution model, it is often very difficult
to make reasonably remote predictions. Often a one-day development depends on the results
obtained during all the previous days. Perhaps binomial distribution and Markov property of a
stochastic process would involve.
However, simulation of such a process can be easily performed daily (or even minute by minute).
Based on present results, we simulate the next day. And thus, we can simulate the day after that,
etc.
As a result, when designing a queuing system or a server facility, it is important to evaluate its vital
performance characteristics, including
ELUCIDATION
• In all above applications 𝐴, 𝐵, we saw how different types of phenomena can be computer-
simulated. However, one simulation is not enough for estimating probabilities and expectations.
After we understand how to program the given phenomenon once, we can embed it in a do-loop
3
and repeat similar simulations a large number of times, generating a long run.
Multiple-queue or multiple-station systems appear in many places, in theme parks (with parallel
queues), or in industrial factories (with sequential queues), seen in Chapter ??.
Key performance measures of a stable queuing system will be briefed in FACT ??. With more
powerful methods being introduced in next chapter, as Discrete Event Simulation (DES), Section
?? then shows us how to combine the DES with other tools to simulate a multiserver.
3
Since the simulated variables are random, we will generally obtain a number of different realizations, from which we calculate probabilities
and expectations as long-run frequencies and averages.
4
• Sample: a proper subset of a population.
First we generate 𝑅 i.i.d. (independent and identically distributed) samples from the distribution,
call them X1 , · · · , X𝑅 (of the same size 𝑛 ≥ 1).
In other words, one runs 𝑅 independent computer experiments replicating the random variable
(r.v.) 𝑋, and then computes 𝜇
̂︀ from the sample.
• The use of random sampling or a method for computing a probability or expectation is often
called Monte Carlo approximation or generally the Monte Carlo method.
• When the estimator 𝜇
̂︀𝑅 of 𝜇 = E[𝑋] is an average of i.i.d. copies of 𝑋 as in (8.6), meaning
𝑔(𝑋) = 𝑋 only, then we refer to 𝜇
̂︀ as a ordinary Monte Carlo (OMC, also CMC - crude MC)
estimator.
5
it is in fact the area limited by the pdf curve, the horizontal axis 𝑦 = 0, and vertical axis at 𝑥 = −∞ and 𝑥 = 𝑡
For the periodic function 𝑔(𝑥) = [cos(50 𝑥) + sin(20 𝑥)]2 with cdf 𝐹 (𝑡) = 𝐹𝑔 (𝑡) we consider
evaluating its integral over [0, 1],
∫︁ 1
𝐼 = 𝐹 (1) − 𝐹 (0) = 𝑔(𝑥) 𝑑𝑥.
0
It can be seen as a uniform expectation on [0, 1], and therefore we generate 𝑈1 , 𝑈2 , · · · , 𝑈𝑛 iid
Uniform [0, 1] random variables, and approximate
∫︁ 𝑡
𝐹𝑔 (𝑡) = 𝑔(𝑥) 𝑑𝑥
0
3
2
1
0
Function
1.2
1.1
1.0
0.9
0.8
NOTE: The Rcommand cumsum is quite handy in that it computes all the partial sums of a sequence
at once and thus allows the immediate representation of the sequence of estimator, specifically when
monitoring Monte Carlo convergence, an issue that will be fully addressed in the next chapter.
Here you (presumably) cannot do it by exact methods (integration or summation using pencil, a
computer algebra system, or exact numerical methods).
• The principle of the Monte Carlo method for approximating E[𝑔(𝑋)] is to simulate/ generate a ran-
dom sample 𝑋1 , 𝑋2 , · · · , 𝑋𝑅 from the density 𝑓 , [the sample are i.i.d. having the same distribution
as 𝑋]. Define
𝑅
1 ∑︁
𝜇
̂︀𝑅 = 𝑔(𝑋𝑖 ). (8.8)
𝑅 𝑖=1
Knowledge Box 1. A few essential facts of the OMC are summarized as follows.
Let 𝑌𝑖 = 𝑔(𝑋𝑖 ) then the 𝑌𝑖 ∼ 𝑌 are also i.i.d. with mean 𝜇 and variance
We could use Monte Carlo sums to calculate a normal cdf. Consider the standard normal r.v.
𝑍 ∼ N(0, 1) with pdf 𝑓 , and fix a sample (𝑋1 , 𝑋2 , · · · , 𝑋𝑅 ) of size 𝑅.
Now we generate (𝑋1 , 𝑋2 , · · · , 𝑋𝑅 ) ∼ N(0, 1) and set the Monte Carlo estimator
𝑅
1 ∑︁ number of observations ≤ 𝑥
Φ(𝑡)
̂︀ = 𝑔(𝑋𝑖 ) = .■
𝑅 𝑖=1 𝑅
QUIZ 1.
1. Write your own R code to confirm that with 𝑡 = 2, the true answer is Φ(2) = .9772 and the Monte
Carlo estimate with 𝑅 = 10, 000 yields Φ(2)
̂︀ = 0.9751. Use 𝑅 = 100, 000 to get .9771.
2. Why variables Id(𝑥𝑖 ≤ 𝑡) are independent Bernoulli with success probability Φ(𝑡)?
REMARK: The Monte Carlo approximation of a probability distribution function illustrated by this example has nontrivial appli-
cations since it can be used in assessing the distribution of a test statistic, such as a likelihood ratio test under a null hypothesis.
8.3
How to achieve a simulation with high precision?
We firstly present a motivation, and it is all right if you do not know all mathematical facts. But
APPENDIX ?? would be essentially helpful for the remaining chapters.
♦ EXAMPLE 8.6.
Consider a customer survey conducted by AIA, a insurance firm at Bangkok. The firm’s quality
assurance team uses a customer survey to measure satisfaction of customers.
Summarizing data, how? We rate satisfaction of customers by asking their satisfaction scores, in
range 0..60. A sample data of 𝑛 = 100 customers are surveyed, a sample mean 𝑥 = 42 of customer
𝑥 = satis-score = [48, 55, 35, 31, · · · , 29, 31, 29, 39, 32, 44, 50.]
𝑁 = the number of all customers, and 𝑛 = 100 (the number of customer we asked).
𝜎 𝜎
P(𝑥 − 1.96 √ < 𝜇 < 𝑥 + 1.96 √ ) = 0.95.
𝑛 𝑛
(II) If don’t know 𝜎, but got a sample of large size 𝑛 > 40, we use its estimate 𝑠 from
∑︀𝑛 2
2 𝑖=1 (𝑥𝑖 − 𝑥)
𝑠 = .
𝑛−1
How can the above solution be applied in SIMULATION with reliable conclusion?
PROBLEM 8.1 (When a customer leaves a Service Center under random arrivals?).
Consider a service system in which no new customers are allowed to enter after 5 p.m. Suppose
that each day follows the same probability law and that we are interested in
i) estimating the expected time at which the last customer departs the system.
ii) our estimated answer will not differ from the true value by more than 15 seconds.
CRITICAL THINKING: the 2nd request need a simulation with high precision?
• continually generate data values relating to the time at which the last customer departs (each time
by doing a simulation run) until we have generated a total of 𝑛 values, where 𝑛 ≥ 100, and
• the simulated data of size 𝑛 satisfies a small enough “precision threshold” , i.e.
𝜎 √
= constant . standard error = 𝑧𝛼/2 · √ ≈ 1.96 𝑆/ 𝑛 < 15
𝑛
where 𝑆 is the sample standard deviation (std, measured in seconds) of the data, and 𝛼 is the
significant level of your conclusion.
ANSWER: Our estimate of the expected time at which the last customer departs will exactly be
the average X n of the 𝑛 data values. WHY? ■
Suppose in a simulation, we have the option of continually generating additional data values 𝑋𝑖 . If
our objective is to estimate the value of E[𝑋𝑖 ] = 𝜃, when should we stop generating new data values?
3. Continue to generate additional data values, stopping when you have generated 𝑛 values and
√
𝑅(𝛼) = 𝑧𝛼/2 · 𝑆/ 𝑛 < 𝑑, where 𝑆 is the sample std based on the sample.
∑︀
4. The estimate of 𝜇 is given by X n = ( 𝑖 𝑋𝑖 )/𝑛.
EXPLAINED SOLUTION (statistical): A practical and feasible answer to this question is that we
should first choose an acceptable value 𝑑 for the error 𝑅(𝛼) = 𝑧𝛼/2 · se, where
√
se = S.E. (X ) = 𝜎/ 𝑛
is the standard error of our estimator (say the sample mean X of 𝜃 = 𝜇).
The higher precision of our simulation is, the smaller value 𝑑 should be found, and fulfills
√
𝑅(𝛼) = 𝑧𝛼/2 · se ≈ 1.96 𝑆/ 𝑛 < 𝑑 = 15,
by using the significant level 𝛼 = 0.05 [of an interval estimate of 𝜇] in Equation 8.12 below
𝑠 𝑠
𝐿 = x −𝑧𝛼/2 · √ ≤ 𝜇 ≤ x +𝑧𝛼/2 · √ = 𝑈. (8.12)
𝑛 𝑛
𝛼
𝑝=1− = Φ(𝑧) 99.5% 97.5% 95% 90% 80% 75% 0.5
2
How to find 𝑅 ? The random variable X is determined by the mean E[𝑋] = 𝜇 and the variance
Var[𝑋] = 𝜎 2 /𝑛, so we set
√
̂︀ = 𝑋, the standard error of X to be se = 𝜎/ 𝑛,
• the estimator 𝜇
or equivalently
[︂ ]︂ [︂ ]︂
𝜎 𝜎 𝜎
P X −𝑧𝛼/2 · √ ≤ 𝜇 ≤ X +𝑧𝛼/2 · √ = P |𝜇 − 𝑋| < 𝑧𝛼/2 · √ = 1 − 𝛼. (8.13)
𝑛 𝑛 𝑛
SYSTEM PERFORMANCE EVALUATION
8.3. How to achieve a simulation with high precision? 35
ELUCIDATION
1. In practice, when the population is generic, being either normal or not, we often use the interval
(8.12). We then need a large sample, of size 𝑛 > 100, and compute the sample standard de-
viation ⎯
⎸ 𝑛
⎸∑︁
𝑠=⎷ (𝑥𝑖 − x )2 /(𝑛 − 1)
𝑖=1
replacing for 𝜎.
We have already exploited the Central Limit Theorem (CLT ) in (??), saying the standardized
Gauss distribution
X −𝜇 X −𝜇
𝑍𝑛 = √ ≈ √ −→ N(0, 1).
𝑆/ 𝑛 𝜎/ 𝑛
2. If the population is arbitrary, 𝜎 is unknown, and can not make large simulated data, we must replace
𝑍 by the Student distribution 𝑇 .
3. REMARK: Since the sample standard deviation 𝑆 may not be a particularly good estimate of 𝜎 (nor
may the normal approximation be valid) when the sample size is small (we must use the Student
𝑇 distribution instead of the Gaussian then) , we thus recommend the following procedure.
8.4
Variance Reduction Technique (VRT)
Now we focus on statistical efficiency (although programming efficiency also matter), as measured
by the variances of the output random variables from a simulation.
If we can somehow reduce the variance of an output random variable of interest such as (i) average
delay in queue or (ii) average cost per month in an inventory system
without disturbing theirs expectation, we can obtain greater precision. Higher precision mathe-
matically would be either achieving a desired precision with less simulating,
• or having smaller confidence intervals, [e.g. of the mean defined in Equation 8.12] for the same
amount of simulating,
The methods of getting better precision using reduction of variance of a parameter of interest is
grouped in a computation class called variance reduction techniques (VRT). Most popular ones
include Control Variables discussed here, and the advanced method named Conditioning, being
discussed more details in Section 8.4.2.
Suppose we are interested in computing parameter 𝜃 = E[𝑔(𝑋1 , 𝑋2 , · · · , 𝑋𝑛 )], the mean of 𝑔(), 𝑔()
2. Repeat similarly step 1 in 𝑘 independent times, until you have generated 𝑘 (some predeter-
mined number) sets, and so have also computed 𝑌1 , 𝑌2 , . . . , 𝑌𝑘 .
3. Now, 𝑌1 , 𝑌2 , . . . , 𝑌𝑘 are independent and identically distributed random variables each having
the same distribution of 𝑔(𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ). Thus, if we let Y denote the average of these 𝑘
𝑘
∑︁
random variables, that is, Y = 𝑌𝑖 /𝑘 then
𝑖=1
It is often the case that it is not possible to analytically compute the preceding, and in such
case we attempt to use simulation to estimate 𝜃. Variance-Reduction Methods include Variance
Reduction by Control Variables and by Conditioning, both based on the above original steps.
(iii) The variance of our estimator Y , V[Y ] = 𝑘V[𝑌𝑖 ]/𝑘 2 = V[𝑌𝑖 ]/𝑘 which is usually not known in
advance, must be estimated from the generated values 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 . ■
Let 𝑋 be an output random variable, such as the total delay time of the first 100 = 99 + 1 = 𝑛 + 1
customer delaying in queue, and assume we want to estimate 𝜃 = E[𝑋]. [ In the general case we
might want to estimate 𝜃𝑔 = E[𝑔(𝑋)] for a given function 𝑔().]
Suppose that 𝑍 is another random variable (in general take 𝑍 := 𝑓 (X) another function of 𝑋)
involved in the simulation that is thought to be correlated with 𝑋 (either positively or negatively),
and that we know the value of 𝜈 = 𝜇𝑍 = E[𝑍].
***
For instance, redefine 𝑌 = 𝑋 as an extension of 𝑋, take function 𝑔() as the delay time of customer
100 in Problem 8.5,
𝑔(Y) = 𝐷𝑛+1=100 =?
Then 𝑍 could be the sum of the service times of the first 𝑛 = 99 customers who complete their service
in the queueing model mentioned above, so we would know its expectation since we generated the
service-time variates Y = 𝑆1 , 𝑆2 , · · · , 𝑆99 from some known input distribution:
It is reasonable to suspect that larger-than-average service times (i.e., 𝑍 > 𝜈) tend to lead to
longer-than-average delays (𝑌 > 𝜃) and vice versa. This means 𝑍 is correlated with 𝑌 , in this case
positively. An essential conclusion is that, we use our knowledge of 𝑍’s expectation to pull 𝑌 (down
or up) toward its expectation 𝜃, thus reducing its variability about 𝜃 from one run to the next. ■
Let Cov[𝐵1 , 𝐵2 ]and Corr[𝐵1 , 𝐵2 ] respectively be the co-variance and correlation between any pair of
random variables 𝐵1 and 𝐵2 . It is well known that, the variance of the sum 𝐵1 + 𝐵2
Since V[𝐵1 ], V[𝐵2 ] ≥ 0, we can reduce the variance V[𝐵1 + 𝐵2 ] by pushing Cov[𝐵1 , 𝐵2 ] < 0, as
negative as possible.
• Suppose that for some other function 𝑓 , the expected value of random function 𝑍 := 𝑓 (X) is known,
say 𝜇 = 𝜇𝑍 = E[𝑍].
𝑊 = 𝑔(X) + 𝐴 [𝑓 (X) − 𝜇]
And, for this value of 𝐴, the variance of 𝑊 is V[𝑊 ] = V[𝑔(X)] − 𝐴2 V[𝑓 (X)] =?
[︀ ]︀
Because V[𝑓 (X)] and Cov 𝑓 (X), 𝑔(X) are usually unknown, the simulated data should be used
to estimate these quantities. Dividing the previous equation by V[𝑔(X)] gives
V[𝑊 ]
= 1 − Corr2 [𝑓 (X), 𝑔(X)] ≤ 1? (8.17)
V[𝑔(X)]
SYSTEM PERFORMANCE EVALUATION
8.4. Variance Reduction Technique (VRT) 41
Knowledge Box 3.
Consequently, the use of a control variate 𝑍 := 𝑓 (X) will greatly reduce the variance of the simu-
lation estimator 𝑊 of 𝜃 = E[𝑔(X)], since the ratio
V[𝑊 ]
<1
V[𝑔(X)]
Conditional expectation
We first need a few concepts for the 2nd method of Variance Reduction.
• E[𝑌 | 𝑥] in (8.18) is defined like E[𝑌 ] as a weighted average of all the possible values of 𝑌 , but now
with the weight given to the value 𝑦 being equal to the conditional probability 𝑝𝑌 (𝑦|𝑥) given 𝑋 = 𝑥.
So E[𝑌 |𝑋] is a random variable, its mean does exist.
• E[𝑔(𝑌 )|𝑋] as a function of 𝑋, taking values E[𝑔(𝑌 )|𝑋 = 𝑥], so is a random variable whose mean
can be calculated.
• Just as conditional probabilities satisfy all the properties of ordinary probabilities, so do the condi-
tional expectations satisfy all the properties of ordinary expectations.
B/ Extending Eq. (8.20) gives the conditional mean E[𝑔(𝑌 )|𝑋] of 𝑔(𝑌 ), and
• The conditional variance of 𝑌 , given value 𝑋 = 𝑥, as the variance of 𝑌 , with respect to the
That is, V[𝑌 | 𝑋] is equal to the (conditional) expected square of the difference between 𝑌 and its (conditional) mean E[𝑌 |𝑋] when
the value of 𝑋 is given. In other words, V[𝑌 | 𝑋] is exactly analogous to the usual definition of variance, but now all expectations
PROOF
From V[𝑌 | 𝑋] = E[𝑌 2 |𝑋] − (E[𝑌 |𝑋])2 , taking expectation both sides w.r.t. 𝑋 gives
[︂ ]︂ [︂ ]︂ [︂ ]︂
2 2 2 2
[︀ ]︀
E V[𝑌 | 𝑋] = E E[𝑌 |𝑋] − E (E[𝑌 |𝑋]) = E[𝑌 ] − E (E[𝑌 |𝑋])
[︂ ]︂
2
[︀ ]︀2
E[ E[𝑌 |𝑋]] = E[𝑌 ] =⇒ V[ E[𝑌 |𝑋]] = E (E[𝑌 |𝑋]) − E[𝑌 ]
]︀2
E[ V[𝑌 |𝑋]] + V[ E[𝑌 |𝑋]] = E[𝑌 2 ] − E[𝑌 ] = V[𝑌 ]. ■
[︀
Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 be a random sample (iid) from a certain distribution, where 𝑁 itself is a natural-
valued random variable (having its own distribution).
The compound random variable of 𝑋𝑖 and 𝑁 is given by
𝑆𝑁 := 𝑋1 + 𝑋2 + · · · + 𝑋𝑁 . (8.25)
In practice, 𝑁 may be the number of people stopping at a service station in a day, and the 𝑋𝑖 are
the amounts of gas they purchased.
One can find the mean and variance of 𝑆𝑁 if observations are random.
PROOF.
If for some new random variable 𝑍 we can compute E[𝑌 | 𝑍] then, from the conditional variance
formula (8.24) we get smaller variance
Indeed, in the Conditional Variance formula V[𝑌 ] = E[ V[𝑌 |𝑋]]+V[ E[𝑌 |𝑋]], always V[𝑌 | 𝑍] ≥ 0
then the mean E[V[𝑌 |𝑋]] ≥ 0, so
V[ E[𝑌 |𝑍]] ≤ V[𝑌 ].
Observe the delays 𝐷1 , 𝐷2 , . . . , 𝐷𝑁 (𝑇 ) of the first 𝑁 (𝑇 ) arrivals on time duration [0, 𝑇 ] in queue 𝑄𝑆,
and put function
𝑁 (𝑇 )
∑︁
𝑌 = 𝑔(𝐷1 , 𝐷2 , . . . , 𝐷𝑁 (𝑇 ) ) = 𝐷𝑖 .
𝑖=1
If we simulate 𝐷𝑖 ∼ 𝐷 as an iid delays of customers in 𝑄𝑆, and assume known the mean E[𝐷], or
if not, compute its sample mean 𝐷, then use
𝑁 (𝑇 )
∑︁
E[𝑌 ] = E[ 𝐷𝑖 ]
𝑖=1
where 𝑚(𝑇 ) := E[𝑁 (𝑇 )] is the expected number of renewals in the time interval (0, 𝑇 ).
The average time that a customer spends in the system 𝑄𝑆 is obviously found as
E[𝑌 ]
𝐴𝐶𝑆 = E[𝐷] = (8.28)
𝑚(𝑇 )
SYSTEM PERFORMANCE EVALUATION
8.4. Variance Reduction Technique (VRT) 49
𝑁 (𝑇 )
∑︁
where 𝑌 = 𝐷𝑖 is the sum of the spent times in 𝑄𝑆 of all arrivals up to 𝑇 .
𝑖=1
⏞ ⏟
𝐴𝑟𝑟𝑖𝑣𝑎𝑙 𝑡𝑖𝑚𝑒𝑠 0‖ − − − − − −An − − − −An+1 − − − − −− 𝐴𝑛+2 − −− >
Xn = An+1 − An ↑ 𝑆𝑛+1 ↑
Note that, the delay in queue of customer 𝑛 + 1 is
hence the time moment when that customer immediately leaves from
the queue obviously is 𝐿𝑛+1 = 𝐷𝑛+1 [be continued from PROBLEM 8.5].
USAGE 2: Compute the mean E[𝑍], where 𝑍 is the total service times in [0, 𝑇 ]:
as our control variable, then replace E[𝑌 |𝑍] for 𝑌 , we get smaller variance of the estimator E[𝑌 |𝑍],
by the above argument (8.27): V[ E[𝑌 |𝑍]] ≤ V[𝑌 ].
here 𝑁 (𝑇 ) is the number of arrivals by time 𝑇 . The quantity 𝑚(𝑇 ) might be known or unknown, but
evidently, 𝑁 (𝑇 ) is a natural simulation estimator of 𝑚(𝑇 ).
Of course, when the arrival process follows a homogeneous Poisson with constant rate 𝜆 then
∫︀ 𝑇
𝑚(𝑇 ) = E[𝑁 (𝑇 )] = 0 𝜆 𝑑𝑡 = 𝜆 𝑇. ■
NEXT OBJECTIVES:
To present several types of comparison and problems of design selection (the best design
among) competing system designs that have been found useful in simulation, together with ap-
propriate statistical procedures for their solution, and numerical examples.
We will discuss statistical analyses of the output from several different simulation models that
might represent competing system designs or alternative operating policies. This is a very important
subject, since the real utility of simulation lies in comparing such alternatives before implementation.
8.5
Comparison of Alternative System Configurations
Many decision making problems require to determine whether the means, proportions or some pa-
rameters of two population or systems are the same or different. In general, the two sample prob-
lem arises when two systems or processes are to be compared, for instance
2. Saliva concentration of respondents who were lying versus those who were truthful.
• In Example 1, Start-up time is the response variable (measured at a specified computer); the
explanatory variable or treatment is the OS type, say
• Taking Example 4, suppose a random sample of 16 patients is taken. Randomly allocate eight of
the patients to treatment 1 and the remaining eight to treatment 2.
• To assess the effectiveness of the treatments it is usual that one of the treatments is a ‘control’ i.e.,
no treatment or the usual treatment. Thus
• Construct firstly confidence intervals on the difference in means of two normal distributions
If two performance indicators are population means we study their difference in Section 8.5.1.
If two performance indicators are population proportions or variances we study their ratio in Section
8.5.2 and 8.6.
However, the fact that the two populations 𝑋 and 𝑌 are independent or not depends on how
the simulations are executed, and could determine which of the two confidence-interval approaches
discussed in next parts.
Often we set Δ0 = 0, and observe two systems 𝑋, 𝑌 to get a data 𝒟 = 𝑥, 𝑦 then we employ the
fact:
Data 𝒟 supporting the null 𝐻0 (we do not reject it) is equivalent to the fact that 0 ∈ CI (𝜉) up to
some significant level 𝛼 ∈ (0, 1).
Hence we will choose suitable way (Hypothesis Testing or Confidence Interval) to describe our
solution.
4. Compute an appropriate test statistic (from the observed data 𝒟) of the standardized variable
𝐺 of 𝐺𝑚 with
X − Y −(𝜇𝑋 − 𝜇𝑌 )
𝐺= √︃
2
𝜎𝑋 𝜎2
+ 𝑌
𝑛1 𝑛2
𝐺 follows the standard Gauss N(0, 1) distribution.
𝑥 − 𝑦 − Δ0
𝑇 = √︂
1 1
𝑠𝑝 +
𝑛1 𝑛2
with 𝑛1 + 𝑛2 − 2 d.o.f, where the pooled variance depends on variances 𝑠2𝑥 , 𝑠2𝑦
5. State the rejection criteria for the statistic with certain significant level 𝛼.
6. Draw appropriate conclusions with eithrr diagram 8.3 for 𝐺 or 8.4 for 𝑇 . If use 𝑇 then just apply
the same argument, but remember using the degrees of freedom 𝑛1 + 𝑛2 − 2 when locating the
suitable critical value 𝑡𝛼 or 𝑡𝛼/2 .
The 𝑇 statistic with critical value 𝑡 has d.f. 𝑛1 + 𝑛2 − 2 since whole data 𝒟 = 𝑥, 𝑦 is comprised by
two independent samples having d.f. 𝑛1 − 1 and 𝑛2 − 1 respectively. ■
Start-up time 𝑋 of computers, it is conjectured, could be related to the operating system (OS) used
on the machines.
Two groups of laptops are randomly assigned to one of two OS: Windows or Linux.
A measure of start-up time 𝑋 (in second) is then obtained for each of the subjects:
Assumptions:
♣ QUESTION. Compare the start-up times of the two operating systems using the above data and
assumptions.
Notation. 𝑋1 = 𝑊 , 𝑋2 = 𝐿.
HINT: For the Windows-run laptops , let 𝑥1 be the sample mean, 𝑠22 be the sample variance;
For the Linuxs-run laptops , let 𝑥2 be the sample mean, 𝑠22 be the sample variance,
Hypotheses:
𝐻0 : 𝜇1 = 𝜇2 , or 𝜇1 − 𝜇2 = Δ0 = 0
versus with 𝐻1 : 𝜇1 ̸= 𝜇2 .
We now do not have to assume that 𝑋 and 𝑌 are independent here, so we compute Confidence
Interval and use Gosset’s 𝑇 distribution in (??).
• Let 𝑛1 = 𝑛2 = 𝑛 (say, or we are willing to discard some observations from the system on which we
actually have more data), we can pair 𝑋1𝑗 with 𝑋2𝑗 to define ‘gap’ variable 𝑍 = 𝑋 − 𝑌 = 𝑋1 − 𝑋2 ,
with the 𝑗th observation 𝑍𝑗 = 𝑋𝑗 − 𝑌𝑗 = 𝑋1𝑗 − 𝑋2𝑗 for 𝑗 = 1, 2, . . . , 𝑛.
• Then the 𝑍𝑗 ’s are IID random variables and E[𝑍𝑗 ] = 𝜉 = E[𝑍], the quantity for which we want to
construct a confidence interval. Thus, we can let a sample ‘mean gap’
𝑛
∑︁
𝑍𝑗
𝑗=1
𝑍(𝑛) =
𝑛
and a ‘variance estimator’ of 𝑍(𝑛) as [explain why yourself]
𝑛
∑︁ [︀ ]︀2
𝑍𝑗 − 𝑍(𝑛)
𝑗=1
̂︂ 𝑍(𝑛) = 𝑆 2 =
[︀ ]︀ [︀ ]︀
V
̂︀ 𝑍(𝑛) = Var
𝑍(𝑛) 𝑛(𝑛 − 1)
SYSTEM PERFORMANCE EVALUATION
8.5. Comparison of Alternative System Configurations 61
DISCUSSION.
1. If the 𝑍𝑗 ’s are normally distributed, this confidence interval CI (𝑍(𝑛)) is exact, i.e., it covers 𝜉 = E[𝑍]
with probability 1 − 𝛼; otherwise, we rely on the central limit theorem (CLT, see Equation 8.52),
which implies that this coverage probability will be near 1 − 𝛼 for large 𝑛.
2. We did not assume that 𝑋 and 𝑌 are independent, nor did we have to assume that Var[𝑋] =
Var[𝑌 ].
3. Using (8.32) we essentially reduced the two-system problem to one involving a single sample,
namely the 𝑍𝑗 ’s . In the next two sections we discuss estimating measures of performance other
than means, namely proportion and variance.
1 𝑇
∫︁
𝐿𝑤 = E[𝑋𝑤 ] = 𝑄(𝑡)𝑑𝑡
𝑇 0
where 𝑋𝑤 = 𝑄(𝑡) is the queue length function at time 𝑡.
the true proportion of spam emails that bombard the mail server of a firm,
the true proportion of stocks of a stock market that go up or down each week,
for the true proportion of households in a country that own personal computers 6 ...
• The procedure to find the confidence interval for a population proportion is similar to that for the
population mean, but the formulas are a bit different although conceptually identical.
• While the formulas are different, they are based upon the same mathematical foundation given to
us by the Central Limit Theorem. Because of this we will see the same basic format using the
same three pieces of information, namely a) the sample value of key parameter,
c) the number of standard deviations we need to have in our estimated confidence interval.
Let denote a binary random variable 𝑍 ∼ B(𝑝) (Bernoulli one), and suppose that we would like to
estimate the probability 𝑝 = P[𝑍 ∈ 𝐵] (success probability), where 𝐵 is a set of real numbers, like
𝐵 = [0, 5).
Make 𝑛 independent replications and let 𝑍1 , 𝑍2 , . . . , 𝑍𝑛 be the resulting IID Bernoulli random vari-
𝑛
∑︁
ables. Put 𝑋 = 𝑍𝑖 just the number of 𝑍𝑗 ’s that fall in the set 𝐵. Probability 𝑝 expresses the
𝑖=1
likelihood of occuring event 𝐵 or proportion 𝑃 of successful cases of 𝑍𝑗 ∈ 𝐵.
Distribution used for a proportion 𝑃 : Firstly, the underlying distribution of a proportion 𝑃 of inter-
est is a binomial distribution. Why? 𝑋 clearly represents the number of successes in 𝑛 trials, then
𝑋 is a binomial variable, and 𝑋 ∼ Bin(𝑛, 𝑝) where 𝑛 is the number of trials.
Secondly, the Mean and Standard deviation (standard error) of the estimator 𝑃̂︀:
⎧
⎨E[𝑃̂︀] = 𝜇 ̂︀ = E[𝑋/𝑛] = 𝑛𝑝 = 𝑝, let 𝑞 = 1 − 𝑝,
⎪
𝑃
2
𝑛 (8.33)
⎩V[𝑃̂︀] = V[𝑋/𝑛] = 𝑋 = 𝑛 𝑝 𝑞 = 𝑝 𝑞 = 𝜎 2 ,
⎪ 𝜎
𝑛2 𝑛2 𝑛 𝑃̂︀
This example shows, comparing two or more systems by some sort of mean system response may
result in misleading conclusions.
Consider a bank with five tellers and one queue, which opens its doors at 9 a.m., closes its doors
at 5 p.m., but stays open until all customers in the bank at 5 p.m. have been served. Assume that
we simulated this dynamic system in 10 independent replications and obtained the above data table,
following the assumption below:
i) customers arrive in accordance with a Poisson process at rate 𝜆 = 1 per minute (i.e., IID
1
exponential interarrival times E(𝛽) with mean = 𝜆 = 1 minute), that
𝛽
1
ii) service times are IID exponential random variables E(𝜇) with mean = 4 minutes, and
𝜇
iii) that customers are served in a FIFO manner.
The 2nd column of Table 8.2, for inatance gives the total number of customers servered in a work
Table 8.2: Results for 10 independent replications of the bank model ([7])
7
day (9 a.m till 5 p.m more or less).
The utilization factor 𝜌 = 𝜆/(5𝜇) = 1/(5 · 1/4) = 0.8 applied for 𝑀/𝑀/5 queue.
We want to compare the policy of having one queue for each teller (so a parallel system)
with the policy of having one queue feed all five tellers (the 𝑀/𝑀/5) on the basis of
7
Table 8.2 shows several typical output statistics from l0 independent replications of a simulation of the bank, assuming that no customers
are present initially. Table 8.3 gives the results of making one simulation run of each policy.
the mean delay in queue E[𝑊 ] and the mean queue length 𝐿𝑤 above.
We obtain a point estimate E[𝑊 ] of average system response over a day, which is given by
𝑁
1 ∑︁
E[𝑊 ] = E[ 𝐷𝑖 ] = 2.03
𝑁 𝑖=1
Table 8.3: Simulation results for the two bank policies via the means
Estimates of mean
Measure of performance Five queues One queue
Mean operating time 𝑂𝑝, hours 8.14 8.14
Mean average delay, minutes 5.57 5.57
Mean average number of customers in queue 5.52 5.52
8
Table 8.3 gives average results of a typical simulation run for each of two bank policies. On the
basis of “average system response,” (the 2nd row of mean average delay E[𝑊 ]) it would appear that
the two policies are equivalent (shown in column 2 and 3). However, this is clearly not the case.
8
Upon the three above assumptions of queues (assuming that the arrival time of the 𝑖th customer was Poisson identical and that the service
time of the 𝑖th customer to begin service (𝑖 = 1, 2, . . . , 𝑁 ) was the same exponential for both policies).
Table 8.4: Simulation results for the two bank policies: proportions
Rows of Table 8.4 give estimates, computed from the same two simulation runs used above, of the
expected proportion of customers with a delay in the interval [0, 5) (in minutes), ..., the expected
proportion of customers with a delay in [40, 45) for both policies.
• Since customers need not be served in the order of their arrival with the multiqueue policy, we
would expect this policy to result in greater variability of a customer’s delay.
delays greater than or equal to 20 minutes for the five-queue and one-queue policies, respec-
tively. ■
The formula for the confidence interval for a population proportion follows the same format as that
for an estimate of a population mean. Therefore, we can assert that
𝑃^ − 𝑃
P[−𝑧𝛼/2 < 𝑍 < 𝑧𝛼/2 ] = 1 − 𝛼, with 𝑍 = (8.35)
𝜎𝑝^
here 𝑧𝛼/2 is is the value above which we find an area of 𝛼/2 under the standard normal curve.
Substituting for 𝑍 with the 𝜎𝑝̂︀ obtained in (8.34), we write:
𝑃^ − 𝑃
P[−𝑧𝛼/2 < < 𝑧𝛼/2 ] = 1 − 𝛼, (8.36)
𝜎𝑝̂︀
√︂
𝑝𝑞
this gives us the CI = 𝑃̂︀ ± 𝑧𝛼/2 of 𝑃 with significance level 𝛼, meaning
𝑛
√︂ √︂
𝑝 𝑞 𝑝𝑞
𝑃̂︀ − 𝑧𝛼/2 < 𝑃 < 𝑃^ + 𝑧𝛼/2
𝑛 𝑛
MATHEMATICAL MODELS, DESIGNS And ALGORITHMS
CHAPTER 8. STATISTICAL SIMULATION: FUNDAMENTALS
70 FOR SYSTEM PERFORMANCE EVALUATION
CONCLUSION
• A clear conclusion from the above example is that comparing alternative systems or policies
on the basis of average system behavior alone can sometimes result in misleading conclusions.
Furthermore, that proportions can be a useful measure of system performance.
• More precisely, 𝑝^ is the numerical value of the statistic 𝑃^ , also the estimated proportion of suc-
cesses 𝑝^ also is a point estimate for 𝑃 , the true population proportion.
♦ EXAMPLE 8.10.
(A) [Marketing Research.] Suppose that a market research firm is hired to estimate the percent
of adults living in a large city who have cell phones. Five hundred randomly selected adult residents
in this city are surveyed to determine whether they have cell phones. Of the 500 people sampled,
421 responded yes - they own cell phones.
Using a 95% confidence level, compute a confidence interval estimate for the true proportion of
adult residents of this city who have cell phones.
Let 𝑋 = the number of people in the sample who have cell phones. 𝑋 is binomial: the random
variable is binary, people either have a cell phone or they do not.
Interpretation: We estimate with 95% confidence that between 81% and 87.4% of all adult resi-
dents of this city have cell phones.
(B) [Finance Study.] A financial officer for a company wants to estimate the percent of accounts
receivable that are more than 30 days overdue. He surveys 500 accounts and finds that 300 are
more than 30 days overdue. Compute a 90% confidence interval for the true percent of accounts
receivable that are more than 30 days overdue, and interpret the confidence interval.
Two mutual funds promise the same expected return; however, one of them recorded a 10%
higher volatility over the last 15 days.
Is this a significant evidence for a conservative investor to prefer the other mutual fund?
2
Denote by 𝜎𝑋 = V[𝑋] and 𝜎𝑌2 = V[𝑌 ] the variances of two populations, we’ll see how to test the null
2
hypothesis 𝐻0 : 𝜎𝑋 = 𝜎𝑌2 and 𝐻𝐴 : 𝜎𝑋
2
̸= 𝜎𝑌2 . Now to compare variances, two independent samples
X = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 and Y = (𝑌1 , 𝑌2 , · · · , 𝑌𝑚 ) are collected, one from each population, as on Figure
8.6.
Unlike population means or proportions, variances are scale factors, and they are compared
through their ratio
2
𝜎𝑋
𝜃= 2.
𝜎𝑌
2
𝜎𝑋
A natural estimator for the ratio of population variances 𝜃 = 2 is the ratio of sample variances
𝜎𝑌
𝑠2𝑋 [ 𝑛𝑖=1 (𝑋𝑖 − 𝑋)2 ]/(𝑛 − 1)
∑︀
^
𝜃 = 2 = ∑︀𝑛 (8.37)
𝑠𝑌 [ 𝑖=1 (𝑌𝑖 − 𝑌 )2 ]/(𝑚 − 1)
MATHEMATICAL MODELS, DESIGNS And ALGORITHMS
sigma 1 to m below
CHAPTER 8. STATISTICAL SIMULATION: FUNDAMENTALS
74 FOR SYSTEM PERFORMANCE EVALUATION
𝑠2𝑋 /𝜎𝑋2
𝑠2𝑋 𝜎𝑌2
𝐹 = =
𝑠2𝑌 /𝜎𝑌2 𝑠2𝑌 𝜎𝑋
2
From Knowledge Box 7 we see that for the normal data, both ratios 𝑠2𝑋 /𝜎𝑋
2
and 𝑠2𝑌 /𝜎𝑌2 follow 𝜒2 -
distributions. We can now conclude that the ratio of two independent 𝜒2 variables, each divided by
its degrees of freedom, has Fisher distribution.
Historical fact: The distribution of this statistic was obtained in 1918 by a famous English statis-
tician and biologist Sir Ronald Fisher (1890-1962) and developed and formalized in 1934 by an
American mathematician George Snedecor (1881-1974). Its standard form, after we divide each
sample variance in formula (8.37) by the corresponding population variance, is therefore called the
Fisher-Snedecor distribution or simply F-distribution with (𝑛 − 1) and (𝑚 − 1) degrees of freedom.
We discuss Fisher distribution and statistics now, being useful in next chapters.
• A ratio of two non-negative continuous random variables, hence any F-distributed variable is also
non-negative and continuous. The numerator df. are always mentioned first.
Figure 8.7: Critical values of the F-distribution and their reciprocal property.
• Interchanging the degrees of freedom changes the distribution, so the order is important because
in the first case we deal with 𝐹 (𝑛−1, 𝑚−1) distribution, and in the second case with 𝐹 (𝑚−1, 𝑛−1).
This leads us to an important general conclusion:
If 𝐹 has 𝐹 (𝑢, 𝑣) distribution, then the distribution of 1/𝐹 is 𝐹 (𝑣, 𝑢). More precisely, we have
that the critical values of 𝐹 (𝑢, 𝑣) and 𝐹 (𝑣, 𝑢) follow
1
𝐹1−𝛼 [𝑢, 𝑣] = . (8.38)
𝐹𝛼 [𝑣, 𝑢]
• The F distributions are not symmetric but are right-skewed. The peak of the F density curve
is near 1; values far from 1 in either direction provide evidence against the hypothesis of equal
standard deviations. Critical values of F-distribution are visualized in Figure 8.7 and given in Table
A7, and we will use them to test hypothesis of comparing two variances.
TESTING PROCEDURE:
𝑠2𝑋 /𝜎𝑋
2
𝐹 = 2 2.
𝑠𝑌 /𝜎𝑌
• Assumption 2: When we only need to know if two variances are equal, then we choose 𝜃0 = 1.
2
Under the null 𝐻0 : 𝜎𝑋 = 𝜎𝑌2 the test statistic becomes
𝑠2𝑋
𝐹0 = 𝐹𝑜𝑏𝑠 = 2 (8.40)
𝑠𝑌
We can compute rejection region or find P-value, using 𝐹 (𝑛 − 1, 𝑚 − 1) distribution in both cases,
as in Figure 8.10. Critical values of F-distribution are in Table A7, used for finding critical values.
For marketing purposes, a survey of users of two operating systems is conducted. Twenty users
of operating system Windows record the average level of satisfaction of 77 on a 100-point scale,
with a sample variance of 220.
Thirty users of operating system MacOS have the average satisfaction level 70 with a sample
variance of 155. We already know how to compare the mean satisfaction levels (testing means
of two populations).
2
Should we assume equality of population variances, 𝜎𝑋 = V[𝑋] and 𝜎𝑌2 = V[𝑌 ] and use the
pooled variance? Here 𝑛 = 20, x = 77, 𝑠2𝑋 = 220; 𝑚 = 30, y = 70, and 𝑠2𝑌 = 155. To compare the
population means by a suitable method, we have to test whether the two population variances are
equal or not.
2 𝑠2𝑋
• We test 𝐻0 : 𝜎𝑋 = 𝜎𝑌2 vs 𝐻𝐴 : 2
𝜎𝑋 ̸= 𝜎𝑌2 with the test statistic 𝑓0 = 2 = 1.42.
𝑠𝑌
• This is a two-sided test, so the P-value is
How to compute these probabilities for the F-distribution with 𝑛 − 1 = 19 and 𝑚 − 1 = 29 d.o.f.?
• In Matlab use fcdf(a, u, v) and in R use pf(a, u, v) for calculating cdf at value 𝑎 with 𝑢 and 𝑣
degrees of freedom. Hence
We are asked to compare volatilities of two mutual funds and decide if one of them is more risky
than the other. So, this is a one-sided test of
𝜎𝑋
𝐻0 : 𝜎𝑋 = 𝜎𝑌 or = 𝜃0 = 1 𝑣𝑠 𝐻1 : 𝜎𝑋 > 𝜎𝑌 .
𝜎𝑌
Are all the conditions met? Can we use F-distribution for inference here?
• The data collected over the period of 30 days show a 10% higher volatility of the first mutual fund,
i.e., 𝑠𝑋 /𝑠𝑌 = 1.1. So, this is a standard F-test, right?
• A careless statistician would immediately proceed to the test statistic [as Eqn. 8.40]
from Table A7 with 𝑛 − 1 = 29 and 𝑚 − 1 = 29 d.f., and jump to a conclusion that there is no
evidence that the first mutual fund carries a higher risk.
• Indeed, why not? Well, every statistical procedure has its assumptions, conditions under which
our conclusions are valid. A careful statistician always checks the assumptions before reporting
any results.
• If we conduct an F-test and refer to the F-distribution, what conditions are required?
𝑠2𝑋 /𝜎𝑋
2
𝑠2𝑋 𝜎𝑌2
𝐹 = 2 2 = 2 2
𝑠𝑌 /𝜎𝑌 𝑠 𝑌 𝜎𝑋
has F-distribution with (𝑛 − 1) and (𝑚 − 1) degrees of freedom.
♣ OBSERVATION.
1. Apparently, for the F-statistic to have F-distribution under 𝐻0 , each of our two samples has to con-
sist of independent and identically distributed normal random variables, and the two samples
have to be independent of each other.
2. The F-test is quite robust. It means that a mild departure from the assumptions 1-3 will not affect
our conclusions severely, and we can treat our result as approximate.
3. However, if the assumptions are not met even approximately, for example, the distribution of our
data is asymmetric and far from normal, then the P-value computed above is simply wrong.
When they are not satisfied, the obtained results may be wrong and misleading.
Therefore, unless there are reasons to believe that all the conditions are met, they have to be
tested statistically.
A queuing system involves spontaneous arrivals of jobs, their random waiting time, assignment to
servers, and finally, their random service time and departure.
When designing a queuing system or a server facility, it is important to evaluate its vital perfor-
mance characteristics. Precisely, five important performance measures of an 𝑀/𝑀/1 queue system
are briefed in the fact box below.
INPUTS:
1. four servers;
3. a Poisson process of arrivals with the rate of 1 arrival every 4 min, independent of service times;
5. Suppose that after 15 minutes of waiting, jobs withdraw from a queue if their service has not
started.
• the average waiting time; the longest waiting time; the number of withdrawn jobs...
Readers should perform the case study in max 12 weeks, write a report with at least three
parts. You can choose your favorite programming languages like MATLAB or R...
Server 𝛼 𝜆
I 6 0.3
II 10 0.2
III 7 0.7
IV 5 1.0
We study an M/M/4 system and write codes to get the simulated data Y = X, 𝑆 at 4 servers, given
parameter of service-time 𝑆 distributed as Gamma(𝛼, 𝜆) [see COMPLEMENT 16B in Section ??]
given in above table. We generate 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 from arrival times, coded by arrival in the pro-
gram below. You might write code on R or any language, here we give a guided Matlab segment of
codes.
Computation on MATLAB
The team write codes to compute performance metrics (ref. [76, Section 7.6] ). We start by entering
The queuing system is ready to work! We start a “while”-loop over the number of arriving jobs. It
will run until the end of the day, when arrival time T reaches 14 hours, or 840 minutes. The length of
this loop, the total number of arrived jobs, is random.
T=0;
while T < 840; % until the end of the day
j=j+1; % next job
T = T-mu*log(rand); % arrival time of job j
arrival = [arrival T];
end % Next parameters for service times of servers
k = 4; % the system has 4 servers to get M/M/4 ?
alpha = [6 10 7 5]; lambda = [0.3 0.2 0.7 1.0];
% we use Gamma distributed service times, need 2 parameters
The arrival time 𝑇 is obtained by incrementing the previous arrival time by an Exponential interar-
rival time. Next, we need to assign the new job 𝑗 to a server, following the rule of random assignment.
There are two cases here: either all servers are busy at the arrival time 𝑇 , or some servers are avail-
able.
1. Think yourself how to make a sample vector 𝑆1 , 𝑆2 , · · · , 𝑆𝑛 at one of four servers, assuming that
𝑆 ∼ Gamma(𝛼, 𝜆) = Gamma(6, 0.3) say, then build data
Y = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 , 𝑆1 , 𝑆2 , · · · , 𝑆𝑛
2. Confirm numerically, at each sever, that the control variable 𝑓 (Y) given in Equation (8.45) and the
(︀ )︀
delay time 𝐷𝑛+1 = 𝑔 X, 𝑆 = 𝑔(Y) as in Equation (8.44) are positively correlated with that data.
3. Finally, try your best to summarize the mean statistic of each key parameter [plus the utilization 𝜌
in Knowledge Box 3.3] from the simulated data at all 4 servers.
In transition diagrams of a system-process, the nodes usually are the values of the system variable
𝑋, the number of jobs in the queueing system.
8.8 ASSIGNMENT II
■ PROBLEM: In Queueing Theory, Batch Service is the service that a system provides to cus-
tomers with the same policy if they are assigned in the same group. The Batch Service rule is often
designed by time factor or other common factor.
If time factor is concerned, the policy means that the same service rate will be applied for all units
in a certain time interval. The service time of a server then is said to be group demand adaptive or
just demand adaptive.
In a M/G/1 system the manager can design a demand adaptive service time 𝑆 by employing the
current queue length, following 3 rules:
For instance, at 9:00 am, he observers 𝑚 = 12 customers waiting in the queue, then he set
1
the mean time of service E[𝑆] = = 1/12 hours = 5 minutes, being applied in the whole time
𝑚
interval [9 am, 10 am] to all of those customers.
DATA: Assume visitors arrive in an airline office with rate 𝜆 = 18 persons per hour. At 10:00 am
QUESTIONS: With unit of minute, students apply rules of the above demand adaptive service
time 𝑆 for the time interval [10 am, 11 am] to compute
SUMMARY
𝑀/𝑀/1 system includes two key processes of Arrival and Service, in which
It precisely is the Poisson variable 𝑁 (𝑡) with rate 𝜆 𝑡, and density function
(𝜆 𝑡)𝑛 𝑒−𝜆 𝑡
P[𝑁 (𝑡) = 𝑛] = 𝑛 = 0, 1, 2, ...
𝑛!
exponentially distributed with the same cdf 𝐹 ∼ E(𝜆 = 𝜇𝐹 ), i.e. {𝑋𝑛 } ∼ 𝑋 = E(𝜇𝐹 ). Hence, the
mean time between two arrivals or
1
the mean inter-arrival time is E[𝑋] = (see Theorem ??).
𝜆
3. The service times {𝑆𝑛 } ∼ 𝒮 = E(𝜇𝐺 ) are i.i.d., with the cdf
For a given stable M/M/1 queue, we study the following performance indicators.
1. The number of jobs in the system at time point 𝑡 is denoted by 𝑋(𝑡). Probability of exactly 𝑛
jobs in the system at 𝑡 is
𝑝𝑛 (𝑡) = P[𝑋(𝑡) = 𝑛], 𝑛 ∈ Range(𝑋(𝑡)) (8.42)
3. The service demand is described via the arrival rate 𝜆𝐴 = 𝜆 ∈ R+ . The service (processing)
rate 𝜆𝑆 = 𝜇 of the server, where 𝑆 ∼ service time of a customer.
4. Utilization 𝑟 or 𝜌: If the queuing system consists of a single server, then the utilization 𝜌 is the
fraction of the time in which the server is busy, i.e., occupied.
arrival rate 𝜆𝐴 𝜆
𝜌= = = . (8.43)
service rate 𝜆𝑆 𝜇
PROBLEM
Hypotheses:
𝐻0 : 𝜇1 = 𝜇2 , or 𝜇1 − 𝜇2 = Δ0 = 0
versus with 𝐻1 : 𝜇1 ̸= 𝜇2 .
Test Statistic: Since we do not know 𝜎1 , 𝜎2 and small sample sizes, use sample variances:
𝑥1 − 𝑥2
𝑇 = √︀
𝑠𝑝 1/𝑛1 + 1/𝑛2
SYSTEM PERFORMANCE EVALUATION
8.9. CHAPTER CONCLUSION 97
𝑛1 + 𝑛2 − 2 = 6 + 7 − 2 = 11
𝑇 > 𝑡𝑛1 +𝑛2 −2;𝛼/2 or 𝑇 < −𝑡𝑛1 +𝑛2 −2;𝛼/2 ; that means 𝑇 > 2.20 or 𝑇 < −2.20.
Hence the test statistic 𝑇 = ... = 2.97 > 2.20: we reject the null hypothesis 𝐻0 .
We conclude that there is a significant difference between the start-up times of the two
groups (start-up time values are lower by an estimated 9.09 in the Linuxs-run group).
This problem will extended to a small case study project in part ??.
1. the {𝑋𝑛 } are mutually independent and identically distributed (i.i.d.) with distribution 𝐹 having
mean 𝜇𝐹 , and
2. the {𝑋𝑛 } are also independent with the service times {𝑆𝑛 }, which are i.i.d. with distribution 𝐺
having mean 𝜇𝐹 .
(︀ )︀
QUESTION. Since functional variable 𝑔 . = 𝐷𝑛+1 is unknown, in fact it is a random variable, we
want to figure out a pattern of 𝐷𝑛+1 , then find its estimator by simulation.
• With the delay in queue 𝐷𝑛+1 of the 𝑛 + 1 customer, and taking into account the possibility that
the simulated 𝑋𝑖 , 𝑆𝑖 may randomly be quite different from what might be expected, we can
• We see that 𝑓 (Y) = 𝑍 and either 𝑔(Y) = 𝐷𝑛+1 = 𝑆𝑛 − 𝑋𝑛 + 𝑆𝑛+1 for all 𝑛 ≥ 1?
𝑛
∑︁ 𝑛
∑︁
Or 𝑔(Y) = 𝐷𝑛+1 = 𝑋𝑖 + 𝑆𝑖 ?
𝑖=1 𝑖=1
PROBLEM 8.5 would be utilized in Section ??, about a Simulation Project of 𝑀/𝑀/𝑘 queue when
𝑘 > 1.
a) Suppose that there are 𝑚 terrorists in a group of 𝑁 visitors arriving per day in all airports of the
U.S., with 𝑚 ≪ 𝑁 . If you choose randomly 𝑛 visitors from that group, 𝑛 < 𝑁 , compute the expected
number of terrorists.
b) Use the moment generating function to prove that both the mean E[𝑋] and variance V[𝑋] of a
Poisson random variable 𝑋 with parameter 𝜆 are E[𝑋] = 𝜆; V[𝑋] = 𝜆.
c) Consider a Poisson process {𝐾(𝑡)} on time interval [0, 𝑡] with 𝑡 > 0 and positive rate of 𝜆 events
per time unit.
a) There are 𝑚 terrorists in a group of 𝑁 visitors. You chose randomly 𝑛 visitors from that group,
𝑛 < 𝑁 . Denote by 𝑋 the number of terrorists in that random sample of 𝑛 visitors, then 𝑋 ∼ Bin(𝑛, 𝑝)
𝑚
a binomial, since 𝑋 = 𝐵1 + 𝐵2 + · · · + 𝐵𝑛 where each 𝐵𝑖 ∼ B(𝑝). The probability 𝑝 = are the
𝑁
same for each 𝐵𝑖 . The linearity of expectation says
𝑛𝑚
E[𝑋] = 𝑛𝑝 = .
𝑁
b) Prove that both the mean E[𝑋] and variance V[𝑋] given by E[𝑋] = 𝜆; V[𝑋] = 𝜆.
⎧
⎨ 𝑑𝑀 = 𝑀 ′ (𝑡) = 𝜆𝑀 (𝑡)𝑒𝑡 ,
⎪
=⇒ 𝑑𝑡
⎩𝑀 ′′ (𝑡)
⎪
= (𝜆2 𝑒2𝑡 + 𝜆𝑒𝑡 ) 𝑀 (𝑡).
Using
𝑀 (𝑛) (𝑡)|𝑡=0 = 𝜇𝑛 = E[𝑋 𝑛 ] = 𝑀 (𝑛) (0) (8.47)
E[𝑋] = 𝜇 = 𝑀 ′ (0) = 𝜆,
(8.48)
2 2 ′′ ′ 2
V[𝑋] = E[𝑋 ] − E[𝑋] = 𝑀 (0) − 𝑀 (0) = 𝜆.
9
The NHPP relaxes the Poisson process assumption of stationary increments. Thus it allows for the possibility that the arrival rate need not
be constant but can vary with time.
8.10
COMPLEMENT 8A:
Non-homogeneous Poisson Process
Definition 8.6. {𝑁 (𝑡), 𝑡≥0} is a non-homogeneous (or non-stationary) Poisson process with inten-
sity (rate) function 𝜆(𝑡) if the following conditions are satisfied:
−𝑚(𝑡) [𝑚(𝑡)]𝑘
𝑝𝑘 (𝑡) = P[𝑁 (𝑡) = 𝑘] = 𝑒 , 𝑘 ≥ 0. (8.50)
𝑘!
Function 𝑚(𝑡) is also called the principal function of the process.
• The NHPP 𝑁 (𝑡) follows a Poisson distribution with mean 𝑚(𝑡). The mean value function 𝑚(𝑡) of
this process is defined by Equation (8.49) .
• In the non-homogeneous case, the rate parameter 𝜆(𝑡) now depends on 𝑡. That is
P {𝑁 (𝜏 + 𝑡) − 𝑁 (𝜏 ) = 1} = 𝜆(𝑡) 𝑡, as 𝑡 → 0.
The second result follows because 𝑁 (𝑡) is Poisson random variable with mean 𝑚(𝑡), and if we let
𝑋(𝑡) = 𝑁 (𝑚−1 (𝑡)), then 𝑋(𝑡) is Poisson with mean 𝑚(𝑚−1 (𝑡)) = 𝑡.
To simulate the first 𝑇 time units of a non-homogeneous Poisson process 𝑁 𝑃 with intensity function
𝜆(𝑡), the time index 0 ≤ 𝑡 < ∞. Let constant 𝜆 be such that
IDEAS
• Such a non-homogeneous Poisson process can be generated by a random selection of the event
times of a Poisson process {𝑁 (𝑡), 𝑡 ≥ 0} having rate 𝜆. That is, if an event of a Poisson process
with rate 𝜆 that occurs at time 𝑡 is counted (independently of what has transpired previously) with
probability 𝑝(𝑡) = 𝜆(𝑡)/𝜆,
then the process {𝑁𝑐 (𝑡), 𝑡 ≥ 0} of counted events is a non-homogeneous Poisson process with
intensity function 𝜆(𝑡) = 𝜆 𝑝(𝑡) for all 0 ≤ 𝑡 ≤ 𝑇.
𝑗
∑︁ 𝑁
∑︁
0 − − 𝑋 1 − − − 𝑋1 + 𝑋2 · · · − − − 𝑋𝑖 − − − − − T − − − − 𝑋𝑖 − − >
𝑖=1 𝑖=1
Non-homogeneous Poisson(Input),
1. Generate independent random variables 𝑋1, 𝑈1, 𝑋2, 𝑈2, · · · where
the 𝑋𝑖 are exponential with rate 𝜆 and the 𝑈𝑖 are random numbers, stopping at
round 𝑛
∑︁
𝑁 = min{𝑛 : 𝑋𝑖 > 𝑇 }.
𝑖=1
8.11
COMPLEMENT 8B: Statistical Inference for SPE
It is well known fact that the subject of statistical inference mathematically classified into two broad
categories of Parameter Estimation and Hypothesis Testing. From the practical viewpoint, the statis-
tical inference process is briefly formed by 4 steps as follows.
3. Analyzing the observed data (what to know, why to do, and how to conduct with suitable methods)
4. Presenting the outcomes/decisions made from the whole process to the boss!
When studying Quality Analytics in previous chapters we partially employed statistical inference
and often focused on population means.
Now aiming the study of System Performance Evaluation (SPE) and other fields we switch to dis-
cuss statistical inference methods for population proportions and variances. Practically, estimating
and testing the population variance are meaningful, to make sure that
1. define problem area: determine areas in which our interest, problem or open questions lie;
2. in that domain of interest, decide what kind of information need to collect from observing real
world, or experimenting in labs;
[Source KPA]
The null hypothesis is that the ratio of the variances of the populations
or in the data to which the linear models 𝑥 and 𝑦 were fitted, is equal to ratio.
Assume random variables 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ∼𝑖.𝑖.𝑑. 𝑋 (the 𝑋𝑖 independent and have the same distri-
bution with a common random variable 𝑋) with mean 𝜇 and variance 𝜎 2 .
The normal population: If population 𝑋 ∼ N(𝜇, 𝜎 2 ) then for any 𝑛 the sample mean
𝑛
∑︁
𝑋𝑖
𝑖=1
X =
𝑛
of observations 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 has expectation E[𝑋] = 𝜇 and variance V[𝑋] = 𝜎 2 /𝑛. Moreover
X n is normal (i.e. follows Gauss distribution) for any 𝑛:
(︀𝜎 2 )︀
X n ∼ N 𝜇, .
𝑛
Generic population:
𝜎 2 )︀
(︀
If 𝑋 is not normal, then X n approximates with Gauss variable N 𝜇, only when 𝑛 is large
𝑛
(𝑛 > 30). The C.L.T. briefly says
the sampling distribution of the sample mean will tend to normality asymptotically.
♣ QUESTION. Only one concern remained:
If population 𝑋 is not normal but 𝑛 ≤ 30 then what is the Sampling Distribution of 𝑋? Knowledge
Box 6 will answer this.
Knowledge Box 6 (Sampling distributions of a generic statistic- The CLT as a special case).
√︀
Under appropriate conditions, generally if 𝑆 is a statistic of interest, and 𝜎𝑆 = V[𝑆] is its standard
𝑆 − E[𝑆]
error, then approximately its standardization 𝑍 = is standard Gaussian.
𝜎𝑆
𝑆 − E[𝑆]
1. The fact 𝑍 = ∼ N(0, 1) equivalently shows that the squared standardization
𝜎𝑆
2 [𝑆 − E[𝑆]]2
𝑍 = ∼ 𝜒2 (1). (8.51)
V[𝑆]
2. Examples include the CLT, saying the sample mean of a random variable 𝑋 follows a normal
(︀ 𝜎 2 )︀
distribution, meaning X ∼ N 𝜇, . Here we fix 𝑆 = X , and exploit Theorem 8.9, with E[𝑆] =
√ 𝑛
E[X ] = 𝜇, and 𝜎𝑆 = 𝜎/ 𝑛.
X n −𝜇
𝑍𝑛 = √ satisfies that lim 𝑍𝑛 = 𝑍 ∼ N(0, 1). (8.52)
𝜎/ 𝑛 𝑛→∞
Several important tests of statistical hypotheses are based on the Chi-square distribution. Chi-
square distribution was introduced around 1900 by a famous English mathematician Karl Pearson
(1857-1936) who is regarded as a founder of the entire field of Mathematical Statistics.
In the present section we assume that 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 are i.i.d. N(𝜇, 𝜎 2 ) random variables. We
have known that 𝜎 2 is estimated unbiasedly and consistently by the sample variance
𝑛
2 1 ∑︁
𝑠 = (𝑋𝑖 − 𝑋)2 .
𝑛 − 1 𝑖=1
• The summands (𝑋𝑖 − 𝑋)2 are not quite independent, as the Central Limit Theorem requires, be-
cause they all depend on X . Nevertheless, the distribution of 𝑠2 is approximately normal, under
mild conditions, when the sample is large. For small to moderate samples, the distribution of 𝑠2 is
not normal at all. It is not even symmetric.
When observations 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 are i.i.d. N(𝜇, 𝜎 2 ) (independent and normal) with V[𝑋𝑖 ] = 𝜎 2 ,
the distribution of 𝑛
(𝑛 − 1)𝑠2 ∑︁ (︀ 𝑋𝑖 − X )︀2
2
= ∼ 𝜒2 [𝑛 − 1], (8.54)
𝜎 𝑖=1
𝜎
meaning it is Chi-square 𝜒2 [𝑛 − 1] with (𝑛 − 1) degrees of freedom.
1 ∑︀𝑛
Hence, 𝑠2 = (𝑋𝑖 − 𝑋)2 is an unbiased and consistent estimator of a certain variance
𝑛 − 1 𝑖=1
𝜎 2 , and by (8.54) the variance estimator 𝑠2 fulfills
𝜎2 2
𝑠2 ∼ 𝜒 [𝑛 − 1].
𝑛−1
Let us construct a (1−𝛼) 100% confidence interval for the population variance 𝜎 2 , based on a sample
of size 𝑛. As always, we start with the estimator, the sample variance 𝑠2 .
• Obviously, since the distribution of 𝑠2 is a chi-square distribution, not symmetric, our confidence
interval won’t have the form “estimator ± margin” as before.
• We may use Table A6 - Figure 8.16 to find the critical values 𝜒2𝛼/2 [𝑛 − 1] and 𝜒21−𝛼/2 [𝑛 − 1] of the
Chi-square distribution with 𝑣 = 𝑛 − 1 degrees of freedom.
Similarly, the (1 − 𝛼) 100% lower and upper confidence bound for 𝜎 2 are
(𝑛 − 1) 𝑠2 2 2 (𝑛 − 1) 𝑠2
≤ 𝜎 ; and 𝜎 ≤ 2 (8.56)
𝜒2𝛼 [𝑛 − 1] 𝜒1−𝛼 [𝑛 − 1]
Figure 8.17: Chi-square curve and critical values with specific significant level
An automated filling machine is used to fill bottles with liquid detergent. A random sample of 20
bottles results in a sample variance of fill volume 𝑋 of 𝑠2 = 0.0153 (liter)2 . If the variance of fill volume
exceeds 0.01 (liter)2 , an unacceptable proportion of bottles will be underfilled or overfilled.
We will assume that the fill volume is approximately normally distributed. A 95% upper confidence
bound is found from Formula 8.56 as follows:
Many practical applications require testing independence of two factors, say 𝐴 and 𝐵. Apparently,
chi-square statistic can help us test
If there is a significant association or dependency between two features, it helps to understand the
cause-and-effect relationships.
• For example, is it true that smoking causes lung cancer? Do the data confirm that drinking and
driving increases the chance of a traffic accident?
• In computing, does customer satisfaction with their PC depend on the operating system?
Definition 8.10.
𝑁 (︀ )︀2
∑︁ 𝑂𝑘 − 𝐸𝑘
𝑇 = 𝜒2 = (8.57)
𝐸𝑘
𝑘=1
Here the sum goes over 𝑁 categories or groups of data defined depending on our testing,
• 𝐸𝑘 = E[𝑂𝑘 |𝐻0 ] is the expected number of sampling units in category 𝑘 if the null hypothesis 𝐻0 is
true.
♣ OBSERVATION. This is always a one-sided, right-tail test. That is because only the low values of
𝜒2 show that the observed counts are close to what we expect them to be under the null hypotheses,
and therefore, the data support 𝐻0 . On the contrary, large 𝜒2 occurs when observations 𝑂𝑘 are far
from expected numbers 𝐸𝑘 , which shows inconsistency of the data and does not support 𝐻0 .
Theorem 8.11.
Under 𝐻0 , 𝑇 ⇝ 𝜒2 . Hence the test: reject Ho if 𝑇 > 𝜒2𝛼 [𝑁 − 1] has asymptotic level 𝛼.
A 3-step procedure:
1. A level 𝛼 rejection region for this chi-square test is 𝑅 = [𝜒2𝛼 , +∞), and the P-value found as
𝑃 = P[𝜒2 ≥ 𝜒2𝑜𝑏𝑠 ]
2. Pearson showed that the null distribution of 𝜒2 converges to the Chi-square distribution with (𝑁 − 1)
degrees of freedom, due to Theorem 8.11, as the sample size increases to infinity.
This follows from a suitable version of the Central Limit Theorem. To apply it, we need to make
sure the sample size is large enough.
𝐸𝑘 = E[𝑂𝑘 |𝐻0 ] ≥ 5
for all 𝑘 = 1, . . . , 𝑁 . If that is the case, then we can use the 𝜒2 distribution to construct rejection
regions and compute P-values. If a count in some category is less than 5, then we should merge
this category with another one, and recalculate the 𝜒2 statistic.
♦ EXAMPLE 8.14 (Internet shopping on different days of the week).
A web designer suspects that the chance for an internet shopper to make a purchase through her
web site varies depending on the day of the week. To test this claim, she collects data during one
week, when the web site recorded 3758 hits.
Observed, 𝑥 Mon Tue Wed Thu Fri Sat Sun Total
No purchase 399 261 284 263 393 531 502 2633
Single purchase 119 72 97 51 143 145 150 777
Multiple purchases 39 50 20 15 41 97 86 348
Total 557 383 401 329 577 773 738 3758
Testing independence (i.e., probability of making a purchase or multiple purchases is the same
on any day of the week), we compute the estimated expected counts, then apply the above 3-step
procedure.
Workload Characterization
With Data Analytics
CHAPTER 9. WORKLOAD CHARACTERIZATION
126 WITH DATA ANALYTICS
Introduction
We introduce the concept of system workload and discuss many data-driven methods for studying
computer system workload and its characterization in this chapter.
Next we promote the approach named Performance by Design (PbD) that is similar to the well-
known Quality by Design (QbD) in industrial manufacturing.
To specify what Performance by Design means we at first summarize the SPE theory being
developed from Chapter ?? then illustratively propose Performance Evaluation Analytics projects
that include most essences of system performance evaluation, Analyzing Product-Form Queues
being discussed in Section ?? and Non-Markovian Queues in Chapter ?? respectively.
Chapter Blueprint
4. Know and employ Popular Techniques for Workload Study, based ob Data Analytics, at least Prin-
cipal Component Analysis and 𝐾-means Clustering.
There are three main factors that affect the performance of a computer system:
– LINPACK- in high performance computing- is a software library for performing numerical linear
1
algebra on digital computers,
1
Wiki: LINPACK written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Gilbert Stewart, and intended for use on supercomputers
in the 1970s and early 1980s. It has been largely superseded by LAPACK, which runs more efficiently on modern architectures.
2. Level of detail
3. Representativeness
4. Timeliness
including
go together, since
For example,
2) Level of detail -
Output of the services exercised is a list of services.
Explicitly,
• (A) Most frequent request: select most frequently requested service as workload
E.g., addition instruction
VS. Frequency of request types: list of [ service, frequency ]
E.g., instruction mixes
• (C) Distribution of resource demands: complete probability distribution needed for E.g., ana-
lytical and simulation
3) Representativeness
• Resource demands
4) Timelineness
Workload modeling is the attempt to create a possibly with slight (but wellcontrolled!) modifi-
simple and general model, which can then be cations.
used to generate synthetic workloads at will,
GOAL (of workload modeling): typically to create synthetic workloads that can be used in
performance evaluation studies, and this synthetic workload is supposed to be similar to those that
occur in practice on real systems.
DATA ROLE: Workload modeling always starts with measured data about the workload. This data
is often recorded as a trace, or log, of workload-related events that happened in a certain system.
For example, a job log may include data about
• Necessary to study real-user environments, observe key characteristics, and develop a workload
model
• arrival time
NOTE: We should use those parameters that depend on the workload rather than on the system.
E.g., response time is not appropriate.
Workload components
simulation
editor
Workload SUT
compiler
mail
Example:
Un-supervised & Supervised Learning Techniques coupling with the statistical soft R in this part.
3. Markov models
5. Clustering
They (mean and dispersion) are all statistical notion, see REMINDER: On Central and Spreading
tendency in part 9.4.
• Arithematic mean: 𝑛
1 ∑︁
𝑥¯ = 𝑥𝑖 .
𝑛 𝑖=1
Specifying dispersion
𝑛
∑︁
2
• Standard deviation: 𝑠 = 1
𝑛−1 (𝑥𝑖 − 𝑥¯)2 .
𝑖=1
2 4 6 8 10
DATA
Table 9.1: C1: Various programs data
The resource demands of various programs executed on 6 university sites were measured for 6
months for two categories, named Various programs and Editors.
• C2. Editors
• Make histogram and then fit a probability distribution to shape of the histogram
0.3 network
6
0.
0.
5
2
0.
0.4
CPU disk 0.8
0.2
Markov model in discrete time also named discrete time Markov chain (DTMC) or just Markov chain
is briefed as follows.
(B) A stationary Markov chain means time homogeneous or homogeneous DTMC, with 3 com-
ponents 𝑀 = (𝑄, 𝑝, P) is specified by
is the state transition probabilities or just transition probability from state 𝑖 to state 𝑗, subject to
two conditions:
∑︁
𝑝𝑖𝑗 ≥ 0, 𝑝𝑖𝑗 = 𝑝𝑖1 + 𝑝𝑖2 + 𝑝𝑖3 + . . . + 𝑝𝑖𝑠 = 1 for all 𝑖
𝑗
EVOLUTION- Given the stochastic model just described, the Markov chain is specified in terms of a
sequence of random variables 𝑋0 , 𝑋1 , 𝑋2 , . . . whose values are taken from the set 𝑄 in accordance
with the Markov property
The Markov property is described through probabilities 𝑝𝑖𝑗 , and represented by the state transition
matrix ⎡ ⎤
𝑝11 𝑝12 𝑝13 . . . .𝑝1𝑠 .
⎢ ⎥
⎢
⎢ 𝑝21 𝑝22 𝑝23 . . . 𝑝2𝑠 . ⎥
⎥
⎢ ⎥
P=⎢
⎢ 𝑝31 𝑝32 𝑝33 . . . 𝑝3𝑠 . ⎥
⎥ (9.2)
⎢ .. .. .. . . . .. ⎥
⎢
⎣ . . . . ⎥ ⎦
𝑝𝑠1 𝑝𝑠2 𝑝𝑠3 . . . 𝑝𝑠𝑠 .
Unless stated otherwise, we assume and work with homogeneous Markov chains 𝑀 .
a/ the initial distribution- the probability distribution of starting position of the concerned object
at time point 0, and b/ the transition probabilities; and
we want to determine the the probability distribution of position 𝑋𝑛 for any time point 𝑛 > 0.
• The initial probabilities 𝑝(0) is obtained at the current time (begining of a research).
In most cases, the major concern is using P and 𝑝(0) to predict future.
(ℎ)
𝑝𝑖𝑗 = Prob(𝑋𝑚+ℎ = 𝑗|𝑋𝑚 = 𝑖) ℎ-step transition probabilities
HOW TO DO?
The first Principal Component having the highest variance is found by 𝑌1 = 𝛿1 ⊤ x, where the 𝑝-
solves
𝛿1 = arg max Var(𝛿1 ⊤ x) (9.3)
‖𝛿‖=1
The second PC is the linear combination with the second largest variance and orthogonal to
the first PC, and so on.
Advantages
∑︀𝑛
Idea: (Different weighting → different class of workload) , so we use weighted sum 𝑦 = 𝑗=𝑖 𝑤𝑗 𝑥𝑗
• Mean characteristics may not correspond to any member component in case of manually assigning
poor weights.
SIMPLE ALGORITHM
• PCA produces a set of principal factors ⃗𝑦 = [𝑦1 , 𝑦2 , . . . , 𝑦𝑚 ]⊤ so that the following holds:
⃗𝑥
⃗𝑦 = 𝐴⃗
– 𝑦’s form an ordered set (𝑦1 explains the highest percentage of the variance).
See PCA theory in Section 9.6 and practical R Computation in Section 9.7.
It is far from the reality if we study all of them. As a result, a basic cognition is needed to break
workload into categories by clustering algorithms. Basic steps are
1. Take a sample (Sampling), that is, a subset of workload components, Section 9.3.1
2
E.g., several thousand user profiles.
2. Select workload parameters (see Section 9.3.2 and PCA in Section 9.6)
3. Transform workload parameters (if necessary), Remove outliers and Data scaling, Section 9.3.3
(*) Lastly, if estimation error is large, just Change parameters, or number of clusters, repeat steps 3-5
We sequentially study mentioned steps by a variaty of methods. Basic concepts of Statistics are
reminded in Section 9.4.
Let 𝑃 designate a finite population of 𝑁 units (assumed that the population size 𝑁 is known). We
first make a list 𝐿𝑁 = {𝑢1 , 𝑢2 , · · · , 𝑢𝑁 } of all the elements of the population, which are all labeled for
identification purposes. Let 𝑋 be a variable of interest and 𝑥𝑖 = 𝑋(𝑢𝑖 ), 𝑖 = 1...𝑁 the value ascribed
by 𝑋 to the 𝑖-th unit, 𝑢𝑖 ∈ 𝑃 .
The population mean and population variance, for the variable 𝑋, i.e.,
𝑁 𝑁
∑︁
2 1 ∑︁
𝜇𝑁 = 𝑥𝑖 , and 𝜎𝑁 = (𝑥𝑖 − 𝜇𝑁 )2 , (9.4)
𝑖=1
𝑁 𝑖=1
• In one study, 2% of the population was chosen for analysis; later 99% of the population could be
assigned to the clusters obtained.
• Criteria:
– Impact on performance
– Small Variance
• Transformation: If the distribution is highly skewed, consider a function of the parameter, e.g.,
function log(𝑋) of CPU time 𝑋.
– Affect normalization.
– Can exclude only if that do not consume a significant portion of the system resources (e.g.,
backup).
3. Range Normalization:
𝑥𝑖𝑘 − 𝑥min,𝑘
𝑥′𝑖𝑘 = ,
𝑥max,𝑘 − 𝑥min,𝑘
affected by outliers.
– Percentile Normalization:
𝑥𝑖𝑘 − 𝑥2.5,𝑘
𝑥′𝑖𝑘 = .
𝑥97.5,𝑘 − 𝑥2.5,𝑘
𝑚
∑︁
2. Manhattan distance 𝑑(𝑢𝑗 , 𝑣𝑗 ) = |𝑢𝑗 − 𝑣𝑗 | =⇒ 𝑑(𝑢, 𝑣) = |𝑢𝑗 − 𝑣𝑗 |.
𝑖=1
𝑚
∑︁
3. Hamming distance 𝑚𝐻 (𝑢, 𝑣) = 𝛿(𝑢𝑖 − 𝑣𝑖 ) and the Hamming mean
𝑖=1
𝑚𝐻 (𝑢, 𝑣)
𝑑(𝑢, 𝑣) = (9.6)
𝑚
3. Triangle inequality,
If only the triangle inequality is not satisfied, the function is called a semimetric.
Similarity in clustering means that the value of 𝑆(𝑢, 𝑣) is large when 𝑢 and 𝑣 are two similar
samples; the value of 𝑆(𝑢, 𝑣) is small otherwise.
Definition 9.2.
• For a data set with 𝑁 data objects, we can defined an 𝑁 × 𝑁 symmetric matrix, called a proximity
matrix, whose (𝑖, 𝑗)-th element represents
the similarity or dissimilarity measure for the 𝑖-th and 𝑗-th objects (𝑖, 𝑗 = 1, . . . , 𝑁 ).
The median 𝑀 is the midpoint of a distribution. Half the observations are smaller than the median
and the other half are larger than the median.
The median 𝑀 is the value in the middle when the data 𝑥1 , · · · , 𝑥𝑛 of size 𝑛 is sorted in ascending
order (smallest to largest).
Indeed, since 𝑛 = 12 is even, the middle two values of data 𝑥* are 2890 and 2920; the median 𝑀
is the average of these values:
2890 + 2920
𝑀= = 2905 = 𝑀 (𝑥* ).
2
SYSTEM PERFORMANCE EVALUATION
9.4. REMINDER: Key Facts of Statistical Data Analytics 161
Remark: Whenever a data set contain extreme values, the median is often the preferred measure
of central location than the mean.
𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000].
Data 𝑥* consists of extreme values (outliers) as $10000, so the new sample mean
∑︀𝑛 *
* 𝑖=1 𝑥𝑖
𝑥 = = 3496 >> 2940 = the old mean of data 𝑥
𝑛
But the median is unchanged, reflecting better central tendency:
2890 + 2920
𝑀 (𝑥) = 𝑀 (𝑥* ) = = 2905.
2
The median and mean are the most common measures of the center of a distribution.
If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed
distribution, the mean is farther out in the long tail than is the median.
𝑛𝐴
– a relative frequency of 𝐴 is 𝑛 ,
So we just choose a specific value 𝐴 with greatest frequency (or greatest relative frequency) from
the histogram.
a) Percentiles provide information about how the data are spread over the interval from the
smallest value to the largest value.
The 𝑝th percentile, for any 0 < 𝑝 < 1, is a value 𝑚 such that
P[𝑋 ≤ 𝑚] = 𝑝;
• and 100(1 − 𝑝) percent of the observations are greater than this value.
• The domain of 𝑝 is [0, 100]: 𝑝 is a real number, but in practice we usually allow 𝑝 ∈ Q ∩ [1, 100]: 𝑝 is
a rational number.
Often we divide data into four equal parts, each part contains approximately one-fourth, or 25% of
the observations. The division points are called the quartiles, and defined as:
𝑄1 = first quartile, or 25th percentile
𝑄2 = second quartile, or 50th percentile (also the median)
𝑄3 = third quartile, or 75th percentile
quantile(𝑥, p);
√
• The sample standard deviation 𝑠 = 𝑠2 .
Coefficient of Variation 𝐶𝑉 measures relative dispersion, i.e. compares how large the standard
deviation is relative to the mean:
(︂ )︂
𝜎
𝐶𝑉 = × 100 % for populations
𝜇
and (︂ )︂
𝜎𝑥
𝐶𝑉 = × 100 % for samples 𝑥.
𝜇𝑥
We now consider the relationship between variables via two most important descriptive measures:
Covariance measures the co-movement of two separate distributions and Correlation. Let us start
by looking at the example below.
A positive covariance indicates that 𝑋 and 𝑌 move together in relation to their means.
Remark that
As a result,
In our example, 𝑠𝑥𝑦 = 99/9 = 11 indicating a strong positive linear relationship between the number
𝑥 of television commercials shown and the sales 𝑦 at the multimedia equipment store.
But the value of the covariance depends on the measurement units for 𝑥 and 𝑦. Is there other
precise measure of this relationship?
𝑠𝑥𝑦
𝑟𝑥𝑦 =
𝑠𝑥 𝑠𝑦
We get −1 ≤ 𝑟𝑥𝑦 ≤ 1. And moreover, if 𝑥 and 𝑦 are linearly related by the equation
𝑦 = 𝑎 + 𝑏𝑥,
then
• An input to a cluster analysis can be described as an ordered pair (𝒟, 𝑠), or (𝒟, 𝑑), where 𝒟 is a
set of objects (or their descriptions) represented with sample points
𝒟 := {𝑥(𝑖) }𝑁
𝑖=1
and 𝑠 and 𝑑 are measures for similarity or dissimilarity among points, respectively.
𝐺1 ∪ 𝐺2 ∪ . . . ∪ 𝐺𝐾 = 𝒟, and 𝐺𝑖 ∩ 𝐺𝑗 = ∅, 𝑖 ̸= 𝑗.
REMARK 9.1.
In discovery-based clustering, both the cluster and its descriptions or characterizations are
generated as a result of a clustering procedure.
• There is no clustering technique that is universally applicable in uncovering the variety of struc-
tures present in multidimensional data sets. But we can utilize three basic schemata for cluster
representation.
NOTE:
• An object (or substance) is isotropic if its physical property which has the same value when
measured in different directions.
• 𝐾-means Clustering essentially illustrates the 1st Fig. 9.3(a) Centroid], and
• Hierarchical Clustering illustrates the 2nd scheme [Fig. 9.3(b) Clustering tree].
Recall the initial motivation in Section 9.2.5 and Clustering basic steps
4. Remove outliers.
7. Perform clustering?
8. Interpret results.
• The weighted Euclidean is used if the parameters have not been scaled or if the parameters have
significantly different levels of importance.
• Goal: Partition into groups so the members of a group are as similar as possible and different
groups are as dissimilar as possible.
• Statistically, the intragroup variance should be as small as possible, and inter-group variance
should be as large as possible.
• Nonhierarchical techniques:
• Hierarchical Techniques:
5. Repeat steps 2 through 4 until all components are part of one cluster.
Dendogram
• Purpose: Obtain clusters for any given maximum allowable intra-cluster distance.
1. In 𝐾-means clustering, you attempt to separate the data 𝒟 into 𝐾 clusters, where the number 𝐾
is determined by you.
2. The data usually has to be in the form of numeric vectors, we denote input signal 𝑥(𝑖) as an 𝑚-
dimension vector
𝑥(𝑖) = [𝑥𝑖1 , 𝑥𝑖2 , . . . , 𝑥𝑖𝑗 , . . . , 𝑥𝑖𝑚 ]𝑇 ∈ 𝒟, here each feature 𝑥𝑖𝑗 ∈ R. (9.11)
3. The method of 𝐾-means clustering will work as long as you have a way of computing
Known 𝒟 := {𝑥(𝑖) }𝑁
𝑖=1 be a set of multidimensional observations that is to be partitioned into a
𝑚
∑︁
The distance between 𝑢 and 𝑣 generally just is 𝑑(𝑢, 𝑣) = 𝑑(𝑢𝑗 , 𝑣𝑗 ) where 𝑑(𝑢𝑗 , 𝑣𝑗 ) is Eu-
𝑗=1
clidean, Manhattan or Hamming metric, see Section 9.3.4.
A measure of dissimilarity between every pair 𝑢, 𝑣 ∈ 𝒟 is the distance 𝑑(𝑢, 𝑣). The points 𝑢 and 𝑣
are close together if their gap 𝑑(𝑢, 𝑣) is small, and far away if 𝑑(𝑢, 𝑣) is large. When the measure
𝑑(𝑢, 𝑣) is small enough, both 𝑢 and 𝑣 are assigned to the same cluster; otherwise, they are assigned
to different clusters.
∑︁
(𝑖) 1 ∑︁ (𝑖)
(a − c) = 0 ⇔ c = a . (9.12)
𝑖
𝑛 𝑖
Suppose we have already determined the clustering or the partitioning into clusters 𝐺1 , 𝐺2 , . . . , 𝐺𝐾 .
What are the best centers for the clusters?
Lemma 9.7. Let 𝐺 = {a(𝑖) }𝑛𝑖=1 = {a(1) , a(2) , . . . , a(𝑛) } be a cluster (of points, 𝑛 ≤ 𝑁 ).
The sum of the squared distances of the a(𝑖) to any point 𝑥 equals the sum of the squared distances
to the centroid c of 𝐺 plus 𝑛 times the squared distance from 𝑥 to the centroid. That is,
∑︁ ∑︁
(𝑖) 2
|a −𝑥| = | a(𝑖) − c |2 + 𝑛 | c − 𝑥 |2
𝑖 𝑖
1 ∑︀ (𝑖)
where c = 𝑖 a is the centroid of the set of points [see from Eq. (9.12)].
𝑛
| a(𝑖) − c |2
∑︀
As a result, the centroid c minimizes the sum of squared distances since the first term, 𝑖
Λ = {𝐺1 , 𝐺2 , . . . , 𝐺𝐾 }
■ NOTATION 2.
Λ = {𝐺1 , 𝐺2 , . . . , 𝐺𝐾 }
STEPS
2. Find the centroid and intra-cluster variance for 𝑖th cluster, for 𝑖 = 1, 2, . . . , 𝑘.
3. Find the cluster with the highest variance and arbitrarily divide it into two clusters.
– Find the two components that are farthest apart, assign other components according to their
distance from these points.
– Place all components below the centroid in one cluster and all components above this hyper
plane in the other.
4. Adjust the points in the two new clusters until the inter-cluster distance between the two clusters is
maximum.
Definition 9.9. A many-to-one map, called the encoder 𝐶, a kind of relationship from 𝒟 to 𝒮, is
defined as
𝑗 = 𝐶(𝑖), 𝑖 = 1, 2, . . . , 𝑁 (9.13)
assigning the 𝑖-th observation 𝑥(𝑖) to the 𝑗-th cluster 𝐺𝑗 according to a rule yet to be defined.
The following cost function (Hastie et al., 2001), for a given encoder 𝐶:
𝐾
∑︁ ∑︁
𝐽(𝐶) = | 𝑥(𝑖) −̂︀
𝜇𝑗 |2 (9.14)
𝑗=𝑖 𝐶(𝑖)=𝑗
is used to optimize the clustering process. The inner summation in this equation is
∑︁
̂︀𝑗2
𝜎 := | 𝑥(𝑖) −̂︀
𝜇𝑗 |2 (9.15)
𝐶(𝑖)=𝑗
• AIM: For a prescribed 𝐾, the requirement is to find the encoder 𝐶(𝑖) = 𝑗 for which
• IDEAS: The 𝐾-means starts with many different random choices for the means 𝜇
̂︀𝑗 for the proposed
3
size 𝐾, then choose the particular set for which 𝐽(𝐶) assumes the smallest value.
ELUCIDATION
Starting from some initial choice of the encoder 𝐶, the algorithm goes back and forth between
these two steps 2 and 3 until there is no further change in the cluster assignments. The 𝐾-means
algorithm mathematically proceeds in two steps:
• The algorithm essentially works by first guessing at 𝐾 “centers” of proposed clusters. In Equation
9.17 we view
• Then next Equation (9.18) says each data point is assigned to the cluster it is closest to, creating
a grouping of the data, and then all centers are moved to
𝜇𝑗 }𝐾
1. Initialize a set of cluster means 𝐶𝑀 := {̂︀ 𝐾
𝑗=𝑖 = {c𝑗 }𝑗=𝑖 .
2. For a given encoder 𝐶, the total cluster variance is minimized with respect to the assigned set of
cluster means 𝐶𝑀 , that is we minimize the score 𝑆 = 𝐽(𝐶), for that given 𝐶
𝐾
∑︁ ∑︁ 𝐾
∑︁
(𝑖) 2
min 𝑆 = min 𝐽(𝐶) = min | 𝑥 −̂︀
𝜇𝑗 | = min ̂︀𝑗2
𝜎 (9.17)
𝐶𝑀 𝐶𝑀 𝐾 {̂︀
𝜇𝑗 }𝑗=𝑖 𝐾 {̂︀
𝜇𝑗 }𝑗=𝑖
𝑗=𝑖 𝐶(𝑖)=𝑗 𝑗=𝑖
Each data point 𝑖 is assigned to a cluster 𝑗0 . Go back to step 2. with the new encoder 𝐶.
k.means=function(X,K, iteration=20)
n=nrow(X); p=ncol(X); center=array(dim=c(K,p));
y=sample(1:K, n, replace=TRUE);
scores=NULL
for(h in 1:iteration)
for(k in 1:K) if(sum(y[]==k)==0)center[k,]=Inf else
## sum(y[]==k) expresses the number of i s.t. y[i]=k
for(j in 1:p)center[k,j]= mean(X[y[]==k,j])
S.total=0 #
for(i in 1:n)
S.min=Inf;
for(k in 1:K)
S=sum((X[i,] - center[k,])^2);
if(S<S.min)S.min=S; y[i]=k
S.total=S.total+S.min #
scores=c(scores,S.total) #
return(list(clusters=y,scores=scores))
♦ REMARK 1.
• The score 𝑆 does not increase for each update of Steps 1 and 2 during executing 𝐾-means
clustering.
• Because the initial centers are randomly chosen, different calls to the function will not necessarily
lead to the same result. At the very least, we would expect
the labeling of clusters to be different between the various calls. Generally, square-error parti-
tional algorithms (with 𝐾-means Clustering as a specific case) attempt to obtain a partition that
minimizes the within-cluster scatter or maximizes the between-cluster scatter.
• The result of K-means clustering depends on the randomly selected initial clusters, which
means that even if K-means is applied, there is no guarantee that an optimum solution will be
4
obtained.
Let us see the algorithm in action, with the iris realistic dataset, and we remove the Species column
to get a numerical matrix to give to the new data. The Rfunction for 𝐾-means clustering, kmeans,
wants numerical data. We need to specify
𝐾, the number of centers in the parameters to kmeans(), and we choose three. We know that
there are three species, so this is a natural choice.
4
These methods are nonhierarchical because all resulting clusters are groups of samples at the same level of partition. To guarantee that
an optimum solution has been obtained, one has to examine all possible partitions of the 𝑁 samples with 𝑚 dimensions into 𝐾 clusters (for a
given 𝐾), but that retrieval process is not computationally feasible.
>library(ggplot2);
CCluster = clusters$cluster
head(CCluster); table(CCluster)
newiris |>
cbind(Cluster = CCluster) |>
ggplot() +
geom_bar(aes(x = SS, fill = as.factor(Cluster)),
position = "dodge") +
scale_fill_discrete("Cluster")
The function returns an object with information about the clustering. The two most interesting
pieces of information are the centers, the variable centers, and the cluster assignment, the vari-
able cluster.
• The variable centers: These are simply vectors of the same form as the input data points. They
are the center of mass for each of the three clusters we have computed.
• The variable cluster: The cluster assignment is simply an integer vector with a number for each
data point specifying which cluster that data point is assigned to.
There are 50 data points for each species so if the clustering perfectly matched the species we
should see 50 points for each cluster as well.
The clustering is not perfect, but we can try plotting the data and see how well the clustering
matches the species class.
We can first plot how many data points from each species are assigned to each cluster. We com-
bine the iris data set with the cluster association from clusters and then make a bar plot. The position
argument is ”dodge”, so the cluster assignments are plotted next to each other instead of stacked
on top of each other.
Now let us consider how the clustering does at predicting the species more formally. This returns
us to familiar territory: we can build a confusion matrix between species and clusters.
ELUCIDATION.
• One problem here is that the clustering doesn’t know about the species, so even if there were
a one-to-one correspondence between clusters and species, the confusion matrix would only be
diagonal if the clusters and species were in the same order.
• We can associate each species to the cluster most of its members are assigned to. This isn’t a
perfect solution- 2 species could be assigned to the same cluster this way, and we still would not
be able to construct a confusion matrix- but it will work for us in the case we consider here.
We can count how many observations from each cluster are seen in each species like the bottom
code paragraph.
• Since 𝐾 is a parameter that needs to be specified, how do you pick it? Here, we knew that there
were three species, so we picked 𝐾 = 3 as well.
• But when we do not know if there is any clustering in the data, to begin with, or if there is a lot,
how do we choose 𝐾? Unfortunately, there isn’t a general answer to this. There are several rules
of thumb, but no perfect solution you can always apply.
The basic principle of dimensionality reduction techniques (with PCA ...) is to transform the data into
a new space that summarize properties of the whole data set along
a reduced number of dimensions. These are then ideal candidates used to visualize the data along
these reduced number of informative dimensions.
Principal Component Analysis (PCA) is a technique that transforms the original 𝑛-dimensional
data into a new 𝑛-dimensional space.
• These new dimensions are linear combinations of the original data, i.e. they are composed of
proportions of the original variables.
• Along these new dimensions, called principal components, the data expresses most of its vari-
ability along the first PC, then second, . . .
5
• Principal components are orthogonal to each other, i.e. non-correlated.
5
PCA is probably the oldest and best known of the techniques of multivariate analysis. Being based on the covariance matrix of the variables,
it is a second-order method. In various fields, it is also known as the singular value decomposition (SVD), the Karhunen-Loève transform, the
Hotelling transform, and the empirical orthogonal function (EOF) method.
The central idea of principal component analysis is to reduce the dimensionality of a data set in
which there are a large number of interrelated variables, while retaining as much as possible of the
variation present in the data set.
This reduction is achieved by transforming to a new set of variables, the principal components,
which are uncorrelated, and which are ordered so that the first few retain most of the variation present
in all of the original variables.
The purpose of PCA is to summarize the matrix 𝑋 as 𝑌1 , · · · , 𝑌𝑚 (1 ≤ 𝑚 ≤ 𝑝): the smaller the 𝑚
is, the more compressed the information in 𝑋. Note that there exists 𝜆1 , 𝜆2 . . . such that
𝑋 𝑇 𝑋𝑌1 = 𝜆1 𝑌1 , (9.19)
𝑋 𝑇 𝑋𝑌2 = 𝜆2 𝑌2 , . . . (9.20)
they are just the nonnegative eigenvalues and 𝑌𝑖 are the eigenvector of 𝑋 𝑇 𝑋 (a nonnegative definite
matrix.). Moreover, the 𝑌𝑖 are mutually orthogonal. Hence, we choose the 𝑚 principle components
𝑌1 , · · · , 𝑌𝑚 with the largest ordered eigenvalues 𝜆1 ≥ · · · ≥ 𝜆𝑚 ≥ 0.
• In essence, PCA seeks to reduce the dimension of the data by finding a few orthogonal linear
combinations (the PCs) of the original variables with the largest variance.
The first PC, 𝑌1 , is the linear combination with the largest variance. We have
• The 2nd PC is the linear combination with the 2nd largest variance and orthogonal to the 1st PC,
and so on. There are as many PCs as the number the original variables.
• For many datasets, the first several PCs explain most of the variance, so that the rest can be
6
disregarded with minimal loss of information.
Data standardization
It can be shown that the PCs are given by the 𝑝 rows of the 𝑝 × 𝑛 matrix 𝑌 , where
𝑌 = 𝑉 ⊤𝑋 ⊤. (9.24)
6
Since the variance depends on the scale of the variables, it is customary to first standardize each variable to have mean zero and standard
deviation one. After the standardization, the original variables with possibly different unit of measurement are all in comparable units.
Performing PCA using (9.24) (i.e., by initially finding the eigenvalues of the sample covariance and
then finding the corresponding eigenvectors) is already simple and computationally fast.
Enhanced Computation- However, case of computation can be further enhanced by utilizing the
connection between PCA and the singular value decomposition (SVD) of the mean-centered data
matrix 𝑋 which takes the form:
𝑋 = 𝑈 𝑆𝑉 ⊤ , (9.25)
where
𝑈 ⊤ 𝑈 = ℐ𝑝 , 𝑉 𝑉 ⊤ = 𝑉 ⊤ 𝑉 = ℐ𝑝
It can be verified easily that matrix 𝑉 in equations (9.24) and (9.25) are the same, and the principal
component scores are given by 𝑈 𝑆.
Evaluation of the PCA- The weighting of the PCs tells us in which directions, expressed in original
coordinates, the best variance explanation is obtained. A measure of how well the first q PCs
explain variation is given by the relative proportion:
∑︀𝑘 ∑︀𝑘
𝑙
𝑗=1 𝑗 𝑗=1 𝑉 𝑎𝑟(𝑌𝑗 )
𝜓𝑘 = ∑︀𝑝 = ∑︀𝑝 . (9.26)
𝑗=1 𝑙𝑗 𝑗=1 𝑉 𝑎𝑟(𝑌𝑗 )
Using the R language function eigen, given the matrix 𝑋 ∈ R𝑛×𝑝 i as an input, we can construct
the function pca that outputs the vectors with the elements 𝜆1 ≥ · · · ≥ 𝜆𝑝 and the matrix with the
columns 𝑌1 , · · · , 𝑌𝑝 as
Even if we do not use the function below, the function prcomp is available in R.
pca=function(x)
n=nrow(x); p=ncol(x); center=array(dim=p)
for(j in 1:p)center[j]=mean(x[,j]); for(j in 1:p)x[,j]=x[,j]-center[j]
sigma = t(x)%*%x; lambda =eigen(sigma)$values
phi = eigen(sigma)$vectors
return(list(lambdas=lambda,vectors=phi,centers=center))
We use the iris dataset, and we remove the Species column to get a numerical matrix to give to the
new data.
The R function kmeans for 𝐾-means clustering needs numerical data. We need to specify 𝐾, the
number of centers in the parameters to kmeans(), and we choose three. We know that there are
three species, so this is a natural choice.
QUESTIONS
We study 𝐾-means Clustering and Analyzing with PCA (principal component analysis),
1. Explain all components of the output from the command
HINT: The function returns an object with information about the clustering. The two most inter-
esting pieces of information are the centers, the variable centers, and the cluster assignment,
the variable cluster.
in codes STEP 1: four features only then STEP 2: Now use the original iris data.■
Students should try using code in R STUDIO environment while answering questions and writing
the report. You might use the codes in Section of FITTING the Model with PCA and Predictive Model
to support for your answers of the above questions.
We aim to visually examine how the clustering result matches where the actual data points fall in,
using PCA (principal component analysis). We can do this by plotting the individual data points and
see how the classification and clustering looks.
We can map data points from the five features of data to the principal components using the
predict() function. This works both for the original data iris used to make the PCA and the centers
we get from the 𝐾-means clustering.
newiris |>
cbind(Cluster = CCluster) |>
ggplot() +
# We can now plot the first two components against each other
install.packages("tibble"); library(tibble) ;
mapped_iris |>
as_tibble() |>
cbind(SS) |> ggplot() + geom_point(aes(x = PC1, y = PC2, colour= SS))
n_mapped_iris |>
as_tibble() |> cbind(SS, Clusters= as.factor(n_clusters$cluster))
|> ggplot() +
geom_point(aes(x= PC1, y= PC2, colour= SS, shape= Clusters)) +
geom_point(aes(x = PC1, y = PC2), size = 5, shape = "X",
data = as_tibble(n_mapped_centers))
(a) The Euclidean distance between points. (b) The city block (Manhattan) distance.
How to compute the Minkowski distance 𝑑𝑝 (𝑢, 𝑣) between vectors 𝑢, 𝑣 ∈ R𝑚 ? Here 𝑝 is the order
of the norm of the difference.
5. Given the samples 𝑋1 = {1, 0}, 𝑋2 = {0, 1}, 𝑋3 = {2, 1}, and 𝑋4 = {3, 3}, suppose that the
samples are randomly clustered into two clusters 𝐺1 = {𝑋1, 𝑋3} and 𝐺2 = {𝑋2, 𝑋4}.
(b) What are the new centroids? How can you prove that the new distribution of samples is better
than the initial one?
(d) Apply the second iteration of the 𝐾-means algorithm and discuss the changes in clusters.
We extensively use random variables in both Stochastic Simulation and Machine Learning to approx-
imate the sampling distribution of data, and to propagate this to the sampling distribution of statistical
estimates and procedures.
Machine Learning is, however, more trendy nowadays in Applied Statistics and Engineering for
several reasons. All disciplines and sectors exploiting Machine Learning algorithms and methods
are data-centric fields, including
a/ memorizing facts and data, understanding rules, principles and theory in general,
b/ trying to apply what we knew in the upcoming problematic situations, with similar data patterns,
to obtain good or optimal answers- solutions.
Definition 9.10 (Statistical model). The simplest statistical model is linear model. A linear model
𝐻 = ℎ(𝑋) = 𝛼 + 𝛽 𝑋 + 𝜀...(*)
empirically expresses a possible linear relation of an (independent) predictor variable 𝑋 [or more]
and a (dependent) response variable 𝑌 .
• Here 𝜀 is the stochastic noise (usually assumed to be normal distributed with mean E[𝜀] = 0).
ℎ = E[𝐻] = 𝛼 + 𝛽 𝑥. (9.28)
Most statistical models are examples of Machine Learning . Why? We can essentially learn about
the “causality” (cause-effect) between 𝑋 and 𝑌 via (9.28). Such regression is as much machine
learning as neural networks are.
Definition 9.11 (Artificial neural network). An artificial neural network is characterized as follows.
3. The communication links interconnecting the source and computation nodes of the graph carry no
weight; they merely provide directions of signal flow in the graph.
SUMMARY
(0) Learning Processes are achieved via at least by a statistical model or ANN.
(3) Journey of Knowledge discovery: Both statistical model and Artificial neural network model
follow the key path of going from proper knowledge representation (hidden in dataset) to effectively
learning with that data.
■ NOTATION 3.
𝑥(𝑖) = [𝑥𝑖1 , 𝑥𝑖2 , . . . , 𝑥𝑖𝑗 , . . . , 𝑥𝑖𝑚 ]𝑇 , here each feature value𝑥𝑖𝑗 ∈ R, or 𝑥𝑖𝑗 ∈ 𝒮 (9.29)
meaning all of input entries are either real number, or taken value in a discrete set 𝒮, the superscript
𝑇 denotes matrix transposition.
We treat 𝑚 to be the number of features (or covariates statistically), then the vector 𝑥(𝑖) [referring
to features, attributes] defines a point in R𝑚 . R𝑚 denotes the 𝑚-dimensional Euclidean space in 𝑚
dimension. In general, however, 𝑥(𝑖) could be
either a complex structured object, such as an email message, a graph, a time series,
Definition 9.12. A set of training data (or sample) is a set of input-output pairs
each pair consists of an input signal (covariate) 𝑥(𝑖) and the corresponding desired response variable
𝑦 (𝑖) might get discrete value or continuous value.
Figure 9.7: Input (covariate) 𝑋 in a training data set could be a complex structured or
unstructured object (Courtesy of Phuc Son Nguyen)
■ CONCEPT.
“A computer program is said to learn from experience 𝐸 w. r. t. some class of tasks 𝑇 and per-
formance measure 𝑃 , if its performance at tasks or actions 𝑎 ∈ 𝑇 , as measured by 𝑃 , improves
with experience 𝐸”.
Machine Learning (Machine Learning ) is a kind of learning algorithm, it learns from data to up-
date our understanding of reality. Machine Learning so is the discipline of developing and applying
models and algorithms for learning from data.
• To develop algorithms like that, you need a deep understanding of your problem.
• Gven a model (in mathematical formula pattern) 𝑦 = 𝑓 (𝑥), you cannot usually figure out what the
linear relationship is between 𝑦 and 𝑥 without looking at data.
In Definition 9.12, if response 𝑦 (𝑖) is discrete we have classification, if 𝑦 (𝑖) gets continuous value
we have regression problem.
𝐴 𝐵
• Briefly
A = Supervised Learning = {𝑐𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑐𝑎𝑡𝑖𝑜𝑛, 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛, 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑡𝑟𝑒𝑒𝑠...}
C = 𝑈 ∖ (𝐴 ∪ 𝐵).
C/ Semi-supervised Learning consists of other methods.
given inputs and told which specific outputs should be associated with them.
Mathematically, supervised learning is model creation where the model describes a relationship
The model estimates the value of the target variable 𝑌 as a function ℎ (possibly a probabilistic
function) of the features.
The goal of this learning is to learn a mapping or model from inputs 𝑥 to outputs 𝑦, given a
labeled set of input-output pairs
• where each 𝑥(𝑖) symbolically represents an object to be classified, most typically an 𝑚-dimensional
vector of real and/or discrete values as given in Equation 9.29;
We divide up supervised learning based on whether the outputs 𝑦 are drawn from
a small finite set as N (to get branch 1 named classification- CASE A1)
Technically, the form of the output 𝑦 can in principle be anything, but mostly assumed in two cases.
CASE A1: response 𝑦 (𝑖) = 𝑦𝑖 is a categorical variable (with values as male or female) from some
finite set,
𝑦𝑖 ∈ {1, 2, . . . , 𝐶}.
Convention: Many textbooks use 𝑥𝑖 instead of 𝑥(𝑖) . We find that notation somewhat difficult to
manage when 𝑥(𝑖) is itself a vector and we need to talk about its elements.
CASE A2: 𝑦 (𝑖) is an ordinal variable, with real-valued output, (as income level).
When 𝑦 (𝑖) is single real-valued, that 𝑦 (𝑖) = 𝑦𝑖 ∈ R (single response),
or also 𝑦 (𝑖) ∈ R𝑘 , 𝑘 ≥ 1 (multiple responses), we call A2/ regression problem [being studied in
Section 10.3 of Chapter 10].
Practically, we would find a real-valued function ℎ(.) that transforms 𝑥 to theoretic value ℎ = ℎ(𝑥) ∈
R. There obviously might be some noise in our observation, so ℎ doesn’t map perfectly from 𝑥 to
really observed output 𝑦, and ℎ − 𝑦 ̸= 0 or ℎ = 𝑦 + 𝑒𝑟𝑟𝑜𝑟 in general.
7
Classification problems are a kind of supervised learning, because the desired output (or class) 𝑦𝑖 is specified for each of the training
examples 𝑥(𝑖) .
ℎ𝑖 = 𝑦𝑖 + 𝜀𝑖 ⇐⇒ 𝜀𝑖 = ℎ(𝑥(𝑖) ) − 𝑦𝑖 . (9.31)
Here 𝜀𝑖 is the error in the real observation 𝑦𝑖 , so we put the error vector 𝜀 = (𝜀1 , . . . , 𝜀𝑛 ). Both ℎ(𝑥(𝑖) )
and 𝑦𝑖 are real numbers, put vector
ℎ(𝜃) := ℎ = (ℎ1 , ℎ2 , · · · , ℎ𝑛 ) ∈ R𝑛
♦ EXAMPLE 9.4 (Optimal estimator or Best fit the data in Euclidean space).
In R𝑚 with square-norm || 𝑥 ||2 , the estimated value of 𝜃 minimizing ‖ ℎ(𝜃) − 𝑦 ‖ (called best fit
the data) is symbolically denoted by
Q: What if we do not use linear regression, and the norm is not square-norm?
Unsupervised learning doesn’t involve learning a function from inputs to outputs based on a set of
input-output pairs. Instead, one is given a data set and generally expected to find some patterns
or structure inherent in it.
For supervised learning, we have one or more targets 𝑌 we want to predict using a set of explana-
8
tory variables 𝑋𝑗 , using a data 𝒟𝑛 = {(𝑥(𝑖) , 𝑦 (𝑖) ) 𝑖 = 1, 2, 3, . . . , 𝑛}.
For Unsupervised learning, we are only given a collection of data points 𝒟 = {𝑥(𝑖) }𝑛𝑖=1 , you now
want to discover what generic patterns there are in the data. This is sometimes called knowledge
discovery.
8
Unsupervised learning is arguably more typical of human and animal learning. It is also more widely applicable than supervised learning,
since it does not require a human expert to manually label the data as above.
𝐴 𝐵
B0/ Density estimation: we predict conditional probabilities, [such as probability of machine fail-
ure]. Given sample 𝑥 = 𝑥(1) , 𝑥(2) , . . . , 𝑥(𝑛) drawn IID from some distribution 𝑓𝑋 (𝑥) = Prob(𝑥),
the goal is to predict the probability 𝑓𝑋 (𝑥(𝑛+1) ) of an element 𝑥(𝑛+1) drawn from the same distri-
bution.
Technically, view input 𝑥(𝑖) ∈ 𝒜, a finite set, in a binary case we assume output 𝑦𝑖 ∈ {−1, 1},
we will find a probability density 𝑓 = 𝑓𝑋 : 𝒜 −→ [0, 1] such that
the value 𝑓𝑋 (𝑥(𝑛+1) ) as close to the mass P[𝑌 = 1|𝑋 = 𝑥(𝑛+1) ] as possible.
B1/ Dimensionality reduction is a method used when you have high- dimensional data and want
to map it down into fewer dimensions. The purpose here is usually to visualize data to try and
spot patterns from plots.
B2/ Clustering methods seek to find similarities between data points and group data according
to these similarities.
Clustering means grouping data into clusters that “belong” together - objects within a cluster
are more similar to each other than to those in other clusters.
Technically, view input 𝑥(𝑖) ∈ 𝒜, a finite set of 𝑛 points, we assume output 𝑦𝑖 ∈ {1, 2, . . . , 𝐶} =
𝒮, a set of 𝐶 clusters, we will fine a function 𝑔 : 𝒜 −→ 𝒮 such that the likelihood value
P[𝑔(𝑥(𝑛+1) ) = 𝑘 ∈ 𝒮] as high as possible.
B3/ Association Rules, search for patterns in your data by picking out subsets of the data, 𝑋 and
𝑌 , based on predicates on the input variables and evaluate a logical implication form 𝑋 =⇒ 𝑌,
called a rule.
The algorithm evaluates all rules to figure out how good each rule is.
SUMMARY
[Learning Processes (LP) and The Essence of Unsupervised Learning]
• Supervised Learning essentially includes classification problem [if response variable are dis-
crete],
and regression problem [if response variable are continuous, with real values].
2. This is a much less well-defined problem (compared with Supervised Learning), since we are
not told what kinds of patterns to look for, and there is no obvious error metric to use (unlike
supervised learning, where we can compare our prediction of 𝑦 for a given 𝑥 to the observed
value). ■
Supervised learning is used when you have variables you want to predict using other variables.
We often formalize the problem using function approximation.
We assume ℎ(𝑥) for some unknown function ℎ, and the goal of learning is to estimate the function
ℎ given a labeled training set {(𝑥𝑖 , 𝑦𝑖 )}, and then to make predictions using 𝑦^ = ̂︀
ℎ(𝑥).
The goal of Supervised learning is to learn a mapping from inputs 𝑋 to outputs 𝑌 , given a labeled
data set 𝒟𝑛 = {(𝑥(𝑖) , 𝑦𝑖 )}𝑛𝑖=1 . Here 𝒟𝑛 is called the training set, with pairs (𝑥(𝑖) , 𝑦𝑖 ) is named
“examples,” “instances with labels,” “observations”; and 𝑛 is the number of training cases.
• The learning process consists of choosing parameters 𝜃 such that we minimize the errors,
that is, such that ℎ(𝑥(𝑖) ; 𝜃) = ℎ𝑖 (𝜃) = ℎ𝑖 is as close to 𝑦𝑖 as we can get.
Learning algorithm
Known ℋ be a set of all classifiers/learning models ℎ = ℎ(𝑥; 𝜃), a learning algorithm is a proce-
dure 𝑃 that takes a data set 𝒟𝑛 as input and returns an element ℎ ∈ ℋ, it looks like 𝒟𝑛 −→
𝑃 (𝒟𝑛 , ℋ) −→ ℎ. We next discuss the followings.
[ Then in Chapter ?? we continue with specific Classification methods: Decision Trees and Ran-
dom Forests, with R illustrated example.]
Generic Classification
In this type of task, the computer program is asked to specify which of 𝐶 categories some input be-
longs to. The most used classification is Binary Classification when 𝐶 = 2, and Ternary Classification
when 𝐶 = 3. To solve this task, the learning algorithm is usually asked to produce a function
ℎ : 𝒳 ⊂ R𝑚 −→ 𝒮 = {1, . . . , 𝐶}, 𝐶 ≥ 2.
• Here 𝒳 = {𝑥(𝑖) } and 𝑦 = ℎ(𝑥) says that the model ℎ assigns an input described by vector 𝑥 to a
category identified by numeric code 𝑦.
• When 𝐶 = 2, we might rewrite 𝒮 = {−1, +1}, or also 𝒮 = {0, +1}. The map ℎ is named a binary
classifier.
NOTE: A popular a classification task is object recognition or pattern recognition, where the input
is an image (usually described as a set of pixel brightness values), and the output is a numeric code
identifying the object in the image.
Source: http://www.statlab.uni-heidelberg.de/data/iris/ .
Courtesy Dennis Kramb and SIGNA
Motivation- Data: The learning goal is to distinguish 3 different kinds of iris flower.
Approach- Methods: Rather than working directly with images, a botanist has already extracted
4 useful features or characteristics: sepal length, sepal width, and petal length and width. Such
feature extraction is an important, but difficult task.
Output- Analysis- Conclusion: It is always a good idea to perform exploratory data analysis, such
as plotting the data, before applying a machine learning method.
Normed space - Let’s explicitly define the normed space 𝑆 := R𝑚 with the usual inner product.
The space R𝑚 shortly is equipped with the inner product
∑︁
< 𝑥, 𝑦 >= 𝑥𝑖 𝑦𝑖 ,
𝑖
then the square norm or Euclid distance is determined by its properties, assuming 𝑥, 𝑦, 𝑧 ∈ 𝑆, as
follows.
* In the space R𝑚 , at property (iii), the notation || · || is called the norm/distance. The norm satisfies
Generic Regression
In this type of task, the computer program is asked to predict a numerical value given some input.
To solve this task, the learning algorithm is asked to output a function
𝑔 : R𝑚 −→ R.
This type of task is similar to classification, except that the format of output is different.
Statistically, we obtain input pairs (𝑥(𝑖) , 𝑦𝑖 ) ∈ 𝒟𝑛 , a finite set of 𝑛 points, assuming output 𝑦𝑖 ∈ R,
the set of reals. We will fine a function 𝑔 : 𝒜 = {𝑥(𝑖) } −→ R such that the error value 𝑖 ‖
∑︀
Here with some input variables 𝑥 you want a good approximation function 𝑔(.) that predicts output
(or response) variables 𝑌 , as close to observed response 𝑦𝑖 as possible. The simplest model 𝑔() is
linear in one single input variable 𝑥 (𝑚 = 1 predictor), defined as
𝑌 = 𝑔(𝑥) + 𝜀 = 𝜃0 + 𝜃1 𝑥 + 𝜀, with 𝜃0 , 𝜃1 ∈ R.
This Linear Regression model predicts the mean of output (or response) variables 𝑦 from input 𝑥.
We observed a sample data with speed - the 𝑥 value and breaking distance- the 𝑦 value and want
to fit this data to see
On Rwe need the function lm(), and few libraries, particularly the ggplot2.
We see that there is a very clear linear relationship between speed and distance.
125
100
75
dist
50
25
5 10 15 20 25
speed
We have looked at classical statistical regression (linear models) and classification, but there are
many other machine learning algorithms for both, they are available as Rpackages.
FURTHER READING
See also Performance Analytics III about Hierarchical Clustering in Section ?? of CHAPTER ??.
PROBLEM
• Consider a system of two servers where customers from outside the system arrive at server 1 at a
Poisson rate 8 and at server 2 at a Poisson rate 12.
• The service rates of server 1 and server 2 are respectively 16 and 24.
whereas a departure from server 2 will go 25 percent of the time to server 1, and will depart the
system otherwise (i.e., 𝑃2,1 = 1/3, 𝑃2,2 = 0).
3. Compute the average number of customers in the system 𝐿, and the mean E[𝑅].
——————————
DATA - Assume that PCs enter Q via system I with arrival rate 𝜆 = 20 PCs/ hour,
Leaving from
the system
when
services all
Sequential queues with one-way flow done
(e.g. appear in industrial sector)
server S2 of system II can package products with rate 𝜇2 = 60 PCs per hour.
QUESTIONS
b) Find the average number of PCs in the whole system Q = (I, II).
——————————
REMINDER:
• The Jackson networks typically represent for open queueing networks with single-class of jobs.
The network is analyzed by considering each of the individual stations one by one.
• Therefore, the main technique would be to decompose the queueing network into individual queues
or stations and develop characteristics of arrival processes for each individual station. The basic
assumptions of these networks are still Poisson arrivals and exponential service times.
A Jackson network is a queuing network with a special structure that consist of 𝑘 ≥ 2 connected
service stations (called nodes), so having 𝑘 queues, but the stations operate independently. Briefly,
the node 𝑖 (ie. 𝑖th service station) of the network can be considered as an independent M/M/mi
system with 𝑚𝑖 servers, arrival rate 𝛾𝑖 , and service rate 𝜇𝑖 .
Arrivals to
station 1 Arrivals to
station 2
Station 1 Arrivals to
Station 2
(has 1
(has 2 servers)
station 3
server)
Station 3
Leaving from (has 1 server)
the system
when services
all done
A SCENARIO IN SERVICE:
Let us now consider a Jackson network of 𝑘 = 3 service stations in an airport where customers
can arrive in any node of the network. Assume that:
• Arrivals at node 𝑖 (for 𝑖 = 1, 2, 3) follow Poisson processes with respectively associated arrival rates
𝛾1 = 4, 𝛾2 = 2, and 𝛾3 = 1.
• The service rates at those nodes are (𝜇1 , 𝜇2 , 𝜇3 ) = (3, 6, 12) respectively.
QUESTIONS
b) Write down the switching probability matrix P = [𝑝𝑖,𝑗 ], where the probabilities are
𝑝1,2 = 0.55, 𝑝1,3 = 0.3; 𝑝2,1 = 0.3, 𝑝2,3 = 0.6; 𝑝3,1 = 0.05, 𝑝3,2 = 0.45,
c) The vector 𝜆 of total input rates into nodes is calculated by the matrix equation
𝜆 = 𝛾[I −P]−1 ,
Introduction
The staring point of this chapter is using Experimental Design terminology and philosophy, in which
the input parameters and structural assumptions composing a model are called factors, and output
performance measures are called responses.
The decision as to which parameters and structural assumptions are considered fixed aspects of
a model and which are experimental factors depends on the goals of the study rather than on the
inherent form of the model.
5. The full 2𝑚 factorial design in 𝑚 binary factors for SPE in Section 10.5
Recall ten key steps in a System Performance Evaluation (SPE) project introduced in Section ??
in which steps 3, 4, 5, then steps 8 and 9 are viewed as most useful in getting best response or
optimum system, given a bunch of factors.
4. List Parameters
7. Select Workload
The three steps 3, 4 and 5 are actually meaningful for utilizing Factorial Designs in SPE, and fully
explored from Section 10.2.1 from the data-centric view. Step 8 will be discussed in Section 10.4 with
terminologies defined prior in Section 10.2.2, and we perform in Section Section 10.6 the Analyze
and Interpret Data Output via the experimental-error analysis from regression.
10.1
Performance Metrics to Factorial Designs
We firstly look at Step 3,4 and 5 of the whole SPEprocess. Step 9 is discussed in Section 10.6. The
more complex stochastic analysis of output from both looking at structural components of a system
and running its simulation models is postponed until Chapter 8 and ??.
We would use a systematic way to select right metrics, called path to metrics.
♣ OBSERVATION.
(1) We so knew that Step 2 lists Services and Outcomes upon basic awareness
• a/ that a system always provides a set of services, and each service has its own outcomes.
1
• b/ that all possible outcomes of a service moreover must be listed.
(2) Selecting right metrics is extremely important in SPE, for instance the next diagram shows the
path from System to various metrics like time, rate, probability ....
3 outcome categories
Helpful to determine what data needs to be collected before or during the analysis.
1. Via the above diagram the causality from Step 2 to 3 can be used for various systems:
2. On Aspects of a metric
• Mean and Variability: both need to be considered. If assume (symbolize) a metric by a random
variable 𝑋, then its mean and variability are denoted respectively by E[𝑋] and V[𝑋].
We should list all possible parameters that affect performance. Few parameter types are
i) the arrival process is Poisson [the inter-arrival times between events are exponential],
ii) the service process is exponential, and iii) with a single server.
• Parameters which has high impact on performance are preferably selected as factors.
• Gradually extend factor list and per-factor levels. Note further that
• Non-factor parameters must be fixed (and also feasible and low impact).
♦ EXAMPLE 10.2.
10.2
Factorial Designs for Performance Evaluation
WHY DOE? Experimental design (DOE) is the best known way of learning about relationships
between the variables of a system. DOE has the goal: obtain the maximum information (of
a system) or the optimum value of a key target factor (named response variable) with the
minimum number of experiments- treatment combinations. ■
Experimental design (DoE) in SPE is unique among statistical methodologies in that it is completely
proactive. Key points are rather than analyzing existing data we carefully
plan which specific data would be most helpful in addressing the issue or problem of your process-
system.
• The effect of variation from sources outside our scope [as ...]
can be minimized by planning the data-collection process ahead of time.
so (2) increase the speed with which we understand the relationship between the variables of
interest.
HOW TO? Based on list of factors and their levels, we decide a sequence of experiments that
♦ EXAMPLE 10.3.
Factors below considered to compare remote pipe and remote procedure call (RPC).
AIM: Find optimal process settings that produce the best results at lowest cost.
1. Determine and quantify the factors that have the biggest impact on the output
2. Identify factors that do not have a big impact on quality or on time (and therefore can be set
at the most convenient and/or least costly levels)
3. Screen a large number of factors quickly to determine the most important ones
4. Reduce the time and number of experiments needed to test multiple factors by using
Fractional Factorial DOE - FFD, see next part 10.2.2.
• Factor 𝐶- CPU: Intel Core I3@1.9GHz, Intel Core I7@2.67GHz, or AMD Athlon IIX4 605e.
Key Terminologies
Factor- either controlled or uncontrolled input variable, also called predictor in our model; factors
classified into primary, secondary...
Fractional Factorial DOE- Looks at only a fraction of all the possible combinations contained in
a full factorial. If many factors are being investigated, information can be obtained with smaller
investment, kindly see from Chapter 11, about Fractional Factorial Design.
Effect- The change in the response variable that occurs as experimental conditions change
Interaction- co-relationship between the defined factors, occurs when the effect of one factor on
the response 𝑌 depends on the setting of another factor
Run or Treatment combination- A single setup in a DOE from which data is gathered. Example,
a 3-factor full factorial DOE each factor at 2 levels has 23 = 8 runs.
Replication- experiment repetition, precisely replicating (duplicating) the entire experiment in a time
sequence with different setups between each run.
Design: a collection 𝐷 of tuples, each tuple said to be an experiment, comprised of all factors.
Therefore 𝐷 = {𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡 1, · · · , 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡 𝑁 }, and so |𝐷| = # experiments (runs, or treatment
combination= factor level combination for each factor)
allowing # replications of each experiment.
Simple Designs
A Full Factorial of 𝑑 factors utilizing every possible combination at all levels of all 𝑑 factors gives the
total number of runs
𝑑
∏︁
𝑁 = 𝑛1 · 𝑛2 · · · 𝑛𝑑 = 𝑛𝑖 .
𝑖=1
Briefly, most popular FFDs include the full binary 2𝑑 , the ternary 3𝑚 , the mixed regular designs
2𝑑 × 3𝑚 , the fractional designs 2𝑑−𝑝
𝑅 . . . see a summary in Section 10.2.3.
Consider a specific 23 experiment for studying a relationship between diet scheme and blood pres-
sure. We conducted experiments to assess the effects of diet on blood pressure 𝑌 in (say American)
males. Three factors are to be measured:
E.g. experiment
𝑟1 = (1) = (low fruits, low fat, low dairy product ), (bottom left),
𝑟7 = (𝑎, 𝑏, 𝑐) = (high fruits, high fat, high dairy product ) (top, counterclockwise) ■
Avoiding high cost essentially is obtained by reducing the total number 𝑛 of runs, by:
2. Reduce number of levels for each factor, hence applying 2𝑚 design first, then more levels added
per factor.
3. Use fractional factorial designs (fractions), to be discussed fully from Section 10.5.
Reuse Example 10.4 now we do not consider factor 𝑁 (no. of SSD disks with 4 choices), then
obtain a new 34 full design 𝒟1 = 𝐶 × 𝑀 × 𝑊 × 𝑈 in 4 factors at total 34 = 81 experiments. But we
can apply 34−2 fractional factorial design as table below.
• Each of the four factors is used three times at each of its three levels. Such fractional design
allows save time and expense, but we get less information (may not get all interactions on the
responses 𝑌 like production cost).
• The good side is that: If some interactions are negligible, then it not be considered a problem.
Note that the number of experiments here must be
𝑑
∑︁
𝑛≥1+ (𝑛𝑖 − 1) = 1 + 8 = 9,
𝑖=1
From the last two examples, we might exploit a naive method with 2 steps
• Start with a configuration 𝑟𝑢𝑛 =< 𝑖3, 8𝐺𝐵, 1𝑑𝑖𝑠𝑘, managerial, college >.
• The (full) factorial design (FD) with respect to these factors is the Cartesian product
𝒟 = 𝑄1 × 𝑄2 × . . . × 𝑄𝑑 = 𝑋1 × 𝑋2 × . . . × 𝑋𝑑 .
𝑓 : 𝒟 → R,
♦ Determine main effects that the manipulated factors will have on response 𝑌
♦ Determine effects that factor interactions will have on 𝑌 or many response variables.
2. Advantages
– Provides information about all main effects, and all interactions as well.
3. Most popular one is the 2-level design 2𝑑 , 𝑑 is the number of factors to be investigated and
2𝑑 = #Runs.
When a firm’s budget is limited, practically the firm’s manager must accept using a subset 𝐹 of 𝐷
when investigating properties of a new products.
• Look at only selected subsets of the possible combinations in a FD, as Example 10.7
• Advantages: Allows you to screen many factors- separate significant from not-significant factors-
with smaller investment in research time and costs 3
𝑑−𝑝
2𝑅 = #Runs
𝐹 ⊆ 𝐷 = 𝐶𝑜×𝑆ℎ×𝑊 𝑒𝑖×𝑀 𝑎×. . .×𝑃 𝑙𝑎𝑐𝑒 of a 211 full design, in which each of these features can take on only two possible values.
Hence the full factorial requires 211 = 2048 experiments. This design 𝐹 however is a fractional factorial design with 12 runs only, of
strength 2 (resolution III), so allows us to separate all main effects on the response
𝑌 = 𝛽0 + 𝛽1 𝐶𝑜 + 𝛽2 𝑆ℎ . . . + 𝛽11 𝑃 𝑙𝑎𝑐𝑒.■
In EXAMPLE 10.7 we began with 𝑑 = 11, but the design 𝐹 has 12 runs, the equation 2𝑑−𝑝 = 12 has
no solution 𝑝, so 𝐹 does not belong to the class 2-level FFD with pattern 2𝑑−𝑝
𝑅 for any resolution 𝑅
. In fact 𝐹 does belong to another class named Orthogonal Array!
The naive method for FFD, for both regular designs (like class of 2𝑑−𝑝
𝑅 =) and non-regular ones (like
Orthogonal Arrays) , however must fulfill the Rao bound (Lemma ??) that
∑︀11
In Example 10.7 we see clearly that 𝑛 ≥ 1 + 𝑖=1 (2 − 1) = 12, so the design is optimal in the
sense that it allows correct estimation of
put/Output (I/O) function that is implicitly determined by the underlying simulation model.
Binary designs and key properties: Good candidates of screening designs are the classic two-
level factorial designs that will be discussed in Section 10.5. These binary designs clearly require
the simulation of at least 𝑛 = 𝑑 + 1 factor combinations where 𝑑 denotes the number of factors in
the experiment. In such a design, each factor has two values or levels; these levels may denote
quantitative or qualitative value.
To study a few interactions plus all main factor effects of the chosen design we could use strength
𝑡 ≥ 3 or resolution 𝑅 ≥ 𝑡 + 1 = 4 fractional designs.
When 𝑡 = 3 (smallest odd natural) the total number of intercept, main effect and two interaction
parameters of a generic orthogonal array 𝐹 = OA(𝑁 ; 𝑟1 , 𝑟2 , · · · , 𝑟𝑑 ; 𝑡) is
𝑑
∑︁ 𝑑
∑︁
1+ (𝑟𝑖 − 1) + (𝑟𝑖 − 1)(𝑟𝑗 − 1),
𝑖=1 𝑖,𝑗=1
𝑖<𝑗
a
where 𝑟𝑖 is the number of levels of factor 𝑖
♣ QUESTION.
Shall we get the same conclusion for the 34 full design 𝒟1 = 𝐶 × 𝑀 × 𝑊 × 𝑈 in 4 factors in Example
10.6, that a new number of design points 𝑁 satisfies
𝑑
∑︁ 4
∑︁
𝑁 =𝑛≥1+ (𝑛𝑖 − 1) = 1 + (3 − 1) = 9?
𝑖=1 𝑖=1
ANSWER:
𝑌 = 𝑎0 + 𝑎1 𝐶 + 𝑎2 𝐶 2 + 𝑚1 𝑀 + 𝑚2 𝑀 2 + · · · + 𝑢1 𝑈 + 𝑢2 𝑈 2 + 𝑡1 + · · · + 𝑡𝐾
• perform the statistical tests and confidence procedures that are analogous to those for simple
linear regression, and check for model adequacy. All will be used in Section 10.5.
10.3
Multiple Linear Regression (MLR)
We practically generalize the simple linear regression to cases where the variability of a variable 𝑌
of interest can be explained, to a large extent, by the linear relationship between 𝑌 and (𝑝 − 1)
4
predicting or explaining 𝑚 = 𝑝 − 1 variables 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 . Applications of the special Multiple
Linear Regression (MLR) analysis can be found in all areas of R & D. MLR, for instance, plays a
meaningful role in the statistical planning and control of
Model the influence of advertising time 𝑥 on the number of positive reactions 𝑦 from the public.
We have a single linear regression model, described by
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀.
𝑌^ = 𝜇𝑌 = E[𝑌 |𝑋 = 𝑥] = 𝛽0 + 𝛽1 𝑥.
Here 𝑚 := 𝑝 − 1 = 1, 𝑌 holds the number of positive reactions caused by the amount of advertising
time 𝑥, then the number of observations 𝑛 ≥ 2. ■
4
Multiple regression analysis (MRA) in general (possibly non0linear) is an important statistical tool for exploring the relationship between the
response 𝑌 on the set of predictors 𝑋𝑖 .
10.3.1 Setting
𝑝−1
∑︁
𝑌 = 𝑓 (𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 ) + 𝜀 = 𝛽0 + 𝛽𝑗 𝑋𝑗 + 𝜀
𝑗=1
1. If 𝑌̂︀ is a linear function of coefficients 𝛽𝑖 (being estimated from data), then it may serve as a
suitable approximation to several nonlinear functional relationships.
3. Thirdly, simplicity the best: many relationships in reality just need a linear function of predictors
𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 to describe the context. The linear models would guarantee the inclusion of
important variables, and the exclusion of unimportant variables.
As part of a recent study titled “Predicting Success for Actuarial Students in Undergraduate
Mathematics Courses,” data from 106 Mahidol Uni. actuarial graduates were obtained. The
researchers were interested in describing how students’ overall math grade point averages
(GPA) are explained by SAT Math and SAT Verbal scores, class rank, and faculty of science’s
If the change in the mean 𝑦 value associated with a 1-unit increase in one independent variable
depends on the value of a second independent variable, there is interaction between these two
variables. Denoting the two independent variables by 𝑋1 , 𝑋2 ,
• When 𝑋1 and 𝑋2 do interact, this model will usually give a much better fit to resulting data than
would the no-interaction model.
• Failure to consider a model with interaction too often leads an investigator to conclude incorrectly
that the relationship between 𝑌 and a set of independent variables is not very substantial. In
application, quadratic predictors 𝑋12 and 𝑋22 are often included to model a curved relationship. This
leads to the full quadratic or complete second-order model
𝑦 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ]
(10.3)
= 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥1 𝑥2 + 𝛽4 𝑥21 + 𝛽5 𝑥22 .
Suppose that an industrial chemist is interested in studying how a product yield (𝑌 ) of a polymer
being influenced by two independent variables or predictor 𝑋1 , 𝑋2 , and possibly theirs certain
reaction. Here 𝑋1 = reaction temperature and
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽12 𝑋1 𝑋2 + 𝜀 (10.4)
(b)
Generally, interaction implies that the effect produced by changing one variable (𝑥1 , say) depends
on the level of the other variable (𝑥2 ). This figure shows that changing 𝑥1 from 2 to 8 produces a
much smaller change in E[𝑌 ] when 𝑥2 = 2 than when 𝑥2 = 10.
Notice further that, although these models are all linear regression models, the shape of the surface
that is generated by the model is not linear.
• (X = [𝑥𝑖𝑗 ]) is called the observed matrix of predictors (predictor matrix), 𝑥𝑖𝑗 is the value of the 𝑗-th
predictor 𝑋𝑗 at the 𝑖-th observation (𝑖 = 1, 2, . . . , 𝑛 and 𝑗 = 1, 2, . . . , 𝑘).
where 𝛽0 , 𝛽1 , 𝛽2 , · · · , 𝛽𝑘 are the linear regression coefficients, and 𝑒𝑖 are random errors, 𝑒𝑖 ∼ N(0, 𝜎 2 ),
i.e. they are normally distributed with mean 0 and standard deviation 𝜎.
^ = (X′ X)−1 X′ y.
b=𝛽 (10.6)
Use dataset 𝒟 = (𝑦, X) = (𝑦 1 𝑥(1) · · · 𝑥(𝑘) ) with 𝑛 observations at 𝑘 predictors, 𝑦 is the response
vector, we fitted a multiple regression model
𝑦 ^ =X𝑏
̂︀ = E[𝑌 ] = X 𝛽
where ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
𝑦^1 y 1
⎢. ⎥ ⎢.
. ⎥ ⎢. ⎥ = y ⎢ ... ⎥
⎥ ⎢ ⎥
^=⎢
𝑦 ⎣ . ⎦; 𝑦 = ⎣ . ⎦ ⎣ ⎦
𝑦^𝑛 y 1
are respectively the vector of fitted values and the vector of identical response means.
We then write the total sum of squares, measuring the total variation of responses, as
𝑛
∑︁
𝑆𝑆𝑇 := 𝑆𝑦𝑦 = (𝑦𝑖 − y)2 = (𝑦 − 𝑦)𝑇 (𝑦 − 𝑦). (10.7)
𝑖=1
• The first sum [with df 𝑇 = (𝑛 − 1) degrees of freedom] can be split into 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸, with
𝑛
∑︁
𝑆𝑆𝑅 = 𝑦𝑖 − y)2 = (^
(^ 𝑦 − 𝑦)𝑇 (^
𝑦 − 𝑦), (10.8)
𝑖=1
∑︀𝑛
and 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝑅 = 𝑖=1 (𝑦𝑖 − 𝑦̂︀𝑖 )2 = (𝑦 − 𝑦
^ )𝑇 (𝑦 − 𝑦
^ ).
• 𝑆𝑆𝑅- the regression sum of squares, measures the response’s variation being explained by the
regression model,
• 𝑆𝑆𝐸- the error sum of squares, describes the variation explained by randomness; the quantity
that we minimized when we applied the method of least squares.
𝑆𝑆𝑅
𝑅2 = (10.9)
𝑆𝑆𝑇
* the amount of variability in the data explained or accounted for by the regression model
that is built up by all predictors.
• When we add new predictors to our model, we explain additional portions of 𝑆𝑆𝑇 ; therefore,
𝑅2 can only go up. Thus, we should expect to increase 𝑅2 and generally, get a better fit by
going from uni-variate to multivariate regression.
This concept is extremely useful in practice (as in Transportation Science and SPE), and also for
multiple regression in next chapters. Some following notable properties of 𝑅2 , however should be
cautiously utilized.
1. The statistic 𝑅2 should be used with caution because it is always possible to make 𝑅2 unity by
simply adding enough terms to the model. In general, 𝑅2 will increase if we add a variable to the
model, but this does not necessarily imply that the new model is superior to the old one.
a) In general, 𝑅2 does not measure the magnitude of the slope of the regression line.
c) Furthermore, 𝑅2 does not measure the appropriateness of the model because it can be artifi-
cially inflated by adding higher order polynomial terms in 𝑥 to the model.
df 𝐸 = df 𝑇 − df 𝑅
(10.10)
= 𝑛 − 1 − 𝑘 = 𝑛 − (𝑘 + 1)
degrees of freedom are left for 𝑆𝑆𝐸. This is the sample size 𝑛 minus 𝑘 estimated slopes of 𝛽𝑖 and 1
estimated intercept of 𝛽0 . We can then write the ANOVA table,
Multivariate ANOVA
We use this theory to analyze experimental errors from a regression in Section 10.6. Section
11.5 later discusses a more complex theory of ‘Temporal’ Linear Regression with Lagging for both
predictors and responses in multivariate realm.
10.4
Product Quality and System Performance
by Design of Experiments
The quality of a product and/or the reliability/efficiency of a process or a system is typically quantified
by quality and performance measures. Examples include measures such as
piston cycle time, yield of a production process, output voltage of an electronic circuit,
These performance measures are affected by several factors that have to be set at specific levels
to get desired results. We will discuss tools compatible with the top of the classic Quality Ladder
(in Figure 10.4), and then, to some extent, propose methods for the newly modified Quality and
Performance ladder developed so far in this text.
Quality by Design, historically is the comprehensive quality engineering approach partially devel-
oped in the 1950s by the Japanese Genichi Taguchi. Taguchi labeled his methodology off-line
5
quality control. .
5
Taguchi’s impact on Japan has expanded to a wide range of industries. He won the 1960 Deming Prize for application of quality as well as
three Deming Prizes for literature on quality in 1951, 1953 and 1984.
Applications of off-line quality control range now from the design of automobiles, copiers and electronic systems to cash-flow optimization in
banking, improvements in computer response times and runway utilization in an airport.
Figure 10.4: Organizations higher up on the Quality Ladder are more efficient
at solving problems with increased returns on investments
The aim of off-line quality control is to determine the factor-level combination that gives the least
variability to the appropriate performance measure, while keeping the mean value of the measure
on target. The goal is to control both accuracy and variability. Optimization problems of products or
processes can take many forms that depend on the objectives to be reached. These objectives are
typically derived from customer requirements.
• Performance parameters such as dimensions, pressure or velocity usually have a target or nominal
value. The objective is to reach the target within a range bounded by upper and lower specification
limits. We call such cases “nominal is best.”
• Noise levels, shrinkage factors, amount of wear and deterioration are usually required to be as low
as possible. We call such cases “the smaller the better.”
• When we measure strength, efficiency, yields or time to failure our goal is, in most cases, to reach
the maximum possible levels. Such cases are called “the larger the better.”
These three types of cases require different objective (target) functions to optimize. Taguchi intro-
duced the concept of loss function determining the appropriate optimization procedure.
Figure 10.5: Dr. Genichi Taguchi, a pioneer in using Experimental Designs for Industry
When “nominal is best” is considered, specification limits are typically two-sided with an upper spec-
ification limit (USL) and a lower specification limit (LSL), see Knowledge Box ??. These limits are
used to differentiate between conforming and nonconforming products. Nonconforming products
are usually repaired, retested and sometimes downgraded or simply scrapped. In all cases defec-
tive products carry a loss to the manufacturer. Taguchi proposed a quadratic function as a simple
approximation to a graduated loss that measures loss on a continuous scale.
𝐿(𝑦, 𝑀 ) = 𝐾 (𝑦 − 𝑀 )2 , (10.11)
where 𝑦 is the value of the performance characteristic of a product, 𝑀 is the target value of this
characteristic and 𝐾 is a positive constant, which yields monetary or other utility value to the loss.
For example, suppose that (𝑀 − Δ, 𝑀 + Δ) is the customer’s tolerance interval around the target
(note that this is different from the statistical tolerance interval). When 𝑦 falls out of this interval the
product has to be repaired or replaced at a cost of $𝐴. Then, for this product,
𝐴 = 𝐾 Δ2 or 𝐾 = 𝐴/Δ2 . (10.12)
• The manufacturer’s tolerance interval is generally tighter than that of the customer, namely
(𝑀 − 𝛿, 𝑀 + 𝛿), where 𝛿 < Δ. We can obtain the value of 𝛿. Suppose the cost to the manufacturer
to repair a product that exceeds the customer’s tolerance, before shipping the product, is $𝐵,
𝐵 < 𝐴. Then
(︂ )︂
𝐴 (︁ 𝐵 )︁1/2
𝐵= 2
(𝑌 − 𝑀 )2 , or 𝑌 =𝑀 ±Δ . (10.13)
Δ 𝐴
Thus,
√︂
𝐵
𝛿=Δ . (10.14)
𝐴
• The manufacturer should reduce the variability in the product performance characteristic so that
process capability 𝐶𝑝𝑘 [defined in Knowledge Box ??] relative to the tolerance interval (𝑀 −𝛿, 𝑀 +𝛿)
should be high. See Figure 10.6 for a schematic presentation of these relationships. Notice that
where Bias = 𝜇 − 𝑀, 𝜇 = E[𝑌 ] and V = E[(𝑌 − 𝜇)2 ] is variance. Thus, the objective is to have
a manufacturing process with the mean 𝜇 (of product feature 𝑌 ) as close as possible to the target
𝛿
𝑀 , variance 𝜎 2 as small as possible (𝜎 < , so that 𝐶𝑝𝑘 > 1). ■
3
The previous section and chapters (in Part B and C) dealt with measuring the impact of such
variability. In this section we discuss methods for actually reducing variability.
System design is the stage where engineering skills, innovation, and technology are pooled to-
gether to create a basic design. Once the design is ready to go into production, one has to specify
tolerances of parts and sub-assemblies so that the product or process meets its requirements.
Loose tolerances are typically less expensive than tight tolerances.
Taguchi proposed changing the classical approach to the design of products and processes
Thus, the three major stages in designing a product (or process) from the Robust Engineering
viewpoint are
1. System Design – This is when the product architecture and technology are determined.
2. Parameter Design – At this stage a planned optimization program is carried out in order to mini-
mize variability and costs.
3. Tolerance Design – Once the optimum performance is determined tolerances should be specified,
so that the product or process stays within specifications. The setting of optimum values of the
tolerance factors is called tolerance design, [see more info in [107, Part V]].
Design parameters and noise factors: Taguchi classifies the variables which affect the perfor-
mance characteristics into two categories: design parameters and source of noise. All factors which
cause variability are included in the source of noise.
1. Sources of noise are classified into two categories: external sources and internal sources.
• External sources are those external to the product, like environmental conditions (temperature,
humidity, dust, etc.); human variations in operating the product and other similar factors.
• Internal sources of variability are those connected with manufacturing imperfections and prod-
uct degradation or natural deterioration.
2. The design parameters, on the other hand, are controllable factors which can be set at predeter-
mined values (level). The product designer has to specify the values of the design parameters to
achieve the objectives. This is done by running an experiment which is called parameter design.
Parameter design is the first priority in the improvement of measuring precision, stability, and/or
reliability. When parameter design is completed, tolerance design is used to further reduce error
factor influences.
Variables that may cause product mal-functioning are called noise. Types of noise include:
1. Outer noise: variation caused by environmental conditions (e.g., temperature, humidity, dust, input
voltage)
Parameter design is used to select the best control-factor level combinations so that the effect of
6
all of the noise above can be minimized.
Practical guidelines
6
Parameter design in general is the most important step in developing stable (robust) products, good or high-performance system or reliable
manufacturing processes. With this technique, nonlinearity may be utilized positively. The purpose of parameter design is to investigate the
overall variation caused by inner and outer noise when the levels of the control factors are allowed to vary widely.
The next step is to find a stable or robust design that is essentially unaffected by inner or outer noise. Therefore, the most likely types of inner
and outer noise factors must be identified and their influence must be investigated. Kindly see more details in [107, Chapter 15].
Noise factors
Control factors
There are two types of experiments, physical experiments and computer based simulation
experiments, and we discuss the latter only.
• Let 𝜃 = (𝜃1 , · · · , 𝜃𝑘 ) be the vector of control design parameters. The vector of noise variables is
denoted by 𝑥 = (𝑥1 , · · · , 𝑥𝑚 ). The response function 𝑌 = 𝑓 (𝜃, 𝑥) involves in many situations the
factors 𝜃 and 𝑥 in a non-linear fashion. The objective of parameter design experiments is to take
advantage of the effects of the non-linear relationship.
The strategy is to perform a factorial experiment to investigate the effects of the design parameters
(controllable factors).
• If we learn from the experiments that certain design parameters effect the mean 𝜇 of 𝑌 but not
its variance and, on the other hand, other design factors effect the variance but not the mean, we
can use the latter group to reduce the variance of 𝑌 as much as possible, and then adjust the
levels of the parameters in the first group to set 𝜇 close to the target 𝑀 (see Figure 10.6).
• Parameter design can be performed by both simulation and also by experimentation. In simulation,
mathematical equations can be used, especially for complicated systems.
10.5
The full factorial design in 𝑚 binary factors
We now employ concepts and ideas being defined in last sections, Chapter 8 and ?? will present
simulation techniques later.
In the full binary factorial design in 𝑘 factors (non-random variables in Regression Models, or
controlable design parameters in Parameter Design theory above), denoted 2𝑘 , each factor has 2
levels. Consider firstly the full 22 design combining with a linear regression analysis from concrete
numerical design.
♦ EXAMPLE 10.11 (Very small factorial design, the 22 and 23 design in few factors ).
Assume a response 𝑌 (as the heat from CPU, or the performance (MIPS) of a workstation, for
example) depends on binary factors 𝐴, 𝐵. We study the impact of
in which we encode levels (choices) of factors (variables) 𝐴, 𝐵 by symbols in (). Values (−1) can
also be replaced by (0), and generally they are not necessarily numbers.
Hence a full binary design describes a factorial experiment 2𝑘 with 𝑘 binary factors (at 2 levels each),
having 2𝑘 experiments and allowing estimate 2𝑘 effects, including
(︀𝑘)︀ (︀𝑘)︀
• 𝑘 main effects, 2 = 𝐶𝑘2 2-factor interactions, 3 = 𝐶𝑘3 3-factor interactions . . .
Regression analysis: Assuming two binary factors 𝐴 = 𝑋𝐴 and 𝐵 = 𝑋2 impact on 𝑌 , the perfor-
mance 𝑌 can be regressed using the following nonlinear regression model,
Observations in the above table give a linear system of 4 unknowns 𝑞 = [𝑞* ] = [𝑞0 , 𝑞𝐴 , 𝑞𝐵 , 𝑞𝐴𝐵 ]
⎧
15 = 𝑞0 − 𝑞𝐴 − 𝑞𝐵 + 𝑞𝐴𝐵
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎨45 = 𝑞0 − 𝑞𝐴 + 𝑞𝐵 − 𝑞𝐴𝐵
⎪
25 = 𝑞0 + 𝑞𝐴 − 𝑞𝐵 − 𝑞𝐴𝐵
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎩75
⎪ = 𝑞0 + 𝑞𝐴 + 𝑞𝐵 + 𝑞𝐴𝐵
Interpretation: We get
Effect of cache = 10 MIPS, and Interaction between memory and cache = 5 MIPS.
Run no. Id 𝐴 𝐵 𝑦
1 1 −1 −1 𝑦1
2 1 1 −1 𝑦2
3 1 −1 1 𝑦3
4 1 1 1 𝑦4
where the rows correspond with the experiments, we get roots from the last system
⎧
1
𝑞 = (𝑦1 + 𝑦2 + 𝑦3 + 𝑦4 ) = y = 40
⎪
0
⎪
4
⎪
⎪
⎪
⎪
⎪ 1
⎨𝑞𝐴 = (−𝑦1 + 𝑦2 − 𝑦3 + 𝑦4 )
⎪
4
1
𝑞 = (−𝑦1 − 𝑦2 + 𝑦3 + 𝑦4 )
⎪
𝐵
⎪
4
⎪
⎪
⎪
⎪
⎪ 1
⎩𝑞𝐴𝐵 = (𝑦1 − 𝑦2 − 𝑦3 + 𝑦4 )
⎪
4
• Hence, effects 𝑞* are linear combinations of responses; and
for 𝑞𝐴 , 𝑞𝐵 , 𝑞𝐴𝐵 , their sum of coefficients = zero, we name these contrast expression.
Statistical model of designs with two generic factors with 𝑎 and 𝑏 levels (fully provided in next
Section 10.5.2) is modified to the case when two factors are binary (so 𝑎 = 𝑏 = 2 ) and 𝑛 = 1
replication. Hence index 𝑘 in all formulae will be skipped.
where 𝜇 is the overall mean effect, 𝜏𝑖𝐴 is the effect of 𝐴𝑖 - the 𝑖-th level of factor 𝐴,
𝜏𝑗𝐵 is the effect of 𝐵𝑗 , 𝜏𝑖𝑗𝐴𝐵 is the effect of the interaction between 𝐴𝑖 , and 𝐵𝑗 .
Define the Total Variation of response as the total sum of error squares from the mean
𝑁
∑︁ 𝑎 ∑︁
∑︁ 𝑏
2
𝑆𝑆𝑇 = (𝑦𝑖 − y) = (𝑌𝑖𝑗 − 𝑌 )2
𝑖=1 𝑖=1 𝑗=1
(sepcial case of Eq. 10.25 when 𝑛 = 1 or single replicate) where sample size 𝑁 = 2𝑚 .
The last computation (10.16) in fact used the Sign table method with data in
In general with 𝑛 ≥ 1 replications, we need another index 𝑘. There are 𝑎 𝑏 treatment combinations
(𝐴𝑖 , 𝐵𝑗 ), 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, · · · , 𝑏. Suppose also that 𝑛 independent replicas are made at each
one of the treatment combinations.
The analysis of variance for full factorial designs is done to test the hypotheses that
Model (10.18) here, for instance, consists of 𝜏𝑖𝑗𝐴𝐵 as the interaction effect between 𝐴𝑖 and 𝐵𝑗 ,
represents deviations of the treatment effects relative to both 𝜏𝑖𝐴 and 𝜏𝑗𝐵 . Effects which involve
comparisons between levels of only one factor are called main effects of that factor, and effects
which involve comparisons for more than a single factor are called interactions. We define precisely
effects as follows.
1. The effect of a factor is defined to be the change in response produced by a change in the level
of the factor. This is frequently called a main effect
If a factor 𝐴 has levels of High and Low, then the main effect of 𝐴 is
symbolically 𝜏 𝐴 = y 𝐴=𝐻𝑖𝑔ℎ − y 𝐴=𝐿𝑜𝑤 , the difference between the average response at the high level
and the low level of 𝐴.
2. In some experiments, we may find that the difference in response between the levels of one factor
is not the same at all levels of the other factors. When this occurs, there is an interaction between
the factors.
(i) If the effect of one factor varies depending on the level of another factor, then there is
interaction between the two factors.
(ii) Interaction = degree of difference from the sum of the separate effects.
The Analysis of Variance (ANOVA) is generally needed for testing the significance of main effects
and interactions. The ANOVA for full factorial designs is built to test the hypotheses that main-effects
or interaction parameters are equal to zero. We present the ANOVA for a design of factors 𝐴 and 𝐵
with a statistical model given in Equation (10.18). The method can be generalized to any number of
factors.
Let 𝑛
1 ∑︁
Y 𝑖𝑗 = 𝑌𝑖𝑗𝑘 , (10.21)
𝑛
𝑘=1
𝑏
1 ∑︁
Y 𝑖. = Y 𝑖𝑗 , 𝑖 = 1, · · · , 𝑎 (10.22)
𝑏 𝑗=1
𝑎
1 ∑︁
Y .𝑗 = Y 𝑖𝑗 , 𝑗 = 1, · · · , 𝑏 (10.23)
𝑎 𝑖=1
and
𝑎 𝑏
1 ∑︁ ∑︁
𝑌 = Y 𝑖𝑗 . (10.24)
𝑎𝑏 𝑖=1 𝑗=1
The ANOVA procedure for generic two-factor designs includes three following steps.
1. The ANOVA partitions first the total sum of squares of deviations from 𝑌
∑︁ 𝑛
𝑏 ∑︁
𝑎 ∑︁
𝑆𝑆𝑇 = (𝑌𝑖𝑗𝑘 − 𝑌 )2 [total sum of squared errors] (10.25)
𝑖=1 𝑗=1 𝑘=1
to two components
𝑎 ∑︁
∑︁ 𝑏 ∑︁
𝑛
𝑆𝑆𝑊 = (𝑌𝑖𝑗𝑘 − Y 𝑖𝑗 )2 [sum of squared errors in whole design], (10.26)
𝑖=1 𝑗=1 𝑘=1
𝑎 ∑︁
∑︁ 𝑏
𝑆𝑆𝐵𝐹 = 𝑛 (Y 𝑖𝑗 −𝑌 )2 [sum of squared errors among factor levels]. (10.27)
𝑖=1 𝑗=1
It is straightforward to show that 𝑆𝑆𝑇 = 𝑆𝑆𝑊 + 𝑆𝑆𝐵𝐹. [𝑆𝑆𝐵𝐹 is also called the sum of square
errors between factors.]
2. In the second stage, the sum of squares of deviations 𝑆𝑆𝐵𝐹 is partitioned to three components
𝑆𝑆𝐴, 𝑆𝑆𝐵, and the interaction sum 𝑆𝑆𝐼 := 𝑆𝑆𝐴𝐵 as
𝑎 ∑︁
∑︁ 𝑏
𝑆𝑆𝐼 = 𝑛 (Y 𝑖𝑗 − Y 𝑖. − Y .𝑗 +𝑌 )2 , [errors caused by interaction effects] (10.28)
𝑖=1 𝑗=1
𝑎
∑︁
𝑆𝑆𝐴 = 𝑛 (Y 𝑖. −𝑌 )2 [sum of squared errors caused by factor effect 𝐴] (10.29)
𝑖=1
𝑏
∑︁
𝑆𝑆𝐵 = 𝑛 (Y .𝑗 −𝑌 )2 , [sum of squared errors caused by factor effect 𝐵], (10.30)
𝑗=1
Source of variation DF SS MS F
𝐴 𝑎−1 𝑆𝑆𝐴 𝑀 𝑆𝐴 𝐹𝐴
𝐵 𝑏−1 𝑆𝑆𝐵 𝑀 𝑆𝐵 𝐹𝐵
𝐴𝐵 (𝑎 − 1)(𝑏 − 1) 𝑆𝑆𝐼 𝑀 𝑆𝐴𝐵 𝐹𝐴𝐵
Between 𝑎𝑏 − 1 𝑆𝑆𝐵𝐹 - -
Within 𝑎𝑏(𝑛 − 1) 𝑆𝑆𝑊 𝑀 𝑆𝑊 -
Total 𝑁 −1 𝑆𝑆𝑇 - -
that is, 𝑆𝑆𝐵𝐹 = 𝑆𝑆𝐼 + 𝑆𝑆𝐴 + 𝑆𝑆𝐵. All these terms are collected in ANOVA Table10.4.
𝑆𝑆𝑇 = 𝑆𝑆𝑊 + 𝑆𝑆𝐵𝐹 = 𝑆𝑆𝐼 + 𝑆𝑆𝐴 + 𝑆𝑆𝐵 = 𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵 (10.31)
Proportion of the variation explained by factors 𝐴, 𝐵 and their interaction are quantified respectively
by ratios
𝑆𝑆𝐴 𝑆𝑆𝐵 𝑆𝑆𝐴𝐵
𝑃 𝑉𝐴 = , 𝑃 𝑉𝐵 = , 𝑃 𝑉𝐴𝐵 = . (10.32)
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝐴 𝑆𝑆𝐵
𝑀 𝑆𝐴 = , 𝑀 𝑆𝐵 = ,
𝑎−1 𝑏−1
𝑆𝑆𝐼 𝑆𝑆𝐴𝐵
𝑀 𝑆𝐴𝐵 = = ,
(𝑎 − 1)(𝑏 − 1) (𝑎 − 1)(𝑏 − 1) (10.33)
and when 𝑛 > 1
𝑆𝑆𝑊
𝑀 𝑆𝑊 = .
𝑎𝑏(𝑛 − 1)
3. Finally, we compute the 𝐹 -statistics
𝑀 𝑆𝐴
𝐹𝐴 = , (10.34)
𝑀 𝑆𝑊
MATHEMATICAL MODELS, DESIGNS And ALGORITHMS
CHAPTER 10. STATISTICALLY DESIGNED EXPERIMENTS
308 FOR SYSTEM PERFORMANCE EVALUATION
𝑀 𝑆𝐵
𝐹𝐵 = , (10.35)
𝑀 𝑆𝑊
and
𝑀 𝑆𝐴𝐵
𝐹𝐴𝐵 = . (10.36)
𝑀 𝑆𝑊
𝐹𝐴 , 𝐹𝐵 and 𝐹𝐴𝐵 are test statistics, respectively to test the significance of the main effects of 𝐴, of
𝐵 and the interactions 𝐴𝐵 on the response. Few cases to consider:
3. Also, if 𝐹𝐴𝐵 < 𝐹1−𝛼 [(𝑎 − 1)(𝑏 − 1), 𝑎𝑏(𝑛 − 1)], we cannot reject the null hypothesis
𝐻0𝐴*𝐵 : 𝜏11
𝐴𝐵 𝐴𝐵
= · · · = 𝜏𝑎𝑏 = 0.
The interaction effects are significant in the case of acceptance the alternative 𝐻1𝐴*𝐵 . The main
effects then are of no importance, no matter whether they are significant or not. ■
for 𝐻0𝐴 : 𝜏𝑖𝐴 = 0 against 𝐻1𝐴 : 𝜏𝑖𝐴 ̸= 0 (for at least one 𝑖) and
2. Hence, every decision is based on properly computing the (Fisher) 𝐹 -statistics and employing 𝐹
critical values given in Table A7 above, or by software R.
SUMMARY. Using ANOVA Table10.4 will be illustrated in Quality Analytics 0 of Section ??, where
we also see that different factors (new controllable factor or scenarios) would require new designs
whose existences are uncertain! We exploit the notation below.
We study the use of 𝑘 = 3 binary factors for SPE with simulated data below.
Table 10.5: Response at a 23 factorial experiment
Three factors 𝐴, 𝐵, 𝐶 as controllable variables of memory, cache and operating system (binary
value Windows-Linux) give effects on the output, 𝑌 - MIPS, of a computer system. In order to
estimate the main effects of 𝐴, 𝐵, 𝐶, a 23 factorial experiment was conducted in 𝑛 = 4 replicates
getting above resonses 𝑦1 , 𝑦2 , 𝑦3 and 𝑦4 .
Each treatment combination was repeated 4 times, at the ‘low’ and ‘high’ levels of two noise
factors. The results are given in Table 10.5, with the design size 𝑁 = 23 = 8.
The mean Y , and standard deviation 𝑆 of 𝑌 at the 8 treatment combinations are listed below.
𝜈 Y 𝑆
0 60.875 0.4918
1 46.800 0.3391
2 91.675 0.4323
3 70.950 0.6103
4 65.375 0.9010
5 50.025 0.5262
6 90.475 1.0871
7 76.675 0.9523
• Regressing the column Y on the 3 orthogonal columns under 𝐴, 𝐵, 𝐶 in in Table 10.5 gives
with 𝑅2 = 0.991 see Equation (10.9) for the multiple regression case]. Moreover, the coefficient
1.53 of 𝐶 is not significant (𝑝-value = 0.103).
• Thus, the significant main effects on the mean yield are of factors 𝐴 and 𝐵 only. Regressing the
column of 𝑆 on 𝐴, 𝐵, 𝐶 we obtain the equation
with 𝑅2 = .805. Only the main effect of 𝐶 is significant. Factors 𝐴 and 𝐵 have no effects on the
standard deviation. The strategy is therefore to set
the values of 𝐴 and 𝐵 to adjust the mean response to be equal to the optimal target value 𝑀 .
• If 𝑀 = 85, we find 𝐴 and 𝐵 to solve the equation 69.1 − 7.99𝐴 + 13.3𝐵 = 85. Putting 𝐵 = 0.75 then
𝐴 = −0.742.
The optimal setting of the design parameters is 𝐴 = −.742, 𝐵 = 0.75 and 𝐶 = −1. ■
♣ OBSERVATION.
1. The applications of factorial analysis so far described deal with only two levels of a particular factor,
for example, low and high, or −1 or +1. If there are only two points, they can only be joined by a
straight line. This implies that there is a rectilinear relationship between the magnitude of the factor
and the response.
2. If this assumption is not true, then a maximum or minimum value of the response may occur
between the chosen levels of the factors and this would not be detected. Therefore, if a rectilinear
relationship cannot be safely assumed, we should use more than two levels, like the simple two
non-binary factor design 3 × 4 later.
Now we suppose that in 23 factorial three binary factors 𝐴, 𝐵, 𝐶 are to be studied. The number of
combinations is eight and with 𝑛 replicates we have 𝑁 = 𝑛 23 = 8 𝑛 observations that are to be
analyzed for their influence on a response.
• The presence of the corresponding lower-case letter in the treatment combination column indicates
that the factor is at its high level. As of the case 22 , if three factors are all quantitative (such as
temperature, pressure, time), then a linear regression representation for 23 design of the response
𝑌 is used in (10.37).
We present the ANOVA table for the three-factor fixed effects model with factors 𝐴, 𝐵, 𝐶, then apply
for binary case 23 with the number of factor levels 𝑎 = 𝑏 = 𝑐 = 2.
• We see, when 𝑎 = 𝑏 = 𝑐 = 2 for 23 design, there are seven degrees of freedom between the eight
treatment combinations in the 23 design.
Three degrees of freedom are associated with the main effects of 𝐴, 𝐵, and 𝐶.
Four degrees of freedom are associated with 2-interactions: one each with 𝐴𝐵, 𝐴𝐶, and 𝐵𝐶 and
one with 3-interaction 𝐴𝐵𝐶.
Table 10.7: Table of ANOVA for a 3-factor factorial experiment
Source of variation DF SS MS F
𝑀 𝑆𝐴
𝐴 𝑎−1=1 𝑆𝑆𝐴 𝑀 𝑆𝐴 𝐹𝐴 =
𝑀 𝑆𝐸
𝐵 𝑏−1=1 𝑆𝑆𝐵 𝑀 𝑆𝐵 𝐹𝐵
𝐶 𝑐−1=1 𝑆𝑆𝐶 𝑀 𝑆𝐶 𝐹𝐶
𝐴 * 𝐵 interaction (𝑎 − 1)(𝑏 − 1) 𝑆𝑆𝐴𝐵 𝑀 𝑆𝐴𝐵 𝐹𝐴𝐵
𝐴 * 𝐶 interaction (𝑎 − 1)(𝑐 − 1) 𝑆𝑆𝐴𝐶 𝑀 𝑆𝐴𝐶 𝐹𝐴𝐶
𝐵 * 𝐶 interaction (𝑏 − 1)(𝑐 − 1) 𝑆𝑆𝐵𝐶 𝑀 𝑆𝐵𝐶 𝐹𝐵𝐶
𝐴 * 𝐵 * 𝐶 interaction (𝑎 − 1)(𝑏 − 1)(𝑐 − 1) 𝑆𝑆𝐴𝐵𝐶 𝑀 𝑆𝐴𝐵𝐶 𝐹𝐴𝐵𝐶
Error 𝑑𝑓𝐸 = 𝑎𝑏𝑐(𝑛 − 1) 𝑆𝑆𝐸 𝑀 𝑆𝐸 -
Total 𝑑𝑓𝑇 = 𝑁 − 1 = 𝑎 𝑏 𝑐 𝑛 − 1 𝑆𝑆𝑇 - -
• The 𝐹 -tests 𝐹𝑥 on main effects and interactions follow directly from the mean squares 𝑀 𝑆𝑥 , where
𝑀 𝑆𝐴
𝑥 = 𝐴, 𝐵, 𝐶, 𝐴𝐵, 𝐴𝐶, 𝐵𝐶, 𝐴𝐵𝐶, for example 𝐹𝐴 = .
𝑀 𝑆𝐸
Suppose that both of our design factors are quantitative (such as temperature, pressure, living
media, time). Therefore, we write 𝑋1 for factor 𝐴, and 𝑋2 for factor 𝐵.
• The response 𝑌 is expressed linearly via two binary factors 𝐴 and 𝐵 as a linear regression
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽12 𝑋1 𝑋2 + 𝜀
where 𝑌 is the response random variable, 𝜀 is a random error term, with E[𝜀] = 0, and 𝛽0 , . . . , 𝛽12
are regression coefficients.
• The variables 𝑋1 , 𝑋2 are viewed as non-random, with values 𝑥 = (𝑥1 , 𝑥2 ) after conducting experi-
ments. By taking expectation, we get
E[𝑌 | X = 𝑥] := E[𝑌 = 𝑦 | 𝐴 ≡ 𝑋1 = 𝑥1 , 𝐵 ≡ 𝑋2 = 𝑥2 ]
(10.38)
= 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽12 𝑥1 𝑥2 .
𝑀 = 𝑚1 (Medium 1) 𝑀 = 𝑚2 (Medium 2)
12 21 23 20 25 24 29
𝑇 = 𝑡1 hours 22 28 26 26 25 27
18 37 38 35 31 29 30
𝑇 = 𝑡2 hours 39 38 36 34 33 35
She performs 𝑛 = 6 (balanced design) replicates for each of the 4 𝑀 * 𝑇 combinations. Here
𝐴1 𝐵1 means 𝑚1 𝑡1 =(Medium 1, 12 hours), and so on. The 𝑁 = 24 measurements were taken in a
completely randomized order. The results are given in Table 10.8. The virologist wants to answer
the following questions.
1. What effects do media 𝑀 and growing time 𝑇 have on the virus’s proliferation?
2. Is there a choice of living media that give uniformly strong proliferation regardless of time?
7
A detailed solution with numerical analysis is shown in PROBLEM 10.3.
7
The 2nd question is particularly important. We possibly find a media alternative that is not greatly affected by time. If this is so, we can make
the living media robust to time variation in the actual environment.
This is an example of using experimental design for robust product design, a very important engineering and scientific problem.
FACT: With generic factor 𝐵 crossed with factor 𝐴, the crossed design becomes a randomized
complete block design (RCB or RCBD) with levels 𝐵𝑗 of 𝐵 as blocks:
Suppose there are 𝑎 = 3 different diets (of factor 𝐴- Diet), we use the same 𝑏 = 4 tanks (of
factor 𝐵- Tank) in each level of diet, and put 6 fish per tank. In this setting, tanks would have been
crossed with diets, and assume the ANOVA table is obtained as the following table
QUESTIONS
Factor 𝐵
Factor 𝐴 𝐵=1 𝐵=2 𝐵=3 𝐵=4
𝐴=1 𝑦1,1 𝑦1,2 𝑦1,3 𝑦1,4
𝐴=2 𝑦2,1 𝑦2,2 𝑦2,3 𝑦2,4
𝐴=3 ? ? ? ?
1. Explain the number 60 in df 𝐹 𝑖𝑠ℎ = df 𝐸 . Fill in right numbers to the cells with ?
2. Make decision on the significance of the main effects of 𝐴, 𝐵 and the interactions 𝐴 * 𝐵 on the
response (weight increase of the fish), using 𝛼 = 0.05.
The statistical analysis of 2𝑚 designs is summarized, and nowadays a computer software package
(like Ror MATLAB) is usually employed in this analysis process.
EFPRAI
3. Perform statistical testing with ANOVA (using software when data available)
4. Refine model
5. Analyze residuals
STEP 1. Estimate factor effects: We calculate a main effect as the gap between the response
means at high and low level. For factor 𝐴, for example, the main effect is
𝜏 𝐴 = y 𝐴+ − y 𝐴− .
We use the analysis of variance to formally test for the significance of main effects and interaction.
Table 10.10 shows the general form of an analysis of variance for a 2𝑚 factorial design with 𝑛
replicates.
STEP 4. Refine model removing any non-significant variables from the full model.
STEP 5. Analyze residuals is the usual residual analysis to check for model adequacy and as-
sumptions.
Source of variation DF SS MS F
𝑚 main effects
𝑀 𝑆𝐴
𝐴 1 𝑆𝑆𝐴 𝑀 𝑆𝐴 𝐹𝐴 =
𝑀 𝑆𝐸
𝐵 1 𝑆𝑆𝐵 𝑀 𝑆𝐵 𝐹𝐵
..
.
𝑀 1 𝑆𝑆𝑀 𝑀 𝑆𝑀 𝐹𝑀
(︀𝑚)︀
2
two-factor interactions
𝐴 * 𝐵 interaction 1 𝑆𝑆𝐴𝐵 𝑀 𝑆𝐴𝐵 𝐹𝐴𝐵
𝐴 * 𝐶 interaction 1 𝑆𝑆𝐴𝐶 𝑀 𝑆𝐴𝐶 𝐹𝐴𝐶
..
.
𝐿 * 𝑀 interaction 1 𝑆𝑆𝐿𝑀 𝑀 𝑆𝐿𝑀 𝐹𝐿𝑀
(︀𝑚)︀
3
three-factor interactions
𝐴 * 𝐵 * 𝐶 interaction 1 𝑆𝑆𝐴𝐵𝐶 𝑀 𝑆𝐴𝐵𝐶 𝐹𝐴𝐵𝐶
..
.
Error 𝑑𝑓𝐸 = 2𝑚 (𝑛 − 1) 𝑆𝑆𝐸 𝑀 𝑆𝐸 -
Total 𝑑𝑓𝑇 = 𝑁 − 1 = 2𝑚 𝑛 − 1 𝑆𝑆𝑇 - -
For any binary factor, we equivalently use symbol 1 or − for low level,
Let’s now illustrate the first step of this procedure if 𝑚 = 3, 2𝑚−1 = 4, for 𝑛 replicates. Hence,
we study three binary factors 𝐴, 𝐵, 𝐶, and repeat the 23 design in 𝑛 times. The total number of
experiments is 𝑁 = 𝑛 · 23 = 8𝑛, and so there are 𝑁/2 = 4𝑛 runs for each level.
STEP 1. Estimate factor effects - We calculate a main effect as the gap between the response
means at high and low level. For factor 𝐴, for example, the main effect is
𝜏 𝐴 = y 𝐴+ − y 𝐴− .
𝑇𝐴2 = 𝑦𝑎 + 𝑦𝑎𝑏 + 𝑦𝑎𝑐 + 𝑦𝑎𝑏𝑐 as the total of responses over all other factors for level 2 of 𝐴,
𝑇𝐴1 = 𝑦(1) + 𝑦𝑏 + 𝑦𝑐 + 𝑦𝑏𝑐 as the response totals over all other factors for level 1 of 𝐴.
𝑁 is the total number of units in the experiment (𝑁/2 for level 1, 𝑁/2 for level 2).
The response mean at high and low level of 𝐴 are respectively summarized by
𝑇𝐴2 𝑇𝐴 𝑇𝐴1
y 𝐴+ = = 2, y 𝐴− = .
𝑁/2 4𝑛 4𝑛
The main effect for factor 𝐴 then (by dropping 𝑦 in response 𝑦△ to only △) is
and
𝑇𝐶2 − 𝑇𝐶1 [?−?]
𝜏𝐶 = = (𝐷𝐼𝑌 ). (10.41)
(𝑁/2) 4𝑛
Note: In the last three equations, the quantities in brackets of main effects for 𝐴, 𝐵, 𝐶 are contrasts
in the treatment combinations.
10.6
Regression with Experimental error analysis
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑒
8
Assumption 1: Linearity between predictor and response, namely 𝑋 and 𝑌 .
8
𝑋 could be family size, interest rate or a project input, number of drunk men per day in BKK, and
At the 𝑖-th observation, predictor 𝑋𝑖 is considered non-random, and we assume a linear relation-
ship between the two 𝑌𝑖 and 𝑋𝑖 of the form:
Experimental errors 𝑒𝑖 := 𝑦𝑖 − 𝑦̂︀𝑖 (as in Equation 10.42) are wrapped up in the following quantity
𝑁
∑︁ 𝑁
∑︁ 𝑁
∑︁
𝑆𝑆𝐸 = 𝑒2𝑖 = 2
(𝑦𝑖 − 𝑦̂︀𝑖 ) = (𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 (10.43)
𝑖=1 𝑖=1 𝑖=1
with 𝑥𝑖 is value of factor 𝑋 at the 𝑖-th experiment or observation, 𝑒𝑖 is the error with E[𝑒𝑖 ] = 0, for all
𝑖 = 1, 2, . . . , 𝑁 ; here 𝑁 is the sample size. Since the total variation
• the regression sum of squares 𝑆𝑆𝑅 has df 𝑅 = 1 degree of freedom (the dimension of the corre-
sponding space (𝑋, 𝑌 ) is 1);
𝑌 could be electricity consumption, project return in investment, or number of traffic accidents in Bangkok.
𝑁
∑︁
• the total sum of squares 𝑆𝑆𝑇 = (𝑦𝑖 − y)2 = (𝑁 − 1) 𝑠2𝑦 , has 𝑁 − 1 degrees of freedom, because
𝑖=1
it is computed directly from the sample variance 𝑠2𝑦 .
For further analysis, we introduce other two standard assumptions of linear regression, applied
when 𝑚 = 1, then extended for cases of 𝑚 ≥ 1 generally.
The linear regression coefficients 𝛽0 , 𝛽1 have normal distribution, and their estimates are de-
noted by (𝑏0 , 𝑏1 ) = 𝑏, found by Equation 10.6 in Knowledge Box 12.
With these info, we can unbiasedly estimate the response variance 𝜎 2 = V[𝑌 ] by the sample
regression variance
𝑆𝑆𝐸
𝑠2 = 𝜎
̂︀2 = 𝑀 𝑆𝐸 = . (10.44)
𝑁 −2
Definition 10.4.
𝑆𝑆𝐸
• RMSE: The estimated sample regression variance 𝑠2 = 𝑀 𝑆𝐸 = gives the estimated
𝑁 −2
standard deviation 𝑠, called root mean squared error or RMSE.
A standard way to present estimation of experimental errors and analysis of variance of the re-
sponse is the ANOVA table 10.11.
Univariate ANOVA
To assess the underlying experimental error, some degree of replication must be accpeted, quan-
tified by a natural number 𝑛 ≥ 1. Consider the smallest multivariate regression analysis, when
studying 𝑘 = 2 predictors with binary choices. We indeed see that the 22 factorial design does not
estimate experimental errors if no experiment is repeated, that we used only 𝑛 = 1 replicate.
• Experimental errors are quantified by replications. Other words, analizing experimental errors
requires the number of replicates 𝑛 > 1.
• If use 𝑛 > 1 replicates then each experiment in the 2𝐾 factorial design is repeats 𝑛 times, so
there are in total 𝑁 = 𝑛 2𝑘 experimental runs based on the standard 2𝑘 factorial.
𝑌 = 𝑞0 + 𝑞𝐴 𝑥𝐴 + 𝑞𝐵 𝑥𝐵 + 𝑞𝐴𝐵 𝑥𝐴 𝑥𝐵 + 𝑒
assuming
Id → 𝑞0 𝐴 𝐵 𝐴𝐵 𝑦 (responses) y
1 −1 −1 1 (15, 18, 12) 15
1 1 −1 −1 (45, 48, 51) 48
1 −1 1 −1 (25, 28, 19) 24
1 1 1 1 (75, 75, 81) 77
164 86 38 20 Total= 164
41 21.5 9.5 5 Total/4 = 41 = y = 𝑞0
To compute the estimated response for each factor-level combination we use model
𝑌̂︀ = 𝐺(𝐴
̂︀ 𝑖 , 𝐵𝑖 ) = 𝑦̂︀𝑖 = 𝑞0 + 𝑞𝐴 𝑥𝐴 + 𝑞𝐵 𝑥𝐵 + 𝑞𝐴𝐵 𝑥𝐴 𝑥𝐵
𝑖 𝑖 𝑖 𝑖
Then, extending (10.43) to the case of two factors gives the difference between estimated value
and measured values as
𝑒𝑖𝑗 = 𝑦𝑖𝑗 − 𝑦̂︀𝑖
then we get the Sum of squared error via the Total variation 𝑆𝑆𝑇 as
𝑆𝑆𝐸 = 𝑆𝑆𝑇 − (𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵) = 𝑆𝑆𝑌 − 𝑆𝑆𝑂 − (𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵)
♣ QUESTION.
1. Why Experimental Error analysis in SPE with DOE method matter (both mathematically and
practically)? Any other techniques of Factorial Designs developed for SPE?
SUMMARY
Regarding the use 2𝑚 designs, we observe the followings.
1. Presence or absence of interactions is not a function of the experimental plan, but is a function
of the scientific problem under investigation.
3. Usually, estimating the main effects, two-factor interactions, and possibly three-factor interactions
is a good starting place in a research program.
4. If a treatment contrast also involves a contrast among block effects, the two contrasts are said to
be confounded. This confounding means that
PROBLEM
PROBLEM 10.2 (A 2 × 3 Two Factor Factorial with analysis on R).
An experiment was run to investigate how the type of glass 𝐺 and the type of phosphorescent
coating 𝑃 affects the brightness of a light bulb.
The response variable 𝑌 is the current (measured in microamps) to obtain a specified brightness.
The data, with 𝑎 = 2, 𝑏 = 3 and 𝑛 = 3, are given in
Phosphor Type
A B C
278 297 273
1 291 304 284
Glass 285 296 288
Type 229 259 228
2 235 249 225
241 241 235
phosphor glass
290
290
B 1
A 2
mean of light
mean of light
C
270
270
250
250
230
230
1 2 A B C
glass phosphor
• The 𝐺 * 𝑃 = 𝐺𝑙𝑎𝑠𝑠 * 𝑃 ℎ𝑜𝑠𝑝ℎ𝑜𝑟 interaction is not significant (p-value = .9130). This is obvious from
the strong parallelism in the interaction plots.
PROBLEM 10.3.
PROBLEM FORMULATION:
A virologist is interested in studying the effects of environment and proliferating time on the growth
of a particular virus by using
𝑀 = 𝑚1 (Medium 1) 𝑀 = 𝑚2 (Medium 2)
12 21 23 20 25 24 29
𝑇 = 𝑡1 hours 22 28 26 26 25 27
18 37 38 35 31 29 30
𝑇 = 𝑡2 hours 39 38 36 34 33 35
She performs 𝑛 = 6 (balanced design) replicates for each of the 4 𝑀 * 𝑇 combinations. Here
𝐴1 𝐵1 means 𝑚1 𝑡1 =(Medium 1, 12 hours), and so on. The 𝑁 = 24 measurements were taken in a
completely randomized order. The results are given in Table 10.12. The virologist wants to answer
the following questions.
♣ QUESTION.
1. What effects do media 𝑀 and growing time 𝑇 have on the virus’s proliferation?
2. Is there a choice of living media that would give uniformly strong proliferation regardless of time?
GUIDANCE for solving.
Th 2nd question is particularly important. We possibly find a media alternative that is not greatly
affected by time. If this is so, we can make the living media robust to time variation in the actual
environment. This is an example of using experimental design for robust product design, a very
important engineering and scientific problem.
𝑀 = 𝑚1 𝑀 = 𝑚2
Medium 1 Medium 2
𝑇 = 𝑡1 12 y 11 = 140/𝑛 = 23.3 y 12 = 156/𝑛 = 26
𝑇 = 𝑡2 18 y 21 = 223/𝑛 = 37.16 y 22 = 192/𝑛 = 32
From response means at each cell 𝑖, 𝑗 in Table 10.13, and Definition 10.3, that factor’s effect is the
change in response produced by a change in the level of the factor, we can find two cases.
1. Either the effect of changing 𝑇 from 12 to 18 hours on the response depends on the level of 𝑀 ,
shown in left figure below, as
medium time
36
36
1 18
mean of growth
mean of growth
2 12
32
32
28
28
24
24
12 18 1 2
time medium
2. Or the effect on the response of changing 𝑀 from medium 1 to 2 depends on the level of 𝑇 , shown
in right figure above, as
If these pairs of effects are significantly different then we say there exists a significant interaction
between factors 𝑀 and 𝑇 . Here for both cases,
The command lm(response A * B, data = DATAFRAME) returns the linear model, which is deter-
mined by Equation (10.38). The last command qqnorm() provides the normal probability plot, helping
to detect real high-order interactions, being discussed later in Section 10.5.6.
●
4
● ●
Sample Quantiles
● ●
2
● ● ● ●
●
0
●●
● ●●
● ●
−2
● ● ●
●
●
●
−2 −1 0 1 2
Theoretical Quantiles
Figure 10.10: The normal probability plot for factor Time and Medium
Source of variation DF SS MS F
𝐴 𝑎−1=1 𝑆𝑆𝐴 𝑀 𝑆𝐴 𝐹𝐴
𝐵 𝑏−1=1 𝑆𝑆𝐵 𝑀 𝑆𝐵 𝐹𝐵
𝐴 * 𝐵 interaction (𝑎 − 1)(𝑏 − 1) = 1 𝑆𝑆𝐴𝐵 𝑀 𝑆𝐴𝐵 𝐹𝐴𝐵
Error 𝑑𝑓𝐸 = 𝑎𝑏(𝑛 − 1) = 20 𝑆𝑆𝐸 𝑀 𝑆𝐸 -
Total 𝑑𝑓𝑇 = 𝑁 − 1 = 𝑎𝑏𝑛 − 1 𝑆𝑆𝑇 - -
CONCLUSION
• Using the above ANOVA table, for a 2-factor factorial experiment, we observe that there are three
degrees of freedom between the four treatment combinations in the 22 design.
Two degrees of freedom are associated with the main effects of 𝐴 and 𝐵, and 1 degree of freedom
is associated with a 2-interaction 𝐴𝐵.
10.8
The simplest case of 3𝑘 factorial design
The simplest 3𝑘 design has two factors, each at 3 levels, denoted 32 Design shown in Figure 10.11,
so there are 32 = 9 treatment combinations (runs), and 8 degrees of freedom between these t.
combinations. Nine runs are denoted by either 𝑎𝑖 𝑏𝑗 or just (𝑖, 𝑗) which assume the values 0,1,2. The
common analysis of variance with polynomial decomposition takes the form shown in table below.
Table 10.14: Analysis of Variance Table, 32 design
Source df
𝐴 2 linear and quadratic
𝐵 2 linear and quadratic
𝐴*𝐵 4 linear × linear, linear × quadratic
quadratic × linear, quadratic × quadratic
1. The main effects of 𝐴 and 𝐵 each have two degrees of freedom, and the 𝐴𝐵 interaction has four
degrees of freedom. If there are 𝑛 replicates, there will be 𝑛 32 − 1 total degrees of freedom and
32 (𝑛 − 1) degrees of freedom for error.
to fit data, to estimate the linear effects, also the quadratic effects of each factor.
We have in total two main effects for each factor (linear and quadratic) and 4 interaction effects.
3. In the ANOVA table, the sums of squares for 𝐴, 𝐵 and 𝐴*𝐵 may be computed by the usual methods
for factorial designs, presented in Section 10.5.2.
* Each main effect can be represented by a linear and a quadratic component, each with a single
degree of freedom. This is meaningful only if the factor is quantitative. For example, 𝐴’s main
effect includes terms 𝛽1 𝑥1 and 𝛽2 𝑥21 .
The 2-factor interaction 𝐴 * 𝐵 may be partitioned in two ways: (A) linear model or (B) orthogonal
Latin squares.
The 2-factor interaction 𝐴 * 𝐵 may be partitioned by subdividing 𝐴 * 𝐵 into the four single-degree-
of-freedom components corresponding to 𝐴 * 𝐵𝐿𝐿 , 𝐴 * 𝐵𝐿𝑄 , 𝐴 * 𝐵𝑄𝐿 , and 𝐴 * 𝐵𝑄𝑄 .
This can be done by fitting the terms 𝛽4 𝑥1 𝑥2 , 𝛽7 𝑥1 𝑥22 , 𝛽5 𝑥21 𝑥2 , and 𝛽8 𝑥21 𝑥22 respectively.
Now suppose there are three factors (A, B, and C) under study and that each factor is at three levels
arranged in a factorial experiment. This is a 33 factorial design, and the experimental layout and
treatment combination notation are shown in Figure 10.11.
Factorial structure:
♦ EXAMPLE 10.15.
Oikawa (1987) reported the results of a 33 factorial to investigate the effects of three factors 𝐴, 𝐵, 𝐶
9
on the stress levels of a membrane 𝑌 .
The data is given in file STRESS.csv. The first three columns of the data file provide the levels of
the three factors, and column 4 presents the stress values 𝑌 .
> data(STRESS)
> summary(lm(stress ~ (A+B+C+I(A^2)+I(B^2)+I(C^2))^3, data=STRESS))
Call: lm.default(formula = stress ~ (A + B + C + I(A^2) + I(B^2) +
I(C^2))^3, data = STRESS)
Residuals:
ALL 27 residuals are 0: no residual degrees of freedom!
Coefficients: (15 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 191.8000 NA NA NA
A 38.5000 NA NA NA
B -46.5000 NA NA NA
C 63.0000 NA NA NA
I(A^2) 0.2000 NA NA NA
I(B^2) 14.0000 NA NA NA
I(C^2) -27.3000 NA NA
A:B -32.7500 NA NA
A:C 26.4500 NA NA
9
Oikawa, T. and Oka, T. (1987) New Techniques for Approximating the Stress in Pad-Type Nozzles Attached to a Spherical Shell, Transactions
of the American Society of Mechanical Engineers, May, 188-192.
A:I(A^2) NA NA NA
...
I(A^2):I(B^2):I(C^2) NA
> summary(aov(stress ~ (A+B+C)^3 +I(A^2)+I(B^2)+I(C^2), data=STRESS))
Df Sum Sq Mean Sq F value Pr(>F)
A 1 36315 36315 378.470 1.47e-12 ***
B 1 32504 32504 338.751 3.43e-12 ***
C 1 12944 12944 134.904 3.30e-09 ***
I(A^2) 1 183 183 1.911 0.185877
I(B^2) 1 2322 2322 24.199 0.000154 ***
I(C^2) 1 4536 4536 47.270 3.73e-06 ***
A:B 1 3290 3290 34.289 2.44e-05 ***
A:C 1 6138 6138 63.971 5.56e-07 ***
B:C 1 183 183 1.910 0.185919
A:B:C 1 32 32 0.338 0.569268
Residuals 16 1535 96
Table 10.15: The LSE (least square estimate) of the parameters of the 33 system
In Figures 10.12 and 10.13 we present the main effects and Interaction plots.
10.9
COMPLEMENT: Ternary Factorial Design
We discuss in the present section estimation and testing of model parameters, when the design is
full factorial 3𝑚 , of 𝑚 factors each one at 𝑝 = 3 levels. We assume that the levels are measured on a
continuous scale, and labeled Low, Medium and High.
When the factors are quantitative, we use the indices 𝑖𝑗 (𝑗 = 1, · · · , 𝑚) which assume the values
0, 1, 2, or also −1, 0, 1 for the Low, Medium and High levels, correspondingly, of each factor. This
facilitates fitting a regression model relating the response to the factor levels.
(𝑖1 , 𝑖2 , · · · , 𝑖𝑚 ). Each treatment combination in the 3𝑚 design so is denoted by 𝑚 digits, where the
first digit indicates the level of factor 𝐴,
For example, in a 32 design, 00 denotes the treatment combination corresponding to 𝐴 and 𝐵 both
at the low level, and 01 denotes the treatment combination corresponding to 𝐴 at the low level and
𝐵 at the medium (intermediate) level.
It is simple to transform the values (levels) of each factor from 0, 1, 2 system to −1, 0, 1 system, use
⎧
⎪
⎪
⎪
⎪ −1, if 𝑖𝑗 = 0
⎨
𝑋𝑗 = 0, if 𝑖𝑗 = 1
⎪
⎪
⎪
⎩1, if 𝑖𝑗 = 2.
⎪
However, the matrix of coefficients 𝑋 that is obtained, when we have quadratic and interaction
parameters, is not orthogonal. This requires then the use of the computer to obtain the least
• Let Y 𝜈 denote the mean yield of 𝑛 replicas of the 𝜈-th treatment combination, 𝑛 ≥ 1. Since we
obtain the yield at three levels of each factor we can, in addition to the linear effects, estimate also
the quadratic effects of each factor.
The concepts utilized in the 32 and 33 designs can be readily extended to the case of m factors, each
at three levels, that is, to a 3𝑚 factorial design.
Thus, for example, the vector (0, 0, · · · , 0) represent the grand mean 𝜇 = 𝛾0 ,
a vector (0, 0, · · · , 1, 0, · · · , 0), with 1 at the 𝑖-th component represents the linear effect of the 𝑖-th
factor. Similarly, (0, 0, · · · , 2, 0, · · · , 0) with 2 at the 𝑖-th component represents the quadratic effect
of the 𝑖-th factor.
When 𝑚 = 4, the series 0120 represents a treatment combination in a 34 design with 𝐴 and
𝐷 at the low levels, 𝐵 at the intermediate level, and 𝐶 at the high level. There are 3𝑚 treatment
combinations, with 3𝑚 − 1 degrees of freedom between them.
𝑚
∑︁
𝜔= 𝜆𝑗 3𝑗−1 , 𝜔 = 0, · · · , 3𝑚 − 1.
𝑗=1
ELUCIDATION
If 𝑚 is not too large, it is also customary to label the factors by the letters 𝐴, 𝐵, 𝐶, · · · and the
parameters by 𝐴𝜆1 𝐵 𝜆2 𝐶 𝜆3 ··· . In this notation a letter to the zero power is omitted. When 𝑚 = 3, the
main effects and interactions are listed in Table 10.16.
The size of the design increases rapidly with 𝑚. For example, a 33 design has 27 treatment
combinations per replication, a 34 design has 81, a 35 design has 243, and so on. Therefore, only a
single replicate of the 3𝑚 design is frequently considered, and higher order interactions are combined
to provide an estimate of error.
SUMMARY
MATHEMATICAL MODELS, DESIGNS And ALGORITHMS
CHAPTER 10. STATISTICALLY DESIGNED EXPERIMENTS
360 FOR SYSTEM PERFORMANCE EVALUATION
allowing up to 𝑚-order interaction effect. Briefly, in this additive model of response we have
totally 3𝑚 parameters, where
𝑚 (︂ )︂
𝑚
∑︁
𝑗 𝑚
3 = 2 .
𝑗=0
𝑗
Introduction
We dicuss in this chapter a few topics aimed to understand key powerful principles of various designs
for analyzing properly experimental results when conducting experiments in practice, and to use
designs to comparare fifferent system configurations.
We study
* Differences between using coded design variables and engineering raw units,
Learning Outcomes
11.1
What are good Fractional Factorial Designs?
We used sign table method in EXAMPLE 10.14 to study and analyze the simple binary 22 . The
influences of predictors on response are explained via the variation, specifically expressed via the
sum of squares and relations
and 𝑆𝑆𝑌 = 𝑆𝑆𝑂 + 𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵 + 𝑆𝑆𝐸 after getting the regressed model
and the empirical observations 𝑦̂︀𝑖 = 𝑞0 + 𝑞𝐴 𝑥𝐴𝑖 + 𝑞𝐵 𝑥𝐵𝑖 + 𝑞𝐴𝐵 𝑥𝐴𝑖 𝑥𝐵𝑖 .
Id → 𝑞0 𝐴 𝐵 𝐴𝐵 𝑦 (responses) y
1 −1 −1 1 (15, 18, 12) 15
1 1 −1 −1 (45, 48, 51) 48
1 −1 1 −1 (25, 28, 19) 24
1 1 1 1 (75, 75, 81) 77
164 86 38 20 Total= 164
41 21.5 9.5 5 Total/4 = 41 = y = 𝑞0
Good factorial designs have a small number of runs and the sign vectors are orthogonal, or precisely
Mutual Orthogonality fulfills.
Proposition 11.1. Mutual Orthogonality between sign vectors of factors in the regression of binary
designs 2𝑘 (and their fractions 2𝑘−𝑝 in Section 11.1.2) include
For example, with 𝑚 = 3, check the mentioned mutual orthogonality of the design in 3 factors with
4 replicates, and 𝑝 = 0 given in table below.
Exp. no. A B C D E F G
1 -1 -1 -1 1 1 1 -1
2 1 -1 -1 -1 -1 1 1
3 -1 1 -1 -1 1 -1 1
4 1 1 -1 1 -1 -1 -1
5 -1 -1 1 1 -1 -1 1
6 1 -1 1 -1 1 -1 -1
7 -1 1 1 -1 -1 1 -1
8 1 1 1 1 1 1 1
𝑦 = 𝑞0 + 𝑞𝐴 𝑥𝐴 + 𝑞𝐵 𝑥𝐵 + 𝑞𝐶 𝑥𝐶 + 𝑞𝐷 𝑥𝐷 + 𝑞𝐸 𝑥𝐸 + 𝑞𝐹 𝑥𝐹 + 𝑞𝐺 𝑥𝐺
with only main effects, the orthogonality property supports the formulation as follows.
∑︁ −𝑦1 + 𝑦2 − 𝑦3 + 𝑦4 − 𝑦5 + 𝑦6 − 𝑦7 + 𝑦8
𝑞𝐴 = 𝑦𝑖 𝑥𝐴𝑖 = ,...
𝑖
8
∑︁ −𝑦1 + 𝑦2 + 𝑦3 − 𝑦4 + 𝑦5 − 𝑦6 − 𝑦7 + 𝑦8
𝑞𝐺 = 𝑦𝑖 𝑥𝐺𝑖 = .
𝑖
8
• Important parameters are not controlled; Effects of different factors are not isolated
These motivate new solutions and effective using Fractional Designs in next parts.
Exp. A B C D E F G
1 -1 -1 -1 1 1 1 -1
2 1 -1 -1 -1 -1 1 1
3 -1 1 -1 -1 1 -1 1
4 1 1 -1 1 -1 -1 -1
5 -1 -1 1 1 -1 -1 1
6 1 -1 1 -1 1 -1 -1
7 -1 1 1 -1 -1 1 -1
8 1 1 1 1 1 1 1
11.2
Binary Fractional Designs- Computation
So far we have seen that it is possible to use a single replicate of a factorial (set of treatments) to
obtain estimates of main effects and two-factor interactions, using high-order interaction MS (mean
square) to estimate 𝜎 2 . The full set of requirements which need to be considered when proposing
the use of a single-replicate design are (due to [65, Chapter 14]):
4. and it should be possible to obtain an estimate of 𝜎 2 using higher-order interaction MS for interac-
tions likely to have only small effects.
We will develop designs using only a fraction (of afull binary factorial), which will enable most of
these four crucial requirements to be satisfied. A design using only a fraction of the possible factorial
treatment combinations is called a fractional design or just fraction.
• Fractional replicates can be useful in a variety of forms ranging from using half of the possible
combinations and achieving all four requirements to using a tiny proportion of the possible combi-
nations in a saturated design, and being able to satisfy only Item 1.
• In these designs we shall ignore Item 3. the estimation of three-factor interactions; since in prac-
tice it is not often relevant to look for interactions involving three factors, and it is very difficult to
make a sensible interpretation when such interactions appear to be large.
♦ EXAMPLE 11.3 (Minimum size Fractional Design of the 211 full binary).
Suppose that we wish to investigate the first six factors (among total 11 binary factors in cell phone
manufacturing), say C= Color, S= Shape, W= Weight, M=Material, P= Price
and O= OS= Operating system, each at two levels but that an experiment of 64 observations is too
large for the available resources.
C S W M P O Run
0 0 0 0 0 0 0 0 0 0 0 1
1 1 1 0 1 1 0 1 0 0 0 2
0 1 1 1 0 1 1 0 1 0 0 3
0 0 1 1 1 0 1 1 0 1 0 4
0 0 0 1 1 1 0 1 1 0 1 5
1 0 0 0 1 1 1 0 1 1 0 6
0 1 0 0 0 1 1 1 0 1 1 7
1 0 1 0 0 0 1 1 1 0 1 8
1 1 0 1 0 0 0 1 1 1 0 9
0 1 1 0 1 0 0 0 1 1 1 10
1 0 1 1 0 1 0 0 0 1 1 11
1 1 0 1 1 0 1 0 0 0 1 12
Cam. Wifi Anti. Ant. Place
To assess all the main effects and two-factor interactions would require 6 and 15 df, respectively.
Using half of the 26 = 64 combinations would give a total of 31 df so that after estimating the
main effects and two-factor interactions there could be 10 df available to estimate 𝜎 2 . Therefore
the question from Item 4. is whether we can identify a suitable set of 32 combinations from the total
64 combinations, which allows us to estimate the effects and 𝜎 2 ? ■
Full factorial experiments with large number of factors even for the binary case of 2𝑘 might be im-
practical. For example, if there are 𝑘 = 12 factors, even at ℎ = 2 levels, the total number of treatment
combinations is ℎ𝑘 = 212 = 4096. This size of an experiment is generally not necessary, because
most of the high order interactions might be negligible and there is no need to estimate 4096 param-
eters.
• If only main effects and first order interactions are considered, a priori of importance, while all the
rest are believed to be negligible, we have to estimate and test only 1 + 𝑘 + 𝑘2 = 1 + 12 + 12
(︀ )︀ (︀ )︀
2 = 79
parameters.
• A fraction of the experiment, of size 27 = 128 would be sufficient. Such a fraction can be even
replicated several times.
Definition 11.2.
Formally, we fix 𝑑 finite sets 𝑄1 , 𝑄2 , . . . , 𝑄𝑑 called factors, where 1 < 𝑑 ∈ N. The elements of a
factor are called its levels.
• The (full) factorial design (also factorial experiment design- FED) with respect to these factors
is the Cartesian product 𝐷 = 𝑄1 × 𝑄2 × . . . × 𝑄𝑑 .
a) Constructing and/or designing: to learn how to construct those experiments, given the scope of
expected commodities and the parameters of components;
(b) Exploring and selecting: to investigate some design characteristics (proposed by researchers)
to choose good designs. For instance, in factorial designs we learn how to detect interactions
between factors; if they exist, calculate how strongly they could affect on outcomes; finally
(c) Implementing, analyzing & consulting: study how to use (i.e., conduct experiments in applications,
measure outcomes, analyze data obtained, and consult clients).
The goal is to use such new understanding to improve product, to answer questions as:
♣ QUESTION 1. In consideration of using fractional factorial design, how do we choose the fraction
of the full factorial in such a way that desirable properties of
Generally, for smaller fractions, which are more likely the case if the factorial is larger, we will need
more than one defining equation (or parameter) In the next two sections we shall look at smaller
fractions for 2𝑚 factorial structures. We firstly consider a powerful method, called fractionization,
for blocking the 2𝑚 in 2𝑝 blocks, 𝑝 is the number of defining equations and 0 < 𝑝 < 𝑚. We call such
1
fraction a fractional factorial experiment, or just fractional design.
In a fractional design, only a fraction of the treatment combinations are observed. This has the
advantage of saving time and money in running the experiment, but the disadvantage that each
1
Fractional factorial experiments are used frequently in industry, especially in various stages of product development and in process and
quality improvement.
main-effect and interaction contrast will be confounded or aliased with one or more other main-
effect and interaction contrasts, and so cannot be estimated separately.
Definition 11.3.
Treatment combinations that are confounded with each other are called aliases.
The aliases are obtained by multiplying the parameter of interest by the defining parameter. An
alias set consists of all treatment combinations that are estimated by the same contrast.
DESIGN AIM
We have to design the fractions that will be assigned to each block in such a way that, if there
are significant differences between the blocks,
then the block effects will not confound or obscure factors of interest.
Consider a specific 23 experiment for studying a relationship between diet scheme and blood
pressure. We conducted experiments to assess the effects of diet on blood pressure in (say
American) males. Three factors are to be measured:
• Each treatment combination 𝑟 will be administered as follows: A subject will have a baseline blood
pressure reading taken, then will be fed (at a laboratory) according to one of the eight diet plans
(treatments 𝑟).
• After three weeks, another blood pressure reading will be taken. Unfortunately, administering the
diet plans is very labor-intensive, and only four treatment combinations can be run at one time.
Thus, the experiment will be run in two blocks, each lasting three weeks. The following design was
decided upon on 𝑏 = 2 blocks:
With eight subjects per treatment combination, the total runs 𝑁 = 8𝑛 = 8.8 = 64, the 𝑑𝑓𝑇 = 𝑁 − 1 =
63, the anova looks like the table below, where there are only 6 degrees of freedom for treatments
because of the confounding with blocks. Indeed, if we break down the 7 degrees of freedom for
treatments we can study the confounding in blocks.
Source df
Blocks 𝑑𝑓𝐵 = 𝑏 − 1 = 2 − 1 = 1
Trts 𝑑𝑓𝑇 𝑟𝑡𝑠 = 6
𝑇 ×𝐵 0
Within error 𝑑𝑓𝐸 = 23 (𝑛 − 1) = 8.7 = 56 = 𝑑𝑓𝑇 − 𝑑𝑓𝐵 − 𝑑𝑓𝑇 𝑟𝑡𝑠
Total 𝑑𝑓𝑇 = 63
• We look at the treatment combinations corresponding to the component main effects and interac-
tions, and we can see the confounding in Table 11.1.
We see that the 𝐴𝐵𝐶 interaction is confounded with blocks, as Block 1 has all high and Block 2
2
has all low levels. Every other effect is balanced between the blocks .
Definition 11.4.
2
in that there are two high levels and two low levels in each block, so no other effect is confounded with blocks. So we see why the above
anova table has only 6 d. of freedom for treatments, as the block sum of squares is exactly the sum of squares due to the 3 way interaction.
The resolution of a 2𝑚−𝑘 design is the length of the smallest word (shortest parameter, excluding
𝜇) in the subgroup of defining (generating) parameters or just generators.
• Resolution III designs: designs in which no main effects are aliased with any other main effect,
but main effects are aliased with two-factor interactions, and some two-factor interactions may be
aliased with each other.
• Resolution IV designs: designs in which no main effect is aliased with any other main effect or
2-factor interactions, but 2-factor interactions can be aliased with each other.
• Resolution V designs: no main effect or two-factor interaction is aliased with any other main effect
or two-factor interaction, but two-factor interactions are aliased with three-factor interactions.
3
Designs of resolution 𝑅 = 𝐼𝐼𝐼, 𝐼𝑉 are useful in factor screening experiments.
Fractionization
♦ EXAMPLE 11.5.
We illustrate the construction of fractions specifically via the 2𝑚−𝑝 = 28−4 design. Here we construct
two fractions (blocks), each one of size 16. As discussed before, 𝑝 = 4 so four generating parameters
3
Orthogonal arrays (Definition ??), however, include both regular designs and irregular designs (ones can not be defined by generator words)!
[See more in Montgomery [19] and Wu [112]].
should be specified.
Let these generators be 𝐵𝐶𝐷𝐸, 𝐴𝐶𝐷𝐹, 𝐴𝐵𝐶𝐺, 𝐴𝐵𝐷𝐻. These parameters generate resolution
IV design where the degree of fractionation is 𝑝 = 4.
• The blocks can be indexed 0, 1, · · · , 15. Each index is determined by the signs of the four
generators, which determine the blocks. Each block is a fractional design having 2𝑚−𝑝 = 16
runs.
• Thus, the signs (−1, −1, 1, 1) correspond to (0,0,1,1), which yields the index
4
∑︁
𝑖𝑗 2𝑗−1 = 12.
𝑗=1
In the following table two blocks (fractions) derived with soft R are printed.
11.3
Binary Fractional Factorial Designs- Analysis
As an essential illustration of full factorial designs, we present 2𝑚 factorial designs, the most simple
full factorials of 𝑚 factors, each factor at two levels.
Thus 𝑎 denotes the combination where 𝐴 is at the high and 𝐵 is at the low level,
𝑏 denotes the combination where 𝐴 is at the low and 𝐵 is at the high level, and the treatment
combination
𝑎𝑏 denotes the combination where both treatments are at the high level.
𝑎𝑏𝑐 denotes the combination where each treatment is at the high level, and
𝑐 denotes the combination where 𝐴, 𝐵 are at the low, and 𝐶 is at high level.
• (III) We also symbolically denote 𝐴1 , 𝐴2 for levels of 𝐴, and 𝐵1 , 𝐵2 for levels of 𝐵, and use (cou-
pling with the above notation) these new symbolic representation to define their main effects in
regression model and ANOV computation.
• 𝐴2 𝐵1 : 𝐴 at high level, 𝐵 at low level; 𝐴2 𝐵2 : 𝐴 at high level, 𝐵 at high level. The design 𝐴 × 𝐵 = 22
is complete factorial design without replication.
In this design, the term factorial signifies the inclusion of all combinations of levels of factors in
the experiment (no connection between this term and the factorial function).
Table 11.2: Four systems of notation for interactions in 22 design with Yates’ order
Here the quantity represented by 𝐴1 𝐵1 merely got renamed to (1) in the 3rd column (Modern
notation). The special symbol (1) is used to represent the control, or the treatment combination with
both factors at the low level. All four systems are used in practice, and we will use at least the last
three of them. The modern notation is more compact and makes it a lot easier to extend our analysis.
The explicit inclusion of the letter in modern notation indicates that the factor is at its high level; thus,
(𝐴) represents 𝐴 high and because of the absence of 𝐵1 , the 𝐵 low.
that is each letter is followed by all combinations of that letter and letters previously introduced.
The response value at generic treatment combination (𝑇 ) is denoted by 𝑦𝑡 ≡ 𝑦(𝑇 ) , so 𝑦(1) = 𝑦𝐴=1,𝐵=1 ,
𝑦𝑎 ≡ 𝑦(𝐴) , 𝑦𝑏 ≡ 𝑦(𝐵) , etc
𝑇𝐴2 = 𝑦𝑎 + 𝑦𝑎𝑏 := 𝑦(𝐴) + 𝑦(𝐴𝐵) as the total of responses over factor 𝐵 for level 2 of 𝐴,
𝑇𝐴1 = 𝑦(1) + 𝑦𝑏 = 𝑦(1) + 𝑦(𝐵) as the total of responses over factor 𝐵 at level 1 of 𝐴, as in Table 11.3.
The response mean at high and low level of 𝐴 are respectively summarized by
𝑇𝐴2 𝑇𝐴1
y 𝐴2 =
, y 𝐴1 =
𝑁/2 𝑁/2
where 𝑁 is the total number of units in the experiment (𝑁/2 = 2𝑛 for level 1, 𝑁/2 for level 2), 𝑛 is the
number of replicates. We estimate the main effect of factor 𝐴 as
𝑇𝐴 − 𝑇𝐴1 [𝑎𝑏 + 𝑎 − 𝑏 − (1)]
𝜏 𝐴 = y 𝐴2 − y 𝐴1 = 2 = , (11.2)
(2𝑛) 2𝑛
(by convention of dropping 𝑦 in response 𝑦△ to only △), and the main effect of factor 𝐵 similarly as
𝑇𝐵 − 𝑇𝐵1 [𝑎𝑏 + 𝑏 − 𝑎 − (1)]
𝜏 𝐵 = y 𝐵2 − y 𝐵1 = 2 = . (11.3)
(2𝑛) 2𝑛
MATHEMATICAL MODELS, DESIGNS And ALGORITHMS
CHAPTER 11. FRACTIONAL FACTORIAL DESIGNS
392 IN PERFORMANCE EVALUATION
The interaction effect of 𝐴 and 𝐵, denoted by 𝜏 𝐴𝐵 and is determined as the gap between the
average responses at both ‘extreme’ choices of 𝐴, 𝐵 and those at their mixed choices:
[𝑎𝑏 + (1) − 𝑎 − 𝑏]
𝜏 𝐴𝐵 = [(𝑇𝐴2 ,𝐵2 + 𝑇𝐴1 𝐵1 ) − (𝑇𝐴2 ,𝐵1 + 𝑇𝐴1 𝐵2 )]/2 = (11.4)
2
.
where 𝑇𝐴2 ,𝐵2 = 𝑦𝑎𝑏 = 𝑎𝑏 is the response of factor 𝐴, 𝐵 both at level 2 (high),
.
𝑇𝐴1 ,𝐵1 = 𝑦(1) = (1) is the response of factor 𝐴, 𝐵 both at level 1 (low),
.
𝑇𝐴2 ,𝐵1 = 𝑦𝑎 = 𝑎 is the response of factor 𝐴 at level 2, 𝐵 at level 1, and
.
𝑇𝐴1 ,𝐵2 = 𝑦𝑏 = 𝑏 is the response of factor 𝐴 at level 1, 𝐵 at level 2. ■
NOTE: The study of Confounding in Binary Factorial Designs will be detailed later in Section 11.7.
We now connect binary design with linear regression.
11.4
Work with Coded Design Variables
2) For 2𝑚 design, should we estimate all the 2𝑚 parameters or terms in the coupled regression
model?
We have so far performed all of the analysis and model fitting for a 2𝑚 design
• and not the design factors in their original units (sometimes called actual, natural, or engineering
units).
When the engineering units are used in an observed data, we can obtain different numerical results
in comparison to the coded unit analysis, and often the results will not be as easy to interpret. The
analysis of these data via empirical modeling lends some insight into the value of coded units and
the engineering units in designed experiments. To illustrate some of the differences between the two
analyses, consider the following experiment.
A simple DC-circuit is constructed in which two different resistors, 1 and 2Ω, can be connected.
The circuit also contains an ammeter and a variable-output power supply. With a resistor installed
in the circuit, the power supply is adjusted until a current flow of either 4 or 6 amps is obtained.
Then the voltage output of the power supply is read from a voltmeter (in last column). Two
replicates of a 22 factorial design are performed, and Table 11.4 presents the results
We present the regression models obtained using the design variables in the usual coded variables
(𝑥1 = current and 𝑥2 = resistance) and then in the engineering units, respectively. If the coded
variables 𝑥1 = 𝐼 and 𝑥2 = 𝑅 use only values −1 and +1 then orthogonality here means the
inner product of two coded vectors 𝑥1 and 𝑥2 equals 0. They clearly are orthogonal.
♣ QUESTION. What is the frequency of treatment combinations from the two engineering design
variables 𝐼 and 𝑅?
1. Consider first the coded variable analysis with R code and ANOVA output.
* The regression equation is 𝑉 = 7.50 + 1.52𝑥1 + 2.53𝑥2 + 0.458 𝑥1 𝑥2 . Notice that both main ef-
fects (𝑥1 = current) and (𝑥2 = resistance) are significant as is the interaction. In the coded variable
analysis, the magnitudes of the model coefficients are directly comparable; that is, they all are di-
mensionless, and they measure the effect of changing each design factor over a one-unit interval.
* Furthermore, they are all estimated with the same precision (notice that the standard error of all
three coefficients is 0.053). Coded variables are very effective for determining the relative size of
factor effects.
2. Now consider the analysis based on the engineering units, as shown below.
In this model, only the interaction is significant. The model coefficient for the interaction term is
0.917, and the standard error is 0.1046. ■
SUMMARY.
1. Note that the regression coefficients are not dimensionless and that they are estimated with
differing precision. This is because the experimental design, with the factors in the engineering
units, is not orthogonal.
2. Generally, we conclude that the engineering units are not directly comparable, but they may
have physical meaning as in the present example. This could lead to possible simplification based
on the underlying mechanism.
3. The fact that coded variables let an experimenter see the relative importance of the design factors
is useful in practice.
• The levels of the 𝑖-th factor (𝑖 = 1, · · · , 𝑚) are fixed at 𝑥𝑖1 and 𝑥𝑖2 , where 𝑥𝑖1 < 𝑥𝑖2 . By simple
transformation all factor levels can be reduced to
⎧
⎪
⎪
⎪
⎪+1, if 𝑥 = 𝑥𝑖2
⎨
𝑐𝑖 = 𝑖 = 1, · · · , 𝑚.
⎪
⎪
⎪
⎩−1, if 𝑥 = 𝑥𝑖1
⎪
• In such a factorial experiment there are 2𝑚 treatment combination (or just treatment). Denote
(𝑖1 , · · · , 𝑖𝑚 ) a treatment combination, where 𝑖1 , · · · , 𝑖𝑚 are indices,
⎧
⎨0, if 𝑐𝑗 = −1
⎪
𝑖𝑗 =
⎩1, if
⎪
𝑐𝑗 = 1.
𝜈 𝑖1 𝑖2 𝑖3
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Thus, if there are 𝑚 = 3 factors, the number of possible treatment combinations is 23 = 8. These
are given in Table 11.5.
We discuss now the estimation of the main effects and interaction parameters.
An important rule of thumb is that an interaction between two factors should be considered,
and acknowledged in an experimental design, unless there is an explicit understanding of why it
is acceptable to assume that it is zero.
3. A vector (0, 0, · · · , 1, 0, · · · , 0) where the 1 is the 𝑖th component, represents the main effect of the
𝑖th factor, 𝑖 (𝑖 = 1, · · · , 𝑚).
4. A vector with two ones, at the 𝑖th and 𝑗th component (𝑖 = 1, · · · , 𝑚 − 1; 𝑗 = 𝑖 + 1, · · · , 𝑚) represent
the first order interaction between factor 𝑖 and 𝑗.
5. A vector with three ones, at 𝑖, 𝑗 and 𝑘 component, represent the second order interaction between
factors 𝑖, 𝑗, 𝑘 etc. Put
𝑚
∑︁
𝜔= 𝑗𝑖 2𝑖−1
𝑖=1
6. Let 𝐶2𝑚 be the matrix of coefficients, that is obtained recursively by the equations
⎡ ⎤
1 −1
𝐶2 = ⎣ ⎦, (11.6)
1 1
𝑚
DETERMINATION OF REGRESSION PARAMETERS 𝛽 (2 )
𝑚 𝑚
Let 𝑌 (2 )
be the response vector, Then, the linear model relating 𝑌 (2 )
to the vector
𝑚
𝛽 (2 )
= (𝛽𝜔 ) = (𝛽0 , 𝛽1 , · · · , 𝛽2𝑚 −1 )′
is
𝑚 𝑚 𝑚
𝑌 (2 )
= 𝐶2𝑚 · 𝛽 (2 )
+ 𝑒(2 ) . (11.8)
The least squares estimator: The column vectors of 𝐶2𝑚 are orthogonal, so
𝑚
(𝐶2𝑚 )′ 𝐶2𝑚 = 2𝑚 I2𝑚 , the least squares estimator (LSE) of 𝛽 (2 )
is
̂︀ (2
𝑚
) 1 ′ (2𝑚 )
𝛽 = (𝐶 2𝑚) 𝑌 . (11.9)
2𝑚
The matrix I2𝑚 is also called Hadamard matrix. Accordingly, the LSE of 𝛽𝜔 is
𝑚
−1
2∑︁
1 (2𝑚 )
𝛽̂︀𝜔 = 𝑚 𝑐(𝜈+1),(𝜔+1) 𝑌 𝜈 , (11.10)
2 𝜈=0
(2𝑚 )
where 𝑐𝑖,𝑗 is the 𝑖-th row and 𝑗-th column element of 𝐶2𝑚 , i.e.,
𝑚
multiply the components of vector 𝑌 (2 )
by those of the column 𝜔 + 1 of 𝐶2𝑚 ,
• The variance 𝜎 2 can be estimated by the pooled variance estimator, obtained from the be-
tween replication variance within each treatment combinations. That is, if 𝑌𝜈𝑗 , 𝑗 = 1, 2, · · · , 𝑛
are the observed values at the 𝜈-th treatment combination then
𝑚
2 𝑛
2 1 ∑︁ ∑︁
𝜎
̂︀ = 𝑚
(𝑌𝜈𝑗 − Y 𝜈 )2 . (11.13)
(𝑛 − 1)2 𝜈=1 𝑗=1
SUMMARY − For 2𝑚 design, we do not have to estimate all the 2𝑚 parameters or terms in
regression model, but can restrict attention only to parameters of interest. Finally, note that factors
not studied may be influential. The primary ways of addressing these uncontrolled factors are as
follows:
In SPE or Pollution Studies, for instance, assume for each time point 𝑡 ∈ N𝑛 = {1, 2, . . . , 𝑛} we
observe 𝑘 pollutant variables 𝑌𝑡 1 , 𝑌𝑡 2 , . . . , 𝑌𝑡 𝑘 that are concatenated to form the random vector Y𝑡 :=
[︀ ]︀𝑇
𝑌𝑡 1 , 𝑌𝑡 2 , . . . , 𝑌𝑡 𝑗 , . . . , 𝑌𝑡 𝑘 .
Multivariate linear regression (shortly MLR) expresses 𝑘 output responses 𝑌𝑗 linearly related to
the 𝑟 inputs 𝑧𝑖 (predictors 𝑖 = 1, 2, . . . , 𝑟),
We assume the noises 𝑤𝑡𝑗 are correlated over the identifier 𝑗, but are still independent over
time, i.e. Cov[𝑤𝑠 𝑖 , 𝑤𝑡 𝑗 ] = 𝜎𝑖𝑗 for time point 𝑠 = 𝑡, and
Cov[𝑤𝑠 𝑖 , 𝑤𝑡 𝑗 ] = 0 for 𝑠 ̸= 𝑡.
Matrix form of MLR- Model fitting and selection with information criteria
𝑦𝑡 = 𝛽1 𝑧𝑡 1 + 𝛽2 𝑧𝑡 2 + · · · + 𝛽𝑟 𝑧𝑡 𝑟 + 𝑤𝑡 , 𝑡 = 1, 2, . . . , 𝑛. (11.15)
𝑦 𝑡 = ℬ 𝑧 𝑡 +𝑤𝑡 , 𝑡 = 1, 2, . . . , 𝑛. (11.16)
[︀ ]︀𝑇
Here, 𝑧 𝑡 := 𝑧𝑡 1 , 𝑧𝑡 2 , . . . , 𝑧𝑡 𝑟 is the input vector at time 𝑡 of 𝑟 predictors, and
the error process {𝑤𝑡 } assumed consists of independent vectors of size 𝑘 × 1, with
We next fit model (11.16) from data, i.e. estimate regression coefficient matrix ℬ.
• With 𝜎
̂︀𝑗𝑗 is the 𝑗-th diagonal element of Σ
̂︀ 𝑤 , the estimated standard error
√︀
se(𝛽̂︀𝑖𝑗 ) := 𝑐𝑖𝑖 𝜎
̂︀𝑗𝑗 , 𝑖 = 1, 2, . . . , 𝑟, 𝑗 = 1, 2, . . . , 𝑘 (11.20)
We use a specific univariate time series to fit data sets involving one time series, namly Auto
Regressive model AR(()𝑝), given in COMPLEMENT 11.6. We then use the MLR setting above to
study possible dynamics of phenomena of interest from data sets involving more than one time
series, called vector Auto Regressive model.
♦ Each factor 𝑥𝑡,𝑗 can be place, temperature, pollutant level, mortality count ... measured at time
point 𝑡. Vector 𝑥𝑡 = [𝑥𝑡,𝑗 ] describes values of all factors 𝑗, 𝑗 = 1, 2, . . . , 𝑘.
we then modify the vector 𝑥𝑡 to be 𝑥𝑡 (𝑠), still a 𝑘 × 1- vector but for measuring 𝑘 factors at time 𝑡 and
point 𝑠 ∈ 𝐷.
if E[𝑥𝑡 ] = 𝜇 then 𝛼 = (Id −Φ)𝜇. Furthermore, if 𝜇 = 0 then we get the standard VAR(1)
• Note the similarity between the VAR model and the MLR model (11.16). The regression formulas
carry over, by letting vectors
)︀′
ℬ = (𝛼, Φ), and 𝑧 𝑡 = 1, 𝑥′𝑡−1 .
(︀
𝑦 𝑡 = 𝑥𝑡 , (11.24)
Observing data 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , we might fit Model (11.21) with the estimated coefficients ℬ̂︀ = (̂︀
𝛼, Φ),
̂︀
found by the conditional MLE (11.18). The estimated covariance matrix is modified from 11.19 as
𝑛
1 ∑︁ ′
Σ
̂︀ 𝑤 = 𝑤𝑡𝑤𝑡 (11.25)
𝑛 − 1 𝑡=2
Recall the VAR(1) model (11.21) expressing simultaneously 𝑘 responses at time 𝑡, namely
where Φ is a 𝑘 × 𝑘 transition matrix that represents the dependence of responses 𝑥𝑡 at time 𝑡 on 𝑥𝑡−1
(the responses just one time unit before time 𝑡).
• The conditional maximum likelihood estimator for the error covariance matrix
Σ𝑤 = E[𝑤𝑡 𝑤′𝑡 ]
is
̂︀ 𝑤 = SSE
Σ (11.28)
𝑛−𝑝
as in the multivariate regression case, except now only 𝑛 − 𝑝 residuals exist in the SSE.
• The selection criteria used for choosing ‘good’ VAR models are the popular AIC, the AICc as
𝑘(𝑟 + 𝑛)
AIC𝑐 = ln |Σ
̂︀ 𝑤 | + , (11.29)
𝑛 − (𝑘 + 𝑟 + 1)
and the more reasonable classification BIC, given by
̂︀ 𝑤 | + 𝑘 2 𝑝 ln 𝑛/𝑛.
BIC = ln |Σ (11.30)
We use the Rpackage vars to fit vector AR models to DienChau data via least squares.
We will select a VAR(𝑝) model and then fit automatically the model, using the Rcode VARselect,
the AIC-based information criteria.
setwd("E:/Computation/COVID19-ANALYSIS/"); getwd()
# datdc<-read.csv("E:/Computation/COVID19ANALYS/Dien_Chau_data.csv");
datdc<-read.csv(file.choose())
data$temp<-datdc$temp # now column 2
data$part <-datdc$dewp # now column 3, present particulate level PM2.5
data$cmort<-datdc$all # now column 10
cmort = as.numeric(unlist(x[10]))
y = cbind(cmort, tempr, part); ts.plot(y,col=1:3); library(vars)
summary(VAR(y, p=2, type="both")) # "both" fits constant + trend
VARselect(y, lag.max=10, type="both")
summary(fit <- VAR(y, p=2, type="both"))
$selection
AIC(n) HQ(n) SC(n) FPE(n)
4 3 1 4
OUTPUT
VAR Estimation Results:
Endogenous variables: cmort, tempr, part
Deterministic variables: both
Sample size: 357
Log Likelihood: -2276.193
Roots of the characteristic polynomial:
0.9481 0.629 0.2508 0.2508 0.1567 0.09881
Call: VAR(y = y, p = 2, type = "both")
Estimation results for equation part: [more interesting!]
=====================================
part = cmort.l1 + tempr.l1 + part.l1 + cmort.l2 + tempr.l2 + part.l2 + const + trend
Estimate Std. Error t value Pr(>|t|)
cmort.l1 -0.0644332 0.0583374 -1.104 0.270141
tempr.l1 0.3844768 0.1004902 3.826 0.000154 ***
part.l1 0.9333421 0.0620427 15.044 < 2e-16 ***
cmort.l2 -0.0491436 0.0570606 -0.861 0.389689
tempr.l2 -0.0982051 0.1023111 -0.960 0.337786
part.l2 -0.1824492 0.0613116 -2.976 0.003126 **
const 11.0149952 1.8269184 6.029 4.19e-09 ***
trend -0.0003455 0.0014298 -0.242 0.809174
Correlation matrix of residuals:
cmort tempr part
cmort 1.00000 0.02441 0.02515
tempr 0.02441 1.00000 0.54087
part 0.02515 0.54087 1.00000
SUMMARY
• Significantly, the particulate PM2.5 is highly correlated with the temperature, 𝜌𝑃 𝑇 = 0.54.
• VARselect returns infomation criteria and final prediction error for sequential increasing the lag
order up to a VAR(𝑝)-proccess. which are based on the same sample size.
• Note that BIC (SC) picks the order 𝑝 = 2 model while AIC and FPE (Final Prediction Error) pick an
order 𝑝 = 4 model. Using the notation of the previous example, the prediction model for particulate
PM2.5 is
𝑃̂︀𝑡 = 11 − 0.001𝑡 − 0.06 𝑀𝑡−1 + 0.38 𝑇𝑡−1 + 0.93 𝑃𝑡−1 − 0.05 𝑀𝑡−2 − 0.1 𝑇𝑡−2 − 0.2 𝑃𝑡−2
• When BIC (SC) picks the order 𝑝 = 1, the all-cause mortality is estimated as 𝑀
̂︁𝑡 = 𝛼 + 𝛽𝑡 + ...
How should call the term 𝛼 + 𝛽𝑡 = 11 − 0.001𝑡 in the fitted model 𝑃̂︀𝑡 ?
This is viewed as the trend effect on the mortality, and generally named an exogenous factor.
11.6
COMPLEMENT: Autoregressive Process AR(p)
Definition 11.7. A simple autoregressive or AR model order 𝑝, denoted AR(𝑝), is a univariate time
series {𝑋𝑡 } where 𝑋𝑡 (the present value at a given time 𝑡) is given as a linear combination of 𝑝 past
values, 𝑋𝑡−1 , 𝑋𝑡−2 , . . . , 𝑋𝑡−𝑝 . Precisely, AR(𝑝) is given as
where 𝑋𝑡 is stationary, 𝜑𝑖 are constants (𝜑𝑝 ̸= 0), and 𝑊𝑡 is white noise, i.e.,
Suppose we consider the Gaussian white noise series 𝑊𝑡 [Definition ??] as input and calculate
the output using the second-order equation (for 𝑡 = 1, 2, . . . , 500)
Equation (11.33) represents a regression or prediction of the current value 𝑋𝑡 of a time series as a
function of the past two values of the series, and, hence, the term autoregression of order 𝑝 = 2 is
suggested. ■
autoregression
6
4
2
0
x
−2
−4
−6
−8
Time
In R we can try
We will need two simple but useful operators for representing time series model .
𝐵𝑋𝑡 = 𝑋𝑡−1
𝐵 𝑘 𝑋𝑡 = 𝑋𝑡−𝑘 (11.34)
we get 𝑋𝑡 = 𝜑1 𝐵𝑋𝑡 + 𝜑2 𝐵 2 𝑋𝑡 + . . . + 𝜑𝑝 𝐵 𝑝 𝑋𝑡 + 𝑊𝑡 , or
(1 − 𝜑1 𝐵 − 𝜑2 𝐵 2 − · · · − 𝜑𝑝 𝐵 𝑝 ) 𝑋𝑡 = 𝑊𝑡 . (11.35)
𝜑(𝐵)𝑋𝑡 = 𝑊𝑡 (11.36)
Autoregressive operator- is the above polynomial 𝜑(𝐵), defined on the backshift operator 𝐵. Then
the AR(𝑝) can be viewed as a solution to the equation (11.36), i.e.,
1
𝑋𝑡 = 𝑊𝑡 . (11.38)
𝜑(𝐵)
Equation 𝜑(𝐵) = 0 is called the characteristic equation for the autoregressive model AR(𝑝).
(︀ )︀
We iterate backwards 𝑘 times, 𝑋𝑡 = 𝜑𝑋𝑡−1 + 𝑊𝑡 = 𝜑 𝜑𝑋𝑡−2 + 𝑊𝑡−1 + 𝑊𝑡
𝑘−1
∑︁
2 𝑘
=⇒ 𝑋𝑡 = 𝜑 𝑋𝑡−2 + 𝜑𝑊𝑡−1 + 𝑊𝑡 = · · · = 𝜑 𝑋𝑡−𝑘 + 𝜑𝑗 𝑊𝑡−𝑗
𝑗=0
Provided that 𝑋𝑡 is stationary, by continuing to iterate backward, we can represent an AR(1) model
as a linear process given by
∞
∑︁
𝑋𝑡 = 𝜑𝑗 𝑊𝑡−𝑗 . (11.39)
𝑗=0
This is called the stationary solution of the model. Hence, when an AR(1) is stationary, the infinite
summation’s convergence implies that |𝜑| < 1.
1
𝜑(𝐵) 𝑋𝑡 = 𝑊𝑡 ⇐⇒ 𝑋𝑡 = 𝑊𝑡 =: 𝜓(𝐵) 𝑊𝑡 , (11.41)
𝜑(𝐵)
𝜑(𝐵) = 1 − 𝜑1 𝐵 − 𝜑2 𝐵 2 − · · · − 𝜑𝑝 𝐵 𝑝 . (11.42)
1
= 𝜓(𝐵) ⇐⇒ 𝜑(𝐵) 𝜓(𝐵) = 1 (11.43)
𝜑(𝐵)
∞
∑︁
and the polynomial 𝜓(𝐵) = 𝜑(𝐵) −1
= 𝜓𝑗 𝐵 𝑗 will be utilized later.
𝑗=0
• The parameter 𝜇 is the mean of the process. Think of the term 𝜑 (𝑋𝑡−1 − 𝜇) as representing
“memory” or “feedback” of the past into the present value of the process.
• The parameter 𝜑 determines the amount of feedback, with a larger absolute value of 𝜑 resulting in
more feedback, and 𝜑 = 0 implying that 𝑋𝑡 = 𝜇 + 𝑊𝑡 , so that 𝑋𝑡 ∼ WN(𝜇, 𝜎𝑤2 ).
In applications, one can think of 𝑊𝑡 as representing the effect of new information. Information that
is truly new cannot be anticipated, so the effects of today’s new information should be independent
of the effects of yesterday’s news. This is why we model new information as white noise.
𝑋𝑡 = 𝜑𝑋𝑡−1 + 𝑊𝑡 (11.47)
one with 𝜑 = 0.9 and one with 𝜑 = −0.9; in both cases, 𝜎𝑤2 = 1.
In the first case 𝜌(ℎ) = 0.9ℎ , so observations close together in time are positively correlated with
each other... In R we can try
par(mfrow=c(2,1))
plot(arima.sim(list(order=c(1,0,0), ar=.9), n=100), ylab="x",
main=(expression(AR(1)~~~phi==+.9)))
plot(arima.sim(list(order=c(1,0,0), ar=-.9), n=100), ylab="x",
main=(expression(AR(1)~~~phi==-.9)))
Relook at the operator form (11.36) of AR(1) (assuming the inverse operator exists)
therefore 𝑋𝑡 = 𝜑−1 (𝐵) 𝑊𝑡 . Here 𝜑(𝐵) = 1−𝜑 𝐵 is the autoregressive operator of AR(1), with |𝜑| < 1.
With 𝐵 is the backshift operator, rewrite (11.39) to form a polynomial 𝜓(𝐵) using operator form as
∞
∑︁ ∞
∑︁
𝑗
𝑋𝑡 = 𝜑 𝑊𝑡−𝑗 = 𝜓𝑗 𝑊𝑡−𝑗 =: 𝜓(𝐵) 𝑊𝑡 (11.48)
𝑗=0 𝑗=0
∞
∑︁
𝑗
where 𝜓𝑗 := 𝜑 , and 𝜓(𝐵) := 𝜓𝑗 𝐵 𝑗 .
𝑗=0
Fact 11.1.
For any polynomial 𝑃 (𝑧) = 1 − 𝑎 𝑧, where 𝑧 is a complex number and |𝑎| < 1. Then, the inverse
1
𝑃 −1 (𝑧) = = 1 + 𝑎 𝑧 + 𝑎2 𝑧 2 + · · · + 𝑎𝑗 𝑧 𝑗 + · · · , |𝑧| < 1. (11.49)
𝑃 (𝑧)
We could view 𝜓(𝐵) as a one-side generating function, and treat the backshift operator 𝐵 as
complex number. In particular, it will often be necessary to consider the different cases when |𝐵| < 1,
|𝐵| = 1, or |𝐵| > 1, that is, when the complex number 𝐵 lies inside, on, or outside the unit circle.
that is, 𝜑−1 (𝐵) is exactly the poly 𝜓(𝐵) in Equation (11.48): 𝜑−1 (𝐵) 𝑊𝑡 = 𝜓(𝐵) 𝑊𝑡 . These results will
be generalized in our discussion of ARMA models in next chapters.
Definition 11.8 (Causal and explosive AR processes).
• When an AR process is stationary (it does not depend on the future), we will say the process is causal.
(E.g, the stationary AR(1) with |𝜑| < 1, is causal).
• Hence, an AR process with |𝜑| ≥ 1 is nonstationary, and the mean, variance, and correlation
are not constant. In particular, with |𝜑| > 1, the AR process is future dependent, it is not causal,
we say the process is explosive.
When 𝜑 = 1, we get a special nonstationary process, often called random walk, given by
𝑋𝑡 = 𝑋𝑡−1 + 𝑊𝑡 .
Then E[𝑋𝑡 |𝑋0 ] = 𝑋0 for all 𝑡, which is constant but depends entirely on the arbitrary starting point
𝑋0 . Moreover, the variance V[𝑋𝑡 |𝑋0 ] = 𝑡 𝜎𝑤2 , which is not stationary but rather increases linearly
with time. The process therefore is not mean-reverting. ■
♦ EXAMPLE 11.11 (Explosive AR Models and Causality).
𝑋𝑡 = 𝜑𝑋𝑡−1 + 𝑊𝑡 (11.51)
with |𝜑| > 1, or |𝜑|−1 < 1. Such processes are called explosive because the values of the time
𝑘
∑︁
Indeed, the finite sum 𝑆𝑘 = 𝜑𝑗 𝑊𝑡−𝑗 of series (11.48) will not converge (in mean square) as
𝑗=0
𝑗
𝑘 −→ ∞, [because |𝜑| increases without bound as 𝑗 −→ ∞], so the intuition used to get (11.48)
(converged) will not work directly.
Rewriting
𝑋𝑡+1 = 𝜑𝑋𝑡 + 𝑊𝑡+1 ⇐⇒ 𝑋𝑡 = 𝜑−1 𝑋𝑡+1 − 𝜑−1 𝑊𝑡+1 = . . .
and by iterating forward 𝑘 steps as 𝑋𝑡+1 = 𝜑−1 𝑋𝑡+2 − 𝜑−1 𝑊𝑡+2 · · · we get
𝑘
∑︁
−𝑘
𝑋𝑡 = 𝜑 𝑋𝑡+𝑘 − 𝜑−𝑗 𝑊𝑡+𝑗 .
𝑗=1
Now |𝜑|−𝑗 < 1 for all 𝑗 = 1, 2, . . ., this result suggests the following
It requires us to know the future to be able to predict the future! In this explosive case, the process
is stationary, but it is also future dependent, and not causal. ■
Knowledge Box 15 (Summary of AR(1) model).
HINT: Use the pattern (11.48) to check E[𝑋𝑡 ] = 0, and compute the covariance
|ℎ|
2 𝜑1
Cov[𝑋𝑡+ℎ , 𝑋𝑡 ] = · · · = 𝜎 .
1 − 𝜑21
2. If |𝜑| ≥ 1, then the AR(1) process is nonstationary, and the mean, variance, covariances and
and correlations are not constant.
11.7
Confounding in Factorial Designs (Extra reading)
• Briefly, given a factorial experiment of a few factors of interest with a response 𝑌 , confounding
between two effects 𝐸1 and 𝐸2 (possibly main or interaction effects) means we can not separate
the impacts of 𝐸1 and 𝐸2 on 𝑌 .
• Mathematically, without overusing new term, confounding is also described as a design technique
for arranging a complete factorial experiment [as of Definition 11.2] in blocks, where the block size
is smaller than the number of treatment combinations in one replicate, to avoid the confounding
phenomenon. That explains why confounding is essentially based on blocking.
Sugar Milk
with two choices: also two choices
Yes, No (or High, Low) With (or Much)
and Without (Little)
?
?
COFFEE
Questions:
1) How many types of coffee can we make?
By combinatorial mathematics!
2) How can you select your favorite coffee?
By experimentation and data analysis!
For instance when we study the quality (delicious sensory) of drinking coffee,
In this simple factorial design we have raised two primary questions, in which confounding between
two main effects 𝐴 and 𝐵 may occur in answering the 2nd question. It is not clear to tell that the
coffee is delicious because of milk or sugar, isn’t it? More examples can be seen in Example 11.12.
Blocking designs
Full factorial experiments, in general with large number of factors might be impractical. Trouble-
some particularly occurs when all factors at 3 levels, we have dealt with the 3𝑚 system, defined at
the COMPLEMENT of Ternary Factorial Design in Section 10.9. Moreover, practically sometimes
it is impossible to perform all of the runs in a 2𝑚 or 3𝑚 factorial experiment under homogeneous
conditions.
• For example, a single batch of raw material might not be large enough to make all of the required
runs. In other cases, it might be desirable to deliberately vary the experimental conditions to ensure
that the treatments are equally effective (i.e., robust) across many situations that are likely to be
encountered in practice.
• A chemical engineer may run a pilot plant experiment with several batches 4 .
The design technique used in these situations is blocking. The emphasis is on the fundamentals
of confounding that may become important in large, very expensive experiments are discussed in
other topic, namely Split-Plot Designs.
Aim, Ideas and Notations: The aim of this chapter is to introduce techniques for assigning parts of
replicates to smaller blocks, with the assignment based on the factorial treatment structure. We look
at ways to keep blocks small and homogeneous while retaining the desirable features and efficiency
of large factorial experiments.
♦ EXAMPLE 11.12 (WHY BLOCKING?).
If there are 𝑚 = 3 factors, even at 𝑝 = 2 levels, the total number of treatment combinations is
23 = 8. If each takes two hours to run each combination, 16 hours will be required to complete the
experiment. Over such a long period, many influences could occur that are not of interest to us in
this experiment, but that might make the interpretation of our results unclear.
4
of raw material because he knows that different raw material batches of different quality grades are likely to be used in the actual full-scale
process.
1. Suppose we have only eight hours available in a day, and so are forced to run our 16-hour
experiment over two days: a block of four treatment combinations on Monday and another
block of four on Tuesday.
2. Hence, we are not able to run all eight treatment combinations as one large block under
homogeneous conditions; instead, we have to split the one large block of eight treatment com-
binations into two smaller blocks of four.
Other “nuisance” factors can pollute data, rendering the interpretation problematic:
• Personnel changing, as the day-shift radiologist might be replaced by the night-shift radiologist
in a hospital radiology experiment,
• The humidity in the photo lab might shift from cool in the morning to warm in the afternoon.
■
Definition 11.9. What is Confounding?
confounding between two effects 𝐸1 and 𝐸2 (either main or interaction effects) means that we
can not separate the impacts of 𝐸1 and 𝐸2 on 𝑌 . When this phenomenon happens, we say 𝐸1
and 𝐸2 are confounded or aliased together.
1. The basic idea of this chapter is to assign treatment combinations to blocks in such a way that
effects of most interest can be estimated from within block information while sacrificing estimates
of effects of lesser importance.
2. In a factorial experiment we are often most interested in main effects and two or three-factor inter-
actions. Experience shows that the higher-order interactions are often much smaller in magnitude
than the main effects. This is fortunate since they also tend to be much more difficult to interpret.
The technique causes information about certain treatment effects (usually high-order interactions)
to be indistinguishable from, or confounded with, blocks.
Note that even though the designs presented are incomplete block designs because each block
does not contain all the treatments or treatment combinations, the special structure of the 2𝑚
factorial system allows a simplified method of analysis. We consider the construction and analysis
of the 2𝑚 factorial design in 2𝑘 incomplete blocks, where 𝑘 < 𝑚.
- When 𝑘 = 2, these designs can be run in four blocks, 𝑘 = 3, designs run in eight blocks...
• If batches of raw material are considered as blocks, then we must assign two of the four treatment
combinations to each block. The geometric view, as Figure 11.2.a, indicates that treatment
combinations on opposing diagonals are assigned to different blocks. In Fig. 11.2.b, block 1
contains the treatment combinations (1) and 𝑎𝑏 and block 2 has 𝑎 and 𝑏.
• Of course, the order in which the treatment combinations are run within a block is randomly
determined. Suppose we estimate the main effects of 𝐴 and 𝐵 just as if no blocking had oc-
curred. With 𝑇𝐴2 = 𝑦𝑎 + 𝑦𝑎𝑏 = 𝑎 + 𝑎𝑏 (the total of responses over factor 𝐵 for level 2 of 𝐴), and
𝑇𝐴1 = 𝑦(1) + 𝑦𝑏 = (1) + 𝑏, from Section 11.3.1, the main effect of 𝐴 is
𝑇𝐴2 − 𝑇𝐴1 [𝑎𝑏 + 𝑎 − 𝑏 − (1)]
𝜏 𝐴 = y 𝐴2 − y 𝐴1 = = ,
(2𝑛) 2𝑛
by Eqn. 11.2, and Equation 11.3 gives the main effect of 𝐵 as
[𝑎𝑏 + 𝑏 − 𝑎 − (1)]
𝜏 𝐵 = y 𝐵2 − y 𝐵1 = .
2𝑛
For single replicate, the main effects and 𝐴𝐵 interaction effect respectively are
[𝑎𝑏 + 𝑎 − 𝑏 − (1)] [𝑎𝑏 + 𝑏 − 𝑎 − (1)] 𝐴𝐵 [𝑎𝑏 + (1) − 𝑎 − 𝑏]
𝜏𝐴 = , 𝜏𝐵 = , 𝜏 = .
2 2 2
In brief in blocking 2𝑚 design we conclude on Confounding between block effect and interaction:
Because the two treatment combinations with the plus sign [the 𝑎𝑏 and (1)] are in block 1 and
the two with the minus sign [the 𝑎 and 𝑏] are in block 2, then the block effect and the 𝐴𝐵
interaction are identical.
That is, 𝐴𝐵 is confounded (or aliased) with blocks.
The reason for this is apparent from the table of plus and minus signs for the 22 design,
produced as in Table 11.6. From this table, we see that all treatment combinations that have a plus
sign on 𝐴𝐵 are assigned to block 1, whereas all treatment combinations that have a minus sign on
Table 11.6: Table of Plus and Minus Signs for the 22 Design
𝐴𝐵 = +1 =⇒ 𝐵1 = 𝐵+ = {(1), 𝑎𝑏},
(11.53)
𝐴𝐵 = −1 =⇒ 𝐵2 = 𝐵− = {𝑎, 𝑏}.
QUIZ 2. As a second example, fix 𝑚 = 3, consider a 23 design with 8 level (treatment) combi-
nations. While the experimenter has only 8 experimental units, there is some reason to believe
that the units can be put into two blocks of four experimental units each, in such a manner that
the variability among units within blocks is much smaller than the variability among the eight units as a set.
AIM: to confound the three-factor interaction 𝐴𝐵𝐶 with blocks.
DIY by filling blanks in Table 11.7 below which is similar to Table 11.6.
Table 11.7: Table of Plus and Minus Signs for the 23 Design
GUIDANCE for solving: Let 𝜆𝑖 = 0, 1 (𝑖 = 1, 2, 3) and let 𝐴𝜆1 𝐵 𝜆2 𝐶 𝜆3 represent the 8 parameters.
When the number of factors is not large, we represent the treatment combinations by low case letters
𝑎, 𝑏, 𝑐, . . ..
* The letter 𝑎 means 𝑎 = 𝐴1 𝐵 0 𝐶 0 = 𝐴, says that factor 𝐴 is at the High (𝑥1 = 1), and 𝐵, 𝐶 are at
the Low level (𝑥2 = 𝑥3 = −1); similarly about other factors.
We then assign the treatment combinations that are plus on 𝐴𝐵𝐶 to block 1 and those that are
minus on ABC to block 2.
This page is left blank intentionally.
List of Tables
10.1 Typical causality diagram shows cause-effect relationship between key events up to
uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
10.2 A factorial design with binary factors 𝐴, 𝐵, 𝐶 . . . . . . . . . . . . . . . . . . . . . . . . 257
10.3 Linear regression models with different shapes . . . . . . . . . . . . . . . . . . . . . . 274
10.4 Organizations higher up on the Quality Ladder are more efficient at solving problems
with increased returns on investments . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
10.5 Dr. Genichi Taguchi, a pioneer in using Experimental Designs for Industry . . . . . . . 285
10.6 Quadratic loss and tolerance intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
10.7 Schematic parameter design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
10.8 F distribution values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
[19] Douglas C. Montgomery, George C. Runger, [28] Garcia, Gabriel V., and Roberto A. Osegueda. ”Com-
Applied Statistics and Probability for Engineers, Sixth bining damage detection methods to improve prob-
Edition, (2014) Wiley ability of detection.” In Smart Structures and Materi-
als 2000: Smart Systems for Bridges, Structures, and
Highways, Shih-Chi Liu, 135-142. SPIE, 2000.
[20] A. J. Duncan, Quality Control and Industrial Statistics,
5th edition, Irwin, Homewood, Illinois (1986) [29] Google Earth, Digital Globe, 2014- 2019
[21] Eiichi Bannai, Etsuko Bannai, Hajime Tanaka and Yan [30] Halfpenny, Angela. ”A Frequency Domain Approach
Zhu, Design Theory from the Viewpoint of Algebraic for Fatigue Life Estimation from Finite Element Anal-
Combinatorics, Graphs and Combinatorics 33 (2017) ysis.” Key Engineering Materials 167-168: D (1999):
1-41. 401-410.
[31] Härdle, Wolfgang, and Léopold Simar. Applied multi-
[22] Brian Bergstein, AI still gets confused about how the variate statistical analysis. 2nd. Springer, 2007.
world works, pp 62-65, MIT Technology Review, The
predictions issues, Vol 123 (2), 2020 [32] Haywood, Jonathan, Wieslaw J. Staszewski, and Keith
Worden. ”Impact Location in Composite Structures
[23] Doebling, S. W., Farrar, C. R., Prime, M. B., and She- Using Smart Sensor Technology and Neural Net-
vitz, D. W.. ”Damage Identification and Health Mon- works.” In The 3rd International Workshop on Struc-
itoring of Structural and Mechanical Systems From tural Health Monitoring, 1466-1475. Stanford, Califor-
Changes in Their Vibration Characteristics: A Litera- nia, 2001.
ture Review,” Los Alamos National Laboratory Report
LA-13070-MS, 1996. [33] Hotelling, Harold. ”Relations Between Two Sets of
Variates.” Biometrika 28, no. 3-4 (1936): 321-377.
[24] Dohono, David. High-dimensional data analysis: The
curses and blessings of dimensionality., 2000. [34] Hedayat, A.S., Seiden, E., Stufken, J., On the maximal
number of factors and the enumeration of 3-symbol or-
[25] Farrar, Charles R, and Keith Worden. ”An introduction thogonal arrays of strength 3. Journal of Statistical
to structural health monitoring.” Philosophical transac- Planning and Inference, Vol 58, (1997) 43–63
tions. Series A, Mathematical, physical, and engineer-
ing sciences 365, no. 1851 (2007): 303-15. [35] Hedayat, A. S. et. al., Orthogonal Arrays, Springer-
Verlag, Germany, 1999.
[26] Fodor, Imola. A Survey of Dimension Reduction [36] Hoai V. Tran, SPE lectures @ HCMUT, VNUHCM, Viet-
Techniques. Center for Applied Scientific Computing, nam (2022)
Lawrence Livermore National Laboratory, 2002.
[37] J´er´emie Gallien, Systems Optimization and Analy-
[27] Eastment, H. T., and W. J. Krzanowski. ”Cross- sis (15.066J), OCW MIT (Accessed Spring 2023)
Validatory Choice of the Number of Components from
a Principal Component Analysis.” Technometrics 24, [38] Judea Pearl, Causality: Models, Reasoning, and Infer-
no. 1 (1982): 73 - 77. ence, 2nd Edition, Cambridge University Press (2009)
[39] Monique Laurent, Strengthened Semidefinite Bounds [46] Man Van Minh Nguyen.
for Codes, Journal Mathematical Programming, DATA ANALYTICS- STATISTICAL FOUNDATION: In-
109(2-3) (2007) 239-261. ference, Linear Regression and Stochastic Processes
[40] Mahmut Parlar, Interactive Operations Research with ISBN: 978-620-2-79791-7 Publisher: LAP LAMBERT
Maple, Methods and Models, Birkhauser (2000) Academic Publishing (2020)
[41] Brouwer, A. E., Cohen, A. M. and Nguyen, M. V. M. [47] Mien TN. Nguyen and Man VM. Nguyen.
(2006), Orthogonal arrays of strength 3 and small run Application of Thin-Plate Spline and Distributed Lag
sizes, Journal of Statistical Planning and Inference, Non-Linear Model to Describe the Interactive Ef-
136, 3268-3280. fect of Two Predictors on Count Outcomes, DOI
978-1-6654-5422-3/22
© 2022 IEEE, special issuse of the 9th NAFOSTED
[42] Eric D. Schoen, Pieter T. Eendebak, and Man Nguyen, Conference on Information and Computer Science
Complete enumeration of pure-level and mixed-level (NICS), 2022, Vietnam
orthogonal array, Journal of Combinatorial Designs
18(2) (2010) 123-140. [48] Mien TN. Nguyen, Man VM. Nguyen and Ngoan T. Le.
Using the Discrete Lindley Distribution to Deal with
[43] Hien Phan, Ben Soh and Man VM. Nguyen, A Par- Over-dispersion in Count Data. Austrian Journal of
allelism Extended Approach for the Enumeration of Statistics, to appear in July 2023
Orthogonal Arrays, ICA3PP 2011, Part I, Lecture
Notes in Computer Science, Vol. 7016, Y. Xiang et al.
eds., Springer- Verlag Berlin Heidelberg, pp. 482–494, [49] Uyen Huynh, Nabendu Pal, and Man Nguyen.
2011. Regression model under skew-normal error with
applications in predicting groundwater arsenic level in
[44] Hien Phan, Ben Soh and Man VM. Nguyen, A the Mekong Delta Region.
Step-by-Step Extending Parallelism Approach for Environmental and Ecological Statis-
Enumeration of Combinatorial Objects, ICA3PP 2010, tics Vol. 28, pp. 323–353. DOI
Part I, Lecture Notes in Computer Science, Vol. doi.org/10.1007/s10651-021-00488-2, Springer
6081, C.-H. Hsu et al. eds., Springer- Verlag Berlin Nature 2021
Heidelberg, pp. 463-475, 2010.
[50] Man Van Minh Nguyen.
Quality Engineering with Balanced Factorial Ex-
[45] Man Van Minh Nguyen. perimental Designs, Southeast Asian Bulletin of
Mathematics, Vol 44 (6), pp. 819-844 (2020)
PROBABILITY and STATISTICS: Inference, Causal
Analysis and Stochastic Analysis ISBN: 978-620-
0-08656-3 Publisher: LAP LAMBERT Academic [51] Uyen Huynh, Nabendu Pal, Buu-Chau Truong and
Publishing (2019) Man Nguyen.
A Statistical Profile of Arsenic Prevalence in the
Mekong Delta Region, Thailand Statistician Journal [58] Nguyen Van Minh Man and Scott H. Murray. Mixed
2020 Orthogonal Arrays: Constructions and Applications,
talk in International Conference on Applied Probability
and Statistics, December 28-31, 2011, The Chinese
[52] Man VM. Nguyen and Nhut C. Nguyen. Univ. of Hong Kong, Hong Kong
Analyzing Incomplete Spatial Data For Air Pollution
Prediction, Southeast-Asian J. of Sciences, Vol. 6, No
2, pp. 111-133, (2018) [59] Man Nguyen and Tran Vinh Tan.
Selecting Meaningful Predictor Variables: A Case
Study with Bridge Monitoring Data,
[53] Nguyen V. Minh Man. Proceeding of the First Regional Conference on
A Survey on Computational Algebraic Statistics and Applied and Engineering Mathematics (RCAEM I)
Its Applications East-West Journal of Mathematics, (2010), University of Perlis, Malaysia.
Vol. 19, No 2, pp. 1-44 (2017)
[60] Man Nguyen and Phan Phuc Doan.
A Combined Approach to Damage Identification for
[54] Nguyen V. Minh Man. Bridge,
Permutation Groups and Integer Linear Algebra for
Enumeration of Orthogonal Arrays, East-West Journal Proceeding of the 5th Asian Mathematical Confer-
of Mathematics, Vol. 15, No 2 (2013) ence, pp 629- 636, (2009),
Universiti Sains Malaysia in collaboration with UN-
ESCO, Malaysia.
[55] Man Nguyen, Tran Vinh Tan and Phan Phuc Doan,
Statistical Clustering and Time Series Analysis for
Bridge Monitoring Data, Recent Progress in Data [61] Nguyen, Man V. M. Some New Constructions of
Engineering and Internet Technology, Lecture Notes strength 3 Orthogonal Arrays,
in Electrical Engineering 156, (2013) pp. 61 - 72, the Memphis 2005 Design Conference Special Issue
Springer-Verlag of the Journal of Statistical Planning and Infer-
ence, Vol 138, Issue 1 (Jan 2008) pp. 220-233.
[56] Man Nguyen and Le Ba Trong Khang.
Maximum Likelihood For Some Stock Price Models, [62] Nguyen Van Minh Man,
Journal of Science and Technology, Vol. 51, no. 4B, Computer-Algebraic Methods for the Construction of
(2013) pp. 70- 81, VAST, Vietnam Designs of Experiments, Ph.D. thesis, Eindhoven
Technology University (TUe), Netherlands (2005)
[64] Giesbrecht, Marcia L. Gumpertz [76] Michael Baron, Probability and Statistics for Computer
Planning, Construction, and Statistical Analysis of Scientists, 2nd Edition (2014), CRC Press, Taylor &
Comparative Experiments, (2004) Wiley Francis Group
[65] R. Mead, S.G. Gilmour, and A. Mead [77] R. H. Myers, Douglas C. Montgomery and Christine
Statistical Principles for the Design of Experiments, M. Anderson-Cook
(2012) Cambridge University Press Response Surface Methodology : Process and Prod-
uct Optimization Using Designed Experiments, Wiley,
[66] M. F. Fecko and al., Combinatorial designs in Mul- 2009.
tiple faults localization for Battlefield networks, IEEE
Military Communications Conf., Vienna, 2001. [78] Nathabandu T. Kottegoda, Renzo Rosso.
[67] Glonek G.F.V. and Solomon P.J. Factorial and time Applied Statistics for Civil and Environmental Engi-
course designs for cDNA microarray experiments, Bio- neers, 2nd edition (2008), Blackwell Publishing Ltd
statistics 5, 89-111, 2004. and The McGraw-Hill Inc
[68] Hedayat, A. S., Sloane, N. J. A. and Stufken, J. Or- [79] Paul Mac Berthouex. L. C. Brown. Statistics for Envi-
thogonal Arrays, Springer, 1999. ronmental Engineers; 2nd edition (2002), CRC Press
[69] Joel Cutcher-Gershenfeld – ESD.60 Lean/Six Sigma [80] Peter Goos
Systems, LFM, MIT The optimal design of blocked and split-plot experi-
ments, (2002) Springer
[70] John J. Borkowski’s Home Page,
www.math.montana.edu/ jobo/courses.html/ [81] Peter Goos and Bradley Jones
Optimal Design of Experiments- A Case Study Ap-
[71] Joseph A. de Feo, Junran’s Quality Management And proach (2011) John Wiley
Analysis, McGraw-Hill, 2015.
[72] Jay L. Devore and Kenneth N. Berk, [82] P.K.Bhattacharya and PrabirBurman. Linear Model.
in Theory and Methods of Statistics. Pages 309-382.
Modern Mathematical Statistics with Applications, 2nd Academic Press. 2016.
Edition, Springer (2012)
[73] Google Earth, Digital Globe, 2014- 2019 [83] Nathaniel E. Helwig. Multivariate Linear Regression.
2017
[74] Robert V. Hogg, Joseph W. McKean, Allen T. Craig In-
troduction to Mathematical Statistics, Seventh Edition [84] Heather Turner. Introduction to Generalized Linear
Pearson, 2013. Models. University of Warwick, UK. 2008.
[75] Bulutoglu, D.A. and Margot, F., Classification of or- [85] Chitavorn Jirajan, Triage in Emergency Department.
thogonal arrays by integer programming, Journal of
Statistical Planning and Inference 138 (2008) 654- [86] Marie-Pierre De Bellefon, Jean-Michel Floch. Hand-
666. book of Spatial Analysis. Chapter 9, 231-254. 2018.
[87] Nelder, J. and R. Wedderburn (1972). Generalized lin- [99] Online news.samsung.com/global/
ear models. Journal of the Royal Statistical Society, samsung-announces-new-and-
Series A 132, 370–384. enhanced-quality-assurance-measures-
to-improve-product-safety
[88] McCuIlagh Peter and NeIder John Ashworth, Gener-
alized Linear Models, 2nd ed., Springer, 1989. [100] Online samsungengineering.com/
sustainability/quality/common/suView
[89] David Ardia, Financial Risk Management with
Bayesian Estimation of GARCH Models, Springer [101] Sudhir Gupta, Balanced Factorial Designs for cDNA
(2008) Microarray Experiments, Communications in Statis-
tics: Theory and Methods, Volume 35, Number 8 , p.
[90] Peter K. Dunn and Gordon K. Smyth Generalized Lin- 1469-1476 (2006)
ear Models With Examples in R (2018), Springer Na-
ture. [102] Sung H. Park, Six-Sigma for Quality and Productiv-
ity Promotion, Asian Productivity Organization, 1-2-10
[91] Philippe Jorion , Value at Risk- The New Benchmark Hirakawacho, Chiyoda-ku, Tokyo, Japan, 2003.
for Managing Financial Risk, 3rd Edition McGraw Hill
(2007) [103] Sloane N.J.A.,
neilsloane.com/hadamard/index.html/
[92] Ron S. Kenett, Shelemyahu Zacks. Modern Industrial
Statistics with applications in R, MINITAB, 2nd edition, [104] John Stufken and Boxin Tang, Complete Enumer-
(2014), Wiley ation of Two-Level Orthogonal Arrays of Strength 𝐷
With 𝐷 + 2 Constraints, The Annals of Statistics 35(2),
[93] Sheldon M. Ross. Introduction to Probability Models, p. 793-814 (2008)
10th edition, (2010), Elsevier Inc.
[105] Online toyota-global.com/company/
[94] Sheldon M. Ross. Introduction to Simulation, Third history-of-toyota/75years/data/company-
edition, (2002), Academic Press information/management-and-finances/
management/tqm/change.html
[95] Simon Hubbert, Essential Mathematics for Market
Risk Management, Wiley (2012) [106] Vo Ngoc Thien An, Design of Experiment for Sta-
tistical Quality Control, Master thesis, LHU, Vietnam
[96] Soren Asmussen and Peter W. Glynn, Stochastic (2011)
Simulation- Algorithms and Analysis, Springer (2007)
[107] Genichi Taguchi, Subir Chowdhury and Yuin Wu
[97] A. Stewart Fotheringham, Chris Brundon, Martin (2005), Taguchi’s Quality Engineering Handbook,
Charlton. Geographically Weighted Regression : the John Wiley & Sons
analysis of spatoally varying relationships. Wiley,
England. 2002. [108] Trevor Hastie, Robert Tibshirani and Jerome Fried-
man, The Elements of Statistical Learning Data Min-
[98] Scheffe, H. (1959) The Analysis of Variance, John Wi- ing, Inference, and Prediction, 2nd Ed. Springer
ley & Sons, Inc., New York. (2017)
[109] Wang, J.C. and Wu, C. F. J. (1991), An approach [119] Papadimitriou, C. ”Optimal sensor placement
to the construction of asymmetrical orthogonal arrays, methodology for parametric identification of structural
Journal of the American Statistical Association, 86, systems.” Journal of Sound and Vibration 278, no. 4-5
450–456. (2004): 923-947.
[110] Larry Wasserman, All of Statistics- A Concise Course [120] Rytter, A., Vibration based inspection of civil en-
in Statistical Inference, Springer, (2003) gineering structures. Ph.D Dissert., Department of
Building Technology and Structural Engineering, Aal-
[111] William J. Stevenson, Operations Management, 12th borg University, Denmark, 1993.
ed., McGraw-Hill
[121] Rytter, A., and Kirkegaard, P. , Vibration Based In-
[112] C.F. Jeff Wu, Michael Hamada Experiments: Plan- spection Using Neural Networks, Structural Damage
ning, Analysis and Parameter Design Optimization, Assessment Using Advanced Signal Processing Pro-
Wiley, 2000. cedures, Proceedings of DAMAS ‘97, University of
Sheffield, UK, 1997,pp. 97-108.
[113] Inada, T., Shimamura, Y., Todoroki, A., Kobayashi,
H., and Nakamura, H., Damage Identification Method [122] Google Earth, Digital Globe, 2014- 2019
for Smart Composite Cantilever Beams with Piezo- [123] Silverman, B.W. , Density Estimation for Statistics
electric Materials, Structural Health Monitoring 2000, and Data Analysis, Chapman and Hall, New York, New
Stanford University, Palo Alto, California, 1999,pp. York,1986.
986-994.
[124] Sithole, M.M., and S. Ganeshanandam. ”Variable se-
[114] Jolliffe, I. T. Principal component analysis. 2nd. lection in principal component analysis to preserve the
Springer, 2002. underlying multivariate data structure.” In ASC XII –
12th Australian Stats Conference. Monash University,
[115] Ron S. Kenett, Shelemyahu Zacks. Modern Industrial Melbourne, Australia, 1994.
Statistics with applications in R, MINITAB, 2nd edition,
(2014), Wiley [125] Sohn, Hoon. ”Effects of environmental and oper-
ational variability on structural health monitoring..”
[116] Lapin, L.L. , Probability and Statistics for Modern En- Philosophical transactions. Series A, Mathematical,
gineering, PWS-Kent Publishing, 2nd Edition, Boston, physical, and engineering sciences 365, no. 1851
Massachusetts,1990. (2007): 539-60.
[117] Ljung, L. System identification: theory for the user, [126] Sohn, Hoon, and Charles R. Farrar. Damage diag-
Prentice Hall, Englewood Cliffs, NJ, 1987 nosis using time series analysis of vibration signals.
Smart Materials and Structures. Vol. 10, 2001.
[118] Masri, S.F., Smyth, A.W., Chassiakos, A.G.,
Caughey, T.K., and Hunter, N.F.,Application of Neural [127] Sohn, Hoon, David W.Allen, Keith Worden and
Networks for Detection of Changes in Nonlinear Sys- Charles R. Farrar, Statistical damage classification us-
tems, Journal of Engineering Mechanics, July,2000, ing sequential probability ratio test, Structural Health
pp. 666-676. Monitoring, 2003.p.57-74
[128] Sohn, Hoon, Keith Worden, Charles R. Farrar, Statis- [134] Wald, A. Sequential Analysis, John Wiley and Sons,
tical Damage Classification under Changing Environ- New York, 1947
mental and Operational Conditions, Journal of Intelli-
gent Materials Systems and Structures, 2007 [135] Worden, K., and Lane, A.J. , Damage Identification
Using Support Vector Machines, Smart Materials and
[129] Sohn, Hoon, Charles R. Farrar, Francois M. Hemez, Structures, Vol. 10,2001, pp. 540-547.
Devin D. Shunk, Daniel W. Stinemates, Brett R.
Nadler, and Jerry J. Czarnecki. A Review of Struc- [136] Worden, K., Pierce, S.G., Manson, G., Philp, W.R.,
tural Health Monitoring Literature: 1996–2001. Struc- Staszewski, W.J., and Culshaw, B. , Detection of De-
tural Health Monitoring. Los Alamos National Labo- fects in Composite Plates Using Lamb Waves and
ratery Report, 2004. Novelty Detection, International Journal of Systems
Science, Vol. 31,2000, pp. 1,397-1,409
[130] Todd, M.D., and Nichols, J.M., Structural Damage
Assessment Using Chaotic Dynamic Interrogation, [137] Worden, K., and Fieller, N.R.J., Damage Detection
Proceedings of 2002 ASME International Mechani- Using Outlier Analysis, Journal of Sound and Vibra-
cal Engineering Conference and Exposition, New Or- tion, Vol. 229, No. 3,1999, pp. 647-667.
leans, Louisiana, 2002.
[131] Vanik, M. W., Beck, J. L., and Au, S. K. , Bayesian [138] Yang, Lingyun, Jennifer M. Schopf, Catalin L. Du-
Probabilistic Approach to Structural Health Monitor- mitrescu, and Ian Foster. ”Statistical Data Reduction
ing, Journal of Engineering Mechanics, Vol. 126, No. for Efficient Application Performance Monitoring.” CC-
7, 2000,pp. 738-745. GRID (2006).
[132] Vapnik, V., Statistical Learning Theory, John Wiley & [139] Q.W.Zhang, Statistical damage identification for
Sons, Inc., New York,1998 bridges using ambient vibration data, Elsevier, 2006.
p.476-485.
[133] Vo Ngoc Thien An, Design of Experiment for Sta-
tistical Quality Control, Master thesis, LHU, Vietnam [140] Larry Wasserman, All of Statistics- A Concise Course
(2011) in Statistical Inference, Springer, (2003)