Week 11-12-13 14 Simulation Workload Doe

SYSTEM PERFORMANCE EVALUATION
CHAPTERS 8 .. 11
Statistical Simulation, Workload Modeling,
Statistically Designed Experiments
© 2023
Hoai V. Tran and Duy Phuong Nguyen †
Man VM. Nguyen ‡

† Faculty of Computer Science & Engineering - HCMC University of Technology
‡ Faculty of Science - Mahidol University

THE BLUEPRINT
Chapter 8: STATISTICAL Simulation for SPE

Week 11: Comparison of Alternative System Configurations
Chapter 9: Workload Characterization for SPE
Week 12: Techniques for Workload Characterization
Chapter 10: Designed Experiments for SPE
Week 13: Factorial Designs, full 2𝑘 design combined with regression model
Week 14: Design and Linear model of 2𝑘 Factorial
Chapter 11: Fractional Designs for SPE Self study
Week 15: Design and Linear model of 2𝑘−𝑝 Factorial
Chapter 12: SPE Analytics Projects Discussion & Computation
Week 16: FINAL REVIEW
NOTE: Courtesy of Google Inc. [122] for picturesque icons/images of chapter covers.
This page is left blank intentionally.
Contents
CHAPTERS 8 ... 11: Statistical Simulation

Workload Modeling,
Statistical Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 8 Statistical Simulation: Fundamentals

For System Performance Evaluation 3
8.1 Mathematical Generation of Random Variables . . . . . . . . . . . . . . . . . . . . . 10
8.1.1 Generate continuous random variables . . . . . . . . . . . . . . . . . . . . . . 10
8.1.2 Exponential variable and Poisson variable in Queuing Theory . . . . . . . . . . 12
MATHEMATICAL MODELS, DESIGNS And ALGORITHMS

iv CONTENTS
8.2 The Monte Carlo Simulation- Methodology . . . . . . . . . . . . . . . . . . . . . . . 17
8.2.1 The Monte Carlo Simulation (MCS)- Overview . . . . . . . . . . . . . . . . . . 17
8.2.2 Monte Carlo Simulation- Problems . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.2.3 Monte Carlo: Application 2- Computing integration . . . . . . . . . . . . . . . . 24
8.3 How to achieve a simulation with high precision? . . . . . . . . . . . . . . . . . . . 28
8.3.1 A typical scenario in business . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.3.2 What quantity should we evaluate to justify the above argument? . . . . . . . . 31
8.4 Variance Reduction Technique (VRT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.4.1 Variance-Reduction by Control Variables . . . . . . . . . . . . . . . . . . . . . 39
8.4.2 Variance-Reduction by Conditioning . . . . . . . . . . . . . . . . . . . . . . . . 41
8.5 Comparison of Alternative System Configurations . . . . . . . . . . . . . . . . . . . 51
8.5.1 Comparison of Performance using Mean Discrepancy . . . . . . . . . . . . . . 53
8.5.2 Performance Comparison by Interval Estimation of Proportions . . . . . . . . . 63
8.6 Comparison of Performance Based on Variances . . . . . . . . . . . . . . . . . . . . . . 72
8.6.1 Comparing two population variances: How to? . . . . . . . . . . . . . . . . . . 73

CONTENTS v
8.6.2 Fisher distribution (F distribution): Properties and Usages . . . . . . . . . . . . 75
8.6.3 F-tests comparing two population variances . . . . . . . . . . . . . . . . . . . . 76
8.7 PROJECT: Monte Carlo Simulation of queues . . . . . . . . . . . . . . . . . . . . . . . 86
8.7.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.7.2 Notations and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.8 ASSIGNMENT II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.9 CHAPTER CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.9.1 The concept of 𝑀/𝑀/1 Queue revisited . . . . . . . . . . . . . . . . . . . . . . 93
8.9.2 Performance indicators of the stable M/M/1 queue . . . . . . . . . . . . . . . . 95
8.10 COMPLEMENT 8A: Non-homogeneous Poisson Process . . . . . . . . . . . . . . . . 103
8.10.1 Non-homogeneous Poisson process- NHPP . . . . . . . . . . . . . . . . . . . 103
8.10.2 Sampling a Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.11 COMPLEMENT 8B: Statistical Inference for SPE . . . . . . . . . . . . . . . . . . . . 107
8.11.1 Grand scheme for inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.11.2 Probabilistic Characterization of Sample Means . . . . . . . . . . . . . . . . . 113

vi CONTENTS
8.11.3 Chi-square distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.11.4 Confidence interval for the population variance . . . . . . . . . . . . . . . . . . 118
8.11.5 Chi-square statistic for testing independence . . . . . . . . . . . . . . . . . . . 121
Chapter 9 Workload Characterization

With Data Analytics 125
9.1 Preliminaries on Workload Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.1.1 Why and What questions? Types of Workload . . . . . . . . . . . . . . . . . . . 128
9.1.2 Workload selection- Considerations . . . . . . . . . . . . . . . . . . . . . . . . 130
9.1.3 Workload Modeling with key terminologies . . . . . . . . . . . . . . . . . . . . . 134
9.1.4 Workload components and workload parameters . . . . . . . . . . . . . . . . . 136
9.2 Popular Techniques for Workload Study . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.2.1 Averaging and Specifying dispersion . . . . . . . . . . . . . . . . . . . . . . . 140
9.2.2 Single-parameter & multiple-parameter histogram . . . . . . . . . . . . . . . . 144
9.2.3 Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.2.4 Principal Component Analysis (PCA) - First look . . . . . . . . . . . . . . . . . 150

CONTENTS vii
9.2.5 Clustering - Motivation and Steps . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.3 Statistical Methods for Workload Characterization . . . . . . . . . . . . . . . . . . . . 154
9.3.1 Sampling [see a background in Appendix ??] . . . . . . . . . . . . . . . . . . . 154
9.3.2 Parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.3.3 From Transformation and Outliers to Data scaling . . . . . . . . . . . . . . . . . 156
9.3.4 Notation of distance - Metric of dissimilarity in general cases . . . . . . . . . . 157
9.4 REMINDER: Key Facts of Statistical Data Analytics . . . . . . . . . . . . . . . . . . . . 160
9.4.1 (I) Central tendency- Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.4.2 (II) Spreading tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.4.3 (III) Measure of Dispersion- Variance and Standard deviation . . . . . . . . . . 165
9.4.4 (IV) Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.4.5 (V) Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.5 CLUSTERING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.5.1 Clusters- Input and Output of Cluster analysis . . . . . . . . . . . . . . . . . . . 170
9.5.2 Briefed Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

viii CONTENTS
9.5.3 Minimum spanning tree-clustering method . . . . . . . . . . . . . . . . . . . . 176
9.5.4 Nearest centroid method (aka 𝐾-means Clustering) . . . . . . . . . . . . . . . 178
9.5.5 Cost function, Total clustering variance to Algorithm . . . . . . . . . . . . . . . 182
9.5.6 Computation with default commands of R . . . . . . . . . . . . . . . . . . . . . 188
9.5.7 Analyzing with confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.6 Principal Component Analysis (PCA): Principles . . . . . . . . . . . . . . . . . . . . . . 194
9.6.1 Principal Component Analysis- A Formalization . . . . . . . . . . . . . . . . . . 195
9.6.2 PROCEDURE for Principal Component Analysis . . . . . . . . . . . . . . . . . 196
9.7 Combining 𝐾-means Clustering with PCA . . . . . . . . . . . . . . . . . . . . . . . . . 200
9.8 REVIEW QUESTIONS for CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.9 MACHINE LEARNING in SPE With Mathematics . . . . . . . . . . . . . . . . . . . . 210
9.9.1 MACHINE LEARNING- A Primer . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.9.2 Fundamental of Supervised Learning (SL) . . . . . . . . . . . . . . . . . . . . . 225

CONTENTS ix
Chapter 10 Statistically Designed Experiments

For System Performance Evaluation 241
10.1 Performance Metrics to Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . 244
10.2 Factorial Designs for Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . 250
10.2.1 Experimental Design (or Design of Experiments - DOE) . . . . . . . . . . . . . 250
10.2.2 DOE: Notation, Terms and Simple Designs . . . . . . . . . . . . . . . . . . . . 254
10.2.3 Summarized Planning FD and FFD . . . . . . . . . . . . . . . . . . . . . . . . . 261
10.3 Multiple Linear Regression (MLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.3.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
10.3.2 Models with Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.3.3 Inference in multivariate contexts . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.3.4 Analysis of variance (ANOVA) in multiple regression . . . . . . . . . . . . . . . 278
10.3.5 Multivariate ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
10.4 Product Quality and System Performance by Design of Experiments . . . . . . . . . 283
10.4.1 Off-line quality control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

x CONTENTS
10.4.2 Major stages in product and process design . . . . . . . . . . . . . . . . . . . . 289
10.4.3 Parameter designed experiments (briefly Parameter designs) . . . . . . . . . . 291
10.5 The full factorial design in 𝑚 binary factors . . . . . . . . . . . . . . . . . . . . . . 294
10.5.1 Statistical model of binary designs with ANOVA . . . . . . . . . . . . . . . . . . 297
10.5.2 The ANOVA for Full Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . 302
10.5.3 The 23 factorial design in practice . . . . . . . . . . . . . . . . . . . . . . . . . . 312
10.5.4 The 23 factorial design with Linear Model . . . . . . . . . . . . . . . . . . . . . 315
10.5.5 The first mixed factorial design in practice . . . . . . . . . . . . . . . . . . . . . 321
10.5.6 Summarized Analysis Procedure for a 2𝑚 design . . . . . . . . . . . . . . . . . 323
10.6 Regression with Experimental error analysis . . . . . . . . . . . . . . . . . . . . . . 328
10.6.1 Univariate regression gives Simple ANOVA table . . . . . . . . . . . . . . . . . 328
10.6.2 Multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
10.8 The simplest case of 3𝑘 factorial design . . . . . . . . . . . . . . . . . . . . . . . . . 347
10.8.1 The 33 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

CONTENTS xi
10.8.2 Design 35 with small runsize - The full factorial 3𝑚 . . . . . . . . . . . . . . . . . 356
10.9 COMPLEMENT: Ternary Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . . 356
10.9.1 Setting of 3𝑚 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
10.9.2 General Analysis of 3𝑚 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Chapter 11 Fractional Factorial Designs

In Performance Evaluation 363
11.1 What are Binary Fractional Designs? . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
11.1.1 Binary designs with Mutual orthogonality . . . . . . . . . . . . . . . . . . . . . 368
11.1.2 Fractional Factorial Designs with No interactions . . . . . . . . . . . . . . . . . 369
11.1.3 The fraction 2𝑘−𝑝 in 𝑘 binary factors and 𝑝 constraints . . . . . . . . . . . . . . . 371
11.2 Binary Fractional Designs- Computation . . . . . . . . . . . . . . . . . . . . . . . . . 374
11.3 Binary Fractional Factorial Designs- Analysis . . . . . . . . . . . . . . . . . . . . . . 387
11.3.1 Systems of notation for interactions of small 2𝑘 designs . . . . . . . . . . . . . 389
11.4 Work with Coded Design Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
11.4.1 Comparison between using coded units and engineering units . . . . . . . . . 393

xii CONTENTS
11.4.2 Mathematical Coding of Design Variables . . . . . . . . . . . . . . . . . . . . . 398
11.4.3 Estimation of the main effects and interaction effects . . . . . . . . . . . . . . . 399
11.5 Multiple Regression With Lags (Extra Reading) . . . . . . . . . . . . . . . . . . . . . 404
11.5.1 Multivariate Linear Regression (MLR) . . . . . . . . . . . . . . . . . . . . . . . 404
11.5.2 Basic Terms of Vector AutoRegressive (VAR) modeling . . . . . . . . . . . . . 407
11.5.3 From VAR(1) model to VAR(𝑝) model . . . . . . . . . . . . . . . . . . . . . . 409
11.6 COMPLEMENT: Autoregressive Process . . . . . . . . . . . . . . . . . . . . . . . . . 414
11.6.1 Backshift operator and Autoregressive operator . . . . . . . . . . . . . . . . . . 416
11.6.2 The mean, autocovariance function and ACF of AR(1) . . . . . . . . . . . . . . 419
11.6.3 Review the AR(1) in operator form . . . . . . . . . . . . . . . . . . . . . . . . . 421
11.7 Confounding in Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
11.7.1 Confounding the 2𝑚 Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . 429
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

VOLUME II: Advanced Mathematics in SPE
Chapter 8: Statistical Simulation

Fundamentals for Performance Evaluation
Chapter 9: System Workload Characterization

A Data-driven Approach
Chapter 10 : Statistically Designed Experiments

For System Performance Evaluation
Chapter 11 : Fractional Designs And Causal Inference

Chapter ??: (R) Performance Analytics Projects

Further Insights and Views
• Chapter 9 introduces a few concepts and typical problems of System Workload in modeling and
characterization from both Data-driven and Design-driven Approaches. We also propose an SPE
Analytics project for analyzing complex multivariate systems in this chapter to wrap up the Perfor-
mance Evaluation subject.
• Chapter 10 presents an entirely new methodology, called Designed Experiments (DOE) with ad-
vanced usages of statistical experimental designs for system performance evaluation in diverse sec-
tors. For example, in Computing the DOE approach for SPE is useful in both software engineering
and hardware manufacturing, when the “experiment” is the execution of a computer simulation
model or just computer model. Chapter 11 is meaningful since many classes of Fractional Facto-
rial Designs find their way into Quality Analytics and Performance Evaluation from the cost and
efficiency optimality.
• Finally, Chapter ?? (R) proposes Performance Analytics Projects with Further Insights and Views.
Chapter 8
Statistical Simulation: Fundamentals

CHAPTER 8. STATISTICAL SIMULATION: FUNDAMENTALS
4 FOR SYSTEM PERFORMANCE EVALUATION
Introduction
Simulation- simulation model(s) of a system is an approximate imitation or simplified represen-

tation of the operation of that process or system. The main purpose of simulation is estimating
quantities whose direct computation is complicated, risky, consuming (time and money), expensive,
or impossible.
1. Firstly, suppose that a complex device or machine is to be built and launched. Before it happens,
its performance is simulated, and this allows us to evaluate its adequacy and associated risks
carefully.
2. We surely prefers to evaluate reliability and safety of a new module of a space station by means
of computer simulations rather than during the actual mission.
In the design phase of a system there is no system available, we can not rely on measurements
for generating a density function (pdf). In such extreme cases, we may use simulation. Large
complex system simulation has become common practice in many areas.
Essentially, simulation consists of key steps
Step (i) building a computer model that describes the behavior of a system;
Step (ii) experimenting with model to get conclusion.

5
Once we have a computer simulation model of the actual system, we need to generate values for
the random quantities that are part of the system input (to the model).
To conduct Step (i) rightly and meaningfully, a close collaboration between mathematicians and
statisticians with engineers and experts in specific areas is vital.
PROBLEM 1. A specific scenario in management clarifies further Step (ii).
An organization has realized that a system is not operating as desired, it will look for ways to
improve its performance. To do so, sometimes it is possible to experiment with the real system and,
through observation and the aid of Probabilistic methods and Statistics, we reach valid conclusions
for future system improvement.
TAKE AWAY Facts

• However, experiments with a real system may entail ethical and/or economical problems,
which may be avoided dealing with a prototype, a physical model.
• Sometimes, it is not feasible or impossible, to build a prototype, we may yet obtain a mathe-
matical model (through equations and constraints) for describing the essential behavior of the
system.
PROBLEM 1’s analysis may be done through analytical or numerical methods, but the model may
be too complex to be dealt with. ■

♣ QUESTION. How do we poceed when a mathematical model is not feasible, or too complex?
We employ Simulation! Simulation(s) may be stochastic or deterministic. In a stochastic simulation

(the most common and our main interest in this text), randomness is introduced to represent the
variation found in most systems.
Probability Distributions for

uncertain inputs
Simulation Model
Probability
Distributions for
important outputs
Appropriate Simulation Model
Figure 8.1: A schematic diagram of Practical Simulation Model
A brief simulation process (as in Figure 8.1) essentially consists of passing the inputs through the
simulation model to obtain outputs for being analyzed later.

7
Chapter Structure plus Motivation
To overcome the raised problems, the readers at first would know
1. Mathematical Generation of Random Variables? See Section 8.1
2. The Monte Carlo Simulation- Methodology, see Section 8.2
3. How to achieve a simulation with high precision? See Section 8.3
4. Variance-Reduction Technique (VRT) in Section 8.4
5. Comparison of Alternative System Configurations in Section 8.5
♣ QUESTION. How to choose good system design from multiple system designs?
Possibly from the STATISTICAL SIMULATION view a good solution firstly comes from combining
Statistical Design and Inference with Simulation while looking out for clues of the next two questions
that
1. In which application domains that Statistical Variance Analysis be essental?,
2. How do we capture uncertain variation of systems and express them up to some proper preci-
sion?

• So far variance analysis works well for single population or one system.
Inference on many variances arises in at least two populations.
Such inferences exploiting Fisher distribution [ref. part 8.6.2] and Chi-square distribution [see
8.11.3] are used for the comparison of
accuracy of processes in engineering and SPE or
a handful of risks in finance & bank sectors.
• We will study these two issues from Section 8.5, with
Comparison of Performance using Mean Discrepancy
Comparison of Performance Based on Interval Estimation of Proportions.
• Section 8.6 presents methods for comparison of performance via risk (variance) analysis, with
F-tests comparing two population variances.
• Statistical Inference Essence is recalled in COMPLEMENT 8.11.
MOTIVATION (for Inference on many means or variances) wih DATA:
Start-up time 𝑋 of computers, it is conjectured, could be related to the operating system (OS) used
on the machines.
Two groups of laptops are randomly assigned to one of two OS: Windows or Linux.

9
A measure of start-up time 𝑋 (in second) is then obtained for each of the subjects:
Windows 𝑊 19.1 24.0 28.6 29.7 30.0 34.8 (sec)
Linux 𝐿 12.0 13.0 13.6 20.5 22.7 23.7 24.8 (sec)
Assumptions:
• The 𝑊, 𝐿 measures of start-up time 𝑋 are normal (Gaussian).
• The variances of the two populations are equal 𝜎 2 .
♣ QUESTION. Compare the start-up times of the two operating systems using the above data
and assumptions.
See details from EXAMPLE 8.8.
With COMPLEMENT 7B of ?? showing key mathematical ways of generating random numbers, we then present generation of
continuous random variables at Section 8.1.1 and lastly Section 8.1.2 shows a short discussion on using exponential variable and
Poisson variable in Queuing Theory.

8.1
Generation of Random Variables and Its Usage
8.1.1 Generate continuous random variables
Algorithm 1 Generating continuous variables

Continuous-randomize(𝐹 )
Input A continuous random variable 𝑋 that has cdf 𝐹
Output Values 𝑥 of 𝑋.
1. Obtain a standard uniform variable 𝑉 ∈ [0, 1] from a random number generator.
2. Generate values 𝑥 via the transformation 𝑋 = 𝐹 −1 (𝑉 ). In other words, solve the equation 𝐹 (𝑋) =
𝑉 for 𝑋.
♦ EXAMPLE 8.1. Generate an exponential 𝑋 ∼ E(𝜆), 𝜆 > 0.
The exponential cdf is 𝐹𝑋 (𝑥) = 1 − 𝑒−𝜆𝑥 , we solve equation 𝐹𝑋 (𝑋) = 𝑉
1 1
⇐⇒ 𝑋 = 𝐹𝑋−1 (𝑉 ) = − log(1 − 𝑉 ) = − log 𝑈, 𝑈 ∼ Uni([0, 1]).
𝜆 𝜆
8.1. Mathematical Generation of Random Variables 11
1
Hence, 𝑋 = − log 𝑈 , so the negative log of an uniform 𝑈 is exponentially distributed with rate 𝜆.
𝜆
When 𝜆 = 1, 𝑋 ∼ E(1), for any constant 𝑐 > 0, 𝑐 𝑋 is exponential with mean 𝑐.
♦ EXAMPLE 8.2. Generate a gamma 𝑋 ∼ Gamma(𝑛, 𝛽).
The shape parameter 𝑛 and the frequency parameter 𝛽 determine Gamma(𝑛, 𝛽) completely. How-
ever, the probability density function of 𝑋 ∼ Gamma(𝑛; 𝛽)
1
⎧
⎪
⎨
𝑛
𝑥𝑛−1 𝑒−𝑥/𝛽 , if 𝑥 ≥ 0,
𝑔(𝑥; 𝑛, 𝛽) = 𝛽 Γ(𝑛) (8.1)
⎪
⎩0, if 𝑥 < 0.
gives the complicated cdf 𝐹 , given by

∫︁ 𝑡
1
𝐹𝐺 (𝑡) = 𝑛 𝑥𝑛−1 𝑒−𝑥/𝛽 𝑑𝑥. (8.2)
𝛽 Γ(𝑛 0
When 𝑛 = 1 then 𝑋 becomes exponential, so Gamma(1, 𝛽) = E(𝛽).
A Gamma variable generally can be generated as a sum of 𝑛 independent exponential and (iden-
tically) each with rate 𝛽, that is
𝑛
∑︁
𝑋 = Gamma(𝑛, 𝛽) ∼ E𝑖 (𝛽) ∼ 𝐺(𝑛, 𝛽)
𝑖=1

or 𝑛
∑︁ 1 1
𝑋∼ − log 𝑈𝑖 = − log(𝑈1 · · · 𝑈𝑛 ), where 𝑈𝑖 ∼ Uni([0, 1])?
𝑖=1
𝛽 𝛽
In Matlab we compute this sum 𝑋 ∼ 𝐺(𝑛, 𝛽) by 𝑋 = 𝑠𝑢𝑚( −1/𝜆 * log(𝑟𝑎𝑛𝑑(𝑛, 1)). ■
8.1.2 Exponential variable and Poisson variable in Queuing Theory
Poisson variable 𝑋 counts the number of (rare) events randomly occurring in one 1 unit of time,
denoted by 𝑋 ∼ Pois(𝜆), is determined by 3 components:
• The observed values 𝑆𝑋 = Range(𝑋) = {0, 1, 2, 3, 4, . . . , 𝑚, 𝑚 + 1, . . .}.
• Probability density function of 𝑋 ∼ Pois(𝜆) is

𝑒−𝜆 𝜆𝑖
𝑝𝑋 (𝑖) = 𝑝(𝑖; 𝜆) = P[𝑋 = 𝑖] = 𝑖 = 0, 1, 2, ... (8.3)
𝑖!
Constant 𝜆 > 0 is the rate or speed of events or the average number of events occurring in one
time unit.
• Probability cumulative function of 𝑋 ∼ Pois(𝜆) is

𝑥
∑︁ 𝑥
∑︁
𝐹 (𝑥; 𝜆) = P(𝑋 ≤ 𝑥) = P[𝑋 = 𝑖] = 𝑝(𝑖; 𝜆) 𝑥 = 0, 1, 2, . . . (8.4)
𝑖=0 𝑖=0

Exponential variable E(.) and Poisson variable Pois(𝜆) has strongly closed bond in engineering
and services as Queuing Theory and System. For a simple queuing system with one server we
1
can describe some parameters and measures of performance. . ■
■ NOTATION 1.
Suppose that customers (or entities) entering a queuing system are assigned numbers with the
𝑖-th arriving customer called customer-𝑖. Let
• 𝐴𝑖 denote the time when the 𝑖-th customer arrives, and thereby
• 𝑋𝑖 = 𝐴𝑖+1 − 𝐴𝑖 , an interarrival time
(between the (𝑖 − 1)𝑠𝑡 to the 𝑖𝑡ℎ event),
• 𝑁 (𝑡): the number of customers in the system at time 𝑡.
⏞ ⏟
𝐴𝑟𝑟𝑖𝑣𝑎𝑙 𝑡𝑖𝑚𝑒𝑠 0‖ − −𝐴1 − − − 𝐴2 · · · An − − − −An+1 − − − − − 𝐴𝑛+2 − − >
𝑋1 = 𝐴2 − 𝐴1 ↖ . . . Xn = An+1 − An ↖ 𝑆𝑛+1 ↑
We can use the following result, obtained from Theorem ??.

1
Few introductory links are here https://www.youtube.com/watch?v=YPYMFXyDbzM
https://www.youtube.com/watch?v=SqSUJ0UYWMQ.

A service system with customers arrive dynamically in time is called a

Poisson process with rate 𝜆 when the times between successive events (the interarrival
times) are independent exponential random variables with rate 𝜆.
If precisely the inter-arrival times {𝑋𝑖 } are exponentially distributed with an arrival rate of 𝜆 [with
mean E[𝑋𝑖 ] = 1/𝜆 and pdf 𝑓𝑋 (𝑡) = 𝜆 𝑒−𝜆𝑡 ],
then the number of arrivals 𝑁 (𝑡) in the time interval [0, 𝑡) forms a Poisson process with param-
eter 𝜆 𝑡.
♦ EXAMPLE 8.3. Relation of exponential 𝑋 ∼ E(.) and Poisson Pois(𝜆).
EXAMPLE 8.1 provides a way to generate an exponential 𝑋 ∼ E(.). We may generate Poisson
random variables that uses E(.) in queuing theory as follows.
We already knew the number of arrivals (events) 𝑁 (1) in the time interval [0, 1) is Poisson dis-
tributed with mean 𝜆.
𝑛
∑︁
• The 𝑛-th event will occur at time 𝑋𝑖 , so the number of events by time 1 is
𝑖=1
{︁ 𝑛
∑︁ }︁
𝑁 (1) = max 𝑛 : 𝑋𝑖 ≤ 1 . (8.5)
𝑖=1

That is, the number of events by time 1 is equal to the largest 𝑛 for which the 𝑛-th event has
occurred by time 1. E.g., if the 4th event occurred by time 1 but the 5th event did not, then clearly
there would have been a total of four events by time 1.
• Hence, we use the results of EXAMPLE 8.1 to generate 𝑁 = 𝑁 (1), a Poisson random variable
with mean 𝜆, by generating random numbers 𝑈1 , 𝑈2 , · · · , 𝑈𝑛 , . . . and setting
𝑛
{︁ ∑︁ 1 }︁
− log 𝑈𝑖 ≤ 1 = max 𝑛 : 𝑈1 · · · 𝑈𝑛 ≥ 𝑒−𝜆 𝑊 𝐻𝑌 ?
{︀ }︀
𝑁 = max 𝑛 :
𝑖=1
𝜆
We conclude, a Poisson random variable 𝑁 with mean 𝜆 can be generated by successively gener-
ating random numbers until their product exceeds 𝑒−𝜆 , then setting
𝑁 = min 𝑛 : 𝑈1 · · · 𝑈𝑛 < 𝑒−𝜆 − 1. ■

{︀ }︀
The following table below shows key probability distributions being useful in System Performance
Evaluation (SPE).

Name of Notation Parameters Expectation Variance

distribution E[𝑋] V[𝑋] = Var(𝑋)
Bernoulli 𝑋 ∼ B(𝑝) 𝑝- 𝑝 𝑝(1 − 𝑝).
probability
of success
Binomial 𝑋 ∼ Bin(𝑛, 𝑝) 𝑛, 𝑝 𝑛𝑝 𝑛𝑝(1 − 𝑝)
Poisson 𝑋 ∼ Pois(𝜆) 𝜆 𝜆 𝜆
Gauss 𝑋 ∼ N(𝜇, 𝜎 2 ) 𝜇, 𝜎 2 𝜇 𝜎2
Exponential 𝑋 ∼ E(𝜆) 𝜆 1/𝜆 1/𝜆2
𝜒2 𝑋 ∼ 𝜒2𝑛 𝑛 𝜇=𝑛 2𝑛
Student 𝑇 ∼ 𝑡𝑛,𝑝 𝑛, 𝑝 𝜇𝑇 = 0 𝑛/((𝑛 − 2)
We will learn how to quantify the simulation’s precision in Section 8.3, but prior of that we first
present Monte Carlo Simulation.

8.2. The Monte Carlo Simulation- Methodology 17
8.2
The Monte Carlo Simulation- Methodology
8.2.1 The Monte Carlo Simulation (MCS)- Overview
1. Monte Carlo methods are those based on computer simulations involving random numbers. To
perform a simulation, we need
• a model that represents your population or phenomena of interest and
• a way to generate random numbers (according to your model) using a computer. The data
that are generated from your model can then be studied as if they were observations.
2. Monte Carlo techniques were originally first developed in the area of 2 .
Statistically, Monte Carlo methods are mostly used for the computation of probabilities, expected
values, and other distribution characteristics (such as variances).
2
Statistical Physics - in particular, during development of the atomic bomb - but are now widely used in statistics and machine learning... The
term Monte Carlo mathematically referred to simulations that involved random walks and was first used by Jon von Neumann in the 1940’s.
Today, the Monte Carlo method refers to any simulation that involves the use of random numbers

Monte Carlo Simulation- Applications
We will show that Monte Carlo simulations (or experiments) are feasible way to understand the
phenomena of interest via a few examples.
A) Forecasting in [Climate Science.] Given just a basic distribution model, it is often very difficult
to make reasonably remote predictions. Often a one-day development depends on the results
obtained during all the previous days. Perhaps binomial distribution and Markov property of a
stochastic process would involve.
Then prediction for tomorrow may be straightforward whereas computation of
a one-month forecast is already problematic. See PROBLEM ?? for example.
However, simulation of such a process can be easily performed daily (or even minute by minute).
Based on present results, we simulate the next day. And thus, we can simulate the day after that,
etc.
B) Queuing at service centers: Extension from EXAMPLE ??
A queuing system is described by a number of random variables. It involves

spontaneous arrivals of jobs, their random waiting time,
assignment to servers,
and finally, their random service time and departure.

As a result, when designing a queuing system or a server facility, it is important to evaluate its vital
performance characteristics, including
the job’s average waiting time, the average length of a queue,
the proportion of customers who had to wait, of “unsatisfied customers”...
ELUCIDATION
• In all above applications 𝐴, 𝐵, we saw how different types of phenomena can be computer-
simulated. However, one simulation is not enough for estimating probabilities and expectations.
After we understand how to program the given phenomenon once, we can embed it in a do-loop
3
and repeat similar simulations a large number of times, generating a long run.
• On Queuing at many servers - How about multiple queues?
Multiple-queue or multiple-station systems appear in many places, in theme parks (with parallel
queues), or in industrial factories (with sequential queues), seen in Chapter ??.
Key performance measures of a stable queuing system will be briefed in FACT ??. With more
powerful methods being introduced in next chapter, as Discrete Event Simulation (DES), Section
?? then shows us how to combine the DES with other tools to simulate a multiserver.
3
Since the simulated variables are random, we will generally obtain a number of different realizations, from which we calculate probabilities
and expectations as long-run frequencies and averages.

8.2.2 Monte Carlo Simulation- Problems
Monte Carlo Simulation shortly is a method of estimating

the value of an unknown quantity using
the principles of Inferential Statistics.
Key facts: Inferential Statistics utilizes the coupling concepts
• Population: a set of examples-units-entities-elements and
4
• Sample: a proper subset of a population.
Monte Carlo approximation for Single-variate case
Given a random variable 𝑋, we see that computing
a) Either the distribution 𝑓 (𝑋) [or the cdf 𝐹 (𝑥)] of 𝑋,
b) Or the mean E[𝑔(𝑋)] of a function 𝑔(.) of 𝑋, with definition can be difficult.
A simple but powerful alternative for both cases is as follows.

4
A random sample tends to exhibit the same properties as the population from which it is drawn.

First we generate 𝑅 i.i.d. (independent and identically distributed) samples from the distribution,
call them X1 , · · · , X𝑅 (of the same size 𝑛 ≥ 1).
When 𝑛 = 1, we get 𝑅 i.i.d. random numbers 𝑋1 , 𝑋2 , · · · , 𝑋𝑅 [copies of 𝑋].

∫︀ 𝑡 5
SOLUTION of a): We can approximate 𝑓 (𝑋) or the cdf 𝐹 (𝑡) = −∞ 𝑓 (𝑥) 𝑑𝑥 by using the empirical
distribution of {𝑓 (𝑥𝑖 )}𝑖=1..𝑅 .
SOLUTION of b): compute the mean E[𝑔(𝑋)] of a function 𝑔(.) of 𝑋.
We may estimate the mean via the sample-mean estimator

𝑅
1 ∑︁
𝑑𝑒𝑓.
𝜇
̂︀𝑅 = 𝑔(𝑋𝑖 ). (8.6)
𝑅 𝑖=1
In other words, one runs 𝑅 independent computer experiments replicating the random variable
(r.v.) 𝑋, and then computes 𝜇
̂︀ from the sample.
• The use of random sampling or a method for computing a probability or expectation is often
called Monte Carlo approximation or generally the Monte Carlo method.
• When the estimator 𝜇
̂︀𝑅 of 𝜇 = E[𝑋] is an average of i.i.d. copies of 𝑋 as in (8.6), meaning
𝑔(𝑋) = 𝑋 only, then we refer to 𝜇
̂︀ as a ordinary Monte Carlo (OMC, also CMC - crude MC)
estimator.
5
it is in fact the area limited by the pdf curve, the horizontal axis 𝑦 = 0, and vertical axis at 𝑥 = −∞ and 𝑥 = 𝑡

We will show the theory of OMC in next sub-section 8.2.3.
♦ EXAMPLE 8.4 (Periodic function).
For the periodic function 𝑔(𝑥) = [cos(50 𝑥) + sin(20 𝑥)]2 with cdf 𝐹 (𝑡) = 𝐹𝑔 (𝑡) we consider
evaluating its integral over [0, 1],
∫︁ 1
𝐼 = 𝐹 (1) − 𝐹 (0) = 𝑔(𝑥) 𝑑𝑥.
0
It can be seen as a uniform expectation on [0, 1], and therefore we generate 𝑈1 , 𝑈2 , · · · , 𝑈𝑛 iid
Uniform [0, 1] random variables, and approximate
∫︁ 𝑡
𝐹𝑔 (𝑡) = 𝑔(𝑥) 𝑑𝑥
0
with the sum 𝑛

∑︁
𝑔(𝑈𝑖 )/ 𝑛.
𝑖=1
The 2nd (lower) picture shows the running means 𝜇

̂︀𝑅 = estint and the bounds derived from
esterr- the estimated standard errors against the number 𝑅 = 10000 of simulations. The Rimplementation
is as follows:

3
2
1
0
0.0 0.2 0.4 0.6 0.8 1.0
Function
1.2
1.1
1.0
0.9
0.8
0 2000 4000 6000 8000 10000
Mean and error range
library(astsa); # EXAMPLE 2.12.

g=function(x)(cos(50*x)+sin(20*x))^2;
par(mar=c(2,2,2,1),mfrow=c(2,1))

curve(g, xlab="Function", ylab="",lwd=2)

integrate(g,0,1)
R = 10^4 ; x= g(runif(R))
estint=cumsum(x)/(1:R)
esterr=sqrt(cumsum((x-estint)^2))/(1:R)
plot(estint, xlab="Mean and error range",type="l",lwd=
+ 2,ylim=mean(x)+20*c(-esterr[10^4],esterr[10^4]),ylab="")
lines(estint+2*esterr,col="gold",lwd=2)
lines(estint-2*esterr,col="gold",lwd=2)
NOTE: The Rcommand cumsum is quite handy in that it computes all the partial sums of a sequence
at once and thus allows the immediate representation of the sequence of estimator, specifically when
monitoring Monte Carlo convergence, an issue that will be fully addressed in the next chapter.
8.2.3 Monte Carlo: Application 2- Computing integration
The availability of computer-generated random variables allows us to approximate univariate and

multidimensional integrals. Specifically, we calculate the mean 𝜇 = E[𝑔(𝑋)] of 𝑔(.), a real-valued
function on the state space of 𝑋, where 𝑋 has a pdf 𝑓 (𝑥) and range 𝒳 .

• Obviously an exact calculation of the mean 𝜇 = E[𝑔(𝑋)] is evaluating the integral

∫︁
E[𝑔(𝑋)] = 𝑔(𝑥) 𝑓 (𝑥) 𝑑𝑥. (8.7)
𝒳
Here you (presumably) cannot do it by exact methods (integration or summation using pencil, a
computer algebra system, or exact numerical methods).
• The principle of the Monte Carlo method for approximating E[𝑔(𝑋)] is to simulate/ generate a ran-
dom sample 𝑋1 , 𝑋2 , · · · , 𝑋𝑅 from the density 𝑓 , [the sample are i.i.d. having the same distribution
as 𝑋]. Define
𝑅
1 ∑︁
𝜇
̂︀𝑅 = 𝑔(𝑋𝑖 ). (8.8)
𝑅 𝑖=1
Knowledge Box 1. A few essential facts of the OMC are summarized as follows.
Let 𝑌𝑖 = 𝑔(𝑋𝑖 ) then the 𝑌𝑖 ∼ 𝑌 are also i.i.d. with mean 𝜇 and variance
𝜎 2 := 𝜎𝑌2 = V[𝑔(𝑋)] = E[𝑌 2 ] − 𝜇2 .

1. The sample mean of the 𝑌𝑖 = 𝑔(𝑋𝑖 ) is clearly 𝜇

̂︀𝑅 ,
∫︀
2. The true mean 𝜇 = E[𝑔(𝑋)] = 𝑔(𝑥)𝑓 (𝑥)𝑑𝑥 ≈ 𝜇
̂︀𝑅 ; clearly by the Strong Law of Large
Numbers, that
𝑌1 + 𝑌2 + · · · + 𝑌𝑅
lim 𝜇
̂︀𝑅 = lim = E[𝑔(𝑋)].
𝑅→∞ 𝑅→∞ 𝑅
3. We call Equation (8.8) the Monte Carlo estimator of 𝜇, rather than the “point estimator” of 𝜇.
The CLT says that this estimator is asymptotically normal:
(︁ 𝜎 2 )︁ (︁ 𝜎 2 )︁
̂︀𝑅 ≈ N 𝜇,
𝜇 𝜇𝑅 − 𝜇) −→ N 0,
, or (̂︀ . (8.9)
𝑅 𝑅
4. Of course, 𝜎 2 in the CLT is unknown, but it can also be estimated by MC:

𝑅
1 ∑︁ ]︀2
̂︀𝑅2 ̂︀𝑅 ≈ 𝜎 2 .
[︀
𝜎 = 𝑌𝑖 − 𝜇 (8.10)
𝑅 − 1 𝑖=1
̂︀𝑅2 is called the MC sample-variance (empirical variance of 𝑌𝑖 ).

The variance 𝜎
√
̂︀𝑅 / 𝑅 the Monte Carlo standard error (MCSE), rather than just the standard
5. Finally, we call 𝜎
error.

♦ EXAMPLE 8.5 (Gaussian distribution function).
We could use Monte Carlo sums to calculate a normal cdf. Consider the standard normal r.v.
𝑍 ∼ N(0, 1) with pdf 𝑓 , and fix a sample (𝑋1 , 𝑋2 , · · · , 𝑋𝑅 ) of size 𝑅.
The standard Gaussian distribution function

∫︁ 𝑡
𝑧2
1 −
Φ(𝑡) = P[𝑍 ≤ 𝑡] = 𝑓 (𝑧) 𝑑𝑧, 𝑓 (𝑧) = √ 𝑒 2 ,
−∞ 2𝜋
by the Monte Carlo method, has an approximation
𝑅
1 ∑︁
Φ(𝑡)
̂︀ = Id(𝑥𝑖 ≤ 𝑡). (8.11)
𝑅 𝑖=1
How to obtain Φ(𝑡)?

̂︀ Let indicator function 𝑔(𝑠) = Id(𝑠 ≤ 𝑡) of interval (−∞, 𝑡),
⎧
⎨1, 𝑠≤𝑡
⎪
𝑔(𝑠) =
⎩0,
⎪
𝑠 > 𝑡.
Now we generate (𝑋1 , 𝑋2 , · · · , 𝑋𝑅 ) ∼ N(0, 1) and set the Monte Carlo estimator
𝑅
1 ∑︁ number of observations ≤ 𝑥
Φ(𝑡)
̂︀ = 𝑔(𝑋𝑖 ) = .■
𝑅 𝑖=1 𝑅

QUIZ 1.
1. Write your own R code to confirm that with 𝑡 = 2, the true answer is Φ(2) = .9772 and the Monte
Carlo estimate with 𝑅 = 10, 000 yields Φ(2)
̂︀ = 0.9751. Use 𝑅 = 100, 000 to get .9771.
2. Why variables Id(𝑥𝑖 ≤ 𝑡) are independent Bernoulli with success probability Φ(𝑡)?
REMARK: The Monte Carlo approximation of a probability distribution function illustrated by this example has nontrivial appli-
cations since it can be used in assessing the distribution of a test statistic, such as a likelihood ratio test under a null hypothesis.
8.3
How to achieve a simulation with high precision?
We firstly present a motivation, and it is all right if you do not know all mathematical facts. But
APPENDIX ?? would be essentially helpful for the remaining chapters.
♦ EXAMPLE 8.6.
Consider a customer survey conducted by AIA, a insurance firm at Bangkok. The firm’s quality
assurance team uses a customer survey to measure satisfaction of customers.
Summarizing data, how? We rate satisfaction of customers by asking their satisfaction scores, in
range 0..60. A sample data of 𝑛 = 100 customers are surveyed, a sample mean 𝑥 = 42 of customer

8.3. How to achieve a simulation with high precision? 29
satisfaction is then computed.
𝑥 = satis-score = [48, 55, 35, 31, · · · , 29, 31, 29, 39, 32, 44, 50.]
𝑁 = the number of all customers, and 𝑛 = 100 (the number of customer we asked).
(I) Assume the population is normal. When 𝛼 = 5% = 0.05, then

𝜎
𝑧𝛼/2 = 𝑧0.025 = 1.96, 𝑅(𝛼) = 1.96 · √ .
𝑛
(︀ 𝜎 𝜎 )︀
The 95% CI for the mean 𝜇 is 𝑥 − 1.96 · √ , 𝑥 + 1.96 · √ , or equivalently
𝑛 𝑛
𝜎 𝜎
P(𝑥 − 1.96 √ < 𝜇 < 𝑥 + 1.96 √ ) = 0.95.
𝑛 𝑛
(II) If don’t know 𝜎, but got a sample of large size 𝑛 > 40, we use its estimate 𝑠 from
∑︀𝑛 2
2 𝑖=1 (𝑥𝑖 − 𝑥)
𝑠 = .
𝑛−1
· · · , 29, 31, 29, 39, 32, 44, 50] of size 𝑛 = 100

With our AIA data 𝑥 = satis-score = [48, 55, 35, 31,
𝜎
customers, the sample mean 𝑋 has value 𝑥 = 42, and assume 𝜎 = 10, we get S.E. (𝑋) = √ =
𝑛
10
√ = 10/10 = 1.
100
Hence, with the confidence level 1 − 𝛼 = 0.95, then

𝜎
𝜇 ∈ [𝑥 ± 𝑧𝛼/2 · √ ] = [42 − 1.96, 42 + 1.96] =? ■
𝑛
8.3.1 A typical scenario in business
How can the above solution be applied in SIMULATION with reliable conclusion?
PROBLEM 8.1 (When a customer leaves a Service Center under random arrivals?).
Consider a service system in which no new customers are allowed to enter after 5 p.m. Suppose
that each day follows the same probability law and that we are interested in
i) estimating the expected time at which the last customer departs the system.
Furthermore, suppose we want to be at least 95% certain that
ii) our estimated answer will not differ from the true value by more than 15 seconds.
CRITICAL THINKING: the 2nd request need a simulation with high precision?
It seems a solution using specific parameter estimation techniques is needed!
To satisfy the above requirements it is necessary that we
• continually generate data values relating to the time at which the last customer departs (each time
by doing a simulation run) until we have generated a total of 𝑛 values, where 𝑛 ≥ 100, and

• the simulated data of size 𝑛 satisfies a small enough “precision threshold” , i.e.
𝑅(𝛼) = | estimated value − true value |
𝜎 √
= constant . standard error = 𝑧𝛼/2 · √ ≈ 1.96 𝑆/ 𝑛 < 15
𝑛
where 𝑆 is the sample standard deviation (std, measured in seconds) of the data, and 𝛼 is the
significant level of your conclusion.
ANSWER: Our estimate of the expected time at which the last customer departs will exactly be
the average X n of the 𝑛 data values. WHY? ■
8.3.2 What quantity should we evaluate to justify the above argument?
Suppose in a simulation, we have the option of continually generating additional data values 𝑋𝑖 . If
our objective is to estimate the value of E[𝑋𝑖 ] = 𝜃, when should we stop generating new data values?
Knowledge Box 2. (Determining When to Stop Generating New Data)

1. Choose an acceptable value 𝑑 for the standard error of the estimator.
2. Generate at least 100 data values 𝑋𝑖 .
3. Continue to generate additional data values, stopping when you have generated 𝑛 values and
√
𝑅(𝛼) = 𝑧𝛼/2 · 𝑆/ 𝑛 < 𝑑, where 𝑆 is the sample std based on the sample.
∑︀
4. The estimate of 𝜇 is given by X n = ( 𝑖 𝑋𝑖 )/𝑛.
EXPLAINED SOLUTION (statistical): A practical and feasible answer to this question is that we
should first choose an acceptable value 𝑑 for the error 𝑅(𝛼) = 𝑧𝛼/2 · se, where
√
se = S.E. (X ) = 𝜎/ 𝑛
is the standard error of our estimator (say the sample mean X of 𝜃 = 𝜇).
The higher precision of our simulation is, the smaller value 𝑑 should be found, and fulfills
√
𝑅(𝛼) = 𝑧𝛼/2 · se ≈ 1.96 𝑆/ 𝑛 < 𝑑 = 15,
by using the significant level 𝛼 = 0.05 [of an interval estimate of 𝜇] in Equation 8.12 below
𝑠 𝑠
𝐿 = x −𝑧𝛼/2 · √ ≤ 𝜇 ≤ x +𝑧𝛼/2 · √ = 𝑈. (8.12)
𝑛 𝑛

WHY? Use of Interval Estimate of the Population Mean
That is because a CI estimate of the mean 𝜇 is

(︀ )︀
[𝐿, 𝑈 ] = 𝜇 ̂︀ + 𝑅 = X ± 𝑅.
̂︀ − 𝑅, 𝜇
𝛼
Computing 𝑅 requires Laplace function Φ(𝑧) = 1 − , in fact the cdf of N(0, 1), few its important
2
values are listed here.
Table 8.1: Tabulated values of Laplace function Φ(𝑧)
𝛼
𝑝=1− = Φ(𝑧) 99.5% 97.5% 95% 90% 80% 75% 0.5
2
𝑧𝑝 2.58 1.96 1.645 1.28 0.84 0.6745 0
How to find 𝑅 ? The random variable X is determined by the mean E[𝑋] = 𝜇 and the variance
Var[𝑋] = 𝜎 2 /𝑛, so we set
√
̂︀ = 𝑋, the standard error of X to be se = 𝜎/ 𝑛,
• the estimator 𝜇
• and set the margin of error

√ 𝑠
𝑅(𝛼) = 𝑧𝛼/2 · 𝜎/ 𝑛, or 𝑅(𝛼) = 𝑧𝛼/2 · √ , when do not know 𝜎.
𝑛

A confidence interval of 𝜃 = 𝜇 so is the interval
[𝐿, 𝑈 ] = [𝑐1 , 𝑐2 ] = 𝑋 ± 𝑅(𝛼)
That practically says, a two side 100(1 − 𝛼)% CI of 𝜇 approximately satisfies:
Figure 8.2: Two-side testing with Z statistic
or equivalently
[︂ ]︂ [︂ ]︂
𝜎 𝜎 𝜎
P X −𝑧𝛼/2 · √ ≤ 𝜇 ≤ X +𝑧𝛼/2 · √ = P |𝜇 − 𝑋| < 𝑧𝛼/2 · √ = 1 − 𝛼. (8.13)
𝑛 𝑛 𝑛
ELUCIDATION
1. In practice, when the population is generic, being either normal or not, we often use the interval
(8.12). We then need a large sample, of size 𝑛 > 100, and compute the sample standard de-
viation ⎯
⎸ 𝑛
⎸∑︁
𝑠=⎷ (𝑥𝑖 − x )2 /(𝑛 − 1)
𝑖=1
replacing for 𝜎.
We have already exploited the Central Limit Theorem (CLT ) in (??), saying the standardized
Gauss distribution
X −𝜇 X −𝜇
𝑍𝑛 = √ ≈ √ −→ N(0, 1).
𝑆/ 𝑛 𝜎/ 𝑛
2. If the population is arbitrary, 𝜎 is unknown, and can not make large simulated data, we must replace
𝑍 by the Student distribution 𝑇 .
3. REMARK: Since the sample standard deviation 𝑆 may not be a particularly good estimate of 𝜎 (nor
may the normal approximation be valid) when the sample size is small (we must use the Student
𝑇 distribution instead of the Gaussian then) , we thus recommend the following procedure.

8.4
Variance Reduction Technique (VRT)
Now we focus on statistical efficiency (although programming efficiency also matter), as measured
by the variances of the output random variables from a simulation.
If we can somehow reduce the variance of an output random variable of interest such as (i) average
delay in queue or (ii) average cost per month in an inventory system
without disturbing theirs expectation, we can obtain greater precision. Higher precision mathe-
matically would be either achieving a desired precision with less simulating,
• or having smaller confidence intervals, [e.g. of the mean defined in Equation 8.12] for the same
amount of simulating,
The methods of getting better precision using reduction of variance of a parameter of interest is
grouped in a computation class called variance reduction techniques (VRT). Most popular ones
include Control Variables discussed here, and the advanced method named Conditioning, being
discussed more details in Section 8.4.2.
Variance Reduction Technique (VRT)- Original form
Suppose we are interested in computing parameter 𝜃 = E[𝑔(𝑋1 , 𝑋2 , · · · , 𝑋𝑛 )], the mean of 𝑔(), 𝑔()

8.4. Variance Reduction Technique (VRT) 37
is some specified function.
(1) (1) (1)

1. Generate 𝑋1 , 𝑋2 , · · · 𝑋𝑛 having the same joint distribution 𝑓X , and set
(︀ (1) (1)
𝑌1 = 𝑔 𝑋1 , 𝑋2 , · · · 𝑋𝑛(1) .
)︀
2. Repeat similarly step 1 in 𝑘 independent times, until you have generated 𝑘 (some predeter-
mined number) sets, and so have also computed 𝑌1 , 𝑌2 , . . . , 𝑌𝑘 .
3. Now, 𝑌1 , 𝑌2 , . . . , 𝑌𝑘 are independent and identically distributed random variables each having
the same distribution of 𝑔(𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ). Thus, if we let Y denote the average of these 𝑘
𝑘
∑︁
random variables, that is, Y = 𝑌𝑖 /𝑘 then
𝑖=1
E[Y ] = 𝜃, and E[(Y −𝜃)2 ] = V[Y ]. (8.14)
It is often the case that it is not possible to analytically compute the preceding, and in such
case we attempt to use simulation to estimate 𝜃. Variance-Reduction Methods include Variance
Reduction by Control Variables and by Conditioning, both based on the above original steps.
(i) Hence, we can use Y as an unbiased estimator of 𝜃.

(ii) We would like the variance E[(Y −𝜃)2 ] to be as small as possible.
(iii) The variance of our estimator Y , V[Y ] = 𝑘V[𝑌𝑖 ]/𝑘 2 = V[𝑌𝑖 ]/𝑘 which is usually not known in
advance, must be estimated from the generated values 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 . ■
What are Control Variables?
We call 𝑍 a control variate for 𝑋 if it is used to adjust 𝑋, or partially “control” 𝑋.
♦ EXAMPLE 8.7 (Intuitive case in simple queues).
Let 𝑋 be an output random variable, such as the total delay time of the first 100 = 99 + 1 = 𝑛 + 1
customer delaying in queue, and assume we want to estimate 𝜃 = E[𝑋]. [ In the general case we
might want to estimate 𝜃𝑔 = E[𝑔(𝑋)] for a given function 𝑔().]
Suppose that 𝑍 is another random variable (in general take 𝑍 := 𝑓 (X) another function of 𝑋)
involved in the simulation that is thought to be correlated with 𝑋 (either positively or negatively),
and that we know the value of 𝜈 = 𝜇𝑍 = E[𝑍].
***
For instance, redefine 𝑌 = 𝑋 as an extension of 𝑋, take function 𝑔() as the delay time of customer
100 in Problem 8.5,
𝑔(Y) = 𝐷𝑛+1=100 =?

Then 𝑍 could be the sum of the service times of the first 𝑛 = 99 customers who complete their service
in the queueing model mentioned above, so we would know its expectation since we generated the
service-time variates Y = 𝑆1 , 𝑆2 , · · · , 𝑆99 from some known input distribution:
𝑍 = 𝑓 (Y) = 𝑓 (𝑆1 , 𝑆2 , · · · , 𝑆99 ) = 𝑆1 + 𝑆2 + · · · + 𝑆99=𝑛
It is reasonable to suspect that larger-than-average service times (i.e., 𝑍 > 𝜈) tend to lead to
longer-than-average delays (𝑌 > 𝜃) and vice versa. This means 𝑍 is correlated with 𝑌 , in this case
positively. An essential conclusion is that, we use our knowledge of 𝑍’s expectation to pull 𝑌 (down
or up) toward its expectation 𝜃, thus reducing its variability about 𝜃 from one run to the next. ■
8.4.1 Variance-Reduction by Control Variables
Let Cov[𝐵1 , 𝐵2 ]and Corr[𝐵1 , 𝐵2 ] respectively be the co-variance and correlation between any pair of
random variables 𝐵1 and 𝐵2 . It is well known that, the variance of the sum 𝐵1 + 𝐵2
V[𝐵1 + 𝐵2 ] = V[𝐵1 ] + V[𝐵2 ] + 2 Cov[𝐵1 , 𝐵2 ]
Since V[𝐵1 ], V[𝐵2 ] ≥ 0, we can reduce the variance V[𝐵1 + 𝐵2 ] by pushing Cov[𝐵1 , 𝐵2 ] < 0, as
negative as possible.
AIM: We use simulation and Control Variables to estimate 𝜃 = E[𝑔(X)].

• Suppose that for some other function 𝑓 , the expected value of random function 𝑍 := 𝑓 (X) is known,
say 𝜇 = 𝜇𝑍 = E[𝑍].
• For any constant 𝐴 we can define a new variable
𝑊 = 𝑔(X) + 𝐴 [𝑓 (X) − 𝜇]
as an unbiased estimator of 𝜃 = E[𝑔(X)], since E[𝑊 ] = E[𝑔(X)] = 𝜃. Now, compute
V[𝑊 ] = V[𝑔(X)] + 𝐴2 V[𝑓 (X)] + 2 𝐴 Cov 𝑔(X), 𝑓 (X)

[︀ ]︀
(8.15)
• The preceding is minimized when

[︀ ]︀
− Cov 𝑓 (X), 𝑔(X)
𝐴= , 𝑊 𝐻𝑌 ? (8.16)
V[𝑓 (X)]
And, for this value of 𝐴, the variance of 𝑊 is V[𝑊 ] = V[𝑔(X)] − 𝐴2 V[𝑓 (X)] =?
[︀ ]︀
Because V[𝑓 (X)] and Cov 𝑓 (X), 𝑔(X) are usually unknown, the simulated data should be used
to estimate these quantities. Dividing the previous equation by V[𝑔(X)] gives
V[𝑊 ]
= 1 − Corr2 [𝑓 (X), 𝑔(X)] ≤ 1? (8.17)
V[𝑔(X)]
Knowledge Box 3.
Consequently, the use of a control variate 𝑍 := 𝑓 (X) will greatly reduce the variance of the simu-
lation estimator 𝑊 of 𝜃 = E[𝑔(X)], since the ratio
V[𝑊 ]
<1
V[𝑔(X)]
whenever 𝑓 (X) and 𝑔(X) are strongly correlated. ■
8.4.2 Variance-Reduction by Conditioning
Conditional expectation
We first need a few concepts for the 2nd method of Variance Reduction.

The conditional expectation of a random variable 𝑌 given 𝑋 = 𝑥 is defined

⎧ ∑︀
⎪ ∑︀ 𝑦 𝑦 P[𝑌 = 𝑦, 𝑋 = 𝑥]
⎪
⎪ 𝑦 𝑦 𝑝 𝑌 (𝑦|𝑥) =
⎪
⎪
⎪
⎪ P[𝑋 = 𝑥]
⎪
⎪
⎪
⎨ if (𝑋, 𝑌 ) discrete,
E[𝑌 | 𝑥] = E[𝑌 |𝑋 = 𝑥] = ∫︀ ∫︀ (8.18)
∞ Range(𝑌 ) 𝑦 𝑓 (𝑥, 𝑦)𝑑𝑦
𝑦 𝑓 (𝑦|𝑥) 𝑑𝑦 =
⎪
⎪
⎪ −∞ 𝑌 ∫︀
Range(𝑌 ) 𝑓 (𝑥, 𝑦)𝑑𝑦
⎪
⎪
⎪
⎪
⎪
⎪
if (𝑋, 𝑌 ) continuous.
⎪
⎩
for all 𝑥 ∈ Range(𝑋) such that 𝑝𝑋 (𝑥) = P[𝑋 = 𝑥] > 0.
Definition 8.1 (Conditional expectation of a function 𝑔(𝑌 ) of a r.v. 𝑌 ).
The conditional expectation of 𝑔(𝑌 ), as an extension of (8.18):

⎧
⎨∑︀ 𝑔(𝑦) 𝑝𝑌 (𝑦|𝑥) if (𝑋, 𝑌 ) discrete,
⎪
𝑦
E[𝑔(𝑌 )| 𝑥] = E[𝑔(𝑌 )|𝑋 = 𝑥] = ∫︀ (8.19)
⎩ ∞ 𝑔(𝑦) 𝑓𝑌 (𝑦|𝑥) 𝑑𝑦 if (𝑋, 𝑌 ) continuous.
⎪
−∞
• E[𝑌 | 𝑥] in (8.18) is defined like E[𝑌 ] as a weighted average of all the possible values of 𝑌 , but now
with the weight given to the value 𝑦 being equal to the conditional probability 𝑝𝑌 (𝑦|𝑥) given 𝑋 = 𝑥.
So E[𝑌 |𝑋] is a random variable, its mean does exist.
• E[𝑔(𝑌 )|𝑋] as a function of 𝑋, taking values E[𝑔(𝑌 )|𝑋 = 𝑥], so is a random variable whose mean
can be calculated.

• Just as conditional probabilities satisfy all the properties of ordinary probabilities, so do the condi-
tional expectations satisfy all the properties of ordinary expectations.
Pure expectation and Conditional expectation

A/ The ‘pure’ expectation of 𝑌 itself can be computed via the conditional expectation
E[𝑌 | 𝑋] as:
E[𝑌 ] = E[ E[𝑌 |𝑋]]. (8.20)
B/ Extending Eq. (8.20) gives the conditional mean E[𝑔(𝑌 )|𝑋] of 𝑔(𝑌 ), and
E[𝑔(𝑌 )] = E[E[𝑔(𝑌 )|𝑋]]. (8.21)
Similarly as Definition 8.1, we define V[𝑌 | 𝑥] - the conditional variance of 𝑌 .
Definition 8.2 (Conditional variance). Fix a pair of random variables 𝑋, 𝑌 .
• The conditional variance of 𝑌 , given value 𝑋 = 𝑥, as the variance of 𝑌 , with respect to the

conditional p.d.f. 𝑓𝑌 (𝑦 | 𝑥) = 𝑓𝑌 |𝑋 (𝑦 | 𝑥) is given by

[︂ ]︂
(︀ )︀2 ⃒
V[𝑌 | 𝑥] = V[𝑌 | 𝑋 = 𝑥] = E 𝑌 − E[𝑌 |𝑥] ⃒𝑋 = 𝑥
∫︁
(︀ )︀2 (8.22)
= 𝑦 − E[𝑌 | 𝑥] 𝑓𝑌 (𝑦 | 𝑥) 𝑑𝑦.
𝑦
It is a function of variate 𝑥 ∈ Range(𝑋).
• Then we define the conditional variance variable
V[𝑌 | 𝑋] = E[𝑌 2 |𝑋] − (E[𝑌 |𝑋])2 ≥ 0.
That is, V[𝑌 | 𝑋] is equal to the (conditional) expected square of the difference between 𝑌 and its (conditional) mean E[𝑌 |𝑋] when
the value of 𝑋 is given. In other words, V[𝑌 | 𝑋] is exactly analogous to the usual definition of variance, but now all expectations
are conditional on the fact that 𝑋 is known.
Pure Variance and Conditional Variance

Furthermore, the ‘pure’ variance of 𝑌 itself can be computed via the conditional variance
V[𝑌 | 𝑋] and the conditional expectation E[𝑌 |𝑋]:
V[𝑌 ] = E[V[𝑌 |𝑋]] + V[ E[𝑌 |𝑋]]. (8.23)

This is known as the Conditional Variance formula, and as a result, we get
V[𝑌 ] ≥ V[ E[𝑌 |𝑋]] (8.24)
Why E[ V[𝑌 |𝑋]] + V[ E[𝑌 |𝑋]] = V[𝑌 ]?
PROOF
From V[𝑌 | 𝑋] = E[𝑌 2 |𝑋] − (E[𝑌 |𝑋])2 , taking expectation both sides w.r.t. 𝑋 gives
[︂ ]︂ [︂ ]︂ [︂ ]︂
2 2 2 2
[︀ ]︀
E V[𝑌 | 𝑋] = E E[𝑌 |𝑋] − E (E[𝑌 |𝑋]) = E[𝑌 ] − E (E[𝑌 |𝑋])
due to (8.21). Also,
[︂ ]︂
2
[︀ ]︀2
E[ E[𝑌 |𝑋]] = E[𝑌 ] =⇒ V[ E[𝑌 |𝑋]] = E (E[𝑌 |𝑋]) − E[𝑌 ]
then summing the two shows that
]︀2
E[ V[𝑌 |𝑋]] + V[ E[𝑌 |𝑋]] = E[𝑌 2 ] − E[𝑌 ] = V[𝑌 ]. ■
[︀
What happen if we treat the sample size 𝑛 as a random variable 𝑁 ?
Definition 8.3 (Compound random variable).

Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 be a random sample (iid) from a certain distribution, where 𝑁 itself is a natural-
valued random variable (having its own distribution).
The compound random variable of 𝑋𝑖 and 𝑁 is given by
𝑆𝑁 := 𝑋1 + 𝑋2 + · · · + 𝑋𝑁 . (8.25)
In practice, 𝑁 may be the number of people stopping at a service station in a day, and the 𝑋𝑖 are
the amounts of gas they purchased.
One can find the mean and variance of 𝑆𝑁 if observations are random.
Proposition 8.4 (The mean and variance of 𝑆𝑁 ).

𝑖.𝑖.𝑑.
When 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 ∼ 𝑋 and 𝑋𝑖 is independent with 𝑁 , then
(1) E[𝑆𝑁 ] = E[𝑁 ]E[𝑋]

(8.26)
2
(2) V[𝑆𝑁 ] = E[𝑁 ] V[𝑋] + V[𝑁 ] (E[𝑋]) .
PROOF.
(1) First, E[𝑋1 + 𝑋2 + · · · + 𝑋𝑁 |𝑁 = 𝑛] = E[𝑋1 + 𝑋2 + · · · + 𝑋𝑛 ] = 𝑛 E[𝑋],

and so as a random variable, the mean
E[𝑆𝑁 |𝑁 ] = E[𝑋1 + 𝑋2 + · · · + 𝑋𝑁 |𝑁 ] = 𝑁 E[𝑋].
Now view 𝑔(𝑌 ) = 𝑌 = 𝑆𝑁 and with conditioning on 𝑁 as in Equation 8.21 we write

[︂ ]︂ [︂ ]︂
E[𝑆𝑁 ] = E E[𝑆𝑁 |𝑁 ] = E E[𝑋1 + 𝑋2 + · · · + 𝑋𝑁 |𝑁 ]
[︂ ]︂
= E 𝑁 E[𝑋] = E[𝑁 ] E[𝑋]
since 𝑋 is independent with 𝑁 .
Using Conditioning and Compound random variable
By simulating a data X = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 , suppose we are interested in estimating the mean 𝜇𝑔 =

E[𝑔(𝑋1 , 𝑋2 , · · · , 𝑋𝑛 )] of a function 𝑌 = 𝑔(𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ).
If for some new random variable 𝑍 we can compute E[𝑌 | 𝑍] then, from the conditional variance
formula (8.24) we get smaller variance
V[ E[𝑌 |𝑍]] ≤ V[𝑌 ]. (8.27)
Indeed, in the Conditional Variance formula V[𝑌 ] = E[ V[𝑌 |𝑋]]+V[ E[𝑌 |𝑋]], always V[𝑌 | 𝑍] ≥ 0
then the mean E[V[𝑌 |𝑋]] ≥ 0, so
V[ E[𝑌 |𝑍]] ≤ V[𝑌 ].

PROBLEM 8.2 (A Queueing System 𝑄𝑆).
Observe the delays 𝐷1 , 𝐷2 , . . . , 𝐷𝑁 (𝑇 ) of the first 𝑁 (𝑇 ) arrivals on time duration [0, 𝑇 ] in queue 𝑄𝑆,
and put function
𝑁 (𝑇 )
∑︁
𝑌 = 𝑔(𝐷1 , 𝐷2 , . . . , 𝐷𝑁 (𝑇 ) ) = 𝐷𝑖 .
𝑖=1
GUIDANCE for solving.
USAGE 1: Estimate E[𝑌 ], the expected sum of the delays.
If we simulate 𝐷𝑖 ∼ 𝐷 as an iid delays of customers in 𝑄𝑆, and assume known the mean E[𝐷], or
if not, compute its sample mean 𝐷, then use
𝑁 (𝑇 )
∑︁
E[𝑌 ] = E[ 𝐷𝑖 ]
𝑖=1
= E[𝐷] E[𝑁 (𝑇 )] = 𝜇𝐷 𝑚(𝑇 ) ≈ 𝐷 E[𝑁 (𝑇 )],
where 𝑚(𝑇 ) := E[𝑁 (𝑇 )] is the expected number of renewals in the time interval (0, 𝑇 ).
The average time that a customer spends in the system 𝑄𝑆 is obviously found as
E[𝑌 ]
𝐴𝐶𝑆 = E[𝐷] = (8.28)
𝑚(𝑇 )
𝑁 (𝑇 )
∑︁
where 𝑌 = 𝐷𝑖 is the sum of the spent times in 𝑄𝑆 of all arrivals up to 𝑇 .
𝑖=1
⏞ ⏟
𝐴𝑟𝑟𝑖𝑣𝑎𝑙 𝑡𝑖𝑚𝑒𝑠 0‖ − − − − − −An − − − −An+1 − − − − −− 𝐴𝑛+2 − −− >
Xn = An+1 − An ↑ 𝑆𝑛+1 ↑
Note that, the delay in queue of customer 𝑛 + 1 is
𝐷𝑛+1 = 𝑆𝑛 − 𝑋𝑛 + 𝑆𝑛+1 for all 𝑛 ≥ 1
hence the time moment when that customer immediately leaves from
the queue obviously is 𝐿𝑛+1 = 𝐷𝑛+1 [be continued from PROBLEM 8.5].
USAGE 2: Compute the mean E[𝑍], where 𝑍 is the total service times in [0, 𝑇 ]:
We could define the total service-times of all 𝑁 (𝑇 ) customers

𝑁 (𝑇 )
∑︁
𝑍 = 𝑆𝑁 (𝑇 ) = 𝑆𝑖
𝑖=1
as our control variable, then replace E[𝑌 |𝑍] for 𝑌 , we get smaller variance of the estimator E[𝑌 |𝑍],
by the above argument (8.27): V[ E[𝑌 |𝑍]] ≤ V[𝑌 ].
• The arrival process is usually assumed independent of the service times, so

𝑁 (𝑇 )
∑︁
E[𝑍] = E[ 𝑆𝑖 ] = E[𝑆] E[𝑁 (𝑇 )] = 𝜇𝐺 𝑚(𝑇 ), (8.29)
𝑖=1

here 𝑁 (𝑇 ) is the number of arrivals by time 𝑇 . The quantity 𝑚(𝑇 ) might be known or unknown, but
evidently, 𝑁 (𝑇 ) is a natural simulation estimator of 𝑚(𝑇 ).
• Computing 𝑚(𝑇 ): We obtain 𝑚(𝑇 ) by employing conditioning, applicable when
the arrival process were a nonhomogeneous Poisson with rate 𝜆(𝑡):

∫︁ 𝑇
𝑚(𝑇 ) = E[𝑁 (𝑇 )] = 𝜆(𝑡) 𝑑𝑡. (8.30)
0
Of course, when the arrival process follows a homogeneous Poisson with constant rate 𝜆 then
∫︀ 𝑇
𝑚(𝑇 ) = E[𝑁 (𝑇 )] = 0 𝜆 𝑑𝑡 = 𝜆 𝑇. ■
NEXT OBJECTIVES:
To present several types of comparison and problems of design selection (the best design
among) competing system designs that have been found useful in simulation, together with ap-
propriate statistical procedures for their solution, and numerical examples.
We will discuss statistical analyses of the output from several different simulation models that
might represent competing system designs or alternative operating policies. This is a very important
subject, since the real utility of simulation lies in comparing such alternatives before implementation.

8.5. Comparison of Alternative System Configurations 51
8.5
Comparison of Alternative System Configurations
Many decision making problems require to determine whether the means, proportions or some pa-
rameters of two population or systems are the same or different. In general, the two sample prob-
lem arises when two systems or processes are to be compared, for instance
1. Start-up times of computers running on Linux or Windows operating system (OS).
2. Saliva concentration of respondents who were lying versus those who were truthful.
3. Time to complete a computational task when instructed in two algorithms.
4. Weight gain of patients when subjected to two different medical treatments.
• In Example 1, Start-up time is the response variable (measured at a specified computer); the
explanatory variable or treatment is the OS type, say
Linux (Ubuntu, Debian or Mint) and Windows (Windows 7, 8, 10 or 11).
• Examples 2..4 could be designed to be carried out under experimental conditions.
• Taking Example 4, suppose a random sample of 16 patients is taken. Randomly allocate eight of
the patients to treatment 1 and the remaining eight to treatment 2.

• To assess the effectiveness of the treatments it is usual that one of the treatments is a ‘control’ i.e.,
no treatment or the usual treatment. Thus
a) the experiment has replication, randomization and a control treatment, and
b) a significant difference does not automatically imply a meaningful difference. ■
After careful study of this session, you should be able to
• Construct comparative experiments involving two samples as tests
• Construct firstly confidence intervals on the difference in means of two normal distributions
• More general, construct confidence intervals on
the difference or ratio of two performance indicators.
If two performance indicators are population means we study their difference in Section 8.5.1.
If two performance indicators are population proportions or variances we study their ratio in Section
8.5.2 and 8.6.

8.5.1 Comparison of Performance using Mean Discrepancy
Seting for comparinng two systems (populations)
Aim and Assumptions:
We investigate a common feature of two systems 𝑋1 = 𝑋 and 𝑋2 = 𝑌 .
1. Let 𝑥 = 𝑋11 , 𝑋12 , · · · , 𝑋1𝑗 , · · · , 𝑋1 𝑛1 be a random sample (IID) from 𝑋.
2. Let 𝑦 = 𝑋21 , 𝑋22 , · · · , 𝑋2𝑗 , · · · , 𝑋2 𝑛2 = 𝑌1 , 𝑌2 , · · · , 𝑌𝑛2 be a random sample from 𝑌 .
3. Both 𝑋 and 𝑌 are normal.
However, the fact that the two populations 𝑋 and 𝑌 are independent or not depends on how
the simulations are executed, and could determine which of the two confidence-interval approaches
discussed in next parts.
Let 𝜇𝑖 = E[𝑋𝑖𝑗 ] = E[𝑋𝑖 ] (𝑖 = 1, 2) be the expected responsed of interest, we will
either construct a confidence interval CI (𝜉) for the “gap performance” 𝜉 = 𝜇1 − 𝜇2
or equivalently test some hypothesis on 𝜉.
Set the null hypothesis as 𝐻0 : 𝜇𝑋 − 𝜇𝑌 = Δ0 .

Alternative Hypotheses P-Value Rejection Criterion For

Fixed-Level Tests
H0: 1  2 ≠ 0 Probability above |z0| and

probability below  |z0|, z0  z2 or z0  z2
P = 2[1  (|z0|)]
H1: 1  2 > 0 Probability above z0, z0  z

P = 1  (z0)
H1: 1  2 < 0 Probability below z0, z0  z

P = (z0)
Figure 8.3: Using Gauss variable when 𝑋 and 𝑌 are independent
We define the gap of means 𝐺𝑚 = X − Y then knew that

⎧
E[𝐺𝑚 ] = E[X − Y ] = E[X ] − E[Y ] = 𝜇𝑋 − 𝜇𝑌
⎪
⎪
⎪
⎪
⎪
⎨
(8.31)
2
𝜎𝑌2
⎪
⎪
⎪ 𝜎𝑋
⎩V[𝐺𝑚 ] = V[X − Y ] = V X ] + V[Y ] = + .
⎪
⎪
𝑛1 𝑛2

Rejection Criterion for

Alternative Hypothesis P-Value Fixed-Level Tests
H1: 1  2  0 Probability above t0 and t 0  t  / 2, n1  n2  2

probability belowt0
t0  t / 2, n1  n2  2
H1: 1  2  0
t0  t , n1  n2  2
Probability above t0
H1: 1  2  0 Probability below t 0

t0  t, n1  n2  2
Figure 8.4: Using Student variable 1
Often we set Δ0 = 0, and observe two systems 𝑋, 𝑌 to get a data 𝒟 = 𝑥, 𝑦 then we employ the
fact:
Data 𝒟 supporting the null 𝐻0 (we do not reject it) is equivalent to the fact that 0 ∈ CI (𝜉) up to
some significant level 𝛼 ∈ (0, 1).
Hence we will choose suitable way (Hypothesis Testing or Confidence Interval) to describe our
solution.

Comparison of two populations.
Figure 8.5: Key parameters of two populations

(A) When 𝑋 and 𝑌 are independent - use Hypothesis Testing
General Procedure for Two population Hypothesis Tests
1. Identify the parameter of interest (like the “gap performance” 𝜉 = 𝜇1 − 𝜇2 ).
2. Formulate the null 𝐻0 , and specify a suitable alternative hypothesis, 𝐻1 .
3. Choose a significance level, 𝛼.
4. Compute an appropriate test statistic (from the observed data 𝒟) of the standardized variable
𝐺 of 𝐺𝑚 with
X − Y −(𝜇𝑋 − 𝜇𝑌 )
𝐺= √︃
2
𝜎𝑋 𝜎2
+ 𝑌
𝑛1 𝑛2
𝐺 follows the standard Gauss N(0, 1) distribution.
a/ If we know 𝜎𝑋 , 𝜎𝑌 then use 𝐺 (where we can replace 𝜎𝑋 , 𝜎𝑌 by sample standard deviations

𝑠𝑋 , 𝑠𝑌 ):
𝑥 − 𝑦 − Δ0
𝐺 = √︃
2
𝜎𝑋 𝜎𝑌2
+
𝑛1 𝑛2

b/ Or if we do not know 𝜎𝑥 , 𝜎𝑦 and small sample sizes, use Student’s 𝑇 distribution
𝑥 − 𝑦 − Δ0
𝑇 = √︂
1 1
𝑠𝑝 +
𝑛1 𝑛2
with 𝑛1 + 𝑛2 − 2 d.o.f, where the pooled variance depends on variances 𝑠2𝑥 , 𝑠2𝑦
(𝑛1 − 1)𝑠2𝑥 + (𝑛2 − 1)𝑠2𝑦

𝑠2𝑝 =
𝑛1 + 𝑛2 − 2
5. State the rejection criteria for the statistic with certain significant level 𝛼.
6. Draw appropriate conclusions with eithrr diagram 8.3 for 𝐺 or 8.4 for 𝑇 . If use 𝑇 then just apply
the same argument, but remember using the degrees of freedom 𝑛1 + 𝑛2 − 2 when locating the
suitable critical value 𝑡𝛼 or 𝑡𝛼/2 .
The 𝑇 statistic with critical value 𝑡 has d.f. 𝑛1 + 𝑛2 − 2 since whole data 𝒟 = 𝑥, 𝑦 is comprised by
two independent samples having d.f. 𝑛1 − 1 and 𝑛2 − 1 respectively. ■
♦ EXAMPLE 8.8 (Comparing Two Systems when populations are independent).
Start-up time 𝑋 of computers, it is conjectured, could be related to the operating system (OS) used
on the machines.
Two groups of laptops are randomly assigned to one of two OS: Windows or Linux.

A measure of start-up time 𝑋 (in second) is then obtained for each of the subjects:
Windows 𝑊 19.1 24.0 28.6 29.7 30.0 34.8 (sec)
Linux 𝐿 12.0 13.0 13.6 20.5 22.7 23.7 24.8 (sec)
Assumptions:
• The 𝑊, 𝐿 measures of start-up time 𝑋 are normal.
• The variances of the two populations are equal 𝜎 2 .
♣ QUESTION. Compare the start-up times of the two operating systems using the above data and
assumptions.
Notation. 𝑋1 = 𝑊 , 𝑋2 = 𝐿.
HINT: For the Windows-run laptops , let 𝑥1 be the sample mean, 𝑠22 be the sample variance;
let 𝜇1 be the population mean, common variance 𝜎 2 .
For the Linuxs-run laptops , let 𝑥2 be the sample mean, 𝑠22 be the sample variance,
let 𝜇2 be the population mean, variance 𝜎 2 .
Hypotheses:
𝐻0 : 𝜇1 = 𝜇2 , or 𝜇1 − 𝜇2 = Δ0 = 0

versus with 𝐻1 : 𝜇1 ̸= 𝜇2 .
kindly see full solution in PROBLEM 8.4.
(B) For generic populations: Find Paired-𝑡 Confidence Interval
We now do not have to assume that 𝑋 and 𝑌 are independent here, so we compute Confidence
Interval and use Gosset’s 𝑇 distribution in (??).
• Let 𝑛1 = 𝑛2 = 𝑛 (say, or we are willing to discard some observations from the system on which we
actually have more data), we can pair 𝑋1𝑗 with 𝑋2𝑗 to define ‘gap’ variable 𝑍 = 𝑋 − 𝑌 = 𝑋1 − 𝑋2 ,
with the 𝑗th observation 𝑍𝑗 = 𝑋𝑗 − 𝑌𝑗 = 𝑋1𝑗 − 𝑋2𝑗 for 𝑗 = 1, 2, . . . , 𝑛.
• Then the 𝑍𝑗 ’s are IID random variables and E[𝑍𝑗 ] = 𝜉 = E[𝑍], the quantity for which we want to
construct a confidence interval. Thus, we can let a sample ‘mean gap’
𝑛
∑︁
𝑍𝑗
𝑗=1
𝑍(𝑛) =
𝑛
and a ‘variance estimator’ of 𝑍(𝑛) as [explain why yourself]
𝑛
∑︁ [︀ ]︀2
𝑍𝑗 − 𝑍(𝑛)
𝑗=1
̂︂ 𝑍(𝑛) = 𝑆 2 =
[︀ ]︀ [︀ ]︀
V
̂︀ 𝑍(𝑛) = Var
𝑍(𝑛) 𝑛(𝑛 − 1)
The Paired-𝑡 Confidence Interval

is built as the (approximate) 100(1 − 𝛼) percent confidence interval
CI (𝑍(𝑛)) = 𝑍(𝑛) ± 𝑡𝛼/2,𝑛−1 𝑆𝑍(𝑛) (8.32)

[︀ ]︀
using the above variance estimator Var
̂︂ 𝑍(𝑛) .
DISCUSSION.
1. If the 𝑍𝑗 ’s are normally distributed, this confidence interval CI (𝑍(𝑛)) is exact, i.e., it covers 𝜉 = E[𝑍]
with probability 1 − 𝛼; otherwise, we rely on the central limit theorem (CLT, see Equation 8.52),
which implies that this coverage probability will be near 1 − 𝛼 for large 𝑛.
2. We did not assume that 𝑋 and 𝑌 are independent, nor did we have to assume that Var[𝑋] =
Var[𝑌 ].
3. Using (8.32) we essentially reduced the two-system problem to one involving a single sample,
namely the 𝑍𝑗 ’s . In the next two sections we discuss estimating measures of performance other
than means, namely proportion and variance.
Knowledge Box 4 (The mean queue length of generic queue).

The mean number of customers in queue (mean queue length) is defined by
1 𝑇
∫︁
𝐿𝑤 = E[𝑋𝑤 ] = 𝑄(𝑡)𝑑𝑡
𝑇 0
where 𝑋𝑤 = 𝑄(𝑡) is the queue length function at time 𝑡.
Its discretization can be seen in a specific case in figure below.
• Three events: Arrival at 𝑡 = 0, departures at 𝑡 = 1 and 𝑡 = 4.
• 𝑄 = 2, 1, 0 at these events. Avg 𝑄 ̸= (2 + 1 + 0)/3 = 1, Avg 𝑄 = Area/4 = (2 + 1 + 1 + 1)/4 = 5/4.

8.5.2 Performance Comparison by Interval Estimation of Proportions
Confidence intervals can be calculated for
the true proportion of spam emails that bombard the mail server of a firm,
the true proportion of stocks of a stock market that go up or down each week,
for the true proportion of households in a country that own personal computers 6 ...
• The procedure to find the confidence interval for a population proportion is similar to that for the
population mean, but the formulas are a bit different although conceptually identical.
• While the formulas are different, they are based upon the same mathematical foundation given to
us by the Central Limit Theorem. Because of this we will see the same basic format using the
same three pieces of information, namely a) the sample value of key parameter,
b) the standard deviation of the relevant sampling distribution, and
c) the number of standard deviations we need to have in our estimated confidence interval.
Key distribution for proportion problems
How do you know you are dealing with a proportion problem?

6
since investors in the stock market are interested in the true proportion of stocks that go up and down weekly; firms selling PCs interested
in the proportion of households that own PC to make their buz plan.

Let denote a binary random variable 𝑍 ∼ B(𝑝) (Bernoulli one), and suppose that we would like to
estimate the probability 𝑝 = P[𝑍 ∈ 𝐵] (success probability), where 𝐵 is a set of real numbers, like
𝐵 = [0, 5).
Make 𝑛 independent replications and let 𝑍1 , 𝑍2 , . . . , 𝑍𝑛 be the resulting IID Bernoulli random vari-
𝑛
∑︁
ables. Put 𝑋 = 𝑍𝑖 just the number of 𝑍𝑗 ’s that fall in the set 𝐵. Probability 𝑝 expresses the
𝑖=1
likelihood of occuring event 𝐵 or proportion 𝑃 of successful cases of 𝑍𝑗 ∈ 𝐵.
Distribution used for a proportion 𝑃 : Firstly, the underlying distribution of a proportion 𝑃 of inter-
est is a binomial distribution. Why? 𝑋 clearly represents the number of successes in 𝑛 trials, then
𝑋 is a binomial variable, and 𝑋 ∼ Bin(𝑛, 𝑝) where 𝑛 is the number of trials.
• So a point estimator of the proportion 𝑃 is given by the statistic

𝑋
𝑃̂︀ = .
𝑛
• The sample proportion 𝑝̂︀ = 𝑥/𝑛 will be used as the point estimate of 𝑃 , where
𝑥 = the number of successes in the sample, and 𝑛 = the size of the sample.
Secondly, the Mean and Standard deviation (standard error) of the estimator 𝑃̂︀:
⎧
⎨E[𝑃̂︀] = 𝜇 ̂︀ = E[𝑋/𝑛] = 𝑛𝑝 = 𝑝, let 𝑞 = 1 − 𝑝,
⎪
𝑃
2
𝑛 (8.33)
⎩V[𝑃̂︀] = V[𝑋/𝑛] = 𝑋 = 𝑛 𝑝 𝑞 = 𝑝 𝑞 = 𝜎 2 ,
⎪ 𝜎
𝑛2 𝑛2 𝑛 𝑃̂︀

henceforth the standard error (of the sampling distribution) of 𝑃̂︀ is

√︂ √︂
𝑝𝑞 𝑝 (1 − 𝑝)
S.E. 𝑃̂︀ = 𝜎𝑃̂︀ = = . (8.34)
𝑛 𝑛
♦ EXAMPLE 8.9 (Motivation- why not means?).
This example shows, comparing two or more systems by some sort of mean system response may
result in misleading conclusions.
Consider a bank with five tellers and one queue, which opens its doors at 9 a.m., closes its doors
at 5 p.m., but stays open until all customers in the bank at 5 p.m. have been served. Assume that
we simulated this dynamic system in 10 independent replications and obtained the above data table,
following the assumption below:
i) customers arrive in accordance with a Poisson process at rate 𝜆 = 1 per minute (i.e., IID
1
exponential interarrival times E(𝛽) with mean = 𝜆 = 1 minute), that
𝛽
1
ii) service times are IID exponential random variables E(𝜇) with mean = 4 minutes, and
𝜇
iii) that customers are served in a FIFO manner.
The 2nd column of Table 8.2, for inatance gives the total number of customers servered in a work

Table 8.2: Results for 10 independent replications of the bank model ([7])
Mean delay Proportion of

Number Finish time in queue Mean queue customers delayed
Replication served (hours) (E[𝑊 ] minu.) length 𝐿𝑤 𝑝 < 5 minutes
1 484 𝑂𝑝1 = 8.12 𝐷1 = 1.53 1.52 0.917

2 475 8.14 1.66 1.62 0.916
3 484 8.19 1.24 1.23 0.952
4 483 8.03 2.34 2.34 0.822
5 455 8.03 2.00 1.89 0.840
6 461 8.32 1.69 1.56 0.866
7 451 8.09 2.69 2.50 0.783
8 486 8.19 2.86 2.83 0.782
9 502 8.15 1.70 1.74 0.873
𝑁 = 10 475 𝑂𝑝1 = 8.24 𝐷10 = 2.60 2.50 0.779
7
day (9 a.m till 5 p.m more or less).
The utilization factor 𝜌 = 𝜆/(5𝜇) = 1/(5 · 1/4) = 0.8 applied for 𝑀/𝑀/5 queue.
We want to compare the policy of having one queue for each teller (so a parallel system)
with the policy of having one queue feed all five tellers (the 𝑀/𝑀/5) on the basis of
7
Table 8.2 shows several typical output statistics from l0 independent replications of a simulation of the bank, assuming that no customers
are present initially. Table 8.3 gives the results of making one simulation run of each policy.

the mean delay in queue E[𝑊 ] and the mean queue length 𝐿𝑤 above.
We obtain a point estimate E[𝑊 ] of average system response over a day, which is given by
𝑁
1 ∑︁
E[𝑊 ] = E[ 𝐷𝑖 ] = 2.03
𝑁 𝑖=1
Table 8.3: Simulation results for the two bank policies via the means
Estimates of mean
Measure of performance Five queues One queue
Mean operating time 𝑂𝑝, hours 8.14 8.14
Mean average delay, minutes 5.57 5.57
Mean average number of customers in queue 5.52 5.52
8
Table 8.3 gives average results of a typical simulation run for each of two bank policies. On the
basis of “average system response,” (the 2nd row of mean average delay E[𝑊 ]) it would appear that
the two policies are equivalent (shown in column 2 and 3). However, this is clearly not the case.
8
Upon the three above assumptions of queues (assuming that the arrival time of the 𝑖th customer was Poisson identical and that the service
time of the 𝑖th customer to begin service (𝑖 = 1, 2, . . . , 𝑁 ) was the same exponential for both policies).

Table 8.4: Simulation results for the two bank policies: proportions
Estimates of expected proportions

of delays in interval
Interval (minutes) Five queues One queue
[0, 5) 0.626 0.597
[5, 10) 0.182 0.188
[10, 15) 0.076 0.107
[15, 20) 0.047 0.095
[20, 25) ↓ 0.031 0.0125

[25, 30) 0.020 0
[30, 35) 0.015 0
[35, 40) 0.003 0
[40, 45) 0 0
Rows of Table 8.4 give estimates, computed from the same two simulation runs used above, of the
expected proportion of customers with a delay in the interval [0, 5) (in minutes), ..., the expected
proportion of customers with a delay in [40, 45) for both policies.
• Since customers need not be served in the order of their arrival with the multiqueue policy, we
would expect this policy to result in greater variability of a customer’s delay.

Specifically, using the last 5 rows of Table 8.4 if assuming that
480 customers arrive in a day (the expected number), then
33 = 480 x 0.06875, and 6= 480 x 0.0125 of them would be expected to have
delays greater than or equal to 20 minutes for the five-queue and one-queue policies, respec-
tively. ■
Interval Estimation of the proportion 𝑃
The formula for the confidence interval for a population proportion follows the same format as that
for an estimate of a population mean. Therefore, we can assert that
𝑃^ − 𝑃
P[−𝑧𝛼/2 < 𝑍 < 𝑧𝛼/2 ] = 1 − 𝛼, with 𝑍 = (8.35)
𝜎𝑝^
here 𝑧𝛼/2 is is the value above which we find an area of 𝛼/2 under the standard normal curve.
Substituting for 𝑍 with the 𝜎𝑝̂︀ obtained in (8.34), we write:
𝑃^ − 𝑃
P[−𝑧𝛼/2 < < 𝑧𝛼/2 ] = 1 − 𝛼, (8.36)
𝜎𝑝̂︀
√︂
𝑝𝑞
this gives us the CI = 𝑃̂︀ ± 𝑧𝛼/2 of 𝑃 with significance level 𝛼, meaning
𝑛
√︂ √︂
𝑝 𝑞 𝑝𝑞
𝑃̂︀ − 𝑧𝛼/2 < 𝑃 < 𝑃^ + 𝑧𝛼/2
𝑛 𝑛
CONCLUSION
• A clear conclusion from the above example is that comparing alternative systems or policies
on the basis of average system behavior alone can sometimes result in misleading conclusions.
Furthermore, that proportions can be a useful measure of system performance.
√︂large, we can get an approximate 100(1 − 𝛼) percent confidence interval for

• When 𝑛 is sufficiently
𝑝^ 𝑞^
𝑃 as CI = 𝑝̂︀ ± 𝑧𝛼/2 or with bounds
𝑛
√︂ √︂
𝑝^ 𝑞^ 𝑝^ 𝑞^
𝑝̂︀ − 𝑧𝛼/2 < 𝑃 < 𝑝̂︀ + 𝑧𝛼/2 .
𝑛 𝑛
• More precisely, 𝑝^ is the numerical value of the statistic 𝑃^ , also the estimated proportion of suc-
cesses 𝑝^ also is a point estimate for 𝑃 , the true population proportion.
♦ EXAMPLE 8.10.
(A) [Marketing Research.] Suppose that a market research firm is hired to estimate the percent
of adults living in a large city who have cell phones. Five hundred randomly selected adult residents
in this city are surveyed to determine whether they have cell phones. Of the 500 people sampled,
421 responded yes - they own cell phones.

Using a 95% confidence level, compute a confidence interval estimate for the true proportion of
adult residents of this city who have cell phones.
Let 𝑋 = the number of people in the sample who have cell phones. 𝑋 is binomial: the random
variable is binary, people either have a cell phone or they do not.
To calculate the confidence interval, we must find 𝑝^, 𝑞^.
𝑛 = 500, and 𝑥 = the number of successes in the sample = 421 ...
Answer: 0.810 ≤ 𝑝 ≤ 0.874
Interpretation: We estimate with 95% confidence that between 81% and 87.4% of all adult resi-
dents of this city have cell phones.
(B) [Finance Study.] A financial officer for a company wants to estimate the percent of accounts
receivable that are more than 30 days overdue. He surveys 500 accounts and finds that 300 are
more than 30 days overdue. Compute a 90% confidence interval for the true percent of accounts
receivable that are more than 30 days overdue, and interpret the confidence interval.
GUIDANCE for solving. D.I.Y.

8.6 Comparison of Performance Based on Variances
Comparison of two populations.
Figure 8.6: Key parameters of two populations

8.6. Comparison of Performance Based on Variances 73
♦ EXAMPLE 8.11 (Conservative investment in [Finance Study.] ).
Two mutual funds promise the same expected return; however, one of them recorded a 10%
higher volatility over the last 15 days.
Is this a significant evidence for a conservative investor to prefer the other mutual fund?
(Volatility is essentially the standard deviation of returns.)
8.6.1 Comparing two population variances: How to?
2
Denote by 𝜎𝑋 = V[𝑋] and 𝜎𝑌2 = V[𝑌 ] the variances of two populations, we’ll see how to test the null
2
hypothesis 𝐻0 : 𝜎𝑋 = 𝜎𝑌2 and 𝐻𝐴 : 𝜎𝑋
2
̸= 𝜎𝑌2 . Now to compare variances, two independent samples
X = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 and Y = (𝑌1 , 𝑌2 , · · · , 𝑌𝑚 ) are collected, one from each population, as on Figure
8.6.
Unlike population means or proportions, variances are scale factors, and they are compared
through their ratio
2
𝜎𝑋
𝜃= 2.
𝜎𝑌
2
𝜎𝑋
A natural estimator for the ratio of population variances 𝜃 = 2 is the ratio of sample variances
𝜎𝑌
𝑠2𝑋 [ 𝑛𝑖=1 (𝑋𝑖 − 𝑋)2 ]/(𝑛 − 1)
∑︀
^
𝜃 = 2 = ∑︀𝑛 (8.37)
𝑠𝑌 [ 𝑖=1 (𝑌𝑖 − 𝑌 )2 ]/(𝑚 − 1)
sigma 1 to m below
Knowledge Box 5 (Pattern of F- distribution via the ratio of sample variances).
For independent samples X = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 from N (𝜇𝑋 , 𝜎𝑋 ) and

Y = (𝑌1 , 𝑌2 , · · · , 𝑌𝑚 ) from N (𝜇𝑌 , 𝜎𝑌 ) the standardized ratio of variances
𝑠2𝑋 /𝜎𝑋2
𝑠2𝑋 𝜎𝑌2
𝐹 = =
𝑠2𝑌 /𝜎𝑌2 𝑠2𝑌 𝜎𝑋
2
has F-distribution with (𝑛 − 1) and (𝑚 − 1) degrees of freedom.
From Knowledge Box 7 we see that for the normal data, both ratios 𝑠2𝑋 /𝜎𝑋
2
and 𝑠2𝑌 /𝜎𝑌2 follow 𝜒2 -
distributions. We can now conclude that the ratio of two independent 𝜒2 variables, each divided by
its degrees of freedom, has Fisher distribution.
Historical fact: The distribution of this statistic was obtained in 1918 by a famous English statis-
tician and biologist Sir Ronald Fisher (1890-1962) and developed and formalized in 1934 by an
American mathematician George Snedecor (1881-1974). Its standard form, after we divide each
sample variance in formula (8.37) by the corresponding population variance, is therefore called the
Fisher-Snedecor distribution or simply F-distribution with (𝑛 − 1) and (𝑚 − 1) degrees of freedom.

8.6.2 Fisher distribution (F distribution): Properties and Usages
We discuss Fisher distribution and statistics now, being useful in next chapters.
• A ratio of two non-negative continuous random variables, hence any F-distributed variable is also
non-negative and continuous. The numerator df. are always mentioned first.
Figure 8.7: Critical values of the F-distribution and their reciprocal property.
• Interchanging the degrees of freedom changes the distribution, so the order is important because
in the first case we deal with 𝐹 (𝑛−1, 𝑚−1) distribution, and in the second case with 𝐹 (𝑚−1, 𝑛−1).
This leads us to an important general conclusion:

If 𝐹 has 𝐹 (𝑢, 𝑣) distribution, then the distribution of 1/𝐹 is 𝐹 (𝑣, 𝑢). More precisely, we have
that the critical values of 𝐹 (𝑢, 𝑣) and 𝐹 (𝑣, 𝑢) follow
1
𝐹1−𝛼 [𝑢, 𝑣] = . (8.38)
𝐹𝛼 [𝑣, 𝑢]
• The F distributions are not symmetric but are right-skewed. The peak of the F density curve
is near 1; values far from 1 in either direction provide evidence against the hypothesis of equal
standard deviations. Critical values of F-distribution are visualized in Figure 8.7 and given in Table
A7, and we will use them to test hypothesis of comparing two variances.
8.6.3 F-tests comparing two population variances
In this section, we test the null hypothesis about a ratio of variances

2
𝜎𝑋
𝐻0 : = 𝜃0 (8.39)
𝜎𝑌2
against a one-sided or a two-sided alternative.
TESTING PROCEDURE:
F-distribution is used to compare variances, so this test is called the F-test.

Table A7. Table of F-distribution
Figure 8.8: F distribution values

• The test statistic for (8.39) is

𝑠2𝑋
⧸︂
𝐹0 = 𝐹𝑜𝑏𝑠 = 2 𝜃0
𝑠𝑌
which under the null hypothesis equals
𝑠2𝑋 /𝜎𝑋
2
𝐹 = 2 2.
𝑠𝑌 /𝜎𝑌
• Assumption 1: If X = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 and Y = (𝑌1 , 𝑌2 , · · · , 𝑌𝑚 ) are samples from normal distribu-

tions, this F-statistic has F-distribution with 𝑛 − 1 and 𝑚 − 1 degrees of freedom.
Similarly as the chi-square, if denote

2
𝜎12 = 𝜎𝑋 , 𝜎22 = 𝜎𝑌2 ; 𝑠21 = 𝑠2𝑋 , 𝑠22 = 𝑠2𝑌 ; 𝑛1 = 𝑛 and 𝑛2 = 𝑚,
we get the summary as in Figure 8.9.
• Assumption 2: When we only need to know if two variances are equal, then we choose 𝜃0 = 1.
2
Under the null 𝐻0 : 𝜎𝑋 = 𝜎𝑌2 the test statistic becomes
𝑠2𝑋
𝐹0 = 𝐹𝑜𝑏𝑠 = 2 (8.40)
𝑠𝑌

Tests on the ratio of variances from two normal distributions
Figure 8.9: Compare two variances with F distribution
This is a two-sided test, so the P-value - by Equation 8.38- is
𝑃 = P[𝐹0 ≥ 𝐹𝛼/2 (𝑛 − 1, 𝑚 − 1)] + P[𝐹0 ≤ 𝐹1−𝛼/2 (𝑛 − 1, 𝑚 − 1)]

= P[𝐹0 ≥ 𝐹𝛼/2 (𝑛 − 1, 𝑚 − 1)] + P[𝐹0 ≤ 1/𝐹𝛼/2 (𝑚 − 1, 𝑛 − 1)] (8.41)
= 2 min {P[𝐹 ≥ 𝐹0 ], P[𝐹 ≤ 𝐹0 ]}
We can compute rejection region or find P-value, using 𝐹 (𝑛 − 1, 𝑚 − 1) distribution in both cases,
as in Figure 8.10. Critical values of F-distribution are in Table A7, used for finding critical values.

Summary of F-tests for the ratio of population variances
Figure 8.10: Using both rejection region and P-value

♦ EXAMPLE 8.12. [Marketing Research.]
For marketing purposes, a survey of users of two operating systems is conducted. Twenty users
of operating system Windows record the average level of satisfaction of 77 on a 100-point scale,
with a sample variance of 220.

Thirty users of operating system MacOS have the average satisfaction level 70 with a sample
variance of 155. We already know how to compare the mean satisfaction levels (testing means
of two populations).
But what method should we choose?
2
Should we assume equality of population variances, 𝜎𝑋 = V[𝑋] and 𝜎𝑌2 = V[𝑌 ] and use the
pooled variance? Here 𝑛 = 20, x = 77, 𝑠2𝑋 = 220; 𝑚 = 30, y = 70, and 𝑠2𝑌 = 155. To compare the
population means by a suitable method, we have to test whether the two population variances are
equal or not.
2 𝑠2𝑋
• We test 𝐻0 : 𝜎𝑋 = 𝜎𝑌2 vs 𝐻𝐴 : 2
𝜎𝑋 ̸= 𝜎𝑌2 with the test statistic 𝑓0 = 2 = 1.42.
𝑠𝑌
• This is a two-sided test, so the P-value is
𝑃 = 2 min {P[𝐹 ≥ 𝑓0 ], P[𝐹 ≤ 𝑓0 ]} = 2 min{P[𝐹 ≥ 1.42], P[𝐹 ≤ 1.42]} = ...?
How to compute these probabilities for the F-distribution with 𝑛 − 1 = 19 and 𝑚 − 1 = 29 d.o.f.?
• In Matlab use fcdf(a, u, v) and in R use pf(a, u, v) for calculating cdf at value 𝑎 with 𝑢 and 𝑣
degrees of freedom. Hence
P[𝐹 ≤ 1.42] = 𝑝𝑓 (1.42, 19, 29) = 0.8074;

P[𝐹 ≥ 1.42] = 1 − 𝑝𝑓 (1.42, 19, 29) = 0.1926
=⇒ 𝑃 = 2 min{0.1926, 0.8074} = 0.3852.
Hence, this is a high P-value showing no evidence of different variances.
PROBLEM 8.3. [Finance Study.] Continued from Example 8.11
We are asked to compare volatilities of two mutual funds and decide if one of them is more risky
than the other. So, this is a one-sided test of
𝜎𝑋
𝐻0 : 𝜎𝑋 = 𝜎𝑌 or = 𝜃0 = 1 𝑣𝑠 𝐻1 : 𝜎𝑋 > 𝜎𝑌 .
𝜎𝑌
Are all the conditions met? Can we use F-distribution for inference here?
• The data collected over the period of 30 days show a 10% higher volatility of the first mutual fund,
i.e., 𝑠𝑋 /𝑠𝑌 = 1.1. So, this is a standard F-test, right?
• A careless statistician would immediately proceed to the test statistic [as Eqn. 8.40]
𝐹0 = 𝐹𝑜𝑏𝑠 = 𝑠2𝑋 /𝑠2𝑌 = 1.21

and the P-value
𝑃 = P[𝐹 ≥ 𝐹𝑜𝑏𝑠 ] > 0.25
from Table A7 with 𝑛 − 1 = 29 and 𝑚 − 1 = 29 d.f., and jump to a conclusion that there is no
evidence that the first mutual fund carries a higher risk.
• Indeed, why not? Well, every statistical procedure has its assumptions, conditions under which
our conclusions are valid. A careful statistician always checks the assumptions before reporting
any results.
• If we conduct an F-test and refer to the F-distribution, what conditions are required?
We find the answer in Knowledge Box 5, recalled here for convenience:

Figure 8.11: F-distribution values with 𝑚 ≥ 15, 𝑛 ≥ 15

F- distribution via the ratio of sample variances
For independent samples

X = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 from N (𝜇𝑋 , 𝜎𝑋 ) and
Y = (𝑌1 , 𝑌2 , · · · , 𝑌𝑚 ) from N (𝜇𝑌 , 𝜎𝑌 ) the standardized ratio of variances
𝑠2𝑋 /𝜎𝑋
2
𝑠2𝑋 𝜎𝑌2
𝐹 = 2 2 = 2 2
𝑠𝑌 /𝜎𝑌 𝑠 𝑌 𝜎𝑋
has F-distribution with (𝑛 − 1) and (𝑚 − 1) degrees of freedom.
♣ OBSERVATION.
1. Apparently, for the F-statistic to have F-distribution under 𝐻0 , each of our two samples has to con-
sist of independent and identically distributed normal random variables, and the two samples
have to be independent of each other.
2. The F-test is quite robust. It means that a mild departure from the assumptions 1-3 will not affect
our conclusions severely, and we can treat our result as approximate.

3. However, if the assumptions are not met even approximately, for example, the distribution of our
data is asymmetric and far from normal, then the P-value computed above is simply wrong.
4. This leads us to a very important conclusion:
Every statistical procedure is valid under certain assumptions.
When they are not satisfied, the obtained results may be wrong and misleading.
Therefore, unless there are reasons to believe that all the conditions are met, they have to be
tested statistically.
8.7 PROJECT: Monte Carlo Simulation of queues
8.7.1 Problem Description
A queuing system involves spontaneous arrivals of jobs, their random waiting time, assignment to
servers, and finally, their random service time and departure.
When designing a queuing system or a server facility, it is important to evaluate its vital perfor-
mance characteristics. Precisely, five important performance measures of an 𝑀/𝑀/1 queue system
are briefed in the fact box below.

8.7. PROJECT: Monte Carlo Simulation of queues 87
AIM: Simulate a specific queuing system 𝑀/𝑀/𝑘
Study a service system using Monte-Carlo Simulation on soft (MATLAB or R):
HOW to GENERATE the data Y = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 , 𝑆1 , 𝑆2 , · · · , 𝑆𝑛 = X, 𝑆?
INPUTS:
1. four servers;
2. Gamma distributed service times with parameters given in the table,
3. a Poisson process of arrivals with the rate of 1 arrival every 4 min, independent of service times;
4. random assignment of servers, when more than 1 server is available.
5. Suppose that after 15 minutes of waiting, jobs withdraw from a queue if their service has not
started.
REQUEST: Compute performance metrics, in fact the expected values of:
• the total time each server is busy with jobs;
• the total number of jobs served by each server;
• the average waiting time; the longest waiting time; the number of withdrawn jobs...

Readers should perform the case study in max 12 weeks, write a report with at least three
parts. You can choose your favorite programming languages like MATLAB or R...
8.7.2 Notations and Variables
Table 8.5: Parameters of four servers
Server 𝛼 𝜆
I 6 0.3
II 10 0.2
III 7 0.7
IV 5 1.0
We study an M/M/4 system and write codes to get the simulated data Y = X, 𝑆 at 4 servers, given
parameter of service-time 𝑆 distributed as Gamma(𝛼, 𝜆) [see COMPLEMENT 16B in Section ??]
given in above table. We generate 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 from arrival times, coded by arrival in the pro-
gram below. You might write code on R or any language, here we give a guided Matlab segment of
codes.
Computation on MATLAB
The team write codes to compute performance metrics (ref. [76, Section 7.6] ). We start by entering

parameters of the system.

k = 4; mu = 4;
% number of servers and the mean interarrival time
alpha = [6 10 7 5]; % parameters of service times
lambda = [0.3 0.2 0.7 1.0];
arrival = [ ]; % arrival time
start = [ ]; % service starts
finish = [ ]; % service finishes; departure time
server = [ ]; % assigned server
j = 0; % job number
T = 0; % arrival time of a new job
A = zeros(1,k); % array of times when each server
% becomes available for the next job
The queuing system is ready to work! We start a “while”-loop over the number of arriving jobs. It
will run until the end of the day, when arrival time T reaches 14 hours, or 840 minutes. The length of
this loop, the total number of arrived jobs, is random.

T=0;
while T < 840; % until the end of the day
j=j+1; % next job
T = T-mu*log(rand); % arrival time of job j
arrival = [arrival T];
end % Next parameters for service times of servers
k = 4; % the system has 4 servers to get M/M/4 ?
alpha = [6 10 7 5]; lambda = [0.3 0.2 0.7 1.0];
% we use Gamma distributed service times, need 2 parameters
The arrival time 𝑇 is obtained by incrementing the previous arrival time by an Exponential interar-
rival time. Next, we need to assign the new job 𝑗 to a server, following the rule of random assignment.
There are two cases here: either all servers are busy at the arrival time 𝑇 , or some servers are avail-
able.
QUESTION. Fix sample size to be any 𝑛 ≥ 100. The reader would
1. Think yourself how to make a sample vector 𝑆1 , 𝑆2 , · · · , 𝑆𝑛 at one of four servers, assuming that
𝑆 ∼ Gamma(𝛼, 𝜆) = Gamma(6, 0.3) say, then build data
Y = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 , 𝑆1 , 𝑆2 , · · · , 𝑆𝑛

2. Confirm numerically, at each sever, that the control variable 𝑓 (Y) given in Equation (8.45) and the
(︀ )︀
delay time 𝐷𝑛+1 = 𝑔 X, 𝑆 = 𝑔(Y) as in Equation (8.44) are positively correlated with that data.
3. Finally, try your best to summarize the mean statistic of each key parameter [plus the utilization 𝜌
in Knowledge Box 3.3] from the simulated data at all 4 servers.
PROLOGUE for ASSIGNMENT II:
Configuration of queues is also called a topology of the queueing system.
In system’s configurations the nodes represent the servers.
In transition diagrams of a system-process, the nodes usually are the values of the system variable
𝑋, the number of jobs in the queueing system.

8.8 ASSIGNMENT II
■ PROBLEM: In Queueing Theory, Batch Service is the service that a system provides to cus-
tomers with the same policy if they are assigned in the same group. The Batch Service rule is often
designed by time factor or other common factor.
If time factor is concerned, the policy means that the same service rate will be applied for all units
in a certain time interval. The service time of a server then is said to be group demand adaptive or
just demand adaptive.
In a M/G/1 system the manager can design a demand adaptive service time 𝑆 by employing the
current queue length, following 3 rules:
* 𝑆 depends on the number 𝑚 of customers that currently waiting in the queue,

1
* 𝑆 follows a certain distribution with mean E[𝑆] = hours, and
𝑚
* 𝑆 will be hourly updated.
For instance, at 9:00 am, he observers 𝑚 = 12 customers waiting in the queue, then he set
1
the mean time of service E[𝑆] = = 1/12 hours = 5 minutes, being applied in the whole time
𝑚
interval [9 am, 10 am] to all of those customers.
DATA: Assume visitors arrive in an airline office with rate 𝜆 = 18 persons per hour. At 10:00 am

8.9. CHAPTER CONCLUSION 93
the manager observes 𝑚 = 20 of visitors are currently waiting in the queue.
QUESTIONS: With unit of minute, students apply rules of the above demand adaptive service
time 𝑆 for the time interval [10 am, 11 am] to compute
a) the mean E[𝑆], the service rate 𝜇, (with precise units),
b) the integral mean 𝐿 = E[𝑋] =?, if given the variance V[𝑆] = 1,
c) the mean response time E[𝑅] =?, and
d) the mean waiting time E[𝑊 ] =?
e) Design and implement a package to implement/ simulate various G in M/G/1.
DUE DAY: 11:00 pm Decemmber 16th, 2023
HINT: Use theory given from Section ?? of Chapter ??.
8.9 CHAPTER CONCLUSION
SUMMARY
8.9.1 The concept of 𝑀/𝑀/1 Queue revisited

Definition 8.5.

𝑀/𝑀/1 system includes two key processes of Arrival and Service, in which
▼ the arrival process is Poisson, the service process is exponential,
♦ and the arrival and service processes are independent.
▲ Besides, it has a single server and an infinite number of waiting positions.
Probabilistic Modeling of 𝑀/𝑀/1 Queue
1. The arrival process of M/M/l is expressed by Poisson random variables.
It precisely is the Poisson variable 𝑁 (𝑡) with rate 𝜆 𝑡, and density function
(𝜆 𝑡)𝑛 𝑒−𝜆 𝑡
P[𝑁 (𝑡) = 𝑛] = 𝑛 = 0, 1, 2, ...
𝑛!
Job 𝑛 comes in the queue at time point 𝐴𝑛 .
2. The inter-arrival times {𝑋𝑛 = 𝐴𝑛+1 − 𝐴𝑛 } are independent random variables,
exponentially distributed with the same cdf 𝐹 ∼ E(𝜆 = 𝜇𝐹 ), i.e. {𝑋𝑛 } ∼ 𝑋 = E(𝜇𝐹 ). Hence, the
mean time between two arrivals or
1
the mean inter-arrival time is E[𝑋] = (see Theorem ??).
𝜆

3. The service times {𝑆𝑛 } ∼ 𝒮 = E(𝜇𝐺 ) are i.i.d., with the cdf
𝐺(𝑡) = P[𝑆𝑛 ≤ 𝑡] = P[𝒮 ≤ 𝑡] = 1 − 𝑒−𝜇𝐺 𝑡 .
The mean service time so is E[𝑆] = 1/𝜇𝐺 .
8.9.2 Performance indicators of the stable M/M/1 queue
For a given stable M/M/1 queue, we study the following performance indicators.
1. The number of jobs in the system at time point 𝑡 is denoted by 𝑋(𝑡). Probability of exactly 𝑛
jobs in the system at 𝑡 is
𝑝𝑛 (𝑡) = P[𝑋(𝑡) = 𝑛], 𝑛 ∈ Range(𝑋(𝑡)) (8.42)
2. Probability of exactly 𝑛 jobs in the system in statistical equilibrium is
𝑝𝑛 = P[ there are 𝑛 jobs in the system in the long-run],
or 𝑝𝑛 = lim 𝑝𝑛 (𝑡) = lim P[𝑋(𝑡) = 𝑛].

𝑡→∞ 𝑡→∞
It is referred to as the steady-state probability of exactly 𝑛 jobs in the system.

3. The service demand is described via the arrival rate 𝜆𝐴 = 𝜆 ∈ R+ . The service (processing)
rate 𝜆𝑆 = 𝜇 of the server, where 𝑆 ∼ service time of a customer.
4. Utilization 𝑟 or 𝜌: If the queuing system consists of a single server, then the utilization 𝜌 is the
fraction of the time in which the server is busy, i.e., occupied.
arrival rate 𝜆𝐴 𝜆
𝜌= = = . (8.43)
service rate 𝜆𝑆 𝜇
PROBLEM
PROBLEM 8.4 (Comparing Two Systems when populations are independent).
Hypotheses:
𝐻0 : 𝜇1 = 𝜇2 , or 𝜇1 − 𝜇2 = Δ0 = 0
versus with 𝐻1 : 𝜇1 ̸= 𝜇2 .
Test Statistic: Since we do not know 𝜎1 , 𝜎2 and small sample sizes, use sample variances:
𝑥1 − 𝑥2
𝑇 = √︀
𝑠𝑝 1/𝑛1 + 1/𝑛2
where the pooled variance

(𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22
𝑠2𝑝 =
𝑛1 + 𝑛2 − 2
Make decision: If 𝐻0 is true then 𝑇 ∼ 𝑡𝑛1 +𝑛2 −2 , where
𝑛1 + 𝑛2 − 2 = 6 + 7 − 2 = 11
Reject 𝐻0 if the Test Statistic
𝑇 > 𝑡𝑛1 +𝑛2 −2;𝛼/2 or 𝑇 < −𝑡𝑛1 +𝑛2 −2;𝛼/2 ; that means 𝑇 > 2.20 or 𝑇 < −2.20.
From samples, we get 𝑥1 = 27.70, 𝑥2 = 18.61,
𝑠21 = 29.63, 𝑠22 = 30.80 and 𝑠2𝑝 = 30.27.
Hence the test statistic 𝑇 = ... = 2.97 > 2.20: we reject the null hypothesis 𝐻0 .
We conclude that there is a significant difference between the start-up times of the two
groups (start-up time values are lower by an estimated 9.09 in the Linuxs-run group).
PROBLEM 8.5 (A Queueing System 𝑄𝑆, see Definition 8.5).
This problem will extended to a small case study project in part ??.
ASSUMPTION: Define 𝐷𝑛+1 to be the delay in queue of the 𝑛 + 1 customer in 𝑄𝑆.

We explicitly make two assumptions about interarrival times {𝑋𝑛 }:
1. the {𝑋𝑛 } are mutually independent and identically distributed (i.i.d.) with distribution 𝐹 having
mean 𝜇𝐹 , and
2. the {𝑋𝑛 } are also independent with the service times {𝑆𝑛 }, which are i.i.d. with distribution 𝐺
having mean 𝜇𝐹 .
FORMULATION of delay time 𝐷𝑛+1 in the queue 𝑄𝑆:
If 𝑋𝑖 = 𝐴𝑖+1 − 𝐴𝑖 is the interarrival time between arrival 𝑖 and 𝑖 + 1, and
if 𝑆𝑖 is the service time of customer 𝑖, 𝑖 ≥ 1, we may write

(︀ )︀
𝐷𝑛+1 = 𝑔 X, 𝑆 (8.44)
with interarrival-time data X = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 , and service-time data 𝑆 = 𝑆1 , 𝑆2 , · · · , 𝑆𝑛
(︀ )︀
QUESTION. Since functional variable 𝑔 . = 𝐷𝑛+1 is unknown, in fact it is a random variable, we
want to figure out a pattern of 𝐷𝑛+1 , then find its estimator by simulation.
SOLUTION for PROBLEM 8.5 with practical assumption that 𝑆𝑖 ≥ 𝑋𝑖
• With the delay in queue 𝐷𝑛+1 of the 𝑛 + 1 customer, and taking into account the possibility that
the simulated 𝑋𝑖 , 𝑆𝑖 may randomly be quite different from what might be expected, we can

define a control variate 𝑛

∑︁
𝑍 := 𝑓 (X, 𝑆) = (𝑆𝑖 − 𝑋𝑖 ) (8.45)
𝑖=1
assuming 𝑆𝑖 ≥ 𝑋𝑖 , ‘service time at least interarrival times’ for customer 𝑖 = 1, 2, . . ., making a
realistic queue!
• Define the concatenated sample Y = X, 𝑆, so 𝑍 := 𝑓 (Y) and use
𝑊 = 𝑔(Y) + 𝑎[𝑓 (Y) − 𝜇𝑍 ]
as an estimator of 𝜃 = E[𝐷𝑛+1 ] = E[𝑔(Y)], where 𝜇𝑍 = E[𝑓 (Y)] = 𝑛 (𝜇𝐺 − 𝜇𝐹 ).
• We see that 𝑓 (Y) = 𝑍 and either 𝑔(Y) = 𝐷𝑛+1 = 𝑆𝑛 − 𝑋𝑛 + 𝑆𝑛+1 for all 𝑛 ≥ 1?
𝑛
∑︁ 𝑛
∑︁
Or 𝑔(Y) = 𝐷𝑛+1 = 𝑋𝑖 + 𝑆𝑖 ?
𝑖=1 𝑖=1
are both functions of the same data Y = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 , 𝑆1 , 𝑆2 , · · · , 𝑆𝑛 .

It follows from Knowledge Box 3 that functions 𝑓 (Y) and 𝑔(Y) are positively correlated, and so
the simulated estimate of 𝐴 in (8.16) should turn out to be negative.
We conclude that a good pattern of 𝐷𝑛+1 is given by

𝑛
∑︁ 𝑛
∑︁
𝑔(Y) = 𝐷𝑛+1 = 𝑋𝑖 + 𝑆𝑖 .■
𝑖=1 𝑖=1

PROBLEM 8.5 would be utilized in Section ??, about a Simulation Project of 𝑀/𝑀/𝑘 queue when
𝑘 > 1.
PROBLEM 8.6 (Practical Poisson modeling).
a) Suppose that there are 𝑚 terrorists in a group of 𝑁 visitors arriving per day in all airports of the
U.S., with 𝑚 ≪ 𝑁 . If you choose randomly 𝑛 visitors from that group, 𝑛 < 𝑁 , compute the expected
number of terrorists.
b) Use the moment generating function to prove that both the mean E[𝑋] and variance V[𝑋] of a
Poisson random variable 𝑋 with parameter 𝜆 are E[𝑋] = 𝜆; V[𝑋] = 𝜆.
c) Consider a Poisson process {𝐾(𝑡)} on time interval [0, 𝑡] with 𝑡 > 0 and positive rate of 𝜆 events
per time unit.
*/ Confirm that the mean and variance are E[𝐾(𝑡)] = 𝜆 𝑡, Var[𝑋(𝑡)] = 𝜆 𝑡.
*/ Prove that {𝐾(𝑡)} has stationary increments.
a) There are 𝑚 terrorists in a group of 𝑁 visitors. You chose randomly 𝑛 visitors from that group,
𝑛 < 𝑁 . Denote by 𝑋 the number of terrorists in that random sample of 𝑛 visitors, then 𝑋 ∼ Bin(𝑛, 𝑝)

𝑚
a binomial, since 𝑋 = 𝐵1 + 𝐵2 + · · · + 𝐵𝑛 where each 𝐵𝑖 ∼ B(𝑝). The probability 𝑝 = are the
𝑁
same for each 𝐵𝑖 . The linearity of expectation says
𝑛𝑚
E[𝑋] = 𝑛𝑝 = .
𝑁
b) Prove that both the mean E[𝑋] and variance V[𝑋] given by E[𝑋] = 𝜆; V[𝑋] = 𝜆.
The moment generating function of 𝑋 is

∞
𝑡𝑋 −𝜆
∑︁ 𝜆𝑗
𝑀 (𝑡) = E[𝑒 ] = 𝑒 · 𝑒𝑡𝑗
𝑗=0
𝑗! (8.46)
𝑡 𝑡
= 𝑒−𝜆 · 𝑒𝜆 𝑒 = 𝑒−𝜆(1−𝑒 ) , −∞ < 𝑡 < ∞.
⎧
⎨ 𝑑𝑀 = 𝑀 ′ (𝑡) = 𝜆𝑀 (𝑡)𝑒𝑡 ,
⎪
=⇒ 𝑑𝑡
⎩𝑀 ′′ (𝑡)
⎪
= (𝜆2 𝑒2𝑡 + 𝜆𝑒𝑡 ) 𝑀 (𝑡).
Using
𝑀 (𝑛) (𝑡)|𝑡=0 = 𝜇𝑛 = E[𝑋 𝑛 ] = 𝑀 (𝑛) (0) (8.47)
the mean and variance of the Poisson 𝑋 are
E[𝑋] = 𝜇 = 𝑀 ′ (0) = 𝜆,
(8.48)
2 2 ′′ ′ 2
V[𝑋] = E[𝑋 ] − E[𝑋] = 𝑀 (0) − 𝑀 (0) = 𝜆.

Next we study non-homogeneous Poisson process- denoted by NHPP. 9
9
The NHPP relaxes the Poisson process assumption of stationary increments. Thus it allows for the possibility that the arrival rate need not
be constant but can vary with time.

8.10. COMPLEMENT 8A: Non-homogeneous Poisson Process 103
8.10
COMPLEMENT 8A:
Non-homogeneous Poisson Process
8.10.1 Non-homogeneous Poisson process- NHPP
Definition 8.6. {𝑁 (𝑡), 𝑡≥0} is a non-homogeneous (or non-stationary) Poisson process with inten-
sity (rate) function 𝜆(𝑡) if the following conditions are satisfied:
1. 𝑁 (0) = 0, and {𝑁 (𝑡), 𝑡≥0} has independent increments
2. P[ 2 or more events in (𝑡, 𝑡 + ℎ)] = 𝑜(ℎ)
3. P[ exactly 1 event in (𝑡, 𝑡 + ℎ)] = 𝜆(𝑡) ℎ + 𝑜(ℎ) or equivalently
4. limℎ→0 (1/ℎ) P[𝑁 (𝑡 + ℎ) − 𝑁 (𝑡) = 1] = 𝜆(𝑡). ■
NOTATION: If we define the mean value function

∫︁ 𝑡
𝑚(𝑡) = 𝜆(𝑠)𝑑𝑠 (8.49)
0
then it can be showed that the densiy
−𝑚(𝑡) [𝑚(𝑡)]𝑘
𝑝𝑘 (𝑡) = P[𝑁 (𝑡) = 𝑘] = 𝑒 , 𝑘 ≥ 0. (8.50)
𝑘!
Function 𝑚(𝑡) is also called the principal function of the process.

Theorem 8.7 (Key features of NHPP).
• The NHPP 𝑁 (𝑡) follows a Poisson distribution with mean 𝑚(𝑡). The mean value function 𝑚(𝑡) of
this process is defined by Equation (8.49) .
• In the non-homogeneous case, the rate parameter 𝜆(𝑡) now depends on 𝑡. That is
P {𝑁 (𝜏 + 𝑡) − 𝑁 (𝜏 ) = 1} = 𝜆(𝑡) 𝑡, as 𝑡 → 0.
• When 𝜆(𝑡) = 𝜆 constant, then it reduces to the homogeneous case.
• 𝑁 (𝑟 + 𝑡) − 𝑁 (𝑟) has a Poisson distribution with mean 𝑚(𝑟 + 𝑡) − 𝑚(𝑟).
Corollary 8.8. Therefore, we might employ the following.
• 𝑁 (𝑡) is a Poisson random variable with mean E[𝑁 (𝑡)] = 𝑚(𝑡).
• If {𝑁 (𝑡), 𝑡 ≥ 0} is a non-homogeneous with mean value function 𝑚(𝑡),
then the process 𝑁 (𝑚−1 (𝑡)), 𝑡 ≥ 0 is homogeneous with intensity 𝜆 = 1.

{︀ }︀
■
The second result follows because 𝑁 (𝑡) is Poisson random variable with mean 𝑚(𝑡), and if we let
𝑋(𝑡) = 𝑁 (𝑚−1 (𝑡)), then 𝑋(𝑡) is Poisson with mean 𝑚(𝑚−1 (𝑡)) = 𝑡.
We can simulate a non-homogeneous Poisson process.

8.10. COMPLEMENT 8A: Non-homogeneous Poisson Process 105
8.10.2 Sampling a Poisson Process
To simulate the first 𝑇 time units of a non-homogeneous Poisson process 𝑁 𝑃 with intensity function
𝜆(𝑡), the time index 0 ≤ 𝑡 < ∞. Let constant 𝜆 be such that
𝜆(𝑡) ≤ 𝜆 for all 𝑡 ≤ 𝑇.
IDEAS
• Such a non-homogeneous Poisson process can be generated by a random selection of the event
times of a Poisson process {𝑁 (𝑡), 𝑡 ≥ 0} having rate 𝜆. That is, if an event of a Poisson process
with rate 𝜆 that occurs at time 𝑡 is counted (independently of what has transpired previously) with
probability 𝑝(𝑡) = 𝜆(𝑡)/𝜆,
then the process of counted events is a non-homogeneous Poisson process.
• Symbolically, if we put 𝑁𝑐 (𝑡) be the number of counted events by time 𝑡,
then the process {𝑁𝑐 (𝑡), 𝑡 ≥ 0} of counted events is a non-homogeneous Poisson process with
intensity function 𝜆(𝑡) = 𝜆 𝑝(𝑡) for all 0 ≤ 𝑡 ≤ 𝑇.
𝑗
∑︁ 𝑁
∑︁
0 − − 𝑋 1 − − − 𝑋1 + 𝑋2 · · · − − − 𝑋𝑖 − − − − − T − − − − 𝑋𝑖 − − >
𝑖=1 𝑖=1

We obtain the following, to create an non-homogeneous Poisson process.
Algorithm 2 Simulate non-homogeneous Poisson process
Non-homogeneous Poisson(Input),
1. Generate independent random variables 𝑋1, 𝑈1, 𝑋2, 𝑈2, · · · where
the 𝑋𝑖 are exponential with rate 𝜆 and the 𝑈𝑖 are random numbers, stopping at
round 𝑛
∑︁
𝑁 = min{𝑛 : 𝑋𝑖 > 𝑇 }.
𝑖=1
2. Now, for index 𝑗 = 1, . . . , 𝑁 − 1, let indicator

𝑗
⎧
⎪ ∑︁
1, if 𝑈𝑗 ≤ 𝜆( 𝑋𝑖)/𝜆
⎨
𝐼𝑗 =
⎪ 𝑖=1
0, otherwise
⎩
and set 𝐽 = { 𝑗 : 𝐼𝑗 = 1}.

𝑗
∑︁
3. Thus, the counting process having events at the set of times { 𝑋𝑖 : 𝑗 ∈ 𝐽}
𝑖=1
constitutes the desired process.

8.11. COMPLEMENT 8B: Statistical Inference for SPE 107
8.11
COMPLEMENT 8B: Statistical Inference for SPE
Statistical Inference turns out to be extremely helpful for SPE in general.
8.11.1 Grand scheme for inference
It is well known fact that the subject of statistical inference mathematically classified into two broad
categories of Parameter Estimation and Hypothesis Testing. From the practical viewpoint, the statis-
tical inference process is briefly formed by 4 steps as follows.
1. Planning (objectives to study, challenges to overcome, areas to investigate)
2. Collecting data (where, what, and how)
3. Analyzing the observed data (what to know, why to do, and how to conduct with suitable methods)
4. Presenting the outcomes/decisions made from the whole process to the boss!

When studying Quality Analytics in previous chapters we partially employed statistical inference
and often focused on population means.
Now aiming the study of System Performance Evaluation (SPE) and other fields we switch to dis-
cuss statistical inference methods for population proportions and variances. Practically, estimating
and testing the population variance are meaningful, to make sure that
1. the actual variability of customer satisfaction at a firm,
2. the uncertainty in choosing policy in a certain life insurance,
3. the volatility of popular stock prices, or risk of financial transactions...
does not exceed the promised value. How?
Studying the following major topics would answer this question.
1. Probabilistic Characterization of Sample Means in Section 8.11.2
2. Chi-square distribution and Fisher distribution in Section 8.11.3
3. Confidence interval for the population variance in Section 8.11.4
4. Chi-square statistic for testing independence in Section 8.11.5
We now specifically consider step 1 and 3.

Step 1 - Plan: Planning is aimed to:
1. define problem area: determine areas in which our interest, problem or open questions lie;
2. in that domain of interest, decide what kind of information need to collect from observing real
world, or experimenting in labs;
3. specify steps of investigation and procedures for analyzing collected datasets.
Figure 8.13: Planning: the 1st step in Inference Process
We will deal with two populations whose variances need to be compared.
Step 3- Analyze: Inference for Variances: WHY?

• Inference on single variance: as visualized in Figure 8.14.
Variance needs to be estimated or tested for Science & Technology:
to assess stability and accuracy in engineering,
to evaluate risk in health-care, finance and generally,
Figure 8.14: Inference: getting conclusion about population from sampling
[Source KPA]

and to construct confidence interval for the mean if variance is unknown.

Computation on R. Example of two samples with the same mean, but the variance gap is huge!
b =c(13, 14, 14, 14, 15, 15, 17, 18); nb = length(b);

summary(b); var(b); # 2.85
c =c(11, 11, 11, 12, 18, 19, 19, 19); nc = length(c);
summary(c); var(c); # 16.28
var.test(b, c); # the sample means are the same,
# but need to check for two population variances; their gap is huge!
Fisher distribution in var.test(b, c)
Output of var.test(b, c, ratio=1)
# F test to compare two variances of data: b and c

F = 0.17544, num df = 7, denom df = 7, p-value = 0.03521
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval: 0.03512348 0.87629986
sample estimates: ratio of variances 0.1754386
Syntx and Theoretic elucidation Full version is

var.test(x, y, ratio = 1, alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, ...)

The null hypothesis is that the ratio of the variances of the populations
from which samples 𝑏 and 𝑐 were drawn,
or in the data to which the linear models 𝑥 and 𝑦 were fitted, is equal to ratio.
8.11.2 Probabilistic Characterization of Sample Means
Theorem 8.9 (Sampling Distribution of a sample mean 𝑋 generally).
Assume random variables 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ∼𝑖.𝑖.𝑑. 𝑋 (the 𝑋𝑖 independent and have the same distri-
bution with a common random variable 𝑋) with mean 𝜇 and variance 𝜎 2 .
The normal population: If population 𝑋 ∼ N(𝜇, 𝜎 2 ) then for any 𝑛 the sample mean
𝑛
∑︁
𝑋𝑖
𝑖=1
X =
𝑛
of observations 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 has expectation E[𝑋] = 𝜇 and variance V[𝑋] = 𝜎 2 /𝑛. Moreover
X n is normal (i.e. follows Gauss distribution) for any 𝑛:
(︀𝜎 2 )︀
X n ∼ N 𝜇, .
𝑛

Generic population:
𝜎 2 )︀
(︀
If 𝑋 is not normal, then X n approximates with Gauss variable N 𝜇, only when 𝑛 is large
𝑛
(𝑛 > 30). The C.L.T. briefly says
the sampling distribution of the sample mean will tend to normality asymptotically.
♣ QUESTION. Only one concern remained:
If population 𝑋 is not normal but 𝑛 ≤ 30 then what is the Sampling Distribution of 𝑋? Knowledge
Box 6 will answer this.
Knowledge Box 6 (Sampling distributions of a generic statistic- The CLT as a special case).
√︀
Under appropriate conditions, generally if 𝑆 is a statistic of interest, and 𝜎𝑆 = V[𝑆] is its standard
𝑆 − E[𝑆]
error, then approximately its standardization 𝑍 = is standard Gaussian.
𝜎𝑆
𝑆 − E[𝑆]
1. The fact 𝑍 = ∼ N(0, 1) equivalently shows that the squared standardization
𝜎𝑆
2 [𝑆 − E[𝑆]]2
𝑍 = ∼ 𝜒2 (1). (8.51)
V[𝑆]
2. Examples include the CLT, saying the sample mean of a random variable 𝑋 follows a normal
(︀ 𝜎 2 )︀
distribution, meaning X ∼ N 𝜇, . Here we fix 𝑆 = X , and exploit Theorem 8.9, with E[𝑆] =
√ 𝑛
E[X ] = 𝜇, and 𝜎𝑆 = 𝜎/ 𝑛.

The Central Limit Theorem (CLT)

When sample size 𝑛 is large (𝑛 > 30) the standardized (of X )
X n −𝜇
𝑍𝑛 = √ satisfies that lim 𝑍𝑛 = 𝑍 ∼ N(0, 1). (8.52)
𝜎/ 𝑛 𝑛→∞
8.11.3 Chi-square distribution
Several important tests of statistical hypotheses are based on the Chi-square distribution. Chi-
square distribution was introduced around 1900 by a famous English mathematician Karl Pearson
(1857-1936) who is regarded as a founder of the entire field of Mathematical Statistics.
Chi-square distribution, or 𝜒2 , is derived from a continuous random variable with density

1
𝑓 (𝑥) = P(𝑌 ≤ 𝑥) = 𝑣/2
𝑥(𝑣/2)−1 𝑒−𝑥/2 𝑥≥0 (8.53)
2 Γ(𝑣/2)
where 𝑣 > 0 is a parameter that is called degrees of freedom and has the same meaning as for
the Student’s T-distribution. Expectation and variance of 𝜒2 [𝑛 − 1] are
E(𝜒2 [𝑛 − 1]) = 𝜇 = 𝑛 − 1 = 𝑣; and V(𝜒2 [𝑛 − 1]) = 𝜎 2 = 2(𝑛 − 1) = 2𝑣.

• Chi-square densities with ν = 1, 5, 10, and 30 degrees of freedom.

• Each distribution is right-skewed.
• For large ν, it is approximately normal.
Figure 8.15: The pdf of 𝜒2 [𝑣] for various 𝑣
In the present section we assume that 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 are i.i.d. N(𝜇, 𝜎 2 ) random variables. We

have known that 𝜎 2 is estimated unbiasedly and consistently by the sample variance
𝑛
2 1 ∑︁
𝑠 = (𝑋𝑖 − 𝑋)2 .
𝑛 − 1 𝑖=1
• The summands (𝑋𝑖 − 𝑋)2 are not quite independent, as the Central Limit Theorem requires, be-
cause they all depend on X . Nevertheless, the distribution of 𝑠2 is approximately normal, under
mild conditions, when the sample is large. For small to moderate samples, the distribution of 𝑠2 is
not normal at all. It is not even symmetric.
Knowledge Box 7 (Distribution of the sample variance).
When observations 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 are i.i.d. N(𝜇, 𝜎 2 ) (independent and normal) with V[𝑋𝑖 ] = 𝜎 2 ,
the distribution of 𝑛
(𝑛 − 1)𝑠2 ∑︁ (︀ 𝑋𝑖 − X )︀2
2
= ∼ 𝜒2 [𝑛 − 1], (8.54)
𝜎 𝑖=1
𝜎
meaning it is Chi-square 𝜒2 [𝑛 − 1] with (𝑛 − 1) degrees of freedom.
1 ∑︀𝑛
Hence, 𝑠2 = (𝑋𝑖 − 𝑋)2 is an unbiased and consistent estimator of a certain variance
𝑛 − 1 𝑖=1
𝜎 2 , and by (8.54) the variance estimator 𝑠2 fulfills
𝜎2 2
𝑠2 ∼ 𝜒 [𝑛 − 1].
𝑛−1

8.11.4 Confidence interval for the population variance
Let us construct a (1−𝛼) 100% confidence interval for the population variance 𝜎 2 , based on a sample
of size 𝑛. As always, we start with the estimator, the sample variance 𝑠2 .
• Obviously, since the distribution of 𝑠2 is a chi-square distribution, not symmetric, our confidence
interval won’t have the form “estimator ± margin” as before.
• We may use Table A6 - Figure 8.16 to find the critical values 𝜒2𝛼/2 [𝑛 − 1] and 𝜒21−𝛼/2 [𝑛 − 1] of the
Chi-square distribution with 𝑣 = 𝑛 − 1 degrees of freedom.
Two-sided confidence interval for variance

A two-sided (1 − 𝛼) 100% confidence interval for the population variance 𝜎 2 is obtained:
(︃ )︃
(𝑛 − 1) 𝑠2 (𝑛 − 1) 𝑠2
, . (8.55)
𝜒2𝛼/2 [𝑛 − 1] 𝜒21−𝛼/2 [𝑛 − 1]
Similarly, the (1 − 𝛼) 100% lower and upper confidence bound for 𝜎 2 are
(𝑛 − 1) 𝑠2 2 2 (𝑛 − 1) 𝑠2
≤ 𝜎 ; and 𝜎 ≤ 2 (8.56)
𝜒2𝛼 [𝑛 − 1] 𝜒1−𝛼 [𝑛 − 1]

Table A6: Chi-squared distribution values
Figure 8.16: Table of chi-square critical values

Figure 8.17: Chi-square curve and critical values with specific significant level
♦ EXAMPLE 8.13. ([Industry- Manufacturing] Process Control - Detergent Filling)
An automated filling machine is used to fill bottles with liquid detergent. A random sample of 20
bottles results in a sample variance of fill volume 𝑋 of 𝑠2 = 0.0153 (liter)2 . If the variance of fill volume
exceeds 0.01 (liter)2 , an unacceptable proportion of bottles will be underfilled or overfilled.

What action will a process engineer or manager do?

We will assume that the fill volume is approximately normally distributed. A 95% upper confidence
bound is found from Formula 8.56 as follows:
2(𝑛 − 1) 𝑠2 (19) 0.0153

𝜎 ≤ 2 = = 0.0287 (𝑙𝑖𝑡𝑒𝑟)2
𝜒0.95 [19] 10117
This last expression may be converted into a confidence interval on the standard deviation 𝜎 by
taking the square root of both sides, resulting in 𝜎 ≤ 0.17(𝑙𝑖𝑡𝑒𝑟). Therefore, at the 95% level of
confidence, the data indicate that the process standard deviation could be as large as 0.17 liter. The
process engineer or manager now needs to determine whether a standard deviation this large could
lead to an operational problem with under- or over-filled bottles.
8.11.5 Chi-square statistic for testing independence
Many practical applications require testing independence of two factors, say 𝐴 and 𝐵. Apparently,
chi-square statistic can help us test
𝐻0 : 𝐴 and 𝐵 are independent vs 𝐻𝐴 : 𝐴 and 𝐵 are dependent.
If there is a significant association or dependency between two features, it helps to understand the
cause-and-effect relationships.

• For example, is it true that smoking causes lung cancer? Do the data confirm that drinking and
driving increases the chance of a traffic accident?
• In computing, does customer satisfaction with their PC depend on the operating system?
Definition 8.10.
Chi-square statistic is defined as
𝑁 (︀ )︀2
∑︁ 𝑂𝑘 − 𝐸𝑘
𝑇 = 𝜒2 = (8.57)
𝐸𝑘
𝑘=1
Here the sum goes over 𝑁 categories or groups of data defined depending on our testing,
• 𝑂𝑘 is the actually observed number of sampling units in category 𝑘, and
• 𝐸𝑘 = E[𝑂𝑘 |𝐻0 ] is the expected number of sampling units in category 𝑘 if the null hypothesis 𝐻0 is
true.
♣ OBSERVATION. This is always a one-sided, right-tail test. That is because only the low values of
𝜒2 show that the observed counts are close to what we expect them to be under the null hypotheses,
and therefore, the data support 𝐻0 . On the contrary, large 𝜒2 occurs when observations 𝑂𝑘 are far
from expected numbers 𝐸𝑘 , which shows inconsistency of the data and does not support 𝐻0 .
Theorem 8.11.

Under 𝐻0 , 𝑇 ⇝ 𝜒2 . Hence the test: reject Ho if 𝑇 > 𝜒2𝛼 [𝑁 − 1] has asymptotic level 𝛼.
The p-value is P[𝜒2 [𝑁 − 1] ≥ 𝑡] where

𝑁 (︀ )︀2
∑︁ 𝑜𝑘 − 𝑒𝑘
𝑡 = 𝜒2𝑜𝑏𝑠 =
𝑒𝑘
𝑘=1
is the observed value of the test statistic 𝑇 .
A 3-step procedure:
1. A level 𝛼 rejection region for this chi-square test is 𝑅 = [𝜒2𝛼 , +∞), and the P-value found as
𝑃 = P[𝜒2 ≥ 𝜒2𝑜𝑏𝑠 ]
2. Pearson showed that the null distribution of 𝜒2 converges to the Chi-square distribution with (𝑁 − 1)
degrees of freedom, due to Theorem 8.11, as the sample size increases to infinity.
This follows from a suitable version of the Central Limit Theorem. To apply it, we need to make
sure the sample size is large enough.
3. The rule of thumb requires an expected count of at least 5 in each category,
𝐸𝑘 = E[𝑂𝑘 |𝐻0 ] ≥ 5

for all 𝑘 = 1, . . . , 𝑁 . If that is the case, then we can use the 𝜒2 distribution to construct rejection
regions and compute P-values. If a count in some category is less than 5, then we should merge
this category with another one, and recalculate the 𝜒2 statistic.
♦ EXAMPLE 8.14 (Internet shopping on different days of the week).
A web designer suspects that the chance for an internet shopper to make a purchase through her
web site varies depending on the day of the week. To test this claim, she collects data during one
week, when the web site recorded 3758 hits.
Observed, 𝑥 Mon Tue Wed Thu Fri Sat Sun Total
No purchase 399 261 284 263 393 531 502 2633
Single purchase 119 72 97 51 143 145 150 777
Multiple purchases 39 50 20 15 41 97 86 348
Total 557 383 401 329 577 773 738 3758
Testing independence (i.e., probability of making a purchase or multiple purchases is the same
on any day of the week), we compute the estimated expected counts, then apply the above 3-step
procedure.

Chapter 9
Workload Characterization
With Data Analytics
CHAPTER 9. WORKLOAD CHARACTERIZATION
126 WITH DATA ANALYTICS
Introduction
We introduce the concept of system workload and discuss many data-driven methods for studying
computer system workload and its characterization in this chapter.
Next we promote the approach named Performance by Design (PbD) that is similar to the well-
known Quality by Design (QbD) in industrial manufacturing.
To specify what Performance by Design means we at first summarize the SPE theory being
developed from Chapter ?? then illustratively propose Performance Evaluation Analytics projects
that include most essences of system performance evaluation, Analyzing Product-Form Queues
being discussed in Section ?? and Non-Markovian Queues in Chapter ?? respectively.
Chapter Blueprint
1. Preliminaries- Workload Modeling
2. Workload Characterization in Section 9.2
3. Statistical Techniques for Workload Characterization in Section 9.3
4. CLUSTERING TECHNIQUES in Section 9.5
5. Principal Component Analysis (PCA): Principles

127
6. Combining 𝐾-means Clustering with PCA
7. MACHINE LEARNING in SPE With Mathematics in Section 9.9.
Learning outcomes- Readers should be able to
1. Describe types of Workloads, and major considerations in workload selection
2. Explain workload modeling and characterization
3. Express workload components and workload parameters for a given system
4. Know and employ Popular Techniques for Workload Study, based ob Data Analytics, at least Prin-
cipal Component Analysis and 𝐾-means Clustering.

9.1 Preliminaries on Workload Modeling
9.1.1 Why and What questions? Types of Workload
A/ Why workload needed?
There are three main factors that affect the performance of a computer system:
1. The system’s design
2. The system’s implementation
3. The workload to which the system is subjected.
B/ Types of Workloads- Workload is used in performance studies of a system, and generally

classified in many different types, but we begin with two key types.
• Real workload: observed on a system being used for normal operations.
• Synthetic workload: similar to real workload.
– Can be applied repeatedly in a controlled manner.
– No real-world data files ⇒ small and no-sensitive information.

9.1. Preliminaries on Workload Modeling 129
For example, Exponentially distributed arrival is a synthetic workload.
C/ Test workloads for computer systems
• Addition instructions: only ADD instructions
• Instruction mixes: various instructions with usage frequencies
• Kernels: list of functions provided by services of processor
• Synthetic programs: OS services + I/Os
• Application benchmarks: synthetic programs for a particular domain
– SPEC Benchmark Suite (computer)

SPEC= Systems Performance Evaluation Consortium, the organization defines several bench-
mark suites aimed at evaluating computer systems. The most important of these suites is
SPEC CPU, which dominates the field of evaluating the microarchitecture of computer proces-
sors.
– LINPACK- in high performance computing- is a software library for performing numerical linear
1
algebra on digital computers,
1
Wiki: LINPACK written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Gilbert Stewart, and intended for use on supercomputers
in the 1970s and early 1980s. It has been largely superseded by LAPACK, which runs more efficiently on modern architectures.

– Debit-Credit Benchmark (bank)
Hence, all test workloads above are synthetic.
9.1.2 Workload selection- Considerations
First of first note:
Wrong choice of workload is possible to reach misleading conclusions.
Four major considerations in workload selection are
1. Services exercised (from Requests)
2. Level of detail
3. Representativeness
4. Timeliness
We detail the above four concepts.
1) Requests lead to Services exercised

including
• SUT: system under test, and CUS: component under study;
go together, since
• Requests are at the service-interface level of the SUT, while
• Services exercised should be selected for CUS.
For example,
• ALU is CUS: apply services with arithmetic instructions more frequently.
• CPU is CUS: processing kernels is applied.

2) Level of detail -
Output of the services exercised is a list of services.
Basic levels of detail
• (A) Most frequent request VS. Frequency of request types
• (B) Average resource demand
• (C) Distribution of resource demands
• (D) Time-stamped sequence of requests
Explicitly,
• (A) Most frequent request: select most frequently requested service as workload
E.g., addition instruction
VS. Frequency of request types: list of [ service, frequency ]
E.g., instruction mixes
• (B) Average resource demand: with its description is sufficient

E.g., analytical modeling
• (C) Distribution of resource demands: complete probability distribution needed for E.g., ana-
lytical and simulation

• (D) Time-stamped sequence of requests: trace of requests on a real system.
3) Representativeness
Test (or synthetic) workload ≈ real workload ⇒ Representativeness
• Arrival Rate (inter-arrival time)
• Resource demands
• Resource usage profile: sequence and amount of using different.

resources.
4) Timelineness
• New systems → new workloads.
• Change in system performance

=⇒ Users change their behavior to optimize the response
=⇒ Usage pattern changes with time.
• User behaviour should be monitored on an ongoing basis.

9.1.3 Workload Modeling with key terminologies
■ CONCEPT. WHAT is workload Modeling?
Workload modeling is the attempt to create a possibly with slight (but wellcontrolled!) modifi-
simple and general model, which can then be cations.
used to generate synthetic workloads at will,
GOAL (of workload modeling): typically to create synthetic workloads that can be used in
performance evaluation studies, and this synthetic workload is supposed to be similar to those that
occur in practice on real systems.
DATA ROLE: Workload modeling always starts with measured data about the workload. This data
is often recorded as a trace, or log, of workload-related events that happened in a certain system.
For example, a job log may include data about
the arrival times of jobs,
who ran them, and how many resources they required.
■ CONCEPT. WHAT is workload characterization?
• Workload should be repeatable to test multiple alternatives under identical conditions.

• Necessary to study real-user environments, observe key characteristics, and develop a workload
model
• That process is called workload characterization.
Two key terms
1. User = Entity that makes the service request.

(Workload component, workload unit) E.g., Applications, sites, sessions
Characteristics (parameters) of a service request:
• arrival time
• duration of the request
• type of request or resource demanded
• quantity of resource demanded.
2. Workload parameters or workload features: they are
measured quantities, service requests, resource demands ...
used to characterize (model) the workload

E.g., transaction types, instructions, packet sizes,

source destinations of a packet, and

page reference pattern
NOTE: We should use those parameters that depend on the workload rather than on the system.
E.g., response time is not appropriate.
9.1.4 Workload components and workload parameters
Workload parameters [𝑥, 𝑦, 𝑧, . . . ]⊤
Workload components
simulation
editor
Workload SUT
compiler
mail
Example:
• Parameter 𝑥 := CPU time of mail, compiler, editor, simulation
• Parameter 𝑦 := #Disk reads of mail, compiler, editor, simulation

• Parameter 𝑧 := #Disk writes of mail, compiler, editor, simulation
Workload modeling: Principles TRACE DATA is utilized for
both Statistical Analysis to build up Workload Model

and also User Modeling (to simulate user activities)
eventually returning performance metrics.

9.2. Popular Techniques for Workload Study 139
9.2 Popular Techniques for Workload Study 21/11/2023

We now discuss major workload characterization, essentially based on Statistical Learning Tech-
niques and Data Analytics. Eventually we will exploit both
Un-supervised & Supervised Learning Techniques coupling with the statistical soft R in this part.
A brief of MACHINE LEARNING in SPE is shortly provided in Section 9.9.
Un-supervised Learning Techniques

1. Averaging and Specifying dispersion
2. Histograms: Single Parameter and Multi-parameter cases
3. Markov models
4. Principal Component Analysis
5. Clustering
In Performance Analytics III at Section ?? we extend classic clustering method to Hierarchical

Clustering. To Supervised Learning Techniques we provide Multiple Linear Regression (MLR)
modeling later in Section 10.3, then Multiple Regression With Lags in Section 11.5.

9.2.1 Averaging and Specifying dispersion
They (mean and dispersion) are all statistical notion, see REMINDER: On Central and Spreading
tendency in part 9.4.
Use a single number to characterize a workload parameter.
• Arithematic mean: 𝑛
1 ∑︁
𝑥¯ = 𝑥𝑖 .
𝑛 𝑖=1
Median, geometric mean, harmonic mean could be used instead.
• Mode (for categorical variables): most frequent value.
FACT: Averaging ⇒ Uniform probability distribution
Specifying dispersion
𝑛
∑︁
2
• Standard deviation: 𝑠 = 1
𝑛−1 (𝑥𝑖 − 𝑥¯)2 .
𝑖=1
• Coefficient of variation (C.O.V.): 𝑠/¯

𝑥 for measures of Dispersion.
• Alternatives: range ([min,max]), 10- and 90-percentiles, see Definition 9.3.

♦ EXAMPLE 9.1. The means are not much useful!
2 4 6 8 10
The two means and their associated C.O.V. are
the blue points 𝑥¯ = 3.4, 𝑠2 = 1.44, 𝑠/¯

𝑥 = 0.3529 ,
the red points 𝑥¯ = 3.3, 𝑠2 = 4.41, 𝑠/¯

𝑥 = 0.6363.
NOTE: Still we ignore using exact distribution.

♦ EXAMPLE 9.2 (Resource demands in universities).
DATA
Table 9.1: C1: Various programs data
Data Average C.O.V.

CPU time 2.19 seconds 40.23
Number of direct writes 8.20 53.59
Direct-write bytes 10.21 kbytes 82.41
Number of direct reads 22.64 25.65
Direct-read bytes 49.70 kbytes 21.01
Conclusion: High C.O.V.
The resource demands of various programs executed on 6 university sites were measured for 6
months for two categories, named Various programs and Editors.
• C1: Various programs
ANALYSIS: compare the above C.O.V. and the below C.O.V.
• C2. Editors

Table 9.2: C2. Editors data
Data Average C.O.V.

CPU time 2.57 seconds 3.54
Number of direct writes 19.74 4.33
Direct-write bytes 13.46 kbytes 3.87
Number of direct reads 37.77 3.73
Direct-read bytes 36.93 kbytes 3.16
Conclusion: Low C.O.V.

9.2.2 Single-parameter & multiple-parameter histogram
• Make histogram and then fit a probability distribution to shape of the histogram
• Singe-parameter histogram ⇒ ignore multiple-parameter correlation.

• Multiple-parameter: 3D plot for 2 parameters ⇒ to show correlations.
Correlations between parameters (see Section 9.4)
• Strong correlation between #packets sent and #packets received

⇒ changing #packets sent changing #packets received
9.2.3 Markov model
Sometimes, probability distributions are not enough, e.g., to representing sequences.
• Workload model as state diagram with probabilistic transitions
• Requests of workload generated on transitions

0.3 network
6
0.
0.
5
2
0.
0.4
CPU disk 0.8
0.2
Figure 9.1: Three nodes Markov model
How to create a Markov model?
1. Observe long sequence of activities
2. Count pairs of states in matrix
3. Normalize rows (sum to 1.0)

Markov model in discrete time- A Primer of DTMC
Markov model in discrete time also named discrete time Markov chain (DTMC) or just Markov chain
is briefed as follows.
(A) Formalizing DTMC:
A Markov chain is a discrete-time, discrete-state Markov stochastic process.
(B) A stationary Markov chain means time homogeneous or homogeneous DTMC, with 3 com-
ponents 𝑀 = (𝑄, 𝑝, P) is specified by
STRUCTURE- A stochastic model is identified in terms of the following:
* a finite set 𝑄 of 𝑠 possible states denoted by 𝑄 = {1, 2, . . . , 𝑠};
* a corresponding set of probabilities {𝑝𝑖𝑗 }, where
𝑝𝑖𝑗 = P[𝑋𝑛 = 𝑗|𝑋𝑛−1 = 𝑖], ∀𝑖, 𝑗 ∈ 𝑄 (9.1)
is the state transition probabilities or just transition probability from state 𝑖 to state 𝑗, subject to
two conditions:
∑︁
𝑝𝑖𝑗 ≥ 0, 𝑝𝑖𝑗 = 𝑝𝑖1 + 𝑝𝑖2 + 𝑝𝑖3 + . . . + 𝑝𝑖𝑠 = 1 for all 𝑖
𝑗
(since states-destinations are disjoint and exhaustive events).

EVOLUTION- Given the stochastic model just described, the Markov chain is specified in terms of a
sequence of random variables 𝑋0 , 𝑋1 , 𝑋2 , . . . whose values are taken from the set 𝑄 in accordance
with the Markov property
P[Xn+1 = j|Xn = i, 𝑋𝑛−1 = 𝑘, · · · , 𝑋0 = 𝑎] = P[Xn+1 = j|Xn = i],
for all times 𝑛 = 1, 2, · · · and all possible sequence of states 𝑖, 𝑘, . . . , 𝑎 ∈ 𝑄.
The Markov property is described through probabilities 𝑝𝑖𝑗 , and represented by the state transition
matrix ⎡ ⎤
𝑝11 𝑝12 𝑝13 . . . .𝑝1𝑠 .
⎢ ⎥
⎢
⎢ 𝑝21 𝑝22 𝑝23 . . . 𝑝2𝑠 . ⎥
⎥
⎢ ⎥
P=⎢
⎢ 𝑝31 𝑝32 𝑝33 . . . 𝑝3𝑠 . ⎥
⎥ (9.2)
⎢ .. .. .. . . . .. ⎥
⎢
⎣ . . . . ⎥ ⎦
𝑝𝑠1 𝑝𝑠2 𝑝𝑠3 . . . 𝑝𝑠𝑠 .
Unless stated otherwise, we assume and work with homogeneous Markov chains 𝑀 .
QUIZ: Write down matrix P of model in Figure 9.1.
We are practically given

a/ the initial distribution- the probability distribution of starting position of the concerned object
at time point 0, and b/ the transition probabilities; and
we want to determine the the probability distribution of position 𝑋𝑛 for any time point 𝑛 > 0.
• The initial probabilities 𝑝(0) is obtained at the current time (begining of a research).
• Transition Probability Matrix P is found from empirical observations.
In most cases, the major concern is using P and 𝑝(0) to predict future.
NOTATIONS - Fix 𝑀 = {𝑋𝑛 : 𝑛 ≥ 0} = (𝒮 = 𝑄, 𝑝(0), P) be a Markov chain.

P = [𝑝𝑖𝑗 ] the state transition matrix of 𝑀

𝑝𝑖𝑗 = Prob(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖). transition probability
(ℎ)
𝑝𝑖𝑗 = Prob(𝑋𝑚+ℎ = 𝑗|𝑋𝑚 = 𝑖) ℎ-step transition probabilities
𝑝𝑖 (𝑡) = P[𝑋(𝑡) = 𝑖], distribution of states 𝑖 ∈ 𝒮 at time 𝑡
𝑝(𝑛) = [𝑝1 (𝑛), . . . , 𝑝𝑠 (𝑛)] distribution vector at time 𝑛

𝑝(0) = P[𝑋(0) = [𝑒1 , 𝑒2 , · · · , 𝑒𝑠 ]]
= [𝑝1 (0), 𝑝2 (0), . . . , 𝑝𝑠 (0)] initial distribution vector
Section ?? presents a few specific properties of Discrete Time MC.
9.2.4 Principal Component Analysis (PCA) - First look
HOW TO DO?
The first Principal Component having the highest variance is found by 𝑌1 = 𝛿1 ⊤ x, where the 𝑝-

dimensional coefficient vector

𝛿1 = (𝛿1,1 , . . . , 𝛿1,𝑝 )⊤
solves
𝛿1 = arg max Var(𝛿1 ⊤ x) (9.3)
‖𝛿‖=1
where a sample x = (𝑥1 , 𝑥2 , . . . , 𝑥𝑝 )⊤ viewed as an input vector of 𝑝 random variables 𝑋𝑗 .
The second PC is the linear combination with the second largest variance and orthogonal to
the first PC, and so on.
Advantages
i) Combines correlated parameters into single variable (parameter)
ii) Handles more than 2 parameters
iii) Insensitive to scale of original data
iv) Identifies variables by importance
∑︀𝑛
Idea: (Different weighting → different class of workload) , so we use weighted sum 𝑦 = 𝑗=𝑖 𝑤𝑗 𝑥𝑗
to classify the components into a number of classes.
• Weighted sum is commonly used.

• Mean characteristics may not correspond to any member component in case of manually assigning
poor weights.
SIMPLE ALGORITHM
• Given a set of 𝑛 parameters ⃗𝑥 = [𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ]⊤ .
• PCA produces a set of principal factors ⃗𝑦 = [𝑦1 , 𝑦2 , . . . , 𝑦𝑚 ]⊤ so that the following holds:
– principal factors are linear combinations of input
⃗𝑥
⃗𝑦 = 𝐴⃗
– 𝑦’s form an orthogonal set
– 𝑦’s form an ordered set (𝑦1 explains the highest percentage of the variance).
REMARK: Not easy to relate principal factors to original parameters.
Principal component analysis (PCA)- Computation in practice
See PCA theory in Section 9.6 and practical R Computation in Section 9.7.

9.2.5 Clustering - Motivation and Steps
Measured workload consists of a large number of components. 2
It is far from the reality if we study all of them. As a result, a basic cognition is needed to break
workload into categories by clustering algorithms. Basic steps are
1. Take a sample (Sampling), that is, a subset of workload components, Section 9.3.1
2
E.g., several thousand user profiles.

2. Select workload parameters (see Section 9.3.2 and PCA in Section 9.6)
3. Transform workload parameters (if necessary), Remove outliers and Data scaling, Section 9.3.3
4. Select a distance measure, or Metric of dissimilarity in Section 9.3.4
5. Perform clustering, see from Section 9.5
b) Select representative components from each cluster.
c) and Interpret results
(*) Lastly, if estimation error is large, just Change parameters, or number of clusters, repeat steps 3-5
9.3 Statistical Methods for Workload Characterization
We sequentially study mentioned steps by a variaty of methods. Basic concepts of Statistics are
reminded in Section 9.4.
9.3.1 Sampling [see a background in Appendix ??]
Let 𝑃 designate a finite population of 𝑁 units (assumed that the population size 𝑁 is known). We
first make a list 𝐿𝑁 = {𝑢1 , 𝑢2 , · · · , 𝑢𝑁 } of all the elements of the population, which are all labeled for

9.3. Statistical Methods for Workload Characterization 155
identification purposes. Let 𝑋 be a variable of interest and 𝑥𝑖 = 𝑋(𝑢𝑖 ), 𝑖 = 1...𝑁 the value ascribed
by 𝑋 to the 𝑖-th unit, 𝑢𝑖 ∈ 𝑃 .
The population mean and population variance, for the variable 𝑋, i.e.,
𝑁 𝑁
∑︁
2 1 ∑︁
𝜇𝑁 = 𝑥𝑖 , and 𝜎𝑁 = (𝑥𝑖 − 𝜇𝑁 )2 , (9.4)
𝑖=1
𝑁 𝑖=1
are called population quantities or population parameters.
• In one study, 2% of the population was chosen for analysis; later 99% of the population could be
assigned to the clusters obtained.
• Random selection −→ Select top consumers of a resource?
9.3.2 Parameter selection
• Criteria:
– Impact on performance
– Small Variance
• Method 1: Redo clustering with one less parameter.

• Method 2: Principal component analysis, see from Section 9.6.
Key idea: Identify parameters with the highest variance.
9.3.3 From Transformation and Outliers to Data scaling
• Transformation: If the distribution is highly skewed, consider a function of the parameter, e.g.,
function log(𝑋) of CPU time 𝑋.
• What are Outliers:
– Data points with extreme parameter values.
– Affect normalization.
– Can exclude only if that do not consume a significant portion of the system resources (e.g.,
backup).
• What is Data scaling?
1. Normalize to Zero Mean and Unit Variance:

𝑥𝑖𝑘 − 𝑥¯𝑘
𝑥′𝑖𝑘 = .
𝑠𝑘
2. Weights: 𝑥0𝑖𝑘 = 𝑤𝑘 𝑥𝑖𝑘 , 𝑤𝑘 /relative importance or 𝑤𝑘 = 1/𝑠𝑘 .

3. Range Normalization:
𝑥𝑖𝑘 − 𝑥min,𝑘
𝑥′𝑖𝑘 = ,
𝑥max,𝑘 − 𝑥min,𝑘
affected by outliers.
– Percentile Normalization:
𝑥𝑖𝑘 − 𝑥2.5,𝑘
𝑥′𝑖𝑘 = .
𝑥97.5,𝑘 − 𝑥2.5,𝑘
9.3.4 Notation of distance - Metric of dissimilarity in general cases
Choose row vectors 𝑢 = 𝑢1 , 𝑢2 , · · · , 𝑢𝑚 , and 𝑣 = 𝑣1 , 𝑣2 , · · · , 𝑣𝑚 , in the space R𝑚 . Common dissimi-

larity functions to compute the distance between 𝑢 and 𝑣 are generally defined as
𝑚
∑︁
𝑑(𝑢, 𝑣) = 𝑑(𝑢𝑗 , 𝑣𝑗 )
𝑗=1
with specific cases of 𝑑(𝑢𝑗 , 𝑣𝑗 ) below.
1. Euclidean distance [or squared Euclidean distance]

𝑚
2
∑︁ (︀ )︀2
𝑑(𝑢𝑗 , 𝑣𝑗 ) = (𝑢𝑗 − 𝑣𝑗 ) =⇒ 𝑑(𝑢, 𝑣) = 𝑢𝑖 − 𝑣𝑖 . (9.5)
𝑖=1
𝑚
∑︁
2. Manhattan distance 𝑑(𝑢𝑗 , 𝑣𝑗 ) = |𝑢𝑗 − 𝑣𝑗 | =⇒ 𝑑(𝑢, 𝑣) = |𝑢𝑗 − 𝑣𝑗 |.
𝑖=1

𝑚
∑︁
3. Hamming distance 𝑚𝐻 (𝑢, 𝑣) = 𝛿(𝑢𝑖 − 𝑣𝑖 ) and the Hamming mean
𝑖=1
𝑚𝐻 (𝑢, 𝑣)
𝑑(𝑢, 𝑣) = (9.6)
𝑚
Definition 9.1 (Common properties of the above distance functions).
A dissimilarity [distance function] 𝑑(·, ·) on a data 𝒟 is defined to satisfy
1. Symmetry, 𝑑(𝑢, 𝑣) = 𝑑(𝑣, 𝑢);
2. Positivity, 𝑑(𝑢, 𝑣) ≥ 0 for all 𝑢, 𝑣 .
The distance 𝑑(·, ·) is called a metric, if, moreover, we have
3. Triangle inequality,
𝑑(𝑢, 𝑣) ≤ 𝑑(𝑢, 𝑤) + 𝑑(𝑤, 𝑣) for all 𝑢, 𝑣 and 𝑤, and (9.7)
4. Reflexivity, 𝑑(𝑢, 𝑣) if and only if 𝑢 = 𝑣 also hold.
If only the triangle inequality is not satisfied, the function is called a semimetric.

How about similarity 𝑆(·, ·)?
Similarity in clustering means that the value of 𝑆(𝑢, 𝑣) is large when 𝑢 and 𝑣 are two similar
samples; the value of 𝑆(𝑢, 𝑣) is small otherwise.
Definition 9.2.
• A similarity function 𝑆(·, ·) is defined if Item 2- the Positivity is replaced by
0 ≤ 𝑆(𝑢, 𝑣) ≤ 1, for all 𝑢 and 𝑣 (9.8)
• For a data set with 𝑁 data objects, we can defined an 𝑁 × 𝑁 symmetric matrix, called a proximity
matrix, whose (𝑖, 𝑗)-th element represents
the similarity or dissimilarity measure for the 𝑖-th and 𝑗-th objects (𝑖, 𝑗 = 1, . . . , 𝑁 ).

9.4 REMINDER: Key Facts of Statistical Data Analytics

9.4.1 (I) Central tendency- Median
(see more in [17])
The median 𝑀 is the midpoint of a distribution. Half the observations are smaller than the median
and the other half are larger than the median.
The median 𝑀 is the value in the middle when the data 𝑥1 , · · · , 𝑥𝑛 of size 𝑛 is sorted in ascending
order (smallest to largest).
- If 𝑛 is odd, then the median 𝑀 is the middle value.
- If 𝑛 is even, 𝑀 is the average of the two middle values.
For instance, the median of two data sets 𝑥, 𝑥* above

𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000].
is the same, through their sample means are different:

∑︀𝑛
𝑥𝑖
𝑥 = 𝑖=1 = 2940 ̸= 𝑥* .
𝑛
Indeed, since 𝑛 = 12 is even, the middle two values of data 𝑥* are 2890 and 2920; the median 𝑀
is the average of these values:
2890 + 2920
𝑀= = 2905 = 𝑀 (𝑥* ).
2
9.4. REMINDER: Key Facts of Statistical Data Analytics 161
Remark: Whenever a data set contain extreme values, the median is often the preferred measure
of central location than the mean.
𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000].
Data 𝑥* consists of extreme values (outliers) as $10000, so the new sample mean
∑︀𝑛 *
* 𝑖=1 𝑥𝑖
𝑥 = = 3496 >> 2940 = the old mean of data 𝑥
𝑛
But the median is unchanged, reflecting better central tendency:
2890 + 2920
𝑀 (𝑥) = 𝑀 (𝑥* ) = = 2905.
2
• Mean versus median
The median and mean are the most common measures of the center of a distribution.
The mean and median of a symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed
distribution, the mean is farther out in the long tail than is the median.
• Frequency distribution and Mode

In any sample data 𝑥 = 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 of size 𝑛,
the mode of 𝑥 is the value that occurs with greatest frequency.
HOW TO? We knew
– the number of observations 𝑛𝐴 of a particular value 𝐴 is its absolute frequency,
𝑛𝐴
– a relative frequency of 𝐴 is 𝑛 ,
– a histogram is a bar graph of a (relative) frequency distribution.

Student’s scores and Mode
So we just choose a specific value 𝐴 with greatest frequency (or greatest relative frequency) from
the histogram.

9.4.2 (II) Spreading tendency
We employ basic measures to describe spreading tendency of data:
a/ Percentiles and b/ Quartiles, a specific case of percentiles.
a) Percentiles provide information about how the data are spread over the interval from the
smallest value to the largest value.
Given a sample 𝑥 of increasingly ordered observations, formally we have

Definition 9.3.
The 𝑝th percentile, for any 0 < 𝑝 < 1, is a value 𝑚 such that
• 100𝑝 percent of the observations are less than or equal 𝑚, i.e.
P[𝑋 ≤ 𝑚] = 𝑝;
• and 100(1 − 𝑝) percent of the observations are greater than this value.
• The domain of 𝑝 is [0, 100]: 𝑝 is a real number, but in practice we usually allow 𝑝 ∈ Q ∩ [1, 100]: 𝑝 is
a rational number.

b) Quartiles are percentiles obtained when we choose value 𝑝

to be a multiple of 0.25 = 25%.
Often we divide data into four equal parts, each part contains approximately one-fourth, or 25% of
the observations. The division points are called the quartiles, and defined as:
𝑄1 = first quartile, or 25th percentile
𝑄2 = second quartile, or 50th percentile (also the median)
𝑄3 = third quartile, or 75th percentile
For any 𝑝 ∈ (0, 1) the 𝑝th percentile of data 𝑥 is found by R code:
quantile(𝑥, p);
9.4.3 (III) Measure of Dispersion- Variance and Standard deviation

Definition 9.4. The variance of a data is a measure of variability that utilizes all the data.
√
Let 𝑠2 = V[𝑥] be sample variance, and 𝑠 = 𝑠2 for its standard deviation.
• The sample variance of a data 𝑥 = [𝑥1 , · · · , 𝑥𝑛 ] of size 𝑛

∑︀𝑛 2
2 𝑖=1 (𝑥𝑖 − 𝑥)
𝑠 = (9.9)
𝑛−1
∑︀𝑛 2
2 − 𝑛𝑥2
𝑖=1 𝑥𝑖
or 𝑠 = ;
𝑛−1
where the sample mean is

𝑛
1 ∑︁
𝑥 := x 𝑛 = (𝑥1 + . . . + 𝑥𝑛 )/𝑛 = 𝑥𝑖 . (9.10)
𝑛 𝑖=1
√
• The sample standard deviation 𝑠 = 𝑠2 .
• The population variance of a population of size 𝑁 , with 𝜇 = 𝑥:

∑︀𝑁 2
2 𝑖=1 (𝑥𝑖 − 𝜇)
𝜎 = .
𝑁
Definition 9.5.
Coefficient of Variation 𝐶𝑉 measures relative dispersion, i.e. compares how large the standard
deviation is relative to the mean:
(︂ )︂
𝜎
𝐶𝑉 = × 100 % for populations
𝜇
and (︂ )︂
𝜎𝑥
𝐶𝑉 = × 100 % for samples 𝑥.
𝜇𝑥
Association Between Two Variables

We now consider the relationship between variables via two most important descriptive measures:
Covariance measures the co-movement of two separate distributions and Correlation. Let us start
by looking at the example below.
Practical motivation 4. [Sale trend]
A manager of a sound equipment store wants to determine the relationship between
• the number 𝑥 of weekend television commercials shown, and
• the sales 𝑦 at his store during the following weeks.
Data of size 𝑛 = 10 has been recorded in 10 weeks, shown in Table 9.3.

Table 9.3: Sample data for the sound equipment store
Week Number of commercials (𝑥) Sales Volume 𝑦 (×$100s)

1 2 50
2 5 57
3 1 41
4 3 54
5 4 54
6 1 38
7 5 63
8 3 48
9 4 59
10 2 46
9.4.4 (IV) Covariance
the 1st descriptive measure of association between 2 variables 𝑋, 𝑌 .

Definition 9.6. For a sample data of size 𝑛 with the observations

(𝑥, 𝑦) = {(𝑥1 , 𝑦1 ), · · · , (𝑥𝑛 , 𝑦𝑛 )} the sample covariance is defined as
∑︀
(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑠𝑥𝑦 = 𝑖
𝑛−1
In our example we have 𝑥 = 30/10 = 3 and 𝑦 = 510/10 = 51, and the sample covariance 𝑠𝑥𝑦 =
99/9 = 11.
Obviously for the entire population, the population covariance is

∑︀
(𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 )
𝜎𝑥𝑦 = 𝑖
𝑁
A positive covariance indicates that 𝑋 and 𝑌 move together in relation to their means.
A negative covariance indicates that they move in opposite directions.
Remark that
(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦) > 0 ⇐⇒ the point (𝑥𝑖 , 𝑦𝑖 ) ∈ quadrants 𝐼&𝐼𝐼𝐼
(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦) < 0 ⇐⇒ the point (𝑥𝑖 , 𝑦𝑖 ) ∈ quadrants 𝐼𝐼&𝐼𝑉
As a result,
1. 𝑠𝑥𝑦 > 0 indicates a positive linear association (relationship) between 𝑥 and 𝑦
2. 𝑠𝑥𝑦 ≈ 0: 𝑥 and 𝑦 are not linearly associated

3. 𝑠𝑥𝑦 < 0 then 𝑥 and 𝑦 are negatively linearly associated
9.4.5 (V) Correlation
In our example, 𝑠𝑥𝑦 = 99/9 = 11 indicating a strong positive linear relationship between the number
𝑥 of television commercials shown and the sales 𝑦 at the multimedia equipment store.
But the value of the covariance depends on the measurement units for 𝑥 and 𝑦. Is there other
precise measure of this relationship?
Correlation coefficient- the second descriptive measure.
𝑠𝑥𝑦
𝑟𝑥𝑦 =
𝑠𝑥 𝑠𝑦
We get −1 ≤ 𝑟𝑥𝑦 ≤ 1. And moreover, if 𝑥 and 𝑦 are linearly related by the equation
𝑦 = 𝑎 + 𝑏𝑥,
then
𝑟𝑥𝑦 = 1 when 𝑏 is positive, and
𝑟𝑥𝑦 = −1 when 𝑏 is negative.

9.5 CLUSTERING TECHNIQUES
9.5.1 Clusters- Input and Output of Cluster analysis
• An input to a cluster analysis can be described as an ordered pair (𝒟, 𝑠), or (𝒟, 𝑑), where 𝒟 is a
set of objects (or their descriptions) represented with sample points
𝒟 := {𝑥(𝑖) }𝑁
𝑖=1
and 𝑠 and 𝑑 are measures for similarity or dissimilarity among points, respectively.
• The output from the clustering system is a partition Λ = {𝐺1 , 𝐺2 , . . . , 𝐺𝐾 }
where 𝐺𝑘 , 𝑘 = 1, . . . , 𝐾 is a crisp subset of 𝒟 such that
𝐺1 ∪ 𝐺2 ∪ . . . ∪ 𝐺𝐾 = 𝒟, and 𝐺𝑖 ∩ 𝐺𝑗 = ∅, 𝑖 ̸= 𝑗.
The members 𝐺1 , 𝐺2 , . . . , 𝐺𝐾 of Λ are called clusters (Figure 9.2).
REMARK 9.1.
Sample points 𝑥(𝑖) can be structured objects as vectors, or un-structured as images.
• Every cluster may be described with some characteristics.

9.5. CLUSTERING TECHNIQUES 171
Components of a Cluster Analysis
Figure 9.2: Cluster analysis of points in a 2D space.

(a) Initial data, each dot is a sample points 𝑥(𝑖) .
(b) Three clusters of data. (c) Four clusters of data.
In discovery-based clustering, both the cluster and its descriptions or characterizations are
generated as a result of a clustering procedure.
• There is no clustering technique that is universally applicable in uncovering the variety of struc-
tures present in multidimensional data sets. But we can utilize three basic schemata for cluster
representation.

Knowledge Box 8 (Representation of clusters).
There are three ways for a formal description of discovered clusters:

1. Represent a cluster of points in an 𝑚-dimensional space (samples) by their centroid (mean) or
by a set of distant (border) points in a cluster.
We use this way when the clusters are compact or isotropic.
2. Represent a cluster graphically using nodes in a clustering tree.
3. Represent clusters by using logical expression on sample attributes.
NOTE:
• An object (or substance) is isotropic if its physical property which has the same value when
measured in different directions.
• 𝐾-means Clustering essentially illustrates the 1st Fig. 9.3(a) Centroid], and
• Hierarchical Clustering illustrates the 2nd scheme [Fig. 9.3(b) Clustering tree].
9.5.2 Briefed Background
Recall the initial motivation in Section 9.2.5 and Clustering basic steps

Different schemata for cluster representation

(Courtesy for Mehmed Kantardzic)
Figure 9.3: Representation of clusters:

(a) Centroid. (b) Clustering tree. (c) Logical expressions.
1. Take a sample, that is, a subset of workload components.
2. Select workload parameters.
3. Transform workload parameters (if necessary)
4. Remove outliers.
5. Scale all observations.

All done in Section 9.3
6. Select a distance measure.
7. Perform clustering?
8. Interpret results.
9. Change parameters, or number of clusters, repeat steps 3-7.
10. Select representative components from each cluster.
In Step 6, refering back to Section 9.3.4 we note that
• The Euclidean distance is the most commonly used distance metric.
• The weighted Euclidean is used if the parameters have not been scaled or if the parameters have
significantly different levels of importance.
• The Manhattan distance is useful in many analysis.

TAKE AWAY Facts

Clustering from Principles to Specific methods
Total Variance = Intra-group Variance + Inter-group Variance.
Clustering techniques (1): Principles
• Goal: Partition into groups so the members of a group are as similar as possible and different
groups are as dissimilar as possible.
• Statistically, the intragroup variance should be as small as possible, and inter-group variance
should be as large as possible.
Total Variance = Intra-group Variance + Inter-group Variance.
Clustering techniques (2): Specific methods
• Nonhierarchical techniques:
– Start with an arbitrary set of 𝑘 clusters,

– Move members until the intra-group variance is minimum.
• Hierarchical Techniques:
– Agglomerative: Start with n clusters and merge.

– Divisive: Start with one cluster and divide.
• Two popular techniques:
– Minimum spanning tree method (agglomerative).

– Centroid method (Divisive, like 𝐾-means Clustering).
9.5.3 Minimum spanning tree-clustering method
Minimum spanning tree-clustering method also is named Classification Tree.
1. Start with 𝑘 = 𝑛 clusters.
2. Find the centroid of the 𝑖th cluster, for 𝑖 = 1, 2, . . . , 𝑘.
3. Compute the inter-cluster distance matrix.
4. Merge the the nearest clusters.
5. Repeat steps 2 through 4 until all components are part of one cluster.
Note See practical example in Section ??.
Dendogram

• Dendogram = Spanning Tree.
• Purpose: Obtain clusters for any given maximum allowable intra-cluster distance.

9.5.4 Nearest centroid method (aka 𝐾-means Clustering)
Characteristics and Assumptions of 𝐾-means clustering
1. In 𝐾-means clustering, you attempt to separate the data 𝒟 into 𝐾 clusters, where the number 𝐾
is determined by you.
2. The data usually has to be in the form of numeric vectors, we denote input signal 𝑥(𝑖) as an 𝑚-
dimension vector
𝑥(𝑖) = [𝑥𝑖1 , 𝑥𝑖2 , . . . , 𝑥𝑖𝑗 , . . . , 𝑥𝑖𝑚 ]𝑇 ∈ 𝒟, here each feature 𝑥𝑖𝑗 ∈ R. (9.11)
3. The method of 𝐾-means clustering will work as long as you have a way of computing
i) the distance (metric) between pairs of data points, and
ii) the centroid or mean of a subset of data points in 𝒟.
Known 𝒟 := {𝑥(𝑖) }𝑁
𝑖=1 be a set of multidimensional observations that is to be partitioned into a
proposed set of 𝐾 clusters, where 𝐾 < 𝑁 , we rewrite points

′
𝑢 := 𝑥(𝑖) = 𝑢1 , 𝑢2 , · · · , 𝑢𝑚 and 𝑣 := 𝑥(𝑖 ) = 𝑣1 , 𝑣2 , · · · , 𝑣𝑚 in 𝒟.

𝑚
∑︁
The distance between 𝑢 and 𝑣 generally just is 𝑑(𝑢, 𝑣) = 𝑑(𝑢𝑗 , 𝑣𝑗 ) where 𝑑(𝑢𝑗 , 𝑣𝑗 ) is Eu-
𝑗=1
clidean, Manhattan or Hamming metric, see Section 9.3.4.
We need a measure of dissimilarity, centroids of clusters, and a cost function.
Computing the distance between pairs of data points
A measure of dissimilarity between every pair 𝑢, 𝑣 ∈ 𝒟 is the distance 𝑑(𝑢, 𝑣). The points 𝑢 and 𝑣
are close together if their gap 𝑑(𝑢, 𝑣) is small, and far away if 𝑑(𝑢, 𝑣) is large. When the measure
𝑑(𝑢, 𝑣) is small enough, both 𝑢 and 𝑣 are assigned to the same cluster; otherwise, they are assigned
to different clusters.
In 𝐾-means clustering, we might exploit the squared Euclidean norm (9.5)

𝑚
2
∑︁ (︀ )︀2
𝑑(𝑢, 𝑣) = | 𝑢 − 𝑣 | = 𝑢𝑖 − 𝑣𝑖
𝑖=1
is often used to define the measure of similarity between the observations.
Note that, the centroid c of 𝑛 points a(𝑖) ∈ R𝑚 satisfies

∑︁
(𝑖) 1 ∑︁ (𝑖)
(a − c) = 0 ⇔ c = a . (9.12)
𝑖
𝑛 𝑖
Finding Best Centers for the clusters
Suppose we have already determined the clustering or the partitioning into clusters 𝐺1 , 𝐺2 , . . . , 𝐺𝐾 .
What are the best centers for the clusters?
Lemma 9.7. Let 𝐺 = {a(𝑖) }𝑛𝑖=1 = {a(1) , a(2) , . . . , a(𝑛) } be a cluster (of points, 𝑛 ≤ 𝑁 ).
The sum of the squared distances of the a(𝑖) to any point 𝑥 equals the sum of the squared distances
to the centroid c of 𝐺 plus 𝑛 times the squared distance from 𝑥 to the centroid. That is,
∑︁ ∑︁
(𝑖) 2
|a −𝑥| = | a(𝑖) − c |2 + 𝑛 | c − 𝑥 |2
𝑖 𝑖
1 ∑︀ (𝑖)
where c = 𝑖 a is the centroid of the set of points [see from Eq. (9.12)].
𝑛
| a(𝑖) − c |2
∑︀
As a result, the centroid c minimizes the sum of squared distances since the first term, 𝑖
is a constant independent of 𝑥 and setting 𝑥 = c sets the second term, 𝑛 | c − 𝑥 |2 , to zero.

Definition 9.8. 𝐾-means Clustering:
Finding a partition of of 𝒟 into 𝐾 clusters,
Λ = {𝐺1 , 𝐺2 , . . . , 𝐺𝐾 }
with corresponding centroids (centers) c1 , . . . , c𝐾 , to minimize the sum of squares of distances

between data points and the centers c𝑗 of their clusters.
So we want to minimize a ‘cost function
𝐾
∑︁ ∑︁
Φ𝐾𝑚𝑒𝑎𝑛𝑠 (𝒟) = 𝑑(a(𝑖) , c𝑗 ), with 𝑑(a(𝑖) , c𝑗 ) = | a(𝑖) − c𝑗 |2 .
𝑗=1 a(𝑖) ∈𝐺𝑗
■ NOTATION 2.
From now on, for any cluster 𝐺𝑗 of the partition
Λ = {𝐺1 , 𝐺2 , . . . , 𝐺𝐾 }
we write its centroid by c𝑗 [the center of cluster 𝑗] or 𝜇

̂︀𝑗 , for 𝑗 = 1, 2, . . . , 𝐾. Here 𝜇
̂︀𝑗 more clearly
denotes the estimated mean vector associated with cluster 𝑗.
STEPS
1. Fix a natural 𝐾. Start with 𝑘 = 1.

2. Find the centroid and intra-cluster variance for 𝑖th cluster, for 𝑖 = 1, 2, . . . , 𝑘.
3. Find the cluster with the highest variance and arbitrarily divide it into two clusters.
– Find the two components that are farthest apart, assign other components according to their
distance from these points.
– Place all components below the centroid in one cluster and all components above this hyper
plane in the other.
4. Adjust the points in the two new clusters until the inter-cluster distance between the two clusters is
maximum.
5. Set 𝑘 = 𝑘 + 1. Repeat steps 2 through 4 until 𝑘 = 𝐾.
Clustering algorithms: What are they? How to build?
9.5.5 Cost function, Total clustering variance to Algorithm
Store index of clusters in Λ into the index set 𝒮 = {1, 2, . . . , 𝐾}.

Definition 9.9. A many-to-one map, called the encoder 𝐶, a kind of relationship from 𝒟 to 𝒮, is
defined as
𝑗 = 𝐶(𝑖), 𝑖 = 1, 2, . . . , 𝑁 (9.13)
assigning the 𝑖-th observation 𝑥(𝑖) to the 𝑗-th cluster 𝐺𝑗 according to a rule yet to be defined.
The following cost function (Hastie et al., 2001), for a given encoder 𝐶:
𝐾
∑︁ ∑︁
𝐽(𝐶) = | 𝑥(𝑖) −̂︀
𝜇𝑗 |2 (9.14)
𝑗=𝑖 𝐶(𝑖)=𝑗
is used to optimize the clustering process. The inner summation in this equation is
∑︁
̂︀𝑗2
𝜎 := | 𝑥(𝑖) −̂︀
𝜇𝑗 |2 (9.15)
𝐶(𝑖)=𝑗
as an estimate of the variance of the observations associated with cluster 𝑗.

∑︀𝐾
The cost 𝐽(𝐶) = 𝑗=𝑖 ̂︀𝑗2 is viewed as a measure of the total clustering variance resulting from
𝜎
the assignments of all the 𝑁 points to the 𝐾 clusters that are made by encoder 𝐶.
• AIM: For a prescribed 𝐾, the requirement is to find the encoder 𝐶(𝑖) = 𝑗 for which
the cost function 𝐽(𝐶) = 𝐾 ̂︀𝑗2 is minimized.

∑︀
𝑗=𝑖 𝜎

• IDEAS: The 𝐾-means starts with many different random choices for the means 𝜇
̂︀𝑗 for the proposed
3
size 𝐾, then choose the particular set for which 𝐽(𝐶) assumes the smallest value.
ELUCIDATION
Starting from some initial choice of the encoder 𝐶, the algorithm goes back and forth between
these two steps 2 and 3 until there is no further change in the cluster assignments. The 𝐾-means
algorithm mathematically proceeds in two steps:
• The algorithm essentially works by first guessing at 𝐾 “centers” of proposed clusters. In Equation
9.17 we view
𝐶(𝑖) = 𝑗 ⇐⇒ point a(𝑖) ∈ 𝐺𝑗 (cluster), and

∑︁
̂︀𝑗2 :=
𝜎 | 𝑥(𝑖) −̂︀
𝜇𝑗 |2 (9.16)
𝐶(𝑖)=𝑗
is the variance of all the observations (data points) in cluster 𝑗.
• Then next Equation (9.18) says each data point is assigned to the cluster it is closest to, creating
a grouping of the data, and then all centers are moved to
the mean position of their clusters.

This is repeated until an equilibrium is reached.
3
Neural networks and Learning machines - Chapter 5.5/ Simon Haykin

Algorithm 3 𝐾-means Clustering( data 𝒟 := {𝑥(𝑖)} )
Input data 𝒟, and 0 < 𝐾 ∈ N
Output An encoder 𝐶, that minimizes the cost 𝐽(𝐶).
𝜇𝑗 }𝐾
1. Initialize a set of cluster means 𝐶𝑀 := {̂︀ 𝐾
𝑗=𝑖 = {c𝑗 }𝑗=𝑖 .
2. For a given encoder 𝐶, the total cluster variance is minimized with respect to the assigned set of
cluster means 𝐶𝑀 , that is we minimize the score 𝑆 = 𝐽(𝐶), for that given 𝐶
𝐾
∑︁ ∑︁ 𝐾
∑︁
(𝑖) 2
min 𝑆 = min 𝐽(𝐶) = min | 𝑥 −̂︀
𝜇𝑗 | = min ̂︀𝑗2
𝜎 (9.17)
𝐶𝑀 𝐶𝑀 𝐾 {̂︀
𝜇𝑗 }𝑗=𝑖 𝐾 {̂︀
𝜇𝑗 }𝑗=𝑖
𝑗=𝑖 𝐶(𝑖)=𝑗 𝑗=𝑖
3. Optimize the encoder as follows:
𝐶(𝑖) = (= 𝑗0 ) = arg min | 𝑥(𝑖) −̂︀

𝜇𝑗 |2 (9.18)
1≤𝑗𝐾
Each data point 𝑖 is assigned to a cluster 𝑗0 . Go back to step 2. with the new encoder 𝐶.
Computation on R (Courtesy of Joe Suzuki [15]).

k.means=function(X,K, iteration=20)
n=nrow(X); p=ncol(X); center=array(dim=c(K,p));
y=sample(1:K, n, replace=TRUE);
scores=NULL
for(h in 1:iteration)
for(k in 1:K) if(sum(y[]==k)==0)center[k,]=Inf else
## sum(y[]==k) expresses the number of i s.t. y[i]=k
for(j in 1:p)center[k,j]= mean(X[y[]==k,j])
S.total=0 #
for(i in 1:n)
S.min=Inf;
for(k in 1:K)
S=sum((X[i,] - center[k,])^2);
if(S<S.min)S.min=S; y[i]=k
S.total=S.total+S.min #
scores=c(scores,S.total) #

return(list(clusters=y,scores=scores))
# K-means clustering executed with the function k.means to display

# clusters of p = 2-dimensional artificial data
p=2; n=1000; X=matrix(rnorm(p*n),nrow=n,ncol=p) ## Data Generation
y=k.means(X,5)$clusters ## Obtain the clusters of the samples
# The points in the same cluster share the same color
plot(-3:3, -3:3, xlab="First", ylab="Second", type="n")
points(X[,1],X[,2],col=y+1)
♦ REMARK 1.
• The score 𝑆 does not increase for each update of Steps 1 and 2 during executing 𝐾-means
clustering.
• Because the initial centers are randomly chosen, different calls to the function will not necessarily
lead to the same result. At the very least, we would expect
the labeling of clusters to be different between the various calls. Generally, square-error parti-
tional algorithms (with 𝐾-means Clustering as a specific case) attempt to obtain a partition that
minimizes the within-cluster scatter or maximizes the between-cluster scatter.

• The result of K-means clustering depends on the randomly selected initial clusters, which
means that even if K-means is applied, there is no guarantee that an optimum solution will be
4
obtained.
9.5.6 Computation with default commands of R
Let us see the algorithm in action, with the iris realistic dataset, and we remove the Species column
to get a numerical matrix to give to the new data. The Rfunction for 𝐾-means clustering, kmeans,
wants numerical data. We need to specify
𝐾, the number of centers in the parameters to kmeans(), and we choose three. We know that
there are three species, so this is a natural choice.
> library(datasets); library(mlbench);

data("iris"); names(iris) # 5 factors
iris |> head()
SL = iris$Sepal.Length; SW = iris$Sepal.Width;
PL = iris$Petal.Length; PW = iris$Petal.Width;
newiris = data.frame(SL,SW,PL,PW)
4
These methods are nonhierarchical because all resulting clusters are groups of samples at the same level of partition. To guarantee that
an optimum solution has been obtained, one has to examine all possible partitions of the 𝑁 samples with 𝑚 dimensions into 𝐾 clusters (for a
given 𝐾), but that retrieval process is not computationally feasible.

clusters <- newiris |> kmeans(centers = 3)

clusters$centers
SL SW PL PW
1 6.850000 3.073684 5.742105 2.071053
2 5.006000 3.428000 1.462000 0.246000
3 5.901613 2.748387 4.393548 1.433871
>library(ggplot2);
CCluster = clusters$cluster
head(CCluster); table(CCluster)
newiris |>
cbind(Cluster = CCluster) |>
ggplot() +
geom_bar(aes(x = SS, fill = as.factor(Cluster)),
position = "dodge") +
scale_fill_discrete("Cluster")

The function returns an object with information about the clustering. The two most interesting
pieces of information are the centers, the variable centers, and the cluster assignment, the vari-
able cluster.
• The variable centers: These are simply vectors of the same form as the input data points. They
are the center of mass for each of the three clusters we have computed.
• The variable cluster: The cluster assignment is simply an integer vector with a number for each
data point specifying which cluster that data point is assigned to.
There are 50 data points for each species so if the clustering perfectly matched the species we
should see 50 points for each cluster as well.
The clustering is not perfect, but we can try plotting the data and see how well the clustering
matches the species class.
We can first plot how many data points from each species are assigned to each cluster. We com-
bine the iris data set with the cluster association from clusters and then make a bar plot. The position
argument is ”dodge”, so the cluster assignments are plotted next to each other instead of stacked
on top of each other.
Now let us consider how the clustering does at predicting the species more formally. This returns

Figure 9.4: Cluster assignments for the three iris species
us to familiar territory: we can build a confusion matrix between species and clusters.
9.5.7 Analyzing with confusion matrix

> table(iris$Species, clusters$cluster)

1 2 3
setosa 0 50 0
versicolor 2 0 48
virginica 36 0 14
tbl <- table(iris$Species, clusters$cluster)

(counts <- apply(tbl, 1, which.max))
map <- rep(NA, each = 3); map[counts] <- names(counts)

table(iris$Species, map[clusters$cluster])
setosa versicolor virginica

setosa 50 0 0
versicolor 0 48 2
virginica 0 14 36
ELUCIDATION.

• One problem here is that the clustering doesn’t know about the species, so even if there were
a one-to-one correspondence between clusters and species, the confusion matrix would only be
diagonal if the clusters and species were in the same order.
• We can associate each species to the cluster most of its members are assigned to. This isn’t a
perfect solution- 2 species could be assigned to the same cluster this way, and we still would not
be able to construct a confusion matrix- but it will work for us in the case we consider here.
We can count how many observations from each cluster are seen in each species like the bottom
code paragraph.
• Since 𝐾 is a parameter that needs to be specified, how do you pick it? Here, we knew that there
were three species, so we picked 𝐾 = 3 as well.
• But when we do not know if there is any clustering in the data, to begin with, or if there is a lot,
how do we choose 𝐾? Unfortunately, there isn’t a general answer to this. There are several rules
of thumb, but no perfect solution you can always apply.

9.6 Principal Component Analysis (PCA): Principles
The basic principle of dimensionality reduction techniques (with PCA ...) is to transform the data into
a new space that summarize properties of the whole data set along
a reduced number of dimensions. These are then ideal candidates used to visualize the data along
these reduced number of informative dimensions.
Principal Component Analysis (PCA) is a technique that transforms the original 𝑛-dimensional
data into a new 𝑛-dimensional space.
• These new dimensions are linear combinations of the original data, i.e. they are composed of
proportions of the original variables.
• Along these new dimensions, called principal components, the data expresses most of its vari-
ability along the first PC, then second, . . .
5
• Principal components are orthogonal to each other, i.e. non-correlated.
5
PCA is probably the oldest and best known of the techniques of multivariate analysis. Being based on the covariance matrix of the variables,
it is a second-order method. In various fields, it is also known as the singular value decomposition (SVD), the Karhunen-Loève transform, the
Hotelling transform, and the empirical orthogonal function (EOF) method.

9.6. Principal Component Analysis (PCA): Principles 195
9.6.1 Principal Component Analysis- A Formalization
The central idea of principal component analysis is to reduce the dimensionality of a data set in
which there are a large number of interrelated variables, while retaining as much as possible of the
variation present in the data set.
This reduction is achieved by transforming to a new set of variables, the principal components,
which are uncorrelated, and which are ordered so that the first few retain most of the variation present
in all of the original variables.
Let us consider a sample x = (𝑥1 , 𝑥2 , . . . , 𝑥𝑝 )⊤ is a 𝑝 random variables vector, denote
𝑋 ∈ R𝑛×𝑝 is a (𝑛 × 𝑝) matrix consists of 𝑛 observations on 𝑝 variables in x, (𝑝 ≤ 𝑛).
PCA is the procedure used to obtain 𝑝 vectors,
• 𝑌1 ∈ R𝑛 with ‖ 𝑌1 ‖= 1 that maximizes ‖ 𝑋𝑌1 ‖,
• 𝑌2 ∈ R𝑛 with ‖ 𝑌2 ‖= 1 that is orthogonal to 𝑌1 and maximizes ‖ 𝑋𝑌2 ‖ · · · ,

..
.
• 𝑌𝑝 ∈ R𝑛 with ‖ 𝑌𝑝 ‖= 1 that is orthogonal to 𝑌1 , · · · , 𝑌𝑝−1 and maximizes ‖ 𝑋𝑌𝑝 ‖.
The purpose of PCA is to summarize the matrix 𝑋 as 𝑌1 , · · · , 𝑌𝑚 (1 ≤ 𝑚 ≤ 𝑝): the smaller the 𝑚
is, the more compressed the information in 𝑋. Note that there exists 𝜆1 , 𝜆2 . . . such that

𝑋 𝑇 𝑋𝑌1 = 𝜆1 𝑌1 , (9.19)
𝑋 𝑇 𝑋𝑌2 = 𝜆2 𝑌2 , . . . (9.20)
they are just the nonnegative eigenvalues and 𝑌𝑖 are the eigenvector of 𝑋 𝑇 𝑋 (a nonnegative definite
matrix.). Moreover, the 𝑌𝑖 are mutually orthogonal. Hence, we choose the 𝑚 principle components
𝑌1 , · · · , 𝑌𝑚 with the largest ordered eigenvalues 𝜆1 ≥ · · · ≥ 𝜆𝑚 ≥ 0.
9.6.2 PROCEDURE for Principal Component Analysis
• In essence, PCA seeks to reduce the dimension of the data by finding a few orthogonal linear
combinations (the PCs) of the original variables with the largest variance.
The first PC, 𝑌1 , is the linear combination with the largest variance. We have
𝑌1 = 𝛿1 ⊤ x, where the p-dimensional coefficient vector 𝛿1 = (𝛿1,1 , . . . , 𝛿1,𝑝 )⊤ solves
𝛿1 = arg max V(𝑌1 ). (9.21)

‖𝛿‖=1
• The 2nd PC is the linear combination with the 2nd largest variance and orthogonal to the 1st PC,
and so on. There are as many PCs as the number the original variables.

• For many datasets, the first several PCs explain most of the variance, so that the rest can be
6
disregarded with minimal loss of information.
STEP-by-STEP PROCEDURE for Principal Component Analysis
Data standardization
Assuming a standardized data with the empirical covariance matrix

1
Σ𝑝×𝑝 = 𝑋𝑋 ⊤ , (9.22)
𝑛
where 𝑋 is a (𝑛 × 𝑝) matrix consists of 𝑛 observations on 𝑝 variables in x, we can use the spectral
decomposition theorem to write Σ as
Σ = 𝑉 𝐿𝑉 ⊤ , (9.23)
where 𝐿 = diag(𝜆1 , 𝜆2 , . . . , 𝜆𝑝 ) is the diagonal matrix of the ordered eigenvalues

𝜆1 ≤ · · · ≤ 𝜆𝑝 , and 𝑉 is a 𝑝 × 𝑝 orthogonal matrix containing the eigenvectors.
Computation of the PCA
It can be shown that the PCs are given by the 𝑝 rows of the 𝑝 × 𝑛 matrix 𝑌 , where
𝑌 = 𝑉 ⊤𝑋 ⊤. (9.24)
6
Since the variance depends on the scale of the variables, it is customary to first standardize each variable to have mean zero and standard
deviation one. After the standardization, the original variables with possibly different unit of measurement are all in comparable units.

Performing PCA using (9.24) (i.e., by initially finding the eigenvalues of the sample covariance and
then finding the corresponding eigenvectors) is already simple and computationally fast.
Enhanced Computation- However, case of computation can be further enhanced by utilizing the
connection between PCA and the singular value decomposition (SVD) of the mean-centered data
matrix 𝑋 which takes the form:
𝑋 = 𝑈 𝑆𝑉 ⊤ , (9.25)
where
𝑈 ⊤ 𝑈 = ℐ𝑝 , 𝑉 𝑉 ⊤ = 𝑉 ⊤ 𝑉 = ℐ𝑝
and 𝑆 is diagonal with diagonal elements 𝑠1 , 𝑠2 , . . . , 𝑠𝑝 . Here, 𝑠1 ≥ 𝑠2 ≥ · · · ≥ 𝑠𝑝 are the non-

negative square-roots of the eigenvalues of 𝑋 ⊤ 𝑋 or 𝑋𝑋 ⊤ , the columns of 𝑈 are the 𝑝 orthogonal
eigenvectors of 𝑋𝑋 ⊤ and the rows of 𝑉 ⊤ are the orthogonal eigenvectors of 𝑋 ⊤ 𝑋.
It can be verified easily that matrix 𝑉 in equations (9.24) and (9.25) are the same, and the principal
component scores are given by 𝑈 𝑆.
Evaluation of the PCA- The weighting of the PCs tells us in which directions, expressed in original
coordinates, the best variance explanation is obtained. A measure of how well the first q PCs
explain variation is given by the relative proportion:
∑︀𝑘 ∑︀𝑘
𝑙
𝑗=1 𝑗 𝑗=1 𝑉 𝑎𝑟(𝑌𝑗 )
𝜓𝑘 = ∑︀𝑝 = ∑︀𝑝 . (9.26)
𝑗=1 𝑙𝑗 𝑗=1 𝑉 𝑎𝑟(𝑌𝑗 )

Using the R language function eigen, given the matrix 𝑋 ∈ R𝑛×𝑝 i as an input, we can construct
the function pca that outputs the vectors with the elements 𝜆1 ≥ · · · ≥ 𝜆𝑝 and the matrix with the
columns 𝑌1 , · · · , 𝑌𝑝 as
pca(X)$values and pca(X)$vectors, respectively.
Computation on R (Courtesy of Joe Suzuki [15]).
Even if we do not use the function below, the function prcomp is available in R.
pca=function(x)
n=nrow(x); p=ncol(x); center=array(dim=p)
for(j in 1:p)center[j]=mean(x[,j]); for(j in 1:p)x[,j]=x[,j]-center[j]
sigma = t(x)%*%x; lambda =eigen(sigma)$values
phi = eigen(sigma)$vectors
return(list(lambdas=lambda,vectors=phi,centers=center))

9.7 Combining 𝐾-means Clustering with PCA
KEY REQUEST and HINT with DATA Description
We use the iris dataset, and we remove the Species column to get a numerical matrix to give to the
new data.
The R function kmeans for 𝐾-means clustering needs numerical data. We need to specify 𝐾, the
number of centers in the parameters to kmeans(), and we choose three. We know that there are
three species, so this is a natural choice.
QUESTIONS
We study 𝐾-means Clustering and Analyzing with PCA (principal component analysis),
1. Explain all components of the output from the command
clusters <- newiris |> kmeans(centers = 3) in Rcode.
HINT: The function returns an object with information about the clustering. The two most inter-
esting pieces of information are the centers, the variable centers, and the cluster assignment,
the variable cluster.
2. What is confusion matrix? And why should we use it?

9.7. Combining 𝐾-means Clustering with PCA 201
3. Explain the output of table(iris$Species, map[clusters$cluster])
4. What is the functionality of the algorithm prcomp(), and of predict()?
[HINT: Ask Rmanual.]
5. Illustrate prcomp(data) while using PCA with iris data
in codes STEP 1: four features only then STEP 2: Now use the original iris data.■
Students should try using code in R STUDIO environment while answering questions and writing
the report. You might use the codes in Section of FITTING the Model with PCA and Predictive Model
to support for your answers of the above questions.
We are ready to fit our basic model, with various techniques.
FITTING the Model with PCA and Predictive Model
We aim to visually examine how the clustering result matches where the actual data points fall in,
using PCA (principal component analysis). We can do this by plotting the individual data points and
see how the classification and clustering looks.

We can map data points from the five features of data to the principal components using the
predict() function. This works both for the original data iris used to make the PCA and the centers
we get from the 𝐾-means clustering.
> library(datasets); library(mlbench);

data("iris"); names(iris) # 5 factors
iris |> head()
newiris = data.frame(SL,SW,PL,PW)
clusters <- newiris |> kmeans(centers = 3)
clusters$centers
>library(ggplot2);
CCluster = clusters$cluster
head(CCluster); table(CCluster)
newiris |>
cbind(Cluster = CCluster) |>
ggplot() +

geom_bar(aes(x = SS, fill = as.factor(Cluster)),

position = "dodge") +
scale_fill_discrete("Cluster")
## About Analyzing with confusion matrix

> table(iris$Species, clusters$cluster)
tbl <- table(iris$Species, clusters$cluster)
(counts <- apply(tbl, 1, which.max))
map <- rep(NA, each = 3)

map[counts] <- names(counts)
table(iris$Species, map[clusters$cluster])
## About Analyzing with PCA (principal component analysis)

> # PCA job, with iris_species? numeric version of SS = flower species
library(pcaPP); require(utils)
# STEP 1: four features only
pca <- iris_species |> prcomp()

SS = iris$Species; iris_species = as.numeric(SS);

newiris = data.frame(SL,SW,PL,PW); # not include the SS
pca <- prcomp(newiris); pca |> plot()

mapped_iris <- pca |> predict(newiris);
mapped_iris |> head(); length(mapped_iris)
# We can now plot the first two components against each other
install.packages("tibble"); library(tibble) ;
mapped_iris |>
as_tibble() |>
cbind(SS) |> ggplot() + geom_point(aes(x = PC1, y = PC2, colour= SS))
# STEP 2: Now use the original iris data

Iris_full = data.frame(SL,SW,PL,PW, iris_species);
fivepca <- prcomp(Iris_full) ; fivepca |> plot()
n_mapped_iris = fivepca |> predict(Iris_full)

n_clusters <- Iris_full |> kmeans(centers = 3)

n_mapped_centers = fivepca |> predict(n_clusters$centers)
n_mapped_centers
## The mapped-iris object returned from the predict() function is not a

## data frame but a matrix. That won’t work with ggplot(), so we
## transform it back into a data frame, do that with as-tibble.
n_mapped_iris |>
as_tibble() |> cbind(SS, Clusters= as.factor(n_clusters$cluster))
|> ggplot() +
geom_point(aes(x= PC1, y= PC2, colour= SS, shape= Clusters)) +
geom_point(aes(x = PC1, y = PC2), size = 5, shape = "X",
data = as_tibble(n_mapped_centers))

Figure 9.5: Cluster assignments for the three iris species
THE END of K-means ad PCA.

9.8. REVIEW QUESTIONS for CLUSTERING 207
9.8 REVIEW QUESTIONS for CLUSTERING
1. Why is the validation of a clustering process highly subjective?
2. What increases the complexity of clustering algorithms?
3. Given 5-dimensional numeric samples 𝐴 = (1, 0, 2, 5, 3), 𝐵 = (2, 1, 0, 3, −1), find
(a) The Euclidean distance between points. (b) The city block (Manhattan) distance.
(c*) The Minkowski distance for 𝑝 = 3.
(d) The cosine-correlation distance.
GUIDANCE for solving (The Minkowski distance).
Minkowski distance calculates the distance between two real-valued vectors
𝑢 = 𝑢1 , 𝑢2 , · · · , 𝑢𝑚 , and 𝑣 = 𝑣1 , 𝑣2 , · · · , 𝑣𝑚 . It is a generalization of the Euclidean and Manhattan

distance measures (see from Section 9.3.4) and adds a parameter, called the “order 𝑝”,
How to compute the Minkowski distance 𝑑𝑝 (𝑢, 𝑣) between vectors 𝑢, 𝑣 ∈ R𝑚 ? Here 𝑝 is the order
of the norm of the difference.
• The case where 𝑝 = 1 is equivalent to the Manhattan distance, hence

𝑚
∑︁
𝑑𝑝 (𝑢𝑗 , 𝑣𝑗 ) = |𝑢𝑗 − 𝑣𝑗 | =⇒ 𝑑1 (𝑢, 𝑣) = || 𝑢 − 𝑣 ||𝑝 = |𝑢𝑗 − 𝑣𝑗 |
𝑖=1

• The case where 𝑝 = 2 is equivalent to the Euclidean distance,

⎯
⎸ 𝑚 𝑚
2
⎸∑︁ (︀ )︀2 [︁ ∑︁ (︀ )︀2 ]︁1/2
𝑑(𝑢𝑗 , 𝑣𝑗 ) = (𝑢𝑗 − 𝑣𝑗 ) =⇒ 𝑑2 (𝑢, 𝑣) = ⎷ 𝑢𝑖 − 𝑣𝑖 = 𝑢𝑖 − 𝑣𝑖 .
𝑖=1 𝑖=1
• The general case 𝑝 ≥ 1, then

𝑚
[︁ ∑︁ ⃒ ⃒𝑝 ]︁1/𝑝
𝑑𝑝 (𝑢, 𝑣) = ⃒ 𝑢𝑖 − 𝑣𝑖 ⃒ . (9.27)
𝑖=1
4. (*) Given six-dimensional categorical samples
𝐶 = (𝐴, 𝐵, 𝐴, 𝐵, 𝐴, 𝐴) and 𝐷 = (𝐵, 𝐵, 𝐴, 𝐵, 𝐵, 𝐴), find:
(a) A simple matching coefficient (SMC) of the similarity between samples.
(b) Jaccard’s coefficient. (c) Rao’s coefficient.
5. Given the samples 𝑋1 = {1, 0}, 𝑋2 = {0, 1}, 𝑋3 = {2, 1}, and 𝑋4 = {3, 3}, suppose that the
samples are randomly clustered into two clusters 𝐺1 = {𝑋1, 𝑋3} and 𝐺2 = {𝑋2, 𝑋4}.
(a) Apply one iteration of the 𝐾-means (partitional-clustering) algorithm, and
find a new distribution of samples in clusters.

9.8. REVIEW QUESTIONS for CLUSTERING 209
(b) What are the new centroids? How can you prove that the new distribution of samples is better
than the initial one?
(c) What is the change in the total square error?
(d) Apply the second iteration of the 𝐾-means algorithm and discuss the changes in clusters.

9.9 MACHINE LEARNING in SPE With Mathematics
9.9.1 MACHINE LEARNING- A Primer
We extensively use random variables in both Stochastic Simulation and Machine Learning to approx-
imate the sampling distribution of data, and to propagate this to the sampling distribution of statistical
estimates and procedures.
Machine Learning is, however, more trendy nowadays in Applied Statistics and Engineering for
several reasons. All disciplines and sectors exploiting Machine Learning algorithms and methods
are data-centric fields, including
automated driving, computer vision, speech recognition,
document classification, language-processing tasks,
computational science, and decision support...
What do we mean by Learning? In real life, it means
a/ memorizing facts and data, understanding rules, principles and theory in general,
b/ trying to apply what we knew in the upcoming problematic situations, with similar data patterns,
to obtain good or optimal answers- solutions.

9.9. MACHINE LEARNING in SPE With Mathematics 211
Definition 9.10 (Statistical model). The simplest statistical model is linear model. A linear model
𝐻 = ℎ(𝑋) = 𝛼 + 𝛽 𝑋 + 𝜀...(*)
empirically expresses a possible linear relation of an (independent) predictor variable 𝑋 [or more]
and a (dependent) response variable 𝑌 .
• Here 𝜀 is the stochastic noise (usually assumed to be normal distributed with mean E[𝜀] = 0).
• 𝑋 is non-random, but both 𝑌 and 𝜀 are random variables.
• The associated linear regression of (*) is the response mean
ℎ = E[𝐻] = 𝛼 + 𝛽 𝑥. (9.28)
Most statistical models are examples of Machine Learning . Why? We can essentially learn about
the “causality” (cause-effect) between 𝑋 and 𝑌 via (9.28). Such regression is as much machine
learning as neural networks are.
Definition 9.11 (Artificial neural network). An artificial neural network is characterized as follows.
1. Source nodes supply input signals to the graph.

Nonlinear model of a neuron, labeled k.
Figure 9.6: Symbolic diagram of a neuron

2. Each neuron is represented by a single node called a computation node.
3. The communication links interconnecting the source and computation nodes of the graph carry no
weight; they merely provide directions of signal flow in the graph.
SUMMARY
(0) Learning Processes are achieved via at least by a statistical model or ANN.

(1) Knowledge: We will need the following formal concept.
Knowledge refers to stored information or models used by a per-

son or machine to interpret, predict, and appropriately respond to
the outside world. [chapter 3 of “Intelligence: the eye, the brain and
the computer”, Fischler and Firschein, 1987] ■
(2) Representing Knowledge: A representation of a problem S is a translation of S into a system

consisting of (i) a vocabulary that names things, relations ... in S,
(ii) facts and constraints about these things, and
(iii) operations that can be performed on them.
(3) Journey of Knowledge discovery: Both statistical model and Artificial neural network model
follow the key path of going from proper knowledge representation (hidden in dataset) to effectively
learning with that data.
From Knowledge Representation to Learning with Data
The primary characteristics of knowledge representation are twofold:
• (i) what information is actually made explicit; and

• (ii) how the information is physically encoded for subsequent use.
Item (ii) is built via training data.
■ NOTATION 3.
Write input signal 𝑥(𝑖) as an 𝑚-dimension vector
𝑥(𝑖) = [𝑥𝑖1 , 𝑥𝑖2 , . . . , 𝑥𝑖𝑗 , . . . , 𝑥𝑖𝑚 ]𝑇 , here each feature value𝑥𝑖𝑗 ∈ R, or 𝑥𝑖𝑗 ∈ 𝒮 (9.29)
meaning all of input entries are either real number, or taken value in a discrete set 𝒮, the superscript
𝑇 denotes matrix transposition.
We treat 𝑚 to be the number of features (or covariates statistically), then the vector 𝑥(𝑖) [referring
to features, attributes] defines a point in R𝑚 . R𝑚 denotes the 𝑚-dimensional Euclidean space in 𝑚
dimension. In general, however, 𝑥(𝑖) could be
either a complex structured object, such as an email message, a graph, a time series,
or an unstructured object, as a sentence, an image, a molecular shape, etc.
Definition 9.12. A set of training data (or sample) is a set of input-output pairs
𝒟 = {(𝑥(𝑖) , 𝑦 (𝑖) ) 𝑖 = 1, 2, 3, . . . , 𝑛},

each pair consists of an input signal (covariate) 𝑥(𝑖) and the corresponding desired response variable
𝑦 (𝑖) might get discrete value or continuous value.
♦ EXAMPLE 9.3 (Various types of input 𝑥(𝑖) ).
Training data in Machine Learning
Figure 9.7: Input (covariate) 𝑋 in a training data set could be a complex structured or
unstructured object (Courtesy of Phuc Son Nguyen)
■ CONCEPT.
A definition of learning [Mitchell (1997)] is a triple (𝐸, 𝑇, 𝑃 ):

“A computer program is said to learn from experience 𝐸 w. r. t. some class of tasks 𝑇 and per-
formance measure 𝑃 , if its performance at tasks or actions 𝑎 ∈ 𝑇 , as measured by 𝑃 , improves
with experience 𝐸”.
What is Machine Learning (Machine Learning )? A Prime
Machine Learning (Machine Learning ) is a kind of learning algorithm, it learns from data to up-
date our understanding of reality. Machine Learning so is the discipline of developing and applying
models and algorithms for learning from data.
Two basic requests are:
• To develop algorithms like that, you need a deep understanding of your problem.
• Gven a model (in mathematical formula pattern) 𝑦 = 𝑓 (𝑥), you cannot usually figure out what the
linear relationship is between 𝑦 and 𝑥 without looking at data.
Major branches include at least three ways of learning mechanisims.
A/ Supervised Learning covers classification, and regression.
In Definition 9.12, if response 𝑦 (𝑖) is discrete we have classification, if 𝑦 (𝑖) gets continuous value
we have regression problem.
B/ Unsupervised Learning includes dimensionality reduction, clustering, ...

𝐴 𝐵
• Briefly
A = Supervised Learning = {𝑐𝑙𝑎𝑠𝑠𝑖𝑓 𝑖𝑐𝑎𝑡𝑖𝑜𝑛, 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛, 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑡𝑟𝑒𝑒𝑠...}
B = Unsupervised Learning = {dimensionality reduction, clustering ... }
C = 𝑈 ∖ (𝐴 ∪ 𝐵).
C/ Semi-supervised Learning consists of other methods.
A/ Supervised Learning (or Predictive Learning)
The idea of supervised learning is that the learning system is
given inputs and told which specific outputs should be associated with them.
Mathematically, supervised learning is model creation where the model describes a relationship
between a set of selected variables (attributes or features) X
and a predefined variable called the target variable 𝑌 .

The model estimates the value of the target variable 𝑌 as a function ℎ (possibly a probabilistic
function) of the features.
The goal of this learning is to learn a mapping or model from inputs 𝑥 to outputs 𝑦, given a
labeled set of input-output pairs
𝒟𝑛 = {(𝑥(𝑖) , 𝑦 (𝑖) ) : 𝑖 = 1...𝑛},
• where each 𝑥(𝑖) symbolically represents an object to be classified, most typically an 𝑚-dimensional
vector of real and/or discrete values as given in Equation 9.29;
• the 𝑦 (𝑖) values are sometimes called target values.
We divide up supervised learning based on whether the outputs 𝑦 are drawn from
a small finite set as N (to get branch 1 named classification- CASE A1)
or a large finite or continuous set as R (branch 2 named regression- CASE A2).
Technically, the form of the output 𝑦 can in principle be anything, but mostly assumed in two cases.
CASE A1: response 𝑦 (𝑖) = 𝑦𝑖 is a categorical variable (with values as male or female) from some
finite set,
𝑦𝑖 ∈ {1, 2, . . . , 𝐶}.

Here the value 𝑦𝑖 is categorical, we call A1/ classification or pattern recognition.

The goal in a classification is ultimately, to predict the value of 𝑦𝑛+1 given a new input value 𝑥(𝑛+1) .
7
Convention: Many textbooks use 𝑥𝑖 instead of 𝑥(𝑖) . We find that notation somewhat difficult to
manage when 𝑥(𝑖) is itself a vector and we need to talk about its elements.
CASE A2: 𝑦 (𝑖) is an ordinal variable, with real-valued output, (as income level).
When 𝑦 (𝑖) is single real-valued, that 𝑦 (𝑖) = 𝑦𝑖 ∈ R (single response),
or also 𝑦 (𝑖) ∈ R𝑘 , 𝑘 ≥ 1 (multiple responses), we call A2/ regression problem [being studied in
Section 10.3 of Chapter 10].
Knowledge Box 9 (Supervised learning: 𝑘 = 1 ).
A mathematical elucidation when 𝑘 = 1 - response 𝑌 has single values.
Practically, we would find a real-valued function ℎ(.) that transforms 𝑥 to theoretic value ℎ = ℎ(𝑥) ∈
R. There obviously might be some noise in our observation, so ℎ doesn’t map perfectly from 𝑥 to
really observed output 𝑦, and ℎ − 𝑦 ̸= 0 or ℎ = 𝑦 + 𝑒𝑟𝑟𝑜𝑟 in general.
7
Classification problems are a kind of supervised learning, because the desired output (or class) 𝑦𝑖 is specified for each of the training
examples 𝑥(𝑖) .

Therefore, when our data is a sample
𝑥(1) , 𝑥(2) , . . . , 𝑥(𝑛) where each feature vector 𝑥(𝑖) ∈ R𝑚 , (9.30)
the target vector ℎ = (ℎ1 , ℎ2 , · · · , ℎ𝑛 ) ∈ R𝑛 , with target values ℎ𝑖 = ℎ(𝑥(𝑖) ) given as
ℎ𝑖 = 𝑦𝑖 + 𝜀𝑖 ⇐⇒ 𝜀𝑖 = ℎ(𝑥(𝑖) ) − 𝑦𝑖 . (9.31)
Here 𝜀𝑖 is the error in the real observation 𝑦𝑖 , so we put the error vector 𝜀 = (𝜀1 , . . . , 𝜀𝑛 ). Both ℎ(𝑥(𝑖) )
and 𝑦𝑖 are real numbers, put vector
ℎ(𝜃) := ℎ = (ℎ1 , ℎ2 , · · · , ℎ𝑛 ) ∈ R𝑛
and also the error 𝜀 = ℎ − 𝑦 ∈ R𝑛 ■.
♦ EXAMPLE 9.4 (Optimal estimator or Best fit the data in Euclidean space).
In R𝑚 with square-norm || 𝑥 ||2 , the estimated value of 𝜃 minimizing ‖ ℎ(𝜃) − 𝑦 ‖ (called best fit
the data) is symbolically denoted by
𝜃̂︀ = arg min | ℎ(𝜃) − 𝑦 |2 = arg min | 𝜀 |2 , where 𝜀 = (𝜀1 , . . . , 𝜀𝑛 ). (9.32)
If we use a linear regression of the form (9.28) ℎ = E[𝐻] = 𝛼 + 𝛽 𝑥 = 𝜃0 + 𝜃1 𝑥, with 𝑛 is the

sample size, then the optimal estimator

𝑛
∑︁
𝜃̂︀ = arg𝜃0 ,𝜃1 min (𝜃0 + 𝜃1 𝑥 − 𝑦𝑖 )2 . ■
𝑖=1
Q: What if we do not use linear regression, and the norm is not square-norm?
B/ Unsupervised Learning, or Knowledge Discovery
Unsupervised learning doesn’t involve learning a function from inputs to outputs based on a set of
input-output pairs. Instead, one is given a data set and generally expected to find some patterns
or structure inherent in it.
Unsupervised learning vs Supervised learning- Key diffrences
For supervised learning, we have one or more targets 𝑌 we want to predict using a set of explana-
8
tory variables 𝑋𝑗 , using a data 𝒟𝑛 = {(𝑥(𝑖) , 𝑦 (𝑖) ) 𝑖 = 1, 2, 3, . . . , 𝑛}.
For Unsupervised learning, we are only given a collection of data points 𝒟 = {𝑥(𝑖) }𝑛𝑖=1 , you now
want to discover what generic patterns there are in the data. This is sometimes called knowledge
discovery.
8
Unsupervised learning is arguably more typical of human and animal learning. It is also more widely applicable than supervised learning,
since it does not require a human expert to manually label the data as above.

𝐴 𝐵
B = Unsupervised Learning = {Density estimation, dimensionality reduction, clustering

... }
• Unsupervised learning include few topics:
B0/ Density estimation: we predict conditional probabilities, [such as probability of machine fail-
ure]. Given sample 𝑥 = 𝑥(1) , 𝑥(2) , . . . , 𝑥(𝑛) drawn IID from some distribution 𝑓𝑋 (𝑥) = Prob(𝑥),
the goal is to predict the probability 𝑓𝑋 (𝑥(𝑛+1) ) of an element 𝑥(𝑛+1) drawn from the same distri-
bution.
Technically, view input 𝑥(𝑖) ∈ 𝒜, a finite set, in a binary case we assume output 𝑦𝑖 ∈ {−1, 1},
we will find a probability density 𝑓 = 𝑓𝑋 : 𝒜 −→ [0, 1] such that
the value 𝑓𝑋 (𝑥(𝑛+1) ) as close to the mass P[𝑌 = 1|𝑋 = 𝑥(𝑛+1) ] as possible.
B1/ Dimensionality reduction is a method used when you have high- dimensional data and want
to map it down into fewer dimensions. The purpose here is usually to visualize data to try and
spot patterns from plots.

B2/ Clustering methods seek to find similarities between data points and group data according
to these similarities.
Clustering means grouping data into clusters that “belong” together - objects within a cluster
are more similar to each other than to those in other clusters.
Technically, view input 𝑥(𝑖) ∈ 𝒜, a finite set of 𝑛 points, we assume output 𝑦𝑖 ∈ {1, 2, . . . , 𝐶} =
𝒮, a set of 𝐶 clusters, we will fine a function 𝑔 : 𝒜 −→ 𝒮 such that the likelihood value
P[𝑔(𝑥(𝑛+1) ) = 𝑘 ∈ 𝒮] as high as possible.
Examples: clustering consumers for market research,

clustering genes into families, or image segmentation (medical imaging).
B3/ Association Rules, search for patterns in your data by picking out subsets of the data, 𝑋 and
𝑌 , based on predicates on the input variables and evaluate a logical implication form 𝑋 =⇒ 𝑌,
called a rule.
The algorithm evaluates all rules to figure out how good each rule is.
SUMMARY
[Learning Processes (LP) and The Essence of Unsupervised Learning]

LP needs proper knowledge representation, which has two aspects:
what information to describe, and how to describe (physically encode) it?

• LP requires using Statistical model and Artificial neural network.
• Specifically, two branches of Machine Learning are
Supervised Learning (Predictive Learning), and Unsupervised Learning.
• Supervised Learning essentially includes classification problem [if response variable are dis-
crete],
and regression problem [if response variable are continuous, with real values].
• Un-supervised Learning essentially covers Dimensionality reduction, Clustering, Association

rules...
The Essence of Unsupervised Learning

1. Unsupervised Learning is concerned with discovering patterns in data when you don’t neces-
sarily know what kind of questions you are interested in learning. More precisely, we still want
to know how they are related in certain way, but the desired model not need to be a causal form
as the linear model in Equation (9.28).
2. This is a much less well-defined problem (compared with Supervised Learning), since we are

not told what kinds of patterns to look for, and there is no obvious error metric to use (unlike
supervised learning, where we can compare our prediction of 𝑦 for a given 𝑥 to the observed
value). ■
9.9.2 Fundamental of Supervised Learning (SL)
Supervised learning is used when you have variables you want to predict using other variables.
We often formalize the problem using function approximation.
We assume ℎ(𝑥) for some unknown function ℎ, and the goal of learning is to estimate the function
ℎ given a labeled training set {(𝑥𝑖 , 𝑦𝑖 )}, and then to make predictions using 𝑦^ = ̂︀
ℎ(𝑥).
Setting of supervised learning
The goal of Supervised learning is to learn a mapping from inputs 𝑋 to outputs 𝑌 , given a labeled
data set 𝒟𝑛 = {(𝑥(𝑖) , 𝑦𝑖 )}𝑛𝑖=1 . Here 𝒟𝑛 is called the training set, with pairs (𝑥(𝑖) , 𝑦𝑖 ) is named
“examples,” “instances with labels,” “observations”; and 𝑛 is the number of training cases.
In the simplest setting, each training input 𝑥𝑖 ∈ R𝑚 is a 𝑚-dimensional vector of reals,
representing, say, the height and weight of a person (𝑚 = 2).
These are known as features, attributes or covariates.

Learning process is an optimization problem- WHY?
• Most machine learning works as an algorithm, defines a class ℋ of parameterized functions

ℎ(𝑥; 𝜃), each mapping input 𝑥 ↦−→ ℎ(𝑥; 𝜃) = ℎ(𝜃), the value we get for the output depends on
the parameters 𝜃 of the function.
• The learning process consists of choosing parameters 𝜃 such that we minimize the errors,
that is, such that ℎ(𝑥(𝑖) ; 𝜃) = ℎ𝑖 (𝜃) = ℎ𝑖 is as close to 𝑦𝑖 as we can get.
We’ll minimize the distance ‖ ℎ(𝜃) − 𝑦 ‖

from vector ℎ(𝜃) = (ℎ1 , ℎ2 , · · · , ℎ𝑛 ) to vector 𝑦 = (𝑦1 , 𝑦2 , · · · , 𝑦𝑛 ), for some distance ‖ · ‖. [See
MATHEMATICAL TREATMENT 9.9.2 for general distance in finite-dimension spaces.]
Such a function ℎ(𝑥; 𝜃) is said to be a classifier or a learning model.
• Popular distances in Machine Learning are the following twos.

* The mean deviation
𝑛
1 ∑︁
𝜀𝑛 (ℎ) = 𝛿(ℎ(𝑥(𝑖) ) − 𝑦𝑖 ) (9.33)
𝑛 𝑖=1
where the Kronecker 𝛿(ℎ(𝑥(𝑖) ) − 𝑦𝑖 ) = 1 if ℎ(𝑥(𝑖) ) ̸= 𝑦𝑖 and 0 otherwise.

* The Eulidean deviation

𝑛
∑︁ )︀2
ℎ(𝑥(𝑖) ) − 𝑦𝑖 .
(︀
𝜀𝑛 (ℎ) = (9.34)
𝑖=1
Learning algorithm
Known ℋ be a set of all classifiers/learning models ℎ = ℎ(𝑥; 𝜃), a learning algorithm is a proce-
dure 𝑃 that takes a data set 𝒟𝑛 as input and returns an element ℎ ∈ ℋ, it looks like 𝒟𝑛 −→
𝑃 (𝒟𝑛 , ℋ) −→ ℎ. We next discuss the followings.
1. Generic Classification, with symbolic formulation.
[ Then in Chapter ?? we continue with specific Classification methods: Decision Trees and Ran-
dom Forests, with R illustrated example.]
2. Generic Regression, with symbolic formulation.
A Step-by-step Procedure for learning with both regression and classification

Four steps Procedure

1. Know the data 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )}𝑛𝑖=1 with meanings of 𝑋, 𝑌
2. Describe clearly your aim (reseach questions)
3. Formulate your models with associated commands to answer questions in Step 2
4. Summarize your finding.
Generic Classification
In this type of task, the computer program is asked to specify which of 𝐶 categories some input be-
longs to. The most used classification is Binary Classification when 𝐶 = 2, and Ternary Classification
when 𝐶 = 3. To solve this task, the learning algorithm is usually asked to produce a function
ℎ : 𝒳 ⊂ R𝑚 −→ 𝒮 = {1, . . . , 𝐶}, 𝐶 ≥ 2.
• Here 𝒳 = {𝑥(𝑖) } and 𝑦 = ℎ(𝑥) says that the model ℎ assigns an input described by vector 𝑥 to a
category identified by numeric code 𝑦.
• When 𝐶 = 2, we might rewrite 𝒮 = {−1, +1}, or also 𝒮 = {0, +1}. The map ℎ is named a binary
classifier.
• This definition cover a specicial variant of the classification task,

for example, where ℎ outputs a probability distribution over classes.
NOTE: A popular a classification task is object recognition or pattern recognition, where the input
is an image (usually described as a set of pixel brightness values), and the output is a numeric code
identifying the object in the image.
♦ EXAMPLE 9.5 (Classifying flowers, 𝐶 = 3).
Three types of iris flowers: setosa, versicolor and virginica.
Source: http://www.statlab.uni-heidelberg.de/data/iris/ .
Courtesy Dennis Kramb and SIGNA
Motivation- Data: The learning goal is to distinguish 3 different kinds of iris flower.

Approach- Methods: Rather than working directly with images, a botanist has already extracted
4 useful features or characteristics: sepal length, sepal width, and petal length and width. Such
feature extraction is an important, but difficult task.
Output- Analysis- Conclusion: It is always a good idea to perform exploratory data analysis, such
as plotting the data, before applying a machine learning method.
MATHEMATICAL TREATMENT (Inner product to normed spaces).
Inner product on the space R𝑚 is denoted by < 𝑥, 𝑦 > ∈ R, where 𝑥, 𝑦 ∈ R𝑚 .

An inner product on the complex field C in general is denoted by < 𝑥, 𝑦 > ∈ C, where 𝑥, 𝑦 ∈ C.
Normed space - Let’s explicitly define the normed space 𝑆 := R𝑚 with the usual inner product.
The space R𝑚 shortly is equipped with the inner product
∑︁
< 𝑥, 𝑦 >= 𝑥𝑖 𝑦𝑖 ,
𝑖
then the square norm or Euclid distance is determined by its properties, assuming 𝑥, 𝑦, 𝑧 ∈ 𝑆, as
follows.

(i) < 𝑥 + 𝑦, 𝑧 >=< 𝑥, 𝑧 > + < 𝑦, 𝑧 >
(ii) < 𝛼 𝑥, 𝑦 >= 𝛼 < 𝑥, 𝑦 >, where 𝛼 ∈ C
(iii) < 𝑥, 𝑥 >= ||𝑥||2 ≥ 0; and < 𝑥, 𝑥 >= 0 if and only 𝑥 = 0.
* In the space R𝑚 , at property (iii), the notation || · || is called the norm/distance. The norm satisfies
the triangle inequality || 𝑥 + 𝑦 || ≤‖ 𝑥 ‖ +|| 𝑦 || (9.35)
and the Cauchy-Schwarz inequality | < 𝑥, 𝑦 > |2 ≤ || 𝑥 ||2 || 𝑦 ||2 .
Generic Regression
In this type of task, the computer program is asked to predict a numerical value given some input.
To solve this task, the learning algorithm is asked to output a function
𝑔 : R𝑚 −→ R.
This type of task is similar to classification, except that the format of output is different.

Statistically, we obtain input pairs (𝑥(𝑖) , 𝑦𝑖 ) ∈ 𝒟𝑛 , a finite set of 𝑛 points, assuming output 𝑦𝑖 ∈ R,
the set of reals. We will fine a function 𝑔 : 𝒜 = {𝑥(𝑖) } −→ R such that the error value 𝑖 ‖
∑︀
𝑔(𝑥(𝑖) ) − 𝑦𝑖 ‖ is min as possible.
Linear Statistical Model and Linear Regression for 𝑚 ≥ 1
The goal of regression is to derermine a cause-effect relationship from inputs 𝑥 ∈ R𝑚 , to numerical

output/response 𝑦 ∈ R, a real value, statisticaly written as
𝑌 = 𝑔(𝑥) + 𝜀, 𝑥 consists of 𝑚 attributes, (𝑚 ≥ 1).
Here with some input variables 𝑥 you want a good approximation function 𝑔(.) that predicts output
(or response) variables 𝑌 , as close to observed response 𝑦𝑖 as possible. The simplest model 𝑔() is
linear in one single input variable 𝑥 (𝑚 = 1 predictor), defined as
𝑌 = 𝑔(𝑥) + 𝜀 = 𝜃0 + 𝜃1 𝑥 + 𝜀, with 𝜃0 , 𝜃1 ∈ R.
Then we take the simple linear regression as
𝑦 = E[𝑌 ] = 𝑔(𝑥; 𝜃) = 𝜃0 + 𝜃1 𝑥, with 𝜃0 , 𝜃1 ∈ R. (9.36)
This Linear Regression model predicts the mean of output (or response) variables 𝑦 from input 𝑥.

♦ EXAMPLE 9.6. (How does car’s speed impact on breaking distance?)
We observed a sample data with speed - the 𝑥 value and breaking distance- the 𝑦 value and want
to fit this data to see
if car’s speed impact on breaking distance by a linear regression model?
Computation on Rfor linear-model building:
On Rwe need the function lm(), and few libraries, particularly the ggplot2.
> library(astsa); library(mistat); library(ggplot2)

> library(datasets); library(help = "datasets")
> data(cars); head(cars); names(cars)
> plot(dist~ speed, data= cars, main= "data(cars) & smoothing splines")
> cars |> ggplot(aes(x = speed, y = dist)) +
geom_point() + geom_smooth(formula = y ~ x, method = "lm")
We see that there is a very clear linear relationship between speed and distance.

125
100
75
dist
50
25
5 10 15 20 25
speed
Figure 9.8: Linear model of car’s speed impact on breaking distance
We have looked at classical statistical regression (linear models) and classification, but there are
many other machine learning algorithms for both, they are available as Rpackages.

FURTHER READING
CHAPTER ?? is a good development for SPE when systems are Non-Markovian.
See also Performance Analytics III about Hierarchical Clustering in Section ?? of CHAPTER ??.
Self-test on Performance Analytics
PROBLEM
PROBLEM 9.1. (Mixed-open queue)
• Consider a system of two servers where customers from outside the system arrive at server 1 at a
Poisson rate 8 and at server 2 at a Poisson rate 12.
• The service rates of server 1 and server 2 are respectively 16 and 24.
• A customer upon completion of service at server 1 is equally likely to go to server 2; or to leave

the system (i.e., 𝑃1,1 = 0, 𝑃1,2 = 2/3);

whereas a departure from server 2 will go 25 percent of the time to server 1, and will depart the
system otherwise (i.e., 𝑃2,1 = 1/3, 𝑃2,2 = 0).
1. Determine the limiting probabilities 𝑝𝑗
2. Find the 𝑝(𝑛1 , 𝑛2 )
3. Compute the average number of customers in the system 𝐿, and the mean E[𝑅].
——————————
PROBLEM 9.2. (Sequential queues)
A SCENARIO IN INDUSTRY: Computers produced in a IBM factory move on an assembly line of

two sequential queueing systems I and II, making one-way flow from I to II. System I puts compo-
nents of a PC (personal computer) together and system II packages PCs in boxes, see the visualized
assembly line in Figure 9.9.
The whole system Q = (I, II) is modeled by vector (𝑛, 𝑚),
meaning that there is 𝑋1 = 𝑛 PCs in system I and 𝑋2 = 𝑚 PCs in system II.
DATA - Assume that PCs enter Q via system I with arrival rate 𝜆 = 20 PCs/ hour,

Arrivals Server 1 Server 2
Leaving from
the system
when
services all
Sequential queues with one-way flow done
(e.g. appear in industrial sector)
Figure 9.9: Q with one-way sequential queues in IBM’s PC industry

the service rate at server S1 of system I is 𝜇1 = 40 PCs per hour, and
server S2 of system II can package products with rate 𝜇2 = 60 PCs per hour.
QUESTIONS
a) Calculate the traffic intensity 𝜌1 at server S1, 𝜌2 at server S2.
b) Find the average number of PCs in the whole system Q = (I, II).
c) What is the meaning of 𝑝3,2 ? Compute it.

d) Compute the average time that a computer (upon production) stays in Q.
——————————
REMINDER:
• The Jackson networks typically represent for open queueing networks with single-class of jobs.
The network is analyzed by considering each of the individual stations one by one.
• Therefore, the main technique would be to decompose the queueing network into individual queues
or stations and develop characteristics of arrival processes for each individual station. The basic
assumptions of these networks are still Poisson arrivals and exponential service times.
PROBLEM 9.3. (Jackson network)
A Jackson network is a queuing network with a special structure that consist of 𝑘 ≥ 2 connected
service stations (called nodes), so having 𝑘 queues, but the stations operate independently. Briefly,
the node 𝑖 (ie. 𝑖th service station) of the network can be considered as an independent M/M/mi
system with 𝑚𝑖 servers, arrival rate 𝛾𝑖 , and service rate 𝜇𝑖 .
By Jackson Theorem, the utilization ratio 𝜌𝑖 at this node 𝑖 is

𝜆𝑖
𝜌𝑖 = , 𝑖 = 1, 2, . . . , 𝑘. (9.37)
𝑚𝑖 𝜇𝑖
Arrivals to
station 1 Arrivals to
station 2
Station 1 Arrivals to
Station 2
(has 1
(has 2 servers)
station 3
server)
Station 3
Leaving from (has 1 server)
the system
when services
all done
Jackson network of 3 service stations
Figure 9.10: Jackson network has 3 sub-systems (𝑘 = 3).
A SCENARIO IN SERVICE:
Let us now consider a Jackson network of 𝑘 = 3 service stations in an airport where customers
can arrive in any node of the network. Assume that:
• The number of servers at node 1, 2, 3 respectively are (𝑚1 , 𝑚2 , 𝑚3 ) = (3, 2, 1).
• Arrivals at node 𝑖 (for 𝑖 = 1, 2, 3) follow Poisson processes with respectively associated arrival rates

𝛾1 = 4, 𝛾2 = 2, and 𝛾3 = 1.
• The service rates at those nodes are (𝜇1 , 𝜇2 , 𝜇3 ) = (3, 6, 12) respectively.
QUESTIONS
a) What is the total number of servers in the whole Jackson network?
b) Write down the switching probability matrix P = [𝑝𝑖,𝑗 ], where the probabilities are
𝑝1,2 = 0.55, 𝑝1,3 = 0.3; 𝑝2,1 = 0.3, 𝑝2,3 = 0.6; 𝑝3,1 = 0.05, 𝑝3,2 = 0.45,
and all diagonal entries are 0.
c) The vector 𝜆 of total input rates into nodes is calculated by the matrix equation
𝜆 = 𝛾[I −P]−1 ,
where 𝛾 = [4, 2, 1]; and we already obtained 𝜆 = [7.636, 10.522, 9.604].
Find the traffic intensities (utilization ratios) 𝜌𝑖 at nodes 𝑖, for 𝑖 = 1, 2, 3.
d) Compute the mean number of customers 𝐿𝑖 at nodes 𝑖, 𝑖 = 1, 2, 3.
Find 𝐿, the expected number of customers in the whole network. ■

Chapter 10
Statistically Designed Experiments

CHAPTER 10. STATISTICALLY DESIGNED EXPERIMENTS
Introduction
The staring point of this chapter is using Experimental Design terminology and philosophy, in which
the input parameters and structural assumptions composing a model are called factors, and output
performance measures are called responses.
The decision as to which parameters and structural assumptions are considered fixed aspects of
a model and which are experimental factors depends on the goals of the study rather than on the
inherent form of the model.
This chapter’s blueprint is
1. Studying Performance Metrics via Factorial Designs, see Section 10.1
2. Factorial Designs in Section 10.2
3. Multiple Linear Regression (MLR)
4. Product Quality and System Performance by Design in Section 10.4
5. The full 2𝑚 factorial design in 𝑚 binary factors for SPE in Section 10.5
6. Regression with Experimental error analysis in Section 10.6
7. The ternary design 3𝑘 in Section 10.8

243
8. COMPLEMENT 9: Ternary Factorial Design
Important steps of a SPE process
Recall ten key steps in a System Performance Evaluation (SPE) project introduced in Section ??
in which steps 3, 4, 5, then steps 8 and 9 are viewed as most useful in getting best response or
optimum system, given a bunch of factors.
1. State goals and Define the System
2. List Services and Outcomes
3. Select Metrics (measure used to quantify similarity or dis-similarity among objects)
4. List Parameters
5. List Factors to Study
6. Select Evaluation Technique
7. Select Workload
8. Design Of Experiments (DOE)
9. Analyze and Interpret Data Output

10. Present Results
Key three steps
The three steps 3, 4 and 5 are actually meaningful for utilizing Factorial Designs in SPE, and fully
explored from Section 10.2.1 from the data-centric view. Step 8 will be discussed in Section 10.4 with
terminologies defined prior in Section 10.2.2, and we perform in Section Section 10.6 the Analyze
and Interpret Data Output via the experimental-error analysis from regression.
10.1
Performance Metrics to Factorial Designs
We firstly look at Step 3,4 and 5 of the whole SPEprocess. Step 9 is discussed in Section 10.6. The
more complex stochastic analysis of output from both looking at structural components of a system
and running its simulation models is postponed until Chapter 8 and ??.
Step 3- Selecting metrics
We would use a systematic way to select right metrics, called path to metrics.

10.1. Performance Metrics to Factorial Designs 245
The Path to determine correct metrics

List system’s services −→ list outcomes per service
−→ determine metrics per outcomes.
Example: Metrics of processor={speed (time taken to execute) of various instructions...}
♣ OBSERVATION.
(1) We so knew that Step 2 lists Services and Outcomes upon basic awareness
• a/ that a system always provides a set of services, and each service has its own outcomes.
1
• b/ that all possible outcomes of a service moreover must be listed.
(2) Selecting right metrics is extremely important in SPE, for instance the next diagram shows the
path from System to various metrics like time, rate, probability ....
Diagram 10.1 shows the following facts and connections:
• Done correctly: Time-rate-resource ≡ responsiveness- productivity- utilization
• Done incorrectly: rate, probability of errors

1
This is because services and theirs possible outcomes are useful for selecting right metrics and workloads.

3 outcome categories
Figure 10.1: Typical causality diagram shows cause-effect relationship

between key events up to uncertainty
• Cannot do: time to failure and duration.
Knowledge Box 10 (Summarized points of Step 2 and 3).
Helpful to determine what data needs to be collected before or during the analysis.
1. Via the above diagram the causality from Step 2 to 3 can be used for various systems:
• (Computer network) Responsiveness ∋ response time

• (Operating system) Productivity ∋ throughput

• (Queuing system) Highest utilization ≡ bottleneck
• (Computer network) Timeout rate
Correct service Incorrect service Not service

(related to) (related to) (related to)
Response time Rate Inter-event time
Throughput Error probability Event’s execcution time
Resource (Utilization) Reliability Availability
(Inter-error time)
2. On Aspects of a metric
• Mean and Variability: both need to be considered. If assume (symbolize) a metric by a random
variable 𝑋, then its mean and variability are denoted respectively by E[𝑋] and V[𝑋].
• Global and individual
– Resource utilization, reliability, availability: global metrics.
– Response time, throughput: individual and global metrics.
– Only using system (global) or individual thoughtput ⇒ unfair situations.

Step 4- Listing Parameters
We should list all possible parameters that affect performance. Few parameter types are
• System parameter: not vary much among various system installations.
– Speed of local and remote CPUs

– Network speed
– Operating systems involved in the system
• Workload parameter: vary from one installation to the next.
– Time between successive ping send

– Number of bytes in a ping packet
– Time-to-live (ttl) for a packet, and others.
♦ EXAMPLE 10.1 (M/M/1 system, see more info in Section ??).
The simplest queue is referred to as the M/M/l system. It is a queue where
i) the arrival process is Poisson [the inter-arrival times between events are exponential],
ii) the service process is exponential, and iii) with a single server.
Key Parameters in M/M/1 include

• System parameter: service rate 𝜇
• Workload parameter: arrival rate 𝜆. ■
Step 5- Listing Factors to Study - Major Concepts
• Parameters to be varied are called factors.
• Different values of factors are called levels.

• Parameters which has high impact on performance are preferably selected as factors.
• Gradually extend factor list and per-factor levels. Note further that
• Selected factors must be feasible to decision makers’ control (or engineers).
• Non-factor parameters must be fixed (and also feasible and low impact).
♦ EXAMPLE 10.2.
• Transmission media: copper cable, fiber cable, wireless,
• Size of network: short or long distance
• Packet size: set of levels = {0, 64, 128, ..., 4096}
10.2
Factorial Designs for Performance Evaluation
10.2.1 Experimental Design (or Design of Experiments - DOE)

10.2. Factorial Designs for Performance Evaluation 251
WHY DOE? Experimental design (DOE) is the best known way of learning about relationships
between the variables of a system. DOE has the goal: obtain the maximum information (of
a system) or the optimum value of a key target factor (named response variable) with the
minimum number of experiments- treatment combinations. ■
Methodological principles: (see 2 )
Experimental design (DoE) in SPE is unique among statistical methodologies in that it is completely
proactive. Key points are rather than analyzing existing data we carefully
plan which specific data would be most helpful in addressing the issue or problem of your process-
system.
• The effect of variation from sources outside our scope [as ...]
can be minimized by planning the data-collection process ahead of time.
• Briefly, with DoE we can actually

(1) create the informative events from which to learn better from a system (implying a high level
of risk prediction and prevention in comparing with the passive monitoring of a process), and
2
chapter 11 of Improving Performance through Statistical Thinking, by Galen C. Britz; Donald W. Emerling; Lynne B. Hare; Roger W. Hoerl;
Stuart J. Janis; Janice E. Shade ASQ 2000

so (2) increase the speed with which we understand the relationship between the variables of
interest.
HOW TO? Based on list of factors and their levels, we decide a sequence of experiments that
• offer maximum information: co-relationship between factors and metrics, and
• exploit minimal effort or employ minimum number of experiments.
WHAT IS DOE in practice?
An approach with at least 2 phases needed, in short we employ
1. Phase 1: large number of factors, small number of levels.
2. Phase 2: reduced number of factors, increased number of levels.
♦ EXAMPLE 10.3.
Factors below considered to compare remote pipe and remote procedure call (RPC).
• Type of channels: Remote pipes and remote procedure calls (2 levels).
• Size of the Network: Short distance and long distance (2 levels).
• Sizes of the call parameters: small and large (2 levels).

• Number n of consecutive calls=Block size: 1, 2,. . . , 1024 (11 levels).
⇒ # experiments (full factorial) = 23 × 11 = 88.
Key four steps of D-I-S-R for planning a designed experiment
AIM: Find optimal process settings that produce the best results at lowest cost.
We may plan a designed experiment in the following steps.
1. Determine and quantify the factors that have the biggest impact on the output
2. Identify factors that do not have a big impact on quality or on time (and therefore can be set
at the most convenient and/or least costly levels)
3. Screen a large number of factors quickly to determine the most important ones
4. Reduce the time and number of experiments needed to test multiple factors by using
Fractional Factorial DOE - FFD, see next part 10.2.2.
♦ EXAMPLE 10.4. Personal computer design
• Factor 𝐶- CPU: Intel Core I3@1.9GHz, Intel Core I7@2.67GHz, or AMD Athlon IIX4 605e.
• Factor 𝑀 - Memory size: 4GB, 8GB, or 16GB.

• Factor 𝑁 - Number of SSD disks: 1, 2, 3, or 4.
• Factor 𝑊 - Workload: Secretarial, managerial, or scientific.
• Factor 𝑈 - User education: High school, college, or post-graduate level.
We write design 𝒟 = 𝐶 × 𝑀 × 𝑁 × 𝑊 × 𝑈 =⇒ 5 factors at 3x3x4x3x3
= 324 level combinations, so we could symbolically put 𝒟 = 4 · 34 (up to permutation).
10.2.2 DOE: Notation, Terms and Simple Designs
Key Terminologies
Response variable- 𝑌 say, measured performance (in performance study)
Factor- either controlled or uncontrolled input variable, also called predictor in our model; factors
classified into primary, secondary...
Fractional Factorial DOE- Looks at only a fraction of all the possible combinations contained in
a full factorial. If many factors are being investigated, information can be obtained with smaller
investment, kindly see from Chapter 11, about Fractional Factorial Design.
Effect- The change in the response variable that occurs as experimental conditions change

Interaction- co-relationship between the defined factors, occurs when the effect of one factor on
the response 𝑌 depends on the setting of another factor
Run or Treatment combination- A single setup in a DOE from which data is gathered. Example,
a 3-factor full factorial DOE each factor at 2 levels has 23 = 8 runs.
Repetition- Running several samples during one experimental setup run.
Replication- experiment repetition, precisely replicating (duplicating) the entire experiment in a time
sequence with different setups between each run.
Design: a collection 𝐷 of tuples, each tuple said to be an experiment, comprised of all factors.
Therefore 𝐷 = {𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡 1, · · · , 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡 𝑁 }, and so |𝐷| = # experiments (runs, or treatment
combination= factor level combination for each factor)
allowing # replications of each experiment.
Simple Designs
include both Full Factorial designs and Fractional Factorial designs.
CLASS 1: FD - Full Factorial Designs (Full Factorials)

examine every possible combination of factors at the levels tested. The full factorial design is an
experimental strategy that allows us to answer most questions completely.

A Full Factorial of 𝑑 factors utilizing every possible combination at all levels of all 𝑑 factors gives the
total number of runs
𝑑
∏︁
𝑁 = 𝑛1 · 𝑛2 · · · 𝑛𝑑 = 𝑛𝑖 .
𝑖=1
The general notation for a full factorial hence is 𝑛1 · 𝑛2 · · · 𝑛𝑑 , particularly

♦ 2𝑚 - the binary design in 𝑑 = 𝑚 factors, each factor at 2 levels, so there are 2𝑚 = # Runs, see
EXAMPLE 10.5 .
▼ or 3𝑘 factorial design with 𝑑 = 𝑘 ternary factors, see COMPLEMENT B in Section 10.9).
CLASS 2: FFD - Fractional factorial designs, or fraction

kindly see Definition 11.2, and specific fractions at EXAMPLES 10.7 and 10.6 .
Briefly, most popular FFDs include the full binary 2𝑑 , the ternary 3𝑚 , the mixed regular designs
2𝑑 × 3𝑚 , the fractional designs 2𝑑−𝑝
𝑅 . . . see a summary in Section 10.2.3.
♦ EXAMPLE 10.5 (A full binary 23 experiment ).
Consider a specific 23 experiment for studying a relationship between diet scheme and blood pres-
sure. We conducted experiments to assess the effects of diet on blood pressure 𝑌 in (say American)
males. Three factors are to be measured:

The 2^3 design
Figure 10.2: A factorial design with binary factors 𝐴, 𝐵, 𝐶
𝐴 = amount of fruits and vegetables in the diet (low/ high),
𝐵 = amount of fat in the diet (low/ high),
𝐶 = amount of dairy products in the diet (low/ high).
These factors 𝐴, 𝐵, 𝐶 each obviously has 2 levels.
A treatment combination (a run or an experiment) 𝑟 is generally denoted 𝑟 = (𝑥1 , 𝑥2 , 𝑥3 ).

E.g. experiment
𝑟1 = (1) = (low fruits, low fat, low dairy product ), (bottom left),
𝑟3 = (𝑎, 𝑏, .) = (high fruits, high fat, low dairy product ), and
𝑟7 = (𝑎, 𝑏, 𝑐) = (high fruits, high fat, high dairy product ) (top, counterclockwise) ■
Knowledge Box 11 (Facts of Full Factorial (FD) ).
Observe a few pros and con of FD as follows.
STRENGTH of FD: DRAWBACK of FD
If use the full factorial

• High cost of the study since have to conduct
∏︀𝑑
all 𝑛 = 𝑖=1 𝑛𝑖 experiments while higher than
• we can find the effect of every factor,
2 interactions often have little impact on re-
• and study all of their interactions. sponse.
Avoiding high cost essentially is obtained by reducing the total number 𝑛 of runs, by:
1. Reduce number of factors.
2. Reduce number of levels for each factor, hence applying 2𝑚 design first, then more levels added
per factor.

3. Use fractional factorial designs (fractions), to be discussed fully from Section 10.5.
Illustration with specific fractional designs
♦ EXAMPLE 10.6 (A fraction with 9 runs only of a 34 full design).
Reuse Example 10.4 now we do not consider factor 𝑁 (no. of SSD disks with 4 choices), then
obtain a new 34 full design 𝒟1 = 𝐶 × 𝑀 × 𝑊 × 𝑈 in 4 factors at total 34 = 81 experiments. But we
can apply 34−2 fractional factorial design as table below.
CPU Memory Workload User Run

i3 4GB managerial high school 1
i3 8GB scientific post-graduate 2
i3 16GB secretarial college 3
i7 4GB scientific college 4
i7 8GB secretarial high school 5
i7 16GB managerial post-graduate 6
AMD Athlon 4GB secretarial post-graduate 7
AMD Athlon 8GB managerial college 8
AMD Athlon 16GB scientific high school 9

• Each of the four factors is used three times at each of its three levels. Such fractional design
allows save time and expense, but we get less information (may not get all interactions on the
responses 𝑌 like production cost).
• The good side is that: If some interactions are negligible, then it not be considered a problem.
Note that the number of experiments here must be
𝑑
∑︁
𝑛≥1+ (𝑛𝑖 − 1) = 1 + 8 = 9,
𝑖=1
WHY? And we got the optimum number of experiments (run-size) 𝑛𝑜𝑝𝑡 = 9!
Naive Method for designs
From the last two examples, we might exploit a naive method with 2 steps
1. Start with a typical configuration (level combination)
2. Vary one factor at a time to see the factor affects on performance.
For instance, in Example 10.4 we might
• Start with a configuration 𝑟𝑢𝑛 =< 𝑖3, 8𝐺𝐵, 1𝑑𝑖𝑠𝑘, managerial, college >.
• Vary only factor CPU to find which CPU is the best.

10.2.3 Summarized Planning FD and FFD
Definition 10.1 (Full factorial experiments).
In general, in SPE a system is often determined by 𝑑 factors 𝑋𝑖 , 𝑖 = 1, . . . , 𝑑 (viewed as finite sets

𝑄𝑖 ) where 1 < 𝑑 ∈ N.
• The elements of a factor/ a set are called its levels.
• The (full) factorial design (FD) with respect to these factors is the Cartesian product
𝒟 = 𝑄1 × 𝑄2 × . . . × 𝑄𝑑 = 𝑋1 × 𝑋2 × . . . × 𝑋𝑑 .
• The main aim of using FD is to identify an unknown function
𝑓 : 𝒟 → R,
a mathematical model of a quantity of interest (favor, usefulness, best-buy, ...)

𝑌 = 𝑓 (X) = 𝑓 (𝑋1 , 𝑋2 , · · · , 𝑋𝑑 ) which has to be computed or optimized.
1. We examine every possible combination of factor levels, so enable us to
♦ Determine main effects that the manipulated factors will have on response 𝑌
♦ Determine effects that factor interactions will have on 𝑌 or many response variables.

2. Advantages
– Quantify a mathematical model or relationship 𝑌 = 𝑓 (X) defined on a multi-parameter system,

then help us to predict new result 𝑌 when 𝑋𝑖 get new values.
– Provides information about all main effects, and all interactions as well.
3. Most popular one is the 2-level design 2𝑑 , 𝑑 is the number of factors to be investigated and
2𝑑 = #Runs.
Fractional factorial experiments (FFD)
When a firm’s budget is limited, practically the firm’s manager must accept using a subset 𝐹 of 𝐷
when investigating properties of a new products.
• Look at only selected subsets of the possible combinations in a FD, as Example 10.7
• Advantages: Allows you to screen many factors- separate significant from not-significant factors-
with smaller investment in research time and costs 3
• Limitations/drawbacks: Not all interactions will be discovered/known

3
=⇒ resources necessary to complete a FFD are manageable (economy of time, money, and personnel)

TAKE AWAY Facts

General notation to designate a 2-level FFD is:
𝑑−𝑝
2𝑅 = #Runs
2−𝑝 is the size of the fraction (𝑝 = 1 is a 1/2 fraction ...)

2𝑑−𝑝 is the number of runs
𝑅 is the resolution, an indicator of what levels of effects and interactions are confounded,
meaning you can not separate them in your analysis.
♦ EXAMPLE 10.7 (A fraction of a 211 full design given in Table 10.1).
Look at Table 10.1 showing a fraction
𝐹 ⊆ 𝐷 = 𝐶𝑜×𝑆ℎ×𝑊 𝑒𝑖×𝑀 𝑎×. . .×𝑃 𝑙𝑎𝑐𝑒 of a 211 full design, in which each of these features can take on only two possible values.
Hence the full factorial requires 211 = 2048 experiments. This design 𝐹 however is a fractional factorial design with 12 runs only, of
strength 2 (resolution III), so allows us to separate all main effects on the response
𝑌 = 𝛽0 + 𝛽1 𝐶𝑜 + 𝛽2 𝑆ℎ . . . + 𝛽11 𝑃 𝑙𝑎𝑐𝑒.■
♣ OBSERVATION ( Naive Method- Properties and Weaknesses).
1. CLASSIFICATION: Do all designs follow good pattern?

Table 10.1: An orthogonal array with 11 binary factors
Co Sh Wei Ma Pri Run

0 0 0 0 0 0 0 0 0 0 0 1
1 1 1 0 1 1 0 1 0 0 0 2
0 1 1 1 0 1 1 0 1 0 0 3
0 0 1 1 1 0 1 1 0 1 0 4
0 0 0 1 1 1 0 1 1 0 1 5
1 0 0 0 1 1 1 0 1 1 0 6
0 1 0 0 0 1 1 1 0 1 1 7
1 0 1 0 0 0 1 1 1 0 1 8
1 1 0 1 0 0 0 1 1 1 0 9
0 1 1 0 1 0 0 0 1 1 1 10
1 0 1 1 0 1 0 0 0 1 1 11
1 1 0 1 1 0 1 0 0 0 1 12
OS Cam. Wifi Anti. Ant. Place
In EXAMPLE 10.7 we began with 𝑑 = 11, but the design 𝐹 has 12 runs, the equation 2𝑑−𝑝 = 12 has
no solution 𝑝, so 𝐹 does not belong to the class 2-level FFD with pattern 2𝑑−𝑝
𝑅 for any resolution 𝑅
. In fact 𝐹 does belong to another class named Orthogonal Array!
2. C.R. Rao’s bound on run-size [formulated as in the inequality (10.2)]:
The naive method for FFD, for both regular designs (like class of 2𝑑−𝑝
𝑅 =) and non-regular ones (like
Orthogonal Arrays) , however must fulfill the Rao bound (Lemma ??) that

• we can correctly estimate all main factor effects

(of a design of 𝑑 factors, where the 𝑖th factor having 𝑛𝑖 levels)
if the number of design points must satisfy
𝑡/2
∑︁ ∑︁
𝑛≥ Π𝑖∈𝐼 (𝑟𝑖 − 1)
𝑗=0 |𝐼|=𝑗
with strength 𝑡 = 2, or explicitly

𝑑
∑︁
𝑛≥1+ (𝑛𝑖 − 1). (10.1)
𝑖=1
∑︀11
In Example 10.7 we see clearly that 𝑛 ≥ 1 + 𝑖=1 (2 − 1) = 12, so the design is optimal in the
sense that it allows correct estimation of
all 11 main factor effects plus noise, and that is all!
3. WEAKNESSES: These such designs, with max strength 𝑡 = 2 or max resolution 𝑅 = 𝑡 + 1 = 3

satisfying the constraint (10.1), however is not statistically efficient (i.e., not make the best use
of the effort spent), and so returns wrong conclusions if the factors have interaction!
The so-called resolution-3 designs are often called screening designs, they however gives
unbiased estimators of all the main effects or first-order effects [no interaction at all],
provided a first-order polynomial is a “valid metamodel” (adequate approximation) of the In-

put/Output (I/O) function that is implicitly determined by the underlying simulation model.
Binary designs and key properties: Good candidates of screening designs are the classic two-
level factorial designs that will be discussed in Section 10.5. These binary designs clearly require
the simulation of at least 𝑛 = 𝑑 + 1 factor combinations where 𝑑 denotes the number of factors in
the experiment. In such a design, each factor has two values or levels; these levels may denote
quantitative or qualitative value.
4. BETTER DESIGN? HOW?
To study a few interactions plus all main factor effects of the chosen design we could use strength
𝑡 ≥ 3 or resolution 𝑅 ≥ 𝑡 + 1 = 4 fractional designs.
When 𝑡 = 3 (smallest odd natural) the total number of intercept, main effect and two interaction
parameters of a generic orthogonal array 𝐹 = OA(𝑁 ; 𝑟1 , 𝑟2 , · · · , 𝑟𝑑 ; 𝑡) is
𝑑
∑︁ 𝑑
∑︁
1+ (𝑟𝑖 − 1) + (𝑟𝑖 − 1)(𝑟𝑗 − 1),
𝑖=1 𝑖,𝑗=1
𝑖<𝑗

KEY FACT (C.R. Rao 1949)

Generally if strength 𝑡 is odd natural number, then the run-size of a generic orthogonal array
a
𝐹 = OA(𝑁 ; 𝑟1 , 𝑟2 , · · · , 𝑟𝑑 ; 𝑡) must satisfy
⎛ ⎞
(𝑡−1)/2 ⎜ ⎟
∑︁ ∑︁ ⎜ ∑︁ ∏︁ ⎟
𝑁 ≥1+ Π𝑖∈𝐼 (𝑟𝑖 − 1) + ⎜(𝑟𝑗 − 1)
𝑚𝑎𝑥𝑗 ⎜ (𝑟𝑖 − 1)⎟
⎟. (10.2)
𝑗=1 |𝐼|=𝑗 ⎝ 𝑡−1 𝑖∈𝐼 ⎠
|𝐼|= ,𝑗̸∈𝐼
2
a
where 𝑟𝑖 is the number of levels of factor 𝑖
♣ QUESTION.
Shall we get the same conclusion for the 34 full design 𝒟1 = 𝐶 × 𝑀 × 𝑊 × 𝑈 in 4 factors in Example
10.6, that a new number of design points 𝑁 satisfies
𝑑
∑︁ 4
∑︁
𝑁 =𝑛≥1+ (𝑛𝑖 − 1) = 1 + (3 − 1) = 9?
𝑖=1 𝑖=1
when we want to study a few two-interactions, 𝐾 terms say, in regression model?
ANSWER:

The new design must have strength 𝑡 = 3, or resolution 𝑅 = 𝑡 + 1 = 4, hence 𝑁 = 𝑛 + 𝐾 > 𝑛.

What does that mean? What would then the regression model
𝑌 = 𝑎0 + 𝑎1 𝐶 + 𝑎2 𝐶 2 + 𝑚1 𝑀 + 𝑚2 𝑀 2 + · · · + 𝑢1 𝑈 + 𝑢2 𝑈 2 + 𝑡1 + · · · + 𝑡𝐾
be a good functional relationship between the cost 𝑌 with 4 factors 𝐶, 𝑀, 𝑊, 𝑈 ?■
PROBLEM: How can we proceed when there are many predictors?
In next sections, practically we show how to
• fit multiple linear regression models,
• perform the statistical tests and confidence procedures that are analogous to those for simple
linear regression, and check for model adequacy. All will be used in Section 10.5.
10.3
Multiple Linear Regression (MLR)
We practically generalize the simple linear regression to cases where the variability of a variable 𝑌
of interest can be explained, to a large extent, by the linear relationship between 𝑌 and (𝑝 − 1)

10.3. Multiple Linear Regression (MLR) 269
4
predicting or explaining 𝑚 = 𝑝 − 1 variables 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 . Applications of the special Multiple
Linear Regression (MLR) analysis can be found in all areas of R & D. MLR, for instance, plays a
meaningful role in the statistical planning and control of
♦ computing (CS and CE), agriculture, and climate change science,
♦ industrial processes and manufacturing; economics, actuarial science;
♦ environment, ecology, transportation science, and public health.
♦ EXAMPLE 10.8. (Simple MLR to study Influence of advertising time)
Model the influence of advertising time 𝑥 on the number of positive reactions 𝑦 from the public.
We have a single linear regression model, described by
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀.
Taking mean with condition 𝑋 = 𝑥 we get the linear model
𝑌^ = 𝜇𝑌 = E[𝑌 |𝑋 = 𝑥] = 𝛽0 + 𝛽1 𝑥.
Here 𝑚 := 𝑝 − 1 = 1, 𝑌 holds the number of positive reactions caused by the amount of advertising
time 𝑥, then the number of observations 𝑛 ≥ 2. ■
4
Multiple regression analysis (MRA) in general (possibly non0linear) is an important statistical tool for exploring the relationship between the
response 𝑌 on the set of predictors 𝑋𝑖 .

10.3.1 Setting
• Each 𝑋𝑖 is called a predictor, there are 𝑝 − 1 ≥ 2 predictors.
• All predictors 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 are continuous random variables.
• 𝑌 (the dependent variable) is called the response.
• The regression of 𝑌 on several predictors is called multiple regression.
• For multiple linear regression models, under uncertainty the response
𝑝−1
∑︁
𝑌 = 𝑓 (𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 ) + 𝜀 = 𝛽0 + 𝛽𝑗 𝑋𝑗 + 𝜀
𝑗=1
is a linear function of predictors 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 . Generally we need to have 𝑛 ≥ 𝑝 observations

to fit the model with 𝑝 − 1 these predictor variables.
Our Aim: to specify the relationship between 𝑌 the 𝑋𝑗 by a linear function.
Definition 10.2 (Single and Multiple regression).

Regression of 𝑌 on 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 is the conditional expectation,

𝑌̂︀ := E[𝑌 |𝑋 = 𝑥] if 𝑝 − 1 = 1 (single regression) or
𝑌̂︀ := E[𝑌 |𝑋1 = 𝑥1 , · · · , 𝑋𝑝−1 = 𝑥𝑝−1 ] if 𝑝 − 1 > 1 (multiple regression).
Why should multiple regression be linear?
1. If 𝑌̂︀ is a linear function of coefficients 𝛽𝑖 (being estimated from data), then it may serve as a
suitable approximation to several nonlinear functional relationships.
2. Secondly, linearity ensures our computation mathematically feasible.
3. Thirdly, simplicity the best: many relationships in reality just need a linear function of predictors
𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 to describe the context. The linear models would guarantee the inclusion of
important variables, and the exclusion of unimportant variables.
♦ EXAMPLE 10.9 (Describing a multiple regression).
As part of a recent study titled “Predicting Success for Actuarial Students in Undergraduate
Mathematics Courses,” data from 106 Mahidol Uni. actuarial graduates were obtained. The
researchers were interested in describing how students’ overall math grade point averages
(GPA) are explained by SAT Math and SAT Verbal scores, class rank, and faculty of science’s

mathematics placement score.
• What is the response variable? What is 𝑛, the number of cases?
• What is 𝑝 − 1, the number of explanatory variables, i.e. predictors?
• What are the predictors (explanatory variables)?
HINTS: The response variable 𝑌 is student’s overall math GPA; 𝑝 − 1 = 4. ■
10.3.2 Models with Interactions
If the change in the mean 𝑦 value associated with a 1-unit increase in one independent variable
depends on the value of a second independent variable, there is interaction between these two
variables. Denoting the two independent variables by 𝑋1 , 𝑋2 ,
we can model this interaction by including as an additional predictor 𝑋3 = 𝑋1 𝑋2 , the product of

the two independent variables.
• When 𝑋1 and 𝑋2 do interact, this model will usually give a much better fit to resulting data than
would the no-interaction model.
• Failure to consider a model with interaction too often leads an investigator to conclude incorrectly
that the relationship between 𝑌 and a set of independent variables is not very substantial. In

application, quadratic predictors 𝑋12 and 𝑋22 are often included to model a curved relationship. This
leads to the full quadratic or complete second-order model
𝑦 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ]
(10.3)
= 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥1 𝑥2 + 𝛽4 𝑥21 + 𝛽5 𝑥22 .
♦ EXAMPLE 10.10. [Industry- Manufacturing]
Suppose that an industrial chemist is interested in studying how a product yield (𝑌 ) of a polymer
being influenced by two independent variables or predictor 𝑋1 , 𝑋2 , and possibly theirs certain
reaction. Here 𝑋1 = reaction temperature and
𝑋2 = pressure at which the reaction is carried out.
A model that might describe this relationship is
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽12 𝑋1 𝑋2 + 𝜀 (10.4)

(b)

Figure 10.3: Linear regression models with different shapes
Taking mean we get 𝑦 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ] = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽12 𝑥1 𝑥2 . Figure 10.3(a) - the

three-dimensional plot of the regression model
𝑦 = E[𝑌 |𝑥𝑖 ] = 50 + 10𝑥1 + 7𝑥2 + 5 𝑥1 𝑥2 ,
provides a nice graphical interpretation of an interaction.
Generally, interaction implies that the effect produced by changing one variable (𝑥1 , say) depends
on the level of the other variable (𝑥2 ). This figure shows that changing 𝑥1 from 2 to 8 produces a
much smaller change in E[𝑌 ] when 𝑥2 = 2 than when 𝑥2 = 10.
Figure 10.3(b) shows the three-dimensional plot of the regression model
𝑦 = E[𝑌 ] = 800 + 10𝑥1 + 7𝑥2 − 8.5𝑥21 − 5𝑥22 + 4 𝑥1 𝑥2 .
This is the full quadratic model of this regression.
Notice further that, although these models are all linear regression models, the shape of the surface
that is generated by the model is not linear.

10.3.3 Inference in multivariate contexts
If we have 𝑛 > 1 observations measured simultaneously at 𝑘 = 𝑝 − 1 > 1 predictors, our true

multivariate regression linear function becomes
E[𝑌 |𝑋1 = 𝑥(1) , 𝑋2 = 𝑥(2) , . . . , 𝑋𝑘 = 𝑥(𝑘) ] = 𝛽0 + 𝛽1 𝑥(1) + 𝛽2 𝑥(2) + . . . + 𝛽𝑘 𝑥(𝑘)

(10.5)
⇐⇒ 𝑦 = E[𝑌 ] = X 𝛽
⎡ ⎤
1 𝑥(1) 𝑥(2) · · · 𝑥(𝑘)
⎡ ⎤ ⎢ ⎥ ⎡ ⎤
𝑦1 ⎢
⎢ −− −− −− −− −− ⎥ ⎥ 𝛽0
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 𝑦2 ⎥ ⎢ 1 𝑥11 𝑥12 · · · 𝑥1𝑘 ⎥ ⎢ 𝛽1 ⎥
⇐⇒ ⎢ ⎥=⎢ ⎥·⎢ .
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
.. .. ⎥
⎢
⎣ . ⎥ ⎢
⎦ ⎢ 1 𝑥21 𝑥22 · · · 𝑥2𝑘 ⎥ ⎢
⎥ ⎣ . ⎦⎥
⎢ .. .. .. .. .. ⎥
𝑦𝑛 ⎢
⎣ . . . . . ⎥
⎦ 𝛽𝑘
1 𝑥𝑛1 𝑥𝑛2 · · · 𝑥𝑛𝑘
• With vector 𝑥(𝑖) consists of observed values of the predictor 𝑋𝑖 ,
• (X = [𝑥𝑖𝑗 ]) is called the observed matrix of predictors (predictor matrix), 𝑥𝑖𝑗 is the value of the 𝑗-th
predictor 𝑋𝑗 at the 𝑖-th observation (𝑖 = 1, 2, . . . , 𝑛 and 𝑗 = 1, 2, . . . , 𝑘).
The true regression linear model (at the 𝑖-th observation) is
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + · · · + 𝛽𝑘 𝑥𝑖(𝑝−1) + 𝑒𝑖 , 𝑖 = 1, 2, . . . , 𝑛

where 𝛽0 , 𝛽1 , 𝛽2 , · · · , 𝛽𝑘 are the linear regression coefficients, and 𝑒𝑖 are random errors, 𝑒𝑖 ∼ N(0, 𝜎 2 ),
i.e. they are normally distributed with mean 0 and standard deviation 𝜎.
Knowledge Box 12.
The least square estimate of regression parameters 𝛽 is vector
^ = (X′ X)−1 X′ y.
b=𝛽 (10.6)
All the estimated coefficients b given in Equation 10.6 are
• linear functions of observed responses 𝑦1 , 𝑦2 , · · · , 𝑦𝑛 ,
• unbiased for the regression slopes because
E[b] = (X′ X)−1 X′ E[y] = (X′ X)−1 X′ X 𝛽 = 𝛽,
• normal if the response variable 𝑌 is normal.

10.3.4 Analysis of variance (ANOVA) in multiple regression
Use dataset 𝒟 = (𝑦, X) = (𝑦 1 𝑥(1) · · · 𝑥(𝑘) ) with 𝑛 observations at 𝑘 predictors, 𝑦 is the response
vector, we fitted a multiple regression model
𝑦 ^ =X𝑏
̂︀ = E[𝑌 ] = X 𝛽
where ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
𝑦^1 y 1
⎢. ⎥ ⎢.
. ⎥ ⎢. ⎥ = y ⎢ ... ⎥
⎥ ⎢ ⎥
^=⎢
𝑦 ⎣ . ⎦; 𝑦 = ⎣ . ⎦ ⎣ ⎦
𝑦^𝑛 y 1
are respectively the vector of fitted values and the vector of identical response means.
We then write the total sum of squares, measuring the total variation of responses, as
𝑛
∑︁
𝑆𝑆𝑇 := 𝑆𝑦𝑦 = (𝑦𝑖 − y)2 = (𝑦 − 𝑦)𝑇 (𝑦 − 𝑦). (10.7)
𝑖=1
• The first sum [with df 𝑇 = (𝑛 − 1) degrees of freedom] can be split into 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸, with
𝑛
∑︁
𝑆𝑆𝑅 = 𝑦𝑖 − y)2 = (^
(^ 𝑦 − 𝑦)𝑇 (^
𝑦 − 𝑦), (10.8)
𝑖=1
∑︀𝑛
and 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝑅 = 𝑖=1 (𝑦𝑖 − 𝑦̂︀𝑖 )2 = (𝑦 − 𝑦
^ )𝑇 (𝑦 − 𝑦
^ ).

• 𝑆𝑆𝑅- the regression sum of squares, measures the response’s variation being explained by the
regression model,
• 𝑆𝑆𝐸- the error sum of squares, describes the variation explained by randomness; the quantity
that we minimized when we applied the method of least squares.
The coefficient of determination
𝑆𝑆𝑅
𝑅2 = (10.9)
𝑆𝑆𝑇
generally measures the proportion of the total variation explained by regression.

TAKE AWAY Notes

• 𝑅 is called multiple correlation coefficient, is the correlation between the observations 𝑦𝑖
and the predicted values 𝑦^𝑖 .
• We often refer loosely to 𝑅2 (R-square) as
* the proportion of the total variation explained by the model, or
* the amount of variability in the data explained or accounted for by the regression model
that is built up by all predictors.
• When we add new predictors to our model, we explain additional portions of 𝑆𝑆𝑇 ; therefore,
𝑅2 can only go up. Thus, we should expect to increase 𝑅2 and generally, get a better fit by
going from uni-variate to multivariate regression.
This concept is extremely useful in practice (as in Transportation Science and SPE), and also for
multiple regression in next chapters. Some following notable properties of 𝑅2 , however should be
cautiously utilized.
1. The statistic 𝑅2 should be used with caution because it is always possible to make 𝑅2 unity by
simply adding enough terms to the model. In general, 𝑅2 will increase if we add a variable to the
model, but this does not necessarily imply that the new model is superior to the old one.

2. There are several misconceptions about 𝑅2 .
a) In general, 𝑅2 does not measure the magnitude of the slope of the regression line.
b) A large value of 𝑅2 does not imply a steep slope.
c) Furthermore, 𝑅2 does not measure the appropriateness of the model because it can be artifi-
cially inflated by adding higher order polynomial terms in 𝑥 to the model.
10.3.5 Multivariate ANOVA
The multivariate regression 𝑦 = E[𝑌 ] = X 𝛽 [introduced in Model (10.5)] defines a 𝑘-dimensional

regression plane where the fitted values 𝑦^𝑖 belong to. Therefore, the regression sum of squares 𝑆𝑆𝑅
has df 𝑅 = 𝑘 degrees of freedom, whereas by subtraction,
df 𝐸 = df 𝑇 − df 𝑅
(10.10)
= 𝑛 − 1 − 𝑘 = 𝑛 − (𝑘 + 1)
degrees of freedom are left for 𝑆𝑆𝐸. This is the sample size 𝑛 minus 𝑘 estimated slopes of 𝛽𝑖 and 1
estimated intercept of 𝛽0 . We can then write the ANOVA table,

Table 10.2: Multivariate ANOVA table
Multivariate ANOVA
Source D.F. S.S. M.S. 𝐹0

Sum of squares Mean squares
𝑆𝑆𝑅 𝑀 𝑆𝑅
Model 𝑘 𝑆𝑆𝑅 𝑀 𝑆𝑅 =
𝑘 𝑀 𝑆𝐸
𝑦 − 𝑦)𝑇 (^
= (^ 𝑦 − 𝑦)
𝑆𝑆𝐸
Error 𝑛−𝑘−1 𝑆𝑆𝐸 𝑀 𝑆𝐸 =
𝑛−𝑘−1
^ )𝑇 (𝑦 − 𝑦
= (𝑦 − 𝑦 ^)
Total 𝑛−1 𝑆𝑆𝑇 = (𝑦 − 𝑦)𝑇 (𝑦 − 𝑦)
We use this theory to analyze experimental errors from a regression in Section 10.6. Section
11.5 later discusses a more complex theory of ‘Temporal’ Linear Regression with Lagging for both
predictors and responses in multivariate realm.

10.4. Product Quality and System Performance by Design of Experiments 283
10.4
Product Quality and System Performance
by Design of Experiments
The quality of a product and/or the reliability/efficiency of a process or a system is typically quantified
by quality and performance measures. Examples include measures such as
piston cycle time, yield of a production process, output voltage of an electronic circuit,
noise level of a compressor or response times of a computer system.
These performance measures are affected by several factors that have to be set at specific levels
to get desired results. We will discuss tools compatible with the top of the classic Quality Ladder
(in Figure 10.4), and then, to some extent, propose methods for the newly modified Quality and
Performance ladder developed so far in this text.
Quality by Design, historically is the comprehensive quality engineering approach partially devel-
oped in the 1950s by the Japanese Genichi Taguchi. Taguchi labeled his methodology off-line
5
quality control. .
5
Taguchi’s impact on Japan has expanded to a wide range of industries. He won the 1960 Deming Prize for application of quality as well as
three Deming Prizes for literature on quality in 1951, 1953 and 1984.
Applications of off-line quality control range now from the design of automobiles, copiers and electronic systems to cash-flow optimization in
banking, improvements in computer response times and runway utilization in an airport.

The Quality Ladder is matching management maturity level

with appropriate statistical tools
Figure 10.4: Organizations higher up on the Quality Ladder are more efficient
at solving problems with increased returns on investments
10.4.1 Off-line quality control
The aim of off-line quality control is to determine the factor-level combination that gives the least
variability to the appropriate performance measure, while keeping the mean value of the measure

on target. The goal is to control both accuracy and variability. Optimization problems of products or
processes can take many forms that depend on the objectives to be reached. These objectives are
typically derived from customer requirements.
• Performance parameters such as dimensions, pressure or velocity usually have a target or nominal
value. The objective is to reach the target within a range bounded by upper and lower specification
limits. We call such cases “nominal is best.”
• Noise levels, shrinkage factors, amount of wear and deterioration are usually required to be as low
as possible. We call such cases “the smaller the better.”
• When we measure strength, efficiency, yields or time to failure our goal is, in most cases, to reach
the maximum possible levels. Such cases are called “the larger the better.”
These three types of cases require different objective (target) functions to optimize. Taguchi intro-
duced the concept of loss function determining the appropriate optimization procedure.
Figure 10.5: Dr. Genichi Taguchi, a pioneer in using Experimental Designs for Industry

Product and process optimization using loss functions
When “nominal is best” is considered, specification limits are typically two-sided with an upper spec-
ification limit (USL) and a lower specification limit (LSL), see Knowledge Box ??. These limits are
used to differentiate between conforming and nonconforming products. Nonconforming products
are usually repaired, retested and sometimes downgraded or simply scrapped. In all cases defec-
tive products carry a loss to the manufacturer. Taguchi proposed a quadratic function as a simple
approximation to a graduated loss that measures loss on a continuous scale.
■ CONCEPT. A quadratic loss function has the form
𝐿(𝑦, 𝑀 ) = 𝐾 (𝑦 − 𝑀 )2 , (10.11)
where 𝑦 is the value of the performance characteristic of a product, 𝑀 is the target value of this
characteristic and 𝐾 is a positive constant, which yields monetary or other utility value to the loss.
For example, suppose that (𝑀 − Δ, 𝑀 + Δ) is the customer’s tolerance interval around the target
(note that this is different from the statistical tolerance interval). When 𝑦 falls out of this interval the
product has to be repaired or replaced at a cost of $𝐴. Then, for this product,
𝐴 = 𝐾 Δ2 or 𝐾 = 𝐴/Δ2 . (10.12)

• The manufacturer’s tolerance interval is generally tighter than that of the customer, namely
(𝑀 − 𝛿, 𝑀 + 𝛿), where 𝛿 < Δ. We can obtain the value of 𝛿. Suppose the cost to the manufacturer
to repair a product that exceeds the customer’s tolerance, before shipping the product, is $𝐵,
𝐵 < 𝐴. Then
(︂ )︂
𝐴 (︁ 𝐵 )︁1/2
𝐵= 2
(𝑌 − 𝑀 )2 , or 𝑌 =𝑀 ±Δ . (10.13)
Δ 𝐴
Thus,
√︂
𝐵
𝛿=Δ . (10.14)
𝐴

Figure 10.6: Quadratic loss and tolerance intervals.
• The manufacturer should reduce the variability in the product performance characteristic so that
process capability 𝐶𝑝𝑘 [defined in Knowledge Box ??] relative to the tolerance interval (𝑀 −𝛿, 𝑀 +𝛿)
should be high. See Figure 10.6 for a schematic presentation of these relationships. Notice that

the expected loss is

E[𝐿(𝑌, 𝑀 )] = 𝐾 (Bias2 +V) (10.15)
where Bias = 𝜇 − 𝑀, 𝜇 = E[𝑌 ] and V = E[(𝑌 − 𝜇)2 ] is variance. Thus, the objective is to have
a manufacturing process with the mean 𝜇 (of product feature 𝑌 ) as close as possible to the target
𝛿
𝑀 , variance 𝜎 2 as small as possible (𝜎 < , so that 𝐶𝑝𝑘 > 1). ■
3
10.4.2 Major stages in product and process design
FACT: A major challenge to industry is to reduce variability in products and processes.
The previous section and chapters (in Part B and C) dealt with measuring the impact of such
variability. In this section we discuss methods for actually reducing variability.
Design of products or processes involves two main steps:
(1) designing the engineering characteristics and (2) setting tolerances.
System design is the stage where engineering skills, innovation, and technology are pooled to-
gether to create a basic design. Once the design is ready to go into production, one has to specify
tolerances of parts and sub-assemblies so that the product or process meets its requirements.
Loose tolerances are typically less expensive than tight tolerances.
Taguchi proposed changing the classical approach to the design of products and processes

and add an intermediate stage of parameter design.
Thus, the three major stages in designing a product (or process) from the Robust Engineering
viewpoint are
1. System Design – This is when the product architecture and technology are determined.
2. Parameter Design – At this stage a planned optimization program is carried out in order to mini-
mize variability and costs.
3. Tolerance Design – Once the optimum performance is determined tolerances should be specified,
so that the product or process stays within specifications. The setting of optimum values of the
tolerance factors is called tolerance design, [see more info in [107, Part V]].
Design parameters and noise factors: Taguchi classifies the variables which affect the perfor-
mance characteristics into two categories: design parameters and source of noise. All factors which
cause variability are included in the source of noise.
1. Sources of noise are classified into two categories: external sources and internal sources.
• External sources are those external to the product, like environmental conditions (temperature,
humidity, dust, etc.); human variations in operating the product and other similar factors.

• Internal sources of variability are those connected with manufacturing imperfections and prod-
uct degradation or natural deterioration.
2. The design parameters, on the other hand, are controllable factors which can be set at predeter-
mined values (level). The product designer has to specify the values of the design parameters to
achieve the objectives. This is done by running an experiment which is called parameter design.
Parameter design is the first priority in the improvement of measuring precision, stability, and/or
reliability. When parameter design is completed, tolerance design is used to further reduce error
factor influences.
10.4.3 Parameter designed experiments (briefly Parameter designs)
Why do we use these experiments (Parameter designs) in SPE?
Variables that may cause product mal-functioning are called noise. Types of noise include:
1. Outer noise: variation caused by environmental conditions (e.g., temperature, humidity, dust, input
voltage)
2. Inner noise: deterioration of elements or materials in the product
3. Between-product noise: piece-to-piece variation between products.

Parameter design is used to select the best control-factor level combinations so that the effect of
6
all of the noise above can be minimized.
Practical guidelines
In a parameter designed experiment (Figure 10.7 ) we test the effects of the
controllable factors and the noise factors on the performance characteristics
of the product, in order to
(a) make the product robust (insensitive) to environmental conditions;
(b) make the product insensitive to components variation;
(c) minimize the mean-squared-error about a target value.
6
Parameter design in general is the most important step in developing stable (robust) products, good or high-performance system or reliable
manufacturing processes. With this technique, nonlinearity may be utilized positively. The purpose of parameter design is to investigate the
overall variation caused by inner and outer noise when the levels of the control factors are allowed to vary widely.
The next step is to find a stable or robust design that is essentially unaffected by inner or outer noise. Therefore, the most likely types of inner
and outer noise factors must be identified and their influence must be investigated. Kindly see more details in [107, Chapter 15].

Noise factors
Signal factors Output

SYSTEM (Response)
Control factors
Figure 10.7: Schematic parameter design
There are two types of experiments, physical experiments and computer based simulation
experiments, and we discuss the latter only.
• Let 𝜃 = (𝜃1 , · · · , 𝜃𝑘 ) be the vector of control design parameters. The vector of noise variables is
denoted by 𝑥 = (𝑥1 , · · · , 𝑥𝑚 ). The response function 𝑌 = 𝑓 (𝜃, 𝑥) involves in many situations the
factors 𝜃 and 𝑥 in a non-linear fashion. The objective of parameter design experiments is to take
advantage of the effects of the non-linear relationship.
The strategy is to perform a factorial experiment to investigate the effects of the design parameters
(controllable factors).

• If we learn from the experiments that certain design parameters effect the mean 𝜇 of 𝑌 but not
its variance and, on the other hand, other design factors effect the variance but not the mean, we
can use the latter group to reduce the variance of 𝑌 as much as possible, and then adjust the
levels of the parameters in the first group to set 𝜇 close to the target 𝑀 (see Figure 10.6).
• Parameter design can be performed by both simulation and also by experimentation. In simulation,
mathematical equations can be used, especially for complicated systems.
10.5
The full factorial design in 𝑚 binary factors
We now employ concepts and ideas being defined in last sections, Chapter 8 and ?? will present
simulation techniques later.
In the full binary factorial design in 𝑘 factors (non-random variables in Regression Models, or
controlable design parameters in Parameter Design theory above), denoted 2𝑘 , each factor has 2
levels. Consider firstly the full 22 design combining with a linear regression analysis from concrete
numerical design.
TAKE AWAY Facts

10.5. The full factorial design in 𝑚 binary factors 295
It is wellknown that if there are 𝑘 binary factors 𝐴, 𝐵, 𝐶, · · · impacting on a process or system,

there are
𝑘 types of main effects, also denoted 𝐴, 𝐵, 𝐶
(︀𝑘)︀
2 types of pairwise interactions 𝐴𝐵, 𝐴𝐶, 𝐵𝐶, · · ·
(︀𝑘)︀
3 interactions between three factors, 𝐴𝐵𝐶, 𝐴𝐵𝐷, · · · and so on.
On the whole there are, together with the grand mean 𝜇, 2𝑘 parameters.
Question: what if 𝐴, 𝐵, 𝐶, · · · all are ternary? [See hints in Section 10.9.2.]
A Brief of ANOVA for binary factorial design is given in Section 10.5.6.
♦ EXAMPLE 10.11 (Very small factorial design, the 22 and 23 design in few factors ).
Consider 22 factorial design in few binary factors with simple regression.
Assume a response 𝑌 (as the heat from CPU, or the performance (MIPS) of a workstation, for
example) depends on binary factors 𝐴, 𝐵. We study the impact of
𝜃1 = 𝐴- memory size (MB ) and 𝜃2 = 𝐵- cache size (KB) on
the performance 𝑌 of a workstation being designed, exploiting the data below

Table 10.3: Data of 22 design
𝜃2 = 𝐵 = Cache size 𝜃1 = 𝐴 = Memory size

4MB (−1) 16MB (1)
1KB (−1) 15 = 𝑦1 45 = 𝑦2
2KB (1) 25 = 𝑦3 75 = 𝑦4
in which we encode levels (choices) of factors (variables) 𝐴, 𝐵 by symbols in (). Values (−1) can
also be replaced by (0), and generally they are not necessarily numbers.
If you have three (𝑘 = 3) binary factors, redenoted by 𝐴 = 𝑋1 , 𝐵 = 𝑋2 , 𝐶 = 𝑋3 and use symbol 0

in place of −1 then all level combinations are represented as vertices of the cube

10.5.1 Statistical model of binary designs with ANOVA
Hence a full binary design describes a factorial experiment 2𝑘 with 𝑘 binary factors (at 2 levels each),
having 2𝑘 experiments and allowing estimate 2𝑘 effects, including
(︀𝑘)︀ (︀𝑘)︀
• 𝑘 main effects, 2 = 𝐶𝑘2 2-factor interactions, 3 = 𝐶𝑘3 3-factor interactions . . .
Regression analysis: Assuming two binary factors 𝐴 = 𝑋𝐴 and 𝐵 = 𝑋2 impact on 𝑌 , the perfor-
mance 𝑌 can be regressed using the following nonlinear regression model,
𝑦 = 𝑌̂︀ = E[𝑌 |𝐴, 𝐵] = 𝑞0 + 𝑞𝐴 𝑥𝐴 + 𝑞𝐵 𝑥𝐵 + 𝑞𝐴𝐵 𝑥𝐴 𝑥𝐵
Observations in the above table give a linear system of 4 unknowns 𝑞 = [𝑞* ] = [𝑞0 , 𝑞𝐴 , 𝑞𝐵 , 𝑞𝐴𝐵 ]
⎧
15 = 𝑞0 − 𝑞𝐴 − 𝑞𝐵 + 𝑞𝐴𝐵
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎨45 = 𝑞0 − 𝑞𝐴 + 𝑞𝐵 − 𝑞𝐴𝐵
⎪
25 = 𝑞0 + 𝑞𝐴 − 𝑞𝐵 − 𝑞𝐴𝐵
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎩75
⎪ = 𝑞0 + 𝑞𝐴 + 𝑞𝐵 + 𝑞𝐴𝐵
Find solution 𝑞 = [𝑞* ] providing the estimated response 𝑦 = 40 + 20𝑥𝐴 + 10𝑥𝐵 + 5 𝑥𝐴 𝑥𝐵 .
Interpretation: We get

Mean performance = 40 MIPS
Effect of memory = 20 MIPS; (so if memory 𝐴 goes up 1 MB then 𝑌 goes up 20 MIPS)
Effect of cache = 10 MIPS, and Interaction between memory and cache = 5 MIPS.
Computation of effects: Rewriting data table 10.3 as
Run no. Id 𝐴 𝐵 𝑦
1 1 −1 −1 𝑦1
2 1 1 −1 𝑦2
3 1 −1 1 𝑦3
4 1 1 1 𝑦4
where the rows correspond with the experiments, we get roots from the last system
⎧
1
𝑞 = (𝑦1 + 𝑦2 + 𝑦3 + 𝑦4 ) = y = 40
⎪
0
⎪
4
⎪
⎪
⎪
⎪
⎪ 1
⎨𝑞𝐴 = (−𝑦1 + 𝑦2 − 𝑦3 + 𝑦4 )
⎪
4
1
𝑞 = (−𝑦1 − 𝑦2 + 𝑦3 + 𝑦4 )
⎪
𝐵
⎪
4
⎪
⎪
⎪
⎪
⎪ 1
⎩𝑞𝐴𝐵 = (𝑦1 − 𝑦2 − 𝑦3 + 𝑦4 )
⎪
4
• Hence, effects 𝑞* are linear combinations of responses; and
for 𝑞𝐴 , 𝑞𝐵 , 𝑞𝐴𝐵 , their sum of coefficients = zero, we name these contrast expression.

• Note that the root are inner products of two vectors:

⎧
1
⎨ 𝑞𝐴 = Column 𝐴 · Column 𝑦
⎪
22 (10.16)
1
⎩ 𝑞𝐵 = 2 Column 𝐵 · Column 𝑦
⎪
2
Allocation of Variation - ANOVA of binary designs
We first study allocation of variation, since in SPE
Importance of a factor = proportion of the variation in response explained by that factor.
Statistical model of designs with two generic factors with 𝑎 and 𝑏 levels (fully provided in next
Section 10.5.2) is modified to the case when two factors are binary (so 𝑎 = 𝑏 = 2 ) and 𝑛 = 1
replication. Hence index 𝑘 in all formulae will be skipped.
• The yield at treatment combination (𝐴𝑖 , 𝐵𝑗 ) is given by
𝑌𝑖𝑗 = 𝜇 + 𝜏𝑖𝐴 + 𝜏𝑗𝐵 + 𝜏𝑖𝑗𝐴𝐵 + 𝑒𝑖𝑗 . (10.17)
where 𝜇 is the overall mean effect, 𝜏𝑖𝐴 is the effect of 𝐴𝑖 - the 𝑖-th level of factor 𝐴,
𝜏𝑗𝐵 is the effect of 𝐵𝑗 , 𝜏𝑖𝑗𝐴𝐵 is the effect of the interaction between 𝐴𝑖 , and 𝐵𝑗 .

Define the Total Variation of response as the total sum of error squares from the mean
𝑁
∑︁ 𝑎 ∑︁
∑︁ 𝑏
2
𝑆𝑆𝑇 = (𝑦𝑖 − y) = (𝑌𝑖𝑗 − 𝑌 )2
𝑖=1 𝑖=1 𝑗=1
(sepcial case of Eq. 10.25 when 𝑛 = 1 or single replicate) where sample size 𝑁 = 2𝑚 .
♦ EXAMPLE 10.12 (Memory-cache on MIPS study cont.).
The last computation (10.16) in fact used the Sign table method with data in
Experiment no. Id 𝐴 𝐵 𝐴𝐵 𝑦 (response)

1 1 −1 −1 1 𝑦1 = 15
2 1 1 −1 −1 𝑦2 = 45
3 1 −1 1 −1 𝑦3 = 25
4 1 1 1 1 𝑦4 = 75
Total 160 80 40 20 160
Total/4 40 20 10 5 40
To find proportion of the variation in response explained by the factors 𝐴, 𝐵, we need

4
∑︁
• Total variation of 𝑌 : 𝑆𝑆𝑇 = (𝑦𝑖 − y)2 = (𝑁 − 1) 𝑠2𝑦 , then sample variance of response 𝑌 is
𝑖=1
𝑆𝑆𝑇 𝑆𝑆𝑇
𝑠2𝑦 = = 2
𝑁 −1 2 −1
here 𝑁 = 2𝑚 = 22 = 4 is the sample size (no. of design points).
• For a 22 design, follow (10.31) we have
𝑆𝑆𝑇 = 𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵
in which, by using formulae (10.29) and (10.30) with 𝑛 = 4 replicates
Variation due to 𝐴 = 𝑆𝑆𝐴 = 𝑛 𝑞𝐴2 = 4 · 400 = 1600
Variation due to 𝐵 = 𝑆𝑆𝐵 = 𝑛 𝑞𝐵2 = 4 · 100 = 400

2
Variation due to 𝐴𝐵 = 𝑆𝑆𝐴𝐵 = 𝑛 𝑞𝐴𝐵 = 4 · 25 = 100
• 𝑆𝑆𝑇 = (𝑁 − 1) 𝑠2𝑦 = 3 · 700 = 2100 = 1600 + 400 + 100
therefore (proportion of the) variation explained by factor 𝐴 (memory), by (10.32) is 𝑃 𝑉𝐴 = 𝑆𝑆𝐴/𝑆𝑆𝑇 =

1600/2100 = 76%,
the variation explained by factor 𝐵 (cache)
𝑃 𝑉𝐵 = 𝑆𝑆𝐵/𝑆𝑆𝑇 = 400/2100 = 19%
and the variation due to interaction 𝐴𝐵 (memory-cache)
𝑃 𝑉𝐴𝐵 = 𝑆𝑆𝐴𝐵/𝑆𝑆𝑇 = 100/2100 = 5%.

10.5.2 The ANOVA for Full Factorial Designs
In general with 𝑛 ≥ 1 replications, we need another index 𝑘. There are 𝑎 𝑏 treatment combinations
(𝐴𝑖 , 𝐵𝑗 ), 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, · · · , 𝑏. Suppose also that 𝑛 independent replicas are made at each
one of the treatment combinations.
• The yield at the 𝑘-th replication of treatment combination (𝐴𝑖 , 𝐵𝑗 ) is given by
𝑌𝑖𝑗𝑘 = 𝜇 + 𝜏𝑖𝐴 + 𝜏𝑗𝐵 + 𝜏𝑖𝑗𝐴𝐵 + 𝑒𝑖𝑗𝑘 . (10.18)
where 𝜇 is the overall mean effect,
𝜏𝑖𝐴 is the effect of level 𝐴𝑖 , 𝜏𝑗𝐵 is the effect of level 𝐵𝑗 ,
𝜏𝑖𝑗𝐴𝐵 is the effect of the interaction between 𝐴𝑖 , and 𝐵𝑗 .
• The error terms 𝑒𝑖𝑗𝑘 are independent random variables satisfying
E[𝑒𝑖𝑗𝑘 ] = 0 and V[𝑒𝑖𝑗𝑘 ] = 𝜎 2 , (10.19)
for all 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, 2, · · · , 𝑏, and 𝑘 = 1, 2, · · · , 𝑛.
Main effects and interactions- How to quantify 𝜏𝑖𝐴 ?
The analysis of variance for full factorial designs is done to test the hypotheses that

main-effects or interaction parameters are equal to zero.
Model (10.18) here, for instance, consists of 𝜏𝑖𝑗𝐴𝐵 as the interaction effect between 𝐴𝑖 and 𝐵𝑗 ,
represents deviations of the treatment effects relative to both 𝜏𝑖𝐴 and 𝜏𝑗𝐵 . Effects which involve
comparisons between levels of only one factor are called main effects of that factor, and effects
which involve comparisons for more than a single factor are called interactions. We define precisely
effects as follows.
Definition 10.3 (Main effect and interaction).
1. The effect of a factor is defined to be the change in response produced by a change in the level
of the factor. This is frequently called a main effect
because it refers to the primary factors of interest in the experiment,

and viewed as a comparison between the expected responses for different levels of one factor,
averaging over all levels of all other factors. We use the working concept below.
If a factor 𝐴 has levels of High and Low, then the main effect of 𝐴 is
𝜏 𝐴 = the mean of responses 𝑦𝐴=𝐻𝑖𝑔ℎ − the mean of responses 𝑦𝐴=𝐿𝑜𝑤 , (10.20)
symbolically 𝜏 𝐴 = y 𝐴=𝐻𝑖𝑔ℎ − y 𝐴=𝐿𝑜𝑤 , the difference between the average response at the high level
and the low level of 𝐴.

2. In some experiments, we may find that the difference in response between the levels of one factor
is not the same at all levels of the other factors. When this occurs, there is an interaction between
the factors.
We use the following practical ways to recognize and describe an interaction:
(i) If the effect of one factor varies depending on the level of another factor, then there is
interaction between the two factors.
(ii) Interaction = degree of difference from the sum of the separate effects.
The Analysis of Variance (ANOVA) is generally needed for testing the significance of main effects
and interactions. The ANOVA for full factorial designs is built to test the hypotheses that main-effects
or interaction parameters are equal to zero. We present the ANOVA for a design of factors 𝐴 and 𝐵
with a statistical model given in Equation (10.18). The method can be generalized to any number of
factors.
Let 𝑛
1 ∑︁
Y 𝑖𝑗 = 𝑌𝑖𝑗𝑘 , (10.21)
𝑛
𝑘=1
𝑏
1 ∑︁
Y 𝑖. = Y 𝑖𝑗 , 𝑖 = 1, · · · , 𝑎 (10.22)
𝑏 𝑗=1

𝑎
1 ∑︁
Y .𝑗 = Y 𝑖𝑗 , 𝑗 = 1, · · · , 𝑏 (10.23)
𝑎 𝑖=1
and
𝑎 𝑏
1 ∑︁ ∑︁
𝑌 = Y 𝑖𝑗 . (10.24)
𝑎𝑏 𝑖=1 𝑗=1
Knowledge Box 13. ANOVA Procedure
The ANOVA procedure for generic two-factor designs includes three following steps.
1. The ANOVA partitions first the total sum of squares of deviations from 𝑌
∑︁ 𝑛
𝑏 ∑︁
𝑎 ∑︁
𝑆𝑆𝑇 = (𝑌𝑖𝑗𝑘 − 𝑌 )2 [total sum of squared errors] (10.25)
𝑖=1 𝑗=1 𝑘=1
to two components
𝑎 ∑︁
∑︁ 𝑏 ∑︁
𝑛
𝑆𝑆𝑊 = (𝑌𝑖𝑗𝑘 − Y 𝑖𝑗 )2 [sum of squared errors in whole design], (10.26)
𝑖=1 𝑗=1 𝑘=1
𝑎 ∑︁
∑︁ 𝑏
𝑆𝑆𝐵𝐹 = 𝑛 (Y 𝑖𝑗 −𝑌 )2 [sum of squared errors among factor levels]. (10.27)
𝑖=1 𝑗=1
It is straightforward to show that 𝑆𝑆𝑇 = 𝑆𝑆𝑊 + 𝑆𝑆𝐵𝐹. [𝑆𝑆𝐵𝐹 is also called the sum of square
errors between factors.]

2. In the second stage, the sum of squares of deviations 𝑆𝑆𝐵𝐹 is partitioned to three components
𝑆𝑆𝐴, 𝑆𝑆𝐵, and the interaction sum 𝑆𝑆𝐼 := 𝑆𝑆𝐴𝐵 as
𝑎 ∑︁
∑︁ 𝑏
𝑆𝑆𝐼 = 𝑛 (Y 𝑖𝑗 − Y 𝑖. − Y .𝑗 +𝑌 )2 , [errors caused by interaction effects] (10.28)
𝑖=1 𝑗=1
𝑎
∑︁
𝑆𝑆𝐴 = 𝑛 (Y 𝑖. −𝑌 )2 [sum of squared errors caused by factor effect 𝐴] (10.29)
𝑖=1
𝑏
∑︁
𝑆𝑆𝐵 = 𝑛 (Y .𝑗 −𝑌 )2 , [sum of squared errors caused by factor effect 𝐵], (10.30)
𝑗=1
Table 10.4: Table of ANOVA for a 2-factor factorial experiment
Source of variation DF SS MS F
𝐴 𝑎−1 𝑆𝑆𝐴 𝑀 𝑆𝐴 𝐹𝐴
𝐵 𝑏−1 𝑆𝑆𝐵 𝑀 𝑆𝐵 𝐹𝐵
𝐴𝐵 (𝑎 − 1)(𝑏 − 1) 𝑆𝑆𝐼 𝑀 𝑆𝐴𝐵 𝐹𝐴𝐵
Between 𝑎𝑏 − 1 𝑆𝑆𝐵𝐹 - -
Within 𝑎𝑏(𝑛 − 1) 𝑆𝑆𝑊 𝑀 𝑆𝑊 -
Total 𝑁 −1 𝑆𝑆𝑇 - -

that is, 𝑆𝑆𝐵𝐹 = 𝑆𝑆𝐼 + 𝑆𝑆𝐴 + 𝑆𝑆𝐵. All these terms are collected in ANOVA Table10.4.
KEY NOTE: When 𝑛 = 1 then 𝑆𝑆𝑊 = 0 and so
𝑆𝑆𝑇 = 𝑆𝑆𝑊 + 𝑆𝑆𝐵𝐹 = 𝑆𝑆𝐼 + 𝑆𝑆𝐴 + 𝑆𝑆𝐵 = 𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵 (10.31)
Proportion of the variation explained by factors 𝐴, 𝐵 and their interaction are quantified respectively
by ratios
𝑆𝑆𝐴 𝑆𝑆𝐵 𝑆𝑆𝐴𝐵
𝑃 𝑉𝐴 = , 𝑃 𝑉𝐵 = , 𝑃 𝑉𝐴𝐵 = . (10.32)
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇
Thus, the mean sum of square errors by factors 𝐴, 𝐵 ... are
𝑆𝑆𝐴 𝑆𝑆𝐵
𝑀 𝑆𝐴 = , 𝑀 𝑆𝐵 = ,
𝑎−1 𝑏−1
𝑆𝑆𝐼 𝑆𝑆𝐴𝐵
𝑀 𝑆𝐴𝐵 = = ,
(𝑎 − 1)(𝑏 − 1) (𝑎 − 1)(𝑏 − 1) (10.33)
and when 𝑛 > 1
𝑆𝑆𝑊
𝑀 𝑆𝑊 = .
𝑎𝑏(𝑛 − 1)
3. Finally, we compute the 𝐹 -statistics
𝑀 𝑆𝐴
𝐹𝐴 = , (10.34)
𝑀 𝑆𝑊
𝑀 𝑆𝐵
𝐹𝐵 = , (10.35)
𝑀 𝑆𝑊
and
𝑀 𝑆𝐴𝐵
𝐹𝐴𝐵 = . (10.36)
𝑀 𝑆𝑊
𝐹𝐴 , 𝐹𝐵 and 𝐹𝐴𝐵 are test statistics, respectively to test the significance of the main effects of 𝐴, of
𝐵 and the interactions 𝐴𝐵 on the response. Few cases to consider:
1. If 𝐹𝐴 < 𝐹1−𝛼 [𝑎 − 1, 𝑎𝑏(𝑛 − 1)] we do not reject the null hypothesis
𝐻0𝐴 : 𝜏1𝐴 = 𝜏2𝐴 = · · · = 𝜏𝑎𝐴 = 0
2. If 𝐹𝐵 < 𝐹1−𝛼 [𝑏 − 1, 𝑎𝑏(𝑛 − 1)] we do not reject the null hypothesis
𝐻0𝐵 : 𝜏1𝐵 = 𝜏2𝐵 = · · · = 𝜏𝑏𝐵 = 0
3. Also, if 𝐹𝐴𝐵 < 𝐹1−𝛼 [(𝑎 − 1)(𝑏 − 1), 𝑎𝑏(𝑛 − 1)], we cannot reject the null hypothesis
𝐻0𝐴*𝐵 : 𝜏11
𝐴𝐵 𝐴𝐵
= · · · = 𝜏𝑎𝑏 = 0.
That is we accept the alternative
𝐻1𝐴*𝐵 : at least one interaction 𝜏𝑖𝑗𝐴𝐵 ̸= 0

when 𝐹𝐴𝐵 > 𝐹1−𝛼 [(𝑎 − 1)(𝑏 − 1), 𝑎𝑏(𝑛 − 1)].
The interaction effects are significant in the case of acceptance the alternative 𝐻1𝐴*𝐵 . The main
effects then are of no importance, no matter whether they are significant or not. ■

Table A7. Table of F-distribution
Figure 10.8: F distribution values

♣ OBSERVATION. We make use few important observations.
1. If, however, 𝐻0𝐴*𝐵 is not rejected, then the test results
for 𝐻0𝐴 : 𝜏𝑖𝐴 = 0 against 𝐻1𝐴 : 𝜏𝑖𝐴 ̸= 0 (for at least one 𝑖) and
𝐻0𝐵 : 𝜏𝑗𝐵 = 0 against 𝐻1𝐵 : 𝜏𝑗𝐵 ̸= 0 (for at least one 𝑗)
are of importance for the interpretation in model
𝑌𝑖𝑗𝑘 = 𝜇 + 𝜏𝑖𝐴 + 𝜏𝑗𝐵 + 𝑒𝑖𝑗𝑘 .
2. Hence, every decision is based on properly computing the (Fisher) 𝐹 -statistics and employing 𝐹
critical values given in Table A7 above, or by software R.
SUMMARY. Using ANOVA Table10.4 will be illustrated in Quality Analytics 0 of Section ??, where
we also see that different factors (new controllable factor or scenarios) would require new designs
whose existences are uncertain! We exploit the notation below.
NOTATIONS Parameters of system Indicator for

𝑘 number of factors/ predictors describing control-design parameters
𝑁 number of design points (runs) indicating the number of experiments)
𝑛 number of replicates indicating the replication level

10.5.3 The 23 factorial design in practice
♦ EXAMPLE 10.13. Illustrated usage of the 23 design.
We study the use of 𝑘 = 3 binary factors for SPE with simulated data below.
Table 10.5: Response at a 23 factorial experiment
Factor Levels Response

𝐴 𝐵 𝐶 𝑦1 𝑦2 𝑦3 𝑦4
-1 -1 -1 60.5 61.7 60.5 60.8
1 -1 -1 47.0 46.3 46.7 47.2
-1 1 -1 92.1 91.0 92.0 91.6
1 1 -1 71.0 71.7 71.1 70.0
-1 -1 1 65.2 66.8 64.3 65.2
1 -1 1 49.5 50.6 49.5 50.5
-1 1 1 91.2 90.5 91.5 88.7
1 1 1 76.0 76.0 78.3 76.4
Three factors 𝐴, 𝐵, 𝐶 as controllable variables of memory, cache and operating system (binary
value Windows-Linux) give effects on the output, 𝑌 - MIPS, of a computer system. In order to
estimate the main effects of 𝐴, 𝐵, 𝐶, a 23 factorial experiment was conducted in 𝑛 = 4 replicates
getting above resonses 𝑦1 , 𝑦2 , 𝑦3 and 𝑦4 .

Each treatment combination was repeated 4 times, at the ‘low’ and ‘high’ levels of two noise
factors. The results are given in Table 10.5, with the design size 𝑁 = 23 = 8.
The mean Y , and standard deviation 𝑆 of 𝑌 at the 8 treatment combinations are listed below.
𝜈 Y 𝑆
0 60.875 0.4918
1 46.800 0.3391
2 91.675 0.4323
3 70.950 0.6103
4 65.375 0.9010
5 50.025 0.5262
6 90.475 1.0871
7 76.675 0.9523
• Regressing the column Y on the 3 orthogonal columns under 𝐴, 𝐵, 𝐶 in in Table 10.5 gives
Mean Y = 69.1 − 7.99𝐴 + 13.3𝐵 + 1.53𝐶
with 𝑅2 = 0.991 see Equation (10.9) for the multiple regression case]. Moreover, the coefficient
1.53 of 𝐶 is not significant (𝑝-value = 0.103).

• Thus, the significant main effects on the mean yield are of factors 𝐴 and 𝐵 only. Regressing the
column of 𝑆 on 𝐴, 𝐵, 𝐶 we obtain the equation
STD 𝑆 = 0.655 − 0.073𝐴 + 0.095𝐵 + 0.187𝐶,
with 𝑅2 = .805. Only the main effect of 𝐶 is significant. Factors 𝐴 and 𝐵 have no effects on the
standard deviation. The strategy is therefore to set
the value of 𝐶 at −1 (as Linux OS) and
the values of 𝐴 and 𝐵 to adjust the mean response to be equal to the optimal target value 𝑀 .
• If 𝑀 = 85, we find 𝐴 and 𝐵 to solve the equation 69.1 − 7.99𝐴 + 13.3𝐵 = 85. Putting 𝐵 = 0.75 then
𝐴 = −0.742.
The optimal setting of the design parameters is 𝐴 = −.742, 𝐵 = 0.75 and 𝐶 = −1. ■
♣ OBSERVATION.
1. The applications of factorial analysis so far described deal with only two levels of a particular factor,
for example, low and high, or −1 or +1. If there are only two points, they can only be joined by a
straight line. This implies that there is a rectilinear relationship between the magnitude of the factor
and the response.
2. If this assumption is not true, then a maximum or minimum value of the response may occur
between the chosen levels of the factors and this would not be detected. Therefore, if a rectilinear

relationship cannot be safely assumed, we should use more than two levels, like the simple two
non-binary factor design 3 × 4 later.
10.5.4 The 23 factorial design with Linear Model
Now we suppose that in 23 factorial three binary factors 𝐴, 𝐵, 𝐶 are to be studied. The number of
combinations is eight and with 𝑛 replicates we have 𝑁 = 𝑛 23 = 8 𝑛 observations that are to be
analyzed for their influence on a response.
The design is given in ‘standard order’ (Yates’ order) in Table 11.2.

Table 10.6: 2𝑚 factorial design in 3 factors X = (𝐴, 𝐵, 𝐶) with values 𝑥 = (𝑖, 𝑗, 𝑘)
Trial number 𝐴=𝑖 𝐵=𝑗 𝐶=𝑘 Treatment combination Response 𝑌*

1 1 1 1 (1) 𝑦(1)
2 2 1 1 (𝐴) 𝑦𝑎
3 1 2 1 (𝐵) 𝑦𝑏
4 2 2 1 (𝐴𝐵) 𝑦𝑎𝑏
5 1 1 2 (𝐶) 𝑦𝑐
6 2 1 2 (𝐴𝐶) 𝑦𝑎𝑐
7 1 2 2 (𝐵𝐶) 𝑦𝑏𝑐
8 2 2 2 (𝐴𝐵𝐶) 𝑦𝑎𝑏𝑐
• The presence of the corresponding lower-case letter in the treatment combination column indicates

that the factor is at its high level. As of the case 22 , if three factors are all quantitative (such as
temperature, pressure, time), then a linear regression representation for 23 design of the response
𝑌 is used in (10.37).
• The interaction among 𝐴, 𝐵, 𝐶 can then be written as 𝐴 * 𝐵 * 𝐶 or as 𝑥1 𝑥2 𝑥3 .
In software R, the notation 𝑥1 * 𝑥2 * 𝑥2 , however, includes all 𝑥1 , 𝑥2 , 𝑥3 , the 𝑥1 𝑥2 , 𝑥1 𝑥3 , 𝑥2 𝑥3 , and

also the 3-interaction 𝑥1 𝑥2 𝑥3 .
The linear model for 23 design of the response 𝑌 is
E[𝑌 = 𝑦] := E[𝑌 = 𝑦|X = 𝑥] = E[𝑌 = 𝑦|𝐴 = 𝑥1 , 𝐵 = 𝑥2 , 𝐶 = 𝑥3 ]

(10.37)
= 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽12 𝑥1 𝑥2 + 𝛽13 𝑥1 𝑥3 + 𝛽23 𝑥2 𝑥3 + 𝛽123 𝑥1 𝑥2 𝑥3
where 𝑦 is the observed response,

the 𝛽’s are parameters whose values are to be determined, and respectively
𝑥1 , 𝑥2 , 𝑥3 are observed variables representing factors 𝐴, 𝐵, and 𝐶.
Analysis of Variance (ANOVA) Procedure for a 23 design
We present the ANOVA table for the three-factor fixed effects model with factors 𝐴, 𝐵, 𝐶, then apply
for binary case 23 with the number of factor levels 𝑎 = 𝑏 = 𝑐 = 2.

• We see, when 𝑎 = 𝑏 = 𝑐 = 2 for 23 design, there are seven degrees of freedom between the eight
treatment combinations in the 23 design.
Three degrees of freedom are associated with the main effects of 𝐴, 𝐵, and 𝐶.
Four degrees of freedom are associated with 2-interactions: one each with 𝐴𝐵, 𝐴𝐶, and 𝐵𝐶 and
one with 3-interaction 𝐴𝐵𝐶.
Table 10.7: Table of ANOVA for a 3-factor factorial experiment
𝑀 𝑆𝐴
𝐴 𝑎−1=1 𝑆𝑆𝐴 𝑀 𝑆𝐴 𝐹𝐴 =
𝑀 𝑆𝐸
𝐵 𝑏−1=1 𝑆𝑆𝐵 𝑀 𝑆𝐵 𝐹𝐵
𝐶 𝑐−1=1 𝑆𝑆𝐶 𝑀 𝑆𝐶 𝐹𝐶
𝐴 * 𝐵 interaction (𝑎 − 1)(𝑏 − 1) 𝑆𝑆𝐴𝐵 𝑀 𝑆𝐴𝐵 𝐹𝐴𝐵
𝐴 * 𝐶 interaction (𝑎 − 1)(𝑐 − 1) 𝑆𝑆𝐴𝐶 𝑀 𝑆𝐴𝐶 𝐹𝐴𝐶
𝐵 * 𝐶 interaction (𝑏 − 1)(𝑐 − 1) 𝑆𝑆𝐵𝐶 𝑀 𝑆𝐵𝐶 𝐹𝐵𝐶
𝐴 * 𝐵 * 𝐶 interaction (𝑎 − 1)(𝑏 − 1)(𝑐 − 1) 𝑆𝑆𝐴𝐵𝐶 𝑀 𝑆𝐴𝐵𝐶 𝐹𝐴𝐵𝐶
Error 𝑑𝑓𝐸 = 𝑎𝑏𝑐(𝑛 − 1) 𝑆𝑆𝐸 𝑀 𝑆𝐸 -
Total 𝑑𝑓𝑇 = 𝑁 − 1 = 𝑎 𝑏 𝑐 𝑛 − 1 𝑆𝑆𝑇 - -
• The 𝐹 -tests 𝐹𝑥 on main effects and interactions follow directly from the mean squares 𝑀 𝑆𝑥 , where
𝑀 𝑆𝐴
𝑥 = 𝐴, 𝐵, 𝐶, 𝐴𝐵, 𝐴𝐶, 𝐵𝐶, 𝐴𝐵𝐶, for example 𝐹𝐴 = .
𝑀 𝑆𝐸

MODEL 1 (Representation of a binary design 22 by linear regression model).
Suppose that both of our design factors are quantitative (such as temperature, pressure, living
media, time). Therefore, we write 𝑋1 for factor 𝐴, and 𝑋2 for factor 𝐵.
• The response 𝑌 is expressed linearly via two binary factors 𝐴 and 𝐵 as a linear regression
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽12 𝑋1 𝑋2 + 𝜀
where 𝑌 is the response random variable, 𝜀 is a random error term, with E[𝜀] = 0, and 𝛽0 , . . . , 𝛽12
are regression coefficients.
• The variables 𝑋1 , 𝑋2 are viewed as non-random, with values 𝑥 = (𝑥1 , 𝑥2 ) after conducting experi-
ments. By taking expectation, we get

Linear regression model for 2-factor binary design
E[𝑌 | X = 𝑥] := E[𝑌 = 𝑦 | 𝐴 ≡ 𝑋1 = 𝑥1 , 𝐵 ≡ 𝑋2 = 𝑥2 ]
(10.38)
= 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽12 𝑥1 𝑥2 .
where the 𝛽’s are parameters whose values are to be determined,

𝑥1 is observed variable that represents factor 𝐴,
𝑥2 is observed variable that represents factor 𝐵.
The interaction between factors 𝐴, 𝐵 can then be written as 𝐴 * 𝐵 or as 𝑥1 𝑥2 .
Note: But in software R, the notation 𝑥1 * 𝑥2 includes all 𝑥1 , 𝑥2 and the 𝑥1 𝑥2 .
We elucidate this representation via using a 2 × 2 two Factor Factorial in Bio-technology.
PROBLEM FORMULATION: A virologist is interested in studying the effects of environment and

proliferating time on the growth of a particular virus by using
𝑎 = 2 different living media (𝑀 ), denoted Medium 1 and Medium 2; and
𝑏 = 2 different proliferating times (𝑇 ), of 12 hours and 18 hours.

Table 10.8: THE DATA for two factors 𝑀, 𝑇
𝑀 = 𝑚1 (Medium 1) 𝑀 = 𝑚2 (Medium 2)
12 21 23 20 25 24 29
𝑇 = 𝑡1 hours 22 28 26 26 25 27
18 37 38 35 31 29 30
𝑇 = 𝑡2 hours 39 38 36 34 33 35
She performs 𝑛 = 6 (balanced design) replicates for each of the 4 𝑀 * 𝑇 combinations. Here
𝐴1 𝐵1 means 𝑚1 𝑡1 =(Medium 1, 12 hours), and so on. The 𝑁 = 24 measurements were taken in a
completely randomized order. The results are given in Table 10.8. The virologist wants to answer
the following questions.
1. What effects do media 𝑀 and growing time 𝑇 have on the virus’s proliferation?
2. Is there a choice of living media that give uniformly strong proliferation regardless of time?
7
A detailed solution with numerical analysis is shown in PROBLEM 10.3.
7
The 2nd question is particularly important. We possibly find a media alternative that is not greatly affected by time. If this is so, we can make
the living media robust to time variation in the actual environment.
This is an example of using experimental design for robust product design, a very important engineering and scientific problem.

10.5.5 The first mixed factorial design in practice
FACT: With generic factor 𝐵 crossed with factor 𝐴, the crossed design becomes a randomized
complete block design (RCB or RCBD) with levels 𝐵𝑗 of 𝐵 as blocks:
Block 1 = 𝐵1 Block 2 = 𝐵2 ... Block 𝑏 = 𝐵𝑏

Treatment 1 = 𝐴1 𝑌11 𝑌12 ... 𝑌1𝑏
Treatment 2 = 𝐴2 𝑌21 𝑌22 ... 𝑌2𝑏
.. .. ..
. . .
Treatment 𝑎 = 𝐴𝑎 𝑌𝑎1 𝑌𝑎2 ... 𝑌𝑎𝑏
PROBLEM 10.1 (A 3 × 4 Two Factor Design in Aquaculture).

An experimenter is investigating the effect of different food (treatment 𝜏𝑖 ) for a species of fish. He
places the food in tanks containing the fish. The weight increase of the fish is the response (𝑦𝑖𝑗 ).
The experimental unit is the tank, as the treatment is applied to the tank, not to the fish. The
fishes are viewed as replicates.
Suppose there are 𝑎 = 3 different diets (of factor 𝐴- Diet), we use the same 𝑏 = 4 tanks (of
factor 𝐵- Tank) in each level of diet, and put 6 fish per tank. In this setting, tanks would have been
crossed with diets, and assume the ANOVA table is obtained as the following table
QUESTIONS

Table 10.9: Response 𝑦𝑖,𝑗 = 𝑦𝐴=𝑖,𝐵=𝑗 of design 𝐴 × 𝐵
Factor 𝐵
Factor 𝐴 𝐵=1 𝐵=2 𝐵=3 𝐵=4
𝐴=1 𝑦1,1 𝑦1,2 𝑦1,3 𝑦1,4
𝐴=2 𝑦2,1 𝑦2,2 𝑦2,3 𝑦2,4
𝐴=3 ? ? ? ?
1. Explain the number 60 in df 𝐹 𝑖𝑠ℎ = df 𝐸 . Fill in right numbers to the cells with ?
2. Make decision on the significance of the main effects of 𝐴, 𝐵 and the interactions 𝐴 * 𝐵 on the
response (weight increase of the fish), using 𝛼 = 0.05.
Source df (degrees Sum Square Mean Square F-ratio

of freedom)
𝑀 𝑆𝐷𝑖𝑒𝑡
𝐴 = Diet df 𝐷𝑖𝑒𝑡 = 𝑎 − 1 21247.7 𝑀 𝑆𝐷𝑖𝑒𝑡 = 10623.9 𝐹𝐴 = =?
𝑀 𝑆𝐸
𝐵 = Tank df 𝑇 𝑎𝑛𝑘 = 𝑏 − 1 𝑆𝑆𝐵 = 91.5 𝑀 𝑆𝑇 𝑎𝑛𝑘 =? 𝐹𝐵 =?
𝐴*𝐵 (𝑎 − 1)(𝑏 − 1) 𝑆𝑆𝐴𝐵 = 16.3 𝑀 𝑆𝐴𝐵 =? 𝐹𝐴𝐵 =?
Error df 𝐹 𝑖𝑠ℎ = 60 𝑆𝑆𝐸 = 3085.3 𝑀 𝑆𝐸 =? -

(whole design, caused by fish)

10.5.6 Summarized Analysis Procedure for a 2𝑚 design
The statistical analysis of 2𝑚 designs is summarized, and nowadays a computer software package
(like Ror MATLAB) is usually employed in this analysis process.
EFPRAI
1. Estimate factor effects
2. Form initial model with existing factorial designs
3. Perform statistical testing with ANOVA (using software when data available)
4. Refine model
5. Analyze residuals
6. Interpret results, by graphical analysis (main effect, interaction plots).
STEP 1. Estimate factor effects: We calculate a main effect as the gap between the response
means at high and low level. For factor 𝐴, for example, the main effect is
𝜏 𝐴 = y 𝐴+ − y 𝐴− .

STEP 2. Form initial model requires existing 2𝑚 designs, being computed by R.
STEP 3. Perform statistical testing with ANOVA:
We use the analysis of variance to formally test for the significance of main effects and interaction.
Table 10.10 shows the general form of an analysis of variance for a 2𝑚 factorial design with 𝑛
replicates.
STEP 4. Refine model removing any non-significant variables from the full model.
STEP 5. Analyze residuals is the usual residual analysis to check for model adequacy and as-
sumptions.
STEP 6. Graphical analysis either main effect or interaction plots.

Table 10.10: Table of ANOVA for a 𝑚-factor factorial experiment
𝑚 main effects
𝑀 𝑆𝐴
𝐴 1 𝑆𝑆𝐴 𝑀 𝑆𝐴 𝐹𝐴 =
𝑀 𝑆𝐸
𝐵 1 𝑆𝑆𝐵 𝑀 𝑆𝐵 𝐹𝐵
..
.
𝑀 1 𝑆𝑆𝑀 𝑀 𝑆𝑀 𝐹𝑀
(︀𝑚)︀
2
two-factor interactions
𝐴 * 𝐵 interaction 1 𝑆𝑆𝐴𝐵 𝑀 𝑆𝐴𝐵 𝐹𝐴𝐵
𝐴 * 𝐶 interaction 1 𝑆𝑆𝐴𝐶 𝑀 𝑆𝐴𝐶 𝐹𝐴𝐶
..
.
𝐿 * 𝑀 interaction 1 𝑆𝑆𝐿𝑀 𝑀 𝑆𝐿𝑀 𝐹𝐿𝑀
(︀𝑚)︀
3
three-factor interactions
𝐴 * 𝐵 * 𝐶 interaction 1 𝑆𝑆𝐴𝐵𝐶 𝑀 𝑆𝐴𝐵𝐶 𝐹𝐴𝐵𝐶
..
.
Error 𝑑𝑓𝐸 = 2𝑚 (𝑛 − 1) 𝑆𝑆𝐸 𝑀 𝑆𝐸 -
Total 𝑑𝑓𝑇 = 𝑁 − 1 = 2𝑚 𝑛 − 1 𝑆𝑆𝑇 - -
For any binary factor, we equivalently use symbol 1 or − for low level,

and symbol 2 or + for high level.
Let’s now illustrate the first step of this procedure if 𝑚 = 3, 2𝑚−1 = 4, for 𝑛 replicates. Hence,
we study three binary factors 𝐴, 𝐵, 𝐶, and repeat the 23 design in 𝑛 times. The total number of
experiments is 𝑁 = 𝑛 · 23 = 8𝑛, and so there are 𝑁/2 = 4𝑛 runs for each level.
STEP 1. Estimate factor effects - We calculate a main effect as the gap between the response
means at high and low level. For factor 𝐴, for example, the main effect is
𝜏 𝐴 = y 𝐴+ − y 𝐴− .
When using the standard order, for factor 𝐴, we could write
𝑇𝐴2 = 𝑦𝑎 + 𝑦𝑎𝑏 + 𝑦𝑎𝑐 + 𝑦𝑎𝑏𝑐 as the total of responses over all other factors for level 2 of 𝐴,
𝑇𝐴1 = 𝑦(1) + 𝑦𝑏 + 𝑦𝑐 + 𝑦𝑏𝑐 as the response totals over all other factors for level 1 of 𝐴.
𝑁 is the total number of units in the experiment (𝑁/2 for level 1, 𝑁/2 for level 2).

Trial number 𝐴 = 𝑥1 𝐵 = 𝑥2 𝐶 = 𝑥3 Treatment combination Response 𝑌*

1 1 1 1 (1) 𝑦(1)
2 2 1 1 (A) → 𝑦𝑎
3 1 2 1 (𝐵) 𝑦𝑏
4 2 2 1 (A𝐵) → 𝑦𝑎𝑏
5 1 1 2 (𝐶) 𝑦𝑐
6 2 1 2 (A𝐶) → 𝑦𝑎𝑐
7 1 2 2 (𝐵𝐶) 𝑦𝑏𝑐
8 2 2 2 (A𝐵𝐶) → 𝑦𝑎𝑏𝑐
The response mean at high and low level of 𝐴 are respectively summarized by
𝑇𝐴2 𝑇𝐴 𝑇𝐴1
y 𝐴+ = = 2, y 𝐴− = .
𝑁/2 4𝑛 4𝑛
The main effect for factor 𝐴 then (by dropping 𝑦 in response 𝑦△ to only △) is
𝑇𝐴2 − 𝑇𝐴1 [𝑎 − (1) + 𝑎𝑏 − 𝑏 + 𝑎𝑐 − 𝑐 + 𝑎𝑏𝑐 − 𝑏𝑐]

𝜏𝐴 = = (10.39)
(𝑁/2) 4𝑛
Similarly, the main effects for factor 𝐵 and 𝐶 respectively are
𝑇𝐵2 − 𝑇𝐵1 [𝑏 + 𝑎𝑏 + 𝑏𝑐 + 𝑎𝑏𝑐 − (1) − 𝑎 − 𝑐 − 𝑎𝑐]

𝜏𝐵 = = ? (10.40)
(𝑁/2) 4𝑛
and
𝑇𝐶2 − 𝑇𝐶1 [?−?]
𝜏𝐶 = = (𝐷𝐼𝑌 ). (10.41)
(𝑁/2) 4𝑛
Note: In the last three equations, the quantities in brackets of main effects for 𝐴, 𝐵, 𝐶 are contrasts
in the treatment combinations.
10.6
Regression with Experimental error analysis
Methods of estimating a regression line/surface/hyperplane and partitioning the total variation

do not rely on any distribution; thus, we can apply them to virtually any data.
10.6.1 Univariate regression gives Simple ANOVA table
Consider a simple linear regression (when 𝑚 = 1)
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑒
where 𝑒 is experimental error. Linearity of model comes from
8
Assumption 1: Linearity between predictor and response, namely 𝑋 and 𝑌 .
8
𝑋 could be family size, interest rate or a project input, number of drunk men per day in BKK, and

10.6. Regression with Experimental error analysis 329
At the 𝑖-th observation, predictor 𝑋𝑖 is considered non-random, and we assume a linear relation-
ship between the two 𝑌𝑖 and 𝑋𝑖 of the form:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖 , for each 𝑖 = 1, 2, . . . , 𝑛 (10.42)
where 𝑌𝑖 denotes the 𝑖-th observation of the response 𝑌 ,
• 𝑋𝑖 denotes the 𝑖-th observation on the predictor 𝑋, with E[𝑋𝑖 ] = 𝑥𝑖 , and

• 𝑒𝑖 is the error with E[𝑒𝑖 ] = 0, ∀𝑖 = 1, 2, . . . , 𝑛; here 𝑛 is the sample size.
Experimental errors 𝑒𝑖 := 𝑦𝑖 − 𝑦̂︀𝑖 (as in Equation 10.42) are wrapped up in the following quantity
𝑁
∑︁ 𝑁
∑︁ 𝑁
∑︁
𝑆𝑆𝐸 = 𝑒2𝑖 = 2
(𝑦𝑖 − 𝑦̂︀𝑖 ) = (𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 (10.43)
𝑖=1 𝑖=1 𝑖=1
with 𝑥𝑖 is value of factor 𝑋 at the 𝑖-th experiment or observation, 𝑒𝑖 is the error with E[𝑒𝑖 ] = 0, for all
𝑖 = 1, 2, . . . , 𝑁 ; here 𝑁 is the sample size. Since the total variation
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
the degrees of freedom of 𝑆𝑆𝑇 is df 𝑇 = df 𝑅 + df 𝐸 , where
• the regression sum of squares 𝑆𝑆𝑅 has df 𝑅 = 1 degree of freedom (the dimension of the corre-
sponding space (𝑋, 𝑌 ) is 1);
𝑌 could be electricity consumption, project return in investment, or number of traffic accidents in Bangkok.

𝑁
∑︁
• the total sum of squares 𝑆𝑆𝑇 = (𝑦𝑖 − y)2 = (𝑁 − 1) 𝑠2𝑦 , has 𝑁 − 1 degrees of freedom, because
𝑖=1
it is computed directly from the sample variance 𝑠2𝑦 .
So, 𝑆𝑆𝐸 has df 𝐸 = df 𝑇 − df 𝑅 = 𝑁 − 2 degrees of freedom. Obviously,
df 𝐸 = sample size − number of estimated location parameters = 𝑁 − 2
with 2 degrees of freedom deducted for 2 estimated parameters 𝛽0 and 𝛽1 .
For further analysis, we introduce other two standard assumptions of linear regression, applied
when 𝑚 = 1, then extended for cases of 𝑚 ≥ 1 generally.
Assumption 2: Normality of the responses: We therefore assume that observed responses 𝑌𝑖

are independent normal random variables with mean
E[𝑌𝑖 ] = E[𝑌 |𝑋 = 𝑥𝑖 ] = 𝐺(𝑥𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖
and with constant variance 𝜎 2 = V[𝑌𝑖 ] = V[𝑌 ].
Assumption 3: Normality of regression coefficients:
The linear regression coefficients 𝛽0 , 𝛽1 have normal distribution, and their estimates are de-
noted by (𝑏0 , 𝑏1 ) = 𝑏, found by Equation 10.6 in Knowledge Box 12.

With these info, we can unbiasedly estimate the response variance 𝜎 2 = V[𝑌 ] by the sample
regression variance
𝑆𝑆𝐸
𝑠2 = 𝜎
̂︀2 = 𝑀 𝑆𝐸 = . (10.44)
𝑁 −2
Definition 10.4.
𝑆𝑆𝐸
• RMSE: The estimated sample regression variance 𝑠2 = 𝑀 𝑆𝐸 = gives the estimated
𝑁 −2
standard deviation 𝑠, called root mean squared error or RMSE.
• F-ratio: The F-ratio

𝑀 𝑆𝑅
𝐹 =
𝑀 𝑆𝐸
is used to test significance of the entire regression model.
Notice that the usual sample variance

𝑁
∑︁
(𝑦𝑖 − y)2
𝑆𝑆𝑇 𝑖=1
𝑠2𝑦 = =
𝑁 −1 𝑁 −1
is biased because y no longer estimates the expectation E[𝑌𝑖 ] of 𝑌𝑖 .
A standard way to present estimation of experimental errors and analysis of variance of the re-
sponse is the ANOVA table 10.11.

Table 10.11: Simple ANOVA table
Univariate ANOVA
Source D.F. S.S. M.S. F

Sum of squares Mean squares
𝑀 𝑆𝑅
Model 1 𝑆𝑆𝑅 = 𝑏2 𝑆𝑥𝑥 𝑀 𝑆𝑅 = 𝑆𝑆𝑅
𝑀 𝑆𝐸
𝑁
∑︁
= 𝑦𝑖 − y)2
(^
𝑖=1
Error 𝑁 −2 𝑆𝑆𝐸 𝑀 𝑆𝐸 = 𝑆𝑆𝐸/(𝑁 − 2)
= 𝑁 ̂︀𝑖 )2 = 𝑠2
∑︀
𝑖=1 (𝑦𝑖 − 𝑦
𝑆𝑆𝑇
Total 𝑁 −1 𝑆𝑆𝑇 = 𝑆𝑦𝑦 𝑠2𝑦 =
𝑁 −1
𝑁
∑︁
= (𝑦𝑖 − y)2
𝑖=1
10.6.2 Multivariate case
To assess the underlying experimental error, some degree of replication must be accpeted, quan-
tified by a natural number 𝑛 ≥ 1. Consider the smallest multivariate regression analysis, when
studying 𝑘 = 2 predictors with binary choices. We indeed see that the 22 factorial design does not
estimate experimental errors if no experiment is repeated, that we used only 𝑛 = 1 replicate.

TAKE AWAY Facts

Known generally that,
• Experimental errors are quantified by replications. Other words, analizing experimental errors
requires the number of replicates 𝑛 > 1.
• If use 𝑛 > 1 replicates then each experiment in the 2𝐾 factorial design is repeats 𝑛 times, so
there are in total 𝑁 = 𝑛 2𝑘 experimental runs based on the standard 2𝑘 factorial.
♦ EXAMPLE 10.14 (Linear regression analysis of 22 design with many replicates).
The linear regression equation between predictors 𝐴, 𝐵 and response 𝑌 is
𝑌 = 𝑞0 + 𝑞𝐴 𝑥𝐴 + 𝑞𝐵 𝑥𝐵 + 𝑞𝐴𝐵 𝑥𝐴 𝑥𝐵 + 𝑒
where 𝑒 is experimental error.
Perform 𝑛 = 3 replicates of the 22 design

Experiment no. Id 𝐴 𝐵 𝐴𝐵 𝑦 (response)

1 1 −1 −1 1 𝑦1 = 15
2 1 1 −1 −1 𝑦2 = 45
3 1 −1 1 −1 𝑦3 = 25
4 1 1 1 1 𝑦4 = 75
Total 160 80 40 20 160
Total/4 40 20 10 5 40
assuming
• Still using sign table method
• Value 𝑦 is replaced by the mean y, in the table
Id → 𝑞0 𝐴 𝐵 𝐴𝐵 𝑦 (responses) y
1 −1 −1 1 (15, 18, 12) 15
1 1 −1 −1 (45, 48, 51) 48
1 −1 1 −1 (25, 28, 19) 24
1 1 1 1 (75, 75, 81) 77
164 86 38 20 Total= 164
41 21.5 9.5 5 Total/4 = 41 = y = 𝑞0

To compute the estimated response for each factor-level combination we use model
𝑌̂︀ = 𝐺(𝐴
̂︀ 𝑖 , 𝐵𝑖 ) = 𝑦̂︀𝑖 = 𝑞0 + 𝑞𝐴 𝑥𝐴 + 𝑞𝐵 𝑥𝐵 + 𝑞𝐴𝐵 𝑥𝐴 𝑥𝐵
𝑖 𝑖 𝑖 𝑖
Then, extending (10.43) to the case of two factors gives the difference between estimated value
and measured values as
𝑒𝑖𝑗 = 𝑦𝑖𝑗 − 𝑦̂︀𝑖
With 𝑁 = 𝑛 22 = 12, the sum of squares explained by regression

𝑆𝑆𝑅 = 𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵, the sum of squares explained by response 𝑌 is
𝑆𝑆𝑌 = 𝑆𝑆𝑂 + 𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵 + 𝑆𝑆𝐸
⇒ 𝑆𝑆𝑌 − 𝑆𝑆𝑂 = 𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵 + 𝑆𝑆𝐸 = 𝑆𝑆𝑇 ,
then we get the Sum of squared error via the Total variation 𝑆𝑆𝑇 as
𝑆𝑆𝐸 = 𝑆𝑆𝑇 − (𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵) = 𝑆𝑆𝑌 − 𝑆𝑆𝑂 − (𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵)
Our Memory- Cache example gives
• 𝑆𝑆𝑌 = 152 + 182 + 122 + · · · + 752 + 752 + 812 = 27, 204
• 𝑆𝑆𝑂 = 𝑁 𝑞02 = 12𝑞02 = 12 · (41)2 = 20, 172
• 𝑆𝑆𝐴 = 𝑁 𝑞𝐴2 = 12𝑞𝐴2 = 12 · (21.5)2 = 5, 547
• 𝑆𝑆𝐵 = 𝑁 𝑞𝐵2 = 12𝑞𝐵2 = 12 · (9.5)2 = 1, 083

2
• 𝑆𝑆𝐴𝐵 = 𝑁 𝑞𝐴𝐵 = 12 · (5)2 = 300
• so 𝑆𝑆𝐸 = 𝑆𝑆𝑌 − 𝑆𝑆𝑂 − (𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵) = 102
and 𝑆𝑆𝑇 = 𝑆𝑆𝑌 − 𝑆𝑆0 = 27, 204 − 20, 172 = 7, 032.
Therefore the experimental errors explains 𝑆𝑆𝐸/𝑆𝑆𝑇 = 102/7, 032 = 1.45%,
the memory 𝐴 explains 𝑆𝑆𝐴/𝑆𝑆𝑇 = 5, 547/7, 032 = 78.88% ...

♣ QUESTION.
1. Why Experimental Error analysis in SPE with DOE method matter (both mathematically and
practically)? Any other techniques of Factorial Designs developed for SPE?
2. How to select Good Factorial Designs? Section 11.1 answers this.
SUMMARY
Regarding the use 2𝑚 designs, we observe the followings.
1. Presence or absence of interactions is not a function of the experimental plan, but is a function
of the scientific problem under investigation.
2. The existence of interactions is not a shortcoming of factorial experiments.
3. Usually, estimating the main effects, two-factor interactions, and possibly three-factor interactions
is a good starting place in a research program.
4. If a treatment contrast also involves a contrast among block effects, the two contrasts are said to
be confounded. This confounding means that
the effects cannot be separated.

PROBLEM
PROBLEM 10.2 (A 2 × 3 Two Factor Factorial with analysis on R).
An experiment was run to investigate how the type of glass 𝐺 and the type of phosphorescent
coating 𝑃 affects the brightness of a light bulb.
The response variable 𝑌 is the current (measured in microamps) to obtain a specified brightness.
The data, with 𝑎 = 2, 𝑏 = 3 and 𝑛 = 3, are given in
Phosphor Type
A B C
278 297 273
1 291 304 284
Glass 285 296 288
Type 229 259 228
2 235 249 225
241 241 235
R output for two factor ANOVA
Df Sum Sq Mean Sq F value Pr(>F)

glass 1 11451 11451 258.940 1.74e-09 ***
phosphor 2 1167 584 13.200 0.000931 ***
glass:phosphor 2 8 4 0.092 0.913004 <-- not significant
Residuals 12 531 44

Residual standard error: 6.65 on 12 degrees of freedom

Multiple R-squared: 0.9597,Adjusted R-squared: 0.9429
F-statistic: 57.1 on 5 and 12 DF, p-value: 5.988e-08
phosphor glass
290
290
B 1
A 2
mean of light
mean of light
C
270
270
250
250
230
230
1 2 A B C
glass phosphor
• The 𝐺 * 𝑃 = 𝐺𝑙𝑎𝑠𝑠 * 𝑃 ℎ𝑜𝑠𝑝ℎ𝑜𝑟 interaction is not significant (p-value = .9130). This is obvious from
the strong parallelism in the interaction plots.
R code for two factor ANOVA

library(mistat); require(graphics)
setwd("E:/Computation/R-codes-Data-Analytics")
light <- c(278,291,285,297,304,296,273,284,288,
229,235,241,259,249,241,225,228,235)

glass <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)

phosphor <- c(’A’,’A’,’A’,’B’,’B’,’B’,’C’,’C’,’C’,
’A’,’A’,’A’,’B’,’B’,’B’,’C’,’C’,’C’)
phosphor <-as.character(phosphor)
aovdata <- data.frame(light,glass,phosphor)
# Now conduct the ANOVA for this aovdata
f1 <- aov(light~glass*phosphor,data=aovdata); summary (f1)
f2 <- lm(light~glass*phosphor,data=aovdata); summary (f2)
windows(); par(mfrow=c(2,2))
# Make interaction plots
interaction.plot(glass,phosphor,light)
interaction.plot(phosphor,glass,light)
PROBLEM 10.3.
[Representation of a binary design by regression model, see Model 1]
PROBLEM FORMULATION:
A virologist is interested in studying the effects of environment and proliferating time on the growth
of a particular virus by using
𝑎 = 2 different living media (𝑀 ), denoted Medium 1 and Medium 2; and
𝑏 = 2 different proliferating times (𝑇 ), of 12 hours and 18 hours.

Table 10.12: THE DATA for two factors 𝑀, 𝑇
𝑀 = 𝑚1 (Medium 1) 𝑀 = 𝑚2 (Medium 2)
12 21 23 20 25 24 29
𝑇 = 𝑡1 hours 22 28 26 26 25 27
18 37 38 35 31 29 30
𝑇 = 𝑡2 hours 39 38 36 34 33 35
She performs 𝑛 = 6 (balanced design) replicates for each of the 4 𝑀 * 𝑇 combinations. Here
𝐴1 𝐵1 means 𝑚1 𝑡1 =(Medium 1, 12 hours), and so on. The 𝑁 = 24 measurements were taken in a
completely randomized order. The results are given in Table 10.12. The virologist wants to answer
the following questions.
♣ QUESTION.
1. What effects do media 𝑀 and growing time 𝑇 have on the virus’s proliferation?
2. Is there a choice of living media that would give uniformly strong proliferation regardless of time?
Th 2nd question is particularly important. We possibly find a media alternative that is not greatly
affected by time. If this is so, we can make the living media robust to time variation in the actual
environment. This is an example of using experimental design for robust product design, a very
important engineering and scientific problem.

Table 10.13: THE MEANS of responses in cell 𝑖, 𝑗 for 𝑛 = 6
𝑀 = 𝑚1 𝑀 = 𝑚2
Medium 1 Medium 2
𝑇 = 𝑡1 12 y 11 = 140/𝑛 = 23.3 y 12 = 156/𝑛 = 26
𝑇 = 𝑡2 18 y 21 = 223/𝑛 = 37.16 y 22 = 192/𝑛 = 32
From response means at each cell 𝑖, 𝑗 in Table 10.13, and Definition 10.3, that factor’s effect is the
change in response produced by a change in the level of the factor, we can find two cases.
1. Either the effect of changing 𝑇 from 12 to 18 hours on the response depends on the level of 𝑀 ,
shown in left figure below, as
- for medium 1, the 𝑇 effect = y 21 − y 11 = 37.16 − 23.3 = 13.83
- for medium 2, the 𝑇 effect = y 22 − y 12 = 32 − 26 = 6.

medium time
36
36
1 18
mean of growth
mean of growth
2 12
32
32
28
28
24
24
12 18 1 2
time medium
Figure 10.9: 𝑀 * 𝑇 interaction
2. Or the effect on the response of changing 𝑀 from medium 1 to 2 depends on the level of 𝑇 , shown
in right figure above, as
- for 𝑇 = 12 hours, the 𝑀 effect = y 12 − y 11 = 26 − 23.3 = 2.6
- For 𝑇 = 18 hours, the 𝑀 effect = y 22 − y 21 = 32 − 37.16 = −5.16.
If these pairs of effects are significantly different then we say there exists a significant interaction
between factors 𝑀 and 𝑇 . Here for both cases,
case 1: 13.83 is much far from 6 for the 𝑀 effects,
case 2: 2.6 is significantly different than - 5.16 for the 𝑇 effects,

we conclude to have a significant 𝑀 * 𝑇 interaction, see Figure 10.9.
We can use R to analyze this 2 × 2 factorial design data.

library(mistat); require(graphics)
setwd("E:/Computation/R-codes-Data-Analytics")
growth <- c(21,23,20,22,28,26,25,24,29,26,25,27,
37,38,35,39,38,36,31,29,30,34,33,35)
time <- c(rep(12,12),rep(18,12))
medium <- c(1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2)
time <-as.character(time)
medium <-as.character(medium)
aovdata <- data.frame(growth,time,medium)
# Make interaction plots
interaction.plot(time,medium,growth)
interaction.plot(medium,time,growth)
f1 <- aov(growth~time*medium,data=aovdata); summary (f1)
f2 <- lm(growth~time*medium,data=aovdata); summary (f2)
# Make diagnostic plots
plot(fitted(f1),resid(f1),main="Residuals vs Predicted Values")
qqnorm(resid(f1),main="Normal Probability Plot")
The command lm(response A * B, data = DATAFRAME) returns the linear model, which is deter-
mined by Equation (10.38). The last command qqnorm() provides the normal probability plot, helping
to detect real high-order interactions, being discussed later in Section 10.5.6.
R output for two factor ANOVA

Normal Probability Plot
●
4
● ●
Sample Quantiles
● ●
2
● ● ● ●
●
0
●●
● ●●
● ●
−2
● ● ●
●
●
●
−2 −1 0 1 2
Theoretical Quantiles
Figure 10.10: The normal probability plot for factor Time and Medium

time 1 590.0 590.0 115.506 9.29e-10 ***
medium 1 9.4 9.4 1.835 0.190617
time:medium 1 92.0 92.0 18.018 0.000397 ***

Residuals 20 102.2 5.1

## now use f2, gives the equivalent linear model
Call: lm(formula = growth ~ time * medium, data = aovdata)
Residuals:
Min 1Q Median 3Q Max
-3.333 -1.500 -0.250 1.208 4.667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.3333 0.9227 25.288 < 2e-16 ***
time18 13.8333 1.3049 10.601 1.18e-09 ***
medium2 2.6667 1.3049 2.044 0.054393 .
time18:medium2 -7.8333 1.8454 -4.245 0.000397 ***
𝐴 𝑎−1=1 𝑆𝑆𝐴 𝑀 𝑆𝐴 𝐹𝐴
𝐵 𝑏−1=1 𝑆𝑆𝐵 𝑀 𝑆𝐵 𝐹𝐵
𝐴 * 𝐵 interaction (𝑎 − 1)(𝑏 − 1) = 1 𝑆𝑆𝐴𝐵 𝑀 𝑆𝐴𝐵 𝐹𝐴𝐵
Error 𝑑𝑓𝐸 = 𝑎𝑏(𝑛 − 1) = 20 𝑆𝑆𝐸 𝑀 𝑆𝐸 -
Total 𝑑𝑓𝑇 = 𝑁 − 1 = 𝑎𝑏𝑛 − 1 𝑆𝑆𝑇 - -
CONCLUSION
• Using the above ANOVA table, for a 2-factor factorial experiment, we observe that there are three
degrees of freedom between the four treatment combinations in the 22 design.

10.8. The simplest case of 3𝑘 factorial design 347
Two degrees of freedom are associated with the main effects of 𝐴 and 𝐵, and 1 degree of freedom
is associated with a 2-interaction 𝐴𝐵.
• The degrees of freedom of residuals (errors) is 𝑑𝑓𝐸 = 𝑎𝑏(𝑛 − 1) = 2.2.5 = 20. ■
10.8
The simplest case of 3𝑘 factorial design
The simplest 3𝑘 design has two factors, each at 3 levels, denoted 32 Design shown in Figure 10.11,
so there are 32 = 9 treatment combinations (runs), and 8 degrees of freedom between these t.
combinations. Nine runs are denoted by either 𝑎𝑖 𝑏𝑗 or just (𝑖, 𝑗) which assume the values 0,1,2. The
common analysis of variance with polynomial decomposition takes the form shown in table below.
Table 10.14: Analysis of Variance Table, 32 design
Source df
𝐴 2 linear and quadratic
𝐵 2 linear and quadratic
𝐴*𝐵 4 linear × linear, linear × quadratic
quadratic × linear, quadratic × quadratic
1. The main effects of 𝐴 and 𝐵 each have two degrees of freedom, and the 𝐴𝐵 interaction has four
degrees of freedom. If there are 𝑛 replicates, there will be 𝑛 32 − 1 total degrees of freedom and
32 (𝑛 − 1) degrees of freedom for error.

2. We can use the quadratic model
𝑌 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥21 + 𝛽3 𝑥2 + 𝛽4 𝑥1 𝑥2 + 𝛽5 𝑥21 𝑥2 + 𝛽6 𝑥22 + 𝛽7 𝑥1 𝑥22 + 𝛽8 𝑥21 𝑥22 + 𝑒. (10.45)
to fit data, to estimate the linear effects, also the quadratic effects of each factor.
• Here coefficients 𝛽1 and 𝛽3 represent the linear effects of 𝑥1 and 𝑥2 ,

𝛽2 and 𝛽6 represent the quadratic effects of 𝑥1 and 𝑥2 .
• The other coefficients represent interaction effects.

𝛽4 represents the 𝑙𝑖𝑛𝑒𝑎𝑟 × 𝑙𝑖𝑛𝑒𝑎𝑟 interaction,
𝛽5 represents the 𝑞𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐 × 𝑙𝑖𝑛𝑒𝑎𝑟 interaction,
𝛽7 represents the 𝑙𝑖𝑛𝑒𝑎𝑟 × 𝑞𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐 interaction, and
𝛽8 represents the 𝑞𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐 × 𝑞𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐 interaction.
We have in total two main effects for each factor (linear and quadratic) and 4 interaction effects.
3. In the ANOVA table, the sums of squares for 𝐴, 𝐵 and 𝐴*𝐵 may be computed by the usual methods
for factorial designs, presented in Section 10.5.2.
* Each main effect can be represented by a linear and a quadratic component, each with a single
degree of freedom. This is meaningful only if the factor is quantitative. For example, 𝐴’s main
effect includes terms 𝛽1 𝑥1 and 𝛽2 𝑥21 .

* Suppose that both factors 𝐴 and 𝐵 are quantitative.
The 2-factor interaction 𝐴 * 𝐵 may be partitioned in two ways: (A) linear model or (B) orthogonal
Latin squares.
Method A: Using the quadratic model (10.45):
𝑌 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥21 + 𝛽3 𝑥2 + 𝛽4 𝑥1 𝑥2 + 𝛽5 𝑥21 𝑥2 + 𝛽6 𝑥22 + 𝛽7 𝑥1 𝑥22 + 𝛽8 𝑥21 𝑥22 + 𝑒
The 2-factor interaction 𝐴 * 𝐵 may be partitioned by subdividing 𝐴 * 𝐵 into the four single-degree-
of-freedom components corresponding to 𝐴 * 𝐵𝐿𝐿 , 𝐴 * 𝐵𝐿𝑄 , 𝐴 * 𝐵𝑄𝐿 , and 𝐴 * 𝐵𝑄𝑄 .
This can be done by fitting the terms 𝛽4 𝑥1 𝑥2 , 𝛽7 𝑥1 𝑥22 , 𝛽5 𝑥21 𝑥2 , and 𝛽8 𝑥21 𝑥22 respectively.
10.8.1 The 33 Design
Now suppose there are three factors (A, B, and C) under study and that each factor is at three levels
arranged in a factorial experiment. This is a 33 factorial design, and the experimental layout and
treatment combination notation are shown in Figure 10.11.
Factorial structure:
• The 27 treatment combinations have 26 degrees of freedom.

Treatment combinations in a 3^2 and 3^3 factorial design
Figure 10.11: 33 factorial design with 9 treatment combinations [right figure]

• Each main effect has two degrees of freedom (df), each two-factor interaction has 4 df, and the
three-factor interaction has 8 df.
• If there are 𝑛 replicates, there are 𝑛 33 − 1 total degrees of freedom and
33 (𝑛 − 1) degrees of freedom for error.
♦ EXAMPLE 10.15.
Oikawa (1987) reported the results of a 33 factorial to investigate the effects of three factors 𝐴, 𝐵, 𝐶

9
on the stress levels of a membrane 𝑌 .
The data is given in file STRESS.csv. The first three columns of the data file provide the levels of
the three factors, and column 4 presents the stress values 𝑌 .
To analyze this data with R we apply:
> data(STRESS)
> summary(lm(stress ~ (A+B+C+I(A^2)+I(B^2)+I(C^2))^3, data=STRESS))
Call: lm.default(formula = stress ~ (A + B + C + I(A^2) + I(B^2) +
I(C^2))^3, data = STRESS)
Residuals:
ALL 27 residuals are 0: no residual degrees of freedom!
Coefficients: (15 not defined because of singularities)
(Intercept) 191.8000 NA NA NA
A 38.5000 NA NA NA
B -46.5000 NA NA NA
C 63.0000 NA NA NA
I(A^2) 0.2000 NA NA NA
I(B^2) 14.0000 NA NA NA
I(C^2) -27.3000 NA NA
A:B -32.7500 NA NA
A:C 26.4500 NA NA
9
Oikawa, T. and Oka, T. (1987) New Techniques for Approximating the Stress in Pad-Type Nozzles Attached to a Spherical Shell, Transactions
of the American Society of Mechanical Engineers, May, 188-192.

A:I(A^2) NA NA NA
...
I(A^2):I(B^2):I(C^2) NA
> summary(aov(stress ~ (A+B+C)^3 +I(A^2)+I(B^2)+I(C^2), data=STRESS))
A 1 36315 36315 378.470 1.47e-12 ***
B 1 32504 32504 338.751 3.43e-12 ***
C 1 12944 12944 134.904 3.30e-09 ***
I(A^2) 1 183 183 1.911 0.185877
I(B^2) 1 2322 2322 24.199 0.000154 ***
I(C^2) 1 4536 4536 47.270 3.73e-06 ***
A:B 1 3290 3290 34.289 2.44e-05 ***
A:C 1 6138 6138 63.971 5.56e-07 ***
B:C 1 183 183 1.910 0.185919
A:B:C 1 32 32 0.338 0.569268
Residuals 16 1535 96
All the non-significant parameters are denoted by n.s. in Table 10.15.

Table 10.15: The LSE (least square estimate) of the parameters of the 33 system
Parameter LSE S.E. Significance

𝐴 44.917 2.309
𝐴2 - 1.843 1.333 n.s.
𝐵 -42.494 2.309
𝐴𝐵 - 16.558 2.828
𝐴2 𝐵 - 1.897 1.633 n.s.
𝐵2 6.557 1.333
𝐴𝐵 2 1.942 1.633 n.s.
𝐴2 𝐵 2 -0.171 0.943 n.s.
𝐶 26.817 2.309
𝐴𝐶 22.617 2.828
𝐴2 𝐶 0.067 1.633 n.s.
𝐵𝐶 -3.908 2.828 n.s.
𝐴𝐵𝐶 2.013 3.463 n.s.
𝐴2 𝐵𝐶 1.121 1.999 n.s.
𝐵2𝐶 -0.708 1.633 n.s. (...)
... 𝐴𝐵 2 𝐶 0.246 1.099 n.s.
𝐴2 𝐵 2 𝐶 0.287 1.154 n.s.
𝐶2 -9.165 1.333
𝐴𝐶 2 -4.833 1.633
𝐴2 𝐶 2 0.209 0.943 n.s.
𝐵𝐶 2 2.803 1.633 n.s.
𝐴𝐵𝐶 2 -0.879 1.999 n.s.
𝐴2 𝐵𝐶 2 0.851 1.154 n.s.
𝐵2𝐶 2 -0.216 0.943 n.s.
𝐴𝐵 2 𝐶 2 0.287 1.154 n.s.
𝐴2 𝐵 2 𝐶 2 0.059 0.666 n.s.

Figure 10.12: Main effects plot for 33 design

Figure 10.13: Interaction plots for 33 design
In Figures 10.12 and 10.13 we present the main effects and Interaction plots.
To draw Figure 10.12 we may use R code below.

> library(FrF2); data(STRESS)

> Stress2 <- data.frame(lapply(STRESS[,!names(STRESS)
% in % "stress"], as.factor), Stress=STRESS$stress)
> layout(matrix(1:4, 2, byrow=TRUE))
> with(Stress2,
interaction.plot(x.factor=A, trace.factor=rep(0, length(A)),
response=Stress, legend=FALSE, type="b", pch=15:18, ylim=c(160, 280)))
> with(Stress2,
interaction.plot(x.factor=B, trace.factor=rep(0, length(A)),
> with(Stress2,
interaction.plot(x.factor=C, trace.factor=rep(0, length(A)),
> layout(1)
10.8.2 Design 35 with small runsize - The full factorial 3𝑚
A short theory of full ternary factorial 3𝑚 is complemented at Section 10.9.
10.9
COMPLEMENT: Ternary Factorial Design
10.9.1 Setting of 3𝑚 Design
We discuss in the present section estimation and testing of model parameters, when the design is
full factorial 3𝑚 , of 𝑚 factors each one at 𝑝 = 3 levels. We assume that the levels are measured on a
continuous scale, and labeled Low, Medium and High.

10.9. COMPLEMENT: Ternary Factorial Design 357
When the factors are quantitative, we use the indices 𝑖𝑗 (𝑗 = 1, · · · , 𝑚) which assume the values
0, 1, 2, or also −1, 0, 1 for the Low, Medium and High levels, correspondingly, of each factor. This
facilitates fitting a regression model relating the response to the factor levels.
• Thus, we have 3𝑚 treatment combinations, represented by vectors of indices
(𝑖1 , 𝑖2 , · · · , 𝑖𝑚 ). Each treatment combination in the 3𝑚 design so is denoted by 𝑚 digits, where the
first digit indicates the level of factor 𝐴,
the second digit indicates the level of factor 𝐵, etc. and
the 𝑚-th digit indicates the level of factor 𝑀 .
For example, in a 32 design, 00 denotes the treatment combination corresponding to 𝐴 and 𝐵 both
at the low level, and 01 denotes the treatment combination corresponding to 𝐴 at the low level and
𝐵 at the medium (intermediate) level.
It is simple to transform the values (levels) of each factor from 0, 1, 2 system to −1, 0, 1 system, use
⎧
⎪
⎪
⎪
⎪ −1, if 𝑖𝑗 = 0
⎨
𝑋𝑗 = 0, if 𝑖𝑗 = 1
⎪
⎪
⎪
⎩1, if 𝑖𝑗 = 2.
⎪
However, the matrix of coefficients 𝑋 that is obtained, when we have quadratic and interaction
parameters, is not orthogonal. This requires then the use of the computer to obtain the least

squares estimators, with the usual multiple regression program.
• The index 𝜈 of the standard order of treatment combination is

𝑚
∑︁
𝜈= 𝑖𝑗 3𝑗−1 . (10.46)
𝑗=1
This index ranges from 0 to 3𝑚 − 1.
• Let Y 𝜈 denote the mean yield of 𝑛 replicas of the 𝜈-th treatment combination, 𝑛 ≥ 1. Since we
obtain the yield at three levels of each factor we can, in addition to the linear effects, estimate also
the quadratic effects of each factor.
10.9.2 General Analysis of 3𝑚 Design
The concepts utilized in the 32 and 33 designs can be readily extended to the case of m factors, each
at three levels, that is, to a 3𝑚 factorial design.
• As in the case of 3𝑚 , each parameter in a 3𝑚 model is represented by a vector of 𝑚 indices

(𝜆1 , 𝜆2 , · · · , 𝜆𝑚 ), where 𝜆𝑗 = 0, 1, 2 (𝑗 = 1, · · · , 𝑚).
Thus, for example, the vector (0, 0, · · · , 0) represent the grand mean 𝜇 = 𝛾0 ,
a vector (0, 0, · · · , 1, 0, · · · , 0), with 1 at the 𝑖-th component represents the linear effect of the 𝑖-th
factor. Similarly, (0, 0, · · · , 2, 0, · · · , 0) with 2 at the 𝑖-th component represents the quadratic effect
of the 𝑖-th factor.

When 𝑚 = 4, the series 0120 represents a treatment combination in a 34 design with 𝐴 and
𝐷 at the low levels, 𝐵 at the intermediate level, and 𝐶 at the high level. There are 3𝑚 treatment
combinations, with 3𝑚 − 1 degrees of freedom between them.
• The standard order of the parameters is
𝑚
∑︁
𝜔= 𝜆𝑗 3𝑗−1 , 𝜔 = 0, · · · , 3𝑚 − 1.
𝑗=1
ELUCIDATION
If 𝑚 is not too large, it is also customary to label the factors by the letters 𝐴, 𝐵, 𝐶, · · · and the
parameters by 𝐴𝜆1 𝐵 𝜆2 𝐶 𝜆3 ··· . In this notation a letter to the zero power is omitted. When 𝑚 = 3, the
main effects and interactions are listed in Table 10.16.
The size of the design increases rapidly with 𝑚. For example, a 33 design has 27 treatment
combinations per replication, a 34 design has 81, a 35 design has 243, and so on. Therefore, only a
single replicate of the 3𝑚 design is frequently considered, and higher order interactions are combined
to provide an estimate of error.
SUMMARY
1. Generally, if there are 𝑚 ternary factors we have, in addition to 𝛽0 ,
2𝑚 parameters for main effects (linear and quadratic),
22 𝑚2 parameters for interactions between 2 factors,

(︀ )︀
23 𝑚3 interactions between 3 factors, etc.

(︀ )︀
2. We can use the additive model of response below to fit data

∑︁ ∑︁
𝑌 = 𝛽0 + 𝛽𝑖 𝑥 𝑖 + 𝛾𝑖 𝑥2𝑖 + · · · +
𝑖 𝑗
∑︁ (10.47)
(𝛽 𝛾)𝑖𝑗 𝑥𝑖 𝑥𝑗 + · · · + 𝛿 𝑥21 𝑥22 . . . 𝑥2𝑚
𝑖<𝑗
allowing up to 𝑚-order interaction effect. Briefly, in this additive model of response we have
totally 3𝑚 parameters, where
𝑚 (︂ )︂
𝑚
∑︁
𝑗 𝑚
3 = 2 .
𝑗=0
𝑗

Table 10.16: The main effects and interactions of a 33 factorial
𝜔 Parameter Indices 𝜔 Parameter Indices
0 Mean (0,0,0) 15 𝐵2𝐶 (0,2,1)

1 𝐴 (1,0,0) 16 𝐴𝐵 2 𝐶 (1,2,1)
2 𝐴2 (2,0,0) 17 𝐴2 𝐵 2 𝐶 (2,2,1)
3 𝐵 (0,1,0) 18 𝐶2 (0,0,2)
4 𝐴𝐵 (1,1,0) 19 𝐴𝐶 2 (1,0,2)
5 𝐴2 𝐵 (2,1,0) 20 𝐴2 𝐶 2 (2,0,2)
6 𝐵2 (0,2,0) 21 𝐵𝐶 2 (0,1,2)
7 𝐴𝐵 2 (1,2,0) 22 𝐴𝐵𝐶 2 (1,1,2)
8 𝐴2 𝐵 2 (2,2,0) 23 𝐴2 𝐵𝐶 2 (2,1,2)
9 𝐶 (0,0,1) 24 𝐵2𝐶 2 (0,2,2)
10 𝐴𝐶 (1,0,1) 25 𝐴𝐵 2 𝐶 2 (1,2,2)
11 𝐴2 𝐶 (2,0,1) 26 𝐴2 𝐵 2 𝐶 2 (2,2,2)
12 𝐵𝐶 (0,1,1)
13 𝐴𝐵𝐶 (1,1,1)
14 𝐴2 𝐵𝐶 (2,1,1)

Chapter 11
Fractional Factorial Designs

In Performance Evaluation
CHAPTER 11. FRACTIONAL FACTORIAL DESIGNS
364 IN PERFORMANCE EVALUATION
Introduction
We dicuss in this chapter a few topics aimed to understand key powerful principles of various designs
for analyzing properly experimental results when conducting experiments in practice, and to use
designs to comparare fifferent system configurations.
We study
1. What are good Fractional Factorial Designs? See Section 11.1
2. Computation of Binary Fractional Factorial Designs in Section 11.2
3. Analysis and Inference for general factorial design in Section 11.3
4. Section 11.4 on Working with Coded Design Variables shows
* Differences between using coded design variables and engineering raw units,
* Mathematical Coding of Design Variables.
5. Multiple Regression With Lags
6. Confounding in Factorial Designs in Section 11.7

365
Learning Outcomes
1. Use the fraction 2𝑘−𝑝 in 𝑘 binary factors and 𝑝 constraints
Know Steps to design orthogonal columns on the full factorial 2𝑘
2. Understand Resolution to Fractionization in Fractional Designs
Concepts of Resolution III, IV and V designs
3. Be able to Work with Coded Design Variables
4. Know Confounding in Factorial Designs

11.1
What are good Fractional Factorial Designs?
Motivation- Analysis of a simple design
We used sign table method in EXAMPLE 10.14 to study and analyze the simple binary 22 . The
influences of predictors on response are explained via the variation, specifically expressed via the
sum of squares and relations
𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵 + 𝑆𝑆𝐸 = 𝑆𝑆𝑇
and 𝑆𝑆𝑌 = 𝑆𝑆𝑂 + 𝑆𝑆𝐴 + 𝑆𝑆𝐵 + 𝑆𝑆𝐴𝐵 + 𝑆𝑆𝐸 after getting the regressed model
𝑌̂︀ = 𝑞0 + 𝑞𝐴 𝑥𝐴 + 𝑞𝐵 𝑥𝐵 + 𝑞𝐴𝐵 𝑥𝐴 𝑥𝐵 (11.1)
and the empirical observations 𝑦̂︀𝑖 = 𝑞0 + 𝑞𝐴 𝑥𝐴𝑖 + 𝑞𝐵 𝑥𝐵𝑖 + 𝑞𝐴𝐵 𝑥𝐴𝑖 𝑥𝐵𝑖 .
The collected experimental values are summarized in table

11.1. What are Binary Fractional Designs? 367
Id → 𝑞0 𝐴 𝐵 𝐴𝐵 𝑦 (responses) y
1 −1 −1 1 (15, 18, 12) 15
1 1 −1 −1 (45, 48, 51) 48
1 −1 1 −1 (25, 28, 19) 24
1 1 1 1 (75, 75, 81) 77
164 86 38 20 Total= 164
41 21.5 9.5 5 Total/4 = 41 = y = 𝑞0
Why the response mean y = 𝑞0 symbolically? We compute the sample mean 𝑦¯

4 4
1 ∑︁ 1 ∑︁
y = 𝑦𝑖 = (𝑞0 + 𝑞𝐴 𝑥𝐴𝑖 + 𝑞𝐵 𝑥𝐵𝑖 + 𝑞𝐴𝐵 𝑥𝐴𝑖 𝑥𝐵𝑖 )
4 𝑖=1 4 𝑖=1
4 4 4 4
1 ∑︁ 1 ∑︀ 1 ∑︀ 1 ∑︀
= 𝑞0 + 𝑞𝐴 𝑥𝐴𝑖 + 𝑞𝐵 𝑥𝐵𝑖 + 𝑞𝐴𝐵 𝑥𝐴𝑖 𝑥𝐵𝑖
4 𝑖=1 4 𝑖=1 4 𝑖=1 4 𝑖=1
= 𝑞0 + 0...by Orthogonality of sign vectors
Hence, we utilized the Orthogonality of sign vectors, namely the
inner product of the transposed sign vector of factors all equal 0:
𝐼𝑑 · 𝐴 = [1, 1, 1, 1] · [−1, 1, −1, 1] = 𝐼𝑑 · 𝐵 = 𝐴 · 𝐵 = 0.

11.1.1 Binary designs with Mutual orthogonality
Good factorial designs have a small number of runs and the sign vectors are orthogonal, or precisely
Mutual Orthogonality fulfills.
Proposition 11.1. Mutual Orthogonality between sign vectors of factors in the regression of binary
designs 2𝑘 (and their fractions 2𝑘−𝑝 in Section 11.1.2) include
• Sum of each column 𝑗 is zero.

∑︁
𝑥𝑖𝑗 = 0, ∀𝑗
𝑖
Note: 𝑥𝑖𝑗 = level of 𝑗-th variable in 𝑖-th experiment.
• Sum of the products of any two columns 𝑗 ̸= 𝑙 is zero.

∑︁
𝑥𝑖𝑗 𝑥𝑖𝑙 = 0, ∀𝑗 ̸= 𝑙
𝑖
• Sum of squares of each column is 2𝑘−𝑝 : 2

= 2𝑘−𝑝 .
∑︀
𝑖 𝑥𝑖𝑗
For example, with 𝑚 = 3, check the mentioned mutual orthogonality of the design in 3 factors with
4 replicates, and 𝑝 = 0 given in table below.

Factor Levels Response

𝐴 𝐵 𝐶 𝑦1 𝑦2 𝑦3 𝑦4
-1 -1 -1 60.5 61.7 60.5 60.8
1 -1 -1 47.0 46.3 46.7 47.2
-1 1 -1 92.1 91.0 92.0 91.6
1 1 -1 71.0 71.7 71.1 70.0
-1 -1 1 65.2 66.8 64.3 65.2
1 -1 1 49.5 50.6 49.5 50.5
-1 1 1 91.2 90.5 91.5 88.7
1 1 1 76.0 76.0 78.3 76.4
11.1.2 Fractional Factorial Designs with No interactions
♦ EXAMPLE 11.1 (Small 2𝑘−𝑝 design).
• 𝑘 = 7 factors, each factor has 2 levels
• A full factorial design needs 27 = 128 experiments (no replication).
• 𝑝 = 4 ⇒ There are only 27−4 = 8 experiments.

Exp. no. A B C D E F G
1 -1 -1 -1 1 1 1 -1
2 1 -1 -1 -1 -1 1 1
3 -1 1 -1 -1 1 -1 1
4 1 1 -1 1 -1 -1 -1
5 -1 -1 1 1 -1 -1 1
6 1 -1 1 -1 1 -1 -1
7 -1 1 1 -1 -1 1 -1
8 1 1 1 1 1 1 1
If we use the following model
𝑦 = 𝑞0 + 𝑞𝐴 𝑥𝐴 + 𝑞𝐵 𝑥𝐵 + 𝑞𝐶 𝑥𝐶 + 𝑞𝐷 𝑥𝐷 + 𝑞𝐸 𝑥𝐸 + 𝑞𝐹 𝑥𝐹 + 𝑞𝐺 𝑥𝐺
with only main effects, the orthogonality property supports the formulation as follows.
∑︁ −𝑦1 + 𝑦2 − 𝑦3 + 𝑦4 − 𝑦5 + 𝑦6 − 𝑦7 + 𝑦8
𝑞𝐴 = 𝑦𝑖 𝑥𝐴𝑖 = ,...
𝑖
8
∑︁ −𝑦1 + 𝑦2 + 𝑦3 − 𝑦4 + 𝑦5 − 𝑦6 − 𝑦7 + 𝑦8
𝑞𝐺 = 𝑦𝑖 𝑥𝐺𝑖 = .
𝑖
8

Sign table method revisited

I A B C D E F G 𝑦
1 -1 -1 -1 1 1 1 -1 20
1 1 -1 -1 -1 -1 1 1 35
1 -1 1 -1 -1 1 -1 1 7
1 1 1 -1 1 -1 -1 -1 42
1 -1 -1 1 1 -1 -1 1 36
1 1 -1 1 -1 1 -1 -1 50
1 -1 1 1 -1 -1 1 -1 45
1 1 1 1 1 1 1 1 82
317 101 35 109 43 1 47 3 Total
39.62 12.62 4.37 13.62 5.37 0.125 5.87 0.37 Total/8
• Effect of A so is 𝜏 𝐴 = 12.62, and the Total variation of response 𝑌 is 𝑆𝑆𝑇 = 427.31
• Variation due to factor 𝐴 so is 𝑆𝑆𝐴 = 159.26(37.3%)
No interactions considered in this example. Only main effects are obtained. ■
11.1.3 The fraction 2𝑘−𝑝 in 𝑘 binary factors and 𝑝 constraints
Steps to design orthogonal columns based on the full factorial 2𝑘
1. Choose 𝑚 = 𝑘 − 𝑝 factors from the full 2𝑘 .

2. Prepare a full factorial design for 𝑘 − 𝑝 = 𝑚 factors.
3. Replace rightmost columns by 𝑝 factors not chosen in step 1.
Common mistakes in experimentation
• The variation due to experimental error is ignored.
• Important parameters are not controlled; Effects of different factors are not isolated
• Simple one-factor-at-a-time designs are WRONGLY used
• Interactions are ignored; Too many experiments are conducted.
These motivate new solutions and effective using Fractional Designs in next parts.
♦ EXAMPLE 11.2 (Design columns of small 2𝑘−𝑝 = 23 when 𝑘 = 7, 𝑝 = 4).

Exp. no. A B C AB AC BC ABC

1 -1 -1 -1 1 1 1 -1
2 1 -1 -1 -1 -1 1 1
3 -1 1 -1 -1 1 -1 1
4 1 1 -1 1 -1 -1 -1
5 -1 -1 1 1 -1 -1 1
6 1 -1 1 -1 1 -1 -1
7 -1 1 1 -1 -1 1 -1
8 1 1 1 1 1 1 1
Replace 4 rightmost columns with 𝐷 = 𝐴𝐵? and 𝐸 = 𝐴𝐶, 𝐹 = 𝐵𝐶, 𝐺 = 𝐴𝐵𝐶
=⇒ 27−4 fractional design starting from the full 23 .
Exp. A B C D E F G
1 -1 -1 -1 1 1 1 -1
2 1 -1 -1 -1 -1 1 1
3 -1 1 -1 -1 1 -1 1
4 1 1 -1 1 -1 -1 -1
5 -1 -1 1 1 -1 -1 1
6 1 -1 1 -1 1 -1 -1
7 -1 1 1 -1 -1 1 -1
8 1 1 1 1 1 1 1

11.2
Binary Fractional Designs- Computation
So far we have seen that it is possible to use a single replicate of a factorial (set of treatments) to
obtain estimates of main effects and two-factor interactions, using high-order interaction MS (mean
square) to estimate 𝜎 2 . The full set of requirements which need to be considered when proposing
the use of a single-replicate design are (due to [65, Chapter 14]):
1. main effects are always important, and must be estimated;
2. two-factor interactions should usually be estimated;
3. three-factor or higher-order interactions may occasionally need to be estimated;
4. and it should be possible to obtain an estimate of 𝜎 2 using higher-order interaction MS for interac-
tions likely to have only small effects.
We will develop designs using only a fraction (of afull binary factorial), which will enable most of
these four crucial requirements to be satisfied. A design using only a fraction of the possible factorial
treatment combinations is called a fractional design or just fraction.
• Fractional replicates can be useful in a variety of forms ranging from using half of the possible

11.2. Binary Fractional Designs- Computation 375
combinations and achieving all four requirements to using a tiny proportion of the possible combi-
nations in a saturated design, and being able to satisfy only Item 1.
• In these designs we shall ignore Item 3. the estimation of three-factor interactions; since in prac-
tice it is not often relevant to look for interactions involving three factors, and it is very difficult to
make a sensible interpretation when such interactions appear to be large.
♦ EXAMPLE 11.3 (Minimum size Fractional Design of the 211 full binary).
Suppose that we wish to investigate the first six factors (among total 11 binary factors in cell phone
manufacturing), say C= Color, S= Shape, W= Weight, M=Material, P= Price
and O= OS= Operating system, each at two levels but that an experiment of 64 observations is too
large for the available resources.

C S W M P O Run
0 0 0 0 0 0 0 0 0 0 0 1
1 1 1 0 1 1 0 1 0 0 0 2
0 1 1 1 0 1 1 0 1 0 0 3
0 0 1 1 1 0 1 1 0 1 0 4
0 0 0 1 1 1 0 1 1 0 1 5
1 0 0 0 1 1 1 0 1 1 0 6
0 1 0 0 0 1 1 1 0 1 1 7
1 0 1 0 0 0 1 1 1 0 1 8
1 1 0 1 0 0 0 1 1 1 0 9
0 1 1 0 1 0 0 0 1 1 1 10
1 0 1 1 0 1 0 0 0 1 1 11
1 1 0 1 1 0 1 0 0 0 1 12
Cam. Wifi Anti. Ant. Place
To assess all the main effects and two-factor interactions would require 6 and 15 df, respectively.
Using half of the 26 = 64 combinations would give a total of 31 df so that after estimating the
main effects and two-factor interactions there could be 10 df available to estimate 𝜎 2 . Therefore
the question from Item 4. is whether we can identify a suitable set of 32 combinations from the total
64 combinations, which allows us to estimate the effects and 𝜎 2 ? ■

Fractional Designs: Resolution to Fractionization
Full factorial experiments with large number of factors even for the binary case of 2𝑘 might be im-
practical. For example, if there are 𝑘 = 12 factors, even at ℎ = 2 levels, the total number of treatment
combinations is ℎ𝑘 = 212 = 4096. This size of an experiment is generally not necessary, because
most of the high order interactions might be negligible and there is no need to estimate 4096 param-
eters.
• If only main effects and first order interactions are considered, a priori of importance, while all the
rest are believed to be negligible, we have to estimate and test only 1 + 𝑘 + 𝑘2 = 1 + 12 + 12
(︀ )︀ (︀ )︀
2 = 79
parameters.
• A fraction of the experiment, of size 27 = 128 would be sufficient. Such a fraction can be even
replicated several times.
Definition 11.2.
Formally, we fix 𝑑 finite sets 𝑄1 , 𝑄2 , . . . , 𝑄𝑑 called factors, where 1 < 𝑑 ∈ N. The elements of a
factor are called its levels.

• The (full) factorial design (also factorial experiment design- FED) with respect to these factors
is the Cartesian product 𝐷 = 𝑄1 × 𝑄2 × . . . × 𝑄𝑑 .
• A fractional factorial design, fractional design or just fraction 𝐹 of 𝐷 is a subset consisting of

elements of 𝐷 (possibly with multiplicities). Put 𝑟𝑖 := |𝑄𝑖 | be the number of levels of the 𝑖th
factor. We say that 𝐹 is symmetric if 𝑟1 = 𝑟2 = · · · = 𝑟𝑑 , otherwise 𝐹 is mixed.
We want to find such designs, investigate it in practice, specifically interested in:
a) Constructing and/or designing: to learn how to construct those experiments, given the scope of
expected commodities and the parameters of components;
(b) Exploring and selecting: to investigate some design characteristics (proposed by researchers)
to choose good designs. For instance, in factorial designs we learn how to detect interactions
between factors; if they exist, calculate how strongly they could affect on outcomes; finally
(c) Implementing, analyzing & consulting: study how to use (i.e., conduct experiments in applications,
measure outcomes, analyze data obtained, and consult clients).
The goal is to use such new understanding to improve product, to answer questions as:
1. What are the key factors in a process?

2. At what settings would the process deliver acceptable performance?
3. What are the main interaction effects in the process?
4. What settings would bring about less variation in the output?
♣ QUESTION 1. In consideration of using fractional factorial design, how do we choose the fraction
of the full factorial in such a way that desirable properties of
orthogonality, [ie. vectors of factor effect coefficients are orthogonal in space],
or equal variances of estimators, etc.
will be kept, and the parameters of interest will be estimable unbiasedly?
Generally, for smaller fractions, which are more likely the case if the factorial is larger, we will need
more than one defining equation (or parameter) In the next two sections we shall look at smaller
fractions for 2𝑚 factorial structures. We firstly consider a powerful method, called fractionization,
for blocking the 2𝑚 in 2𝑝 blocks, 𝑝 is the number of defining equations and 0 < 𝑝 < 𝑚. We call such
1
fraction a fractional factorial experiment, or just fractional design.
In a fractional design, only a fraction of the treatment combinations are observed. This has the
advantage of saving time and money in running the experiment, but the disadvantage that each
1
Fractional factorial experiments are used frequently in industry, especially in various stages of product development and in process and
quality improvement.

main-effect and interaction contrast will be confounded or aliased with one or more other main-
effect and interaction contrasts, and so cannot be estimated separately.
Definition 11.3.
Treatment combinations that are confounded with each other are called aliases.
The aliases are obtained by multiplying the parameter of interest by the defining parameter. An
alias set consists of all treatment combinations that are estimated by the same contrast.
DESIGN AIM
We have to design the fractions that will be assigned to each block in such a way that, if there
are significant differences between the blocks,
then the block effects will not confound or obscure factors of interest.
♦ EXAMPLE 11.4 (Fractionization of a binary experiment).
Consider a specific 23 experiment for studying a relationship between diet scheme and blood
pressure. We conducted experiments to assess the effects of diet on blood pressure in (say
American) males. Three factors are to be measured:
𝐴 = amount of fruits and vegetables in the diet (low/ high),
𝐵 = amount of fat in the diet (low/ high),

𝐶 = amount of dairy products in the diet (low/ high).
These factors 𝐴, 𝐵, 𝐶 each obviously has 2 levels.
A treatment combination 𝑟 is generally denoted 𝑟 = (𝑥1 , 𝑥2 , 𝑥3 ).
We wish to partition the 23 = 8 treatment combinations [arranged in a 23 factorial design of Figure

11.1] into two fractions of size 22 = 4.
The 2^3 design
Figure 11.1: A factorial design with binary factors 𝐴, 𝐵, 𝐶
• Each treatment combination 𝑟 will be administered as follows: A subject will have a baseline blood
pressure reading taken, then will be fed (at a laboratory) according to one of the eight diet plans

(treatments 𝑟).
• After three weeks, another blood pressure reading will be taken. Unfortunately, administering the
diet plans is very labor-intensive, and only four treatment combinations can be run at one time.
Thus, the experiment will be run in two blocks, each lasting three weeks. The following design was
decided upon on 𝑏 = 2 blocks:
𝐵1 = {𝑎, 𝑏, 𝑐, 𝑎𝑏𝑐}, 𝐵2 = {(1), 𝑎𝑏, 𝑎𝑐, 𝑏𝑐}.
With eight subjects per treatment combination, the total runs 𝑁 = 8𝑛 = 8.8 = 64, the 𝑑𝑓𝑇 = 𝑁 − 1 =
63, the anova looks like the table below, where there are only 6 degrees of freedom for treatments
because of the confounding with blocks. Indeed, if we break down the 7 degrees of freedom for
treatments we can study the confounding in blocks.
Source df
Blocks 𝑑𝑓𝐵 = 𝑏 − 1 = 2 − 1 = 1
Trts 𝑑𝑓𝑇 𝑟𝑡𝑠 = 6
𝑇 ×𝐵 0
Within error 𝑑𝑓𝐸 = 23 (𝑛 − 1) = 8.7 = 56 = 𝑑𝑓𝑇 − 𝑑𝑓𝐵 − 𝑑𝑓𝑇 𝑟𝑡𝑠
Total 𝑑𝑓𝑇 = 63

• We look at the treatment combinations corresponding to the component main effects and interac-
tions, and we can see the confounding in Table 11.1.
Table 11.1: Evidence of confounding between 𝐴𝐵𝐶 and blocks
Treatment Combination 𝐴 𝐵 𝐴𝐵 𝐶 𝐴𝐶 𝐵𝐶 𝐴𝐵𝐶 Block

𝑎 + − − − ? + 1
𝑏 − + − − ? + 1
𝑐 − − + + + 1
𝑎𝑏𝑐 + + + + + 1
(1) − − + − ? − 2
𝑎𝑏 + + + − ? − 2
𝑎𝑐 + − − + − 2
𝑏𝑐 − + − + − 2
We see that the 𝐴𝐵𝐶 interaction is confounded with blocks, as Block 1 has all high and Block 2
2
has all low levels. Every other effect is balanced between the blocks .
Definition 11.4.
2
in that there are two high levels and two low levels in each block, so no other effect is confounded with blocks. So we see why the above
anova table has only 6 d. of freedom for treatments, as the block sum of squares is exactly the sum of squares due to the 3 way interaction.

The resolution of a 2𝑚−𝑘 design is the length of the smallest word (shortest parameter, excluding
𝜇) in the subgroup of defining (generating) parameters or just generators.
• Resolution III designs: designs in which no main effects are aliased with any other main effect,
but main effects are aliased with two-factor interactions, and some two-factor interactions may be
aliased with each other.
• Resolution IV designs: designs in which no main effect is aliased with any other main effect or
2-factor interactions, but 2-factor interactions can be aliased with each other.
• Resolution V designs: no main effect or two-factor interaction is aliased with any other main effect
or two-factor interaction, but two-factor interactions are aliased with three-factor interactions.
3
Designs of resolution 𝑅 = 𝐼𝐼𝐼, 𝐼𝑉 are useful in factor screening experiments.
Fractionization
♦ EXAMPLE 11.5.
We illustrate the construction of fractions specifically via the 2𝑚−𝑝 = 28−4 design. Here we construct
two fractions (blocks), each one of size 16. As discussed before, 𝑝 = 4 so four generating parameters
3
Orthogonal arrays (Definition ??), however, include both regular designs and irregular designs (ones can not be defined by generator words)!
[See more in Montgomery [19] and Wu [112]].

should be specified.
Let these generators be 𝐵𝐶𝐷𝐸, 𝐴𝐶𝐷𝐹, 𝐴𝐵𝐶𝐺, 𝐴𝐵𝐷𝐻. These parameters generate resolution
IV design where the degree of fractionation is 𝑝 = 4.
• The blocks can be indexed 0, 1, · · · , 15. Each index is determined by the signs of the four
generators, which determine the blocks. Each block is a fractional design having 2𝑚−𝑝 = 16
runs.
• Thus, the signs (−1, −1, 1, 1) correspond to (0,0,1,1), which yields the index
4
∑︁
𝑖𝑗 2𝑗−1 = 12.
𝑗=1
• The index of generator 1 (𝐵𝐶𝐷𝐸 = 𝐴0 𝐵 1 𝐶 1 𝐷1 𝐸 1 𝐹 0 𝐺0 𝐻 0 ) is 0, 1, 1, 1, 1, 0, 0, 0,
for generator 2: 1, 0, 1, 1, 0, 1, 0, 0; (𝐴𝐶𝐷𝐹 =????)
for generator 3: 1, 1, 1, 0, 0, 0, 1, 0 and generator 4: 1, 1, 0, 1, 0, 0, 0, 1.
In the following table two blocks (fractions) derived with soft R are printed.

Block (fraction) 0 Block (fraction) 1

1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1
1 2 2 2 2 1 1 1 1 1 1 1 2 1 1 1
2 1 2 2 1 2 1 1 2 2 1 1 1 2 1 1
2 2 1 1 2 2 1 1 2 1 2 2 2 2 1 1
2 2 2 1 1 1 2 1 2 1 1 2 1 1 2 1
2 1 1 2 2 1 2 1 2 2 2 1 2 1 2 1
1 2 1 2 1 2 2 1 1 1 2 1 1 2 2 1
1 1 2 1 2 2 2 1 1 2 1 2 2 2 2 1
2 2 1 2 1 1 1 2 2 1 2 1 1 1 1 2
2 1 2 1 2 1 1 2 2 2 1 2 2 1 1 2
1 2 2 1 1 2 1 2 1 1 1 2 1 2 1 2
1 1 1 2 2 2 1 2 1 2 2 1 2 2 1 2
1 1 2 2 1 1 2 2 1 2 1 1 1 1 2 2
1 2 1 1 2 1 2 2 1 1 2 2 2 1 2 2
2 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2
2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2
The R code segment:
> Gen <- matrix(c(

0,1,1,1,1,0,0,0,
1,0,1,1,0,1,0,0),
nrow=2, byrow=TRUE)

11.3. Binary Fractional Factorial Designs- Analysis 387
> head( fac.design(nlevels=2, nfactors=8, blocks=4, block.gen=Gen))

creating full factorial with 256 runs
Blocks A B C D E F G H
1 1 1 1 1 1 1 1 2 1
2 1 1 2 2 1 1 2 1 2
3 1 1 2 1 1 2 1 1 1
4 1 2 1 2 2 1 2 2 2
5 1 1 2 2 2 2 1 2 2
6 1 1 1 2 2 1 1 2 2 ...
11.3
Binary Fractional Factorial Designs- Analysis
As an essential illustration of full factorial designs, we present 2𝑚 factorial designs, the most simple
full factorials of 𝑚 factors, each factor at two levels.
Notation for describing treatment combinations
• (I) For two factors 𝐴, 𝐵 of two levels, we denote

the “high” level by the corresponding lower case letter, and
the “low” level by the absence of the letter.
Thus 𝑎 denotes the combination where 𝐴 is at the high and 𝐵 is at the low level,
𝑏 denotes the combination where 𝐴 is at the low and 𝐵 is at the high level, and the treatment
combination
𝑎𝑏 denotes the combination where both treatments are at the high level.
• (II) For three binary factors 𝐴, 𝐵, 𝐶, similarly the treatment combination
𝑎𝑏𝑐 denotes the combination where each treatment is at the high level, and
𝑐 denotes the combination where 𝐴, 𝐵 are at the low, and 𝐶 is at high level.
* For any 𝑚 ≥ 2 factors, the treatment combination with all factors at
the low level is denoted by (I).
• (III) We also symbolically denote 𝐴1 , 𝐴2 for levels of 𝐴, and 𝐵1 , 𝐵2 for levels of 𝐵, and use (cou-
pling with the above notation) these new symbolic representation to define their main effects in
regression model and ANOV computation.
* The levels being studied need not be quantitative. ∙

E.g., in Post Service study, with factor
𝐴- the choice of envelope color (white versus gray) and
𝐵- Postage (sending class) [normal or registerd],

each combination of factor levels is called a treatment / treatment combination. We designate
them as follows:
• 𝐴1 𝐵1 : 𝐴 at low level, 𝐵 at low level; 𝐴1 𝐵2 : 𝐴 at low level, 𝐵 at high level
• 𝐴2 𝐵1 : 𝐴 at high level, 𝐵 at low level; 𝐴2 𝐵2 : 𝐴 at high level, 𝐵 at high level. The design 𝐴 × 𝐵 = 22
is complete factorial design without replication.
In this design, the term factorial signifies the inclusion of all combinations of levels of factors in
the experiment (no connection between this term and the factorial function).
11.3.1 Systems of notation for interactions of small 2𝑘 designs
If 𝐴, 𝐵 has 𝑎 and 𝑏 levels, we simply write 𝐴′ s levels as 1, 2, . . . , 𝑖, . . . , 𝑎, 𝐵 ′ s levels as 1, 2, . . . , 𝑗, . . . , 𝑏,

so we naturally prefer the use of combinations 𝐴𝑖 𝐵𝑗 . Taking both treatments together in all possible
combinations leads to four treatment combinations:

Table 11.2: Four systems of notation for interactions in 22 design with Yates’ order
Natural notation Old notation Modern notation Mathematical notation

𝐴1 𝐵1 𝑎0 𝑏 0 (1) 00
𝐴1 𝐵2 𝑎0 𝑏 1 (𝐵) 01
𝐴2 𝐵1 𝑎1 𝑏 0 (𝐴) 10
𝐴2 𝐵2 𝑎1 𝑏 1 (𝐴𝐵) 11
Here the quantity represented by 𝐴1 𝐵1 merely got renamed to (1) in the 3rd column (Modern
notation). The special symbol (1) is used to represent the control, or the treatment combination with
both factors at the low level. All four systems are used in practice, and we will use at least the last
three of them. The modern notation is more compact and makes it a lot easier to extend our analysis.
The explicit inclusion of the letter in modern notation indicates that the factor is at its high level; thus,
(𝐴) represents 𝐴 high and because of the absence of 𝐵1 , the 𝐵 low.
For a two-level, two-factor design, Yates’ order is
(1) := (𝐴 = 1, 𝐵 = 1); (𝐴) := (𝐴 = 2, 𝐵 = 1); (𝐵) := (𝐴 = 1, 𝐵 = 2); (𝐴𝐵) := (𝐴 = 2, 𝐵 = 2),
that is each letter is followed by all combinations of that letter and letters previously introduced.
The response value at generic treatment combination (𝑇 ) is denoted by 𝑦𝑡 ≡ 𝑦(𝑇 ) , so 𝑦(1) = 𝑦𝐴=1,𝐵=1 ,
𝑦𝑎 ≡ 𝑦(𝐴) , 𝑦𝑏 ≡ 𝑦(𝐵) , etc

Table 11.3: 2𝑚 factorial design in 2 factors
Trial number 𝐴=𝑖 𝐵=𝑗 Treatment combination Response 𝑌*

1 1 1 (1) 𝑦(1)
2 2 1 (𝐴) 𝑦𝑎
3 1 2 (𝐵) 𝑦𝑏
4 2 2 (𝐴𝐵) 𝑦𝑎𝑏
Main and interaction effects- Formalization for 22 design
When using the standard order, for factor 𝐴, we could write
𝑇𝐴2 = 𝑦𝑎 + 𝑦𝑎𝑏 := 𝑦(𝐴) + 𝑦(𝐴𝐵) as the total of responses over factor 𝐵 for level 2 of 𝐴,
𝑇𝐴1 = 𝑦(1) + 𝑦𝑏 = 𝑦(1) + 𝑦(𝐵) as the total of responses over factor 𝐵 at level 1 of 𝐴, as in Table 11.3.
The response mean at high and low level of 𝐴 are respectively summarized by
𝑇𝐴2 𝑇𝐴1
y 𝐴2 =
, y 𝐴1 =
𝑁/2 𝑁/2
where 𝑁 is the total number of units in the experiment (𝑁/2 = 2𝑛 for level 1, 𝑁/2 for level 2), 𝑛 is the
number of replicates. We estimate the main effect of factor 𝐴 as
𝑇𝐴 − 𝑇𝐴1 [𝑎𝑏 + 𝑎 − 𝑏 − (1)]
𝜏 𝐴 = y 𝐴2 − y 𝐴1 = 2 = , (11.2)
(2𝑛) 2𝑛
(by convention of dropping 𝑦 in response 𝑦△ to only △), and the main effect of factor 𝐵 similarly as
𝑇𝐵 − 𝑇𝐵1 [𝑎𝑏 + 𝑏 − 𝑎 − (1)]
𝜏 𝐵 = y 𝐵2 − y 𝐵1 = 2 = . (11.3)
(2𝑛) 2𝑛
The interaction effect of 𝐴 and 𝐵, denoted by 𝜏 𝐴𝐵 and is determined as the gap between the
average responses at both ‘extreme’ choices of 𝐴, 𝐵 and those at their mixed choices:
[𝑎𝑏 + (1) − 𝑎 − 𝑏]
𝜏 𝐴𝐵 = [(𝑇𝐴2 ,𝐵2 + 𝑇𝐴1 𝐵1 ) − (𝑇𝐴2 ,𝐵1 + 𝑇𝐴1 𝐵2 )]/2 = (11.4)
2
.
where 𝑇𝐴2 ,𝐵2 = 𝑦𝑎𝑏 = 𝑎𝑏 is the response of factor 𝐴, 𝐵 both at level 2 (high),
.
𝑇𝐴1 ,𝐵1 = 𝑦(1) = (1) is the response of factor 𝐴, 𝐵 both at level 1 (low),
.
𝑇𝐴2 ,𝐵1 = 𝑦𝑎 = 𝑎 is the response of factor 𝐴 at level 2, 𝐵 at level 1, and
.
𝑇𝐴1 ,𝐵2 = 𝑦𝑏 = 𝑏 is the response of factor 𝐴 at level 1, 𝐵 at level 2. ■
NOTE: The study of Confounding in Binary Factorial Designs will be detailed later in Section 11.7.
We now connect binary design with linear regression.
11.4
Work with Coded Design Variables
♣ QUESTION 2. Two important questions now arise:
1) Why do we work with coded design variables in Practical Usages?
2) For 2𝑚 design, should we estimate all the 2𝑚 parameters or terms in the coupled regression

11.4. Work with Coded Design Variables 393
model?
We have so far performed all of the analysis and model fitting for a 2𝑚 design
• using coded design variables, −1 ≤ 𝑥𝑖 ≤ +1, or 0 ≤ 𝑥𝑖 ≤ 1,
• and not the design factors in their original units (sometimes called actual, natural, or engineering
units).
11.4.1 Comparison between using coded units and engineering units
When the engineering units are used in an observed data, we can obtain different numerical results
in comparison to the coded unit analysis, and often the results will not be as easy to interpret. The
analysis of these data via empirical modeling lends some insight into the value of coded units and
the engineering units in designed experiments. To illustrate some of the differences between the two
analyses, consider the following experiment.
♦ EXAMPLE 11.6 ([Industry- Manufacturing]). [19]
A simple DC-circuit is constructed in which two different resistors, 1 and 2Ω, can be connected.
The circuit also contains an ammeter and a variable-output power supply. With a resistor installed
in the circuit, the power supply is adjusted until a current flow of either 4 or 6 amps is obtained.

Then the voltage output of the power supply is read from a voltmeter (in last column). Two
replicates of a 22 factorial design are performed, and Table 11.4 presents the results
Table 11.4: The Circuit Experiment
𝐼 (Amps) 𝑅 (Ohms) 𝑥1 𝑥2 𝑉 (Volts)

4 1 -1 -1 3.802
4 1 −1 −1 4.013
6 1 1 −1 6.065
6 1 1 −1 5.992
4 2 −1 1 7.934
4 2 −1 1 8.159
6 2 1 1 11.865
6 2 1 1 12.138
We present the regression models obtained using the design variables in the usual coded variables
(𝑥1 = current and 𝑥2 = resistance) and then in the engineering units, respectively. If the coded
variables 𝑥1 = 𝐼 and 𝑥2 = 𝑅 use only values −1 and +1 then orthogonality here means the
inner product of two coded vectors 𝑥1 and 𝑥2 equals 0. They clearly are orthogonal.
♣ QUESTION. What is the frequency of treatment combinations from the two engineering design
variables 𝐼 and 𝑅?
1. Consider first the coded variable analysis with R code and ANOVA output.

> volts <- c(3.802,4.013,6.065,5.992,7.934,8.159,11.865,12.138)

current <- c(-1,-1,1,1,-1,-1,1,1)
resistance <- c(-1,-1,-1,-1,1,1,1,1)
aovdata <- data.frame(volts,current,resistance)
f1 <- aov(volts ~ current* resistance,data=aovdata); summary (f1)
f2 <- lm(volts ~ current* resistance,data=aovdata); summary (f2)
current 1 18.46 18.46 843.82 8.36e-06 ***
resistance 1 51.13 51.13 2337.15 1.10e-06 ***
current:resistance 1 1.68 1.68 76.88 0.000933 ***
Residuals 4 0.09 0.02
Call: lm(formula = volts ~ current * resistance, data = aovdata)
Coefficients:
(Intercept) 7.49600 0.05229 143.349 1.42e-08 ***
current 1.51900 0.05229 29.049 8.36e-06 ***
resistance 2.52800 0.05229 48.344 1.10e-06 ***
current:resistance 0.45850 0.05229 8.768 0.000933 ***
---


* The regression equation is 𝑉 = 7.50 + 1.52𝑥1 + 2.53𝑥2 + 0.458 𝑥1 𝑥2 . Notice that both main ef-
fects (𝑥1 = current) and (𝑥2 = resistance) are significant as is the interaction. In the coded variable
analysis, the magnitudes of the model coefficients are directly comparable; that is, they all are di-
mensionless, and they measure the effect of changing each design factor over a one-unit interval.
* Furthermore, they are all estimated with the same precision (notice that the standard error of all
three coefficients is 0.053). Coded variables are very effective for determining the relative size of
factor effects.
2. Now consider the analysis based on the engineering units, as shown below.
> I <- c(4,4,6,6,4,4,6,6); R <- c(1,1,1,1,2,2,2,2)

aovdatan <- data.frame(volts,I,R)
f3 <- aov(volts ~ I*R,data=aovdatan) ; summary (f3)
f4 <- lm(volts ~ I* R,data=aovdatan); summary (f4)
Call: lm(formula = volts ~ I * R, data = aovdatan)
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.8055 0.8432 -0.955 0.393518

I 0.1435 0.1654 0.868 0.434467

R 0.4710 0.5333 0.883 0.427003
I:R 0.9170 0.1046 8.768 0.000933 ***
The regression equation of the response is 𝑉 = −0.806 + 0.144𝐼 + 0.471𝑅 + 0.917 𝐼 𝑅.
In this model, only the interaction is significant. The model coefficient for the interaction term is
0.917, and the standard error is 0.1046. ■
SUMMARY.
1. Note that the regression coefficients are not dimensionless and that they are estimated with
differing precision. This is because the experimental design, with the factors in the engineering
units, is not orthogonal.
2. Generally, we conclude that the engineering units are not directly comparable, but they may
have physical meaning as in the present example. This could lead to possible simplification based
on the underlying mechanism.
In almost all situations, the coded unit analysis is preferable.
3. The fact that coded variables let an experimenter see the relative importance of the design factors
is useful in practice.

11.4.2 Mathematical Coding of Design Variables
• The levels of the 𝑖-th factor (𝑖 = 1, · · · , 𝑚) are fixed at 𝑥𝑖1 and 𝑥𝑖2 , where 𝑥𝑖1 < 𝑥𝑖2 . By simple
transformation all factor levels can be reduced to
⎧
⎪
⎪
⎪
⎪+1, if 𝑥 = 𝑥𝑖2
⎨
𝑐𝑖 = 𝑖 = 1, · · · , 𝑚.
⎪
⎪
⎪
⎩−1, if 𝑥 = 𝑥𝑖1
⎪
• In such a factorial experiment there are 2𝑚 treatment combination (or just treatment). Denote
(𝑖1 , · · · , 𝑖𝑚 ) a treatment combination, where 𝑖1 , · · · , 𝑖𝑚 are indices,
⎧
⎨0, if 𝑐𝑗 = −1
⎪
𝑖𝑗 =
⎩1, if
⎪
𝑐𝑗 = 1.

Table 11.5: Treatment combinations of a 23 experiment
𝜈 𝑖1 𝑖2 𝑖3
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Thus, if there are 𝑚 = 3 factors, the number of possible treatment combinations is 23 = 8. These
are given in Table 11.5.
11.4.3 Estimation of the main effects and interaction effects
We discuss now the estimation of the main effects and interaction parameters.
An important rule of thumb is that an interaction between two factors should be considered,
and acknowledged in an experimental design, unless there is an explicit understanding of why it
is acceptable to assume that it is zero.

1. The index 𝜈 of the standard order, is given by

𝑚
∑︁
𝜈= 𝑖𝑗 2𝑗−1 . (11.5)
𝑗=1
Notice that 𝜈 ranges from 0 to 2𝑚 − 1, or possibly modified to be 1 to 2𝑚 for convenience. Let

𝑌𝜈 , 𝜈 = 0, 1, · · · , 2𝑚 − 1, denote the yield of the 𝜈-th treatment combination.
The order of parameters is made in a standard manner as follows.
2. Each of the 2𝑚 parameters can be represented by a binary vector (𝑗1 , 𝑗2 , · · · , 𝑗𝑚 ), where 𝑗𝑖 =

0, 1 (𝑖 = 1, · · · , 𝑚). The vector (0, 0, · · · , 0) presents the grand mean 𝜇.
3. A vector (0, 0, · · · , 1, 0, · · · , 0) where the 1 is the 𝑖th component, represents the main effect of the
𝑖th factor, 𝑖 (𝑖 = 1, · · · , 𝑚).
4. A vector with two ones, at the 𝑖th and 𝑗th component (𝑖 = 1, · · · , 𝑚 − 1; 𝑗 = 𝑖 + 1, · · · , 𝑚) represent
the first order interaction between factor 𝑖 and 𝑗.
5. A vector with three ones, at 𝑖, 𝑗 and 𝑘 component, represent the second order interaction between
factors 𝑖, 𝑗, 𝑘 etc. Put
𝑚
∑︁
𝜔= 𝑗𝑖 2𝑖−1
𝑖=1

and 𝛽𝜔 be the parameter represented by the vector with index 𝜔.

𝑚
Denote 𝛽 (2 )
= (𝛽𝜔 ) be the vector of parameters. For example, 𝛽3 corresponds to (1, 1, 0, · · · , 0),
which represents the first order interaction between factors 1 and 2.
6. Let 𝐶2𝑚 be the matrix of coefficients, that is obtained recursively by the equations
⎡ ⎤
1 −1
𝐶2 = ⎣ ⎦, (11.6)
1 1
and for 𝑙 = 2, 3, · · · , 𝑚 then ⎡ ⎤

𝐶2𝑙−1 −𝐶2𝑙−1
𝐶2𝑙 = ⎣ ⎦. (11.7)
𝐶2𝑙−1 𝐶2𝑙−1
𝑚
DETERMINATION OF REGRESSION PARAMETERS 𝛽 (2 )
𝑚 𝑚
Let 𝑌 (2 )
be the response vector, Then, the linear model relating 𝑌 (2 )
to the vector
𝑚
𝛽 (2 )
= (𝛽𝜔 ) = (𝛽0 , 𝛽1 , · · · , 𝛽2𝑚 −1 )′
is
𝑚 𝑚 𝑚
𝑌 (2 )
= 𝐶2𝑚 · 𝛽 (2 )
+ 𝑒(2 ) . (11.8)

The least squares estimator: The column vectors of 𝐶2𝑚 are orthogonal, so
𝑚
(𝐶2𝑚 )′ 𝐶2𝑚 = 2𝑚 I2𝑚 , the least squares estimator (LSE) of 𝛽 (2 )
is
̂︀ (2
𝑚
) 1 ′ (2𝑚 )
𝛽 = (𝐶 2𝑚) 𝑌 . (11.9)
2𝑚
The matrix I2𝑚 is also called Hadamard matrix. Accordingly, the LSE of 𝛽𝜔 is
𝑚
−1
2∑︁
1 (2𝑚 )
𝛽̂︀𝜔 = 𝑚 𝑐(𝜈+1),(𝜔+1) 𝑌 𝜈 , (11.10)
2 𝜈=0
(2𝑚 )
where 𝑐𝑖,𝑗 is the 𝑖-th row and 𝑗-th column element of 𝐶2𝑚 , i.e.,
𝑚
multiply the components of vector 𝑌 (2 )
by those of the column 𝜔 + 1 of 𝐶2𝑚 ,
and divide the sum of products by 2𝑚 .
The variance of coefficients 𝛽𝜔 and the population 𝜎 2 :

(2𝑚 )
Since 𝑐𝑖,𝑗 = ±1, the variance of 𝛽̂︀𝜔 is
𝜎2
V[𝛽𝜔 ] = 𝑚 ,
̂︀ for all 𝜔 = 0, 1, 2, · · · , 2𝑚 − 1. (11.11)
2
Finally, if every treatment combination is repeated 𝑛 times, the estimation of the parameters is
based on the means Y 𝜈 of the 𝑛 replications.

• The variance of 𝛽̂︀𝜔 becomes

𝜎2
V[𝛽𝜔 ] = 𝑚 .
̂︀ (11.12)
𝑛2
• The variance 𝜎 2 can be estimated by the pooled variance estimator, obtained from the be-
tween replication variance within each treatment combinations. That is, if 𝑌𝜈𝑗 , 𝑗 = 1, 2, · · · , 𝑛
are the observed values at the 𝜈-th treatment combination then
𝑚
2 𝑛
2 1 ∑︁ ∑︁
𝜎
̂︀ = 𝑚
(𝑌𝜈𝑗 − Y 𝜈 )2 . (11.13)
(𝑛 − 1)2 𝜈=1 𝑗=1
SUMMARY − For 2𝑚 design, we do not have to estimate all the 2𝑚 parameters or terms in
regression model, but can restrict attention only to parameters of interest. Finally, note that factors
not studied may be influential. The primary ways of addressing these uncontrolled factors are as
follows:
1. Hold them constant.
2. Randomize their effects
3. Estimate their magnitude by replicating the experiment...

11.5 Multiple Regression With Lags (Extra Reading)
11.5.1 Multivariate Linear Regression (MLR)
In SPE or Pollution Studies, for instance, assume for each time point 𝑡 ∈ N𝑛 = {1, 2, . . . , 𝑛} we
observe 𝑘 pollutant variables 𝑌𝑡 1 , 𝑌𝑡 2 , . . . , 𝑌𝑡 𝑘 that are concatenated to form the random vector Y𝑡 :=
[︀ ]︀𝑇
𝑌𝑡 1 , 𝑌𝑡 2 , . . . , 𝑌𝑡 𝑗 , . . . , 𝑌𝑡 𝑘 .
Definition 11.5 (Multivariate linear regression).
Multivariate linear regression (shortly MLR) expresses 𝑘 output responses 𝑌𝑗 linearly related to
the 𝑟 inputs 𝑧𝑖 (predictors 𝑖 = 1, 2, . . . , 𝑟),
with the observed value 𝑦𝑡 𝑗 at time point 𝑡, given by
𝑦𝑡 𝑗 = 𝛽𝑗1 𝑧𝑡 1 + 𝛽𝑗2 𝑧𝑡 2 + · · · + 𝛽𝑗𝑟 𝑧𝑡 𝑟 + 𝑤𝑡𝑗 (11.14)
for each of the response variables 𝑗 = 1, 2, . . . , 𝑘; and time points 𝑡 = 1, 2, . . . , 𝑛.
We assume the noises 𝑤𝑡𝑗 are correlated over the identifier 𝑗, but are still independent over
time, i.e. Cov[𝑤𝑠 𝑖 , 𝑤𝑡 𝑗 ] = 𝜎𝑖𝑗 for time point 𝑠 = 𝑡, and
Cov[𝑤𝑠 𝑖 , 𝑤𝑡 𝑗 ] = 0 for 𝑠 ̸= 𝑡.

11.5. Multiple Regression With Lags (Extra Reading) 405
Matrix form of MLR- Model fitting and selection with information criteria
The matrix form of MLR makes smooth computation, especially on software.
• When 𝑘 = 1 (one response) we get the popular multiple linear regression
𝑦𝑡 = 𝛽1 𝑧𝑡 1 + 𝛽2 𝑧𝑡 2 + · · · + 𝛽𝑟 𝑧𝑡 𝑟 + 𝑤𝑡 , 𝑡 = 1, 2, . . . , 𝑛. (11.15)
We can group 𝑘 ≥ 2 response outputs at time 𝑡 into the vector

[︀ ]︀𝑇
𝑦 𝑡 = 𝑦𝑡 1 , 𝑦𝑡 2 , . . . , 𝑦𝑡 𝑗 , . . . , 𝑦𝑡 𝑘 ,
and define ℬ = {𝛽𝑗𝑖 } (for outputs 𝑗 = 1, 2, . . . , 𝑘 and predictors 𝑖 = 1, 2, . . . , 𝑟) to be
a 𝑘 × 𝑟 matrix containing the regression coefficients, then rewrite (11.14) in form
𝑦 𝑡 = ℬ 𝑧 𝑡 +𝑤𝑡 , 𝑡 = 1, 2, . . . , 𝑛. (11.16)
[︀ ]︀𝑇
Here, 𝑧 𝑡 := 𝑧𝑡 1 , 𝑧𝑡 2 , . . . , 𝑧𝑡 𝑟 is the input vector at time 𝑡 of 𝑟 predictors, and
the error process {𝑤𝑡 } assumed consists of independent vectors of size 𝑘 × 1, with
the 𝑘 × 𝑘 common covariance matrix of the covariances 𝜎𝑖𝑗 :
Σ𝑤 = E[𝑤𝑡 𝑤′𝑡 ] = [𝜎𝑖𝑗 ]. (11.17)
We next fit model (11.16) from data, i.e. estimate regression coefficient matrix ℬ.
Knowledge Box 14. The theory of normal MLR confirms results.

• The maximum likelihood estimator (MLE) for the regression matrix ℬ is

𝑛
(︁ ∑︁ 𝑛
)︁ (︁ ∑︁ )︁−1
′ ′
𝑇 𝑇 −1
ℬ̂︀ = 𝑌 𝑍 (𝑍 𝑍) = 𝑦𝑡 𝑧𝑡 𝑧𝑡 𝑧𝑡 (11.18)
𝑡=1 𝑡=1
where 𝑍 𝑇 = [𝑧 1 , 𝑧 2 , . . . , 𝑧 𝑛 ] a matrix of size 𝑟 × 𝑛, and 𝑌 𝑇 = [𝑦 1 , 𝑦 2 , . . . , 𝑦 𝑛 ] a vector of size 𝑘 × 𝑛.
• The covariance matrix Σ𝑤 (of errors 𝑤) is estimated by

𝑛
1 ∑︁ (︀ )︀(︀ )︀′
Σ
̂︀ 𝑤 = [̂︀
𝜎𝑖𝑗 ] = 𝑦 𝑡 −ℬ̂︀ 𝑧 𝑡 𝑦 𝑡 −ℬ̂︀ 𝑧 𝑡 (11.19)
𝑛 − 𝑟 𝑡=1
[︀ ]︀𝑇
where 𝑧 𝑡 = 𝑧𝑡 1 , 𝑧𝑡 2 , . . . , 𝑧𝑡 𝑟 the input vector at time 𝑡 of 𝑟 predictors. With 𝑧 𝑡 we find matrix
𝑛
(︁ ∑︁ )︁−1
𝐶 = [𝑐𝑖𝑗 ] := 𝑧 𝑡 𝑧 𝑇𝑡 .
𝑡=1
• With 𝜎
̂︀𝑗𝑗 is the 𝑗-th diagonal element of Σ
̂︀ 𝑤 , the estimated standard error
√︀
se(𝛽̂︀𝑖𝑗 ) := 𝑐𝑖𝑖 𝜎
̂︀𝑗𝑗 , 𝑖 = 1, 2, . . . , 𝑟, 𝑗 = 1, 2, . . . , 𝑘 (11.20)
measures the variability (uncertainty) in the estimators.

11.5.2 Basic Terms of Vector AutoRegressive (VAR) modeling
We use a specific univariate time series to fit data sets involving one time series, namly Auto
Regressive model AR(()𝑝), given in COMPLEMENT 11.6. We then use the MLR setting above to
study possible dynamics of phenomena of interest from data sets involving more than one time
series, called vector Auto Regressive model.
Definition 11.6 (Vector AutoRegressive (VAR) model).
The first-order vector autoregressive model, denoted VAR(1), is given by
𝑥𝑡 = 𝛼 + Φ 𝑥𝑡−1 +𝑤𝑡 (11.21)
where Φ is a 𝑘 × 𝑘 transition matrix that expresses the dependence of 𝑥𝑡 on 𝑥𝑡−1 .
♦ The vector white noise process 𝑤𝑡 is assumed to be multivariate normal
with zero mean and covariance

Σ𝑤 = E[𝑤𝑡 𝑤′𝑡 ]. (11.22)
♦ Each factor 𝑥𝑡,𝑗 can be place, temperature, pollutant level, mortality count ... measured at time
point 𝑡. Vector 𝑥𝑡 = [𝑥𝑡,𝑗 ] describes values of all factors 𝑗, 𝑗 = 1, 2, . . . , 𝑘.
NOTE: If we incorporate the spatial dimension to study events at location 𝑠 ∈ 𝐷 ⊂ R𝑑 , 1 < 𝑑 ≤ 3,

we then modify the vector 𝑥𝑡 to be 𝑥𝑡 (𝑠), still a 𝑘 × 1- vector but for measuring 𝑘 factors at time 𝑡 and
point 𝑠 ∈ 𝐷.
MATHEMATICAL TREATMENT. What are key properties of the VAR(1)?

[︀ ]︀𝑇
• In this regression, the vector 𝛼 = 𝛼1 , 𝛼2 , . . . , 𝛼𝑘 is viewed as constant, so
if E[𝑥𝑡 ] = 𝜇 then 𝛼 = (Id −Φ)𝜇. Furthermore, if 𝜇 = 0 then we get the standard VAR(1)
𝑥𝑡 = Φ 𝑥𝑡−1 +𝑤𝑡 , 𝑤𝑡 ∼ N(0, Σ𝑤 ). (11.23)
• Note the similarity between the VAR model and the MLR model (11.16). The regression formulas
carry over, by letting vectors
)︀′
ℬ = (𝛼, Φ), and 𝑧 𝑡 = 1, 𝑥′𝑡−1 .
(︀
𝑦 𝑡 = 𝑥𝑡 , (11.24)
Observing data 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , we might fit Model (11.21) with the estimated coefficients ℬ̂︀ = (̂︀
𝛼, Φ),
̂︀
found by the conditional MLE (11.18). The estimated covariance matrix is modified from 11.19 as
𝑛
1 ∑︁ ′
Σ
̂︀ 𝑤 = 𝑤𝑡𝑤𝑡 (11.25)
𝑛 − 1 𝑡=2
where noise vector 𝑤𝑡 = 𝑥𝑡 −̂︀

𝛼−Φ̂︀ 𝑥𝑡−1 .

11.5.3 From VAR(1) model to VAR(𝑝) model
Briefly, we use indexes 𝑗 = 1, 2, . . . , 𝑘 for outputs (responses),

𝑖 = 1, 2, . . . , 𝑟 for predictors in regression, and 𝑡 = 1, 2, . . . , 𝑛 for time points (sample size).
Recall the VAR(1) model (11.21) expressing simultaneously 𝑘 responses at time 𝑡, namely
𝑥𝑡 = 𝛼 + Φ 𝑥𝑡−1 +𝑤𝑡 , 𝑤𝑡 ∼ N(0, Σ𝑤 )
where Φ is a 𝑘 × 𝑘 transition matrix that represents the dependence of responses 𝑥𝑡 at time 𝑡 on 𝑥𝑡−1
(the responses just one time unit before time 𝑡).
We extend the representation of 𝑥𝑡 w.r.t. 𝑝 past observations, to get the VAR(𝑝), as

𝑝
∑︁
𝑥𝑡 = 𝛼 + Φ𝑗 𝑥𝑡−𝑗 +𝑤𝑡 , 𝑡 = 𝑝 + 1, . . . , 𝑛. (11.26)
𝑗=1
Similar as VAR(1), if E[𝑥𝑡 ] = 𝜇 = 0 then we get the standard VAR(𝑝)

𝑝
∑︁
𝑥𝑡 = Φ𝑗 𝑥𝑡−𝑗 +𝑤𝑡 = Φ1 𝑥𝑡−1 +Φ2 𝑥𝑡−2 + · · · + Φ𝑝 𝑥𝑡−𝑝 +𝑤𝑡 , 𝑡 = 𝑝 + 1, . . . , 𝑛.
𝑗=1
MATHEMATICAL TREATMENT. Fitting VAR(𝑝)

[︀ ]︀𝑇
On fitting the VAR(𝑝), put 𝑧 𝑡 := 1, 𝑥𝑡−1 , . . . , 𝑥𝑡−𝑝 , and let ℬ = (𝛼, Φ1 , Φ2 , . . . , Φ𝑝 ) be the regres-
sion coefficient matrix,

• The 𝑘 × 𝑘 error sum of products matrix becomes

𝑛 𝑛
∑︁ ∑︁ )︀′
𝑤𝑡 𝑤′𝑡
(︀ )︀(︀
SSE = = 𝑥𝑡 −ℬ 𝑧 𝑡 𝑥𝑡 −ℬ 𝑧 𝑡 (11.27)
𝑡=𝑝+1 𝑡=𝑝+1
• The conditional maximum likelihood estimator for the error covariance matrix
Σ𝑤 = E[𝑤𝑡 𝑤′𝑡 ]
is
̂︀ 𝑤 = SSE
Σ (11.28)
𝑛−𝑝
as in the multivariate regression case, except now only 𝑛 − 𝑝 residuals exist in the SSE.
• The selection criteria used for choosing ‘good’ VAR models are the popular AIC, the AICc as
𝑘(𝑟 + 𝑛)
AIC𝑐 = ln |Σ
̂︀ 𝑤 | + , (11.29)
𝑛 − (𝑘 + 𝑟 + 1)
and the more reasonable classification BIC, given by
̂︀ 𝑤 | + 𝑘 2 𝑝 ln 𝑛/𝑛.
BIC = ln |Σ (11.30)

♦ EXAMPLE 11.7 (Pollution,Weather, and Mortality (cont)).
We use the Rpackage vars to fit vector AR models to DienChau data via least squares.
We will select a VAR(𝑝) model and then fit automatically the model, using the Rcode VARselect,
the AIC-based information criteria.
setwd("E:/Computation/COVID19-ANALYSIS/"); getwd()
# datdc<-read.csv("E:/Computation/COVID19ANALYS/Dien_Chau_data.csv");
datdc<-read.csv(file.choose())
data$temp<-datdc$temp # now column 2
data$part <-datdc$dewp # now column 3, present particulate level PM2.5
data$cmort<-datdc$all # now column 10
data<- subset(data,!is.na(data$temp)) # clean missing

# Fahrenheit temperature to oC and create temperature,
# PM2.5 and mortality series
data$temp<-(data$temp-32)/1.8
attach(data); x= data
tempr = as.numeric(unlist(x[2]))
part = as.numeric(unlist(x[3]))

cmort = as.numeric(unlist(x[10]))
y = cbind(cmort, tempr, part); ts.plot(y,col=1:3); library(vars)
summary(VAR(y, p=2, type="both")) # "both" fits constant + trend
VARselect(y, lag.max=10, type="both")
summary(fit <- VAR(y, p=2, type="both"))
$selection
AIC(n) HQ(n) SC(n) FPE(n)
4 3 1 4
OUTPUT
VAR Estimation Results:
Endogenous variables: cmort, tempr, part
Deterministic variables: both
Sample size: 357
Log Likelihood: -2276.193
Roots of the characteristic polynomial:
0.9481 0.629 0.2508 0.2508 0.1567 0.09881
Call: VAR(y = y, p = 2, type = "both")
Estimation results for equation part: [more interesting!]

=====================================
part = cmort.l1 + tempr.l1 + part.l1 + cmort.l2 + tempr.l2 + part.l2 + const + trend
cmort.l1 -0.0644332 0.0583374 -1.104 0.270141
tempr.l1 0.3844768 0.1004902 3.826 0.000154 ***
part.l1 0.9333421 0.0620427 15.044 < 2e-16 ***
cmort.l2 -0.0491436 0.0570606 -0.861 0.389689
tempr.l2 -0.0982051 0.1023111 -0.960 0.337786
part.l2 -0.1824492 0.0613116 -2.976 0.003126 **
const 11.0149952 1.8269184 6.029 4.19e-09 ***
trend -0.0003455 0.0014298 -0.242 0.809174
Correlation matrix of residuals:
cmort tempr part
cmort 1.00000 0.02441 0.02515
tempr 0.02441 1.00000 0.54087
part 0.02515 0.54087 1.00000
SUMMARY
• Significantly, the particulate PM2.5 is highly correlated with the temperature, 𝜌𝑃 𝑇 = 0.54.

• VARselect returns infomation criteria and final prediction error for sequential increasing the lag
order up to a VAR(𝑝)-proccess. which are based on the same sample size.
• Note that BIC (SC) picks the order 𝑝 = 2 model while AIC and FPE (Final Prediction Error) pick an
order 𝑝 = 4 model. Using the notation of the previous example, the prediction model for particulate
PM2.5 is
𝑃̂︀𝑡 = 𝛼 + 𝛽𝑡 + 𝜑1 𝑀𝑡−1 + 𝜑2 𝑇𝑡−1 + 𝜑3 𝑃𝑡−1 + 𝜑4 𝑀𝑡−2 + 𝜑5 𝑇𝑡−2 + 𝜑6 𝑃𝑡−2 ,
𝑃̂︀𝑡 = 11 − 0.001𝑡 − 0.06 𝑀𝑡−1 + 0.38 𝑇𝑡−1 + 0.93 𝑃𝑡−1 − 0.05 𝑀𝑡−2 − 0.1 𝑇𝑡−2 − 0.2 𝑃𝑡−2
• When BIC (SC) picks the order 𝑝 = 1, the all-cause mortality is estimated as 𝑀
̂︁𝑡 = 𝛼 + 𝛽𝑡 + ...
How should call the term 𝛼 + 𝛽𝑡 = 11 − 0.001𝑡 in the fitted model 𝑃̂︀𝑡 ?
This is viewed as the trend effect on the mortality, and generally named an exogenous factor.
11.6
COMPLEMENT: Autoregressive Process AR(p)
A univariate time series is just a sequence of random variables
{𝑋𝑡 } = {. . . , 𝑋𝑡−1 , 𝑋𝑡 , 𝑋𝑡+1 , . . .} (11.31)
measured at successive time points, usually spaced at uniform time intervals.

11.6. COMPLEMENT: Autoregressive Process 415
Definition 11.7. A simple autoregressive or AR model order 𝑝, denoted AR(𝑝), is a univariate time
series {𝑋𝑡 } where 𝑋𝑡 (the present value at a given time 𝑡) is given as a linear combination of 𝑝 past
values, 𝑋𝑡−1 , 𝑋𝑡−2 , . . . , 𝑋𝑡−𝑝 . Precisely, AR(𝑝) is given as
𝑋𝑡 = 𝜑1 𝑋𝑡−1 + 𝜑2 𝑋𝑡−2 + · · · + 𝜑𝑝 𝑋𝑡−𝑝 + 𝑊𝑡 (11.32)
where 𝑋𝑡 is stationary, 𝜑𝑖 are constants (𝜑𝑝 ̸= 0), and 𝑊𝑡 is white noise, i.e.,
𝑊𝑡 ∼ WN(0, 𝜎 2 ) and 𝑊𝑡 is uncorrelated with 𝑋𝑠 for each 𝑠 < 𝑡.
♦ EXAMPLE 11.8 (An AR(2)).
Suppose we consider the Gaussian white noise series 𝑊𝑡 [Definition ??] as input and calculate
the output using the second-order equation (for 𝑡 = 1, 2, . . . , 500)
𝑋𝑡 − 𝑋𝑡−1 + 0.9𝑋𝑡−2 = 𝑊𝑡 , or 𝑋𝑡 = 𝑋𝑡−1 − 0.9𝑋𝑡−2 + 𝑊𝑡 (11.33)
Equation (11.33) represents a regression or prediction of the current value 𝑋𝑡 of a time series as a
function of the past two values of the series, and, hence, the term autoregression of order 𝑝 = 2 is
suggested. ■

autoregression
6
4
2
0
x
−2
−4
−6
−8
0 100 200 300 400 500
Time
In R we can try
w = rnorm(550,0,1) # 50 extra to avoid startup problems

x = filter(w, filter=c(1,-0.9), method="recursive")[-(1:50)]
plot.ts(x, main="autoregression")
11.6.1 Backshift operator and Autoregressive operator
We will need two simple but useful operators for representing time series model .

Backshift operator or Lag operator is defined as
𝐵𝑋𝑡 = 𝑋𝑡−1
and extend it to powers 𝐵 2 𝑋𝑡 = 𝐵(𝐵𝑋𝑡) = 𝐵𝑋𝑡−1 = 𝑋𝑡−2 , and so on. Thus,
𝐵 𝑘 𝑋𝑡 = 𝑋𝑡−𝑘 (11.34)
We can also form a polynomial of 𝐵 as 𝑎(𝐵) = 𝑎0 + 𝑎1 𝐵 + 𝑎2 𝐵 2 + · · · + 𝑎𝑝 𝐵 𝑝 , 𝑎𝑖 ∈ C.
Now applying 𝐵𝑋𝑡 = 𝑋𝑡−1 , 𝐵 2 𝑋𝑡 = 𝑋𝑡−2 ... to model (11.32)
𝑋𝑡 = 𝜑1 𝑋𝑡−1 + 𝜑2 𝑋𝑡−2 + . . . + 𝜑𝑝 𝑋𝑡−𝑝 + 𝑊𝑡
we get 𝑋𝑡 = 𝜑1 𝐵𝑋𝑡 + 𝜑2 𝐵 2 𝑋𝑡 + . . . + 𝜑𝑝 𝐵 𝑝 𝑋𝑡 + 𝑊𝑡 , or
(1 − 𝜑1 𝐵 − 𝜑2 𝐵 2 − · · · − 𝜑𝑝 𝐵 𝑝 ) 𝑋𝑡 = 𝑊𝑡 . (11.35)
Using the concise notation we write the AR(𝑝) in operator form
𝜑(𝐵)𝑋𝑡 = 𝑊𝑡 (11.36)
where 𝜑(𝐵) is order 𝑝:

𝜑(𝐵) = 1 − 𝜑1 𝐵 − 𝜑2 𝐵 2 − · · · − 𝜑𝑝 𝐵 𝑝 . (11.37)

Autoregressive operator- is the above polynomial 𝜑(𝐵), defined on the backshift operator 𝐵. Then
the AR(𝑝) can be viewed as a solution to the equation (11.36), i.e.,
1
𝑋𝑡 = 𝑊𝑡 . (11.38)
𝜑(𝐵)
Equation 𝜑(𝐵) = 0 is called the characteristic equation for the autoregressive model AR(𝑝).
Consider specific cases of 𝑝 = 1 and 𝑝 > 1.
(I) The model AR(1) : 𝑋𝑡 = 𝜑 𝑋𝑡−1 + 𝑊𝑡 = 𝜑 𝐵𝑋𝑡 + 𝑊𝑡 has 𝜑(𝐵) = 1 − 𝜑 𝐵.
(︀ )︀
We iterate backwards 𝑘 times, 𝑋𝑡 = 𝜑𝑋𝑡−1 + 𝑊𝑡 = 𝜑 𝜑𝑋𝑡−2 + 𝑊𝑡−1 + 𝑊𝑡
𝑘−1
∑︁
2 𝑘
=⇒ 𝑋𝑡 = 𝜑 𝑋𝑡−2 + 𝜑𝑊𝑡−1 + 𝑊𝑡 = · · · = 𝜑 𝑋𝑡−𝑘 + 𝜑𝑗 𝑊𝑡−𝑗
𝑗=0
Provided that 𝑋𝑡 is stationary, by continuing to iterate backward, we can represent an AR(1) model
as a linear process given by
∞
∑︁
𝑋𝑡 = 𝜑𝑗 𝑊𝑡−𝑗 . (11.39)
𝑗=0
This is called the stationary solution of the model. Hence, when an AR(1) is stationary, the infinite
summation’s convergence implies that |𝜑| < 1.
(II) The model AR(𝑝) is given by
𝑋𝑡 = 𝜑1 𝑋𝑡−1 + 𝜑2 𝑋𝑡−2 + . . . + 𝜑𝑝 𝑋𝑡−𝑝 + 𝑊𝑡 (11.40)

and it can be equivalently represented in operator form
1
𝜑(𝐵) 𝑋𝑡 = 𝑊𝑡 ⇐⇒ 𝑋𝑡 = 𝑊𝑡 =: 𝜓(𝐵) 𝑊𝑡 , (11.41)
𝜑(𝐵)
where in general we need 𝑝 parameters 𝜑𝑗 for 𝜑(𝐵):
𝜑(𝐵) = 1 − 𝜑1 𝐵 − 𝜑2 𝐵 2 − · · · − 𝜑𝑝 𝐵 𝑝 . (11.42)
Remember in AR(1), set 𝜑 := 𝜑1 . The relationship
1
= 𝜓(𝐵) ⇐⇒ 𝜑(𝐵) 𝜓(𝐵) = 1 (11.43)
𝜑(𝐵)
∞
∑︁
and the polynomial 𝜓(𝐵) = 𝜑(𝐵) −1
= 𝜓𝑗 𝐵 𝑗 will be utilized later.
𝑗=0
11.6.2 The mean, autocovariance function and ACF of AR(1)
Therefore, the mean of a stationary AR(1) : 𝑋𝑡 = 𝜑 𝑋𝑡−1 + 𝑊𝑡 , given by (11.39), is

∞
∑︁
E[𝑋𝑡 ] = 𝜑𝑗 E[𝑊𝑡−𝑗 ] = 0,
𝑗=0

and we get the autocovariance function

∞
[︁(︀ ∑︁ ∞
𝑗
)︀ (︀ ∑︁ 𝑘
)︀]︁
𝛾(ℎ) = Cov[𝑋𝑡+ℎ , 𝑋𝑡 ] = E 𝜑 𝑊𝑡+ℎ−𝑗 𝜑 𝑊𝑡−𝑘 = · · ·
𝑗=0 𝑘=0 (11.44)
𝜎𝑤2
𝜑 ℎ
𝜎𝑤2
= 2
= 2
𝜑ℎ , ℎ ≥ 0, 𝜎𝑤2 is the variance of white noises.
1−𝜑 1−𝜑
The ACF (autocorrelation function) of an AR(1), by (??) is
𝛾(ℎ)
𝜌(ℎ) = = 𝜑 𝜌(ℎ − 1) = · · · = 𝜑ℎ , ℎ = 1, 2, 3, . . . (11.45)
𝛾(0)
In general, if E[𝑋𝑡 ] = 𝜇 ̸= 0, the above AR(1) is rewritten as
𝑋𝑡 − 𝜇 = 𝜑 (𝑋𝑡−1 − 𝜇) + 𝑊𝑡 =⇒ 𝑋𝑡 = (1 − 𝜑)𝜇 + 𝜑 𝑋𝑡−1 + 𝑊𝑡 (11.46)
• The parameter 𝜇 is the mean of the process. Think of the term 𝜑 (𝑋𝑡−1 − 𝜇) as representing
“memory” or “feedback” of the past into the present value of the process.
• The parameter 𝜑 determines the amount of feedback, with a larger absolute value of 𝜑 resulting in
more feedback, and 𝜑 = 0 implying that 𝑋𝑡 = 𝜇 + 𝑊𝑡 , so that 𝑋𝑡 ∼ WN(𝜇, 𝜎𝑤2 ).
In applications, one can think of 𝑊𝑡 as representing the effect of new information. Information that
is truly new cannot be anticipated, so the effects of today’s new information should be independent
of the effects of yesterday’s news. This is why we model new information as white noise.

• Equation (11.46) shows a linear regression model of 𝑋𝑡 , with 𝛽0 = (1 − 𝜑)𝜇,...

♦ EXAMPLE 11.9 (The Sample Path of an AR(1) when |𝜑| < 1).
We show two time plots of an AR(1) process
𝑋𝑡 = 𝜑𝑋𝑡−1 + 𝑊𝑡 (11.47)
one with 𝜑 = 0.9 and one with 𝜑 = −0.9; in both cases, 𝜎𝑤2 = 1.
In the first case 𝜌(ℎ) = 0.9ℎ , so observations close together in time are positively correlated with
each other... In R we can try
par(mfrow=c(2,1))
plot(arima.sim(list(order=c(1,0,0), ar=.9), n=100), ylab="x",
main=(expression(AR(1)~~~phi==+.9)))
plot(arima.sim(list(order=c(1,0,0), ar=-.9), n=100), ylab="x",
main=(expression(AR(1)~~~phi==-.9)))
11.6.3 Review the AR(1) in operator form
Relook at the operator form (11.36) of AR(1) (assuming the inverse operator exists)
𝜑(𝐵)𝑋𝑡 = 𝑊𝑡 =⇒ 𝜑−1 (𝐵) 𝜑(𝐵)𝑋𝑡 = 𝜑−1 (𝐵)𝑊𝑡

therefore 𝑋𝑡 = 𝜑−1 (𝐵) 𝑊𝑡 . Here 𝜑(𝐵) = 1−𝜑 𝐵 is the autoregressive operator of AR(1), with |𝜑| < 1.
With 𝐵 is the backshift operator, rewrite (11.39) to form a polynomial 𝜓(𝐵) using operator form as
∞
∑︁ ∞
∑︁
𝑗
𝑋𝑡 = 𝜑 𝑊𝑡−𝑗 = 𝜓𝑗 𝑊𝑡−𝑗 =: 𝜓(𝐵) 𝑊𝑡 (11.48)
𝑗=0 𝑗=0

∞
∑︁
𝑗
where 𝜓𝑗 := 𝜑 , and 𝜓(𝐵) := 𝜓𝑗 𝐵 𝑗 .
𝑗=0
Fact 11.1.
For any polynomial 𝑃 (𝑧) = 1 − 𝑎 𝑧, where 𝑧 is a complex number and |𝑎| < 1. Then, the inverse
1
𝑃 −1 (𝑧) = = 1 + 𝑎 𝑧 + 𝑎2 𝑧 2 + · · · + 𝑎𝑗 𝑧 𝑗 + · · · , |𝑧| < 1. (11.49)
𝑃 (𝑧)
We could view 𝜓(𝐵) as a one-side generating function, and treat the backshift operator 𝐵 as
complex number. In particular, it will often be necessary to consider the different cases when |𝐵| < 1,
|𝐵| = 1, or |𝐵| > 1, that is, when the complex number 𝐵 lies inside, on, or outside the unit circle.
The operator 𝜑(𝐵) = 1 − 𝜑 𝐵 has its inverse
𝜑−1 (𝐵) = 1 + 𝜑𝐵 + 𝜑2 𝐵 2 + · · · + 𝜑𝑗 𝐵 𝑗 + · · · , (11.50)
that is, 𝜑−1 (𝐵) is exactly the poly 𝜓(𝐵) in Equation (11.48): 𝜑−1 (𝐵) 𝑊𝑡 = 𝜓(𝐵) 𝑊𝑡 . These results will
be generalized in our discussion of ARMA models in next chapters.
Definition 11.8 (Causal and explosive AR processes).
• When an AR process is stationary (it does not depend on the future), we will say the process is causal.
(E.g, the stationary AR(1) with |𝜑| < 1, is causal).

• Hence, an AR process with |𝜑| ≥ 1 is nonstationary, and the mean, variance, and correlation
are not constant. In particular, with |𝜑| > 1, the AR process is future dependent, it is not causal,
we say the process is explosive.
♦ EXAMPLE 11.10 (Random walk without drift).
When 𝜑 = 1, we get a special nonstationary process, often called random walk, given by
𝑋𝑡 = 𝑋𝑡−1 + 𝑊𝑡 .
Suppose we start the process at an arbitrary point 𝑋0 . It is easy to see that

𝑡
∑︁
𝑋 𝑡 = 𝑋0 + 𝑊 1 + · · · + 𝑊 𝑡 = 𝑋0 + 𝑊𝑖 .
𝑖=1
Then E[𝑋𝑡 |𝑋0 ] = 𝑋0 for all 𝑡, which is constant but depends entirely on the arbitrary starting point
𝑋0 . Moreover, the variance V[𝑋𝑡 |𝑋0 ] = 𝑡 𝜎𝑤2 , which is not stationary but rather increases linearly
with time. The process therefore is not mean-reverting. ■
♦ EXAMPLE 11.11 (Explosive AR Models and Causality).
We might wonder whether there is a stationary AR(1) process of the form
𝑋𝑡 = 𝜑𝑋𝑡−1 + 𝑊𝑡 (11.51)
with |𝜑| > 1, or |𝜑|−1 < 1. Such processes are called explosive because the values of the time

series quickly become large in magnitude.
𝑘
∑︁
Indeed, the finite sum 𝑆𝑘 = 𝜑𝑗 𝑊𝑡−𝑗 of series (11.48) will not converge (in mean square) as
𝑗=0
𝑗
𝑘 −→ ∞, [because |𝜑| increases without bound as 𝑗 −→ ∞], so the intuition used to get (11.48)
(converged) will not work directly.
Rewriting
𝑋𝑡+1 = 𝜑𝑋𝑡 + 𝑊𝑡+1 ⇐⇒ 𝑋𝑡 = 𝜑−1 𝑋𝑡+1 − 𝜑−1 𝑊𝑡+1 = . . .
and by iterating forward 𝑘 steps as 𝑋𝑡+1 = 𝜑−1 𝑋𝑡+2 − 𝜑−1 𝑊𝑡+2 · · · we get
𝑘
∑︁
−𝑘
𝑋𝑡 = 𝜑 𝑋𝑡+𝑘 − 𝜑−𝑗 𝑊𝑡+𝑗 .
𝑗=1
Now |𝜑|−𝑗 < 1 for all 𝑗 = 1, 2, . . ., this result suggests the following
stationary future dependent AR(1) model

∞
∑︁
𝑋𝑡 = − 𝜑−𝑗 𝑊𝑡+𝑗 . (11.52)
𝑗=0
It requires us to know the future to be able to predict the future! In this explosive case, the process
is stationary, but it is also future dependent, and not causal. ■
Knowledge Box 15 (Summary of AR(1) model).

1. AR(1) model 𝑋𝑡 = 𝜑1 𝑋𝑡−1 + 𝑊𝑡 has other two representations:
• Either 𝜑(𝐵)𝑋𝑡 = 𝑊𝑡 =⇒ 𝑋𝑡 = 𝜑−1 (𝐵)𝑊𝑡 where polynomial 𝜑(𝐵) = 1 − 𝜑1 𝐵;

∞
∑︁
• Or via Eq. (11.48): 𝑋𝑡 = 𝜓𝑗 𝑊𝑡−𝑗 =: 𝜓(𝐵) 𝑊𝑡 , 𝜓𝑗 := 𝜑𝑗1 .
𝑗=0
Prove that 𝑋𝑡 is (weakly) stationary if |𝜑1 | < 1.
HINT: Use the pattern (11.48) to check E[𝑋𝑡 ] = 0, and compute the covariance
|ℎ|
2 𝜑1
Cov[𝑋𝑡+ℎ , 𝑋𝑡 ] = · · · = 𝜎 .
1 − 𝜑21
2. If |𝜑| ≥ 1, then the AR(1) process is nonstationary, and the mean, variance, covariances and
and correlations are not constant.
11.7
Confounding in Factorial Designs (Extra reading)
What is confounding, as a daily life idea or, as a mathematical concept?
• Briefly, given a factorial experiment of a few factors of interest with a response 𝑌 , confounding
between two effects 𝐸1 and 𝐸2 (possibly main or interaction effects) means we can not separate
the impacts of 𝐸1 and 𝐸2 on 𝑌 .

11.7. Confounding in Factorial Designs 427
If this phenomenon happens, we say 𝐸1 and 𝐸2 are confounded or aliased together.
• Mathematically, without overusing new term, confounding is also described as a design technique
for arranging a complete factorial experiment [as of Definition 11.2] in blocks, where the block size
is smaller than the number of treatment combinations in one replicate, to avoid the confounding
phenomenon. That explains why confounding is essentially based on blocking.
Sugar Milk
with two choices: also two choices
Yes, No (or High, Low) With (or Much)
and Without (Little)
?
?
COFFEE
Questions:
1) How many types of coffee can we make?
By combinatorial mathematics!
2) How can you select your favorite coffee?
By experimentation and data analysis!
For instance when we study the quality (delicious sensory) of drinking coffee,

factors of interest can be
𝐴 = 𝑆𝑢𝑔𝑎𝑟 with 𝑚 = 2 choices of Yes, No (or High, Low); and
𝐵 = 𝑀 𝑖𝑙𝑘 with 𝑘 = 2 choices of Yes, No (or Much, Little).
In this simple factorial design we have raised two primary questions, in which confounding between
two main effects 𝐴 and 𝐵 may occur in answering the 2nd question. It is not clear to tell that the
coffee is delicious because of milk or sugar, isn’t it? More examples can be seen in Example 11.12.
Blocking designs
Full factorial experiments, in general with large number of factors might be impractical. Trouble-
some particularly occurs when all factors at 3 levels, we have dealt with the 3𝑚 system, defined at
the COMPLEMENT of Ternary Factorial Design in Section 10.9. Moreover, practically sometimes
it is impossible to perform all of the runs in a 2𝑚 or 3𝑚 factorial experiment under homogeneous
conditions.
• For example, a single batch of raw material might not be large enough to make all of the required
runs. In other cases, it might be desirable to deliberately vary the experimental conditions to ensure
that the treatments are equally effective (i.e., robust) across many situations that are likely to be
encountered in practice.

• A chemical engineer may run a pilot plant experiment with several batches 4 .
The design technique used in these situations is blocking. The emphasis is on the fundamentals
of confounding that may become important in large, very expensive experiments are discussed in
other topic, namely Split-Plot Designs.
11.7.1 Confounding the 2𝑚 Factorial Design
Aim, Ideas and Notations: The aim of this chapter is to introduce techniques for assigning parts of
replicates to smaller blocks, with the assignment based on the factorial treatment structure. We look
at ways to keep blocks small and homogeneous while retaining the desirable features and efficiency
of large factorial experiments.
♦ EXAMPLE 11.12 (WHY BLOCKING?).
If there are 𝑚 = 3 factors, even at 𝑝 = 2 levels, the total number of treatment combinations is
23 = 8. If each takes two hours to run each combination, 16 hours will be required to complete the
experiment. Over such a long period, many influences could occur that are not of interest to us in
this experiment, but that might make the interpretation of our results unclear.
4
of raw material because he knows that different raw material batches of different quality grades are likely to be used in the actual full-scale
process.

1. Suppose we have only eight hours available in a day, and so are forced to run our 16-hour
experiment over two days: a block of four treatment combinations on Monday and another
block of four on Tuesday.
2. Hence, we are not able to run all eight treatment combinations as one large block under
homogeneous conditions; instead, we have to split the one large block of eight treatment com-
binations into two smaller blocks of four.
Other “nuisance” factors can pollute data, rendering the interpretation problematic:
• Personnel changing, as the day-shift radiologist might be replaced by the night-shift radiologist
in a hospital radiology experiment,
• The humidity in the photo lab might shift from cool in the morning to warm in the afternoon.
■
Definition 11.9. What is Confounding?
1. Conceptually, given an experiment of 𝑚 factors of interest with a response 𝑌 ,
confounding between two effects 𝐸1 and 𝐸2 (either main or interaction effects) means that we
can not separate the impacts of 𝐸1 and 𝐸2 on 𝑌 . When this phenomenon happens, we say 𝐸1
and 𝐸2 are confounded or aliased together.

2. In practice it is impossible to perform a complete replicate of a factorial in one block. Therefore,

confounding, computationally, is a design technique for arranging a complete factorial exper-
iment in blocks, where the block size is smaller than the number of treatment combinations in
one replicate.
1. The basic idea of this chapter is to assign treatment combinations to blocks in such a way that
effects of most interest can be estimated from within block information while sacrificing estimates
of effects of lesser importance.
2. In a factorial experiment we are often most interested in main effects and two or three-factor inter-
actions. Experience shows that the higher-order interactions are often much smaller in magnitude
than the main effects. This is fortunate since they also tend to be much more difficult to interpret.
The technique causes information about certain treatment effects (usually high-order interactions)
to be indistinguishable from, or confounded with, blocks.
This confounding means that the effects cannot be separated.
3. In this chapter we concentrate on confounding systems for the 2𝑚 factorial design.
Note that even though the designs presented are incomplete block designs because each block
does not contain all the treatments or treatment combinations, the special structure of the 2𝑚
factorial system allows a simplified method of analysis. We consider the construction and analysis
of the 2𝑚 factorial design in 2𝑘 incomplete blocks, where 𝑘 < 𝑚.

- When 𝑘 = 1, these designs can be run in two blocks.
- When 𝑘 = 2, these designs can be run in four blocks, 𝑘 = 3, designs run in eight blocks...
Confounding the 2𝑚 Factorial Design in Two Blocks
♣ QUESTION 3. How can we split a single replicate of a 22 design in two blocks?
ASSUMPTIONS: Each of the 22 = 4 treatment combinations requires a quantity of raw material,

for example, and each batch of raw material is only large enough for two treatment combinations to
be tested. Thus, two batches of raw material are required.

A 2^2 design in two blocks

Figure 11.2: Blocking a simple 22 design
• If batches of raw material are considered as blocks, then we must assign two of the four treatment
combinations to each block. The geometric view, as Figure 11.2.a, indicates that treatment
combinations on opposing diagonals are assigned to different blocks. In Fig. 11.2.b, block 1
contains the treatment combinations (1) and 𝑎𝑏 and block 2 has 𝑎 and 𝑏.
• Of course, the order in which the treatment combinations are run within a block is randomly
determined. Suppose we estimate the main effects of 𝐴 and 𝐵 just as if no blocking had oc-

curred. With 𝑇𝐴2 = 𝑦𝑎 + 𝑦𝑎𝑏 = 𝑎 + 𝑎𝑏 (the total of responses over factor 𝐵 for level 2 of 𝐴), and
𝑇𝐴1 = 𝑦(1) + 𝑦𝑏 = (1) + 𝑏, from Section 11.3.1, the main effect of 𝐴 is
𝑇𝐴2 − 𝑇𝐴1 [𝑎𝑏 + 𝑎 − 𝑏 − (1)]
𝜏 𝐴 = y 𝐴2 − y 𝐴1 = = ,
(2𝑛) 2𝑛
by Eqn. 11.2, and Equation 11.3 gives the main effect of 𝐵 as
[𝑎𝑏 + 𝑏 − 𝑎 − (1)]
𝜏 𝐵 = y 𝐵2 − y 𝐵1 = .
2𝑛
For single replicate, the main effects and 𝐴𝐵 interaction effect respectively are
[𝑎𝑏 + 𝑎 − 𝑏 − (1)] [𝑎𝑏 + 𝑏 − 𝑎 − (1)] 𝐴𝐵 [𝑎𝑏 + (1) − 𝑎 − 𝑏]
𝜏𝐴 = , 𝜏𝐵 = , 𝜏 = .
2 2 2
In brief in blocking 2𝑚 design we conclude on Confounding between block effect and interaction:
Because the two treatment combinations with the plus sign [the 𝑎𝑏 and (1)] are in block 1 and
the two with the minus sign [the 𝑎 and 𝑏] are in block 2, then the block effect and the 𝐴𝐵
interaction are identical.
That is, 𝐴𝐵 is confounded (or aliased) with blocks.
The reason for this is apparent from the table of plus and minus signs for the 22 design,
produced as in Table 11.6. From this table, we see that all treatment combinations that have a plus
sign on 𝐴𝐵 are assigned to block 1, whereas all treatment combinations that have a minus sign on

Table 11.6: Table of Plus and Minus Signs for the 22 Design
Treatment Combination (t.c.) Factorial Effect Block

𝐴 𝐵 𝐴𝐵
(1) − − + 1
𝑎 + − − 2
𝑏 − + − 2
𝑎𝑏 + + + 1
𝐴𝐵 are assigned to block 2. The defining equations of the blocks are
𝐴𝐵 = +1 =⇒ 𝐵1 = 𝐵+ = {(1), 𝑎𝑏},
(11.53)
𝐴𝐵 = −1 =⇒ 𝐵2 = 𝐵− = {𝑎, 𝑏}.
This scheme can be used to confound any 2𝑚 design in two blocks. ■
QUIZ 2. As a second example, fix 𝑚 = 3, consider a 23 design with 8 level (treatment) combi-
nations. While the experimenter has only 8 experimental units, there is some reason to believe
that the units can be put into two blocks of four experimental units each, in such a manner that
the variability among units within blocks is much smaller than the variability among the eight units as a set.
AIM: to confound the three-factor interaction 𝐴𝐵𝐶 with blocks.
DIY by filling blanks in Table 11.7 below which is similar to Table 11.6.

Table 11.7: Table of Plus and Minus Signs for the 23 Design
Treatment Combination 𝐴 𝐵 𝐴𝐵 𝐶 𝐴𝐶 𝐵𝐶 𝐴𝐵𝐶 Block

(1) − − + − ? ? 2
𝑎 + − − − ? + 1
𝑏 − + − − ? + 1
𝑎𝑏 + + + − ? − 2
𝑐 − − + + + 1
𝑎𝑐 + − − + ? ?
𝑏𝑐 − + − + ? ?
𝑎𝑏𝑐 + + + + + 1
Table 11.8: A 23 factorial design
Treatments Main Effects Defining Parameter

𝐴 𝐵 𝐶 𝐴𝐵𝐶
(1) −1 −1 −1 −1
𝑎 1 −1 −1 1
𝑏 −1 1 −1 1
𝑎𝑏 1 1 −1 −1
𝑐 −1 −1 1 1
𝑎𝑐 1 −1 1 −1
𝑏𝑐 −1 1 1 −1
𝑎𝑏𝑐 1 1 1 1
GUIDANCE for solving: Let 𝜆𝑖 = 0, 1 (𝑖 = 1, 2, 3) and let 𝐴𝜆1 𝐵 𝜆2 𝐶 𝜆3 represent the 8 parameters.
When the number of factors is not large, we represent the treatment combinations by low case letters
𝑎, 𝑏, 𝑐, . . ..
* The letter 𝑎 means 𝑎 = 𝐴1 𝐵 0 𝐶 0 = 𝐴, says that factor 𝐴 is at the High (𝑥1 = 1), and 𝐵, 𝐶 are at
the Low level (𝑥2 = 𝑥3 = −1); similarly about other factors.

* The absence of a letter indicates that the corresponding factor is at Low level. The symbol
(1) = (𝐴0 𝐵 0 𝐶 0 ) indicates that all levels are Low, shown in Table 11.8.
We then assign the treatment combinations that are plus on 𝐴𝐵𝐶 to block 1 and those that are
minus on ABC to block 2.
List of Tables
8.1 Tabulated values of Laplace function Φ(𝑧) . . . . . . . . . . . . . . . . . . . . . . . . . 33

8.2 Results for 10 independent replications of the bank model ([7]) . . . . . . . . . . . . . 66
8.3 Simulation results for the two bank policies via the means . . . . . . . . . . . . . . . . 67
8.4 Simulation results for the two bank policies: proportions . . . . . . . . . . . . . . . . . 68
8.5 Parameters of four servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.1 C1: Various programs data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.2 C2. Editors data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.3 Sample data for the sound equipment store . . . . . . . . . . . . . . . . . . . . . . . . 167
10.1 An orthogonal array with 11 binary factors . . . . . . . . . . . . . . . . . . . . . . . . . 264

10.2 Multivariate ANOVA table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
10.3 Data of 22 design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
10.4 Table of ANOVA for a 2-factor factorial experiment . . . . . . . . . . . . . . . . . . . . 306
10.5 Response at a 23 factorial experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
10.6 2𝑚 factorial design in 3 factors X = (𝐴, 𝐵, 𝐶) with values 𝑥 = (𝑖, 𝑗, 𝑘) . . . . . . . . . 315

10.7 Table of ANOVA for a 3-factor factorial experiment . . . . . . . . . . . . . . . . . . . . 317
10.8 THE DATA for two factors 𝑀, 𝑇 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
10.9 Response 𝑦𝑖,𝑗 = 𝑦𝐴=𝑖,𝐵=𝑗 of design 𝐴 × 𝐵 . . . . . . . . . . . . . . . . . . . . . . . . . 322
10.10 Table of ANOVA for a 𝑚-factor factorial experiment . . . . . . . . . . . . . . . . . . . 325
10.11Simple ANOVA table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
10.12 THE DATA for two factors 𝑀, 𝑇 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
10.13 THE MEANS of responses in cell 𝑖, 𝑗 for 𝑛 = 6 . . . . . . . . . . . . . . . . . . . . . . 342
10.14 Analysis of Variance Table, 32 design . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
10.15 The LSE (least square estimate) of the parameters of the 33 system . . . . . . . . . . 353
10.16 The main effects and interactions of a 33 factorial . . . . . . . . . . . . . . . . . . . . 361
11.1 Evidence of confounding between 𝐴𝐵𝐶 and blocks . . . . . . . . . . . . . . . . . . . 383
11.2 Four systems of notation for interactions in 22 design with Yates’ order . . . . . . . . 390
11.3 2𝑚 factorial design in 2 factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
11.4 The Circuit Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
11.5 Treatment combinations of a 23 experiment . . . . . . . . . . . . . . . . . . . . . . . . 399
11.6 Table of Plus and Minus Signs for the 22 Design . . . . . . . . . . . . . . . . . . . . . 435
11.7 Table of Plus and Minus Signs for the 23 Design . . . . . . . . . . . . . . . . . . . . . 436
11.8 A 23 factorial design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
List of Figures
8.1 A schematic diagram of Practical Simulation Model . . . . . . . . . . . . . . . . . . . . 6

8.2 Two-side testing with Z statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.3 Using Gauss variable when 𝑋 and 𝑌 are independent . . . . . . . . . . . . . . . . . 54
8.4 Using Student variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.5 Key parameters of two populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.6 Key parameters of two populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.7 Critical values of the F-distribution and their reciprocal property. . . . . . . . . . . . . . 75
8.8 F distribution values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.9 Compare two variances with F distribution . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.10 Using both rejection region and P-value . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.11 F-distribution values with 𝑚 ≥ 15, 𝑛 ≥ 15 . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.12 Inference process: step 2 with probability distributions . . . . . . . . . . . . . . . . . . 108
8.13 Planning: the 1st step in Inference Process . . . . . . . . . . . . . . . . . . . . . . . . 110
8.14 Inference: getting conclusion about population from sampling . . . . . . . . . . . . . . 111
8.15 The pdf of 𝜒2 [𝑣] for various 𝑣 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

442 LIST OF FIGURES
8.16 Table of chi-square critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8.17 Chi-square curve and critical values with specific significant level . . . . . . . . . . . . 120
9.1 Three nodes Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

9.2 Cluster analysis of points in a 2D space. (a) Initial data, each dot is a sample points
𝑥(𝑖) . (b) Three clusters of data. (c) Four clusters of data. . . . . . . . . . . . . . . . . . 171
9.3 Representation of clusters: (a) Centroid. (b) Clustering tree. (c) Logical expressions. . 173
9.4 Cluster assignments for the three iris species . . . . . . . . . . . . . . . . . . . . . . . 191
9.5 Cluster assignments for the three iris species . . . . . . . . . . . . . . . . . . . . . . . 206
9.6 Symbolic diagram of a neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
9.7 Input (covariate) 𝑋 in a training data set could be a complex structured or unstructured
object (Courtesy of Phuc Son Nguyen) . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.8 Linear model of car’s speed impact on breaking distance . . . . . . . . . . . . . . . . . 234
9.9 Q with one-way sequential queues in IBM’s PC industry . . . . . . . . . . . . . . . . . 237
9.10 Jackson network has 3 sub-systems (𝑘 = 3). . . . . . . . . . . . . . . . . . . . . . . . 239
10.1 Typical causality diagram shows cause-effect relationship between key events up to
uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
10.2 A factorial design with binary factors 𝐴, 𝐵, 𝐶 . . . . . . . . . . . . . . . . . . . . . . . . 257
10.3 Linear regression models with different shapes . . . . . . . . . . . . . . . . . . . . . . 274
10.4 Organizations higher up on the Quality Ladder are more efficient at solving problems
with increased returns on investments . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
10.5 Dr. Genichi Taguchi, a pioneer in using Experimental Designs for Industry . . . . . . . 285
10.6 Quadratic loss and tolerance intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
10.7 Schematic parameter design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
10.8 F distribution values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

10.9 𝑀 * 𝑇 interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
10.10The normal probability plot for factor Time and Medium . . . . . . . . . . . . . . . . . 345
10.1133 factorial design with 9 treatment combinations [right figure] . . . . . . . . . . . . . . 350
10.12 Main effects plot for 33 design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.13 Interaction plots for 33 design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
11.1 A factorial design with binary factors 𝐴, 𝐵, 𝐶 . . . . . . . . . . . . . . . . . . . . . . . . 381

11.2 Blocking a simple 22 design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
THE END
Copyright © Man VM. Nguyen 2024

[7] Averill M. Law, Simulation modeling and analysis, Fifth
edition. McGraw-Hill (2013)
[8] Friesland Campina Thailand,
frieslandcampina.com/sustainability/
[9] David S. Moore, George P. Mccabe and Bruce A.
Craig, Introduction to the Practice of Statistics, 6th
Bibliography ed., (2009), W. H. Freeman, New York
[10] Madhav, S. P., Quality Engineering using robust de-

sign, Prentice Hall, 1989.
[11] Chitavorn Jirajan, Triage in Emergency Department.
[1] Alexander Schrijver, New code upper bounds from the [12] Montgomery D.C. Introduction to Statistical Quality
Terwilliger algebra, IEEE Transactions on Information Control, 7th ed., (2009), Wiley.
Theory 51(8) (2005) 2859- 2866.
[13] Canvas paintings by Australian artists, Australian Na-
[2] Alexander von Eye and Eun-Young Mun. Log-linear tional Museum
Modeling: Concepts, Interpretation, and Application,
(2013), Wiley [14] Arijit Chaudhuri, Tasos C. Christofides and C.R. Rao,
Handbook of Statistics Volume 34, North-Holland pub-
[3] Annette J. Dobson and Adrian G. Barnett, lications, Elsevier B.V., 2016.
An Introduction to Generalized Linear Models, Third
Edition, CRC (2008) [15] Statistical Learning with Math and R, Springer Nature
2020, by Joe Suzuki (Graduate School of Engineering
[4] Andrew F. Seila and Sally Brailsford, Opportunities Science Osaka University)
and Challenges in Health Care Simulation, in Advanc-
ing the Frontiers of Simulation, International Series in [16] C.R. Rao. Factorial experiments derivable from combi-
Operations Research & Management Science, (2009) natorial arrangements of arrays, Suppl. J. Roy. Statis-
tics Soc., vol 9, pp. 128- 139, 1947
[5] Antal Kozak, Robert A. Kozak, Christina L. Staud-
hammer, Susan B. Watts Introductory Probability and [17] David S. Moore, George P. McCabe and Bruce A.
Statistics Applications for Forestry and Natural Sci- Craig, 2009.
ences, CAB (2008) Introduction to the Practice of Statistics, 6th edition, W.
Freeman Company, New York
[6] Ali Jahan & Md Yusof Ismail and Rasool Noorossana
Multi Response optimization in DOE considering ca- [18] S.R. Dalai and al., Factor-covering designs for Test-
pability index in bounded objectives method, Journal ing Software, Technometrics 40(3), 234-243, Ameri-
of Scientific and Industrial Research - India, 69 (2010), can Statistical Association and the American Society
11–16. for Quality, 1998.

446 BIBLIOGRAPHY
[19] Douglas C. Montgomery, George C. Runger, [28] Garcia, Gabriel V., and Roberto A. Osegueda. ”Com-
Applied Statistics and Probability for Engineers, Sixth bining damage detection methods to improve prob-
Edition, (2014) Wiley ability of detection.” In Smart Structures and Materi-
als 2000: Smart Systems for Bridges, Structures, and
Highways, Shih-Chi Liu, 135-142. SPIE, 2000.
[20] A. J. Duncan, Quality Control and Industrial Statistics,
5th edition, Irwin, Homewood, Illinois (1986) [29] Google Earth, Digital Globe, 2014- 2019
[21] Eiichi Bannai, Etsuko Bannai, Hajime Tanaka and Yan [30] Halfpenny, Angela. ”A Frequency Domain Approach
Zhu, Design Theory from the Viewpoint of Algebraic for Fatigue Life Estimation from Finite Element Anal-
Combinatorics, Graphs and Combinatorics 33 (2017) ysis.” Key Engineering Materials 167-168: D (1999):
1-41. 401-410.
[31] Härdle, Wolfgang, and Léopold Simar. Applied multi-
[22] Brian Bergstein, AI still gets confused about how the variate statistical analysis. 2nd. Springer, 2007.
world works, pp 62-65, MIT Technology Review, The
predictions issues, Vol 123 (2), 2020 [32] Haywood, Jonathan, Wieslaw J. Staszewski, and Keith
Worden. ”Impact Location in Composite Structures
[23] Doebling, S. W., Farrar, C. R., Prime, M. B., and She- Using Smart Sensor Technology and Neural Net-
vitz, D. W.. ”Damage Identification and Health Mon- works.” In The 3rd International Workshop on Struc-
itoring of Structural and Mechanical Systems From tural Health Monitoring, 1466-1475. Stanford, Califor-
Changes in Their Vibration Characteristics: A Litera- nia, 2001.
ture Review,” Los Alamos National Laboratory Report
LA-13070-MS, 1996. [33] Hotelling, Harold. ”Relations Between Two Sets of
Variates.” Biometrika 28, no. 3-4 (1936): 321-377.
[24] Dohono, David. High-dimensional data analysis: The
curses and blessings of dimensionality., 2000. [34] Hedayat, A.S., Seiden, E., Stufken, J., On the maximal
number of factors and the enumeration of 3-symbol or-
[25] Farrar, Charles R, and Keith Worden. ”An introduction thogonal arrays of strength 3. Journal of Statistical
to structural health monitoring.” Philosophical transac- Planning and Inference, Vol 58, (1997) 43–63
tions. Series A, Mathematical, physical, and engineer-
ing sciences 365, no. 1851 (2007): 303-15. [35] Hedayat, A. S. et. al., Orthogonal Arrays, Springer-
Verlag, Germany, 1999.
[26] Fodor, Imola. A Survey of Dimension Reduction [36] Hoai V. Tran, SPE lectures @ HCMUT, VNUHCM, Viet-
Techniques. Center for Applied Scientific Computing, nam (2022)
Lawrence Livermore National Laboratory, 2002.
[37] J´er´emie Gallien, Systems Optimization and Analy-
[27] Eastment, H. T., and W. J. Krzanowski. ”Cross- sis (15.066J), OCW MIT (Accessed Spring 2023)
Validatory Choice of the Number of Components from
a Principal Component Analysis.” Technometrics 24, [38] Judea Pearl, Causality: Models, Reasoning, and Infer-
no. 1 (1982): 73 - 77. ence, 2nd Edition, Cambridge University Press (2009)

BIBLIOGRAPHY 447
[39] Monique Laurent, Strengthened Semidefinite Bounds [46] Man Van Minh Nguyen.
for Codes, Journal Mathematical Programming, DATA ANALYTICS- STATISTICAL FOUNDATION: In-
109(2-3) (2007) 239-261. ference, Linear Regression and Stochastic Processes
[40] Mahmut Parlar, Interactive Operations Research with ISBN: 978-620-2-79791-7 Publisher: LAP LAMBERT
Maple, Methods and Models, Birkhauser (2000) Academic Publishing (2020)
[41] Brouwer, A. E., Cohen, A. M. and Nguyen, M. V. M. [47] Mien TN. Nguyen and Man VM. Nguyen.
(2006), Orthogonal arrays of strength 3 and small run Application of Thin-Plate Spline and Distributed Lag
sizes, Journal of Statistical Planning and Inference, Non-Linear Model to Describe the Interactive Ef-
136, 3268-3280. fect of Two Predictors on Count Outcomes, DOI
978-1-6654-5422-3/22
© 2022 IEEE, special issuse of the 9th NAFOSTED
[42] Eric D. Schoen, Pieter T. Eendebak, and Man Nguyen, Conference on Information and Computer Science
Complete enumeration of pure-level and mixed-level (NICS), 2022, Vietnam
orthogonal array, Journal of Combinatorial Designs
18(2) (2010) 123-140. [48] Mien TN. Nguyen, Man VM. Nguyen and Ngoan T. Le.
Using the Discrete Lindley Distribution to Deal with
[43] Hien Phan, Ben Soh and Man VM. Nguyen, A Par- Over-dispersion in Count Data. Austrian Journal of
allelism Extended Approach for the Enumeration of Statistics, to appear in July 2023
Orthogonal Arrays, ICA3PP 2011, Part I, Lecture
Notes in Computer Science, Vol. 7016, Y. Xiang et al.
eds., Springer- Verlag Berlin Heidelberg, pp. 482–494, [49] Uyen Huynh, Nabendu Pal, and Man Nguyen.
2011. Regression model under skew-normal error with
applications in predicting groundwater arsenic level in
[44] Hien Phan, Ben Soh and Man VM. Nguyen, A the Mekong Delta Region.
Step-by-Step Extending Parallelism Approach for Environmental and Ecological Statis-
Enumeration of Combinatorial Objects, ICA3PP 2010, tics Vol. 28, pp. 323–353. DOI
Part I, Lecture Notes in Computer Science, Vol. doi.org/10.1007/s10651-021-00488-2, Springer
6081, C.-H. Hsu et al. eds., Springer- Verlag Berlin Nature 2021
Heidelberg, pp. 463-475, 2010.
[50] Man Van Minh Nguyen.
Quality Engineering with Balanced Factorial Ex-
[45] Man Van Minh Nguyen. perimental Designs, Southeast Asian Bulletin of
Mathematics, Vol 44 (6), pp. 819-844 (2020)
PROBABILITY and STATISTICS: Inference, Causal
Analysis and Stochastic Analysis ISBN: 978-620-
0-08656-3 Publisher: LAP LAMBERT Academic [51] Uyen Huynh, Nabendu Pal, Buu-Chau Truong and
Publishing (2019) Man Nguyen.
A Statistical Profile of Arsenic Prevalence in the

448 BIBLIOGRAPHY
Mekong Delta Region, Thailand Statistician Journal [58] Nguyen Van Minh Man and Scott H. Murray. Mixed
2020 Orthogonal Arrays: Constructions and Applications,
talk in International Conference on Applied Probability
and Statistics, December 28-31, 2011, The Chinese
[52] Man VM. Nguyen and Nhut C. Nguyen. Univ. of Hong Kong, Hong Kong
Analyzing Incomplete Spatial Data For Air Pollution
Prediction, Southeast-Asian J. of Sciences, Vol. 6, No
2, pp. 111-133, (2018) [59] Man Nguyen and Tran Vinh Tan.
Selecting Meaningful Predictor Variables: A Case
Study with Bridge Monitoring Data,
[53] Nguyen V. Minh Man. Proceeding of the First Regional Conference on
A Survey on Computational Algebraic Statistics and Applied and Engineering Mathematics (RCAEM I)
Its Applications East-West Journal of Mathematics, (2010), University of Perlis, Malaysia.
Vol. 19, No 2, pp. 1-44 (2017)
[60] Man Nguyen and Phan Phuc Doan.
A Combined Approach to Damage Identification for
[54] Nguyen V. Minh Man. Bridge,
Permutation Groups and Integer Linear Algebra for
Enumeration of Orthogonal Arrays, East-West Journal Proceeding of the 5th Asian Mathematical Confer-
of Mathematics, Vol. 15, No 2 (2013) ence, pp 629- 636, (2009),
Universiti Sains Malaysia in collaboration with UN-
ESCO, Malaysia.
[55] Man Nguyen, Tran Vinh Tan and Phan Phuc Doan,
Statistical Clustering and Time Series Analysis for
Bridge Monitoring Data, Recent Progress in Data [61] Nguyen, Man V. M. Some New Constructions of
Engineering and Internet Technology, Lecture Notes strength 3 Orthogonal Arrays,
in Electrical Engineering 156, (2013) pp. 61 - 72, the Memphis 2005 Design Conference Special Issue
Springer-Verlag of the Journal of Statistical Planning and Infer-
ence, Vol 138, Issue 1 (Jan 2008) pp. 220-233.
[56] Man Nguyen and Le Ba Trong Khang.
Maximum Likelihood For Some Stock Price Models, [62] Nguyen Van Minh Man,
Journal of Science and Technology, Vol. 51, no. 4B, Computer-Algebraic Methods for the Construction of
(2013) pp. 70- 81, VAST, Vietnam Designs of Experiments, Ph.D. thesis, Eindhoven
Technology University (TUe), Netherlands (2005)
[57] Nguyen Van Minh Man and Scott H. Murray. Alge-

braic Methods for Construction of Mixed Orthogonal [63] Helge Toutenburg
Arrays, Southeast Asian Journal of Sciences, Vol 1, Statistical Analysis of Designed Experiments, Second
No. 2 (2012) pp. 155-168 Edition, (2002) Springer

BIBLIOGRAPHY 449
[64] Giesbrecht, Marcia L. Gumpertz [76] Michael Baron, Probability and Statistics for Computer
Planning, Construction, and Statistical Analysis of Scientists, 2nd Edition (2014), CRC Press, Taylor &
Comparative Experiments, (2004) Wiley Francis Group
[65] R. Mead, S.G. Gilmour, and A. Mead [77] R. H. Myers, Douglas C. Montgomery and Christine
Statistical Principles for the Design of Experiments, M. Anderson-Cook
(2012) Cambridge University Press Response Surface Methodology : Process and Prod-
uct Optimization Using Designed Experiments, Wiley,
[66] M. F. Fecko and al., Combinatorial designs in Mul- 2009.
tiple faults localization for Battlefield networks, IEEE
Military Communications Conf., Vienna, 2001. [78] Nathabandu T. Kottegoda, Renzo Rosso.
[67] Glonek G.F.V. and Solomon P.J. Factorial and time Applied Statistics for Civil and Environmental Engi-
course designs for cDNA microarray experiments, Bio- neers, 2nd edition (2008), Blackwell Publishing Ltd
statistics 5, 89-111, 2004. and The McGraw-Hill Inc
[68] Hedayat, A. S., Sloane, N. J. A. and Stufken, J. Or- [79] Paul Mac Berthouex. L. C. Brown. Statistics for Envi-
thogonal Arrays, Springer, 1999. ronmental Engineers; 2nd edition (2002), CRC Press
[69] Joel Cutcher-Gershenfeld – ESD.60 Lean/Six Sigma [80] Peter Goos
Systems, LFM, MIT The optimal design of blocked and split-plot experi-
ments, (2002) Springer
[70] John J. Borkowski’s Home Page,
www.math.montana.edu/ jobo/courses.html/ [81] Peter Goos and Bradley Jones
Optimal Design of Experiments- A Case Study Ap-
[71] Joseph A. de Feo, Junran’s Quality Management And proach (2011) John Wiley
Analysis, McGraw-Hill, 2015.
[72] Jay L. Devore and Kenneth N. Berk, [82] P.K.Bhattacharya and PrabirBurman. Linear Model.
in Theory and Methods of Statistics. Pages 309-382.
Modern Mathematical Statistics with Applications, 2nd Academic Press. 2016.
Edition, Springer (2012)
[73] Google Earth, Digital Globe, 2014- 2019 [83] Nathaniel E. Helwig. Multivariate Linear Regression.
2017
[74] Robert V. Hogg, Joseph W. McKean, Allen T. Craig In-
troduction to Mathematical Statistics, Seventh Edition [84] Heather Turner. Introduction to Generalized Linear
Pearson, 2013. Models. University of Warwick, UK. 2008.
[75] Bulutoglu, D.A. and Margot, F., Classification of or- [85] Chitavorn Jirajan, Triage in Emergency Department.
thogonal arrays by integer programming, Journal of
Statistical Planning and Inference 138 (2008) 654- [86] Marie-Pierre De Bellefon, Jean-Michel Floch. Hand-
666. book of Spatial Analysis. Chapter 9, 231-254. 2018.

450 BIBLIOGRAPHY
[87] Nelder, J. and R. Wedderburn (1972). Generalized lin- [99] Online news.samsung.com/global/
ear models. Journal of the Royal Statistical Society, samsung-announces-new-and-
Series A 132, 370–384. enhanced-quality-assurance-measures-
to-improve-product-safety
[88] McCuIlagh Peter and NeIder John Ashworth, Gener-
alized Linear Models, 2nd ed., Springer, 1989. [100] Online samsungengineering.com/
sustainability/quality/common/suView
[89] David Ardia, Financial Risk Management with
Bayesian Estimation of GARCH Models, Springer [101] Sudhir Gupta, Balanced Factorial Designs for cDNA
(2008) Microarray Experiments, Communications in Statis-
tics: Theory and Methods, Volume 35, Number 8 , p.
[90] Peter K. Dunn and Gordon K. Smyth Generalized Lin- 1469-1476 (2006)
ear Models With Examples in R (2018), Springer Na-
ture. [102] Sung H. Park, Six-Sigma for Quality and Productiv-
ity Promotion, Asian Productivity Organization, 1-2-10
[91] Philippe Jorion , Value at Risk- The New Benchmark Hirakawacho, Chiyoda-ku, Tokyo, Japan, 2003.
for Managing Financial Risk, 3rd Edition McGraw Hill
(2007) [103] Sloane N.J.A.,
neilsloane.com/hadamard/index.html/
[92] Ron S. Kenett, Shelemyahu Zacks. Modern Industrial
Statistics with applications in R, MINITAB, 2nd edition, [104] John Stufken and Boxin Tang, Complete Enumer-
(2014), Wiley ation of Two-Level Orthogonal Arrays of Strength 𝐷
With 𝐷 + 2 Constraints, The Annals of Statistics 35(2),
[93] Sheldon M. Ross. Introduction to Probability Models, p. 793-814 (2008)
10th edition, (2010), Elsevier Inc.
[105] Online toyota-global.com/company/
[94] Sheldon M. Ross. Introduction to Simulation, Third history-of-toyota/75years/data/company-
edition, (2002), Academic Press information/management-and-finances/
management/tqm/change.html
[95] Simon Hubbert, Essential Mathematics for Market
Risk Management, Wiley (2012) [106] Vo Ngoc Thien An, Design of Experiment for Sta-
tistical Quality Control, Master thesis, LHU, Vietnam
[96] Soren Asmussen and Peter W. Glynn, Stochastic (2011)
Simulation- Algorithms and Analysis, Springer (2007)
[107] Genichi Taguchi, Subir Chowdhury and Yuin Wu
[97] A. Stewart Fotheringham, Chris Brundon, Martin (2005), Taguchi’s Quality Engineering Handbook,
Charlton. Geographically Weighted Regression : the John Wiley & Sons
analysis of spatoally varying relationships. Wiley,
England. 2002. [108] Trevor Hastie, Robert Tibshirani and Jerome Fried-
man, The Elements of Statistical Learning Data Min-
[98] Scheffe, H. (1959) The Analysis of Variance, John Wi- ing, Inference, and Prediction, 2nd Ed. Springer
ley & Sons, Inc., New York. (2017)

BIBLIOGRAPHY 451
[109] Wang, J.C. and Wu, C. F. J. (1991), An approach [119] Papadimitriou, C. ”Optimal sensor placement
to the construction of asymmetrical orthogonal arrays, methodology for parametric identification of structural
Journal of the American Statistical Association, 86, systems.” Journal of Sound and Vibration 278, no. 4-5
450–456. (2004): 923-947.
[110] Larry Wasserman, All of Statistics- A Concise Course [120] Rytter, A., Vibration based inspection of civil en-
in Statistical Inference, Springer, (2003) gineering structures. Ph.D Dissert., Department of
Building Technology and Structural Engineering, Aal-
[111] William J. Stevenson, Operations Management, 12th borg University, Denmark, 1993.
ed., McGraw-Hill
[121] Rytter, A., and Kirkegaard, P. , Vibration Based In-
[112] C.F. Jeff Wu, Michael Hamada Experiments: Plan- spection Using Neural Networks, Structural Damage
ning, Analysis and Parameter Design Optimization, Assessment Using Advanced Signal Processing Pro-
Wiley, 2000. cedures, Proceedings of DAMAS ‘97, University of
Sheffield, UK, 1997,pp. 97-108.
[113] Inada, T., Shimamura, Y., Todoroki, A., Kobayashi,
H., and Nakamura, H., Damage Identification Method [122] Google Earth, Digital Globe, 2014- 2019
for Smart Composite Cantilever Beams with Piezo- [123] Silverman, B.W. , Density Estimation for Statistics
electric Materials, Structural Health Monitoring 2000, and Data Analysis, Chapman and Hall, New York, New
Stanford University, Palo Alto, California, 1999,pp. York,1986.
986-994.
[124] Sithole, M.M., and S. Ganeshanandam. ”Variable se-
[114] Jolliffe, I. T. Principal component analysis. 2nd. lection in principal component analysis to preserve the
Springer, 2002. underlying multivariate data structure.” In ASC XII –
12th Australian Stats Conference. Monash University,
[115] Ron S. Kenett, Shelemyahu Zacks. Modern Industrial Melbourne, Australia, 1994.
Statistics with applications in R, MINITAB, 2nd edition,
(2014), Wiley [125] Sohn, Hoon. ”Effects of environmental and oper-
ational variability on structural health monitoring..”
[116] Lapin, L.L. , Probability and Statistics for Modern En- Philosophical transactions. Series A, Mathematical,
gineering, PWS-Kent Publishing, 2nd Edition, Boston, physical, and engineering sciences 365, no. 1851
Massachusetts,1990. (2007): 539-60.
[117] Ljung, L. System identification: theory for the user, [126] Sohn, Hoon, and Charles R. Farrar. Damage diag-
Prentice Hall, Englewood Cliffs, NJ, 1987 nosis using time series analysis of vibration signals.
Smart Materials and Structures. Vol. 10, 2001.
[118] Masri, S.F., Smyth, A.W., Chassiakos, A.G.,
Caughey, T.K., and Hunter, N.F.,Application of Neural [127] Sohn, Hoon, David W.Allen, Keith Worden and
Networks for Detection of Changes in Nonlinear Sys- Charles R. Farrar, Statistical damage classification us-
tems, Journal of Engineering Mechanics, July,2000, ing sequential probability ratio test, Structural Health
pp. 666-676. Monitoring, 2003.p.57-74

452 BIBLIOGRAPHY
[128] Sohn, Hoon, Keith Worden, Charles R. Farrar, Statis- [134] Wald, A. Sequential Analysis, John Wiley and Sons,
tical Damage Classification under Changing Environ- New York, 1947
mental and Operational Conditions, Journal of Intelli-
gent Materials Systems and Structures, 2007 [135] Worden, K., and Lane, A.J. , Damage Identification
Using Support Vector Machines, Smart Materials and
[129] Sohn, Hoon, Charles R. Farrar, Francois M. Hemez, Structures, Vol. 10,2001, pp. 540-547.
Devin D. Shunk, Daniel W. Stinemates, Brett R.
Nadler, and Jerry J. Czarnecki. A Review of Struc- [136] Worden, K., Pierce, S.G., Manson, G., Philp, W.R.,
tural Health Monitoring Literature: 1996–2001. Struc- Staszewski, W.J., and Culshaw, B. , Detection of De-
tural Health Monitoring. Los Alamos National Labo- fects in Composite Plates Using Lamb Waves and
ratery Report, 2004. Novelty Detection, International Journal of Systems
Science, Vol. 31,2000, pp. 1,397-1,409
[130] Todd, M.D., and Nichols, J.M., Structural Damage
Assessment Using Chaotic Dynamic Interrogation, [137] Worden, K., and Fieller, N.R.J., Damage Detection
Proceedings of 2002 ASME International Mechani- Using Outlier Analysis, Journal of Sound and Vibra-
cal Engineering Conference and Exposition, New Or- tion, Vol. 229, No. 3,1999, pp. 647-667.
leans, Louisiana, 2002.
[131] Vanik, M. W., Beck, J. L., and Au, S. K. , Bayesian [138] Yang, Lingyun, Jennifer M. Schopf, Catalin L. Du-
Probabilistic Approach to Structural Health Monitor- mitrescu, and Ian Foster. ”Statistical Data Reduction
ing, Journal of Engineering Mechanics, Vol. 126, No. for Efficient Application Performance Monitoring.” CC-
7, 2000,pp. 738-745. GRID (2006).
[132] Vapnik, V., Statistical Learning Theory, John Wiley & [139] Q.W.Zhang, Statistical damage identification for
Sons, Inc., New York,1998 bridges using ambient vibration data, Elsevier, 2006.
p.476-485.
[133] Vo Ngoc Thien An, Design of Experiment for Sta-
tistical Quality Control, Master thesis, LHU, Vietnam [140] Larry Wasserman, All of Statistics- A Concise Course
(2011) in Statistical Inference, Springer, (2003)

Week 11-12-13 14 Simulation Workload Doe

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 11-12-13 14 Simulation Workload Doe

Uploaded by

Copyright:

Available Formats

SYSTEM PERFORMANCE EVALUATION

Statistical Simulation, Workload Modeling,

Statistically Designed Experiments

Man VM. Nguyen ‡

‡ Faculty of Science - Mahidol University

Chapter 8: STATISTICAL Simulation for SPE

CHAPTERS 8 ... 11: Statistical Simulation

Chapter 8 Statistical Simulation: Fundamentals

8.1 Mathematical Generation of Random Variables . . . . . . . . . . . . . . . . . . . . . 10

8.1.1 Generate continuous random variables . . . . . . . . . . . . . . . . . . . . . . 10

8.1.2 Exponential variable and Poisson variable in Queuing Theory . . . . . . . . . . 12

MATHEMATICAL MODELS, DESIGNS And ALGORITHMS

8.2 The Monte Carlo Simulation- Methodology . . . . . . . . . . . . . . . . . . . . . . . 17

8.2.1 The Monte Carlo Simulation (MCS)- Overview . . . . . . . . . . . . . . . . . . 17

8.2.2 Monte Carlo Simulation- Problems . . . . . . . . . . . . . . . . . . . . . . . . . 20

8.2.3 Monte Carlo: Application 2- Computing integration . . . . . . . . . . . . . . . . 24

8.3 How to achieve a simulation with high precision? . . . . . . . . . . . . . . . . . . . 28

8.3.1 A typical scenario in business . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8.3.2 What quantity should we evaluate to justify the above argument? . . . . . . . . 31

8.4 Variance Reduction Technique (VRT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8.4.1 Variance-Reduction by Control Variables . . . . . . . . . . . . . . . . . . . . . 39

8.4.2 Variance-Reduction by Conditioning . . . . . . . . . . . . . . . . . . . . . . . . 41

8.5 Comparison of Alternative System Configurations . . . . . . . . . . . . . . . . . . . 51

8.5.1 Comparison of Performance using Mean Discrepancy . . . . . . . . . . . . . . 53

8.5.2 Performance Comparison by Interval Estimation of Proportions . . . . . . . . . 63

8.6 Comparison of Performance Based on Variances . . . . . . . . . . . . . . . . . . . . . . 72

8.6.1 Comparing two population variances: How to? . . . . . . . . . . . . . . . . . . 73

SYSTEM PERFORMANCE EVALUATION

8.6.2 Fisher distribution (F distribution): Properties and Usages . . . . . . . . . . . . 75

8.6.3 F-tests comparing two population variances . . . . . . . . . . . . . . . . . . . . 76

8.7 PROJECT: Monte Carlo Simulation of queues . . . . . . . . . . . . . . . . . . . . . . . 86

8.7.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8.7.2 Notations and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

8.9 CHAPTER CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8.9.1 The concept of 𝑀/𝑀/1 Queue revisited . . . . . . . . . . . . . . . . . . . . . . 93

8.9.2 Performance indicators of the stable M/M/1 queue . . . . . . . . . . . . . . . . 95

8.10 COMPLEMENT 8A: Non-homogeneous Poisson Process . . . . . . . . . . . . . . . . 103

8.10.1 Non-homogeneous Poisson process- NHPP . . . . . . . . . . . . . . . . . . . 103

8.10.2 Sampling a Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.11 COMPLEMENT 8B: Statistical Inference for SPE . . . . . . . . . . . . . . . . . . . . 107

8.11.1 Grand scheme for inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.11.2 Probabilistic Characterization of Sample Means . . . . . . . . . . . . . . . . . 113

MATHEMATICAL MODELS, DESIGNS And ALGORITHMS

8.11.3 Chi-square distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.11.4 Confidence interval for the population variance . . . . . . . . . . . . . . . . . . 118

8.11.5 Chi-square statistic for testing independence . . . . . . . . . . . . . . . . . . . 121

Chapter 9 Workload Characterization

9.1 Preliminaries on Workload Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

9.1.1 Why and What questions? Types of Workload . . . . . . . . . . . . . . . . . . . 128

9.1.2 Workload selection- Considerations . . . . . . . . . . . . . . . . . . . . . . . . 130

9.1.3 Workload Modeling with key terminologies . . . . . . . . . . . . . . . . . . . . . 134

9.1.4 Workload components and workload parameters . . . . . . . . . . . . . . . . . 136

9.2 Popular Techniques for Workload Study . . . . . . . . . . . . . . . . . . . . . . . . . . 139

9.2.1 Averaging and Specifying dispersion . . . . . . . . . . . . . . . . . . . . . . . 140

9.2.2 Single-parameter & multiple-parameter histogram . . . . . . . . . . . . . . . . 144

9.2.3 Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

9.2.4 Principal Component Analysis (PCA) - First look . . . . . . . . . . . . . . . . . 150

SYSTEM PERFORMANCE EVALUATION

9.2.5 Clustering - Motivation and Steps . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9.3 Statistical Methods for Workload Characterization . . . . . . . . . . . . . . . . . . . . 154

9.3.1 Sampling [see a background in Appendix ??] . . . . . . . . . . . . . . . . . . . 154

9.3.2 Parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.3.3 From Transformation and Outliers to Data scaling . . . . . . . . . . . . . . . . . 156

9.3.4 Notation of distance - Metric of dissimilarity in general cases . . . . . . . . . . 157