You are on page 1of 80

CE 6013:

Statistical Methods in Civil Engineering


Lecture# 1
Introduction, Descriptive Statistics and Visualization

Dr. Sheikh Mokhlesur Rahman


Associate Professor, Dept. of CE
Contact: smrahman@ce.buet.ac.bd
Introduction
3
The Engineering Process

➢ An engineer is someone who solves problems of


interest to society with the efficient application of
scientific principles by:
• Refining existing products
• Designing new products or processes

Figure 1 The engineering method


April 2023 Semester CE 6013_Introduction
Why Statistics is Relevant to Engineers? 4

➢ Engineers must know how to efficiently


• plan experiments, collect data, analyze and interpret the
data, and understand how the observed data relate to
the model proposed to solve a problem.
➢ The field of statistics deals with the collection,
presentation, analysis, and use of data to:
• Make decisions
• Solve problems
• Design products and processes
➢ It is the science of learning information from data.

“Statistics is the science of Data”


April 2023 Semester CE 6013_Introduction
Experiments & Processes Are Not 5
Deterministic
➢ Statistical
techniques are useful for describing and
understanding variability.
➢ By variability, we mean successive observations of a
system or phenomenon do not produce exactly the
same result.
➢ Statistics gives us a framework for describing this
variability and for learning about potential sources of
variability.

April 2023 Semester CE 6013_Introduction


6
An Engineering Example of Variability-1

An engineer is designing a nylon connector to be used


in an automotive engine application. The engineer is
considering establishing the design specification on
wall thickness at 3/32 inch, but is somewhat uncertain
about the effect of this decision on the connector pull-
off force. If the pull-off force is too low, the connector
may fail when it is installed in an engine. Eight
prototype units are produced and their pull-off forces
measured (in pounds):

12.6, 12.9, 13.4, 12.3, 13.6, 13.5, 12.6, 13.1.

April 2023 Semester CE 6013_Introduction


7
An Engineering Example of Variability-2

➢ The dot diagram is a very useful plot for displaying


a small body of data - say up to about 20
observations.
➢ This plot allows us to see easily two features of the
data; the location, or the middle, and the scatter or
variability.
12.6, 12.9, 13.4, 12.3, 13.6, 13.5, 12.6, 13.1.

Dot diagram of the pull-off force data when wall thickness is 3/32 inch.
April 2023 Semester CE 6013_Introduction
8
An Engineering Example of Variability-3

➢ The engineer considers an alternate design, and


eight prototypes are built, and pull-off force is
measured.
➢ The dot diagram can be used to compare two sets
of data.

Dot diagram of pull-off force for two wall thicknesses.

April 2023 Semester CE 6013_Introduction


9
An Engineering Example of Variability-4

➢ Since pull-off force varies or exhibits variability, it is a


random variable.
➢ A random variable, X, can be modeled by:
X=+
where  is a constant and  is a random disturbance.
➢ The constant remains same with every measurement,
but small changes in the environment, variance in test
equipment, differences in the individual parts
themselves, etc. change the value of .
➢ If there were no disturbances,  would always equal
zero and X would always be equal to the constant μ
April 2023 Semester CE 6013_Introduction
10
Populations & Samples

Sample is a sub-set of population

Sample: Eight prototype


connectors

Population: All connectors that


will be produced and sold to
customers

April 2023 Semester CE 6013_Introduction


11
Two Directions of Reasoning

Statistical inference is one type of reasoning.

April 2023 Semester CE 6013_Introduction


12
Basic Types of Studies

Three basic methods for collecting data:


➢ A retrospective study using historical data
• Data collected in the past for other purposes.
➢ An observational study
• Data, presently collected, by a passive
observer.
➢ A designed experiment
• Data collected in response to process input
changes.

April 2023 Semester CE 6013_Introduction


13
Hypothesis Tests

➢ Hypothesis Test
• A statement about a process behavior value.
• Compared to a claim about another process value.
• Data is gathered to support or refute the claim.

➢ One-sample hypothesis test:


• Example: Ford avg mpg = 30 vs. avg mpg < 30
➢ Two-sample hypothesis test:
• Example: Ford avg mpg – Chevy avg mpg = 0 vs. > 0.

April 2023 Semester CE 6013_Introduction


14
Factor Experiment Example-1

➢ Consider a petroleum distillation column:


• Output is acetone concentration
• Inputs (factors) are:
▪ Reboil temperature
▪ Condensate temperature
▪ Reflux rate
➢ Output changes as the inputs are changed by
experimenter.
➢ Each factor is set at 2 reasonable levels (-1 and +1)
➢ 8 (23) runs are made, at every combination of factors,
to observe acetone output.
➢ Resultant data is used to create a mathematical model
of the process representing cause and effect.

April 2023 Semester CE 6013_Introduction


15
Factor Experiment Example-2

The Designed Experiment (Factorial Design) for the


Distillation Column

April 2023 Semester CE 6013_Introduction


16
Factor Experiment Example-3

The factorial experiment for the distillation column.

April 2023 Semester CE 6013_Introduction


17
Factor Experiment Example-4

➢ Now consider a new design of the distillation column


(fourth factor):
• Repeat the settings for the new design, obtaining 8
more data observations of acetone concentration.
• Resultant data is used to create a mathematical
model of the process representing cause and effect
of the new process.
• The response of the old and new designs can now
be compared.
• The most desirable process and its settings are
selected as optimal.

April 2023 Semester CE 6013_Introduction


18
Factor Experiment Example-5

A four-factorial experiment for the distillation column


(24 = 16 settings)
April 2023 Semester CE 6013_Introduction
19
Factor Experiment Considerations

➢ Factor experiments can get too large. For example, 8


factors will require 28 = 256 experimental runs of the
distillation column.
➢ Certain combinations of factor levels can be deleted
from the experiments without degrading the resultant
model.
➢ The result is called a fractional factorial experiment.

April 2023 Semester CE 6013_Introduction


Fractional Factorial Experiment 20
Example

A fractional factorial experiment for the distillation column


(one-half fraction)
24 / 2 = 8 circled settings
April 2023 Semester CE 6013_Introduction
21
Observing Process Over Time

➢ Whenever data are collected over time, it is important


to plot the data over time.
➢ Phenomena that might affect the system or process
often become more visible in a time-oriented plot and
the concept of stability can be better judged.

Distribution of acetone concentration taken hourly from 30 Distillation


Column Runs
The dot diagram illustrates data centrality and variation, but does not
identify any time-oriented problem.
April 2023 Semester CE 6013_Introduction
22
Time-oriented Plot

A time series plot of concentration provides more information


than a dot diagram – shows a developing trend.

April 2023 Semester CE 6013_Introduction


23
How Is the Change Detected?

➢ A control chart is used. Its characteristics are:


• Time-oriented horizontal axis, e.g., hours.
• Variable-of-interest vertical axis, e.g., % acetone.
➢ Long-term average is plotted as the center-line.
➢ Long-term usual variability is plotted as an upper and
lower control limit around the long-term average.
➢ A sample of size n is taken hourly and the averages
are plotted over time. If the plot points are between the
control limits, then the process is normal; if not, it
needs to be adjusted.

April 2023 Semester CE 6013_Introduction


How Is the Change Detected - 24
Graphically?

A control chart for the chemical process concentration data.


Control limit is set based on first 20 hours of data. Process steps
out at hour 24 &29. Shut down & adjust process.
April 2023 Semester CE 6013_Introduction
25
Use of Control Chart

1. Enumerative studies: Control 2. Analytic studies: Real-time


chart of past production lots. Used control of a production process.
for lot-by-lot acceptance sampling.

Enumerative versus analytic study.


April 2023 Semester CE 6013_Introduction
26
Mechanistic vs Empirical Models

➢ A mechanistic model is built from our underlying


knowledge of the basic physical mechanism that
relates several variables.
Example: Ohm’s Law, Current =voltage/resistance,
𝐼 = 𝐸Τ𝑅 Or 𝐼 = 𝐸Τ𝑅 + 𝜀
➢ The form of the function is known.

➢ An empirical model is built from our engineering and


scientific knowledge of the phenomenon, but is not
directly developed from our theoretical or first-
principles understanding of the underlying mechanism.
➢ The form of the function is not known a priori.

April 2023 Semester CE 6013_Introduction


27
An Example of an Empirical Model

➢ Suppose, average molecular weight (Mn) of a polymer is


related to the viscosity of the material (V), and it also
depends on the amount of catalyst (C) and the
temperature (T) in the polymerization reactor when the
material is manufactured.
➢ The relationship between Mn and these variables is
Mn = f(V,C,T)
where the form of the function f is unknown.
➢ We estimate the model from experimental data as the
following form where the 𝛽’s are unknown parameters.

Empirical Model
April 2023 Semester CE 6013_Introduction
28
Another Example of an Empirical Model

➢ In a semiconductor manufacturing plant, the finished


semiconductor is wire-bonded to a frame. In an
observational study, the variables recorded were:
• Pull strength to break the bond (y)
• Wire length (x1)
• Die height (x2)
➢ One form of the empirical model can be:

April 2023 Semester CE 6013_Introduction


29
Table 1-2 Wire Bond Pull Strength Data

April 2023 Semester CE 6013_Introduction


30
Developed Empirical Model

➢ Ingeneral, this type of empirical model is called a


regression model.
➢ Least squares method is used to estimate the model.
➢ The estimated regression relationship is given by:

April 2023 Semester CE 6013_Introduction


31
Visualizing the Data

Three-dimensional plot of the pull strength (y), wire


length (x1) and die height (x2) data.

April 2023 Semester CE 6013_Introduction


32
Visualizing the Empirical Model

Plot of the predicted values (a plane) of pull strength


from the empirical regression model.

April 2023 Semester CE 6013_Introduction


33
Models Can Also Reflect Uncertainty

➢ Probability models help quantify the risks involved in


statistical inference, that is, risks involved in decisions
made every day.
➢ Probability provides the framework for the study and
application of statistics.

April 2023 Semester CE 6013_Introduction


Descriptive Statistics and
Visualization
35
Numerical Summaries of Data

➢ Data are the numeric observations of a phenomenon of


interest.
➢ The totality of all observations is a population. A
portion used for analysis is a random sample.
➢ We gain an understanding of this collection, possibly
massive, by describing it numerically and graphically,
usually with the sample data.
➢ We describe the collection in terms of shape, outliers,
center, and spread (SOCS).
➢ The center is measured by the mean.
➢ The spread is measured by the variance.

April 2023 Semester CE 6013_Introduction


36
Populations & Samples

A population is described, in part, by its parameters, i.e., mean


(μ) and standard deviation (σ). A random sample of size n is
drawn from a population and is described, in part, by its
statistics, i.e., mean 𝑥ҧ and standard deviation (s). The statistics
are used to estimate the parameters.
April 2023 Semester CE 6013_Introduction
37
Mean

➢ Ifthe 𝑛 observations in a random sample are denoted


by 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , the sample mean is
𝑥1 + 𝑥2 +. . . +𝑥𝑛 σ𝑛𝑖=1 𝑥𝑖
𝑥= =
𝑛 𝑛
➢ For the 𝑁 observations in a population denoted by
𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , the population mean is
𝑁
σ𝑁
𝑖=1 𝑥𝑖
𝜇 = ෍ 𝑥𝑖 ⋅ 𝑓 𝑥 =
𝑁
𝑖=1

April 2023 Semester CE 6013_Introduction


38
Exercise 6-1: Sample Mean

Consider 8 observations (xi) of pull-off force from


engine connectors as shown in the table.
8

x i
12.6 + 12.9 + ... + 13.1 i xi
x = average = i =1
= 1 12.6
8 8
2 12.9
104
= = 13.0 pounds 3 13.4
8 4 12.3
5 13.6
6 13.5
7 12.6
8 13.1
Mean 13.00

The sample mean is the balance point.

April 2023 Semester CE 6013_Introduction


39
Variance

➢ Ifthe 𝑛 observations in a random sample are denoted


by 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , the sample variance is
σ 𝑛 2
𝑥
𝑖=1 𝑖 − 𝑥
𝑠2 =
𝑛−1
➢ For the 𝑁 observations in a population denoted by
𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , the population variance is
𝑁 𝑁
σ 2
2 2 𝑖=1 𝑥𝑖 − 𝜇
𝜎 = ෍ 𝑥𝑖 − 𝜇 ⋅ 𝑓 𝑥 =
𝑁
𝑖=1

April 2023 Semester CE 6013_Introduction


40
Rationale for the Variance

The xi values above are the deviations from the mean. Since the mean
is the balance point, the sum of the left deviations (negative) equals the
sum of the right deviations (positive). If the deviations are squared, they
become a measure of the data spread. The variance is the average
data spread.

April 2023 Semester CE 6013_Introduction


41
Standard Deviation

➢ The standard deviation is the square root of the


variance.
➢ σ is the population standard deviation symbol.
➢ s is the sample standard deviation symbol.
➢ The units of the standard deviation are the same as:
• The data.
• The mean.

April 2023 Semester CE 6013_Introduction


42
Example 6-2: Sample Variance

Table displays the quantities needed to calculate the


variances for the pull-off force data.
i xi x i - xbar (x i - xbar) 2
1 12.6 -0.40 0.1600
𝑛 2 2 12.9 -0.10 0.0100
2
σ𝑖=1 𝑥𝑖 − 𝑥 3 13.4 0.40 0.1600
𝑠 =
𝑛−1 4 12.3 -0.70 0.4900
5 13.6 0.60 0.3600
6 13.5 0.50 0.2500
7 12.6 -0.40 0.1600
Unit of:
8 13.1 0.10 0.0100
xi is pounds
sums = 104.00 0.00 1.6000
Mean is pounds.
divide by 8 divide by 7
Variance is pounds2.
mean = 13.00 variance = 0.2286
Standard deviation is pounds.
standard deviation = 0.48

April 2023 Semester CE 6013_Introduction


Computation of s2 43

The prior calculation is definitional and tedious. A


shortcut is derived here and involves just 2 sums.
𝑛 2 𝑛 2 2
2
σ 𝑥
𝑖=1 𝑖 − 𝑥 σ 𝑖=1 𝑖 + 𝑥 − 2𝑥𝑖 𝑥
𝑥
𝑠 = =
𝑛−1 𝑛−1
2
σ𝑛𝑖=1 𝑥𝑖2 + 𝑛𝑥 − 2𝑥 σ𝑛𝑖=1 𝑥𝑖
=
𝑛−1
2 2
σ𝑛 2
𝑖=1 𝑥𝑖 +𝑛𝑥 −2𝑥⋅𝑛𝑥 σ𝑛 2
𝑖=1 𝑥𝑖 −𝑛𝑥
= =
𝑛−1 𝑛−1
2
σ𝑛 𝑥
σ𝑖=1 𝑥𝑖 − 𝑖=1 𝑖
𝑛 2
𝑛
=
𝑛−1

April 2023 Semester CE 6013_Introduction


44
Example 6-3: Variance by Shortcut

2
n
 n


i =1
x −   xi 

2
i
i =1 
n i xi x i2
s =
2
1 12.6 158.76
n −1 2 12.9 166.41
3 13.4 179.56
1,353.60 − (104.0 ) 8
2

= 4 12.3 151.29
7 5 13.6 184.96
6 13.5 182.25
1.60
= = 0.2286 pounds 2 7 12.6 158.76
7 8 13.1 171.61
sums = 104.0 1,353.60
s = 0.2286 = 0.48 pounds

April 2023 Semester CE 6013_Introduction


What is this “n–1”? 45

➢ The population variance is calculated with N, the


population size. Why isn’t the sample variance
calculated with n, the sample size?
➢ The true variance is based on data deviations from the
true mean, μ.
➢ The sample calculation is based on the data deviations
from 𝑥ҧ , not μ. 𝑥ҧ is an estimator of μ; close but not the
same. So, the n-1 divisor is used to compensate for the
error in the mean estimation.

April 2023 Semester CE 6013_Introduction


46
Degrees of Freedom

➢ The sample variance is calculated with the quantity


n-1.
➢ This quantity is called the “degrees of freedom”.
➢ Origin of the term:
• There are n deviations from 𝑥ҧ in the sample.
• The sum of the deviations is zero. (Balance point)
• n-1 of the observations can be freely determined, but the
nth observation is fixed to maintain the zero sum.

April 2023 Semester CE 6013_Introduction


47
Sample Range

If the n observations in a sample are denoted by


𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , the sample range is:

𝑟 = max(𝑥𝑖 ) − min(𝑥𝑖 )

From Example 6-3:


r = 13.6 – 12.3 = 1.30

Note that: population range ≥ sample range

April 2023 Semester CE 6013_Introduction


48
Dot Diagrams

Dots representing data are plotted on the number


line.

Dot diagram of the pull-off force data when wall thickness is 3/32 inch.

April 2023 Semester CE 6013_Introduction


49
Stem-and-Leaf Diagrams

➢ Dot diagrams (dotplots) are useful for small data sets.


Stem & leaf diagrams are better for large data sets.
➢ Steps to construct a stem-and-leaf diagram:
1) Divide each number (xi) into two parts: a stem,
consisting of the leading digits, and a leaf, consisting
of the remaining digit.
2) List the stem values in a vertical column (no skips).
3) Record the leaf for each observation beside its stem.
4) Write the units for the stems and leaves on the display.

April 2023 Semester CE 6013_Introduction


50
Example 6-4: Alloy Strength

Compressive Strength
(psi) of 80 Aluminum-
Lithium Specimens
105 221 183 186 121 181 180 143
97 154 153 174 120 168 167 141
245 228 174 199 181 158 176 110
163 131 154 115 160 208 158 133
207 180 190 193 194 133 156 123
134 178 76 167 184 135 229 146
218 157 101 171 165 172 158 169
199 151 142 163 145 171 148 158

Stem-and-leaf diagram for Alloy strength


data. Center is about 155 and most data
is between 110 and 200. Leaves are
unordered.
April 2023 Semester CE 6013_Introduction
51
Quartiles

➢ The three quartiles partition the data into four equally


sized counts or segments.
• 25% of the data is less than q1.
• 50% of the data is less than q2, the median.
• 75% of the data is less than q3.

April 2023 Semester CE 6013_Introduction


52
Quartiles

➢ Calculated as Index = f*(n+1) where:


• Index (I) is the Ith item (interpolated) of the sorted data
list.
• f is the fraction associated with the quartile. (f = 0.25 for
q1, f = 0.5 for q2, f = 0.75 for q3
• n is the sample size.
➢ For the Alloy Strength data:
N = 80
Value of
indexed item
f Index I th (I +1)th quartile
0.25 20.25 143 145 143.50
0.50 40.50 160 163 161.50
0.75 60.75 181 181 181.00

April 2023 Semester CE 6013_Introduction


53
Percentiles

➢ Percentiles are a special case of the quartiles.


➢ Percentiles partition the data into 100 segments.
➢ q1 is 25-percentile, q2 is 50-percentile or median, and
q3 is 75-percentile.
➢ The Index = f*(n+1) methodology is the same.
➢ The 37-percentile is calculated as follows:
• Index = 0.37(81) = 29.97
• 37-percentile = 153 + 0.97(154 – 153) = 153.97

April 2023 Semester CE 6013_Introduction


54
Interquartile Range

➢ The interquartile range (IQR) is defined as:


IQR = q3 – q1.
➢ From Alloy Strength data:
IQR = 181.00 – 143.25 = 37.75 = 37.8
➢ It can be used as a measure of variability.
➢ Impact of outlier data:
• IQR is not affected by extreme outlier
• Range is directly affected.

April 2023 Semester CE 6013_Introduction


55
Frequency Distributions

➢ A frequency distribution is a compact summary of data,


expressed as a table, graph, or function.
➢ The data is gathered into bins or cells, defined by class
intervals.
➢ The number of classes, multiplied by the class interval,
should exceed the range of the data. The square root
of the sample size is a guide.
➢ The boundaries of the class intervals should be
convenient values, as should the class width.

April 2023 Semester CE 6013_Introduction


56
Frequency Distribution Table

Frequency Distribution of Alloy Strength


Considerations: Data
Range = 245 – 76 = 169 Cumulative
Relative Relative
Sqrt(80) = 8.9 Class Frequency Frequency Frequency
70 ≤ x < 90 2 0.0250 0.0250
Trial class width
90 ≤ x < 110 3 0.0375 0.0625
= 169/8.9 = 18.9
110 ≤ x < 130 6 0.0750 0.1375
Decisions: 130 ≤ x < 150 14 0.1750 0.3125
Number of classes = 9 150 ≤ x < 170 22 0.2750 0.5875
170 ≤ x < 190 17 0.2125 0.8000
Class width = 20 190 ≤ x < 210 10 0.1250 0.9250
210 ≤ x < 230 4 0.0500 0.9750
Range of classes = 230 ≤ x < 250 2 0.0250 1.0000
20 * 9 = 180
80 1.0000
Starting point = 70
April 2023 Semester CE 6013_Introduction
57
Histograms

➢ A histogram is a visual display of a frequency


distribution, similar to a bar chart or a stem-and-leaf
diagram.
➢ Steps to build one with equal bin widths:
1) Label the bin boundaries on the horizontal scale.
2) Mark & label the vertical scale with the frequencies or
relative frequencies.
3) Above each bin, draw a rectangle whose height is
equal to the frequency or relative frequency.

April 2023 Semester CE 6013_Introduction


58
Histogram of the Alloy Strength Data

Histogram of compressive strength of 80 aluminum-lithium alloy


specimens. Note these features – (1) horizontal scale bin
boundaries & labels with units, (2) vertical scale measurements
and labels, (3) histogram title at top or in legend.

April 2023 Semester CE 6013_Introduction


59
Histograms with Unequal Bin Widths

➢ Ifthe data is tightly clustered in some regions and


scattered in others, it is visually helpful to use narrow
class widths in the clustered region and wide class
widths in the scattered areas.
➢ In this approach, the rectangle area, not the height,
must be proportional to the class frequency.

bin frequency
Rectangle height =
bin width

April 2023 Semester CE 6013_Introduction


60
Poor Choices in Drawing Histograms-1

Histogram of compressive strength of 80 aluminum-lithium


alloy specimens. Errors: too many bins (17) create jagged
shape, horizontal scale not at class boundaries, horizontal axis
label does not include units.

April 2023 Semester CE 6013_Introduction


61
Poor Choices in Drawing Histograms-2

Histogram of compressive strength of 80 aluminum-lithium alloy


specimens. Errors: horizontal scale not at class boundaries
(cutpoints), horizontal axis label does not include units.
April 2023 Semester CE 6013_Introduction
62
Cumulative Frequency Plot

Cumulative histogram of compressive strength of 80 aluminum-


lithium alloy specimens.
Easy to see cumulative probabilities, hard to see distribution shape.

April 2023 Semester CE 6013_Introduction


63
Shape of a Frequency Distribution

Histograms of symmetric and skewed distributions.


(b) Symmetric distribution has identical mean, median and mode
measures.
(a & c) Skewed distributions are positive or negative, depending on the
direction of the long tail. Their measures occur in alphabetical order as
the distribution is approached from the long tail.

April 2023 Semester CE 6013_Introduction


64
Histograms for Categorical Data

➢ Categorical data is of two types:


• Ordinal: categories have a natural order, e.g., year in
college, military rank.
• Nominal: Categories are simply different, e.g., gender,
colors.
➢ Histogram bars are for each category, are of equal
width, and have a height equal to the category’s
frequency or relative frequency.
➢ A Pareto chart is a histogram in which the categories
are sequenced in decreasing order. This approach
emphasizes the most and least important categories.

April 2023 Semester CE 6013_Introduction


Example 6-6: Categorical Data 65
Histogram

Airplane production in 1985. (Source: Boeing Company)


Comment: Illustrates nominal data in spite of the numerical
names, categories are shown at the bin’s midpoint, a Pareto
chart since the categories are in decreasing order.

April 2023 Semester CE 6013_Introduction


66
Box Plot or Box-and-Whisker Chart

➢ A box plot is a graphical display showing center,


spread, shape, and outliers (SOCS).
➢ It displays the 5-number summary:
min, q1, median, q3, and max.

Description of a box plot


April 2023 Semester CE 6013_Introduction
67
Box Plot of Alloy Strength Data

Box plot of compressive strength of 80 aluminum-lithium alloy


specimens. Comment: Box plot may be shown vertically or
horizontally, data reveals three outliers and no extreme outliers.
Lower outlier limit is: 143.5 – 1.5*(181.0-143.5) = 87.25.

April 2023 Semester CE 6013_Introduction


68
Comparative Box Plots

Comparative box plots of a quality index at three manufacturing plants.


Comment: Plant 2 has too much variability. Plants 2 & 3 need to raise
their quality index performance.
April 2023 Semester CE 6013_Introduction
69
Limitations of Box Plots

https://datavizpyr.com/violinplot-vs-boxplot-when-
violinplot-can-be-more-useful/

April 2023 Semester CE 6013_Introduction


70
Violin Plots
Hintze and Nelson, The American
Statistician, 52(2) pp. 181-184, 1998
In addition to the box, we can plot
the density

It combines the box plot and the


density trace (or smoothed
histogram) into a single display that
reveals structure found within the
data

April 2023 Semester CE 6013_Introduction


71
Further Demonstration

https://www.autodesk.com/research/publications/same-stats-different-graphs

April 2023 Semester CE 6013_Introduction


72
Time Sequence Plots/ Time Series Plots

➢ A time series plot shows the data value, or statistic,


on the vertical axis with time on the horizontal axis.
➢ A time series plot reveals trends, cycles or other
time-oriented behavior that could not be otherwise
seen in the data.

Company sales by year (a) & by quarter (b). The annual time interval
masks cyclical quarterly variation, but shows consistent progress.

April 2023 Semester CE 6013_Introduction


73
Digidot Plot of Alloy Strength Data

A digidot plot of the alloy compressive strength data. It combines a


time series with a stem-and-leaf plot. The variability in the frequency
distribution, as shown by the stem-and-leaf plot, is distorted by the
apparent trend in the time series data.
April 2023 Semester CE 6013_Introduction
Digidot Plot of Chemical Concentration 74
Data

A digiplot of chemical concentration readings, observed hourly.


Comment: For the first 20 hours, the mean concentration is about 90. For
the last 9 hours, the mean concentration has dropped to about 85. This
shows that the process has changed and might need adjustment. The stem-
and-leaf plot does not highlight this shift.

April 2023 Semester CE 6013_Introduction


Scatter Plots/ Scatter Diagram 75

Scatter diagram of wine quality and color.


An excellent exploratory tool and can be very useful in identifying potential
relationships between two variables.

April 2023 Semester CE 6013_Introduction


Scatter Plots/ Scatter Diagram 76

Matrix of scatter diagrams for the wine quality data.


If number of variables is more than two, the matrix of scatter diagrams may
be useful in looking at all of the pairwise relationships between the variables
in the sample. (weak potential linear relationship between quality and pH and
somewhat stronger potential relationships between quality and color density and
quality and color)
April 2023 Semester CE 6013_Introduction
77
Correlation Coefficient

➢ The sample correlation coefficient 𝑟𝑥𝑦 is a quantitative


measure of the strength of the linear relationship between
two random variables x and y.
➢ The sample correlation coefficient is defined as
σ𝑛𝑖=1 (𝑦𝑖 − 𝑦)
lj 𝑥 − 𝑥lj
𝑟𝑥𝑦 =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥lj 2 σ𝑛𝑖=1 𝑦𝑖 − 𝑦lj 2
➢ If the two variables are perfectly linearly related with a
positive slope, then 𝑟𝑥𝑦 = 1, and if they are perfectly linearly
related with a negative slope, then 𝑟𝑥𝑦 = −1. If no linear
relationship between the two variables exists, then 𝑟𝑥𝑦 = 0.
➢ The simple correlation coefficient is also sometimes called
the Pearson correlation coefficient after Karl Pearson

April 2023 Semester CE 6013_Introduction


78
Correlation Types

Potential relationship between variables.

April 2023 Semester CE 6013_Introduction


79
Pairwise Correlation-Wine Quality Data

Pairwise sample correlations between the five variables


of wine quality data

April 2023 Semester CE 6013_Introduction


Thank you

Acknowledgement: Most of the slides of this lecture are based on “Applied Statistics
and Probability for Engineers”. Seventh Edition. D. C. Montgomery, and G. C. Runger,
Wiley, 2018.

April 2023 Semester CE 6013_Introduction

You might also like