You are on page 1of 127

STAT319 LAB MANUAL

Based on

For Instructors
R EVISED B Y

NASIR A BBAS
J IMOH A JADI
M OHAMMED SALEH
E MMANUEL A FUECHETA

First Revision, November 2022


Contents

1 Introduction 3
1.1 The R Software: Getting and Installing It . . . . . . . . . . . . . . . 3
1.2 Starting the Program . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Reading data from files . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Descriptive Statistics 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 The Population and the Sample . . . . . . . . . . . . . . . . . . . . . 13
2.3 Stem-and-Leaf Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Frequency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Graphs of Frequency Distributions . . . . . . . . . . . . . . . . . . . 19
2.6 The Bar Chart and the Pie Chart . . . . . . . . . . . . . . . . . . . . 20
2.7 Numerical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 The Empirical Rule (ER) . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.9 The Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.10 Approximate Mean and Variance of Grouped Data . . . . . . . . . . 33

3 Discrete Random Variables 40


3.1 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 The Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 The Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . 42
3.4 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Continuous Random Variables 50


4.1 The Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . 51
4.2 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 52

1
5 Sampling Distributions 64
5.1 Sampling Distributions of Sums and Means and the Central Limit
Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 The Normal Approximation to the Binomial Distribution . . . . . . 66
5.3 Drawing a Random Sample from a known Distribution . . . . . . . . 68
5.4 Use of t, χ2 and F Tables . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Statistical Estimation 77
6.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Confidence Interval Estimation for the Population Mean . . . . . . . 77
6.3 Computing Confidence Interval for a Population Mean Using Using R 82
6.4 Large Sample Confidence Interval Estimation of a Population Proportion 84

7 Tests of Hypotheses 89
7.1 Testing Hypotheses about a Population Mean . . . . . . . . . . . . . 89
7.2 Testing for the Population Mean Using R . . . . . . . . . . . . . . . . 92
7.3 Large Sample Tests of Proportions . . . . . . . . . . . . . . . . . . . 93
7.4 Testing a Population Proportion Using R . . . . . . . . . . . . . . . . 94

8 Linear Regression 97
8.1 Scatter Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 The Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . 101
8.3 Estimating the Line of Best Fit . . . . . . . . . . . . . . . . . . . . . 103
8.4 Testing the Slope of the Regression Line . . . . . . . . . . . . . . . . 104
8.5 Testing the Significance of the Regression by Analysis of Variance . . 106
8.6 Confidence Interval Estimation of Regression Parameters . . . . . . . 108
8.7 Prediction Interval (PI) for a Future Observation y0 . . . . . . . . . 109
8.8 Checking Model Assumptions . . . . . . . . . . . . . . . . . . . . . . 110
8.9 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 115

2
Chapter 1

Introduction

The idea of this chapter is to provide an introduction to using the R statistical


computing software. It covers how to read data into R, run different calculations, and
get some data summary statistics.
R is a free, open-source statistical environment which was originally designed by Ross
Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is now
developed and managed by the R Development Core Team. In 1997, the software’s
initial version was released. It is an implementation of the S language, which J.
M. Chambers created in the late 1980s at AT&T’s Bell Laboratories. There is a
commercial version of S called S-PLUS which is sold by TIBCO Software Inc. S-PLUS
is rather expensive but works in a very similar way to the freely available R, so why not
use R? There are a number of materials (textbooks,online notes, etc) available which
discuss using R for statistical data analysis. In general, an excellent introductory text is

Crawley, M. J. (2005) Statistics, An Introduction using R. (Wiley)

which can be found online in the R Library.

1.1 The R Software: Getting and Installing It


The most recent version of R is available for free download and installation on your
personal computer. The website for the R Project for Statistical Computing may be
found at
http://www.r-project.org/.
This website has a wealth of R-related materials, including FAQs, user guides, and
mailing list links.
Alternatively, visit the UK CRAN mirror at the following University of Bristol
website to download the most recent version of the software on your personal computer:

3
http://www.stats.bris.ac.uk/R/
Once there, click on “base”, then “download and install R”, choosing Windows from
the “Download and Install R” section if that’s the operating system you use. R
is also available for the Macintosh and Linux operating systems. The binaries of
user-contributed packages of functions are located in the subdirectory “contrib.” It
may be simpler to install only those that are necessary as at when you require them
in the future as opposed to installing them all at once.
Under “Documentation” within these sites mentioned earlier, you may wish to
download the document titled “An Introduction to R” by W. N. Venables, D. M. Smith
and the R Development Core Team (2022). R is explained in detail in this document,
along with instructions on how to use it for statistical analysis and graphics. An
Introduction to R can also be found under Help> Manuals (in PDF) in the software.
It is available for download as a pdf file, which you may store on your computer for
future use.

1.2 Starting the Program


You can start up an R session by

(i) clicking the Windows Start Menu, then All Programs and

(ii) clicking the program icon,

When the package has loaded, the program screen will look similar to the one shown
in Figure 1.1 below.
The sub-window to the fore is called the commands window. To use R, the necessary
command is entered into this sub-window. The R prompt, represented by the symbol
>, will be seen at the start of the command line. Following the >, type your command.
When you hit return, R will either display the results of your command on the following
line or in a new window, depending on the command, or it will display > once more
on the following line if the command’s result is assigned to an R object. A plus sign +
is used as the continuation prompt when a command is too big to fit on a single line.
It is advised to use lower case letters in the main when using R commands because
they are case sensitive. When a # symbol is used to signal the end of a line, the text
that follows is to be taken as a comment rather than a command. Once you begin the
practice session, you will understand how this works more clearly.

1.3 Reading data from files


During any of your lab sessions or practice, you will be required to read data sets
into R. You can type your data directly into R’s interactive console. However, it is

4
Figure 1.1: The R desktop window at start time.

very likely that you will be dealing with large data objects for any kind of serious
work. Large data sets are typically read as values from external files(such as STAT319
module Blackboard site or desktop) instead of being typed at the keyboard during an
R session. Assuming you have your data say Data1.txt on the module Blackboard
site, the easiest thing to do is to copy and paste the data from Blackboard into separate
notepad documents or files and then save them with the same filename as above into
a folder on a memory stick or at c:/ Work or My Documents on the hard drive. You
might give this new folder a name like “MyRsession” or any name of your choice to
describe what it contains. Afterwards, you click the menu item in R to change the
working directory to the same folder in which you have put the data. That is:

File > Change dir

and finding the appropriate folder by utilizing the browse feature. Once this is done,
the read.table() function can then be used to read the data frame directly as

> Data <- read.table("Data1.txt", header=TRUE, sep"")

where the header=TRUE option is a logical value indicating whether the file contains
the names of the variables as its first line and sep" " is the field separator character.
Basically, values on each line of the file are separated by this character.
It is important to note that R has an excellent built-in help facility. This help
facility can be accessed through the “Help” drop-down menu at the top of the R
console window. However, if you do not want to use this approach to access help

5
and more information regarding a specific function in R, you can use the alternative
approach. This includes

>help(function name)

Here, the function name may be one of the following: length, seq, rep, var, sd, mean,
etc. for example, help(length).
In the course of an R session, objects are generated or created and stored by name.
To display the names of the objects stored within R, we can employ the command ls(
). These objects can also be removed from R. To do that, we employ the command
rm() and include the name(s) of the objects to be deleted between the brackets
and separated by commas. Notably, the command rm(list=ls()) can be used to
completely remove everything. However, this is far-reaching as it may result in severe
consequences.

Saving your R session


The contents of your current worksheet can be saved by using the menu option
File>Save Workspace. Once File>Save Workspace is selected, you will be asked to
choose a filename. Endeavour to include the extension .RData to the specified filename.
Similarly, you can as well store your command history by choosing File>Save History
and then include the extension .Rhistory to the specified filename. Provided that
your current worksheet and command history are properly saved, you can resume this
saved sessions by opening R and choosing File>Load Workspace and or File>Load
History. In each case, you choose the relevant previously saved files.

Using R
You have to practice using R in this course and during your Lab session by copying
and typing the commands which are given following the ">" prompt in the notes below
and then assessing/observing the output. At the end, there are also some exercises for
you to try. The purpose of this is to create as much interest in R as possible. Thus,
you can use these exercises to test your comprehension of basic understanding of R.

(i) Using R as a calculator: By entering an arithmetic command following the


> prompt and pressing the RETURN key, we can use R as a calculator. For
example:

> 30+20-9
[1] 41
> 20*20

6
[1] 400
> 2786/256
[1] 10.88281
> 4^3
[1] 64
> 2+4*20/10 # R operator precedence rules are conventional
[1] 10
> pi*10^2 # pi is the usual constant 3.14159..
[1] 314.1593

From the above results, you may observe that at the start of each results line we
have [1]. This indicates that the answer is the first element of a vector, which in
each of the cases above is of length 1.

(ii) Storing data in objects: We can store data in objects and these data can be
easily manipulated in R. For example:

> a<-25 # assigns the value 25 to the object "a".

Bear in mind that the combination of the two <- symbols denotes assignment.
(Alternatively, the equals sign = may be used.) So, when you simply type "a",
R prints its value. That is,

> a
[1] 25

Now, the variable names can be used to perform calculations. For example:

> b<-6
> b
[1] 6
> a*b
[1] 150
> a/b
[1] 4.166667
c<-15*a+b
> c
[1] 381
> ls()
[1] "a" "b" "c"
> rm(a, b, c)
> ls()
character(0)

7
(iii) vector of length greater than one: The objects a, b and c in (ii) above are
examples of vectors which of length 1. To type and store data into a numeric
vector which is of length greater than one we can do the following:

> x<-c(1.5, 2.8, 4.10, 5.6, 7.9, 10.3, 14.3, 15.1, 16.10)
> x
[1] 1.5 2.8 4.1 5.6 7.9 10.3 14.3 15.1 16.1

Note that in this case, the function c() instructs us to concatenate its arguments-
the list of numbers in the brackets into a vector with the name "x". Based on
this, we can now perform numerical operations on this vector as follows

> 2*x
[1] 3.0 5.6 8.2 11.2 15.8 20.6 28.6 30.2 32.2
> 2*x+1
[1] 4.0 6.6 9.2 12.2 16.8 21.6 29.6 31.2 33.2
> 1/x
[1] 0.66666667 0.35714286 0.24390244 0.17857143
[5] 0.12658228 0.09708738 0.06993007 0.06622517
[9] 0.06211180

Keep in mind that the commands affect every component of the vector. Addi-
tional examples of functions that can be used on x include:

> length(x) # computes the number of elements in x


[1] 9
> sqrt(x)
[1] 1.224745 1.673320 2.024846 2.366432 2.810694
[6] 3.209361 3.781534 3.885872 4.012481
> sum(x)
[1] 77.7
> mean(x)
[1] 8.633333
> var(x)
[1] 30.9575
> sd(x)
[1] 5.563946
> sqrt(var(x))
[1] 5.563946
> s<-sqrt(sum((x-mean(x))^2/(length(x)-1)))
> s
[1] 5.563946
> v<-s^2
> v
[1] 30.9575

8
> ls()
[1] "s" "v" "x"

(iv) Accessing a particular element or sets of elements in a vector: Now,


using a few examples, we’ll explore how to retrieve specific items or groups of
components within a vector.

> x[3] # the 3rd element of x


[1] 4.1

> x[-3] # all but the 3rd element


[1] 1.5 2.8 5.6 7.9 10.3 14.3 15.1 16.1

> x[2:5] # the 2nd to 5th elements


[1] 2.8 4.1 5.6 7.9

> x[(length(x)-2):length(x)] # the last 3 elements


[1] 14.3 15.1 16.1

> x[c(2, 4, 8)] # the 2nd, 4th and 8th elements


[1] 2.8 5.6 15.1

> x[x>12] # all elements > 12


[1] 14.3 15.1 16.1

> x[x<2 | x>15] # elements <2 or >15


[1] 1.5 15.1 16.1

> x[x>6 & x<12] # elements >6 and <12


[1] 7.9 10.3

(v) Defining logical vectors: In addition to defining numerical vectors, we can


also define logical vectors. Such a vector’s elements are produced by conditions
and are marked as either TRUE or FALSE. For example:

> x
[1] 1.5 2.8 4.1 5.6 7.9 10.3 14.3 15.1 16.1
> lv<-x>5 #Note that this is letter "l" in the vector name "lv"
> lv
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE

9
In R, the logical operators which can be used include <, >, <=, >=, == and
! = . The meaning of these logical operators are “less than”, “more than”, “less
than or equal to”, “greater than or equal to”, “equal to” and “not equal to”,
respectively. Also, if c1 and c2 are logical expressions, then c1 & c2 is their
intersection (“and”) , c1 | c2 is their union (“or”) and !c1 is the negation of c1 .
We can also use logical vectors in ordinary arithmetic, in this case they are
transformed into numeric vectors by FALSE becoming 0 and TRUE becoming 1.
For example:

> sum(lv)
[1] 6

> sum(x>14)
[1] 3

Exercises
1.1 (cf. Foster, P., UoM, 2015) Let the vector y = (1, 2, 4, 12, 16)T and z = (3, 4, 8, 9, 15, 18)T .
Read the data into two vectors y and z in R. Before carrying out each of the
following commands, try to guess the result first.
(i) y − 3.
(ii) z ∗ 3 + 1.
(iii) The median(y) and the median(z).
(iv) The sum(z[z > 10]).
(v) mean(y[2 : 4]).
(vi) mean(z[−(4 : 6)]).
(vii) y + z. Why does this not work?
(viii) sum(y >= 4) and sum(y[y >= 4])
(ix) sum(z > 5&z < 12)

1.2 (cf. Foster, P., UoM, 2015) The following data are the closing prices (in pence) of
a share in a particular company over a twelve day period: 38; 42; 44; 43; 46; 48;
52; 54; 50; 47; 48; 48
(a) Enter the data into a vector s in R.
(b) Use R to find the minimum, maximum, range, mean and standard deviation
of the twelve values.
(c) Oops, the first value of 38 was a mistake and it should correctly have been
entered as the value 48. Edit the data so that it is now correct and then
recalculate the mean. Remember, we access the elements in a vector with
[ ].

10
(d) Based on the corrected data from part (iii), use R to find the percentage of
days when the closing share price was less than 45 pence.
(e) Use the function diff on the data in s and store the results in the vector d.
What are the values stored in d? What is the length of d?

1.3 (cf. Foster, P., UoM, 2015) Categorical data. Categorical data is data that
corresponds to the category of each record in the data file. For example, a small
survey asked a small sample of 15 people whether they were part of an employer’s
pension scheme or not. The responses recorded were:

yes, yes, no, yes, no, no, yes, no, no, yes, no, yes, yes, no, no

We can enter this data into R by again using the c() command and summarize
them with the table command. ie.

> response<-c("yes", "yes", "no", "yes", "no", "no",


"yes", "no", "no", "yes", "no", "yes", "yes",
"no", "no")
> response
[1] "yes" "yes" "no" "yes" "no" "no" "yes" "no"
[9] "no" "yes" "no" "yes" "yes" "no" "no"

> table(response)
response
no yes
8 7

Factors It is common practice with categorical data to assign numerical values to


each of the categories. eg. in the pension scheme data above, we could assign the
number 1 to the category “yes” and “0” to no. In R, categorical data, whether
recorded as characters or numerically, should be declared as a f actor.

> responsef<-factor(response)
> levels(responsef)
[1] "no" "yes"
> responsef
[1] yes yes no yes no no yes no
[9] no yes no yes yes no no
Levels: no yes

> levels(responsef)<-c(0, 1)
> levels(responsef)
[1] "0" "1"

11
> responsef
[1] 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0
Levels: 0 1

The levels() command allows us to identify the different levels/categories/values


the factor can take and we can see clearly how we can recode the levels ”no” and
”yes” to be 0 and 1.

1.4 (cf. Foster, P., UoM, 2015) In an opinion poll conducted in February 2008 voters
were asked: “If there were a general election tomorrow which party do you think
you would vote for?” A summary of the responses is as follows:

Party Party code Number of supporters


Conservative 1 366
Labour 2 344
Liberal-Democrats 3 212
Other 4 78

Table 1.1: Discrete Probability Distributions.

As an exercise, use the read.table command to read the opinion poll data into
R which are contained in the file opinion− poll txt. You will need to check the file
first to see whether the first element is a variable name or not. In the file, the
preferences are coded as 1 (Conservative), 2 (Labour), 3 (Liberal-Democrats) and
4 (Others). Firstly, specify the data as factors and then recode the numeric values
1 − 4 into character values. Use the table command to summarize the data in
tabular form.

12
Chapter 2

Descriptive Statistics

2.1 Introduction
The description of a data set includes, among, other things:

ˆ Presentation of the data by tables and graphs.

ˆ Examination of the overall shape of the graphed data for important features,
including symmetry or departures from it.

ˆ Scanning the graphed data for any unusual observation that seems to stick far
out from the major mass of the data.

ˆ Computation of numerical measures for a typical or representative value of the


center of the data.

ˆ Measuring the amount of spread or variation present in the data.

2.2 The Population and the Sample


Population: A population is a complete collection of all observations of interest
(scores, people measurements, and so on). The collection is complete in the sense that
it includes all subjects to be studied. Sample: A sample is a collection of observations
representing only a portion of the population.

2.3 Stem-and-Leaf Plot


One useful way to summarize data is to arrange each observation in the data into two
categories “stems and leaves”. First of all, we represent all the observations by the
same number of digits possibly by putting 0’s at the beginning or at the end of an

13
observation as needed. If there are r digits in an observation, the first x(1 ≤ x ≤ r) of
them constitute stems and last (r − x) digits called leaves are put against stems. If
there are many observations in a stem (in a row), they may be represented by two
rows by defining a rule for every stem.
Example 2.2 (cf. Vining, 1998) In a galvanized coating process for large pipes,
standards call for an average coating weight of 200 lbs per pipe. These data are the
coating weights for a random sample of 30 pipes.

216 202 208 208 212 202 193 208 206 206
206 213 204 204 204 218 204 198 207 218
204 212 212 205 203 196 216 200 215 202

Step 1: Divide each observation in the sample into a stem and a leaf. For 3-digit
observations there would be two choices:

ˆ stem = first digit, leaf = last two digits

ˆ stem = first two digits, leaf = third digit.

The choice of stem and leaf that makes the stem-and-leaf plot compact is preferred.
The first choice would make only two stems with too many leaves in a stem while the
second choice would make 3 stems with a reasonable number of leaves in each stem.
So the second choice is preferred.
Step 2: List the stems in order in a column.
Step 3: Proceed through the data set, placing the leaf for each observation in the
appropriate stem or row.
Leaves are sometimes ordered and the corresponding display is called Ordered Stem-
and-leaf Display.

Stem Leaf Frequency


19 368 3
20 022234444458886667 18
21 222356688 9
Total 30

Table 2.1: Stem-and-Leaf Display for the Coating Weight Data

Stem-and-Leaf Plot Using R


x<-c(216,202,208,208,212,202,193,208,206,206,
206,213,204,204,204,218,204,198,207,218,
204,212,212,205,203,196,216,200,215,202)
stem(x)

14
Figure 2.1: Stem and leaf Plot.

Example 2.3: A sample of n = 25 Job CPU Times (in seconds) is selected from 1000
CPU times (See Mendenhall and Sincich, 1995, 25).

1.17 1.61 1.16 1.38 3.53 1.23 3.76 1.94 0.96


4.75 0.15 2.41 0.71 0.02 1.59 0.19 0.82 0.47
2.16 2.01 0.92 0.75 2.59 3.07 1.40

Construct a Stem and Leaf Plot of the data.


Step 1: Divide each observation, in the sample into two parts, the stem and the leaf.
For 3-digit observations, there would be two choices:

ˆ stem = first digit, leaf = last two digits

ˆ stem = first two digits, leaf = third digit

For the CPU data, the first choice would be better.


Step 2: List the stems in order in a column.
Step 3: Proceed through the data set, placing the leaf for each observation in the
appropriate stem or row.
The first entry corresponds to 0.02, the second to 0.15 and so on. It is not a bad idea
to put decimal in the place it occurs in the sample though it is not popular

Stem Leaf Frequency


0 02 15 19 47 71 75 82 92 96 9
1 16 17 23 38 40 59 61 94 8
2 01 16 41 59 4
3 07 53 76 3
4 75 1
Total 25

Table 2.2: Ordered Stem-and-Leaf Display for the CPU Data

15
These steps result in the stem and leaf plot as shown in Figure 2.1. For example,
the second row contains 196 and 198.

2.4 Frequency Tables


When summarizing a large set of data it is often useful to classify the data into classes
or categories and to determine the number of individuals belonging to each class,
called the class frequency. A tabular arrangement of data by classes together with
the corresponding frequencies is called a frequency distribution or simply a frequency
table. Consider the following definitions:
Frequency: The number of observations in a class.
Relative Frequency: The ratio of the frequency of a class to the total number of
observations in the data set.
Cumulative Frequency: The total frequency of all values less than the upper class
limit.
Relative Cumulative Frequency: The cumulative frequency divided by the total
frequency.
Example 2.5: Consider the data in Example 2.2. The steps needed to prepare a
frequency distribution for the data set are described below:

ˆ Step 1: Range = Largest observation – Smallest observation = 218 − 193 = 25

ˆ Step 2: Divide the range between into classes of (preferably) equal width.

A rule of thumb for the number of classes is n.

Range
Class Width ≈
number of classes
Since we√have a sample of size 30, the number of classes in the histogram should be
around 30 = 5.48. In this case, the class width would be approximately 25/5.48 =
4.56 ≈ 5. The smallest observation is 193. The first class boundary may well start at
193 or little below it, say at 190 (just to avoid the smallest observation, in general,
falling on the class boundary). Thus the first class is given by (190, 195]. The second
class is given by (195, 200]. Complete the class boundaries for all classes.
Step 3: For each class, count the number of observations that fall in that class. This
number is called the class frequency.
Step 4: The relative frequency of a class is calculated by f /n where f is the frequency
of the class and n is the number of observations in the data set.
Cumulative Relative Frequency of a class, denoted by F , is the total of the relative
frequencies up to that class. To avoid rounding in every class, one may accumulate
the frequencies up to a class and then divide by n. The resulting quantity Relative

16
Cumulative Frequency (F/n) is just the same as Cumulative Relative Frequency and
is desirable in a frequency table. For the data in Example 2.2, we have the following
frequency distribution:

Class Count f F Relative f Relative F


(190, 195] / 1 1 0.033 0.033
(195, 200] // 2 3 0.066 0.100
(200, 205] ////////// 10 13 0.333 0.433
(205, 210] //////// 8 21 0.266 0.700
(210, 215] //// 4 25 0.133 0.833
(215, 220] ///// 5 30 0.166 1.000

We construct the basic frequency distribution using R as

data=c(216,202,208,208,212,202,193,208,206,206,
206,213,204,204,204,218,204,198,207,218,
204,212,212,205,203,196,216,200,215,202)
table(data)

data
193 196 198 200 202 203 204 205 206 207 208 212 213 215 216 218
1 1 1 1 3 1 5 1 3 1 3 3 1 1 2 2

We can also use the epiDisplay package with tab1() function. This displays the
frequency table and a bar chart to show relative frequencies. The arguments of the
function are:

ˆ sort.group can either be “increasing” (sort the group in ascending order) or


decreasing (sorting in descending order)

ˆ cum.percent indicates whether to include cumulative relative frequency or not

install.packages("epiDisplay")
library("epiDisplay")

#Loading required package: foreign


#Loading required package: survival
#Loading required package: MASS
#Loading required package: nnet

tab1(data,sort.group="decreasing",cum.percent=TRUE)

17
Figure 2.2: Frequency distribution .

data: Frequency Percent Cum. percent


204 5 16.7 16.7
212 3 10.0 26.7
208 3 10.0 36.7
206 3 10.0 46.7
202 3 10.0 56.7
218 2 6.7 63.3
216 2 6.7 70.0
215 1 3.3 73.3
213 1 3.3 76.7
207 1 3.3 80.0
205 1 3.3 83.3
203 1 3.3 86.7
200 1 3.3 90.0
198 1 3.3 93.3
196 1 3.3 96.7
193 1 3.3 100.0
Total 30 100.0 100.0

18
2.5 Graphs of Frequency Distributions

Frequency Histogram
A frequency histogram is a bar diagram where a bar against a class represents frequency
of the class. We can use the following R code to construct a frequency histogram for
the data in example 2.1.

Histogram of data
10
8
6
Frequency

4
2
0

190 195 200 205 210 215 220

data

Figure 2.3: Histogram.

We can change the color, add title and labels to the plot as

ˆ col adds the colour to the plot

ˆ xlab and ylab add label to horizontal and vertical axis respectively

ˆ main adds title to the plot

hist(data,col="blue",xlab = "weights", ylab="Frequency",


+ main="Histogram of weights")

19
Histogram of weights

10
8
6
Frequency

4
2
0

190 195 200 205 210 215 220

weights

Figure 2.4: Histogram.

2.6 The Bar Chart and the Pie Chart


Both bar and pie charts are used to represent discrete and qualitative data.

Bar Chart
A bar chart gives the frequency (or relative frequency) corresponding to each category,
with the height or length of the bar proportional to the category frequency (or relative
frequency). To make a bar chart, the classes are marked along the horizontal axis
and a vertical bar of height equal to the class frequency is drawn over the respective
classes.
Example 2.6: Consider the following example of different brands of disks:

##
## 1 Sony Imation Verbatim Imation Verbatim Sony Verbatim Sony
## 2 Verbatim Verbatim Sony Verbatim Verbatim Verbatim Sony Verbatim
## 3 Sony Verbatim Sony Verbatim Sony Verbatim Verbatim Verbatim
## 4 Verbatim Verbatim Verbatim Sony Verbatim Verbatim Verbatim Verbatim
## 5 Verbatim Verbatim Verbatim Verbatim Verbatim Verbatim Sony Imation

20
## 6 Sony Verbatim Imation Verbatim Sony Sony Verbatim Verbatim
## 7 Verbatim Verbatim Verbatim Sony Verbatim Verbatim Sony Sony
## 8 Verbatim Sony Verbatim Verbatim Verbatim Verbatim Verbatim Verbatim
## 9 Sony Verbatim Sony Verbatim Verbatim Sony Verbatim Verbatim
## 10 Verbatim Verbatim Verbatim Sony Imation Verbatim Verbatim Imation
## 11 Imation Verbatim Verbatim Verbatim Verbatim Verbatim Sony Verbatim
## 12 Verbatim Verbatim Sony Verbatim Verbatim Sony Verbatim Sony
## 13 Verbatim Imation Verbatim Sony Verbatim Verbatim Verbatim Verbatim
## 14 Sony Verbatim Sony Verbatim Verbatim Sony Imation Imation
## 15 Verbatim Verbatim Verbatim Sony Verbatim Verbatim Verbatim Verbatim
## 16 Verbatim Verbatim Verbatim Verbatim Sony Verbatim Sony Sony
## 17 Sony Verbatim Verbatim Verbatim Verbatim Imation Verbatim Verbatim
## 18 Verbatim Imation Verbatim Verbatim Verbatim Verbatim Verbatim Sony

To draw a Pie Chart using R, follow the steps:

ˆ First make sure your dataset is in dataframe format

Brand=read.csv("table 5.csv")
df=as.data.frame(table(Brand))

ˆ Install ggplot2 package(This package is an R package dedicated to data visual-


ization. It increases one’s efficiency in producing graphics in R and at the same
time enhances the quality and beauty of these graphics. And with gplot2, you
can create practically any kind of chart.)

ˆ then use the following code

#install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: ’ggplot2’
## The following object is masked from ’package:epiDisplay’:
##
## alpha
ggplot(df, aes(x = "", y = Freq, fill = Brand)) +
geom_col() +
coord_polar(theta = "y")

21
Figure 2.5: Pie chart.

Alternatively, one may directly obtain the pie chart from the summary table without
using ggplot2 as follows:

Brand=read.csv("table 5.csv")
df=as.data.frame(table(Brand))
df
Brand Freq
1 Imation 12
2 Sony 34
3 Verbatim 97

#####
x <- c(12, 34, 97) ## define a vector containing the summary data, df
labels <- c("Imation", "Sony", "Verbatim") #gives the name of the labels
pie(x,labels) # produces the pie chart

#### Adding color and title


pie(x, labels, main = "Brand pie chart", col = rainbow(length(x)))

#### Adding slice percentage and a chart legend


piepercent<- round(100*x/sum(x), 1)
pie(x, labels = piepercent, main = "Brand pie chart",col = rainbow(length(x)))
legend("topright", c("Imation","Sony","Verbatim"), cex = 0.8,
fill = rainbow(length(x)))

22
The procedure for constructing a bar chart is similar to that for the pie chart.

ggplot(df, aes(x = Brand, y = Freq, fill=Brand)) +


geom_col()

Figure 2.6: Bar chart.

Alternatively, one may directly obtain the pie chart from the summary table as follows:

Brand=read.csv("table 5.csv")
df=as.data.frame(table(Brand))
df
Brand Freq
1 Imation 12
2 Sony 34
3 Verbatim 97

x <- c(12, 34, 97)


labels <- c("Imation", "Sony", "Verbatim")
barplot(x)

barplot(x,names.arg=labels,xlab="Brand",ylab="Freq",col="blue",
main="Bar chart",border="red")

23
2.7 Numerical Measures
Sometimes we are interested in a number which is representative or typical of the data
set. The mean and the median are such numbers. Similarly, we define the range of the
data which gives some idea about the variation or dispersion of observations in the
data. The most important measure for dispersion is the sample standard deviation.

Measures of Location
Population Mean: The population mean is denoted by µ , and for a finite population
is defined by

N
1 X
µ= xi , where x′i s are the population values
N i=1

Sample Mean: The mean x̄ of a sample is the average of the observations x1 , x2 , · · · , xn


in the sample. It is given by:

n
1X
x̄ = xi
n i=1

Example 2.8 Consider a sample of bottle bursting strength data of a set of 5 soft
drink bottles 251 255 254 253 252 The sample mean is given by
251 + 255 + 254 + 253 + 252
x̄ = = 253.
5

Using R to compute the sample mean

x=c(251,255,254,253,252)
mean(x)
[1] 253

Sample Median: The median of a sample of n observations x1 , x2 , · · · , xn is the


middle observation when the observations are arranged in ascending or descending
order if the number of observations is odd. If the number of observations is even, it is
the average of the middle two observations. In other words, for any sample of size n,
the median x e is given by

n+1 th
( 
2
if n is Odd
Median = x
e= th
( n2 ) +( n2 +1)
th

2
if n is Even

24
For the bottle bursting strength data, the median is 253. There are 2 observations
below it and 2 above it.
Example 2.9: Marks obtained by 6 students in STAT 319 are given by 818298838085.
The ordered sample observations are 80 81 82 83 85 98, so that the median is
x
e=(82+83)/2=82.5.
Using R to compute the sample median

x=c(81, 82, 98,83, 80, 85)


median(x)
[1] 82.5

Mode: The mode of a sample is the observation occurring the maximum number of
times i.e. the observations with the largest frequency.
Example 2.10: The following samples provide prices, in Saudi Riyals (SR), of a
computer monitor.

(a) 1200, 1000, 1500, 1200, 1000, 1200

(b) 1300, 1200, 1000 What is the modal price?

Solution:

(a) The modal price is SR1200.

(b) There is no modal price.

Example 2.11: The following table shows the hourly wages in SR earned by the
employees of a small company and the number of employees who earn each wage.

Wages hour Number of employees


6 3
8 5
10 4
13 4

The modal wage per hour is 8 SR


Computing Mode with R: There is no direct inbuilt function for mode in R. However,
we can calculate the mode using the following codes

ˆ table creates the frequency table

25
ˆ sort function returns sorted data in ascending order. The ‘-’ sign before table
function allows the data with the highest frequency to be sorted first.
ˆ names function gives only the data without the frequency. Then the first data is
displayed.

d=c(1200, 1000, 1500, 1200, 1000, 1200)


names(sort(-table(d)))[1]
[1] "1200"

Alternatively, one can write a brief code as follows to compute the mode:

ModeFunction <- function(Data)


{
UNIDATA <- unique(Data)
UNIDATA[which.max(tabulate(match(Data, UNIDATA)))]
}

##unique: removes duplicate values in the data


##match: match returns a vector of the positions of (first)
## matches of its first argument in its second.

Data <-c(1200, 1000, 1500, 1200, 1000, 1200)##Data

OUTPUT<-ModeFunction (Data)# Mode is calculated.

print(OUTPUT) #mode is printed

Measures of Variability
Population Variance: The variance of a population is denoted by

N
2 1 X
σ = (xi − µ)2 ,
N i

when N is finite.
Sample Variance: For a sample of size n, the variance, denoted by s2 , is the Total
Sum of Squares (TSS) of observations around their mean divided by n − 1 . That is

n
1 X
s2 = (xi − µ)2 ,
n−1 i

26
Note that TSS can also be written as

n n
!2 n
X 1 X X
s2 = x2i − xi = x2i − nx̄2
i
n i i

The following R codes calculate the sample variance

x=c(251,255,254,253,252)
var(x)
[1] 2.5

Standard Deviation: The standard deviation is the positive square root of the
variance and is given by

v
u
u1 XN
σ= t (xi − µ)2 , for the population
N i

v v
u n u n
u 1 X
2
u 1 X 2
s=t (xi − µ) = t x − nx̄2 , for the sample
n−1 i
n−1 i i

For example, the standard deviation for the data in Example 2.8 is given by

r r r
1 1 10
s= [320055 − 5(253)2 ] = [320055 − 320045] = ≈ 1.58114,
4 4 4

The following codes calculate the sample standard deviation

x=c(251,255,254,253,252)
sd(x)
[1] 1.581139

Percentiles The αth percentile Pα is the value that exceeds α% of the data, and is
obtained by the following steps:

ˆ Step 1: Determine Rα = α(n + 1)/100, α = 1, 2, · · · , 99.

ˆ Step 2: Separate i (the largest integer not exceeding Rα ) and the decimal part
d of Rα and write Rα = i + d.

27
ˆ Step 3: Order the observations in an ascending manner.

ˆ Step 4: The αth percentile is then given by

Pα = x(i) + d(x(i+1) − x(i) ) = (1 − d)x(i) + dx(i+1) , α = 1, 2, · · · , 99

where

- x(i) is the ith observation after ordering the observations ascendingly


- The 25th percentile is called the 1th quartile and is denoted by Q1
- The 50th percentile is called the 2nd quartile and is denoted by Q2 .
- The 75th percentile is called the 3rd quartile and is denoted by Q3 .

Example 2.12 (cf. Vinning, 1998, 193). An independent consumer group tested radial
tires from a major brand to determine expected tread life. The data (in thousands of
miles) are given below:

50 54 52 47 56 51 51
48 56 53 43 56 58 42

Find the 1st , 2nd and 3rd quartiles.


Solution: The ordered sample observations are given by
42 43 47 48 50 51 51 52 5354 56 56 56 58
The ranks of the quartiles are
(n + 1) (14 + 1)
R25 = (25) = = 3.75, (i = 3, d = 0.75)
100 4
(n + 1) (14 + 1)
R50 = (50) = = 7.5, (i = 7, d = 0.5)
100 2
n+1 14 + 1
R75 = (75) =3 = 11.25, (i = 11, d = 0.25)
100 4

so that the quartiles are given by:

Q1 = 3.75th obs = (1 − 0.75)(3rd obs) + 0.75(4th obs) = 0.25(47) + 0.75(48) = 47.75


Q2 = 7.50th obs = (1 − 0.50)(7th obs) + 0.50(8th obs) = 0.50(51) + 0.5(52) = 51.50
Q3 = 11.25th obs = (1 − 0.25)(11th obs) + 0.25(12th obs) = 0.75(56) + 0.25(56) = 56

The following codes calculate the percentile in R The first argument is the data while
the second argument is percentile(s)

28
x=c(42, 43, 47, 48, 50, 51, 51, 52, 53, 54, 56, 56, 56, 58)

quantile(x,probs=c(0.25,0.50,0.75))

## 25% 50% 75%


## 48.5 51.5 55.5

2.8 The Empirical Rule (ER)


If the relative frequency of the data is approximately mound shaped (i.e. bell shaped),
then

1 Approximately 68% of the measurements will lie within 1 standard deviation of


their mean, i.e. within the interval [µ − σ, µ + σ] for a population, [x̄ − s, x̄ + s] for
a sample.

2 Approximately 95% of the measurements will lie within 2 standard deviations of


their mean, i.e. within the interval [µ − 2σ, µ + 2σ] for a population, [x̄ − 2s, x̄ + 2s]
for a sample.

3 Almost all the measurements (i.e. 100%) will lie within 3 standard deviations of
their mean, i.e. within the interval [µ − 3σ, µ + 3σ] for a population,[x̄ − 3s, x̄ + 3s]
for a sample.

A population/sample satisfying the above three properties is said to satisfy the


empirical rule, though in many cases, it may not guarantee a bell shaped distribution.
Example 2.13 The observations in Example 2.3 are reproduced in ascending
order:
For the data, we have x̄ = 1.63, s = 1.19.

1 The interval [x̄ − s, x̄ + s] = [0.437, 2.823] contains 18 observations which leads to


the proportion 1825
= 72% which is not close to 68% as expected by the Empirical
Rule. Since the rule is violated, we say ER is not satisfied by the sample.

2 The interval [x̄ − 2s, x̄ + 2s] = [−0.755, 4.015] contains 24 observations which leads
to the proportion 24
25
= 96% which is not far from 95% as expected by the Empirical
Rule.

3 The interval [x̄ − 3s, x̄ + 3s] = [−1.948, 5.208] contains all 25 observations which
lead to the proportion 25 25
= 100% which is exactly the same as expected by the
Empirical Rule.

If all the three rules are approximately satisfied by the sample, we say that the rule is
satisfied. Thus, for this data set the empirical rule is not satisfied.

29
Coefficient of Variation
The sample coefficient of variation relates variability in the sample to the mean. It is
defined by

s
CV =

Example 2.14 Suppose that calibration inspection time based on a sample of 100
observations has a mean of 14.342 and standard deviation 1.72 (Lapin, 1997, p22).
The coefficient of variation of the sample given by
1.72
= 0.12
14.342
It indicates that the sample standard deviation is only 12% as large as the mean.
Since our sample yields a CV = 0.12, therefore we conclude that the sample does not
have much variation relative to the mean.
The following codes calculate the coefficient of variation in R

x=c(251,255,254,253,252)
sd(x)/mean(x)
## [1] 0.006249561

Coefficient of Skewness
A measure of skewness indicates the direction of the relative frequency distribution,
either skewed to lower values or higher values. The sample coefficient of skewness is
given by
x̄ − x
e
CS =
s/3
A negative value of CS implies that the relative frequency distribution is negatively
skewed (left tailed distribution) while a positive value of CS implies that the relative
frequency distribution is positively skewed (right tailed distribution).
For the CPU data in Example 2.13 the coefficient of skewness is given by:

1.63 − 1.38
C.S = = 0.629
1.1928/3
which indicates that the sample is positively skewed, i.e. the relative frequency
histogram has a long right tail.
The following codes calculate the coefficient of skewness in R after instalation of
‘moments’ package

30
x=c(18, 15, 12, 6, 8, 2, 3, 5, 20, 10)
#install.packages{"moments"}
library(moments)

skewness(x)

## [1] 0.3359345

Proportion
The population proportion is defined as p = X
N
, where X is the number of observations
in the population possessing a particular characteristic, and N is the population size.
The sample proportion is given by pb = nx where n is the sample size, x is the number
of observations possessing that particular characteristic in the sample.
In a statistics course 30 students sat for final exam, 6 got A, 3 failed and the rest got
6
other grades B, C, D. Then the proportion of students who got A is 30 = 0.20, and
3
the proportion of failing students is 30 = 0.10

2.9 The Box Plot


A box aligned with the first and the third quartiles as edges, median at the appropriate
place in the scale is called a box plot. It is extended to both directions up to the
smallest and the largest values. These extensions may be called arms. This technique
displays the structure of the data set by using the quartiles and the extreme values of
a sample.
The following intervals, called inner fences and outer fences, are used to detect outliers.

Inner fences : [Q1 − 1.5(IQR), Q3 + 1.5(IQR)] = [LIF, U IF ]


Outer fences : [Q1 − 3.0(IQR), Q3 + 3.0(IQR)] = [LOF, U OF ]

where IQR = Q3 − Q1 is the interquartile range and LIF , U IF are Lower and
Upper Inner Fence and LOF , U OF are Lower and Upper Outer Fence.
Observations that fall within the inner fence and outer fence are deemed to be suspected
outliers and those falling outside the outer fence are highly suspect outliers (Sincich,
1992).

31
Example 2.5: Construct the Box plot the scores in Example 2.4.
Solution: The quartiles are given by

25(11)
Q1 = P25 = = 2.75 =⇒ Q1 = 0.25(3) + 0.75(5) = 4.5
100
50(1)
Q2 = P50 = = 5.5 =⇒ Q2 = 0.5(8) + 0.5(10) = 9
100
75(11)
Q3 = P75 = = 8.25 =⇒ Q3 = 0.75(15) + 0.25(18) = 15.75
100

IQR = Q3 − Q1 = 15.75 − 4.5 = 11.25

The Inner Fences are given by Q1 −1.5IQR and Q3 +1.5IQR = 4.5−1.5(11.25), 15.75+
1.5(11.25) = (−12.375, 32.625) while the Outer Fences are given by Q1 − 3IQR and
Q3 + 3IQR = 4.5 − 3(11.25), 15.75 + 3(11.25) = (−29.25, 49.5) Clearly there is no
outlier by the inner Fence Method.
Since the second quartile Q2 is closer to the first quartile Q1 than it is to the third
quartile Q3 Q i.e. Q2 − Q1 < Q3 − Q2 , the distribution is positively skewed.
To construct a box – plot using R for the data in example 2.4, follow the steps:

x=c(18, 15, 12, 6, 8, 2, 3, 5, 20, 10)


boxplot(x)

Figure 2.7: Box plot.

32
2.10 Approximate Mean and Variance of Grouped
Data
The CPU data in Example 2.3 has been used to make the following frequency distri-
bution.
Class Class interval Midvalue f Relative f F Relative F
1 [0, 1) 0.5 9 0.36 9 0.36
2 [1, 2) 1.5 8 0.32 17 0.68
3 [2, 3) 2.5 4 0.16 21 0.84
4 [3, 4) 3.5 3 0.12 24 0.96
5 [4, 5) 4.5 1 0.04 25 1.00

The above table is ‘equivalent’ to CPU data with mid-values as given below: The
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
2.5 2.5 2.5 2.5
3.5 3.5 3.5
4.5

sample mean of the above sample can now be calculated by the usual formula

0.5 + 0.5 + · · · + 4.5


x̄ = = 1.66
25
Note the discrepancy between the sample mean (1.63) calculated from the ungrouped
data in Example 2.3 and the sample mean (1.66) calculated from the grouped data.
The expression for the mean can also be written by the distinct numbers as

k
1h i 1X
x̄ = 0.5(9) + 1.5(8) + 2.5(4) + 3.5(3) + 4.5(1) = xi f i
25 n i=1
where k is the number of classes in the Frequency Table.
The sample variance can be calculated as follows:

 P 2 
k
1
k
X 1 
k
X i=1 xi f i
s2 = (xi − x̄)2 fi = x2i fi −

n−1 n − 1 i=1 n
 
i=1

33
Thus, for the data consisting of the above mid-vales we have s2 = 1.39

Exercise
2.1 Refer to Example 2. 1, do the following:

(a) Select a SRS of size 12 using a random number table.


(b) Select a SRS of size 20 using R.
(c) Construct a frequency distribution using the class intervals and so on.
(d) Draw the histogram corresponding to the frequency distribution in part (a).
How would you describe the shape of this histogram?
(e) Draw a stem and leaf plot for the above data.
(f) Draw a box plot and comment on the symmetry and shape of the data.

2.2 (cf. Devore, J. L. and Peck, R., 1997, 72). The paper “The Pedaling Technique
of Elite Endurance Cyclists” (Int. J. of Sport Biomechanics (1991, pp. 29-53)
reported the accompanying data on single-leg power at a high workload.

244 191 160 187 180 176 174 205 211 183 211 180 194 200

(a) Find the mean, median, standard deviation, variance, lower and upper quar-
tiles, range inter quartile range, coefficient of variation, co-efficient of skewness
for the above data.
(b) Do the data satisfy the empirical rule?

2.3 (cf. Montgomery, D. C., et. al 2001, 25-26). The following data are direct solar
intensity measurements (watts/m-sq) on different days at a location in southern
Spain:

562 869 708 775 775 704 809 856 655 806 878 909
918 558 768 870 918 940 946 661 820 898 935 952
957 693 835 905 939 955 960 498 653 730 753.

(a) Calculate the following summary statistics for this sample Mean, median,
standard deviation, variance, co-efficient of variation, co-efficient of skewness,
range, lower and upper quartiles, inter-quartile range.
(b) Construct the box plot.

2.4 (Montgomery, D. C., et. al, 2001, 25-26). The following data are the compressive
strengths in pounds per square inch (psi) of 80 specimens of a new aluminum-
lithium alloy undergoing evaluation as a possible material for aircraft structural
elements.

34
105 221 183 186 121 181 180 143 97 154 153 174 120 168 167 141
245 228 174 199 181 158 176 110 163 131 154 115 160 208 158 133
207 180 190 193 194 133 156 123 134 178 76 167 184 135 229 146
218 157 101 171 165 172 158 169 199 151 142 163 145 171 148 158
160 175 149 87 160 237 150 135 196 201 200 176 150 170 118 149

(a) Construct a frequency distribution and a frequency histogram starting from


70 and the step size 20.
(b) Construct a stem and leaf plot.

2.5 Refer to Exercise 2.1 draw a random sample of size 20 using the random number
table at the end of your manual.

(a) With replacement


(b) Without replacement

2.6 (cf. Johnson, R. A., 2001, 53). The following measurements of the diameters (in
feet) of Indian mounds in southern Wisconsin were gathered by examining reports
in the Wisconsin Archeologist.
22 24 24 30 22 20 28 30 24 34 36 15 37

(a) Find the upper and lower quartiles and 90th percentile for the above data.
(b) Find the range and the inter quartile range of this data.
(c) Calculate the mean, median & standard deviation.
(d) Find the proportion of the observations that are in the intervals.

x̄ ± s, x̄ ± 2s, x̄ ± 3s

(e) Compare the results in part (d) with the empirical guidelines.
(f) Display the data in the form of a box plot.

2.7 (Johnson, R. A., 2000, 22). Consider the following humidity readings rounded to
the nearest percent:

29 44 12 53 21 34 39 25 48 23
17 24 27 32 34 15 42 21 28 37

(a) Construct a frequency distribution and histogram starting from 10 and with
a width (step size) of the intervals 10.
(b) Construct a stem and leaf plot of the above data.

35
2.8 (Devore, J. L. and Farnum, N. R., 1999, 16). Corrosion reinforcing steel is a
serious problem in concrete structures located in environments affected by severe
weather conditions. For this reason researchers have been investigating the use
of reinforcing bars made of composite material. One study was carried out to
develop guidelines for bonding glass-fiber-reinforced plastic rebars to concrete.
Consider the following 48 observations on measured bond strength:

11.5 12.1 9.9 9.3 7.8 6.2 6.6 7.0 13.4 17.1 9.3 5.6
5.7 5.4 5.2 5.1 4.9 10.7 15.2 8.5 4.2 4.0 3.9 3.8
3.6 3.4 20.6 25.5 13.8 12.6 13.1 8.9 8.2 10.7 14.2 7.6
5.2 5.5 5.1 5.0 5.2 4.8 4.1 3.8 3.7 3.6 3.6 3.6

(a) Construct a stem-and-leaf display for these data.


(b) Construct a frequency distribution and histogram, starting from 2 and with a
step size 2.

2.9 (cf. Montgomery, D. C., et. al, 2001, 25). In Applied Life Data Analysis (Wiley,
1982), Wayne Nelson presents the break-down time of an insulating fluid between
electrodes at 34 kV. The times in minutes, are as follows:

0.19 0.78 0.96 1.31 2.78 3.16 4.15 4.67 4.85 5.81
6.50 7.35 8.01 8.27 12.06 13.75 32.52 33.91 36.71 72.89

(a) Calculate the sample average and the sample standard deviation.
(b) Calculate the coefficient of variation and coefficient of skewness.

2.10 (cf. Montgomery, D. C., et. al, 2001, 25). An article in the Journal of Structural
Engineering (1989, p115) describes an experiment to test the yield strength of
circular tubes with caps welded to the ends. The first yields (in kN) are

96 102 102 102 104 104 108 126 126 128 128 140 156 160 160 164 170

Calculate the sample median, upper and lower quartile and construct a box plot.

2.11 (cf. Montgomery, D. C., et. al, 2001, 25). The data on visual accommodation (a
function of eye movement) when recognizing a speckle pattern on a high resolution
CRT screen is as follows:

36.45 67.90 38.77 42.18 26.72 50.77 39.30 49.71 67.90


38.77 42.18 26.72 50.77 39.30 67.90 38.77 42.18 26.72
50.77 39.30 67.90 38.77 42.18 26.72 50.77 39.30 29.12

(a) Calculate the sample mean, median, mode, variance and the sample standard
deviation.

36
(b) Calculate the coefficient of variation and coefficient of skewness and interpret
these values.
(c) Prepare a stem-and-leaf plot of the above data and comment on the shape of
the data.
(d) Construct a frequency histogram, and compare it with stem-and-leaf plot.
(e) Draw a cumulative relative frequency curve and determine the 40th percentile,
the 70th percentile. Explain these quantities.

2.12 (cf. Montgomery, D. C., et. al, 2001, 30). The following data are the numbers
of cycles to failure of aluminum test coupons subjected to repeated alternating
stress at 21,000 psi, 18 cycles per second:

1115 1567 1223 1782 1055 1310 1883 375 1522 1764
1540 1203 2265 1792 1330 1502 1270 1910 1000 1608
1258 1015 1018 1820 1535 1315 845 1452 1940 1781
1085 1674 1890 1120 1750 798 1016 2100 910 1501
1020 1102 1594 1730 1238 865 1605 2023 1102 990
2130 706 1315 1578 1468 1421 2215 1269 758 1512
1109 785 1260 1416 1750 1481 885 1888 1560 1642

(a) Construct a stem-and-leaf display for these data.


(b) Construct a frequency distribution and histogram, starting from 750 and with
a step size 200.
(c) Is the empirical rule satisfied?

2.13 (cf. Montgomery, D. C., et. al, 2001, 2001, 42). The pH of a solution is measured
eight times by one operator using the same instrument. She obtains the following
data:

7.05 7.20 7.18 7.19 7.20 7.15 7.20 7.18 7.19 7.20 7.21 7.16
7.15 7.20 7.08 7.19 7.25 7.21 7.16 7.15 7.20 7.18 7.19 7.20
7.21 7.16 7.21 7.16 7.15 7.26 7.18 7.19 7.20 7.21 7.16 7.19.

Calculate the following summary statistics: Mean, Median, Range, IQR, Standard
Deviation and Variance

2.14 (cf. Montgomery, D. C., et. al, 2001, 42). A sample of 30 resistors yielded the
following resistances (ohms):

38 47 45 41 35 35 34 45 44 47 45 41 35 35 36
34 45 34 45 44 47 45 41 35 47 45 41 35 43 43

Compute summary statistics for this data.

37
2.15 (cf. Montgomery, D. C., et. al, 2001, 37). An article in the Transactions of
the Institution of Chemical Engineers (1956, 34, 280-293) reported data from
an experiment investigating the effect of several process variable on the vapor
phase oxidation of naphathalene. A sample of percentage mole conversion of
naphathalene to maleic anhydride follows:

4.2 4.7 5.0 3.8 3.6 3.0 5.1 3.1 3.8 4.8
4.0 5.2 4.3 2.8 2.0 2.8 3.3 4.8 5.0.

(a) Calculate the sample mean, variance, standard deviation, range, coefficient of
variation and skewness.
(b) Calculate the sample median, lower and upper quartiles, inter-quartile-range.
(c) Construct a box plot of the data.

2.16 (cf. Montgomery, D. C., et. al, 2001, 37). The following data are the temperatures
of effluent at discharge from a sewage treatment facility on consecutive days:

43 47 51 48 52 50 46 49 45 52 46 51
44 49 46 51 49 45 44 50 48 50 49 50

(a) Calculate the sample mean, variance, standard deviation, range, coefficient of
variation and skewness.
(b) Calculate the sample median, lower and upper quartiles, inter-quartile-range.
(c) Construct a box plot of the data.
(d) Find the 5th and 95th percentiles of the temperature.
(e) Construct a dot plot for the temperature data.

2.17 (Devore, J. L. and Farnum, N. R., 1999, 4-5). The tragedy that befell the
space shuttle Challenger and its astronauts in 1986 led to a number of studies
to investigate the reasons for mission failure. Attention quickly focused on the
behavior of the rocket engine’s O-rings. Here is data consisting of observations on
O-ring Temperature (F) for each test firing or actual launch of the shuttle rocket
engine (Presidential Commission on the Space Shuttle Challenger Accident, 1986,
1, pp.129-131).

84 49 61 40 83 67 45 66 70 69 80 58 68 60 67 72 73 70
57 63 70 78 52 67 53 67 75 61 70 81 76 79 75 76 58 31

(a) Prepare a dot plot of the sample.


(b) Construct a stem-and-leaf display for these data.
(c) Construct a frequency distribution and histogram, starting from 25 and with
a step size 10.

38
2.18 (Devore, J. L. and Farnum, N. R., 1999, 18). In the manufacture of printed circuit
boards, finished boards are subjected to a final inspection before they are shipped
to customers. Here is data on the type of defect for each board rejected at final
inspection during a particular time period:

Type of defect Frequency


Low copper plating 112
Poor electrolyses coverage 35
Lamination problems 10
Plating separation 8
Etching problems 5
Miscellaneous 12

Make a bar chart and a pie chart of the above data.

2.19 (Devore, J. L., 2000, 18). Power companies need information about customer
usage to obtain accurate forecast of demands. Investigators from Wisconsin Power
and Light determined energy consumption (BTUs) during a particular period for
a sample of 90 gas-heated homes. An adjusted consumption value was calculated
as follows:

Class Frequency
1−3 1
3−5 1
5−7 11
7−9 21
9 − 11 25
11 − 13 17
13 − 15 9
15 − 17 4
17 − 19 1

(a) Find mean, median, standard deviation, variance, lower and upper quartiles,
range inter quartile range, co-efficient of variation, co-efficient of skewness
for the above data.
(b) Does the Empirical Rule satisfy the above data?
(c) Construct a frequency histogram of the above data.

39
Chapter 3

Discrete Random Variables

The following are well known discrete probability distributions

Distribution Probability function Mean Variance


P (X = x) = nx px (1 − p)n−x ,

Binomial np np(1 − p)
x = 0, 1, 2 · · · , n
Geometric P (X = x) = q x−1 p, 1/p q/p2
x = 1, 2,
where q = 1 − p
(No )(N −No ) −n
n NNo n NNo . 1 − No
.N

Hypergeometric P (X = x) = x Nn−x , N −1
(n) N
x = 0, 1, · · · , min(n, No )
x e−λ
Poisson P (X = x) = λ x! , λ λ
x = 0, 1, 2, · · ·

Table 3.1: Discrete Probability Distributions.

3.1 The Binomial Distribution


Example 3.1 An oil drilling company ventures into various locations, and their
success or failure is independent from one location to another. Suppose the probability
of a success at any specific location is 0.1. If a driller drills 5 locations,

1. Find the probability that there will be exactly two successes.


Solution Let X be the number of successful drilling of oils in 5 locations so that
X ∼ B (n = 5, p = 0.1) and P (X = 2) = 52 (0.1)2 (0.9)3 = 0.0729


40
2. Find the probability that there will be at least two successes.
Solution Let X be the number of successful drilling of oils in 5 locations so that
X ∼ B (n = 5, p = 0.1) and

P (X ≥ 2) = 1 − P (X < 2) = 1 − P (X = 0) − P (X = 1)
   
5 0 5 5
= 1− (0.1) (0.9) − (0.1)1 (0.9)4 = 0.08146
0 1

Computing Binomial Probabilities Using R


To solve example 3.1 in R, we compute binomial probability using the function binom,
which is the R name corresponding to the binomial probability distribution. For
Bin(n, p), the probability distribution can be obtained as: dbinom(x, n, p). R has four
particular functions available for each distribution. These are

Name Description
dname(x= , other arguments) Density or probability mass function
pname(q= , other arguments) Cumulative distribution function
qname(p= , other arguments) Quantile function
rname(n= , other arguments) Random deviates

Table 3.2: Name and Description.

that is, depending on what you want to calculate, you prefix the R function name
with either the letters “d”, “p”, “q”, or “r”. If you are altering any default values
that have been established for the distribution’s parameters, you must specify the
new values in the function call. The “help” function in R can be used to examine any
additional arguments that functions may have with predetermined values. Basically,
when calculating any of these four functions-the pdf( or pmf), the cdf P (X ≤ x), the
quantile (or inverse cdf) function or generating random observations) tabulated in
Table 3.2, there is need to specify either a scalar or vector of values using the necessary
argument for which the calculation is to be performed. For instance, to calculate the
PDF(PMF), we need to specify either a scalar or vector of values using the argument
x = · · · , to calculate the CDF i.e. (P (X ≤ q), we need to specify either a scalar or
vector of x-values using q = · · · ; etc. Now, recomputing example 3.1 in R we have ,

1. > dbinom(2, 5, 0.1)


[1] 0.0729

2. > 1-pbinom(1, 5, 0.1)


[1] 0.08146

41
3.2 The Geometric Distribution
Example 3.2 A manufacturer uses electrical fuses in an electronic system. The fuses
are purchased in large lots and tested sequentially until the first defective fuse is
observed. Assume that the lot contains 10% defective fuses. What is the probability
that the

(a) first defective fuse is observed on the first test?


(b) first defective fuse is observed on the second test?
(c) first defective fuse is observed on the third test?

Solution (a) .10 (b) (0.90)(0.10) (c) (0.90)2 (0.10)

Computing geometric Probabilities Using R


To solve example 3.2 in R, we compute geometric probability using the function geom,
which is the R name corresponding to the geometric probability distribution. For
Geometric(p), the probability distribution can be obtained as: dgeom(x − 1, p). Now,
recomputing example 3.2 in R we have

(a) > dgeom(0,0.1)


[1] 0.1
(b) > dgeom(1,0.1)
[1] 0.09
(c) > dgeom (2,0.1)
[1] 0.081

3.3 The Hypergeometric Distribution


Example 3.3 Suppose that a random sample of size 2 is selected without replacement,
from a lot of 100 laser printers and it is known that 5% of the items in the lot are
defective. What is the probability that

(a) none of them is defective?


(b) one of them is defective?
(c) both are defective?

(50)(952) (51)(951) (52)(950)


Solution (a) = 0.902 (b) = 0.096 (c) = 0.002
(100
2 ) (100
2 ) (100
2 )

42
Computing Hypergeometric Probabilities Using R
To solve example 3.3 in R, we compute hypergeometric probability using the func-
tion hyper, which is the R name corresponding to the hypergeometric probability
distribution. For Hypergeometric distribution with parameters as shown in Table
3.1, the associated probability can be obtained in R as: dhyper(x, No , N -No , n). Now,
recomputing example 3.3 in R we have

(a) > dhyper(0,5,95,2)


[1] 0.9020202

(b) > dhyper(1,5,95,2)


[1] 0.0959596

(c) > dhyper(2,5,95,2)


[1] 0.002020202

3.4 The Poisson Distribution


Example 3.4 The number of failures of a testing instrument from contamination
particles on the product is a Poisson random variable with a mean of 0.02 failures per
hour.

(a) What is the probability that the instrument does not fail in an 8-hour shift?

(b) What is the probability of at least one failure in 30 minutes?

Solution

(a) Let X = number of failures of the testing instrument per 8-hour hour. Then the
expected number of failures in an 8-hour shift is given by

λ = 0.02(8) = 0.16

so that

P (X = 0) = e−λ = e−0.16 = 0.852

(b) Let X = number of failures of the testing instrument in 30 minutes. Then the
expected number of failures in 30 minutes is given by

λ = 0.02(30/60) = 0.01

so that

P (X ≥ 1) = 1 − P (X = 0) = 1 − e−λ = 1 − e−0.01 = 0.01

43
Computing Poisson Probabilities Using R
To solve example 3.4 in R, we compute Poisson probability using the function pois,
which is the R name corresponding to the Poisson probability distribution. For
P oisson(λ), the probability distribution can be obtained as: dpois(x, λ). Now, recom-
puting example 3.4 in R we have

(a) > dpois(0,0.16)


[1] 0.8521438

(b) > 1-dpois(0,0.01)


[1] 0.009950166

OR

> 1-ppois(0,0.01)
[1] 0.009950166

Exercises
3.1 (Johnson, R. A., 2000, 139). If the probability is 0.20 that a downtime of an
automated production process will exceed 2 minutes, find the probability that 3
of 8 downtimes of the process will exceed 2 minutes.

3.2 (Johnson, R. A., 2000, 139). If the probability that a fluorescent light has a useful
life of at least 500 hours is 0.85, find the probabilities that among 20 such lights

(a) 18 will have a useful life of at least 500 hours.


(b) at least 15 will have a useful life of at least 500 hours.
(c) at most 10 will not have a useful life of at least 500 hours.

3.3 (Johnson, R. A., 2000, 125). It is known that 5% of the books bound at a certain
bindery have defective bindings. Find the probability that at least 20 of 100 books
bound by this bindery will have defective bindings.

3.4 (cf. Devore, J. L., 2000, 123). Suppose that 20% of all copies of a particular
textbook fail a certain binding strength test. Let X denote the number among 15
randomly selected copies that fail the test. Then X has a binomial distribution
with n = 15, p = 0.2

(a) Complete the probability and cumulative probability distribution for the
number of failures.
(b) Draw the probability and cumulative probability histograms.

44
(c) Find the probability that at most 8 fail the test.
(d) Find the probability that exactly 8 fail the test.
(e) Find the probability that at least 8 fail.
(f) Find the probability that between 4 and 7, inclusive, fail the test.

3.5 (cf. Devore, J. L., 2000, 123). An electronic manufacturer claims that 10% of its
power supply units need service during the warranty period. To investigate this
claim, technicians at a testing laboratory purchase 20 units and subject each unit
to accelerated testing to simulate use during the warranty period.

(a) Find the complete probability and cumulative probability distributions for
the number of units that need repair during the warranty period.
(b) Draw the probability and cumulative probability histograms.
(c) Find the probability that at most 6 need repair during the warranty period.
(d) Find the probability that exactly 12 need repair during the warranty period.
(e) Find the probability that between 5 and 10, inclusive, need repair during the
warranty period.

3.6 (cf. Devore, J. L., 2000, 125). Compute the following binomial probabilities

(a) Bin(3; 8; 0.6).


(b) P (3 < X ≤ 5) when n = 8 and p = 0.6.
(c) P (X < 3) when n = 12 and p = 0.1
(d) P (X ≥ 4) when n = 10 and p = 0.3
(e) P (X ≥ 15) when n = 30 and p = 0.3

3.7 (cf. Devore, J. L., 2000, 125). When circuit boards used in the manufacture of
compact disc players are tested, the long run percentage of defectives is 5%. Let
X = number of defective boards in a random sample of size n = 35. Determine:

(a) P (X ≥ 10).
(b) P (X ≤ 20).
(c) P (8 ≤ X ≤ 25).
(d) What is the probability that none of the 35 boards are defective?
(e) Calculate the expected value and standard deviation of X .

3.8 (cf. Devore, J. L., 2000, 125-126). A company that produces fine crystal knows
from experience that 10% of its goblets have cosmetic flaws and must be classified
as “seconds.”

(a) Among six randomly selected goblets, how likely is it that only at least one is
a second?

45
(b) Among six randomly selected goblets, what is the probability that at least
two are seconds?
(c) If goblets are examined one by one, what is the probability that at most 5 of
six are seconds?

3.9 (cf. Devore, J. L., 2000, 126). Suppose that only 20% of all drivers come to a
complete stop at an intersection having flashing red lights in all directions when
no other cars are visible. What is the probability that, of 20 randomly chosen
drivers coming to an intersection under these conditions:

(a) at most 7 will come to a complete stop?


(b) more than 6 will come to a complete stop?
(c) at least 8 will come to a complete stop?
(d) not all 20 will come to a stop?

3.10 (cf. Walpole, R. E, et. al, 2002, 134). In a certain manufacturing process it
is known that, on the average, 1 in every 100 items is defective. What is the
probability that

(a) the fifth item inspected is the first defective item found?
(b) at least four defective items are checked before the first non defective item?
(c) at most four defective items are checked before the first non defective item?

3.11 (cf. Walpole, R. E, et. al, 2002, 135). At Busy time a telephone exchange is very
near capacity, so callers have difficulty placing their calls. It may be of interest to
know the number of attempts necessary in order to gain a connection. Suppose
that we let p = 0.05 be the probability of the connection during busy time. What
is the probability that

(a) 5 attempts are necessary for a successful call?


(b) 12 attempts are necessary for a successful call?
(c) 20 attempts are necessary for a successful call?

3.12 (cf. Walpole, R. E, et. al, 2002, 139). The probability that a student pilot passes
the written test for a private pilot’s license is 0.7. Find the probability that the
student will pass the test

(a) on the third try.


(b) on the seventh try.
(c) on the ninth try.

3.13 (Johnson, R. A., 2000, 139). A basketball player makes 90% of his free throws.
What is the probability that he will miss for the first time on the seventh shot?

46
3.14 (cf. Walpole, R. E, et. al, 2002, 137). During a laboratory experiment the average
number of radioactive particles passing through a counter in 1 millisecond is 4.
What is the probability that

(a) 6 particles enter the counter in a given millisecond?


(b) more than 8 particles enter the counter in a given millisecond?
(c) no less than 2 particles enter the counter in a given millisecond?
(d) less than 10 particles enter the counter in a given millisecond?

3.15 (cf. Walpole, R. E, et. al, 2002, 137). Ten is the average number of oil tankers
arriving each day at a certain port city. The facilities at the port can handle at
most 15 tankers per day. What is the probability that that on a given day tankers
have to be turned away?

3.16 (Walpole, R. E, et. al, 2002, 139). On average, a certain intersection results in 3
traffic accidents per month. What is the probability that for any given month at
this intersection

(a) exactly 5 accidents will occur?


(b) less than 3 accidents will occur?
(c) at least 2 accidents will occur?

3.17 (Walpole, R. E, et. al, 2002, 139). A secretary makes 2 errors per page on average.
What is the probability that on the next page, he will make

(a) 4 or more errors?


(b) no errors?
(c) less than 10 errors?

3.18 (Walpole, R. E, et. al, 2002, 139). A certain area in the eastern United States is,
on average, hit by 6 hurricanes per year. Find the probability that for a given
year that area will be hit by

(a) fewer than 8 hurricanes.


(b) anywhere from 4 to 12 hurricanes.
(c) more than 10 hurricanes.

3.19 (Walpole, R. E, et. al, 2002, 139). The average number of field mice per acre in a
wheat field is estimated to be 12. Find the probability that on a given acre

(a) fewer than 7 field mice are found.


(b) no less than 10 field mice are found.
(c) no fewer than 5 field mice are found

47
3.20 (Johnson, R. A., 2000, 128). If a bank receives on the average 6 bad checks per
day, what are the probabilities that it will receive

(a) 4 bad checks in any given day?


(b) 10 bad checks over any two consecutive days?

3.21 (Johnson, R. A., 2000, 128). In the inspection of tinplate produced by a continuous
electrolytic process, 0.2 imperfections are spotted on the average per minute. Find
the probabilities of spotting

(a) one imperfection in 3 minutes.


(b) at least 2 imperfections in 5 minutes.
(c) at most one imperfection in 15 minutes.

3.22 (Johnson, R. A., 2000, 131). Given that the switchboard of a consultant’s office
receives on the average 0.6 calls per minute. Find the probabilities that

(a) in a given minute, there will be at least 1 call.


(b) in a given minute, there will be at least 5 call.

3.23 (Johnson, R. A., 2000, 131). At a checkout counter customers arrive at an average
of 1.5 per minute. Find the probabilities that

(a) at most 4 will arrive in any given minute.


(b) at least 3 will arrive during an interval of 2 minutes.
(c) at most 15 will arrive during an interval of 6 minutes.

3.24 (Johnson, R. A., 2000, 129). If on the average three trucks arrive per hour to be
unloaded at a warehouse, find the probability that at most 20 will arrive during
an 8-hour day shift.

3.25 (Johnson, R. A., 2000, 140). The number of weekly breakdowns of a computer
is a random variable having a Poisson distribution with λ = 0.3. What is the
probability that the computer will operate without a breakdown for 2 consecutive
weeks?

3.26 In a shipment of 50 hard disks, five are defective. If four of the disks are randomly
selected for inspection, what is the probability that more than 2 are defective?

3.27 A foundry ships engine blocks in lots of 20. Three items are selected and tested.
If the lot actually contains five defective items, find the probability that there will
be at least 2 defective blocks in the sample?

48
3.28 During the course of an hour 1000 bottles of soft drinks are filled by a particular
machine. Each hour a sample of 20 bottles is randomly selected and the number
of ounces of soft drink bottle is checked. Suppose that during a particular hour
100 underfilled bottles are produced. What is the probability that at least 3
underfilled bottles will be among those sampled?

3.29 Twenty microprocessor chips are in stock. Three have etching errors that cannot be
detected by the naked eye. Five chips are selected and installed in field equipment.
Find the probability that at least one chip with an etching error will be chosen.

3.30 Production line workers assemble 15 automobiles per hour. During a given hour,
four are produced with improperly fitted doors. Three automobiles are selected at
random and inspected. Find the probability that at most one will be found with
improperly fitted doors.

49
Chapter 4

Continuous Random Variables

The following table contains some of the most well-known, and often used continuous
distributions in Engineering

Distribution Probability function Mean Variance


Exponential f (X = x) = λe−λx , 1/λ 1/λ2
x > 0,
λ
2
−1/2 − 1 ( x−µ )
Normal f (x) = (2πσ 2 ) e
2 σ , µ σ2
− ∞ < x < ∞,
α−1e−x/β
Gamma f (x) = x β α Γ(α) , αβ αβ 2
0 ≤ x < ∞,
0 < α < ∞,
0<β<∞ h
α α−1 −x/β 1/α 2/α
Weibull f (x) = β
x e , β Γ(1 + 1/α) β Γ(1 + 2/α) −
0 ≤ x < ∞, i
Γ2 (1 + 1/α)
0 < α < ∞,
0<β<∞  
[e(σ ) − 1]e(2µ+σ )
2 2
σ2 2
Lognormal f (x) = xσ√1 2π e−(ln x−µ) /2σ 2 , exp µ + 2
0 < x < ∞,
µ > 0,
σ>0
Beta Γ(α+β) α−1
f (x) = Γ(α)Γ(β) x (1 − x)β−1 , α
α+β
α
. β
α+β (α+β)(α+β+1)
0 ≤ x ≤ 1,
0 < α < ∞,
0<β<∞

Table 4.1: Some Continuous Random Variables and Their Means and Variances.

50
4.1 The Exponential Distribution
Example 4.1 Life length of a particular type of battery follows exponential distribution
with mean 2 hundred hours. Find the probability that the

(a) life length of a particular battery of this type is less than 2 hundred hours.

(b) life length of a particular battery of this type is more than 4 hundred hours.

(c) life length of a particular battery of this type is less than 2 hundred hours or more
than 4 hundred hours.

Solution Let X = life length of a battery. Then X ∼ Exp(1/2), by µ = 1/λ = 2


(given) so that λ = 1/2 = 0.5

(a) P (X < 2) = F (2) = 1 − e−(0.5)2 = 1 − e−1 = 0.632


h i
(b) P (X > 4) = 1 − P (X ≤ 4) = 1 − 1 − e0.5(4) = e−2 = 0.135
h i
(c) P (X < 2) or (X > 4) = P (X < 2) + P (X > 4) = 1 − e−1 + e−2 = 0.767

Computing Exponential Probabilities Using R


To solve example 4.1 in R, we compute exponential probability using the function
exp,, which is the R name corresponding to the exponential probability distribution.
For exponential distribution with parameter as shown in Table 4.1, the associated
PDF and CDF can be obtained in R as dexp(x, 1/λ) and pexp(x, 1/λ), respectively.
Now, recomputing example 4.1 in R we have

(a) > pexp(2, 0.5)


[1] 0.6321206

(b) > 1-pexp(4, 0.5)


[1] 0.1353353

(c) > pexp(2, 0.5)+ 1-pexp(4, 0.5)


[1] 0.7674558

51
4.2 The Normal Distribution
When the mean of a normal distribution equals 0, and the variance equals 1, we get
what we call a standard normal random Z. Its density is given by

1 2
f (z) = √ e−z /2 , −∞ < z < ∞

Computing Standard Normal Probabilities Using


Tables
Using the standard normal probability table in Appendix A2 we can find the
following:

P (Z < 2.13) = 0.9834


P (Z > −1.68) = 0.9535
P (−1.02 < Z < 1.51) = 0.9345 − 0.1562 = 0.7783

Computing Normal Probabilities Using Tables


Example 4.2 A manufacturing process has a machine that fills coke to 300 ml bottles.
Over a long period of time, the average amount dispensed into the bottles is 300 ml,
but there is a standard deviation of 5 ml in this measurement. If the amounts of
fill per bottle can be assumed to be normally distributed, find the probability that
the machine will dispense between 295 and 310 ml of liquid in any one bottle. (cf.
Scheaffer and McClave, 1995, 216-217).
Solution Let X = amount of fill in a bottle. Then X ∼ N (300, 52 ).

 
295 − 300 X − 300 310 − 300
P (295 < X < 310) = P < <
5 5 5
= P (−1 < Z < 2)
= 0.9772 − 0.1587 = 0.8185

Example 4.3 The compressive strength of samples of cement can be modeled by


a normal distribution with a mean of 6000 kilograms per square centimeter and a
standard deviation of 100 kilograms per square meter.

(a) What is the probability that the strength of a sample is less than 6164.5 kg/cm2 ?

52
(b) What compressive strength is exceeded by 95% of the time?
(c) What compressive strength exceeds 5% of the time?

Solution
X−6000 6164.5−6000

(a) P (X < 6164.5) = P 100
< 100
= P (Z < 1.645) ≈ 0.95
X−µ u−µ

(b) P (X > µ) = 0.95 → P σ
> σ
= 0.95.

From the Standard Normal Probability Table, P (Z > −1.645) = 0.95 so that by
comparison we have u−µ
σ
= −1.645 → u = µ − 1.645σ = 5835.5.

Computing Normal Probabilities Using R


To solve example 4.2 in R, we compute normal probabilities using the function norm,,
which is the R name corresponding to the normal probability distribution. For
normal distribution with parameters as shown in Table 4.1, the associated PDF and
CDF can be obtained in R as dnorm(x, µ, σ) and pnorm(x, µ, σ), respectively. Now,
recomputing example 4.2 in R we have

(a) > pnorm(6164.5, 6000,100)


[1] 0.9500151
(b) > qnorm(0.95,mean=6000,sd=100, lower.tail=FALSE)
[1] 5835.515

OR

> qnorm(0.05,mean=6000,sd=100)
[1] 5835.515

(c) > qnorm(0.05,mean=6000,sd=100,lower.tail=FALSE)


[1] 6164.485

OR

> qnorm(0.95,mean=6000,sd=100)
[1] 6164.485

Note: qnorm() deals by default with areas below the given boundary value. For
instance, in part (b) above, if we had computed qnorm(0.95,mean=6000,sd=100)
without fiddling with the lower.tail argument, we would not have got the desired
95th percentile.

53
Computing Standard Normal Probabilities Using R
Consider now the standard Normal N (µ = 0, σ 2 = 1) distribution. It is possible to
determine the height of the density curve given a value of z, the cumulative area given
a value of z, or a z value given a cumulative area. For instance, to find the Area to
the Left of z = 1.39 i.e P (Z < 1.39), we will proceed as follows

> pnorm(1.39, 0,1)


[1] 0.9177356

It is important to mention that the default settings are for the mean=0 and sd=1, so
we could obtain the area to the left without mentioning these parameters in the call
to the function pnorm above. This is also the case with the functions dnorm, qnorm
and rnorm. That is

> pnorm(1.39)
[1] 0.9177356

To find the Area to the Right of a value, find the probability to the left, then use the cal-
culator to calculate it. Alternatively, you could fiddle with the lower.tail argument
to obtain the Area to the Right of a value, i.e. pnorm(1.39,lower.tail=FALSE)

Calculate a z Value Given the Cumulative Probabil-


ity
To find the z value for a cumulative probability of 0.025. i.e. P (Z < z) = 0.025

> qnorm(0.025) OR qnorm(p=0.025,mean=0, sd=1)


[1] -1.959964

Finding the Values zα and zα/2 of the Standard Nor-


mal Random Variable
The value of the standard random variable for which the probability is α to its
right is denoted by zα and α is called the tail probability. For instance, if we have
P (Z > 1.439531) = 0.075, then the value 1.439531 has a probability of 0.075 to its
right, i.e., z0.075 = 1.439531. In R, we compute α as follows

> pnorm(1.439531,lower.tail=FALSE)
[1] 0.07500007

54
Computing and Plotting Probabilities of Normal
Random Variables Using R
Consider the Normal N (4, 102 ) distribution. To calculate the height of the PDF at
x = 0 in R we use:

> dnorm(x=0, mean=4, sd=10)


[1] 0.03682701

To plot the PDF curve for this distribution, we use the following scripts:

> xn<-seq(from=-26, to=34, by=0.2)


> yxn<-dnorm(xn, mean=4, sd=10)
> plot(xn, yxn, type="l") #This is the letter "l"

When the plot’s type is specified as “l” a line plot is generated. This implies that
instead of plotting symbols at each of the coordinates, the density curve is represented
graphically by connecting successive points with lines. Using the main=” ” argument
of the plot function, you can provide a title.
Then, the CDF values can be calculated as follows:

> pnorm(q=4, mean=4, sd=10)


[1] 0.5
> pnorm(q=c(4, 8, 12), mean=4, sd=10)
[1] 0.5000000 0.6554217 0.7881446

We can calculate the quantile or inverse CDF function as follows:

> qnorm(p=0.975)
[1] 1.959964
> qnorm(p=0.975, mean=4, sd=10)
[1] 23.59964
> qnorm(p=0.5, mean=4, sd=10)
[1] 4

One or more random observations can be generated from a specified Normal distribu-
tion. For instance, n = 5 from the N (4, 102 ) distribution.

> rnorm(n=5, mean=4, sd=10)


[1] 8.245934 10.212281 -7.596698 31.125897 18.880339

55
Plotting a PDF on a histogram: Given the previous description, utilizing the
”dname” and ”lines” functions to superimpose a PDF on a histogram is fairly simple.
For instance, to overlay a Normal PDF, we would first generate the coordinates
through which the line will pass using the dnorm function, and then build the line
on the plot using the graphical lines function. This can be demonstrated using the
following simulated data:

> xsim<-rgamma(n=100, shape=200, scale=2) # simulates n=100


> #observations from the
> #Ga(shape=200, scale=2)
> #distribution.
> #Nb. in R, the Gamma pdf with
> #shape=a and scale=s is
> #f(y)= const. y^(a-1).exp(-y/s).
> hist(xsim, freq=F) # specify the number of bins required
> #as well if you wish.
> xv<-seq(from=310, to=470, by=0.5)
> yv<-dnorm(x=xv, mean=mean(xv), sd=sd(xv))
> lines(xv, yv) # adds a line to the histogram plot (the
> #currently active plot) which passes through all
> #the (xv[i], yv[i]) pairs.

Exercises
4.1 The tread wear (in thousands of kilometers) that car owners get with a certain
kind of tire is a random variable whose probability density is given by
1 −x/30
f (x) = e 0≤x<∞
30
(a) Find the probability that one of these tires will last at most 18000 kilometers.
(b) Find the probability that one of these tires will last anywhere from 27000 to
36000 kilometers.
(c) Comment on the probability in (a) if the mean time to failure is β =
10000, 20000, 30000, 40000, 50000, 60000.

4.2 A transistor has an exponential time to failure distribution with mean time to
failure of β = 20, 000 hours.

(a) What is the probability that the transistor fails by 30,000 hours?
(b) The transistor has already lasted 20, 000 hours in a particular application.
What is the probability that it fails by 30, 000 hours?
(c) Comment on the probability in (a) if β = 10000; 20000; 30000; 40000; 50000; 60000.

56
4.3 The lifetime X (in hours) of the central processing unit of a certain type of
microcomputer is an exponential random variable with parameter 0.001. What is
the probability that the unit will work at least 1, 500 hours?

4.4 The lifetime (in hours) of the central processing unit of a certain type of micro-
computer is an exponential random variable with mean β = 1000.

(a) What is the probability that a central processing unit will have a lifetime of
at least 2000 hours?
(b) What is the probability that a central processing unit will have a lifetime of
at most 2000 hours

4.5 The amount of raw sugar that one plant in a sugar refinery can process in one
day can be modeled as having an exponential distribution with a mean of 4 tons.
What is the probability that any plant processes more than 4 ln 2 tons of sugar on
a day?

4.6 (Johnson, R. A., 2000, 172). The amount of time that a surveillance camera will
run without having to be rested is a random variable having the exponential
distribution with λ = 50 days. Find the probabilities that such a camera will

(a) have to rested in less than 20 days;


(b) not have to rested in at least 60 days

4.7 (Johnson, R. A., 2000, 197). Consider a random variable having the exponential
distribution with parameter λ = 0.25. Find the probabilities that

(a) it takes values more than 200;


(b) it takes values less than 300.

4.8 (Johnson, R. A., 2000, 168). If on the average three trucks arrive per hour to be
unloaded at a warehouse. Find the probability that the time between the arrivals
of successive trucks will be less than 5 minutes

4.9 (Johnson, R. A., 2000, 172). The number of weekly breakdowns of a computer is
a random variable having a Poisson distribution with λ = 0.3. Find the percent
of the time that the interval between the breakdowns of the computer will be

(a) less than one week;


(b) at least 5 weeks.

4.10 (Johnson, R. A., 2000, 173). Given that the switchboard of a consultant’s office
receives on the average 0.6 calls per minute. Find the probabilities that the time
between the successive calls arriving at the switchboard of the consulting firm will
be

(a) less than 1/2 minute;

57
(b) more than 3 minutes.

4.11 Let Z have a standard normal distribution. Then evaluate the following:

(a) P (Z ≥ 1.96)
(b) P (Z ≤ −1.96)
(c) P (Z ≥ −1.96)
(d) P (−1.645 < Z <≤ −1.28)
(e) P (−1.64 < Z ≤ 1.96)
(f) P (|Z| ≤ 1.96)
(g) P (|Z| > 1.96)

4.12 Solve the following Probability equations to find normal percentiles

(a) P (Z ≥ z) = 0.05
(b) P (Z ≤ −z) + P (Z ≥ z) = 0.25
(c) P (Z ≥ −z) + P (Z ≤ z) = 0.95
(d) P (Z ≤ −z) = 0.05
(e) P (Z ≤ z) = 0.95
(f) P (−1.645 < Z ≤ −z) = 0.15
(g) P (−1.645 < Z ≤ −z) = 0.90
(h) P (|Z| ≤ z) = 0.95
(i) P (|Z| > z) = 0.05.

58
4.13 Complete the table where the α’s are the tail probabilities of the standard normal
random variable.

α 1−α Zα Z1−α Zα/2


0.005 2.575829
- 2.326348
2.170090
2.241403
0.05
1.644854
0.200
0.250
0.7
0.600
0.750
0.900
0.990

4.14 (cf. Devore, J. L., 2000, 171). Let X denote the number of flaws along a 100 − m
reel of magnetic tape. Suppose X has approximately a normal distribution with
µ and σ. Calculate the probability that the number of flaws is

(a) between 20 and 30.


(b) at most 30.
(c) less than 30.
(d) not more than 25.
(e) at most 10

4.15 (Johnson, R. A., 2000, 196). If a random variable has the standard normal
distribution, find the probability that it will take on a value

(a) between 0 and 2.50;


(b) between 1.22 and 2.35;
(c) between –1.33 and –0.33;
(d) between –1.60 and 1.80.

4.16 The length of each component in an assembly is normally distributed with mean 6
inches and standard deviation σ inch. Specifications require that each component
be between 5.7 and 6.3 inches long. What proportion of components will pass
these requirements? Comment by varying σ as 0.05 0.10 0.15 0.20 0.25 0.30 0.35
etc

59
4.17 A machining operation produces steel shafts having diameters that are normally
distributed with a mean of 1.005 inches and a standard deviation of 0.01 inch.
Specifications call for diameters to fall within the interval 1.00 ± 0.02 inches.

(a) What percentage of the output of this operation will fail to meet specifications?
(b) Comment on the percentage in (a) if σ increases.

4.18 The weekly amount spent for maintenance and repairs in a certain company has
approximately a normal distribution with a mean of $400 and a standard deviation
of $20.

(a) If $450 is budgeted to cover repairs for next week, what is the probability
that the actual costs will exceed the budgeted amount?
(b) Comment on the probability in part (a) if µ changes, keeping σ fixed.
(c) Comment on the probability in part (a) if σ changes, keeping µ fixed

4.19 A type of capacitor has resistance that varies according to a normal distribution
with a mean of 800 megohms and a standard deviation of 200 megohms (Nelson,
Industrial Quality Control, 1967, pp. 261-268). A certain application specifies
capacitors with resistances between 900 and 1000 megohms. If 30 capacitors are
randomly chosen from a lot of capacitors of this type, what is the probability that
at least 4 of them all will satisfy the specification?

4.20 The fracture strengths of a certain type of glass average 14 (in thousands of pounds
per square inche) and have a standard deviation of 1.9psi. What proportion of
these glasses will have fracture strength exceeding 14.5psi?

4.21 Suppose examination scores are normally distributed with mean 60 and variance
25.

(a) What value exceeds 25% of the scores?


(b) What value is exceeded by 25% of the scores?
(c) What is the minimum score to get A+ if the top 3% students get A+?
(d) What is the maximum score leading to failure if the bottom 20% of students
fails?

4.22 The life of a semi-conductor laser at a constant power is normally distributed with
a mean of 7000 hours and a standard deviation of 600 hours.

(a) What is the probability that the laser fails before 5000 hours?
(b) What is the life in hours that 5% of the lasers exceed?
(c) What life (in hours) is exceeded by 5% of the lasers?

4.23 The reaction time of a driver to visual stimulus is normally distributed with a
mean of 0.40 seconds and a standard deviation of 0.05 second.

60
(a) What is the probability that a reaction requires more than 0.50 second?
(b) What is the probability that a reaction requires between 0.4 and 0.5 second?
(c) What is the reaction time that is exceeded 90% of the time?
(d) What reaction time is exceeded 10% of the time?

4.24 The personnel manager of a large company requires job applicants to take a certain
test and achieve a score of 500 or more. The test scores are distributed with mean
485 and standard deviation 30. What score is exceeded by 75% of the applicants?
Assume that the test scores are normally distributed.

4.25 The personnel manager of a large company requires employees to take a certain test.
The test scores are normally distributed with mean 485 and standard deviation 30.
The manager will promote those applicants whose scores exceed 75t h percentile,
and terminate those with scores less than 25t h percentile.

(a) What is the minimum score to have promotion in the job?


(b) What is the maximum score for getting terminated from the job?

4.26 A Company produces light bulbs whose lifetimes follow a normal distribution with
mean 500 hours and standard deviation 50 hours.

(a) If a light bulb is chosen randomly from the company’s output, what is the
probability that its lifetime will be between 417.75 and 582.25 hours?
(b) If thirty light bulbs are chosen at random, what is the probability that more
than half of them will survive more than the average lifetime?

4.27 ( Devore, J. L., 2000, 164). The breakdown voltage of a randomly chosen diode of
a particular time is known to be normally distributed. What is the probability
that a diode’s breakdown voltage is within 1 standard deviation of its mean value?

4.28 (Devore, J. L., 2000, 164). The time that it takes a driver to react to the brake
lights on a decelerating vehicle is critical in helping to avoid rear-end collisions.
The article “Fast-Rise Brake Lamp as a Collision-Prevention Device” (Ergonomics,
1993: 391-395) suggests that reaction time for an in-traffic response to brake signal
from standard brake lights can be modeled with a normal distribution having
mean value 1.25 sec and standard deviation of .46 sec. What is the probability
that

(a) the reaction time is between 1.00 sec and 1.75 sec?
(b) the reaction time exceeds 2 sec?
(c) the reaction time is no more than 1.45 sec?

4.29 (Devore, J. L., 2000, 169). Suppose that the force acting on a column that helps
to support a building is normally distributed with mean 15.0 kips and standard
deviation 1.25 kips. What is the probability that the force

61
(a) is at most 17 kips?
(b) is between 10 and 12 kips?
(c) differs from 15 kips by at most 2 standard deviations?

4.30 (Johnson, R. A., 2000, 197). The burning time of an experimental rocket is a
random variable having the normal distribution with mean = 4.76 seconds and
standard deviation. = 0.04 second. What is the probability that this kind of
rocket will burn

(a) less than 4.66 seconds?


(b) more than 4.80 seconds?
(c) anywhere from 4.70 to 4.82 seconds?

4.31 (Johnson, R. A., 2000, 172). If a random variable has the gamma distribution
with α = 2 and β = 2, find the probability that the random variable will take on
a value less than 4.

4.32 (Johnson, R. A., 2000, 172). In a certain city, the daily consumption of electric
power (in millions of kilowatt-hors) can be treated as a random variable having
a gamma distribution with α = 3 and β = 2. If the power plant of the city has
a daily capacity of 12 million kilowatt-hours, what is the probability that this
power supply will be adequate on any given day.

4.33 (Johnson, R. A., 2000, 171). Suppose that the lifetime of a certain kind of an
emergency backup battery (in hours) is a random variable X having the Weibull
distribution with α = 0.1 and β0.5. Find

(a) the probability that such a battery will last more than 300 hours
(b) the probability that such a battery will last less than 380 hours
(c) the probability that such a battery will not last 100 hours.

4.34 (Johnson, R. A., 2000, 173). Suppose that the time to failure (in minutes) of certain
electronic components subjected to continuous vibration may be looked upon as a
random variable having the Weibull distribution with = 1/5 and β = 1/3. What
is the probability that such a component will fail in less than 5 hours?

4.35 (Johnson, R. A., 2000, 173). Suppose that the service life (in hours) of a semicon-
ductor is a random variable having the Weibull distribution with α = 0.025and
β = 0.500. What is the probability that such a semiconductor will still be in
operating condition after 4,000 hours?

4.36 (Johnson, R. A., 2000, 197). A mechanical engineer models the bending strength
of a support beam in a transmission tower as a random variable having the Weibull
distribution with α = 0.02 and β = 3.0. What is the probability that the beam
can support a load of 4.5?

62
4.37 (Johnson, R. A., 2000, 169). In a certain country the proportion of highway
sections requiring repairs in any given year is a random variable with α = 3 and
β = 2.

(a) On the average what percentage of the highway sections requires repair in any
given year?
(b) Find the probability that at most half of the highway sections will require
repair in any given year?

4.38 (Johnson, R. A., 2000, 173). If the annual proportion of erroneous income tax
returns filed with the IRS can be looked upon as a random variable having a beta
distribution with α = 2 and β = 9, what is the probability that in any given year
there will be fewer than 105 erroneous returns?

4.39 (Johnson, R. A., 2000, 173). Suppose that the proportion of the defectives shipped
by a vendor, which varies somewhat from shipment to shipment, is a random
variable having the beta distribution with α = 1and β = 4. Find the probability
that a shipment from this vendor will contain 25% or more defectives.

63
Chapter 5

Sampling Distributions

In this chapter we will introduce one of the most important concepts in statistics, that
is the sampling distribution.
Example 5.1 A population consisting of exam grades of N = 5 students taking the
Engineering Statistics course is described below

Name Ordinal Grade Grade Points


Mohammad B 3
Abdallah C 2
Khaled B 3
Ali A 4
Saad C 2

Table 5.1: Distribution of Grades for the Engineering Statistics Class

It is easy to check that for grade points (x), the mean and variance are given by

5
" 5 5
!2 #
1X 14 1 X X 42 − 5(2.8)2
µ= xi = = 2.8, σ 2 = x2i − n xi = = 0.56
5 i=1 5 N i=1 i=1
5

Now consider a sample of size 2, and prepare all possible samples without replacement.
In fact, there will be 52 samples without replacement. The sample means and the
corresponding probabilities are tabulated below:

64
x̄ Sample Units P (X̄ = x̄)
2.0 (Abdallah, Saad) 0.10
2.5 (Mohammad, Abdallah), (Mohammad, Saad) 0.40
(Abdallah, Khaled), (Khaled, Saad)
3.0 (Mohammad, Khaled), (Abdallah, Ali), (Ali, Saad) 0.30
3.5 (Mohammad, Ali), (Khaled, Ali) 0.20

Table 5.2: All Possible Samples of Size 2 with their Probabilities

The expected value of the sample mean is

E(X̄) = 2.0(0.10) + 2.5(0.40) + 3.0(0.30) + 3.5(0.20) = 2.8

which is the same as the population mean µ. The expected value of X̄ 2 is given by

E(X̄ 2 ) = 2.02 (0.10) + 2.52 (0.40) + 3.02 (0.30) + 3.52 (0.20) = 8.05

so that V (X̄) = 8.05 − (2.8)2 = 0.21. It may be shown that

N − n σ2 5 − 2 0.56
V (X̄) = = = 0.21
N −1 n 5−1 2

That is the variance of the sample mean , X can be calculated from the population
variance σ 2 . Note that in case of sampling with replacement with large N , we have
N −n 2
N −1
= 1 and consequently, V (X̄) = σn

5.1 Sampling Distributions of Sums and Means


and the Central Limit Theorem
Example 5.2 One hundred bolts are packed in a plastic box. The weight of the
empty box can be ignored. However, each bolt weighs around 1 ounce with standard
deviation σ = 0.01 ounce. Assume that weights of bolts follow a normal distribution.

(a) Find the probability that a box filled with hundred bolts weighs more than 100.196
ounces.

(b) Find the probability that the mean weight of 100 bolts is more than 1.00196
ounces.

65
Solution

(a)
X
Xi ∼ N 100, 100(0.01)2 = N 100, 0.12
 
i
! P 
X
i Xi − 100 100.196 − 100
P Xi > 100.196 = P >
i
0.1 0.1
= P (Z > 1.96) ≈ 0.025

(b) since
0.012
 
= N (1, 10−6 ) = N 1, (10−3 )2

X̄ ∼ N 1,
100
 
 X̄ − 1 1.0196 − 1
P X̄ > 1.00196 = P > = P (Z > 1.96) ≈ 0.025
10−3 10−3

Note that the events in (a) and (b) are equivalent.


Example 5.3 The weights of ball bearings have a distribution with a mean of 22.40
ounces and a standard deviation of 0.048 ounces. If a random sample of size 36 is
drawn from this population, find the probability that the sample mean lies between
22.39 and 22.42.
Solution Let X = weight of a ball bearing. Then

X ∼ (µ = 22.40, σ 2 = 0.0482 )
and

0.0482
 
X̄ ∼ N 22.40, = N (22.40, 0.0082 )
36
and hence


P 22.39 < X̄ < 22.42 = P (−1.25 < Z < 2.5) ≈ 0.8882

5.2 The Normal Approximation to the Binomial Distribution

Let X be a binomial random variable with n trials and success probability p. Then
probabilities of events related to X can be approximated by a normal distribution
with mean np and variance np(p − 1) if the conditions np ≥ 5 and n(1 − p) ≥ 5 are
satisfied.

66
Continuity Correction

It is the adjustment made to an integer-valued discrete random variable when it is


approximated by a continuous random variable. For a binomial random variable, we
inflate the events by adding or subtracting 0.5 to the event as follows:

{X :X = 4} = {X : 3.5 ≤ X ≤ 4.5}
{X :X < 4} = {X : X ≤ 3} = {X : X ≤ 3.5}
{X :X ≤ 4} = {X : X ≤ 4.5}
{X :X > 4} = {X : X ≥ 5} = {X : X ≥ 4.5}
{X :X ≥ 4} = {X : X ≥ 3.5}

The continuity correction should be applied anytime a discrete random variable is


being approximated by a continuous random variable.
Example 5.4 The pass mark in an examination is the median mark. A random
sample of 10 candidates is chosen after the examination.

(a) What is the distribution of the random variable “the number of students who
passed the examination”?

(b) Find the probability that more than 2 of the selected candidates passed the
examination.

(c) Find the probability that at least 2 of the selected candidates passed the exami-
nation.

(d) Solve parts (b) and (c) using the normal approximation.

Solution

(a) Any candidate picked either passes or fails the examination (i.e. mutually exclusive
outcomes at each trial). Since the pass mark is the median mark, it means that
50% of the candidates passed the examination so that p = 0.5. The trials are
assumed to be independent. Thus, the random variable “the number of students
who passed the examination” is a binomial random variable with n = 10 and
p = 0.5.

(b) If we represent the random variable in (a) by X, then we are interested in the
probability P (X > 2). By the use of binomial probability we have

P (X > 2) = 1 − P (X ≤ 2) = 1 − [P (X = 2) + P (X = 1) + P (X = 0)]
= 1 − 0.055 = 0.945

67
(c) By the use of binomial probability we have
P (X ≥ 2) = 1 − P (X ≤ 1) = 1 − [P (X = 1) + P (X = 0)]
= 1 − 0.011 = 0.989

(d) Since np = n(1 − p) = 5, we use the normal approximation to binomial and so


the random variable X has a normal distribution with mean np = 5 and variance
np(1 − p) = 2.5. So to solve the problems based on the approximation, using R,
we employ either dnorm(x, µ, σ) or pnorm(q, µ, σ) or both

(b) P (X > 2) = P (X ≥ 2.5)

1-pnorm(2.5,5, sqrt(2.5))
[1] 0.9430769

(b) P (X ≥ 2) = P (X ≥ 1.5)

> 1-pnorm(1.5,5, sqrt(2.5))


[1] 0.9865717

5.3 Drawing a Random Sample from a known Distribution

Random samples are drawn from a known distribution by evaluating the inverse
cumulative distribution functions at random probabilities. For example, suppose a
random variable has a cumulative distribution function F , i,e, F (x) = P (X ≤ x) = p
, where x is a value in the domain of X, and hence x = F −1 (p). Thus by supplying
random values between 0 and 1 for p, we obtain values of x, which constitute a
random sample from the distribution with cumulative distribution function F . This
approach presumably follows from the transformation of random variables in basic
probability theory and it is probably the simplest technique for simulating samples
from notable specified univariate distributions F . This procedure is therefore very
easy to implement when F is easy to invert. For Example, suppose that X ∼ Exp(λ),
then F (x) = 1 − exp(−λ). Setting u = p = F (x) and rearranging to get x = F −1 (u).

u = 1 − exp (−λx)
exp (−λx) = 1 − u
−λx = log (1 − u)
1
x = − log (1 − u)
λ
Now, it is important to note that V = 1 − U ∼ U [0, 1], and therefore if G(X) =
1 − F (X), then X ∼ G−1 (V ). Put differently, if u1 , u2 , · · · denote realizations from X
where yi = G−1 (ui ).

68
Now, for this exponential example,

G−1 (u) = −λ−1 log(u)

Therefore, depending on which is most practical, we can employ either F or G. To


obtain a sample of size n from exponential distribution with parameter λ, i.e. Exp(λ),
the aforementioned procedure can be easily implemented in R.
Sampling from the Uniform Distribution To develop a simulation of the sampling
distribution of the mean from a uniformly distributed population with 100 samples of
n = 30, we have

nSim<-100
n<-30
lambda=0.5
means<-array(0, dim=nSim) #Arrays storing the sample means
for(i in 1:nSim) #loop for iterating accross simulation
{
u=runif(n,0,1) #Shorthand for this command would be u=runif(n)
x=-log(1-u)/lambda
means[i]<-mean(x)
}
hist(means) #Histogram of the sample means
means #This gives the output for the Means of 100
#random samples of Size 30 from the exp(lambda)

Alternatively, we may quickly generate random observations from the exponential


distribution in R by using the rexp function. For example the below code will give
you similar results as the one above

set.seed(1000)#Create simulated values that are reproducible


n <- 30
lambda <- 0.5
simu <- 100
dataset <- matrix(rexp(n*simu, lambda), nrow = 30, ncol = 100)
sampleMeans = NULL
for (i in 1: 100)
{
sampleMeans[i] = mean(dataset[, i])
}
SamMean <- mean(sampleMeans) ##This is the value of the sample mean
TheoMean <- 1/lambda ##Calculate the theoretical mean of the distribution
hist(sampleMeans)

69
Sample x̄ Sample x̄ Sample x̄ Sample x̄
1 2.03083 26 2.06614 51 2.14173 76 1.77623
2 1.93572 27 1.68173 52 1.43473 77 1.64866
3 1.56200 28 1.41266 53 2.04715 78 2.21201
4 2.28174 29 2.33124 54 2.13129 79 1.79238
5 1.73629 30 2.04047 55 2.11185 80 2.24587
6 1.53249 31 1.84683 56 1.66601 81 1.95927
7 2.31160 32 2.09427 57 2.22853 82 1.91424
8 2.09425 33 2.49579 58 2.15453 83 2.11859
9 1.71859 34 1.98080 59 2.03769 84 1.51129
10 2.55306 35 2.56042 60 2.17910 85 1.63826
11 2.41056 36 1.41034 61 2.48720 86 2.21952
12 1.91694 37 1.44917 62 1.64028 87 2.30895
13 2.51130 38 2.26028 63 1.75977 88 1.84262
14 2.33719 39 2.47431 64 2.55026 89 1.88155
15 2.41996 40 1.66230 65 2.70323 90 1.76893
16 2.40229 41 1.95616 66 2.65496 91 1.81046
17 2.69608 42 2.05711 67 1.85997 92 1.81857
18 2.13224 43 1.88952 68 1.92766 93 1.97177
19 1.66660 44 1.99126 69 1.76134 94 2.36127
20 1.88309 45 1.37156 70 1.89358 95 1.70884
21 1.70845 46 2.36135 71 2.21756 96 2.29309
22 2.24863 47 1.86295 72 2.45860 97 1.57450
23 2.49552 48 2.37365 73 2.17030 98 2.61525
24 1.96742 49 1.85437 74 2.80880 99 2.34621
25 1.98953 50 2.17150 75 1.80652 100 1.78427

Table 5.3: Means of 100 Random Samples of Size 30 from the Exponential Distribution

5.4 Use of t, χ2 and F Tables

Student’s T: Using the student t table in Appendix A3 we can find the following:
For a t random variable with 8 degrees of freedom, P (t > 2.896) = 0.01.
For a t random variable with 17 degrees of freedom, P (t < –2.898) = 0.005.
For a t random variable with 21 degrees of freedom, P (t < –1.721) = 0.05

Finding the area under t-distribution using R

In R, to find P (t ≤ x), with n degrees of freedom(df), we employ the function t,


which is the R name corresponding to the student’s t probability distribution. For the
t distribution with n degrees of freedom(df), the associated PDF and CDF can be
obtained in R as dt(x, df = n) and pt(q, df = n), respectively. For the above given
examples, we have

70
Histogram of means
20
15
Frequency

10
5
0

1.5 2.0 2.5 3.0

Figure 5.1: Histogram of the sample in Table 5.3

> 1-pt(2.896, df=8)


[1] 0.01000705
OR
> pt(2.896, df=8, lower.tail=FALSE)
[1] 0.01000705

> pt(-2.898,df=17)
[1] 0.005002443

> pt(-1.721, df=21)


[1] 0.04997625

71
Chi-square: Using the χ2 table in Appendix A4 we can find the following:
For a χ2 random variable with 7 degrees of freedom, P (χ2 > 2.17) = 0.95
For a χ2 random variable with 17 degrees of freedom, P (χ2 < 33.41) = 0.99
For a χ2 random variable with 28 degrees of freedom, P (χ2 > 8.28) = 0.01

Finding the area under χ2 distribution using R

In R, to find P (χ2 ≤ x), with n degrees of freedom(df), we employ the function chisq,
which is the R name corresponding to the chi-square probability distribution(χ2 ). For
the χ2 distribution with n degrees of freedom(df), the associated PDF and CDF can
be obtained in R as dchisq(x, df = n) and pchisq(q, df = n), respectively. For the
above given examples, we have

> pchisq(2.17, df=7,lower.tail=FALSE)


[1] 0.9498349
OR
1-pchisq(2.17, df=7)
[1] 0.9498349

> pchisq(33.41, df=17)


[1] 0.9900039

> pchisq(48.28, df=28,lower.tail=FALSE)


[1] 0.009995604
OR
> 1-pchisq(48.28, df=28)
[1] 0.009995604

F : Using the F table in Appendix A5 we can find the following:


For an F random variable with 4 and 7 degrees of freedom, P (F > 4.12) = 0.05
For an F random variable with 15 and 21 degrees of freedom, P (F > 2.534) = 0.025
For an F random variable with 12 and 9 degrees of freedom, P (F < 5.111) = 0.99

Finding the area under F distribution using R

In R, to find P (F ≤ x), with n1 and n2 degrees of freedom(df1 & df2 ), we employ the
function f, which is the R name corresponding to the F probability distribution. For the
F distribution with n1 and n2 degrees of freedom(df1 & df2 ), the associated PDF and
CDF can be obtained in R as df (x, df1 = n1 , df2 = n2 ) and pf (q, df1 = n1 , df2 = n2 ),
respectively. For the above given examples, we have

72
> pf(4.12, df1=4, df2=7,lower.tail=FALSE)
[1] 0.05000849
OR
1-pf(4.12, df1=4, df2=7)
[1] 0.05000849

> pf(2.534, df1=15, df2=21,lower.tail=FALSE)


[1] 0.02498872
OR
1-pf(2.534, df1=15, df2=21)
[1] 0.02498872

> pf(5.111, df1=12, df2=9)


[1] 0.9899971

Exercises

5.1 (Johnson, R. A., 2000, 212). A random sample of size 100 is taken from an infinite
population having a mean, 76 and a variance, 256. What is the probability that
the sample mean will be between 75 and 78?

5.2 (Johnson, R. A., 2000, 212). A wire- bonding process is said to be in control if the
mean pull-strength is 10 pounds. It is known that the pull-strength measurements
are normally distributed with a standard deviation of 1.5 pounds. Periodic random
samples of size 4 are taken from this process and the process is said to be “out of
control” if a sample mean is less than 7.75 pounds. Comment

5.3 The weights of ball bearings have a distribution with a mean of 22.40 ounces and
a standard deviation of 0.048 ounces. If a random sample of size 49 is drawn from
this population, find the probability that the

(a) sample mean lies between 22.36 and 22.41,


(b) sample mean is more than 22.38,
(c) sample mean is no more than 22.43,
(d) sample mean is greater than or equal to 22.41

5.4 Suppose X is normally distributed with mean 50 and variance 9. Let X̄ be a


random variable in the sense of drawing repeated samples of size 16 from the
distribution of X. Find the probability that

(a) X̄ differs from the mean by less than 2.5 units,


(b) X̄ differs from the mean by more than 1.5 units,

73
(c) X̄ is between 1.8 and 2.6.

5.5 Consider a binomial random variable with 50 trials and success probability 0.39.

(a) Compute the probability that it is at least equal to 15.


(b) Compute the probability that it is at most equal to 12.
(c) Compute the probability that it is equal to 20.
(d) Compute the probability that it is equal to 30.
(e) Repeat (a) to (d) using the normal approximation to the binomial

5.6 Consider a binomial random variable with 100 trials and success probability 0.45.

(a) Compute the probability that it is at least equal to 25.


(b) Compute the probability that it is at most equal to 32.
(c) Compute the probability that it is less than or equal to 20.
(d) Compute the probability that it is equal to 40.
(e) Repeat (a) to (d) using the normal approximation to the binomial.

5.7 Draw a random sample of size 200 from a normal distribution with mean 40 and
variance 36. Compute the mean and standard deviation of your sample.

5.8 Draw a random sample of size 60 from a normal distribution with mean 10 and
variance 4. Compute the mean and standard deviation of your sample.

5.9 Draw a random sample of size 100 from a normal distribution with mean 20 and
variance 25. Compute the mean and standard deviation of your sample.

5.10 Draw a random sample of size 100 from an exponential distribution with λ = 4.
Compute the mean and standard deviation of your sample.

5.11 Draw a random sample of size 120 from an exponential distribution with mean 2.
Compute the mean and standard deviation of your sample

5.12 Suppose X is normally distributed with mean 60 and variance 16. Find the
probability that X̄ based on samples of size 9, differs from the mean by less than
2.5 units.

5.13 Consider a binomial random variable with 20 trials and success probability 0.45.

(a) Compute the probability that it is at least equal to 3.


(b) Compute the probability that it is at most equal to 12.
(c) Compute the probability that it is equal to 3.
(d) Compute the probability that it is equal to 12.
(e) Repeat (a) to (d) using the normal approximation to binomial

74
5.14 Draw a random sample of size 36 from a normal distribution with mean 10 and
variance 4. Compute the mean and standard deviation of your sample.
5.15 Consider an exponential distribution with expected value 10. Drawr = 100 samples
of size n = 30 from the above population. Draw a relative frequency histogram
and relative frequency curve for the 100 sample means. Repeat the experiment
with r = 100 samples of size n = 100 . What is the sampling distribution of the
sample means?
5.16 Consider the t random variable with 10 df .
(a) Find the proportion of the area to the right of 2.1.
(b) Find the probability that t is less than 2.
(c) Find the proportion of the area to the left of –2.1.
(d) Find the proportion of the area between –2.1 and +2.1.
(e) Find the proportion of the area between –1.2 and +2.1.
5.17 Consider a t random variable with 9 degrees of freedom.
Find t0.005 , t0.01 , t0.015 , t0.02 , t0.05 , t0.1 .
5.18 Complete the table for t random variable with 19 df. Note that the relationship
between tα and t1−α is tα = −t1−α (0 ≤ α ≤ 1) .

α 1−α tα tα/2 t1−α/2


0.01 2.539488
0.02
0.05
0.010 1.729133
0.02

5.19 Consider the chi-square random variable with 25 degree of freedom.


(a) Find the probability that it is less than 20.
(b) Find the probability that it is greater than 25.
(c) Find the probability that it is between 21and 24.
5.20 Consider a t random variable with 32 degrees of freedom.
Find t0.005 , t0.01 , t0.015 , t0.02 , t0.05 , t0.1
5.21 Consider a t random variable with 119 degrees of freedom.
Find t0.005 , t0.01 , t0.015 , t0.02 , t0.05 , t0.1 and compare them with
Z0.005 , Z0.01 , Z0.015 , Z0.02 , Z0.05 , Z0.1

75
5.22 Consider a χ2 random variable with 9 degrees of freedom. Find χ20.975 , χ20.025

5.23 Consider an F random variable with 3 and 4 degrees of freedom. Find F0.01 , F0.05 , F0.035 .

76
Chapter 6

Statistical Estimation

6.1 Point Estimation

The following table contains some of the well-known population parameters and their
point estimates based on a random sample.

Population Sample
Mean µ x̄
Variance σ2 s2
Proportion p= XN
pb = nx

Table 6.1: Testing hypotheses about a population mean using z tests

6.2 Confidence Interval Estimation for the Population Mean

Point estimates may be far away from the true parameter if the estimators have large
variances. So we want to estimate parameters by confidence intervals that consider
the variability and the sampling distribution.

Confidence Interval Estimation on the Mean of a Normal Population with


Variance Known

A 100(1 − α)% confidence interval for mean µ is given by:

r
σ2
x̄ ± zα/2
n
or

77
r r
σ2 σ2
x̄ − zα/2 ≤ µ ≤ x̄ + zα/2
n n

where zα/2 is the 100(1 − α/2)th percentile of the standard normal distribution

Example 6.1 The yield of a chemical process is being studied. From previous
experience, yield is known to be normally distributed and σ = 3. The past five days of
plant operation have resulted in the following percent yields: 91.6, 88.75, 90.8, 89.95,
and 91.3. Find a 95% two-sided confidence interval on the true mean yield.
Solution Since the population standard deviation σ = 3, we assume that yield follow
a normal distribution N (µ, 32 ). With 1 − α = 0.95, zα/2 = z0.025 = 1.96 so that a 95%
CI for µ is given by

r
32
90.48 ± 1.96 , i.e. 87.85 ≤ µ ≤ 93.11
5

Large Sample Confidence Interval for the Population Mean

A 100(1 − α) confidence interval (CI) for the population mean µ is given by


r
s2
x̄ ± zα/2
n

Example 6.2 A meat inspector has randomly measured 30 packs of acclaimed 95%
lean beef. The sample resulted in the mean 96.2% with the sample standard deviation
of 0.8%. Find a 98% confidence interval for the mean of similar packs (Walpole, R. E.
et. al, 2002, 236-237).
Solution A 98% CI for µ is given by

r
(0.8)2
96.2 ± 2.326 , i.e.95.8603 ≤ µ ≤ 96.5397
30

An Experiment for Large Sample Confidence Interval

One hundred samples each of size 30 have been drawn from an exponential distribution
with mean 2, and 95% confidence interval have been calculated for each sample of size
30. The sample mean, LCL (Lower Confidence Limit) and UCL (Upper Confidence
Limit) are given in Table 6.2. The interval that contains the true mean 2 is followed
by a Y , otherwise by N .

78
Sample Mean LCL UCL Y/ N
C1 2.1790 1.5788 2.7792 Y
C2 1.9778 1.1719 2.7836 Y
C3 2.5806 1.6660 3.4951 Y
C4 2.1161 1.2453 2.9868 Y
C5 1.9929 1.2145 2.7714 Y
C6 2.3214 1.4795 3.1633 Y
C7 2.0379 1.3489 2.7271 Y
C8 2.4378 1.3352 3.5405 Y
C9 2.1490 1.3694 2.9287 Y
C10 1.7582 1.2117 2.3047 Y
C11 1.7566 1.1819 2.3312 Y
C12 2.0279 1.1847 2.8710 Y
C13 2.0105 1.2118 2.8094 Y
C14 1.8774 1.1075 2.6473 Y
C15 1.7119 1.1854 2.2386 Y
C16 1.8232 1.1598 2.4867 Y
C17 2.7502 1.8101 3.6902 Y
C18 2.1466 1.2467 3.0465 Y
C19 1.9332 1.2452 2.6212 Y
C20 2.0821 1.3345 2.8297 Y
C21 1.4457 0.9776 1.9138 N
C22 1.7582 1.2355 2.2809 Y
C23 1.9975 1.2987 2.6962 Y
C24 1.7588 0.8082 2.7094 Y
C25 1.8509 1.1329 2.5689 Y
C26 1.3934 0.8597 1.9271 N
C27 1.8494 1.2896 2.4092 Y
C28 1.8064 1.1125 2.5002 Y
C29 2.3428 1.5917 3.0939 Y
C30 1.8725 1.2316 2.5134 Y
C31 1.9489 1.2072 2.6906 Y
C32 1.8292 0.9996 2.6588 Y
C33 1.4579 1.0048 1.9111 N
C34 2.1429 1.2687 3.0173 Y
C35 2.0964 1.0399 3.1528 Y
C36 1.9006 1.2708 2.5310 Y
C37 2.0545 1.2358 2.8732 Y
C38 2.0777 1.1938 2.9616 Y
C39 2.1266 1.2661 2.9872 Y
C40 2.0915 1.3087 2.8744 Y
C41 2.3041 1.3396 3.2686 Y
C42 1.5656 0.9763 2.1549 Y
C43 2.7001 1.6032 3.7970 Y

79
C44 2.3216 1.1147 3.5285 Y
C45 2.2822 1.4709 3.0935 Y
C46 1.5049 0.9677 2.0423 Y
C47 2.3985 1.5324 3.2645 Y
C48 2.2272 1.1277 3.3267 Y
C49 2.3145 1.5413 3.0877 Y
C50 1.8669 1.0764 2.6576 Y
C51 1.6973 0.9257 2.4689 Y
C52 1.4834 0.9449 2.0219 Y
C53 2.1219 1.4426 2.8013 Y
C54 2.0054 1.3186 2.6923 Y
C55 2.2493 1.3744 3.1243 Y
C56 1.7336 1.1152 2.3519 Y
C57 1.7186 1.0996 2.3376 Y
C58 1.2960 0.7826 1.8095 N
C59 2.3322 1.4760 3.1884 Y
C60 1.9447 1.2595 2.6299 Y
C61 2.3604 1.5601 3.1608 Y
C62 2.8159 1.9349 3.6969 Y
C63 2.4363 1.5238 3.3489 Y
C64 2.1681 1.3227 3.0135 Y
C65 1.6833 1.1838 2.1828 Y
C66 2.2904 1.3307 3.2501 Y
C67 1.6014 0.8826 2.3203 Y
C68 1.6944 1.1902 2.1986 Y
C69 2.2796 1.5889 2.9702 Y
C70 1.7836 1.2667 2.3005 Y
C71 1.5878 1.0871 2.0885 Y
C72 2.6200 1.6938 3.5463 Y
C73 1.6798 0.8418 2.5177 Y
C74 1.2743 0.9735 1.5749 N
C75 2.1606 1.3443 2.9769 Y
C76 1.3105 0.7919 1.8289 N
C77 2.1495 1.3038 2.9952 Y
C78 2.2101 1.4763 2.9438 Y
C79 2.4033 1.4772 3.3293 Y
C80 1.8017 1.3024 2.3009 Y
C81 1.7591 1.1780 2.3401 Y
C82 2.0183 1.3443 2.6923 Y
C83 1.5494 0.9228 2.1761 Y
C84 2.4460 1.5185 3.3736 Y
C85 1.7547 1.2521 2.2572 Y
C86 1.9683 1.1867 2.7499 Y
C87 1.7297 1.2564 2.2032 Y
C88 1.9599 1.3147 2.6051 Y

80
C89 2.0622 1.4277 2.6968 Y
C90 1.8086 1.2125 2.4047 Y
C91 1.3786 0.7675 1.9897 N
C92 1.9505 1.2360 2.6649 Y
C93 1.8886 1.1770 2.6001 Y
C94 2.0094 1.3489 2.6699 Y
C95 2.1840 1.4007 2.9673 Y
C96 1.8148 1.2167 2.4129 Y
C97 1.8457 1.2152 2.4762 Y
C98 2.2048 1.1911 3.2184 Y
C99 2.5565 1.6199 3.4932 Y
C100 1.9319 1.1761 2.6877 Y

Table 6.2: Large Sample Confidence Intervals by R

We observe that 93% of the interval envelope or trap the true mean µ = λ1 = 2
whereas the theory says that 95 out of 100 should include . This however, is a fairly
good agreement between theory and application. You may draw 100 samples each
of size 50, calculate 100 confidence intervals and see the difference! For interested
readers, the R code for doing that is given as:

xbar = LCL = UCL = Fall = c()


for(i in 1:100){
dataset <- rexp(n=50, rate=1/2)
test1 <- t.test(x=dataset, alternative = "two.sided", conf.level = 0.95);
xbar[i] <- test1$estimate
LCL[i] <- test1$conf.int[1];
UCL[i] <- test1$conf.int[2];

Fall[i] <- ifelse(2 < LCL[i] | 2 > UCL[i] , "No" , "Yes")


}
cbind(xbar,LCL,UCL,Fall)
table(Fall)

Confidence Interval on the Mean of a Normal Population with Variance


Unknown

A 100(1 − α)% confidence interval (CI) for the population mean µ is given by
r
s2
x̄ ± tα/2
n
where tα/2 is the 100(1 − α/2)th percentileof student t distribution with n − 1 degrees
of freedom.

81
Example 6.3 The contents of 7 similar containers of sulfuric acid are 9.8, 10.2 10.4,
9.8, 10.0, 10.2 and 9.6 liters. Find a 99% confidence interval for the mean of all such
containers, assuming an approximate normal distribution (Walpole, R. E. et. al, 2002,
pp. 236-237)
Solution For 6 degrees of freedom, tα/2 = t0.005 = 3.707 using Appendix A3. A 99%
confidence interval for the mean µ is given by

r
(0.2828)2
10 ± 3.707 , 9.6037 ≤ µ ≤ 10.3964
7

6.3 Computing Confidence Interval for a Population Mean Using Using R

Normal Confidence Interval

Example 6.1 You can use R to form a confidence interval estimate for the mean
when σ is known. Run the following code in R:

install.packages("BSDA") # for installing the BSDA package


library(BSDA); # for initiating the BSDA package
dataset <- c(91.6, 88.75, 90.8, 89.95, 91.3);
z.test(x=dataset, sigma.x=3, alternative = "two.sided", conf.level = 0.95);

where dataset represents the column where data are stored, sigma.x is the known
population standard deviation, alternative argument gives you option for creating
two sided, upper sided or lower sided confidence interval by two.sided, greater or
less, respectively, conf.level is the predefined confidence level which is 0.95 in our
example.
The following output of results will be displayed:

One-sample z-Test

data: dataset
z = 67.44, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
87.85043 93.10957
sample estimates:
mean of x
90.48

82
From the output, the sample mean yield is 90.48 and the required 95% confidence
interval is 87.85043 ≤ µ ≤ 93.109.
In case you have the summary of data in terms of sample mean and sample size, you
can compute the same confidence interval by using following R code:

zsum.test(mean.x = 90.48, n.x = 5, sigma.x=3,


alternative = "two.sided",conf.level = 0.95);

Student t Confidence Interval

Example 6.3. You can use R to form a confidence interval estimate for the mean
when σ is unknown. Run the following code in R:

dataset <- c(9.8, 10.2, 10.4, 9.8, 10.0, 10.2, 9.6);


t.test(x=dataset, alternative = "two.sided", conf.level = 0.95);

where dataset represents the column where data are stored, alternative argument
gives you option for creating two sided, upper sided or lower sided confidence interval
by two.sided, greater or less, respectively, conf.level is the predefined confidence
level which is 0.95 in our example.
The following output of results will be displayed:

One Sample t-test

data: dataset
t = 93.541, df = 6, p-value = 1.006e-10
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
9.738414 10.261586
sample estimates:
mean of x
10

From the output, the sample acid is 10 liters and the required 95% confidence interval
is 9.738414 ≤ µ ≤ 10.261586.
In case you have the summary of data in terms of sample mean, sample standard
deviation and sample size, you can compute the same confidence interval by using
following R code:

tsum.test(mean.x = 10, s.x = 0.2828427, n.x = 7,


alternative = "two.sided",conf.level = 0.95);

83
6.4 Large Sample Confidence Interval Estimation of a Population Propor-
tion

A 100(1 − α)% confidence interval for p is given by

r
pb(1 − pb)
pb ± zα/2
n
Example 6.4 In certain water-quality studies, it is important to check for the presence
or absence of various types of microorganisms. Suppose 20 out of 100 randomly selected
samples of a fixed volume show the presence of a particular microorganism. Estimate
the true proportion of microorganism with a 90% confidence interval (Scheaffer and
McClave, 1995, 369).
Solution Since the sample size is large, we use a z-interval. A 90% confidence interval
for is given by

r
0.2(0.80)
0.2 ± 1.645 = 0.2 ± 0.066, i.e. 0.134 ≤ p ≤ 0.266
100

Computing Confidence Interval of a Population Proportion Using R

Example 6.4. You can use to form a confidence interval estimate for the Proportion.
Run the following code in R:

binom.test(x = 20, n = 100, p = 0.5,


alternative = "two.sided", conf.level = 0.9)

where x represents the number of successes, n is the number of trials, alternative


argument gives you option for creating two sided, upper sided or lower sided confidence
interval by two.sided, greater or less, respectively, conf.level is the predefined
confidence level which is 0.90 in our example.
The following output of results will be displayed:

Exact binomial test

data: 20 and 100


number of successes = 20, number of trials = 100, p-value = 1.116e-09
alternative hypothesis: true probability of success is not equal to 0.5
90 percent confidence interval:
0.1366613 0.2772002
sample estimates: probability of success 0.2

84
From the output, the required 95% confidence interval is 0.1366613 ≤ p ≤ 0.2772002.
Note that this is the exact confidence interval from Binomial distribution and is
slightly different from the one calculated from normal approximation.

Exercises

6.1 (cf. Walpole, R. E, et. al, 2002, 246). The following measurements were recorded
for the drying time, in hours, of a certain brand of latex paint:

3.4 4.8 2.8 3.3 4.0 4.4 5.2 5.6 4.3 5.6

Assuming that that the measurements represent a random sample from a normal
population, find a 99% confidence interval of the true mean of drying time.

(a) Assume that population standard deviation is 1.3.


(b) Assume that population standard deviation is unknown.

6.2 (cf. Devore, J. L., 2000, 287). A random sample of fifteen heat pumps of a certain
type yielded the following observations on lifetime (in years):

2.0 1.3 6.0 1.9 5.1 0.4 1.0 5.3 15.7 0.7 4.8 0.9 12.2 5.3 0.6

(a) Obtain a 95% confidence interval for expected (true average) lifetime.
(b) Obtain a 99% confidence interval for expected (true average) lifetime.

6.3 (cf. Devore, J. L., 2000, 299). Consider the following sample of fat content (in
percentage) of ten randomly selected hot dogs:

25.2 21.3 22.8 17.0 29.8 21.0 25.5 16.0 20.9 19.5

Assuming that these were selected from a normal population distribution, find a
95% C.I. for the population mean of fat content.

(a) Assume that population standard deviation is 2.4.


(b) Assume that population standard deviation is unknown.

6.4 (Devore, J. L., 2000, 303). A study of the ability of individuals to walk in a
straight line reported the accompanying data on cadence (stride per second) for a
sample of twenty randomly selected healthy men:

0.95 0.85 0.92 0.95 0.93 0.86 1.00 0.92 0.85 0.81
0.78 0.93 1.05 0.93 1.06 1.06 0.96 0.81 0.96 0.93

Calculate and interpret a 95% confidence interval for population mean cadence.

85
6.5 (Devore, J. L., 2000, 306). The following observations were made on fracture
toughness of a base plate of 18% nickel steel:

69.5 71.9 72.6 73.1 73.3 73.5 75.5 75.7 75.8 76.1 76.2
76.2 77.0 77.9 78.1 79.6 79.7 79.9 80.1 82.2 83.7 93.7

Calculate a 99% CI for the actual mean of the fracture toughness.

6.6 (Devore, J. L., 2000, 307). For each of 18 preserved cores from oil-wet carbonate
reservoirs, the amount of residual gas saturation after a solvent injection was
measured at water flood-out. Observations, in percentage of pore volume, were:

23.5 31.5 34.0 46.7 45.6 32.5 41.4 37.2 42.5


46.9 51.5 36.4 44.5 35.7 33.5 39.3 22.0 51.2

Calculate a 98% CI for the true average amount of residual gas saturation.

6.7 (cf. Vining, G. G., 1998, 176). In a study of the thickness of metal wires produced
in a chip-manufacturing process. Ideally, these wires should have a target thickness
of 8 microns. These are the sample data:

8.4 8.0 7.8 8.0 7.9 7.7 8.0 7.9 8.2 7.9 8.1 7.8 8.2
7.9 8.2 7.9 7.8 7.9 7.9 8.0 8.0 7.6 8.2 8.1 8.3 7.8
8.0 8.0 8.3 7.8 8.2 8.3 8.0 8.0 7.8 8.2 7.7 7.8 8.3
7.8 7.9 8.4 7.7 8.0 7.9 8.0 7.7 7.7 7.8 8.3 8.0 7.5

Construct a 95% confidence interval for the true mean thickness.

6.8 (cf. Vining, G. G., 1998, 177). In a study of aluminum contamination in recycled
PET plastic from a pilot plant operation at Rutgers University, they collected
26 samples and measured, in parts per million (ppm), the amount of aluminum
contamination. The maximum acceptable level of aluminum contamination, on
the average, is 220 ppm. The data are listed here:

291 222 125 79 145 119 244 118 182 119 120 30 115
63 30 140 101 102 87 183 60 191 511 172 90 90

Construct a 95% confidence interval for the true mean concentration.

6.9 (cf. Vining, G. G., 1998, 178). Researchers discuss the production of polyol,
which is reached with isocynate in a foam molding process. Variations in the
moisture content of polyol cause problems in controlling the reaction with isocynate.
Production has set target moisture content of 2.125%. The following data represent
27 moisture analyses over a 4-month period.

86
2.29 2.22 1.94 1.90 2.15 2.02 2.15 2.09 2.18
2.00 2.06 2.02 2.15 2.17 2.17 1.90 1.72 1.75
2.12 2.06 2.00 1.98 1.98 2.02 2.14 2.10 2.05

Construct a 99% confidence interval for the true mean moisture content

6.10 (cf. Vining, G. G., 1998, 178). In a study of a galvanized coating process for large
pipes. Standards call for an average coating weight of 200 lb per pipe. These data
are the coating weights for a random sample of 30 pipes:

216 202 208 208 212 202 193 208 206 206
206 213 204 204 204 218 204 198 207 218
204 212 212 205 203 196 216 200 215 202

Construct 97%, 98% and 99% confidence intervals for the true mean coating
weight.

6.11 (cf. Vining, G. G., 1998, 179). Researchers studied a batch operation at a chemical
plant where an important quality characteristic was the product viscosity, which
had a target value of 14.90. Production personnel use a viscosity measurement
for each 12-hour batch to monitor this process. These are the viscosities for the
past ten batches:

13.3 14.5 15.3 15.3 14.3 14.8 15.2 14.9 14.6 14.1

Construct a 90% confidence interval for the true mean viscosity.

6.12 (cf. Vining, G. G., 1998, 179). Scientists looked at the average particle size of
a product with a specification of 70 – 130 microns and a target of 100 microns.
Production personnel measure the particle size distribution using a set of screening
sieves. They test one sample a day to monitor this process. The average particle
sizes for the past 25 days are listed here:

99.6 92.1 103.8 95.3 101.6 102.3 93.8 102.7 94.9 94.9
102.8 100.9 100.5 102.7 96.9 103.2 97.5 98.3 105.8 100.6
101.5 96.7 96.8 97.8 104.7

Construct a 95% confidence interval for the true mean particle size assuming that
true standard deviation is 5.2.

6.13 (cf. Vining, G. G., 1998, 184). In a study of cylinder boring process for an engine
block. Specifications require that these bores be in. Management is concerned
that the true proportion of cylinder bores outside the specifications is excessive.
Current practice is willing to tolerate up to 10% outside the specifications. Out
of a random sample of 165, 36 were outside the specifications. Construct a 99%
confidence interval for the true proportion of bores outside the specifications.

87
6.14 (cf. Vining, G. G., 1998, 184). Consider nonconforming brick from a brick
manufacturing process. Typically, 5% of the brick produced is not suitable for all
purposes. Management monitors this process by periodically collecting random
samples and classifying the bricks as conforming or nonconforming. A recent
sample of 214 bricks yielded 18 nonconforming. Construct a 98% confidence
interval for the true proportion of nonconforming bricks.

6.15 (cf. Vining, G. G., 1998, 185). In a study examining a process for manufacturing
electrical resistors that have a normal resistance of 100 ohms with a specification
of ohms. Suppose management has expressed a concern that the true proportion of
resistors with resistances outside the specifications has increased from the historical
level of 10%. A random sample of 180 resistors yielded 46 with resistances outside
the specifications. Construct a 95% confidence interval for the true proportion of
resistors outside the specification.

6.16 (Vining, G. G., 1998, 185). An automobile manufacturer gives a 5-year/60,000-mile


warranty on its drive train. Historically, 7% of this manufacturer’s automobiles
have required service under this warranty. Recently, a design team proposed an
improvement that should extend the drive train’s life. A random sample of 200
cars underwent 60,000 miles of road testing; the drive train failed for 12. Construct
a 95% confidence interval for the true proportion of automobiles with drive trains
that fail.

6.17 (Vining, G. G., 1998, 185). Historically, 10% of the homes in Florida have radon
levels higher than recommended by the Environmental Protection Agency. Radon
is a weakly radioactive gas known to contribute to health problems. A city in north
central Florida has hired an environmental consulting group to determine whether
it has a greater than normal problem with this gas. A random sample of 200 homes
indicated that 25 had random levels exceeding EPA recommendations. Construct
a 95% confidence interval for the true proportion of homes with excessive levels of
radon.

88
Chapter 7

Tests of Hypotheses

7.1 Testing Hypotheses about a Population Mean

Testing Hypotheses on the Mean of a Normal Population, Variance Known


Possible hypotheses, rejection region and probability values are summarized in the
following table below.

H0 vs Ha Rejection Region p-value


µ = µ0 vs µ < µ0 z < −zα P (Z < z)
µ = µ0 vs µ > µ0 z > zα P (Z > z)
µ = µ0 vs µ ̸= µ0 z < −zα/2 or z > zα/2 2P (Z > |z|)

Table 7.1: Testing hypotheses about a population mean using z tests

where z is the test statistic, which can be written

x̄ − µ0
z=p
σ 2 /n

and α is the significance level of the test


Example 7.1 The average zinc concentration recovered from a sample of zinc mea-
surements in 36 different locations is found to be 2.6 grams per millimeter. Assume
that the population standard deviation is 0.3. It is believed that the average zinc
concentration of such measurements is less than 3 grams per millimeter. Set up
suitable hypotheses and test at 1% level of significance.
Solution From the sample we have n = 36, x̄ = 2.6. The hypotheses are given by

H0 : µ = 3 vs Ha : µ < 3

The value of the test statistic z is given by

89
2.6 − 3
z=p = −8
(0.3)2 /36

Since z = −8 < −za = −2.33 we reject the null hypothesis H0 , i.e. there is sufficient
evidence to reject the hypothesis that the mean zinc concentration is 3.

Large Sample Test of the Population Mean

Refer to Table 1 for the hypotheses. The test statistic is given by


x̄ − µ0
z=p
s2 /n

Example 7.2 A manufacturer of sports equipment has developed a new synthetic


fishing line that he claims has a mean breaking strength of 8 kilograms. At 1% level of
significance, test the hypothesis that the mean breaking strength is 8 kilograms against
the alternative that mean breaking strength is not 8 kilograms if a random sample of
50 lines is tested and found to have a mean breaking strength of 7.8 kilograms and a
standard deviation of 0.5 kilogram.
Solution The hypotheses are given by

H0 : µ = 8 vs Ha : µ ̸= 8

The value of the test statistic z is given by

7.8 − 8
z=p
(0.5)2 /50

Since z = −2.83 < −z0.005 = −2.575 we reject H0 and conclude that the average
breaking strength is not equal to 8.

90
Testing the Mean of a Normal Population, Variance Unknown

H0 vs Ha Rejection Region p-value


µ = µ0 vs µ < µ0 t < −tα P (T < t)
µ = µ0 vs µ > µ0 t > tα P (T > t)
µ = µ0 vs µ ̸= µ0 t < −tα/2 or t > tα/2 2P (T > |t|)

Table 7.2: Testing hypotheses about a population mean using t tests

The test statistic t is given by

x̄ − µ0
t= p
s2 /n

and tα is the 100(1 − α)th percentile of the student t distribution with n − 1 degrees
of freedom.
Example 7.3 It is claimed that a vacuum cleaner expends an average of 46 kilowatt-
hours per year. If a random sample of 12 homes included in a planned study indicates
that vacuum cleaners expend the following kilowatt-hours per year

30 44 40 45 46 40 47 48 46 45 41 50

Does this suggest at the 5% level of significance that vacuum cleaners expenses, on
the average, is different from 46 kilowatt-hours annually? Assume that the population
of kilowatt-hours to be normal.
Solution The hypotheses are given by

H0 : µ = 46 vs Ha : µ ̸= 46

The values of the test statistic t is given by

43.5 − 46
t= p = −1.6447
5.2656692 /12

Since t = −1.6447 is not less than tα/2 = t0.025 = −2.201, we cannot reject the null
hypothesis, i.e. µ = 46

91
7.2 Testing for the Population Mean Using R

Normal Population with Variance Known

Example 7.1 You can use R to perform the test for the mean when σ is known. Run
the following code in R:

install.packages("BSDA") # for installing the BSDA package


library(BSDA); # for initiating the BSDA package
zsum.test(mean.x = 2.6, n.x = 36, sigma.x = 0.3, mu = 3,
alternative = "less", conf.level = 0.99);

where mean.x is the sample mean, n.x is the sample size, sigma.x is the known
population standard deviation, mu is the value of mean to be tested in H0 , alternative
argument gives you the option for setting the direction of H1 i.e. “two.sided”,
“greater” or “less”, conf.level is the predefined confidence level if one is interested
in the confidence interval too.
The following output of results will be displayed:

One-sample z-Test

data: Summarized x
z = -8, p-value = 6.221e-16
alternative hypothesis: true mean is less than 3
99 percent confidence interval:
NA 2.716317
sample estimates:
mean of x
2.6

From the output, the calculated z-statistic is −8 and the corresponding p-value is very
close to zero. In case you have the original data, you can perform the same hypothesis
test by using following R code after storing the data in dataset:

z.test(x = dataset, sigma.x = 0.3, mu = 3,


alternative = "less", conf.level = 0.99);

Normal Population with Variance Unknown

Example 7.3. You can use R to perform the test for the mean when σ is unknown.
Run the following code in R:

tsum.test(mean.x = 43.5, n.x = 12, s.x = 5.265669, mu = 46, alternative = "two.sided"

92
where mean.x is the sample mean, n.x is the sample size, s.x is the sample standard
deviation, mu is the value of mean to be tested in H0 , alternative argument gives
you the option for setting the direction of H1 i.e. “two.sided”, “greater” or “less”,
conf.level is the predefined confidence level if one is interested in the confidence
interval too.
The following output of results will be displayed:

One-sample t-Test

data: Summarized x
t = -1.6447, df = 11, p-value = 0.1283
alternative hypothesis: true mean is not equal to 46
95 percent confidence interval:
40.15435 46.84565
sample estimates:
mean of x
43.5

From the output, the calculated t-statistic is −1.6447 and the corresponding p-value
is 0.1283. In case you have the original data, you can perform the same hypothesis
test by using following R code after storing the data in dataset:

t.test(x = dataset, mu = 46, alternative = "two.sided",conf.level = 0.95);

7.3 Large Sample Tests of Proportions

Testing a Population Proportion

Tests on p are summarized in the following table:

H0 vs Ha Rejection Region p-value


p = p0 vs p < p0 z < −zα P (Z < z)
p = p0 vs p > p0 z > zα P (Z > z)
p = p0 vs p ̸= p0 z < −zα/2 or z > zα/2 2P (Z > |z|)

Table 7.3: Testing hypotheses for the population proportion

The test statistic z is given by

pb − p0
z=p
p0 (1 − p0 )/n

93
Example 7.4 In certain water-quality studies, it is important to check for the presence
or absence of various types of microorganisms. Suppose 20 out of 100 randomly selected
samples of a fixed volume show the presence of a particular microorganism. At the
5% level of significance, test the hypothesis that the true proportion of the presence
of a particular microorganism is at least 0.30.
Solution The hypotheses are given by

H0 : p ≥ 0.30 vs Ha : p < 0.30

With α = 0.01, we will reject H0 if z0 < −zα where zα = 1.645. The observed

0.20 − 0.30
z=p = −2.18
0.30(0.70)/100
Since z = −2.18 is less than −1.645, we reject the null hypothesis and conclude that
the true proportion of the presence of a particular microorganism is less than 0.30

7.4 Testing a Population Proportion Using R

Example 7.4 You can use R to perform the test for the proportion. Run the following
code in R:

binom.test(x = 20, n = 100, p = 0.3, alternative = "less", conf.level = 0.95)

where x represents the number of successes, n is the number of trials, p is the value
of proportion to be tested in H0 , alternative argument gives you the option for
setting the direction of H1 i.e. “two.sided”, “greater” or “less”, conf.level is
the predefined confidence level if one is interested in the confidence interval too.
The following output of results will be displayed:

Exact binomial test

data: 20 and 100


number of successes = 20, number of trials = 100, p-value = 0.01646
alternative hypothesis: true probability of success is less than 0.3
95 percent confidence interval:
0.0000000 0.2772002
sample estimates:
probability of success
0.2

From the output, the p-value came out to be 0.01646 leading to rejection of H0 .

94
Exercises

7.1 Refer to Exercise 6.1, Conduct the most appropriate hypothesis test using a 0.05
significance level

7.2 Refer to Exercise 6.2, Conduct the most appropriate hypothesis test using a 0.01
significance level

7.3 Refer to Exercise 6.3, Conduct the most appropriate hypothesis test using a 0.01
significance level.

7.4 Refer to Exercise 6.4, Conduct the most appropriate hypothesis test using a 0.01
significance level.

7.5 Refer to Exercise 6.5, Conduct the most appropriate hypothesis test using a 1%
significance level

7.6 Refer to Exercise 6.6, Conduct the most appropriate hypothesis test using a 0.05
significance level.

7.7 Refer to Exercise 6.7, Conduct the most appropriate hypothesis test using a 0.01
significance level.

7.8 Refer to Exercise 6.8, Conduct the most appropriate hypothesis test using a 0.01
significance level.

7.9 Refer to Exercise 6.9, Conduct the most appropriate hypothesis test using a 0.05
significance level.

7.10 Refer to Exercise 6.10, Conduct the most appropriate hypothesis test using a 0.05
significance level.

7.11 Refer to Exercise 6.11, Conduct the most appropriate hypothesis test using a 0.05
significance level.

7.12 Refer to Exercise 6.12, Conduct the most appropriate hypothesis test using a 0.05
significance level.

7.13 Refer to Exercise 6.13, Conduct the appropriate hypothesis test using a 0.05
significance level.

7.14 Refer to Exercise 6.14, Conduct the appropriate hypothesis test using a 0.10
significance level.

7.15 Refer to Exercise 6.15, Conduct the appropriate hypothesis test using a 0.05
significance level.

7.16 Refer to Exercise 6.16, Conduct the appropriate hypothesis test using a 0.01
significance level.

95
7.17 Refer to Exercise 6.17, Conduct the appropriate hypothesis test using a 0.05
significance level.

96
Chapter 8

Linear Regression

8.1 Scatter Diagram

Example 8.1 A chemical engineer is investigating the effect of process operating


temperature (x) on product yield (y). The study results in the following data:

x 100 110 120 130 140 150 160 170 180 190
y 45 51 54 61 66 70 74 78 85 89

(Hines and Montgomery, 1990, p 457) Check if there is any linear relationship between
temperature and product yield.
Importing the data into R is the first step in any data analysis. R can import data
from a wide variety of file types, including those found on your PC, the internet,
and relational databases. As mentioned in Chapter one, this data importation can
be done in a number of ways. Now, assuming you have your data in an excel file.
Below, we properly explore two distinct methods for importing an excel file into the
R programming language.
Solution:
Method 1 Using Rstudio’s built-in menu options: Suppose that the data are
provided in Excel file named dataset, as shown in the following snapshot:
To import these data in R, follow these steps:

ˆ Click on “Import Data” and then on “From Excel” as shown in the snapshot
below:

ˆ Click in “Browse” to browse your computed for the data file, as shown in the
snapshot below:

ˆ After locating the data file, the last step of importing the data is to click on
“Import” as shown in the snapshot below:

97
ˆ After the import, R will automatically open the new data for viewing. Click
on the previously opened script to start writing the program, as shown in the
snapshot below:

In the above picture, it can be seen the data frame is saved with the name “dataset”
i.e. same as the name of our source Excel file.
Method 2 Using read excel(): In this approach, it is assumed that you are using the
basic R and not Rstudio. For this case, the user must call the read excel() function
from the readxl library of the R language using the file name as the parameter in
order to import an Excel file into R. The user will be able to import the Excel file
into R by using this function.
Suppose that the data are provided in Excel file named dataset, as shown before. To

98
import these data in R, follow these steps:

ˆ save your data using the filename dataset, into a folder at either c:/Work or
Desktop or Documents on the hard drive or a folder on a memory stick. For
example, assuming I call this new folder Rsession1, or something similar, to
indicate what the folder contains, then save the folder on the desktop as:

ˆ Then, in R, you should change the working directory to the same folder you

99
have stored the data in by clicking on the menu item: File > Change dir as
shown in the snapshot below

ˆ From Change dir, use the browse facility to locate the relevant folder Rsession1
on the desktop and click “OK” as shown in the snapshot below:

100
ˆ The final step is to run the following code in R

install.packages("readxl")
library(readxl)
dataset=read_excel("dataset.xlsx")
View(dataset)

Again, in the above picture, it can be seen the data frame is saved with the name
“dataset” i.e. same as the name of our source Excel file.
Run the following R code to create a scatter plot:

plot(x = dataset$temp, y = dataset$yield, main = "Scatter Plot",


xlab = "Temperature", ylab = "Yield", xlim = c(90,200),
ylim = c(40,90), pch = 19, col = "red", lwd = 5)

where x is the variable for x-axis, y is the variable for y-axis, main is main title of plot,
xlab is the label of x-axis, ylab is the label of y-axis, xlim is the range of x-axis, ylim
is the range of y-axis, pch is the plot character with 25 different characters available
in R, col is the color of characters and lwd is the size/width of characters.
The following output of plot will be displayed:
Since the scatter diagram between temperature (X) and product yield (Y ) shows a
linear trend, one recommends estimating the line of best fit.

8.2 The Correlation Coefficient

The strength of linear relationship between x and y is measured by the correlation


coefficient, defined by

Sxy
r=p
Sxx Syy

101
Scatter Plot

90
80
70
Yield

60
50
40

100 120 140 160 180 200

Temperature

where
xi )2 X 2
P
X 2
X (
Sxx = (xi − x̄) = (n − 1)Sx2 = x2i = − xi − nx̄2
n
( yi )2 X 2
X X P
2 2 2
Syy = (yi − ȳ) = (n − 1)Sy = yi − = yi − nȳ 2
P n P
X X ( xi ) ( yi ) X
Sxy = (nxi − x̄) (yi − ȳ) = xi yi − = xi yi − nx̄ȳ
n
In Example 8.1, we have Sxy = 3985, Sxx = 8250, Syy = 1932.1

3985
r= = 0.99813
(8250) (1932.10)

For finding the correlation coefficient in R, run the following code:

cor(x = dataset$temp, y = dataset$yield)

102
8.3 Estimating the Line of Best Fit

The simple linear regression model is of the form

y = µy,x = β0 + β1 x + ϵ

where µy,x is the conditional mean of y at x, y and x are respectively the dependent
(response) and independent (explanatory) variables, ϵ is the random error component,
β0 and β1 are the y-intercept and slope of the regression line respectively.
The least squares estimators of the regression parameters are given by

Sxy
βb1 = and βb0 = ȳ − βb1 x̄
Sxx

Once the parameters are estimated, the equation yb = βb0 + βb1 x will be called the
estimated regression line, the prediction line, the line of best fit or the least squares
line. It should be noted that yb0 = βb0 + βb1 x0 can be used as a point estimate of µy,x0
the conditional mean of y at x0 , or a predictor of the response at x0 .
For the data in Example 8.1,

βb1 = 0.4830 and βb0 = −2.7393

so that the line of best fit is given by

yb = −2.7393 + 0.4830x

At a temperature of 140o C, we predict the yield to be

yb0 = βb0 + 140βb1 = −2.7393 + 140(0.4830) = 64.8848

Estimating β0 and β1 Using R

To estimate β0 and β1 by using R run the following code:

reg <- lm(formula = yield ~ temp, data = dataset)


summary(reg)

103
where formula is the argument to specify the response and predictor variables, data
is the argument to specify the frame where data are saved.
Following output will be displayed:

Call:
lm(formula = yield ~ temp, data = dataset)

Residuals:
Min 1Q Median 3Q Max
-1.3758 -0.5591 0.1242 0.7470 1.1152

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.73939 1.54650 -1.771 0.114
temp 0.48303 0.01046 46.169 5.35e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9503 on 8 degrees of freedom


Multiple R-squared: 0.9963,Adjusted R-squared: 0.9958
F-statistic: 2132 on 1 and 8 DF, p-value: 5.353e-11

For the data in Example 8.1 you will get y = −2.73939 + 0.48303x which means that
the estimates of β0 and β1 are given by βb0 = −2.73939 and βb1 = 0.48303 respectively.
Thus, the predicted simple linear regression model is given by

yb = −2.73939 + 0.48303x
To display the fitted values and residuals, you can run the following code:

reg$fitted.values
reg$residuals

8.4 Testing the Slope of the Regression Line

The following table has a list of possible null hypotheses involving the slope β1 , the
critical region and the p-value in each case
The test Statistic for these hypotheses is

βb1 − β10
t= q
M SE
Sxx

104
H0 vs Ha Rejection Region p-value
β1 = β10 vs β1 ̸= β10 t : t < −t α2 or t > t α2 2P (|t| < t α2 )
β1 = β10 vs β1 < β10 t : t < −tα P (t < −tα )
β1 = β10 vs β1 > β10 t : t > tα P (t > tα )

Table 8.1: Hypotheses about β1 and their respective rejection regions and p-values

̸ 0 is known as the hypothesis of the significance


The hypothesis H0 ; β1 = 0 vs Ha : β1 =
of the regression.
In example 8.1, test the hypothesis H0 : β1 = 0 vs Ha : β1 ̸= 0 at significance level
0.4830−0
α = 0.05. The value of the test statistics is t = √0.903038250 = 46.1689
Since t = 46.1689 > t0.025 = 2.306, we reject H0 in favor of the alternative hypothesis
Hα , and conclude that the regression is significant.
For testing the slope of regression line using R, run the following code:

reg <- lm(formula = yield ~ temp, data = dataset)


summary(reg)

where formula is the argument to specify the response and predictor variables, data
is the argument to specify the frame where data are saved.
Following output will be displayed:

Call:
lm(formula = yield ~ temp, data = dataset)

Residuals:
Min 1Q Median 3Q Max
-1.3758 -0.5591 0.1242 0.7470 1.1152

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.73939 1.54650 -1.771 0.114
temp 0.48303 0.01046 46.169 5.35e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9503 on 8 degrees of freedom


Multiple R-squared: 0.9963,Adjusted R-squared: 0.9958
F-statistic: 2132 on 1 and 8 DF, p-value: 5.353e-11

The standard error of β1 denoted by se (β1 ) is given as 0.01046. For testing the
̸ 0, the calculated t-statistic is equal to 46.169
hypotheses H0 : β1 = 0 against H1 : β1 =

105
and the corresponding p-value is 5.35 × 10−11 . Since the p-value for testing the null
hypothesis H0 : β1 = 0 against the alternative hypothesis that Hα : β1 ̸= 0 is almost 0,
we reject the null hypothesis in favor of the alternative hypothesis.

8.5 Testing the Significance of the Regression by Analysis of Variance

The variation in the dependent variable say T SS = syy is attributed partly to that in
the independent variable. The restP
is attributed to what is called the Sums of Squares
Due to Errors defined by SSE = (yi − yb)2 and can be calculated by the following
table:

x y yb y − yb = e e2
100 45 45.5636 -0.5636 0.3176
110 51 50.3939 0.6061 0.3674
120 54 55.2242 -1.2242 1.4987
130 61 60.0545 0.9455 0.8940
140 66 64.8848 1.1152 1.2437
150 70 69.7151 0.2849 0.0812
160 74 74.5454 -0.5454 0.2975
170 78 79.3757 -1.3757 1.8926
180 85 84.2060 0.7940 0.6304
190 89 89.0363 -0.0363 0.0013
(10)−8 7.224

The SSE = 7.2244 here compare these errors (or residuals) with that obtained by R

Decomposition of the Sum of Squares

It can be proved that T SS = SSR + SSE where T SS is the Total Sum of Squares,
SSR is the Sum of Squares due to Regression and is the Sum of Squares of Errors,
also known as the residual sum of squares.

s2xy
T SS = syy and SSR = βb1 sxy =
sxx
The coefficient of determination is defined by

SSR
R2 =
T SS

In Example 8.1,T SS = 1932.1, SSR = 1924.8757 so that SSE = T SS−SSR = 7.2243.


Note that the expression SRR = βb12 sxx may not be computationally efficient.

106
The coefficient of determination R2 = 0.9963
In order to test the hypothesis H0 : β1 = 0 vs Ha : β1 ̸= 0 at the 5% significant level,
using an F test, we reproduce here the ANOVA table of example 8.1 shown in Figure
8.4

Sv SS DF MS F
Regression 1924.875758 1 1924.8757 2131.5738
Error 7.22424243 8 0.90303
Total 1932.10 9

The test statistic for the above hypothesis is

M SR SSR/1
=
M SE SSE/(n − 2)

The observed value of the test statistic is f = 2131.5738. Since f = 2131.5738 >
f0.05 5.32, the critical value from the F distribution with 1 and n − 2 degrees of freedom,
we reject H0 : β1 = 0 in favor of the alternative hypothesis Ha at 5% level significance.
For generating the analysis of variance (ANOVA) table in R, run the following code:

reg <- lm(formula = yield ~ temp, data = dataset)


anova(reg)

Following output will appear:

Analysis of Variance Table

Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
temp 1 1924.88 1924.9 2131.6 5.353e-11 ***
Residuals 8 7.22 0.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

where SSR is equal to 1924.88, SSE is equal to 7.22, M SR is equal to 1924.9, M SE


is equal to 0.9 and for testing the significance of full model, the F -statistic given by
F =M M SR
SE
is equal to 2132 and the corresponding p-value is 5.353 × 10−11 .
The coefficient of determination can be found from summary of the regression model
by running the following code:

reg <- lm(formula = yield ~ temp, data = dataset)


summary(reg)

107
Following output will be displayed:

Call:
lm(formula = yield ~ temp, data = dataset)

Residuals:
Min 1Q Median 3Q Max
-1.3758 -0.5591 0.1242 0.7470 1.1152

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.73939 1.54650 -1.771 0.114
temp 0.48303 0.01046 46.169 5.35e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9503 on 8 degrees of freedom


Multiple R-squared: 0.9963,Adjusted R-squared: 0.9958
F-statistic: 2132 on 1 and 8 DF, p-value: 5.353e-11

In the above output, the coefficient of determination denoted by R2 = 1 − SSE


SST
is equal
2 SSE/(n−k−1)
to 0.9963. The adjusted coefficient of determination denoted by Radj = 1 − SST /(n−1)
is equal to 0.9958.

8.6 Confidence Interval Estimation of Regression Parameters

Confidence Interval(CI) for the slope parameter β1


q
A 100(1 − α)% CI for β1 is given by β1 ± tα/2 Msxx
b SE
where tα/2 is the 100(1 − α/2)th
percentile of the t-distribution with df = n − 2, and

SSE
M SE = , the estimate for σ 2
n−2

b2 = 0.9030, and thus a 95% CI for β1 is given by


For the data in Example 8.1, σ
r
0.9030
0.4830 ± (2.306)
8250

In other words

0.458877405 ≤ β1 ≤ 0.507156201

108
The resulting of result shows the Total Sum of Squares (T SS), the sum of squares
due to regression (SSR) and the sum of squares due to errors (SSE), the mean
squares and the F value. For the data in Example 8.1 above, we have T SS = 1932.1,
SSR = 1924.876 and SSE = 7.224
To create a 95% confidence interval for β1 , you can run the following R code:

n = 10; k = 1;
Lower <- summary(reg)$coefficients[2,1] -
qt(p = 1-0.05/2, df = n-k-1) * summary(reg)$coefficients[2,2]
Upper <- summary(reg)$coefficients[2,1] +
qt(p = 1-0.05/2, df = n-k-1) * summary(reg)$coefficients[2,2]
print(c(Lower,Upper))

Following output will appear:

[1] 0.4589044 0.5071562

8.7 Prediction Interval (PI) for a Future Observation y0

A (1 − α)100% PI for a future observation y0 at x0 is given by

s 
  1 (x0 − x̄)2
β0 + β1 x0 ± tα/2
b b 1+ + M SE
n sxx

For the data in Example 8.1 a 95% prediction interval for the yield at 140◦ is given by

p
64.8848 ± (2.306) (1 + 0.103)(0.90303) = 64.8848 ± 2.3014 = 62.5834 ≤ yb0 ≤ 67.1862

Confidence Interval for the Conditional Mean µy,x

A 100(1 − α)% CI for the conditional mean at x0 is given by

s 
  1 (x0 − x̄)2
βb0 + βb1 x0 ± tα/2 + M SE
n sxx

In example 8.1, a 95% CI for the conditional mean at 140◦ is given by

p
64.8848 ± (2.306) (0.103)(0.90303) = 64.8848 ± 0.7033 = 64.1814 ≤ µ ≤ 65.5882

109
To do the prediction and estimation of mean response, you need to enter the new data
in a way similar to the existing data by giving the same variable name(s). Suppose
we want to do the prediction at two levels of temperature i.e. 140◦ C and 165◦ C. The
R code for entering the new data is a data frame named dataset2 is given as:

dataset2 <- data.frame(temp = c(140,165))

To find the predicted values and the corresponding prediction intervals, run the
following R code:

predict(object = reg, newdata = dataset2, interval = "prediction", level = 0.95)

Following output will be displayed:

fit lwr upr


1 64.88485 62.58338 67.18632
2 76.96061 74.61220 79.30902

For the temperature at 140◦ C, the predicted yield is 62.88485 with a 95% prediction
interval [62.58338, 67.18632]. Similarly, the second row of output is for 165◦ C.
Also, a 95% confidence interval for mean yield when the temperature is 1400C can be
found by running the following R code:

predict(object = reg, newdata = dataset2, interval = "confidence", level = 0.95)

Following output will be displayed:

fit lwr upr


1 64.88485 64.18146 65.58823
2 76.96061 76.11620 77.80501

8.8 Checking Model Assumptions

We now discuss how to verify the assumptions that the random errors are normally
distributed and that they have a constant variance.

Checking the Assumption of Normality

To check the assumption that the errors follow a normal distribution, a normal QQ
plot of residuals is drawn. If the plot is approximately linear, then the assumption
is justified, otherwise, the assumption is not justified. To get the normal QQ plot of
residuals, we run the following code in R assuming that we have the data and model
of Example 8.1 in the Regression Module named reg.

110
qqnorm(rstandard(reg), pch = 19, col = "darkblue", main = "Normal QQ Plot")
qqline(rstandard(reg), col = "red", lwd = 2)

Following Normal QQ plot of standardized residuals will be displayed in the plots


area:

Since the normal probability plot of residuals for the data in Example 8.1 exhibits
a linear trend. Thus, the normality assumption is not seriously violated. Moreover,
a histogram of standardized residuals can also give you some idea of normality of
errors if the sample size is reasonably large. Following R code can be used to draw a
histogram in R:

hist(rstandard(reg), col = "steelblue", main = "Histogram",


xlab = "Standardized Residuals")

Following output will be displayed in the plots area:

Checking the Assumption of Constancy of Variance

To check the assumptions of linearity and constant variance, a versus fit graph of
the standardized residuals versus the fitted values is plotted. If the graph shows no
pattern, then the assumptions are justified. Otherwise, either or both of the two
assumptions are not justified.
To get the versus fit, we run the following code:

plot(x = fitted(reg), y = rstandard(reg), pch = 19, col = "red",


main = "Versus Fit", xlab = "Fitted Values",
ylab = "Standardized Residuals")
abline(0,0)

111
Following versus fit plot will be displayed in the plots area:

The above graph is the residual plot for example 8.1. Since it does not exhibit any
pattern, we conclude that the linearity and constant variance assumptions are justified.
To check the assumption that the errors are independent, a versus order graph of the

112
standardized residuals versus the observation number is plotted. If the graph shows no
pattern, then the assumption is justified. Otherwise, the assumption is not justified.
To get the versus order plot, we run the following code:

plot(rstandard(reg), type="o", col = "red", main = "Versus Order",


ylab = "Standardized Residuals", pch = 19)

Following versus fit order will be displayed in the plots area

The versus order plot does not give any clear evidence of positive/negative autocorre-
lation which means that the assumption of independence is justified.
If one is interested in plotting all the above 4 graphs on to a single plot, following is
the R code:

par(mfrow = c(2,2), cex = 0.6)

qqnorm(rstandard(reg), pch = 19, col = "darkblue", main = "Normal QQ Plot")


qqline(rstandard(reg), col = "red", lwd = 2)

plot(x = fitted(reg), y = rstandard(reg), pch = 19, col = "red",


main = "Versus Fit", xlab = "Fitted Values",
ylab = "Standardized Residuals")

113
abline(0,0)

hist(rstandard(reg), col = "steelblue", breaks = 9,


xlab = "Standardized Residuals", main = "Histogram")

plot(rstandard(reg), type="o", col = "red", main = "Versus Order",


ylab = "Standardized Residuals", pch = 19)

where the par will create the plot with first 2 graphs in first row and the last 2 graphs
in the second row. cex indicates the amount by which plotting text and symbols
should be scaled relative to the default.
Following 4 in 1 plot will appear:

114
8.9 Multiple Linear Regression

The multiple regression model is a mathematical model which explains the relationship
between the dependent variable and two or more independent variables. For example, a
manufacturer wants to model the quality of a product (Y ) as a function of temperature
(x1 ) and pressure (x2 ) at which it is produced.
The multiple linear regression model with 2 independent variables x1 and x2 is given
by

Y = β0 + β1 x1 + β2 x2 + ϵ

where β0 and ϵ are the intercept and the random error term respectively. We shall
refer to the β ′ s in the model as the regression parameters.
Example 8.2 Consider the problem of predicting gasoline mileage (in miles per
gallon), where the independent variables are fuel octane rating x1 and average speed
(mile per hour) x2 . The sample data obtained from 20 test runs with cars at various
speeds are as follows:

y x1 x2
24.8 88 52
30.6 93 60
31.1 91 28
28.2 90 52
31.6 90 55
29.9 89 46
31.5 92 58
27.2 87 46
33.3 94 55
32.6 95 62
30.6 88 47
28.1 89 58
25.2 90 63
35.0 93 54
29.2 91 53
31.9 92 52
27.7 89 52
31.7 94 53
34.2 93 54
30.1 91 58

Estimate the linear regression model and interpret your results.


Solution To solve this problem by using R, first import the data on three variables
using the method explained in section 8.1. Suppose the data frame is imported with

115
the name multipledata and the column (variable) names are y, x1 and x2 . Run the
following R code to repeat everything that we did for example 8.1 including model
estimation, confidence interval of β1 coefficients, prediction interval, confidence interval
for mean response and residuals analysis:

reg2 <- lm(formula = y ~ x1 + x2, data = multipledata)


summary(reg2)

n = 20; k = 2;
Lower <- summary(reg2)$coefficients[2,1] -
qt(p = 1-0.05/2, df = n-k-1) * summary(reg2)$coefficients[2,2]
Upper <- summary(reg2)$coefficients[2,1] +
qt(p = 1-0.05/2, df = n-k-1) * summary(reg2)$coefficients[2,2]
print(c(Lower,Upper))

multipledata2 <- data.frame(x1 = c(91), x2 = c(51))


predict(object = reg2, newdata = multipledata2,
interval = "prediction", level = 0.95)
predict(object = reg2, newdata = multipledata2,
interval = "confidence", level = 0.95)

par(mfrow = c(2,2), cex = 0.6)

qqnorm(rstandard(reg2), pch = 19, col = "darkblue", main = "Normal QQ Plot")


qqline(rstandard(reg2), col = "red", lwd = 2)

plot(x = fitted(reg2), y = rstandard(reg2), pch = 19, col = "red",


main = "Versus Fit", xlab = "Fitted Values",
ylab = "Standardized Residuals")
abline(0,0)

hist(rstandard(reg2), col = "steelblue", breaks = 9,


xlab = "Standardized Residuals", main = "Histogram")

plot(rstandard(reg2), type="o", col = "red", main = "Versus Order",


ylab = "Standardized Residuals", pch = 19)

Following outputs will be displayed:

Model Summary
Call:
lm(formula = y ~ x1 + x2, data = multipledata)

Residuals:

116
Min 1Q Median 3Q Max
-2.9712 -1.0873 0.2258 0.7949 2.8037

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56.79512 16.65462 -3.410 0.00333 **
x1 1.01930 0.19149 5.323 5.61e-05 ***
x2 -0.10747 0.05743 -1.871 0.07861 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.777 on 17 degrees of freedom


Multiple R-squared: 0.6251,Adjusted R-squared: 0.581
F-statistic: 14.17 on 2 and 17 DF, p-value: 0.000239

Confidence interval of β1 coefficients

[1] 0.6152958 1.4233044

Prediction Interval

fit lwr upr


[1] 30.48016 26.63041 34.32991

Confidence interval for mean response

fit lwr upr


[1] 30.48016 29.60863 31.35169

Residuals analysis

For the data in Example 8.2, one can read the estimates of the regression parameters
as:

βb0 = −56.79512, βb1 = 1.0193 and βb2 = −0.10747

Thus, the predicted multiple linear regression model for the given data is

Yb = −56.8 + 1.019X1 − 0.1075X2

If ‘average speed’ (X2 or C3) is held constant, it is estimated that a 1-unit increase
in ‘octane’ (X1 or C2) would result in 1.019 units increase in the expected ‘gasoline
mileage’. Similarly if ‘octane’ (X1 or C2) is held constant, it is estimated that a 1-unit
increase in ‘average speed’ (X2 or C3) would result in units decrease in the expected
‘gasoline mileage’.

117
Exercises

8.1 (cf. Devore, J. L., 2000, 510). The following data represent the burner area
liberation rate (= x) and emission rate (Nox) (= y) :

x 100 125 125 150 150 200 200 250 250 300 300 350 400
y 150 140 180 210 190 320 280 400 430 440 390 600 610

(a) Assuming that the simple linear regression model is valid, obtain the least
square estimate of the true regression line.
(b) What is the estimate of the expected Nox emission rate when burner area
liberation rate equals 225?
(c) Estimate the amount by which you expect Nox emission rate to change when
burner area liberation rate is decreased by 50.

8.2 (cf. Devore, J. L., 2000, 510). The following data represent the wet deposition
(NO3) (= x) and lichen N (% dry weight) (= y):

x 0.05 0.10 0.11 0.12 0.31 0.42 0.58 0.68 0.68 0.73 0.85
y 0.48 0.55 0.48 0.50 0.58 0.52 0.86 1.0 0.86 0.88 1.04

(a) What are the least square estimates of β0 and β1 ?

118
(b) Predict lichen N for an NO3 deposition value of 0.5.
(c) Test the significance of regression at 5% level of significance.

119
8.3 (Devore, J. L., 2000, 510). The following data represent x = available travel space
in feet, and y = separation distance:

x 12.8 12.9 12.9 13.6 14.5 14.6 15.1 17.5 19.5 20.8
y 5.5 6.2 6.3 7.0 7.8 8.3 7.1 10.0 10.8 11.0

(a) Derive the equation of the estimated line.


(b) What separation distance would you predict if available travel space value is
15.0?

8.4 (Devore, J. L., 2000, 511). Consider the following data set in which the variable
of interest are x = commuting distance and y = commuting time:

x 15 16 17 18 19 20 5 10 15 20 25 50 5 10 15 20 25 50
y 42 45 35 42 49 46 16 32 44 45 63 115 8 16 22 23 31 60

Obtain the least square estimate of the regression model.

8.5 8.5 (cf. Devore, J. L., 2000, 584). Soil and sediment adsorption, the extent
to which chemicals collect in a condensed form on the surface, is an important
characteristic influencing the effectiveness of pesticides and various agricultural
chemicals, The article “Adsorption of Phosphate, Arsenate, Methancearsonate,
and Cacodylate by Lake and Stream sediments: Comparisons with Soils” (J. of
Environ. Qual., 1984: 499-504) gives the accompanying data on y = phosphate
adsorption index, x1 = amount of extractable iron, and x2 = amount of extractable
aluminum

x1 61 175 111 124 130 173 169 169 160 244 257 333 199
x2 13 21 24 23 64 38 33 61 39 71 112 88 54
y 4 18 14 18 26 26 21 30 28 36 65 62 40

(a) Find the least square estimates of the parameters and write the equation of
the estimated model.
(b) Make a prediction of Adsorption index resulting from an extractable iron
= 250 and extractable aluminum = 55.
(c) Test the null hypothesis that β2 = 0 against the alternative hypothesis that
β2 ̸= 0 at 5% level of significance.

120
8.6 (Johnson, R. A., 2000, 345). The following table shows how many weeks a sample
of 6 persons have worked at an automobile inspection station and the number of
cars each one inspected between noon and 2 P.M. on a given day:

Number of weeks employed (x) 2 7 9 1 5 12


Number of cars inspected (y) 13 21 23 14 15 21

(a) Find the equation of the least squares line, which will enable us to predict y
in terms of x.
(b) Use the result of part (a) to estimate how many cars someone who has been
working at the inspection station for 8 weeks can be expected to inspect
during the given 2-hour period.

8.7 (cf. Devore, J. L., 2000, 590). An investigation of die casting process resulted in
the accompanying data on x1 = furnace temperature, x2 = die close time and y =
temperature difference on the die surface (A Multiple Objective Decision Making
Approach for Assessing Simultaneous Improvement in Die Life and Casting Quality
in a Die Casting Process,” Quality Engineering, 1994: 371-383).

x1 1250 1300 1350 1250 1300 1250 1300 1350 1350


x2 6 7 6 7 6 8 8 7 8
y 80 95 101 85 92 87 96 106 108

(a) Write the equation of the estimated model.


(b) Test the null hypothesis that β1 = 0 against the alternative hypothesis that
β1 ̸= 0 at 5% level of significance.

8.8 (cf. Johnson, R. A., 2000, 334). The following are measurements of the air velocity
(cm/sec) and evaporation (mm2/sec) coefficient of burning fuel droplets in an
impulse engine:

Air Velocity x 20 60 100 140 180 220 160 300 340 380
Evaporation coefficient y 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65

(a) Fit a straight line to the data by the method of least square and use it to
estimate the evaporation coefficient of a droplet when the air velocity is
190cm/s.
̸ 0
(b) Test the null hypothesis that β1 = 0 against the alternative hypothesis β1 =
at the 0.05 level of significance.

121
8.9 (Johnson, R. A., 2000, 344). A chemical company, wishing to study the effect of
extraction time on the efficiency on an extraction operation, obtained the data
shown in the following table;

Extraction time (minutes) (x) 27 45 41 19 35 39 19 49 15 31


Extraction efficiency (%) (y) 57 64 80 46 62 72 52 77 57 68

(a) Draw a scattergram to verify that a straight line will provide a good fit to
the data.
(b) Draw a straight line to predict the extraction efficiency one can expect when
the extraction time is 35 minutes.

8.10 (cf. Johnson, R. A., 2000, 347). The cost of manufacturing a lot of certain product
depends on the lot size, as shown by the following sample data:

Cost (Dollars) 30 70 140 270 530 1010 2500 5020


Lot Size 1 5 10 25 50 100 250 500

(a) Draw a scattergram to verify the assumption that the relationship is linear,
letting lot size be x and cost y.
(b) Fit a straight line to these data by the method of least squares, using lot size
as the independent variable, and draw its graph on the diagram obtained in
part (a).

8.11 (Johnson, R. A., 2000, 345). The following table, x is the tensile force applied to a
steel specimen in thousands of pounds, and y is the resulting elongation thousands
of an inch:

x1 2 3 4 5 6
y 14 33 40 63 76 85

(a) Graph the data to verify that it is reasonable to assume that the regression of
y on x is linear.
(b) Find the equation of the least square line, and use it to predict the elongation
when the tensile force is 3.5 thousand pounds.

122
8.12 (Johnson, R. A., 2000, 385). The following are the data on the number of twists(y)
required to break a certain kind of forged alloy bar(x1 ) and the percentage of two
alloying elements(x2 ) present in the metal;

y 41 49 69 65 40 50 58 57 31 36 44 57 19 31 33 43
x1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
x2 5 5 5 5 10 10 10 10 15 15 15 15 20 20 20 20

Fit a least square regression plane and use its equation to estimate the number of
twists required to break one of the bars when x1 = 2.5 and x2 = 12.

8.13 (Johnson, R. A., 2000, 263). Twelve specimens of cold-reduced sheet steel, having
different copper contents (x1 ) and annealing temperatures (x2 ), are measured for
hardness (y) with the following results:

y 78.9 65.1 55.2 56.4 80.9 69.7 57.4 55.4 85.3 71.8 60.7 58.9
x1 0.02 0.02 0.02 0.02 0.10 0.10 0.10 0.10 0.18 0.18 0.18 0.18
x2 1000 1100 1200 1300 1000 1100 1200 1300 1000 1100 1200 1300

Fit an equation of the form y = β0 + β1 x1 + β2 x2 , where x1 represents the copper


content, x2 represents the annealing temperature, and y represents the hardness.

8.14 Suppose the following data gives the mass of adults in kilograms sampled from
three villages A, B, C

A B C
71.5 68.0 62.0 65.0 60.5 62.8
63.3 73.5 71.3 73.1 58.4 58.7
73.6 74.1 64.7 72.5 65.5 58.1
78.1 73.6 66.0 73.0 50.6 58.5
66.5 76.5 72.6 66.8 62.1 52.6
74.3 73.3 71.1 71.9 64.5 58.8
76.3 76.0 65.4 65.0 60.6 63.7
62.0 64.0 59.1 69.9 62.9 61.6
62.8 69.6 69.7 69.8 60.2 67.2
72.9 69.2 77.1 78.5 63.4 58.2

(a) Assuming that these samples are independent, run t-tests to determine which
of the villages have identical mean mass of adults stating clearly the hypotheses
you are testing. State your conclusions based on the p-value as well as the
t-value.
(b) State the assumption under which your tests are valid.

123
8.15 8.16 (cf. Dougherty, 1990, 595) When smoothing a surface with an abrasive, the
roughness of the finished surface decreases as the abrasive grain becomes finer.
The following data give measurements of surface roughness (in micrometers) in
terms of the grit numbers of the grains, finer grains possessing larger grit numbers.

x 24 30 36 46 54 60
y 0.34 0.30 0.28 0.22 0.19 0.18

(a) Draw a scatter diagram. Do you recommend fitting a linear regression model?
(b) How strong is the linear correlation between the two variables?
(c) Do you think that there is strong nonlinear correlation between the two
variables?

8.16 8.17 (cf. Johnson, R. A., 2000, 578). The article “How to optimize and Control
the Wire Bonding Process: Part II” (Solid State Technology, Jan. 1991: 67-72)
described on experiment carried out to assess the impact of the variable x1 =
force (gm), x2 = power (mw), x3 = temperature(◦ C), and x4 = time (ms) on y =
ball bond share strength (gm). The following data generated to be consistent with
the information given in the article:

Observations Force Power Temperature Time Strength


1 30 60 175 15 26.2
2 40 60 175 15 26.3
3 30 90 175 15 39.8
4 40 90 175 15 39.7
5 30 60 225 15 38.6
6 40 60 225 15 35.5
7 30 90 225 15 48.8
8 40 90 225 15 37.8
9 30 60 175 25 26.6
10 40 60 175 25 23.4
11 30 90 175 25 38.6
12 40 90 175 25 52.1
13 30 60 225 25 39.5
14 40 60 225 25 32.3
15 30 90 225 25 43.0
16 40 90 225 25 56.0
17 25 75 200 20 35.2
18 45 75 200 20 46.9
19 35 45 200 20 22.7
20 35 105 200 20 58.7
21 35 75 150 20 34.5
22 35 75 250 20 44.0

124
23 35 75 200 10 35.7
24 35 75 200 30 41.8
25 35 75 200 20 36.5
26 35 75 200 20 37.6
27 35 75 200 20 40.3
28 35 75 200 20 46.0
29 35 75 200 20 27.8
30 35 75 200 20 40.3

(a) Find the least square estimates of the parameters and write the equation of
the estimated model.
(b) Make a prediction of strength resulting from a force of 35 gm, power of 75
mw, temperature of 200 degrees and time of 20 ms.
(c) Test the null hypothesis that β3 = 0 against the alternative hypothesis that
β3 ̸= 0 at 5% level of significance

125

You might also like