Professional Documents
Culture Documents
Statistics of Engineers
Statistics of Engineers
STATISTICS
BY
Zzigwa Marvin & Samuel Kibirango
WHAT IS STATISTICS?
• Statistics is the science of creating, developing, and applying techniques such that
decisions can be made and evaluated in the face of uncertainty.
Statistic
• Any numerical or descriptive measure of a sample is referred to as a statistic. Usually, a given population
parameter corresponds to a sample statistic.A statistic varies from sample to sample and within samples.
Estimate
• A sample statistic ESTIMATES a given population parameter. Since it is generally not possible, or very difficult to
calculate the value of the population parameter directly, we calculate a sample statistic that corresponds to that
parameter and it is hopefully close to its true value.
Experiment/Trial/event
• An experiment is the process by which an observation (or measurement) is obtained. The experiment, when
performed, it often called a trial. An event is the outcome of an experiment. Example: A coin is tossed. This is an
experiment.The outcome is either ‘Heads’ or ‘Tails’.The outcome that is obtained in the experiment is the event.
BASIC COMMON CONCEPTS IN STATISTICS
Variable (variate)
• Independent variables are those phenomena which are capable of influencing other
variables. For instance, the weight of an individual is a dependent variable, while the
dietary habits which influence that weight are the independent variable.
BASIC COMMON CONCEPTS IN
STATISTICS
• Different sorts of variables are encountered by statisticians and it is desirable to distinguish among
them.This distinction is illustrated in the diagram below.
• Variate
Quantitative Qualitative
Nominal/ categorical
Discrete Continuous Ordinal
BASIC COMMON CONCEPTS IN STATISTICS
a) A quantitative variate includes data that represents the amount or quantity of something and measurement is made
over a range of values.The observations of a quantitative variate posses a natural order or ranking.
i) A continuous quantitative variate is one for which all values in some range are possible e.g., height, weight, etc.
ii) A discrete quantitative variate is one for which only certain values are possible. They are usually consecutive
integers; for example, size of household, number of insects in an insect trap, etc.
b) A qualitative variate describes a characteristic or attribute of an individual. It includes data that can be categorized
but not quantified.
• This type of variates is often represented by ‘scores’ such as 1 – 2 – 3, and as such are erroneously treated
statistically as discrete variables.
i) Nominal/categorical variates are associated with some quality, characteristic or attribute which the variable
possesses; for example, eye color (blue – grey - green – brown); vehicle type (bus – truck – car – bicycle); etc.
BASIC COMMON CONCEPTS IN STATISTICS
Random variate
• The observations of a random variate can assume any of the possible values in a population with equal likelihood.
Model
• A model is an attempt to construct a simplified explanation of the complex reality in a manner which can be easily
understood, and whose nature can generally be said to be possible at all times.
• Models are often used to describe phenomena in social sciences because the interaction of variables is of a
complex nature and it is impossible to run controlled experiments to determine the nature of the interaction of
such variables. For example, the habits of any consumer population can only be modeled, because we cannot lock
any member or members of the population or any two members of the population in a laboratory for a long time
to study their consumer habits.
Survey
• In a survey, data is collected over various locations, with no control over the factors under study. In a survey,
extraneous factors cannot be held constant.
Hypothesis
A hypothesis is an educated guess of the nature of the relationship between variables which is capable of being verified
during the process of modeling or experimentation. Hypotheses can never be proven; they are only verified.
TYPES OF STUDIES
• Comparative vs absolute
In a comparative experiment, two or more procedures or treatments are tested against one another by
means of some criterion for performance. In an absolute experiment, a single procedure is studied
and/or the purpose is to obtain a single certain measurement.
• Objective experimentation
Here, the focus is to: (i) verify a hypothesis, and (ii) generate ‘empirical knowledge’ (knowledge based
on observation or experience rather than theory. Such a study uses the ‘empirical method’.
• Observational experiments
These are carried out to demonstrate (usually visually) the effectiveness or otherwise of particular
practices or recommendations
DESCRIPTIVE STATISTICS
DESCRIPTIVE STATISTICS
▪ Descriptive statistics are used to describe the basic features of the data in a study.
▪ They provide simple summaries about the sample and the measures.
▪ Together with simple graphics analysis, they form the basis of virtually every quantitative analysis
of data.
▪ While with inferential statistics, we try to reach conclusions that extend beyond the immediate data
alone. For instance, we use inferential statistics to try to infer from the sample data what the
population might think.
▪ How to properly describe data through statistics and graphs is important. Typically, there are two
general types of statistic that are used to describe data:
▪ Measures of central tendency: these are ways of describing the central position of a frequency
distribution for a group of data. There are three major types of estimates of central tendency:
mode, median, and mean.
▪ Measures of spread: these are ways of summarizing a group of data by describing how spread out
the scores are. For example, the mean score of our 100 students may be 65 out of 100. However, not
all students will have scored 65 marks. Rather, their scores will be spread out. Some will be lower
and others higher.
▪ Measures of spread help us to summarize how spread out these scores are. To describe this spread,
a number of statistics are available to us, including the range, quartiles, variance and standard
deviation.
GRAPHICAL METHODS OF DATA DESCRIPTION
▪ Data can be described using types of graphs depending on the type of data (variable)
(a) Continuous data
- Histograms
- Line graphs (frequency polygons)
- Cumulative frequency curves (ogive)
- Boxplot
(b) Discrete data
▪ - column charts
▪ - bar graphs
▪ - pie charts
(c) Qualitative data (nominal and ordinal)
▪ - column charts
▪ - bar graphs
▪ - pie charts
▪ Consider results of a survey in which the gross income of 8,644 households is obtained.
In raw form, such information would somewhat be meaningless because of bulkiness.
In such situations it is necessary to summarize the data in such a way that some ‘picture’
is obtained. The table below is the result of such a summary.
(a) Whenever possible ensure that class intervals are the same for all classes.
(b) Ensure that each item (observation or measurement) is allocated to one class
only.
MEAN
• The mean (informally, the “ average“) is found by adding all of the numbers
together and dividing by the number of items in the set: 10 + 10 + 20 + 40 + 70 / 5
= 30.
MEAN FOR GROUPED DATA
• Example: The following table gives the frequency distribution of the number of
orders received each day during the past 50 days at the office of a mail-order
company. Calculate the mean.
▪ Start by getting class boundaries.
σ 𝑓𝑋
▪ Then get the midpoint X of each class adding the ▪ The mean/average/Ⴟ =
𝑛
class limits and divide them by 2.
832
▪ =
50
▪ = 16.64
MEDIAN
▪ The median is the middle number in a data set. To find the median, list
your data points in ascending order and then find the middle number.
The middle number in this set is 28 as there are 4 numbers below it and
4 numbers above:
23, 24, 26, 26, 28, 29, 30, 31, 33
▪ Note: If you have an even set of numbers, average the middle two to
find the median. For example, the median of this set of numbers is 28.5
(28 + 29 / 2).
23, 24, 26, 26, 28, 29, 30, 31, 33, 34
MEDIAN FOR GROUPED DATA
▪ Where:
n = the total frequency
F = the cumulative frequency before class median
i = the class width
Lm = the lower boundary of the class median
𝑓𝑚 = the frequency of the class median
▪ Based on the grouped data below, find the median:
SOLUTION
25−22
▪ = 20.5 + 10
12
▪ = 23
QUARTILES
▪ Using the same method of calculation ▪ Example: Based on the grouped data
as in the Median, below, find the Interquartile Range
we can get Q1 and Q3 equation as
▪
follows:
𝑛
−𝐹
▪ Q1 = LQ1+ 4
i and Q3 = LQ3+
𝑓𝑄1
3𝑛
−𝐹
4
i
𝑓𝑄3
SOLUTION
▪ Second Step: Determine the Q1 and Q3
▪ First Step: Construct the cumulative
frequency distribution. 𝑛 50
▪ Q1 class = = = 12.5
4 4
12.5−8
▪ = 10.5+ 10 = 13.7143
14
3𝑛 3(500)
▪ Q3 class = = = 37.5
4 4
37.5−34
▪ = 30.5+ 10 = 34.3889
9
INTERQUARTILE RANGE (IQR)
▪ IQR = Q3 - Q1
▪ = 20.6746
MODE
Mode is the value that has the highest frequency ▪ Where
in a data set.
▪ i is the class width
For example, the mode in this set of
21, 21, 21, 23, 24, 26, 26, 28, 29, 30, 31and 33 is 21. ▪ ∆1 is the difference between the frequency of
the modal class and the frequency of the class
▪ For grouped data, class mode (or, modal class) after the modal class.
is the class with the highest frequency.
▪ ∆2 is the difference between the frequency of
▪ To find mode for grouped data, use the the modal class and the frequency of the class
following formula: before the modal class.
▪ Mode = Lmo+
∆1
i ▪ Lmo is the lower boundary of class mode.
∆1 +∆2
▪ Based on the grouped data below, find the mode
SOLUTION
Based on the table above,
6
▪ Mode = 10.5 + 10 = 17.5
6+2
σ 𝑓𝑋 2 −
σ 𝑓𝑋 2 Example
▪ Population Variance: 𝜎 2 = 𝑁
𝑁 ▪ Find the variance and standard deviation for
the following data
σ 𝑓𝑋 2
σ 𝑓𝑋 2 −
▪ Variance of sample data: s2 = 𝑛
𝑛−1
▪ Standard Deviation
▪ Population: 𝜎 = 𝜎2
▪ Population Sample: s = 𝑠2
σ 𝑓𝑋 2
σ 𝑓𝑋 2 −
▪ Variance s2 = 𝑛
𝑛−1
832 2
14216−
▪ = 50
50−1
▪ = 7.5820
▪ Standard Deviation, s = 𝑠2
▪ = 7.5820
▪ = 2.75
To design a rural multilane road, Vehicle speed surveys were carried out on Entebbe Express way
as shown in the table below
▪ A probability experiment is the process that leads to well defined results called outcomes.
Number of outcomes in E
P(E) =
Total number of outcomes in the sample space
n(E)
P(E) =
n(S)
P (A or B) =P(A)+P(B)
Example
▪ A day of the week is selected at random, find the probability that it is a weekend day.
- The second addition rule is used when two vents are not mutually exclusive. If A and B are not
mutually exclusive, the probability that A or B will occur is given by;
▪ The second formula can also work for mutually exclusive events though P(A and) is always zero
since both events cannot happen at the same time.
COMPLEMENT RULE
▪ The complement of an event is the set of outcomes that are not included in the outcomes of the
event. If the probability of an event is known, the probability of its complement can be got by
subtracting it from one since one is the highest probability. Likewise, if the probability of the
complement is known, the probability of the event can be got by subtracting the complement
probability from one. The complement of event A is denoted by a bar on top of A.
Example
1
▪ He probability that beans will germinate is , find the probability that beans will not germinate.
5
Solution
▪ The multiplicative rules are used to find probability of two or more events that occur in sequence.
For example if a coin is tossed and a die is rolled, one can get the probability of getting a tail on the
coin and then getting a two in the die. These two events are independent since the outcome of
tossing a coin does not affect the outcome of rolling a die.
- Multiplication rule 1
▪ When two events are independent, the probability of both occurring is given by;
- Multiplicative rule 2
▪ This is used to find probability when the outcome of the first event affects the outcome of the
second event. These are called dependent events. The formula for probability is given by;
▪ The condition probability of an event can be got by dividing both sides of the
equation for multiplication rule 2 by Probability of B
𝑃(𝐴 𝑎𝑛𝑑 𝐵)
▪ P (A/B)=
𝑃 (𝐵)
BINOMIAL PROBABILITY DISTRIBUTION
▪ A binomial distribution are the outcomes of a binomial experiment and the corresponding
probabilities. In binomial experiments, the outcomes are classified as either successes or
failures. A binomial experiment is a probability experiment that satisfies the following
requirements;
- Each trial can only have two outcomes or outcomes can be reduced to two outcomes.
- The probability of a success must remain the same for each trial.
The binomial probability distribution formula is expressed as follows;
𝑛!
𝑃 𝑋 = . 𝑝 𝑥 𝑞𝑛−𝑥
𝑛 − 𝑋 ! 𝑋!
0≤𝑋≤𝑛
The mean, variance and standard deviation for the binomial distribution is given by;
Mean,µ=n.p
Variance, 𝜎 2 = 𝑛. 𝑝. 𝑞
Standard deviation,𝜎 = 𝑛. 𝑝. 𝑞
▪ A survey showed that 30% of teenage consumers receive their spending money from part-time
jobs. If 5 teenagers are selected at random, find the probability that atleast 3 of them will have
part-time jobs.
Solution
▪ To find the probability that at least 3 have part-time jobs, we find the individual properties of
3,4,or5 and add them together to get the total probability.
5!
▪ P(3) = . 0.33 0.72 = 0.132
5−3 !3!
5!
▪ P(4) = . 0.34 0.71 = 0.028
5−4 !4!
5!
▪ P(5) = . 0.35 0.70 = 0.002
5−5 !5!
▪ = 0.132+0.028+0.002
▪ =0.162
NORMAL
DISTRIBUTIONS
NORMAL DISTRIBUTION
▪ The normal distribution is the continuous, symmetric and bell shaped distribution of a variable. The
standard normal distribution is a normal distribution with a mean of zero and standard deviation one.
▪ A standard normal curve provides both a visual and a numerical representation of how a given variable is
distributed across a population
- It is bell shaped
- The mean, median and mode are equal and located at the center of the distribution.
▪ If the Z-score is positive, then the data value is above the mean and if its negative, this implies that the data
value is below the mean. This helps in comparing relative standards.
𝐷𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒−𝑀𝑒𝑎𝑛
▪ Z=
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝑋−𝑋ത
▪ Z= Where x is the data value, 𝑋ത is the sample mean and s is the sample standard deviation.
𝑆
𝑋−µ
▪ Z= Where µ 𝑖𝑠 the population mean, X is the data value and 𝜎 is the population standard deviation.
𝜎
▪ A farmer harvested 65kgs of Nakati in a plot of land mulched with grass and the mean harvest for that plot
was 50kgs and the standard deviation was 10. The same farmer harvested 30kgs of Nakati in a plot mulched
with timber dust. The mean for this plot was 25kgs and the standard deviation was 5. Compare the farmer’s
relative harvest in the two plots.
Solution
𝑋−Ϫത 65−50
▪ Z= = =1.5
𝑆 10
𝑋−Ϫത 30−25
▪ Z= = =1.0
𝑆 105
▪ Since the Z-score for grass is larger, the farmers relative harvest in the grass mulched plot is higher than the
relative harvest in the timber dust mulched plot.
FINDING THE AREA UNDER THE NORMAL CURVE
▪ The area under the entire normal curve, no matter its precise shape, is assigned the value
1.0. All partial areas under the normal curve are thus decimal numbers between 0 and 1 and
can be converted easily to percentages by multiplying them by 100.
▪ For solutions of problems using a standard normal distribution, a four step process is
recommended;
▪ 1. Find the Z- value
▪ 2. Draw a picture(curve)
▪ 4. The next step depends on the position of the shaded area on the curve.
▪ NOTE:
▪ or 49.04%.
▪ When the shaded area is in any tail of the curve, look up for the z-value in the table and then
subtract the area from 0.5000.
▪ Find the area (z<-1.93) ▪ Again the z-table gives the area for
positive z-values. But from the symmetric
Solution property, the area of the negative z-value
is the same as the area of the positive z-
value.
▪ Look for the area between 0 and +1.93
▪ =0.0268
▪ or 2.68%
▪ When the shaded area lies between any two z values on the same side of the mean, look for
the z values and get the areas, subtract the smaller area from the larger area.
▪ Find the area between z=2.00 and z=2.47 ▪ For this situation, look for the area from
z=0 to z=2.47 which is equal to 0.4932 and
Solution the area from z=0 to z=2.00 which is
0.4772.
▪ The difference is the desired area which is
0.4932-0.4772
▪ = 0.0160
▪ or 1.60%.
▪ When the shaded are is less than any z-value where z is greater than the mean, we look for
the z value to get the area and add 0.5000 to the area.
▪ Find the area to the left of z<1.99 ▪ Since the z-table only gives the area
between z=0 and z=1.99, you add 0.500
Solution (one half) to the able area.
▪ The area between z=0 and z=1.99 is
0.4767.
▪ Therefore, the total area = 0.4767+0.5000
▪ = 0.9767
▪ or 97.67%.
▪ If the shaded area lies in both tails of the curve, look for the z values to get the areas, subtract
each area from 0.5000 and add the two answers.
▪ Find the area to the right of z=2.43 and to ▪ The area to the right of 2.43 is 0.5000-
the left of z=-3.01 0.4925 = 0.0075.
▪ = 0.0088
▪ or 0.88%.
▪ When the shaded areas are on opposite sides of the mean, look up for z values to get the
areas and add up the areas.
▪ Find the area between z=168 and z=-1.37 ▪ Since the two areas are on opposite sides of
z=0, we find both areas and add them up.
Solution
▪ The area between z=0 and z=1.68 is 0.4535.
▪ = 0.8682
▪ or 86.82%.
▪ When the shaded area is to the right of any z value and the z-value is less than the mean, look
up for the z value to get the area and add 0.5000 to get the total area.
EXAMPLE
▪ Find the area to the right of z=-1.16 ▪ The area between z=0 and z=-1.16 is
0.3770.
Solution
▪ Hence the total area is 0.5000+0.3770
▪ = 0.8770
▪ or 87.70%.
APPLICATIONS OF NORMAL DISTRIBUTIONS
▪ The standard normal distribution curve can be used to solve a wide variety of practical
problems.
▪ The only requirement is that the variable must be normally or approximately normally
distributed.
▪ To solve problems by using the standard normal distribution, we transform the original
variable to a standard normal distribution variable by using the formula;
𝐷𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒−𝑀𝑒𝑎𝑛
▪ Z=
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝑋−µ
▪ or Z=
𝜎
▪ The mean number of hours a year two student ▪ Step 2: Find the z-value corresponding to 3.5
spends on a computer is 3.1 hours per day. 𝑋−µ 3.5−3.1
Assume the standard deviation is 0.5 hours. ▪ Z= = Z= = 0.80
Find the percentage of students who spend less 𝜎 0.5
than 3.5 hours on the computer. Assume the
variable is normally distributed.
Solution
Step 1: Draw the figure showing the area
Since the populations are always big, the values got from samples are just estimates of
the true parameter.
An important question in estimation is that of the sample size, how large should it be to
make an accurate estimate?
Therefore a point estimate is used to infer the sample mean to the population mean.
A point estimate is a specific numerical value estimate of parameter. The best point
estimate of the population mean µ is the sample mean 𝑋. ത
Sample means other than mode and median are used in estimating the population
mean simply because means of samples vary less than other statistics (mode and
median).
It should be unbiased
It should be consistent
𝜎
Likewise, there is a 95% probability that the interval specified by 𝑋ത ±1.96 will
𝑛
contain µ which can also be stated as;
𝜎 𝜎
𝑋ത −1.96 < µ< 𝑋ത +1.96
𝑛 𝑛
The value used for the 95% CI is obtained from the bottom of the t-distribution
table.
Since other CIs are also used, the symbol 𝑧𝛼Τ2 is used in the general formula.
At 95% CL, 𝛼 = 1-0.95 = 0.05, at 99% CL, 𝛼 = 0.01 and at 90% CL, 𝛼=0.1
The general formula for CI is given by;
𝜎 𝜎
𝑋ത − 𝑧𝛼Τ2 < µ< 𝑋ത + 𝑧𝛼Τ2
𝑛 𝑛
For 90% CL, 𝑧𝛼Τ2 =1.65; for 95% CL, 𝑧𝛼Τ2 =1.96 and for 99% CL, 𝑧𝛼Τ2 =2.58.
𝜎
The term 𝑧𝛼Τ2 is called the maximum error of estimate.
𝑛
This is the likely difference between the point estimate of a parameter and he
actual value of a parameter.
Example
A survey of 30 adults showed that the mean age of a person’s vehicle is 5.6
years. Assuming the standard deviation of the population is 0.8 year, find the best
point estimate of the population mean and the 99% confidence interval of the
population mean.
Solution
0.8 0.8
5.6− 2.58 < µ< 5.6− 2.58
30 30
5.6-0.38< µ<5.6+0.38
5.22< µ<5.98
Hence, one can be 99% confident that the mean age of all vehicles is between
5.2 and 6.0 years, based on 30 vehicles.
Sample size
It is bell shaped
The mean, mode and median are equal to 0 and located at the center of the
distribution.
As the sample size increases, the t distribution approaches the standard normal
distribution.
The t Distribution
Degrees of freedom (d.f) are a number of values that are free to vary after a
sample statistic has been computed.
For example, if the mean of 10 values is 20, then 9 out of the 10 values are free to
vary.
Hence the degrees of freedom are 10-1=9 (subtracting 1 from the sample size; d.f
= n-1).
The formula for a specific confidence interval for the mean when 𝜎 is unknown
and n<30 is;
𝑠 𝑠
𝑋ത − 𝑡𝛼Τ2 < µ< 𝑋ത + 𝑡𝛼Τ2
𝑛 𝑛
Example 1
Find the 𝑡𝛼Τ2 value for a 95% confidence interval when the sample size is 22.
Solution
Look for 21 in the left column of the t distribution table and 95% in the row
labeled confidence intervals.
The intersection where the two meet gives the value for 𝑡𝛼Τ2 which is 2.080.
Note: At the bottom for the t distribution table where z= ∞, the 𝑧𝛼Τ2 values can be
found.
Example 2
Ten randomly selected plants were studied and their length was measured. The
mean height was 0.32cm and he standard deviation was 0.08cm. Find the 95% CI of
the mean height assuming that the variable is approximately randomly selected.
Solution
The t distribution is used and with 9 d.f at 95% CL, 𝑡𝛼Τ2 =2.262
𝑠 𝑠
Using 𝑋ത − 𝑡𝛼Τ2 < µ< 𝑋ത + 𝑡𝛼Τ2 ,
𝑛 𝑛
0.08 0.08
0.32−(2.262) < µ< 0.32+(2.262)
10 10
0.26< µ<0.38
Hence, one can be 95% that the mean height of the population lies between 0.26 an
0.38cm based on a sample of ten plants.
Confidence intervals and sample size for proportions
Proportions also represent probabilities and they can be obtained from samples
or populations.
p – population proportion
Where x is the number of sample units that possess the characteristics of interest
and n is the sample size.
For example
In a study, 200 people were asked if they were satisfied with their
jobs;162 said they were.
𝑥 162
In this case, n=200, x=162, ṕ = = =0.81 or 81%.
𝑛 200
Confidence intervals about the proport must meet the criteria np≥5 and nq≥5
ṕǭ ṕǭ
ṕ − 𝑧𝛼Τ2 < 𝑝 < ṕ + 𝑧𝛼Τ2
𝑛 𝑛
Solution
ṕǭ ṕǭ
Substituting in the formula ṕ − 𝑧𝛼Τ2 < 𝑝 < ṕ + 𝑧𝛼Τ2
𝑛 𝑛
𝑥 60
ṕ= = =0.12, ǭ = 1 − ṕ = 1-0.12 = 0.88
𝑛 500
(0.12)(0.88) (0.12)(0.88)
0.12−1.65 < 𝑝 < ṕ + 1.65
500 500
0.12-0.024< 𝑝<0.12+0.024
0.096< 𝑝<0.144
9.6%< 𝑝<14.4%
Hence, one can be 90% confident that the percentage of applicants who were men os
Sample size for proportions
ṕǭ
This is derived from 𝐸 = 𝑧𝛼Τ2
𝑛
𝑧𝛼ൗ 2
𝑛 = ṕǭ 2
𝐸
Confidence intervals for Variance and Standard Deviations
To calculate these CIs, we use the chi-square distribution.
The chi-square variable is similar to the t variable in that its distribution is a family of curves
based on the number of degrees of freedom.
For example, to find the table values corresponding to the 95% CI, one must first
change 95% to a decimal and subtract from 1 (1-0.95 = 0.05).
Then divide the answer by 2 (𝛼Τ2= 0.05Τ2=0.025). This is the column on the right side
of the table used to get values for 𝑥 2 𝑟𝑖𝑔ℎ𝑡 .
To get values for 𝑥 2 𝑙𝑒𝑓𝑡 , subtract the value of 𝛼Τ2 from 1 (1- 0.05Τ2 = 0.975).
Finally, find the appropriate row corresponding to the degrees of freedom (n-1)
Find the values for 𝑥 2 𝑟𝑖𝑔ℎ𝑡 and 𝑥 2 𝑙𝑒𝑓𝑡 for a 90% CI when n=25.
Solution
Hence, use the 0.95 and 0.05 columns and the row corresponding to
24 d.f.
𝑥 2 𝑟𝑖𝑔ℎ𝑡 = 36.415
𝑥 2 𝑙𝑒𝑓𝑡 = 13.848
Formula for Confidence Interval for a Variance and
Standard Deviation
(𝑛−1)𝑠 2 (𝑛−1)𝑠 2
For Variance, CI is given by < 𝜎2 <
𝑥 2 𝑟𝑖𝑔ℎ𝑡 𝑥 2 𝑙𝑒𝑓𝑡
(𝑛−1)𝑠 2 (𝑛−1)𝑠 2
For Standard Deviation, the CI is expressed as <𝜎 <
𝑥 2 𝑟𝑖𝑔ℎ𝑡 𝑥 2 𝑙𝑒𝑓𝑡
Example
Find the 95% confidence interval for the variance and standard deviation of
nitrogen content of a fertilizer if a sample of 20 packs has a standard deviation of
1.6 miligrams.
Solution
The corresponding values of 𝑥 2 𝑟𝑖𝑔ℎ𝑡 and 𝑥 2 𝑙𝑒𝑓𝑡 are 32.852 and 8.907 respectively.
(𝑛−1)𝑠2 (𝑛−1)𝑠2
Substituting in the formula for variance, < 𝜎2 <
𝑥 2 𝑟𝑖𝑔ℎ𝑡 𝑥 2 𝑙𝑒𝑓𝑡
(20−1)1.62 (20−1)1.62
< 𝜎2 <
32.852 8.907
1.5< 𝜎 2 <5.5
Hence, one can be 95% confident that the true variance for the nitrogen content lies between
1.5 and 5.5.
For standard deviation,
(𝑛−1)𝑠 2 (𝑛−1)𝑠 2
<𝜎 <
𝑥 2 𝑟𝑖𝑔ℎ𝑡 𝑥 2 𝑙𝑒𝑓𝑡
1.2< 𝜎 <2.3
Hence, one can be 95% confident that the true standard deviation for the nitrogen
content lies between 1.2 and 2.3 milligrams based on a sample of 20 packs.
HYPOTHESIS TESTING
Hypothesis Testing
This has been used since the hypothesis testing method was formulated.
There are two types of hypotheses for each situation, the null hypothesis and the
alternative hypothesis.
The null hypothesis symbolized by H0, is a statistical hypothesis that states that
there is no difference between a parameter and a specific value, or that there is
no difference between two parameters.
A chemist invents an additive to increase the shelf life of food. If the mean
shelf life is 36 days, then her hypotheses are;
In this situation, the chemist is only interested in increasing the food shelf life,
so her alternative hypothesis is that the mean shelf life will be greater than 36
hours.
The probability of a type II error is symbolized by β but in most hypothesis testing situations, β cannot be easily computed.
However, α and β are related in that decreasing one increases the other.
Statisticians generally agree on using three arbitrary significance levels : the 0.10, 0.05, and 0.01 levels. If the null hypothesis is
rejected, the probability oof a type I error will be 10.5 or 1% depending on the significance level used.
After the significance level is chosen, a critical value (CV) is selected from table for the appropriate test. If a z-test is used, the
z-table is consulted to find the critical value.
The critical value determines the critical and noncritical regions. It separates the critical region from the noncritical region.
The critical or rejection region is the range of values of the test value that indicates that there is a significant difference and that
the null hypothesis should be rejected.
The noncritical or nonrejection region is the range of values of the test value that indicates that the difference was probably due
to chance and that the null hypothesis should not be rejected.
The CV can be on the right side of the mean or on the left side for a one-tailed
test and this depends on the inequality sign of the alternative hypothesis.
A one-tailed test indicates that the null hypothesis should be rejected when the
test value is in the critical region on one side of the mean.
In a two-tailed test, the null hypothesis should be rejected when the test value is
in either of the two critical regions.
For a two-tailed test, the critical region must be split into two equal parts. If α
=0.01, then one-half of the area, 0.005 must b to the right of the mean and the
other half to the left of the mean.
In this case, the area to be found in the z-table is (0.5000-0.005)=0.4950 and the
critical values are +2.58 and -2.58.
Procedure for finding CVs for specific values using the z-table
Step 2: For a one-tailed test, subtract the area equivalent to α from 0.5000. For
𝛼
the two-tailed test, subtract the area equivalent to from 0.5000.
2
Step 3: Look for the area in the z-table corresponding to the value obtained step
2. If the exact value can not be found in the table, use the closest value.
Step 4: Find the z-value the corresponds to the area and this will be the critical
value.
Step 5: Determine the sign of the CV. For a one-tailed test, if the test is left-tailed,
the CV will be negative and if the test is right-tailed, the CV will be positive. For a
two-tailed test, one value will be positive and the other negative.
z Test for a mean
Many hypotheses are tested using a statistical test base on the following general
formula:
𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 −(𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒)
Test Value = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟
The z test is a statistical test for the mean of a population. It can be used when n ≥ 30.
Ⴟ−µ
Z= 𝜎
ൗ 𝑛
Where
Ⴟ = Sample mean,
µ = Hypothesized population mean,
𝜎 = population standard deviation,
n=sample size
Procedure for hypothesis testing using the traditional
method
Step 4: Make the decision to reject or not to reject the null hypothesis
Solution
Step 2: Find the CV. Since α = 0.05 and the test is a right-tailed test, the CV is z=+1.65
Step 5: Summarize the results. There is no enough evidence to support the claim that
junior agronomists earn more than $42,000 a year
p-Value method for Hypothesis Testing
The p-value (probability value) is the probability of getting a sample statistic (eg
mean) in the direction of the alternative hypothesis when the null hypothesis is true.
First find the area under the normal distribution curve corresponding to the z-value,
then subtract this area from 0.5000 to get the p-value.
Note: To get the p-value for a two ailed test, we double the area after subtracting.
Step 4: Make the decision to reject or not to reject the null hypothesis
Step 3: Find the p-value. Using the z-table, the corresponding area for z=2.10 is 0.4821. Subtract this value
from 0.5000 to get the area in the right tail. 0.5000-0.4821 = 0.0179
Hence the p-value is 0.0179
Step 4: Since the p-value is less than α (0.05), the decision is to reject the null hypothesis.
Step 5: There is enough evidence to support the claim that the average age of year two students is greater
than 24 years.
THE CHI-SQUARE STATISTIC
THE CHI-SQUARE TEST
In some research situations one may be confronted with qualitative variables
where the Normal Distribution is meaningless because of the small sizes of
samples involved, or because the quantities being measured cannot be
quantified in exact terms.
Qualities such as marital status, stem color of a plant, sex of the respondents
being observed, etc., are the relevant variables; instead of exact numbers which
we have used to measure variables all our lives.
𝑂𝑖 = The observed frequency of the characteristic of interest.
𝐸𝑖 = The expected frequency of the characteristic of interest.
k = The number of paired groups in each class comprising the observed frequencies 𝑂𝑖
and the expected frequencies 𝐸𝑖 .
c = An individual observations of paired groups
The Chi-Square distribution is such that if the observed values differ very little from the
expected values then the chi-square value is very little. On the other hand when the
differences are large, then the chi squares are also very large.
This means that the mode of this distribution tends to be located among the fewest
observations.
An outline of the general procedure for a chi-squared test
Get the Chi-square tabulated using the chi-square table and degrees of
freedom where degree of freedom = n - 1
𝑂𝑖 − 𝐸𝑖 2
Get the Chi-square calculated using the formula 𝑥2 = 𝑘
σ𝑐=1
𝐸𝑖
No. of defective
Machine items
A 25
B 24
C 20
D 28
E 23
Total 120
𝑘 𝑂𝑖 − 𝐸𝑖 2
𝑥2 = σ𝑐=1
𝐸𝑖
𝑥 2 = 34/24
𝑥 2 Calculated= 1.42
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
▪𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2
▪ =0.897
▪ The correlation coefficient suggests a strong positive relationship between age and
blood pressure.
SIGNIFICANCE OF THE CORRELATION COEFFICIENT
▪ Since the value of r is computed from data obtained from samples, there are two
possibilities when r is not equal to zero:
▪ Either the value of r is high enough to conclude that there is a significant linear
relationship between the variables, or the value of r was due to chance.
▪ To make this decision, a hypothesis testing procedure is used;
▪ H0 : 𝑝 = 0 The null hypothesis means that there is no correlation between the x and y
variables in he population.
▪ H1 : 𝑝 ≠ 0 This alternative hypothesis means that there is a significant correlation
between the variables in he population.
▪ When he null hypothesis is rejected at a given level, it means that there is a significant
difference between the value of r and 0.
▪ When the null hypothesis is not rejected, it means that the value of r is not significantly
different from 0 and is probably due to chance.
▪ Formula for the t-test for the correlation coefficient
𝑛−2
▪𝑡=𝑟
1−𝑟 2
▪ The two-tailed critical values are used and they are found in the t distribution table.
▪ Also, when testing the significance of a correlation coefficient, both the x and y variables
must come from a normally distributed population.
Example
▪ Test the significance of the correlation coefficient from the previous example. Use α =
0.05 and r was 0.897.
Solution
▪ Step 1: State the hypotheses
▪ H0 : 𝑝 = 0 and H1 : 𝑝 ≠ 0
▪ Step 2: Find the critical values. Since α = 0.05 and there are 6-2 degrees of freedom, the critical values
obtained from the t distribution table are ±2.776.
𝑛−2 6−2
▪𝑡=𝑟 = (0.897) = 4.059
1−𝑟 2 1−(0.897)2
▪ Step 4: Make the decision. Reject the null hypothesis since the test value falls in the critical region.
▪ Step 5: Summarize the results. There is a significant relationship between the variables
of blood pressure and age.