Professional Documents
Culture Documents
ii
3.4.1 Exercise................................................................................................................................20
3.5 Median......................................................................................................................................21
3.5.1 Exercise................................................................................................................................22
3.6 Mode.........................................................................................................................................22
3.6.1 Group task............................................................................................................................22
3.6.2 Exercise................................................................................................................................22
3.7 Conclusion..............................................................................................................................23
3.8 References & Further Reading............................................................................................23
Unit 4: Measures of Dispersion/Variability...................................................................24
3.1 Introduction.............................................................................................................................24
3.2 Aim of the Unit........................................................................................................................24
3.3 Structure of the Unit..............................................................................................................25
3.4 Range.......................................................................................................................................25
3.5 Interquartile Range (IQR)......................................................................................................26
3.6 Standard Deviation................................................................................................................27
3.7 Skewness................................................................................................................................31
3.8 Kurtosis...................................................................................................................................31
3.9 Exercise...................................................................................................................................32
3.10 Conclusion............................................................................................................................32
3.11 References & Further Reading..........................................................................................33
4.0 Introduction.............................................................................................................................34
4.1 Aim of the unit........................................................................................................................34
4.2 Structure of the Unit..............................................................................................................34
4.3 Probability Concepts.............................................................................................................34
4.4 Sample Space.........................................................................................................................36
4.4.1 Example 1: Tossing Two Coins........................................................................................37
4.4.2 Example 2: Throwing a dice..............................................................................................38
4.5 Probability tree diagram.......................................................................................................38
4.5.1 Exercise 1.............................................................................................................................38
4.5.2 Exercise 2.............................................................................................................................39
4.6 Probability Distributions.......................................................................................................39
iii
4.6.1 Discrete Probability Distributions....................................................................................39
4.6.1.1 Binomial Probability Distribution.................................................................................39
Mean of a Binomial Distribution................................................................................................41
Variance of a Binomial distribution..........................................................................................41
4.6.1.2 Poisson Probability Distribution...................................................................................41
Mean and Variance.......................................................................................................................41
4.6.1.3 Exercise 1 (Discrete probability distributions)...........................................................42
4.6.2 Continuous probability distributions..............................................................................42
4.7 Hypothesis testing.................................................................................................................45
4.7.1 What is a Hypothesis?.......................................................................................................45
4.7.2 Characteristics of hypothesis..........................................................................................45
4.7.3 Concepts of Hypothesis Testing......................................................................................46
4.8 Exercise...................................................................................................................................53
4.9 Conclusion..............................................................................................................................54
4.10 References & Further Reading..........................................................................................54
Unit 5: Bivariate Analysis............................................................................................................55
5.0 Introduction.............................................................................................................................55
5.1 Aim of the Unit........................................................................................................................55
5.2 Cross-tabulation Table..........................................................................................................55
5.3 The Chi-Squared Test for Independence of Association................................................56
5.4 Correlation...............................................................................................................................59
5.5 Conclusion..............................................................................................................................66
5.6 References & Further Reading............................................................................................66
Unit 6: Regression Analysis.......................................................................................................67
6.0 Introduction.............................................................................................................................67
6.1 Aim of the Unit........................................................................................................................67
6.2 Objectives of Regression Analysis.....................................................................................67
6.2.1 Exercise 1.............................................................................................................................68
Exercise 2: From the output given below,...............................................................................69
6.2.3 Computer based practical Exercise 1(Microsoft Excel)...............................................70
6.3 Conclusion..............................................................................................................................70
iv
6.4 References & Further Reading............................................................................................71
Unit 7: Multivariate analysis.......................................................................................................72
7.0 Introduction.............................................................................................................................72
7.1 Aim of the Unit........................................................................................................................72
7.2 Structure of the Unit..............................................................................................................72
7.3 The objective of Multivariate Analysis (MVA)...................................................................72
7.3.1 Exercise 1 (Multiple Regression).....................................................................................73
7.4 Multicollinearity......................................................................................................................74
7.5 Analysis of Variance (ANOVA)............................................................................................75
7.6 Exercise 1 (Multiple regression)..........................................................................................75
7.7 Conclusion..............................................................................................................................76
7.8 References & Further Reading............................................................................................76
v
1.0 Module Rationale
The purpose of this module is to introduce basic statistics to undergraduate level one
students. This is an introductory course that assumes no prior knowledge of statistics.
The main objective of this course is to introduce and equip students with basic statistical
concepts and techniques. The course aims at imparting knowledge and analytical skills
to students in issues related to the following: descriptive statistics; presentation of data;
probability; inferential statistics and multivariate analysis. The calculations will be done
using spreadsheet software, such as Excel or the Statistical Package for Social
Sciences (SPSS). The aforementioned issues are meant to enhance students’
professional and academic performance in the subject.
1.1 Module Structure
This module is structured as follows:
Unit 1: Introduction;
Unit 2: Levels of Measurement and Frequency Distributions;
Unit 3: Measures of Central Tendency;
Unit 4: Measures of Dispersion/Variability;
Unit 5: Probability;
Unit 6: Bivariate Analysis;
Unit 7: Regression Modelling; and
Unit 8: Multivariate analysis.
1.2 Aims and Learning Outcomes
For this module, the student has to master the following outcomes:
Explain the basic concepts of descriptive and inferential statistics;
Present data;
Calculate and interpret basic descriptive and inferential statistics;
Determine when, why, and how various statistical tests are used; and,
Analyze data using spreadsheet software (e.g. Excel, SPSS)
1.3 Methods of Assessment
The approach adopted covers the normal examination at the end of the academic
semester, constituting 50% of the overall course mark, and coursework making up the
remaining 50%. Students will be given a series of exercises, in-class tests, written and
practical assignments that will constitute the continuous assessment (coursework).
vi
Students are advised to seriously consider the continuous assessment exercises as
they contribute significantly towards the overall course mark.
vii
Unit 1 Introduction
Data refers to observations and measurements which have been collected in some
way, often through research. Data is the actual values (numbers) or outcomes recorded
on a random variable.
Some examples of random variables and their data are:
the travel distances of delivery vehicles (data: 34 km, 13 km, 21 km)
the daily occupancy rates of hotels in Harare (data: 45%, 72%, 54%)
the duration of machine downtime (data: 14 min, 25 min, 6 min)
brand of coffee preferred (data: Nescafe, Ricoffy, Frisco).
Data that is recorded as numbers (and therefore measures quantities) is Quantitative
data.
viii
Quantitative variables- variables that can be counted or measured for example, age is
numerical and people can be ranked in order according to the value of their ages. Other
examples are height, weight, temperature
The following are examples of quantitative random variables with real numbers as data:
the age of an employee (e.g. 46 years; 28 years; 32 years);
machine downtime (e.g. 8 min; 32.4 min; 12.9 min);
the price of a product in different stores (e.g. R6.75; R7.45; R7.20; R6.99); and
delivery distances travelled by a courier vehicle (e.g. 14.2 km; 20.1 km; 17.8 km).
Data that is recorded as text (and therefore records qualities) is qualitative data.
Categorical data is data which is grouped into categories, such as responses to yes/no
questions, data for a 'gender' or 'smoking status' or 'marital status'.
Numerical data include both Discrete and Continuous variables.
ix
Discrete variables-values can be counted. May, but not necessarily, have a finite
number of values.
The most common type of discrete numerical variable that we will encounter produces a
response that comes from a counting process e.g. Number of women who have given
birth, the number of children born to a woman etc.
Continuous variables- assume an infinite number of values between any two specific
values.
May take on any value within a given range of real numbers and usually arises from a
measurement, e.g. temperature, height, weight and distance.
Can assume an infinite number between any two given temperatures – 36.5 degrees
Celsius or height 6.2 metres tall.
x
Descriptive statistics, in short, help describe and understand the features of a specific
data set by giving short summaries about the sample and measures of the data.
2.8 Conclusion
In this unit we introduced basic terms and concepts of Statistics, described the different
data types and explained the different components of Statistics. We managed to
distinguish between qualitative and quantitative random variables and discussed
descriptive and inferential statistics.
xi
2.9 References & Further Reading
Bhattacharyya, G. K., and R. A. Johnson, (1997). Statistical Concepts and Methods,
John Wiley and Sons, New York.
https://www.chi2innovations.com/blog/data-types-101/
https://www.sagepub.com/sites/default/files/upm-binaries/40006_Chapter1.pdf
xii
Unit 2 Levels of Measurement and Frequency Distributions
When we talk about levels of measurement, we are talking about how we measure a
variable. First, knowing the level of measurement helps you decide how to interpret the
data from that variable. Second, knowing the level of measurement helps you decide
what statistical analysis is appropriate on the values that were assigned.
xiii
The graphic below should help you visualize the four different levels of measurement.
See the definitions and examples below for each.
Interval differs from ordinal level in that there are precise differences between
the units and can be ranked in order. Meaning is given to the difference between
measurements e.g. temperature (150C, 220C) or IQ tests (IQ of 100 or IQ of 110).
xiv
An interval indicates rank and distance from an arbitrary zero measured in unit
intervals.
Ratio data has all the characteristics of interval measurement, i.e. indicate both
rank and distance from a natural zero, with ratios of two measures having
meaning. Ratio scales have precise differences between units of measures and a
true zero Weight – 500g, 20kg, 25kgs Height – 1.5m, 1.6m, 1.8m. Also, the ratio
scale contains a true value between values e.g. if one can lift 200kg and another
can lift 100kgs, then the ratio between them is 2 to1.
Graphical/Pictorial Methods
There are several graphical and pictorial methods that enhance researchers'
understanding of individual variables and the relationships between variables. Graphical
and pictorial methods provide a visual representation of the data. Graphs are used to
display data because it is easier to see trends in the data when it is displayed visually
compared to when it is displayed numerically in a table. Complicated data can often be
displayed and interpreted more easily in a graph format than in a data table.
In a graph, the X-axis runs horizontally (side to side) and the Y-axis runs vertically (up
and down). Typically, the independent variable will be shown on the X axis and the
dependent variable will be shown on the Y axis. Some of the data presentation methods
include:
Bar chart/graph
Frequency table
Histogram
Line graph
Ogive
Pie chart
Scatter plot
xv
Bar Chart/Graph
Bar charts are used to compare measurements between different groups. Bar charts
should be used when your data is not continuous, but rather is divided into different
categories. If you counted the number of people in each of the 10 provinces in
Zimbabwe, each province would be its own category. There is no value between
provinces, so this data is not continuous. Figure below shows an example of a bar
chart/graph.
Figure 1: Example of a bar chart/graph
Frequency Table
xvi
may also contain relative frequencies (proportions)/percentages. Frequency tables may
be computed for both discrete and continuous variables and may take either an
ungrouped or a grouped format. Table below shows an example of a frequency table.
Histogram
A histogram is used to summarize discrete or continuous data. In other words, it
provides a visual interpretation of numerical/continuous data by showing the number of
data points that fall within a specified range of values. It is similar to a vertical bar graph.
However, a histogram, unlike a vertical bar graph, shows no gaps between the bars.
The heights of the bars correspond to the frequency values, and the bars are drawn
adjacent to each other (without gaps). The histogram is used to:
Identifying the most common process outcome
Identifying data symmetry
Spotting deviations
Verifying equal distribution
Spotting areas that require little effort
Figure below shows an example of a histogram.
xvii
Figure 2: Example of a histogram
Line graph
Line graphs are the best type of graph to use when you are displaying a change in
something over a continuous range/over time. For example, you could use a line graph
to display a change in student performance/incomes/inflation rate over time. The
important use of line graph is to track the changes over the short and long period of
time. It is also used to compare the changes over the same period of time for different
groups. It is always better to use the line than the bar graph, whenever the small
changes exist. Figure below shows an example of a line graph.
Figure 3: An example of a line graph
xviii
Ogive graph
An ogive graph is a plot used in statistics to show cumulative frequencies. It is used to
quickly estimate the number of observations that are less than or equal to a particular
value. There are two types of ogives:
Less than ogive: Plot the points with the upper limits of the class as
abscissae and the corresponding less than cumulative frequencies as
ordinates. The points are joined by free hand smooth curve to give less
than cumulative frequency curve or the less than Ogive. It is a rising curve.
Greater than ogive: Plot the points with the lower limits of the classes as
abscissa and the corresponding Greater than cumulative frequencies as
ordinates. Join the points by a free hand smooth curve to get the “More
than Ogive”. It is a falling curve.
Figure below shows an example of the two types of ogives graph.
Figure 4: An example of the two types of ogives graph
Pie chart
A Pie Chart is a type of graph that displays data in a circular graph. The pieces of the
graph are proportional to the fraction of the whole in each category. In other
words, each slice of the pie is relative to the size of that category in the group as a
whole. The entire “pie” represents 100 percent of a whole, while the pie “slices”
represent portions of the whole. A pie chart is best used when trying to work out the
composition of something. Figure below shows an example of a pie chart.
xix
Figure 5: An example of a pie chart
Scatter plot
This is used when you are showing the relationship between two variables (x and y
axis), for example a person's weight and height. Essentially, each of these data points
looks “scattered” around the graph, giving this type of data visualization its name.
Scatter plots can also be known as scatter diagrams or x-y graphs, and the point of
using one of these is to determine if there are patterns between two variables. Figure
below shows an example of a scatter diagram.
xx
3.5 Assessment activity
Using the data below on the population of Zimbabwe by province, construct any two
graphs using Excel.
3.6 Conclusion
This unit identified different the four levels of measurement that is Nominal, Ordinal,
Interval, Ratio and their examples. A number of approaches to summarize statistical
data and present the results graphically for easier interpretation were discussed. Charts,
such as the pie chart, the simple bar chart, are all used to pictorially display categorical
data from qualitative random variables. Numeric random variables are summarized into
numeric frequency distributions, which are most often displayed graphically in the form
of a histogram. This chapter also introduced Excel (2007) to create summary tables
(pivot tables) and display them graphically using the various chart options. In
conclusion, graphical representations should always be considered when statistical
xxi
findings are to be presented. A graphical representation promotes more rapid
assimilation of the information to be conveyed than written reports and tables.
xxii
Unit 3: Measures of Central Tendency
3.1 Introduction
This unit presents the measures of central tendency. A measure of central tendency is a
single value that attempts to describe a set of data by identifying the central position
within that set of data. As such, measures of central tendency are sometimes called
measures of central location. They are also classed as summary statistics. The mean
(often called the average) is most likely the measure of central tendency that you are
most familiar with, but there are others, such as the median and the mode.
The mean represents the sum of all values in a dataset divided by the total number of
the values. It allows to characterize the centre of the frequency distribution of
a quantitative variable by considering all of the observations with the same weight
afforded to each. The sample mean is computed by summing all of the values for a
particular variable in the sample and dividing by the number of values in the sample.
xxiii
Calculating the Average in Excel is much simpler than the above formula. Use the
Average function and select the range which needs to be averaged for example
=AVERAGE(B2:B12).
Advantages
The mean can be used for both continuous and discrete numeric data;
It summarizes the essential features of a series and in enables data to be compared
to;
It uses all the data values in its calculation;
It is used in further statistical calculations; and
It is a relatively stable measure of central tendency.
Disadvantages/Limitations
As the mean includes every value in the distribution the mean is influenced by
outliers (an outlier is an extreme value in a data set) and skewed distributions; and
It may lead to wrong impressions, particularly when the item values are not given
with the average.
3.4.1 Exercise
Calculate the mean for the following: Comment inline with the advantages and
disadvantages of mean.
Dataset 1: 1,2,3,4,5,6,7,8,9, 10
Dataset 2: 1,5,5,6,7,8,15,25,35,50
xxiv
3.5 Median
The median is the middle score for a set of data that has been arranged in order of
magnitude.
It divides the sample into two halves. The median can be used to get an idea of what
values fall above the midpoint and what values fall below the midpoint. In one half all
items are less than median, whereas in the other half all items have values higher than
median. The median provides a helpful measure of the centre of a dataset also known
as ‘positional average’.
Follow these steps to calculate the median for ungrouped (raw) numeric data:
Find the median by first identifying the middle position in the data set as follows:
If n is odd, the median value lies in the ((n+1)/2) th position in the data set.
Calculating the Median in Excel is much simpler than the above formula. Use Median
function and select the range and you will find your median for
example=MEDIAN(B2:B12).
Advantage
Median is a positional average and is used only in the context of qualitative
phenomena, for example, in estimating intelligence, etc., which are often
encountered in sociological fields; and
The median is less affected by outliers and skewed data than the mean, and is
usually the preferred measure of central tendency when the distribution is not
symmetrical.
Disadvantage/Limitations
Median is not useful where items need to be assigned relative importance;
It is not frequently used in sampling statistics; and
xxv
The median cannot be identified for categorical nominal data, as it cannot be
logically ordered.
3.5.1 Exercise
3.6 Mode
The mode is the most frequent score in our data set. It’s a measure that tells you the
most popular choice or most common characteristic of your sample.
To find the mode in Excel, use the MODE function and select the range you want to find
the mode for example =MODE(B2:B12).
Advantage
The mode has an advantage over the median and the mean as it can be found for
both numerical and categorical data.
It is not affected by the values of extreme items it is, therefore, useful in all situations
where we want to eliminate the effect of extreme variations.
Mode is useful in the study of popular sizes.
Disadvantage
It is not unique, so it leaves us with problems when we have two or more values that
share the highest frequency
It is considered unsuitable in cases where we want to give relative importance to
items under consideration.
3.6.2 Exercise
Calculate and describe the three measures of central tendency for the following dataset
xxvi
3.7 Conclusion
Each central tendency measure was defined and the conditions under which each
would be appropriate to use were identified. The influence of data type and the
presence of outliers is identified as the primary criteria determining the choice of a
suitable measure to describe sample data. Advantages and disadvantages of each
measure were discussed. All descriptive measures can be computed in Excel.
Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.
Sundaram KR, Dwivedi SN, Sreenivas V. 1st ed. New Delhi: B.I Publications Pvt Ltd;
2010. Statistics principles and methods.
https://www.abs.gov.au/websitedbs/D3310114.nsf/Home/Statistical+Language+-
+measures+of+central+tendency
https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-
median.php
xxvii
Unit 4: Measures of Dispersion/Variability
3.1 Introduction
This unit presents measures of dispersion/variability. The measures of central tendency
are not adequate to describe data. Two data sets can have the same mean, but they
can be entirely different. Thus, to describe data, one needs to know the extent of the
dispersion. In statistics, dispersion (also called variability, scatter, or spread) is the
extent to which a distribution is stretched or squeezed. Range, interquartile range, and
standard deviation are the three commonly used measures of dispersion.
A measure of spread gives us an idea of how well the mean, for example, represents
the data. If the spread of values in the data set is large, the mean is not as
representative of the data as if the spread of data is small. This is because a large
spread indicates that there are probably large differences between individual scores.
The two plots below show the difference graphically for distributions with the same
mean but more and less dispersion. The panel on the left shows a distribution that is
tightly clustered around the average, while the distribution in the right panel is more
spread out.
xxviii
3.2 Aim of the Unit
After studying this chapter, you should be able to:
Define and find the Range and Interquartile Range
Define and calculate the Standard Deviation and variance of Ungrouped and
Grouped Data
Distinguish between Skewness and Kurtosis
3.4 Range
The range is the difference between the largest and the smallest observation in the
data.
Calculation of Range
Example:
10, 60, 50, 30, 40, 25
Range = 60 - 10
=50
xxix
Advantages
It is easy to calculate and easy to understand; and
It is useful in frequency distributions where only two extreme observations are
considered.
Disadvantages
It is very sensitive to outliers and does not use all the observations in a data set;
Is susceptible to considerable distortion if there is an unusual extreme value;
It can be greatly influenced by one value which is very different from all of the
others; and
It also ignores all but two of the values thus, is likely to provide an inadequate
measure of the general dispersion of the values around the mean or median.
Interquartile range is defined as the spread of the middle 50% of the elements that is
difference between the 25th and 75th percentile. The IQR is used to measure how
spread out the data points in a set are from the mean of the data set. The higher the
IQR, the more spread out the data points; in contrast, the smaller the IQR, the more
bunched up the data points are around the mean.
Bottom 25%;
Middle 50%; and
Top 25%.
The inter-quartile range is the difference between the 3rd quartile and the 1st quartile,
i.e.
Q3 – Q1.
xxx
Advantages
It can be used as a measure of variability if the extreme values are not being
recorded exactly (as in case of open-ended class intervals in the frequency
distribution);
It is not affected by extreme values as a result;
It is more likely to provide an accurate reflection of the spread or dispersion of
the elements; and
Good for ordinal data.
Disadvantages
It ignores information from the top and the bottom 25% of elements. For example:
We could have two sets of elements with the same inter-quartile range, but with
extreme values in one set than in the other; and
The difference in spread or dispersion between the two sets of elements would not
be detected by the inter-quartile range.
When the values in a dataset are grouped closer together, you have a smaller standard
deviation. On the other hand, when the values are spread out more, the standard
deviation is larger because the standard distance is greater.
Variance is the average squared difference of scores from the mean score of a
distribution. Standard deviation is the square root of the variance.
xxxi
The following formulae define these measures
Population Sample
Step 1: Work out the mean (Manual & using Microsoft Excel).
Mean = X = 50+45+52+56+65 = 268 = 53.6
N 5 5
Where: X = Number of scores; and
N = Number of respondents.
Step 2: Subtract the mean from each score (X - M), as shown in column 4.
Step 3: Square each of the scores in the 4th column (X – M)2.
Step 4: Work out the variance, i.e. total of all squared scores (X – M) 2 then divide by
the number of respondents:
Variance = ơ2 = (50-53.6)2 + (45-53.6)2 + (52-53.6)2 + (56-53.6)2 + (65-53.6)2
5
= 12.96 + 73.96 + 2.56 + 5.76 + 129.96
5
= 225.2/5
Step 5: The standard deviation is the square root of the variance:
Standard Deviation: ơ = √45 = 6.7
xxxii
Calculating Variance and Standard Deviation for Grouped Data
ơ2 = ∑fi(Xi – M)2
∑fi
Where: fi = frequency of the ith class
Xi = Class mid-point of the ith class
Mid-point = Upper class boundary + Lower class boundary
2
Example:
For Class boundary 4.5-9.5
Mid-point = (4.5+9.5)/2 = 7
xxxiii
ơ2 = 4375/100
= 43.75
ơ = √43.75
= 6.61
Advantages
It is more difficult to calculate than the range or interquartile range but generally
provides a more accurate measure of the spread of elements.
It is useful in theoretical work and statistical methods and inference.
It can show which scores are within one Standard Deviation (SD) of the mean.
Using the SD, we have a “standard” way of knowing what is normal, and what is
extra large or extra small.
The SD takes account of all the scores and provides a sensitive measure of
dispersion.
It describes the spread of the scores in a normal distribution with great precision.
Disadvantages
It is hard to calculate manually and much harder to work out than the other
measures of dispersion.
xxxiv
Because variance relies on the squared differences of scores from the mean, a
single outlier has greater impact on the size of the variance than does a single
score near the mean.
3.7 Skewness
The term ‘skewness’ means the absence of symmetry from the mean of the dataset.
It is characteristic of the deviation from the mean, to be greater on one side than
the other, i.e. attribute of the distribution having one tail heavier than the other.
In a skewed distribution, the curve is extended to either left or right side. So,
when the plot is extended towards the right side more, it denotes positive
skewness, while when the plot is stretched more towards the left direction, then it
is called as negative skewness.
3.8 Kurtosis
Kurtosis is used to indicate the flatness or peakedness of the frequency distribution
curve and measures the tails or outliers of the distribution.
Positive kurtosis represents that the distribution is more peaked than the normal
distribution
Negative kurtosis shows that the distribution is less peaked than the normal
distribution.
xxxv
3.9 Exercise
Using Excel
Calculate the Mean and Range from the data related to number of flowers per plant
for 15 plants of a Vernonia species.
16, 23, 5, 12, 17, 21, 11, 28, 10, 7, 13, 19, 14, 19, 22
Using the following student age data find the inter-quartile range.
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Using the student age data find the variance and the standard deviation
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
3.10 Conclusion
In this unit we have covered the measures of dispersion which are Range, Interquartile
Range, Standard Deviation for ungrouped data and Standard Deviation for grouped
data. The standard deviation and variance are the most commonly used measures of
xxxvi
dispersion in the social sciences because both take into account the precise difference
between each score and the mean. Consequently, these measures are based on a
maximum amount of information. Advantages and disadvantages of each of the
measures of dispersion were discussed. Skewness and Kurtosis were also explained.
Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.
https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/
xxxvii
Unit 4: Probability
4.0 Introduction
Many decisions are made under conditions of uncertainty. Probability theory provides
the foundation for quantifying and measuring uncertainty. It is used to estimate the
reliability in making inferences from samples to populations, as well as to quantify the
uncertainty of future events. It is therefore necessary to understand the basic concepts
and laws of probability to be able to manage uncertainty.
xxxviii
Probability is the chance, or likelihood, that a particular event will occur.
A probability value lies only between 0 and 1 inclusive (i.e. 0 ≤ P(A) ≤ 1).
The sum of the probabilities of all possible events (i.e. the collectively exhaustive set
of
events) equals one, i.e. P(A1) + P(A2) + P(A3) + … + P(Ak) = 1, for k possible
events.
The following concepts are relevant when calculating probabilities associated with two
or
The intersection of two events A and B is the set of all outcomes that belong to both
A and B simultaneously. It is written as A ∩ B (i.e. A and B), and the keyword is ‘and’.
The union of two events A and B is the set of all outcomes that belong to either event
A or B or both. It is written as A ∪ B (i.e. either A or B or both) and the key word is ‘or’.
Events are mutually exclusive if they cannot occur together on a single trial of a random
xxxix
experiment (i.e. not at the same point in time).
Events are collectively exhaustive when the union of all possible events is equal to the
sample space.
Two events, A and B, are statistically independent if the occurrence of event A has no
Probability Experiment
Is a chance process that leads to well-defined results called outcomes.
Tossing a coin, throwing a die or drawing a card from a deck are called experiments.
Outcome
Result of a single trial of a probability experiment.
One or more of the possible outcomes of doing something, e.g. tossing a coin.
Event
An event is a collection of outcomes or a subset of the sample space.
A head is an event.
xl
E.g., if you toss a coin – head or tail.
The first letter of each couple corresponds to the first coin, while the second letter
corresponds to the second coin.
xli
4.4.2 Example 2: Throwing a dice
4.5.1 Exercise 1
Find the sample space for the sex of three children in a family.
Working
xlii
4.5.2 Exercise 2
Table below shows the percentage frequency table for the petrol brand most preferred
by
(b) What is the chance that a randomly selected motorist does not prefer Puma?
(c) What is the probability of finding a motorist who prefers either the Total, Trek, Engen
xliv
1! =1.
Use of the binomial distribution requires three assumptions:
Each replication of the process results in one of two possible outcomes (success or
failure),
The probability of success is the same for each replication, and
The replications are independent, meaning here that a success in one
observation/trial does not influence the probability of success in another.
Poisson distribution is for counts—if events happen at a constant rate over time. The
Poisson distribution gives the probability of X number of events occurring in time T that
is the random variable X is the number of occurrences of an event over some interval.
The occurrences occur randomly. The occurrences are independent of one another.
Demand of patients,
xlv
Number of accidents at an intersection, etc.
The processes are described by variables which take discrete integer values only.
There are 200 typographical errors randomly distributed in a 500-page manuscript. Find
the probability a given page contains exactly 3 errors.
It has been observed that at a particular section along 4th Ave, 12 fatal accidents occur
every 6 months. What is the probability of 3 accidents happening in the next 6 months?
The probability that a continuous random variable will assume a particular value is
zero. As a result, a continuous probability distribution cannot be expressed in tabular
form.
xlvi
Instead, an equation or formula is used to describe a continuous probability
distribution.
xlvii
The mean, mode and median are all equal. i.e. Mean, = Median = Mode.
All normal distributions can be converted into the standard normal curve by subtracting
the mean and dividing by the standard deviation:
X
Z
Example
What’s the probability of getting a score of 75 or less in an in-class test, =45 and
=20?
i.e., A score of 75 is 1.5 standard deviations above the mean. But to look up Z= 1.5 in
standard normal chart: Statistical Tables.
xlviii
4.6.2.2 Exercise (Normal Probability Distribution)
If birth weights in a population are normally distributed with a mean of 109 oz and a
standard deviation of 13 oz,
xlix
the next thing is to analyze data and then accept or reject the hypothesis. The goal of
hypothesis testing is to determine the likelihood that a population parameter, such as
the mean, is likely to be true.
l
4.7.3 Concepts of Hypothesis Testing
Any statistical test revolves around the choice between two hypotheses. These are
labelled H0 and H1:
Type I error means rejection of hypothesis which should have been accepted;
Type II error means accepting the hypothesis which should have been rejected.
The probability of rejecting a true null hypothesis is denoted as α is ‘small’ – called the
significance level.
The probability of failing to reject the null hypothesis when it is true is (1 – α).
The probability of making a Type II error when the null hypothesis is false is denoted as
β.
li
The probability of rejecting a false null hypothesis is (1 – β) – called the power of test.
Level of Significance
This is a very important concept in the context of hypothesis testing. It is always some
percentage (usually 5%) which should be chosen with great care, thought and reason.
In case we take the significance level at 5 per cent, then this implies that the researcher
is willing to take as much as a 5 per cent risk of rejecting the null hypothesis when it
(H0) happens to be true. Thus, the significance level is the maximum value of the
probability of rejecting H0 when it is true.
6. Based on the chosen level for μ, compare the test statistic with a value from a
table (the critical value).
A researcher takes a sample and has a value in mind for the population mean μ. Then
the question is, does the sample mean (X) contradict this value significantly?
lii
Two versions:
Z Test
Z-test refers to a univariate statistical analysis used to test the hypothesis that
proportions from two independent samples differ greatly.
It determines to what extent a data point is away from its mean of the data set, in
standard deviation.
Assumptions of Z-test:
• n is sample size
liii
2. Find the critical value(s).
Example
In a town Mutare, the average IQ score is 101.5. The variable is normally distributed
and the population SD is 15. A regional education officer claims that the students in her
school district have an IQ higher than the average of 101.5. She selects a random
sample of 30 students and finds the mean of the test scores is 106.4.
Since α = 0.05 and the test is a right-tailed test, the critical value is z = + 1.65.
X−μ 106.4−101.5
z= = = 1.79
σ /√ n 15 /√ 30
Step 4: Make the decision to reject or not reject the null hypothesis.
Since the test value 1.79 > the critical value 1.65, the decision is to reject the null
hypothesis.
There is enough evidence to support the claim that the IQ of the students is higher than
the town average IQ.
Comment:
liv
The difference is said to be statistically different. However, when the null hypothesis is
rejected, there is always a chance of a type I error. In this case, the probability of a type
I error is at most 0.05, or 5%.
Example 2
An engineer measured the Brinell hardness of 40 pieces of ductile iron that were sub-
critically annealed. The engineer hypothesized that the mean Brinell hardness of all
such ductile iron pieces is greater than 170. The average Brinell hardness of the 40
pieces of ductile iron was 172.52 with a standard deviation of 10.31. The engineer set
his significance level α at 0.05. Test the hypothesis.
Since α = 0.05 and the test is a right-tailed test, the critical value is z = + 1.65.
X−μ 172.52−170
z= = = 1.55
σ /√ n 10.31/√ 40
Step 4: Make the decision to reject or not reject the null hypothesis.
Since the test value 1.55 < the critical value 1.65, the decision is to accept the null
hypothesis.
There is not enough evidence to support the claim that the Brinell hardness of the
pieces of ductile iron is > 170.
lv
T-Test for Mean
A t-test is a hypothesis test used to examine how the means taken from two
independent samples differ. T-test follows t-distribution, which is appropriate when the
sample size is small, and the population standard deviation is not known.
Assumptions of T-test:
The critical values for the test are given in Table 4. at the end of this module.
lvi
For a one-tailed test, find the α level looking at the top row of the table and finding the
appropriate column. Find the degrees of freedom by looking down the left-hand column.
N.B. The degrees of freedom are given for values from 1-30, then at intervals above 30.
Find the critical t value for α = 0.05 with df. = 28 for a right-tailed t test.
Answer
Find the 0.05 column in the top row labelled One tail and 28 in the left-hand column.
The critical value is found where the row and column meet.
Find the critical t value for α = 0.01 with df. = 22 for a left-tailed t test.
Answer
Find the 0.01 column in the top row labelled One tail and 22 in the left-hand column.
Find the critical t value for α = 0.10 with df. = 18 for a two-tailed t test.
Answer
Find the 0.10 column in the row labelled Two tails and 18 in the column labelled df.
Example 3
The average starting annual salary for a teacher is $79,500. A researcher does not
agree and wishes to test the claim that the starting salary is less than $79,500.
A random sample of 8 teachers is selected and their salaries are shown below.
lvii
82,000 68,000 70,200 75,200
83,500
64,300 78,600 79,000
Is there enough evidence to support the researcher’s claim at α = 0.10 and degrees of
freedom = 7? Assume that the variables are normally distributed.
Step 3: Compute the test value. First, the mean and standard deviation must be found.
M = 75,150
s = ∑(Xi – M)2
n
= $6,487.49
4.8 Exercise
Q1) A sample of 400 male students is found to have a mean height 67.47 inches. Can it
be reasonably regarded as a sample from a large population with mean height 67.39
inches and standard deviation 1.30 inches? Test at 5% level of significance.
Q3) The specimen of copper wires drawn form a large lot have the following breaking
strength (in kg. weight):
lviii
578, 572, 570, 568, 572, 578, 570, 572, 596, 544
Test (using Student’s t-statistic) whether the mean breaking strength of the lot may be
taken to be 578 kg. weight (Test at 5 per cent level of significance).
Q4) A Restaurant has been having average sales of 500 tea cups per day. Because of
the development of bus stand nearby, it expects to increase its sales. During the first 12
days after the start of the bus stand, the daily sales were as under:
550, 570, 490, 615, 505, 580, 570, 460, 600, 580, 530, 526
On the basis of this sample information, can one conclude that the Restaurant’s sales
have increased?
4.9 Conclusion
This chapter introduced the concept of probabilities as the foundation for inferential
statistics. The term ‘probability’ is a measure of the uncertainty associated with the
outcome of a specific event, and the properties of probabilities were defined. Also
examined were the concepts of probabilities, such as the union and intersection of
events, mutually exclusive events, collectively exhaustive sets of events and statistically
independent events. These concepts describe the nature of events for which
probabilities are calculated.
Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.
lix
Unit 5: Bivariate Analysis
5.0 Introduction
This unit presents bivariate analysis from cross tabulation, chi-square test and
correlation.
Group 1 A B A+B
Group 2 C D C+D
lx
Easy to Understand-No advanced statistical degree is needed to interpret cross
tabulation. The results are easy to read and explain. This is makes it useful in
any type of presentation.
The data can be displayed in a contingency table where each row represents a category
for one variable and each column represents a category for the other variable. For
example, say a researcher wants to examine the relationship between gender (male vs.
female) and empathy (high vs. low). The chi-square test of independence can be used
to examine this relationship. The null hypothesis for this test is that there is no
relationship between gender and empathy. The alternative hypothesis is that there is a
relationship between gender and empathy (e.g. there are more high-empathy females
than high-empathy males).
Once we have gathered our data, we summarize the data in the two-way contingency
table.
rc
χ 2=∑ ❑¿ ¿
i=1
where O is the observed values (data), E is the expected values (from theory), and k is
the number of different data cells or categories
The test statistic follows a Chi-Square distribution with degrees of freedom equal to (r-1)
(c-1) , where r is the number of rows and c is the number of columns.
lxi
There are five steps to conduct this test.
Step 1: Formulate the hypotheses
Null Hypothesis: H0: There is no significant association between students’ educational
level
and their preference for online or face-to-face instruction.
Alternative Hypothesis: Ha: There is a significant association between students’
educational level and their preference for online or face-to-face instruction.
Step 2: Specify the expected values for each cell of the table (when the null
hypothesis is true)
The expected values specify what the values of each cell of the table would be if there
was no association between the two variables. The formula for computing the expected
values requires the sample size, the row totals, and the column totals.
(row total)(column total)
E=
total sample size
Step 3: To see if the data give convincing evidence against the null hypothesis,
compare the observed counts from the sample with the expected counts,
assuming H0 is true.
The observed values are the actual counts computed from the sample.
Step 4: Compute the test statistic
The chi-square statistic compares the observed values to the expected values. This test
statistic is used to determine whether the difference between the observed and
expected values is statistically significant.
rc
2
χ =∑ ❑¿ ¿
i=1
lxii
Example
A company wanted to know if providing the vaccine made a difference. To answer this
question, they must choose a statistic that can test for differences when all the variables
are nominal. The χ2 statistic was used to test the question, “Was there a difference in
incidence of pneumonia between the two groups?” At the end of the winter, the table
below was constructed to illustrate the occurrence of pneumonia among the employees.
Solution
State the null and alternative hypothesis
H0: There is no difference in occurrence of pneumococcal pneumonia between the
vaccinated and unvaccinated groups
H1: There is a difference in occurrence of pneumococcal pneumonia between the
vaccinated and unvaccinated groups
Calculate the sum of each row, and the sum of each column.
lxiii
Cell expected values
Calculate the χ2 for example, cell χ2 for the first cell in the case study data is calculated
as follows: (23−13.93)2/13.93 = 5.92
Chi-square values
Once the cell χ2 values have been calculated, they are summed to obtain the χ 2 statistic
for the table. In this case, the χ2 is 12.35 (rounded). The Chi-square table requires the
table’s degrees of freedom (df) in order to determine the significance level of the
statistic. The degrees of freedom for a χ2 table are calculated with the formula:
(Number of rows − 1) × (Number of columns − 1) = (3-1) * (2-1) = 2
Using a χ2 table, the significance of a Chi-square value of 12.35 with 2 df equals P <
0.005.
Conclusion
The researcher rejects the null hypothesis and accepts the alternate hypothesis: “There
is a difference in occurrence of pneumococcal pneumonia between the vaccinated and
unvaccinated groups.”
5.4 Correlation
A variable is an attribute which assumes different values, e.g. age. It is very important to
know if there is a relationship between variables under study. Is there any relationship
between variables X and Y? If there is a relationship:
How are they related – linear or non linear?
How strong is the relationship?
Is the relationship causal – does X cause Y or does Y cause X?
Correlation deals with relationship. When the fluctuation of one variable reliably predicts
a similar fluctuation in another variable, there’s often a tendency to think that means that
the change in one causes the change in the other.
lxiv
However, correlation does not imply causation. There may be an unknown factor that
influences both variables similarly.
Definition
Correlation is a statistical technique that can show whether and how strongly pairs of
variables are related.
A researcher collects data on certain variables and would want to establish if there is a
relationship between two variables. The two variables under study are called the
independent variable (x) and the dependent variable (y).
Independent variable is the variable in regression that can be controlled and
manipulated.
Dependent variable is the variable in regression that cannot be controlled or manipulate.
Plot a graph called a scatter plot and you can come up with any type of the following
scatter plots:
Positive relationship – As the independent variable X increases, the dependent
variable Y also increases.
Negative Relationship - As the independent variable X increases, the dependent
variable Y decreases.
Curvilinear Relationship - As X increases, Y also increases, but only up to a
certain point, after which, as X continues to increase, Y decreases. The graph
could be an inverted-U or a U-shaped curve.
No relationship.
lxv
Strong positive correlation between x and y- the points lie close to a straight line with
y increasing as x increases.
Weak, positive correlation between x and y- the trend is that y increases as x
increases but the points are not close to a straight line.
No correlation between x and y- the points are distributed randomly on the graph.
Weak, negative correlation between x and y- the trend is that y decreases as x
increases but the points do not lie close to a straight line.
Strong, negative correlation -the points lie close to a straight line, with y decreasing
as x increases.
Example 1
A researcher wishes to establish a relationship between moisture increase in an
environment and the growth of mould spores, she/he must select a sample. The
data is collected and is presented below.
Dependent variable –moisture increase.
Independent variable – growth of mould spores.
lxvi
Step 1 - Construct a scatter plot from the data collected.
Step 2 – Determine the type of relationship:
There is a linear relationship and a positive one – positive linear relationship.
As the moisture increases, the mould spores content also increases.
lxvii
Examples of Negative Correlation
As weather gets colder, air conditioning costs decrease.
If a train increases speed, the length of time to get to the final point decreases.
A student who has many absences has a decrease in grades
Correlation Coefficient
A measure used to determine the strength of the linear relationship between two
variables.
The symbol for the sample correlation coefficient is (r).
If there is a strong positive linear relationship between the variables, the value
of r will be close to +1.
If there is a strong negative linear relationship between the variables, the
value of r will be close to -1.
When there is no relationship, between the variables or a weak relationship,
the value of r will be close to 0.
lxviii
When the value of r is 0 or close to 0, it implies only that there is no linear
relationship between the variables.
lxix
Exercise
lxx
Conclusion: The correlation coefficient suggests a strong positive linear relationship
between the no. of cars a rental agency has and its annual revenue. The more cars
a rental agency has, the more annual revenue the company will have.
Coefficient of Determination r 2 /R2
The coefficient of determination is the ratio of the explained variation to the total
variation.
It is useful because it gives the proportion of the variance (fluctuation) of one
variable that is predictable from the other variable. It is a measure that allows us to
determine how certain one can be in making predictions from a certain model/graph.
The coefficient of determination is such that 0 < r 2 < 1, and denotes the strength of
the linear association between x and y.
The coefficient of determination represents the percent of the data that is the closest
to the line of best fit. For example, if r = 0.922, then r 2 = 0.850, which means that
85% of the total variation in y can be explained by the linear relationship between x
and y (as described by the regression equation). The other 15% of the total
variation in y remains unexplained.
The coefficient of determination is a measure of how well the regression line
represents the data. If the regression line passes exactly through every point on the
scatter plot, it would be able to explain all of the variation. The further the line is
away from the points, the less it is able to explain.
5.5 Conclusion
This unit introduced cross tabulation and the chi-squared test statistic. The Chi-
square is a valuable analysis tool that provides considerable information about the
nature of research data. It is a powerful statistic that enables researchers to test
hypotheses about variables measured at the nominal level. Correlation, r, and the
coefficient of determination, r2 were discussed. Correlation analysis identifies the
strength of the relationships and determines which variables are useful in predicting
the response variable.
Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.
lxxi
Unit 6: Regression Analysis
6.0 Introduction
This chapter presents regression analysis. Regression analysis involves identifying and
evaluating the relationship between a dependent variable and one or more independent
variables. Linear regression explores relationships that can be readily described by
straight lines or their generalization to many dimensions.
In the regression model, the independent variable is labelled the X variable, and the
dependent variable the Y variable.
The relationship between X and Y can be shown on a graph, with the independent
variable X along the horizontal axis, and the dependent variable Y along the vertical
axis.
lxxii
The aim of the regression model is to determine the straight-line relationship that
connects X and Y. In simple linear regression, the straight line connecting any two
variables X and Y can be stated algebraically as Y = a + bX
In simplest terms, the purpose of regression is to try to find the best line or equation that
expresses the relationship between Y and X.
Intercept or Constant is the point at which the regression intercepts y-axis. Intercept
provides a measure about the mean of dependent variable when slope(s) are zero.
Zero Slope means that independent variable does not have any influence on dependent
variable.
6.2.1 Exercise 1
Using Microsoft Excel or the Statistical Package of Social Sciences (SPSS),
calculate the simple linear regression of study time versus exam mark.
Calculate the coefficient of determination. Interpret your results.
4. In the Regression dialog box, click the "Input Y Range" box and select the
lxxiii
5. Click the "Input X Range" box and select the independent variable data (S&P 500
returns).
Errors in Prediction
Errors of prediction are defined as the differences between the observed values of the
dependent variable and the predicted values for that variable obtained using a given
regression equation and the observed values of the independent variable.
lxxiv
6.2.3 Computer based practical Exercise 1(Microsoft Excel)
The training manager of a company that assembles and exports pool pumps wants to
know if there is a link between the number of hours spent by assembly workers in
training and their productivity on the job. A random sample of 10 assembly workers was
selected and their performances evaluated.
Training hours 20 36 20 38 40 33 32 28 40 24
Output 40 70 44 56 60 48 62 54 63 38
(a) Construct a scatter plot of the sample data and comment on the relationship
(b) Calculate a simple regression line, using the method of least squares, to identify a
linear relationship between the hours of training received by assembly workers and their
output (i.e. number of units assembled per day).
(c) Calculate the coefficient of determination between training hours received and
worker output. Interpret its meaning and advise the training manager.
(d) Estimate the average daily output of an assembly worker who has received only
twenty-five hours of training.
6.3 Conclusion
Simple linear regression analysis is a technique that builds a straight-line relationship
between a single independent variable, x, and a dependent variable, y. The purpose of
the regression equation is to estimate y-values from known, or assumed, x-values by
substituting the x-value into a regression equation. The data of all the independent
variable and the dependent variable must be numeric.
The method of least squares is used to find the best-fit equation to express this
relationship. The coefficients of the regression equations, b0 and b1, are weights that
measure the importance of each of the independent variables in estimating the y-
variable The simple linear regression equation, which is always based on sample data,
must be tested for statistical significance before it can be used to produce valid and
reliable estimates of the true mean value of the dependent variable. In simple linear
lxxv
regression, a test of significance of the simple correlation coefficient, r, between x and y
will establish whether x is significant in estimating y.
Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.
lxxvi
Unit 7: Multivariate analysis
7.0 Introduction
This unit presents multivariate analysis. Multivariate means more than one variable
behind the resultant outcome. Anything that happens in the world or business is not due
to one reason but multiple reasons behind the outcome known as multivariate.
Grouping and Sorting the data- MVA has multiple variables. The variables are
grouped based on their unique features.
lxxvii
Multiple regression is an extension of simple linear regression. It is used when we
want to predict the value of a variable based on the value of two or more other
variables.
Y = a + b1X1 + b2X2+.....+bnXn
where X1, X2 …Xn are independent variables and Y being the dependent variable.
In multiple regression analysis, the regression coefficients (viz., b 1 b2, ….bn) become
less reliable as the degree of correlation between the independent variables (viz., X 1,
X2,…,Xn) increases.
lxxviii
In such a situation we should use only one set of the independent variable to make our
estimate as adding a second variable, say X 2, that is correlated with the first variable,
say X1, distorts the values of the regression coefficients.
Advantages
The analysis is tested and conclusions are drawn. The drawn conclusions are
close to real-life situations.
Disadvantages
The analysis requires a huge amount of observations for multiple variables that
are collected and tabulated. This observation process is time-consuming.
7.4 Multicollinearity
Multicollinearity is the occurrence of high intercorrelations among two or more
independent variables in a multiple regression model. Multicollinearity can lead to
skewed or misleading results when a researcher or analyst attempts to determine how
well each independent variable can be used most effectively to predict or understand
the dependent variable in a statistical model.
If we conclude that multicollinearity poses a problem for our regression model, we can
attempt a handful of basic fixes.
lxxix
Using techniques such as Partial Least Squares regression (PLS) and
Principal Component Analysis (PCA). PLS can lessen variables to a smaller grouping
with no correlation between them. PLS, like PCA, is a dimensionality reduction
technique. PCA reduces the dimension of data through the decomposition of data into
independent factors. Therefore, new variables with no correlation between them are
created.
Centering the variables. Centering is defined as subtracting a constant from the
value of every variable.
Analysis of variance asks whether different sample means of a numeric random variable
come
from the same population, or whether at least one sample mean comes from a different
population. The test statistic used to test this hypothesis is called the F-statistic.
of an influencing factor rather than chance. This chapter will consider the case in which
only one factor influences the differences in sample means. Hence the method known
as one-factor
7.7 Conclusion
lxxx
7.8 References & Further Reading
Bhattacharyya, G. K., and R. A. Johnson, (1997). Statistical Concepts and Methods,
John Wiley and Sons, New York.
Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.
lxxxi
lxxxii
lxxxiii