You are on page 1of 83

UNIVERSITY OF ZIMBABWE

FACULTY OF SOCIAL AND BEHAVIOURAL SCIENCES

DEPARTMENT OF DEMOGRAPHY SETTLEMENT AND


DEVELOPMENT

Module Title: Introduction to Statistics


Module Code: HPSAD102
Level: 1.1
Lecturer: Mr A. Milanzi
Table of Contents

1.0 Module Rationale..........................................................................................................5


1.1 Module Structure.....................................................................................................................5
1.2 Aims and Learning Outcomes...............................................................................................2
1.3 Methods of Assessment.........................................................................................................2
Unit 1 Introduction..............................................................................................................3
2.1 Unit Introduction......................................................................................................................3
2.2 Aim of the Unit..........................................................................................................................3
2.3 Structure of the Unit................................................................................................................3
2.4 Basic definitions in Statistics................................................................................................4
2.5 Types of Data............................................................................................................................5
2.6 Components of Statistics.......................................................................................................5
2.6.1 Descriptive Statistics...........................................................................................................6
2.6.2 Inferential Statistics.............................................................................................................6
2.7 Group Tasks.............................................................................................................................6
2.8 Conclusion................................................................................................................................7
2.9 References & Further Reading..............................................................................................7
Unit 2 Levels of Measurement and Frequency Distributions......................................8
3.1 Introduction of the unit...........................................................................................................8
3.2 Aim of the Unit..........................................................................................................................8
3.3 Levels of measurements........................................................................................................8
3.4 Presentation of Descriptive statistics................................................................................10
3.5 Assessment activity..............................................................................................................16
3.6 Conclusion..............................................................................................................................17
3.7 References & Further Reading............................................................................................18
Unit 3: Measures of Central Tendency..........................................................................19
3.1 Introduction.............................................................................................................................19
3.2 Aim of the Unit........................................................................................................................19
3.3 Structure of the Unit..............................................................................................................19
3.4 The Mean.................................................................................................................................19

ii
3.4.1 Exercise................................................................................................................................20
3.5 Median......................................................................................................................................21
3.5.1 Exercise................................................................................................................................22
3.6 Mode.........................................................................................................................................22
3.6.1 Group task............................................................................................................................22
3.6.2 Exercise................................................................................................................................22
3.7 Conclusion..............................................................................................................................23
3.8 References & Further Reading............................................................................................23
Unit 4: Measures of Dispersion/Variability...................................................................24
3.1 Introduction.............................................................................................................................24
3.2 Aim of the Unit........................................................................................................................24
3.3 Structure of the Unit..............................................................................................................25
3.4 Range.......................................................................................................................................25
3.5 Interquartile Range (IQR)......................................................................................................26
3.6 Standard Deviation................................................................................................................27
3.7 Skewness................................................................................................................................31
3.8 Kurtosis...................................................................................................................................31
3.9 Exercise...................................................................................................................................32
3.10 Conclusion............................................................................................................................32
3.11 References & Further Reading..........................................................................................33
4.0 Introduction.............................................................................................................................34
4.1 Aim of the unit........................................................................................................................34
4.2 Structure of the Unit..............................................................................................................34
4.3 Probability Concepts.............................................................................................................34
4.4 Sample Space.........................................................................................................................36
4.4.1 Example 1: Tossing Two Coins........................................................................................37
4.4.2 Example 2: Throwing a dice..............................................................................................38
4.5 Probability tree diagram.......................................................................................................38
4.5.1 Exercise 1.............................................................................................................................38
4.5.2 Exercise 2.............................................................................................................................39
4.6 Probability Distributions.......................................................................................................39

iii
4.6.1 Discrete Probability Distributions....................................................................................39
4.6.1.1 Binomial Probability Distribution.................................................................................39
Mean of a Binomial Distribution................................................................................................41
Variance of a Binomial distribution..........................................................................................41
4.6.1.2 Poisson Probability Distribution...................................................................................41
Mean and Variance.......................................................................................................................41
4.6.1.3 Exercise 1 (Discrete probability distributions)...........................................................42
4.6.2 Continuous probability distributions..............................................................................42
4.7 Hypothesis testing.................................................................................................................45
4.7.1 What is a Hypothesis?.......................................................................................................45
4.7.2 Characteristics of hypothesis..........................................................................................45
4.7.3 Concepts of Hypothesis Testing......................................................................................46
4.8 Exercise...................................................................................................................................53
4.9 Conclusion..............................................................................................................................54
4.10 References & Further Reading..........................................................................................54
Unit 5: Bivariate Analysis............................................................................................................55
5.0 Introduction.............................................................................................................................55
5.1 Aim of the Unit........................................................................................................................55
5.2 Cross-tabulation Table..........................................................................................................55
5.3 The Chi-Squared Test for Independence of Association................................................56
5.4 Correlation...............................................................................................................................59
5.5 Conclusion..............................................................................................................................66
5.6 References & Further Reading............................................................................................66
Unit 6: Regression Analysis.......................................................................................................67
6.0 Introduction.............................................................................................................................67
6.1 Aim of the Unit........................................................................................................................67
6.2 Objectives of Regression Analysis.....................................................................................67
6.2.1 Exercise 1.............................................................................................................................68
Exercise 2: From the output given below,...............................................................................69
6.2.3 Computer based practical Exercise 1(Microsoft Excel)...............................................70
6.3 Conclusion..............................................................................................................................70

iv
6.4 References & Further Reading............................................................................................71
Unit 7: Multivariate analysis.......................................................................................................72
7.0 Introduction.............................................................................................................................72
7.1 Aim of the Unit........................................................................................................................72
7.2 Structure of the Unit..............................................................................................................72
7.3 The objective of Multivariate Analysis (MVA)...................................................................72
7.3.1 Exercise 1 (Multiple Regression).....................................................................................73
7.4 Multicollinearity......................................................................................................................74
7.5 Analysis of Variance (ANOVA)............................................................................................75
7.6 Exercise 1 (Multiple regression)..........................................................................................75
7.7 Conclusion..............................................................................................................................76
7.8 References & Further Reading............................................................................................76

v
1.0 Module Rationale
The purpose of this module is to introduce basic statistics to undergraduate level one
students. This is an introductory course that assumes no prior knowledge of statistics.
The main objective of this course is to introduce and equip students with basic statistical
concepts and techniques. The course aims at imparting knowledge and analytical skills
to students in issues related to the following: descriptive statistics; presentation of data;
probability; inferential statistics and multivariate analysis. The calculations will be done
using spreadsheet software, such as Excel or the Statistical Package for Social
Sciences (SPSS). The aforementioned issues are meant to enhance students’
professional and academic performance in the subject.
1.1 Module Structure
This module is structured as follows:
Unit 1: Introduction;
Unit 2: Levels of Measurement and Frequency Distributions;
Unit 3: Measures of Central Tendency;
Unit 4: Measures of Dispersion/Variability;
Unit 5: Probability;
Unit 6: Bivariate Analysis;
Unit 7: Regression Modelling; and
Unit 8: Multivariate analysis.
1.2 Aims and Learning Outcomes
For this module, the student has to master the following outcomes:
 Explain the basic concepts of descriptive and inferential statistics;
 Present data;
 Calculate and interpret basic descriptive and inferential statistics;
 Determine when, why, and how various statistical tests are used; and,
 Analyze data using spreadsheet software (e.g. Excel, SPSS)
1.3 Methods of Assessment
The approach adopted covers the normal examination at the end of the academic
semester, constituting 50% of the overall course mark, and coursework making up the
remaining 50%. Students will be given a series of exercises, in-class tests, written and
practical assignments that will constitute the continuous assessment (coursework).

vi
Students are advised to seriously consider the continuous assessment exercises as
they contribute significantly towards the overall course mark.

How to use the module /instructions

vii
Unit 1 Introduction

2.1 Unit Introduction


This chapter introduces the basic concepts in statistics.
2.2 Aim of the Unit
After studying this chapter, you should be able to:
 Describe the different data types;
 Explain the basic terms and concepts of Statistics and provide examples;
 Explain the different components of Statistics;
 Distinguish between qualitative and quantitative random variables; and
 Discuss descriptive and inferential statistics.
2.3 Structure of the Unit
 Basic definitions in Statistics;
 Data types;
 Descriptive Statistics; and
 Inferential Statistics.
2.4 Basic definitions in Statistics
Statistics is the science of conducting studies to collect, organize, summarize and draw
conclusions from data.
A random variable is any attribute of interest on which data is collected and analyzed.

Data refers to observations and measurements which have been collected in some
way, often through research. Data is the actual values (numbers) or outcomes recorded
on a random variable.
Some examples of random variables and their data are:
 the travel distances of delivery vehicles (data: 34 km, 13 km, 21 km)
 the daily occupancy rates of hotels in Harare (data: 45%, 72%, 54%)
 the duration of machine downtime (data: 14 min, 25 min, 6 min)
 brand of coffee preferred (data: Nescafe, Ricoffy, Frisco).
Data that is recorded as numbers (and therefore measures quantities) is Quantitative
data.

viii
Quantitative variables- variables that can be counted or measured for example, age is
numerical and people can be ranked in order according to the value of their ages. Other
examples are height, weight, temperature
The following are examples of quantitative random variables with real numbers as data:
 the age of an employee (e.g. 46 years; 28 years; 32 years);
 machine downtime (e.g. 8 min; 32.4 min; 12.9 min);
 the price of a product in different stores (e.g. R6.75; R7.45; R7.20; R6.99); and
 delivery distances travelled by a courier vehicle (e.g. 14.2 km; 20.1 km; 17.8 km).

Data that is recorded as text (and therefore records qualities) is qualitative data.

Qualitative Variables- variables that have distinct categories according to some


characteristic or attribute for example, if subjects are classified according to sex (male
or female), then the variable ‘sex’ is qualitative and will be quantified by assigning
values, male =1, female = 2. Other examples include religion, place of residence (rural,
urban), province etc.
The following are examples of qualitative random variables with categories as data:
 The gender of a consumer is either male or female.
 An employee’s highest qualification is either A level, a diploma or a degree.
 A company operates in either the financial, retail, mining or industrial sector.
 A consumer’s choice of mobile phone service provider is Econet, Netone,
Telecel.

2.5 Types of Data


Variables can be classified in several ways. One of the classifications refers to the type
and amount of information contained in the data.
Data are either Categorical or Numerical.

Categorical data is data which is grouped into categories, such as responses to yes/no
questions, data for a 'gender' or 'smoking status' or 'marital status'.
Numerical data include both Discrete and Continuous variables.

ix
Discrete variables-values can be counted. May, but not necessarily, have a finite
number of values.
The most common type of discrete numerical variable that we will encounter produces a
response that comes from a counting process e.g. Number of women who have given
birth, the number of children born to a woman etc.

Continuous variables- assume an infinite number of values between any two specific
values.
May take on any value within a given range of real numbers and usually arises from a
measurement, e.g. temperature, height, weight and distance.
Can assume an infinite number between any two given temperatures – 36.5 degrees
Celsius or height 6.2 metres tall.

2.6 Components of Statistics


Statistics consists of two major components: descriptive statistics and inferential
statistics.

2.6.1 Descriptive Statistics


Descriptive statistics are brief descriptive coefficients that summarize a given data
set, which can be either a representation of the entire or a sample of a population.
Descriptive statistics are broken down into measures of central tendency and measures
of variability (spread). Measures of central tendency include the mean, median, and
mode, while measures of variability include standard deviation, variance, minimum
and maximum variables, kurtosis, and skewness.

x
Descriptive statistics, in short, help describe and understand the features of a specific
data set by giving short summaries about the sample and measures of the data.

Descriptive statistics can be useful for two purposes:


1) to provide basic information about variables in a dataset; and
2) to highlight potential relationships between variables.

2.6.2 Inferential Statistics

Consists of generalizing from samples to populations, performing estimations and


hypothesis tests, determining relationships among variables, and making predictions.
Inferential statistics go beyond the description and makes a generalization of the whole
population.

2.7 Group Tasks


 Group Task 1: Describe the different data types given at the beginning of the
module due during the lecture.
 Group Task 2: Discuss descriptive and inferential statistics and uses given at the
beginning of the module due during the lecture.

2.8 Conclusion
In this unit we introduced basic terms and concepts of Statistics, described the different
data types and explained the different components of Statistics. We managed to
distinguish between qualitative and quantitative random variables and discussed
descriptive and inferential statistics.

xi
2.9 References & Further Reading
Bhattacharyya, G. K., and R. A. Johnson, (1997). Statistical Concepts and Methods,
John Wiley and Sons, New York.

Dixon, W. J. and Massey, F.J. (2005). Introduction to Statistical Analysis, McGraw-Hill,


New York.

D Lane, (2003). Introduction to Statistics. Rice University

McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.

Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/).


Project Leader: David M. Lane, Rice University.

Watkins, J. C,. (2019). An Introduction to the Science of Statistics: From Theory to


Implementation.

https://www.chi2innovations.com/blog/data-types-101/

https://www.sagepub.com/sites/default/files/upm-binaries/40006_Chapter1.pdf

xii
Unit 2 Levels of Measurement and Frequency Distributions

3.1 Introduction of the unit


This unit presents levels of measurement and frequency distributions.
3.2 Aim of the Unit
After studying this chapter, you should be able to:
 Describe the levels of measurement and give examples;
 Distinguish between nominal, ordinal, interval and ratio levels of
measurement;
 Construct and interpret Bar Charts, Pie Charts, Boxplots, Scatter Plots,
Histograms, Ogives, Line graphs; and
 Use Excel to produce tables and charts.
3.3 Levels of measurements

When we talk about levels of measurement, we are talking about how we measure a
variable. First, knowing the level of measurement helps you decide how to interpret the
data from that variable. Second, knowing the level of measurement helps you decide
what statistical analysis is appropriate on the values that were assigned.

Variables have 4 different levels of measurement:


 Nominal
 Ordinal
 Interval
 Ratio
These four levels of measurement fall under two broad types of variables:
Categorical – variables where data are grouped into categories
 Nominal
 Ordinal
Continuous – assume an infinite number of values between any two specific values
 Interval
 Ratio

xiii
The graphic below should help you visualize the four different levels of measurement.
See the definitions and examples below for each.

 Nominal Classifies data into mutually exclusive (non-overlapping) categories. No


ordering or ranking can be imposed on the data e.g. Religion - Catholic,
Protestants, Apostolic, Pentecostals etc.; Gender -male, female; and Residence
– rural, urban.

 Ordinal classifies data into categories that can be ordered or ranked. Similar to


nominal data, the values are words that describe responses. The differences
between the attributes do not have any meaning, e.g. Ranking on satisfaction –
poor, good, satisfactory, excellent.

 Interval differs from ordinal level in that there are precise differences between
the units and can be ranked in order. Meaning is given to the difference between
measurements e.g. temperature (150C, 220C) or IQ tests (IQ of 100 or IQ of 110).

xiv
An interval indicates rank and distance from an arbitrary zero measured in unit
intervals.

 Ratio data has all the characteristics of interval measurement, i.e. indicate both
rank and distance from a natural zero, with ratios of two measures having
meaning. Ratio scales have precise differences between units of measures and a
true zero Weight – 500g, 20kg, 25kgs Height – 1.5m, 1.6m, 1.8m. Also, the ratio
scale contains a true value between values e.g. if one can lift 200kg and another
can lift 100kgs, then the ratio between them is 2 to1.

3.4 Presentation of Descriptive statistics

Graphical/Pictorial Methods

There are several graphical and pictorial methods that enhance researchers'
understanding of individual variables and the relationships between variables. Graphical
and pictorial methods provide a visual representation of the data. Graphs are used to
display data because it is easier to see trends in the data when it is displayed visually
compared to when it is displayed numerically in a table. Complicated data can often be
displayed and interpreted more easily in a graph format than in a data table.

In a graph, the X-axis runs horizontally (side to side) and the Y-axis runs vertically (up
and down). Typically, the independent variable will be shown on the X axis and the
dependent variable will be shown on the Y axis. Some of the data presentation methods
include:
 Bar chart/graph
 Frequency table
 Histogram
 Line graph
 Ogive
 Pie chart
 Scatter plot

xv
Bar Chart/Graph

Bar charts are used to compare measurements between different groups. Bar charts
should be used when your data is not continuous, but rather is divided into different
categories. If you counted the number of people in each of the 10 provinces in
Zimbabwe, each province would be its own category. There is no value between
provinces, so this data is not continuous. Figure below shows an example of a bar
chart/graph.
Figure 1: Example of a bar chart/graph

Frequency Table

Frequency is a measure of the number of occurrences of a particular score in a given


set of data. A frequency table is a method of organizing raw data in a compact form by
displaying a series of scores in ascending or descending order, together with their
frequencies—the number of times each score occurs in the respective data set.
Included in a frequency table are typically a column for the scores and a column
showing the frequency of each score in the data set. However, more detailed tables

xvi
may also contain relative frequencies (proportions)/percentages. Frequency tables may
be computed for both discrete and continuous variables and may take either an
ungrouped or a grouped format. Table below shows an example of a frequency table.

Table 1: Example of a frequency table

Histogram
A histogram is used to summarize discrete or continuous data. In other words, it
provides a visual interpretation of numerical/continuous data by showing the number of
data points that fall within a specified range of values. It is similar to a vertical bar graph.
However, a histogram, unlike a vertical bar graph, shows no gaps between the bars.
The heights of the bars correspond to the frequency values, and the bars are drawn
adjacent to each other (without gaps). The histogram is used to:
 Identifying the most common process outcome
 Identifying data symmetry
 Spotting deviations
 Verifying equal distribution
 Spotting areas that require little effort
Figure below shows an example of a histogram.

xvii
Figure 2: Example of a histogram

Line graph
Line graphs are the best type of graph to use when you are displaying a change in
something over a continuous range/over time. For example, you could use a line graph
to display a change in student performance/incomes/inflation rate over time. The
important use of line graph is to track the changes over the short and long period of
time. It is also used to compare the changes over the same period of time for different
groups. It is always better to use the line than the bar graph, whenever the small
changes exist. Figure below shows an example of a line graph.
Figure 3: An example of a line graph

xviii
Ogive graph
An ogive graph is a plot used in statistics to show cumulative frequencies. It is used to
quickly estimate the number of observations that are less than or equal to a particular
value. There are two types of ogives:
 Less than ogive: Plot the points with the upper limits of the class as
abscissae and the corresponding less than cumulative frequencies as
ordinates. The points are joined by free hand smooth curve to give less
than cumulative frequency curve or the less than Ogive. It is a rising curve.
 Greater than ogive: Plot the points with the lower limits of the classes as
abscissa and the corresponding Greater than cumulative frequencies as
ordinates. Join the points by a free hand smooth curve to get the “More
than Ogive”. It is a falling curve.
Figure below shows an example of the two types of ogives graph.
Figure 4: An example of the two types of ogives graph

Pie chart
A Pie Chart is a type of graph that displays data in a circular graph. The pieces of the
graph are proportional to the fraction of the whole in each category. In other
words, each slice of the pie is relative to the size of that category in the group as a
whole. The entire “pie” represents 100 percent of a whole, while the pie “slices”
represent portions of the whole. A pie chart is best used when trying to work out the
composition of something. Figure below shows an example of a pie chart.

xix
Figure 5: An example of a pie chart

Scatter plot
This is used when you are showing the relationship between two variables (x and y
axis), for example a person's weight and height. Essentially, each of these data points
looks “scattered” around the graph, giving this type of data visualization its name.
Scatter plots can also be known as scatter diagrams or x-y graphs, and the point of
using one of these is to determine if there are patterns between two variables. Figure
below shows an example of a scatter diagram.

Figure 6: An example of a scatter diagram

xx
3.5 Assessment activity
 Using the data below on the population of Zimbabwe by province, construct any two
graphs using Excel.

 Class test on Unit 1 and 2

3.6 Conclusion
This unit identified different the four levels of measurement that is Nominal, Ordinal,
Interval, Ratio and their examples. A number of approaches to summarize statistical
data and present the results graphically for easier interpretation were discussed. Charts,
such as the pie chart, the simple bar chart, are all used to pictorially display categorical
data from qualitative random variables. Numeric random variables are summarized into
numeric frequency distributions, which are most often displayed graphically in the form
of a histogram. This chapter also introduced Excel (2007) to create summary tables
(pivot tables) and display them graphically using the various chart options. In
conclusion, graphical representations should always be considered when statistical

xxi
findings are to be presented. A graphical representation promotes more rapid
assimilation of the information to be conveyed than written reports and tables.

3.7 References & Further Reading


Ott, Lyman and Michael Longnecker. An Introduction to Statistical Methods & Data
Analysis. 7th ed., Cengage Learning, 2016.
Mendenhall, William, et al. Introduction to Probability and Statistics. 14th ed., Cengage
Learning, 2013.
https://www.academia.edu/35746190/UNIT_13_DATA_PRESENTATION_AND_DESC
RIPTIVE_STATISTICS

xxii
Unit 3: Measures of Central Tendency

3.1 Introduction
This unit presents the measures of central tendency. A measure of central tendency is a
single value that attempts to describe a set of data by identifying the central position
within that set of data. As such, measures of central tendency are sometimes called
measures of central location. They are also classed as summary statistics. The mean
(often called the average) is most likely the measure of central tendency that you are
most familiar with, but there are others, such as the median and the mode.

3.2 Aim of the Unit


After studying this chapter, you should be able to:
 describe the measures of central tendency;
 calculate and interpret the mean, mode and median;
 describe the appropriate central location measure for different data types; and
 use Excel to calculate the mean.

3.3 Structure of the Unit


 Mean
 Median
 Mode

3.4 The Mean


The mean (or average) is the most popular and well-known measure of central
tendency. It can be used with both discrete and continuous data and it provides the
overall picture of the data.

The mean represents the sum of all values in a dataset divided by the total number of
the values. It allows to characterize the centre of the frequency distribution of
a quantitative variable by considering all of the observations with the same weight
afforded to each. The sample mean is computed by summing all of the values for a
particular variable in the sample and dividing by the number of values in the sample.

xxiii
Calculating the Average in Excel is much simpler than the above formula. Use the
Average function and select the range which needs to be averaged for example
=AVERAGE(B2:B12).

Advantages
 The mean can be used for both continuous and discrete numeric data;
 It summarizes the essential features of a series and in enables data to be compared
to;
 It uses all the data values in its calculation;
 It is used in further statistical calculations; and
 It is a relatively stable measure of central tendency.

Disadvantages/Limitations
 As the mean includes every value in the distribution the mean is influenced by
outliers (an outlier is an extreme value in a data set) and skewed distributions; and
 It may lead to wrong impressions, particularly when the item values are not given
with the average.

3.4.1 Exercise
Calculate the mean for the following: Comment inline with the advantages and
disadvantages of mean.
Dataset 1: 1,2,3,4,5,6,7,8,9, 10
Dataset 2: 1,5,5,6,7,8,15,25,35,50

xxiv
3.5 Median
The median is the middle score for a set of data that has been arranged in order of
magnitude.

It divides the sample into two halves. The median can be used to get an idea of what
values fall above the midpoint and what values fall below the midpoint. In one half all
items are less than median, whereas in the other half all items have values higher than
median. The median provides a helpful measure of the centre of a dataset also known
as ‘positional average’.

Follow these steps to calculate the median for ungrouped (raw) numeric data:

 Arrange the n data values in ascending order.

Find the median by first identifying the middle position in the data set as follows:

 If n is odd, the median value lies in the ((n+1)/2) th position in the data set.

Calculating the Median in Excel is much simpler than the above formula. Use Median
function and select the range and you will find your median for
example=MEDIAN(B2:B12).

Advantage
 Median is a positional average and is used only in the context of qualitative
phenomena, for example, in estimating intelligence, etc., which are often
encountered in sociological fields; and
 The median is less affected by outliers and skewed data than the mean, and is
usually the preferred measure of central tendency when the distribution is not
symmetrical.

Disadvantage/Limitations
 Median is not useful where items need to be assigned relative importance;
 It is not frequently used in sampling statistics; and

xxv
 The median cannot be identified for categorical nominal data, as it cannot be
logically ordered.
3.5.1 Exercise

Calculate the median for the following:


Dataset 1: 5,2,12,4,9, 10,8,6,7,3,15

3.6 Mode
The mode is the most frequent score in our data set. It’s a measure that tells you the
most popular choice or most common characteristic of your sample.

To find the mode in Excel, use the MODE function and select the range you want to find
the mode for example =MODE(B2:B12).

Advantage
 The mode has an advantage over the median and the mean as it can be found for
both numerical and categorical data.
 It is not affected by the values of extreme items it is, therefore, useful in all situations
where we want to eliminate the effect of extreme variations.
 Mode is useful in the study of popular sizes.
Disadvantage
 It is not unique, so it leaves us with problems when we have two or more values that
share the highest frequency
 It is considered unsuitable in cases where we want to give relative importance to
items under consideration.

3.6.1 Group task


Discuss each of the three measures of central tendency.

3.6.2 Exercise
Calculate and describe the three measures of central tendency for the following dataset

Dataset: 5,2,12,4,9, 10,8,6,7,3,15,5,2,1,7,7,20,12,2,1,5,9,2

xxvi
3.7 Conclusion
Each central tendency measure was defined and the conditions under which each
would be appropriate to use were identified. The influence of data type and the
presence of outliers is identified as the primary criteria determining the choice of a
suitable measure to describe sample data. Advantages and disadvantages of each
measure were discussed. All descriptive measures can be computed in Excel.

3.8 References & Further Reading


Bhattacharyya, G. K., and R. A. Johnson, (1997). Statistical Concepts and Methods,
John Wiley and Sons, New York.

D Lane, (2003). Introduction to Statistics. Rice University.

Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.

Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/).


Project Leader: David M. Lane, Rice University.

Watkins, J. C,. (2019). An Introduction to the Science of Statistics: From Theory to


Implementation.

Sundaram KR, Dwivedi SN, Sreenivas V. 1st ed. New Delhi: B.I Publications Pvt Ltd;
2010. Statistics principles and methods.

https://www.abs.gov.au/websitedbs/D3310114.nsf/Home/Statistical+Language+-
+measures+of+central+tendency
https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-
median.php

xxvii
Unit 4: Measures of Dispersion/Variability

3.1 Introduction
This unit presents measures of dispersion/variability. The measures of central tendency
are not adequate to describe data. Two data sets can have the same mean, but they
can be entirely different. Thus, to describe data, one needs to know the extent of the
dispersion. In statistics, dispersion (also called variability, scatter, or spread) is the
extent to which a distribution is stretched or squeezed. Range, interquartile range, and
standard deviation are the three commonly used measures of dispersion.

A measure of spread gives us an idea of how well the mean, for example, represents
the data. If the spread of values in the data set is large, the mean is not as
representative of the data as if the spread of data is small. This is because a large
spread indicates that there are probably large differences between individual scores.

The two plots below show the difference graphically for distributions with the same
mean but more and less dispersion. The panel on the left shows a distribution that is
tightly clustered around the average, while the distribution in the right panel is more
spread out.

xxviii
3.2 Aim of the Unit
After studying this chapter, you should be able to:
 Define and find the Range and Interquartile Range
 Define and calculate the Standard Deviation and variance of Ungrouped and
Grouped Data
 Distinguish between Skewness and Kurtosis

3.3 Structure of the Unit


 Range, Interquartile Range
 Standard Deviation
 Variance of Discrete Data
 Variance and Standard Deviation for Grouped Data
 Skewness and Kurtosis

3.4 Range
The range is the difference between the largest and the smallest observation in the
data.

Range is a measure of variability or scatteredness of the variates or observations


among themselves and does not give an idea about the spread of the observations
around some central value.

Calculation of Range

Range (R) = Highest value (H) – Lowest value (L)

Example:
10, 60, 50, 30, 40, 25

Step 1: Arrange in ascending order.

10, 25, 30, 40, 50, 60

Range = 60 - 10

=50

xxix
Advantages
 It is easy to calculate and easy to understand; and
 It is useful in frequency distributions where only two extreme observations are
considered.

Disadvantages
 It is very sensitive to outliers and does not use all the observations in a data set;
 Is susceptible to considerable distortion if there is an unusual extreme value;
 It can be greatly influenced by one value which is very different from all of the
others; and
 It also ignores all but two of the values thus, is likely to provide an inadequate
measure of the general dispersion of the values around the mean or median.

3.5 Interquartile Range (IQR)

Interquartile range is defined as the spread of the middle 50% of the elements that is
difference between the 25th and 75th percentile. The IQR is used to measure how
spread out the data points in a set are from the mean of the data set. The higher the
IQR, the more spread out the data points; in contrast, the smaller the IQR, the more
bunched up the data points are around the mean.

The data can be divided into the:

 Bottom 25%;
 Middle 50%; and
 Top 25%.

The median can be taken as the 2nd quartile.

The inter-quartile range is the difference between the 3rd quartile and the 1st quartile,
i.e.

Q3 – Q1.

xxx
Advantages
 It can be used as a measure of variability if the extreme values are not being
recorded exactly (as in case of open-ended class intervals in the frequency
distribution);
 It is not affected by extreme values as a result;
 It is more likely to provide an accurate reflection of the spread or dispersion of
the elements; and
 Good for ordinal data.

Disadvantages
It ignores information from the top and the bottom 25% of elements. For example:

 We could have two sets of elements with the same inter-quartile range, but with
extreme values in one set than in the other; and

 The difference in spread or dispersion between the two sets of elements would not
be detected by the inter-quartile range.

3.6 Standard Deviation

Standard deviation (SD) is the most used measure of dispersion. It is a measure of


spread of data about the mean. SD is the square root of sum of squared deviation from
the mean divided by the number of observations.

When the values in a dataset are grouped closer together, you have a smaller standard
deviation. On the other hand, when the values are spread out more, the standard
deviation is larger because the standard distance is greater.

Variance is the average squared difference of scores from the mean score of a
distribution. Standard deviation is the square root of the variance.

xxxi
The following formulae define these measures

Population Sample

Calculating Variance and Standard Deviation for Ungrouped Data

 Step 1: Work out the mean (Manual & using Microsoft Excel).
Mean = X = 50+45+52+56+65 = 268 = 53.6
N 5 5
Where: X = Number of scores; and
N = Number of respondents.
 Step 2: Subtract the mean from each score (X - M), as shown in column 4.
 Step 3: Square each of the scores in the 4th column (X – M)2.
 Step 4: Work out the variance, i.e. total of all squared scores (X – M) 2 then divide by
the number of respondents:
Variance = ơ2 = (50-53.6)2 + (45-53.6)2 + (52-53.6)2 + (56-53.6)2 + (65-53.6)2
5
= 12.96 + 73.96 + 2.56 + 5.76 + 129.96
5
= 225.2/5
 Step 5: The standard deviation is the square root of the variance:
Standard Deviation: ơ = √45 = 6.7

xxxii
Calculating Variance and Standard Deviation for Grouped Data

ơ2 = ∑fi(Xi – M)2
∑fi
Where: fi = frequency of the ith class
Xi = Class mid-point of the ith class
Mid-point = Upper class boundary + Lower class boundary
2
Example:
For Class boundary 4.5-9.5
Mid-point = (4.5+9.5)/2 = 7

xxxiii
ơ2 = 4375/100
= 43.75
ơ = √43.75
= 6.61

Advantages
 It is more difficult to calculate than the range or interquartile range but generally
provides a more accurate measure of the spread of elements.
 It is useful in theoretical work and statistical methods and inference.
 It can show which scores are within one Standard Deviation (SD) of the mean.
 Using the SD, we have a “standard” way of knowing what is normal, and what is
extra large or extra small.
 The SD takes account of all the scores and provides a sensitive measure of
dispersion.
 It describes the spread of the scores in a normal distribution with great precision.

Disadvantages
 It is hard to calculate manually and much harder to work out than the other
measures of dispersion.

xxxiv
 Because variance relies on the squared differences of scores from the mean, a
single outlier has greater impact on the size of the variance than does a single
score near the mean.

3.7 Skewness
The term ‘skewness’ means the absence of symmetry from the mean of the dataset.

 It is characteristic of the deviation from the mean, to be greater on one side than
the other, i.e. attribute of the distribution having one tail heavier than the other.
 In a skewed distribution, the curve is extended to either left or right side. So,
when the plot is extended towards the right side more, it denotes positive
skewness, while when the plot is stretched more towards the left direction, then it
is called as negative skewness.

3.8 Kurtosis
Kurtosis is used to indicate the flatness or peakedness of the frequency distribution
curve and measures the tails or outliers of the distribution.

 Positive kurtosis represents that the distribution is more peaked than the normal
distribution
 Negative kurtosis shows that the distribution is less peaked than the normal
distribution. 

xxxv
3.9 Exercise
Using Excel

 Calculate the Mean and Range from the data related to number of flowers per plant
for 15 plants of a Vernonia species.

16, 23, 5, 12, 17, 21, 11, 28, 10, 7, 13, 19, 14, 19, 22

 Using the following student age data find the inter-quartile range.

18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18

 Describe the measures of dispersion learnt in this unit.

 Using the student age data find the variance and the standard deviation

18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18

3.10 Conclusion
In this unit we have covered the measures of dispersion which are Range, Interquartile
Range, Standard Deviation for ungrouped data and Standard Deviation for grouped
data. The standard deviation and variance are the most commonly used measures of

xxxvi
dispersion in the social sciences because both take into account the precise difference
between each score and the mean. Consequently, these measures are based on a
maximum amount of information. Advantages and disadvantages of each of the
measures of dispersion were discussed. Skewness and Kurtosis were also explained.

3.11 References & Further Reading


Bhattacharyya, G. K., and R. A. Johnson, (1997). Statistical Concepts and Methods,
John Wiley and Sons, New York.

D Lane, (2003). Introduction to Statistics. Rice University.

Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.

https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/

xxxvii
Unit 4: Probability

4.0 Introduction
Many decisions are made under conditions of uncertainty. Probability theory provides
the foundation for quantifying and measuring uncertainty. It is used to estimate the
reliability in making inferences from samples to populations, as well as to quantify the
uncertainty of future events. It is therefore necessary to understand the basic concepts
and laws of probability to be able to manage uncertainty.

4.1 Aim of the unit


After studying this chapter, you should be able to:
 understand the importance of probability in statistical analysis
 define the different types of probability
 describe the properties and concepts of probabilities
 apply the rules of probability to empirical data
 construct and interpret probabilities from joint probability tables
 understand the concept of a probability distribution
 describe three common probability distributions
 calculate and interpret probabilities for each of these distributions
 understand the concept of hypothesis testing
 distinguish when to use the z-test statistic or the t-test statistic
 correctly interpret the results of a hypothesis test
4.2 Structure of the Unit
 Basic Concepts
 Basic Rules to Probability
 Discrete Distributions - Binomial Distribution, Poisson Distribution
 Continuous Distributions - Normal Distribution
 Hypotheses testing

4.3 Probability Concepts

xxxviii
Probability is the chance, or likelihood, that a particular event will occur.

There are five basic properties that apply to every probability:

 A probability value lies only between 0 and 1 inclusive (i.e. 0 ≤ P(A) ≤ 1).

 If an event A cannot occur (i.e. an impossible event), then P(A) = 0.

 If an event A is certain to occur (i.e. a certain event), then P(A) = 1.

 The sum of the probabilities of all possible events (i.e. the collectively exhaustive set
of

events) equals one, i.e. P(A1) + P(A2) + P(A3) + … + P(Ak) = 1, for k possible
events.

 Complementary probability: If P(A) is the probability of event A occurring, then the

probability of event A not occurring is defined as P(A ) = 1 − P(A).

Basic Rules Pertaining to Probability

The following concepts are relevant when calculating probabilities associated with two
or

more events occurring:

Intersection of two events - (A ∩ B)

The intersection of two events A and B is the set of all outcomes that belong to both

A and B simultaneously. It is written as A ∩ B (i.e. A and B), and the keyword is ‘and’.

Union of two events (A ∪ B)

The union of two events A and B is the set of all outcomes that belong to either event

A or B or both. It is written as A ∪ B (i.e. either A or B or both) and the key word is ‘or’.

Mutually exclusive events

Events are mutually exclusive if they cannot occur together on a single trial of a random

xxxix
experiment (i.e. not at the same point in time).

Collectively exhaustive events

Events are collectively exhaustive when the union of all possible events is equal to the

sample space.

Statistically independent events

Two events, A and B, are statistically independent if the occurrence of event A has no

effect on the outcome of event B, and vice versa.

Probability Experiment
 Is a chance process that leads to well-defined results called outcomes.

 Tossing a coin, throwing a die or drawing a card from a deck are called experiments.

Outcome
 Result of a single trial of a probability experiment.

 One or more of the possible outcomes of doing something, e.g. tossing a coin.

 When a coin is tossed - two possible outcomes: Head or Tail.

 A head is the outcome.

 Tossing is the experiment.

Event
 An event is a collection of outcomes or a subset of the sample space.

 E.g., tossing a coin – two possible outcomes – head or tail.

 A head is an event.

4.4 Sample Space


 A Sample space is a set of all possible outcomes of a probability experiment.

xl
 E.g., if you toss a coin – head or tail.

 Throwing a die – six possible outcomes – 1, 2, 3, 4, 5, 6.

 The elements of a sample space are called sample points.

4.4.1 Example 1: Tossing Two Coins

Find the sample space for tossing two coins.


 Looking at the repartition of heads and tails on the two coins, all the possible
outcomes of this experiment form the following set:

 {(H, H), (H, T), (T, H), (T, T)}

 The first letter of each couple corresponds to the first coin, while the second letter
corresponds to the second coin.

 In this case, every elementary result of the experiment can be regarded as an


element of the set.

 This is called a sample space.

xli
4.4.2 Example 2: Throwing a dice

4.5 Probability tree diagram


A probability tree diagram is a graphical way to apply probability rules where there are
multiple events that occur in sequence and these events can be represented by
branches (similar to a tree).

4.5.1 Exercise 1
Find the sample space for the sex of three children in a family.

Working

xlii
4.5.2 Exercise 2
Table below shows the percentage frequency table for the petrol brand most preferred
by

50 motorists who live in Harare

Table 1: Petrol brand preference

Petrol Frequenc Percentage


brand y (%)
Total 13 26
Trek 9 18
Engen 6 12
Puma 22 44
Total 50 100

What is the likelihood that a randomly selected motorist prefers Engen?

(b) What is the chance that a randomly selected motorist does not prefer Puma?

(c) What is the probability of finding a motorist who prefers either the Total, Trek, Engen

or Puma brand of petrol?

4.6 Probability Distributions

All probability distributions can be classified as discrete probability distributions or as


continuous probability distributions, depending on whether they define probabilities
xliii
associated with discrete variables or continuous variables. This section will describe two
discrete probability distribution functions called the Binomial and the Poisson
distributions and one continuous probability function called the Normal distribution.

4.6.1 Discrete Probability Distributions


Discrete probability distributions assume that the outcomes of a random variable under
study can take on only specific (usually integer) values.
4.6.1.1 Binomial Probability Distribution
A discrete random variable follows the Binomial distribution if it satisfies the following
four
conditions:
 The random variable is observed n number of times, "n" denotes the number of
observations or the number of times the process is repeated, and "x"
denotes the number of "successes" or events of interest occurring during "n"
observations;
 There are only two, mutually exclusive and collectively exhaustive, outcomes
associated with the random variable on each object in the sample. These two
outcomes are labelled success and failure (e.g. a product is defective or not
defective; an employee is absent or not absent from work; a consumer prefers
brand A or not brand A).
 Each outcome has an associated probability.

– The probability for the success outcome is denoted by p.

– The probability for the failure outcome is denoted by 1 − p.

 The objects are assumed to be independent of each other, meaning that p


remains constant for each sampled object (i.e. the outcome on any object is not
influenced by the outcome on any other object). This means that p is the same
(constant) for each of the n objects.

The binomial equation also uses factorials.


The factorial of a non-negative integer k is denoted by k!, i.e. the product of all
positive integers less than or equal to k. For example,
 5! = 5x4 x 3 x 2 x 1 = 120,
 3! = 3x2 x 1 = 6,

xliv
 1! =1.

The binomial distribution model is defined as:

 
Use of the binomial distribution requires three assumptions:

 Each replication of the process results in one of two possible outcomes (success or
failure),
 The probability of success is the same for each replication, and
 The replications are independent, meaning here that a success in one
observation/trial does not influence the probability of success in another.

Mean of a Binomial Distribution


Mean = np
Variance of a Binomial distribution
Variance = npq
σ2 = npq
σ = √npq

4.6.1.2 Poisson Probability Distribution


A Poisson process is also a discrete process which measures the number of
occurrences of a particular outcome of a discrete random variable in a predetermined
time, space or volume interval for which an average number of occurrences of the
outcome is known or can be determined.

Poisson distribution is for counts—if events happen at a constant rate over time. The
Poisson distribution gives the probability of X number of events occurring in time T that
is the random variable X is the number of occurrences of an event over some interval.
The occurrences occur randomly. The occurrences are independent of one another.

It is used to describe a number of processes, including processes like the:

 Distribution of calls going through the switchboard,

 Demand of patients,

xlv
 Number of accidents at an intersection, etc.

 The processes are described by variables which take discrete integer values only.

The Poisson Distribution is defined as

Mean and Variance


Mean = Variance =λ

4.6.1.3 Exercise 1 (Discrete probability distributions)


If new cases of Cholera in Harare are occurring at a rate of about 20 per month, what is
the probability that 30 cases of cholera will occur in Harare in the next month?

There are 200 typographical errors randomly distributed in a 500-page manuscript. Find
the probability a given page contains exactly 3 errors.

It has been observed that at a particular section along 4th Ave, 12 fatal accidents occur
every 6 months. What is the probability of 3 accidents happening in the next 6 months?

4.6.2 Continuous probability distributions

If a random variable is a continuous variable, its probability distribution is called a


continuous probability distribution.

A continuous probability distribution differs from a discrete probability distribution in


several ways.

 The probability that a continuous random variable will assume a particular value is
zero. As a result, a continuous probability distribution cannot be expressed in tabular
form.

xlvi
Instead, an equation or formula is used to describe a continuous probability
distribution.

 The equation used to describe a continuous probability distribution is called a


probability density function. Sometimes, it is referred to as a density function, a PDF,
or a pdf.

The following are continuous probability distributions Normal probability distribution,


Student's t distribution, Chi-square distribution and F distribution. We will discuss the
Normal distribution only in this module.

4.6.2.1 Normal Probability Distribution

The Normal probability distribution has the following properties:

 The curve is bell-shaped.


 It is symmetrical about a central mean value, μ.
 The tails of the curve never touch the x-axis, meaning that there is always a non-
zero probability associated with every value in the problem domain (i.e. asymptotic).
 The distribution is always described by two parameters: a mean (μ) and a standard
deviation (σ).
 The total area under the curve will always equal one, since it represents the total
sample space. Because of symmetry, the area under the curve below μ is 0.5, and
above μ is also 0.5.
 The probability associated with a particular interval of x-values is defined by the area
under the normal distribution curve between the limits of x1 and x2.

xlvii
The mean, mode and median are all equal. i.e. Mean, = Median = Mode.

The Standard Normal Distribution (Z)

All normal distributions can be converted into the standard normal curve by subtracting
the mean and dividing by the standard deviation:

X 
Z

Example
What’s the probability of getting a score of 75 or less in an in-class test, =45 and
=20?

i.e., A score of 75 is 1.5 standard deviations above the mean. But to look up Z= 1.5 in
standard normal chart: Statistical Tables.

xlviii
4.6.2.2 Exercise (Normal Probability Distribution)
If birth weights in a population are normally distributed with a mean of 109 oz and a
standard deviation of 13 oz,

a) What is the chance of obtaining a birth weight of 141 oz or heavier when


sampling birth records at random?
b) What is the chance of obtaining a birth weight of 120 or lighter?

4.7 Hypothesis testing


Hypothesis testing is a formal procedure for investigating our ideas about the world
using statistics. Once a researcher has formulated a hypothesis and accumulated data,

xlix
the next thing is to analyze data and then accept or reject the hypothesis. The goal of
hypothesis testing is to determine the likelihood that a population parameter, such as
the mean, is likely to be true.

4.7.1 What is a Hypothesis?


Ordinarily, when one talks about hypothesis, one simply means a mere assumption or
some supposition to be proved or disproved.

But for researchers, a research hypothesis is a predictive statement, capable of being


tested by scientific methods, that relates an independent variable to some dependent
variable;

Hypotheses are capable of being objectively verified and tested.

4.7.2 Characteristics of hypothesis


 Hypothesis should be clear and precise. If the hypothesis is not clear and precise,
the inferences drawn on its basis cannot be taken as reliable;
 Hypothesis should be capable of being tested. A hypothesis “is testable if other
deductions can be made from it which, in turn, can be confirmed or disproved
by observation.”
 Hypothesis should state relationship between variables, if it happens to be a
relational hypothesis.
 Hypothesis should be limited in scope and must be specific;
 Hypothesis should be stated as far as possible in most simple terms so that the
same is easily understandable by all concerned;
 Hypothesis should be consistent with most known facts i.e.; it must be
consistent with a substantial body of established facts. In other words, it should
be one which judge accept as being the most likely;
 Hypothesis should be amenable to testing within a reasonable time. One should
not use even an excellent hypothesis, if the same cannot be tested in reasonable
time;
 Hypothesis must explain the facts that gave rise to the need for explanation.
Hypothesis must actually explain what it claims to explain; it should have
empirical reference.

l
4.7.3 Concepts of Hypothesis Testing
Any statistical test revolves around the choice between two hypotheses. These are
labelled H0 and H1:

• Null hypothesis (H0) – A maintained hypothesis is held to be true unless


there is a strong evidence against this null hypothesis. It states the exact
opposite of what an investigator or an experimenter predicts or expects. Null
hypothesis represents the hypothesis we are trying to reject.

 Alternative hypothesis H1 – It makes a statement that suggests or advises a


potential result or an outcome that an investigator or the researcher may
expect. It is usually the one which one wishes to prove.

Type I error means rejection of hypothesis which should have been accepted;

Type II error means accepting the hypothesis which should have been rejected.

The probability of rejecting a true null hypothesis is denoted as α is ‘small’ – called the
significance level.

The probability of failing to reject the null hypothesis when it is true is (1 – α).

The probability of making a Type II error when the null hypothesis is false is denoted as
β.

li
The probability of rejecting a false null hypothesis is (1 – β) – called the power of test.

Level of Significance

This is a very important concept in the context of hypothesis testing. It is always some
percentage (usually 5%) which should be chosen with great care, thought and reason.

In case we take the significance level at 5 per cent, then this implies that the researcher
is willing to take as much as a 5 per cent risk of rejecting the null hypothesis when it
(H0) happens to be true. Thus, the significance level is the maximum value of the
probability of rejecting H0 when it is true.

It is usually determined in advance before testing the hypothesis.

Steps in Hypothesis Testing

1. Identify which test is appropriate.

2. Choose a level of significance.

3. Formulate H0 and H1.

4. Determine whether the test is one-tailed or two-tailed. If it is one-tailed, determine


whether it is an upper- or lower-tail test.

5. Calculate the test statistic.

6. Based on the chosen level for μ, compare the test statistic with a value from a
table (the critical value).

7. Conclude by either accepting or rejecting H1.

Test for Mean

A researcher takes a sample and has a value in mind for the population mean μ. Then
the question is, does the sample mean (X) contradict this value significantly?

lii
Two versions:

 z test – used when σ is known; and

 t test – used when σ is unknown.

Z Test

Z-test refers to a univariate statistical analysis used to test the hypothesis that
proportions from two independent samples differ greatly.

It determines to what extent a data point is away from its mean of the data set, in
standard deviation.

Assumptions of Z-test:

• All sample observations are independent.

• Sample size should be more than 30.

• Distribution of Z is normal, with a mean zero and variance 1.

The test statistic is:

• x i̅ s the sample mean

• σ is population standard deviation

• n is sample size

• μ is the population mean

Steps for Solving Hypothesis Testing Problem

1. State the hypotheses and identify the claim.

liii
2. Find the critical value(s).

3. Compute the test value.

4. Make the decision to reject or not reject the null hypothesis.

5. Summarise the results.

Example

In a town Mutare, the average IQ score is 101.5. The variable is normally distributed
and the population SD is 15. A regional education officer claims that the students in her
school district have an IQ higher than the average of 101.5. She selects a random
sample of 30 students and finds the mean of the test scores is 106.4.

Test the claim at α = 0.05.

Step 1: Step the hypotheses and identify the claim

H0: μ = 101.5 H1 = μ> 101.5

Step 2: Find the critical value.

Since α = 0.05 and the test is a right-tailed test, the critical value is z = + 1.65.

Step 3: Compute the test value.

X−μ 106.4−101.5
z= = = 1.79
σ /√ n 15 /√ 30

Step 4: Make the decision to reject or not reject the null hypothesis.

Since the test value 1.79 > the critical value 1.65, the decision is to reject the null
hypothesis.

Step 5: Summarise the results.

There is enough evidence to support the claim that the IQ of the students is higher than
the town average IQ.

Comment:

liv
The difference is said to be statistically different. However, when the null hypothesis is
rejected, there is always a chance of a type I error. In this case, the probability of a type
I error is at most 0.05, or 5%.

Example 2

An engineer measured the Brinell hardness of 40 pieces of ductile iron that were sub-
critically annealed. The engineer hypothesized that the mean Brinell hardness of all
such ductile iron pieces is greater than 170. The average Brinell hardness of the 40
pieces of ductile iron was 172.52 with a standard deviation of 10.31. The engineer set
his significance level α at 0.05. Test the hypothesis.

Step 1: Step the hypotheses and identify the claim

H0: μ = 170 H1 = μ> 170

Step 2: Find the critical value.

Since α = 0.05 and the test is a right-tailed test, the critical value is z = + 1.65.

Step 3: Compute the test value.

X−μ 172.52−170
z= = = 1.55
σ /√ n 10.31/√ 40

Step 4: Make the decision to reject or not reject the null hypothesis.

Since the test value 1.55 < the critical value 1.65, the decision is to accept the null
hypothesis.

Step 5: Summarise the results.

There is not enough evidence to support the claim that the Brinell hardness of the
pieces of ductile iron is > 170.

lv
T-Test for Mean
A t-test is a hypothesis test used to examine how the means taken from two
independent samples differ. T-test follows t-distribution, which is appropriate when the
sample size is small, and the population standard deviation is not known.

The shape of a t-distribution is highly affected by the degree of freedom.

Note: The degree of freedom implies the number of independent observations in a


given set of observations.

Assumptions of T-test:

 All data points are independent.


 The sample size is small. Generally, a sample size exceeding 30 sample units is
regarded as large, otherwise small but that should not be less than 5, to apply t-test.

The test statistic is:

• x ̅is the sample mean


• s is sample standard deviation
• n is sample size
• μ is the population mean

The critical values for the test are given in Table 4. at the end of this module.

lvi
For a one-tailed test, find the α level looking at the top row of the table and finding the
appropriate column. Find the degrees of freedom by looking down the left-hand column.

N.B. The degrees of freedom are given for values from 1-30, then at intervals above 30.

Example 1: Right-Tailed Test

Find the critical t value for α = 0.05 with df. = 28 for a right-tailed t test.

Answer

Find the 0.05 column in the top row labelled One tail and 28 in the left-hand column.

The critical value is found where the row and column meet.

Thus, the critical value is + 1.701.

Example 2: Left-Tailed Test

Find the critical t value for α = 0.01 with df. = 22 for a left-tailed t test.

Answer

Find the 0.01 column in the top row labelled One tail and 22 in the left-hand column.

The critical value is - 2.508.

Example 3: Two-tailed Test

Find the critical t value for α = 0.10 with df. = 18 for a two-tailed t test.

Answer

Find the 0.10 column in the row labelled Two tails and 18 in the column labelled df.

The critical values are +1.734 and -1.734.

Example 3
The average starting annual salary for a teacher is $79,500. A researcher does not
agree and wishes to test the claim that the starting salary is less than $79,500.

A random sample of 8 teachers is selected and their salaries are shown below.

lvii
82,000 68,000 70,200 75,200
83,500
64,300 78,600 79,000

Is there enough evidence to support the researcher’s claim at α = 0.10 and degrees of
freedom = 7? Assume that the variables are normally distributed.

Step1: State the hypotheses and identify the claim.

H0: μ = $79,500 H1 = μ < $79,500

Step 2: Find the critical value.

Since α = 0.10 and the df. = 9, the critical value is - 1.415.

Step 3: Compute the test value. First, the mean and standard deviation must be found.

M = 75,150

s = ∑(Xi – M)2

n
= $6,487.49

4.8 Exercise
Q1) A sample of 400 male students is found to have a mean height 67.47 inches. Can it
be reasonably regarded as a sample from a large population with mean height 67.39
inches and standard deviation 1.30 inches? Test at 5% level of significance.

Q2) The mean of a certain production process is known to be 50 with a standard


deviation of 2.5. The production manager may welcome any change is mean value
towards higher side but would like to safeguard against decreasing values of mean. He
takes a sample of 12 items that gives a mean value of 48.5. What inference should the
manager take for the production process on the basis of sample results? Use 5 per cent
level of significance for the purpose.

Q3) The specimen of copper wires drawn form a large lot have the following breaking
strength (in kg. weight):

lviii
578, 572, 570, 568, 572, 578, 570, 572, 596, 544

Test (using Student’s t-statistic) whether the mean breaking strength of the lot may be
taken to be 578 kg. weight (Test at 5 per cent level of significance).

Q4) A Restaurant has been having average sales of 500 tea cups per day. Because of
the development of bus stand nearby, it expects to increase its sales. During the first 12
days after the start of the bus stand, the daily sales were as under:

550, 570, 490, 615, 505, 580, 570, 460, 600, 580, 530, 526

On the basis of this sample information, can one conclude that the Restaurant’s sales
have increased?

Use 5 per cent level of significance.

4.9 Conclusion
This chapter introduced the concept of probabilities as the foundation for inferential
statistics. The term ‘probability’ is a measure of the uncertainty associated with the
outcome of a specific event, and the properties of probabilities were defined. Also
examined were the concepts of probabilities, such as the union and intersection of
events, mutually exclusive events, collectively exhaustive sets of events and statistically
independent events. These concepts describe the nature of events for which
probabilities are calculated.

4.10 References & Further Reading


Bhattacharyya, G. K., and R. A. Johnson, (1997). Statistical Concepts and Methods,
John Wiley and Sons, New York.

D Lane, (2003). Introduction to Statistics. Rice University.

Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.

lix
Unit 5: Bivariate Analysis

5.0 Introduction
This unit presents bivariate analysis from cross tabulation, chi-square test and
correlation.

5.1 Aim of the Unit


After studying this chapter, you should be able to:
 understand the crosstabulation table
 understand the concept and rationale of the chi-squared statistic
 perform the Chi-square Test
 interpret the results of the various chi-squared tests.
 compute the Correlation Coefficient
 Structure of the Unit
 Cross tabulation, Chi-square Test
 Strength of the Relationship
 Computing the Correlation Coefficient

5.2 Cross-tabulation Table


A cross-tabulation table (also called a contingency table) summarizes the joint
responses of two categorical variables. The table shows the number (and/or
percentage) of observations that jointly belong to each combination of categories of the
two categorical variables. This summary table is used to examine the association
between two categorical measures.

Success Failure Total

Group 1 A B A+B

Group 2 C D C+D

Total A+C B+D A+B+C+D

The Importance of Cross Tabulation

 Clean, Useable Data-Cross tabulation makes it simple to interpret data! The


clarity offered by cross tabulation helps deliver clean data that be used to
improve decisions throughout an organization.

lx
 Easy to Understand-No advanced statistical degree is needed to interpret cross
tabulation. The results are easy to read and explain. This is makes it useful in
any type of presentation.

5.3 The Chi-Squared Test for Independence of Association


The Chi-Square test of independence is used to determine if there is a significant
relationship between two nominal (categorical) variables.  The frequency of each
category for one nominal variable is compared across the categories of the second
nominal variable. 

The data can be displayed in a contingency table where each row represents a category
for one variable and each column represents a category for the other variable.   For
example, say a researcher wants to examine the relationship between gender (male vs.
female) and empathy (high vs. low).  The chi-square test of independence can be used
to examine this relationship.  The null hypothesis for this test is that there is no
relationship between gender and empathy.  The alternative hypothesis is that there is a
relationship between gender and empathy (e.g. there are more high-empathy females
than high-empathy males).

In many situations, the chi-squared statistic is used to test for independence of


association. This test establishes whether two categorical random variables are
statistically related (i.e. dependent or independent of each other). Statistical
independence means that the outcome of one random variable in no way influences (or
is influenced by) the outcome of a second random variable.

Once we have gathered our data, we summarize the data in the two-way contingency
table.

Chi-Square Test Statistic

rc
χ 2=∑ ❑¿ ¿
i=1

where O is the observed values (data), E is the expected values (from theory), and k is
the number of different data cells or categories
The test statistic follows a Chi-Square distribution with degrees of freedom equal to (r-1)
(c-1) , where r is the number of rows and c is the number of columns.

lxi
There are five steps to conduct this test.
Step 1: Formulate the hypotheses
Null Hypothesis: H0: There is no significant association between students’ educational
level
and their preference for online or face-to-face instruction.
Alternative Hypothesis: Ha: There is a significant association between students’
educational level and their preference for online or face-to-face instruction.

Step 2: Specify the expected values for each cell of the table (when the null
hypothesis is true)
The expected values specify what the values of each cell of the table would be if there
was no association between the two variables. The formula for computing the expected
values requires the sample size, the row totals, and the column totals.
(row total)(column total)
E=
total sample size

Step 3: To see if the data give convincing evidence against the null hypothesis,
compare the observed counts from the sample with the expected counts,
assuming H0 is true.
The observed values are the actual counts computed from the sample.
Step 4: Compute the test statistic
The chi-square statistic compares the observed values to the expected values. This test
statistic is used to determine whether the difference between the observed and
expected values is statistically significant.
rc
2
χ =∑ ❑¿ ¿
i=1

Step 5: Decide if Chi-square is statistically significant


The final step of the chi-square test of significance is to determine if the value of the chi-
square test statistic is large enough to reject the null hypothesis.
The value calculated from the formula above is compared with values in the chi-square
distribution table.

lxii
Example
A company wanted to know if providing the vaccine made a difference. To answer this
question, they must choose a statistic that can test for differences when all the variables
are nominal. The χ2 statistic was used to test the question, “Was there a difference in
incidence of pneumonia between the two groups?” At the end of the winter, the table
below was constructed to illustrate the occurrence of pneumonia among the employees.

Results of the vaccination program.

Health Outcome Unvaccinated Vaccinated


Sick with pneumococcal pneumonia 23 5
Sick with non-pneumococcal pneumonia 8 10
No pneumonia 61 77

Solution
State the null and alternative hypothesis
H0: There is no difference in occurrence of pneumococcal pneumonia between the
vaccinated and unvaccinated groups
H1: There is a difference in occurrence of pneumococcal pneumonia between the
vaccinated and unvaccinated groups
Calculate the sum of each row, and the sum of each column.

Not vaccinated Vaccinated Row marginals


Health Outcome
Col 1 Col 2 (Row sum)
Sick with pneumococcal 23 5 28
pneumonia
Sick with non-pneumococcal 8 10 18
pneumonia
Stayed healthy 61 77 138

Column marginals (Sum of 92 92 N = 184


the column)

Calculate the expected values for each cell.

lxiii
Cell expected values

Health outcome Not vaccinated Vaccinated


Sick with pneumococcal pneumonia 13.92 12.57
Sick with non-pneumococcal pneumonia 8.95 9.05
Stayed healthy 69.12 69.88

Calculate the χ2 for example, cell χ2 for the first cell in the case study data is calculated
as follows: (23−13.93)2/13.93 = 5.92

Chi-square values

Health outcome Not vaccinated Vaccinated


Sick with pneumococcal pneumonia (5.92) (4.56)
Sick with non-pneumococcal pneumonia (0.10) (0.10)
Stayed healthy (0.95) (0.73)

Once the cell χ2 values have been calculated, they are summed to obtain the χ 2 statistic
for the table. In this case, the χ2 is 12.35 (rounded). The Chi-square table requires the
table’s degrees of freedom (df) in order to determine the significance level of the
statistic. The degrees of freedom for a χ2 table are calculated with the formula:
(Number of rows − 1) × (Number of columns − 1) = (3-1) * (2-1) = 2
Using a χ2 table, the significance of a Chi-square value of 12.35 with 2 df equals P <
0.005.
Conclusion
The researcher rejects the null hypothesis and accepts the alternate hypothesis: “There
is a difference in occurrence of pneumococcal pneumonia between the vaccinated and
unvaccinated groups.”
5.4 Correlation
A variable is an attribute which assumes different values, e.g. age. It is very important to
know if there is a relationship between variables under study. Is there any relationship
between variables X and Y? If there is a relationship:
 How are they related – linear or non linear?
 How strong is the relationship?
 Is the relationship causal – does X cause Y or does Y cause X?
Correlation deals with relationship. When the fluctuation of one variable reliably predicts
a similar fluctuation in another variable, there’s often a tendency to think that means that
the change in one causes the change in the other.

lxiv
However, correlation does not imply causation. There may be an unknown factor that
influences both variables similarly.
Definition
Correlation is a statistical technique that can show whether and how strongly pairs of
variables are related.
A researcher collects data on certain variables and would want to establish if there is a
relationship between two variables. The two variables under study are called the
independent variable (x) and the dependent variable (y).
Independent variable is the variable in regression that can be controlled and
manipulated.
Dependent variable is the variable in regression that cannot be controlled or manipulate.
Plot a graph called a scatter plot and you can come up with any type of the following
scatter plots:
 Positive relationship – As the independent variable X increases, the dependent
variable Y also increases.
 Negative Relationship - As the independent variable X increases, the dependent
variable Y decreases.
 Curvilinear Relationship - As X increases, Y also increases, but only up to a
certain point, after which, as X continues to increase, Y decreases. The graph
could be an inverted-U or a U-shaped curve. 
 No relationship.

lxv
 Strong positive correlation between x and y- the points lie close to a straight line with
y increasing as x increases.
 Weak, positive correlation between x and y- the trend is that y increases as x
increases but the points are not close to a straight line.
 No correlation between x and y- the points are distributed randomly on the graph.
 Weak, negative correlation between x and y- the trend is that y decreases as x
increases but the points do not lie close to a straight line.
 Strong, negative correlation -the points lie close to a straight line, with y decreasing
as x increases.
Example 1
A researcher wishes to establish a relationship between moisture increase in an
environment and the growth of mould spores, she/he must select a sample. The
data is collected and is presented below.
Dependent variable –moisture increase.
Independent variable – growth of mould spores.

lxvi
Step 1 - Construct a scatter plot from the data collected.
Step 2 – Determine the type of relationship:
There is a linear relationship and a positive one – positive linear relationship.
As the moisture increases, the mould spores content also increases.

lxvii
Examples of Negative Correlation
As weather gets colder, air conditioning costs decrease.
If a train increases speed, the length of time to get to the final point decreases.
A student who has many absences has a decrease in grades

Correlation Coefficient
A measure used to determine the strength of the linear relationship between two
variables.
The symbol for the sample correlation coefficient is (r).

Pearson Product Moment Correlation


The linear correlation coefficient ranges from -1 to +1.

 If there is a strong positive linear relationship between the variables, the value
of r will be close to +1.
 If there is a strong negative linear relationship between the variables, the
value of r will be close to -1.
 When there is no relationship, between the variables or a weak relationship,
the value of r will be close to 0.

lxviii
 When the value of r is 0 or close to 0, it implies only that there is no linear
relationship between the variables.

Finding the Value of the Linear Regression Coefficient

lxix
Exercise

Substitute in the formula and find the value of r=0.982

lxx
Conclusion: The correlation coefficient suggests a strong positive linear relationship
between the no. of cars a rental agency has and its annual revenue. The more cars
a rental agency has, the more annual revenue the company will have.
Coefficient of Determination r 2 /R2
The coefficient of determination is the ratio of the explained variation to the total
variation.
It is useful because it gives the proportion of the variance (fluctuation) of one
variable that is predictable from the other variable. It is a measure that allows us to
determine how certain one can be in making predictions from a certain model/graph.
The coefficient of determination is such that 0 < r 2 < 1, and denotes the strength of
the linear association between x and y. 
The coefficient of determination represents the percent of the data that is the closest
to the line of best fit.  For example, if r = 0.922, then r 2 = 0.850, which means that
85% of the total variation in y can be explained by the linear relationship between x
and y (as described by the regression equation).  The other 15% of the total
variation in y remains unexplained.
The coefficient of determination is a measure of how well the regression line
represents the data.  If the regression line passes exactly through every point on the
scatter plot, it would be able to explain all of the variation. The further the line is
away from the points, the less it is able to explain.

5.5 Conclusion
This unit introduced cross tabulation and the chi-squared test statistic. The Chi-
square is a valuable analysis tool that provides considerable information about the
nature of research data. It is a powerful statistic that enables researchers to test
hypotheses about variables measured at the nominal level. Correlation, r, and the
coefficient of determination, r2 were discussed. Correlation analysis identifies the
strength of the relationships and determines which variables are useful in predicting
the response variable.

5.6 References & Further Reading


Bhattacharyya, G. K., and R. A. Johnson, (1997). Statistical Concepts and Methods,
John Wiley and Sons, New York.

D Lane, (2003). Introduction to Statistics. Rice University.

Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.

lxxi
Unit 6: Regression Analysis

6.0 Introduction
This chapter presents regression analysis. Regression analysis involves identifying and
evaluating the relationship between a dependent variable and one or more independent
variables. Linear regression explores relationships that can be readily described by
straight lines or their generalization to many dimensions.

6.1 Aim of the Unit


After studying this chapter, you should be able to:
 explain the meaning of regression analysis
 identify practical examples where regression analysis can be used
 construct a simple linear regression model
 use the regression line for prediction purposes
 Structure of the Unit
 Understanding Simple linear Regression
 Simple linear Regression Equation
 Errors in Prediction
 Predicting Variability
6.2 Objectives of Regression Analysis
The primary objective of regression is to develop a linear relationship between a
response variable and explanatory variables for the purposes of prediction.

Regression analysis used to:

• explain variability in dependent variable by means of one or more of independent


or control variables and to analyze relationships among variables
• answer the question of how much dependent variable changes with changes in
each of the independent's variables.

In the regression model, the independent variable is labelled the X variable, and the
dependent variable the Y variable.

The relationship between X and Y can be shown on a graph, with the independent
variable X along the horizontal axis, and the dependent variable Y along the vertical
axis.

lxxii
The aim of the regression model is to determine the straight-line relationship that
connects X and Y. In simple linear regression, the straight line connecting any two
variables X and Y can be stated algebraically as Y = a + bX

• a is called the Y intercept, or simply the intercept

• b is the slope of the line (regression coefficient).

In simplest terms, the purpose of regression is to try to find the best line or equation that
expresses the relationship between Y and X.

Intercept or Constant is the point at which the regression intercepts y-axis. Intercept
provides a measure about the mean of dependent variable when slope(s) are zero.

Slope (Regression coefficient)

How much we expect y to change as x increases

Zero Slope means that independent variable does not have any influence on dependent
variable.

6.2.1 Exercise 1
 Using Microsoft Excel or the Statistical Package of Social Sciences (SPSS),
calculate the simple linear regression of study time versus exam mark.
 Calculate the coefficient of determination. Interpret your results.

Student No. of hours Exam mark


studying per week
1 10 55
2 15 69
3 12 46
4 34 77
5 12 65
6 35 60
7 23 78

4. In the Regression dialog box, click the "Input Y Range" box and select the
lxxiii
5. Click the "Input X Range" box and select the independent variable data (S&P 500
returns).

6. Click "OK" to run the results.

Exercise 2: From the output given below,


 Identify the coefficient of determination and interpret;
 Write down the regression equation and interpret.

Errors in Prediction
Errors of prediction are defined as the differences between the observed values of the
dependent variable and the predicted values for that variable obtained using a given
regression equation and the observed values of the independent variable.

lxxiv
6.2.3 Computer based practical Exercise 1(Microsoft Excel)
The training manager of a company that assembles and exports pool pumps wants to
know if there is a link between the number of hours spent by assembly workers in
training and their productivity on the job. A random sample of 10 assembly workers was
selected and their performances evaluated.

Training hours 20 36 20 38 40 33 32 28 40 24

Output 40 70 44 56 60 48 62 54 63 38

(a) Construct a scatter plot of the sample data and comment on the relationship

between hours of training and output.

(b) Calculate a simple regression line, using the method of least squares, to identify a
linear relationship between the hours of training received by assembly workers and their
output (i.e. number of units assembled per day).

(c) Calculate the coefficient of determination between training hours received and
worker output. Interpret its meaning and advise the training manager.

(d) Estimate the average daily output of an assembly worker who has received only
twenty-five hours of training.

6.3 Conclusion
Simple linear regression analysis is a technique that builds a straight-line relationship
between a single independent variable, x, and a dependent variable, y. The purpose of
the regression equation is to estimate y-values from known, or assumed, x-values by
substituting the x-value into a regression equation. The data of all the independent
variable and the dependent variable must be numeric.

The method of least squares is used to find the best-fit equation to express this
relationship. The coefficients of the regression equations, b0 and b1, are weights that
measure the importance of each of the independent variables in estimating the y-
variable The simple linear regression equation, which is always based on sample data,
must be tested for statistical significance before it can be used to produce valid and
reliable estimates of the true mean value of the dependent variable. In simple linear

lxxv
regression, a test of significance of the simple correlation coefficient, r, between x and y
will establish whether x is significant in estimating y.

6.4 References & Further Reading


Bhattacharyya, G. K., and R. A. Johnson, (1997). Statistical Concepts and Methods,
John Wiley and Sons, New York.

D Lane, (2003). Introduction to Statistics. Rice University.

Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.

lxxvi
Unit 7: Multivariate analysis

7.0 Introduction
This unit presents multivariate analysis. Multivariate means more than one variable
behind the resultant outcome. Anything that happens in the world or business is not due
to one reason but multiple reasons behind the outcome known as multivariate.

7.1 Aim of the Unit


After studying this chapter, you should be able to:
 explain the meaning of multiple regression analysis;
 identify practical examples where multiple regression analysis can be used;
 construct a multiple linear regression model; and
 define multicollinearity.

7.2 Structure of the Unit


 Multiple regression modelling;
 Multicollinearity; and
 Analysis of variance.

7.3 The objective of Multivariate Analysis (MVA)


 Reduction in data or simplification of the structure-MVA helps to simplify the
data as much as possible without losing out on the critical information. This aids
in drawing interpretation later.

 Grouping and Sorting the data- MVA has multiple variables. The variables are
grouped based on their unique features. 

 Data is verified based on the variables- Understanding the variables and


collected data is verified. Concluding, the state of the variables is critical. The
variables can be independent or dependent on the other variables.

 Establishing a connection between the variables- The relationship between


the variables is vital to understand the behaviour of the variables based on
observations and other variables present.

 Testing and construction of hypothesis- Creating a statistical hypothesis


based on the parameters of the multivariate data is tested. This testing is done to
understand if the assumptions are correct or not.

lxxvii
Multiple regression is an extension of simple linear regression. It is used when we
want to predict the value of a variable based on the value of two or more other
variables.

Multiple regression equation assumes the form

Y = a + b1X1 + b2X2+.....+bnXn

where X1, X2 …Xn are independent variables and Y being the dependent variable.

7.3.1 Exercise 1 (Multiple Regression)


 Exam mark could be determined by a student’s number of study hours, lecture
attendance and IQ. Identify the dependent and independent variable?
 From the output given below
a) Identify the coefficient of determination and interpret;
b) Write down the regression equation and interpret.

In multiple regression analysis, the regression coefficients (viz., b 1 b2, ….bn) become
less reliable as the degree of correlation between the independent variables (viz., X 1,
X2,…,Xn) increases.

If there is a high degree of correlation between independent variables, we have a


problem of what is commonly described as the problem of Multicollinearity.

lxxviii
In such a situation we should use only one set of the independent variable to make our
estimate as adding a second variable, say X 2, that is correlated with the first variable,
say X1, distorts the values of the regression coefficients.

Advantages

 MVA considers multiple variables. These variables can be independent or


dependent on each other. The analysis considers the factors and draws an
accurate conclusion.

 The analysis is tested and conclusions are drawn. The drawn conclusions are
close to real-life situations. 

Disadvantages

 MVA is laborious and as it includes complex computations. 

 The analysis requires a huge amount of observations for multiple variables that
are collected and tabulated. This observation process is time-consuming.

7.4 Multicollinearity
Multicollinearity is the occurrence of high intercorrelations among two or more
independent variables in a multiple regression model. Multicollinearity can lead to
skewed or misleading results when a researcher or analyst attempts to determine how
well each independent variable can be used most effectively to predict or understand
the dependent variable in a statistical model.

How to address multicollinearity

If we conclude that multicollinearity poses a problem for our regression model, we can
attempt a handful of basic fixes.

 Removing variables. A straightforward method of correcting multicollinearity is


removing one or more variables showing a high correlation. This assists in reducing the
multicollinearity linking correlated features.
 More data. Statistically, a regression model with more data is likely to suffer less
variance due to a larger sample size. This will reduce the impact of multicollinearity.

lxxix
 Using techniques such as Partial Least Squares regression (PLS) and
Principal Component Analysis (PCA). PLS can lessen variables to a smaller grouping
with no correlation between them. PLS, like PCA, is a dimensionality reduction
technique. PCA reduces the dimension of data through the decomposition of data into
independent factors. Therefore, new variables with no correlation between them are
created.
 Centering the variables. Centering is defined as subtracting a constant from the
value of every variable.

7.5 Analysis of Variance (ANOVA)


Analysis of variance (ANOVA) is a hypothesis test approach to test for equality of
means across multiple populations. It is an extension of the z-test or t-test, which only
test for equality of means between two populations.

Analysis of variance asks whether different sample means of a numeric random variable
come

from the same population, or whether at least one sample mean comes from a different
population. The test statistic used to test this hypothesis is called the F-statistic.

If significant differences between sample means are found to exist, it is assumed to be


the result

of an influencing factor rather than chance. This chapter will consider the case in which
only one factor influences the differences in sample means. Hence the method known
as one-factor

ANOVA will be used to test for differences in means.

7.6 Exercise 1 (Multiple regression)


 Conduct multiple regression modelling and describe findings for a given dataset due
at the end of the lecture.

7.7 Conclusion

lxxx
7.8 References & Further Reading
Bhattacharyya, G. K., and R. A. Johnson, (1997). Statistical Concepts and Methods,
John Wiley and Sons, New York.

D Lane, (2003). Introduction to Statistics. Rice University.

Gravetter FJ, Wallnau LB. 5th ed. Belmont: Wadsworth – Thomson Learning; 2000.
Statistics for the behavioral sciences.
McCabe, M, (2006). Introduction to the Practice of Statistics, 5th edition, Freeman.

Appendix 1 List of Statistical Tables


Table 1 Standard normal distribution (z)

lxxxi
lxxxii
lxxxiii

You might also like