You are on page 1of 143

UNIT 1: INTRODUCTION

OUTCOMES
This unit deals with the role of statistics in the data analysis process. Concepts that are basic to the study of
statistics are discussed.

After completion of this unit you will be able to:


 recognize the role of statistics in life
 understand the language of statistics
 select suitable measuring scales for different types of data
 understand the role of computers in statistics

We live in an era where we are faced with increasing amounts of information, referred to as data. To perform
many tasks efficiently we need a basic understanding of statistical methods. The field of statistics covers a
problem-solving process that seeks answers to questions through data. To be an informed consumer of
information you must be able to:
 extract information from tables and graphs
 follow numerical arguments
 understand the basics of how data should be gathered, summarised and analysed to draw statistical
conclusions

1.1. PROBLEM-SOLVING STEPS


Solving a statistical problem typically comprises the following steps:
1. Identify the problem.
2. Collect the information (data): how to measure it; what is the appropriate data source (existing or new);
use population or sample; which sampling method?
3. Organize and summarise the information: tables, graphs, numerical summaries; understand important
characteristics of the data which gives guidance in selecting appropriate methods for further analysis.
4. Interpret results: draw conclusions, make recommendations, assess the risk of an incorrect decision.
This involves generalizing from a sample of individuals or objects that we have studied to a larger
population.

1
Example 1.1
As part of a weekly check to access the calibration of a filling machine, the quality control manager randomly
selects 50 bottles of beer that were filled on a specific day.

1. Identify the problem

2. Collect the information

3. Organise and summarise the information

4. Interpret results

Key components of statistical thinking


 Use data whenever possible to guide the analysis
 Look for connections and relationships
 Understand why data values differ from one another

1.2. DEFINTION
Statistics is the scientific discipline that provides methods to help us make sense of data by:
 collecting data in a methodical way
 analysing data using methods to organise and summarise data using tables, graphs and numbers
 interpret data to draw conclusions or to answer questions

2
The field of statistics is subdivided into descriptive and inferential statistics:
 Descriptive statistics includes the collection and summarising of data to give an overview of the
information collected
 Inferential statistics is the process of making an estimate, prediction or decision about a population
based on sample data
o A population is almost always very large
o A sample is drawn and data summarised using descriptive techniques
o The results are used to make decisions about the population
o Reliability of decisions/conclusions are measured
 Confidence level: the proportion of times that an estimating procedure will be correct
in the long run
 Significance level: how frequently the conclusion will be wrong

1.3. THE LANGUAGE OF STATISTICS


 An experiment or investigation is any process of observation or measurement
 Elements are the people or objects about which information is collected
 A population is a complete collection of elements you wish to study
o If the population contains a countable number of items it is said to be finite
o If the number of items in the population is unlimited it is said to be infinite
o A study of the entire population is known as a census
o A parameter is a numerical measure that describes the population
 It is usually indicated by a letter from the Greek alphabet (e.g. , , )
 A sample is a portion of data drawn from the population and must be representative of the population
o A statistic is a numerical measure that describes a sample
 It is usually indicated by a letter from the Roman alphabet (e.g. 𝑥̅ , s, p)
 A variable is a characteristic of interest about each element of a population or sample
o It varies from element to element
o The observed values of the variable are the data we will use in a statistical investigation
 Variables can be classified as quantitative or qualitative
o Qualitative or categorical variables provide information that is non-numerical
 E.g. marital status, type of job, gender, etc.
 Is sometimes coded to make it appear quantitative but will have no numerical meaning
o Quantitative variables provide numerical measurements of the elements of the study
 E.g. height, weight, age, etc.
 Arithmetic operations can be performed on the values of the variable

3
 Quantitative variables are further classified as discrete or continuous
o Discrete variables are countable
 Values are obtained through counting
 E.g. the number of students in the class
o Continuous variables have infinite number of possible values that are not countable
 Values are obtained through measuring or weighing
 E.g. weight, length, time taken to complete a task, age, etc.

Example 1.2
Distinguish between qualitative and quantitative variables.

1. Gender

2. Temperature

3. Postal code

4. Number of drinks at a party for a couple of friends

Example 1.3
Distinguish between discrete and continuous variables.

1. Number of heads obtained after flipping a coin five times

2. Number of cars that arrive at the KFC drive-through between 10h00 and 12h00

3. Distances different model cars with the same tank capacity can drive in city driving conditions

4. Temperature
4
1.4. MEASUREMENT
Measurement is the process we use to assign a value to the observations or elements of a variable. There are
four levels or scales of measurement: nominal, ordinal, interval, ratio. The analyses depend on the scale used
to measure a variable.

1.4.1. Nominal scale


 This categorical level applies to data that consist of names, labels and categories in no specific order
 Numbers or symbols are used to identify groups to which various observations belong
 In counting males and females, the male group can be assigned the code 1 and the females the code 2
o These numbers serve only as a label for the group and the measurement consist of placing the
data in the correct group
 No arithmetic operations can be performed by such numbers other than counting the groups and the
number of elements falling into each group

1.4.2. Ordinal scale


 The categories into which data are grouped are ranked in some order
 Differences between data values are meaningless, e.g. income levels such as low, medium or high
 The permissible analysis methods for ordinal data include techniques generally associated with the
order of the observations

1.4.3. Interval scale


 Interval scaled data are always numerical
 Data can be arranged in order and the distances between data values are meaningful, but ratios of data
are not because of the absence of a true zero point
 E.g. temperature in degrees Celsius:
o There is a zero but it has meaning, it does not represent “nothingness”
o The increase on the centigrade scales between 10 and 20 is the same as the increase between
30 and 40
o However, heat cannot be measured in absolute terms (0ºC does not mean no heat) and it is not
possible to say that 40º are twice as hot as 20º
 Arithmetic operations can be performed on the difference between numbers, not the numbers
themselves
 The following are examples of data at the interval level of measurement:
o IQ score, scores on the Meyers-Briggs Scale, belt/shoe sizes, calendar dates, time

5
1.4.4. 1.4.4 Ratio scale
 Data can be arranged in order
 Both differences between data values and ratios of data values are meaningful
 This scale must contain a zero value that indicates that nothing exists for the variable at the zero point
 Arithmetic operations can be performed on the numeric values themselves.
 E.g. money:
o The zero point is meaningful, i.e. at zero you have none
o The difference between R10 and R20 is the same as the difference between R50 and R60
o R10 is twice as much as R5
 Arithmetic operations can be performed on the numbers themselves
 Variables such as distance, height, weight, and time use the ratio scale

Activity 1.1
Categorise these measurements according to level.
Variable/measurements Measurement scale
1. Species of fish in the Vaaldam
2. Cost of rod and reel
3. Time of return home
4. Rating area of fishing area: Poor, fair, good
5. Number of fish caught
6. Temperature of water

Activity 1.2
The student council at a university with 10000 students is interested in the proportion of students who favour
a change in the admission requirements at the university. Two hundred students are interviewed to determine
their attitude towards this proposed change. Of the 200, 64 (32%) are in favour of the change. The student
council announced that less than 35% of all students are in favour of a change.

a) What is the question to be answered in this investigation?

b) What is the variable of interest?

6
c) Classify the variable in terms of type and measurement scale

d) What is the population of interest?

e) What group of students constitute the sample in this problem?

f) What is the sample statistic?

g) What is the population parameter?

1.5. ROLE OF THE COMPUTER IN STATISTICS


 The Internet can provide access to data across continents at low costs
 Spreadsheets or statistical and mathematical software packages make such analysis readily available
to everyone.
 Use computer for:
o Large volume of input
o Repetition of projects
o Desired greater speed in processing
o Greater accuracy
o Processing complexities that require electronic help
 It can help you develop your ideas about how to organize the information by using a ‘try and refine’
approach, which can take too long to carry out manually.

7
UNIT 2: COLLECTION OF DATA

OUTCOMES
This unit deals with how and where to obtain data that can be used to make informed decisions. The quality
of the final product depends on the quality of the raw material used.
 Remember the acronym GIGO - garbage in, garbage out.

After completion of this unit you will be able to:


 Distinguish between primary and secondary data sources
 Examine various sources of primary data
 Appreciate the art of questionnaire design
 Distinguish between probability and non-probability samples
 Conduct a sample
 Distinguish between different methods of data collection

Methods of data collection depend upon:


 The nature of the problem
 The time available
 The money available
 Data sources
 The degree of accuracy required

2.1. SOURCES OF DATA: WHERE TO GET THE DATA


 Secondary data:
o Already available in processed form, e.g. database, Internet, libraries, record keeping within
your company
o Often available at low cost
o Must be suitable for the problem of interest
o Internal data
 Comes from within the organisation for its own use
o External data
 Collected from sources outside the organisation, e.g. trade publications, CPI,
newspapers, libraries, official statistics, etc.
 Primary data:
o Collected for own use
o Reliable and relevant to your purpose
o Can take a long time to collect and may be expensive
o Sources:
 Experiments, observation, focus groups, survey questionnaires
o Methods:
 Face-to-face, by phone, by post, via Internet

2.2. PRIMARY DATA SOURCES


2.2.1. Conducting an experiment
 Impose some treatment on individuals or objects in order to observe the responses
 The purpose of an experiment is to study whether the treatment causes a change in the response

Example 2.1: TV Viewing


To determine if there is any relationship between the hours of TV viewing and the channel someone views,
we selected a random sample of students and told each one of them which TV channel they must watch over
the weekend. Each student recorded the number of hours they watched TV.

2.2.2. Observations
Collecting data relies on watching or listening, and then counting or measuring events as they happen without
anything being done to the individuals or objects. Draw up an observation sheet and keep count of the
observations using a tally table. Keep count by using straight lines for each item counted ( | | | | ). The fifth line
is a line across the first four lines so that we can count in multiples of five ( | | | | ).

Example 2.2:
The police wanted to determine whether motorists using a certain road wore seatbelts. They observed if the
driver used a seatbelt as the cars passed by and counted how many wore seatbelts and how many did not.

The number of motorists wearing seatbelts between 7am and 8am on


f
27 February 2013
Wearing seatbelts |||| |||| |||| |||| ||
Not wearing seatbelts |||| |||| ||||
Total number of motorists
2.2.3. Focus groups
 Small group of respondents
 Group discussion to explore a topic, stimulate new ideas and concepts, provide a broader
understanding of behaviour, determine reasons for attitudes and beliefs
 This facilitates design of questionnaires and other research tools

2.2.4. Survey by means of asking questions


Design a list of questions, known as a questionnaire, to make sure that the same questions are asked in the
same way. Record the answers in written form.

Personal interview
 Obtain data verbally and face to face
 Candidates are selected at random
 Popular method for conducting market research about products
 Interviewers must be trained to ask questions and record responses, which make this method more
costly and time consuming
 You can obtain in-depth responses, not only by listening to the answer but also by interpreting their
body language
 Difficult questions can be clarified
 Show visual displays or products to the respondent to provide better communication and motivation
to participate in survey

Telephone interview
 Present questionnaire by telephone
 Telephone surveys are less costly than personal interviews and can be conducted over wider
geographical areas
 People are more open in their opinions as there is no face-to-face contact
 Some people in the sample will not have phones or will not be home when you call them

Mail questionnaire
 Respondents are asked to complete and return a questionnaire, which they receive in the mail,
newspaper, magazine or attached to a product
 Cover a wider geographical area in comparison with telephone or personal interviews since it is the
least expensive
 Can cover a larger sample as it is relatively cheap
 Respondents can remain anonymous and will be more open and honest in their opinion
 Disadvantages include a low response rate, inappropriate answers to questions and illiteracy of some
people included in the sample

Internet-based questionnaires
 E-mail with a link to the questionnaire on a secure website
 Quick and inexpensive, but often less detailed
 Disadvantages include a low response rate, excluding people who do not have a computer or are
computer illiterate

Activity 2.1
Rate the survey method as either 1 (most appropriate), 2 (less appropriate) or 3 (least appropriate), under the
following circumstances:

Telephone Personal Mail


Large geographical area
Small sample
Difficult questions
Keeping the cost low
Body language
If speed is a factor
Response rate
Illiteracy
Training of interviewers
Confidentiality
Market research for a product
2.3. QUESTIONNAIRE DESIGN
Decide what questions and how these questions will be asked. Well thought-out questions and a flow from
question to question

2.3.1. Questionnaire sections


 Administrative; Classification; Subject-matter of inquiry

2.3.2. Question Wording


 Should be appropriate to research topic
 Short and easy to understand
 Unbiased
 Should not be phrased emotively
 Should not be offensive or embarrassing
 Use closed questions where possible
 Confidentiality should be assured

2.3.3. Types of questions we can ask


Closed questions
 Give a series of possible answers from which one must be chosen
 Easy to record information
 Reduces interviewer bias
Open-ended questions
 Allows respondents’ own opinions in own words
 Answers may vary a great deal in length and detail

Activity 2.2: Distinguishing between open-ended and closed questions


1 How do you feel about violence in your neighbourhood?
2 Do you regularly watch soccer on TV? Yes or no
3 How often do you watch soccer on TV?
Never Sometimes One game every week Two or more games per week
4 What will you do to improve attendance at your school’s sporting events?
5 How reliable is your calculator?
6 How would you rate the reliability of your calculator?
Superior Very good Good Poor
2.4. SELECTING A SAMPLE
 Data collected on all the elements of the population, is a census, e.g. the national census
 This complete list of all population units is called the sampling frame
 A sample is taken from the population using the sampling frame

2.4.1. Advantages of sampling


 Costs are reduced
 Collection time is reduced
 Overall accuracy is improved
 The only method of data collection for infinite populations or testing procedures that entail the
destruction of the item being tested
 By studying the behaviour of a sample, we can get a good idea of the behaviour of the population from
which the sample was drawn

2.4.2. Sampling laws


 Sampling is justified only if the sample is representative of the population
 Sampling is based on two laws:
o The Law of Statistical Regularity:
 A reasonable large number of items selected at random from a large group of items will,
on the average, have characteristics representative of the population
o The Law of Inertia of Large Numbers:
 Large groups of data show more stability than small ones

2.4.3. Sampling error


 We cannot expect that the sample results will be the same as the population results
 The difference between the sample statistic and the actual population parameter is known as sampling
error
 The smaller the sampling error, the more accurate the estimate for the population parameter.
 Factors that influence sampling error:
o Sample size: The larger the sample, the more similar the sample statistics will be to the
population parameter
o Variation among the values in the population: If you investigate the amount of pocket money
children receive and these amounts are more or less the same, the variability in the population
is small and a small sample will be sufficient. If the amounts differ a lot, the variability is
greater and we will need a larger sample
2.4.4. Sample size (n)
Factors that influence sample size:
 The size of the population (N): Use a bigger sample for a bigger population
 Resources available such as time and money: Less time available will result in a smaller sample
 Error that can be tolerated: The bigger the sample you select, the more accurate the conclusions you
draw about the population based on the sample will be
 Variation in the population: The more similar the elements of the population are to one another, the
less variation there is within the population, so the sample can be smaller

2.4.5. Sample design


 Sample design = method used to select a sample from the population
 Two categories: non-random samples and random samples

2.1. NON-RANDOM OR NON-PROBABILITY SAMPLING METHODS


 Sample items are selected using personal convenience, expert judgment, or any type of conscious
researcher selection
 Sample selection is not done by chance
 Samples like these produce unrepresentative data and are not desirable for use in inferential statistics
 Include convenience sampling, judgment sampling, voluntary response sampling and snowball
sampling

2.3.1. Convenience sampling


 Researcher chooses elements that are readily available, nearby, or willing to participate
 When both time and money are limited, convenience samples are widely used
 For example:
o Man-in-the-street interviews
o Lunch-hour interviews
o Interviewing close friends or family
o Door-to-door interviews

2.3.2. Judgment sampling


 Items are chosen from the population based on the experience and judgment of the researcher
 This method usually results in making systematic errors in one direction leading to what are called
biases
 For example:
o Four of the most influential economists were asked to estimate next year’s rate of inflation
2.3.3. Voluntary response sampling
 Samples consist of people who choose themselves by responding to a broad appeal, such as online
polls or newspaper questionnaires
 People who take the trouble to respond to an open invitation are usually not representative of any
clearly defined population, because only people with strong opinions are most likely to respond

2.3.4. Snowball sampling


 Sample elements are selected based on referral from other survey respondents
 The researcher identifies a person who fits the profile wanted for the study and asks this person for the
names and locations of others who also fit this profile
 Sample elements can be identified cheaply and efficiently, which is particularly useful when survey
subjects are difficult to locate

2.2. RANDOM SAMPLING


 Every unit in the population has an equal chance of selection
 Method is considered fair and free from bias
 Includes simple random sampling, systematic random sampling, stratified sampling, cluster sampling

2.4.1. Simple random sampling


 This technique is the basis for the other random techniques
 Each unit of the sampling frame is numbered from 1 to N (where N is the size of the population)
o Impractical for large populations
 Two of the random techniques are:
o The ‘goldfish bowl’ or lottery technique:
 Similar to drawing names from a hat
 Works well with a small sample
 Place a numbered card for each element in the population in a bowl, mix them
thoroughly, and select as many cards as needed in the sample
o Table of random numbers:
 Random number tables consist of rows and columns in which the numbers 0 - 9 appear
 A random number generator is a computer program that generates these numbers
 Any series of numbers read across or down the table is considered random
 See random number table in Appendix 4 of the textbook
Example 2.3
Assume that you have 100 employees in a company and wish to interview a random sample of 10. Assign
every employee a two-digit number from 00 to 99. You assign a two-digit number to each element in the
population, and then you can use two digits of each number from the random number list.

Decide where in the random table you should start. You can choose to use the first two digits, the middle two
digits or the last two digits. You can even choose which columns to use. You can make this decision by using
the “goldfish-bowl” technique or by closing your eyes and pointing to a spot in the table.

Suppose you have decided to start in the first column with the first two digits. If we reach the bottom of the
last column on the right and are still short of our desired number, we can go back to the beginning and start
reading the third and fourth digits of each number. According to the table, employee numbers 70, 23, 20, 22,
53, 39, 48, 64, 12 and 45 will be in the sample of ten.

If a number occurs more than once, you skip it. You cannot use any population ID twice because there is a
unique ID assigned to each element in the population.

Activity 2.3
Each student at the university has a mailbox on campus. The mailboxes are numbered 0000 to 9000. Use the
random number table and select 10 mailbox numbers in your sample.

7081 8887 2876 1705 4260 5065 5528 8241 5997


2318 0139 6986 4900 2408 2027 1676 4382 3370
2099 3526 7912 3824 5108 1033 7363 0183 8479
2293 4424 9209 5979 5022 4849 1960 1771 7961
5359 3108 7453 9978 3538 8963 9562 5437 6806
3971 9260 0760 1284 1020 0961 2666 0255 5957
4833 6395 4528 0665 5386 3539 5918 9165 2088
6492 9493 1058 9069 7725 0094 9513 2735 2915
1227 1585 3239 0593 4703 4737 5851 2551 2824
4505 9108 0031 9578 0077 9836 5817 3221 1174
9515 4576 4486 8388 1343 4507 0031 2209 1921
9889 6933 2616 3883 9008 3389 3672 6952 5839
5737 6911 3388 3682 7271 1110 7272 5674 1650
2.4.2. Stratified sampling
 Identify non-overlapping groups within the population
 Select a simple random sample from each group
 Each of these groups is represented proportionally in the sample
 For example, if a researcher needs to estimate the average mass of a large group of people, he or she
first divides the group into two strata - male and female - and then selects a proportional simple random
sample from each stratum

2.4.3. Systematic sampling


 Select the starting number (a value between 1 and k) at random and each successive number
systematically from an orderly list of the population to obtain the sample
 Every kth item is selected to produce a sample of size n from a population of size N
N
 The value of k can be determined by the following formula: k 
n
 For example, a quality controller selects every 100th smart phone of a specific brand from the assembly
line and conducts a quality test

2.4.4. Cluster sampling


 Some populations have non-overlapping groups, which represent all of the views of the population,
for example a town, university or a file of invoices
 It will be more convenient and cost effective to select one or more of these clusters at random and then
select a sample from the clusters, or carry out a census within the selected clusters
 Sometimes the clusters are too large and a second set of clusters is taken from the original chosen
clusters
o This technique is called multi-stage sampling
 A large geographical area is often divided into more manageable provinces or clusters
 Select a few provinces and then select a few towns from each province
 Out of each town select a few blocks, and out of each block select individual families
at random
TEST YOURSELF
For each of the following, identify the most appropriate sampling technique:

Question 2
Select a random sample of 25 of the 371 active telephone area codes in South Africa.

Question 3
At a party, 30 students are 21 or older and 15 are under 21. Select a sample of size 5 to measure attitudes
towards alcohol.

Adjusted from Question 4


Forest areas in KZN are divided into large sectors. We are interested in the tree species count in a 20 × 25 m
rectangle.

Adjusted from Question 5


Sample 4 houses in a neighbourhood.
UNIT 3: SUMMARISING DATA USING TABLES AND GRAPHS

OUTCOMES
Describe data by summarizing and displaying it using tables and graphs so that the salient features of the
dataset are more easily understood.

After completion of this unit you will be able to:


 recognize the difference between grouped and ungrouped data
 construct a frequency distribution
 draw graphs based on qualitative and discrete data
 draw graphs based on continuous data
 recognize the usefulness of visual aids in presenting data

Data formats
 Raw data: list of the observations for each variable
o Provide little information
o Raw data must be organised and summarised to get the overall picture
 Frequency distribution/table: arrange raw data in a summarised form
 Graphs: visual representation of frequency tables
o Complements a table by showing the data’s general structure more clearly and to reveal
relationships that might be overlooked in a table.
o Type of graph depends on the type of data, the complexity of the data and the requirements of
the user

3.1. SUMMARISING QUALITATIVE DATA


3.1.1. Frequency distributions
 A frequency distribution is a table that records each value that a variable might have and the number
of times (frequency) that each one occurs in the data set
 A frequency distribution for qualitative data displays the possible categories along with the
corresponding frequencies. The frequency for a category is the number of times the category appears
in the data set
 The relative frequency for a category is the proportion or percentage of the observations within a
category
frequency
Relative frequency 
total frequency

 If the table includes relative frequencies, it is referred to as a relative frequency distribution


Steps
 Construct a column containing the values or categories for the variable of interest
 List the corresponding number of times that the category or value occurs (frequency column)
 Add the frequency column to make sure that the total is the same as the number of observations
 Interpret the table results

Example 3.1

Activity 3.1
A bio-kinetics instructor wants to study the different types of rehabilitation required by her patients. She
selected a random sample of her patients and recorded the body part requiring rehabilitation. The following
results were obtained:

Hand Back Ankle Shoulder Back Back


Back Shoulder Back Wrist Knee Knee
Neck Ankle Hip Knee Back Neck
Wrist Shoulder Back Back Back Shoulder
Knee Back Back Knee Hand Wrist

Construct a frequency distribution and a relative frequency distribution to describe the data. Give a short
interpretation of your results.

Category Frequency Relative frequency Percent frequency

Total
3.1.2. Cross-tabulation
 Summary table with two categorical variables (bivariate) is known as a two-way or contingency
frequency table
 One variable in the rows and the other variable in the columns
 Each row and column combination is called a cell
o Observed cell counts = number of times each combination occurs in the data for all cells (joint
data/information)
o Marginal totals = total observed cell counts in each row and also in each column
o Grand total = total of all the observed cell counts in the table
 E.g. data collected at a university to compare students, staff and management on the basis of their
transportation to campus (taxi, bus, car, train, motorcycle, bicycle or on foot)
o 3  7 two-way frequency table

Activity 3.2
People believe that organic foods are healthier than conventional grown fruit and vegetables. An investigation
was carried out on a sample of 10000 food items. The following table displays the frequencies for all possible
category combinations of the two variables: food type and pesticide status. Briefly comment on these results.
Pesticides
Food type Present Not present Total
Organic 28 99 127
Conventional 9085 788 9873
Total 9113 887 10000

Dimensions:

Marginal information:

Joint information:
Activity 3.3
One hundred students, majoring in Sciences were classified according to gender and year of study. Ten were
1st year women, 20 were senior women, 40 were 1st year men and 30 were senior men. Arrange the data in a
contingency table. Briefly comment on your results.

Dimensions:

Marginal information:

Joint information:

3.1.3. Bar Graph: single data set


 Bar graphs are a quick and easy way of showing variation in, or between variables
 One of the axes is used to represent the categories in the frequency table and the other axis is used to
represent the frequencies or relative frequencies
 Single bars representing each variable are drawn either vertically or horizontally

Steps
 Communicate a single variable
 Identify, scale and label axes
o Variable categories on x-axis
o Frequency, relative frequency, frequency percentage on y-axis
 Category axis
o Bars represent categories
o The bars must all have the same width
o Make the bars reasonably wide so that they can be clearly seen
o Gap between bars

Example 3.2
DIY

Activity 3.4
Draw a simple bar chart showing the ages of employees and draw conclusions from your results.
Age Number of employees
20 11
21 4
22 8
23 6
24 5

1) Identify the units of observation / elements of the sample

2) Identify the variable of interest

3) What is the sample size?

4) What percentage of the sample are under 22 years old?

5) Graph:
3.1.4. Comparative bar graphs
 To compare two or more data sets
 Use contingency data as input
 Multiple bar chart:
o Bars are grouped together in each category
o Can use relative frequency if sample sizes are different
 Segmented /stacked bar chart:
o Category are stacked
o Useful to emphasise relative proportions of components that make up the category
 Use a key to distinguish between the categories

Example 3.3 / 3.4

Activity 3.5 / 3.6


1) Complete the following table
Age Male Female
f %f f %f
20 3 8
21 1 3
22 4 4
23 2 4
24 5 4
Total

2) Draw a multiple bar chart showing the ages of male and female employees
3) Draw a percentage component bar chart showing the ages of male and female employees

3.1.5. Pie chart


A pie chart represents the data in a circle divided into “slices” representing the possible categories. This allows
a quick overall view of the relative sizes of the categories, but offers little potential for comparison.

Steps
 Draw a circle to represent the entire data set
 Keep the categories to 10 or fewer
 For each category calculate the slice size, relative to the whole circle
 Put any labelling outside the circle
 Look for categories that form large and small proportions of the data set when interpreting the chart

Example 3.5
DIY

Activity 3.7
Draw a pie chart to portray how you might divide up your day, using the following data:
Travelling 10%
Working 30%
Eating 10%
Sleeping 28%
Social life 7%
Other 15%

3.1.6. Pictogram
Not included in curriculum
3.2. SUMMARISING QUANTITATIVE DATA IN TABLES
3.2.1. The ordered array of data
 If there are not too many observations we can use the collected data in its raw form, also known as
ungrouped data
 Arrange the data in an array: sort data in numerical order from small to big
o Necessary for some statistical procedures, such as the median, percentiles and quartiles

Example 3.6
Arrange the following data in an array:
Data: 4 80 50 10 5
Array: 4 5 10 50 80

Activity 3.8
Arrange the following numbers in an array:
Data: 67 23 56 45 56 41 34 33 0 18 23
Array:

3.2.2. Dot plot


 Portrays individual observations
 Used for relatively small data sets (usually not more than 20)

Steps
 Construct a horizontal axis and label it
 Mark the axis with a scale to fit the smallest to largest value in the data set
 For each observation, place a dot above its value on the number line
 If there are two or more observations with the same value, stack the dots vertically
 The number of dots above a value represents the frequency of occurrence of that value

Example 3.7
Activity 3.9
The following table lists 15 popular cereals and the amounts of sodium and sugar in a single serving of 180ml.
Cereal Sodium (mg) Sugar (g) Cereal Sodium (mg) Sugar (g)
A 290 2 I 250 10
B 200 3 J 125 14
C 230 3 K 220 3
D 125 13 L 0 7
E 260 5 M 220 12
F 200 11 N 170 3
G 210 12 O 140 10
H 140 10

1) Can you explain the “0” sodium value for Cereal L?

2) Construct a dot plot for the sugar values

3) What does the plot tell us about the data?


3.2.3. Stem-and-leaf-plot
 Portrays individual observations in order and shows the shape
 All the information in the original data list is shown and we could reconstruct the original list of values
 The stem-and-leaf plot represents data by separating each value into two parts:
o Stem: the leading digit or digits and are displayed in a vertical position on the left-hand side of
a vertical line
o Leaf: remaining digits of the values
 Important to note the unit of measurement for the stem and the leaf
o E.g. for the value 76 the 7 will be the stem (units = tens) and the 6 will be the leaf (unit = ones)

Stem Leaf
7 6

 What would this number be if the leaf unit is:


o 10
o 0.1

Steps
 Select leading digit(s) for the stem
 Find the smallest number and the largest number data for the first stem and the last stem
 List all possible stems in increasing order, to the left of the line
 The trailing digit(s) become the leaves
o Place the leaves with the same stem on the same row as the stem.
o Arrange the leaves in each row from lowest to highest to form a stem–and-leaf plot.
 Use a label to indicate the units for stems and leaves in the display.
 The count of the number of leaves per row is the frequency of each row

The display conveys information about:


 the location of peaks to show a typical value
 the spread of the data values
 the presence of any gaps in the data
 the extend of symmetry in the distribution
 the presence of outliers
Consider the following stem-and-leaf plot (leaf unit 0.1):
Stem Leaf
0 2 8
1 1 6 7
2 4 4 5 6 6 6
3
4 1 5 9

1) How many observations are there in the sample?

2) What are the minimum and maximum values?

Activity 3.10
The following is an array of the daily litres of used sunflower oil bought by a biodiesel plant.
58 63 69 69 70 71 71 72 72 72 73 73 74 75 77
79 80 82 84 84 85 88 91 91 91 94 96 97 99 100

1) Construct a stem-an-leaf plot for the data

2) Comment of the graph


3.2.4. Frequency distribution tables
These tables condense the raw data into a more manageable form that will increase our ability to detect pattern
and meaning. This is done by keeping count of how many times a particular value occurs.
 A frequency is the number of times a value occurs

Ungrouped frequency distribution


 Use quantitative data
 Group it into an ungrouped frequency distribution
o ‘ungrouped’ because each x-value in the distribution stands alone

Example 3.9

Activity 3.11
Value (x) Frequency (f)
4 1
6 2
14 1
15 4
17 1
18 1
Total

1) From this frequency distribution:


(a) Which values occurred only once?
(b) Which value(s) occurs most often?
(c) How many values are there in the distribution?
2) Form an ungrouped frequency distribution of the following data and comment on the frequency of
each value:
1 2 1 4 0 2 0 1 4 1 6
Grouped frequency distribution
If the number of observed values is relatively large you will have a fairly lengthy list of data, which is not easy
to interpret. It is then necessary to summarise the data in a grouped frequency distribution by grouping adjacent
x-values into intervals, known as classes. In this summary we lose the detail of individual values, but it makes
the data much easier to read and understand.

 A frequency distribution is a summary of numerical data by grouping it into several non-overlapping


class intervals showing the number of observations (frequency) in each interval
 Data organized into a frequency distribution are called grouped data
 Frequency distributions can vary in final shape and design because they are constructed to individual
researchers’ taste

Construction a frequency distribution


1) Determine the range of the raw data: R = maximum – minimum
2) Determine the number of class intervals: K  n
o between 5 and 20 classes
o round K to nearest whole number
R
3) Determine the width of the class interval: c 
K
o Round answer to a whole number or to the same number of decimals as the raw data
4) Test if (K  c > R); if not, increase c or K
5) Choose the lower and upper class boundaries of each interval
o Must include all data values
o Must not overlap
o Lowest value of first class interval could be minimum value or a little smaller
o The upper class limit of the first interval is the same as the lower class limit of the second
interval
o The last class upper limit must be equal to or more than the maximum value
6) Sort the raw data into the classes by making use of the tally method
7) Count the number of tallies (observations) in each class to obtain the frequency (f) for each class
8) The sum of the frequencies for all class intervals must equal the number of original data values
9) It is possible to come to some conclusions like: In which class do you find the majority of the values
or the least number of values?
Note
1) The number of classes should be small enough to provide an effective summary but large enough to
display the relevant characteristics of the data
2) Class boundaries must be selected in such a way that the smallest value is included in the first interval
and the largest value in the last interval
3) Avoid overlapping of intervals so that an observation falls in one class only
4) The width of all classes should be equal
5) Open-ended class intervals should be avoided although they may be useful when a few values are
extremely large or small in comparison with the rest of the values
6) Class intervals with a frequency of zero should be avoided (unless this naturally forms part of the
distribution of the variable)

Additional information
 Different formats are used to denote the class intervals
 E.g. from 1 to under 10 could be represented as:
o [1, 10)
o 1  x  10

Example 3.10

Activity 3.12
Frequency distribution of acrylamide levels in Big Mac potato fries (from Example 3.10)
Class intervals Frequency
[151, 187) 3
[187, 223) 5
[223, 259) 6
[259, 295) 6
[295, 331) 8
[331, 367) 2
Total 30

1) The number of franchises with acrylamide levels in their French fries between 259 and 295 is____
2) The frequency for the class with acrylamide levels between 187 and 223 is____
3) The upper boundary of the first class is____
4) The lower boundary of the third class is____
5) The total number of observations in the data set is____
Relative frequency distribution
 When the proportion of observations in each class interval instead of the actual number of observations
is recorded, the distribution is known as a relative frequency distribution
 Relative frequency distributions are useful for comparing two data sets, especially when the sample
sizes or measurement scales differ substantially
 A relative frequency of a class is the observed frequency of the class divided by the total number of
observations in the data set. If the percentage is required, multiply the result by 100

Class midpoint or class mark


 The class mark or midpoint (x) divides a class interval into two equal parts
 It is obtained by adding the upper and lower boundaries of each class interval and dividing the result
by two
 This middle value (or x) represents the class interval in calculations

Cumulative frequency distribution


 To determine the number of observations that lie below or above a certain value
 A cumulative frequency for a class is the sum of the frequencies for that class and all previous classes
 We read it as the total of all the frequencies less than the upper boundary of each interval
 A cumulative relative frequency distribution is a ratio calculated by dividing a cumulative frequency
of a class by the total number of observations in the data set

Activity 3.13 / 3.14


A study was carried out to determine the amount of time that non-secretarial office staff spend using computer
terminals. The time spent using computers, in hours per week, for 50 staff are as follows:
0.3 2.2 3.9 5.2 7.8 10.3 12.2 12.8 14.4 16.5
0.5 2.2 4.2 5.5 8.2 10.5 12.5 13.1 14.9 17.0
0.7 2.5 4.8 6.5 8.8 10.5 12.5 13.5 15.5 17.7
1.2 2.5 5.0 7.0 9.0 10.8 12.7 13.6 16.0 18.0
1.8 2.8 5.1 7.5 9.5 11.6 12.8 14.1 16.0 22.0
Note: data from book have been sorted

1) Range:
2) Number of class intervals:
3) Width of class interval:

4) Test:
5) Lower and upper class boundaries of each interval (in table)
6) Tally (data are sorted)
7) Frequency (in table)
8) Total frequency (in table)

Relative Cumulative
Class intervals Frequency Midpoint
frequency frequency

3.3. SUMMARISING QUANTITATIVE DATA USING GRAPHS


3.3.1. The histogram and relative histogram
A histogram is a continuous series of rectangles of equal width but different heights drawn to display the class
frequencies.

Steps
1) Mark the class boundaries on the x-axis. The class intervals are equal in width; therefore the points
must be equidistant from one another
2) Use either f or %f on the y-axis. A proper scale showing the true zero must be used on the y-axis in
order not to misrepresent the character of the data
3) Draw a rectangle for each class directly above the corresponding interval. The height of each rectangle
is the frequency (or relative frequency) of the corresponding class
4) There are no gaps between the bars of the histogram

To interpret the histogram you must look for:


 The overall pattern and obvious deviations from this pattern
 The overall pattern can be described by its shape, centre and spread
 Location and number of peaks
 Presence of gaps and outliers
Possible shapes of the histogram
Symmetric:
 Right-hand side is a mirror image of the left-hand side

Positively skewed / Skewed to the right:


 Larger values (tail) extends much farther to the right

Negatively skewed / Skewed to the left:


 Larger values (tail) extends much farther to the left

Uniform distribution:
 Frequencies are evenly distributed across the scale
Example 3.12
9
8
7
6
Frequency

5
4
3
2
1
0
115 151 187 223 259 295 331 367 403
Acrylamide levels

Comment on the shape of the distribution

Activity 3.15
Construct a histogram for time spent using computers (from Activity 3.13) and comment on the shape
Class intervals Frequency
[0, 3.5) 10
[3.5, 7.0) 8
[7.0, 10.5) 8
[10.5, 14.0) 13
[14.0, 17.5) 8
[17.5, 21.0) 2
[21.0, 24.5) 1
Total 50
3.3.2. The polygon and relative polygon
The polygon is a line graph that can be used to portray the shape of the distribution. A polygon that uses the
relative frequencies of the intervals is called a relative polygon. It has the same shape as the frequency polygon,
but uses a percentage scale / relative frequency on the y-axis.

Steps
1) Determine the class midpoint (x) of each class
2) Mark the class midpoints on the x-axis
3) Mark the frequencies on the y-axis using a proper scale
4) Plot each midpoint together with its corresponding frequency
5) Connect the successive dots with a straight line to form the polygon
6) Begin and end the line on the horizontal axis with a frequency of zero

Activity 3.16
Construct a frequency polygon for time spent on the Internet (from activity 3.13)
Class intervals Frequency Midpoint

[0, 3.5) 10 1.75


[3.5, 7.0) 8 5.25
[7.0, 10.5) 8 8.75
[10.5, 14.0) 13 12.25
[14.0, 17.5) 8 15.75
[17.5, 21.0) 2 19.25
[21.0, 24.5) 1 22.75
3.3.3. Ogive (cumulative curve) and relative ogive
An ogive is a smooth curve that can be used to estimate graphically the number of observations below or
above a set level. Therefore an ogive requires cumulative class frequencies.

Steps
1) The frequency distribution must show class boundaries and cumulative frequencies
2) The frequency scale on the y-axis must extend to the total of the frequencies
3) Mark the class boundaries on the x-axis
4) For each class plot the upper boundary together with the cumulative ‘less-than’ class frequency
5) Draw a smooth curve through the points. The ‘less-than’ curve slopes upwards and to the right
6) The ‘less-than’ ogive begins on the horizontal axis at the lower class boundary of the first class
7) If the cumulative frequencies are expressed as percentages of the total, a relative ogive can be drawn

Activity 3.17
Construct an ogive of the amount of time staff spent on computers (from Activity 3.13). Comment on the
graph
Cumulative
Class intervals x-value
frequency

[0, 3.5) 10
[3.5, 7.0) 18
[7.0, 10.5) 26
[10.5, 14.0) 39
[14.0, 17.5) 47
[17.5, 21.0) 49
[21.0, 24.5) 50
3.4 Using software
 There are a number of useful software packages available for data presentation which are easy to use
 Computers can help you develop your ideas about how to organize the information by using a ‘try and
refine’ approach, which would take too long to carry out manually
o For example, if you decide to break the information down in a certain way, and the results are
not what you need, it is a simple matter to create new ones and experiment again
 Computer software can produce accurate and professional graphs and charts from data, but they are
only as useful as the data and instructions used to make them
UNIT 4: SUMMARISING DATA USING NUMERICAL DESCRIPTORS

In this unit we look at numerical measures that can be used to describe the characteristics of data collected in
their raw form (ungrouped data) as well as for data summarised into frequency distributions (grouped data).

After completion of this unit you will be able to:


 Compute the mean, median and mode for both grouped and ungrouped data
 Describe the characteristics of the mean, median and mode
 Compute the range and standard deviation
 Compute and interpret the coefficient of variation and the coefficient of skewness
 Locate and plot the mean, median and mode for symmetrical and skewed distributions
 Compute measures of relative standing

Descriptive measures are numbers used to describe data sets


 Statistics = summary measures used to describe a sample
 Parameters = summary measures used to describe a population
 For the purpose of this course: our data sets are samples, unless otherwise stated

Data are characterised based on:


 Location (measures of central tendency)
 Dispersion (measures of variability / spread / dispersion)
 Shape (measures of skewness)

4.1. MEAURES OF CENTRAL TENDENCY


These measures are numbers that represent the typical or middle values in a distribution. The commonly used
measures of central tendency are:
 Arithmetic mean; Median; Mode

4.1.1. Arithmetic mean


This is the most commonly used measure of central tendency and is referred to as the average or the mean.
The mean of a data set is the centre of gravity. It is the middle of the actual numerical values of all the
observations. The arithmetic mean is the sum of all the values in a data set divided by the number of
observations.

Population mean (parameter) 


Sample mean (statistic) x
Ungrouped / raw data
 Ungrouped or raw data = a list of numbers in any order

 The sample mean for a variable X is x 


x
n
o Add the individual observations x  x
o Count the number of observations n
o Divide the sum of all values by n

Activity 4.1
A city planner working on bikeways needs information about local bicycle commuters. She designs a
questionnaire. One of the questions asks how many minutes it takes the rider to pedal from home to his or her
destination. A sample of 12 local bicycle commuters yielded the following times:
22 29 27 30 12 22 31 15 26 16 48 23

Determine the mean traveling time.

Grouped / frequency data


 We cannot calculate exact values of the mean without raw data
 If the data are in the form of an ungrouped frequency table, the mean can be calculated exactly
 If the data are in the form of a grouped frequency table, the mean can only be approximated / estimated
 For both ungrouped and grouped frequency table data we use the same formula, but different
procedures:

o x
 xf
n
o Procedure for ungrouped frequency table data:
 Multiply every value of the variable (x) with the corresponding frequency (f)
 Sum the value xf
 Divide the sum by n
o Procedure for grouped frequency table data:
 Multiply the midpoint of every class (x) with the corresponding frequency (f)
 Sum the value xf
 Divide the sum by n
Extra activity
Calculate the mean number of cellphones in a household from the following ungrouped frequency table:
Number of cellphones (x) Frequency (f) xf
0 5
1 18
2 26
3 8
4 3
Total (n)

Activity 4.2
Calculate the mean number of hours of personal computer usage per week from the following grouped
frequency table:
Class Frequency (f) Midpoint (x) xf
[1.95, 3.95) 2
[3.95, 5.95) 5
[5.95, 7.95) 5
[7.95, 9.95) 3
[9.95, 11.95) 1
Total (n)
Characteristics of the arithmetic mean
1. It is the arithmetic average of all the quantitative measurements in the data set
2. Every numerical data set has only one mean
3. It is reliable because it reflects all the values in data set
4. It is sensitive to every value in the data set and can be greatly affected by the presence of even a single
extreme value (or outlier)
Note: An outlier is an unusually large or small observation in comparison with the rest of the values
5. It is useful for further inferential statistical procedures
6. It can be calculated using a pocket calculator with pre-programmed formulae
o Setup  Down arrow  3:Stat  1:On
o Mode  2:Stat  1:(1-VAR)
o Enter x-values (raw data, discrete values or class midpoints from frequency table) under X
o Enter frequencies under FREQ
o STAT  4:Var  2: x

4.1.2. Median
The median is the value that occupies the middle position of a data set arranged in a numerical order. This
means that there are an equal number of data values in the ordered distribution that are above it and below it.
 Interpretation: at most 50% of observations are below the median value and at most 50% of
observations are above the median value

Ungrouped / raw data


 Arrange the data in numerical order
n 1
 Determine the median position =
2
o If n is odd, median position is a whole number
o If n is even, median position is a fraction
 Find the median value:
o If n is odd, the median is the value in the middle of the data set, i.e. the value in the median
position
o If n is even, the median is the average of the two middle values in the data set
 Round down the median position value and find the x-value in that position of the data
set
 Round up the median position value and find the x-value in that position of the data set
 Take an average of the two x-values
Activity 4.3 (1)
Determine the median typing speed (in words per minute) of five secretaries from the following data.
30 90 45 25 55

Ordered set:

Median position:

Median value:

Interpretation:

Activity 4.3 (2)


Determine the median calorie content of pizzas from different outlet (data in textbook sorted below)

Numerical order: 275 296 299 322 323 332 333 337 347 350 353 357 358 393

Position median:

Median value:

Interpretation:
Grouped / frequency data
For ungrouped frequency data, we use the cumulative frequencies to find the median value (not in textbook).
 Add cumulative frequencies to the table
n 1
 Determine the median position =
2
 Find the median value:
o If n is odd
 The median is the first x-value for which the cumulative frequency is greater than or
equal to the median position
o If n is even
 Round down the median position value and find the first x-value for which the
cumulative frequency is greater than or equal to the rounded position value
 Round up the median position value and find the first x-value for which the cumulative
frequency is greater than or equal to the rounded position value
 Take an average of the two x-values

Extra activity
Calculate the median number of cellphones in a household from the following ungrouped frequency table:
Number of cellphones (x) Frequency (f) Cumulative frequency
0 5
1 18
2 26
3 8
4 3
Total (n) 60
With grouped frequency data we are unable to determine where the true middle value falls, but we can use a
formula or the ogive to estimate the median.

Using the formula:


 Add cumulative frequencies to the table
n
 Determine the location of the median =
2
 The median class is the first interval for which the cumulative frequency is greater than or equal to the
median position
 Estimate the value of the median using the formula:

o Median  L 
 n2  F  c
fm
 L = lower boundary of median class
 fm = frequency of median class
 c = class width
 F = cumulative frequency just before the median class

Activity 4.4
Estimate the median number of hours of personal computer usage per week for a sample of 16 people.
Class Frequency (f) Cumulative frequency
[1.95, 3.95) 2
[3.95, 5.95) 5
[5.95, 7.95) 5
[7.95, 9.95) 3
[9.95, 11.95) 1
Total (n) 16

Median position:

Median value:
Using the ogive:
n
 Find the median position on the y-axis
2
 Draw a straight horizontal line from the y-axis to the ogive
 Draw straight vertical line from the ogive to the x-axis
 The corresponding value on the x-axis is the estimated median

Use the ogive of this data to estimate the median value and compare your answer to the estimate using the
formula (not in textbook)
16 16
15

14
12
Cumulative frequency

12

10

8 7

4
2
2
0
0
1.95 3.95 5.95 7.95 9.95 11.95
Hours

Characteristics of the median


1. It is the central value: 50% of the measurements lie above it and 50% lie below it
2. There is only one median for a data set
3. Extreme measurements (outliers) do not affect the median as strongly as they do the mean because
the median is dependent on its middle position
4. It is a better measure of central tendency to use than the mean when the data are very skewed
4.1.3. Mode
The mode of a data set is the value that occurs most frequently. It can be a good measure to represent a typical
value such as the most popular shirt size.

Ungrouped / raw data


 Count the number of observations that occur for each data value
 If there is no value that occurs more often than the others, then there is no mode
o Note: This is not the same as a mode of zero
 A set of data can have no mode, one mode (uni-modal) or more than one mode (bi-modal or multi-
modal)
o This is referred to as the modality

Activity 4.5
A telephone company conducted a study on the length of calls. Determine the modal length of calls for a
sample of 10 calls (in minutes): 1.4 15.5 2.1 8.0 15.5 1.4 17.7 7.2 9.1 15.5

Ordered array

Mode

Modality

Interpretation

Grouped / frequency data


For ungrouped frequency table data the mode is simply the value(s) of the variable that occur(s) more than
any other value, i.e. highest frequency. Identify the mode of number of cellphones:
Number of cellphones (x) Frequency (f)
0 5
1 18
2 26
3 8
4 3
Total (n) 60
For grouped frequency table data an estimate of the mode can be approximated either graphically or by making
use of a formula. Grouped data do not show a single most frequently occurring value but assume that the mode
will occur in the interval with the highest frequency.

Using the formula


 Select the class containing the highest frequency as the modal class
 Determine the 1 value by subtracting the frequency of the class preceding the modal class from the
frequency of the modal class
 Determine 2 by subtracting the frequency of the class following the modal class from the frequency
of the modal class
 1 
 Use the formula to estimate the modal value: Mode  L   c

 1   2 

o L = lower boundary of modal class


o C = class width
o Δ1 = frequency of modal class minus frequency of previous class
o Δ2 = frequency of modal class minus frequency of following class

Activity 4.6
Estimate the modal number of hours of personal computer usage per week from the following grouped
frequency table data:
Class Frequency (f)
[1.95, 3.95) 2
[3.95, 5.95) 5
[5.95, 7.95) 9
[7.95, 9.95) 3
[9.95, 11.95) 1
Total (n) 20

Modal class:

Modal value:
Using the histogram
 Identify the longest bar on the histogram as the modal bar
 Draw a line from the top right corner of the modal bar up to right corner of the bar to its immediate
left
 Draw a line from the top left corner of the modal bar up to the top left corner of the bar to its immediate
right
 Draw a line parallel to the y-axis through the intersection point of the previous two lines down to the
x-axis
 The value on the x-axis approximates the modal value

Use the histogram of this data to estimate the modal value and compare your answer to the estimate using the
formula (not in textbook)
10
9
8
7
Frequency

6
5
4
3
2
1
0
[1.95, 3.95) [3.95, 5.95) [5.95, 7.95) [7.95, 9.95) [9.95, 11.95)
Hours

Characteristics of the mode


 It is the most frequent measure in the data set
 It is not based on all the data in the sample
 It is not affected by extreme values
 Use for quantitative and qualitative data
 Sometimes the mode does not exist or it is possible that there is more than one mode
o For these reasons, many people consider the mode as an unreliable measure of central tendency
4.1.4. Choose between the mean, median or mode
An average (measure of central tendency) should convey an impression of a distribution in a single value. It
is important to use the right type of average. All the averages are different measures with different uses. The
factors that play a role in choosing the right average are the following:
1. Is the nature of the data numerical or non-numerical?
 The mode, which is the value that occurs most often, is the only measure of central tendency useful
for qualitative (non-numerical) data that you cannot rank in any way. You can also use the mode
for all other qualitative or quantitative (numerical) data sets.
 If you can rank qualitative data sets, you can use the median. The median is also valid for all
quantitative data sets.
 The arithmetic mean involves arithmetic and is appropriate only for quantitative data sets.
2. What does each average tells us?
Depending on the problem, one measure may be superior to another, and in some other cases, you can
use all three in conjunction.
 The mode identifies the most common or ‘typical’ value, or the value that occurs more often than
the others do. The mode conveys the least amount of information about the data set as a whole. In
some samples, the mode may be in the middle of the distribution but in others, it may be a value at
one end of the distribution. It is also possible to have more than one mode that will eliminate the
mode as an option. Outliers do not influence the mode at all and stays at the peak of the distribution.
 The median indicates the centre of the distribution. There are the same number of observations that
lie above and below the median, regardless of how far above or how far below. This means that it
is unlikely that outliers at either end of the distribution will affect the median very much.
 The mean is the most frequently used average because it includes all the values in the data set. This
feature makes it the most sensitive to extreme values.
3. What is the shape of the distribution?
 In a symmetrical distribution, the mean, median and mode will be the same or very close together.
Whichever one you choose will give you the same answer.
 If there are extreme values present on one side of the data set, the distribution is skewed. If the
mean is very different from the median, the median will be a better option to use.

Activity 4.7
The National Housing Department conducted a survey to estimate the average number of livable square meters
for low cost housing. The reported mean was 24.5 square meters and the median 22.2 square meters. Which
measure of central tendency is more appropriate? Explain your answer.
4.2. MEAURES OF DISPERSION
An average summarises a set of data in just one number. Two sets of data can have the same mean and yet be
very different if one is more spread out than the other. To describe this difference quantitatively, we use a
measure of dispersion / variability / spread. This is a descriptive measure that indicates the amount of variation
in a data set.

Some measures of dispersion are:


 The range
 Mean absolute deviation
 Standard deviation
 Variance
 Coefficient of variation

4.2.1. The range


The range is the difference between the highest and lowest values in a data set. It measures the distance across
the entire set of data but its usefulness as a measure of dispersion is limited. For grouped frequency data the
range is the difference between the upper boundary of the last interval and the lower boundary of the first
interval.

Range = largest value − smallest value

A midrange can be calculated by dividing the range by 2.

4.2.2. Mean absolute deviation (MAD)


Not included in curriculum

4.2.3. Standard deviation (s)


This is the most widely used measure of dispersion and measures on the average, how far each data value is
from the mean. To prevent that negative deviations from the mean cancel positive deviations, the differences
are squared.
1. It uses all the entries in the data set and is therefore sensitive to outliers
2. The larger the standard deviation the larger the variation in the data. A standard deviation of zero
means there is no variation
3. It is useful for further inferential statistical procedures because most statistical theories are based
on distributions described by their mean and standard deviation
4. The measuring unit is expressed in the original units of measurements. (Rands, meters, etc.)
5. It can be calculated using a calculator with pre-programmed formulae
Ungrouped / raw data
The standard deviation is calculated as the square root of the average squared deviations from the mean, using
the following formula:

 x  x 
2

s
n 1

 x = sample mean
  x  x  = deviation from the mean
 x  x
2
= squared deviation from the mean

  x  x 
2
= sum of squared deviation from the mean

 x  x 
2

 = average squared deviation from the mean


n 1
 For the sample standard deviation (s) the average of the squared deviations is obtained by dividing by
(n – 1), not n
o This is known as degrees of freedom
o It corrects the bias in estimating the population standard deviation using the sample standard
deviation
 The population standard deviation is denoted by the Greek letter σ
 A large amount of variability in the data is indicated by a relatively large standard deviation
o Note: the value of s is interpreted in terms of how the variable was measured
o It is therefore difficult to directly interpret s and determine if it is large or small
o Under certain conditions and distributional assumptions, s has a practical interpretation (more
later when we do inference)
 If two samples are compared on the same variable, the sample with the smaller s has less variability
than the other sample

4.2.4. Variance (s2)


The variance is the standard deviation squared (s2). This is a very popular measure in describing data, but the
main drawback is that the unit of measure is also squared.
Activity 4.10
In a study on bikeways the number of minutes it takes a sample of 12 local bicycle commuters to pedal from
home to their destination is recorded.
22 29 27 30 12 22 31 15 26 16 48 23

Determine the range, standard deviation and variance of the traveling time for the riders and interpret the result
in the context of the data.

Range =

x x  x x  x
2

22
29
27
30
12
22
31
15
26
16
48
23
Sum = Sum = Sum =

Mean =

Standard deviation =

Variance =
Grouped data
For both ungrouped and grouped frequency data we can calculate /estimate the standard deviation using the
following formula:

 x  x 
2
f
s
n 1

 Need a frequency table with the columns: values / midpoint and frequencies
 Compute the mean x
 Compute the squared deviation between each value / midpoint and the mean
 Multiply this by the corresponding frequency
 Add the results
 Divide by n – 1
 Take the square root to get the standard deviation

Extra example (ungrouped frequency data)


Calculate the range, standard deviation and variance of the number of cellphones per household.

Range =
Mean =

Number of cellphones Frequency (f) x  x x  x


2
x  x
2
f
(x)
0 5
1 18
2 26
3 8
4 3
Total (n) 60

Standard deviation =

Variance =
Activity 4.11
Estimate the range, standard deviation and variance of number of hours of personal computer usage per week.

Range =
Mean =

Number of cellphones (x) Frequency (f) x  x x  x


2
x  x
2
f

[1.95, 3.95) 2
[3.95, 5.95) 5
[5.95, 7.95) 9
[7.95, 9.95) 3
[9.95, 11.95) 1
Total (n) 20

Standard deviation =

Variance =

4.2.5. Coefficient of variation (CV)


The coefficient of variation is a relative measure of dispersion, which is the ratio, expressed as a percentage,
of the standard deviation to the mean. This is sometimes used as measure of risk. It is calculated as follows:
s
100
x

This is a unit-free number because the standard deviation and mean are measured using the same units. This
measure is useful to compare two or more sets of data with different means, measurement units or sample
sizes. The higher the result the more variability there is in a set of data.
Activity 4.12
Two growers of grapefruit have obtained the following statistics regarding the mass of their current crops.
Grower A: x = 300g with s = 20g
Grower B: x = 280g with s = 40g

Which grower’s grapefruit are more uniform in mass?

4.3. MEAURES OF SHAPE


The shape of a distribution can be described by:
1. Its skewness, i.e. its symmetry or lack of it
2. Its kurtosis, i.e. its peakedness

4.3.1. Skewness (SK)


 Skewness relates to the shape of the stem-and leaf plot, histogram or polygon that could be drawn from
the data set
 The shape influences the locations of the mean, median and mode in the data set, e.g. Whether the
mean is larger or smaller than the median

Symmetrical or normal distributions


 Left side is a mirror image of the right side
 Mean = Median = Mode
 No outliers
 Skewness coefficient (SK) = 0

Mean
Median
Mode
Skewed distributions
A distribution is skewed if the tail of the graph extends more to one side than the other. In a skewed distribution,
the mode stays at the peak of the distribution because outliers do not influence the mode at all. The influence
of the outliers is the highest on the arithmetic mean because the mean is affected by all values in the data set,
including the extreme ones, and tends to be located toward the tail of the skewed distribution. The median,
being dependent on the number of values in the data set rather than on the size of those values, is less sensitive
than the mean, since only the middle measurements are used for its calculation and is located somewhere
between the mode and the mean.

Positively skewed distributions


 Majority of data values are concentrated on the left
 Tail to the right will be longer than to the left
 Mode < Median < Mean

Mode Mean Mean Mode


Median Median

Negatively skewed distributions


 Majority of data values are concentrated on the right
 Tail to the left will be longer than to the right
 Mean < Median < Mode

Mean Mode
Median
Pearson’s second coefficient of skewness
 Arithmetic measure of skewness
 Indicates which pattern is in the data
 Calculated using the formula:
3  mean  median 
SK 
standard deviation
 SK is measures on a scale from -3 to +3
 If SK = 0  symmetrical distribution
 If SK > 0 but less than +3  positively skewed
 If SK < 0 but greater than -3  negatively skewed

Activity 4.13
In a sample showing the sodium contents (in milligrams per kilogram) of chocolate pudding made from instant
mix, the mean is = 2965.20, the median is 2946 and the standard deviation = 543.52. Calculate a coefficient
of skewness for this distribution and interpret your answer.

4.3.2. Measures of kurtosis


 The amount of peakedness of a distribution
 The flatter the curve the greater the spread of the data  the standard deviation is larger relative to the
mean
 A formula exists, but it is easier to determine the extent of the kurtosis by observing the frequency
polygon
 If the distribution is …
o High (peaked) and thin  leptokurtic
o Flat and spread  platykurtic
o More normal, i.e. neither very peaked nor flat  mesokurtic
4.4. INTERPRETING CENTRE AND VARIABILITY
1. Dispersion is the amount of spread that occurs in a data set. It can be interpreted as the size of a ‘typical’
deviation from the mean. If the values in the data set are close to the mean, the standard deviation is
small, but if the values are widely dispersed about their mean, the standard deviation is large.
2. In comparing two data sets with the same unit of measure, the one with the larger standard deviation
has the greater amount of variability and one with the smaller standard deviation is more consistent,
with less variability among the numbers in the data set.
3. If you have a single data set, the mean can be combined with the standard deviation to obtain
information about how values in a data set are distributed along a number line. To do this, we describe
how far away a particular observation is from the mean in terms of the standard deviation. For example,
we might say that an observation is 2 standard deviations above the mean or 1 standard deviation below
the mean. The number of standard deviations is known as the z-score of the standardised value:
xx
z
s

Consider a data set with a mean of 100 and a standard deviation of 15:
 Determine which values are within 1 standard deviation of the mean
o The mean minus 1 standard deviation = 100 – 15 = 85
 This means that “85” is 1 standard deviation below the mean
o The mean plus 1 standard deviation = 100 + 15 = 115
 This means that “115” is 1 standard deviation above the mean
o All observation that fall “85” and “115” are within 1 standard deviation of the mean
 Determine how many standard deviations the values 70 and 130 are above / below the mean
70  100
o Value 70: z   2
15
 The value “70” is 2 standard deviations below the mean
130  100
o Value 130: z   2
15
 The value “130” is 2 standard deviations above the mean

4. The following two rules can be applied, depending on the shape of the distribution:
 If the distribution is symmetrical, use the Empirical Rule to make a statement about the
proportion of data values that fall into an interval
o Approximately 68% of observations fall within 1 standard deviation from the mean
o Approximately 95% of observations fall within 2 standard deviations from the mean
o Approximately 99.7% of observations fall within 3 standard deviations from the
mean
 A more general interpretation of the standard deviation is derived from Chebysheff’s
theorem, which applies to distributions of all shapes
 1
o At least 1  2  100% of observations lie within z standard deviations from the
 z 
mean

Activity 4.14
Consider the time (in minutes) taken to travel to work for a sample of 25 people from Gauteng, with:
 Mean = 31.7 minutes; Median = 30.88 minutes; Standard deviation = 7.94 minutes

Calculate Pearson’s second coefficient of skewness and comment on the results

Activity 4.15
The number of cars entering a parking area during a sample of ten-minute intervals yield the following
summary statistics:
 Mean = 19.6; Median = 22.5; Standard deviation = 8.72

Calculate Pearson’s second coefficient of skewness and comment on the results

Determine the range of values for 2 standard deviations around the mean

Use Chebyshev’s theorem to determine the proportion of the distribution that fall between 6.52 and 32.68
UNIT 6: SUMMARISING BIVARIATE DATA: SIMPLE REGRESSION AND

CORRELATION ANALYSIS

Outcomes
This unit deals with methods to summarize data consisting of observations on two quantitative variables
(bivariate data) presented as ordered pairs. The purpose is to understand if there is a relationship between
two variables and what you can do if a relationship exists.

After completion of this unit you will be able to:


 Explain the purpose of regression and correlation analysis
 Compute and explain the meaning of the coefficients of correlation and determination
 Compute the regression equation and use this measure to do estimates
 Compute correlation by making use of ranking

Purpose
Regression and correlation analysis are statistical tools used to study the relationship between two variables
of which one is dependent and the other independent. It is used to determine:
 Whether there is a relationship between the variables,
 How good that relationship is
 How the relationship can be used to make estimates

Examples of relationships
 Sales and earnings
 Cost and number of items produced
 Shoe size and intelligence
 Effort and results

6.1. RESPONSE VARIABLE (Y) AND EXPLANATORY VARIABLE (X)


 Each observation of bivariate data is a data pair of the form (x, y)
 The x is the explanatory (or independent) variable and y is the response (or dependent) variable
 The response variable y is the variable whose value depends on, or can be explained by, the value of
the explanatory or independent x-variable
 When we analyze data on two variables, the first step is to distinguish between the response variable
and the explanatory variable
Activity 6.1
Identify the independent x and dependent y variables for each of the following:
1. Size of advertisements of companies in the yellow pages and the number of calls to the business that
were generated by the advertisement
Dependent =
Independent =
2. Number of hours of training per employee and revenue of the company
Dependent =
Independent =
3. Hours of training to use Excel and number of errors made using the program
Dependent =
Independent =
4. Daily temperature and ice-cream sales
Dependent =
Independent =
5. Body mass and calorie intake
Dependent =
Independent =
6. Circulation of a magazine and advertising charges
Dependent =
Independent =
7. The hardness and tensile strength of die-cast aluminum
Dependent =
Independent =
8. Wattage of a heater and the effective heating area
Dependent =
Independent =
9. Number of cavities and sugar intake of children
Dependent =
Independent =
10. Number of police on the streets and number of crimes
Dependent =
Independent =
6.2. SCATTER DIAGRAM
A scatter diagram (or scatter plot) is a graph using data from two quantitative variables (bivariate data). This
graph can identify the type of relationship that might exist between two variables.

Steps
1. Collect pairs of data (x, y). The data are paired in a way that matches each value from one data set with
a corresponding value from a second data set
2. Select which variable is the dependent (y) variable and which is the independent (x) variable. The label
y goes to the variable which we want to predict. The other variable is then labeled as x
3. Arrange the data in two columns, x and y
4. Draw a set of axis
5. The horizontal axis represents the x-variable and is scaled so that any x value can be easily located
6. The vertical axis represents the y-variable and is scaled so that any y value can be easily located
7. Each pair of observations (x, y) is plotted as a point. That is where a vertical line from the value on the
x-axis meets a horizontal line from the value on the y-axis
8. The points are not connected
9. Scatter plots can take on the following patterns:
 The plot can show no relationship, because no pattern can be identified

 The plot can show a positive relationship because the dots start at the bottom left and move
upwards to the top right. Although the data points do not fall exactly on a line, they appear to
cluster about a line. A positive relationship means that if the x-variable increases, the y-variable
will also increase
 The plot can show a negative relationship because the dots start at the top left and moves
downwards to the bottom right. A negative relationship means that if the x-variable increases,
the y-variable will decrease

 If all the points fall exactly along a straight line, in a negative or positive direction, we say the
relationship is perfect

 Non-linear relationship is beyond the scope of this unit

Activity 6.2
During the baking of a certain type of bread roll on very low heat, each bread roll goes through a series of heat
processes. The length of time spent under this heat treatment is related to the lifespan of the bread rolls. A
sample of 8 bread rolls that underwent different baking times were selected and the life span (in hours) of each
was recorded. Draw a scatter plot and interpret.
25

20

15
Life span

10

0
0 5 10 15 20
Length of time
6.3. CORRELATION ANALYSIS (r)
A correlation exists between two variables when one of them is related or can be influenced by the other in
some way. The linear correlation coefficient is a numerical measure to describe the degree of strength and
direction by which one variable is related to another. A linear relationship means that when graphed, the points
approximate a straight line pattern. We use an equation, known as Pearson’s Product Moment Correlation
Coefficient, to measures this strength. This correlation coefficient is represented by the letter r. It is calculated
using the formula:
n xy   x y
r
 n x 2   x 2   n y 2   y 2 
       

 We need the sample size, the sum of the values, the sum of squares and the sum of the product
 If raw data are provided, you can enter the data in your calculator and find r

6.3.1. Characteristics of the linear correlation coefficient


 The sign of r indicates the direction of the relationship between variables x and y
 r ranges from −1 to +1 (inclusive)
o the closer to ±1 the stronger the linear relationship
o if close to 0 there is little or no linear relationship
 The strength of the linear relationship is not dependent on the direction
 r is a measure of association without a unit

6.3.2. Interpreting a correlation coefficient (r)


1. If an inverse (negative) relationship exists, Y decreases as X increases, r will fall between 0 and −1
2. If there is a direct (positive) relationship, Y increases as X increases, r will fall between 0 and +1
3. If there is no relationship between x and y, then r = 0
4. If the association between two variables is strong, then knowing the one variable helps a lot in
predicting the other. But when there is a weak association, information about one variable does not
help much in estimating the other
5. This measure enables us to make statements such as: the correlation is strong, weak, etc.
Size of r General interpretation
 (0.9 to 1.0) Very strong relationship
 (0.8 to 0.9) Strong relationship
 (0.6 to 0.8) Moderate relationship
 (0.2 to 0.6) Weak relationship
 (0.0 to 0.2) Very weak or no relationship
6.3.3. The coefficient of determination (r2)
The coefficient of determination (r2) is a more conservative measure of the relationship because it enables us
to calculate the total variation in y variable that is explained by the corresponding variation in the x variable.
To determine the value of this measure, simply square the correlation coefficient and multiply the answer by
100. This answer will always be positive within the range 0% to 100%. It is not possible to determine the
direction of the relationship between the variables:
Coefficient of determination = r2  100

Activity 6.3
Calculate the correlation coefficient and coefficient of determination for the bread roll data (Activity 6.2).
Interpret your answers.
Length of
Observation Life span (y) x2 y2 xy
time (x)
1 18 23
2 13 20 169 400 260
3 18 18 324 324 324
4 15 16 225 256 240
5 10 14
6 12 11 144 121 132
7 8 10 64 100 80
8 4 7 16 49 28
Sum
6.4. REGRESSION ANALYSIS
 If the correlation coefficient indicates that a relation exists between the two variables, the next step is
to determine the equation of the straight line that best describes the pattern of the relationship between
the two variables
o This equation, known as the regression equation, can be used to predict a dependent variable
(y) if the independent variable (x) is known
 ‘Best’ refer to how close the predictions of y are to the actual values of y
 A linear relationship between the two variables means that the equation will result in a straight line if
plotted on a graph

6.4.1. Formulating the regression equation


Any linear function involving two variables can be expressed in the form:
ŷ  a  bx

Where
ŷ = estimated / predicted value of y for a given value of x
a = intercept on the y-axis
b = slope (the average / predicted change in y for every 1-unit change in x)

The regression equation is calculate as follows:


n xy   x y
 b
n x 2    x 
2

 a  y  bx 
 y b x
n n
 ŷ  a  bx

6.4.2. Using the regression equation to make predictions


 The regression equation allows you to use the independent variable to make predictions for the
dependent variable
 Substitute the independent x-value for which you require a prediction into the regression equation and
calculate the estimated y-value
o Note: estimated y-values are meaningful only for x-values in (or close to) the range of the given
data

6.4.3. Plot the regression line on the scatter diagram


 Regression line is the “average” line through the data points
Activity 6.4
Bread roll data
 Estimate the regression equation
o Using the sums obtained in Activity 6.3 (n = 8)

 x  98 x 2
 1366  y  119 y 2
 1975  xy  1618

25

20

15
Life span

10

0
0 5 10 15 20
Length of time

 Interpret the slope

 If a bread roll spends 16 hours under heat treatment, how long to expect it to remain fresh?
Interpreting regression output (not in textbook)

The following tables list the regression and correlation analysis output for the bread roll data.

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.870
R Square 0.757
Adjusted R Square 0.717
Standard Error 2.878
Observations 8

ANOVA
df SS MS F Significance F
Regression 1 155.2 155.2 18.7 0.005
Residual 6 49.7 8.3
Total 7 204.9

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 3.0144 2.924 1.031 0.342 -4.140 10.167
X 0.9683 0.224 4.328 0.005 0.421 1.516
6.5. SPEARMAN RANK CORRELATION COEFFICIENT (rs)
 This coefficient measures the strength of the relationship between two variables on the basis of their
ranks instead of values (ordinal data)
 Can be used in problems where one or both variables can be ranked even though they cannot be
measured on a numerical scale
 Interpreted in a manner similar to the correlation coefficient r but because a great deal of the data gets
lost, it provides a less reliable result
 It is calculated using the formula:
6 d 2
o rs  1 
n  n 2  1

 Steps:
o Rank x and y values separately from smallest to largest (find the average rank for ties)
o Calculate the difference between ranks of the two variables (d)
o Square the differences and compute the sum
o Substitute into the equation and interpret

Activity 6.5
Determine Spearman’s rank correlation coefficient for the bread roll data and interpret your answer.
Length of
Observation Life span (y) Rank (x) Rank (y) d d2
time (x)
1 18 23
2 13 20 5 7
3 18 18 6 1.5 2.25
4 15 16 6 5 1 1
5 10 14 3 4 -1 1
6 12 11 4 3 1 1
7 8 10 2 0 0
8 4 7 1 1 0 0
Sum
UNIT 8: BASIC PROBABILITY CONCEPTS

The theory of probability grew out of the study of various games of chance using coins, dice, cards, lottery and
gambling machines. Since then probability theory has been developed to determine uncertainties in our everyday
lives as well.

After completion of this unit you will be able to:


 define probability
 describe the classical, empirical and subjective approaches to probability
 understand the meaning of basic terms used in the probability theorem
 apply the properties of probability
 calculate probabilities using the rules of addition and multiplication
 use the counting rules

In each of the following questions there are an implied condition of uncertainty:


 Is there a link between second-hand smoking and asthma in young children?
 What are my chances of getting that new job?
 What is the estimated influence of the AIDS epidemic on the population growth?
 What is the chance that it will rain today?

We make decisions in the face of uncertainty. We know that it is possible that things could turn out in different
ways but we simply do not know how probable each possibility is. Our need to cope with this risk or the ‘chance
that it will happen’ leads us to the use of the probability theory.

Inferential statistics involves using statistics obtained from a sample to make estimates and decisions concerning
the entire population. We can never be certain that our decisions are correct, but to assess how good they will be
we need to know how to measure ‘chance’ and probability’. The science of measuring ‘uncertainty’ is called
probability.

Probability describes the relative possibility (chance or likelihood) that an event will occur.
8.1. LANGUAGE OF PROBABILITY
 An experiment is a trial or action that generates the uncertain outcomes to which we will assign
probabilities.
 A particular result of an experiment is an outcome.
 A sample space of a random experiment is a list of all the possible outcomes of the random experiment.
 The individual outcomes in a sample space are called simple events. An event is a collection of one or
more outcomes of an experiment.

Activity 8.1
The type of transmission - automatic (A) or manual (M) - is recorded for each of the next two cars purchased
from a certain dealer.
1. What is the random experiment?

2. What is the sample space?

3. List the outcomes in each of the following events:


a) That at least one car has an automatic transmission

b) That exactly one car has an automatic transmission

c) That neither car has an automatic transmission


8.2. APPROACHES TO ASSIGNING PROBABILITIES
Each outcome in a sample space has a probability. Each event has a probability. The method you will use to
calculate a probability depends on the approach you use.

8.2.1. Classical approach


Classical probability is used in situations where the outcomes of the experiment are equally likely. The probability
of an outcome or event happening is computed by dividing the number of favorable outcomes (or successes) by
the number of possible outcomes.

number of successes
PE 
total number of outcomes

P(E) is read as the probability that event E will occur.

Activity 8.2
In drawing a card from a deck of 52 cards, what is the probability that it will be an ace?

8.2.2. Empirical probability


This approach is based on relative frequencies. The probability of an event happening is determined by observing
what proportion of the time similar events happened in the past or if the experiment is repeated many times under
identical conditions.

number of times the event occurred


PE 
total number of observations

As you increase the number of times an experiment is repeated, the empirical probability of an event approaches
the classical probability of the event.

NOTE: Chance behaviour is unpredictable over the short run but has a regular and predictable pattern in the long
run.
Activity 8.3
In a survey, a sample of 100 students were asked if they think that cloning of humans should be allowed. Ninety-
two said it should not be allowed, five said it should be allowed and three had no opinion. Calculate the
probabilities for each event in the survey.

Activity 8.4
We all know that fruit is good for us and that we don’t eat enough. In a recent study done among a random sample
of 75 teenage boys, the following information was collected:

Fruit servings per day Number of boys % of boys


0 20 27
1 15 20
2 15 20
3 12 16
4 8 11
5 5 6
Total 75 100

1. What is the probability of a teenage boy eating 3 fruit servings per day?

2. What is the probability of a teenage boy eating no fruit per day?


8.2.3. Subjective probability approach
Subjective probabilities are based on intuition and give a numerical estimate of the likelihood that a particular
outcome will occur using past experience, judgment, educated guesses or opinion.

Example 8.3
Given a patient’s health and extent of injuries, a doctor may feel that the patient has a 90% chance of full recovery.

Additional example
What is the probability that the traffic lights will work at the intersection of University Road and Kingsway Road?

8.3. PROPERTIES OF PROBABILITIES


1. The probability for any event is between 0 and 1 inclusive. A probability cannot be negative and nor can it
exceed one.
0  PE 1

 If P(E) = 0, then the event has no chance of occurring (impossible)


 If P(E) = 1, then the event is certain to occur
 The closer the probability is to 1, the better the chance that the event will happen
 The closer the probability to 0, the weaker the chance that the event will happen
 If P(E) = 0.5 we say the chance is 50-50 or there is an even chance
 A P(E) of more than 0.5 and less than 1, is said to be likely to occur
 A P(E) of less than 0.5 but more than 0, is said to be unlikely to occur

2. The sum of the probabilities for all the possible outcomes (or events) of an experiment must be equal to one:

 PE 1

3. The complement rule: The complement of an event E is that event E does not occur. That includes all the
outcomes in the sample space that do not belong to event E. The sum of the probabilities of all events in the
sample space is equal to 1. If the probability of event E occurring is P(E) and the probability that event E does

not occur is P  E  , then:

PE  PE 1
 P  E   1 P  E 
Activity 8.5
The probability that a typist will make at most five mistakes is 0.64. What is the probability that she will make
more than five mistakes?

Activity 8.6
Consider the example from activity 8.4:

Fruit servings per day Number of boys % of boys


0 20 27
1 15 20
2 15 20
3 12 16
4 8 11
5 5 6
Total 75 100

1. What is the probability of a teenage boy eating 3 fruit servings per day?

2. What is the probability of a teenage boy not eating 3 fruits per day?

3. What is the probability that teenage boys eat at least 1 fruit per day?
8.4. FORMING NEW EVENTS
Once some events have been specified, there are several useful ways of manipulating them to create new events
known as compound events.

8.4.1. Union of two events (A or B)


The union of events A and B is represented by the symbol AB and merges all the outcomes in event A with those
in event B. An outcome is only listed once in the combined event if it appears in both A and B.

8.4.2. Intersection of two events (A and B)


The intersection of two events is represented by the symbol AB and contains all the outcomes that events A and
B share in common – this is known as a joint probability.

Additional example
Consider an experiment of rolling two fair 4-sided die (one red and one blue), where the sides are numbered as 1,
2, 3 and 4:

Sample space:

Complement rule:

Union:

Intersection:

Note: Section 8.4.3 (display events graphically) is discussed before Section 8.5.1 (addition rules).
8.5. PROBABILITY RULES FOR COMPOUND EVENTS
We use various rules of probability to compute the probabilities of the more complex, related events.
1. If an event (E) is made up by two (or more) simple events (A and B), the probability P(E) can be formed either
as:
 The union of two or more events – all the outcomes that make up the two events
 The intersection of two events – the outcomes that fulfil the conditions for both events
2. Two or more events are mutually exclusive if the occurrence of one event means that none of the other events
can occur at the same time. An outcome can belong to event A or to event B but not to both. For example, if
you flip a coin, you can either have Heads or Tails. Both can’t happen at the same time!
3. Two events are independent if the occurrence of one is in no way affected by the occurrence of the other; that
is, they are unrelated. If you flip two coins and you obtained a Head on the one, it will have no influence on
the outcome of the second flip.
4. If there is a particular relationship between events such that the occurrence of one event affects the occurrence
of the second event, the events are dependent and the probability attached to the occurrence of such events is
known as conditional probability.

8.4.3. Display events graphically


Diagrams used to display events are known as Venn diagrams. It is a useful technique to visualise relationships
and in understanding probability. The collection of all possible outcomes, that is the sample space, is shown as
the interior of a rectangle with the various events drawn as circles inside the rectangle. Each event is a subset of
the sample space.

 Single event
 Union and intersection of two events

 Union and intersection of two mutually exclusive events

8.5.1. Addition rules


The addition rules are used if we want to find the probability of one event or another occurring.

A reminder of the notation:


 P  A or B   P  A  B   union of the two events

 P  A and B   P  A  B   intersection of the two events

Special rule of addition


 To apply the special rule of addition, the events must be mutually exclusive
o the two events are disjoint and therefore cannot occur simultaneously
 This rule states that the probability of event A or event B occurring in a given single outcome equals the
sum of their probabilities, i.e. the union of P(A) with P(B)
o P  A  B   P  A  P  B 
Activity 8.7
If the probabilities for different ratings of a new Shiraz are 0.05 (very poor), 0.14 (poor), 0.17 (fair), 0.33 (good),
0.20 (very good) and 0.11 (excellent), what are the probabilities that the Shiraz will be rated as:
1. Very poor or poor

2. Good, very good or excellent

General addition rule


 If the events are not mutually exclusive, it means that event A may occur or event B may occur, or both A
and B may occur in a single outcome
 This rule states that the probability that either event A or event B occurs equals the probability that event
A occurs plus the probability that event B occurs minus the probability that both occur
o P  A  B   P  A  P  B   P  A  B 

 To avoid double counting the probability of the outcomes that fulfil the conditions for both events, i.e. the
intersection of the two events, P(A and B) is subtracted from the sum of the probability of A and B

Activity 8.8
There are two secretaries in the office. The probability that the one secretary will be absent on any given day is
0.08 and the probability that the other one will be absent on any given day is 0.07. The probability that both will
be absent is 0.02. What is the probability that on a given day:
1. either or both secretaries will be absent

2. at least one secretary comes to work


8.5.2. Multiplication rules
These rules are used to determine the probability of event A and event B both happen in two successive outcomes.

Special rule of multiplication


 This rule states that when events A and B are independent, the probability that both occurring together is
the product of the individual probabilities
o Statistical independence
o P  A  B   P  A  P  B  P(A and B) = P(A)  P(B)

 Two events are independent if the occurrence of one does not change the probability of the next one
occurring

Activity 8.9
The quality control manager of a company questions the reliability of the two quality control checks in the food
processor manufacturing process. A worker who manually checks the processors performs one check and a
computer monitor performs a second check. The manager knows that 5% of the time the worker is apt to miss a
defective processor and that 2% of the time the computer will malfunction and fail to detect defective processors.
What is the probability that a worker will miss a defective processor and the computer will malfunction, causing
a defective processor to stop manufacturing?

General rule of multiplication


 This rule is used to combine events that are dependent
o Often referred to as conditional probability
 Conditional probability is the probability of an event B occurring given that event A has already occurred
 The probability of the second event B is affected by the outcome of the first event A
 Conditional probability is denoted by P(B\A) and is read as the “probability of B, given A happened”
o P  A  B   P  A  P  B | A

 It therefore follows that the conditional probability can be expressed as the ratio of the marginal probability
of the conditioning event and the intersection probability
P  A  B
o P  B | A 
P  A
Activity 8.11
A medical researcher has discovered a new test for tuberculosis. If the test indicates a person has tuberculosis, the
test is positive. Experimentation has shown that the probability of a positive test is 0.82, given that a person has
tuberculosis. The probability is 0.04 that the test registers positive and that the person does not have tuberculosis.
Assume that in the general population the probability that a person has tuberculosis is 0.20.

Identify all given probabilities (marginal / joint / conditional):

What is the probability that a person chosen at random will:


1. Have tuberculosis and a positive test

2. Not have tuberculosis

3. Have a positive test but not have tuberculosis


8.5.3. Calculating probabilities using a contingency table
A two-way contingency table or cross-tabulation table lists the frequency of each combination of the values of
two variables.

Activity 8.11
The following table shows the results of a study on 102 children in which a child’s IQ was examined and the
presence of a specific gene was found in the child:
Gene present Gene not present Total
High IQ 33 19 52
Normal IQ 39 11 50
Total 72 30 102

Determine the probability that a child has:


1. A high IQ and the given gene

2. A normal IQ without the gene

3. The gene

4. A normal IQ

5. A high IQ, given that the child has the gene

6. The gene, if his IQ is normal


Additional example

Let P  A  0.5 , P  B   0.4 and P  A  B   0.2

Find P  A  B  using:

 A Venn diagram

 A probability rule

 A contingency table
8.5.4. Tree diagrams
Another useful method of calculating probabilities if there are several stages or trials in the experiment is to use
a probability tree. All the possible outcomes of the experiment are represented by the branches of the tree.

Steps
 Plot a dot on the left to represent the root of the tree.
 Construct a column for each trial.
 Start on the left and determine the possibilities for the first trial which forms the branches of the tree in
the first column.
 Branches grow from each of the original branches, representing the possibilities for the second trial. The
second stage is based on the choice made in the first stage.
 The branches of the tree are weighted by probabilities; therefore show the probabilities for each event on
the branches.
 List all the outcomes together with the joint probability for each combined outcome.
 Add the probabilities. Because the tree represents the sample space of the experiment, the sum of the
probabilities should equal 1.

Example 8.15
A bag contains 5 red balls and 3 black balls. Two balls are drawn from the bag (one after the other). Construct a
probability tree to list all the possible outcomes together with each outcome’s probability.

NOTE: The first stage of the tree shows the marginal probabilities and the second stage shows the conditional
probabilities
Activity 8.12
Approximately 10% of people are left-handed. Two people are selected at random. Construct a tree diagram.

Use a probability tree to determine the probability that:


1. Both are right handed

2. Both are left-handed

3. One is right-handed and the other left-handed

NOTE: Both the first and second stages of the tree show marginal probabilities since the events are independent
P  A  B  P  A  P  B 
If events A and B are independent: P  A | B     P  A
P  B P  B
Additional example
A factory has 2 machines, A and B. Machine A us used 70% of the time. The probability that machine A produces
a defective item is 0.1 and the probability that machine B produces a defective item is 0.2. Draw a tree diagram
to represent the given probabilities:

Use the information from the tree diagram to construct a contingency table

What is the probability of producing good items at the factory?

An item is inspected and it is defective. What is the probability that it was produced by machine A?
8.6. COUNTING THE POSSIBILITIES
Probability is based on the number of successes and the possible number outcomes that make up the numerator
and denominator. A collection of rules for counting the number of outcomes that can occur for a particular
experiment can be used.

8.6.1. Multiplication rule of counting


The multiplication counting rule (mn rule) states that if there are m ways in which event A can happen, and n ways
in which event B can happen, then there are m times n ways in which both can happen. This rule can be extended
if there are more events.
m n 

Example 8.16
A computer password is to be made up consisting of four alphabetical characters. How many different computer
passwords can be designed if repetition of letters is allowed?

Activity 8.13
If a restaurant menu had a choice of 3 salads, 6 main dishes and six desserts, how many different possible dinners
can be ordered?

Extra example
How many Gauteng number plates are possible?
8.6.2. Permutation rule
The permutation rule is used to determine the number of ways to arrange n distinct objects taking x at a time in a
specific order, i.e. select a subset of x items out of n and arrange them in order.
n!
Px 
n
 n  x !

 Read as “n-permutation-x”
 Note: n! (pronounced n factorial) is the product of the whole numbers from n downwards to 1.

Example 8.17
Select a group of 3 people from 10 to fill the roles of chairperson, secretary and treasurer in a committee. The
number of possible ways to fill these roles is:

Activity 8.15
Assume there are five carriages that need to be unloaded at a dock but there is only enough time left in the day to
unload three of them. Since the goods in each of the carriages are needed by customers, the order of unloading is
important. In how many ways can three of the five carriages be unloaded in first, second and third order?

Extra example
How many pin numbers consisting of 4 digits can you create if all the digits must be different number?
8.6.3. Combination rule
The combination rule is used to determine the number of ways to select x objects from a larger set of n objects
without regard to the order in which the objects are selected.
n n!
Cx    
 x  x ! n  x !
n

 Read as “n-combination-x”

Example 8.18
A group of 7 mountain climbers wants to form a mountain climbing team of 5. How many different teams could
be formed?

Activity 8.15
You are given a list of 10 books and you are to read 4 of them. How many possible combinations of 4 books are
available from the list of ten?

Extra example
A lottery game consists of selecting 3 numbers from 1 to 10. How many possible lottery numbers can be selected?
UNIT 9: PROBABILITY DISTRIBUTIONS

Outcomes
After completion of this unit, you will be able to:
 define a probability distribution
 distinguish between discrete and continuous random variables
 find the probability for a binomial investigation
 find the probability for a Poisson investigation
 find the probabilities for a normally distributed variable by transforming it into a standard random variable

Definitions
 In investigations involving chance it is difficult to predict the exact value of a variable and so it is known
as a random variable
 A probability distribution is a listing of all the possible outcomes of a random variable together with the
probability of its occurrence
o The sum of the probabilities of all possible outcomes is 1
 Probability distributions are classified as either discrete or continuous, depending on the type of random
variable:
o A random variable is discrete if it can assume a countable number of possible values such as 0, 1,
2, 3, etc.
o A continuous random variable has an infinite number of possible values and can take on any value
over a given interval of values

9.1. DISCRETE PROBABILITY DISTRIBUTION


Steps in constructing a discrete probability distribution:
 List all the possible outcomes of an investigation in a frequency distribution
 Add the frequencies of each possible outcome
 Find the probability of each possible outcome by dividing the frequency by the sum of the frequencies
 Make sure each probability is between 0 and 1 and the total of all the probabilities is 1

The mean of a probability distribution is referred to as its expected value. For a discrete probability distribution
the mean is denoted by:
   E  X    x  P  x

 where P  x   P  X  x 
Example 9.1 (empirical)
A survey asked a sample of 200 people how many times they donate blood each year. The results are summarised
as a probability distribution (i.e. the empirical distribution is used as an estimate of the population distribution).
The random variable (X) represents the number of donations for one year.

x f P(x) E(x)
0 60 0.30 0.00
1 50 0.25 0.25
2 50 0.25 0.50
3 20 0.10 0.30
4 10 0.05 0.20
5 6 0.03 0.15
6 4 0.02 0.12
Total 200 1.00 1.52

Interpretation
 0.3 of the people did not donate blood
 0.25 of the people donated once
 The expected number of times someone will donate blood is 1.52

Additional example (classical)


Consider an experiment where a fair coin is tossed 3 times. Construct a probability distribution for the number of
tails in 3 tosses of the coin.

Sample space

Probability distribution
x P(x) E(x)

Total
Expected value

P  X  1 

P  X  2 

9.1.1. The binomial distribution


 Investigations resulting in responses for which there are only two possible outcomes, such as yes or no,
accept or decline, pass or fail, etc. are known as binomial investigations
 A binomial investigation usually involves repeating the basic investigation a fixed number of times (n),
called trials, and observing the number of times that the outcome of interest (success) (x) occurs
 A binomial probability distribution allows us to calculate the probability of a specific number of successes
(x = 0, 1, 2, ,…n) in n trials.

Properties of a binomial experiment


 Consists of n identical trials
 The trials are independent
 Each trial has only 2 possible outcomes: success or failure
 Each trial has the same probability of a success, i.e. π
 The probability of a failure is 1 – π
 The random variable X is the number of successes in n trials
 The probability distribution of X ~ B(n, π) is given by:
n
P  x      x 1   
n x
o
 x

Steps
 Find the probability () of a success in each trial
 Find the number of trials (n)
 Decide on the number of successes (x) for which you want to determine the probability
 Substitute the values into the formula
Activity 9.1
A shoe store’s records show that 30% of customers making a purchase use a credit card to make payment. This
morning 7 customers purchased shoes from the store.

Identify the distribution and its parameters

What is the probability that:


1. 3 customers will pay using a credit card

2. At least 2 customers will pay using a credit card

3. More than 5 customers will pay using a credit card

4. Exactly 3 customers will not pay using a credit card


9.1.2. The Poisson distribution
 The Poisson distribution is a discrete distribution and is useful for calculating the probability that a certain
number of successes (x = 0, 1, 2, …) will occur over a specific period of time, area or other measurement
 There is no specific upper limit to the count (n is unknown) although a finite count is expected

Properties of a Poisson experiment


 The number of successes in one interval is independent of the number of successes in any other interval
 The probability of a success in one interval is the same for all intervals of equal size and is proportional
to the size of the interval
 The random variable X is the number of successes in an interval and can take on values x = 0, 1, 2, …
 The rate of occurrence or average number of successes in a given interval is denoted by the parameter λ
 The probability distribution of X ~ P(λ) is given by:
 x e 
o P  x 
x!

Activity 9.2
A tollgate operator has observed that cars arrived randomly at an average of 360 cars per hour.

Identify the distribution and its parameters

Calculate the probability that:


1. only two cars will arrive during a specified one minute period

2. at least 3 cars will arrive during a specified two minute period


9.2. PROBABILITY DISTRIBUTIONS FOR CONTINUOUS RANDOM VARIABLES
A continuous distribution is a distribution in which the X variable may assume any value within a given range or
interval. The most widely used continuous probability distribution is the normal distribution.

9.2.1. The normal distribution


Many things are distributed with the characteristics of what we call normal. There are lots of occurrences in the
middle of the distribution, but relatively few on each end. For example, there are relatively few tall people and
relatively few short people, but lots of people in the middle. That means the chance (or probability) that a person
will be average in height, or more or less in the middle of the distribution, will be much better than the chance
that any one person will either be very tall or very short. Those events that tend to occur in the middle of the
normal curve have a higher probability of occurring than the ones in the extreme. The normal distribution plays
a very important role in statistical inference.

Characteristics
 The graph for a continuous random X variable is a smooth curve
 The curve is unimodal (single mode)
 The distribution is symmetrical around its mean and bell-shaped in appearance
 The left and right hand tails of the distribution extend indefinitely
 The x-axis represents all the possible values of the X variable
 Two parameters describe the normal distribution: the mean ( ) describes where the distribution is centred
and the standard deviation () describes how much the curve spreads out around the centre
o Notation: X ~ N(,2)
 Probabilities for continuous variables correspond to areas under the normal curve
 The total area under the normal curve is equal to 1 or 100% (The sum of all the probabilities of a
probability distribution =1). This means that the areas to the left and the right of the mean will each
comprise 50% of the total area
Additional exercise
Consider the following three different normal distribution. Discuss the similarities and differences between these
distributions in terms of their means and variances/standard deviations.

B C

1) A vs. B

2) B vs. C

3) A vs. C

Calculating probabilities under the normal curve


 We are interested in calculating probabilities for normally distributed random variables
 But calculating probabilities under a normal distribution is computationally intensive and an infinite
number of possible normal distributions can exist
 To solve this problem we use a very useful property of random variables to calculate the required
probabilities, namely the standard normal distribution
o Z ~ N(0, 1)
o A set of statistical tables exist showing probabilities for the Z-distribution
1. Any normal random variable can be converted to the standard normal random variable by calculating the
corresponding z-score. The z-score expresses the difference between the value of interest (x) and the mean
(µ) in units of the standard deviation (σ):
x
z

2. This z-score shows the number of standard deviations that a specific value lies to the right or left of the
mean. Any x-value smaller than the mean will have a negative z-score and any x-value greater than the
mean will have a positive z-score. A negative z-score only indicates that the area is to the left of the mean.
3. The x-axis of the graph becomes the z-axis with z = 0 in the middle
4. The z-score is used to find the area under the standard normal curve in published tables
5. Because the distribution is symmetrical, the normal table deals only with positive z-scores. The negative
z-scores will give exactly the same areas as the positive ones
6. The area is always positive because all probabilities are positive
7. Within the same number of standard deviations (z) from the mean, all normal distributions will have the
same area or fraction of their probabilities

How to read an area from the normal table using a z-score


1. The first column in the table gives the z-score to the first decimal place, and the top row give the second
decimal for a z-score
2. For example, to find the area of a z-score of 0.23 (i.e. between 0 and 0.23):
a) Find 0.2 in the first column
b) Go across with this row up to the column headed with the second decimal of 0.03
c) Where the corresponding row and column intersects, the area is 0.0910
d) This is the area between a z-score of 0 and 0.23 and is denoted as P(0  z  0.23)
e) A z-score of 1.02 or P(0  z  1.02) will correspond to an area of 0.3461
3. This table always gives the area between the mean = 0 and the required z-score

9.2.2. Different areas under the normal curve


Steps (from book)
 Draw the normal curve and indicate the mean in the middle
 Find the value of the x-variable on the x-axis and shade the area desired
 Convert x-values to z-values using the z-formula
 Use the standard normal table and the absolute value of the calculated z-score to find corresponding areas
and probabilities
Steps (alternative approach)
 Write the required probability in notation in terms of the original variable X
 Standardise to get the z-score(s)
 Draw the standard normal curve with 0 in the middle
 Mark the z-score(s) on the curve and shade the area desired
 Use the standard normal table and the absolute value(s) of the calculated z-score(s) to find corresponding
area(s)

Area between 0 and any z-score


Area between 2 values on the same side of the mean of 0

Area in any tail


Area to the right of any z-score, where z is less than the mean and the area to the left of any z-score, where z is
greater than the mean

Find the z-score for a given area (from book)


If a probability or area is given and you are required to determine an unknown x-value:
1. Calculate the area between  and the unknown x-value
2. If the known area falls in the tail of the curve, subtract the tail area from 0.5
3. Compare this area with the areas in the body of the normal table
4. If the exact area is not listed, use the closest value
5. Read the z-score from the first column and top row
6. Use this z-score in the z-formula to obtain the unknown x-value

Note: If the area falls to the left of , the z is negative; if the area falls to the right of , the z is positive

Alternative approach
 Write the given probability in notation
 Determine if the z score is on the positive or negative side of the graph (through logical reasoning)
 Share the given area (probability)
 Find the z-score (often through logical reasoning)
x
 Find the x-value by substituting all known values into z  and solving for the unknown x

Example 9.4
The time it takes a randomly selected person to perform a task is normally distributed with a mean value of 120
seconds and a standard deviation of 20 seconds.

1. The probability that a randomly selected employee will complete the task between 100 and 130 seconds:

2. The probability that a randomly selected employee will complete the task between 75 and 100 seconds:

3. The probability that a randomly selected employee will complete the task within 75 seconds:

4. The probability that a randomly selected employee will complete the task in more than 75 seconds:
5. The 10% of the employees who complete the task within the shortest time are to be given advanced
training. What task times qualify individuals for such training?

Activity 9.3
The lifetimes of a certain kind of battery have a mean of 300 hours and a standard deviation of 35 hours. Assume
that the lifetimes, measured to the nearest hour, follow a normal distribution.
1. Determine the percentage of batteries that have a lifetime of more than 320 hours

2. Determine the value above which the best 30% of the batteries lie
3. Determine the proportion of batteries that have a lifetime from 250 to 350 hours

4. Determine the proportion of batteries with a lifetime between 250 and 280 hours

5. Determine the maximum lifetime below which the weakest 20% of the batteries will lie

6. Determine the minimum lifetime above which the 60% of the batteries with the longest life will fall
UNIT 10: ESTIMATION

The objective of most statistical studies is inference. Inferential methods presented in the following two units use
information contained in a sample to reach conclusions about the characteristics of the population from which the
sample was drawn. There are two general procedures for making inferences about populations: estimation and
hypothesis testing.

After completion of this unit, you will be able to:


 Describe the purpose of sampling distributions
 Find the confidence interval for the mean
 Find the confidence interval for the proportion
 Determine the size of the sample required in order to make estimates to a specified degree of accuracy

10.1. STATISTICS AND PARAMETERS


Calculations based on a sample are known as sample statistics. The values calculated from population information
are referred to as population parameters. Sample statistics are used to estimate population parameters. To
distinguish between the two, Greek letters will be used to refer to population parameters and Roman letters will
refer to sample statistics.

Statistic Parameter
Mean x µ
Standard deviation s σ
Variance (not in book) s2 σ2
Proportion P π
Size n N

10.2. SAMPLING DISTRIBUTION OF THE MEAN


A sampling distribution is a distribution of a statistic (such as the mean) of all the possible samples of the same
size selected from a population. The sampling distribution theorem forms the foundation of inferential statistics.
The theorem states that if you draw all the possible different samples of the same size (n) from a population, you
can calculate the mean of each sample and these means will most probably differ. If you list all these possible
samples together with their means, it is known as the sampling distribution of the mean.
Properties of the sampling distribution of the means
 The mean of all possible sample means is equal to the population mean:
o x  
 The standard error of the mean measures the variability in the sampling distribution:

o x 
n
o The larger the sample size, the smaller the standard error of the mean and the better the estimate
of the population mean because of the lesser dispersion
 In most cases the population standard deviation (σ) is unknown and the sample standard deviation (s) is
used as an approximation to the population standard deviation, and the standard error of the mean
becomes:
s
o sx 
n
 If the population is normally distributed with a known population standard deviation, the sampling
distribution of the mean is also normal, regardless of the sample size
 The central limit theorem (CLT) states that if the population is not normal, the distribution of the sample
means will become more normal as the sample size increases
o A sample of n ≥ 30 is considered as large enough to assume a normal distribution
 If the population is normal with an unknown population standard deviation, the distribution of the sample
means is referred to as the student’s t-distribution
o The i-distribution is more dispersed that the normal distribution
o It is distinguished by degrees of freedom: df = n – 1
o There is a different t-distribution for every sample size
o For large n the t-distribution can be approximated by the Z-distribution
o The t-distribution will only be used if σ is unknown and n < 30
 The sample proportion (p) is used to estimate the population proportion (π)
 For the sampling distribution of a proportion:
o The mean is:  p  

 1   
o And the standard error:  p 
n

p 1  p 
o If π is unknown:  p 
n
10.3. ESTIMATING POPULATION PARAMETERS

Two types of estimates can be made about a population: a point estimate or an interval estimate.

10.3.1. Point estimation


In order to form a point estimate for a population mean we take a sample of size n and compute the sample mean.
The sample mean is a point estimate or estimator for the population mean. The sample p is the estimator for the
population π.

The discrepancy between a sample statistic and its population parameter is called sampling error. Measuring
sampling error forms a large part of inferential statistics. A point estimate will almost be never correct because
the sample is one point in the sample space of sample means. Each possible sample statistic will result in a
different estimate for the population parameter.

10.3.2. Confidence interval estimates


Because of sampling variability, a point estimate is much more useful if it is accompanied by a probability to
measure how close the sample mean is to the population mean. This is done using confidence intervals. A
confidence interval is a range of values within which the population parameter is expected to lie, with a certain
level of confidence.

The probability associated with an interval estimate is known as the confidence level (1- ) and is a measure of
confidence that the interval estimate will include the population parameter. A high probability, say 99%, means
more confidence and hence a wider interval.

Components of a confidence interval of a single population mean µ:



xz
n
where
x  point estimate for 
z  critical value associated with the chosen level of confidence

 standard error of the estimate
n

z  margin of error (ME)
n
Calculating a confidence interval
 Depends on the problem statement, sample size, distribution

Problem statements/Type of sampling distribution:


 Single mean where the population is normal with σ known  Z-distribution

o xz
n
 Single mean where the population is not normal / unknown with σ known and n ≥ 30  Z-distribution via
CLT

o xz
n
 Single mean where the population is normal or not normal / unknown with σ unknown and n ≥ 30 
Z-distribution via CLT
s
o xz
n
 Single mean where the population is normal with σ unknown and n < 30  t-distribution with df = n – 1
s
o   x t
n
 Single proportion with nπ ≥ 5 and n(1 – π) ≥ 5  Z-distribution

p 1  p 
o   pz
n
Activity 10.1
Identify the critical z-values associated with 90%, 95%, 98% and 99% confidence levels.

Additional
Identify the critical t-values associated with 90%, 95%, 98% and 99% confidence levels for a sample of size 15.

Activity 10.2
In 36 randomly selected seawater samples, the mean sodium chloride concentration was 23 cm3/m3 and the
standard deviation was 6.7 cm3/m3. Construct a 98% confidence interval estimate for the mean sodium chloride
concentration.
Activity 10.3
The time taken to complete the same task (in minutes) was recorded for nine participants in a training exercise:
8 7 8 9 7 7 9 10 9

Construct a 95% confidence interval for the average time taken to complete the task.

Activity 10.4
A medical researcher wished to determine the percentage of females who take vitamins. A study of 180 females
showed that 25% took vitamins. Construct a 99% confidence interval for the percentage of females who take
vitamins.
10.4. SAMPLE SIZE (n)
An important question that needs to be answered in statistical inference is how large a sample size is needed to
guarantee a certain level of confidence for a given margin of error? An appropriate sample size depends on:
 Level of confidence
 Variability in the population being studies
o The greater the variability in the population, the larger the sample required
 Maximum allowable error (E)
o This is the maximum amount a point estimate should differ (above or below) the parameter being
estimated, i.e. the difference between the sample statistic and the population parameter

Formula for determining the sample size when estimating the population mean:

 z 
2

n 
 E 

Formula for determining the sample size when estimating the population mean:
 1    z 2
n
E2

Activity 10.5
A cheese processing company wants to estimate the mean cholesterol content of all 50g servings of cheese. The
estimate must be within 0.5g of the population mean. Determine the minimum required sample size to construct
a 95% confidence interval for the population mean. Assume the population standard deviation is 2.8g.

Activity 10.6
A daily newspaper is investigating the reading habits of the home delivery customers. If a previous survey
indicated that 50% of readers read the editorial page, what sample size should be used to estimate, within 4%, the
proportion of people who read the editorial page if α = 0.1.
UNIT 11: HYPOTHESIS TESTING

In Unit 10 you learnt how to estimate the value of the population parameter of interest. In this unit you will learn
how to test a claim or hypothesis about a population parameter.

After completing this unit, you will be able to:


 explain the reasoning behind hypothesis
 explain the steps in the hypothesis-testing procedure
 distinguish between a one-tailed and two-tailed test
 conduct tests of hypothesis concerning values of the following parameters:
o population mean; large and small samples
o population proportion
 conduct chi-square tests

A hypothesis is a claim or statement about a population characteristic. Hypothesis testing is a decision-making


process to determine whether enough statistical evidence exists to enable us to conclude that a belief or hypothesis
about a population parameter is reasonable. The fundamental idea behind hypothesis testing procedures is: the
null hypothesis (H0) is rejected if the observed sample is very unlikely to have occurred when H0 is true

To make this decision, it is necessary do decide whether the difference that exists between the hypothesized
population parameter and the sample result is significant and therefore not supportive of the hypothesis, or
whether the difference is a chance difference and therefore supportive of the claim.

In the sampling process, any one of the samples in the sampling distribution might be selected. Most of the time
this sample mean would not equal the population mean. Such a difference is due to the sampling process and is
known as a chance difference. The difference is not large enough to cause concern. Results are statistically
significant if the difference between the sample result and the statement made in H0 if unlikely to occur due to
chance alone. It indicates that the sample came from a population with a mean other than the hypothesised mean.
11.1 A SINGLE SAMPLE CLASSICAL HYPOTHESIS TEST
Steps
Understand the problem
 Set up the null and alternative hypothesis.
Define the test procedure and decision rule
 Select the significance level  for the test
 Determine the type of sampling distribution (z or t)
 Determine the critical value(s) and corresponding rejection region
Collect and analyse the data
 Collect the data, calculate the test statistic
Draw conclusions and make recommendations
 Make the statistical decision by comparing the test statistic with the rejection region
 Interpret the statistical decision

11.4.1. Stating the hypothesis


To test a population parameter, you should identify a pair of hypothesis; one that represents the claim (H0) and
the other its complement (HA).

H0: population characteristic () = hypothesised value


 A null hypothesis, denoted by H0, is a statement of equality or no difference
 The ‘null’ implies that there has been no change or no difference in the value of the parameter
 The null hypothesis is assumed true until evidence indicates otherwise

HA: State the alternative hypothesis


 The HA (or test hypothesis) states what the case will be if the null hypothesis is not true
 The value appearing in HA must be identical to the H0

Different ways to set up the hypothesis:


 A test in which we want to find out whether a population parameter ( or ) has changed (), regardless
of the direction of change, is referred to as a two-tailed test.
H0: population characteristic = hypothesized value (will remain unchanged)
HA: population characteristic  hypothesized value (will be different)
 If we wish to determine whether the sample came from a population that has a parameter ( or ) less
than or more than a hypothesized value, the attention is focused on the direction of change, and the test is
referred to as left-tailed or right-tailed
o Left-tailed: H0: population characteristic = hypothesised value
HA: population characteristic < hypothesised value
o Right-tailed: H0: population characteristic = hypothesised value
HA: population characteristic > hypothesised value

Some Common Phrases that indicate the direction of test (note a few changes compared to the textbook)
> (right-tailed) < (left-tailed) ≠ (two-tailed)
Is greater than Is less than Is not equal to
Is above Is below Is different
Is higher than Is lower than It has changed
Is longer than Is shorter than
Is bigger than Is smaller than
Is increase Is decreased or reduced
An incline A decline

11.4.2. Select a level of significance () to be used


When you perform a hypothesis test, you can make one of two decisions: reject H0 or do not reject H0. Because
this decision is based on a sample and not on the population, there is always a possibility that the decision could
be wrong. The probability associated with this uncertainty is your level of significance. Just like we place a level
of confidence in the construction of an interval, we can determine the probability of making errors. The
significance level is chosen by the researcher before the sample data are collected.
 α = Type I error and occurs if H0 is rejected when it is true
 Type II error occurs if H0 is not rejected when it is false

For example, when  = 0.10, there is a 10% chance of rejecting a true H0. You can decrease the probability of
rejecting H0 when it is actually true by lowering the significance level. The significance level is the maximum
probability of making a type I error and is denoted by . The purpose of the level of significance is to provide a
probability basis for deciding whether an observed difference between a sample statistic and a hypothesized
parameter is a chance difference or a statistically significant difference, since  is the probability that the test
statistic will fall in the rejection area. Usually tests are performed at an -value of 0.01, 0.02, 0.05 or 0.10.
11.4.3. Determine the type of sampling distribution
This step will enable you to know whether to use the normal z-distribution of the t-distribution in determining the
rejection area.
 Use the normal z-distribution:
o when the distribution is approximately normal with  known
o when the sample size n  30 with an unknown or known 
 Use the t-distribution:
o when  is unknown and the sample size n < 30
o If n  30 the distribution approximates the normal curve via the Central Limit Theorem and you
use the normal z-table

11.4.4. Determine the critical value(s) and identify the rejection region
 The critical value represents the maximum number of standard deviations the sample mean or proportion
can differ from the hypothesized value before HO is rejected.
 The critical value separates the area under the curve into two regions - the non-rejection region and the
rejection region.
 The rejection region (or decision rule) is a range of values such that, if the test statistic falls into that range,
we reject HO.

Steps
 Specify the level of significance ()
 Decide whether the test is two-tailed, left-tailed or right-tailed (based on the alternative hypothesis
 Determine the critical value(s) and identify the rejection region
 Sketch the normal curve. Draw a vertical line at each critical value and shade the rejection region(s).
 State the rejection region in words.

Alternative hypothesis Right-tailed Left-tailed Two-tailed


Rejection region

Critical value
Example 11.1
A two-tailed hypothesis test at the 5% level of significance for a normal distribution contains 2.5 % of  in each
tail. The area to look up in the normal table is (0.5 – 0.025) = 0.475. If you look up 0.475 the corresponding z-
value is -1.96 to the left and +1.96 to the right. These two values are your critical values.
 The rejection region is given as: Reject H0 if the z-test > 1.96 or if the z-test < -1.96

A one-tailed test to the right at a 5% level of significance means all 5% will go into the right-hand tail; the area is
0.45, which will result in a z-value of +1.64. This 1.64 is your critical value.
 The rejection region is given as: Reject H0 if the z-test > 1.64

A one-tailed test to the left at a 5% level of significance means all 5% will go into the left-hand tail; the area is
0.45, which will result in a z-value of -1.64. The rejection region is given as: Reject H0 if the z-test < -1.64. This
-1.64 is your critical value.
 The rejection region is given as: Reject H0 if the z-test < −1.64
Activity 11.1
Find the critical value(s) and rejection region for a two-tailed test, a left-tailed test and a right-tailed test for an
approximately normal distribution at a level of significance of 1%, 2% and 10%.
1% 2% 10%
Two-tailed

Right-tailed

Left-tailed
Extra activity
Find the critical value(s) and rejection region for a two-tailed test, a left-tailed test and a right-tailed test for t-
distribution, where n = 12, at a level of significance of 1%, 2% and 10%.
1% 2% 10%
Two-tailed

Right-tailed

Left-tailed
11.4.5. Conduct the statistical test (calculating the test statistic)
By making use of the appropriate formula for the sampling distribution we calculate how many standard errors
the sample result is away from the assumed population value.
Distribution Test statistic
x 
z
Z 
n
Single mean (µ)
x 
tn 1 t
s
n
p 
z
Single proportion (π) Z  1   
n

11.4.6. Make the decision


 The decision rule is a statement that indicates the action to be taken, that is, do not reject H0 or reject H0
 Compare the test statistic with the critical value to see if it falls within the limits of the rejection area or
outside the limits
 If the test statistic falls within the rejection region, the sample evidence does not support the H0 that the
parameter was the specified value and we say reject H0
 If the test statistic falls in the non-rejection region, the sample evidence does support the H0 that the
parameter was the specified value and we say do not reject H0
 We never accept H0. Sample evidence can never prove the null hypothesis to be true. When we do not
reject the null hypothesis, we are saying that the evidence indicates that the null hypothesis could be true
and that there is no statistical evidence to reject it. As long as conclusions are based on sample data, there
is a chance that an error could be made

11.4.7. Interpret the decision


The conclusion should be stated in the context of the original claim. The level of significance should be included
and you would say there is enough evidence to support the claim or there is not enough evidence to support the
claim.
Example 11.2
A machine is set to fire 30 g of dried fruit into a box of cereal moving along the production line. A sample of 36
boxes revealed that the average mass of fruit inserted was 30.3 g with a standard deviation of 0.5 g. Is the increase
in the amount of fruit inserted significant at the 0.01 level of significance?
 Identify the test:
 Identify the distribution:
 Identify the form of the alternative hypothesis:

Activity 11.2
A sample of 100 healthy adult males has a systolic blood pressure of 125 mmHg with a standard deviation of 15.
Test at a 2% level of significance whether the mean systolic blood pressure is different from the generally accepted
level of 130 mmHg.
 Identify the test:
 Identify the distribution:

1) Hypotheses

2) Critical value

3) Test statistic

4) Decision

5) Conclusion
Example 11.3
A process takes an average time of 35 minutes. It is thought that a certain modification would reduce this time,
and after being modified, the process is repeated 13 times, giving an average time of 33.3 minutes with a standard
deviation of 2.4 minutes. Is there any significant reduction in the time at a level of significance of 0.05?
 Identify the test:
 Identify the distribution:
 Identify the form of the alternative hypothesis:

Activity 11.3
From past records we know that the average unbroken sleep periods of patients with a certain kind of insomnia is
2.8 hours. A new drug is tested on a sample of 25 patients and this yields an average of 3 hours unbroken sleep
with a standard deviation of 0.8 hours. Is there a significant improvement on the unbroken number of hours sleep?
Test at  = 2.5%.
 Identify the test:
 Identify the distribution:

1) Hypotheses

2) Critical value

3) Test statistic

4) Decision

5) Conclusion
Example 11.4
Directors of a company claim that 90% of the workforce supports a new shift pattern that they have suggested. A
random survey of 100 people in the workforce finds 85 in favour of the new scheme. Test at a 5% level if there is
a significant difference between the survey results and the directors’ claim.
 Identify the test:
 Identify the distribution:
 Identify the form of the alternative hypothesis:

Activity 11.4
To determine if new flavours of ice cream must be introduced into the market, a random sample of 320 people
was asked to taste and choose their favourite ice cream flavour. Of the 320 people surveyed, 58 responded that
that they preferred the chocolate flake flavour. Test the claim that less than 25% of people prefer chocolate flake
flavour ice cream at the  = 0.01 level of significance.
 Identify the test:
 Identify the distribution:

1) Hypotheses

2) Critical value

3) Test statistic

4) Decision

5) Conclusion
11.2 HYPOTHESIS TESTING USING THE p-VALUE APPROACH
The p-value of the test statistic is the probability that a sample statistic takes a value equal to, or more extreme
than, the one used in the hypothesis, when the null hypothesis is true.

Steps
 State the null and alternative hypotheses
 Determine the sampling distribution (z or t)
 Calculate the test statistic
 Calculate the p-value
o Note: we will only calculate p-values for tests that use the Z-distribution

Alternative hypothesis Right-tailed Left-tailed Two-tailed


p-value P(Z  ztest statistic ) P(Z  ztest statistic ) 2  P(Z  ztest statistic )

 Interpret the p-value


If the p-value is: Decision: Conclusion:
Less than the level of Reject H0 Sufficient evidence for Ha at α% level of
significance (< α) significance
Greater than the level of Do not reject H0 / Insufficient evidence for Ha at α% level of
significance (> α) Fail to reject H0 significance

NOTE: for examination purposes in 2017, we will only focus on p-value calculations for hypothesis tests based
on the Z distribution.
Example 11.5
A cellphone operator’s manager believes that customer monthly cellphone bills average more than R85 per month.
To test this claim, a sample of 64 customer cellphone accounts was randomly selected. The calculated test statistic
was equal to 1.5.
 Hypotheses:
H0: µ = 85
Ha: µ > 85
 Test statistic = 1.5
 p-value:

 Decision:
 Conclusion:

Activity 11.2 (one sample z-test)


A sample of 100 healthy adult males has a systolic blood pressure of 125 mmHg with a standard deviation of 15.
 Hypotheses:
H0: µ = 130
Ha: µ ≠ 130
 Test statistic = −3.33
 p-value =

 Decision:
 Conclusion:
Example 11.4 (one sample proportion)
Directors of a company claim that 90% of the workforce supports a new shift pattern that they have suggested. A
random survey of 100 people in the workforce finds 85 in favour of the new scheme.
 Hypotheses:
H0: π = 0.9
Ha: π ≠ 0.9
 Test statistic = −1.67
 p-value =

 Decision:
 Conclusion:

Activity 11.4 (one sample proportion)


To determine if new flavours of ice cream must be introduced into the market, a random sample of 320 people
was asked to taste and choose their favourite ice cream flavour. Of the 320 people surveyed, 58 responded that
that they preferred the chocolate flake flavour.
 Hypotheses:
H0: π = 0.25
Ha: π < 0.25

 Test statistic =
 p-value =

 Decision:
 Conclusion:
11.3 TESTING FOR DIFFERENCES AMONG MEANS AND PROPORTIONS
Not for examination purposes for 2017

11.4 TESTS USING THE CHI-SQUARE DISTRIBUTION (χ2)


The chi-square hypothesis tests are used for qualitative data and involve counting the data items that fall into the
testing categories.
 The chi square distribution (2) is a continuous distribution
 The distribution is positively skewed but approaches the normal distribution as the number of degrees of
freedom increases
 It is used to test whether the observed frequency corresponds with the expected frequency
 The 2-value is always positive

11.4.1. Test for independence


This test is used to determine whether relationships exist between different variables from a contingency table, or
whether the variables may be considered independent. The dimensions of such a table are described by identifying
the number of rows (r) and the number of columns (k) in the identity (r  k).

Steps
 State H0 and Ha: The null hypothesis states that the two variables are statistically independent. This means
that knowledge of the one variable does not help in predicting the other variable.
H0: The variables are independent (no relationship)
Ha: The variables are dependent (is a relationship)
 Select the level of significance ()
 State the decision rule by defining the rejection region
o Use the level of significance and degrees of freedom to find the critical value from the 2 table
o df = (r – 1)(k – 1)
o The 2 distribution is positively skewed, so the critical value will always be in the right-hand tail
o The rejection region for H0 goes from the critical value to right, i.e. reject H0 if the test statistic is
greater than the critical value

2
0 ( , df)
 Calculate the value of the chi-square test by substituting cell by cell the values from the fo and fe table into
the formula:

 fo  fe 
2

o  
2

fe
 fo = observed frequencies in each cell
 fe = expected frequencies in each cell
row total  column total
 fe 
overall total
 Compare the test statistic with the rejection region
o Make a decision in terms of H0
o Reach a conclusion in terms of H1

Activity 11.8
A manufacturer of women’s clothing is interested to know if age is a factor in whether women would buy a
particular garment, depending on its quality. A researcher sampled three age groups and each woman was asked
to rate the garment as excellent, average or poor. Test the hypothesis, at a 5% level of significance, that rating is
not related to age group.
 Rephrase: Test the hypothesis, at a 5% level of significance, if rating is related to age group

OBSERVED Age group


15 – 20 21 – 30 31 – 60 Total
Excellent 40 47 46
Average 51 74 57
Rating
Poor 29 19 37
Total

1) Hypotheses

2) Critical value
3) Test statistic

EXPECTED Age group


15 – 20 21 – 30 31 – 60 Total
Excellent 39.9 46.55 46.55
Average 63.7 63.7
Rating
Poor 25.5 29.75 29.75
Total

CHI-SQUARED TEST Age group


STATISTIC 15 – 20 21 – 30 31 – 60 Total
Excellent 0.0003 0.0044 0.0065
Average 1.6655 0.7047
Rating
Poor 0.4804 3.8845 1.7668
Total

4) Decision

5) Conclusion
11.4.2. Goodness-of-fit tests
The 2 goodness-of-fit test for uniform distributions is used to determine whether a set of sample data differs
significantly from what is expected.
 Equal distribution across categories
 OR equal to some known distribution

Steps
 State H0 and Ha
o Equal distribution
H0: All categories are equal (π1 = π2 =…= πk)
Ha: At least 1 category is different
o Equal to some known distribution
H0: Distribution follows a known pattern
Ha: At least 1 category is different from the known pattern
 Select the level of significance ()
 State the decision rule by defining the rejection region
o Use the level of significance and degrees of freedom to find the critical value from the 2 table
o df = (r – 1)
 Calculate the value of the χ2 test using the observed (fo) and expected (fe) frequencies:

 fo  fe 
2

o  
2

fe
 fo = observed frequencies in each cell
 fe = expected frequencies in each cell
 fe   i  n

 Compare the test statistic with the rejection region


o Make a decision in terms of H0
o Reach a conclusion in terms of H1

Example 11.12
A manufacturer of soap wishes to know if consumers have a preference for bath soap fragrances. To answer their
question, a random sample of 200 adult shoppers is offered a free bar of soap. The recipients may choose from
among four flavours.
 H0: Equal distribution
Example 11.13
The distribution of car manufacturers’ shares of the national market are given. A random sample of 2000 car
owners in Pretoria revealed the following ownership pattern: Volkswagen 758, Toyota 680, Delta 300, BMW 162
and Mercedes 100. Does the ownership pattern in Pretoria differ significantly from the national pattern?
 H0: Equal to some known distribution

Activity 11.9
A delivery of assorted nuts is labelled as having 45% walnuts, 20% hazelnuts, 20% almonds and 15% brazil nuts.
By randomly picking several scoops of nuts from the bag delivered, the following count was obtained:
Walnuts Hazelnuts Almonds Brazil nuts Total
Observed 92 69 32 42

Could these findings be a basis for an accusation of mislabelling at a 2.5% level of significance?

1) Hypotheses

2) Critical value

3) Test statistic

Walnuts Hazelnuts Almonds Brazil nuts Total


Expected 47 47 35.25

Walnuts Hazelnuts Almonds Brazil nuts Total


Test statistic 10.2979 4.7872 1.2926

4) Decision

5) Conclusion
UNIT 7: TIME SERIES

This unit discusses the general use of forecasting in business and several methods that are available for making
forecasts.

After completion of this unit, you should be able to:


 explain the purpose of time series analysis
 explain the components of the multiplicative model
 use linear models to analyse and project the trend of a time series
 measure the seasonal effect in a time series
 use time series in forecasting

Forecasting is the science of predicting the future. It is used in the decision-making process to help business
people reach conclusions about buying, selling, producing and many other actions. Time series analysis is known
as a forecasting tool. The objective is to analyse how observed data changes over time in order to detect patterns
that will enable us to predict future values. Time series helps us cope with uncertainty about the future. Time
series data is numerical data gathered on a given characteristic over a period of time at regular intervals.

7.1 COMPONENTS OF A TIME SERIES


It is generally believed that the factors that have influenced data in the past and present will continue to do so in
more or less the same way in the future. The primary goal of time series analysis is to isolate and measure these
influencing factors or components for forecasting purposes.

The observed time series data consists of four separate components – trend, cyclical, seasonal and irregular.
 Secular trend (T) is the underlying long-term movement (increase or decrease) over time in the recorded
data values and is usually the result of long-term factors such as changes in the population size,
demographic characteristics of the population, technology and consumer preferences
 Cyclical variations (C) are medium-term changes caused by circumstances which repeat in cycles and
cause upward and downward swings, not of equal length, throughout the series
o E.g. the general business cycle of prosperity, recession, depression and recovery
 Random or irregular variations (I) occur over short intervals and are unpredictable with no pattern to their
behaviour
o Variations due to everyday unpredictable influences
o E.g. the weather, political unrest, crime, war, transport breakdowns
 Seasonal variations (S) are short-term fluctuations that tend to repeat themselves of days, weeks, months,
quarters, etc.
o E.g. ice cream sales, an increase in the number of flu cases every winter, higher sales in shops
before Christmas, etc.

The classical time series model is known as the multiplicative model and is represented by the following equation
showing the four components that make up the time series and their relation to each other:
Y=TCSI
o The components are multiplied to provide the value of the observed dependent variable Y
o Trend (T) is in the same units as Y
o The other components are expressed as percent adjustment
 A value > 100  above average effect of the component
 A value < 100  below average effect of the component

For a time series composed only of annual data, there is no seasonal component. In that case the time series model
becomes:
Y=T×C×I

Periodically reported data, monthly, quarterly, weekly, daily, etc., include the influence of all four components of
the time series:
Y=T×C×S×I

The process of division can remove or isolate any component of this model and is called decomposition.

NOTE: Analysis of cyclical and irregular influences on data is useful for describing past variations but because
of their unpredictability, their value in forecasting is very limited. Instead, a number of business indicators are
used to forecast cyclical turning points. Predicting cyclical and irregular variations require techniques beyond the
scope of this unit.

If we ignore the C and I components, since by definition they cannot be predicted, the forecasting model will
become:

Ŷ  T  S

Activity 7.1 (ignore)


7.2 HISTORIGRAM
 A historigram portrays the behaviour over time of the variable of interest (Y)
 Time is the independent variable (X) and is always measured along the horizontal axis X
 The variable of interest is the dependent variable (Y) and is measured on the vertical axis Y
 Plot the time series values (on the vertical y-axis) against time (on the horizontal x-axis) as single points,
and join the points by straight-line segments

7.3 TIME-SERIES DECOMPOSITION


7.3.1. Trend analysis (T)
The trend can be thought of as the core component of the time series model about which the other components,
Cyclical (C), Seasonal (S) and Irregular (I) variations, fluctuate. The objective is to enable the underlying
movement of the data to be highlighted by making use of a straight line passing through the data with a positive
or negative slope. Thus, a business sales trend will normally show whether sales are moving up or down (or
remaining static) in the long term. This component is found by identifying separate Trend (T) values, each
corresponding to a time point. Methods that can be used to extract a trend from a set of time series values are:
 Method of least squares
 Method of semi-averages
 Method of moving average

Method of Least Squares


Steps
 The observed time-series data are the dependent Y variable since this is the variable that we want to predict
 The time variable is the independent X variable
o x = 1 is time period 1, x = 2 is time period 2, etc.
 The least squares trend line is obtained through a simple linear regression of Y (data) on X (time)

o Ŷ  a  bx
o If raw data are given, use your calculator to find the values of the intercept (a) and slope (b) using
regression analysis
Activity 7.2
The following table shows the number of traffic tickets issued by the Alberton Traffic Department for the first six
months of the year.
Month Time (X) Number of tickets (Y)
January 1 120
February 2 120
March 3 100
April 4 90
May 5 130
June 6 150

1. Use the method of least squares to find the trend line (different phrasing to the book)
o Add the index of time (X) from 1 to the last value in the time series
o Enter the X (independent) and Y (dependent) values in your calculator for regression analysis
o Trend line:
2. Forecast the number of tickets that the Traffic Department can expect to issue for the next three months
o July:

o August:

o September:

3. Graphs the time series and the trend line


160

140
y = 4.8571x + 101.33
120
Numer of tickets

100

80

60

40

20

0
0 1 2 3 4 5 6 7
Time
Method of semi-averages
This technique involves the calculation of two averages which, when plotted on a graph as two separate points
and joined up, form a straight line.

Steps
 Split the data into two equal groups
o If there are an odd number of years, simply omit the middle year
o It is important that the two groups in question have an equal number of data values
 Calculate the arithmetic mean for each group
 Plot the two means at the midpoints of the time intervals covered by the respective groups
 Join these two points with a straight line - this is the trend line

Forecast:
 Extend the straight line up to the required forecast period and use the graph to read the value off the y-axis
 OR calculate the average increase per year by determining the difference between the two averages and
divide this difference by the number of years between the two averages
o Add this increment an appropriate number of times to the mean of the latter group

Method of moving averages


This method shows the trend in a time series by eliminating any obscuring variations caused by seasonal, cyclical
or irregular fluctuations by averaging a fixed number of periods. One of the drawbacks is that values for some
periods are lost at the beginning and end of the series. A moving average is an artificially constructed time series
in which the value for a given period is replaced by the mean of that value and the values of some number of
preceding and succeeding time periods.

Steps
 If an odd number moving average is calculated (i.e. 3, 5 or 7), there will be a middle time point opposite
which to record the answers
o E.g. a 3-year moving average
o Add the Y-values for time periods 1, 2, 3 and divide by 3
 The answer corresponds to time period 2
o Move down one year and calculate the average Y-value for time periods 2, 3, 4 and divide by 3
 The answer corresponds to time period 3
o Continue until you no longer have 3 Y-values to add
 If an even number moving averages is calculated (i.e. 2, 4, 6, etc.), the resulting averages will correspond
between two time points. But a trend value is must coincide with a particular point in time, therefore an
extra step is required to centre the average, by averaging successive moving averages
o E.g. a 4-year moving average (example not in book)
o Add the Y-values for time periods 1, 2, 3, 4 and divide by 4
 The answer corresponds to time period 2.5
o Add the Y-values for time periods 2, 3, 4, 5 and divide by 4
 The answer corresponds to time period 3.5
o Take an average of the answer for times 2.5 and 3.5
 The final answer will correspond to time period 3
o Continue until you no longer have 4 Y-values to add
 The longer the time period covered in computing the average, the smoother the resulting curve. If the
period is too long, a straight line will result and the general direction of the curve will be lost
 The moving average forecast for the next year is the average of the preceding period

Example 7.3
The quarterly sales of petrol at Jack’s Garage are represented in the table below.
3-quarterly moving 4-quarterly moving average
Time Quarter Sales
average 4-quarterly average Centred average
1 1 40
2 2 37 46
3 3 61 52 49.0 46.0
4 4 58 45 A C
5 1 16 43 B 50.1
6 2 55 51 52.8 49.0
7 3 82 55 45.3 48.3
8 4 28 50 51.3 53.1
9 1 40 46 55.0 51.0
10 2 70 53 47.0 51.8
11 3 50 62 56.5 58.4
12 4 66 57 60.3 56.6
13 1 55 54 53.0 58.0
14 2 41 62 63.0 62.8
15 3 90 65 62.5
16 4 64

A= B= C=
Activity 7.4
Construct a 4-year and a 5-year moving average and graph the results.
4-year moving average
Time Year Y 5-year moving average
4-year average Centred average
1 2005 14
2 2006 20
3 2007 40 26.4 26.0 27.8
4 2008 30 32.0 B D
5 2009 28 38.2 C 36.4
6 2010 42 A 37.8 37.1
7 2011 51 35.6 36.5 37.0
8 2012 25 37.5
9 2013 32
7.3.2. Seasonal variation (S)
Seasonal variations occur within a period of one year or less. Therefore period data (weekly, monthly, quarterly,
daily) is required. Seasonal variation is generally expressed as an index number and can be identified using the
ratio-to-moving–average method.

Ratio-to-moving-average-method
Steps
1. List data in date order
2. Determine the time period to be used for the moving average, e.g. monthly data use a 12-monthly moving
average, quarterly data a four-quarterly moving average and 6 days a week a six-daily moving average
3. Calculate the required moving average. If the period is an even number, centre the averages by averaging
adjacent moving averages
4. Express the original time series values as percentages of the corresponding centred moving averages:
Divide the average into the original data and multiply the result by 100. These are the individual seasonal
percentages
5. Summarise the seasonal percentages in a new table by grouping together the seasonal time periods. For
example, all the first quarters together, all the second quarters together, etc. Use the modified mean
approach to compute an unadjusted seasonal index per season
o The modified mean is the arithmetic mean of the values that remain after elimination of the
smallest and largest values in the column
6. Add the unadjusted seasonal indexes
7. Determine the factor needed to adjust the index numbers to typical index numbers
Typical quarterly index = 100 × 4
Typical monthly index = 100 × 12
total typical index
Factor 
total of unadjusted indexes
8. Calculate the adjusted typical seasonal index numbers by multiplying the unadjusted index numbers by
this factor
9. Seasonal indexes can be included in short-term forecasts
Seasonalised forecasting for periodically reported data

1. Determine a linear trend over the given data using Ŷ  a  bx


2. Cod the different time units (quarters, months, etc.) within the forecasting period
3. Calculate the trend value for each time unit within the forecasting period
4. Multiply the trend value for each time unit with the seasonal index of that time unit

Deseasonalising data
The influence of seasonality can be removed from a time series by dividing each original value in the series by
the appropriate typical seasonal index for that period and then multiplying the result by 100. The result is known
as deseasonalised data. Deseasonalised data is used if we wish to compare data across seasons to determine if an
increase or decrease, irrespective of seasonal trend, has taken place.

Example 7.4
The quarterly income (in R’000) of a soft drink company has been recorded for 4 years.

1 2 3 4 5 6
Year Quarter y 4-quarter moving average Centred % Seasonal index
2009 1 52 83.3
2 67 107.2
3 85 64.5 65.2 130.4 126.5
4 54 65.8 66.8 80.8 83.0
2010 1 57 67.8 68.4 83.3 83.3
2 75 69.0 69.9 107.3 107.2
3 90 70.8 71.2 126.4 126.5
4 61 71.5 71.8 85.0 83.0
2011 1 60 72.0 72.5 82.8 83.3
2 77 73.0 73.2 105.2 107.2
3 94 73.5 74.2 126.6 126.5
4 63 75.0 75.9 83.0 83.0
2012 1 66 76.8 77.3 85.4 83.3
2 84 77.8 78.3 107.3 107.2
3 98 78.8 126.5
4 67 83.0
Column 1: Quarterly data
Column 2: Sales data
Column 3: 4-quarterly moving average
 Time period 2.5:

 Time period 3.5:

Column 4: Centred moving average


 Time period 3:

Column 5: Seasonal percentages


 Time period 3:

Column 6: Seasonal index


Year 1 2 3 4 Total
2009 130.4 80.8
2010 83.3 107.3 126.4 85
2011 82.8 105.2 126.6 83
2012 85.4 107.3
Unadjusted 83.3 107.3 126.6 83.0 400.2
Typical index 400 400 400 400
Factor 0.9995 0.9995 0.9995 0.9995
Seasonal index 83.3 107.2 126.5 83 400.0

Forecast for 2013:


 Linear trend line:

 Quarter 1, 2013:

 Quarter 2, 2013:

 Quarter 3, 2013:

 Quarter 4, 2013:
Activity 7.5
The owner of a pizzeria recorded the number of pizzas sold during the past weeks in order to determine the
influence of the day of the week on sales. Do a seasonal forecast per day for week 5.

Time Week Day Number sold 5-day moving average % Seasonal index
1 Monday 12 72.4
2 Tuesday 18 83.5
3 1 Wednesday 16 20.4 78.4 84.5
4 Thursday 25 A B 124.2
5 Friday 31 20.0 155 135.4
6 Monday 11 20.6 53.4 72.4
7 Tuesday 17 20.4 83.3 83.5
8 2 Wednesday 19 19.6 96.9 84.5
9 Thursday 24 20.2 118.8 124.2
10 Friday 27 20.0 135 135.4
11 Monday 14 19.4 72.2 72.4
12 Tuesday 16 20.2 79.2 83.5
13 3 Wednesday 16 19.8 80.8 84.5
14 Thursday 28 20.4 137.3 124.2
15 Friday 25 21.4 116.8 135.4
16 Monday 17 22.2 76.6 72.4
17 Tuesday 21 21.4 98.1 83.5
18 4 Wednesday 20 22.8 87.7 84.5
19 Thursday 24 124.2
20 Friday 32 135.4

A=

B=
Week Monday Tuesday Wednesday Thursday Friday Total
1 78.4 155
2 53.4 83.3 96.9 118.8 135
3 72.2 79.2 80.8 137.3 116.8
4 76.6 98.1 87.7
Unadjusted
Typical index 500 500 500 500 500
Factor
Seasonal index

Forecast for week 5:


 Linear trend:

 Monday, Week 5:

 Tuesday, Week 5:

 Wednesday, Week 5:

 Thursday, Week 5:

 Friday, Week 5:

You might also like