You are on page 1of 99

SIT 212: STATISTICS

prepared by
Musa Mohammed (PhD)
FREQUENCY DISTRIBUTION
What is a Frequency Distribution Table?

Frequency tells you how often something happened. The frequency of an


observation tells you the number of times the observation occurs in the data.

For example, in the following list of numbers, the frequency of the number 9 is 5
(because it occurs 5 times): 1, 2, 3, 4, 6, 9, 9, 8, 5, 1, 1, 9, 9, 0, 6, 9.

So the table which contains frequency and data is called a frequency distribution
table or simply a frequency table.
Types of Frequency Distributions
Basically, there are two types of frequency distribution under statistics which are
explained below:

1. Ungrouped frequency distribution: It shows the frequency of an item in each


separate data value rather than groups of data values.

2. Grouped frequency distribution: In this type, the data is arranged and


separated into groups called class intervals. The frequency of data belonging to
each class interval is noted in a frequency distribution table. The grouped
frequency table shows the distribution of frequencies in class intervals.
Ungrouped frequency distribution
How to make an ungrouped frequency table
1. Create a table with two columns and as many rows as there are values of the variable.
Label the first column using the variable name and label the second column
“Frequency.” Enter the values in the first column.
- For ordinal variables, the values should be ordered from smallest to largest in the
table rows.
- For nominal variables, the values can be in any order in the table. You may wish to
order them alphabetically or in some other logical order.
2. Count the frequencies. The frequencies are the number of times each value occurs.
Enter the frequencies in the second column of the table beside their corresponding
values.
- Especially if your dataset is large, it may help to count the frequencies by tallying.
Add a third column called “Tally.” As you read the observations, make a tick mark in
the appropriate row of the tally column for each observation. Count the tally marks
to determine the frequency.
Ungrouped frequency distribution CONT.
Examples 1: Let’s observe the marks secured by 20 students in a math test of 20
marks.
10, 8, 9, 8, 10, 10, 11, 12, 14, 15, 8, 9, 7, 8, 10, 10, 11, 12, 10, 8
The frequency distribution table of the above data is given below:

Marks Frequency
7 1
8 5
9 2
10 6
11 2
12 2
14 1
15 1
N = 20
Ungrouped frequency distribution CONT.
Examples 2: A jar containing beads of different colors- red, green, blue, black,
red, green, blue, yellow, red, red, green, green, green, yellow, red, green, yellow.
To know the exact number of beads of each particular color, we need to classify
the beads into categories. An easy way to find the number of beads of each color
is to use tally marks. Pick the beads one by one and enter the tally marks in the
respective row and column. Then, indicate the frequency for each item in the
table.
Grouped Frequency Distribution
How to make a grouped frequency table particularly for a large amount of data
For a large number of data, to draw a frequency table, we should use the
following steps:

Step 1: Identify the smallest data and largest data.


Step 2: Divide the data into an appropriate class interval of the same size. There
are no firm rules on how to choose the width.
Step 3: Common data always belongs to the higher class e.g. in 10-20 and 20-30,
20 lies in 20-30.
Step 4: Count the data and write tally bars in a frequency table.
Grouped Frequency Distribution CONT.
Example 2: Represent the following data in a frequency distribution table by
making appropriate class intervals.
4, 7, 12, 15, 50, 22, 25, 27, 29, 33, 39, 44, 47, 18, 51, 31, 20, 21, 41, 36
Solution:
Frequency distribution table,
Data Frequency
0-10 2
10-20 3
20-30 6
30-40 4
40-50 3
50-60 2
N = 20
MEASURES OF LOCATION
What is Central Tendency?

Central tendency (sometimes called measure of location, central location, or just


center), is a summary measure that attempts to describe a whole set of data with
a single value that represents the middle or centre of its distribution.

There are three main measures of central tendency:

1. mean

2. median

3. mode

Each of these measures describes a different indication of the typical or central


value in the distribution.
Mean
The mean is also known as the average, and it is calculated by adding up all the
values in a data set and dividing by the total number of values.
The mean or arithmetic mean is the most common used among all the averages.
The formula:
- for ungrouped data
Σ𝑋
𝑋=
𝑛
where, 𝑋 = mean, Σ𝑋 = sum of, n = number of observation
- for grouped data
Σf𝑋 Σ𝑓𝑋
𝑋= or
Σ𝑓 𝑛
where, Σ𝑓 or n = sum of frequency or total number of observation
Σf𝑋 = total of the product of each observation (x) and the frequency (f)
Median
The median is the middle value in the distribution when the values are arranged
in ascending or descending order.
Formula:
for ungrouped data
𝑛+1 th
If the number of observation is odd, the median is the order of
2
observation
for an even observation, is simply the arithmetic mean of the two middle
numbers
For grouped data
For the grouped data that does not involved class interval, apply the above
formula after the cumulative frequency data have been obtained.
Median CONT.
For grouped data that involved class interval
𝑛
2
−Σ𝑓1
Median = 𝐿𝑏 + c
𝑓𝑚

Where,

Lb = lower class boundary of the median class (median class is the one in which
𝑛 th
order of observation lies in the C.F.)
2

n = number of items in the data (i.e. total frequency)

Fm = frequency of the median class

C = Size of the median class interval


Mode
The mode is the most common number in a set of data.
It is also the item with the highest frequency.
There may be no mode in the distribution
Mode can be one (uni-modal), it can be two (bi-modal) or three (tri-modal) or
more than three (multi-modal)
Formula:
For ungrouped data
Number occurring most frequently or the observations that appear most in the
distribution.
For grouped data without class interval
Mode is the class with the highest frequency
Mode CONT.
For grouped data with class interval
𝑓1
Mode = 𝐿𝑏 + C
𝑓1+𝑓2

Where

Lb = lower class boundary of the modal class (i.e. the class containing the mode)

F1 = frequency of the modal class – frequency of the next preceding class (just
before)

F2 = frequency of the modal class – frequency of the next succeeding class (just
after)

C = size of the modal class interval


Mean, Median and Mode for Ungrouped Data
Example 1: In a small scale project, a contractor paid a worker for seven days the
following amount of money in (00) of naira.
7, 2, 5, 4, 3, 3, and 6
(a) Determine the average income earned by the worker per day
(b) Compute the median, and
(c) Mode of the distribution

Solution:
Σ𝑋 7+2+5+4+3+3+6
(a) 𝑋 = = = 4.29, therefore mean = N429
𝑛 7
(b) Median = arrange in ascending order 2, 3, 3, 4 , 5, 6, 7 = N400
(c) Mode = N300
Mean, Median and Mode for Ungrouped Data CONT
Example 2: A project manager was saddle with the responsibility of make the SIT
clean and used the following number of cleaners in six days.
10, 40, 15, 30, 10, and 15
Compute (a) the mean cleaner per day (b) median, and (c) the mode
Σ𝑋 10+40+15+30+10+15 120
(a) 𝑋 = = = = 20 cleaners
𝑛 6 6

(b) Median = arrange the number in order of magnitude


10, 10, 15, 15, 30, 40
15+15
the arithmetic mean of the two middle numbers = = 15 cleaners
2

(c) Mode = 10 and 15 cleaners.


Mean, Median and Mode for Grouped Data
Without class interval
Example 1: A project manager paid the following amount of money in thousands of
naira (N000) to various categories of workers in a day project.
7, 1, 3, 7, 5, 5, 3, 2, 1, 3, 5, 2, 2, 7, 5, 7, 5, 5, 2
Compute (a) mean, (b) median, and (c) mode
(a) X F FX Cum. Freq.
1 2 2 2
2 4 8 6
3 3 9 9
5 6 30 15
7 4 28 19
Σf = 19 Σfx = 77
The shaded area is the median class and modal class
Mean, Median and Mode for Grouped Data CONT.
Σf𝑋 77
(a) 𝑋 = Σ𝑓 = 19 = 4.05, therefore mean = N4,050
(b) Median = since number of observation is odd being 19
𝑛+1 th
median = order of observation
2
19+1 th
= order of observation
2
To locate 10th observation, check whether 10 appear in the C.F. column, if it does,
the corresponding x is the 10th observation. If it doesn’t, we pick the next larger
value than 10 in the C.F. and the corresponding x is the 10th observation
Therefore, 10th observation is 15, and the corresponding x is 5
Median = N5,000
Mode = N5,000
Mean, Median and Mode for Grouped Data CONT.
With class interval
Example 2: The quantity of cement (in bags) purchased by a contractor in 39 days
is shown in the following distribution
54, 40, 38, 25, 32, 45, 46, 45, 35, 59, 42, 43, 46, 46, 28, 34, 40, 44, 44, 47, 51, 49,
49, 36, 31, 36, 41, 42, 37, 35, 45, 49, 48, 45, 46, 47, 48, 41, 44.
From the data
(a) Prepare a frequency distribution table, grouping the data in the 5 interval of 5
bags (e.g. 25-29, 30-34, etc.)
(b) What is the average cement purchase per day?
(c) Compute the median
(d) Compute the mode
Mean, Median and Mode for Grouped Data CONT.
Solution
(a) Frequency distribution table
Class F Mid point FX C.F.
X
25-29 2 27 54 2
30-34 3 32 96 5
35-39 7 37 259 12
40-44 10 42 420 22
45-49 15 47 705 37
50-54 2 52 104 39
Σ𝐟=39 Σf𝐗=1638
Median class is shaded yellow
Modal class is shaded brown
Mean, Median and Mode for Grouped Data CONT.
Σf𝑋 1638
𝑏 𝑋= = = 42 ∴ 𝑋 = N42,000
Σ𝑓 39

𝑛
2
−Σ𝑓1
(c) Median = 𝐿𝑏 + c
𝑓𝑚

Lb = 39.5, Σ𝑓1 = 12 (C.F. just before 22), 𝑓𝑚 = 10, C = 30 – 25 = 5

39
−12 7.5
2
= 39.5 + 5 = 39.5 + 5 = 43.25 bags of cements
10 10
Mean, Median and Mode for Grouped Data CONT.
𝑓1
(d) Mode = 𝐿𝑏 + C
𝑓1+𝑓2

𝐿𝑏 =44.5, 𝑓1 =15 – 10=10, 𝑓2=15 – 2=13, C=30 – 25=5

5
= 44.5 + 5
5+13

25
= 44.5 +
18

= 45.06 bags of cements


ASSIGMENT 1
The following is a record showing the quantity of rice (in kg) consumed by various
household in a month.
36, 45, 45, 50, 51, 22, 23, 36, 14, 54, 1, 17, 39, 31, 43, 33, 33, 36, 25, 21, 41, 56, 11,
40, 32, 26, 48, 50, 42, 29, 18, 44, 30, 36, 49, 55, 15, 37, 37, 52, 48, 47, 46, 38, 32,
31, 27, 28, 24, 38.
Required:
(a) Prepare a frequency distribution table, grouping in 10kg (e.g. 1-10, 11-20 etc.)
(b) Compute the mean
(c) Compute the median, and
(d) Compute the mode
MEASURE OF DISPERSION OR VARIABILITY
Measure of location are used to summarised data
However, in trying to summarise information by the use of averages, some vital
information are lost.
One of such vital items of information is the spread of the data.
That is, whether the distribution are clustered or spread out.
The spread of the data can be measure in many ways, but, we limit ourselves to
the four major ones.
- Range
- Mean Deviation
- Standard Deviation, and
- Variance
Range
This is the simplest measure of variability. The range is the differences between
the highest and the lowest values in the distribution.

Range = Highest Value – Lowest Value


Mean Deviation
It measure the dispersion of values around the arithmetic mean

It is the sum of the differences of all the values from the arithmetic mean divided
by the number of observations

Symbolically, the mean deviation is

for ungrouped data

Σ 𝑋−𝑋 Σ𝑑
MD = or
𝑛 𝑛

for group data

Σ𝑓 𝑋 − 𝑋
MD =
𝑛
Standard Deviation
It shows how far the observations are spread from or clustered to the mean.
When the observations are clustered to the mean, the standard deviation is
small, but, when they are spread out, it will be large.
It is the square root of the variance.
The formula:
for ungrouped data
2
Σ 𝑋 −𝑋
S=
𝑛

for grouped data


2
Σ𝑓 𝑋 −𝑋
S=
𝑛
Variance
Variance is a measure of dispersion that takes into account the spread of all data
points in a data set.
It's the measure of dispersion the most often used, along with the standard
deviation, which is simply the square root of the variance.
The formula:
for ungrouped data
Σ(𝑋 −𝑋)2
S2 =
𝑛

for grouped data


Σ𝑓(𝑋 −𝑋)2
S2 =
𝑛
MD, SD and Variance for Ungrouped data
Example 1: the quantity of milk demanded by consumers of milk in a week is
shown in the following distribution.
3, 4, 5, 6, 6, 7. 8. 9.
Compute the (a) Range, (b) the mean deviation, (c) the standard deviation, and (d)
the variance
Solution:
(a) Range = Highest value – lowest value
=9–3
= 6 milk.
MD, SD and Variance for Ungrouped data CONT.
The deviations will be gotten from the use of table
X X–𝑿 𝑿 − 𝑿 or 𝒅 (X – 𝑿)𝟐
3 -3 3 6
4 -2 2 4
5 -1 1 1
6 0 0 0
6 0 0 0
7 1 1 1
8 2 2 4
9 3 3 6
Σ 𝑿 − 𝑿 =12 Σ(X – 𝑿)𝟐 = 22

Σ𝑋 48
𝑋= = = 6 milk
𝑛 8
MD, SD and Variance for Ungrouped data CONT.
Σ 𝑋−𝑋 12
(b) MD = = = 1.5
𝑛 8

2
Σ 𝑋 −𝑋 22
(c) S = = 8 = 2.75 = 1.66
𝑛

Σ(𝑋 −𝑋)2 22
(d) S2 = = = 2.75
𝑛 8
MD, SD and Variance for Grouped data
Example 2: The quantity of fertilizer (in bags) used by twenty farmers in one year is
shown in the following distribution.
2, 4, 5, 5, 4, 8, 6, 7, 6, 2, 4, 5, 6, 6, 7, 10, 12, 10, 5, 6.
Required: Compute
(a) The Range
(b) The Mean Deviation
(c) The Standard Deviation
(d) The Variance
solution:
(a) Range = Highest value – Lowest value
= 12 – 2
= 10 bags
MD, SD and Variance for Grouped data CONT.
X F FX X –𝑿 𝑋 − 𝑋 f𝑋 − 𝑋 (X – 𝑿)𝟐 f(X – 𝑿)𝟐
2 2 4 -4 4 8 16 32
4 3 12 -2 2 6 4 12
5 4 20 -1 1 4 1 4
6 5 30 0 0 0 0 0
7 2 14 1 1 2 1 2
8 1 8 2 2 2 4 4
10 2 20 4 4 8 16 34
12 1 12 6 6 6 36 36
Σ𝑓=20 Σ𝑓𝑥=120 Σf 𝑋 − 𝑋 =36 Σf(X – 𝑿)𝟐 =122

Σf𝑋 120
𝑋= = = 6 bags
Σ𝑓 20
MD, SD and Variance for Grouped data CONT.
Σ𝑓 𝑋 − 𝑋 36
(b) MD = = = 1.8
𝑛 20

2
Σ𝑓 𝑋 −𝑋 122
(c) S = = = 6.1 = 2.47
𝑛 20

Σ𝑓(𝑋 −𝑋)2 122


(d) = = 6.1
𝑛 20
ASSIGNMENT 2
The quantity of cement (in bags) purchased by a contractor in 39 days is shown in
the following distribution

54, 40, 38, 25, 32, 45, 46, 45, 35, 59, 42, 43, 46, 46, 28, 34, 40, 44, 44, 47, 51, 49,
49, 36, 31, 36, 41, 42, 37, 35, 45, 49, 48, 45, 46, 47, 48, 41, 44.

Using the grouping of data in the 5 interval of 5 bags (e.g. 25-29, 30-34, etc.)
Required: Compute
(a) The Range
(b) The Mean Deviation
(c) The Standard Deviation
(d) The Variance
PROBABILITY
The literally meaning of probability is the chance of an event to occur or happen.
The term probability of an event ‘E’ is defined as the ratio of favourable outcome
of the event ‘E’ happening to the total outcome of an event ‘E’.
𝐹𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 (𝐸)
i.e. P(E) =
𝑇𝑜𝑡𝑎𝑙 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 (𝐸)

The probability of an event is measured by value between 0 and 1 inclusive.


For an event that we are very certain will occur, if it does, the probability is 1;
and if it does not, the probability is 0.
Probability can not be negative or exceed 1.
Probability Range: 0 ≤ P(E) ≤ 1
PROBABILITY Cont.
Example 1: If a coin is tossed once, what is the probability that the face up is head.
Solution:
Sample space or total outcome (H, T)
1
P(H) =
2
Example 2: If a coin is thrown twice, find the probability of getting at least a tail
Solution:
Total outcome (HH, HT, TH, TT)
1 1 1
P(getting at least a tail) = + +
4 4 4
3
=
4
PROBABILITY Cont.
Example 3: A die is thrown once, what is the probability that the face up shows
(a) 2
(b) An odd number
Solution:
Total outcome (1, 2, 3, 4, 5, 6)
For (a) we have A=face up show 2
1
∴ P(A) =
6
For (b) we have B=face up shows odd number (1, 3, 5)
3 1
∴ P(B) = =
6 2
Law of Addition
If the events are mutually exclusive (they have no points in common), we use
addition rule, i.e. the probability of either A or B.

P(A or B) = P(A) + P(B)

Also, if the events are not mutually exclusive (if they have points in common).
The probability of either A or B is:

P(A or B) = P(A) + P(B) – P(AB)

Again, if events A and B are independence, the probability is:

P(AB) = P(A) P(B)


Law of Addition CONT.
Example 1: If a coin is thrown twice, the outcome is
S or TO (HH, HT, TH, TT)
A(HH, HT) B(TH TT)
Find the probability of (a) A, (b) B, (c) AB, and (d) A + B
Solution:
2 1
(a) P(A) = =
4 2
2 1
(b) P(B) = =
4 2
2 2 4 1
(c) P(AB) = P(A) P(B) = x = =
4 4 16 4
1 1 1 3
(d) P(A+B) = P(A) + P(B) – P(AB) = + - =
2 2 4 4
Law of Multiplication
If two events A and B are mutually independence, the probability of their
simultaneous occurrence is equal to the product of their individual probabilities.
i.e. P(A and B) = P(A) x P(B) or
P(AB) = P(A) x P(B)
Example 1: A bowl contains three oranges and two mangoes. Two fruits are drawn,
one after the other without replacement: What is the probability
(a) Of drawing an orange first
(b) That a mango is drawn the second time given that an orange was drawn the
first time
(c) Of drawing an orange and a mango in that order.
Law of Multiplication CONT.
Solution:

Let O=orange and M=mango

3
(a) P(O) =
5

2 1
(b) P(M) = =
4 2

(c) P(OM) = P(O) x P(M)

3 1
= x
5 2

3
=
10
BINOMIAL DISTRIBUTION
If P is the probability that an event will happen in any single trial (called
probability of success) and q = 1 – P is the probability that it will fail to happen in
any single trial (called the probability of a failure).
The binomial equation also uses factorials. In mathematics, the factorial of a
non-negative integer k is denoted by k!, for example,

4! = 4 x 3 x 2 x 1 = 24,

2! = 2 x 1 = 2,

1!=1.

There is one special case, 0! = 1.

With this notation in mind, the binomial distribution model is defined as:
BINOMIAL DISTRIBUTION Cont.
Then, the probability that an event will happen exactly X times in N trial (i.e. X
success and N – X failure will occur) is given by
𝑁
P(X) = 𝑋
𝑃 𝑋 𝑞𝑁−𝑋
𝑁!
= 𝑃 𝑋 𝑞𝑁−𝑋
𝑋!(𝑁−𝑋)!

Where
N= Number of trials
X= favourable outcome
𝑃 𝑋 = probability of success
𝑞𝑁−𝑋 = probability of failure
q= 1 – P
BINOMIAL DISTRIBUTION Cont.
Example 1: If a fair coin is tossed 6 times, what is the probability of getting exactly
2 heads. Solution:
1
N = 6, P = since there are two outcome per trial and only one is favourable, X = is
2
the appearance of head
𝑁!
P(X=2) = 𝑃 𝑋 𝑞𝑁−𝑋
𝑋!(𝑁−𝑋)!
6! 1 2 1 6−2
= 1−
2!(6−2)! 2 2
6∗5∗4∗3∗2∗1 1 2 1 4
=
2∗1(4∗3∗2∗1) 2 2

30 1 2 1 4
=
2 2 2
15 1 1 1 1 1 1 15
= x x x x x x =
1 2 2 2 2 2 2 64
BINOMIAL DISTRIBUTION Cont.
Example 2: if a ludo die is thrown five times, what is the probability of getting (a)
one six, (b) two six (c) three or more six
Solution:
1
N = 5, P = since there are six outcomes per trial and only one is favourable, X =
6
appearances of six
𝑁!
(a) P(X=1) = 𝑃 𝑋 𝑞𝑁−𝑋
𝑋!(𝑁−𝑋)!
5! 1 1 1 5−1
P(X=1) = 1−
1!(5−1)! 6 6
120 1 5 4
=
24 6 6
75000
=
186624
= 0.40
BINOMIAL DISTRIBUTION Cont.
𝑁!
(b) P(X=2) = 𝑃 𝑋 𝑞𝑁−𝑋
𝑋!(𝑁−𝑋)!

5! 1 2 1 5−2
= 1−
2!(5−2)! 6 6

120 1 2 5 3
=
12 6 6

15000
=
15552

= 0.96
BINOMIAL DISTRIBUTION Cont.
(c) P(X=3 or more 6) = P(X=3) + P(X=4) + P(X=5)

5! 1 3 1 5−3 5! 1 4 1 5−4 5! 1 5 1 5−5


= 1 − + 1 − + 1−
3!(5−3)! 6 6 4!(5−4)! 6 6 5!(5−5)! 6 6

120 1 3 5 2 120 1 4 1 120 1 5


= + + (1)
12 6 6 24 6 6 120 6

= 0.03215 + 0.0032 + 0.0001

= 0.035
POISSON DISTRIBUTION
It describes the number of events that occur within a given interval
The event in question must occur at random. They must be independence of one
another.
They must be what is describe as rare i.e. their occurrence must be very low.
It is possible to calculate the probability of any event in a defined interval if the
mean number of events per interval is known.
The formula is
𝑚 𝑥
Pr(x) = 𝑒 −𝑚
𝑥!
Where
x = favourable outcome
e = a special mathematical constant
𝑒 −𝑚 = exponential table gives its value
m = mean
POISSON DISTRIBUTION Cont.
Example: Customers arrived randomly at a departmental store at an average rate
of 3.4 per minutes. Assuming the customers arrivals from a poisson
distribution, calculate the probability that
(a) No customer arrive in particular minutes
(b) Exactly one customer arrives in any particular minute
(c) Two or more customers arrive in any particular minutes, and
(d) One or more customers arrive in any 30 second period.
Solution:
For a, b, and c, the interval for the poisson distribution is one minute, with a given
mean of 3.5. For d, the interval is just 30 seconds, thus the mean here must be
3.4
adjusted to = 1.7
2
POISSON DISTRIBUTION Cont.
𝑚 𝑥 3.4 0
(a) Pr(0) = 𝑒 −𝑚 = 𝑒 −3.4 =𝑒 −3.4 = 0.0334
𝑥! 0!

𝑚 𝑥 3.41
(b) Pr(1) = 𝑒 −𝑚 = 𝑒 −3.4 = (0.0334) (3.4) = 0.1136
𝑥! 1!

(c) Pr(2 or more) = 1 – Pr 0 + 𝑃𝑟 1 = 1 – [0.0334 + 0.1136] = 0.8530

(d) Pr(1 or more) = 1 – Pr(0) = 1 – 𝑒 −1.7 = 1 – 0.1827 =0.8173


NORMAL DISTRIBUTION
Is the name given to a type of distribution of continuous data that occurs
frequently in practice.
It is the distribution of natural phenomena such as weight, height, length, time,
etc.
Since the data are continuous, it is not possible to find probability of precise
value rather for range of values.
To calculate probability of normal distribution, the mean and standard deviation
must be known.
The standard normal distribution table is used in determining the probability that
an observation or score falls within a given interval of the distribution.
formula:
𝑋−𝜇
Z=
𝜎
Where: X = Individual scores, μ = Mean, σ = standard deviation
NORMAL DISTRIBUTION Cont.
Example 1: Suppose that X is a normally distributed random variable with a mean =
8 and standard deviation = 2. Find the probability that X lies in the
interval of 6 and 11.
Solution:
Firstly, we find the probability that X lies in the total area of 6
𝑋−𝜇 6−8 −2
Z= = = = -1.0 = 0.3413
𝜎 2 2

Then the probability that X lies in the total area of 11 is:


𝑋−𝜇 11−8 3
Z= = = = 1.5 = 0.4332
𝜎 2 2

The probability that X lies in the interval of 6 and 11 is:


P(-1.0 < Z < 1.5) = 0.3413 + 0.4332 = 0.7745
NORMAL DISTRIBUTION Cont.
Example 2: If the mean yield of maize is normally distributed with a mean of 360
kg/ha and a standard deviation of 24kg/ha. Find the probability that a
farmer chosen at random have:
a. A yield of 420kg/ha
b. A yield greater than 420kg/ha
c. A yield of between 420 and 440kg/ha
Solution
a. X = 420, 𝜇 = 360kg/ha, σ = 24kg/ha
𝑋−𝜇 420−360 60
Z= = = = 2.5 = 0.4938
𝜎 24 24
b. For a yield of greater than 420kg/ha:
Z = 0.5000 – 0.494 = 0.006
NORMAL DISTRIBUTION Cont.
c. For a yield of between 420 and 440

Firstly, we find the probability for a yield of 420 (X = 420, 𝜇 = 360kg/ha, σ = 24kg/ha)

𝑋−𝜇 420−360 60
Z= = = = 2.5 = 0.4938
𝜎 24 24

Then the probability for a yield of 440 (X = 440, 𝜇 = 360kg/ha, σ = 24kg/ha)

𝑋−𝜇 440−360 80
Z= = = = 3.33 = 0.4996
𝜎 24 24

P(2.5 < Z < 3.33) = 0.494 + 0.4996 = 0.9936


NORMAL DISTRIBUTION Cont.
ASSIGNMENT 3

If X is a normally distributed random variable with mean ( μ ) = 50 and


standard deviation ( σ ) = 10. Find the probability that X will take the
value of 60
STATISTICAL ESTIMATION
All the elements of interest in a particular study form the population.
Because of time, cost, and other considerations, data often cannot be collected
from every element of the population.
In such cases, a subset of the population, called a sample, is used to provide the
data.
Data from the sample are then used to develop estimates of the characteristics
of the larger population.
Therefore, Statistical estimation involves estimating a population parameter with
a sample statistic.
But, if the population is completely known or element of the population could be
observed, then, it would not be necessary to make estimate about its
characteristics.
Type of Estimates
There are two types of estimates:

1. Point estimates: An estimate of the population parameter given by a single


number. E.g., if we say the distance is measured as 5.28 meters (m), we are
given a point estimate.

2. Interval estimate: An estimate of the population parameter given by two


numbers between which the parameter lie. E.g., if we say the distance is 5.28
± 0.03 (i.e. the distance lies between 5.25 and 5.31m), we are given an interval
estimate.
Properties of a Good Estimator
1. Unbiasedness: An estimator is said to be unbiased if its expected value (from
sample) is identical with the population parameter being estimated. E.g.,
sample means equals population means (𝑋=𝜇)

2. Efficiency: The most efficient estimator among a group of unbiased estimators


is the one with the smallest variance.

3. Consistency: An unbiased estimator is said to be consistent if the difference


between the estimator and the target population parameter becomes smaller
as we increase the sample size.

4. Sufficiency: An estimator is said to be sufficient if it uses all the information


about the population parameter that the sample can provide. E.g. sample
means is sufficient because it uses all the observations.
HYPOTHESIS TESTING
Statistical hypothesis is an assumption about a population parameter (object of
interest)

Assumption may or may not be true

Hypothesis testing refers to the formal procedures used by statisticians to accept


or reject statistical hypotheses

Hypothesis testing in statistics is a way for you to test the results of a survey or
experiment to see if you have meaningful results.

It is always advisable to state hypothesis that will be rejected.

The rejection of hypothesis signifies that it is false.


Types of Hypotheses
There are two types of statistical hypotheses:

1. Null hypotheses: this is the hypotheses under investigation that the researcher
is willing to reject. It is denoted by H0. E.g.

H0: Men are, on average, not taller than women.

2. Alternative hypotheses: the opposite of null hypotheses. It is denoted by H1 or


HA. E.g.

H1: Men are, on average, taller than women.


Types of Error
A decision maker is often liable to committing two types of error.

1. Type I error: This is a situation where the researcher rejects the null hypotheses
when actually null hypotheses is true (i.e. to be accepted).

2. Type II error: This is a situation where the researcher accepts the null
hypotheses when actually the null hypotheses is false (i.e. to be rejected).
Choose level of significance
Type I error can be minimised by choosing the significance level appropriately. In
practice, a significance level of 0.05 or 0.01 is customary, 0.05 signifies that we
are about 95% confident that we have made the right decision. That of 0.01 is
99% confident.

Type II error can be minimised by never accepting null hypotheses. Though, in


some cases, we will accept null hypotheses if the data gathered provide evidence
for the acceptance, e.g. H0: covid 19 vaccine do not provide 100% prevention to
corona virus.
Types of Test
1. One tailed test: Also known as one sided test or directional test. When hypothesis is
stated to indicate the direction of difference, it is called a one-tailed test. For
example, people who live in high altitude areas perform better in long distance races.
People who have stout bodies do better in short-put. Expensive cars are better in
performance etc.
The alternative hypothesis is stated as greater than (>) or less than (<) a value stated
in the null hypothesis.
2. Two tailed test: Also known as non-directional or two sided test. When a hypothesis
is stated in such a way that it does not indicate a direction of difference, but agrees
that a difference exists we apply a two-tailed test of significance. Most of the null
hypotheses are two tailed because they do not indicate the direction of difference
such as in social and management science. They merely state that there is no
significant difference between A and B. For instance, there is no significance
difference in academic performance between those who went to Federal
Government Colleges and those who went to State Schools.
The alternative hypothesis is stated as not equal to (≠) the null hypothesis.
Test Statistics
Is a sample statistics use to decide whether to reject null hypothesis or not.
The distribution of test statistics is divided into two regions namely, region of
rejection (critical region) and region of non-rejection.

Results are significant Not significant Results are significant

Standard Normal Curve with critical region (0.05) and acceptance region (0.95)
Test Statistics CONT.
The total shaded area (0.05) is the significance level of the test.

It represents the probability of our being wrong in rejecting the hypotheses (i.e.
the probability of making type I error).

Thus, we said the hypothesis is at 0.05 significance level or that the z-scores of
the given sample statistics is significance at the 0.05 level.

Reject the null hypothesis at 0.05 significance level if the z score of the statistic
lie outside the range -1.96 to 1.96 (i.e. either Z > 1.96 or Z < -1.96) or Z > 𝑍∝
(1.96) .

Accept the hypothesis otherwise ( or if desire, make no decision at all).


Test Statistics CONT.

The distribution of the standardised variable (or z-scores)

Level of Significance 𝜶 0.10 0.05 0.01 0.005 0.002

Critical value for z for one -1.28 or 1.28 -1.645 or 1.645 -2.33 or 2.33 -2.58 or 2.58 -2.88 or 2.88
tailed tests

Critical value for z for two -1.645 or 1.645 -1.96 or 1.96 -2.58 or 2.58 -2.81 or 2.81 -3.08 or 3.58
tailed tests
Test Concerning Mean (One Sample Distribution)
We may be interested in determining (testing) that the mean (𝑋) is not different
from the mean (𝜇).
We shall examining one sample size for n≥30 (large sample).
The testing procedure can be itemised as follows
1. State the null hypothesis (H0)
2. State the alternative hypothesis (H1)
3. State the 𝛼-significance level
4. State the decision rule
5. Compute the statistics using sample data (calculated value)
6. Decision: Reject H0 if the calculated value is in the critical region otherwise,
accept H0.
Test Concerning Mean CONT.
When the mean and the standard deviation are known, we can use Z statistics.
𝑋−𝜇
Z=𝜎
𝑛

Where
𝑋 = sample mean
µ = hypothesised or population mean
𝜎 = standard deviation
n = sample number
Test Concerning Mean CONT.
Example 1: A beverage manufacturing company claim that the mean weight of its
medium size products is 450g with standard deviation of 15g. If a sample of 40
products were obtained and found out that the mean weight is 442g, test
whether the mean is not significantly equal to 450g at 0.05 level of significance.
Solution:
1. H0: 𝜇 = 450
2. H1: 𝜇 ≠ 450 (two tailed)
3. ∝ =0.05
4. Critical region: Z < -1.96 and Z > 1.96 or if Z > 𝑍∝ (1.96) Reject H0 (results
are significant)
Test Concerning Mean CONT.
5. Computation:

𝑋 = 442, 𝜇 = 450, 𝜎 = 15, n = 40

𝑋−𝜇 442−450
Z=𝜎 Z= 15 = -3.38
𝑛 40

6. Conclusion: Since 𝑍 = 3.38 is greater than 𝑍∝ = 1.96, the results are highly
significant and we reject H0 and conclude that the mean weight of the beverage
is not equal to 450.
Test Concerning Mean CONT.
Example 2: The average length of rods produced from a company was claimed to
be 30m with 𝜎 = 0.88m. A distributor of the rods disputed the claim and said
that it is less than 30m. A sample of 50 rods were taken and the average length
was 29.8m. Test the hypothesis at ∝ = 0.05 to ascertain which claim is correct,
the manufacturer or the distributor.
Solution:
1. H0: 𝜇 = 30m
2. H1: 𝜇 < 30m (one tailed)
3. ∝ =0.05
4. Critical region: Z < -1.96 and Z > 1.96 or if Z > 𝑍∝ Reject H0 (results are
significant)
Test Concerning Mean CONT.
5. Computation:

𝑋 = 29.8, 𝜇 = 30, 𝜎 = 0.88, n = 50

𝑋−𝜇 29.8−30
Z=𝜎 Z= 0.88 = -1.6071
𝑛 50

6. Conclusion: Since Z = 1.6071 is less than 𝑍∝ = 1.96, the results are not
significant and we accept H0 and conclude that the mean length of the rod
is 30m.
ANALYSIS OF VARIANCE (ANOVA)
An ANOVA test is a way to find out if survey or experiment results are significant.
In other words, they help you to figure out if you need to reject the null
hypothesis or accept it.

Basically, you’re testing groups to see if there’s a difference between them.

Examples of when you might want to test different groups:

• A manufacturer has two different processes to make light bulbs. They want to
know if one process is better than the other.

• Students from four colleges take the same exam. You want to see if one college
outperforms the other.
ANALYSIS OF VARIANCE (ANOVA) Cont.
ANOVA is basically of two types:

1. One way classification or one factor experiments: It has one dependent and one
independent variable (with 2 or 3 levels). E.g. yields in kg per acre of a wheat
grown in a particular type of soil treated with chemical A, B and C.

2. Two way classification or two factor experiments: it has one dependent variable
with two independent variables. E.g. yields in kg per acre of a wheat grown in a
particular type of soil treated with chemical A, B and C, and with rainfall of little,
moderate and sufficient.

The null hypothesis (H0) of ANOVA is that there is no difference among group
means. The alternate hypothesis (Ha) is that there is significance difference
among the group means.
ANALYSIS OF VARIANCE (ANOVA) Cont.
One way ANOVA

Example: The Table below shows the yields in bushels per acre of a certain variety
of wheat grown in a particular type of soil treated with chemical A, B, or C.

A 48 49 50 49
B 47 49 48 48
C 49 51 50 40

H0: The mean yield in bushels per acre are the same

H1: The mean yield in bushels per acre are not the same (though, not necessary)
ANALYSIS OF VARIANCE (ANOVA) Cont.
The ANOVA Table:
Variation Degree of Freedom Mean Square F

𝑉
Between treatment (factor) a–1 𝑆𝐵2 = 𝑎−1
𝐵 𝑆𝐵2
VB = b Σ (𝑋j – 𝑋)2 2
𝑆𝑊

Within treatment (error) a(b – 1) or N – a 𝑉


𝑆𝑤2 = 𝑁−𝑎
𝑊 With a – 1 and N – a degree
VW = V – VB of freedom

Total ab – 1 or N – 1
V = VB – VW
= Σ (Xjk – 𝑋)2

Where, a = row, b = column, N = number of observation, 𝑋j = row means


𝑋 = Grand mean, Xjk = observations or values
ANALYSIS OF VARIANCE (ANOVA) Cont.
Substitution of values:

Row means
48+49+50+49 47+49+48+48 49+51+50+50
Xj1 = = 45, Xj2 = = 48, Xj3 = = 50
4 4 4

The grand mean


48+49+50+49+47+49+48+48+49+51+50+50
𝑋= = 49
12

The total variation

V = Σ (Xjk – 𝑋)2 = (48-49)2+(49-49)2+(50-49)2+(49-49)2+(47-49)2+49-49)2+(48-49)2

+(48-49)2+(49-49)2+(51-49)2+(50-49)2+(50-49)2 = 14
ANALYSIS OF VARIANCE (ANOVA) Cont.
The variation between treatment
VB = b Σ(𝑋j – 𝑋)2 = 4[(45 – 49)2 + (48 – 49)2 + (50 – 49)2] = 8
The variation within:
VW = V – VB = 14 – 8 = 6
Then, the substitution of answers in ANOVA Table
Variation Degree of Freedom Mean Square F
Between treatment a–1=3–1=2 𝑉 8 2
𝑆𝐵2 = 𝑎 −1
𝐵 =4 𝑆𝐵
=
4
=6
2 2
𝑆𝑊 0.667
(factor)
VB = 8
𝑉𝑊 6 With 2 and 9 degree of
Within treatment (error) a(b – 1) = 3(4 – 1) =9 𝑆𝐵2 = = = 0.667
𝑁 −𝑎 9 freedom
VW = 6 or N – a = 12 – 3 = 9
Total ab – 1 = 3(4) – 11
V = 14 or N – 1 = 12 – 1 = 11
ANALYSIS OF VARIANCE (ANOVA) Cont.
 Decision Rule: If F calculated > F critical value, then reject the null hypothesis
and conclude that the means of at least two groups are statistically significant
(not the same).

 From the results, the F-calculated is 6. The F-critical value (the intersection of
the degree of freedom of 2 and 9 from the F-distribution table) is 8.02.
Therefore, since Fcal < Fvalue, we reject null hypothesis and conclude that the
mean yield in bushels per acre are not the same.
SIMPLE REGRESSION MODEL
 Is concerned with mathematical form of relationship between two variables.
 The simple regression model is:
Y = a + bX + e or Ŷ = â + bX
where
Y = dependent Variable
X = independent variable
a = intercept
b = the slope
e = error term
SIMPLE REGRESSION MODEL cont.
 a and b are constants whose values are to be estimated or obtained on the basis
of given value of X and Y.
 The intercept ‘a’ represent the part of dependent variable Y that does not depend
on the X.
 The slope ‘b’ represent change in Y per unit change in X. if b is positive, a unit
increase in X would increase Y by b units. And if b is negative, a unit increase in X
would decrease Y by b units.
 The error term ‘e’ represents the differences between the actual value of Y and its
estimate.
 Various method exist in obtaining the value of a and b, one of such method is the
least square estimate. The formula is :
𝑛Ʃ𝑋𝑌 − Ʃ𝑋Ʃ𝑌 𝑛Ʃ𝑌 −𝑏Ʃ𝑋
b= â = Ῡ - bẊ or a=
𝑛Ʃ𝑋2 − Ʃ𝑋 2 𝑛
SIMPLE REGRESSION MODEL cont.
Coefficient of determination
 The coefficient of determination (r2) measure the goodness of fit of the fitted
regression to a set of data.
 It gives the proportion or percentage of the total variation in the dependent
variable Y by explanatory variable X.
 The r2 lies between 0 and 1. If it is 1, the fitted regression explains 100% variation
in Y, and if 0, the model does not explain any of the variation in Y.
 The fit of the model is said to be “better” the closer r2 is to 1.
1 = 100% error term = 0%
0.99 = 99% error term = 1%
0.56 = 56% error term = 44%
SIMPLE REGRESSION MODEL cont.

𝑏2Ʃ𝑥2
𝑟2 =
Ʃ𝑦2
or
𝑏2 Σ(𝑋−𝑋)2
𝑟2 =
Σ(𝑌−𝑌)2
SIMPLE REGRESSION MODEL cont.
Example: The following table contains observations of the quantity demanded
(Y) of certain commodity at various prices.
Quantity Demanded in Kg (Y) Prices in N (X)
100 4
60 5
40 9
70 6
130 4
80 8

Required:
a. Fit the simple linear regression model of quantity demanded on price.
b. Predict the quantity that would be demanded when the price is N11.
c. Compute the coefficient of determination and express how the fitted
equation fit the data.
SIMPLE REGRESSION MODEL cont.
Solution:
X Y XY X2 X-Ẋ=𝒙 Y-Ῡ=y x2 y2

4 100 400 16 -2 20 4 400

5 60 300 25 -1 -20 1 400

9 40 360 81 3 -40 9 1600

6 70 420 36 0 -10 0 100

4 130 520 16 -2 50 4 2500

8 80 640 64 2 0 4 0

ƩX=36 ƩY=480 ƩXY=2640 ƩX2=238 Ʃ𝒙 = 0 Ʃy=0 Ʃ𝒙2 = 22 Ʃy2=5000

36 480
Ẋ= = 6, Ῡ = = 80
6 6
SIMPLE REGRESSION MODEL cont.
a. model: Y = â + Bx
𝑛Ʃ𝑋𝑌 − Ʃ𝑋Ʃ𝑌 6 2640 −36(480)
b= = = -10.91
𝑛Ʃ𝑋2 − Ʃ𝑋 2 6(238) − 36 2

𝑛Ʃ𝑌 −𝑏Ʃ𝑋
â = Ῡ - bẊ or a=
𝑛
= 80 – (-10.91)(6) = 145.46
Fitted simple linear regression is:
Y = 145.46 + (-10.91)X
b. Required to find Y when X = 11
Y = 145.46 + (-10.91)11
= 25.45 implying 25.45 kg.
SIMPLE REGRESSION MODEL cont.

𝑏2Ʃ𝑥2 𝑏2 Σ(𝑋−𝑋)2
c. 𝑟 2 = =
Ʃ𝑦2 Σ(𝑌−𝑌)2

−10.91 2(22)
= = 0.52
5000

the simple regression equation is 52% fit to the data.


CORRELATION ANALYSIS
Correlation coefficients are used to measure how strong a relationship is
between two variables. E.g. “quantity demanded” and “price”
The correlation coefficient is not able to tell the difference between dependent
variables and independent variables.
There are several types of correlation coefficient, but the most popular is
Pearson’s. Pearson’s correlation (also called Pearson’s R) is a correlation
coefficient commonly used in linear regression.
Correlation coefficient formulas are used to find how strong a relationship is
between data. The formulas return a value between -1 and 1, where:
• 1 indicates a strong positive relationship.
• -1 indicates a strong negative relationship.
• A result of zero indicates no relationship at all.
CORRELATION ANALYSIS cont.
A diagram showing the rejection and acceptance region:

If the calculated r falls in shaded area, reject H0, and if falls in unshaded area,
accept the H0 at 0.05 and 0.01 significance levels
To Find Pearson's Correlation Coefficient (by Hand)
Example: Find the value of the correlation coefficient from the following table:
Age (𝒙) 43 21 25 42 57 59
Glucose Level (𝑦) 99 65 79 75 87 81

H0: There is no statistically significance relationship between age and glucose level
CORRELATION ANALYSIS cont.
Solution: the formula first to what is required in the table.

Age (𝒙) Glucose Level (𝑦) 𝒙𝒚 𝒙𝟐 𝒚𝟐


43 99 4257 1849 9801
21 65 1365 441 4225
25 79 1975 625 6241
42 75 3150 1764 5625
57 87 4959 3249 7569
59 81 4779 3481 6561
Σ𝑥=247 Σ𝑦=586 Σ𝑥𝑦=20485 Σ𝑥 2 =11409 Σ𝑦 2 =40022

6 20485 −(247)(586) 2868


r= 2 2
= = 0.529809
6 11409 − 247 [6 40022 − 586 ] 5413.27
CORRELATION ANALYSIS cont.
The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or
52.98%, which means the variables have a moderate positive correlation.
Conclusion: since r is 0.5298, and fall in the unshaded area, we accept H0 and
conclude that there is no strong statistical significance relationship between age
and glucose level.
CHI-SQUARE TEST OF INDEPENDENCE
The Chi-Square test of independence is used to determine if there is a significant
relationship between two nominal (categorical) variables.

The frequency of each category for one nominal variable is compared across the
categories of the second nominal variable.

The data can be displayed in a contingency table where each row represents a
category for one variable and each column represents a category for the other
variable.
Example of Chi-Square Test
Question: A cellular phone company conducts a survey to determine the
ownership of cellular phones in different groups. The results for 1000 households
are obtained as follows

Summary of the responses from the respondents

Cellular Phone Age 18 – 24 25 – 54 55 – 64 ≥ 65 Total

Yes 50 80 70 50 250

No 200 170 180 200 750

Total 250 250 250 250 1000


Example of Chi-Square Test CONT.
Solution:
Step 1: State the hypotheses
H0: The proportion owning cellular phones are the same for the different age
groups
H1: The proportion owning cellular phones are not the same for the different age
groups
Step 2: Chi-Square computation
Use contingency table, compute the expected frequencies. e= 𝑅𝑎𝑤 𝑇𝑜𝑡𝑎𝑙 ∗𝐶𝑜𝑙𝑢𝑚𝑛 𝑇𝑜𝑡𝑎𝑙
𝐺𝑟𝑎𝑛𝑑 𝑇𝑜𝑡𝑎𝑙

Cellular Phone 18 – 24 25 – 54 55 – 64 ≥ 65 Total


Yes 50 (62.5) 80 (62.5) 70 (62.5) 50 (62.5) 250
No 200 (187.5) 170 (187.5) 180 (187.5) 200 (187.5) 750
Total 250 250 250 250 1000
Example of Chi-Square Test CONT.
2
𝑜 −𝑒
Use a table to compute Chi-Square, X2 =Σ 𝑒

Row, Column o e (o – e) (o – e)2 (o – e)2 / e


1,1 50 62.5 -12.5 156.25 2.5
1,2 80 62.5 17.5 306.25 4.9
1,3 70 62.5 7.5 56.25 0.9
1,4 50 62.5 -12.5 156.25 2.5
2,1 200 187.5 12.5 156.25 0.8
2,2 170 187.5 -17.5 306.25 1.6
2,3 180 187.5 -7.5 56.25 0.3
2,4 200 187.5 12.5 156.25 0.8
Sum 1000 1000 0 14.3

X2 = 14.3
Degree of freedom, 𝜐 = 𝑟 − 1 𝑐−1 = 2−1 4−1 =3
Example of Chi-Square Test CONT.
Example of Chi-Square Test CONT.
2 2
Conclusion: 𝑋.95 = 7.81, and 𝑋𝑐𝑎𝑙 = 14.3. Since 14.3 is greater than 7.81, we reject
null hypothesis and conclude that the proportion owning cellular
phones are not the same for the different age groups.

You might also like