Professional Documents
Culture Documents
Introduction
Organising Data
TABULATIONS
Frequency Distributions
1
Solution
19 20 19 27 37
30 23 20 20 33
22 19 23 21 22
21 19 42 32 20
18 24 30 28 20
22 27 37 20 24
27 21 22 20 25
41 20 24 27 19
27 22 18 19 30
28 19 19 21 20
2
Solution
Age Frequency
15-19 10
20-24 23
25-29 8
30-34 5
35-39 2
40-44 2
Cumulative Frequencies/Percent
Example 2.3 Referring to the students degree preference, we have the follow-
ing table incorporating relative, percent and cumulative percent frequencies.
3
Degree Preference Relative Freq Percent Cumulative Percent
Accounting 0.24 24 24
Business 0.16 16 40
Tourism 0.08 8 48
Economics 0.16 16 64
BSc 0.08 8 72
Engineering 0.04 4 76
BA 0.08 8 84
Pharmacy 0.08 8 84
Medicine 0.08 8 100
Some graphs are ideal for qualitative data whilst some are ideal for quanti-
tative data.
Types of Graphs
Pie chart, bar graph, box and whisker diagram, histogram, stem and leaf,
scatter plot, population pyramid, e.t.c.
Pie Chart
A pie chart is used to split a particular quantity into its component of pieces.
It is a convenient way of representing percentages or relative frequencies
(rather than frequencies). To construct a pie chart, draw a line from the
center of the circle to the outer edge and then construct the various pieces
of the pie-chart by drawing the corresponding angles.
4
Bar Graph
5
SEASON QUARTERLY SALES
1st quarter 290
2rd quarter 180
rd
3 quarter 260
4th quarter 120
Histogram
A graph very similar to a bar graph is the histogram. It is ideal for con-
tinuous data and there are no gaps between the bars as in bar graph. It is
constructed by first selecting a number of ”intervals” to be used. The choice
is between reducing the information sufficiently while still providing enough
variability to picture the shape of the distribution.
6
Construction of Histogram of Equal Class Width
Solution The classes have equal width that is 25. So the height of the
bars/rectangles are the corresponding frequencies for each class.
7
Stem and Leaf Diagram
They are useful in summarising reasonably sized data sets and unlike his-
tograms, result in no loss of information. By this we mean that it is possible
to retrieve the original data set from the stem and leaf diagram, which is not
the case when using a histogram. A stem and leaf diagram is constructed in
such a manner that each number is divided into two parts, a stem consisting
of one or more of the leading digits and a leaf, consisting of the remaining
digits.
• For each observation, record the remaining digits, that is, the leaf to
the right in ech of the rows corresponding to the appropriate stem.
• If there are too many leafs corresponding to one stem, we split the
stems possibly into two.
Example 2.7 The following data gives time taken in minutes to interview
participants in a certain survey.
18 42 20 58 56 11 34 42 54 23
66 41 52 23 16 42 37 50 19 36
32 42 53 30 14 29 25 47 21 12
8
Solution
Stem Leaf
1 1 2 4 6 8 9
2 0 1 3 3 5 9
3 0 2 4 6 7
4 1 2 2 2 2 7
5 0 2 3 4 6 8
6 6
Frequency Polygon
9
Activity 2.1
1. The total number of children per 1000 married women are as follows
for women between the ages 15 and 45.
249 180 190 230 349 299 300 175 305 205
225 160 195 395 245 275 155 225 360 180
205 239 305 260 155 255 310 230 305 210
320 168 309 380 190 225 365 302 285 395
(a) Determine the class intervals if 5 class intervals are desired for a
frequency distribution.
(b) Construct a cumulative frequency distribution.
(c) Construct a histogram for the data.
(d) What proportion of the apartments priced over $299 per month.
10
3. The following data are examinations scores obtained by 30 students at
a certain college.
45 88 65 58 45 58 56 68 71 66
87 54 51 74 52 66 64 59 63 54
51 72 53 62 42 32 68 47 91 52
11
DESCRIPTIVE MEASURES
The mean is the most commonly used measure of central tendency. In most
cases, we deal with data from a sample and we refer to the arithmetic mean
as simply the sample mean. If the observations in a sample of size n are
x1 , x2 , ..., xn , then Pn
xi
x̄ = i=1
n
Example 2.8 The data, 3652, 4125, 9526, 2546 and 2328 are salaries of
Company Executives. Find the sample sample mean.
Solution
Pn
i=1 xi 3652 + 4125 + ... + 2328 22177
x̄ = = = = 4435.4
n 5 5
Sometimes we may have to work with data in the form of a frequency distri-
bution, called grouped data, when the raw data are not available. We do not
have the data values used to make this frequency distribution and so we are
forced to approximate the sample statistics. Suppose data is grouped into
k classes with frequencies f1 , f2 , ..., fk and midpoints x1 , x2 , ..., xk , then the
arithmetic mean for grouped data is defined as
Pk
i=1 fi xi
x̄ =
n
Pk
where n = i=1 fi .
12
Example 2.9 Making use of data in Example 2.6, calculate the sample
mean.
Money Spent (Dollars) LCB UCB Mid Point (xi ) Frequency (fi ) fi xi
0-25 0 25 12.5 36 450
25-50 25 50 37.5 24 900
50-75 50 75 62.5 12 750
75-100 75 100 87.5 9 787.5
100-125 100 125 112.5 9 1012.5
125-150 125 150 137.5 5 687.5
150-175 150 175 162.5 3 487.5
175-200 175 200 187.5 2 375
Sum 100 5450
Pk
i=1 fi xi
x̄ =
n
= 12.5 × 36 + 37.5 × 24 + ... + 187.5 × 2
5450
=
100
= 54.5
13
Example 2.10 Find the median of the numbers 356, 147, 216, 215, 191, 209,
187, 153, 278 and 133.
It is not possible to find the exact value for the median for grouped data.
However, the median can be obtained using two approaches, that is, the
graphical method and the arithmetic approach.
Graphical Method
Here we make use of the cumulative frequency curve to find the median. To
find the median you locate the 50th percentile or ( n2 )th position of absolute
cumulative frequencies on the vertical axis. Having located this point, move
horizontally until you reach the curve and then move vertically downward to
the horizontal axis. That position at the horizontal axis is the median.
Arithmetic method
cm ( n2 − Fm−1 )
median = Lm +
fm
14
where Lm is the lower limit of the median class,
cm is the class width of the median class,
fm is frequency of the median class,
Fm−1 is the cumulative frequency of class just before the median class and
n = ki=1 fi is the total number of observations.
P
Example 2.11 Making use of data in Example 2.6, calculate the sample me-
dian.
cm ( n2 − Fm−1 )
median = Lm +
fm
25( 100
2
− 36
= 25 +
24
25(50 − 36)
= 25 +
24
350
= 25 +
24
= 39.5833 (4 d.p.)
15
Mode for Ungrouped Data
Graphical Method
To identify the mode, the tallest bar in the histogram is identified. Join
the corner points of the tallest bar diagonally to the start of the next bars
respectively. The diagonals will intersect at some point. Then draw a verti-
cal line passing through the point of intersection and the mode is the point
where this vertical line intersect with the horizontal axis.
Arithmetic Method
We determine the modal class, that is, the class with the highest frequency.
Then the mode is given by:
cm (fm − fm−11
M ode = Lm +
2fm − (fm−1 + fm+1 )
Example 2.12 Making use of data used in Example 2.11, calculate the mode.
Solution The modal class is 0-25, that is the class with the highest fre-
quency. We can also note that Lm = 0, cm = 25, fm = 36, fm−1 = 0,
16
fm+1 = 24 and hence
cm (fm − fm−1
M ode = Lm +
2fm − (fm−1 + fm+1 )
25(36 − 0)
=0+
2 ∗ 36 − (0 + 24)
900
=
48
= 18.75
Activity 2.2
(a) mean
(b) median
(c) mode
(2) The following data were obtained from a survey requesting 30 different
families to list their weekly expenditure on food.
99 85 72 59 119 120 95 83 78 91
64 106 86 87 78 108 136 102 86 74
72 103 94 63 73 89 75 88 107 101
17
(3) The data below gives marks obtained by Applied Statistics students
Calculate,
18
Measures of Position
They cannot really be called measures of central tendency, but they are
measures of location in that they give position of specified observations and
the most commonly used are quartiles, deciles and percentiles.
Quartiles
The quartiles divide the set of measurements into four equal parts. Twenty-
five per cent of the measurements are less than the lower quartile, fifty per
cent of the measurements are less than the median and seventy-five per cent
of the measurements are less than the upper quartile. So, fifty per cent of
the measurements are between the lower quartile and the upper quartile.
These are denoted by the sympols Q1 (lower or first quartile), Q2 (second
quartiles) and Q3 (upper or third quartile).
Note: median is equivalent to the second quartile.
There are different approaches to the calculation of the first and third quar-
tiles. The approach we are going adopt in this course is the interpolation
method because most statistical softwares make use of this method. First we
order data in ascending or descending order. Irrespective of n, the sample
size, we have:
1
Q1 = (n + 1)th observation.
4
3
Q3 = (n + 1)th observation.
4
1 3
Usually 4 (n + 1)th and 4 (n + 1)th are fractions and thus where this method
of interpolation arise.
For example, suppose 14 (n + 1) = a.bth observation, where a is integer part
and b is the decimal part. In this case
Example 2.13 Find the median, lower quartile, upper quartile and in-
terquartile range of the following data set of scores: 18 20 23 20 23 27 24 23 29
19
Solution
Arrange the values in ascending order of magnitude:
18 20 20 23 23 23 24 27 29
n+1
Since n is odd, median = the value of the 2
th observation, that is
n+1
median = th observation
2
9+1
= th observation
2
= 5th observation = 23
For the lower quartile we have,
1
Q1 = (n + 1)th observation
4
1
= (9 + 1)th observation = 2.5
4
= 2nd observation + 0.5[3rd − 2nd observation]
= 20 + 0.5(20 − 20) = 20
For the upper or third quartile we have,
3
Q3 = (n + 1)th observation
4
3
= (9 + 1)th observation = 7.5
4
= 7th observation + 0.5[8th − 7th observation]
= 24 + 0.527 − 24 = 25.5
20
Quartiles for Grouped Data
The procedure is the same with the one for the median. The difference
lies in the identification of the quartile and the quartile position. The posi-
tions are identified by calculating n4 and 3n4
for Q1 and Q3 respectively.
For the lower quartile we have,
cq ( n4 − Fq−1 )
Q1 = Lq +
fq
cq ( 3n
4
− Fq−1 )
Q3 = Lq +
fq
21
Solution
50
4
= 12.5, thus class containing the lower quartile is 35-45. Then Lq = 35,
cq = 45 − 35 = 10, fq = 11, Fq−1 = 5 and n = 50. Hence,
cq ( n4 − Fq−1 )
Q1 = Lq +
fq
10( 50
4
− 5)
= 35 +
11
= 41.8182 (4 d.p)
3∗50
4
= 37.5, thus class containing the lower quartile is 55-65. Then Lq = 55,
cq = 65 − 55 = 10, fq = 6, Fq−1 = 34 and n = 50. Hence,
cq ( 3n
4
− Fq−1 )
Q3 = Lq +
fq
10( 3∗50
4
− 34)
= 55 +
6
= 60.8333 (4 d.p)
Deciles
These values divide the observations into 10 equal parts and are denoted
by D1 , D2 , ..., D9 , for example D3 has 30% of values below it and 70% above
it.
22
Percentiles
These values divide the observations into 100 equal parts and are denoted by
P1 , P2 , P3 , ..., P99 , for example P30 has 30% of values below it and 70% above
it.
Note
• Q1 = P25
• Q2 = D5 = P50
• Q3 = P75
Measures of Dispersion
Dispersion is the statistical term for the spread or variability of data. Mea-
sures of dispersion reflect the amount of spread or variability in a collection
of data. Consider the following data of height (cm) in two different teams.
Harare: 160, 161, 162, 162, 163, 164, 165, 167, 171, 175
Chinhoyi: 154, 156, 158, 159, 163, 164, 166, 172, 172, 186
Both teams have the same mean height of 165cm, but the distribution of the
height of the team members in Chinhoyi is more spread out.
Range
Is the simplest measure of dispersion.
For ungrouped data
23
For grouped data
IQR = Q3 − Q1
A measurement of spread can be derived from it, called the Semi Inter-
quartile Range
1
Semi Inter − quartile Range = (Q3 − Q1 )
2
Variance and Standard Deviation
This is the most commonly used measure of variability in statistical analysis.
Unlike the range and IQR, it takes into account all the observations in the
data set. The greater the variability in the data, the higher the value of the
statistic.
Disadvantage It is difficult or time-consuming to compute manually.
(ii) Find the difference between each observation and the mean, xi − x̄.
24
(iii) Square the differences, (xi − x̄)2
Pn
(iv) Sum the squared differences, i=1 (xi − x̄)2
For q
sample standard deviation, take the square root of the result in (v) i.e.
1 Pn 2
s = n−1 i=1 (xi − x̄) .
NB: Approximately 68% of the data will fall within one stadard deviation of
the mean, 95% will fall within 2 standard deviations and 7.7% (almost 00%)
will fall within 3 standard deviations of the mean in a normal distribution
curve. This is useful for outlier detection.
Computational Formulae
For ungrouped data
N
1 X
P opulation variance, σ 2 = (xi − µ)2
N i=1
v
u N
u1 X
P opulation standard deviation, σ = t (xi − µ)2
N i=1
n
1 X
Sample variance, s2 = (xi − x̄)2
n − 1 i=1
Pn
i=1 x2i − nx̄2
=
n−1
v
n
1 X
u
u
Sample standard deviation, s = t (xi − x̄)2
n − 1 i=1
sP
n
i=1 x2i − nx̄2
=
n−1
25
For grouped data
If the data were grouped into k classed with class intervals whose midpoints
are x1 , x2 , ..., xk with frequency of occurrence f1 , f2 , ..., fk , respectively then
k
1
P opulation variance, σ 2 = Pk fi (xi − µ)2 )
X
i=1 fi i=1
k Pk
fi xi 2
fi xi − ( Pi=1
2
X
= k )
i=1 i=1 fi
v
u k
u 1 X
P opulation standard deviation, σ = t Pk fi (xi − µ)2 )
i=1 fi i=1
v
u k Pk
uX fi xi 2
= t fi xi − ( Pi=1
2
k )
i=1 i=1 fi
k
1 X
Sample variance, s2 = fi (xi − x̄)2
n − 1 i=1
k
( ki=1 fi xi )2
P
1 X 2
= ( fi xi − Pk )
n − 1 i=1 i=1 fi
v
u k
u 1 X
Sample standard deviation, s = t fi (xi − x̄)2
n − 1 i=1
v
k
( ki=1 fi xi )2
u P
u 1 X
=t ( fi xi 2 − P k )
n − 1 i=1 i=1 fi
26
(1) range will decrease
If a constant k is added to every data point in a set of data, then the range,
IQR and standard deviation will not change.
The range, IQR and the standard deviation will be k times the original
values.
NB: The last 3 notions are essential in ZIMSTAT work when the concept
of weighting is considered, say on the bread-basket composition and their
contribution in the computation of the Consumer Price Index, CPI.
Coefficient of Variation
For distributions having the same mean, the distributions with the largest
standard deviation has the greatest variation. But when considering distribu-
tions with different means, decision-makers cannot compare the uncertainty
in distribution only by comparing standard deviations. The coefficient of
variation is a measure used to compare the variability in one data set with
that in another in situations in which a direct comparison of standard devia-
tions is not convenient or realistic. For example, in a study of milk consump-
tion in USA, it is reported that the mean number of gallons of milk consumed
per family unit per week is 8 with a standard deviation of 3 gallons. A similar
study in Canada reports the mean consumption to be 12 litres with a sample
standard deviation of 4 litres. It makes no sense to compare these standard
deviations directly because they are reported in different. Coefficient of
variation in a data set is simply the ratio of the standard deviation to the
mean expressed as a percentage.
s
Coef f icient of variation = × 100%
x̄
27
NB: Interpretation of coefficient of variation, we say that data exhibits y%
of relative variation from the mean, where y is the value obtained in .
Measures of Shape
An important aspect of the description of a data/variable is its shape, which
indicates the frequency of values from different ranges of the variable. on e is
typically interested in knowing how well the distribution of the variable/data
can be approximated by the normal distribution.
Skewness
Skewness is a measure of the asymmetry of the distribution relative to the
normal distribution. The normal distribution is symmetrical about its meant,
its skewness is equal to zero. A distribution with a significant positive skew-
ness has a long right tail, whilst a distribution with a significant negative
skewness has a long left tail. For a skewed distribution, the mean tends to lie
on the same side of the mode as the longer tail. Skewness is a dimensionless
quantity.
1 Pn 3
n−1 i=1 (xi − x̄)
skewness, m3 = 1 n 3
2 2
i=1 (xi − x̄) )
P
( n−1
mean − mode
=
standard deviation
3(mean − median)
=
standard deviation
if m3 < 0, then the data is negatively skewed.
if m3 = 0, then the data is symmetrical
if m3 > 0, then the data is positively skewed.
28
Kurtosis is the degree of peakedness in a distribution, usually taken relative
to a normal distribution. The peakedness property means that there is an
excess frequency at the center of the distribution.
Pn
− x̄)4
i=1 (xi
m4 = −3
(n − 1)s4
Positive values of m4 indicate longer and thicker tails than a normal distri-
bution, whereas negative values of m4 indicate shorter and thinner tails. A
distribution with positive kurtosis is called Leptokurtic, and a distribution
with negative kurtosis is called Platykurtic, when m4 = 0, the distribution
is called Mesokurtic.
Box and Whisker Diagram
This diagram was not mentioned when we discussed other graphical tech-
niques because it is constructed using some of the descriptive measures we
have just discussed as opposed to the former which used the raw data. A Box
and Whisker diagram or plot illustrates the spread and skewness of a data
set. It provides a graphical five-point summary of the set of data by showing
the quartiles and the extreme values of the data. It is useful in identifying
outliers i.e. unusually high or low values in a data set which can be due to
typographical errors.
The whiskers extend from the box to the maximum value or Q3 + 23 (IQR) and
minimum value or Q3 − 23 (IQR). If maximum value < Q3 + 32 (IQR), then
whisker extends to maximum value otherwise it will extends to Q3 + 23 (IQR).
Values greater than Q3 + 23 (IQR) are indicated by * and this is an indication
of outliers. If minimum value > Q1 − 32 (IQR), then the whisker extends
to minimum value otherwise it extends to Q1 − 23 (IQR). Values less than
Q1 − 23 (IQR) are indicated by * and this is an indication of outliers.
Below is the box and whisker plot of Harare players height.
29
30
Applied Statistics I, STAT102 Assignment 2
A1 Given data on heights of 100 randomly selected Certificate of Applied
Statistics students Compute the following statistic;
A2 Using the data on the two teams Harare and Chinhoyi in the notes on
measures of dispersion, calculate for each team the;
31