You are on page 1of 31

3

Measures of Central Tendency

INTRODUCTION

Measures of central tendency are also referred to as “averages”. They indicate where the
center, the middle property, or the most typical value of a data set lies.

Why is the concept of measures of central tendency important? Because when data have been
collected, they have to be put into a form that will make it possible to summarize and interpret
them easily. The concept is also important because it is one of the first statistics computed for a
set of data.

An average is a single figure that stands for or represents a group of figures. For example, the
average contribution made to a certain fund drive to some extent gives an indication of the
amount paid by each contributor. We can also say the average either the maximum or
minimum weather temperature in April versus that in December.

In this module, we shall study three measures of central tendency namely, the mean, median,
and mode.

OBJECTIVES

At the end of this module study, you will be able to:

1. Explain the characteristics of different measures of central tendency, namely:


1.1 mean
1.2 median
1.3 mode
1.4 other position measures
2. determine the mean, median, and mode of both the ungrouped and grouped data,
3. apply the correct concept of measures of central tendency

3.1 ARITHMETIC MEAN

We shall take up in this section the arithmetic mean or computed average of a distribution. The
mean is the sum of all the values in a data set divided by the total number of values. The
symbol for the mean of a population is. While the symbol for the mean of a sample is 𝐗̅.

3.1.1 Computation of the Mean for Ungrouped Data

The mean is determined by adding the scores and dividing the sum by the total number of
scores. Symbolically, this is written as:

̅X = ∑X sum of the scores


n number of scores in the data set

where ̅X = mean
X = value of each item
n = number of items
∑ = “the sum of”

For a population, the mean is computed as:

μ= ∑X
N

Where: μ = arithmetic mean of a population


X = value of each item
N = number of items in the population
∑ = “the sum”

For example, given a sample of scores, 3, 5, 7, 9, compute for the mean.

̅
̅X = ∑ X = 3+5+7+9 = 6
n 4

3.1.2 The Mean from an Ungrouped Set of Data

However, when your data are arranged in an ungrouped frequency distribution table, the mean
is computed as:

̅
X
̅ = ∑fX

N
where: f = frequency associated with each value of the variable X
X = value of the variable
N = total number of cases

In words, each value of x is multiplied by its frequency of occurrence, these products are then
summed, and the sum is divided by total number of values in a distribution.

To illustrate, look at Table 3.1 showing the different weight measurements of newborns in the
nursery of the Philippine Hospital.

Table 3.1 Frequency Distribution of Weight Measurement (in gms) of Newborns in the Nursery of
PGH

X (weight) f FX

3500 1 3500

3400 1 3400

3350 2 6700

3300 2 6600

3250 1 3250
3200 1 3200

To compute for the mean, multiply first each frequency value with the corresponding X value,
then get the ∑ fx and divide it by N.

̅
X
̅ = ∑ fX
N
= 58400
20
= 2920 gm

Based on the computation, we can say that the mean weight of the 20 newborns confined in
the nursery of PGH on March 20, 1999 is 2,920 gm.

Let us pause for some exercises

SAQ 3-1

Go back to the scores obtained by Bertha Pila on her statistics exam.

1. What is the mean of the score?


2. The mean passing mark is 40. Did Bertha pass statistics? How far is her average from
the mean passing mark?

SAQ 3-2

Consider the daily earnings of the employees of a buy and sell firm:
210, 210, 850, 360, 310, 210, 210, 960, 210

1. Construct an ungrouped frequency distribution table which at least includes information


on X and f.
2. Find the mean daily earnings of the employees of the buy and sell firm.
SAQ 3-3

Last summer, six salesmen in a heating and air-conditioning firm sold the following number of
air-conditioning units: 16, 9, 11, 6, 10, and 8. Find the average number of units sold and show
your solution.

3.1.3 Computation of the Mean for Grouped Data

When the number of cases becomes large, the computation of the mean may become tedious.
It is then useful to group the data into categories and to compute the mean from the resulting
frequency distribution. Sometimes, we find data already given to us in grouped form, and it will
be either impossible or impractical to go back to the original data for purposes of computation.
Census data are usually given in grouped form, for example. We only know that there are a
number of persons aged 0 to 4 or 5 to 9, but the exact age of individual is unknown.

In computing the mean for grouped frequency distribution, the formula to use is:

X
̅ = ∑ fm

N
where f = frequency associated with each class interval
m = midpoint of each class interval (i.e. the middle value in each class
interval)
N = total number of cases

The mean computed by this formula is only an estimate since an exact mean cannot be
computed for a distribution. Generally, the wider the interval width the more error we can
expect from estimating the mean from a grouped frequency distribution.

Table 3.2 shows the age distribution of patients at St. Henri Hospital.

Table 3.2 Age distribution of patients at St. Henri Hospital


Age Frequency (f) Midpoint Fm

60-64 14 62 868
55-59 17 57 969

50-54 21 52 1092

45-49 19 47 893

40-44 13 42 546

35-39 30 37 1110

30-34 26 32 832

25-29 15 27 405

20-24 31 22 682

15-19 27 17 459

10-14 29 12 348

5-9 24 7 168

0-4 34 2 68

∑f = N = 300 ∑ fm =8440

To compute for the mean:


(1) Determine the midpoint of each class interval.

for 60-64 m = 60+64


= 62
2
(2) Multiply each frequency value with the corresponding midpoint.
(3) Compute for the ∑f and ∑ fm
(4) Get X

̅X = ∑ fm

N
= 8440
300
= 28.13 or 28
Therefore, the average age of the patients admitted at St. Henri Hospital is 28 years.

You must now be familiar with the arithmetic mean. Let’s do the following SAQA.

SAQ 3-4

Mr. Allan Gali, a fabric store manager eager to see if the latest patterns for size 12 dresses show
a longer hemline than last year’s. if this is so, he can then expect to sell more fabric since each
pattern will call for more material. He took a random sample of ten dress patterns and
measured in inches the length of each pattern from the neckline to the hemline. The ten dress
patterns have the following lengths:

41.5 42 39 44 43.5 45 43 45 42 46

1. Compute for n, ∑X, and


2. Last year, the mean length of size 12 dresses was 36 inches. Can Mr, Gali to sell
more material per dress this year.

The mean has one major disadvantage; its value can be strongly influenced by extreme scores.
Specifically, the mean is pulled toward the outliers in an exaggerated fashion. When there are
extreme scores in a distribution, it is best to use other measures of central tendency, such as
the median.

3.2 MEDIAN
Let us now study the second indicator of central tendency, the median. The median of a
quantitative data set is the middle number of that set when the measurements are arranged in
ascending or descending order. Once the scores are ordered, the median is determined by
simply counting the scores until you reach the middle value. Therefore, the median (Md) is the
N + 1th position in a given data set, either from top or bottom of the scale. 2

1. If N is odd the median is exactly the middle number.


2. If N is even the median is the average of the middle two numbers.
Examples:
N = odd number 7
4
5 middle = the median
6
2

N = even number 8

4 Md = 4 + 6 = 5
6 2
7

The median is a positional measure because the values of the individual items in a distribution
do not affect the median. It is not influenced by the extreme values. The highest nor the lowest
in the distribution does not enter into the computation of the median. Thus, when there are
extreme values in a distribution, it is better to compute for the median rather than for the
mean. This is the advantage of the median over the mean.

In computing the median for a grouped frequency distribution, there is a need to interpolate to
find the exact position of the median. The needed information in determining the median for a
grouped frequency distribution is the frequencies and cumulative frequencies.

Table 3.3 shows the age distribution of patients at St. Henri Hospital and their corresponding
frequencies and cumulative frequencies.

Table 3.3 Age Distribution of patients at St. Henri Hospital

AGE FREQUENCY (f) CUMULATIVE FREQUENCY (cf)

60-64 14 300

55-59 17 286

50-54 21 269

45-49 19 248

40-44 13 229

35-39 30 216
30-34 26 186

25-29 15 160

20-24 31 145

15-19 27 114

10-14 29 87

5-9 24 58

0-4 34 34

∑f = N = 300
To determine the median of this distribution, use the following formula:

N / 2 −𝑐𝑓
Md = L + [ ]i
f

where, cf = cumulative frequency of the class interval below the class


interval containing the median
f = frequency of the interval containing the median
L = lower exact limit of the interval containing the median
i = width of the interval containing the median
N = total number of scores or ∑f

It is important that we list first the cumulative frequencies, then we locate the interval
containing the middle value or N th case. Thus, for Table 3.2., 300 divided 2

By 2 is 150, so we are looking for interval containing the 150 th case. Now under the column for
cf, look for the first value that is greater than 150. Determine the corresponding interval, which
is 2529, and then apply the formula.

Md = 24.5 + [ ]5

Md = 24.5 +

Md = 24.5 +
Md = 24.5 + 1.
67
Md = 26.17 or 26

Thus, the median age of the patients at St. Henri Hospital is 26, which means that 50% of the
population are 26 yrs. old and below.

SAQ 3-5

Consider the following sample of measurements:

5 7 4 5 20 6 2

1. Calculate the median (Md) of this sample.


2. Eliminate the last measurement (the 2) and calculate the median of the remaining
measurements.
3. Is the median affected by the measurement 20? Why?

In certain situations, the median maybe a better measure of central tendency than the mean.
Particularly, the median is less sensitive than the mean to extremely large or small
measurements, as shown in SAQA 6.

3.3 MODE

The third measure of central tendency is the mode (Mo). By definition, the mode is the
measurement occurs most frequently in a data set. In an ungrouped frequency distribution, it is
easily identifiable by merely looking at the score or item which occurs most frequently.

In the case of frequency distributions, the mode may be estimated as equal to the midpoint of
the class interval showing the highest frequency. However, this value is only an estimate of the
true mode of distribution. The true mode from grouped data cannot be computed because
information is lost when scores are combined into class intervals.

Let us look for the mode of the following:

Example1:

3
4
7
7 In this set of numbers, the mode is 7 because it is the most
7 frequently occurring number
8
11
11
14
18
19

Example 2:

What is the mode of the following values?


6 6 6 9 9 9 9 12 12 12
12 12 12 15 15 15 15 15 15 21
21 21 35 35

Mode = 12, 15

We have two modes for this set because both 12 and b15 occur 6 times

The most frequently occurring score is usually somewhere near the center of a distribution.
When this happens, we can say that the mode is a legitimate index of central tendency.
Experience shows that the mode sometimes does not occur near the center of a distribution
and hence we cannot rely on it to accurately reflect the center of a set of scores. This makes the
mode an unreliable measure of central tendency.

Furthermore, there is no mode in instances when all scores occur with equal frequency as the
following:
4 5 6 7 8 9
When there are two modes, the distribution is described as bimodal and when there are more
than 2 modes, the distribution is multimodal.

Modes are especially useful to designers, salesmen, business people, procedures, merchants,
and others who are in the business of selling products at specific outlets or markets. These
individuals are interested to know the most frequently bought sizes of shirts, shoes, or the most
frequently bought flavor of drinks or biscuits. Such information guides people to plan and make
decision for the production of such frequently bought commodities.

Let us have some exercise to apply your knowledge about the three measures of tendency.

SAQ 3-6
The College of Nursing, UP Manila makes a report to the Finance and Scholarship Committee
about the average credit hour load a full-time student takes. A 12-credit hour load is the
minimum requirement for full-time status. For the same tuition, students may take up to 21
credit hours. A random sample of 40 students yielded the following information in credit hours.

17 12 14 17 13 16 18 20 13 12
12 17 16 15 14 12 12 13 17 14
15 12 15 16 12 18 20 19 12 15
18 14 16 17 15 19 12 13 12 15

1. Create an ungrouped frequency distribution table.


2. What is the mode of this distribution? Is it different from the mean and the median?
3. If the Finance and Scholarship Committee is going to fund from the College according to
the average student credit hour load, which of the three measures of central tendency
do you think the Committee should use and why?

SAQ 3-7

The faculty of the College of Medicine had registered the following weights in kilograms in
March 1999.
74 82 78 72 78 73 78 73 78 72 78 81

Find the mean, median and mode.

Did you get it right? You will find that constant reading and fidelity to do the exercises will
facilitate liking statistics.

Here is another activity

SAQ 3-8

A random sample of 12 people gave their opinions about making age 50 a compulsory
retirement age for government employees. Opinions were given on a scale of 1-10 where 1 =
strongly disagree and 10 strongly agree. Here is the result of the survey.
3 1 3 2 3 3 5 5 3 4 4 1
1. What is the mean?
2. What is the median?
3. What is the mode?
4. What is the interpretation?

3.4 COMPARISON OF MEAN, MEDIAN, AND MODE

Which among the three measures of central tendency is the best? To some extent, the answer
depends on the scale used to measure the variable. If the data are nominal, only the mode is
appropriate. If the data are ordinal, both the median and the mode may be appropriate. All
three measures of central tendency may be used if the data are either interval or ratio.

Characteristics of the mean

The mean among the three measures of central tendency is by far the most common and
frequently used because it is a stable measurement to use when sample data are frequencies
about populations. It is the point that balances all the values on either side. The feature of the
mean from an applied stand point is that is strongly influenced by extreme scores. Specifically,
the mean is pulled toward the extreme scores in an exaggerated fashion. This instability of the
mean make it inappropriate measure of central tendency when the distribution contains open-
ended intervals, in the absence of additional information. The mean, when the distribution is
symmetrical, is the best measure and is a useful measure for inferential statistics.

In choosing the most appropriate measure of central tendency, we should also consider how
the measure is to be used. If we wish to infer from samples to populations, the mean usually
has a distinct advantage. The mean can be manipulated mathematically in ways that are
inappropriate to the median or the mode. But if the purpose is primarily descriptive, then the
measure that best describes the data should be used.

Let us compare the properties of the mean and the median. The mean uses more information
than the median in the median in the sense that all exact scores are used in computing the
mean, whereas the median only uses the relative position of the scores. Another important
difference is that the mean is affected by extreme values whereas the median is not.

The important difference between the mean and the median enables us to decide in most
instances what will be the more appropriate. Ordinarily, we desire our measure to make use of
all information available. We somehow have more intuitive faith in such a measure. Although at
this point it is impossible to bolster our faith with a sound statistical argument, some
justification for the preference for the mean under ordinary circumstances can be given. It turns
out that the mean is generally a more stable measure than the median in the sense that it
varies less from sample to sample. When we turn our attention to inferential statistics, we shall
see that we are ordinarily much more interested in generalizing about a population than we are
in a particular sample. We are well aware of the fact that had another sample been taken, the
results would not have been quite the same. Had a very large number of samples of the sample
size been drawn, we would be able to see just how much sample means differed among
themselves. In other words, the sample medians will differ from one sample to the next more
than will the means. Since, in actuality, we usually draw only one sample, it is important to
know that the measure we use will give reliable results in that there will be a minimal variability
from one sample to the next. We can therefore state the following rule of thumb: When in
doubt, use the mean in preference to the median.

Because of the fact that the mean uses all the data, whereas the median does not depend upon
the extreme values, the mean may give very misleading results under some circumstances. You
must keep in mind that in making use of a central tendency, you are attempting to obtain a
simple description of what is typical of our scores. Thus, whenever a distribution is highly
skewed, i.e., whenever there are considerably a few extreme cases in one direction than the
other, the median will generally be more appropriate than the mean.

Another difference is that the computation of the mean requires an interval or ratio scale.
Without an interval ratio scale, it would be meaningless, of course to talk about summing
scores. The median, on the other hand, can be used for ordinal as well as interval or ratio scale.
The actual numerical score of the median will be meaningless unless we have an interval or
ratio scale, but it will certainly be possible to locate the middle score.

3.5 SHAPE OF DISTRIBUTION AND MEASURES OF CENTRAL TENDENCY

3.5.1 Symmetrical Distribution

3.5.1.1 Normal
The mean, median and f
mode fall on the same
value under the normal
curve

𝑋̅
Md
Mo

3.5.1.2 Bimodal

X
Mo ̅X Md
Md

Md

3.5.1.3 Rectangular
f

MdMo

̅X
Y

Md

3.5.2 Skewed Distribution

3.5.2.1 Positively skewed


f

X
Mo Md ̅X

3.5.2.2 Negatively skewed


f

SUMMARY

This module has shown us the three measures of the central tendency, namely the mean,
median, and the mode. In the next module, we shall study the measures of dispersion and
variability. The exercises incorporated in the text shall guide you to develop your skills. Carry on
this interest and thank you again.

4
Measures of
Dispersion or Variability
INTRODUCTION

Module 3 discussed the measures of central tendency. Through those measures, a given set of
data could be described indicating the points where the items are centrally located. In terms of
distribution, however, we do not know how far or how close the data are to each other. We
need to know further how the observations spread out from the average. We need a statistical
cross reference.

This cross-reference should be a measure of the variance, or spread of the data. Descriptive
measures that are used to indicate the amount of variation in a data set are called measures of
dispersion, or variability or spread. When descriptive statistics are presented, there is usually at
least one measure of central tendency and at least one measure of variability.

A measure of dispersion or variability a supplement of a measure of central tendency, giving


meaning to the measure of central tendency. The measures of dispersion or variability indicate
the nature or degree of clustering. The more concentrated the values are about the mean or
average, the more meaningful is the average as a measure of location. There is a low variability
if the scores tend to crowd around the sample point. On the other hand, if the scores are widely
scattered, the data indicate high variability.

In this module, we will study three measures of dispersion or variability: the range, the semi
interquartile range and the standard deviation. Furthermore, this module will also discuss z-
scores or standard scores. Take your time and with a relaxed mind, study well this module.

OBJECTIVES

At the end of this module study, you will be able to:

1. Determine the variability of scores in terms of:


1.1 range
1.2 semi-interquartile range
1.3 standard deviation
2. Standardize scores
3. Interpret the computation or results obtained.

4.1 THE RANGE

The range is the simplest index of variability. Described as the distance between the highest
score and the lowest score in the distribution, the range is the least stable because it is just
influenced by extreme scores, it is completely determined by theses scores.

The range R is computed as: R = highest score – lowest score + 1

Example 1: Given the following test scores; compute for the range.

85 79 86 84 92 97
R = (97-79) + 1 = 17

For a grouped distribution, the range is the difference between the midpoints of the extreme
categories plus one. See Example 2: Table 4.1 for illustration.

Table 4.1 Monthly Salary of Health Personnel at Our Father Hospital


Salary Internal Frequency (f) Midpoint (m)

P7001-8000 06 P7500.50 (110)

1P6001-7000 11 P6500.50 (104)

P5001-6000 42 P5500.50 (93)

P4001-5000 21 P4500.50 (51)

P3001-4000 17 P3500.50 (30)

P2001-3000 13 P2500.50 (13)

N = 110
110
110
1o4
93
51
30
13
To compute for the range, first get the midpoint of the lowest interval (P 2001-3000) and the
midpoint of the highest interval (P 7001.50 – P8000).
Then, R = P7500.50 – P2500.50 + 1
= P5001
Simple isn’t it?

The extreme simplicity of the range as a measure of dispersion is both an advantage and a
disadvantage. The range may prove useful if it is desirable to obtain some very quick calculation
that can give a rough indication of dispersion or if computations must be made by persons
unacquainted with statistics. If the data are to be presented to a relatively unsophisticated
audience, the range may be the only measure of dispersion that will be readily interpreted. The
disadvantage of the range is obvious. It is based on only two cases: the two extreme cases that.
Since extremes are likely to be the rare or unusual cases in most empirical problems, it should
be recognized that it is usually a matter of chance if one happens to get one or two extreme
values in a sample.

Suppose, for example, that there is one millionaire in the community sampled. If we choose 10
persons at random, he or she will probably not be included. But suppose he or she is. The range
in income will then be extremely large and very misleading as a measure of dispersion. If you
use the range as your measure, you know nothing about the variability of scores between the
two extreme values except that the scores lie somewhere within the range. And, as implied in
the above example, the range will vary considerably from one sample to the next. Furthermore,
the range will ordinarily be greater for large samples than small ones simply because I large
samples, you have a better chance of including the most extreme individuals. For these reasons,
the range is not ordinarily used in behavioral research except at the most exploratory levels.

4.2 The Semi-interquartile Range


Recall the discussion of the median in module 3. The median was described as the middle value
in a distribution and thus cutting in half the distribution. In similar manner, we can also divide a
distribution into four equal parts. The values that divide a distribution are called quartiles.

4.2.1 Quartiles

The first quartile (Q1) is the scores that separate the lower 25% of the distribution from the rest.
The second quartile (Q2) is the score that has 50% of the distribution below it; Q 2 is actually the
median of the distribution. Finally, the third quartile (Q 3) separates the lower 75% of the
distribution from the rest. If you recall the discussion on percentile ranks in module 2, each
percentile rank has a corresponding score and this score is called the percentile. Thus, in terms
of quartiles, the 25th percentile (P25) is actually the first Quartile (Q 1), P50 = Q2= Md and P75 =
Q3. See Figure 4.1 for illustration.

M
d
Q1 Q Q
2 3

P25 P50 P75

Fig. 4.1 Parts in a Distribution

Keep in mind that quartiles are a natural extension of the median concept because they are
values which divide a set of data into equal parts. The difference lies in the fact that the median
divides the distribution into two parts, while the quartiles divide the distribution into four equal
parts. Since a quartile is an extension of the median, then both basically use the same formula.

L + (𝑁)(3/4)− 𝑐𝑓
[
Q3 =
]i
𝑓
L + (𝑁)(2/4)− 𝑐𝑓
[
Q2=Md=
]i
𝑓
(𝑁)(1/4)− 𝑐𝑓
[
Q1 = L +
]i
𝑓

Where L = lower exact limit of the interval that contains Q1, Q2 and Q3
Cf = cf value of the interval below the selected interval
F = f value of the selected interval
N = total number of cases in the distribution

4.2.1.1 Computation of the quartiles for ungrouped frequency distribution

For example, consider the third quartiles of the distribution.

Solving for Q1:


N = 40
N/4 = 40/4 = 10

Q1= 77.5 + [(10-


8)/3]
Q1= 77.5 + 2/3
Q1= 77 + .67
Q1= 77.67

Solving for Q3:


N(3/4 = 40(3/4) =
30
Q3= 86.5+1
Q3= 87.5

Interpretation: Seventy-five percent of the 40 master’s students have a score less than or equal
to 87.5 thus 25% have scores greater than 87.75. Similarly, 25% of the students have a score
less than or equal to 77.67 while 75% scored higher than 77.67.

4.2.1.2 Computation of the quartiles for grouped frequency distribution

Given the following distribution below, compute for the third quartiles.

Interval f % cf c%
95-99 3 7.5 40 100.0

90-94 5 12.5 37 92.5

85-89 8 20.0 32 80.0

80-84 11 27.5 24 60.0

75-79 10 25.0 13 32.5

70-74 3 7.5 3 7.5

Computation of Q1

N = 40, N/4 = 40/4 = 10

Q1 class = 75 – 79
L1 = 74.5
i = 5
cf = 3
f = 10

Q1 = L + [𝑁 /4−𝑐𝑓] 𝐶
𝑓

Q1 = 74.5 + [ ]5

Q1 = 74.5 + 3 . 5

Q1 = 78
Interpretation: Twenty-five percent of the data students have a score of 78 and below.

Follow similar procedure to compute for Q3.

4.2.2 The Interquartile Range and the Semi-Interquartile Range

The Distance between the first and the third quartiles is called the interquartile range (IQR).
IQR = Q3 - Q1

When the interquartile range is used to describe dispersion, it becomes the semi-interquartile
range. It is a type of range, but instead of representing the difference between extreme values,
it is arbitrarily defined as half the distance between the first and third quartiles. The formula for
the semi-interquartile range is:

Q3−Q1
Semi-interquartile range =
2

Using the data presented in Table 4.1 find the semi-interquartile range of the distribution

Semi-interquartile range =

= 400 / 2
= P2000

Notice that the quartile deviation is one half of the range covered by the middle half of the
cases. Since Q1 and Q3 will vary less from sample to sample rather than the most extreme
cases, the quartile deviation is far more stable measure than the range. But it does not take
advantage of all the information. We are not measuring the variability among the middle cases
nor are we taking into consideration what is happening at the extremes of the distribution.

4.3 STANDARD DEVIATION AND VARIANCE

4.3.1 Variance

The variance as a measure of variability takes the mean as the reference point taking into
account the deviations of the individual observations from the mean. Conceptually, the
variance is the average of the squared deviations from the mean.

In short, the variance can be stated as:

𝑠𝑢𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑠


2
Variance = 𝑆 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠

The formula of the variance for grouped frequency distributions can be presented as:
∑ fm2- ∑(fm)2 / n
Where f = frequency of each category
m = midpoint of each category
n = total number of observations

Before applying the formula to compute the variance using the previous data, for the
distribution in Table 4.1, compute first 𝑚2, fm, 𝑓𝑚2, and (𝑓𝑚)2 as presented in Table 4.2.

Table 4.2 Computation of Variance for Grouped Data

Salary Interval f m 𝐦𝟐 fm 𝐟𝐦𝟐 (𝐟𝐦)𝟐

P7001-8000 6 7500.50 56,257,500 45,003.0 337,545,000 2,025,727,000

P6001-7000 11 6500.50 42,256,500 71,505.5 446,821,500 5,113,036,500

P5001-6000 42 5500.50 30,255,500 231,021.0 1,270,731,000 53,370,702,000

P4001-5000 21 4500.50 20,254,500 94,510.5 425,344,500 8,932,234,600

P3001-4000 17 3500.50 12,253,500 59,508.5 208,304,500 3,541,261,600

P2001-3000 13 2500.50 6,252,500 32506.5 81,282,500 1,056,672,500

n = 110 534,055.0 1,788,029,000 74,039,634,200


∑ fm ∑ fm2 ∑(fm)2

Apply the formula using the following substitutions: ∑fm2 = P 1,788,029,000.00; ∑(fm)2 = P
74,039,634,200.00; and N = 110.

Hence,

s2 =

s2 =
s2

s = 10,135,830.
91
It can therefore concluded that the average of squared deviations from the mean is
P10,135,830.91

Meanwhile, the formula for the variance for ungrouped data is:
𝟐
∑(𝐗−𝐗)
𝒔𝟐x =
𝒏−𝒊

Where X = values of the variable x


X = the mean
N = number of cases
∑(X − X)2 = sum of the squared deviation

4.3.1.1 Sum of Squares

The sum of the squared deviations can be shortened to the sum of squares and can be
symbolized as SS. Thus,

SS = for grouped data


𝑛

∑(X − X)2 for ungrouped data

If you notice SS is the numerator in the variance formula. Thus,

SS
2
S =
𝑛−1

In the data presented in Table 4.2 and the computation of variance discussed previously, I don’t
think that there will be a need for me to discuss on how to compute for the sum of squares,
since one cannot come up with the variance without computing for the sum of squares, right?
So, if you still recall, what did you get as sum of squares for the monthly salary of health
personnel at Our Father Hospital? Well, as presented earlier, the SS value or the sum of the
deviation of scores we obtained was P1,114,941,400.00.

The sum of squares has one great advantage over variances; they can be treated algebraically,
added and subtracted from one another. This is particularly useful in the analysis of variance in
which you try to divide or break down the total variability of a set of data into various types and
sources variability. This may sound confusing to you but don’t worry, the analysis of variance
will be discussed in Module 12.
Going back to variance as a descriptive statistic for variability, the variance changes in value as a
function of the amount of variability seen in the data. When all scores are identical (and thus
fall exactly at the mean) such that there is no variability, 𝐬2=0. As scores become more and
more dispersed around the mean, this increased variability will be reflected in the 𝐬2 value.
The variance of a sample of scores is represented by the symbol 𝑠2 The variance of a population,
as opposed to that of a sample is represented by the lower case Greek letter sigma squared, ð 2.

4.3.1.2 Estimating the Population Variance

The variance computed according to the equation above is a sample variance. It tells us the
average square deviation of scores around the mean in a sample drawn from some larger
population. This is, 𝐬2 will usually be somewhat smaller than ð2. It is easy enough to understand
why this is the case. A population consists of a more cases than are found in a sample drawn
from that population, and it is likely that any given sample will not include some of the more
deviant cases that are included in the population. These extreme cases ass to the population’s
variance, but not being included in the sample, do not influence sample variance.

There are occasions when we wish to estimate a population variance from a sample drawn from
the population, but we know that 𝐬2 tends to give a low estimate of ð2. To give an unbiased
estimate (corrected variance) of the ð2, the formula we should use is:

𝟐
∑(𝐗−𝐗)
𝐬2 = for ungrouped data
𝒏−𝟏

2 2
∑ fm − ∑(fm) / n
S2 = for grouped data
𝑛−1

Where X = values of the variable x


x = the mean of x
N = number of cases

As you can see, this unbiased estimate is inflated slightly relative to the biased estimate by
using a denominator of n-1 rather than n. this inflation brings the unbiased estimate closer into
line with the larger population variance,

Large samples show little difference between the biased and unbiased estimate because the
difference between n and n-1 is insignificant when n is large. On the other hand, when n is
small, the difference between n and n-1 is proportionally much larger, and the difference
between biased and unbiased estimates becomes quite noticeable. Thus, from now on, we will
use the unbiased formula instead of the biased. In fact, most statistical software packages
compute only for the corrected variance.

4.3.2 Standard Deviation

Having learned some measures of variability, you can now turn your attention to the most
useful and frequently used measure, the standard deviation. It is defined as the square root of
the arithmetic mean of the mean of the squared deviations of the mean.

When data have been grouped, you may simplify your work considerably by treating each case
as though it were at the midpoint of the interval even though this is not the case. Of course, it
will be possible to introduce certain inaccuracies, but the saving in time will be substantial if
computations must be carried out by hand:

The basic formula for the deviation using the grouped data is:

2 2
∑ fm −∑(fm )/n
S = √
𝑛−1

The formula for grouped data is:

2
∑(X−X) /n
S = √
𝑛−1

If you notice, this formula is similar to the variance formula. The only difference is that with the
standard deviation. You obtain the square root of the computed value. So, the standard
deviation can also be stated as:

Standard = square root of the


deviation variance

S = S √s2

In our previous example, the standard deviation is simply the square root of the variance.
s =

s =

s = 3,198.25

The standard deviation tells us that the deviation from the mean or the variability of the mean
monthly salary of the Our Father Hospital’s health personnel reached the amount of P3,198.25.
Since we obtained a very large standard deviation, this indicate that the salaries of the health
personnel are generally far from the average salary. We can say that the sample in the
distribution may have not accurately represented the population. Of all the measures of
variability, the standard deviation is the most useful especially when it is an important measure
for inferential purposes.

4.4 Z-SCORES OR STANDARD SCORES

When you wish to compare two different distributions, you may do so by standardizing the
distributions resulting to only one standardized distribution where each value of x has a
standardized value denoted by Z, which is defined by the following for formula:
X−µ X − X̅
Z = or z =
ð s

Where
X = raw score
µ = population mean
̅X = sample mean
s = sample standard deviation
ð = population standard deviation

Properties of Z-score
1. The sign of Z-score indicate the location of the corresponding raw score relative to the
mean. If Z is positive, the score is above the mean and if Z is negative, the score is below
the mean.
2. The Z-score can be directly transformed to a percentile score when a distribution is
normal.

Let us take an example of Brenda Tag’s final examination results on three courses of her course
in nursing:

Subject Brenda’s Grades Class Mean Standard Deviation

Pathophysiology 86 81 5.75

Theories 76 73 6.00

Statistics 91 93 6.50

On which subject did Brenda performed well? worst? As is, you cannot answer these questions.
Transform Brenda’s scores to Z-scores, then compare.

Solution:
X− x
1. Pathophysiology: Z =
s

= 0.87

X− x
1. Theories: Z =
s

= 0.5

X− 𝑥
2. Statistics: Z =
s
=

= -0.3

The z-score indicates the location of the score relative to the mean.

Interpretation: Among her three subjects, Brenda Tag performed well in Pathophysiology and
performed badly in Statistical.

Here are some exercises for you to do in order to apply what you learned in this module.

SAQ 4-1

Find the range and standard deviation of the following weighs (in kilos) of 10 students:
50, 55, 48, 60, 54, 48, 57, 45, 52, 63

SAQ 4-2

Find the range, the semi-interquartile range and the standard deviation of the following
distribution:

Weekly hours No. of Workers

50-54 4

45-49 12

40-44 15

35-39 13

30-34 6
N = 50

SAQ 4-3

On two final examinations (Anatomy and Pathophysiology), the class’ mean grade was 76 and
the standard deviation was 7.6. A Nursing student scored 71 in Anatomy and 75 in
Pathophysiology. In which examination was the student’s standing higher?

SAQ 4-4

A master’s student received a grade of 84 on a final examination in Research for which the class
mean grade is 76 and the standard deviation is 10. On the final examination for the Statistics for
which the classes mean grade was 82 and the standard deviation is 8, the master student
received grade of 92. In which subject was the student’s standing higher?

Summary

This module discussed the measures of dispersion or variability. These measures supplement
the measures of central tendency.
The SAQs given this module will develop your skills. Go back to the text, understand the
illustrations, and soon you will master the measures of dispersion or variability. Keep on
reading and doing the SAQs. Our next module will be on presentation schemes.

You might also like