You are on page 1of 24

Measures of Central Tendency

Introduction
Finding the average grade of students, the average daily sales of a department store, or the average salary per
month of employees in a company are common procedures that we do occasionally. Without us knowing it, we have been
doing statistical tasks at different times. Average is a term which can be associated with central tendency or central
location.
Measures of Central Tendency for Ungrouped Data
There are three types of measures of central tendency, namely, arithmetic mean, median, and mode.
Mean
To find the arithmetic mean, add all the items or observations then, divide the sum by the total number of
observations. In symbols,
∑x
Mn = ----
N where x = ith observation
N = total number of observations
Example 1:
The grades of student A in five subjects are 78, 88, 89, 90, and 95. What is her mean grade?
Solution:

78 + 88 + 89 + 90 + 95
Mn=
5
440
=
5
= 88
Therefore, the average grade of the student in the five subjects is 88.
Example 2:
The height in cm of 12 college freshmen are as follows:152, 144, 156, 166, 172, 150, 153, 160, 154, 168, 165, and
170.
Find the mean height of the students.
Solution:
152 + 144 + 156 + 166 + 172 + 150 + 153 + 160 + 154 + 168 + 165 + 170
Mn= 5
= 159.17

Therefore, the mean height of the freshmen students is 159.17 cm.


Example 3:
Distribution of the Students’ Scores in a Statistics Examination
x Frequency
10 7
11 12
20 20
24 12
33 5
Total 56
Find the mean score of the students.

Solution:
7(10) + 12(11) + 20(20) + 12(24) + 5(33)
Mn
56
=
1055
=
56
= 18.84
Therefore, the mean score of the students in the Statistics exam is 18.84.

Example 4:
Lorie Company has 6 employees whose monthly salaries are the following: two clerks with a salary of P13,500
each, one executive secretary with a salary of P18,000, one janitor with a salary of P8,000, one accountant with a salary of
P28,000, and one manager with a salary of P50,000. What is the mean monthly salary of the employees?
Solution:
2(13,500) + 1(18,000) + 1(8,000) + 1(28,000) + 1(50,000)
Mn=
6
131,000
=
6
= 21,833.33
Therefore, the mean monthly salary of the employees in Company A is P21,833.33.

Median
The median is the midpoint of an array of numbers or observations. Let us denote the median by the symbol Md.
If a set of data contains an odd number of observations, the median is the only middle observation that divides the
set into two equal parts.
Example 5:
Find the median score of sophomore students in a Chemistry quiz.
12, 34, 23, 14, 16, 33, 41, 35, 10, 45, 25, 24, 50

Solution:
Write the observations in ascending order.

10, 12, 14, 16, 23, 24, 25, 33, 34, 35, 41, 45, 50
Thus, the median score is 25.

Example 6:
The ages of the patients at the pediatric ward of Hospital X are 10, 2, 5, 6, 5, 8, 9, and 9. Find the median of the patients’
ages.
Solution:
Write the ages in ascending order.

2, 5, 5, 6, 8, 9, 9, 10

Md = 6+8
2
14
=
2
=7
The median for an array with an even number of observations can be obtained by finding the mean of the two
points in the middle. In example 6, the pointsd are 6 and 8. Therefore, the median is 7.

Mode
The mode is the observation that appears the most number of times in a distribution.
Example 7:
A store owner wants to know from his employee what shoe size is mostly bought by women for the month of June.
Using the sales report for June, the employee told the owner that the most common shoe size is 6.5, meaning, this size
appears the most number of times in the list. What the employee was actually telling the owner is that the mode of the
sizes of purchased shoes is 6.5.

Example 8:
What is the mode of the students’ scores in a Statistics test?
The scores are as follows:
12, 13, 12, 11, 10, 20, 24, 25, 10, 22, 20, 13, 16, 18, 20, 20, 20, 20.
Solution:
Score Frequency
12 2
13 2
11 1
10 2
20 6 highest frequency
24 1
25 1
22 1
16 1
18 1
Therefore, the mode is 20 since it is the observation or score that appeared the most number of times.

Example 9:
What is the mode of the following heights of freshmen college students?
152 cm, 166 cm, 176 cm, 150 cm, 168 cm, 155 cm, 149 cm
Solution:
There is no mode in this distribution.

Example 10:
What is the mode in the following set of data?
4 6 5 5 5 6 4 5 6 5 3 4 3 3 3 3 2 1 8 7
Solution:
There are two modes, 3 and 5, since both observations occurred the same number of times and are the most frequent.
In any given set of data, there can be one mode, two or more modes, or none at all.

Measures of Central Tendency for Grouped Data


Mean
Below are the characteristics of the mean of any distribution:
1. The mean is the most appropriate measure of central tendency when the data are in the interval ratio or ratio scale.
2. The mean lies between the largest and smallest values or measurements.
3. There is only one value for the mean for a given set of values or measurements.
4. The mean is easily influenced by extreme values because all values contribute to the average. If there are high values, the
mean tends to be high also. If there are extremely low values, the mean to be low also.
For grouped data, the formula for finding the mean is as follows:
∑fd
Mn = A.M. + i ----
fm
where A.M. = assumed mean
f = frequency
d = coded deviation
n = total frequency
i = interval

Example 11:
Table 1
Ages of Patients in Coria Hospital, June 2004

Age(in years)  x x
60 – 68 1 64 64
51 – 59 2 55 110
42 – 50 3 46 138
33 – 41 2 37 74
24 – 32 20 28 560
15 – 23 5 19 95
6 – 14 3 10 30
N = 36 x = 1,071

The mean age of the patients in Coria Hospital, as shown in the table above, is:
1,071
Mn=
36
= 29.75
Therefore, the mean age of the patients in Coria Hospital; X is 29.75 years old.

Example 12:
Find the mean of the scores in table 2.
Table 2
Scores of 165 Students in a Statistics Exam
Class Boundaries <Cumulative
Scores  x x
Lower Upper Frequency
91 – 95 16 93 1,488 90.5 95.5 165
86 – 90 18 88 1,584 85.5 90.5 149
81 – 85 25 83 2,075 80.5 85.5 131
76 – 80 39 78 3,042 75.5 80.5 106
71 – 75 35 73 2,555 70.5 75.5 67
66 – 70 20 68 1,360 65.5 70.5 32
61 – 65 12 63 756 60.5 65.5 12
N = 165 x= 12,860
Solution:
12,860
Mn =
165
= 77.9
Therefore, the mean of the students’ scores is 77.9.

Median
Characteristics of the Median:
1. The median is the most appropriate measure of central tendency for interval data.
2. The median lies between the highest and lowest measyurements.
3. There is only one value for the median in a given set of measurements.
4. The median is not influenced by extreme values.
5. The median is used when the middle value is desired. It is the value where 50% or half of the distribution lies above it and
50% lies below it.
The formula for finding the median of a grouped data is given by:

n – - cf
Md = LL + i 2
fm
where LL = lower limit of the median class
cf = cumulative frequency below the median class
i = class size
m = frequency of the median class
n = total frequency
Refer to table 2 for the following example in finding the median of a grouped data. To determine the median class,
calculate the value of N/2 . From the given data, the value of N/2 is 165/2 = 82.5. Next, look for the cf that is nearest to,
but not less than 82.5. The median class, therefore, is the class interval 76 – 80. The median is expected to be found in this
interval. The value of LL = 75.5 (the lower class boundary of the median class). The value of cf is 67, i = 5, and the frequency
of the median class is 39. Hence, the median of the scores is

82.5 – 67
Md = 75.5 + 5
39
Therefore, the median score is 77.49.
= 75.5 + 1.99
= 77.49
Mode
Characteristics of the mode:
1. The mode is the most appropriate measure of central tendency when the data are nominal in scale.
2. The mode is the least reliable among the three measures of central tendency because its value is undefined in some
distributions.
3. The mode is used when we want to find the value which occurs most often.
4. The mode is a quick approximation of the average. The mode is sometimes referred to as an inspection average.
To find the mode of a grouped data, the formula below is applied:
Mo = 3Md – 2Mn
For the distribution in table 2, the mean is 77.9 and the median is 77.49.
Mo = 3(77.49) – 2(77.9)
= 233.47 – 155.8
= 77.67

Measures of Variation
Introduction
Events of nature vary from time to time. People keep on changing their location, motion, physical appearance, skin
reaction to different chemicals, height, weight, hair color, eye color, ideas, and even values in life. Usually, the heights of a
group of people with the same race tend to converge to a certain common value. For example, if the mean height of Filipino
males is approximately 5 feet and 6 inches, then this means that most Filipino male adults have heights that are clustering
about this value. The extent of the clustering of the heights of the Filipino males about a central value is known as variation.
The measures of variation will enable you to know how varied the observations are, whether there are extreme values in
the distribution, or whether the values are very close to each other. If the measure of variation is zero, it means that there
is no variation at all and that the observations are all alike, or homogeneous. Otherwise, they are heterogeneous. The
common measures of variation are the range, mean absolute deviation, variance, standard deviation, coefficient of
variation, quartile deviation, and the percentile range.
Range
The range is the simplest form of measuring the variation of a distribution. To get the range, subtract the lowest
score or observation from the highest score.
R = Highest Observation – Lowest Observation
Example 1:
A group of scientists went on an expedition to ten range in Sierra Madre, Philippines to study the different species
of plants existing in that area. The ages of the scientists are 34, 35, 45, 56, 32, 25, and 40. What is the range of their ages?
Solution:
Highest Age = 56
Lowest Age = 25
R = Highest – Lowest
= 56 – 25
= 31
Therefore, the range of their ages is 31.
If the size of the population or sample is large, the range is not an excellent measure of variation because it will
only consider the highest and the lowest values and will not tell anything about the values between them. If one is
interested in the position of each observation relative to the mean of the set of data, other measures of variation may be
necessary. One such measure is the mean absolute deviation.

Mean Absolute Deviation


To find the mean absolute deviation, subtract the mean score from each raw score, then, using the absolute values
of the differences, get the sum of the results. The sum is called the sum of the deviations from the mean. Next, divide this
sum by N, the total number of cases. In symbols,

MAD =
x - x (for ungrouped data)
N
where MAD = mean absolute deviation
x = raw score
=xmean score
N = number of observations

MAD =
x - x (for grouped data)
N
where MAD = mean absolute deviation
 = frequency
x = class mark
=xmean score
N = number of observations

Example 2:
Take the MAD of the ages of the scientists in example 1.
Solution:
The ages are 34, 35, 45, 56, 32, 25, and 40.

34 + 35 + 45 + 56 + 32 + 25 + 40
Mean Age: x= = 38.14
7

x x-x x - x
34 -4.14 4.14
35 -3.14 3.14
45 6.86 6.86
56 17.86 17.86
32 -6.14 6.14
25 -13.14 13.14
40 1.86 1.86
Total 53.14

MAD = 53.14 = 7.59


7
Therefore, the mean absolute deviation is 7.59.

Variance
Variance is another measure of variation which can be used instead of the range. The variance considers the
deviation of each observation from the mean. To obtain the variance of a distribution, first, square the deviation from the
mean of each raw score and add them together. Then, divide the resulting sum by N or the total number of cases.
a. Population Variance for Ungrouped Data

(x – )2
 =
N
where V = population variance
x = raw score
 = population mean
N = number of observations

b. Sample Variance for Ungrouped Data

V =
(x – Mn )2
N-1
where V = sample variance
x = raw score
Mn = sample mean
N = number of observations

c. Population Variance for Grouped Data

f(x – )2
 =
N
where V = population variance
 = frequency
x = class mark
 = population mean
N = number of observations

d. Sample Variance for Grouped Data

Nx2 – ( x)2
V =
N(N – 1)
N
where V = sample variance
N = number of observations
 = frequency
x = class mark

Except when specified that the population variance is to be used, you will always use the sample variance formula in the
examples and exercises throughout the book.

Example 3:
Find the population and sample variances of the following distribution:
34, 35, 45, 56, 32, 25, and 40
Solution:

x = 267 = 38.14
7
x x - x (x – x)2
34 4.14 4.14
35 3.14 3.14
a. 45 6.86 6.86 Population Variance
56 17.86 17.86
(x – )2 6.14
32 6.14
 =
25 N 13.14 13.14
= 40606.86 1.86 1.86
Total 267 7 53.14 606.86
= 86.7
b. Sample Variance

V =
(x – Mn )2
N
606.86
=
6
= 101.14
Example 4:
Compute for the population and sample variances for the data in table 1.

Table 1
IQ Scores

IQ Scores  x x x2 x2 (x – x)2


75 – 79 10 77 770 5,929 59,290 1,876.9
80 – 84 12 82 984 6,724 80,688 908.28
85 – 89 25 87 2,175 7,569 189,225 342.25
90 – 94 34 92 3,128 8,464 287,776 57.46
95 – 99 19 97 1,843 9,409 178,771 754.11
100 - 104 15 102 1,530 10,404 156,060 1,915.35
N = 115 10,430 951,810 5,854.35

10,430= 38.14
x=
115
Solution:
Sample Variance

Nx2 – ( x)2
V =
N(N – 1)
N
115(951,810) – (10,430)2
=
115(115 – 1)
109,458,150 – 108,784,900
=
13,110
= 51.35
Population Variance

(x – )2
 =
N
5,854.35
=
115
= 50.91
Standard Deviation
The standard deviation,  for a population or s for a sample, is the square root of the value of the variance. In
symbols,
Population Standard Deviation (s)
s=√

Sample Standard Deviation (SD)


SD = √V

Unless specified, the sample standard deviation will be used in all the examples and exercises throughout the book.

Example 5:
Compute for the population and sample standard deviations for the data in table 1.
Solution:
Population Variance
 = 50.91

Therefore, the value of the population standard deviation is

s = √50.91 = 7.14

Sample Variance

V = 51.35

The sample standard deviation is

SD = √51.35 = 7.17

Example 6:
Find the standard deviation for the distribution in table 2.
Table 2
Scores in the Statistics Final Exam

Class Interval  x x x2


27 – 29 12 28 336 9,408
30 – 32 23 31 713 22,103
33 – 35 60 34 2,040 69,360
36 – 38 45 37 1,665 61,605
39 – 41 51 40 2,040 81,600
42 – 44 75 43 3,225 138,675
45 – 47 28 46 1,288 59,248
48 – 50 33 49 1,617 79,233
51 – 53 18 52 936 48,672
54 – 56 10 55 550 30,250
355 14,410 600,154

14,410
x=
355
= 40.59

355(600,154) – (14,410)2
V =
355(355 – 1)
5,406,570
=
125,670
= 43.02

SD = √43.02
= 6.56

Therefore, the standard deviation of the score is 6.56.

Coefficient of Variation
When it is necessary to compare the variability of two or more groups, the task is easy if the means are the same.
For example, you can easily compare which group is more varied in height between the following groups:
Group 1: 156 cm, standard deviation = 6
Group 2: 156 cm, standard deviation = 10
Clearly, one can say that Group 2 is more varied because it has a higher standard deviation. The task becomes
more difficult if the means are not equal and the units are different, such as when comparing the weights of two groups
belonging to different age brackets or different genders. To compare the variability of the weights of 9 girls, having a mean
weight of 100 pounds and a standard deviation of 5 with that of the weight of 12 boys having a mean of 160 pounds and a
standard deviation of 8, a statistic called the coefficient of variation could help you. The formula is given by:
SD
CV =   100%
where SD = standard deviation
 = mean

Since s and  have the same units, their units will cancel out and so, CV has no unit.

Example 7:
Suppose two groups of students are to be compared in terms of height.

Group Mean Height Standard Deviation CV


Male 162 cm 10 cm 6.17%
Female 148 cm 4 cm 2.70%

Solution:
10
Male CV =  100% = 6.17%
162
4
Female CV =  100% = 2.70%
148
Comparing the relative variations in height of the male and female students, it can be seen that the heights of the male
students have a higher coefficient of variation than those of the female students. Thus, the male students’ heights are more
varied.

Example 8:
Compare the variability of the heights and weights of the students given in the following data:
 s CV
Height (in cm) 168 cm 12 cm 7.14%
Weight (in pounds) 200 lb 20 lb 10.00%
From the results, it can be seen that the weights of the students are more varied than the heights.

Quartile Deviation
The quartile deviation is another way of determining the spread of a distribution in terms of quartiles. The quartile
deviation formula is shown below:

Q3 – Q1
QD = 2
where QD = quartile deviation
Q3 = 3rd quartile
Q1 = 1st quartile

Example 9:
Find the QD of the following scores:
23 25 25 30 35 39 40 44 47 51 60
Solution:

Q3 = 3N th = 3(11) th = 8.25th item Thus, Q3 = 47.


4 4

Q1 = N th = 11 th = 2.75th item Thus, Q1 = 25.


4 4
47 – 25 22
QD = = = 11
2 2
Hence, the QD is 11.

Example 10:
Find the QD of the car battery lives (in years).
1.2 1.4 1.6 2.2 2.5 2.8 3.0 3.0 3.1 4.4
Solution:

Q3 = 3N th = 3(10) th = 7.5th item, which is 3.0 years (the value is


4 4
midway between the 7th and the 8th items, which is 3.0 in this example).

Q1 = N th = 10 th = 2.5th item, which is 1.5 years (since the number of


4 4
cases is even, the mean between the 2nd and the 3rd item, which is 1.5, is taken).

3.0 – 1.5 1.5


QD = = = 0.75
2 2
Hence, the QD is 0.75.

Example 11:
Find the QD of the scores in the following table:
Table 3
Scores in a Statistics Final Exam

Class Boundaries
Class Interval  x x <CF
Lower Upper
27 – 29 12 28 336 12 26.5 29.5
30 – 32 23 31 713 35 29.5 32.5
33 – 35 60 34 2,040 95 32.5 35.5
36 – 38 45 37 1,665 140 35.5 38.5
39 – 41 51 40 2,040 191 38.5 41.5
42 – 44 75 43 3,225 266 41.5 44.5
45 – 47 28 46 1,288 294 44.5 47.5
48 – 50 33 49 1,617 327 47.5 50.5
51 – 53 18 52 936 345 50.5 53.5
54 – 56 10 55 550 355 53.5 56.5
N = 355

Solution:
N = 355 = 88.75, hence, LL = 32.5, cf = 35, i = 3, and fm = 60.
For Q1:
4 4
88.75 – 35
Q1 = 32.5 + 3 = 32.5 + 2.69 = 35.19
60

3N
For Q3: = = 3(355) = 266.25, hence, LL = 32.5, cf = 35, i = 3, and fm = 60.
4 4
266.25 – 266
Q3 = 44.5 + 3 = 44.5 + 0.027 = 44.53
28
44.53 – 35.19 9.34
QD = = = = 4.67
2 2
Hence, the quartile deviation is 4.67.

Percentile Range
The percentile range, PR, is the difference between the 90th percentile (P90) and the 10th percentile (P10). In symbols,
PR = P90 – P10

Example 12:
The following data represent the scores of students in a Physics final examination:

100 100 111 111 112 120 121 122 123


130 132 133 135 140 145 145 146 150
150 155 160 164 165 165 170 171 175 180
Calculate the percentile range of the scores.

Solution:

P90 = 90N th = 90(29) th = 26.1th item, which is 175.


100 100
P10 = 10N th = 10(29) th = 2.9th item, which is 111.
100 100
PR = P90 – P10
= 175 – 111
= 64
Hence, the percentile range of the scores is 64.

Normal Distribution
Introduction
The distribution of some human abilities and characteristics such as mental ability a certain specific shape called
the normal distribution. When the distribution is normal, most of the observations (about 68%) tend to converge at the
middle and the rest are distributed to the left and right ends of the distribution. The normal curve is bell-shaped. In a
normal distribution, the mean, median, and mode values are equal and coincide at one point when the graph is drawn.

Fig. 1 The normal curve

The Standard Normal Distribution


The normal curve is the graph of the equation
1
y= 1 e2 Z 2

s√2p
where z is the z-score and s is the population standard deviation.

z-scores
A distribution which is not normal can be normalized by changing all the scores in the distribution into z-scores.
The graph using z-scores as points is a normal curve. The total area under the curve is 1. At the vertex of the normal curve
lie the mean, median, and mode values. Since it is bell-shaped, the right and the left sides of the curve are symmetric with
respect to a vertical axis. The area under the curve to the right of the vertical axis is 0.5 and the area under the curve to the
left is also 0.5. The z-score of 0 lies at the vertex. All z-scores to the right side are positive and those to the left are negative.

x–m
z=
s
where z = z-score
x = score
m = mean
s = population standard deviation

Example 1:
Convert the following scores to z-scores, where m = 75 and s = 5.
a. 75
b. 80
c. 58
Solution:
a. x = 75
75 – 75= = 0 0
z=
5 5
b. x = 80
80 – 75 = 5 = 1
z=
5 5
c. x = 58
58– 75
z= = -17 = -3.4
5 5
Many statistical problems can be solved by using the z-scores and the areas under the normal curve. A table of values of
areas under the normal curve is given in Appendix A.

Example 2:
The following are the scores of 27 students in a Biology quiz:
12 10 9 10 12 15 15 16 15
20 22 23 10 12 10 14 16 17
18 20 20 21 10 12 23 10 10
a. Convert scores of 10 and 20 to z-scores.
b. What percent of the class obtained scores higher than 20?
c. How many students obtained a score less than 20?
d. How many students scored between 10 and 20?
Solution:
a. m = 14.9
s = 4.5

Score z-score
9 -1.31
10 -1.09
12 -0.64
14 -0.20
15 0.02
16 0.24
17 0.47
18 0.69
20 1.13
21 1.36
22 1.58
23 1.80

b. z-score of 20 = 1.13
From the table of areas under the normal curve, the area under the curve when z is 1.13 is A = 0.3708 or 37.08%.
This represents the total number of students who scored between 14.9 and 20. The percentage of students who
scored more than 20 (shaded portion in fig. 2) is 0.50 – 0.3708 = 0.1292 or equivalent to 12.92% of the whole class.
The desired area is the region lying to the right of z = 1.13. Therefore, 12.92% of the students scored higher than
20.

z=0 z = 1.13
Fig. 2 Curve for Example 2b

c. The number of students who scored less than 20 will be 0.3708 + 0.5 = 0.8708 or 87.08% of all the 27 students in
the class, which is 23.51. Hence, there are about 24 students who scored less than 20 in the test. In figure 3, the
desired area is the region lying to the left of z = 1.13.

Fig. 3 Curve for Example 2c z = 1.13

d. z-score of 20 = 1.13, Area = 0.3708


z-score of 10 = -1.09, Area = 0.3621
In figure 4, the desired area lies between z = -1.09 and z = 1.13. Add 0.3708 and 0.3621 to determine the total area
between the two z-scores. The total area under the curve for the interval (-1.09, 1.13) is 0.7329 or 73.29% of the
-1.09 < z < 1.13
27 students, which is about 20 students.

z = -1.09 z = 1.13
Fig. 4 Curve for Example 2d

Example 3:
What is the z-score that marks the upper 33% of the area under the normal curve?
Solution:

33%

Fig. 5 Curve for Example 3 z

The z-score that marks the upper 33% of the area under the normal curve can be determined using the table in
Appendix A. To be able to use the table, compute first for the area under the normal curve between z = 0 and the required
z-score. This area is equal to 50% - 33% = 17% or 0.17.In Appendix A, the z-score that corresponds to an area of 0.17 is 0.44.
Therefore, the given region (upper 33%) is the area under the normal curve for z > 0.44.

Skewness
Skewness refers to the symmetry or asymmetry of a frequency distribution and its measure can be obtained by using the
formula:

3(x – Md)
Sk = SD

Fig. 6 Positively skewed Fig. 7 Negatively skewed

If the values of the mean and the median are equal, the distribution is normal and the graph is a bell curve. If the
observations are concentrated at the left side of the vertical axis and has fewer observations at the right side, it is called a
positively skewed distribution. If the observations are concentrated at the right side, you have a negatively skewed
distribution. In a positively skewed distribution, the mean is higher than the median. In a negatively skewed distribution, the
mean is lower than median. An example of a positively skewed distribution is the marrying age of women. Usually, women
marry between ages 18 and 30 years old. Marrying beyond 30 is considered late, so only few will marry at 40, 50, or 60.

Example 4:
Calculate the degree of skewness of a distribution if the mean is 45, the median is 40, and the standard deviation is 5.
Solution:
3(x – Md) 3(45 – 40)
Sk = SD = = 3(5) = 3
5 5
Hence, the distribution is positively skewed.

Example 5:
Find the degree of skewness for the following data:

Table 1
Frequency Distribution of Examination Marks in Statistics

Class Boundaries
Scores ¦ x ¦x <CF x2 ¦x2
Lower Upper
90 – 94 1 92 92 60 89.5 94.5 8,464 8,464
85 – 89 4 87 348 59 84.5 89.5 7,569 30,276
80 – 84 3 82 246 55 79.5 84.5 6,724 20,172
75 – 79 8 77 616 52 74.5 79.5 5,929 47,432
70 – 74 20 72 1,440 44 69.5 74.5 5,184 103,680
65 – 69 15 67 1,005 24 64.5 69.5 4,489 67,335
60 – 64 7 62 434 9 59.5 64.5 3,844 26,908
55 – 59 1 57 57 2 54.5 59.5 3,249 3,249
50 – 54 1 52 52 1 49.5 54.5 2,704 2,704
N = 60 4,290 310,220
Solution:

a. x=
S¦x
N
4,290
= 60
= 71.5

N – - cf
b. Md = LL + i 2
¦m

30 - 24
= 69.5 + 5
20
= 69.5 + 1.5
= 71

a. NS¦x2 – ( S¦x)2
V=
N(N – 1)
60(310,220) – (4,290)2
=
60(59)
= 59.07

c. SD = √59.07
= 7.69
158

3(x – Md)
Sk = SD
3(71.5 – 71)
=
7.69

1.5
=
7.69
= 0.20
Hence, the distribution is positively skewed.

Kurtosis
The degree of peakedness or flatness of a curve is called kurtosis, denoted by Ku. This is also known as percentile
coefficient of kurtosis and its formula is given by
QD
Ku = PR
where QD = quartile deviation
PR = percentile range

When the value of Ku is:


a. equal to 0.263, the curve is a normal curve or mesocurtic.
b. greater than 0.263, the curve is platykurtic or flat.
c. less than 0.263, the curve is leptokurtic or thin.

leptokurtic

mesokurtic

platykurtic

Fig. 8 Mesokurtic, platykurtic, and leptokurtic curves

Example 6:

Calculate the percentile coefficient of kurtosis for the data below.


Table 2
Frequency Distribution of Examination Marks in Statistics

Class Boundaries
Scores ¦ x ¦x <CF
Lower Upper
90 – 94 1 92 92 60 89.5 94.5
85 – 89 4 87 348 59 84.5 89.5
80 – 84 3 82 246 55 79.5 84.5
75 – 79 8 77 616 52 74.5 79.5
70 – 74 20 72 1,440 44 69.5 74.5
65 – 69 15 67 1,005 24 64.5 69.5
60 – 64 7 62 434 9 59.5 64.5
55 – 59 1 57 57 2 54.5 59.5
50 – 54 1 52 52 1 49.5 54.5
N = 60 4,290
Solution:
a. Find QD.
N = 60= 15
4 4
= 64.5
LL cf = 9 = 15 ¦m
15 – 9
Q1 = 64.5 + 5
15
= 64.5 + 2
= 66.5

3N = 3 60 = 45
4 4
LL= 64.5 cf = 9 = 15 ¦m
Q3 = 74.5 + 5 45 – 44
8
= 74.5 + 0.625
= 75.125

Q3 – Q1 75.125 – 66.5
QD = 2 = = 4.3125
2
b. Find PR.

10N = 10 60 = 6
100 100
LL = 64.5 cf = 9 = 15 ¦m

P10 = 59.5 + 5 6 – 2
7
= 59.5 + 2.86
= 62.36
90N = 90 60 = 54
100 100
LL = 64.5 cf = 9 = 15 ¦m

P90 = 79.5 + 5 54 – 52
3
= 79.5 + 3.33
= 82.83
PR = P90 – P10
= 82.83 – 62.36
= 20.47

Since the coefficient is lower than 0.263, it means that the distribution is leptokurtic.
Fig. 9 Graph for Example 6

Hypothesis Testing
Introduction
Testing the significance of the difference between two means, two standard deviations, two proportions, or two
percentages, is an important area of inferential statistics. Comparison between two or more variables often arises in
research or experiments and to be able to make valid conclusions regarding the result of the study, one has to apply an
appropriate test statistic. This chapter deals with the discussion of the different test statistics that are commonly used in
research studies.
Hypothesis
A hypothesis is a conjecture or statement which aims to explain certain phenomena in the real world. Many
hypotheses, statistical or not, are products of man’s curiosity. To seek for the answers to his questions, he tries to find and
present evidences, then tests the resulting hypothesis using statistical tools and analysis. In statistical analysis, assumptions
are given in the form of a null hypothesis, the truth of which will be either accepted or rejected within a certain critical
interval.
The Null and Alternative Hypotheses
The hypothesis that is subjected to testing to determine whether its truth can be accepted or rejected is the null
hypothesis denoted by Ho. This hypothesis states that there is no significant relationship or no significant difference
between two or more variables, or that one variable does not affect another variable. In statistical research, the hypotheses
should be written in null form. For example, suppose you want to know whether method A is more effective than method B
in teaching high school mathematics. The null hypothesis for this study will be: “There is no significant difference between
the effectiveness of method A and B.”
Another type of hypothesis is the alternative hypothesis, denoted by Ha. This is the hypothesis that challenges the
null hypothesis. The alternative hypothesis for the example above can be: “There is a significant difference between the
effectiveness of method A and method B,” or “Method A is more effective than method B,” or “Method A is less effective
than method B,” depending on whether the type of test is either one-tailed or two-tailed. These will be discussed in the
succeeding lessons.
Significance Level
To test the null hypothesis of no significance in the difference between the two methods in the above example,
one must set the level of significance first. This is the probability of having a Type I error and is denoted by the symbol a. A
Type I error is the probability of accepting the alternative hypothesis, Ha, when, in fact, the null hypothesis, Ho, is true. The
probability of accepting the null hypothesis when, in fact, it is false is called a Type II error and is denoted by the symbol b.
The most common level of significance is 5%.
One-Tailed and Two-Tailed Tests
A test is called a one-tailed test if the rejection region lies on one extreme side of the distribution and two-tailed if
the region is located on both ends of the distribution.

(a) (b)
Fig. 1 One-tailed (a) and two-tailed (b) tests
In fig. 1a (one-tailed), the rejection region is the area to the right of the vertical line under the bell curve. In fig. 1b
(two-tailed), the rejection region is the areas to the extreme left and right of the curve marked by the two vertical lines.
Suppose the null hypothesis of a certain research paper is given as Ho: x = y , then the alternative hypothesis
will be Ho: x ¹ y . A two-tailed or two-directional tests will be used if there are to possibilities either the left side is greater
than the right, or the left side is less than the right. If Ha:, then
x < ya one-tailed or one-directional test will be used.
Testing Hypothesis
Below are the steps when testing the truth of a hypothesis.
a. Formulate the null hypothesis. Denote it as Ho and the alternative hypothesis as Ha.
b. Set the desired level of significance (a).
c. Determine the appropriate test statistic to be used in testing the null hypothesis.
d. Compute for the value of the statistic to be used.
e. Compute for the degrees of freedom.
f. Find the tabular value using the table of values for different tests from the appendix tables.
g. Compare the computed value, CV, to the tabular value, TV.
Decision Rule: If the CV is less than the TV, accept the null hypothesis. If the CV is greater than the TV, reject the null
hypothesis. Make a conclusion using the result of the comparison.
Degree of Freedom (df)
The degree of freedom gives the number of pieces of independent information available for computing variability.
For ay statistical tool used in testing hypothesis, the number of degrees of freedom required will vary depending on the size
of the distribution. For a single group of population, the number of degrees of freedom is N – 1, where N is the population.
For two groups, the formula foe df is: N1 + N2 – 2 for t-test and N – 2 for Person r. These test statistics will be discussed later
in this chapter.
Tests Concerning Means
z-test on the Comparison between the Population Mean and the Sample Mean
If the population mean (m) and the population variance (s) are known, and m will be compared to a sample mean
, use the formulax below. x–m
z= √n
s
The tabular values of z can be obtained from the following table:
Table 1
Critical Values of z
Level of Significance
Test Type
0.10 0.05 0.025 0.01
One-tailed Test ±1.28 ±1.645 ±1.96 ±2.33
Two-tailed Test ±1.645 ±1.96 ±2.33 ±2.58
Decision Rule:
Reject Ho if êzú ≥ êztabularú.
Example 1:
A company, which makes battery-operated toy cars, claims that its products have a mean life span of 5 years with a
standard deviation of 2 years. Test the null hypothesis that m = 5 years against the alternative hypothesis that m ¹ 5 years if
a random sample of 40 toy cars was tested and found to have a mean life span of only 3 years. Use a 0.05 level of
significance.
Solution:
1. Ho: The mean lifespan of the battery-operated toy cars is 5 years.
(m = 5 years)
Ha: The mean lifespan of the battery-operated toy cars is not 5 years.
(m ¹ 5 years)
2. a = 0.05; two-tailed
3. Use z-test as test statistic.
4. Computation:
x= 3 m=5
n = 40 s=2
x–m
z= √n
s
3–5
= 2 √40
= -6.32
5. Critical regions: z < -1.96 and z > 1.96
6. Decision: Reject the Ho and accept the proposition that the mean lifespan of the toys is not equal to 5 years since
êzú, which is 6.32, is greater than êztabularú, which is 1.96.
7. The difference is significant.
Example 2:
A manufacturer of bicycle tires has developed a new design which he claims has an average lifespan of 5 years with
a standard deviation of 1.2 years. A dealer of the product claims that the average lifespan of 150 samples of the tires is only
3.5 years. Test the difference of the population and sample means at 5% level of significance.
Solution:
1. Ho: x = m
Ha: x < m
2. a = 5%; one-tailed
3. Test statistic: z-test
4. Computation:
x = 3.5 m=5 n = 150
s = 1.2
√n = 12.25
x–m
z= √n
s
3.5 – 5
= 1.2 (12.25)
= -15.31
5. Critical areas: at 5% level of significance, z < -1.645 and z > 1.645
6. Since êzú, which is 15.31, is greater than êztabularú, which is 1.645, reject the null hypothesis and accept the alternative
hypothesis. The mean lifespan of the samples is less than the mean lifespan of the population.
7. It implies that the difference between the two means is significant at 5% level.
t-test on the Comparison between the Population Mean and the Sample Mean
The t-test can be used to compare the means when the population mean (m) is known but the population variance
(s) is unknown.
When the population standard deviation is unknown but the sample standard deviation can be computed, the t-
test can also be used instead of the z-test. The formula is given below:

x–m
t= √n
SD
The denominator of the formula, s, divided by the √n for t is called standard error of the statistic. It is the
standard deviation of the sampling distribution of a statistic for random samples n.
Decision Rule:
Reject Ho if êtú ≥ êttabularú.
Example 3:
The average length of time for people to vote using the old procedure during presidential election period in
precinct A is 55 minutes. Using computerization as a new election method, a random sample of 20 registrants was used and
found to have a mean length of voting of 30 minutes with a standard deviation of 1.5 minutes. Test the significance of the
difference between the population mean and the sample mean.
Solution:
Ho: x = m
1.
Ha: x < m
2. a = 5%; one-tailed
3. t-test is the appropriate test statistic.
4. Computation:
x–m
t= √n
s
30 – 55
= 1.5 √20
= -74.54
5. df = n – 1 = 20 – 1 = 19
6. Tabular value of t = 1.729 (one-tailed)
7. Decision: Reject the Ho since êtú, which is 74.54, is greater than êttabularú, which is 1.729.
8. There is a significant difference between the means.
t-test Concerning Means of Independent Samples
When two samples are drawn from normally distributed populations with the assumption that their variances are
equal, the t-test with the given formula should be used.
x1 – x2
t=

Ö (n1 – 1)V1 + (n2 – 1)V22 n1 + n2


2

n1 + n2 – 2 n1n2
where x1, x2 = means
n1, n2 = sample sizes
V12, V22 = variances
Decision Rule: Reject Ho if êtú ≥ êttabularú.
Example 4:
A course in physics was taught to 10 students using the traditional method. Another group of 11 students went
through the same course using another method. At the end of the semester, the same test was administered to each group.
The 10 students under method A got an average of 82 with a standard deviation of 5, while the 11 students under method B
got an average of 78 with a standard deviation of 6. Test the null hypothesis of no significant difference in the performance
of the two groups of students at 5% level of significance.
1. Ho : There is no significant difference between the average scores of the two groups of students. x1 = x2
Ha : The mean score of first group is higher than the mean score of the second group. x1 > x2
2. a = 5%; one-tailed
3. Use the t-test as test statistic.
4. Computation:
x1= 82 x2= 78
n1 = 10 n2 = 11
V1 = 5 V2 = 6
x1 – x2
t=

Ö (n1 – 1)V1 + (n2 – 1)V22 n1 + n2


2

n1 + n2 – 2 n1n2
82 – 78
=

Ö (10 – 1)(5) + (11 – 1)(6)2 10 + 11


2

10 + 11 – 2 (10)(11)
4 4
= = = 1.65

Ö (9)(25) + (10)(36)2
2 21 2.4245
19 110
5. df = n1 + n2 – 2
= 10 + 11 – 2 = 19
6. Tabular value = 1.729
7. The null hypothesis will be accepted because êtú < êttabularú. This means that the difference between the means is
not significant at 5% level of significance. This also implies that method A is as effective as method B.

t-test on the Significance of the Difference Between Two Correlated Means


When comparing two correlated means, the t-test is he appropriate test statistic. A typical example is when
comparing the results of the pre-test and the post-test administered to a group of individuals. The two tests must be the
same.
d
n
t=
SDd
√n
where d = difference between the pre-test and post-test scores
n = number of observations
SDd = standard deviation of the differences
Example 5:
To determine whether the students’ performance in College Algebra improved after enrolling in the subject for
one term, a 60-item pre-test and post-test were administered to them on the fist and the last days of classes, respectively.
The same test was given as pre-test and post-test.
The results are as follows:
Student Pre-test Score Post-test Score Difference, d d2
A 34 45 -11 121
B 23 32 -9 81
C 40 46 -6 36
D 31 57 -26 676
E 24 39 -15 225
F 45 48 -3 9
G 27 27 0 0
H 32 33 -1 1
I 12 18 -6 36
J 45 45 0 0
Sd = -77 Sd = 1,185
2

Solution:
1. Ho : The students’ performance in Algebra did not improve. (m1 = m2)
Ha : The students’ performance in Algebra improved. (m1 < m2)
2. a = 1%; one-tailed test
3. t-test will be used.
4. Computations:
Sample variance:

n(Sd2) – (Sd)2
SDd2 =
n(n – 1)
10(1,185) – (77)2
SDd2 = = 65.79
10(9)
SDd = 8.11

d
= – 77= -7.7
n 10
√n = 3.16
d
n -7.7
t= = = -3
SDd 8.11
√n 3.16
5. df = n – 1 = 10 – 1 = 9
6. Tabular value = 2.821 (one-tailed)
7. Reject Ho since êtú > êttabularú. This means that the performance of the students in College Algebra significantly
improved.
z-test on the Significance of the Difference Between Two Independent Proportions
To determine if there is a significant difference between proportions of two variables, the z-test will be used.
p1 – p2
z=

Ö
p1q1 p2q2
n1 + n2
where p1 = proportions of first sample
p2 = proportions of second sample
q1 = 1 – p
q2 = 1 – p
n1 = number of cases in the first sample
n2 = number of cases in the second sample
Example 6:
A sample survey of a presidential candidate in the Philippines shows that 120 of 200 male voters dislike candidate
X and 175 of 250 female voters dislike the same candidate. 120/200 and 175/250
Determine whether the difference between the two sample proportions, is significant or not at 1% level of significance.

Solution:
1. Ho: There is no significant difference between the proportion of the male voters and the proportion of female
votes. (p1 = p2)
Ha: There is a significant difference between the proportion of the male votes and the proportion of female votes.
(p1 ¹ p2)
2. Set a = 0.01; two-tailed
3. Use z-test as test statistic.
4. Computation:
p1 – p2
z=

Ö
p1q1 p2q2
n1 + n2
120 – 175
200 250
=

Ö
120 80 175 75
200 200 250 250
+
200 250
-0.1
=

Ö 0.24
200
+ 0.21
250
-0.1
=
0.045
= -2.22
5. The tabular value of z is 2.58 (two-tailed).
6. Since êzú, which is 2.22, is less than êztabularú, which is 2.58, accept Ho.
7. Therefore, there is no significant difference between the proportions of male and female voters in their dislike for
candidate X.

Significance of the Difference Between Variances

Analysis of Variance
When the variances of two or more independent samples differ, the appropriate test statistic to determine the
significance of such difference is the analysis of variance (ANOVA), which makes use of the F ratio or variance ratio. The
various groups being compared are assumed to belong to a population with a normal distribution, each group randomly
selected and independent from the other group. The variables from each group also have standard deviations that are
approximately equal.
Steps in Solving the Analysis of Variance
1. State the null hypothesis.
2. Set the level of significance.
3. Accomplish the ANOVA Table.
The ANOVA Table
Source of Variation Sum of Squares d¦ Mean Square F
SSW
Between SSB d¦B = k – 1 MSW = d¦B MSB
SSW
F = MSW
Within SSW d¦w = N – k MSW =
d¦W
Total TSS d¦T = N – 1

where (SXAi)2 (S X i )2
SSB = S nAi –
N
(S X i )2
TSS = SXi –
2
N
SSW = TSS – SSB
N = sample size
k = number of columns
X = observed value
n = number of rows
A = given factor or category
i = individual observation of cell
4. Find the tabular value of F at the given level of significance (from Appendix E).
5. Accept the null hypothesis if the computed value of F is less than the tabular value and reject if it is greater than
the tabular value.
6. Interpret the result.
Example 7:
Determine who among the three salesmen will most likely be promoted based on their monthly sales in pesos. Use
5% level of significance.

Table 2
Sales of Three Candidates for Promotion (A, B, C)
A B C A2 B2 C2
12,000 15,500 12,899 144,000,000 240,250,000 166,384,201
10,000 12,500 16,000 100,000,000 156,250,000 256,000,000
10,900 12,000 15,000 118,810,000 144,000,000 225,000,000
18,000 13,000 12,700 324,000,000 169,000,000 161,290,000
16,000 14,000 15,000 256,000,000 196,000,000 225,000,000
14,400 15,888 13,000 207,360,000 252,428,544 169,000,000
14,400 12,300 12,000 207,360,000 151,290,000 144,000,000
15,500 15,000 16,000 240,250,000 225,000,000 256,000,000
18,800 19,000 16,000 353,440,000 361,000,000 256,000,000
130,000 129,188 128,599 1,951,220,000 1,895,218,544 1,858,674,201
S A = 130,000 S A2 = 1,951,220,000
S B = 129,188 S B2 = 1,895,218,544
S C = 128,599 S C2 = 1,858,674,201
Solution:
1. Ho: There is no significant difference in the mean sales of the three candidates for promotion.
2. Set a = 5%.
3. Sum of Squares (S A + S B + S C )2
N
TSS = (SA2 + SB2 + SC2) –

= (1,951,220,000 + 1,895,218,544 + 1,858,674,201)


(130,000 + 129,188 + 128,599)2
– 27
(387,787)2
= 5,705,112,745 –
27
= 5,705,112,745 – 5,569,583,606
= 135,529,139
(SA)2 + (SB)2 + (SC)2 (S A + S B + S C )2
SSB = n – N
(130,0002 + 129,1882 + 128,5992)
= –
9
(130,000 + 129,188 + 128,599)2
27
= 5,569,693,572 –
(387,787)2
27
= 5,569,693,572 – 5,569,583,606
= 109,966
SSW = TSS – SSB = 135,529,139 – 109,966 = 135,419,173
degrees of freedom
df T = N – 1 = 27 – 1 = 26
df B = k – 1 = 3 – 1 = 2
df W = N – K = 27 – 3 = 24
Mean Squares
SSB 109,966
MSB = = = 54,983
df B 2

SSW 135,419,173
MSW = = = 5,642,465.542
dfW 24
MSB
F=
MSW
54,983
= 5,642,465.542

= 0.00974
The ANOVA Table
Source of
Sum of Squares d¦ Mean Square F
Variation
Between 109,966 2 54,983
0.00974
Within 135,419,173 24 5,642,465.542
Total 135,529,139 26
4. Tabular value of F = 3.40
5. The computed value, 0.00974, is less than the tabular value, 3.40, so the null hypothesis is accepted.
6. Therefore, there is no significant difference between the sales of Salesmen A, B, and C. Hence, the three salesmen
have almost equal chances of promotion.
t-test for Samples with Correlated Variances
When the observations are paired, then they are correlated and their variances are not independent estimates. In
this case, the t-test given by the formula below should be used.

(V12 – V22)√N – 2
t=
√4V12V22(1 – r122)
The degree of freedom for this distribution is N – 2 and r12 is the correlation between the two variables. This will be
discussed in the next chapter.

Correlation Analysis
Introduction
The measure of relationship between two variables is called correlation. Correlation analysis is a method of
measuring the strength of such relationship between the two variables. “When two social, physical, or biological
phenomena increase or decrease proportionately and simultaneously because of identical external factors, the phenomena
are positively correlated. If one increases in the same proportion that the other decreases, the two phenomena are
negatively correlated. Investigators calculate the degree of correlation by applying a coefficient of correlation of data
concerning the two phenomena.” (Microsoft Corporation © 1993 – 2003)
The following are examples of correlated variables:
1. The students’ mental ability and academic performance in school are related.
2. There is a close relationship between reading comprehension and mathematical ability.
3. The larger the mass of a body, the greater the amount of heat energy required to melt it.
4. In physics, the larger the force exerted to push a body, the faster the acceleration of the body will be.
5. In the linear equation y = x + 1, the higher the value of x to be assigned, the higher the corresponding value of the
dependent variable y.
Pearson Product-Moment Correlation Coefficient
The most common statistical tool in measuring the linear relationship between two random variables, x and y, is
the linear correlation coefficient commonly called the Pearson Product-Moment Correlation Coefficient or Pearson r for
short. This formula was developed and perfected by Karl Pearson, a colleague of Francis Galton who made behavioral
studies of humans. It became the basis of different theories in the fields of heredity, psychology, anthropometry, and
statistics. It can be used to determine the linearity of the relationships between two variables.
NSxy – SxSy
r=
√[NSx2 – (Sx)2][NSy2 – (Sy)2]
Note that the results of r should be interpreted only after its value has been found to be significant, as shown
below.
r Verbal Interpretation
0.00 to ± 0.20 slight correlation
±0.21 to ±0.40 low correlation
±0.41 to ±0.60 moderate correlation
±0.61 to ±0.80 high correlation
±0.81 to ±1.00 very high correlation
Example 1:
Test the hypothesis that there is no significant correlation between mental ability and English proficiency at 5% level of
significance.
Table 1
Mental Ability and English Proficiency Test Scores
Mental Ability (x) English Proficiency (y)
50 200
54 198
50 200
51 203
49 186
46 205
48 185
47 197
44 183
44 171
46 179
45 185
48 184
53 190
54 191
33 170
34 168
Solution:
1. Ho: There is no significant correlation between mental ability and English proficiency.
2. a = 5%
3. Pearson r will be used to test the hypothesis.
4. Computation
Mental Ability (x) English Proficiency (y) xy x2 y2
50 200 10,000 2,500 40,000
54 198 10,692 2,916 39,204
50 200 10,000 2,500 40,000
51 203 10,353 2,601 41,209
49 186 9,114 2,401 34,596
46 205 9,430 2,116 42,025
48 185 8,880 2,304 34,225
47 197 9,259 2,209 38,809
44 183 8,052 1,936 33,489
44 171 7,524 1,936 29,241
46 179 8,234 2,116 32,041
45 185 8,325 2,025 34,225
48 184 8,832 2,304 33,856
53 190 10,070 2,809 36,100
54 191 10,314 2,916 36,481
33 170 5,610 1,089 28,900
34 168 5,712 1,156 28,224
Sx = 796 Sy = 3,195 Sxy = 150,401 Sx2 = 37,834 Sy2 = 602,625

NSxy – SxSy
r=
√[NSx2 – (Sx)2][NSy2 – (Sy)2]
17(150,401) – (796)(3,195)
=
√[17(37,834) – (796)2][17(602,625) – (3,195)2]
= 0.727
5. df = N – 2 = 17 – 2 = 15
6. Tabular Value = 0.482 (from Appendix C)
7. Reject the null hypothesis because the computed value, 0.727, is greater than the tabular value, 0.482.
8. There is a significant linear relationship between mental ability and English proficiency. The verbal interpretation of
r shows there is a high correlation between the two variables.
Regression Analysis
Regression analysis is used when predicting the behavior of a variable. The regression equation explains the
amount of variations observable in the independent variable x. It is actually an equation of a straight line in the form:
y = a + bx
where y = criterion measure
x = predictor
a = ordinate or the point where the regression line crosses the y-axis
b = beta weight or the slope of the line
To get the regression equation, the values of a and b are computed using the formula below.
(Sy)(Sx2) – (Sx)(Sxy)
a=
and n(Sx2) – (Sx)2
nSxy – SxSy
b=
n(Sx2) – (Sx)2
where n = number of pairs
Example 2:
The data in the table represent the memberships at a university mathematics club during the past 5 years.
Number of Years (x) Membership (y)
1 25
2 30
3 32
4 45
5 50
From a curve of the form y + bx to predict the membership 5 years from now.
Solution: Sy = 182 Sx = 15
n=5 Sx2 = 55
Sxy = 611
(Sy)(Sx2) – (Sx)(Sxy) nSxy – SxSy
a= b=
n(Sx2) – (Sx)2 n(Sx2) – (Sx)2
182(55) – 15(611) 5(611) – 15(182)
= 5(55) – (15)2 = 5(55) – (15)2
= 16.9 = 6.5
y = a + bx
y = 16.9 + 6.5x
Since you need to predict the membership five years from now, or at year 10, substitute 10 for x in the equation.
Thus, 5 years from now, y = 16.9 + 6.5(10) = 81.9 » 82
Five years from now, the club would have 82 members.
The graphical presentation of the data is called a scatter plot or scatter diagram.

Fig. 1 Scatter plot or scatter diagram


The line drawn in the scatter plot is called the trend line. It is drawn in such a way that the sum of the vertical distances
above the line is the same as the sum of the vertical distances below the line. The line can be used to estimate the number
of members five years from now. From the figure, one can see that the estimate is about 85, which is very close to the
computed value of 82.
Spearman’s Rank Correlation Coefficient (r)
When the entries in the set of data are ranks, the Spearman’s Rank Correlation Coefficient r (also known as the
Spearman rho) will be used in hypothesis testing.

6SD2
r=1– N(N2 – 1)
where N = number of pairs
D = difference between two sets of ranks
Example 3:
Ten instructors were rated by third- and fourth-year students on their “mastery of the subject matter” and the results were
tabulated. What is the Spearman rho value for the data? At 5% level of significance, determine if there is a significant
difference in the scores obtained by the teachers.
Table 2
Rates of Instructors
Instructor 3rd Yr (x) 4th Yr (y) Rx Ry D D2
1 44 46 4 3.5 0.5 0.25
2 45 43 3 6 -3 9
3 38 40 6 7 -1 1
4 32 30 9 10 -1 1
5 46 39 2 8 -6 36
6 47 37 1 9 -8 64
7 37 44 7 5 2 4
8 35 46 8 3.5 4.5 20.25
9 27 48 10 2 8 64
10 40 50 5 1 4 16
SD2 = 215.5
Solution:
1. Ho: There is no significant difference between the ratings given to the ten instructors.
2. a = 5%
3. Spearman rho will be used to test Ho.
6SD2
r=1– N(N2 – 1)
6(215.5)
=1–
10(102 – 1)
= 1 – 1.31
= -0.31
4. df = N – 2 = 10 – 2 = 8
5. Tabular value = 0.643 (from Appendix F)
6. Since the absolute value of the computed value of r, which is 0.31, is smaller than the tabular value, which is
0.643, the null hypothesis will be accepted.
7. There is no significant difference in the ratings given to the instructors.
When N is 10 or greater, testing the significance of r can also be computed using a t value given by the formula:

Ö
t=r N–2
1 – r2
The degree of freedom for this t distribution is N – 2. In the previous example, N = 10, so df = 10 – 2 = 8. At 5% level
of significance,
Ö 1–r
N–2
Ö 1 – (-0.31)= -0.92
t=r 8
2
= -0.31 2

Using the table of values of t in the appendix, t is 1.86 at 5% level of significance with df = 8 for one-tailed test. It
implies that the null hypothesis will be accepted since the absolute value of the computed value is less than the tabular
value. Hence, the value of r indicates that there is no significant difference in the ratings given by the students to their
instructors.

Chi-Square
Introduction
The Chi-Square test is used when treating ordinal data in the form of frequencies or proportions. In this test, the
observed frequencies are compared to the expected frequencies and the difference is tested at a desired level of
significance. The tabular values are given in the appendix. The formula is
k where Oi = observed frequencies
(Oi – Ei)2
c =
2
S Ei Ei = expected frequencies
i=1
Contingency Table
This table shows a cross-tabulation of the classes of observations with the frequencies for each class shown. A one-
way classification table is one where only one row of observations is given. In a two-way classification table, there are r
number of rows and k number of columns of observed frequencies.
Test of Independence
In this test, two variables are involved. One variable will be tested for independence to the other variable. A two-
way classification table will be used to tabulate the observed frequencies.
Example 1:
In a recent survey conducted by company A to determine the effectiveness of its hair shampoo, five groups of
female respondents were given questionnaires. Their answers are as follows:
Table 1
Survey Results
Group Strongly Approve Approve Disapprove Strongly Disapprove Total
I 12 5 12 11 40
II 5 14 14 11 44
III 5 6 12 11 34
IV 10 8 13 13 44
V 15 20 5 5 45
Total 47 53 56 51 207
Test the significance of the difference between the observed frequencies and the expected frequencies at 1% level of
significance.
Solution:
1. Ho: The responses of the five groups do not differ significantly.
2. a = 1%
3. Use Chi-Square test.
(row total)(column total)
4. Solve first for the expected values using the formula E=
For example,
overall total
E11 = sum of the observed values in the first column times the sum of the observed values in the first row divided
by the grand total = (47)(40) = 9.082126
207
The values of the expected frequencies are as follows:
E11 = 9.082126 E21 = 10.24155 E31 = 10.82126 E41 = 9.855072
E12 = 9.990338 E22 = 11.2657 E32 = 11.90338 E42 = 10.84058
E13 = 7.719807 E23 = 8.705314 E33 = 9.198068 E43 = 8.376812
E14 = 9.990338 E24 = 11.2657 E34 = 11.90338 E44 = 10.84058
E15 = 10.21739 E25 = 11.52174 E35 = 12.17391 E45 = 11.08696
Substitute these values to the formula for c2. k
(Oi – Ei)2
c2 = S Ei
i=1

(12 – 9.082126) 2
(5 – 9.990338)2 (5 – 7.719807)2
c2 = 9.082126
+
9.990338
+
7.719807
(10 – 9.990338)2 (15 – 10.21739)2 (5 – 10.24155)2
+ + +
9.990338 10.21739 10.24155
(14 – 11.2657)2 (6 – 8.705314)2 (8 – 11.2657)2
+ 11.2657 + 8.705314 + 11.2657
(20 – 11.52174)2 (12 – 10.82126)2 (14 – 11.90338)2
+ 11.52174 + 10.82126 + 11.90338
(12 – 9.198068)2 (13 – 11.90338)2 (5 – 12.17391)2
+ 9.198068 + 11.90338 + 12.17391
(11 – 9.855072)2 (11 – 10.84058)2 (11 – 8.376812)2
+ 9.855072 + 10.84058 + 8.376812
(13 – 10.84058)2 (5 – 11.08696)2
+ 10.84058 + 11.08696
= 0.937444457 + 2.492755836 + 0.958229929 + 0.000009344 + 2.238669407 + 2.682586757 + 0.663642427 +
0.84071911 + 0.946660792 + 6.238718512 + 0.128397985 + 0.369291363 + 0.85352956 + 0.101028063 + 4.227481942 +
0.133013754 + 0.002344407 + 0.821447978 + 0.430151775 + 3.341861253
c2 = 28.40798465
5. df = (c – 1)(r – 1) = (4 – 1)(5 – 1) = (3)(4) = 12
6. Tabular value = 26.217
7. Decision: Reject Ho since the computed value of Chi-Square is greater than the tabular value. Hence, the opinions
of the respondents differ.

Goodness of Fit
In this test, only one set of observed frequencies will be compared to a set of expected or theoretical frequencies
and a one-way classification table will be used to tabulate the frequencies.
Example 2:
A group of university deans rated 100 professors into three categories: excellent, satisfactory, and needs
improvement. If teaching effectiveness is normally distributed in the population of college professors, determine if the
distribution of ratings by the deans differ significantly from the expected ratings.
Table 2
Classification of Professors
Excellent Satisfactory Needs Improvement Total
Observed 10 85 5 100
Expected 1.587 58.021 0.7935
In a single row of observed values, the expected frequencies can be computed using the following steps:
1. Divide the normal curve into three divisions, as shown below.

15.87% 68.26%
Fig. 1 Normal
15.87%curve with three divisions
The middle region represents 68.26% of the total population and the outer regions represent 31.74% of the total
population, half of which (15.87%) is found on the left side and the other half is found on the right side.
2. Multiply the observed frequencies by the percentage values.
E11 = (15.87%)(10) = 1.587 k
(Oi – Ei)2
E12 = (68.26%)(85) = 58.021 c2 = S Ei
E13 = (15.87%)(5) = 0.7935 i=1

(10 – 1.587)2 (85 – 58.021)2 (5 – 0.7935)2


= + +
1.587 58.021 0.7935
= 44.59897227 + 12.54487929 + 22.29948614
= 79.4433377
Solution:
a. Ho: The actual ratings are not significantly different from the expected ratings.
b. Set the level of significance at 5%.
c. Use the Chi-Square test.
d. Computed value = 79.4433377
e. df = 3 – 1 = 2
f. Tabular value = 5.991
g. Since the computed value is greater than the tabular value, Ho will be rejected.
h. Therefore, there is a significant difference between the observed and the expected values.
Chi-Square in Testing the Significance of the Difference Between Proportions
The Chi-Square is more convenient to use when testing the significance of the difference between two proportions
as compared to the z-test which was earlier discussed in this chapter.
The formula is:
N(AD – BC) 2
c2 =
(A + B)(C + D)(A + C)(B + D)
where N = total number of cases A B
A, B, C, and D are the observed frequencies
C D
Below is a sample table showing the positions of the frequencies A, B, C, and D.
Use the Chi-Square in testing the significance of the difference between the proportions of male and female voters
in the data below.
Example 3:
A sample survey of a presidential candidate in the Philippines shows that 120 of 200 male voters dislike candidate
X and 175 of 250 female voters dislike the same candidate.
Determine whether the difference between the two sample proportions, 120/200 and 175/250, is significant or not at
1% level of significance.
Gender Voters who Dislike the Candidate Voters who Like the Candidate Total
Male 120 (A) 80 (B) 200 (A + B)
Female 175 (C) 75 (D) 250 (C + D)
Total 295 (A + C) 155 (B + D) 450(N)
Computing for c2, we have:
N(AD – BC)2
c2 =
(A + B)(C + D)(A + C)(B + D)
450[(120)(75) – (80(175)]2
=
= 4.92 200(250)(295)(155)
The degree of freedom is (2 – 1)(2 -1) = 1. The tabular value of c2 is 3.841 at 5% level of significance. Therefore, there is a
significant difference between the opinions of male and female voters toward the candidate and there are more voters who
dislike the candidate than those who like him.

You might also like