You are on page 1of 55

Data Management Notes

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


1
Lesson

Population and Sample (including


2
Data Collection)
POPULATION AND SAMPLE

Population – the totality of all objects, individuals or perceptions wherein its unique properties or
characteristics are subject of statistical inquiry.
Finite Population can be counted with relative ease and the number obtained is limited.
Examples are: number of students in a class, number of patients in a hospital, number of
registered voters in a municipality.
Infinite Population cannot be counted easily because of the large number involved of because of the
nature of the data. Examples are: the number of hair strands, number of stars in the sky, the exact
Philippine population.

Sample – representative part of a population


Parameter – characteristic of a population
Statistic – characteristic of a sample

REASONS FOR SAMPLING


1. The cost of collecting and processing data is obviously lower the fewer are the units that have
to be contacted.
2. Complete survey is sometimes physically impossible as when the number of units is infinitely
large or when some of them are totally inaccessible.
3. Complete survey is senseless whenever the acquisition of the desired information destroys the
elementary units of interest.
4. Complete survey is senseless whenever it produces information that comes too late.
5. For a given cost, sampling can provide more detailed information than a complete enumeration
( total population).

Cochran;s Formula

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


2
 Determination of Sample Size Using the Yamane’s Formula also known as the Slovin’s
Formula

Where: n = sample size


N = population size
e = desired margin of error (percent of non-precision because of the use of the
sample instead of the population)

Illustrative example:
A researcher would like to determine the research capability of graduate school students in 4 universities in
the city. Let us consider the following hypothetical data.

1. Determine the Population Size

Universities Population (N1)


SLU 800
Table 1 UB 700
UC 600
BCU 400
As a Whole 2500

2. Assuming the margin of error is 0.03 or 3%, determine the sample size using the Yamane‟s
(Slovin‟s) formula

= 769.23 = 770

3. Determine the proportion of the sample size as to the population size.


p = n/N = 770/2500 = 0.308

4. Determine the sub-sample in every sub-population

Universities Population Sample Size (n1)


(N1)
SLU 800 (0.308)(800) = 246
UB 700 (0.308)(700) = 216
Table 2
UC 600 (0.308)(600) = 185
BCU 400 (0.308)(400) = 123
As a Whole 2500 770

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


3
TYPES OF SAMPLE
Random and Non-random Sampling

Random Sampling is the most commonly used sampling technique in which each member in the
population is given an equal chance of being selected in the sample. Non-random Sampling is a
method of collecting a small portion of the population by which not all members of the population
are given the same chance to be included in the sample. Certain elements in the population are
deliberately left out from the selection for varied reasons.

PROPERTIES OF RANDOM SAMPLING


1. Equiprobability
2. Independence

RANDOM SAMPLING TECHNIQUES

1. Simple Random Sampling - the random selection process allows no discretion to the investigator as
to which particular units in the population enter the sample
- it tends to avoid the problem of unrepresentativeness
a. Lottery or Fishbowl Sampling

b. Sampling using the Table of Random Numbers ( or similar material)

2. Systematic Sampling – the use of a random start (k) which also serves as the common interval
- Usually used if population is known
k = N/n (N= sample ; n = sample size)

3. Stratified Random Sampling – the population is subdivided based on a strata


a. Simple stratified
b. Stratified Proportional (see Table 2)

4. Cluster Sampling or Multi-stage Sampling – used when the population is large and spread over a
geographical area in which smaller sub-regions are easily sampled where a simple random or a
stratified random sample may not be carried out easily or when the selection of individuals of the
population is impractical: – a procedure of selection in which the unit of selection (cluster) contains
two or more population members.

NON-RANDOM SAMPLING

1. Convenience sampling is probably the most common of all sampling techniques. With
convenience sampling, the samples are selected because they are accessible to the researcher.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


4
Subjects are chosen simply because they are easy to recruit. This technique is considered
easiest, cheapest and least time consuming.

2. Consecutive sampling is very similar to convenience sampling except that it seeks to include ALL
accessible subjects as part of the sample. This non-probability sampling technique can be
considered as the best of all non-probability samples because it includes all subjects that are
available that makes the sample a better representation of the entire population.
3. Quota sampling is a non-probability sampling technique wherein the researcher ensures equal
or proportionate representation of subjects depending on which trait is considered as basis
of the quota.

For example, if basis of the quota is college year level and the researcher needs equal
representation, with a sample size of 100, he must select 25 1st year students, another 25 2nd year
students, 25 3rd year and 25 4th year students. The bases of the quota are usually age, gender,
education, race, religion and socioeconomic status.

4. Judgmental sampling is more commonly known as purposive sampling. In this type of sampling,
subjects are chosen to be part of the sample with a specific purpose in mind. With judgmental
sampling, the researcher believes that some subjects are more fit for the research compared
to other individuals. This is the reason why they are purposively chosen as subjects.

5. Snowball sampling is usually done when there is a very small population size. In this type of
sampling, the researcher asks the initial subject to identify another potential subject who also
meets the criteria of the research. The downside of using a snowball sample is that it is hardly
representative of the population.

6. Incidental or Opportunity Sampling applied to those samples which are taken because they are
the most available and willing.

METHODS OF DATA COLLECTION

A. Interview (Direct) Method – a method of person-to-person exchange between the interviewer


and the interviewee.

Positive:
1) It provides consistent and more precise information since clarification maybe
given by the interviewee.
2) Questions may be repeated or maybe modified to suit the interviewee‟s level of
understanding.

Negative:
1) Time-consuming
2) Expensive
3) Limited field coverage

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


5
B. Self-Enumeration (Indirect) Method – in this method written responses are given to prepared
questions. A questionnaire is used to elicit answers to the problems of the study.
Questionnaires may be mailed or hand-carried.

Positive:
1) Inexpensive
2) Can cover a wide area in a shorter span of time.
3) Respondents may feel a greater sense of freedom to express views and
opinions because their anonymity is maintained.

Negative:
1) There‟s a strong possibility of non-response, especially when questionnaires are
mailed.
2) Questions not easily understood may not be answered.

C. Registration Method – this method of gathering information is enforced by law.


e.g. registration of births
deaths
vehicles
licenses
Positive:
1) Information is kept systematized.
2) Information is always made available to the public.

D. Observation Method – the investigator observes the behavior of the subject/respondent. It is


used when the subjects cannot talk or write.

Positive:
The recording of behavior at the appropriate time and situation is made possible.

E. Experiment Method - this method is used when the objective is to determine the cause-and-
effect relationship of certain phenomena under controlled conditions. It is usually
used by scientific researches.

Data Presentation (including Frequency


Distribution Table)
FORMS OF DATA PRESENTATION

1. Textual – combines text and numerical facts in statistical reports


Collected data may be organized and presented in a narrative or paragraph form.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


6
Example:
Population to double in 29 years

Based on the Census of Population and Housing conducted decennially by the National
Statistics Office, the total population of the Philippines as of May 1, 2000 was
76,504,077 persons. This was higher by 7,887,541 persons or about 10.31 percent from
the 1995 census (with September 1, 1995 as reference date). It was 10 times the
Philippine population in 1903 when the first census was undertaken.

The expansion of the Philippine population reflected a 2.36 percent average annual
growth rate in the 1995-2000 period. This figure recorded an slight increase from a
declining growth rate which started in the first half of the seventies. The last increase
recorded in population growth rates was during the intercensal period 1948 to 1960 at
3.07 percent. The recent growth rate was 0.04 percentage point higher than the annual
growth during
Source: NSO,the2000
early part ofofthe
Census nineties. and
Population If theHousing
average annual growth rate continues,
the population of the Philippines is expected to double in 29 years.

2. Tabular – a more concise and systematic manner of presenting numerical facts compared to textual
form. Tabular presentation facilitates the analysis of relationships.

Example:

21 provinces in the country had more than one million population

Among the 78 provinces in the country, Pangasinan (2.43 million persons) of


Region I (Ilocos), was the largest in terms of population size. Cebu (2.38 million
persons), Bulacan (2.23 million persons), Negros Occidental (2.14 million persons)
and Cavite (2.06 million persons) followed. These were the provinces which
surpassed the two millionth population mark. Of the 21 provinces with more than
one million population, 13 provinces were in Luzon, five in Visayas and three in
Mindanao.
On the contrary, the four smallest provinces with less than a hundred
thousand population were Batanes (16.5 thousand persons), Camiguin (74.2
thousand persons), Siquijor (81.6 thousand persons) and Apayao (97.1 thousand
persons).

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


7
Population Distribution by Region: 2000 (Source: NSO, Various Censuses of Population and Housing )

Total Population Percent


Region
Philippines 76,504,077 100.00
NCR 9,932,560 12.98
CAR 1,365,412 1.78
I - Ilocos 4,200,478 5.49
II - Cagayan Valley 2,813,159 3.68
III - Central Luzon 8,030,945 10.50
IV - Southern Tagalog 11,793,655 15.42
V - Bicol 4,686,669 6.13
VI - Western Visayas 6,211,038 8.12
VII - Central Visayas 5,706,953 7.46
VIII - Eastern Visayas 3,610,355 4.72
IX - Western Mindanao 3,091,208 4.04
X - Northern Mindanao 2,747,585 3.59
XI - Southern Mindanao 5,189,335 6.78
XII - Central Mindanao 2,598,210 3.40
XIII - Caraga 2,095,367 2.74
ARMM 2,412,159 3.15

3. Graphical Presentation – an effective means of organizing and presenting statistical data because the
important relationships are brought out more clearly and creatively in virtually solid and
colorful figures.
Example: More single men than women

About 43.89 percent of the total population 10 years and over were single, while 45.66 percent were
married. The remaining 10.45 percent were either widowed, separated/divorced, with other arrangements or
with unknown marital status.

Among the single persons, the proportion was higher for males (52.94 percent) than for females (47.06
percent). In contrast, the proportion for widowed was higher for females (75.72 percent) than for males
(24.28 percent).
Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.
8
* Some Commonly Used Graphs
a. Scatter Graph
 a graph used to present measurements or values that are thought to be related.

Scatter Graph of the Amount of Garbage


Discarded and Household Size
12
Amount of Garbage
Discarded (in kg)

10

0
0 1 2 3 4 5 6 7 8 9
Household Size

b. Line Chart
 graphical presentation of data especially useful for showing trends over a period of time.

Age at First Marriage in the United States

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


9
Following a sharp decline during and after World War II, the age at which men and
women in the United States first marry has steadily increased. In the mid-1990s, the
age of first marriage for women was higher and closer to the age at which men first
marry than at any time in the previous 100 years.

Comparative Line Graph on Membership of Students

600
500 500
400 420
360
300 300 Male
280 260
240 Female
200
140
100
0
I II III IV

c. Pie Chart
 a circular graph that is useful in showing how a total quantity is distributed among a
group of categories. The “pieces of the pie” represent the proportions of the total
that fall into each category.

5 Leading Causes of Deaths of


Filipinos During the Year 1995

Lung
Disease
6%
Accident Accidents
s 6% Stroke
Lung Disease
10%
Heart Stroke
Disease Cancer
Cancer
45% 33%
Heart Disease

Pie Chart on Graduate School Students

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


10
,0

BCU, 400 SLU,


800
UC,
600
UB,
700

d. Bar graph
 It represents the frequency or magnitudes of quantities of each of the categories as
a bar rising vertically from the horizontal axis with the height of each bar
proportional to the frequency or magnitude of the corresponding category.
 It may be simple, compound and can be vertically or horizontally arranged. It is
used for both qualitative and quantitative data

Bar Graph on Membership of Students

IV
III
II
I Number of
Students
0 20 40 60

e. Frequency Polygon

 A frequency polygon can be made from a line graph by shading in the area beneath the graph. It can be
made from a histogram by joining midpoints of each column.

Frequency Polygon on Graduate School Students

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


11
1000

800

600

400

200

0
SLU UB UC BCU

f. Histogram
 A histogram displays continuous data in ordered columns. Categories are of continuous measure
such as time, inches, temperature, etc.

Histogram on Range of Salary


30

25

20

15

10

0
Below 150-199 200-249 250-299 300-349 350-399 400-449 450-499 500 and
P150 up

g. Stem and Leaf Display


 Stem and leaf plots record data values in rows, and can easily be made into a histogram. Large data
sets can be accommodated by splitting stems.

Stem and Leaf Display on Number of Students Attending School Activities


4|34
5|279999
6|234579 NOTE: 6 | 3 means 63
7|11223357
8|3345689
9|4

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


12
h. Pictograph
 A pictograph uses an icon to represent a quantity of data values in order to decrease the size of the
graph. A key must be used to explain the icon.

Pictograph on Strawberries picked

* Some Commonly Used Tabular Presentations

a. Simple Table

Number of Graduate School Students from Four Major Universities

School Number of Students


SLU 800
UB 700
UC 600
BCU 400
Total 2500

b. Cross Tabulation
Number of Graduate School Students from Four Major Universities Classified According to Gender

Gender Male Female Total


Year Level
I 300 500 800
II 280 420 700
III 240 360 600
IV 140 260 400
Total 960 1540 2500

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


13
Range of Daily Wage of Workers in the Three Sectors of the Economy
Sector
Range of Daily Wage Agriculture Industrial Service Total
Below P150 8 5 9 22
150-199 7 7 3 17
200-249 7 9 6 22
250-299 3 10 9 22
300-349 3 9 13 25
350-399 1 7 6 14
400-449 0 8 12 20
450-499 0 5 3 8
500 and up 0 3 3 6
Total 29 63 64 156

Notes:
 Simple tables are commonly used for ordinal and nominal variables.
 A combination of the two levels of measurements can be used for cross tabulation

FREQUENCY DISTRIBUTION TABLE

A Frequency Distribution is a grouping of data into mutually exclusive categories showing the number of
observations in each class.

Frequency Distribution Table for an Ungrouped Data:

Example:

Construct a frequency distribution table for the following data.

5, 1, 3, 4, 2, 1, 3, 5, 4, 2, 1, 5, 1, 3, 2, 1, 5, 3, 3, 2.

Solution:

From the data, we observe that the numbers 1, 2, 3, 4 and 5 are repeated. Hence under the number
column write to the five numbers namely 1, 2, 3, 4 and 5 one below the other.

Now read the numbers one by one and put the tally mark in the tally mark column against the number.
For example, the first number is 5. So put a tally mark („ | ‟) against the number S. The next number is 1. So
put a tally mark („ I ‟) against the number l. Continue the process till all the numbers are exhausted.

Add the tally marks against the numbers 1, 2, 3, 4 and 5 and write the total in the corresponding
frequency column. Now add all the numbers under the frequency column and write it against the total.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


14
Number Tally marks Frequency
1 5
2 4

3 5
4 2
5 4
Total 20

Frequency Distribution Table for Grouped Data:

 Class interval: The class interval is obtained by subtracting the lower limit of a class from the lower
limit of the next class. The class intervals should be equal.
 Class Frequency: The number of observations in each class.
 Class Midpoint: A point that divides a class into two equal parts. This is the average of the upper
and lower class limits.
 Class Boundaries
Lower boundary is the lower limit less 0.5
Upper boundary is the upper limit plus 0.5
 Relative Frequency the relation of the class frequency to the total frequencies
 Cumulative Frequency corresponding to a particular value is the sum of all the frequencies up to
and including that value.

Example:

The following are the marks obtained by 50 students in a mathematics test. Prepare a frequency distribution
table for the data.

45 68 41 87 61 44 67 30 54 8 39 60 37 50 19 86 42 29 32 61 25 77 62 98 47 36 15 40 9 25
34 50 61 75 51 96 20 13 18 35 43 88 25 95 68 81 29 41 45 87

Solution:

To decide the length of the class interval and to take all the scores given in the problem. We have to in the
largest value and the smallest value from the given scores. This we can do by merely going through all the
scores. Here the largest value is 98 and the smallest value is 8.

Step One: Decide on the number of classes

(You can use the formula 2c > n where c= desired number of classes
n=number of observations)
o There are 50 observations so n=50.

o Two raised to the 6th power is 64

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


15
o Therefore, we should have at least 6 classes, i.e., c =6

Step Two: Determine the class size or width using the formula

Range = Largest value - Smallest value.


= 98 - 8
R = 90.
i ≥ R = 90 = 15
c 6
o The class size is 15

o Set the lower limit of the first class at 5

NOTE: The researcher may decide on the class width to use. It is then advisable to use an odd
number for the class width to have a whole number for the class midpoint. The number of classes
should not be too few ( at least 5) and not too many (at most 20).

Frequency distribution table of the marks taken by 50 students in a mathematics test

Class Intervals Class Boundaries Frequency Relative Cumulative Cumulative


Frequency Frequency less Frequency
(CI) (cb) (f) than (cfb<) greater than
(Rf) (cfb>)

CI cb f Rf cfb< cfb>

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


16
Measures of Central Tendency, Location and
Variation
MEASURES OF CENTRAL TENDENCY
Central Tendency refers to the location of a score in a distribution. The most important measure of central

tendency are (1) the mean, (2) the median, and (3) the mode.

A. MEAN
It is obtained by adding all the observations and dividing the sum by the number of observations,

thus it is called the computational average.

UNGROUPED DATA
1. Arithmetic mean or average of a population is represented by µ for the population and ẍ for the sample.

Illustrative Example:
Suppose ten people you had chosen from those entering the campus have ages:
15, 25, 18, 15, 20, 25, 18, 18, 20, and 25
What is the mean age of these people?

For ungrouped data, such as the one given above, the formula easily applies. But for data where
x observation(s) would occur more than once the weighted mean could be used.

2. Weighted mean
a. Since there are scores that occur more than once, we may want to list down the scores as follows:

Scores (X) Frequency (f) fX


15 2
18 3
20 2
25 3__
∑f= 10 ∑fX =

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


17
f i xi
x
f i

The answer is the same but the method is shorter.

b. If each observation has a different weight, w will take the place of f in the formula making it

Illustrative Example:
Here are scores obtained by an applicant for a certain job. The weight for each criterion is given:

Criteria Grades (X) Weight (w) Xw


Academic qualification 80 30%
Personality 85 20%
Technical Skills 82 25%
Experience 88 15%
Recommendations 85 10%

Total 100%

c. If items are to be rated based on a scale then r would take the place of w in the formula making it:

Illustrative Example:
Mall goers were asked to rate the level of effectiveness of the inspection done by security forces in
prohibiting crimes in shopping malls in the city.
Level of Effectiveness
Very effective Moderately Least Not
(4) effective (3) effective (2) effective(1)

Number of Mall goers


97 132 176 170

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


18
f i xi
x
f i

Thus, the weighted mean of ______________ falls under the rating ______ which means that the
inspection done at the mall is _________________ in prohibiting the occurrence of crime.

GROUPED DATA
In case of large groups the formulas stated above may not be very usable. The more practical thing

to do is to make a frequency distribution table first.

a. Short Method (Class Deviation Method)

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


19
Illustrative Example:

Consumers were asked to try a new cracker and provide feedback. Their ages were
recorded and grouped with 5 as the class size. What is the mean age of consumers that
obliged to try the product?

Scores X f d fd
35-39 37 5 3
30-34 32 8 2
25-29 27 9 1
20-24 22 6 0
15-19 17 7 -1
10-14 12 4 -2
5-9 7 1 -3
TOTAL 40

Scores X f d fd
35-39 37 5 6
30-34 32 8 5
25-29 27 9 4
20-24 22 6 3
15-19 17 7 2
10-14 12 4 1
5-9 7 1 0
Total 40

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


20
b. Long Method (Class Deviation Method)

Illustrative Example:

Scores X f fx
35-39 37 5
30-34 32 8
25-29 27 9
20-24 22 6
15-19 17 7
10-14 12 4
5-9 7 1
TOTAL 40

B. MEDIAN
The Median is the midpoint of the values after they have been ordered from the smallest to the largest

or from the largest to the smallest observation. As such it is a positional average.

There are as many values above the median as below it in the data array.

UNGROUPED DATA
Illustrative Examples:

 For an even set of values, the median will be the arithmetic average of the two middle
numbers.
The heights of four basketball players, in inches, are: 76, 73, 80, 75.
Arranging the data in ascending order gives: 73, 75, 76, 80

 For odd set of values the median is found at the (n+1)/2 ranked observation.

The ages for a sample of five college students are: 21, 25, 19, 20, 22.

Arranging the data in ascending order gives: 19, 20, 21, 22, 25.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


21
GROUPED DATA
If instead of ungrouped data, a frequency distribution is given we cannot identify the middle score.

But we can tell in what class interval it is found. Here is the way to find the median.
Illustrative Example:
Scores cb f Cf<
35-39 5
30-34 8
25-29 9
20-24 6
15-19 7
10-14 4
5-9 1
Total 40

Steps:

1. Include the “cumulative frequency less than” column.


2. Since the median is the mid-score, take n/2. In this case 40/2 = 20
The 20th observation is part of the cumulative frequency 27. The class interval 25-29 is the median
class
3. Use the formula

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


22
You will note that the answer is a value within the range of the median class. The median is not

necessarily close to the mean. In this given distribution, the median is slightly greater than the mean.

C. MODE

UNGROUPED DATA

The mode is the value in the distribution that occurs the most number of times. As the most
frequently occurring observation, it is a nominal average.

UNGROUPED DATA Illustrative Example:


Look closely at the ungrouped data below:
A 15 12 4 9 6 10 5
B 15 12 4 12 6 12 5
C 15 12 4 15 4 6 5

For distribution A,
For distribution B, the Mode .
For distribution C,

Evidently a distribution can have no mode, one mode or more than one mode. Thus, the mode is not a very
reliable measure of central tendency. However, there are instances when no other measure can be used

except the mode like when the data is nominal. In determining the prevalent gender, civil status, or highest

educational attainment only the mode can be used, because no numerical values are assigned to these

variables.

GROUPED DATA
If instead of ungrouped data, a frequency distribution is given we cannot easily identify the TRUE

MODE score. But we can tell in what class interval it is found. Here is the way to find the mode.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


23
. Illustrative Example:
Scores cb f
35-39 34.5 – 39.5 5
30-34 29.5 – 34.5 8
25-29 24.5 – 29.5 9
20-24 19.5 – 24.5 6
15-19 14.5 – 19.5 7
10-14 9.5 - 14.5 4
5-9 4.5 - 9.5 1
Total 40

Steps:

1. Find the class or classes with the highest frequency.


2. Use the formula

It is possible that there are more than one mode in a grouped data. Illustrative Example:

Scores cb f
35-39 34.5 – 39.5 9
30-34 29.5 – 34.5 8
25-29 24.5 – 29.5 9
20-24 19.5 – 24.5 6
15-19 14.5 – 19.5 7
10-14 9.5 - 14.5 4
5-9 4.5 - 9.5 9
Total 52

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


24
Example 2
A survey was conducted at a Cafe which sells food and coffees. The reason for the survey was that they
were having trouble keeping up with the demand for Cappuccino coffees during peak periods.
The Barista suggested that they get a bigger machine to cope with the high demand. A bigger machine is
very expensive to buy, and so the owner had a two day survey done to find out how many Cappuccinos
were being made per hour in the Cafe.
From the survey results, they would be able to do some Graphs and Statistics, and better understand the
current problem situation.

The Histogram shows an even spread of data, indicating that sometimes the Coffee Shop is very busy,
while other times they are making less than eight cappuccinos per hour.

We now want to find the Average Number of Cappuccinos made every hour. There are three types of
Averages: the Mean, the Median, and the Mode.

SUMMARY NOTES:
MEAN:
1. All the scores or measurements are considered in the computation of the mean.
2. Very high or very low scores or observations affect the mean.
3. If a constant k is added, subtracted, multiplied or divided to the scores, the same constant k is

added, subtracted, multiplied or divided to the mean.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


25
MEDIAN:
1. Only the middle scores or measurements are considered in the computation of the median.
2. Very high or very low scores or observations do not affect the median.

MODE:
1. It is very easy to compute but is seldom used because it is very unstable.
2. It is most appropriate for nominal scale as a measure of popularity.

B. MEASURES OF VARIABILITY
Variability refers to how "spread out" a group of scores or numerical data is. To see what we mean by
spread out, consider graphs in Figure 1. These graphs represent the scores on food taste. The mean score
for each product is 7.0. Despite the equality of means, you can see that the distributions are quite
different. Specifically, the scores on Product 1 are more densely packed and those on Product 2 are more
spread out. The differences among consumers‟ scores were much greater on Product 2 than on Product 1.

Product 1

Product 2

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


26
The terms variability, spread, and dispersion are synonyms, and refer to how spread out a distribution is.
Just as in the section on central tendency where we discussed measures of the center of a distribution of
scores, in this lesson we will discuss measures of the variability of a distribution. There are four frequently
used measures of variability: the range, interquartile range, variance, and standard deviation.

a. Range

The range is the simplest measure of variability to calculate, and one you have probably encountered
many times in your life. The range is simply the highest score minus the lowest score.

Illustrative Examples

1. In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9. So the range is 9 − 3 = 6

0 1 2 3 4 5 6 7 8 9 10

2. What is the range of the price increase of bottled drinks: ₱10, ₱2, ₱5, ₱6, ₱7, ₱3, ₱4?

3. The following numbers are customers of a car company in 10 weeks: 99, 45, 23, 67, 45, 91, 82, 78, 62, 51.
What is the range?

b. Interquartile Range

The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles.

Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called
the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.

 Q1 is the "middle" value in the first half of the rank-ordered data set.
 Q2 is the median value in the set.
 Q3 is the "middle" value in the second half of the rank-ordered data set.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


27
The interquartile range is equal to Q3 minus Q1.

Odd set of numbers

1. Find the IQR for the following data set: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.

Steps:

 Step 1: Put the numbers in order.


1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
 Step 2: Find the median.
1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
 Step 3: Place parentheses around the numbers above and below the median.
Not necessary statistically, but it makes Q1 and Q3 easier to spot.
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27).
 Step 4: Find Q1 and Q3
Think of Q1 as a median in the lower half of the data and think of Q3 as a median for the upper half
of data.
(1, 2, 5, 6, 7), 9, ( 12, 15, 18, 19, 27). Q1 = 5 and Q3 = 18.
 Step 5: Subtract Q1 from Q3 to find the interquartile range.
18 – 5 = 13.

Even Set of Numbers


2. Find the IQR for the following data set: 3, 5, 7, 8, 9, 11, 15, 16, 20, 21.

 Step 1: Put the numbers in order.


3, 5, 7, 8, 9, 11, 15, 16, 20, 21.
 Step 2: Make a mark in the center of the data:
3, 5, 7, 8, 9, | 11, 15, 16, 20, 21.
 Step 3: Place parentheses around the numbers above and below the mark you made in Step 2–
it makes Q1 and Q3 easier to spot.
(3, 5, 7, 8, 9), | (11, 15, 16, 20, 21).
 Step 4: Find Q1 and Q3
Q1 is the median (the middle) of the lower half of the data, and Q3 is the median (the middle) of the
upper half of the data.
(3, 5, 7, 8, 9), | (11, 15, 16, 20, 21). Q1 = 7 and Q3 = 16.
 Step 5: Subtract Q1 from Q3.
16 – 7 = 9.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


28
c. Standard Deviation and Variance

The standard deviation of the mean is the most commonly used measure of the spread of values in a
distribution. The Standard Deviation is a measure of how spread out the numbers are. The Variance is
defined as the average of the squared differences from the Mean.

UNGROUPED DATA

Population Standard Deviation and Variance

Sample Standard Deviation and Variance

1. Here are the minutes employees were late for the day:
4, 2, 5, 8, 6

=5

x
4 (4-5) = -1
2
5
8
6

The standard deviation for the minutes students are late for the class is ___________.

2. Here are the ages of nine cash registers in a supermarket:


5, 6, 8, 9, 4, 7, 9, 8, 5

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


29
Sample Standard Deviation and Variance

Illustrative Example:
1. A recent survey I worked on asked a question about what users thought of the visual appeal of the
software. Users were given a five point rating scale (from strongly disagree to strongly agree).

Here are the responses from 18 users:

5, 5, 5, 5, 4, 5, 3, 4, 5, 5, 5, 5, 4, 5, 1, 2, 3, 4

Because the question was just written for the survey, there‟s no historical or comparative data. To find
more meaning in this jumble of numbers, the first thing you need to do is compute the mean and
standard deviation. While you won‟t necessarily report them, you‟ll need them for some of the subsequent
steps.

Standard deviation and variance is best used to compare groups:

Lunch Expense of a certain employee in 2 Weeks

Week 1 Week 2
Monday P100 0 P100 0 0
Tuesday 100 0 50 50 2500
Wednesday 100 0 50 50 2500
Thursday 100 0 200 100 10000
Friday 100 0 50 50 2500
Saturday 100 0 150 50 2500
Total P600 0 P600 20000
Mean P100 P100
Range 0 P150

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


30
Week 1 Week 2
Mean Mean

Range = P200-50
Range = P100-100 = P150
=0
Standard Deviation
Standard Deviation
S=0
Variance = = 63.2455532
S2 = 0 s = P63.25

Variance

= = 4000

s2 = P4,000

Note: The allowances for the week are representative of all the
allowances.

If we are to get the standard deviation and variance of the expense for week 2, then such is considered as
the population.

Standard Deviation

= = 57.73502692

δ = P57.74

Variance
= = 3333.33333

2
δ = P3,333.33

GROUPED DATA

Population Standard Deviation


Population Variance

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


31
Sample Standard Deviation
Sample Variance

Illustrative Example:
The following is a tabular presentation of the ages of people participated in product evaluation.

RANGE OF AGES X F
35-39 37 5 12.25 750.313
30-34 32 8 7.25 420.5
25-29 27 9 2.25 45.5625
20-24 22 6 -2.75 45.375
15-19 17 7 -7.75 420.438
10-14 12 4 -12.75 650.25
5-9 7 1 -17.75 315.063
TOTAL 40 2647.5

Sample Standard Deviation

= = 8.23921061

s = 8.24

Sample Variance
= = 37.88461538

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


32
Hypothesis Testing

Hypothesis Testing

Hypothesis testing is the operation of deciding whether or not a data set obtained for a random
sample supports or fails to support a particular hypothesis. A hypothesis is an assertion or conjecture
about a parameter(s) of a population; it may also concern with the type, nature of the population, or
distributional form of characteristics of interest.

Steps in Hypothesis Testing

STEPS IN HYPOTHESIS TESTING BY THE CRITICAL VALUE APPROACH


1) Formulate the null hypothesis and the alternative hypothesis.
2) Set the level of significance, .
3) Select the appropriate test statistic.
4) Establish the critical region.
5) Compute the value of the test statistic.
6) Decision:
 Reject Ho if the value of test statistic belongs to the critical region.
 Do not reject Ho if the value of the test statistics does not belong to the critical region.
7) Conclusion.
 Test statistic – a sample statistic computed from the data. The value of the test statistic is used
in determining whether or not the null hypothesis is rejected.
 Critical or rejection region – a range of test statistic values for which the null hypothesis should
be rejected. This range of values will indicate that there is a significant or large enough
difference between the hypothesized parameter value and the corresponding point estimate for
the parameter.
 Critical value –first value in the critical region. The set of values that are not in the critical region
is called the region of acceptance (noncritical or non-rejection).

STEPS IN HYPOTHESIS TESTING BY THE P-VALUE APPROACH


A p-value is the smallest significance level at which a null hypothesis may be rejected.
1) Formulate the null hypothesis and the alternative hypothesis.
2) Set the level of significance of size .
3) Select the appropriate test statistic.
4) Compute the value of the test statistic.
5) Determine the p–value of the test statistic.
6) Decision:
 Reject Ho if the p–value is less than .
 Do not reject Ho if the p–value is greater (or equal) than .

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


33
7) Conclusion.

Types of Hypotheses

A. Null hypothesis, – represents a theory that has been put forward, either because it is believed
to be true or because it is used as a basis for argument. This assertion is held as true until there is
sufficient statistical evidence to conclude otherwise. It states that there is no difference between a
parameter(s) and a specific value.
B. Alternative hypothesis, or – an assertion of all situations not covered by the null
hypothesis. It states that there is a precise difference between a parameter(s) and a specific value.

Illustrative Examples:
State the null hypothesis and the alternative hypothesis to be used. (Note: The equal sign must be
in the null hypothesis, regardless of the statement.)
1. New software is being integrated into the teaching of a course with the hope that it will
help to improve the overall average score for this course. The historical average score for
this course is 70.
Ho:
Ha:

2. A real estate agent claims that the average price for homes in a certain subdivision is
₱1.8M. You believe that the average price is lower. You plan to test his claim by taking a
random sample of the prices of the homes in the subdivision; formulate the set of
hypotheses.
Ho:
Ha:

3. An advertisement on the TV claims that a certain brand of tire has an average lifetime of
50,000 miles. Suppose you plan to test this claim by taking a sample of tires and putting
them on test. What is the correct set of hypotheses to set up?
Ho:
Ha:

Type I and Type II Errors


Statistical True state of the null hypothesis
Decision Ho is true Ho is false
ERROR Correct
Reject Ho
Type I decision
Correct ERROR
Do not reject Ho
decision Type II

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


34
 - probability of committing a type I error
- area of the critical region
- also called the level of significance
- probability of committing a type II error

Notes:
 A type I error will be committed when the true null hypothesis is rejected.
 A type II error is committed when a false null hypothesis is not rejected.

One-tailed and Two-tailed Tests

A. One tailed test: A test of any statistical hypothesis where the alternative hypothesis is one-sided.
This is also known as Directional Alternative Hypothesis. It could be left-tailed or right-tailed.
( : > or <).
B. Two tailed test: A test of any statistical hypothesis where the alternative hypothesis is two-sided.
Non-directional Alternative Hypothesis is concerned with the two sides of the distribution.

I. Testing Hypothesis Concerning One Population Mean

A. Known Population Standard Deviation: One-sample z-test

̅

where: 𝑥̅ sample mean


𝜇 hypothesized mean
𝜎 population standard deviation
𝑛 sample size

If the sampling distribution is normal, the test is appropriate for any sample size.

Alternative
Critical Region p – value
Hypothesis
Reject if the Reject Ho if the
computed test –
statistic is greater ( ) is less
than than .
Reject if the Reject Ho if the
computed test –
statistic is less than ( ) is less

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


35
than .
Reject if the
computed test Reject Ho if the

statistic is greater
( | |)
than or less
is less than .
than
Note: is the computed test statistic.

B. Unknown Population Standard Deviation: One-sample t-test


The distribution is the appropriate basis for determining the standardized test statistic when the
sampling distribution of the mean is normally distributed but is not known.
̅

where: ̅ sample mean
hypothesized mean
sample standard
sample size

Condition for t-test to be used: The level of measurement for the dependent variable must be interval or
ratio.
Alternative
Critical Region p – value
Hypothesis
Reject if the Reject Ho if the
computed test –
statistic is greater ( ) is less
than than .
Reject if the Reject Ho if the
computed test –
statistic is less than ( ) is less
than .
Reject if the
computed test Reject Ho if the

statistic is greater
( | |)
than or less
is less than .
than
Note: 1) is the computed test statistic.
2) and the -values are based on – degrees of freedom.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


36
Illustrative Examples:
1. Suppose that the movie industry wants to know how often college students go to the movies. Sixteen
college students are selected at random and asked how many movies they have seen in a theater in the
previous year. The data are as follows: 7, 13, 0, 10, 2, 8, 4, 11, 0, 5, 16, 6, 2, 12, 9, 0.
(a) What is the mean number of movie theater visits by all college students in a year?
(b) Suppose that the national average for movie visits in a year is 9.0. If alpha is set to 0.05, can you
reject the null hypothesis that the population mean for college students is the same as the mean for
the general population, using a two-tailed test?
a. Ho:

Ha:
b. Let =
Critical Value:
Critical region:

Decision Rule:
c. Computation (using appropriate test statistics)

d. Decision:

e. Conclusion and Interpretation:

2. A random sample of 75 eleven-year-olds performed a simple task and the time taken, x minutes, noted
for each. The results were summarized as follows:
∑ ∑ . Test, at the 0.01 level of significance, whether there is evidence that the mean
time taken to perform the task is greater than 15 minutes
a. Ho:

Ha:

b. Let =
Critical Value:
Critical region:

Decision Rule:
c. Computation (using appropriate test statistics)

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


37
d. Decision:

e. Conclusion and Interpretation:

3. An athlete finds that her times for running a race are normally distributed with mean 10.6 seconds. She
trains intensively for a week and then records her time in the next 5 races. Her times, in seconds are, 10.70,
10.65, 10.75, 10.80, 10.60. Is there evidence, at the 5% level of significance, that training intensively has
improved her times?
a. Ho:
b. Ha:
c. Let =
Critical Value:

Critical region:

Decision Rule:
d. Computation (using appropriate test statistics)

d. Decision:
e. Conclusion and Interpretation:

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


38
C. Testing Hypothesis Concerning Two Population Means

Value of test statistic Critical region


and known Reject H0 if
̅ ̅ Reject H0 if
Reject H0 if or

= but unknown Reject H0 if


̅ ̅ Reject H0 if
Reject H0 if and

where
where

but unknown Reject H0 if


̅ ̅ Reject H0 if
Reject H0 if and

where
2
 s1 2 s 2 2 
  
n n 2 
 1

s 1
2
/ n1   s
2
2
2
/ n2  2

n1  1 n2 1
̅ Reject H0 if .
Reject H0 if .
√ Reject H0 if or
where: where:
.
=hypothesized n ( d 2 )  ( d ) 2
sd 
difference n(n  1) where:
d
d =number of paired
n
d=difference of two dependent observations

values

=mean of population 1 =standard deviation of sample 1


=mean of population 2 =standard deviation of sample 2
=standard deviation of population 1 =hypothesized difference of and
=standard deviation of population 2 =pooled variance
̅ =mean of sample taken from population 1 =degrees of freedom
̅ =mean of sample taken from population 2

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


39
**Computational conditions for the t-test:

For the equal-variance t test, the observations should be independent, random samples from normal
distributions with the same population variance. For the unequal-variance t test, the observations should
be independent, random samples from normal distributions.
(1) The level of measurement for the dependent variable must be interval or ratio, e.g, weight, income,
degrees of self-care, and level of treatment effect can be used as dependent variables.
(2) The level of measurement for the independent variable must be nominal, e.g.,“minority and
nonminority groups”, “race”, “gender”, and “experimental and control groups.”

Illustrative Examples:
1. The following data represent the running times of films produced by two different motion picture
companies:

Company Time (in minutes)


A 102 68 98 109 92
B 81 165 97 143 92 78 114

Test hypothesis that the average running time of films produced by Company B exceeds the
average running time of films produced by Company A by 10 minutes against the alternative that
the difference is more than 10 minutes. Use a 0.05 level of significance and assume the
distribution of times to be approximately normal and the population variances are equal.

2. Two Groups X and Y of freshman students, 28 in each group, are paired for age and score on
Form A of the Otis Group Intelligence Scale. Three weeks later, both groups are given Form B of
the same test. Before the second test, Group X, the experimental group, is praised for its
performance on the first test and urged to try to score better than the other group. Group Y, the
control group, was given the second test without comment. Will the incentive (praise) cause the
final scores of group X and Group Y to differ significantly? Test the hypotheses at 0.01 level of
significance given the information below:

Mean Scores on Standard Deviation on


Form B Final Test Form B Final Test
Group X Group Y Group X Group Y
88.63 83.24 24.36 21.62

3. A researcher wishes to determine if vitamin E supplements could increase cognitive ability among
elderly women. In 1999, the researcher recruits a sample of elderly women age 75-80. At the time
of the enrollment of the study, the women were randomized to either take Vitamin E or a placebo
for six months. At the end of the six month period, the women were given a cognition test. Higher
scores on this test indicate better cognition. The mean and standard deviation of the test scores

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


40
of 81 women who took vitamin E supplements was 27 and 6.9 respectively. The mean and
standard deviation of the test scores of the 90 women who took placebo supplements was 24 and
6.2. Compute a 95% confidence level for the mean difference in cognition test scores between
Vitamin E and placebo groups. What would you conclude from these study results? Assume
unequal variances.

4. Many companies that cater to teenagers have learned that young people respond to commercials
that provide dance-beat music, adventure, and a fast pace, rather than words. In one test, a group
of 128 teenagers were shown commercials featuring rock music, and their purchasing frequency
of the advertised products over the following month were recorded as a single score for each
person in the group. Then a group of 212 teenagers were shown commercials for the same
products, but with the music replaced by verbal persuasion. The purchase frequency scores of this
group were computed as well. The results for the music group were ̅ and ; and
the results for the verbal group were ̅ and . Assume that the two groups were
randomly selected from the entire teenager consumer population. Using the level of
significance, test the null hypothesis that both methods are equally effective versus the alternative
hypothesis that they are not equally effective.

5. An instructor wanted to measure the basic math skills of his students before and after his college
algebra course. A skills test was administered at the beginning of the semester, and the scores
were recorded. At the end of the semester, he administered the same test and recorded the
scores. The table below shows the before-and-after scores for the test for the students who
remained in the course until the end of the semester. The maximum possible score on the test
was 100 points.

Student
1 2 3 4 5 6 7 8 9
#
Before 61 58 79 69 62 71 25 48 53
After 68 62 83 65 62 74 31 52 51

Test the hypothesis that learning took place at 0.01 level of significance.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


41
6. Kindergarten students were the participants in a study conducted by Susan Bazyk et al. (A-39).
The researchers studied the fine motor skills of 37 children receiving occupational therapy. They
used an index of fine motor skills that measured hand use, eye–hand coordination, and manual
dexterity before and after 7 months of occupational therapy. Higher values indicate stronger fine
motor skills. The scores appear in the following table.

Subject Pre Post Subject Pre Post


1 91 94 20 76 112
2 61 94 21 79 91
3 85 103 22 97 100
4 88 112 23 109 112
5 94 91 24 70 70
6 112 112 25 58 76
7 109 112 26 97 97
8 79 97 27 112 112
9 109 100 28 97 112
10 115 106 29 112 106
11 46 46 30 85 112
12 45 41 31 112 112
13 106 112 32 103 106
14 112 112 33 100 100
15 91 94 34 88 88
16 115 112 35 109 112
17 59 94 36 85 112
18 85 109 37 88 97
19 112 112
Source: Data provided courtesy of Susan Bazyk, M.H.S.

Can one conclude on the basis of these data that after 7 months, the fine motor skills in a population of
similar subjects would be stronger? Let 05.

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


42
II. Testing Hypothesis Concerning Three Or More Population Means

A. Analysis of Variance
Analysis of Variance (ANOVA) is used to test hypothesis about three or more population
means rather than population variances. The F-test is used to test the significance of the differences
of the population means named after R.A. Fisher.

Assumptions underlying the use of the ANOVA


1. The individuals in the various subgroups should be selected on the basis of random sampling from
normally distributed populations.
2. The variances of the subgroups should be homogenous.
(  12   22   32  ...   n2 )
3. The samples that constitute the groups should be independent.

The purpose of ANOVA, as the term implies, is to establish the variations (or sources of differences)
between groups and within groups. In comparing the groups, there are three possible sources of
variation, these are:
1. Variation between groups (column means or treatments).
2. Variation within groups (experimental error).
3. Total variation among the values of all groups.

When solving ANOVA problems, it is helpful to organize the term that will be used in the
computations into a matrix called ANOVA table.

The following steps should be followed when employing ANOVA:


1. State the null and alternative hypothesis
2. Level of significance
3. Test to be used
4. Establish the critical region: Reject Ho if
where , .
5. Computations:
a. SSC, SSE, SST
b. ANOVA table
6. Decision
7. Conclusion

nalysis of Variance for One Way Classification

Source of Sum of Degrees of


Mean Square Computed f
Variation Squares Freedom
Column SSC MSC
SSC – MSC 
k 1
f
Means MSE

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


43
SSE
Error SSE – MSE 
Nk
Total SST –

∑∑

SSE  SST  SSC


where: = sample size per group
= total number of observations
= number of groups
grand total
total of th group/column

B. Post Hoc Analysis


A significant F ratio tells us that there are differences between at least one pair of means. The
purpose of post hoc analysis is to find out exactly where those differences are. A variety of different
types of post hoc analysis allow us to make multiple pairwise comparisons and determine which pairs
are significantly different and which are not. The interpretation of this analysis is similar to that of the
two-sample t test.

1. For Equal Sample Size – use Tukey’s HSD


Tukey‟s HSD (honestly significant difference) is one of the most popular procedures used in post
hoc analysis. This test is used to test the hypothesis that all possible pairs of means are equal. To
perform this multiple comparison test, we select an overall significance level, which denotes the
probability that one or more of the null hypothesis is false. Those pairs whose differences exceed the
HSD are considered significantly different. The formula for computing HSD is:

where: = Tukey‟s table of critical values


= error mean square
= sample size per group

2. For Unequal Sample Size – use Scheffe’s Multiple Comparison Test


Scheffe‟s Multiple Comparison test will be computed using the formula:

where: √

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


44
√( )

Illustrative Examples:
1. The following represent the number of hours of pain relief provided by 4 different brands of
headache tablets administered to 20 subjects. The 20 subjects were randomly divided into 4
groups and each group was treated with a different brand. Test the hypothesis at the 0.05 level of
significance that the mean number of hours of relief provided by the tablets is the same for all four
brands.

Tablets
A B C D
5 9 3 2
4 7 5 3
8 8 2 4
6 6 3 1
3 9 7 4
26 39 20 14
̅ 5.2 7.8 4.0 2.8

Solution:
1. State the Null and Alternative hypothesis:
Ho:
Ha: at least two of the means are not equal
2. Level of significance, 0.05
3. Test Statistic: The test follows the F-distribution
4. Establish the critical region/Decision Rule:
v1 = k – 1 = 4 – 1 = 3
v2 = N – k = 20 – 4 = 16
Reject Ho if computed f-value 3.24
5. Computation
a. 99
b. ∑ 603
c. SST:

d. SSC:
( )

e. SSE = SST – SSC


Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.
45
ANOVA Summary Table
Source of Sum of Degrees of Mean Computed
Variation Squares Freedom Square f

Column
68.55 3 22.85
Means
Error 44.4 16 2.78 8.22

Total 112.95 19

6. Decision: Since , reject Ho


7. Conclusion: At least two of the means are not equal.
8. Post Hoc Analysis: use Tukey‟s HSD (equal sample size)
a. Determine the critical Trange =

2.78
TRANGE  4.05
5
TRANGE  3.02
Compare the absolute mean difference to that of Trange (consider the table below)
Absolute Mean Critical
Pairs Description
Difference TRANGE
A–B 2.6 < 3.02 NS
A–C 1.2 < 3.02 NS
A–D 2.4 < 3.02 NS
B–C 3.4 > 3.02 S
B–D 5.0 > 3.02 S
C-D 1.2 < 3.02 NS

Interpretation: Pair B - C is significantly different. The mean number of hours of relief provided by B is
significantly different from the mean number of hours of relief provided by C. Also, the mean number of
hours of relief provided by B is significantly different from the mean number of hours of relief provided by
D.

2. A large marketing firm owns many photocopy machines, several of each of different models. Over
the last six months, the office manager has tabulated for each machine the average number of
minutes per week that it is out of service due to repairs, resulting in the following data:

̅
Model A: 56 68 42 82 70 318 63.6
Model B: 74 77 92 54 297 74.25
Model C: 25 36 56 44 48 38 247 41.17

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


46
Test at the 0.01 level of significance whether the differences among the four sample means are significant.

Solution:
1. State the Null and Alternative hypothesis:
Ho:
Ha: at least two of the means are not equal
2. Level of significance, 0.01
3. Test Statistic: The test follows the F-distribution
4. Establish the critical region/Decision Rule:
v1 = k – 1 = 3 – 1 = 2
v2 = N – k = 15 – 3 = 12
Reject Ho if computed f-value 6.93
5. Computation:
a. ∑ 862
b. ∑ 54,674
c. SST:

d. SSC:

( )

e. SSE = SST – SSC

ANOVA Summary Table


Source of Sum of Degrees of Mean Computed
Variation Squares Freedom Square f
Column
2908.95 2 1454.48
Means
7.831
Error 2228.78 12 185.73
Total 5137.73 14

6. Decision: Since , reject Ho


7. Conclusion: At least two of the means are not equal.
8. Post Hoc Analysis: use Scheffe‟s Multiple Comparison Test (unequal sample size)
a. Determine the critical Srange

s k  1  F(  ,v 1 ,v 2 )  3  1  6.93  3.723

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


47
 1 1 
 A B    185 .73  9.14
 5 4 

 1 1 
 A C    185 .73  8.25
 5 6 
 1 1 
 B C    185 .73  8.80
 4 6 

b. Compare the absolute mean difference to that of critical Srange (consider the table below)

Absolute Mean
Pairs Critical S RANGE  s  Description
Difference
A–B 10.65 < 34.00 NS
A–C 22.73 < 30.69 NS
B–C 33.08 > 32.74 S

Interpretation: Pair B - C is significantly different. The mean number of minutes per week that

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


48
III. Correlation
Correlation analysis is a group of techniques used to measure the strength of the
association/relationship between variables.

A. Pearson Correlation Coefficient


The degree of linear association/relationship between two variables (at least of interval scale)
is measured by a correlation coefficient, denoted by . It is sometimes called Pearson correlation
coefficient (Pearson product moment correlation coefficient) in honor of its developer. If a curved line
is needed to express the relationship, other and more complicated measures of the correlation must
be used.
The correlation coefficient is measured on a scale that varies from through 0 to .
Perfect correlation between two variables is expressed by either or . Positive values indicate a
relationship between and variables such that as values for increases, values for also increase.
Negative values indicate a relationship between and such that as values for increase, values for
decrease. If there is no linear correlation or a weak linear correlation, is close to 0.
FORMULA:
∑ ∑ ∑
√ ∑ ∑ √ ∑ ∑

Interpretation of 𝒓:
𝒓 Interpretation
1.0 Perfect (Positive/Negative) Correlation
Very Strong (Positive/Negative)
0.80 – 0.99
Correlation
0.60 – 0.79 Strong (Positive/Negative) Correlation
0.40 – 0.59 Moderate` (Positive/Negative)Correlation
0.20 – 0.39 Weak (Positive/Negative)Correlation
0.01 – 0.19 Very Weak (Positive/Negative) Correlation
0.0 No Correlation

B. Spearman Rank Correlation

Alternatively, the Spearman rank correlation (a non-parametric) is used for variables that
may be quantitative discrete or ordered categorical. Observations are replaced by their ranks in the
calculation of the correlation coefficient. It is used to determine a possible correlation (consistency)
between two ordinal variables.
This results in a simple formula for Spearman's rank correlation, ,

where:
= difference in the ranks of the two variables for a given respondent
= number of pairs of values of and

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


49
Illustrative Examples:
1. A men's tie shop ran 10 sales promotions to determine the number of men's neckties of a certain
type that customers would buy at various prices. Following are the sales results:

Number Number
Prices, Prices,
of ties of ties
sold, sold,
649 187 899 132
699 149 949 90
749 155 999 99
799 148 1,049 69
849 130 1,099 51
Calculate the coefficient of correlation.

2. The following are the numbers of sales contacts made by 9 salespersons during a week and the
number of sales made. Compute the correlation coefficient.
Sales-person 1 2 3 4 5 6 7 8 9
Sales contact 71 64 100 105 75 79 82 68 110
Sales 25 14 37 40 18 10 22 12 42

IV. Linear Regression

A. Simple Linear Regression


A simple linear regression attempts to model the relationship between two variables by fitting a
linear equation to observed data. One variable is considered to be an explanatory variable, and the
other is considered to be a dependent variable.
 Dependent variable – the variable that is being estimated or predicted.
 Independent variable – the variable that provides a basis for estimation. It is the predictor
variable.

The linear regression model postulates that


Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.
50
where:
= dependent/response variable
= independent/explanatory variable
and = regression coefficients
= -intercept of the regression line
= slope of the regression line
= residual/random error

In general, the goal of linear regression is to find the line that best predicts from , that is, to find
the line that best estimates the regression model by determining and
that best estimate and .

**Note that linear regression assumes that the data are linear and it finds the slope and intercept that
make a straight line best fit the data.

B. Method of Least Squares


The slope:
∑ ∑ ∑
∑ ∑

The -intercept:
∑ ∑

̅ ̅
The goal of linear regression is to adjust the values of slope and intercept to find the line
that best predicts from . More precisely, the goal of regression is to minimize the sum of the
squares of the vertical distances of the points from the line.

C. The Coefficient of Determination


The coefficient of determination, , is used to determine the proportion of the variance
(fluctuation) of one variable that is predictable from the other variable. It allows us to determine how
certain one can be in making predictions from a certain model/graph.
The coefficient of determination has values from 0 to , and measures how well the
regression line represents the data. It represents the percent of the data that is the closest to the line
of best fit.
For example, if = 0.922, then = 0.850, this means that 85% of the total variation in can
be explained by the linear relationship between and . The other 15% of the total variation in
remains unexplained. If the regression line passes exactly through every point on the scatter plot, it
would be able to explain all of the variation. The further the line is away from the points, the less it is
able to explain.

Illustrative Examples:
1. A study was made by a retail merchant to determine the relation between weekly advertising
expenditures and sales (both in hundreds of pesos). The following data were recorded:

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


51
Advertising Sales Advertising Sales
Costs (₱) (₱) Costs (₱) (₱)
4 38.5 4 49.0
2 40.0 2 42.0
2.5 39.5 5 56.0
2 36.5 4 52.5
3 47.5 2.5 48.0
5 44.0 5 51.0

a) Plot a scatter diagram.


b) Find the equation of the regression line to predict weekly sales from advertising expenditures.
c) Compute the coefficient of determination.
d) Estimate the weekly sales when advertising costs are ₱3.5.

2. In the 1990‟s, research efforts have focused on the problem of predicting a manufacturer‟s market
share using information on the quality of its product. Suppose that the following data are
available on market share, in percentage ( ), and product quality, on scale of 0 to 100,
determined by an objective evaluation procedure ( ).

X 27 39 73 66 33 43 47 55 60 68 70 75
Y 2 3 10 9 4 6 5 8 7 9 10 13

a) Draw the scatter diagram.


b) Estimate the simple linear regression relationship between market share and product quality
rating. Graph the line.
c) Compute for the coefficient of determination. Interpret.
d) Estimate the market share when the product quality is 95.

V. Chi-Square Test

A. Test for Independence


The chi-square test of independence is a nonparametric statistical test to determine if two
or more classifications of the samples are independent or not. It uses a contingency table
(sometimes referred to as a cross classification table) to examine the nature of the relationship
between these variables.
By independence, we mean that the row and column variables are unassociated (i.e.
knowing the value of a row variable will not help us predict the value of a column variable, and
likewise, knowing the value of a column variable will not help us predict the value of a row
variable).

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


52
Contingency Table
Variable 1
Variable 2
1 2 … Total
1 …
2 …

...

...
...

...

...

...

Total …

Data Consideration
a) Use ordered or unordered numeric categorical variables (ordinal or nominal levels of
measurement).
b) The data are assumed to be a random sample. The expected frequencies for each category
should be at least 1. No more than 20% of the categories should have expected frequencies
of less than 5. If not, use Fisher‟s Exact or other tests.

Steps in testing independence between two variables:


1) Formulate the null and alternative hypothesis.
:
 There is no relationship between the two variables (or the two variables are independent).
:
 There is some relationship between the variables (or the two variables are dependent).
2) Determine the significance level ( ).
3) Decision rule: Reject if where .
4) Calculate the chi-square test statistic.
( )
∑∑ ∑

where:
= the test statistic that asymptotically approaches a chi-square distribution
= the observed frequency of the ith row and jth column
=the expected (theoretical) frequency of the ith row and jth column

= total of the ith row


= total of the jth column
5) Decision
6) Conclusion

B. Measures of Association
Use of the chi-square test of independence can provide information on whether the
association between two qualitative statistic figure values A and B can be regarded as statistically
significant or not. Direct evaluation of the degree of association can be done using measures of
association, which are based on the computed chi-square value ( ). The nearer the value of the

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


53
measure of association is to 0, the greater the degree of independence between the two variables is
confirmed. Here are some measures of association:

a) Phi coefficient is used in 2 by 2 tables.


b) Contingency coefficient C (Pearson‟s C) is only used for 5 by 5 tables or larger.
c) Cramer’s V is the most popular measure of association regardless of table size.

C. Testing for Several Proportions


The steps are similar to the test for independence however the null hypothesis is that the
several population proportions are all mutually equal.

Illustrative Examples:
1. Grades in a statistics course and mathematical analysis for business taken simultaneously were as
follows for a group of students.

Mathematical Analysis
for Business Grade
Statistics
A B C Others
Grade
A 25 6 17 13
B 17 16 15 6
C 18 4 18 10
Others 10 8 11 20

Are the grades in statistics and mathematical analysis for business related? Use in reaching your
conclusion.

2. A random sample of students is asked their opinions on a proposed core curriculum change. The
results are as follows.

Opinion
Class Favoring Opposing
Freshman 120 80
Sophomore 70 130
Junior 60 70
Senior 40 60

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


54
Test the hypothesis that the proportions in the opinions on the change are the same for all year levels.
Use .

3. A company has to choose among three pension plans. Management wishes to know whether the
preference for plans is independent of job classification and wants to use . The opinions of a
random sample of 500 employees are shown below:
Pension Plan
Job Classification 1 2 3
Salaried workers 160 140 40
Hourly workers 40 60 60

3. A survey sampling example showing a cross classification of gender by class was given below. Use
the chi square test of independence to determine if gender and social class of the respondent are
independent of each other. Use the 0.05 level of significance.

Gender
Social Class Male Female
Upper Middle 33 29
Middle 153 181
Working 103 81
Lower 16 14

4. A sample of adults in X city was conducted to examine public attitudes toward government cuts in
social spending. Concerning this data, the researcher comments, “Respondents who knew
someone on social assistance, were more likely to feel that welfare rates were too low...”

Knows someone on social assistance


Welfare Spending Yes No
Too little 40 6
About right 16 13
Too much 9 7

Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.


55

You might also like