Professional Documents
Culture Documents
Gmath Stat
Gmath Stat
Population – the totality of all objects, individuals or perceptions wherein its unique properties or
characteristics are subject of statistical inquiry.
Finite Population can be counted with relative ease and the number obtained is limited.
Examples are: number of students in a class, number of patients in a hospital, number of
registered voters in a municipality.
Infinite Population cannot be counted easily because of the large number involved of because of the
nature of the data. Examples are: the number of hair strands, number of stars in the sky, the exact
Philippine population.
Cochran;s Formula
Illustrative example:
A researcher would like to determine the research capability of graduate school students in 4 universities in
the city. Let us consider the following hypothetical data.
2. Assuming the margin of error is 0.03 or 3%, determine the sample size using the Yamane‟s
(Slovin‟s) formula
= 769.23 = 770
Random Sampling is the most commonly used sampling technique in which each member in the
population is given an equal chance of being selected in the sample. Non-random Sampling is a
method of collecting a small portion of the population by which not all members of the population
are given the same chance to be included in the sample. Certain elements in the population are
deliberately left out from the selection for varied reasons.
1. Simple Random Sampling - the random selection process allows no discretion to the investigator as
to which particular units in the population enter the sample
- it tends to avoid the problem of unrepresentativeness
a. Lottery or Fishbowl Sampling
2. Systematic Sampling – the use of a random start (k) which also serves as the common interval
- Usually used if population is known
k = N/n (N= sample ; n = sample size)
4. Cluster Sampling or Multi-stage Sampling – used when the population is large and spread over a
geographical area in which smaller sub-regions are easily sampled where a simple random or a
stratified random sample may not be carried out easily or when the selection of individuals of the
population is impractical: – a procedure of selection in which the unit of selection (cluster) contains
two or more population members.
NON-RANDOM SAMPLING
1. Convenience sampling is probably the most common of all sampling techniques. With
convenience sampling, the samples are selected because they are accessible to the researcher.
2. Consecutive sampling is very similar to convenience sampling except that it seeks to include ALL
accessible subjects as part of the sample. This non-probability sampling technique can be
considered as the best of all non-probability samples because it includes all subjects that are
available that makes the sample a better representation of the entire population.
3. Quota sampling is a non-probability sampling technique wherein the researcher ensures equal
or proportionate representation of subjects depending on which trait is considered as basis
of the quota.
For example, if basis of the quota is college year level and the researcher needs equal
representation, with a sample size of 100, he must select 25 1st year students, another 25 2nd year
students, 25 3rd year and 25 4th year students. The bases of the quota are usually age, gender,
education, race, religion and socioeconomic status.
4. Judgmental sampling is more commonly known as purposive sampling. In this type of sampling,
subjects are chosen to be part of the sample with a specific purpose in mind. With judgmental
sampling, the researcher believes that some subjects are more fit for the research compared
to other individuals. This is the reason why they are purposively chosen as subjects.
5. Snowball sampling is usually done when there is a very small population size. In this type of
sampling, the researcher asks the initial subject to identify another potential subject who also
meets the criteria of the research. The downside of using a snowball sample is that it is hardly
representative of the population.
6. Incidental or Opportunity Sampling applied to those samples which are taken because they are
the most available and willing.
Positive:
1) It provides consistent and more precise information since clarification maybe
given by the interviewee.
2) Questions may be repeated or maybe modified to suit the interviewee‟s level of
understanding.
Negative:
1) Time-consuming
2) Expensive
3) Limited field coverage
Positive:
1) Inexpensive
2) Can cover a wide area in a shorter span of time.
3) Respondents may feel a greater sense of freedom to express views and
opinions because their anonymity is maintained.
Negative:
1) There‟s a strong possibility of non-response, especially when questionnaires are
mailed.
2) Questions not easily understood may not be answered.
Positive:
The recording of behavior at the appropriate time and situation is made possible.
E. Experiment Method - this method is used when the objective is to determine the cause-and-
effect relationship of certain phenomena under controlled conditions. It is usually
used by scientific researches.
Based on the Census of Population and Housing conducted decennially by the National
Statistics Office, the total population of the Philippines as of May 1, 2000 was
76,504,077 persons. This was higher by 7,887,541 persons or about 10.31 percent from
the 1995 census (with September 1, 1995 as reference date). It was 10 times the
Philippine population in 1903 when the first census was undertaken.
The expansion of the Philippine population reflected a 2.36 percent average annual
growth rate in the 1995-2000 period. This figure recorded an slight increase from a
declining growth rate which started in the first half of the seventies. The last increase
recorded in population growth rates was during the intercensal period 1948 to 1960 at
3.07 percent. The recent growth rate was 0.04 percentage point higher than the annual
growth during
Source: NSO,the2000
early part ofofthe
Census nineties. and
Population If theHousing
average annual growth rate continues,
the population of the Philippines is expected to double in 29 years.
2. Tabular – a more concise and systematic manner of presenting numerical facts compared to textual
form. Tabular presentation facilitates the analysis of relationships.
Example:
3. Graphical Presentation – an effective means of organizing and presenting statistical data because the
important relationships are brought out more clearly and creatively in virtually solid and
colorful figures.
Example: More single men than women
About 43.89 percent of the total population 10 years and over were single, while 45.66 percent were
married. The remaining 10.45 percent were either widowed, separated/divorced, with other arrangements or
with unknown marital status.
Among the single persons, the proportion was higher for males (52.94 percent) than for females (47.06
percent). In contrast, the proportion for widowed was higher for females (75.72 percent) than for males
(24.28 percent).
Notes developed by Prof. Raflyn Manuel-Guillermo, PhD.
8
* Some Commonly Used Graphs
a. Scatter Graph
a graph used to present measurements or values that are thought to be related.
10
0
0 1 2 3 4 5 6 7 8 9
Household Size
b. Line Chart
graphical presentation of data especially useful for showing trends over a period of time.
600
500 500
400 420
360
300 300 Male
280 260
240 Female
200
140
100
0
I II III IV
c. Pie Chart
a circular graph that is useful in showing how a total quantity is distributed among a
group of categories. The “pieces of the pie” represent the proportions of the total
that fall into each category.
Lung
Disease
6%
Accident Accidents
s 6% Stroke
Lung Disease
10%
Heart Stroke
Disease Cancer
Cancer
45% 33%
Heart Disease
d. Bar graph
It represents the frequency or magnitudes of quantities of each of the categories as
a bar rising vertically from the horizontal axis with the height of each bar
proportional to the frequency or magnitude of the corresponding category.
It may be simple, compound and can be vertically or horizontally arranged. It is
used for both qualitative and quantitative data
IV
III
II
I Number of
Students
0 20 40 60
e. Frequency Polygon
A frequency polygon can be made from a line graph by shading in the area beneath the graph. It can be
made from a histogram by joining midpoints of each column.
800
600
400
200
0
SLU UB UC BCU
f. Histogram
A histogram displays continuous data in ordered columns. Categories are of continuous measure
such as time, inches, temperature, etc.
25
20
15
10
0
Below 150-199 200-249 250-299 300-349 350-399 400-449 450-499 500 and
P150 up
a. Simple Table
b. Cross Tabulation
Number of Graduate School Students from Four Major Universities Classified According to Gender
Notes:
Simple tables are commonly used for ordinal and nominal variables.
A combination of the two levels of measurements can be used for cross tabulation
A Frequency Distribution is a grouping of data into mutually exclusive categories showing the number of
observations in each class.
Example:
5, 1, 3, 4, 2, 1, 3, 5, 4, 2, 1, 5, 1, 3, 2, 1, 5, 3, 3, 2.
Solution:
From the data, we observe that the numbers 1, 2, 3, 4 and 5 are repeated. Hence under the number
column write to the five numbers namely 1, 2, 3, 4 and 5 one below the other.
Now read the numbers one by one and put the tally mark in the tally mark column against the number.
For example, the first number is 5. So put a tally mark („ | ‟) against the number S. The next number is 1. So
put a tally mark („ I ‟) against the number l. Continue the process till all the numbers are exhausted.
Add the tally marks against the numbers 1, 2, 3, 4 and 5 and write the total in the corresponding
frequency column. Now add all the numbers under the frequency column and write it against the total.
3 5
4 2
5 4
Total 20
Class interval: The class interval is obtained by subtracting the lower limit of a class from the lower
limit of the next class. The class intervals should be equal.
Class Frequency: The number of observations in each class.
Class Midpoint: A point that divides a class into two equal parts. This is the average of the upper
and lower class limits.
Class Boundaries
Lower boundary is the lower limit less 0.5
Upper boundary is the upper limit plus 0.5
Relative Frequency the relation of the class frequency to the total frequencies
Cumulative Frequency corresponding to a particular value is the sum of all the frequencies up to
and including that value.
Example:
The following are the marks obtained by 50 students in a mathematics test. Prepare a frequency distribution
table for the data.
45 68 41 87 61 44 67 30 54 8 39 60 37 50 19 86 42 29 32 61 25 77 62 98 47 36 15 40 9 25
34 50 61 75 51 96 20 13 18 35 43 88 25 95 68 81 29 41 45 87
Solution:
To decide the length of the class interval and to take all the scores given in the problem. We have to in the
largest value and the smallest value from the given scores. This we can do by merely going through all the
scores. Here the largest value is 98 and the smallest value is 8.
(You can use the formula 2c > n where c= desired number of classes
n=number of observations)
o There are 50 observations so n=50.
Step Two: Determine the class size or width using the formula
NOTE: The researcher may decide on the class width to use. It is then advisable to use an odd
number for the class width to have a whole number for the class midpoint. The number of classes
should not be too few ( at least 5) and not too many (at most 20).
CI cb f Rf cfb< cfb>
tendency are (1) the mean, (2) the median, and (3) the mode.
A. MEAN
It is obtained by adding all the observations and dividing the sum by the number of observations,
UNGROUPED DATA
1. Arithmetic mean or average of a population is represented by µ for the population and ẍ for the sample.
Illustrative Example:
Suppose ten people you had chosen from those entering the campus have ages:
15, 25, 18, 15, 20, 25, 18, 18, 20, and 25
What is the mean age of these people?
For ungrouped data, such as the one given above, the formula easily applies. But for data where
x observation(s) would occur more than once the weighted mean could be used.
2. Weighted mean
a. Since there are scores that occur more than once, we may want to list down the scores as follows:
b. If each observation has a different weight, w will take the place of f in the formula making it
Illustrative Example:
Here are scores obtained by an applicant for a certain job. The weight for each criterion is given:
Total 100%
c. If items are to be rated based on a scale then r would take the place of w in the formula making it:
Illustrative Example:
Mall goers were asked to rate the level of effectiveness of the inspection done by security forces in
prohibiting crimes in shopping malls in the city.
Level of Effectiveness
Very effective Moderately Least Not
(4) effective (3) effective (2) effective(1)
Thus, the weighted mean of ______________ falls under the rating ______ which means that the
inspection done at the mall is _________________ in prohibiting the occurrence of crime.
GROUPED DATA
In case of large groups the formulas stated above may not be very usable. The more practical thing
Consumers were asked to try a new cracker and provide feedback. Their ages were
recorded and grouped with 5 as the class size. What is the mean age of consumers that
obliged to try the product?
Scores X f d fd
35-39 37 5 3
30-34 32 8 2
25-29 27 9 1
20-24 22 6 0
15-19 17 7 -1
10-14 12 4 -2
5-9 7 1 -3
TOTAL 40
Scores X f d fd
35-39 37 5 6
30-34 32 8 5
25-29 27 9 4
20-24 22 6 3
15-19 17 7 2
10-14 12 4 1
5-9 7 1 0
Total 40
Illustrative Example:
Scores X f fx
35-39 37 5
30-34 32 8
25-29 27 9
20-24 22 6
15-19 17 7
10-14 12 4
5-9 7 1
TOTAL 40
B. MEDIAN
The Median is the midpoint of the values after they have been ordered from the smallest to the largest
There are as many values above the median as below it in the data array.
UNGROUPED DATA
Illustrative Examples:
For an even set of values, the median will be the arithmetic average of the two middle
numbers.
The heights of four basketball players, in inches, are: 76, 73, 80, 75.
Arranging the data in ascending order gives: 73, 75, 76, 80
For odd set of values the median is found at the (n+1)/2 ranked observation.
The ages for a sample of five college students are: 21, 25, 19, 20, 22.
Arranging the data in ascending order gives: 19, 20, 21, 22, 25.
But we can tell in what class interval it is found. Here is the way to find the median.
Illustrative Example:
Scores cb f Cf<
35-39 5
30-34 8
25-29 9
20-24 6
15-19 7
10-14 4
5-9 1
Total 40
Steps:
necessarily close to the mean. In this given distribution, the median is slightly greater than the mean.
C. MODE
UNGROUPED DATA
The mode is the value in the distribution that occurs the most number of times. As the most
frequently occurring observation, it is a nominal average.
For distribution A,
For distribution B, the Mode .
For distribution C,
Evidently a distribution can have no mode, one mode or more than one mode. Thus, the mode is not a very
reliable measure of central tendency. However, there are instances when no other measure can be used
except the mode like when the data is nominal. In determining the prevalent gender, civil status, or highest
educational attainment only the mode can be used, because no numerical values are assigned to these
variables.
GROUPED DATA
If instead of ungrouped data, a frequency distribution is given we cannot easily identify the TRUE
MODE score. But we can tell in what class interval it is found. Here is the way to find the mode.
Steps:
It is possible that there are more than one mode in a grouped data. Illustrative Example:
Scores cb f
35-39 34.5 – 39.5 9
30-34 29.5 – 34.5 8
25-29 24.5 – 29.5 9
20-24 19.5 – 24.5 6
15-19 14.5 – 19.5 7
10-14 9.5 - 14.5 4
5-9 4.5 - 9.5 9
Total 52
The Histogram shows an even spread of data, indicating that sometimes the Coffee Shop is very busy,
while other times they are making less than eight cappuccinos per hour.
We now want to find the Average Number of Cappuccinos made every hour. There are three types of
Averages: the Mean, the Median, and the Mode.
SUMMARY NOTES:
MEAN:
1. All the scores or measurements are considered in the computation of the mean.
2. Very high or very low scores or observations affect the mean.
3. If a constant k is added, subtracted, multiplied or divided to the scores, the same constant k is
MODE:
1. It is very easy to compute but is seldom used because it is very unstable.
2. It is most appropriate for nominal scale as a measure of popularity.
B. MEASURES OF VARIABILITY
Variability refers to how "spread out" a group of scores or numerical data is. To see what we mean by
spread out, consider graphs in Figure 1. These graphs represent the scores on food taste. The mean score
for each product is 7.0. Despite the equality of means, you can see that the distributions are quite
different. Specifically, the scores on Product 1 are more densely packed and those on Product 2 are more
spread out. The differences among consumers‟ scores were much greater on Product 2 than on Product 1.
Product 1
Product 2
a. Range
The range is the simplest measure of variability to calculate, and one you have probably encountered
many times in your life. The range is simply the highest score minus the lowest score.
Illustrative Examples
0 1 2 3 4 5 6 7 8 9 10
2. What is the range of the price increase of bottled drinks: ₱10, ₱2, ₱5, ₱6, ₱7, ₱3, ₱4?
3. The following numbers are customers of a car company in 10 weeks: 99, 45, 23, 67, 45, 91, 82, 78, 62, 51.
What is the range?
b. Interquartile Range
The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles.
Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called
the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.
Q1 is the "middle" value in the first half of the rank-ordered data set.
Q2 is the median value in the set.
Q3 is the "middle" value in the second half of the rank-ordered data set.
1. Find the IQR for the following data set: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
Steps:
The standard deviation of the mean is the most commonly used measure of the spread of values in a
distribution. The Standard Deviation is a measure of how spread out the numbers are. The Variance is
defined as the average of the squared differences from the Mean.
UNGROUPED DATA
1. Here are the minutes employees were late for the day:
4, 2, 5, 8, 6
=5
x
4 (4-5) = -1
2
5
8
6
The standard deviation for the minutes students are late for the class is ___________.
Illustrative Example:
1. A recent survey I worked on asked a question about what users thought of the visual appeal of the
software. Users were given a five point rating scale (from strongly disagree to strongly agree).
5, 5, 5, 5, 4, 5, 3, 4, 5, 5, 5, 5, 4, 5, 1, 2, 3, 4
Because the question was just written for the survey, there‟s no historical or comparative data. To find
more meaning in this jumble of numbers, the first thing you need to do is compute the mean and
standard deviation. While you won‟t necessarily report them, you‟ll need them for some of the subsequent
steps.
Week 1 Week 2
Monday P100 0 P100 0 0
Tuesday 100 0 50 50 2500
Wednesday 100 0 50 50 2500
Thursday 100 0 200 100 10000
Friday 100 0 50 50 2500
Saturday 100 0 150 50 2500
Total P600 0 P600 20000
Mean P100 P100
Range 0 P150
Range = P200-50
Range = P100-100 = P150
=0
Standard Deviation
Standard Deviation
S=0
Variance = = 63.2455532
S2 = 0 s = P63.25
Variance
= = 4000
s2 = P4,000
Note: The allowances for the week are representative of all the
allowances.
If we are to get the standard deviation and variance of the expense for week 2, then such is considered as
the population.
Standard Deviation
= = 57.73502692
δ = P57.74
Variance
= = 3333.33333
2
δ = P3,333.33
GROUPED DATA
Illustrative Example:
The following is a tabular presentation of the ages of people participated in product evaluation.
RANGE OF AGES X F
35-39 37 5 12.25 750.313
30-34 32 8 7.25 420.5
25-29 27 9 2.25 45.5625
20-24 22 6 -2.75 45.375
15-19 17 7 -7.75 420.438
10-14 12 4 -12.75 650.25
5-9 7 1 -17.75 315.063
TOTAL 40 2647.5
= = 8.23921061
s = 8.24
Sample Variance
= = 37.88461538
Hypothesis Testing
Hypothesis testing is the operation of deciding whether or not a data set obtained for a random
sample supports or fails to support a particular hypothesis. A hypothesis is an assertion or conjecture
about a parameter(s) of a population; it may also concern with the type, nature of the population, or
distributional form of characteristics of interest.
Types of Hypotheses
A. Null hypothesis, – represents a theory that has been put forward, either because it is believed
to be true or because it is used as a basis for argument. This assertion is held as true until there is
sufficient statistical evidence to conclude otherwise. It states that there is no difference between a
parameter(s) and a specific value.
B. Alternative hypothesis, or – an assertion of all situations not covered by the null
hypothesis. It states that there is a precise difference between a parameter(s) and a specific value.
Illustrative Examples:
State the null hypothesis and the alternative hypothesis to be used. (Note: The equal sign must be
in the null hypothesis, regardless of the statement.)
1. New software is being integrated into the teaching of a course with the hope that it will
help to improve the overall average score for this course. The historical average score for
this course is 70.
Ho:
Ha:
2. A real estate agent claims that the average price for homes in a certain subdivision is
₱1.8M. You believe that the average price is lower. You plan to test his claim by taking a
random sample of the prices of the homes in the subdivision; formulate the set of
hypotheses.
Ho:
Ha:
3. An advertisement on the TV claims that a certain brand of tire has an average lifetime of
50,000 miles. Suppose you plan to test this claim by taking a sample of tires and putting
them on test. What is the correct set of hypotheses to set up?
Ho:
Ha:
Notes:
A type I error will be committed when the true null hypothesis is rejected.
A type II error is committed when a false null hypothesis is not rejected.
A. One tailed test: A test of any statistical hypothesis where the alternative hypothesis is one-sided.
This is also known as Directional Alternative Hypothesis. It could be left-tailed or right-tailed.
( : > or <).
B. Two tailed test: A test of any statistical hypothesis where the alternative hypothesis is two-sided.
Non-directional Alternative Hypothesis is concerned with the two sides of the distribution.
̅
√
If the sampling distribution is normal, the test is appropriate for any sample size.
Alternative
Critical Region p – value
Hypothesis
Reject if the Reject Ho if the
computed test –
statistic is greater ( ) is less
than than .
Reject if the Reject Ho if the
computed test –
statistic is less than ( ) is less
Condition for t-test to be used: The level of measurement for the dependent variable must be interval or
ratio.
Alternative
Critical Region p – value
Hypothesis
Reject if the Reject Ho if the
computed test –
statistic is greater ( ) is less
than than .
Reject if the Reject Ho if the
computed test –
statistic is less than ( ) is less
than .
Reject if the
computed test Reject Ho if the
–
statistic is greater
( | |)
than or less
is less than .
than
Note: 1) is the computed test statistic.
2) and the -values are based on – degrees of freedom.
Ha:
b. Let =
Critical Value:
Critical region:
Decision Rule:
c. Computation (using appropriate test statistics)
d. Decision:
2. A random sample of 75 eleven-year-olds performed a simple task and the time taken, x minutes, noted
for each. The results were summarized as follows:
∑ ∑ . Test, at the 0.01 level of significance, whether there is evidence that the mean
time taken to perform the task is greater than 15 minutes
a. Ho:
Ha:
b. Let =
Critical Value:
Critical region:
Decision Rule:
c. Computation (using appropriate test statistics)
3. An athlete finds that her times for running a race are normally distributed with mean 10.6 seconds. She
trains intensively for a week and then records her time in the next 5 races. Her times, in seconds are, 10.70,
10.65, 10.75, 10.80, 10.60. Is there evidence, at the 5% level of significance, that training intensively has
improved her times?
a. Ho:
b. Ha:
c. Let =
Critical Value:
Critical region:
Decision Rule:
d. Computation (using appropriate test statistics)
d. Decision:
e. Conclusion and Interpretation:
n1 1 n2 1
̅ Reject H0 if .
Reject H0 if .
√ Reject H0 if or
where: where:
.
=hypothesized n ( d 2 ) ( d ) 2
sd
difference n(n 1) where:
d
d =number of paired
n
d=difference of two dependent observations
values
For the equal-variance t test, the observations should be independent, random samples from normal
distributions with the same population variance. For the unequal-variance t test, the observations should
be independent, random samples from normal distributions.
(1) The level of measurement for the dependent variable must be interval or ratio, e.g, weight, income,
degrees of self-care, and level of treatment effect can be used as dependent variables.
(2) The level of measurement for the independent variable must be nominal, e.g.,“minority and
nonminority groups”, “race”, “gender”, and “experimental and control groups.”
Illustrative Examples:
1. The following data represent the running times of films produced by two different motion picture
companies:
Test hypothesis that the average running time of films produced by Company B exceeds the
average running time of films produced by Company A by 10 minutes against the alternative that
the difference is more than 10 minutes. Use a 0.05 level of significance and assume the
distribution of times to be approximately normal and the population variances are equal.
2. Two Groups X and Y of freshman students, 28 in each group, are paired for age and score on
Form A of the Otis Group Intelligence Scale. Three weeks later, both groups are given Form B of
the same test. Before the second test, Group X, the experimental group, is praised for its
performance on the first test and urged to try to score better than the other group. Group Y, the
control group, was given the second test without comment. Will the incentive (praise) cause the
final scores of group X and Group Y to differ significantly? Test the hypotheses at 0.01 level of
significance given the information below:
3. A researcher wishes to determine if vitamin E supplements could increase cognitive ability among
elderly women. In 1999, the researcher recruits a sample of elderly women age 75-80. At the time
of the enrollment of the study, the women were randomized to either take Vitamin E or a placebo
for six months. At the end of the six month period, the women were given a cognition test. Higher
scores on this test indicate better cognition. The mean and standard deviation of the test scores
4. Many companies that cater to teenagers have learned that young people respond to commercials
that provide dance-beat music, adventure, and a fast pace, rather than words. In one test, a group
of 128 teenagers were shown commercials featuring rock music, and their purchasing frequency
of the advertised products over the following month were recorded as a single score for each
person in the group. Then a group of 212 teenagers were shown commercials for the same
products, but with the music replaced by verbal persuasion. The purchase frequency scores of this
group were computed as well. The results for the music group were ̅ and ; and
the results for the verbal group were ̅ and . Assume that the two groups were
randomly selected from the entire teenager consumer population. Using the level of
significance, test the null hypothesis that both methods are equally effective versus the alternative
hypothesis that they are not equally effective.
5. An instructor wanted to measure the basic math skills of his students before and after his college
algebra course. A skills test was administered at the beginning of the semester, and the scores
were recorded. At the end of the semester, he administered the same test and recorded the
scores. The table below shows the before-and-after scores for the test for the students who
remained in the course until the end of the semester. The maximum possible score on the test
was 100 points.
Student
1 2 3 4 5 6 7 8 9
#
Before 61 58 79 69 62 71 25 48 53
After 68 62 83 65 62 74 31 52 51
Test the hypothesis that learning took place at 0.01 level of significance.
Can one conclude on the basis of these data that after 7 months, the fine motor skills in a population of
similar subjects would be stronger? Let 05.
A. Analysis of Variance
Analysis of Variance (ANOVA) is used to test hypothesis about three or more population
means rather than population variances. The F-test is used to test the significance of the differences
of the population means named after R.A. Fisher.
The purpose of ANOVA, as the term implies, is to establish the variations (or sources of differences)
between groups and within groups. In comparing the groups, there are three possible sources of
variation, these are:
1. Variation between groups (column means or treatments).
2. Variation within groups (experimental error).
3. Total variation among the values of all groups.
When solving ANOVA problems, it is helpful to organize the term that will be used in the
computations into a matrix called ANOVA table.
∑∑
where: √
Illustrative Examples:
1. The following represent the number of hours of pain relief provided by 4 different brands of
headache tablets administered to 20 subjects. The 20 subjects were randomly divided into 4
groups and each group was treated with a different brand. Test the hypothesis at the 0.05 level of
significance that the mean number of hours of relief provided by the tablets is the same for all four
brands.
Tablets
A B C D
5 9 3 2
4 7 5 3
8 8 2 4
6 6 3 1
3 9 7 4
26 39 20 14
̅ 5.2 7.8 4.0 2.8
Solution:
1. State the Null and Alternative hypothesis:
Ho:
Ha: at least two of the means are not equal
2. Level of significance, 0.05
3. Test Statistic: The test follows the F-distribution
4. Establish the critical region/Decision Rule:
v1 = k – 1 = 4 – 1 = 3
v2 = N – k = 20 – 4 = 16
Reject Ho if computed f-value 3.24
5. Computation
a. 99
b. ∑ 603
c. SST:
d. SSC:
( )
Column
68.55 3 22.85
Means
Error 44.4 16 2.78 8.22
Total 112.95 19
2.78
TRANGE 4.05
5
TRANGE 3.02
Compare the absolute mean difference to that of Trange (consider the table below)
Absolute Mean Critical
Pairs Description
Difference TRANGE
A–B 2.6 < 3.02 NS
A–C 1.2 < 3.02 NS
A–D 2.4 < 3.02 NS
B–C 3.4 > 3.02 S
B–D 5.0 > 3.02 S
C-D 1.2 < 3.02 NS
Interpretation: Pair B - C is significantly different. The mean number of hours of relief provided by B is
significantly different from the mean number of hours of relief provided by C. Also, the mean number of
hours of relief provided by B is significantly different from the mean number of hours of relief provided by
D.
2. A large marketing firm owns many photocopy machines, several of each of different models. Over
the last six months, the office manager has tabulated for each machine the average number of
minutes per week that it is out of service due to repairs, resulting in the following data:
̅
Model A: 56 68 42 82 70 318 63.6
Model B: 74 77 92 54 297 74.25
Model C: 25 36 56 44 48 38 247 41.17
Solution:
1. State the Null and Alternative hypothesis:
Ho:
Ha: at least two of the means are not equal
2. Level of significance, 0.01
3. Test Statistic: The test follows the F-distribution
4. Establish the critical region/Decision Rule:
v1 = k – 1 = 3 – 1 = 2
v2 = N – k = 15 – 3 = 12
Reject Ho if computed f-value 6.93
5. Computation:
a. ∑ 862
b. ∑ 54,674
c. SST:
d. SSC:
( )
s k 1 F( ,v 1 ,v 2 ) 3 1 6.93 3.723
1 1
A C 185 .73 8.25
5 6
1 1
B C 185 .73 8.80
4 6
b. Compare the absolute mean difference to that of critical Srange (consider the table below)
Absolute Mean
Pairs Critical S RANGE s Description
Difference
A–B 10.65 < 34.00 NS
A–C 22.73 < 30.69 NS
B–C 33.08 > 32.74 S
Interpretation: Pair B - C is significantly different. The mean number of minutes per week that
Interpretation of 𝒓:
𝒓 Interpretation
1.0 Perfect (Positive/Negative) Correlation
Very Strong (Positive/Negative)
0.80 – 0.99
Correlation
0.60 – 0.79 Strong (Positive/Negative) Correlation
0.40 – 0.59 Moderate` (Positive/Negative)Correlation
0.20 – 0.39 Weak (Positive/Negative)Correlation
0.01 – 0.19 Very Weak (Positive/Negative) Correlation
0.0 No Correlation
Alternatively, the Spearman rank correlation (a non-parametric) is used for variables that
may be quantitative discrete or ordered categorical. Observations are replaced by their ranks in the
calculation of the correlation coefficient. It is used to determine a possible correlation (consistency)
between two ordinal variables.
This results in a simple formula for Spearman's rank correlation, ,
∑
where:
= difference in the ranks of the two variables for a given respondent
= number of pairs of values of and
Number Number
Prices, Prices,
of ties of ties
sold, sold,
649 187 899 132
699 149 949 90
749 155 999 99
799 148 1,049 69
849 130 1,099 51
Calculate the coefficient of correlation.
2. The following are the numbers of sales contacts made by 9 salespersons during a week and the
number of sales made. Compute the correlation coefficient.
Sales-person 1 2 3 4 5 6 7 8 9
Sales contact 71 64 100 105 75 79 82 68 110
Sales 25 14 37 40 18 10 22 12 42
In general, the goal of linear regression is to find the line that best predicts from , that is, to find
the line that best estimates the regression model by determining and
that best estimate and .
**Note that linear regression assumes that the data are linear and it finds the slope and intercept that
make a straight line best fit the data.
The -intercept:
∑ ∑
̅ ̅
The goal of linear regression is to adjust the values of slope and intercept to find the line
that best predicts from . More precisely, the goal of regression is to minimize the sum of the
squares of the vertical distances of the points from the line.
Illustrative Examples:
1. A study was made by a retail merchant to determine the relation between weekly advertising
expenditures and sales (both in hundreds of pesos). The following data were recorded:
2. In the 1990‟s, research efforts have focused on the problem of predicting a manufacturer‟s market
share using information on the quality of its product. Suppose that the following data are
available on market share, in percentage ( ), and product quality, on scale of 0 to 100,
determined by an objective evaluation procedure ( ).
X 27 39 73 66 33 43 47 55 60 68 70 75
Y 2 3 10 9 4 6 5 8 7 9 10 13
V. Chi-Square Test
...
...
...
...
...
...
…
Total …
Data Consideration
a) Use ordered or unordered numeric categorical variables (ordinal or nominal levels of
measurement).
b) The data are assumed to be a random sample. The expected frequencies for each category
should be at least 1. No more than 20% of the categories should have expected frequencies
of less than 5. If not, use Fisher‟s Exact or other tests.
where:
= the test statistic that asymptotically approaches a chi-square distribution
= the observed frequency of the ith row and jth column
=the expected (theoretical) frequency of the ith row and jth column
B. Measures of Association
Use of the chi-square test of independence can provide information on whether the
association between two qualitative statistic figure values A and B can be regarded as statistically
significant or not. Direct evaluation of the degree of association can be done using measures of
association, which are based on the computed chi-square value ( ). The nearer the value of the
Illustrative Examples:
1. Grades in a statistics course and mathematical analysis for business taken simultaneously were as
follows for a group of students.
Mathematical Analysis
for Business Grade
Statistics
A B C Others
Grade
A 25 6 17 13
B 17 16 15 6
C 18 4 18 10
Others 10 8 11 20
Are the grades in statistics and mathematical analysis for business related? Use in reaching your
conclusion.
2. A random sample of students is asked their opinions on a proposed core curriculum change. The
results are as follows.
Opinion
Class Favoring Opposing
Freshman 120 80
Sophomore 70 130
Junior 60 70
Senior 40 60
3. A company has to choose among three pension plans. Management wishes to know whether the
preference for plans is independent of job classification and wants to use . The opinions of a
random sample of 500 employees are shown below:
Pension Plan
Job Classification 1 2 3
Salaried workers 160 140 40
Hourly workers 40 60 60
3. A survey sampling example showing a cross classification of gender by class was given below. Use
the chi square test of independence to determine if gender and social class of the respondent are
independent of each other. Use the 0.05 level of significance.
Gender
Social Class Male Female
Upper Middle 33 29
Middle 153 181
Working 103 81
Lower 16 14
4. A sample of adults in X city was conducted to examine public attitudes toward government cuts in
social spending. Concerning this data, the researcher comments, “Respondents who knew
someone on social assistance, were more likely to feel that welfare rates were too low...”