You are on page 1of 71

CHAPTER 1

Introduction to Statistics

Objectives:
1. define Statistics and its concepts
2. identify the type of variables

1.1 What is Statistics?


Statistics is a branch of mathematics which deals with the Collection, Organization, Presentation, Analysis and
Interpretation of data

Statistics is used in different fields such as:


 Education
 Business
1.2 Types of Statistics
Statistics has two types, descriptive and inferential.

 Descriptive Statistics consist of methods for organizing and summarizing data/ information
 Inferential Statistics consist of methods for drawing and measuring conclusions

1.3

1
CHAPTER 5
TESTS OF HYPOTHESIS

Objectives
The learner should be able to :
1. illustrate null and alternative hypotheses,
2. illustrate level of significance and rejection region,
3. formulate the appropriate null and alternative hypotheses on a population mean and
proportion,
4. identify the appropriate form of the test statistic
5. compute for the test statistic value ,and
6. solve problems involving test of hypothesis .

Lesson 5.1 Hypothesis Test

A hypothesis test is a statistical test that is used to determine whether there is enough
evidence in a sample of data to infer that a certain condition is true for the entire population.
A hypothesis test examines two opposing hypotheses about a population :
a. null hypothesis ( Ho )
b. alternative hypothesis. ( Ha )

Null hypothesis
It is the statement being tested. Usually the null hypothesis is a statement of "no effect" or "no
difference".

Alternative hypothesis
It is the statement you want to be able to conclude is true. A common misconception is that
statistical hypothesis tests are designed to select the more likely of two hypotheses. Instead, a test will
remain with the null hypothesis until there is enough evidence (data) to support the alternative
hypothesis.

The general idea of hypothesis testing involves:


1. making an initial assumption.
2. collecting evidence.
3. deciding whether to reject or not reject the initial assumption based on the available
evidence

Example:
Problem: Is normal body temperature really 98.6 oF?
Solution:
Consider the population of many adults. A researcher hypothesized that the average adult body
temperature is lower than the often-advertised 98.6 degrees F. That is, the researcher wants an answer to
the question: "Is the average adult body temperature 98.6 degrees? Or is it lower?" To answer his
research question, the researcher starts by assuming that the average adult body temperature was 98.6
degrees F.

2
Then, the researcher went out and tried to find evidence that refutes his initial assumption. In doing so, he
selects a random sample of 130 adults. The average body temperature of the 130 sampled adults is 98.25
degrees.
Then, the researcher uses the data he collected to make a decision about his initial assumption. It is
either likely or unlikely that the researcher would collect the evidence he did given his initial
assumption that the average adult body temperature is 98.6 degrees:

 If it is likely, then the researcher does not reject his initial assumption that the average adult
body temperature is 98.6 degrees. There is not enough evidence to do otherwise.

 If it is unlikely, then:
 either the researcher's initial assumption is correct and he experienced a very unusual
event;
 or the researcher's initial assumption is incorrect.

Types of Test

1. Two-tailed test
A test to determine whether a population parameter has changed since the null hypothesis can
be rejected by observing a statistic that falls either the two tails of the sampling distribution.

2. One-tailed test
It is use if the following conditions satisfy:
1. the sample data from the population that has a parameter less than the hypothesized value
2. the sample data from the population that has a parameter greater than the hypothesized
value

The table below may help to clearly understand the tests :

Type Null Alternative


Right-tailed H0 : μ0 = μ1 HA : μ0>μ1
Left-tailed H0 : μ0 = μ1 HA : μ0<μ1
Two-tailed H0 : μ0 = μ1 HA : μ0 ≠ μ1

Lesson 5.2 Type I and Type II Errors

We make our decision based on evidence not on 100% guaranteed proof.

3
Note:
 If we reject the null hypothesis, we do not prove that the alternative
hypothesis is true.
 If we do not reject the null hypothesis, we do not prove that the null
hypothesis is true.
We merely state that there is enough evidence to behave one way or the other. This is
always true in statistics, because of this, whatever the decisions; there is always a chance that
we made an error.

Two types of errors in hypothesis testing:

Decision True False


Do not reject OK Type II
null ERROR
Reject null Type I OK
ERROR

 Type I error - when the null hypothesis is rejected even if it is true

 Type II error - when the null hypothesis is not rejected even if it is false

Right-tailed test

Left-tailed Test

4
Two-tailed test

Lesson 5.3 Hypothesis Testing Procedure

To analyze the conducted study, the following procedure must be followed:

1. State the hypothesis.


It includes the null hypothesis and the alternative hypothesis.
2. Select the appropriate test statistic and level of significance.
When testing a hypothesis of a proportion, we use the z-statistic or z-test
and the formula :

When testing a hypothesis of a mean, we use the z-statistic or we use the t-


statistic according to the following conditions :
a. If the population standard deviation, σ, is known and either the data is normally
distributed or the sample size n > 30, we use the normal distribution (z-statistic).
b. When the population standard deviation, σ, is unknown and either the data is
normally distributed or the sample size is greater than 30 (n > 30), we use the t-
distribution (t-statistic

Note: The guideline for choosing the level of significance is as follows:


1. the 0.10 level for political polling
2. the 0.05 level for consumer research projects, and
3. the 0.01 level for quality assurance work
3. State the decision rules.
The decision rules state the conditions under which the null hypothesis will be
accepted or rejected. The critical value for the test-statistic is determined by the level of
significance. The critical value is the value that divides the non-reject region from the reject
region.
4. Compute the appropriate test statistic and make the decision.

When we use the z-statistic, we use the formula :

, if there is only one sample mean, or

5
, if there are two sample mean

When we use the t-statistic, we use the formula :


a. if there is only one sample mean

b. if there are two sample means

where: x1 – mean of sample 1 x2 –


mean of sample 2
s1 – standard deviation of sample 1 s2 –
standard deviation of sample 2 n1 –
sample size of sample 1
n2 – sample size of sample 2 µ -
population

Compare the computed test statistic with critical value.


 If the computed value is within the rejection region(s) - reject the null
hypothesis
 if not - do not reject the null hypothesis

5. Interpret the decision.


Based on the decision in Step 4, state a conclusion in the context of the
original problem.

6
Lesson 5.3.1 Hypothesis Test for a Proportion

Hypothesis test of a proportion varies on the following conditions

1. The sampling method is simple random sampling.


2. Each sample point can result in just two possible outcomes. We call one of these
outcomes a success and the other, a failure.
3. The sample includes at least 10 successes and 10 failures.
4. The population size is at least 20 times as big as the sample size.

The P-value is the probability of observing a sample statistic as extreme as the test statistic.
Since the test statistic is a z-score

where : P - the hypothesized value of population proportion in the null


hypothesis
p - the sample proportion
S - the standard deviation of the sampling distribution

Just in case, the standard deviation is not given the use the formula below to obtain the value of
the standard deviation.

where : n - the sample size

Hypothesis Test: Difference between Proportions

This is a test to determine whether the difference between two proportions is significant. The
test procedure, called the two-proportion z-test, is appropriate when the following conditions are
met:
 The sampling method for each population is simple random sampling.
 The samples are independent.
 Each sample includes at least 10 successes and 10 failures.
 Each population is at least 20 times as big as its sample.

This approach consists of four steps:

1. State the null and alternative hypotheses.

7
Every hypothesis test requires the analyst to state a null hypothesis and an
alternative hypothesis. The table below shows three sets of hypotheses. Each makes a
statement about the difference, d, between two population proportions, P1 and P2. (In the table,
the symbol ≠ means " not equal to ").

Set Null hypothesis Alternative hypothesis Number of tails


1 P1 - P2 = 0 P1 - P2 ≠ 0 2
2 P1 - P2> 0 P1 - P2< 0 1
3 P1 - P2 < 0 P 1 - P2 > 0 1

The first set of hypotheses (Set 1) is an example of a two-tailed test, since an extreme value
on either side of the sampling distribution would cause a researcher to reject the null hypothesis. The
other two sets of hypotheses (Sets 2 and 3) are one-tailed tests, since an extreme value on only one
side of the sampling distribution would cause a researcher to reject the null hypothesis.
When the null hypothesis states that there is no difference between the two population
proportions (i.e., d = 0), the null and alternative hypothesis for a two-tailed test are often stated in the
following form.

H0: P1 = P2
Ha: P1 ≠ P2

2. Formulate an analysis plan

The analysis plan describes how to use sample data to accept or reject the null hypothesis. It
should specify the following elements.

 Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or
0.10; but any value between 0 and 1 can be used.

 Test method. Use the two-proportion z-test to determine whether the hypothesized difference
between population proportions differs significantly from the observed sample difference.

3. Analyze sample data

Using sample data, complete the following computations to find the test statistic and its
associated P-Value.
 Pooled sample proportion. Since the null hypothesis states that P1=P2, we use a
pooled sample proportion (p) to compute the standard error of the sampling distribution.

where: p1 - the sample proportion from population 1 p2 -


the sample proportion from population 2 n1 - the
size of sample 1
n2 - the size of sample 2

8

Standard Error. Compute the standard error (SE) of the sampling distribution
difference between two proportions

where: p - the pooled sample proportion


n1 - the size of sample 1
n2 - the size of sample 2


Test Statistic. The test statistic is a z-score (z) defined by the following equation.

where: p1 – the proportion from sample 1


p2 - the proportion from sample 2
SE - the standard error of the sampling distribution


P-value. The P-value is the probability of observing a sample statistic as extreme as
the test statistic.

The analysis described above is a two-proportion z-test.

4. Interpret results.
If the sample findings are unlikely, given the null hypothesis, the researcher rejects the
null hypothesis. Typically, this involves comparing the P-value to the significance level,
and rejecting the null hypothesis when the P-value is less than the significance
level.

Example:
Problem : Suppose the Drug Company develops a new drug, designed to prevent colds. The company
states that the drug is equally effective for men and women. To test this claim, they choose a
simple random sample of 100 women and 200 men from a population of 100,000 volunteers.
At the end of the study, 38% of the women caught a cold; and 51% of the men caught a cold.
Based on these findings, can we reject the company's claim that the drug is equally effective
for men and women? Use a
0.05 level of significance.
Solution:

Step 1: State the hypotheses


Null hypothesis: P1 = P2
Alternative hypothesis: P1 ≠ P2

Note: These hypotheses constitute a two-tailed test. The null hypothesis will be rejected if the
proportion from population 1 is too big or if it is too small.

9
Step 2: Formulate an analysis plan.

significance level : 0.05


test method: two-proportion z-test.

Step 3: Analyze sample data.


Using those measures, we compute the z-score test statistic (z).

pooled sample proportion : (

P = 0.467

Standard error (SE)

SE = √0.467 ∙ (0.533) ∙ (0.01 + 0.005)

SE = √0.2489 ∙ (0.015)

SE =⁡√0.00373

SE = 0.061

Z score :

Z = - 2.13

Since we have a two-tailed test, the P-value is the probability that the z-score is less than -
2.13 or greater than 2.13. Thus, the P-value = 0.0166 + 0.0166 = 0.0332.

10
Note: From Table 3 , when z = - 2.13 , P-value = 0.0166

Step 4: Interpret results.


Since the P-value (0.0332) is less than the significance level (0.05), we
cannot accept the null hypothesis.

Note: If you use this approach on an exam, you may also want to mention why this
approach is appropriate. Specifically, the approach is appropriate because the sampling
method was simple random sampling, the samples were independent, each population
was at least 10 times larger than its sample, and each sample included at least 10
successes and 10 failures.

Lesson 5.3.2 Hypothesis Test for a Mean

Hypothesis Test of a Mean will be conducted when the following conditions are met:
 The sampling method is simple random sampling.
 The sampling distribution is normal or nearly normal.

We can say that the sampling distribution will be approximately normally distributed if any
of the following conditions apply :
 The population distribution is normal.
 The population distribution is symmetric, unimodal, without outliers, and the sample
size is 15 or less.
 The population distribution is moderately skewed, unimodal, without outliers, and the sample
size is between 16 and 40.
 The sample size is greater than 40, without outliers.

This approach consists of four steps:

 State the hypotheses. Every hypothesis test requires the analyst to state a null
hypothesis and an alternative hypothesis. The table below may use.

Set Null hypothesis Alternative hypothesis Tails


1 μ=M μ≠M Two-tailed
2 μ>M μ<M One-tailed
3 μ<M μ>M One-tailed

 Formulate an Analysis Plan. The analysis plan describes how to use sample data to
accept or reject the null hypothesis. It should specify the following elements.

Significance level. Often, researchers choose significance levels equal to 0.01,


0.05, or 0.10; but any value between 0 and 1 can also be used.

11
Test method. Use the one-sample t-test to determine whether the
hypothesized mean differs significantly from the observed sample mean.

 Analyze Sample Data. Using sample data, conduct a one-sample t-test. This involves:

Standard Error. Compute the standard error(SE) of the sampling distribution.

Where: s - the standard deviation of the sample


N - he population size
n - the sample size

When the population size is much larger (at least 20 times larger) than the sample
size, the standard error can be approximated by:

Degrees of Freedom. The degrees of freedom (DF) are equal to the sample
size (n) minus one.

DF = n - 1.

Test Statistic. The test statistic is a t statistic (t) defined by the following equation.

t=(x-µ)
SE

where: t - t-statistic value


x – sample mean
μ -the hypothesized population mean in the null hypothesis
SE -the standard error

P-value
P-value is the probability of observing a sample statistic as extreme as the test
statistic.

Interpret Results. This involves comparing the P-value to the significance level, and
rejecting the null hypothesis when the P-value is less than the significance level.

Example:
Problem :
An elementary school has 1000 students. The principal of the school thinks that the average IQ of
students at Bon Air is at least 110. To prove her point, she administers an IQ test to 20 randomly
selected students. Among the sampled students, the average IQ is 108 with a

12
standard deviation of 10. Based on these results, should the principal accept or reject her original
hypothesis? Assume a significance level of 0.01. (Assume that test scores in the population of engines
are normally distributed.) Solution:

Step 1: State the hypotheses. The first step is to state the null hypothesis and an alternative
hypothesis.

Null hypothesis: μ ≥ 110


Alternative hypothesis: μ < 110

Note: The hypotheses constitute a one-tailed test. The null hypothesis will be rejected if the
sample mean is too small.

Step 2: Formulate an analysis plan.

Significance level : 0.01


Test method: one-sample t-test.

Step 3: Analyze sample data.

Standard error (SE)

SE = 2.236

Degrees of freedom (DF)= n - 1


DF = 20 - 1
DF = 19

t - test statistic (t )

t=

t=

t =-0.894

Logic of the Analysis:


Given the alternative hypothesis (μ <110), we want to know whether the observed sample
mean is small enough to cause us to reject the null hypothesis.

13
The observed sample mean produced a t - test statistic of -0.894. The P(t<-0.894) =
0.19. This means we would expect to find a sample mean of 108 or smaller in 19 percent of our
samples, if the true population IQ were 110. Thus the P-value in this analysis is 0.19

Step 4: Interpret results. Since the P-value (0.19) is greater than the significance level (0.01), we
cannot reject the null hypothesis.

Hypothesis Test : Difference between Means

This lesson explains how to conduct a hypothesis test for the difference between two means.
The test procedure, called the two-sample t-test, is appropriate when the following conditions are
met:
 The sampling method for each sample is simple random sampling.
 The samples are independent.
 Each population is at least 20 times larger than its respective sample.

Lesson 5.3.3 Z- test

A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution. It is used for testing hypothesis when the:
1. sample standard deviation is known
2. sample size is at least 30

It is also used when there is only one sample in the experiment that is known, and both standard
deviation and the mean of the population are known. Likewise, the two-sample z-test is used to compare
the population means between two groups.

When the data are normally distributed we shall follow the following steps:

1. Formulate the null hypothesis and the alternative hypothesis.


2. Specify the level of significance and decide whether two tailed test, or one tailed test (right
tailed test or left-tailed test) shall be use, and decide the test statistic to be used, and find the
critical value from TABLE 4.
3. Compute the value of the test statistic.
4. Graph computed z value and critical value, and make a decision .
5. State the conclusion.

Z- test Using One Sample Mean

Example 1:
Problem:
A researcher reported that the mean grade of Grade 11 students in statistics was 84%. A random
sample of 100 students showed a mean of 87% with a standard deviation of 4%. Is there a significant
difference between the grades of Grade 11 students? Use α = 0.05.
This problem may be computed in two ways. Its either you are going to use one-
tailed test or two-tailed test.

Solution 1: Using two-tailed test

14
Step 1: Formulate the null hypothesis and the alternative hypothesis.

Ho: µ = 84
Ha: µ ≠ 84

Step 2: Specify the level of significance and decide whether two-tailed test, or one- tailed test
(right-tailed test or left-tailed test) shall be used, and decide the test statistic to be
used, and find the critical value from TABLE 4.

Level of significance: α = 0.05


Tailed test: two-tailed test
Test statistic: Z-test
Critical value: ±1.96 (taken from Z tabular value/ TABLE 4)

Step3: Compute the value of the test statistic.

z = ( x - µ )√ n
s

µ = 84
x = 87
n = 100
s=4

z = ( 87 – 84 ) √ 100
4
z = 7.5

Step 4: Graph computed z- value and critical value, and make a decision

Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.

15
Since the computed z was located in the rejected region therefore, null hypothesis is
rejected.

Step 5: State the conclusion.


The reported mean score of the grade 11 students at 84% is not true. There is a
significant difference of 3% between the hypothesized and sample mean.

Solution 2: Using one-tailed test

Step 1: Formulate the null hypothesis and the alternative hypothesis.

Ho: µ = 84
Ha: µ >84 (since the sample mean is 87)

Step 2: Specify the level of significance and decide whether two-tailed test, or one- tailed test
(right-tailed test or left-tailed test) shall be use, and decide the test statistic to be used,
and find the critical value from TABLE 4.

Level of significance: α = 0.05


Tailed test: one-tailed test ( Right-tailed test )
Test statistic: Z-test
Critical value: + 1.645 (The value is taken from Z tabular value/ table 4, and the
positive value must be used since it conveys a right tailed test.)

Step 3: Compute the value of the test statistic.

z = ( x - µ )√ n
s

µ = 84
x = 87
n = 100
s=4

z = ( 87 – 84 ) √ 100
4
z = 7.5

Step 4: Graph the computed z value and critical value, and make a decision

16
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.

Since the computed z was located in the rejected region therefore, null hypothesis is
rejected.

Step 5: State the conclusion.


The reported mean score of the grade 11 students at 84% is not true. There is a
significant difference of 3% between hypothesized and sample mean.

Example 2:
Problem :
A supermarket owner believes that the mean of family income of its customers is Php 45,000
per month. 49 customers were randomly selected and asked their family income. The sample mean was
Php 42,200 per month and the standard deviation was Php 2,800. Is there enough difference to say that
the mean family income per month is Php 45,000 per month at 1% significant level?

Solution:

Step 1: Formulate the null hypothesis and the alternative hypothesis.

Ho: µ = 45,000

Ha: µ ≠45,000

Step 2: Specify the level of significance and decide whether two-tailed test, or one- tailed
test (right-tailed test or left-tailed test) shall be use, and decide the test statistic to be
used, and find the critical value from table 4.

Level of significance: α = 0.01

Tailed test: two-tailed test

Test statistic: Z-test

Critical value: ±2.575 (The value is taken from Z tabular value/ table 4)

Step 3: Compute the value of the test statistic.

z = ( x - µ )√ n
s

µ = 45,000
x = 42,200
n = 49
s = 2,800

17
z = ( 42,200 – 45,000 ) √ 49
2,800
z=7

Step 4: Graph the z value and critical value, and make a decision.

Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.

Since the computed z is located in the rejected region therefore, null hypothesis is
rejected.

Step 5: State the conclusion.


There is a significant difference between the hypothesized and
sample mean.

Z - Test Using Two Sample Means

Example 3:
Problem :
The average lifetime of 120 Brand X AA batteries and 120 Brand Y AA batteries were found to
be 9.1 hours and 9.6 hours respectively. Suppose the population standard deviations of lifetimes are 1.9
hours of Brand X and 2.1 for Brand Y batteries, test the hypothesis using α = 0.05.

Solution 1: Using two-tailed test


Step 1: Formulate the null hypothesis and the alternative hypothesis.

Ho: µ1 = µ2

Ha: µ1≠ µ2

Step 2: Specify the level of significance and decide whether two-tailed test or one-
tailed test (right-tailed test or left-tailed test) shall be used, and decide the test statistic
to be used, and find the critical value from Table 4.

Level of significance: α = 0.05

18
Tailed test: two-tailed test
Test statistic: Z-test
Critical value: ±1.96 (taken from Z tabular value/ Table 4)

Step 3: Compute the value of the test statistic.

, is used if there were two sample means

µ1 = 9.1
µ2 = 9.6
n1 = 120
n2 = 120
s1 = 1.9
s2 = 2.1

z = 9.1 – 9.6
(1.9)2 + (2.1)2
120 120
= - 0.5
√ 0.03 + 0.037

z = -1.93

Step 4: Graph the computed z value and critical value, and make a decision critical

value = - 1.96 critical value = + 1.96

z = - 1.93

19
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.

Since the computed z is located in the accepted region therefore, null hypothesis is
accepted.

Step 5: State the conclusion.


There is no significant difference between hypothesized and sample mean.

Solution 2: Using one-tailed test

Step 1: Formulate the null hypothesis and the alternative hypothesis.

Ho: µ1 = µ2 ; Ha: µ1< µ2

Step 2: Specify the level of significance and decide whether two-tailed test or one- tailed test
(right-tailed test or left-tailed test) shall be use, and decide the test statistic to be used,
and find the critical value from Table 4.

Level of significance: α = 0.05


Tailed test: one-tailed test ( Left-tailed test )
Test statistic: Z-test
Critical value: -1.645 (taken from Z tabular value/ table 4 and we
use negative since it was a left-tailed test)
Step 3: Compute the value of the test statistic.

, is used if there were two sample means µ1 =

9.1
µ2 = 9.6
n1 = 120
n2 = 120
s1 = 1.9
s2 = 2.1

z = 9.1 – 9.6
(1.9)2 + (2.1)2
120 120
= - 0.5
√ 0.03 + 0.037

z = -1.93

20
Step 4: Graph the computed z value and critical value, and make a decision

critical value = - 1.96

z = - 1.93

Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.

Since the computed z is located in the accepted region therefore, null


hypothesis is accepted.

Step 5: State the conclusion.


There is no significant difference between the hypothesized and the sample
mean.

Lesson 5.3.4 T-test

A t-test is an analysis of two population means through the use of statistical examination; a t-
test with two samples is commonly used with small sample sizes, testing the difference
between the samples when the variances of two normal distributions are not known.
A t-test looks at the t-statistic, the t-distribution and degrees of freedom to determine the
probability of difference between populations; the test statistic in the test is known as the t-statistic.
We may use t-test if

 the population variance is unknown ,and therefore has to be estimated from the
sample itself
 the sample size is less than 30, (n < 30)

21
To compute t-test, we shall follow the following steps:

1. Formulate the null hypothesis and the alternative hypothesis.


2. Specify the level of significance and decide whether two tailed test, or one tailed test
(right-tailed test or left-tailed test) shall be used; decide the test statistic to be used; find
the degrees of freedom; and find the critical value from TABLE 5.

Degrees of freedom = n -1 (for one sample mean)

Degrees of freedom = (n1 + n2) -2 (for two sample mean)

3. Compute the value of the test statistic.


4. Graph computed t value and critical value, and make a decision
5. State the conclusion.

T-test Using One Sample Mean

Example 1:
Problem:
A chemical company alleged that the average weight of its bag of chemical is 50 kgs. With a
standard deviation of 0.9 kg., a sample of 25 bags was taken and revealed a mean weight of 48.1 kgs. If
the significant level is 1%, is there a significant difference between the weights of the chemical bags?

Solution 1: Using one-tailed test

Step 1: Formulate the null hypothesis and the alternative hypothesis.

Ho: µ = 50 ; Ha: µ <50

Step 2: Specify the level of significance and decide whether two tailed test, or one tailed test
(right-tailed test or left-tailed test) shall be used, and decide the test statistic to be
used Find the degrees of freedom, and find the critical value from Table 5 .

Level of significance: α = 0.01


Tailed test: one-tailed test left-tailed test
Test statistic: t-test
Degrees of freedom = (n -1) (for one sample mean)
= (29 – 1)
= 28

Critical value: - 2.467 (taken from Table 5)

22
Step 3: Compute the value of the test statistic.

,is used if there is only one sample mean

x = 48.1
µ = 50
n = 25
s = 0.9

t = ( 48.1 – 50 )√ 25
0.9
t= -10.56

Step 4: Graph the computed t value and critical value, and make a decision

Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.

Since the computed t is located in the rejected region therefore, null hypothesis is
rejected.

Step 5: State the conclusion.


There is a significant difference between the hypothesized and
sample mean.

Solution 2: Using two-tailed test

Step 1: Formulate the null hypothesis and the alternative hypothesis.

Ho: µ = 50 ; Ha: µ ≠50

23
Step 2: Specify the level of significance and decide whether two tailed test, or one tailed test
(right-tailed test or left-tailed test) shall be use, and decide the test statistic to be used,
find the degrees of freedom, and find the critical value from table 5

Level of significance: α = 0.01


Tailed test: two-tailed test
Test statistic: t-test
Degrees of freedom = (n -1) ( for one sample mean)
= (29 – 1)
= 28

Critical value: ± 2.763 (taken from Table 5)

Step 3: Compute the value of the test statistic.

,is used if there is only one sample mean x =

48.1
µ = 50
n = 25
s = 0.9

t = ( 48.1 – 50 )√ 25
0.9

t = -10.56

Step 4: Graph the computed t value and critical value, and make a decision

Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.

24
Since the computed t is located in the rejected region therefore, null hypothesis is
rejected.

Step 5: State the conclusion.


There is a significant difference between the hypothesized and the
sample mean.

T-test Using Two Sample Mean

Example 2:
Problem :
A study was conducted to examine the relationship between the attitudes towards mathematics
and success at college level mathematics. Twenty-two man and twenty women were identified as being
at high risk of failure. The students were asked to responds to a series of questions, and their answers
were used to obtain a math anxiety score. Summary values appear in the table below. Test the
hypothesis using a 0.05 level of significance.

Gender n x s
Male 22 40.8 9.3
Female 20 37.5 10.2

Step 1:Formulate the null hypothesis and the alternative hypothesis.

Ho: µ1 = µ2 ; Ha: µ1> µ2

Step 2: Specify the level of significance and decide whether two tailed test, or one tailed test
(right-tailed test or left-tailed test) shall be use, and decide the test statistic to be used,
find the degrees of freedom, and find the critical value from table 5

Level of significance: α = 0.05

Tailed test: one-tailed test (Right-tailed test )

Test statistic: t-test

Degrees of freedom = (n1 + n2) – 2 (for two sample mean)

= (22 + 20) – 2

= 42 – 2

= 40

Critical value: +1.684 (taken from Table 5 and the positive value
is used since it is a right-tailed test)

25
Step 3: Compute the value of the test statistic.

* Formula for two sample means

x1 = 40.8
x2 = 37.5
n1 = 22
n2 = 20
s1 = 9.3
s2 = 10.2

( 22 – 1 )( 9.3 )2 + ( 20 – 1 ) (10.2 )2( 22 + 20 )


( 22 + 20 – 2 ) ( 22 )( 20 )

t= 40.8 - 37.5

t= 1.1

Step 4: Graph computed t value and critical value, and make a decision .

critical value = + 1.684

t = 1.1

26
Since the computed t is located in the accepted region therefore, null hypothesis is
accepted.

Step 5: State the conclusion.


There is no significant difference between the hypothesized and the
sample mean.

Solution 2: Using two-tailed test

Step 1: Formulate the null hypothesis and the alternative hypothesis.


Ho: µ1 = µ2
Ha: µ1 ≠ µ2

Step 2:Specify the level of significance and decide whether two tailed test, or one tailed test
(right-tailed test or left-tailed test) shall be use, and decide the test statistic to be used,
find the degrees of freedom, and find the critical value from table 5

Level of significance: α = 0.05


Tailed test: two-tailed test
Test statistic: t-test

Degrees of freedom = (n1 + n2) – 2 (for two sample mean)

= (22 + 20) – 2

= 42 – 2
= 40

Critical value: ± 2.021 (taken from Table 5 ; one is positive and the other
is negative since it is two-tailed )

Step 3: Compute the value of the test statistic.

* Formula for two sample means

x1 = 40.8
x2 = 37.5
n1 = 22
n2 = 20
s1 = 9.3
s2 = 10.2

27
( 22 – 1 )( 9.3 )2 + ( 20 – 1 ) (10.2 )2( 22 + 20 )
( 22 + 20 – 2 ) ( 22 )( 20 )

t= 40.8 - 37.5

t = 1.1

Sep 4: Graph the computed t value and critical value, and make a decision .

critical value = - 2.021 critical value = + 2.021

t = 1.1

Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.

Since the computed t was located in the accepted region therefore, null hypothesis is
accepted.

Step 5: State the conclusion.


There is no significant difference between the hypothesized and the
sample mean.

28
CHAPTER TEST

Apply the appropriate test hypothesis steps and procedures for the following research
problems.

1. A supermarket owner believes that the mean income of its costumers is Php50,000 per month. One-
hundred costumers are randomly selected and asked of their monthly income. The sample mean is
Php48,500 per month and standard deviation is Php3,200.Is their sufficient evidence to indicate that
the mean income of the costumers of the supermarket is Php50,000per month? Use α= 0.05.

29
2. It is reported that the average monthly salary of accounting graduates in the accounting field is
Php18,000. A dean of a certain university conducted a survey of 60 accounting graduates and found
their average salary at Php20,500 per month with standard deviation of Php1,500 per month. Using
α = 0.05, is there a significant difference between the accounting graduates salaries?

30
3. A prospective MBA student was made to estimate the difference in the monthly salaries of
professors in private and state colleges. He claimed that the difference in the starting salaries of
MBA graduates of the two colleges were relevant. An independent study of a simple random
samples of the most recent MBA graduates of both colleges revealed the following statistics:

Colleges Mean Standard Deviation Sample Size


Private 35,000php 1,800php 53
State 32,000php 1650php 49

31
4. A distributor claims that the average strength of the brand X rope exceeds the average strength of
the brand Y rope. To test its claim, 25 pieces of each brand are tested under similar conditions.
Brand X had an average strength of 90.7 kilograms with a standard deviation of 7.82 kilograms,
while brand Y have an average strength of 93.7 kilograms with a standard deviation of 6.75
kilograms. Test whether the claim of the distributor is correct at 5% level of significance.

32
5. Job satisfaction as a function of a work schedule was investigated in different factories. In the first
factory, the employees are on fixed shift system while in the second factory, the workers have
rotating shift system. Using the data in the table below, determine if there is a significant
difference in job satisfaction between the two groups of workers. Use α = 0.01.

Shift Schedule Mean Standard deviation Sample size


Fixed 7.43 2.42 23
Rotating 6.18 2.15 29

33
CHAPTER 6

CORRELATION AND REGRESSION ANALYSIS

Objectives
The learner should be able to
1. construct a scatter plot,
2. describe shape, trend, and variation based on scatter plot,
3. estimate strength of association between the variables based on scatter plot,
4. calculate the Pearson’s sample correlation coefficient,
5. solve problem involving correlation analysis,
6. identify the independent and dependent variables,
7. draw the best-fit line on a scatter plot,
8. calculate the slope and y-intercept of the regression line,
9. predict the value of the dependent variable given the value of the independent
variable, and
10. solve problems involving regression analysis.

Correlation analysis is used to quantify the association between two continuous variables
(e.g., between an independent and a dependent variable or between two independent variables).
Regression analysis is a related technique to assess the relationship between an outcome
variable and one or more risk factors or confounding variables.
The outcome variable is also called the response or dependent variable and the risk
factors and confounders are called the predictors, or explanatory or independent variables.
In regression analysis, the dependent variable is denoted "y" and the independent
variables are denoted by "x".

Lesson 6.1 Types of Variables

Ambiguities in Classifying a Type of Variable

In some cases, the measurement scale for data is ordinal, but the variable is treated as
continuous. For example, a Likert scale that contains five values - strongly agree, agree, neither
agree nor disagree, disagree, and strongly disagree - is ordinal. However, where a Likert scale
contains seven or more value - strongly agree, moderately agree, agree, neither agree nor disagree,
disagree, moderately disagree, and strongly disagree - the underlying scale is sometimes treated as
continuous (although where you should do this is a cause of great dispute).
It is worth noting that how we categorize variables is somewhat of a choice. Whilst we
categorized gender as a dichotomous variable (you are either male or female), social scientists may
disagree with this, arguing that gender is a more complex variable involving more than two distinctions,
but also including measurement levels like gender queer, intersex and transgender. At the same time,
some researchers would argue that a Likert scale, even with seven values, should never be treated as a
continuous variable.

34
Dependent and Independent Variables

Independent Variable
Sometimes called an experimental or predictor, it is a variable that is being manipulated in
an experiment in order to observe the effect on a dependent variable.

Dependent Variable
It is sometimes called an outcome variable. The dependent variable is simply a variable that
is dependent on an independent variable(s).

All experiments examine some kind of variable(s). A variable is not only something that we
measure, but also something that we can manipulate and something we can control for.

The dependent variable is just like the name sounds; it depends upon some factor that you,
the researcher, controls. For example:

• How well you perform in a race depends on your training.


• How much you weigh depends on your diet.
• How much you earn depends upon the number of hours you work.

Whatever event you are expecting to change is always the dependent variable. In
the first example above race performance is the variable you would expect to change if you changed
your training, so that’s the dependent variable. In the second example, the dependent variable is weight
and in the third example the dependent variable is the amount earned

Example:
Imagine that a tutor asks 100 students to complete a math test. The tutor wants to know why
some students perform better than others. Whilst the tutor does not know the answer to this, she thinks
that it might be because of two reasons: (1) some students spend more time revising for their test; and
(2) some students are naturally more intelligent than others. As such, the tutor decides to investigate the
effect of revision time and intelligence on the test performance of the 100 students. The dependent and
independent variables for the study are:
Dependent Variable: Test Mark (measured from 0 to 100)
Independent Variables: Revision time (measured in hours) Intelligence (measured
using IQ score)

Categorical and Continuous Variables

Categorical variables are also known as discrete or qualitative variables. Categorical variables
can be further categorized as nominal, ordinal or dichotomous.

• Nominal variables are variables that have two or more categories, but which do not have an
intrinsic order. For example, a real estate agent could classify their types of property into
distinct categories such as houses, condos, co-ops or bungalows. So

35
"type of property" is a nominal variable with 4 categories called houses, condos, co-ops and
bungalows. Of note, the different categories of a nominal variable can also be referred to as
groups or levels of the nominal variable. Another example of a nominal variable would be
classifying where people live in the USA by state. In this case there will be many more levels of
the nominal variable (50 in fact).
• Dichotomous variables are nominal variables which have only two categories or levels. For
example, if we were looking at gender, we would most probably categorize somebody as either
"male" or "female". This is an example of a dichotomous variable (and also a nominal variable).
Another example might be if we asked a person if they owned a mobile phone. Here, we may
categorise mobile phone ownership as either "Yes" or "No". In the real estate agent example, if
type of property had been classified as either residential or commercial then "type of property"
would be a dichotomous variable.
• Ordinal variables are variables that have two or more categories just like nominal variables only
the categories can also be ordered or ranked. So if you asked someone if they liked the policies
of the Democratic Party and they could answer either "Not very much", "They are OK" or "Yes,
a lot" then you have an ordinal variable. Why? Because you have 3 categories, namely "Not
very much", "They are OK" and "Yes, a lot" and you can rank them from the most positive
(Yes, a lot), to the middle response (They are OK), to the least positive (Not very much).
However, whilst we can rank the levels, we cannot place a "value" to them; we cannot say that
"They are OK" is twice as positive as "Not very much" for example.

Continuous variables are also known as quantitative variables. Continuous variables can be
further categorized as either interval or ratio variables.

• Interval variables are variables for which their central characteristic is that they can be
measured along a continuum and they have a numerical value (for example, temperature
measured in degrees Celsius or Fahrenheit). So the difference between 20C and 30C is the same
as 30C to 40C. However, temperature measured in degrees Celsius or Fahrenheit is NOT a
ratio variable.
• Ratio variables are interval variables, but with the added condition that 0 (zero) of the
measurement indicates that there is none of that variable. So, temperature measured in
degrees Celsius or Fahrenheit is not a ratio variable because 0C does not mean there is no
temperature. However, temperature measured in Kelvin is a ratio variable as 0 Kelvin (often
called absolute zero) indicates that there is no temperature whatsoever. Other examples of ratio
variables include height, mass, distance and many more. The name "ratio" reflects the fact that
you can use the ratio of measurements. So, for example, a distance of ten meters is twice the
distance of 5 meters.

Experimental and Non-Experimental Research

Experimental research: In experimental research, the aim is to manipulate an independent


variable(s) and then examine the effect that this change has on a dependent variable(s). Since it is
possible to manipulate the independent variable(s), experimental research has the advantage of enabling
a researcher to identify a cause and effect between variables. For example, take 100 students
completing a math exam where the dependent

36
variable is the exam mark (measured from 0 to 100), and the independent variables are revision time
(measured in hours) and intelligence (measured using IQ score). Here, it would be possible to use an
experimental design and manipulate the revision time of the students. The tutor could divide the students
into two groups, each made up of 50 students. In "group one", the tutor could ask the students not to do
any revision. Alternately, "group two" could be asked to do 20 hours of revision in the two weeks prior to
the test. The tutor could then compare the marks that the students achieved.

Non-experimental research: In non-experimental research, the researcher does not


manipulate the independent variable(s). This is not to say that it is impossible to do so, but it will either be
impractical or unethical to do so. For example, a researcher may be interested in the effect of illegal,
recreational drug use (the independent variable(s)) on certain types of behavior (the dependent
variable(s)). However, whi possible, it would be unethical to ask individuals to take illegal drugs in order
to study what effect this had on certain behaviors. As such, a researcher could ask both drug and non-drug
users to complete a questionnaire that had been constructed to indicate the extent to which they exhibited
certain behaviors. While it is not possible to identify the cause and effect between the variables, we can
still examine the association or relationship between them. In addition to understanding the difference
between dependent and independent variables, and experimental and non-experimental research, it is also
important to understand the different characteristics amongst variables.

Lesson 6.2 Nature of Bivariate Analysis

Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of
statistical analysis used to determine if there is a relationship between two sets of values. It usually
involves the variables X and Y.

• Univariate analysis is the analysis of one (“uni”) variable.


• Bivariate analysis s the analysis of exactly two variables.
• Multivariate analysis is the analysis of more than two variables.

The results from bivariate analysis can be stored in a two-column data table.

Example:

You might want to find out the relationship between the age of the students and their academic
achievement. The age would be your independent variable, X and the academic achievement would be
your dependent variable, Y.

Student Age Academic Achievement


1 15 85
2 16 86
3 14 89
4 15 84
5 18 79

37
Lesson 6.3 Scatter plot

Scatter Plot
It is a type of plot or mathematical diagram using Cartesian coordinate to display values for
typically two variables for a set of data. If the points are color-coded, one additional variable can be
displayed. The data is displayed as a collection of points, each having the value of one variable
determining the position on the horizontal axis and the value of the other variable determining the
position on the vertical axis.

A scatter plot can be used either when one continuous variable that is under the control of the
experimenter and the other depends on it or when both continuous variables are independent. If a
parameter exists that is systematically incremented and/or decremented by the other, it is called the
control parameter or independent variable and is customarily plotted along the horizontal axis. The
measured or dependent variable is customarily plotted along the vertical axis.
A scatter plot can suggest various kinds of correlations between variables with a certain
confidence interval.

Example:
Plotting Weight vs. Height. Weight would be on y axis and height would be on the x axis.
Correlations may be positive (rising), negative (falling), or null (uncorrelated).

Pattern of dots

We can consider that there is a:

Positive correlation if the pattern of dots slopes from lower left to upper right.

38
Negative correlation if the pattern of dots slopes from upper left to lower right.

No correlation if the pattern of dots slopes is indefinite.

Lesson 6.4 Best-Fit Line

A line of best fit (or "trend" line) is a straight line that best represents the data on a scatter
plot. This line may pass through some of the points, none of the points, or all of the points. A line of
best fit can be drawn in order to study the relationship between the variables. An equation for the
correlation between the variables can be determined by established best-fit procedures. For a linear
correlation, the best-fit procedure is known as linear regression and is guaranteed to generate a correct
solution in a finite time. No universal best-fit procedure is guaranteed to generate a correct solution for
arbitrary relationships. A scatter plot is also very useful when we wish to see how two comparable data
sets agree with each other. In this case, an identity line, i.e., a y =x line, or an 1:1 line, is often
drawn as a reference. The more the two data sets agree, the more the scatters tend to concentrate in the
vicinity of the identity line; if the two data sets are numerically identical, the scatters fall on the identity
line exactly.

39
Lesson 6.5 Pearson’s Correlation Coefficient

Correlation between sets of data is a measure of how well are they related. The most common
measure of correlation in stats is the Pearson Correlation. The full name is the Pearson Product
Moment Correlation or PPMC. It shows the linear relationship between two sets of data. In
simple terms, it answers the question; Can I draw a line graph to represent the data? Two letters
are used to represent the Pearson correlation: Greek letter rho (ρ) for a population and the letter “r”. It
tells you whether there is a relationship between the variables. To compute the value of Pearson
Correlation we have the formula:

Formula 1:

where: r - the Pearson correlation coefficient

x - the independent variable

y - the dependent variable

Formula 2:

r= Σ(x–x)(y–y)
Σ ( x – x )2 Σ ( y – y )2

where: x – mean of x-variables


y – mean of y-variables

The results will be between -1 and 1. You will rarely see 0, -1 or 1 as a result. You’ll get a
number somewhere in between those values. The closer the value of r gets to zero, the greater the
variation the data points are around the line of best fit. To interpret the obtained results the table below
may use.

Possible result Interpretation


0.5 to 1.0 High positive correlation
0.3 to 0.5 Medium positive correlation
0.01 to 0.3 Low positive correlation
0.5 to 1.0 High negative correlation
0.3 to 0.5 Medium negative correlation
0.01 to 0.3 Low negative correlation

40
Example 1:
Problem :
Researchers want to know if there is a significant relationship between the ages of the person to their
glucose level. They use six (6) persons as their samples and obtained the data below.

Samples Age (s) Glucose Level


1 40 98
2 25 59
3 36 83
4 45 70
5 50 90
6 61 85

Solution :

Step 1: Make a chart. Use the given data, and add three more columns: xy, x2, and y2.

Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98
2 25 59
3 36 83
4 45 70
5 50 90
6 61 85

Step 2: Multiply x and y together to fill the xy column. For example, row 1 would be 40 × 98 = 3920.

Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98 3920
2 25 59 1475
3 36 83 2988
4 45 70 3150
5 50 90 4500
6 61 85 5185

41
Step 3: Take the square of the numbers in the x column, and put the result in the x2 column.

Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98 3920 1600
2 25 59 1475 625
3 36 83 2988 1296
4 45 70 3150 2025
5 50 90 4500 2500
6 61 85 5185 3721

Step 4: Take the square of the numbers in the y column, and put the result in the y2 column.

Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98 3920 1600 9604
2 25 59 1475 625 3481
3 36 83 2988 1296 6889
4 45 70 3150 2025 4900
5 50 90 4500 2500 8100
6 61 85 5185 3721 7225

Step 5: Add up all of the numbers in the columns and put the result at the bottom column. The Greek
letter sigma (Σ) is a short way of saying “sum of” or “summation of”.

Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98 3920 1600 9604
2 25 59 1475 625 3481
3 36 83 2988 1296 6889
4 45 70 3150 2025 4900
5 50 90 4500 2500 8100
6 61 85 5185 3721 7225

∑ 257 485 21218 11767 40199

42
Step 6: Substitute the values obtained to the formula and compute.

The result is 0.5108, which means the variables have a High positive correlation.

Example 2:
Problem :
Calculate the Pearson correlation coefficient of the obtained scores by 5 students in
algebra and trigonometry as given below:

Algebra 15 16 12 10 8
Trigonometry 18 11 10 20 17

Solution:
Complete the table by following steps 1 to 5.
x Y xy x2 y2
15 18 270 225 324
16 11 176 256 121
12 10 120 144 100
10 20 200 100 400
8 17 136 64 289
∑x = 61 ∑y = 76 ∑xy = 902 ∑x = 789
2
∑y = 1234
2

43
Step 6: Substitute the values obtained to the formula and compute.

The result is – 0.4241, which means the variables have a Medium negative correlation

44
Activity:

1. Evaluate Pearson correlation coefficient of the following values for x and y:

x 5 6 4 2
y 7 3 9 8

2. The scores of 6 pupils in two subjects : Physics and Chemistry are given below..
Calculate the coefficient of correlation.

Score Pupil A Pupil B Pupil C Pupil D Pupil E Pupil F

Chemistry 45 53 67 40 35 50

Physics 68 76 70 64 54 66

45
Lesson 6.6 Regression

Simple regression is used to examine the relationship between one dependent and one
independent variable. After performing an analysis, the regression statistics can be used to predict the
dependent variable when independent variable is known. Regression goes beyond correlation by adding
prediction capabilities.
People use regression on an intuitive level everyday, such as :
 in business, a well-dressed man is thought to be financially successful;
 a mother knows that more sugar in her children’s diet results in higher energy levels; and
 the ease of waking up in the morning often depends on how late you went to bed the night
before.

The regression line ( known as the least squares line ) is a plot of the expected value of the
dependent variable for all values of the independent variable. Technically, it is the line that minimizes the
squared residuals. The regression line is the one that best fits the data on a scatter plot.
Using the regression equation , the dependent variable maybe predicted from the independent
variable. The slope of the regression line ( b ) is defined as the rise divided by the run. The y-
intercept ( a ) is the point on the y-axis where the regression line would intercept the y-axis. The slope
and y-intercept are incorporated into the regression equation. The intercept is usually called the
constant , and the slope is referred to as the coefficient. Since the regression model is usually not a
perfect predictor, there is also an error term in the equation.
In the regression equation, y is always the dependent variable and x is always the independent
variable. Here are three equivalent ways to mathematically describe a linear regression model :
y = intercept + ( slope ± x ) + error

y = constant + ( coefficient ± x ) + error y

= a + bx + e

The significance of the slope of the regression line is determined from the t-statistic. It is the
probability that the observed correlation coefficient occurred by chance if the true correlation is zero.
Some researchers prefer to report the F-ratio instead of the t-statistic. The F- ratio is equal to the t-
statistic squared.
The t-statistic for the significance of the slope is essentially a test to determine if the regression
model ( equation ) is usable. If the slope is significantly different than zero, then we can use the
regression model to predict the dependent variable for any value of the independent variable.

Slope ( m ) Formula: m = rise


run

m = Δy
Δx
m = y2 – y1
x2 – x1

46
Forms of Linear Equation

1. Slope-Intercept form: y = mx + b ( m – slope ; b – y-intercept )

2. Point – Slope form: ( y –y1 ) = m ( x – x1 )

3. Standard form: Ax + By = C ( A , B , and C are constants )

4. General form: Ax + By + C = 0

5. Intercept form: x + y =1 ( a is the x-intercept and b is the a


b y-intercept. )

6.7 Slope and Intercept of the Regression Line

The slope indicates the steepness of a line and the intercept indicates the location where it
intersects an axis. The slope and the intercept define the linear relationship between two variables, and
can be used to estimate an average rate of change. The greater the magnitude of the slope, the steeper the
line and the greater the rate of change.
By examining the equation of a line, you quickly can discern its slope and y-intercept (where the
line crosses the y-axis).

The slope is positive . When x increases by 2, y increases by 1. The y-intercept is 2.

47
y = - 3x + 3
4

The slope is negative 3/4. When x increases by 4, y decreases by 3. The y-intercept is 3.

The slope is 0. When x increases by 1, y neither increases nor decreases. The y- intercept
is 2.

Usually, this relationship can be represented by the equation y = b0 + b1x, where b0 is the y
intercept and b1 is the slope.

48
Example :
Problem:
A company determines that job performance for employees in a production department can be
predicted using the regression model y = 130 + 4.3x, where x is the hours of in-house training they
received (from 0 to 20) and y is their score on a job skills test. The value of the y- intercept (130)
indicates the average job skill score for an employee with no training. The value of the slope (4.3)
indicates that for each hour of training, the job skill score increases, on average, by 4.3 points.

Lesson 6.8 Regression analysis

Regression analysis is a statistical process for estimating the relationships among variables.
It includes many techniques for modelling and analyzing several variables, when the focus is on the
relationship between a dependent variable and one or more independent variables. More
specifically, regression analysis helps one understand how the typical value of the dependent
variable changes when any one of the independent variables is varied, while the other independent
variables are held fixed. Regression Analysis estimates the conditional expectation of the dependent
variable given the independent variables – that is, the average value of the dependent variable when
the independent variables are fixed. In all cases, the estimation target is a function of the independent
variables called the regression function

Activity:

Materials: Graphing paper, Pencil, spaghetti strand

Can we predict the number of total calories based upon the total fat grams?

a. Predict the total calories based upon 22 grams of fat


b. Predict the total calories based upon 18 grams of fat
c. Predict the total calories based upon 26 grams of fat
d. Predict the total calories based upon 33 grams of fat
e. Predict the total calories based upon 7 grams of fat

Sandwich Total Fat (g) Total Calories


Hamburger 9 260
Cheeseburger 13 320
Quarter Pounder 21 420
Quarter Pounder with Cheese 30 530
Big Mac 31 560
Arch Sandwich Special 31 550
Arch Special with Bacon 34 590
Crispy Chicken 25 500
Fish Fillet 28 560
Grilled Chicken 20 440
Grilled Chicken Light 5 300

49
Solution:
1. Prepare a scatter plot of the data on graph paper.

2. Using a strand of spaghetti, position the spaghetti so that the plotted points are as close to the
strand as possible.
3. Find two points that you think will be on the "best-fit" line.
4. We are choosing the points (9, 260) and (30, 530). ( You may choose different . )
5. Calculate the slope of the line through your two points (rounded to three decimal places).

Note: The formula of the slope is

m = y2 – y1
x2 – x1

6. Write the equation of the line using the Slope-Point form.


y – y1= m(x – x1)

y – 260 = 12.857(x -9)


y = 12.857(x – 9) + 260

7. This equation can now be used to predict information that was not plotted in the scatter plot.
Question a: Predict the total calories based upon 22 grams of fat.

y = 12.857 (x – 9) + 260
y = 12.857 (22 – 9) + 60
y = 12.857 (13) + 260
y = 427.141calories

50
Question b: Predict the total calories based upon 18 grams of fat.

y = 12.857(x – 9) + 260 y
= 12.857 (18 – 9) +260
y = 12.857 (9) +260
y = 115.713 + 260
y = 375.713 calories

Question c: Predict the total calories based upon 26 grams of fat.

y = 12.857(x – 9) + 260 y
= 12.857 (26 – 9) +260
y = 12.857 (17) +260
y = 218.569 + 260
y = 478.569 calories

Question d: Predict the total calories based upon 33 grams of fat.

Question e: Predict the total calories based upon 7 grams of fat.

51
CHAPTER TEST

Problem solving.

1. To interpret the relationship between years of education and salary potential, 5 persons were
surveyed. The results obtained on their number of years of higher education (college degree and
higher)and their monthly salaries are shown below. Compute the Pearson’s Product Moment Coefficient
Correlation and interpret the relationship between the variables.

Employee Salary(PHP) Years of Higher Education


A 21,400 4
B 15,300 3
C 27,400 5
D 45,000 8
E 26,600 5

52
2. A financial analyst believes that the interest rate on bonds is inversely related to the interest rate of
loans. Hence, bonds perform when the lending rate are down and vice versa. The results of the
observation are shown in the table below. Find the slope and y-intercept on the data and predict the
interest rate bond (%) when the Interest rate loan is
a. 7
b. 11
c. 12
Interest Rate on Loan (%) Interest Rate on Bond (%)
10 6
5 9
8 7
6 8
8 6

53
Tables

Table 1: Areas Under Normal Curve

54
Table 2: Level of Confidence

55
Table 3: P Value Table

56
Table 4: Tabular value of Z at indicated levels of significance (∞)
Test/∞ 0.005 0.01 0.05 0.1
One-tailed ±2.58 ±2.33 ±1.645 ±1.28
Two-tailed ±2.81 ±2.575 ±1.96 ±1.645

Table 5: Critical values if t distribution

57
58
Name: Score
Grade / Section : Date :

Chapter 1 : Exercise

I. True or False. Write C if the statement is true and W if it is false.


1. A sample is only a part of a population.
2. The sum of probabilities could exceed 1.
3. The volume of a liquid is an example of a continuous variable.
4. Discrete variables are variables that are measurable.
5. Standard deviation is the square root of the variance.
II. Identify which of the following are discrete or continuous variables. Write D for discrete and C
for continuous.
1. number of printing mistakes in a book
2. number of siblings of an individual
3. age of a person
4. profit earned by the company
5. distance travelled by a car
III. In a coin toss experiment, what are the number of coins been tossed together if there are:
1. 128 possible outcomes
2. 8 possible outcomes
3. 256 possible outcomes
Tear Here

4. 16 possible outcomes
5. 1024 possible outcomes
IV. Three coins are tossed simultaneously. Find the probability of :
1. at least 2 tails
2. no tail
3. 1 head
4. 1 tail and 2 heads
5. 2 heads and 2 tails
V. The number of adults living in homes on a randomly selected barangay in Tunasan is
described by the following probability distribution.

Number of adults Probability


x Pr ( x )
1 0.25
2 0.5
3 0.15
4 or more ?
Answer the following
1. What is the probability that 4 or more adults reside at a randomly selected home?
2. What is the mean of the probability ?
3. What is the variance ?
4. What is the standard deviation ?
( Use the back of this page for the solution )

59
Solution :

60
Name: Score
Grade / Section : Date :

Chapter 2 : Exercise

I. True or False. Write C if the statement is true and W if it is false.


1. The area under normal curve is always positive.
2. The shape of a normal curve distribution depends on the values of the mean and the
variance.
3. Values of the areas are added if the z scores are located on the same side of the
distribution..
4. In Pr ( z ≥ 0.5 ), the area is located on the left side of the normal curve.
5. The area under the normal curve indicates probability.
II. Find the area between the following z-scores.
1. z = -1 and z = 1
2. z = 0.7 and z = 2.3
3. z = - 2..75 and z = - 1.58
4. z = 0 and z = 1.95
5. z = 0.21 and z = - 1.65
III. Convert the following variable into a standard score with the following given.
1. x = 10 ; x = 12 ; s = 4
2. x = 75 ; x = 60 ; s = 20
Tear Here

3. x = 45 ; x = 42 ; s = 5
4. x = 250 ; x = 255 ; s = 2
5. x = 28 ; x = 24 ; s = 2.5
IV. Draw the graph of the following probability. 1.
Pr ( - 1.8 ≤ z ≤ 2.7 )

2. Pr ( 1.2 ≤ z ≤ 2.8 )

3. Pr ( z ≤ 1.5 )

61
V. Word problems.
1. Scores on a history test have an average of 80 with a standard deviation of 6. If there were
50 students who took the test on this subject, how many students got a score of at least 75 ?

2. The weight of chocolate bars from a particular chocolate factory has a mean of 8 ounces with standard
deviation of .1 ounce. What is the z-score corresponding to a weight of 8.17 ounces?

3. Books in the library are found to have an average length of 350 pages with a standard deviation of
100 pages. If there are 10,000 books in the library , how many books have a corresponding length
of at least 80 pages?

4. The temperature is recorded at 60 airports in a region. The average temperature is 67 degrees


Fahrenheit with a standard deviation of 5 degrees. How many airports have a temperature of
68 degrees and above?

5. The mean growth of the thickness of trees in a forest is found to be .5 cm/year with a
standard deviation of .1cm/year. What is the z-score corresponding to 1 cm/year?

62
Name: Score
Grade / Section : Date :

Chapter 3 : Exercise

I. True or False. Write C if the statement is true and W if it is false.


1. Stratified random sampling is best done for a heterogeneous population..
2. A sample is a subset of a population.
3. In choosing the sample, the researcher should be objective in order to get relaible
informations.
4. Simple random sampling is appropriate when the population from where the sample
is taken is homogeneous.
5. Stratified random sampling has an increased accuracy at a given cost than simple random
sampling.
II. Identification. Write the correct answer in the blank.
1. In stratified random sampling, it means sub-group.
2. It is a characteristic of a population.
3. it is the square root of the variance.
4. It is the basic sampling technique.
5. A group of phenomena that have something in common.
6. It is a characteristic of a sample.
7. It is the likelihood that a certain event will occur.
8. It is a measure of the spread of the distribution.
9. It is the process of acquiring a section of the population for a
study.
10. It is the weighted average of the possible values.
III. Compute the number of sample values with the following given: 1.
n=4;r=2
Tear Here

2. n=6;r=3

3. n=8;r=2

4. n=9;r=3

5. n = 10; r = 4

63
IV. Problem solving.
A population consists of 4 values such as 12, 14, 16, and 18. Compute the following when r =
2:
a. number of sample values or combinations
b. population mean
c. sample variance
d. standard deviation
e. Fill up the table below based on the computed values.

Sample number Sample value Mean Sample variance Std. deviation


(x ) (x) ( s2 ) (s)

Solution:

64
Name: Score
Grade / Section : Date :

Chapter 4 : Exercise

I. Identification. Write the correct answer in the blank.


1. a factor used to compute the margin of error
2. a point estimate of the population mean
3. the difference between the observed sample mean and the true value
of the population
4. a range of values that is likely to contain the true value of the
parameter
5. a point estimate of the population variance
6. a symbol that denotes the probability of success
7. a measure which determine if the population parameter is within the
interval
8. a point estimate of the population standard deviation
9. a rule that describes an estimate
10. refers to the process of using sample data to estimate the
parameters of the selected distribution
II. Compute the probability of success ( p ) and probability of failure based on the following given.
1. 20 out 80 respondents agreed on death penalty
Tear Here

2. 25 out of 40 students were present in the class

III. Find the value of the unknown with the following given:

1. α = if CI ( confidence interval ) = 80 %

2. Z 0.01 = if α = 0.01

3. CI = if Z 0.5 = 1.645

4. α = if CI = 0.95

5. CI = if 2α = 0.01

65
IV. Problem solving.
1. There are hundreds of mangoes on the trees and we want to know if they are
big enough. 46 mangoes were randomly chosen and the following data were
obtained: ( use confidence interval of 95 % )
x = 86
s = 6.2
Find : a. margin of error
b. true mean

2. Lyceum of Alabang P.E. department wants to calculate the proportion of students who have attended a
women’s basketball game at the college. They use student email addresses, randomly choose 220
students, and email them. Of the 145 who responded, 22 had attended a women’s basketball game.

a. What is the sample proportion of students who have attended a women’s basketball game?

b. What is the sample proportion of students who have not attended a women’s basketball game ?

c. Can a normal distribution be used to model the sampling distribution for the sample
proportion ? Explain.

66
Name: Score
Grade / Section : Date :

Chapter 5 : Exercise

I. True or False. Write C if the statement is correct and W if it false.


1. T – test is used when the sample size is more than 30.
2. The null hypothesis ( Ho ) is accepted when the computed t-test value falls at the shaded
region of the normal distribution curve covered by the critical values.
3. Level of significance is equal to 1 minus level of confidence ( α = 1 – C ).
4. In a two tailed test, the alternative hypothesis ( Ha ) uses the inequality symbol of
< or > .
5. T – test is used when the population variance ( σ2 ) is unknown and therefore has to be
estimated from the sample itself.
6. The null hypothesis ( Ho ) and the alternative hypothesis ( Ha ) are mathematical
opposites.
7. The null hypothesis is the statement being tested.
8. Type I error is rejecting a true null hypothesis.
9. If the computed t-value is positive, it is located at the left side of the normal
distribution curve.
10. The tailed test for Ha: µ < 50 is a right-tailed test.

II. Determine if z-test or t-test is appropriate for the following given. Write Z or T in the blank
before each number. If neither of the two test is applicable, write X.
1. s = 2.5 ; n = 50
Tear Here

2. n = 15 ;
3. s2 = unknown ; n = 25
4. s = 16 ; n = 20
5. s = 36 ; n = 30

III. Compute for the degrees of freedom based on the following given.
1. df = if n1 = 16 and n2 = 20

2. df = if n = 28

3. df = if n1 = 24 and n2 = 26

IV. Find the critical value based on the following given,


1. cv = if α = 0.01 , left-tailed test, t-test statistic, n = 24

2. cv = if α = 0.0025, two-tailed test, t-test statistic, n1 = 22, n2 = 20

67
V. Problem solving.
1. Average heart rate for Americans is 72 beats/minute. A group of 25 individuals participated in an
aerobics fitness program to lower their heart rate. After six months the group was evaluated to identify
is the program had significantly slowed their heart. The mean heart rate for the group was 69
beats/minute with a standard deviation of 6.5. Was the aerobics program effective in lowering heart
rate?
Answer the following:
a. Ho:
b. Ha:
c. α=
d Test statistic
e. Tailed test
f. degrees of freedom
g. critical value
h. computed t-value
i. Graph

j. Conclusion:

2. The amount of a certain trace element in blood is known to vary with a standard deviation of 14.1
ppm (parts per million) for male blood donors and 9.5 ppm for female donors. Random samples of 75
male and 50 female donors yield concentration means of 28 and 33 ppm, respectively. What is the
likelihood that the population means of concentrations of the element are the same for men and
women?
Answer the following:
a. Ho:
b. Ha:
c. α=
d Test statistic
e. Tailed test
f. degrees of freedom
g. critical value
h. computed t-value
i. Graph

j. Conclusion:

68
Name: Score
Grade / Section : Date :

Chapter 6 : Exercise

I. True or False. Write C if the statement is correct and W if it is false.


1. The letter x represents the independent variable.
2. Height is a continuous variable.
3. A line of best fit is drawn in order to study the relationship between the
variables.
4. Bivariate analysis is the analysis of exactly two variables.
5. In a positive correlation, the linear graph rises from lower right to upper left.
6. A Grade 8 student observes that he gets a high score in the test when he studies
longer. The independent variable here is getting a high score.
7. If the pattern of dots slope is indefinte, we could say that there is no correlation
between the dependent and independent variables.
8. The closer the value of Pearson’s correlation coefficient ( r ) to zero, the
stronger is the correlation or relationship between two variables.
9. Continuous variable is also known as quantitative variable.
10. A positive correlation is when the value of the independent variable increases, the
corresponding value of the dependent variable decreases.

II. Find the slope and y-intercept with the following given.
Tear Here

1. 2y = 6x + 4 m= y-intercept =

2. ( 4, 6 ) and ( 2, 18 ) m= y-intercept =

3. – 10x + 5y + 20 = 0 m= y-intercept =

4. 2x + y = 1 m= y-intercept =
3 4

69
III. Problem solving.

1.. Find the Pearson’s correlation coefficient ( r ) using the following data (α = 0.02 ; two-
tailed test ) and state the correlation.

Samples x y xy x2 y2
1 2 6
2 7 16
3 5 11
Σ

70
2. You have to examine the relationship between the age and price for used cars sold in the last year
by a car dealership company.

Here is the table of the data:

Car Age ( in years ) Price ( in dollars )


4 6300
5 4500
7 4200
8 4100
10 2100
12 2200

Note : Use points ( 5, 4500 ) and ( 7, 4200 ) as basis for the computation of the slope.

Find :
a. Predict the price when the car age is 6 years.
b. Predict the price when the car age is 9 years.
c. Predict the price when the car age is 15 years.
Tear Here

71

You might also like