You are on page 1of 41

TMA1111 Mathematical Techniques Faculty of Information Science & Technology

TOPIC 6: HYPOTHESES TESTING & REGRESSION

Contents:

A. HYPOTHESES TESTING

6.1 Introduction

6.1.1 Statistical Hypothesis


6.1.2 Important Concepts in Hypothesis Testing
6.1.3 Tests of Hypotheses using Critical Region

6.2 Hypothesis Test about the Mean

6.2.1 Single Sample – Test Concerning the Mean


6.2.2 Two Samples - Test Concerning the Difference between Two Means

6.3 Hypothesis Test about the Proportion

6.3.1 Single Sample – Test on a Single Proportion

6.4 Hypothesis Test about the Variance

6.4.1 Single Sample - Test Concerning The Variance

B. REGRESSION & CORRELATION

6.5 The Simple Linear Regression Model


6.6 Estimating Model Parameters
6.7 The Coefficient of Determination & Correlation

TCK (2016) Page 1 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

A. HYPOTHESES TESTING

6.1 Introduction

While doing a particular research, one may propose a hypothesis (assumption), and
then design an experiment and collect data in order to carry out hypothesis testing.

In order to reach a conclusion about the hypothesis:


 Data may support the research hypothesis or
 Data may not support the research hypothesis.

Since a conclusion is reached based on data from a sample of the population, there
always exists a chance that our conclusion about the hypothesis may turn out to be
wrong!

6.1.1 Statistical Hypothesis

A statistical hypothesis, or just hypothesis, is a conjecture or claim (assertion)


concerning one or more populations.

- Examples of Claims:
• At least 50% of the students will skip the morning class on Friday
• The mean lifetime of a certain light bulb is 8000 hours
• The mean of batch 1 is different from the mean of batch 2.
• The variance of batch 1 is not different from the variance of batch 2.
• The defective percentage is less than 2%.

Evidence from the sample that is inconsistent with the stated hypothesis leads to a
rejection of the hypothesis, whereas evidence supporting the hypothesis leads to its
acceptance.

The Null hypothesis, denoted by Ho is a claim concerning population parameter that is


initially favoured as true.

The Alternative hypothesis, denoted by H1 (or Ha) is the assertion that is contrary to
Ho.

A test of hypothesis is a method of using sample data to decide whether the null
hypothesis H0 should be rejected (in favour of the alternative hypothesis).
TCK (2016) Page 2 of 41
TMA1111 Mathematical Techniques Faculty of Information Science & Technology

There are two possible conclusions from hypothesis-testing (H.T.) analysis:


 reject Ho or
 fail to reject Ho.

If the sample data provides sufficient evidence to suggest that H0 is false, H0 is rejected
in favour of H1. Otherwise, we continue to believe in the truth of H0.

We will confine ourselves to H.T. of the following format:


H0:  = 0
H1:  ≠ 0 OR  > 0 OR  < 0

The parameters which used to form the hypotheses are population parameters.
For example: mean (), proportion (p), and variance (2) or standard deviation ().

How to define hypothesis?

Given a scenario, you must read the scenario carefully and determine the claim that
you want to test (refer Table 1).

 The Ho always carries the equal (=) sign (refer to the column of Ho).

 If the claim suggests a simple direction such as more than, less than, superior to,
inferior to, and so on, then H1 will be stated using the inequality symbol (< or >)
corresponding to the suggested direction (refer to row 2 and row 4).

 If the claim suggests a compound direction (equality as well as direction) such as


at least, equal to or greater, at most, no more than, and so on, then this entire
compound direction (≤ or ≥) is expressed as Ho, but using only the equality (=)
sign, and H1 is given by the opposite direction (refer to row 1 and row 3).

 If no direction whatsoever is suggested by the claim, then H1 is stated using the not
equal symbol,  (refer to row 5).

TCK (2016) Page 3 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Claim Keywords (e.g.) Ho H1


1  At least = <
2 > More Than = >
3  At Most = >
4 < Less Than = <
5  Not equal = 
Table 1. The Claims and the Hypotheses

Example 6.1:

State the null and alternative hypothesis to be used in testing the claim:

(a) A manufacturer of a certain brand of rice cereal claims that the average saturated
fat content does not exceed 1.5 grams.
Claim:  ≤ 1.5, Ho : = 1.5 vs. H1 : > 1.5

(b) A manufacturer of a certain brand of rice cereal claims that the average saturated
fat content is more than 1.5 grams.
Claim:  >1.5,  Ho : = 1.5 vs. H1 : > 1.5

(c) A manufacturer of a certain brand of rice cereal claims that the average saturated
fat content is at least 1.5 grams.
Claim:  ≥ 1.5,  Ho : = 1.5 vs. H1 : < 1.5

(d) A manufacturer of a certain brand of rice cereal claims that the average saturated
fat content is less than 1.5 grams.
Claim: <1.5,  Ho : = 1.5 vs. H1 : < 1.5

(e) A real estate agent claims that 60% of all private residences being built today
are 3-bedroom homes. To test this claim, a large sample of new residences is
inspected; the proportion of these homes with 3 bedrooms is recorded and used
as our statistic.
Claim: p=0.6,  Ho : p = 0.6 vs. H1 : p 0.6

TCK (2016) Page 4 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Example 6.2:

For the following pairs of assertions, indicate which do not comply with our rules for
setting up hypotheses and why.

(a) H o :   100, H1 :   100


These hypotheses comply with our rules

(b) H o :   20, H1 :   20
H1 includes the equality claim ( ≤ 20), it is contradict to Ho.  Not comply

(c) H o : p  0.25, H1 : p  0.25


Ho should contain the equality claim, so these are not legitimate.  Not
comply
(d) H o :   120, H1 :   150
We are not allowing both Ho and H1 to be equality claims.  Not comply

6.1.2 Important Concepts in Hypothesis Testing

A test procedure is specified by:

1. A test statistic, a function of the sample data on which the decision is to be based.
2. A critical region (or rejection region), the set of all test statistic values for which
H0 will be rejected (null hypothesis is rejected if and only if the test statistic value
falls in this region.)

A test statistic is the sample statistic that is used in the hypothesis testing process.
The calculated value of the test statistic is used for either rejecting or accepting the
null hypothesis. Examples of test statistic:

 Mean, :

TCK (2016) Page 5 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

X  X 
Z , and T
 s
n n

 Proportion, p:
pˆ  p
Z
pq
n
 Variance, 2 :
(n  1) s 2
  2
2

The decision procedure could lead to either of two wrong conclusions. So, there are
two types of errors.

Type I error consists of rejecting the null hypothesis Ho when it is true.


P(Type I error) = P(Reject Ho when it is true) = 

Type II error involves accepting Ho when Ho is false.


P(Type II error) = P(Accept Ho when Ho is false) = 

Accept Ho Reject Ho
Ho is true Correct Type I error
Ho is false Type II error Correct

• Level of significance is the probability of committing Type I error and is


denoted by α (alpha).

A test of a statistical hypothesis, where the region of rejection is on only one side of
the sampling distribution of the test statistic, is called a one-tailed test. {Note:
For one-tailed test, it can be upper-tailed test/ right-tailed test, or lower-tailed test/
left-tailed test}

A test of a statistical hypothesis, where the region of rejection is on both sides of the
sampling distribution of the test statistic, is called a two-tailed test.
We refer to H1 to determine the test whether it is right-tailed, left-tailed or two-tailed

TCK (2016) Page 6 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

test.

The critical region is chosen according to three possible cases (upper-tailed test/
right-tailed test, or lower-tailed test/ left-tailed test, or two tailed test), illustrated
with a test statistic that is a standard normal random variable under H0.

Three possible cases:


 A test of any statistical hypothesis, where the H1 is one-sided, such as
H 0 :    0 vs H1 :    0 (one sided test; right-tailed test as “>” is used in
H1).
The critical region :
Reject Ho if Z  Z  (upper-tailed test/ right-tailed test)

 A test of any statistical hypothesis, where the H1 is one-sided, such as


H 0 :    0 vs H1 :    0 (one-sided test; left-tailed test as “<” is used
in H1)
The critical region :
Reject Ho if Z   Z (lower-tailed test/ left-tailed test)

 A test of any statistical hypothesis, where the H1 is two-sided, such as


H 0 :    0 vs H1 :   0 (two tailed test as "" is used in H1 )

The critical region :


Reject Ho if Z  Z or Z   Z
2 2

TCK (2016) Page 7 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Example 6.3:

State the null and alternative hypothesis to be used in testing the claim and determine
where the critical region is located:

(a) A manufacturer of a certain brand of rice cereal claims that the average
saturated fat content does not exceed 1.5 grams. State the null and
alternative hypothesis to be used in testing this claim.
H0 : = 1.5 vs. H1 :  > 1.5 (One-tailed test/ right-tailed test)
Critical region: Reject H0 if z  z

(b) A manufacturer of a certain brand of rice cereal claims that the average
saturated fat content is more than 1.5 grams. State the null and alternative
hypothesis to be used in testing this claim.
H0 : = 1.5 vs. H1 :  > 1.5 (One-tailed test/ right-tailed test)
Critical region: Reject H0 if z  z

(c) A manufacturer of a certain brand of rice cereal claims that the average
saturated fat content is at least 1.5 grams. State the null and alternative
hypothesis to be used in testing this claim.

TCK (2016) Page 8 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

H0 : = 1.5 vs. H1 :  < 1.5 (One-tailed test/left-tailed test)


Critical region : Reject H0 if z  - z

(d) A real estate agent claims that 60% of all private residences being built today
are 3-bedroom homes. To test this claim, a large sample of new residences
is inspected; the proportion of these homes with 3 bedrooms is recorded and
used as our statistic.
H0 : p = 0.6 vs. H1 : p  0.6 (Two-tailed test)
Critical region : Reject H0 if z  - z/2 or z  z/2

6.1.3 Tests of hypotheses using critical region

Steps for conducting a test of hypotheses:


1. State the null hypothesis and the alternative hypothesis:
H 0 :   0 vs H1 :   0 (for 2-tailed test)
Or H1 :   0 (for right-tailed test)
Or H1 :   0 (for left-tailed test)
2. Determine whether it is a one or two-tailed test by referring to H1.
3. Decide on the sampling distribution of the test statistic under H0.
4. State the critical region for the selected significance level, . (or draw the curve).
5. Give the formula for the test statistic. Compute the value of the test statistic from
the sample data.
6. Decide whether H0 should be rejected and state this conclusion in the problem
context.

6.2 Hypothesis Test about the Mean

6.2.1 Single Sample - Test concerning the mean

TCK (2016) Page 9 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Case 1a: 2 is known ( for n  30 or n < 30)

Suppose we have a sample of size n taken from a population whose mean is  and
variance 2. We want to test whether this sample is taken from a population whose
 2
mean is 0. We know that the sample mean X ~ N   ,  if n is large.
 n 
Steps:
(1) H 0 :   0 vs H1 :   0 (for 2-tailed test)
or H1 :   0 (for right-tailed test)
or H1 :   0
(for left-tailed test)
(2) Determine whether it is a one or two-tailed test by referring to H1.
(3) Use z distribution.
(4) Critical region :
Critical/Rejection Region for Level α
Alternative hypothesis Test
H1 :    0 z ≥ zα (upper-tailed test/right-tailed test)

H1 :    0 z ≤− zα (lower-tailed test/left-tailed test)

H1 :    0 z ≤−𝑧𝛼/2 or z ≥𝑧𝛼/2 (two-tailed test)

𝑥̅ −𝜇
(5) Test-statistics, Z = 𝜎
√𝑛
(6) Decision & Conclusion.

Case 1b: 2 is unknown but n  30 (big sample)

TCK (2016) Page 10 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

The steps are same as Case 1a, but just replace  with s if  is unknown for test
𝑥̅ −𝜇
statistics, i.e. Z = 𝑠 .
√𝑛

Case 2: 2 is unknown and n < 30 (small sample)

Suppose we have a sample of size n, where n < 30, taken from a normal population
whose mean is  and variance (2 ) unknown. We want to test whether this sample is
taken from a population whose mean is 0.

Steps:

(1) H 0 :   0 vs H1 :   0 (for 2-tailed test)

or H1 :   0 (for right-tailed test)


or H1 :   0 (for left-tailed test)
(2) Determine whether it is a one or two-tailed test by referring to H1.
(3) Use t-distribution because n < 30 and 2 is unknown.
(4) Critical region :
Critical/Rejection Region for Level α
Alternative hypothesis Test
T  t (upper-tailed test)
H1 :    0 ,n1
2

T  t (lower-tailed test)


H1 :    0 ,n1
2

T  t or T  t (two-tailed test)


H1 :    0 ,n1 ,n1
2 2

X 
(5) Use t-distribution. Test statistic, T .
s
n
(6) Decision & Conclusion.

Example 6.4:

TCK (2016) Page 11 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Suppose that it is known from experience that the standard deviation of the 8-cm
diameter CDs made by a certain company is 0.16 cm. To check whether its production
is under control on a given day, namely, to check whether the true average diameter of
the CD is 8 cm, the employee selected a random sample of 25 pieces of CDs and finds
that their mean diameter is x  8.091cm. Since the company stands to lose money
when   8 and the customer loses out when   8 , test the null hypothesis   8
against the alternative hypothesis   8 at   0.01.

Solution:

Let X be the diameter of the CD. Given: =0.16, n = 25, x  8.091


(i) Hypothesis : H0 :  = 8
H1 :   8 (2-tailed test)
(ii) 2 is known  z-distribution
(iii) Critical region :
 = 0.01  /2=0.01/2= 0.005 ( is divided by 2 because 2-tailed test)
 z0.005 = 2.57
Reject H0 if z  -2.57 or z  2.57
X   8.091  8
(iv) Test statistics : Z  =  2.8438
 0.16
n 25
(v) Since z > 2.57,  Reject H0.
{Note: We also can draw normal curve. By drawing a normal curve and shade the
critical region as below,

0 z = 2.8438
-2.57 2.57
Critical region Critical region

we also can write :

TCK (2016) Page 12 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Z = 2.8438 falls into the critical region. Reject H0. }


{Note: When we reject H0 means we accept H1. So use H1 to make conclusion.}
(vi) Conclusion:   8 , or  >8 and the company stands to lose money.

Example 6.5:
The daily yield for a local chemical plant has averaged 880 tons for the last several
years. The quality control manager would like to know whether this average has
changed in recent months. She randomly selects 50 days from the computer database
and computes the average and standard deviation as 871 and 21 tons respectively. Test
the appropriate hypothesis using  = 0.05

Solution:

Let X be the daily yield for the local chemical plant. Given : n = 50, x  871, s = 21
(i) Ho : = 880 vs. H1 :   880 (2-tailed test) 
Note: use  in H1 because the
claim is “whether this average
has changed...”, indicates that
maybe is it is greater or maybe
lower.
(ii) n > 30  z-distribution
(iii)  = 0.05  /2= 0.05/2 = 0.025 z0.025 = 1.96
Critical region : Reject H0 if z -1.96 or z 1.96
871  880
(iv) Test statistics : z   3.0305
21
50

(v)

z = -3.0305 -1.96 0 1.96


Critical region Critical region

Since z = -3.0.05 falls into critical region {or z < -1.96}  Reject H0.
(vi) Conclusion:   880, or  < 880 and the average have changed to a value
lower than 80 tons.
Example 6.6:

TCK (2016) Page 13 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

The specification for a certain kind of ribbon should have a mean breaking strength of
185 pounds. If five pieces randomly selected from different rolls having mean breaking
strengths of 183.14 and standard deviation of 8.219, test the null hypothesis   185
pounds against   185 pounds at   0.05

Solution:

Let X be the breaking strength for that certain kind of ribbon.


Given: n = 5, x  183.14 , s = 8.219.

(i) H0 : = 185 vs. H1 : <185 (left-tailed test)


(ii)  2 is unknown, n < 30  t-distribution
(iii)  = 0.05  t0.05, 4 = 2.132
Critical region : Reject H0 if t  -2.132

183.14  185
(iv) Test statistics : t   0.506
8.219
5
(v)

-2.132 0
z = -0.506
Critical region

Since t > -2.132 (or we can write t = -0.506 does not fall into critical region),
 Do not reject H0 (or we can write accept H0).

(vi) Conclusion: the mean breaking strength is not significantly less than 185.

6.2.2 Two Samples: Test Concerning the Difference between Two Means

TCK (2016) Page 14 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Case 1:  12 and  22 are known or  12 and  22 are unknown but n1 , n2  30

Two independent samples of size n1 and n2 taken from population with means 1, 2
and variances  12 and  22 . To test whether these samples are taken from population
whose means are equal:

Steps:
(1) For 2-tailed test:
H 0 : 1   2 vs H1 : 1  2 or
H 0 : 1   2  0 vs H1 : 1  2  0

For Right-tailed test:


H 0 : 1   2 vs H1 : 1  2 or
H 0 : 1   2  0 vs H1 : 1  2  0

For Left-tailed test:


H 0 : 1   2 vs H1 : 1  2 or
H 0 : 1   2  0 vs H1 : 1  2  0

(2) Determine 1-tailed or 2-tailed test.


(3) Use Z-distribution.
(4) For a particular value of , determine the critical region.
Critical region :
Critical/Rejection Region for Level α
Alternative hypothesis Test
z ≥ zα (upper-tailed test/right-tailed test)
H1 : 1  2  0

z ≤− zα (lower-tailed test/left-tailed test)


H1 : 1  2  0

z ≤−𝑧𝛼/2 or z ≥𝑧𝛼/2 (two-tailed test)


H1 : 1  2  0

TCK (2016) Page 15 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

( x1  x2 )  ( 1   2 )
(5) Test statistic Z
 12  22

n1 n2
(6) Conclusion.

 Note : Replace  12 and  22 with s1 and s2 if  12 and  22 are unknown for


test statistics.

Example 6.7:

A company claims that its light bulbs are superior to those of its main competitor. If a
study showed that a sample of 40 of its bulbs has a mean lifetime of 647 hours of
continuous use with a standard deviation of 27 hours, while a sample of 40 bulbs made
by main competitor had a mean lifetime of 638 hours of continuous use with a standard
deviation of 31 hours. Does this substantiate the claim at the 0.05 level of significance?

Solution:

Let X1 and X2 be the lifetimes of the light bulbs made by that company and its main
competitor respectively.

Given : n1 = 40, x1  647 , s1 = 27


n2 = 40, x 2  638 , s2 = 31

(i) H0 :1 = 2 vs. H1 : 1>2 (Right-tailed test)


(or we can write H0 :1 - 2 = 0 vs. H1 : 1-2 >0)
(ii) 2 unknown, n1, n2 > 30  z-distribution
(iii)  = 0.05  z0.05 = 1.645
Critical region : Reject H0 if z  1.64

90
(iv) Test statistics : z   1.3846
2 2
27 31

40 40

TCK (2016) Page 16 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

(v)
Critical region

0 1.96
z = 1.3846

Since z <1.645 (or z = 1.3846 does not fall into critical region) Do not reject
H0.
(vi) Conclusion: The two light bulbs have equal quality.

6.3 Hypothesis Test about the Proportion

6.3.1 One Sample - Test on a Single Proportion

When the random samples if size n (n is large) can result in two possible outcomes,
with the sample proportion, p̂ represents the “successes”, could be drawn from a
population with the proportion of “successes”, po, we use the hypothesis test about
proportion.

Steps:
(1) H 0 : p  p0 vs H1 : p  p0 (for 2-tailed test)
or H1 : p  p0 (for right-tailed test)
or H1 : p  p0 (for left-tailed test)
(2) Determine whether it is a one or two-tailed test by referring to H1.
(3) Use z-distribution.
(4) For the required  , the critical region is determined.

Critical region :
Critical/Rejection Region for Level α
Alternative hypothesis Test

TCK (2016) Page 17 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

H1 : p  p0 z ≥ zα (upper-tailed test/right-tailed test)

H1 : p  p0 z ≤− zα (lower-tailed test/left-tailed test)

H1 : p  p0 z ≤−𝑧𝛼/2 or z ≥𝑧𝛼/2 (two-tailed test)

(5)
pˆ  po
Test statistic Z , qo  1  po
po q o
n
(6) Conclusion regarding the acceptance/rejection of null hypothesis based on
rejection criteria.

Example 6.8:

If 4 out of 20 patients suffered serious side effects from a new medication, test the null
hypothesis p  0.5 against the alternative hypothesis p  0.5 at   0.05 . Here, p is
the true proportion of patient suffering side effects from the new medication.

Solution:
(i) H0 : p = 0.5 vs. Ha: p  0.5 (2-tailed test)
(ii)  = 0.05  /2 = 0.05/2 = 0.025  z0.025 = 1.96
Critical region : Reject H0 if z  -1.96 or z  1.96
(iii) Test statistics : n = 20, po = 0.5, qo = 1-po = 0.5, pˆ  4 / 20  0.2 , and thus
0.2  0.5
z  2.68
0.5  0.5
20
(iv) Since z < -1.96,  Reject H0 .
(Note: We also can draw normal curve & shade the critical region as previous
examples above to make decision whether to reject or accept H0.)
(v) Conclusion: p  0.5

6.4 Hypothesis Test about the Variance

TCK (2016) Page 18 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

6.4.1 One Sample - Test Concerning the Variance

To test whether the population variance 2 equals to a specific value  2 . The sample
variance is s2.

Steps:
(1) H 0 :  2   02 , vs H1 :  2   02 (Two-tailed test)
Or H1: σ2 > σ02 (Right-tailed test)
Or H1: σ < σ0
2 2
(Left-tailed test)
(2) Determine whether it is a one or two-tailed test by referring to H1.
(3) Use 2 distribution
(4) Critical region for the required value of α and given n, based on 2 distribution
table.
Critical Region:
Hypotheses: Critical/Rejection Region for Level α Test
H0: σ2 = σ02 χ2 > χ2α

H1: σ2 > σ02 (right-tailed rejection region)


H0: σ2 = σ02 χ2 < χ2 1 - α

H1: σ2 < σ02 (left-tailed rejection region)


H0: σ2 = σ02 χ2 > χ2α /2 or χ2 < χ2 1 - α/2

H1: σ2 ≠ σ02 (two-tailed rejection region)

The rules in the table above are based on the following diagrams:

(i) Right-tailed Test (right-tailed rejection region)

TCK (2016) Page 19 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

(ii) Left-tailed Test (Left-tailed rejection region)

(iii) Two-tailed Test (Two-tailed rejection region)


TCK (2016) Page 20 of 41
TMA1111 Mathematical Techniques Faculty of Information Science & Technology

(n  1) s 2
(5) Test statistic,  
2
2
.
(6) Conclusion regarding the acceptance/rejection of null hypothesis based on
rejection criteria.

Example 6.9:

Suppose that the thickness of a part used in a semiconductor is its critical dimension
and that measurements of the thickness of a random sample of 18 such parts have the
variance s2 = 0.68. The process is said to be under control if the variation of the
thickness is given by a variance not greater than 0.36. Assuming that the measurements
constitute a random sample from a normal population, test the hypothesis  = 0.05.

Solution:

Let X be the thickness of a part used in the semiconductor.


n = 18 s2 = 0.68
(i) H0 : 2 =0.36 vs. H1 : 2 >0.36 (right-tailed test)
(ii)  = 0.05 20.05, 17 = 27.587
Critical region : Reject H0 if 2  27.587

Or draw diagram:

TCK (2016) Page 21 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

= 0.05

= 27.587

(iii) Test statistics : n = 18, s 2 = 0.68 and thus  


2 18  1  0.68  32.111
0.36
(iv) Since 2 > 27.587,  Reject H0.
(or 2 = 32.1111 falls in critical region. Reject H0. )
(v) Conclusion: 2 >0.36 .

Example 6.10:

You have a random sample of size 20, with a standard deviation of 125. You have good
reason to believe that the underlying population is normal. Is the population variance
different from 10,000, at the 0.05 significance level?

Solution:

n = 20, s = 125, σ2 = 10,000, α = 0.05.

(i) H0: σ2 = 10,000


H1: σ2 ≠ 10,000 (two-tailed test.)

(ii)  = 0.05 , Since we have 2-tailed test, divide  by 2, i.e.


/2 = 0.05/2=0.025

Rejection Criteria:
Reject Ho if
 2   2 or 2  2
, n 1 1 ,n1 .
2 2

i.e.  2   02.025,19 or 2   2
0.975,19

TCK (2016) Page 22 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

 2  32.852 or  2  8.906
Or draw diagram:
=0.025

=0.025

=8.906 =32.852

( n  1) s 2 ( 20  1)(1252 )
(iii) Test statistic ,  
2
 =29.688
2 10000 .

Since   29.688 does not fall in rejection region (or  2  32.852 ) .


2

 Accept H0 .

(iv) σ2 = 10,000 (or The population standard deviation may not be different from
10,000)

B. REGRESSION & CORRELATION

6.5 The Simple Linear Regression Model

The simplest deterministic mathematical relationship between two variables x and y is


a linear relationship 𝑦 = 𝛽0 + 𝛽1 𝑥

However, the relationship between two variables x and y may not be deterministic.

Regression analysis is the part of statistics that deals with investigation of the
relationship between two or more variables using probabilistic models.

TCK (2016) Page 23 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

For our discussion, we shall assume that values of the variable x are fixed by the
experimenter. The variable x is the independent (predictor, explanatory) variable.
For a fixed x, the second variable will be a random variable Y with observed value y,
referred to as the dependent (response) variable.

Usually observations will be made for a number of settings of the independent variable.
Let x1, x2, . . . , xn denote values of the independent variable for which observations
are made, and let Yi and yi respectively denote the random variable and observed value
associated with xi. The available bivariate data then consists of the n pairs (x1, y1), (x2,
y2), . . . , (xn, yn). A first step in regression analysis involving two variables is to
construct a scatter plot of the observed data. In such a plot, each (xi, yi) is represented
as a point plotted on a two-dimensional coordinate system.

Scatter Plot
A scatter plot is a useful summary of a set of bivariate data (two variables), usually drawn
before working out a linear correlation coefficient or fitting a regression line. It gives a
good visual picture of the relationship between the two variables, and aids the
interpretation of the correlation coefficient or regression model.

Each unit contributes one point to the scatter plot, on which points are plotted but not
joined. The resulting pattern indicates the type and strength of the relationship between the
two variables.

(from Statistics Glossary by Easton & McColl)

TCK (2016) Page 24 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology
A simple linear regression model describes the linear relationship between dependent
variable Y and a single independent variable x as
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀

where
Y is the response variable/dependent variable
x is the explanatory variable/predictor/ independent variable
𝛽0 and 𝛽1 are the regression coefficients
𝜀 is the random error, with E[𝜀] = 0 and Var[𝜀] = 𝜎 2
𝛽0 , 𝛽1 and 𝜎 2 are parameters.

Note:
(i)  0 indicates the y intersect only if the scope of the model includes the value
x = 0.
(ii) 1 indicates the changes in the mean respond associated with one unit increase in
x. ( 1 is also the slope of the regression line.)
(iii) The true (or population) regression line 𝑌 = 𝛽0 + 𝛽1 𝑥 is the line of mean
values; for a particular x value, y is the expected value of Y for that value of x.

Figure 1. Points corresponding to observations from the simple linear regression


model

Linear models: The simplest relationship between two variables is a straight line, thus
termed as Simple Linear Regressions. By having such relationship
Y = 𝛽0 + 𝛽1 𝑥, one may be able to predict Y at unknown values of X from the
knowledge of the trend between X and Y.
Example 6.11

TCK (2016) Page 25 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Suppose that in a certain chemical process the reaction time Y (hr) is related to the
temperature (oF) in the chamber in which the reaction takes place according to the
simple linear regression model with equation Y = 5.00 - 0.01 X and  = 0.075.
a. What is the expected change in reaction time for a 1 oF increase in temperature?
For a 10 oF increase in temperature?

b. What is the expected reaction time when temperature is 200 oF? When
temperature is 250 oF?

Solution:
a. When X = 1 oF, expected change for a one degree increase,
𝛽1 = -0.01*1 = - 0.01#

When X = 10 oF, expected change for a one degree increase,


𝛽1 = -0.01*10 = -0.1#
b. When X = 200 oF, Y = 5.00 – 0.01(200) = 3.00#
When X = 250 oF, Y = 5.00 – 0.01(250) = 2.50#

6.6 Estimating Model Parameters

Consider a given sample data {(x1, y1), (x2, y2), …, (xi, yi) ,…, (xn, yn) }. Let yi is the
observed value of a rv Yi, where 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 . The errors i are independent
rv’s.

If the line 𝑦 = 𝛽0 + 𝛽1 𝑥 is used to fit the model, the fitted values 𝑦̂𝑖 are obtained via
𝑦̂ = 𝛽0 + 𝛽1 𝑥 . The residual 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 = 𝑦𝑖 − 𝛽0 + 𝛽1 𝑥𝑖 is the vertical
deviation of the point (xi, yi) from the fitted line y = 𝛽0 + 𝛽1 𝑥.

The error sum of squares, denoted by SSE, is:


SSE    i2   (Yi  Yˆi ) 2   (Yi   o   1 X i ) 2
i i i
It is used as the measure of goodness of fit.
Using the principle of least squares, we minimize this sum of squares to obtain the
estimated regression line or least squares line.

A line provides a good fit to the data if the vertical distances (deviations) from the
observed points to the line are “small” (see Figure 2).
TCK (2016) Page 26 of 41
TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Figure 2. Deviations of observed data from line y = 𝛽0 + 𝛽1 𝑥.

The least-squares (regression) line for the data is given by


y = 𝛽̂0 + 𝛽̂1 𝑥.

Here the least squares estimate of the slope:


𝑆𝑥𝑦
𝛽̂1 =
𝑆𝑥𝑥

and the least squares estimate of the intercept:


1
𝛽̂0 = (∑ 𝑦𝑖 − 𝛽̂1 ∑ 𝑥𝑖 ) = 𝑦̅ − 𝛽̂1 𝑥̅
𝑛
𝑖
Where

1
𝑆𝑥𝑦 = ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) = ∑ 𝑥𝑖 𝑦𝑖 − (∑ 𝑥𝑖 )(∑ 𝑦𝑖 )
𝑛
𝑖 𝑖 𝑖 𝑖

2
(∑ 𝑖 𝑥𝑖 )
𝑆𝑥𝑥 = ∑(𝑥𝑖 − 𝑥̅ )2 = ∑ 𝑥𝑖2 −
regression parameters (0, 1) :
To minimize SSE with respect to the linear 𝑛
𝑖 𝑖

Least squares estimators of 𝛽0 and 𝛽1 given above are unbiased and have minimum
variance among all other unbiased estimators.

TCK (2016) Page 27 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

In computing 𝛽̂0 , use extra digits (at least up to 4 decimal) in 𝛽̂1 because if 𝑥̅ is large
in magnitude, rounding will affect the final answer.

The Line

After estimating the model parameters, the fitted regression line can then be written
as:
y = 𝛽̂0 + 𝛽̂1 𝑥.

Note: It must be emphasized that before 𝛽̂0 and 𝛽̂1 are computed, a scatter plot should
be examined to see whether a linear probabilistic model is plausible. If the points do
not tend to cluster about a straight line with roughly the same degree of spread for all
x, other models should be investigated. In practice, plots and regression calculations
are usually done by using a statistical computer package.

Estimating 2 and 

The parameter variance, 2, determines the amount of variability inherent in the
regression model. After a regression model has been fitted, the fitted values 𝑦̂𝑖 are
obtained via 𝑦̂𝑖 = 𝛽̂0 + 𝛽̂1 𝑥 with residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 .

The residuals can be used to give an estimate of 2. An unbiased estimate of 2 is


given by
𝑆𝑆𝐸
̂𝜎 2 = 𝑠 2 =
𝑛−2

with SSE is the error sum of square of errors:


S 
SSE  S YY   XY  S XY
 S XX 
 S YY   1 S XY
where
1 1
S YY   yi2  ( yi ) 2 and S XY   xi yi  ( xi )( yi )
i n i i n i i

Steps to Solve Simple Linear Regression Problem

TCK (2016) Page 28 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

𝑦 = 𝛽0 + 𝛽1 𝑥

Step 1: Draw the scatter plot of the (X,Y) data for visual inspection of the relationship
that may exist between X and Y. {Note: This step can be skipped if the scatter
diagram is not required in the question.}

Step 2: Construct the following table to facilitate computation.

k X Y X2 Y2 XY
1 x1 y1 x1 2 y1 2 x1 y1
2 x2 y2 x2 2 y2 2 x2 y2

     

n xn yn xn 2 yn 2 xn yn
Sum x
i
i y
i
i x
i
2
i y
i
2
i x y
i
i i

Step 3: Calculate the linear regression parameters (o, 1) using the formula below:

1 1
S XY   xi yi  ( xi )( yi ) and S XX   xi2  ( xi ) 2
i n i i i n i

where
S xy
̂1  and ˆ0  y  ˆ1 x
S xx

Step 4: The linear regression model of the data is given by

Y  ˆ0  ˆ1 X
Additionally, we can compute 𝑆𝑆𝐸 = 𝑆𝑦𝑦 − 𝛽̂1 𝑆𝑥𝑦 and hence, an unbiased estimate
𝑆𝑆𝐸
of 2, ̂𝜎2 = 𝑠2 =
𝑛−2
Example 6.12

TCK (2016) Page 29 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

A cloth manufacturer wants to determine the relationship between the thickness of a


synthetic fiber and its tensile strength. Researchers took measurements at various pre-
selected, known levels of fiber thickness, and the following data was collected.

Fiber thickness, 40 31 34 44 49 36 41 50 39 45
X
Tensile strength, 83 74 72 70 75 73 70 76 79 72
Y

If the fiber strength thickness was 45, what would be the predicted strength?
In addition, give an estimate of the standard deviation of the model error.

Solution:

Consider the simple linear regression model to fit the data


Y   0  1 X

Step 1: Draw the scatter plot of the (X,Y) data for visual inspection of the
relationship that may exist between X and Y. {Note: can be skipped}

Y
85
*
80
*
75
* **
** *
70 * *
0 30 35 40 45 50 X
Step 2: Construct the following table to facilitate computation.

k X Y X2 Y2 XY

TCK (2016) Page 30 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

1 40 83 1600 6889 3320


2 31 74 961 5476 2294
3 34 72   
4 44 70
5 49 75
6 36 73   
7 41 70
8 50 76
9 39 79
10 45 72
Sum x
i
i  y i =744 x
i
2
i y
i
2
i x y
i
i i
i

=409 =17077 =55504 =30436

Step 3: Calculate the linear regression parameters (o, 1):


Using the table above, n =10, we determine
1 1
x   xi  40.9 , y   yi  74.4
n i n i

and
S 6.4
ˆ1  XY   0.01834
S XX 348.9
ˆ0  y  ˆ1 x  74.4  (0.01834)( 40.9)  73.6499

Step 4: The linear regression model of the data is given by

Y  73.6499  0.0183X

When thickness, x = 45, the model predicts tensile strength Y to be


Y = 73.6499+0.0183(45) = 74.4734#

For SSE and estimate of 2:

TCK (2016) Page 31 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

2
(∑𝑖 𝑦𝑖 ) 7442
𝑆𝑦𝑦 = ∑ 𝑦𝑖2 − = 55504 − = 150.4
𝑛 10
𝑖
𝑆𝑆𝐸 = 𝑆𝑦𝑦 − 𝛽̂1 𝑆𝑥𝑦 = 150.4 − (0.01834)(6.4) = 150.28

150.28
̂𝜎 2 = 𝑠 2 == 18.785
8
An estimate of the standard deviation ( )of the model error is √18.785 = 4.33

Note: For the above example

1) 0 does not give any meaning since the scope of sample data not include x = 0.

2) Within the scope of the model, we have linear relationship between x and y.

3) We should not make inference about the relationship between x and y for value out
of the range of sample data.

Example 6.13

A chemical engineer is investigating the effect of process operating temperature on


product yield. The study results in the following data:

0
Temperature, C 100 110 120 130 140 150 160 170 180 190
(x)
Yield, % (y) 45 51 54 61 66 70 74 78 85 89

These pairs of points are plotted in Fig. 14-1. Such a display is called a scatter diagram.
Examination of this scatter diagram indicates that there is a strong relationship between
yield and temperature, and the tentative assumption of the straight-line model y = 𝛽0 +
𝛽1 𝑥 + 𝜀 appears to be reasonable. Find the regression line equation that represents this
set of data.

TCK (2016) Page 32 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Solution:

TCK (2016) Page 33 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

6.7 The Coefficient of Determination & Correlation

6.7.1 The Coefficient of Determination

The sample coefficient of determination, r2, represents the proportion of the total
variation of the variable Y that can be explained by a linear relationship with the values
of X. It is widely used to determine how well a regression fits. In other words, how
close the points are to the regression line.

A quantitative measure of the total amount of variation in observed y values is given


by the total sum of squares.

SSE = the sum of squared deviations about the least squares line Y   0  1 X , SST
= the sum of squared deviations about the horizontal line at height y.
SSE/SST = the proportion of total variation that cannot be explained by the simple
linear regression model,
1 – SSE/SST = the proportion of observed y variation explained by the model.

Thus, r2 = 1 – SSE/SST

The higher the value of r2, the more successful is the simple linear regression model in
explaining y variation.

6.7.2 Correlation

Correlation analysis is used to measure the strength of linear relation between X and Y
by means of a single number called a correlation coefficient.

Population correlation coefficient  defined as


 XY
 , with 1    +1.
 XX  YY

TCK (2016) Page 34 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Some useful indications of correlation coefficient:


  = ±1 only occur when we have a perfect linear relationship between the two
variables
  = +1 implies a perfect linear relationship with a positive slope (1 > 0),
  = 1 implies a perfect linear relationship with a negative slope (1 < 0),

Thus, if a sample’s correlation coefficient is close to unity in magnitude, this implies


a good correlation or linear association between X and Y, whereas values that near
to zero indicate little or no correlation.

Sample Estimate of correlation coefficient

Sample estimate of the correlation coefficient, r is defined as


S XY
r or r  r2
S XX S YY

The value of r (1  r  1) measures how good is the linear relationship between X
and Y.

Y Y

X X

Y Y

TCK (2016) Page 35 of 41


X X
TMA1111 Mathematical Techniques Faculty of Information Science & Technology

A value of r near 0 is not evidence of the lack of a strong relationship, but only the
absence of a linear relation or correlation.
A value of r that fall within the range from 0 to 0.5 is considered weak, strong if it is
between 0.8 to 1, and moderate otherwise. Refer to the diagram below for the summary
of r value:

-1 -0.8 -0.5 0 0.5 0.8 1

Strong Moderate Weak Weak Moderate Strong


negative negative negative positive positive positive
relationship relationship relationship relationship relationship relationship

Example 6.14

Construct the correlation coefficient between X (test grade) & Y (number of years) if
SXX = 10.5, SYY = 1504.1, SXY = 114.5

Solution:
S XY
r
S XX SYY

TCK (2016) Page 36 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

114.5
r
10.5 1504.1
= 0.9111#

As a conclusion, the r value of 0.9111 shows a strong correlation between test grade
and the number of years.

Statistical Tables

TCK (2016) Page 37 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

TCK (2016) Page 38 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

TCK (2016) Page 39 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

TCK (2016) Page 40 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

TCK (2016) Page 41 of 41

You might also like