You are on page 1of 9

Lesson 8 .

Simple Linear Regression Analysis

In this lesson, you will learn that statistics can be used in making predictions.
These predictions are based on the fact that two variables are related.

A. Association and Causation

There is a strong association between cigarette smoking and death rate from
lung cancer. A study of British doctors found that smokers had 20 times the risk of
non smokers, and a larger study of American men ages 40 to 79 found 11 times
higher death rates among smokers. Does this mean that cigarette smoking causes
lung cancer?

We are asking whether a specific association is due to changes in one


variable ( lung cancer) being caused by changes in another variable (smoking).
Some observed associations are due to cause and effect. But others are not.

Example:

In Taguig City , a law compelling the residents to wear face masks due to
COVIID 19 went into effect in March 2020. As time passed, an increasing
percentage of people wearing masks is complied. A study found high positive
correlation between the percent of people wearing masks and the percent reduction
of positive cases . This is a clear instance of cause and effect. Wearing masks
prevent the positive cases of COVID 19. So an increase in their use causes a drop in
positive cases.

So, association may be due to either

 Causation - changes in A cause changes in B

 Common response - Changes in both A and B are caused by changes


in a third variable C

 Confounding - changes in B are caused by changes in A and by


changes in a third variable C.

The tobacco industry has argued that the association between smoking and
lung cancer may be an instance of common response. Perhaps there is a genetic
factor that predisposes people both to nicotine addiction and to lung cancer. Then
smoking and lung cancer would be positively associated even if smoking had no
direct effect on health.
How can we detect causation?

1. Causation is not a simple idea.

 Rarely is A “the cause” of B. Smoking is at most a contributory cause of lung


cancer. It is one of several circumstances that make lung cancer more likely.

 Exposure to asbestos and breathing polluted air may be other contributory


causes.

2. Properly designed experiments are the best means of setting questions of


causation.

 You should be able to imagine a randomized controlled experiment with


human subjects that would determine beyond doubt whether cigarette
smoking causes lung cancer.

B. Prediction

A strong relationship between two variables can be used to predict the value
of one when the value of the other is known.

The usefulness for prediction of an observed relationship depends on the


strength of the association. But the usefulness of an observed association in
predicting Y given the value of X does not depend on a cause-and-effect relation
between X and Y.

An employer who uses an aptitude test to screen potential employees does


not think that high test scores cause good job performance after the person is hired.
Rather the tests attempts to measure abilities that will really result in good
performance.
C. Regression Analysis

This section tackles the simplest type of prediction, that of predicting one
variable (Y) with the knowledge of another variable(X). Researches on the field of
behavioral science are mostly focused on the problems on prediction.

Example; prediction of academic performance in school with the knowledge of the


scores in the intelligence test ; job performance of an applicant using information
available at the same time of his application.

Linear Regression Analysis a simple technique for prediction.

Y = a + bX

Where:

Y = predicted score

a = the y intercept

b = the slope of the line

The slope of the regression line for predicting y from x will represented by b and the
point where the line cuts the y –axis or simply the y- intercept is represented by a
and can be determined through the use of the formulas:

b = n∑xy – (∑x) (∑y)


n ∑x2 - (∑x)2

a = Y– bX

where :

Y = mean of the y values

X = mean of the X values


Example:

1. Consider the data gathered from 10 administrators

Admin.

Administrator Capability Productivity

x y xy

1 4 4 16

2 4 3 12

3 3 3 9

4 5 4 20

5 3 5 15

6 3 4 12

7 4 4 16

8 5 4 20

9 5 5 25

10 4 5 20

X ave. = 4 Yave. 4.1 ∑ XY = 165

∑x = 40 ∑y = 41

∑x2 = 166 ∑y2 = 173

Finding the Slope, b or the regression coefficient

Formula:

∆Y n ∑ xy - ∑ x ∑y
b = -------- = ---------------------

∆X n ∑ x2 - (∑x)2

Finding the values:


10 (165) - (40) ( 41)
b = ------------------------------
10 (166) - ( 40)2

1650 - 1640
b = -------------------- = 10 / 60 = 0.17
1660 - 1600

a = y -bx
a = 4.1 - 0.17 (4) = 4.1 + 0.68 = 3.42

Regression line equation ;


Y = a + bX

Y = 3.42 + 0.17 X
Therefore, Productivity = 3.42 + 0.17 ( administrative capability)
This means that for every one unit change in the level of administrative
capability of administrators , their level of productivity increases by 0.17 units.

2. Anthony, an engineering statistics student, has a summer job with the division of
DENR. A new variety of tree was planted 6 years ago, and the trunk diameters were
taken each year of growth: Neglect all the environmental factors.

Year 1 2 3 4 5 6

Diameter 1.3 2.5 3.7 5.3 6.4 7.2

a. Determine the regression line equation.


b. Estimate the average diameter for 3.5 year old trees.
c. Compute the error sum of squares for the regression line.

Solution :
a. The year is the independent variable x, since it is fixed, and the response
variable y is the diameter.

Year Diameter        

x y xy X2 Y (y - Y)2

1 1.3 1.3 1 1.34 0.0016

2 2.5 5.0 4 2.57 0.0049

3 3.7 11.0 9 3.79 0.0081

4 5.3 21.2 16 5.01 0.0841

5 6.4 32.0 25 6.24 0.0256

6 7.2 43.2 36 7.46 0.0676

21 26.4 113.7 91   0.1919

n = 6

_ ∑x 21
x = --------- = ------ = 3.5
n 6

_ 26.4
y = ------ = 4.4
6

Solving for the values of a and b.

n ∑ xy - ∑ x ∑y
b = ---------------------

n ∑ x2 - (∑x)2

6( 113.8) - ( 21) ( 26.4)


b = --------------------------------
6 (91) - ( 21)2
128.4
b = --------- = 1.223
105

_ _
a = y -bx = 4.4 - 1.223 ( 3.5 ) = 0.120

Computing Y :
Y = a + bx1

= 0.12 + 1.223 ( 1) = 1.34

= 0.12 + 1.223 (2) = 2.57

= 0.12 + 1.223 ( 3) = 3.79

= 0.12 + 1.223 ( 4) = 5.01

= 0.12 + 1.223 ( 5) = 6.24

= 0.12 + 1.223 ( 6) = 7.46

a. Regression equation is

Y = a + bx1

Y = 0.12 + 1.223 x

b. The average diameter for x = 3.5 years is found as

Y = 0.12 + 1.223 ( 3.5)

Y = 4.40
_ _
Therefore the regression line lies at the point : x , y = ( 3.5 ; 4.4 )

c. The error sum of squares is :

∑ e2 =
∑ ( y - Y)2 = 0.1919 (from the table)

Exercise 8
1. An FX service operator wants to determine the length of time it would take to
transport passengers within Manila to NAIA during non - peak times. A sample of 9
trips on a particular day during non - peak times indicate the following:

Distance ,km Time, min

x y

10 19.75

11 18.1

12 21.9

14 24.1

15 27.15

16 22.95

18 29.4

20 37.25

24 40.5

a. Determine the regression line equation.

b. Estimate the length of time it would take to transport passengers 23


km from NAIA.

c. Compute the error sum of squares for the regression line.

2. The kilometer-per-liter (Km /L) figures for a new engine are recorded for fixed
speeds between 56 and 112 km / hr.
Speed(km/hr) Mileage (km/L)

x y

56 14.7

104 13.2

64 14.5

88 13.2

112 12.8

96 13.4

84 13.3

68 14.5

80 13.8

100 13

60 14.6

72 4.3

a. Determine the regression line equation.

b. Compute the mean estimated mileage figure for a speed of 85


km/hr.
c. Compute the error sum of squares for the regression line.

You might also like