You are on page 1of 14

Basic Statistics for

Counselling EDG 1503


Regression and Prediction Dr. Madihah Khalid

Assignment 2 EDC
1. Using the1503
following formulas to calculate the regression model:

Interpret the results: what does b mean, what does a mean?


2. If you had another city with an unemployment rate of 18
percent, what number of disorders would you predict that this
city would have based on the regression model you calculated in
number 1?
3. Using the same data, compute the Pearson correlation
between unemployment and disorders. How would you evaluate
this relationship--is it strong, weak, or somewhere in between?

Introduction
Suppose you want to estimate what score you will get on the final
exam. Based on the discussion with the lecturer, you learn that the test
is designed to yield a mean of approximately 75. However, you already
know that you scored 74 on the midterm where the mean is 70. How
can you use this info to make prediction?
Further investigation yields the info that the SD of the midterm was 4.
Thus your z-score is (74-70)/4 = 1 standard deviation above the mean.
Therefore if your performance is consistent and the lecturers
prediction is correct, you would probably score 79 in the final. Is there
enough info for you to make the prediction?
What about the relationship between the scores in midterm and final? If
there is, then you will have a more accurate prediction. If the
correlation coeff. is perfect (r = 1.00), then you are right. What if r = 0?
Then you cannot use the midterm to predict your final mark. What if r
= 0.86?
The first picture shows hypothetical plots when r = 0 and r = 0.86

Line of best fit (regression line) for midterm and final grades, y = 11.4 + .
915(x)

Example - Monthly salary and annual income


employe
e

Monthl Annual
y
Income
salary

2,000

24,000

2,050

24,600

2,200

26,400

2,275

27,300

2,350

28,200

2,425

29,100

2,500

30,000

Y = 12(X) and r = 1.00


Annual Income

H
2,600
31,200
Suppose that the firm added a $1000 bonus at the end of the year. Now the
equation becomes Y = 1000 + 12(X)
In behavioural research, it is rare for data to fall exactly on a straight line. So
we must find a line of best fit regression line.

Determining the Regression Line

Look at the data below:


Plot the points and draw a
regression line through
them

4.8

.2

6.6

-.6

8.4

.6

10 10.2 -.2

Y =

The line in the plot is


the best fitting straight
line; it has a slope of
1.8 and a Y-intercept of
1.2. Y' represents
values along this
regression line. The
general formula for Y'
is: Y' = bX + A where b
is the slope and A is
the Y intercept.

(Y-Y)

+ r

Predicted Y = mean Y + corr. X&Y Deviation Score

The formulas for b and A are: b = r sy/sx and A = My bMx where r is


the Pearson's correlation between X and Y, sy is the
standard deviation of Y, sx is the standard deviation of X, My is the
mean of Y and Mx is the mean of X.
Notice that b = r whenever sy = sx. When scores are standardized, sy
= sx = 1, b = r, and A = 0.
For the example, b = 1.8, A = 1.2 and therefore, Y' = 1.8X + 1.2.
The first value of Y' is 4.8. This was computed as: (1.8)(2)+1.2 =
4.8. The previous page stated that the regression line is the best
fitting straight line through the data. More technically, the
regression line minimizes the sum of the squared differences
between Y and Y'. The third column of the table shows these
differences and the fourth column shows the squared differences.
The sum of these squared differences ( .04 + .36 + .04 + .36 = .80)
is smaller than it would be for any other straight line through the
data.
Since the sum of squared deviations is minimized, this criterion for
the best fit is called the "least squares criterion." Notice that the

Residual Variance & Standard Error of Estimate

The
sum of squared differences is s2 = (Y Y)2 when divided by N-2
will give us the residual variance. The variance around the regression
line is also called error variation.

The standard deviation around the regression line (standard error of


estimate) is obtained from the square root of the residual variance
S = can be calculated more simply by S =

Now, Y is a good estimate of Y when we know about the relation


between the two variable, but unless the correlation is perfect, Y will
not be perfectly accurate.

Back to our earlier Example

Lets

us go back to predicting the final exam mark.


You scored 74 and your friend scored 63 on the
midterm.

The lecturer has given this test before and he knows


that the mean is 75 the SD is 4, and the correlation
is .60. How would predict yours and your friends final
exam score?
For you, Y = 75 + .6( )(74 70) = 75 + (.6)(4) = 77.4
Your friends Y = 75 + .6( )(63 70) = 70.8

Explained and Unexplained Variation

From

the graphs shown, there are 3 separate sum of squares.

1. The total sum of squares for Y (SStotal) = (Y )2 . Basic to


finding variance and SD of the distribution
2. The sum of squares of explained variance in Y (SSexp) = (Y - )
2 (regression sum of squares). The greater the correlation, the
greater the predicted deviation from the mean.
3. The sum of squares of unexplained variance in Y (SS err) = (Y
Y)2

(Y )2 = (Y - )2 + (Y Y)2

Total Variation = Explained


r2 = =

+ Unexplained

Regression and Causation


When two variables are related, it is possible to predict the values of
one variable from another. The relationship between corr. and
prediction can cause serious error in reasoning. Correlation is a
necessary but not a sufficient condition to make causal inferences with
reasonable confidence.
In a widely studied example, numerous epidemiological studies
showed that women who were taking combined
hormone replacement therapy (HRT) also had a lower-than-average
incidence of coronary heart disease (CHD), leading doctors to propose
that HRT was protective against CHD. But randomized controlled trials
showed that HRT caused a small but statistically significant increase in
risk of CHD. Re-analysis of the data from the epidemiological studies
showed that women undertaking HRT were more likely to be from
higher socio-economic groups (ABC1), with better-than-average diet
and exercise regimens. The use of HRT and decreased incidence of
coronary heart disease were coincident effects of a common cause
(i.e. the benefits associated with a higher socioeconomic status),
rather than cause and effect, as had been supposed

Regression and Causation


When 2 variables are related, it is possible to predict the values of one
variable from another.
A high relationship frequently carries the implication that one has
caused the other especially when one precedes the other in time.
However variables may not be connected in any direct way, but may
vary together by a common link to other variables.
Example: You may be tempted to conclude that the number of hours
spent studying causes the grade to vary because of the positive
correlation between the 2. This is not necessarily the case because
there are other factors like intelligence, motivation, better study habit or
supporting companion and therefore longer hours of study may be the
by-product of these factors.
Correlation is a necessary but not sufficient condition to establish
causal relationship between 2 variables.