Statistic-correlation

Statistic-correlation

Regression and Prediction Dr. Madihah Khalid

Assignment 2 EDC

1. Using the1503

following formulas to calculate the regression model:

2. If you had another city with an unemployment rate of 18

percent, what number of disorders would you predict that this

city would have based on the regression model you calculated in

number 1?

3. Using the same data, compute the Pearson correlation

between unemployment and disorders. How would you evaluate

this relationship--is it strong, weak, or somewhere in between?

Introduction

Suppose you want to estimate what score you will get on the final

exam. Based on the discussion with the lecturer, you learn that the test

is designed to yield a mean of approximately 75. However, you already

know that you scored 74 on the midterm where the mean is 70. How

can you use this info to make prediction?

Further investigation yields the info that the SD of the midterm was 4.

Thus your z-score is (74-70)/4 = 1 standard deviation above the mean.

Therefore if your performance is consistent and the lecturers

prediction is correct, you would probably score 79 in the final. Is there

enough info for you to make the prediction?

What about the relationship between the scores in midterm and final? If

there is, then you will have a more accurate prediction. If the

correlation coeff. is perfect (r = 1.00), then you are right. What if r = 0?

Then you cannot use the midterm to predict your final mark. What if r

= 0.86?

The first picture shows hypothetical plots when r = 0 and r = 0.86

Line of best fit (regression line) for midterm and final grades, y = 11.4 + .

915(x)

employe

e

Monthl Annual

y

Income

salary

2,000

24,000

2,050

24,600

2,200

26,400

2,275

27,300

2,350

28,200

2,425

29,100

2,500

30,000

Annual Income

H

2,600

31,200

Suppose that the firm added a $1000 bonus at the end of the year. Now the

equation becomes Y = 1000 + 12(X)

In behavioural research, it is rare for data to fall exactly on a straight line. So

we must find a line of best fit regression line.

Plot the points and draw a

regression line through

them

4.8

.2

6.6

-.6

8.4

.6

10 10.2 -.2

Y =

the best fitting straight

line; it has a slope of

1.8 and a Y-intercept of

1.2. Y' represents

values along this

regression line. The

general formula for Y'

is: Y' = bX + A where b

is the slope and A is

the Y intercept.

(Y-Y)

+ r

the Pearson's correlation between X and Y, sy is the

standard deviation of Y, sx is the standard deviation of X, My is the

mean of Y and Mx is the mean of X.

Notice that b = r whenever sy = sx. When scores are standardized, sy

= sx = 1, b = r, and A = 0.

For the example, b = 1.8, A = 1.2 and therefore, Y' = 1.8X + 1.2.

The first value of Y' is 4.8. This was computed as: (1.8)(2)+1.2 =

4.8. The previous page stated that the regression line is the best

fitting straight line through the data. More technically, the

regression line minimizes the sum of the squared differences

between Y and Y'. The third column of the table shows these

differences and the fourth column shows the squared differences.

The sum of these squared differences ( .04 + .36 + .04 + .36 = .80)

is smaller than it would be for any other straight line through the

data.

Since the sum of squared deviations is minimized, this criterion for

the best fit is called the "least squares criterion." Notice that the

The

sum of squared differences is s2 = (Y Y)2 when divided by N-2

will give us the residual variance. The variance around the regression

line is also called error variation.

estimate) is obtained from the square root of the residual variance

S = can be calculated more simply by S =

between the two variable, but unless the correlation is perfect, Y will

not be perfectly accurate.

Lets

You scored 74 and your friend scored 63 on the

midterm.

that the mean is 75 the SD is 4, and the correlation

is .60. How would predict yours and your friends final

exam score?

For you, Y = 75 + .6( )(74 70) = 75 + (.6)(4) = 77.4

Your friends Y = 75 + .6( )(63 70) = 70.8

From

finding variance and SD of the distribution

2. The sum of squares of explained variance in Y (SSexp) = (Y - )

2 (regression sum of squares). The greater the correlation, the

greater the predicted deviation from the mean.

3. The sum of squares of unexplained variance in Y (SS err) = (Y

Y)2

(Y )2 = (Y - )2 + (Y Y)2

r2 = =

+ Unexplained

When two variables are related, it is possible to predict the values of

one variable from another. The relationship between corr. and

prediction can cause serious error in reasoning. Correlation is a

necessary but not a sufficient condition to make causal inferences with

reasonable confidence.

In a widely studied example, numerous epidemiological studies

showed that women who were taking combined

hormone replacement therapy (HRT) also had a lower-than-average

incidence of coronary heart disease (CHD), leading doctors to propose

that HRT was protective against CHD. But randomized controlled trials

showed that HRT caused a small but statistically significant increase in

risk of CHD. Re-analysis of the data from the epidemiological studies

showed that women undertaking HRT were more likely to be from

higher socio-economic groups (ABC1), with better-than-average diet

and exercise regimens. The use of HRT and decreased incidence of

coronary heart disease were coincident effects of a common cause

(i.e. the benefits associated with a higher socioeconomic status),

rather than cause and effect, as had been supposed

When 2 variables are related, it is possible to predict the values of one

variable from another.

A high relationship frequently carries the implication that one has

caused the other especially when one precedes the other in time.

However variables may not be connected in any direct way, but may

vary together by a common link to other variables.

Example: You may be tempted to conclude that the number of hours

spent studying causes the grade to vary because of the positive

correlation between the 2. This is not necessarily the case because

there are other factors like intelligence, motivation, better study habit or

supporting companion and therefore longer hours of study may be the

by-product of these factors.

Correlation is a necessary but not sufficient condition to establish

causal relationship between 2 variables.

