You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/329611627

Simple regression model

Conference Paper · May 2014

CITATIONS READS

0 576

2 authors:

Mercedes Orús-Lacort Christophe Jouis


Independent researcher. Université de la Sorbonne Nouvelle Paris 3 & EHESS & CAMS-CNRS
228 PUBLICATIONS   6 CITATIONS    226 PUBLICATIONS   185 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

The story of my preprint (future article) tittled “Fermat Last Theorem Revisited” View project

Multidisciplinary researches and articles View project

All content following this page was uploaded by Mercedes Orús-Lacort on 13 December 2018.

The user has requested enhancement of the downloaded file.


Simple linear regression

1. What is the purpose of the simple linear regression?

Occasionally, we have two quantitative variables that may be related, and what we
intend to study is: can we predict the value of one of them from the known values of the
other?.

To study it, the steps that we follow are:

 Draw a graph where appear each variable data, this graph is called "Scatter
plot".

 Calculate the correlation coefficient of Pearson.

 Calculate a formula which will allow us to predict the value of one of these
variables from the another, this formula "Regression line" is called.

 We studied if we can consider the regression line as valid. For do it, we resolve
hypothesis test, and we calculate a ratio called "Adjustment coefficient of
goodness" (or also called R-squared, or coefficient of determination).

Let's see then what are the scatterplots.

Suppose we want to provide the Benefits of a company from Spending on Advertising.


We will call Y to the variable Benefits (which I expected) and X to the variable
Advertising.
The variable Y is called dependent variable and the variable X is called independent
variable.

The values of the two variables that we are studying are represented in this diagram.
And we may find with situations like that you will see below:

First situation:

In this case you may observe that:

- The points are close together: This means that there is a strong relationship between
the two variables.
- Also you may observe they are right-oriented: This means that both variables
are related directly proportional, i.e. when it increases spending on Advertising,
also increase the Benefits.

Second situation:

In this case you may observe that:

- The points are not very close together: This means that there is not a strong relation
between the two variables, but if we calculate the regression line, this will not adjust
very well.

- Also you may observe the right-oriented: This means that both variables are related
directly proportional, i.e. when it increases spending on Advertising, also increase the
Benefits.
Third situation:

In this case you may observe that:

- The points are very dispersed: This means that there is no relation between the two
variables, and that it wouldn't make any sense calculate a regression model.

Fourth situation:

In this case you may observe that:

- The points are close together: This means that there is a strong relationship between
the two variables.

- Also you may observe they are left-oriented: This means that both variables are related
inversely proportional, i.e. when it increases spending on Advertising, then decrease the
Benefits.

2. Calculation of the correlation Pearson coefficient

If we have data from two random variables that we think that they may be related, the
mode to confirm if that relationship exists or not, is to calculate the correlation
coefficient of Pearson rxy. The value of this coefficient is always between - 1 and 1.

To calculate it, we use the following formula:


1
S n1
 (xi  x)(yi  y)
rxy  XY  
SX S Y 1 1
n1
 (xi  x) n  1  (yi  y)
2 2

1
n1
 (xi  x)(yi  y)
 
1 1
n1
 (xi  x) n  1  (yi  y)
2 2

1
n1
 (xi  x)(yi  y)  (xi  x)(yi  y) 
 
1
n1
 (x i  x)2  (yi  y)2  (x i  x)2  (yi  y)2


 xiyi  y xi  x  yi nxy
 
 xi2  2x  xi  nx  yi2  2y yi  ny
2 2

If rxy is close to 1  X and Y correlated directly proportional.

If rxy is close to - 1  X and Y correlated inversely proportional.

If rxy is close to 0  X and Y not correlated.

Important: The sign (positive or negative) of this coefficient, depends on how it


came out focused our scatter diagram: If it came out to the right-oriented, then
the sign of the coefficient is positive, while if it came out the left-oriented then
the sign of the coefficient is negative, and if the diagram was dispersed, this
coefficient will have a value close to 0. That is to say:

First situation:

In this case, rxy will have positive sign, and its value would be close to 1, e.g. rxy = 0976.
Second situation:

In this case, rxy will have positive sign, and its value would be not more close to 1, e.g.
rxy = 0,676.

Third situation:

In this case, rxy will have positive or negative sign and its value would be more close to
0 than 1, e.g. rxy = 0.215 or rxy = - 0.215.

Fourth situation:

In this case, rxy will have a negative sign, and its value would be close to - 1, for
example rxy = - 0,915.

3. Calculation of the simple linear regression model

It makes sense compute it when the correlation coefficient is close to 1 or – 1.

Using the regression line we can predict the value of one of the variables from the
other.
To the variable which we are going to predict its value (say it is Y), is called dependent
variable, and the other variable (say it is X) is called independent variable.

We intend, therefore, to find a formula of the type Y = a + b·X that will allow us to
predict the value of Y from the value of the X, so that, it fits the maximum
possible cloud dispersion plot points.

For example, and according to the 4 situations we have seen above, we could
have:

First situation:

Second situation:

Third situation:

Fourth situation:
Calculation of the values of "a" and "b"

"b" is called a slope of a line, and its formula to calculate it is:

1
SXY n  1  (xi  x)(yi  y)  (x  x)(y  y) 
i i
b 2  
SX 1
 i(x  x) 2  (x  x)
i
2

n1


 x y  y x  x y nxy
i i i i

 x  2x x  nx 
i
2
i
2

And if we know the rxy value, we can calculate it as follows:


SY
b  rxy
SX

Once calculated the "b", "a" called y-intercept, it’s calculated as follows:

a  y  bx

4.- Hypothesis tests for the slope

To know if we can give valid regression model, we must resolve the following
hypotheses test:

Ho: β = 0
Ha: β ≠ 0

Where β represents the slop of the regression line.

To resolve this test, we calculate the statistic test which is a Student's t with
n - 2 degrees of freedom, by the following formula:
b b
t 
Sb 1 n

 (y  a  bxi )2
n  2 i1 i
n

 (x
i 1
i  x)2

where :

 b is the slope of regression line.


 Sb is the standard error estándar of the slope.

Let us note, that if give us the total values of the sums, and I do not know the values of
each value of the variable X and the Y, then, we will calculate the standard error as
shown below:

1 n
 (y  a  bxi )2
n  2 i1 i
Sb  
n

 (x
i 1
i  x) 2

1  n 2 n n n n

 
n  2  i1
y i  n·a 2
 b 2
 x i
2
 2a  y i  2b  x y
i i  2ab  xi 
i 1 i 1 i 1 i 1 

 x i
2
 2x  x i  nx
2

Then we take a decision:

 Through areas of acceptance and rejection of the null hypothesis:


We seek in the table statistics critics tn-2, α/2 and - tn-2, α/2, being α level of
significance.

 Calculate P Value:
P Value = 2·P (tn-2 > |t test|)

Therefore:

P Value > α  Accept Ho


P Value < α  Reject Ho and accept alternative
 Calculating the confidence interval for the slope of the regression line:

  tn2, /2·Standard Error of the slope

So, if 0 falls within the interval, the null hypothesis is accepted.

5.- Calculation coefficient R2

Another way to see if the model "fit well or not", is by calculating the coefficient R
square, or also called coefficient of determination or coefficient of goodness of fit. To
calculate it, we use the following formula:

R2 = rxy2

This ratio takes values between 0 and 1, so that:

If R2 is close to 0  the model doesn’t fit well


If R2 is close to 1  the model fits well

View publication stats

You might also like