You are on page 1of 12

Simple Linear Regression

Regression is a power statistical tool we can use to understand the relationship between
variables or predict the value of one variable based on another variable.
We will be discussing two types of regression – Simple Linear and Multiple Linear
Simple linear regression only has two variables involved.
For multiple linear regression, it has three or more variables involved.
What are these variables? We have two types – independent and dependent variables.
In an experiment, we set up a methodology in such a way that we have certain conditions or
variables we try to vary and control. Say in drying cacao pods, we can vary temperature and
humidity. These are things we can control or adjust. Maybe we are interested in trying out a
certain brand of fertilizer. Or maybe a new formulation for concrete. These types of variables
are called Independent Variables (IV).
On the other hand, by changing certain conditions or variables, the value we are trying to
measure will also change as a response. In our previous modules, we had examples of enrolling
in weight loss programs, comparing hourly wages of computer analysts and registered nurses,
etc. These things affected a certain value we measured. For weight loss programs, our weight
will change. For the hourly wages, it depends on the type of job you are in. Weight and hourly
wage here is dependent on another variable. Hence, they are called Dependent Variables (DV).
In simple terms, the independent variables are those we control and change while dependent
variables measure the resulting outcome due to our adjustments. Think of opening and closing
a faucet. The act of turning the knob is analogous to independent variables. On the other hand,
the rate at which the water flows out of the faucet is the dependent variable – it depends on
how close or open the knob is.
To visually represent the relationship between independent and dependent variables, we
usually use a Scatter Plot. It is simply a Cartesian plane representing an x-axis and a y-axis.
Usually, the independent variable is plotted along the x-axis while the dependent variable is
along the y-axis.

Example
We have a construction company called Triple A Construction. They offer services in renovating
old homes. Their managers have observed that the gross sales is dependent on the average
salary of the area they are working in. The presented the following data.
Gross Sales in Average Local
$100,000s Salary in $10,000s
600,000 → 6 30,000 → 3
800,000 → 8 40,000 → 4
900,000 → 9 60,000 → 6
500,000 → 5 40,000 → 4
450,000 → 4.5 20,000 → 2
950,000 →9.5 50,000 → 5

Here, we can see that the gross sales depend on the average local salary. Thus, the gross sales is
the dependent variable while average local salary is the independent variable.

If we want to make a scatter diagram for this data set, we will set average local salary along the
x-axis and gross sales along the y-axis. You can easily perform this using Excel or do it manually.

Scatter Plot
10

0
0 2 4 6 8

Perhaps in your other mathematics subject before in high school, you have heard of the term
“Line of Best Fit”. It is a method that allows us to determine a line that passes through the given
data points with the least variation and errors. This is analogous to the Linear Regression model
for the data set.

What are regression models? These are used to test if there is a relationship between variables.
It is a mathematical model that tries to quantify the relationship including some random errors
that cannot be predicted. It can be presented in the form below.

𝑌 = 𝛽𝑜 + 𝛽1 𝑋 + 𝜀

where,

𝑌 𝑖𝑠 𝑡ℎ𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

𝑋 𝑖𝑠 𝑡ℎ𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒


𝛽𝑜 𝑖𝑠 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑤ℎ𝑖𝑐ℎ 𝑖𝑠 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑌 𝑤ℎ𝑒𝑛 𝑋 = 0

𝛽1 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑙𝑜𝑝𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑙𝑖𝑛𝑒

𝜀 𝑖𝑠 𝑡ℎ𝑒 𝑟𝑎𝑛𝑑𝑜𝑚 𝑒𝑟𝑟𝑜𝑟

This model is analogous to the true relationship between the variables or the population
parameter in previous topics. Of course, we do know now the exact relationship of the
variables. So, we have to take samples to estimate these values. By doing so, we are left with
the following model
𝑌̂ = 𝑏𝑜 + 𝑏1 𝑋

where,

𝑌̂ 𝑖𝑠 𝑡ℎ𝑒 predicted value of 𝑌

𝑏𝑜 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝛽𝑜 𝑏𝑎𝑒𝑠𝑑 𝑜𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑟𝑒𝑠𝑢𝑙𝑡𝑠

𝑏1 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝛽1 𝑏𝑎𝑒𝑠𝑑 𝑜𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑟𝑒𝑠𝑢𝑙𝑡𝑠

You might be wondering why the random error, ε, is missing. Recall that by doing proper
sampling techniques, we are minimizing biases and errors in our sample. Hence, ε → 0.

Going back to our example on Triple A Construction Company. We can imagine the company
wanting to predict their gross sales based on average local salary. To do this, we let

𝑌 = 𝑔𝑟𝑜𝑠𝑠 𝑠𝑎𝑙𝑒𝑠

𝑋 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑙𝑜𝑐𝑎𝑙 𝑠𝑎𝑙𝑎𝑟𝑦

Again, the regression model is the line that minimizes errors between actual and predicted
values. The error is calculated as

𝐸𝑟𝑟𝑜𝑟 = 𝐴𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒

𝑒 = 𝑌 − 𝑌̂

Thus, our regression model based on samples taken will be

𝑌̂ = 𝑏𝑜 + 𝑏1 𝑋

Now, how do we calculate for 𝑏𝑜 𝑎𝑛𝑑 𝑏1 ? We have the following equations.


∑𝑋
𝑋̅ = = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑟 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑋 𝑣𝑎𝑙𝑢𝑒𝑠
𝑛
∑𝑌
𝑌̅ = = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑟 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑌 𝑣𝑎𝑙𝑢𝑒𝑠
𝑛

Then,

∑(𝑋 − 𝑋̅)(𝑌 − 𝑌̅)


𝑏1 =
∑(𝑋 − 𝑋̅)2

𝑏𝑜 = 𝑌̅ − 𝑏1 𝑋̅

Applying these equations to our example above.

Y X ̅ )𝟐
(𝑿 − 𝑿 (𝑋 − 𝑋̅) (𝑌 − 𝑌̅) (𝑋 − 𝑋̅)(𝑌 − 𝑌̅)
6 3 (3 − 4)2 = 1 (3 − 4̅) = −𝟏 (6 − 7̅) = −𝟏 (−1)(−1) = 𝟏
8 4 (4 − 4)2 = 0 (4 − 4̅) = 0 (8 − 7̅) = 1 (0)(1) = 𝟎
9 6 (6 − 4)2 = 4 (6 − 4̅) = 2 (9 − 7̅) = 2 (2)(2) = 𝟒
5 4 (4 − 4)2 = 0 (4 − 4̅) = 0 (5 − 7̅) = -2 (0)(−2) = 𝟎
4.5 2 (2 − 4)2 = 4 (2 − 4̅) = -2 (4.5 − 7̅) = -2.5 (−2)(−2,5) = 𝟓
9.5 5 (5 − 4)2 = 1 (5 − 4̅) = 1 (9.5 − 7̅) = 2.5 (1)(2.5) = 𝟐. 𝟓

6 + 8 + 9 + 5 + 4.5 + 9.5
𝑌̅ = =7
6
3+4+6+4+2+5
𝑋̅ = =4
6

∑(𝑋 − 𝑋̅)2 = 1 + 0 + 4 + 0 + 4 + 1 = 10

∑(𝑋 − 𝑋̅)(𝑌 − 𝑌̅) = 1 + 0 + 4 + 0 + 5 + 2.5 = 12.5

Calculating the Regression Coefficients

∑(𝑋 − 𝑋̅)(𝑌 − 𝑌̅) 12.5


𝒃𝟏 = = = 𝟏. 𝟐𝟓
∑(𝑋 − 𝑋̅) 2 10

𝒃𝒐 = 𝑌̅ − 𝑏1 𝑋̅ = 7 − (1.25)(4) = 𝟐

Therefore, our regression model is

𝑌̂ = 2 + 1.25𝑋
The next logical question is how good will our regression model be in explaining the
relationship? This can be quantified using the coefficient of determination and coefficient of
correlation.

Normally, we have a long way to manually calculate these values. Thankfully, I prefer using
technology on our side. Hence, we will use our good old friend, Mr. Excel. But before we get
there, what are the coefficient of determination and coefficient of correlation?

The coefficient of determination is the proportion of the variability in Y explained by the


regression equation. This is given the symbol 𝑟 2 . What? How do we understand this even
simpler? Recall our scatter plot.

Our regression model is a linear equation that tries to estimate and predict values. Imagine
taking an exam and I give a paper containing possible solutions. However, the exam you will
take might be different from the cheat sheet that I gave you. Hence, you will not necessarily get
100% correct answers. The percentage in which we get correct values based on the cheat sheet
I gave you is analogous to the coefficient of determination. Hence, you can think of the
coefficient of determination as how good is the regression model in explaining your actual
results.

The coefficient of correlation is an expression of the strength of the linear relationship. It will
always be between +1 and -1. It is equivalent to 𝑟 or the square root of the coefficient of
determination.

𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 = √𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑑𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛

The coefficient of correlation shows us the relative strength of relationship between our
independent and dependent variable.
We can see here that the closer the coefficient of correlation is to +1, the more positive the
correlation is. In other words, as the independent variable increases, the dependent variable
also increases. This is like saying there is a “directly proportional” relationship.

When the coefficient of correlation is 0, there is no correlation between the two variables. In
other words, changing the independent variable will not affect the dependent variable.

On the flip side, if we have a coefficient of correlation closer to -1, the more negative the
correlation is. In other words, as the independent variable increases, the dependent variable
decreases. This is like saying there is an “inversely proportional” relationship.

Also, the closer the coefficient of correlation is to ±1, the better our regression model is.
Meaning, the errors our model provides become smaller and smaller.

The next question is “how can we be sure that our model is valid and correct?” We have to test
our model for significance. Yes, this is similar to applying a test of hypothesis on our regression
model. How? If we start with our regression model, we have the following.

𝑌 = 𝛽𝑜 + 𝛽1 𝑋 + 𝜀

The relationship between X and Y in this equation is solely dependent on 𝛽1. If 𝛽1 = 0, then we
see that

𝑌 = 𝛽𝑜 + 𝜀

This implies no relationship between X and Y as the variable X is no longer present in the
equation above. With this in mind, we can develop a test of hypothesis on the value of 𝛽1.
Without going into further details, we can see that the null and alternative hypothesis for this
should be

𝐻𝑜 : 𝛽1 = 0

𝐻𝑎 : 𝛽1 ≠ 0

If the null hypothesis is not rejected, we can conclude that there is no linear relationship
between X and Y. Meaning, our model is not significant and should not be used.
If the null hypothesis is rejected, we can conclude that there is a linear relationship between X
and Y as given by the value of 𝛽1. Meaning, our model is good enough to be used.

Take note, the value of the correlation of coefficient is NOT a measure of how significant our
model is. Do not make this mistake. It is possible to have a significant model with a low
correlation value OR a model that has a high correlation value but is NOT significant.

We can use Excel to determine our regression model as well as check for its significance. This is
what we will be doing here.

Example

Going back to our example above, we can easily calculate the regression model using Excel.

Gross Sales in Average Local


$100,000s Salary in $10,000s
600,000 → 6 30,000 → 3
800,000 → 8 40,000 → 4
900,000 → 9 60,000 → 6
500,000 → 5 40,000 → 4
450,000 → 4.5 20,000 → 2
950,000 →9.5 50,000 → 5

However, there is a slight preparation that we must to do use Excel. This is available for all
versions of Excel. If you do not have a licensed copy of MS Office, you can use your AdDU email
to get a licensed copy from Microsoft. Google it.

When you open your Excel application, you will see the following.

Click on “File” to access the next window

We can then see the following window on the right.

Click on “Options” shown below in the red rectangle.

This will open a new window shown below.


Next, click on “Add-ins”

Next, click on “Go”. This will open a new pop-up window shown below.
Activate both “Analysis ToolPak” and “Analysis ToolPak – VBA” then click OK.

This will activate the Data Analysis capability of Excel under the “Data” tab.

When you click “Data Analysis”, another pop-up window will show up.

Find and select “Regression” then click OK

This will open another pop-up window.


Check the box “Labels” if you include the labels in selecting the input ranges for X and Y.
Afterwards, click OK.

This will open a new sheet showing the following.

Under the table “Regression Statistics”, we can find Multiple R and R Square.

𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 = 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑅 = 0.833

𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 𝑅 𝑆𝑞𝑢𝑎𝑟𝑒 = 0.694

Easy right? You will be allowed to determine the said values using Excel.
Now, how do we check for the significance of the regression model? Look at the second table
labeled “ANOVA” and look at the columns F and Significance F. We should only focus on the
Significance F value. Then, we will compare this to the level of significance, α, which is usually
set at 0.05.
𝑹𝒆𝒋𝒆𝒄𝒕 𝒏𝒖𝒍𝒍 𝒉𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 𝒘𝒉𝒆𝒏 𝑺𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆 𝑭 < 𝜶
𝑫𝒐 𝒏𝒐𝒕 𝒓𝒆𝒋𝒆𝒄𝒕 𝒏𝒖𝒍𝒍 𝒉𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 𝒘𝒉𝒆𝒏 𝑺𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆 𝑭 > 𝜶

In this case, we can see that

𝑺𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆 𝑭 < 𝜶

𝟎. 𝟎𝟑𝟗 < 𝟎. 𝟎𝟓

Thus, we can reject the null hypothesis. Hence, the alternative hypothesis is correct

𝐻𝑎 : 𝛽1 ≠ 0

Meaning, there is a significant linear relationship between X and Y as given by 𝛽1.

The regression model can be found in the green rectangle.

𝑌̂ = 2 + 1.25𝑋

This is the way to determine the regression model using Excel.

Predicting New Values using Simple Linear Regression Model

Using the linear regression model, we can predict future values of the dependent variable.

In the example prior, the regression model is

𝑌̂ = 2 + 1.25𝑋
So, if they will be working in a new city whose average local salary is $60,000, Triple A
Construction Company can expect a gross sales of

𝑌̂ = 2 + 1.25(6)

𝑌̂ = 9.5 = $950,000

We can predict that the company will receive $950,000 in gross sales.

You might also like