Professional Documents
Culture Documents
Regression is a power statistical tool we can use to understand the relationship between
variables or predict the value of one variable based on another variable.
We will be discussing two types of regression – Simple Linear and Multiple Linear
Simple linear regression only has two variables involved.
For multiple linear regression, it has three or more variables involved.
What are these variables? We have two types – independent and dependent variables.
In an experiment, we set up a methodology in such a way that we have certain conditions or
variables we try to vary and control. Say in drying cacao pods, we can vary temperature and
humidity. These are things we can control or adjust. Maybe we are interested in trying out a
certain brand of fertilizer. Or maybe a new formulation for concrete. These types of variables
are called Independent Variables (IV).
On the other hand, by changing certain conditions or variables, the value we are trying to
measure will also change as a response. In our previous modules, we had examples of enrolling
in weight loss programs, comparing hourly wages of computer analysts and registered nurses,
etc. These things affected a certain value we measured. For weight loss programs, our weight
will change. For the hourly wages, it depends on the type of job you are in. Weight and hourly
wage here is dependent on another variable. Hence, they are called Dependent Variables (DV).
In simple terms, the independent variables are those we control and change while dependent
variables measure the resulting outcome due to our adjustments. Think of opening and closing
a faucet. The act of turning the knob is analogous to independent variables. On the other hand,
the rate at which the water flows out of the faucet is the dependent variable – it depends on
how close or open the knob is.
To visually represent the relationship between independent and dependent variables, we
usually use a Scatter Plot. It is simply a Cartesian plane representing an x-axis and a y-axis.
Usually, the independent variable is plotted along the x-axis while the dependent variable is
along the y-axis.
Example
We have a construction company called Triple A Construction. They offer services in renovating
old homes. Their managers have observed that the gross sales is dependent on the average
salary of the area they are working in. The presented the following data.
Gross Sales in Average Local
$100,000s Salary in $10,000s
600,000 → 6 30,000 → 3
800,000 → 8 40,000 → 4
900,000 → 9 60,000 → 6
500,000 → 5 40,000 → 4
450,000 → 4.5 20,000 → 2
950,000 →9.5 50,000 → 5
Here, we can see that the gross sales depend on the average local salary. Thus, the gross sales is
the dependent variable while average local salary is the independent variable.
If we want to make a scatter diagram for this data set, we will set average local salary along the
x-axis and gross sales along the y-axis. You can easily perform this using Excel or do it manually.
Scatter Plot
10
0
0 2 4 6 8
Perhaps in your other mathematics subject before in high school, you have heard of the term
“Line of Best Fit”. It is a method that allows us to determine a line that passes through the given
data points with the least variation and errors. This is analogous to the Linear Regression model
for the data set.
What are regression models? These are used to test if there is a relationship between variables.
It is a mathematical model that tries to quantify the relationship including some random errors
that cannot be predicted. It can be presented in the form below.
𝑌 = 𝛽𝑜 + 𝛽1 𝑋 + 𝜀
where,
This model is analogous to the true relationship between the variables or the population
parameter in previous topics. Of course, we do know now the exact relationship of the
variables. So, we have to take samples to estimate these values. By doing so, we are left with
the following model
𝑌̂ = 𝑏𝑜 + 𝑏1 𝑋
where,
You might be wondering why the random error, ε, is missing. Recall that by doing proper
sampling techniques, we are minimizing biases and errors in our sample. Hence, ε → 0.
Going back to our example on Triple A Construction Company. We can imagine the company
wanting to predict their gross sales based on average local salary. To do this, we let
𝑌 = 𝑔𝑟𝑜𝑠𝑠 𝑠𝑎𝑙𝑒𝑠
Again, the regression model is the line that minimizes errors between actual and predicted
values. The error is calculated as
𝑒 = 𝑌 − 𝑌̂
𝑌̂ = 𝑏𝑜 + 𝑏1 𝑋
Then,
𝑏𝑜 = 𝑌̅ − 𝑏1 𝑋̅
Y X ̅ )𝟐
(𝑿 − 𝑿 (𝑋 − 𝑋̅) (𝑌 − 𝑌̅) (𝑋 − 𝑋̅)(𝑌 − 𝑌̅)
6 3 (3 − 4)2 = 1 (3 − 4̅) = −𝟏 (6 − 7̅) = −𝟏 (−1)(−1) = 𝟏
8 4 (4 − 4)2 = 0 (4 − 4̅) = 0 (8 − 7̅) = 1 (0)(1) = 𝟎
9 6 (6 − 4)2 = 4 (6 − 4̅) = 2 (9 − 7̅) = 2 (2)(2) = 𝟒
5 4 (4 − 4)2 = 0 (4 − 4̅) = 0 (5 − 7̅) = -2 (0)(−2) = 𝟎
4.5 2 (2 − 4)2 = 4 (2 − 4̅) = -2 (4.5 − 7̅) = -2.5 (−2)(−2,5) = 𝟓
9.5 5 (5 − 4)2 = 1 (5 − 4̅) = 1 (9.5 − 7̅) = 2.5 (1)(2.5) = 𝟐. 𝟓
6 + 8 + 9 + 5 + 4.5 + 9.5
𝑌̅ = =7
6
3+4+6+4+2+5
𝑋̅ = =4
6
∑(𝑋 − 𝑋̅)2 = 1 + 0 + 4 + 0 + 4 + 1 = 10
𝒃𝒐 = 𝑌̅ − 𝑏1 𝑋̅ = 7 − (1.25)(4) = 𝟐
𝑌̂ = 2 + 1.25𝑋
The next logical question is how good will our regression model be in explaining the
relationship? This can be quantified using the coefficient of determination and coefficient of
correlation.
Normally, we have a long way to manually calculate these values. Thankfully, I prefer using
technology on our side. Hence, we will use our good old friend, Mr. Excel. But before we get
there, what are the coefficient of determination and coefficient of correlation?
Our regression model is a linear equation that tries to estimate and predict values. Imagine
taking an exam and I give a paper containing possible solutions. However, the exam you will
take might be different from the cheat sheet that I gave you. Hence, you will not necessarily get
100% correct answers. The percentage in which we get correct values based on the cheat sheet
I gave you is analogous to the coefficient of determination. Hence, you can think of the
coefficient of determination as how good is the regression model in explaining your actual
results.
The coefficient of correlation is an expression of the strength of the linear relationship. It will
always be between +1 and -1. It is equivalent to 𝑟 or the square root of the coefficient of
determination.
The coefficient of correlation shows us the relative strength of relationship between our
independent and dependent variable.
We can see here that the closer the coefficient of correlation is to +1, the more positive the
correlation is. In other words, as the independent variable increases, the dependent variable
also increases. This is like saying there is a “directly proportional” relationship.
When the coefficient of correlation is 0, there is no correlation between the two variables. In
other words, changing the independent variable will not affect the dependent variable.
On the flip side, if we have a coefficient of correlation closer to -1, the more negative the
correlation is. In other words, as the independent variable increases, the dependent variable
decreases. This is like saying there is an “inversely proportional” relationship.
Also, the closer the coefficient of correlation is to ±1, the better our regression model is.
Meaning, the errors our model provides become smaller and smaller.
The next question is “how can we be sure that our model is valid and correct?” We have to test
our model for significance. Yes, this is similar to applying a test of hypothesis on our regression
model. How? If we start with our regression model, we have the following.
𝑌 = 𝛽𝑜 + 𝛽1 𝑋 + 𝜀
The relationship between X and Y in this equation is solely dependent on 𝛽1. If 𝛽1 = 0, then we
see that
𝑌 = 𝛽𝑜 + 𝜀
This implies no relationship between X and Y as the variable X is no longer present in the
equation above. With this in mind, we can develop a test of hypothesis on the value of 𝛽1.
Without going into further details, we can see that the null and alternative hypothesis for this
should be
𝐻𝑜 : 𝛽1 = 0
𝐻𝑎 : 𝛽1 ≠ 0
If the null hypothesis is not rejected, we can conclude that there is no linear relationship
between X and Y. Meaning, our model is not significant and should not be used.
If the null hypothesis is rejected, we can conclude that there is a linear relationship between X
and Y as given by the value of 𝛽1. Meaning, our model is good enough to be used.
Take note, the value of the correlation of coefficient is NOT a measure of how significant our
model is. Do not make this mistake. It is possible to have a significant model with a low
correlation value OR a model that has a high correlation value but is NOT significant.
We can use Excel to determine our regression model as well as check for its significance. This is
what we will be doing here.
Example
Going back to our example above, we can easily calculate the regression model using Excel.
However, there is a slight preparation that we must to do use Excel. This is available for all
versions of Excel. If you do not have a licensed copy of MS Office, you can use your AdDU email
to get a licensed copy from Microsoft. Google it.
When you open your Excel application, you will see the following.
Next, click on “Go”. This will open a new pop-up window shown below.
Activate both “Analysis ToolPak” and “Analysis ToolPak – VBA” then click OK.
This will activate the Data Analysis capability of Excel under the “Data” tab.
When you click “Data Analysis”, another pop-up window will show up.
Under the table “Regression Statistics”, we can find Multiple R and R Square.
Easy right? You will be allowed to determine the said values using Excel.
Now, how do we check for the significance of the regression model? Look at the second table
labeled “ANOVA” and look at the columns F and Significance F. We should only focus on the
Significance F value. Then, we will compare this to the level of significance, α, which is usually
set at 0.05.
𝑹𝒆𝒋𝒆𝒄𝒕 𝒏𝒖𝒍𝒍 𝒉𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 𝒘𝒉𝒆𝒏 𝑺𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆 𝑭 < 𝜶
𝑫𝒐 𝒏𝒐𝒕 𝒓𝒆𝒋𝒆𝒄𝒕 𝒏𝒖𝒍𝒍 𝒉𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 𝒘𝒉𝒆𝒏 𝑺𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆 𝑭 > 𝜶
𝑺𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆 𝑭 < 𝜶
𝟎. 𝟎𝟑𝟗 < 𝟎. 𝟎𝟓
Thus, we can reject the null hypothesis. Hence, the alternative hypothesis is correct
𝐻𝑎 : 𝛽1 ≠ 0
𝑌̂ = 2 + 1.25𝑋
Using the linear regression model, we can predict future values of the dependent variable.
𝑌̂ = 2 + 1.25𝑋
So, if they will be working in a new city whose average local salary is $60,000, Triple A
Construction Company can expect a gross sales of
𝑌̂ = 2 + 1.25(6)
𝑌̂ = 9.5 = $950,000
We can predict that the company will receive $950,000 in gross sales.