You are on page 1of 10

17/11/2020 Multiple Linear Regression

Multiple Linear Regression

Linear Regression, one of the most popular and discussed models, is certainly the
gateway to go deeper into Machine Learning (ML). Such a simplistic,
straightforward approach to modeling is worth learning as one of your first steps
into ML.

Before moving forward, let us recall that Linear Regression can be broadly
classified into two categories.

Simple Linear Regression: It’s the simplest form of Linear


Regression that is used when there is a single input variable for the
output variable.

If you are new to regression, then I strongly suggest you first read about Simple
Linear Regression from the link below, where you would understand the
underlying maths behind and the approach to this model using interesting data and
hands-on coding.

Multiple Linear Regression: It’s a form of linear regression that is


used when there are two or more predictors.

We will see how multiple input variables together influence the output variable,
while also learning how the calculations differ from that of Simple LR model. We
will also build a regression model using Python.

At last, we will go deeper into Linear Regression and will learn things like
Collinearity, Hypothesis Testing, Feature Selection, and much more.

Now one might wonder, we could also use simple linear regression to study our
output against all independent variables separately. That would have made lives
much easier right?

No, it wouldn’t.

read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 1/10
17/11/2020 Multiple Linear Regression

Why Multiple Linear Regression?


“ To predict the outcome from multiple input variables. Duh!”. But, is that it?
Well, hold that thought.

Consider this, suppose you have to estimate the price of a certain house you want
to buy. You know the floor area, the age of the house, its distance from your
workplace, the crime rate of the place, etc.

Now, some of these factors will affect the price of the house positively. For
example more the area, the more the price. On the other hand, factors like
distance from the workplace, and the crime rate can influence your estimate of the
house negatively (unless you are a rich criminal with interest in Machine Learning
looking for a hideout, yeah I don’t think so).

Disadvantages of Simple Linear Regression → Running separate simple linear


regressions will lead to different outcomes when we are interested in just one.
Besides that, there may be an input variable that is itself correlated with or
dependent on some other predictor. This can cause wrong predictions and
unsatisfactory results.

This is where Multiple Linear Regression comes into the picture.

Mathematically…

Here, Y is the output variable, and X terms are the corresponding input variables.
Notice that this equation is just an extension of Simple Linear Regression, and
each predictor has a corresponding slope coefficient (β).

The first β term (βo) is the intercept constant and is the value of Y in absence of
all predictors (i.e when all X terms are 0). It may or may or may not hold any
significance in a given regression problem. It’s generally there to give a relevant
nudge to the line/plane of regression.

Let’s now understand this with the help of some data.

Visualizing the data


We are going to use Advertising data which is available on the site of USC
Marshall School of Business. You can download it here.

If you have read my post on Simple Linear Regression, then you are already
familiar with this data. If you haven’t, let me give you a quick brief.

The advertising data set consists of the sales of a product in 200 different markets,
along with advertising budgets for three different media: TV, radio, and
newspaper. Here’s how it looks like:

read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 2/10
17/11/2020 Multiple Linear Regression
Sales (*1000 units) vs Advertising budget (*1000 USD)

The first row of the data says that the advertising budgets for TV, radio, and
newspaper were $230.1k, $37.8k, and $69.2k respectively, and the corresponding
number of units that were sold was 22.1k (or 22,100).

In Simple Linear Regression, we can see how each advertising medium affects
sales when applied without the other two media. However, in practice, all three
might be working together to impact net sales. We did not consider the combined
effect of these media on sales.

Multiple Linear Regression solves the problem by taking account of all the
variables in a single expression. Hence, our Linear Regression model can now be
expressed as:

Finding the values of these constants(β) is what regression model does by


minimizing the error function and fitting the best line or hyperplane (depending
on the number of input variables).

This is done by minimizing the Residual Sum of Squares (RSS), which is


obtained by squaring the differences between actual and predicted outcomes.

Ordinary Least Squares


Because this method finds the least sum of squares, it is also known as the
Ordinary Least Squares (OLS) method. In Python, there are two primary ways to
implement the OLS algorithm.

SciKit Learn: Just import the Linear Regression module from the
Sklearn package and fit the model on the data. This method is pretty
straightforward and you can see how to use it below.

from sklearn.linear_model import LinearRegression


model = LinearRegression()
model.fit(data.drop('sales', axis=1), data.sales)

StatsModels: Another way is to use the Statsmodels package to


implement OLS. Statsmodels is a Python package that allows
performing various statistical tests on the data. We will use it here so
that you can learn about this great Python library, and because it will
be helpful for us in the later sections.

read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 3/10
17/11/2020 Multiple Linear Regression

Building the model and interpreting the


coefficients
# Importing required libraries
import pandas as pd
import statsmodels.formula.api as sm

# Loading data - You can give the complete path to your data here
ad = pd.read_csv("Advertising.csv")

# Fitting the OLS on data


model = sm.ols('sales ~ TV + radio + newspaper', ad).fit()
print(model.params)

You should get the following output.

Intercept 2.938889
TV 0.045765
radio 0.188530
newspaper -0.001037

I encourage you to run the regression model using Scikit Learn as well and find
the above parameters using model.coef_ & model.intercept_. Did you see the
same results?

Now that we have these values, how to interpret them? Here’s how:

If we fix the budget for TV & newspaper, then increasing the radio
budget by $1000 will lead to an increase in sales by around 189
units(0.189*1000).
Similarly, by fixing the radio & newspaper, we infer an approximate
rise of 46 units of products per $1000 increase in the TV budget.
However, for the newspaper budget, since the coefficient is quite
negligible (close to zero), it’s evident that the newspaper is not
affecting the sales. In fact, it’s on the negative side of zero(-0.001)
which, if the magnitude was big enough, could have meant that this
agent is rather causing the sales to fall. But we cannot make that kind
of inference with such negligible value.

Let me tell you an interesting thing here. If we run Simple Linear


Regression using just the newspaper budget against sales, we’ll observe the
coefficient value of around 0.055, which is quite significant in comparison to
what we saw above. Now, why is that?

Collinearity
To understand why, let’s see how these variables are correlated with each other.

ad.corr()

Correlation Matrix for advertising data

read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 4/10
17/11/2020 Multiple Linear Regression

Let’s visualize these numbers using a heatmap.

import matplotlib.pyplot as plt


%matplotlib inline
> plt.imshow(ad.corr(), cmap=plt.cm.GnBu, interpolation='nearest', data=True)
> plt.colorbar()
> tick_marks = [i for i in range(len(ad.columns))]
> plt.xticks(tick_marks, data.columns, rotation=45)
> plt.yticks(tick_marks, data.columns, rotation=45)

Correlation Heatmap for Advertising Data

Here the dark squares represent a strong correlation (close to 1) while the lighter
ones represent the weaker correlation(close to 0). That’s the reason, all the
diagonals are dark blue, as a variable is fully correlated with itself.

Now, the thing worth noticing here is that the correlation between newspaper and
radio is 0.35. This indicates a fair relationship between newspaper and radio
budgets. Hence, it can be inferred that → when the radio budget is increased for a
product, there’s a tendency to spend more on newspapers as well.

This is called collinearity and is referred to as a situation in which two or more


input variables are linearly related.

Hence, even though the Multiple Regression model shows no impact on sales by
the newspaper, the Simple Regression model still does due to this
multicollinearity and the absence of other input variables.

Sales & Radio → probable causation

Newspaper & Radio → multicollinearity

Sales & Newspaper → transitive correlation

Alright! We understood Linear Regression, we built the model and even


interpreted the results. What we learned so far were the fundamentals of Linear
Regression. However, while dealing with real-world problems, we generally go
beyond this point to statistically analyze our model and do the necessary changes
if required.

Hypothesis Test for Predictors


One of the fundamental questions that should be answered while running Multiple
Linear Regression is, whether or not, at least one of the predictors is useful in
predicting the output.

We saw that the three predictors TV, radio and newspaper had a different degree
of linear relationship with the sales. But what if the relationship is just by chance
and there is no actual impact on sales due to any of the predictors?

The model can only give us numbers to establish a close enough linear
relationship between the response variable and the predictors. However, it cannot

read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 5/10
17/11/2020 Multiple Linear Regression

prove the credibility of these relationships.

To have some confidence, we take help from statistics and do something known
as a Hypothesis Test. We start by forming a Null Hypothesis and a
corresponding Alternative Hypothesis.

Since our goal is to find if at least one predictor is useful in predicting the output,
we are in a way hoping that at least one of the coefficients(not intercept) is non-
zero, not just by a random chance but due to actual cause.

To do this, we start by forming a Null Hypothesis: All the coefficients are equal to
zero.

General Null Hypothesis for Multiple Linear Regression

Null Hypothesis for Advertising Data

Hence the Alternative Hypothesis would be: At least one coefficient is not zero. It
is proved by rejecting the Null Hypothesis by finding strong statistical evidence.

Alternative Hypothesis

The hypothesis test is performed by using F-Statistic. The formula for this
statistic contains Residual Sum of Squares (RSS) and the Total Sum of Squares
(TSS), which we don’t have to worry about because the Statsmodels package
takes care of this. The summary of the OLS model that we fit above contains the
summary of all such statistics and can be obtained with this simple line of code:

print(model.summary2())

read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 6/10
17/11/2020 Multiple Linear Regression

If the value of F-statistic is equal to or very close to 1, then the results are in favor
of the Null Hypothesis and we fail to reject it.

But as we can see that the F-statistic is many folds larger than 1, thus providing
strong evidence against the Null Hypothesis (that all coefficients are zero). Hence,
we reject the Null Hypothesis and are confident that at least one predictor is
useful in predicting the output.

Note that F-statistic is not suitable when the number of predictors(p) is


large, or if p is greater than the number of data samples (n).

Hence, we can say that at least one of the three advertising agents is useful in
predicting sales.

But which one or which two are important? Are all of them important? To find
this out, we will perform Feature Selection or variable selection. Now one way
of doing this is trying all possible combinations i.e.

TV
radio
newspaper
TV & radio
TV & newspaper
radio & newspaper
TV, radio & newspaper

Here, it still looks feasible to try all 7 combinations, but if there are more
predictors, the number of combinations will increase exponentially. For example,
by adding only one more predictor to our case study, the total combinations would
become 15. Just imagine having a dozen predictors.

Hence we need more efficient ways to perform Feature Selection.

Feature Selection
Two of the most popular approaches to do feature selection are:

Forward Selection: We start with a model without any predictor and


just the intercept term. We then perform simple linear regression for
each predictor to find the best performer(lowest RSS). We then add
another variable to it and check for the best 2-variable combination
again by calculating the lowest RSS(Residual Sum of Squares). After
that the best 3-variable combination is checked, and so on. The
approach is stopped when some stopping rule is satisfied.
Backward Selection: We start with all variables in the model, and
remove the variable that is the least statistically significant (greater p-
value: check the model summary above to find p-values of variables).
This is repeated until a stopping rule is reached. For instance, we may
stop when there is no further improvement in the model score.

read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 7/10
17/11/2020 Multiple Linear Regression

In this post, I’ll walk you through the forward selection method. To begin with,
let’s understand how we are going to select or reject the added variable.

We are going to use 2 measures to evaluate our new model after each
addition: RSS and R².

We are already familiar with RSS which is the Residual Sum of Squares and is
calculated by squaring the difference between actual outputs and predicted
outcomes. It should be minimum for the model to perform well.

R² is the measure of the degree to which variance in data is explained by the


model. Mathematically, it’s the square of the correlation between actual and
predicted outcomes. R² closer to 1 indicates that the model is good and explains
the variance in data well. A value closer to zero indicates a poor model.

Luckily, it’s calculated for us by the OLS module in Statsmodels. So let’s begin.

# Defining a function to evaluate a model


def evaluateModel(model):
print("RSS = ", ((ad.sales - model.predict())**2).sum())
print("R2 = ", model.rsquared)

Let’s first evaluate models with single predictors one by one, starting with TV.

# For TV
model_TV = sm.ols('sales ~ TV', ad).fit()
evaluateModel(model_TV)

RSS = 2102.5305831313512
R^2 = 0.611875050850071

# For radio
model_radio = sm.ols('sales ~ radio', ad).fit()
evaluateModel(model_radio)

RSS = 3618.479549025088
R^2 = 0.33203245544529525

# For newspaper
model_newspaper = sm.ols('sales ~ newspaper', ad).fit()
evaluateModel(model_newspaper)

RSS = 5134.804544111939
R^2 = 0.05212044544430516

We observe that for model_TV, the RSS is least and R² value is the most among
all the models. Hence we select model_TV as our base model to move forward.

Now, we will add the radio and newspaper one by one and check the new values.

read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 8/10
17/11/2020 Multiple Linear Regression

# For TV & radio


model_TV_radio = sm.ols('sales ~ TV + radio', ad).fit()
evaluateModel(model_TV_radio)

RSS = 556.9139800676184
R^2 = 0.8971942610828957

As we can see that our values have improved tremendously. RSS has decreased
and R² has increased further, as compared to model_TV. It’s a good sign. Let’s
now check the same for TV and newspaper.

# For TV & newspaper


model_TV_radio = sm.ols('sales ~ TV + newspaper', ad).fit()
evaluateModel(model_TV_newspaper)

RSS = 1918.5618118968275
R^2 = 0.6458354938293271

The values have improved by adding newspaper too, but not as much as with the
radio. Hence, at this step, we will proceed with the TV & radio model and will
observe the difference when we add newspaper to this model.

# For TV, radio & newspaper


model_all = sm.ols('sales ~ TV + radio + newspaper', ad).fit()
evaluateModel(model_all)

RSS = 556.8252629021872
R^2 = 0.8972106381789522

The values have not improved with any significance. Hence, it’s imperative to not
add newspaper and finalize the model with TV and radio as selected features.

So our final model can be expressed as below:

Plotting the variables TV, radio, and sales in the 3D graph, we can visualize how
our model has fit a regression plane to the data.

3D Plot to understand the regression plane. Image by Sangeet Aggarwal

That’s it for Multiple Linear Regression. You can find the full code behind this
post here. I hope you had a good time reading and learning. For more, stay tuned.

read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 9/10
17/11/2020 Multiple Linear Regression

read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 10/10

You might also like