Professional Documents
Culture Documents
Linear Regression, one of the most popular and discussed models, is certainly the
gateway to go deeper into Machine Learning (ML). Such a simplistic,
straightforward approach to modeling is worth learning as one of your first steps
into ML.
Before moving forward, let us recall that Linear Regression can be broadly
classified into two categories.
If you are new to regression, then I strongly suggest you first read about Simple
Linear Regression from the link below, where you would understand the
underlying maths behind and the approach to this model using interesting data and
hands-on coding.
We will see how multiple input variables together influence the output variable,
while also learning how the calculations differ from that of Simple LR model. We
will also build a regression model using Python.
At last, we will go deeper into Linear Regression and will learn things like
Collinearity, Hypothesis Testing, Feature Selection, and much more.
Now one might wonder, we could also use simple linear regression to study our
output against all independent variables separately. That would have made lives
much easier right?
No, it wouldn’t.
read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 1/10
17/11/2020 Multiple Linear Regression
Consider this, suppose you have to estimate the price of a certain house you want
to buy. You know the floor area, the age of the house, its distance from your
workplace, the crime rate of the place, etc.
Now, some of these factors will affect the price of the house positively. For
example more the area, the more the price. On the other hand, factors like
distance from the workplace, and the crime rate can influence your estimate of the
house negatively (unless you are a rich criminal with interest in Machine Learning
looking for a hideout, yeah I don’t think so).
Mathematically…
Here, Y is the output variable, and X terms are the corresponding input variables.
Notice that this equation is just an extension of Simple Linear Regression, and
each predictor has a corresponding slope coefficient (β).
The first β term (βo) is the intercept constant and is the value of Y in absence of
all predictors (i.e when all X terms are 0). It may or may or may not hold any
significance in a given regression problem. It’s generally there to give a relevant
nudge to the line/plane of regression.
If you have read my post on Simple Linear Regression, then you are already
familiar with this data. If you haven’t, let me give you a quick brief.
The advertising data set consists of the sales of a product in 200 different markets,
along with advertising budgets for three different media: TV, radio, and
newspaper. Here’s how it looks like:
read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 2/10
17/11/2020 Multiple Linear Regression
Sales (*1000 units) vs Advertising budget (*1000 USD)
The first row of the data says that the advertising budgets for TV, radio, and
newspaper were $230.1k, $37.8k, and $69.2k respectively, and the corresponding
number of units that were sold was 22.1k (or 22,100).
In Simple Linear Regression, we can see how each advertising medium affects
sales when applied without the other two media. However, in practice, all three
might be working together to impact net sales. We did not consider the combined
effect of these media on sales.
Multiple Linear Regression solves the problem by taking account of all the
variables in a single expression. Hence, our Linear Regression model can now be
expressed as:
SciKit Learn: Just import the Linear Regression module from the
Sklearn package and fit the model on the data. This method is pretty
straightforward and you can see how to use it below.
read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 3/10
17/11/2020 Multiple Linear Regression
# Loading data - You can give the complete path to your data here
ad = pd.read_csv("Advertising.csv")
Intercept 2.938889
TV 0.045765
radio 0.188530
newspaper -0.001037
I encourage you to run the regression model using Scikit Learn as well and find
the above parameters using model.coef_ & model.intercept_. Did you see the
same results?
Now that we have these values, how to interpret them? Here’s how:
If we fix the budget for TV & newspaper, then increasing the radio
budget by $1000 will lead to an increase in sales by around 189
units(0.189*1000).
Similarly, by fixing the radio & newspaper, we infer an approximate
rise of 46 units of products per $1000 increase in the TV budget.
However, for the newspaper budget, since the coefficient is quite
negligible (close to zero), it’s evident that the newspaper is not
affecting the sales. In fact, it’s on the negative side of zero(-0.001)
which, if the magnitude was big enough, could have meant that this
agent is rather causing the sales to fall. But we cannot make that kind
of inference with such negligible value.
Collinearity
To understand why, let’s see how these variables are correlated with each other.
ad.corr()
read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 4/10
17/11/2020 Multiple Linear Regression
Here the dark squares represent a strong correlation (close to 1) while the lighter
ones represent the weaker correlation(close to 0). That’s the reason, all the
diagonals are dark blue, as a variable is fully correlated with itself.
Now, the thing worth noticing here is that the correlation between newspaper and
radio is 0.35. This indicates a fair relationship between newspaper and radio
budgets. Hence, it can be inferred that → when the radio budget is increased for a
product, there’s a tendency to spend more on newspapers as well.
Hence, even though the Multiple Regression model shows no impact on sales by
the newspaper, the Simple Regression model still does due to this
multicollinearity and the absence of other input variables.
We saw that the three predictors TV, radio and newspaper had a different degree
of linear relationship with the sales. But what if the relationship is just by chance
and there is no actual impact on sales due to any of the predictors?
The model can only give us numbers to establish a close enough linear
relationship between the response variable and the predictors. However, it cannot
read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 5/10
17/11/2020 Multiple Linear Regression
To have some confidence, we take help from statistics and do something known
as a Hypothesis Test. We start by forming a Null Hypothesis and a
corresponding Alternative Hypothesis.
Since our goal is to find if at least one predictor is useful in predicting the output,
we are in a way hoping that at least one of the coefficients(not intercept) is non-
zero, not just by a random chance but due to actual cause.
To do this, we start by forming a Null Hypothesis: All the coefficients are equal to
zero.
Hence the Alternative Hypothesis would be: At least one coefficient is not zero. It
is proved by rejecting the Null Hypothesis by finding strong statistical evidence.
Alternative Hypothesis
The hypothesis test is performed by using F-Statistic. The formula for this
statistic contains Residual Sum of Squares (RSS) and the Total Sum of Squares
(TSS), which we don’t have to worry about because the Statsmodels package
takes care of this. The summary of the OLS model that we fit above contains the
summary of all such statistics and can be obtained with this simple line of code:
print(model.summary2())
read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 6/10
17/11/2020 Multiple Linear Regression
If the value of F-statistic is equal to or very close to 1, then the results are in favor
of the Null Hypothesis and we fail to reject it.
But as we can see that the F-statistic is many folds larger than 1, thus providing
strong evidence against the Null Hypothesis (that all coefficients are zero). Hence,
we reject the Null Hypothesis and are confident that at least one predictor is
useful in predicting the output.
Hence, we can say that at least one of the three advertising agents is useful in
predicting sales.
But which one or which two are important? Are all of them important? To find
this out, we will perform Feature Selection or variable selection. Now one way
of doing this is trying all possible combinations i.e.
TV
radio
newspaper
TV & radio
TV & newspaper
radio & newspaper
TV, radio & newspaper
Here, it still looks feasible to try all 7 combinations, but if there are more
predictors, the number of combinations will increase exponentially. For example,
by adding only one more predictor to our case study, the total combinations would
become 15. Just imagine having a dozen predictors.
Feature Selection
Two of the most popular approaches to do feature selection are:
read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 7/10
17/11/2020 Multiple Linear Regression
In this post, I’ll walk you through the forward selection method. To begin with,
let’s understand how we are going to select or reject the added variable.
We are going to use 2 measures to evaluate our new model after each
addition: RSS and R².
We are already familiar with RSS which is the Residual Sum of Squares and is
calculated by squaring the difference between actual outputs and predicted
outcomes. It should be minimum for the model to perform well.
Luckily, it’s calculated for us by the OLS module in Statsmodels. So let’s begin.
Let’s first evaluate models with single predictors one by one, starting with TV.
# For TV
model_TV = sm.ols('sales ~ TV', ad).fit()
evaluateModel(model_TV)
RSS = 2102.5305831313512
R^2 = 0.611875050850071
# For radio
model_radio = sm.ols('sales ~ radio', ad).fit()
evaluateModel(model_radio)
RSS = 3618.479549025088
R^2 = 0.33203245544529525
# For newspaper
model_newspaper = sm.ols('sales ~ newspaper', ad).fit()
evaluateModel(model_newspaper)
RSS = 5134.804544111939
R^2 = 0.05212044544430516
We observe that for model_TV, the RSS is least and R² value is the most among
all the models. Hence we select model_TV as our base model to move forward.
Now, we will add the radio and newspaper one by one and check the new values.
read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 8/10
17/11/2020 Multiple Linear Regression
RSS = 556.9139800676184
R^2 = 0.8971942610828957
As we can see that our values have improved tremendously. RSS has decreased
and R² has increased further, as compared to model_TV. It’s a good sign. Let’s
now check the same for TV and newspaper.
RSS = 1918.5618118968275
R^2 = 0.6458354938293271
The values have improved by adding newspaper too, but not as much as with the
radio. Hence, at this step, we will proceed with the TV & radio model and will
observe the difference when we add newspaper to this model.
RSS = 556.8252629021872
R^2 = 0.8972106381789522
The values have not improved with any significance. Hence, it’s imperative to not
add newspaper and finalize the model with TV and radio as selected features.
Plotting the variables TV, radio, and sales in the 3D graph, we can visualize how
our model has fit a regression plane to the data.
That’s it for Multiple Linear Regression. You can find the full code behind this
post here. I hope you had a good time reading and learning. For more, stay tuned.
read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 9/10
17/11/2020 Multiple Linear Regression
read://https_datasciencewithsan.com/?url=https%3A%2F%2Fdatasciencewithsan.com%2Fmultiple-linear-regression%2F 10/10