Module Handbook

MODULE 5
Module Handbook
Linear Regression
Objectives
At the end of this module, you should be able to:
● Work with Linear Regression

● Build Univariate Linear Regression Model
● Understand Gradient Descent Algorithm
● Implement Linear Regression with sklearn
● Multivariate Linear Regression
● Dummy Variables
● Feature Scaling
● Boston Housing Prices Prediction
● Cross Validation
● Regularization: Lasso & Ridge
www.bootup.ai
Linear Regression
Before we continue to focus topic i.e. “Linear Regression” lets first know what we mean by Regression.
Regression is a statistical way to establish a relationship between a dependent variable and a set of
independent variable(s). e.g., if we say that
Age = 5 + Height * 10 + Weight * 13
Here we are establishing a relationship between Height & Weight of a person with his/ Her Age. This is a very
basic example of Regression.
www.bootup.ai
What is Linear Regression?
A Supervised Learning Algorithm that learns from a set of training samples
“Linear Regression” can be defined as a statistical method to regress the data with dependent variable
having continuous values whereas independent variables can have either continuous or categorical values.
“Linear Regression” can also be called a method to predict dependent variable (Y) based on values of
independent variables (X).
In other words, we can say It simply provides estimation of relationship between a dependent variable
(target/label) and one or more independent variable (predictors).
www.bootup.ai
Assumptions of Linear Regression
Not a single size fits or all, the same is true for Linear Regression as well. In order to fit a linear regression
line data should satisfy few basic but important assumptions. If your data doesn’t follow the assumptions,
your results may be wrong as well as misleading.
I. Linearity & Additive: There should be a linear relationship between dependent and independent
variables and the impact of change in independent variable values should have additive impact on
dependent variable.
II. Normality of error distribution: Distribution of differences between Actual & Predicted values
(Residuals) should be normally distributed.
III. Homoscedasticity: Variance of errors should be constant versus,

○ Time
○ The predictions
○ Independent variable values
IV. Statistical independence of errors: The error terms (residuals) should not have any correlation
among themselves. E.g., In case of time series data there shouldn’t be any correlation between
consecutive error terms.
www.bootup.ai
Objective of Linear Regression
● Establish If there is a relationship between two variables.
Examples – relationship between housing process and area of house, no of hours of study
and the marks obtained, income and spending etc.
● Prediction of new possible values
Based on the area of house predicting the house prices in a particular month; based on
number of hour studied predicting the possible marks. Sales in next 3 months etc.
Linear Regression Use cases
● To model residential home prices as a function of the home's living area, bathrooms,
number of bedrooms, lot size.
● To analyze the effect of a proposed radiation treatment on reducing tumor sizes

based on patient attributes such as age or weight.
● To predict demand for goods and services. For example, restaurant chains can
predict the quantity of food depending on weather.
● To predict company’s sales based on previous month’s sales and stock prices of a
company.
www.bootup.ai
Regression Types
Univariate Linear
Regression
Multiple Linear
Regression
Polynomial Linear
Regression
www.bootup.ai
AI v/s Machine Learning and Deep Learning
Simple Linear Regression

In simple linear regression, we predict scores on one variable from the scores on a second variable.
○ The variable we are predicting is called the criterion variable and is referred to as Y.
○ The variable we are basing our predictions on is called the predictor variable and is referred
to as X.
When there is only one predictor variable, the prediction method is called simple regression.
In simple linear regression, the topic of this section, the predictions of Y when plotted as a function
of X form a straight line.
Here in the below example, considering we have a dataset in which we have the area of the several
houses and their price, and we wish to build a predictive model using this dataset which can predict
he price of the house based on the area fed to it.
www.bootup.ai
In this example the Area will be input variable or input feature or independent variable and the
price will be output variable or target or dependent variable.
Area ( sq ft ) Price ( INR )
1200 20,00,000
1800 42,00,000
3200 44,00,000
3800 25,00,000
4200 62,00,000
If we plot the data given to use in below format considering x axis as the area of the house and y axis as
the price of the house.
Price in 100,000 (INR)
Area in 1000 sq. feet

9
www.bootup.ai
x: Independent variable, predictor variables or regressors.
y: Dependent Variable, criterion variable, or regressand.
Now let us consider a scenario that I have two known people, one is my friend who is into the real
estate business and have just started doing the business and done 2 deals so far and another is my
uncle who is into real estate business from last 15 years, so my uncle would have consumed more
data about the flats during his 15 years, Now in case I have to buy a property in the city, my uncle can
be more correct in suggesting me the price compared to my friend and the reason is data, my uncle
has consumed more data compared to my friend which makes him more correct. Similarly, in
Machine Learning also as much as you can feed data to the algorithm that much better and effective
algorithm you can build.
Consider a scenario of my friend who did only 2 deals so far, so considering that we have first two
samples only ad we have to predict the price for the third value of are i.e. price of the house for 3200
sq. ft.,
How do we do it?
Probably you may say that you can draw a line between 2 existing point and extend the line to make
prediction about the third point on x axis, that’s actually is one of the correct and easiest way.
10
www.bootup.ai
So let’s consider a straight line as a predictive model.
Now in this case, we can say that the current straight line is not so correct if it makes prediction for
all the known values, that means it is not fit to the pattern of our current data.
It has high error between predicted price for all the known values of x and their real/actual price.
Let’s calculate he error for current straight line for x=3200 sq.ft.
11
www.bootup.ai
So here we have depicted out error as the difference between the actual price of the house and
the predicted price of the house.
Now in real world, the errors can be dual directional, means it can be positive and negative both.
12
www.bootup.ai
Here in below image we can see in one of the case i.e. for 7th sample (which is at 6000
sq. f.t value of x approx) the error e is positive where as for 8th sample (which is at 7000
sq. ft. value of x approx) the error e is negative.
So when error is positive we have to decrease it to zero and when error is negative, we
have to increase it to zero. Which makes it 2 directional optimization problem. TO resolve
this issue we can take mean of square of error which is also known as MSE.
13
www.bootup.ai
In above image we can say that green line would have less overall error compared to the red line, and
how can we shift from red line to green line?
What should we change in current equation of line?
Probably you will say, change m & c. That means updating m & c in a certain direction, may reduce the
cost function. So the factors affecting the value of cost function are m & c.
Machine Learning is all about learning from experiences and improving performance. Here if keep on
feeding data and calculate the overall error and update m & c to minimize the
14
www.bootup.ai
Error function or cost function iteration by iteration then we can say that line is learning to adapt the
pattern of the data.
Let’s talk more about the cost function -
This cost function in this case for Linear Regression is convex in nature. One of the important property
of convex function is that it is guaranteed to provide the lowest value when differentiated at zero.
So now on a whole we can describe the final understanding of prediction of a continuous parameter in
following manner -
Now the goal is to minimize the cost function by updating the parameters m and c.
So let’s take the partial differential of cost function w.r.t. m and c.
15
www.bootup.ai
As you can see, the constant 1/2 got cancelled. In partial differentiation, we differentiate the entire
equation with respect to one variable, keeping other variables constant. We also learn that the partial
derivative of this cost function is just the difference between actual and predicted values averaged over all
observations (n). To compute m & c more effectively, gradient descent comes into picture. For a particular
value of m & c, gradient descent works like this:
1. First, it calculates the partial derivative of the cost function.

2. If the derivative is positive, it decreases the parameter value.
3. If the derivative is negative, it increases the parameter value.
4. The motive is to reach to the lowest point (zero) in the convex curve where the derivative is
minimum.
5. It progresses iteratively using a step size (η), also called as learning rate which is defined by the
user. But make sure that the step size isn't too large or too small. Too small a step size will take
longer to converge, too large a step size will never reach an optimum.
16
www.bootup.ai
Gradient Descent Algorithm
Repeat Until converge
simultaneously update, 𝑗=0, 𝑗=1
where, w=parameter (coefficient & constant) (m & c)
Learning Rate 𝑙𝑟
Learning Rate 𝑙𝑟 controls how big step we take while updating our parameter w.
- If 𝑙𝑟 is too small, gradient descent can be slow.
- If 𝑙𝑟 is too big, gradient descent can overshoot the minimum, it may fail to converge
17
www.bootup.ai
Regression Model Performance
Once you build the model, the next logical question comes in mind is to know whether your model is
good enough to predict in future or the relationship which you built between dependent and
independent variables is good enough or not.
For this purpose there are various metrics which we look into-
● Total Sum of Squares (TSS) : TSS is a measure of total variance in the response/
dependent variable Y and can be thought of as the amount of variability inherent in
the response before the regression is performed.
● Residual Sum of Squares (RSS) : RSS measures the amount of variability that is left
unexplained after performing the regression.
● (TSS – RSS) measures the amount of variability in the response that is explained (or
removed) by performing the regression
○ Where N is the number of observations used to fit the model, σx is the
standard deviation of x, and σy is the standard deviation of y.
18
www.bootup.ai
2. Root Mean Square Error (RMSE)
RMSE tells the measure of dispersion of predicted values from actual values. The formula for
calculating RMSE is
N : Total number of observations
Though RMSE is a good measure for errors but the issue with it is that it is susceptible to the range
of your dependent variable. If your dependent variable has thin range, your RMSE will be low and if
dependent variable has wide range RMSE will be high. Hence, RMSE is a good metric to compare
between different iterations of a model.
19
www.bootup.ai
3. Mean Absolute Percentage Error (MAPE)
To overcome the limitations of RMSE, analyst prefer MAPE over RMSE which gives
error in terms of percentages and hence comparable across models. Formula for
calculating MAPE can be written as:
20
www.bootup.ai
How to improve the accuracy of a regression model?
There is little you can do when your data violates regression assumptions. An obvious solution is to
use tree-based algorithms which capture non-linearity quite well. But if you are adamant at using
regression, following are some tips you can implement:
1. It may be possible that all the features may not have good correlation with the label, in that
case your data is suffering from non-linearity, you should transform the Independent
Variables using sqrt, log, square, etc.
2. Sometimes it may be possible that your data is suffering from heteroskedasticity, transform
the Dependent Variable using sqrt, log, square, etc. In such scenario you can also give a try
using weighted least square method to tackle this problem.
3. In case if your data is suffering from multicollinearity, use a correlation matrix to check
correlated variables. Let's say variables A and B are highly correlated. Now, instead of
removing one of them, use this approach: Find the average correlation of A and B with the
rest of the variables. Whichever variable has the higher average in comparison with other
variables, remove it. Alternatively, you can use penalized regression methods such as lasso,
ridge, elastic net, etc.
4. You can do variable selection based on p values. If a variable shows p value > 0.05, we can
remove that variable from model since at p> 0.05, we'll always fail to reject null hypothesis.
21
www.bootup.ai
Multiple Linear Regression
Till now we were discussing about the scenario where we have only one independent variable. If we
have more than one independent variable the procedure for fitting a best fit line is known as
“Multiple Linear Regression”
Fundamentally there is no difference between ‘Simple’ & ‘Multiple’ linear regression. Both works on
OLS principle and procedure to get the best line is also similar. In the case of later, regression
equation will take a shape like:
Here according to the number of independent variable the number of parameters will increase by
n+1, if there are n independent variables, the geometry you can assume as n dimensional
hyperplane.
22
www.bootup.ai
Feature Engineering for Multivariate Linear Regression
One Hot Encoding
When some inputs are categories (e.g. gender) rather than numbers (e.g. age) we need to represent
the category values as numbers so they can be used in our linear regression equations.
23
www.bootup.ai
Avoiding the Dummy variable trap
X=X[:,1:]
NOTE : if you have n dummy variables remove one dummy variable to avoid the dummy
variable trap. However the linear regression model that is built in R and Python takes care of
this. But there is no harm in removing it by ourselves
24
www.bootup.ai
Feature Scaling
When some inputs are categories (e.g. gender) rather than numbers (e.g. age) we need to represent
the category values as numbers so they can be used in our linear regression equations.
25
www.bootup.ai
Cross Validation
• 5 folds = 5-fold CV
• 10 folds = 10-fold CV
• k folds = k-fold CV
More folds = More computationally expensive
26
www.bootup.ai
Overfitting & Generalization
As we train our model with more and more data the it may start to fit the training data more and
more accurately, but become worse at handling test data that we feed to it later.
This is known as “over-fitting” and results in an increased generalization error.
Large coefficients lead to overfitting
Penalizing large coefficients: Regularization
How to minimize?
• To minimize the generalization error we should
• Collect as much sample data as possible.
• Use a random subset of our sample data for training.
• Use the remaining sample data to test how well our model copes with data it was
not trained with.
27
www.bootup.ai
L1 Regularization (Lasso)
(Least Absolute Shrinkage and Selection Operator)
● Having a large number of samples (n) with respect to the number of dimensionality (d)
increases the quality of our model.
● One way to reduce the effective number of dimensions is to use those that most contribute
to the signal and ignore those that mostly act as noise.
● L1 regularization achieves this by adding a penalty that results in the weight for the
dimensions that act as noise becoming 0.
● L1 regularization encourages a sparse vector of weights in which few are non-zero and many
are zero.
Depending on the regularization strength, certain weights can become zero, which makes the
LASSO also useful as a supervised feature selection technique:
A limitation of the LASSO is that it selects at most n variables if m > n.
28
www.bootup.ai
L2 Regularization (Ridge)
● Another way to reduce the complexity of our model and prevent overfitting to outliers is L2
regression, which is also known as ridge regression.
● In L2 Regularization we introduce an additional term to the cost function that has the effect
of penalizing large weights and thereby minimizing this skew.
Ridge regression is an L2 penalized model where we simply add the squared sum of the weights to
our least-squares cost function:
By increasing the value of the hyperparameter λ , we increase the regularization strength and shrink
the weights of our model.
29
www.bootup.ai
L1 & L2 Regularization (Elastic Net)
● L1 Regularisation minimises the impact of dimensions that have low weights and are thus
largely “noise”.
● L2 Regularisation minimise the impacts of outliers in our training data.
● L1 & L2 Regularisation can be used together and the combination is referred to as Elastic Net
regularisation.
● Because the differential of the error function contains the sigmoid which has no inverse, we
cannot solve for w and must use gradient descent.
Lasso regression for feature selection
● Can be used to select important features of a dataset.

● Shrinks the coefficients of less important features to exactly 0.
30
www.bootup.ai
Multi-collinearity
Multi-collinearity tells us the strength of relationship between independent variables. If there is

Multi-Collinearity in our data, our beta coefficients may be misleading. VIF (Variance Inflation
Factor) is used to identify the Multi-collinearity. If VIF value is greater than 4 we exclude that
variable from our model building exercise
Iterative Models
Model building is not one step process, one need to run multiple iterations in order to reach a final
model. Take care of P-Value and VIF for variable selection and R-Square & MAPE for model
selection.
31
www.bootup.ai
Exercise 1 – Simple Linear Regression
Here we will take the salary dataset and apply simple linear regression using python on this.
32
www.bootup.ai
Predicting the test results
33
www.bootup.ai
34
www.bootup.ai
Exercise 2 - Multiple Linear Regression Model
The dataset for this example is taken from –
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
1. Title: Boston Housing Data
2. Sources:
(a) Origin: This dataset was taken from the StatLib library which is
maintained at Carnegie Mellon University.
(b) Creator: Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the
demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.
(c) Date: July 7, 1993
4. Relevant Information:
Concerns housing values in suburbs of Boston.
5. Number of Instances: 506
6. Number of Attributes: 13 continuous attributes (including "class"

attribute "MEDV"), 1 binary-valued attribute.
7. Attribute Information:
1. CRIM per capita crime rate by town

2. ZN proportion of residential land zoned for lots over
25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds
river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centres
9. RAD index of accessibility to radial highways
10.TAX full-value property-tax rate per $10,000
11.PTRATIO pupil-teacher ratio by town
12.B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks
by town
13.LSTAT % lower status of the population
14.MEDV Median value of owner-occupied homes in $1000's
35
www.bootup.ai
8. Missing Attribute Values: None.
36
www.bootup.ai
37
www.bootup.ai
38
www.bootup.ai
39
www.bootup.ai
Cross-validation in scikit-learn
40
www.bootup.ai
41
www.bootup.ai
42
www.bootup.ai
© Copyright BootUP 2019
BootUP
Noble House, 30th Floor
Jl. Dr.Ide Anak Agung Gde Agung
Mega Kuningan, Jakarta 12950
Jakarta, Indonesia
2019
www.bootup.ai
INDONESIA UNITED STATES
Noble House, 30th Floor 68 Willow Rd
Jl. Dr.Ide Anak Agung Gde Agung Menlo Park, CA 94025
Mega Kuningan, Jakarta 12950 Silicon Valley, USA
+62 811 992 1500 +1 800 493 1945
info@bootup.ai info@bootupventures.com
www.bootup.ai

Module Handbook

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module Handbook

Uploaded by

Copyright:

Available Formats

MODULE 5

● Work with Linear Regression

Age = 5 + Height * 10 + Weight * 13

III. Homoscedasticity: Variance of errors should be constant versus,

● Establish If there is a relationship between two variables.

● Prediction of new possible values

Linear Regression Use cases

● To analyze the effect of a proposed radiation treatment on reducing tumor sizes

Simple Linear Regression

Area ( sq ft ) Price ( INR )

Area in 1000 sq. feet

y: Dependent Variable, criterion variable, or regressand.

What should we change in current equation of line?

Let’s talk more about the cost function -

So let’s take the partial differential of cost function w.r.t. m and c.

1. First, it calculates the partial derivative of the cost function.

Repeat Until converge

simultaneously update, 𝑗=0, 𝑗=1

where, w=parameter (coefficient & constant) (m & c)

- If 𝑙𝑟 is too small, gradient descent can be slow.

N : Total number of observations

N : Total number of observations

N : Total number of observations

N : Total number of observations

One Hot Encoding

More folds = More computationally expensive

This is known as “over-ﬁtting” and results in an increased generalization error.

Large coefficients lead to overfitting

Penalizing large coefficients: Regularization

• To minimize the generalization error we should

• Collect as much sample data as possible.

• Use a random subset of our sample data for training.

(Least Absolute Shrinkage and Selection Operator)

A limitation of the LASSO is that it selects at most n variables if m > n.

Lasso regression for feature selection

● Can be used to select important features of a dataset.

Multi-collinearity tells us the strength of relationship between independent variables. If there is

The dataset for this example is taken from –

1. Title: Boston Housing Data

Concerns housing values in suburbs of Boston.

5. Number of Instances: 506

6. Number of Attributes: 13 continuous attributes (including "class"

1. CRIM per capita crime rate by town

You might also like