Professional Documents
Culture Documents
You can apply regression to scenarios that require prediction or causal inference.
You can use regression to understand the extent to which the area of a house
affects the housing prices.
For example, the price of a wine bottle can vary depending on the average growing
season temperature.
For example, if the area of the house is an independent variable and the price of
the house is a dependent variable, you cannot conclude that houses with larger
areas will increase the price of the house.
Topic Summary
Correlation is a measure that describes the strength of relationship between two
variables .
Regression explains in more detail about this strength.
Measure of Quality
You have seen how to fit a model that best describes the data. However, you can
never get a perfect fit.
How will you measure the error/deviation in a model that is fit to the data ?
Sum of Squared Errors (SSE) is a measure of the quality of the Regression Line .
If there are n data points, then the SSE is the sum of square of the residual
errors .
SSE is small for the Line of Best Fit and big for the baseline model.
So, is there a better way to gauge the quality of the Regression Model ?
RMSE
At times, the SSE is difficult to interpret and the units are difficult to
comprehend. So, the alternative measure of quality is the Root Mean Square Error
(RMSE).
RMSE shrinks the magnitude of error by taking the square root of SSE divided by the
number of observations (n).
The SSE values for baseline model is the Total Sum of Square values(SST)
R Sq = 0 means the model is just as good as the base line and there is no
improvement from the baseline model.
Model Interpretation
This is the equation for line of best fit
y = 249.85714 - 0.7928571x
For a unit increase in price of the house, .793 lesser houses are sold .
B0 is 249.85714
B1 is -0.7928571
Descriptive Statistics
R-squared: Measure that says how well the model has performed with respect to the
baseline model.
from sklearn.datasets
import load_boston
import pandas as pd
boston = load_boston()
dataset = pd.DataFrame(data=boston.data, columns=boston.feature_names)
dataset['target'] = boston.target
print(dataset.head())
X = dataset['RM']
Y = dataset['target']
import statsmodels.api as sm
import statsmodels.formula.api as smf
X= sm.add_constant(X)
statsModel =sm.OLS(Y,X)
fittedModel = statsModel.fit()
print (fittedModel.summary())
r_squared = <type the="" value="" here=""> ; from the output of previous step
Got the value - 0.48400, It should be 0.90
MLR Representation
y - Dependent variable
x - Independent variable
e - Error measure
MLR
Multiple Regression helps in predicting a single variable using multiple
independent variables. This improves the model by increasing the accuracy
During this model fitting process, some variables will contribute significantly to
the model but some might not. It is better to remove variables that are not of
significance to the model.
-So, how do we check if a variable is significant for the output? Let's take a look
at that in the following cards.
For example,
MLR Data
Price(thousands of $) x
MLR Equation
The MLR equation is, y = 252.85965 - .824935 x 1 + .3592748 x 2
The number of houses sold is a linear function of both the price of a house and
number of cars sold
A unit increase in the number of cars sold increases the number of houses sold by a
proportion of .35
B0 252.85965
B1 -0.824935
B2 0.3592748
You will see the computation of B0,B1,B2 in the next set of cards.
Price of the House , Number of units sold and the number of cars sold.
Let us create a dataframe from the list using the following code.
import pandas as pd
price = [160,180,200,220,240,260,280]
sale = [126,103,82,75,82,40,20]
cars = [0,9,19,5,25,1,20]
priceDF = pd.DataFrame(price, columns=list('x'))
saleDF = pd.DataFrame(sale, columns=list('y'))
carsDf = pd.DataFrame(cars, columns=list('z'))
houseDf = pd.concat([priceDF,saleDF,carsDf],axis=1)
X = houseDf.drop(['y'], axis=1)
y = houseDf.y
Xc = sm.add_constant(X)
linear_regression = sm.OLS(y,Xc)
fitted_model = linear_regression.fit()
fitted_model.summary()
X - House Price - -0.557 is 0.098 - this term is also significant in predicting the
output. z - car sales - 0.322 is 0.668 this term is not so significant in
predicting the output.
If the coef is zero then that independent variable does not predict the dependent
variable correctly.
Std err denotes how much each coefficient varies from the estimated value
Handling Multicollinearity
A good practice while fitting multiple regression model is to check if there is any
correlation among the independent variables.
In python, for a random array X the command to find correlation is X.corr().
Tips
Choose the coef with low Pr(>|t|) value.
Reject that variable with correlation outside the range -0.7 and 0.7 with any other
variable.
Data Prep
Hope you've understood how to deal with multiple variables and perform multiple
regressions. Let us consider the dataset created using the following code for
further practice.
Hands On Prep
From the previous card load all the variables other than target into a variable
named X
Run a correlation among all independent variables to check for multi collinearity
Occam's razor
When you have two Multiple Regression Models fit for a given data set ,if one is
simple and another is complex , choose the simple model.
Whenever you are in the Model Building exercise , start with a simple model and
then build complexity on top of it.
Data Understanding
Understanding data is the most important step before building a model. You should
not apply regression for the sake of applying . After applying Regression, work to
interpret the results and derive the appropriate insights required for further
analysis.
TIPS
Do not discard theoretical considerations based on statistical measures.
Feature Scaling
Your data-set might contain different features like independent variables (columns)
with different magnitudes. So always bring them to a proper scale for ease of
operation. This process is called feature scaling.
You can achieve Feature scaling with the help of either Normalization or
Standardization depending on the magnitude of the variables.
Normalization
Normalization is the process of re-scaling any value to the range [-1,1] .
Python has ready-made packages for re-scaling the data
from sklearn import preprocessing
import numpy as np
sampleData = np.array([[ -3., -1., 4.]])
normalized_sampleData = preprocessing.normalize(sampleData)
normalized_sampleData
output: array([[-0.58834841, -0.19611614, 0.78446454]])
Standardization
Standardization is the process of removing the arithmetic mean and dividing by the
standard deviation.
Standardization in python is done in the following way: