Professional Documents
Culture Documents
Linear regression is a predictive modeling technique for predicting a numeric response variable
based on one or more explanatory variables. The term "regression" in predictive modeling
generally refers to any modeling task that involves predicting a real number (as opposed
classification, which involves predicting a category or class.). The term "linear" in the name linear
regression refers to the fact that the method models data with linear combination of the
explanatory variables. A linear combination is an expression where one or more variables are
scaled by a constant factor and added together. In the case of linear regression with a single
explanatory variable, the linear combination used in linear regression can be expressed as:
The right side if the equation defines a line with a certain y-intercept and slope times the
explanatory variable. In other words, linear regression in its most basic form fits a straight line to
the response variable. The model is designed to fit a line that minimizes the squared differences
(also called errors or residuals.). We won't go into all the math behind how the model actually
minimizes the squared errors, but the end result is a line intended to give the "best fit" to the
data. Since linear regression fits data with a line, it is most effective in cases where the response
and explanatory variable have a linear relationship.
First, let's load some libraries and look at a scatterplot of weight and mpg to get a sense of the
shape of the data:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
• Pandas – Package to work with data preparations
• Sklearn – Scikit learn has several functions for machine learning algorithms
Out[3]:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
House Price Square Feet City
0 245 1400 N
1 312 1600 Y
2 279 1700 N
3 308 1875 Y
4 199 1100 N
5 219 1550 N
6 405 2350 Y
7 324 2450 Y
8 319 1425 Y
9 255 1700 N
In [6]: data.dtypes #The “dtypes” gets the data type, using which we pass th
is column to the “get_dummies”
#function to convert the categories to dummy columns.
DUMMY VARIABLE: In case we need to use categorical variables in our modelling then we need
to convert those into DUMMY Variables.
We look into unique values in a categorial variable. Use CountUnique to find that data in case the
data is huge
In case we have n unique variables then we will introduce (n-1) dummy variable
Note CITY is a Categorical Variable with 2 unique values and hence one DUMMY VARIABLE will
be created
In [7]: data.columns[obj]
for i in data.columns[obj]:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
dummy=pd.get_dummies(data[i], prefix='City', drop_first=True)
#"drop_first" drops the first category in order to avoid multicolli
nearity problem
#prefix is used to add a certain prefix to all the dummy variables
created for any particular categorial variable
print(dummydf)
City_Y
0 0
1 1
2 0
3 1
4 0
5 0
6 1
7 1
8 1
9 0
obj1=data1.dtypes==np.object
data1=data1.drop(data1.columns[obj1], axis=1) #drop” function is used t
o drop the original column of the categorical variable
print("Head After Removal:\n", data1.head())
Head:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
House Price Square Feet City City_Y
0 245 1400 N 0
1 312 1600 Y 1
2 279 1700 N 0
3 308 1875 Y 1
4 199 1100 N 0
Head After Removal:
House Price Square Feet City_Y
0 245 1400 0
1 312 1600 1
2 279 1700 0
3 308 1875 1
4 199 1100 0
In [11]: #Declare the dependent variable and create your independent and depende
nt datasets
#Separate the dependent & independent variables into two objects.
dep='House Price'
X = data1.drop(dep, axis=1)
Y = data1[dep]
In [12]: X
Out[12]:
Square Feet City_Y
0 1400 0
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Square Feet City_Y
1 1600 1
2 1700 0
3 1875 1
4 1100 0
5 1550 0
6 2350 1
7 2450 1
8 1425 1
9 1700 0
In [13]: Y
Out[13]: 0 245
1 312
2 279
3 308
4 199
5 219
6 405
7 324
8 319
9 255
Name: House Price, dtype: int64
Using visualisation, you should be able to judge which variables have a linear relationship with y.
Start by using Seaborn’s pairplot.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Additional parameters to use:
kind= ‘reg’ : Will attempt to add line of best fit and a 95% confidence band. Will aim to minimize
sum of squared error.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
STEP 4: SPLITTING OUR DATA INTO TRAIN &
TEST DATA
Split the data into train & test for model building and validating the model
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
• We us the model_selection in sklearn package to split the data into train & test
By passing our X and y variables into the train_test_split method, we are able to capture the
splits in data by assigning 4 variables to the result.
XTrain:
Square Feet City_Y
2 1700 0
4 1100 0
7 2450 1
1 1600 1
0 1400 0
8 1425 1
6 2350 1
3 1875 1
XTest:
Square Feet City_Y
9 1700 0
5 1550 0
YTrain:
2 279
4 199
7 324
1 312
0 245
8 319
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
6 405
3 308
Name: House Price, dtype: int64
YTest:
9 255
5 219
Name: House Price, dtype: int64
The “test_size” is used to specify size of the test data, in our case 20%
“random_state” is used to generate the same set of sample which helps in replicating the results
(It acts like seed to generate random number)
C:\Users\Shivani\Anaconda3\lib\site-packages\scipy\stats\stats.py:1390:
UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n
=8
"anyway, n=%i" % int(n))
Out[19]:
OLS Regression Results
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Time: 15:41:42 Log-Likelihood: -41.550
Df Model: 2
The positive slopes imply that they have a positive effect on House Prices.
The p-value of 0.000 for SquareFeet implies that the effect of Square Feet on House Prices is
statistically significant (using p < 0.05 as a rejection rule)
The p-value of 0.522 for City_Y implies that the effect of City on House Prices is statistically
insignificant for the current data set. and Hence should be removed from the data. (Covering
below)
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
regression.
R-Squared
The definition of R-squared is fairly straight-forward; it is the percentage of the response variable
variation that is explained by a linear model. Or:
0% indicates that the model explains none of the variability of the response data around its
mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.
Problem 2: If a model has too many predictors and higher order polynomials, it begins to model
the random noise in the data. This condition is known as overfitting the model and it produces
misleadingly high R-squared values and a lessened ability to make predictions.
Adjusted R-square
Both adjusted R-squared and Adjusted R-square provide information that helps you assess the
number of predictors in your model:
Use the adjusted R-square to compare models with different numbers of predictors Use the
predicted R-square to determine how well the model predicts new observations and whether the
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
model is too complicated
Out[20]:
House Price Square Feet
0 245 1400
1 312 1600
2 279 1700
3 308 1875
4 199 1100
5 219 1550
6 405 2350
7 324 2450
8 319 1425
9 255 1700
In [23]: X2
Out[23]:
Square Feet
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Square Feet
0 1400
1 1600
2 1700
3 1875
4 1100
5 1550
6 2350
7 2450
8 1425
9 1700
In [24]: Y2
Out[24]: 0 245
1 312
2 279
3 308
4 199
5 219
6 405
7 324
8 319
9 255
Name: House Price, dtype: int64
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [29]: #Run model
#Ordinary Least-Squares (OLS) Regression Method
lm2=sm.OLS(Y2_train, X2_train ).fit()
lm2.summary() #To view the OLS regression
results, we can call the .summary() method
C:\Users\Shivani\Anaconda3\lib\site-packages\scipy\stats\stats.py:1390:
UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n
=8
"anyway, n=%i" % int(n))
Out[29]:
OLS Regression Results
Df Model: 1
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [30]: lm2.predict(X2_test)
Out[30]: 9 285.394541
5 260.212670
dtype: float64
2 285.394541
4 184.667056
7 411.303897
1 268.606627
0 235.030798
8 239.227777
6 394.515983
3 314.773391
dtype: float64
2 6.394541
4 -14.332944
7 87.303897
1 -43.393373
0 -9.969202
8 -79.772223
6 -10.484017
3 6.773391
dtype: float64
In [32]: #Predict
pred_test = lm2.predict(X2_test)
err_test = pred_test - Y2_test
print(pred_test)
print(err_test)
9 285.394541
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
5 260.212670
dtype: float64
9 30.394541
5 41.212670
dtype: float64
Out[33]: Text(0.5,1,'Main')
Out[34]: 35.8036053130929
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
mape=np.mean(np.abs((Y2_train - pred_train) / Y2_train)) * 100
mape
Out[36]: 10.526505212851921
Three statistics are used in Ordinary Least Squares (OLS) regression to evaluate model fit: R-
squared, the overall F-test, and the Root Mean Square Error (RMSE). All three are based on two
sums of squares: Sum of Squares Total (SST) and Sum of Squares Error (SSE). SST measures
how far the data are from the mean, and SSE measures how far the data are from the model’s
predicted values. Different combinations of these two values provide different information about
how the regression model compares to the mean model.
The difference between SST and SSE is the improvement in prediction from the regression
model, compared to the mean model. Dividing that difference by SST gives R-squared. It is the
proportional improvement in prediction from the regression model, compared to the mean model.
It indicates the goodness of fit of the model.
R-squared has the useful property that its scale is intuitive: it ranges from zero to one, with zero
indicating that the proposed model does not improve prediction over the mean model, and one
indicating perfect prediction. Improvement in the regression model results in proportional
increases in R-squared.
RMSE
The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the
model to the data–how close the observed data points are to the model’s predicted values.
Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. As the
square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained
variance, and has the useful property of being in the same units as the response variable. Lower
values of RMSE indicate better fit. RMSE is a good measure of how accurately the model
predicts the response, and it is the most important criterion for fit if the main purpose of the
model is prediction.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
The best measure of model fit depends on the researcher’s objectives, and more than one are
often useful. The statistics discussed above are applicable to regression models that use OLS
estimation.
Out[37]: Text(0,0.5,'Residuals')
In [38]: print(lm2.predict())
In [40]: #Homoscedasticity
resid = lm2.resid
plt.scatter(lm2.predict(), resid)
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[40]: <matplotlib.collections.PathCollection at 0x24ea391b390>
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD