You are on page 1of 8

Unit-II

Model diagnostics and Multiple linear Regression

Simple Linear Model diagnostics


To validate the regression model to ensure its validity and goodness of fit before it can be used
for practical applications.

The following measures are used to validate the simple linear regression models:
1. Co-efficient of determination (R-squared).
2. Hypothesis test for the regression coefficient.
3. Analysis of variance for overall model validity (important for multiple linear regression).
4. Residual analysis to validate the regression model assumptions.
5. Outlier analysis, since the presence of outliers can significantly impact the regression parameters.
Co-efficient of Determination (R-Squared or R 2 )
 The primary objective of regression is to explain the variation in Y using the knowledge of X.
 The co-efficient of determination (R-squared or R 2 ) measures the percentage of variation in Y
explained by the model (B0 + B1 X).
 The simple linear regression model can be broken into:
1. Variation in outcome variable explained by the model.
2. Unexplained variation is shown in equation:

Hypothesis Test for the Regression Co-efficient


The hypotheses alternative and null are
Analysis of Variance (ANOVA) in Regression Analysis
To check the overall validity of the regression model using ANOVA in the case of multiple linear
regression model with k features.
The null and alternative hypotheses are given by
H0 : B1 = B2 = … = Bk = 0
HA : Not all regression coefficients are zero
The corresponding F-statistic is given by

F-test is used for checking whether the overall regression model is statistically significant or not
Regression Model Summary Using Python
The function summary2() prints the model summary which contains the information required for
diagnosing a regression model (Table 4.2).
salary_lm.summary2()

Residual Analysis
Residual (error) analysis is important to check whether the assumptions of regression models have been
satisfied. It is performed to check the following:
1. The residuals are normally distributed.
2. Variance of residual is constant (homoscedasticity).
3. The functional form of regression is correctly specified.
4. There are no outliers
Check for Normal Distribution of Residual
The normality of residuals can be checked using the probability−probability plot (P-P plot). P-P plot
compares the cumulative distribution function of two probability distributions against each other
salary_resid = salary_lm.resid
probplot = sm.ProbPlot(salary_resid)

In Figure , the diagonal line is the cumulative distribution of a normal distribution, whereas the dots
represent the cumulative distribution of the residuals.
Test of Homoscedasticity
The homoscedasticity can be observed by drawing a residual plot, which is a plot between standardized
residual value and standardized predicted value
get_standardized_values() creates the standardized values of a series of values (variable).
def get_standardized_values( vals ):
return (vals - vals.mean())/vals.std()
plt.scatter ( get_standardized_values( salary_lm.fittedvalues ),
get_standardized_values( salary_resid ))

Outlier Analysis
Outliers are observations whose value show a large deviation from mean value
1. Z-Score
2. Mahalanobis Distance
3. Cook’s Distance
4. Leverage Values
Z-Score
Z-score is the standardized distance of an observation from its mean value. For the predicted value of
the dependent variable Y, the Z-score is given by.

from scipy.stats import zscore


salary_df[‘z_score_salary’] = zscore( salary_df.Salary )
Cook’s Distance
Cook’s distance measures how much the predicted value of the dependent variable changes for all the
observations in the sample when a particular observation is excluded from the sample for the estimation
of regression parameters.
A Cook’s distance value more than 1 indicates highly influential observation.
mba_influence = mba_salary_lm.get_influence() (c, p) = mba_influence.cooks_distance
Leverage Values
Leverage value of an observation measures the influence of that observation on the overall fit of the
regression function and is related to the Mahalanobis distance. Leverage value more than 3(k + 1)/n is
treated highly influenced observation.
from statsmodels.graphics.regressionplots import influence_plot
fig, ax = plt.subplots(figsize=(8,6) )
influence_plot( mba_salary_lm, ax = ax )
Making Prediction and Measuring Accuracy
Predicting using the Validation Set
The model variable has a method predict(), which takes the X parameters and returns the predicted
values.
pred_y = mba_salary_lm.predict( test_X )
Finding R-Squared and RMSE
Mean Square Error (MSE), Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE)
are some of the frequently used measures.
from sklearn.metrics import r2_score, mean_squared_error
np.abs(r2_score(test_y, pred_y))
Calculating Prediction Intervals
wls_prediction_std() returns the prediction interval while making a prediction.
from statsmodels.sandbox.regression.predstd import wls_prediction_std
pred_y = mba_salary_lm.predict( test_X )

MULTIPLE LINEAR REGRESSIONS


Multiple Linear Regression(MLR) model is a supervised learning algorithm for finding the existence of
an association relationship between a dependent variable and several independent variables.
The functions of MLR is given by
The regression coefficients B1, B2 , … , Bk are called partial regression coefficients
The assumptions that are made in multiple linear regression model are as follows:
1. The regression model is linear in regression parameters (B-values).
2. The residuals follow a normal distribution and the expected value (mean) of the residuals is zero.
3. In time series data, residuals are assumed to uncorrelated.
4. The variance of the residuals is constant for all values of Xi . When the variance of the residuals is
constant for different values of Xi , it is called homoscedasticity. A non-constant variance of residuals is
called heteroscedasticity.
5. There is no high correlation between independent variables in the model (called multi-collinearity).
Multi-collinearity can destabilize the model and can result in an incorrect estimation of the regression
parameters
Developing Multiple Linear Regression Model Using Python
Various steps involved in developing a multiple linear regression Python.
1. Loading the Dataset
Loading data from IPL IMB381IPL2013.csv file and print the meta data
ipl_auction_df = pd.read_csv( ‘IPL IMB381IPL2013.csv’ )
ipl_auction_df.info()
2. Displaying the First Five Records
The function df.iloc() is used for displaying a subset of the dataset.
ipl_auction_df.iloc[0:5, 0:10]
The following function is used for including the features in the model building.
X_features = ipl_auction_df.columns

3. Encoding Categorical Features


Qualitative variables or categorical variables need to be encoded using dummy variables before
incorporating them in the regression model. If a categorical variable has n categories (e.g., the
player role in the data has four categories, namely, batsman, bowler, wicket-keeper and allrounder),
then we will need n − 1 dummy variables
ipl_auction_df[‘PLAYING ROLE’].unique()
array([‘Allrounder’, ‘Bowler’, ‘Batsman’, ‘W. Keeper’], dtype=object)
pd.get_dummies(ipl_auction_df[‘PLAYING ROLE’])[0:5]

4. Splitting the Dataset into Train and Validation Sets


Before building the model, we will split the dataset into 80:20 ratio. The split function allows using
a parameter random_state, which is a seed function for reproducibility of randomness.
X = sm.add_constant( ipl_auction_encoded_df )
Y = ipl_auction_df[‘SOLD PRICE’]
train_X, test_X, train_y, test_y = train_test_split(X , Y, train_size = 0.8, random_state = 42 )
5. Building the Model on the Training Dataset
The summary provides details of the model accuracy, feature significance, and signs of any multi-
collinearity effect
ipl_model_1 = sm.OLS(train_y, train_X).fit()
ipl_model_1.summary2()
Multi-Collinearity and Handling Multi-Collinearity
When the dataset has a large number of independent variables (features), it is possible that few of
these independent variables (features) may be highly correlated. The existence of a high correlation
between independent variables is called multi-collinearity.

1. Variance Inflation Factor (VIF)


Variance Ination Factor (VIF) is a measure used for identifying the existence of multi-collinearity. For
example, consider two independent variables X1 and X2 and regression between them.

R12 be the R-squared value of this model. -en the VIF, which is a measure of multi-collinearity, is Let
given by

vif_factors = get_vif_factors( X[X_features] )


2. Checking Correlation of Columns with Large VIFs
columns_with_large_vif = vif_factors[vif_factors.vif > 4].column
Building a New Model after Removing Multi-collinearity
train_X = train_X[X_new_features] ipl_model_2 = sm.OLS(train_y, train_X).fit()
ipl_model_2.summary2()
Residual Analysis in Multiple Linear Regression
Residual should be normally distributed. This can be verified using P-P plot.
Residual Plot for Homoscedasticity and Model Specification

Detecting Influencers
Leverage values of more than 3 (k + 1) / n are treated as highly influential observations.

observations with leverage values more than 0.178 are highly influential.

Transforming Response Variable


Transformation is a process of deriving new dependent and/or independent variables to identify the
correct functional form of the regression model.
Making Predictions on the Validation Set

Auto-correlation Between Error Terms


An auto-correlation, the standard error estimate of the beta coefficient may be underestimated and that
will result in over-estimation of the t-statistic value, which, in turn, will result in a low p-value.

You might also like