Lecture 03 Regression 31102022 024754pm

REGRESSION ANALYSIS
LECTURER:
Humera Farooq, Ph.D.
Computer Sciences Department,
Bahria University (Karachi Campus)
Correlational Models
It shows a relationship between two variables such that
(1) changes in one are associated with changes in the
other or (2) particular attributes of one variable are
associated with particular attributes of the other.
(Babbie, 2007)
Correlational studies can suggest that there is a
relationship between two variables, but they cannot
prove that one variable causes a change in another
variable.
Calculating correlations
 To calculate a numerical value of a correlation we
can use Pearson’s product moment correlation
coefficient or correlation coefficient with the
symbol of the lowercase letter ‘r’.
 A correlation coefficient ranges from -1.0 to +1.0,

with -1.0 indicating a perfect linear negative
correlation and +1.0 a perfect linear positive
correlation.
Reporting results
 Positive Correlations: Both variables increase or decrease
at the same time. A correlation coefficient close to +1.00
indicates a strong positive correlation.
 Negative Correlations: Indicates that as the amount of one

variable increases, the other decreases (and vice versa). A
correlation coefficient close to -1.00 indicates a strong
negative correlation.
 No Correlation: Indicates no relationship between the two

variables. A correlation coefficient of 0 indicates no
correlation.
Interpreting Covariance
 (x i  X )( yi  Y ) Where i = number
cov ( x , y )  i 1 of samples, x and y

are attribute , X bar
n 1 and Y bar are mean
values
To calculate a numerical value of a

cov ariance( x, y ) correlation we can use Pearson’
r product moment correlation
var x var y coefficient or correlation coefficien
with the symbol of the lowercase
letter ‘r’.
cov(X,Y) > 0 X and Y are positively correlated
cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent

Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Linear Correlation
No relationship
X
Interpretation of the Strength of
Correlations
 00 - .20 – Very Weak

 .21 - .40 – Weak
 .41 - .60 – Moderate
 .61 - .80 – Strong
 .81 – 1.00 - Very Strong
Different statisticians may have similar but slightly different scales.

Simple Linear Regression
9-11
 Multiple R
is the correlation between actual and predicted values of the dependent variable
(r varies from -1 to +1 (r is negative if slope is negative) )
 R Square
the model’s accuracy in explaining the dependent variable R2 varies from 0 (no fit)
to 1 (perfect fit)
 Adjusted R Square
adjusts R2 for sample size and number of X variables
As the sample size increases above 20 cases per variable, adjustment is less
needed (and vice versa).
 Standard Error
variability between observed & predicted Y variables
Regression Models
12
 A statistical process for estimating the relationships among variables, when the focus is on the
relationship between a dependent variable and one or more independent variables (or 'predictors').
 The analysis is carried out through the estimation of a relationship and the results serve the following
two purposes:
1. Answer the question of how much y changes with changes in each of the x's (x1, x2,...,xk)
 Y is the dependent variable
2. Forecast or predict the value of y based on the values of the X's

 X is the independent variable
 In regression the output is continuous (Function approximation) and used for Prediction & Estimation
 Simple linear regression involves a single independent variable.
 Multiple regression involves two or more independent variables.

Simple Linear Regression Model
• Lines come in the form y = mx + b, where m is the slope (β1) and b ( β0 ) is the
y-intercept.
• In statistics, we use y = a + bx for the equation of a straight line. Now and b
(β1) is the slope and a ( β0 ) is the intercept
• The slope (β1) of the line, is the amount by which y increases when x increase
by 1 unit.
• The intercept (β0), sometimes called the vertical intercept, is the height of the
line when x = 0. Population Population Slope Independent Variable
Y intercept Coefficient
Random Error
Yi  β 0  β1X i  ε i
Dependent Variable term
Linear component Random Error

component
Simple Linear Regression
Model
Yi  β 0  β1X i  ε i
Y
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value
Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
The Estimated Coefficients
To calculate the estimates of the line coefficients, that minimize the
differences between the data points and the line, use the formulas:
cov( X , Y )  s XY 
b1  2

  2 

sX  sX 
b0  Y  b1 X
The regression equation that estimates the equation of the first order
linear model is:
Yˆ  b0  b1 X
Least Squares Line
The most widely used criterion for measuring the goodness of fit of a line
The line that gives the best fit to the data is the one that minimizes this sum; it is
called the least squares line.
The slope of a regression line represents the rate of change in y as x changes.

Because y is dependent on x, the slope describes the predicted values of y given x.
β0 and β1 are obtained by finding the values that minimize the sum of the squared
differences between Ŷ and Y
min  (Yi Ŷi ) 2  min  (Yi  (b 0  b1X i )) 2

Three Important Questions
To examine how useful or effective the line summarizing the relationship
between x and y, we consider the following three questions.
1. Is a line an appropriate way to summarize the relationship

between the two variables?
2. Are there any unusual aspects of the data set that we need to
consider before proceeding to use the regression line to make
predictions?
3. If we decide that it is reasonable to use the regression line as a
basis for prediction, how accurate can we expect predictions based
on the regression line to be?
Multiple Regression with Two
18
Predictor Variables
 In the same way that linear regression produces an
equation that uses values of X to predict values of
Y, multiple regression produces an equation that
uses two or more different (quantitative and
qualitative explanatory variables (X1 ….. Xn) to predict
values of Y.
 The equation is determined by a least squared
error solution that minimizes the squared distances
between the actual Y values and the predicted Y
values.
Introduction to Multiple Regression
19
with Two Predictor Variables (cont.)
 For two predictor variables, the general form of the
multiple regression equation is:
Ŷ= b1X1 + b2X2 + a
 The ability of the multiple regression equation to

accurately predict the Y values is measured by first
computing the proportion of the Y-score variability
that is predicted by the regression equation and the
proportion that is not predicted.
Multicollinearity
 High correlation between X variables

 Coefficients measure combined effect
 Leads to unstable coefficients depending on X
variables in model
 Always exists; matter of degree
 Example: Using both total number of rooms and
number of bedrooms as explanatory variables in
same model
Detecting Multicollinearity
 Examine correlation matrix

 Correlations between pairs of X variables are more
than with Y variable
 Few remedies
 Obtain new sample data
 Eliminate one correlated X variable
Evaluating Multiple Regression Model
 Examine variation measures
 Do residual analysis
 Test parameter significance
 Overall model
 Portions of model
 Individual coefficients
 Test for multicollinearity
Types of Regression Models
 Logistic regression
1 dependent variable (binary), 2+ independent variable(s) (interval or ratio or
dichotomous)
 Ordinal regression
1 dependent variable (ordinal), 1+ independent variable(s) (nominal or dichotomous)
 Multinominal regression
1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio or
dichotomous)
 Discriminant analysis
1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio)
Non-Linear Regression Models
 In nonlinear regression observational data are modeled by a function which is a
nonlinear combination of the model parameters and depends on one or more
independent variables.
 In nonlinear regression, a model have the form
Y = f (x, β)
 X is an independent variable, y is dependent variable, f is nonlinear in the
components of the vector of the parameter β.
 This function is nonlinear because it cannot be expressed as a linear combination of
the two βs.
 Other examples of nonlinear functions include exponential functions, logarithmic
functions, trigonometric functions, power functions, Gaussian function, and Lorentz
distributions.
 Some functions, such as the exponential or logarithmic functions, can be
transformed so that they are linear.
Polynomial Regression
introduction
Polynomial Regression
 In polynomial regression, like linear regression use the relationship between the
independent variable x and the dependent variable y to find the best way to draw
a line through the data points.
 The explanatory (independent) variables resulting from the polynomial expansion
of the "b“ baseline" variables are known as higher-degree terms. It is modelled as
an nth degree polynomial in x.
 Such variables are also used in classification settings.
 Although polynomial regression fits a nonlinear model to the data, as a statistical
estimation problem it is linear, in the sense that the regression function E(y | x) is
linear in the unknown parameters that are estimated from the data.
 Polynomial Regression is often termed as Non-linear Regression or Linear in
Parameter Regression.
 For this reason, polynomial regression is considered to be a special case of
multiple linear regression.
Polynomial Regression Shape
Y′ = A + BX + CX2 + DX3 + ….. QXN-1
The highest order regressor determines the overall shape of the

relationship within the range of –1 to 1
Linear Quadratic Cubic

Y′ = A + BX Y′ = A+ BX + CX2 Y′ = A+ BX + CX2 + DX3
Zero bends One bend Two bends
There is one less bend than the highest order in the polynomial
model
27
Shape and Coefficient Sign
The sign of the coefficient for the highest order regressor
determines the direction of the curvature
Linear Quadratic Cubic

Y’ = 0 + 1X Y’ = 0 + 1X + 1X2 Y’ = 0 + 1X + 1X2 + 1X3
Y’ = 0 + -1X Y’ = 0 + 1X + -1X2 Y’ = 0 + 1X + 1X2 + -1X3
28
Over fitting
Is a modeling error that produces good results on training data but performs poorly
on testing data.
It is the result of an overlay complex model with an excessive number of training

data.
The model fits the training data so well that it leaves very little or no room for
generalization over new data. When overfitting occurs, we say that the model has
“high variance”.
Overfitting = Low Bias + High variance

Bias: The difference between expected (avg) prediction of the model and actual value
Varaince: How the prediction for a given point vary between different realization for the
model
Underfitting
Is a modeling error fails to produce good results because of an oversimplified
model.
It is the result of a simple model with an insufficient number of training points.
Such a model can neither model the training data nor generalize over new data.
When such a situation occurs, we say that the model has “high bias”.
Underfitting = High bias + Low Variance

Regression Overfitting and
Overfiiting
If the final “Best Fit” line crosses over every single data point by forming an unnecessary complex curve , then the
model is like overfitting.
In under fitting, the data points are laid out in a given pattern , but the model is unable to “Fit” properly to the given
data due to low model complexity.
High Bias, Low Variance Low Bias, Low Variance Low Bias , High variance
Train Accuracy and Test Accuracy are Degree of Polynomial increase
appropriate
Solution for overfitting
Cross Validation
Early Stopping
Regularization
Train with more data
Remove features
Regularization
Ensembling
Solution for Underfitting
Get more training data
Increase the size or number of parameters in the model
Increase the complexity of the model
Increasing the complexity of the model
Increasing the training time, until cost function is
minimize
Bias Variance Tradeoff
Regularization: An Overview
35
Shrinkage (Regularization)
36
Methods
 The fitting procedure involves a loss function, known as residual sum of squares or RSS.
The coefficients are chosen, such that they minimize this loss function.
• Now, this will adjust the coefficients based on the training data.
• If there is noise in the training data, then the estimated coefficients won’t generalize well to the
future data.
• This is where regularization comes in and shrinks or regularizes these learned estimates towards
zero.
Ridge Regularization (L2 regularization)
37
A small amount of bias is introduced to get better long-term predictions.
It is used to reduce the complexity of the model, also called as L2 regularization.
In this technique, the cost function is altered by adding the penalty term to it. The amount of
bias added to the model is called Ridge Regression penalty.
We can calculate it by multiplying with the lambda to the squared weight of each individual
feature.
A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve
such problems, Ridge regression can be used.
It helps to solve the problems if we have more parameters than samples.

Ridge Regularization (L2 regularization)
38
 It used to minimize the sum of squared errors and sum of the squared coefficients (β).
the coefficients (β) with a large magnitude will generate the graph peak and
deep slope
 Lambda (λ) is use called as Penalty Factor and help us to get a smooth surface instead of an
irregular-graph.
 This is L2 regularization, since its adding a penalty-equivalent to the Square-of-the
Magnitude of coefficients.
Loss Function  Loss Function + Regularized term
Transforming the Loss Function into Ridge Regression

LASSO Regression (L1 regularization)
39
 This is very similar to Ridge Regression, with little difference in Penalty Factor that coefficient is
magnitude instead of squared.
 In which there are possibilities of many coefficients becoming zero, so that corresponding
attribute/features become zero and dropped from the list, this ultimately reduces the dimensions and
supports for dimensionality reduction.
 So which deciding that those attributes/features are not suitable as predators for predicting target value.
This is L1 regularization, because of adding the Absolute-Value as penalty-equivalent to the
magnitude of coefficients.
Loss Function  Loss Function + Regularized term
Transforming the Loss Function into Lasso Regression

The Lasso (cont.)
Choosing 
41
 In both ridge and LASSO regression, we see that the larger our choice of the
regularization parameter the more heavily we penalize large values in 
• If is close to zero, we recover the MSE, i.e. ridge and LASSO regression is just
ordinary regression.
• If is sufficiently large, the MSE term in the regularized loss function will be
insignificant and the regularization term will force ridge and LASSO to be close to
zero.
 To avoid ad-hoc choices, we should select  using cross-validation.
= 0 = minimal = high
No impact on Generalised model and Very high impact on

coefficients(β) and model acceptable accuracy and coefficients (β) and leading
would be Overfit. Not eligible for Test and to underfit. Ultimately
suitable Train. Fit for Production not fit for Production.
for Production
Elastic regression
42
 squared loss with L1 AND L2 regularization

Summary
43
 Linear Regression and Logistic Regression are nice tools for many
simple situations
 But both force us to fit the data with one shape (line or sigmoid)
which will often underfit

 Intelligible results
 When problem includes more arbitrary non-linearity then we need
more powerful models
 Though non-linear data transformation can help in these cases
while still using a linear model for learning.
 These models are commonly used in data mining applications and
also as a "first attempt" at understanding data trends, indicators, etc.
Interval Bands (simple regression)

Lecture 03 Regression 31102022 024754pm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 03 Regression 31102022 024754pm

Uploaded by

Copyright:

Available Formats

REGRESSION ANALYSIS

 A correlation coefficient ranges from -1.0 to +1.0,

 Negative Correlations: Indicates that as the amount of one

 No Correlation: Indicates no relationship between the two

cov ( x , y )  i 1 of samples, x and y

To calculate a numerical value of a

cov(X,Y) > 0 X and Y are positively correlated

cov(X,Y) < 0 X and Y are inversely correlated

cov(X,Y) = 0 X and Y are independent

 00 - .20 – Very Weak

Different statisticians may have similar but slightly different scales.

2. Forecast or predict the value of y based on the values of the X's

 Simple linear regression involves a single independent variable.

 Multiple regression involves two or more independent variables.

Linear component Random Error

The slope of a regression line represents the rate of change in y as x changes.

min  (Yi Ŷi ) 2  min  (Yi  (b 0  b1X i )) 2

1. Is a line an appropriate way to summarize the relationship

 The ability of the multiple regression equation to

 High correlation between X variables

 Examine correlation matrix

The highest order regressor determines the overall shape of the

Linear Quadratic Cubic

Linear Quadratic Cubic

Y’ = 0 + -1X Y’ = 0 + 1X + -1X2 Y’ = 0 + 1X + 1X2 + -1X3

It is the result of an overlay complex model with an excessive number of training

Overfitting = Low Bias + High variance

Underfitting = High bias + Low Variance

A small amount of bias is introduced to get better long-term predictions.

It is used to reduce the complexity of the model, also called as L2 regularization.

It helps to solve the problems if we have more parameters than samples.

Loss Function  Loss Function + Regularized term

Transforming the Loss Function into Ridge Regression

Loss Function  Loss Function + Regularized term

Transforming the Loss Function into Lasso Regression

No impact on Generalised model and Very high impact on

 squared loss with L1 AND L2 regularization

which will often underfit

You might also like