Professional Documents
Culture Documents
Lecture 03 Regression 31102022 024754pm
Lecture 03 Regression 31102022 024754pm
LECTURER:
Humera Farooq, Ph.D.
Computer Sciences Department,
Bahria University (Karachi Campus)
Correlational Models
It shows a relationship between two variables such that
(1) changes in one are associated with changes in the
other or (2) particular attributes of one variable are
associated with particular attributes of the other.
(Babbie, 2007)
Correlational studies can suggest that there is a
relationship between two variables, but they cannot
prove that one variable causes a change in another
variable.
Calculating correlations
To calculate a numerical value of a correlation we
can use Pearson’s product moment correlation
coefficient or correlation coefficient with the
symbol of the lowercase letter ‘r’.
(x i X )( yi Y ) Where i = number
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
No relationship
X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Interpretation of the Strength of
Correlations
Multiple R
is the correlation between actual and predicted values of the dependent variable
(r varies from -1 to +1 (r is negative if slope is negative) )
R Square
the model’s accuracy in explaining the dependent variable R2 varies from 0 (no fit)
to 1 (perfect fit)
Adjusted R Square
adjusts R2 for sample size and number of X variables
As the sample size increases above 20 cases per variable, adjustment is less
needed (and vice versa).
Standard Error
variability between observed & predicted Y variables
Regression Models
12
A statistical process for estimating the relationships among variables, when the focus is on the
relationship between a dependent variable and one or more independent variables (or 'predictors').
The analysis is carried out through the estimation of a relationship and the results serve the following
two purposes:
1. Answer the question of how much y changes with changes in each of the x's (x1, x2,...,xk)
Y is the dependent variable
In regression the output is continuous (Function approximation) and used for Prediction & Estimation
Yi β 0 β1X i ε i
Dependent Variable term
εi Slope = β1
Predicted Value
Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
The Estimated Coefficients
To calculate the estimates of the line coefficients, that minimize the
differences between the data points and the line, use the formulas:
cov( X , Y ) s XY
b1 2
2
sX sX
b0 Y b1 X
The regression equation that estimates the equation of the first order
linear model is:
Yˆ b0 b1 X
Least Squares Line
The most widely used criterion for measuring the goodness of fit of a line
The line that gives the best fit to the data is the one that minimizes this sum; it is
called the least squares line.
β0 and β1 are obtained by finding the values that minimize the sum of the squared
differences between Ŷ and Y
Discriminant analysis
1 dependent variable (nominal), 1+ independent variable(s) (interval or ratio)
Non-Linear Regression Models
In nonlinear regression observational data are modeled by a function which is a
nonlinear combination of the model parameters and depends on one or more
independent variables.
In nonlinear regression, a model have the form
Y = f (x, β)
X is an independent variable, y is dependent variable, f is nonlinear in the
components of the vector of the parameter β.
This function is nonlinear because it cannot be expressed as a linear combination of
the two βs.
Other examples of nonlinear functions include exponential functions, logarithmic
functions, trigonometric functions, power functions, Gaussian function, and Lorentz
distributions.
Some functions, such as the exponential or logarithmic functions, can be
transformed so that they are linear.
Polynomial Regression
introduction
Polynomial Regression
In polynomial regression, like linear regression use the relationship between the
independent variable x and the dependent variable y to find the best way to draw
a line through the data points.
The explanatory (independent) variables resulting from the polynomial expansion
of the "b“ baseline" variables are known as higher-degree terms. It is modelled as
an nth degree polynomial in x.
Such variables are also used in classification settings.
Although polynomial regression fits a nonlinear model to the data, as a statistical
estimation problem it is linear, in the sense that the regression function E(y | x) is
linear in the unknown parameters that are estimated from the data.
Polynomial Regression is often termed as Non-linear Regression or Linear in
Parameter Regression.
For this reason, polynomial regression is considered to be a special case of
multiple linear regression.
Polynomial Regression Shape
Y′ = A + BX + CX2 + DX3 + ….. QXN-1
There is one less bend than the highest order in the polynomial
model
27
Shape and Coefficient Sign
The sign of the coefficient for the highest order regressor
determines the direction of the curvature
28
Over fitting
Is a modeling error that produces good results on training data but performs poorly
on testing data.
The model fits the training data so well that it leaves very little or no room for
generalization over new data. When overfitting occurs, we say that the model has
“high variance”.
Such a model can neither model the training data nor generalize over new data.
When such a situation occurs, we say that the model has “high bias”.
In under fitting, the data points are laid out in a given pattern , but the model is unable to “Fit” properly to the given
data due to low model complexity.
High Bias, Low Variance Low Bias, Low Variance Low Bias , High variance
Train Accuracy and Test Accuracy are Degree of Polynomial increase
appropriate
Solution for overfitting
Cross Validation
Early Stopping
Regularization
Train with more data
Remove features
Regularization
Ensembling
Solution for Underfitting
Get more training data
Increase the size or number of parameters in the model
Increase the complexity of the model
Increasing the complexity of the model
Increasing the training time, until cost function is
minimize
Bias Variance Tradeoff
Regularization: An Overview
35
Shrinkage (Regularization)
36
Methods
The fitting procedure involves a loss function, known as residual sum of squares or RSS.
The coefficients are chosen, such that they minimize this loss function.
• Now, this will adjust the coefficients based on the training data.
• If there is noise in the training data, then the estimated coefficients won’t generalize well to the
future data.
• This is where regularization comes in and shrinks or regularizes these learned estimates towards
zero.
Ridge Regularization (L2 regularization)
37
In this technique, the cost function is altered by adding the penalty term to it. The amount of
bias added to the model is called Ridge Regression penalty.
We can calculate it by multiplying with the lambda to the squared weight of each individual
feature.
A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve
such problems, Ridge regression can be used.
It used to minimize the sum of squared errors and sum of the squared coefficients (β).
the coefficients (β) with a large magnitude will generate the graph peak and
deep slope
Lambda (λ) is use called as Penalty Factor and help us to get a smooth surface instead of an
irregular-graph.
This is L2 regularization, since its adding a penalty-equivalent to the Square-of-the
Magnitude of coefficients.
This is very similar to Ridge Regression, with little difference in Penalty Factor that coefficient is
magnitude instead of squared.
In which there are possibilities of many coefficients becoming zero, so that corresponding
attribute/features become zero and dropped from the list, this ultimately reduces the dimensions and
supports for dimensionality reduction.
So which deciding that those attributes/features are not suitable as predators for predicting target value.
This is L1 regularization, because of adding the Absolute-Value as penalty-equivalent to the
magnitude of coefficients.
In both ridge and LASSO regression, we see that the larger our choice of the
regularization parameter the more heavily we penalize large values in
• If is close to zero, we recover the MSE, i.e. ridge and LASSO regression is just
ordinary regression.
• If is sufficiently large, the MSE term in the regularized loss function will be
insignificant and the regularization term will force ridge and LASSO to be close to
zero.
To avoid ad-hoc choices, we should select using cross-validation.
= 0 = minimal = high
Linear Regression and Logistic Regression are nice tools for many
simple situations
But both force us to fit the data with one shape (line or sigmoid)