You are on page 1of 14

Regressions

Quick Overview
Regression Overview
• Regression modelling represents a method for estimating the value of
a continuous (linear) or categorical (logistic) target attribute.
• Simple linear regression, where a straight line is used to approximate
the relationship between a single continuous input attribute and a
single continuous target attribute
• Multiple regression, where several input attributes as predictor are
used to estimate a single target
Regression works by choosing the
regression line that minimizes the sum
Regressions of squared errors over all the data
points.
Regression line Regression equation:
ý = 𝒂𝟎 + 𝒂𝟏 𝒙 …..
ý is the estimated value of the target attribute.
𝒂𝟎 is the y-intercept of the regression line.
𝒂𝟏 is the slope of the regression line.
𝒂𝟎 and 𝒂𝟎 are called the regression coefficients.

Example:
𝒄𝒆𝒓𝒆𝒂𝒍_𝒓𝒂𝒕𝒊𝒏𝒈 = 𝟓𝟗. 𝟐 − 𝟐. 𝟓𝟏(𝒔𝒖𝒈𝒂𝒓𝒔)
Interpretation: the estimated cereal (i. e. ý) rating
equals 59.2 minus 2.42 times the sugar content in
grams.
Linear Regression Prediction Formula
• In a linear regression, a prediction estimate for the target attribute is formed from a
simple linear combination of the inputs. The intercept centres the range of predictions,
and the remaining parameter estimates determine the strength between each input and
the target.
• Intercept and parameter estimates are chosen to minimize the squared error between
the predicted and actual target values (least square estimation).
• Linear regressions are usually used for targets with an interval measurement scale.

prediction ŷ = ŵ0 + ŵ1. x1 + ŵ2. x2 + …


estimate

intercept parameter
estimate estimate
Logistic Regression Prediction Formula
• In logistic regression, the expected value of the target is transformed by a link function to
restrict its value to the unit interval.
• A linear combination of the inputs generates a logit score, the log of the odds of primary
outcome, in contrast to the linear regression’s direct prediction of the target.

Logit ṗ
𝑙𝑜𝑔 = ŵ0 + ŵ1. x1 + ŵ2. x2 + … = 𝑙𝑜𝑔𝑖𝑡 (ṗ)
scores 1−ṗ

• To obtain a prediction estimate, use a straightforward transformation of the logit score,


which is simply the inverse of the logit function:
1
ṗ=
1 − 𝑒 −𝑙𝑜𝑔𝑖𝑡(ṗ)
Missing Values Issues in Regressions
Problems:
• Training data rows with missing values on inputs used a regression
model are ignored
• Prediction formulas cannot score rows/cases with missing values

Possible strategies for treating missing values:


• Replacement - Replaced with a fixed value (mean or mode)
• Prediction – use a mode to predict the missing values
Inputs Selection
• Forward
• A sequence of models of increasing complexity
• Training begins with no input in the model - inputs are added sequentially to the model.
• The algorithm searches the set of one-input models and selects the model that most improves on the baseline mode. It
then searches the set of two-input models that contain the input selected in the previous step and select the model
showing the most significant improvement. The process of adding new inputs continues until no significant improvement
can be made.
• Backward
• A sequence of models of decreasing complexity
• Training begins with all available inputs in the model - inputs are sequentially removed from the model. At each step, the
input chosen for removal least reduces the overall model fit statistic until the preset stay cut-off significance level is met.
• Stepwise
• Combines the forward and backward methods
• Training begins as in the Forward model but may remove inputs (i.e. Backward) already in the model. This continues until
the stay significance level or the stop criterion is met.
• After each input is added, the algorithm re-evaluates the statistical significance of all the included inputs. If the p-value of
any of the inputs exceeds the stay cut-off, the input is removed from the model and re-entered into the pool of inputs
that are available for inclusion in a subsequent step are available for inclusion. The process terminates when all inputs
available for inclusion in the model have p-values in excess of the entry cut-off and all inputs already included in the
model have p-values below the stay cut-off.
• None - All inputs are used to fit the model.
Model Evaluation for Regressions
Regression models can be evaluated and analysed based on several
measurements:
Linear Regressions (Estimation)
• Squared Errors (ASE or MSE)
• F-test
• R Squared
Logistic Regressions (Classification)
• Chi-Square test
• Misclassification
• ROC
Errors

• Indicates how close a regression line is to a set of points (actual values). It does this by
taking the distances (i.e. errors) from the points to the regression line and squaring
them.
• The difference (i.e. variance/errors) between the estimated values and the actual value.
• Mean squared error (MSE) = SSE/DFE
• Average squared error (ASE) = SSE/N
Where SSE = Sum of Squared Error
N = total sample size
DFE = Degree of Freedom
F-test / Chi-Square for Regressions
• The statistics test the overall significance of the regression model.
• The overall significance indicates whether a regression model provides a better
fit to the data than a model that contains no input attributes or using
mean/mode.
• If the p-value is less than the significance level, the sample data provide
sufficient evidence to conclude that statistically the regression model fits the
data better than the model with no input attributes.
R2 (R-Squared) for Linear Regression
• R2 is a statistical measure of how close the data are to the fitted regression line - the goodness of fit of the
regression.
• R2 is also known as the coefficient of determination.
• R2 indicates how much better the function predicts the dependent variable than just using the mean value
of the dependent variable.
• The adjusted R2 is a statistic adjusted for the number of parameters in the equation and the number of data
observations. It is a more conservative estimate of the percent of variance explained
• R2 measures the strength of the relationship between a model and the input attributes.
• R2 is not a formal significant test for the relationship. The F-test of overall significance is the hypothesis
test for this relationship. If the overall F-test is significant, it can be concluded that the correlation between
the model and input attributes is statistically significant.
• R2 is always between 0 and 100% (or between 0 and 1):
• 0% indicates that the model explains none of the variability of the target attribute data.
• 100% indicates that the model explains all the variability of the target attribute data.
Example Outputs of a Linear Regression Model
Example Outputs of a Logistic Regression Model
Example of Linear Regression Model Interpretation

The regression model has a adjusted R square value of 0.1525, meaning the input
variables in the model can explain 15.25% of the characteristics of the target variable,
i.e. the Result.

Provided other input variables are constant:

Model presentation:
result = 18.7833 + 0.7717(Medu=0) – 0.9251(Medu=1) – 0.7789(Medu=2) +
0.2555(Medu=3) – 0.4689(age) – 0.3951(goout) + 0.8672(studytime)

You might also like