Professional Documents
Culture Documents
Quick Overview
Regression Overview
• Regression modelling represents a method for estimating the value of
a continuous (linear) or categorical (logistic) target attribute.
• Simple linear regression, where a straight line is used to approximate
the relationship between a single continuous input attribute and a
single continuous target attribute
• Multiple regression, where several input attributes as predictor are
used to estimate a single target
Regression works by choosing the
regression line that minimizes the sum
Regressions of squared errors over all the data
points.
Regression line Regression equation:
ý = 𝒂𝟎 + 𝒂𝟏 𝒙 …..
ý is the estimated value of the target attribute.
𝒂𝟎 is the y-intercept of the regression line.
𝒂𝟏 is the slope of the regression line.
𝒂𝟎 and 𝒂𝟎 are called the regression coefficients.
Example:
𝒄𝒆𝒓𝒆𝒂𝒍_𝒓𝒂𝒕𝒊𝒏𝒈 = 𝟓𝟗. 𝟐 − 𝟐. 𝟓𝟏(𝒔𝒖𝒈𝒂𝒓𝒔)
Interpretation: the estimated cereal (i. e. ý) rating
equals 59.2 minus 2.42 times the sugar content in
grams.
Linear Regression Prediction Formula
• In a linear regression, a prediction estimate for the target attribute is formed from a
simple linear combination of the inputs. The intercept centres the range of predictions,
and the remaining parameter estimates determine the strength between each input and
the target.
• Intercept and parameter estimates are chosen to minimize the squared error between
the predicted and actual target values (least square estimation).
• Linear regressions are usually used for targets with an interval measurement scale.
intercept parameter
estimate estimate
Logistic Regression Prediction Formula
• In logistic regression, the expected value of the target is transformed by a link function to
restrict its value to the unit interval.
• A linear combination of the inputs generates a logit score, the log of the odds of primary
outcome, in contrast to the linear regression’s direct prediction of the target.
Logit ṗ
𝑙𝑜𝑔 = ŵ0 + ŵ1. x1 + ŵ2. x2 + … = 𝑙𝑜𝑔𝑖𝑡 (ṗ)
scores 1−ṗ
• Indicates how close a regression line is to a set of points (actual values). It does this by
taking the distances (i.e. errors) from the points to the regression line and squaring
them.
• The difference (i.e. variance/errors) between the estimated values and the actual value.
• Mean squared error (MSE) = SSE/DFE
• Average squared error (ASE) = SSE/N
Where SSE = Sum of Squared Error
N = total sample size
DFE = Degree of Freedom
F-test / Chi-Square for Regressions
• The statistics test the overall significance of the regression model.
• The overall significance indicates whether a regression model provides a better
fit to the data than a model that contains no input attributes or using
mean/mode.
• If the p-value is less than the significance level, the sample data provide
sufficient evidence to conclude that statistically the regression model fits the
data better than the model with no input attributes.
R2 (R-Squared) for Linear Regression
• R2 is a statistical measure of how close the data are to the fitted regression line - the goodness of fit of the
regression.
• R2 is also known as the coefficient of determination.
• R2 indicates how much better the function predicts the dependent variable than just using the mean value
of the dependent variable.
• The adjusted R2 is a statistic adjusted for the number of parameters in the equation and the number of data
observations. It is a more conservative estimate of the percent of variance explained
• R2 measures the strength of the relationship between a model and the input attributes.
• R2 is not a formal significant test for the relationship. The F-test of overall significance is the hypothesis
test for this relationship. If the overall F-test is significant, it can be concluded that the correlation between
the model and input attributes is statistically significant.
• R2 is always between 0 and 100% (or between 0 and 1):
• 0% indicates that the model explains none of the variability of the target attribute data.
• 100% indicates that the model explains all the variability of the target attribute data.
Example Outputs of a Linear Regression Model
Example Outputs of a Logistic Regression Model
Example of Linear Regression Model Interpretation
The regression model has a adjusted R square value of 0.1525, meaning the input
variables in the model can explain 15.25% of the characteristics of the target variable,
i.e. the Result.
Model presentation:
result = 18.7833 + 0.7717(Medu=0) – 0.9251(Medu=1) – 0.7789(Medu=2) +
0.2555(Medu=3) – 0.4689(age) – 0.3951(goout) + 0.8672(studytime)