Regression Analysis - From Statistics to Machine Learning

Regression Analysis - From Statistics to Machine Learning
Ronald Hochreiter
sensational.ai
1
Regression Analysis
• Descriptive Statistics: compute a (scalar) measure of dependency between two quantitative

attributes (covariance, correlation, correlation coefficient).
• Inference Statistics: compute the statistical significance (p-Value) of the dependency
between two quantitative attributes.
• Data Science / Explorative Data Analysis: compute an optimal prediction model based on
dependency including the possibility to determine the optimal combination/selection of
attributes using model selection.
2
Regression Analysis - Descriptive Statistics
Dependency between (at least) two quantitative attributes, e.g.
𝑖 𝑋 𝑌
1 −0.61 −0.59
2 −1.25 0.34
3 1.45 −1.09
4 −0.10 1.16
5 0.71 1.56
How can we define the dependency between 𝑋 and 𝑌 ?
3
Linear dependency: correlation coefficient
Properties of the (Pearson) correlation coefficient 𝜌:
• (Only) linear dependency is measured.

• −1 ≤ 𝜌 ≤ 1.
• The sign indicates the direction of the dependency (positive or negative).
• The absolute value |𝜌| indicates the strength of the (linear) dependency.
• Symmetric: 𝜌𝑥𝑦 = 𝜌𝑦𝑥 .
• Identical data frames exhibit a coefficient of 1: 𝜌𝑥𝑥 = 1.
4
Correlation
Korrelation: r = 0.8954
3
●
2
●
●
● ●
● ●
● ●● ● ●
● ●
● ● ●
● ●
1
● ●
● ●● ● ●
● ●
●
● ●
● ● ●
● ● ●●
●● ●● ●
●
●
● ●● ●
0
●
●●
● ●● ●●
● ● ●
● ● ●● ●
● ● ●
● ● ●●●●
●●
●● ●●
●●●●
−1
● ● ●
●
● ● ●
● ●
● ●
● ●
●
●
−2
●
−3
−3 −2 −1 0 1 2 3
5
Correlation
Korrelation: r = −0.9114
3
●
●
2
● ●
●
●
● ● ●
●
●● ● ●
●
●
1
● ●
● ●
● ●
● ●● ●
● ● ● ●
● ● ●
●●
●● ●
● ● ● ● ●
● ● ●●●
0
●●
● ●● ● ●
● ●● ● ● ●
● ●● ●
● ● ●● ● ● ● ● ●
●
● ●●
●● ●
● ●●
−1
● ●
●●
●
●
● ●●
●
●
● ●
−2
●
●
●
−3
−3 −2 −1 0 1 2 3
6
Correlation
3
● ●
2
● ●
● ●
● ● ●
● ●
●
1
● ● ● ● ●
● ● ●
●
● ● ●
● ●
● ●
● ● ● ● ●
● ● ●
●● ● ●
●
● ● ●
● ● ●
0
● ● ● ● ●
● ● ●
●● ● ● ● ●●
●
●
● ● ●
● ●
● ● ●
● ●
● ●
●
● ● ● ● ●
●●
−1
● ● ●
● ● ●
●
●
●
●
−2
●
●
● ●
●
−3
−3 −2 −1 0 1 2 3
7
Correlation
3
●
●●
2
● ●
●●
● ●
●
● ●
1
● ●● ●
● ● ●
● ●
● ●● ●● ● ●
● ● ● ●
●
● ● ●
● ● ●
●● ● ●
● ●
●● ●
●●
0
● ● ● ●
●
● ●
● ● ●
●
● ● ●
● ●
● ●
●●● ● ●
● ● ● ● ●
●
●
●
−1
●
● ● ● ● ●
●
●
● ●●
● ●
● ●
−2
●
●
−3
−3 −2 −1 0 1 2 3
8
Correlation
3
●
●
●
2
●
●
● ● ●
●
● ●
● ● ●
● ●
1
● ● ●
● ● ● ●
● ● ●
● ●●
●
●
● ● ●●
●
● ●
●
●
●
●
● ● ● ●
● ● ●
0
● ●
●● ● ●● ● ●● ●
● ●
● ● ● ● ● ● ●
● ●●
●
● ●● ●
●
●
● ●
● ●
−1
● ●
●
● ●
● ●
●
● ● ●
●
●
●
−2
●
−3
−3 −2 −1 0 1 2 3
9
Correlation
3
2
●
● ● ● ●
●
● ● ●
● ● ●
●
● ● ●● ●
1
● ●
● ● ●
● ● ●● ●
●●●
●●
● ● ●
● ● ●
● ● ● ●
●●● ●
●●●
0
● ●
● ● ● ● ●●
● ●●
●● ● ● ●
●
● ● ● ●
●
● ●
● ●
●●
−1
● ● ●
● ●
●
●● ● ●
●● ● ●
● ●
● ●
−2
●
● ●
−3
−3 −2 −1 0 1 2 3
10
Correlation
3
●
●●
●
●● ●
2
●
●●
●● ●●●
●● ●●
●●
● ●●
●●
●●●
●●●● ●●
●
●●●● ●
●●●●
1
●● ●
●●● ●
● ● ● ●
● ●●● ●
●● ●
●
● ●●
●● ●●
●
●●●●● ●
●
●●
0
● ●
●● ●● ●●●
●●● ●●●
●●● ● ●●●
● ● ●●●
● ●●
● ●● ●
●●●
●●● ●● ● ●
●●●●●
●●●
● ● ● ●●
●●●●
●● ● ●●●● ●●●●●●● ●
●
●●●
●● ●● ● ●●●● ●
●
−1
●
● ● ●●●● ● ● ●
●●● ●●●● ●
● ●●
● ●●
● ●●
−2
−3
−3 −2 −1 0 1 2 3
11
Correlation
3
●
2
●
● ●●
●● ● ●
●● ●●●● ●
●●● ●
● ●● ●● ●●
●
● ●
●● ●● ● ●
● ●
●●●
●●● ●● ● ●●
1
●
● ● ●●● ●
●● ● ●
●● ● ● ● ●●
● ●
●● ● ● ●● ●●
●
● ● ● ● ●●
● ●
● ●● ●
●
● ● ●● ●●●
●● ●●
●
●
0
● ●●
● ●
●●
● ● ● ● ●
●●● ●● ●
●● ●
●● ●
●
● ● ●● ●● ●●
●
●
●
●
●
●
● ● ● ● ●
●
−1
●
●●● ●● ● ●●●● ●
●● ●
●●● ● ● ●● ●●
●● ● ● ●● ●●
●●●● ●●
● ● ●●
● ●●●
● ●
● ●
−2
●
−3
−3 −2 −1 0 1 2 3
12
Regression and Prediction - p-Value Refresher
p-Value - The golden rule

1. Rule: If 𝑝 is low, 𝐻0 has to go.
2. The hypothesis you want to prove is always formulated as 𝐻𝐴 .
3. 𝐻0 is the logical opposite of 𝐻𝐴 .
4. If 𝑝 is low then the oppsite of the hypothesis you wanted to prove is false.
Effect in Regression Analysis

• The effect we are looking at in Regression Analysis is whether two (or more) variables depend
on each other (positively or negatively).
• Hypotheses are in general either “the higher X the higher Y” (positive dependence) or “the
higher X the lower Y” (negative dependence).
13
Regression and Prediction
Regression: understand your data

Build a model of data based on dependency. Simplify the dataset from 𝑛 observations of two
features to two coefficients as well as one p-Value and certain quality measures (R-squared,
AIC/BIC, …)
Prediction: use your data to predict unseen values

Use the regression model not just to understand the data, but also use the model to predict 𝑌
values of certain values 𝑋 that have not been in the original dataset from which the model was
deducted.
14
Quantitative Attributes/Features only
One of the main drawbacks of Regression Analysis is that it basically only works with quantitative
variables. However, the following techniques can be applied to properly use Regression Analysis
despite qualitative (categorical) attributes/features.
Output is qualitative (binary classification)

Use Logistic Regression, which models and predicts an 0 − 1 output.
(Some) attributes/features are qualitative

Transform every qualitative variable with 𝑘 categories into 𝑘 quantitative 0 − 1 variables. These
𝑘 variables are called Dummy Variables in Statistics and the process is called One-Hot Encoding
in Data Science.
15
Dummy Variables / One-Hot Encoding
Transform qualitative variables with 𝑘 categories into 𝑘 quantitative 0 − 1 (dummy) variables,

e.g. the attribute which shows which party a person in the sample voted for at the last general
election:
Person Elected Party
1 R
2 B
3 S
4 S
5 R
6 G
16
Dummy Variables / One-Hot Encoding
Transform qualitative variables with 𝑘 categories into 𝑘 quantitative 0 − 1 (dummy) variables,

e.g. the attribute which shows which party a person in the sample voted for at the last general
election:
Person R S B G
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
4 0 1 0 0
5 1 0 0 0
6 0 0 0 1
17
Ordinary Least Squares (OLS)
The regression coefficients have to be estimated from data, such that for each observed value 𝑥𝑖
an optimal estimation can be calculated.
How to estimate optimal values? One possibility: minimize the residual sum of squares (RSS).
𝑛 𝑛
𝑅𝑆𝑆 = ∑(𝑦𝑖 − 𝑦𝑖̂ ) 2
= ∑(𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 ))2 .
𝑖=1 𝑖=1
Minimize RSS to obtain estimates 𝛽0̂ and 𝛽1̂ .
18
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
19
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
20
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
21
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 9.00
22
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 7.42
23
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 6.01
24
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 4.79
25
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 3.75
26
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 2.89
27
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 2.20
28
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 1.70
29
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 1.38
30
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 1.23
31
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
32
𝑅2 - Coefficient of determination
𝑆𝑆𝑟𝑒𝑠
𝑅2 = 1 −
𝑆𝑆𝑡𝑜𝑡
33
Adjusted 𝑅2
The use of the Adjusted 𝑅2 is an attempt to take account of the phenomenon that the coefficient
of determination 𝑅2 automatically and spuriously increases when extra explanatory variables are
added to the model. It is defined as
𝑛−1
Adjusted 𝑅2 = 1 − (1 − 𝑅2 )
𝑛−𝑝−1
where 𝑝 is the total number of explanatory variables in the model (not including the constant
term), and 𝑛 is the sample size.
There are other related measures too, e.g. AIC/BIC, …
34
Case Study: POTUS Election 2000
3500
● Palm Beach
3000
2500
Stimmen für Buchanen
2000
1500
1000
●
●
●
● ●
500
● ● ●
●
●
●
●● ● ● ●
●
●●● ●●
● ●
● ● ● ●●
●● ●● ●
●●
●
●●●●●●●
●
●
●
●● ●
●
●
0
0 50000 100000 150000 200000 250000 300000
Stimmen für Bush
35
• The POTUS election 2000 was one of the most narrow and controversial elections.
• The Republican George W. Bush won marginally (less votes, but more presidential electors)
against the Democrat Al Gore.
• Decisive was Florida where Bush won with a lead of just 537 votes.
• Some irregularities were observed - especially in the county Palm Beach.
• Both the design of the ballot as well as the electronic ballot boxes were criticized. The
complexity of the ballot paper is suspected to have misled the choice of many voters.
36
• One may suspect that quite a few voters erroneously cast their vote for Pat Buchanan instead
of Al Gore.
• To analyze this fact the votes for both conservative candidates George W. Bush and Pat
Buchanan in all counties of Florida.
• A 2D scatter plot can be used to visualize the amount of votes.
• The outlier Palm Beach can be recognized easily.
Questions
• Is there a relation/dependency between the amount of votes for Bush and Buchanan?
• How many votes for Buchanan do we expect in Palm Beach?
• Did G.W. Bush assume his presidency lawfully?
37
3500
● Palm Beach
3000
2500
2000
1500
1000
●
●
●
● ●
500
● ● ●
●
●
●
●● ● ● ●
●
●●● ●●
● ●
● ● ● ●●
●● ●● ●
●●
●
●●●●●●●
●
●
●
●● ●
●
●
0
0 50000 100000 150000 200000 250000 300000
Stimmen für Bush
38
If the outlier Palm Beach is removed from the dataset and the coefficient of correlation is
computed between the number of votes for Bush and Buchanan, one obtains 𝜌 = 0.87. The
regression coefficients are: 𝛽0 = 65.573 and 𝛽1 = 0.00348 respectively, i.e.
Buchanan = 65.573 + 0.00348 × Bush
39
Using the regression model to predict the numbers of votes for Buchanan in Palm Bach can be
done as follows: Bush received 152846 votes. The prediction of votes for Buchanan is thus
65.573 + 0.00348 × 152846 = 598.
However, Buchanan received 3407 votes. That equals a difference of 2809 votes in relation to the
prediction and is far larger than the winning margin of 537 votes by which Bush won Florida and
ultimately the election.
40
3500
● Palm Beach
3000
2500
2000
1500
1000
●
●
●
● ●
500
● ● ●
●
●
●
●● ● ● ●
●
●●● ●●
● ●
● ● ● ●●
●● ●● ●
●●
●
●●●●●●●
●
●
●
●● ●
●
●
0
0 50000 100000 150000 200000 250000 300000
Stimmen für Bush
41
Multiple Regression
Multiple Linear (!) Regression

𝑘 dependent variables 𝑋1 , … , 𝑋𝑘 :
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + … + 𝛽𝑘 𝑋𝑘 + 𝜖.
Bi-variate Linear Regression
𝑌 = 𝛽
⏟⏟0+ 𝛽 1 𝑋1 + ⏟
⏟⏟⏟⏟ 𝛽⏟⏟
2 𝑋2 + ⏟𝜖
𝑙𝑖𝑛𝑒𝑎𝑟 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙 𝑒𝑟𝑟𝑜𝑟
42
Multiple Regression
Bi-variate Linear Regression
𝑌 = 𝛽
⏟⏟0+ 𝛽 1 𝑋1 + ⏟
⏟⏟⏟⏟ 𝛽⏟⏟
2 𝑋2 + ⏟𝜖
𝑙𝑖𝑛𝑒𝑎𝑟 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙 𝑒𝑟𝑟𝑜𝑟
Interpretation of the coefficients

• 𝛽0 is the average value of 𝑌 , if 𝑋1 = 𝑋2 = 0.
• 𝛽1 : 𝑌 changes by 𝛽1 , if 𝑋1 is raised by 1 and 𝑋2 remains unchanged.
• 𝛽2 : 𝑌 changes by 𝛽2 , if 𝑋2 is raised by 1 and 𝑋1 remains unchanged.
43
Case Study: Used Cars - Data
Data from the Red Book (USA) is sometimes questionable. A used cars dealer collected information
of 100 cars of one specific model (Ford Taurus).
# Price (USD) Miles (x100) Color Services (#) Age (months)
1 5318 373.88 1 2 72
2 5061 447.58 1 2 74
3 5008 458.33 3 2 49
4 5795 308.62 3 4 52
5 5784 317.05 2 4 54
6 5359 340.10 2 2 57
…
• Price is the dependent variable/attribute.

• All other variables/attributes are independent.
Remark: Color cannot be used directly because it is qualitative but could be transformed to a
dummy variable.
44
Case Study: Used Cars
First step: manually choose a regression model to explain price, e.g. price ~ miles, which leads
to the following result:
Estimate (𝛽 ) p-Value
Intercept 6533.3830 <2e-16 ***

miles -3.1105 <2e-16 ***
price = 6533 - 3.1 * miles

(Adjusted 𝑅2 = 0.6466)
As there are different attributes available the main task is to find the optimal model, i.e. the one
that explains the data the best, i.e. from a statistical viewpoint - in this case using the Adjusted 𝑅2 .
45
Next manually chosen model: price ~ miles + services
Intercept 6206.12836 <2e-16 ***

miles -3.14627 <2e-16 ***
service 135.83749 <2e-16 ***
price = 6206 - 3.1 * miles + 136 * service

46
Another manually chosen model: price ~ miles + age
Intercept 6582.9949 <2e-16 ***

miles -3.1105 <2e-16 ***
age -0.9487 0.53
price = 6582 - 3.1 * miles - 0.9 * age

47
Best model (according to Adjusted 𝑅2 ) manually found:
price = 6206 - 3.1 * miles + 136 * service
The bi-variate linear regression model, where the price depends on the amount of miles and the
number of services can be summarized as follows:
• For each 100 miles the average price is reduced by USD 3.1 - assuming the same number of
services.
• Comparing two cars where one received one service more than the other, the price of the
better serviced car is expected to be USD 136 higher.
48
Model selection - general outline
• The statistical model should be as small as possible, i.e. use as few attributes of the original
dataset as possible.
• Simultaneously the model should not lose a significant amount of explanatory power in
relation to models with more attributes.
• Different measures available to measure the explanatory power (R-square, AIC/BIC, …)
Model selection: a manual model selection has been conducted above - however, an automatic
model selection takes the full model as input and returns the best model possible.
49
Model selection (two attributes)
A dataset with one target 𝑌 and two features/attributes 𝑋1 and 𝑋2 would allow for the following
models:
𝑀0 ∶ 𝑌 = 𝛽0 + 𝜖
𝑀1 ∶ 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝜖
𝑀2 ∶ 𝑌 = 𝛽0 + 𝛽2 𝑋2 + 𝜖
𝑀12 ∶ 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜖
Different strategies to find the optimal model (model selection) - crucial if the dataset contains
many attributes, which is the case in most Data Science applications.
Exponential growth of the number of models requires heuristics, e.g. step-wise regression
(backward and forward).
50

Regression Analysis - From Statistics to Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Analysis - From Statistics to Machine Learning

Uploaded by

Copyright:

Available Formats

Regression Analysis - From Statistics to Machine Learning

• Descriptive Statistics: compute a (scalar) measure of dependency between two quantitative

Dependency between (at least) two quantitative attributes, e.g.

Properties of the (Pearson) correlation coefﬁcient 𝜌:

• (Only) linear dependency is measured.

p-Value - The golden rule

Effect in Regression Analysis

Regression: understand your data

Prediction: use your data to predict unseen values

Output is qualitative (binary classiﬁcation)

(Some) attributes/features are qualitative

Transform qualitative variables with 𝑘 categories into 𝑘 quantitative 0 − 1 (dummy) variables,

Person Elected Party

Transform qualitative variables with 𝑘 categories into 𝑘 quantitative 0 − 1 (dummy) variables,

Minimize RSS to obtain estimates 𝛽0̂ and 𝛽1̂ .

There are other related measures too, e.g. AIC/BIC, …

0 50000 100000 150000 200000 250000 300000

Stimmen für Bush

0 50000 100000 150000 200000 250000 300000

Stimmen für Bush

Buchanan = 65.573 + 0.00348 × Bush

65.573 + 0.00348 × 152846 = 598.

0 50000 100000 150000 200000 250000 300000

Stimmen für Bush

Multiple Linear (!) Regression

Bi-variate Linear Regression

Bi-variate Linear Regression

Interpretation of the coefﬁcients

# Price (USD) Miles (x100) Color Services (#) Age (months)

• Price is the dependent variable/attribute.

Intercept 6533.3830 <2e-16 ***

price = 6533 - 3.1 * miles

Next manually chosen model: price ~ miles + services

Intercept 6206.12836 <2e-16 ***

price = 6206 - 3.1 * miles + 136 * service

Another manually chosen model: price ~ miles + age

Intercept 6582.9949 <2e-16 ***

price = 6582 - 3.1 * miles - 0.9 * age

Best model (according to Adjusted 𝑅2 ) manually found:

price = 6206 - 3.1 * miles + 136 * service

You might also like