Professional Documents
Culture Documents
Ronald Hochreiter
sensational.ai
1
Regression Analysis
2
Regression Analysis - Descriptive Statistics
𝑖 𝑋 𝑌
1 −0.61 −0.59
2 −1.25 0.34
3 1.45 −1.09
4 −0.10 1.16
5 0.71 1.56
How can we define the dependency between 𝑋 and 𝑌 ?
3
Linear dependency: correlation coefficient
4
Correlation
Korrelation: r = 0.8954
3
●
2
●
●
● ●
● ●
● ●● ● ●
● ●
● ● ●
● ●
1
● ●
● ●● ● ●
● ●
●
● ●
● ● ●
● ● ●●
●● ●● ●
●
●
● ●● ●
0
●
●●
● ●● ●●
● ● ●
● ● ●● ●
● ● ●
● ● ●●●●
●●
●● ●●
●●●●
−1
● ● ●
●
● ● ●
● ●
● ●
● ●
●
●
−2
●
−3
−3 −2 −1 0 1 2 3
5
Correlation
Korrelation: r = −0.9114
3
●
●
2
● ●
●
●
● ● ●
●
●● ● ●
●
●
1
● ●
● ●
● ●
● ●● ●
● ● ● ●
● ● ●
●●
●● ●
● ● ● ● ●
● ● ●●●
0
●●
● ●● ● ●
● ●● ● ● ●
● ●● ●
● ● ●● ● ● ● ● ●
●
● ●●
●● ●
● ●●
−1
● ●
●●
●
●
● ●●
●
●
● ●
−2
●
●
●
−3
−3 −2 −1 0 1 2 3
6
Correlation
Korrelation: r = 0.6188
3
● ●
2
● ●
● ●
● ● ●
● ●
●
1
● ● ● ● ●
● ● ●
●
● ● ●
● ●
● ●
● ● ● ● ●
● ● ●
●● ● ●
●
● ● ●
● ● ●
0
● ● ● ● ●
● ● ●
●● ● ● ● ●●
●
●
● ● ●
● ●
● ● ●
● ●
● ●
●
● ● ● ● ●
●●
−1
● ● ●
● ● ●
●
●
●
●
−2
●
●
● ●
●
−3
−3 −2 −1 0 1 2 3
7
Correlation
Korrelation: r = −0.5338
3
●
●●
2
● ●
●●
● ●
●
● ●
1
● ●● ●
● ● ●
● ●
● ●● ●● ● ●
● ● ● ●
●
● ● ●
● ● ●
●● ● ●
● ●
●● ●
●●
0
● ● ● ●
●
● ●
● ● ●
●
● ● ●
● ●
● ●
●●● ● ●
● ● ● ● ●
●
●
●
−1
●
● ● ● ● ●
●
●
● ●●
● ●
● ●
−2
●
●
−3
−3 −2 −1 0 1 2 3
8
Correlation
Korrelation: r = 0.0776
3
●
●
●
2
●
●
● ● ●
●
● ●
● ● ●
● ●
1
● ● ●
● ● ● ●
● ● ●
● ●●
●
●
● ● ●●
●
● ●
●
●
●
●
● ● ● ●
● ● ●
0
● ●
●● ● ●● ● ●● ●
● ●
● ● ● ● ● ● ●
● ●●
●
● ●● ●
●
●
● ●
● ●
−1
● ●
●
● ●
● ●
●
● ● ●
●
●
●
−2
●
−3
−3 −2 −1 0 1 2 3
9
Correlation
Korrelation: r = 0.047
3
2
●
● ● ● ●
●
● ● ●
● ● ●
●
● ● ●● ●
1
● ●
● ● ●
● ● ●● ●
●●●
●●
● ● ●
● ● ●
● ● ● ●
●●● ●
●●●
0
● ●
● ● ● ● ●●
● ●●
●● ● ● ●
●
● ● ● ●
●
● ●
● ●
●●
−1
● ● ●
● ●
●
●● ● ●
●● ● ●
● ●
● ●
−2
●
● ●
−3
−3 −2 −1 0 1 2 3
10
Correlation
Korrelation: r = −0.0202
3
●
●●
●
●● ●
2
●
●●
●● ●●●
●● ●●
●●
● ●●
●●
●●●
●●●● ●●
●
●●●● ●
●●●●
1
●● ●
●●● ●
● ● ● ●
● ●●● ●
●● ●
●
● ●●
●● ●●
●
●●●●● ●
●
●●
0
● ●
●● ●● ●●●
●●● ●●●
●●● ● ●●●
● ● ●●●
● ●●
● ●● ●
●●●
●●● ●● ● ●
●●●●●
●●●
● ● ● ●●
●●●●
●● ● ●●●● ●●●●●●● ●
●
●●●
●● ●● ● ●●●● ●
●
−1
●
● ● ●●●● ● ● ●
●●● ●●●● ●
● ●●
● ●●
● ●●
−2
−3
−3 −2 −1 0 1 2 3
11
Correlation
Korrelation: r = −0.0331
3
●
2
●
● ●●
●● ● ●
●● ●●●● ●
●●● ●
● ●● ●● ●●
●
● ●
●● ●● ● ●
● ●
●●●
●●● ●● ● ●●
1
●
● ● ●●● ●
●● ● ●
●● ● ● ● ●●
● ●
●● ● ● ●● ●●
●
● ● ● ● ●●
● ●
● ●● ●
●
● ● ●● ●●●
●● ●●
●
●
0
● ●●
● ●
●●
● ● ● ● ●
●●● ●● ●
●● ●
●● ●
●
● ● ●● ●● ●●
●
●
●
●
●
●
● ● ● ● ●
●
−1
●
●●● ●● ● ●●●● ●
●● ●
●●● ● ● ●● ●●
●● ● ● ●● ●●
●●●● ●●
● ● ●●
● ●●●
● ●
● ●
−2
●
−3
−3 −2 −1 0 1 2 3
12
Regression and Prediction - p-Value Refresher
13
Regression and Prediction
14
Quantitative Attributes/Features only
One of the main drawbacks of Regression Analysis is that it basically only works with quantitative
variables. However, the following techniques can be applied to properly use Regression Analysis
despite qualitative (categorical) attributes/features.
15
Dummy Variables / One-Hot Encoding
1 R
2 B
3 S
4 S
5 R
6 G
16
Dummy Variables / One-Hot Encoding
Person R S B G
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
4 0 1 0 0
5 1 0 0 0
6 0 0 0 1
17
Ordinary Least Squares (OLS)
The regression coefficients have to be estimated from data, such that for each observed value 𝑥𝑖
an optimal estimation can be calculated.
How to estimate optimal values? One possibility: minimize the residual sum of squares (RSS).
𝑛 𝑛
𝑅𝑆𝑆 = ∑(𝑦𝑖 − 𝑦𝑖̂ ) 2
= ∑(𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 ))2 .
𝑖=1 𝑖=1
18
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
19
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
20
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
21
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 9.00
22
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 7.42
23
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 6.01
24
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 4.79
25
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 3.75
26
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 2.89
27
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 2.20
28
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 1.70
29
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 1.38
30
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
RSS = 1.23
31
Regression: Ordinary Least Squares
3
2
●
1
● ●●
0
● ●
● ●
−1
●
−2
−3
−3 −2 −1 0 1 2 3
32
𝑅2 - Coefficient of determination
𝑆𝑆𝑟𝑒𝑠
𝑅2 = 1 −
𝑆𝑆𝑡𝑜𝑡
33
Adjusted 𝑅2
The use of the Adjusted 𝑅2 is an attempt to take account of the phenomenon that the coefficient
of determination 𝑅2 automatically and spuriously increases when extra explanatory variables are
added to the model. It is defined as
𝑛−1
Adjusted 𝑅2 = 1 − (1 − 𝑅2 )
𝑛−𝑝−1
where 𝑝 is the total number of explanatory variables in the model (not including the constant
term), and 𝑛 is the sample size.
34
Case Study: POTUS Election 2000
3500
● Palm Beach
3000
2500
Stimmen für Buchanen
2000
1500
1000
●
●
●
● ●
500
● ● ●
●
●
●
●● ● ● ●
●
●●● ●●
● ●
● ● ● ●●
●● ●● ●
●●
●
●●●●●●●
●
●
●
●● ●
●
●
0
35
Case Study: POTUS Election 2000
• The POTUS election 2000 was one of the most narrow and controversial elections.
• The Republican George W. Bush won marginally (less votes, but more presidential electors)
against the Democrat Al Gore.
• Decisive was Florida where Bush won with a lead of just 537 votes.
• Some irregularities were observed - especially in the county Palm Beach.
• Both the design of the ballot as well as the electronic ballot boxes were criticized. The
complexity of the ballot paper is suspected to have misled the choice of many voters.
36
Case Study: POTUS Election 2000
• One may suspect that quite a few voters erroneously cast their vote for Pat Buchanan instead
of Al Gore.
• To analyze this fact the votes for both conservative candidates George W. Bush and Pat
Buchanan in all counties of Florida.
• A 2D scatter plot can be used to visualize the amount of votes.
• The outlier Palm Beach can be recognized easily.
Questions
• Is there a relation/dependency between the amount of votes for Bush and Buchanan?
• How many votes for Buchanan do we expect in Palm Beach?
• Did G.W. Bush assume his presidency lawfully?
37
Case Study: POTUS Election 2000
3500
● Palm Beach
3000
2500
Stimmen für Buchanen
2000
1500
1000
●
●
●
● ●
500
● ● ●
●
●
●
●● ● ● ●
●
●●● ●●
● ●
● ● ● ●●
●● ●● ●
●●
●
●●●●●●●
●
●
●
●● ●
●
●
0
38
Case Study: POTUS Election 2000
If the outlier Palm Beach is removed from the dataset and the coefficient of correlation is
computed between the number of votes for Bush and Buchanan, one obtains 𝜌 = 0.87. The
regression coefficients are: 𝛽0 = 65.573 and 𝛽1 = 0.00348 respectively, i.e.
39
Case Study: POTUS Election 2000
Using the regression model to predict the numbers of votes for Buchanan in Palm Bach can be
done as follows: Bush received 152846 votes. The prediction of votes for Buchanan is thus
However, Buchanan received 3407 votes. That equals a difference of 2809 votes in relation to the
prediction and is far larger than the winning margin of 537 votes by which Bush won Florida and
ultimately the election.
40
Case Study: POTUS Election 2000
3500
● Palm Beach
3000
2500
Stimmen für Buchanen
2000
1500
1000
●
●
●
● ●
500
● ● ●
●
●
●
●● ● ● ●
●
●●● ●●
● ●
● ● ● ●●
●● ●● ●
●●
●
●●●●●●●
●
●
●
●● ●
●
●
0
41
Multiple Regression
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + … + 𝛽𝑘 𝑋𝑘 + 𝜖.
𝑌 = 𝛽
⏟⏟0+ 𝛽 1 𝑋1 + ⏟
⏟⏟⏟⏟ 𝛽⏟⏟
2 𝑋2 + ⏟𝜖
𝑙𝑖𝑛𝑒𝑎𝑟 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙 𝑒𝑟𝑟𝑜𝑟
42
Multiple Regression
𝑌 = 𝛽
⏟⏟0+ 𝛽 1 𝑋1 + ⏟
⏟⏟⏟⏟ 𝛽⏟⏟
2 𝑋2 + ⏟𝜖
𝑙𝑖𝑛𝑒𝑎𝑟 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙 𝑒𝑟𝑟𝑜𝑟
43
Case Study: Used Cars - Data
Data from the Red Book (USA) is sometimes questionable. A used cars dealer collected information
of 100 cars of one specific model (Ford Taurus).
1 5318 373.88 1 2 72
2 5061 447.58 1 2 74
3 5008 458.33 3 2 49
4 5795 308.62 3 4 52
5 5784 317.05 2 4 54
6 5359 340.10 2 2 57
…
Remark: Color cannot be used directly because it is qualitative but could be transformed to a
dummy variable.
44
Case Study: Used Cars
First step: manually choose a regression model to explain price, e.g. price ~ miles, which leads
to the following result:
Estimate (𝛽 ) p-Value
As there are different attributes available the main task is to find the optimal model, i.e. the one
that explains the data the best, i.e. from a statistical viewpoint - in this case using the Adjusted 𝑅2 .
45
Case Study: Used Cars
Estimate (𝛽 ) p-Value
46
Case Study: Used Cars
Estimate (𝛽 ) p-Value
47
Case Study: Used Cars
The bi-variate linear regression model, where the price depends on the amount of miles and the
number of services can be summarized as follows:
• For each 100 miles the average price is reduced by USD 3.1 - assuming the same number of
services.
• Comparing two cars where one received one service more than the other, the price of the
better serviced car is expected to be USD 136 higher.
48
Model selection - general outline
• The statistical model should be as small as possible, i.e. use as few attributes of the original
dataset as possible.
• Simultaneously the model should not lose a significant amount of explanatory power in
relation to models with more attributes.
• Different measures available to measure the explanatory power (R-square, AIC/BIC, …)
Model selection: a manual model selection has been conducted above - however, an automatic
model selection takes the full model as input and returns the best model possible.
49
Model selection (two attributes)
A dataset with one target 𝑌 and two features/attributes 𝑋1 and 𝑋2 would allow for the following
models:
𝑀0 ∶ 𝑌 = 𝛽0 + 𝜖
𝑀1 ∶ 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝜖
𝑀2 ∶ 𝑌 = 𝛽0 + 𝛽2 𝑋2 + 𝜖
𝑀12 ∶ 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜖
Different strategies to find the optimal model (model selection) - crucial if the dataset contains
many attributes, which is the case in most Data Science applications.
Exponential growth of the number of models requires heuristics, e.g. step-wise regression
(backward and forward).
50