You are on page 1of 50

Regression Analysis - From Statistics to Machine Learning

Ronald Hochreiter
sensational.ai

1
Regression Analysis

• Descriptive Statistics: compute a (scalar) measure of dependency between two quantitative


attributes (covariance, correlation, correlation coefficient).
• Inference Statistics: compute the statistical significance (p-Value) of the dependency
between two quantitative attributes.
• Data Science / Explorative Data Analysis: compute an optimal prediction model based on
dependency including the possibility to determine the optimal combination/selection of
attributes using model selection.

2
Regression Analysis - Descriptive Statistics

Dependency between (at least) two quantitative attributes, e.g.

𝑖 𝑋 𝑌
1 −0.61 −0.59
2 −1.25 0.34
3 1.45 −1.09
4 −0.10 1.16
5 0.71 1.56
How can we define the dependency between 𝑋 and 𝑌 ?

3
Linear dependency: correlation coefficient

Properties of the (Pearson) correlation coefficient 𝜌:

• (Only) linear dependency is measured.


• −1 ≤ 𝜌 ≤ 1.
• The sign indicates the direction of the dependency (positive or negative).
• The absolute value |𝜌| indicates the strength of the (linear) dependency.
• Symmetric: 𝜌𝑥𝑦 = 𝜌𝑦𝑥 .
• Identical data frames exhibit a coefficient of 1: 𝜌𝑥𝑥 = 1.

4
Correlation

Korrelation: r = 0.8954

3

2


● ●
● ●
● ●● ● ●
● ●
● ● ●
● ●

1
● ●
● ●● ● ●
● ●

● ●
● ● ●
● ● ●●
●● ●● ●


● ●● ●

0

●●
● ●● ●●
● ● ●
● ● ●● ●
● ● ●
● ● ●●●●
●●
●● ●●
●●●●
−1

● ● ●

● ● ●
● ●
● ●
● ●


−2


−3

−3 −2 −1 0 1 2 3

5
Correlation

Korrelation: r = −0.9114

3

2
● ●


● ● ●

●● ● ●

1
● ●
● ●
● ●
● ●● ●
● ● ● ●
● ● ●
●●
●● ●
● ● ● ● ●
● ● ●●●

0
●●
● ●● ● ●
● ●● ● ● ●
● ●● ●
● ● ●● ● ● ● ● ●

● ●●
●● ●
● ●●
−1

● ●
●●


● ●●

● ●
−2




−3

−3 −2 −1 0 1 2 3

6
Correlation

Korrelation: r = 0.6188

3
● ●

2
● ●

● ●
● ● ●
● ●

1
● ● ● ● ●
● ● ●

● ● ●
● ●
● ●
● ● ● ● ●
● ● ●
●● ● ●

● ● ●
● ● ●

0
● ● ● ● ●
● ● ●
●● ● ● ● ●●


● ● ●
● ●
● ● ●
● ●
● ●

● ● ● ● ●
●●
−1

● ● ●
● ● ●



−2



● ●


−3

−3 −2 −1 0 1 2 3

7
Correlation

Korrelation: r = −0.5338

3

●●

2
● ●

●●
● ●

● ●

1
● ●● ●
● ● ●
● ●
● ●● ●● ● ●
● ● ● ●

● ● ●
● ● ●
●● ● ●
● ●
●● ●
●●

0
● ● ● ●

● ●
● ● ●

● ● ●
● ●
● ●
●●● ● ●
● ● ● ● ●



−1


● ● ● ● ●


● ●●
● ●

● ●
−2



−3

−3 −2 −1 0 1 2 3

8
Correlation

Korrelation: r = 0.0776

3


2


● ● ●

● ●
● ● ●
● ●

1
● ● ●
● ● ● ●
● ● ●
● ●●


● ● ●●

● ●




● ● ● ●
● ● ●

0
● ●
●● ● ●● ● ●● ●
● ●
● ● ● ● ● ● ●
● ●●

● ●● ●


● ●
● ●
−1

● ●

● ●
● ●

● ● ●



−2


−3

−3 −2 −1 0 1 2 3

9
Correlation

Korrelation: r = 0.047

3
2

● ● ● ●

● ● ●
● ● ●

● ● ●● ●

1
● ●
● ● ●
● ● ●● ●
●●●
●●
● ● ●
● ● ●
● ● ● ●
●●● ●
●●●

0
● ●
● ● ● ● ●●
● ●●
●● ● ● ●

● ● ● ●

● ●
● ●
●●
−1

● ● ●
● ●

●● ● ●
●● ● ●

● ●
● ●
−2


● ●
−3

−3 −2 −1 0 1 2 3

10
Correlation

Korrelation: r = −0.0202

3

●●

●● ●

2

●●
●● ●●●
●● ●●
●●
● ●●
●●
●●●
●●●● ●●

●●●● ●
●●●●

1
●● ●
●●● ●
● ● ● ●
● ●●● ●
●● ●

● ●●
●● ●●

●●●●● ●

●●

0
● ●
●● ●● ●●●
●●● ●●●
●●● ● ●●●
● ● ●●●
● ●●
● ●● ●
●●●
●●● ●● ● ●
●●●●●
●●●
● ● ● ●●
●●●●
●● ● ●●●● ●●●●●●● ●

●●●
●● ●● ● ●●●● ●

−1


● ● ●●●● ● ● ●
●●● ●●●● ●
● ●●
● ●●
● ●●
−2
−3

−3 −2 −1 0 1 2 3

11
Correlation

Korrelation: r = −0.0331

3

2

● ●●
●● ● ●
●● ●●●● ●
●●● ●
● ●● ●● ●●

● ●
●● ●● ● ●
● ●
●●●
●●● ●● ● ●●

1

● ● ●●● ●
●● ● ●
●● ● ● ● ●●
● ●
●● ● ● ●● ●●

● ● ● ● ●●
● ●
● ●● ●

● ● ●● ●●●
●● ●●

0
● ●●
● ●
●●
● ● ● ● ●
●●● ●● ●
●● ●
●● ●

● ● ●● ●● ●●






● ● ● ● ●

−1


●●● ●● ● ●●●● ●
●● ●
●●● ● ● ●● ●●
●● ● ● ●● ●●
●●●● ●●
● ● ●●
● ●●●
● ●
● ●
−2


−3

−3 −2 −1 0 1 2 3

12
Regression and Prediction - p-Value Refresher

p-Value - The golden rule


1. Rule: If 𝑝 is low, 𝐻0 has to go.
2. The hypothesis you want to prove is always formulated as 𝐻𝐴 .
3. 𝐻0 is the logical opposite of 𝐻𝐴 .
4. If 𝑝 is low then the oppsite of the hypothesis you wanted to prove is false.

Effect in Regression Analysis


• The effect we are looking at in Regression Analysis is whether two (or more) variables depend
on each other (positively or negatively).
• Hypotheses are in general either “the higher X the higher Y” (positive dependence) or “the
higher X the lower Y” (negative dependence).

13
Regression and Prediction

Regression: understand your data


Build a model of data based on dependency. Simplify the dataset from 𝑛 observations of two
features to two coefficients as well as one p-Value and certain quality measures (R-squared,
AIC/BIC, …)

Prediction: use your data to predict unseen values


Use the regression model not just to understand the data, but also use the model to predict 𝑌
values of certain values 𝑋 that have not been in the original dataset from which the model was
deducted.

14
Quantitative Attributes/Features only

One of the main drawbacks of Regression Analysis is that it basically only works with quantitative
variables. However, the following techniques can be applied to properly use Regression Analysis
despite qualitative (categorical) attributes/features.

Output is qualitative (binary classification)


Use Logistic Regression, which models and predicts an 0 − 1 output.

(Some) attributes/features are qualitative


Transform every qualitative variable with 𝑘 categories into 𝑘 quantitative 0 − 1 variables. These
𝑘 variables are called Dummy Variables in Statistics and the process is called One-Hot Encoding
in Data Science.

15
Dummy Variables / One-Hot Encoding

Transform qualitative variables with 𝑘 categories into 𝑘 quantitative 0 − 1 (dummy) variables,


e.g. the attribute which shows which party a person in the sample voted for at the last general
election:

Person Elected Party

1 R
2 B
3 S
4 S
5 R
6 G

16
Dummy Variables / One-Hot Encoding

Transform qualitative variables with 𝑘 categories into 𝑘 quantitative 0 − 1 (dummy) variables,


e.g. the attribute which shows which party a person in the sample voted for at the last general
election:

Person R S B G

1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
4 0 1 0 0
5 1 0 0 0
6 0 0 0 1

17
Ordinary Least Squares (OLS)

The regression coefficients have to be estimated from data, such that for each observed value 𝑥𝑖
an optimal estimation can be calculated.

How to estimate optimal values? One possibility: minimize the residual sum of squares (RSS).

𝑛 𝑛
𝑅𝑆𝑆 = ∑(𝑦𝑖 − 𝑦𝑖̂ ) 2
= ∑(𝑦𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑖 ))2 .
𝑖=1 𝑖=1

Minimize RSS to obtain estimates 𝛽0̂ and 𝛽1̂ .

18
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

19
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

20
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

21
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

RSS = 9.00

22
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

RSS = 7.42

23
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

RSS = 6.01

24
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

RSS = 4.79

25
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

RSS = 3.75

26
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

RSS = 2.89

27
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

RSS = 2.20

28
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

RSS = 1.70

29
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

RSS = 1.38

30
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

RSS = 1.23

31
Regression: Ordinary Least Squares

3
2

1
● ●●

0
● ●
● ●

−1


−2
−3

−3 −2 −1 0 1 2 3

32
𝑅2 - Coefficient of determination

𝑆𝑆𝑟𝑒𝑠
𝑅2 = 1 −
𝑆𝑆𝑡𝑜𝑡

33
Adjusted 𝑅2

The use of the Adjusted 𝑅2 is an attempt to take account of the phenomenon that the coefficient
of determination 𝑅2 automatically and spuriously increases when extra explanatory variables are
added to the model. It is defined as

𝑛−1
Adjusted 𝑅2 = 1 − (1 − 𝑅2 )
𝑛−𝑝−1

where 𝑝 is the total number of explanatory variables in the model (not including the constant
term), and 𝑛 is the sample size.

There are other related measures too, e.g. AIC/BIC, …

34
Case Study: POTUS Election 2000

3500
● Palm Beach

3000
2500
Stimmen für Buchanen

2000
1500
1000




● ●
500

● ● ●



●● ● ● ●

●●● ●●
● ●
● ● ● ●●
●● ●● ●
●●

●●●●●●●



●● ●


0

0 50000 100000 150000 200000 250000 300000

Stimmen für Bush

35
Case Study: POTUS Election 2000

• The POTUS election 2000 was one of the most narrow and controversial elections.
• The Republican George W. Bush won marginally (less votes, but more presidential electors)
against the Democrat Al Gore.
• Decisive was Florida where Bush won with a lead of just 537 votes.
• Some irregularities were observed - especially in the county Palm Beach.
• Both the design of the ballot as well as the electronic ballot boxes were criticized. The
complexity of the ballot paper is suspected to have misled the choice of many voters.

36
Case Study: POTUS Election 2000

• One may suspect that quite a few voters erroneously cast their vote for Pat Buchanan instead
of Al Gore.
• To analyze this fact the votes for both conservative candidates George W. Bush and Pat
Buchanan in all counties of Florida.
• A 2D scatter plot can be used to visualize the amount of votes.
• The outlier Palm Beach can be recognized easily.

Questions
• Is there a relation/dependency between the amount of votes for Bush and Buchanan?
• How many votes for Buchanan do we expect in Palm Beach?
• Did G.W. Bush assume his presidency lawfully?

37
Case Study: POTUS Election 2000

3500
● Palm Beach

3000
2500
Stimmen für Buchanen

2000
1500
1000




● ●
500

● ● ●



●● ● ● ●

●●● ●●
● ●
● ● ● ●●
●● ●● ●
●●

●●●●●●●



●● ●


0

0 50000 100000 150000 200000 250000 300000

Stimmen für Bush

38
Case Study: POTUS Election 2000

If the outlier Palm Beach is removed from the dataset and the coefficient of correlation is
computed between the number of votes for Bush and Buchanan, one obtains 𝜌 = 0.87. The
regression coefficients are: 𝛽0 = 65.573 and 𝛽1 = 0.00348 respectively, i.e.

Buchanan = 65.573 + 0.00348 × Bush

39
Case Study: POTUS Election 2000

Using the regression model to predict the numbers of votes for Buchanan in Palm Bach can be
done as follows: Bush received 152846 votes. The prediction of votes for Buchanan is thus

65.573 + 0.00348 × 152846 = 598.

However, Buchanan received 3407 votes. That equals a difference of 2809 votes in relation to the
prediction and is far larger than the winning margin of 537 votes by which Bush won Florida and
ultimately the election.

40
Case Study: POTUS Election 2000

3500
● Palm Beach

3000
2500
Stimmen für Buchanen

2000
1500
1000




● ●
500

● ● ●



●● ● ● ●

●●● ●●
● ●
● ● ● ●●
●● ●● ●
●●

●●●●●●●



●● ●


0

0 50000 100000 150000 200000 250000 300000

Stimmen für Bush

41
Multiple Regression

Multiple Linear (!) Regression


𝑘 dependent variables 𝑋1 , … , 𝑋𝑘 :

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + … + 𝛽𝑘 𝑋𝑘 + 𝜖.

Bi-variate Linear Regression

𝑌 = 𝛽
⏟⏟0+ 𝛽 1 𝑋1 + ⏟
⏟⏟⏟⏟ 𝛽⏟⏟
2 𝑋2 + ⏟𝜖
𝑙𝑖𝑛𝑒𝑎𝑟 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙 𝑒𝑟𝑟𝑜𝑟

42
Multiple Regression

Bi-variate Linear Regression

𝑌 = 𝛽
⏟⏟0+ 𝛽 1 𝑋1 + ⏟
⏟⏟⏟⏟ 𝛽⏟⏟
2 𝑋2 + ⏟𝜖
𝑙𝑖𝑛𝑒𝑎𝑟 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙 𝑒𝑟𝑟𝑜𝑟

Interpretation of the coefficients


• 𝛽0 is the average value of 𝑌 , if 𝑋1 = 𝑋2 = 0.
• 𝛽1 : 𝑌 changes by 𝛽1 , if 𝑋1 is raised by 1 and 𝑋2 remains unchanged.
• 𝛽2 : 𝑌 changes by 𝛽2 , if 𝑋2 is raised by 1 and 𝑋1 remains unchanged.

43
Case Study: Used Cars - Data

Data from the Red Book (USA) is sometimes questionable. A used cars dealer collected information
of 100 cars of one specific model (Ford Taurus).

# Price (USD) Miles (x100) Color Services (#) Age (months)

1 5318 373.88 1 2 72
2 5061 447.58 1 2 74
3 5008 458.33 3 2 49
4 5795 308.62 3 4 52
5 5784 317.05 2 4 54
6 5359 340.10 2 2 57

• Price is the dependent variable/attribute.


• All other variables/attributes are independent.

Remark: Color cannot be used directly because it is qualitative but could be transformed to a
dummy variable.

44
Case Study: Used Cars

First step: manually choose a regression model to explain price, e.g. price ~ miles, which leads
to the following result:

Estimate (𝛽 ) p-Value

Intercept 6533.3830 <2e-16 ***


miles -3.1105 <2e-16 ***

price = 6533 - 3.1 * miles


(Adjusted 𝑅2 = 0.6466)

As there are different attributes available the main task is to find the optimal model, i.e. the one
that explains the data the best, i.e. from a statistical viewpoint - in this case using the Adjusted 𝑅2 .

45
Case Study: Used Cars

Next manually chosen model: price ~ miles + services

Estimate (𝛽 ) p-Value

Intercept 6206.12836 <2e-16 ***


miles -3.14627 <2e-16 ***
service 135.83749 <2e-16 ***

price = 6206 - 3.1 * miles + 136 * service


(Adjusted 𝑅2 = 0.9735)

46
Case Study: Used Cars

Another manually chosen model: price ~ miles + age

Estimate (𝛽 ) p-Value

Intercept 6582.9949 <2e-16 ***


miles -3.1105 <2e-16 ***
age -0.9487 0.53

price = 6582 - 3.1 * miles - 0.9 * age


(Adjusted 𝑅2 = 0.6444)

47
Case Study: Used Cars

Best model (according to Adjusted 𝑅2 ) manually found:

price = 6206 - 3.1 * miles + 136 * service

The bi-variate linear regression model, where the price depends on the amount of miles and the
number of services can be summarized as follows:

• For each 100 miles the average price is reduced by USD 3.1 - assuming the same number of
services.
• Comparing two cars where one received one service more than the other, the price of the
better serviced car is expected to be USD 136 higher.

48
Model selection - general outline

• The statistical model should be as small as possible, i.e. use as few attributes of the original
dataset as possible.
• Simultaneously the model should not lose a significant amount of explanatory power in
relation to models with more attributes.
• Different measures available to measure the explanatory power (R-square, AIC/BIC, …)

Model selection: a manual model selection has been conducted above - however, an automatic
model selection takes the full model as input and returns the best model possible.

49
Model selection (two attributes)

A dataset with one target 𝑌 and two features/attributes 𝑋1 and 𝑋2 would allow for the following
models:

𝑀0 ∶ 𝑌 = 𝛽0 + 𝜖
𝑀1 ∶ 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝜖
𝑀2 ∶ 𝑌 = 𝛽0 + 𝛽2 𝑋2 + 𝜖
𝑀12 ∶ 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜖

Different strategies to find the optimal model (model selection) - crucial if the dataset contains
many attributes, which is the case in most Data Science applications.

Exponential growth of the number of models requires heuristics, e.g. step-wise regression
(backward and forward).

50

You might also like