You are on page 1of 57

On_regression_start.

ipynb

Part 1 (cross sectional)

1
Simple Regression (for prediction)

2
Load and Visualize the Data

3
Divide the data into test and train set
For cross
sectional
Sklearn utility data set
Shuffle=True

20% is a typical test set value if data is not too large


For large data, test value can be less

4
Train (=fit) the model on the training data

https://
scikit-learn.org/stable/modules/generated/skle
arn.linear_model.LinearRegression.html 5
Test (=predict) the model on the testing data

6
Evaluate the model using a metric (=R2
R2 = Appropriate metric for
score)
Cross sectional model

Evaluate model on train data and test data


Expect the performance on test data to be worse
than on train data

7
Scikit learn model evaluation choices

https://scikit-learn.org/stable/modules/model_evaluation.html 8
Evaluate the model using a metric (=R2
score)

The score function of scikit learn’s linear regression model (lr) is R2 by default

#How to know what type of metric lr.score represents (in this case R2)?
#help(lr)

9
Getting Info on the Model

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html 10
Final Results

Intercept and coefficients are private properties (_) of the lr model

11
Logistic Regression (for classification)

12
Logistic regression
• Logistic regression is essentially regression
with a [0-1] target variable.
• But regular linear regression cannot be used
to predict this target variable because:
– The (predicted values) will not be between 0-1, so
will not be interpretable as a probability

13
Logistic function

(1)

1
1+𝑒
−𝑥 Simplest version (2)

14
Logistic regression programmatically

gre

school
rank

gpa

intercept

Linear Probability
Input
approximation of admission
into logistic
function
15
x has a name…
• If you express a probability as a function of x using the
logistic function: (1)

• Then, when you solve for x, you get (2)

• But it turns out that p/(1-p) is the odds ratio, the ratio
of the chance of a win divided by the chance of a loss,
so x means something, it is the natural log of the odds
ratio, a.k.a. “LOGIT”.
• Ok, so it is interesting x is something familiar to a
betting expert. So now let us approximate x by a linear(3)
equation of some kind… 𝛽 0 +𝛽 1 𝑥1 + 𝛽 2 𝑥 2 …+𝛽 𝑘 𝑥 𝑘=𝑥

gpa gre rank 16


Probability model
Specifically, if p is the probability of being in
group 1, we estimate the model in equation:

y= (1)
1

1
−𝑥
1+𝑒 (2)

17
Ordinary Least Squares
• Recall the OLS cost function:
This is

Minimize:

18
Logistic regression fitting
• maximum likelihood estimation : For any model,
we create a likelihood function that is basically
the probability of seeing the results we actually
saw. Then we maximize this likelihood function
with respect to the unknown parameters, in this
case the regression coefficients. In other words,
we choose the regression coefficients so that the
resulting model is most in line with what we
actually observed.

19
Likelihood to maximize
• The likelihood L that we want to maximize
when we build a logistic regression model,
assuming that the individual samples in our
dataset are independent of one another is as
follows: This is This is1-

(1)

This is This is 1-

(2)
20
Logistic regression cost function=
negated log likelihood
• Transform the previous log likelihood function (to be
maximized) into a cost function (to be minimized):
negated
(1)
log likelihood

(2)

https://
scikit-learn.or
(3) g/stable/mod
ules/linear_m 21
odel.html
Load and visualize the data

y values are categorical


with two categories

22
Load and visualize the data

y values have two categories


= two colors

23
Divide the data into test and train set

Identical setup as for linear regression, since we are dealing with cross sectional data
Shuffle=True by default

24
Train (=fit) the model on the training data

Same setup as for linear regression


Scikit learn’s API is standardized for all models

https://
scikit-learn.org/stable/modules/generated/skle
arn.linear_model.LogisticRegression.html 25
Test (=predict) the model on the testing data

Same setup as for linear regression


Scikit learn’s API is standardized for all models

26
Evaluate the model using a metric (=accuracy
score)

Evaluate model on train data and test data


Expect the performance on test data to be worse
than on train data

27
Evaluate the model using a metric (=accuracy
score)

The score function of scikit learn’s logistic regression model (lr) is accuracy by default

28
Final Result

Same setup as for linear regression


Scikit learn API is standardized for all models

29
LINEAR REGRESSION CROSS VALIDATION -a
way to increase the data (no scaling)

Validation data

“Validation” data
is a subset
of the
training data
The real test data is
entirely separate
Appropriate for cross sectional data

30
3 way data split
3 types of data:
Entire Data
Training
Validation
Testing
Training Test
Here we have
3 folds, meaning
Training Validation Test the validation algorithm
runs 3 times
where each time it:
Training Validation Test Separates data into folds
Selects 33% = validation
calculates the score
Training Validation Test
Any number of folds
may be defined

Original order of data is not important = cross sectional data


31
Cross validation workflow

32
Load and visualize the data
Divide the data into test and train set: 80% train 20% test

33
Separate out a percent of the train set as
validation data: 20% (5 folds cv=5)
Scikit learn utility cross_val_score

Take out a fold of the train set data to use as validation data
Size of fold depends on number of folds
Cv=5 defines the number of folds to be 5 (each fold is 20% of the data)

Validation data

34
Calculate the score on the validation data
cross_val_score()

N_jobs allows for the use of all processors (-1)

scoring=None uses the score function of the


model (=by default to R2 in the case of linear
regression)

35
Calculate the score on the validation data
cross_val_score()

scores is an array containing the score of each validation subset (=each fold)
Scores.mean() calculates the mean of the validation scores
The score of the test set should be calculated separately!
The mean validation score should be similar to the test set score

Having a validation set enables one to run the same model with different parameter settings
With the goal of getting a better validation score
This is called optimizing the parameters of a model
Once the final parameter settings are chosen
The score of the test set is calculated separately and compared with the validation score
Ideally the two scores should be similar,
but optimizing the parameters of a model can lead to overfitting.
Overfitting is when the validation score average is much better than the test score.

36
LINEAR REGRESSION GRID SEARCH -a way to
choose the best parameters (no scaling)

37
Load and visualize the data

Load and visualize

38
Divide the data into test and train set: 80%
train 20% test

separate out the test set

39
Separate out a percent of the train set as
validation data: 20% (5 folds cv=5)
Scikit learn utility GridSearchCV

separate out the validation sets


each validation set is 20%

scoring to keep track of (could have set scoring=None since R2 is the default score of lr)

n_jobs to use all processors


Refit an estimator using the best found parameters on the whole dataset.
40
Scoring Values for GridSearchCV()

41
Get the best hyper-parameters

best_params contains the best hyper-parameter values

42
Get the best parameters

43
Instantiate and train (=fit) a new model with
the best hyper-parameters on the training data

44
Training vs. Validation Error
1. Underfitting – Validation and training error high
– high bias or systematic error
2. Overfitting – Validation error is high, training error low
– high variance or sensitivity to random outliers
3. Good fit – Validation error low, slightly higher than the
training error
4. Unknown fit - Validation error low, training error 'high'

Python Machine Learning by Sebastian Raschka (2017) 45


Scoring: R SQUARED
regression

SSres = sum of the squares of the residuals (of the errors) = SSE
SStot = sum of the squares total

sklearn.metrics.r2_score

A full but easy to understand explanation of R-Squared


Can be found in WonnacottNewCh15.pdf pp433ff
46
Scoring: R SQUARED PROBLEM

#Note: seq(from = 1, to = 10, by = ((to - from)/(length.out - 1))

same y
same mean squared error

different R2

https://data.library.virginia.edu/is-r-squared-useless/
http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.p
df 47
Scoring: R SQUARED PROBLEM 2
• http
://www.stat.cmu.edu/~cshalizi/mreg/15/lectu
res/10/lecture-10.pdf
argues that if we use linear regression for
nonlinear data, if the variance of the X data
compared to target Y is large then R-squared
will be close to one but non-reliable.

48
Scoring: R SQUARED PROBLEM 3
• The problem is R-squared is based on the
assumption that the following equation holds:
• SST = SSR + SSE (This equation is discussed in
WonnacottNewCh15.pdf p.434)
• where
• SSR is the sum of squares due to regression
• SSE is the sum of squares error (=SSres), and
• SST is the sum of squares total.

• R-squared is derived as follows:


• SST/SST = SSR/SST + SSE/SST
• 1 = SSR/SST + SSE/SST
• 1 - SSE/SST = SSR/SST
• SSR/SST = R-squared

• But the equation SST = SSR + SSE does not hold


unless the model is linear

49
Scoring: Mean Squared Error
regression

𝑛
1 1
𝑀𝑆𝐸= ∗ 𝑆𝑆𝑟𝑒𝑠= ∑ ( 𝑦 𝑖 − ^𝑦 𝑖 ) 2
𝑛 𝑛 𝑖=1

𝑅𝑀𝑆𝐸=√ 𝑀𝑆𝐸
sklearn.metrics.mean_squared_error

50
Scoring: Mean Absolute Error
regression

sklearn.metrics.mean_absolute_error

51
Scoring: MAPE
regression

def mean_absolute_percentage_error(y_true, y_pred):


return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

52
Scoring: MAD/MEAN RATIO
regression

def MAE_mean_ratio(y_true, y_pred):


return np.mean(np.abs((y_true - y_pred) / np.mean(y_true))) * 100

53
Scoring: Information Coefficient
regression
• The Information Coefficient is the (Pearson)
correlation between the return on a common 1. Spearman Rank Correlation RHO
stock predicted by a valuation model or analyst
and the actual return.
• Some researchers convert predicted and actual
values to ranks before correlating.
• This means their correlation is the Spearman's
rank correlation coefficient --Spearman's rho.
• Spearman's rho is similar to Pearson's r (ranges
between -1 and 1), but exchanges the Pearson’s r
linear relationship for a monotonic relationship.
• Spearman’s rho slightly understates the strength 2. Pearson Correlation RHO

( )( )
of the Pearson r relationship, but its sampling 𝑛
1 𝑋𝑖− 𝑋𝑖 𝑌𝑖− 𝑌𝑖
error is about the same.
• Any IC based on fewer than several thousand
= ∑ 𝑠
𝑛−1 𝑖=1 𝑠𝑌
samples will suffer from significant sampling 𝑋
error. SX and SY = stdev of X and Y
• Associated with a p_value (significance)
X_bar and Y_bar are the means of X and Y
• See: ICAndCorrelation.pdf on the sampling error

https://archive.md/g6hOF pp. 426ff if WonnacottNewCh15.pdf


https://archive.md/VfNkG 54
Scoring: Accuracy
logistic regression

• i.e. fraction of correctly answered ‘yes’ vs. ‘no’


questions divided by total number of ‘yes’ vs. ‘no’
questions
• Answers: Yes = 1, No = 0
• Requires balance: questions requiring ‘yes’ and
requiring ‘no’ should be roughly equal in number

55
Scoring: F1-score
logistic regression
1. F1-score

• Where:
2. Precision
• i.e. fraction of correctly answered ‘yes’ questions divided by the
number of ‘yes’ answers (whether correct or not)
3. Recall
• i.e. fraction of correctly answered ‘yes’ questions divided by the
number of questions requiring ‘yes’ (to be correctly answered).
• F1 is the harmonic mean of Precision and Recall
• Does not require balance

56
Scoring: Phik
logistic regression
• Phik (𝜙k) is a new and practical
correlation coefficient that works
consistently between categorical,
ordinal, interval (continuous) and even
mixed variables, captures non-linear
dependency and reverts to the Pearson
correlation coefficient in case of a
bivariate normal input distribution.
• Like Cramer’s Phi or matthews_corrcoef
(Phi), Phik is based on the Pearson Chi-
squared test and ranges between 0 and
1, but it corrects the former’s defects.
• Associated with a p_value (significance)
• Captures both linear and non linear
correlation
• Does not require balance

https://
towardsdatascience.com/phik-k-get-familiar-with-the-latest-correlation-coefficient-9 57

You might also like