On - Regression - Start - Part1 (Repeated For Reference)

On_regression_start.
ipynb
Part 1 (cross sectional)
1
Simple Regression (for prediction)
2
Load and Visualize the Data
3
Divide the data into test and train set
For cross
sectional
Sklearn utility data set
Shuffle=True
20% is a typical test set value if data is not too large

For large data, test value can be less
4
Train (=fit) the model on the training data
https://
scikit-learn.org/stable/modules/generated/skle
arn.linear_model.LinearRegression.html 5
Test (=predict) the model on the testing data
6
Evaluate the model using a metric (=R2
R2 = Appropriate metric for
score)
Cross sectional model
Evaluate model on train data and test data

Expect the performance on test data to be worse
than on train data
7
Scikit learn model evaluation choices
https://scikit-learn.org/stable/modules/model_evaluation.html 8
Evaluate the model using a metric (=R2
score)
The score function of scikit learn’s linear regression model (lr) is R2 by default
#How to know what type of metric lr.score represents (in this case R2)?
#help(lr)
9
Getting Info on the Model
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html 10
Final Results
Intercept and coefficients are private properties (_) of the lr model
11
Logistic Regression (for classification)
12
Logistic regression
• Logistic regression is essentially regression
with a [0-1] target variable.
• But regular linear regression cannot be used
to predict this target variable because:
– The (predicted values) will not be between 0-1, so
will not be interpretable as a probability
13
Logistic function
(1)
1
1+𝑒
−𝑥 Simplest version (2)
14
Logistic regression programmatically
gre
school
rank
gpa
intercept
Linear Probability
Input
approximation of admission
into logistic
function
15
x has a name…
• If you express a probability as a function of x using the
logistic function: (1)
• Then, when you solve for x, you get (2)
• But it turns out that p/(1-p) is the odds ratio, the ratio
of the chance of a win divided by the chance of a loss,
so x means something, it is the natural log of the odds
ratio, a.k.a. “LOGIT”.
• Ok, so it is interesting x is something familiar to a
betting expert. So now let us approximate x by a linear(3)
equation of some kind… 𝛽 0 +𝛽 1 𝑥1 + 𝛽 2 𝑥 2 …+𝛽 𝑘 𝑥 𝑘=𝑥
gpa gre rank 16

Probability model
Specifically, if p is the probability of being in
group 1, we estimate the model in equation:
y= (1)
1
1
−𝑥
1+𝑒 (2)
17
Ordinary Least Squares
• Recall the OLS cost function:
This is
Minimize:
18
Logistic regression fitting
• maximum likelihood estimation : For any model,
we create a likelihood function that is basically
the probability of seeing the results we actually
saw. Then we maximize this likelihood function
with respect to the unknown parameters, in this
case the regression coefficients. In other words,
we choose the regression coefficients so that the
resulting model is most in line with what we
actually observed.
19
Likelihood to maximize
• The likelihood L that we want to maximize
when we build a logistic regression model,
assuming that the individual samples in our
dataset are independent of one another is as
follows: This is This is1-
(1)
This is This is 1-
(2)
20
Logistic regression cost function=
negated log likelihood
• Transform the previous log likelihood function (to be
maximized) into a cost function (to be minimized):
negated
(1)
log likelihood
(2)
https://
scikit-learn.or
(3) g/stable/mod
ules/linear_m 21
odel.html
Load and visualize the data
y values are categorical

with two categories
22
y values have two categories

= two colors
23
Divide the data into test and train set
Identical setup as for linear regression, since we are dealing with cross sectional data
Shuffle=True by default
24
Train (=fit) the model on the training data
Same setup as for linear regression

Scikit learn’s API is standardized for all models
https://
scikit-learn.org/stable/modules/generated/skle
arn.linear_model.LogisticRegression.html 25
Test (=predict) the model on the testing data

Scikit learn’s API is standardized for all models
26
Evaluate the model using a metric (=accuracy
score)
Evaluate model on train data and test data

Expect the performance on test data to be worse
than on train data
27
Evaluate the model using a metric (=accuracy
score)
The score function of scikit learn’s logistic regression model (lr) is accuracy by default
28
Final Result

Scikit learn API is standardized for all models
29
LINEAR REGRESSION CROSS VALIDATION -a
way to increase the data (no scaling)
Validation data
“Validation” data
is a subset
of the
training data
The real test data is
entirely separate
Appropriate for cross sectional data
30
3 way data split
3 types of data:
Entire Data
Training
Validation
Testing
Training Test
Here we have
3 folds, meaning
Training Validation Test the validation algorithm
runs 3 times
where each time it:
Training Validation Test Separates data into folds
Selects 33% = validation
calculates the score
Training Validation Test
Any number of folds
may be defined
Original order of data is not important = cross sectional data

31
Cross validation workflow
32
Divide the data into test and train set: 80% train 20% test
33
Separate out a percent of the train set as
validation data: 20% (5 folds cv=5)
Scikit learn utility cross_val_score
Take out a fold of the train set data to use as validation data
Size of fold depends on number of folds
Cv=5 defines the number of folds to be 5 (each fold is 20% of the data)
Validation data
34
Calculate the score on the validation data
cross_val_score()
N_jobs allows for the use of all processors (-1)
scoring=None uses the score function of the

model (=by default to R2 in the case of linear
regression)
35
Calculate the score on the validation data
cross_val_score()
scores is an array containing the score of each validation subset (=each fold)
Scores.mean() calculates the mean of the validation scores
The score of the test set should be calculated separately!
The mean validation score should be similar to the test set score
Having a validation set enables one to run the same model with different parameter settings
With the goal of getting a better validation score
This is called optimizing the parameters of a model
Once the final parameter settings are chosen
The score of the test set is calculated separately and compared with the validation score
Ideally the two scores should be similar,
but optimizing the parameters of a model can lead to overfitting.
Overfitting is when the validation score average is much better than the test score.
36
LINEAR REGRESSION GRID SEARCH -a way to
choose the best parameters (no scaling)
37
Load and visualize
38
Divide the data into test and train set: 80%
train 20% test
separate out the test set
39
Separate out a percent of the train set as
validation data: 20% (5 folds cv=5)
Scikit learn utility GridSearchCV
separate out the validation sets

each validation set is 20%
scoring to keep track of (could have set scoring=None since R2 is the default score of lr)
n_jobs to use all processors

Refit an estimator using the best found parameters on the whole dataset.
40
Scoring Values for GridSearchCV()
41
Get the best hyper-parameters
best_params contains the best hyper-parameter values
42
Get the best parameters
43
Instantiate and train (=fit) a new model with
the best hyper-parameters on the training data
44
Training vs. Validation Error
1. Underfitting – Validation and training error high
– high bias or systematic error
2. Overfitting – Validation error is high, training error low
– high variance or sensitivity to random outliers
3. Good fit – Validation error low, slightly higher than the
training error
4. Unknown fit - Validation error low, training error 'high'
Python Machine Learning by Sebastian Raschka (2017) 45

Scoring: R SQUARED
regression
SSres = sum of the squares of the residuals (of the errors) = SSE
SStot = sum of the squares total
sklearn.metrics.r2_score
A full but easy to understand explanation of R-Squared

Can be found in WonnacottNewCh15.pdf pp433ff
46
Scoring: R SQUARED PROBLEM
#Note: seq(from = 1, to = 10, by = ((to - from)/(length.out - 1))
same y
same mean squared error
different R2
https://data.library.virginia.edu/is-r-squared-useless/
http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.p
df 47
Scoring: R SQUARED PROBLEM 2
• http
://www.stat.cmu.edu/~cshalizi/mreg/15/lectu
res/10/lecture-10.pdf
argues that if we use linear regression for
nonlinear data, if the variance of the X data
compared to target Y is large then R-squared
will be close to one but non-reliable.
48
Scoring: R SQUARED PROBLEM 3
• The problem is R-squared is based on the
assumption that the following equation holds:
• SST = SSR + SSE (This equation is discussed in
WonnacottNewCh15.pdf p.434)
• where
• SSR is the sum of squares due to regression
• SSE is the sum of squares error (=SSres), and
• SST is the sum of squares total.
• R-squared is derived as follows:

• SST/SST = SSR/SST + SSE/SST
• 1 = SSR/SST + SSE/SST
• 1 - SSE/SST = SSR/SST
• SSR/SST = R-squared
• But the equation SST = SSR + SSE does not hold

unless the model is linear
49
Scoring: Mean Squared Error
regression
𝑛
1 1
𝑀𝑆𝐸= ∗ 𝑆𝑆𝑟𝑒𝑠= ∑ ( 𝑦 𝑖 − ^𝑦 𝑖 ) 2
𝑛 𝑛 𝑖=1
𝑅𝑀𝑆𝐸=√ 𝑀𝑆𝐸
sklearn.metrics.mean_squared_error
50
Scoring: Mean Absolute Error
regression
sklearn.metrics.mean_absolute_error
51
Scoring: MAPE
regression
def mean_absolute_percentage_error(y_true, y_pred):

return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
52
Scoring: MAD/MEAN RATIO
regression
def MAE_mean_ratio(y_true, y_pred):

return np.mean(np.abs((y_true - y_pred) / np.mean(y_true))) * 100
53
Scoring: Information Coefficient
regression
• The Information Coefficient is the (Pearson)
correlation between the return on a common 1. Spearman Rank Correlation RHO
stock predicted by a valuation model or analyst
and the actual return.
• Some researchers convert predicted and actual
values to ranks before correlating.
• This means their correlation is the Spearman's
rank correlation coefficient --Spearman's rho.
• Spearman's rho is similar to Pearson's r (ranges
between -1 and 1), but exchanges the Pearson’s r
linear relationship for a monotonic relationship.
• Spearman’s rho slightly understates the strength 2. Pearson Correlation RHO
( )( )
of the Pearson r relationship, but its sampling 𝑛
1 𝑋𝑖− 𝑋𝑖 𝑌𝑖− 𝑌𝑖
error is about the same.
• Any IC based on fewer than several thousand
= ∑ 𝑠
𝑛−1 𝑖=1 𝑠𝑌
samples will suffer from significant sampling 𝑋
error. SX and SY = stdev of X and Y
• Associated with a p_value (significance)
X_bar and Y_bar are the means of X and Y
• See: ICAndCorrelation.pdf on the sampling error
https://archive.md/g6hOF pp. 426ff if WonnacottNewCh15.pdf

https://archive.md/VfNkG 54
Scoring: Accuracy
logistic regression
• i.e. fraction of correctly answered ‘yes’ vs. ‘no’

questions divided by total number of ‘yes’ vs. ‘no’
questions
• Answers: Yes = 1, No = 0
• Requires balance: questions requiring ‘yes’ and
requiring ‘no’ should be roughly equal in number
55
Scoring: F1-score
logistic regression
1. F1-score
• Where:
2. Precision
• i.e. fraction of correctly answered ‘yes’ questions divided by the
number of ‘yes’ answers (whether correct or not)
3. Recall
• i.e. fraction of correctly answered ‘yes’ questions divided by the
number of questions requiring ‘yes’ (to be correctly answered).
• F1 is the harmonic mean of Precision and Recall
• Does not require balance
56
Scoring: Phik
logistic regression
• Phik (𝜙k) is a new and practical
correlation coefficient that works
consistently between categorical,
ordinal, interval (continuous) and even
mixed variables, captures non-linear
dependency and reverts to the Pearson
correlation coefficient in case of a
bivariate normal input distribution.
• Like Cramer’s Phi or matthews_corrcoef
(Phi), Phik is based on the Pearson Chi-
squared test and ranges between 0 and
1, but it corrects the former’s defects.
• Associated with a p_value (significance)
• Captures both linear and non linear
correlation
• Does not require balance
https://
towardsdatascience.com/phik-k-get-familiar-with-the-latest-correlation-coefficient-9 57

On - Regression - Start - Part1 (Repeated For Reference)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On - Regression - Start - Part1 (Repeated For Reference)

Uploaded by

Copyright:

Available Formats

On_regression_start.

Part 1 (cross sectional)

20% is a typical test set value if data is not too large

Evaluate model on train data and test data

Intercept and coefficients are private properties (_) of the lr model

• Then, when you solve for x, you get (2)

gpa gre rank 16

y values are categorical

y values have two categories

Same setup as for linear regression

Same setup as for linear regression

Evaluate model on train data and test data

Same setup as for linear regression

Original order of data is not important = cross sectional data

N_jobs allows for the use of all processors (-1)

scoring=None uses the score function of the

Load and visualize

separate out the test set

separate out the validation sets

n_jobs to use all processors

best_params contains the best hyper-parameter values

Python Machine Learning by Sebastian Raschka (2017) 45

A full but easy to understand explanation of R-Squared

#Note: seq(from = 1, to = 10, by = ((to - from)/(length.out - 1))

• R-squared is derived as follows:

• But the equation SST = SSR + SSE does not hold

def mean_absolute_percentage_error(y_true, y_pred):

def MAE_mean_ratio(y_true, y_pred):

https://archive.md/g6hOF pp. 426ff if WonnacottNewCh15.pdf

• i.e. fraction of correctly answered ‘yes’ vs. ‘no’

You might also like