Professional Documents
Culture Documents
ipynb
1
Simple Regression (for prediction)
2
Load and Visualize the Data
3
Divide the data into test and train set
For cross
sectional
Sklearn utility data set
Shuffle=True
4
Train (=fit) the model on the training data
https://
scikit-learn.org/stable/modules/generated/skle
arn.linear_model.LinearRegression.html 5
Test (=predict) the model on the testing data
6
Evaluate the model using a metric (=R2
R2 = Appropriate metric for
score)
Cross sectional model
7
Scikit learn model evaluation choices
https://scikit-learn.org/stable/modules/model_evaluation.html 8
Evaluate the model using a metric (=R2
score)
The score function of scikit learn’s linear regression model (lr) is R2 by default
#How to know what type of metric lr.score represents (in this case R2)?
#help(lr)
9
Getting Info on the Model
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html 10
Final Results
11
Logistic Regression (for classification)
12
Logistic regression
• Logistic regression is essentially regression
with a [0-1] target variable.
• But regular linear regression cannot be used
to predict this target variable because:
– The (predicted values) will not be between 0-1, so
will not be interpretable as a probability
13
Logistic function
(1)
1
1+𝑒
−𝑥 Simplest version (2)
14
Logistic regression programmatically
gre
school
rank
gpa
intercept
Linear Probability
Input
approximation of admission
into logistic
function
15
x has a name…
• If you express a probability as a function of x using the
logistic function: (1)
• But it turns out that p/(1-p) is the odds ratio, the ratio
of the chance of a win divided by the chance of a loss,
so x means something, it is the natural log of the odds
ratio, a.k.a. “LOGIT”.
• Ok, so it is interesting x is something familiar to a
betting expert. So now let us approximate x by a linear(3)
equation of some kind… 𝛽 0 +𝛽 1 𝑥1 + 𝛽 2 𝑥 2 …+𝛽 𝑘 𝑥 𝑘=𝑥
y= (1)
1
1
−𝑥
1+𝑒 (2)
17
Ordinary Least Squares
• Recall the OLS cost function:
This is
Minimize:
18
Logistic regression fitting
• maximum likelihood estimation : For any model,
we create a likelihood function that is basically
the probability of seeing the results we actually
saw. Then we maximize this likelihood function
with respect to the unknown parameters, in this
case the regression coefficients. In other words,
we choose the regression coefficients so that the
resulting model is most in line with what we
actually observed.
19
Likelihood to maximize
• The likelihood L that we want to maximize
when we build a logistic regression model,
assuming that the individual samples in our
dataset are independent of one another is as
follows: This is This is1-
(1)
This is This is 1-
(2)
20
Logistic regression cost function=
negated log likelihood
• Transform the previous log likelihood function (to be
maximized) into a cost function (to be minimized):
negated
(1)
log likelihood
(2)
https://
scikit-learn.or
(3) g/stable/mod
ules/linear_m 21
odel.html
Load and visualize the data
22
Load and visualize the data
23
Divide the data into test and train set
Identical setup as for linear regression, since we are dealing with cross sectional data
Shuffle=True by default
24
Train (=fit) the model on the training data
https://
scikit-learn.org/stable/modules/generated/skle
arn.linear_model.LogisticRegression.html 25
Test (=predict) the model on the testing data
26
Evaluate the model using a metric (=accuracy
score)
27
Evaluate the model using a metric (=accuracy
score)
The score function of scikit learn’s logistic regression model (lr) is accuracy by default
28
Final Result
29
LINEAR REGRESSION CROSS VALIDATION -a
way to increase the data (no scaling)
Validation data
“Validation” data
is a subset
of the
training data
The real test data is
entirely separate
Appropriate for cross sectional data
30
3 way data split
3 types of data:
Entire Data
Training
Validation
Testing
Training Test
Here we have
3 folds, meaning
Training Validation Test the validation algorithm
runs 3 times
where each time it:
Training Validation Test Separates data into folds
Selects 33% = validation
calculates the score
Training Validation Test
Any number of folds
may be defined
32
Load and visualize the data
Divide the data into test and train set: 80% train 20% test
33
Separate out a percent of the train set as
validation data: 20% (5 folds cv=5)
Scikit learn utility cross_val_score
Take out a fold of the train set data to use as validation data
Size of fold depends on number of folds
Cv=5 defines the number of folds to be 5 (each fold is 20% of the data)
Validation data
34
Calculate the score on the validation data
cross_val_score()
35
Calculate the score on the validation data
cross_val_score()
scores is an array containing the score of each validation subset (=each fold)
Scores.mean() calculates the mean of the validation scores
The score of the test set should be calculated separately!
The mean validation score should be similar to the test set score
Having a validation set enables one to run the same model with different parameter settings
With the goal of getting a better validation score
This is called optimizing the parameters of a model
Once the final parameter settings are chosen
The score of the test set is calculated separately and compared with the validation score
Ideally the two scores should be similar,
but optimizing the parameters of a model can lead to overfitting.
Overfitting is when the validation score average is much better than the test score.
36
LINEAR REGRESSION GRID SEARCH -a way to
choose the best parameters (no scaling)
37
Load and visualize the data
38
Divide the data into test and train set: 80%
train 20% test
39
Separate out a percent of the train set as
validation data: 20% (5 folds cv=5)
Scikit learn utility GridSearchCV
scoring to keep track of (could have set scoring=None since R2 is the default score of lr)
41
Get the best hyper-parameters
42
Get the best parameters
43
Instantiate and train (=fit) a new model with
the best hyper-parameters on the training data
44
Training vs. Validation Error
1. Underfitting – Validation and training error high
– high bias or systematic error
2. Overfitting – Validation error is high, training error low
– high variance or sensitivity to random outliers
3. Good fit – Validation error low, slightly higher than the
training error
4. Unknown fit - Validation error low, training error 'high'
SSres = sum of the squares of the residuals (of the errors) = SSE
SStot = sum of the squares total
sklearn.metrics.r2_score
same y
same mean squared error
different R2
https://data.library.virginia.edu/is-r-squared-useless/
http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.p
df 47
Scoring: R SQUARED PROBLEM 2
• http
://www.stat.cmu.edu/~cshalizi/mreg/15/lectu
res/10/lecture-10.pdf
argues that if we use linear regression for
nonlinear data, if the variance of the X data
compared to target Y is large then R-squared
will be close to one but non-reliable.
48
Scoring: R SQUARED PROBLEM 3
• The problem is R-squared is based on the
assumption that the following equation holds:
• SST = SSR + SSE (This equation is discussed in
WonnacottNewCh15.pdf p.434)
• where
• SSR is the sum of squares due to regression
• SSE is the sum of squares error (=SSres), and
• SST is the sum of squares total.
49
Scoring: Mean Squared Error
regression
𝑛
1 1
𝑀𝑆𝐸= ∗ 𝑆𝑆𝑟𝑒𝑠= ∑ ( 𝑦 𝑖 − ^𝑦 𝑖 ) 2
𝑛 𝑛 𝑖=1
𝑅𝑀𝑆𝐸=√ 𝑀𝑆𝐸
sklearn.metrics.mean_squared_error
50
Scoring: Mean Absolute Error
regression
sklearn.metrics.mean_absolute_error
51
Scoring: MAPE
regression
52
Scoring: MAD/MEAN RATIO
regression
53
Scoring: Information Coefficient
regression
• The Information Coefficient is the (Pearson)
correlation between the return on a common 1. Spearman Rank Correlation RHO
stock predicted by a valuation model or analyst
and the actual return.
• Some researchers convert predicted and actual
values to ranks before correlating.
• This means their correlation is the Spearman's
rank correlation coefficient --Spearman's rho.
• Spearman's rho is similar to Pearson's r (ranges
between -1 and 1), but exchanges the Pearson’s r
linear relationship for a monotonic relationship.
• Spearman’s rho slightly understates the strength 2. Pearson Correlation RHO
( )( )
of the Pearson r relationship, but its sampling 𝑛
1 𝑋𝑖− 𝑋𝑖 𝑌𝑖− 𝑌𝑖
error is about the same.
• Any IC based on fewer than several thousand
= ∑ 𝑠
𝑛−1 𝑖=1 𝑠𝑌
samples will suffer from significant sampling 𝑋
error. SX and SY = stdev of X and Y
• Associated with a p_value (significance)
X_bar and Y_bar are the means of X and Y
• See: ICAndCorrelation.pdf on the sampling error
55
Scoring: F1-score
logistic regression
1. F1-score
• Where:
2. Precision
• i.e. fraction of correctly answered ‘yes’ questions divided by the
number of ‘yes’ answers (whether correct or not)
3. Recall
• i.e. fraction of correctly answered ‘yes’ questions divided by the
number of questions requiring ‘yes’ (to be correctly answered).
• F1 is the harmonic mean of Precision and Recall
• Does not require balance
56
Scoring: Phik
logistic regression
• Phik (𝜙k) is a new and practical
correlation coefficient that works
consistently between categorical,
ordinal, interval (continuous) and even
mixed variables, captures non-linear
dependency and reverts to the Pearson
correlation coefficient in case of a
bivariate normal input distribution.
• Like Cramer’s Phi or matthews_corrcoef
(Phi), Phik is based on the Pearson Chi-
squared test and ranges between 0 and
1, but it corrects the former’s defects.
• Associated with a p_value (significance)
• Captures both linear and non linear
correlation
• Does not require balance
https://
towardsdatascience.com/phik-k-get-familiar-with-the-latest-correlation-coefficient-9 57