0% found this document useful (0 votes)
22 views129 pages

Machine Learning Concepts and Models

Uploaded by

Shah Anzar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views129 pages

Machine Learning Concepts and Models

Uploaded by

Shah Anzar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Machine Learning

Terminology is very important. Memorize Names and Terms. New terms are written in Red.

Table of Contents
Chapter 1 – Machine Learning.....................................................................................................................2
Chapter 2 – I am Machine............................................................................................................................4
Chapter 3 – I am Model...............................................................................................................................8
Chapter 4 – I am Model’s Customer..........................................................................................................12
Chapter 5 – I am a Data Scientist – part 1..................................................................................................18
Chapter 6 – Linear Models - part 1............................................................................................................28
Chapter 7 – I am Data… Bias/Variance and Sample Bias............................................................................30
Chapter 8 – I am Error...............................................................................................................................40
Chapter 9 – Decision Trees........................................................................................................................42
Chapter 10 - I am a Data Scientist - part 2.................................................................................................58
Chapter 11- Ensemble of Models..............................................................................................................67
Chapter 12- Inside the Black-Box – SHAP Analysis.....................................................................................70
Chapter 13- Neural Networks....................................................................................................................73
Chapter 14 – Topics: Sample Split, Regularization, Over/Under-sampling.................................................79
Chapter 15 – Unsupervised Learning: Clustering, PCA, Anomaly Detection, and Recommender Systems 83
Chapter 16 – Linear Models – part 2.........................................................................................................95
Chapter 17 – Model Interpretation – to be completed............................................................................119
Chapter 18 – Gradient Descent, Advanced Recommender Systems, Advanced Neural Networks – to be
completed...............................................................................................................................................120
Appendix I – SQL Programming...............................................................................................................120

1
Chapter 1 – Machine Learning

What is Machine Learning (ML)?

ML is the process in which Machine (Computer) learns to do something cool.

For example, it learns to talk to us, like in ChatGPT. Or it learns to estimate someone’s income, much
better than what we as humans can do.

Income model is an example of a simple ML model. ChatGPT is an example of a complex model. In this
course, we will start from income model, and will end with ChatGPT. In between, we will see a “Credit
Risk” Model, an “Algorithmic Trading” model, a “Recommender System” model, and a “Sentiment
Analysis Language Model”. You will see that all ML models, simple to complex, are trying to do the same
thing: Find Similar Cases, but don’t worry about it for now.

Output of a ML model, is called Model’s Output!

How does Machine Learning help the World?

Output of ML model can be used to satisfy a need, improve business processes, …

For example, ChatGPT (or any other good Language Model), can give excellent responses to your
questions. Sample applications are Customer Service Chatbots (or other types of Chatbot that can help
24 hours a day, with many of the questions).

An income model, that estimates people’s income, van be used to identify rich people. A possible
application of such a model is a seller of Luxury brands. The company can use income model, to identify
rich people who are potential customers of luxury products.

How does machine learn?

Machine learns by receiving and processing information.

For example, machine can process information on people’s conversations – all conversations on internet
for example – and will learn how to talk.

Or it can receive information on many people’s incomes, and learns how to estimate someone’s income.

How should we give information to machine?

Machine receives information in the form of table, called Data. Knowledge and understanding of Data
and Data structure is an important part of ML. We will talk about Data a lot.

2
For example, this is an example of income data, on income of some people.

Industry Years of Experience Annual Income ($)


Data Science 10 250,000
Data Science 0 130,000
Professor 5 200,000
Professor 1 185,000

We give this data to machine, data on income of thousands or millions of people. Machine is great in
processing a lot of data. It will process this data and comes up with a model that estimates income based
on “Industry” and “Years of Experience”.

Estimated Income=Function of ( Industy ,Years of Experience)

This model will be used for a person whose “Industry” and “Years of Experience” is known, but “Income”
is not known. This concept is very important. ML model is Trained on Known data, to be used on
Unknown data.

Going back to the example of seller of luxury brands, company may have information on someone’s
Industry and Years of Experience, but information on salary is often non-public. Company can use the
income model, to estimate potential customers’ income. Marketing team can use this information to
target potential customers, who have enough money to buy a luxury product. For example, they may
decide to advertise only to customers whose “Estimated Salary” from the model is higher than $200,000.

Question. If ML Model says” Brad’s income is $200,000”, how much do you trust it? We will talk about
that in Chapter 2.

For a language model, data may look like this:

Question Answer
How are you? Doing fine!
How are you doing? Awesome!
How are you? Awesome! Learning to Train Machine.

Then computer may learn to respond “Awesome.” in response to “How are you?”

Spoiler: Do you think Machine knows what it says? Does it understand concepts? Not as of now. It just
generates a sequence of words; i.e. sentence.

So, Data is how machine receives information, and then will learn from it. Data has columns and rows:

 Columns are the Attributes, also called Features, and Variables. The income data above, has 3
attributes: “Industry”, “Years of Experience”, and “Annual Income ($)”. Text data has two
attributes: “Question”, and “Answer”.

3
 Rows are Observations. For example, in the income data if we have data on 30,000 individuals,
then data will have 30,000 observations/rows. For the language model, we can have access to all
the text on the internet; billions, trillions of observations.

Throughout this course, we will use features to refer to columns/attributes/variables. We will talk a lot
about Features and Observations.

The feature that we are trying to estimate is called Target Variable. Also called Dependent Variable, and Y
Variable. In the income model, “Annual Income ($)” is the target variable. In the language model,
“Answer” is the target variable.

As we mentioned earlier, ML model generates an output, also called Y ^ , Y-Hat. Model’s output is an
estimation of target variable. In the income model, model’s output is an estimation of someone’s income
(target variable).

Other features, that are used in estimating target variable, are called Independent Variables, X variables,
or just features/variables/attributes.

So, in this course, our data has a target variable, and some features. ML model will estimate Target, using
Features.

Note: Later we will see, there might be cases where we don’t have a Target variable. We will talk about
this, later.

Important Conceptual Point: Note that Target is actual. Like in the income data, “Annual Income ($)”
column, which is target, is someone’s actual income. But Output is an estimate. Model may estimate
Income of Bard, Data Scientist, with 4 years of experience, is $181,932, but Brad makes $205,000.

All ML models have Error, defined as the difference between Target and Output (Reality vs. Model’s
Estimation), but their error can be much lower than ours. A large portion of our course will talk about
how to minimize error.

Error=Target−Output

Chapter 2 – I am Machine

How does machine learn?

This is an important question. The answer will tell us how machine thinks? So we know what machine
means, when it gives an answer (output).

For example, in the income model, how does machine estimate someone’s income? Or in the language
model, how does machine answer a question?

We will explain machine’s learning process in a philosophical, and a mathematical way.

Philosophical:

4
Machine’s response is based on similar observations. For example, in the income model, to estimate
income of “a data scientist with 10 years of experience”, machine checks data, and finds data scientists
who have around 10 years of experience, and may give average of those salaries. As you will see, you can
control the machine to give you median, rather than average, income of similar observations. In fact you
can ask for anything among similar observations: Min, Max, Different Percentiles, …

But the thing that does not change, is that machine first finds similar observations.

In the language model, when machine is asked a question, it will find similar questions that have been
asked, and based on responses to those questions, will come up with an answer. We will talk about how
machine analyzes text later, but basics are the same. Machine finds similar observations.

Mathematical:

As we mentioned, model always has error; i.e. model’s output is not exactly equal to target. Machine
tries to minimize error. In fact that is how machine learns: By Minimizing Error. Optimization is at the
core of Machine Learning.

In this chapter, we will discuss how model calculates error. We will later discuss how model minimizes
error. All the exciting topics that you may be waiting for, such as Linear models, Neural Networks, and
XGbBoost, are basically optimization techniques to minimize error.

Let’s start the Math.

We know how to calculate error for a single observation:

Error=Target−Output
For example, if actual income (Y ) for someone is $180,000, and the model predicts $165,000, then
model’s error for this observation will be: 180,000−165,000=15,000 .

But how do we aggregate error across all observations, so we will have model’s error? The function that
aggregates error across all observations, is called Loss Function, ne of the most important concepts in
ML.

As a data scientist, it is your job to use appropriate Loss Function. It will affect how good the machine
will learn, and how useful the model will be.

Let’s discuss Loss Function is, with an example. Table 3 shows the same information as in Table 1, with
two additional columns: “Estimated Income” and “Error”. The assumption is that a ML model has been
Trained on this data, and this columns show model’s output and model’s error for each observation.

Industry Years of Annual Income ($) Estimated Income Error ($)


Experience ($)
Data Science 10 250,000 236,000 14,000
Data Science 0 130,000 142,000 -12,000
Professor 5 200,000 200,000 0

5
Professor 1 185,000 186,000 -1,000

Table 3 shows that model is doing a good job on the third observation, with 0 error. A good error on the
last observation, and high errors on the first two.

To aggregate errors across all observations, we can sum them up:


n n
Aggregated Error=Loss Function=∑ of Errors=∑ (Y i− Y^ i)=∑ Ɛ i
i=1 i=1

Where Ɛ i shows error on the ith observation, and n is the number of observations. For example in the
table above:
n

∑ of Errors=∑ Ɛi =14,000−12,000+ 0−1,000=1,000


i=1

The problem with this approach is that, negative and positive errors cancel each other. So it may sound
like model has low error, while in reality that is not the case. In other words, sign of error is not
important for us (negative or positive), we only care about absolute size of error. An error of -2,000 is as
bad as an error of 2,000.

For example, Table 4 shows output of another model, on the same data.

Industry Years of Annual Income ($) Estimated Income Error ($)


Experience ($)
Data Science 10 250,000 249,000 1,000
Data Science 0 130,000 129,800 200
Professor 5 200,000 199,230 770
Professor 1 185,000 184,343 657

While it seems like the second model is a better model - as it has lower errors across observations - sum
of errors is higher for the second model.
n

∑ of Errors=∑ Ɛi =1,000+200+700+657=2,557
i=1

So we need a Loss function that does not care about sign of error. One option is to define Loss function
based on Squared Error:
n

∑ of Squared Errors=∑ (Ɛ i)2


i=1

The problem with this one, is that this Loss function will be higher when number of observations is
higher. We don’t want to judge a model based on number of observations. So we define our first Loss
function as Mean Squared Error or MSE (also called Quadratic Loss and L2 Loss).

6
n n

∑ (Y i−Y^ i) ∑ (Ɛ i)2
2

MSE= i=1 = i=1


n n
2 2 2 2
−14000 + (−12000 ) +0 + (−1000 )
MSE for table 3 would be: =85,250,000
4
2 2 2 2
−1000 +200 +770 +657
MSE for table 4 would be: =516,137.25
4
So based on MSE loss function, the second model is a better model, as it has lower Loss.

Another metric, is Square Root of MSE, called Root Mean Square Error (RMSE)”. These two are basically
the same metrics, and used interchangeably.

So, ML model’s goal is to build a model that minimizes Loss Function. Note that we have not yet
discussed details of this process. We have not yet discussed how the model is built, and how the model
generates Output. We just know that model is trying to generate Output (Y ^ ) that minimizes Loss
function.

MSE is one of the most popular loss functions used in ML. Another popular Loss function is Mean
Absolute Error or MAE (also called L1 Loss).
n n

∑ ¿Y i−Y^ i∨¿ ∑ ¿ Ɛi∨¿


MAE= i =1 = i=1 ¿¿
n n
¿
MAE for table 3 would be: ¿ 14,000∨+ ¿−12,000∨+¿ 0∨+ ¿−1,000∨ =6,750 ¿
4
¿
MAE for table 4 would be: ¿ 1,000∨+ ¿ 200∨+¿ 770∨+¿ 657∨ 4 =656.75 ¿

Note that MSE penalizes large errors, because it calculates squared error. For example, in table 3, error
of the first observation is 14,000, powered by 2 will be 196,000,000.

Another popular Loss function is Mean Absolute Percentage Error or MAPE, that emphasizes percentage
Target−Output
error in prediction, defined as: .
Target
Y i−Y^ i
n n
Ɛ
∑ ¿ Y ∨¿ ∑ ¿ Y i ∨¿
i=1 i i=1 i
MAPE= = ¿¿
n n

7
Assignment: Calculate MAPE for table 4. The answer is around 0.003.

Which Loss function should be used? As mentioned, choice of Loss function is modeler’s decision, and it
is one of the most conceptual factors in a ML model.

For example, if you use MSE, then you are penalizing large errors. Why? Because MSE is based on Error 2
, and power by two, makes large numbers very large (like 1,000 error becomes

Chapter 3 – I am Model

So far we learned that ML is the process in which Machine Learns to do something cool. In more
technical terms, ML is the process in which we train machine on some data. Machine learns from the
data, relationships in data, …

In this chapter we will know ML models better.

What are the different types of ML models?

There are three types of ML models:

1. Supervised Models – Labeled Data


2. Unsupervised Models
3. Reinforcement Learning Models

In this course, we will cover Supervised and Unsupervised models. Reinforcement learning is out of
scope of this course. Just so you know, if you like to create a game, or a driverless car, that is the type of
models you need to learn.

Supervised models are the most popular type of ML models. Supervised models are ML models in which
we try to estimate a Target variable. Until now, we have been mainly talking about supervised models.
Both income model and language model we discussed so far, are supervised model. So supervised model
^ ), is an estimate of target.
is a ML model in which there is a target variable (Y ), and model’s output (Y

Unsupervised models are models in which there is no target variable. In other words, there is no target
variable in the data. Imagine Income data, without income column. Basically data on some people’s job
and years of experience. We can still use this data to build a model, an unsupervised model. We will
discuss these models later. What are they? How to build them? How to use them? For now, let’s focus on
supervised models.

What are the different types of Supervised models?

8
There are two types of supervised models:

1. Regression
2. Classification

Difference between regression and classification models is in the type (or shape or format) of the target
variable. In regression models, target variable is a Real, Continuous Variable. The income model is an
example of a regression model, because income is a continuous number. It can be any (positive) value.

In a classification model, target is a Categorical Variable. A categorical variable can not get all range of
values, rather it is composed of a few categories, also called Classes or Labels.

For example, in the income model, imagine that target variable is in the form of Rich or Poor rather than
Salary. So target variable is not continuous, rather it is a categorical variable with two classes or two
labels. Table 5 shows an example of this data.

Industry Years of Experience Rich or Poor


Data Science 10 Rich
Data Science 0 Poor
Professor 5 Rich
Professor 1 Rich

There are many examples of classification models. In fact, classification models are the most common
type of ML models in the industry. Many models such as Credit Risk model and many Language models
are classification models. In this course, we will work on several examples of these models.

In a classification model, if target has only two classes, it is also called a Binary Classification Model. The
above example is a binary model, as there are two classes: Rich and Poor. If target has more than two
labels, it is also called a Multi-Class or Multi-Label Classification Model. An example is a ML model that
predicts someone’s nationality. Target variable for this model is name of countries.

Can you suggest some features that might help with such a model? In other words what are some of the
features that help with guessing someone’s nationality? Assign useful features in the following table.
This is an important skill for a data scientist: Be able to define factors that help with a model.

Feature 1 Feature 2 Feature 3 Country

What is the output of a Classification model?

As we discussed, output of a (supervised) ML model, is an estimation of model’s target variable. For


example, for the income model, output is an estimation of income. But what about a classification
model? For example in table 5, where output is Rich or Pool, what is the output?

9
Output of a classification model is the Probability that observation belongs to a specific class. For
example, in the above example, model’s output is the probability that a person is Rich (or Poor). So, in a
classification model, output is a number between 0 and 1, that shows probability that observation
belongs to a class. For example, in table 5, if model’s output for observation 1 is 0.73, it means the
person is rich with 73% probability. It also says the person is poor with 27% probability.

A good classification model, assigns high probability of 1 when target is 1, and gives low probability of 1
when target is 0.

In a binary classification model, we are often interested in one of the two classes. For example, in the
above example we might be interested in estimating probability of Rich. Then we call Rich the Response.
For this reason, target variable in a classification model is also called Response Variable. In table 5, 3 out
of 4 observations are Rich, which means Response Rate is 75%.

Q. Can machine work with text, like Rich and Poor? No. Machine only understands numbers. Before you
feed table 5 to a machine to build a model, you need to convert the target variable to number. How? You
can convert Rich to 1, and Poor to 0. So the table will look like following:

Industry Years of Experience Rich or Poor


Data Science 10 1
Data Science 0 0
Professor 5 1
Professor 1 1

How does Error and Loss function for a classification model look like?

For regression model we defined error as the difference between Target and Model’s Output. Also we
defined three Loss functions that aggregate error across observations: MSE, MAE, MAPE.

But what about for a classification model? Can we define Error and Loss function the same way? In the
^ ) is probability of response. So for example the following
classification model, as we discussed, output (Y
can be output and error for a model on the above table, where output is Probability that Observation
belongs to Class 1 (Rich).

Industry Years of Rich or Poor Model’s Output Error


Experience (Probability of 1)
Data Science 10 1 0.76 0.24
Data Science 0 0 0.56 -0.56
Professor 5 1 0.34 0.66
Professor 1 1 0.83 0.17

And MSE will look like:


n

∑ (Y i −Y^ i)2=0.242 +(−0.56)2+ 0.662 +0.172


MSE= i=1 =0.208925
4

10
And the model will try to minimize this error. The error would be minimized if Output (Probability) is
close to 1 when Target is 1, and is close to 0 when target is 0.

As mentioned, “A good classification model, assigns high probability of 1 when target is 1, and gives low
probability of 1 when target is 0”. So, MSE optimization problem, will give us a good model. There is only
one possible improvement. If instead of Y ^ , we write the Loss function based on ln Y^ , machine can solve
the optimization problem much faster and easier. This Loss function, which is by far the most common
categorical Loss function, is called Cross Entropy Loss (also called Log Loss) function.

Cross Entropy for a Binary classification problem is as following:


n

∑ −Y i ln Y^ i−(1−Y i )ln (1−Y^ i)


Binary Cross Entropy= i=1
n
Make sure you can follow this. To make sense of Cross Entropy, let’s calculate it for the above table:

−1
Binary Cross Entropy= × ( 1× ln ( 0.76 )+ (1−1 ) × ln ( 1−0.76 ) +0 × ln ( 0.56 )+ ( 1−0 ) × ln (1−0.56 )+1 × ln ( 0.34 ) + (
4
Does Cross Entropy make sense? In other words, How do we know minimizing Cross Entropy give us a
good ML model? Try to answer it yourself, before checking the solution. It is fun.

As mentioned, “A good classification model, assigns high probability of 1 when target is 1, and gives low
probability of 1 when target is 0”. Let’s see how Cross Entropy works.

For a single observation, if the observation is a response; i.e. target variable is 1, then only first part of
Cross Entropy formula matter. We have:

Cross Entropy=∑ −Y ln Y^ − (1−Y ) ln ( 1−Y^ ) =−Y ln Y^ =−ln Y^

So if the observation is response, Cross Entropy for that observation is −ln Y^ . Y


^ is the model’s output,
which is a probability; i.e. a number between 0 and 1. Therefor −ln Y^ , will be minimized if Y
^ is close to
1. Cross Entropy passed first half of the condition to be a good model.

If the observation is a non-response; i.e. target variable is 0, then only the second part of Cross Entropy
formula matter. We have:

Cross Entropy=∑ −Y ln Y^ − (1−Y ) ln ( 1−Y^ ) =−( 1−Y ) ln ( 1−Y^ )=−ln Y^ =−ln (1−Y^ )

So if the observation is non-response, Cross Entropy for that observation is −ln (1−Y ^ ). Y^ is the model’s
output, which is a probability; i.e. a number between 0 and 1. Therefor −ln (1−Y ^ ), will be minimized if
Y^ is close to 0. Cross Entropy also passed second half of the condition to be a good model.

So, so far we have MSE, MAE, and MAPE for Regression models, and Cross Entropy for Classification
models.

11
I’m Machine.

Let’s hear from machine, what we have given to it, and what it will give us.

You gave me a Matrix of numbers, which you call Data. You said you want to estimate one column of
this data (you call it Target variable), using the other columns (you call them Features). You told me to
use Cross Entropy Loss Function. I will build a model that minimizes this function. In other words, I will
build a model, that generates outputs that minimize this function.

We will see how the model will be built.

Chapter 4 – I am Model’s Customer

How do we know if a model is good?

How can we estimate ML model’s impact on the business results? Output of ML model would be used as
an input to a strategy. So to calculate impact of ML model, we should calculate its impact in the context
of a strategy. And that means we should first define an “Optimum Strategy” based on a ML model.

In this chapter we will discuss two examples of defining a strategy for a regression and a classification
model. We are in the domain of Data Analytics. We use output of a ML model to define a strategy. It
often means to define certain policies based on the output of ML model and some thresholds.

Note: Strategy design is a very innovative process. There are many pieces to define when designing a
data strategy, and process might be very different case by case. It also always requires understanding of
the business, and the business problem to solve. In contrast, ML model development process is much
more standard. The focus of our course is on development of ML models. We will discuss some examples
of data strategies (like in this chapter), but you will need to keep reading about different data strategies
in different industries. That will help you build a mental framework on strategy design, despite
differences between strategies.

Regression Example – Algorithmic Trading Model:

An Algorithmic Trading model is a ML model that predicts return on a financial asset, such as a stock.
Return is the percentage change in price, which indicates profit in a trade. For example, if you invest $1

12
in a stock, and price increases by 10%, your $1 would become $1.1, and you have made 10 cents as
profit.

Assume the following table shows output of an algorithmic trading model. Output column shows
model’s prediction, and Target column shows actual returns on some stocks. Strategy team wants to use
this data to define the following trading strategy:

“Company would buy a stock if Expected Return on the Stock (Model’s Output) is higher than a
Threshold.”

So, strategy team needs to define the optimum threshold.

Model’s Output (Expected Return) Target (Actual Return)


1.5% 2%
-0.7% 0.2%
2.3% 0.3%
0.78 0.5%
0.5% -1.2%
1.1% -0.8%
4.2% 6.1%

Threshold can be any real number, but we can not test all real numbers. So we test the middle point
between any two consecutive model’s outputs, as possible thresholds. To do so, we sort data based on
output’s column, and calculate middle points. For the above table, outputs sorted would be:

[-0.7, 0.5, 0.78, 1.1, 1.5, 2.3, 4.2]

So the possible thresholds would be:

[-0.1, 0.64, 0.94, 1.3, 1.9, 3.25]

Next, we calculate Profit based on any threshold.

If we use -0.1 as threshold, it means company would buy a stock if Model’s output is higher than -0.1%.
It means company will trade in all the above cases except for the second observation where expected
return = 0.7%. The return from this threshold/strategy is sum of Actual Return on all trades; in this case:
2+ 0.3 + 0.5 -1.2 -0.8 + 6.1 = 6.9.

Model’s Output (Expected Return) Target (Actual Return)


1.5% 2%
-0.7% 0.2%
2.3% 0.3%
0.78 0.5%
0.5% -1.2%
1.1% -0.8%
4.2% 6.1%

13
Same way, we can calculate Expected Return for all other thresholds. Following table shows the results.
Verify the results.

Threshold Return on the Strategy


-0.1 6.9%
0.64 8.1%
0.94 7.6%
1.3 8.4%
1.9 6.4%
3.25 6.1%

The threshold that gives the highest return is 1.3. So strategy suggests to buy any stock with estimated
return - based on ML model – higher than 1.3%. Expected Profit from this “ML Model/Strategy” is 8.4%.
If company already has a strategy, this profit will be compared against the existing strategy’s return, and
if better, will be used going forward.

Also, if there are several ML model candidates, Strategy team would define the optimum
threshold/strategy for each model, and the model/strategy that gives the highest expected profit would
be chosen.

Note in a real case there are other factors to consider in the strategy design. For example in the above
case, assume management decides to trade at least 50% of the times. In that case what would be the
optimum threshold? (answer: 0.64)

Question. Let’s solve a more sophisticated strategy. Imagine company in trying to assign two thresholds,
for the following strategy. What are the optimum thresholds?

“Company would invest $2 in a stock if Expected Return on the Stock (Model’s Output) is higher than
Threshold2, and would invest $1 if model’s output is between Threshold1 and Threshold2. If models’
output is less than Threshold1, company would not invest in the asset.”

Classification Example – Credit Risk Model:

Profit and Loss of a strategy based on a classification model is analyzed by calculating Profit from True
Positive and True Negative, and Loss from False Positive and False Negative. We are going to discuss
these ideas in a Credit Risk model.

A Credit (Default) Risk model predicts Probability of Default of a credit applicant. Credit products such as
Credit Card, Mortgage, Car Loan, Student Loan, …

When someone applies for credit, Lending Companies use these models to estimate Probability that
applicant will pay back the credit, and based on that, company decides whether the accept a credit
application.

14
The following table shows target variable and model’s output for a Credit Risk model. Model’s output is
“Probability that an Applicant will Default”. Target Variable is “Did the applicant default in reality”.
Default is shown by 1 and no default by 0.

Model’s Output - Probability of Default Target – Default Indicator


0.02 0
0.31 0
0.15 0
0.35 1
0.83 1
0.63 0
0.12 0

Company wants to use the model to decide on a loan application. So we need a strategy/threshold based
on the model’s Probability of Default. For example, if threshold is 0.4, then company will accept an
applicant if "Probability of Default of Applicant based on the Credit Risk Model < 0.4”.

To calculate cost and benefit for each threshold, we need to have an estimate of cost of False Positive
and False Negative, and revenue from True Positive and True Negative. True Positive are cases where the
observation was response, and model/strategy predicted response. In other words Target=Output=1.

So we have:

 True Positive: Target=Output=1


 True Negative: Target=Output=0
 False Positive: Target=0, Output=1
 False Negative: Target=1, Output=0

The above concept is shown graphically in a Confusion Matrix, that shows number of observations in
each of the above categories.

Actual
0 1
Predicte 0 #TN #FN
d 1 #FP #TP

In the above Credit Risk model, assume threshold is 40%. i.e. applicants with Probability of Default (PD) <
0.4 will be accepted, and the rest will be rejected. Basically company classifies anyone with PD<0.4 as 0
(no default customer), and anyone with PD>=0.4 as 1 (default customer).So we will have:

Model’s Output - Probability of Target – Default Indicator Model/Strategy’s Output


Default
0.02 0 0
0.31 0 0
0.15 0 0

15
0.35 1 0
0.83 1 1
0.63 0 1
0.12 0 0

The confusion matrix will be as:

Actual
0 1
Predicte 0 4 1
d 1 1 1

Often False Positive and False Negative are associated with some cost, and True Positive and True
Negative are associated with some revenue. To calculate benefit from strategy, company needs an
estimate of the above cost and benefits. Lending companies often estimate those, using another ML
model called Loss Given Default (LGD).

For the above Credit Risk data assume the following cost and benefit.

 True Positive: In this model, TP means an applicant was a bad applicant, and model/strategy
correctly declined the application. Assume the profit from each TP is $0.
 True Negative: In this model, TN means an applicant was a good applicant, and model/strategy
correctly approved the application. Assume the profit from each TN is $100.
 False Positive: In this model, FP means an applicant was a good applicant, but model/strategy
incorrectly declined the application. Assume the cost of each FP is $10.
 False Negative: In this model, FN means an applicant was a bad applicant, but model/strategy
incorrectly approved the application. Assume the cost of each FN is $1000.

Using these assumptions, find the optimum strategy/threshold for the above Credit Risk data. Threshold
can be any real number between 0 and 1 (probability), but we can not test all real numbers. So we test
the middle point between any two consecutive model’s outputs, as possible thresholds. To do so, we sort
data based on output’s column, and calculate middle points. For the above table, outputs sorted would
be:

[0.02, 0.12, 0.15, 0.31, 0.35, 0.63, 0.83]

So the possible thresholds would be:

[0.07, 0.135, 0.23, 0.33, 0.49, 0.73]

Next, we calculate Profit based on any threshold.

If we use 0.07 as threshold, it means company would accept an application if PD is lower than 7%. It
means company will approve only the first observation with PD = 0.02, and will reject the rest. The
return from this threshold/strategy is sum of cost and benefit from TP/TN/FP/FN. In this threshold we
have one TN, two TP, zero FN and four FP. So Expected Profit from this strategy would be:

16
1 ×100+2 ×10=$ 120

Model’s Output - Target – Default Model/Strategy’s TP/TN/FP/FN


Probability of Default Indicator Output
0.02 0 0 TN
0.31 0 1 FP
0.15 0 1 FP
0.35 1 1 TP
0.83 1 1 TP
0.63 0 1 FP
0.12 0 1 FP

Same way, we can calculate Expected Profit for all other thresholds. Following table shows the results.
Verify the results.

Threshold Expected Profit of Strategy


0.07 $120
0.135 $220
0.23 $320
0.33 $420
0.49 $-590
0.73 $-490

So the optimum threshold based on this model/data would be 0.33. This strategy suggests to accept an
application if PD of applicant is less than 33%. If company already has a strategy, expected profit from
this strategy would be compared against the existing strategy’s profit, and if better, will be used going
forward.

Also, if there are several ML model candidates, Strategy team would define the optimum
threshold/strategy for each model, and the model/strategy that gives the highest expected profit would
be chosen.

Again, note in a real case there are other factors to consider in the strategy design. For example lenders
(especially large ones) are restricted by regulators not to take excessive risk. For example, lending
company may not be permitted to have Default Rate > 5%. So company needs find the optimum
threshold at which “Profit is Maximized, but Default Rate < 5%”.

Question. How is Default Rate for each threshold/strategy calculated? Default rate would be response
rate among approved cases. For example in the above credit risk model, default rate at 49% threshold
would be:

¿ Approved∧Defaulted 1
Default Rate at 49 % Threshold = =
¿ Approved 6

17
In assignment #1 you write a program that can be used to find the optimum threshold.

Confusion Matrix Metrics:

Several metrics are defined based on the confusion matrix that are used frequently in model’s
assessment. For the reason that we will discuss in the next chapter, data scientists (ML modelers) are not
recommended to use these metrics when assessing a model, but data analysts use them a lot in
assessing performance of a strategy. These metrics are especially useful when cost and benefit for
TP/TN/FP/FN can not be assigned. Some of these metrics are as following:

¿ Correctly Classified TP+TN


 Accuracy= percentage correctly classified= =
¿Total TP+TN + FP+ FN
¿ Correctly Classified as response TP
 Precision= =
¿ Total Classified as Response TP+ FP

¿ Correctly Classified as response TP
Recall ( Sensitivity )=Percentage of Responses Captured= =
¿Total response TP+ FN

¿ Correctly Classified as non−response TN
Specificity=Percentage of Non−ResponsesCaptured= =
¿ Total non−response TN + FP
2 × Precision × Recall
 F 1−Score=
(Precision+ Recall)
Question. Search on internet, which metric is the most suitable to use?

Chapter 5 – I am a Data Scientist – part 1

How do we know if a model is good?

In the previous chapter we answered this question, from model customer’s point of view. Now we
answer it from model builder’s point of view. Modelers measure model’s performance in a different way
compared with model users, and we will see why.

18
As we will see, as a data scientist, you will build many versions of a model, 200, 1000, … (in a process
called Grid Search - ToBeDiscussed). For example you will build 1000 versions of Income Model. At the
end, you want to choose the best one, to be used by the business. In this section we discuss how to
assess a model and choose the best, the one that will help the business the most.

You may say a model is good if it the model’s customer is happy. In the previous chapter (I am model’s
customer), we saw how model’s customer assess a model’s performance. So the best model is the one
with the best performance in the strategy. That is correct, but for one reason, modelers need other
performance metrics.

The reason is that model is developed before strategy. Note that model’s output is an input to the whole
strategy (like estimated income is an input to the marketing strategy). So when modelers are building the
model, no strategy is developed based on that model. So there is no threshold, …

For example, in the trading model in the previous section, when modelers are building the model, they
are not aware of threshold on the “Expected Return” to be traded. So “Expected Return from Strategy”
can not be used to choose the best model.

Same argument holds for classification models. In the Credit Risk example in the previous chapter, when
modelers are building the model, they do not know the threshold for Accept/Reject an applicant. So
“Expected Profit from Strategy” can not be used to choose the best model.

For the same reason, it is not recommended to use metrics based on the confusion matrix, in the model
development process. Assume a modeler uses accuracy to measure performance of a Credit Default
model. By default, in all ML packages, accuracy is defined with 0.5 as threshold; i.e. cases with
“Probability of Response < 0.5” are classified as 0, and cases with “Probability of Response >= 0.5” are
classified as 1. But 0.5 is not the threshold strategy team will use. As we saw in the example of the
previous chapter, optimum threshold will not be 0.5. Choice of threshold affects model’s accuracy. So a
model that is better based on 0.5 threshold, might be worse based on say 0.2 threshold. Therefore as a
modeler we prefer a performance metric that is independent of threshold/Strategy.

Note: Some modeler’s use metrics based on confusion matrix to assess model’s performance. In other
words, using 0.5 as threshold. But for the above reason, that is not the best practice.

So, how can a ML modeler measure performance of a ML model? Probably the first answer is to use Loss
(value of Loss Function). Since we asked machine to minimize Loss, it is fair to compare models based on
their Losses, and choose the one with the least Loss. So the first set of performance metrics are Loss
functions: MSE (or RMSE), MAE, MAPE, Cross Entropy, …

Q. Which of the above Loss Functions can be used to solve a Regressions model, and which can be used
to solve a Classification problem?

For two reasons, we need other performance metrics, that might be more practical than Loss.

1. Loss is a real number (like say 123,001,201.76); not easy to interpret. We don’t necessarily want
to choose the model with the lowest Loss. In ML, we also Cherish Simplicity! i.e. Less Complex.

19
Among two models that have almost the same Loss, the one that is less complex is the better
option, even if it has higher Loss. But how much difference in Loss is Low, or High? We need a
scaled version of Loss.
2. The second reason why we need more practical performance metrics: We can define
performance metrics that are more in line with how the model will be used. So we need to also
understand how the model will be used (by Data Analysts, to solve a business problem). Defining
a Strategy based on the ML Model, is one of the greatest data skills. Let’s discuss an example
(and we will discuss more examples throughout the course)

For example, in the case of income model and seller of luxury products, Company wants to optimize
their marketing efforts. How? By targeting right customers, so company will have more sales (and
revenue), with less (marketing) costs; i.e. higher Profit, which is Business Optimization Problem!

To optimize marketing costs, we need to know “who will buy our product (with a high probability), if we
advertise to them?” That depends on many items:

 Whether they have money to buy a luxury brand


 Their attitude towards fashion, and high-end fashion
 Other priorities in life
 What else???

Maybe we can use Machine. So we go to the Modeling Team, and ask if they can build a ML model that
“estimates probability that someone will buy our product, if we advertise to them?” Modeling team goes
to Data Team, to check what data is available, and what questions can be answered? Seems like the
company doesn’t have data to estimate that probability, but we can build a ML model that estimates
people’s income, so maybe that model can be used to optimize the marketing process.

Here is the point where Data Analysis team should design a Strategy on how to use the model (i.e. how
to use the model’s output?)

Data Analysts look at data in a more granular way than Data Scientists. For example, instead of analyzing
each individual’s income, and how to treat them, they define Segments based on the output of the ML
model. For example a possible Strategy is “Advertise to those with Estimated Income higher than
$200,000”.

As a Data Analyst you should show this threshold ($200,000) is a good choice. How? by analyzing the
data, and by looking at different segments. Here there are two segments: lower and higher than
$200,000. One possible analysis in this case is to show that Average Actual Income in the 200,000+ group
is Significantly higher than Average Actual Income for the 200,000- group. How can you do that?

We will discuss more applications, how to define thresholds, and how to judge if they are good
thresholds/segments? As mentioned, it is an innovative process, but seeing a few examples would help.

Going back to the performance metrics, in this case, company is interested in identifying rich people. If
the model underestimates income of rich people, we will not contact them, and we lose the opportunity.
So, while we like the model has low error across all observations, we are especially interested in having
Low Error (Low Underestimation Error) for high income people (observations with high values of target
variable).

20
Normal Loss functions give same weight to all observations (and all errors). Is there a performance
metric that gives more weight to high income people? So we can use that to choose the best model,
even though that model may not have the least Loss. Let’s look at some of these other performance
metrics.

Some Popular Regression Performance Metrics:

 R-Squared: R-Squared is a number between 0 and 1. The higher the R-Squared, the better the
model’s performance.
n

∑ ( y i−^y i )2
R−Squared=1− i=1
n

∑ ( y i− y)2
i=1

y is sample average; i.e. average of Target variable across all observations. Average is the
simplest type of Machine Learning.

Nominator in the above formula is Sum of Squared Error. Is that right? If model has no error; i.e.
y= ^y for all the observations, then nominator will be 0, and R-Squared would be 1.
Same way, denominator is Sum of Squared Error for a very simple model: Average. If the model
we have built is only as good as average, then Nominator and Denominator will be the same,
which means R-Squared will be 0. In other words, a R-Squared of 0 means the model is not any
different from just average. Average will result in the same error.

In theory R-Squared can be less than 0. That would mean the model is even worse than a simple
average!
Note that for a sample of data, denominator of R-Squared is constant. The only part that
changes from model to model is the nominator, which is Squared Error. In fact R-Squared is the
scaled version of Squared Error. Bounded between 0 and 1. This makes it easier to compare
models. For example, we may assume two models with R-Squared of 0.78 and 0.80 are almost
the same, and we choose the one that is simpler.

 Adjusted R-Squared: Adjusted R-Squared is similar to R-Squared, but it penalizes model for
model becoming too complex. Adjusted R-Squared is always less than R-Squared. Higher values
of Adjusted R-Squared indicate a better model.
As we discussed, we prefer a simpler model, if it has the same performance. One of the factors
that make the model more complex, is having more features. Adjusted R-Squared penalizes the
model for adding new features.
2 (1−R 2)(N −1)
Adjusted R =1−
(N − p−1)
In this formula, R2 is the Normal R-Squared, N is the number of observations, and p is number of
features.

21
As the formula shows, higher R-Squared help with increasing Adjusted R-Squared. Also more
features (higher p) would decrease the Adjusted R-Squared. While adding a new feature
certainly increases R-Squared, it would increase Adjusted R-Squared only if the new feature has
a significant impact on model’s performance; otherwise adding that feature, would decrease
Adjusted R-Squared, and Adjusted will like the model with less features (less complexity).
So Adjusted R-Squared is a version of R-Squared that automatically controls for complexity.

 Percentage of Observations with Absolute Percentage Error Less than a Threshold: This is a very
practical performance metric. It shows percentage of observations with error less than a
threshold, say 10%.
For example, in the income model, we might be fine with ±10% prediction error; i.e. if absolute
percentage error in prediction is less than 10%, we are fine. So we might be interested in
knowing, what percentage of observations have error in that range? For example, imagine the
following table shows Target Variable and Model’s Output for an income model. Last column
shows that 2 out of 9 observations have Absolute Percentage Error less than 10%. So Percentage
2
Observations with Absolute Percentage Error less than 10% is =22 %.
9

Income (Target Variable) Estimated Income (Model’s Absolute Percentage Error


Output)
150,000 163,800 9.20%
133,000 113,510 14.65%
210,000 205,132 2.32%
410,000 455,000 10.98%
78,000 103,455 32.63%
80,000 94,234 17.79%
190,000 167,432 11.88%

A higher value of this metric indicates a better model; i.e. among a few models, the model with
higher value of this metric, is the best.
Choice of threshold (for example 10% or 5% or …) is modeler’s decision and is related to model’s
application, and error tolerance.

 Percentage of observations with Overvaluation (or undervaluation) higher than a threshold: This
is another practical performance metric. While the previous metric was based on the
observations that have low error, this one looks at observations that have high error. Model may
overestimate or underestimate the true value. In the case of overestimation, Model’s output is
higher than Target, and vice versa for undervaluation.
For example, the following table shows the same information as the previous table, except for
that the last column now shows Percentage Error, rather than Absolute Percentage Error.
Percentage error is defined as:
(Target−Output)
Target

22
Imagine management believes undervaluation higher than 20% results in losing potential
customers, and so they want to know what percentage of observations are underestimated by
more than 20%. The following table shows Percentage Error is never higher than 20%. Therefore
0% will have undervaluation higher than 20%.

Income (Target Variable) Estimated Income (Model’s Percentage Error


Output)
150,000 163,800 -9.20%
133,000 113,510 14.65%
210,000 205,132 2.32%
410,000 455,000 -10.98%
78,000 103,455 -32.63%
80,000 94,234 -17.79%
190,000 167,432 11.88%

A lower value of this metric indicates a better model; i.e. among a few models, the model with
lowest value of this metric, is the best.
Choice of threshold (for example 20% or 10% or …) is modeler’s decision and is related to
model’s application, and error tolerance.
As we discussed, undervaluation of high income is the main concern for the income model used
in the luxury brand example. So Percentage Undervaluation would be a good choice for this
project.

 Customized Loss Functions: You can also define Loss Function in a way that behaves in a way
consistent with the application. For example, how does the following modified MSE help with
the Income Model application we discussed?

∑ Y i (Y i−Y^ i)2
Modified MSE= i=1
n

Some Popular Classification Performance Metrics:

Similar to regression, one approach is to use Loss. So for example for a model trained using Cross
Entropy, the model with lower Cross Entropy Loss is better. But what about other performance metrics?

Performance metrics for classification models have a general philosophy behind them. Understanding
this philosophy is very important for a data analytics job.

Analysis of classification models starts with sorting observations based on model’s output, which is a
Probability. A good classification model assigns high probability of 1 when the target is 1, and vice versa.
A perfect classification model, assigns higher probabilities to all 1s. Let’s see an example. The following
table, shows output and target for a classification model. Observations are sorted based on the Output
(Probability of Y = 1).

23
As you can see, all the 1s are among observations with higher probabilities. In other words, model
assigns higher probabilities to observations where Y = 1. This is an example of a full-separation model;
i.e. 0s and 1s can be fully separated based on model’s results.

Target Probability
1 0.958435117
1 0.948792964
1 0.914152845
1 0.913101696
0 0.72336373
0 0.720844029
0 0.70689184
0 0.637502659
0 0.562053761
0 0.421980592
0 0.374999954
0 0.234611813
0 0.179724192
0 0.117490243
0 0.018149957

However in a real model, that is not the case. Next table shows an example of a more realistic model.
Here also observations are sorted based on the model’s output (Probability), but the model does not
fully separate them. In fact you can see that the observation with the highest probability is 0, and then
0s and 1s are kind of mixed.

Target Probability
0 0.997130323
1 0.93443584
1 0.920676798
0 0.889554458
0 0.877450172
0 0.856804693
1 0.828945311
0 0.816167986

24
0 0.815909692
0 0.715436249
1 0.621659747
0 0.468148883
0 0.446199084
0 0.280205392
0 0.239599425

A classification model is good if it can capture 1s as soon as possible; i.e. once we sort observations
based on model’s output, we would like to see many 1s on top (observations with high probability of 1),
and few 1s in the bottom. For example, among the following two models, model A seems better, because
overall it is assigning higher probabilities to 1s.

Model A Model B
Targe
Target Probability t Probability
1 0.0925 0 0.936315987
1 0.0896 1 0.930511967
0 0.0753 0 0.811131956
1 0.0730 0 0.66873246
1 0.0696 1 0.618558349
0 0.0613 0 0.562027441
0 0.0580 0 0.55246359
0 0.0571 0 0.526800866
0 0.0540 1 0.34600946
0 0.0537 0 0.279729747
0 0.0509 0 0.274914859
0 0.0502 0 0.247539913
0 0.0292 0 0.242596644
0 0.0242 0 0.125014773
0 0.0082 1 0.00741651

But we can not visualize the whole data to find the best model. So is there a metric that tells us how
good model Rank Orders observations? Yes, that metric is called AUC, and is probably the most popular
classification performance metric.

Also note the expression Rank Ordering. We will talk about it more.

 Area Under Curve (AUC):

Best way to explain AUC, is to show how it is created. AUC is area under Receiver Operating
Characteristic (ROC) Curve. For this reason, the metric is also called ROC metric. To calculate AUC, first
sort observations based on model’s output, and then calculate % of 1s and 0s captured.

25
Following table shows an example. First column shows Target, and the second column shows model’s
output. Data is sorted based on the output. The third column shows % of 1s captured up to that record.
Overall there are 6 observations with label = 1. The first observation is 1, so we have captured 1 out of 6;
i.e. 16.67%. Same way we calculate all the numbers in the third and fourth column.

Notice, eventually both columns reach 1 (100%), as eventually we capture all 0s and 1s.

Target Output % of 1s Captured % of 0s captured


1 0.99 16.67% 0.00%
1 0.98 33.33% 0.00%
0 0.87 33.33% 8.33%
0 0.83 33.33% 16.67%
1 0.83 50.00% 16.67%
1 0.73 66.67% 16.67%
1 0.62 83.33% 16.67%
0 0.57 83.33% 25.00%
0 0.43 83.33% 33.33%
0 0.34 83.33% 41.67%
0 0.33 83.33% 50.00%
1 0.29 100.00% 50.00%
0 0.27 100.00% 58.33%
0 0.23 100.00% 66.67%
0 0.21 100.00% 75.00%
0 0.18 100.00% 83.33%
0 0.17 100.00% 91.67%
0 0.12 100.00% 100.00%

ROC is the graph of column 3 versus column 4. In other words, in ROC curve, third column is Y axis and
fourth column is X axis. AUC is the are below this curve.

26
ROC Curve
120.00%

100.00%

80.00%

60.00%

40.00%

20.00%

0.00%
0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00%

Q. What is the AUC of the above curve?

AUC takes values between 0.5 and 1. A complete separation model, has an AUC of 1. A model that is not
different from a random guess, has AUC of 0.5. Random guess means we assign probabilities by random,
so no model. In other words, an AUC of 0.5 means no merit for a model.

Why does AUC work? As mentioned, a good model assigns high probabilities to 1s and low probabilities
to 0s. If that happens, values in the third column (% of 1s captured) go up quickly (because we are
capturing 1s among high probability observations). On the other hand, values in the fourth column does
not go up quickly (because we are capturing many 0s among high probability observations). As a result,
first observations in AUC curve will have high values of Y and Low values of X. Which means AUC curve
goes up with a high slope, and once it captured almost all 1s, it becomes flat.

If the model is not good, there are many 0s among high probabilities, and many 1s among low
probabilities. So values in the third column will not go up quickly, and values in the fourth column will go
up relatively fast. Therefore ROC curve will look more flat, with lower AUC.

27
So, high AUC means model, on average, is assigning higher values to 1s, and lower values to 0s.

 Percentage of Responses Captured among the top X% of Observations: This is another practical
performance metric for classification models. To calculate this metric, sort the observations
based on the model’s output, and then calculate % of responses among top X%. X is often 5 or
10 percent.
For example, imagine data has 10,000 observations, and response rate is 1%; i.e. 100 response.
Assume we sort observations and look at the top 5% of observations with the highest probability
of response. Top 5% means top 500 observations. Imagine 90 out of 500 are response, which
means 90% of responses are among these top 5%. So we have captured 90% of responses
among top 5%.
Do we like the value of this metric to be higher or lower? Higher.
As you can see in this metric also we care about how good model rank orders observations, and
we like the model to quickly capture the responses. However in this metric we only care about
the top observations.
This metric is useful for models such as fraud, where the goal is to capture responses very
quickly. A fraud model’s output is probability that a transaction is fraudulent. A possible strategy
is to block transactions with very high probability of fraud. Notice that blocking a transaction
that is not fraud is very costly, and the cost is in the form of losing customer’s satisfaction. So it is
not possible to block many transactions. On the other hand, failing to block a fraudulent
transaction can be very very costly. So, we can only block a small portion of transactions, and we
are hoping to capture almost all the frauds in that small portion. That is where this metric can be
useful, as it shows percentage of fraud that is captured among a small portion of observations
with highest probability of fraud.

So far we talked about how machine learns (in a supervised model)? By minimizing the Loss function on
a data composed of Target variable and independent features. Now it is time to talk about different
Machine Learning Techniques. We will first talk about Linear Models, then Decision Trees, Ensemble

28
Models, and finally Neural Networks. All of these techniques can be used to solve both Regression and
Classification models.

Chapter 6 – Linear Models - part 1


So, Machine minimizes Loss Function. The next step is to define the Functional Form of Output. Output
would be a function of features:
Y^ =f ( X 1 , X 2 , X 3 , …)

What is the form of this function. We start from one of the simplest types, Linear Models, where Output
is a Linear function of inputs.

Linear Regression:
Linear Regression estimates Y as a linear function of features:
Y^ = βX=β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + …

βs are called model coefficients, and Linear Regression model, finds coefficients that minimize Loss
function. For example for MSE, Linear Regression solves the following optimization problem:
n
min ∑ ( y i−β X i)2
β i=1

Let’s calculate the Loss function for the following three observations from the income data, so we get an
idea of what machine sees.

Age Years of Experience Salary


29 3 55000
31 5 70000
38 12 140000

^ will be:
With two independent variables, there will be three coefficients. So Y
Y^ = β0 + β 1 X 1 + β 2 X 2 + β 3 X 3

3
Loss=∑ ( y−^y )2=(55000−β 0−β 1 × 29−β 2 ×3)2+(70000−β 0−β 1 × 31−β 2 ×5)2 +(140000−β 0−β 1 ×38− β2 ×1
i=1

This is a function in terms of three coefficients; so machine will solve this three-dimensional optimization
problem. Of course in a real case, there are more features and many more observations.

29
Linear Classification (Logistic Regression):
In the classification model output is “Probability of Response”; i.e. output is between 0 and 1. So it can
not be in the form of:
^y =βX

Why? Because βX can get any Real Number value, outside 0 and 1. We need a functional form whose
output is bounded between 0 and 1. To achieve this goal, we define a Linear function for Logit ( ^y )
rather than ^y . Logit is a Mathematical function that can map from [0,1] to [-∞, +∞], and it is Strictly
Monotonic, guaranteeing a unique mapping from Logit ( ^y ) to ^y .
In other words, we estimate Logit ( ^y ), and convert it to ^y , which is “Probability of Response”.

Logit’s Formula:

Logit ( x ) =ln ( 1−xx )


Practice. Plot Logit on the interval of [0,1].

Linear Classification Model:

Logit ( ^y )=ln ( 1−^^y y )=βX


This looks like a regression model with Logit ( ^y ) as dependent variable. For this reason, linear
classification model is called Logistic Regression.

Linear model gives Logit ( ^y ), how to convert it to ^y ?

( )
βX
^y e
Logit ( ^y )=ln =βX so ^y =
1− ^y 1+e
βX

So Cross Entropy Loss function will be:

( ) ( ) (
n n βX βX n β
−1 −1 e e −1 e
Cross Entropy= ∑
n i=1
y i ln ( ^y i ) + ( 1− y i ) ln ( 1−^y i )= ∑
n i=1
y i ln
1+e
βX
+ ( 1− y i ) ln 1−
1+e
βX
= ∑
n i=1
y i ln
1+ e

Let’s write it for the following three rows from a hypothetical classification data:

X1 X2 Y
1.2 22 0
-0.3 31 1
1.8 40 0

30
( ( ) ( ) ( )
β 0+ β1 ×1.2 +β 2× 22 β0 −β1 × 0.3+ β2 ×31
−1 e 1 e
Cross Entropy= × 0 ×ln + ( 1−0 ) × ln +1× ln + ( 1−1 ) ×
3 1+ e β + β × 1.2+ β ×22
0 1 2
1+e β + β ×1.2 + β ×22
0 1 2
1+e β −β × 0.3 +β × 31
0 1 2

So machine will solve the above three-dimensional optimization problem.

Chapter 7 – I am Data… Bias/Variance and Sample Bias


Machine learns from data, and the model is only as good as data. Whether you use the most
sophisticated modeling techniques or the simplest, the model will not be useful if the data is not good. If
you train ChatGPT only on conversations between philosophers, it probably will not be able to give a
straightforward answer to simple questions. Or if you train income model on the salary of data scientists,
it may have high errors predicting income of say teachers, even if the model has high performance (MSE,
MAE, MAPE, …) predicting Data Scientist salaries.

How do we know if model will be good?


In previous chapters we discussed “How do we know if model is good?” and we said we can check
model’s performance by using performance metrics. Now we want to make sure if this model will be
good. In other words, will the model perform well on other samples, other than the train sample?

Any data model is developed to be used on a Population. Population refers to all potential use cases of a
model. For example, in the income model, where the company wants to use the model to target
customers, all potential customers are model’s Population. This includes everyone that goes through the
model, low and high income, at any time including all future customers.

We never have data on the whole population. If we do, then why do we need a model? Rather we have
data on a Sample from the Population. We train the model on this sample, and then will use it on
Unseen data. In the income model, we have the sample data - the Excel file – to train the model. We call
this the Train Sample, as this sample will be used to train the model. We train the model on this sample,
and calculate performance metric on that same sample. The very very very important question is “Will
the model have the same performance on other samples, from other parts of the population?” To
answer this, we check two things:

1. Does the model have the same performance on another similar sample?
2. Is the data we have for model development, similar to the whole population?

31
Answers to these questions are two of the most important concepts in Applied ML. To answer the first
question, we do Bias/Variance Analysis. To answer the second question, we do our best to train the
model on an Unbiased Sample.

Bias and Variance:

We built the income model on the Train sample, and performance (say measured by R-Squared) is 0.85.
Is it good? If yes, will the model perform the same on unseen data? In other words, what if we put the
model into production, and it does not estimate people’s income correctly (or as good as Train sample).
Then everything we have done to have a good model to target customers, would be useless.

To test how the model performs on unseen data, we use part of Train Sample, to test the model. For
example, if we have 30,000 observations in the Train sample, we would not use all of them to train the
model. Rather we use, say 20,000 to train the model, and will use the other 10,000 to test the model.
We call this sample Test Sample. The process of splitting data to test and train is referred to as Test/Train
Split. We will discuss it in more details shortly.

Note we have the value of Target variable for the Test data; so we can use this sample to calculate
model’s performance. However we will not use Test Sample at all in any step of model training. Why?
Because Test Sample is supposed to represent unseen data. So it does not exist when we train the
model.

Test data should not be used in model fine tunning, neither in data processing. We will discuss these
terms/steps in more details when we talk about Ensemble Models and Neural Networks.

So we split the data into Test and Train. Now we can calculate model’s performance on Train sample, as
well as on unseen data; i.e. Test Sample. A ML model is good if it has good performance on both Test and
Train samples, and the performance metric does vary a lot across different samples. Let’s call model’s
performance on a single sample “Model’s Bias”, and Let’s call variance of performance metric across
different samples “Model’s Variance”. This is a practical definition for Bias and Variance. For a
conceptual discussion of Bias and Variance, check the internet. This is a good reference:
[Link]
learning/

A good ML model has Low Bias and Low Variance.

Let’s go through some examples. The following tables shows some models’ performances on Test and
Train samples, and the discussion related to each table. Note two things: First, performance and
Bias/Variance can be measured using any relevant performance metric. Second, we can have more than
one test sample. In fact that is recommended.

32
Example 1. Is it a good model? Performance is measured by AUC.

Train Test 1 Test 2


Model 1 0.85 0.83 0.82

Model has a low bias (high AUC) on all samples. Also variance of AUC across three samples looks low.
This looks like a good model.

Example 2. Which model is better? Performance is measured by “% of observations with significant


undervaluation (>30%)”

Train Test 1 Test 2


Model 1 22% 20% 23%
Model 2 10% 22% 15%
Model 3 13% 15% 16%

Lower values of this performance metric means lower bias. Model 2 has lowest bias on the train sample,
but it has high variance across three samples. Models 1 and 3 have low variance, but model 3 has lower
bias. So we may choose that as the best model.

Next, let’s start modeling. We will build our first model in Excel, so we see with our eyes what is going
on?

File “Income_Data.csv” contains data on income of 6,702 people. Sheet 1 shows all the data. As
mentioned we will not build the model on all data. Rather 5,000 out of 6,702 are copied into Train sheet
and will be used as train sample, to train the model. The rest, 1,702 observations, are copied into Test
sheet and will be used to test the model.

Let’s build a linear regression model on the train data. We will only use “Age” and “Years of Education” as
independent variables, excluding “Gender”, “Education”, and “Job Title”.

We need to define the Loss Function and ask Excel to minimize it. We will use Excel Solver, which is a
very straightforward optimization tool.

Let’s use MSE. We will use Train sample to build the model. To define MSE:

33
1. Define some empty cells as coefficients. We need 3 coefficients: Cells M2:O2
2. For each cell define Output: Column G
3. For each cell define Error: Column H
4. For each Error define Error powered by 2: Column I
5. Calculate Average of Squared Errors: Cell J2
6. Use Excel Solver to Minimize J2 by changing M2:O2

Note: If you run the Solver, there is a high chance that you will get different results. That is because
Solver uses a simple optimization technique. We will discuss Numerical Optimization Techniques later.
World goes only as far as Science of Optimization go. That is just a sentence I always say, with no proof.

7. Is it a good model? MSE not easy to interpret. Is it high? Or low? We need a scaled performance
metric, like R-Squared: Cell K2

I got R-Squared of 56% for train. Let’s calculate model’s R-Squared on the test sample:
1. Use coefficients from Trained model (in Train sheet), to calculate Y-Hat for each observation in
the Test sheet: Cells M2:O2 and Column G
2. Calculate R-Squared of Test Sample: Cell K2

I got R-Squared of 80% on the test sample. Model’s variance is very high. 56% versus 80%.

Note: In practice this almost never happens that performance of Test be much higher than Train,
because model is optimized on the train sample. But still it is not impossible to see model performs
better on some unseen samples, which means (a slightly) better performance on a test sample.

Practical Question: What if we get the results similar to the above, a much better performance on
unseen data, is that a good model? Well, not this difference. This difference indicates R-Squared can be
20% on the next sample.

Practical Question: How much difference is ok?


Welcome to the subjective parts of Data Science!

A lot of data science is art; i.e. you need to make a judgmental calls, no exact mathematics. One of those
judgmental calls is to choose the best model based on Bias and Variance. Let’s see this by an example.

Look at the following few models and their Performances, measured by KS. KS is a metric for
performance metrics that has similarities to AUC. It can range from 0 to 1, and 1 is the highest; i.e.
perfect separation. Which model is the best?

Train Test 1 Test 2


Model 1 0.75 0.71 0.72
Model 2 0.91 0.74 0.75
Model 3 0.64 0.63 0.65

34
Model 1 has good bias and variance. But what about 2? Although there is a drop from train to test, test
performance sounds stable, and is higher than model 1. What to do?

Everything is math on the machine’s side. Almost nothing is math on the human’s side, a lot art.

Bias and Variance – Overfitting and Underfitting

Often as the model becomes more complex, error (and bias) on the train sample goes down, but then if
the model becomes too complex, bias on the test sample starts going up, which means model’s variance
starts to go up.

How does a model become more complex? A linear model becomes more complex when more features
added. We will discuss what complexity mean for more sophisticated techniques, such as Neural
Networks and Ensemble Models, in the next sections.

Following figure shows the relationship between model’s complexity and error on the test and train
samples. Blue Line shows the error on the train sample, and the red line shows the error on the test
sample. As we move to the right, the model becomes more complex.

Model’s Error
Train

Test

Model’s Complexity
Underfitting Optimum Region Overfitting
Region Region

35
As the model becomes more complex, error on both test and train decreases, until at some point where
error on test starts to increase. This is called overfitting. What it means is that model is too much fitted
to the train data that it is capturing trends that are specific to the train sample, and can not be
generalized to the whole population.

For example, in the income data, we may see that people with 5-6 years of experience make more
money than people with 7+ years of experience. This is probably a sample-specific trend and has
happened in this specific sample. It can have any reason. For example it might be because in this specific
sample there are some highly qualified people with 5-6 years of experience, but we don’t have data to
figure out how qualified a person is; i.e. no feature on that.

If we build a very complex model, the model would start becoming exactly like the train sample; i.e.
overfits to the train sample. The result is that error on the train sample goes down, but error will
increase on other samples, because they do not necessarily follow these sample-specific trends. A model
that is overfitted has high variance.

Q. Before we move to underfitting region, a question for smart people. Try to answer before checking
the response. In the income data we have about 6700 observations, and two features: Age and Years of
Experience. What is the minimum MSE we can achieve on the train sample with a Linear model? Feel
free to make the model as complex as possible.

A. The minimum MSE on the train sample would be zero. In other words we can build a model that has
no error. How? Define features as powers of any of the two variables. For example Age2, Age2, Age2, …
6700
Age , …
Now what happens is that the linear model will be a Polynomial of very high degree (degree equal to #
observations). This polynomial can go through all the points; predict target variable exactly for all the
train observations. But it will have large error on any other sample. See the following graph, which shows
an overfitted model with zero error on the train sample and much higher error on the test sample.

Train
Test

36
To the left of the figure is the underfitting region. An underfitted model is too simple and often has high
error on both test and train. In cases error on train can be higher than test (like the model we built in
Excel).

As a data scientist you need to make sure you are always in the optimum region, where the model is not
over or underfitted. How to know that? By looking at Bias and Variance; i.e. Bias/Variance analysis. You
are looking for models that have low bias and variance.

Note, as mentioned, there is no exact formula for the best model, optimum bias and variance, … It is
eventually modeler’s discretion to choose among a few good models.

Now let’s go to Python and first build a good model on the income data, a model that hopefully has low
variance, and a decent bias. We can not expect to get a very low bias, because we are using only two
features that probably do not explain much someone’s income.

We will use the same income data. Split it to test and train, train the model on the train sample, and
calculate performance metric on both test and train.

The ultimate thing to learn data science is to go through code. Go through the code carefully.

The model has R-Squared of 66% for both test and train. A fair bias and excellent variance.

Next let’s make the model more complex, by adding some features, and let’s see if the model will overfit.
Q. How do we know if the model is overfitted?

Basically we just added one cell to the previous code, where we are defining some new features. Still we
see R-Squared for Test and Train are very close, so we were not able to create an overfitted model! More
features needed, but you got the point, right?

37
So let’s see what we did? The very very very important question we are trying to answer is: “Will the
model have the same performance on other samples, from other parts of the population?” To answer
this, we check two things:

1. Does the model have the same performance on another similar sample?
2. Is the data we have for model development, similar to the whole population?

We answered the first one. We check Variance. How? By defining test sample(s) and compare
performance across different samples. Our goal is to build a model that has low bias and low variance.

Conceptual Discussion: As you can see our goal is not to minimize Loss function. In fact machine can
always build a very complex model with zero loss on the train sample, but we don’t want that to happen.
That is why, we guide the machine no to overfit. How? By setting parameters that control model’s
complexity and variance. We will discuss these parameters later.

Ok, now the second question. A very important one. Is the data we have for model development, similar
to the whole population? Let’s start by an example.

In the income model, imagine we have data only for people with less than 5 years of experience. We split
the data to Test and Train, build the model on train, and check it on test. The model has low bias and
variance. But what if we want to use the model on someone with many years of experience? Does the
model have good performance on this group as well? Maybe not. In other words, model has low bias
and variance, but on a different population. If the seller of luxury brands is going to advertise to people
with several years of experience, this model (trained on people with low experience) may not be a good
option.

What we discussed is called Sample Bias. A sample is biased if it is not similar to the population. We
always need to make sure our Train and Test samples are unbiased; i.e. they represent the whole
population. Sample Bias is one of the biggest challenges when building a model. A lot of times, ML
projects can not be done due to lack of unbiased data. It is not enough just to have data; the data should
also be unbiased.

Let’s look at some examples:

1. Often when companies want to expand their market to new markets, they encounter sample
bias. Imagine a lending company that is active on the prime sector; i.e. lends to people with
high credit. Company decides to enter sub-prime market, and needs a credit risk model that
shows probability of default for an applicant. Company has built the model on its current
customers (prime customers). Can the same model be used on sub-prime customers? Most
probably not. The relationship between Y and Xs in the sub-prime segment is most probably
different than the relationship in the prime segment.
2. The above is an example of cross-sectional sample bias. Sample Bias often also happens in
time series, and as a result of changes through time. Imagine you want to build a trading
model, and you have data as of past 6 months. If you build a model on this data, and the
model shows good bias and variance, does it guarantee that the model will perform well in

38
the future? The problem here is that financial markets are very volatile and trends change
frequently. If the market and macro conditions change in future, then data of the last 6
months would be a biased sample of the future population.
3. ChatGPT is trained on a huge corpus of text, but can it answer ML questions related to our
course? Probably not. Because its training set does not exactly represent our course
material.

This is one of the fastest growing trends in AI start-ups. Companies try to Fine-Tune ChatGPT (or
any other OpenAI model) to be able to answer questions related to a specific field. How do they
do it? By feeding text related to that specific application. For example, I am thinking of building a
smart TA for my class. Based on my understanding of myself, I will do it in the next 100 years.

So, we want our data to be unbiased representation of the target population; otherwise low bias and low
variance probably will not indicate a useful and applied model. Sample bias is one of the most
conceptual topics in applied ML. There is no formula to define if a sample is biased or not, and a lot of
times it is an expert opinion. The important thing is to always ask:

1. What is Target Population? i.e. on which population the model will be applied (used)?
2. Is the data we have for model development similar to the target population?

Some Solutions to Sample Bias:

There is a lot room for innovation when it comes to sample bias. The following are some examples of
how to solve it:

1. Buy data: If company wants to enter a new market for which they have no data, then they may
buy that data. For example, in the lending company example above, the company may choose to
buy data on sub-prime customers from Bureaus.
2. Resampling: This approach is used if the distribution (composition) of target population is
different from distribution of development sample. For example in the case of income model,
assume we have a lot of data on low income, and a little data on high income customers. What
will machine do to solve the model? It will minimize Loss function. But in the Loss function
majority of observations are from the low income group. So the optimization process will put
more emphasize on optimizing the low income group, trying to find relationships in that group.
As a result, the model would have good performance on low income, and not very good
performance on high income.
To solve for this issue, modeling team may Oversample the high income cases or Under-sample
the low income cases, and use this sample in the train process. For example, if 90% of the
sample is low income, and 10% high income, modeling team may randomly select 10%-20% of
lows, and use it to build the model in combination with 10% high income. This is an example of
under-sampling of low income.

39
If the team choose to do oversampling of high income, they may choose to build a larger sample
of High Income, by randomly choose high income observations With Replacement. As a result,
many of high income cases will appear in the train sample several times (repetitive rows)
Note: As we will see, a lot of ML packages solve this issue by assigning a Weight to a specific
class, which means they will have higher weight in the loss Function. For example you can assign
a higher weight to High Income cases in the income model. If we assign a weight of 10 for
example, then say MAE will look like:
n m
MAE=10 ∑ ¿ y i−^y i∨¿+ ∑ ¿ y j− ^y j∨¿ ¿ ¿
i=1 j =1

Where there are n high income, and m low income observations, and subscript i indicates High
Income, and subscript j indicates Low Income cases. As the equation shows, model will assign
more weight to high income cases, resulting in better performance for this segment.

How much over or under-sampling is good? Again not an exact answer. In practice we try
several weights or resampling ratios and choose the model that has the best performance (i.e.
low bias and variance). We will get back to this topic later.

3. Data Cleaning / Observation Exclusion: Observations can be excluded from the model for many
reasons. It is often the most time consuming task in a ML project. Data received from the data
team (raw data) needs to be cleaned. For example, old and stale observations should be
removed.
Also some observations may be excluded to mitigate sample bias. Imagine you want to build a
model to analyze the impact of Covid vaccine on elderly. If your raw data covers all ages, you
need to first exclude young cases.
Interview Question. How old is young? i.e. which ages should be removed? Create a story of a
data analysis case to answer that. You have data on [“Age”, “Vaccinated”, “Hospitalized”], where
“Vaccinated” and “Hospitalized” are binary variables.

So,

 How do you know if a model is good? By checking the model’s performance, using a
performance metric.
 How do you know if the model will perform well on other samples?
 First, we need to make sure the development sample (i.e. train and all test
samples) are unbiased representations of the population.
 We need to check if model’s performance is persistent across other samples.
How? By checking variance of performance metric across train and test
sample(s).

40
Chapter 8 – I am Error

So far we know how to pass information to machine – in the form of a Dataset / DataFrame. And
machine will solve an error optimization problem to fit a model. We can check performance metric on
test and train sample, to make sure we have a good model; i.e. low bias and low variance.

So now we have the model’s output. But also we said machine always has error; i.e. model’s output is
different from the actual value. So what is model’s output and error, in plain English?

Data science is the science of probability. Almost nothing is certain. For this discussion we need to know
about two concepts from Probability Theory: Deterministic vs. Stochastic (Random) variables.

1. Deterministic Variable: A variable whose value is certain. Like price of a stock yesterday. It is just
a number, no uncertainty about it.
2. Stochastic (Random) Variable: A variable whose value is not certain. Like price of a stock
tomorrow. No one knows the value precisely.
Since we don’t know the exact value of a Stochastic variable, we analyze them in the form of
Probability Distributions. For example: “Price of Stock X tomorrow is Normally distributed with
Expected Value = 0 and Variance = 2”. So stochastic variables have a probability distribution,
expected value (µ), Standard Deviation (σ), and other possible statistics.
We can use these probabilities in making decisions. For example, once we know distribution of
price of stock tomorrow, we may calculate probability that price will be higher than today’s price
(which is a deterministic variable). A trader may buy the stock if its price will increase with 70%
probability. Another trader may be more risk averse, and would be willing to buy the stock only if
its price will increase with 80% probability…

Now that we know about Stochastic and Deterministic variables, let’s get back to our question: What is
model’s output and error, in plain English?

We built the income model, and it has estimated the following income for a customer.

Age Year of Experience Estimated Income


42 7 $250,000

Let’s see what we can say about this customer’s true income:

Y =Y^ + Error=250,000+ Error


Do we know Error? No. It is not deterministic. It is a Stochastic variable.

What do we know about its distribution?

41
 First, Expected Value of Error should be 0. If expected value is not 0, the model systematically
makes error. It is called a Biased model (yes, bias is everywhere in data science).
 To have an idea of distribution of Error and probably its standard deviation, we can look at the
distribution of error from the data (train, test, …)

So Error is Stochastic with Expected value of 0. What about Y (actual value)? It is also Stochastic. We
don’t know actual income of this person (if we knew why do we need a model?).

Error’s formula (Y =Y^ + Error ) shows that Y has the same distribution as Error, with the same Variance.
^ + Expected Value of Error=Y^ .
Also Expected Value of Y is Y

So, actual value of Target variable (in any model), is Stochastic. It has a distribution, expected value,
variance, … What the model gives as output (Y^ ) is Expected Value of this distribution, or Expected Value
of Y.

Y^ =Expected Valueof Y

For the above observation, we Expect the salary of this person be $250,000. What is the probability that
this person’s salary is $250,000? 0, right?

Conditional Expected Value:

So ML model gives “expected value of the Target variable”. But it is more complete to say that ML model
gives “expected value of Target variable conditional on features”. For example, in the above example,
Expected Value of Income is $250,000 conditional on Age = 42, and Years of Experience = 7.

All ML models are conditional models. As soon as you add the first feature, you are building a conditional
model. Conditional on Xs.

We can write the above prediction in a more plain language: “We expect someone who is 42 years old,
and has 7 years of experience, to make $250,000 per year.

Let’s also make it a more technical and complete by adding a sentence to the beginning: “Based on the
data we have, we expect someone who is 42 years old, and has 7 years of experience, to make $250,000
per year. What if our data is biased? What if we build the model on people from Argentina? Can we have
the same expectations for U.S.?

Expressing technical concepts in Plain English indicates deep understanding of a topic, and is an excellent
practice. Try to write technical concepts (like output of ML model, a business problem, …) as simple as
possible.

42
Chapter 9 – Decision Trees

Decision Trees (DT) also learn by minimizing Loss, but they do it in a different way, compared with Linear
models. Linear models minimize Loss by “finding the best coefficients(βs) that minimize Loss”, but DT
^ is
learns by “finding the best Split that minimize Loss”. Split means to divide data into two groups, and Y
the same for all observations in each group. An example of split for income model is:
^ = $240,112
If Age < 35 then Y
^ = $302,150
Else if Age >= 35 then Y

Above model, estimates income of someone who is 48 years old as $302,150.

Decision Tree is like a many nested if..then..else … A sample DT for a classification model would be like:

if X5<1 then
if X3>0.34 then
if X7=0 then Y^ = 0.32
^ = 0.81
else if X7=1 then Y
else if X10<5.2 then
^ = 0.11
if X1=1 then Y

Based on the above model, for a case where X5<1, X3>0.34, and X7=0, probability of response = 0.32.

In an actual model, often there will be many segments, and a lot of nested if..then..else

Decision Trees are excellent models for visualization. Imagine the following income model:

If “Years of Experience” < 10 then


If “Age” <= 32 then Y^ = $180,002
Else if “Age” > 32 then Y^ = $240,798
Else if “Years of Experience” >= 10 then
If “Years of Experience” <= 15 then Y^ = $232,460
^ = $506,002
Else If “Years of Experience” > 15 then Y

Model’s visualization would be as following, where we have added hypothetical counts for each cell:

43
Node 1

# observations: 10,000

Years of Experience < 10 Years of Experience >= 10

Node 2 Node 3
# observations: 2,000 # observations: 8,000

Age <= 32 Age > 32 Years of Experience <= 15 Years of Experience > 15

Node 4 Node 5 Node 6 Node 7

# observations: 500 # observations: 1,500 # observations: 2,322 # observations: 5,678

Note that DT is doing what all ML models do. It is finding Similar Observations (final nodes), and assign
them a value (output).

How is this output calculated by the DT? It depends on the Loss function we use. We will discuss it
shortly.

Conceptual Discussion: DT is a non-linear model; which means the relationship between Output and
Features may not be linear. A Linear relationship between Output and a feature (say X1) means “if all
other features are constant, then relationship between Output and X1 is linear. In a linear model that is
the case. Linear model is as following:

Y^ = βX=β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + …

If all features other than X1 are constant, then the whole term “ β 2 X 2+ β3 X 3 +… ” is a constant. SO the
linear model will be as:

Y^ = βX=Constant + β 1 X 1

44
Which is a linear graph as.

This is not the case for a DT. In fact it will not be the case for any other model we will discuss. Let’s
analyze the concept of Non-Linear relationship in the above DT. What is the relationship between “Years
of Experience” and Estimated Income, if other features (here is only Age) are constant. Assume Age = 35.
Following graph shows the relationship between Years of Experience and Estimated Income for this
person.

Years of Experience Y-Hat Years of Experience Y-Hat Years of Experience Y-Hat


0 240,000 10 232,460 20 506,002
1 240,000 11 232,460 21 506,002
2 240,000 12 232,460 22 506,002
3 240,000 13 232,460 23 506,002
4 240,000 14 232,460 24 506,002
5 240,000 15 232,460 25 506,002
6 240,000 16 506,002 26 506,002
7 240,000 17 506,002 27 506,002
8 240,000 18 506,002 28 506,002
9 240,000 19 506,002 29 506,002
… …

45
550,000

500,000

450,000

400,000

350,000

300,000

250,000

200,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

In a linear model, we can only have a line:

Also notice the relationship is not Monotonic, and among those who are more than 32, those with less
than 10 years of experience make more than those who have between 10 and 15 years of experience.

Is it a sample specific trend? i.e. overfitting? Maybe. How can we know? By analyzing Variance of the
Model. If we see high variance, what can we do? We can make the model less complex. For example, we
can limit the model to have at most two layers (i.e. only one split)

Of course in an actual model, there are many more features, observations, and layers. But the concept is
the same. Bias/Variance analysis to have a good model.

Linearity can be considered as a constraint for a linear model. It is like optimizing Loss function, but set a
constraint: Functional form of Output should be Linear ( Y^ = βX or Logit ( Y^ )=βX ). So in case there are
non-linear relationships between Output and a feature, linear model is not able to fully capture it. For
example, what is the logical relationship between salary and age? Probably humped-shape.

At the same time, as we saw, sometimes there are sometimes sample-specific trends, and Linear model
can help with smoothing those, and help with overfitting.

All the above considered, we have built a Linear model and a DT. Which one is better? The one that wins
the Bias/Variance analysis!

End of conceptual discussion □

46
Decision Tree’s Terminology:

The first node, containing all the observations, is called Root Node. Nodes that do not split, are called
Final Nodes or Leaf Nodes. Each Final node generates a Y ^ . Decision Tree generates as many distinct
outputs as number of final nodes. In contrast to Linear models that generate infinite different outputs,
βX
e
βX or βX .
1+ e
Nodes 2 and 3 are child of Node 1. Node 1 is their Parent. Node 2 is the Left Child. Node 3 is the Right
Child.

The above tree has three Layers.

Questions:

1. How does Decision Tree decide on split? For example, in the above model, first split is based on
“Years of Experience”, and 10 as Split-Point. Why split was based on this Feature/Split-Point?
Why not Age/32, or Age/42, or Years of Experience/17, or …?
2. How is the Output calculated?
3. When does DT stop growing? In other words, how many times will DT split? How many layers
and final nodes?

1. How does Decision Tree decide on split?

DT tries all possible Feature/Split-Points and will choose the one that minimizes Loss. For example,
imagine we have the following 4 observations for an Income model:

Age Years of Experience Income


34 4 $280,000
75 33 $610,000
23 0 $73,000
44 4 $250,000

The following are possible Feature/Split-Points. Note, Split Points are defined as the middle point (i.e.
average) between any two consecutive values, in the train data.

Age / 28.5

Age / 39

Age / 59.5

Years of Experience / 2

Years of Experience / 18.5

47
For this parent node with 4 observations, DT will calculate Loss for the above 5 Feature/Split-Points, and
will choose the one with the lowest Loss.

But to calculate Loss, we need to have values for Output. So next question.

2. How is the Output calculated?

Not only in DT, but any other ML technique, choice of Loss function determines how the model
estimates output.

Let’s see how MSE looks like for a single split of DT, based on any Feature/Split_Point. Assume there are
n observations in the parent node.
n

∑ (Y i −Y^ i)2
MSE= i=1
n
In a split, there are two child nodes, and two outputs. We still don’t know what this output is, but we
know it is the same for all observations in each child nodes. Assume out of n observations, n1 go to the
^ 1 and output for child 2 is Y^ 2. We can
left child and n2 go to the right child. Also output for child 1 is Y
write MSE in terms of Sum of Squared Error for the two child nodes separately:
n n1 n2

∑ (Y i −Y^ i)2 ∑ (Y i −Y^ 1)2 +∑ (Y j−Y^ 2)2


i=1 i=1 j=1
MSE= =
n n
^ and Y^ that minimize the above MSE. Note our unknows are Y^ and Y^ .
Now let’s find Y 1 2 1 2

^ 1 and Y^ 2, calculate Gradient of Loss function with respect to Y^ 1 and Y^ 2, and find the
To find minimizing Y
Gradient’s zero point.
n1 n2

∑ (Y i −Y^ 1) +∑ (Y j− Y^ 2)
2 2
n1 n1

ӘMSE
Ә
i=1
n
j=1
∑ (Y i−Y^ 1 ¿ ) ∑ Yi
=0⇒ Y^ 1=
i=1 i=1
= =−2 =Y 1 ¿
Ә Y^ 1 ^
ӘY1 n n1

Where Y 1 is the sample average in Node 1. Same way you can show that Y ^ 2=Y 2, sample average in
child 2. So if we use MSE, then output of each node, would be the sample average of all observations in
that node.

For example, in the following model that we visualized

If “Years of Experience” < 10 then


If “Age” <= 32 then Y^ = $180,002
^ = $240,798
Else if “Age” > 32 then Y
Else if “Years of Experience” >= 10 then

48
If “Years of Experience” <= 15 then Y^ = $232,460
^ = $506,002
Else If “Years of Experience” > 15 then Y
If we use MSE, then output of output of Node 4, would be the average of 500 observations in that node
(see the graph), and this average is $180,002.

To find the best split, DT calculates Loss for each split. If we use MSE, output of each node is sample
average; because sample average gives the lowest MSE for each split. DT will choose the split that gives
minimum MSE (like Minimum of Minimum MSE for each Split).

Q. Prove if we use MAE, Sample’s Median minimizes Loss for a Split.

Q. What would be the output of a node, if we use MAPE?

Let’s find the best split for the 4 observations we discussed.

Age Years of Experience Income


34 4 $280,000
75 33 $610,000
23 0 $73,000
44 4 $250,000

Possible Splits, as we discussed:

Age / 28.5

Age / 39

Age / 59.5

Years of Experience / 2

Years of Experience / 18.5

For Age / 28.5 split we will have the following child nodes:

Parent Node

# observations: 4

Age < 28.5 Age >= 28.5

49
Child 1 Child 2

# observations: 1 # observations: 3

Y-Hat-1 = 73,000 Y-Hat-2 = 380,000

Where Y-Hat-1 is the sample average of the only observation in Child 1, and Y-Hat-2 is the sample
average of 3 observations in Child 2. MSE for this split will be:
4 1 3
−∑ (Y i−Y^ i)2 ∑ (Y i −73,000) +∑ (Y j−380,000)2
2

i=1 i=1 j=1 (73000−73000)2 +(280000−380000)2+(610000


MSE= = =
4 4 4

Same way DT would calculate MSE for the other 5 possible Feature/Split-Points, and will choose
Feature/Split-Point that gives the lowest Loss.

Verify that MSE for “Years of Experience / 18.5” is:


2 2 2 2
(280000−201,000) +(73000−201000) +(250000−201000) +(610000−610000)
MSE=¿
4
So DT finds Loss function based on all possible splits, and will choose the split that gives the minimum
Loss.

DT will do a split only if it improves the Los; i.e. Loss of Split is less than Loss of Parent. In the above
example, MSE for parent is:
2 2 2 2
(280000−303,250) +(73000−303,250) +(250000−303,250) +( 610000−303,250)
MSE=¿
4
The split happens only is MSE of the best split (Feature/Split-Point with the lowest Loss) is less than
parent’s Loss.

Once a split happens, DT keeps splitting the child nodes, but now as a parent node. Imagine the
following graph we discussed:

50
E ŽĚĞϭ

ηŽďƐĞƌǀ ĂƟŽŶƐ͗ ϭϬ͕ϬϬϬ

Years of Experience < 10 Years of Experience >= 10

E ŽĚĞϮ E ŽĚĞϯ
ηŽďƐĞƌǀ ĂƟŽŶƐ͗ Ϯ͕ϬϬϬ ηŽďƐĞƌǀ ĂƟŽŶƐ͗ ϴ͕ϬϬϬ

Age <= 32 Age > 32 Years of Experience <= 15 Years of Experience > 15

E ŽĚĞϰ E ŽĚĞϱ E ŽĚĞϲ E ŽĚĞϳ

ηŽďƐĞƌǀ ĂƟŽŶƐ͗ ϱϬϬ ηŽďƐĞƌǀ ĂƟŽŶƐ͗ ϭ͕ ϱϬϬ ηŽďƐĞƌǀ ĂƟŽŶƐ͗ Ϯ͕ ϯ ϮϮ ηŽďƐĞƌǀ ĂƟŽŶƐ͗ ϱ͕ ϲϳ ϴ

Let’s say we use MAE. For the first split, DT goes through all possible Feature/Split-Points, and it came to
conclusion that “Year of Experience/10”, give the best Feature/Split for Node 1 (the root node). MAE for
any split on node 1 is as following:
10000 n2

∑ ¿ Y i −Y^ i∨¿ n1 ∑ ¿ Y j−Y^ 2∨¿


=∑ ¿Y i−Y^ 1∨¿+
i=1 j=1
MAE of Split= ¿¿¿
10000 i =1 10000
^ 1 is the sample
Where n1 and n2 are number of observations in the left and right child, respectively. Y
^ is the sample median in the right node.
median in the left node, and Y 2

Next, DT do the same on Node 2. It calculates MAE of all possible Feature/Split-Points, this time only on
the 2,000 observations in Node 2. Also it finds the best Feature/Split-Point for the 8000 observations in
Node 3. And keeps splitting (growing).

Now, the third question. When does DT stop growing? But before we answer this question, let’s discuss
Loss function, Split, and Output for a classification model.

Everyone, give it up for the Income Model. We are going to focus on other models; especially we are going to
put more emphasize on Classification models, as the most common type of model in the industry. But
we all agree Income Model did an excellent job explaining concepts in the simplest way.

51
When decision tree is used to solve a Regression model, it is called a Regression Tree. When DT is used
to solve a classification model, it is called a Classification Tree.

For a classification model, target looks like 0 or 1, and output is the probability of response. Response is
often labeled as 1. For example, in a Credit Risk model, defaulted cases are labeled as 1. Output in a
classification model is Probability of Response, and is a number between 0 and 1.

Let’s see how Cross-Entropy looks like for a split in a Binary Classification model:
n
−1
Cross Entropy= ∑ ¿¿
n i=1
Imagine there are n1 observations in the left child and n2 observations in the right child. Also assume
output (probability of response) for the left child is ^y 1, and output (probability of response) for the right
child is ^y 2. Cross Entropy will be as:
n1
−1
Cross Entropy= ∑ ¿¿
n i=1

To find ^y 1 and ^y 2 that minimize the above Loss, calculate Gradient of Loss with respect to ^y 1 and ^y 2 and
find zero point of the Gradient.

ӘCross Entropy
=Ә ¿ ¿
Ә ^y 1

Notice that:

 y i ln ^y 1+ ¿ ( 1− y i ) ln ( 1− ^y1 ) = y i ln ^y 1 ¿ when y i=1


 y i ln ^y 1+ ¿ ( 1− y i ) ln ( 1− ^y1 ) =( 1− y i ) ln ( 1−^y 1 ) ¿ when y i=0

So the above Gradient can be written as:

Ә¿¿
Where p shows observations where Y =1 (response), and q shows observations where Y =0 (non-
response). There are k1 responses and k2 non-responses. Note that k 1+ k 2=n1 .

So, this split resulted in n1 observations in left node. Out of these n1, k1 are responses and k2 are non-
responses. y ps are all 1, and y q s are all 0. So the Gradient will be:
k1 k2

−1
∑ yp 1
∑ ( 1− y q )
p =1 q=1
+
n ^y 1 n −k 1 k2
(1−^y ¿¿ 1)= + ¿
^y 1 K1
(1− ^y ¿ ¿ 1)=0⇒ ^y 1= ¿
K 1+ K 2
K1
What is ? It is response rate in the node.
K 1+ K 2

52
So, if we use Cross-Entropy as Loss function, output of each node will be Response Rate in that node . DT
(classification tree) will try all possible Feature/Split-Points, and will choose the one that minimizes Loss.
If Cross Entropy is used for Loss, output would be Response Rate of each node.

Let’s try an example. Imagine we have the following 5 observations for a Spam Detection model. Target is
a binary variable that is 1 if the email was Spam, and is 0 if the email was not Spam. Model’s output
would be “Probability that an email is Spam”. Strategy team can use this output, and block emails where
Prob. of Spam is higher than some threshold.

There are two features. First feature shows number of words in the email. Second feature is a binary
variable that is 1 if the email contains the word “Urgently”, and is 0 otherwise.

# Words Urgently Spam


1512 0 0
7603 1 1
12432 1 0
73 0 1
32,009 0 1

Possible Splits:

Words / 792.5

Words / 4,557.5

Words / 10,017.5

Words / 22,220.5

Urgently / 0 versus 1

What is Cross Entropy for “Words / 4,557.5” split? As a result of this split, there will be two observations
in the left child (Words <= 4557.5) and three observations in the right child (Words > 4557.5). Response
rate (and Y^ ) in the left child is 50%, and response rate (and Y^ ) in the right child is 67%.

Parent Node

# observations: 5

Words <= 4557.5 Words > 4557.5

Child 1 Child 2
53
# observations: 2 # observations: 3

Y-Hat-1 = 0.5 Y-Hat-2 = 0.67


Cross Entropy for this split would be:
2
−1
Cross Entropy= ∑ ¿¿
5 i=1
DT will calculate Cross Entropy for all the above 5 Feature/Split-Points, and will chose Feature/Split with
the lowest Cross Entropy. For each split, Y-Hat is the response rate in each node. This is the Y-Hat that
gives the lowest Cross Entropy for any split.

Split will happen only if Loss of Split is less than Loss of parent node.

Q. In the above example, what is Cross Entropy for the parent node?

Weighted Loss:

So far we found if we use:

 MSE, output would be sample’s average of each node


 MAE, output would be sample’s median of each node
 Cross Entropy, output would be sample’s response rate of each node

Now we want to show that Loss for a Split is Weighted Average of Losses for two child nodes, where
weights are based on the number of observations in each node.

For MSE we have:


n n1 n2 n1 n2

∑ (Y i −Y^ i)2 ∑ (Y i −Y 1) +∑ (Y j−Y 2) ∑ (Y i−Y 1) ∑ (Y j−Y 2)2


2 2 2

i=1 i=1 j=1 i=1 j=1


MSE= = = +
n n n n
Now we do a trick, and multiply and divide each term by number of observations in that node:
n1 n2

∑ (Y i −Y 1) 2
∑ (Y j−Y 2 )2
n1 j=1 n2
i=1
MSE= × + ×
n n1 n n2

Rearranging the terms, we have:


n1 n2

∑ (Y i −Y 1)2 ∑ (Y j−Y 2 )2
n1 j=1 n2
i=1
MSE= × + ×
n1 n n2 n

54
n1

What is
∑ (Y i −Y 1)
2
? It is MSE of the left child. What is
n1
? It is weight of the left child. Same
i=1
n
n1
argument for the right child. So, we have:

MSE of any Split =MSE of ¿ ×Weight of ¿+ MSE of ¿Child × Weight of ¿Child


So, MSE of any split is weighted Average MSE of two child nodes.
n1 n2

Another thing to notice is that


∑ (Y i −Y 1) 2
is sample’s Variance in the left node. Also
∑ (Y j−Y 2 )2 is
i=1 j =1
n1 n2
sample’s variance in the right node. So MSE of Split is Weighted Variance of Two Nodes:

MSE of any Split =Weighted Variance of Two Child Nodes=Variance of ¿ ×Weight of ¿+Variance of ¿ Child ×Weig
If MSE is used as Loss function, DT calculates Weighted Average Variance for all possible Feature/Split-
Points, and will choose Feature/Split-Point that gives the lowest Weighted Average Variance (Which is
the Weighted Average MSE = MSE of Split)

For MAE we have:


n n2 n1 n2

∑ ¿ Y i −Y^ i∨¿ n1 ∑ ¿ Y j −Median2∨¿ ∑ ¿ Y i−Median1∨¿ ∑ ¿ Y j−M


=∑ ¿ Y i −Median1∨¿+
i=1 j=1 i=1 j=1
MAE of Split= = +
n i=1 n n n

So, MAE of any split is weighted MAE of two child nodes.

If MAE is used as Loss function, DT calculates Weighted Average MAE for all possible Feature/Split-
Points, and will choose Feature/Split-Point that gives the lowest Weighted Average MAE (which is the
MAE of Split)

For Cross Entropy we have:


n
−1
Cross Entropy= ∑ ¿¿
n i=1
Multiply and divide each node by number of observations in that node. So we will have:
n1
−n1
Cross Entropy= ∑ ¿¿¿
n i=1
n1
n1
What is −∑ ¿ ¿ ¿ ? It is Cross Entropy for the left node. Also is weight of the left node. Same for the
i=1 n
right node. So:

55
Cross Entropy of any Split=Weighted Avergae Cross Entropy of Two Child Nodes=Cross Entropy of ¿× Weight of

If Cross Entropy is used as Loss function, DT calculates Weighted Average Cross Entropy for all possible
Feature/Split-Points, and will choose Feature/Split-Point that gives the lowest Weighted Cross Entropy
(Which is the Cross Entropy of Split)

Another popular metric (probably the most popular) that is used to find the best Feature/Split-Point in a
classification tree, is called GINI. GINI for a node is defined as:

GINI =1−( %Zero 2+%One 2 )


Where %Zero is percentage of observations in the node where target is 0, and %One is percentage of
observations in the node where target is 1.

For example, in the Spam detection model, GINI of parent node would be:

() ()
2 2
3 2 9 4 12
GINI of Parent Node=1− − =1− − =
5 5 25 25 25
GINI of a split is weighted average of GINI of two nodes. For example in the Spam Detection model, for
the “Urgently / 0 versus 1”, we have the following visualization and GINI.

Parent Node

# observations: 5

#0: 2, #1:3

Urgently = 0 Words > 4557.5

Child 1 Child 2
# observations: 3 # observations: 3
#0: 1, #1:2 #0: 1, #1:1

() ()
2 2
1 2 1 4 4
GINI of ¿=1− − =1− − =
3 3 9 9 9

56
() ()
2 2
1 1 1 1 1
GINI of ¿ node=1− − =1− − =
2 2 4 4 2
3 4 2 1
GINI of Split=Weight of ¿ ×GINI of ¿+ Weight of ¿ Child × GINI of ¿ Child= × + ×
5 9 5 2
DT will calculate GINI for all possible splits, and will choose the split that gives the lowest GINI.

What is minimum GINI of a node? 0; A GINI of 0 means all observations in a node are from the same
class (all 0 or 1). A pure node.

What is maximum GINI? 0.5; A GINI of 0.5 means half of observations in a node are 0 and half are 1.

GAIN: GAIN is the difference between Loss of Parent and Loss of Split:

 GAIN when MSE isused =MSE of Parent−MSE of Split


 GAIN when MAE is used =MAE of Parent−MAE of Split
 GAIN whenCross Entropy is used =Cross Entropy of Parent−Cross Entropy of Split
 GAIN whenGINI is used=GINI of Parent −GINI of Split
 And the same for any other Loss function or metric used to split

A split would happen only if GAIN is positive. Also Higher value of GAIN shows a better split.

Now that we understood how decision tree splits, let’s answer the third question:

3. When does DT stop growing?

What happens if DT keeps growing, generating many layers and nodes? The model becomes complex. As
we know when a model becomes too complex, it starts overfitting to the train sample; as a result it will
have high variance, and perform poorly on test and other samples. So we need to control DT. How? By
setting parameters of DT, that control model’s complexity and variance.

Parameters of DT depends on the DT package that is used. One of the most popular ML libraries in
Python is scikit-learn. The following are links to scikit-learn Classification and Regression Tree packages.
The links show parameters of two classes.

Classifier: [Link]

Regressor: [Link]

Let’s look at some of these parameters. Make sure you understand all the parameters for all the ML
models we discuss.

 max_depth: defines maximum number of layers in a DT. For example, in the following income
model with 3 layers, if we set max_depth = 2, then nodes 2 and 3 will not split.

57
E ŽĚĞϭ

ηŽďƐĞƌǀ ĂƟŽŶƐ͗ ϭϬ͕ϬϬϬ

Years of Experience < 10 Years of Experience >= 10

E ŽĚĞϮ E ŽĚĞϯ
ηŽďƐĞƌǀ ĂƟŽŶƐ͗ Ϯ͕ϬϬϬ ηŽďƐĞƌǀ ĂƟŽŶƐ͗ ϴ͕ϬϬϬ

Age <= 32 Age > 32 Years of Experience <= 15 Years of Experience > 15

E ŽĚĞϰ E ŽĚĞϱ E ŽĚĞϲ E ŽĚĞϳ

ηŽďƐĞƌǀ ĂƟŽŶƐ͗ ϱϬϬ ηŽďƐĞƌǀ ĂƟŽŶƐ͗ ϭ͕ ϱϬϬ ηŽďƐĞƌǀ ĂƟŽŶƐ͗ Ϯ͕ ϯ ϮϮ ηŽďƐĞƌǀ ĂƟŽŶƐ͗ ϱ͕ ϲϳ ϴ

 min_samples_split: Minimum number of observations in a parent node, so that the node can
split. For example, if we set min_samples_split = 5000, then Node 2 can not split. Lower values
of min_samples_split results in a simpler model with less nodes, and decreases model’s
variance.
 min_samples_leaf: Minimum number of observations in a child node, as a result of split. For
example, if we set min_samples_leaf = 1000, then the above split in Node 2 can not happen;
because it results in a child node with 500 observations, which is less than 1000. Lower values of
min_samples_leaf results in a simpler model with less nodes, and decreases model’s variance.
 min_impurity_decrease: Minimum required GAIN as a result of split. Higher values for
min_impurity_decrease means the model requires more meaningful splits; i.e. splits with high
GAINs. Higher values of min_impurity_decrease results in a simpler model with less nodes, and
decreases model’s variance.

Another parameter that is of interest, and we will talk about it more later, is max_features.

 max_features: Number of features used in any split to find the best Feature/Split-Point.
Imagine train dataset has 500 features. If you set max_features = 50, then for any split, DT
randomly chooses 50 out of 500 features, and finds the best split among those 50. What is the
point of max_features? Why don’t we try all possible features? Lower values of max_features
results in lower variance. It also increases speed of training, as model needs to check less splits.

These were some of parameters. Make sure you understand all the parameters. Also start looking at
code. How to define a DT, how to pass data, how to set parameters, … Use internet.

58
Last question. There are many parameters. How do we know which parameters give the best model? We
will discuss this shortly, when we talk about Ensemble models. We go through a process called Grid
Search, in which we build several models by setting different values for parameters, and choose the best
model, i.e. the model that has the lowest Bias and Variance.

To summarize, DT grows by finding the best split; i.e. split that gives the lowest Loss, GINI, or any metric
that is used to find the best split. In this way DT finds similar observations, and defines output based on
these similar observations. If MSE is used, output would be sample average of target variable in any final
node. If Cross Entropy is used, output is response rate in any final node, ….

We control complexity and variance of DT by setting its parameters.

Chapter 10 - I am a Data Scientist - part 2

Now that we know core concepts of ML, we are ready to discuss our job as a ML modeler. What are the
steps in building a ML model.

Steps in a ML model project are as following:

1. Model Design
1.1. Target Definition
1.2. Sample Definition
2. Data Collection
3. Data Cleaning
3.1. Feature Exclusion
3.2. Observation Exclusion
4. Data Processing
4.1. One-Hot Encoding
4.2. Outlier Treatment
4.3. Feature Scaling
4.4. Missing Value Imputation
5. Feature Reduction
6. Model Training
6.1. Grid Search (Hyper-parameter Tuning)
6.2. Bias/Variance Analysis and Finalizing the Model

In this course we will practice steps 3 to 6. The first 2 steps often need to be practiced in a real project
and real data. Description of the steps are as following:

59
1. Model Design: This step is often performed by senior data scientists, and requires a mix knowledge
of modeling process, data structure, and domain knowledge. Goal of this step is to clearly define
what we want to do, so junior data scientists can start coding, reading the data, …

Note: There are several practical concepts in the following discussion on Model’s Design. As
mentioned, model’s design requires some experience with modeling process, and is often done by
more experienced modelers. I encourage you to deeply think about the following discussions, and try
to understand them. It will greatly improve your understanding of data and data models.

1.1. Target Definition: In this step, Target variable is clearly defined, so it can be defined based on
the raw data. For example, in the Credit Risk model, Target is whether credit applicant
defaulted? Company has monthly data on each borrower’s performance; i.e. did the borrower
made the payment on their due date?

In this step, modeler defines Target (0 or 1) based on this raw data. We need to define default,
for a customer who was approved for a loan. If customer missed one payment, is it default?
What if missed 2 payments? Or 3 …? Also how long after loan origination we should track the
loan? Should we track the loan for 6 months? Or 1 year? Or 5 years?

A possible definition for default is “if applicant misses three consecutive payments in the next
12 months”. Now junior data scientists can use this definition to assign 0 or 1 to each applicant,
based on their monthly performance data.

Target Definition is often done in collaboration with strategy team (model’s user). After all, ML
model estimates Target, and that output will be consumed by model’s user; so they should
participate in target definition.

Target definition can have enormous role on model’s impact. For example, in the algorithmic
trading model that we discussed previously, target was defined as “Return on Asset”. First, now
we know that default needs to be clearly defined, so it can be coded. What is missing in the
above definition? Time. A good definition would be “Return on the Asset 8 hours later”. This
definition can be used by junior data scientists to define Target based on the Raw data, at any
time (see sample question 3). Raw data in an Algorithmic Trading model is often in the form of
Price and Volume of trade for an asset at different times.

Some modelers have noticed defining target in another way, would help in building a better
model. A model that possibly has better performance, and can result in better trades and profit.
To do so, they define Target as a binary variable. For example: “Whether Asset ever achieve X%
return in the next 8 hours?”. X is a threshold, and needs to be defined. For example, if X = 5, and
model’s output for an asset is 70%, then in Layman terms model says that “With 70%
Probability the Asset will reach 5% return in the next 8 Hours (conditional on Features).”

A sample strategy based on this model is “Buy the asset if “Prob. of reaching 5% return in the
next 8 hours” > 0.8. Note in Target definition there are two thresholds that need to be defined

60
by modelers (probably in collaboration with strategists): % Return (5 in the above Example), and
Holding Period (8 Hours in the above example). Next we will need thresholds for strategy as
well, 0.8 in the above example.

1.2. Sample Definition: In this step, data used for modeling is defined. Which data sources are
available? Which ones will be used to build the model? Which vintages (time of data)? Also it
will be decided how to do Test-Train Split. Which samples to be used as Test and Train?
This step is done at the same time as the previous step. In fact, modeler needs to know what
data is available, so they can decide what problem they can solve.

Important Example - In the Credit Risk model, assume default is defined as “3 consecutive miss
payments in the 12 months after loan origination”. Assume model development project starts in
June 2023. Modelers decide to use 2 years of (origination) data to build the model. They will
use 2 years of data on loan originations between June 2020 and May 2022. A sample Test/Train
split would be as following:
 June 2020 to Dec 2020: Test 1
 Jan 2021 to Dec 2021: Train
 Jan 2022 to May 2022: Test 2

Important Question. We are building the model in June 2023. Why don’t we use more recent
data to build the model? Like why don’t we use data after May 2022?
This is because we need 12 months of data to define Target. For example, for Loans originated in
May 2022 we will check loan’s performance in the next 12 months to define target. So as of June
2023, the most recent origination vintage for which we have 12 months of performance, is May
2022. As a example, to define target for May 2022 customers, we check June 2022 to May 2023
to see if customer has missed 3 consecutive payments.

Factors to consider when defining Development data. A model is only as good as data. If you
feed bad data to a model, then the model will be bad (garbage in, garage out). But what is good
data. Good data has good quality and quantity. Quality means the data should be unbiased.
Quantity means we should have enough data to build the model.

 Unbiased data: All samples - Train and Test(s) – should be unbiased representations of
the Population. We discussed these concepts in the chapter “I am Data”. For example, in
the Algorithmic Trading model, modeling team may want to use very recent data (say
last 6 months). Why? Because trends in financial markets change very frequently. Old
data may not be a good representations of how markets behave now. And we want to
use the model now.
 Data Quantity: A data model is reliable only if it is trained on large enough data. This
concept has roots in one of the most fundamental laws of Statistics, Law of Large
Numbers. For example, a Credit Risk model based on 100 loans may not be reliable. On
the other hand, modelers often prefer to use more recent data. Because more recent is
less biased representation of Production Population. Production Population is the
population on which the model will be used in production. So modeling team needs to

61
decide between size and age of data. For example, in the above Credit Risk model,
modelers may decide only one year of data is enough, and no need to use 2 years.

Data availability can also impact Target definition. More complex targets, i.e. targets that
depend on more factors (features), often require more data. We may not have enough
data to predict X, but may have enough data to predict Y, which is related to X.
For example, imagine a Credit Risk model used for Credit Card originations. Lender’s final
goal is to estimate “Expected Profit from a Customer”. But profit from a customer
depends on many factors, such as:
 (Cost Component) Whether the customer will default? (PD model)
 (Cost Component) After how many months will the customer default?
 (Revenue Component) How much customer spends each month? –
Credit Card companies often charge a fee on each transaction.
 (Revenue Component) How much of monthly balance customer would
not pay? – Credit Card companies often charge a high IR on balances not
paid each month. So spend as much as you can pay!

So profit is a high-dimensional target. High dimensional targets are difficult to estimate


and need more data. Therefore modeling and strategy teams may conclude there is not
enough data to build a good profit model (that has Low Bias and Variance). Rather they
may decide to build a Credit Risk model and a Revenue Model, separately. By breaking
Profit into its components, target variables are less complex (have lower dimensions),
need less data, and can be estimated with more accuracy.

How much data is enough? There is not a formula for this. Answer is often subjective.
One thing for sure though, we like models that have Low Bias and Low Variance. Models
trained on not-enough data, often can not be generalized to other samples, which
means they would have high variance.

2. Data Collection: Once the model design is complete, it is time to build the model. The first step is to
get in touch with data team, and ask for data that is going to be used in the model.
3. Data Cleaning: Raw data received from data team contains features and observations that may not
be used in the model. So this data needs to be cleaned. Data cleaning is often the most time
consuming part of any data science project.
Note: Practicing data cleaning concepts requires access to actual data, and that is often not available
publicly. Almost all public data are already almost clean. Therefore, try to think about these
concepts, and understand them.

3.1. Feature Exclusion: Some features in the raw data do not add to the model, so need to be
removed. Examples are IDs, Names, … There are also features that indicate data that are also
features that show information that are not available at the time the model is implemented .
These features also need to be removed. This category of features are often tricky and
sometimes modelers include them in the model by mistake.

62
For example, think about the Credit Risk model that is used to decide whether a loan applicant
should be approved. When we collect raw data for customers, we may see a filed like “Loan
Amount”. Should we use this feature as an independent variable in the model? No, because at
the time the model is applied, there is no loan. The model will be used to decide whether a
customer should get a loan. Loan amount is information that is not available at the time the
model is used.

Important Note: When building a model, always think about model’s application, and what data
is available at the time the model will be applied. A lot of times modelers make mistake by using
data that is not available at production time. The reason is that we always build the model using
historical data. In the above Credit Risk model, we are building a model in 2023, using 2022
data. There are fields (such as loan amount) that are available now, but they were not available
when the model was applied to decide on a loan application.

Always think about the time the model will be applied and which data is available at that time.

3.2. Observation Exclusion: Similar to features, some observations may not be used in building the
model. The main goal of observation exclusion is to remove Sample Bias. So observations that
do not represent the Target Population would be removed. Target Population is the population
on which data will be applied.

For example, in the income model used in advertising a product to US customers, the best Dev
sample is composed of people living in US. If data contains data from Brazil for example, then
you probably want to exclude those observations. It is possible that relationships between
features and target variable is different in a sample from Brazil. So a model built based on that
data, may not perform well in US.

Another example. Imagine a lending company that wants to launch a product for Sub-Prime
segment. To build this model, probably modeling team would better exclude data on Prime
customers. Prime customers have different behavior. Including them in the model may decrease
model’s performance on sub-prime segment.

Another example. In an algorithmic trading model that will be used to trade stocks, you
probably don’t want to include other assets, such as bonds, crypto, … Also sample bias can
happen in time. For example, if you think macro condition in 2020 was completely different
from now, you should not include 2020 data in the model.

Observation exclusion can be very tricky. It also often requires some business knowledge.

4. Data Processing: Once data is clean, it is time for processing the data. Goal of data processing is to
make data ready to be consumed by ML model. The following are 4 most common data processing
steps. Depending on the data and modeling technique, other steps may be needed. For example, in
Linear models correlation among variables need to be checked, and among features with high

63
correlation, only one should be kept in the model. Note that correlation is a linear concept, and
often is not a big concern for non-linear ML models, such as Ensemble models and NNs.

Note, DT-based models need minimum data processing. In fact as you will see, many data processing
steps are not required for DT based models (including Ensemble models). That helps with time-to-
build these models. Time saved on data processing can be spent on other activities such as model’s
design, parameter tuning, and trying different approaches.

4.1. One-Hot Encoding: Machine only understands numbers. It plays with numbers to solve the Loss
optimization problem. A lot of times there are features that are categorical (rather than
numerical). We need to convert these features to numbers before we can feed them to a ML
model. Categorical features are converted to numbers through a process called One-Hot
Encoding. In this process, a Binary feature is created for all categories of the categorical feature.
Then the original categorical feature would be removed.

For example, in the income data (Income_Data.csv), there are three categorical features:
“Gender”, “Education Level”, and “Job Title”. Take “Education Level” for example. It has 7
categories: [Bachelor’s, Bachelor’s Degree, High School, Master’s, Master’s Degree, PhD,
Missing]. Note we considered missing as a separate category. There are other ways to impute
missing values, but for categorical features, assigning them a separate category is almost always
the best practice.

As a modeler you may want to first merge Bachelor's and Bachelor’s Degree. For example use
Bachelor's for both. Same with Master’s and Master’s Degree. So you will end up with 5
categories. Next, you do one hot encoding, in which a binary feature would be created for each
category, and the original feature (Education Level) would be excluded. Following shows an
example of data before and after one-hot encoding of Education Level.

Before After

Years of Education Education Years of


Education Education Education Education
Age Gender Job Title Experienc Salary Age Gender _High_Sc _Bachelo Job Title Experienc Salary
Level _Master's _PhD _Missing
e hool r's e

32 Male Bachelor's
Software Engineer 5 90000 32 Male 0 1 0 0 0 Software Engineer 5 90000
28 Female Master's Data Analyst 3 65000 28 Female 0 0 1 0 0 Data Analyst 3 65000
45 Male PhD Senior Manager 15 150000 45 Male 0 0 0 1 0 Senior Manager 15 150000
36 Female Bachelor's
Sales Associate 7 60000 36 Female 0 1 0 0 0 Sales Associate 7 60000
52 Male Master's Director 20 200000 52 Male 0 0 1 0 0 Director 20 200000

Note, for each observations, only one of the categories can be 1. Surprising?
Same way, we need to one-hot encode Gender and Job Title. If you build a linear model using
Age, Education Level, and Years of Experience, the model would look like:

Y^ = β0 + β 1 × Age+ β 2 × Educatio n Hig h School


+ β3 × Educatio n Bachelor ' s + β 4 × Educatio n Master ' s + β 5 × Educatio nPh

In a DT model, these binary features would be used just like any other binary feature.

64
Note 1: If you are building a linear model and your categorical feature has n categories, then
you would need to create binary features for n-1 categories. For example, in the above case,
you may exclude “Education_Missing”, and use only the other 4 categories. The reason is that
the last category/column is a linear function of the other columns. Value of each column in a
one-hot encoded feature equals: 1−∑ of other categories . Mathematics of Linear Models do
not work if a feature is a linear function of other features.
Note 2: If you are building non-linear models (like DT, Ensemble, NN), the above rule is not
required, and a separate column can be created for all categories of the feature. In fact it is best
practice to do so, as it will make it easier for ML model to use this information.
Note 3: For a binary classification feature, which has two classes/categories, you would only
need one column. For example, if Gender column has only Male and Female as categories, then
a single variable (say Male) can capture all the information.

Note, there are two types of categorical features: Ordinal and Nominal. Ordinal features are not
numbers, but there is an ordering between them, while nominal features do not have an order.
For example imagine a feature that shows “Years of Experience” but is stored as text, like [Less
than One Year, Two to Three Years, Four to Seven Years, Eight+ Years]. An ordinal feature like
this may be converted to number by simply replacing each category with a number. For example
in the above case modeler may assign [1, 2, 3, 4] to the above different categories. This
approach especially works with DT type of models (including Ensemble models), as these
models only care about order of values. But may not work with other type of models, such as
linear models. The reason is that although there is an order between categories, but you can
not define a difference between above categories, while when you replace them with numbers,
you are assuming a difference between categories. For example, in the above case, if you simply
replace by numbers, then you assume there is one point difference between any two
consecutive categories, while that is not the case.

So Question. What is the point of One-Hot Encoding? Why don’t we just replace categories with
numbers?
Because numbers have order, but categories don’t. If we simply replace categories with
numbers, machine will solve the problem assuming an order. As we saw, even if our categorical
feature is Nominal, replacing categories with numbers may not be correct.

One-Hot Encoding is a necessary step in all ML techniques.

4.2. Outlier Treatment: Outliers are values that are far from other observations (either very large or
small). Outliers mess up with model’s performance. They affect model mainly through Loss
optimization process. For example, in the income model assume we use MSE. Also assume
there are cases with very large income. Optimization process will put a lot of emphasize on
these observations (trying to predict them accurately); as a result, model will be biased towards
outliers, and will have worse performance on other observations. Question. Why will
optimization process give more weight to these observations?

65
Outliers can exist in both dependent and independent variables. There are several ways to treat
outliers. One way is to define cap and floor for the feature based on some percentile. For
example use 1 and 99 percentile as floor and cap. Therefore any value less than 1 percentile will
be replaced by 1 percentile and any value higher than 99 percentile will be replaced by 99
percentile.

Question. Outlier treatment is a necessary step for Linear models and Neural Networks, but do
we need to do outlier treatment for DT based models? (including ensemble models)
We don’t need to do it for features, but we need to do it for the target. Prediction for a
regression tree is based on the observations in the final nodes. If there are outliers in the final
node, it will impact model’s prediction. So we need to treat outliers for the target.
Why don’t we need to do outlier treatment for the features in a DT?

4.3. Feature Scaling: Features in a ML model may have different scales. For example, imagine I a
Credit Risk Model there are two features:
1. Balance on all credit products
2. Utilization on credit cards

The first feature can get any value higher than 0. It can have values such as 1000, 3000, 50000, …
The second feature is a percentage variable, and normally take values between 0 and 1.
Therefore the first variable has larger scale, and that is something that we don’t like. We prefer
all features have similar scales.

Before we discuss how to solve the issue, the question is why difference in scale would be a
problem? The answer is in the Numerical Optimization process machine uses to minimize the
Loss function. In case there is large difference in scale, machine will put more emphasize on
features with high scale. Therefore information in the low-scale features will not be fully
captured, and model will not be efficient (which would result in lower performance).

There are several ways to scale the data. Two most popular techniques are:

1. Min Max Normalization


2. Standardization

In Min Max Normalization, minimum and maximum of feature would be used to normalize the
data. As a result, all normalized values will be between 0 and 1. Formula for Min Max
Normalization for a feature X is as following:

X−Min(X )
Normalized X =
Max ( X )−Min(X )
For example, imagine the following spam detection data. #Words and Urgently columns have
large difference in scale. We want to Normalize #Words column. Minimum of column is 73 and
maximum is 32,009. Therefore the normalized column would be as the highlighted column.
Normalized column has the same scale as urgently column, and would be used in the model.

66
# Words Urgently Spam Normalized # Words
1512 0 0 0.05
7603 1 1 0.24
12432 1 0 0.39
73 0 1 0.00
32,009 0 1 1.00

Formula for Standardization is as following:

X−Mean (X )
Standardized X =
Standard Deviation ( X )
For the #Words column, average is 10,725.8 and Standard Deviation is 11,524.01. So
Standardized column would be as the highlighted column.

# Words Urgently Spam Standardized #


Words
1512 0 0 -0.80
7603 1 1 -0.27
12432 1 0 0.15
73 0 1 -0.92
32,009 0 1 1.85

Feature Scaling is a necessary step for Neural Networks. Also it is highly recommended for Linear
models. But it is not needed for DT-based models. Why?

4.4. Missing Value Imputation: Missing values need to be replaced with a value before feeding into
the model. This is called missing value imputation. There are several approaches for missing
value imputation. Replacing missing values with mean, median, or mode of observation are
some examples. Eventually there is no perfect solution for missing values. After all no one
knows their real value.

Some Ensemble packages can handle missing values with no imputation. As we will see, XGB
package in Python is one of them. For NNs a good approach is to replace missing values with 0.
We will discuss this later.

5. Feature Reduction: After data processing, data is ready for model training. However often there are
many features in the data, and many of them do not have significant predictive power. In a normal
model, there are often hundreds of candidate features. Keeping these features, will have a very high
computational cost, with no benefit. So it is a good idea to keep only features that have some
predictive power, and exclude the rest.

Ensemble models (discussed in the next chapter) can be used for feature reduction. In fact they are
one of the best and most efficient feature reduction techniques. We will see their applications in the
sample codes.

67
6. Model Training: Now we have a clean, processed dataset, with a fair number of features. It is time to
train the model. ML models have several parameters (we will discuss parameters of Ensemble
models and NNs in the next two chapters). In model training step, many models will be built based
on different combination of parameters, and the best model with the best parameters will be chosen
as the final model. What is the best model? The model that has the best Bias and Variance.
We will practice Model Training, Grid Search, and Bias/Variance analysis in the sample projects.
6.1. Hyper-parameter Tuning (in the form of Grid Search or Random Search)
6.2. Bias/Variance Analysis and Finalizing the Model

So far, we have covered fundamental concepts of ML. From now on, there will be less explanation.
Rather, I would like you to make yourself familiar with searching ML topics. Youtube is always a good
choice. Look for videos with many views. Also I often do not like very short videos …

For practical exams, with data and code, Kaggle is an excellent source. Go to “Datasets” section of the
website, and search for any topic; for example, a technique such as XGBoost, or an application such as
Home Security Systems. Once you found a good project, check the highest voted codes.

Note, these are just projects done mainly amateur modelers; so while you try to earn from them, don’t
overfit to what they have done. A general rule about internet, and life.

Chapter 11- Ensemble of Models


Ensemble models are a combination of several models, called Base Learners. Base learners can use any
modeling technique, but by default they are decision trees. So, Ensemble models are often combination
of Decision Trees. For example, an ensemble model can be composed of 1000 decision trees. As a result,
a lot of discussion in the DT section, applies to these models.

How does Ensemble Model combine these trees to generate the final output, depends on the technique
and the Loss function. Ensemble models can decrease the Bias of a single decision tree, without
increasing variance. Ensemble models achieve this goal by building each model on a different subset of
the train sample; so even if a base learner is overfitted to its subsample, the overall model is not
overfitted to any of the samples (unless the model is irrationally complex).

There are three types of Ensemble models:

1. Bagging models: In these models, DTs are independent of each other. Each tree is trained on a
sub-sample of the training sample, chosen at random. Model’s output is often average of

68
outputs of individual trees in regression models, and majority vote in classification models.
However it is not uncommon to aggregate tress using for example median prediction of trees for
a regression model, or average Loss value of trees for classification (like average Cross Entropy
generated by each tree).
The famous example of this group is Random Forest. See following for the documentation of RF
classifier and RF regressor in sklearn. Make sure you understand these parameters, and how
they affect bias and variance: n_estimators, criterion, max_depth, min_samples_split,
min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes,
min_impurity_decrease, bootstrap, random_state, max_samples, and class_weight.
[Link]
[Link]
[Link]
[Link]

2. Boosting methods: In boosting models, trees are not independent; rather each tree, uses results
of the previous tree. Famous examples of this group are Gradient Boosting (GB) and XGBoost
(XGB). XGB is a modified implementation of GB which is computationally more efficient.
Relationship between trees depends on the boosting technique and the Loss function. For
example, in Gradient Boosting and XGBoost, if MSE is used as Loss Function, each tree uses
Errors from the previous tree as Target variable. In general, GB is an implementation of Gradient
Descent technique on a dataset. If you are interested in a technical discussion of the method,
see the GB’s original paper: [Link]
And XGB’s original paper: [Link]

Following is the documentation of XGB package in python. Make sure you understand these
parameters, and how they affect bias and variance: eta, gamma, max_depth, min_child_weight,
subsample, colsample_bytree, colsample_bylevel, colsample_bynode, lambda, alpha,
scale_pos_weight, tree_method, max_bin, monotone_constraints.
[Link]

Note the tree method parameter. Using this parameter, you can force the model not to go
through all possible splits for a numerical feature; rather choose a few splitting points based on
the distribution of that attribute. This can make the model more computationally efficient.
Distribution of feature is approximated by a histogram. Number of histogram’s bins can be
defined using max_bin parameter.

Note the monotone_constraint. This parameter forces the model’s output to have monotonic
relationship with a feature. This parameter, while may have negative impact on model’s
performance, can hep with model’s explainability, which is a big concern with ML models.

Do you know another constraint which is similar to monotonicity constraint? Linearity constraint,
which is the basis of Linear models. In fact in a Linear model, we assume the relationship
between Y and X variables are linear. This is a big constraint that helps a lot with model’s
explainability.

69
3. Stacking: Stacking is basically combining several models. For example, you can build a Linear
model, an XGB, a Neural Network, and then use the outputs from these models as inputs to
another model, say another XGB. Basically you have used first layer models to define features
(feature engineering).
4. Blending: Blending is very similar to stacking. Difference is in how cross-validate the model. Read
about it online.

Some practical points:

 Number of trees is often the most influential parameter in Ensemble models, followed by
Learning Rate in Boosting algorithms.
 In ensemble models, DTs (or base learners) can be considered feature engineering. Like features
that are defined by combining original features.
 By fitting each base learner on a subsample of training sample, we are practically training model
on different samples. This will help with overfitting and variance.
 One output of ensemble models is Feature Importance, which shows the power of an
independent variable in explaining the dependent variable. This output of ensemble models can
be used for feature reduction, before parameter tunning. A rule of thumb is to keep features
with FI higher than 1%, but this threshold is completely subjective.
Note: SHAP values, explained int the next section, are considered a better measure of
explanation power of a feature (compared with feature importance), and can be used for feature
reduction as well.

How is Feature Importance Calculated?

Each time a feature is used in a split, calculate “Feature/Split Weighted Gain” as:

Feature/Split Weighted Gain = Number of Observations in the Split (in Parent Node) × Gain of Split

What are some of the measures of quality of split? Variance (for regression when MSE is used), GINI or
Entropy (for classification when Cross Entropy is used)

Importance for a feature is the sum of Feature/Split Weighted Gain, in splits based on the feature. Note
that a feature can appear in several splits, in a single tree. Or may never be chosen for any split…

Splits that happen early in the tree, are based on more observations, and generate more Weighted Gain.
Also the higher the split’s Gain, the higher the Weighted Gain.

Feature Importances across all features are Normalized such that sum of all FIs, across all features, is 1.
So each FI divided by “Sum of all FIs.

Important point: While building a ML model, we do several things that may sound suboptimal; i.e. they
result in less optimum value of Loss function (on the train sample). One example is monotonicity
constraint. Another example is tree splitting parameter mentioned above. In general things we do to
reduce variance and complexity. All of these settings, while make the Loss optimization suboptimal on
train sample, have often three advantages:

70
1. Make the model less complex, and reduce variance, which may improve model’s performance
(and Loss) on other samples.
2. By reducing the complexity, make the training process faster and more computationally efficient.
3. By reducing complexity, they may help with model’s explainability.

Chapter 12- Inside the Black-Box – SHAP Analysis


ML models suffer from a problem: they are almost impossible to interpret. This is a problem in some
applications, and a deal-breaker in some other applications. Before going further, let’s answer two
questions:

First, what does model interpretation mean? It means to explain the relationship between dependent
and independent variables; i.e. how changes in an independent variable, affect dependent variable. For a
linear model, this relationship is clear, and is shown by coefficient of a variable. For example, in the
following hypothetical Linear income model, we can see there is a positive relationship between age and
income. Also One unit increase in age, increases income by $1000 (assuming all other independent
variables kept constant).

Income=150+1000 × Age+ β × Other features


Second, why do we care about model interpretation? In other words, if we build a blackbox model that
has low bias and low variance (i.e. it consistently generate accurate predictions), do we need to care
about how the model predicts? We need to interpret model for two reasons:

1. To understand how changes in X affects Y. Examples: In a marketing model example, in a


marketing model you are interested to understand the impact of an ad on customer’s decision to
buy. In another model you may want to find the impact of a bonus program on company’s profit.
Another example is CCAR/CCEL models in banking, where the model’s output is Expected Loss
on a Loan portfolio, and you want to simulate Bank’s losses under different macroeconomic
scenarios. In this case, you would include some macro factors in model, and analyze how
model’s output changes as a result of changes in macro conditions (unemployment, rates,
inflation, …)
2. Another reason to interpret the model is to make sure the model makes economic (or in general
logical) sense. This application is especially important if you deal with regulators, or in general
model governance. Also making sure the model make sense, can help with model’s variance.
Relationships that do not make sense, can be broken easily, which would result in model’s
underperformance in the future.

For the two aforementioned reasons we like to interpret the model. Linear models are very helpful with
model interpretation. In fact, the linearity constraint may have a negative impact on model’s

71
performance (because we are solving a constrained Loss optimization problem), but has the advantage
that model is interpretable.

What about other ML techniques? Other modeling techniques are less interpretable. In fact, as you will
see, some complex techniques are totally impossible to interpret.

There has been many attempts to interpret ML models, but few have industrial application. One of the
most popular model interpretation approaches that has found its way into industry is SHAP analysis. This
method is primarily used to interpret tree-based models (such as ensemble models).

How does SHAP look like? SHAP is a value assigned to each feature/observation, and shows the impact
of that feature to that observation’s output. For example, following tables show hypothetical values for
features and output of a SHAP analysis for a model that tries to detect fake accounts. Features are a
guess, not sure how features in such a model look like. When a feature has high SHAP value for an
observation, it means the feature contributes to a higher value of output for that observation, and vice
versa.

Output of the model is Probability that the account is fake. Sum of SHAP values for an observation
indicates value of output. In a regression model, sum of SHAPs is output. In a classification model, sum of

SHAPs is Logit of output; i.e. ln ( (1−Prob . of Response ) )


Prob . of Response
. Bias is a constant just like an intercept. It is

based on the average of target variable in the train sample.

For the first observation we have:

ln
( (1−Prob . of Response ) )
Prob . of Response
=−1.8+ 0.1−2.1+0.5 ⇒ P rob . of Fake Account =3.6 %

Note that the same feature, same value, can have different impact on two observations. For example,
Typing Speed is 110 for both second and third observations; however SHAP value for the second
observation is much higher than that of the third one. This is not what happens in a Linear model. A
linear model assigns a constant coefficient to each attribute, meaning if a feature has the same value for
two observations it will have the same impact on the output of both observations.

The reason is that Linear models consider marginal impact of a feature, independent of other features,
but tree-based models essentially combine all features to calculate the output. In fact, tree-based
models look at the whole profile of an observation, and do not estimate impact of each feature
separately. In the above example, model may argue that for the third observation, 110 words is not an
indicator of fake account, because the account has had several transactions in the last 30 days. The tree-
based interpretation is that “among observations with high # transactions in the last 3 days, very high
type speed does not indicate high probability of fake account”.

Feature Values
Typing Speed (Word per Age of Account (Years) # Transactions in the last 30
Minute) Days
50 5 0

72
110 1 0
110 5 9

SHAP Values
Bias Typing Speed (Word per Age of Account (Years) # Transactions in the last
Minute) 30 Days
-1.8 0.1 -2.1 0.5
-1.8 4.5 3.2 1.1
-1.8 0.2 -3.3 -10.1

SHAP generates two types of Graphs that help with interpreting the model: Global and Local graphs.
Global graphs analyze all the observations, while local graphs look at a single observation, and show how
each attribute contributed to the output of that observation. Check the following links on the idea
behind SHAP and different graphs that SHAP package generates. Specifically the following three graphs
are often used frequently:

1. Beeswarm plot: A global plot that shows the importance of each feature in the model. It can also
be used to find the overall relationship (positive or negative) between a feature and the output.
How?
2. Partial Dependence Plot: Another global plot that shows the overall relationship between a
feature and the output.
3. Force Plot: A local plot that shows the impact of each feature on model’s output for a single
observation. One application of force plot is to explain which factors impact an observation’s
output, and therefore how to improve the output. For example in a credit risk model, force plot
can be used to explain why a customer received low credit score, and how they can improve
their score.

These are a few articles on SHAP, and its implementation in Python.

[Link]
ab81cc69ef30

[Link]

[Link]
works-732b3f40e137

And this is the SHAP website: [Link]

73
Chapter 13- Neural Networks
A Neural Network is a like a Linear model that automatically generate new features. In this way, a NN
solves a higher dimensional Loss optimization problem, and therefore can find lower Loss values than a
Linear model.

For example, imagine we have two independent variables: X1 and X2. A linear regression model would
be in the form of:

^y =β 0 + β 1 X 1+ β 2 X 2

If we run a NN on this data, NN will generate new features, and ^y would be a function of those features.
For example, a NN may generate three new features: Z1, Z2, and Z3. A possible formula for Z1, Z2, and Z3
is:

Z1 =α 0+ α 1 X 1 +α 2 X 2

t 0 +t 1 X 1 if X 1+ X 2> 2
Z 2=
w1 X 1 X 2 if if X 1 + X 2 ≤ 2

Z3 =sigmoid (b ¿ ¿ 0+b1 X 1 +b3 X 2 )¿


^y =h0 +h1 Z 1 +h2 Z 2 +h3 Z 3

Note a few things. ^y is still a linear function; not a function of original features (Xs), but a function of
some new features that NN has created. These new features are more powerful than original features.
Using these features, we will have a better model, that has better performance, compared with the
original linear model.

The linear model, solved a 3-dimensional loss minimization problem (to find optimum values of β 0, β 1,
β 2), but the above NN solves a 13 dimensional optimization problem (to find optimum values of α 0, α 1,
α 2, t 0, t 1, w 1, b 0, b 1, b 2, h 0, h1 , h2, h3 ). A higher dimensional optimization problem, can find better results
than a lower dimensional problem. So NN can find lower values for the Loss function, which means
lower error, and better performance compared with a Linear model. (imagine you want to find the
lowest point in a mountain. A linear model is like you use a car (and see the world in two dimensions),
while a NN is like you use a plane (and see the world in three dimensions).

Note that training a higher dimensional problem requires more data. If you have say 1000 observations,
a NN can not be trained properly, and a Linear model would have better performance. In general, the
more complex the model, the more data needed. That is why a lot of real applications of Neural
Networks are in the fields where a lot of data is available, such as Text Mining, Image Processing, Sound
Recognition, Recommender Systems, Reinforcement Learning systems (driverless cars, games, …). On the
other hand, Linear models are frequently used in models with lower data such as models that are at
portfolio or macro level (like a CCAR model that predicts loss on a portfolio of loans). In between, are
ensemble models, which are more complex than linear models, but less complex than Neural Networks;
so they don’t require as much data. Ensemble models are the optimum choice for many business
applications where data is at customer or transaction level.

74
Ok, go back to NNs. NNs are shown in the following way, as a sequence of layers. Following is a NN
representation of the above problem and shows that inputs (X1 and X2), will be transformed to Z1, Z2,
^ will be a function of Zs. Layer that defines Zs is called the Hidden layer.
and Z3, and finally Y

Input
Later Hidden Output
Later Later

Z1
X1
Y^
Z2
X2
Z3

To define a NN, like any other model, some parameters should be set. Through Grid Search you can find
which combination of parameters give the best bias and variance. Following are some of the most
important parameters of a NN.

1. Number of Hidden layers. A NN can have many Hidden Layers. Variables in each layer, are a
function of variables in the previous layer. NN with more than one hidden layer, is called a Deep
Neural Network; hence the name Deep Learning.
2. Number of Nodes in each Hidden layer. For example in the above example, the modeler has
decided to put 3 nodes in the (only) hidden layer.
Input layer has as many nodes as number of independent variables. In fact you don’t set
anything for input layer when defining a Neural Network. You have done your settings when
passing the input dataset.
Output layer has as many nodes as number of outputs. In a regression and a binary classification,
output layer has one node. In a multi-class classification, it has many nodes as number of classes
in the target variable.
3. Activation function. Activation function is the functional form of Zs. It defines how Zs are related
to Xs. There are several functional forms that can be used in NNs, but the most famous ones are
ReLU, Sigmoid, and Tanh, with the following formulas:
X if X > 0
Relu ( X )=
0 if X ≤ 0

X −X
( ) e −e
tanh X = X −X
e +e

75
X
e
Sigmoid ( X )= X
1+e

Activation functions would be applied on a Linear function of nodes in the previous layer. For
example, in the above example, if activation function for the hidden layer is Relu, it means each
Z is Relu of a linear function of Xs, like:

a0 +a 1 X 1+ a2 X 2 if a0 +a1 X 1 +a 2 X 2> 0
Z=Relu ( a 0+ a1 X 1 +a2 X 2 )=
0 if a 0+ a1 X 1+ a2 X 2 ≤0

a 0, a 1, and a 2 are model coefficients that NN will estimate.

For hidden layers you may define any activation function, but for output layer, choice of
activation function depends on the target variable. If target is continuous (so a regression
model), activation function is Linear, like a linear regression. Basically means no activation
^ is a Linear function of the previous layer. In the above example, Linear activation
function; i.e. Y
function means:
Y^ =Expected Valueof Y =t 0 +t 1 Z 1+ t 2 Z 2 +t 3 Z 3

If Y is binary, output layer’s activation function will be Sigmoid, like a Logistic regression.
h0 +h1 Z 1+h2 Z 2 +h3 Z 3
e
Y^ =Probability of Response=sigmoid ( h 0+ h1 Z 1+ h2 Z 2+ h3 Z 3 )= h +h Z +h Z +h Z3
1+e 0 1 1 2 2 3

If Y is multiclass, output layer’s activation function would be Softmax, like a multiclass linear
classification model. In this case, model generates n outputs (n number of classes) for each
observation, each output showing probability of one class. For example, if target has three
classes, say Blue, Green, and Red, then:
a0 +a1 Z 1+a2 Z 2 +a3 Z 3
e
Y^ 1=Probability of (Y =Blue)= a + a Z +a Z + a Z b +b Z +b Z + b Z c +c Z +c Z + c Z
e 0 1
+e 1 2
+e2 3 3 0 1 1 2 2 3 3 0 1 1 2 2 3 3

b +b Z +b Z +b Z
e 0 1 1 2 2 3 3

Y^ 2=Probability of ( Y =Green)= a + a Z +a Z +a Z b +b Z +b Z + b Z c +c Z +c Z + c Z
e 0
+e 1 1 2
+e 2 3 3 0 1 1 2 2 3 3 0 1 1 2 2 3 3

c 0+c 1 Z 1+ c2 Z 2 +c 3 Z 3
e
Y^ 3=Probability of (Y =Red)= a0 +a1 Z 1+a2 Z +a Z b + b Z +b Z +b Z c +c Z +c Z +c Z
e +e
2 3 3
+e 0 1 1 2 2 3 3 0 1 1 2 2 3 3

^ 1 + Y^ 2=1.
Note that Sigmoid is a special case of Softmax where there are two classes, and Y

Let’s review everything we said with an example. Assume you want to build a NN on the following binary
classification data, with the following Network Architecture (i.e. parameters):

 NN has two hidden layers.

76
 Hidden layer 1 has one node with Tanh activation.
 Hidden layer 2 has two nodes with Relu activation.

X1 X2 Y
2.2 -0.5 0
1.6 0 1

Draw NN as a sequence of layers. Note, there are to independent features; therefore two nodes in the
input layer.

Input Hidden
Later Hidden Output
Later 1 Later 2 Later

Z1
X1 T1
Y^

X2 T2

^ . Note, it is a Binary classification model, so activation function


Write the formulas for Z1, T1, T2, and Y
for the output layer is Sigmoid.

Activation function for the first hidden later is Tanh, so:


a0 +a1 X 1+ a2 X 2 −a0 −a1 X 1−a2 X 2
e −e
Z1 =tanh ( a 0 +a1 X 1 +a 2 X 2 ) = a +a X1 +a2 X −a −a X 1−a2 X 2
e 0 1 2
+e 0 1

Activation function for the second hidden layer is Relu, so:

b 0 +b1 Z 1 if b0 +b 1 Z 1> 0
T 1=Relu ( b 0+ b1 Z 1 )=
0 if b 0+ b1 Z 1 ≤ 0

c0 + c1 Z 1 if c 0 +c 1 Z 1> 0
T 2=Relu ( c 0 +c 1 Z 1 )=
0if c 0+ c 1 Z1 ≤ 0
^:
And for Y

77
d0 +d1 T 1 +d2 T 2
e
Y^ =Sigmoid ( d 0 + d1 T 1+ d 2 T 2) = d +d T 1+ d 2 T 2
1+ e 0 1

The above Neural Network has 10 coefficients: a 0, a 1, a 2, b 0, b 1, c 0, c 1, d 0 , d 1, d 2.

Imagine, we have fitted the NN, and the values for all coefficients is 1 (for simplicity). Calculate model’s
output for the first observation.

We need to calculate values layer by layer.


1+ 2.2−0.5 −1−2.2+ 0.5
e −e
Z1 =tanh ( a 0 +a1 X 1 +a 2 X 2 ) = 1 +2.2−0.5 −1−2.2+0.5
≈ 0.99
e +e

T 1=Relu ( 1+0.99 )=1.99

T 2=Relu ( 1+0.99 )=1.99

1+1.99 +1.99
e
Y^ =Sigmoid ( d 0 + d1 T 1+ d 2 T 2) = 1 +1.99+1.99
≈ 0.993
1+ e

How does NN find the coefficients? Just like any other model, by minimizing the Loss function. For
example, in the above example, if Cross Entropy is used as Loss function, then Loss function for the
above two observations would be as:

Cross Entrpy=−∑ (Y i ln Y^ i + ( 1−Y i) ln ( 1−Y^ i ) )=−ln ( 1−Y^ 1 ) −ln Y^ 2

Where Y ^ is a function of Ts, and Ts are a function of Z, and Z is a function of Xs (like the above
equations). This model, as discussed, has 10 coefficients (so it is a 10 dimensional optimization problem).
In other words, NN finds values for these 10 coefficients that minimize the above Loss function. Note
^ s can be written as a function of 10 coefficients. We will see a numerical example in the last
that Y
chapter.

NNs often have many coefficients, and hence are complex optimization problems. Extreme cases are
Large Language Models with hundreds of billions of coefficients; hence like a 500 Billion optimization
problem.

Question. What would be the number of coefficients in the above data with the following network
architecture:

 3 Hidden layers.
 Hidden layer 1, 3 nodes.
 Hidden layer 2, 2 nodes.

78
 Hidden layer 3, 5 nodes.

Answer: 9 + 8 + 15 + 6 = 38.

NN uses a Numerical optimization technique to solve these high dimensional optimization problems. The
technique is called Gradient Descent, and is explained in the last chapter. Gradient Descent works in an
iterative process. It will start by assigning some initial random values to the model’s coefficients. Then at
each iteration, value of parameters are adjusted towards the minimum of Loss function. Normally
Gradient Decent continues until minimum value for Loss is achieved, but in a NN – due to complexity of
optimization problem - this may never happens, or may happen after many iterations (which
computationally is not feasible). Therefore the modeler defines number of iterations. Note that even if
we don’t achieve the minimum in these number of iterations, we still get a low-enough value for loss
function, and a model that has good performance.

The way modeler defines number of iterations is through defining two other important parameters of
NN: Batch and Epoch.

Batch defines the number of observations used in each iteration of Gradient Descent. For example, if
Batch = 20, then at each iteration of Gradient Descent, 20 observations (from train sample) would be
selected at random, and the model coefficients would be updated based on those 20 observations, as if
model is marginally trained based on these observations. Example in the last chapter.

Once a batch of observations is used in an iteration, in the next iteration another batch is used. Once all
the observations in the train sample are used, it is called one epoch. A model can have many epochs,
which means it goes through all the observations several times. For example, assume there are 10,000
10000
observations in the train sample. If batch size if 10, it means each epoch will have =1000
10
iterations. If the model has 20 epochs, it means this model will go through 20,000 iterations of Gradient
Descent.

Question. There are 500,000 observations in the development sample. We do a 70/30 train/test split.
Batch size is 30 and epoch is 15. How many iterations of Gradient Descent this model will go through?

Since we do a 70/30 split, there will be 350,000 observations in the train sample. Number of iterations in
350000
one epoch would be: =11666.67. This means each epoch will have 11,667 iterations. Note that
30
the last iteration has less than 30 observations. Since there are 15 epoch, total number of iterations
would be: 15 ×11667=175,005 .

Tensorflow is the most popular package for Neural Networks. Since Tensorflow is a little too technical,
another package called Keras is built which is like a User Interface for Tensorflow. Keras is often the
package when building a Neural Networks. Keras documentation is an important source of formation for
NN developers. It shows different types of layers, parameters for different types of layers, types of
activation functions, and many more.

79
Types of Layers in NN: We mentioned that each node is a (activation) function of a Linear function of
nodes in the previous layer. That is the case for a Dense layer, the simplest type of layer in a NN. There
are other types of layers, that are more complex, and apply a more complex transformation to the
previous nodes. Examples of other NN layers are Recurrent Layers (used for sequence models),
Embedding layers (used in NLP models), and Convolutional layers (used in image and sound processing).
Some of these will be explained in the last chapter.

Following is the Keras documentation for Dense layer. If you are interested in NN, make sure you
understand all of these parameters. [Link]

Chapter 14 – Topics: Sample Split, Regularization, Over/Under-


sampling

15.1. Sample Split:

Sample Split can be done in three ways:

1. Train/Test
2. Train/Validation/Test
3. Cross Validation/Test

A few points before we explain each method:

 The goal in all of the above is to analyze Bias and Variance (especially variance). More specifically
the goal is to analyze Bias and Variance across several samples, to find the best set of
parameters. Test sample is supposed to represent the unseen data.
 Another important point, variance should be analyzed for both model and strategy. So just like
the model, we need to make sure the strategy’s outcome is consistent across samples. Strategy’s
performance can be estimated using con

80
Question. What is a good performance metric for strategy if the goal is to compare models based on
what percentage of responses they capture? Read this article for example on the confusion matrix for
classification models: [Link]
performance-metrics-a0ebfc08408e

 Finally, all the samples (test, train, validation, …) should be unbiased representations of the
target population.

1. Tran/Test split: You are already familiar with train/test split. Here are some practical
considerations in train/test split:
a. It is a good practice to have more than 1 test sample; so you can collect more data
points on model’s variance.
b. Include enough data in all samples. What is enough? It is subjective and very problem
specific.
c. Most recent data should be used to train the model; so data is trained on data that is
most similar to production. Also it is a good practice to use a small sample, very close to
production, for test. Here is an example:
Assume development sample is from Jan to Dec 2023. Following is a good split:
 Train: From March to November 2023.
 Test 1: Jan and Feb 2023.
 Test 2: Dec 2023 (right before production, so you can have a good measure of hoe model
works in production)

In the example above, if there are not enough observations in a month, you may consider using more
months for each sample.

Note: Sample split is an important part of a ML project. Decision about it is made at the beginning of the
project, often by senior management.

2. Train/Validation/Test: This is also called Train/In-time validation/Out-of-time validation. The idea


is to train and validate the model on train and validation, and eventually just test it on the test
sample to make sure the model will keep its performance. So Gird Search1 is done on the train
and validation, the best set of parameters are chosen, and eventually the final model’s
performance tested on the test sample.
In practice, this method is not different from the first option, and just the naming is different.
Note that in the first option also you can define more than one test sample.

1
An alternative for Grid Search is Random Search, where the model does not go through all the combinations in
the Grid, rather chooses a random subsample of observations. Therefore the idea is the same as Grid Search. See
the following for Random Search package in sklearn:
[Link]

81
3. Cross Validation/Test: Cross Validation (K-Fold Cross Validation) can be used to train and validate
the model on the same sample. This method is especially useful if there is not enough data, and
so all the data should be used for train (i.e. putting aside part of data just for test is not feasible).
So a lot of times there is no test sample in this method.
See this for an introduction to K-fold Cross Validation: [Link]
v=TIgfjmp-4BA

Question. A Neural Network is trained on 1,000,000 observations using 5-fold cross validation. Grid
search is done on the following parameters. How many iterations of Gradient Descent this model will go
through? There are 20 epochs.

Grid Search components:

 Activity function: [Relu, Tanh]


 Batch size: [50, 100, 200]
 Dropout: [50%, 70%]

15.2. Regularization: Regularization is a method to control model’s variance and risk, while fitting the
model; i.e. while minimizing the Loss function. Note that Grid Search is done on a fitted model; i.e. once
the model is fitted, bias and variance on different samples are analyzed. But, as mentioned, in
regularization we control for model’s variance and complexity, while fitting the model. In fact
regularization parameters, can be used as Grid Search parameters.

Regularization is a very cool topic, and can have significant impact on how a model behaves. For
example, people build superior trading models (and thus become rich) only by using proper
regularization techniques. Note that the techniques we introduce here are general regularization
techniques. You can use your math innovation to define customized Loss functions that controls specific
parts of the model. For example, after reading about Lasso and Ridge regressions below, think about a
regularization term that penalizes only overvaluation in a regression model.

Regularization in Linear models: In a linear model, regularization helps with controlling size of the
coefficients, so no feature would have a very large coefficient (either positive or negative). You may ask
what is the problem with a large coefficient? A large coefficient means a single (or a few) features
dominate the model, and that is what we don’t like. In other words, we like models that are diversified in
terms of features.

Even if the model has low bias and variance, models that depend on only a few features often have high
risk. The reason is that slight changes in market dynamics and the relationship between those few
dominant features and the target variable would jeopardize the whole model. Also production issues,
such as having missing information or error in dominant features, affect the model’s performance
significantly. So we prefer models that are more diversified with no feature dominating the model. The
solution is to regularize the model.

82
See the following article on regularization for linear models (also called penalized regression, as it
penalizes large coefficients). While reading the article, note the notion of L1 and L2 norm. These are
common terms in Stat; L1 norm refers to regularization terms with absolute value and L2 refers to
regularization terms with Squared. Note how adding regularization terms affect Loss minimization
process and size of coefficients.

[Link]

Regularization in Neural Networks: In Neural Networks, regularization can be applied in two ways:

1. Regularizing coefficients of each hidden layer


2. Using Dropout regularization

The former, works very similar to the linear models. See following for parameters of Dense layer in Keras.
Note there are regularization options for kernel (coefficients), for bias (intercept), and even for the
activation function itself. [Link] For this course, you don’t need
to know how exactly coefficient regularization for NN works, although it is very similar to linear models.

Dropout regularization works by randomly dropping a percentage of coefficients at each iteration of


gradient descent, and not update them during that iteration. Dropout is a parameter, set for each layer.
Following is an example of 50% dropout added to a dense layer, which means in each iteration of
gradient descent, 50% of coefficients of this layer (randomly chosen) will not be updated.

NN_model.add(Dense(5))

NN_model.add(Dropout(0.5))

Not-optimizing coefficients at each iteration, makes the model less overfitted to the train sample, which
would help with model’s variance. Search online for articles on Dropout regularization.

As mentioned, regularization parameters, such as Dropout or in-layer parameters, can be changed during
a Grid search.

Regularization for Tree-Based models: Tree-based models also have their own regularization parameters.
For this class you don’t need to know how those parameters impact training process. Like other models,
the higher the regularization, the lower the variance. See the following link for regularization parameters
in XGB package (lambda and alpha). Note they also work through L1 and L2 norms.
[Link]

15.3. Imbalanced data: I decided to add this small section because I have noticed there is a myth in ML
that data should be balanced. This myth can result in building even worse models. So I found it a good
excuse to talk about sampling, and also review an important concept in classification: rank ordering.

What is a balanced data? It is often referred to classification models. A data is completely balanced if
each class of target variable has the same number of observations; so for example in a binary

83
classification model, 50% of observations should be from each class. Also the higher the difference
between number of observations in each class, the more imbalanced the data is. The myth is that data
should be balanced; so modelers may oversample the minority class, or under-sample the majority class
in order to get a balanced dataset.

As mentioned, this is not a correct notion, and a lot of times even extreme imbalanced data are just fine
as is. For example, Fraud models often have very low response rates (in the order of 0.01%), but often no
under or oversampling is needed. Source of this myth is that it is believed that if data is extremely
imbalanced, like the example above, then machine will consider all cases as non-response, and so in the
above case the accuracy will be 99.99%. However by now, you should know that is not how machine
works. Machine does not optimize accuracy; it optimizes Loss function like Cross Entropy. To minimize
Cross Entropy (or any other classification Loss), responses should get high probability, and no responses
should get low probability. No matter what percentage are response, machine will try to assign high
probability to (even few) responses. In other words machine will still try to Rank-Order observations,
even if few of them are response; which means it will try to assign higher probabilities to responses
compared with non-responses. The result would be that in cases like the Fraud model, all observations
will have low probabilities (say for example the highest probability of response in the model might be
10%), but the model still tries to rank order, so Fraud cases will have higher probability than non-fraud.
In such a model, strategy team can define low thresholds (say for example 9.5%), and get a good
accuracy. Are you following??

Note that classification performance metrics, such as AUC, are sensitive to rank ordering. For example, if
you have 100 responses out of 1,000,000 observations, AUC will be high, only if model assigns higher
probabilities to responses. Are you following?? If not, review AUC concepts. In a recent job posting, I
noticed one of the qualifications is “to have a good understanding of AUC”, which I think is a very clever
job posting.

Note that I am not trying to say that over or under-sampling is never needed. Sometimes that actually
helps, and can be tried as one of the model parameters during grid search. For example, we talked about
how you can play with “scale_pos_weight” parameter in XGB package to try different oversampling
ratios (and possibly get a model with better bias and variance).

Chapter 15 – Unsupervised Learning: Clustering, PCA, Anomaly


Detection, and Recommender Systems
So far we have been talking about Supervised models, where there is a Target variable (label), and the
model tries to predict the target. However sometimes there is no target variable in the data (or there is
only for a few observations), or if there is, we prefer not to use it (will be explained shortly). These types
of models - in which there is no target variable involved – are called unsupervised models.

In general, there are three types of ML techniques:

84
1. Supervised models: to predict a target
2. Unsupervised models: a model with no specific target variable
3. Reinforcement Learning: These models are out of scope of our class. These are the models that
are used like in driver-less cars, and model multi-period systems. For a good introduction to RL,
check the following course by Andrew Ng: [Link]
learning-recommenders-reinforcement-learning?specialization=machine-learning-introduction

Back to unsupervised models. There are four major types of unsupervised models:

1. Clustering: Goal of clustering techniques is to put observations in a few groups (clusters), based
on their similarities. For example, customers may be placed into three clusters based on their
profile characteristics and shopping behavior; then a separate marketing policy may be designed
for each group based on their characteristics.
2. Principal Component Analysis (PCA) and Variable Clustering: While goal of clustering is to put
observations in different clusters, goal of PCA and Variable Clustering is to define clusters for
features.
3. Anomaly Detection: While in clustering we try to find similar observations (and put them in
clusters), in Anomaly Detection the goal is to find observations that are different from other
observations (anomalies or outliers). The applications are in Fraud analytics (find abnormal
transactions), maintenance scheduling (detect anomalies in machinery’s condition, such as
temperature, …), and data quality check (find data entries that might be erroneous).
4. Recommender Systems: As the name shows these systems are used to suggest products or
solutions based on similar cases. These days you see the applications everywhere, Amazon,
YouTube, Netflix, …

Before we discuss each group in more details, a discussion on why sometimes even if we have a target
variable, we prefer not to use it (and prefer an unsupervised technique). This is a conceptual and
insightful argument.

Note what we do in a supervised model is to predict Target based on some Features. What if features are
weak, and for any reason do not have enough predictive power? In that case, even the best modeler and
the most complex technique can not do anything.

For two reasons, features may have weak explanatory power:

1. Features are not strongly related to the target, or in the language of Statistics, target is
independent of features.
2. Dynamics of target and features change frequently. Examples are trading and fraud models
where relationships between target and features change dynamically and frequently; so ay
training sample would be basically biased for future samples.

For the above reasons, sometimes even if we have enough data on target, we may still decide not to use
a supervised model, since either the model will have low performance (high bias), or it will have high

85
variance. In these cases, modeler may decide to go with a more general and less complex approach; i.e.
an unsupervised model.

With an increase in quality and quantity of data, many applications of unsupervised modeling is being
replaced by supervised models. An example is fraud models. Now supervised fraud models have
significantly better performances compared with unsupervised models.

Next, we have a discussion of different unsupervised approaches.

1. Clustering: As mentioned, the goal of clustering models is to define clusters of observations where
each cluster is composed of similar observations. These are the famous families of clustering models:
1.1. Centroid-Based clustering: The famous example in this category is K-Means Clustering. This
is a good article on K-means clustering.
[Link]

The most important parameters of K-Means clustering are number of clusters, and measure of
similarity. Following is a good article on different measures of similarity:
[Link]
machine-learning-1f68b9ecb0a3

Think and search about different similarity measures. What are the advantages and
disadvantages of each? How to choose which one to use for a specific application?

Is Euclidean distance same as Cosine similarity when features are scaled?

How a new observation is assigned to a cluster (in production)? Based on the similarity of the
observation with each of the cluster centroids. Same similarity metric as the one used in model
training.

Read online about Kmeans++.

1.2. Hierarchical-Based clustering: This is a good article on hierarchical clustering:


[Link]

The most important parameters of Hierarchical Clustering are measure of similarity and linkage
criteria.

How a new observation is assigned to a cluster (in production)? Based on the similarity of the
observation with each of cluster, based on the linkage criteria. Same similarity measure and
linkage criteria as the ones used in model training.

86
1.3. Density-Based methods: The most famous technique in this group is DBSCAN. This is a good
article on DBSCAN: [Link]
clustering/

The most important parameters of DBSCAN are measure of similarity, Epsilon (radius to define
neighborhood), and Minimum Points.

Question. How a new observation is assigned to a cluster (in production)?

This method does not try to assign outliers to any cluster. In fact, outliers will remain alone with
no neighboring observation. As a result this technique is not sensitive to outliers. Also can be
used to detect outliers.

1.4. Distribution-Based models: The famous example in this category is Gaussian Mixture
Models. These models are similar to Density Based models in the sense that they try to find
areas of sample with higher density of observations; but in contrast to density based models,
they try to assign a distribution function (like Normal) to different parts of data. So they are
considered parametric models. For this class you don’t need to understand these models.

Practical points regarding Clustering models:

 Although techniques such as Elbow can be used to find the optimum number of clusters (in
K-Means and Hierarchical), in practice number of clusters is defined based on the business
judgment and requirements.
 Just like supervised models, there are several metrics to calculate performance of an
unsupervised model. Make sure you understand Inertia and Silhouette metrics. This is a
good article on Silhouette: [Link]
machine-learning-part-3-clustering-d69550662dc6. Note that you can use any similarity
metric to calculate any performance metric.
 Sometimes clustering can be used as a first stage for a supervised model. For example, in a
customer churn model, you may define a few clusters, and build a separate supervised
model for each cluster. This approach is more popular when building linear models. The
reason is that linear models assign coefficients based on the average impact in the whole
sample. So if you believe different subsamples behave differently (and therefore should have
different coefficients), a first-stage clustering may help. With the increased application of
non-linear models, such as Ensemble models and NNs, clustering is losing its application in
this area.
 Another possible application of clustering is in feature engineering. Cluster number assigned
to an observation, can be used as a feature in a supervised model.

A word on the differences between supervised and unsupervised models, feature treatment, and role
of the modeler:

87
Note two differences between supervised and unsupervised models (among other differences):
Supervised models can find the most important features (feature selection), and also can assign higher
weights to features that are more important. They can do this because they see the target. In contrast,
unsupervised models are blind and have no target. An unsupervised model would use all the features,
and will assign the same importance to all of them (this is probably the main reason behind lower
performance of unsupervised models – no clue about relative importance of features).

Since unsupervised models have this shortcoming in finding the relevant features, it is important that
modeler use business judgment, and only use features that are relevant in the context of model’s
application. In fact, modeler needs to do feature selection step, and we can say that the most important
parameter in any unsupervised model is the list of features used in model taining.

Also considering the sensitivity of similarity metric to the scale of features, it is very important to scale all
features before feeding them into the model. Here, modeler can assign higher weights to a feature by
multiplying the Scaled feature by a number. For example, if modeler thinks feature #2 is twice as
important as other features, they can multiply values of scaled feature by 2.

Bias / Variance Analysis and Parameter Tuning in Clustering models: Note, it is always a good practice
to analyze bias and variance of the model. Whether you define a separate Test sample, or do Cross
validation, analyzing model’s performance metric, and making sure the model has low bias and variance
is important. Process is similar to supervised models: define test and train (or use cross validation), and
then perform Grid Search to find the parameters that result in the best combination of Bias and
Variance. See above for different performance metrics.

2. Principal Component Analysis: PCA is also a clustering method, but clustering of features, rather
than observations. PCA can be used as a feature reduction method that can make the data storage
and processing, more efficient.

PCA analysis generates new features (columns). Call them PCA1, PCA2, ... Each PCA is a linear
function of original independent variables. There will be as many PCAs as the number of original
independent variables. Therefore if there are 3 independent variables (X1, X2, X3), there will be 3
PCAs, and each is a linear function of X1, X2, and X3. For example:
PCA 1=−2+3 × X 1−0.6 × X 2−X 3

The point of PCA analysis is that the first few PCAs, store almost all the information that is in the
independent variables. Consider information as the knowledge we can get from data. Therefore, we
can use a few PCAs instead of original data, and the models will have almost the same performance.
For example, we may convert 60 features to 10 PCAs, gain the same amount of information, and pay
less for processing and storage costs.

PCA has a disadvantage though. PCAs are not easy to understand and interpret. It makes model
interpretation difficult, even for linear models. These days in the industry of data science,

88
interpretation is becoming more important, while storage and processing costs are decreasing. All
bad news for PCA analysis.

PCA is based on a the concept of Eigen Values and Eigen Vectors in Linear Algebra.

An alternative to PCA that is actually used in industry is Variable Clustering. Variable clustering, puts
variables into clusters, similar to k-Means for observations. The first few variables from each cluster
show majority of variation in that cluster (which is the stat. way to say they contain majority of
information in that cluster). So a few features from each cluster can be chosen, gain the same
amount of information, pay less for processing and storage costs, and have meaningful features.

3. Anomaly Detection: The goal of anomaly detection is to find observations that are very different
from other observations. Example of applications are identifying machinery defects, errors in data,
and fraud detection. Following are main anomaly detection methods:
3.1. Density-Based: This technique estimates a probability distribution for each of the features,
and calculates probability that each observation happen. The lower the probability that an
observation happens, the higher the probability that the observation is anomaly. Let’s
discuss it with an example. Assum there are two features in the data: X1 and X2.
 The first step is to assume a probability distribution for X1 and X2. A common
assumption that often is almost correct is that features have Normal distribution. Even
if the features are not Normally distributed, we may transform the feature (often using
Log transform), so the transformed feature is almost Normally distributed. Normality
assumption can be verified simply by looking at the histogram of the feature. Also there
are several techniques to check how Normal is a distribution.
Note: Whenever we assume a distribution for the data, the model is called a Parametric
model. Other examples of parametric models are Linear models (when we make
assumptions about the distribution of error, see chapter on Econometrics), and
Distribution-based clustering.
 Once a distribution is assumed, parameters of the distribution are calculated based on
the sample. For example, if we assume X1 and X2 are Normally distributed, then we will
need to estimate Expected Value and Standard Deviation of Normal distribution.
Expected Value can be estimated using Sample Average, and Standard Deviation can be
calculated using Sample Standard Deviation.
 Assume Sample Average of X1 is 2, and Sample Average of X2 is -0.5. Also assume
Sample standard deviation of X1 is 1.2 and sample standard deviation of X2 is 2.2. So:
µ X =2, σ X =1.2, µ X =−0.5, σ X =2.2
1 1 2 1

Therefore, using formula for Normal distribution, probability density function for X1 and
X2 will be as following:
( )
2
−1 X 1−µ X 1

1 2 σX
PDF ( X 1 )= e 1

σ X √2 π
1

89
( )
2
−1 X 2−µ X 2

1 2 σX
PDF ( X 2 )= e 2

σ X √2 π
2

 Using the above formula, probability of each observation would be


Probability ( X 1 ) × Probability ( X 2). For example, for an observation with X1=4.3, and
X2=-1.3 we have:
( )
2
−1 4.3−2
1 2 1.2
Probability ( X 1=4.3 ) = e =0.053
1.2 √ 2 π
( )
2
−1 −0.5 +1.3
1 2 2.2
Probability ( X 2=−1.3 )= e =0.17
2.2 √ 2 π
Probability of Observation=Probability of X 1× Probability of X 2=0.053 ×0.17=0.009
Following this approach, you can calculate Probability of all observations in the sample. Note
that the farther X1 or X2 from their sample average, the lower would be the probability for that
observation (indicating the observation might be anomaly). Also lower standard deviation for
features, increases the probability (does this make sense?)

Another point to mention is that the assumption in the last step is that X1 and X2 are
independent. For independent random variables we have:

Prob . ( X 1 , X 2 )=Prob. ( X 1 ) × Prob.( X 2)


If the features are not independent, the above formula does not hold, and to calculate
Probability of Observation, we can not simply multiply Probability of Features. In practice, even if
X1 and X2 are not independent, the model results would still be ok.

So we calculated Probability of all observations. Next question is that which ones should be
considered as anomaly? In other words, what would be the threshold, so observations with
probability less than threshold are considered anomaly? In practice, there are two approaches in
defining the threshold:

1. Assume percentage of anomaly in your data. Based on the business knowledge you
may have some idea about the percentage of observations that are anomaly; say
0.1%. In that case you can define threshold such that 0.1% of observations will have
Prob. less that the threshold (i.e. threshold would be 0.1 percentile of probabilities
in the data).
2. Many times we have responses (labels) for a few observations (or if we don’t have
the response, we can gather it for a few observations at a reasonable cost).
Therefore we can define the threshold such that most of the responses have prob.
less than threshold (and therefore would be classified as anomaly). What does most
of the observations mean? Your guess is correct. You define it just like you do it for a
supervised model. You can assume cost for FN and FP, and benefit for TN and TP, and
based on that define the threshold that maximizes expected benefit.

90
Most important parameter in these models is the list of features. As discussed, unsupervised
models can not assign importance to features, and often modeler should judge which features to
include in the model.

Parameter Tuning and Bias/Variance analysis: As mentioned, in many anomaly detection models
we have labels for few observations. We can use these cases to validate and test the model. The
common approach is to split the data to Train/Validation/Test. Do not put any of the responses in
the train sample, put half the responses in the validation, and half in the test.

For example, assume we are building an anomaly detection model to detect possible defects in
machinery; so anomalous machines will be scheduled for maintenance (for example if machine
has abnormally high temperature, …). Assume we have 1,000,000 observations and 20 responses
(defects). A possible sampling for this model is to do a 70/15/15 split for Train/Validation/Test. 10
of the responses would be included in the validation sample, and 10 of them in the test sample.
Model would be trained on the train sample (i.e. distributions, mean, and variance would be
estimated based on the train sample). Validation sample will be used to define the threshold, and
Test sample would be used to test what would be the performance of Model/Threshold. You are
looking for a model that has high performance on both validation and test samples.

Note, you can use the above data and Bias/Variance analysis to find the best parameters of the
model, including choice of attributes. So you may build several models with different subsets of
features to find the best combination.

3.2. Hierarchical-Based: The famous technique in this category is Isolation Forest. The idea is to
build a model like an Ensemble model, so the model is a combination of Trees. Each tree
works by randomly splitting the observations (so in contrast to supervised trees, split is not
based on Gain, but is totally random, based on a random Feature/Split). Each tree grows
until all the observations are separated. Output of each tree for each observation is the level
where the observation was separated. Model’s output would be average output of each tree.
So for example, if the model has 4 trees, and an observations was separated in the 4th, 15th,
23rd, and 10th level in the first, second, third, and fourth trees respectively, then the model’s
output for this observation would be average of [4, 15, 23, 10], which is 13.

Anomalies are the observations that are far from other observations. So, in a random split, we
expect outliers to separate from others early in the tree. In other words, model’s output for
outliers would be small. So the lower the observation’s output, the higher the probability that
the observation is outlier.

This is a good article on Isolation Forest: [Link]


799fceacdda4

91
How to define the threshold, so observations with output less than threshold are considered
anomaly? The answer is similar to what we discussed in the density-based anomaly detection
(and in fact similar to the way we define threshold in all models). Define a threshold that
maximizes expected benefits: based on t he model’s output and cost of FP and FN, and benefit
of TP and TN.

The most important parameters for these models is list of features, and number of trees. You
can do Train/Validation/Test split, and do Parameter Tuning just like you did for density based
anomaly detection.

4. Recommender Systems: Recommender Systems are now everywhere, and their applications is
increasing. They are basically used to recommend items to users, like in Netflix, Amazon, Youtube, …
So the question is that based on the information that we have about the user, such as demographics,
previous purchases, previous ratings, …, what are the best items to recommend to the user?
There are generally three types of Recommender systems: Average-Based, Content-Based Filtering,
and Collaborative Filtering. In this chapter we discuss some simple approaches in recommender
systems. In the future chapters we will discuss more advanced approaches to recommender systems,
that use Linear models and Neural Networks.

To explain the topics, imagine we have the following data on rankings some users have assigned to
some series. Rankings are from 1 to 5, where 5 shows the highest ranking. Throughout this example,
have in mind that in reality there are thousands of movies, and millions (or billions) of ratings.

Friends Game of Lost Sex and 24 The Narcos


Thrones the City Office
User 1 5 5 5 5 5 5 5
User 2 5 3 5 4
User 3 2 5 3
User 4 3 4
User 5 3
User 6

4.1. Average-Based Recommenders: This is the simplest approach, and should be used for cases
like User 6 where no information is available. Remember the question is “Which movies we
should recommend to User 6?”. Since we have no information about 6, the best approach is
to offer him the most popular movies. Popularity can be defined as the average ratings for a
movie. Therefore we will have:

Friends Game of Lost Sex and 24 The Narcos


Thrones the City Office
Popularity 4.333333 3.5 5 4 4.333333 4.25 4.5
Index
(Model’s

92
Output)

So, the rankings for User 6 will be: [Lost, [Friends, 24[, [The Office, Narcos], Sex and the City,
Game of Thrones]. If you are supposed to recommend two movies, it would be Lost plus a
Random choice between Friends and 24.

You may decide to assign a threshold to the number of reviews, because estimates based on
few data points is not reliable. For example, you may decide to exclude movies with less than 2
reviews, and so Lost would be excluded, and the ranking would be: [[Friends, 24[, [The Office,
Narcos], Sex and the City, Game of Thrones]. If you are supposed to recommend 3 movies it will
be Friends, 24, and a random choice between The Office and Narcos.

Note that you can also change the way you define popularity. For example assign a weight
based on the number of reviews for the movie. The following can be the output of such a
movie:
' ¿ of ratings of the movie
Mode l sOutput=Popularity Index=Average ratings of the movie ×
¿ ttoal ratings
Using this formula, popularities would be:

Friends Game of Lost Sex and 24 The Narcos


Thrones the City Office
Popularity
Index =4.333333*
(Model’s 13/72 =
Output) 0.782407 0.340278 0.347222 0.444444 0.782407 1.003472 0.5625

So, the ranking would be [The Office, [Friends, 24], Narcos, Sex and the City, Lost, and Game of
Thrones.

I am sorry that Game of Thrones is always at the end in this randomly assigned rankings. In
Game of Thrones, dead can be a representation of Machine and AI. Some argue that AI will
never replace human, and among the most important reasons are Lack of Emotional Intellect
and Empathy.

4.2. Collaborative Filtering: What about other users who have rated some movies? How can we
use this information to improve recommendation system for them? The answer is
Collaborative Filtering. In Collaborative Filtering, Popularity Index is calculated based on the
movies similar to the rated movies, or users similar to the user.

Similarity is often calculated using Cosine similarity. So for example similarity between Game
of Thrones and Office, is the Cosine similarity of vectors [5, 2] and [5, 3]. Other similarity
metrics might be used as well (Euclidean, Manhattan, …)

93
Using the first approach, based on similar movies, one possible formula for Popularity Index
would be (calculating Popularity Index of Sex and the City for User 4):

Popularity index of Sex ∧theCity for User 4=Popularity index based on the similarity between Sex∧the

Popularity index based on the similarity between Sex∧the City∧The Office=Rate user assigned ¿ The Offic
Same way, similarity between Sex and the City and Friends can be calculated.

A question for you to think about and search. How is Cosine similarity calculated when there
is Missing information – a common problem in these systems? For example, Cosine similarity
between The Office and Narcos can be calculated in a two dimensional space. Is it
comparable to similarity between The Office and Friends, which is in a 3 dimensional space?
Should we assign a weight to dimension, since higher dimension means more data, and
more reliable measure? Maybe we can use dimension as a weight, so dimension adds to
popularity index. Following is an example:

Popularity index of Sex ∧theCity for User 4=Popularity index based on the similarity between Sex∧the

Popularity index based on the similarity between Sex∧the City∧The Office=Rate user assigned ¿ The Offic
Same way, similarity between Sex and the City and Friends can be calculated.

Using the second approach, based on similar users, one possible formula for Popularity Index
would be (calculating Popularity Index of Narcos for User 3):

Popularity index of Narcos for User 3=Popularity index based on the similarity between User 1∧User 3

Popularity index based on the similarity between User 1∧User 3=Rate user 1 assigned ¿ Marcos × Similari
Can you think of alternative ways to define Popularity Index? Also which Distance Metric is the best?
Why don’t we use Euclidean Distance instead of Cosine Similarity? Always question all the steps in a ML
model. Any step can be improved depending on the question you are answering.

For example, another approach is to find 100 users that are most similar to user i, and calculate the
average rating for the movies they have rated and use those averages as Popularity Index of that movie
for user i. Note this approach does not assign any ranking to many movies, but it is Computationally
Feasible, in contrast to the above approaches that probably are not feasible on very large datasets.
Notice that in real application, due to very large number of users, movies, and ratings, it is not feasible to
use all data in each user/movie rating. In the aforementioned approach, list of 100 similar users to any
user can be calculated offline, and stored (so there will be 100 columns in the data for each user). This
will significantly increase the speed of recommender, because for a specific user, the system only needs
data on 100 similar users and movies they have rated. List of similar users can be calculated offline, and
updated say once a week.

94
Another approach! For each movie, save a list of 100 most similar movies (100 columns for each movie).
To each user, recommend a few movies, randomly chosen from movies similar to the last 5 movies user
has watched.

4.3. Content-Based Filtering: What if we have some information about the users and movies?
How can we use this information to improve the recommender system? The answer is
Content-Based filtering. Similar to Collaborative Filtering, Popularity Index can be calculated
based on the movies similar to the rated movie, or users similar to the user. However, this
time the additional information can be used to calculate similarity.

For example, imagine we have the following information about the movies. Ratings are as of
November 2023.

IMDb
Rating Comedy Drama Mystery
Friends 8.9 1 0 0
Game of
Thrones 9.2 0 1 0
Lost 8.3 0 0 1
Sex and the
City 7.3 1 0 0
24 8.4 0 1 0
The Office 9 1 0 0
Narcos 8.8 0 1 0

Using this data, similarity between Lost and The Office would be the similarity between vectors
[8.3, 0, 0, 1] and [9, 1, 0, 0]. We can also add data on ratings, in which case the similarity would
be based the vectors [8.3, 0, 0, 1, 5], [9, 1, 0, 0, 5].

Did we miss any step in the above formula? Probably yes, we need to scale IMDb Rating and
User Rating columns. You can divide IMDb ratings by 10 and user rating by 5.

Question for you to think and search. What if you think some columns are more important? For
example, what if you want to give double to user ratings? If, after scaling, you multiply values of
this column by 2, does it help with giving higher weight to this column? Does the answer depend
on similarity metric? Like, maybe it affects Euclidean distance but not the Cosine similarity…

95
Chapter 16 – Linear Models – part 2
Int this chapter we will have a quick review of Econometrics and Hypothesis Testing. This is normally a
full course (or several courses), and the goal here is just an introduction. If you are interested, I
encourage you to watch some videos on Econometrics.

Focus of Econometrics is on model’s explainability; i.e. the relationship between Y and X variables. For
this reason, Econometrics is mainly concerned with Linear models, which are the most explainable and
non-Blackbox models.

Note that explainability is a big concern for many of the ML applications. Regulators and senior
management (in general non-technical stakeholders) are among those who love explainable models, in
which relationship between Y and X variables is clear. SHAP analysis is a progress in this way that makes
Ensemble models more explainable. To the best of my knowledge, a similar solution for NNs does not
exist, and due to complexity of these algorithms probably will never exist.

Going back to Econometrics. What is the question that Econometrics is trying to answer? Econometrics
tries to find the real Linear relationship between Y and X; i.e. the real coefficient of an independent
variable in a linear model.

Imagine the following linear regression model, where Y is assumed to be a Linear function of X1 and X2:

Y = β0 + β 1 X 1 + β 2 X 2 + Ɛ(1)
Where Ɛ is the error term, which will be explained shortly.

This is assumed to be the Real relationship between Y and Xs. We could probably know this relationship
if we had the whole population, but we never have the full population; all we have is a sample. We don’t
know this formula; in fact it may be wrong. We have only assumed that this relationship exists, and we
will use Econometrics to make conclusions about this relationship; i.e. the values of β 1 and β 2.
Eventually we may conclude that for example β 1=0 , which means there is no Linear relationship
between X1 and Y.

We don’t know the values of β 1 and β 2, but we can have an estimate of β 1 and β 2 by running a
regression on a sample of data that we have. Imagine we run a linear regression model and get the
following equation (i.e. model minimizes the Loss function and comes up with the following coefficients):

Y^ = ^β0 + β^ 1 X 1 + ^β 2 X 2 =2.3+1.1 X 1−0.2 X 2


^β =1.1 is an estimate of β 1. But this estimate is based on a sample of data, and may not indicate the
1
real relationship between X1 and Y. For example, we may run the model on another sample, and get a
completely different value for X1, say -3.4. So how can we make conclusions about the real linear
relationship between Y and X1 (i.e. β 1). The answer is Econometrics and a method called Hypothesis
Testing.

Hypothesis Testing is basically defining a hypothesis about the value of a coefficient and test it. A
Hypothesis is composed of a Null Hypothesis (which is what we are testing), and an Alternative
hypothesis (which is the opposite of null). For example, a popular test in Econometrics is whether a
coefficient is 0? The Hypothesis to test whether β 1 is zero, would be written as following:

96
H 0 : β 1=0
H 1 : β1≠ 0
Null hypothesis, as indicated above, is shown by H 0, and alternative is shown by H 1. Once a
hypothesis is tested (will be explained shortly), one of two conclusions will be made:

 Either we reject the Null hypothesis


 Or we fail to reject the Null hypothesis

In the above example, if we reject the null, it means we concluded that β 1 is not zero, which means
there is a Linear relationship between Y and X1. In the language of statistics, we say X1 has a significant
relationship with Y.

On the other hand, if we fail to reject the null, conclusion would be that X1 does not have a significant
relationship with Y; i.e. X1 does not impact Y in a linear way. Note that we never accept the Null, we just
fail to reject it (kind of Econometrics jargon).

Before we talk about how to test a hypothesis, let’s see an application of Hypothesis testing. Imagine a
company intends to analyze the impact of a marketing campaign on its sales (so if there is no impact, no
reason to invest in a campaign). Company runs the campaign in some months, and the goal is to
compare sales for months with campaign versus months without campaign. Modeling team runs the
following regression model:

Sales=β 0 + β 1 Campaign+ βX +Ɛ

Where Sales shows dollar sales in a month, Campaign is a binary variable which is 1 if there is campaign
in a month, and 0 otherwise, and Xs are other independent variables that may impact sales. Some of
these variables are for example month Dummy (a dummy variable for each month to capture
seasonality), some macro variables, such as unemployment rate, to capture business cycles, lag of sales
(sales in the previous months), and … What other features you can think of?

Modeler can run the following hypothesis test to check whether Campaign has a positive impact on
sales:

H 0 : β1≤ 0
H 1: β 1 >0
If null is rejected, we conclude that campaign has significant positive impact on sales. If we fail to reject
the null, company may decide not to continue the campaign.

Note the difference between this hypothesis and the previous one. In the previous hypothesis, we tested
β 1=0 versus β 1 not zero. This is called a two-sided test, because we reject the Null if β 1 is not zero in
any direction (whether positive or negative). The second hypothesis is called a one-sided test, because
we reject the null only if β 1 is higher than 0.

Also note that 0 does not have to be always there. For example, company may want to invest in the
campaign only if it increases sales by more than 1 unit. In that case the proper test would be:

97
H 0 : β1≤ 1
H 1: β 1 >1

Important Section/Discussion on Random Processes.

Next question, how to test the hypothesis? To answer this question, note that ^β (estimate of β) is a
stochastic variable, and its value depends on the sample. A stochastic (random) variable is a variable
whose exact value is not clear. This term is used in contrast to a deterministic variable whose value is
clear. For example, return on Google’s stock tomorrow is stochastic, while return yesterday is
deterministic. In data science we always train a model on deterministic data (known target variable), and
will use it to make predictions about stochastic variables, a best guess for a random outcome.

A stochastic variable has a distribution. The most famous random variable is a Normal random variable; a
variable that is Normally distributed (a statistics joke says that any variable that is not Normal is
abnormal). To understand hypothesis testing, you should be able to answer the following stat questions:

1. X is Normally distributed with mean 3, and variance of 2.4. Plot the Probability Distribution
Function (PDF) and show the area between Mean ± One Standard Deviation.

Normal distributions has a bell shape. PDF of the above distribution looks like the following picture.
Expected Value is 3. Since variance is 2.4, Standard Deviation=√ Variance=1.55 . Points 1.45 and 4.55
show the area between Mean ± One Standard Deviation. In a Normal distribution 68% of observations
fall between these two points; i.e.

Probability of X between Mean ± One Standard Deviation=68 % .

2. What does that mean “Probability of X between 1.45 and 3.55 is 68%”? It means if we have 100
Xs, (about) 68 of them will have a value between these two numbers. These 100 Xs, are 100
random samples (draws) from this distribution.

98
For example, assume an algorithmic trading bot that monitors 5000 stocks, and automatically
trade. X can be daily return on the trades suggested by this bot. If the bot makes 100
transactions, average return among them would be 3%. Also 68% of the times, return is between
1.45% and 4.55%.

** If you build a bot like that, it has 3% daily return, i.e. 90% monthly return. This is in fact
doable with small investments, say $20,000. Note that in that case you would have 90% monthly
return, i.e. $18,000 per month (which assuming 35% tax rate, would be around $11,700 per
month, passive income). Of course you need to spend your nights re-training the model.

3. What is the probability that X > 5.9?

Probability of X between Mean ± Standard Deviation is famous. But what about other non-famous
probabilities?

Following shows the area of X > 5.9. Probability of X >5.9 = area of this region. In mathematical terms,
this area is the integral of PDF function from this point forward.

( ) ( )
2 2
∞ ∞ −1 x− µ ∞ −1 x−3

∫ PDF ( x ) dx=¿ ∫ σ √12 π e 2 σ


=¿ ∫
1
e
2 1.55
dx ¿ ¿
5.9 5.9 5.9 1.55 √ 2 π

Calculating this integral, in general is not easy. But for a specific Normal distribution, probability for many
values is pre-calculated, a Normal Distribution with Mean = 0 and Standard Deviation = 1, called
Standard Normal Distribution. These calculations are available by searching for Standard Normal
Distribution table. Understanding of these tables is crucial for Hypothesis testing.

For example, we want to calculate the above probability:

Probability of X >5.9 , X N (3 , 1.55)

99
Using two formulas from Probability theory, we can convert X to a Standard Normal distribution, and
look up the above probability in the table. Assume A is a constant (deterministic variable), and X is a
random variable with Expected Value = µ and Standard Deviation = σ, then we have:

Expected Value of ( X− A )=( ExpectedValue of X )− A=µ− A

Standard Deviation of ( XA )= Standard Deviation


A
of X

Using the first formula we can see that for the above X:
Expected Value of ( X−µ ) =( ExpectedValue of X ) −µ=µ−µ=0
Which means that if we deduct mean of a random variable from the variable, the resulting random
variable has a mean of zero. For example, in the algorithmic trading bot, assume 3% of each transaction
should be paid as transaction fee. Therefore return on each transaction would be distributed as X −3.
Since expected return of X is 3, expected return of each transaction would be 3−3=0 .

Transaction fees are killer.

Using the second formula we see that for the above X:

Standard Deviation of ( Xσ )= Standard Deviation


σ
of X σ
= =1
σ
Which means that if you divide a random variable by its standard deviation, the resulting random
variable has a standard deviation equal to 1.

X−µ
Combine both formulas and you will see that if X N (µ , σ ) then N (0 ,1)
σ
X−µ
i.e. has Standard Normal distribution. Standard Normal distribution is often shown as Z, also
σ
called Z distribution.

We can use this fact, to calculate the above probability:

Probability of ( X> 5.9)=Prob. ( X −µ


σ
>
σ )
5.9−µ
=Prob. ( Z >
1.55 )
5.9−3
=Prob.(Z> 1.48)

Where Z denotes Standard Normal distribution. Now we go to Z table. Following shows the Z table. This
table shows the probability of dashed area; i.e. probability that Z is less than some value (called
Cumulative Probability). We can find Probability (Z< 1.48) and calculate
Probability ( Z >1.48 ) =1−Probability ( Z< 1.48).
To Find Probability ( Z< 1.48) find the point where combination of row and column becomes 1.48. This
is the red rectangle (row 1.4, column 0.08). So Probability ( Z <1.48 ) =0.9306 therefore
Probability ( Z >1.48 ) =1−Probability ( Z <1.48 )=1−0.9306=0.0694
This means, if we have 10,000 Xs, (about) 694 of them will be higher than 5.9.

100
Let’s get back to hypothesis testing. We said ^β is stochastic, which means the exact value of ^β is not
clear, and depends on the sample. If we train the model on several samples, we would get different
values for ^β , so we can get an idea of distribution for ^β . Once we get the distribution of ^β we can use
Expected Value of that distribution as β (actual value of coefficient), and we are done with the test. For
example, in the above marketing example, we may run the model on 100 train samples, and conclude
that ^β is Normally distributed with mean = 2, and that can be used as a good approximation of actual β .

This approach sounds good, except for that there is a small problem: we do not have many samples to
train the model. In fact, as we will mention, Econometrics is mostly useful when we don’t have a lot of

101
data to train the model. So we have probably a small sample, and only one estimate of ^β . How can we
use this estimate, to make conclusions about the actual β ?

Under specific assumptions (discussed later) on model 1 (which shows the actual relationship between Y
and Xs), ^β is Normally distributed with mean = β , and a variance that can be estimated using the
sample. We will explain the variance later. For now, the important idea is that Expected Value of ^β is the
unknown β .

The idea behind hypothesis testing is that, if we assume the Null hypothesis is correct, then we have the
expected value of ^β (which is the value of β under null hypothesis). For example, for the following
hypothesis, if null is correct then ^β N (0 ,Var ( β^ ) )

H 0 : β=0
H1:β≠0
Also we have one sample from this Normal distribution, which is the coefficient of the linear model.
Note that this coefficient is calculated based on the sample, by minimizing the Loss function. It is a
random draw from the ^β distribution.

For the sake of notation we show the coefficient of linear model by b . We reject the null hypothesis if b
is far from Expected Value of ^β under Null. The idea is that if we randomly generate a number from a
Normal distribution, it is very unlikely to see a value far from the mean. So if based on the model we
have got a value far from the null, then we reject the null.

For example, assume in the above hypothesis, the calculated coefficient (b ) is 5.3. If null is correct then
expected value of ^β is 0. Assume variance of ^β is 1. For Normal Distribution 99.5% of observations fall in
the interval of Mean ± 3 Standard Deviation. So if Null is correct, 99.5% of the observations fall in the
interval of [-3, 3]. However, estimate from the model is 5.3. If null is correct, it is very unlikely that b
turns out to be 5.3. So we reject the null hypothesis that β is zero.

How far is far? In other words, what would be the value of b at which we reject the null. We reject the
null if b falls in a rejection area, defined by the modeler (a model’s parameter). For the above
hypothesis, the rejection area would look like the following:

102
If b falls in any of these two intervals it would be considered as too far, and the null would be rejected.
Modeler sets these thresholds by defining the probability of Reject area; i.e.

Prob . of ( β^ ←a∨ ^β> a ) =user defined value


This user defined value is called Level of Significance, and shown by alpha (α). Common values for level
of significance are 5% and 1%. To test a hypothesis, find the thresholds (a and -a). If b<-a or b>a then
reject the null, otherwise fail to reject the null.

Let’s review everything with an example. Assume we have built the following Income model.
^
Income=2500+35000 × Has Bachelor+2.5 × Age
We intend to test whether the coefficient of Age is significant at 5% level of significance (i.e. whether at
5% level, we reject the null). So testing the following hypothesis:

H 0 :Coefficient of Age=0
H 1 :Coefficient of Age ≠ 0
Assume standard deviation of ^β (coefficient of age) is 1.3. Therefore if null is correct ^β N (0 ,1.3). To
find the thresholds:

Prob . ( β^ ←a ) + Prob . ( ^β> a ) =5 %

Due to symmetry of Normal distribution Prob . ( β^ ←a ) =Prob. ( ^β >a )

Therefore: 2 × Prob. ( β^ >a )=5 %

Therefore Prob . ( β^ >a )=2.5 %

To find the threshold a, we will convert ^β to a Standard Normal distribution, and use the table. Note that
under the null hypothesis, ^β N (0 ,1.3).

Prob. ( β^ >a )=Prob. ( β−0


^
1.3
>
1.3 )
a−0
=Prob. ( Z >
1.3 )
a
=2.5 %

From the Z table, we can see the point where Probability higher than that is 0.025. It is the point where
a
rectangle shows; i.e. 1.96. Therefore =1.96 ⇒ a=2.548. The reject regions for this test would be
1.3
>2.548 and <-2.548. From the model, value of coefficient is 2.5. It is not in the reject area. So we fail to
reject the null. It means the coefficient of Age is not statistically significant at 5% level of significance. So
Age does not have a significant impact on income.

103
We can find a general formula for the reject threshold. Assume we want to test the following hypothesis
at α% level of significance:

H 0 : β=µ
H1:β≠ µ
Where µ is a constant. Assume variance of ^β is σ. Under the null hypothesis ^β N (µ , σ ). Reject areas
would look like the following:

104
To find the thresholds: Prob. ( β^ > µ+a ) =
α
2
⇒ Prob.
σ
> (
^β−µ µ+a−µ α
σ )a α
= ⇒ Prob. Z > =
2 σ 2 ( )
α
Show Z α as the threshold where Prob . Z> Z α = . For example in the above example where alpha
2 ( 2 ) 2
was 5%, Z α =Z 2.5 % =1.96.
2

Therefore we can write: Prob . Z > ( σa )= α2 ⇒ aσ =Z α


2
⇒ a=Z α × σ .
2

Therefore, rejection thresholds are µ+a=µ+ Z α × σ and µ−a=µ−Z α × σ . We reject the null if
2 2
calculated coefficient from the linear model (b ) falls in the above reject region; i.e. we reject the null if:

b−µ
 b> µ+ Z α × σ ⇒ >Z α or
2
σ 2
b−µ
 b< µ−Z α ×σ ⇒ <−Z α
2
σ 2

b−µ b−µ
i.e. we fail to reject the null if falls in the interval [−Z α , Z α ], and reject otherwise. is called
σ 2 2 σ
the Test Statistic. [−Z α , Z α ] is called the Confidence Interval.
2 2

Example. Form the regression, ^


β 1=−3.5. Standard Deviation( ^
β 1)=2.2. Test the following hypothesis at
1% level of significance:

H 0 : β 1=−1
H 1: β 1 ≠−1

We calculate test statistic, and check if it falls in the interval of [−Z α , Z α ].


2 2

105
b−µ −3.5−(−1)
Test Statistic = = =−1.14
σ 2.2
Z α =Z 0.01 =Z 0.005=2.58
2 2

Since test stat falls in the confidence interval of [-2.58, 2.58], we reject the null; i.e. we fail to reject that
β 1=−1.

So far we assumed that we have Standard Devaition( ^β). In reality we don’t have it. Rather we
estimate Standard Deviation of ^β using the sample. Call this estimate Standard Error of ^β=SE( β^ ). We
^
β−β
mentioned that N ( 0 ,1 ) called Z . And we used this fact to find reject areas based on a
STD ( β^ )

106
hypothesis and level of significance. Since we don’t have STD ( ^β), in practice we use its estimate:
SE( ^β ). By using that, the above variable will not have a Z distribution, rather it will have a t-distribution;
i.e.
^β−β
tk
SE ( ^β)
Where k is called degree of freedom for the t distribution. T distribution is similar to Z distribution but it
has fatter tails, like the following.

b−µ
So what happens to the hypothesis test? We calculate the test stat as , and our confidence
SE ( ^β)
interval changes from [−Z α , Z α ] to [−t k , α , t k , α ]. T values can be calculated from T table. As mentioned
2 2 2 2
k is the degrees of freedom and is equal to n-2, where n is number of observations.

Example: We have built the following regression model that shows return on a fund that you manage, as
a function of return on market portfolio (S&P 500).

Return on your Fund= α^ + ^β × Returnon S∧P 500=0.5+1.5 × Return on S∧P 500


^

Asset pricing theory says that Market Portfolio can explain all the return on any portfolio, and so
intercept in the above formula should not be different from 0 (should be statistically insignificant). So we
want to test the following hypothesis:

H 0 :α =0
H 1 :α ≠ 0

^ , from the data, is 0.75. Test the significance of α at 5% level of significance. Model is
Standard Error of α
based on 30 observations.

α^ −µ 0.5−0
Test Statistic= = =0.67
SE (α ) 0.75

107
Next we find t k , α =t 28 ,0.25 . Based on the following t table, the value is 2.048. Since test stat = 0.67 falls in
2
the confidence interval of [-2.048, 2.048] we fail to reject the null; i.e. intercept is not statistically
different from 0, which is consistent with Finance theory.

Note: α in the language of investment is called Risk-Adjusted return. If you can come up with a trading
strategy, and show that α for your portfolio is significant (i.e. reject the null of α =0 ), you and a few
generations after you, do not need to work anymore, unless you like to!

108
Note the term Confidence Level. Confidence Level is the probability of Confidence interval, which is 1-
Level of Significance.

Also note the Z table that we previously used, showed the area to the left of any number (the dashed
area on top of the table), but the above t-table shows the area to the right of any value. There are many
versions of these tables. When using a table, pay attention to the area that it shows. For example, the
following Z table, shows the area between any threshold and 0.

In the above sections, we were mainly talking about the two sided tests; i.e.

H 0 : β=µ
H1:β≠ µ

109
As we mentioned, in a two-sided test, null is rejected if estimated coefficient is far from null, in either
b−µ
direction. We got the test stat as , and confidence interval of [−t k, α , t k , α ]. Confidence interval
SE (σ ) 2 2

would look like this:

For one sided tests, the idea is the same; i.e. we reject the null if test stat falls in the reject region (out of
b−µ
confidence interval). Also test stat is defined the same way . But confidence intervals would
SE (σ )
change as following.

For
H 0:β≥µ
H 1: β < µ
We reject the null if the estimated coefficient (b ), is much smaller than µ. The confidence interval would
be [−t k, α ,∞ ], and would look like the following:

110
For
H 0:β≤µ
H 1: β>µ
We reject the null if the estimated coefficient (b ), is much larger than µ. The confidence interval would
be [−∞ ,t k , α ], and would look like the following:

Example: In the marketing campaign example (explained above), modeling team builds the following
model.

111
Sales= ^
^ β0+ ^
β 1 Campaign+ ^β X=2.3+1.33 Campaign+ ^β X
Modeling team wants to test whether campaign has significant positive impact on sales. So we test the
following hypothesis:

H 0 : β1≤ 0
H 1: β 1 >0

If we reject the null, it means β 1(impact of campaign) is significantly higher than 0, and we should
continue with the campaign. Otherwise we fail to reject the null; i.e. we fail to conclude that campaign
impact is significantly higher than 0, and we may decide to discontinue the campaign.

Test the above hypothesis at 1% level of significance. Assume Standard Error of ^


β 1 is 0.68. There are
10,000 observations in the sample.

1.33−0
Test Stat= =1.96
0.68

Confidence Interval = [ −∞ , t k ,α ] =[ −∞ , t 998, 0.05 ] =[−∞ , 1.646] (see following table). Since test-stat does
not fall in the confidence interval, we reject the null hypothesis at 5% level of significance. So the
campaign has positive impact on sales.

112
Type I Error: Type I Error, refers to the probability of rejecting the null, when null is correct. Note that if
null is correct (i.e. actual β equals the value in the null hypothesis, call it µ), probability distribution of ^β
would be N ( µ , STD ( ^β ) ). The estimated coefficient that we get from the model (train sample) is a
random draw from this distribution. What is the probability that this random draw falls in the reject
region (out of confidence interval)? The probability is α, level of significant (verify this as an assignment).
So even if Null is correct, there is α% chance that we reject the null. For example, if we test whether a

113
feature has significant impact on the target (i.e. H 0 : β=0),at 5% level of significance, there is 5% chance
that we reject the null, even if the feature does not have significant impact.

Type II Error: Type II Error is the probability that we fail to reject a null, even if it is not correct (like a
coefficient is actually different from 0, but we fail to reject H 0 : β=0). Power of a Test = 1 – Type II Error.

Increasing level of significance, increases the chance of Type I error, and decreases chance of Type II
error (increases model’s power). The opposite happens for lower level of significance. The only way to
decrease both errors is to add more, high quality data.

In most models we are more concerned with Type I error. That is why we often choose low values for
level of significance. Common values are 1%, 5%, and 10%. In a significance test (i.e. H 0 : β=0), if null is
rejected at 5%, we say β is significant. If null is rejected at 1%, we say β is highly significant.

P-Value: P-Value is an important concept. ML packages often report P-Value, and modeler can use P-
Value to test a hypothesis. P-Value is the lowest level of significance at which we reject the null.

Note that as we decrease the level of significance, confidence interval increases, which means there is a
higher chance that test stat falls in the confidence interval; i.e. fail to reject the null. So as we decrease
level of significance, there is lower chance of rejecting the null. If P-Value for a coefficient is 7% it means
the lowest level of significance to reject the null is 7%. So at 10% level we reject the null, but at 5% we
fail to reject the null. A variable with P-Value of 7% is significant at 10%, and is insignificant at 5%.

b−µ
Following is a graphical representation of P-Value, where t-stat as always is :
SE ( ^β)

Example: In the following classification model, P-Value of β 0 is 0.086 and P-Value of β 1 is 0.0001. What is
the probability of response for an observation with X1=-1.1? Use only coefficients that are significant at
5% level of significance.

Logit ( Y^ )= β^0 + ^
β 1 X 1=−3.3+1.2 X 1

114
Since P-Value of β 0 is more than 5%, β 0 is not significant at 5%. So:
1.2 ×(−1.1)
e
Prob . of Response= 1.2× (−1.1 )
21 %
1+ e

Standard Error of Estimate: What is the impact of standard error of estimate ( SE( ^β ) on significance of
coefficient? Higher SE means lower values of test stat (lower value of absolute value of test stat). This
will increase the chance that we fail to reject the null. In other words, the higher the SE, the higher the
chances that we fail to reject the null (and so the lower chance that the attribute will be significant).

Size of Data: As size of data increases, standard error of estimate decrease. In limit, standard error of
estimate converges to 0, which means ^β has a very low standard deviation, and estimated coefficient
from the model is almost the same as actual β. So hypothesis testing is very easy if we have a lot of data,
because we can just check the estimated coefficients, and we have the actual βs.

Interpretation is that as the train sample becomes larger, it becomes more similar to the population. As a
result, coefficients become closer to the actual coefficient, from the population. For this reason,
econometrics is mainly useful for small datasets. What is small, what is big? There is really no definition
for that. In traditional stat, sometimes higher than 30 observations is called big! But that is actually not
correct. One thing for sure, the more complex the relationship between X and Y, the more data needed
to be considered large.

Note that even though these days there is a lot of data available for many applications, there are several
important industry areas for which not a lot of data is available. One example is when you want to model
behavior of a portfolio (portfolio of customers, products, …). In these cases, because you look at an
aggregate portfolio (rather than individual cases), there are not many observations. An example is Loan
Portfolio in banks (called CCAR models). Goal of these models is to predict losses on a portfolio of loans.
Portfolio can be defined as all the loans booked in one month. So for example if you look at the last 20
years, you will 20*12=240 observations (one observation per month). To make conclusions about the
coefficients of this model (to figure out which factors significantly impact losses on a portfolio of loans),
we need to do hypothesis testing.

Another example would be the marketing campaign model mentioned previously. In this model also
observations are at month level.

This was Hypothesis Testing, an important topic in Econometrics. But what is the rest of Econometrics?

At the beginning of the Hypothesis testing, we made an assumption: ^β is Normally distributed with
mean = β , the actual value of the coefficient, and a standard deviation that we calculated based on the
data. That is how we performed the z−test ,∧t−test . So if this assumption is not correct, all the
hypothesis testing might be wrong, hence wrong conclusions about the relationship between Y and X.

115
How do we know if the assumption about distribution of ^β is correct? This assumption, is based on some
more fundamental assumptions. Econometrics studies those fundamentals, and explains what to do if
those assumptions are violated. For example, Time Series modeling is a solution to violation of one of
these assumptions.

Assumptions:

We started chapter by the following model. It shows the relationship between Y and Xs. Before we talk
about the assumptions of linear models, let’s know this model and especially the Error (Ɛ) better.

Y = β0 + β 1 X 1 + β 2 X 2 + Ɛ(1)
This model, gives the value of Y conditional on Xs. For example, assume the following is the relationship
between Income, and two independent variables: HasBachelor and Age:

Income=50000+45000 × HasBachelor +1500 × Age+Ɛ


For example the model says that “Income for Someone who has a Bachelor Degree and is 35 years old is
147,500 + Error. What is the error? It is the combination of all other factors that affect someone’s
income, other than “Having a Bachelor Degree” and “Age”. Examples of these factors are “Industry”,
“Years of Experience”, maybe other functions of attributes already in the model, like “ Age2”. In other
words, the model shows the Linear relationship between Income and the two features. All other factors
that affect Income are in the Error.

Note that Error is Stochastic (Random); i.e. we don’t know the exact value of error. If we knew, we knew
the exact income of everyone. Since Error is Stochastic, it has a distribution, a Mean and a standard
deviation. Assumptions of Linear models that are necessary to perform correct hypothesis testing, are
about the distribution of Error. The assumptions are as following:

1. Expected Value ( Ɛ )=0


2. Variance(Ɛ) is constant
3. Covariance ( Ɛ i , Ɛ j )=0
4. Covariance ( Ɛ , X )=0
5. Ɛ N (0 , σ 2 )

Next we discuss each assumption in details. But before that, what is the point of the above assumptions?
Let’s review what we are doing. We want to analyze the relationship between some X and Y. To do so, we
run a linear model on the sample we have, and we get ^β , which is estimate of β (based on our sample).
To do so, we minimize MSE loss function. This model is called OLS or Ordinary Least Square. It is called
OLS because we just minimize MSE, which is squared error, with no changes to the model (we will
discuss some possible changes later).

If the above assumptions hold, then ^β s of OLS will have the following characteristics.

116
1. Consistency; i.e. lim Probability (|β− β^|> σ ) =0, where n in number of observations.
n→∞
In simple terms, consistency means if we have many observations (large train sample), then ^β
will be an exact estimate of β . How many observations is large? It depends on the complexity of
the relationship between X and Y, but in many of today’s samples, we have enough data to be
considered as many.
Note how strong this conclusion is. It is saying that if the above assumptions hold, OLS estimator
shows the exact relationship between X and Y.

2. Unbiasedness; i.e. E ( ^β )= β .
Note this is an important assumption in hypothesis testing. In fact this is an assumption we
made when we were setting the test statistic. If this assumption is violated, the hypothesis test is
not valid anymore. On the other hand, if the above assumptions hold, OLS estimator will be
unbiased.

3. Efficiency. This characteristic means no other estimator will have lower Standard Deviation (than
OLS estimator). Remember ^β is stochastic, and its value depends on the sample. Ideally we like it
to have small standard deviation, because with lower standard deviation it will give a closer
estimate of β (it has lower variation around β ). If the above assumptions hold, OLS estimator
will have the lowest standard deviation (among all other estimators).
Question. What would be the impact of increase in standard deviation of ^β on hypothesis test?

So, if the above assumptions hold, OLS estimator will be Consistent, Unbiased, and Efficient. The above
assumptions are also called assumptions of OLS, and if these assumptions hold, OLS estimator is BLUE
(Best Linear Unbiased Estimator).

Next, we analyze each of these assumptions, what happens if they are violated, and what to do to solve
them.

Assumption 1 - Expected Value ( Ɛ )=0

The first assumption says Expected Value of Error is 0; so, Error is stochastic and it has a distribution, and
Mean of the distribution is 0. In the above income model, for the person with a bachelor and 35 years
old, Income is 147,500 + Error. The person might be in a low-paying industry and his income might be say
85,000. In this case error is -62,500. Or the person might be from a high-paying industry and his income
be say 250,000. In this case error is 102,500. Same way, there is an error for any other individual; but
Expected Value of the error is 0; i.e. for the above person, there is equal chance of actual income be
higher than or lower than 147,500 (depending on other unknown factors that impact income).

What happens if the assumption is violated? If this assumption is violated, then OLS estimator will be
biased; i.e. E ( ^β ) ≠ β, and so it will make the whole hypothesis test invalid. Why?

117
This assumption always hold if we have an intercept. In fact, intercept can be interpreted as Expected
Value of Error, and so Error ( impact of unknown features )=Intercept + Ɛ , where E ( Ɛ )=0. That is why
Intercept is also called Bias. It shows expected value of any feature that is not included in the model.

Important Point: In equation 1, the only stochastic term in right hand side is Ɛ. So Y is summation of a
deterministic term ( βX ), and a stochastic term (Ɛ). Therefore Y has the same distribution and standard
deviation as Ɛ. The only difference is in Expected Value. If OLS is Unbiased, then Expected Value of Ɛ = 0,
and Expected Value of Y =βX .

Assumption 2 - Variance(Ɛ) is constant. This assumption says that variance of error (and therefore Y) is
constant across all observations. In the above example, it means variance of error is a constant,
independent of Age or Education. In contrast, we may observe that variance of income is higher for
people with higher age; i.e. we may see higher variation in Income for people with higher age, compared
with younger people. If variance of error is not constant, the model is called Heteroskedastic.

Another example is in asset pricing. Imagine the following model that defines return at time t as a
function of return at time t-1:

r t =β 0 + β 1 × r t−1 +Ɛ

It is a well-known fact that volatility of prices (and return) goes up when return is very high or low. That
means if r t −1 is very high or low, volatility of r t is higher compared to when r t −1 is close to zero.
Therefore the above model is Heteroskedastic.

If the model is Heteroskedastic, then it is still Unbiased and Consistent, but it is not Efficient anymore.
Often variance of intercept is very high, and variance of slopes is small. How does a small variance of
slope affect hypothesis testing about the significance of a coefficient? How does it impact power of the
model?

One way to solve for Heteroskedasticity is to use Generalized Least Squares (GLS) or Weighted Least
Squares (WLS). In this approach, observations will be divided by cause of Heteroskedasticity. For
example, in the income model, assume variance of error is linearly related to age; i.e.
variance of error α × Age, where α is a constant. Therefore if we divide Error by Age, then variance of
Error will be constant. To do so, we divide the whole model by age; i.e.

Income 1 HasBachelor Ɛ
Income=β 0+ β1 × Age+ β 2 × HasBachelor +Ɛ ⇒ =β 0 × + β 1+ β 2 × +
Age Age Age Age
Income 1 HasBachelor
It means, we divide all columns by age, so we will have four columns: , , ,
Age Age Age
Age
and =1. Now we run OLS on this data. This OLS model is not heteroskedastic anymore. Note that
Age
in this model β 1 will be the intercept.

It is often not easy to find the exact source of heteroskedasticity.

118
Assumption 3. Covariance ( Ɛ i , Ɛ j )=0 . This assumption says there is no Linear relationship between
error of different observations, and errors are not correlated. This assumption is often violated in Time
Series models, where observations are correlated. For example, in asset pricing, if there is Momentum in
prices, then there is positive correlation between consecutive prices (and returns, and errors), and if
there is Mean-Reversion, there is negative correlation between consecutive prices (and returns, and
errors). In both cases, there is correlation between errors, and assumption 3 is violated. When this
assumption is violated, and errors are correlated, the model is called Autocorrelated.

What happens if the model (data) is autocorrelated? The impact is similar to Heteroskedasticity.
Coefficients will be Unbiased and Consistent, but they will not be Efficient. If there is positive
autocorrelation, variance of estimates will be underestimated. If there is negative autocorrelation,
variance will be overestimated.

What is the impact of positive or negative autocorrelation on model’s power and hypothesis test?

How do we know if the model is autocorrelated? One way is to plot the data and see if there is a trend in
the data (correlation between observations). Also there are tests to detect autocorrelation; the most
famous ones are Durbin-Watson and Breush-Godfrey.

How to solve for autocorrelation? Here we mention three ways to solve for autocorrelation:

1. Use GLS: In this approach you assume a functional form between errors, and use GLS to solve
the model. Assume we have the following time series model:
Y t =β 0 + β 1 X t +Ɛ t

Same way, at time t-1 we have:


Y t −1=β 0+ β1 X t −1+ Ɛt −1

Since the model is autocorrelated, there is a linear relationship between errors; so we have:
Ɛ t =ρ Ɛ t−1 +ut
We can write:
Y t −ρ Y t−1= β0 + β 1 X t + Ɛt −ρ Y t −1=β 0+ β1 X t + ρ Ɛt −1+ ut −ρ ( β 0+ β 1 X t −1 + Ɛt −1 )=β 0 ( 1−ρ ) + β 1 ( X t −ρ X t −1 ) +
So, in the data, we replace Y t by Y t −ρ Y t−1 , and X t by X t −ρ X t −1 and solve the model with
OLS. This OLS model is not autocorrelated, as ut and ut −1 are not correlated.

Of course this method requires knowledge of ρ (autocorrelation factor).

2. Use differencing: Another way to solve for autocorrelation is to use change in Y instead of Y. A lot
of times, although Y is autocorrelated, but change in Y is not autocorrelated. For example, while
price of asset is correlated, but change in price may not be correlated. So if we have:
Y t =β 0 + β 1 X t +Ɛ t
Y t −1=β 0+ β1 X 1−1 +Ɛ t−1

We can write:
Y t −Y t −1 =β 0+ β 1 ( X t −X t −1 ) +Ɛ t −Ɛ t−1 ⇒ ΔY t=β 0 + β 1 ΔX t + ut

119
In the above formula, we have implicitly assumed there is a unit correlation between Ɛ t and Ɛ t −1
; i.e.
Ɛ t =Ɛ t −1 +ut

Therefore, this approach is basically similar to the first approach, if we assume ρ=1.
3. Use dynamic models: In this approach, we use Lags of Y t in the model. So the model will look
like the following:

Y t =β 0 + β 1 X t + β 2 Y t −1+ β 3 Y t−2 +…+ Ɛ t

But didn’t we use Lags of Y in the previous approaches? We did, but we made an important
assumption. The assumption was that when we are estimating Y t , values of Y t −1, Y t −2, … are
known, but in a dynamic model, the assumption is that the values of Lags are not known. In
other words Lags are stochastic. So we have some features that are stochastic. For example,
assume we want to predict price in the next n days, and we predict prices day by day. If the
formula we use is:
Pt =β 0+ β1 X t + β 2 Pt −1 + β 3 P t−2 +Ɛ t

Say for day 14 we have:


P14=β 0 + β 1 X 14 + β 2 P13 + β 3 P12+ Ɛ 14

And P13 and P12 are not known as of now (they can be estimated using the same formula, and
using previous data). In fact, X 14 also may not be known.

A complete analysis of this approach is related to Time Series analysis, and beyond the scope of
our course.

Assumption 4 - Covariance ( Ɛ , X )=0 . This assumption means there is no linear relationship between
Xs and Error. This assumption is always correct if Xs are not stochastic (i.e. the exact value of X when
predicting Y is known). In fact, the assumption of non-stochastic Xs is stronger than zero correlation.

Note that in almost all the models (including all previous chapters) that we have discussed so far, we
have always made this assumption that exact value of features are known, when predicting Y; i.e.
features are non-stochastic. An exception is the dynamic model we mentioned above.

If assumption 4 is not correct, the coefficients are biased, and they would not be consistent neither. So
even having a large dataset, we will not have the correct coefficients.

As mentioned, as long as Xs are not stochastic, this assumption is not an issue.

120
Assumption 5 - Ɛ N (0 , σ 2 ). As you know, this is another crucial assumption for hypothesis test, making
it possible to assume a t-distribution for coefficients. Violation of this assumption, does not affect
Consistency, Unbiasedness, and Efficiency of coefficients, but t-test can not be used.

If sample is large enough, there is no problem with this assumption.

Multi-Collinearity: One of the implicit assumptions of Linear models is that features are Linearly
independent of each other (i.e. correlation between any two independent variables is 0). Another term
for linear independence is that features are orthogonal to each other. In reality this does not happen,
and features are always, at least weakly correlated. Weak correlation is fine, but the problem arises
when some features are highly correlated. As a rule of thumb, correlation higher than 0.8 is assumed
high. This issue is called multi-collinearity. There are two main problems with multi-collinearity:

1. Coefficients of multi-collinear features are sometimes unstable, and change a lot from sample to
sample (so if you train the model on a different sample, you get totally different coefficients).
2. Standard Error of collinear coefficients is higher than what it should be, which results in lower
test statistics, and lower probability of significance (lower chance of rejecting the null
hypothesis)

To solve for multi-collinearity, you can remove the feature with lower P-Value. Another solution is to
collect more data. As mentioned, with more data, SE of coefficient will be lower, and in limit, multi-
collinearity would not be a problem anymore.

What is next? Science of Econometrics is the science of analyzing the above assumptions, and solve for
any possible issues. The goal is to perform accurate hypothesis test, to analyze the relationship between
X and Y. Time Seies modeling, Stochastic Volatility models, and Panel data are some sub-sections of
Econometrics.

Chapter 17 – Model Interpretation – to be completed.


Model interpretation is an important application of ML models. The goal is to interpret the model, by
answering the following question:

How much does feature X i - any independent variable - impact Target?

For example, what is the impact of a marketing campaign (binary attribute) on company’s sales in the
next three months?

A Linear Model gives us the impact of a feature through feature’s coefficient. Following the above
example imagine this is the model for company’s sales:

121
Expected Sales∈the next three months ∈ Million Dollars ( Y^ )=11.5+ 2.5× Having a Campaign∈the next two weeks (
Our train data, after minimizing the Loss function, suggests this formula. If our data is not biased, we
expect the same equation to hold in the future (at least near future). That means Holding a Campaign in
the next two weeks, increases expected sales by $2.5 Million. This number can be used, along with cost
of campaign and margin of profit, to decide whether holding a campaign is profitable.

Linear models are the best in terms of model interpretation. Other ML models are more black-box.
Neural Networks are almost totally Blackbox. As of now, I am not any good technique to interpret a NN
(i.e. explain the relationship between target and any of the features). Ensemble models are in between
and are interpretable to some extent. We will discuss how to interpret an Ensemble Model at the end of
this section.

Why do we care about the relationship between target and a feature?

Counterfactual Analysis

Why do we care about the relationship between target and a feature? Often we want to know if we can
change the target through an action or policy. For example in the above example

Chapter 18 – Gradient Descent, Advanced Recommender Systems,


Advanced Neural Networks – to be completed.

Appendix I – SQL Programming


Make sure you understand more complex queries in section 4.

This appendix is a quick, and partial review of SQL. SQL syntax differs slightly across different softwares
(SQL Server, SAS, Python, …), but as long as you understand the general syntax, you should be able to
adjust to any version, pretty fast. The chapter starts from easy basics and continues to more complex
queries.

A lot of SQL queries can be done in alternative ways, such as using Pandas operations or some other
packages; however in SQL is the only all-purpose language when it comes to complex and real data
projects. Also SQL gives you a good understanding of data structure.

122
The chapter discusses SQL queries through some examples. We will use the following 5-minute data on
price of Bitcoin and Ethereum to show the output of commands. Imagine this data is called BTCETH.

low high open close volume date pair


19345.44 19359.68 19345.44 19354.94 3.76788869 10/24/2022 22:40 BTC-USDT
19346.92 19354.49 19346.92 19352.5 2.83758659 10/24/2022 22:35 BTC-USDT
19327.6 19342.11 19328.39 19342.11 1.69112782 10/24/2022 22:30 BTC-USDT
19321.33 19329.98 19328.94 19329.98 0.29138719 10/24/2022 22:25 BTC-USDT
19319.44 19332.57 19332.57 19328.94 1.16095171 10/24/2022 22:20 BTC-USDT
1341.92 1342.84 1342.1 1341.92 13.39516738 10/25/2022 0:55 ETH-USDT
1341.2 1342.67 1342.67 1341.77 20.3724605 10/25/2022 0:50 ETH-USDT
1342.85 1344.67 1344.67 1342.86 21.70678606 10/25/2022 0:45 ETH-USDT
1343.5 1344.95 1344.41 1344.53 25.11637552 10/25/2022 0:40 ETH-USDT
1343.98 1344.67 1343.98 1344.67 4.1186 10/25/2022 0:35 ETH-USDT

1. Read and Choose:

1.1. Choose only BTC rows.


select *
from BTCETH
where pair = “BTC-USDT”

low high open close volume date pair


19345.44 19359.68 19345.44 19354.94 3.76788869 10/24/2022 22:40 BTC-USDT
19346.92 19354.49 19346.92 19352.5 2.83758659 10/24/2022 22:35 BTC-USDT
19327.6 19342.11 19328.39 19342.11 1.69112782 10/24/2022 22:30 BTC-USDT
19321.33 19329.98 19328.94 19329.98 0.29138719 10/24/2022 22:25 BTC-USDT
19319.44 19332.57 19332.57 19328.94 1.16095171 10/24/2022 22:20 BTC-USDT

1.2. In the above code, * indicates all the rows. If we want to choose a subsample of columns:
select close, volume
from BTCETH
where pair = “BTC-USDT”

close volume
19354.94 3.76788869
19352.5 2.83758659
19342.11 1.69112782
19329.98 0.29138719
19328.94 1.16095171

123
1.3. Choose ETH when volume higher than 20.
select *
from BTCETH
where pair = “ETH-USDT” and volume > 20

low high open close volume date pair


1341.2 1342.67 1342.67 1341.77 20.3724605 10/25/2022 0:50 ETH-USDT
1342.85 1344.67 1344.67 1342.86 21.70678606 10/25/2022 0:45 ETH-USDT
1343.5 1344.95 1344.41 1344.53 25.11637552 10/25/2022 0:40 ETH-USDT

2. By Group analysis: A lot of times you want to some analysis on different subsets of data.

2.1. Show distinct pairs in the data


select distinct pair
from BTCETH

pair
BTC-USDT
ETH-USDT

Distinct is a frequently used command used to do group by group analysis. Alternatively, you
can use group by command, which is faster than distinct, and has more functionality (as shown
below).
select pair
from BTCETH
group by pair

pair
BTC-USDT
ETH-USDT

2.2. Calculate average close price for BTC and ETH.


select pair, average (close) as average_close
from BTCETH
group by pair

pair average_close
BTC-USDT 19341.694
ETH-USDT 1343.15

2.3. Calculate average close, minimum low, maximum high, and STD of volume for BTC when volume
> 2.
select
pair,

124
average (close) as average_close,
min (low) as min_low,
max (high) as max_high,
std (volume) as std_volume
from BTCETH
group by pair
where pair = “BTC-USDT” and volume > 2

min_low max_high average_close std_volume pair


19345.44 19359.68 19353.72 0.46515105 BTC-USDT

3. Join: Join commands are used to merge two tables based on one or more merging keys (features).
Since here we have only one table, we will create other tables based on this table, and merge them
together.

First, create a table that shows pair, and average (close) for each pair.

craete table Average_By_Pair as


select pair, average (close) as average_close
from BTCETH
group by pair

pair average_close
BTC-USDT 19341.694
ETH-USDT 1343.15

Also the following that shows average close only for BTC.

craete table Average_BTC as


select *
from Average_By_Pair
where pair = “BTC-USDT”

pair average_close
BTC-USDT 19341.694

Practice Question: Calculate the above table using the original table

3.1. Inner Join: Inner join only keeps the matching observations in both tables.

select [Link], [Link], b.average_close


from BTCETH as a, Average_BTC as b

125
where [Link] = [Link]

average_clos
pair close e
BTC-USDT 19354.94 19353.72
BTC-USDT 19352.5 19353.72
BTC-USDT 19342.11 19353.72
BTC-USDT 19329.98 19353.72
BTC-USDT 19328.94 19353.72
Alternatively we could use Inner Join.

select [Link], [Link], b.average_close


from BTCETH as a inner join Average_BTC as b
on [Link] = [Link]

average_clos
pair close e
BTC-USDT 19354.94 19353.72
BTC-USDT 19352.5 19353.72
BTC-USDT 19342.11 19353.72
BTC-USDT 19329.98 19353.72
BTC-USDT 19328.94 19353.72

Note often when you use “Join” command, “on” is used instead of “where”.

Also we create another table that we will use shortly:

Create table BTCETH_with_Average as


select [Link], [Link], b.average_close
from BTCETH as a, Average_By_Pair as b
where [Link] = [Link]
average_clos
pair close e
BTC-USDT 19354.94 19353.72
BTC-USDT 19352.5 19353.72
BTC-USDT 19342.11 19353.72
BTC-USDT 19329.98 19353.72
BTC-USDT 19328.94 19353.72
ETH-USDT 1341.92 1343.15
ETH-USDT 1341.77 1343.15
ETH-USDT 1342.86 1343.15
ETH-USDT 1344.53 1343.15
ETH-USDT 1344.67 1343.15

126
Practice Question: How many rows this table will have? Answer: 50

Create table test as


select [Link]
from BTCETH as a, BTCETH_with_Average as b
on [Link] = [Link]

3.2. Outer Join (also called full outer join): Returns all the records, whether there is a match or not.

select [Link], [Link], b.average_close


from BTCETH as a outer join Average_BTC as b
on [Link] = [Link]
average_clos
pair close e
BTC-USDT 19354.94 19353.72
BTC-USDT 19352.5 19353.72
BTC-USDT 19342.11 19353.72
BTC-USDT 19329.98 19353.72
BTC-USDT 19328.94 19353.72
ETH-USDT 1341.92
ETH-USDT 1341.77
ETH-USDT 1342.86
ETH-USDT 1344.53
ETH-USDT 1344.67

For another example, let’s say we have the following table on average price of BTC and ADA. Call
the table BTCADA.

pair average_close
BTC-USDT 19341.694
ADA_USDT 1.2

select [Link], [Link], b.average_close


from BTCETH as a outer join BTCADA as b
on [Link] = [Link]
average_clos
pair close e
BTC-USDT 19354.94 19353.72
BTC-USDT 19352.5 19353.72
BTC-USDT 19342.11 19353.72
BTC-USDT 19329.98 19353.72

127
BTC-USDT 19328.94 19353.72
ETH-USDT 1341.92
ETH-USDT 1341.77
ETH-USDT 1342.86
ETH-USDT 1344.53
ETH-USDT 1344.67
ADA_USDT 1.2

3.3. Left Join (same way Right Join): Keeps all the observations on the left table, independent of
match.
select [Link], [Link], b.average_close
from BTCETH as a left join BTCADA as b
on [Link] = [Link]
average_clos
pair close e
BTC-USDT 19354.94 19353.72
BTC-USDT 19352.5 19353.72
BTC-USDT 19342.11 19353.72
BTC-USDT 19329.98 19353.72
BTC-USDT 19328.94 19353.72
ETH-USDT 1341.92
ETH-USDT 1341.77
ETH-USDT 1342.86
ETH-USDT 1344.53
ETH-USDT 1344.67

4. More complex code, combine different conditions.

Create a table (call it technical_trage) that shows Pair, Date, and Average Close in the Last 10
minutes. This query can be used to calculate Moving Average of Price in the last N minutes; a
popular technical indicator, used in technical trading.

Create table technical_trade as


select [Link], [Link], average ([Link]) as average_10_min_close
from BTCETH as a, BTCETH as b
where [Link] = [Link] and [Link] > [Link] and [Link] <= 10
group by [Link], [Link]

average_10_min_clos
pair date e
BTC-USDT 10/24/2022 22:40 19347.305
BTC-USDT 10/24/2022 22:35 19336.045

128
BTC-USDT 10/24/2022 22:30 19329.46
BTC-USDT 10/24/2022 22:25 19328.94
ETH-USDT 10/25/2022 0:55 1342.315
ETH-USDT 10/25/2022 0:50 1343.695
ETH-USDT 10/25/2022 0:45 1344.6
ETH-USDT 10/25/2022 0:40 1344.67
Note: The underlined code ([Link] <= 10) is far from being a correct SQL command. To find
the difference between two date columns, some additional processing, or specific code is needed.

Make sure you understand how every row is calculated. Note the first row of each pair is deleted
(why?). In a real project, 4th and 8th rows might be deleted due to data insufficiency.

Final Question. Add a date column to technical_trade table that shows the first time that price goes
above (current close * 1.001), in the next 10 minutes. (If this date column is null, it means price
never increased by 0.1% in the next 10 minutes, and vice versa. This field can be used to define a
binary variable that shows whether there will be 0.1% return in the next 10 minutes, if we buy at
current close).

Create table technical_trade as


select [Link], [Link], a.average_10_min_close, min([Link]) as return_date
from technical_trade as a, BTCETH as b
where [Link] = [Link] and [Link] < [Link] and [Link] <= 10
and [Link] >= ([Link]*1.001)
group by [Link], [Link], a.average_10_min_close

pair date average_close retrun_date


BTC-USDT 10/24/2022 22:35 19336.045
BTC-USDT 10/24/2022 22:30 19329.46
BTC-USDT 10/24/2022 22:25 19328.94 10/24/2022 22:35
ETH-USDT 10/25/2022 0:50 1343.695
ETH-USDT 10/25/2022 0:45 1344.6
ETH-USDT 10/25/2022 0:40 1344.67

So only at 10/24/2022 22:25 we will have 0.1% return in the next 10 minutes.

129

You might also like