Professional Documents
Culture Documents
Wait
Why should you listen to me?
Who are you?
Who knows Python?
Some setup – Google Colab.. Anybody heard of it?
Download the workbook and dataset.. be ready
Class participation will fetch you 5%
1
AI
2
Source: Wikipedia, pixabay
Mom says
I must study
harder…
3
Why this intelligence called “Artificial”?
Artificial : made by people, often as a copy of something natural
Home loan Application Acceptor
Start
Example: Rule based AI
5
Learning rules automatically
Given the observations can rules be derived automatically?
6
ML
Machine Learning is the statistical approach to achieve AI. Experience
Improve with experience.
• Linear Regression
• K mean Clustering
• Decision Trees
Statistical Model
• Random Forest Building Algorithm
… and many more
ML Model
Statistical modeling is mathematically formalized method for
approximating reality (i.e., what generates your data). It helps
making predictions based on that approximation.
7
You said model? What is a model?
• Variance
9
Data about Birds
Data for Modelling Attributes (or features)
Data points
1.Every object or entity is
a data point
2.Every data point has
some attributes
12
More about Models
Other way of looking at training the models is to try reducing the prediction error..
13
14
Today we will learn an ML model building approach
Regression
15
Regression !!! What is it?
From dictionary: “a return to a former or less developed state.”
For us:
Regression is a statistical method that attempts to determine (or model) the relationship between
• one dependent variable (usually denoted by Y) and
• a series of other variables (known as independent variables).
16
Why should you know
Regression?
17
Variables
An independent Variable stands alone and isn't changed by the other variables being measured.
A person’s weight.
Is it an independent variable
to dependent variable?
Simple linear regression is an approach for predicting a dependent variable using a single independent variable.
income happiness
1 3.862647 2.314489
2 4.979381 3.43349
3 4.923957 4.599373
4 3.214372 2.791114
5 7.196409 5.596398
6 3.729643 2.458556
7 4.674517 3.192992
8 4.498104 1.907137
9 3.121631 2.94245
10 4.639914 3.737942
11 4.63284 3.175406
12 2.773179 2.009046
13 7.119479 5.951814
14 7.466653 5.960547
15 2.117742 1.445799
19
Equation of a line Y- axis
𝑅ⅈsⅇ
𝑅ⅈsⅇ
𝑅𝑢𝑛 𝒎=
𝑅𝑢𝑛
𝑦 = 𝑚𝑥 + 𝑐 𝑐
X- axis
Slope y - intercept
Simple Linear Regression lets us identify this line. In other words, it lets us
“model” the relationship between dependent and independent variable.
20
Simple Linear Regression Model
Data Point Model
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀 𝜺
𝑅ⅈsⅇ
Dependent
Variable 𝑅𝑢𝑛 𝑅ⅈsⅇ
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖 𝑌
𝜷𝟏 =
𝑅𝑢𝑛
𝜷𝟎
Independent Variable 𝑋
𝒀𝒊 is the value of 𝑖 𝑡ℎ data-point’s dependent variable. Also known as response variable or outcome variable.
𝑿𝒊 is the value of 𝑖 𝑡ℎ data-point’s independent variable. Also known as feature or predictor variable.
𝜺𝒊 is the random error.
𝜷𝟎 , 𝜷𝟏 are regression parameters.
21
Fitting the Model (line)
Find the line, i.e., parameters 𝜷𝟎 and 𝜷𝟏 , such that
Model Sum of Squared Errors (SSE) is minimized.
Data Point
𝜀6 7
𝜀5 𝑆𝑆𝐸 = 𝜀𝑖 2
𝜀7
𝑌 𝜀2 𝜀 𝑖=1
4
𝜀1 𝜀3 𝜷𝟏 = 𝑺𝒍𝒐𝒑𝒆
Where, 𝜀𝑖 = 𝑌𝑖 − 𝑌i
𝜷𝟎 and,
𝑋 𝑌𝑖 : Value of the 𝑖 𝑡ℎ observation of the dependent variable
𝑌𝑖 : Predicted value of the 𝑖 𝑡ℎ observation of the dependent variable
𝜀𝑖 : Random error or residual for 𝑖 𝑡ℎ observation
22
Which one is the best fitting model?
23
Sum of Squared Errors (SSE) is minimized
when…
𝑛
𝑆𝑆𝐸 = 𝜀𝑖 2
𝑖=1
𝑦𝑖 : Value of the 𝑖 𝑡ℎ observation of the dependent variable
𝑦ො𝑖 : Predicted value of the 𝑖 𝑡ℎ observation of the dependent variable
𝑛 𝜀𝑖 : Random error or residual for 𝑖 𝑡ℎ observation
𝑆𝑆𝐸 = (𝑌𝑖 − 𝑌𝑖 )2 𝑌ത : Mean value of dependent variable
𝑖=1 𝑋ത : Mean value of independent variable
that’s what the model will predict. 𝜕 𝑆𝑆𝐸 ത 𝛽መ0 + 𝛽መ1 𝑋ത The model passes through mean
=0 𝑌=
𝑛
𝜕𝛽0
𝑆𝑆𝐸 = (𝑌𝑖 −𝛽0 + 𝛽1 𝑋𝑖 )2
𝑖=1
𝜕 𝑆𝑆𝐸 σ𝑛𝑖=1(𝑋𝑖 −𝑋)(𝑌
ത 𝑖 −𝑌)ത
=0 𝛽መ1 =
𝜕𝛽1 ത 2
σ𝑛𝑖=1(𝑋𝑖 −𝑋)
24
Show me some code….
25
2nd Leture
26
Revisiting Income and happiness
𝛽1 = 0.72038294 𝛽1 = 1.04290149
If your income increases by $1000, then, your If you start being more happy by 1 unit, then,
happiness will increase by 0.72 units (0-10 scale). your salary will increase by $1042.
27
Validating the derived model - Empirical
Train Data
Data
Test Data
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖
𝑌𝑖 = 𝑌𝑖 + 𝜀𝑖
29
2
Coefficient of determination (𝑅 − 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑜𝑟 𝑅 )
Ratio of explained variation to the total variation of the dependent variable.
𝑌𝑖 = 𝑌𝑖 + 𝜀𝑖
𝑌ത
σ𝑛 ത 2
𝟐
𝑆𝑆𝑅 (
𝑖=1 𝑖𝑌 − 𝑌) How much of the variation in Dependent Variable can be
𝑹 = = 𝑛
𝑆𝑆𝑇 σ𝑖=1(𝑌𝑖 −𝑌)ത 2 explained by taking independent variable into account.
30
Data Point Model
𝜺
Why call it linear? 𝑅ⅈsⅇ
Dependent
Variable 𝑅𝑢𝑛 𝑅ⅈsⅇ
𝜷𝟏 =
𝑌 𝑅𝑢𝑛
𝜷𝟎
Independent Variable 𝑋
80
Y
40 𝜷𝟏 = 𝟑
30
𝑌 linearly related 𝛽1 20
10
0
1 2 3 4
X
32
Will Linear regression fail to identify non-linear relationship for
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀
33
Lets see something interesting
X = [2 , 2 , 2 , 3 , 1 , 1 , 2 , 3 , 1 , 4]
y = [109, 102, 98, 85, 95, 96, 98, 123, 94, 102]
𝛽1 = 3.01123596
Spurious-Correlations
34
Correlation is not Causation
If two variables appear correlated, then it does not mean that one is caused by other.
Abandoned Uninstalled
Fever Rashes Carts Apps
Correlation Correlation
Terms dependent and independent does not necessarily imply a causal relationship between two variables.
35
Regression vs Correlation
36
Assumptions – Regression
The method of least squares gives the best equation under the assumptions stated below (Harter 1974,
1975):
37
Multiple Linear Regression
38
Multiple Linear Regression
Extension of Simple Linear Regression for multiple independent variables.
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑛 𝑋𝑛 + 𝜀
𝒀𝒊 is the value of 𝑖 𝑡ℎ data-point’s dependent variable. Also known as response variable or outcome variable.
𝑿𝒏𝒊 is the value of 𝑖 𝑡ℎ data-point’s 𝑛𝑡ℎ independent variable. Also known as feature or predictor variable.
𝜺𝒊 is the random error.
𝜷𝟎 , 𝜷𝟏 , 𝜷𝟐 ,…, 𝜷𝒏 are regression parameters.
2 3 𝑛
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑌 + 𝛽3 𝑍 + ⋯ + 𝛽𝑛 𝐺 + 𝜀
Is this a linear equation? 39
Examples of Multiple Linear Regression
The treatment cost of a cardiac patient may depend on factors such as
• age,
• past medical history,
• body weight,
• blood pressure, and so on.
Salary of MBA students at the time of graduation may depend on factors such as
• their academic performance,
• prior work experience, communication skills,
• Whether they carefully attended Lecture of Business Applications of AI & ML Techniques.
Market share of a brand may depend on factors such as price, promotion expenses,
competitors’ price, etc.
40
Steps in building a Regression Model
41
Pre-process the Data
1) Data Quality (measured through several characteristics such as
completeness, correctness, etc.): Data completeness refers to availability of
necessary data for developing the model.
2) Missing Data: Many variables may have missing values. The data scientist
has to come up with a strategy to handle missing values such as data
imputation and specific techniques to carry out the imputation.
3) Handling Qualitative Variables :Qualitative variables or categorical variables
need to be converted using dummy variables before incorporating them in
regression model.
4) Derive new variables (such as ratios and interaction variables), which may
have better association relationship with the dependent variable.
42
Data Data Data…
Distance
Income Birthplace
Height
Review score Happiness score Fuel Economy Bedrooms
Data
Also called Qualitative
Variables.
Categorical Numeric
S. No. Education Salary S. No. Education Salary S. No. Education Salary 25000
1 HS
1 9800 11 G2 17200 21 PG
3 21000 20000
G2 Lets Encode
2 HS
1 10200 12 17600 22 PG
3 19400
15000
G2 HS : 1
Salary
3 HS
1 14200 13 17650 23 PG
3 18800
4 HS
1 21000 14 G2 19600 24 PG
3 21000 G :2 10000
5 HS
1 16500 15 G2 16700 25 NA
4 6500 PG : 3 5000
6 HS
1 19210 16 G2 16700 26 NA
4 7200 NA : 4
0
7 HS
1 9700 17 G2 17500 27 NA
4 7700 0 2 4 6
8 HS
1 11000 18 G2 15000 28 NA
4 5600 Education
9 HS
1 7800 19 PG
3 18500 29 NA
4 8000
10 HS
1 8800 20 PG
3 19700 30 NA
4 9300 Do you see the problem?
44
Regression Models with Qualitative Variables
Since the scale is not a ratio or interval for categorical variables, we cannot
include them directly in the model, since its inclusion directly will result in model
misspecification.
What to do?
We have to pre-process the categorical variables
using dummy variables for building a
regression model.
Dummy Variable…
What are you talking about? 45
Regression Models with Qualitative Variables
Note that, if we build a model Y = 0 + 1 × Education, it will be incorrect.
We have to use 3 dummy variables since there are 4 categories for educational qualification.
SNo Education HS G PG Salary SNo Education HS G PG Salary SNo Education HS G PG Salary
𝑌 = 𝛽0 + 𝛽1 𝑯𝑺 + 𝛽2 𝑮 + 𝛽3 𝑷𝑮 + 𝜀
The fourth category (none) for which we did not create an explicit dummy variable is called
the base category. 46
Something Interesting…
Dependent Variable : Enjoyment Food: Pizza, Salad, Idli, Ice-cream
Dressing: Oregano, Ketchup, chocolate-sauce
Independent Variables : Food, Seasoning (or dressing)
Oregano
Ketchup
Chocolate-sauce
Interaction: An interaction occurs when an independent variable has a different effect on the
outcome depending on the values of another independent variable.
47
Interaction Variables in Regression Models
Another Example: Interaction between adding sugar to coffee and stirring the coffee. Neither of the two
individual variables has much effect on sweetness but a combination of the two does.
Interaction variables are basically inclusion of variables in the regression model that are a product of two
independent variables (such as X1 X2)
S. No. Gender WE Salary S. No. Gender WE Salary - Gender = 1 denotes female and 0
1 1 2 6800 16 0 2 22100 denotes male
2 1 3 8700 17 0 1 20200 - WE is the work experience in number
3 1 1 9700 18 0 1 17700 of years
4 1 3 9500 19 0 6 34700
5 1 4 10100 20 0 7 38600 𝑆𝑎𝑙𝑎𝑟𝑦 =
6 1 6 9800 21 0 7 39900
7 0 2 14500 22 0 7 38300 𝛽0
8 0 3 19100 23 0 3 26900
9 0 4 18600 24 0 4 31800 +𝛽1 𝑊𝐸
10
11
0
0
2
4
14200
28000
25
26
1
1
5
5
8000
8700
+ 𝛽2 𝐺𝑒𝑛𝑑𝑒𝑟
12
13
0
0
3
1
25700
20350
27
28
1
1
3
3
6200
4100
+ 𝛽3 𝑊𝐸 × 𝐺𝑒𝑛𝑑𝑒𝑟
14 0 4 30400 29 1 2 5000
15 0 1 19400 30 1F 800 48
https://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/interaction.pdf
https://www.theanalysisfactor.com/interpreting-interactions-in-regression/
𝑆𝑎𝑙𝑎𝑟𝑦 = 𝛽0 + 𝛽1 𝑊𝐸 + 𝛽2 𝐺𝑒𝑛𝑑𝑒𝑟 + 𝛽3 𝑊𝐸 × 𝐺𝑒𝑛𝑑𝑒𝑟
The regression equation is given by
Y = 13442.895 – 7757.75 Gender + 3523.547 WE – 2913.908 Gender × WE
Equation can be written as
That is, the change in salary for female when WE increases by one year is 609.639 and for
male is 3523.547.
That is the salary for male workers is increasing at a higher rate compared female workers.
Validating Multiple Linear Regression Model-Mathematically
Ratio of explained variation to the
2
Coefficient of determination (𝑅 − 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑜𝑟 𝑅 ) total variation of the dependent variable.
𝑌𝑖 = 𝑌𝑖 + 𝜀𝑖
𝑌ത
𝑆𝑆𝑅 σ𝑛 ത 2
𝑖=1(𝑌𝑖 −𝑌) 𝑆𝑆𝑇 − 𝑆𝑆𝐸
𝑹𝟐 = = σ𝑛 ത 2
= How much of the variation in Dependent Variable can be
𝑆𝑆𝑇 𝑖=1(𝑌𝑖 −𝑌) 𝑆𝑆𝑇 explained by taking independent variable into acc50ount.
How to improve model’s performance
I.e., getting higher Coefficient of determination (𝑅 − 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑜𝑟 𝑅2 )
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀
Consider a researcher interested in predicting Mouse’s body length (L). She considers
• Mouse’s body weight (W) and
• Mouse’s tail length (T)
as independent variables
𝐿 = 𝛽0 + 𝛽1 𝑊 + 𝛽2 𝑇 + 𝜀
𝟐
𝑆𝑆𝑅 𝑆𝑆𝑇 − 𝑆𝑆𝐸 𝑆𝑆𝐸
𝑹 = = =1 −
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇
A problem with the R2, is that, it will always increase when more variables are added to the model, even if
those variables are only weakly associated with the response (James et al. 2014). A solution is to adjust the R2
by taking into account the number of predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x
variables included in the prediction model. 52
Adjusted Coefficient of determination
2
(𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 )
𝑆𝑆𝐸/(𝑛 − 𝑘 − 1)
𝑨𝒅𝒋𝒖𝒔𝒕𝒆𝒅 𝑹𝟐 = 1 −
𝑆𝑆𝑇(𝑛 − 1)
53
See you next time
54
3rd Lecture
55
Case Study – IPL Player Pricing
Case Study Document: https://hbsp.harvard.edu/tu/e6586d37
Case study Data: https://hbsp.harvard.edu/tu/8375b172
Can we predict the price of an IPL player using historical data or features about
the player?
56
Features (independent variables) for IPL player price..
57
Lets see the code…
58
Multicollinearity
The presence of multicollinearity can destabilize the multiple linear regression model.
59
Why is Multicollinearity a Potential Problem?
A key goal of regression analysis is to isolate the relationship between each independent variable and the
dependent variable.
The interpretation of a regression coefficient is that it represents the mean change in the dependent variable
for each 1 unit change in an independent variable when you hold all of the other independent variables
constant.
Multicollinearity Types
Structural multicollinearity: This type occurs when we Data multicollinearity: This type of
create a model term using other terms. In other words, it’s multicollinearity is present in the data
a byproduct of the model that we specify rather than being itself rather than being an artifact of our
present in the data itself. For example, if you square term model. Observational experiments are
X to model curvature, clearly there is a correlation more likely to exhibit this kind of
between X and X2. multicollinearity.
https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
Checking for Multicollinearity
VIF – Variance Inflation Factor 1
- Calculated for every independent variable 𝑉𝐼𝐹𝑖 =
1 − 𝑅𝑖2
Find 𝑹𝟐 (coefficient of determination) for an independent variable 𝑋𝑖 with respect to other independent variables.
https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
62
Extra Slides
63
If Linear regression identifies linear relationship, then it should fail for
𝑋2
𝑌 = 𝛽0 + 𝛽1 ∗ +𝜀
5
𝑋2
= [0, 0.2, 0.8, 1.8, 3.2, 5, 7.2, 9.8, 12.8, 16.2]
5
𝛽1 = 0.59392553