You are on page 1of 64

Lets Begin

Wait
Why should you listen to me?
Who are you?
Who knows Python?
Some setup – Google Colab.. Anybody heard of it?
Download the workbook and dataset.. be ready
Class participation will fetch you 5%

1
AI

Is this view correct?

2
Source: Wikipedia, pixabay
Mom says
I must study
harder…

3
Why this intelligence called “Artificial”?
Artificial : made by people, often as a copy of something natural
Home loan Application Acceptor
Start
Example: Rule based AI

A system that accomplishes artificial Is credit Yes Is Income >


Yes
score > 0.1 of loan
intelligence through a rule-based model 850 required
is known as rule-based AI systems. Accept
No No

• rules are coded by humans. Is credit Is Income >


Yes Yes
• simple artificial intelligence models : score > 0.12 of loan
800 required
if-then coding statements. Reject
No
No

Is credit Yes Is Income > Yes


score > 0.13 of loan
750 required
No
No
4
https://medium.com/@er.rameshkatiyar/what-is-rule-engine-86ea759ad97d
How many rules you need to code?
If there are 5 variables where each one can take 2 values then we have
25 possible outcomes. How many rules you will code?
______

5 variables, each with 5 possible outcomes: 55

5
Learning rules automatically
Given the observations can rules be derived automatically?

Home loan Application Historical Data


Customer Credit Income/Loan Accept loan
ID Score ratio application
2319 839 0.1 No
6394 882 0.1 Yes
Fire means HOT
1122 721 0.2 Yes

8990 650 0.19 No

6
ML
Machine Learning is the statistical approach to achieve AI. Experience
Improve with experience.

• Linear Regression
• K mean Clustering
• Decision Trees
Statistical Model
• Random Forest Building Algorithm
… and many more

ML Model
Statistical modeling is mathematically formalized method for
approximating reality (i.e., what generates your data). It helps
making predictions based on that approximation.
7
You said model? What is a model?

Tells us most important aspects of the


system being modelled.
Easier to interact and use than reality.
Why do we need a model? Sometimes it is the only option.
8
Simplest Statistical model
• Mean
• Median Statistical modeling is mathematically formalized method for
approximating reality (i.e., what generates your data). It helps
• Mode making predictions based on that approximation.

• Variance

Modelling requires observations.


Observations are represented in form of data.

Data comes in various formats.

9
Data about Birds
Data for Modelling Attributes (or features)

Data points
1.Every object or entity is
a data point
2.Every data point has
some attributes

Table Source: https://www.chegg.com/homework-help/questions-and-answers/table-1-predation-rats-r- 10


rattus-r-exulans-birds-included-table-bird-species-typical-stages-q51946047
Bank Marketing Data
age job marital education default housing loan contact month day_of_week duration
56 housemaid married basic.4y no no no telephone may mon 261
57 services married high.school unknown no no telephone may mon 149
37 services married high.school no yes no telephone may mon 226
40 admin. married basic.6y no no no telephone may mon 151
56 services married high.school no no yes telephone may mon 307
45 services married basic.9y unknown no no telephone may mon 198
59 admin. marriedprofessional.course no no no telephone may mon 139
41 blue-collar married unknown unknown no no telephone may mon 217
24 technician single professional.course no yes no telephone mayWalmart Sales Data
mon 380
25 services single high.school no yes no telephone may mon 50
Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI Unemployment
41 blue-collar married unknown unknown no no telephone may mon 55
1 5/2/2010 1643690.9 0 42.31 2.572 211.0963582 8.106
25 services single high.school no yes no telephone may mon 222
1 12/2/2010 1641957.44 1 38.51 2.548 211.2421698 8.106
29 blue-collar single high.school no no yes telephone may mon 137
1 19-02-2010 1611968.17 0 39.93 2.514 211.2891429 8.106
57 housemaid divorced basic.4y no yes no telephone may mon 293
1 26-02-2010 1409727.59 0 46.63 2.561 211.3196429 8.106
35 blue-collar married basic.6y no yes no telephone may mon 146
1 5/3/2010 1554806.68 0 46.5 2.625 211.3501429 8.106
54 retired married basic.9y unknown yes yes telephone may mon 174
1 12/3/2010 1439541.59 0 57.79 2.667 211.3806429 8.106
35 blue-collar married basic.6y no yes no telephone may mon 312
1 19-03-2010 1472515.79 0 54.58 2.72 211.215635 8.106
46 blue-collar married basic.6y unknown yes yes telephone may mon 440
1 26-03-2010 1404429.92 0 51.45 2.732 211.0180424 8.106
50 blue-collar married basic.9y no yes yes telephone may mon 353
1 2/4/2010 1594968.28 0 62.27 2.719 210.8204499 7.808
39 management single basic.9y unknown no no telephone may mon 195
1 9/4/2010 1545418.53 0 65.86 2.77 210.6228574 7.808
30 unemployed married high.school no no no telephone may mon 38
1 16-04-2010 1466058.28 0 66.32 2.808 210.4887 7.808
55 blue-collar married basic.4y unknown yes no telephone may mon 262
1 23-04-2010 1391256.12 0 64.84 2.795 210.4391228 7.808
1 30-04-2010 1425100.71 0 67.41 2.78 210.3895456 7.808
1 7/5/2010 1603955.12 0 72.55 2.835 210.3399684 7.808
1 14-05-2010 1494251.5 0 74.78 2.854 210.3374261 7.808
1 21-05-2010 1399662.07 0 76.44 2.826 210.6170934 7.808
1 28-05-2010 1432069.95 0 80.44 2.759 210.8967606 7.808
1 4/6/2010 1615524.71 0 80.69 2.705 211.1764278 7.808
1 11/6/2010 1542561.09 0 80.43 2.668 211.4560951 7.808
11
https://www.kaggle.com/datasets
Building ML (Statistical) models
Often call as Training rather than building… remember we are
making a machine learn.. Right?

Statistics is all about parameters.


Training or building the model thus involves finetuning such parameters.
Untuned Parameters Tuned Parameters

12
More about Models

Models are rarely completely accurate

There is always some error involved in prediction

Other way of looking at training the models is to try reducing the prediction error..

13
14
Today we will learn an ML model building approach

Regression

15
Regression !!! What is it?
From dictionary: “a return to a former or less developed state.”

For us:
Regression is a statistical method that attempts to determine (or model) the relationship between
• one dependent variable (usually denoted by Y) and
• a series of other variables (known as independent variables).

Coined by Sir Francis Galton


He wanted to understand whether heights of parents and their children have any relation.

16
Why should you know
Regression?

Uses of Regression based Analysis


• Description: Describe the relationship between a dependent variable y (child’s
height) and explanatory variables x (parents’ height).
• Prediction: Predict dependent variable y based on explanatory variables x.

17
Variables
An independent Variable stands alone and isn't changed by the other variables being measured.

A person’s weight.
Is it an independent variable
to dependent variable?

Problem 1: Predict a person's Independent variable or dependent


weight when his height, age, variable?
daily calorie intake, and
calories burned are given • A person’s Age
• Her Car’s fuel economy
Problem 2: Predict the fuel • Her husband’s height
economy of a scooter when its • Price of a product
owner’s weight, average • Discount offered on a product
driving speed, driving • Air-conditioner’s efficiency
aggression are given
18
Simple Linear Regression

Simple linear regression is an approach for predicting a dependent variable using a single independent variable.

income happiness
1 3.862647 2.314489
2 4.979381 3.43349
3 4.923957 4.599373
4 3.214372 2.791114
5 7.196409 5.596398
6 3.729643 2.458556
7 4.674517 3.192992
8 4.498104 1.907137
9 3.121631 2.94245
10 4.639914 3.737942
11 4.63284 3.175406
12 2.773179 2.009046
13 7.119479 5.951814
14 7.466653 5.960547
15 2.117742 1.445799
19
Equation of a line Y- axis
𝑅ⅈsⅇ
𝑅ⅈsⅇ
𝑅𝑢𝑛 𝒎=
𝑅𝑢𝑛
𝑦 = 𝑚𝑥 + 𝑐 𝑐
X- axis
Slope y - intercept

Simple Linear Regression lets us identify this line. In other words, it lets us
“model” the relationship between dependent and independent variable.

Simple Linear regression establishes the relationship between two


variables based on a line of best fit.

20
Simple Linear Regression Model
Data Point Model

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀 𝜺
𝑅ⅈsⅇ
Dependent
Variable 𝑅𝑢𝑛 𝑅ⅈsⅇ
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖 𝑌
𝜷𝟏 =
𝑅𝑢𝑛

𝜷𝟎

Independent Variable 𝑋

𝒀𝒊 is the value of 𝑖 𝑡ℎ data-point’s dependent variable. Also known as response variable or outcome variable.
𝑿𝒊 is the value of 𝑖 𝑡ℎ data-point’s independent variable. Also known as feature or predictor variable.
𝜺𝒊 is the random error.
𝜷𝟎 , 𝜷𝟏 are regression parameters.
21
Fitting the Model (line)
Find the line, i.e., parameters 𝜷𝟎 and 𝜷𝟏 , such that
Model Sum of Squared Errors (SSE) is minimized.
Data Point
𝜀6 7

𝜀5 𝑆𝑆𝐸 = ෍ 𝜀𝑖 2
𝜀7
𝑌 𝜀2 𝜀 𝑖=1
4
𝜀1 𝜀3 𝜷𝟏 = 𝑺𝒍𝒐𝒑𝒆
Where, 𝜀𝑖 = 𝑌𝑖 − 𝑌෠i

𝜷𝟎 and,
𝑋 𝑌𝑖 : Value of the 𝑖 𝑡ℎ observation of the dependent variable
𝑌෠𝑖 : Predicted value of the 𝑖 𝑡ℎ observation of the dependent variable
𝜀𝑖 : Random error or residual for 𝑖 𝑡ℎ observation

22
Which one is the best fitting model?

If we use the best model the __________ will be minimized.

23
Sum of Squared Errors (SSE) is minimized
when…
𝑛

𝑆𝑆𝐸 = ෍ 𝜀𝑖 2
𝑖=1
𝑦𝑖 : Value of the 𝑖 𝑡ℎ observation of the dependent variable
𝑦ො𝑖 : Predicted value of the 𝑖 𝑡ℎ observation of the dependent variable
𝑛 𝜀𝑖 : Random error or residual for 𝑖 𝑡ℎ observation
𝑆𝑆𝐸 = ෍(𝑌𝑖 − 𝑌෠𝑖 )2 𝑌ത : Mean value of dependent variable
𝑖=1 𝑋ത : Mean value of independent variable

Now, 𝑌෠𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 because…

that’s what the model will predict. 𝜕 𝑆𝑆𝐸 ത 𝛽መ0 + 𝛽መ1 𝑋ത The model passes through mean
=0 𝑌=
𝑛
𝜕𝛽0
𝑆𝑆𝐸 = ෍(𝑌𝑖 −𝛽0 + 𝛽1 𝑋𝑖 )2
𝑖=1
𝜕 𝑆𝑆𝐸 σ𝑛𝑖=1(𝑋𝑖 −𝑋)(𝑌
ത 𝑖 −𝑌)ത
=0 𝛽መ1 =
𝜕𝛽1 ത 2
σ𝑛𝑖=1(𝑋𝑖 −𝑋)

24
Show me some code….
25
2nd Leture

26
Revisiting Income and happiness

𝛽1 = 0.72038294 𝛽1 = 1.04290149

If your income increases by $1000, then, your If you start being more happy by 1 unit, then,
happiness will increase by 0.72 units (0-10 scale). your salary will increase by $1042.

27
Validating the derived model - Empirical
Train Data

Data

Used for building the model

Test Data

Used to see how accurate the model


28
is
How to validate the derived model - Mathematically
Coefficient of determination (𝑅 − 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑜𝑟 𝑅2 )
Ratio of explained variation to the total variation of the dependent variable.

Value of dependent Error in value of


Actual Value of
variable predicted dependent variable
dependent variable
by model predicted by the model

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖

𝑌𝑖 = 𝑌෠𝑖 + 𝜀𝑖

29
2
Coefficient of determination (𝑅 − 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑜𝑟 𝑅 )
Ratio of explained variation to the total variation of the dependent variable.

𝑌𝑖 = 𝑌෠𝑖 + 𝜀𝑖

𝑌ത

Total Variation Variation explained by model Variation not explained by model


𝑛 𝑛 𝑛
ത 2
෍(𝑌𝑖 −𝑌) ෍(𝑌෠𝑖 − 𝑌)
ത 2 ෍(𝑌𝑖 − 𝑌෠𝑖 )2
𝑖=1 𝑖=1 𝑖=1
SST : Sum of Squares of SSR : Sum of squares of variation SSE : Sum of squares of errors
Total variation explained by the regression model or unexplained variation

σ𝑛 ෠ ത 2
𝟐
𝑆𝑆𝑅 (
𝑖=1 𝑖𝑌 − 𝑌) How much of the variation in Dependent Variable can be
𝑹 = = 𝑛
𝑆𝑆𝑇 σ𝑖=1(𝑌𝑖 −𝑌)ത 2 explained by taking independent variable into account.
30
Data Point Model

𝜺
Why call it linear? 𝑅ⅈsⅇ
Dependent
Variable 𝑅𝑢𝑛 𝑅ⅈsⅇ
𝜷𝟏 =
𝑌 𝑅𝑢𝑛

𝜷𝟎

Independent Variable 𝑋

If this is Linear, then, these are non-linear.. right?

Linear relationship between Dependent variable and


independent variable.
31
Real Reason of calling it linear
90 Non Linear relationships

80

“Linear” in linear regression refers to the relationship between 70


X*β1
The dependent variable 𝑌 and model coefficient 𝛽1 60
β1^X
50
logX (β1)

Y
40 𝜷𝟏 = 𝟑
30
𝑌 linearly related 𝛽1 20

10

0
1 2 3 4
X

32
Will Linear regression fail to identify non-linear relationship for

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀

33
Lets see something interesting
X = [2 , 2 , 2 , 3 , 1 , 1 , 2 , 3 , 1 , 4]
y = [109, 102, 98, 85, 95, 96, 98, 123, 94, 102]

Do you feel it is a decent model fit?

Can we say for every unit increase in X, the


Increase in Y is 3.011?

𝛽1 = 3.01123596

Spurious-Correlations
34
Correlation is not Causation
If two variables appear correlated, then it does not mean that one is caused by other.

Too many steps


illness to purchase

Causation Causation Causation Causation

Abandoned Uninstalled
Fever Rashes Carts Apps
Correlation Correlation

Terms dependent and independent does not necessarily imply a causal relationship between two variables.
35
Regression vs Correlation

Regression is the study of, “existence of a relationship”, between two


variable. The main objective is to estimate the change in mean value of
independent variable.

Correlation is the study of, “strength of relationship”, between two


variables.

Beware of spurious Regression

36
Assumptions – Regression
The method of least squares gives the best equation under the assumptions stated below (Harter 1974,
1975):

1. The regression model is linear in regression parameters.


2. The explanatory variable, X, is assumed to be non-stochastic (i.e., X is deterministic).
3. The conditional expected value of the residuals, E(i|Xi), is zero.
4. In case of time series data, residuals are uncorrelated, that is, Cov(i, j) = 0 for all i  j.
5. The residuals, i, follow a normal distribution.
6. The variance of the residuals is constant for all values of Xi. When the variance of the residuals is
constant for different values of Xi, it is called homoscedasticity. A non-constant variance of residuals
is called heteroscedasticity

37
Multiple Linear Regression

38
Multiple Linear Regression
Extension of Simple Linear Regression for multiple independent variables.

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑛 𝑋𝑛 + 𝜀

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝑛 𝑋𝑛𝑖 + 𝜀𝑖

𝒀𝒊 is the value of 𝑖 𝑡ℎ data-point’s dependent variable. Also known as response variable or outcome variable.
𝑿𝒏𝒊 is the value of 𝑖 𝑡ℎ data-point’s 𝑛𝑡ℎ independent variable. Also known as feature or predictor variable.
𝜺𝒊 is the random error.
𝜷𝟎 , 𝜷𝟏 , 𝜷𝟐 ,…, 𝜷𝒏 are regression parameters.

2 3 𝑛
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑌 + 𝛽3 𝑍 + ⋯ + 𝛽𝑛 𝐺 + 𝜀
Is this a linear equation? 39
Examples of Multiple Linear Regression
The treatment cost of a cardiac patient may depend on factors such as
• age,
• past medical history,
• body weight,
• blood pressure, and so on.

Salary of MBA students at the time of graduation may depend on factors such as
• their academic performance,
• prior work experience, communication skills,
• Whether they carefully attended Lecture of Business Applications of AI & ML Techniques.

Market share of a brand may depend on factors such as price, promotion expenses,
competitors’ price, etc.

40
Steps in building a Regression Model

41
Pre-process the Data
1) Data Quality (measured through several characteristics such as
completeness, correctness, etc.): Data completeness refers to availability of
necessary data for developing the model.
2) Missing Data: Many variables may have missing values. The data scientist
has to come up with a strategy to handle missing values such as data
imputation and specific techniques to carry out the imputation.
3) Handling Qualitative Variables :Qualitative variables or categorical variables
need to be converted using dummy variables before incorporating them in
regression model.
4) Derive new variables (such as ratios and interaction variables), which may
have better association relationship with the dependent variable.

42
Data Data Data…
Distance
Income Birthplace
Height
Review score Happiness score Fuel Economy Bedrooms

Weight name Hair color Price IQ Area

Data
Also called Qualitative
Variables.
Categorical Numeric

Nominal Data Ordinal Data Discrete Data Continuous Data


count and order names or labels Number of cows Average number
but it can not be of cows 43
measured. Good read: https://www.formpl.us/blog/categorical-numerical-data
Regression Models with Categorical Variables
What do you think could be the challenge?

S. No. Education Salary S. No. Education Salary S. No. Education Salary 25000

1 HS
1 9800 11 G2 17200 21 PG
3 21000 20000
G2 Lets Encode
2 HS
1 10200 12 17600 22 PG
3 19400
15000
G2 HS : 1

Salary
3 HS
1 14200 13 17650 23 PG
3 18800
4 HS
1 21000 14 G2 19600 24 PG
3 21000 G :2 10000

5 HS
1 16500 15 G2 16700 25 NA
4 6500 PG : 3 5000
6 HS
1 19210 16 G2 16700 26 NA
4 7200 NA : 4
0
7 HS
1 9700 17 G2 17500 27 NA
4 7700 0 2 4 6
8 HS
1 11000 18 G2 15000 28 NA
4 5600 Education
9 HS
1 7800 19 PG
3 18500 29 NA
4 8000
10 HS
1 8800 20 PG
3 19700 30 NA
4 9300 Do you see the problem?
44
Regression Models with Qualitative Variables
Since the scale is not a ratio or interval for categorical variables, we cannot
include them directly in the model, since its inclusion directly will result in model
misspecification.

What to do?
We have to pre-process the categorical variables
using dummy variables for building a
regression model.

Dummy Variable…
What are you talking about? 45
Regression Models with Qualitative Variables
Note that, if we build a model Y =  0 + 1 × Education, it will be incorrect.
We have to use 3 dummy variables since there are 4 categories for educational qualification.
SNo Education HS G PG Salary SNo Education HS G PG Salary SNo Education HS G PG Salary

1 1 1 0 0 9800 11 2 0 1 0 17200 21 3 0 0 1 21000

2 1 1 0 0 10200 12 2 0 1 0 17600 22 3 0 0 1 19400

3 1 1 0 0 14200 13 2 0 1 0 17650 23 3 0 0 1 18800

4 1 1 0 0 21000 14 2 0 1 0 19600 24 3 0 0 1 21000

5 1 1 0 0 16500 15 2 0 1 0 16700 25 4 0 0 0 6500

6 1 1 0 0 19210 16 2 0 1 0 16700 26 4 0 0 0 7200

7 1 1 0 0 9700 17 2 0 1 0 17500 27 4 0 0 0 7700


8 1 1 0 0 11000 18 2 0 1 0 15000 28 4 0 0 0 5600
9 1 1 0 0 7800 19 3 0 0 1 18500 29 4 0 0 0 8000
10 1 1 0 0 8800 20 3 0 0 1 19700 30 4 0 0 0 9300

𝑌 = 𝛽0 + 𝛽1 𝑯𝑺 + 𝛽2 𝑮 + 𝛽3 𝑷𝑮 + 𝜀
The fourth category (none) for which we did not create an explicit dummy variable is called
the base category. 46
Something Interesting…
Dependent Variable : Enjoyment Food: Pizza, Salad, Idli, Ice-cream
Dressing: Oregano, Ketchup, chocolate-sauce
Independent Variables : Food, Seasoning (or dressing)

Do you like chocolate sauce on your food?

Enjoyment Pizza Salad Idli Ice-cream

Oregano

Ketchup

Chocolate-sauce

Interaction: An interaction occurs when an independent variable has a different effect on the
outcome depending on the values of another independent variable.
47
Interaction Variables in Regression Models
Another Example: Interaction between adding sugar to coffee and stirring the coffee. Neither of the two
individual variables has much effect on sweetness but a combination of the two does.
Interaction variables are basically inclusion of variables in the regression model that are a product of two
independent variables (such as X1 X2)
S. No. Gender WE Salary S. No. Gender WE Salary - Gender = 1 denotes female and 0
1 1 2 6800 16 0 2 22100 denotes male
2 1 3 8700 17 0 1 20200 - WE is the work experience in number
3 1 1 9700 18 0 1 17700 of years
4 1 3 9500 19 0 6 34700
5 1 4 10100 20 0 7 38600 𝑆𝑎𝑙𝑎𝑟𝑦 =
6 1 6 9800 21 0 7 39900
7 0 2 14500 22 0 7 38300 𝛽0
8 0 3 19100 23 0 3 26900
9 0 4 18600 24 0 4 31800 +𝛽1 𝑊𝐸
10
11
0
0
2
4
14200
28000
25
26
1
1
5
5
8000
8700
+ 𝛽2 𝐺𝑒𝑛𝑑𝑒𝑟
12
13
0
0
3
1
25700
20350
27
28
1
1
3
3
6200
4100
+ 𝛽3 𝑊𝐸 × 𝐺𝑒𝑛𝑑𝑒𝑟
14 0 4 30400 29 1 2 5000
15 0 1 19400 30 1F 800 48
https://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/interaction.pdf
https://www.theanalysisfactor.com/interpreting-interactions-in-regression/
𝑆𝑎𝑙𝑎𝑟𝑦 = 𝛽0 + 𝛽1 𝑊𝐸 + 𝛽2 𝐺𝑒𝑛𝑑𝑒𝑟 + 𝛽3 𝑊𝐸 × 𝐺𝑒𝑛𝑑𝑒𝑟
The regression equation is given by
Y = 13442.895 – 7757.75 Gender + 3523.547 WE – 2913.908 Gender × WE
Equation can be written as

For Female (Gender = 1)


Y = 13442.895 – 7757.75 + (3523.547 – 2913.908) × WE
Y = 13442.895 – 7757.75 + 609.639 × WE
Y = 5685.145 + 609.639 × WE

For Male (Gender = 0)


Y = 13442.895 + 3523.547 × WE

That is, the change in salary for female when WE increases by one year is 609.639 and for
male is 3523.547.
That is the salary for male workers is increasing at a higher rate compared female workers.
Validating Multiple Linear Regression Model-Mathematically
Ratio of explained variation to the
2
Coefficient of determination (𝑅 − 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑜𝑟 𝑅 ) total variation of the dependent variable.

𝑌𝑖 = 𝑌෠𝑖 + 𝜀𝑖

𝑌ത

Total Variation Variation explained by model Variation not explained by model


𝑛 𝑛 𝑛
ത 2
෍(𝑌𝑖 −𝑌) ෍(𝑌෠𝑖 − 𝑌)
ത 2 ෍(𝑌𝑖 − 𝑌෠𝑖 )2
𝑖=1 𝑖=1 𝑖=1
SST : Sum of Squares of SSR : Sum of squares of variation SSE : Sum of squares of errors
Total variation explained by the regression model or unexplained variation

𝑆𝑆𝑅 σ𝑛 ෠ ത 2
𝑖=1(𝑌𝑖 −𝑌) 𝑆𝑆𝑇 − 𝑆𝑆𝐸
𝑹𝟐 = = σ𝑛 ത 2
= How much of the variation in Dependent Variable can be
𝑆𝑆𝑇 𝑖=1(𝑌𝑖 −𝑌) 𝑆𝑆𝑇 explained by taking independent variable into acc50ount.
How to improve model’s performance
I.e., getting higher Coefficient of determination (𝑅 − 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑜𝑟 𝑅2 )

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀
Consider a researcher interested in predicting Mouse’s body length (L). She considers
• Mouse’s body weight (W) and
• Mouse’s tail length (T)
as independent variables

𝐿 = 0.1 + 0.7 𝑊 + 0.5 𝑇 + 𝜀


What happens if Mouse’s tail length has no impact on it’s body length?
𝐿 = 0.5 + 0.8 𝑊 + 𝟎 𝑇 + 𝜀
What if researcher considers more variables in her study, say:
- Kind of food mouse eats
- Age of mouse
- Number of rooms in house where mouse lives.
51
- Density of walls of house where mouse lives https://www.youtube.com/watch?v=nk2CQITm_eo
𝐿 = 𝛽0 + 𝛽1 𝑊 + 𝛽2 𝑇 + 𝛽3 𝐹𝑜𝑜𝑑 + 𝛽4 𝐴𝑔𝑒 + 𝛽5 𝑅𝑜𝑜𝑚𝑠 + 𝛽5 𝐷𝑒𝑛𝑠𝑖𝑡𝑦 + 𝜀 ′

𝐿 = 𝛽0 + 𝛽1 𝑊 + 𝛽2 𝑇 + 𝜀

Due to chance events we might get 𝜀 > 𝜀 ′

𝟐
𝑆𝑆𝑅 𝑆𝑆𝑇 − 𝑆𝑆𝐸 𝑆𝑆𝐸
𝑹 = = =1 −
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇

A problem with the R2, is that, it will always increase when more variables are added to the model, even if
those variables are only weakly associated with the response (James et al. 2014). A solution is to adjust the R2
by taking into account the number of predictor variables.

The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x
variables included in the prediction model. 52
Adjusted Coefficient of determination
2
(𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 )

𝑆𝑆𝐸/(𝑛 − 𝑘 − 1)
𝑨𝒅𝒋𝒖𝒔𝒕𝒆𝒅 𝑹𝟐 = 1 −
𝑆𝑆𝑇(𝑛 − 1)

𝑛- number of data points


𝑘- number of independent variables

53
See you next time

54
3rd Lecture

55
Case Study – IPL Player Pricing
Case Study Document: https://hbsp.harvard.edu/tu/e6586d37
Case study Data: https://hbsp.harvard.edu/tu/8375b172

Can we predict the price of an IPL player using historical data or features about
the player?

What are some relevant features?

Can we use linear regression for price prediction?

56
Features (independent variables) for IPL player price..

Can you identify most


significant features?

57
Lets see the code…

58
Multicollinearity

Existence of a high correlation between independent variables is called multi-collinearity.

The presence of multicollinearity can destabilize the multiple linear regression model.

59
Why is Multicollinearity a Potential Problem?
A key goal of regression analysis is to isolate the relationship between each independent variable and the
dependent variable.

The interpretation of a regression coefficient is that it represents the mean change in the dependent variable
for each 1 unit change in an independent variable when you hold all of the other independent variables
constant.
Multicollinearity Types

Structural multicollinearity: This type occurs when we Data multicollinearity: This type of
create a model term using other terms. In other words, it’s multicollinearity is present in the data
a byproduct of the model that we specify rather than being itself rather than being an artifact of our
present in the data itself. For example, if you square term model. Observational experiments are
X to model curvature, clearly there is a correlation more likely to exhibit this kind of
between X and X2. multicollinearity.

https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
Checking for Multicollinearity
VIF – Variance Inflation Factor 1
- Calculated for every independent variable 𝑉𝐼𝐹𝑖 =
1 − 𝑅𝑖2

Find 𝑹𝟐 (coefficient of determination) for an independent variable 𝑋𝑖 with respect to other independent variables.

• VIFs start at 1 and have no upper limit.


• A value of 1 indicates that there is no correlation between this independent variable and any others.
• VIFs between 1 and 5 suggest that there is a moderate correlation, but it is not severe enough to
warrant corrective measures.
• VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly
estimated, and the p-values are questionable.

https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
62
Extra Slides

63
If Linear regression identifies linear relationship, then it should fail for

𝑋2
𝑌 = 𝛽0 + 𝛽1 ∗ +𝜀
5

𝑋2
= [0, 0.2, 0.8, 1.8, 3.2, 5, 7.2, 9.8, 12.8, 16.2]
5

𝑌 = [1, 3, 2, 5, 7, 8, 8, 9, 10, 12]

𝛽1 = 0.59392553

Why didn’t it fail?


64

You might also like