You are on page 1of 23

REGRESSION

Regression
MODELS

BMSCE - ME
MCL - Python
| PA G E 1
Data Types

Continuous Values Discrete/Categorical Values

Variable that has infinite possible values Variable that has finite possible values

Ex : Distance, Speed, Age Ex : Gender, Country, Class

Models: Linear Regression, SVM, Models : Logistic Regression, SVM, Decision


Random Forest Trees, Random Forest, Perceptron, kNN

BMSCE - ME
MCL - Python
| PA G E 2
Average Temperature in Bengaluru
27.5

27

26.5

26

25.5

25

24.5

24

23.5

23

22.5
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020

BMSCE - ME
MCL - Python
| PA G E 3
Data
Relationships

Deterministic Relationship Statistical Relationship

BMSCE - ME
MCL - Python
| PA G E 4
x y x y
2 5 2 8.49
4 9 4 12.31
1 3 1 6.20
3 7 3 10.61
7 15 7 18.07
Can you try 8 17 8 20.45
this ? 10 21 10 24.96
13 27 13 30.05
5 11 5 14.73

Can you come up with the equation Can you come up with the equation

BMSCE - ME
MCL - Python
| PA G E 5
Visualize the data

X axis : Area of the house


Y axis : Price of the house

Is there a trend ?

BMSCE - ME
MCL - Python
| PA G E 6
Trend

BMSCE - ME
MCL - Python
| PA G E 7
Error

BMSCE - ME
MCL - Python
| PA G E 8
Error

BMSCE - ME
MCL - Python
| PA G E 9
HOW DOES
IT WORK ?

Least Square Method Gradient Descent

BMSCE - ME
MCL - Python
| PA G E 10
Gradient Descent –
Simple Linear
Regression
𝐸𝑟𝑟𝑜𝑟 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛

𝛽0
𝛽1
BMSCE - ME
MCL - Python
| PA G E 12
What is Regression?

 Predictive modelling technique


 Relationship between
 Dependent (target) vs Independent
variable(s) (predictor).
 This technique is used for :
o Forecasting
o Time series modelling
o Finding the causal effect
relationship between the variables.
 Types :
o Linear Regression
Regression Analysis o Multiple Linear Regression
o Polynomial Regression
o Lasso, Ridge, Stepwise, Elastic
BMSCE - ME
MCL - Python
| PA G E 13
Simple Linear Regression

o Dependent Variable (Y): Continuous

o Independent Variables (X): Continuous or Discrete

o Dependent vs Independent: Linear (straight line)

o Equation:

o Cost Function:

o Issues: Extremely sensitive to outliers

o Metric: SSE, RMSE, R squared


Bias-Variance
Tradeoff

BIAS :
How well the model fits the data

Simpler models: Stable (low variance) but they VARIANCE :


don't get close to the truth (high bias).
How much the model changes based on
Complex models: More prone to being over fit changes in the inputs
(high variance) but they are expressive enough to get
close to the truth (low bias).
80-20 Split Method

Complete Dataset

Training 80% 20% Testing


Data Set Data Set
Random

Model

Testing
Training Error Error

BMSCE - ME
MCL - Python
| PA G E 16
Cross Validation (K-fold)

RUN 1 Test Train

RUN 2 Train Test Train


RUN 3 Train Test Train

RUN 4 Train Test Train

RUN K Train Test

BMSCE - ME
MCL - Python
| PA G E 17
Leave one out cross validation

RUN 1 1 2 3 4 5 N
RUN 2 1 2 3 4 5 N
RUN 3 1 2 3 4 5
RUN 4 1 2 3 4 5

RUN K 1 2 3 4 5 N

BMSCE - ME
MCL - Python
| PA G E 18
Multiple Linear Regression

o Dependent Variable (Y): Continuous

o Independent Variables ((X1, X2, .. Xn):

Continuous or Discrete

o Dependent vs Independent: Linear (Hyper Plane)

o Equation:

o Cost Function:

o Challenges: Extremely sensitive to outliers

o Metric: SSE, RMSE, R squared


AREA FLOOR ROOM CODE PRICE AREA FLOOR ROOM A B PRICE

1000 7 2 B 5617 1000 7 2 0 1 5617

1030 7 1 A 5201 1030 7 1 1 0 5201

1060 1 1 A 4779 1060 1 1 1 0 4779

Dummy 1090 6 1 A 5424 1090 6 1 1 0 5424

Variable 1120 0 2 B 5657 1120 0 2 0 1 5657

• Categorical columns are hard to • Convert all the categorical


understand for regression column to “DUMMY” variables
models • Create n columns (each for a
• CODE is a categorical column category in the column)
• Populate them by 0s and 1s
BMSCE - ME
MCL - Python
| PA G E 20
Polynomial Regression

o Dependent Variable (Y): Continuous

o Independent Variables (X): Continuous or Discrete

o Dependent vs Independent: Non Linear

o Equation:

o Cost Function:

o Challenges: Extremely sensitive to outliers

o Metric: SSE, RMSE, R squared


Which are
SLR Problems ?

Example 1 YES

Example 2 YES

Example 3 YES

Example 4 NO
Scikit-Learn
• Python library (installation required)

• Can access well-known machine learning algorithms within Python code

• Built using NumPy and SciPy libraries

• Intel distribution for Python comes with optimized sklearn

• Expect data to be stored in a two-dimensional array or matrix. The size of


the array is expected to be [n_samples, n_features]

• n_samples: The number of samples: each sample is an item to process


(e.g. classify). A sample can be a document, a picture, a sound, a video,
a row in database or CSV file, or whatever you can describe with a fixed
set of quantitative traits.

• n_features: The number of features or distinct traits that can be used to


describe each item in a quantitative manner. Features are generally real-
valued, but may be Boolean or discrete-values in some cases.

BMSCE - ME
MCL - Python
| PA G E 23

You might also like