You are on page 1of 19

INTRODUCTION TO MACHINE LEARNING

LINEAR AND LOGISTIC REGRESSION + HANDS-ON WITH AZURE ML


LEARNING OBJECTIVES

 Understand linear & logistic regression

 Be able to perform logistic regression with AzureML

 Understand classification performance metrics


LINEAR REGRESSION
LINEAR REGRESSION: UNIVARIATE

We want to predict a continuous variable


LINEAR REGRESSION
We want to predict a continuous variable

Objective: Minimize the Mean Squared Error

Observed value

Predicted value
LINEAR REGRESSION FORMULA

“Training” a model means finding the parameters (in this case the beta values) that minimize the error (MSE for linear regression)
LINEAR REGRESSION METRICS
LINEAR REGRESSION METRICS
LINEAR REGRESSION METRICS
LOGISTIC REGRESSION
LOGISTIC REGRESSION

We want to predict a binary (0/1) variable

Doesn’t work because the right


side is unconstrained (-infinity to
+infinity)

We will predict a probability P

Works because the


left side is
unconstrained too
(-infinity to +infinity)
LOGISTIC REGRESSION FORMULA
LOGISTIC REGRESSION – LOSS FUNCTION

 We predict a probability “p” of the outcome “y” being equal to 1

 We penalize being wrong

 The cost function is:


 - Ln(p) when y=1 (penalty high if p close to 0) Since 0 <= p <= 1, the log is a negative number, that is why we have the “-” sign in
front to make the cost function positive.
 - Ln(1 – p) when y=0 (penalty high if p close to 1)

 The algorithm seeks to minimize the cost function


IMPORTANT PROPERTIES OF LINEAR MODELS

Applies to both linear regression and logistic regression


 1. As the name implies (“linear”), each input feature has a linear and monotonic impact on the output.
=> These models cannot, on their own, handle non-linear relationships.
 2. Each input feature has a in the end a separate and distinct impact on the output.
=> These models do not, on their own, take into account the relationships between input variables.
 3. Linear and logistic regressions can be VERY sensitive to outliers
(see link)

Applies only to linear regression


 The output of a linear regression is not bounded (can go from –infinity to +infinity)
HOW TO DEAL WITH CATEGORICAL VARIABLES?
One-hot-encoding (also known as creating dummy variables)
DO IT WITH ML AZURE!
CLASSIFICATION PERFORMANCE METRICS: AUC
Threshold=0.0
TPR = [True Positives] / [Observed Positives]
Predicted
1 0

1
Observed

Predicted
1 0

1
Observed
Threshold=1.0 FPR = [False Positives] / [Observed Negatives] 0
AUC = Area Under the Curve
CLASSIFICATION PERFORMANCE METRICS: RECALL, PRECISION,
F1
OVERFIT, HOLD-OUT DATA AND CROSS-VALIDATION

Overfit is when a model begins to "memorize" training data rather than "learning" to generalize
from a trend.

One way to measure overfit is to train on a dataset but assess model


performance on a different dataset, called testing or hold-out
dataset  that’s why we SPLIT the data into train and test.

A slightly more sophisticated version is to use cross-validation,


which splits an initial dataset into pairs of (train, test) folds.

In general, for a fixed-size dataset, the more variables you have, the
higher the risk of overfit => if it doesn’t hurt performance, less
variables is better.

You might also like