You are on page 1of 27

Unit 4.

Regression:
Linear regression and logistic regression

4. Classification
Assoc. Prof Nguyen Manh Tuan
AGENDA

 Linear regression
 Logistic regression

10/26/2022 internal use


Linear regression

 Linear regression is not only one of the oldest data science methodologies,
but it also the most easily explained method for demonstrating function
fitting.
 The basic idea is to come up with a function that explains and predicts the value of
the target variable when given the values of the predictor variables.
 Ex: the effect of the number of rooms in a house (predictor) on its median sale price
(target).
 Each data point on the chart corresponds to a house.
 It is evident that on average, increasing the number of rooms tends to also increase median
price.
 This general statement can be captured by drawing a straight line through the data

10/26/2022 internal use


Linear regression

The problem with one predictor:


• Clearly, one can fit an infinite
number of straight lines through
a given set of points.
• How does one know which one
is the best?
• A metric is needed, one that
helps quantify the different
straight line fits through the
data
 selecting the best line becomes a
matter of finding the optimum value for
this quantity

10/26/2022 internal use


Linear regression

 A commonly used metric is the concept of an error function.


 In a single predictor case, the predicted value, ŷ, for a value of x that exists in the
dataset is then given by:
 The error (at a single location (x, y) in the dataset) is simply the difference between
the actual target value and the predicted target value

 Computing the error for all existing points to come up with an aggregate error.
 The difference can be squared to eliminate the sign bias and an average error for
a given fit can be calculated as:

n: number of points in dataset; J: total squared error


the best (b0, b1) can be found for a given dataset

10/26/2022 internal use


Linear regression

 The method of least squares (Stigler, 1999)


 Practical linear regression algorithm: gradient descent
 More than one predictor: multiple linear regression (MLR)

 MLR can be applied in any situation where a numeric prediction (e.g “how much
will something sell for”) is required.
 This is in contrast to making categorical predictions such as “will someone buy/not buy” or
“will/will not fail,” where classification tools such as decision trees or logistic regression
models are used.
Correlation(x, y): correlation between x and y
sy, sx: standard deviations of y and x.
xmean and ymean: respective mean values

10/26/2022 internal use


Linear regression

The objectives are:


1. Identify which of the several
attributes are required to
accurately predict the median
price of a house.
2. Build a multiple linear
regression model to predict the
median price using the most
important attributes
13 predictors; 1 target
- Physical characteristics (#rooms; age; tax;
location)
- Neighborhood features (schools; industries)

10/26/2022 internal use


Linear regression

How to Implement
1. Building a linear regression model
2. Measuring the performance of the model
3. Understanding the commonly used options for the Linear Regression operator
4. Applying the model to predict MEDV prices for unseen data

 Linear Regression operator: with parameters (feature selection: none; eliminate


colinear features: checked; use bias: checked)
 Boston Housing dataset (506 cases, 14 attributes including “class”/”target”) (BKEL)

10/26/2022 internal use


Linear regression

10/26/2022 internal use


Linear regression

10/26/2022 internal use


Linear regression

10/26/2022 internal use


Linear regression

10/26/2022 internal use


Linear regression

10/26/2022 internal use


Linear regression

10/26/2022 internal use


Linear regression

10/26/2022 internal use


Linear regression

10/26/2022 internal use


Logistic regression
 From a historical perspective, there are two main classes of data science techniques:
 Evolved from statistics (such as regression)
 Emerged from a blend of statistics, computer science, and mathematics (such as
classification trees).
 Growth of logistic regression applications in statistical research.
 Linear regression: the key assumption are that both the predictor and target
variables are continuous
 Intuitively, one can state that when x increases, y increases along the slope of the line (ex:
advertisement spend and sales)
 What happens if the target variable is not continuous?

10/26/2022 internal use


Logistic regression
 Suppose the target variable is the response to advertisement campaign
 if more than a threshold number of customers buy, then the response is considered to be 1;
 if not, the response is 0. In this case, the target (y) variable is discrete
the straight line is no longer a fit as seen in the chart
 Although one can still estimate—approximately—that when x (ad spend) increases, y
(response/no response to a mailing campaign) also increases, there is no gradual
transition:
 the y value abruptly jumps from one binary outcome to the other.
Thus, the straight line is a poor fit for this data

10/26/2022 internal use


Logistic regression

10/26/2022 internal use


Logistic regression

10/26/2022 internal use


Logistic regression
The S-shaped curve is
certainly a better fit for the
data shown.
If the equation to this
“sigmoid” curve
is known, then it can be
used as effectively as the
straight line was used in
the case of linear
regression.

Logistic regression is the


process of obtaining an
appropriate nonlinear
curve to fit the data when
the target variable is
10/26/2022 internal use discrete.
Logistic regression
 Logistic regression
 Predictors (xi): (-∞, + ∞); Target (y): binominal (yes/no; pass/fail; 0/1)
 If the target variable y is transformed to the logarithm of the odds of y, then the
transformed target variable is linearly related to the predictors, x.
 In most cases, the y is usually a yes/no type of response. This is usually
interpreted as the probability of an event happening (y=1) or not happening (y=0).
 If y is an event (response, pass/fail, etc.), and p is the probability of the event
happening (y=1),
● (1-p): the probability of the event not happening (y=0),
● p/(1-p): the odds of the event happening.
 log (p/(1-p)) is linear in the predictors, X, and is called the logit function.

10/26/2022 internal use


Logistic regression
 The logit can be expressed as a linear function of the predictors X, similar to
the linear regression model as:
 More generally, with multiple independent variables, x:
 From the logit, it is easy to then compute the probability of the response y (occurring
or not occurring)
 From the data given the x’s are known, using the two above equations, one
can compute the p for any given x.
 But to do that, the coefficients first need to be determined, b. How is this done?
 Assume that one starts out with a trial of values for b. Given a training data sample,
one can compute the quantity:

y is the original outcome variable (0/1), and p is the probability estimated by the logit equation

10/26/2022 internal use


Logistic regression
 For a specific training sample,
 if the actual outcome was y=0 and the model estimate of p was high (say 0.9), that is, the
model was wrong, then this quantity reduces to 0.1.
 if the model estimate of probability was low (say 0.1), that is, the
model was good, then this quantity increases to 0.9.
 Therefore, this quantity, which is a simplified form of a likelihood function, is
maximized for good estimates and minimized for poor estimates.
 In reality, gradient descent or other nonlinear optimization techniques are
used to search for the coefficients, b, with the objective of maximizing the
likelihood of correct estimation, summed over all training samples,

10/26/2022 internal use


Logistic regression

How to implement
 Logistic operator
 Titanic dataset (891 samples; 12 attributes) (BKEL)
 German credit (1000 cases, 20 attributes) (BKEL)

10/26/2022 internal use


Logistic regression

Summary
 Logistic regression can be considered equivalent to using linear regression for situations where,
when the target variable is discrete, that is, not continuous.
 In principle, the response variable or label is binomial (2 categories: Yes/No, Accept/Not Accept,
Default/Not Default, and so on.
 Logistic regression is ideally suited for business analytics applications where the target variable
is a binary decision (fail/pass, response/no response, etc.).
 Logistic regression comes from the concept of the logit. The logit is the logarithm of the odds of
the response (y), expressed as a function of independent (predictor variables, x), and a constant
term.
 Ex: log (odds of y=Yes) = b1x + b0.
 The predictors can be either numerical or categorical for standard logistic regression. However,
in RapidMiner, the predictors can only be numerical.

10/26/2022 internal use


THE END

10/26/2022 internal use

You might also like