You are on page 1of 132

Unit-3 Regression

1
• Linear regression- Linear models, A bi-
dimensional example, Linear Regression and
higher dimensionality, Ridge, Lasso and Elastic
Net, Robust regression with random sample
consensus, Polynomial regression, Isotonic
regression,
Logistic regression-Linear classification, Logistic
regression, Implementation and Optimizations,
Stochastic gradient descendent algorithms,
Finding the optimal hyper-parameters through
grid search, Classification metric, ROC Curve.
2
Linear Regression
• Linear regression is a linear model, e.g. a model
that assumes a linear relationship between the
input variables (x) and the single output variable
(y).
• More specifically, that y can be calculated from a
linear combination of the input variables (x).
• When there is a single input variable (x), the
method is referred to as simple linear regression.
• When there are multiple input
variables, literature from statistics often refers to
the method as multiple linear regression
Simple Linear Regression
• Simple regression problem (a single x and a
single y), the form of the model would be:
Constant Coefficient

y = b0 + b1 *
x1

Dependent Independent variable


variable (IV)
(DV)
Simple Linear Regression
SALARY EQUATION
(₹) PLOTTING

y = b0 + b1 *
x1

SALARY = b0 + b1 *
+10
EXPERIENCE
K

+1 HOW much Salary will


Yr increase?

+1 EXPERIENC
Yr E
Simple Linear Regression
ANALYZING DATASET

IV DV
Simple Linear Regression
• LET's CODE!
• Prep your Data Preprocessing Template
• Import Dataset
• No need for Missing Data
• Splitting into Training & Testing dataset
• Keep Feature Scaling but least preffered
here
• Co-relate Salaries with Experience
• Later carry out prediction
• Verify the Values of prediction
• Prediction on TEST SET
Example-2
• Let’s make this concrete with an example.
Imagine we are predicting weight (y) from
height (x). Our linear regression model
representation for this problem would be:
y = B0 + B1 * x1
or
weight =B0 +B1 * height
• Where B0 is the bias coefficient and B1 is the coefficient for
the height column. We use a learning technique to find a
good set of coefficient values. Once found, we can plug in
different height values to predict the weight.
• For example, lets use B0 = 0.1 and B1 = 0.5. Let’s
plug them in and calculate the weight (in kilograms)
for a person with the height of 182 centimeters.
weight = 0.1 + 0.05 * 182
weight = 91.1
• You can see that the above equation could be plotted as a line
in two-dimensions. The B0 is our starting point regardless of
what height we have.
• We can run through a bunch of heights from 100 to 250
centimeters and plug them to the equation and get weight
values, creating our line.
Multi Linear Regression
Constant Coefficients

y = b0 + b1 * x1 + b2 * x2 + ... + bn *
xn
Independent variables
Dependent
(IVs)
variable
(DV)
Multiple linear regression analysis
makes several key assumptions:
• Multivariate Normality–Multiple regression assumes that the
residuals are normally distributed.
• No Multicollinearity—Multiple regression assumes that the
independent variables are not highly correlated with each
other.  This assumption is tested using Variance Inflation
Factor (VIF) values.
• Homoscedasticity–This assumption states that the variance of
error terms are similar across the values of the independent
variables.  A plot of standardized residuals versus predicted
values can show whether points are equally distributed across
all values of the independent variables.
• Intellectus Statistics automatically includes the assumption
tests and plots when conducting a regression.
• Multiple linear regression requires at least
two independent variables, which can be
nominal, ordinal, or interval/ratio level
variables. 
• A rule of thumb for the sample size is that regression analysis
requires at least 20 cases per independent variable in the
analysis.
• First, multiple linear regression requires the
relationship between the independent and
dependent variables to be linear.  
• The linearity assumption can best be tested with
scatterplots.  The following two examples depict a
curvilinear relationship (left) and a linear relationship
(right).
curvilinear relationship (left) and a
linear relationship (right).
• Second, the multiple linear regression analysis
requires that the errors between observed and
predicted values (i.e., the residuals of the regression)
should be normally distributed.
• This assumption may be checked by looking at a
histogram or a Q-Q-Plot.  Normality can also be
checked with a goodness of fit test (e.g., the
Kolmogorov-Smirnov test), though this test must be
conducted on the residuals themselves.
• Third, multiple linear regression assumes that there
is no multicollinearity in the data. 
• Multicollinearity occurs when the independent
variables are too highly correlated with each other.
• Multicollinearity may be checked multiple ways:
• 1) Correlation matrix – When computing a matrix of
Pearson’s bivariate correlations among all independent
variables, the magnitude of the correlation coefficients
should be less than .80.
• 2) Variance Inflation Factor (VIF) – The VIFs of the linear
regression indicate the degree that the variances in the
regression estimates are increased due to multicollinearity.
VIF values higher than 10 indicate that multicollinearity is a
problem.
• If multicollinearity is found in the data, one possible solution
is to center the data.  To center the data, subtract the mean
score from each observation for each independent variable.
However, the simplest solution is to identify the variables
causing multicollinearity issues (i.e., through correlations or
VIF values) and removing those variables from the regression.
• A scatterplot of residuals versus predicted
values is good way to check for
homoscedasticity.  There should be no clear
pattern in the distribution; if there is a cone-
shaped pattern (as shown below), the data is
heteroscedastic.
Multi Linear Regression
DUMMY
VARIABLES

Categorical
Variable
Multi Linear Regression
DUMMY
VARIABLES

NEW YORK CALIFORNIA

1 0

0 1

0 1

0 1

y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 *
D1
Multi Linear Regression
DUMMY VARIABLE
TRAP
D
NEW YORK CALIFORNIA

1 0
D2 = 1 - D1 0 1

Multi Linear t0y


1
Colineari
0 1

y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * D1 + b5 * D2
Always OMIT one Dummy Variable
Building A Model
STEP BY
STEP
Building A Model

• METHODS OF BUILDING A
MODEL

• Forward Selection • All - in Stepwise regression


• Bidirectional Elimination• Backward Elimination
• Score Comparison

• Forward Elimination
Building A Model

• METHODS OF BUILDING
A MODEL
• ALL - IN
• Throw in every variable
• Prior Knowledge
• Known Values
• Preparing Backward elimination
Building A Model
• BACKWARD ELIMINATION
MODEL (Best Model in All )
• Step 1
• Select significance level to stay in model (0.05)
• Step 2
• Fit in full model with all possible predictors
• Step 3
• Consider the predictor with highest P value
• If P > SL, go to Step 4, otherwise go to FIN
• Step 4
MODEL
• Remove the Predictor BUILT
• Step 5
• Fit the model w/o this
Bi-Dimensional Example

• Let's consider a small dataset built by adding some uniform


noise to the points belonging to a segment bounded between
-6 and 6
• The original equation is: y = x + 2 + n, where n
is a noise term.
• Figure shows , there's a plot with a candidate
regression function:
• As we're working on a plane, the regressor
we're looking for is a function of only two
parameters:
• In order to fit our model, we must find the
best parameters and to do that we choose an
least squares approach.
• This task can be easily accomplished by Least
Square Method. 
• It is the most common method used for fitting
a regression line.
• It calculates the best-fit line for the observed
data by minimizing the sum of the squares of
the vertical deviations from each data point
to the line.
• Because the deviations are first squared,
when added, there is no cancelling out
between positive and negative values.
• The loss function to minimize is:

• So (for simplicity, it accepts a vector containing both


variables):
import numpy as np
def loss(v):
e = 0.0
for i in range(nb_samples):
e += np.square(v[0] + v[1]*X[i] - Y[i])
return 0.5 * e
• in order to find the global minimum, we must
impose:

the gradient can be defined as:


def gradient(v):
g = np.zeros(shape=2)
for i in range(nb_samples):
g[0] += (v[0] + v[1]*X[i] - Y[i])
g[1] += ((v[0] + v[1]*X[i] - Y[i]) * X[i])
return g
The optimization can now be solved using SciPy:
• scipy.optimize.minimize
Parameters:
fun : callableThe objective function to be minimized.
• fun(x, *args) -> float
• where x is an 1-D array with shape (n,) and args is a tuple of the fixed
parameters needed to completely specify the function.
x0 : ndarray, shape (n,)Initial guess. Array of real elements of size (n,), where
‘n’ is the number of independent variables.
args : tuple, optionalExtra arguments passed to the objective function and its
derivatives (fun, jac and hess functions).
method : str or callable, optionalType of solver. Should be one of
• ‘Nelder-Mead’ (see here)
• ‘Powell’ (see here)
• ‘CG’ (see here)
• ‘BFGS’ (see here)
• ‘Newton-CG’ (see here)
• ‘L-BFGS-B’ (see here)
• ‘TNC’ (see here)
• ‘COBYLA’ (see here)
• ‘SLSQP’ (see here)
• ‘trust-constr’(see here)
• ‘dogleg’ (see here)
• ‘trust-ncg’ (see here)
• ‘trust-exact’ (see here)
• ‘trust-krylov’ (see here)
• custom - a callable object (added in version 0.14.0), see below for
description.
• If not given, chosen to be one of BFGS, L-BFGS-B, SLSQP, depending if the
problem has constraints or bounds.
• jac : {callable, ‘2-point’, ‘3-point’, ‘cs’, bool}, optionalMethod for
computing the gradient vector.
• hess : {callable, ‘2-point’, ‘3-point’, ‘cs’, HessianUpdateStrategy},
optionalMethod for computing the Hessian matrix
>>> from scipy.optimize import minimize
>>> minimize(fun=loss, x0=[0.0, 0.0], jac=gradient,
method='L-BFGS-B')
fun: 9.7283268345966025
hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
jac: array([ 7.28577538e-06, -2.35647522e-05])
message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
nfev: 8
nit: 7
status: 0 success: True
x: array([ 2.00497209, 1.00822552])
• As expected, the regression denoised our dataset,
rebuilding the original equation: y = x + 2.
Scipy Optimization Example using
Python
• Optimization deals with selecting the best
option among a number of possible choices
that are feasible or don't violate constraints.
• Mathematical optimization problems may
include equality constraints (e.g. =), inequality
constraints (e.g. <, <=, >, >=), objective
functions, algebraic equations, differential
equations, continuous variables, discrete or
integer variables, etc. 
• This problem has a nonlinear objective that the
optimizer attempts to minimize. The variable values at
the optimal solution are subject to (s.t.) both equality
(=40) and inequality (>25) constraints. The product of
the four variables must be greater than 25 while the
sum of squares of the variables must also equal 40. In
addition, all variables must be between 1 and 5 and
the initial guess is x1 = 1, x2 = 5, x3 = 5, and x4 = 1.
https://www.youtube.com/watch?v=cXHvC_FGx24
Linear regression with scikit-learn and
higher dimensionality
• scikit-learn offers the class LinearRegression,
which works with n-dimensional spaces. For
this purpose, we're going to use the Boston
dataset:
from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> boston.data.shape (506L, 13L)
>>> boston.target.shape (506L,)
It has 506 samples with 13 input features and one output. In the following
figure, there' a collection of the plots of the first 12 features:
How to Find Accuracy of Model
• Model to normalize the data before processing it.
Moreover, for testing purposes, we split the original
dataset into training (90%) and test (10%) sets:
• from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
>>> X_train, X_test, Y_train, Y_test =
train_test_split(boston.data, boston.target,
test_size=0.1)
>>> lr = LinearRegression(normalize=True)
>>> lr.fit(X_train, Y_train)
• LinearRegression(copy_X=True, fit_intercept=True,
n_jobs=1, normalize=True)
 
• To check the accuracy of a regression, scikit-
learn provides the internal method score(X,
y) which evaluates the model on test data:
>>> lr.score(X_test, Y_test)
0.77371996006718879
• So the overall accuracy is about 77%, which is
an acceptable result considering the non-
linearity of the original dataset, but it can be
also influenced by the subdivision made by
train_test_split (like in our case).
• we can use the function cross_val_score(),
which works with all the classifiers.
• The scoring parameter is very important
because it determines which metric will be
adopted for tests.
• As LinearRegression works with ordinary least
squares, we preferred the negative mean
squared error, which is a cumulative measure
that must be evaluated according to the
actual values (it's not relative).
• from sklearn.model_selection import
cross_val_score
>>> scores = cross_val_score(lr, boston.data,
boston.target, cv=7,
scoring='neg_mean_squared_error')
array([ -11.32601065, -10.96365388, -32.12770594, -
33.62294354,- 10.55957139, -146.42926647, -
12.98538412])

>>> scores.mean()
-36.859219426420601
>>> scores.std()
45.704973900600457
• Another very important metric used in
regressions is called the coefficient of
determination or R2. It measures the amount
of variance on the prediction which is
explained by the dataset

>>> cross_val_score(lr, X, Y, cv=10, scoring='r2')


0.75
CV- Cross Validation Algo-10 R2 ~1
 Big Mart Sales-In the data set, we have
product wise Sales for Multiple outlets of a
chain.
Ridge & Lasso
• Ridge and Lasso regression are powerful techniques generally
used for creating parsimonious models in presence of a
‘large’ number of features. Here ‘large’ can typically mean
either of two things:
• Large enough to enhance the tendency of a model to
overfit (as low as 10 variables might cause overfitting)
• Large enough to cause computational challenges. With
modern systems, this situation might arise in case of millions
or billions of features
• Though Ridge and Lasso might appear to work towards a
common goal, the inherent properties and practical use cases
differ substantially. If you’ve heard of them before, you must
know that they work by penalizing the magnitude of
coefficients of features along with minimizing the error
between predicted and actual observations. These are called
‘regularization’ techniques.
Why Penalize the Magnitude of Coefficients?
• Lets try to understand the impact of model
complexity on the magnitude of coefficients. As an
example, I have simulated a sine curve (between 60°
and 300°)
• This resembles a sine curve but not exactly
because of the noise.
• We’ll use this as an example to test different
scenarios in this article.
• Lets try to estimate the sine function
using polynomial regression with powers of x
form 1 to 15.
• Lets add a column for each power upto 15 in
our dataframe.
• Now that we have all the 15 powers, lets make
15 different linear regression models with each
model containing variables with powers of x
from 1 to the particular model number. For
example, the feature set of model 8 will be – {x,
x_2, x_3, … ,x_8}.
• RSS refers to ‘Residual Sum of Squares’ which is nothing but the sum of
square of errors between the predicted and actual values in the training
data set. 
• We would expect the models with increasing
complexity to better fit the data and result in lower
RSS values.
• This can be verified by looking at the plots generated
for 6 models:
• As the model complexity increases, the
models tends to fit even smaller deviations in
the training data set.
• Though this leads to overfitting, lets keep this
issue aside for some time and come to our
main objective, i.e. the impact on the
magnitude of coefficients.
• See the Out put in coef_matrix_simple
• It is clearly evident that the size of coefficients
increase exponentially with increase in
model complexity
• Ridge regression imposes an additional shrinkage
penalty to the ordinary least squares loss function
to limit its squared L2 norm:

• Ridge Regression: Performs L2 regularization, i.e.


adds penalty equivalent to square of the
magnitude of coefficients
• Minimization objective = LS Obj + α * (sum of
square of coefficients)
• Note that here ‘LS Obj’ refers to ‘least squares objective’,
i.e. the linear regression objective without regularization.
α can take various values:
• α = 0:
– The objective becomes same as simple linear regression.
– We’ll get the same coefficients as simple linear regression.
• α = ∞:
– The coefficients will be zero. Why? Because of infinite
weightage on square of coefficients, anything less than
zero will make the objective infinite.
• 0 < α < ∞:
– The magnitude of α will decide the weightage given to
different parts of objective.
– The coefficients will be somewhere between 0 and ones
for simple linear regression.
 as the value of alpha increases, the
model complexity reduces.
The RSS increases with increase in alpha, this model complexity reduces
An alpha as small as 1e-15 gives us significant reduction in magnitude of
coefficients. How? Compare the coefficients in the first row of this table to the
last row of simple linear regression table.
High alpha values can lead to significant underfitting.
Note the rapid increase in RSS for values of alpha greater than 1
Though the coefficients are very very small, they are NOT zero.
• Lasso Regression:Performs L1 regularization,
i.e. adds penalty equivalent to absolute value
of the magnitude of coefficients
• Minimization objective = LS Obj + α * (sum of
absolute value of coefficients)
• A Lasso regressor imposes a penalty on the L1
norm of w to determine a potentially higher
number of null coefficients:
• Here, α (alpha) works similar to that of ridge
and provides a trade-off between balancing
RSS and magnitude of coefficients. Like that of
ridge, α can take various values. Lets iterate it
here briefly:
• α = 0: Same coefficients as simple linear
regression
• α = ∞: All coefficients zero (same logic as
before)
• 0 < α < ∞: coefficients between 0 and that of
simple linear regression
This again tells us that the model complexity decreases with
increase in the values of alpha.
Apart from the expected inference of higher RSS for higher alphas, we can see the
following:
For the same values of alpha, the coefficients of lasso regression are much smaller as
compared to that of ridge regression (compare row 1 of the 2 tables).
For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge
regression Many of the coefficients are zero even for very small values of
alpha
ElasticNet
• The last alternative is ElasticNet, which
combines both Lasso and Ridge into a single
model with two penalty factors: one
proportional to L1 norm and the other to L2
norm. In this way, the resulting model will be
sparse like a pure Lasso, but with the same
regularization ability as provided by Ridge.
The resulting loss function is:
Summery
• L1 Regularization aka Lasso Regularization– This add
regularization terms in the model which are function of
absolute value of the coefficients of parameters. The
coefficient of the paratmeters can be driven to zero as
well during the regularization process. Hence this
technique can be used for feature selection and
generating more parsimonious model
• L2 Regularization aka Ridge Regularization — This add
regularization terms in the model which are function of
square of coefficients of parameters. Coefficient of
parameters can approach to zero but never become zero
and hence
• Combination of the above two such as Elastic Nets– This add
regularization terms in the model which are combination of both
L1 and L2 regularization.
Robust regression with random sample
consensus
• A common problem with linear regressions is caused
by the presence of outliers. An ordinary least square
approach will take them into account and the result (in
terms of coefficients) will be therefore biased. In the
following figure, there's an example of such a behavior:
• The less sloped line represents an acceptable
regression which discards the outliers, while the other
one is influenced by them. An interesting approach to
avoid this problem is offered by random sample
consensus (RANSAC), which works with every regressor
by subsequent iterations, after splitting the dataset into
inliers and outliers.
What is Outlier??
• Outlier is a commonly used terminology by
analysts and data scientists as it needs close
attention else it can result in wildly
wrong estimations.
• It detect via Boxplot
• Histogram
• Scatter Plot
Types of Outlier
• Data Entry Errors:
• Measurement Error:
• Experimental Error: 
• Intentional Outlier:
• Data Processing Error:
• Sampling Error
How to Remove Outlier-
• Deleting observations
• Transforming and binning values
Random Sample Consensus (RANSAC),
• The model is trained only with valid
samples (evaluated internally or through
the callable is_data_valid()) and all
samples are re-evaluated to verify if
they're still inliers or they have become
outliers.
• The process ends after a fixed number of
iterations or when the desired score is
achieved.
there's an example of simple linear regression applied
to the dataset shown in the previous figure on ppt 62
• from sklearn.linear_model import
LinearRegression
>>> lr = LinearRegression(normalize=True)
>>> lr.fit(X.reshape((-1, 1)), Y.reshape((-1, 1)))
>>> lr.intercept_ array([ 5.500572])
>>> lr.coef_
array([[ 2.53688672]])
• As imagined, the slope is high due to the presence of
outliers. The resulting regressor is y =5.5 + 2.5x
(slightly less sloped than what was shown in the
figure).
• Now we're going to use RANSAC with the same
linear regressor:
• RANSAC is an iterative algorithm for the
robust estimation of parameters from a subset
of inliers from the complete data set.
from sklearn.linear_model import RANSACRegressor
>>> rs = RANSACRegressor(lr)
>>> rs.fit(X.reshape((-1, 1)), Y.reshape((-1, 1)))
>>> rs.estimator_.intercept_ array([ 2.03602026])
>>> es.estimator_.coef_
array([[ 0.99545348]]) 
• In this case, the regressor is about y = 2 + x (which is
the original clean dataset without outliers).
Polynomial Regression
• A regression equation is a polynomial
regression equation if the power of
independent variable is more than 1. The
equation below represents a polynomial
equation:
y=a+b*x^2
• While there might be a temptation to fit a higher
degree polynomial to get lower error, this can result
in over-fitting. Always plot the relationships to see
the fit and focus on making sure that the curve fits
the nature of the problem. Here is an example of
how plotting can help:
• Generate a new feature matrix consisting of
all polynomial combinations of the features
with degree less than or equal to the specified
degree.
• For example, if an input sample is two
dimensional and of the form [a, b], the
degree-2 polynomial features are [1, a, b,
a^2, ab, b^2].
• Number of features in the output array scales
polynomially in the number of features of the
input array, and exponentially in the degree.
High degrees can cause overfitting.
>>> X = np.arange(6).reshape(3, 2)
>>> X array([[0, 1], [2, 3], [4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
array([[ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])
>>> poly =
PolynomialFeatures(interaction_only=True)>>
> poly.fit_transform(X)
array([[ 1., 0., 1., 0.],
[ 1., 2., 3., 6.],
[ 1., 4., 5., 20.]])
• from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression(normalize=True)
>>> lr.fit(X.reshape((-1, 1)), Y.reshape((-1, 1)))
>>> lr.score(X.reshape((-1, 1)), Y.reshape((-1, 1)))
0.10888218817034558- Performances are poor, as
expected.
• from sklearn.preprocessing import
PolynomialFeatures
>>> pf = PolynomialFeatures(degree=2)
>>> Xp = pf.fit_transform(X.reshape(-1, 1))
>>> Xp.shape
• (100L, 3L)
• As expected, the old x1 coordinate has been
replaced by a triplet, which also contains the
quadratic and mixed terms. At this point, a
linear regression model can be trained:
>>> lr.fit(Xp, Y.reshape((-1, 1)))
>>> lr.score(Xp, Y.reshape((-1, 1)))
0.99692778265941961- The score is quite
higher and the only price we have paid is an
increase in terms of features.
Isotonic regression
• The isotonic regression finds a non-
decreasing approximation of a function while
minimizing the mean squared error on the
training data.
• The benefit of such a model is that it does not
assume any form for the target function such
as linearity.
• It produces a piecewise interpolating function
minimizing the functional:
• An example (with a toy dataset) is provided next:
>>> X = np.arange(-5, 5, 0.1)
>>> Y = X + np.random.uniform(-0.5, 1,
size=X.shape)
Following is a plot of the dataset. As
everyone can see, it can be easily modeled
by a linear regressor, but without a high
non-linear function, it is very difficult to
capture the slight (and local) modifications
in the slope:
Another example
• The class IsotonicRegression needs to know ymin and ymax
(which correspond to the variables y0 and yn in the loss
function). In this case, we impose -6 and 10:
from sklearn.isotonic import IsotonicRegression
>>> ir = IsotonicRegression(-6, 10)
>>> Yi = ir.fit_transform(X, Y)
• The result is provided through three instance variables:
>>> ir.X_min_
-5.0
>>> ir.X_max_
4.8999999999999648
>>> ir.f_
<scipy.interpolate.interpolate.interp1d at 0x126edef8>
• The last one, (ir.f_), is an interpolating function which can be
evaluated in the domain [xmin, xmax]. For example:
>>> ir.f_(2) array(1.7294334618146134)
A plot of this function (the green line), together with the original
data set, is shown in the following figure:
Logistic Regression
Linear classification
• Let's consider a generic linear classification problem
with two classes. In the following figure,
• Our goal is to find an optimal hyperplane, which
separates the two classes. In multi-class problems, the
strategy one-vs-all is normally adopted, so the
discussion can be focused only on binary classifications.
• Suppose we have the following dataset:

• This dataset is associated with the following target set:

• We can now define a weight vector made of m


continuous components:
•  
• We can also define the quantity z

• If x is a variable, z is the value determined by


the hyperplane equation. Therefore, if the set
of coefficients w that has been determined is
correct, it happens that:

• Now we must find a way to optimize w, in order to reduce the


classification error. If such a combination exists (with a certain error
threshold), we say that our problem is linearly separable. On the other
hand, when it's impossible to find a linear classifier, the problem is called
non-linearly separable. A very simple but famous example is given by the
logical operator XOR:
Logistic regression
• Logistic regression is the appropriate regression
analysis to conduct when the dependent variable is
dichotomous (binary). 
• It is used to predict a binary outcome (1 / 0, Yes / No,
True / False) given a set of independent variables. To
represent binary / categorical outcome, we use dummy
variables. 
• Like all regression analyses, the logistic regression is a
predictive analysis.  Logistic regression is used to
describe data and to explain the relationship between
one dependent binary variable and one or more
nominal, ordinal, interval or ratio-level independent
variables.
• logistic regression as a special case of linear
regression when the outcome variable is categorical,
where we are using log of odds as dependent
variable. In simple words, it predicts the probability
of occurrence of an event by fitting data to a logit
function.
• The fundamental equation of generalized linear
model is: g(E(y)) = α + βx1 + γx2
• Here, g() is the link function, E(y) is the expectation
of target variable and α + βx1 + γx2 is the linear
predictor ( α,β,γ to be predicted).
• The role of link function is to ‘link’ the expectation of
y to linear predictor.
• We are provided a sample of 1000 customers. We
need to predict the probability whether a customer
will buy (y) a particular magazine or not. As you can
see, we’ve a categorical outcome variable, we’ll use
logistic regression.
• g(y) = βo + β(Age)         ---- (a)
• considered ‘Age’ as independent variable.
• In logistic regression, we are only concerned about
the probability of outcome dependent variable
( success or failure). As described above, g() is the
link function. This function is established using two
things: Probability of Success(p) and Probability of
Failure(1-p).
p should meet following criteria:
• It must always be positive (since p >= 0)
• It must always be less than equals to 1 (since p <= 1)
• Now, we’ll simply satisfy these 2 conditions and get
to the core of logistic regression. To establish link
function, we’ll denote g() with ‘p’ initially and
eventually end up deriving this function.
• Since probability must always be positive, we’ll put
the linear equation in exponential form. For any
value of slope and dependent variable, exponent of
this equation will never be negative.
• p = exp(βo + β(Age)) = e^(βo + β(Age)) ------- (b)
• To make the probability less than 1, we must divide p
by a number greater than p. This can simply be done
by:
p  =  exp(βo + β(Age)) / exp(βo + β(Age)) + 1   =  
e^(βo + β(Age)) / e^(βo + β(Age)) + 1    ----- (c)
• sing (a), (b) and (c), we can redefine the probability
as:
p = e^y/ 1 + e^y           --- (d)
• where p is the probability of success. This (d) is the
Logit Function
• If p is the probability of success, 1-p will be the
probability of failure which can be written as:
q = 1 - p = 1 - (e^y/ 1 + e^y)    --- (e)
• where q is the probability of failure
• On dividing, (d) / (e), we get,

• After taking log on both side, we get,

• log(p/1-p) is the link function. Logarithmic


transformation on the outcome variable allows us to
model a non-linear association in a linear way.
• After substituting value of y, we’ll get:
• This is the equation used in Logistic Regression. Here
(p/1-p) is the odd ratio. Whenever the log of odd
ratio is found to be positive, the probability of
success is always more than 50%.
• A typical logistic model plot is shown below. You can
see probability never goes below 0 and above 1.
Implementation and optimizations
>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import
LogisticRegression
>>> X, y = load_iris(return_X_y=True)
>>> clf = LogisticRegression(random_state=0,
solver='lbfgs', ...multi_class='multinomial').fit(X, y)
>>> clf.predict(X[:2, :]) array([0, 0])
>>> clf.predict_proba(X[:2, :])
array([[9.8...e-01, 1.8...e-02, 1.4...e-08],
[9.7...e-01, 2.8...e-02, ...e-08]])
>>> clf.score(X, y) 0.97...
Stochastic Gradient Descent Algorithms
• Stochastic Gradient Descent (SGD) is a simple
yet very efficient approach to discriminative
learning of linear classifiers under convex loss
functions such as (linear) Support Vector
Machines and Logistic Regression.
• SGD has been successfully applied to large-scale
and sparse machine learning problems often
encountered in text classification and natural
language processing. Given that the data is sparse,
the classifiers in this module easily scale to
problems with more than 10^5 training examples
and more than 10^5 features.
• The advantages of Stochastic Gradient Descent
are:
• Efficiency.
• Ease of implementation (lots of opportunities
for code tuning).
• The disadvantages of Stochastic Gradient
Descent include:
• SGD requires a number of hyperparameters
such as the regularization parameter and the
number of iterations.
• SGD is sensitive to feature scaling.
• class SGDClassifier implements a plain
stochastic gradient descent learning routine
which supports different loss functions and
penalties for classification.
• As other classifiers, SGD has to be fitted with two
arrays: an array X of size [n_samples, n_features]
holding the training samples, and an array Y of size
[n_samples] holding the target values (class labels)
for the training samples:
>>> from sklearn.linear_model import SGDClassifier
>>> X = [[0., 0.], [1., 1.]]
>>> y = [0, 1]
>>> clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
>>> clf.fit(X, y) SGDClassifier(alpha=0.0001, average=False,
class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0,
fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge',
max_iter=5, n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
power_t=0.5, random_state=None, shuffle=True, tol=None,
validation_fraction=0.1, verbose=0, warm_start=False)
After being fitted, the model can then be used to predict new values:
• >>> clf.predict([[2., 2.]])
array([1])
• SGD fits a linear model to the training data. The
member coef_ holds the model parameters:
>>> clf.coef_ array([[9.9..., 9.9...]])
• Member intercept_ holds the intercept (aka offset or
bias):
>>> clf.intercept_ array([-9.9...])
• Whether or not the model should use an intercept, i.e. a
biased hyperplane, is controlled by the
parameter fit_intercept.
• To get the signed distance to the hyperplane
use SGDClassifier.decision_function:
>>> clf.decision_function([[2., 2.]]) array([29.6...])
• The concrete loss function can be set via
the loss parameter. SGDClassifier supports the
following loss functions:
• loss="hinge": (soft-margin) linear Support Vector
Machine,
• loss="modified_huber": smoothed hinge loss,
• loss="log": logistic regression,
• and all regression losses below.
• The first two loss functions are lazy, they only update
the model parameters if an example violates the
margin constraint, which makes training very
efficient and may result in sparser models, even
when L2 penalty is used.
• penalty="l2": L2 norm penalty on coef_.
• penalty="l1": L1 norm penalty on coef_.
• penalty="elasticnet": Convex combination of L2 and
L1; (1 - l1_ratio) * L2 + l1_ratio * L1.
• In the case of multi-class classification coef_ is a two-
dimensionalarray
of shape=[n_classes, n_features] and intercept_ is a
one-dimensional array of shape=[n_classes]. The i-th
row of coef_ holds the weight vector of the OVA
classifier for the i-th class; classes are indexed in
ascending order (see attribute classes_). Note that, in
principle, since they allow to create a probability
model, loss="log" and loss="modified_huber" are
more suitable for one-vs-all classification.
Finding the optimal hyperparameters
through grid search
• GridSearchCV, which automates the training
process of different models and provides the
user with optimal values using cross-
validation.
• When creating a machine learning model,
you'll be presented with design choices as to
how to define your model architecture. Often
times, we don't immediately know what the
optimal model architecture should be for a
given model, and thus we'd like to be able to
explore a range of possibilities.
• In true machine learning fashion, we'll ideally
ask the machine to perform this exploration
and select the optimal model architecture
automatically.
• hyperparameters and thus this process of
searching for the ideal model architecture is
referred to as hyperparameter tuning.
• What should I set my learning rate to for
gradient descent?
• What degree of polynomial features should I
use for my linear model?
• we show how to use it to find the best penalty and
strength factors for a linear regression with the Iris
toy dataset:
import multiprocessing
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV 
>>> iris = load_iris()
>>> param_grid = [
{
'penalty': [ ‘L1', ‘L2' ],
'C': [ 0.5, 1.0, 1.5, 1.8, 2.0, 2.5]
}
]
Explanation
• Comparison of the sparsity (percentage of
zero coefficients) of solutions when L1 and L2
penalty are used for different values of C.
• We can see that large values of C give more
freedom to the model. Conversely, smaller
values of C constrain the model more.
• In the L1 penalty case, this leads to sparser
solutions.
• We classify 8x8 images of digits into two
classes: 0-4 against 5-9. The visualization
shows coefficients of the models for varying C.
• C=1.00
• Sparsity with L1 penalty: 4.69%
• score with L1 penalty: 0.9082
• Sparsity with L2 penalty: 4.69%
• score with L2 penalty: 0.9048

• C=0.10
• Sparsity with L1 penalty: 28.12%
• score with L1 penalty: 0.9026
• Sparsity with L2 penalty: 4.69%
• score with L2 penalty: 0.9021

• C=0.01
• Sparsity with L1 penalty: 84.38%
• score with L1 penalty: 0.8625
• Sparsity with L2 penalty: 4.69%
• score with L2 penalty: 0.8898
Let Code Continue PPT 102
>>> gs = GridSearchCV(estimator=LogisticRegression(),
param_grid=param_grid, scoring='accuracy', cv=10,
n_jobs=multiprocessing.cpu_count())
>>> gs.fit(iris.data, iris.target)
>>> gs.best_estimator_LogisticRegression(C=1.5,
class_weight=None, dual=False,
fit_intercept=True,intercept_scaling=1,
max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l1', random_state=None, solver='liblinear',
tol=0.0001, verbose=0, warm_start=False)
>>> cross_val_score(gs.best_estimator_, iris.data,
iris.target, scoring='accuracy', cv=10).mean()
0.96666666666666679
Basic-Grid-search and cross-validated
estimators
• By default, the GridSearchCV uses a 3-fold cross-
validation.
• However, if it detects that a classifier is passed,
rather than a regressor, it uses a stratified 3-fold.
• The default will change to a 5-fold cross-validation in
version 0.22.
• Eg
• >>> clf.score(X_digits[1000:], y_digits[1000:])
0.943...
• Nested cross-validation
>>> cross_val_score(clf, X_digits, y_digits)
array([0.938..., 0.963..., 0.944...])
find the best parameters of an SGDClassifier
trained with perceptron loss. The dataset is plotted
in the following figure:
SGDClassifier
• This estimator implements regularized linear
models with stochastic gradient descent (SGD)
learning: the gradient of the loss is estimated
each sample at a time.
• This implementation works with data represented as
dense or sparse arrays of floating point values for the
features.
• The model it fits can be controlled with the loss
parameter; by default, it fits a linear support vector
machine (SVM).
• The regularizer is a penalty added to the loss
function that shrinks model parameters
towards the zero vector using either the
squared euclidean norm L2 or the absolute
norm L1 or a combination of both (Elastic Net).
• If the parameter update crosses the 0.0
value because of the regularizer, the
update is truncated to 0.0 to allow for
learning sparse models and achieve
online feature selection.
 from sklearn.model_selection import GridSearchCV
>>> param_grid = [
{
'penalty': [ 'l1', 'l2', 'elasticnet' ],
'alpha': [ 1e-5, 1e-4, 5e-4, 1e-3, 2.3e-3, 5e-3, 1e-2],
'l1_ratio': [0.01, 0.05, 0.1, 0.15, 0.25, 0.35, 0.5, 0.75, 0.8]
} ]
>>> sgd = SGDClassifier(loss='perceptron',
learning_rate='optimal')
>>> gs = GridSearchCV(estimator=sgd, param_grid=param_grid,
scoring='accuracy', cv=10,
n_jobs=multiprocessing.cpu_count()) 
>>> gs.fit(X, Y)
>>> gs.best_score_
0.89400000000000002
>>> gs.best_estimator_
SGDClassifier(alpha=0.001, average=False,
class_weight=None, epsilon=0.1, eta0=0.0,
fit_intercept=True, l1_ratio=0.1, learning_rate='optimal',
loss='perceptron', n_iter=5, n_jobs=1, penalty='elasticnet',
power_t=0.5, random_state=None, shuffle=True, verbose=0,
warm_start=False)
Classification metrics
• A classification task can be evaluated in many
different ways to achieve specific objectives.
Of course, the most important metric is the
accuracy, often expressed as:

• In scikit-learn, it can be assessed using the


built-in accuracy_score() function:
from sklearn.metrics import accuracy_score
>>> accuracy_score(Y_test, lr.predict(X_test))
0.94399999999999995
from sklearn.metrics import zero_one_loss 
>>> zero_one_loss(Y_test, lr.predict(X_test))
0.05600000000000005 
>>> zero_one_loss(Y_test, lr.predict(X_test),
normalize=False)
7L
A similar but opposite metric is the Jaccard similarity
coefficient, defined as:
• This index measures the similarity and is bounded
between 0 (worst performances) and 1 (best
performances).
• In the former case, the intersection is null, while in
the latter, the intersection and union are equal
because there are no misclassifications.
from sklearn.metrics import jaccard_similarity_score 
>>> jaccard_similarity_score(Y_test, lr.predict(X_test))
0.94399999999999995
These measures provide a good insight into our
classification algorithms.
• it's necessary to be able to differentiate between different
kinds of misclassifications (we're considering the binary case
with the conventional notation: 0- negative, 1-positive),
because the relative weight is quite different. For this reason,
we introduce the following definitions:
• True positive: A positive sample correctly
classified
• False positive: A negative sample classified as
positive
• True negative: A negative sample correctly
classified
• False negative: A positive sample classified as
negative
false positive and false negative can be considered as
similar errors, but think about a medical prediction:
while a false positive can be easily discovered with
further tests, a false negative is often neglected, with
repercussions following the consequences of this
action. For this reason, it's useful to introduce the
concept of a confusion matrix:
• it's possible to build a confusion matrix using a built-in
function. Let's consider a generic logistic regression on
a dataset X with labels Y:
>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.25)
>>> lr = LogisticRegression()
>>> lr.fit(X_train, Y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
•  Now we can compute our confusion matrix
and immediately see how the classifier is
working:
from sklearn.metrics import confusion_matrix 
>>> cm = confusion_matrix(y_true=Y_test,
y_pred=lr.predict(X_test)) cm[::-1, ::-1]
[[50 5]
[ 2 68]]
• we have five false negatives and two false
positives.
• Another useful direct measure is
• This is directly connected with the ability to capture
the features that determine the positiveness of a
sample, to avoid the misclassification as negative. In
scikit-learn, the implementation is:
from sklearn.metrics import precision_score 
>>> precision_score(Y_test, lr.predict(X_test))
0.96153846153846156
The ability to detect true positive samples
among all the potential positives can be
assessed using another measure:
• The scikit-learn implementation is:
• from sklearn.metrics import recall_score
>>> recall_score(Y_test, lr.predict(X_test))
0.90909090909090906
It's not surprising that we have a 90 percent recall
with 96 percent precision, because the number of
false negatives (which impact recall) is
proportionally higher than the number of false
positives (which impact precision). A weighted
harmonic mean between precision and recall is
provided by:
• A beta value equal to 1 determines the so-called F1
score, which is a perfect balance between the two
measures.
• A beta less than 1 gives more importance to
precision and a value greater than 1 gives more
importance to recall.
• The following snippet shows how to implement it with scikit-
learn:
• from sklearn.metrics import fbeta_score
• >>> fbeta_score(Y_test, lr.predict(X_test), beta=1)
0.93457943925233655
• >>> fbeta_score(Y_test, lr.predict(X_test), beta=0.75)
0.94197437829691033
• >>> fbeta_score(Y_test, lr.predict(X_test), beta=1.25)
0.92886270956048933
• For F1 score, scikit-learn provides the function
f1_score(), which is equivalent to
fbeta_score() with beta=1.
• The highest score is achieved by giving more
importance to precision (which is higher),
while the least one corresponds to a recall
predominance.
• FBeta is hence useful to have a compact
picture of the accuracy as a trade-off
between high precision and a limited number
of false negatives.
ROC curve
• The ROC curve (or receiver operating
characteristics) is a valuable tool to compare
different classifiers that can assign a score to
their predictions.
• In general, this score can be interpreted as a
probability, so it's bounded between 0 and 1.
• The plane is structured like in the following
figure:
• The x axis represents the increasing false positive rate
(also known as specificity), while the y axis represents
the true positive rate (also known as sensitivity).
• The dashed oblique line represents a perfectly random
classifier, so all the curves below this threshold perform
worse than a random choice, while the ones above it
show better performances. Of course, the best classifier
has an ROC curve split into the segments [0, 0] - [0, 1]
and [0, 1] - [1, 1], and our goal is to find algorithms
whose performances should be as close as possible to
this limit.
• To show how to create a ROC curve with scikit-learn,
we're going to train a model to determine the scores for
the predictions (this can be achieved using the
decision_function() or predict_proba() methods):
>>> X_train, X_test, Y_train, Y_test =
train_test_split(X, Y, test_size=0.25) 
>>> lr = LogisticRegression()
>>> lr.fit(X_train, Y_train)
LogisticRegression(C=1.0, class_weight=None,
dual=False, fit_intercept=True,intercept_scaling=1,
max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear',
tol=0.0001, verbose=0, warm_start=False)
>>> Y_scores = lr.decision_function(X_test)
• we can compute the ROC curve:
from sklearn.metrics import roc_curve
>>> fpr, tpr, thresholds = roc_curve(Y_test, Y_scores)
• it's also useful to compute the area under the curve
(AUC), whose value is bounded between 0 (worst
performances) and 1 (best performances), with a
perfectly random value corresponding to 0.5:
from sklearn.metrics import auc
>>> auc(fpr, tpr) 0.96961038961038959
• We already know that our performances are
rather good because the AUC is close to 1.
Now we can plot the ROC curve using
matplotlib. As this book is not dedicated to
this powerful framework, I'm going to use a
snippet that can be found in several examples:
import matplotlib.pyplot as plt 
>>> plt.figure(figsize=(8, 8))
>>> plt.plot(fpr, tpr, color='red', label='Logistic regression (AUC: %.2f)'
% auc(fpr, tpr))
>>> plt.plot([0, 1], [0, 1], color='blue', linestyle='--')
>>> plt.xlim([0.0, 1.0])
>>> plt.ylim([0.0, 1.01])
>>> plt.title('ROC Curve')
>>> plt.xlabel('False Positive Rate')
>>> plt.ylabel('True Positive Rate')
>>> plt.legend(loc="lower right")
>>> plt.show()
• As confirmed by the AUC, our ROC curve
shows very good performance. In later
chapters, we're going to use the ROC curve to
visually compare different algorithms. As an
exercise, you can try different parameters of
the same model and plot all the ROC curves,
to immediately understand which setting is
preferable.
References
• https://www.statisticssolutions.com/assumptions-of-multiple-linear-
regression/
• https://machinelearningmastery.com/linear-regression-for-machine-
learning/
• https://medium.com/@jayeshbahire/lasso-ridge-and-elastic-net-
regularization-4807897cb722
• https://www.statisticssolutions.com/what-is-logistic-regression/
• https://www.analyticsvidhya.com/blog/2015/11/beginners-guide-on-
logistic-regression-in-r/
• https://www.kaggle.com/joparga3/2-tuning-parameters-for-logistic-
regression
• https://www.jeremyjordan.me/hyperparameter-tuning/

You might also like