You are on page 1of 3

ENCODING

Machine learning models require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before
you can fit and evaluate a model.

The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding.

 Nominal Variable (Categorical). Variable comprises a finite set of discrete values


with no relationship between values.
 Ordinal Variable. Variable comprises a finite set of discrete values with a ranked
ordering between values.

Ordinal Encoding
In ordinal encoding, each unique category value is assigned an integer value.

One-Hot Encoding
For categorical variables where no ordinal relationship exists, the integer encoding may not
be enough, at best, or misleading to the model at worst.

A one-hot encoding can be applied to the ordinal representation. This is where the integer
encoded variable is removed and one new binary variable is added for each unique integer
value in the variable.

Dummy Variable Encoding


The one-hot encoding creates one binary variable for each category.

The problem is that this representation includes redundancy.

When there are C possible values of the predictor and only C – 1 dummy variables are used,
the matrix inverse can be computed and the contrast method is said to be a full rank
parameterization

LOGISTIC REGRESSION
Logistic regression is named for the function used at the core of the method, the logistic
function.

The logistic function, also called the sigmoid function was developed by statisticians to
describe properties of population growth in ecology, rising quickly and maxing out at the
carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued
number and map it into a value between 0 and 1, but never exactly at those limits.
1 / (1 + e^-value)

Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your
spreadsheet) and value is the actual numerical value that you want to transform. 

Representation Used for Logistic Regression

Logistic regression uses an equation as the representation, very much like linear regression.

Input values (x) are combined linearly using weights or coefficient values (referred to as the
Greek capital letter Beta) to predict an output value (y). A key difference from linear
regression is that the output value being modeled is a binary values (0 or 1) rather than a
numeric value.

Below is an example logistic regression equation:

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for
the single input value (x). Each column in your input data has an associated b coefficient (a
constant real value) that must be learned from your training data.

 Logistic regression is a form of regression which is used when the dependent variable
is dichotomous, discrete, or categorical, and the explanatory variables are of any kind.
 Using the logit transformation, logistic regression predicts always the probability of
group membership in relation to several variables independent of their distribution.
 The logistic regression analysis is based on calculating the odds of the outcome as the
ratio of the probability of having the outcome divided by the probability of not having
it.
Logistic regression is applied on the dataset. The solver used is “newton-cg”, iterated upto
10000 iterations with job assigned is 2.

Applying GridSearchCV for Logistic Regression


 Grid Search is the process of performing hyperparameter tuning in order to determine
the optimal values for a given model. The performance of a model significantly depends
on the value of hyperparameters.

 There is no way to know in advance the best values for hyperparameters so


ideally, we need to try all possible values to know the optimal values. Doing this
manually could take a considerable amount of time and resources and thus we use
GridSearchCV to automate the tuning of hyperparameters.

 GridSearchCV is a function that comes in Scikit-learn’s(or SK-learn) model_selection


package. This function helps to loop through predefined hyperparameters and fit the estimator
(model) on training set. So, in the end, the best parameters from the listed hyperparameters
can be selected.

Grid Search is performed on the dataset with penalty l2. Two solvers used “sag” and “ibfgs”
and two tolerences given 0.0001 and 0.00001.
Various arguments that are taken by GridSearchCV function:

1.estimator: The model instance is to be passed for which we want to check the
hyperparameters.

2.params_grid: The dictionary object that holds the hyperparameters we want to try.

3.scoring: Evaluation metric that we want use , it needs to simply pass a valid string/
object of evaluation metric.

4.cv: number of cross-validation we have to try for each selected set of hyperparameters

5.verbose: It can be set to 1 to get the detailed print out while fitting the data to
GridSearchCV

6.n_jobs: number of processes we wish to run in parallel for the task .

What are the limitations of Logistic Regression?


Logistic Regression is a simple and powerful linear classification algorithm. However, it has
some disadvantages which have led to alternate classification algorithms like LDA. Some of
the limitations of Logistic Regression are as follows:

 Two-class problems – Logistic Regression is traditionally used for two-class and


binary classification problems. Though it can be extrapolated and used in multi-class
classification, this is rarely performed. On the other hand, Linear Discriminant
Analysis is considered a better choice whenever multi-class classification is required
and in the case of binary classifications, both logistic regression and LDA are applied.
 Unstable with Well-Separated classes – Logistic Regression can lack stability when
the classes are well-separated. This is where LDA comes in.
 Unstable with few examples – If there are few examples from which the parameters
are to be estimated, logistic regression becomes unstable. However, Linear
Discriminant Analysis is a better option because it tends to be stable even in such
cases.

You might also like