Professional Documents
Culture Documents
Machine learning models require all input and output variables to be numeric.
This means that if your data contains categorical data, you must encode it to numbers before
you can fit and evaluate a model.
The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding.
Ordinal Encoding
In ordinal encoding, each unique category value is assigned an integer value.
One-Hot Encoding
For categorical variables where no ordinal relationship exists, the integer encoding may not
be enough, at best, or misleading to the model at worst.
A one-hot encoding can be applied to the ordinal representation. This is where the integer
encoded variable is removed and one new binary variable is added for each unique integer
value in the variable.
When there are C possible values of the predictor and only C – 1 dummy variables are used,
the matrix inverse can be computed and the contrast method is said to be a full rank
parameterization
LOGISTIC REGRESSION
Logistic regression is named for the function used at the core of the method, the logistic
function.
The logistic function, also called the sigmoid function was developed by statisticians to
describe properties of population growth in ecology, rising quickly and maxing out at the
carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued
number and map it into a value between 0 and 1, but never exactly at those limits.
1 / (1 + e^-value)
Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your
spreadsheet) and value is the actual numerical value that you want to transform.
Logistic regression uses an equation as the representation, very much like linear regression.
Input values (x) are combined linearly using weights or coefficient values (referred to as the
Greek capital letter Beta) to predict an output value (y). A key difference from linear
regression is that the output value being modeled is a binary values (0 or 1) rather than a
numeric value.
Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for
the single input value (x). Each column in your input data has an associated b coefficient (a
constant real value) that must be learned from your training data.
Logistic regression is a form of regression which is used when the dependent variable
is dichotomous, discrete, or categorical, and the explanatory variables are of any kind.
Using the logit transformation, logistic regression predicts always the probability of
group membership in relation to several variables independent of their distribution.
The logistic regression analysis is based on calculating the odds of the outcome as the
ratio of the probability of having the outcome divided by the probability of not having
it.
Logistic regression is applied on the dataset. The solver used is “newton-cg”, iterated upto
10000 iterations with job assigned is 2.
Grid Search is performed on the dataset with penalty l2. Two solvers used “sag” and “ibfgs”
and two tolerences given 0.0001 and 0.00001.
Various arguments that are taken by GridSearchCV function:
1.estimator: The model instance is to be passed for which we want to check the
hyperparameters.
2.params_grid: The dictionary object that holds the hyperparameters we want to try.
3.scoring: Evaluation metric that we want use , it needs to simply pass a valid string/
object of evaluation metric.
4.cv: number of cross-validation we have to try for each selected set of hyperparameters
5.verbose: It can be set to 1 to get the detailed print out while fitting the data to
GridSearchCV