You are on page 1of 13

Unit-II

Regression and Classification algorithms are Supervised Learning algorithms. Both


the algorithms are used for prediction in Machine learning and work with the labeled
datasets. But the difference between both is how they are used for different machine
learning problems.

The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc.
and Classification algorithms are used to predict/Classify the discrete values such as
Male or Female, True or False, Spam or Not Spam, etc.

Classification
• The Classification algorithm is a Supervised Learning technique that is used to
identify the category of new observations on the basis of training data
• In Classification, a program learns from the given dataset or observations and
then classifies new observation into a number of classes or groups. Such as, Yes
or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as
targets/labels or categories.
• Classification is a process of finding a function which helps in dividing the
dataset into classes based on different parameters. In Classification, a computer
program is trained on the training dataset and based on that training, it
categorizes the data into different classes.
• Unlike regression, the output variable of Classification is a category, not a value,
such as "Green or Blue", "fruit or animal", etc. Since the Classification algorithm
is a Supervised learning technique, hence it takes labeled input data, which
means it contains input with the corresponding output
In classification algorithm, a discrete output function(y) is mapped to input
variable(x).

1. y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the following types:

o Logistic Regression
o K-Nearest Neighbours
o Support Vector Machines
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Classification Terminologies In Machine Learning

• Classifier – It is an algorithm that is used to map the input data to a specific


category.
• Classification Model – The model predicts or draws a conclusion to the input
data given for training, it will predict the class or category for the data.
• Feature – A feature is an individual measurable property of the phenomenon
being observed.
• Binary Classification – It is a type of classification with two outcomes, for eg
– either true or false.
• Multi-Class Classification – The classification with more than two classes, in
multi-class classification each sample is assigned to one and only one label or
target.
• Multi-label Classification – This is a type of classification where each sample
is assigned to a set of labels or targets.
• Initialize – It is to assign the classifier to be used for the
• Train the Classifier – Each classifier in sci-kit learn uses the fit(X, y) method
to fit the model for training the train X and train label y.
• Predict the Target – For an unlabeled observation X, the predict(X) method
returns predicted label y.
• Evaluate – This basically means the evaluation of the model i.e classification
report, accuracy score, etc.

Logistic regression:

1. Logistic regression is one of the most popular Supervised Machine Learning


algorithms. It is used for predicting the categorical dependent variable using a
given set of independent variables.
2. the logistic model (or logit model) is used to model the probability of a certain
class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This
can be extended to model several classes of events such as determining whether
an image contains a cat, dog, lion, etc. Each object being detected in the image
would be assigned a probability between 0 and 1
3. Logistic regression predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
4. In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1)

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted


values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible


types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low", "Medium", or "High".

Logistic Regression Loss Function

Problem with the linear line:

When you extend this line, you will have values greater than 1 and less than 0, which
do not make much sense in our classification problem. It will make a model
interpretation a challenge. That is where `Logistic Regression` comes in. If we needed
to predict sales for an outlet, then this model could be helpful. But here we need to
classify customers.

-We need a function to transform this straight line in such a way that values will be
between 0 and 1:

Ŷ = Q (Z)

Q (Z) =1/1+ e-z (Sigmoid Function)

Ŷ =1/1+ e-z

After transformation, we will get a line that remains between 0 and 1. Another
advantage of this function is all the continuous values we will get will be between 0 and
1 which we can use as a probability for making predictions. For example, if the
predicted value is on the extreme right, the probability will be close to 1 and if the
predicted value is on the extreme left, the probability will be close to 0.
electing the right model is not enough. You need a function that measures the
performance of a Machine Learning model for given data. Cost Function quantifies
the error between predicted values and expected values.

`If you can’t measure it, you can’t improve it.`

-Another thing that will change with this transformation is Cost Function. In Linear
Regression, we use `Mean Squared Error` for cost function given by:-

and when this error function is plotted with respect to weight parameters of the Linear
Regression Model, it forms a convex curve which makes it eligible to apply Gradient
Descent Optimization Algorithm to minimize the error by finding global minima and
adjust weights

What is Log Loss?

Log Loss is the most important classification metric based on probabilities. It’s hard to
interpret raw log-loss values, but log-loss is still a good metric for comparing models.
For any given problem, a lower log loss value means better predictions.
Mathematical interpretation:

Log Loss is the negative average of the log of corrected predicted probabilities for
each instance.

Let us understand it with an example:

The model is giving predicted probabilities as shown above.

What are the corrected probabilities?

-> By default, the output of the logistics regression model is the probability of the
sample being positive(indicated by 1) i.e if a logistic regression model is trained to
classify on a `company dataset` then the predicted probability column says What is the
probability that the person has bought jacket. Here in the above data set the probability
that a person with ID6 will buy a jacket is 0.94.

In the same way, the probability that a person with ID5 will buy a jacket (i.e. belong
to class 1) is 0.1 but the actual class for ID5 is 0, so the probability for the class is (1-
0.1)=0.9. 0.9 is the correct probability for ID5.
We will find a log of corrected probabilities for each instance.

As you can see these log values are negative. To deal with the negative sign, we take
the negative average of these values, to maintain a common convention that lower
loss scores are better.

In short, there are three steps to find Log Loss:

1. To find corrected probabilities.


2. Take a log of corrected probabilities.
3. Take the negative average of the values we get in the 2nd step.
If we summarize all the above steps, we can use the formula:-

Here Yi represents the actual class and log(p(yi)is the probability of that class.

• p(yi) is the probability of 1.


• 1-p(yi) is the probability of 0.

Now Let’s see how the above formula is working in two cases:

1. When the actual class is 1: second term in the formula would be 0 and we will
left with first term i.e. yi.log(p(yi)) and (1-1).log(1-p(yi) this will be 0.
2. When the actual class is 0: First-term would be 0 and will be left with the
second term i.e (1-yi).log(1-p(yi)) and 0.log(p(yi)) will be 0.

wow!! we got back to the original formula for binary cross-entropy/log loss

Maths behind Logistic Regression

We could start by assuming p(x) be the linear function. However, the problem is that p
is the probability that should vary from 0 to 1 whereas p(x) is an unbounded linear
equation. To address this problem, let us assume, log p(x) be a linear function of x and
further, to bound it between a range of (0,1), we will use logit transformation. Therefore,
we will consider log p(x)/(1-p(x)). Next, we will make this function to be linear:
After solving for p(x):

To make the logistic regression a linear classifier, we could choose a certain threshold,
e.g. 0.5. Now, the misclassification rate can be minimized if we predict y=1 when p ≥
0.5 and y=0 when p<0.5. Here, 1 and 0 are the classes.

Since Logistic regression predicts probabilities, we can fit it using likelihood. Therefore,
for each training data point x, the predicted class is y. Probability of y is either p if y=1
or 1-p if y=0. Now, the likelihood can be written as:

The multiplication can be transformed into a sum by taking the log:

Further, after putting the value of p(x):


The next step is to take a maximum of the above likelihood function because in the case
of logistic regression gradient ascent is implemented (opposite of gradient descent).

Gradient Descent in Machine Learning:

Gradient Descent is known as one of the most commonly used optimization algorithms
to train machine learning models by means of minimizing errors between actual and
expected results. Further, gradient descent is also used to train Neural Networks.

The best way to define the local minimum or local maximum of a function using gradient
descent is as follows:

o If we move towards a negative gradient or away from the gradient of the function
at the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the
function at the current point, we will get the local maximum of that function.

o This entire procedure is known as Gradient Ascent, which is also known as


steepest descent. The main objective of using a gradient descent algorithm is to
minimize the cost function using iteration. To achieve this goal, it performs two
steps iteratively:
o Calculates the first-order derivative of the function to compute the gradient or
slope of that function.
o Move away from the direction of the gradient, which means slope increased from
the current point by alpha times, where Alpha is defined as Learning Rate. It is a
tuning parameter in the optimization process which helps to decide the length of
the steps.

Logistic Regression and Decision Boundary:


The fundamental application of logistic regression is to determine a decision boundary
for a binary classification problem. Although the baseline is to identify a binary decision
boundary, the approach can be very well applied for scenarios with multiple
classification classes or multi-class classification
What is the decision boundary?
In the above diagram, the dashed line can be identified as the decision boundary since
we will observe instances of a different class on each side of the boundary. Our intention
in logistic regression would be to decide on a proper fit to the decision boundary so that
we will be able to predict which class a new feature set might correspond to. The
interesting fact about logistic regression is the utilization of the sigmoid function as the
target class estimator. Let us have a look at the intuition behind this decision.
In the above diagram, the dashed line can be identified as the decision boundary since
we will observe instances of a different class on each side of the boundary. Our intention
in logistic regression would be to decide on a proper fit to the decision boundary so that
we will be able to predict which class a new feature set might correspond to. The
interesting fact about logistic regression is the utilization of the sigmoid function as the
target class estimator. Let us have a look at the intuition behind this decision.

The Sigmoid function:


The sigmoid function for parameter z can be represented as follows. Note that the
function always lies in the range of 0 to 1, boundaries being asymptotic. This gives us a
perfect output representation of probabilities too.

You might also like