Professional Documents
Culture Documents
Types of Classification
Classification is of two types:
1. The patient has the said disease. Basically, a result labeled “Yes”
or “True”.
2. The patient is disease-free. A result labeled “No” or “False”.
Types of Classification
Classification is of two types:
1. Binary Classification: When we have to categorize given data into 2 distinct classes.
Example – On the basis of given health conditions of a person, we have to determine
whether the person has a certain disease or not.
2. Multiclass Classification: The number of classes is more than 2. For Example – On the
basis of data about different species of flowers, we have to determine which specie our
observation belongs.
Fig: Binary and Multiclass Classification. Here x1 and x2 are the variables upon which the class
is predicted.
1. The patient has the said disease. Basically, a result labeled “Yes” or “True”.
2. The patient is disease-free. A result labeled “No” or “False”.
1. X: pre-classified data, in the form of an N*M matrix. N is the no. of observations and M is the
number of features
2. y: An N-d vector corresponding to predicted classes for each of the N observations.
3. Feature Extraction: Extracting valuable information from input X using a series of transforms.
4. ML Model: The “Classifier” we’ll train.
5. y’: Labels predicted by the Classifier.
6. Quality Metric: Metric used for measuring the performance of the model.
7. ML Algorithm: The algorithm that is used to update weights w’, which updates the model and
“learns” iteratively
Confusion Matrix
It is the easiest way to measure the performance of a classification problem where the
output can be of two or more type of classes. A confusion matrix is nothing but a table
with two dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions
have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False
Negatives (FN)” as shown below −
Classification Accuracy
It is most common performance metric for classification algorithms. It may be defined as
the number of correct predictions made as a ratio of all predictions made. We can easily
calculate it by confusion matrix with the help of following formula −
We can use accuracy_score function of sklearn.metrics to compute accuracy of our
classification model.
Accuracy=TP+TN/TP+FP+FN+TN
Classification Report
This report consists of the scores of Precisions, Recall, F1 and Support. They are
explained as follows −
Precision
Precision, used in document retrievals, may be defined as the number of correct
documents returned by our ML model. We can easily calculate it by confusion matrix
with the help of following formula −
Precision=TP/TP+EP
Recall or Sensitivity
Recall may be defined as the number of positives returned by our ML model. We can
easily calculate it by confusion matrix with the help of following formula −
Recall=TP/TP+FN
Specificity
Specificity, in contrast to recall, may be defined as the number of negatives returned by
our ML model. We can easily calculate it by confusion matrix with the help of following
formula −
Specificity=TN/TN+FP
Support
Support may be defined as the number of samples of the true response that lies in each
class of target values.
F1 Score
This score will give us the harmonic mean of precision and recall. Mathematically, F1
score is the weighted average of the precision and recall. The best value of F1 would be
1 and worst would be 0. We can calculate F1 score with the help of following formula
−
𝑭𝟏 = 𝟐 ∗ (𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ∗ 𝒓𝒆𝒄𝒂𝒍𝒍) / (𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 + 𝒓𝒆𝒄𝒂𝒍𝒍)
F1 score is having equal relative contribution of precision and recall.
We can use classification_report function of sklearn.metrics to get the classification
report of our classification model.
AUC (Area Under ROC curve)
AUC (Area Under Curve)-ROC (Receiver Operating Characteristic) is a performance
metric, based on varying threshold values, for classification problems. As name
suggests, ROC is a probability curve and AUC measure the separability. In simple
words, AUC-ROC metric will tell us about the capability of model in distinguishing the
classes. Higher the AUC, better the model.
Mathematically, it can be created by plotting TPR (True Positive Rate) i.e. Sensitivity or
recall vs FPR (False Positive Rate) i.e. 1-Specificity, at various threshold values.
Following is the graph showing ROC, AUC having TPR at y-axis and FPR at x-axis −
Example
The following is a simple recipe in Python which will give us an insight about how we
can use the above explained performance metrics on binary classification model −
MAE=1n∑|Y−Y^|
Here, 𝑌=Actual Output Values
MSE=1n∑(Y−Y^)
Here, 𝑌=Actual Output Values
R Squared (R2)
R Squared metric is generally used for explanatory purpose and provides an indication
of the goodness or fit of a set of predicted output values to the actual output values.
The following formula will help us understanding it −
R2=1−1n∑ni=1(Yi−Yi^)2/1n∑ni=1(Yi−Yi)2
In the above equation, numerator is MSE and the denominator is the variance in 𝑌
values.
We can use r2_score function of sklearn.metrics to compute R squared value.
Example
The following is a simple recipe in Python which will give us an insight about how we
can use the above explained performance metrics on regression model −
Output
R Squared = 0.9656060606060606
MAE = 0.42499999999999993
MSE = 0.5674999999999999
Have you ever wondered why Logistic Regression is used for classification
problem but still has “Regression” in it.
In this post we will answer questions like why regression term is there in
Logistic Regression? and How it can be converted to class
probabilities?
Here, we must know the class probability of fraud for that particular
case.
(i) the probability estimates to be well calibrated, meaning that if you take 100
cases whose class membership probability is estimated to be 0.2, then about 20
of them will actually belong to the class.
(ii) the probability estimates to be discriminative. Meaning that they should give
different probability estimates for different examples. Say 0.5 class probability
indicates that 50% of population is fraudulent. Which is base rate thus we need
discrimination to get some higher/lower class probability boundary for
estimation.
Assume, f(x) is our linear function. x is an instance being further from the
separating boundary intuitively ought to lead to a higher probability of being in
one class or the other. Thus f(x) gives us the distance from the separating
boundary. As we know that linear regression can take values from -infinity to
+infinity. But our class probabilities range from 0 to 1.
One useful notion of likelihood of an event is odds.
The odds of an event is the ratio of the event occurring to the probability of the
event not occurring. Odds range from 0 to +infinity. So we can not map our
linear distribution from 0 to +infinity using odds function.
But, Wait! Since any number which ranges from 0 to +infinity its log value will
range from -infinity to +infinity. So lets compare it i.e log-odds with our linear
model.
Assume we have class c instance to be predicted from linear model then log-
odds of it will be
Above, w0, w1, w2, …. are weights given by our linear model and x1, x2,
x3,… are features of dataset. P(c) is the probability of instance that how much
it is a credit fraud and 1-P(c) is the probability of instance that how much it is
not being a credit fraud.
Now, we often want probability of class c i.e P(c) as our predicted class
probability and don’t want log odds for function. We can solve for P(c).
On left side we can cancel out e and log for power term to be the value.
Solving for P(c),
Simplify
Simplify
Simplify
In Equation 2, If we plot the graph using values for x and w we get the curve
something like
Logistic Regression : Estimates of class probabilities
Above curve is called the “Sigmoid Curve” due to its S shape which squeezes
the probabilities into their respective correct range (between zero and one).
The sigmoid curve suggests that the values near the boundary are uncertain of
class. And as you move away from boundary the uncertainty decreases and thus
membership of class becomes certain.
The g function gives the model’s estimated probability of seeing x’s actual class given
x’s features
The maximum likelihood model “on average” gives the highest probabilities to
the positive examples and the lowest probabilities to the negative examples.
I Multi-Class Classification
Multi-class classification refers to those classification tasks that have more than two class
labels.
Examples include:
• Face classification.
• Plant species classification.
• Optical character recognition.
Unlike binary classification, multi-class classification does not have the notion of normal and
abnormal outcomes. Instead, examples are classified as belonging to one among a range of
known classes.
The number of class labels may be very large on some problems. For example, a model may
predict a photo as belonging to one among thousands or tens of thousands of faces in a face
recognition system.
Problems that involve predicting a sequence of words, such as text translation models, may
also be considered a special type of multi-class classification. Each word in the sequence of
words to be predicted involves a multi-class classification where the size of the vocabulary
defines the number of possible classes that may be predicted and could be tens or hundreds
of thousands of words in size.
It is common to model a multi-class classification task with a model that predicts a Multinoulli
probability distribution for each example.
The Multinoulli distribution is a discrete probability distribution that covers a case where an
event will have a categorical outcome, e.g. K in {1, 2, 3, …, K}. For classification, this means
that the model predicts the probability of an example belonging to each class label.
Many algorithms used for binary classification can be used for multi-class classification.
• k-Nearest Neighbors.
• Decision Trees.
• Naive Bayes.
• Random Forest.
• Gradient Boosting.
Algorithms that are designed for binary classification can be adapted for use for multi-class
problems.
This involves using a strategy of fitting multiple binary classification models for each class vs.
all other classes (called one-vs-rest) or one model for each pair of classes (called one-vs-
one).
• One-vs-Rest: Fit one binary classification model for each class vs. all other classes.
• One-vs-One: Fit one binary classification model for each pair of classes.
Binary classification algorithms that can use these strategies for multi-class classification
include:
• Logistic Regression.
• Support Vector Machine.
Next, let’s take a closer look at a dataset to develop an intuition for multi-class classification
problems.
6 # define dataset
9 print(X.shape, y.shape)
11counter = Counter(y)
12print(counter)
14for i in range(10):
15 print(X[i], y[i])
20pyplot.legend()
21pyplot.show()
Running the example first summarizes the created dataset showing the 1,000 examples
divided into input (X) and output (y) elements.
The distribution of the class labels is then summarized, showing that instances belong to
class 0, class 1, or class 2 and that there are approximately 333 examples in each class.
Next, the first 10 examples in the dataset are summarized showing the input values are
numeric and the target values are integers that represent the class membership.
1 (1000, 2) (1000,)
5 [-3.05837272 4.48825769] 0
6 [-8.60973869 -3.72714879] 1
7 [1.37129721 5.23107449] 0
8 [-9.33917563 -2.9544469 ] 1
9 [-8.63895561 -8.05263469] 2
10[-8.48974309 -9.05667083] 2
11[-7.51235546 -7.96464519] 2
12[-7.51320529 -7.46053919] 2
13[-0.61947075 3.48804983] 0
14[-10.91115591 -4.5772537 ] 1
Finally, a scatter plot is created for the input variables in the dataset and the points are colored
based on their class value.
We can see three distinct clusters that we might expect would be easy to discriminate.
ML | Underfitting and Overfitting
When we talk about the Machine Learning model, we actually talk about
how well it performs and its accuracy which is known as prediction
errors. Let us consider that we are designing a machine learning model.
A model is said to be a good machine learning model if it generalizes any
new input data from the problem domain in a proper way. This helps us
to make predictions about future data, that the data model has never
seen. Now, suppose we want to check how well our machine learning
model learns and generalizes to the new data. For that, we have
overfitting and underfitting, which are majorly responsible for the poor
performances of the machine learning algorithms.
Before diving further let’s understand two important terms:
• Bias: Assumptions made by a model to make a function easier to learn. It is actually the
error rate of the training data. When the error rate has a high value, we call it High Bias
and when the error rate has a low value, we call it low Bias.
• Variance: The difference between the error rate of training data and testing data is called
variance. If the difference is high then it’s called high variance and when the difference
of errors is low then it’s called low variance. Usually, we want to make a low variance
for generalized our model.
In a nutshell, Underfitting refers to a model that can neither performs well on the training data nor
generalize to new data.
Overfitting: A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts learning from
the noise and inaccurate data entries in our data set. And when testing with test data results in High
variance. Then the model does not categorize the data correctly, because of too many details and
noise. The causes of overfitting are the non-parametric and non-linear methods because these types
of machine learning algorithms have more freedom in building the model based on the dataset and
therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we are using
decision trees.
What is Hypothesis?
The hypothesis is defined as the supposition or proposed explanation based on
insufficient evidence or assumptions. It is just a guess based on some known
facts but has not yet been proven. A good hypothesis is testable, which
results in either true or false.
Example: Let's understand the hypothesis with a common example. Some
scientist claims that ultraviolet (UV) light can damage the eyes then it may
also cause blindness.
In this example, a scientist just claims that UV rays are harmful to the eyes,
but we assume they may cause blindness. However, it may or may not be
possible. Hence, these types of assumptions are called a hypothesis.
There are some common methods given to find out the possible hypothesis
from the Hypothesis space, where hypothesis space is represented
by uppercase-h (H) and hypothesis by lowercase-h (h). Th ese are defined as
follows:
Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also
known as a hypothesis set. It is used by supervised machine learning algorithms
to determine the best possible hypothesis to describe the target function or
best maps input to output.
Hypothesis (h):
Hence hypothesis (h) can be concluded as a single hypothesis that maps input
to proper output and can be evaluated as well as used to make predictions.
y= mx + b
Where,
Y: Range
m: Slope of the line which divided test data or changes in y divided by change
in x.
x: domain
c: intercept (constant)
Example: Let's understand the hypothesis (h) and hypothesis space (H) with
a two-dimensional coordinate plane showing the distribution of data as
follows:
Now, assume we have some test data by which ML algorithms predict the
outputs for input as follows:
If we divide this coordinate plane in such as way that it can help you to predict
output or result as follows:
Based on the given test data, the output result will be as follows:
However, based on data, algorithm, and constraints, this coordinate plane
can also be divided in the following ways as follows:
Hypothesis space (H) is the composition of all legal best possible ways to
divide the coordinate plane so that it best maps input to proper output.
Further, each individual best possible way is called a hypothesis (h). Hence,
the hypothesis and hypothesis space would be like this:
Hypothesis in Statistics
Similar to the hypothesis in machine learning, it is also considered an
assumption of the output. However, it is falsifiable, which means it can be
failed in the presence of sufficient evidence.
null hypothesis, which means if one of the two hypotheses is true, then the
other must be false. In other words, an alternative hypothesis is a type of
statistical hypothesis which tells that there is some significant effect that
exists in the given set of observations.
Significance level
The significance level is the primary thing that must be set before starting an
experiment. It is useful to define the tolerance of error and the level at which
effect can be considered significantly. During the testing process in an
experiment, a 95% significance level is accepted, and the remaining 5% can
be neglected. The significance level also tells the critical or threshold value.
For e.g., in an experiment, if the significance level is set to 98%, then the
critical value is 0.02%.
P-value
The p-value in statistics is defined as the evidence against a null hypothesis.
In other words, P-value is the probability that a random chance generated
the data or something else that is equal or rarer under the null hypothesis
condition.
If the p-value is smaller, the evidence will be stronger, and vice-versa which
means the null hypothesis can be rejected in testing. It is always represented
in a decimal form, such as 0.035.
Whenever a statistical test is carried out on the population and sample to find
out P-value, then it always depends upon the critical value. If the p-value is
less than the critical value, then it shows the effect is significant, and the null
hypothesis can be rejected. Further, if it is higher than the critical value, it
shows that there is no significant effect and hence fails to reject the Null
Hypothesis.
The growth function, also called the shatter coefficient or the shattering number,
measures the richness of a set family. It is especially used in the context of statistical
learning theory, where it measures the complexity of a hypothesis class. The term 'growth
function' was coined by Vapnik and Chervonenkis in their 1968 paper, where they also
proved many of its properties.[1] It is a basic concept in machine learning.[2][3]
DefinitionsEdit
Set-family definitionEdit
Let be a set family (a set of sets) and a set. Their intersection is defined as the
following set-family:
set has elements then the index is at most . If the index is exactly 2m then the
Hypothesis-class definitionEdit
Equivalently, let be a hypothesis-class (a set of binary functions) and a set