You are on page 1of 31

UNIT-II

Getting started with Classification


As the name suggests, Classification is the task of “classifying things”
into sub-categories. But, by a machine! If that doesn’t sound like much,
imagine your computer being able to differentiate between you and a
stranger. Between a potato and a tomato. Between an A grade and an F.
Now, it sounds interesting now. In Machine Learning and Statistics,
Classification is the problem of identifying to which of a set of categories
(subpopulations), a new observation belongs, on the basis of a training
set of data containing observations and whose categories membership is
known.

Types of Classification
Classification is of two types:

1. Binary Classification: When we have to categorize given data into


2 distinct classes. Example – On the basis of given health
conditions of a person, we have to determine whether the person
has a certain disease or not.
2. Multiclass Classification: The number of classes is more than 2.
For Example – On the basis of data about different species of
flowers, we have to determine which specie our observation
belongs.
Fig: Binary and Multiclass Classification. Here x1 and x2 are the variables
upon which the class is predicted.

How does classification works?


Suppose we have to predict whether a given patient has a certain disease
or not, on the basis of 3 variables, called features.

This means there are two possible outcomes:

1. The patient has the said disease. Basically, a result labeled “Yes”
or “True”.
2. The patient is disease-free. A result labeled “No” or “False”.

This is a binary classification problem.


We have a set of observations called the training data set, which
comprises sample data with actual classification results. We train a
model, called Classifier on this data set, and use that model to predict
whether a certain patient will have the disease or not.
The outcome, thus now depends upon :

1. How well these features are able to “map” to the outcome.

Getting started with Classification


As the name suggests, Classification is the task of “classifying things” into sub-categories. But, by
a machine! If that doesn’t sound like much, imagine your computer being able to differentiate
between you and a stranger. Between a potato and a tomato. Between an A grade and an F. Now,
it sounds interesting now. In Machine Learning and Statistics, Classification is the problem of
identifying to which of a set of categories (subpopulations), a new observation belongs, on the
basis of a training set of data containing observations and whose categories membership is known.

Types of Classification
Classification is of two types:

1. Binary Classification: When we have to categorize given data into 2 distinct classes.
Example – On the basis of given health conditions of a person, we have to determine
whether the person has a certain disease or not.
2. Multiclass Classification: The number of classes is more than 2. For Example – On the
basis of data about different species of flowers, we have to determine which specie our
observation belongs.
Fig: Binary and Multiclass Classification. Here x1 and x2 are the variables upon which the class
is predicted.

How does classification works?


Suppose we have to predict whether a given patient has a certain disease or not, on the basis of 3
variables, called features.

This means there are two possible outcomes:

1. The patient has the said disease. Basically, a result labeled “Yes” or “True”.
2. The patient is disease-free. A result labeled “No” or “False”.

This is a binary classification problem.


We have a set of observations called the training data set, which comprises sample data with actual
classification results. We train a model, called Classifier on this data set, and use that model to
predict whether a certain patient will have the disease or not.
The outcome, thus now depends upon :

1. How well these features are able to “map” to the outcome.


2. The quality of our data set. By quality, I refer to statistical and Mathematical qualities.
3. How well our Classifier generalizes this relationship between the features and the outcome.
4. The values of the x1 and x2.

Following is the generalized block diagram of the classification task.


Generalized Classification Block Diagram.

1. X: pre-classified data, in the form of an N*M matrix. N is the no. of observations and M is the
number of features
2. y: An N-d vector corresponding to predicted classes for each of the N observations.
3. Feature Extraction: Extracting valuable information from input X using a series of transforms.
4. ML Model: The “Classifier” we’ll train.
5. y’: Labels predicted by the Classifier.
6. Quality Metric: Metric used for measuring the performance of the model.
7. ML Algorithm: The algorithm that is used to update weights w’, which updates the model and
“learns” iteratively

Types of Classifiers (algorithms)

There are various types of classifiers. Some of them are :

• Linear Classifiers: Logistic Regression


• Tree-Based Classifiers: Decision Tree Classifier
• Support Vector Machines
• Artificial Neural Networks
• Bayesian Regression
• Gaussian Naive Bayes Classifiers
• Stochastic Gradient Descent (SGD) Classifier
• Ensemble Methods: Random Forests, AdaBoost, Bagging Classifier,
Voting Classifier, ExtraTrees Classifier
A detailed description of these methodologies is beyond an article!

Practical Applications of Classification

1. Google’s self-driving car uses deep learning-enabled


classification techniques which enables it to detect and classify
obstacles.
2. Spam E-mail filtering is one of the most widespread and well-
recognized uses of Classification techniques.
3. Detecting Health Problems, Facial Recognition, Speech
Recognition, Object Detection, and Sentiment Analysis all use
Classification at their core.

Machine Learning - Performance Metrics


There are various metrics which we can use to evaluate the performance of ML
algorithms, classification as well as regression algorithms. We must carefully choose
the metrics for evaluating ML performance because −
• How the performance of ML algorithms is measured and compared will be
dependent entirely on the metric you choose.
• How you weight the importance of various characteristics in the result will
be influenced completely by the metric you choose.

Performance Metrics for Classification Problems


We have discussed classification and its algorithms in the previous chapters. Here, we
are going to discuss various performance metrics that can be used to evaluate
predictions for classification problems.

Confusion Matrix
It is the easiest way to measure the performance of a classification problem where the
output can be of two or more type of classes. A confusion matrix is nothing but a table
with two dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions
have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False
Negatives (FN)” as shown below −

Explanation of the terms associated with confusion matrix are as follows −


• True Positives (TP) − It is the case when both actual class & predicted class
of data point is 1.
• True Negatives (TN) − It is the case when both actual class & predicted
class of data point is 0.
• False Positives (FP) − It is the case when actual class of data point is 0 &
predicted class of data point is 1.
• False Negatives (FN) − It is the case when actual class of data point is 1 &
predicted class of data point is 0.
We can use confusion_matrix function of sklearn.metrics to compute Confusion Matrix
of our classification model.

Classification Accuracy
It is most common performance metric for classification algorithms. It may be defined as
the number of correct predictions made as a ratio of all predictions made. We can easily
calculate it by confusion matrix with the help of following formula −
We can use accuracy_score function of sklearn.metrics to compute accuracy of our
classification model.

Accuracy=TP+TN/TP+FP+FN+TN
Classification Report
This report consists of the scores of Precisions, Recall, F1 and Support. They are
explained as follows −

Precision
Precision, used in document retrievals, may be defined as the number of correct
documents returned by our ML model. We can easily calculate it by confusion matrix
with the help of following formula −

Precision=TP/TP+EP

Recall or Sensitivity
Recall may be defined as the number of positives returned by our ML model. We can
easily calculate it by confusion matrix with the help of following formula −

Recall=TP/TP+FN

Specificity
Specificity, in contrast to recall, may be defined as the number of negatives returned by
our ML model. We can easily calculate it by confusion matrix with the help of following
formula −

Specificity=TN/TN+FP

Support
Support may be defined as the number of samples of the true response that lies in each
class of target values.

F1 Score
This score will give us the harmonic mean of precision and recall. Mathematically, F1
score is the weighted average of the precision and recall. The best value of F1 would be
1 and worst would be 0. We can calculate F1 score with the help of following formula

𝑭𝟏 = 𝟐 ∗ (𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ∗ 𝒓𝒆𝒄𝒂𝒍𝒍) / (𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 + 𝒓𝒆𝒄𝒂𝒍𝒍)
F1 score is having equal relative contribution of precision and recall.
We can use classification_report function of sklearn.metrics to get the classification
report of our classification model.
AUC (Area Under ROC curve)
AUC (Area Under Curve)-ROC (Receiver Operating Characteristic) is a performance
metric, based on varying threshold values, for classification problems. As name
suggests, ROC is a probability curve and AUC measure the separability. In simple
words, AUC-ROC metric will tell us about the capability of model in distinguishing the
classes. Higher the AUC, better the model.
Mathematically, it can be created by plotting TPR (True Positive Rate) i.e. Sensitivity or
recall vs FPR (False Positive Rate) i.e. 1-Specificity, at various threshold values.
Following is the graph showing ROC, AUC having TPR at y-axis and FPR at x-axis −

We can use roc_auc_score function of sklearn.metrics to compute AUC-ROC.

LOGLOSS (Logarithmic Loss)


It is also called Logistic regression loss or cross-entropy loss. It basically defined on
probability estimates and measures the performance of a classification model where
the input is a probability value between 0 and 1. It can be understood more clearly by
differentiating it with accuracy. As we know that accuracy is the count of predictions
(predicted value = actual value) in our model whereas Log Loss is the amount of
uncertainty of our prediction based on how much it varies from the actual label. With
the help of Log Loss value, we can have more accurate view of the performance of our
model. We can use log_loss function of sklearn.metrics to compute Log Loss.

Example
The following is a simple recipe in Python which will give us an insight about how we
can use the above explained performance metrics on binary classification model −

from sklearn.metrics import confusion_matrix


from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
X_actual = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
Y_predic = [1, 0, 1, 1, 1, 0, 1, 1, 0, 0]
results = confusion_matrix(X_actual, Y_predic)
print ('Confusion Matrix :')
print(results)
print ('Accuracy Score is',accuracy_score(X_actual, Y_predic))
print ('Classification Report : ')
print (classification_report(X_actual, Y_predic))
print('AUC-ROC:',roc_auc_score(X_actual, Y_predic))
print('LOGLOSS Valu

Performance Metrics for Regression Problems


We have discussed regression and its algorithms in previous chapters. Here, we are
going to discuss various performance metrics that can be used to evaluate predictions
for regression problems.

Mean Absolute Error (MAE)


It is the simplest error metric used in regression problems. It is basically the sum of
average of the absolute difference between the predicted and actual values. In simple
words, with MAE, we can get an idea of how wrong the predictions were. MAE does
not indicate the direction of the model i.e. no indication about underperformance or
overperformance of the model. The following is the formula to calculate MAE −

MAE=1n∑|Y−Y^|
Here, 𝑌=Actual Output Values

And Y^Y^= Predicted Output Values.


We can use mean_absolute_error function of sklearn.metrics to compute MAE.

Mean Square Error (MSE)


MSE is like the MAE, but the only difference is that the it squares the difference of actual
and predicted output values before summing them all instead of using the absolute
value. The difference can be noticed in the following equation −

MSE=1n∑(Y−Y^)
Here, 𝑌=Actual Output Values

And Y^Y^ = Predicted Output Values.


We can use mean_squared_error function of sklearn.metrics to compute MSE.

R Squared (R2)
R Squared metric is generally used for explanatory purpose and provides an indication
of the goodness or fit of a set of predicted output values to the actual output values.
The following formula will help us understanding it −

R2=1−1n∑ni=1(Yi−Yi^)2/1n∑ni=1(Yi−Yi)2
In the above equation, numerator is MSE and the denominator is the variance in 𝑌
values.
We can use r2_score function of sklearn.metrics to compute R squared value.

Example
The following is a simple recipe in Python which will give us an insight about how we
can use the above explained performance metrics on regression model −

from sklearn.metrics import r2_score


from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
X_actual = [5, -1, 2, 10]
Y_predic = [3.5, -0.9, 2, 9.9]
print ('R Squared =',r2_score(X_actual, Y_predic))
print ('MAE =',mean_absolute_error(X_actual, Y_predic))
print ('MSE =',mean_squared_error(X_actual, Y_predic))

Output
R Squared = 0.9656060606060606
MAE = 0.42499999999999993
MSE = 0.5674999999999999

Class Probability Estimation and Logistic Regression

Have you ever wondered why Logistic Regression is used for classification
problem but still has “Regression” in it.

In this post we will answer questions like why regression term is there in
Logistic Regression? and How it can be converted to class
probabilities?

For Classification problems in machine learning we often want to know


how likely the instance belongs to the class rather than which class it will belong
to. So in many cases we would like to use the estimated class probability for
decision making.
Consider a scenario where we have to detect a credit fraud. The manager of fraud
control department wants to know not only who are likely to be fraud but also
the cases where credit risk is at stake i.e accounts where companies monetary
loss is expected to be highest.

Here, we must know the class probability of fraud for that particular
case.

What exactly is an accurate estimate of class membership probability is a subject


of debate beyond the scope of this post.

Roughly, we would like

(i) the probability estimates to be well calibrated, meaning that if you take 100
cases whose class membership probability is estimated to be 0.2, then about 20
of them will actually belong to the class.

(ii) the probability estimates to be discriminative. Meaning that they should give
different probability estimates for different examples. Say 0.5 class probability
indicates that 50% of population is fraudulent. Which is base rate thus we need
discrimination to get some higher/lower class probability boundary for
estimation.

Understanding what is the difficulty in using Linear Model for


predicting class probability ?

Assume, f(x) is our linear function. x is an instance being further from the
separating boundary intuitively ought to lead to a higher probability of being in
one class or the other. Thus f(x) gives us the distance from the separating
boundary. As we know that linear regression can take values from -infinity to
+infinity. But our class probabilities range from 0 to 1.
One useful notion of likelihood of an event is odds.

The odds of an event is the ratio of the event occurring to the probability of the
event not occurring. Odds range from 0 to +infinity. So we can not map our
linear distribution from 0 to +infinity using odds function.

But, Wait! Since any number which ranges from 0 to +infinity its log value will
range from -infinity to +infinity. So lets compare it i.e log-odds with our linear
model.

Assume we have class c instance to be predicted from linear model then log-
odds of it will be

Equation 1 : Log-odds with linear model also called as Logit Function

Above, w0, w1, w2, …. are weights given by our linear model and x1, x2,
x3,… are features of dataset. P(c) is the probability of instance that how much
it is a credit fraud and 1-P(c) is the probability of instance that how much it is
not being a credit fraud.

Now, we often want probability of class c i.e P(c) as our predicted class
probability and don’t want log odds for function. We can solve for P(c).

Thus we take exponential of both the sides w.r.t e.

Taking exponential of both sides w.r.t e

On left side we can cancel out e and log for power term to be the value.
Solving for P(c),

Simplify

Solve the bracket

Take all P(c) on left side

Take P(c) common on left side

Simplify

Divide numerator and denominator on right side by e^f(x)

Simplify

Equation 2 : Sigmoid Function

In Equation 2, If we plot the graph using values for x and w we get the curve
something like
Logistic Regression : Estimates of class probabilities

Above curve is called the “Sigmoid Curve” due to its S shape which squeezes
the probabilities into their respective correct range (between zero and one).

The sigmoid curve suggests that the values near the boundary are uncertain of
class. And as you move away from boundary the uncertainty decreases and thus
membership of class becomes certain.

This leads us to the standard objective function for fitting a logistic


regression. Thus the “likelihood” of given the example belong to correct class
can be given by,

The g function gives the model’s estimated probability of seeing x’s actual class given
x’s features

The maximum likelihood model “on average” gives the highest probabilities to
the positive examples and the lowest probabilities to the negative examples.
I Multi-Class Classification
Multi-class classification refers to those classification tasks that have more than two class
labels.
Examples include:
• Face classification.
• Plant species classification.
• Optical character recognition.
Unlike binary classification, multi-class classification does not have the notion of normal and
abnormal outcomes. Instead, examples are classified as belonging to one among a range of
known classes.

The number of class labels may be very large on some problems. For example, a model may
predict a photo as belonging to one among thousands or tens of thousands of faces in a face
recognition system.

Problems that involve predicting a sequence of words, such as text translation models, may
also be considered a special type of multi-class classification. Each word in the sequence of
words to be predicted involves a multi-class classification where the size of the vocabulary
defines the number of possible classes that may be predicted and could be tens or hundreds
of thousands of words in size.

It is common to model a multi-class classification task with a model that predicts a Multinoulli
probability distribution for each example.
The Multinoulli distribution is a discrete probability distribution that covers a case where an
event will have a categorical outcome, e.g. K in {1, 2, 3, …, K}. For classification, this means
that the model predicts the probability of an example belonging to each class label.
Many algorithms used for binary classification can be used for multi-class classification.

Popular algorithms that can be used for multi-class classification include:

• k-Nearest Neighbors.
• Decision Trees.
• Naive Bayes.
• Random Forest.
• Gradient Boosting.
Algorithms that are designed for binary classification can be adapted for use for multi-class
problems.

This involves using a strategy of fitting multiple binary classification models for each class vs.
all other classes (called one-vs-rest) or one model for each pair of classes (called one-vs-
one).

• One-vs-Rest: Fit one binary classification model for each class vs. all other classes.
• One-vs-One: Fit one binary classification model for each pair of classes.
Binary classification algorithms that can use these strategies for multi-class classification
include:

• Logistic Regression.
• Support Vector Machine.
Next, let’s take a closer look at a dataset to develop an intuition for multi-class classification
problems.

We can use the make_blobs() function to generate a synthetic multi-class classification


dataset.
The example below generates a dataset with 1,000 examples that belong to one of three
classes, each with two input features.

1 # example of multi-class classification task

2 from numpy import where

3 from collections import Counter

4 from sklearn.datasets import make_blobs

5 from matplotlib import pyplot

6 # define dataset

7 X, y = make_blobs(n_samples=1000, centers=3, random_state=1)

8 # summarize dataset shape

9 print(X.shape, y.shape)

10# summarize observations by class label

11counter = Counter(y)

12print(counter)

13# summarize first few examples

14for i in range(10):

15 print(X[i], y[i])

16# plot the dataset and color the by class label

17for label, _ in counter.items():

18 row_ix = where(y == label)[0]

19 pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))

20pyplot.legend()

21pyplot.show()
Running the example first summarizes the created dataset showing the 1,000 examples
divided into input (X) and output (y) elements.
The distribution of the class labels is then summarized, showing that instances belong to
class 0, class 1, or class 2 and that there are approximately 333 examples in each class.

Next, the first 10 examples in the dataset are summarized showing the input values are
numeric and the target values are integers that represent the class membership.

1 (1000, 2) (1000,)

3 Counter({0: 334, 1: 333, 2: 333})

5 [-3.05837272 4.48825769] 0

6 [-8.60973869 -3.72714879] 1

7 [1.37129721 5.23107449] 0

8 [-9.33917563 -2.9544469 ] 1

9 [-8.63895561 -8.05263469] 2

10[-8.48974309 -9.05667083] 2

11[-7.51235546 -7.96464519] 2

12[-7.51320529 -7.46053919] 2

13[-0.61947075 3.48804983] 0

14[-10.91115591 -4.5772537 ] 1

Finally, a scatter plot is created for the input variables in the dataset and the points are colored
based on their class value.

We can see three distinct clusters that we might expect would be easy to discriminate.
ML | Underfitting and Overfitting
When we talk about the Machine Learning model, we actually talk about
how well it performs and its accuracy which is known as prediction
errors. Let us consider that we are designing a machine learning model.
A model is said to be a good machine learning model if it generalizes any
new input data from the problem domain in a proper way. This helps us
to make predictions about future data, that the data model has never
seen. Now, suppose we want to check how well our machine learning
model learns and generalizes to the new data. For that, we have
overfitting and underfitting, which are majorly responsible for the poor
performances of the machine learning algorithms.
Before diving further let’s understand two important terms:

• Bias: Assumptions made by a model to make a function easier to


learn. It is actually the error rate of the training data. When the
error rate has a high value, we call it High Bias and when the error
rate has a low value, we call it low Bias.
• Variance: The difference between the error rate of training data
and testing data is called variance. If the difference is high then
it’s called high variance and when the difference of errors is low
then it’s called low variance. Usually, we want to make a low
variance for generalized our model.

Error Measure Overfitting


ML | Underfitting and Overfitting
When we talk about the Machine Learning model, we actually talk about how well it performs and
its accuracy which is known as prediction errors. Let us consider that we are designing a machine
learning model. A model is said to be a good machine learning model if it generalizes any new
input data from the problem domain in a proper way. This helps us to make predictions about
future data, that the data model has never seen. Now, suppose we want to check how well our
machine learning model learns and generalizes to the new data. For that, we have overfitting and
underfitting, which are majorly responsible for the poor performances of the machine learning
algorithms.

Before diving further let’s understand two important terms:

• Bias: Assumptions made by a model to make a function easier to learn. It is actually the
error rate of the training data. When the error rate has a high value, we call it High Bias
and when the error rate has a low value, we call it low Bias.
• Variance: The difference between the error rate of training data and testing data is called
variance. If the difference is high then it’s called high variance and when the difference
of errors is low then it’s called low variance. Usually, we want to make a low variance
for generalized our model.

Underfitting: A statistical model or a machine learning algorithm is said to have underfitting


when it cannot capture the underlying trend of the data, i.e., it only performs well on training data
but performs poorly on testing data. (It’s just like trying to fit undersized pants!) Underfitting
destroys the accuracy of our machine learning model. Its occurrence simply means that our model
or the algorithm does not fit the data well enough. It usually happens when we have fewer data to
build an accurate model and also when we try to build a linear model with fewer non-linear data.
In such cases, the rules of the machine learning model are too easy and flexible to be applied to
such minimal data and therefore the model will probably make a lot of wrong predictions.
Underfitting can be avoided by using more data and also reducing the features by feature selection.

In a nutshell, Underfitting refers to a model that can neither performs well on the training data nor
generalize to new data.

Reasons for Underfitting:

1. High bias and low variance


2. The size of the training dataset used is not enough.
3. The model is too simple.
4. Training data is not cleaned and also contains noise in it.

Techniques to reduce underfitting:

1. Increase model complexity


2. Increase the number of features, performing feature engineering
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better results.

Overfitting: A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts learning from
the noise and inaccurate data entries in our data set. And when testing with test data results in High
variance. Then the model does not categorize the data correctly, because of too many details and
noise. The causes of overfitting are the non-parametric and non-linear methods because these types
of machine learning algorithms have more freedom in building the model based on the dataset and
therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we are using
decision trees.

In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on


training data is different from unseen data.

Reasons for Overfitting are as follows:

1. High variance and low bias


2. The model is too complex
3. The size of the training data
Hypothesis in Machine Learning
The hypothesis is a common term in Machine Learning and data science
projects. As we know, machine learning is one of the most powerful
technologies across the world, which helps us to predict results based on past
experiences. Moreover, data scientists and ML professionals conduct
experiments that aim to solve a problem. These ML professionals and data
scientists make an initial assumption for the solution of the problem.

This assumption in Machine learning is known as Hypothesis. In Machine


Learning, at various times, Hypothesis and Model are used interchangeably.
However, a Hypothesis is an assumption made by scientists, whereas a model
is a mathematical representation that is used to test the hypothesis. In this
topic, "Hypothesis in Machine Learning," we will discuss a few important
concepts related to a hypothesis in machine learning and their importance.
So, let's start with a quick introduction to Hypothesis.

What is Hypothesis?
The hypothesis is defined as the supposition or proposed explanation based on
insufficient evidence or assumptions. It is just a guess based on some known
facts but has not yet been proven. A good hypothesis is testable, which
results in either true or false.
Example: Let's understand the hypothesis with a common example. Some
scientist claims that ultraviolet (UV) light can damage the eyes then it may
also cause blindness.

In this example, a scientist just claims that UV rays are harmful to the eyes,
but we assume they may cause blindness. However, it may or may not be
possible. Hence, these types of assumptions are called a hypothesis.

Hypothesis in Machine Learning (ML)


The hypothesis is one of the commonly used concepts of statistics in Machine
Learning. It is specifically used in Supervised Machine learning, where an ML
model learns a function that best maps the input to corresponding outputs
with the help of an available dataset.

In supervised learning techniques, the main aim is to determine the possible


hypothesis out of hypothesis space that best maps input to the corresponding
or correct outputs.

There are some common methods given to find out the possible hypothesis
from the Hypothesis space, where hypothesis space is represented
by uppercase-h (H) and hypothesis by lowercase-h (h). Th ese are defined as
follows:

Hypothesis space (H):

Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also
known as a hypothesis set. It is used by supervised machine learning algorithms
to determine the best possible hypothesis to describe the target function or
best maps input to output.

It is often constrained by choice of the framing of the problem, the choice of


model, and the choice of model configuration.

Hypothesis (h):

It is defined as the approximate function that best describes the target in


supervised machine learning algorithms. It is primarily based on data as well
as bias and restrictions applied to data.

Hence hypothesis (h) can be concluded as a single hypothesis that maps input
to proper output and can be evaluated as well as used to make predictions.

The hypothesis (h) can be formulated in machine learning as follows:

y= mx + b

Where,

Y: Range

m: Slope of the line which divided test data or changes in y divided by change
in x.
x: domain

c: intercept (constant)

Example: Let's understand the hypothesis (h) and hypothesis space (H) with
a two-dimensional coordinate plane showing the distribution of data as
follows:

Now, assume we have some test data by which ML algorithms predict the
outputs for input as follows:
If we divide this coordinate plane in such as way that it can help you to predict
output or result as follows:

Based on the given test data, the output result will be as follows:
However, based on data, algorithm, and constraints, this coordinate plane
can also be divided in the following ways as follows:

With the above example, we can conclude that;

Hypothesis space (H) is the composition of all legal best possible ways to
divide the coordinate plane so that it best maps input to proper output.
Further, each individual best possible way is called a hypothesis (h). Hence,
the hypothesis and hypothesis space would be like this:

Hypothesis in Statistics
Similar to the hypothesis in machine learning, it is also considered an
assumption of the output. However, it is falsifiable, which means it can be
failed in the presence of sufficient evidence.

Unlike machine learning, we cannot accept any hypothesis in statistics


because it is just an imaginary result and based on probability. Before start
working on an experiment, we must be aware of two important types of
hypotheses as follows:

o Null Hypothesis: A null hypothesis is a type of statistical hypothesis which tells

that there is no statistically significant effect exists in the given set of


observations. It is also known as conjecture and is used in quantitative
analysis to test theories about markets, investment, and finance to decide
whether an idea is true or false.
o Alternative Hypothesis: An alternative hypothesis is a direct contradiction of the

null hypothesis, which means if one of the two hypotheses is true, then the
other must be false. In other words, an alternative hypothesis is a type of
statistical hypothesis which tells that there is some significant effect that
exists in the given set of observations.

Significance level
The significance level is the primary thing that must be set before starting an
experiment. It is useful to define the tolerance of error and the level at which
effect can be considered significantly. During the testing process in an
experiment, a 95% significance level is accepted, and the remaining 5% can
be neglected. The significance level also tells the critical or threshold value.
For e.g., in an experiment, if the significance level is set to 98%, then the
critical value is 0.02%.

P-value
The p-value in statistics is defined as the evidence against a null hypothesis.
In other words, P-value is the probability that a random chance generated
the data or something else that is equal or rarer under the null hypothesis
condition.

If the p-value is smaller, the evidence will be stronger, and vice-versa which
means the null hypothesis can be rejected in testing. It is always represented
in a decimal form, such as 0.035.

Whenever a statistical test is carried out on the population and sample to find
out P-value, then it always depends upon the critical value. If the p-value is
less than the critical value, then it shows the effect is significant, and the null
hypothesis can be rejected. Further, if it is higher than the critical value, it
shows that there is no significant effect and hence fails to reject the Null
Hypothesis.

The growth function, also called the shatter coefficient or the shattering number,
measures the richness of a set family. It is especially used in the context of statistical
learning theory, where it measures the complexity of a hypothesis class. The term 'growth
function' was coined by Vapnik and Chervonenkis in their 1968 paper, where they also
proved many of its properties.[1] It is a basic concept in machine learning.[2][3]
DefinitionsEdit
Set-family definitionEdit
Let be a set family (a set of sets) and a set. Their intersection is defined as the
following set-family:

The intersection-size (also called the index) of with respect to is . If a

set has elements then the index is at most . If the index is exactly 2m then the

set is said to be shattered by , because contains all the subsets of , i.e.:

The growth function measures the size of as a function of . Formally:

Hypothesis-class definitionEdit
Equivalently, let be a hypothesis-class (a set of binary functions) and a set

with elements. The restriction of to is the set of binary functions

on that can be derived from :[3]: 45

The growth function measures the size of as a function of :[3]: 49


Examples:

Techniques to reduce overfitting:

1. Increase training data.


2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training period as
soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.

You might also like