You are on page 1of 5

MODULE 3 Classification

3.1.1 Basic Concepts OF CLASSIFICATION

• Decision tree induction is a common technique in data mining that is used to generate
a predictive model from a dataset. This technique involves constructing a tree-like
structure, where each internal node represents a test on an attribute, each branch
represents the outcome of the test, and each leaf node represents a prediction. The
goal of decision tree induction is to build a model that can accurately predict the
outcome of a given event, based on the values of the attributes in the dataset.
• To build a decision tree, the algorithm first selects the attribute that best splits the data
into distinct classes. This is typically done using a measure of impurity, such as
entropy or the Gini index, which measures the degree of disorder in the data. The
algorithm then repeats this process for each branch of the tree, splitting the data into
smaller and smaller subsets until all of the data is classified.
• Decision tree induction is a popular technique in data mining because it is easy to
understand and interpret, and it can handle both numerical and categorical data.
Additionally, decision trees can handle large amounts of data, and they can be
updated with new data as it becomes available. However, decision trees can be prone
to overfitting, where the model becomes too complex and does not generalize well to
new data. As a result, data scientists often use techniques such as pruning to simplify
the tree and improve its performance.

Advantages of Decision Tree Induction


1. Easy to understand and interpret: Decision trees are a visual and intuitive model that
can be easily understood by both experts and non-experts.
2. Handle both numerical and categorical data: Decision trees can handle a mix of
numerical and categorical data, which makes them suitable for many different types
of datasets.
3. Can handle large amounts of data: Decision trees can handle large amounts of data
and can be updated with new data as it becomes available.
4. Can be used for both classification and regression tasks: Decision trees can be used
for both classification, where the goal is to predict a discrete outcome, and regression,
where the goal is to predict a continuous outcome.
Disadvantages of Decision Tree Induction
1. Prone to overfitting: Decision trees can become too complex and may not generalize
well to new data. This can lead to poor performance on unseen data.
2. Sensitive to small changes in the data: Decision trees can be sensitive to small
changes in the data, and a small change in the data can result in a significantly
different tree.
3. Biased towards attributes with many levels: Decision trees can be biased towards
attributes with many levels, and may not perform well on attributes with a small
number of levels.

3.1.2 Naïve Bayes Classifier Algorithm

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
PauseNext
Unmute
Current Time 0:06
/
Duration 18:10
Loaded: 4.04%
Â
Fullscreen
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:

3.1.3 Accuracy and Error measures

Data Mining can be referred to as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. In this article, we will see
techniques to evaluate the accuracy of classifiers.
HoldOut
In the holdout method, the largest dataset is randomly divided into three subsets:
• A training set is a subset of the dataset which are been used to build predictive
models.
• The validation set is a subset of the dataset which is been used to assess the
performance of the model built in the training phase. It provides a test platform for
fine-tuning of the model’s parameters and selecting the best-performing model. It is
not necessary for all modeling algorithms to need a validation set.
• Test sets or unseen examples are the subset of the dataset to assess the likely future
performance of the model. If a model is fitting into the training set much better than it
fits into the test set, then overfitting is probably the cause that occurred here.
Basically, two-thirds of the data are been allocated to the training set and the remaining one-
third is been allocated to the test set.
Random Subsampling
• Random subsampling is a variation of the holdout method. The holdout method is
been repeated K times.
• The holdout subsampling involves randomly splitting the data into a training set and a
test set.
• On the training set the data is been trained and the mean square error (MSE) is been
obtained from the predictions on the test set.
• As MSE is dependent on the split, this method is not recommended. So a new split
can give you a new MSE.
• The overall accuracy is been calculated as E = 1/K \sum_{k}^{i=1} E_{i}
Cross-Validation
• K-fold cross-validation is been used when there is only a limited amount of data
available, to achieve an unbiased estimation of the performance of the model.
• Here, we divide the data into K subsets of equal sizes.
• We build models K times, each time leaving out one of the subsets from the training,
and use it as the test set.
• If K equals the sample size, then this is called a “Leave-One-Out”
Bootstrapping
• Bootstrapping is one of the techniques which is used to make the estimations from the
data by taking an average of the estimates from smaller data samples.
• The bootstrapping method involves the iterative resampling of a dataset with
replacement.
• On resampling instead of only estimating the statistics once on complete data, we can
do it many times.
• Repeating this multiple times helps to obtain a vector of estimates.
• Bootstrapping can compute variance, expected value, and other relevant statistics of
these estimates.

You might also like