CCPS521 WIN2023 Week05 - Classification

Introduction to Data
Science (CCPS521)
Session 5:
Classification
Classification
Definition:
In statistics, classification is the problem of identifying which of a set
of categories (sub-populations) an observation (or observations) belongs to.
Examples are assigning a given email to the "spam" or "non-spam" class, and
assigning a diagnosis to a given patient based on observed characteristics of the
patient (sex, blood pressure, presence or absence of certain symptoms, etc.).
Often, the individual observations are analyzed into a set of quantifiable
properties, known variously as explanatory variables or features. These properties
may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type),
ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of
occurrences of a particular word in an email) or real-valued (e.g. a measurement
of blood pressure). Other classifiers work by comparing observations to previous
observations by means of a similarity or distance function.
Source: https://en.wikipedia.org/wiki/Statistical_classification
Classification
Definition:
An algorithm that implements classification, especially in a concrete
implementation, is known as a classifier. The term "classifier" sometimes also
refers to the mathematical function, implemented by a classification algorithm,
that maps input data to a category.
Terminology across fields is quite varied. In statistics, where classification is often
done with logistic regression or a similar procedure, the properties of observations
are termed explanatory variables (or independent variables, regressors, etc.), and
the categories to be predicted are known as outcomes, which are considered to be
possible values of the dependent variable. In machine learning, the observations
are often known as instances, the explanatory variables are
termed features (grouped into a feature vector), and the possible categories to be
predicted are classes.
Source: https://en.wikipedia.org/wiki/Statistical_classification
Classification
What is Classification?
Classification is defined as the process of recognition, understanding, and grouping of objects
and ideas into preset categories a.k.a “sub-populations.” With the help of these pre-categorized
training datasets, classification in machine learning programs leverage a wide range of algorithms
to classify future datasets into respective and relevant categories.
Classification algorithms used in machine learning utilize input training data for the purpose of
predicting the likelihood or probability that the data that follows will fall into one of the
predetermined categories. One of the most common applications of classification is for filtering
emails into “spam” or “non-spam”, as used by today’s top email service providers.
In short, classification is a form of “pattern recognition,”. Here, classification algorithms applied
to the training data find the same pattern (similar number sequences, words or sentiments, and
the like) in future data sets.
Source: https://www.simplilearn.com/tutorials/machine-learning-tutorial/classification-in-machine-learning
Classification
What is Classification?
Machine learning is a field of study and is concerned with algorithms that learn from
examples.
Classification is a task that requires the use of machine learning algorithms that learn
how to assign a class label to examples from the problem domain. An easy to understand
example is classifying emails as “spam” or “not spam.”
There are many different types of classification tasks that you may encounter in machine
learning and specialized approaches to modeling that may be used for each.
Examples of classification problems include:
• Given an example, classify if it is spam or not.
• Given a handwritten character, classify it as one of the known characters.
• Given recent user behavior, classify as churn or not.
Source: https://machinelearningmastery.com/types-of-classification-in-machine-learning/
Classification
How classification works?
From a modeling perspective, classification requires a training dataset with many
examples of inputs and outputs from which to learn.
A model will use the training dataset and will calculate how to best map examples
of input data to specific class labels. As such, the training dataset must be
sufficiently representative of the problem and have many examples of each class
label.
Class labels are often string values, e.g. “spam,” “not spam,” and must be mapped
to numeric values before being provided to an algorithm for modeling. This is often
referred to as label encoding, where a unique integer is assigned to each class
label, e.g. “spam” = 0, “no spam” = 1.
Classification
There are many different types of classification algorithms for modeling
classification predictive modeling problems.
There is no good theory on how to map algorithms onto problem types; instead, it
is generally recommended that a practitioner use controlled experiments and
discover which algorithm and algorithm configuration results in the best
performance for a given classification task.
Classification predictive modeling algorithms are evaluated based on their results.
Classification accuracy is a popular metric used to evaluate the performance of a
model based on the predicted class labels. Classification accuracy is not perfect but
is a good starting point for many classification tasks.
Classification
How classification works? - Classification rule
Given a population whose members each belong to one of a number of different
sets or classes, a classification rule or classifier is a procedure by which the
elements of the population set are each predicted to belong to one of the
classes.[1] A perfect classification is one for which every element in the population
is assigned to the class it really belongs to. An imperfect classification is one in
which some errors appear, and then statistical analysis must be applied to analyse
the classification.
A special kind of classification rule is binary classification, for problems in which
there are only two classes.
Source: https://en.wikipedia.org/wiki/Classification_rule
Classification
Instead of class labels, some tasks may require the prediction of a probability of
class membership for each example. This provides additional uncertainty in the
prediction that an application or user can then interpret. A popular diagnostic for
evaluating predicted probabilities is the ROC Curve.
There are perhaps four main types of classification tasks that you may encounter;
they are:
• Binary Classification
• Multi-Class Classification
• Multi-Label Classification
• Imbalanced Classification
Classification
Binary Classification
Binary classification is the task of classifying the elements of a set into two groups (each
called class) on the basis of a classification rule. Typical binary classification problems
include:
• Medical testing to determine if a patient has certain disease or not;
• Quality control in industry, deciding whether a specification has been met;
• In information retrieval, deciding whether a page should be in the result set of a search
or not.
Binary classification is dichotomization applied to a practical situation. In many practical
binary classification problems, the two groups are not symmetric, and rather than overall
accuracy, the relative proportion of different types of errors is of interest. For example, in
medical testing, detecting a disease when it is not present (a false positive) is considered
differently from not detecting a disease when it is present (a false negative).
Source: https://en.wikipedia.org/wiki/Binary_classification
Classification
Statistical Binary Classification is a problem studied in machine learning. It is a type
of supervised learning, a method of machine learning where the categories are
predefined, and is used to categorize new probabilistic observations into said categories.
When there are only two categories the problem is known as statistical binary
classification. Some of the methods commonly used for binary classification are:
• Decision trees
• Random forests
• Bayesian networks
• Support vector machines
• Neural networks
• Logistic regression
Source: https://en.wikipedia.org/wiki/Binary_classification
Classification
Typically, binary classification tasks involve one class that is the normal state and
another class that is the abnormal state.
For example “not spam” is the normal state and “spam” is the abnormal state.
Another example is “cancer not detected” is the normal state of a task that involves
a medical test and “cancer detected” is the abnormal state.
The class for the normal state is assigned the class label 0 and the class with the
abnormal state is assigned the class label 1.
Classification
Popular algorithms that can be used for binary classification include:
• Logistic Regression
• k-Nearest Neighbors
• Decision Trees
• Support Vector Machine
• Naive Bayes
Some algorithms are specifically designed for binary classification and do not
natively support more than two classes; examples include Logistic Regression and
Support Vector Machines.
Classification
Multiclass classification
In machine learning and statistical classification, multiclass classification or
multinomial classification is the problem of classifying instances into one of three
or more classes
While many classification algorithms (notably multinomial logistic regression)
naturally permit the use of more than two classes, some are by
nature binary algorithms; these can, however, be turned into multinomial
classifiers by a variety of strategies.
Multiclass classification should not be confused with multi-label classification,
where multiple labels are to be predicted for each instance.
Classification
Multi-Class Classification
Multi-class classification refers to those classification tasks that have more than
two class labels.
Examples include:
• Face classification.
• Plant species classification.
• Optical character recognition.
Unlike binary classification, multi-class classification does not have the notion of
normal and abnormal outcomes. Instead, examples are classified as belonging to
one among a range of known classes.
Classification
Multi-Class Classification
• The number of class labels may be very large on some problems. For example, a model may predict a
photo as belonging to one among thousands or tens of thousands of faces in a face recognition system
Popular algorithms that can be used for multi-class classification include:

• k-Nearest Neighbors.
• Decision Trees.
• Naive Bayes.
• Random Forest.
• Gradient Boosting.
Algorithms that are designed for binary classification can be adapted for use for multi-class problems.
Binary classification algorithms that can use these strategies for multi-class classification include:
• Logistic Regression.
• Support Vector Machine.
Classification
Multi-Label Classification
In machine learning, multi-label classification or multi-output classification is a
variant of the classification problem where multiple nonexclusive labels may be
assigned to each instance. Multi-label classification is a generalization of multiclass
classification, which is the single-label problem of categorizing instances into
precisely one of several (more than two) classes. In the multi-label problem the
labels are nonexclusive and there is no constraint on how many of the classes the
instance can be assigned to.
Formally, multi-label classification is the problem of finding a model that maps
inputs x to binary vectors y; that is, it assigns a value of 0 or 1 for each element
(label) in y.
Source: https://en.wikipedia.org/wiki/Multi-label_classification
Classification
Multi-label classification refers to those classification tasks that have two or more
class labels, where one or more class labels may be predicted for each example.
Consider the example of photo classification, where a given photo may have
multiple objects in the scene and a model may predict the presence of multiple
known objects in the photo, such as “bicycle,” “apple,” “person,” etc.
This is unlike binary classification and multi-class classification, where a single class
label is predicted for each example.
It is common to model multi-label classification tasks with a model that predicts
multiple outputs, with each output taking predicted as a Bernoulli probability
distribution. This is essentially a model that makes multiple binary classification
predictions for each example.
Classification
Classification algorithms used for binary or multi-class classification cannot be used
directly for multi-label classification. Specialized versions of standard classification
algorithms can be used, so-called multi-label versions of the algorithms, including:
• Multi-label Decision Trees
• Multi-label Random Forests
• Multi-label Gradient Boosting
Another approach is to use a separate classification algorithm to predict the labels
for each class.
Classification
Imbalanced Classification
An imbalanced classification problem is an example of a classification problem
where the distribution of examples across the known classes is biased or skewed.
The distribution can vary from a slight bias to a severe imbalance where there is
one example in the minority class for hundreds, thousands, or millions of examples
in the majority class or classes.
An imbalanced classification problem is an example of a classification problem
where the distribution of examples across the known classes is biased or skewed.
The distribution can vary from a slight bias to a severe imbalance where there is
one example in the minority class for hundreds, thousands, or millions of examples
in the majority class or classes.
Source: https://machinelearningmastery.com/what-is-imbalanced-classification/
Classification
For example, we may collect measurements of flowers and have 80 examples of one
flower species and 20 examples of a second flower species, and only these examples
comprise our training dataset. This represents an example of an imbalanced classification
problem.
We refer to these types of problems as “imbalanced classification” instead of
“unbalanced classification“. Unbalance refers to a class distribution that was balanced
and is now no longer balanced, whereas imbalanced refers to a class distribution that is
inherently not balanced.
There are other less general names that may be used to describe these types of
classification problems, such as:
• Rare event prediction.
• Extreme event prediction.
• Severe class imbalance.
Classification
It is common to describe the imbalance of classes in a dataset in terms of a ratio.
For example, an imbalanced binary classification problem with an imbalance of 1
to 100 (1:100) means that for every one example in one class, there are 100
examples in the other class.
Another way to describe the imbalance of classes in a dataset is to summarize the
class distribution as percentages of the training dataset. For example, an
imbalanced multiclass classification problem may have 80 percent examples in the
first class, 18 percent in the second class, and 2 percent in a third class.
Classification
Imbalanced Classification - Causes of Class Imbalance
There are perhaps two main groups of causes for the imbalance we may want to consider; they
are:
- Data sampling and
- Properties of the domain.
Data Sampling:
It is possible that the imbalance in the examples across the classes was caused by the way the
examples were collected or sampled from the problem domain. This might involve biases
introduced during data collection, and errors made during data collection.
• Biased Sampling.
• Measurement Errors.
For example, perhaps examples were collected from a narrow geographical region, or slice of
time, and the distribution of classes may be quite different or perhaps even collected in a
different way.
Classification
Data Sampling (Continued…):
Errors may have been made when collecting the observations. One type of error
might have been applying the wrong class labels to many examples. Alternately,
the processes or systems from which examples were collected may have been
damaged or impaired to cause the imbalance.
Often in cases where the imbalance is caused by a sampling bias or measurement
error, the imbalance can be corrected by improved sampling methods, and/or
correcting the measurement error. This is because the training dataset is not a fair
representation of the problem domain that is being addressed.
Classification
The imbalance might be a property of the problem domain.
For example, the natural occurrence or presence of one class may dominate other
classes. This may be because the process that generates observations in one class
is more expensive in time, cost, computation, or other resources. As such, it is
often infeasible or intractable to simply collect more samples from the domain in
order to improve the class distribution. Instead, a model is required to learn the
difference between the classes.
Classification
Challenge of Imbalanced Classification
The imbalance of the class distribution will vary across problems.
A classification problem may be a little skewed, such as if there is a slight
imbalance. Alternately, the classification problem may have a severe imbalance
where there might be hundreds or thousands of examples in one class and tens of
examples in another class for a given training dataset.
• Slight Imbalance. An imbalanced classification problem where the distribution of
examples is uneven by a small amount in the training dataset (e.g. 4:6).
• Severe Imbalance. An imbalanced classification problem where the distribution
of examples is uneven by a large amount in the training dataset (e.g. 1:100 or
more).
Classification
A slight imbalance is often not a concern, and the problem can often be treated
like a normal classification predictive modeling problem. A severe imbalance of the
classes can be challenging to model and may require the use of specialized
techniques
The class or classes with abundant examples are called the major or majority
classes, whereas the class with few examples (and there is typically just one) is
called the minor or minority class.
• Majority Class: The class (or classes) in an imbalanced classification predictive
modeling problem that has many examples.
• Minority Class: The class in an imbalanced classification predictive modeling
problem that has few examples.
Classification
When working with an imbalanced classification problem, the minority class is typically of
the most interest. This means that a model’s skill in correctly predicting the class label or
probability for the minority class is more important than the majority class or classes.
The minority class is harder to predict because there are few examples of this class, by
definition. This means it is more challenging for a model to learn the characteristics of
examples from this class, and to differentiate examples from this class from the majority
class (or classes).
The abundance of examples from the majority class (or classes) can swamp the minority
class. Most machine learning algorithms for classification predictive models are designed
and demonstrated on problems that assume an equal distribution of classes. This means
that a naive application of a model may focus on learning the characteristics of the
abundant observations only, neglecting the examples from the minority class that is, in
fact, of more interest and whose predictions are more valuable.
Classification
Imbalanced classification is not “solved.”
It remains an open problem generally, and practically must be identified and
addressed specifically for each training dataset.
This is true even in the face of more data, so-called “big data,” large neural
network models, so-called “deep learning,” and very impressive competition-
winning models, so-called “xgboost.”
• Many of the classification predictive modeling problems that we are interested in

solving in practice are imbalanced.
• As such, it is surprising that imbalanced classification does not get more attention
than it does.
Classification
Below is a list of ten examples of problem domains where the class distribution of examples is inherently
imbalanced.
Many classification problems may have a severe imbalance in the class distribution; nevertheless, looking
at common problem domains that are inherently imbalanced will make the ideas and challenges of class
imbalance concrete.
• Fraud Detection.
• Claim Prediction
• Default Prediction.
• Churn Prediction.
• Spam Detection.
• Anomaly Detection.
• Outlier Detection.
• Intrusion Detection
• Conversion Prediction.
Classification
The list of examples sheds light on the nature of imbalanced classification
predictive modeling.
Each of these problem domains represents an entire field of study, where specific
problems from each domain can be framed and explored as imbalanced
classification predictive modeling. This highlights the multidisciplinary nature of
class imbalanced classification, and why it is so important for a machine learning
practitioner to be aware of the problem and skilled in addressing it.
Notice that most, if not all, of the examples are likely binary classification
problems. Notice too that examples from the minority class are rare, extreme,
abnormal, or unusual in some way.
Also notice that many of the domains are described as “detection,” highlighting the
desire to discover the minority class amongst the abundant examples of the
majority class.
Classification
Summary:
• Imbalanced classification is the problem of classification when there is an
unequal distribution of classes in the training dataset.
• The imbalance in the class distribution may vary, but a severe imbalance is more
challenging to model and may require specialized techniques.
• Many real-world classification problems have an imbalanced class distribution
such as fraud detection, spam detection, and churn prediction.
Classification
Working with Imbalanced Classification
Specialized techniques may be used to change the composition of samples in the
training dataset by undersampling the majority class or oversampling the minority
class.
Examples include:
• Random Undersampling.
• SMOTE Oversampling.
Classification
Working with Imbalanced Classification - Random Oversampling and
Undersampling for Imbalanced Classification
One approach to addressing the problem of class imbalance is to randomly
resample the training dataset. The two main approaches to randomly resampling
an imbalanced dataset are to delete examples from the majority class, called
undersampling, and to duplicate examples from the minority class, called
oversampling.
Random Resampling Imbalanced Datasets
Resampling involves creating a new transformed version of the training dataset in
which the selected examples have a different class distribution. This is a simple and
effective strategy for imbalanced classification problems.
Source: https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-
classification/
Classification
Random Resampling Imbalanced Datasets (Continued …)
• The simplest strategy is to choose examples for the transformed dataset
randomly, called random resampling.
• There are two main approaches to random resampling for imbalanced
classification; they are oversampling and undersampling.
• Random Oversampling: Randomly duplicate examples in the minority class.
• Random Undersampling: Randomly delete examples in the majority class.
classification/
Classification
Random oversampling involves randomly selecting examples from the minority
class, with replacement, and adding them to the training dataset. Random
undersampling involves randomly selecting examples from the majority class and
deleting them from the training dataset.
Both approaches can be repeated until the desired class distribution is achieved in
the training dataset, such as an equal split across the classes.
classification/
Classification
They are referred to as “naive resampling” methods because they assume nothing
about the data and no heuristics are used. This makes them simple to implement
and fast to execute, which is desirable for very large and complex datasets.
Both techniques can be used for two-class (binary) classification problems and
multi-class classification problems with one or more majority or minority classes.
Importantly, the change to the class distribution is only applied to the training
dataset. The intent is to influence the fit of the models. The resampling is not
applied to the test or holdout dataset used to evaluate the performance of a
model.
Source: https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/
Classification
Working with Imbalanced Classification - Random Oversampling and Undersampling for
Random Oversampling Imbalanced Datasets (Continued …)
Random oversampling involves randomly duplicating examples from the minority class
and adding them to the training dataset.
Examples from the training dataset are selected randomly with replacement. This means
that examples from the minority class can be chosen and added to the new “more
balanced” training dataset multiple times; they are selected from the original training
dataset, added to the new training dataset, and then returned or “replaced” in the
original dataset, allowing them to be selected again.
This technique can be effective for those machine learning algorithms that are affected
by a skewed distribution and where multiple duplicate examples for a given class can
influence the fit of the model. This might include algorithms that iteratively learn
coefficients, like artificial neural networks that use stochastic gradient descent. It can also
affect models that seek good splits of the data, such as support vector machines and
decision trees.
Classification
Random Oversampling Imbalanced Datasets (Continued …)
It might be useful to tune the target class distribution. In some cases, seeking a balanced
distribution for a severely imbalanced dataset can cause affected algorithms to overfit
the minority class, leading to increased generalization error. The effect can be better
performance on the training dataset, but worse performance on the holdout or test
dataset.
As such, to gain insight into the impact of the method, it is a good idea to monitor the
performance on both train and test datasets after oversampling and compare the results
to the same algorithm on the original dataset.
The increase in the number of examples for the minority class, especially if the class skew
was severe, can also result in a marked increase in the computational cost when fitting
the model, especially considering the model is seeing the same examples in the training
dataset again and again.
Classification
Random Undersampling Imbalanced Datasets
Random undersampling involves randomly selecting examples from the majority
class to delete from the training dataset.
This has the effect of reducing the number of examples in the majority class in the
transformed version of the training dataset. This process can be repeated until the
desired class distribution is achieved, such as an equal number of examples for
each class.
This approach may be more suitable for those datasets where there is a class
imbalance although a sufficient number of examples in the minority class, such a
useful model can be fit.
Classification
Random Undersampling Imbalanced Datasets (Continued…)
A limitation of undersampling is that examples from the majority class are deleted
that may be useful, important, or perhaps critical to fitting a robust decision
boundary. Given that examples are deleted randomly, there is no way to detect or
preserve “good” or more information-rich examples from the majority class
Classification
Combining Random Oversampling and Undersampling
Interesting results may be achieved by combining both random oversampling and
undersampling.
For example, a modest amount of oversampling can be applied to the minority class to
improve the bias towards these examples, whilst also applying a modest amount of
undersampling to the majority class to reduce the bias on that class.
This can result in improved overall performance compared to performing one or the
other techniques in isolation.
For example, if we had a dataset with a 1:100 class distribution, we might first apply
oversampling to increase the ratio to 1:10 by duplicating examples from the minority
class, then apply undersampling to further improve the ratio to 1:2 by deleting examples
from the majority class.
Classification
Synthetic Minority Oversampling Technique (SMOTE)
A problem with imbalanced classification is that there are too few examples of the
minority class for a model to effectively learn the decision boundary.
One way to solve this problem is to oversample the examples in the minority class. This
can be achieved by simply duplicating examples from the minority class in the training
dataset prior to fitting a model. This can balance the class distribution but does not
provide any additional information to the model.
An improvement on duplicating examples from the minority class is to synthesize new
examples from the minority class. This is a type of data augmentation for tabular data
and can be very effective.
Perhaps the most widely used approach to synthesizing new examples is called
the Synthetic Minority Oversampling TEchnique, or SMOTE for short. This technique was
described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled
“SMOTE: Synthetic Minority Over-sampling Technique.”
Source: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
Classification
Synthetic Minority Oversampling Technique (SMOTE) (Continued…)
SMOTE works by selecting examples that are close in the feature space, drawing a line
between the examples in the feature space and drawing a new sample at a point along
that line.
Specifically, a random example from the minority class is first chosen. Then k of the
nearest neighbors for that example are found (typically k=5). A randomly selected
neighbor is chosen and a synthetic example is created at a randomly selected point
between the two examples in feature space.
This procedure can be used to create as many synthetic examples for the minority class as
are required. As described in the paper, it suggests first using random undersampling to
trim the number of examples in the majority class, then use SMOTE to oversample the
minority class to balance the class distribution.
Classification
Synthetic Minority Oversampling Technique (SMOTE) (Continued…)
This procedure can be used to create as many synthetic examples for the minority
class as are required. As described in the paper, it suggests first using random
undersampling to trim the number of examples in the majority class, then use
SMOTE to oversample the minority class to balance the class distribution.
The approach is effective because new synthetic examples from the minority class
are created that are plausible, that is, are relatively close in feature space to
existing examples from the minority class.
A general downside of the approach is that synthetic examples are created without
considering the majority class, possibly resulting in ambiguous examples if there is
a strong overlap for the classes.
Classification
Working with Imbalanced Classification
Specialized modeling algorithms may be used that pay more attention to the minority class when
fitting the model on the training dataset, such as cost-sensitive machine learning algorithms.
Examples include:
• Cost-sensitive Logistic Regression.
• Cost-sensitive Decision Trees
• Cost-sensitive Support Vector Machines.
Finally, alternative performance metrics may be required as reporting the classification accuracy
may be misleading.
Examples include:
• Precision.
• Recall.
• F-Measure.
Classification
Summary
• Classification predictive modeling involves assigning a class label to input
examples.
• Binary classification refers to predicting one of two classes and multi-class
classification involves predicting one of more than two classes.
• Multi-label classification involves predicting one or more classes for each
example and imbalanced classification refers to classification tasks where the
distribution of examples across the classes is not equal.

CCPS521 WIN2023 Week05 - Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CCPS521 WIN2023 Week05 - Classification

Uploaded by

Copyright:

Available Formats

Introduction to Data

Popular algorithms that can be used for multi-class classification include:

• Many of the classification predictive modeling problems that we are interested in

You might also like