Unit 4 Classification

Unit 4.
Classification (2):
k-nearest neighbors, naïve Bayesian and
artificial neural networks
4. Classification
Assoc. Prof Nguyen Manh Tuan
AGENDA
 k-nearest neighbors
 Naïve Bayesian
 Artificial neural networks
10/11/2022 internal use

Models – k-nearest neighbors
 Decision trees and rule induction: generalizing the relationship within a dataset and
using it to predict the outcome of new unseen data.
 eager learners
 to find the best approximation of the actual relationship between the input and target variables
 The alternative class of learners adopts a blunt approach, where no “learning” is
performed from the training dataset
 rather the training dataset is used as a lookup table to match the input variables and find the
outcome.
 These approaches that memorize the training set are called lazy learners.
 The underlying idea here is somewhat similar to the old adage, “birds of a
feather flock together.”
 Similar records congregate in a neighborhood in n-dimensional space, with the same target class
labels.

 The entire training dataset is “memorized” and when unlabeled example records need
to be classified, the input attributes of the new unlabeled records are compared
against the entire training set to find the closest match.
 The class label of the closest training record is the predicted class label for the
unseen test record.
 The closest training record needs to be found for each test record.
 Even though no mathematical generalization or rule generation is involved, finding
the closet training record for a new unlabeled record can be a tricky problem to solve,
particularly when there is no exact match of training data available for a given test
data record.

 The year is 1798 and you are Lieutenant-Colonel David Collins of HMS Calcutta
who is exploring the region around Hawkesbury River, in New South Wales.
 After an expedition up the river one of the men tells you that he saw a strange animal
near the river.
 You ask him to describe the animal to you and he explains that he didn’t see it very
well but that he did notice that it had webbed feet and a duck-bill snout, and that it
growled at him.
 In order to plan the expedition for the next day you decide that you need to classify
the animal so that you can figure out whether it is dangerous to approach it or not.
 Matching animals you remember to the features of the unknown animal described by
the sailor.

6

 The process of classifying an unknown animal by matching

the features of the animal against the features of animals
you can remember neatly encapsulates the big idea
underpinning similarity-based learning:
 if you are trying to classify something then you should search your
memory to find things that are similar and label it with the same
class as the most similar thing in your memory
 One of the simplest and best-known machine learning algorithms
for this type of reasoning: the nearest neighbor algorithm.


Triangles: ’Non-draft’
Crosses: ‘Draft’

 A similarity metric measures the similarity between two instances/cases/examples
 Mathematically, a metric must conform to the following 4 criteria:
1 Non-negativity: metric(a; b) ≥ 0
2 Identity: metric(a; b) = 0  a = b
3 Symmetry: metric(a; b) = metric(b; a)
4 Triangular Inequality: metric(a; b) ≤ metric(a; c) + metric(b; c)
where metric(a; b) is a function that returns the distance between a and b.
 One of the best-known metrics is Euclidean distance which computes the length of the straight
line between two points a and b in a m-dimensional feature/attribute space:

 Another, less well known, distance measure is the Manhattan distance or taxi-cab
distance.
 The Manhattan distance between two instances a and b in a m-feature space
 The Euclidean and Manhattan distances are special cases of Minkowski distance
which is defined as (a and b in a m-feature space)
p=1: Manhattan (binary)
p=2: Euclidean (numeric; or good start)
p=∞: Chebyshev
p: norm
p-norm distance internal use
10/11/2022

 The Nearest Neighbor Algorithm

Require: set of training instances and a query to be classified
1) Iterate across the instances in memory and find the instance that is shortest distance
from the query position in the feature space.
2) Make a prediction for the query equal to the value of the target feature of the nearest
neighbor.
 Similarity-based prediction models attempt to mimic a very human way of reasoning

by basing predictions for a target feature value on the most similar instances in
memory
 This makes them easy to interpret and understand.


The query/target: ‘?’

Triangles: ’Non-draft’
Crosses: ‘Draft’

Distances (Dist.) between the query instance (SPEED=6.75 and AGILITY = 3.00) and
each other instance:

One of the great things about nearest neighbor algorithms is that we can add in new data to update
the model very easily.

 Returning to 1798 and HMS Calcutta, the next day you accompany your men on the expedition
up the river and you encounter the strange animal the sailor had described to you.
 This time when you see the animal yourself you realize that it definitely isn’t a duck!
 It turns out that you and your men are the first Europeans to encounter a platypus.
Two important, yet related aspects of supervised machine learning:
 learning is based on the stationarity assumption which states
that the data doesn’t change over time.
 for classification, learning creates models that distinguish
between the classes that are present in the dataset they are
induced from.
 When classification model is trained to distinguish between
A duck-billed platypus lions, frogs and ducks, the model will classify a query as being
either a lion, a frog or a duck; even if the query is actually a
platypus.

 The year is 1798 and you are Lieutenant-Colonel David Collins of HMS Calcutta
who is exploring the region around Hawkesbury River, in New South Wales.
 After an expedition up the river one of the men tells you that he saw a strange animal
near the river.
 You ask him to describe the animal to you and he explains that he didn’t see it very
well but that he did notice that it had webbed feet and a duck-bill snout, and that it
growled at him.
 In order to plan the expedition for the next day you decide that you need to classify
the animal so that you can figure out whether it is dangerous to approach it or not.
 Matching animals you remember to the features of the unknown animal described by
the sailor.



 But what if the closest training record is an outlier with the incorrect class in the training set?
Then, all the unlabeled test records near the outlier will get wrongly classified.
 To prevent this misclassification, the value of k can be increased to, say, 3. When k=3, the
nearest 3 training records are considered instead of 1.
k=1 k=3

 Once the nearest k neighbors are determined, the process of determining the
predicted target class is straightforward.
 The predicted target class is the majority class of the nearest k neighbors.
y’ = majority class (y1, y2, …yk)
where y’ is the predicted target class of the test data point and yi is the class of ith
neighbor ni

 When k > 1, it can be argued that the closest neighbors should have more say in the
outcome of the predicted target class than the farther neighbors.
 The farther away neighbors should have less influence in determining the final
class outcome.
 This can be accomplished by assigned weights for all the neighbors, with the weights increasing as
the neighbors get closer to the test data point.
 The weights are included in the final multi-voting step, where the predicted class is
calculated.
 Weights (wi) should satisfy two conditions:
 should be proportional to the distance of the test data point from the neighbor, and
 the sum of all weights should be equal to 1.

 One of the calculations for weights follows an exponential decay based on distance:
wi the weight of ith neighbor ni,
k: total number of neighbors,
x: the test data point
 The weight is used in predicting target class y’:
yi: the class outcome of neighbor ni

 Correlation similarity: (for numeric attributes)

Pearson correlation: 0.98
X (1,2,3,4,5) - Y (10,15,35,40,55)
 The correlation between two data points X and Y is the measure of the linear
relationship between the attributes X and Y.
 Pearson correlation between two data points X and Y:
-1 (perfect negative correlation) to +1 (perfect positive correlation)
0: no correlation between X and Y (no linear relationship, but there may be a quadratic or any other
higher degree relationship)

 Simple matching coefficient (similarity indexes): (for binary attributes)
 q, d: 2 instances; |q|: total features in dataset;

 CP (q,d): total number of co-presences between q and d


 CA (q,d): total number of co-absences between q and d


 CA (q,d): total number of co-absences between q and d
 PA (q,d): total number of presences in q with an absence in d
 AP (q,d): total number of absences in q with a presence in d

More on Jaccard similarity

 If X and Y represent two text documents, each word will be an attribute in a
dataset called a term document matrix or document vector.
 Each record in the document dataset corresponds to a separate document or a text
blob.
 Although the number of attributes would be very large, often in the thousands, most
of the attribute values will be zero.
 This means that two documents do not contain the same rare words.
 In this instance, what is interesting is that the comparison of the occurrence of the same
word and non-occurrence of the same word does not convey any information and can
be ignored.


q to be more similar to d1 than d2 q to be equally similar to d1 and d2 q to be more similar to d2 than d1

 Cosine similarity: (for binary attributes)

 Cosine similarity between two instances is the cosine of the inner angle between the two
vectors that extend from the origin to each instance
d1 = (SMS = 97; VOICE = 21)

d2 = (SMS = 181; VOICE = 184)



More on cosine similarity

 Continuing with the example of the document vectors, where attributes represent either the
presence or absence of a word.
 It is possible to construct a more informational vector with the number of occurrences in the
document, instead of just 1 and 0.
 Document datasets are usually long vectors with thousands of variables or attributes.
 The cosine similarity measure is one of the most used similarity measures, but the
determination of the optimal measure comes down to the data structures.
 The choice of distance or similarity measure can also be parameterized, where multiple
models are created with each different measure.
 The model with a distance measure that best fits the data with the smallest generalization
error can be the appropriate proximity measure for the data.

Models – Data normalization
The marketing department wants to decide whether or not they should contact
a customer with the following profile: (SALARY = 56; 000; AGE = 35)

 The instances are labelled their IDs, triangles/crosses represent the negative/positive ones.
 The location of the query (SALARY = 56; 000; AGE = 35) is indicated by the ‘?’

 This odd prediction is caused by features taking different ranges of values, this is
equivalent to features having different variances.
 We can adjust for this using normalization, using the formula

 Normalizing the data is an important thing to do for almost all machine

learning algorithms, not just nearest neighbor.

How to Implement
Step 1. Data preparation
 Iris dataset with 150 examples and 4 numeric attributes.
 All attributes need to be normalized using the Normalize operator, with Z-
transformation (most commonly used)
 The dataset is then split into two equal exclusive datasets (ie. training and test) using
the Split Data operator (from Data Transformation/Filtering/Sampling), with
shuffled sampling.
 One half of the dataset is used as training data for developing the k-NN model and
the other half of the dataset is used to test the validity of the model

Step 2. Modelling
 k-NN operator: available in Modeling/Classification/Lazy Modeling
 Parameter: k=3
 Apply Model operator accepts the test data and applies k-NN model to predict the class type of
the species.
 Performance operator is then used to compare the predicted class with the labeled class for all
of the test records.
Step 3. Execution and interpretation
 k-NN model: just the set of training records
 Performance vector: the confusion matrix with correct and incorrect predictions for all of the
test dataset
 Labeled test dataset: The prediction can be examined at the record level.



Summary
 The k-NN model requires normalization to avoid bias by any attribute that has large/small
units in the scale.
 The model is quite robust when there are any missing attribute values in the test record
 The entire attribute is ignored, and the model can still function with reasonable accuracy
 If sepal length of a test record is not known, k-NN becomes a 3-dimensional model instead of the original
4 dimensions.
 k-NN models can handle categorical inputs but the distance measure will be either 1 or 0.
 Ordinal values can be converted to integers so that one can better leverage the distance function.
 As a lazy learner, the relationship between input and output cannot be explained, as the model is
just a memorized set of all training records. There is no generalization or abstraction of the
relationship.
 It is still quite an effective model and leverages the existing relationships between attributes and class
label in the training records.

Summary
 Eager learners are better at explaining the relationship and providing a model description.
 Model building in k-NN is just memorizing and does not require much time.
 But when a new unlabeled record is to be classified, the algorithm needs to find the distance
between the unseen record and all the training records.
 This process can get expensive, depending on the size of training set and the number of
attributes.
 A few sophisticated k-NN implementations index the records so that it is easy to search and
calculate the distance.
 One can also convert the actual numbers to ranges so as to make it easy to index and compare it
against the test record.
 However, k-NN is difficult to be used in time sensitive applications like serving an online
advertisement or real-time fraud detection.

Models – Naïve Bayesian
 The naïve Bayesian algorithm is built on the Bayes’ theorem that is one of the most influential
and important concepts in statistics and probability theory.
 It provides a mathematical expression for how a degree of subjective belief changes to account
for new evidence.
 Assume X is the evidence (attribute set) (X = {X1, X2, X3, . . ., Xn}), where Xi is an individual
attribute, such as credit rating; and Y is the outcome (class label).
 The probability of outcome P(Y): prior probability, which can be calculated from the
training dataset.
 Prior probability shows the likelihood of an outcome in a given dataset.
 P(Y|X) is called the conditional probability (or, posterior probability), which provides the
probability of an outcome given the evidence, that is, when the value of X is known.
 This is the likelihood of an outcome as the conditions are learnt.

 The Bayes’s theorem:
 P(X|Y): the class conditional probability, which is the probability of the existence of
conditions given an outcome.
 Like P(Y), P(X|Y) can be calculated from the training dataset as well.
 P(X): the probability of the evidence.
 To classify a new record, one can compute P(Y|X) for each class of Y and see which
probability “wins.”
 Class label Y with the highest value of P(Y|X) wins for given condition X.
 Since P(X) is the same for every class value of the outcome, one does not have to calculate
this and it can be assumed as a constant.

 For an example set with n attributes: X = {X1, X2, X3 . . . Xn},
 Since P(X) is constant for every value of Y, it is enough to calculate the numerator
of the equation for every class value.

• weather condition: the evidence Categorical data type is being used for ease of
• decision to play or not play: the belief explanation (temperature and humidity have
10/11/2022 internal use been converted from the numeric type)
Step 1: Calculating Prior Probability P(Y)

 There are two possible outcomes: (Play = yes and Play = no), with the 14 examples in total.
P(Y = no) = 5/14 ; P(Y = yes) = 9/14
 The class-stratified sampling of data from the population will be ideal for naïve Bayesian
modeling.
 The class-stratified sampling ensures the class distribution in the sample is the same as the
population
Step 2: Calculating Class Conditional Probability P(Xi|Y)
 Class conditional probability is the probability of each attribute value for an attribute, for each
outcome value.
 This calculation is repeated for all the attributes: Temperature (X1), Humidity (X2),
Outlook(X3), and Wind (X4), and for every distinct outcome value.


 The outcome class can be predicted based on Bayes’ theorem by calculating

the posterior probability P(Y|X) for both values of Y.
 Once P(Y=yes|X) and P(Y=no|X) are calculated, one can determine which outcome
has higher probability and the predicted outcome is the one that has the highest
probability.

Both the estimates can be normalized by dividing both

conditional probability by (0.0094 1 0.027) to get:
0.0094
0.0274
P(Y=yes|X) < P(Y=no|X)

 the prediction for the unlabeled test record will be Play=no
How to implement
Step 1: Data Preparation
 The Golf dataset (14 records for training) and a Golf-Test dataset (14 records for testing)
 Since the Bayes operator accepts numeric and nominal data types, no other data transformation
process is necessary.
Step 2: Modeling Operator and Parameters
 The Naïve Bayesian operator has only one parameter option to set: whether or not to include
Laplace correction (included by default)
 For smaller datasets, Laplace correction is strongly encouraged, as a dataset may not have all
combinations of attribute values for every class value.
 Outputs of the operator: the model and original training dataset.
 The model output should be connected to Apply Model to execute the model on the test dataset.

Step 3: Evaluation
 The output of the Apply Model operator is the labeled test dataset and the model.
 The labeled test dataset is then connected to the Performance-Classification operator
(Evaluation/Performance Measurement) to evaluate the performance of the classification model.

Step 4: Execution and Interpretation

 The process has 3 result outputs: a model description, performance vector, and labeled dataset.
 The labeled dataset contains the test dataset with the predicted class as an added column, with
the confidence for each label class, which indicates the prediction strength.
 The model description result contains more information on class conditional probabilities of all
the input attributes, derived from the training dataset.
 The Charts tab in model description contains probability density functions for the attributes
 In the case of continuous attributes, the decision boundaries can be discerned across the
different class labels for the Humidity attribute. It has been observed that when Humidity
exceeds 82, the likelihood of Play 5 no increases.
 The Distribution Table shown in Fig. 4.37 contains the familiar class conditional probability
table similar to Tables 4.5 and 4.6.


Summary
 The Bayesian algorithm provides a probabilistic framework for a classification problem. It has
a simple and sound foundation for modeling data and quite robust to outliers and missing
values.
 If the test example set does not contain a value, suppose temperature is not available, the Bayesian
model simply omits the corresponding class conditional probability for all the outcomes.
 Having missing values in the test set would be difficult to handle in decision trees and regression
algorithms, particularly when the missing attribute is used higher up in the node of the decision
tree or has more weight in regression.
 One major limitation of the model is the assumption of independent attributes, which can be
mitigated by advanced modeling or through preprocessing to decrease attribute dependence.

Summary
 This algorithm is deployed widely in text mining and document classification where the
application has a large set of attributes and attributes values to compute.
 The naïve Bayesian classifier is often a great place to start for a data science project as it
serves as a good benchmark for comparison to other models.
 The uniqueness of the technique is that it leverages new information as it arrives and tries
to make a best prediction considering new evidence.

Models – Artificial neural networks
 The artificial neural network (ANN) is a computational and mathematical model inspired by the
biological nervous system.
 ANNs model more complex nonlinear relationships of data and learn though adaptive
adjustments of weights between the nodes.
 The 1st layer of nodes closest to the input is called the input layer/nodes.
 The last layer of nodes is called the output layer/nodes.
Perception:
- the most simplistic form of an ANN
- one input and one output layer
- feed-forward ANN where the input moves in one
direction and there are no loops in the topology

 There can be multiple layers, hidden layers, apart from the input and output layer, to model
nonlinear, complicated relationships between input and output variables.
 The output layer performs an activation function consisting of a combination of an aggregation
function, usually summarization, and a transfer function.
 A hidden layer connects input from previous layers, and also, applies an activation function.
 The transfer function, which scales the output into the desired range, commonly used are:
sigmoid, normal bell curve, logistic, hyperbolic, or linear functions.
 sigmoid and bell curves: to provide a linear transformation for a particular range of values and a nonlinear
transformation for the rest of the values.
 Because of the transfer function and the presence of multiple hidden layers, one can model or
closely approximate almost any mathematical continuous relationship between input variables
and output variables.
 A multilayer ANN is called a universal approximator.
 To search for an optimal solution: time consuming

 For a categorical label problem, the ANN

provides output for each class type.
 A winning class type is picked based on
the maximum value of the output class
label.
One can use a topology with multiple hidden layers and even
with looping where the output of one layer is used as input for preceding layers.

How it works
 An ANN learns the relationship between input attributes and the output class label
through a technique called back propagation.
 The key training task is to find the weights of the links.
 The model uses every training record to estimate the error between the predicted and
the actual output.
 Then the model uses the error to adjust the weights to minimize the error for the next
training record and this step is repeated until the error falls within the acceptable
range.

The steps in developing an ANN from a training dataset include:

 Step 1: Determine the Topology and Activation Function (see left)
 Step 2: Initiation (see right)
 Assuming X1/2/3 =1; Y=15

 Initial weights: w2=2; w3=3; w4=4

The steps in developing an ANN from a training dataset include:

 Step 3: Calculating error:
 The predicted output Y’ is (1+1*2+1*3+1*4) = 10
 Error (e) = Y-Ý = 15-10 = 5
 Step 4: Weight adjustment (w’ = w + λ * e)
 The most important part of learning in an ANN.
 The error (e) calculated in the previous step is passed back from the output node to all other nodes
(reverse direction).
 The fraction (λ): learning rate, ranging from 0 to 1.
A value close to 1: a drastic change to the model for each training record
A value close to 0: smaller changes, and, hence, less correction
 The choice of λ can be tricky in the ANN implementation.
 Some model processes start with λ close to 1 and reduce the value of λ while training each cycle.
 By this approach, any outlier records later in the training cycle will not degrade the relevance of the model

 Step 4: Weight adjustment (w’ = w + λ * e)  Similarly, the weight will be adjusted for all the links.
 In the next cycle, a new error will be computed for the
next training record.
 This cycle goes on until all the training records are
processed.
 The same training example can be repeated until the
error is less than a threshold.
 In reality, there will be multiple hidden layers and
multiple output links – one for each nominal class
value.
 Because of the numeric calculations, an ANN model
works well with numeric inputs and outputs.
 If the input contains a nominal attribute, a pre-
processing step should be included to convert the
 Assuming: λ = 0.5 nominal attribute into multiple numeric attributes—one
 New weight: w’2 = 2 + 0.5 * 5/3 = 2.83 for each attribute value.
(divide by 3: error is back propagated to 3 links from the  This specific preprocessing increases the number of
output node) input links for an ANN in the case of nominal attributes
internal use
and, thus, increases the necessary computing resources.
10/11/2022
How to implement
Step 1: Data preparation
 Iris dataset
 ANN operator only works on numeric data types  data transformation with categorical/nominal attributes
 Using Rename operator to name 4 attributes of Iris dataset
 Using Split Data operator to split 150 Iris records equally into training and test data
Step 2: Modelling operator and parameters
 Training dataset is connected to the Neural Net operator that accepts numeric data and normalizes the values
 The output of Neural Net can be connected to the Apply Model operator that also gets an input dataset from
the Split Data operator for the test dataset.
 The outputs of Apply Model is the labeled dataset and the ANN model.
Step 3: Evaluation
 The labeled dataset after using Apply Model is then connected to the Performance-Classification operator.

Step 4: Execution and interpretation

Step 4: Execution and interpretation

Summary
 Neural network models require stringent input constraints and preprocessing.
 An ANN does not handle categorical input data. If the data has nominal values, it needs to be
converted to binary or real values. This means one input attribute explodes into multiple input
attributes and exponentially increases nodes, links, and complexity
 However, having redundant correlated attributes is not going to be a problem in an ANN model.
 If the test example has missing attribute values, the model cannot function (like to a regression or
decision tree model).
 The missing values can be replaced with average values or any default values to mitigate the constraint.
 The relationship between input and output cannot be explained clearly by an ANN.
 Since there are hidden layers, it is quite complex to understand the model.
 Building a good ANN model with optimized parameters takes time.
 It depends on the number of training records and iterations.
 There are no consistent guidelines on the number of hidden layers and nodes within each hidden layer.

Summary
 However, once a model is built, it is straightforward to implement, and a new unseen record
gets classified quite fast.
 Because model building is by incremental error correction, ANN can yield local optima as the
final model.
 The rapid classification of test examples makes an ANN quite useful for anomaly detection and
classification problems.
 An ANN is commonly used in fraud detection, a scoring situation where the relationship
between inputs and output is nonlinear.
 If there is a need for something that handles a highly nonlinear landscape along with fast real-
time performance, then the ANN fits the bill.

THE END

Unit 4 Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 Classification

Uploaded by

Copyright:

Available Formats

Unit 4.

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

 The process of classifying an unknown animal by matching

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

 The Nearest Neighbor Algorithm

 Similarity-based prediction models attempt to mimic a very human way of reasoning

10/11/2022 internal use

10/11/2022 internal use

The query/target: ‘?’

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

 The weight is used in predicting target class y’:

yi: the class outcome of neighbor ni

10/11/2022 internal use

 Correlation similarity: (for numeric attributes)

10/11/2022 internal use

 Simple matching coefficient (similarity indexes): (for binary attributes)

 q, d: 2 instances; |q|: total features in dataset;

10/11/2022 internal use

 Simple matching coefficient (similarity indexes): (for binary attributes)

 q, d: 2 instances; |q|: total features in dataset;

10/11/2022 internal use

 Simple matching coefficient (similarity indexes): (for binary attributes)

 q, d: 2 instances; |q|: total features in dataset;

10/11/2022 internal use

More on Jaccard similarity

10/11/2022 internal use

 Simple matching coefficient (similarity indexes): (for binary attributes)

10/11/2022 internal use

 Simple matching coefficient (similarity indexes): (for binary attributes)

q to be more similar to d1 than d2 q to be equally similar to d1 and d2 q to be more similar to d2 than d1

10/11/2022 internal use

 Cosine similarity: (for binary attributes)

d1 = (SMS = 97; VOICE = 21)

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

More on cosine similarity

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

 Normalizing the data is an important thing to do for almost all machine

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use

10/11/2022 internal use