Professional Documents
Culture Documents
Classification (2):
k-nearest neighbors, naïve Bayesian and
artificial neural networks
4. Classification
Assoc. Prof Nguyen Manh Tuan
AGENDA
k-nearest neighbors
Naïve Bayesian
Artificial neural networks
Decision trees and rule induction: generalizing the relationship within a dataset and
using it to predict the outcome of new unseen data.
eager learners
to find the best approximation of the actual relationship between the input and target variables
The alternative class of learners adopts a blunt approach, where no “learning” is
performed from the training dataset
rather the training dataset is used as a lookup table to match the input variables and find the
outcome.
These approaches that memorize the training set are called lazy learners.
The underlying idea here is somewhat similar to the old adage, “birds of a
feather flock together.”
Similar records congregate in a neighborhood in n-dimensional space, with the same target class
labels.
The entire training dataset is “memorized” and when unlabeled example records need
to be classified, the input attributes of the new unlabeled records are compared
against the entire training set to find the closest match.
The class label of the closest training record is the predicted class label for the
unseen test record.
The closest training record needs to be found for each test record.
Even though no mathematical generalization or rule generation is involved, finding
the closet training record for a new unlabeled record can be a tricky problem to solve,
particularly when there is no exact match of training data available for a given test
data record.
The year is 1798 and you are Lieutenant-Colonel David Collins of HMS Calcutta
who is exploring the region around Hawkesbury River, in New South Wales.
After an expedition up the river one of the men tells you that he saw a strange animal
near the river.
You ask him to describe the animal to you and he explains that he didn’t see it very
well but that he did notice that it had webbed feet and a duck-bill snout, and that it
growled at him.
In order to plan the expedition for the next day you decide that you need to classify
the animal so that you can figure out whether it is dangerous to approach it or not.
Matching animals you remember to the features of the unknown animal described by
the sailor.
Triangles: ’Non-draft’
Crosses: ‘Draft’
The Euclidean and Manhattan distances are special cases of Minkowski distance
which is defined as (a and b in a m-feature space)
p=1: Manhattan (binary)
p=2: Euclidean (numeric; or good start)
p=∞: Chebyshev
p: norm
p-norm distance internal use
10/11/2022
Models – k-nearest neighbors
Distances (Dist.) between the query instance (SPEED=6.75 and AGILITY = 3.00) and
each other instance:
Returning to 1798 and HMS Calcutta, the next day you accompany your men on the expedition
up the river and you encounter the strange animal the sailor had described to you.
This time when you see the animal yourself you realize that it definitely isn’t a duck!
It turns out that you and your men are the first Europeans to encounter a platypus.
Two important, yet related aspects of supervised machine learning:
learning is based on the stationarity assumption which states
that the data doesn’t change over time.
for classification, learning creates models that distinguish
between the classes that are present in the dataset they are
induced from.
When classification model is trained to distinguish between
A duck-billed platypus lions, frogs and ducks, the model will classify a query as being
either a lion, a frog or a duck; even if the query is actually a
platypus.
The year is 1798 and you are Lieutenant-Colonel David Collins of HMS Calcutta
who is exploring the region around Hawkesbury River, in New South Wales.
After an expedition up the river one of the men tells you that he saw a strange animal
near the river.
You ask him to describe the animal to you and he explains that he didn’t see it very
well but that he did notice that it had webbed feet and a duck-bill snout, and that it
growled at him.
In order to plan the expedition for the next day you decide that you need to classify
the animal so that you can figure out whether it is dangerous to approach it or not.
Matching animals you remember to the features of the unknown animal described by
the sailor.
But what if the closest training record is an outlier with the incorrect class in the training set?
Then, all the unlabeled test records near the outlier will get wrongly classified.
To prevent this misclassification, the value of k can be increased to, say, 3. When k=3, the
nearest 3 training records are considered instead of 1.
k=1 k=3
Once the nearest k neighbors are determined, the process of determining the
predicted target class is straightforward.
The predicted target class is the majority class of the nearest k neighbors.
y’ = majority class (y1, y2, …yk)
where y’ is the predicted target class of the test data point and yi is the class of ith
neighbor ni
When k > 1, it can be argued that the closest neighbors should have more say in the
outcome of the predicted target class than the farther neighbors.
The farther away neighbors should have less influence in determining the final
class outcome.
This can be accomplished by assigned weights for all the neighbors, with the weights increasing as
the neighbors get closer to the test data point.
The weights are included in the final multi-voting step, where the predicted class is
calculated.
Weights (wi) should satisfy two conditions:
should be proportional to the distance of the test data point from the neighbor, and
the sum of all weights should be equal to 1.
One of the calculations for weights follows an exponential decay based on distance:
wi the weight of ith neighbor ni,
k: total number of neighbors,
x: the test data point
The correlation between two data points X and Y is the measure of the linear
relationship between the attributes X and Y.
Pearson correlation between two data points X and Y:
-1 (perfect negative correlation) to +1 (perfect positive correlation)
0: no correlation between X and Y (no linear relationship, but there may be a quadratic or any other
higher degree relationship)
The cosine similarity measure is one of the most used similarity measures, but the
determination of the optimal measure comes down to the data structures.
The choice of distance or similarity measure can also be parameterized, where multiple
models are created with each different measure.
The model with a distance measure that best fits the data with the smallest generalization
error can be the appropriate proximity measure for the data.
The marketing department wants to decide whether or not they should contact
a customer with the following profile: (SALARY = 56; 000; AGE = 35)
The instances are labelled their IDs, triangles/crosses represent the negative/positive ones.
The location of the query (SALARY = 56; 000; AGE = 35) is indicated by the ‘?’
This odd prediction is caused by features taking different ranges of values, this is
equivalent to features having different variances.
We can adjust for this using normalization, using the formula
How to Implement
Step 1. Data preparation
Iris dataset with 150 examples and 4 numeric attributes.
All attributes need to be normalized using the Normalize operator, with Z-
transformation (most commonly used)
The dataset is then split into two equal exclusive datasets (ie. training and test) using
the Split Data operator (from Data Transformation/Filtering/Sampling), with
shuffled sampling.
One half of the dataset is used as training data for developing the k-NN model and
the other half of the dataset is used to test the validity of the model
Step 2. Modelling
k-NN operator: available in Modeling/Classification/Lazy Modeling
Parameter: k=3
Apply Model operator accepts the test data and applies k-NN model to predict the class type of
the species.
Performance operator is then used to compare the predicted class with the labeled class for all
of the test records.
Step 3. Execution and interpretation
k-NN model: just the set of training records
Performance vector: the confusion matrix with correct and incorrect predictions for all of the
test dataset
Labeled test dataset: The prediction can be examined at the record level.
Summary
The k-NN model requires normalization to avoid bias by any attribute that has large/small
units in the scale.
The model is quite robust when there are any missing attribute values in the test record
The entire attribute is ignored, and the model can still function with reasonable accuracy
If sepal length of a test record is not known, k-NN becomes a 3-dimensional model instead of the original
4 dimensions.
k-NN models can handle categorical inputs but the distance measure will be either 1 or 0.
Ordinal values can be converted to integers so that one can better leverage the distance function.
As a lazy learner, the relationship between input and output cannot be explained, as the model is
just a memorized set of all training records. There is no generalization or abstraction of the
relationship.
It is still quite an effective model and leverages the existing relationships between attributes and class
label in the training records.
Summary
Eager learners are better at explaining the relationship and providing a model description.
Model building in k-NN is just memorizing and does not require much time.
But when a new unlabeled record is to be classified, the algorithm needs to find the distance
between the unseen record and all the training records.
This process can get expensive, depending on the size of training set and the number of
attributes.
A few sophisticated k-NN implementations index the records so that it is easy to search and
calculate the distance.
One can also convert the actual numbers to ranges so as to make it easy to index and compare it
against the test record.
However, k-NN is difficult to be used in time sensitive applications like serving an online
advertisement or real-time fraud detection.
The naïve Bayesian algorithm is built on the Bayes’ theorem that is one of the most influential
and important concepts in statistics and probability theory.
It provides a mathematical expression for how a degree of subjective belief changes to account
for new evidence.
Assume X is the evidence (attribute set) (X = {X1, X2, X3, . . ., Xn}), where Xi is an individual
attribute, such as credit rating; and Y is the outcome (class label).
The probability of outcome P(Y): prior probability, which can be calculated from the
training dataset.
Prior probability shows the likelihood of an outcome in a given dataset.
P(Y|X) is called the conditional probability (or, posterior probability), which provides the
probability of an outcome given the evidence, that is, when the value of X is known.
This is the likelihood of an outcome as the conditions are learnt.
P(X|Y): the class conditional probability, which is the probability of the existence of
conditions given an outcome.
Like P(Y), P(X|Y) can be calculated from the training dataset as well.
P(X): the probability of the evidence.
To classify a new record, one can compute P(Y|X) for each class of Y and see which
probability “wins.”
Class label Y with the highest value of P(Y|X) wins for given condition X.
Since P(X) is the same for every class value of the outcome, one does not have to calculate
this and it can be assumed as a constant.
Since P(X) is constant for every value of Y, it is enough to calculate the numerator
of the equation for every class value.
• weather condition: the evidence Categorical data type is being used for ease of
• decision to play or not play: the belief explanation (temperature and humidity have
10/11/2022 internal use been converted from the numeric type)
Models – Naïve Bayesian
0.0274
How to implement
Step 1: Data Preparation
The Golf dataset (14 records for training) and a Golf-Test dataset (14 records for testing)
Since the Bayes operator accepts numeric and nominal data types, no other data transformation
process is necessary.
Step 2: Modeling Operator and Parameters
The Naïve Bayesian operator has only one parameter option to set: whether or not to include
Laplace correction (included by default)
For smaller datasets, Laplace correction is strongly encouraged, as a dataset may not have all
combinations of attribute values for every class value.
Outputs of the operator: the model and original training dataset.
The model output should be connected to Apply Model to execute the model on the test dataset.
Step 3: Evaluation
The output of the Apply Model operator is the labeled test dataset and the model.
The labeled test dataset is then connected to the Performance-Classification operator
(Evaluation/Performance Measurement) to evaluate the performance of the classification model.
Summary
The Bayesian algorithm provides a probabilistic framework for a classification problem. It has
a simple and sound foundation for modeling data and quite robust to outliers and missing
values.
If the test example set does not contain a value, suppose temperature is not available, the Bayesian
model simply omits the corresponding class conditional probability for all the outcomes.
Having missing values in the test set would be difficult to handle in decision trees and regression
algorithms, particularly when the missing attribute is used higher up in the node of the decision
tree or has more weight in regression.
One major limitation of the model is the assumption of independent attributes, which can be
mitigated by advanced modeling or through preprocessing to decrease attribute dependence.
Summary
This algorithm is deployed widely in text mining and document classification where the
application has a large set of attributes and attributes values to compute.
The naïve Bayesian classifier is often a great place to start for a data science project as it
serves as a good benchmark for comparison to other models.
The uniqueness of the technique is that it leverages new information as it arrives and tries
to make a best prediction considering new evidence.
The artificial neural network (ANN) is a computational and mathematical model inspired by the
biological nervous system.
ANNs model more complex nonlinear relationships of data and learn though adaptive
adjustments of weights between the nodes.
The 1st layer of nodes closest to the input is called the input layer/nodes.
The last layer of nodes is called the output layer/nodes.
Perception:
- the most simplistic form of an ANN
- one input and one output layer
- feed-forward ANN where the input moves in one
direction and there are no loops in the topology
There can be multiple layers, hidden layers, apart from the input and output layer, to model
nonlinear, complicated relationships between input and output variables.
The output layer performs an activation function consisting of a combination of an aggregation
function, usually summarization, and a transfer function.
A hidden layer connects input from previous layers, and also, applies an activation function.
The transfer function, which scales the output into the desired range, commonly used are:
sigmoid, normal bell curve, logistic, hyperbolic, or linear functions.
sigmoid and bell curves: to provide a linear transformation for a particular range of values and a nonlinear
transformation for the rest of the values.
Because of the transfer function and the presence of multiple hidden layers, one can model or
closely approximate almost any mathematical continuous relationship between input variables
and output variables.
A multilayer ANN is called a universal approximator.
To search for an optimal solution: time consuming
One can use a topology with multiple hidden layers and even
with looping where the output of one layer is used as input for preceding layers.
How it works
An ANN learns the relationship between input attributes and the output class label
through a technique called back propagation.
The key training task is to find the weights of the links.
The model uses every training record to estimate the error between the predicted and
the actual output.
Then the model uses the error to adjust the weights to minimize the error for the next
training record and this step is repeated until the error falls within the acceptable
range.
Step 4: Weight adjustment (w’ = w + λ * e) Similarly, the weight will be adjusted for all the links.
In the next cycle, a new error will be computed for the
next training record.
This cycle goes on until all the training records are
processed.
The same training example can be repeated until the
error is less than a threshold.
In reality, there will be multiple hidden layers and
multiple output links – one for each nominal class
value.
Because of the numeric calculations, an ANN model
works well with numeric inputs and outputs.
If the input contains a nominal attribute, a pre-
processing step should be included to convert the
Assuming: λ = 0.5 nominal attribute into multiple numeric attributes—one
New weight: w’2 = 2 + 0.5 * 5/3 = 2.83 for each attribute value.
(divide by 3: error is back propagated to 3 links from the This specific preprocessing increases the number of
output node) input links for an ANN in the case of nominal attributes
internal use
and, thus, increases the necessary computing resources.
10/11/2022
Models – Artificial neural networks
How to implement
Step 1: Data preparation
Iris dataset
ANN operator only works on numeric data types data transformation with categorical/nominal attributes
Using Rename operator to name 4 attributes of Iris dataset
Using Split Data operator to split 150 Iris records equally into training and test data
Step 2: Modelling operator and parameters
Training dataset is connected to the Neural Net operator that accepts numeric data and normalizes the values
The output of Neural Net can be connected to the Apply Model operator that also gets an input dataset from
the Split Data operator for the test dataset.
The outputs of Apply Model is the labeled dataset and the ANN model.
Step 3: Evaluation
The labeled dataset after using Apply Model is then connected to the Performance-Classification operator.
Summary
Neural network models require stringent input constraints and preprocessing.
An ANN does not handle categorical input data. If the data has nominal values, it needs to be
converted to binary or real values. This means one input attribute explodes into multiple input
attributes and exponentially increases nodes, links, and complexity
However, having redundant correlated attributes is not going to be a problem in an ANN model.
If the test example has missing attribute values, the model cannot function (like to a regression or
decision tree model).
The missing values can be replaced with average values or any default values to mitigate the constraint.
The relationship between input and output cannot be explained clearly by an ANN.
Since there are hidden layers, it is quite complex to understand the model.
Building a good ANN model with optimized parameters takes time.
It depends on the number of training records and iterations.
There are no consistent guidelines on the number of hidden layers and nodes within each hidden layer.
Summary
However, once a model is built, it is straightforward to implement, and a new unseen record
gets classified quite fast.
Because model building is by incremental error correction, ANN can yield local optima as the
final model.
The rapid classification of test examples makes an ANN quite useful for anomaly detection and
classification problems.
An ANN is commonly used in fraud detection, a scoring situation where the relationship
between inputs and output is nonlinear.
If there is a need for something that handles a highly nonlinear landscape along with fast real-
time performance, then the ANN fits the bill.