You are on page 1of 50

Machine Learning

(CS4613)

Department of Computer Science


Capital University of Science and Technology (CUST)
Course Outline
Topic Weeks Reference
Introduction Week 1 Hands-on machine learning, Ch 1
Hypothesis Learning Week 2 Tom Mitchel, Ch2
Model Evaluation. Week 3 Fundamentals of Machine Learning for Predictive Data Analytics, Ch8
Classification
Decision Trees Week 4, 5 Fundamentals of Machine Learning for Predictive Data Analytics, Ch4

Bayesian Inference. Week 6,7 Fundamentals of Machine Learning for Predictive Data Analytics, Ch6
Naïve Bayes
PCA Week 8 Hands-on machine learning, Ch 8
Linear Regression Week 9, 10 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
SVM Week 11, 12 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
ANN Week 13, 14 Neural Networks, A systematic Introduction, 1, 2, 3, 4, 7 (Selected Topics)
Hands-on machine learning, Ch 10
K-Nearest Neighbor Week 15, 16 Fundamentals of Machine Learning for Predictive Data Analytics, Ch5
K-Means Clustering Master Machine Learning Algorithms Ch 22, 23

2
Outline Week 15, 16
• Similarity Based Learning
• Measuring Similarity Using Distance Metrics
• K-Nearest Neighbors Algorithm

• Unsupervised Learning
• K-Means Clustering

3
Similarity-based
learning

4
Similarity Based Learning
• Similarity-based approaches to machine learning come from
the idea that the best way to make a predictions is to simply
look at what has worked well in the past and predict the same
thing again.
• If you are trying to make a prediction for a current situation
then you should search your memory to find situations that
are similar to the current one and make a prediction based on
what was true for the most similar situation in your memory.
• A key component of this approach is defining a computational
measure of similarity between instances which is actually
some form of distance measure in the feature space.

5
Example Dataset
• An example dataset containing two descriptive features, the SPEED and
AGILITY ratings for college athletes (both measures out of 10), and one target
feature that lists whether the athletes were drafted to a professional team.
• We can represent this dataset in a feature space by taking each of the
descriptive features to be the axes of a coordinate system.
• We can then place each instance within the feature space based on the
values of its descriptive features
• In the figure on the next slide, SPEED has been plotted on the horizontal axis,
and AGILITY has been plotted on the vertical axis.
• The value of the DRAFT feature is indicated by the shape representing each
instance as a point in the feature space: triangles for no and crosses for yes.
• In this example, there are only two descriptive features, so the feature space
is two-dimensional. Feature spaces can have many more dimensions.

6
7
Feature Space
• An abstract m-dimensional space that is created by
making each descriptive feature in a dataset an axis
of an m-dimensional coordinate system and
mapping each instance in the dataset to a point in
this coordinate system based on the values of its
descriptive features.
• The distance between two points in the feature
space is a useful measure of the similarity of the
descriptive features of the two instances. Smaller
the distance between two instances, more similar
the two instances are.
8
Measuring Similarity
Using Distance Metrics

9
Distance Measure
• metric(a, b) is a function that returns the distance
between two instances a and b.
• Mathematically, a metric must conform to the
following four criteria

10
Euclidean distance
• One of the best known distance metrics is
Euclidean distance, which computes the length of
the straight line between two points.
• Euclidean distance between two instances a and b
in an m-dimensional feature space is defined as

11
Example
• For example, the Euclidean distance between
instances d12 (SPEED = 5.00, AGILITY = 2.50) and d5
(SPEED = 2.75, AGILITY = 7.50) is,

12
Manhattan distance
• The Manhattan distance between two instances a
and b in a feature space with m dimensions is
defined as

13
Example
• The Manhattan distance between instances d12
(SPEED = 5.00, AGILITY = 2.50) and d5 (SPEED =
2.75, AGILITY = 7.50) is

14
Discussion
• Euclidean distance and Manhattan distance are the
two most commonly used distance measures.
• From a computational perspective, the Manhattan
distance has a slight advantage over the Euclidean
distance—the computation of the squaring and the
square root is saved—and computational
considerations can become important when dealing
with very large datasets.
• Euclidean distance is often used as the default.

15
Nearest Neighbor
Algorithm

16
Introduction
• It is the standard approach to similarity-based learning
• The training phase needed to build a nearest neighbor model
is very simple and just involves storing all the training
instances in memory.
• In the standard version of the algorithm, the data structure
used to store training data is a simple list.
• When the model is used to make predictions for new query
instances, the distance in the feature space between the
query instance and each instance in memory is computed,
• The prediction returned by the model is the target feature
level of the instance that is nearest to the query in the feature
space.
• The default distance metric used in nearest neighbor models
is Euclidean distance.
17
18
Example
• Assume that we are using the dataset shown on
slide 7 as our labeled training dataset
• We want to make a prediction to tell us whether a
query instance with SPEED = 6.75 and AGILITY =
3.00 is likely to be drafted or not.
• Figure on the next slide illustrates the feature space
of the training dataset with the query, represented
by the ? marker.

19
20
Example Contd..
• Just by visually inspecting the Figure we can see that the
nearest neighbor to the query instance has a target level of
yes, so this is the prediction that the model should return.
Let’s step through how the algorithm makes this prediction
• During the prediction stage, the nearest neighbor algorithm
iterates across all the instances in the training dataset and
computes the distance between each instance and the query.
• These distances are then ranked from lowest to highest to
find the nearest neighbor.
• The table on the next slide shows that the nearest neighbor
to the query is instance d18, with a distance of 1.2749 and a
target level of yes.

21
22
23
Example Contd..
• When the algorithm is searching for the nearest neighbor using
Euclidean distance, it is partitioning the feature space into what
is known as a Voronoi tessellation, and it is trying to decide
which Voronoi region the query belongs to.
• The Voronoi region belonging to a training instance defines the
set of queries for which the prediction will be determined by
that training instance.
• We can see in this figure that the query is inside a Voronoi
region defined by an instance with a target level of yes. As
such, the prediction for the query instance should be yes.
• The decision boundary is the boundary between regions of the
feature space in which different target levels will be predicted.

24
25
Handling Noisy Data: K-Nearest
Neighbors
• Throughout our worked example, the top right corner of the feature
space contained a no region. This is because one of the no instances
occurs far away from the rest of the instances with this target level.
• It is likely that either this instance has been incorrectly labeled, or one
of the descriptive features for this instance has an incorrect value and
hence it is in the wrong location in the feature space. This represents
noise in the dataset.
• We can reduce the dependency of the algorithm on individual (possibly
noisy) instances by modify the algorithm to return the majority target
level within the set of k nearest neighbors to the query q.

26
27
Trade-off in setting the value of k
• There is always a trade-off in setting the value of k.
• If we set k too, low we run the risk of the algorithm being
sensitive to noise in the data and overfitting.
• Conversely, if we set k too high, we run the risk of losing the
true pattern of the data and underfitting.
• The risks associated with setting k to a high value are
particularly acute when we are dealing with an imbalanced
dataset.
• An imbalanced dataset is a dataset that contains significantly
more instances of one target level than another.
• In these situations, as k increases, the majority target level
begins to dominate.
28
Predicting Continuous Targets
• It is relatively easy to adapt the k nearest neighbor
approach to handle continuous target features.
• To do this we simply change the approach to return
a prediction of the average target value of the
nearest neighbors, rather than the majority target
level.

29
Data Normalization
• Distance computations are sensitive to the value
ranges of the features in the dataset.
• Normalizing the features in a dataset ensures that
each feature can contribute equally to the distance
metric.
• Normalizing the data is an important thing to do for
almost all machine learning algorithms, not just
nearest neighbor.

30
Similarity Indexes for
Binary Descriptive
Features

31
Similarity Measure for Binary
Features
• Many datasets contain binary features (categorical
features that have only two levels).
• For example, a dataset may record whether or not
someone liked a movie, a customer bought a
product, or someone visited a particular webpage.
• If the descriptive features in a dataset are binary, it
is often a good idea to use a similarity index that
defines similarity between instances specifically in
terms of co-presence or co-absence of features,
rather than an index based on distance.

32
Example: Predicting sign up to
paid account
• A common business model for online services is to
allow users a free trial period after which time they
have to sign up to a paid account to continue using the
service.
• These businesses often try to predict the likelihood that
users coming to the end of the trial period will accept
the upsell offer to move to the paid service.
• It can help a marketing department decide which
customers coming close to the end of their trial period
the department should contact to promote the benefits
of signup to the paid service.
33
Sample Data

• PROFILE: Did the user complete the profile form when registering for the free trial?
• FAQ: Did the user read the frequently asked questions page?
• HELPFORUM: Did the user post a question on the help forum?
• NEWSLETTER: Did the user sign up for the weekly newsletter?
• LIKED: Did the user Like the website on Facebook?
• The target feature, SIGNUP, indicates whether the customers ultimately signed up to the paid
service or not (yes or no).
• The query instance, q, is
PROFILE = true, FAQ = false, HELPFORUM = true, NEWSLETTER = false, LIKED = false
34
Similarity Measures
• co-presence (CP), how often a true value occurred for the same feature in both
the query data q and the data for the comparison user (d1 or d2)
• co-absence (CA), how often a false value occurred for the same feature in both
the query data q and the data for the comparison user (d1 or d2)
• presence-absence (PA), how often a true value occurred in the query data q
when a false value occurred in the data for the comparison user (d1 or d2) for
the same feature
• absence-presence (AP), how often a false value occurred in the query data q
when a true value occurred in the data for the comparison user (d1 or d2) for
the same feature

35
Russel-Rao similarity index
• One way of judging similarity is to focus solely on co-
presence.
• For example, in an online retail setting, co-presence could
capture what two users jointly viewed, liked, or bought.
• The Russel-Rao similarity index focuses on this and is
measured in terms of the ratio between the number of co-
presences and the total number of binary features
considered.

36
Other Names for KNN
• Instance-Based Learning: The raw training instances are
used to make predictions. As such KNN is often referred
to as instance-based learning or a case-based learning
• Lazy Learning: No learning of the model is required and
all of the work happens at the time a prediction is
requested. As such, KNN is often referred to as a lazy
learning algorithm.
• Nonparametric: KNN makes no assumptions about the
functional form of the problem being solved. As such
KNN is referred to as a nonparametric machine learning
algorithm.
37
Unsupervised Learning

38
Unsupervised Learning
• Unsupervised machine learning models are given
unlabeled data and allowed to discover patterns and
insights without any explicit guidance or instruction

39
40
K-Means Clustering

41
K-Means Clustering
• K-Means Clustering is an Unsupervised Learning algorithm,
which groups the unlabeled dataset into different clusters.
• It aims to partition n observations into k clusters in which
each observation belongs to the cluster with the nearest
mean (also called cluster centers or cluster centroid).
• Cluster mean or cluster centroid serves as a prototype of
the cluster.
• K defines the number of pre-defined clusters that need to
be created in the process. If K=2, there will be two
clusters, and for K=3, there will be three clusters, and so
on.
42
K-Means and K-Nearest
Neighbor
• K-Nearest Neighbor is a Supervised Algorithm
• K-Means is Unsupervised.
• The two are often confused due to similar names.
• We can however combine the two.
• Obtain cluster centers using K-Means.
• Apply the 1-nearest neighbor classifier to the cluster
centers obtained by k-means and classify new data into
the existing clusters.

43
Algorithm
1. Select random K points as cluster centroids.
2. Assign each data point to their closest centroid.
This will form K clusters.
3. Find centroids of these newly formed clusters
(using mean value of each feature for all the data
points in the cluster)
4. Continue with Step 2 and 3 till no reassignments
occurs (no data points change clusters)

44
45
Example
• https://codinginfinite.com/k-means-clustering-expl
ained-with-numerical-example/

46
How to find optimal number of
clusters?
• Execute the K-means clustering on a given dataset for
different K values (for example from 1-10).
• For each K, calculate the WCSS (Within Cluster Sum of
Squares) value.
• It is the sum of squares of distances between a centroid and the
data points in that cluster.
• Plot a curve between calculated WCSS values and the
number of clusters K.
• The value of K for which there is a sharp bend in the graph is
considered as the optimal value of K.

47
48
Example using Sci-kit
learn
https://drive.google.com/file/d/1T89nhetWtsbp93TO
60_XYbTn_xfxJuG3/view?usp=sharing

49
That is all for Week 15
and 16

50

You might also like