You are on page 1of 60

CHAPTER7– UNIT 5

PROBABILISTIC LEARNING

BY

NAGARATNA B. CHITTARAGI

1
• One criticism that is often made of neural networks—especially the
MLP—is that it is not clear exactly what it is doing: while we can go and have
a look at the activations of the neurons and the weights, they don’t tell us
much.
• We’ve already seen some methods that don’t have this problem, principally
the decision tree in Chapter 12.
• In this chapter we are going to look at methods that are based on statistics,
and that are therefore more transparent, in that we can always extract and
look at the probabilities and see what they are, rather than having to worry
about weights that have no obvious meaning.
• We will look at how to perform classification by using the frequency with which
examples appear in the training data, and then we will see how we can deal
with our first example of unsupervised learning, when the labels are not
present for the training examples.
• If the data comes from known probability distributions, then we will see that it
is possible to solve this problem with a very neat algorithm, the
Expectation-Maximisation EM algorithm, which we will also see in other
guises in later chapters.
• Finally, we will have a look at a rather different way of using the dataset when
we look at nearest neighbour methods.

2
3
4
5
Gaussian Mixture Model (GMM)

The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function.

•Further, GMM also requires estimated statistics values such as mean and standard
deviation or parameters.
•It is used to estimate the parameters of the probability distributions to best fit the
density of a given training dataset.
•Although there are plenty of techniques available to estimate the parameter of the
Gaussian Mixture Model (GMM), the Maximum Likelihood Estimation is one of the
most popular techniques among them.

6
Unsupervised Machine Learning
•There may be many cases in which we do not have labeled data and need to find the hidden
patterns from the given dataset.
•So, to solve such types of cases in machine learning, we need unsupervised learning
techniques.

What is Unsupervised Learning?

•As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset.
•Instead, models itself find the hidden patterns and insights from the given data.
• It can be compared to learning which takes place in the human brain while learning new
things. It can be defined as:

Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

7
Unsupervised learning cannot be directly applied to a regression or classification problem
because unlike supervised learning, we have the input data but no corresponding output
data.

The goal of unsupervised learning is to

find the underlying structure of dataset, group that data according to


similarities, and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning


algorithm is given an input dataset containing images
of different types of cats and dogs. The algorithm is
never trained upon the given dataset, which means it
does not have any idea about the features of the dataset.
The task of the unsupervised learning algorithm is to
identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering
the image dataset into the groups according to
similarities between images.

8
Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised Learning:
▪Unsupervised learning is helpful for finding useful insights from the data.
▪Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
▪Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
▪In real-world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning.

9
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram:

10
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two
types of problems:
▪ Clustering: Clustering is a method of grouping the
objects into clusters such that objects with most
similarities remains into a group and has less or no
similarities with the objects of another group.
▪ Cluster analysis finds the commonalities between the
data objects and categorizes them as per the presence
and absence of those commonalities.
▪ Association: An association rule is an unsupervised
learning method which is used for finding the
relationships between variables in the large database.
▪ It determines the set of items that occurs together in
the dataset.
▪ Association rule makes marketing strategy more
effective.
▪ Such as people who buy X item (suppose a bread) are
also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket
Analysis.

11
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:

▪K-means clustering
▪KNN (k-nearest neighbors)
▪Hierarchal clustering
▪Anomaly detection
▪Neural Networks
▪Principle Component Analysis
▪Independent Component Analysis
▪Apriori algorithm
▪Singular value decomposition

12
A technique demanded by many real world tasks
◦ Bank/Internet Security: fraud/spam pattern discovery
◦ Biology: taxonomy of living things such as kingdom, phylum, class, order,
family, genus and species
◦ City-planning: Identifying groups of houses according to their house type,
value, and geographical location
◦ Climate change: understanding earth climate, find patterns of atmospheric
and ocean
◦ Finance: stock clustering analysis to uncover correlation underlying shares
◦ Image Compression/segmentation: coherent pixels grouped
◦ Information retrieval/organization: Google search, topic-based news
◦ Land use: Identification of areas of similar land use in an earth observation
database
◦ Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
◦ Social network mining: special interest group automatic discovery

13

13
Quiz

14
NEAREST NEIGHBOUR METHODS
• Groups in a classroom.
• Making groups in function hall

• This is pretty much exactly the idea behind nearest neighbour methods: if
we don’t have a model that describes the data, then the best thing to do is to look
at similar data and choose to be in the same class as them.

• We have the datapoints positioned within input space, so we just need to


work out which of the training data are close to it.

• This requires computing the distance to each datapoint in the training set, which is
relatively expensive:
• if we are in normal Euclidean space, then we have to compute d subtractions and
d squarings (we can ignore the square root since we only want to know which
points are the closest, not the actual distance) and this has to be done O(N ) 2

times.

15
• We can then identify the k nearest neighbours to the test point, and then set
the class of the test point to be the most common one out of those for the
nearest neighbours.
• The choice of k is not trivial. Make it too small and nearest neighbour methods
are sensitive to noise, too large and the accuracy reduces as points that are
too far away are considered.
• Some possible effects of changing the size of k on the decision boundary are
shown in Figure 7.3.

16
• This method suffers from the curse of dimensionality (Section 2.1.2). First, as
shown above, the computational costs get higher as the number of
dimensions grows.
• This is not as bad as it might appear at first: there are sets of methods such
as KD-Trees (see Section 7.2.2 for more details) that compute this in O(N
logN) time.

• However, more importantly, as the number of dimensions increases, so


the distance to other datapoints tends to increase.

• In addition, they can be far away in a variety of different directions—there


might be points that are relatively close in some dimensions, but a long way in
others.

• There are methods for dealing with these problems, known as adaptive
nearest neighbour methods, and there is a reference to them in the Further
Reading section at the end of the chapter.

• The only part of this that requires any care during the implementation is what
to do when there is more than one class found in the closest points, but even
with that the implementation is nice and simple:

17
18
K-Nearest Neighbor(KNN) Algorithm for Machine Learning

▪ K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
▪ K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
▪ K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
▪ K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.

▪ K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
▪ It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
▪ KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.

19
K-nearest neighbours (KNN) algorithm uses ‘feature similarity’ to predict the
values of new data points which further means that the new data point will be
assigned a value based on how closely it matches the points in the training set.

We can understand its working with the help of following steps −

Step 1 −We need dataset. So during the first step of KNN,


we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the
nearest data points. K can be any integer.
Step 3 − For each point in the test data do the following

20
3.1 − Calculate the distance between test data and each row of
training data with the help of any of the method namely:
◦ Euclidean, Manhattan or Hamming distance.
◦ The most commonly used method to calculate distance is Euclidean.

3.2 − Now, based on the distance value, sort them in ascending


order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most
frequent class of these rows.
Step 4 − End

21
Let us start with a simple example. Consider the following table – it consists of the height,
age, and weight (target) values for 10 people. As you can see, the weight value of ID11 is
missing. We need to predict the weight of this person based on their height and age.

Note: The data in this table does not represent actual values. It is merely used as an
example to explain this concept.

22
How Does the KNN Algorithm Work?
•As we saw above, the KNN algorithm can be used for both classification and
regression problems.
•The KNN algorithm uses ‘feature similarity’ to predict the values of any new data
points.
•This means that the new point is assigned a value based on how closely it resembles
the points in the training set. From our example, we know that ID11 has height and age
similar to ID1 and ID5, so the weight would also approximately be the same.

23
Below is a stepwise explanation of the
algorithm:
1. First, the distance between the new point and
each training point is calculated.

24
2. The closest k data points are selected (based
on the distance). In this example, points 1, 5,
and 6 will be selected if the value of k is 3.
We will further explore the method to select the
right value of k later in this article.

25
3. The average of these data points is the final prediction for the new point. Here, we
have the weight of ID11 = (77+72+60)/3 = 69.66 kg.

How to Calculate the Distance Between Points?

The first step is to calculate the distance between the new point and each training
point. There are various methods for calculating this distance, of which the most
commonly known methods are – Euclidian, Manhattan (for continuous), and Hamming
distance (for categorical).
1.Euclidean Distance: Euclidean distance is calculated as the square root of the sum of
the squared differences between a new point (x) and an existing point (y).
2.Manhattan Distance: This is the distance between real vectors using the sum of their
absolute difference.

26
1.Hamming Distance: It is used for categorical variables. If the value (x) and the value (y) are the
same, the distance D will be equal to 0. Otherwise D=1.

27
There is also Minkowski distance which is a generalized form of Euclidean and
manhattan distances.

Once the distance of a new observation from the points in our training set has been
measured, the next step is to pick the closest points. The number of points to be
considered is defined by the value of k.

The second step is to select the k value. This determines the number of neighbors we look at
when we assign a value to any new observation.
In our example, for a value k = 3, the closest points are ID1, ID5, and ID6.

ID11 = (77+72+60)/3
ID11 = 69.66 kg

28
29
Problem 1: Apply K-Nearest Neighbour algorithm
for classification for the following datasets. Find the
class to which the data X={Maths=6, physics=8} and
K=3.

Apply each step and find the class to which


X={Maths=6, Physics=8}.
30
31
• From these we have to consider three nearest neighbours (K=3).

• In this, 2nd, 3rd and 5th data elements are nearer to new test data.

• Same time these three belongs to class Pass from the given dataset.

• Hence we should fit or label the class for new data as Pass.

• In case different class labels are found, i.e. two pass and one fail,(Two fail
and one pass) in that time, higher probability of belongingness to class labels
must be considered.

• This indicates that, new test data is identified with the class label by
considering three neighbours k=3.

• Similarly, you can find the class for the given data. X={Maths=5,
Physics=7}

32
33
• We are going to look next at how we can use these methods for regression,
before we turn to the question of how to perform the distance calculations as
efficiently as possible, something that is done simply but inefficiently in the code
above.

• We will then consider briefly whether or not the Euclidean distance is always the
most useful way to calculate distances, and what alternatives there are.
• For the k-nearest neighbours algorithm the bias-variance decomposition can be
computed as:

• The way to interpret this is that when k is small, so that there are few neighbours
considered, the model has flexibility and can represent the underlying model well,
but that it makes mistakes (has high variance) because there is relatively little
data.

• As k increases, the variance decreases, but at the cost of less flexibility and so
more bias.
34
Nearest Neighbour Smoothing
• Nearest neighbour methods can also be used for regression by returning the
average value of the neighbours to a point, or a spline or similar fit as the new
value.
• The most common methods are known as kernel smoothers, and they use a
kernel (a weighting function between pairs of points) that decides how much
emphasis (weight) to put onto the contribution from each datapoint according to
its distance from the input.
• We will see kernels in a different context in Section 8.2, but here we shall simply
use two kernels that are used for smoothing.
• Both of these kernels are designed to give more weight to points that are closer
to the current input, with the weights decreasing smoothly to zero as they pass
out of the range of the current input, with the range specified by a parameter.
• They are the Epanechnikov quadratic kernel:

35
• The results of using these kernels are shown in Figure 7.4 on a dataset that
consists of the time between eruptions (technically known as the repose) and the
duration of the eruptions of Mount Ruapehu, the large volcano in the centre of
New Zealand’s north island.
• Values of λ of 2 and 4 were used here. Picking λ requires experimentation.
• Large values average over more datapoints, and therefore produce lower
variance, but at the cost of higher bias.

36
FIGURE 7.4 Output of the nearest neighbour method and two kernel smoothers
on the
data of duration and repose of eruptions of Mount Ruapehu 1860–2006.

37
Efficient Distance Computations: the KD-Tree-- K-Dimensional Tree

• As was mentioned above, computing the distances between all pairs of points
is very computationally expensive.
• Fortunately, as with many problems in computer science, designing an efficient
data structure can reduce the computational overhead a lot.
• For the problem of finding nearest neighbours the data structure of choice is the
KD-Tree.
• It has been around since the late 1970s, when it was devised by Friedman and
Bentley, and it reduces the cost of finding a nearest neighbour to O(logN) for
O(N) storage.

• The construction of the tree is O(N log2 N), with much of the computational cost
being in the computation of the median, which with a naïve algorithm requires a
sort and is therefore O(N logN), or can be computed with a randomised
algorithm in O(N) time.

• The idea behind the KD-tree is very simple. You create a binary tree by
choosing one dimension at a time to split into two, and placing the line through
the median of the point coordinates of that dimension.

• Not that different to a decision tree (Chapter 12), really. The points themselves
end up as leaves of the tree. 38
• Making the tree follows pretty much the same steps as usual for constructing
a binary tree: we identify a place to split into two choices, left and right, and
then carry on down the tree.
• This makes it natural to write the algorithm recursively. The choice of what to
split and where is what makes the KD-tree special.
• Just one dimension is split in each step, and the position of the split is found
by computing the median of the points that are to be split in that one
dimension, and putting the line there.
• In general, the choice of which dimension to split alternates through the
different choices, or it can be made randomly.
• The algorithm below cycles through the possible dimensions based on the
depth of the tree so far, so that in two dimensions it alternates horizontal and
vertical splits.

39
40
FIGURE 7.6 The splits and leaf points found by the KD-tree.

41
42
• Suppose that we had seven two-dimensional points to make a tree from:

(5, 4), (1, 6), (6, 1), (7, 5), (2, 7), (2, 2), (5, 8) (as plotted in Figure 7.5).

• The algorithm will pick the first coordinate to split on initially, and the median
point here is 5, so the split is through x = 5. Of those on the left of the line, the
median y coordinate is 6, and for those on the right it is 5.
• At this point we have separated all the points, and so the algorithm terminates
with the split shown in Figure 7.6 and the tree shown in Figure 7.7.

• Searching the tree is the same as any other binary tree; we are more
interested in finding the nearest neighbours of a test point.
• This is fairly easy: starting at the root of the tree you recurse down through the
tree comparing just one dimension at a time until you find a leaf node that is in
the region containing the test point.
• Using the tree shown in Figure 7.7 we introduce the test point (3, 5), which
finds (2, 2) as the leaf for the box that (3, 5) is in.
• However, looking at Figure 7.8 we see that this is not the closest point at all, so
we need to do some more work.

43
• The first thing we do is label the leaf we have found as a potential nearest
neighbour, and compute the distance between the test point and this point,
since any other point has to be closer.
• Now we need to check any other boxes that could contain something closer.
• Looking at Figure 7.8 you can see that point (3, 7) is closer, and that is the
label of the leaf for the sibling box to the one that was returned, so the
algorithm also needs to check the sibling box.
• However, suppose that we used (4.5, 2) as the test point. In that case the
sibling is too far away, but another point (6, 1) is closer.
• So just checking the sibling is not enough — we also need to check the
siblings of the parent node, together with its descendants (the cousins of the
first point).
• A look at the figure again should convince you that the algorithm can then
terminate in most cases; very occasionally it can be necessary to go even
further afield, but it is easy to see which branches to prune.
• This leads to the following Python program:
44
45
FIGURE 7.8 Two
test points for
the example KD-tree.

46
47
• For the sake of simplicity, let us understand a 2-D Tree with an example.

• The root would have an x-aligned plane, the root’s children would both have
y-aligned planes, the root’s grandchildren would all have x-aligned planes, and the
root’s great-grandchildren would all have y-aligned planes and so on.

• Generalization:

Let us number the planes as 0, 1, 2, …(K – 1). From the above example, it is quite
clear that a point (node) at depth D will have A aligned plane where A is calculated
as:
A = D mod K

• How to determine if a point will lie in the left subtree or in right subtree?

• If the root node is aligned in planeA, then the left subtree will contain all points
whose coordinates in that plane are smaller than that of root node.
• Similarly, the right subtree will contain all points whose coordinates in that plane
are greater-equal to that of root node.

48
Creation of a 2-D Tree:
Consider following points in a 2-D plane:
(3, 6), (17, 15), (13, 15), (6, 12), (9, 1), (2, 7), (10, 19)

49
How is space partitioned?
All 7 points will be plotted in the X-Y
plane as follows:

1.Point (3, 6) will divide the space into two


parts: Draw line X = 3.

50
2. Point (2, 7) will divide the space to the left of
line X = 3 into two parts horizontally.
Draw line Y = 7 to the left of line X = 3.

51
3. Point (17, 15) will divide the space to
the right of line X = 3 into two parts
horizontally.
Draw line Y = 15 to the right of line X = 3.

52
4. Point (6, 12) will divide the space below line Y = 15 and to the
right of line X = 3 into two parts.
Draw line X = 6 to the right of line X = 3 and below line Y = 15.

53
5. Point (13, 15) will divide the space below line Y = 15 and to
the right of line X = 6 into two parts.
Draw line X = 13 to the right of line X = 6 and below line Y = 15.

54
6. Point (9, 1) will divide the space
between lines X = 3, X = 6 and Y = 15 into
two parts.
Draw line Y = 1 between lines X = 3 and
X = 6.

55
7. Point (10, 19) will divide the space to the
right of line X = 3 and above line Y = 15 into
two parts.
Draw line Y = 19 to the right of line X = 3
and above line Y = 15.

56
Distance Measures
• We have computed the distance between points as the Euclidean distance,
which is something that you learnt about in high school.
• However, it is not the only option, nor is it necessarily the most useful.
• In this section we will look at the underlying idea behind distance calculations
and possible alternatives.
• If I were to ask you to find the distance between my house and the nearest
shop, then your first guess might involve taking a map of my town, locating my
house and the shop, and using a ruler to measure the distance between them.
• By careful application of the map scale you can now tell me how far it is.
• However, when I set out to buy some milk I’m liable to find that I have to walk
rather further than you’ve told me, since the direct line that you measured would
involve walking through (or over) several houses, and some serious
fence-scaling.
• Your ‘as the crow flies’ distance is the shortest possible path, and it is the
straight-line, or Euclidean, distance.
• You can measure it on the map by just using a ruler, but it essentially consists
of measuring the distance in one direction (we’ll call it north-south) and then the
distance in another direction that is perpendicular to the first (let’s call it
east-west) and then squaring them, adding them together, and then taking the
square root of that.

57
• Writing that out, the Euclidean distance that we are all used to is:

• where (x1, y1) is the location of my house in some coordinate system (say by
using a GPS tracker) and (x2, y2) is the location of the shop.
• If I told you that my town was laid out on a grid block system, as is common in
towns that were built in the interval between the invention of the motor car and
the invention of innovative town planners, then you would probably use a
different measure.
• You would measure the distance between my house and the shop in the
‘north-south’ direction and the distance in the ‘east-west’ direction, and then add
the two distances together.
• This would correspond to the distance I actually had to walk. It is often known as
the city-block or Manhattan distance and looks like:

The point of this discussion is to show that there is more than one way to measure a
distance, and that they can provide radically different answers.
These two different distances can be seen in Figure 7.9.

58
• Mathematically, these distance measures are known as metrics.
• A metric function or norm takes two inputs and gives a scalar (the distance) back,
which is positive, and 0 if and only if the two points are the same, symmetric (so
that the distance to the shop is the same as the distance back), and obeys the
triangle inequality, which says that the distance from a to b plus the distance from
b to c should not be less than the direct distance from a to c.
• Most of the data that we are going to have to analyse lives in rather more than
two dimensions.
• Fortunately, the Euclidean distance that we know about generalises very well to
higher dimensions (and so does the city-block metric).
• In fact, these two measures are both instances of a class of metrics that work in
any number of dimensions.

59
The general measure is the Minkowski metric and it is written as:

• If we put k = 1 then we get the city-block distance (Equation (7.12)), and k = 2


gives the Euclidean distance (Equation (7.11)).
• Thus, you might possibly see the Euclidean metric written as the L2 norm and the
city-block distance as the L1 norm.
• These norms have another interesting feature.
• Remember that we can define different averages of a set of numbers. If we
define the average as the point that minimises the sum of the distance to every
datapoint, then it turns out that the mean minimises the Euclidean distance (the
sum-of-squares distance), and the median minimises the L1 metric.
• We met another distance measure earlier: the Mahalanobis distance

60

You might also like