You are on page 1of 35

UNIT 8 INTRODUCTION TO MACHINE

LEARNING
Structure
8.1 Introduction
8.2 What is Machine Learning?
8.3 Types of Machine Learning
8.4 Machine Learning Algorithms
8.5 Neural Networks and Deep Learning
8.6 Mathematics for Machine Learning
8.7 Software for Machine Learning
8.8 Summary
8.9 Keywords
8.10 Check Your Progress – Possible Answers
8.11 References and Selected Readings

8.1 INTRODUCTION
Credit to define machine learning goes to Arthur Samuel. IBM’s Arthur
Samuel wrote a paper titled “Some Studies in Machine Learning Using the
Game of Checkers” in 1959. The paper investigated the application of machine
learning in the game of checkers. The concept of machine learning introduced
by Samuel showed that machines (computers) can learn without being
explicitly programmed. Without explicitly programmed means without the use
of direct programming commands. Here machine learning refers to self-
learning by machine (computer). How will a machine learn? A machine will
learn from historical data and empirical information. Machine Learning,
statistical learning or predictive modelling represents the same concept.
Statistical modelling is at the core of Machine Learning.
Broadly speaking, machines can learn in three ways. These form three
categories of Machine Learning. These are: Supervised Learning,
Unsupervised Learning and Reinforcement Learning. Supervised learning
uses labelled datasets. Labelled data is data that comes with a name or type.
Unsupervised learning involves finding a pattern in data. Thus unsupervised
learning segregates data in clusters or groups. These clusters or groups are
unlabeled. Reinforcement learning works on the principle of reward and
punishment. In other words, Reinforcement learning builds its prediction
model by gaining feedback from random trial and error and leveraging insight
from previous iterations. There are Machine Learning Algorithms
corresponding to these Machine Learning categories. Support Vector Machine
is a supervised machine learning algorithm. K-means clustering is an
unsupervised machine learning algorithm. These traditional models do not
scale in performance as the size of the dataset increases. However, deep
learning methods continue to scale in performance with the increasing size of
the dataset. Machine Learning is an emerging field of computer science having
wide applications in Search engines, Recommendation systems, Spam filters
etc.
Objectives
In this unit, you will learn the fundamentals of Machine Learning with proper
examples and illustrations. After reading this unit, you will be able to:
 Understand the concept of Machine Learning
 Discuss various types of Machine Learning
 Appreciate various Machine Learning Algorithms
 Understand the concept of neural networks and deep learning
 Explore some frameworks for Machine Learning in Python

8.2 WHAT IS MACHINE LEARNING?


Machine learning is a branch of artificial intelligence (AI) and computer
science which focuses on the use of data and algorithms to imitate the way that
humans learn, gradually improving its accuracy. Machine learning algorithms
are a class of computation algorithms where computers are not programmed to
do a task explicitly but rather “Learn” how to perform the task. Think of it in
the following fashion; in classical programming, you have a function that for
any given input x would produce y as an output.

However, a machine learning model tries to figure out the function f given x
and y. This approach is similar to how humans figure out things. They observe
the cause and effect and try to figure out how it happened. In other words,
machine learning tries to figure out the rule that connects the input and output.
This approach works wonders when a computer is tasked with performing non-
trivial tasks which do not have a set defined rule, such as recognizing a human
face, differentiating between different cat species, and talking to humans
(Remember Siri and Alexa?).
Another interesting non-trivial task is playing games such as chess and Go, at
which humans seem to excel. However, computers have a tough time
understanding and evaluating the game. For a long time, computers relied on
brute force calculations and tabulated data to try and outperform human
players. For example, when deep blue defeated the World Chess Champion
Garry Kasparov in 1997, it relied heavily on Opening book, a database of
about 70,000 Master Level games and Good-Old Fashioned AI algorithms like
alpha-beta search.
Cut to the present, Google’s AlphaZero is ruling the chess world. It is an AI
based system that learns chess by literally just playing against itself and
consistently outperforms all Grand Masters and other conventional chess
engines.
Unlike chess, in Go (a game quite popular in Korea), the possible scenarios
grow at an even faster rate than in chess, making it even harder to use brute
force techniques and the role of intuition becomes much more important.
Google’s AlphaGo defeated the reigning world champion of Go, again
showing how superior AI and Machine Learning Models can be compared to
conventional algorithms.
A more business-like application of machine learning is in building
recommendation systems. Websites like Netflix and YouTube rely on
recommendation engines to show relevant results to users from an almost
infinite set of possibilities.
Another widely used application of Machine Learning is spam detection.
Google and other companies classify Emails as Spam or Not-Spam based on
certain patterns found in them. Recently Machine Learning has been applied to
this task and with great success.
Check Your Progress 1
In this section, you studied “What is Machine Learning?”, now answer the
questions given in Check Your Progress-1.
Note: a) Write your answer in about 50 words
b) Check your answer with possible answers given at the end of the unit
(1) Briefly explain the idea of Machine Learning.
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(2) Mention currently used applications based on Machine Learning.
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------

8.3 TYPES OF MACHINE LEARNING


Typically machine learning is divided into three categories:
i. Supervised Learning
ii. Unsupervised Learning
iii. Reinforcement Learning
8.3.1 Supervised Learning
Supervised learning utilizes a “Labelled Dataset”, i.e., a dataset that contains
the data and its labels. A Labelled Dataset contains features and their
corresponding labels. For example, the MNIST dataset has images of
handwritten digits. Each image has a corresponding integer label indicating the
number corresponding to the image. The images and their corresponding labels
are as shown in Figure 8.1.

Figure 8.1: MNIST Dataset Showing Images and Corresponding Labels

A supervised machine learning algorithm learns the mapping from image to


label. When the model is trained, it can predict the labels even for previously
unseen images. Note that the features need not be images; these can be any set
of attributes or characteristics. Image classification is one of several tasks that
supervised learning can accomplish.
8.3.2 Unsupervised Learning
Unsupervised learning involves finding patterns in data. It utilizes data that is
not labelled and attempts to find patterns by clustering similar objects together.
For protecting customers from fraudulent online activities, the company may
use either supervised learning or unsupervised learning. Unsupervised learning
will be able to detect the new pattern of fraudulent practices by clustering the
data.
Since unsupervised learning does not involve labels, large volumes of data that
are hard to label can be processed to get insights.
8.3.3 Reinforcement Leaning
Reinforcement Learning models work on the principle of reward and
punishment. In reinforcement learning, an agent is allowed to interact with the
environment and is rewarded if the action is deemed correct, else it is punished
for not behaving expectedly. The goal of reinforcement learning is to
maximize the cumulative reward.
Reinforcement Learning (RL) models differ from Supervised and
Unsupervised Learning in the sense that there is no data from which patterns
have to be recognized or values to be predicted. RL models are very useful in
industrial applications where the model can learn to perform complicated tasks
and optimize the process. Other applications include self-driving cars.
To understand reinforcement learning, one has to understand the following:
Environment: Think of the environment as a boardgame in which landing on
a particular square gives a reward or penalty. Thus the environment is a
collection of entities with which an agent can interact
Agent: An agent is the entity which performs certain actions by interacting
with the environment and has a goal of maximizing its rewards.
Reward: Reward is a numerical incentive given to the agent for performing a
certain task.
Check Your Progress 2
In this section, you studied different types of Machine Learning, now answer
the questions given in Check Your Progress-2.
Note: a) Write your answer in about 50 words
b) Check your answer with possible answers given at the end of the unit
(1) Discuss different types of Machine Learning.
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(2) How Reinforcement Learning is different from Supervised Learning?
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(3) What is a Labelled Dataset?
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------

8.4 MACHINE LEARNING ALGORITHMS


8.4.1 K-means Clustering
K-means clustering is an unsupervised learning algorithm, which attempts to
partition the set of given input data into k partitions by minimizing the
variance for each partition. In Figure. 8.2, there are three clusters shown in
three different colours. K-means clustering will find three centroids
corresponding to each cluster and try to correctly allocate each data point to
the corresponding centroid by forming three sets from the given data points.

Figure 8.2: Dataset for K-means Clustering

(The colours are for readers’ convenience. The dataset does not have any prior
information regarding the cluster.)
The simplest way of doing this is using the Naive k-means algorithm. Start
with some arbitrary values for the centroids (μ1 , μ2 , …. . , μ�) . Then allocate
each point to its nearest centroid. Finally, update the centroid values to the new
values, which is the mean of the data points in that partition.
8.4.2 Linear and Logistic Regression
Regression is a widely used method to establish a relation between input and
output variables. In linear regression, the relation between the input and output
is known to be linear in nature (while this might seem too restrictive, there are
indeed a lot of problems that can be mapped to this simple case)
i.e.,
� = �� + �
(a, b are unknown but fixed constants)
A formal presentation of the problem at hand can be done in the following
manner.
Given a set of ordered pairs �1 , �1 , �2 , �2 ……, �� , �� , the goal is to find
the constants a and b that characterize a function �(�) = �� + � such that the
total error ��=1 � � �� , �� (L is a loss function.) is minimized. Commonly
used loss functions include the mean square error, which is defined as

2
� �� − ��
�=1

Figure 8.3: Linear Regression Fit for the Plotted Points

The above statement can be understood visually is as shown in the Figure 8.3.
Here ‘n’ (input, output) pairs are given, and the task is to find the line which
minimizes the sum of the distance of each point from the line.
This can be done using the gradient descent algorithm (Ref Section 8.6 for
details). Start with some random values for ‘a’ and ‘b’ and then calculate the
loss. Then find the gradient of the loss and update parameters to reduce the
loss function. Do so until the error is below the acceptable limit or a certain
number of iterations have been done.
Repeat
��
�=�−
��
��
�=�−
��
Until
L < ϵ,  where ϵ is the acceptable error limit
However, there is another class of problem that can be approached in a similar
manner. This problem is that of binary classification. So now the output is
binary one (i.e. �� ∈ 0,1) instead of being a continuous variable. One simple
approach can be to fit a straight line to the data and define a threshold value �0
and classify everything greater than �0 as 1 and everything less than �0 as 0.
Figure 8.4: Sigmoid Function

However, a better idea is to use a “Sigmoid” function on top of the linear


model is as shown in Figure 8.4. Mathematically, the prediction made by the
model is
1
y= 1+�−�
(i)

where,
� = �� + �
Equation (i) represents the logistic regression model. The output of the sigmoid
function can be interpreted as the probability or the confidence that a given
data point has the label 1. There is one last catch before we can fully
implement the logistic regression model, that is, the mean squared error (one
used in linear regression) cannot be used here as it leads to non-convex
optimization problems. So, finally, a Log Loss (also called Cross-Entropy Loss)
function is used which is defined as:

L= − �� ��� �'� − 1 − �� ��� 1 − �'�


�=1

The gradient descent is used to optimize the parameters ‘a’ and ‘b’ and
minimize the loss function.
Bias-variance Trade-off
Bias: A model is said to have a high bias when it underfits the data. That is
when the model does not correctly learn the relationship between the input data
and the output. This might happen because the model is too simple, or the
model is making assumptions that simply isn’t true.
Variance: When the model overfits, it is said to have high variance. In this case,
the model is unable to generalize to the data and performs poorly on the test
data while scoring very well on the training data. Generally, this happens when
the model is too complex, or there is too little data to train on.
8.4.3 Support Vector Machines

Figure 8.5: Optimal Hyperplane Separating Two Classes

Support Vector Machines (SVM) are robust classifiers that are great at
separating data that have a complex decision boundary. An SVM finds a
hyperplane is as shown in Figure 8.5, that separates the two classes such that it
maximizes the distance from the nearest data points (also known as the
“Support Vectors” and hence the name).
In Fig.5, there are two classes of points, green and blue. SVM finds � ��� �
such that the line (or more generally the hyperplane) separating the two classes
maximizes the distance between points nearest to the line.
Mathematically we are interested in finding the hyperplane,
�� − � = 0
which is mid-way between the hyperplanes containing the support vectors, i.e.,
�� − � = 1
�� − � =− 1
Such that the distance between them,
2
�=

is maximized.
The intuition behind using an SVM is that points that are closest to other group
are more important to consider while making a boundary than those which are
not. Further, the best boundary that separates the two groups is the one that
maximizes the distance between these nearby points. As can be seen intuitively
in Figure 8.6, clearly, the red line is the best boundary.
Figure 8.6: Possible Hyperplanes Separating the Two Classes

The real power of SVM is, however, unleashed when the “Kernel Trick” is
used. In this technique, the data is embedded into a higher dimensional space.
By such embedding, the data which is not linearly separable in the original
space becomes linearly separable in the higher dimensional space. The SVM
then goes on to find the optimal hyperplane separating the data.
Embedding the data to a higher dimensional space basically means we add
more dimensions to the data that are derived from the original data. For
example as shown in the Figure 8.7, we can see there is no hyperplane
separating the two variables. However, if we introduce a third variable
� = �2 + �2
we get an embedding in a 3-D space. When we plot �� , �� , �� , we can clearly
see that purple points are higher up compared to the red points; thus, a
horizontal plane can separate the two.

Figure 8.7: Effect of Embedding Data in a Higher Dimensional Space


(Source: Wikipedia)

Check Your Progress 3


In this section, you studied Machine Learning algorithm, now answer the
questions given in Check Your Progress-3.
Note: a) Write your answer in about 50 words
b) Check your answer with possible answers given at the end of the unit
(1) Briefly describe the iterative K-means clustering algorithm.
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(2) Describe different loss functions in linear and logistic regression.
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(3) Describe how embedding data in higher dimensional space is useful for
SVM?
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------

8.5 NEURAL NETWORKS AND DEEP LEARNING


As the size of data scales up, traditional machine learning algorithms stop
performing better. However, in this regime, deep learning models continue to
scale up in performance with the data, making them very relevant in the
current scenario when a humongous volume of data is generated daily (Think
Facebook, YouTube, Satellite Data). In this section, we will look at how neural
networks work. Neural Networks are inspired by the human brain, which has
thousands of neurons working together to give us consciousness. Before that,
we will look at the perceptron which is the building block of the neural
network. Finally, we will explore deep learning, including models that are
designed for image processing (Convolutional Neural Networks) and models
designed for language processing (Recurrent Neural Networks).
8.5.1 The Perceptron

Figure 8.7: Perceptron

A perceptron is a single layer neural network, or in other words, it is the


building block of neural networks. A perceptron takes an input vector,
multiplies it with a parameter vector and then applies an activation function to
get the result. Now that might be a lot to take in at one go, so here look at it
one by one:
Input Vector: It is the input to the perceptron. It may be from the training set
from which the perceptron has to learn or from a test set in which case a
perceptron has to predict. For example, in an image classification problem, the
image pixel values can be the input.
Parameter Vector: The parameter vector is the key to the perceptron’s
learning ability. The parameter vector is optimized (or learned) during the
training of the model.
Activation Function: Finally, an activation function is applied to map the
output to the desired range. For the example of image classification, we might
want to map the output to zero and one for negative and positive samples,
respectively. Some commonly used activation functions are: Sigmoid
Activation Function, ReLU (Rectified Linear Unit), Leaky ReLU, and tanh.
These will be discussed as and when they are used.
Simple implantation of perceptron model in python (Refer 8.7 for setting up
python):

Feel free to change the input data and weights to see how the output changes.
Exercise: Generate plot for ReLU activation function.
(Hint: ReLU function is defined as ��� 0, � )
Ans.

Figure 8.8: ReLU Activation Function Plot

To summarize mathematically:
������: � = �1 , �2 , ……, �� �
����������: � = �1 , �2 , ……�� �
� = �� �
�=� �
1 2
���� = �' − �
2
�' is the actual label, � is the predicted label, and � is the activation function.
Loss is generally taken to be mean square error or cross entropy loss.
8.5.2 Neural Network
One perceptron isn’t very powerful on its own. To build a more powerful
classifier, we combine a lot of perceptrons so that they can learn even more
complex decision boundaries; this is called a neural network. A simple google
search would return the following trademark image of a neural network is as
shown in Figure 8.9. One can now instantly recognize that it’s just a lot of
perceptrons; each having its own set of input vectors, parameters and
activation function. Output from one layer is passed on to the next as input.
This layered structure gives rise to some fairly obvious terminology:

Figure 8.9: Neural Network

Input Layer: The input layer is basically the input data itself, reshaped to a
suitable shape.
Hidden Layer: Hidden layer is where all of the computation and learning
occurs.
Output Layer: It finally converts the output of the hidden layer to the
desired output such as {0,1} in the case of binary classification.
Loss Function:
The goal of training the neural network is to minimize the loss function. For
binary classification problems we can use the same loss function as in the case
of logistic regression, i.e., binary cross entropy. Mean Squared Error is used
for regression problems.
Backpropagation
The neural network works by minimizing the loss function. To minimize the
loss function, the gradient descent algorithm is used. For that, there is a need to
find gradients with the help of the chain rule. Differentiating the equations
developed for perceptronusing the chain rule,
�� �� �� ��
=
�� �� �� ��
Notice that � is the loss, which is a function of �. In turn, � is a function of �,
which is a function of �. We can compute each derivate term separately.
��
= �' − �
��
��
= �' �
��
If � is ReLu
�� 1 if z > 0
=
�� 0 �� � < 0
And finally
��
=�
��
Combining,
��
= � − �' �' � �
��
Now in a neural network, there are several layers of perceptrons. For the
output layer, the above formula can be modified to,
��
= � − �' �' � �ℎ
��
Where �ℎ is the output of the hidden layer and the input of the output layer.
For any other layer, the gradient depends on the next outer layer. In other
words, gradient of a layer determines the gradient of the layer before it, hence
the name back propagation
�� ��ℎ
= � − �' �' � �
��ℎ ��ℎ
Where,
�ℎ = � ��ℎ,� �ℎ,�
The subscript j denotes each perceptron in the hidden layer.
Thus,
��
= �0 �' �ℎ �ℎ,�
��ℎ
�0 is the term that is carried over from the output layer.
Finally, all parameters are updated,
��
��,� = ��,� − α
���,�
α is called the learning rate, it determines by how much are the parameters
updated. A very small learning rate will make the learning slow, a very high
learning rate may cause the model to skip the minimum loss point and diverge.
8.5.3 Deep Learning
A Neural Network can have any number of neurons (aka perceptrons) in the
hidden layer, and the hidden layer itself can be multilayered. Neural Networks
that have a lot of hidden layers are termed deep neural networks. As the layers
on a neural network increase, its capacity to learn more complex models
increases. Thus, the more data you have, the deeper neural network you need.
However, as the depth of the network increases, the computational resources
required to train the model goes up considerably. Thus it is preferable to keep
the network deep enough so that it is capable of fitting the data, and at the
same time, avoid problems like overfitting and resource wastage. The two
most basic deep learning models are Convolutional Neural Network and
Recurrent Neural Networks used in image processing and Natural Language
Processing, respectively.
Convolutional Neural Networks
Convolutional neural networks (CNN) use “Convolutional layers” along with
Dense Layers (the normal neural network layers we studied in previous
section).
Convolutional Layers: Convolutional layers use filters to scan the image. A
filter is basically an NxN array where N is significantly less than the
dimension of the input image. The filter is placed at the start of the image, and
a dot product of the filter and overlapping image is calculated.
0.8 2.4 2.5 1 1 1
A = 2.4 0 4.0 ∗ 1 1 1 = (14.4)
1.1 2.4 0.8 1 1 1
This gives us the first output value. The filter is then shifted by a number of
“strides”

Figure 8.10: An Image and a Convolution Filter.

and the dot product is evaluated again. This gives us the output of the
convolutional layer. When multiple filters are used, we can stack the output
back to back, so if we have k kernels each giving an MxM output, we would
have a kxMxM output. Convolutional Layers help the model to learn the
correlation between neighbouring pixels and identify structures and patterns
such as eyes on a face and so on.
Another important type of layer used in CNNs is the Pooling Layers, which
come in two varieties. These are : MaxPooling Layer and AveragePooling
Layers
Pooling Layers: Pooling Layers work in a manner similar to the convolutional
layers. They scan the image using a small NxN filter but instead of taking the
dot product between the filter and image, they pick out the maximum value
from the overlapping image in case of max pooling or return the averagevalue
in case of average pooling.
Recurrent Neural Networks

Figure 8.11: RNN Model

Recurrent neural networks (RNN) are useful in cases where we have sequential
data, such as language models, music and so on. RNNs take a sequential input:
� = �0 , �1 , ……. . , �� . This could be a sentence, for example, “This is a book
about neural networks”. Now, of course, to use a model such as RNN there is a
need to encode this sentence into numeric values. So, each word in the
vocabulary is mapped to a number. For example,
�ℎ�� → �0
�� → �1
� → �2
���� → �3
����� → �4
������ ������� → �5

Now, the way RNN works is bypassing the value at each time-step �� to a
neural network, which returns two parameters ��, �� as shown in Figure 8.11.��
is the output for that time-step and �� is passed over to the next time-step. Then
the output �� is taken as the input along with ��+1 to give ��+1 and ��+1 . Thus,
an RNN utilizes the information of the previous time step to generate the
output for the current time-step, something that is desired when sequential
information is being processed.
Let’s looks at the mathematical description of RNN

Figure 8.12: Single RNN Unit


�� = g ��� xi + ��� ��−1 + b
Or, compactly written as
�� = � �� ��−1 , �� + �
where,
�� = ��� , ���
And,
�� = ℎ ���� + �
ℎ and � are the activation function which is usually chosen as ���ℎ or ReLu
for �.
RNNs are not able to learn long distance connections in the data, for example,
“Sachin Tendulkar is one of the greatest cricket players, he has over 14,000
runs.” In the sentence we expect the RNN to be able to figure out that he refers
to Sachin Tendulkar however in practice RNNs do poorly when the gap
between two time-steps increases and are unable to make the connection.
Check Your Progress 4
In this section, you studied neural networks and deep learning, now answer the
questions given in Check Your Progress-4.
Note: a) Write your answer in about 50 words
b) Check your answer with possible answers given at the end of the unit
(1) Define input layer, hidden layer and output layer for neural network.
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(2) What advantages deep learning has over other algorithms such as Support
Vector Machine?
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(3) What is the major problem associated with simple RNNs?
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------

8.6 MATHEMATICS FOR MACHINE LEARNING


The basic level of knowledge about linear algebra and vector calculus is
needed to understand machine learning algorithms where input data is always
given as a numerical vector.
8.6.1 Introduction to Linear Algebra
Vectors in physics are quantities that have a magnitude and a direction.
However, if you look closely, you’ll realize they are just an ordered set of three
numbers � = � � � . The magnitude of the vector is given by � =
�2 + �2 + �2 . The magnitude of the vector is also known as the “norm of
the vector” (more precisely the L2 norm, as will see). Another operation that is
defined between two vectors is the dot product which is defined as�. � =

�1 �2 + �1 �2 + �1 �2 . Let’s represent the vector by a column matrix � = � .

Matrix
A matrix is a rectangular array of numbers. An � � � matrix has m rows and n
columns. For example:
1 2 5
3 7 9
�=
5 9 3
1 4 8
is a 3x4 matrix.
A matrix can be used to write and solve a system of linear equations in a
compact form. Consider the following system of equations
2�1 + 3�2 + 4�3 = 5
7�1 + 1�2 + 2�3 = 5
12�1 + 7�2 + 9�3 = 4
It can be written compactly as
2 3 4 �1 5
7 1 2 �2 = 5
12 7 9 �3 4
Thus, any general linear system of equations can be written as �� = �, where
� is the matrix of coefficients, � is the unknown, and � is the coefficient of the
constant term.
An Identity matrix akin to the number 1 in the real number system is the
multiplicative identity for matrices.
1 0 0
�= 0 1 0
0 0 1
A matrix B is the inverse of a matrix � if their product yields the identity
matrix.
�� = �,
� = �−1
Transpose of a matrix
The transpose of a matrix is obtained by exchanging the rows and columns of a
matrix.
2 1 5
7 3 9
A=
9 5 3
4 1 8
1 3 5 1

� = 2 7 9 4
5 9 3 8
Note that because of the rules of matrix multiplication, inverse exists only for
square matrices, i.e., matrices which have an equal number of rows and
columns.
Let’s get back to vectors. We mentioned that we can represent vectors as

column matrices. The multiplication of a vector � by a 3x3 matrix M would

'

give another vector �' . This matrix would transform any given vector into
�'
some other vector and hence is also known as a “linear transformation”. For
each matrix M in general, there are some vectors which are transformed to the
same vector scaled by a constant mathematically �� = �� these vectors are
called the eigenvectors of the Matrix and the scaling constant � is called the
eigenvalue.
Following is an example of Eigenvectors and Eigenvalues of a matrix:
Consider the following 2x2 matrix
2 1
1 2
Notice what happens when the matrix is multiplied to the following vectors,
2 1 1 1
=3
1 2 1 1
2 1 1 1
=
1 2 −1 −1
The same vector is returned but with a constant factor. These special vectors
are the eigenvectors, and the scaling constants are the eigenvalues.
Verify using examples that no other vector obeys this property for the given
matrix. Also, note that eigenvalues have to be non-zero. The same applies for
eigenvectors.
Now there is nothing special about 3x3 matrices or vectors having just 3
dimensions. In general, we can define a vector to be an ordered set of N real
numbers (or even complex numbers if you wish) and represent it by a column
matrix of N entries.
�1
�2
�= �3
..
��
Let’s look at some of the common definitions in linear algebra that we will be
encountering throughout machine learning.
Vector Norm: A vector norm is a function that takes a vector and returns a
non-negative real number as output. A vector norm should fulfil the following
conditions.
� > 0 �f� ≠ 0 ��� � = 0 �ff � = 0
�� = � �
�+� < � + �
The most commonly used norm is the Euclidean norm, also known as the L2
norm. It is defined as:

�� 2

As an exercise, verify that the Euclidean norm satisfies the above-mentioned


properties.
Inner Product
Inner Product of two vectors � and � is given by �� � , where �� is the
transpose of the matrix �.

�� ��

The inner product gives us the sense of how correlated two vectors are. The
inner product is maximum when two vectors are in the same direction. When
the inner product between two vectors is zero, they are termed Orthogonal
Vectors.
8.6.2 Multivariable Calculus
Single variable calculus deals with functions of one variable � � . However, in
machine learning, we are often interested in functions that depend on several
variables. For example, to predict if it will rain today, we might want to know
the Temperature, Pressure, Wind speed, Humidity and so on. Thus the
probability of it raining today is a function of all of these variables
� �, �, �, �
Naturally, we want to find analogues of derivatives and integrals for such
functions. Also, we would like to see how to find minima and maxima for
these functions. Multivariable Calculus deals with these questions. We will
consider the functions of two variables x and y, as these are easy to understand
visually.
Concept of Gradient
Partial Derivative: Consider a function � �, � , the partial derivative of �
with respect to �, at a given point �0 , �0 is given by
�� � � + ℎ, � − � �, �
= lim ​
�� ℎ→0 ℎ �0 ,�0

The partial derivative measures the change in the function with respect to one
variable while keeping the other variables fixed. Let’s look at an example.
Example: Find the partial derivatives of the function � �, � = �2 + �2 at
� = 2, � = 3 .
Solution:
��
= 2�
��
��
�� � = 2; =4
��
��
= 2�
��
��
�� � = 3; =6
��
Gradient
Gradient gives the direction (in the domain of the function) in which the
function is the “steepest,” i.e., the direction of maximum change. The
magnitude of the gradient quantifies the steepness.
��
��
∇� �, � = ��
��
Note that Gradient of a function is a vector quantity.
If the gradient of a function is zero at any point, it may point to one of the three
cases:
 Local Maxima: A point where the function attains a maximum value
locally, that is, the value at the point is greater than all neighbouring
points.

 Local Minima: A point where the function attains a minimum value


locally, that is the value at the point is less than all neighboring points

 Saddle Point: Saddle points occur when the gradient is zero, but the
point is neither a maxima or a minima. The figure shows how the
function appears to have a minima when viewed from the front and a
maxima when viewed from the side.
Check Your Progress 5
In this section, you studied Mathematics for Machine Learning, now answer
the questions given in Check Your Progress-5.
Note: a) Write your answer in about 50 words
b) Check your answer with possible answers given at the end of the unit
(1) Find the gradients of the following function:
�2+�2
� �, � = �
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(2) Find the Eigenvalue and Eigenvector for the following matrices.
3 0
1 1
(Hint:)
�� = ��
� − �� � = 0
� − �� = 0

-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(3) Find the Dot Product of the following pair of vectors.

� = 2,4,6,1,8

� = 1,0,3,0,1
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
8.7 SOFTWARE FOR MACHINE LEARNING
Python is the most commonly used Programming Language for designing
machine learning algorithms. The easiest way of using Python is to use Google
Colab Notebooks. Colab Notebooks can run python command interactively.
Google also gives access to high end GPUs for free, which greatly reduces the
time taken to train models, especially deep-learning models.
8.7.1 Setting Up the Environment
Google Colab
To use google Colab you first must have a google account. If you don’t already
have an account, you can create a google account and set up an email address
and a password.
Go to https://colab.research.google.com/ and sign in with your google account.
Select the New Notebook option. This will open a blank notebook.

Figure 8.13: Colab Startup Screen

The default notebook contains a code cell. You can run blocks of python code
in these cells. The other type of cell is the Text Cell, where you can enter a
description of your code or any other information that you want to.
Just to get started, Make a text box and type “My First Colab Notebook”, now
create a Code Cell and type print(‘hello’).
It should look something like this,

Figure8.14 Colab Interface

Now run each cell using Ctrl+Enter. Alternatively, use Runtime->Run All. The
text will get formatted according to Markdown Syntax. All code output will be
displayed below the cell.
You can add multiple code cells to implement different parts of a code. All
variables, functions and classes defined in one cell are accessible to other cells.
The getting started guide gives a more detailed and comprehensive
introduction.
Setting Up Locally
Easiest way to set up an environment on your system is by using Anaconda.
Anaconda is a package manager, an environment manager and a Python/R data
science distribution. This basically means that all the machine learning
packages such as numpy, Tensorflow etc. Following is a quick guide on how
to install Anaconda
i. Download the Anaconda installer.
ii. Double click the installer to launch.
iii. Complete the set up by agreeing to the Licencing terms and selecting
the location for installation.
iv. Select the following option in advanced menu(Recommended by
Anaconda).

v. Click the Install button. If you want to watch the packages Anaconda is
installing, click Show Details.

vi. After a successful installation you will see the “Thanks for installing
Anaconda” dialog box:
Open Anaconda Navigator from the start Menu. You can install different tools
that are used in Data Science and Machine learning from here. It is
recommended that you Install Jupyter-Notebook for this course. A Jupyter-
Notebook works almost like a Colab Notebook, just with a minor difference.

4Anaconda Navigator
8.7.1 Numpy
Although numpy is not a machine learning library, it is extensively used to
handle the data and pass inputs to other libraries/frameworks. Numpy helps to
handle data that is in the form of an array. For example, an image is an array of
pixels, so you can represent an image using numpy. Operations in numpy are
highly optimized and work way faster than other alternatives such as iterating
over loops. To illustrate this, let’s sum an array with 10^7 elements
It takes the loop about 2 seconds to evaluate the sum, whereas numpy takes
only 0.02 seconds for the same task, making numpy 100x faster than the
alternative.
Numpy arrays are often called “ndarrays” which is a short form for n-
dimensional arrays. Numpy has inbuilt methods that can be utilized to perform
almost all linear algebra tasks. Simplest way to create a numpy array is by
passing a list to the np.array() method. It returns an nd array having the same
values as the list, but a fixed data type.

Output: array([1, 2, 3, 4, 5, 6, 7, 8])


Concatenate function:It concatenates two arrays, that is, simply joins them
back-to-back.
We have already seen in the how np.sum() can be used, similarly np.average(),
np.std() etc can be used to find average, standard deviation etc.
We can use the reshape method to reshape the array into a different shape. For
example, in the following code arange method creates an array of shape (6,),
reshaping it to (2,3) changes the shape of the array.

Output: array([[0, 1, 2],


[3, 4, 5]])
Output: array([[0, 3],
[1, 4],
[2, 5]])
We can also do the reverse operation. Given an ndarray of shape (3,3) we can
flatten it out by using the flatten method.
Output: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
8.7.2 Scikit-Learn
Scikit-Learn is a complete machine learning package that has most of the
machine learning techniques implemented in a highly optimized fashion. Let’s
look at a few examples from the Scikit-Learn documentation to understand it
better.
First, we need to install Scikit-Learn, which can be done easily using pip

Let’s look at some of the features of scikit-learn.

Clearly, scikit-learn offers a wide range of features, however, most of these


have similar designs and are well documented.
8.7.3 Tensorflow and Keras
Tensorflow is a deep learning library developed by Google. Keras is built on
top of the Tensorflow backend and is a fast way of creating deep learning
models. Let’s build a small neural network to classify handwritten digits. For
this classification, we will use the MNIST dataset, which has 60,000 training
images and additional 10,000 images for testing.
Figure 8.15: Samples from MNIST Dataset

Fortunately, MNIST data set is already available with Keras so no additional


downloads are required. The following chart shows the workflow for any
neural network in Keras.
Note that we changed the labels to categorical labels. What this means is that
each number from 0 to 9 is represented by a vector with 1 at the corresponding
position.
1 0 0
0 1 0
0 0 0
0 0 0
0 0 0
0→ ,1 → , …………. , 9→
0 0 0
0 0 0
0 0 0
0 0 0
0 0 1
This is done because we use a SoftMax activation function with the last layer,
which returns output in this format. This is known as one-hot encoding.
The DropOut layer is a special layer. It turns off some neurons during the
training, i.e. model behaves as if they were not there for that epoch.
When the model completes training over the entire dataset once, we say the
model has trained for one epoch. The model is trained several times on the
same data, i.e. through several epochs to improve the quality.
Now, we will train the model and make some predictions and measure the
accuracy of the model.

Output: dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

Figure 8.16 Model Accuracy at Each Epoch

The fit() method returns a history object which stores the values of different
parameters such as accuracy, validation accuracy, loss for each epoch. The
history member inside the history object is a dictionary that stores values for
different parameters for each epoch.
The compile method configures the model. It takes as input the optimizer that
we want to use to minimize the loss function, the metric for validation.
Finally, we test the model for our test data and see how it performs.

The evaluate() method returns the value of the loss and the metric chosen (In
our case accuracy). The evaluate function gives the results for the entire data
set. If we wish to make an individual prediction on a test data, we can use the
predict() method.
Check Your Progress 6
In this section, you studied Software for Machine Language, now answer the
questions given in Check Your Progress-6.
Note: a) Write your answer in about 50 words
b) Check your answer with possible answers given at the end of the unit
(1) What will be the output for the following code?
print(np.sum([1,2,3,4]))
print(np.concatenate([[1,2,3,4],[4,3,2,1]]))
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(2) Identify and describe the layers used in the following Keras model.
model = keras.Sequential(
[
keras.Input(shape=input_shape),
layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Flatten(),
layers.Dropout(0.5),
layers.Dense(num_classes, activation="softmax"),
]
)
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
(3) What is one-hot encoding?
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------

8.8 SUMMARY
This unit dealt with the fundamentals of Machine Learning. This unit discussed
how a machine learns. Machine learning algorithms were explained and linked
to various machine learning categories. Neural Networks and Deep learning
were described to appreciate the power of machine learning. Mathematics and
Machine Learning Software are crucial in appreciating machine learning
algorithms. These have been ingrained at appropriate places in this unit and
can be used easily by learners either using Google Colab (online) or python
package Anaconda (by installing on your computer/laptop).

8.9 KEYWORDS
Activation Function: The activation function is a mathematical function that
lets transform the outputs to a desired non-linear format before it is sent to the
next layer. It maps the summation result to a desired range.
Agent: An agent is the entity which performs certain actions by interacting
with the environment and has a goal of maximizing its rewards.
Algorithm: An algorithm provides fixed computational rules.
Conventional algorithm: Algorithms based on explicit instructions of how to
perform a task
Machine Learning algorithm: Algorithms that are based on an implicit set of
rules (based on data, not on explicit instructions of how to perform a task)
Anaconda: The most popular data science platform for Python
Backpropagation: Backpropagation is the method for performing gradient
descent in artificial neural networks. It allows us to compute the derivative of
a loss function with respect to every free parameter (i.e. weight and bias) in the
network. It does so layer by layer.
Classifier: A classifier is a type of machine learning algorithm used to assign a
class label to a data input.
Convolutional Neural Networks (CNN): CNN is one of the most popular
deep neural network algorithms. It is mostly used in visual recognition task. It
takes image as an input and learns the features from the different parts of
image.
Environment: Think of the environment as a board game in which landing on
a particular square gives a reward or penalty. Thus the environment is a
collection of entities with which an agent can interact
Gradient descent: Gradient descent is an optimization algorithm which is
commonly used to train machine learning models and neural networks.
K-means clustering: K-means clustering is an unsupervised learning
algorithm, which attempts to partition the set of given input data into k
partitions by minimizing the variance for each partition.
Keras: Keras is a high level open source neural network library written in
python.
Kernel: In machine learning, a kernel is the measure of resemblance where a
kernel function defines the distribution of similarity of points around a given
point.
Kernel-trick: The kernel-trick is a method that allows one to use a linear
classifier to solve a non-linear problem.
Labelled Database: Labelled data is data that comes with a name or type. In
other words, a dataset that contains the data as well as its labels (name/type)
MNIST dataset: The MNIST (Modified National Institute of Standards and
Technology) dataset is a large collection of handwritten digits.
Reward: Reward is a numerical incentive given to the agent for performing a
certain task.
Recurrent Neural Networks (RNN): A recurrent neural network (RNN) is a
type of artificial neural network which uses sequential data or time series data.
These deep learning algorithms are commonly used in language translation,
natural language processing (nlp), speech recognition, and image captioning;
they are incorporated into popular applications such as Siri, voice search, and
Google Translate.
ReLU (Rectified Linear Unit): ReLUis anactivation function in deep neural
network. ReLU function is defined as ��� 0, �
Scikit-Learn: Scikit-Learn is a complete machine learning package that has
most of the machine learning techniques implemented in a highly optimized
fashion.
Support Vector Machine (SVM): SVM is a machine learning algorithm
popular for regression and classification.
Tanh: Tanh is an activation function in a neural network.

8.10 CHECK YOUR PROGRESS 1 – POSSIBLE ANSWERS


1) Briefly explain the idea of Machine Learning.
The concept of machine learning is that machines (computers) can learn
without being explicitly programmed. Without explicitly programmed means
without the use of direct programming commands. Here machine learning
refers to self-learning by machine (computer).
2) Mention currently used applications based on Machine Learning.
Machine Learning is an emerging field of computer science having wide
applications in Search engines, Recommendation systems, Spam filters etc.
Check Your Progress 2 – Possible Answers
1) Discuss different types of Machine Learning.

There are three categories of Machine Learning. These are: Supervised


Learning, Unsupervised Learning and Reinforcement Learning. Supervised
learning uses labelled datasets. Labelled data is data that comes with a name or
type. Unsupervised learning involves finding a pattern in data. Thus
unsupervised learning segregates data in clusters or groups. These clusters or
groups are unlabeled. Reinforcement learning works on the principle of reward
and punishment.

2) How is Reinforcement Learning different from Supervised Learning?


Reinforcement Learning models differ from Supervised and Unsupervised
Learning in the sense that there is no data from which patterns have to be
recognized or values to be predicted. RL models are very useful in industrial
applications where the model can learn to perform complicated tasks and
optimize the process. Other applications include self-driving cars and so on.
3) What is a Labelled Dataset?
A dataset that has data as well as label is known as labelled data. In other
words, Labelled data is data that comes with a name or type. Some of the
popular labelled datasets are: MNIST dataset.
Check Your Progress 3 – Possible Answers
1) Briefly describe the iterative K-means clustering algorithm.

K-means clustering is an unsupervised learning algorithm, which attempts


to partition the set of given input data into k partitions by minimizing the
variance for each partition.

2) Describe different loss functions in linear and logistic regression.


Binary cross-entropy loss:

L= − �� ��� �'� − 1 − �� ��� 1 − �'�


�=1
Mean Squared Error:

2
�= �' − �
�=1
3) Describe how embedding data in higher dimensional space is useful for
SVM?
Embedding the data into a higher dimensional space makes it linearly
separable. i.e. an SVM can find an optimal hyperplane.
Check Your Progress 4 – Possible Answers
1) Define input layer, hidden layer and output layer for a neural network.

1. Input Layer: The input layer is basically the input data itself, reshaped
to a suitable shape.
2. Hidden Layer: Hidden layer is where all of the computation and
learning occurs.
3. Output Layer: It finally converts the output of the hidden layer to the
desired output such as {0,1} in the case of binary classification.

2) What advantages deep learning has over other algorithms such as Support
Vector Machine?

As the size of data scales up, traditional machine learning algorithms (like
Support Vector Machine) stop performing better. However, in this regime,
deep learning models continue to scale up in performance with the data making
them very relevant in the current scenario when a humongous volume of data
is generated on a daily basis (Think Facebook, YouTube, Satellite Data).
3) What is the major problem associated with simple RNNs?

It turns out very simple RNNs are not able to learn long distance connections
in the data, for example, “Sachin Tendulkar is one of the greatest cricket
players, he has over 14,000 runs.” In the sentence, we expect the RNN to be
able to figure out that he refers to Sachin Tendulkar however, in practice,
RNNs do poorly when the gap between two time-steps increases and are
unable to make the connection.
Check Your Progress 5 – Possible Answers
1) Find the gradients of the following functions.
2 2
2�� � +�
2 2
2�� � +�
2) Find the Eigenvalue and Eigenvector for the following matrices.
3−� 0
��� =0
1 1−�
3−� 1−� =0
����� ������: � = 1, � = 3
��� � = 3, �����
3−3 0 �
=0
1 1−3 �
� = 2, � = 1
Similarly, solve for k=1.

(3) Find the Dot Product of the following pair of vectors.



� = 2,4,6,1,8

� = 1,0,3,0,1
Ans. 28
Check Your Progress 6 – Possible Answers
1) What will be the output for the following code?
10
[1 2 3 4 4 3 2 1]

2) Identify and describe the layers used in the following Keras model.
Convolution Layer: Uses a filter to generate an output corresponding to the dot
product of filter and overlapping image.
MaxPool Layer: Uses a filter to select maximum values from the image for
each filter window
Flatten Layer: Reshapes the data for compatibility
Droput Layer: Turns off some neurons for the epoch
Dense Layer: Is the simple neural network layer
3) What is one-hot encoding??
Note that we changed the labels to categorical labels. What this means is
that each number from 0 to 9 is represented by a vector with 1 at the
corresponding position.
1 0 0
0 1 0
0 0 0
0 0 0
0 0 0
0→ ,1 → , …………. , 9→
0 0 0
0 0 0
0 0 0
0 0 0
0 0 1
This is done because we use a SoftMax activation function with the last
layer, which returns output in this format. This is known as one-hot
encoding.

8.11 REFERENCES AND SELECTED READINGS


1. Oliver Theobald, “Machine Learning for Absolute Beginners: A Plain
English Introduction”, Independently published, 2021
2. Saiket Dutt, Subramanian Chandramouli and Amit Kumar Das,
“Machine Learning”, Pearson, 2018
3. Sridhar, M. Vijaylakshmi, “Machine Learning”, Oxford University
Press, 2021

4. https://web.mit.edu/6.034/wwwbob/svm.pdf
5. https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html
6. https://users.math.msu.edu/users/gnagy/teaching/11-fall/mth234/L19-
234-th.pdf
7. https://colah.github.io/posts/2015-08-Understanding-LSTMs/
8. https://cs231n.github.io/convolutional-networks/
9. https://keras.io/examples/vision/mnist_convnet/

You might also like