You are on page 1of 8

What is Machine Learning?

There are some tasks for which you do not have an algorithm to solve it by conventional
programming, but we have lots of data from which useful information can be learned (can you
think of suitable examples??). With digital devices and powerful memory and computation
capabilities, stored data can be analyzed and turned into information that we can make use of, to
make predictions. There are always certain patterns or regularities in the data from which we
can identify the underlying process that generates the data and use it for prediction. We can
never get this process completely; can construct a good and useful approximation. That
approximation may not explain everything, but may still be able to account for some part of the
data. We believe that though identifying the complete process may not be possible, we can still
detect certain patterns or regularities. This is machine learning. ( See the more formal definition
of ML in Mitchell book and introduction lecture slides) . Pattern recognition is a very typical
application area of ML. (Can you think of some of them??). Application of machine learning
methods to large databases is called data mining. The analogy is that a large volume of earth and
raw material is extracted from a mine, which when processed leads to a small amount of very
precious material.
Task -1 Do some general self-study for self-knowledge and read about various applications
of ML from internet under broad categories: for example in e-commerce, finance,
computer vision, manufacturing, robotics medical diagnosis, telecommunications, science,
speech recognition, bioinformatics etc

It must be clear that ML is NOT a database problem, but a part of AI: They are intelligent
systems made possible by LEARNING in a changing environment. Machine learning is
programming computers to optimize a performance criterion using example data or past
experience. We have a model defined up to some parameters, and learning is the execution of a
computer program to optimize the parameters of the model using the training data or past
experience. The model may be predictive to make predictions in the future, or descriptive to gain
knowledge from data, or both

Learning paradigms

1. Supervised
2. Unsupervised
3. Reinforcement Learning
In supervised learning there is the data pair an input, X, an output, Y, (often called label of
training data) and the task is to learn the mapping from the input to the output, whose correct
values are provided by a supervisor. The approach is that we assume a model defined with
respect to a set of parameters to approximate this mapping. The machine learning program
optimizes the parameters, such that the approximation error is minimized, that is, our estimates
are as close as possible to the correct values given in the training set. The two types of
supervised learning:
1) Classification where the labels are of discrete variables (K categories in general for a multi-
classification problem (e.g. K=2, it’s a binary classification, Y could be +1/-1or 0/1 )
(2) Regression, where the labels to be predicted are continuous.
In Unsupervised learning, (Also called Density Estimation in statistics), there is no such
supervisor and we only have input data, no labels are provided. Density estimation techniques
learn the probability distribution according to which the data has been sampled. There are other
approaches of unsupervised learning such as Clustering in which similar data points are grouped
and partitioned to homogenous sets. [Note the differences between Clustering and Classification]
The overall aim in unsupervised learning is to find the regularities or associations in the input
data. There is a structure to the input space such that certain patterns occur more often than
others, and we want to see what generally happens and what does not. Feature selection and
Dimensionality reduction ( We will study the PCA technique later in our course ) are some other
commonly used unsupervised learning schemes.

In Reinforcement learning, an agent is put in an environment where he learns to behave in it by


performing certain actions and observing the rewards which it gets from those actions. It is
relevant for those applications, where the output of the system is a sequence of actions. In such a
case, a single action is not important; In this approach, the learning process finds a sequence of
actions to reach a goal. The algorithm gradually evolves: note there is no supervised output, but
only delayed rewards. What is important is the policy that is the sequence of correct actions to
reach the goal. There is no such thing as the best action in any intermediate state; an action is
good if it is part of a good policy. In such a case, the machine learning program should be able to
assess the goodness of policies and learn from past good action sequences to be able to generate
a policy. This is an advanced ML area used in applications such as self-driving cars, game
playing (like chess etc ) …..Can you search some others??

Task 2: You should be able to classify a given problem into the correct paradigm, as a self-study,
go through all the typical application examples of each category for strengthening your concepts

General Design of a Supervised learning problem is shown below ( details we have discussed in
class)
We will be very clear about the following

What do you mean by features and dimensionality of your data and feature space?

What do you mean by training examples?

How many features? How many training examples do we have in the above input space of the
training data?

The diagram below summarizes very well the role of features in the evolving of Machine
Learning from earlier AI expert systems that were mostly rule based to what we have today
popularly called as Deep Learning. We see that it is the Feature Discovery process that is getting
more and more automated.
Most of the real-world problems belong to the Classification category and we will start our
lectures therefore first with Classifiers ( We will study in our course Naïve Bayes Classifier,
Logistic Regression, SVM ) and then we move onto Regression problems most popularly
handled by ANN. (Please note that Logistic Regression is actually a Classification problem,
there is a confusion in using the name regression, it is a misnomer) The next important thing that
we must learn is the distinction between Linear and Non-linear classifiers and we should have
the concept of Linearly and Non-Linearly separable Datasets. A data set is separable by a
learner if there is some instance of that learner that correctly predicts all the data points. The data
points can be separated into two classes using a hyperplane in feature space. In 2-dimensions, the
decision boundary is a straight line as shown in the examples below:
Please see an important observation that often a suitable feature mapping can change NLS data
to LS one usually by transforming to a higher dimensional space. See the 1D example below
how a feature mapping from 1D to 2D makes the data set linearly separable. This is the basis of
Kernel Mapping we will study in brief in later lectures when we discuss SVM.

Linear and Non-Linear Classifiers


Linear classifiers are those where we decide on the basis of a linear function of the input vectors
(NB, Logistic regression, SVM, perceptron are some of the linear classifiers we will study in our
course ). Some examples of Non-linear classifiers are ANN, K-Nearest neighbor , Decision
Trees, Kernels .
In the linear models that we use in learning, Y is a linear function of X : They are of the form
, where Y is the PREDICTION,( this could be a score or probability ) W is the
weight vector and b is known as the bias W and b are together known as the model parameters.
Note in the figure below the weight and data space for a 2D dataset. X is the feature vector. If we
use some feature mapping , then the classifier model would be The score or
probability of a particular classification would be a linear combination of their features and their
weights . ( can you tell how many weights would be there ??). The bias term b is often folded
into the W vector as W0 with the corresponding input X0 =1 ) THIS IS IMPORTANT . We will
be using this in many of the mathematical derivations

Non-Linear Classifier

Binary Linear Classifier

What is this??

You might also like