You are on page 1of 21

UNIT-1

INTRODUCTION TO MACHINE LEARNING

INTRODUCTION

Machine learning is a growing technology which enables computers to learn automatically from
past data. Machine learning uses various algorithms for building mathematical models and
making predictions using historical data or information. Currently, it is being used for
various tasks such as image recognition, speech recognition, email filtering, Facebook
auto-tagging, recommender system, and many more.he term machine learning was first
introduced by Arthur Samuel in 1959.

DEFINITION OF MACHINE LEARNING

In the real world, we are surrounded by humans who can learn everything from their experiences
with their learning capability, and we have computers or machines which work on our
instructions. But can a machine also learn from experiences or past data like a human does? So
here comes the role of Machine Learning.

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences
on their own.

HOW DOES MACHINE LEARNING WORKS ?

A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model which
predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead of
writing a code for it, we just need to feed the data to generic algorithms, and with the help of
these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram explains
the working of Machine Learning algorithm:
FEATURES OF MACHINE LEARNING

● Machine learning uses data to detect various patterns in a given dataset.

● It can learn from past data and improve automatically.

● It is a data-driven technology.

● Machine learning is much similar to data mining as it also deals with the huge amount of
the data.

WHY MACHINE LEARNING IS IMPORTANT ?

The need for machine learning is increasing day by day. The reason behind the need for machine
learning is that it is capable of doing tasks that are too complex for a person to implement
directly. As a human, we have some limitations as we cannot access the huge amount of data
manually, so for this, we need some computer systems and here comes the machine learning to
make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data and let
them explore the data, construct the models, and predict the required output automatically. The
performance of the machine learning algorithm depends on the amount of data, and it can be
determined by the cost function. With the help of machine learning, we can save both time and
money.

The importance of machine learning can be easily understood by its uses cases, Currently,
machine learning is used in self-driving cars, cyber fraud detection, face recognition, and
friend suggestion by Facebook, etc. Various top companies such as Netflix and Amazon have
build machine learning models that are using a vast amount of data to analyze the user interest
and recommend product accordingly.

CLASSIFICATIONS OF MACHINE LEARNING

● Supervised Machine Learning


● Unsupervised machine learning
● Reinforcement learning

Supervised Machine Learning

Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.

Working of Supervised machine learning

In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.

● If the given shape has four sides, and all the sides are equal, then it will be labelled as a
Square.

● If the given shape has three sides, then it will be labelled as a triangle.

● If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.

Unsupervised Machine Learning

As the name suggests, unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the hidden patterns and
insights from the given data. It can be compared to learning which takes place in the human brain
while learning new things.

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output data. The
goal of unsupervised learning is to find the underlying structure of dataset, group that data
according to similarities, and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given dataset,
which means it does not have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the image dataset into the groups
according to similarities between images.
WORKING OF UNSUPERVISED MACHINE LEARNING

Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns
from the data and then will apply suitable algorithms such as k-means clustering, Decision tree,
etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.

REINFORCEMENT LEARNING

● Reinforcement Learning is a feedback-based Machine learning technique in which an


agent learns to behave in an environment by performing the actions and seeing the results
of actions. For each good action, the agent gets positive feedback, and for each bad
action, the agent gets negative feedback or penalty.

● In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
● Since there is no labeled data, so the agent is bound to learn by its experience only.

● The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.

● The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that "Reinforcement learning is a
type of machine learning method where an intelligent agent (computer program)
interacts with the environment and learns to act within that." How a Robotic dog learns
the movement of his arms is an example of Reinforcement learning.

● Example: Suppose there is an AI agent present within a maze environment, and his goal
is to find the diamond. The agent interacts with the environment by performing some
actions, and based on those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback.

● The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.

● The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.


PERSPECTIVES AND ISSUES

Although machine learning is being used in every industry and helps organizations make more
informed and data-driven choices that are more effective than classical methodologies, it still has
so many problems that cannot be ignored. Here are some common issues in Machine Learning
that professionals face to inculcate ML skills and create an application from scratch.

1. Inadequate Training Data

The major issue that comes while using machine learning algorithms is the lack of quality as well
as quantity of data. Although data plays a vital role in the processing of machine learning
algorithms, many data scientists claim that inadequate data, noisy data, and unclean data are
extremely exhausting the machine learning algorithms. For example, a simple task requires
thousands of sample data, and an advanced task such as speech or image recognition needs
millions of sample data examples. Further, data quality is also important for the algorithms to
work ideally, but the absence of data quality is also found in Machine Learning applications.
Data quality can be affected by some factors as follows:

● Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well
as accuracy in classification tasks.

● Incorrect data- It is also responsible for faulty programming and results obtained in
machine learning models. Hence, incorrect data may affect the accuracy of the results
also.

● Generalizing of output data- Sometimes, it is also found that generalizing output data
becomes complex, which results in comparatively poor future actions.

2. Poor quality of data

As we have discussed above, data plays a significant role in machine learning, and it must be of
good quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to less
accuracy in classification and low-quality results. Hence, data quality can also be considered as a
major common problem while processing machine learning algorithms.
3. Non-representative training data

To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training data
must cover all cases that are already occurred as well as occurring.

Further, if we are using non-representative training data in the model, it results in less accurate
predictions. A machine learning model is said to be ideal if it predicts well for generalized cases
and provides accurate decisions. If there is less training data, then there will be a sampling noise
in the model, called the non-representative training set. It won't be accurate in predictions. To
overcome this, it will be biased against one class or a group.

Hence, we should use representative data in training to protect against being biased and make
accurate predictions without any drift.

4. Overfitting and Underfitting

Overfitting:

Overfitting is one of the most common issues faced by Machine Learning engineers and data
scientists. Whenever a machine learning model is trained with a huge amount of data, it starts
capturing noise and inaccurate data into the training data set. It negatively affects the
performance of the model. Let's understand with a simple example where we have a few training
data sets such as 1000 mangoes, 1000 apples, 1000 bananas, and 5000 papayas. Then there is a
considerable probability of identification of an apple as papaya because we have a massive
amount of biased data in the training data set; hence prediction got negatively affected. The main
reason behind overfitting is using non-linear methods used in machine learning algorithms as
they build non-realistic data models. We can overcome overfitting by using linear and parametric
algorithms in the machine learning models.

Methods to reduce overfitting:

● Increase training data in a dataset.

● Reduce model complexity by simplifying the model by selecting one with fewer
parameters

● Ridge Regularization and Lasso Regularization

● Early stopping during the training phase

● Reduce the noise


● Reduce the number of attributes in training data.

● Constraining the model.

Underfitting:

Underfitting is just the opposite of overfitting. Whenever a machine learning model is trained
with fewer amounts of data, and as a result, it provides incomplete and inaccurate data and
destroys the accuracy of the machine learning model.

Underfitting occurs when our model is too simple to understand the base structure of the data,
just like an undersized pant. This generally happens when we have limited data into the data set,
and we try to build a linear model with non-linear data. In such scenarios, the complexity of the
model destroys, and rules of the machine learning model become too easy to be applied on this
data set, and the model starts doing wrong predictions as well.

Methods to reduce Underfitting:

● Increase model complexity

● Remove noise from the data

● Trained on increased and better features

● Reduce the constraints

● Increase the number of epochs to get better results.

5. Monitoring and maintenance

As we know that generalized output data is mandatory for any machine learning model; hence,
regular monitoring and maintenance become compulsory for the same. Different results for
different actions require data change; hence editing of codes as well as resources for monitoring
them also become necessary.

6. Getting bad recommendations

A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example where at a
specific time customer is looking for some gadgets, but now customer requirement changed over
time but still machine learning model showing same recommendations to the customer while
customer expectation has been changed. This incident is called a Data Drift. It generally occurs
when new data is introduced or interpretation of data changes. However, we can overcome this
by regularly updating and monitoring data according to the expectations.

7. Lack of skilled resources

Although Machine Learning and Artificial Intelligence are continuously growing in the market,
still these industries are fresher in comparison to others. The absence of skilled resources in the
form of manpower is also an issue. Hence, we need manpower having in-depth knowledge of
mathematics, science, and technologies for developing and managing scientific substances for
machine learning.

8. Customer Segmentation

Customer segmentation is also an important issue while developing a machine learning


algorithm. To identify the customers who paid for the recommendations shown by the model and
who don't even check them. Hence, an algorithm is necessary to recognize the customer behavior
and trigger a relevant recommendation for the user based on past experience.

DESIGNING LEARNING SYSTEMS

According to Arthur Samuel “Machine Learning enables a Machine to Automatically learn from
Data, Improve performance from an Experience and predict things without explicitly
programmed.”
In Simple Words, When we fed the Training Data to Machine Learning Algorithm, this algorithm
will produce a mathematical model and with the help of the mathematical model, the machine
will make a prediction and take a decision without being explicitly programmed. Also, during
training data, the more machine will work with it the more it will get experience and the more
efficient result is produced.

Example : In Driverless Car, the training data is fed to Algorithm like how to Drive Car in
Highway, Busy and Narrow Street with factors like speed limit, parking, stop at signal etc. After
that, a Logical and Mathematical model is created on the basis of that and after that, the car will
work according to the logical model. Also, the more data the data is fed the more efficient output
is produced.

Designing a Learning System in Machine Learning


According to Tom Mitchell, “A computer program is said to be learning from experience (E),
with respect to some task (T). Thus, the performance measure (P) is the performance at task T,
which is measured by P, and it improves with experience E.”
Example: In Spam E-Mail detection,
● Task, T: To classify mails into Spam or Not Spam.
● Performance measure, P: Total percent of mails being correctly classified as being
“Spam” or “Not Spam”.
● Experience, E: Set of Mails with label “Spam”

Steps for Designing Learning System are:

Step 1) Choosing the Training Experience: The very important and first task is to choose the
training data or training experience which will be fed to the Machine Learning Algorithm. It is
important to note that the data or experience that we fed to the algorithm must have a significant
impact on the Success or Failure of the Model. So Training data or experience should be chosen
wisely.

Step 2- Choosing target function: The next important step is choosing the target function. It
means according to the knowledge fed to the algorithm the machine learning will choose
NextMove function which will describe what type of legal moves should be taken. For example
: While playing chess with the opponent, when opponent will play then the machine learning
algorithm will decide what be the number of possible legal moves taken in order to get success.
Step 3- Choosing Representation for Target function: When the machine algorithm will know
all the possible legal moves the next step is to choose the optimized move using any
representation i.e. using linear Equations, Hierarchical Graph Representation, Tabular form etc.
The NextMove function will move the Target move like out of these move which will provide
more success rate. For Example : while playing chess machine have 4 possible moves, so the
machine will choose that optimized move which will provide success to it.
Step 4- Choosing Function Approximation Algorithm: An optimized move cannot be chosen
just with the training data. The training data had to go through with set of example and through
these examples the training data will approximates which steps are chosen and after that machine
will provide feedback on it. For Example : When a training data of Playing chess is fed to
algorithm so at that time it is not machine algorithm will fail or get success and again from that
failure or success it will measure while next move what step should be chosen and what is its
success rate.
Step 5- Final Design: The final design is created at last when system goes from number of
examples , failures and success , correct and incorrect decision and what will be the next step
etc. Example: DeepBlue is an intelligent computer which is ML-based won chess game against
the chess expert Garry Kasparov, and it became the first computer which had beaten a human
chess expert.

CONCEPT OF HYPOTHESIS

The hypothesis is a common term in Machine Learning and data science projects. As we know,
machine learning is one of the most powerful technologies across the world, which helps us to
predict results based on past experiences. Moreover, data scientists and ML professionals
conduct experiments that aim to solve a problem. These ML professionals and data scientists
make an initial assumption for the solution of the problem.This assumption in Machine learning
is known as Hypothesis.

The hypothesis is defined as the supposition or proposed explanation based on insufficient


evidence or assumptions. It is just a guess based on some known facts but has not yet been
proven. A good hypothesis is testable, which results in either true or false.

Example: Let's understand the hypothesis with a common example. Some scientist claims that
ultraviolet (UV) light can damage the eyes then it may also cause blindness.
In this example, a scientist just claims that UV rays are harmful to the eyes, but we assume they
may cause blindness. However, it may or may not be possible. Hence, these types of assumptions
are called a hypothesis.

HYPOTHESIS IN MACHINE LEARNING

The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is
specifically used in Supervised Machine learning, where an ML model learns a function that best
maps the input to corresponding outputs with the help of an available dataset.

In supervised learning techniques, the main aim is to determine the possible hypothesis out of
hypothesis space that best maps input to the corresponding or correct outputs.

There are some common methods given to find out the possible hypothesis from the Hypothesis
space, where hypothesis space is represented by uppercase-h (H) and hypothesis by
lowercase-h (h). Th ese are defined as follows:

Hypothesis space (H):

Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known as
a hypothesis set. It is used by supervised machine learning algorithms to determine the best
possible hypothesis to describe the target function or best maps input to output.

It is often constrained by choice of the framing of the problem, the choice of model, and the
choice of model configuration.

Hypothesis (h): It is defined as the approximate function that best describes the target in
supervised machine learning algorithms. It is primarily based on data as well as bias and
restrictions applied to data.
he hypothesis (h) can be formulated in machine learning as follows:
y= mx + b
Where,
Y: Range
m: Slope of the line which divided test data or changes in y divided by change in x.
x: domain
c: intercept (constant)
Example: Let's understand the hypothesis (h) and hypothesis space (H) with a two-dimensional
coordinate plane showing the distribution of data as follows:

Now, assume we have some test data by which ML algorithms predict the outputs for input as
follows:

If we divide this coordinate plane in such as way that it can help you to predict output or result as
follows:
Based on the given test data, the output result will be as follows:

However, based on data, algorithm, and constraints, this coordinate plane can also be divided in
the following ways as follows:

With the above example, we can conclude that;

Hypothesis space (H) is the composition of all legal best possible ways to divide the coordinate
plane so that it best maps input to proper output.
Further, each individual best possible way is called a hypothesis (h). Hence, the hypothesis and
hypothesis space would be like this:

VERSION SPACE

A version space is a hierarchial representation of knowledge that enables you to keep track of all
the useful information supplied by a sequence of learning examples without remembering any of
the examples.

A version space description consists of two complementary trees:

1. One that contains nodes connected to overly general models, and


2. One that contains nodes connected to overly specific models.

Diagram of a Version Space

In the diagram below, the specialization tree is colored red, and the generalization tree is colored

.
INDUCTIVE BIAS

In machine learning, the term inductive bias refers to a set of (explicit or implicit) assumptions
made by a learning algorithm in order to perform induction, that is, to generalize a finite set of
observation (training data) into a general model of the domain.

Inductive bias is nothing but a set of assumptions in which a model learns by itself through the
observation of the relationship between data points in order to make itself a generalized model so that
the accuracy of prediction will be increased when exposed to a new test data in real-time.

For example,

Let’s consider a regression model to predict the marks of a student considering attendance percentage
as an independent variable-

Here, the model will assume that there is a linear relationship between attendance percentage and
marks of the student. This assumption is nothing but an Inductive bias.

In the future, if any new test data is applied to the model then this model will try to predict the
marks with respect to the learning it had through this training data. Linearity is important
information (assumption) this model has even before it is seeing the test data for the first time.
So, the inductive bias of this model is an assumption of linearity between the independent and
dependent variables.

PERFORMANCE METRICS IN MACHINE LEARNING

Evaluating the performance of a Machine learning model is one of the important steps while
building an effective ML model. To evaluate the performance or quality of the model, different
metrics are used, and these metrics are known as performance metrics or evaluation metrics.
These performance metrics help us understand how well our model has performed for the given
data. In this way, we can improve the model's performance by tuning the hyper-parameters.

In a classification problem, the category or classes of data is identified based on training data.
The model learns from the given dataset and then classifies the new data into classes or groups
based on the training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not
Spam, etc. To evaluate the performance of a classification model, different metrics are used, and
some of them are as follows:

● Accuracy

● Precision

● Recall

● AUC(Area Under the Curve)-ROC

I. Accuracy

The accuracy metric is one of the simplest Classification metrics to implement, and it can be
determined as the number of correct predictions to the total number of predictions.It can be
formulated as:

When to Use Accuracy?

It is good to use the Accuracy metric when the target variable classes in data are approximately
balanced. For example, if 60% of classes in a fruit image dataset are of Apple, 40% are Mango.
In this case, if the model is asked to predict whether the image is of Apple or Mango, it will give
a prediction with 97% of accuracy.

When not to use Accuracy?

It is recommended not to use the Accuracy measure when the target variable majorly belongs to
one class. For example, Suppose there is a model for a disease prediction in which, out of 100
people, only five people have a disease, and 95 people don't have one. In this case, if our model
predicts every person with no disease (which means a bad prediction), the Accuracy measure
will be 95%, which is not correct.
Precision
The precision metric is used to overcome the limitation of Accuracy. The precision determines
the proportion of positive prediction that was actually correct. It can be calculated as the True
Positive or predictions that are actually true to the total positive predictions (True Positive and
False Positive).

Recall or Sensitivity

It is also similar to the Precision metric; however, it aims to calculate the proportion of actual
positive that was identified incorrectly. It can be calculated as True Positive or predictions that
are actually true to the total number of positives, either correctly predicted as positive or
incorrectly predicted as negative (true Positive and false negative).

The formula for calculating Recall is given below:

AUC-ROC

Sometimes we need to visualize the performance of the classification model on charts; then, we
can use the AUC-ROC curve. It is one of the popular and important metrics for evaluating the
performance of the classification model.

Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC represents a
graph to show the performance of a classification model at different threshold levels. The curve
is plotted between two parameters, which are:

● True Positive Rate

● False Positive Rate


TPR or true Positive rate is a synonym for Recall, hence can be calculated as:

FPR or False Positive Rate can be calculated as:

To calculate value at any point in a ROC curve, we can evaluate a logistic regression model
multiple times with different classification thresholds, but this would not be much efficient. So,
for this, one efficient method is used, which is known as AUC.

AUC: Area Under the ROC curve

AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the
two-dimensional area under the entire ROC curve, as shown below image:

AUC calculates the performance across all the thresholds and provides an aggregate measure.
The value of AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have
an AUC of 0.0, whereas models with 100% correct predictions will have an AUC of 1.0.

When to Use AUC

AUC should be used to measure how well the predictions are ranked rather than their absolute
values. Moreover, it measures the quality of predictions of the model without considering the
classification threshold.
When not to use AUC

As AUC is scale-invariant, which is not always desirable, and we need calibrating probability
outputs, then AUC is not preferable.

Further, AUC is not a useful metric when there are wide disparities in the cost of false negatives
vs. false positives, and it is difficult to minimize one type of classification error.

You might also like