Professional Documents
Culture Documents
INTRODUCTION
Machine learning is a growing technology which enables computers to learn automatically from
past data. Machine learning uses various algorithms for building mathematical models and
making predictions using historical data or information. Currently, it is being used for
various tasks such as image recognition, speech recognition, email filtering, Facebook
auto-tagging, recommender system, and many more.he term machine learning was first
introduced by Arthur Samuel in 1959.
In the real world, we are surrounded by humans who can learn everything from their experiences
with their learning capability, and we have computers or machines which work on our
instructions. But can a machine also learn from experiences or past data like a human does? So
here comes the role of Machine Learning.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences
on their own.
A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model which
predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead of
writing a code for it, we just need to feed the data to generic algorithms, and with the help of
these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram explains
the working of Machine Learning algorithm:
FEATURES OF MACHINE LEARNING
● It is a data-driven technology.
● Machine learning is much similar to data mining as it also deals with the huge amount of
the data.
The need for machine learning is increasing day by day. The reason behind the need for machine
learning is that it is capable of doing tasks that are too complex for a person to implement
directly. As a human, we have some limitations as we cannot access the huge amount of data
manually, so for this, we need some computer systems and here comes the machine learning to
make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let
them explore the data, construct the models, and predict the required output automatically. The
performance of the machine learning algorithm depends on the amount of data, and it can be
determined by the cost function. With the help of machine learning, we can save both time and
money.
The importance of machine learning can be easily understood by its uses cases, Currently,
machine learning is used in self-driving cars, cyber fraud detection, face recognition, and
friend suggestion by Facebook, etc. Various top companies such as Netflix and Amazon have
build machine learning models that are using a vast amount of data to analyze the user interest
and recommend product accordingly.
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
● If the given shape has four sides, and all the sides are equal, then it will be labelled as a
Square.
● If the given shape has three sides, then it will be labelled as a triangle.
● If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
As the name suggests, unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the hidden patterns and
insights from the given data. It can be compared to learning which takes place in the human brain
while learning new things.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given dataset,
which means it does not have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the image dataset into the groups
according to similarities between images.
WORKING OF UNSUPERVISED MACHINE LEARNING
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns
from the data and then will apply suitable algorithms such as k-means clustering, Decision tree,
etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
REINFORCEMENT LEARNING
● In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
● Since there is no labeled data, so the agent is bound to learn by its experience only.
● The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
● The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that "Reinforcement learning is a
type of machine learning method where an intelligent agent (computer program)
interacts with the environment and learns to act within that." How a Robotic dog learns
the movement of his arms is an example of Reinforcement learning.
● Example: Suppose there is an AI agent present within a maze environment, and his goal
is to find the diamond. The agent interacts with the environment by performing some
actions, and based on those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback.
● The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.
● The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.
●
PERSPECTIVES AND ISSUES
Although machine learning is being used in every industry and helps organizations make more
informed and data-driven choices that are more effective than classical methodologies, it still has
so many problems that cannot be ignored. Here are some common issues in Machine Learning
that professionals face to inculcate ML skills and create an application from scratch.
The major issue that comes while using machine learning algorithms is the lack of quality as well
as quantity of data. Although data plays a vital role in the processing of machine learning
algorithms, many data scientists claim that inadequate data, noisy data, and unclean data are
extremely exhausting the machine learning algorithms. For example, a simple task requires
thousands of sample data, and an advanced task such as speech or image recognition needs
millions of sample data examples. Further, data quality is also important for the algorithms to
work ideally, but the absence of data quality is also found in Machine Learning applications.
Data quality can be affected by some factors as follows:
● Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well
as accuracy in classification tasks.
● Incorrect data- It is also responsible for faulty programming and results obtained in
machine learning models. Hence, incorrect data may affect the accuracy of the results
also.
● Generalizing of output data- Sometimes, it is also found that generalizing output data
becomes complex, which results in comparatively poor future actions.
As we have discussed above, data plays a significant role in machine learning, and it must be of
good quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to less
accuracy in classification and low-quality results. Hence, data quality can also be considered as a
major common problem while processing machine learning algorithms.
3. Non-representative training data
To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training data
must cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less accurate
predictions. A machine learning model is said to be ideal if it predicts well for generalized cases
and provides accurate decisions. If there is less training data, then there will be a sampling noise
in the model, called the non-representative training set. It won't be accurate in predictions. To
overcome this, it will be biased against one class or a group.
Hence, we should use representative data in training to protect against being biased and make
accurate predictions without any drift.
Overfitting:
Overfitting is one of the most common issues faced by Machine Learning engineers and data
scientists. Whenever a machine learning model is trained with a huge amount of data, it starts
capturing noise and inaccurate data into the training data set. It negatively affects the
performance of the model. Let's understand with a simple example where we have a few training
data sets such as 1000 mangoes, 1000 apples, 1000 bananas, and 5000 papayas. Then there is a
considerable probability of identification of an apple as papaya because we have a massive
amount of biased data in the training data set; hence prediction got negatively affected. The main
reason behind overfitting is using non-linear methods used in machine learning algorithms as
they build non-realistic data models. We can overcome overfitting by using linear and parametric
algorithms in the machine learning models.
● Reduce model complexity by simplifying the model by selecting one with fewer
parameters
Underfitting:
Underfitting is just the opposite of overfitting. Whenever a machine learning model is trained
with fewer amounts of data, and as a result, it provides incomplete and inaccurate data and
destroys the accuracy of the machine learning model.
Underfitting occurs when our model is too simple to understand the base structure of the data,
just like an undersized pant. This generally happens when we have limited data into the data set,
and we try to build a linear model with non-linear data. In such scenarios, the complexity of the
model destroys, and rules of the machine learning model become too easy to be applied on this
data set, and the model starts doing wrong predictions as well.
As we know that generalized output data is mandatory for any machine learning model; hence,
regular monitoring and maintenance become compulsory for the same. Different results for
different actions require data change; hence editing of codes as well as resources for monitoring
them also become necessary.
A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example where at a
specific time customer is looking for some gadgets, but now customer requirement changed over
time but still machine learning model showing same recommendations to the customer while
customer expectation has been changed. This incident is called a Data Drift. It generally occurs
when new data is introduced or interpretation of data changes. However, we can overcome this
by regularly updating and monitoring data according to the expectations.
Although Machine Learning and Artificial Intelligence are continuously growing in the market,
still these industries are fresher in comparison to others. The absence of skilled resources in the
form of manpower is also an issue. Hence, we need manpower having in-depth knowledge of
mathematics, science, and technologies for developing and managing scientific substances for
machine learning.
8. Customer Segmentation
According to Arthur Samuel “Machine Learning enables a Machine to Automatically learn from
Data, Improve performance from an Experience and predict things without explicitly
programmed.”
In Simple Words, When we fed the Training Data to Machine Learning Algorithm, this algorithm
will produce a mathematical model and with the help of the mathematical model, the machine
will make a prediction and take a decision without being explicitly programmed. Also, during
training data, the more machine will work with it the more it will get experience and the more
efficient result is produced.
Example : In Driverless Car, the training data is fed to Algorithm like how to Drive Car in
Highway, Busy and Narrow Street with factors like speed limit, parking, stop at signal etc. After
that, a Logical and Mathematical model is created on the basis of that and after that, the car will
work according to the logical model. Also, the more data the data is fed the more efficient output
is produced.
Step 1) Choosing the Training Experience: The very important and first task is to choose the
training data or training experience which will be fed to the Machine Learning Algorithm. It is
important to note that the data or experience that we fed to the algorithm must have a significant
impact on the Success or Failure of the Model. So Training data or experience should be chosen
wisely.
Step 2- Choosing target function: The next important step is choosing the target function. It
means according to the knowledge fed to the algorithm the machine learning will choose
NextMove function which will describe what type of legal moves should be taken. For example
: While playing chess with the opponent, when opponent will play then the machine learning
algorithm will decide what be the number of possible legal moves taken in order to get success.
Step 3- Choosing Representation for Target function: When the machine algorithm will know
all the possible legal moves the next step is to choose the optimized move using any
representation i.e. using linear Equations, Hierarchical Graph Representation, Tabular form etc.
The NextMove function will move the Target move like out of these move which will provide
more success rate. For Example : while playing chess machine have 4 possible moves, so the
machine will choose that optimized move which will provide success to it.
Step 4- Choosing Function Approximation Algorithm: An optimized move cannot be chosen
just with the training data. The training data had to go through with set of example and through
these examples the training data will approximates which steps are chosen and after that machine
will provide feedback on it. For Example : When a training data of Playing chess is fed to
algorithm so at that time it is not machine algorithm will fail or get success and again from that
failure or success it will measure while next move what step should be chosen and what is its
success rate.
Step 5- Final Design: The final design is created at last when system goes from number of
examples , failures and success , correct and incorrect decision and what will be the next step
etc. Example: DeepBlue is an intelligent computer which is ML-based won chess game against
the chess expert Garry Kasparov, and it became the first computer which had beaten a human
chess expert.
CONCEPT OF HYPOTHESIS
The hypothesis is a common term in Machine Learning and data science projects. As we know,
machine learning is one of the most powerful technologies across the world, which helps us to
predict results based on past experiences. Moreover, data scientists and ML professionals
conduct experiments that aim to solve a problem. These ML professionals and data scientists
make an initial assumption for the solution of the problem.This assumption in Machine learning
is known as Hypothesis.
Example: Let's understand the hypothesis with a common example. Some scientist claims that
ultraviolet (UV) light can damage the eyes then it may also cause blindness.
In this example, a scientist just claims that UV rays are harmful to the eyes, but we assume they
may cause blindness. However, it may or may not be possible. Hence, these types of assumptions
are called a hypothesis.
The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is
specifically used in Supervised Machine learning, where an ML model learns a function that best
maps the input to corresponding outputs with the help of an available dataset.
In supervised learning techniques, the main aim is to determine the possible hypothesis out of
hypothesis space that best maps input to the corresponding or correct outputs.
There are some common methods given to find out the possible hypothesis from the Hypothesis
space, where hypothesis space is represented by uppercase-h (H) and hypothesis by
lowercase-h (h). Th ese are defined as follows:
Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known as
a hypothesis set. It is used by supervised machine learning algorithms to determine the best
possible hypothesis to describe the target function or best maps input to output.
It is often constrained by choice of the framing of the problem, the choice of model, and the
choice of model configuration.
Hypothesis (h): It is defined as the approximate function that best describes the target in
supervised machine learning algorithms. It is primarily based on data as well as bias and
restrictions applied to data.
he hypothesis (h) can be formulated in machine learning as follows:
y= mx + b
Where,
Y: Range
m: Slope of the line which divided test data or changes in y divided by change in x.
x: domain
c: intercept (constant)
Example: Let's understand the hypothesis (h) and hypothesis space (H) with a two-dimensional
coordinate plane showing the distribution of data as follows:
Now, assume we have some test data by which ML algorithms predict the outputs for input as
follows:
If we divide this coordinate plane in such as way that it can help you to predict output or result as
follows:
Based on the given test data, the output result will be as follows:
However, based on data, algorithm, and constraints, this coordinate plane can also be divided in
the following ways as follows:
Hypothesis space (H) is the composition of all legal best possible ways to divide the coordinate
plane so that it best maps input to proper output.
Further, each individual best possible way is called a hypothesis (h). Hence, the hypothesis and
hypothesis space would be like this:
VERSION SPACE
A version space is a hierarchial representation of knowledge that enables you to keep track of all
the useful information supplied by a sequence of learning examples without remembering any of
the examples.
In the diagram below, the specialization tree is colored red, and the generalization tree is colored
.
INDUCTIVE BIAS
In machine learning, the term inductive bias refers to a set of (explicit or implicit) assumptions
made by a learning algorithm in order to perform induction, that is, to generalize a finite set of
observation (training data) into a general model of the domain.
Inductive bias is nothing but a set of assumptions in which a model learns by itself through the
observation of the relationship between data points in order to make itself a generalized model so that
the accuracy of prediction will be increased when exposed to a new test data in real-time.
For example,
Let’s consider a regression model to predict the marks of a student considering attendance percentage
as an independent variable-
Here, the model will assume that there is a linear relationship between attendance percentage and
marks of the student. This assumption is nothing but an Inductive bias.
In the future, if any new test data is applied to the model then this model will try to predict the
marks with respect to the learning it had through this training data. Linearity is important
information (assumption) this model has even before it is seeing the test data for the first time.
So, the inductive bias of this model is an assumption of linearity between the independent and
dependent variables.
Evaluating the performance of a Machine learning model is one of the important steps while
building an effective ML model. To evaluate the performance or quality of the model, different
metrics are used, and these metrics are known as performance metrics or evaluation metrics.
These performance metrics help us understand how well our model has performed for the given
data. In this way, we can improve the model's performance by tuning the hyper-parameters.
In a classification problem, the category or classes of data is identified based on training data.
The model learns from the given dataset and then classifies the new data into classes or groups
based on the training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not
Spam, etc. To evaluate the performance of a classification model, different metrics are used, and
some of them are as follows:
● Accuracy
● Precision
● Recall
I. Accuracy
The accuracy metric is one of the simplest Classification metrics to implement, and it can be
determined as the number of correct predictions to the total number of predictions.It can be
formulated as:
It is good to use the Accuracy metric when the target variable classes in data are approximately
balanced. For example, if 60% of classes in a fruit image dataset are of Apple, 40% are Mango.
In this case, if the model is asked to predict whether the image is of Apple or Mango, it will give
a prediction with 97% of accuracy.
It is recommended not to use the Accuracy measure when the target variable majorly belongs to
one class. For example, Suppose there is a model for a disease prediction in which, out of 100
people, only five people have a disease, and 95 people don't have one. In this case, if our model
predicts every person with no disease (which means a bad prediction), the Accuracy measure
will be 95%, which is not correct.
Precision
The precision metric is used to overcome the limitation of Accuracy. The precision determines
the proportion of positive prediction that was actually correct. It can be calculated as the True
Positive or predictions that are actually true to the total positive predictions (True Positive and
False Positive).
Recall or Sensitivity
It is also similar to the Precision metric; however, it aims to calculate the proportion of actual
positive that was identified incorrectly. It can be calculated as True Positive or predictions that
are actually true to the total number of positives, either correctly predicted as positive or
incorrectly predicted as negative (true Positive and false negative).
AUC-ROC
Sometimes we need to visualize the performance of the classification model on charts; then, we
can use the AUC-ROC curve. It is one of the popular and important metrics for evaluating the
performance of the classification model.
Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC represents a
graph to show the performance of a classification model at different threshold levels. The curve
is plotted between two parameters, which are:
To calculate value at any point in a ROC curve, we can evaluate a logistic regression model
multiple times with different classification thresholds, but this would not be much efficient. So,
for this, one efficient method is used, which is known as AUC.
AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the
two-dimensional area under the entire ROC curve, as shown below image:
AUC calculates the performance across all the thresholds and provides an aggregate measure.
The value of AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have
an AUC of 0.0, whereas models with 100% correct predictions will have an AUC of 1.0.
AUC should be used to measure how well the predictions are ranked rather than their absolute
values. Moreover, it measures the quality of predictions of the model without considering the
classification threshold.
When not to use AUC
As AUC is scale-invariant, which is not always desirable, and we need calibrating probability
outputs, then AUC is not preferable.
Further, AUC is not a useful metric when there are wide disparities in the cost of false negatives
vs. false positives, and it is difficult to minimize one type of classification error.