Professional Documents
Culture Documents
PALLAVI SHUKLA
Assistant Professor
United College Of Engineering & Research, Prayagraj
UNIT 1 - INTRODUCTION
Learning, Types of Learning,
Well-defined learning problems,
Designing a Learning System,
History of ML, Introduction of Machine Learning Approaches – (Artificial Neural
Network,
Clustering, Reinforcement Learning,
Decision Tree Learning,
Bayesian networks, Support Vector Machine,
Genetic Algorithm),
Issues in Machine Learning and Data Science vs. Machine Learning;
What is Human learning?
• (1) either somebody who is an expert in the subject directly teaches us,
• (2) we build our own notion indirectly based on what we have learned
from the expert in the past,
• (3) we do it ourselves, maybe after multiple attempts, some being
unsuccessful.
Learning under Expert Guidance -
• An infant may inculcate certain traits and characteristics, learning straight
from its guardians.
• He calls his hand, a ‘hand’, because that is the information he gets from
his parents.
• The sky is ‘blue’ to him because that is what his parents have taught him.
• We say that the baby ‘learns’ things from his parents.
Learning guided by knowledge gained from
experts -
Learning guided by knowledge gained from
experts -
• In all these situations, there is no direct learning. It is some past information
shared on some different context, which is used as a learning to make
decisions
Learning by Shelf -
Learning by Shelf -
• The process of fitting a model to a dataset is known as training. When the model
has been trained, the data has been transformed into an abstract form that
summarizes the original information.
Generalization -
• The term generalization describes the process of turning the knowledge
about stored data into a form that can be utilized for future action.
• These actions are to be carried out on tasks that are similar, but not
identical, to those that have been seen before.
• In generalization, the goal is to discover those properties of the data that
will be most relevant to future tasks.
• The term generalization is defined as the process of turning abstracted
knowledge into a form that can be utilized for future action, on tasks that
are similar, but not identical, to those the learner has seen before.
• It acts as a search through the entire set of models (that is, theories or
inferences) that could be established from the data during training.
• In generalization, the learner is tasked with limiting the patterns it discovers
to only those that will be most relevant to its future tasks.
• Normally, it is not feasible to reduce the number of patterns by examining
them one by one and ranking them by future utility.
• Instead, machine learning algorithms generally employ shortcuts that
reduce the search space more quickly.
• To this end, the algorithm will employ heuristics, which are educated
guesses about where to find the most useful inferences.
Evaluation -
• Evaluation is the last component of the learning process.
• It is the process of giving feedback to the user to measure the utility of the
learned knowledge.
• This feedback is then utilized to effect improvements in the whole learning
process.
• The final step in the learning process is to evaluate its success and to
measure the learner's performance in spite of its biases.
• The information gained in the evaluation phase can then be used to
inform additional training if needed.
• Generally, evaluation occurs after a model has been trained on an initial
training dataset.
• Then, the model is evaluated on a separate test dataset in order to judge
how well its characterization of the training data generalizes to new,
unseen cases.
• It's worth noting that it is exceedingly rare for a model to perfectly
generalize to every unforeseen case—mistakes are almost always
inevitable.
Well-posed learning problem:
• For defining a new problem, which can be solved using machine
learning, a simple framework, highlighted below, can be used.
• This framework also helps in deciding whether the problem is a right
candidate to be solved using machine learning.
• The framework involves answering three questions:
• 1. What is the problem?
• 2. Why does the problem need to be solved?
• 3. How to solve the problem?
• Step 1: What is the Problem?
• A number of information should be collected to know what is the problem.
• Informal description of the problem, e.g.
• I need a program that will prompt the next word as and when I type a word.
• Formalism Use Tom Mitchell’s machine learning formalism stated above to define the T,
P, and E for the problem.
• For example: Task (T): Prompt the next word when I type a word.
Experience (E): A corpus of commonly used English words and phrases.
Performance (P): The number of correct words prompted considered as a
percentage (which in machine learning paradigm is known as learning accuracy).
Assumptions - Create a list of assumptions about the problem.
Similar problems What other problems have you seen or can you think of that are similar
to the problem that you are trying to solve?
Step 2: Why does the problem need to be
solved?
Motivation
• What is the motivation for solving the problem?
• What requirement will it fulfill?
• For example, does this problem solve any long-standing business issue like
finding out potentially fraudulent transactions?
• Or the purpose is more trivial like trying to suggest some movies for the
upcoming weekend.
Step 3: How would I solve the problem?
• Try to explore how to solve the problem manually.
• Detail out step-by-step data collection, data preparation, and program
design to solve the problem.
• Collect all these details and update the previous sections of the problem
definition, especially the assumptions.
Introduction to ML -
PALLAVI SHUKLA
Assistant professor
Applications of Machine Learning-
• Supervised learning allows collecting data and produces data output from
previous experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world
computation problems.
• It performs classification and regression tasks.
• It allows estimating or mapping the result to a new sample.
• We have complete control over choosing the number of classes we want in
the training data.
Disadvantages Of Supervised Algorithm -
• Supervised learning allows collecting data and produces data output from
previous experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world
computation problems.
• It performs classification and regression tasks.
• It allows estimating or mapping the result to a new sample.
• We have complete control over choosing the number of classes we want in
the training data.
Disadvantages Of Supervised Algorithm -
• It determines the set of items that occurs together in the dataset. Association
rule makes marketing strategy more effective.
Advantages of Unsupervised Learning:
• It does not require training data to be labeled.
• Dimensionality reduction can be easily accomplished using unsupervised learning.
• Capable of finding previously unknown patterns in data.
• Flexibility: Unsupervised learning is flexible in that it can be applied to a wide
variety of problems, including clustering, anomaly detection, and association rule
mining.
• Exploration: Unsupervised learning allows for the exploration of data and the
discovery of novel and potentially useful patterns that may not be apparent from the
outset.
• Low cost: Unsupervised learning is often less expensive than supervised learning
because it doesn’t require labeled data, which can be time-consuming and costly to
obtain.
Disadvantages of Unsupervised Learning :
• Difficult to measure accuracy or effectiveness due to lack of predefined answers during
training.
• The results often have lesser accuracy.
• The user needs to spend time interpreting and label the classes which follow that
classification.
• Lack of guidance: Unsupervised learning lacks the guidance and feedback provided by
labeled data, which can make it difficult to know whether the discovered patterns are
relevant or useful.
• Sensitivity to data quality: Unsupervised learning can be sensitive to data quality,
including missing values, outliers, and noisy data.
• Scalability: Unsupervised learning can be computationally expensive, particularly for
large datasets or complex algorithms, which can limit its scalability.
REINFORCEMENT LEARNING -
• It is the problem of getting an agent to act in the world so as to maximize
its rewards.
• A learner (the program) is not told what actions to take as in most forms
of machine learning, but instead must discover which actions yield the
most reward by trying them .
Example-
Application of Reinforcement Learning -
• Suppose, you are the head of a rental store and wish to understand
the preferences of your costumers to scale up your business.
• Is it possible for you to look at details of each costumer and devise
a unique business strategy for each one of them?
• Definitely not.
• But, what you can do is to cluster all of your customers into say 10
groups based on their purchasing habits and use a separate
strategy for customers in each of these 10 groups. And this is what
we call clustering.
Types of Clustering:
1.It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
2. It can be very useful for solving decision-related problems.
3. It helps to think about all the possible outcomes for a problem.
4. There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree:
• Let’s consider two independent variables x1, x2, and one dependent
variable which is either a blue circle or a red circle.
• From the figure above it’s very clear that there are multiple lines
(our hyperplane here is a line because we are considering only two
input features x1, x2) that segregate our data points or do a
classification between red and blue circles. So how do we choose
the best line or in general the best hyperplane that segregates our
data points?
Support Vector Machine Terminology -
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original input
data points into high-dimensional feature spaces, so, that the hyperplane can be easily found out
even if the data points are not linearly separable in the original input space. Some of the common
kernel functions are linear, polynomial, radial basis function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane
that properly separates the data points of different categories without any misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a soft
margin technique. Each data point has a slack variable introduced by the soft-margin SVM
formulation, which softens the strict margin requirement and permits certain misclassifications or
violations. It discovers a compromise between increasing the margin and reducing violations.
Genetic Algorithm -
• Initialization
• Fitness assignment
• Selection
• Reproduction
• Crossover
• Mutation:
INITIALIZATION
• This operator adds new genetic information to the new child population.
• This is achieved by flipping some bits in the chromosome.
• Mutation solves the problem of local minimum and enhances
diversification.
• The following image shows how mutation is done
ISSUES IN MACHINE
LEARNING- Lecture 6
Pallavi Shukla
Assistant professor
Computer science & engineering
Inadequate Training Data -
• Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well as
accuracy in classification tasks.
• Incorrect data- It is also responsible for faulty programming and results obtained in machine
learning models. Hence, incorrect data may affect the accuracy of the results also.
• Generalizing of output data- Sometimes, it is also found that generalizing output data
becomes complex, which results in comparatively poor future actions.
Poor quality of data-
• To make sure our training model is generalized well or not, we have to ensure that sample
training data is representative of new cases that we need to generalize.
• The training data must cover all cases that have already occurred as well as occurring.
• Further, if we are using non-representative training data in the model, it results in less
accurate predictions.
• A machine learning model is said to be ideal if it predicts well for generalized cases and
provides accurate decisions.
• If there is less training data, then there will be a sampling noise in the model, called the non-
representative training set.
• It won't be accurate in predictions.
• To overcome this, it will be biased against one class or a group.
Overfitting and Underfitting -
Overfitting –
Overfitting is one of the most common issues faced by Machine Learning engineers and data scientists.
Whenever a machine learning model is trained with a huge amount of data, it starts capturing noise and
inaccurate data into the training data set.
It negatively affects the performance of the model.
Let's understand with a simple example where we have a few training data sets such as 1000 mangoes,
1000 apples, 1000 bananas, and 5000 papayas.
Then there is a considerable probability of identification of an apple as papaya because we have a massive
amount of biased data in the training data set; hence prediction got negatively affected.
The main reason behind overfitting is using non-linear methods used in machine learning algorithms as
they build non-realistic data models.
We can overcome overfitting by using linear and parametric algorithms in the machine learning models.
Methods to reduce overfitting:
• A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model.
• Let's understand with an example where at a specific time customer is looking for some
gadgets, but customer requirement change over time but still machine learning model
shows same recommendations to the customer while customer expectations has been
changed.
• This incident is called a Data Drift.
• It generally occurs when new data is introduced or the interpretation of data changes.
However, we can overcome this by regularly updating and monitoring data according to
the expectations.
Lack of skilled resources -
• The machine learning process is very complex, which is also another major issue faced by
machine learning engineers and data scientists.
• However, Machine Learning and Artificial Intelligence are very new technologies but are
still in an experimental phase and continuously changing over time.
• There is the majority of hits and trial experiments; hence the probability of error is
higher than expected.
• Further, it also includes analyzing the data, removing data bias, training data, applying
complex mathematical calculations, etc., making the procedure more complicated and
quite tedious.
Data Bias