AI & ML (Unit 1-5) - Merged

ARTIFICIAL INTELLIGENCE
AND
MACHINE LEARNING
For IV semester (CSE Department)
As per the latest syllabus of Anna University (2021 Regulation)
REEBA ROSE. L, M.Tech.,

Assistant Professor
Department of Computer Science and Engineering
MULLI MARY.V, M.E.,
Assistant Professor
ASHMOL A, M.E.,
Assistant Professor
ARUNACHALA PUBLICATIONS
Manavilai, Vellichanthai
First edition: 2023
Copyright © 2023 by ARUNACHALA PUBLICATIONS.
No part of this publication may be reproduced or distributed in any form or by

means of electronic, mechanical, photocopying or otherwise, without the prior
written permission from the publishers.
Price: Rs.515/-
Published by:
ARUNACHALA PUBLICATIONS
Manavilai ,Vellichanthai,
Kanyakumari District,
Tamilnadu-629 203
Phone: 04651 200123
E-mail:acewomenscollege@gmail.com
Website:www.arunachalacollege.com
PREFACE
Artificial Intelligence - This word describes how new technologies are

created in this universe. In the future, everything will depend upon Al.
Rudimentary of Artificial Intelligence syllabi framed from Anna University,
Chennai for the regulation 2021, this book explained in simple manner to enrich
the knowledge about artificial intelligence and machine learning.
First unit covered basic of working principal of agents, discussed about

different types of algorithms for the searching techniques with an example
problem to understand the concept.
Second unit explained about the probabilities reasoning such as uncertainty

with bayes theorem, Naïve Bayes model, and detail explanations about Bayesian
Networks.
Third unit, discussed about machine learning basic and significant of

supervised learning algorithm with an example.
Fourth unit, detailed explained about the ensemble techniques and unsupervised
learning algorithm with an example.
Fifth unit, describes about basic of deep learning using neural network.
ACKNOWLEDGEMENTS
First and foremost we would like to thank God in the process of putting
this book together, we realized how true this gift of writing is for us. We could
never have done this without the faith we have in him.
It is great pleasure to express our deep sense of gratitude to our respected

Chairman Dr.T.Krishnaswamy, M.E.,Ph.D., for his constant encouragement
and moral support in all our activities.
We would like to express deep sense of thanks and gratitude to our

Principal Dr.S.JosephJawhar, M.E.,Ph.D.,M.B.A(HR).,M.Sc(Psychology).,
who constantlypraised and inspired us to write books on technical subjects and
whose enthusiasm and guidance led us to write this book.
We would like to thank our HOD/CSE,Professor Dr.T.V.

Chithra,M.E.,Ph.D., for her support and guidance which was a source of
sustenance for us at all stages in this task.
We would like to thank our friends, faculty members for giving valuable
inputs, comments, suggestions and praise while writing the book.
We extend our sincere thanks to Arunachala Publications for

encouraging, coordinating us to publish this book and also for making this book
see the light of the day.
AUTHORS
CONTENTS
UNIT –I PROBLEM SOLVING 1.1
1.1Introduction to AI 1.1
1.1.1 AI approaches 1.2
1.1.2 History of AI 1.5
1.2 Applications of AI 1.7
1.2.1 Good Behaviour: The Concept of Rationality 1.14
1.2.2The Nature of Environment 1.16
1.2.3The Structure of Agents 1.20
1.2.3.1 Simple Reflex agent 1.22
1.2.3.2 Model-based Reflex agent 1.24
1.2.3.3 Goal-Based agent 1.25
1.2.3.4 Utility –Based agent 1.26
1.2.3.5 Learning Agent 1.27
1.3 Problem-solving agents 1.29
1.3.1 Search problems and solutions 1.30
1.3.2 Formulating problems 1.31
1.3.3 Example Problems 1.31
1.3.3.1 Toy problems 1.31
1.3.3.2 Real World Problems 1.34
1.3.3.3 Water jug problem 1.36
1.4 Search Algorithms 1.38
1.4.1 Best-first search 1.39
1.5 Uninformed Search Strategies 1.41
1.5.1 Breadth-first search 1.41
1.5.2 Uniform-cost search 1.42
1.5.3 Depth-first search 1.42
1.5.4 Depth-limited search 1.41
1.5.5 Iterative deepening search 1.44
1.5.6 Bidirectional search 1.44
1.6 Informed (Heuristic) Search Strategies 1.46
1.6.1.1 Greedy Best-first search 1.46
1.6.1.2 A* search (A-star search) 1.48
1.6.1.3 Memory-bounded search 1.50
1.6.1.4 Iterative-deepening A* search 1.51
1.6.1.5 Recursive Best-first search 1.51
1.6.2. Heuristic Functions 1.53
1.7 Local Search and Optimization Problems 1.59
1.7.1.1 Hill Climbing search 1.59
1.7.1.2 Simulated Annealing: 1.61
1.7.1.3 Local Beam Search 1.62
1.7.1.4 Evolutionary algorithms 1.62
1.7.2 Local Search in Continuous Spaces 1.64
1.7.3 Search with Nondeterministic Actions 1.67
1.7.3.1 The Erratic (Unpredictable) Vacuum World 1.67
1.7.3.2 AND-OR Search Trees 1.69
1.7.3.3 Try, Try Again For Vacuum World 1.70
1.7.4 Search in Partially Observable Environments 1.71
1.7.4.1 Searching with no observation 1.71
1.7.4.2 Searching in partially observable environments 1.75
1.7.4.3 Solving partially observable problems 1.76
1.7.4.4 An agent for partially observable environments 1.77
1.7.5 Online Search agents and Unknown Environments 1.79
1.7.5.1 Online Search Problems 1.79
1.7.5.2 Online search agents 1.81
1.7.5.3 Online Local search 1.82
1.7.5.4 Learning in Online Search 1.84
1.8 Adversarial search 1.84
1.8.1 Game Theory 1.84
1.8.1.1Two-player zero-sum games 1.83
1.8 2 Optimal Decisions in Games 1.86
1.8.2.1 The minimax search algorithm 1.87
1.8.2.2 Optimal decisions in multiplayer games 1.88
1.8.2.3 Alpha-Beta Pruning 1.89
1.8.3 Monte Carlo Tree Search 1.92
1.8.4 Stochastic Games 1.96
1.8.4.1 Evaluation functions for games of chance 1.97
1.8.5 Partially Observable Games 1.98
1.8.5.1 Kriegspiel: partially observable chess 1.98
1.8.5.2 Card games 1.100
1.9 Constraint Satisfaction Problems 1.101
1.9.1.1 Defining Constraint Satisfaction Problems 1.101
1.9.1.2 Example Problem: Map Coloring 1.101
1.9.1.3 Variations on the CSP 1.103
1.9.2 Constraint Propagation 1.105
1.9.2.1 Node consistency 1.105
1.9.2.2 Arc consistency 1.106
1.9.2.3 Path consistency: 1.107
1.9.2.4 K- consistency: 1.107
1.9.2.5 Global constraints 1.107
1.9.2.6 Sudoku 1.108
1.9.3 Backtracking Search for CSPs 1.109
1.9.3.1Variable and value ordering 1.110
1.9.3.2Interleaving search and inference 1.111
1.9.3.3Intelligent backtracking: 1.112
1.9.3.4Constraint learning: 1.113
1.9.4 Local Search for CSPs 1.113
1.9.5 The Structure of Problems 1.115
1.9.5.1 Cutset conditioning 1.116
1.9.5.2 Tree Decomposition 1.117
1.9.5.3 Value symmetry 1.117
UNIT-II PROBABILISTIC REASONING 2.1
2.1 Acting under uncertainty 2.1
2.1.1 Basic Probability Notation 2.3
2.1.2 Inference Using Full Joint Distributions 2.6
2.1.3 Independence 2.8
2.1.4 Bayes’ Rule and Its Use 2.9
2.2 Naïve Bayes models. 2.10
2.3.1 Text classification with naive Bayes 2.11
2.3 Probabilistic reasoning 2.12
2.4.1 Representing Knowledge in an Uncertain Domain
2.4 Bayesian networks 2.12
2.5.1 The Semantics of Bayesian Networks 2.14
2.5.2 Conditional independence relations in Bayesian networks 2.16
2.5 Exact inference in BN 2.17
2.6.1 Inference by enumeration 2.18
2.6.2 The variable elimination algorithm 2.20
2.6 Approximate inference in BN 2.24
2.7.1 Direct sampling methods 2.24
2.7.2 Inference by Markov chain simulation 2.28
2.7 Causal networks 2.29
2.8.1 Representing actions: The do-operator 2.30
2.8.2 The back-door criterion 2.32
UNIT III SUPERVISED LERANING 3.1
3.1 Introduction to machine learning 3.2
3.2 Linear Regression 3.7
3.2.1 Least square regression 3.8
3.2.2 Single and Multiple Variables 3.10
3.2.3 Bayesian Regression 3.12
3.2.4 Gradient Descent in Machine Learning 3.14
3.3 Classification Models 3.19
3.3.1 Discriminant Functions 3.19
3.4 Probabilistic Discriminant Functions 3.21
3.4.1 Logistic Regression in ML 3.22
3.5 Probabilistic Generative Models 3.25
3.5.1 Naïve Bayes Classifier Algorithm 3.26
3.6 Maximum Margin Classifier 3.30
3.6.1 Support Vector Machine 3.30
3.7 Decision Tree Classification Algorithm 3.36
3.7.1 Algorithm for Decision Tree 3.40
3.7.2 Advantages of Decision Tree 3.44
3.7.3 Disadvantages of Decision Tree 3.44
3.8 Random Forest Algorithm 3.44
3.8.1 The Working Process 3.46
3.8.2 Application of Random Forest 3.46
3.8.3 Advantages of Random Forest 3.47
3.8.4 Disadvantages of Random Forest 3.47
UNIT IV ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING 4.1
4.1 Combining Multiple Learners 4.1
4.1.1 Model Combination Schemes 4.2
4.1.2 Voting 4.3
4.1.3 Error-Correcting Output Codes 4.5
4.2 Ensemble Learning 4.6
4.2.1 Bagging 4.7
4.2.2 Boosting 4.9
4.2.3 Stacking 4.9
4.2.4 Adaboost 4.14
4.2.5 Difference between Bagging and Boosting 4.15
4.3 Clustering 4.16
4.3.1 Unsupervised Learning: K-Means 4.19
4.4 Instance Based Learning: KNN 4.26

4.4.1 Why Do We Need KNN? 4.27
4.4.2 How Does KNN Works? 4.27
4.4.3 Difference between K-means and KNN 4.30
4.5 Gaussian Mixture Models and Expectation Maximization 4.30
4.5.1 Expectation-maximization 4.31
UNIT V NEURAL NETWORKS
5.1 perceptron in machine learning -
5.2 multilayer perceptron 5.5
5.2.1 The multilayer perceptron algorithm 5.6
5.3 Activation functions - 5.8
5.4 Gradient descent Optimization 5.12
5.5 Error propagation, from shallow networks to deep networks 5.14
5.6 Unit saturation (aka the vanishing gradient problem) - 5.17
5.7 ReLU - 5.19
5.8 Hyperparameter turning 5.22
5.9 Batch normalization 5.22
5.10 Regularization 5.26
5.11 Dropout 5.29
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
UNIT-1
PROBLEM SOLVING
Introduction to AI – AI Applications – Problem solving agents – Search algorithms –

Uninformed search strategies – Heuristic search strategies – Local search and optimization
problems – adversarial search – Constraint Satisfaction problems (CSP)
1.1 INTRODUCTION TO AI
AI is composed of two words Artificial and Intelligence, where Artificial defines "man-
made," and intelligence defines "thinking power". Hence AI means "a man-made thinking
power”. Artificial Intelligence exists when a machine can have human based skills such as
learning, reasoning, and problem solving. With AI you do not need to pre-program a machine
to do some work, despite that you can create a machine with programmed algorithms which
can work with own intelligence. Definition: It is a branch of computer science by which we
can create intelligent machines which can behave like a human, think like humans, and
able to make decisions.
What is artificial intelligence?
 Artificial Intelligence is the branch of computer science concerned with making

computers behave like humans.
 Major AI textbooks define artificial intelligence as "the study and design of intelligent
agents," where an intelligent agent is a system that perceives its environment and
takes actions which maximize its chances of success.
 John McCarthy, who coined the term in 1956, defines it as "the science and engineering
of making intelligent machines, especially intelligent computer programs."
 The definitions of AI according to some text books are categorized into four approaches
and are summarized in the table below :
 Some have defined intelligence in terms of fidelity to human performance, while others
prefer an abstract, formal definition of intelligence called rationality—loosely
speaking, doing the “right thing.”
The definitions of AI according to some text books are categorized into four approaches and
are summarized below:
1. 1
 "The exciting new effort to make computers think … machines with minds, in the full
and literal sense."(Haugeland,1985)
 "The study of mental faculties through the use of computer models." (Charniak and
McDermont,1985)
 The art of creating machines that performs functions that require intelligence when
performed by people."(Kurzweil,1990)
 "Computational intelligence is the study of the design of intelligent agents."(Poole et
al.,1998)
Why Artificial Intelligence?
Before Learning about Artificial Intelligence, we should know that what is the importance of
AI and why should we learn it. Following are some main reasons to learn about AI:
o With the help of AI, you can create such software or devices which can solve real-world
problems very easily and with accuracy such as health issues, marketing, traffic issues,
etc.
o With the help of AI, you can create your personal virtual Assistant, such as Cortana,
Google Assistant, Siri, etc.
o With the help of AI, you can build such Robots which can work in an environment
where survival of humans can be at risk.
o AI opens a path for other new technologies, new devices, and new Opportunities.
There are four possible goals to pursue in Artificial Intelligence:
 Acting humanly
 Thinking humanly
 Thinking rationally
 Acting rationally
1.1.1 AI Approaches
i) Acting humanly: The Turing test approach
The Turing test, proposed by Alan Turing (1950), was designed as a thought
experiment that would sidestep the philosophical vagueness of the question “Can a machine
think?” A computer passes the test if a human interrogator, after posing some written questions,
cannot tell whether the written responses come from a person or from a computer.
1. 2
The art of creating machines that perform functions requiring intelligence when
performed by people; that it is the study of, how to make computers do things which, at the
moment, people do better. Focus is on action, and not intelligent behaviour centered around
the representation of the world.
Example: Turing Test
 3 rooms contains: a person, a computer and an interrogator.

 The interrogator can communicate with the other 2 by teletype (to avoid the machine
imitate the appearance of voice of the person)
 The interrogator tries to determine which the person is and which the machine is.
 The machine tries to fool the interrogator to believe that it is the human, and the person
also tries to convince the interrogator that it is the human.
 If the machine succeeds in fooling the interrogator, then conclude that the machine is
intelligent.
The computer would need the following capabilities:
Natural language processing to communicate successfully in a human language;
Knowledge representation to store what it knows or hears;
Automated reasoning to answer questions and to draw new conclusions;
Machine learning to adapt to new circumstances and to detect and extrapolate patterns.
Total Turing Test includes a video signal so that the interrogator can test the subject’s
perceptual abilities, as well as the opportunity for the interrogator to pass physical objects
―”through the hatch”.
To pass the total Turing test, a robot will need
 Computer vision and speech recognition to perceive the world;

 Robotics to manipulate objects and move about.
1. 3
Fig: Illustrating Turing Test
ii) Thinking humanly: The cognitive modelling approach
To say that a program thinks like a human, we must know how humans think. We can
learn about human thought in three ways:
Introspection—trying to catch our own thoughts as they go by;
Psychological experiments—observing a person in action;
Brain imaging—observing the brain in action.
Once we have a sufficiently precise theory of the mind, it becomes possible to express
the theory as a computer program. If the program’s input–output behaviour matches
corresponding human behaviour, that is evidence that some of the program’s mechanisms could
also be operating in humans.
iii) Thinking rationally: The “laws of thought” approach
Aristotle was one of the first to attempt to codify ―”right thinking”, that is, irrefutable
reasoning processes. His syllogisms provided patterns for argument structures that always
yielded correct conclusions when given correct premises.
For Example,
Socrates a man;
All men are mortal;
therefore, “Socrates is mortal”.-- logic
1. 4
 There are two main obstacles to this approach.

i. It is not easy to take informal knowledge and state it in the formal terms
required by logical notation, particularly when the knowledge is less than 100%
certain.
ii. Second, there is a big difference between solving a problem ―”in principle”
and solving it in practice.
iv) Acting rationally: The rational agent approach
An agent is just something that acts. All computer programs do something, but
computer agents are expected to do more: operate autonomously, perceive their environment,
persist over a prolonged time period, adapt to change, and create and pursue goals. A rational
agent is one that acts so as to achieve the best outcome or, when there is uncertainty, the best
expected outcome.
In the “laws of thought” approach to AI, the emphasis was on correct inferences.
Making correct inferences is sometimes part of being a rational agent, because one way to act
rationally is to deduce that a given action is best and then to act on that conclusion. For
example, recoiling from a hot stove is a reflex action that is usually more successful than a
slower action taken after careful deliberation. All the skills needed for the Turing test also allow
an agent to act rationally. Knowledge representation and reasoning enable agents to reach good
decisions. We need to be able to generate comprehensible sentences in natural language to get
by in a complex society. We need learning not only for erudition, but also because it improves
our ability to generate effective behaviour, especially in circumstances that are new.
The rational-agent approach to AI has two advantages over the other approaches. First,
it is more general than the “laws of thought” approach because correct inference is just one of
several possible mechanisms for achieving rationality. Second, it is more amenable to scientific
development. The standard of rationality is mathematically well defined and completely
general.
1.1.2 History of AI
Maturation of Artificial Intelligence (1943-1952)
o Year 1943: The first work which is now recognized as AI was done by Warren
McCulloch and Walter pits in 1943. They proposed a model of artificial neurons.
1. 5
o Year 1949: Donald Hebb demonstrated an updating rule for modifying the connection
strength between neurons. His rule is now called Hebbian learning.
o Year 1950: The Alan Turing who was an English mathematician and pioneered
Machine learning in 1950. Alan Turing publishes "Computing Machinery and
Intelligence" in which he proposed a test. The test can check the machine's ability to
exhibit intelligent behavior equivalent to human intelligence, called a Turing test.
The birth of Artificial Intelligence (1952-1956)
o Year 1955: An Allen Newell and Herbert A. Simon created the "first artificial
intelligence program"Which was named as "Logic Theorist". This program had
proved 38 of 52 Mathematics theorems, and find new and more elegant proofs for some
theorems.
o Year 1956: The word "Artificial Intelligence" first adopted by American Computer
scientist John McCarthy at the Dartmouth Conference. For the first time, AI coined as
an academic field.
At that time high-level computer languages such as FORTRAN, LISP, or COBOL were
invented. And the enthusiasm for AI was very high at that time.
The golden years-Early enthusiasm (1956-1974)
o Year 1966: The researchers emphasized developing algorithms which can solve
mathematical problems. Joseph Weizenbaum created the first chatbot in 1966, which
was named as ELIZA.
o Year 1972: The first intelligent humanoid robot was built in Japan which was named
as WABOT-1.
The first AI winter (1974-1980)
o The duration between years 1974 to 1980 was the first AI winter duration. AI winter
refers to the time period where computer scientist dealt with a severe shortage of
funding from government for AI researches.
o During AI winters, an interest of publicity on artificial intelligence was decreased.
A boom of AI (1980-1987)
o Year 1980: After AI winter duration, AI came back with "Expert System". Expert
systems were programmed that emulate the decision-making ability of a human expert.
1. 6
o In the Year 1980, the first national conference of the American Association of Artificial
Intelligence was held at Stanford University.
The second AI winter (1987-1993)
o The duration between the years 1987 to 1993 was the second AI Winter duration.
o Again Investors and government stopped in funding for AI research as due to high cost
but not efficient result. The expert system such as XCON was very cost effective.
The emergence of intelligent agents (1993-2011)
o Year 1997: In the year 1997, IBM Deep Blue beats world chess champion, Gary
Kasparov, and became the first computer to beat a world chess champion.
o Year 2002: for the first time, AI entered the home in the form of Roomba, a vacuum
cleaner.
o Year 2006: AI came in the Business world till the year 2006. Companies like Facebook,
Twitter, and Netflix also started using AI.
Deep learning, big data and artificial general intelligence (2011-present)
o Year 2011: In the year 2011, IBM's Watson won jeopardy, a quiz show, where it had
to solve the complex questions as well as riddles. Watson had proved that it could
understand natural language and can solve tricky questions quickly.
o Year 2012: Google has launched an Android app feature "Google now", which was
able to provide information to the user as a prediction.
o Year 2014: In the year 2014, Chatbot "Eugene Goostman" won a competition in the
infamous "Turing test."
o Year 2018: The "Project Debater" from IBM debated on complex topics with two
master debaters and also performed extremely well.
o Google has demonstrated an AI program "Duplex" which was a virtual assistant and
which had taken hairdresser appointment on call, and lady on other side didn't notice
that she was talking with the machine.
1.2 Applications of AI
 Artificial Intelligence has various applications in today's society. It is becoming

essential for today's time because it can solve complex problems with an efficient way
1. 7
in multiple industries, such as Healthcare, entertainment, finance, education, etc. AI is

making our daily life more comfortable and fast.
 Following are some sectors which have the application of Artificial Intelligence:
1. AI in Astronomy
o Artificial Intelligence can be very useful to solve complex universe problems. AI
technology can be helpful for understanding the universe such as how it works, origin,
etc.
2. AI in Healthcare
o In the last, five to ten years, AI becoming more advantageous for the healthcare industry
and going to have a significant impact on this industry.
o Healthcare Industries are applying AI to make a better and faster diagnosis than
humans. AI can help doctors with diagnoses and can inform when patients are
worsening so that medical help can reach to the patient before hospitalization.
3. AI in Gaming
o AI can be used for gaming purpose. The AI machines can play strategic games like
chess, where the machine needs to think of a large number of possible places.
4. AI in Finance
o AI and finance industries are the best matches for each other. The finance industry is
implementing automation, chatbot, adaptive intelligence, algorithm trading, and
machine learning into financial processes.
5. AI in Data Security
o The security of data is crucial for every company and cyber-attacks are growing very
rapidly in the digital world. AI can be used to make your data more safe and secure.
Some examples such as AEG bot, AI2 Platform,are used to determine software bug and
cyber-attacks in a better way.
6. AI in Social Media
o Social Media sites such as Facebook, Twitter, and Snapchat contain billions of user
profiles, which need to be stored and managed in a very efficient way. AI can organize
and manage massive amounts of data. AI can analyze lots of data to identify the latest
trends, hashtag, and requirement of different users.
1. 8
7. AI in Travel & Transport

o AI is becoming highly demanding for travel industries. AI is capable of doing various
travel related works such as from making travel arrangement to suggesting the hotels,
flights, and best routes to the customers. Travel industries are using AI-powered
chatbots which can make human-like interaction with customers for better and fast
response.
8. AI in Automotive Industry
o Some Automotive industries are using AI to provide virtual assistant to their user for
better performance. Such as Tesla has introduced TeslaBot, an intelligent virtual
assistant.
o Various Industries are currently working for developing self-driven cars which can
make your journey more safe and secure.
9. AI in Robotics:
o Artificial Intelligence has a remarkable role in Robotics. Usually, general robots are
programmed such that they can perform some repetitive task, but with the help of AI,
we can create intelligent robots which can perform tasks with their own experiences
without pre-programmed.
o Humanoid Robots are best examples for AI in robotics, recently the intelligent
Humanoid robot named as Erica and Sophia has been developed which can talk and
behave like humans.
10. AI in Entertainment
o We are currently using some AI based applications in our daily life with some
entertainment services such as Netflix or Amazon. With the help of ML/AI algorithms,
these services show the recommendations for programs or shows.
11. AI in Agriculture
o Agriculture is an area which requires various resources, labor, money, and time for best
result. Now a day's agriculture is becoming digital, and AI is emerging in this field.
Agriculture is applying AI as agriculture robotics, solid and crop monitoring, predictive
analysis. AI in agriculture can be very helpful for farmers.
1. 9
12. AI in E-commerce
o AI is providing a competitive edge to the e-commerce industry, and it is becoming more
demanding in the e-commerce business. AI is helping shoppers to discover associated
products with recommended size, color, or even brand.
13. AI in education:
o AI can automate grading so that the tutor can have more time to teach. AI chatbot can
communicate with students as a teaching assistant.
o AI in the future can be work as a personal virtual tutor for students, which will be
accessible easily at any time and any place.
Advantages of Artificial Intelligence
Following are some main advantages of Artificial Intelligence:
o High Accuracy with less errors: AI machines or systems are prone to less errors and
high accuracy as it takes decisions as per pre-experience or information.
o High-Speed: AI systems can be of very high-speed and fast-decision making, because
of that AI systems can beat a chess champion in the Chess game.
o High reliability: AI machines are highly reliable and can perform the same action
multiple times with high accuracy.
o Useful for risky areas: AI machines can be helpful in situations such as defusing a
bomb, exploring the ocean floor, where to employ a human can be risky.
o Digital Assistant: AI can be very useful to provide digital assistant to the users such as
AI technology is currently used by various E-commerce websites to show the products
as per customer requirement.
o Useful as a public utility: AI can be very useful for public utilities such as a self-
driving car which can make our journey safer and hassle-free, facial recognition for
security purpose, Natural language processing to communicate with the human in
human-language, etc.
Disadvantages of Artificial Intelligence
Every technology has some disadvantages, and the same goes for Artificial intelligence. Being
so advantageous technology still, it has some disadvantages which we need to keep in our mind
while creating an AI system. Following are the disadvantages of AI:
o High Cost: The hardware and software requirement of AI is very costly as it requires
lots of maintenance to meet current world requirements.
1. 10
o Can't think out of the box: Even we are making smarter machines with AI, but still
they cannot work out of the box, as the robot will only do that work for which they are
trained, or programmed.
o No feelings and emotions: AI machines can be an outstanding performer, but still it
does not have the feeling so it cannot make any kind of emotional attachment with
human, and may sometime be harmful for users if the proper care is not taken.
o Increase dependency on machines: With the increment of technology, people are
getting more dependent on devices and hence they are losing their mental capabilities.
o No Original Creativity: As humans are so creative and can imagine some new ideas
but still AI machines cannot beat this power of human intelligence and cannot be
creative and imaginative.
INTELLIGENT AGENTS
 An intelligent agent is an autonomous entity which act upon an environment using

sensors and actuators for achieving goals.
 An intelligent agent may learn from the environment to achieve their goals. A
thermostat is an example of an intelligent agent.
Following are the main four rules for an AI agent:
o Rule 1: An AI agent must have the ability to perceive the environment.

o Rule 2: The observation must be used to make decisions.
o Rule 3: Decision should result in an action.
o Rule 4: The action taken by an AI agent must be a rational action.
Rationality:
The rationality of an agent is measured by its performance measure. Rationality can be judged
on the basis of following points:
o Performance measure which defines the success criterion.

o Agent prior knowledge of its environment.
o Best possible actions that an agent can perform.
o The sequence of percepts.
1. 11
Agents and Environments
An agent is anything that can be viewed as perceiving its environment through sensors
and SENSOR acting upon that environment through actuators. This simple idea is illustrated
in Figure 1.2.
 A human agent has eyes, ears, and other organs for sensors and hands, legs, mouth,
and other body parts for actuators.
 A robotic agent might have cameras and infrared range finders for sensors and various
motors for actuators.
 A software agent receives keystrokes, file contents, and network packets as sensory
inputs and acts on the environment by displaying on the screen, writing files, and
sending network packets.
 Sensor: Sensor is a device which detects the change in the environment and sends the
information to other electronic devices. An agent observes its environment through
sensors.
 Actuators: Actuators are the component of machines that converts energy into motion.
The actuators are only responsible for moving and controlling a system. An actuator
can be an electric motor, gears, rails, etc.
 Effectors: Effectors are the devices which affect the environment. Effectors can be
legs, wheels, arms, fingers, wings, fins, and display screen.
Fig 1.2: Agents interact with environments through sensors and actuators
Agent Terminology
Percept
We use the term percept to refer to the agent's perceptual inputs at any given instant.
1. 12
Percept Sequence
An agent's percept sequence is the complete history of everything the agent has ever
perceived.
Agent function
Mathematically speaking, we say that an agent's behaviour is described by the agent
function that maps any given percept sequence to an action.
𝒇: 𝑷∗ → 𝑨
AGENT PROGRAM
 The agent function for an artificial agent will be implemented by an agent program.
 It is important to keep these two ideas distinct.
 The agent function is an abstract mathematical description
 The agent program is a concrete implementation, running on the agent architecture
 To illustrate these ideas, we will use a very simple example-the vacuum-cleaner world
shown in Figure 1.3.
 This particular world has just two locations: squares A and B.
 The vacuum agent perceives which square it is in and whether there is dirt in the square.
 It can choose to move left, move right, suck up the dirt, or do nothing.
 One very simple agent function is the following:
if the current square is dirty, then suck, otherwise, it move to the other square.
Fig: 1.3 A vacuum-cleaner world with just two locations

Agent Function
A partial tabulation of this agent function is shown in Figure .
1. 13
Agent Program
function Reflex-VACUUM-AGENT ([locations, status]) returns an action
if status = Dirty then return Suck
else if location = A then return Right
elseif location = B then return Left
1.2.1 Good Behaviour: The Concept of Rationality
A rational agent is one that does the right thing. Obviously, doing the right thing is
better than doing the wrong thing, but what does it mean to do the right thing?
Performance measures
 Moral philosophy has developed several different notions of the “right thing,” but AI
has generally stuck to one notion called consequentialism: we evaluate an agent’s
behaviour by its consequences.
 When an agent is plunked down in an environment, it generates a sequence of actions
according to the percept it receives. This sequence of actions causes the environment
to go through a sequence of states.
 If the sequence is desirable, then the agent has performed well. This notion of
desirability is captured by a performance measure that evaluates any given sequence of
environment states.
 Humans have desires and preferences of their own, Machines, on the other hand, do not
have desires and preferences of their own; the performance measure is, initially at least,
1. 14
in the mind of the designer of the machine, or in the mind of the users the machine is
designed for.
 Consider, for example, the vacuum-cleaner agent. We might propose to measure
performance by the amount of dirt cleaned up in a single eight-hour shift.
 A rational agent can maximize this performance measure by cleaning up the dirt, then
dumping it all on the floor, then cleaning it up again, and so on.
 A more suitable performance measure would reward the agent for having a clean floor.
For example, one point could be awarded for each clean square at each time step.
 As a general rule, it is better to design performance measures according to what one
actually wants to be achieved in the environment, rather than according to how one
thinks the agent should behave.
Rationality
What is rational at any given time depends on four things:
 The performance measure that defines the criterion of success.

 The agent’s prior knowledge of the environment.
 The actions that the agent can perform.
 The agent’s percept sequence to date.
This leads to a definition of a rational agent:
For each possible percept sequence, a rational agent should select an action that is
expected to maximize its performance measure, given the evidence provided by the
percept sequence and whatever built-in knowledge the agent has.
Consider the simple vacuum-cleaner agent, that cleans a square if it is dirty and moves to the
other square if not (That is shown in the agent function table). Is this a rational agent? what the
performance measure is, what is known about the environment, and what sensors and actuators
the agent has. Let us assume the following:
 The performance measure awards one point for each clean square at each time step,
over a “lifetime” of 1000 time steps.
 Clean squares stay clean and sucking cleans the current square.
 The Left and Right actions move the agent one square except when this would take the
agent outside the environment, in which case the agent remains where it is.
1. 15
 The only available actions are Left, Right and Suck.

 The agent correctly perceives its location and whether that location contains dirt.
 A better agent for this case would do nothing once it is sure that all the squares are
clean.
 If clean squares can become dirty again, the agent should occasionally check and re-
clean them if needed.
Omniscience, learning, and autonomy
 An omniscient agent knows the actual outcome of its actions and can act
accordingly; but omniscience is impossible in reality.
 Doing actions in order to modify future percepts— sometimes called information
gathering
 A rational agent not only to gather information but also to learn as much as possible
from what it perceives. The agent’s initial configuration could reflect some prior
knowledge of the environment, but as the agent gains experience this may be
modified and augmented.
 An agent relies on the prior knowledge of its designer rather than on its own
percepts and learning processes, we say that the agent lacks autonomy. A rational
agent should be autonomous—it should learn what it can to compensate for partial
or incorrect prior knowledge.
1.2.2 The Nature of Environments
An Environment is simply the surrounding of an agent and is where the agent

operates. We must think about task environments, which are essentially the “problems” to
which rational agents are the “solutions.” We first begin by showing how to specify a task
environment, illustrating the process with a number of examples.
Specifying the task environment
The rationality of the simple vacuum-cleaner agent, needs specification of

 the performance measure
 the environment
 the agent's actuators
 Sensors.
PEAS
1. 16
 All these are grouped together under the heading of the task environment.
 We call this the PEAS (Performance, Environment, Actuators, Sensors) description.
 In designing an agent, the first step must always be to specify the task environment as
fully as possible.
 The following table shows PEAS description of the task environment for an automated
taxi.
Fig 1.4 PEAS description for an automated taxi
The following table shows PEAS description of the task environment for some other agent
type.
Fig: 1.5 Examples of agent types and their PEAS description
1. 17
Properties of task environments

 Fully observable vs. partially observable
 Deterministic vs. stochastic
 Episodic vs. sequential
 Static vs. dynamic
 Discrete vs. continuous
 Single agent vs. multiagent
 Known vs Unknown
1. Fully Observable
 An agent can always see the entire state of an environment.
 A task environment is effectively fully observable if the sensors detect all aspects that
are relevant to the choice of action;
 Example Chess. The board is fully observable, and so are the opponent’s moves
because each player of the chess game gets to see the whole board.
Partially observable
 An agent can never see the entire state of an environment.

 An environment might be partially observable because of noisy and inaccurate sensors
or because parts of the state are simply missing from the sensor data.
 Example: Card game, the user is only able to see his cards. The used cards, cards
reserved for the future, are not visible to the user.
2. Deterministic
 An agent’s current state and selected action can completely determine the next state of
an environment.
 Example: Chess, there would be only a few possible moves for a coin at the current
state and these moves can be determined.
Stochastic
 A stochastic environment is random in nature and cannot be determined completely by
an agent.
 Example: Ludo
3. Episodic
 Only the current percept is required for the action.
 Every episode is independent of each other.
1. 18
 Example: Consider Pick and Place robot, which is used to detect defective parts from
the conveyor belts. Here, every time robot (agent) will make the decision on the
current part i.e. there is no dependency between current and previous decisions.
Sequential
 An agent requires memory of past actions to determine the next best actions.
 The current decision could affect all future decisions.
 Example: Chess Checkers- Where the previous move can affect all the following
moves.
4. Single agent
 If only one agent is involved in an environment, and operating by itself then such an
environment is called single agent environment.
 Example: maze, A person left alone in a maze is an example of the single-agent
system.
Multiple agents
 If multiple agents are involved in an environment, then it is called a multiagent
environment.
 Example: Football, The game of football is multi-agent as it involves 11 players in
each team.
5. Static
 The environment does not change while an agent is acting.
 Example: Crossword puzzles
 Puzzle environment is static as there’s no change in the surroundings when an agent

enters.
Dynamic
 The environment may change over time.
 Example: A roller coaster ride is dynamic as it is set in motion and the environment
keeps changing every instant.
6. Discrete
 The environment consists of a finite number of actions that can be deliberated in the
environment to obtain the output.
 Example Chess. The game of chess is discrete as it has only a finite number of moves.
The number of moves might vary with every game, but still, it’s finite.
Continuous
1. 19
 The environment in which actions performed cannot be numbered is said to be

continuous.
 Example: Self-driving cars are an example of continuous environments as their
actions are driving, parking, etc. which cannot be numbered.
7. Known
 On a known environment, the results for all actions are known to the agent.
 Example: Card game. In solitaire card games, I know the rules but am still unable to
see the cards that have not yet been turned over.
Unknown
 In unknown environment, agent need to learn how it works in order to perform an
action.
 Example: A new video game. Here the screen may show the entire game state but I still
don’t know what the buttons do until I try them.
Fig: 1.6 Examples of task environments and their characteristics
1.2.3 Structure of Agents

 The job of Artificial Intelligence is to design the agent program that implements the
agent function mapping percepts to actions.
 The agent program will run in an architecture An architecture is a computing device
with physical sensors and actuators
 Where Agent is combination of Program and Architecture
Agent = Program + Architecture
1. 20
 An agent program takes the current percept as input while the agent function takes the
entire percept history
 Current percept is taken as input to the agent program because nothing more is
available from the environment
 The following TABLE-DRIVEN_AGENT program is invoked for each new percept
and returns an action each time
Drawbacks:
 Table lookup of percept-action pairs defining all possible condition-action rules
necessary to interact in an environment
 Problems
 Too big to generate and to store (Chess has about 10^120 states, for
example)
 No knowledge of non-perceptual parts of the current state
 Not adaptive to changes in the environment; requires entire table to be
updated if changes occur
 Looping: Can't make actions conditional
 Take a long time to build the table
 No autonomy
 Even with learning, need a long time to learn the table entries
Some Agent Types
 Table-driven agents
use a percept sequence/action table in memory to find the next action. They
are implemented by a (large) lookup table.
 Simple reflex agents
are based on condition-action rules, implemented with an appropriate
production system. They are stateless devices which do not have memory of past
world states.
 Agents with memory
have internal state, which is used to keep track of past states of the
world.
1. 21
 Agents with goals

are agents that, in addition to state information, have goal information that
describes desirable situations. Agents of this kind take future events into
consideration.
 Utility-based agents
base their decisions on classic axiomatic utility theory in order to act
rationally.
Kinds of Agent Programs
The following are the agent programs,
 Simple reflex agents

 Model-based reflex agents
 Goal-based reflex agents
 Utility-based agents
1.2.3.1 Simple Reflex agent:

 They choose actions only based on the current situation ignoring the history of
perceptions.
 Perform actions only on simple situation.
 They will work only if the environment is fully observable.
 The agent function is based on the condition-action rule: “if condition, then action”.
 This can only be done based on the pre-determined rules that are present in the
knowledge base.
 Example: if car-in-front-brakes
 For example, the vacuum agent whose agent function is tabulated is given in Figure
1.7,
 It is a simple reflex agent, because its decision is based only on the current location and
on whether that contains dirt.
 Select action on the basis of only the current percept. E.g. the vacuum-agent
1. 22
Fig 1.7 The agent program for a simple reflex agent in the two-state vacuum
environment.
A Simple Reflex Agent: Schema

The following simple reflex agents, acts according to a rule whose condition matches the
current state, as defined by the percept
Fig: 1.8 Schematic diagram of a simple reflex agent
1. 23
Characteristics
 Only works if the environment is fully observable.
 Lacking history, easily get stuck in infinite loops
 One solution is to randomize actions
1.2.3.2 Model-based Reflex agent
 A model-based reflex agent is an intelligent agent that uses percept history and internal
memory to make decisions about the “model” of the world around it.
 The Model-based agent can work in a partially observable environment, and track the
situation.
 Model- Knowledge about “how the things happen in the world”.
 Internal state- It is a representation of unobserved aspects of current state depending
on percept history.
Updating the state requires the information about
 How the world evolves.
 How the agent’s actions affect the world.
Fig: 1.9 A model-based reflex agent
1. 24
1.2.3.3 Goal-Based agent

 They choose their actions in order to achieve goals.
 This allows the agent a way to choose among multiple possibilities, selecting the one
which reaches a goal state.
 The agent needs some sort of goal information that describes situations that are
desirable.
 The agent program can combine this with information about the results of possible
actions in order to choose actions that achieve the goal.
 Example: A GPS system finding a path to certain destination.
 Sometimes goal-based action selection is straightforward—for example, when goal
satisfaction results immediately from a single action.
 Sometimes it will be trickier – for example, when the agent has to consider long
sequences of twists and turns in order to find a way to achieve the goal. Search and
planning are the subfields of AI devoted to finding action sequences that achieve the
agent’s goals.
1. 25
Schematic diagram of the goal-based agent's structure
Fig 1.10 A model-based, goal based agent

1.2.3.4 Utility –Based agent
 A utility based agent is an agent that acts based not only on what the goal is, but the
best way to reach the goal.
 They choose actions based on a preference (utility) for each state.
 Example: A GPS system finding a shortest/fastest/safer to certain destination.
1. 26
Fig: 1.11 A model-based, utility-based agent

1.2.3.5 Learning Agent
 All agents can improve their performance through learning.
 A learning agent can be divided into four conceptual components, as,
• Learning element
• Performance element
• Critic
• Problem generator
 The learning element, which is responsible for making improvements,
 The performance element, which is responsible for selecting external actions. The
performance element is what we have previously considered to be the entire agent: it
takes in percepts and decides on actions.
 The learning element uses feedback from the critic on how the agent is doing and
determines how the performance element should be modified to do better in the future.
1. 27
Fig: 1.12 A general model of learning agent
 The last component of the learning agent is the problem generator. It is responsible
for suggesting actions that will lead to new and informative experiences. But if the
agent is willing to explore a little, it might discover much better actions for the long
run. The problem generator's job is to suggest these exploratory actions. This is what
scientists do when they carry out experiments.
Problem Formulation
 An important aspect of intelligence is goal-based problem solving.
 The solution of many problems can be described by finding a sequence of actions that
lead to a desirable goal.
 Each action changes the state and the aim is to find the sequence of actions and states
that lead from the initial (start) state to a final (goal) state.
1. 28
 A well-defined problem can be described by:

a) Initial state
b) Operator or successor function - for any state x returns s(x), the set of
states reachable from x with one action
c) State space - all states reachable from initial by any sequence of actions
d) Path - sequence through state space
e) Path cost - function that assigns a cost to a path. Cost of a path is the
sum of costs of individual actions along the path
f) Goal test - test to determine if at goal state
 What is Search?
a) Search is the systematic examination of states to find path from the start/root
state to the goal state.
b) The set of possible states, together with operators defining their connectivity
constitute the search space.
c) The output of a search algorithm is a solution, that is, a path from the initial
state to a state that satisfies the goal test.
1.3 Problem-solving agents

 A Problem solving agent is a goal-based agent.
 It decides what to do by finding sequence of actions that lead to desirable states.
 The agent can adopt a goal and aim at satisfying it.
 To illustrate the agent's behaviour
 For example where our agent is in the city of Arad, which is in Romania. The agent has
to adopt a goal of getting to Bucharest.
 Goal formulation
 based on the current situation and the agent's performance measure, is the first
step in problem solving.
 The agent's task is to find out which sequence of actions will get to a goal state.
 Problem formulation
 is the process of deciding what actions and states to consider given a goal.
 Search
 Before taking any action in the real world, the agent simulates sequences of
actions in its model, searching until it finds a sequence of actions that reaches
the goal. Such a sequence is called a solution.
 The agent might have to simulate multiple sequences that do not reach the goal,
but eventually it will find a solution (such as going from Arad to Sibiu to
Fagaras to Bucharest), or it will find that no solution is possible.
 Execution
 The agent can now execute the actions in the solution, one at a time.
1. 29
Fig: 1.13
1.3.1 Search problems and solutions

A search problem can be defined formally as follows
 A set of possible states that the environment can be in. We call this the state
space.
 The initial state that the agent starts in. For example: Arad
 A set of one or more goal states. Sometimes there is one goal state (e.g.,
Bucharest), sometimes there is a small set of alternative goal states, and
sometimes the goal is defined by a property that applies to many states
(potentially an infinite number). For example, in a vacuum-cleaner world, the
goal might be to have no dirt in any location, regardless of any other facts about
the state.
 The actions available to the agent. Given a state s, ACTIONS (s) returns a finite
set of actions that can be executed in s. We say that each of these actions is
applicable in s. An example:
 A transition model, which describes what each action does. RESULT (s,a)
returns the state that results from doing action a in state s. For example,
 An action cost function, denoted by ACTION-COST (s,a s’) when we are

programming or (s,a.s’) when we are doing math, that gives the numeric cost of
applying action in state s to reach state s’. A problem-solving agent should use
a cost function that reflects its own performance measure; for example, for
route-finding agents, the cost of an action might be the length in miles or it
might be the time it takes to complete the action.
 A sequence of actions forms a path, and a solution is a path from the initial
state to a goal state. We assume that action costs are additive; that is, the total
cost of a path is the sum of the individual action costs.
 An optimal solution has the lowest path cost among all solutions.
1. 30
 The state space can be represented as a graph in which the vertices are states
and the directed edges between them are actions. The map of Romania shown
in figure is such a graph, where each road indicates two actions, one in each
direction
1.3.2 Formulating problems
 We derive a formulation of the problem in terms of the initial state, successor
function , goal test, and path cost
 Our formulation of the problem of getting to Bucharest is a model—an abstract
mathematical description—and not the real thing.
 Compare the simple atomic state description Arad to an actual cross-country
trip, where the state of the world includes so many things: the traveling
companions, the current radio program, the scenery out of the window, the
proximity of law enforcement officers, the distance to the next rest stop, the
condition of the road, the weather, the traffic, and so on.
 All these considerations are left out of our model because they are irrelevant to
the problem of finding a route to Bucharest.
 The process of removing detail from a representation is called abstraction.
1.3.3 EXAMPLE PROBLEMS

 The problem solving approach has been applied to a vast array of task environments.
Some best known problems are summarized below.
 They are distinguished as toy or real-world problems.
i. A Toy problem (standardized) is intended to illustrate various problem
solving methods. It can be easily used by different researchers to
compare the performance of algorithms.
ii. A Real-world problem is one whose solutions people actually care
about.
1.3.3.1 TOY PROBLEMS
a. Vacuum World Example
 States: The agent is in one of two locations., each of which might or might not contain
dirt. Thus there are 2 × 22 = 8 possible world states.
 Initial state: Any state can be designated as initial state.
 Successor function : This generates the legal states that results from trying the three
actions (left, right, suck). The complete state space is shown in below figure 1.14
 Goal Test : This tests whether all the squares are clean.
 Path Test : Each step costs one ,so that the path cost is the number of steps in the path.
1. 31
Vacuum World State Space
Fig: 1.14
b. 8-puzzle:
 An 8-puzzle consists of a 3x3 board with eight numbered tiles and a blank space.
 A tile adjacent to the blank space can slide into the space. The object is to reach the
specific goal state ,as shown in figure 1.15
Fig: 1.15
The problem formulation is as follows:

 States : A state description specifies the location of each of the eight tiles and the blank
in one of the nine squares.
 Initial state : Any state can be designated as the initial state. It can be noted that any
given goal can be reached from exactly half of the possible initial states.
 Successor function : This generates the legal states that result from trying the four
actions (blank moves Left,Right,Vp or down).
1. 32
 Goal Test : This checks whether the state matches the goal configuration shown in
figure 2.4.(Other goal configurations are possible)
 Path cost: Each step costs 1, so the path cost is the number of steps in the path.
 The 8-puzzle belongs to the family of sliding-block puzzles, which are often used as
test problems for new search algorithms in AI.
 This general class is known as NP-complete.
 The 8-puzzle has 9!/2 = 181,440 reachable states and is easily solved.
 The 15 puzzle ( 4 x 4 board ) has around 1.3 trillion states, an the random instances can
be solved optimally in few milli seconds by the best search algorithms.
 The 24-puzzle (on a 5 x 5 board) has around 1025 states ,and random instances are still
quite difficult to solve optimally with current machines and algorithms.
c. 8-queens problem
 The goal of 8-queens problem is to place 8 queens on the chessboard such that no queen
attacks any other. (A queen attacks any piece in the same row, column or diagonal).
 The following figure shows an attempted solution that fails: the queen in the right most
column is attacked by the queen at the top left.
 An Incremental formulation involves operators that augments the state description,
starting with an empty state for 8-queens problem, this means each action adds a queen
to the state.
 A complete-state formulation starts with all 8 queens on the board and move them
around.
In either case the path cost is of no interest because only the final state counts.
1. 33
 The first incremental formulation one might try is the following :

 States : Any arrangement of 0 to 8 queens on board is a state.
 Initial state : No queen on the board.
 Successor function : Add a queen to any empty square.
 Goal Test : 8 queens are on the board, none attacked.
 In this formulation, we have 64.63…57 = 3 x 1014 possible sequences to investigate.
 A better formulation would prohibit placing a queen in any square that is already
attacked.
 States : Arrangements of n queens ( 0 <= n < = 8 ) ,one per column in the left
most columns ,with no queen attacking another are states.
 Successor function : Add a queen to any square in the left most empty column
such that it is not attacked by any other queen.
 This formulation reduces the 8-queen state space from 3 x 1014 to just 2057,and
solutions are easy to find.
 For the 100 queens the initial formulation has roughly 10400 states whereas the
improved formulation has about 1052 states.
 This is a huge reduction, but the improved state space is still too big for the algorithms
to handle.
1.3.3.2 REAL WORLD PROBLEMS

 A real world problem is one whose solutions people actually care about.
 They tend not to have a single agreed upon description, but attempt is made to give
general favour of their formulation,
 The following are the some real world problems,
 Route Finding Problem
 Touring Problems
 Travelling Salesman Problem
 Robot Navigation
ROUTE-FINDING PROBLEM
 Route-finding problem is defined in terms of specified locations and transitions along

links between them.
 Route-finding algorithms are used in a variety of applications, such as routing in
computer networks, military operations planning, and air line travel planning systems.
1. 34
a. AIRLINE TRAVEL PROBLEM
The airline travel problem is specifies as follows :

 States : Each is represented by a location(e.g.,an airport) and the current
time.
 Initial state : This is specified by the problem.
 Successor function : This returns the states resulting from taking any
scheduled flight(further specified by seat class and location),leaving
later than the current time plus the within-airport transit time,from the
current airport to another.
 Goal Test : Are we at the destination by some prespecified time?
 Path cost : This depends upon the monetary cost,waiting time,flight
time,customs and immigration procedures,seat quality,time of dat,type
of air plane,frequent-flyer mileage awards, and so on.
TOURING PROBLEMS
 Touring problems are closely related to route-finding problems, but with an important
difference.
 Consider for example, the problem, "Visit every city at least once" as shown in Romania
map.
 As with route-finding the actions correspond to trips between adjacent cities.
 Initial state would be "In Bucharest; visited{Bucharest}".  Intermediate state would be
"In Vaslui; visited {Bucharest,Vrziceni,Vaslui}".
 Goal test would check whether the agent is in Bucharest and all 20 cities have been
visited.
THE TRAVELLING SALESPERSON PROBLEM (TSP)

 TSP is a touring problem in which each city must be visited exactly once.
 The aim is to find the shortest tour. The problem is known to be NP-hard.
 Enormous efforts have been expended to improve the capabilities of TSP algorithms.
 These algorithms are also used in tasks such as planning movements of automatic
circuit-board drills and of stocking machines on shop floors
1. 35
VLSI layout
 A VLSI layout problem requires positioning millions of components and connections
on a chip to minimize area, minimize circuit delays, minimize stray capacitances, and
maximize manufacturing yield.
 The layout problem is split into two parts: cell layout and channel routing.
ROBOT navigation
 ROBOT navigation is a generalization of the route-finding problem.

 Rather than a discrete set of routes, a robot can move in a continuous space with an
infinite set of possible actions and states.
 For a circular Robot moving on a flat surface, the space is essentially two-dimensional.
 When the robot has arms and legs or wheels that also must be controlled, the search
space becomes multi-dimensional.
 Advanced techniques are required to make the search space finite.
AUTOMATIC ASSEMBLY SEQUENCING

 The example includes assembly of intricate objects such as electric motors.
 The aim in assembly problems is to find the order in which to assemble the parts of
some objects.
 If the wrong order is choosen, there will be no way to add some part later without
undoing somework already done.
 Another important assembly problem is protein design, in which the goal is to find a
sequence of Amino acids that will be fold into a three-dimensional protein with the
right properties to cure some disease.
1.3.3.3 Water jug Problem
 In the water jug problem in Artificial Intelligence, we are provided with two jugs:
one having the capacity to hold 3 gallons of water and the other has the capacity to
hold 4 gallons of water. There is no other measuring equipment available and the jugs
also do not have any kind of marking on them. So, the agent’s task here is to fill the 4-
gallon jug with 2 gallons of water by using only these two jugs and no other material.
Initially, both our jugs are empty.
 So, to solve this problem, following set of rules were proposed: shown in figure: 1.16
 Production rules for solving the water jug problem
1. 36
 Here, let x denote the 4-gallon jug and y denote the 3-gallon jug.
(x-[3-y], 3)
The listed production rules contain all the actions that could be performed by the agent in
transferring the contents of jugs. But, to solve the water jug problem in a minimum number of
moves, following set of rules in the given sequence should be performed: shown in figure 1.17
Fig 1.16
Fig 1.17 Solution of water jug problem according to the production rules
 On reaching the 7th attempt, we reach a state which is our goal state. Therefore,
at this state, our problem is solved.
1. 37
1.4 SEARCH ALGORITHMS

 A search algorithm takes a search problem as input and returns a solution, or an
indication of failure.
 We consider algorithms that try to find a path that reaches a goal state.
 Each node in the search tree corresponds to a state in the state space and the edges in
the search tree correspond to actions.
 The root of the tree corresponds to the initial state of the problem.
 The state space describes the set of states in the world, and the actions that allow
transitions from one state to another.
 The search tree describes paths between these states, reaching towards the goal. The
search tree may have multiple paths to any given state, but each node in the tree has a
unique path back to the root (as in all trees)
 Figure 1.18 shows the first few steps in finding a path from Arad to Bucharest.
 The root node of the search tree is at the initial state, Arad.
 We can expand the node, by considering the available ACTIONS for that state, using
the RESULT function to see where those actions lead to, and generating a new node
(called a child node or successor node) for each of the resulting states. Each child node
has Arad as its parent node.
 At each stage, we have expanded every node on the frontier, extending every path with
all applicable actions that don’t result in a state that has already been reached.
 At the third stage, the topmost city (Oradea) has two successors, both of which have
already been reached by other paths, so no paths are extended from Oradea.
 Nodes that have been expanded and nodes on the frontier that have been generated are
shown. Nodes that could be generated next are shown in faint dashed lines. In the
bottom tree there is a cycle from Arad to Sibiu to Arad; that can’t be an optimal path,
so search should not continue from there.
 Now we must choose which of these three child nodes to consider next. This is the
essence of search—following up one option now and putting the others aside for later.
Suppose we choose to expand Sibiu first, results a set of 6 unexpanded nodes. We call
this the frontier of the search tree. We say that any state that has had a node generated
for it has been reached (whether or not that node has been expanded).
1. 38
Fig: 1.18
1.4.1 Best-first search

 How do we decide which node from the frontier to expand next?
 A very general approach is called best-first search, in which we choose a node,
with minimum value of some f (n) , evaluation function, the algorithm is shown
in figure 1.19.
 On each iteration we choose a node on the frontier with minimum value, return
it if its state is a goal state, and otherwise apply EXPAND to generate child
nodes.
 Each child node is added to the frontier if it has not been reached before, or is
re-added if it is now being reached with a path that has a lower path cost than
any previous path.
 The algorithm returns either an indication of failure, or a node that represents a
path to a goal. By employing different functions, we get different specific
algorithms, which this chapter will cover.
1. 39
Fig 1.19 The best-first search algorithm, and the function for expanding a node.
 Search data structures
Search algorithms require a data structure to keep track of the search tree. A node in
the tree is represented by a data structure with four components:
node.STATE: the state to which the node corresponds;
node.PARENT: the node in the tree that generated this node;
node.ACTION: the action that was applied to the parent’s state to generate this node;
node.PATH-COST: the total cost of the path from the initial state to this node. In
mathematical formulas, we use as a synonym for PATH-COST.
 We need a data structure to store the frontier.
 The appropriate choice is a queue of some kind, because the operations on a frontier
are:
IS-EMPTY(frontier) returns true only if there are no nodes in the frontier.
POP(frontier) removes the top node from the frontier and returns it.
TOP(frontier) returns (but does not remove) the top node of the frontier.
ADD(node, frontier) inserts node into its proper place in the queue.
 MEASURING PROBLEM-SOLVING PERFORMANCE
 The output of problem-solving algorithm is either failure or a solution.
 The algorithm's performance can be measured in four ways :
i. Completeness: Is the algorithm guaranteed to find a solution when there
is one?
ii. Optimality : Does the strategy find the optimal solution
iii. Time complexity: How long does it take to find a solution?
iv. Space complexity: How much memory is needed to perform the search?
1. 40
1.5 UNINFORMED SEARCH STRATEGIES (Blind search)

 The term means that the strategies have no additional information about states
beyond that provided in the problem definition.
 All they can do is generate successors and distinguish a goal state from a non-goal
state.
 All search strategies are distinguished by the order in which nodes are expanded.
1.5.1 Breadth-first search
 Breadth-first search is a simple strategy in which the root node is expanded first,
then all the successors of the root node are expanded next, then their successors,
and so on.
 In general all the nodes are expanded at a given depth in the search tree before any
nodes at the next level are expanded.
 BFS is an instance of the general graph-search algorithm in which the shallowest
unexpanded node is chosen for expansion. This is achieved very simply by using
a FIFO queue for the frontier.
Fig:
 Now suppose that the 1.20 Breadth
solution first Then
is at depth searchthe
algorithm
total number of nodes generated is
All the nodes remain in memory, so both time and space complexity are 𝑂(𝑏 𝑑 ).
1. 41
 The memory requirements are a bigger problem for breadth-first search than the
execution time.
1.5.2 Uniform-cost search
 Uniform-cost search does not care about the number of steps a path has, but only about
their total cost.
 By a simple extension, we can find an algorithm that is optimal with any step-cost
function.
 Instead of expanding the shallowest node, uniform-cist search expands the node n
with the lowest path cost 𝑔(𝑛).
 This is done by storing the frontier as a priority queue ordered by 𝑔.
Fig: 1.20
 The algorithm is shown in figure 1.20

 Uniform-cost search on a graph. The algorithm is identical to the general graph search
algorithm, except for the use of a priority queue and the addition of an extra check in
case a shorter path to a frontier state is discovered.
 The data structure for frontier needs to support efficient membership testing, so it
should combine the capabilities of a priority queue and a hash table.
1.5.3 Depth-first search
 Depth-first search always expands the deepest node in the current frontier of the
search tree.
 The progress of the search is illustrated in figure 1.21.
 The search proceeds immediately to the deepest level of the search tree, where the
nodes have no successors.
 All those nodes are expanded, they are dropped from the frontier, so then the
search “backs up” to the next deepest node that still has unexplored successors.
1. 42
Fig: 1.21
 For a state space with branching factor b and maximum depth m, depth-first search
requires storage of only 𝑂(𝑏 𝑚 ) nodes.
1.5.4 Depth-limited search
 The embarrassing failure of depth-first search in infinite state spaces can be
alleviated by supplying depth-first search with a predetermined depth limit 𝒍.
That is, nodes at depth 𝒍 are treated as if they have no successors. This approach is
called depth-limited search.
 The depth limit solves the infinite-path problem. Unfortunately, it also introduces an
additional source of incompleteness if we choose 𝑙 < 𝑑, that is, the shallowest goal is
beyond the depth-limit.
 Depth-limit search will also be non-optimal if we choose 𝒍 > 𝒅. Its time
complexity is 𝑶(𝒃𝒍 ) and its space complexity is 𝑶(𝒃𝒍). Depth-first search can be
viewed as a special case of depth-limited search with 𝑙 = ∞.
Fig: 1.22 The Recursive implementation of Depth-limited tree search:
1. 43
1.5.5 Iterative deepening search

 Iterative deepening search is used in combination with depth-first tree search, that
finds the best depth limit.
 It does this by gradually increasing the limit-first 0, then 1, then 2, and so on-
until a goal is found. This will occur when the depth limit reaches d, the depth of the
shallowest goal node.
 The algorithm is shown in figure 1.23, which repeatedly applies depth-limited
search with increasing limits. It terminates when a solution is found or if the
depth-limited search return failure, meaning that no solution exists.
Fig: 1.23
1.5.6 Bidirectional Search

 The idea behind bidirectional search is to run two simultaneous searches
 One is the forward search from the initial state and
 other is the backward search from the goal state,
 It stops when the two searches meet in the middle.
𝒅⁄ 𝒅⁄
 The motivation is that 𝒃 𝟐 +𝒃 𝟐 much less than 𝒃𝒅
1. 44
 The general best-first bidirectional search algorithm is shown in Figure 1.24.

 We pass in two versions of the problem and the evaluation function, one in the
forward direction (subscript) and one in the backward direction (subscript).
 When the evaluation function is the path cost, we know that the first solution found
will be an optimal solution, but with different evaluation functions that is not
necessarily true.
 Therefore, we keep track of the best solution found so far, and might have to update
that several times before the TERMINATED test proves that there is no possible
better solution remaining.
Fig: 1.24
Advantage:
 Bidirectional search is fast and it requires less memory.
Disadvantage:
 We should know the goal state in advance.
Performance Evaluation
Completeness Bidirectional search is complete if branching factor b is finite and if we
use BFS in both searches.
Optimality Bidirectional search is optimal.
Time Complexity 𝑂(𝑏 𝑑 ) if it used BFS (where b is the branching factors or number of
nodes and d is the depth of the search tree or number of levels in search tree).
Space ComplexityO(𝑏 𝑑
1. 45
1.6 Informed (Heuristic) Search Strategies

 Informed search strategy is one that uses problem-specific knowledge beyond the
definition of the problem itself.
 It can find solutions more efficiently than uninformed strategy.
 The hints come in the form of a heuristic function, denoted h(n).
Where, h(n) = estimated cost of the cheapest path from the state at node n to a goal state.
 For example, in route-finding problems, we can estimate the distance from the current
state to a goal by computing the straight-line distance on the map between the two
points
Best-first search
 Best-first search is an instance of general TREE-SEARCH or GRAPH-SEARCH
algorithm in which a node is selected for expansion based on an evaluation function
f(n).
 The node with lowest evaluation is selected for expansion,because the evaluation
measures the distance to the goal.
 This can be implemented using a priority-queue,a data structure that will maintain the
fringe in ascending order of f-values.
Heuristic functions
 A heuristic function or simply a heuristic is a function that ranks alternatives in
various search algorithms at each branching step basing on an available information in
order to make a decision which branch is to be followed during a search.
 The key component of Best-first search algorithm is a heuristic function,denoted by
h(n):
h(n) = extimated cost of the cheapest path from node n to a goal node.
 For example,in Romania,one might estimate the cost of the cheapest path from Arad
to Bucharest via a straight-line distance from Arad to Bucharest(Figure 2.1).
 Heuristic function are the most common form in which additional knowledge is
imparted to the search algorithm.
1.6.1.1 Greedy Best-first search

 Greedy best-first search tries to expand the node that is closest to the goal,on the
grounds that this is likely to a solution quickly.
 It evaluates the nodes by using the heuristic function f(n) = h(n).
 Taking the example of Route-finding problems in Romania , the goal is to reach
Bucharest starting from the city Arad.
 We need to know the straight-line distances to Bucharest from various cities as shown
in Figure 2.1. For example, the initial state is In(Arad) ,and the straight line distance
heuristic 𝒉𝑺𝑳𝑫 (In(Arad)) is found to be 366.
1. 46
 Using the straight-line distance heuristic 𝒉𝑺𝑳𝑫 ,the goal state can be reached
faster.
Fig: 2.2
Figure 2.2 shows the progress of greedy best-first search using ℎ𝑆𝐿𝐷 to find a path from Arad
to Bucharest. The first node to be expanded from Arad will be Sibiu,because it is closer to
Bucharest than either Zerind or Timisoara. The next node to be expanded will be
Fagaras,because it is closest. Fagaras in turn generates Bucharest,which is the goal.
1. 47
Properties of greedy search

 Complete?? No–can get stuck in loops, e.g.,
Complete in finite space with repeated-state checking
 Time?? O(bm), but a good heuristic can give dramatic improvement

 Space?? O(bm)—keeps all nodes in memory
 Optimal?? No
 Greedy best-first search is not optimal,and it is incomplete.
 The worst-case time and space complexity is 𝑂(𝑏 𝑚 ),where m is the maximum depth
of the search space.
1.6.1.2 A* Search
 A* Search is the most widely used form of best-first search. The evaluation function
f(n) is
obtained by combining
i. g(n) = the cost to reach the node,and
ii. h(n) = the cost to get from the node to the goal :
f(n) = g(n) + h(n).
 A* Search is both optimal and complete. A* is optimal if h(n) is an admissible
heuristic. The obvious example of admissible heuristic is the straight-line distance
ℎ𝑆𝐿𝐷 . It cannot be an overestimate.
 A* Search is optimal if h(n) is an admissible heuristic – that is,provided that h(n)
never overestimates the cost to reach the goal.
 An obvious example of an admissible heuristic is the straight-line distance ℎ𝑆𝐿𝐷 that
we used in getting to Bucharest. The progress of an A* tree search for Bucharest is
shown in Figure 2.2.
 The values of ‘g ‘ are computed from the step costs shown in the Romania map
( figure 2.1). Also the values of ℎ𝑆𝐿𝐷 are given in Figure 2.1.
1. 48
1. 49
Fig: 2.2
 A* search is complete.
 Whether A* is cost-optimal depends on certain properties of the heuristic.
 A key property is admissibility: an admissible heuristic is one that never
overestimates the cost to reach a goal.
 A slightly stronger property is called consistency. A heuristic is consistent if, for
every node and every successor of generated by an action we have:
𝒉(𝒏) ≤ 𝒄(𝒏, 𝒂, 𝒏 ′ ) + 𝒉(𝒏 ′ ).
 This is a form of the triangle inequality, which stipulates that a side of a triangle
cannot be longer than the sum of the other two sides (see Figure 3.19 ). An example
of a consistent heuristic is the straight-line distance that we used in getting to
Bucharest.
Fig: 2.3
1.6.1.3 Memory-bounded search

 The main issue with A* is its use of memory.
 Memory is split between the frontier and the reached states.
 In our implementation of best-first search, a state that is on the frontier is stored in
two places: as a node in the frontier (so we can decide what to expand next) and as an
entry in the table of reached states (so we know if we have visited the state before).
 We can keep reference counts of the number of times a state has been reached, and
remove it from the reached table when there are no more ways to reach the state.
 Beam search limits the size of the frontier. The easiest approach is to keep only the
nodes with the best -scores, discarding any other expanded nodes.
 This makes the search incomplete and suboptimal, but we can choose to make good
use of available memory, and the algorithm executes fast because it expands fewer
nodes.
1. 50
1.6.1.4 Iterative-deepening A* search

 (IDA*) is to A* what iterative-deepening search is to depth first: IDA* gives us the
benefits of A* without the requirement to keep all reached states in memory, at a cost
of visiting some states multiple times.
 It is a very important and commonly used algorithm for problems that do not fit in
memory.
 In standard iterative deepening the cutoff is the depth, which is increased by one each
iteration. In IDA* the cutoff is the 𝒇-cost (𝒈 + 𝒉 ); at each iteration, the cutoff
value is the smallest 𝒇-cost of any node that exceeded the cutoff on the previous
iteration
1.6.1.5 Recursive Best-first Search(RBFS)
 Recursive best-first search is a simple recursive algorithm that attempts to mimic the
operation of standard best-first search,but using only linear space. The algorithm is
shown in figure 2.4.
 Its structure is similar to that of recursive depth-first search,but rather than continuing
indefinitely down the current path,it keeps track of the f-value of the best
alternative path available from any ancestor of the current node.
 If the current node exceeds this limit,the recursion unwinds back to the
alternative path. As the recursion unwinds,RBFS replaces the f-value of each
node along the pathwith the best f-value of its children.
Fig: 2.4
Figure 2.5 shows how RBFS reaches Bucharest.
1. 51
 RBFS is optimal if the heuristic function is admissible. Its space complexity is

linear in the depth of the deepest optimal solution, but its time complexity is rather
difficult to characterize: it depends both on the accuracy of the heuristic function and
on how often the best path changes as nodes are expanded. It expands nodes in order
of increasing -score, even if is nonmonotonic.
1. 52
 IDA* and RBFS suffer from using too little memory. Between iterations, IDA*
retains only a single number: the current -cost limit.
 RBFS retains more information in memory, but it uses only linear space: even if
more memory were available, RBFS has no way to make use of it.
 Because they forget most of what they have done, both algorithms may end up
reexploring the same states many times over
 To determine how much memory we have available, and allow an algorithm to use all
of it. Two algorithms that do this are MA* (memory-bounded A*) and SMA*
(simplified MA*).
SMA*
 SMA* proceeds just like A*, expanding the best leaf until memory is full.
 At this point, it cannot add a new node to the search tree without dropping an old one.
 SMA* always drops the worst leaf node—the one with the highest -value.
 Like RBFS, SMA* then backs up the value of the forgotten node to its parent.
 In this way, the ancestor of a forgotten subtree knows the quality of the best path in
that subtree.
 With this information, SMA* regenerates the subtree only when all other paths have
been shown to look worse than the path it has forgotten.
 Another way of saying this is that if all the descendants of a node are forgotten, then
we will not know which way to go from n but we will still have an idea of how
worthwhile it is to go anywhere from n.
1.6.2 Heuristic Functions
 A heuristic function or simply a heuristic is a function that ranks alternatives in
various search algorithms at each branching step basing on an available information in
order to make a decision which branch is to be followed during a search.
The 8-puzzle
 The 8-puzzle is an example of Heuristic search problem. The object of the puzzle is to
slide the tiles horizontally or vertically into the empty space until the configuration
matches the goal configuration(Figure 2.6)
 The average solution cost for a randomly generated 8-puzzle instance is about 22
steps.
1. 53
 The branching factor is about 3.(When the empty tile is in the middle, there are four
possible moves; when it is in the corner there are two; and when it is along an edge
there are three).
 This means that an exhaustive search to depth 22 would look at about 322
approximately = 3.1 𝑋 1010 states.
 By keeping track of repeated states, we could cut this down by a factor of about
170,000, because there are only 9!/2 = 181,440 distinct states that are reachable. This
is a manageable number, but the corresponding number for the 15-puzzle is roughly
1013.
 If we want to find the shortest solutions by using A*,we need a heuristic function that
never overestimates the number of steps to the goal.
 The two commonly used heuristic functions for the 15-puzzle are :
i. h1 = the number of misplaced tiles.
For figure 2.6, all of the eight tiles are out of position, so the start state would have h1
= 8. h1 is an admissible heuristic.
ii. h2 = the sum of the distances of the tiles from their goal positions.
This is called the city block distance or Manhattan distance.
h2 is admissible ,because all any move can do is move one tile one step closer to the goal.
Tiles 1 to 8 in start state give a Manhattan distance of
h2 = 3 + 1 + 2 + 2 + 2 + 3 + 3 + 2 = 18.
Neither of these overestimates the true solution cost ,which is 26.
The effect of heuristic accuracy on performance
The Effective Branching factor
One way to characterize the quality of a heuristic is the effective branching factor b*. If
the total number of nodes generated by A* for a particular problem is N,and the solution
depth is d,then b*is the branching factor that a uniform tree of depth d would have to have in
order to contain N+1nodes. Thus,
𝑵 + 𝟏 = 𝟏 + 𝒃 ∗ + (𝒃 ∗)𝟐 +. . . +(𝒃 ∗)𝒅
 For example, if A* finds a solution at depth 5 using 52 nodes, then effective

branching factor is 1.92.
 To test the heuristic functions h1 and h2, 1200 random problems were generated with
solution lengths from 2 to 24 and solved them with iterative deepening search and
with A* search using both h1 and h2.
 Figure 2.7 gives the average number of nodes expanded by each strategy and the
effective branching factor.
 The results suggest that h2 is better than h1,and is far better than using iterative
deepening search.
1. 54
Fig 2.7 Comparison of the search cost and effective branching factor
Generating heuristics from relaxed problems

Relaxed problems
 A problem with fewer restrictions on the actions is called a relaxed problem
 The cost of an optimal solution to a relaxed problem is an admissible heuristic for the
original problem
 If the rules of the 8-puzzle are relaxed so that a tile can move anywhere, then h1(n)
gives the shortest solution
 If the rules are relaxed so that a tile can move to any adjacent square, then h2(n)
gives the shortest solution
 Hence, the cost of an optimal solution to a relaxed problem is an admissible heuristic
for the original problem.
 For example, if the 8-puzzle actions are described as
A tile can move from square X to square Y if
X is adjacent to Y and Y is blank,
 we can generate three relaxed problems by removing one or both of the conditions:
i. A tile can move from square X to square Y if X is adjacent to
Y.
ii. A tile can move from square X to square Y if Y is blank.
iii. A tile can move from square X to square Y.
 From (a), we can derive (Manhattan distance). The reasoning is that ℎ2 would be the
proper score if we moved each tile in turn to its destination.
1. 55
 From (c), we can derive ℎ1 (misplaced tiles) because it would be the proper score if
tiles could move to their intended destination in one action.
 If the relaxed problem is hard to solve, then the values of the corresponding heuristic
will be expensive to obtain.
 A program called ABSOLVER can generate heuristics automatically from problem
definitions, using the “relaxed problem” method.
Generating admissible heuristics from sub problems: Pattern databases
 Admissible heuristics can also be derived from the solution cost of a subproblem
of a given problem.
Fig 2.8
Pattern databases
 The idea behind pattern databases is to store these exact solution costs for every
possible subproblem instance- in our example, every possible configuration of the
four tiles and the blank.
 Then we compute an admissible heuristic for each state encountered during a
search simply by looking up the corresponding subproblem configuration in the
database.
 The database itself is constructed by searching back from the goal and recording
the cost of each new pattern encountered;
Generating heuristics with landmarks
 There are online services that host maps with tens of millions of vertices and find
cost-optimal driving directions in milliseconds (figure 2.9)
 How can they do that, when the best search algorithms we have considered so far
are about a million times slower?
 There are many tricks, but the most important one is precomputation of some
optimal path costs.
 Although the precomputation can be time-consuming, it need only be done once,
and then can be amortized over billions of user search requests.
1. 56
Fig 2.9
 If the optimal path happens to go through a landmark, this heuristic will be exact; if
not it is inadmissible—it overestimates the cost to the goal.
 In an A* search, if you have exact heuristics, then once you reach a node that is on an
optimal path, every node you expand from then on will be on an optimal path.
 Some route-finding algorithms save even more time by adding shortcuts—artificial
edges in the graph that define an optimal multi-action path.
 This is called a differential heuristic.
1. 57
Learning to search better
 Could an agent learn how to search better? The answer is yes, and the method rests on
an important concept called the metalevel state space.
 Each state in a metalevel state space captures the internal (computational) state of a
program that is searching in an ordinary state space such as the map of Romania.
(To keep the two concepts separate, we call the map of Romania a k object-level state
space.)
 Each action in the metalevel state space is a computation step that alters the internal
state; for example, each computation step in A* expands a leaf node and adds its
successors to the tree.
 For harder problems, there will be many such missteps, and a metalevel learning
algorithm can learn from these experiences to avoid exploring unpromising subtrees.
 The goal of learning is to minimize the total cost of problem solving, trading off
computational expense and path cost.
Learning heuristics from experience
 one way to invent a heuristic is to devise a relaxed problem for which an optimal
solution can be found easily.
 An alternative is to learn from experience. “Experience” here means solving lots
of 8-puzzles, for instance.
 Each optimal solution to an 8-puzzle problem provides an example (goal, path)
pair. From these examples, a learning algorithm can be used to construct a
function that can approximate the true path cost for other states that arise during
search.
1. 58
1.7 LOCAL SEARCH AND OPTIMIZATION PROBLEM

Local Search
 Local search algorithms operate using a single current node and generally move
only to neighbors of that node.
 Local search method keeps small amount of nodes in memory. They are suitable for
problems when the solution is the goal state itself and not the path.
Local search have two key advantages
 They use very little memory - usually a constant amount
 They can often find reasonable solutions in large or infinite state spaces for which
systematic algorithms are unsuitable.
Optimization Problem
In addition to finding goals, local search algorithms are useful for solving pure optimization
problems, in which the aim is to find the best state according to an objective function.
Hill Climbing and Simulated annealing are examples of local search algorithms
1.7.1.1 Hill Climbing search
 It is a local search algorithm which continuously moves in the direction of
increasing elevation/value to find the peak of the mountain or best solution to the
problem.
 It terminates when it reaches a peak value where no neighbor has a higher value.
 Hill climbing is sometimes called greedy local search because it grabs a good
neighbor state without thinking ahead about where to go next.
1. 59
Fig: 2.10
To illustrate hill climbing, we will use the 8-queens problem

Fig: 2.11 (a) The 8-queens problem: place 8 queens on a chess board so that no queen
attacks another
Fig: (b) A 8-queens state with heuristic cost estimate. The board shows the value of
for each possible successor obtained by moving a queen within its column. There are
8 moves that are tied for best, with h=12
 Minimum in the 8-queeens state space; the state has h=1 but every successor
has a higher cost
h= number of pairs of queens that are attacking each other, either
directly or indirectly
h = 17 for the above state
A local minimum with h=1
Limitations:
 Hill climbing cannot reach the optimum/best state(global maximum) if it enters any of
the following regions:
1. 60
Local Maxima
 A local maximum is a peak that is higher than each of its neighbouring states but
lower than the global maximum.
Plateaus
 A plateau is a flat area of the state-space landscape.
 It can be a flat local maximum, from which no uphill exit exits, or a shoulder, from
which progress is possible.
Ridges
 A Ridge is an area which is higher than surrounding states, but it cannot be
reached in a single move.
Fig: 2.12
 A Ridge is shown in figure 2.12 result in a sequence of local maxima that is very
difficult for greedy algorithm to navigate.
Variations of Hill Climbing
 In steepest Ascent hill climbing all successors are compared and the closest to the
solution is chosen.
 Steepest ascent hill climbing is like best-first search, which tries all possible
extensions of the current path instead of only one.
 It gives optimal solution but time consuming.
1.7.1.2 Simulated Annealing:
 Annealing is the process used to temper or harden metals and glass by heating them to
a high temperature and then gradually cooling them, thus allowing the material to
reach a low-energy crystalline state.
 The simulated annealing algorithm is quite similar to hill climbing.
 Instead of picking the best move, however, it picks a random move.
 If the move improve the situation, it is always accepted.
 Otherwise the algorithm accepts the move with some probability less than 1.
 Checks all the neighbours.
1. 61
 Moves to worst state may be accepted.
1.7.1.3 Local Beam Search

 The local beam search algorithm keeps track of k states rather than just one.
 It begins with k randomly generated states.
 At each step, all the successors of all states are generated.
 If anyone is a goal, the algorithm halts. Otherwise, it selects the best successors from
the complete list and repeats.
Limitations
 It only explore best ‘k’ nodes that mean lack of diversity to remove this problem
Stochastic beam search came into picture.
 Instead of choosing the best k from the pool of candidate successors, stochastic beam
search chooses k successors at random, with the probability of choosing a given
successor being an increasing function of its value.
1.7.1.4 Evolutionary algorithms
 A genetic algorithm is a variant of stochastic beam search in which successor states
are generated by combining two parent states rather than by modifying a single
state.
 This algorithm reflects the process of natural selection where the fittest individuals are
selected for reproduction in order to produce offspring of the next generation.
 The fitness function evaluates how close a given solution is to the optimal solution
of the desired problem. A fitness function should return higher values for better
states.
1. 62
Fig:2.13
Consider a 8 queens problem:

 The problem is to place 8 queens on a chess board so that none of them attack the
other.
 A chess board can be considered a plain board with eight columns and eight rows.
 Here, the fitness function is the number of queen that attack none.
 There is a population of individuals (states), in which the fittest (highest value)
individuals produce offspring (successor states) that populate the next generation, a
process called recombination.
Fig 2.13 c)
Fig 2.14
Fig 2.13 d)
Fig 2.13
It includes the following ways:

 The size of the population.
 The representation of each individual. In evolution strategies, an individual is
a sequence of real numbers, and in genetic programming an individual is a
computer program.
 The selection process for selecting the individuals who will become the
parents of the next generation: one possibility is to select from all individuals
with probability proportional to their fitness score.
 The recombination procedure. One common approach is to randomly select a
crossover point to split each of the parent strings, and recombine the parts to
1. 63
form two children, one with the first part of parent 1 and the second part of
parent 2; the other with the second part of parent 1 and the first part of parent
2.
 The mutation rate, which determines how often offspring have random
mutations to their representation. Once an offspring has been generated,
every bit in its composition is flipped with probability equal to the
mutation rate.
1.7.2 Local Search in Continuous Space

 A local search is first conducted in the continuous space until a local optimum is
reached.
 It is then switches to a discrete space that represents a discretization of the continuous
model to find an improved solution from there.
 The process continues switching between the two problem formulations until no
further improvement can be found in either.
 To perform local search in continuous state space we need techniques from calculus.
 The main technique to find a minimum is called gradient descent.
Gradient Descent
 A gradient measures how much the output of a function changes if you change the
inputs a little bit.
 Gradient descent is an iterative optimization algorithm to find the minimum of a
function.
 Gradient descent is an optimization algorithm for finding a local minimum of a
differentiable function.
 It is a first order optimization algorithm. This means it only takes into account the first
derivative when performing the updates on the parameters.
1. 64
 On each iteration, we update the parameters in the opposite direction of the gradient of
the objective function with respect to the parameters where the gradient gives the
direction of the steepest ascent.
 The size of the step we take on each iteration to reach the local minimum is determined
by the learning rate 𝛼. Therefore we follow the direction of the slope downhill until we
reach a local minimum.
𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 = 𝑃𝑎𝑟𝑡𝑖𝑎𝑙 𝑑𝑒𝑟𝑖𝑣𝑎𝑡𝑖𝑣𝑒 𝑜𝑓 𝑐𝑜𝑠𝑡/𝑙𝑜𝑠𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑤. 𝑟. 𝑡. 𝑤𝑒𝑖𝑔ℎ𝑡𝑠/𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠
𝜕𝐶𝑜𝑠𝑡
𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 = 𝜕𝑤
𝜕𝐶𝑜𝑠𝑡
𝑁𝑒𝑤 𝑤𝑒𝑖𝑔ℎ𝑡 = 𝑜𝑙𝑑 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 * 𝜕𝑤
 What is the gradient of a function (𝑥) ?

Usually written as
𝜕
𝛻𝑓(𝑥) = 𝑓(𝑥)
𝜕𝑥
∇𝑓(𝑥) (the gradient itself)represents the direction of the steepest slope
|∆𝑓(𝑥)| (magnitude of the gradient)tells how big the steepest slope is
 Suppose we want to find a local minimum of a function f(x). We use the gradient
descent rule:
𝑥 ← 𝑥 − 𝛼∇𝑓(𝑥)
 Suppose we want to find a local maximum of a function f(x). We use the gradient ascent
rule:
𝑥 ← 𝑥 + 𝛼∇𝑓(𝑥)
 If 𝛼 is too large Gradient descent overshoots the optimum point
 If 𝛼 is too small Gradient descent requires too many steps and will take a very long
time to converge.
Multivariate Gradient Descent
What happens if your function is multivariate eg: 𝑓(𝑥1 , 𝑥2 , 𝑥3 )?
Then,
𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓(𝑥1 , 𝑥2 , 𝑥3 ) = ( , , )
𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
The gradient descent rule becomes:
𝜕𝑓 𝜕𝑓 𝜕𝑓
𝑥1 ← 𝑥1 − 𝛼 𝜕𝑥 𝑥2 ← 𝑥2 − 𝛼 𝜕𝑥 𝑥3 ← 𝑥3 − 𝛼 𝜕𝑥
1 2 3
1. 65
 Consider an example. Suppose we want to place three new airports anywhere in

Romania, such that the sum of squared straight-line distances from each city on the map
to its nearest airport is minimized. (See Figure 3.1 for the map of Romania.)
 The state space is then defined by the coordinates of the three airports:
(𝑥1 ,𝑦1) ,(𝑥2 𝑦2 ), 𝑎𝑛𝑑(𝑥3 𝑦3 ). This is a six-dimensional space; we also say that states are
defined by six variables. In general, states are defined by an n-dimensional vector of
variables, x. Moving around in this space corresponds to moving one or more of the
airports on the map. The objective function is relatively easy to compute for any
particular state once we compute the closest cities. Let 𝐶𝑖 be the set of cities whose
closest airport (in the state x) is airport i. Then, we have
𝑓(𝑥) = 𝑓(𝑥1 , 𝑦1, 𝑥2 , 𝑦2, 𝑥3 𝑦3)=∑3 2 2
𝑖=1 ∑𝑐∈𝐶𝑖 (𝑥𝑖 −𝑥𝑐 ) +(𝑦𝑖 −𝑦𝑐 )
 if we stray too far from (by altering the location of one or more of the airports by a large
amount) then the set of closest cities for that airport changes, and we need to recompute
𝐶𝑖 .
 One way to deal with a continuous state space is to discretize it.
 For example, instead of allowing the locations to be any point in continuous two-
dimensional space, we could limit them to fixed points on a rectangular grid with
spacing of size 𝛿 (delta).
 Methods that measure progress by the change in the value of the objective function
between two nearby points are called empirical gradient methods
 Reducing the value of 𝛿 over time can give a more accurate solution, but does not
necessarily converge to a global optimum in the limit.
 Many methods attempt to use the gradient of the landscape to find a maximum.
 The gradient of the objective function is a vector ∇𝑓 that gives the magnitude and
direction of the steepest slope. For our problem, we have
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 = ( , , , , , )
𝜕𝑥1 𝜕𝑦1 𝜕𝑥2 𝜕𝑦2 𝜕𝑥3 𝜕𝑦3
 In some cases, we can find a maximum by solving the equation ∇𝑓 = 0 , if we were
placing just one airport
 This equation cannot be solved in closed form. For example, with three airports, the
expression for the gradient depends on what cities are closest to each airport in the
current state.
we can compute the gradient locally
𝜕𝑓
= 2 ∑ (𝑥1 − 𝑥𝑐 )
𝜕𝑥1
𝑐∈𝐶1
we can perform steepest-ascent hill climbing by updating the current state according to the
formula
𝑥 ← 𝑥 + 𝛼∆𝑓(𝑥)
where 𝛼(alpha) is a small constant often called the step size.
1. 66
 The problem occurs when 𝛼 becomes too small(too many steps needed) or too
large(search exceed the maximum). The technique of line search tries to overcome this
problem by extending the current gradient direction—usually by repeatedly doubling
𝛼—until 𝑓 starts to decrease again. The point at which this occurs becomes the new
current state.
 For many problems, the most effective algorithm is the venerable Newton–Raphson
method. This is a general technique for finding roots of functions—that is, solving
equations of the form 𝑔(𝑥) = 0.
a new estimate for the root according to Newton’s formula is given by
𝑥 ← 𝑥 − 𝑔(𝑥)/𝑔′(𝑥)
 An optimization problem is constrained if solutions must satisfy some hard
constraints on the values of the variables. For example, in our airport-siting problem,
we might constrain sites to be inside Romania and on dry land (rather than in the middle
of lakes).
 The difficulty of constrained optimization problems depends on the nature of the
constraints and the objective function. The best-known category is that of linear
programming problems, in which constraints must be linear inequalities forming
a convex set and the objective function is also linear.
1.7.3 Searching with Non-deterministic actions
Belief state
 When the environment is partially observable, and the agent doesn’t know for sure
what state it is in; and
 When the environment is nondeterministic, the agent doesn’t know what state it
transitions to after taking an action.
 A set of physical states that the agent believes are in belief states.
Conditional Plan
 In partially observable and nondeterministic environments,
 The solution to a problem is no longer a sequence,
 But rather a conditional plan (sometimes called a contingency plan or a strategy).
 That specifies what to do, depending on what percepts agent receives, while executing
the plan.
1.7.3.1The Erratic (Unpredictable) Vacuum World
 The vacuum world has eight states, and three actions- Right, Left, and Suck.
 The goal is to clean up all the dirt (state 7 or 8)
 If the environment is fully observable, deterministic, and completely known, then the
problem is easy to solve and the solution is an action sequence.
 For example, if the initial state is 1, then the action sequence [𝑠𝑢𝑐𝑘, 𝑅𝑖𝑔ℎ𝑡, 𝑆𝑢𝑐𝑘] will
reach a goal state, 8
 In the erratic vacuum world, the environment is nondeterministic,
 Then the 𝑠𝑢𝑐𝑘 action works as follows:
1. 67
1. When applied to a dirty square the action cleans the square and sometimes, cleans
up dirt in an adjacent square, too.
2. When applied to a clean square, the action sometimes deposits dirt on the carpet.
 Then the 𝑠𝑢𝑐𝑘 action works as follows:

3. When applied to a dirty square the action cleans the square and sometimes, cleans
up dirt in an adjacent square, too.
4. When applied to a clean square, the action sometimes deposits dirt on the carpet.
 To provide a formulation of this problem, we need to generate the notion of a
transition model.
 Here, a 𝑅𝐸𝑆𝑈𝐿𝑇𝑆(𝑠, 𝑎) function returns a set of possible outcome states.
 For example, in the erratic vacuum world, the 𝑆𝑢𝑐𝑘 action in state 1 cleans up either
just the current location, or both locations:
Conditional Plan
 The Erratic Vacuum world, to get solution, the action sequence produces a conditional
plan (not sequence plan)
 A conditional plan can contain if-then-else steps.
 Here, solutions are trees rather than sequences
1. 68
1.7.3.2 AND-OR Search Trees

 In a deterministic environment, the only branching is introduced by the agent’s own
choices in each state:
 I can do this action or that action.
 We call these nodes OR nodes.
 In a nondeterministic environment, branching is also introduced by the environment’s
choice of outcome for each action.
 We call these nodes AND nodes.
AND-OR tree for Vacuum World
 State nodes are OR nodes where some action must be chosen.
 At the AND nodes, shown as circles,
 Every outcome must be handled, as indicated by the arc linking the outgoing
branches.
 In the vacuum world, at an OR node the agent chooses Left or 𝑅𝑖𝑔ℎ𝑡 𝑜𝑟 𝑆𝑢𝑐𝑘
 The 𝑆𝑢𝑐𝑘 action in state 1 results in the belief state {5,7},
 So the agent would need to find a plan for state 5 and for state 7.
 The solution found is shown in bold lines in figure 2.15
Fig 2.15
1. 69
 A solution for an AND-OR search problem is a subtree of the complete search tree
that
1. Has a goal node at every leaf,
2. Specifies one action at each of its OR nodes, and
3. Includes every outcomes branch at each of its AND nodes.
4. The solution shown in bold lines in the figure 2.15.
1.7.3.3 Try, Try Again For Vacuum World

 Consider a slippery vacuum world,
 Which is decided to the ordinary(non-erratic) vacuum world
 Except that movement actions sometimes fail,
 Leaving the agent in the same location.
 A cyclic solution, which is to keep trying Right until it works.
 We can express this with a new while construct:
 Or by adding a label to denote some portion of the plan and referring to that label
later:
1. 70
Fig 2.16
 A minimum condition is that

 Every leaf is a goal state and that a leaf is reachable from every point in the plan.
1.7.4 Search in Partially Observable Environments
 When an agent’s percepts can’t inform everything about the state, the environment is
partially observable.
 The agent could be in multiple states based on what it kno
 ws, so the outcome will be one of many possible.
 The belief states space is all the sets of states the agent thinks it could be in.
 A single belief state is one set of physical states the agent thinks it could be in.
1.7.4.1 Searching with no observation
 A sensorless (conformant) problem has a “blind” agent: sensors do nothing.
 Agents can still work and be useful without sensors.
 Costs a lot to do sensing.
 Some algorithms can get you to state no matter the original state.
 Consider sensorless vaccum world:
 Agent knows the world, but not it’s location in the world.
 Could be in any state: 18.
 Go right? Now in: 4 or 8.
 What about [Right, Suck, Left, Suck]? Must be in state 7.
 The agent has coerced the world to state 7.
 Key: the space of belief states is searched, not physical space.
 Why? The belief space is fully observable because the agent knows its own
belief state.
1. 71
 And any existing solution is a sequence of actions.

 The percepts received after each action are completely predictable—they’re
always empty!
 So there are no contingencies to plan for. This is true even if the environment
is nondeterministic, no search needed.
 Belief-state search problem construction:
 For some physical problem p, we have functions: ACTIONSp, RESULTp,
GOAL-TESTp, and STEP-COSTp.
 Then:
 Belief states: possible sets of physical states. Given n states for p.
 Initial state: sets of states in p (unknown where agent starts)
 Actions: If all the actions corresponding to our belief states allow for “safe”
moves, then it’s a union of those actions. Otherwise, it’s an intersection of all
“safe” actions.
 Transition model: an attempt to generate the new belief state, called the
prediction step. The agent doesn’t know which state in the belief state is
correct. It just ends up in a new state.
 For deterministic actions, the set that could reaches is
determined by:
 For nondeterministic actions, the set that could be reached is

determined by:
Where b is the current belief state, s is a state in action , b’ is a possible updated belief state,
s’ is a possible state.
Lots of spooky math, but the first definition just means it’s a one-to-one transition and
the second means it’s a one-to-many possible state transition. Here is a sample of the
predictions.
 Goal test: a belief state satisfies the goal ONLY if all physical states in that
belief state solve the GOAL-TESTp function. (might skip an earlier goal if the
earlier belief state contains one or more physical states solving the GOAL-
TESTp function, but not all).
1. 72
 Path cost: depends on what all the possible costs are for each potential state.
Assume they are the same, to make it easier for now.
Fig 2.17
 The figure 2.18 shows the reachable belief-state space for the deterministic, sensorless
vacuum world. There are only 12 reachable belief states out of possible belief states.
 Creating the belief-state problem formulation now easily possible using the
underlying physical problem definition.
 Then, use any search algorithms.
 Now, sensorless problem-solving is possible, but rarely practically feasible.
 Why? The size of each belief state can be ginormous.
 Example: A vaccum world with an initial physical state of 10 × 10 would
mean a belief state of 100 × 2100 physical states.
1. 73
Fig 2.18
 What can be done about these size?

 One solution is to represent the belief state by some more compact description.
 Another is to use alternate search algorithms: incremental belief-state
searches.
 Builds up the solution one physical state at a time.
 Eg: Initial belief state: {1,2,3,4,5,6,7,8}.
 Find solution for state 1. Does it work for 2? No?
Go back and find another solution state 1 and repeat.
 Some things are impossible without sensing.
 Sensorless 8-puzzle is impossible.
 Add ability to sense just 1 tile though and it is possible
Searching with Observations
 For partially observable problems, how the agent uses its percepts to sense the
environment must be specified.
1. 74
1.7.4.2 Searching in partially observable environments

 Many problems cannot be solved without sensing. For example, the sensorless 8-
puzzle is impossible.
 On the other hand, a little bit of sensing can go a long way: we can solve 8- puzzles if
we can see just the upper-left corner square.
 The solution involves moving each tile in turn into the observable square and keeping
track of its location from then on.
 Consider a local-sensing vacuum world, in which the agent has a position sensor that
yields the percept L in the left square, and R in the right square, and a dirt sensor that
yields Dirty when the current square is dirty and Clean when it is clean.
 Thus, the PERCEPT in state 1 is [L,Dirty]. With partial observability, it will usually
be the case that several states produce the same percept; state 3 will also produce
[L,Dirty]. Hence, given this initial percept, the initial belief state will be {1,3}.
 The transition model between belief states for partially observable problems as
occurring in three stages, as shown in figure 2.19
 The prediction stage computes the belief state resulting from the action, ,
exactly as we did with sensorless problems.
 The possible percepts stage computes the set of percepts that could be
observed in the predicted belief state (using the letter for observation):
 The update stage computes, for each possible percept, the belief state that
would result from the percept. The updated belief state 𝑏0 is the set of states in
b that could have produced the percept:
 The agent needs to deal with possible percepts at planning time, because it
won’t know the actual percepts until it executes the plan.
 each updated belief state 𝑏𝑜 can be no larger than the predicted belief state b;
 Putting these three stages together, we obtain the possible belief states
resulting from a given action and the subsequent possible percepts:
1. 75
Fig 2.19
1.7.4.3 Solving partially observable problems

 Use the RESULT function and the and-or search algorithm
 Consider the figure 2.20
 For an initial percept [A,Dirty]. The solution is the conditional plan
1. 76
Fig 2.20
1.7.4.4 An agent for partially observable environments

 Similar to earlier problem-solving agents:
 Formulate a problem, call a search algorithm to solve it, do the solution.
 Not the same:
 Solution not a sequence; it’s a conditional plan
 Agent maintains it’s belief state while performing actions and getting percepts.
 Follows earlier prediction-observation-update process, except:
 Percept comes from environment rather than agent calculation.
 Given an initial belief state , an action , and a percept , the new belief state is:
 Following figure 2.21 shows the maintained belief state:
Fig 2.21
 What it looks like when we change the problem so any square can become dirty at
any time, unless agent is actively cleaning is at the time.
 Real world is almost always a partially observable environment.
 The intelligent agent must keep its belief state updated
1. 77
 The function for this goes by many names and is sometimes called:
monitoring, filtering, or state estimation.
 Previous equation is example of a recursive version:
 New state is a function of the old one, not the entire percept sequence.
 Needs to be fast. Can’t fall behind. If things change too quickly, then have to
start making guesses on the belief state.
 Here is an example in a discrete environment with deterministic sensors and
nondeterministic actions.
 The example concerns a robot with a particular state estimation task called
localization: working out where it is, given a map of the world and a sequence of
percepts and actions.
 The task? Localization
 The agent? A broken robot
 The environment? A maze
 The problem? The robot can only move in a random direction must determine where
it is at.
 The rub? The robot can sense where I’s at thanks to four sonar sensors, but it’s
actuators are messed up. It moves randomly.
Fig 2.22
 The long and the short of it? The robot can tell where it’s at by comparing sensor
observations, even though it didn’t choose the direction it moved.
1. 78
 The new location in the above fig 2.22 b) is the only possible place it could have
moved to according to:
𝑈𝑃𝐷𝐴𝑇𝐸(𝑃𝑅𝐸𝐷𝐼𝐶𝑇(𝑈𝑃𝐷𝐴𝑇𝐸(𝐵, 𝑁𝑆𝑊), 𝑀𝑜𝑣𝑒), 𝑁𝑆)
 We assume that the sensors give perfectly correct data, and that the robot has a correct
map of the environment. But unfortunately, the robot’s navigational system is broken,
so when it executes a Right action, it moves randomly to one of the adjacent squares.
 The robot’s task is to determine its current location. Suppose the robot has just been
switched on, and it does not know where it is—its initial belief state consists of the set
of all locations.
 The robot then receives the percept 1011 and does an update using the equation
𝑏0 = 𝑈𝑃𝐷𝐴𝑇𝐸(1011), yielding the 4 locations shown in Fig 2.22 a).
 Next the robot executes a Right action, but the result is nondeterministic.
 The new belief state,𝑏𝑎 = 𝑃𝑅𝐸𝐷𝐼𝐶𝑇(𝑏0 , 𝑅𝑖𝑔ℎ𝑡), contains all the locations that are
one step away from the locations in . When the second percept, 1010, arrives, the
robot does 𝑈𝑃𝐷𝐴𝑇𝐸(𝑏𝑎 , 1010)and finds that the belief state has collapsed down to
the single location shown in Figure 2.22 b). That’s the only location that could be the
result of
1.7.5 ONLINE SEARCH AGENTS AND UNKNOWN ENVIRONMENTS

 Offline search algorithms find a solution before they do anything in the real world.
 Online search algorithms interleave thinking (computation) and doing (action).
 Good for dynamic environments, not good to wait too long to act.
 Good for deterministic environments, allows for response to what actually happened
instead of thinking about what could.
 Required for unknown environments.
 How can you plan ahead when you don’t what about what’s around you?
 The agent has an exploration problem.
Actions become experiments to learn about the world.
Example: A robot exploring a dark room
Example: A newborn learning is an online search process
1.7.5.1 Online Search Problems
 Solutions require agent take action, rather than think it all the way through.
 Here’s some examples of deterministic and fully observable environments, where the
agent knows:
ACTION(s), a list of possible actions in state s.
A step-cost function c(s, a, s’), used when s’ is a known outcome.
GOAL-TEST(s)
 The RESULT(s, a) is known when the agent is in s, doing a.
1. 79
Consider:
Fig 2.23
 Just means the agent can’t know going up from (1, 1) leads to (1, 2) until it does it.
 Adding a heuristic function h(s) can estimate how far from current state to goal state.
 So, reach that goal state using minimum cost.
 Comparing the cost of agent getting there with the actual cost of getting there is the
shortest possible time is called competitive ratio.
 Make it small
 Can be infinite if some states irreversible, a dead-end state gets
reached.
Consider:
Fig 2.24
 For 2.24 a), it’s fail in one of the cases

 For 2.24 b),
Example of the adversary argument
1. 80
 The adversary is building the state space trying to make the

exploring agent travel further than it otherwise would.
 Real life dead ends: robots hitting stairs, cliffs, other forms of impassible
terrain.
 Assume state spaces that are safely explorable with no dead-ends. That is,
some goal state is reachable from every reachable state.
1.7.5.2 Online search agents
 After an action, the agent gets a percept updating it’s knowledge of the world.
 That updated “map” is used to decide what to do next.
 In effect, one node gets expanded at a time before actions are taken.
 In offline algorithms, the node expansions are “simulations” so they can keep right on
going.
 Online algorithms use depth-first search to expand nodes in a local order is shown in
figure 2.25:
Consider:
Fig 2.25
 A table is used to keep track of its map of the world.

 RESULT[s, a]
 Records the state resulting from action a in state s
 Try any action that has yet to be tried in the current state.
 In online search, if all actions are tried, time to physically backtrack.
 To get back, a table of previous states visited is also kept.
 The way the agent does backtracking requires reversible state spaces.
 In worst cases, links get traversed exactly twice; for exploration is optimal.
 Competitive ratio can go to crap if the goal is really close to the start state.
 Iterative deepening can help
 A uniform tree? The competitive ratio can be pretty small.
1. 81
1.7.5.3 Online Local search

 Hill-climbing search is an online search algorithm(keeps just one node in memory)
 Problem- can get stuck at maxima.
 Random restarts cannot be used, because the agent cannot teleport itself to a new start
state.
 The alternative is to take a random walkjust pick, at random, an action you can
take from the current state.
Fig 2.26
 Eventually, in a finite space, you will find what you are looking for.
 The process can be very slow. One step forward, two steps back.
 Improvements are made by replacing randomness with memory.
 Store a current best guess of cost to goal from each state visited, H(s).
 Start with a heuristic guess (h(s)) and update as you go.
 The agent moves according to the best guess involving it’s neighbors.
Fig 2.27
1. 82
 The agent should follow what seems to be the best path to the goal given the current
cost estimates for its neighbours. The estimated cost to reach the goal through a
neighbour s’ is the cost to get to s’ plus the estimated cost to get to a goal from
there. that is,
 C(s, a, s’) + H(s)
 In (a), two actions: 1 + 9 and 1 + 2. Go right, which is 1 + 2
 An agent implementing this scheme, which is called learning real-time A* (LRTA*),

builds a map of the environment in the result table.
 Update cost estimate for the state just left and then makes a best guess.
 Assume that actions not yet tried in a state s are always assumed to lead
immediately to the goal with the least possible cost, namely h(s).
 Ensures new paths are not left unexplored.
 This optimism under uncertainty encourages the agent to explore new,
possibly promising paths.
1. 83
 LRTA will always find a goal in any finite, safely explorable environment.
 Not complete in infinite state spaces.
 Explores environment of n states in at worst 𝑶(𝒏𝟐 ).
1.7.5.4 Learning in Online Search
 Agents learn through mapping the outcome of each state as they explore.
 Needs accurate cost guesses of each state.
 Once known, best choices are made by going to the lowest cost choice.
 Pure, it is a form of optimal hill climbing that if you move uphill you might have to
backtrack a little bit.
 We could keep the search tree in memory and reuse the parts of it that are unchanged
in the new problem. We could keep the heuristic h values and update them as we gain
new information, either because the world has changed or because we have computed
a better estimate. Or we could keep the best-path values, using them to piece together
a new solution, and updating them when the world changes.
1.8.1 Game theory
 Multiagent environments are a form of competitive environment if the agent’s goals

are conflicting.
 Such environments lead to adversarial search problems (games).
 Mathematical game theory considers multiagent environments as a game when agent
decisions significantly impact other agents.
1.8.1.1 Two- player zero-sum games

 Most common games in AI are deterministic, turn-taking, two-player, zero-sum
games of perfect information (chess, checkers).
 Another way to put it: deterministic, fully observable environments with agents
alternating turns having end game utility values always equal and opposite.
 One player wins, the other must lose.
 Interesting problems to study because they are hard.
 Chess has average branching factor of 35
 Must make a decision, even if optimal takes too long to find.
 Games penalize inefficiency
 Some ideas for saving time:
 Pruning lets you ignore parts of search trees.
 Heuristic evaluation function help you to make good guesses without
completing an entire search.
 Some games have to deal with imperfect information.
1. 84
 Can’t see all the cards at once in solitaire.

 To start, consider a two-player game with players MIN and MAX
 MAX goes first.
 Turns alternate until game ends.
 Points given to winner, penalities to loser.
 Then, a game becomes a search problem.
 Properties of the search problem:
𝑆𝑜 :The initial state (the game setup)
𝑃𝐿𝐴𝑌𝐸𝑅(𝑠): find player’s turn in state s.
𝐴𝐶𝑇𝐼𝑂𝑁(𝑠):find legal moves in state s.
𝑅𝐸𝑆𝑈𝐿𝑇(𝑠, 𝑎): transition model, the result of the move.
𝑇𝐸𝑅𝑀𝐼𝑁𝐴𝐿 −TEST(s): a terminal test, returns true if the game is over.
Game ending states are terminal states.
𝑈𝑇𝐼𝐿𝐼𝑇𝑌(𝑠): utility function(objective function), figures out a “score”.
 The score:
 In chess, +1 for a win, 0 for a loss, ½ for tie
 In backgammon, from 0 to +192.
 Zero-sum game: total payoff to all players is the same for each game istance.
 Chess is zero-sum. Pays: 0+1 or 1+0 or ½+ ½.
 The game’s game tree is defined by ACTIONS, RESULT and initial state:
Fig 3.1
1. 85
 Notes about the tree:

 MAX starts and can make 9 moves.
 Play alternates.
 Play ends once reaching a leaf node.
 Leaves are terminal states: three-in-a-row or cat’s eye.
 Leaf numbers are utility values from MAX’s POV.
 High values good for MAX, low values good for MIN.
 Tree is fairly small for tic-tac-toe: 9! Terminal nodes.
 Tree is huge for chess: 1040 nodes.
 Tree is theoretical.
 A “search tree” is a tree superimposed on the game tree.
 Look at just enough nodes to make a move.
 Player always picks the best move it can.
1.8.2 OPTIMAL DECISION IN GAMES
 An optimal search provides an action sequence ending in a winning terminal state.
 In adversarial searches, MIN has an impact on this search.
 In response, MAX finds a contingent strategy.
 What is this?
 Max picks an initial move and then comes up with moves in response
to every MIN counter move.
 Basically, MAX is anticipating it’s moves, all of MIN’s counter moves, and MAX’s
counter-counter moves, etc.
 Analogous to the AND-OR search algorithm.
MAX is or…
MIN is end
 Think of its way: an optimal strategy provides outcomes at least as good as any
strategy assuming a perfect opponent.
 How to find it?
 Since even TIC-TAC-TOE is too complicated, consider:
Fig 3.2
1. 86
 To start, MAX can take moves a1, a2, a3.

 MIN can reply with b1, b2, b3, etc.
 Game ends with one move by each.
 So:
 The game tree is one move deep.
 There’s 2 half moves
 Each half move is called a ply.
 Terminal state utilities: 2 14
 Find the optimal strategy using a the minmax value from each node:
 Use function: MINMAX(n).
 This value assumes both players play optimally the entire game.
 The minmax value of a terminal state is just its utility.
 MAX prefers a move for maximum value.
 MIN prefers a move for minimum value.
 A definition for function MINMAX:
 MAX will fare even better if MIN plays suboptimally.

 To start, the best minmax decision for MAX is move 𝑎1
 MIN’s best response is 𝑏1 .
1.8.2.1 The minimax search algorithm
 Minimax is a kind of backtracking algorithm that is used in game theory to find the
optimal move for a player.
 It is widely used in two player turn-based games.
 Example: chess, checkers, ti-tac-toe.
 In Minimax the two players are called MAX And MIN.
 MAX highest value
 MIN lowest value
 The minimax algorithm finds the minimax decision for the current state.
 Uses recursion to get to a terminal state (leaf)
 The minmax values are backed up through the recursion tree.
1. 87
Fig 3.3
 Performs a complete depth-first search of the game tree

 For a tree depth of m with b moves at each point, time complexity is: 𝑂(𝑏 𝑚 ).
 Space complexity: O(bm), when generating all actions at once.
OR
O(m) if actions are evaluated one at a time.
 Real games? Time is horrible. Gives us a starting point though for learning practical
algorithms.
1.8.2.2 Optimal decisions in multiplayer games

 The single value for each node gets replaced with a vector of values (one for each
opponent).
 For players A,B, C, vector <𝑉𝐴 , 𝑉𝐵, 𝑉𝐶 >
𝑤𝑜𝑢𝑙𝑑 𝑏𝑒 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑒𝑎𝑐ℎ 𝑛𝑜𝑑𝑒.
 The vector gives values at terminal nodes from each player’s perspective.
 How? Have UTILITY return a vector of utilities.
 Consider:
1. 88
Fig 3.4
 C chooses a move from the ‘X’ state.

 Choice gives you either
<𝑉𝐴 = 1, 𝑉𝐵 = 2, 𝑉𝐶 = 6 > 𝑜𝑟 < 𝑉𝐴 = 4, 𝑉𝐵 = 2, 𝑉𝐶 = 3 >
 C picks the former because 𝑉𝐶 = 6.
 So, play will lead to the state with the vector containing that value.
 Then, the “backed-up” value of X is the vector with that value.
 Multiplayer games usually involve alliances, whether formal or informal, among the
players. Alliances are made and broken as the game proceeds.
 Multiplayer alliances can lead to complications.
 C too strong? A and B gang up.
1.8.2.3 Alpha-Beta Pruning

 Alpha-beta pruning is a modified version of the minimax algorithm
 It is an optimization technique for the minimax algorithm.
 Alpha-beta pruning is the pruning(cutting down) of useless branches in decision
trees.
 Alpha(𝛼): highest value
Initial value of 𝛼=−∞
Condition
Max player will only update the value of alpha
 Beta(𝛽) : lowest value 𝜶≥𝜷
Initial value of 𝛽 = ∞
Min player will only update the value of beta
 How? Figure out the right minimax decision without examining all possible nodes.
 How? Use pruning to ignore parts of the tree.
 The technique is called alpha-beta pruning.
 Used to get the same decision as minimax without all the work.
 Consider:
1. 89
Fig 3.5
 We can also simplify the formula for minimax:
 So, we can disregard x and y nodes because MIN will pick 2, no matter what.
 Why? The 2 is smaller than any of the other nodes, in the other branches.
 Another way of looking at it:
Fig 3.6
1. 90
 Works, in part, because of DFS.

 Where did the name come from?
 Keeping track of the best values for MAX(alpha) and the best values for MIN(beta):
Fig 3.7
Move Ordering
 Order in which states are looked at can dramatically impact performance.
 Depending on the values of each state, it can be determined to examine fewer nodes.
o Determine the smallest value sooner and you don’t need to look at the others.
 If examinations begin with the likely best successors:
o Alpha-beta need only examine O(𝑏 𝑚/2 ) nodes.
o Minimax needs O(𝑏 𝑚 ).
o Branching factor essentially becomes √𝑏 instead of b.
 Chess would go from 35 to something like 6.
 Dynamic move-ordering schemes can improve it further.
Example: use moves that were best in the past.
 These best moves are often called killer moves.
 Trying them first is called the killer move heuristic.
1. 91
 In certain games, transpositions can kill performance. (certain moves that are mirrors
of each other).
 Example: chess pieces,
[a1, b1, a2, b2] mirrors [a2, b2, a1, b1]
 Pieces ending up in the same position, just different
order of same moves to get there.
 The redundant paths to repeated states can cause an exponential increase in search
cost, and that keeping a table of previously reached states can address this problem.
 In game tree search, repeated states can occur because of transpositions—different
permutations of the move sequence that end up in the same position, and the problem
can be addressed with a transposition table that caches the heuristic value of states.
 Keep a transposition table. Ignore the duplicates.
 Similar to the explored list from GRAPH-SEARCH
1.8.3 Monte Carlo Tree Search
 The game of Go illustrates two major weaknesses of heuristic alpha–beta tree search:
i. Go has a branching factor that starts at 361, which means alpha–beta search
would be limited to only 4 or 5 ply.
ii. It is difficult to define a good evaluation function for Go because material value
is not a strong indicator and most positions are in flux until the endgame. In
response to these two challenges, modern Go programs have abandoned alpha–
beta search and instead use a strategy called Monte Carlo tree search (MCTS)
 The basic MCTS strategy does not use a heuristic evaluation function. Instead, the value
of a state is estimated as the average utility over a number of simulations of complete
games starting from the state.
 A simulation (also called a playout or rollout) chooses moves first for one player, than
for the other, repeating until a terminal position is reached. At that point the rules of the
game determine who has won or lost, and by what score.
 To get useful information from the playout we need a playout policy that biases the
moves towards good ones. For Go and other games, playout policies have been
successfully learned from self-play by using neural networks.
 Given a playout policy, we next need to decide two things:
i. from what positions do we start the playouts, and
ii. how many playouts do we allocate to each position?
 Monte Carlo search do simulations starting from the current state of the game, and track
which of the possible moves from the current position has the highest win percentage.
 For some stochastic games this converges to optimal play as increases, but for most
1. 92
games it is not sufficient—we need a selection policy that selectively focuses the
computational resources on the important parts of the game tree.
 It balances two factors:
i. exploration of states that have had few playouts, and
ii. exploitation of states that have done well in past playouts, to get a more
accurate estimate of their value.
 Monte Carlo tree search does that by maintaining a search tree and growing it on each
iteration of the following four steps, as shown in Figure
SELECTION:
 Starting at the root of the search tree, we choose a move leading to a successor
node, and repeat that process, moving down the tree to a leaf.
 Figure 5.10(a) shows a search tree with the root representing a state where white
has just moved, and white has won 37 out of the 100 playouts done so far.
 The thick arrow shows the selection of a move by black that leads to a node
where black has won 60/79 playouts. This is the best win percentage among the
three moves.
 Selection continues on to the leaf node marked 27/35.
EXPANSION:
 We grow the search tree by generating a new child of the selected node; Figure
5.10(b) shows the new node marked with 0/0.
SIMULATION:
 We perform a playout from the newly generated child node, choosing moves
for both players according to the playout policy.
 These moves are not recorded in the search tree. In the figure, the simulation
results in a win for black.
BACK-PROPAGATION:
 We now use the result of the simulation to update all the search tree nodes going
up to the root.
 Since black won the playout, black nodes are incremented in both the number
of wins and the number of playouts, so 27/35 becomes 28/26 and 60/79 becomes
61/80.
 Since white lost, the white nodes are incremented in the number of playouts
1. 93
only, so 16/53 becomes 16/54 and the root 37/100 becomes 37/101.
 We repeat these four steps either for a set number of iterations, or until the allotted time
has expired, and then return the move with the highest number of playouts.
 One very effective selection policy is called “upper confidence bounds applied to trees”
or UCT. The policy ranks each possible move based on an upper confidence bound
formula called UCB1.
 For a node , the formula is:
Fig 3.10
Where,
𝑈(𝑛) is the total utility of all playouts that went through node 𝑛,
𝑁(𝑛) is the number of playouts through node 𝑛 , and 𝑃𝑎𝑟𝑒𝑛𝑡(𝑛) is the parent node of 𝑛 in the
tree.
𝑈(𝑛)
is the exploitation term: the average utility of 𝑛.
𝑁(𝑛)
 The term with the square root is the exploration term: it has the count 𝑁(𝑛) in the
denominator, which means the term will be high for nodes that have only been explored
a few times.
 In the numerator it has the log of the number of times we have explored the parent of
n.
1. 94
 The pseudo code shows the complete UCT MCTS algorithm. When the iterations
terminate, the move with the highest number of playouts is returned.
 The idea is that a node with wins is better than one with wins,
 UCB1 formula ensures that the node with the most playouts is almost always the node
with the highest win percentage
Advantages of Monte Carlo Tree Search:
 MCTS is a simple algorithm to implement.
 It does not necessarily require any tactical knowledge about the game
 A general MCTS implementation can be reused for any number of games with little
modification
 Focuses on nodes with higher chances of winning the game
 Algorithm is very straightforward to implement
 MCTS supports asymmetric expansion of the search tree based on the circumstances
in which it is operating.
Disadvantages of Monte Carlo Tree Search:
 As the tree growth becomes rapid after a few iterations, it requires a huge amount of
memory.
 Computationally inefficient — when you have a large amount of variables

bounded to different constraints, it requires a lot of time and a lot of computations
to approximate a solution
using this method.
1. 95
1.8.4 STOCHASTIC GAMES

 Chess is a deterministic game.
 Games with random chance (dice rolls) are stochastic games.
 Example: backgammon.
 Black player knows where all the pieces are, but can’t know ahead of time where
white will move because of the random dice roll.
 Can’t make a standard game tree.
Fig 3.11
 Requires a tree containing chance nodes in addition to min and max nodes.
 They consider the possible dice rolls.
1. 96
Fig 3.12
 Each chance node gets dictated by probabilities of the die rolls 1/36, 1/18, etc.
 Uncertainty. Only possible to calculate a position’s expected value: the average of all
possible outcomes of the chance nodes.
 So, generalize the deterministic game’s minimax value to an expectiminimax value
for games with chance nodes.
 Terminal, MIN, and MAX nodes stay the same.
 For chance nodes, sum the value of all outcomes (weighted using probability):
 Where r is the dice roll, and RESULT(s, r) is the same state s with roll r.
1.8.4.1 Evaluation functions for games of chance
 Because of chance nodes, the meaning of evaluation values is a bit more dicey than in
deterministic games. Consider:
 Assigning values to the leaves has different outcomes (who knew?) [1, 2, 3, 4] leads
to taking a1, but [1, 20, 30, 400] leads to taking a2.
1. 97
Fig 3.13
 Dealing with this involves a linear transformation of win probability of a position

 Assume the program knows all the die rolls ahead of time (if don’t ,but..) then the
performance is 𝑂(𝑏 𝑚 ), b is branching factor, m is max depth.
 Worse than that though, because you have to consider chance nodes, so:
𝑂(𝑏 𝑚 𝑛𝑚 ) where n is the number of dice rolls.
 For example: backgammon has a b of ~20 and an n of 21. Sometimes b can be as
much as 4000. 3 plies is about it.
 By putting bounds on possible utility function values, then something like alpha-beta
pruning can be done to improve performance.
 Example:
 If all utility values are between -2 and +2, the leaf node values are bounded.
 Then an upper bound can be placed on a chance node without looking at all
children.
 Alternative : Monter Carlo simulation
 Evaluate the position by starting with alpha-beta algorithm.
 Play many games against yourself, using random dice.
 Provides a win percentage that can be used as a heuristic.
 Works pretty good for backgammon.
 For games with dice, referred to as a rollout.
1.8.5 Partially Observable Games
 Games where certain aspects are unknown are games with partial observability.
 Games with the “fog of war” are examples.
Scouts, spies, feints, bluffs, etc. are possible.
1.8.5.1 Kriegspiel: partially observable chess
 Deterministic partially observable games keep the opponent choices a secret.
 Battleship, Stratego, Kriegspiel.
1. 98
Kriegspiel rules:
 Black and white see only their pieces, with a referee conducting the game.
 Player tells ref about a move, ref resolves the move.
 Humans pull it off, computers can leverage belief states.
 Starting off, white’s belief state is a singleton, black hasn’t moved yet.
 After black’s move, white belief state can have 20 positions because black can
respond in 20 ways.
 Keeping track of the belief state is the problem of state estimation.
 Can work with kriegspiel using partially observable nondeterministic section
 RESULT uses white’s move plus the unpredictable black move.
 Strategy changes in partially-observable games:
 Moves are decided based on every possible percept sequence we could
get.
 Not on each move the opponent might make.
 With kriegspiel: guaranteed checkmate comes with each possible percept sequence
leads to a checkmate, no matter what the opponent does.
 Opponent’s belief state doesn’t matter.
 Simplifies things a ton. Here’s a part of a guaranteed checkmate for King and Rook vs
a King situation.
Fig 3.14
1. 99
 The general AND-OR search algorithm can be applied to the belief-state space to find
guaranteed checkmates.
 It finds midgame checkmates up to depth 9, which most humans can’t
do.
 In addition to guaranteed checkmates, we have probabilistic checkmate, which
makes no sense in fully observable games. Hence randomization happens.
 By moving randomly, the white king eventually bumps into the black king.
 Black can’t keep guessing escape moves forever.
 In KBNK(King, Bishop, Knight vs King) endgame:
 White gives black infinite choices.
 Eventually black guesses wrong.
 This reveals black’s position.
 This ends in checkmate.
 Hard to find probabilistic checkmate with a reasonable depth, except endgame.
 Usually you get an accidental checkmate early on, where the random choices just
work out.
 So, how likely will a strategy win? How likely is the belief state board state the actual
true board state?
 Now, not all belief states are equally likely. Certain moves are more important than
others, skewing the probabilities.
 But, a player may want to avoid being predictable, skewing the probabilities even
more.
 So, to play optimally, some randomness has to be built into moves on the part of the
player.
 Leads to the idea of an equilibrium solution.
1.8.5.2 Card games
 Many examples of stochastic partial observability.
 Example:
 Randomly deal cards at game start.
 Cards hidden from other players.
 Bridge, poker, hearts, etc.
 Not exactly like dice, but suggests an algorithm:
 Solve all possible deals of the invisible cards as if fully observable.
 Then, pick best move average over all the deals.
 Then, for every deal s with probability P(s), we can say the desired move is:
𝑎𝑟𝑔𝑚𝑎𝑥𝛼 ∑ 𝑃(𝑠)𝑀𝐼𝑁𝐼𝑀𝐴𝑋(𝑅𝐸𝑆𝑈𝐿𝑇(𝑠, 𝑎))
𝑠
 Number of deals can be huge, so solving all of them can be impossible.
 Instead, use a Monte Carlo approximation
o i.e., don’t add up all deals, take a random sample of N deals.
o Consider the probability of s appearing in that sample is P(s), then:
𝑁
1
𝑎𝑟𝑔𝑚𝑎𝑥𝛼 ∑ 𝑀𝐼𝑁𝐼𝑀𝐴𝑋(𝑅𝐸𝑆𝑈𝐿𝑇(𝑆𝑖 , 𝑎))
𝑁
𝑖=1
o The bigger the N, the better the approximation.
1. 100
1.9 Constraint Satisfaction Problem

 A constraint satisfaction problem is one of the standard search problems where
instead of saying that state is a black box, we say that state is defined by variables and
values.
 Each state has a certain set of variables and each variable has a certain set of values
and a complete assignment to all the variables, creates a final state.
 A problem is solved when each variable has a value that satisfies all the constraints on
the variable. A problem described this way is called a constraint satisfaction
problem, or CSP.
 This is a simple example of a formal representation language and it allows for general
purpose algorithms with more power than standard search algorithms.
1.9.1.1 Defining Constraint Satisfaction Problems
1.9.1.2 Example Problem: Map Coloring

Consider the map of Australia, and we try to solve the graph coloring problem or the map
coloring problem.
 Here, there are seven states in the map and given with three colors, red, blue and
green.
 The task is to color the map such that no two adjacent states have the same color.
 This is a very standard graph theory problem and if we wanted to pass it as a
constraint satisfaction problem, our variables would be 7th.
 Assign one variable for each state and the domains would be red, blue and green.
 These are the three colors that we are allowed to use for coloring each variable and
then the constraint would be that Western Australia cannot be equal to Northern
Territory and so on.
1. 101
Fig 3.15
 This is the solution, the solution is a specific assignment to each variable such that all
constraints are satisfied.
 It can be helpful to visualize a CSP as a constraint graph, as shown in Fig (b).
 In a constraint graph, each node is a variable and each edge determines whether
there is a constraint between those two variables or not.
 This kind of a constraint graph is a binary constraint graph, where each constraint
relates at most two variables and such a CSP are called binary CSPs.
1. 102
 A state has many variables and we call them as state variables where, each state
variable is a node.
1.9.1.3 Variations on the CSP

i. Discrete variables
Finite domains
 The simplest kind of CSP involves variables that have discrete and have finite
domains.
 The simplest kind of CSP involves variables that are discrete and have finite domains.
 Map coloring problems are of this kind.
 The 8-queens problem can also be viewed as finite-domain
 CSP, where the variables Q1,Q2,…..Q8 are the positions each queen in columns
1, … .8 and each variable has the domain {1,2,3,4,5,6,7,8}.
 If the maximum domain size of any variable in a CSP is d, then the number of
possible complete assignments is 𝑶(𝒅𝒏 ) - that is, exponential in the number of
variables.
Infinite domains
 Discrete variables can also have infinite domains - for example,the set of integers or
the set of strings.
 With infinite domains, it is no longer possible to describe constraints by enumerating
all allowed combination of values. Instead a constraint language is needed such as
𝑆𝑡𝑎𝑟𝑡𝐽𝑜𝑏1 + 5 <= 𝑆𝑡𝑎𝑟𝑡𝐽𝑜𝑏3
ii. Continuous variables
 CSPs with continuous domains are very common in real world.
 For example, in operation research field, the scheduling of experiments on the
Hubble Telescope requires very precise timing of observations; the start and finish
of each observation are continuous-valued variables that must obey a variety of
astronomical, precedence and power constraints.
1. 103
 The best known category of continuous-domain CSPs is that of linear

programming problems, where the constraints must be linear inequalities
forming a convex region.
 Linear programming problems can be solved in time polynomial in the
number of variables.
Varieties of CSPs
Example: red is better than green.

 Often represented by a cost for each variable assignment called constraint
optimization problem.
Cryptarithmetic
 Another example is provided by cryptarithmetic puzzles
 Each letter in a cryptarithmetic puzzle represents a different digit.
 It would be represented as the global constraint
1. 104
Fig 3.16
 Each letter stands for a distinct digit; the aim is to find a substitution of digits for
letters such that the resulting sum is arithmetically correct, with the added restriction
that no leading zeros are allowed.
 The constraint hypergraph for the cryptarithmetic problem, shown in the Alldiff
constraint as well as the column addition constraints.
 Each constraint is a square box connected to the variables it contains.
1.9 Constraint Propogation
 A number of inference techniques use the constraints to infer which variable/value pairs
are consistent and which are not. These include node, arc, path, and k-consistent.
constraint propagation: Using the constraints to reduce the number of legal values for
a variable, which in turn can reduce the legal values for another variable, and so on.
local consistency: If we treat each variable as a node in a graph and each binary
constraint as an arc, then the process of enforcing local consistency in each part of the
graph causes inconsistent values to be eliminated throughout the graph.
 There are different types of local consistency:
1.9.2.1 Node consistency

 A single variable (a node in the CSP network) is node-consistent if all the values in
the variable’s domain satisfy the variable’s unary constraint.
 For example, in the variant of the Australia map-coloring problem where South
Australians dislike green, the variable starts with domain {red, green, blue} , and we
can make it node consistent by eliminating green, leaving SA with the reduced
domain {red, blue}.
 We say that a network is node-consistent if every variable in the network is node-
consistent.
1. 105
1.9.2.2 Arc consistency

 A variable in a CSP is arc-consistent if every value in its domain satisfies the
variable’s binary constraints.
 Xi is arc-consistent with respect to another variable Xj if for every value in the current
domain Di there is some value in the domain Dj that satisfies the binary constraint on
the arc (Xi, Xj).
 A network is arc-consistent if every variable is arc-consistent with every other
variable.
 Arc consistency tightens down the domains (unary constraint) using the arcs (binary
constraints).
 AC-3 maintains a queue of arcs which initially contains all the arcs in the CSP.
 AC-3 then pops off an arbitrary arc (Xi, Xj) from the queue and makes Xi arc-
consistent with respect to Xj.
 If this leaves Di unchanged, just moves on to the next arc;
 But if this revises Di, then add to the queue all arcs (Xk, Xi) where Xk is a neighbor of
Xi.
 If Di is revised down to nothing, then the whole CSP has no consistent solution, return
failure;
 Otherwise, keep checking, trying to remove values from the domains of variables
until no more arcs are in the queue.
 The result is an arc-consistent CSP that have the same solutions as the original one
but have smaller domains.
1. 106
 The complexity of AC-3: Assume a CSP with n variables, each with domain size at
most d, and with c binary constraints (arcs). Checking consistency of an arc can be
done in O(d2) time, total worst-case time is O(cd3).
1.9.2.3 Path consistency:

 A two-variable set {Xi, Xj} is path-consistent with respect to a third variable Xm if, for
every assignment {Xi = a, Xj = b} consistent with the constraint on {Xi, Xj}, there is
an assignment to Xm that satisfies the constraints on {Xi, Xm} and {Xm, Xj}.
 Path consistency tightens the binary constraints by using implicit constraints that are
inferred by looking at triples of variables.
3.8.4 K- consistency:
 K-consistency: A CSP is k-consistent if, for any set of k-1 variables and for any
consistent assignment to those variables, a consistent value can always be assigned to
any kth variable.
 1-consistency = node consistency; 2-consisency = arc consistency; 3-consistensy =
path consistency.
 A CSP is strongly k-consistent if it is k-consistent and is also (k - 1)-consistent,
(k – 2)-consistent, … all the way down to 1-consistent.
 A CSP with n nodes and make it strongly n-consistent, we are guaranteed to find a
solution in time O(n2d). But algorithm for establishing n-consitentcy must take time
exponential in n in the worse case, also requires space that is exponential in n.
1.9.2.5 Global constraints

 A global constraint is one involving an arbitrary number of variables (but not
necessarily all variables). Global constraints can be handled by special-purpose
algorithms that are more efficient than general-purpose methods.
i) inconsistency detection for Alldiff constraints

 A simple algorithm: First remove any variable in the constraint that has a singleton
domain, and delete that variable’s value from the domains of the remaining variables.
Repeat as long as there are singleton variables. If at any point an empty domain is
produced or there are more vairables than domain values left, then an inconsistency
has been detected.
 A simple consistency procedure for a higher-order constraint is sometimes more
effective than applying arc consistency to an equivalent set of binary constrains.
ii) inconsistency detection for resource constraint (the atmost constraint)

 We can detect an inconsistency simply by checking the sum of the minimum of
the current domains;
e.g.
 Atmost(10, P1, P2, P3, P4): no more than 10 personnel are assigned in total.
If each variable has the domain {3, 4, 5, 6}, the Atmost constraint cannot be satisfied.
 We can enforce consistency by deleting the maximum value of any domain if it is not
consistent with the minimum values of the other domains.
e.g. If each variable in the example has the domain {2, 3, 4, 5, 6}, the values 5 and 6 can
be deleted from each domain.
1. 107
iii) inconsistency detection for bounds consistent

 For large resource-limited problems with integer values, domains are represented by
upper and lower bounds and are managed by bounds propagation.
e.g.
 suppose there are two flights F1 and F2 in an airline-scheduling problem, for which the
planes have capacities 165 and 385, respectively. The initial domains for the numbers
of passengers on each flight are
D1 = [0, 165] and D2 = [0, 385].
 Now suppose we have the additional constraint that the two flight together must carry
420 people: F1 + F2 = 420. Propagating bounds constraints, we reduce the domains to
D1 = [35, 165] and D2 = [255, 385].
 A CSP is bounds consistent if for every variable X, and for both the lower-bound and
upper-bound values of X, there exists some value of Y that satisfies the constraint
between X and Y for every variable Y.
1.9.2.6 Sudoku
 A Sudoku board consists of 81 squares, some of which are initially filled with digits
from 1 to 9. The puzzle is to fill in all the remaining squares such that no digit appears
twice in any row, column, or box. A row, column, or 3 × 3 box is called a unit.
Fig 3.17
 A Sudoku puzzle can be considered a CSP with 81 variables, one for each square. We
use the variable names A1 through A9 for the top row (left to right), down to I1
through I9 for the bottom row. The empty squares have the domain {1, 2, 3, 4, 5, 6, 7,
8, 9} and the pre-filled squares have a domain consisting of a single value.
 There are 27 different Alldiff constraints: one for each row, column, and box of 9
squares:
1. 108
Alldiff(A1, A2, A3, A4, A5, A6, A7, A8, A9)

Alldiff(B1, B2, B3, B4, B5, B6, B7, B8, B9)
…
Alldiff(A1, B1, C1, D1, E1, F1, G1, H1, I1)
Alldiff(A2, B2, C2, D2, E2, F2, G2, H2, I2)
…
Alldiff(A1, A2, A3, B1, B2, B3, C1, C2, C3)
Alldiff(A4, A5, A6, B4, B5, B6, C4, C5, C6)
1.9.3 Backtracking search for CSPs

 Backtracking search, a form of depth-first search, is commonly used for solving
CSPs. Inference can be interwoven with search.
Commutativity:
 CSPs are all commutative. A problem is commutative if the order of application of
any given set of actions has no effect on the outcome.
Backtracking search:
 A depth-first search that chooses values for one variable at a time and backtracks
when a variable has no legal values left to assign.
 Backtracking algorithm repeatedly chooses an unassigned variable, and then tries all
values in the domain of that variable in turn, trying to find a solution. If an
inconsistency is detected, then BACKTRACK returns failure, causing the previous
call to try another value.
 There is no need to supply BACKTRACKING-SEARCH with a domain-specific
initial state, action function, transition model, or goal test.
 BACKTRACKING-SARCH keeps only a single representation of a state and alters
that representation rather than creating a new ones.
1. 109
Fig 3.18
To solve CSPs efficiently without domain-specific knowledge, address following questions:

i) function SELECT-UNASSIGNED-VARIABLE: which variable should be
assigned next?
Function ORDER-DOMAIN-VALUES: in what order should its values be tried?
ii) function INFERENCE : what inferences should be performed at each step in the
search?
iii) When the search arrives at an assignment that violates a constraint, can the search
avoid repeating this failure?
1.9.3.1Variable and value ordering
 The backtracking algorithm contains the line
 Variable Selection-fail first
Minimum-remaining-values (MRV) heuristic:

 The idea of choosing the variable with the fewest “legal” value. A.k.a. “most
constrained variable” or “fail-first” heuristic, it picks a variable that is most likely
to cause a failure soon thereby pruning the search tree.
1. 110
 If some variable X has no legal values left, the MRV heuristic will select X and
failure will be detected immediately—avoiding pointless searches through other
variables.
 E.g. After the assignment for WA=red and NT=green, there is only one possible
value for SA, so it makes sense to assign SA=blue next rather than assigning Q.
Degree heuristic:
 The degree heuristic attempts to reduce the branching factor on future choices by
selecting the variable that is involved in the largest number of constraints on other
unassigned variables. [useful tie-breaker]
 E.g. SA is the variable with highest degree 5; the other variables have degree 2 or 3; T
has degree 0.
 ORDER-DOMAIN-VALUES
 Value selection-fail -last
 If we are trying to find all the solution to a problem (not just the first one), then the
ordering does not matter.
 Least-constraining-value heuristic: prefers the value that rules out the fewest choice
for the neighboring variables in the constraint graph. (Try to leave the maximum
flexibility for subsequent variable assignments.)
 e.g. We have generated the partial assignment with WA=red and NT=green and that
our next choice is for Q. Blue would be a bad choice because it eliminates the last
legal value left for Q’s neighbor, SA, therefore prefers red to blue.
 The minimum-remaining-values and degree heuristic are domain-independent
methods for deciding which variable to choose next in a backtracking search.
The least-constraining-value heuristic helps in deciding which value to try first for a
given variable.
1.9.3.2 Interleaving search and inference
 INFERENCE - every time we make a choice of a value for a variable.
 One of the simplest forms of inference is called forward checking. Whenever a
variable X is assigned, the forward-checking process establishes arc consistency for it:
for each unassigned variable Y that is connected to X by a constraint, delete from Y’s
domain any value that is inconsistent with the value chosen for X.
 There is no reason to do forward checking if we have already done arc consistency as
a preprocessing step.
Fig 3.19
 Advantage: For many problems the search will be more effective if we combine the
MRV heuristic with forward checking.
1. 111
 Disadvantage: Forward checking only makes the current variable arc-consistent, but
doesn’t look ahead and make all the other variables arc-consistent.
MAC (Maintaining Arc Consistency) algorithm:

 [More powerful than forward checking, detect this inconsistency.] After a variable
Xi is assigned a value, the INFERENCE procedure calls AC-3, but instead of a queue
of all arcs in the CSP, we start with only the arcs(Xj, Xi) for all Xj that are unassigned
variables that are neighbors of Xi.
 From there, AC-3 does constraint propagation in the usual way, and if any variable
has its domain reduced to the empty set, the call to AC-3 fails and we know to
backtrack immediately.
 chronological backtracking: The BACKGRACKING-SEARCH has a simple policy,
when a branch of the search fails, back up to the preceding variable and try a different
value for it.
e.g.
 Suppose we have generated the partial assignment {Q=red, NSW=green, V=blue,
T=red}.
 When we try the next variable SA, we see every value violates a constraint.
 We back up to T and try a new color, it cannot resolve the problem.
1.9.3.3 Intelligent backtracking:
Backtrack to a variable that was responsible for making one of the possible values of the next
variable (e.g. SA) impossible.
 Conflict set for a variable: A set of assignments that are in conflict with some value
for that variable.
(e.g. The set {Q=red, NSW=green, V=blue} is the conflict set for SA.)
 backjumping method: Backtracks to the most recent assignment in the conflict set.
(e.g. backjumping would jump over T and try a new value for V.)
 Forward checking can supply the conflict set with no extra work.
 Whenever forward checking based on an assignment X=x deletes a value from Y’s
domain, add X=x to Y’s conflict set;
 If the last value is deleted from Y’s domain, the assignment in the conflict set of Y are
added to the conflict set of X.
 In fact,every branch pruned by backjumping is also pruned by forward checking.
Hence simple backjumping is redundant in a forward-checking search or in a search
that uses stronger consistency checking (such as MAC).
Conflict-directed backjumping:
e.g.
 consider the partial assignment which is proved to be inconsistent: {WA=red,
NSW=red}.
 We try T=red next and then assign NT, Q, V, SA, no assignment can work for these
last 4 variables.
 Eventually we run out of value to try at NT, but simple backjumping cannot work
because NT doesn’t have a complete conflict set of preceding variables that caused to
fail.
 The set {WA, NSW} is a deeper notion of the conflict set for NT, caused NT together
with any subsequent variables to have no consistent solution. So the algorithm should
backtrack to NSW and skip over T.
1. 112
 A backjumping algorithm that uses conflict sets defined in this way is called conflict-
direct backjumping.
How to Compute:
 When a variable’s domain becomes empty, the “terminal” failure occurs, that variable
has a standard conflict set.
 Let Xj be the current variable, let conf(Xj) be its conflict set. If every possible value
for Xj fails, backjump to the most recent variable Xi in conf(Xj), and set
conf(Xi) ← conf(Xi)∪conf(Xj) – {Xi}.
 The conflict set for an variable means, there is no solution from that variable onward,
given the preceding assignment to the conflict set.
e.g.
assign WA, NSW, T, NT, Q, V, SA.
SA fails, and its conflict set is {WA, NT, Q}. (standard conflict set)
Backjump to Q, its conflict set is {NT, NSW}∪{WA,NT,Q}-{Q} = {WA, NT, NSW}.
Backtrack to NT, its conflict set is {WA}∪{WA,NT,NSW}-{NT} = {WA, NSW}.
Hence the algorithm backjump to NSW. (over T)
 After backjumping from a contradiction, how to avoid running into the same problem
again:
1.9.3.4 Constraint learning:
The idea of finding a minimum set of variables from the conflict set that causes the problem.
This set of variables, along with their corresponding values, is called a no-good. We then
record the no-good, either by adding a new constraint to the CSP or by keeping a separate
cache of no-goods.
 Backtracking occurs when no legal assignment can be found for a variable.
 A backjumping algorithm that uses conflict sets defined in this way is called
conflict-directed backjumping.
 Conflict-directed backjumping backtracks directly to the source of the problem.
1.9.4 Local search for CSPs
 Local search algorithms turn out to be very effective in solving many CSPs. They use
a complete-state formulation, where each state assigns a value to every variable, and
the search changes the value of one variable at a time.
 As an example, we’ll use the 8-queens problem, as defined as a CSP. In Figure, we
start on the left with a complete assignment to the 8 variables; typically this will
violate several constraints.
 We then randomly choose a conflicted variable, which turns out to be, the rightmost
column.
1. 113
Fig 3.20
 The min-conflicts heuristic: In choosing a new value for a variable, select the value
that results in the minimum number of conflicts with other variables.
 In the above figure we see there are two rows that only violate one constraint; we pick
𝑄8 = 3 (that is, we move the queen to the 8th column, 3rd row).
 On the next iteration, in the middle board of the figure, we select 𝑄6 as the variable to
change, and note that moving the queen to the 8th row results in no conflicts.
 At this point there are no more conflicted variables, so we have a solution. The
algorithm is shown in Figure 3.21.
 The landscape of a CSP under the mini-conflicts heuristic usually has a series of
plateau. Simulated annealing and Plateau search (i.e. allowing sideways moves to
another state with the same score) can help local search find its way off the plateau.
 This wandering on the plateau can be directed with tabu search: keeping a small list
of recently visited states and forbidding the algorithm to return to those states.
 Constraint weighting: a technique that can help concentrate the search on the
important constraints.
 Each constraint is given a numeric weight Wi, initially all 1.
 At each step, the algorithm chooses a variable/value pair to change that will result in
the lowest total weight of all violated constraints.
1. 114
Fig 3.21
 The weights are then adjusted by incrementing the weight of each constraint that is
violated by the current assignment.
 Local search can be used in an online setting when the problem changes, this is
particularly important in scheduling problems.
1.9.5 The Structure of Problems

 We examine ways in which the structure of the problem, as represented by the
constraint graph, can be used to find solutions quickly.
 From CSP, we represent constraint graph
Fig 3.22
DAC & Topological sort

 A constraint graph is a tree when any two variables are connected by only one path.
 Any tree-structured CSP can be solved in linear time in the number of variables.
 A CSP is defined to be directed arc-consistent (DAC) under an ordering of variables
𝑋1 , 𝑋2 , … 𝑋𝑛 if and only if every 𝑋𝑖 is arc-consistent with each 𝑋𝑗 for j>i.
 An ordering of the variables such that each variable appears after its parent in the tree.
Such an ordering is called a topological sort.
 We have a directed arc-consistent (DAC) graph, we can just march down the list of
variables and choose any remaining value. Since each link from a parent to its child is
arc-consistent, we know that for any value we choose for the parent, there will be a
valid value left to choose for the child.
1. 115
Two ways to reduce constraint graphs to trees

1.9.5.1Cutset conditioning
 The general algorithm is as follows:
 Choose a subset S of the CSP’s variables such that the constraint graph
becomes a tree after removal of S. S is called a cycle cutset.
 For each possible assignment to the variables in S that satisfies all
constraints on S,
a) remove from the domains of the remaining variables any
values that are inconsistent with the assignment for S, and
b) If the remaining CSP has a solution, return it together with
the assignment for S.
Fig 3.23
1. 116
1.9.5.2Tree Decomposition
 A tree decomposition must satisfy the following three requirements:
i. Every variable in the original problem appears in atleast one of the
subproblems.
ii. If two variables are connected by a constraint in the original problem, they
must appear together (along with the constraint) in atleast one of the
subproblems.
iii. If a variable appears in two subproblems in a tree, it must appear in every
subproblem along the path connecting those subproblems.
Fig 3.24
1.9.5.3 Value symmetry

 Consider the map-coloring problem with d colors. For every consistent
solution, there is actually a set od d! solutions formed by permuting the color
names.
 For example, on the Australia map we know that WA,NT, and SA must all
have different colors, but there are 3!=6 ways to assign three colors to three
regions.
 This is called value symmetry.
1. 117
2-Marks
1. Define A.I or what is A.I?
 It is a branch of computer science by which we can create intelligent machines
which can behave like a human, think like humans, and able to make decisions.
2. What is meant by Turing test?
 To conduct this test we need two people and one machine.
 One person will be an interrogator (i.e.) questioner, will be asking questions to one
person and one machine.
 Three of them will be in a separate room. Interrogator knows them just as A and B.
so it has to identify which is the person and machine.
 The goal of the machine is to make Interrogator believe that it is the person’s answer.
If machine succeeds by fooling Interrogator, the machine acts like a human.
 Programming a computer to pass Turing test is very difficult.
3. What is called materialism?
 An alternative to dualism is materialism, which holds that the entire world operate
according to physical law.
 Mental process and consciousness are therefore part of physical world, but inherently
unknowable they are beyond rational understanding.
4. What are the capabilities, computer should posses to pass Turing test?
 Natural Language Processing
 Knowledge representation
 Automated reasoning
 Machine Learning
5. Define Total Turing Test?
The test which includes a video signals so that the interrogator can test the perceptual
abilities of the machine.
6. What are the capabilities computers needs to pass total Turing test?
 Computer Vision
 Robotics
7. Define Rational Agent.
A rational agent is one that does the right thing. Here right thing is one that will cause
agent to be more successful. That leaves us with the problem of deciding how and when to
evaluate the agent’s success.
8. Define Agent.
An Agent is anything that can be viewed as perceiving (i.e.) understanding its
environment through sensors and acting upon that environment through actuators.
1. 118
9. Define an Omniscient agent.

An omniscient agent knows the actual outcome of its action and can act accordingly;
but omniscience is impossible in reality.
10. What are the factors that a rational agent should depend on at any given time?
 The performance measure that defines degree of success.
 Ever thing that the agent has perceived so far. We will call this complete perceptual
history the percept sequence.
 When the agent knows about the environment.
 The action that the agent can perform.
11. Define Architecture.
The action program will run on some sort of computing device which is called as
Architecture
12. List the various type of agent program.
 Simple reflex agent program.
 Agent that keep track of the world.
 Goal based agent program.
 Utility based agent program
13. Give the structure of agent in an environment?
 Agent interacts with environment through sensors and actuators.
 An Agent is anything that can be viewed as perceiving (i.e.) understanding its
environment through sensors and acting upon that environment through actuators.
14. What is meant by robotic agent?
 A machine that looks like a human being and performs various complex acts of a
human being.
 It can do the task efficiently and repeatedly without fault.
 It works on the basis of a program feeder to it; it can have previously stored
knowledge from environment through its sensors.
 It acts with the help of actuators
15. Give the general model of learning agent?
Learning agent model has 4 components –
1) Learning element. 2) Performance element. 3) Critic 4) Problem Generator
16. What is the role of agent program?
 Agent program is important and central part of agent system. It drives the agent,
which means that it analyzes date and provides probable actions agent could take.
 An agent program is internally implemented as agent function. An agent program
takes input as the current percept from the sensor and returns an action to the
effectors.
1. 119
17. List down the characteristics of intelligent agent?

 The IA must learn and improve through interaction with the environment.
 The IA must adapt online and in the real time situation.
 The IA must accommodate new problem-solving rules incrementally.
 The IA must have memory which must exhibit storage and retrieval capabilities.
18. Define abstraction?
 In AI the abstraction is commonly used to account for the use of various levels in
detail in a given representation language or the ability to change from one level to
another level to another while preserving useful properties.
 Abstraction has been mainly studied in problem solving, theorem solving.
19. What are the functionalities of the agent function?
 Agent function is a mathematical function which maps each and every possible
percept sequence to a possible action.
𝒇: 𝑷∗ → 𝑨
 The major functionality of the agent function is to generate the possible action to each
and every percept. It helps the agent to get the list of possible actions the agent can
take.
 Agent function can be represented in the tabular form.
20. Define basic agent program?
 The basic agent program is a concrete implementation of the agent function which
runs on the agent architecture.
 Agent program puts bound on the length of percent sequence and considers only
required percept sequences.
 Agent program implements the functions of percept sequence and action which are
external characteristics of the agent.
 Agent program takes input as the current percept from the sensor and return an action
to the effectors (Actuators)
21. What are the four components to define a problem? Define them.
initial state: state in which agent starts in.
A description of possible actions: description of possible actions which are available to the
agent.
The goal test: it is the test that determines whether a given state is goal state.
A path cost function: it is the function that assigns a numeric cost (value ) to each path. The
problem-solving agent is expected to choose a cost function that reflects its own performance
measure.
22. What is rational at any given time depends on four things:
 The performance measure that defines the criterion of success.
1. 120
 The agent's prior knowledge of the environment.

 The actions that the agent can perform.
 The agent's percept sequence to date.
23. Define Rationality

For each possible percept sequence, a rational agent should select an action that is
expected to maximize its performance measure, given the evidence provided by the percept
sequence and whatever built-in knowledge the agent has.
24. Define Problem Formulation?
 Problem formulation is the process of deciding what actions and states to consider,
given a goal.
 The process of looking for a sequence of actions that reaches the goal is called search.
 A search algorithm takes a problem as input and returns a solution in the form of an
action sequence.
25. Name the components to define a problem?
A well-defined problem has the following five components:
1) A start or initial state
2) A description of actions available to an agent
3) A description of what each action does, also called a transition model
4) A goal test
5) Path cost
26. Define a graph and path?
 Together, the initial state, actions and transition model define a state space. The
state space is typically a graph.
 A path in the state space is a sequence of states connected by a sequence of
actions.
27. What is heuristic search?
 Heuristic search is class of method which is used in order to search a solution space for
an optimal solution for a problem.
 The heuristic here uses some method to search the solution space while assessing where
in the space the solution is most likely to be and focusing the search on that area.
28. What is heuristic function?
 Heuristic is a function which finds the most promising path.

 It takes the current state of the agent as its input and produces the estimation of how
close agent is from the goal.
1. 121
 Heuristic function estimates how close a state is to the goal. It is represented by

h(n), and it calculates the cost of an optimal path between the pair of states.
 The value of the heuristic function is always positive.
29. Define Evaluation & Heuristic function
 Evaluation function f(n)
 A node with the lowest evaluation is selected for expansion, because evaluation
measures distance to the goal.
 Heuristic function h(n)
 It is defined as the estimated cost of the cheapest path from node ‘n’ to a
goal node.
30. Define Greedy Best First search.
 Greedy best-first search algorithm always selects the path which appears best at
that moment.
 With the help of best-first search, at each step, we can choose the most
promising node. In the best first search algorithm, we expand the node which is
closest to the goal node and the closest cost is estimated by heuristic function,
i.e. f(n)= g(n).
31. Discover what is optimal solution?
 If a solution found for an algorithm is guaranteed to be the best solution (lowest

path cost) among all other solutions, then such a solution for is said to be an
optimal solution.
32. Define abstraction
 Data abstraction is defined as the process of reducing the object to its essence
so that only the necessary characteristics are exposed to the users.
33. Differentiate uninformed and informed search strategies
Informed search strategies

 The algorithms of an informed search contain information regarding the goal
state. It helps an AI make more efficient and accurate searches. A function
obtains this data/info to estimate the closeness of a state to its goal in the system.
Uninformed search strategies
 The algorithms of a uniformed AI do not consist of any additional data/ info
regarding the goal node. It only contains the information provided during the
definition of a problem.
1. 122
34. Define local search
 Local Search is an optimizing algorithm to find the optimal solution more

quickly.
 Local search algorithms are used when we care only about a solution but not the
path to a solution.
 Local search is used in most of the models of AI to search for the optimal
solution according to the cost function of that model.
 Hill climbing, simulated annealing are some of the local search algorithms.
35. What is hill climbing?
 Hill climbing algorithm is a local search algorithm which continuously moves in the
direction of increasing elevation/value to find the peak of the mountain or best solution
to the problem.
 It terminates when it reaches a peak value where no neighbour has a higher value.
36. Summarize simulated annealing
 In mechanical term Annealing is a process of hardening a metal or glass to a

high temperature then cooling gradually, so this allows the metal to reach a low-
energy crystalline state.
 The same process is used in simulated annealing in which the algorithm picks
a random move, instead of picking the best move. If the random move improves
the state, then it follows the same path.
 Otherwise, the algorithm follows the path which has a probability of less than 1
or it moves downhill and chooses another path.
37. What is a ridge?
 A ridge is a special form of the local maximum. It has an area which is higher
than its surrounding areas, but itself has a slope, and cannot be reached in a
single move.
 Solution: With the use of bidirectional search, or by moving in different
directions, we can improve this problem.
38. What is stochastic beam search?
 Stochastic beam search is an alternative to beam search, which, instead of

choosing the best k individuals, selects k of the individuals at random; the
individuals with a better evaluation are more likely to be chosen.
 This is done by making the probability of being chosen a function of the
evaluation function.
1. 123
39. Define belief state?
 When the environment is nondeterministic, the agent doesn’t know what state
it transitions to after taking an action.
 A set of physical states that the agent believes are in belief states.
40. What is a plateau in local search?
 A plateau is the flat area of the search space in which all the neighbour states
of the current state contains the same value, because of this algorithm does
not find any best direction to move.
 Plateau is the region where all the neighbouring nodes have the same value of
objective function so the algorithm finds it hard to select an appropriate
direction.
41. What is local maximum and global maximum in state space?
 Local maximum: Local maximum is a state which is better than its neighbour
states, but there is also another state which is higher than it.
 Global Maximum: Global Maximum is the best possible state of state space
landscape. It has the highest value of objective function.
42. Define Admissible heuristic h(n)
 Admissible heuristics are used to estimate the cost of reaching the goal state in
a search algorithm.
 Admissible heuristics never overestimate the cost of reaching the goal state.
 The use of admissible heuristics also results in optimal solutions as they
always find the cheapest path solution.
 For a heuristic to be admissible to a search problem, needs to be lower than or
equal to the actual cost of reaching the goal.
43. What is triangle inequality?
 It states that each side of a triangle cannot be longer than the sum of the
other two slides of the triangle.
1. 124
44. What is metalevel state space?
 Each state in a metalevel state space captures the internal state of a program that
is searching in an object level state space.
45. When a heuristic function h is said to be admissible? Give an admissible heuristic function
for TSP?
 A heuristic function is said to be admissible if it never overestimates the cost

of reaching the goal, i.e. the cost it estimates to reach the goal is not higher
than the lowest possible cost from the current point in the path.
 The admissible heuristic for TSP is
a. Minimum spanning tree
b. Minimum assignment problem
46. Mention how the search strategies are evaluated?
 Search strategies are evaluated on four criteria

i. Completeness - Completeness is a guarantee of finding a solution
whenever one exists.
ii. Time Complexity – It is how long does it take to find a solution and this is
usually measured in terms of the number of nodes that the searching
technique expands.
iii. Space complexity – It is the measurement of the maximum size of the nodes
list during the search.
iv. Optimality - It states, If a solution is found, is it guaranteed to be an optimal
one
47. What is game playing in AI?
 General Game Playing (GGP) is the design of Artificial Intelligence programs
to be able to play more than one game successfully.
 For instance, a chess-playing computer program cannot play checkers.
 General game playing is considered as a necessary milestone on the way to
artificial general intelligence.
 Gameplay is the specific way in which players interact with a game, and in
particular with video games.
 Gameplay is the pattern defined through the game rules, connection between
player and the game, challenges and overcoming them, plot and player’s
connection with it.
48. What is game theory in AI?
 Game theory is a branch of mathematics used to model the strategic
interaction between different players in a context with predefined rules and
outcomes.
1. 125
 Game theory can be applied in different ambit of Artificial Intelligence: Multi

agent AI systems.
49. Where is game theory used in AI?
 Currently, game theory is being used in adversary training in GANs, multi-
agent systems, and imitation and reinforcement learning.
 In the case of perfect information and symmetric games, many Machine
Learning and Deep learning techniques are applicable.
50. What are the characteristics of game theory?
 Game theory, branch of applied mathematics that provides tools for analyzing
situations in which parties, called players, make decisions that are
independent.
 This interdependence causes each player to consider the other player’s
possible decisions, or strategies, in formulating strategy.
51. What is optimal decision gaming?
 In a normal search problem, the optimal solution would be a sequence of
actions leading to a goal state- a terminal state that is a win.
 An optimal decision is a decision that leads to at least as good a known or
expected outcome as all other available decision options.
 In practice, few people verify that their decisions are optimal, but instead use
heuristics to make benefits that are “good enough”- that is, they engage in
satisficing.
52. How are optimal decisions made?
 To make an optimal decision, economists ask: “What are the extra (marginal)
costs and what are the extra (marginal) benefits associated with the decision?”
 If the extra benefits are bigger than the extra costs, you shall go ahead with the
decision, namely the decision is good.
53. What is Alpha Beta algorithm in AI?
 Alpha-beta pruning is a modified version of the minimax algorithm. It is an
optimization technique for the minimax algorithm.
 It is also called as Alpha-Beta algorithm. Alpha-beta pruning can be applied at
any depth of a tree, and sometimes it not only prune the tree leaves but also
entire sub-tree.
54. What is Alpha-beta pruning?
 Alpha Beta Pruning is a method that optimizes the minimax algorithm. The
number of states to be visited by the minimax algorithm are exponential,
which shoots up the time complexity.
 Some of the branches of the decision tree are useless, and the same result can
be achieved if they were never visited.
1. 126
55. What is Monte Carlo tree search used for?

 In computer science, Monte Carlo tree search (MCTS) is a heuristic search
algorithm for some kinds of decision process, most notably those employed in
software that plays board games.
 In that context MCTS is used to solve the game tree.
56. What are the four steps of Monte Carlo tree search?
 The process of Monte Carlo Tree Search can be broken down into four distinct
steps,
 Selection
 Expansion
 Simulation
 Backpropagation
57. What are the advantages of Monte Carlo search?
 MCTS is a simple algorithm to implement.
 Monte Carlo Tree Search is a heuristic algorithm. MCTS can operate
effectively without any knowledge in the particular domain, apart from the
rules and conditions, and can find its own moves and learn from them by
playing random playouts.
58. What are stochastic games in detail?
 A stochastic game was introduced by Lloyd Shapley in the early 1950s. It is a
dynamic game with probabilistic transitions played by one or more players.
 The game is played in a sequence of stages. At the beginning of each stage,
the game is in a certain state.
 Stochastic games (SG) also called Markov games-extend Markov decision
process (MDP) to the case where there are multiple players in a common
environment.
 These agents perform a joint action that defines both the reward obtained by
the agents and the new state of the environment.
59. What is partial observability in AI?
 Partial observability means that an agent does not know the state of the world
or that the agents act simultaneously.
 In a partially observable system the observer may utilize a memory system in
order to add information to the observers understanding of the system.
 An example of a partially observable system would be a card game in which
some of the cards are discarded into a pile face down.
60. What is CSP in AI?
 A constraint satisfaction problem (CSP) consists of a set of variables, a
domain for each variable, and as set of constraints.
1. 127
61. What is CSP algorithm?

 In general, a CSP is a problem compose of a finite set of variables, each of
which has a finite domain of values, and a set of constraints.
 The task is to find an assignment of a value for each variable such that the
assignments satisfy all the constraints.
 In some problems, the goal is to find all such assignments.
62. What is an example of CSP?

 Some of the popular CSP problems include
 Sudoku
 Cryptarithmetic
 Crosswords
 n-Queen etc.
63. What is constraint propagation?
 Constraint propagation is the process of communicating the domain reduction
of a decision variable to all of the constraints that are stated over this variable.
 This process can result in more domain reductions.
 These domain reductions, in turn, are communicated to the appropriate
constraints.
64. What is backtracking search with example
 A depth-first search that choses values for one variable at a time and
backtracks when a variable has no legal values left to assign.
 Backtracking algorithm repeatedly chooses an unassigned variable, and then
tries all values in the domain of that variable in turn, trying to find a solution.
 Examples where backtracking can be used to sove puzzles or problems
include: Puzzles such as eight queens puzzle, crosswords, verbal arithmetic,
Sudoku, and Peg Solitaire.
 Combinational optimization problems such as parsing and the knapsack
problem.
65. How are constraints propagated in forward checking?
 Forward checking detects the inconsistency earlier than backtracking and thus
it allows branches of the search tree that will lead to failure to be pruned
earlier than with simple backtracking.
 This reduces the search tree and the overall amount of work done.
66. What is local search for CSP?
 A local search problem consists of a CSP: a set of variables, domains for these
variables, and constraints on their joint values.
 A node in the search space will be a complete assignment to all of the
variables.
1. 128
 Local search is an incomplete method for finding a solution to a problem.

 It is based on iteratively improving an assignment of the variables until all
constraints are satisfied.
 In particular, local search algorithms typically modify the value of a variable
in an assignment at each step.
Part-B
1. State and Explain the Types of Environments in AI.
2. Explain the Types of Intelligent Agents.
3. Explain Rationality with Example.
4. Explain all four characteristic of Intelligent Agent.
5. List the features of Intelligent Agents
6. Explain uninformed search strategies in detail
7. Discuss about
v. Greedy best-first search
vi. A* search
vii. Memory bounded heuristic search
8. Compose the algorithm for recursive best first search.
9. Explain the nature of heuristics with an example. What is the effect of heuristic
accuracy on performance?
10. Explain the following types of Hill Climbing search techniques
i) Simple Hill Climbing
ii) Steepest-Ascent Hill Climbing
iii) Simulated Annealing
11. Explain informed search strategies with an example
12. Explain in detail about Online Search Agent and Unknown environment
13. Explain briefly about Search with non-deterministic actions
14. Describe Alpha beta pruning with algorithm.
15. Explain stochastic games with examples.
16. Why is game theory important in AI?
17. Define minimax algorithm in Game theory.
18. How does Monte Carlo Tree search work?
19. How CSP is formulated as a search problem?
20. What is backtracking search in CSP?
1. 129
UNIT-II
PROBABILISTIC REASONING
Acting under uncertainty – Bayesian inference – Naïve Bayes models. Probabilistic reasoning –
Bayesian networks – Exact inference in BN – Approximate inference in BN – Causal networks.
2.1 Acting Under Uncertainty

 When an agent knows enough facts about its environment, the logical plans and
actions produces a guaranteed work.
 Unfortunately, agents never have access to the whole truth about their environment.
Agents act under uncertainty.
[Without full knowledge about the environment, taking decisions are difficult or it
will go wrong.]
Nature of Uncertain Knowledge
 The Diagnosis: medicine, automobile repair, or whatever is a task that almst always
involves uncertainty.
 Let us try to write rules for dental diagnosis using first-order logic, so that we can see
how the logical approach breaks down. Consider the following rule:
∀𝒑 𝒔𝒚𝒎𝒑𝒕𝒐𝒎(𝒑, 𝑻𝒐𝒐𝒕𝒉𝒂𝒄𝒉𝒆) ⇒ 𝑫𝒊𝒔𝒆𝒂𝒔𝒆(𝒑, 𝑪𝒂𝒗𝒊𝒕𝒚)
 The problem is that this rule is wrong.

 Not all the patients with toothaches have cavities; some of them have gum disease,
swelling, or one of several other problems
∀𝒑 𝑺𝒚𝒎𝒑𝒕𝒐𝒎(𝒑, 𝑻𝒐𝒐𝒕𝒉𝒂𝒄𝒉𝒆)
⇒ 𝑫𝒊𝒔𝒆𝒂𝒆(𝒑, 𝑪𝒂𝒗𝒊𝒕𝒚)˅𝑫𝒊𝒔𝒆𝒂𝒔𝒆(𝒑, 𝑮𝒖𝒎𝑫𝒊𝒔𝒆𝒂𝒔𝒆)˅𝑫𝒊𝒔𝒆𝒂𝒔𝒆(𝒑, 𝑺𝒘𝒆𝒍𝒍𝒊𝒏𝒈) …
 To make the rule true, we have to add almost unlimited list of possible causes.
 We could try a casual rule:
∀𝒑 𝑫𝒊𝒔𝒆𝒂𝒔𝒆(𝒑, 𝑪𝒂𝒗𝒊𝒕𝒚) ⇒ 𝑺𝒚𝒎𝒑𝒕𝒐𝒎(𝒑, 𝑻𝒐𝒐𝒕𝒉𝒂𝒄𝒉𝒆)
 But this rule is also not right either; not all cavities cause pain
 Toothache and a Cavity are unconnected, so the judgement may go wrong.
 This is a type of the medical domain, as well as most other judgmental domains: law,
business, design, automobile repair, gardening, dating, and so on.
 Three main reasons of failures
2.1
i. Laziness- we are too much lazy to represent all antecedants/consequents (this

is our inability, we don’t know all the reasons of toothache.)
ii. Theoritical ignorance- there is no complete knowledge( we don’t know the
exact reason/ all the reasons of toothache.
iii. Practical ignorance- not all tests can be run (we cant take all test to determine
the problem)
 The agent take action, only a degree of belief relevant sentences.
 Our main tool for dealing with degrees of belief will be probability theory.
 The Probability assigns to each sentence a numerical degree of belief between 0 and
1.
 Probability theory provides a way of summarizing the uncertainty that come from the
laziness & ignorance.
Uncertainty and rational decisions (we have uncertainty but still we want to take rational
decisions)
 Example : Automated taxi
i. Plan 1(𝐴90 ) – Leave 90 mins early
ii. Plan 2((𝐴180 ) – Leave 180 mins early
iii. Plan 3(𝐴1440 )- Leave 24 hours early
We have to evaluate the plan. Inorder to overcome, I have to evaluate the problem using
utility values ( timely arrival, whether my ride was legal, whether the ride was comfortable,
whether it is safe ride)
Based on this we have to make rational decisions. We will leave that decision to decision
theory , it will make the use of probability theory and utility theory. So, preferences will be
given for utility values.
 Preferences, as expressed by utilities, are combined with probabilities in the general
theory of rational decisions called decision theory:
 Decision theory combines the agent’s beliefs and desires, defining the best action as
the one that maximizes expected utility. (i.e, not all the time the utility values are
satisfied, but we have to maximize the expected utilities).
2.2
2.2 Basic Probability Notation

 Sample space- the set of all possible worlds (one instance is known as possible
world)
For example: two dice are rolled – 36 possible worlds (if I enumerate them I have
(1,1) (1,2…(1,6), (2,1)….(6,6). 6*6 options total 36. These 36 instances are called 36
possible worlds)
 Probability model- associates a numerical probability with each possible world
(In this example all events are equally likely, so there is no biasing for any instance,
So we can say that the probability of each event is 1/36. Assigning this probability to a
particular possible world is govern by this probability model)
W is representing one of the possible world, probability of any world will lie between
0 and 1. 0 represent impossible event, 1 represent certain event. So, it is true always for
every w, & the summation of probability of all the possible world will 1. (we have 36 events
each wit probability 1/36 , so summation is 1)
Unconditional Probability/ Prior Probability
 For example, rolling the 2 dices and they add up to 11( which instance will add 115
& 6 or 6 & 5. And their probability is 1/36 +1/36 i.e 2/36 1/18)
Conditional Probability/Posterior Probability

 For example, rolling the two dices given that the first die is a 5, (this condition is
imposed here, P(5,6)|𝑑𝑖𝑒1 = 5)
2.3
 Mathematically, conditional probability is given by, (find probability of A given B

that is already occurred.
Product rule
Random variables – variables in probability theory are called random variables.(we don’t
know the exact occurance of those variables, tat’s why we call them as random variables.)
Domain- Each variable will have domain
(we have our syntax of logical statement inorder to represent our knowledge)
Example: “The probability that the patient has a cavity, given that she is a teenager with no
toothache, is 0.1” as follows:
Probability distribution
 For example, Weather={ sunny, rain, cloudy, snow}
Sometimes we will want to talk about the probabilities of all the possible values of a random
variable. We could write:
2.4
but as an abbreviation we will allow,
 Statement P defines a probability distributions for the random variable Weather.

 For a continuous variables, P defines the probability density function(pdf)
Probability axioms
 Relationship between a proposition and its negation
Inclusion-exclusion principle
Sum rule
P(A˅B)=P(A) + P(B)
Independent event
P(A|B)= P(A)
So, that we have product rule,
P(A∧B)= P(A|B). P(B)
=P(A). P(B)
So, this is true for independent events.
Full Joint Probability Distribution

 Distributions on multiple variables
 For example,
 Weather= {sunny, rain, cloudy, snow}
2.5
 Cavity={ cavity, ¬𝑐𝑎𝑣𝑖𝑡𝑦}

 Joint Probability distribution of Weather & Cavity
Here, Weather & Cavity are the variables. Weather has 4 values & Cavity having 2 values.
What are the possible combination of these variables. Consider (W, C) = 8 combinations.
So, find the probability of distribution on these 2 variables.
One of the instance we have, sunny weather with cavity 7 sunny with no cavity
Similarly, rain with cavity, rain without cavity. So, 4*2=8 combinations possible
So, every possible world will have some probability. Some possible world will be more
probable i.e) probability will be more, called joint probability.
When you have multiple variables, probability distribution over multiple variables is known
as joint probability.
And if you enlist all the possible world then it becomes, full joint probability distribution.
 Can be written as a single equation:
2.3 Inference Using Full joint Distributions

 To study the method for probabilistic inference. So, how to infer a new fact in case of
uncertainty.
 When you infer a new fact, that fact will have some probability, because uncertainty
is there.
 Given data itself will have a probability
 Consider an instance with three variables
2.6
catch (the dentist’s nasty steel probe catches in my tooth).

The table is fully depicting the joint distribution of these three variables.
Toothache has 2 values, Cavity 2 values, Catch 2 values. Total 8 entries
 How to infer the probability of any proposition (new fact).
Here, our new proposition is Cavity or Toothache
Now, what is the probability of Cavity with Toothache we can have OR, we
have only Cavity, only Toothache or both
 direct way to calculate the probability of any proposition, simple or complex:
simply identify those possible worlds in which the proposition is true and add up
their probabilities. For example, there are six possible worlds in which
𝐶𝑎𝑣𝑖𝑡𝑦 ˅ 𝑡𝑜𝑜𝑡ℎ𝑎𝑐ℎ𝑒 holds:
To compute conditional probabilities

 For example, we can compute the probability of a cavity, given evidence of a
toothache, as follows:
In order to reduce computation complexity, we can find denominator only once,

because it is repeated.
For that we have a concept of Normalization.
2.7
2.4 Independence
 Independence between propositions a and b can be written as
 Example, add Weather variable in our previous example

𝑃(𝑡𝑜𝑜𝑡ℎ𝑎𝑐ℎ𝑒, 𝑐𝑎𝑡𝑐ℎ, 𝐶𝑎𝑣𝑖𝑡𝑦, 𝑊𝑒𝑎𝑡ℎ𝑒𝑟)
 To find P(toothache, catch, cavity, cloudy), we use the product rule
 It seems safe to say that the weather does not influence the dental variables.
Therefore, the following assertion seems reasonable
2.8
 Thus, the 32-element table for four variables can be constructed from one 8-element
table and one 4-element table.
2.5 Bayes’ Rule and its use
 When you have 2 independent events, two forms of product rule
2.5.1 Applying Bayes’ rule: The simple case
2.9
In a task such as medical diagnosis, we often have conditional probabilities on causal

relationships. The doctor knows 𝑃(𝑠𝑦𝑚𝑝𝑡𝑜𝑚𝑠|𝑑𝑖𝑠𝑒𝑎𝑠𝑒)) ) and want to derive a diagnosis,
𝑃(𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑠𝑦𝑚𝑝𝑡𝑜𝑚𝑠),
2.6 Naïve Bayes Models

 The full joint distribution can be written as
 Such a probability distribution is called a naive Bayes model—“naive” because it is

often used in cases where the “effect” variables are not strictly independent given the
cause variable.
 The naive Bayes model is sometimes called a Bayesian classifier.
 To use a naive Bayes model, we can apply Equation (5.2) to obtain the probability of
the cause given some observed effects. Call the observed effects E=e, while the
remaining effect variables Y are unobserved. Then the standard method for inference
from the joint distribution (Equation (5.1) can be applied:
From Equation 5.1, we get
2.10
Equation 5.3
 where the last line follows because the summation over is 1.

2.6.1 Text classification with naïve Bayes
 Let’s see how a naive Bayes model can be used for the task of text classification:
 Given a text, decide which of a predefined set of classes or categories it belongs to.
 Here the “cause” is the variable, and the “effect” variables are the presence or
absence of certain key words, 𝐻𝑎𝑠𝑊𝑜𝑟𝑑𝑖 . Consider these two example sentences,
taken from newspaper articles:
 To categorize a new document, we check which key words appear in the document
and then apply Equation 5.3 to obtain the posterior probability distribution over
categories.
 If we have to predict just one category, we take the one with the highest posterior
probability.
 Notice that, for this task, every effect variable is observed, since we can always tell
whether a given word appears in the document.
 The naive Bayes model assumes that words occur independently in documents, with
frequencies determined by the document category. This independence assumption is
clearly violated in practice.
2.11
 For example, the phrase “first quarter” occurs more frequently in business (or sports)
articles than would be suggested by multiplying the probabilities of “first” and
“quarter.”
2.7 Probabilistic Reasoning
Bayesian Network
 Bayesian Network is to represent the dependencies among variables and to give a
brief specification of any full joint probability distribution.
 Bayesian network is a data structure also called as belief network, probabilistic
network, casual network, all knowledge map.
 The extension of Bayesian network is called as a decision network or influence
diagram.
 A Bayesian is a directed graph in which each node is annotated with quantitative
probability information.
 The full specification is a s follows:
 A set of random variables makes up the nodes of the network. Variables may
be discrete or continuous.
 A set of directed links or arrows connects a pairs of nodes. If there is an
arrow from node X to node Y, X is said to be a parent of Y.
 Each node X has a conditional probability distribution P(X,(Parents(X)) that
quantifies the effect of the parents on the node.(X is parent of Y)
 The graph has no directed cycles (and hence is a directed, acyclic graph, or
DAG.)
Example: Simple Bayesian network
 Now, consider the example Burglar Alarm

 If a thief or unknown person enter into your compound, then the alarm rings.
 You have installed a new burglar alarm at home.
 It is fairly reliable at detecting a burglary, but also responds on occasion to minor
earthquakes.
 You also have two neighbors, John and Mary, who have promised to call you at work
when they hear the alarm.
2.12
 John always calls when he hears the alarm, but sometimes confuses the telephone
ringing with the alarm and calls then, too.
 Mary, on the other hand, likes loud music and sometimes misses the alarm altogether.
 Given the evidence of who has or has not called, we would like to estimate the
probability of a burglary.
 This, is the Bayesian network for our example.

 Each node is having its own conditional probability table CPT,
 And in this diagram, alarm is directly depending on Burglary and Earthquake but John
and Mary are depending on only the Alarm.
 So, in each CPT, they are having letters B Burglary, E Earthquake, S alarm,
JJohn calls, M Mary calls.
 From the network, the topology shows that
 Burglary and earthquakes directly affect the probability of the alarm,
 But John and Mary call depends on the alarm.
 Our assumptions from the network,
 They do not perceive any burglaries directly,
 They do not notice the minor earthquakes, and
 They do not discuss before calling.
 Notice that the burglar alarm network does not have any nodes corresponding to
 Mary is currently listening to loud music or
 The telephone ringing and confusing john
 These factors are summarizing in the uncertainty, associated with the links from
Alarm to JohnCalls and MaryCalls.
 This shows both laziness and ignorance in operation.
2.13
Conditional Probability Tables- CPT

 The conditional probability tables in the network give the probabilities for the
values of the random variable depending on the combination of values for the
parent nodes.
 Each row must sum to 1.
 All variables are Boolean, and therefore, the probability of a true value is p, the
probability of false must be 1-p.
 A table for a Boolean variable with k parents contains 2𝑘 independently
specifiable probabilities.
 A variable with no parents has only one row, representing the prior probabilities
of each possible value of the variable.
2.8 Semantics of Bayesian Networks
 The full joint probability distribution specifies the probability of values to
random variables.
 It is usually too large to create or use in its explicit form.
 Joint probability distribution of two variables X and Y are
Joint Probabilities X X’
Y 0.20 0.12
Y’ 0.65 1.03
 Joint Probability distribution for n variables require 2𝑛 entries with all possible
combination.( entries also increased & this is the drawback of joint probability)
Drawbacks of joint probability distribution
i. Large number of variables and grows rapidly.
ii. Time and space complexity are huge.
iii. Statistical estimation with probability is difficult.
iv. Human tends signal out few propositions.
v. The alternative to this is Bayesian Networks.
Example:
 We can calculate the probability that the alarm has sounded, but neither a burglary nor
an earthquake has occurred, and both John and Mary call.
2.8.1 The Semantics of Bayesian Networks

 An entry in joint distribution is the probability of conjunction of particular assignment
to each variable, such as (here X random variables, x- values to random
variables,𝜋 → 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 & here 𝑥𝑖 is depending on the parent 𝑋𝑖 value)
2.14
which is equal to
Method for constructing Bayesian Network

 Rewrite the joint distribution in terms of a conditional probability, using the product
rule
 Then we repeat the process, reducing each conjunctive probability to a conditional

probability and a smaller conjunction. We end up with one big product:
 This identity is called the chain rule. The specification of the joint distribution is
equivalent to the general assertion that, for every variable 𝑋𝑖 in the network,
 We can directly implement this formula into our example
 No need to consider all the other things, because the parent of MaryCalls is Alarm.
So, MaryCalls is the child node of Alarm.
Compactness and node ordering
 The compactness of Bayesian network is an example of general property of locally
constructed systems. (also called as spare systems, inside some components there, and
those are communicated)
 In a locally structured system, each subcomponent interacts directly with only a
bounded number of other components, regardless of the total number of components.
 Therefor the correct in which to add node is to add the ‘root causes’ first, then the
variables they influenced and so on until we reach the leaves.
 Suppose, we decide to add the nodes in the order MaryCalls, JohnCalls,Alarm,
Burglary, Earthquake.
 Adding MaryCalls: No parents
2.15
 Adding JohnCalls: If Mary calls, the probably means the alarm has gone off, which of
course would make it more likely that John calls. Therfore, johnCalls needs
MaryCalls as a parent.
 Adding Alarm: Clearly, if both call, it is more likely that the alarm has gone off than
if just one or neither call, so we need both MaryCalls and JohnCalls as parents.
 Adding Burglary: If we know the alarm state, then the call from John or Mary might
give us information about our phone ringing or Mary’s music, but not about burglary:
 Hence we need just Alarm as parent.

 Adding Earthquake: If the alarm is on, it is more likely that there has been an
earthquake. But if we know that there has been a burglary, then that explains the
alarm, and the probability of an earthquake would be only slightly above normal.
Hence we need both Alarm and Burglary as parents.
2.8.2 Conditional independence relations in Bayesian networks

i. A node is conditionally independent of its non-descendants, given its parents.
 For example, JohnCalls is independent of Burglary and Earthquake, given the
value of Alarm.
2.16
MaryCalls
JohnCalls
Alarm
Burglary
Earthquake
ii. A node is conditionally independent of all other nodes in the network, given
its parents, children, and children’s parents- that is, given its Markov blanket.
 For example, Burglary is independent of JohnCalls and MaryCalls, given
Alarm and Earthquake.
2.9 Exact Inference in Bayesian Network

 Probabilistic Inference System is to compute Posterior Probability Distribution for a
set of query variables, given some observed events.
 That is, some assignment of values to a set of evidence variables.
2.17
Notations
 X  denotes the query variable
 E  set of evidence variables {𝐸1 , … 𝐸𝑚 }
 E  particular observed event
 Y non-evidence, non-query variables, 𝑌1 , … 𝑌𝑛 . (called the hidden variables)
 The complete set of variables – 𝑿 = {𝑿} 𝙐 𝑬 𝙐 𝒀
 A typical query asks for the Posterior Probability distribution P( X | e)
 In the burglary network, we might observe the event in which
𝐽𝑜ℎ𝑛𝐶𝑎𝑙𝑙𝑠 = 𝑡𝑟𝑢𝑒 𝑎𝑛𝑑 𝑀𝑎𝑟𝑦𝐶𝑎𝑙𝑙𝑠 = 𝑡𝑟𝑢𝑒.
 We could then ask for, say, the probability that a burglary has occurred:
Types of Inferences
2.9.1 Inference by Enumeration (inference by listing or recording all variables)
 Any conditional probability can be computed by summing terms from the full joint
distribution.
 More specifically, a query P(X | e) can be answered using equation.
 Where α is normalized constant

 X  Query Variable
2.18
 e  event
 y  number of terms
 Burglary  query variable (X)

 JohnCalls  Evidence variable 1 (E1)
 MaryCalls  Evidence Variable 2 (E2)
 The hidden variables of this query are earthquake and alarm
 Using initial letter for the variables to shorten the expression we have
 The semantic of Bayesian network give us an expression, in terms of CPT entries, for
simplicity we do this just for Burglary = true
2.19
2.9.2 Inference by Variable Elimination

 The enumeration algorithm can be improved substantially by elimination repeated
calculations.
 The idea is simple: do the calculation once and solve the result for later use. This is a
form of dynamic programming.
 Variable elimination works by evaluating expressions,
 Previous equation (derived in inference by enumeration)
 From this the repeated variables are separated
 Intermediate results are stored, and summations of each variable are done, for only
those portion of the expression, that depends on the variable.
 Let us illustrate this process for the burglary network.
 We evaluate the expression
2.20
 We have annotated each part of the expression with the same name of the associated
variable, these parts are called factors
 For example, the factors 𝑓4(𝐴) and 𝑓5(𝐴) corresponding to 𝑃( 𝑗 | 𝑎) and 𝑃( 𝑚 | 𝑎)
depending just on A because J and M are fixed by the query.
 They are therefore two element vectors.
 Given two factors 𝑓(𝑋, 𝑌) and 𝑔(𝑌, 𝑍) with probability distributions shown below,
the pointwise product 𝑓 × 𝑔 = ℎ(𝑋, 𝑌, 𝑍) has 21+1+1 = 8
Elimination
 Summing out, or eliminating a variable from a factor is done by adding up the sub-
arrays formed by fixing the variable to each of its values in turn.
 For example, to sum out a from ℎ(𝑋, 𝑌, 𝑍), we write:
2.21
Relevance
𝑃(𝐽|𝑏)
 𝛴𝑚 𝑃(𝑚|𝑎) = 1, therefore M is irrelevant for the query.

 In other words, 𝑃(𝐽|𝑏) remains unchanged if we remove M from the network.
2.9.3 Complexity of exact inference
 The computational and space complexity of variable elimination is determined by the
largest factor.
o The elimination ordering can greatly affect the size of the largest factor.
o Does there always exists an ordering that only results in small factors? No!
 Greedy heuristic: eliminate whichever variable minimizes the size of
the factor to be constructed.
 Singly connected networks (polytrees):
 Any two nodes are connected by at most one (undirected path).
 The time and space complexity of exact inference in polytrees is
linear in the size of the network.
 For these networks, time and space complexity of variable
elimination are 𝑂(𝑛𝑑 𝑘 )
 For multiply connected networks, variable elimination can have
exponential time and space complexity in the worst case, even when
the number of parents per node is bounded.
 Because it includes inference in propositional logic as a special case,
inference in Bayes nets is NP-hard.
 Figure 13.14 shows how to encode a particular 3-SAT problem.
 The propositional variables become the root variables of the network, each with prior
probability 0.5.
 The next layer of nodes corresponds to the clauses, with each clause variable
connected to the appropriate variables as parents.
 The conditional distribution for a clause variable 𝐶𝑗 is a deterministic disjunction,
with negation as needed, so that each clause variable is true if and only if the
assignment to its parents satisfies that clause.
 Finally, 𝑆 is the conjunction of the clause variables.
2.22
 To determine if the original sentence is satisfiable, we simply evaluate P(S=true).

 If the sentence is satisfiable, then there is some possible assignment to the logical
variables that makes S true;
 Therefore, 𝑃(𝑆 = 𝑡𝑟𝑢𝑒) > 0 for a satisfiable sentence.

 We can use Bayes net inference to solve 3-SAT problems;
 By reduction, inference in Bayesian networks is therefore NP-complete.
2.9.4 Clustering algorithms
 The variable elimination algorithm is simple and efficient but if we want to compute
posterior probabilities for all the variables in a network, it is less efficient.
 The basic idea of clustering is to join individual nodes of the network to form cluster
nodes in such a way that the resulting network is a polytree.
 For example, the multiply connected network shown in Figure (a) can be converted
into a polytree by combining the Sprinkler and Rain node into a cluster node called
Sprinkler+Rain, as shown in Figure (b)
 The two Boolean nodes are replaced by a meganode that takes on four possible
values: 𝑡𝑡, 𝑡𝑓, 𝑓𝑡, 𝑎𝑛𝑑 𝑓𝑓.
 The meganode has only one parent, the Boolean variable Cloudy, so there are two
conditioning cases.
 Once the network is in polytree form, a special-purpose inference algorithm is
required, because ordinary inference methods cannot handle meganodes that share
variables with each other.
 This algorithm is able to compute posterior probabilities for all the nonevidence nodes
in the network in time linear in the size of the clustered network.
 However, the NP-hardness of the problem has not disappeared: if a network requires
exponential time and space with variable elimination, then the CPTs in the clustered
network will necessarily be exponentially large.
2.23
2.10 Approximate inference

 Exact inference is intractable for most probabilistic models of practical interest.
e.g) involving many variables, continuous and discrete, undirected cycles, etc.
Sampling Methods
 Basic idea:
 Draw N samples from a sampling distribution S.
 Compute an approximate posterior probability P.
 Show this approximate coverages to the true probability distribution P.
Why sampling
 Generating samples is often much faster than computing the right answer (e.g., with
variable elimination)
Sampling
 How to sample from the distribution of a discrete variable X?
 Assume k discrete outcomes 𝑥1,… 𝑥𝑘 with probability P(𝑥𝑖 )
 Assume sampling from the uniform 𝑈(0,1) is possible.
e.g) as enabled by a standard r and () function.
 Divide the [0,1] interval into k regions, with region i having size P(𝑥𝑖 ).
 Sample 𝑢~𝑈(0,1) and return the value associated to the region in which 𝑢 falls.
𝑥𝑖
0 P(𝑥𝑖 ) 1
2.24
Prior Sampling
 Sampling from a Bayesian network, without observed evidence.
 Sample each variable in turn, in topological order.
 The probability distribution from which the value is sampled is conditioned on
the values already assigned to the variable’s parents.
Analysis
 The probability that prior sampling generates a particular event is
i.e) the Bayesian network’s joint probability
2.25
 Let 𝑁𝑃𝑆 (𝑥1,.. 𝑥𝑛 ) denote the number of samples of an event. We define the probability
estimator
𝑃̂(𝑥1 , . . 𝑥𝑛 ) = 𝑁𝑃𝑆 (𝑥1 , … 𝑥𝑛 )/𝑁

Then,
Therefore, prior sampling is consistent:
Rejection sampling in Bayesian networks
 Using prior sampling, an estimate 𝑃̂(𝑥|𝑒) can be formed from the proportion of
samples 𝑥 agreeing with the evidence 𝑒 among all samples agreeing with the
evidence.
Analysis
 Let consider the posterior probability estimator 𝑃̂(𝑋|𝑒) formed by rejection sampling:
2.26
 Therefore, rejection sampling is consistent.

 The standard deviation of the error in each probability is 𝑂(1/√𝑛)), where n is the
number of samples used to compute the estimate.
 Problem: many samples are rejected!
 Hopelessly expensive if the evidence is unlikely. i.e if P(e) is small.
 Evidence is not exploited when sampling.
Likelihood weighting
 Idea: clamp the evidence variables, sample the rest.
 Problem: the resulting sampling distribution is not consistent.
 Solution: weight by probability of evidence given parents.
Analysis
 The sampling probability for an event with likelihood weighting is
𝑙
𝑆𝑊𝑆 (𝑥, 𝑒) = ∏ 𝑃(𝑥𝑖 |𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑖 ))

𝑖=1
 Where the product is over the non-evidence variables. The weight for a given sample
x, e is
2.27
𝑤(𝑥, 𝑒) = ∏ 𝑃(𝑒𝑖 |𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝐸𝑖 ))

𝑖=1
 Where the product is over the evidence variables.

 The weighted sampling probability is
𝑙 𝑚
𝑆𝑊𝑆 (𝑥, 𝑒)𝑤(𝑥, 𝑒) = ∏ 𝑃(𝑥𝑖 |𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑖 )) ∏ 𝑃(𝑒𝑖 |𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝐸𝑖 ))

𝑖=1 𝑖=1
= 𝑃(𝑥, 𝑒)
 The estimated posterior probability is computed as follows:
𝑃̂(𝑥, 𝑒) = 𝛼𝑁𝑊𝑆 (𝑥, 𝑒)𝜔(𝑥, 𝑒)
≈ 𝛼 ′ 𝑆𝑊𝑆 (𝑥, 𝑒)𝜔(𝑥, 𝑒)

= 𝛼 ′ 𝑃(𝑥, 𝑒)
= 𝑃(𝑥|𝑒)
 Where 𝛼 and 𝛼 ′ are normalization constants.
 Hence likelihood weighting returns consistent estimates.
 Likelihood weighting is helpful:
 The evidence is taken into account to generate a sample.
 More samples will reflect the state of the world suggested by the evidence.
 Likelihood weighting does not solve all problems:
 Performance degrades as the number of evidence variable increases.
 The evidence influences the choice of downstream variables, but not upstream
ones.
o Ideally, we would like to consider the evidence when we sample each
and every variable.
Inference by Markov chain simulation
 Markov chain Monte Carlo (MCMC) algorithms are a family of sampling algorithms
that generate samples through a Markov chain.
 They generate a sequence of samples by making random changes to a preceding
sample, instead of generating each sample from scratch.
 Helpful to think of a Bayesian network as being in a particular current state specifying
a value for each variable and generating a next state by making random changes to the
current state.
 Metropolis-Hastings is one of the most famous MCMC methods, of which Gibbs
sampling is a special case.
Gibbs sampling
 Start with an arbitrary instance 𝑥1 , … 𝑥𝑛 consistent with the evidence.
 Sample one variable at a time, conditioned on all the rest, but keep the evidence fixed.
 Keep repeating this for a long time.
2.28
2.11 Causal network

 Causal networks, a restricted class of Bayesian networks that forbids all but causally
compatible orderings.
 We will explore how to construct such networks, what is gained by such construction,
and how to leverage this gain in decision-making tasks.
 Consider the simplest Bayesian network imaginable, a single arrow 𝐹𝑖𝑟𝑒 → 𝑆𝑚𝑜𝑘𝑒. It
tells as that variables Fire and Smoke may be dependent, so one needs to specify the
2.29
prior 𝑃(𝐹𝑖𝑟𝑒) and the conditional probability 𝑃(𝑆𝑚𝑜𝑘𝑒|𝐹𝑖𝑟𝑒) in order to specify the
joint distribution.
 However, this distribution can be represented equally well by the reverse arrow 𝐹𝑖𝑟𝑒 ←
𝑆𝑚𝑜𝑘𝑒, using the appropriate 𝑃(𝑆𝑚𝑜𝑘𝑒) and 𝑃(𝐹𝑖𝑟𝑒|𝑆𝑚𝑜𝑘𝑒) computed from Bayes’
rule. The idea that these two networks are equivalent, hence convey the same
information, evokes discomfort and even resistance in most people.
 Causal Bayesian networks, sometimes called Causal Diagrams, were devised to permit
us to represent causal asymmetries and to leverage the asymmetries towards reasoning
with causal information.
 If nature assigns a value to 𝑆𝑚𝑜𝑘𝑒 on the basis of what nature learns about 𝐹𝑖𝑟𝑒, we
draw an arrow from to.
 More importantly, if we judge that nature assigns 𝐹𝑖𝑟𝑒 a truth value that depends on
other variables, not 𝑆𝑚𝑜𝑘𝑒, we refrain from drawing the arrow 𝐹𝑖𝑟𝑒 ← 𝑆𝑚𝑜𝑘𝑒.
 In other words, the value 𝑥𝑖 of each variable𝑋𝑖 is determined by an equation 𝑥𝑖 =
𝑓𝑖 (𝑂𝑡ℎ𝑒𝑟𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠), and an arrow 𝑋𝑗 → 𝑋𝑖 is drawn if and only if 𝑋𝑗 is one of the
arguments of 𝑓𝑖 .
 The equation 𝑥𝑖 = 𝑓𝑖 (. ) is called a structural equation.
 We simply delete the arrow Rain→ 𝑊𝑒𝑡𝐺𝑟𝑎𝑠𝑠.
2.11.1 Representing actions: The do-operator

 Consider again the Sprinkler story of Figure (a). According to the standard semantics
of Bayes nets, the joint distribution of the five variables is given by a product of five
conditional distributions:
2.30
where we have abbreviated each variable name by its first letter. As a system of
structural equations, the model looks like this:
 where, without loss of generality, can be the identity function. The -variables in these
equations represent unmodeled variables, also called error terms or disturbances,
that perturb the functional relationship between each variable and its parents
 suppose we turn the sprinkler on—that is, if we (who are, by definition, not part of the
causal processes described by the model) intervene to impose the condition
Sprinkler=true. In the notation of the do-calculus, which is a key part of the theory of
causal networks, this is written as (𝑆𝑝𝑟𝑖𝑛𝑘𝑙𝑒𝑟 = 𝑡𝑟𝑢𝑒) . Once done, this means that the
sprinkler variable is no longer dependent on whether it’s a cloudy day. We therefore
delete the equation 𝑆 = 𝑓𝑠 (𝐶, 𝑈𝑠 ) from the system of structural equations and replace it
with 𝑆 = 𝑡𝑟𝑢𝑒, giving us
 From these equations, we obtain the new joint distribution for the remaining variables
conditioned on 𝑑𝑜(𝑆𝑝𝑟𝑖𝑛𝑘𝑙𝑒𝑟 = 𝑡𝑟𝑢𝑒):
 conditioning on 𝑑𝑜(𝑆𝑝𝑟𝑖𝑛𝑘𝑙𝑒𝑟 = 𝑡𝑟𝑢𝑒)in the original graph is equivalent to

conditioning on 𝑆𝑝𝑟𝑖𝑛𝑘𝑙𝑒𝑟 = 𝑡𝑟𝑢𝑒 in the mutilated graph.
 A similar approach can be taken to analyze the effect of 𝑑𝑜(𝑋𝑗 = 𝑥𝑗𝑘) in a general
causal network with variables 𝑋1 , … … , 𝑋𝑛 . The network corresponds to a joint
distribution defined in the usual way.
2.31
 The probability terms in the sum are obtained by computation on the original network,
by any of the standard inference algorithms. This equation is known as an adjustment
formula.
2.11.3 The back-door criterion
 The ability to predict the effect of any intervention is a remarkable result, but it does
require accurate knowledge of the necessary conditional distributions in the model,
particularly 𝑃(𝑥𝑗 |𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑗 )).
 For example, we know that “genetic factors” play a role in obesity, but we do not know
which genes play a role or the precise nature of their effects.
 The specific reason this is problematic in this instance is that we would like to predict
the effect of turning on the sprinkler on a downstream variable such as 𝐺𝑟𝑒𝑒𝑛𝑒𝑟𝐺𝑟𝑎𝑠𝑠,
but the adjustment formula must take into account not only the direct route from
𝑆𝑝𝑟𝑖𝑛𝑘𝑙𝑒𝑟, but also the “back door” route via 𝐶𝑙𝑜𝑢𝑑𝑦 𝑎𝑛𝑑 𝑅𝑎𝑖𝑛.
2.32
 If we knew the value of 𝑅𝑎𝑖𝑛, this back-door path would be blocked—which suggests
that there might be a way to write an adjustment formula that conditions on Rain instead
of 𝐶𝑙𝑜𝑢𝑑𝑦. And indeed this is possible:
 In general, if we wish to find the effect of 𝑑𝑜(𝑋𝑗 = 𝑥𝑗𝑘 )on a variable 𝑋𝑖 , the back-door
criterion allows to write an adjustment formula that conditions on any set of variables
Z that closes the back door.
2.33
2 marks
1. Why does uncertainty arise ?
 Agents almost never have access to the whole truth about their environment.
 Agents cannot find a caterorial answer.
 Uncertainty can also arise because of incompleteness, incorrectness in agents
understanding of properties of environment
2. State the reason why first order, logic fails to cope with that the mind like medical
diagnosis.
Three reasons
a.laziness: it is hard to lift complete set of antecedents of consequence, needed to ensure and
exceptionless rule.
b.Theoritical Ignorance: medical science has no complete theory for the domain.
c. Practical ignorance: even if we know all the rules, we may be uncertain about a particular
item needed.
3.What is the need for probability theory in uncertainty ?
 Probability provides the way of summarizing the uncertainty that comes from our
laziness and ignorance .
 Probability statements do not have quite the same kind of semantics known as
evidences.
4.What is the need for utility theory in uncertainty?

 Utility theory says that every state has a degree of usefulness, or utility to in agent,
and that the agent will prefer states with higher utility.
 The use utility theory to represent and reason with preferences.
5. What is called as principle of maximum expected utility ?

 The basic idea is that an agent is rational if and only if it chooses the action that yields
the highest expected utility, averaged over all the possible outcomes of the action.
 This is known as MEU.
6. What Is Called As Decision Theory ?

 Preferences As Expressed by Utilities Are Combined with Probabilities in the General
Theory of Rational Decisions Called Decision Theory.
Decision Theory = Probability Theory + Utility Theory.
7. Define Prior Probability?
 p(a) for the Unconditional or Prior Probability Is That the Proposition A is True.
2.34
 It is important to remember that p(a) can only be used when there is no other
information.
8. Define conditional probability?

 Once the agents has obtained some evidence concerning the previously unknown
propositions making up the domain conditional or posterior probabilities with the
notation p(A/B) is used.
 This is important that p(A/B) can only be used when all be is known.
9. Define joint probability distribution

 This completely specifies an agent's probability assignments to all propositions in the
domain.
 The joint probability distribution p(x1,x2,--------xn) assigns probabilities to all
possible atomic events;where X1,X2------Xn 10 =variables.
10.Give the Baye's rule equation

W.K.T P(A ^ B) = P(A/B) P(B) -------------------------- 1
P(A ^ B) = P(B/A) P(A) -------------------------- 2
DIVIDING BY P(A) ;
WE GET P(B/A) = P(A/B) P(B) -------------------- P(A)
11.What is meant by belief network?
 A belief network is a graph in which the following holds
 A set of random variables
 A set of directive links or arrows connects pairs of nodes.
 The conditional probability table for each node
 The graph has no directed cycles.
12. What is Maximum – Likelihood hypotheses?
ML – it is reasonable approach when there is no reason to prefer one hypotheses over another
a prior.
13. What are the methods for maximum likelihood parameter learning?
i. Write down an expression for the likelihood of the data as a function of the parameter.
ii. Write down the derivative of the log likelihood with respect to each parameter.
iii. Find the parameter values such that the derivatives are zero.
14. Define Naïve Bayes model.
 In this model, the “class” variable C is the root and the “attribute” variable XI are the
leaves.
2.35
 This model assumes that the attributes are conditionally independent of each other,
given the class.
15. What is causal Bayesian network?

 Bayesian Network consists of a DAG, a causal graph where nodes represents random
variables and edges represent the relationship between them, and a conditional
probability distribution (CPDs) associated with each of the random variables.
16. What is exact inference in Bayesian network & its types?
 Exact inference algorithms calculate the exact value of probability P(X|Y ).
 Probabilistic Inference System is to compute Posterior Probability Distribution for a
set of query variables, given some observed events.
 That is, some assignment of values to a set of evidence variables.
i. Inference by enumeration
ii. Inference by variable elimination
17. What is approximate inference in BN?
 Exact inference becomes intractable for large multiply-connected networks .
 Variable elimination can have exponential time and space complexity
 A method of estimating probabilities in Bayesian networks.
 Also called ‘Monte Carlo’ algorithms
18. What is backdoor criterion?
 The “backdoor” criterion for identifying an effect of X on Y involves finding a set of
nodes to condition on that collectively block all “backdoor paths” between X and Y .
 The intuition is that if these paths are blocked, then any systematic correlation
between X and Y reflects the effect of X on Y.
19. Define Bayesian classifier
 A Bayesian classifier is a probabilistic model where the classification is a latent
variable that is probabilistically related to the observed variables.
 The Naive Bayes classifier works on the principle of conditional probability, as given
by the Bayes theorem
 Bayes theorem is used in decision-making and uses the knowledge of prior events to
predict future events.
20. What is Bayesian inference?
 Bayesian inference is a popular machine learning technique that allows for an
algorithm to make predictions based on prior beliefs.
 In Bayesian inference, the posterior distribution of predictors updated based on new
evidence.
 Bayesian inference is a probabilistic approach to machine learning that provides
estimates of the probability of specific events.
2.36
PART-B
1. Explain Approximate inference in detail
2. Explain Naïve Baye’s model
3. What is exact inference, explain briefly
4. Write in detail about causal networks.
5. Harry installed a new burglary alarm at his home to detect burglary. Calculate the
probability that alarm has sounded, but there is neither a burglary, nor an earthquake
occurred, and John and Mary called the Harry. (John & Mary are neighbors)
2.37
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
UNIT III
3 SUPERVISED LEARNING
Syllabus
Introduction to machine learning-Linear Regression Models:Leastsquares,single&
multiples variables ,Bayesian linear regression ,gradient descent ,Linear Classification
Models :Discriminant function-Probablistic discriminative model -Logistic
regression,Probablistic generative model-Navive Bayes ,Maximum margin classifier-
Support Vector machine,Decision Tree,Random forests
3.1INTRODUCTION TO MACHINE LEARNING
• Machine Learning (ML) is a sub-field of Artificial Intelligence (Al) which concerns with
developing computational theories of learning and building learning machines.
• Learning is a phenomenon and process which has manifestations of various aspects. Learning
proCess includes gaining of new symbolic knowledge and development of cognitive skills
through instruction and practice. It is also discovery of new facts and theories through
observation and experiment.
• Machine Learning Definition : A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure p, if its performance at tasks in T,
as measured by P, improves with experience E.
• Machine learning is programming computers to optimize a performance criterion using

example data or past experience. Application of machine learning methods to large databases
is called data mining.
• The goal of machine learning is to build computer systems that can adapt and learn from their
experience.
Examples:
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
3.1
• Training experience E: A dataset of handwritten words with given classifications

ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• training experience: A sequence of images and steering commands recorded while
observing a human driver
Phases of machine learning:

1.Training: A training set of examples of correct behavior is analyzed and of the
newly learnt knowledge is stored. is form of rules.
2.Validation: The rules are checked and, if necessary, additional training is given.
3.Application: The rules are used in responding to some new situation.
Fig 3.1 Phases of ML

Applications of machine learning
1. In retail business, machine learning is used to study consumer behaviour
2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control, and troubleshooting. 3 4.
In medicine, learning programs are used for medical diagnosis.
5. In telecommunications, call patterns are analyzed for network optimization and maximizing the
quality of service.
6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed fast
enough by computers. The World Wide Web is huge; it is constantly growing and searching for
relevant information cannot be done manually.
3.2
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
How machines learn
Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation. Figure 3.2 illustrates the various
components and the steps involved in the learning process.
Fig 3.2 components of learning process

1. Data storage
Facilities for storing and retrieving huge amounts of data are an important component
of the learning process
2. Abstraction
Abstraction is the process of extracting knowledge about stored data
3. Generalization
The term generalization describes the process of turning the knowledge about stored
data into a form that can be utilized for future action.
4. Evaluation
It is the process of giving feedback to the user to measure the utility of the learned
knowledge.
Different types of learning
1. Supervised learning
2. Unsupervised learning
3. Semi-Supervised Learning
4. Reinforcement learning
1. Supervised learning
❖ Supervised learning is the machine learning task of learning a function that maps an
input to an output based on example input-output pairs.
❖ In supervised learning, each example in the training set is a pair consisting of an input
object (typically a vector) and an output value.
❖ A supervised learning algorithm analyzes the training data and produces a function,
which can be used for mapping new examples. In the optimal case, the function will
correctly determine the class labels for unseen instances.
❖ Both classification and regression problems are supervised learning problems.
3.3
Fig.3.3 Supervised learning
Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients and each patient is labeled as “healthy” or “sick”.
2. Unsupervised learning
❖ Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses.
❖ In unsupervised learning algorithms, a classification or categorization is not included
in the observations.
❖ The most common unsupervised learning method is cluster analysis, which is used for
exploratory data analysis to find hidden patterns or grouping in data.
Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.
3 .Semi-Supervised Learning
3.4
Semi-Supervised learning is a type of Machine Learning algorithm that represents the

intermediate ground between Supervised and Unsupervised learning algorithms. It uses the
combination of labeled and unlabeled datasets during the training period.
❖ Semi-supervised learning is motivated by its practical value in learning
faster, better and cheaper.
❖ In many real world applications, it is relatively easy to acquire a large
amount of unlabeled data x.
❖ Semi-supervised learning sometimes enables predictive model testing at
reduced cost.
❖ Semi-supervised classification : Training on labeled data exploits additional
unlabeled data, frequently resulting in a more accurate classifier.
❖ Semi-supervised clustering : Uses small amount of labeled data to aid and
bias the clustering of unlabeled data.
For example, documents can be crawled from the Web, images can be obtained surveillance cameras,
and speech can be collected from broadcast.
4 Reinforcement learning
❖ This is somewhere between supervised and unsupervised learning.
❖ User will get immediate feedback in supervised learning and no feedback from
unsupervised learning. But in the reinforced learning, you will get delayed scalar
feedback.
Fig:3.4 Reinforcement learning
➢ Reinforcement learning is the problem of getting an agent to act in the world so as to

maximize its rewards.
➢ A learner (the program) is not told what actions to take as in most forms of machine learning,
but instead must discover which actions yield the most reward by trying them. In the most
interesting and challenging cases, actions may affect not only the immediate reward but also
the next situations and, through that, all subsequent rewards.
➢ Example :Consider teaching a dog a new trick: we cannot tell it what to do, but we can
reward/punish it if it does the right/wrong thing. It has to find out what it did that made it get
the reward/punishment.
3.5
Fig:3.5 Example for Reinforcement learning
3.2 LINEAR REGRESSION:

• Linear Regression is a supervised and statistical ML method that is used for predictive analysis.
This uses the relationship between the data-points to draw a straight line through them.
• Linear regression makes predictions for continuous/real or numeric variables such as sales, age,
price, income etc.
• Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables.
3.6
• Regression finds how the value of the dependent variable is changing according to the value
of the independent variable.
• The model provides a sloped straight line representing the relationship betweenthe variables.
• The mathematical equation of simple linear regression is given below:
Here,
o Yi is the predicted output for the instance i and is the dependent or explained variable.
o ẞ0 is the intercept of the line while ẞ1 is the slope or scaling factor for each input.
o Xi is the independent variable or explanatory variable or predictor or feature that
governs the entire learning process.
o i is the error component.
3.2.1 Least Squares Regression
o “A least-squares regression method is a form of statistical regression analysis that
establishes the relationship between the dependent (Y) and independent variable (X)
through linear line, referred as line of best fit”.
o The least squares method is a statistical procedure to find the best fit for a set of data
points by minimizing the sum of the offsets or residuals of points from the plotted
curve.
o Least squares regression is used to predict the behavior of dependent variables. This
method of regression analysis begins with a set of data points to be plotted on an x-
and y-axis graph.
o If the data shows a leaner relationship between two variables, the line that best fits
this linear relationship is known as a least-squares regression line, which minimizes
the vertical distance from the data points to the regression line.
o The term "least squares" indicates the smallest sum of squares of errors otherwise
known as variance.
o
The least-squares method is often applied in data fitting.
.
There are two basic categories of least-squares problems:
▪ Ordinary or linear least squares: used in statistical regression analysis
▪ Nonlinear least squares: iterative method to approximate the model to a linear model with
each iteration.
Advantages
o The least-squares method of regression analysis is best suited for prediction models
and trend analysis.
o It is best used in the fields of economics, finance, and stock markets wherein the
value of any future variable is predicted with the help of existing variables and the
relationship between the same.
o The least-squares method provides the closest relationship between the variables.
o The difference between the sums of squares of residuals to the line of best fit is
minimal under this method.
o The computation mechanism is simple and easy to apply.
3.7
Disadvantages
o This method relies on establishing the closest relationship between a given s of
variables.
o The computation mechanism is sensitive to the data, and in case of any outliers, the
results may affect severally.
o More exhaustive computation mechanisms are applied for non linear problems.
Least Square Algorithm
❖ For each (x, y) point calculate x and xy
❖ Sum all x, y, x and xy, which gives us x, y, x2 and xy
❖ Calculate Slope b
❖ Calculate Intercept a:
❖ Assemble the equation of a line: Y= bx+a
Example 1: The below table give the statistics about the number of hours or rainfall in Chennai and
the number of French fries sold on a week from Monday to Friday in a canteen. Predict the number of
French fries to be prepared on Saturday, if a rainfall of 8 hours is expected.
Hours of rain No.of French fries sold
2 4
3 5
5 7
7 10
9 15
Solution:
1.For each (x,y) point calculate x2 and xy.
2.Find x, y, x2 and xy.
x y X2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
x=26 y=41 x =168
2
xy=263
3.Find the slope (b)
3.8
b=(263-(26*41))/5)/(168(26*26)/5)=1.5182
4.Calculate the Intercept a
a=(41-(1.5182*26))/5
a=0.3049
5.Form the equation:Y=1.5182x+0.3049
Compute the error
x y Y=1.5182x+0.3049 Error(Y-y)
2 4 3.3413 -0.6587
3 5 4.8595 -0.1405
5 7 7.8959 0.8959
7 10 10.9323 0.9323
9 15 13.9687 -1.0313
Visualizing the line of fit:
Fig 3.5:Line of fit

Number of French fries to be prepared if it rains for 8 hours Substitute x=8 in Y= 1.5182x+0.3049,
then Y=12.45
So, approximately 13 French fries will be sold on Saturday.
3.2.2 Single and Multiple Variables
3.9
• Simple or single linear regression performs regression analysis of two variables The single
independent variable impacts the slope of the regression line.
• Multiple regression is a broader class of regressions that encompasses linear and nonlinear regressions
with multiple explanatory variables.
• Each independent variable in multiple regression has its own coefficient ensure each variable is
weighted appropriately to establish complex connections between variables.
• Two main operations are done in multiple variable regression:
i) Determine the dependent variable based on multiple independent variables
ii) Determine the strength of the relationship is between each variable.
❖ Multiple regression assumes there is not a strong relationship between each independent
variable.
❖ It also assumes there is a correlation between each independent variable and the single
dependent variable.
❖ Each of these relationships is weighted to ensure more impactful independent variables
drive the dependent value by adding a unique regression coefficient to each independent
variable.
❖ Using multiple variables for regression is more specific calculation than simple linear
regression. More complex relationships can be acquired through multiple linear regression.
❖ All the multiple variables use multiple slopes to predict the outcome of single target
variableY=a+b1x1+b2x2+...+bnxn
❖ In the above equation, b1, b2, ..., bn are the slopes for the individual variables x1 , x2…..xn
Assumptions of Linear Regression

❖ Linear regression have some fundamental assumptions:
❖ Linearity: There must be a linear relationship between the dependent and independent
variables.
❖ Homoscedasticity: The residuals must have a constant variance.
❖ Normality: Normally distributed error
❖ No Multicollinearity: No high correlation between the independent variables
3.10
Fig 3.6 Linear vs Multivariate regression
3.2.3 Bayesian Regression:
➢ Bayesian is just an approach to defining and estimating statistical models. Bayesian Regression
can be very useful when we have insufficient data in the dataset or the data is poorly distributed.
➢ The output of a Bayesian Regression model is obtained from a probability distribution, as
compared to regular regression techniques where the output is just obtained from a single value
of each attribute
➢ In order to explain Naive Bayes we need to first explain Bayes theorem. The foundation of Bayes
theorem is conditional probability (figure 1). In fact, Bayes theorem (figure 1) is just an alternate
or reverse way to calculate conditional probability. When the joint probability, P(A∩B), is hard
to calculate or if the inverse or Bayes probability, P(B|A), is easier to calculate then Bayes
theorem can be applied.
• Class prior or prior probability: probability of event A occurring before knowing

anything about event B.
• Predictor prior or evidence: same as class prior but for event B.
• Posterior probability: probability of event A after learning about event B.
• Likelihood: reverse of the posterior probability.
Steps to Apply Bayes Theorem
Here’s a simple example that will be relevant with all the New Years resolutions. Globo Gym
wants to predict if a member will attend the gym given the weather conditions
P(attend = yes |weather).
3.11
Step 1- View or collect “raw” data.
We have data where each row represents member attendance to Globo Gym given the weather.
So observation 3 is a member that attended the gym when it was cloudy outside.
weather attended
0 sunny yes.
1 rainy no
2 snowy no
3 cloudy yes
4 cloudy no
Step 2 - Convert long data to a frequency table
This provides the sum of attendance by weather condition.

attended
no yes
weather
cloudy 1 3
rainy 2 1
snowy 3 1
sunny 1 3
Step 3 - Row and column sums to get probabilities

weather probabilities
cloudy = 4/15 or 0.267
rainy = 3/15 or 0.20
snowy = 4/15 or 0.267
sunny = 4/15 or 0.267attendance probabilities
no = 7/15 or 0.467
yes = 8/15 or 0.533
Looking at our class prior probability (probability of attendance), on average a member is

53% likely to attend the gym. Just FYI, Thats the exact business model for most gyms: hope a lot of
people sign up but rarely attend. However, our question is whats the probability a member will attend
the gym given the weather condition.
Step 4 - Apply probabilities from frequency table to Bayes theorem
The below equation shows our question put into Bayes theorem notation.
Likelihood: P(sunny | yes) = 3/8 or 0.375 (Total sunny AND yes divided by total yes)
Class Prior Probability: P(yes) = 8/15 or 0.533
Predictor Prior Probability: P(sunny) = 4/15 or 0.267
3.12
It shows that a random member is 75% likely to attend the gym given its sunny. Thats higher
than the overall average attendance of 53%! On the opposite spectrum, the probability of attending the
gym when its snowy out is only 25% (0.125 ⋅ 0.533 / 0.267).
Since this is a binary example (attend or not attend) and P(yes | sunny) = 0.75 or 75%, then the
inverse P(no | sunny) is 0.25 or 25% since probabilities have to sum to 1 or 100%.Thats how to use
Bayes theorem to find the posterior probability for classification.
3.2.4 Gradient Descent in Machine Learning
➢ Gradient Descent is known as one of the most commonly used optimization algorithms
to train machine learning models by means of minimizing errors between actual and
expected results. Further, gradient descent is also used to train Neural Networks.
➢ In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in
machine learning, optimization is the task of minimizing the cost function
parameterized by the model's parameters.
➢ The main objective of gradient descent is to minimize the convex function using
iteration of parameter updates. Once these machine learning models are optimized,
these models can be used as powerful tools for Artificial Intelligence and various
computer science applications.
➢ In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about
gradient descent, the role of cost functions specifically as a barometer within Machine
Learning, types of gradient descents, learning rates, etc.
What is Gradient Descent or Steepest Descent?
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th

century. Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models. It helps in
finding the local minimum of a function.
The best way to define the local minimum or local maximum of a function using gradient descent is as
follows:
o If we move towards a negative gradient or away from the gradient of the function at the current
point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the function at the
current point, we will get the local maximum of that function.
3.13
0000000
This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The main
objective of using a gradient descent algorithm is to minimize the cost function using iteration. To
achieve this goal, it performs two steps iteratively:
o Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
o Move away from the direction of the gradient, which means slope increased from the current
point by alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the
optimization process which helps to decide the length of the steps.
Cost-function
The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single real
number. It helps to increase and improve machine learning efficiency by providing feedback to this
model so that it can minimize error and find the local or global minimum.
How does Gradient Descent work?
Before starting the working principle of gradient descent, we should know some basic concepts to find
out the slope of a line from linear regression. The equation for simple linear regression is given as:
1. Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
3.14
The starting point(shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent
line to calculate the steepness of this slope. Further, this slope will inform the updates to the parameters
(weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are
generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest point,
which is called a point of convergence.
Direction & Learning Rate
These two factors are used to determine the partial derivative calculation of future iteration and allow
it to the point of convergence or local minimum or global minimum.
It is defined as the step size taken to reach the minimum or lowest point. This is typically a
small value that is evaluated and updated based on the behavior of the cost function. If the learning rate
is high, it results in larger steps but also leads to risks of overshooting the minimum. At the same time,
a low learning rate shows the small step sizes, which compromises overall efficiency but gives the
advantage of more precision.
3.15
Types of Gradient Descent
Based on the error in various training models, the Gradient Descent learning algorithm can be divided
into Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. Let's
understand these different types of gradient descent:
1. Batch Gradient Descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and update the
model after evaluating all training examples. This procedure is known as the training epoch. In simple
words, it is a greedy approach where we have to sum over all examples for each update.
Advantages of Batch gradient descent:
o It produces less noise in comparison to other gradient descent.

o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training samples.
2. Stochastic gradient descent
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example
per iteration. Or in other words, it processes a training epoch for each example within a dataset and
updates each training example's parameters one at a time.
As it requires only one training example at a time, hence it is easier to store in allocated memory.
However, it shows some computational efficiency losses in comparison to batch gradient systems as it
shows frequent updates that require more detail and speed. Further, due to frequent updates, it is also
treated as a noisy gradient. However, sometimes it can be helpful in finding the global minimum and
also escaping the local minimum.
Advantages of Stochastic gradient descent:
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.
o It is easier to allocate in desired memory.

o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.
3. MiniBatch Gradient Descent:
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the updates on
those batches separately.
Splitting training datasets into smaller batches make a balance to maintain the computational
efficiency of batch gradient descent and speed of stochastic gradient descent. Hence, we can achieve a
special type of gradient descent with higher computational efficiency and less noisy gradient descent.
3.16
Advantages of Mini Batch gradient descent:
o It is easier to fit in allocated memory.

o It is computationally efficient.
o It produces stable gradient descent convergence.
Challenges with the Gradient Descent
Although we know Gradient Descent is one of the most popular methods for optimization
problems, it still also has some challenges. There are a few challenges as follows:
1. Local Minima and Saddle Point:
For convex problems, gradient descent can find the global minimum easily, while for non-
convex problems, it is sometimes difficult to find the global minimum, where the machine learning
models achieve the best results.
Whenever the slope of the cost function is at zero or just close to zero, this model stops learning
further. Apart from the global minimum, there occur some scenarios that can show this slop, which is
saddle point and local minimum. Local minima generate the shape similar to the global minimum, where
the slope of the cost function increases on both sides of the current points.
In contrast, with saddle points, the negative gradient only occurs on one side of the point, which
reaches a local maximum on one side and a local minimum on the other side. The name of a saddle
point is taken by that of a horse's saddle.
The name of local minima is because the value of the loss function is minimum at that point in
a local region. In contrast, the name of the global minima is given so because the value of the loss
function is minimum there, globally across the entire domain the loss function.
2. Vanishing and Exploding Gradient
In a deep neural network, if the model is trained with gradient descent and backpropagation,
there can occur two more issues other than local minima and saddle point.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During
backpropagation, this gradient becomes smaller that causing the decrease in the learning rate of earlier
3.17
layers than the later layer of the network. Once this happens, the weight parameters update until they
become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is
too large and creates a stable model. Further, in this scenario, model weight increases, and they will be
represented as NaN. This problem can be solved using the dimensionality reduction technique, which
helps to minimize complexity within the model.
3.3 Classification Models

➢ The Classification algorithm is a Supervised Learning technique that is used to
identify the category of new observations on the basis of training data.
➢ In Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups. Such as, Yes or No,
0 or 1, Spam or Not Spam, cat or dog, etc.
➢ Classes can be called as targets/labels or categories.
➢ Unlike regression, the output variable of Classification is a category, not a value,
such as "Green or Blue", "fruit or animal", etc.
➢ Since the Classification algorithm is a Supervised learning technique, hence it takes
labeled input data, which means it contains input with the corresponding output.
3.3.1 Discriminant Functions

➢ Discriminant Function Analysis is a dimensionality reduction technique that is
commonly used for supervised classification problems.
➢ It is used for modelling differences in groups i.e., separating two or more classes. It
is used to project the features in higher dimension space into a lower dimension
space.
➢ For example, we have two classes, and we need to separate them efficiently. Classes
can have multiple features. Using only a single feature to classify them may result
in some overlapping as shown in the below figure. So, we will keep on increasing
the number of features for proper classification.
3.18
Example:
➢ Suppose we have two sets of data points belonging to two different classes that we
want to classify. As shown in the given 2D graph, when the data points are plotted
on the 2D plane, there’s no straight line that can separate the two classes of the data
points completely. Hence, in this case, LDA (Linear Discriminant Analysis) is used
which reduces the 2D graph into a 1D graph in order to maximize the separability
between the two classes.
➢ Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new
axis and projects data onto a new axis in a way to maximize the separation of the
two categories and hence, reducing the 2D graph into a 1D graph.
➢ Two criteria are used by LDA to create a new axis:
• Maximize the distance between means of the two classes.
• Minimize the variation within each class.
➢ In the above graph, it can be seen that a new axis (in red) is generated and plotted
in the 2D graph such that it maximizes the distance between the means of the two
classes and minimizes the variation within each class.
3.19
➢ In simple terms, this newly generated axis increases the separation between the data
points of the two classes. After generating this new axis using the above-mentioned
criteria, all the data points of the classes are plotted on this new axis and are shown
in the figure given below.
➢ But Linear Discriminant Analysis fails when the mean of the distributions are
shared, as it becomes impossible for LDA to find a new axis that makes both the
classes linearly separable. In such cases, we use non-linear discriminant analysis.
3.4 PROBABILISTIC DISCRIMINANT FUNCTIONS
➢ Probabilistic LDA or PLDA is a generative model which assumes that given data
samples are generated from a distribution. We need to find the parameters of model
which best describe the training data.
➢ The discriminative model refers to a class of models used in Statistical Classification,
mainly used for supervised machine learning. These types of models are also known
as conditional models since they learn the boundaries between classes or labels in a
dataset.
➢ Discriminative models focus on modeling the decision boundary between classes in a
classification problem. The goal is to learn a function that maps inputs to binary outputs,
indicating the class label of the input. Maximum likelihood estimation is often used to
estimate the parameters of the discriminative model, such as the coefficients of a logistic
regression model or the weights of a neural network.
3.20
Examples of Discriminative Models
• Logistic regression
• Support vector machines(SVMs)
• Traditional neural networks
• Nearest neighbor
• Conditional Random Fields (CRFs)
• Decision Trees and Random Forest
3.4.1 Logistic Regression in Machine Learning
➢ Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
➢ Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes
or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0 and 1.
➢ Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
➢ In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
➢ The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
➢ Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
➢ Logistic Regression can be used to classify the observations using different types
of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:
3.21
Logistic Function (Sigmoid Function):
➢ The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
➢ It maps any real value into another value within a range of 0 and 1.
➢ The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
➢ In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Assumptions for Logistic Regression:
➢ The dependent variable must be categorical in nature.

➢ The independent variable should not have multi-collinearity.
Logistic Regression Equation:
➢ The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:
• We know the equation of the straight line can be written as:
• In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
3.22
• But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
➢ The above equation is the final equation for Logistic Regression.
Type of Logistic Regression:
➢ On the basis of the categories, Logistic Regression can be classified into three types:
• Binomial: In binomial Logistic regression, there can be only two possible types of
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".
3.5 PROBABLISTIC GENERATIVE MODELS
Generative models are considered a class of statistical models that can generate new data
instances. These models are used in unsupervised machine learning as a means to perform tasks
such as
• Probability and Likelihood estimation,

• Modeling data points
• To describe the phenomenon in data,
• To distinguish between classes based on these probabilities.
Since these models often rely on the Bayes theorem to find the joint probability, generative
models can tackle a more complex task than analogous discriminative models.
These models use probability estimates and likelihood to model data points and differentiate
between different class labels present in a dataset. Unlike discriminative models, these models
can also generate new data points.
However, they also have a major drawback – If there is a presence of outliers in the dataset, then it
affects these types of models to a significant extent.
3.23
Examples of Generative Models
• Naïve Bayes
• Bayesian networks
• Markov random fields
• Hidden Markov Models (HMMs)
• Latent Dirichlet Allocation (LDA)
• Generative Adversarial Networks (GANs)
• Autoregressive Model
FIG: Example of generative models
3.5.1 Naïve Bayes Classifier Algorithm
➢ Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes

theorem and used for solving classification problems.
➢ It is mainly used in text classification that includes a high-dimensional training dataset.
➢ Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
➢ It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object.
3.24
➢ Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Why is it called Naïve Bayes?
➢ The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the fruit is
identified on the bases of color, shape, and taste, then red, spherical, and sweet
fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
➢ Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
➢ The formula for Bayes' theorem is given as:
Where,
➢ P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

➢ P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
• P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
• P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
➢ Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
➢ Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on a
3.25
particular day according to the weather conditions. So to solve this problem, we need
to follow the below steps:
✓ Convert the given dataset into frequency tables.
✓ Generate Likelihood table by finding the probabilities of given features.
✓ Now, use Bayes theorem to calculate the posterior probability.
➢ Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Likelihood table weather condition:
3.26
Applying Bayes’ theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
➢ Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
➢ It can be used for Binary as well as Multi-class Classifications.
➢ It performs well in Multi-class predictions as compared to the other Algorithms.
➢ It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
➢ Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Applications of Naïve Bayes Classifier:
➢ It is used for Credit Scoring.

➢ It is used in medical data classification.
➢ It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
➢ It is used in Text classification such as Spam filtering and Sentiment analysis.
3.27
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes
that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word
is present or not in a document. This model is also famous for document classification
tasks.
3.6 Maximum Margin Classifier
3.6.1 Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
3.28
Example:
SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
3.29
Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if
there are 3 features, then hyperplane will be a 2-dimensional plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors: The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So, as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
3.30
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called a hyperplane. SVM algorithm finds the closest point of the lines
from both the classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
3.31
So, to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
3.32
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine
Now we will implement the SVM algorithm using Python. Here we will use the same dataset
user data, which we have used in Logistic regression and KNN classification.
Data Pre-processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
#Data Pre-processing Step
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
3.33
#Extracting Independent and dependent Variable

x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
The scaled output for the test set will be:
3.7 Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o Below diagram explains the general structure of a decision tree:
3.34
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
3.35
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes
of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

(Or)
3.36
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
(Or)
Where,
o S= Total number of samples

o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree
“Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree”.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
o Cost Complexity Pruning

o Reduced Error Pruning.
3.7.1 Algoritm for decision tree
3.37
There are many algorithms there to build a decision tree. They are
1. CART (Classification and Regression Trees) — This makes use of Gini impurity as the
metric.
2. ID3 (Iterative Dichotomiser 3) — This uses entropy and information gain as metric.
Classification using the ID3 algorithm
Consider whether a dataset based on which we will determine whether to play football or not.
3.38
Here There are for independent variables to determine the dependent variable. The independent
variables are Outlook, Temperature, Humidity, and Wind. The dependent variable is whether to play
football or not.
As the first step, we have to find the parent node for our decision tree. For that follow the steps:
Step 1:Find the entropy of the class variable.
E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94
note: Here typically we will take log to base 2.Here total there are 14 yes/no. Out of which 9 yes and 5
no.Based on it we calculated probability above.
From the above data for outlook we can arrive at the following table easily
Step2:Now we have to calculate average weighted entropy.

ie, we have found the total of weights of each feature multiplied by probabilities.
E(S, outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) + (5/14)*E(2,3) = (5/14)(-(3/5)log(3/5)-(2/5)log(2/5))+

(4/14)(0) + (5/14)((2/5)log(2/5)-(3/5)log(3/5)) = 0.693
Step 3:The next step is to find the information gain. It is the difference between parent entropy and
average weighted entropy we found above.
IG(S, outlook) = 0.94 - 0.693 = 0.247
Similarly find Information gain for Temperature, Humidity, and Windy.
IG(S, Temperature) = 0.940 - 0.911 = 0.029
IG(S, Humidity) = 0.940 - 0.788 = 0.152
IG(S, Windy) = 0.940 - 0.8932 = 0.048
Step 4:Now select the feature having the largest entropy gain. Here it is Outlook. So it forms the first
node(root node) of our decision tree.
Now our data look as follows
3.39
Since overcast contains only examples of class ‘Yes’ we can set it as yes. That means If outlook is
overcast football will be played. Now our decision tree looks as follows.
Step 5:The next step is to find the next node in our decision tree.
Now we will find one under sunny. We have to determine which of the following Temperature,
Humidity or Wind has higher information gain.
Calculate parent entropy E(sunny)

E(sunny) = (-(3/5)log(3/5)-(2/5)log(2/5)) = 0.971.
Now Calculate the information gain of Temperature. IG(sunny, Temperature)
E(sunny, Temperature) = (2/5)*E(0,2) + (2/5)*E(1,1) + (1/5)*E(1,0)=2/5=0.4

Now calculate information gain.
3.40
IG(sunny, Temperature) = 0.971–0.4 =0.571

Similarly we get
IG(sunny, Humidity) = 0.971
IG(sunny, Windy) = 0.020
Here IG(sunny, Humidity) is the largest value. So Humidity is the node that comes under sunny.
For humidity from the above table, we can say that play will occur if humidity is normal and will not
occur if it is high. Similarly, find the nodes under rainy.
Note: A branch with entropy more than 0 needs further splitting.
Finally, our decision tree will look as below:
Chapter 3 Classification using CART algorithm
Classification using CART is similar to it. But instead of entropy, we use Gini impurity.
So as the first step we will find the root node of our decision tree. For that Calculate the Gini
index of the class variable
Gini(S) = 1 - [(9/14)² + (5/14)²] = 0.4591
As the next step, we will calculate the Gini gain. For that first, we will find the average weighted
Gini impurity of Outlook, Temperature, Humidity, and Windy.
First, consider case of Outlook
3.41
Gini(S, outlook) = (5/14)gini(3,2) + (4/14)*gini(4,0)+ (5/14)*gini(2,3) = (5/14)(1 - (3/5)² - (2/5)²) +

(4/14)*0 + (5/14)(1 - (2/5)² - (3/5)²)= 0.171+0+0.171 = 0.342
Gini gain (S, outlook) = 0.459 - 0.342 = 0.117
Gini gain(S, Temperature) = 0.459 - 0.4405 = 0.0185
Gini gain(S, Humidity) = 0.459 - 0.3674 = 0.0916
Gini gain(S, windy) = 0.459 - 0.4286 = 0.0304
Choose one that has a higher Gini gain. Gini gain is higher for outlook. So we can choose it as our root
node.
Note:Repeat the same steps we used in the ID3 algorithm.
3.7.2 Advantages of the Decision Tree
o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
3.7.3Disadvantages of the Decision Tree
o The decision tree contains lots of layers, which makes it complex.

o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase
3.8 Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:
3.42
Fig.3.8.1 Random forest
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
• It takes less training time as compared to other algorithms.

• It predicts output with high accuracy, even for the large dataset it runs efficiently.
• It can also maintain accuracy when a large proportion of data is missing.
How does Random Forest algorithm work?
Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.
3.8.1 The Working process
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
3.43
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points
to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and when a new data point occurs, then
based on the majority of results, the Random Forest classifier predicts the final decision. Consider the
below image:
3.8.2 Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
3.8.3 Advantages of Random Forest
o Random Forest is capable of performing both Classification and Regression tasks.

o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
3.8.4 Disadvantages of Random Forest
o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.
3.44
Two marks
1) What is Machine Learning?
Definition : A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure p, if its performance at tasks in T, as measured by P, improves with
experience E.Machine learning is programming computers to optimize a performance criterion using
example data or past experience. Application of machine learning methods to large databases is called
data mining.
2)What are the phases of machine learning?
Phases of machine learning:
1.Training: A training set of examples of correct behavior is analyzed and of the
newly learnt knowledge is stored. is form of rules.
2.Validation: The rules are checked and, if necessary, additional training is given.
3.Application: The rules are used in responding to some new situation.

3) What are the different types of learning
1 Supervised learning
2 Unsupervised learning
3 Reinforcement learning
4)Define Supervised learning
o Supervised learning is the machine learning task of learning a function that maps an
input to an output based on example input-output pairs.
o In supervised learning, each example in the training set is a pair consisting of an input
object (typically a vector) and an output value.
o A supervised learning algorithm analyzes the training data and produces a function,
which can be used for mapping new examples. In the optimal case, the function will
correctly determine the class labels for unseen instances.
5)Define Unsupervised learning
❖ Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses.
❖ In unsupervised learning algorithms, a classification or categorization is not included
in the observations.
❖ The most common unsupervised learning method is cluster analysis, which is used for
exploratory data analysis to find hidden patterns or grouping in data.
6)What is Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that represents the
intermediate ground between Supervised and Unsupervised learning algorithms. It uses the
combination of labeled and unlabeled datasets during the training period.
7.Define Reinforcement learning
❖ This is somewhere between supervised and unsupervised learning.
3.45
❖ User will get immediate feedback in supervised learning and no feedback from
unsupervised learning. But in the reinforced learning, you will get delayed scalar
feedback.
Reinforcement learning
8.What is Linear regression

• Linear Regression is a supervised and statistical ML method that is used for predictive analysis.
This uses the relationship between the data-points to draw a straight line through them.
• Linear regression makes predictions for continuous/real or numeric variables such as sales, age,
price, income etc.
• Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables.
9.Define Bayesian Regression:
Bayesian is just an approach to defining and estimating statistical models. Bayesian

Regression can be very useful when we have insufficient data in the dataset or the data is poorly
distributed.
The output of a Bayesian Regression model is obtained from a probability distribution, as

compared to regular regression techniques where the output is just obtained from a single value
of each attributeIn order to explain Naive Bayes we need to first explain Bayes theorem. The foundation
of Bayes theorem is conditional probability (figure 1). In fact, Bayes theorem (figure 1) is just an
alternate or reverse way to calculate conditional probability. When the joint probability, P(A∩B), is hard
to calculate or if the inverse or Bayes probability, P(B|A), is easier to calculate then Bayes theorem can
be applied.
3.46
10)Difference between single and multiple regression
11)What is Gradient Descent or Steepest Descent?
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th

century. Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models. It helps in
finding the local minimum of a function.
12) What is the use of Logistic Regression in Machine Learning
➢ Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
➢ Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes
or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0 and 1.
13)What is meant by Decision Tree Classification
3.47
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
3.48
UNIT IV
Ensemble Techniques
4 And Unsupervised Learning
Syllabus
Combining multiple learners: Model combination schemes, Voting, Ensemble Learning -
bagging, boosting, stacking, Unsupervised learning: K. means, Instance Based Learning:
KNN, Gaussian mixture models and Expectation maximization.
4.1 Combining Multiple Learners
• When designing a learning machine, we generally make some choices like parameters
of machine, training data, representation, etc. This implies some sort of variance in
performance. For example, in a classification setting, we can use a parametric
classifier or in a multilayer perceptron, we should also decide on the number of
hidden units.
• Each learning algorithm dictates a certain model that comes with a set of assumptions.
This inductive bias leads to error if the assumptions do not hold for the data.
• Different learning algorithms have different accuracies. The no free lunch theorem
asserts that no single learning algorithm always achieves the best performance in any
domain. They can be combined to attain higher accuracy.
• Data fusion is the process of fusing multiple records representing the same real-world
object into a single, consistent, and clean representation. Fusion of data for improving
prediction accuracy and reliability is an important problem in machine learning.
• Combining different models is done to improve the performance of deep learning
models. Building a new model by combination requires less time, data, and
computational resources. The most common method to combine models is by
averaging multiple models, where taking a weighted average improves the accuracy.
4.1
1. Generating Diverse Learners:
• Different Algorithms: We can use different learning algorithms to train different base-
learners. Different algorithms make different assumptions about the data and lead to
different classifiers.
• Different Hyper-parameters: We can use the same learning algorithm but use it with
different hyper – parameters.
• Different Input Representations: Different representations make different
characteristics explicit allowing better identification.
• Different training sets: Another possibility is to train different base – learners by
different subsets of the training set.
4.1.1 Model Combination Schemes
• Different methods are used for generating final output for multiple base learners are
Multiexpert and multistage combination.
1. Multiexpert combination.
• Multiexpert combination methods have base-learners that work in parallel.

a) Global approach (Learner fusion) : given an input, all base- learners generate an
out put and all these outputs are used, such as voting and stacking
b) Local approach (learner selection : in mixture of experts, there is a gating model,
which looks at the input and chooses one (or every few ) of the learners as
responsible for generating the output.
2. Multistage combination : Multistage combination methods use a serial approach where

the next multistage combination base – learner is trained with or tested on only the instances
where the previous base – learners are not accurate enough.
• Let’s assume that we want to construct a function that maps inputs to outputs from a
set of known Ntrain input -output pairs.
• Let’s assume that we want to construct a function that maps inputs to outputs from a
set of known N train input -output pairs
D train = [(xi,yi )] N train
where𝑥𝑖 ∈×is a D dimensional feature input vector, 𝑦𝑖 ∈ 𝑌is the out put.
• Classification : When the output takes values in a discrete set of class labels
4.2
Y= {𝑐1 ; 𝑐2 ; … … 𝑐𝑥 𝑙, where K is the number of different classes. Regression consists in

predicting continuous ordered outputs,Y=R.
4.1.2 Voting
• The simplest way to combine multiple classifiers is by voting, which corresponds to

take a linear combination of the learners. Voting is an ensemble machine learning
algorithm.
• For regression, a voting ensemble involves make a prediction that the average of
multiple other regression models.
• In classification, a hard voting ensembled involves summing the votes for crisp class
labels from other models and predicting the class with the most votes, A soft voting
ensemble involves summing the predicted probabilities for class labels and predicting
the class label with the largest sum probability.
Fig.4.1.1 Base- learners with their outputs.
• In this methods, the first step is to create multiple classification/ regression models
using some training dataset. Each base model can be created using different splits of
the same training dataset and same algorithm, or using the same dataset with different
algorithms, or any other methods.
• Fig. 4.1.2 shows general idea of Base – learners with model combiner.
4.3
Fig. 4.1.2 Base learners with model combiner
• When combining multiple independent and diverse decisions each of which is at least
more accurate than random guessing, random errors cancel each other out, and correct
decisions are reinforced. Human ensembles are demonstrably better.
• Use a single, arbitrary learning algorithm but manipulate training data to make it learn
multiple models.
4.1.3 Error – Correcting Output Codes
• In Error – Correcting Output Codes main classification task is defined in terms of a

number of subtasks that are implemented by the base – learners. The idea is that the
original task of separating one class from all other classes may be a difficult problem.
• So, we want to define a set of simpler classification problems, each specializing in
one aspect of the task, and combining these simple classifiers, we get the final
classifier.
• Base – learners are binary classifiers having output -1/+1, and there is a code matrix
W of K × L whose K rows are the binary codes of classes in terms of the L base –
learners dj.
• Code matrix W codes classes in terms of learners
• One per class L= K
+1 −1 −1 −1
−1 +1 −1 −1
𝑊= [ ]
−1 −1 +1 −1
−1 −1 −1 +1 𝐾 × 𝐿
• The problem here is that if there is an error with one of the base-learners, there may
be a misclassification because the class code words are so similar. So the approach in
4.4
error-correcting codes is to have L>K and increase the Hamming distance between the
code words.
• One possibilityis pairwise separation of classes where there is a separate base –
learner to separate 𝐶𝑖 from 𝐶𝑗 for i<j.
• Pairwise L= K (K-1) /2
+1 +1 +1 0 0 0
−1 0 0 +1 +1 0
𝑊= [ ]
−0 −1 0 −1 0 +1
0 0 −1 0 −1 −1
• Full ode L=2 (k-1) -1

−1 −1 −1 −1 −1 −1 −1
−1 −1 −1 +1 +1 +1 +1
𝑤=
−1 +1 +1 −1 −1 +1 +1
[+1 −1 +1 −1 +1 −1 +1]
• With reasonable L, find W such that the Hamming distance between rows and
between columns are maximized.
• Voting scheme are
𝑙
𝑛
𝑦𝑖 = (𝑥 + 𝑎) = ∑ 𝑊𝑖𝑗 𝐷𝑗
𝑗=1
and then we choose the class with the highest Yi

• One problem wih ECOC is that because the code matrix W is set a priori, there is no
guarantee that the subtasks as defined by the columns of W will be simple.
4.2 Ensemble Learning
• The idea of ensemble learning is to employ multiple learners and combine their
predictions. If we have a committee of M models with uncorrelated errors, simply by
averaging them the average error of a model can be reduced by a factor of M.
• Unformtunately, the key assumption that the errors due to the individual models are
uncorrelated is unrealistic; in practice, the errors are typically highly correlated, so the
reduction in overall error is generally small.
4.5
• Ensemble modeling is the process of running two or more related but different
analytical models and then synthesizing he results into a single score or spread in
order to improve the accuracy of predictive analytics and data mining applications.
• Ensembles of classifiers is a set of classifiers whose individual decisions combined in
some way to classify new examples.
• Ensemble methods combine several decision trees classifiers to produce better
predictive performance than a single decision tree classifier. The main principle
behind the ensemble model is that a group of weak learners come together to form a
strong learner, thus increasing the accuracy of the model.
• Why do ensemble methods work?
• Based on one of two basic observations:
1. Variance reduction : If the training sets are completely independent, it will
always Helps to average an ensemble because this will reduce variance
without affecting bias (e.g. bagging) and reduce sensitivity to individual and
points.
2. Bias reduction: for simple models, average of models has much greater
capacity than single model Averaging models can reduce bias substantially by
increasing capacity and control variance by Citting one component at a time
4.2.1 Bagging
• Bagging is also called Bootstrap aggregating. Bagging and boosting are meta-
algorithms that pool decisions from multiple classifiers. It creates ensembles by
repeatedly randomly resampling he training data.
• Bagging was the first effective method of ensemble learning and is one of the simplest
methods of arching. The meta – algorithm, which is a special case of the model
averaging, was originally designed for classification and is usually applied to decision
tree models, but it can be used with any type of model for classification or regression
4.6
• Ensemble classifiers such as bagging, boosting and model averaging are known to
have improved accuracy and robustness over a single model. Although unsupervised
models, such as clustering, do not directly generate label prediction for each
individual, they provide useful constraints for the joint prediction of a set of related
objects.
• For given a training set of size n, create m samples of size n by drawing n examples
from the original data, with replacement. Each bootstrap sample will on average
contain 63.2% of the unique training examples, the rest are replicates. It combines the
m resulting models using simple majority vote.
• In particular, on each round, the base learner is trained on what is often called a
“bootstrap replicate” of the original training set. Suppose the training set consists of
n examples. Then a bootstrap replicate is a new training set that also consists of n
examples, and which is formed by repeatedly selecting uniformly at random and with
replacement n examples from the original training set. This means that the same
example may appear multiple times in the bootstrap replicate, or it may appear not at
all.
• It also decreases error by decreasing the variance in the results due to unstable
learners, algorithms (like decision trees) whose out put can change dramatically when
the training date is slightly changed.
• Pseudocode:
1. Given Training data (x1,y1) …….. (xm,ym)
2. For t = 1,… T :
a. Form bootstrap replicate dataset St by selecting m random examples from the
training set with replacement.
b. Let ht be the result of training base learn4ng algorithm on st
4.7
3. Output combined classifier :
H(×) = majority (h1(×) …….. hT(×)
Bagging Steps :
1. Suppose there are N observations and M features in training data set. A sample
from training data set is taken randomly with replacement.
2. A subset of M features is selected randomly and whichever feature gives the best
split is used to split the node iteratively.
3. The tree is grown to the largest.
4. Above steps are repeated n times and predictions is given based on the aggregation
of predictions from n number of trees.
Advantages of Bagging :
1. Reduces over – fitting of the model.

2. Handles higher dimensionality data very well.
3. Maintains accuracy for missing data.
Disadvantages of Bagging:
1. Since final prediction is based on the mean predictions from subset trees, it won’t
give precise values for the classification and regression model.
4.2.2 Boosting
• Boosting is an ensemble learning method that combines a set of weak learners into a
strong learner to minimize training errors. In boosting, a random sample of data is selected,
fitted with a model and then trained sequentially—that is, each model tries to compensate for
the weaknesses of its predecessor..
• Originally developed by computational learning theorists to guarantee performance

improvements on fitting training data for a weak learner that only needs to generate a
hypothesis with a training accuracy greater than 0.5. Final result is the weighted sum
of the results of weak classifiers.
4.8
• A learner is weak if it produces a classifier that is only slightly better than random
guessing, while a learner is said to be strong if it produces a classifier that achieves a
low error with high confidence for a given concept.
• Revised to be a practical algorithm, AdaBoost, for building ensembles that
empirically improves generalization performance. Examples are given weights. At
each iteration, a new hypothesis is learned and the examples are reweighted to focus
the system on examples that the most recently learned classifier got wrong.
• Boosting is a bias reduction technique. It typically improves the performance of a
single tee model. A reason for this is that we often cannot construct trees which are
sufficiently large due to thinning out of observations in the terminal nodes.
• Boosting is then a device to come up with a more complex solution by taking linear
combination of trees. In presence of high – dimensional predictors, boosting is also
very useful as a regularization technique for additive or interaction modeling.
• To begin, we define an algorithm for finding the rules of thumb, which we call a weak
learner. The boosting algorithm repeatedly calls this weak learner, each time feeding
it a different distribution over the training data. Each call generates a weak classifier
and we must combine all of these into a single classifier that, hopefully, is much more
accurate than any one of the rules.
• Train a set of weak hypotheses ; h1 ….,hT.The combined hypothesis H is a weighted
majority vote of the T weak hypotheses. During the training, focus on the examples
that are misclassified.
4.9
Fig. 4.2.1combined hypothesis
AdaBoost:
• AdaBoost, short for “ Adaptive Boosting”, is a machine learning meta – algorithm

formulated by Yoav Freund and Robert Schapire who won the prestigious “Godel
Prize” in 2003 for their work. It can be used in conjunction with ay other types of
learning algorithms to improve their performance.
• It can be used to learn weak classifiers and final classification based on weighted vote
of weak classifiers.
• It is linear classifier with all is desirable properties. It has good generalization
properties.
• To use the weak learner to form a highly accurate prediction rule by calling the weak
learner repeatedly on different distributions over the training examples.
• Initially, all weights are set equally, but each round the weights of incorrectly
classified examples are increased so that those observations that the previously
classifier poorly predicts receive greater weight on the next iteration.
Advantages of AdaBoost:
1. Very simple to implement
2. fairly good generalization
3. The prior error need not be known ahead of time.
Disadvantages of AdaBoost:
1. Suboptimal solution
2. Can over fit in presence of noise.
Boosting Steps:
4.10
1. Draw a random subset of training samples d1 without replacement from the training
set D to train a weak learner C1
2. Draw second random training subset d2 without replacement from the training set add
add 50 percent of the samples that were previously falsely classified / misclassified to
train a weak learner C2
3. Find the training samples d3 in the training set D on which C1 and C2 disagree to
train a third weak learner C3
4. Combine all the weak learners via majority voting.
Advantages of Boosting:
1. Supports different loos function.
2. Works well with interactions.
Disadvantages of Boosting:
1. Prone to over – fitting.
2. Requires careful tuning of different hyper – parameters.
4.2.3Stacking
• Stacking, sometimes called stacked generalization, is an ensemble machine learning
method that combines multiple heterogeneous base or component models via a meta -
model.
• The base model is trained on the complete training data, and then he meta – model is
trained on the predictions of the base models. The advantage of stacking is the ability
to explore the solution space with different models in the same problem.
• The stacking base model can be visualized in levels and has at least two levels of the
models. The first level typically trains the two or more base learners (can be
heterogeneous) and the second level might be a single meta learner that utilizes the
base models predictions as input and gives the final result as output . A stacked model
can have more than two such levels but increasing the levels doesn’t always guarantee
better performance.
• In the classification tasks, often logistic regression is used as a meta learner, while
linear regression is more suitable as a meta learner for regression – based tasks.
• Stacking is concerned with combining multiple classifiers generated by different
learning algorithms L1,….LN on a single dataset S, which is composed by a feature
vectors Si= (xi, ti)
4.11
• The stacking process can be broken into two phases:

1. Generate a set of base – level classifiers C1,…..CN where Ci = Bi (S)
2. Train a meta – level classifier to combine the output of the bae – level
classifiers.
• Fig. 4.2.2 shows stacking frame.
Fig.4.2.2 Stacking frame.

• The training set for he meta – level classifier is generated through a leave – one – out
cross validation process.
∀𝑖 = 1, … . , 𝑛𝑎𝑛𝑑∀𝑘 = 1, … . , 𝑁 ∶ 𝐶𝑘𝑖
= LK (S-si)
• The learned classifiers are then used to generate predictions for𝑠𝑖 : 𝑦𝑖𝑘 = 𝐶𝑘𝑖 (×𝑖 )
𝐼 𝑛
• The meta – level dataset consists of examples of the form 𝑦̂ 𝑘 , … 𝑦̂ 𝑖 ), 𝑦𝑖 ), where the
features are the predictions of the base – level classifiers and the class is the correct
class of the example in hand.
• Why do ensemble methods work?
• Based on one of two basic observations:
1. Variance reduction: If the training sets are completely independent, it will
always helps to average an ensemble because this will reduce variance without
affecting bias (e.g. – bagging) and reduce sensitivity to individual data points.
2. Bias reduction : For simple models, average of models has much greater
capacity than single model Averaging models can reduce bias substantially by
increasing capacity and control variance by Citting one component at a time.
4.12
4.2.4Adaboost
• AdaBoost also referred to as adaptive boosting is a method in Machine Learning used
as an ensemble method. The maximum not unusual algorithm used with adaBoost is
selection trees with one stage meaning with decision trees with most effective I split.
These trees also are referred to as decision stumps.
Stump
Fig. 4.2.3 Adaboost
• The working of the AdaBoost version follows

the beneath – referred to path :
▪ Creation of the learner.
▪ Calculation of the total error via the beneath formulation.
▪ Calculation of performance of the decision stumps.
▪ Updating the weights in line with the misclassified factors.
Creation of a new database :
AdaBoost ensemble:
• In the ensemble approach, we upload the susceptible fashions sequentially and then
teach them the use of weighted schooling records.
• We hold to iterate the process till we gain the advent of a pre-set range of vulnerable
learners or we can not look at further improvement at the dataset. At the end of the
algorithm, we are left with some vulnerable learners with a stage fee.
4.2.5 Difference between Bagging and Boosting
Sr. No. Bagging Boosting

1. Bagging is a technique that builds Boosting refers to a group of
multiple homogeneous models from algorithms that utilize weighted
Different subsamples of the same averages to make weak learning
training dataset to obtain more algorithms stronger learning
accurate predictions than its algorithms.
individual models
2 Learns them independently from Learns them sequentially in very
each other in parallel adaptative way
3 It helps in reducing variance. It helps in reducing bias and
4.13
variance.
4. Every model receives an equal Models are weighted by their
weight. performance.
4.3 Clustering
• Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).
• Cluster analysis can be a powerful data-mining tool for any organization that needs to
identity discrete groups of customers, sales transaction, or other types of behaviors
and things. For example, insurance providers use cluster analysis to detect fraudulent
claims and banks used it for credit scoring.
• Cluster analysis uses mathematical models to discover groups of similar customers

based on the smallest variations among customers within each group.
• Cluster is a group of objects that belongs to the same class. In another words the
similar object are grouped in one cluster and dissimilar are grouped in other cluster.
• Clustering is a process of partitioning a set of data in a set of meaningful subclasses.
Every data in the sub class shares a common trait. It helps a user understand the
natural grouping or structure in a data set.
• Various types of clustering methods are partitioning methods, hierarchical clustering,
fuzzy clustering, density based clustering and model based clustering.
• Cluster anlysis is process of grouping a set of data objects into clusters.
4.14
Desirable properties of a clustering algorithm are as follows:
Fig. 4.3.1
1. Scalability (in terms of both time and space )

2. Ability to deal with different data types
3. Minimal requirements for domain knowledge to determine input parameters.
4. Interpretability and usability.
• Clustering of date is a method by which large sets of data are grouped into clusters of
smaller sets of similar data. Clustering can be considered the most important
unsupervised learning problem.
• A cluster is therefore a collection of objects which are “similar” between them and are
dissimilar” to the objects belonging to other clusters. Fig. 4.3.1 shows cluster.
• In this case we easily identify the 4 clusters into which the data can be divided; the
similarity criterion is distance : two or more objects belong to the same cluster if they
are “close” according to a given distance (in this case geometrical distance ) This is
called distance – based clustering.
Raw data Clustering algorithm Clusters of data
Fig. 4.3.2 cluster centroid
4.15
• Clustering means grouping of data or dividing a large data set into smaller data sets of
some similarity.
• A clustering algorithm attempts to find natural groups components or data based on
some similarity. Also, the clustering algorithm finds the centroid of a group of data
sets.
• To determine cluster membership, most algorithms evaluate the distance between a

point and the cluster centroids. The output from a clustering algorithm is basically a
statistical description of the cluster centroids with the number of components in each
cluster.
• Cluster centroid : The centroid of a cluster is a point whose parameter values are the
means of the parameter values of all the points in the cluster. Each cluster has a well
defined centroid.
• Distance : The distance between two points is taken as a common metric to as see the
similarity among he components of population. The commonly used distance measure
is the Euclidean metric which defines the distance between two points p= (p1, p2…)
and q=(q1, q2,…) is given by,
D = ∑𝑘𝑖=1(𝑃𝑖 - qi) 2
• The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data.
But how to decide what constitutes a good clustering? It can be shown that there is no
absolute “best” criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply criterion, in such a way that
the result of the clustering will suit their needs.
• Clustering analysis helps construct meaningful partitioning of a large set of objects
cluster analysis has been widely used in numerous application, including, pattern
recognition, data analysis, image processing etc.
• Clustering algorithms may be classified as listed below:
1. Exclusive clustering
2. Overlapping clustering
3. Hierarchical clustering
4. Probabilistic clustering.
• A good clustering method will produce high quality clusters high intra – class
similarity and low inter – class similarity. The quality of a clustering result depends
4.16
on both the similarity measure used by the method and its implementation. The
quality of a clustering method is also measured by it’ s ability to discover some or all
of the hidden patterns.
• Clustering techniques types : The major clustering techniques are
a) Partitioning methods
b) Hierarchical methods
c) Density – based methods.
4.3.1 Unsupervised Learning: K-means
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the

clustering problems in machine learning or data science. In this topic, we will learn what is
K-means clustering algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.
What is K-Means Algorithm?
❖ K-Means Clustering is an Unsupervised Learning algorithm, which groups the

unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.
❖ It is an iterative algorithm that divides the unlabeled dataset into k different clusters
in such a way that each dataset belongs only one group that has similar properties.
❖ It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
❖ It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
❖ The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of
k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
4.17
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let's understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:
4.18
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting
the below two points as k points, which are not the part of our dataset. Consider
thebelow image:
o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied to
calculate the distance between two points. So, we will draw a median between
boththecentroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them
as blue and yellow for clear visualization.
4.19
o As we need to find the closest cluster, so we will repeat the process by choosing a
new centroid. To choose the new centroids, we will compute the center of gravity of
these centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
4.20
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of
the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
4.21
How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, but here we are discussing the most
appropriate method to find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
4.22
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
4.4 Instance Based Learning: KNN
• K. Nearest Neighbour is one of only Machine Learning, algorithms based

totally on supervised learning approach.
• K-NN algorithm assumes the similarity between the brand new case/facts and
available instances and Placed the brand new case into the category that is
maximum similar to the to be had classes.
• K-NN set of rules shops all of the to be had facts and classifies a new statistics
point based at the similarity. This means when new data seems then it may be
effortlessly categorized into a properly suite class by using K-NN algorithm.
• K-NN set of rules can be used for regression as well as for classification
however normally it’s miles used for the classification troubles.
• K-NN is a non-parametric algorithm, because of this it does no longer makes
any assumption on underlying data.
• It is also referred to as a lazy learner set of rules because it does no longer
research from the training set immediately as a substitute it shop the dataset and
at the time of class, it plays an movement at the dataset.
• The KNN set of rules at the schooling section simply stores the dataset and
when it gets new data, then it classifies that statistics into a class that is an
awful lot similar to the brand new data.
• Example : Suppose, we’ve an picture of a creature that looks much like cat and
dog, but we want to know both it is a cat or dog. So for this identity, we are able
4.23
to use the KNN algorithm, because it works on a similarity degree. Our KNN
version will discover the similar features of the new facts set to the cats and
dogs snap shots and primarily based on the most similar functions it will place it
in both cat or canine class.
4.4.1 Why do We Need KNN ?
• Suppose there are two categories, i.e., category A and category B and we’ve a brand
new statistics point x1, so this fact point will lie within of these classes. To solve this
sort of problem, we need a K-NN set of rules. With the help of K-NN, we will without
difficulty discover the category or class of a selected dataset. Consider the underneath
diagram :
Fig.9.4.1 why do we need KNN
4.4.2 How Does KNN Work?
• The K-NN working can be explained on the basis of the below algorithm :
Step -1 :Select the wide variety K of the acquaintances.
Step -2 :Calculate the Euclidean distance of K variety of friends.
Step -3 :Take the K nearest neighbors as according to the calculated Euclidean distance.
Step -4 :Among these ok pals, count number the number of the data points in each class.
Step -5 :Assign the brand new records points to that category for which the quantity of the
neighbor is maximum.
Step -6 :Our model is ready.
4.24
• Suppose we’ve got a brand new information point and we want to place it in the
required category, Consider the under image
Fig4.4.2 KNN example
• Firstly, we are able to pick the number of friends, so we are able to select the ok=5
• Next, we will calculate the Euclidean distance between the facts points. The
Euclidean distance is the gap between points, which we’ve got already studied in
geometry. It may be calculated as :
Fig4.4.3 KNN example continue
4.25
• By calculating the Euclidean distance we got the nearest acquaintances, as 3 nearest

neigh bours in category A and two nearest associates in class B. Consider the
underneath image.
Fig:KNN example continue
• As we are able to see the three nearest acquaintances are from category A,
subsequently this new fact point must belong to category A.
4.4.3 Difference between K- means and KNN
Sr. No K-means KNN

1 K-Means is an unsupervised machine KNN is a supervised machine
learning algorithm used for clustering. learning algorithm used for
classification.
2 K- Means is an eager learner k-NN is a lazy learner
3 It is used for Clustering It is used mostly for Classification,
and sometimes even for
Regression
4 K in K-Means is the number of clusters the K’ in KNN is the number of
algorithm is trying to identify/ learn from nearest neigh bours used classify
the data or predict a test sample
5 K-means require unlabeled data. It gathers
and groups data into k number of clusters.
4.5 Gaussian Mixture models and Expectation Maximization

• Gaussian Mixture Models is a “soft” clustering algorithm, where each point
probabilistically “belongs” to all clusters. This is different than k-means where each
point belongs to one cluster.
4.26
• The Gaussian mixture model is a probabilistic model that assumes all the data points
are generated from a mix of Gaussian distributions with unknown parameters.
• For example, in modeling human height data, height is typically modeled as a normal
distribution for each gender with a mean of approximately 5’10” for males and 5’5”
for females. Given only the height data and not the gender assignments for each data
point, the distribution of all heights would follow the sum of two scaled (different
variance) and shifted (different mean) normal distributions. A model making this
assumption is an example of a Gaussian mixture model.
• Gaussian mixture models do not rigidly classify each and every instance into one
class or the other. The algorithm attempts to produce K-Gaussian distributions that
would take into account the entire training space. Every point can be associated with
one or more distributions. Consequently, the deterministic factor would be the
probability that each point belongs to a certain Gaussian distribution.
• GMMs have a variety of real - world applications. Some of them are listed below.
a) Used for signal processing
b) Used for customer churn analysis
c) Used for language identification
d) Used in video game industry
e) Genre classification of songs
4.5.1 Expectation – maximization
• In Gaussian mixture models, an expectation-maximization method is a
powerful tool for estimating he parameters of a Gaussian mixture model.
The expectation is termed E and maximization is termed M.
• Expectation is used to find the Gaussian parameters which are used to represent each
component of gaussian mixture models. Maximization is termed M and it is involved
in determining whether new data points can be added or not.
• The Expectation – Maximization (EM) algorithm is used in maximum likelihood
estimation where the problem involves two sets of random variables of which one, X,
is observable and the other, Z, is hidden.
• The goal of the algorithm is to find the parameter vector ∅ that maximizes the
likelihood of the observed values of X, L (∅ | X)
4.27
• But in cases where this is not feasible, we associated the extra hidden variables Z and
express the underlying model using both, to maximize the likelihood of the joint
distribution of X and Z the complete likelihood Lc (∅|X,Z)
• Expectation -maximization (EM) is an iterative method used to find maximum
likelihood estimates of parameters in probabilistic models, where the model depends
on unobserved, also called latent, variables.
• EM alternates between performing an expectation € step, which computes an
expectation of the likelihood by including the latent variables as if they were
obserrved, and maximization (M) step, which computes the maximum likelihood
estimates of the parameters by maximizing the expected likelihood found in the E
step.
• The Parameters found on the M step are then used to start another E step, and the
process is repeated until some criterion is satisfied. EM is frequently used for data
clustering like for example in Gaussian mixtures.
• In the Expectation step, find the expected values of the latent variables (here you
need to use the current parameter values)
• In the Maximization step, first plug in the expected values of the latent variables in
the log-likelihood of the augmented data. The maximize this log-likelihood to
reevaluate the parameters.
• Expectation – Maximization (EM) is a technique used in point estimation. Given a set
a observable variables X and unknown (latent) variables Z we want to estimate
parameters ∅ in a model.
• The expectation maximization (EM ) algorithm is a widely used maximum likely-
hood estimation procedure for statistical models when the values of some of the
variables in the model are not observed
• The EM algorithm is an elegant and powerful method for finding the maximum
likelihood of models with hidden variables. The key concept in the EM algorithm is
that it iterates between the expectation step (E-setp ) and maximization step (M-step )
until convergence.
• In the E-step, the algorithm estimates the posterior distribution of the hidden variables
Q given the observed data and the current parameter settings; and in the M-step the
algorithm calculates the ML parameter settings with Q fixed.
4.28
• At the end of each iteration the lower bound on the likelihood is optimized for the
given parameter setting (M-step) and the likelihood is set to the bound (E-setp )
which guarantees an increase in the likelihood and convergence to a local
maximum, or global maximum if the likelihood function is unimodal.
• Generally, EM works best when the fraction of missing information is small and the
dimensionality of he data is not too large. EM can require many iterations, and higher
dimensionality can dramatically slow down the E-setp.
• EM is useful for several reasons: conceptual simplicity, ease of implementation, and
the fact that each iteration improves l (ø ). The rate of convergence on the first few
steps is typically quite good, but can become excruciatingly slow as you approach
local optima.
• Sometimes the M- step is a constrained maximization, which means that there are
constraints on valid solutions not encoded in the function itself.
• Expectation maximization is an effective technique that is often used in data analysis
to manage missing data. Indeed, expectation maximization overcomes some of the
limitations of other techniques, such as mean substitution or regression substitution.
These alternative techniques generate biased estimates – and specifically,
underestimate the standard errors. Expectation maximization overcomes this problem.
4.29
Two Marks Questions with Answers
Q:1 What is unsupervised learning?

Ans : In an unsupervised learning, the network adapts purely in response to its inputs.
Such networks can learn to pick out structure in their input.
Q.2 What is semi – supervised learning :
Ans: Semi – supervised learning uses both labeled and unlabeled data to improve
supervised learning.
Q.3 What is ensemble method?
Ans: Ensemble methods is a machine learning technique that combines several base
models in order to produce on optimal predictive model. It combine the insights obtained
from multiple learning models to facilitate accurate and improved decisions.
Q.4 What is cluster ?
Ans: Cluster is a group of objects that belong to the same class. In other words the
similar object are grouped in one cluster and dissimilar are grouped in other cluster
Q5 Explain clustering.
Ans: Clustering is a process of partitioning a set of data ion a set of meaningful subclass.
Every data in the subclass shares a common trait. It helps a user understand the natural
grouping or structure in a data set.
Q6 What is Bagging ?
Ans: Bagging is also known as Bootstrap aggregation, ensemble method works by
training multiple models independently and combining later to result in a strong model.
Q7 Define bosting.
Ans : Boosting refers to a group of algorithms that utilize weighted averages to make
weak learning algorithms stronger learning algorithms
Q8 What is K- Nearest Neighbour Methods?
Ans :· The K-Nearest Neighbor (KNN) is a classical classification method and requires no
training effort, critically depends on the quality of the distance measures among examples
The KNN classifier uses mahalanobis distance function. A sample is classified according
to the majority vote of the its nearest K training samples in the feature space. Distance of a
sample to its neighbors is defined using a distance function.
Q.9 Which are the performance factors that influence KNN algorithm?
Ans: The performance of the KNN algorithm is influenced by three main factors:
1. The distance function or distance metric used to determine he nearest neighbors.
4.30
2. The decision rule used to drive a classification from the K-nearest neighbors.
3. The number of neighbors used to classify the new example.
Q10. What is K-means clustering?
Ans: k-means clustering is heuristic method. Here each cluster is represented by the
center off the cluster. The k-means algorithm takes the input parameter, k, and partitions a
set of a objects into k-clusters so that the resulting intracluster similarity is high but the
intracluster similarity is low
Q.11 List the properties of K-Means algorithm.
Ans : 1. There are always k clusters.
2. There is always at least one item in each cluster.
3. The clusters are non – hierarchical and they do not overlap.
Q.12 What is stacking ?
Ans : Staking, sometimes called stacked generalization. Is an ensemble machine
learning method that combines multiple heterogeneous base or component models via a
meta – model.
Q. 13 How do GMMs differentiate from K- means clustering ?
Ans : GMMs and K-means, both are clustering algorithms used for unsupervised learning
tasks. However, the basic difference between the is that k-means is a distance -based
clustering method while GMMs is a distribution based clustering method.
4.31
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
UNIT V
NEURAL NETWORKS
Perceptron- Multilayer perceptron, activation functions, network training – gradient descent

optimization – stochastic gradient descent, error backpropagation, from shallow networks to
deep networks –Unit saturation (aka the vanishing gradient problem) – ReLU, hyperparameter
tuning, batch normalization, regularization, dropout.
5.1 Perceptron in Machine Learning
Perceptron is Machine Learning algorithm for supervised learning of various binary classification
tasks. Further, Perceptron is also understood as an Artificial Neuron or neural network unit that
helps to detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural networks.
However, it is a supervised learning algorithm of binary classifiers. Hence, we can consider it as a
single-layer neural network with four main parameters, i.e., input values, weights and Bias, net
sum, and an activation function.
Basic Components of Perceptron
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three
main components. These are as follows:
Fig: 5.1
5.1
o Input Nodes or Input Layer:
This is the primary component of Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical value.
o Wight and Bias:
Weight parameter represents the strength of the connection between units. This is another most
important parameter of Perceptron components. Weight is directly proportional to the strength of
the associated input neuron in deciding the output. Further, Bias can be considered as the line of
intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire
or not. Activation Function can be considered primarily as a step function.
Types of Activation functions:
o Sign function
o Step function, and
o Sigmoid function
Fig:5.2
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step,
and Sigmoid) in perceptron models by checking whether the learning process is slow or has
vanishing or exploding gradients.
5.2
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that consists of

four main parameters named input values (Input nodes), weights and Bias, net sum, and an
activation function. The perceptron model begins with the multiplication of all input values and
their weights, then adds these values together to create the weighted sum. Then this weighted sum
is applied to the activation function 'f' to obtain the desired output. This activation function is also
known as the step function and is represented by 'f'.
Fig:5.3
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is indicative
of the strength of a node. Similarly, an input's bias value gives the ability to shift the activation
function curve up or down.
Perceptron model works in two important steps as follows:
Step-1
In the first step first, multiply all input values with corresponding weight values and then add them
to determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
5.3
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as follows:
1. Single-layer Perceptron Model

2. Multi-layer Perceptron model
Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside the
model. The main objective of the single-layer perceptron model is to analyze the linearly separable
objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with
inconstantly allocated input for weight parameters. Further, it sums up all inputs (weight). After
adding all inputs, if the total sum of all inputs is more than a pre-determined value, the model gets
activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this model
is stated as satisfied, and weight demand does not change. However, this model consists of a few
discrepancies triggered when multiple weight inputs values are fed into the model. Hence, to find
desired output and minimize errors, some changes should be necessary for the weights input.
Multi-Layered Perceptron Model:
Like a single-layer perceptron model, a multi-layer perceptron model also has the same model
structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes
in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and
terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the
model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
5.4
5.2 Multilayer Perceptron:
• Multilayer perceptron is one of the most commonly used machine learning method.
• The Multi-layer Perceptron network, consisting of multiple layers of connected neurons.
• Multilayer perceptron is an artificial neural network structure and is a non parametric
estimator that can be used for classification and regression.
Fig: 5.4 The Multi-layer Perceptron network

• In the multi-layer perceptron diagram above, we can see that there are three inputs and thus
three input nodes and the hidden layer has three nodes.
• The output layer gives two outputs, therefore there are two output nodes.
• The nodes in the input layer take input and forward it for further process, in the diagram
above the nodes in the input layer forwards their output to each of the three nodes in the
hidden layer, and in the same way, the hidden layer processes the information and passes
it to the output layer.
• Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid
activation function takes real values as input and converts them to numbers between 0 and
1 using the sigmoid formula.
• The most commonly used form of this function (where β is some positive parameter) is:
• The multi-layer perceptron is also known as back propagation algorithm, which executes
in two stages as follows:
i. Forward stage:
In Figure 5.1, we start at the left by filling in the values for the inputs. We then use these
inputs and the first level of weights to calculate the activations of the hidden layer, and then
we use those activations and the next set of weights to calculate the activations of the output
5.5
layer. Now that we’ve got the outputs of the network, we can compare them to the targets
and compute the error.
ii. Backward stage: BACK-PROPAGATION OF ERROR
Backpropagation, or backward propagation of errors, is an algorithm that is designed to

test for errors working back from output nodes to input nodes. The error function that we
used for the Perceptron was
Where N is the number of output nodes.
5.2.1 The Multi-layer Perceptron Algorithm:
The MLP training algorithm using back-propagation of error is described below:

1. an input vector is put into the input nodes
2. the inputs are fed forward through the network
• the inputs and the first-layer weights (here labelled as v) are used to decide
whether the hidden nodes fire or not. The activation function g(·) is the sigmoid
function given in
• the outputs of these neurons and the second-layer weights (labelled as w) are
used to decide if the output neurons fire or not
3. the error is computed as the sum-of-squares difference between the network outputs
and the targets
4. this error is fed backwards through the network in order to

• first update the second-layer weights
• and then afterwards, the first-layer weights
5.6
Advantages of Multi-layer perceptron:

• It can be used to solve complex nonlinear problems.
• It handles large amounts of input data well.
• Makes quick predictions after training.
• The same accuracy ratio can be achieved even with smaller samples.
Disadvantages of Multi-layer perceptron:
• In Multi-layer perceptron, computations are difficult and time-consuming.
5.7
• In multi-layer Perceptron, it is difficult to predict how much the dependent variable

affects each independent variable.
• The model functioning depends on the quality of the training.
5.3 Activation Functions:

• Artificial neurons are elementary units in an artificial neural network. The artificial
neuron receives one or more inputs and sums them to produce an output. Each input is
separately weighted, and the sum is passed through a function known as an activation
function or transfer function.
• In an artificial neural network, the function which takes the incoming signals as input and
produces the output signal is known as the activation function.
Fig: 5.5 Artificial neuron
x1, x2,……,xn : input signals

w1,w2,……,wn : weights associated with input signals
x0 : input signal taking the constant value 1
w0 : weight associated with x0 (called bias)
Ʃ : indicates summation of input signals
f : function which produces the output
y : output signal
• The function f can be expressed in the following form:
Some simple activation functions

The following are some of the simple activation functions.
5.8
1. Threshold activation function

• The threshold activation function is defined by
• The graph of this function is shown as follows:
Fig: 5.6 Threshold activation function
2. Unit step functions:

• Sometimes, the threshold activation function is also defined as a unit step function in
which case it is called a unit-step activation function.
• This is defined as follows:
Fig: 5.7 Unit step functions
3. Sigmoid activation function (logistic function):

• One of the most commonly used activation functions is the sigmoid activation function.
• It is a function which is plotted as ‘S’ shaped graph

• This is defined as follows:
5.9
• Value Range :- 0to +1

• Nature :- non-linear
• Uses : Usually used in output layer of a binary classification, where result is either 0 or 1,
as value for sigmoid function lies between 0 and 1 only so, result can be predicted easily
to be 1 if value is greater than 0.5 and 0 otherwise.
Fig: 5.8 Sigmoid activation function
4. Linear activation function

• The linear activation function is defined by
F(x) = mx + c
• This defines a straight line in the xy-plane.
Fig: 5.9 Linear activation function
5. Tanh or Hyperbolic tangential activation function

• The activation that works almost always better than sigmoid function is Tanh function also
known as Tangent Hyperbolic function. It’s actually mathematically shifted version of the
sigmoid function. Both are similar and can be derived from each other.
• Value Range :- -1 to +1
• Nature :- non-linear
• Uses :- Usually used in hidden layers of a neural network as it’s values lies between -1 to
1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps in
5.10
centering the data by bringing mean close to 0. This makes learning for the next layer much
easier.
• This is defined by
Fig: 5.10 Hyperbolic tangent activation function
6. RELU Activation Function
It Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of Neural network.
Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
Value Range :- [0, inf)
Nature :- non-linear, which means we can easily backpropagate the errors and have multiple layers
of neurons being activated by the ReLU function.
Uses :- ReLu is less computationally expensive than tanh and sigmoid because it involves simpler
mathematical operations. At a time only a few neurons are activated making the network sparse
making it efficient and easy for computation.
Fig: 5.11 RELU activation Function
5.11
5.4 Gradient Descent Optimization:

• Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to
a wide range of problems.
• The general idea is to tweak parameters iteratively in order to minimize the cost function.
• An important parameter of Gradient Descent (GD) is the size of the steps, determined by the
learning rate hyperparameters. If the learning rate is too small, then the algorithm will have
to go through many iterations to converge, which will take a long time, and if it is too high
we may jump the optimal value.
Types of Gradient Descent:
Typically, there are three types of Gradient Descent:
1.Batch Gradient Descent

2.Stochastic Gradient Descent
3.Mini-batch Gradient Descent
Stochastic Gradient Descent:
• In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration.
• In Gradient Descent, there is a term called “batch” which denotes the total number of
samples from a dataset that is used for calculating the gradient for each iteration.
• In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to
be the whole dataset. Although using the whole dataset is really useful for getting to the
minima in a less noisy and less random manner, the problem arises when our dataset gets
big.
• Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique, you will have to use all of the one million samples for completing
one iteration while performing the Gradient Descent, and it has to be done for every iteration
until the minima are reached. Hence, it becomes computationally very expensive to perform.
• This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single sample,
i.e., a batch size of one, to perform each iteration.
• The sample is randomly shuffled and selected for performing the iteration.
5.12
Fig: 5.12 the path taken by Batch Gradient Descent
Fig: 5.13 the path taken by Stochastic Gradient Descent
The steps of the algorithm are
1. Find the slope of the objective function with respect to each parameter/feature. In other
words, compute the gradient of the function.
2. Pick a random initial value for the parameters. (To clarify, in the parabola example,
differentiate “y” with respect to “x”. If we had more features like x1, x2 etc., we take the
partial derivative of “y” with respect to each of the features.)
3. Update the gradient function by plugging in the parameter values.
4. Calculate the step sizes for each feature as : step size = gradient * learning rate.
5. Calculate the new parameters as : new params = old params -step size
5.13
6. Repeat steps 3 to 5 until gradient is almost 0.
5.5 Error backpropagation, from shallow networks to deep networks:
Backpropagation is one of the important concepts of a neural network. Our task is to classify
our data best. For this, we have update the weights of parameter and bias, but how can we do that
in a deep neural network? In the linear regression model, we use gradient descent to optimize the
parameter. Similarly here we also use gradient descent algorithm using Backpropagation.
Backpropagation defines the whole process encompassing both the calculation of the gradient
and its need in the stochastic gradient descent. Technically, backpropagation is used to calculate
the gradient of the error of the network concerning the network's modifiable weights. The
characteristics of Backpropagation are the iterative, recursive and effective approach through
which it computes the updated weight to increase the network until it is not able to implement the
service for which it is being trained. Derivatives of the activation service to be known at network
design time are needed for Backpropagation.
Backpropagation is widely used in neural network training and calculates the loss function for
the weights of the network. Its service with a multi-layer neural network and discover the internal
description of input-output mapping. It is a standard form of artificial network training, which
supports computing gradient loss function concerning all weights in the network. The
backpropagation algorithm is used to train a neural network more effectively through a chain rule
method. This gradient is used in a simple stochastic gradient descent algorithm to find weights that
minimize the error. The error propagates backward from the output nodes to the inner nodes.
The training algorithm of backpropagation involves four stages which are as follows
• Initialization of weights- There are some small random values are assigned.
• Feed-forward - Each unit X receives an input signal and transmits this signal to each of
the hidden unit Z1, Z2, ... Zn. Each hidden unit calculates the activation function and sends
its signal Z, to each output unit. The output unit calculates the activation function to form
the response of the given input pattern.
• Backpropagation of errors - Each output unit compares activation Y, with the target
value T, to determine the associated error for that unit. It is based on the error, the factor
88 (K = 1, ... . m) is computed and is used to distribute the error at the output unit Y, back
to all units in the previous layer. Similarly the factor 88,(j = 1, .... p) is compared for each
hidden unit Z.
• It can update the weights and biases.
5.14
Fig: 5.14 Back propagation neural network
Consider the above Back propagation neural network example diagram to understand.
The General Algorithm
The backpropagation algorithm proceeds in the following steps, assuming a suitable learning
rate alpha a and random initialization of the parameters w_{ij}^k:wijk:
Definition
1. Calculate the forward phase for each input-output pair (𝑥⃗𝑑 , 𝑦𝑑 ) and store the results 𝑦̂𝑑 , 𝑎𝑗𝑘 and
𝑜𝑗𝑘 for each node j in layer k by proceeding from layer 0, the input layer, to layer m, the output
layer.
𝜕𝐸𝑑
2. Calculate the backward phase for each input-output pair (𝑥⃗𝑑 , 𝑦𝑑 ) and store the result 𝑘 for
𝜕𝑤𝑖𝑗
each weight 𝑤𝑖𝑗𝑘 , connecting node in layer k − 1 store the results to node j in layer k by proceeding
from layer m, the output layer, to layer1, the input layer.
(a) Evaluate the error term for the final layer 𝜕𝑙𝑚 by using the second equation.
(b) Backpropogate the error terms for the hidden layers 𝜕𝑗𝑘 , working backwards from the final
hidden layer k = m-1, by repeatedly using the third equation.
(c) Evaluate the partial derivatives of the individual error Ed, with respect to 𝑤𝑖𝑗𝑘 by using the
first equation.
5.15
𝜕𝐸𝑑
3. Combine the individual gradients for each input-output pair 𝑘 to get the total gradient
𝜕𝑤𝑖𝑗
𝜕𝐸(𝑋,𝜃)
𝑘 for the entire set of input-output pairs X = {(𝑥⃗1 , 𝑦1 ),..., (𝑥⃗𝑁 , 𝑦𝑁 ) } by using the fourth
𝜕𝑤𝑖𝑗
equation (a simple average of the individual gradients).

𝜕𝐸(𝑋,𝜃)
4. Update the weights according to the learning rate 𝛼 and total gradient 𝑘 by using the fifth
𝜕𝑤𝑖𝑗
equation (moving in the direction of the negative gradient).
How Backpropagation Algorithm Works
Inputs X, arrive through the preconnected path
1. Input is modeled using real weights W. The weights are usually randomly selected.
2. Calculate the output for every neuron from the input layer, to the hidden layers, to the output
layer.
3. Calculate the error in the outputs
Error B = Actual Output - Desired Output
4. Travel back from the output layer to the hidden layer to adjust the weights such that the error
is decreased.
5. Keep repeating the process until the desired output is achieved
Why We Need Backpropagation?
Most prominent advantages of Backpropagation are:
• Backpropagation is fast, simple and easy to program

• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function to be learned.
Types of Backpropagation
There are two types of Backpropagation which are as follows -
Static Back Propagation - In this type of backpropagation, the static output is created because of
the mapping of static input. It is used to resolve static classification problems like optical character
recognition.
5.16
Recurrent Backpropagation - The Recurrent Propagation is directed forward or directed until a

specific determined value or threshold value is acquired. After the certain value, the error is
evaluated and propagated backward.
Key Points
• Simplifies the network structure by elements weighted links that have the least effect on
the trained network
• You need to study a group of input and activation values to develop the relationship
between the input and hidden unit layers.
• It helps to assess the impact that a given input variable has on a network output. The
knowledge gained from this analysis should be represented in rules.
• Backpropagation is especially useful for deep neural networks working on error-prone
projects, such as image or speech recognition.
• Backpropagation takes advantage of the chain and power rules allows backpropagation to
function with any number of outputs.
Disadvantages of using Backpropagation
• The actual performance of backpropagation on a specific problem is dependent on the

input data.
• Back propagation algorithm in data mining can be quite sensitive to noisy data
• You need to use the matrix-based approach for backpropagation instead of mini-batch.
5.6 Unit saturation (aka the vanishing gradient problem)
The vanishing gradient problem is an issue that sometimes arises when training machine
learning algorithms through gradient descent. This most often occurs in neural networks that have
several neuronal layers such as in a deep learning system, but also occurs in recurrent neural
networks.
The key point is that the calculated partial derivatives used to compute the gradient as one goes
deeper into the network. Since the gradients control how much the network learns during training,
the gradients are very small or zero, then little to no training can take place, leading to poor
predictive performance.
The problem:
As more layers using certain activation functions are added to neural networks, the gradients
of the loss function approaches zero, making the network hard to train.
Why:
5.17
Certain activation functions, like the sigmoid function, squishes a large input space into a small
input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will
cause a small change in the output. Hence, the derivative becomes small.
Fig:5.16
The sigmoid function and its derivative
As an example, the above image is the sigmoid function and its derivative. Note how when the
inputs of the sigmoid function becomes larger or smaller (when |𝑥| becomes bigger), the derivative
becomes close to zero.
Why it's significant:
For shallow network with only a few layers that use these activations, this isn't a big problem.
However, when more layers are used, it can cause the gradient to be too small for training to work
effectively. Gradients of neural networks are found using backpropagation. Simply put,
backpropagation finds the derivatives of the network by moving layer by layer from the final layer
to the initial one. By the chain rule, the derivatives of each layer are multiplied down the network
(from the final layer to the initial) to compute the derivatives of the initial layers.
However, when n hidden layers use activation like the sigmoid an function, n small derivatives
are multiplied together. Thus, the gradient decreases exponentially as we propagate down to the
initial layers. A small gradient means that the weights and biases of the initial layers will not be
updated effectively with each training session. Since these initial layers are often crucial to
recognizing the core elements of the input data, it can lead to overall inaccuracy of the whole
network.
Solution:
The simplest solution is to use other activation functions, such as ReLU, which doesn't cause
a small derivative. Residual networks are another solution, as they provide residual connections
straight to earlier layers. The residual connection directly adds the value at the beginning of the
5.18
block, x, to the end of the block (F(x) + x). This residual connection doesn't go through activation
functions that "squashes" the derivatives, resulting in a higher overall derivative of the block.
Fig:5.17
What is an activation function?
Activation function is a simple mathematical function that transforms the given input to
the required output that has a certain range. From their name they activate the neuron when output
reaches the set threshold value of the function. Basically are responsible for switching the neuron
ON/OFF. The neuron receives the sum of the product of inputs and randomly initialized weights
along with a static bias for each layer. The activation function is applied on to this sum, and an
output is generated. Activation functions introduce a non-linearity, so as to make the network learn
complex patterns in the data such as in the case of images, text, videos or sounds. Without an
activation function our model is going to behave like a linear regression model that has limited
learning capacity.
5.7 ReLU
The rectified linear activation unit, or ReLU, is one of the few landmarks in the deep learning
revolution. It's simple, yet it's far superior to previous activation functions like sigmoid or tanh.
ReLU formula is: f(x) = max(0,x)
Both the ReLU function and its derivative are monotonic. If the function receives any
negative input, it returns 0; however, if the function receives any positive value x, it returns that
value. As a result, the output has a range of 0 to infinite. ReLU is the most often used activation
function in neural networks, especially CNNs, and is utilized as the default activation function.
5.19
Fig: 5.18 RELU activation Function
Implementing ReLu function in Python
The code for ReLu is as follows :
def relu(x):
return max(0.0, x)
To test the function, let’s run it on a few inputs.
x = 1.0
print('Applying Relu on (%.1f) gives %.1f' % (x, relu(x)))
x = -10.0
x = 0.0
x = 15.0
x = -20.0
5.20
Fig: 5.19
We see from the plot that all the negative values have been set to zero, and the positive values
are returned as it is. Note that we've given a set of consecutively increasing numbers as input, so
we've a linear output with an increasing slope.
Advantages of ReLU:
ReLU is used in the hidden layers instead of Sigmoid or tanh as using sigmoid or tanh in the
hidden layers leads to the infamous problem of "Vanishing Gradient". The "Vanishing Gradient"
prevents the earlier layers from learning important information when the network is
backpropagating. The sigmoid which is a logistic function is more preferrable to be used in
regression or binary classification related problems and that too only in the output layer, as the
output of a sigmoid function ranges from 0 to 1. Also and tanh saturate and have lesser sensitivity.
Some of the advantages of ReLU are:
→ Simpler Computation: Derivative remains constant i.e 1 for a positive input and thus
reduces the time taken for the model to learn and in minimizing the errors.
→ Representational Sparsity: It is capable of outputting a true zero value.
→ Linearity: Linear activation functions are easier to optimize and allow for a smooth flow.
So, it is best suited for supervised tasks on large sets of labelled data.
Disadvantages of ReLU:
• Exploding Gradient: This occurs when the gradient gets accumulated, this causes a large
differences in the subsequent weight updates. This as a result causes instability when
converging to the global minima and causes instability in the learning too.
• Dying ReLU: The problem of "dead neurons" occurs when the neuron gets stuck in the
negative side and constantly outputs zero. Because gradient of 0 is also 0, it's unlikely for
the neuron to ever recover. This happens when the learning rate is too high or negative bias
is quite large.
5.21
5.8 Hyperparameter tuning
A Machine Learning model is defined as a mathematical model with a number of parameters

that need to be learned from the data. By training a model with existing data, we are able to fit the
model parameters, However, there is another kind of parameter, known as Hyperparameters, that
cannot be directly learned from the regular training process. They are usually fixed before the
actual training process begins. These parameters express important properties of the model such
as its complexity or how fast it should learn.
Some examples of model hyperparameters include:
• The penalty in Logistic Regression Classifier i.e. L, or L, regularization

• The learning rate for training a neural network.
• The C and sigma hyperparameters for support vector machines.
• The k in k-nearest neighbors.
Models can have many hyperparameters and finding the best combination of parameters can
be treated as a search problem. The two best strategies for Hyperparameter tuning are:
Grid Search CV
In GridSearchCV approach, the machine learning model is evaluated for a range of

hyperparameter values. This approach is called GridSearchCV, because it searches for the best set
of hyperparameters from a grid of hyperparameters values. For example, if we want to set two
hyperparameters C and Alpha of the Logistic Regression Classifier model, with different sets of
values. The grid search technique will construct many versions of the model with all possible
combinations of hyperparameters and will return the best one.
Fig: 5.20
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha= [0.1, 0.2, 0.3, 0.4]. For a
combination of C = 0.3 and Alpha = 0.2, the performance score comes out to be 0.726(Highest),
therefore it is selected.
5.22
The following code illustrates how to use GridSearchCV
# Necessary imports
from sklearn.linear_model import Logistic Regression
from sklearn.model_selection import GridSearchCV
# Creating the hyperparameter grid
c_space = np.logspace(-5, 8, 15) param_grid = {'C': c_space}
#Instantiating logistic regression classifier
logreg = Logistic Regression()
# Instantiating the GridSearchCV object
logreg_cv= GridSearchCV(logreg, param_grid, cv = 5)
logreg_cv.fit(X,y)
# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters:{}".format(logreg_cv.best_params_))
print("Best score is {}".format(logrcg_cv.best_score_))
Output:
Tuned Logistic Regression Parameters: {'C': 3.7275937203149381) Best score is

0.7708333333333334
Drawback:
GridSearch CV will go through all the intermediate combinations of hyperparameters which

makes grid search computationally very expensive.
5.9 Batch Normalization
Normalization is a data pre-processing tool used to bring the numerical data to a common scale
without distorting its shape. -Generally, when we input the data to a machine or deep learning
algorithm we tend to change the values to a balanced scale. The reason we normalize is partly to
ensure that our model can generalize appropriately. Now coming back to Batch normalization, it
is a process to make neural networks faster and more stable through adding extra layers in a deep
neural network. The new layer performs the standardizing and normalizing operations on the input
of a layer coming from a previous layer. A typical neural network is trained using a collected set
5.23
of input data called batch. Similarly, the normalizing process in batch normalization takes place in
batches, not as a single input.
Fig: 5.21
L = Number of layers
Bias = 0
Activation Function = Sigmoid
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre-
processing stage. When the input passes through the first layer, it transforms, as a sigmoid function
applied over the dot product of input X and the weight matrix W.
h1 = 𝜎(W1X)
Fig: 5.22
Similarly, this transformation will take place for the second layer and go till the last layer L as
shown in the following image.
5.24
Fig:5.23
Although, our input X was normalized with time the output will no longer be on the same scale.
As the data go through multiple layers of the neural network and L activation functions are applied,
it leads to an internal co-variate shift in the data.
How does Batch Normalization work?
Since by now we have a clear idea of why we need Batch normalization, let's understand how it
works. It is a two-step process. First, the input is normalized, and later rescaling and offsetting is
performed.
Normalization of the Input
Normalization is the process of transforming the data to have a mean zero and standard deviation
one. In this step we have our batch input from layer h, first, we need to calculate the mean of this
hidden activation.
1
𝜇 = 𝑚 Ʃhi
Here, m is the number of neurons at layer h. Once we have meant at our end, the next step is to
calculate the standard deviation of the hidden activations.
1
𝜇 = √𝑚 Ʃ(hi − µ)2
Further, as we have the mean and the standard deviation ready. We will normalize the hidden
activations using these values. For this, we will subtract the mean from each input and divide the
whole value with the sum of standard deviation and the smoothing term (ε). The smoothing term(ε)
assures numerical stability within the operation by stopping a division by a zero value.
5.25
(ℎ𝑖 − 𝜇)
h i(norm) = 𝜎+ 𝜀
Advantages of Batch Normalization
Now let's look into the advantages the BN process offers.
Speed Up the Training
By Normalizing the hidden layer activation the Batch normalization speeds up the training process.
Handles internal covariate shift
It solves the problem of internal covariate shift. Through this, we ensure that the input for every
layer is distributed around the same mean and standard deviation. If you are unaware of what is an
internal covariate shift, look at the following example.
Internal covariate shift
Suppose we are training an image classification model, that classifies the images into Dog or Not
Dog. Let's say we have the images of white dogs only, these images will have certain distribution
as well. Using these images model will update its parameters.
Smoothens the Loss Function
Batch normalization smoothens the loss function that in turn by optimizing the model parameters
improves the training speed of the model.
5.10 Regularization
The Problem of Overfitting
So, before diving into regularization, let's take a step back to understand what bias-variance is and
its impact. Bias is the deviation between the values predicted by the model and the actual values
whereas, variance is the difference between the predictions when the model fits different datasets.
When a model performs well on the training data and does not perform well on the testing data,
then the model is said to have high generalization error. In other words, in such a scenario, the
model has low bias and high variance and is too complex. This is called overfitting. Overfitting
means that the model is a good fit on the train data compared to the data, as illustrated in the graph
above. Overfitting is also a result of the model being too complex.
What Is Regularization in Machine Learning?
Regularization is one of the key concepts in Machine learning as it helps choose a simple model
rather than a complex one. We want our model to perform well both on the train and the new
unseen data, meaning the model must have the ability to be generalized. Generalization error is "a
5.26
measure of how accurately an algorithm can predict outcome values for previously unseen data."
Regularization refers to the modifications that can be made to a leaming algorithm that helps to
reduce this generalization error and not the training error. It reduces by ignoring the less important
features. It also helps prevent overfitting, making the model more robust and decreasing the
complexity of a model.
How Does Regularization Work?
Regularization works by shrinking the beta coefficients of a regression model. To understand why
we need to shrink the coefficients, let us see the below example:
Fig: 5.24
In the above graph, the two lines represent the relationship between total years of experience and
salary, where salary is the target variable. These are slopes indicating the change in salary per unit
change in total years of experience. As the slope b1+b3 decreases to slope b₁ , we see that the salary
is less sensitive to the total years of experience. By decreasing the slope, the target variable (salary)
became less sensitive to the change in the independent X variables, which increases the bias into
the model. Remember, bias is the difference between the predicted and the actual values.
With the increase in bias to the model, the variance (which is the difference between the predictions
when the model fits different datasets.) decreases. And, by decreasing the variance, the overfitting
gets reduced. The models having the higher variance leads to overfitting, and we saw above, we
will shrink or reduce the beta coefficients to overcome the overfitting. The beta coefficients or the
weights of the features converge towards zero, which is known as shrinkage.
What Is the Regularization Parameter?
For linear regression, the regularization has two terms in the loss function:
The Ordinary Least Squares (OLS) function, and
The penalty term
5.27
It becomes :
Loss function regularization =Loss function ols+ Penalty term
The goal of the linear regression model is to minimize the loss function. Now for Regularization,
the goal becomes to minimize the following cost function:
𝒏
∑( 𝐲𝐚𝐜𝐭 – 𝐲𝐩𝐫𝐞𝐝 )𝟐 + 𝐩𝐞𝐧𝐚𝐥𝐭𝐲

𝒊=𝟏
Where, the penalty term comprises the regularization parameter and the weights associated with
the variables. Hence, the penalty term is:
𝒑𝒆𝒏𝒂𝒍𝒕𝒚 = 𝝀 ∗ 𝒘
where,
λ = Regularization parameter
w = weight associated with the variables; generally considered to be L-p norms
The regularization parameter in machine learning is λ: It imposes a higher penalty on the variable
having higher values, and hence, it controls the strength of the penalty term. This tuning parameter
controls the bias-variance trade-off.
λ can take values 0 to infinity. If λ = 0, then means there is no difference between a model with
and without regularization.
Regularization Techniques in Machine Learning
Each of the following techniques uses different regularization norms (L-p) based on the
mathematical methodology that creates different kinds of regularization. These methodologies
have different effects on the beta coefficients of the features. The regularization techniques in
machine learning as follows:
(a) Ridge Regression

The Ridge regression technique is used to analyze the model where the variables may be having
multicollinearity. It reduces the insignificant independent variables though it does not remove
them completely. This type of regularization uses the L₂ norm for regularization
𝒏
𝒄𝒐𝒔𝒕 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 = ∑( 𝐲𝐚𝐜𝐭 – 𝐲𝐩𝐫𝐞𝐝 )𝟐 + 𝛌 ‖𝐰‖𝟐𝟐

𝒊=𝟏
5.28
(b) Lasso Regression

Least Absolute Shrinkage and Selection Operator (or LASSO) Regression penalizes the
coefficients to the extent that it becomes zero. It eliminates the insignificant independent
variables. This regularization technique uses the L1 norm for regularization.
𝒏
𝒄𝒐𝒔𝒕 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 = ∑( 𝐲𝐚𝐜𝐭 – 𝐲𝐩𝐫𝐞𝐝 )𝟐 + 𝛌 ‖𝐰‖𝟐

𝒊=𝟏
(c) Elastic Net Regression
The Elastic Net Regression technique is a combination of the Ridge and Lasso regression
technique. It is the linear combination of penalties for both the L, -norm and L₂ -norm
regularization.
The model using elastic net regression allows the learning of the sparse model where some of
the points are zero, similar to Lasso regularization, and yet maintains the Ridge regression
properties. Therefore, the model is trained on both the L, and L₂ norms.
The cost function of Elastic Net Regression is:

𝒏
𝒄𝒐𝒔𝒕 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 = ∑( 𝐲𝐚𝐜𝐭 – 𝐲𝐩𝐫𝐞𝐝 )𝟐 + 𝛌𝐫𝐢𝐝𝐠𝐞 ‖𝐰‖𝟐𝟐 + 𝛌𝐥𝐚𝐬𝐬𝐨 ‖𝐰‖𝟐𝟐

𝒊=𝟏
When to Use Which Regularization Technique?
The regularization in machine learning is used in following scenarios:
• Ridge regression is used when it is important to consider all the independent variables
in the model or when many interactions are present. That is where collinearity or
codependency is present amongst the variables.
• Lasso regression is applied when there are many predictors available and would want
the model to make feature selection as well for us.
When many variables are present, and we can't determine whether to use Ridge or Lasso
regression, then the Elastic-Net regression is your safe bet.
5.11 DROUPOUT:
"Dropout" in machine learning refers to the process of randomly ignoring certain nodes in a layer
during training. In the figure below, the neural network on the left represents a typical neural
network where all units are activated. On the right, the red units have been dropped out of the
model- the values of their weights and biases are not considered during training.
5.29
Fig:5.25
Dropout is used as a regularization technique - it prevents overfitting by ensuring that no units are
codependent.
Common Regularization Methods
Common regularization techniques include:
Early stopping: stop training automatically when a specific performance measure (eg. Validation
loss, accuracy) stops improving
Weight decay: incentivize the network to use smaller weights by adding a penalty to the loss
function (this ensures that the norms of the weights are relatively evenly distributed amongst all
the weights in the networks, which prevents just a few weights from heavily influencing network
output)
Noise: allow some random fluctuations in the data through augmentation (which makes the
network robust to a larger distribution of inputs and hence improves generalization)
Model combination: average the outputs of separately trained neural networks (requires a lot of
computational power, data, and time)
Dropout remains an extremely popular protective measure against overfitting because of its
efficiency and effectiveness.
How Does Dropout Work?
When we apply dropout to a neural network, we're creating a "thinned" network with unique
combinations of the units in the hidden layers being dropped randomly at different points in time
during training. Each time the gradient of our model is updated, we generate a new thinned neural
network with different units dropped based on a probability hyperparameter p. Training a network
using dropout can thus be viewed as training loads of different thinned neural networks and
merging them into one network that picks up the key properties of each thinned network. This
process allows dropout to reduce the overfitting of models on training data.
5.30
This graph, taken from the paper "Dropout: A Simple Way to Prevent Neural Networks from
Overfitting" by Srivastava et al., compares the change in classification error of models without
dropout to the same models with dropout (keeping all other hyperparameters constant). All the
models have been trained on the MNIST dataset.
Fig:5.26
It is observed that the models with dropout had a lower classification error than the same models
without dropout at any given point in time. A similar trend was observed when the models were
used to train other datasets in vision, as well as speech recognition and text analysis. The lower
error is because dropout helps prevent overfitting on the training data by reducing the reliance of
each unit in the hidden layer on other units in the hidden layers.
The Downside of Dropout
Although dropout is clearly a highly effective tool, it comes with certain drawbacks. A network
with dropout can take 2-3 times longer to train than a standard network. One way to attain the
benefits of dropout without slowing down training is by finding a regularizer that is essentially
equivalent to a dropout layer. For linear regression, this regularizer has been proven to be a
modified form of L2 regularization.
TWO MARKS ( PART A)
1. What is perceptron model in machine learning?
5.31
Perceptron is Machine Learning algorithm for supervised learning of various binary

classification tasks. Further, Perceptron is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks.
2. List down the advantages of multilayer perceptron?

• It can be used to solve complex nonlinear problems.
• It handles large amounts of input data well.
• Makes quick predictions after training.
• The same accuracy ratio can be achieved even with smaller samples.
3. Define activation function.
In an artificial neural network, the function which takes the incoming signals as
input and produces the output signal is known as the activation function.
4. How do you train a neural model?
In the process of training, we want to strat with a bad performing neural network and wind
up with network with high accuracy, In terms of loss function, we want our loss function to
much lower in the end of training. Improving the network is possible, because we can change
its function by adjusting weights. We want to find another function that performs better than
the initial one.
5. List the types of gradient descent?
1.Batch Gradient Descent
2.Stochastic Gradient Descent
3.Mini-batch Gradient Descent
6. Define SGD.
In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes
the total number of samples from a dataset that is used for calculating the gradient for each
iteration. In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is
taken to be the whole dataset.
7. List the types of Backpropagation
5.32
Static Back Propagation - In this type of backpropagation, the static output is created because
of the mapping of static input. It is used to resolve static classification problems like optical
character recognition.
Recurrent Backpropagation - The Recurrent Propagation is directed forward or directed until

a specific determined value or threshold value is acquired. After the certain value, the error is
evaluated and propagated backward.
8. What is the downside of dropout?
Although dropout is clearly a highly effective tool, it comes with certain drawbacks. A
network with dropout can take 2-3 times longer to train than a standard network. One way to
attain the benefits of dropout without slowing down training is by finding a regularizer that is
essentially equivalent to a dropout layer. For linear regression, this regularizer has been proven
to be a modified form of L2 regularization.
9. Define ReLU
The rectified linear activation unit, or ReLU, is one of the few landmarks in the deep
learning revolution. It's simple, yet it's far superior to previous activation functions like
sigmoid or tanh.
ReLU formula is: f(x) = max(0,x)
Both the ReLU function and its derivative are monotonic.
10. What is the objective of gradient descent
Gradient Descent is an optimizing algorithm used in Machine/ Deep Learning algorithms.

The goal of Gradient Descent is to minimize the slope to reach the lowest point of the
surface.
PART B & C
1. Explain about the multilayer perceptron model.

2. Elaborate the activation function in detail.
3. How does dropout works? Discuss
4. Explain backpropagation in detail.
5. Classify how to train a neural network.
5.33

AI & ML (Unit 1-5) - Merged

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI & ML (Unit 1-5) - Merged

Uploaded by

Copyright:

Available Formats

ARTIFICIAL INTELLIGENCE

REEBA ROSE. L, M.Tech.,

No part of this publication may be reproduced or distributed in any form or by

Artificial Intelligence - This word describes how new technologies are

First unit covered basic of working principal of agents, discussed about

Second unit explained about the probabilities reasoning such as uncertainty

Third unit, discussed about machine learning basic and significant of

It is great pleasure to express our deep sense of gratitude to our respected

We would like to express deep sense of thanks and gratitude to our

We would like to thank our HOD/CSE,Professor Dr.T.V.

We extend our sincere thanks to Arunachala Publications for

4.4 Instance Based Learning: KNN 4.26

Introduction to AI – AI Applications – Problem solving agents – Search algorithms –

What is artificial intelligence?

 Artificial Intelligence is the branch of computer science concerned with making

Why Artificial Intelligence?

There are four possible goals to pursue in Artificial Intelligence:

i) Acting humanly: The Turing test approach

Example: Turing Test

 3 rooms contains: a person, a computer and an interrogator.

The computer would need the following capabilities:

Natural language processing to communicate successfully in a human language;

Knowledge representation to store what it knows or hears;

Automated reasoning to answer questions and to draw new conclusions;

To pass the total Turing test, a robot will need

 Computer vision and speech recognition to perceive the world;

Fig: Illustrating Turing Test

ii) Thinking humanly: The cognitive modelling approach

Introspection—trying to catch our own thoughts as they go by;

Psychological experiments—observing a person in action;

Brain imaging—observing the brain in action.

iii) Thinking rationally: The “laws of thought” approach

All men are mortal;

therefore, “Socrates is mortal”.-- logic

 There are two main obstacles to this approach.

iv) Acting rationally: The rational agent approach

Maturation of Artificial Intelligence (1943-1952)

The birth of Artificial Intelligence (1952-1956)

The golden years-Early enthusiasm (1956-1974)

The first AI winter (1974-1980)

The second AI winter (1987-1993)

The emergence of intelligent agents (1993-2011)

Deep learning, big data and artificial general intelligence (2011-present)

 Artificial Intelligence has various applications in today's society. It is becoming

in multiple industries, such as Healthcare, entertainment, finance, education, etc. AI is

7. AI in Travel & Transport

Advantages of Artificial Intelligence

Following are some main advantages of Artificial Intelligence:

Disadvantages of Artificial Intelligence

 An intelligent agent is an autonomous entity which act upon an environment using

Following are the main four rules for an AI agent:

o Rule 1: An AI agent must have the ability to perceive the environment.

o Performance measure which defines the success criterion.

Agents and Environments

Fig: 1.3 A vacuum-cleaner world with just two locations

if status = Dirty then return Suck

else if location = A then return Right

elseif location = B then return Left

1.2.1 Good Behaviour: The Concept of Rationality

What is rational at any given time depends on four things:

 The performance measure that defines the criterion of success.

This leads to a definition of a rational agent:

 The only available actions are Left, Right and Suck.