Professional Documents
Culture Documents
AND
MACHINE LEARNING
For IV semester (CSE Department)
As per the latest syllabus of Anna University (2021 Regulation)
ARUNACHALA PUBLICATIONS
Manavilai, Vellichanthai
First edition: 2023
Copyright © 2023 by ARUNACHALA PUBLICATIONS.
Price: Rs.515/-
Published by:
ARUNACHALA PUBLICATIONS
Manavilai ,Vellichanthai,
Kanyakumari District,
Tamilnadu-629 203
Phone: 04651 200123
E-mail:acewomenscollege@gmail.com
Website:www.arunachalacollege.com
PREFACE
Fifth unit, describes about basic of deep learning using neural network.
ACKNOWLEDGEMENTS
First and foremost we would like to thank God in the process of putting
this book together, we realized how true this gift of writing is for us. We could
never have done this without the faith we have in him.
We would like to thank our friends, faculty members for giving valuable
inputs, comments, suggestions and praise while writing the book.
AUTHORS
CONTENTS
UNIT –I PROBLEM SOLVING 1.1
1.1Introduction to AI 1.1
1.1.1 AI approaches 1.2
1.1.2 History of AI 1.5
1.2 Applications of AI 1.7
1.2.1 Good Behaviour: The Concept of Rationality 1.14
1.2.2The Nature of Environment 1.16
1.2.3The Structure of Agents 1.20
1.2.3.1 Simple Reflex agent 1.22
1.2.3.2 Model-based Reflex agent 1.24
1.2.3.3 Goal-Based agent 1.25
1.2.3.4 Utility –Based agent 1.26
1.2.3.5 Learning Agent 1.27
1.3 Problem-solving agents 1.29
1.3.1 Search problems and solutions 1.30
1.3.2 Formulating problems 1.31
1.3.3 Example Problems 1.31
1.3.3.1 Toy problems 1.31
1.3.3.2 Real World Problems 1.34
1.3.3.3 Water jug problem 1.36
1.4 Search Algorithms 1.38
1.4.1 Best-first search 1.39
1.5 Uninformed Search Strategies 1.41
1.5.1 Breadth-first search 1.41
1.5.2 Uniform-cost search 1.42
1.5.3 Depth-first search 1.42
1.5.4 Depth-limited search 1.41
1.5.5 Iterative deepening search 1.44
1.5.6 Bidirectional search 1.44
1.6 Informed (Heuristic) Search Strategies 1.46
1.6.1.1 Greedy Best-first search 1.46
1.6.1.2 A* search (A-star search) 1.48
1.6.1.3 Memory-bounded search 1.50
1.6.1.4 Iterative-deepening A* search 1.51
1.6.1.5 Recursive Best-first search 1.51
1.6.2. Heuristic Functions 1.53
1.7 Local Search and Optimization Problems 1.59
1.7.1.1 Hill Climbing search 1.59
1.7.1.2 Simulated Annealing: 1.61
1.7.1.3 Local Beam Search 1.62
1.7.1.4 Evolutionary algorithms 1.62
1.7.2 Local Search in Continuous Spaces 1.64
1.7.3 Search with Nondeterministic Actions 1.67
1.7.3.1 The Erratic (Unpredictable) Vacuum World 1.67
1.7.3.2 AND-OR Search Trees 1.69
1.7.3.3 Try, Try Again For Vacuum World 1.70
1.7.4 Search in Partially Observable Environments 1.71
1.7.4.1 Searching with no observation 1.71
1.7.4.2 Searching in partially observable environments 1.75
1.7.4.3 Solving partially observable problems 1.76
1.7.4.4 An agent for partially observable environments 1.77
1.7.5 Online Search agents and Unknown Environments 1.79
1.7.5.1 Online Search Problems 1.79
1.7.5.2 Online search agents 1.81
1.7.5.3 Online Local search 1.82
1.7.5.4 Learning in Online Search 1.84
1.8 Adversarial search 1.84
1.8.1 Game Theory 1.84
1.8.1.1Two-player zero-sum games 1.83
1.8 2 Optimal Decisions in Games 1.86
1.8.2.1 The minimax search algorithm 1.87
1.8.2.2 Optimal decisions in multiplayer games 1.88
1.8.2.3 Alpha-Beta Pruning 1.89
1.8.3 Monte Carlo Tree Search 1.92
1.8.4 Stochastic Games 1.96
1.8.4.1 Evaluation functions for games of chance 1.97
1.8.5 Partially Observable Games 1.98
1.8.5.1 Kriegspiel: partially observable chess 1.98
1.8.5.2 Card games 1.100
1.9 Constraint Satisfaction Problems 1.101
1.9.1.1 Defining Constraint Satisfaction Problems 1.101
1.9.1.2 Example Problem: Map Coloring 1.101
1.9.1.3 Variations on the CSP 1.103
1.9.2 Constraint Propagation 1.105
1.9.2.1 Node consistency 1.105
1.9.2.2 Arc consistency 1.106
1.9.2.3 Path consistency: 1.107
1.9.2.4 K- consistency: 1.107
1.9.2.5 Global constraints 1.107
1.9.2.6 Sudoku 1.108
1.9.3 Backtracking Search for CSPs 1.109
1.9.3.1Variable and value ordering 1.110
1.9.3.2Interleaving search and inference 1.111
1.9.3.3Intelligent backtracking: 1.112
1.9.3.4Constraint learning: 1.113
1.9.4 Local Search for CSPs 1.113
1.9.5 The Structure of Problems 1.115
1.9.5.1 Cutset conditioning 1.116
1.9.5.2 Tree Decomposition 1.117
1.9.5.3 Value symmetry 1.117
UNIT-II PROBABILISTIC REASONING 2.1
2.1 Acting under uncertainty 2.1
2.1.1 Basic Probability Notation 2.3
2.1.2 Inference Using Full Joint Distributions 2.6
2.1.3 Independence 2.8
2.1.4 Bayes’ Rule and Its Use 2.9
2.2 Naïve Bayes models. 2.10
2.3.1 Text classification with naive Bayes 2.11
2.3 Probabilistic reasoning 2.12
2.4.1 Representing Knowledge in an Uncertain Domain
2.4 Bayesian networks 2.12
2.5.1 The Semantics of Bayesian Networks 2.14
2.5.2 Conditional independence relations in Bayesian networks 2.16
2.5 Exact inference in BN 2.17
2.6.1 Inference by enumeration 2.18
2.6.2 The variable elimination algorithm 2.20
2.6 Approximate inference in BN 2.24
2.7.1 Direct sampling methods 2.24
2.7.2 Inference by Markov chain simulation 2.28
2.7 Causal networks 2.29
2.8.1 Representing actions: The do-operator 2.30
2.8.2 The back-door criterion 2.32
UNIT III SUPERVISED LERANING 3.1
3.1 Introduction to machine learning 3.2
3.2 Linear Regression 3.7
3.2.1 Least square regression 3.8
3.2.2 Single and Multiple Variables 3.10
3.2.3 Bayesian Regression 3.12
3.2.4 Gradient Descent in Machine Learning 3.14
3.3 Classification Models 3.19
3.3.1 Discriminant Functions 3.19
3.4 Probabilistic Discriminant Functions 3.21
3.4.1 Logistic Regression in ML 3.22
3.5 Probabilistic Generative Models 3.25
3.5.1 Naïve Bayes Classifier Algorithm 3.26
3.6 Maximum Margin Classifier 3.30
3.6.1 Support Vector Machine 3.30
3.7 Decision Tree Classification Algorithm 3.36
3.7.1 Algorithm for Decision Tree 3.40
3.7.2 Advantages of Decision Tree 3.44
3.7.3 Disadvantages of Decision Tree 3.44
3.8 Random Forest Algorithm 3.44
3.8.1 The Working Process 3.46
3.8.2 Application of Random Forest 3.46
3.8.3 Advantages of Random Forest 3.47
3.8.4 Disadvantages of Random Forest 3.47
UNIT IV ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING 4.1
4.1 Combining Multiple Learners 4.1
4.1.1 Model Combination Schemes 4.2
4.1.2 Voting 4.3
4.1.3 Error-Correcting Output Codes 4.5
4.2 Ensemble Learning 4.6
4.2.1 Bagging 4.7
4.2.2 Boosting 4.9
4.2.3 Stacking 4.9
4.2.4 Adaboost 4.14
4.2.5 Difference between Bagging and Boosting 4.15
4.3 Clustering 4.16
4.3.1 Unsupervised Learning: K-Means 4.19
UNIT-1
PROBLEM SOLVING
1.1 INTRODUCTION TO AI
AI is composed of two words Artificial and Intelligence, where Artificial defines "man-
made," and intelligence defines "thinking power". Hence AI means "a man-made thinking
power”. Artificial Intelligence exists when a machine can have human based skills such as
learning, reasoning, and problem solving. With AI you do not need to pre-program a machine
to do some work, despite that you can create a machine with programmed algorithms which
can work with own intelligence. Definition: It is a branch of computer science by which we
can create intelligent machines which can behave like a human, think like humans, and
able to make decisions.
The definitions of AI according to some text books are categorized into four approaches and
are summarized below:
1. 1
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
"The exciting new effort to make computers think … machines with minds, in the full
and literal sense."(Haugeland,1985)
"The study of mental faculties through the use of computer models." (Charniak and
McDermont,1985)
The art of creating machines that performs functions that require intelligence when
performed by people."(Kurzweil,1990)
"Computational intelligence is the study of the design of intelligent agents."(Poole et
al.,1998)
Before Learning about Artificial Intelligence, we should know that what is the importance of
AI and why should we learn it. Following are some main reasons to learn about AI:
o With the help of AI, you can create such software or devices which can solve real-world
problems very easily and with accuracy such as health issues, marketing, traffic issues,
etc.
o With the help of AI, you can create your personal virtual Assistant, such as Cortana,
Google Assistant, Siri, etc.
o With the help of AI, you can build such Robots which can work in an environment
where survival of humans can be at risk.
o AI opens a path for other new technologies, new devices, and new Opportunities.
Acting humanly
Thinking humanly
Thinking rationally
Acting rationally
1.1.1 AI Approaches
The Turing test, proposed by Alan Turing (1950), was designed as a thought
experiment that would sidestep the philosophical vagueness of the question “Can a machine
think?” A computer passes the test if a human interrogator, after posing some written questions,
cannot tell whether the written responses come from a person or from a computer.
1. 2
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The art of creating machines that perform functions requiring intelligence when
performed by people; that it is the study of, how to make computers do things which, at the
moment, people do better. Focus is on action, and not intelligent behaviour centered around
the representation of the world.
Machine learning to adapt to new circumstances and to detect and extrapolate patterns.
Total Turing Test includes a video signal so that the interrogator can test the subject’s
perceptual abilities, as well as the opportunity for the interrogator to pass physical objects
―”through the hatch”.
1. 3
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
To say that a program thinks like a human, we must know how humans think. We can
learn about human thought in three ways:
Once we have a sufficiently precise theory of the mind, it becomes possible to express
the theory as a computer program. If the program’s input–output behaviour matches
corresponding human behaviour, that is evidence that some of the program’s mechanisms could
also be operating in humans.
Aristotle was one of the first to attempt to codify ―”right thinking”, that is, irrefutable
reasoning processes. His syllogisms provided patterns for argument structures that always
yielded correct conclusions when given correct premises.
For Example,
Socrates a man;
1. 4
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
An agent is just something that acts. All computer programs do something, but
computer agents are expected to do more: operate autonomously, perceive their environment,
persist over a prolonged time period, adapt to change, and create and pursue goals. A rational
agent is one that acts so as to achieve the best outcome or, when there is uncertainty, the best
expected outcome.
In the “laws of thought” approach to AI, the emphasis was on correct inferences.
Making correct inferences is sometimes part of being a rational agent, because one way to act
rationally is to deduce that a given action is best and then to act on that conclusion. For
example, recoiling from a hot stove is a reflex action that is usually more successful than a
slower action taken after careful deliberation. All the skills needed for the Turing test also allow
an agent to act rationally. Knowledge representation and reasoning enable agents to reach good
decisions. We need to be able to generate comprehensible sentences in natural language to get
by in a complex society. We need learning not only for erudition, but also because it improves
our ability to generate effective behaviour, especially in circumstances that are new.
The rational-agent approach to AI has two advantages over the other approaches. First,
it is more general than the “laws of thought” approach because correct inference is just one of
several possible mechanisms for achieving rationality. Second, it is more amenable to scientific
development. The standard of rationality is mathematically well defined and completely
general.
1.1.2 History of AI
o Year 1943: The first work which is now recognized as AI was done by Warren
McCulloch and Walter pits in 1943. They proposed a model of artificial neurons.
1. 5
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
o Year 1949: Donald Hebb demonstrated an updating rule for modifying the connection
strength between neurons. His rule is now called Hebbian learning.
o Year 1950: The Alan Turing who was an English mathematician and pioneered
Machine learning in 1950. Alan Turing publishes "Computing Machinery and
Intelligence" in which he proposed a test. The test can check the machine's ability to
exhibit intelligent behavior equivalent to human intelligence, called a Turing test.
o Year 1955: An Allen Newell and Herbert A. Simon created the "first artificial
intelligence program"Which was named as "Logic Theorist". This program had
proved 38 of 52 Mathematics theorems, and find new and more elegant proofs for some
theorems.
o Year 1956: The word "Artificial Intelligence" first adopted by American Computer
scientist John McCarthy at the Dartmouth Conference. For the first time, AI coined as
an academic field.
At that time high-level computer languages such as FORTRAN, LISP, or COBOL were
invented. And the enthusiasm for AI was very high at that time.
o Year 1966: The researchers emphasized developing algorithms which can solve
mathematical problems. Joseph Weizenbaum created the first chatbot in 1966, which
was named as ELIZA.
o Year 1972: The first intelligent humanoid robot was built in Japan which was named
as WABOT-1.
o The duration between years 1974 to 1980 was the first AI winter duration. AI winter
refers to the time period where computer scientist dealt with a severe shortage of
funding from government for AI researches.
o During AI winters, an interest of publicity on artificial intelligence was decreased.
A boom of AI (1980-1987)
o Year 1980: After AI winter duration, AI came back with "Expert System". Expert
systems were programmed that emulate the decision-making ability of a human expert.
1. 6
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
o In the Year 1980, the first national conference of the American Association of Artificial
Intelligence was held at Stanford University.
o The duration between the years 1987 to 1993 was the second AI Winter duration.
o Again Investors and government stopped in funding for AI research as due to high cost
but not efficient result. The expert system such as XCON was very cost effective.
o Year 1997: In the year 1997, IBM Deep Blue beats world chess champion, Gary
Kasparov, and became the first computer to beat a world chess champion.
o Year 2002: for the first time, AI entered the home in the form of Roomba, a vacuum
cleaner.
o Year 2006: AI came in the Business world till the year 2006. Companies like Facebook,
Twitter, and Netflix also started using AI.
o Year 2011: In the year 2011, IBM's Watson won jeopardy, a quiz show, where it had
to solve the complex questions as well as riddles. Watson had proved that it could
understand natural language and can solve tricky questions quickly.
o Year 2012: Google has launched an Android app feature "Google now", which was
able to provide information to the user as a prediction.
o Year 2014: In the year 2014, Chatbot "Eugene Goostman" won a competition in the
infamous "Turing test."
o Year 2018: The "Project Debater" from IBM debated on complex topics with two
master debaters and also performed extremely well.
o Google has demonstrated an AI program "Duplex" which was a virtual assistant and
which had taken hairdresser appointment on call, and lady on other side didn't notice
that she was talking with the machine.
1.2 Applications of AI
1. 7
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. AI in Astronomy
o Artificial Intelligence can be very useful to solve complex universe problems. AI
technology can be helpful for understanding the universe such as how it works, origin,
etc.
2. AI in Healthcare
o In the last, five to ten years, AI becoming more advantageous for the healthcare industry
and going to have a significant impact on this industry.
o Healthcare Industries are applying AI to make a better and faster diagnosis than
humans. AI can help doctors with diagnoses and can inform when patients are
worsening so that medical help can reach to the patient before hospitalization.
3. AI in Gaming
o AI can be used for gaming purpose. The AI machines can play strategic games like
chess, where the machine needs to think of a large number of possible places.
4. AI in Finance
o AI and finance industries are the best matches for each other. The finance industry is
implementing automation, chatbot, adaptive intelligence, algorithm trading, and
machine learning into financial processes.
5. AI in Data Security
o The security of data is crucial for every company and cyber-attacks are growing very
rapidly in the digital world. AI can be used to make your data more safe and secure.
Some examples such as AEG bot, AI2 Platform,are used to determine software bug and
cyber-attacks in a better way.
6. AI in Social Media
o Social Media sites such as Facebook, Twitter, and Snapchat contain billions of user
profiles, which need to be stored and managed in a very efficient way. AI can organize
and manage massive amounts of data. AI can analyze lots of data to identify the latest
trends, hashtag, and requirement of different users.
1. 8
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
8. AI in Automotive Industry
o Some Automotive industries are using AI to provide virtual assistant to their user for
better performance. Such as Tesla has introduced TeslaBot, an intelligent virtual
assistant.
o Various Industries are currently working for developing self-driven cars which can
make your journey more safe and secure.
9. AI in Robotics:
o Artificial Intelligence has a remarkable role in Robotics. Usually, general robots are
programmed such that they can perform some repetitive task, but with the help of AI,
we can create intelligent robots which can perform tasks with their own experiences
without pre-programmed.
o Humanoid Robots are best examples for AI in robotics, recently the intelligent
Humanoid robot named as Erica and Sophia has been developed which can talk and
behave like humans.
10. AI in Entertainment
o We are currently using some AI based applications in our daily life with some
entertainment services such as Netflix or Amazon. With the help of ML/AI algorithms,
these services show the recommendations for programs or shows.
11. AI in Agriculture
o Agriculture is an area which requires various resources, labor, money, and time for best
result. Now a day's agriculture is becoming digital, and AI is emerging in this field.
Agriculture is applying AI as agriculture robotics, solid and crop monitoring, predictive
analysis. AI in agriculture can be very helpful for farmers.
1. 9
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
12. AI in E-commerce
o AI is providing a competitive edge to the e-commerce industry, and it is becoming more
demanding in the e-commerce business. AI is helping shoppers to discover associated
products with recommended size, color, or even brand.
13. AI in education:
o AI can automate grading so that the tutor can have more time to teach. AI chatbot can
communicate with students as a teaching assistant.
o AI in the future can be work as a personal virtual tutor for students, which will be
accessible easily at any time and any place.
o High Accuracy with less errors: AI machines or systems are prone to less errors and
high accuracy as it takes decisions as per pre-experience or information.
o High-Speed: AI systems can be of very high-speed and fast-decision making, because
of that AI systems can beat a chess champion in the Chess game.
o High reliability: AI machines are highly reliable and can perform the same action
multiple times with high accuracy.
o Useful for risky areas: AI machines can be helpful in situations such as defusing a
bomb, exploring the ocean floor, where to employ a human can be risky.
o Digital Assistant: AI can be very useful to provide digital assistant to the users such as
AI technology is currently used by various E-commerce websites to show the products
as per customer requirement.
o Useful as a public utility: AI can be very useful for public utilities such as a self-
driving car which can make our journey safer and hassle-free, facial recognition for
security purpose, Natural language processing to communicate with the human in
human-language, etc.
Every technology has some disadvantages, and the same goes for Artificial intelligence. Being
so advantageous technology still, it has some disadvantages which we need to keep in our mind
while creating an AI system. Following are the disadvantages of AI:
o High Cost: The hardware and software requirement of AI is very costly as it requires
lots of maintenance to meet current world requirements.
1. 10
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
o Can't think out of the box: Even we are making smarter machines with AI, but still
they cannot work out of the box, as the robot will only do that work for which they are
trained, or programmed.
o No feelings and emotions: AI machines can be an outstanding performer, but still it
does not have the feeling so it cannot make any kind of emotional attachment with
human, and may sometime be harmful for users if the proper care is not taken.
o Increase dependency on machines: With the increment of technology, people are
getting more dependent on devices and hence they are losing their mental capabilities.
o No Original Creativity: As humans are so creative and can imagine some new ideas
but still AI machines cannot beat this power of human intelligence and cannot be
creative and imaginative.
INTELLIGENT AGENTS
Rationality:
The rationality of an agent is measured by its performance measure. Rationality can be judged
on the basis of following points:
1. 11
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
An agent is anything that can be viewed as perceiving its environment through sensors
and SENSOR acting upon that environment through actuators. This simple idea is illustrated
in Figure 1.2.
A human agent has eyes, ears, and other organs for sensors and hands, legs, mouth,
and other body parts for actuators.
A robotic agent might have cameras and infrared range finders for sensors and various
motors for actuators.
A software agent receives keystrokes, file contents, and network packets as sensory
inputs and acts on the environment by displaying on the screen, writing files, and
sending network packets.
Sensor: Sensor is a device which detects the change in the environment and sends the
information to other electronic devices. An agent observes its environment through
sensors.
Actuators: Actuators are the component of machines that converts energy into motion.
The actuators are only responsible for moving and controlling a system. An actuator
can be an electric motor, gears, rails, etc.
Effectors: Effectors are the devices which affect the environment. Effectors can be
legs, wheels, arms, fingers, wings, fins, and display screen.
Fig 1.2: Agents interact with environments through sensors and actuators
Agent Terminology
Percept
We use the term percept to refer to the agent's perceptual inputs at any given instant.
1. 12
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Percept Sequence
An agent's percept sequence is the complete history of everything the agent has ever
perceived.
Agent function
Mathematically speaking, we say that an agent's behaviour is described by the agent
function that maps any given percept sequence to an action.
𝒇: 𝑷∗ → 𝑨
AGENT PROGRAM
The agent function for an artificial agent will be implemented by an agent program.
It is important to keep these two ideas distinct.
The agent function is an abstract mathematical description
The agent program is a concrete implementation, running on the agent architecture
To illustrate these ideas, we will use a very simple example-the vacuum-cleaner world
shown in Figure 1.3.
This particular world has just two locations: squares A and B.
The vacuum agent perceives which square it is in and whether there is dirt in the square.
It can choose to move left, move right, suck up the dirt, or do nothing.
One very simple agent function is the following:
if the current square is dirty, then suck, otherwise, it move to the other square.
1. 13
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Agent Program
function Reflex-VACUUM-AGENT ([locations, status]) returns an action
A rational agent is one that does the right thing. Obviously, doing the right thing is
better than doing the wrong thing, but what does it mean to do the right thing?
Performance measures
Moral philosophy has developed several different notions of the “right thing,” but AI
has generally stuck to one notion called consequentialism: we evaluate an agent’s
behaviour by its consequences.
When an agent is plunked down in an environment, it generates a sequence of actions
according to the percept it receives. This sequence of actions causes the environment
to go through a sequence of states.
If the sequence is desirable, then the agent has performed well. This notion of
desirability is captured by a performance measure that evaluates any given sequence of
environment states.
Humans have desires and preferences of their own, Machines, on the other hand, do not
have desires and preferences of their own; the performance measure is, initially at least,
1. 14
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
in the mind of the designer of the machine, or in the mind of the users the machine is
designed for.
Consider, for example, the vacuum-cleaner agent. We might propose to measure
performance by the amount of dirt cleaned up in a single eight-hour shift.
A rational agent can maximize this performance measure by cleaning up the dirt, then
dumping it all on the floor, then cleaning it up again, and so on.
A more suitable performance measure would reward the agent for having a clean floor.
For example, one point could be awarded for each clean square at each time step.
As a general rule, it is better to design performance measures according to what one
actually wants to be achieved in the environment, rather than according to how one
thinks the agent should behave.
Rationality
For each possible percept sequence, a rational agent should select an action that is
expected to maximize its performance measure, given the evidence provided by the
percept sequence and whatever built-in knowledge the agent has.
Consider the simple vacuum-cleaner agent, that cleans a square if it is dirty and moves to the
other square if not (That is shown in the agent function table). Is this a rational agent? what the
performance measure is, what is known about the environment, and what sensors and actuators
the agent has. Let us assume the following:
The performance measure awards one point for each clean square at each time step,
over a “lifetime” of 1000 time steps.
Clean squares stay clean and sucking cleans the current square.
The Left and Right actions move the agent one square except when this would take the
agent outside the environment, in which case the agent remains where it is.
1. 15
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
An omniscient agent knows the actual outcome of its actions and can act
accordingly; but omniscience is impossible in reality.
Doing actions in order to modify future percepts— sometimes called information
gathering
A rational agent not only to gather information but also to learn as much as possible
from what it perceives. The agent’s initial configuration could reflect some prior
knowledge of the environment, but as the agent gains experience this may be
modified and augmented.
An agent relies on the prior knowledge of its designer rather than on its own
percepts and learning processes, we say that the agent lacks autonomy. A rational
agent should be autonomous—it should learn what it can to compensate for partial
or incorrect prior knowledge.
1. 16
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
All these are grouped together under the heading of the task environment.
We call this the PEAS (Performance, Environment, Actuators, Sensors) description.
In designing an agent, the first step must always be to specify the task environment as
fully as possible.
The following table shows PEAS description of the task environment for an automated
taxi.
Fig 1.4 PEAS description for an automated taxi
The following table shows PEAS description of the task environment for some other agent
type.
1. 17
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Partially observable
1. 18
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Example: Consider Pick and Place robot, which is used to detect defective parts from
the conveyor belts. Here, every time robot (agent) will make the decision on the
current part i.e. there is no dependency between current and previous decisions.
Sequential
An agent requires memory of past actions to determine the next best actions.
The current decision could affect all future decisions.
Example: Chess Checkers- Where the previous move can affect all the following
moves.
4. Single agent
If only one agent is involved in an environment, and operating by itself then such an
environment is called single agent environment.
Example: maze, A person left alone in a maze is an example of the single-agent
system.
Multiple agents
If multiple agents are involved in an environment, then it is called a multi- agent
environment.
Example: Football, The game of football is multi-agent as it involves 11 players in
each team.
5. Static
The environment does not change while an agent is acting.
Example: Crossword puzzles
1. 19
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 20
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
An agent program takes the current percept as input while the agent function takes the
entire percept history
Current percept is taken as input to the agent program because nothing more is
available from the environment
The following TABLE-DRIVEN_AGENT program is invoked for each new percept
and returns an action each time
Drawbacks:
Table lookup of percept-action pairs defining all possible condition-action rules
necessary to interact in an environment
Problems
Too big to generate and to store (Chess has about 10^120 states, for
example)
No knowledge of non-perceptual parts of the current state
Not adaptive to changes in the environment; requires entire table to be
updated if changes occur
Looping: Can't make actions conditional
Take a long time to build the table
No autonomy
Even with learning, need a long time to learn the table entries
Some Agent Types
Table-driven agents
use a percept sequence/action table in memory to find the next action. They
are implemented by a (large) lookup table.
Simple reflex agents
are based on condition-action rules, implemented with an appropriate
production system. They are stateless devices which do not have memory of past
world states.
Agents with memory
have internal state, which is used to keep track of past states of the
world.
1. 21
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Utility-based agents
base their decisions on classic axiomatic utility theory in order to act
rationally.
Kinds of Agent Programs
The following are the agent programs,
1. 22
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 1.7 The agent program for a simple reflex agent in the two-state vacuum
environment.
1. 23
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Characteristics
Only works if the environment is fully observable.
Lacking history, easily get stuck in infinite loops
One solution is to randomize actions
1.2.3.2 Model-based Reflex agent
A model-based reflex agent is an intelligent agent that uses percept history and internal
memory to make decisions about the “model” of the world around it.
The Model-based agent can work in a partially observable environment, and track the
situation.
Model- Knowledge about “how the things happen in the world”.
Internal state- It is a representation of unobserved aspects of current state depending
on percept history.
Updating the state requires the information about
How the world evolves.
How the agent’s actions affect the world.
1. 24
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 25
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 26
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 27
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The last component of the learning agent is the problem generator. It is responsible
for suggesting actions that will lead to new and informative experiences. But if the
agent is willing to explore a little, it might discover much better actions for the long
run. The problem generator's job is to suggest these exploratory actions. This is what
scientists do when they carry out experiments.
Problem Formulation
An important aspect of intelligence is goal-based problem solving.
The solution of many problems can be described by finding a sequence of actions that
lead to a desirable goal.
Each action changes the state and the aim is to find the sequence of actions and states
that lead from the initial (start) state to a final (goal) state.
1. 28
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
What is Search?
a) Search is the systematic examination of states to find path from the start/root
state to the goal state.
b) The set of possible states, together with operators defining their connectivity
constitute the search space.
c) The output of a search algorithm is a solution, that is, a path from the initial
state to a state that satisfies the goal test.
1. 29
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig: 1.13
A transition model, which describes what each action does. RESULT (s,a)
returns the state that results from doing action a in state s. For example,
1. 30
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The state space can be represented as a graph in which the vertices are states
and the directed edges between them are actions. The map of Romania shown
in figure is such a graph, where each road indicates two actions, one in each
direction
1.3.2 Formulating problems
We derive a formulation of the problem in terms of the initial state, successor
function , goal test, and path cost
Our formulation of the problem of getting to Bucharest is a model—an abstract
mathematical description—and not the real thing.
Compare the simple atomic state description Arad to an actual cross-country
trip, where the state of the world includes so many things: the traveling
companions, the current radio program, the scenery out of the window, the
proximity of law enforcement officers, the distance to the next rest stop, the
condition of the road, the weather, the traffic, and so on.
All these considerations are left out of our model because they are irrelevant to
the problem of finding a route to Bucharest.
The process of removing detail from a representation is called abstraction.
1. 31
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig: 1.14
b. 8-puzzle:
An 8-puzzle consists of a 3x3 board with eight numbered tiles and a blank space.
A tile adjacent to the blank space can slide into the space. The object is to reach the
specific goal state ,as shown in figure 1.15
Fig: 1.15
1. 32
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Goal Test : This checks whether the state matches the goal configuration shown in
figure 2.4.(Other goal configurations are possible)
Path cost: Each step costs 1, so the path cost is the number of steps in the path.
The 8-puzzle belongs to the family of sliding-block puzzles, which are often used as
test problems for new search algorithms in AI.
This general class is known as NP-complete.
The 8-puzzle has 9!/2 = 181,440 reachable states and is easily solved.
The 15 puzzle ( 4 x 4 board ) has around 1.3 trillion states, an the random instances can
be solved optimally in few milli seconds by the best search algorithms.
The 24-puzzle (on a 5 x 5 board) has around 1025 states ,and random instances are still
quite difficult to solve optimally with current machines and algorithms.
c. 8-queens problem
The goal of 8-queens problem is to place 8 queens on the chessboard such that no queen
attacks any other. (A queen attacks any piece in the same row, column or diagonal).
The following figure shows an attempted solution that fails: the queen in the right most
column is attacked by the queen at the top left.
An Incremental formulation involves operators that augments the state description,
starting with an empty state for 8-queens problem, this means each action adds a queen
to the state.
A complete-state formulation starts with all 8 queens on the board and move them
around.
In either case the path cost is of no interest because only the final state counts.
1. 33
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ROUTE-FINDING PROBLEM
1. 34
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
TOURING PROBLEMS
Touring problems are closely related to route-finding problems, but with an important
difference.
Consider for example, the problem, "Visit every city at least once" as shown in Romania
map.
As with route-finding the actions correspond to trips between adjacent cities.
Initial state would be "In Bucharest; visited{Bucharest}". Intermediate state would be
"In Vaslui; visited {Bucharest,Vrziceni,Vaslui}".
Goal test would check whether the agent is in Bucharest and all 20 cities have been
visited.
1. 35
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
VLSI layout
A VLSI layout problem requires positioning millions of components and connections
on a chip to minimize area, minimize circuit delays, minimize stray capacitances, and
maximize manufacturing yield.
The layout problem is split into two parts: cell layout and channel routing.
ROBOT navigation
In the water jug problem in Artificial Intelligence, we are provided with two jugs:
one having the capacity to hold 3 gallons of water and the other has the capacity to
hold 4 gallons of water. There is no other measuring equipment available and the jugs
also do not have any kind of marking on them. So, the agent’s task here is to fill the 4-
gallon jug with 2 gallons of water by using only these two jugs and no other material.
Initially, both our jugs are empty.
So, to solve this problem, following set of rules were proposed: shown in figure: 1.16
Production rules for solving the water jug problem
1. 36
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Here, let x denote the 4-gallon jug and y denote the 3-gallon jug.
(x-[3-y], 3)
The listed production rules contain all the actions that could be performed by the agent in
transferring the contents of jugs. But, to solve the water jug problem in a minimum number of
moves, following set of rules in the given sequence should be performed: shown in figure 1.17
Fig 1.16
Fig 1.17 Solution of water jug problem according to the production rules
On reaching the 7th attempt, we reach a state which is our goal state. Therefore,
at this state, our problem is solved.
1. 37
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 38
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig: 1.18
1. 39
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 1.19 The best-first search algorithm, and the function for expanding a node.
Search data structures
Search algorithms require a data structure to keep track of the search tree. A node in
the tree is represented by a data structure with four components:
node.STATE: the state to which the node corresponds;
node.PARENT: the node in the tree that generated this node;
node.ACTION: the action that was applied to the parent’s state to generate this node;
node.PATH-COST: the total cost of the path from the initial state to this node. In
mathematical formulas, we use as a synonym for PATH-COST.
We need a data structure to store the frontier.
The appropriate choice is a queue of some kind, because the operations on a frontier
are:
IS-EMPTY(frontier) returns true only if there are no nodes in the frontier.
POP(frontier) removes the top node from the frontier and returns it.
TOP(frontier) returns (but does not remove) the top node of the frontier.
ADD(node, frontier) inserts node into its proper place in the queue.
MEASURING PROBLEM-SOLVING PERFORMANCE
The output of problem-solving algorithm is either failure or a solution.
The algorithm's performance can be measured in four ways :
i. Completeness: Is the algorithm guaranteed to find a solution when there
is one?
ii. Optimality : Does the strategy find the optimal solution
iii. Time complexity: How long does it take to find a solution?
iv. Space complexity: How much memory is needed to perform the search?
1. 40
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig:
Now suppose that the 1.20 Breadth
solution first Then
is at depth searchthe
algorithm
total number of nodes generated is
All the nodes remain in memory, so both time and space complexity are 𝑂(𝑏 𝑑 ).
1. 41
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The memory requirements are a bigger problem for breadth-first search than the
execution time.
1.5.2 Uniform-cost search
Uniform-cost search does not care about the number of steps a path has, but only about
their total cost.
By a simple extension, we can find an algorithm that is optimal with any step-cost
function.
Instead of expanding the shallowest node, uniform-cist search expands the node n
with the lowest path cost 𝑔(𝑛).
This is done by storing the frontier as a priority queue ordered by 𝑔.
Fig: 1.20
1. 42
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig: 1.21
For a state space with branching factor b and maximum depth m, depth-first search
requires storage of only 𝑂(𝑏 𝑚 ) nodes.
1.5.4 Depth-limited search
The embarrassing failure of depth-first search in infinite state spaces can be
alleviated by supplying depth-first search with a predetermined depth limit 𝒍.
That is, nodes at depth 𝒍 are treated as if they have no successors. This approach is
called depth-limited search.
The depth limit solves the infinite-path problem. Unfortunately, it also introduces an
additional source of incompleteness if we choose 𝑙 < 𝑑, that is, the shallowest goal is
beyond the depth-limit.
Depth-limit search will also be non-optimal if we choose 𝒍 > 𝒅. Its time
complexity is 𝑶(𝒃𝒍 ) and its space complexity is 𝑶(𝒃𝒍). Depth-first search can be
viewed as a special case of depth-limited search with 𝑙 = ∞.
Fig: 1.22 The Recursive implementation of Depth-limited tree search:
1. 43
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig: 1.23
1. 44
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig: 1.24
Advantage:
Bidirectional search is fast and it requires less memory.
Disadvantage:
We should know the goal state in advance.
Performance Evaluation
Completeness Bidirectional search is complete if branching factor b is finite and if we
use BFS in both searches.
Optimality Bidirectional search is optimal.
Time Complexity 𝑂(𝑏 𝑑 ) if it used BFS (where b is the branching factors or number of
nodes and d is the depth of the search tree or number of levels in search tree).
Space ComplexityO(𝑏 𝑑
1. 45
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 46
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Using the straight-line distance heuristic 𝒉𝑺𝑳𝑫 ,the goal state can be reached
faster.
Fig: 2.2
Figure 2.2 shows the progress of greedy best-first search using ℎ𝑆𝐿𝐷 to find a path from Arad
to Bucharest. The first node to be expanded from Arad will be Sibiu,because it is closer to
Bucharest than either Zerind or Timisoara. The next node to be expanded will be
Fagaras,because it is closest. Fagaras in turn generates Bucharest,which is the goal.
1. 47
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1.6.1.2 A* Search
A* Search is the most widely used form of best-first search. The evaluation function
f(n) is
obtained by combining
i. g(n) = the cost to reach the node,and
ii. h(n) = the cost to get from the node to the goal :
f(n) = g(n) + h(n).
A* Search is both optimal and complete. A* is optimal if h(n) is an admissible
heuristic. The obvious example of admissible heuristic is the straight-line distance
ℎ𝑆𝐿𝐷 . It cannot be an overestimate.
A* Search is optimal if h(n) is an admissible heuristic – that is,provided that h(n)
never overestimates the cost to reach the goal.
An obvious example of an admissible heuristic is the straight-line distance ℎ𝑆𝐿𝐷 that
we used in getting to Bucharest. The progress of an A* tree search for Bucharest is
shown in Figure 2.2.
The values of ‘g ‘ are computed from the step costs shown in the Romania map
( figure 2.1). Also the values of ℎ𝑆𝐿𝐷 are given in Figure 2.1.
1. 48
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 49
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig: 2.2
A* search is complete.
Whether A* is cost-optimal depends on certain properties of the heuristic.
A key property is admissibility: an admissible heuristic is one that never
overestimates the cost to reach a goal.
A slightly stronger property is called consistency. A heuristic is consistent if, for
every node and every successor of generated by an action we have:
𝒉(𝒏) ≤ 𝒄(𝒏, 𝒂, 𝒏 ′ ) + 𝒉(𝒏 ′ ).
This is a form of the triangle inequality, which stipulates that a side of a triangle
cannot be longer than the sum of the other two sides (see Figure 3.19 ). An example
of a consistent heuristic is the straight-line distance that we used in getting to
Bucharest.
Fig: 2.3
1. 50
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig: 2.4
1. 51
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 52
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
IDA* and RBFS suffer from using too little memory. Between iterations, IDA*
retains only a single number: the current -cost limit.
RBFS retains more information in memory, but it uses only linear space: even if
more memory were available, RBFS has no way to make use of it.
Because they forget most of what they have done, both algorithms may end up
reexploring the same states many times over
To determine how much memory we have available, and allow an algorithm to use all
of it. Two algorithms that do this are MA* (memory-bounded A*) and SMA*
(simplified MA*).
SMA*
SMA* proceeds just like A*, expanding the best leaf until memory is full.
At this point, it cannot add a new node to the search tree without dropping an old one.
SMA* always drops the worst leaf node—the one with the highest -value.
Like RBFS, SMA* then backs up the value of the forgotten node to its parent.
In this way, the ancestor of a forgotten subtree knows the quality of the best path in
that subtree.
With this information, SMA* regenerates the subtree only when all other paths have
been shown to look worse than the path it has forgotten.
Another way of saying this is that if all the descendants of a node are forgotten, then
we will not know which way to go from n but we will still have an idea of how
worthwhile it is to go anywhere from n.
1.6.2 Heuristic Functions
A heuristic function or simply a heuristic is a function that ranks alternatives in
various search algorithms at each branching step basing on an available information in
order to make a decision which branch is to be followed during a search.
The 8-puzzle
The 8-puzzle is an example of Heuristic search problem. The object of the puzzle is to
slide the tiles horizontally or vertically into the empty space until the configuration
matches the goal configuration(Figure 2.6)
The average solution cost for a randomly generated 8-puzzle instance is about 22
steps.
1. 53
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The branching factor is about 3.(When the empty tile is in the middle, there are four
possible moves; when it is in the corner there are two; and when it is along an edge
there are three).
This means that an exhaustive search to depth 22 would look at about 322
approximately = 3.1 𝑋 1010 states.
By keeping track of repeated states, we could cut this down by a factor of about
170,000, because there are only 9!/2 = 181,440 distinct states that are reachable. This
is a manageable number, but the corresponding number for the 15-puzzle is roughly
1013.
If we want to find the shortest solutions by using A*,we need a heuristic function that
never overestimates the number of steps to the goal.
The two commonly used heuristic functions for the 15-puzzle are :
i. h1 = the number of misplaced tiles.
For figure 2.6, all of the eight tiles are out of position, so the start state would have h1
= 8. h1 is an admissible heuristic.
ii. h2 = the sum of the distances of the tiles from their goal positions.
This is called the city block distance or Manhattan distance.
h2 is admissible ,because all any move can do is move one tile one step closer to the goal.
Tiles 1 to 8 in start state give a Manhattan distance of
h2 = 3 + 1 + 2 + 2 + 2 + 3 + 3 + 2 = 18.
Neither of these overestimates the true solution cost ,which is 26.
The effect of heuristic accuracy on performance
The Effective Branching factor
One way to characterize the quality of a heuristic is the effective branching factor b*. If
the total number of nodes generated by A* for a particular problem is N,and the solution
depth is d,then b*is the branching factor that a uniform tree of depth d would have to have in
order to contain N+1nodes. Thus,
𝑵 + 𝟏 = 𝟏 + 𝒃 ∗ + (𝒃 ∗)𝟐 +. . . +(𝒃 ∗)𝒅
1. 54
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 2.7 Comparison of the search cost and effective branching factor
1. 55
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
From (c), we can derive ℎ1 (misplaced tiles) because it would be the proper score if
tiles could move to their intended destination in one action.
If the relaxed problem is hard to solve, then the values of the corresponding heuristic
will be expensive to obtain.
A program called ABSOLVER can generate heuristics automatically from problem
definitions, using the “relaxed problem” method.
Generating admissible heuristics from sub problems: Pattern databases
Admissible heuristics can also be derived from the solution cost of a subproblem
of a given problem.
Fig 2.8
Pattern databases
The idea behind pattern databases is to store these exact solution costs for every
possible subproblem instance- in our example, every possible configuration of the
four tiles and the blank.
Then we compute an admissible heuristic for each state encountered during a
search simply by looking up the corresponding subproblem configuration in the
database.
The database itself is constructed by searching back from the goal and recording
the cost of each new pattern encountered;
Generating heuristics with landmarks
There are online services that host maps with tens of millions of vertices and find
cost-optimal driving directions in milliseconds (figure 2.9)
How can they do that, when the best search algorithms we have considered so far
are about a million times slower?
There are many tricks, but the most important one is precomputation of some
optimal path costs.
Although the precomputation can be time-consuming, it need only be done once,
and then can be amortized over billions of user search requests.
1. 56
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 2.9
If the optimal path happens to go through a landmark, this heuristic will be exact; if
not it is inadmissible—it overestimates the cost to the goal.
In an A* search, if you have exact heuristics, then once you reach a node that is on an
optimal path, every node you expand from then on will be on an optimal path.
Some route-finding algorithms save even more time by adding shortcuts—artificial
edges in the graph that define an optimal multi-action path.
1. 57
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Could an agent learn how to search better? The answer is yes, and the method rests on
an important concept called the metalevel state space.
Each state in a metalevel state space captures the internal (computational) state of a
program that is searching in an ordinary state space such as the map of Romania.
(To keep the two concepts separate, we call the map of Romania a k object-level state
space.)
Each action in the metalevel state space is a computation step that alters the internal
state; for example, each computation step in A* expands a leaf node and adds its
successors to the tree.
For harder problems, there will be many such missteps, and a metalevel learning
algorithm can learn from these experiences to avoid exploring unpromising subtrees.
The goal of learning is to minimize the total cost of problem solving, trading off
computational expense and path cost.
Learning heuristics from experience
one way to invent a heuristic is to devise a relaxed problem for which an optimal
solution can be found easily.
An alternative is to learn from experience. “Experience” here means solving lots
of 8-puzzles, for instance.
Each optimal solution to an 8-puzzle problem provides an example (goal, path)
pair. From these examples, a learning algorithm can be used to construct a
function that can approximate the true path cost for other states that arise during
search.
1. 58
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 59
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig: 2.10
Minimum in the 8-queeens state space; the state has h=1 but every successor
has a higher cost
h= number of pairs of queens that are attacking each other, either
directly or indirectly
h = 17 for the above state
A local minimum with h=1
Limitations:
Hill climbing cannot reach the optimum/best state(global maximum) if it enters any of
the following regions:
1. 60
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Local Maxima
A local maximum is a peak that is higher than each of its neighbouring states but
lower than the global maximum.
Plateaus
A plateau is a flat area of the state-space landscape.
It can be a flat local maximum, from which no uphill exit exits, or a shoulder, from
which progress is possible.
Ridges
A Ridge is an area which is higher than surrounding states, but it cannot be
reached in a single move.
Fig: 2.12
A Ridge is shown in figure 2.12 result in a sequence of local maxima that is very
difficult for greedy algorithm to navigate.
Variations of Hill Climbing
In steepest Ascent hill climbing all successors are compared and the closest to the
solution is chosen.
Steepest ascent hill climbing is like best-first search, which tries all possible
extensions of the current path instead of only one.
It gives optimal solution but time consuming.
1.7.1.2 Simulated Annealing:
Annealing is the process used to temper or harden metals and glass by heating them to
a high temperature and then gradually cooling them, thus allowing the material to
reach a low-energy crystalline state.
The simulated annealing algorithm is quite similar to hill climbing.
Instead of picking the best move, however, it picks a random move.
If the move improve the situation, it is always accepted.
Otherwise the algorithm accepts the move with some probability less than 1.
Checks all the neighbours.
1. 61
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 62
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig:2.13
Fig 2.13 c)
Fig 2.14
Fig 2.13 d)
Fig 2.13
1. 63
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
form two children, one with the first part of parent 1 and the second part of
parent 2; the other with the second part of parent 1 and the first part of parent
2.
The mutation rate, which determines how often offspring have random
mutations to their representation. Once an offspring has been generated,
every bit in its composition is flipped with probability equal to the
mutation rate.
1. 64
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
On each iteration, we update the parameters in the opposite direction of the gradient of
the objective function with respect to the parameters where the gradient gives the
direction of the steepest ascent.
The size of the step we take on each iteration to reach the local minimum is determined
by the learning rate 𝛼. Therefore we follow the direction of the slope downhill until we
reach a local minimum.
𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 = 𝑃𝑎𝑟𝑡𝑖𝑎𝑙 𝑑𝑒𝑟𝑖𝑣𝑎𝑡𝑖𝑣𝑒 𝑜𝑓 𝑐𝑜𝑠𝑡/𝑙𝑜𝑠𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑤. 𝑟. 𝑡. 𝑤𝑒𝑖𝑔ℎ𝑡𝑠/𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠
𝜕𝐶𝑜𝑠𝑡
𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 = 𝜕𝑤
𝜕𝐶𝑜𝑠𝑡
𝑁𝑒𝑤 𝑤𝑒𝑖𝑔ℎ𝑡 = 𝑜𝑙𝑑 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 * 𝜕𝑤
1. 65
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
if we stray too far from (by altering the location of one or more of the airports by a large
amount) then the set of closest cities for that airport changes, and we need to recompute
𝐶𝑖 .
One way to deal with a continuous state space is to discretize it.
For example, instead of allowing the locations to be any point in continuous two-
dimensional space, we could limit them to fixed points on a rectangular grid with
spacing of size 𝛿 (delta).
Methods that measure progress by the change in the value of the objective function
between two nearby points are called empirical gradient methods
Reducing the value of 𝛿 over time can give a more accurate solution, but does not
necessarily converge to a global optimum in the limit.
Many methods attempt to use the gradient of the landscape to find a maximum.
The gradient of the objective function is a vector ∇𝑓 that gives the magnitude and
direction of the steepest slope. For our problem, we have
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 = ( , , , , , )
𝜕𝑥1 𝜕𝑦1 𝜕𝑥2 𝜕𝑦2 𝜕𝑥3 𝜕𝑦3
In some cases, we can find a maximum by solving the equation ∇𝑓 = 0 , if we were
placing just one airport
This equation cannot be solved in closed form. For example, with three airports, the
expression for the gradient depends on what cities are closest to each airport in the
current state.
we can compute the gradient locally
𝜕𝑓
= 2 ∑ (𝑥1 − 𝑥𝑐 )
𝜕𝑥1
𝑐∈𝐶1
we can perform steepest-ascent hill climbing by updating the current state according to the
formula
𝑥 ← 𝑥 + 𝛼∆𝑓(𝑥)
where 𝛼(alpha) is a small constant often called the step size.
1. 66
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The problem occurs when 𝛼 becomes too small(too many steps needed) or too
large(search exceed the maximum). The technique of line search tries to overcome this
problem by extending the current gradient direction—usually by repeatedly doubling
𝛼—until 𝑓 starts to decrease again. The point at which this occurs becomes the new
current state.
For many problems, the most effective algorithm is the venerable Newton–Raphson
method. This is a general technique for finding roots of functions—that is, solving
equations of the form 𝑔(𝑥) = 0.
a new estimate for the root according to Newton’s formula is given by
𝑥 ← 𝑥 − 𝑔(𝑥)/𝑔′(𝑥)
An optimization problem is constrained if solutions must satisfy some hard
constraints on the values of the variables. For example, in our airport-siting problem,
we might constrain sites to be inside Romania and on dry land (rather than in the middle
of lakes).
The difficulty of constrained optimization problems depends on the nature of the
constraints and the objective function. The best-known category is that of linear
programming problems, in which constraints must be linear inequalities forming
a convex set and the objective function is also linear.
1.7.3 Searching with Non-deterministic actions
Belief state
When the environment is partially observable, and the agent doesn’t know for sure
what state it is in; and
When the environment is nondeterministic, the agent doesn’t know what state it
transitions to after taking an action.
A set of physical states that the agent believes are in belief states.
Conditional Plan
In partially observable and nondeterministic environments,
The solution to a problem is no longer a sequence,
But rather a conditional plan (sometimes called a contingency plan or a strategy).
That specifies what to do, depending on what percepts agent receives, while executing
the plan.
1.7.3.1The Erratic (Unpredictable) Vacuum World
The vacuum world has eight states, and three actions- Right, Left, and Suck.
The goal is to clean up all the dirt (state 7 or 8)
If the environment is fully observable, deterministic, and completely known, then the
problem is easy to solve and the solution is an action sequence.
For example, if the initial state is 1, then the action sequence [𝑠𝑢𝑐𝑘, 𝑅𝑖𝑔ℎ𝑡, 𝑆𝑢𝑐𝑘] will
reach a goal state, 8
In the erratic vacuum world, the environment is nondeterministic,
Then the 𝑠𝑢𝑐𝑘 action works as follows:
1. 67
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. When applied to a dirty square the action cleans the square and sometimes, cleans
up dirt in an adjacent square, too.
2. When applied to a clean square, the action sometimes deposits dirt on the carpet.
Conditional Plan
The Erratic Vacuum world, to get solution, the action sequence produces a conditional
plan (not sequence plan)
A conditional plan can contain if-then-else steps.
1. 68
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 2.15
1. 69
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
A solution for an AND-OR search problem is a subtree of the complete search tree
that
1. Has a goal node at every leaf,
2. Specifies one action at each of its OR nodes, and
3. Includes every outcomes branch at each of its AND nodes.
4. The solution shown in bold lines in the figure 2.15.
Or by adding a label to denote some portion of the plan and referring to that label
later:
1. 70
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 2.16
1. 71
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Where b is the current belief state, s is a state in action , b’ is a possible updated belief state,
s’ is a possible state.
Lots of spooky math, but the first definition just means it’s a one-to-one transition and
the second means it’s a one-to-many possible state transition. Here is a sample of the
predictions.
Goal test: a belief state satisfies the goal ONLY if all physical states in that
belief state solve the GOAL-TESTp function. (might skip an earlier goal if the
earlier belief state contains one or more physical states solving the GOAL-
TESTp function, but not all).
1. 72
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Path cost: depends on what all the possible costs are for each potential state.
Assume they are the same, to make it easier for now.
Fig 2.17
The figure 2.18 shows the reachable belief-state space for the deterministic, sensorless
vacuum world. There are only 12 reachable belief states out of possible belief states.
Creating the belief-state problem formulation now easily possible using the
underlying physical problem definition.
Then, use any search algorithms.
Now, sensorless problem-solving is possible, but rarely practically feasible.
Why? The size of each belief state can be ginormous.
Example: A vaccum world with an initial physical state of 10 × 10 would
mean a belief state of 100 × 2100 physical states.
1. 73
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 2.18
1. 74
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The prediction stage computes the belief state resulting from the action, ,
exactly as we did with sensorless problems.
The possible percepts stage computes the set of percepts that could be
observed in the predicted belief state (using the letter for observation):
The update stage computes, for each possible percept, the belief state that
would result from the percept. The updated belief state 𝑏0 is the set of states in
b that could have produced the percept:
The agent needs to deal with possible percepts at planning time, because it
won’t know the actual percepts until it executes the plan.
each updated belief state 𝑏𝑜 can be no larger than the predicted belief state b;
Putting these three stages together, we obtain the possible belief states
resulting from a given action and the subsequent possible percepts:
1. 75
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 2.19
1. 76
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 2.20
Fig 2.21
What it looks like when we change the problem so any square can become dirty at
any time, unless agent is actively cleaning is at the time.
Real world is almost always a partially observable environment.
The intelligent agent must keep its belief state updated
1. 77
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The function for this goes by many names and is sometimes called:
monitoring, filtering, or state estimation.
Previous equation is example of a recursive version:
New state is a function of the old one, not the entire percept sequence.
Needs to be fast. Can’t fall behind. If things change too quickly, then have to
start making guesses on the belief state.
Here is an example in a discrete environment with deterministic sensors and
nondeterministic actions.
The example concerns a robot with a particular state estimation task called
localization: working out where it is, given a map of the world and a sequence of
percepts and actions.
The task? Localization
The agent? A broken robot
The environment? A maze
The problem? The robot can only move in a random direction must determine where
it is at.
The rub? The robot can sense where I’s at thanks to four sonar sensors, but it’s
actuators are messed up. It moves randomly.
Fig 2.22
The long and the short of it? The robot can tell where it’s at by comparing sensor
observations, even though it didn’t choose the direction it moved.
1. 78
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The new location in the above fig 2.22 b) is the only possible place it could have
moved to according to:
𝑈𝑃𝐷𝐴𝑇𝐸(𝑃𝑅𝐸𝐷𝐼𝐶𝑇(𝑈𝑃𝐷𝐴𝑇𝐸(𝐵, 𝑁𝑆𝑊), 𝑀𝑜𝑣𝑒), 𝑁𝑆)
We assume that the sensors give perfectly correct data, and that the robot has a correct
map of the environment. But unfortunately, the robot’s navigational system is broken,
so when it executes a Right action, it moves randomly to one of the adjacent squares.
The robot’s task is to determine its current location. Suppose the robot has just been
switched on, and it does not know where it is—its initial belief state consists of the set
of all locations.
The robot then receives the percept 1011 and does an update using the equation
𝑏0 = 𝑈𝑃𝐷𝐴𝑇𝐸(1011), yielding the 4 locations shown in Fig 2.22 a).
Next the robot executes a Right action, but the result is nondeterministic.
The new belief state,𝑏𝑎 = 𝑃𝑅𝐸𝐷𝐼𝐶𝑇(𝑏0 , 𝑅𝑖𝑔ℎ𝑡), contains all the locations that are
one step away from the locations in . When the second percept, 1010, arrives, the
robot does 𝑈𝑃𝐷𝐴𝑇𝐸(𝑏𝑎 , 1010)and finds that the belief state has collapsed down to
the single location shown in Figure 2.22 b). That’s the only location that could be the
result of
1. 79
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Consider:
Fig 2.23
Just means the agent can’t know going up from (1, 1) leads to (1, 2) until it does it.
Adding a heuristic function h(s) can estimate how far from current state to goal state.
So, reach that goal state using minimum cost.
Comparing the cost of agent getting there with the actual cost of getting there is the
shortest possible time is called competitive ratio.
Make it small
Can be infinite if some states irreversible, a dead-end state gets
reached.
Consider:
Fig 2.24
1. 80
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 2.25
1. 81
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 2.26
Eventually, in a finite space, you will find what you are looking for.
The process can be very slow. One step forward, two steps back.
Improvements are made by replacing randomness with memory.
Store a current best guess of cost to goal from each state visited, H(s).
Start with a heuristic guess (h(s)) and update as you go.
The agent moves according to the best guess involving it’s neighbors.
Fig 2.27
1. 82
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The agent should follow what seems to be the best path to the goal given the current
cost estimates for its neighbours. The estimated cost to reach the goal through a
neighbour s’ is the cost to get to s’ plus the estimated cost to get to a goal from
there. that is,
C(s, a, s’) + H(s)
In (a), two actions: 1 + 9 and 1 + 2. Go right, which is 1 + 2
1. 83
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
LRTA will always find a goal in any finite, safely explorable environment.
Not complete in infinite state spaces.
Explores environment of n states in at worst 𝑶(𝒏𝟐 ).
Agents learn through mapping the outcome of each state as they explore.
Needs accurate cost guesses of each state.
Once known, best choices are made by going to the lowest cost choice.
Pure, it is a form of optimal hill climbing that if you move uphill you might have to
backtrack a little bit.
We could keep the search tree in memory and reuse the parts of it that are unchanged
in the new problem. We could keep the heuristic h values and update them as we gain
new information, either because the world has changed or because we have computed
a better estimate. Or we could keep the best-path values, using them to piece together
a new solution, and updating them when the world changes.
1. 84
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.1
1. 85
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.2
1. 86
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 87
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.3
1. 88
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.4
1. 89
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.5
So, we can disregard x and y nodes because MIN will pick 2, no matter what.
Why? The 2 is smaller than any of the other nodes, in the other branches.
Another way of looking at it:
Fig 3.6
1. 90
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.7
Move Ordering
Order in which states are looked at can dramatically impact performance.
Depending on the values of each state, it can be determined to examine fewer nodes.
o Determine the smallest value sooner and you don’t need to look at the others.
If examinations begin with the likely best successors:
o Alpha-beta need only examine O(𝑏 𝑚/2 ) nodes.
o Minimax needs O(𝑏 𝑚 ).
o Branching factor essentially becomes √𝑏 instead of b.
Chess would go from 35 to something like 6.
Dynamic move-ordering schemes can improve it further.
Example: use moves that were best in the past.
These best moves are often called killer moves.
Trying them first is called the killer move heuristic.
1. 91
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
In certain games, transpositions can kill performance. (certain moves that are mirrors
of each other).
Example: chess pieces,
[a1, b1, a2, b2] mirrors [a2, b2, a1, b1]
Pieces ending up in the same position, just different
order of same moves to get there.
The redundant paths to repeated states can cause an exponential increase in search
cost, and that keeping a table of previously reached states can address this problem.
In game tree search, repeated states can occur because of transpositions—different
permutations of the move sequence that end up in the same position, and the problem
can be addressed with a transposition table that caches the heuristic value of states.
Keep a transposition table. Ignore the duplicates.
Similar to the explored list from GRAPH-SEARCH
1.8.3 Monte Carlo Tree Search
The game of Go illustrates two major weaknesses of heuristic alpha–beta tree search:
i. Go has a branching factor that starts at 361, which means alpha–beta search
would be limited to only 4 or 5 ply.
ii. It is difficult to define a good evaluation function for Go because material value
is not a strong indicator and most positions are in flux until the endgame. In
response to these two challenges, modern Go programs have abandoned alpha–
beta search and instead use a strategy called Monte Carlo tree search (MCTS)
The basic MCTS strategy does not use a heuristic evaluation function. Instead, the value
of a state is estimated as the average utility over a number of simulations of complete
games starting from the state.
A simulation (also called a playout or rollout) chooses moves first for one player, than
for the other, repeating until a terminal position is reached. At that point the rules of the
game determine who has won or lost, and by what score.
To get useful information from the playout we need a playout policy that biases the
moves towards good ones. For Go and other games, playout policies have been
successfully learned from self-play by using neural networks.
Given a playout policy, we next need to decide two things:
i. from what positions do we start the playouts, and
ii. how many playouts do we allocate to each position?
Monte Carlo search do simulations starting from the current state of the game, and track
which of the possible moves from the current position has the highest win percentage.
For some stochastic games this converges to optimal play as increases, but for most
1. 92
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
games it is not sufficient—we need a selection policy that selectively focuses the
computational resources on the important parts of the game tree.
It balances two factors:
i. exploration of states that have had few playouts, and
ii. exploitation of states that have done well in past playouts, to get a more
accurate estimate of their value.
Monte Carlo tree search does that by maintaining a search tree and growing it on each
iteration of the following four steps, as shown in Figure
SELECTION:
Starting at the root of the search tree, we choose a move leading to a successor
node, and repeat that process, moving down the tree to a leaf.
Figure 5.10(a) shows a search tree with the root representing a state where white
has just moved, and white has won 37 out of the 100 playouts done so far.
The thick arrow shows the selection of a move by black that leads to a node
where black has won 60/79 playouts. This is the best win percentage among the
three moves.
Selection continues on to the leaf node marked 27/35.
EXPANSION:
We grow the search tree by generating a new child of the selected node; Figure
5.10(b) shows the new node marked with 0/0.
SIMULATION:
We perform a playout from the newly generated child node, choosing moves
for both players according to the playout policy.
These moves are not recorded in the search tree. In the figure, the simulation
results in a win for black.
BACK-PROPAGATION:
We now use the result of the simulation to update all the search tree nodes going
up to the root.
Since black won the playout, black nodes are incremented in both the number
of wins and the number of playouts, so 27/35 becomes 28/26 and 60/79 becomes
61/80.
Since white lost, the white nodes are incremented in the number of playouts
1. 93
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
only, so 16/53 becomes 16/54 and the root 37/100 becomes 37/101.
We repeat these four steps either for a set number of iterations, or until the allotted time
has expired, and then return the move with the highest number of playouts.
One very effective selection policy is called “upper confidence bounds applied to trees”
or UCT. The policy ranks each possible move based on an upper confidence bound
formula called UCB1.
For a node , the formula is:
Fig 3.10
Where,
𝑈(𝑛) is the total utility of all playouts that went through node 𝑛,
𝑁(𝑛) is the number of playouts through node 𝑛 , and 𝑃𝑎𝑟𝑒𝑛𝑡(𝑛) is the parent node of 𝑛 in the
tree.
𝑈(𝑛)
is the exploitation term: the average utility of 𝑛.
𝑁(𝑛)
The term with the square root is the exploration term: it has the count 𝑁(𝑛) in the
denominator, which means the term will be high for nodes that have only been explored
a few times.
In the numerator it has the log of the number of times we have explored the parent of
n.
1. 94
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The pseudo code shows the complete UCT MCTS algorithm. When the iterations
terminate, the move with the highest number of playouts is returned.
The idea is that a node with wins is better than one with wins,
UCB1 formula ensures that the node with the most playouts is almost always the node
with the highest win percentage
Advantages of Monte Carlo Tree Search:
It does not necessarily require any tactical knowledge about the game
A general MCTS implementation can be reused for any number of games with little
modification
MCTS supports asymmetric expansion of the search tree based on the circumstances
in which it is operating.
As the tree growth becomes rapid after a few iterations, it requires a huge amount of
memory.
1. 95
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.11
Requires a tree containing chance nodes in addition to min and max nodes.
They consider the possible dice rolls.
1. 96
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.12
Each chance node gets dictated by probabilities of the die rolls 1/36, 1/18, etc.
Uncertainty. Only possible to calculate a position’s expected value: the average of all
possible outcomes of the chance nodes.
So, generalize the deterministic game’s minimax value to an expectiminimax value
for games with chance nodes.
Terminal, MIN, and MAX nodes stay the same.
For chance nodes, sum the value of all outcomes (weighted using probability):
Where r is the dice roll, and RESULT(s, r) is the same state s with roll r.
1.8.4.1 Evaluation functions for games of chance
Because of chance nodes, the meaning of evaluation values is a bit more dicey than in
deterministic games. Consider:
Assigning values to the leaves has different outcomes (who knew?) [1, 2, 3, 4] leads
to taking a1, but [1, 20, 30, 400] leads to taking a2.
1. 97
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.13
1. 98
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Kriegspiel rules:
Black and white see only their pieces, with a referee conducting the game.
Player tells ref about a move, ref resolves the move.
Humans pull it off, computers can leverage belief states.
Starting off, white’s belief state is a singleton, black hasn’t moved yet.
After black’s move, white belief state can have 20 positions because black can
respond in 20 ways.
Keeping track of the belief state is the problem of state estimation.
Can work with kriegspiel using partially observable nondeterministic section
RESULT uses white’s move plus the unpredictable black move.
Strategy changes in partially-observable games:
Moves are decided based on every possible percept sequence we could
get.
Not on each move the opponent might make.
With kriegspiel: guaranteed checkmate comes with each possible percept sequence
leads to a checkmate, no matter what the opponent does.
Opponent’s belief state doesn’t matter.
Simplifies things a ton. Here’s a part of a guaranteed checkmate for King and Rook vs
a King situation.
Fig 3.14
1. 99
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The general AND-OR search algorithm can be applied to the belief-state space to find
guaranteed checkmates.
It finds midgame checkmates up to depth 9, which most humans can’t
do.
In addition to guaranteed checkmates, we have probabilistic checkmate, which
makes no sense in fully observable games. Hence randomization happens.
By moving randomly, the white king eventually bumps into the black king.
Black can’t keep guessing escape moves forever.
In KBNK(King, Bishop, Knight vs King) endgame:
White gives black infinite choices.
Eventually black guesses wrong.
This reveals black’s position.
This ends in checkmate.
Hard to find probabilistic checkmate with a reasonable depth, except endgame.
Usually you get an accidental checkmate early on, where the random choices just
work out.
So, how likely will a strategy win? How likely is the belief state board state the actual
true board state?
Now, not all belief states are equally likely. Certain moves are more important than
others, skewing the probabilities.
But, a player may want to avoid being predictable, skewing the probabilities even
more.
So, to play optimally, some randomness has to be built into moves on the part of the
player.
Leads to the idea of an equilibrium solution.
1.8.5.2 Card games
Many examples of stochastic partial observability.
Example:
Randomly deal cards at game start.
Cards hidden from other players.
Bridge, poker, hearts, etc.
Not exactly like dice, but suggests an algorithm:
Solve all possible deals of the invisible cards as if fully observable.
Then, pick best move average over all the deals.
Then, for every deal s with probability P(s), we can say the desired move is:
𝑎𝑟𝑔𝑚𝑎𝑥𝛼 ∑ 𝑃(𝑠)𝑀𝐼𝑁𝐼𝑀𝐴𝑋(𝑅𝐸𝑆𝑈𝐿𝑇(𝑠, 𝑎))
𝑠
Number of deals can be huge, so solving all of them can be impossible.
Instead, use a Monte Carlo approximation
o i.e., don’t add up all deals, take a random sample of N deals.
o Consider the probability of s appearing in that sample is P(s), then:
𝑁
1
𝑎𝑟𝑔𝑚𝑎𝑥𝛼 ∑ 𝑀𝐼𝑁𝐼𝑀𝐴𝑋(𝑅𝐸𝑆𝑈𝐿𝑇(𝑆𝑖 , 𝑎))
𝑁
𝑖=1
o The bigger the N, the better the approximation.
1. 100
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 101
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.15
This is the solution, the solution is a specific assignment to each variable such that all
constraints are satisfied.
It can be helpful to visualize a CSP as a constraint graph, as shown in Fig (b).
In a constraint graph, each node is a variable and each edge determines whether
there is a constraint between those two variables or not.
This kind of a constraint graph is a binary constraint graph, where each constraint
relates at most two variables and such a CSP are called binary CSPs.
1. 102
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
A state has many variables and we call them as state variables where, each state
variable is a node.
1. 103
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Varieties of CSPs
1. 104
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.16
Each letter stands for a distinct digit; the aim is to find a substitution of digits for
letters such that the resulting sum is arithmetically correct, with the added restriction
that no leading zeros are allowed.
The constraint hypergraph for the cryptarithmetic problem, shown in the Alldiff
constraint as well as the column addition constraints.
Each constraint is a square box connected to the variables it contains.
1.9 Constraint Propogation
A number of inference techniques use the constraints to infer which variable/value pairs
are consistent and which are not. These include node, arc, path, and k-consistent.
constraint propagation: Using the constraints to reduce the number of legal values for
a variable, which in turn can reduce the legal values for another variable, and so on.
local consistency: If we treat each variable as a node in a graph and each binary
constraint as an arc, then the process of enforcing local consistency in each part of the
graph causes inconsistent values to be eliminated throughout the graph.
There are different types of local consistency:
1. 105
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Otherwise, keep checking, trying to remove values from the domains of variables
until no more arcs are in the queue.
The result is an arc-consistent CSP that have the same solutions as the original one
but have smaller domains.
1. 106
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The complexity of AC-3: Assume a CSP with n variables, each with domain size at
most d, and with c binary constraints (arcs). Checking consistency of an arc can be
done in O(d2) time, total worst-case time is O(cd3).
3.8.4 K- consistency:
K-consistency: A CSP is k-consistent if, for any set of k-1 variables and for any
consistent assignment to those variables, a consistent value can always be assigned to
any kth variable.
1-consistency = node consistency; 2-consisency = arc consistency; 3-consistensy =
path consistency.
A CSP is strongly k-consistent if it is k-consistent and is also (k - 1)-consistent,
(k – 2)-consistent, … all the way down to 1-consistent.
A CSP with n nodes and make it strongly n-consistent, we are guaranteed to find a
solution in time O(n2d). But algorithm for establishing n-consitentcy must take time
exponential in n in the worse case, also requires space that is exponential in n.
1. 107
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1.9.2.6 Sudoku
A Sudoku board consists of 81 squares, some of which are initially filled with digits
from 1 to 9. The puzzle is to fill in all the remaining squares such that no digit appears
twice in any row, column, or box. A row, column, or 3 × 3 box is called a unit.
Fig 3.17
A Sudoku puzzle can be considered a CSP with 81 variables, one for each square. We
use the variable names A1 through A9 for the top row (left to right), down to I1
through I9 for the bottom row. The empty squares have the domain {1, 2, 3, 4, 5, 6, 7,
8, 9} and the pre-filled squares have a domain consisting of a single value.
There are 27 different Alldiff constraints: one for each row, column, and box of 9
squares:
1. 108
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 109
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.18
1. 110
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
If some variable X has no legal values left, the MRV heuristic will select X and
failure will be detected immediately—avoiding pointless searches through other
variables.
E.g. After the assignment for WA=red and NT=green, there is only one possible
value for SA, so it makes sense to assign SA=blue next rather than assigning Q.
Degree heuristic:
The degree heuristic attempts to reduce the branching factor on future choices by
selecting the variable that is involved in the largest number of constraints on other
unassigned variables. [useful tie-breaker]
E.g. SA is the variable with highest degree 5; the other variables have degree 2 or 3; T
has degree 0.
ORDER-DOMAIN-VALUES
Value selection-fail -last
If we are trying to find all the solution to a problem (not just the first one), then the
ordering does not matter.
Least-constraining-value heuristic: prefers the value that rules out the fewest choice
for the neighboring variables in the constraint graph. (Try to leave the maximum
flexibility for subsequent variable assignments.)
e.g. We have generated the partial assignment with WA=red and NT=green and that
our next choice is for Q. Blue would be a bad choice because it eliminates the last
legal value left for Q’s neighbor, SA, therefore prefers red to blue.
The minimum-remaining-values and degree heuristic are domain-independent
methods for deciding which variable to choose next in a backtracking search.
The least-constraining-value heuristic helps in deciding which value to try first for a
given variable.
1.9.3.2 Interleaving search and inference
INFERENCE - every time we make a choice of a value for a variable.
One of the simplest forms of inference is called forward checking. Whenever a
variable X is assigned, the forward-checking process establishes arc consistency for it:
for each unassigned variable Y that is connected to X by a constraint, delete from Y’s
domain any value that is inconsistent with the value chosen for X.
There is no reason to do forward checking if we have already done arc consistency as
a preprocessing step.
Fig 3.19
Advantage: For many problems the search will be more effective if we combine the
MRV heuristic with forward checking.
1. 111
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Disadvantage: Forward checking only makes the current variable arc-consistent, but
doesn’t look ahead and make all the other variables arc-consistent.
Forward checking can supply the conflict set with no extra work.
Whenever forward checking based on an assignment X=x deletes a value from Y’s
domain, add X=x to Y’s conflict set;
If the last value is deleted from Y’s domain, the assignment in the conflict set of Y are
added to the conflict set of X.
In fact,every branch pruned by backjumping is also pruned by forward checking.
Hence simple backjumping is redundant in a forward-checking search or in a search
that uses stronger consistency checking (such as MAC).
Conflict-directed backjumping:
e.g.
consider the partial assignment which is proved to be inconsistent: {WA=red,
NSW=red}.
We try T=red next and then assign NT, Q, V, SA, no assignment can work for these
last 4 variables.
Eventually we run out of value to try at NT, but simple backjumping cannot work
because NT doesn’t have a complete conflict set of preceding variables that caused to
fail.
The set {WA, NSW} is a deeper notion of the conflict set for NT, caused NT together
with any subsequent variables to have no consistent solution. So the algorithm should
backtrack to NSW and skip over T.
1. 112
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
A backjumping algorithm that uses conflict sets defined in this way is called conflict-
direct backjumping.
How to Compute:
When a variable’s domain becomes empty, the “terminal” failure occurs, that variable
has a standard conflict set.
Let Xj be the current variable, let conf(Xj) be its conflict set. If every possible value
for Xj fails, backjump to the most recent variable Xi in conf(Xj), and set
conf(Xi) ← conf(Xi)∪conf(Xj) – {Xi}.
The conflict set for an variable means, there is no solution from that variable onward,
given the preceding assignment to the conflict set.
e.g.
assign WA, NSW, T, NT, Q, V, SA.
SA fails, and its conflict set is {WA, NT, Q}. (standard conflict set)
Backjump to Q, its conflict set is {NT, NSW}∪{WA,NT,Q}-{Q} = {WA, NT, NSW}.
Backtrack to NT, its conflict set is {WA}∪{WA,NT,NSW}-{NT} = {WA, NSW}.
Hence the algorithm backjump to NSW. (over T)
After backjumping from a contradiction, how to avoid running into the same problem
again:
The idea of finding a minimum set of variables from the conflict set that causes the problem.
This set of variables, along with their corresponding values, is called a no-good. We then
record the no-good, either by adding a new constraint to the CSP or by keeping a separate
cache of no-goods.
Backtracking occurs when no legal assignment can be found for a variable.
A backjumping algorithm that uses conflict sets defined in this way is called
conflict-directed backjumping.
Conflict-directed backjumping backtracks directly to the source of the problem.
Local search algorithms turn out to be very effective in solving many CSPs. They use
a complete-state formulation, where each state assigns a value to every variable, and
the search changes the value of one variable at a time.
As an example, we’ll use the 8-queens problem, as defined as a CSP. In Figure, we
start on the left with a complete assignment to the 8 variables; typically this will
violate several constraints.
We then randomly choose a conflicted variable, which turns out to be, the rightmost
column.
1. 113
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.20
The min-conflicts heuristic: In choosing a new value for a variable, select the value
that results in the minimum number of conflicts with other variables.
In the above figure we see there are two rows that only violate one constraint; we pick
𝑄8 = 3 (that is, we move the queen to the 8th column, 3rd row).
On the next iteration, in the middle board of the figure, we select 𝑄6 as the variable to
change, and note that moving the queen to the 8th row results in no conflicts.
At this point there are no more conflicted variables, so we have a solution. The
algorithm is shown in Figure 3.21.
The landscape of a CSP under the mini-conflicts heuristic usually has a series of
plateau. Simulated annealing and Plateau search (i.e. allowing sideways moves to
another state with the same score) can help local search find its way off the plateau.
This wandering on the plateau can be directed with tabu search: keeping a small list
of recently visited states and forbidding the algorithm to return to those states.
Constraint weighting: a technique that can help concentrate the search on the
important constraints.
Each constraint is given a numeric weight Wi, initially all 1.
At each step, the algorithm chooses a variable/value pair to change that will result in
the lowest total weight of all violated constraints.
1. 114
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.21
The weights are then adjusted by incrementing the weight of each constraint that is
violated by the current assignment.
Local search can be used in an online setting when the problem changes, this is
particularly important in scheduling problems.
Fig 3.22
1. 115
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig 3.23
1. 116
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1.9.5.2Tree Decomposition
A tree decomposition must satisfy the following three requirements:
i. Every variable in the original problem appears in atleast one of the
subproblems.
ii. If two variables are connected by a constraint in the original problem, they
must appear together (along with the constraint) in atleast one of the
subproblems.
iii. If a variable appears in two subproblems in a tree, it must appear in every
subproblem along the path connecting those subproblems.
Fig 3.24
1. 117
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2-Marks
1. Define A.I or what is A.I?
It is a branch of computer science by which we can create intelligent machines
which can behave like a human, think like humans, and able to make decisions.
2. What is meant by Turing test?
To conduct this test we need two people and one machine.
One person will be an interrogator (i.e.) questioner, will be asking questions to one
person and one machine.
Three of them will be in a separate room. Interrogator knows them just as A and B.
so it has to identify which is the person and machine.
The goal of the machine is to make Interrogator believe that it is the person’s answer.
If machine succeeds by fooling Interrogator, the machine acts like a human.
Programming a computer to pass Turing test is very difficult.
3. What is called materialism?
An alternative to dualism is materialism, which holds that the entire world operate
according to physical law.
Mental process and consciousness are therefore part of physical world, but inherently
unknowable they are beyond rational understanding.
4. What are the capabilities, computer should posses to pass Turing test?
Natural Language Processing
Knowledge representation
Automated reasoning
Machine Learning
5. Define Total Turing Test?
The test which includes a video signals so that the interrogator can test the perceptual
abilities of the machine.
6. What are the capabilities computers needs to pass total Turing test?
Computer Vision
Robotics
7. Define Rational Agent.
A rational agent is one that does the right thing. Here right thing is one that will cause
agent to be more successful. That leaves us with the problem of deciding how and when to
evaluate the agent’s success.
8. Define Agent.
An Agent is anything that can be viewed as perceiving (i.e.) understanding its
environment through sensors and acting upon that environment through actuators.
1. 118
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 119
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The major functionality of the agent function is to generate the possible action to each
and every percept. It helps the agent to get the list of possible actions the agent can
take.
Agent function can be represented in the tabular form.
20. Define basic agent program?
The basic agent program is a concrete implementation of the agent function which
runs on the agent architecture.
Agent program puts bound on the length of percent sequence and considers only
required percept sequences.
Agent program implements the functions of percept sequence and action which are
external characteristics of the agent.
Agent program takes input as the current percept from the sensor and return an action
to the effectors (Actuators)
21. What are the four components to define a problem? Define them.
initial state: state in which agent starts in.
A description of possible actions: description of possible actions which are available to the
agent.
The goal test: it is the test that determines whether a given state is goal state.
A path cost function: it is the function that assigns a numeric cost (value ) to each path. The
problem-solving agent is expected to choose a cost function that reflects its own performance
measure.
22. What is rational at any given time depends on four things:
The performance measure that defines the criterion of success.
1. 120
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Heuristic search is class of method which is used in order to search a solution space for
an optimal solution for a problem.
The heuristic here uses some method to search the solution space while assessing where
in the space the solution is most likely to be and focusing the search on that area.
28. What is heuristic function?
1. 121
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
A node with the lowest evaluation is selected for expansion, because evaluation
measures distance to the goal.
It is defined as the estimated cost of the cheapest path from node ‘n’ to a
goal node.
Greedy best-first search algorithm always selects the path which appears best at
that moment.
With the help of best-first search, at each step, we can choose the most
promising node. In the best first search algorithm, we expand the node which is
closest to the goal node and the closest cost is estimated by heuristic function,
i.e. f(n)= g(n).
31. Discover what is optimal solution?
Data abstraction is defined as the process of reducing the object to its essence
so that only the necessary characteristics are exposed to the users.
33. Differentiate uninformed and informed search strategies
1. 122
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Hill climbing algorithm is a local search algorithm which continuously moves in the
direction of increasing elevation/value to find the peak of the mountain or best solution
to the problem.
It terminates when it reaches a peak value where no neighbour has a higher value.
A ridge is a special form of the local maximum. It has an area which is higher
than its surrounding areas, but itself has a slope, and cannot be reached in a
single move.
Solution: With the use of bidirectional search, or by moving in different
directions, we can improve this problem.
1. 123
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
When the environment is nondeterministic, the agent doesn’t know what state
it transitions to after taking an action.
A set of physical states that the agent believes are in belief states.
40. What is a plateau in local search?
A plateau is the flat area of the search space in which all the neighbour states
of the current state contains the same value, because of this algorithm does
not find any best direction to move.
Plateau is the region where all the neighbouring nodes have the same value of
objective function so the algorithm finds it hard to select an appropriate
direction.
41. What is local maximum and global maximum in state space?
Local maximum: Local maximum is a state which is better than its neighbour
states, but there is also another state which is higher than it.
Global Maximum: Global Maximum is the best possible state of state space
landscape. It has the highest value of objective function.
42. Define Admissible heuristic h(n)
Admissible heuristics are used to estimate the cost of reaching the goal state in
a search algorithm.
Admissible heuristics never overestimate the cost of reaching the goal state.
The use of admissible heuristics also results in optimal solutions as they
always find the cheapest path solution.
For a heuristic to be admissible to a search problem, needs to be lower than or
equal to the actual cost of reaching the goal.
43. What is triangle inequality?
It states that each side of a triangle cannot be longer than the sum of the
other two slides of the triangle.
1. 124
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Each state in a metalevel state space captures the internal state of a program that
is searching in an object level state space.
45. When a heuristic function h is said to be admissible? Give an admissible heuristic function
for TSP?
1. 125
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 126
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
56. What are the four steps of Monte Carlo tree search?
The process of Monte Carlo Tree Search can be broken down into four distinct
steps,
Selection
Expansion
Simulation
Backpropagation
57. What are the advantages of Monte Carlo search?
MCTS is a simple algorithm to implement.
Monte Carlo Tree Search is a heuristic algorithm. MCTS can operate
effectively without any knowledge in the particular domain, apart from the
rules and conditions, and can find its own moves and learn from them by
playing random playouts.
58. What are stochastic games in detail?
A stochastic game was introduced by Lloyd Shapley in the early 1950s. It is a
dynamic game with probabilistic transitions played by one or more players.
The game is played in a sequence of stages. At the beginning of each stage,
the game is in a certain state.
Stochastic games (SG) also called Markov games-extend Markov decision
process (MDP) to the case where there are multiple players in a common
environment.
These agents perform a joint action that defines both the reward obtained by
the agents and the new state of the environment.
59. What is partial observability in AI?
Partial observability means that an agent does not know the state of the world
or that the agents act simultaneously.
In a partially observable system the observer may utilize a memory system in
order to add information to the observers understanding of the system.
An example of a partially observable system would be a card game in which
some of the cards are discarded into a pile face down.
60. What is CSP in AI?
A constraint satisfaction problem (CSP) consists of a set of variables, a
domain for each variable, and as set of constraints.
1. 127
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. 128
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Part-B
1. State and Explain the Types of Environments in AI.
2. Explain the Types of Intelligent Agents.
3. Explain Rationality with Example.
4. Explain all four characteristic of Intelligent Agent.
5. List the features of Intelligent Agents
6. Explain uninformed search strategies in detail
7. Discuss about
v. Greedy best-first search
vi. A* search
vii. Memory bounded heuristic search
8. Compose the algorithm for recursive best first search.
9. Explain the nature of heuristics with an example. What is the effect of heuristic
accuracy on performance?
10. Explain the following types of Hill Climbing search techniques
i) Simple Hill Climbing
ii) Steepest-Ascent Hill Climbing
iii) Simulated Annealing
11. Explain informed search strategies with an example
12. Explain in detail about Online Search Agent and Unknown environment
13. Explain briefly about Search with non-deterministic actions
14. Describe Alpha beta pruning with algorithm.
15. Explain stochastic games with examples.
16. Why is game theory important in AI?
17. Define minimax algorithm in Game theory.
18. How does Monte Carlo Tree search work?
19. How CSP is formulated as a search problem?
20. What is backtracking search in CSP?
1. 129
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
UNIT-II
PROBABILISTIC REASONING
Acting under uncertainty – Bayesian inference – Naïve Bayes models. Probabilistic reasoning –
Bayesian networks – Exact inference in BN – Approximate inference in BN – Causal networks.
∀𝒑 𝑺𝒚𝒎𝒑𝒕𝒐𝒎(𝒑, 𝑻𝒐𝒐𝒕𝒉𝒂𝒄𝒉𝒆)
⇒ 𝑫𝒊𝒔𝒆𝒂𝒆(𝒑, 𝑪𝒂𝒗𝒊𝒕𝒚)˅𝑫𝒊𝒔𝒆𝒂𝒔𝒆(𝒑, 𝑮𝒖𝒎𝑫𝒊𝒔𝒆𝒂𝒔𝒆)˅𝑫𝒊𝒔𝒆𝒂𝒔𝒆(𝒑, 𝑺𝒘𝒆𝒍𝒍𝒊𝒏𝒈) …
To make the rule true, we have to add almost unlimited list of possible causes.
We could try a casual rule:
But this rule is also not right either; not all cavities cause pain
Toothache and a Cavity are unconnected, so the judgement may go wrong.
This is a type of the medical domain, as well as most other judgmental domains: law,
business, design, automobile repair, gardening, dating, and so on.
Three main reasons of failures
2.1
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Decision theory combines the agent’s beliefs and desires, defining the best action as
the one that maximizes expected utility. (i.e, not all the time the utility values are
satisfied, but we have to maximize the expected utilities).
2.2
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
(In this example all events are equally likely, so there is no biasing for any instance,
So we can say that the probability of each event is 1/36. Assigning this probability to a
particular possible world is govern by this probability model)
W is representing one of the possible world, probability of any world will lie between
0 and 1. 0 represent impossible event, 1 represent certain event. So, it is true always for
every w, & the summation of probability of all the possible world will 1. (we have 36 events
each wit probability 1/36 , so summation is 1)
Unconditional Probability/ Prior Probability
For example, rolling the 2 dices and they add up to 11( which instance will add 115
& 6 or 6 & 5. And their probability is 1/36 +1/36 i.e 2/36 1/18)
2.3
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Product rule
Random variables – variables in probability theory are called random variables.(we don’t
know the exact occurance of those variables, tat’s why we call them as random variables.)
Domain- Each variable will have domain
(we have our syntax of logical statement inorder to represent our knowledge)
Example: “The probability that the patient has a cavity, given that she is a teenager with no
toothache, is 0.1” as follows:
Probability distribution
For example, Weather={ sunny, rain, cloudy, snow}
Sometimes we will want to talk about the probabilities of all the possible values of a random
variable. We could write:
2.4
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Inclusion-exclusion principle
Sum rule
P(A˅B)=P(A) + P(B)
Independent event
P(A|B)= P(A)
So, that we have product rule,
P(A∧B)= P(A|B). P(B)
=P(A). P(B)
So, this is true for independent events.
2.5
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Here, Weather & Cavity are the variables. Weather has 4 values & Cavity having 2 values.
What are the possible combination of these variables. Consider (W, C) = 8 combinations.
So, find the probability of distribution on these 2 variables.
One of the instance we have, sunny weather with cavity 7 sunny with no cavity
Similarly, rain with cavity, rain without cavity. So, 4*2=8 combinations possible
So, every possible world will have some probability. Some possible world will be more
probable i.e) probability will be more, called joint probability.
When you have multiple variables, probability distribution over multiple variables is known
as joint probability.
And if you enlist all the possible world then it becomes, full joint probability distribution.
Can be written as a single equation:
2.6
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2.7
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2.4 Independence
Independence between propositions a and b can be written as
It seems safe to say that the weather does not influence the dental variables.
Therefore, the following assertion seems reasonable
2.8
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Thus, the 32-element table for four variables can be constructed from one 8-element
table and one 4-element table.
2.5 Bayes’ Rule and its use
When you have 2 independent events, two forms of product rule
2.9
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2.10
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Equation 5.3
To categorize a new document, we check which key words appear in the document
and then apply Equation 5.3 to obtain the posterior probability distribution over
categories.
If we have to predict just one category, we take the one with the highest posterior
probability.
Notice that, for this task, every effect variable is observed, since we can always tell
whether a given word appears in the document.
The naive Bayes model assumes that words occur independently in documents, with
frequencies determined by the document category. This independence assumption is
clearly violated in practice.
2.11
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
For example, the phrase “first quarter” occurs more frequently in business (or sports)
articles than would be suggested by multiplying the probabilities of “first” and
“quarter.”
2.7 Probabilistic Reasoning
Bayesian Network
Bayesian Network is to represent the dependencies among variables and to give a
brief specification of any full joint probability distribution.
Bayesian network is a data structure also called as belief network, probabilistic
network, casual network, all knowledge map.
The extension of Bayesian network is called as a decision network or influence
diagram.
A Bayesian is a directed graph in which each node is annotated with quantitative
probability information.
The full specification is a s follows:
A set of random variables makes up the nodes of the network. Variables may
be discrete or continuous.
A set of directed links or arrows connects a pairs of nodes. If there is an
arrow from node X to node Y, X is said to be a parent of Y.
Each node X has a conditional probability distribution P(X,(Parents(X)) that
quantifies the effect of the parents on the node.(X is parent of Y)
The graph has no directed cycles (and hence is a directed, acyclic graph, or
DAG.)
Example: Simple Bayesian network
2.12
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
John always calls when he hears the alarm, but sometimes confuses the telephone
ringing with the alarm and calls then, too.
Mary, on the other hand, likes loud music and sometimes misses the alarm altogether.
Given the evidence of who has or has not called, we would like to estimate the
probability of a burglary.
2.13
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2.14
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
which is equal to
This identity is called the chain rule. The specification of the joint distribution is
equivalent to the general assertion that, for every variable 𝑋𝑖 in the network,
No need to consider all the other things, because the parent of MaryCalls is Alarm.
So, MaryCalls is the child node of Alarm.
Compactness and node ordering
The compactness of Bayesian network is an example of general property of locally
constructed systems. (also called as spare systems, inside some components there, and
those are communicated)
In a locally structured system, each subcomponent interacts directly with only a
bounded number of other components, regardless of the total number of components.
Therefor the correct in which to add node is to add the ‘root causes’ first, then the
variables they influenced and so on until we reach the leaves.
Suppose, we decide to add the nodes in the order MaryCalls, JohnCalls,Alarm,
Burglary, Earthquake.
Adding MaryCalls: No parents
2.15
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Adding JohnCalls: If Mary calls, the probably means the alarm has gone off, which of
course would make it more likely that John calls. Therfore, johnCalls needs
MaryCalls as a parent.
Adding Alarm: Clearly, if both call, it is more likely that the alarm has gone off than
if just one or neither call, so we need both MaryCalls and JohnCalls as parents.
Adding Burglary: If we know the alarm state, then the call from John or Mary might
give us information about our phone ringing or Mary’s music, but not about burglary:
2.16
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
MaryCalls
JohnCalls
Alarm
Burglary
Earthquake
ii. A node is conditionally independent of all other nodes in the network, given
its parents, children, and children’s parents- that is, given its Markov blanket.
For example, Burglary is independent of JohnCalls and MaryCalls, given
Alarm and Earthquake.
2.17
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Notations
X denotes the query variable
E set of evidence variables {𝐸1 , … 𝐸𝑚 }
E particular observed event
Y non-evidence, non-query variables, 𝑌1 , … 𝑌𝑛 . (called the hidden variables)
The complete set of variables – 𝑿 = {𝑿} 𝙐 𝑬 𝙐 𝒀
A typical query asks for the Posterior Probability distribution P( X | e)
In the burglary network, we might observe the event in which
𝐽𝑜ℎ𝑛𝐶𝑎𝑙𝑙𝑠 = 𝑡𝑟𝑢𝑒 𝑎𝑛𝑑 𝑀𝑎𝑟𝑦𝐶𝑎𝑙𝑙𝑠 = 𝑡𝑟𝑢𝑒.
We could then ask for, say, the probability that a burglary has occurred:
Types of Inferences
2.9.1 Inference by Enumeration (inference by listing or recording all variables)
Any conditional probability can be computed by summing terms from the full joint
distribution.
More specifically, a query P(X | e) can be answered using equation.
2.18
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
e event
y number of terms
The semantic of Bayesian network give us an expression, in terms of CPT entries, for
simplicity we do this just for Burglary = true
2.19
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Intermediate results are stored, and summations of each variable are done, for only
those portion of the expression, that depends on the variable.
Let us illustrate this process for the burglary network.
We evaluate the expression
2.20
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
We have annotated each part of the expression with the same name of the associated
variable, these parts are called factors
For example, the factors 𝑓4(𝐴) and 𝑓5(𝐴) corresponding to 𝑃( 𝑗 | 𝑎) and 𝑃( 𝑚 | 𝑎)
depending just on A because J and M are fixed by the query.
They are therefore two element vectors.
Given two factors 𝑓(𝑋, 𝑌) and 𝑔(𝑌, 𝑍) with probability distributions shown below,
the pointwise product 𝑓 × 𝑔 = ℎ(𝑋, 𝑌, 𝑍) has 21+1+1 = 8
Elimination
Summing out, or eliminating a variable from a factor is done by adding up the sub-
arrays formed by fixing the variable to each of its values in turn.
For example, to sum out a from ℎ(𝑋, 𝑌, 𝑍), we write:
2.21
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Relevance
𝑃(𝐽|𝑏)
2.22
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2.23
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Sampling Methods
Basic idea:
Draw N samples from a sampling distribution S.
Compute an approximate posterior probability P.
Show this approximate coverages to the true probability distribution P.
Why sampling
Generating samples is often much faster than computing the right answer (e.g., with
variable elimination)
Sampling
How to sample from the distribution of a discrete variable X?
Assume k discrete outcomes 𝑥1,… 𝑥𝑘 with probability P(𝑥𝑖 )
Assume sampling from the uniform 𝑈(0,1) is possible.
e.g) as enabled by a standard r and () function.
Divide the [0,1] interval into k regions, with region i having size P(𝑥𝑖 ).
Sample 𝑢~𝑈(0,1) and return the value associated to the region in which 𝑢 falls.
𝑥𝑖
0 P(𝑥𝑖 ) 1
2.24
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Prior Sampling
Sampling from a Bayesian network, without observed evidence.
Sample each variable in turn, in topological order.
The probability distribution from which the value is sampled is conditioned on
the values already assigned to the variable’s parents.
Analysis
The probability that prior sampling generates a particular event is
2.25
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Let 𝑁𝑃𝑆 (𝑥1,.. 𝑥𝑛 ) denote the number of samples of an event. We define the probability
estimator
Using prior sampling, an estimate 𝑃̂(𝑥|𝑒) can be formed from the proportion of
samples 𝑥 agreeing with the evidence 𝑒 among all samples agreeing with the
evidence.
Analysis
Let consider the posterior probability estimator 𝑃̂(𝑋|𝑒) formed by rejection sampling:
2.26
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Analysis
The sampling probability for an event with likelihood weighting is
𝑙
= 𝑃(𝑥, 𝑒)
The estimated posterior probability is computed as follows:
𝑃̂(𝑥, 𝑒) = 𝛼𝑁𝑊𝑆 (𝑥, 𝑒)𝜔(𝑥, 𝑒)
2.28
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2.29
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
prior 𝑃(𝐹𝑖𝑟𝑒) and the conditional probability 𝑃(𝑆𝑚𝑜𝑘𝑒|𝐹𝑖𝑟𝑒) in order to specify the
joint distribution.
However, this distribution can be represented equally well by the reverse arrow 𝐹𝑖𝑟𝑒 ←
𝑆𝑚𝑜𝑘𝑒, using the appropriate 𝑃(𝑆𝑚𝑜𝑘𝑒) and 𝑃(𝐹𝑖𝑟𝑒|𝑆𝑚𝑜𝑘𝑒) computed from Bayes’
rule. The idea that these two networks are equivalent, hence convey the same
information, evokes discomfort and even resistance in most people.
Causal Bayesian networks, sometimes called Causal Diagrams, were devised to permit
us to represent causal asymmetries and to leverage the asymmetries towards reasoning
with causal information.
If nature assigns a value to 𝑆𝑚𝑜𝑘𝑒 on the basis of what nature learns about 𝐹𝑖𝑟𝑒, we
draw an arrow from to.
More importantly, if we judge that nature assigns 𝐹𝑖𝑟𝑒 a truth value that depends on
other variables, not 𝑆𝑚𝑜𝑘𝑒, we refrain from drawing the arrow 𝐹𝑖𝑟𝑒 ← 𝑆𝑚𝑜𝑘𝑒.
In other words, the value 𝑥𝑖 of each variable𝑋𝑖 is determined by an equation 𝑥𝑖 =
𝑓𝑖 (𝑂𝑡ℎ𝑒𝑟𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠), and an arrow 𝑋𝑗 → 𝑋𝑖 is drawn if and only if 𝑋𝑗 is one of the
arguments of 𝑓𝑖 .
The equation 𝑥𝑖 = 𝑓𝑖 (. ) is called a structural equation.
We simply delete the arrow Rain→ 𝑊𝑒𝑡𝐺𝑟𝑎𝑠𝑠.
2.30
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
where we have abbreviated each variable name by its first letter. As a system of
structural equations, the model looks like this:
where, without loss of generality, can be the identity function. The -variables in these
equations represent unmodeled variables, also called error terms or disturbances,
that perturb the functional relationship between each variable and its parents
suppose we turn the sprinkler on—that is, if we (who are, by definition, not part of the
causal processes described by the model) intervene to impose the condition
Sprinkler=true. In the notation of the do-calculus, which is a key part of the theory of
causal networks, this is written as (𝑆𝑝𝑟𝑖𝑛𝑘𝑙𝑒𝑟 = 𝑡𝑟𝑢𝑒) . Once done, this means that the
sprinkler variable is no longer dependent on whether it’s a cloudy day. We therefore
delete the equation 𝑆 = 𝑓𝑠 (𝐶, 𝑈𝑠 ) from the system of structural equations and replace it
with 𝑆 = 𝑡𝑟𝑢𝑒, giving us
From these equations, we obtain the new joint distribution for the remaining variables
conditioned on 𝑑𝑜(𝑆𝑝𝑟𝑖𝑛𝑘𝑙𝑒𝑟 = 𝑡𝑟𝑢𝑒):
2.31
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The probability terms in the sum are obtained by computation on the original network,
by any of the standard inference algorithms. This equation is known as an adjustment
formula.
2.11.3 The back-door criterion
The ability to predict the effect of any intervention is a remarkable result, but it does
require accurate knowledge of the necessary conditional distributions in the model,
particularly 𝑃(𝑥𝑗 |𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑗 )).
For example, we know that “genetic factors” play a role in obesity, but we do not know
which genes play a role or the precise nature of their effects.
The specific reason this is problematic in this instance is that we would like to predict
the effect of turning on the sprinkler on a downstream variable such as 𝐺𝑟𝑒𝑒𝑛𝑒𝑟𝐺𝑟𝑎𝑠𝑠,
but the adjustment formula must take into account not only the direct route from
𝑆𝑝𝑟𝑖𝑛𝑘𝑙𝑒𝑟, but also the “back door” route via 𝐶𝑙𝑜𝑢𝑑𝑦 𝑎𝑛𝑑 𝑅𝑎𝑖𝑛.
2.32
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
If we knew the value of 𝑅𝑎𝑖𝑛, this back-door path would be blocked—which suggests
that there might be a way to write an adjustment formula that conditions on Rain instead
of 𝐶𝑙𝑜𝑢𝑑𝑦. And indeed this is possible:
In general, if we wish to find the effect of 𝑑𝑜(𝑋𝑗 = 𝑥𝑗𝑘 )on a variable 𝑋𝑖 , the back-door
criterion allows to write an adjustment formula that conditions on any set of variables
Z that closes the back door.
2.33
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2 marks
1. Why does uncertainty arise ?
Agents almost never have access to the whole truth about their environment.
Agents cannot find a caterorial answer.
Uncertainty can also arise because of incompleteness, incorrectness in agents
understanding of properties of environment
2. State the reason why first order, logic fails to cope with that the mind like medical
diagnosis.
Three reasons
a.laziness: it is hard to lift complete set of antecedents of consequence, needed to ensure and
exceptionless rule.
b.Theoritical Ignorance: medical science has no complete theory for the domain.
c. Practical ignorance: even if we know all the rules, we may be uncertain about a particular
item needed.
3.What is the need for probability theory in uncertainty ?
Probability provides the way of summarizing the uncertainty that comes from our
laziness and ignorance .
Probability statements do not have quite the same kind of semantics known as
evidences.
2.34
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
It is important to remember that p(a) can only be used when there is no other
information.
ML – it is reasonable approach when there is no reason to prefer one hypotheses over another
a prior.
13. What are the methods for maximum likelihood parameter learning?
i. Write down an expression for the likelihood of the data as a function of the parameter.
ii. Write down the derivative of the log likelihood with respect to each parameter.
iii. Find the parameter values such that the derivatives are zero.
In this model, the “class” variable C is the root and the “attribute” variable XI are the
leaves.
2.35
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
This model assumes that the attributes are conditionally independent of each other,
given the class.
2.36
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
PART-B
1. Explain Approximate inference in detail
2. Explain Naïve Baye’s model
3. What is exact inference, explain briefly
4. Write in detail about causal networks.
5. Harry installed a new burglary alarm at his home to detect burglary. Calculate the
probability that alarm has sounded, but there is neither a burglary, nor an earthquake
occurred, and John and Mary called the Harry. (John & Mary are neighbors)
2.37
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
UNIT III
3 SUPERVISED LEARNING
Syllabus
Introduction to machine learning-Linear Regression Models:Leastsquares,single&
multiples variables ,Bayesian linear regression ,gradient descent ,Linear Classification
Models :Discriminant function-Probablistic discriminative model -Logistic
regression,Probablistic generative model-Navive Bayes ,Maximum margin classifier-
Support Vector machine,Decision Tree,Random forests
• Machine Learning (ML) is a sub-field of Artificial Intelligence (Al) which concerns with
developing computational theories of learning and building learning machines.
• Learning is a phenomenon and process which has manifestations of various aspects. Learning
proCess includes gaining of new symbolic knowledge and development of cognitive skills
through instruction and practice. It is also discovery of new facts and theories through
observation and experiment.
• Machine Learning Definition : A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure p, if its performance at tasks in T,
as measured by P, improves with experience E.
• The goal of machine learning is to build computer systems that can adapt and learn from their
experience.
Examples:
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
3.1
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2.Validation: The rules are checked and, if necessary, additional training is given.
3.2
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
How machines learn
Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation. Figure 3.2 illustrates the various
components and the steps involved in the learning process.
3.3
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients and each patient is labeled as “healthy” or “sick”.
2. Unsupervised learning
❖ Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses.
❖ In unsupervised learning algorithms, a classification or categorization is not included
in the observations.
❖ The most common unsupervised learning method is cluster analysis, which is used for
exploratory data analysis to find hidden patterns or grouping in data.
Example
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.
3 .Semi-Supervised Learning
3.4
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
For example, documents can be crawled from the Web, images can be obtained surveillance cameras,
and speech can be collected from broadcast.
4 Reinforcement learning
❖ This is somewhere between supervised and unsupervised learning.
❖ User will get immediate feedback in supervised learning and no feedback from
unsupervised learning. But in the reinforced learning, you will get delayed scalar
feedback.
3.5
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
3.6
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Regression finds how the value of the dependent variable is changing according to the value
of the independent variable.
• The model provides a sloped straight line representing the relationship betweenthe variables.
• The mathematical equation of simple linear regression is given below:
Here,
o Yi is the predicted output for the instance i and is the dependent or explained variable.
o ẞ0 is the intercept of the line while ẞ1 is the slope or scaling factor for each input.
o Xi is the independent variable or explanatory variable or predictor or feature that
governs the entire learning process.
o i is the error component.
3.2.1 Least Squares Regression
o “A least-squares regression method is a form of statistical regression analysis that
establishes the relationship between the dependent (Y) and independent variable (X)
through linear line, referred as line of best fit”.
o The least squares method is a statistical procedure to find the best fit for a set of data
points by minimizing the sum of the offsets or residuals of points from the plotted
curve.
o Least squares regression is used to predict the behavior of dependent variables. This
method of regression analysis begins with a set of data points to be plotted on an x-
and y-axis graph.
o If the data shows a leaner relationship between two variables, the line that best fits
this linear relationship is known as a least-squares regression line, which minimizes
the vertical distance from the data points to the regression line.
o The term "least squares" indicates the smallest sum of squares of errors otherwise
known as variance.
o
The least-squares method is often applied in data fitting.
.
There are two basic categories of least-squares problems:
▪ Ordinary or linear least squares: used in statistical regression analysis
▪ Nonlinear least squares: iterative method to approximate the model to a linear model with
each iteration.
Advantages
o The least-squares method of regression analysis is best suited for prediction models
and trend analysis.
o It is best used in the fields of economics, finance, and stock markets wherein the
value of any future variable is predicted with the help of existing variables and the
relationship between the same.
o The least-squares method provides the closest relationship between the variables.
o The difference between the sums of squares of residuals to the line of best fit is
minimal under this method.
o The computation mechanism is simple and easy to apply.
3.7
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Disadvantages
o This method relies on establishing the closest relationship between a given s of
variables.
o The computation mechanism is sensitive to the data, and in case of any outliers, the
results may affect severally.
o More exhaustive computation mechanisms are applied for non linear problems.
Least Square Algorithm
❖ For each (x, y) point calculate x and xy
❖ Sum all x, y, x and xy, which gives us x, y, x2 and xy
❖ Calculate Slope b
❖ Calculate Intercept a:
Example 1: The below table give the statistics about the number of hours or rainfall in Chennai and
the number of French fries sold on a week from Monday to Friday in a canteen. Predict the number of
French fries to be prepared on Saturday, if a rainfall of 8 hours is expected.
Hours of rain No.of French fries sold
2 4
3 5
5 7
7 10
9 15
Solution:
1.For each (x,y) point calculate x2 and xy.
2.Find x, y, x2 and xy.
x y X2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
x=26 y=41 x =168
2
xy=263
3.Find the slope (b)
3.8
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
b=(263-(26*41))/5)/(168(26*26)/5)=1.5182
4.Calculate the Intercept a
a=(41-(1.5182*26))/5
a=0.3049
5.Form the equation:Y=1.5182x+0.3049
Compute the error
x y Y=1.5182x+0.3049 Error(Y-y)
2 4 3.3413 -0.6587
3 5 4.8595 -0.1405
5 7 7.8959 0.8959
7 10 10.9323 0.9323
9 15 13.9687 -1.0313
3.9
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Simple or single linear regression performs regression analysis of two variables The single
independent variable impacts the slope of the regression line.
• Multiple regression is a broader class of regressions that encompasses linear and nonlinear regressions
with multiple explanatory variables.
• Each independent variable in multiple regression has its own coefficient ensure each variable is
weighted appropriately to establish complex connections between variables.
• Two main operations are done in multiple variable regression:
i) Determine the dependent variable based on multiple independent variables
ii) Determine the strength of the relationship is between each variable.
❖ Multiple regression assumes there is not a strong relationship between each independent
variable.
❖ It also assumes there is a correlation between each independent variable and the single
dependent variable.
❖ Each of these relationships is weighted to ensure more impactful independent variables
drive the dependent value by adding a unique regression coefficient to each independent
variable.
❖ Using multiple variables for regression is more specific calculation than simple linear
regression. More complex relationships can be acquired through multiple linear regression.
❖ All the multiple variables use multiple slopes to predict the outcome of single target
variableY=a+b1x1+b2x2+...+bnxn
❖ In the above equation, b1, b2, ..., bn are the slopes for the individual variables x1 , x2…..xn
3.10
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
➢ Bayesian is just an approach to defining and estimating statistical models. Bayesian Regression
can be very useful when we have insufficient data in the dataset or the data is poorly distributed.
➢ The output of a Bayesian Regression model is obtained from a probability distribution, as
compared to regular regression techniques where the output is just obtained from a single value
of each attribute
➢ In order to explain Naive Bayes we need to first explain Bayes theorem. The foundation of Bayes
theorem is conditional probability (figure 1). In fact, Bayes theorem (figure 1) is just an alternate
or reverse way to calculate conditional probability. When the joint probability, P(A∩B), is hard
to calculate or if the inverse or Bayes probability, P(B|A), is easier to calculate then Bayes
theorem can be applied.
Here’s a simple example that will be relevant with all the New Years resolutions. Globo Gym
wants to predict if a member will attend the gym given the weather conditions
3.11
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
We have data where each row represents member attendance to Globo Gym given the weather.
So observation 3 is a member that attended the gym when it was cloudy outside.
weather attended
0 sunny yes.
1 rainy no
2 snowy no
3 cloudy yes
4 cloudy no
The below equation shows our question put into Bayes theorem notation.
Likelihood: P(sunny | yes) = 3/8 or 0.375 (Total sunny AND yes divided by total yes)
3.12
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
It shows that a random member is 75% likely to attend the gym given its sunny. Thats higher
than the overall average attendance of 53%! On the opposite spectrum, the probability of attending the
gym when its snowy out is only 25% (0.125 ⋅ 0.533 / 0.267).
Since this is a binary example (attend or not attend) and P(yes | sunny) = 0.75 or 75%, then the
inverse P(no | sunny) is 0.25 or 25% since probabilities have to sum to 1 or 100%.Thats how to use
Bayes theorem to find the posterior probability for classification.
3.2.4 Gradient Descent in Machine Learning
➢ Gradient Descent is known as one of the most commonly used optimization algorithms
to train machine learning models by means of minimizing errors between actual and
expected results. Further, gradient descent is also used to train Neural Networks.
➢ In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in
machine learning, optimization is the task of minimizing the cost function
parameterized by the model's parameters.
➢ The main objective of gradient descent is to minimize the convex function using
iteration of parameter updates. Once these machine learning models are optimized,
these models can be used as powerful tools for Artificial Intelligence and various
computer science applications.
➢ In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about
gradient descent, the role of cost functions specifically as a barometer within Machine
Learning, types of gradient descents, learning rates, etc.
The best way to define the local minimum or local maximum of a function using gradient descent is as
follows:
o If we move towards a negative gradient or away from the gradient of the function at the current
point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the function at the
current point, we will get the local maximum of that function.
3.13
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
0000000
This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The main
objective of using a gradient descent algorithm is to minimize the cost function using iteration. To
achieve this goal, it performs two steps iteratively:
o Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
o Move away from the direction of the gradient, which means slope increased from the current
point by alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the
optimization process which helps to decide the length of the steps.
Cost-function
The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single real
number. It helps to increase and improve machine learning efficiency by providing feedback to this
model so that it can minimize error and find the local or global minimum.
Before starting the working principle of gradient descent, we should know some basic concepts to find
out the slope of a line from linear regression. The equation for simple linear regression is given as:
1. Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
3.14
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The starting point(shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent
line to calculate the steepness of this slope. Further, this slope will inform the updates to the parameters
(weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are
generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest point,
which is called a point of convergence.
These two factors are used to determine the partial derivative calculation of future iteration and allow
it to the point of convergence or local minimum or global minimum.
It is defined as the step size taken to reach the minimum or lowest point. This is typically a
small value that is evaluated and updated based on the behavior of the cost function. If the learning rate
is high, it results in larger steps but also leads to risks of overshooting the minimum. At the same time,
a low learning rate shows the small step sizes, which compromises overall efficiency but gives the
advantage of more precision.
3.15
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Based on the error in various training models, the Gradient Descent learning algorithm can be divided
into Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. Let's
understand these different types of gradient descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and update the
model after evaluating all training examples. This procedure is known as the training epoch. In simple
words, it is a greedy approach where we have to sum over all examples for each update.
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example
per iteration. Or in other words, it processes a training epoch for each example within a dataset and
updates each training example's parameters one at a time.
As it requires only one training example at a time, hence it is easier to store in allocated memory.
However, it shows some computational efficiency losses in comparison to batch gradient systems as it
shows frequent updates that require more detail and speed. Further, due to frequent updates, it is also
treated as a noisy gradient. However, sometimes it can be helpful in finding the global minimum and
also escaping the local minimum.
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the updates on
those batches separately.
Splitting training datasets into smaller batches make a balance to maintain the computational
efficiency of batch gradient descent and speed of stochastic gradient descent. Hence, we can achieve a
special type of gradient descent with higher computational efficiency and less noisy gradient descent.
3.16
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Although we know Gradient Descent is one of the most popular methods for optimization
problems, it still also has some challenges. There are a few challenges as follows:
For convex problems, gradient descent can find the global minimum easily, while for non-
convex problems, it is sometimes difficult to find the global minimum, where the machine learning
models achieve the best results.
Whenever the slope of the cost function is at zero or just close to zero, this model stops learning
further. Apart from the global minimum, there occur some scenarios that can show this slop, which is
saddle point and local minimum. Local minima generate the shape similar to the global minimum, where
the slope of the cost function increases on both sides of the current points.
In contrast, with saddle points, the negative gradient only occurs on one side of the point, which
reaches a local maximum on one side and a local minimum on the other side. The name of a saddle
point is taken by that of a horse's saddle.
The name of local minima is because the value of the loss function is minimum at that point in
a local region. In contrast, the name of the global minima is given so because the value of the loss
function is minimum there, globally across the entire domain the loss function.
In a deep neural network, if the model is trained with gradient descent and backpropagation,
there can occur two more issues other than local minima and saddle point.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During
backpropagation, this gradient becomes smaller that causing the decrease in the learning rate of earlier
3.17
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
layers than the later layer of the network. Once this happens, the weight parameters update until they
become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is
too large and creates a stable model. Further, in this scenario, model weight increases, and they will be
represented as NaN. This problem can be solved using the dimensionality reduction technique, which
helps to minimize complexity within the model.
3.18
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Example:
➢ Suppose we have two sets of data points belonging to two different classes that we
want to classify. As shown in the given 2D graph, when the data points are plotted
on the 2D plane, there’s no straight line that can separate the two classes of the data
points completely. Hence, in this case, LDA (Linear Discriminant Analysis) is used
which reduces the 2D graph into a 1D graph in order to maximize the separability
between the two classes.
➢ Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new
axis and projects data onto a new axis in a way to maximize the separation of the
two categories and hence, reducing the 2D graph into a 1D graph.
➢ Two criteria are used by LDA to create a new axis:
• Maximize the distance between means of the two classes.
• Minimize the variation within each class.
➢ In the above graph, it can be seen that a new axis (in red) is generated and plotted
in the 2D graph such that it maximizes the distance between the means of the two
classes and minimizes the variation within each class.
3.19
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
➢ In simple terms, this newly generated axis increases the separation between the data
points of the two classes. After generating this new axis using the above-mentioned
criteria, all the data points of the classes are plotted on this new axis and are shown
in the figure given below.
➢ But Linear Discriminant Analysis fails when the mean of the distributions are
shared, as it becomes impossible for LDA to find a new axis that makes both the
classes linearly separable. In such cases, we use non-linear discriminant analysis.
➢ Probabilistic LDA or PLDA is a generative model which assumes that given data
samples are generated from a distribution. We need to find the parameters of model
which best describe the training data.
mainly used for supervised machine learning. These types of models are also known
as conditional models since they learn the boundaries between classes or labels in a
dataset.
➢ Discriminative models focus on modeling the decision boundary between classes in a
classification problem. The goal is to learn a function that maps inputs to binary outputs,
indicating the class label of the input. Maximum likelihood estimation is often used to
estimate the parameters of the discriminative model, such as the coefficients of a logistic
regression model or the weights of a neural network.
3.20
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Logistic regression
• Support vector machines(SVMs)
• Traditional neural networks
• Nearest neighbor
• Conditional Random Fields (CRFs)
• Decision Trees and Random Forest
➢ Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
➢ Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes
or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0 and 1.
➢ Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
➢ In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
➢ The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
➢ Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
➢ Logistic Regression can be used to classify the observations using different types
of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:
3.21
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
➢ The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
➢ It maps any real value into another value within a range of 0 and 1.
➢ The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
➢ In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
➢ The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:
• We know the equation of the straight line can be written as:
• In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
3.22
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
➢ On the basis of the categories, Logistic Regression can be classified into three types:
• Binomial: In binomial Logistic regression, there can be only two possible types of
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".
Generative models are considered a class of statistical models that can generate new data
instances. These models are used in unsupervised machine learning as a means to perform tasks
such as
Since these models often rely on the Bayes theorem to find the joint probability, generative
models can tackle a more complex task than analogous discriminative models.
These models use probability estimates and likelihood to model data points and differentiate
between different class labels present in a dataset. Unlike discriminative models, these models
can also generate new data points.
However, they also have a major drawback – If there is a presence of outliers in the dataset, then it
affects these types of models to a significant extent.
3.23
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Naïve Bayes
• Bayesian networks
• Markov random fields
• Hidden Markov Models (HMMs)
• Latent Dirichlet Allocation (LDA)
• Generative Adversarial Networks (GANs)
• Autoregressive Model
3.24
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
➢ Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
➢ The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the fruit is
identified on the bases of color, shape, and taste, then red, spherical, and sweet
fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
➢ Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
➢ The formula for Bayes' theorem is given as:
Where,
➢ Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
➢ Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on a
3.25
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
particular day according to the weather conditions. So to solve this problem, we need
to follow the below steps:
✓ Convert the given dataset into frequency tables.
✓ Generate Likelihood table by finding the probabilities of given features.
✓ Now, use Bayes theorem to calculate the posterior probability.
➢ Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
3.26
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
➢ Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
➢ It can be used for Binary as well as Multi-class Classifications.
➢ It performs well in Multi-class predictions as compared to the other Algorithms.
➢ It is the most popular choice for text classification problems.
➢ Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
3.27
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes
that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word
is present or not in a document. This model is also famous for document classification
tasks.
3.28
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Example:
SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
3.29
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
So, as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
3.30
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called a hyperplane. SVM algorithm finds the closest point of the lines
from both the classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
3.31
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
So, to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
3.32
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
3.33
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
3.34
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
3.35
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes
of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
3.36
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
(Or)
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
“Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree”.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
3.37
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
There are many algorithms there to build a decision tree. They are
1. CART (Classification and Regression Trees) — This makes use of Gini impurity as the
metric.
2. ID3 (Iterative Dichotomiser 3) — This uses entropy and information gain as metric.
Consider whether a dataset based on which we will determine whether to play football or not.
3.38
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Here There are for independent variables to determine the dependent variable. The independent
variables are Outlook, Temperature, Humidity, and Wind. The dependent variable is whether to play
football or not.
As the first step, we have to find the parent node for our decision tree. For that follow the steps:
note: Here typically we will take log to base 2.Here total there are 14 yes/no. Out of which 9 yes and 5
no.Based on it we calculated probability above.
From the above data for outlook we can arrive at the following table easily
Step 3:The next step is to find the information gain. It is the difference between parent entropy and
average weighted entropy we found above.
IG(S, outlook) = 0.94 - 0.693 = 0.247
Similarly find Information gain for Temperature, Humidity, and Windy.
IG(S, Temperature) = 0.940 - 0.911 = 0.029
IG(S, Humidity) = 0.940 - 0.788 = 0.152
IG(S, Windy) = 0.940 - 0.8932 = 0.048
Step 4:Now select the feature having the largest entropy gain. Here it is Outlook. So it forms the first
node(root node) of our decision tree.
3.39
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Since overcast contains only examples of class ‘Yes’ we can set it as yes. That means If outlook is
overcast football will be played. Now our decision tree looks as follows.
Step 5:The next step is to find the next node in our decision tree.
Now we will find one under sunny. We have to determine which of the following Temperature,
Humidity or Wind has higher information gain.
3.40
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
For humidity from the above table, we can say that play will occur if humidity is normal and will not
occur if it is high. Similarly, find the nodes under rainy.
Classification using CART is similar to it. But instead of entropy, we use Gini impurity.
So as the first step we will find the root node of our decision tree. For that Calculate the Gini
index of the class variable
As the next step, we will calculate the Gini gain. For that first, we will find the average weighted
Gini impurity of Outlook, Temperature, Humidity, and Windy.
3.41
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Choose one that has a higher Gini gain. Gini gain is higher for outlook. So we can choose it as our root
node.
o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:
3.42
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
3.43
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points
to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and when a new data point occurs, then
based on the majority of results, the Random Forest classifier predicts the final decision. Consider the
below image:
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.
3.44
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Two marks
1) What is Machine Learning?
Definition : A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure p, if its performance at tasks in T, as measured by P, improves with
experience E.Machine learning is programming computers to optimize a performance criterion using
example data or past experience. Application of machine learning methods to large databases is called
data mining.
2)What are the phases of machine learning?
Phases of machine learning:
1.Training: A training set of examples of correct behavior is analyzed and of the
newly learnt knowledge is stored. is form of rules.
2.Validation: The rules are checked and, if necessary, additional training is given.
3.45
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
❖ User will get immediate feedback in supervised learning and no feedback from
unsupervised learning. But in the reinforced learning, you will get delayed scalar
feedback.
Reinforcement learning
3.46
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
➢ Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
➢ Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes
or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0 and 1.
3.47
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
3.48
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
UNIT IV
Ensemble Techniques
4 And Unsupervised Learning
Syllabus
Combining multiple learners: Model combination schemes, Voting, Ensemble Learning -
bagging, boosting, stacking, Unsupervised learning: K. means, Instance Based Learning:
KNN, Gaussian mixture models and Expectation maximization.
• When designing a learning machine, we generally make some choices like parameters
of machine, training data, representation, etc. This implies some sort of variance in
performance. For example, in a classification setting, we can use a parametric
classifier or in a multilayer perceptron, we should also decide on the number of
hidden units.
• Each learning algorithm dictates a certain model that comes with a set of assumptions.
This inductive bias leads to error if the assumptions do not hold for the data.
• Different learning algorithms have different accuracies. The no free lunch theorem
asserts that no single learning algorithm always achieves the best performance in any
domain. They can be combined to attain higher accuracy.
• Data fusion is the process of fusing multiple records representing the same real-world
object into a single, consistent, and clean representation. Fusion of data for improving
prediction accuracy and reliability is an important problem in machine learning.
• Combining different models is done to improve the performance of deep learning
models. Building a new model by combination requires less time, data, and
computational resources. The most common method to combine models is by
averaging multiple models, where taking a weighted average improves the accuracy.
4.1
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Different Algorithms: We can use different learning algorithms to train different base-
learners. Different algorithms make different assumptions about the data and lead to
different classifiers.
• Different Hyper-parameters: We can use the same learning algorithm but use it with
different hyper – parameters.
• Different Input Representations: Different representations make different
characteristics explicit allowing better identification.
• Different training sets: Another possibility is to train different base – learners by
different subsets of the training set.
• Different methods are used for generating final output for multiple base learners are
Multiexpert and multistage combination.
1. Multiexpert combination.
• Let’s assume that we want to construct a function that maps inputs to outputs from a
set of known Ntrain input -output pairs.
• Let’s assume that we want to construct a function that maps inputs to outputs from a
set of known N train input -output pairs
D train = [(xi,yi )] N train
where𝑥𝑖 ∈×is a D dimensional feature input vector, 𝑦𝑖 ∈ 𝑌is the out put.
• Classification : When the output takes values in a discrete set of class labels
4.2
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
4.1.2 Voting
• In this methods, the first step is to create multiple classification/ regression models
using some training dataset. Each base model can be created using different splits of
the same training dataset and same algorithm, or using the same dataset with different
algorithms, or any other methods.
• Fig. 4.1.2 shows general idea of Base – learners with model combiner.
4.3
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• When combining multiple independent and diverse decisions each of which is at least
more accurate than random guessing, random errors cancel each other out, and correct
decisions are reinforced. Human ensembles are demonstrably better.
• Use a single, arbitrary learning algorithm but manipulate training data to make it learn
multiple models.
• The problem here is that if there is an error with one of the base-learners, there may
be a misclassification because the class code words are so similar. So the approach in
4.4
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
error-correcting codes is to have L>K and increase the Hamming distance between the
code words.
• One possibilityis pairwise separation of classes where there is a separate base –
learner to separate 𝐶𝑖 from 𝐶𝑗 for i<j.
• Pairwise L= K (K-1) /2
+1 +1 +1 0 0 0
−1 0 0 +1 +1 0
𝑊= [ ]
−0 −1 0 −1 0 +1
0 0 −1 0 −1 −1
• With reasonable L, find W such that the Hamming distance between rows and
between columns are maximized.
• Voting scheme are
𝑙
𝑛
𝑦𝑖 = (𝑥 + 𝑎) = ∑ 𝑊𝑖𝑗 𝐷𝑗
𝑗=1
4.5
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Ensemble modeling is the process of running two or more related but different
analytical models and then synthesizing he results into a single score or spread in
order to improve the accuracy of predictive analytics and data mining applications.
• Ensembles of classifiers is a set of classifiers whose individual decisions combined in
some way to classify new examples.
• Ensemble methods combine several decision trees classifiers to produce better
predictive performance than a single decision tree classifier. The main principle
behind the ensemble model is that a group of weak learners come together to form a
strong learner, thus increasing the accuracy of the model.
• Why do ensemble methods work?
• Based on one of two basic observations:
1. Variance reduction : If the training sets are completely independent, it will
always Helps to average an ensemble because this will reduce variance
without affecting bias (e.g. bagging) and reduce sensitivity to individual and
points.
2. Bias reduction: for simple models, average of models has much greater
capacity than single model Averaging models can reduce bias substantially by
increasing capacity and control variance by Citting one component at a time
4.2.1 Bagging
• Bagging is also called Bootstrap aggregating. Bagging and boosting are meta-
algorithms that pool decisions from multiple classifiers. It creates ensembles by
repeatedly randomly resampling he training data.
• Bagging was the first effective method of ensemble learning and is one of the simplest
methods of arching. The meta – algorithm, which is a special case of the model
averaging, was originally designed for classification and is usually applied to decision
tree models, but it can be used with any type of model for classification or regression
4.6
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Ensemble classifiers such as bagging, boosting and model averaging are known to
have improved accuracy and robustness over a single model. Although unsupervised
models, such as clustering, do not directly generate label prediction for each
individual, they provide useful constraints for the joint prediction of a set of related
objects.
• For given a training set of size n, create m samples of size n by drawing n examples
from the original data, with replacement. Each bootstrap sample will on average
contain 63.2% of the unique training examples, the rest are replicates. It combines the
m resulting models using simple majority vote.
• In particular, on each round, the base learner is trained on what is often called a
“bootstrap replicate” of the original training set. Suppose the training set consists of
n examples. Then a bootstrap replicate is a new training set that also consists of n
examples, and which is formed by repeatedly selecting uniformly at random and with
replacement n examples from the original training set. This means that the same
example may appear multiple times in the bootstrap replicate, or it may appear not at
all.
• It also decreases error by decreasing the variance in the results due to unstable
learners, algorithms (like decision trees) whose out put can change dramatically when
the training date is slightly changed.
• Pseudocode:
1. Given Training data (x1,y1) …….. (xm,ym)
2. For t = 1,… T :
a. Form bootstrap replicate dataset St by selecting m random examples from the
training set with replacement.
b. Let ht be the result of training base learn4ng algorithm on st
4.7
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Bagging Steps :
1. Suppose there are N observations and M features in training data set. A sample
from training data set is taken randomly with replacement.
2. A subset of M features is selected randomly and whichever feature gives the best
split is used to split the node iteratively.
3. The tree is grown to the largest.
4. Above steps are repeated n times and predictions is given based on the aggregation
of predictions from n number of trees.
Advantages of Bagging :
Disadvantages of Bagging:
1. Since final prediction is based on the mean predictions from subset trees, it won’t
give precise values for the classification and regression model.
4.2.2 Boosting
• Boosting is an ensemble learning method that combines a set of weak learners into a
strong learner to minimize training errors. In boosting, a random sample of data is selected,
fitted with a model and then trained sequentially—that is, each model tries to compensate for
the weaknesses of its predecessor..
4.8
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• A learner is weak if it produces a classifier that is only slightly better than random
guessing, while a learner is said to be strong if it produces a classifier that achieves a
low error with high confidence for a given concept.
• Revised to be a practical algorithm, AdaBoost, for building ensembles that
empirically improves generalization performance. Examples are given weights. At
each iteration, a new hypothesis is learned and the examples are reweighted to focus
the system on examples that the most recently learned classifier got wrong.
• Boosting is a bias reduction technique. It typically improves the performance of a
single tee model. A reason for this is that we often cannot construct trees which are
sufficiently large due to thinning out of observations in the terminal nodes.
• Boosting is then a device to come up with a more complex solution by taking linear
combination of trees. In presence of high – dimensional predictors, boosting is also
very useful as a regularization technique for additive or interaction modeling.
• To begin, we define an algorithm for finding the rules of thumb, which we call a weak
learner. The boosting algorithm repeatedly calls this weak learner, each time feeding
it a different distribution over the training data. Each call generates a weak classifier
and we must combine all of these into a single classifier that, hopefully, is much more
accurate than any one of the rules.
• Train a set of weak hypotheses ; h1 ….,hT.The combined hypothesis H is a weighted
majority vote of the T weak hypotheses. During the training, focus on the examples
that are misclassified.
4.9
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
AdaBoost:
Advantages of AdaBoost:
1. Very simple to implement
2. fairly good generalization
3. The prior error need not be known ahead of time.
Disadvantages of AdaBoost:
1. Suboptimal solution
2. Can over fit in presence of noise.
Boosting Steps:
4.10
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. Draw a random subset of training samples d1 without replacement from the training
set D to train a weak learner C1
2. Draw second random training subset d2 without replacement from the training set add
add 50 percent of the samples that were previously falsely classified / misclassified to
train a weak learner C2
3. Find the training samples d3 in the training set D on which C1 and C2 disagree to
train a third weak learner C3
4. Combine all the weak learners via majority voting.
Advantages of Boosting:
1. Supports different loos function.
2. Works well with interactions.
Disadvantages of Boosting:
1. Prone to over – fitting.
2. Requires careful tuning of different hyper – parameters.
4.2.3Stacking
• Stacking, sometimes called stacked generalization, is an ensemble machine learning
method that combines multiple heterogeneous base or component models via a meta -
model.
• The base model is trained on the complete training data, and then he meta – model is
trained on the predictions of the base models. The advantage of stacking is the ability
to explore the solution space with different models in the same problem.
• The stacking base model can be visualized in levels and has at least two levels of the
models. The first level typically trains the two or more base learners (can be
heterogeneous) and the second level might be a single meta learner that utilizes the
base models predictions as input and gives the final result as output . A stacked model
can have more than two such levels but increasing the levels doesn’t always guarantee
better performance.
• In the classification tasks, often logistic regression is used as a meta learner, while
linear regression is more suitable as a meta learner for regression – based tasks.
• Stacking is concerned with combining multiple classifiers generated by different
learning algorithms L1,….LN on a single dataset S, which is composed by a feature
vectors Si= (xi, ti)
4.11
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
features are the predictions of the base – level classifiers and the class is the correct
class of the example in hand.
• Why do ensemble methods work?
• Based on one of two basic observations:
1. Variance reduction: If the training sets are completely independent, it will
always helps to average an ensemble because this will reduce variance without
affecting bias (e.g. – bagging) and reduce sensitivity to individual data points.
2. Bias reduction : For simple models, average of models has much greater
capacity than single model Averaging models can reduce bias substantially by
increasing capacity and control variance by Citting one component at a time.
4.12
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
4.2.4Adaboost
• AdaBoost also referred to as adaptive boosting is a method in Machine Learning used
as an ensemble method. The maximum not unusual algorithm used with adaBoost is
selection trees with one stage meaning with decision trees with most effective I split.
These trees also are referred to as decision stumps.
Stump
• In the ensemble approach, we upload the susceptible fashions sequentially and then
teach them the use of weighted schooling records.
• We hold to iterate the process till we gain the advent of a pre-set range of vulnerable
learners or we can not look at further improvement at the dataset. At the end of the
algorithm, we are left with some vulnerable learners with a stage fee.
4.2.5 Difference between Bagging and Boosting
4.13
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
variance.
4. Every model receives an equal Models are weighted by their
weight. performance.
4.3 Clustering
• Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).
• Cluster analysis can be a powerful data-mining tool for any organization that needs to
identity discrete groups of customers, sales transaction, or other types of behaviors
and things. For example, insurance providers use cluster analysis to detect fraudulent
claims and banks used it for credit scoring.
4.14
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig. 4.3.1
4.15
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Clustering means grouping of data or dividing a large data set into smaller data sets of
some similarity.
• A clustering algorithm attempts to find natural groups components or data based on
some similarity. Also, the clustering algorithm finds the centroid of a group of data
sets.
4.16
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
on both the similarity measure used by the method and its implementation. The
quality of a clustering method is also measured by it’ s ability to discover some or all
of the hidden patterns.
• Clustering techniques types : The major clustering techniques are
a) Partitioning methods
b) Hierarchical methods
c) Density – based methods.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
4.17
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:
4.18
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting
the below two points as k points, which are not the part of our dataset. Consider
thebelow image:
o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied to
calculate the distance between two points. So, we will draw a median between
boththecentroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them
as blue and yellow for clear visualization.
4.19
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
o As we need to find the closest cluster, so we will repeat the process by choosing a
new centroid. To choose the new centroids, we will compute the center of gravity of
these centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
4.20
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of
the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
4.21
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, but here we are discussing the most
appropriate method to find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
4.22
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
4.23
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
to use the KNN algorithm, because it works on a similarity degree. Our KNN
version will discover the similar features of the new facts set to the cats and
dogs snap shots and primarily based on the most similar functions it will place it
in both cat or canine class.
• Suppose there are two categories, i.e., category A and category B and we’ve a brand
new statistics point x1, so this fact point will lie within of these classes. To solve this
sort of problem, we need a K-NN set of rules. With the help of K-NN, we will without
difficulty discover the category or class of a selected dataset. Consider the underneath
diagram :
• The K-NN working can be explained on the basis of the below algorithm :
Step -3 :Take the K nearest neighbors as according to the calculated Euclidean distance.
Step -4 :Among these ok pals, count number the number of the data points in each class.
Step -5 :Assign the brand new records points to that category for which the quantity of the
neighbor is maximum.
4.24
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Suppose we’ve got a brand new information point and we want to place it in the
required category, Consider the under image
• Firstly, we are able to pick the number of friends, so we are able to select the ok=5
• Next, we will calculate the Euclidean distance between the facts points. The
Euclidean distance is the gap between points, which we’ve got already studied in
geometry. It may be calculated as :
4.25
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• As we are able to see the three nearest acquaintances are from category A,
subsequently this new fact point must belong to category A.
4.26
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• The Gaussian mixture model is a probabilistic model that assumes all the data points
are generated from a mix of Gaussian distributions with unknown parameters.
• For example, in modeling human height data, height is typically modeled as a normal
distribution for each gender with a mean of approximately 5’10” for males and 5’5”
for females. Given only the height data and not the gender assignments for each data
point, the distribution of all heights would follow the sum of two scaled (different
variance) and shifted (different mean) normal distributions. A model making this
assumption is an example of a Gaussian mixture model.
• Gaussian mixture models do not rigidly classify each and every instance into one
class or the other. The algorithm attempts to produce K-Gaussian distributions that
would take into account the entire training space. Every point can be associated with
one or more distributions. Consequently, the deterministic factor would be the
probability that each point belongs to a certain Gaussian distribution.
• GMMs have a variety of real - world applications. Some of them are listed below.
a) Used for signal processing
b) Used for customer churn analysis
c) Used for language identification
d) Used in video game industry
e) Genre classification of songs
4.5.1 Expectation – maximization
• In Gaussian mixture models, an expectation-maximization method is a
powerful tool for estimating he parameters of a Gaussian mixture model.
The expectation is termed E and maximization is termed M.
• Expectation is used to find the Gaussian parameters which are used to represent each
component of gaussian mixture models. Maximization is termed M and it is involved
in determining whether new data points can be added or not.
• The Expectation – Maximization (EM) algorithm is used in maximum likelihood
estimation where the problem involves two sets of random variables of which one, X,
is observable and the other, Z, is hidden.
• The goal of the algorithm is to find the parameter vector ∅ that maximizes the
likelihood of the observed values of X, L (∅ | X)
4.27
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• But in cases where this is not feasible, we associated the extra hidden variables Z and
express the underlying model using both, to maximize the likelihood of the joint
distribution of X and Z the complete likelihood Lc (∅|X,Z)
• Expectation -maximization (EM) is an iterative method used to find maximum
likelihood estimates of parameters in probabilistic models, where the model depends
on unobserved, also called latent, variables.
• EM alternates between performing an expectation € step, which computes an
expectation of the likelihood by including the latent variables as if they were
obserrved, and maximization (M) step, which computes the maximum likelihood
estimates of the parameters by maximizing the expected likelihood found in the E
step.
• The Parameters found on the M step are then used to start another E step, and the
process is repeated until some criterion is satisfied. EM is frequently used for data
clustering like for example in Gaussian mixtures.
• In the Expectation step, find the expected values of the latent variables (here you
need to use the current parameter values)
• In the Maximization step, first plug in the expected values of the latent variables in
the log-likelihood of the augmented data. The maximize this log-likelihood to
reevaluate the parameters.
• Expectation – Maximization (EM) is a technique used in point estimation. Given a set
a observable variables X and unknown (latent) variables Z we want to estimate
parameters ∅ in a model.
• The expectation maximization (EM ) algorithm is a widely used maximum likely-
hood estimation procedure for statistical models when the values of some of the
variables in the model are not observed
• The EM algorithm is an elegant and powerful method for finding the maximum
likelihood of models with hidden variables. The key concept in the EM algorithm is
that it iterates between the expectation step (E-setp ) and maximization step (M-step )
until convergence.
• In the E-step, the algorithm estimates the posterior distribution of the hidden variables
Q given the observed data and the current parameter settings; and in the M-step the
algorithm calculates the ML parameter settings with Q fixed.
4.28
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• At the end of each iteration the lower bound on the likelihood is optimized for the
given parameter setting (M-step) and the likelihood is set to the bound (E-setp )
which guarantees an increase in the likelihood and convergence to a local
maximum, or global maximum if the likelihood function is unimodal.
• Generally, EM works best when the fraction of missing information is small and the
dimensionality of he data is not too large. EM can require many iterations, and higher
dimensionality can dramatically slow down the E-setp.
• EM is useful for several reasons: conceptual simplicity, ease of implementation, and
the fact that each iteration improves l (ø ). The rate of convergence on the first few
steps is typically quite good, but can become excruciatingly slow as you approach
local optima.
• Sometimes the M- step is a constrained maximization, which means that there are
constraints on valid solutions not encoded in the function itself.
• Expectation maximization is an effective technique that is often used in data analysis
to manage missing data. Indeed, expectation maximization overcomes some of the
limitations of other techniques, such as mean substitution or regression substitution.
These alternative techniques generate biased estimates – and specifically,
underestimate the standard errors. Expectation maximization overcomes this problem.
4.29
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
4.30
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2. The decision rule used to drive a classification from the K-nearest neighbors.
3. The number of neighbors used to classify the new example.
Q10. What is K-means clustering?
Ans: k-means clustering is heuristic method. Here each cluster is represented by the
center off the cluster. The k-means algorithm takes the input parameter, k, and partitions a
set of a objects into k-clusters so that the resulting intracluster similarity is high but the
intracluster similarity is low
Q.11 List the properties of K-Means algorithm.
Ans : 1. There are always k clusters.
2. There is always at least one item in each cluster.
3. The clusters are non – hierarchical and they do not overlap.
Q.12 What is stacking ?
Ans : Staking, sometimes called stacked generalization. Is an ensemble machine
learning method that combines multiple heterogeneous base or component models via a
meta – model.
Q. 13 How do GMMs differentiate from K- means clustering ?
Ans : GMMs and K-means, both are clustering algorithms used for unsupervised learning
tasks. However, the basic difference between the is that k-means is a distance -based
clustering method while GMMs is a distribution based clustering method.
4.31
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
UNIT V
NEURAL NETWORKS
Perceptron is Machine Learning algorithm for supervised learning of various binary classification
tasks. Further, Perceptron is also understood as an Artificial Neuron or neural network unit that
helps to detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural networks.
However, it is a supervised learning algorithm of binary classifiers. Hence, we can consider it as a
single-layer neural network with four main parameters, i.e., input values, weights and Bias, net
sum, and an activation function.
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three
main components. These are as follows:
Fig: 5.1
5.1
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
This is the primary component of Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is another most
important parameter of Perceptron components. Weight is directly proportional to the strength of
the associated input neuron in deciding the output. Further, Bias can be considered as the line of
intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire
or not. Activation Function can be considered primarily as a step function.
o Sign function
o Step function, and
o Sigmoid function
Fig:5.2
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step,
and Sigmoid) in perceptron models by checking whether the learning process is slow or has
vanishing or exploding gradients.
5.2
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig:5.3
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is indicative
of the strength of a node. Similarly, an input's bias value gives the ability to shift the activation
function curve up or down.
Step-1
In the first step first, multiply all input values with corresponding weight values and then add them
to determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
5.3
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Based on the layers, Perceptron models are divided into two types. These are as follows:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside the
model. The main objective of the single-layer perceptron model is to analyze the linearly separable
objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with
inconstantly allocated input for weight parameters. Further, it sums up all inputs (weight). After
adding all inputs, if the total sum of all inputs is more than a pre-determined value, the model gets
activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this model
is stated as satisfied, and weight demand does not change. However, this model consists of a few
discrepancies triggered when multiple weight inputs values are fed into the model. Hence, to find
desired output and minimize errors, some changes should be necessary for the weights input.
Like a single-layer perceptron model, a multi-layer perceptron model also has the same model
structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes
in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and
terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the
model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
5.4
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Multilayer perceptron is one of the most commonly used machine learning method.
• The Multi-layer Perceptron network, consisting of multiple layers of connected neurons.
• Multilayer perceptron is an artificial neural network structure and is a non parametric
estimator that can be used for classification and regression.
• The multi-layer perceptron is also known as back propagation algorithm, which executes
in two stages as follows:
i. Forward stage:
In Figure 5.1, we start at the left by filling in the values for the inputs. We then use these
inputs and the first level of weights to calculate the activations of the hidden layer, and then
we use those activations and the next set of weights to calculate the activations of the output
5.5
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
layer. Now that we’ve got the outputs of the network, we can compare them to the targets
and compute the error.
• the outputs of these neurons and the second-layer weights (labelled as w) are
used to decide if the output neurons fire or not
3. the error is computed as the sum-of-squares difference between the network outputs
and the targets
5.6
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
5.7
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
5.8
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
5.9
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
5.10
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
centering the data by bringing mean close to 0. This makes learning for the next layer much
easier.
• This is defined by
It Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of Neural network.
Nature :- non-linear, which means we can easily backpropagate the errors and have multiple layers
of neurons being activated by the ReLU function.
Uses :- ReLu is less computationally expensive than tanh and sigmoid because it involves simpler
mathematical operations. At a time only a few neurons are activated making the network sparse
making it efficient and easy for computation.
5.11
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration.
• In Gradient Descent, there is a term called “batch” which denotes the total number of
samples from a dataset that is used for calculating the gradient for each iteration.
• In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to
be the whole dataset. Although using the whole dataset is really useful for getting to the
minima in a less noisy and less random manner, the problem arises when our dataset gets
big.
• Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique, you will have to use all of the one million samples for completing
one iteration while performing the Gradient Descent, and it has to be done for every iteration
until the minima are reached. Hence, it becomes computationally very expensive to perform.
• This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single sample,
i.e., a batch size of one, to perform each iteration.
• The sample is randomly shuffled and selected for performing the iteration.
5.12
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. Find the slope of the objective function with respect to each parameter/feature. In other
words, compute the gradient of the function.
2. Pick a random initial value for the parameters. (To clarify, in the parabola example,
differentiate “y” with respect to “x”. If we had more features like x1, x2 etc., we take the
partial derivative of “y” with respect to each of the features.)
4. Calculate the step sizes for each feature as : step size = gradient * learning rate.
5. Calculate the new parameters as : new params = old params -step size
5.13
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Backpropagation is one of the important concepts of a neural network. Our task is to classify
our data best. For this, we have update the weights of parameter and bias, but how can we do that
in a deep neural network? In the linear regression model, we use gradient descent to optimize the
parameter. Similarly here we also use gradient descent algorithm using Backpropagation.
Backpropagation defines the whole process encompassing both the calculation of the gradient
and its need in the stochastic gradient descent. Technically, backpropagation is used to calculate
the gradient of the error of the network concerning the network's modifiable weights. The
characteristics of Backpropagation are the iterative, recursive and effective approach through
which it computes the updated weight to increase the network until it is not able to implement the
service for which it is being trained. Derivatives of the activation service to be known at network
design time are needed for Backpropagation.
Backpropagation is widely used in neural network training and calculates the loss function for
the weights of the network. Its service with a multi-layer neural network and discover the internal
description of input-output mapping. It is a standard form of artificial network training, which
supports computing gradient loss function concerning all weights in the network. The
backpropagation algorithm is used to train a neural network more effectively through a chain rule
method. This gradient is used in a simple stochastic gradient descent algorithm to find weights that
minimize the error. The error propagates backward from the output nodes to the inner nodes.
The training algorithm of backpropagation involves four stages which are as follows
• Initialization of weights- There are some small random values are assigned.
• Feed-forward - Each unit X receives an input signal and transmits this signal to each of
the hidden unit Z1, Z2, ... Zn. Each hidden unit calculates the activation function and sends
its signal Z, to each output unit. The output unit calculates the activation function to form
the response of the given input pattern.
• Backpropagation of errors - Each output unit compares activation Y, with the target
value T, to determine the associated error for that unit. It is based on the error, the factor
88 (K = 1, ... . m) is computed and is used to distribute the error at the output unit Y, back
to all units in the previous layer. Similarly the factor 88,(j = 1, .... p) is compared for each
hidden unit Z.
• It can update the weights and biases.
5.14
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Consider the above Back propagation neural network example diagram to understand.
The backpropagation algorithm proceeds in the following steps, assuming a suitable learning
rate alpha a and random initialization of the parameters w_{ij}^k:wijk:
Definition
1. Calculate the forward phase for each input-output pair (𝑥⃗𝑑 , 𝑦𝑑 ) and store the results 𝑦̂𝑑 , 𝑎𝑗𝑘 and
𝑜𝑗𝑘 for each node j in layer k by proceeding from layer 0, the input layer, to layer m, the output
layer.
𝜕𝐸𝑑
2. Calculate the backward phase for each input-output pair (𝑥⃗𝑑 , 𝑦𝑑 ) and store the result 𝑘 for
𝜕𝑤𝑖𝑗
each weight 𝑤𝑖𝑗𝑘 , connecting node in layer k − 1 store the results to node j in layer k by proceeding
from layer m, the output layer, to layer1, the input layer.
(a) Evaluate the error term for the final layer 𝜕𝑙𝑚 by using the second equation.
(b) Backpropogate the error terms for the hidden layers 𝜕𝑗𝑘 , working backwards from the final
hidden layer k = m-1, by repeatedly using the third equation.
(c) Evaluate the partial derivatives of the individual error Ed, with respect to 𝑤𝑖𝑗𝑘 by using the
first equation.
5.15
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
𝜕𝐸𝑑
3. Combine the individual gradients for each input-output pair 𝑘 to get the total gradient
𝜕𝑤𝑖𝑗
𝜕𝐸(𝑋,𝜃)
𝑘 for the entire set of input-output pairs X = {(𝑥⃗1 , 𝑦1 ),..., (𝑥⃗𝑁 , 𝑦𝑁 ) } by using the fourth
𝜕𝑤𝑖𝑗
1. Input is modeled using real weights W. The weights are usually randomly selected.
2. Calculate the output for every neuron from the input layer, to the hidden layers, to the output
layer.
4. Travel back from the output layer to the hidden layer to adjust the weights such that the error
is decreased.
Static Back Propagation - In this type of backpropagation, the static output is created because of
the mapping of static input. It is used to resolve static classification problems like optical character
recognition.
5.16
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Key Points
• Simplifies the network structure by elements weighted links that have the least effect on
the trained network
• You need to study a group of input and activation values to develop the relationship
between the input and hidden unit layers.
• It helps to assess the impact that a given input variable has on a network output. The
knowledge gained from this analysis should be represented in rules.
• Backpropagation is especially useful for deep neural networks working on error-prone
projects, such as image or speech recognition.
• Backpropagation takes advantage of the chain and power rules allows backpropagation to
function with any number of outputs.
The vanishing gradient problem is an issue that sometimes arises when training machine
learning algorithms through gradient descent. This most often occurs in neural networks that have
several neuronal layers such as in a deep learning system, but also occurs in recurrent neural
networks.
The key point is that the calculated partial derivatives used to compute the gradient as one goes
deeper into the network. Since the gradients control how much the network learns during training,
the gradients are very small or zero, then little to no training can take place, leading to poor
predictive performance.
The problem:
As more layers using certain activation functions are added to neural networks, the gradients
of the loss function approaches zero, making the network hard to train.
Why:
5.17
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Certain activation functions, like the sigmoid function, squishes a large input space into a small
input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will
cause a small change in the output. Hence, the derivative becomes small.
Fig:5.16
As an example, the above image is the sigmoid function and its derivative. Note how when the
inputs of the sigmoid function becomes larger or smaller (when |𝑥| becomes bigger), the derivative
becomes close to zero.
For shallow network with only a few layers that use these activations, this isn't a big problem.
However, when more layers are used, it can cause the gradient to be too small for training to work
effectively. Gradients of neural networks are found using backpropagation. Simply put,
backpropagation finds the derivatives of the network by moving layer by layer from the final layer
to the initial one. By the chain rule, the derivatives of each layer are multiplied down the network
(from the final layer to the initial) to compute the derivatives of the initial layers.
However, when n hidden layers use activation like the sigmoid an function, n small derivatives
are multiplied together. Thus, the gradient decreases exponentially as we propagate down to the
initial layers. A small gradient means that the weights and biases of the initial layers will not be
updated effectively with each training session. Since these initial layers are often crucial to
recognizing the core elements of the input data, it can lead to overall inaccuracy of the whole
network.
Solution:
The simplest solution is to use other activation functions, such as ReLU, which doesn't cause
a small derivative. Residual networks are another solution, as they provide residual connections
straight to earlier layers. The residual connection directly adds the value at the beginning of the
5.18
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
block, x, to the end of the block (F(x) + x). This residual connection doesn't go through activation
functions that "squashes" the derivatives, resulting in a higher overall derivative of the block.
Fig:5.17
Activation function is a simple mathematical function that transforms the given input to
the required output that has a certain range. From their name they activate the neuron when output
reaches the set threshold value of the function. Basically are responsible for switching the neuron
ON/OFF. The neuron receives the sum of the product of inputs and randomly initialized weights
along with a static bias for each layer. The activation function is applied on to this sum, and an
output is generated. Activation functions introduce a non-linearity, so as to make the network learn
complex patterns in the data such as in the case of images, text, videos or sounds. Without an
activation function our model is going to behave like a linear regression model that has limited
learning capacity.
5.7 ReLU
The rectified linear activation unit, or ReLU, is one of the few landmarks in the deep learning
revolution. It's simple, yet it's far superior to previous activation functions like sigmoid or tanh.
Both the ReLU function and its derivative are monotonic. If the function receives any
negative input, it returns 0; however, if the function receives any positive value x, it returns that
value. As a result, the output has a range of 0 to infinite. ReLU is the most often used activation
function in neural networks, especially CNNs, and is utilized as the default activation function.
5.19
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
def relu(x):
return max(0.0, x)
x = 1.0
x = -10.0
x = 0.0
x = 15.0
x = -20.0
5.20
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig: 5.19
We see from the plot that all the negative values have been set to zero, and the positive values
are returned as it is. Note that we've given a set of consecutively increasing numbers as input, so
we've a linear output with an increasing slope.
Advantages of ReLU:
ReLU is used in the hidden layers instead of Sigmoid or tanh as using sigmoid or tanh in the
hidden layers leads to the infamous problem of "Vanishing Gradient". The "Vanishing Gradient"
prevents the earlier layers from learning important information when the network is
backpropagating. The sigmoid which is a logistic function is more preferrable to be used in
regression or binary classification related problems and that too only in the output layer, as the
output of a sigmoid function ranges from 0 to 1. Also and tanh saturate and have lesser sensitivity.
→ Simpler Computation: Derivative remains constant i.e 1 for a positive input and thus
reduces the time taken for the model to learn and in minimizing the errors.
→ Linearity: Linear activation functions are easier to optimize and allow for a smooth flow.
So, it is best suited for supervised tasks on large sets of labelled data.
Disadvantages of ReLU:
• Exploding Gradient: This occurs when the gradient gets accumulated, this causes a large
differences in the subsequent weight updates. This as a result causes instability when
converging to the global minima and causes instability in the learning too.
• Dying ReLU: The problem of "dead neurons" occurs when the neuron gets stuck in the
negative side and constantly outputs zero. Because gradient of 0 is also 0, it's unlikely for
the neuron to ever recover. This happens when the learning rate is too high or negative bias
is quite large.
5.21
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Grid Search CV
Fig: 5.20
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha= [0.1, 0.2, 0.3, 0.4]. For a
combination of C = 0.3 and Alpha = 0.2, the performance score comes out to be 0.726(Highest),
therefore it is selected.
5.22
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
# Necessary imports
logreg_cv.fit(X,y)
Output:
Drawback:
Normalization is a data pre-processing tool used to bring the numerical data to a common scale
without distorting its shape. -Generally, when we input the data to a machine or deep learning
algorithm we tend to change the values to a balanced scale. The reason we normalize is partly to
ensure that our model can generalize appropriately. Now coming back to Batch normalization, it
is a process to make neural networks faster and more stable through adding extra layers in a deep
neural network. The new layer performs the standardizing and normalizing operations on the input
of a layer coming from a previous layer. A typical neural network is trained using a collected set
5.23
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
of input data called batch. Similarly, the normalizing process in batch normalization takes place in
batches, not as a single input.
Fig: 5.21
L = Number of layers
Bias = 0
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre-
processing stage. When the input passes through the first layer, it transforms, as a sigmoid function
applied over the dot product of input X and the weight matrix W.
h1 = 𝜎(W1X)
Fig: 5.22
Similarly, this transformation will take place for the second layer and go till the last layer L as
shown in the following image.
5.24
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig:5.23
Although, our input X was normalized with time the output will no longer be on the same scale.
As the data go through multiple layers of the neural network and L activation functions are applied,
it leads to an internal co-variate shift in the data.
Since by now we have a clear idea of why we need Batch normalization, let's understand how it
works. It is a two-step process. First, the input is normalized, and later rescaling and offsetting is
performed.
Normalization is the process of transforming the data to have a mean zero and standard deviation
one. In this step we have our batch input from layer h, first, we need to calculate the mean of this
hidden activation.
1
𝜇 = 𝑚 Ʃhi
Here, m is the number of neurons at layer h. Once we have meant at our end, the next step is to
calculate the standard deviation of the hidden activations.
1
𝜇 = √𝑚 Ʃ(hi − µ)2
Further, as we have the mean and the standard deviation ready. We will normalize the hidden
activations using these values. For this, we will subtract the mean from each input and divide the
whole value with the sum of standard deviation and the smoothing term (ε). The smoothing term(ε)
assures numerical stability within the operation by stopping a division by a zero value.
5.25
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
(ℎ𝑖 − 𝜇)
h i(norm) = 𝜎+ 𝜀
By Normalizing the hidden layer activation the Batch normalization speeds up the training process.
It solves the problem of internal covariate shift. Through this, we ensure that the input for every
layer is distributed around the same mean and standard deviation. If you are unaware of what is an
internal covariate shift, look at the following example.
Suppose we are training an image classification model, that classifies the images into Dog or Not
Dog. Let's say we have the images of white dogs only, these images will have certain distribution
as well. Using these images model will update its parameters.
Batch normalization smoothens the loss function that in turn by optimizing the model parameters
improves the training speed of the model.
5.10 Regularization
So, before diving into regularization, let's take a step back to understand what bias-variance is and
its impact. Bias is the deviation between the values predicted by the model and the actual values
whereas, variance is the difference between the predictions when the model fits different datasets.
When a model performs well on the training data and does not perform well on the testing data,
then the model is said to have high generalization error. In other words, in such a scenario, the
model has low bias and high variance and is too complex. This is called overfitting. Overfitting
means that the model is a good fit on the train data compared to the data, as illustrated in the graph
above. Overfitting is also a result of the model being too complex.
Regularization is one of the key concepts in Machine learning as it helps choose a simple model
rather than a complex one. We want our model to perform well both on the train and the new
unseen data, meaning the model must have the ability to be generalized. Generalization error is "a
5.26
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
measure of how accurately an algorithm can predict outcome values for previously unseen data."
Regularization refers to the modifications that can be made to a leaming algorithm that helps to
reduce this generalization error and not the training error. It reduces by ignoring the less important
features. It also helps prevent overfitting, making the model more robust and decreasing the
complexity of a model.
Regularization works by shrinking the beta coefficients of a regression model. To understand why
we need to shrink the coefficients, let us see the below example:
Fig: 5.24
In the above graph, the two lines represent the relationship between total years of experience and
salary, where salary is the target variable. These are slopes indicating the change in salary per unit
change in total years of experience. As the slope b1+b3 decreases to slope b₁ , we see that the salary
is less sensitive to the total years of experience. By decreasing the slope, the target variable (salary)
became less sensitive to the change in the independent X variables, which increases the bias into
the model. Remember, bias is the difference between the predicted and the actual values.
With the increase in bias to the model, the variance (which is the difference between the predictions
when the model fits different datasets.) decreases. And, by decreasing the variance, the overfitting
gets reduced. The models having the higher variance leads to overfitting, and we saw above, we
will shrink or reduce the beta coefficients to overcome the overfitting. The beta coefficients or the
weights of the features converge towards zero, which is known as shrinkage.
For linear regression, the regularization has two terms in the loss function:
5.27
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
It becomes :
The goal of the linear regression model is to minimize the loss function. Now for Regularization,
the goal becomes to minimize the following cost function:
𝒏
Where, the penalty term comprises the regularization parameter and the weights associated with
the variables. Hence, the penalty term is:
𝒑𝒆𝒏𝒂𝒍𝒕𝒚 = 𝝀 ∗ 𝒘
where,
λ = Regularization parameter
The regularization parameter in machine learning is λ: It imposes a higher penalty on the variable
having higher values, and hence, it controls the strength of the penalty term. This tuning parameter
controls the bias-variance trade-off.
λ can take values 0 to infinity. If λ = 0, then means there is no difference between a model with
and without regularization.
Each of the following techniques uses different regularization norms (L-p) based on the
mathematical methodology that creates different kinds of regularization. These methodologies
have different effects on the beta coefficients of the features. The regularization techniques in
machine learning as follows:
5.28
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The Elastic Net Regression technique is a combination of the Ridge and Lasso regression
technique. It is the linear combination of penalties for both the L, -norm and L₂ -norm
regularization.
The model using elastic net regression allows the learning of the sparse model where some of
the points are zero, similar to Lasso regularization, and yet maintains the Ridge regression
properties. Therefore, the model is trained on both the L, and L₂ norms.
• Ridge regression is used when it is important to consider all the independent variables
in the model or when many interactions are present. That is where collinearity or
codependency is present amongst the variables.
• Lasso regression is applied when there are many predictors available and would want
the model to make feature selection as well for us.
When many variables are present, and we can't determine whether to use Ridge or Lasso
regression, then the Elastic-Net regression is your safe bet.
5.11 DROUPOUT:
"Dropout" in machine learning refers to the process of randomly ignoring certain nodes in a layer
during training. In the figure below, the neural network on the left represents a typical neural
network where all units are activated. On the right, the red units have been dropped out of the
model- the values of their weights and biases are not considered during training.
5.29
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig:5.25
Dropout is used as a regularization technique - it prevents overfitting by ensuring that no units are
codependent.
Early stopping: stop training automatically when a specific performance measure (eg. Validation
loss, accuracy) stops improving
Weight decay: incentivize the network to use smaller weights by adding a penalty to the loss
function (this ensures that the norms of the weights are relatively evenly distributed amongst all
the weights in the networks, which prevents just a few weights from heavily influencing network
output)
Noise: allow some random fluctuations in the data through augmentation (which makes the
network robust to a larger distribution of inputs and hence improves generalization)
Model combination: average the outputs of separately trained neural networks (requires a lot of
computational power, data, and time)
Dropout remains an extremely popular protective measure against overfitting because of its
efficiency and effectiveness.
When we apply dropout to a neural network, we're creating a "thinned" network with unique
combinations of the units in the hidden layers being dropped randomly at different points in time
during training. Each time the gradient of our model is updated, we generate a new thinned neural
network with different units dropped based on a probability hyperparameter p. Training a network
using dropout can thus be viewed as training loads of different thinned neural networks and
merging them into one network that picks up the key properties of each thinned network. This
process allows dropout to reduce the overfitting of models on training data.
5.30
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
This graph, taken from the paper "Dropout: A Simple Way to Prevent Neural Networks from
Overfitting" by Srivastava et al., compares the change in classification error of models without
dropout to the same models with dropout (keeping all other hyperparameters constant). All the
models have been trained on the MNIST dataset.
Fig:5.26
It is observed that the models with dropout had a lower classification error than the same models
without dropout at any given point in time. A similar trend was observed when the models were
used to train other datasets in vision, as well as speech recognition and text analysis. The lower
error is because dropout helps prevent overfitting on the training data by reducing the reliance of
each unit in the hidden layer on other units in the hidden layers.
Although dropout is clearly a highly effective tool, it comes with certain drawbacks. A network
with dropout can take 2-3 times longer to train than a standard network. One way to attain the
benefits of dropout without slowing down training is by finding a regularizer that is essentially
equivalent to a dropout layer. For linear regression, this regularizer has been proven to be a
modified form of L2 regularization.
5.31
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
In an artificial neural network, the function which takes the incoming signals as
input and produces the output signal is known as the activation function.
In the process of training, we want to strat with a bad performing neural network and wind
up with network with high accuracy, In terms of loss function, we want our loss function to
much lower in the end of training. Improving the network is possible, because we can change
its function by adjusting weights. We want to find another function that performs better than
the initial one.
6. Define SGD.
In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes
the total number of samples from a dataset that is used for calculating the gradient for each
iteration. In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is
taken to be the whole dataset.
5.32
CS3491 -ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Static Back Propagation - In this type of backpropagation, the static output is created because
of the mapping of static input. It is used to resolve static classification problems like optical
character recognition.
Although dropout is clearly a highly effective tool, it comes with certain drawbacks. A
network with dropout can take 2-3 times longer to train than a standard network. One way to
attain the benefits of dropout without slowing down training is by finding a regularizer that is
essentially equivalent to a dropout layer. For linear regression, this regularizer has been proven
to be a modified form of L2 regularization.
9. Define ReLU
The rectified linear activation unit, or ReLU, is one of the few landmarks in the deep
learning revolution. It's simple, yet it's far superior to previous activation functions like
sigmoid or tanh.
PART B & C
5.33