Ai Unit Wise Notes

Unit 1
- Formulating problems
Steps performed by Problem-solving agent
 Goal Formulation: It is the first and simplest step in
problem-solving. It organizes the steps/sequence
required to formulate one goal out of multiple goals as
well as actions to achieve that goal. Goal formulation is
based on the current situation and the agent’s
performance measure (discussed below).
Problem Formulation: It is the most important step of
problem-solving which decides what actions should be taken to
achieve the formulated goal. There are following five components
involved in problem formulation:
 Initial State: It is the starting state or initial step of the
agent towards its goal.
 Actions: It is the description of the possible actions
available to the agent.
 Transition Model: It describes what each action does.
 Goal Test: It determines if the given state is a goal

state.
 Path cost: It assigns a numeric cost to each path that
follows the goal. The problem-solving agent selects a
cost function, which reflects its performance measure.
Remember, an optimal solution has the lowest path
cost among all the solutions.
Note: Initial state, actions, and transition model together
define the state-space of the problem implicitly. State-
space of a problem is a set of all states which can be
reached from the initial state followed by any sequence of
actions. The state-space forms a directed map or graph
CS6659-Artificial Page
where nodes are the states, links between the nodes are
actions, and the path is a sequence of states connected by
the sequence of actions.
 Search: It identifies all the best possible sequence of
actions to reach the goal state from the current state. It
takes a problem as an input and returns solution as its
output.
 Solution: It finds the best algorithm out of various
algorithms, which may be proven as the best optimal
solution.
 Execution: It executes the best optimal solution from
the searching algorithms to reach the goal state from
the current state.
Example Problems
Basically, there are two types of problem approaches:
 Toy Problem: It is a concise and exact description of

the problem which is used by the researchers to
compare the performance of algorithms.
 Real-world Problem: It is real-world based problems
which require solutions. Unlike a toy problem, it does
not depend on descriptions, but we can have a general
formulation of the problem.
Some Toy Problems
 8 Puzzle Problem: Here, we have a 3×3 matrix with
movable tiles numbered from 1 to 8 with a blank space.
The tile adjacent to the blank space can slide into that
space. The objective is to reach a specified goal state
similar to the goal state, as shown in the below figure.
 In the figure, our task is to convert the current state into
goal state by sliding digits into the blank space.
In the above figure, our task is to convert the current(Start)
state into goal state by sliding digits into the blank space.
The problem formulation is as follows:

 States: It describes the location of each numbered tiles
and the blank tile.
 Initial State: We can start from any state as the initial
state.
 Actions: Here, actions of the blank space is defined,
i.e., either left, right, up or down
 Transition Model: It returns the resulting state as per
the given state and actions.
 Goal test: It identifies whether we have reached the
correct goal-state.
 Path cost: The path cost is the number of steps in the
path where the cost of each step is 1.
Note: The 8-puzzle problem is a type of sliding-block
problem which is used for testing new search algorithms in
artificial intelligence.
 8-queens problem: The aim of this problem is to place
eight queens on a chessboard in an order where no
queen may attack another. A queen can attack other
queens either diagonally or in same row and
column.
From the following figure, we can understand the problem
as well as its correct solution.
It is noticed from the above figure that each queen is set

into the chessboard in a position where no other queen is
placed diagonally, in same row or column. Therefore, it is
one right approach to the 8-queens problem.
For this problem, there are two main kinds of
formulation:
 Incremental formulation: It starts from an empty state
where the operator augments a queen at each step.
Following steps are involved in this formulation:
 States: Arrangement of any 0 to 8 queens on the
chessboard.
 Initial State: An empty chessboard
 Actions: Add a queen to any empty box.
 Transition model: Returns the chessboard with the

queen added in a box.
 Goal test: Checks whether 8-queens are placed on the
chessboard without any attack.
 Path cost: There is no need for path cost because only
final states are counted.
In this formulation, there is approximately 1.8 x
1014 possible sequence to investigate.
 Complete-state formulation: It starts with all the 8-
queens on the chessboard and moves them around,
saving from the attacks.
Following steps are involved in this formulation
 States: Arrangement of all the 8 queens one per
column with no queen attacking the other queen.
 Actions: Move the queen at the location where it is
safe from the attacks.
This formulation is better than the incremental formulation
as it reduces the state space from 1.8 x 1014 to 2057, and it
is easy to find the solutions.
Some Real-world problems
 Traveling salesperson problem(TSP): It is a touring
problem where the salesman can visit each city only
once. The objective is to find the shortest tour and sell-
out the stuff in each city.
 VLSI Layout problem: In this problem, millions of
components and connections are positioned on a chip
in order to minimize the area, circuit-delays, stray-
capacitances, and maximizing the manufacturing yield.
The layout problem is split into two parts:
 Cell layout: Here, the primitive components of the
circuit are grouped into cells, each performing its
specific function. Each cell has a fixed shape and size.
The task is to place the cells on the chip without
overlapping each other.
 Channel routing: It finds a specific route for each wire
through the gaps between the cells.
 Protein Design: The objective is to find a sequence of
amino acids which will fold into 3D protein having a
property to cure some disease.
Searching for solutions
We have seen many problems. Now, there is a need to
search for solutions to solve them.
In this section, we will understand how searching can be

used by the agent to solve a problem.
For solving different kinds of problem, an agent makes use

of different strategies to reach the goal by searching the
best possible algorithms. This process of searching is
known as search strategy.
Measuring problem-solving performance
Before discussing different search strategies,
the performance measure of an algorithm should be
measured. Consequently, there are four ways to measure
the performance of an algorithm:
Completeness: It measures if the algorithm guarantees to
find a solution (if any solution exist).
Optimality: It measures if the strategy searches for an
optimal solution.
Time Complexity: The time taken by the algorithm to find a
solution.
Space Complexity: Amount of memory required to perform
a search.
The complexity of an algorithm depends on branching
factor or maximum number of successors, depth of the
shallowest goal node (i.e., number of steps from root to
the path) and the maximum length of any path in a state
space.
Search Strategies
There are two types of strategies that describe a solution for
a given problem:
Uninformed Search (Blind Search)

This type of search strategy does not have any additional
information about the states except the information provided
in the problem definition. They can only generate the
successors and distinguish a goal state from a non-goal
state. These type of search does not maintain any internal
state, that’s why it is also known as Blind search.
Unit 2
There are following types of uninformed searches:

 Breadth-first search
 Uniform cost search
 Depth-first search
 Depth-limited search
 Iterative deepening search
 Bidirectional search
Informed Search (Heuristic Search)
This type of search strategy contains some additional
information about the states beyond the problem definition.
This search uses problem-specific knowledge to find more
efficient solutions. This search maintains some sort of
internal states via heuristic functions (which provides hints),
so it is also called heuristic search.
There are following types of informed searches:
 Best first search (Greedy search)
 A* search
Breadth-first search (BFS)
It is a simple search strategy where the root node is

expanded first, then covering all other successors of the
root node, further move to expand the next level nodes and
the search continues until the goal node is not found.
BFS expands the shallowest (i.e., not deep) node first using
FIFO (First in first out) order. Thus, new nodes (i.e., children
of a parent node) remain in the queue and old unexpanded
node which are shallower than the new nodes, get
expanded first.
In BFS, goal test (a test to check whether the current
state is a goal state or not) is applied to each node at the
time of its generation rather when it is selected for
expansion.
Breadth-
first search tree
In the above figure, it is seen that the nodes are expanded
level by level starting from the root node A till the last
node I in the tree. Therefore, the BFS sequence followed
is: A->B->C->D->E->F->G->I.
BFS Algorithm
 Set a variable NODE to the initial state, i.e., the root
node.
 Set a variable GOAL which contains the value of
the goal state.
 Loop each node by traversing level by level until the
goal state is not found.
 While performing the looping, start removing the
elements from the queue in FIFO order.
 If the goal state is found, return goal state otherwise
continue the search.
The performance measure of BFS is as follows:
 Completeness: It is a complete strategy as it definitely
finds the goal state.
 Optimality: It gives an optimal solution if the cost of
each node is same.
 Space Complexity: The space complexity of BFS
is O(bd), i.e., it requires a huge amount of memory.
Here, b is the branching factor and d denotes
the depth/level of the tree
 Time Complexity: BFS consumes much time to reach
the goal node for large instances.
Disadvantages of BFS
 The biggest disadvantage of BFS is that it requires a lot
of memory space, therefore it is a memory bounded
strategy.
 BFS is time taking search strategy because it expands
the nodes breadthwise.
Note: BFS expands the nodes level by level, i.e.,
breadthwise, therefore it is also known as a Level search
technique.
Example:
Consider the below search problem, and we will
traverse it using greedy best-first search. At each
iteration, each node is expanded using evaluation
function f(n)=h(n) , which is given in the below
table.
In this search example, we are using two lists which

are OPEN and CLOSED Lists. Following are the
iteration for traversing the above example.
Expand the nodes of S and put in the CLOSED
list
Initialization: Open [A, B], Closed [S]
Iteration 1: Open [A], Closed [S, B]
Iteration 2: Open [E, F, A], Closed [S, B]
: Open [E, A], Closed [S, B, F]
Iteration 3: Open [I, G, E, A], Closed [S, B, F]
: Open [I, E, A], Closed [S, B, F, G]
Hence the final solution path will be: S----> B-----
>F----> G
Time Complexity: The worst case time complexity
of Greedy best first search is O(bm).
Space Complexity: The worst case space
complexity of Greedy best first search is O(bm).
Where, m is the maximum depth of the search
space.
Complete: Greedy best-first search is also
incomplete, even if the given state space is finite.
Optimal: Greedy best first search algorithm is not
optimal.
Uniform-cost search
Unlike BFS, this uninformed search explores nodes based

on their path cost from the root node. It expands a node n
having the lowest path cost g(n), where g(n) is the total cost
from a root node to node n. Uniform-cost search is
significantly different from the breadth-first search because
of the following two reasons:
 First, the goal test is applied to a node only when it is

selected for expansion not when it is first
generated because the first goal node which is
generated may be on a suboptimal path.
 Secondly, a goal test is added to a node, only when a
better/optimal path is found.
Thus, uniform-cost search expands nodes in a sequence of
their optimal path cost because before exploring any
node, it searches the optimal path. Also, the step cost is
positive so, paths never get shorter when a new node is
added in the search.
Uniform-cost search on a binary

tree
In the above figure, it is seen that the goal-state is F and
start/initial state is A. There are three paths available to
reach the goal node. We need to select an optimal path
which may give the lowest total cost g(n). Therefore, A->B-
>E->F gives the optimal path cost i.e., 0+1+3+4=8.
Uniform-cost search Algorithm
node and expand it.
 After expanding the root node, select one node having
the lowest path cost and expand it further.
Remember, the selection of the node should give an
optimal path cost.
 If the goal node is searched with optimal value,
return goal state, else carry on the search.
The performance measure of Uniform-cost search
 Completeness: It guarantees to reach the goal state.
 Optimality: It gives optimal path cost solution for the

search.
 Space and time complexity: The worst space and
time complexity of the uniform-cost search is O(b1+LC*/ᵋ˩).
Note: When the path cost is same for all the nodes, it
behaves similar to BFS.
Disadvantages of Uniform-cost search
 It does not care about the number of steps a path has
taken to reach the goal state.
 It may stick to an infinite loop if there is a path with
infinite zero cost sequence.
 It works hard as it examines each node in search of
lowest cost path.
Depth-first search
This search strategy explores the deepest node first, then

backtracks to explore other nodes. It uses LIFO (Last in
First Out) order, which is based on the stack, in orderto
expand the unexpanded nodes in the search tree. The
search proceeds to the deepest level of the tree where it
has no successors. This search expands nodes till infinity,
i.e., the depth of the tree.
DFS search tree
In the above figure, DFS works starting from the initial
node A (root node) and traversing in one direction deeply till
node I and then backtrack to B and so on. Therefore, the
sequence will be A->B->D->I->E->C->F->G.
DFS Algorithm
node.
the goal state.
 Loop each node by traversing deeply in one
direction/path in search of the goal node.
elements from the stack in LIFO order.
 If the goal state is found, return goal state otherwise
backtrack to expand nodes in other direction.
The performance measure of DFS
 Completeness: DFS does not guarantee to reach the
goal state.
 Optimality: It does not give an optimal solution as it
expands nodes in one direction deeply.
 Space complexity: It needs to store only a single path
from the root node to the leaf node. Therefore, DFS
has O(bm) space complexity where b is the branching
factor(i.e., total no. of child nodes, a parent node
have) and m is the maximum length of any path.
 Time complexity: DFS has O(b ) time complexity.
m
Disadvantages of DFS
 It may get trapped in an infinite loop.
 It is also possible that it may not reach the goal state.

 DFS does not give an optimal solution.
Note: DFS uses the concept of backtracking to explore
each node in a search tree.
Depth-limited search
This search strategy is similar to DFS with a little difference.

The difference is that in depth-limited search, we limit the
search by imposing a depth limit l to the depth of the
search tree. It does not need to explore till infinity. As a
result, the depth-first search is a special case of depth-
limited search. when the limit l is infinite.
Depth-limited search on a binary
tree
In the above figure, the depth-limit is 1. So, only level 0
and 1 get expanded in A->B->C DFS sequence, starting
from the root node A till node B. It is not giving satisfactory
result because we could not reach the goal node I.
Depth-limited search Algorithm
node.
the goal state.
 Set a variable LIMIT which carries a depth-limit value.
 Loop each node by traversing in DFS manner till the

depth-limit value.
elements from the stack in LIFO order.
 If the goal state is found, return goal state.
Else terminate the search.
The performance measure of Depth-limited
search
 Completeness: Depth-limited search does not
guarantee to reach the goal node.
 Optimality: It does not give an optimal solution as it
expands the nodes till the depth-limit.
 Space Complexity: The space complexity of the
depth-limited search is O(bl).
 Time Complexity: The time complexity of the depth-
limited search is O(bl).
Disadvantages of Depth-limited search
 This search strategy is not complete.
 It does not provide an optimal solution.

Note: Depth-limit search terminates with two kinds of
failures: the standard failure value indicates “no solution,”
and cut-off value, which indicates “no solution within the
depth-limit.”
Iterative deepening depth-first search/Iterative
deepening search
This search is a combination of BFS and DFS, as BFS

guarantees to reach the goal node and DFS occupies less
memory space. Therefore, iterative deepening search
combines these two advantages of BFS and DFS to reach
the goal node. It gradually increases the depth-limit
from 0,1,2 and so on and reach the goal node.
In the above figure, the goal node is H and initial depth-
limit =[0-1]. So, it will expand level 0 and 1 and will
terminate with A->B->C sequence. Further, change
the depth-limit =[0-3], it will again expand the nodes from
level 0 till level 3 and the search terminate with A->B->D-
>F->E->H sequence where H is the desired goal node.
Iterative deepening search Algorithm
 Explore the nodes in DFS order.

 Set a LIMIT variable with a limit value.
 Loop each node up to the limit value and further
increase the limit value accordingly.
 Terminate the search when the goal state is found.
The performance measure of Iterative deepening
search
 Completeness: Iterative deepening search may or
may not reach the goal state.
 Optimality: It does not give an optimal solution always.
 Space Complexity: It has the same space complexity
as BFS, i.e., O(bd).
 Time Complexity: It has O(d) time complexity.
Disadvantages of Iterative deepening search
 The drawback of iterative deepening search is that it
seems wasteful because it generates states multiple
times.
Note: Generally, iterative deepening search is required
when the search space is large, and the depth of the
solution is unknown.
Bidirectional search
The strategy behind the bidirectional search is to run two

searches simultaneously–one forward search from the
initial state and other from the backside of the
goal–hoping that both searches will meet in the middle. As
soon as the two searches intersect one another, the
bidirectional search terminates with the goal node. This
search is implemented by replacing the goal test to check if
the two searches intersect. Because if they do so, it means
a solution is found.
The performance measure of Bidirectional search
 Complete: Bidirectional search is complete.
 Optimal: It gives an optimal solution.
 Time and space complexity: Bidirectional search

has O(bd/2)
Disadvantage of Bidirectional Search
 It requires a lot of memory space.
An informed search is more efficient than an uninformed
search because in informed search, along with the current
state information, some additional information is also
present, which make it easy to reach the goal state.
Below we have discussed different types of informed

search:
Best-first Search (Greedy search)
A best-first search is a general approach of informed

search. Here, a node is selected for expansion based on
an evaluation function f(n), where f(n) interprets the
cost estimate value. The evaluation function expands that
node first, which has the lowest cost. A component of f(n)
is h(n) which carries the additional information required for
the search algorithm, i.e.,
h(n)= estimated cost of the cheapest path from the
current node n to the goal node.
Note: If the current node n is a goal node, the value of h(n)
will be 0.
Best-first search is known as a greedy search because it
always tries to explore the node which is nearest to the goal
node and selects that path, which gives a quick solution.
Thus, it evaluates nodes with the help of the heuristic
function, i.e., f(n)=h(n).
Best-first search Algorithm
1. Set an OPEN list and a CLOSE list where the

OPEN list contains visited but unexpanded nodes
and the CLOSE list contains visited as well as
expanded nodes.
2. Initially, traverse the root node and visit its next
successor nodes and place them in the OPEN list in
ascending order of their heuristic value.
3. Select the first successor node from the OPEN list with
the lowest heuristic value and expand further.
4. Now, rearrange all the remaining unexpanded nodes in
the OPEN list and repeat above two steps.
5. If the goal node is reached, terminate the search, else
expand further.
In the above figure, the root node is A, and its next level
successor nodes are B and C with h(B)=2 and h(C)=4. Our
task is to explore that node which has the lowest h(n)
value. So, we will select node B and expand it further to
node D and E. Again, search out that node which has the
lowest h(n) value and explore it further.
The performance measure of Best-first search
Algorithm:
 Completeness: Best-first search is incomplete even in
finite state space.
 Optimality: It does not provide an optimal solution.
 Time and Space complexity: It has O(b ) worst time

m
and space complexity, where m is the maximum

depth of the search tree. If the quality of the heuristic
function is good, the complexities could be reduced
substantially.
Note: Best first searches combines the advantage of BFS
and DFS to find the best solution.
Disadvantages of Best-first search
 BFS does not guarantees to reach the goal state.

 Since the best-first search is a greedy approach, it
does not give an optimized solution.
 It may cover a long distance in some cases.
A* Search Algorithm
A* search is the most widely used informed search

algorithm where a node n is evaluated by combining values
of the functions g(n)and h(n). The function g(n) is the path
cost from the start/initial node to a node n and h(n) is
the estimated cost of the cheapest path from node n to
the goal node. Therefore, we have
f(n)=g(n)+h(n)
where f(n) is the estimated cost of the cheapest solution
through n.
So, in order to find the cheapest solution, try to find the
lowest values of f(n).
Let’s see the below example to understand better.
In the above example, S is the root node, and G is the goal

node. Starting from the root node S and moving towards its
next successive nodes A and B. In order to reach the goal
node G, calculate the f(n) value of node S, A and B using
the evaluation equation i.e.
f(n)=g(n)+h(n)
Calculation of f(n) for node S:
f(S)=(distance from node S to S) + h(S)

 0+10=10.
Calculation of f(n) for node A:
f(A)=(distance from node S to A)+h(A)
 2+12=14
Calculation of f(n) for node B:
f(B)=(distance from node S to B)+h(B)
 3+14=17
Therefore, node A has the lowest f(n) value. Hence, node A
will be explored to its next level nodes C and D and again
calculate the lowest f(n) value. After calculating, the
sequence we get is S->A–>D->G with f(n)=13(lowest
value).
How to make A* search admissible to get an optimized
solution?
A* search finds an optimal solution as it has the admissible

heuristic function h(n) which believes that the cost of solving
a problem is less than its actual cost. A heuristic function
can either underestimate or overestimate the cost
required to reach the goal node. But an admissible heuristic
function never overestimates the cost value required to
reach the goal state. Underestimating the cost value means
the cost we assumed in our mind is less than the actual
cost. Overestimating the cost value means the cost we
assumed is greater than the actual cost, i.e.,
Here, h(n) is the actual heuristic cost value and h’(n) is
the estimated heuristic cost value.
Note: An overestimated cost value may or may not lead to
an optimized solution, but an underestimated cost value
always lead to an optimized solution.
Let’s understand with the help of an example:
Consider the below search tree where the starting/initial

node is A and goal node is E. We have different paths to
reach the goal node E with their different heuristic costs
h(n) and path costs g(n). The actual heuristic cost
is h(n)=18. Let’s suppose two different estimation values:
h1′(n)= 12 which is underestimated cost value
h2′ (n)= 25 which is overestimated cost value
So, when the cost value is overestimated, it will not take

any load to search the best optimal path and acquire the
first optimal path. But if the h(n) value is underestimated, it
will try and reach the best optimal value of h(n) which will
lead to a good optimal solution.
Note: Underestimation of h(n) leads to a better optimal

solution instead of overestimating the value.
The performance measure of A* search
 Completeness: The star(*) in A* search guarantees to
reach the goal node.
 Optimality: An underestimated cost will always give an
optimal solution.
 Space and time complexity: A* search has O(b )
d
space and time complexities.

Disadvantage of A* search
 A* mostly runs out of space for a long period.

AO* search Algorithm
AO* search is a specialized graph based on AND/OR

operation. It is a problem decomposition strategy where a
problem is decomposed into smaller pieces and solved
separately to get a solution required to reach the desired
goal. Although A*search and AO* search, both follow best-
first search order, but they are dissimilar from one another.
Let’s understand AO* working with the help of the below
example:
Here, the destination/ goal is to eat some food. We have
two ways, either order food from any restaurant or buy
some food ingredients and cook to eat food. Thus, we can
apply any of the two ways, the choice depends on us. It is
not guaranteed whether the order will be delivered on time,
food will be tasty or not, etc. But if we will purchase and
cook it, we will be more satisfied.
Therefore, the AO* search provides two ways to choose

either OR or AND. It is better to choose AND rather OR to
get a good optimal solution.
Heuristic Functions
As we have already seen that an informed search make use
of heuristic functions in order to reach the goal node in a
more prominent way. Therefore, there are several pathways
in a search tree to reach the goal node from the current
node. The selection of a good heuristic function matters
certainly. A good heuristic function is determined by its
efficiency. More is the information about the problem, more
is the processing time.
Some toy problems, such as 8-puzzle, 8-queen, tic-tac-toe,
etc., can be solved more efficiently with the help of a
heuristic function. Let’s see how:
Consider the following 8-puzzle problem where we have a

start state and a goal state. Our task is to slide the tiles of
the current/start state and place it in an order followed in the
goal state. There can be four moves either left, right, up,
or down. There can be several ways to convert the
current/start state to the goal state, but, we can use a
heuristic function h(n) to solve the problem more efficiently.
A heuristic function for the 8-puzzle problem is defined
below:
h(n)=Number of tiles out of position.

So, there is total of three tiles out of position i.e., 6,5 and 4.
Do not count the empty tile present in the goal state). i.e.
h(n)=3. Now, we require to minimize the value of h(n) =0.
We can construct a state-space tree to minimize the h(n)
value to 0, as shown below:
It is seen from the above state space tree that the goal state
is minimized from h(n)=3 to h(n)=0. However, we can create
and use several heuristic functions as per the reqirement. It
is also clear from the above example that a heuristic
function h(n) can be defined as the information required to
solve a given problem more efficiently. The information can
be related to the nature of the state, cost of transforming
from one state to another, goal node
characterstics, etc., which is expressed as a heuristic
function.
Properties of a Heuristic search Algorithm
Use of heuristic function in a heuristic search algorithm
leads to following properties of a heuristic search algorithm:
 Admissible Condition: An algorithm is said to be

admissible, if it returns an optimal solution.
 Completeness: An algorithm is said to be complete, if
it terminates with a solution (if the solution exists).
 Dominance Property: If there are two admissible
heuristic
algorithms A1 and A2 having h1 and h2 heuristic
functions, then A1 is said to dominate A2 if h1 is better
than h2 for all the values of node n.
 Optimality Property: If an algorithm is complete,
admissible, and dominating other algorithms, it will
be the best one and will definitely give an optimal
solution.
 Difference between Intelligence and Artificial
Intelligence
Intelligence Artificial Intelligence
It is a natural process or quality It is programmed using

given to human beings. human intelligence.
It is not hereditary but a copy

It is an actual hereditary.
of human intelligence.
A human brain does not require

Artificial intelligence requires
any electricity to show his
electricity to get an output.
intelligence.
Execution speed of a human brain Execution speed is higher

is less. than the human brain.
Human intelligence can handle It is designed to handle only

different situations in a better way. a few types of problems.
A human brain is analog. An artificial brain is digital.
Difference between BFS and DFS

BFS DFS
It extends for Breadth-first It extends for Depth-first

search. search.
It searches a node breadthwise,

It searches a node depthwise,
i.e., covering each level one by
i.e., covering one path deeply.
one.
It uses the queue to store data It uses the stack to store data
in the memory. in the memory.
BFS is a vertex-based DFS is an edge-based

algorithm. algorithm.
The structure of a BFS tree is The structure of a DFS tree is

wide and short. narrow and long.
The oldest unexpanded node is The nodes along the edge are
its first priority to explore it. explored first.
BFS is used to examine bipartite DFS is used to examine a two-

graph, connected path as well edge connected graph, acyclic
as shortest path present in the graph, and also the topological
graph. order.
Unit 3
Knowledge representation and reasoning (KR², KR&R) is the field of artificial intelligence (AI)
dedicated to representing information about the world in a form that a computer system can utilize to
solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language.
Knowledge representation incorporates findings from psychology about how humans solve problems
and represent knowledge in order to design formalisms that will make complex systems easier to
design and build. Knowledge representation and reasoning also incorporates findings from logic to
automate various kinds of reasoning, such as the application of rules or the relations
of sets and subsets.
Examples of knowledge representation formalisms include semantic nets, systems
architecture, frames, rules, and ontologies. Examples of automated reasoning engines
include inference engines, theorem provers, and classifiers.
Human beings are good at understanding, reasoning and interpreting knowledge. And
using this knowledge, they are able to perform various actions in the real world. But how
do machines perform the same? In this article, we will learn about Knowledge
Representation in AI and how it helps the machines perform reasoning and interpretation
using Artificial Intelligence in the following sequence:
 What is Knowledge Representation?

 Different Types of Knowledge
 Cycle of Knowledge Representation
 What is the relation between Knowledge & Intelligence?
 Techniques of Knowledge Representation
 Representation Requirements
 Approaches to Knowledge Representation with Example
What is Knowledge Representation?

Knowledge Representation in AI describes the representation of knowledge. Basically, it
is a study of how the beliefs, intentions, and judgments of an intelligent agent can be
expressed suitably for automated reasoning. One of the primary purposes of Knowledge
Representation includes modeling intelligent behavior for an agent.
Knowledge Representation and Reasoning (KR, KRR) represents information from the
real world for a computer to understand and then utilize this knowledge to solve complex
real-life problems like communicating with human beings in natural language.
Knowledge representation in AI is not just about storing data in a database, it allows a
machine to learn from that knowledge and behave intelligently like a human being.
The different kinds of knowledge that need to be represented in AI include:
 Objects
 Events
 Performance
 Facts
 Meta-Knowledge
 Knowledge-base
Now that you know about Knowledge representation in AI, let’s move on and know about
the different types of Knowledge.
Different Types of Knowledge

There are 5 types of Knowledge such as:
 Declarative Knowledge – It includes concepts, facts, and objects and expressed

in a declarative sentence.
 Structural Knowledge – It is a basic problem-solving knowledge that describes
the relationship between concepts and objects.
 Procedural Knowledge – This is responsible for knowing how to do something
and includes rules, strategies, procedures, etc.
 Meta Knowledge – Meta Knowledge defines knowledge about other types of
Knowledge.
 Heuristic Knowledge – This represents some expert knowledge in the field or
subject.
These are the important types of Knowledge Representation in AI. Now, let’s have a look
at the cycle of knowledge representation and how it works.
Cycle of Knowledge Representation in AI

Artificial Intelligent Systems usually consist of various components to display their
intelligent behavior. Some of these components include:
 Perception
 Learning
 Knowledge Representation & Reasoning
 Planning
 Execution
Here is an example to show the different components of the system and how it works:
Example
The above diagram shows the interaction of an AI system with the real world and
the components involved in showing intelligence.
 The Perception component retrieves data or information from the environment.
with the help of this component, you can retrieve data from the environment, find
out the source of noises and check if the AI was damaged by anything. Also, it
defines how to respond when any sense has been detected.
 Then, there is the Learning Component that learns from the captured data by the
perception component. The goal is to build computers that can be taught instead of
programming them. Learning focuses on the process of self-improvement. In order
to learn new things, the system requires knowledge acquisition, inference,
acquisition of heuristics, faster searches, etc.
 The main component in the cycle is Knowledge Representation and
Reasoning which shows the human-like intelligence in the machines. Knowledge
representation is all about understanding intelligence. Instead of trying to
understand or build brains from the bottom up, its goal is to understand and build
intelligent behavior from the top-down and focus on what an agent needs to know
in order to behave intelligently. Also, it defines how automated reasoning
procedures can make this knowledge available as needed.
 The Planning and Execution components depend on the analysis of knowledge
representation and reasoning. Here, planning includes giving an initial state, finding
their preconditions and effects, and a sequence of actions to achieve a state in
which a particular goal holds. Now once the planning is completed, the final stage
is the execution of the entire process.
So, these are the different components of the cycle of Knowledge Representation in AI.
Now, let’s understand the relationship between knowledge and intelligence.
What is the Relation between Knowledge & Intelligence?

In the real world, knowledge plays a vital role in intelligence as well as creating artificial
intelligence. It demonstrates the intelligent behavior in AI agents or systems. It is
possible for an agent or system to act accurately on some input only when it has the
knowledge or experience about the input.
Let’s take an example to understand the relationship:
In this example, there is one decision-maker whose actions are justified by sensing the
environment and using knowledge. But, if we remove the knowledge part here, it will not
be able to display any intelligent behavior.
Reasoning:
The reasoning is the mental process of deriving logical conclusion and making predictions from
available knowledge, facts, and beliefs. Or we can say, "Reasoning is a way to infer facts
from existing data." It is a general process of thinking rationally, to find valid conclusions.
In artificial intelligence, the reasoning is essential so that the machine can also think rationally
as a human brain, and can perform like a human.
Types of Reasoning
In artificial intelligence, reasoning can be divided into the following categories:
o Deductive reasoning
o Inductive reasoning
o Abductive reasoning
o Common Sense Reasoning
o Monotonic Reasoning
o Non-monotonic Reasoning
Note: Inductive and deductive reasoning are the forms of propositional logic.
1. Deductive reasoning:
Deductive reasoning is deducing new information from logically related known information. It
is the form of valid reasoning, which means the argument's conclusion must be true when the
premises are true.
Deductive reasoning is a type of propositional logic in AI, and it requires various rules and
facts. It is sometimes referred to as top-down reasoning, and contradictory to inductive
reasoning.
In deductive reasoning, the truth of the premises guarantees the truth of the conclusion.
Deductive reasoning mostly starts from the general premises to the specific conclusion, which
can be explained as below example.
Example:
Premise-1: All the human eats veggies
Premise-2: Suresh is human.
Conclusion: Suresh eats veggies.
The general process of deductive reasoning is given below:
2. Inductive Reasoning:
Inductive reasoning is a form of reasoning to arrive at a conclusion using limited sets of facts
by the process of generalization. It starts with the series of specific facts or data and reaches
to a general statement or conclusion.
Inductive reasoning is a type of propositional logic, which is also known as cause-effect

reasoning or bottom-up reasoning.
In inductive reasoning, we use historical data or various premises to generate a generic rule,
for which premises support the conclusion.
In inductive reasoning, premises provide probable supports to the conclusion, so the truth of
premises does not guarantee the truth of the conclusion.
Example:
Premise: All of the pigeons we have seen in the zoo are white.
Conclusion: Therefore, we can expect all the pigeons to be white.
3. Abductive reasoning:
Abductive reasoning is a form of logical reasoning which starts with single or multiple
observations then seeks to find the most likely explanation or conclusion for the observation.
Abductive reasoning is an extension of deductive reasoning, but in abductive reasoning, the

premises do not guarantee the conclusion.
Example:
Implication: Cricket ground is wet if it is raining
Axiom: Cricket ground is wet.
Conclusion It is raining.
4. Common Sense Reasoning

Common sense reasoning is an informal form of reasoning, which can be gained through
experiences.
Common Sense reasoning simulates the human ability to make presumptions about events
which occurs on every day.
It relies on good judgment rather than exact logic and operates on heuristic
knowledge and heuristic rules.
Example:
1. One person can be at one place at a time.

2. If I put my hand in a fire, then it will burn.
The above two statements are the examples of common sense reasoning which a human mind
can easily understand and assume.
5. Monotonic Reasoning:
In monotonic reasoning, once the conclusion is taken, then it will remain the same even if we
add some other information to existing information in our knowledge base. In monotonic
reasoning, adding knowledge does not decrease the set of prepositions that can be derived.
To solve monotonic problems, we can derive the valid conclusion from the available facts only,
and it will not be affected by new facts.
Monotonic reasoning is not useful for the real-time systems, as in real time, facts get
changed, so we cannot use monotonic reasoning.
Monotonic reasoning is used in conventional reasoning systems, and a logic-based system is

monotonic.
Any theorem proving is an example of monotonic reasoning.
Example:
o Earth revolves around the Sun.
It is a true fact, and it cannot be changed even if we add another sentence in knowledge base
like, "The moon revolves around the earth" Or "Earth is not round," etc.
Advantages of Monotonic Reasoning:

o In monotonic reasoning, each old proof will always remain valid.
o If we deduce some facts from available facts, then it will remain valid for always.
Disadvantages of Monotonic Reasoning:

o We cannot represent the real world scenarios using Monotonic reasoning.
o Hypothesis knowledge cannot be expressed with monotonic reasoning, which means
facts should be true.
o Since we can only derive conclusions from the old proofs, so new knowledge from the
real world cannot be added.
6. Non-monotonic Reasoning
In Non-monotonic reasoning, some conclusions may be invalidated if we add some more
information to our knowledge base.
Logic will be said as non-monotonic if some conclusions can be invalidated by adding more
knowledge into our knowledge base.
Non-monotonic reasoning deals with incomplete and uncertain models.
"Human perceptions for various things in daily life, "is a general example of non-monotonic
reasoning.
Example: Let suppose the knowledge base contains the following knowledge:
o Birds can fly

o Penguins cannot fly
o Pitty is a bird
So from the above sentences, we can conclude that Pitty can fly.
However, if we add one another sentence into knowledge base "Pitty is a penguin", which
concludes "Pitty cannot fly", so it invalidates the above conclusion.
Advantages of Non-monotonic reasoning:

o For real-world systems such as Robot navigation, we can use non-monotonic
reasoning.
o In Non-monotonic reasoning, we can choose probabilistic facts or can make
assumptions.
Disadvantages of Non-monotonic Reasoning:

o In non-monotonic reasoning, the old facts may be invalidated by adding new
sentences.
It cannot be used for theorem proving
Inference in First-Order Logic

Inference in First-Order Logic is used to deduce new facts or sentences from existing
sentences. Before understanding the FOL inference rule, let's understand some basic
terminologies used in FOL.
Substitution:
Substitution is a fundamental operation performed on terms and formulas. It occurs in all

inference systems in first-order logic. The substitution is complex in the presence of
quantifiers in FOL. If we write F[a/x], so it refers to substitute a constant "a" in place of
variable "x".
Equality:
First-Order logic does not only use predicate and terms for making atomic sentences but also
uses another way, which is equality in FOL. For this, we can use equality symbols which
specify that the two terms refer to the same object.
Example: Brother (John) = Smith.
As in the above example, the object referred by the Brother (John) is similar to the object
referred by Smith. The equality symbol can also be used with negation to represent that two
terms are not the same objects.
Example: ￢(x=y) which is equivalent to x ≠y.
FOL inference rules for quantifier:

As propositional logic we also have inference rules in first-order logic, so following are some
basic inference rules in FOL:
o Universal Generalization
o Universal Instantiation
o Existential Instantiation
o Existential introduction
1. Universal Generalization:
o Universal generalization is a valid inference rule which states that if premise P(c) is
true for any arbitrary element c in the universe of discourse, then we can have a
conclusion as ∀ x P(x).
o It can be represented as: .

o This rule can be used if we want to show that every element has a similar property.
o In this rule, x must not appear as a free variable.
Example: Let's represent, P(c): "A byte contains 8 bits", so for ∀ x P(x) "All bytes
contain 8 bits.", it will also be true.
2. Universal Instantiation:
o Universal instantiation is also called as universal elimination or UI is a valid inference

rule. It can be applied multiple times to add new sentences.
o The new KB is logically equivalent to the previous KB.
o As per UI, we can infer any sentence obtained by substituting a ground term
for the variable.
o The UI rule state that we can infer any sentence P(c) by substituting a ground term c
(a constant within domain x) from ∀ x P(x) for any object in the universe of
discourse.
o It can be represented as: .
Example:1.
IF "Every person like ice-cream"=> ∀x P(x) so we can infer that

"John likes ice-cream" => P(c)
Example: 2.
Let's take a famous example,
"All kings who are greedy are Evil." So let our knowledge base contains this detail as in the
form of FOL:
∀x king(x) ∧ greedy (x) → Evil (x),
So from this information, we can infer any of the following statements using Universal
Instantiation:
o King(John) ∧ Greedy (John) → Evil (John),

o King(Richard) ∧ Greedy (Richard) → Evil (Richard),
o King(Father(John)) ∧ Greedy (Father(John)) → Evil (Father(John)),
3. Existential Instantiation:
o Existential instantiation is also called as Existential Elimination, which is a valid

inference rule in first-order logic.
o It can be applied only once to replace the existential sentence.
o The new KB is not logically equivalent to old KB, but it will be satisfiable if old KB was
satisfiable.
o This rule states that one can infer P(c) from the formula given in the form of ∃x P(x)
for a new constant symbol c.
o The restriction with this rule is that c used in the rule must be a new term for which P(c
) is true.
o It can be represented as:
Example:
From the given sentence: ∃x Crown(x) ∧ OnHead(x, John),
So we can infer: Crown(K) ∧ OnHead( K, John), as long as K does not appear in the
knowledge base.
o The above used K is a constant symbol, which is called Skolem constant.
o The Existential instantiation is a special case of Skolemization process.
4. Existential introduction
o An existential introduction is also known as an existential generalization, which is a

valid inference rule in first-order logic.
o This rule states that if there is some element c in the universe of discourse which has a
property P, then we can infer that there exists something in the universe which has the
property P.
o It can be represented as:

o Example: Let's say that,
"Priyanka got good marks in English."
"Therefore, someone got good marks in English."
Propositional logic in Artificial intelligence

Propositional logic (PL) is the simplest form of logic where all the statements are made by
propositions. A proposition is a declarative statement which is either true or false. It is a
technique of knowledge representation in logical and mathematical form.
Example:
1. a) It is Sunday.
2. b) The Sun rises from West (False proposition)
3. c) 3+3= 7(False proposition)
4. d) 5 is a prime number.
Following are some basic facts about propositional logic:
o Propositional logic is also called Boolean logic as it works on 0 and 1.

o In propositional logic, we use symbolic variables to represent the logic, and we can use
any symbol for a representing a proposition, such A, B, C, P, Q, R, etc.
o Propositions can be either true or false, but it cannot be both.
o Propositional logic consists of an object, relations or function, and logical
connectives.
o These connectives are also called logical operators.
o The propositions and connectives are the basic elements of the propositional logic.
o Connectives can be said as a logical operator which connects two sentences.
o A proposition formula which is always true is called tautology, and it is also called a
valid sentence.
o A proposition formula which is always false is called Contradiction.
o A proposition formula which has both true and false values is called
o Statements which are questions, commands, or opinions are not propositions such as
"Where is Rohini", "How are you", "What is your name", are not propositions.
Syntax of propositional logic:

The syntax of propositional logic defines the allowable sentences for the knowledge
representation. There are two types of Propositions:
a. Atomic Propositions
b. Compound propositions
o Atomic Proposition: Atomic propositions are the simple propositions. It consists of a

single proposition symbol. These are the sentences which must be either true or false.
Example:
1. a) 2+2 is 4, it is an atomic proposition as it is a true fact.

2. b) "The Sun is cold" is also a proposition as it is a false fact.
o Compound proposition: Compound propositions are constructed by combining
simpler or atomic propositions, using parenthesis and logical connectives.
Example:
1. a) "It is raining today, and street is wet."

2. b) "Ankit is a doctor, and his clinic is in Mumbai."
Logical Connectives:
Logical connectives are used to connect two simpler propositions or representing a sentence
logically. We can create compound propositions with the help of logical connectives. There are
mainly five connectives, which are given as follows:
1. Negation: A sentence such as ¬ P is called negation of P. A literal can be either

Positive literal or negative literal.
2. Conjunction: A sentence which has ∧ connective such as, P ∧ Q is called a
conjunction.
Example: Rohan is intelligent and hardworking. It can be written as,
P= Rohan is intelligent,
Q= Rohan is hardworking. → P∧ Q.
3. Disjunction: A sentence which has ∨ connective, such as P ∨ Q. is called disjunction,
where P and Q are the propositions.
Example: "Ritika is a doctor or Engineer",
Here P= Ritika is Doctor. Q= Ritika is Doctor, so we can write it as P ∨ Q.
4. Implication: A sentence such as P → Q, is called an implication. Implications are also
known as if-then rules. It can be represented as
If it is raining, then the street is wet.
Let P= It is raining, and Q= Street is wet, so it is represented as P → Q
5. Biconditional: A sentence such as P⇔ Q is a Biconditional sentence, example If I
am breathing, then I am alive
P= I am breathing, Q= I am alive, it can be represented as P ⇔ Q.
Following is the summarized table for Propositional Logic

Connectives:
Truth Table:
In propositional logic, we need to know the truth values of propositions in all possible
scenarios. We can combine all the possible combination with logical connectives, and the
representation of these combinations in a tabular format is called Truth table. Following are
the truth table for all logical connectives:
Truth table with three propositions:
We can build a proposition composing three propositions P, Q, and R. This truth table is made-
up of 8n Tuples as we have taken three proposition symbols.
Precedence of connectives:
Just like arithmetic operators, there is a precedence order for propositional connectors or
logical operators. This order should be followed while evaluating a propositional problem.
Following is the list of the precedence order for operators:
Precedence Operators
First Precedence Parenthesis
Second Precedence Negation
Third Precedence Conjunction(AND)
Fourth Precedence Disjunction(OR)
Fifth Precedence Implication
Six Precedence Biconditional
Logical equivalence:
Logical equivalence is one of the features of propositional logic. Two propositions are said to
be logically equivalent if and only if the columns in the truth table are identical to each other.
Let's take two propositions A and B, so for logical equivalence, we can write it as A⇔B. In
below truth table we can see that column for ¬A∨ B and A→B, are identical hence A is
Equivalent to B
Properties of Operators:
o Commutativity:
o P∧ Q= Q ∧ P, or
o P ∨ Q = Q ∨ P.
o Associativity:
o (P ∧ Q) ∧ R= P ∧ (Q ∧ R),
o (P ∨ Q) ∨ R= P ∨ (Q ∨ R)
o Identity element:
o P ∧ True = P,
o P ∨ True= True.
o Distributive:
o P∧ (Q ∨ R) = (P ∧ Q) ∨ (P ∧ R).
o P ∨ (Q ∧ R) = (P ∨ Q) ∧ (P ∨ R).
o DE Morgan's Law:
o ¬ (P ∧ Q) = (¬P) ∨ (¬Q)
o ¬ (P ∨ Q) = (¬ P) ∧ (¬Q).
o Double-negation elimination:
o ¬ (¬P) = P.
Limitations of Propositional logic:

o We cannot represent relations like ALL, some, or none with propositional logic.
Example:
a. All the girls are intelligent.

b. Some apples are sweet.
o Propositional logic has limited expressive power.
o In propositional logic, we cannot describe statements in terms of their properties or
logical relationships.
First-Order Logic in Artificial intelligence
In the topic of Propositional logic, we have seen that how to represent statements using
propositional logic. But unfortunately, in propositional logic, we can only represent the facts,
which are either true or false. PL is not sufficient to represent the complex sentences or
natural language statements. The propositional logic has very limited expressive power.
Consider the following sentence, which we cannot represent using PL logic.
o "Some humans are intelligent", or

o "Sachin likes cricket."
To represent the above statements, PL logic is not sufficient, so we required some more
powerful logic, such as first-order logic.
First-Order logic:
o First-order logic is another way of knowledge representation in artificial intelligence. It
is an extension to propositional logic.
o FOL is sufficiently expressive to represent the natural language statements in a concise
way.
o First-order logic is also known as Predicate logic or First-order predicate logic.
First-order logic is a powerful language that develops information about the objects in
a more easy way and can also express the relationship between those objects.
o First-order logic (like natural language) does not only assume that the world contains
facts like propositional logic but also assumes the following things in the world:
o Objects: A, B, people, numbers, colors, wars, theories, squares, pits, wumpus,
......
o Relations: It can be unary relation such as: red, round, is adjacent, or n-
any relation such as: the sister of, brother of, has color, comes between
o Function: Father of, best friend, third inning of, end of, ......
o As a natural language, first-order logic also has two main parts:
a. Syntax
b. Semantics
Syntax of First-Order logic:
The syntax of FOL determines which collection of symbols is a logical expression in first-order
logic. The basic syntactic elements of first-order logic are symbols. We write statements in
short-hand notation in FOL.
Basic Elements of First-order logic:

Following are the basic elements of FOL syntax:
Constant 1, 2, A, John, Mumbai, cat,....
Variables x, y, z, a, b,....
Predicates Brother, Father, >,....
Function sqrt, LeftLegOf, ....
Connectives ∧, ∨, ¬, ⇒, ⇔
Equality ==
Quantifier ∀, ∃
Atomic sentences:
o Atomic sentences are the most basic sentences of first-order logic. These sentences are
formed from a predicate symbol followed by a parenthesis with a sequence of terms.
o We can represent atomic sentences as Predicate (term1, term2, ......, term n).
Example: Ravi and Ajay are brothers: => Brothers(Ravi, Ajay).

Chinky is a cat: => cat (Chinky).
Complex Sentences:
o Complex sentences are made by combining atomic sentences using connectives.
First-order logic statements can be divided into two parts:
o Subject: Subject is the main part of the statement.

o Predicate: A predicate can be defined as a relation, which binds two atoms together in
a statement.
Consider the statement: "x is an integer.", it consists of two parts, the first part x is the
subject of the statement and second part "is an integer," is known as a predicate.
Quantifiers in First-order logic:
o A quantifier is a language element which generates quantification, and quantification
specifies the quantity of specimen in the universe of discourse.
o These are the symbols that permit to determine or identify the range and scope of the
variable in the logical expression. There are two types of quantifier:
a. Universal Quantifier, (for all, everyone, everything)

b. Existential quantifier, (for some, at least one).
Universal Quantifier:
Universal quantifier is a symbol of logical representation, which specifies that the statement
within its range is true for everything or every instance of a particular thing.
The Universal quantifier is represented by a symbol ∀, which resembles an inverted A.
Note: In universal quantifier we use implication "→".
If x is a variable, then ∀x is read as:
o For all x
o For each x
o For every x.
Example:
All man drink coffee.
Let a variable x which refers to a cat so all x can be represented in UOD as below:
∀x man(x) → drink (x, coffee).
It will be read as: There are all x where x is a man who drink coffee.
Existential Quantifier:
Existential quantifiers are the type of quantifiers, which express that the statement within its
scope is true for at least one instance of something.
It is denoted by the logical operator ∃, which resembles as inverted E. When it is used with a
predicate variable then it is called as an existential quantifier.
Note: In Existential quantifier we always use AND or Conjunction symbol (∧).
If x is a variable, then existential quantifier will be ∃x or ∃(x). And it will be read as:
o There exists a 'x.'

o For some 'x.'
o For at least one 'x.'
Example:
Some boys are intelligent.
∃x: boys(x) ∧ intelligent(x)
It will be read as: There are some x where x is a boy who is intelligent.
Points to remember:
o The main connective for universal quantifier ∀ is implication →.
o The main connective for existential quantifier ∃ is and ∧.
Properties of Quantifiers:
o In universal quantifier, ∀x∀y is similar to ∀y∀x.
o In Existential quantifier, ∃x∃y is similar to ∃y∃x.
o ∃x∀y is not similar to ∀y∃x.
Some Examples of FOL using quantifier:
1. All birds fly.

In this question the predicate is "fly(bird)."
And since there are all birds who fly so it will be represented as follows.
∀x bird(x) →fly(x).
2. Every man respects his parent.

In this question, the predicate is "respect(x, y)," where x=man, and y= parent.
Since there is every man so will use ∀, and it will be represented as follows:
∀x man(x) → respects (x, parent).
3. Some boys play cricket.
In this question, the predicate is "play(x, y)," where x= boys, and y= game. Since there are
some boys so we will use ∃, and it will be represented as:
∃x boys(x) → play(x, cricket).
4. Not all students like both Mathematics and Science.

In this question, the predicate is "like(x, y)," where x= student, and y= subject.
Since there are not all students, so we will use ∀ with negation, so following representation
for this:
¬∀ (x) [ student(x) → like(x, Mathematics) ∧ like(x, Science)].
5. Only one student failed in Mathematics.

In this question, the predicate is "failed(x, y)," where x= student, and y= subject.
Since there is only one student who failed in Mathematics, so we will use following
representation for this:
∃(x) [ student(x) → failed (x, Mathematics) ∧∀ (y) [¬(x==y) ∧ student(y) →
¬failed (x, Mathematics)].
Free and Bound Variables:

The quantifiers interact with variables which appear in a suitable way. There are two types of
variables in First-order logic which are given below:
Free Variable: A variable is said to be a free variable in a formula if it occurs outside the
scope of the quantifier.
Example: ∀x ∃(y)[P (x, y, z)], where z is a free variable.
Bound Variable: A variable is said to be a bound variable in a formula if it occurs within the
scope of the quantifier.
Example: ∀x [A (x) B( y)], here x and y are the bound variables.
Forward Chaining and backward chaining in

AI
In artificial intelligence, forward and backward chaining is one of the important topics, but
before understanding forward and backward chaining lets first understand that from where
these two terms came.
Inference engine:
The inference engine is the component of the intelligent system in artificial intelligence,
which applies logical rules to the knowledge base to infer new information from known
facts. The first inference engine was part of the expert system. Inference engine commonly
proceeds in two modes, which are:
a. Forward chaining
b. Backward chaining
Horn Clause and Definite clause:
Horn clause and definite clause are the forms of sentences, which enables knowledge base
to use a more restricted and efficient inference algorithm. Logical inference algorithms use
forward and backward chaining approaches, which require KB in the form of the first-
order definite clause.
Definite clause: A clause which is a disjunction of literals with exactly one positive
literal is known as a definite clause or strict horn clause.
Horn clause: A clause which is a disjunction of literals with at most one positive
literal is known as horn clause. Hence all the definite clauses are horn clauses.
Example: (¬ p V ¬ q V k). It has only one positive literal k.
It is equivalent to p ∧ q → k.
A. Forward Chaining
Forward chaining is also known as a forward deduction or forward reasoning method when
using an inference engine. Forward chaining is a form of reasoning which start with atomic
sentences in the knowledge base and applies inference rules (Modus Ponens) in the forward
direction to extract more data until a goal is reached.
The Forward-chaining algorithm starts from known facts, triggers all rules whose premises
are satisfied, and add their conclusion to the known facts. This process repeats until the
problem is solved.
Properties of Forward-Chaining:
o It is a down-up approach, as it moves from bottom to top.
o It is a process of making a conclusion based on known facts or data, by starting
from the initial state and reaches the goal state.
o Forward-chaining approach is also called as data-driven as we reach to the goal
using available data.
o Forward -chaining approach is commonly used in the expert system, such as CLIPS,
business, and production rule systems.
Consider the following famous example which we will use in both approaches:
Example:
"As per the law, it is a crime for an American to sell weapons to hostile nations.
Country A, an enemy of America, has some missiles, and all the missiles were sold
to it by Robert, who is an American citizen."
Prove that "Robert is criminal."
To solve the above problem, first, we will convert all the above facts into first-order definite
clauses, and then we will use a forward-chaining algorithm to reach the goal.
Facts Conversion into FOL:

o It is a crime for an American to sell weapons to hostile nations. (Let's say p, q, and
r are variables)
American (p) ∧ weapon(q) ∧ sells (p, q, r) ∧ hostile(r) → Criminal(p)
...(1)
o Country A has some missiles. ?p Owns(A, p) ∧ Missile(p). It can be written in two
definite clauses by using Existential Instantiation, introducing new Constant T1.
Owns(A, T1) ......(2)
Missile(T1) .......(3)
o All of the missiles were sold to country A by Robert.
?p Missiles(p) ∧ Owns (A, p) → Sells (Robert, p, A) ......(4)
o Missiles are weapons.
Missile(p) → Weapons (p) .......(5)
o Enemy of America is known as hostile.
Enemy(p, America) →Hostile(p) ........(6)
o Country A is an enemy of America.
Enemy (A, America) .........(7)
o Robert is American
American(Robert). ..........(8)
Forward chaining proof:

Step-1:
In the first step we will start with the known facts and will choose the sentences which do
not have implications, such as: American(Robert), Enemy(A, America), Owns(A, T1),
and Missile(T1). All these facts will be represented as below.
Step-2:
At the second step, we will see those facts which infer from available facts and with
satisfied premises.
Rule-(1) does not satisfy premises, so it will not be added in the first iteration.
Rule-(2) and (3) are already added.
Rule-(4) satisfy with the substitution {p/T1}, so Sells (Robert, T1, A) is added, which
infers from the conjunction of Rule (2) and (3).
Rule-(6) is satisfied with the substitution(p/A), so Hostile(A) is added and which infers from
Rule-(7).
Step-3:
At step-3, as we can check Rule-(1) is satisfied with the substitution {p/Robert, q/T1,
r/A}, so we can add Criminal(Robert) which infers all the available facts. And hence we
reached our goal statement.
Hence it is proved that Robert is Criminal using forward chaining approach.
B. Backward Chaining:
Backward-chaining is also known as a backward deduction or backward reasoning method
when using an inference engine. A backward chaining algorithm is a form of reasoning,
which starts with the goal and works backward, chaining through rules to find known facts
that support the goal.
Properties of backward chaining:
o It is known as a top-down approach.

o Backward-chaining is based on modus ponens inference rule.
o In backward chaining, the goal is broken into sub-goal or sub-goals to prove the
facts true.
o It is called a goal-driven approach, as a list of goals decides which rules are selected
and used.
o Backward -chaining algorithm is used in game theory, automated theorem proving
tools, inference engines, proof assistants, and various AI applications.
o The backward-chaining method mostly used a depth-first search strategy for
proof.
Example:
In backward-chaining, we will use the same above example, and will rewrite all the rules.
o American (p) ∧ weapon(q) ∧ sells (p, q, r) ∧ hostile(r) → Criminal(p) ...(1)

Owns(A, T1) ........(2)
o Missile(T1)
o ?p Missiles(p) ∧ Owns (A, p) → Sells (Robert, p, A) ......(4)
o Missile(p) → Weapons (p) .......(5)
o Enemy(p, America) →Hostile(p) ........(6)
o Enemy (A, America) .........(7)
o American(Robert). ..........(8)
Backward-Chaining proof:
In Backward chaining, we will start with our goal predicate, which is Criminal(Robert),
and then infer further rules.
Step-1:
At the first step, we will take the goal fact. And from the goal fact, we will infer other facts,
and at last, we will prove those facts true. So our goal fact is "Robert is Criminal," so
following is the predicate of it.
Step-2:
At the second step, we will infer other facts form goal fact which satisfies the rules. So as
we can see in Rule-1, the goal predicate Criminal (Robert) is present with substitution
{Robert/P}. So we will add all the conjunctive facts below the first level and will replace p
with Robert.
Here we can see American (Robert) is a fact, so it is proved here.

Step-3:t At step-3, we will extract further fact Missile(q) which infer from Weapon(q), as it
satisfies Rule-(5). Weapon (q) is also true with the substitution of a constant T1 at q.
Step-4:
At step-4, we can infer facts Missile(T1) and Owns(A, T1) form Sells(Robert, T1, r) which
satisfies the Rule- 4, with the substitution of A in place of r. So these two statements are
proved here.
Step-5:
At step-5, we can infer the fact Enemy(A, America) from Hostile(A) which satisfies
Rule- 6. And hence all the statements are proved true using backward chaining.
Unit 4
UNIT 4
PLANNING AND MACHINE LEARNING
4.1 Planning With State Space Search
The agent first generates a goal to achieve and then constructs a plan to achieve it from
the Current state.
Problem Solving To Planning

Representation Using Problem Solving Approach
 Forward search
 Backward search
 Heuristic search
Representation Using Planning Approach

 STRIPS-standard research institute problem solver.
 Representation for states and goals
 Representation for plans
 Situation space and plan space
 Solutions
Why Planning?
Intelligent agents must operate in the world. They are not simply passive reasons (Knowledge
Representation, reasoning under uncertainty) or problem solvers (Search), they must also act on
the world.
We want intelligent agents to act in “intelligent ways”. Taking purposeful actions, predicting the
expected effect of such actions, composing actions together to achieve complex goals. E.g. if we
have a robot we want robot to decide what to do; how to act to achieve our goals.
Planning Problem
How to change the world to suit our needs
Critical issue: we need to reason about what the world will be like after doing a few actions, not
just what it is like now
GOAL: Craig has coffee
CURRENTLY: robot in mailroom, has no coffee, coffee not made, Craig in office etc.
TO DO: goto lounge, make coffee
Partial Order Plan

 A partially ordered collection of steps
o Start step has the initial state description and its effect
o Finish step has the goal description as its precondition
o Causal links from outcome of one step to precondition of another step
o Temporal ordering between pairs of steps
 An open condition is a precondition of a step not yet causally linked
 A plan is complete if every precondition is achieved
 A precondition is achieved if it is the effect of an earlier step and no possibly intervening

step undoes it
Start
Right Sock
Right Shoe
Left Sock
Left Shoe
Finish
Partial Order Plan Algorithm
4.2 Stanford Research Institute Problem Solver (STRIPS)
STRIPS is a classical planning language, representing plan components as states, goals, and
actions, allowing algorithms to parse the logical structure of the planning problem to provide a
solution.
In STRIPS, state is represented as a conjunction of positive literals. Positive literals may be a

propositional literal (e.g., Big ^ Tall) or a first-order literal (e.g., At(Billy, Desk)). The positive
literals must be grounded – may not contain a variable (e.g., At(x, Desk)) – and must be
function-free – may not invoke a function to calculate a value (e.g., At(Father(Billy), Desk)).
Any state conditions that are not mentioned are assumed false.
The goal is also represented as a conjunction of positive, ground literals. A state satisfies a goal
if the state contains all of the conjuncted literals in the goal; e.g., Stacked ^ Ordered ^ Purchased
satisfies Ordered ^ Stacked.
Actions (or operators) are defined by action schemas, each consisting of three parts:
 The action name and any parameters.

 Preconditions which must hold before the action can be executed. Preconditions are
represented as a conjunction of function-free, positive literals. Any variables in a
precondition must appear in the action‟s parameter list.
 Effects which describe how the state of the environment changes when the action is
executed. Effects are represented as a conjunction of function-free literals. Any
variables in a precondition must appear in the action‟s parameter list. Any world state not
explicitly impacted by the action schema‟s effect is assumed to remain unchanged.
The following, simple action schema describes the action of moving a box from location x to
location y:
Action: MoveBox(x, y)
Precond: BoxAt(x)
Effect: BoxAt(y), ¬ BoxAt(x)
If an action is applied, but the current state of the system does not meet the necessary
preconditions, then the action has no effect. But if an action is successfully applied, then any
positive literals, in the effect, are added to the current state of the world; correspondingly, any
negative literals, in the effect, result in the removal of the corresponding positive literals from the
state of the world.
For example, in the action schema above, the effect would result in the proposition BoxAt(y)
being added to the known state of the world, while BoxAt(x) would be removed from the known
state of the world. (Recall that state only includes positive literals, so a negation effect results in
the removal of positive literals.) Note also that positive effects can not get duplicated in state;
likewise, a negative of a proposition that is not currently in state is simply ignored. For example,
if Open(x) was not previously part of the state, ¬ Open(x) would have no effect.
A STRIPS problem includes the complete (but relevant) initial state of the world, the goal
state(s), and action schemas. A STRIPS algorithm should then be able to accept such a problem,
returning a solution. The solution is simply an action sequence that, when applied to the initial
state, results in a state which satisfies the goal.
4.2.1 STRIPS Planning Algorithm

As previously referenced, STRIPS began as an automated planning algorithm and has double-
meaning to describe the language (described above) used to provide input to that algorithm.
While the algorithm does not scale well to real-world problems, it, like the language, serves as a
foundational starting point to developing and understanding more powerful automated planning
algorithms. The STRIPS algorithm [3] is found below, followed by a brief commentary:
STRIPS(A, s, g)
p = empty plan
loop…
if s satisfies g then return p
a = [an applicable action in A, relevant for g]
if a = null, then return failure
p’ = STRIPS(A, s, precond(a))
if p’ = failure, then return failure
s = apply p’ to s
s = apply a to s
p = p + p’ + a
In the above STRIPS algorithm, A represents all of the possible, grounded actions (i.e., action
schemas with variables replaced with values), while s is the current state, and g is the goal state.
Critically important to note is that this algorithm is a backward-search algorithm. In other

words, the goal state of the planning problem is fed into the algorithm as s, while the initial state
of the planning problem is provided as g. The returned solution, if any, simply needs to be
reversed to provide the solution to the planning problem. Simply put, it works backwards from
the goal to figure out a logical path to get there.
4.3 Explanation
Most expert systems have explanation facilities that allow the user to ask questions- why and
how it reached some conclusion
The questions are answered by referring to the system goals, the rules being used, and existing
problem solving. The rules typically reflect empirical or “compiled” knowledge. They are codes
of an expert‟s rule of thumb, not the expert‟s deeper understanding.
Example:
Dialog with an expert system designed to give advice on car problems.
System Does the car start?
User No.
System Does the engine turn over?
User yes
System Do you smell gas?
User yes
System Recommendation-wait 5 minutes and try again.
User Why?
System I used the rule:
If not start, and engine_turn_over, and smell_gas
Then recommendation is „wait 5 minutes and try again‟

User
Note: The rule gives the correct advice for a flooded car. And knows the questions to be ask to
determine if the car is flooded, but it does not contain the knowledge of what a flooded car is and
why waiting will help.
Types of Explanation
There are four types of explanations commonly used in expert systems.
 Rule trace reports on the progress of a consultation;
 Explanation of how the system reached to the give conclusion;
 Explanation of why the system did not give any conclusion.
 Explanation of why the system is asking a question;
4.4 Learning
Machine Learning
 Like human learning from past experiences,a computer does not have “experiences”.
 A computer system learns from data, which represent some “past experiences” of an
application domain.
 Objective of machine learning : learn a target function that can be used to predict the
values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low
risk.
 The task is commonly called: Supervised learning, classification, or inductive learning
Supervised Learning
Supervised learning is a machine learning technique for learning a function from training data.
The training data consist of pairs of input objects (typically vectors), and desired outputs. The
output of the function can be a continuous value (called regression), or can predict a class label
of the input object (called classification). The task of the supervised learner is to predict the
value of the function for any valid input object after having seen a number of training examples
(i.e. pairs of input and target output). To achieve this, the learner has to generalize from the
presented data to unseen situations in a "reasonable" way.
Another term for supervised learning is classification. Classifier performance depend greatly on
the characteristics of the data to be classified. There is no single classifier that works best on all
given problems. Determining a suitable classifier for a given problem is however still more an art
than a science. The most widely used classifiers are the Neural Network (Multi-layer
Perceptron), Support Vector Machines, k-Nearest Neighbors, Gaussian Mixture Model,
Gaussian, Naive Bayes, Decision Tree and RBF classifiers.
Supervised learning process: two steps

 Learning (training): Learn a model using the training data
 Testing: Test the model using unseen test data to assess the model accuracy
Accuracy  Number of correct classifica tions ,

Totalnumber of test cases
Supervised vs. unsupervised Learning

 Supervised learning:
classification is seen as supervised learning from examples.
 Supervision: The data (observations, measurements, etc.) are labeled with pre-
defined classes. It is like that a “teacher” gives the classes (supervision).
 Test data are classified into these classes too.
 Unsupervised learning (clustering)
 Class labels of the data are unknown
 Given a set of data, the task is to establish the existence of classes or clusters in
the data
Decision Tree
 A decision tree takes as input an object or situation described by a set of attributes and
returns a “decision” – the predicted output value for the input.
 A decision tree reaches its decision by performing a sequence of tests.
Example : “HOW TO” manuals (for car repair)
A decision tree reaches its decision by performing a sequence of tests. Each internal node in the
tree corresponds to a test of the value of one of the properties, and the branches from the node
are labeled with the possible values of the test. Each leaf node in the tree specifies the value to be
returned if that leaf is reached. The decision tree representation seems to be very natural for
humans; indeed, many "How To" manuals (e.g., for car repair) are written entirely as a single
decision tree stretching over hundreds of pages.
A somewhat simpler example is provided by the problem of whether to wait for a table at a
restaurant. The aim here is to learn a definition for the goal predicate Will Wait. In setting this up
as a learning problem, we first have to state what attributes are available to describe examples in
the domain. we will see how to automate this task; for now, let's suppose we decide on the
following list of attributes:
1. Alternate: whether there is a suitable alternative restaurant nearby.
2. Bar: whether the restaurant has a comfortable bar area to wait in.
3. Fri/Sat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values are None, Some, and Full).
6. Price: the restaurant's price range ($, $$, $$$).
7. Raining: whether it is raining outside.
8. Reservation: whether we made a reservation.
9. Type: the kind of restaurant (French, Italian, Thai, or burger).
10. Wait Estimate: the wait estimated by the host (0-10 minutes, 10-30, 30-60, >60).
Decision tree induction from examples
An example for a Boolean decision tree consists of a vector of' input attributes, X, and a single
Boolean output value y. A set of examples (X1,Y1) . . . , (X2, y2) is shown in Figure. The
positive examples are the ones in which the goal Will Wait is true (XI, X3, . . .); the negative
examples are the ones in which it is false (X2, X5, . . .). The complete set of examples is called
the training set.
Decision Tree Algorithm

The basic idea behind the Decision-Tree-Learning-Algorithm is to test the most important
attribute first. By "most important," we mean the one that makes the most difference to the
classification of an example. That way, we hope to get to the correct classification with a small
number of tests, meaning that all paths in the tree will be short and the tree as a whole will be
small.
Reinforcement Learning
 Learning what to do to maximize reward
 Learner is not given training
 Only feedback is in terms of reward
 Try things out and see what the reward is
 Different from Supervised Learning
 Teacher gives training examples
Examples
 Robotics: Quadruped Gait Control, Ball Acquisition (Robocup)
 Control: Helicopters
 Operations Research: Pricing, Routing, Scheduling
 Game Playing: Backgammon, Solitaire, Chess, Checkers
 Human Computer Interaction: Spoken Dialogue Systems
 Economics/Finance: Trading
Markov decision process VS Reinforcement Learning
 Markov decision process
 Set of state S, set of actions A
 Transition probabilities to next states T(s, a, a‟)
 Reward functions R(s)
 RL is based on MDPs, but
 Transition model is not known
 Reward model is not known
 MDP computes an optimal policy
 RL learns an optimal policy
Types of Reinforcement Learning

 Passive Vs Active
 Passive: Agent executes a fixed policy and evaluates it
 Active: Agents updates policy as it learns
 Model based Vs Model free
 Model-based: Learn transition and reward model, use it to get optimal policy
 Model free: Derive optimal policy without learning the model
Passive Learning
 Evaluate how good a policy π is
 Learn the utility Uπ(s) of each state
 Same as policy evaluation for known transition & reward models
Agent executes a sequence of trials:
(1, 1) → (1, 2) → (1, 3) → (1, 2) → (1, 3) → (2, 3) → (3, 3) → (4, 3)+1

(1, 1) → (1, 2) → (1, 3) → (2, 3) → (3, 3) → (3, 2) → (3, 3) → (4, 3)+1
(1, 1) → (2, 1) → (3, 1) → (3, 2) → (4, 2)−1
Goal is to learn the expected utility Uπ(s)
Direct Utility Estimation

 Reduction to inductive learning
 Compute the empirical value of each state
 Each trial gives a sample value
 Estimate the utility based on the sample values
 Example: First trial gives
 State (1,1): A sample of reward 0.72
 State (1,2): Two samples of reward 0.76 and 0.84
 State (1,3): Two samples of reward 0.80 and 0.88
 Estimate can be a running average of sample values
 Example: U(1, 1) = 0.72,U(1, 2) = 0.80,U(1, 3) = 0.84, . . .
 Ignores a very important source of information
 The utility of states satisfy the Bellman equations
 Search is in a hypothesis space for U much larger than needed
 Convergence is very slow
 Make use of Bellman equations to get Uπ(s)
 Need to estimate T(s, π(s), s‟) and R(s) from trials
 Plug-in learnt transition and reward in the Bellman equations
 Solving for Uπ: System of n linear equations
 Estimates of T and R keep changing
 Make use of modified policy iteration idea
 Run few rounds of value iteration
 Initialize value iteration from previous utilities
 Converges fast since T and R changes are small
 ADP is a standard baseline to test „smarter‟ ideas
 ADP is inefficient if state space is large
 Has to solve a linear system in the size of the state space
 Backgammon: 1050 linear equations in 1050 unknowns
Temporal Difference Learning

 Best of both worlds
 Only update states that are directly affected
 Approximately satisfy the Bellman equations
 Example:
(1, 1) → (1, 2) → (1, 3) → (1, 2) → (1, 3) → (2, 3) → (3, 3) → (4, 3)+1
(1, 1) → (1, 2) → (1, 3) → (2, 3) → (3, 3) → (3, 2) → (3, 3) → (4, 3)+1

(1, 1) → (2, 1) → (3, 1) → (3, 2) → (4, 2)−1
 After the first trial, U(1, 3) = 0.84,U(2, 3) = 0.92
 Consider the transition (1, 3) → (2, 3) in the second trial
 If deterministic, then U(1, 3) = −0.04 + U(2, 3)
 How to account for probabilistic transitions (without a model)
 TD chooses a middle ground
 Temporal difference (TD) equation, α is the learning rate
 The TD equation
 TD applies a correction to approach the Bellman equations
 The update for s‟ will occur T(s, π(s), s‟) fraction of the time
 The correction happens proportional to the probabilities
 Over trials, the correction is same as the expectation
 Learning rate α determines convergence to true utility
 Decrease αs proportional to the number of state visits
 Convergence is guaranteed if
 Decay αs (m) = 1/m satisfies the condition
 TD is model free
TD Vs ADP
 TD is mode free as opposed to ADP which is model based
 TD updates observed successor rather than all successors

 The difference disappears with large number of trials
 TD is slower in convergence, but much simpler computation per observation
Active Learning
 Agent updates policy as it learns
 Goal is to learn the optimal policy
 Learning using the passive ADP agent
 Estimate the model R(s),T(s, a, s‟) from observations
 The optimal utility and action satisfies
 Solve using value iteration or policy iteration
 Agent has “optimal” action
 Simply execute the “optimal” action
Exploitation vs Exploration
 The passive approach gives a greedy agent
 Exactly executes the recipe for solving MDPs
 Rarely converges to the optimal utility and policy
 The learned model is different from the true environment
 Trade-off
 Exploitation: Maximize rewards using current estimates
 Agent stops learning and starts executing policy
 Exploration: Maximize long term rewards
 Agent keeps learning by trying out new things

 Pure Exploitation
 Mostly gets stuck in bad policies
 Pure Exploration
 Gets better models by learning
 Small rewards due to exploration
 The multi-armed bandit setting
 A slot machine has one lever, a one-armed bandit
 n-armed bandit has n levers
 Which arm to pull?
 Exploit: The one with the best pay-off so far
 Explore: The one that has not been tried
Exploration
 Greedy in the limit of infinite exploration (GLIE)
 Reasonable schemes for trade off
 Revisiting the greedy ADP approach
 Agent must try each action infinitely often
 Rules out chance of missing a good action
 Eventually must become greedy to get rewards
 Simple GLIE
 Choose random action 1/t fraction of the time
 Use greedy policy otherwise
 Converges to the optimal policy
 Convergence is very slow

Exploration Function
 A smarter GLIE
 Give higher weights to actions not tried very often
 Give lower weights to low utility actions
 Alter Bellman equations using optimistic utilities U+(s)
 The exploration function f (u, n)
 Should increase with expected utility u
 Should decrease with number of tries n
 A simple exploration function
 Actions towards unexplored regions are encouraged
 Fast convergence to almost optimal policy in practice
Q-Learning
 Exploration function gives a active ADP agent
 A corresponding TD agent can be constructed
 Surprisingly, the TD update can remain the same
 Converges to the optimal policy as active ADP
 Slower than ADP in practice
 Q-learning learns an action-value function Q(a; s)
 Utility values U(s) = maxa Q(a; s)
 A model-free TD method
 No model for learning or action selection

 Constraint equations for Q-values at equilibrium
 Can be updated using a model for T(s; a; s‟)
 The TD Q-learning does not require a model
 Calculated whenever a in s leads to s‟
 The next action anext = argmaxa‟ f (Q(a‟; s‟);N(s‟; a‟))
 Q-learning is slower than ADP
 Trade-o: Model-free vs knowledge-based methods
PART- A
1. What are the components of planning system?
2. What is planning?
3. What is nonlinear plan?
4. List out the 3 types of machine learning?
5. What is Reinforcement Learning?
6. What do you mean by goal stack planning?
7. Define machine learning.
8. What are the types of Reinforcement Learning.
PART B
1. Briefly explain the advanced plan generation systems.
2. Explain Machine Learning.
3. Explain STRIPS.
4. Explain Reinforcement Learning.
5. Briefly explain Partial Order Plan.
6. Explain in detail about various Machine learning methods.
UNIT 5
Uncertainty:
Till now, we have learned knowledge representation using first-order logic and propositional
logic with certainty, which means we were sure about the predicates. With this knowledge
representation, we might write A→B, which means if A is true then B is true, but consider a
situation where we are not sure about whether A is true or not then we cannot express this
statement, this situation is called uncertainty.
So to represent uncertain knowledge, where we are not sure about the predicates, we need
uncertain reasoning or probabilistic reasoning.
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
1. Information occurred from unreliable sources.

2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.
Probabilistic reasoning:
Probabilistic reasoning is a way of knowledge representation where we apply the concept of
probability to indicate the uncertainty in knowledge. In probabilistic reasoning, we combine
probability theory with logic to handle the uncertainty.
We use probability in probabilistic reasoning because it provides a way to handle the

uncertainty that is the result of someone's laziness and ignorance.
In the real world, there are lots of scenarios, where the certainty of something is not
confirmed, such as "It will rain today," "behavior of someone for some situations," "A match
between two teams or two players." These are probable sentences for which we can assume
that it will happen but not sure about it, so here we use probabilistic reasoning.
Need of probabilistic reasoning in AI:
o When there are unpredictable outcomes.

o When specifications or possibilities of predicates becomes too large to handle.
o When an unknown error occurs during an experiment.
In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:
o Bayes' rule
o Bayesian Statistics
As probabilistic reasoning uses probability and related terms, so before understanding

probabilistic reasoning, let's understand some common terms:
Probability: Probability can be defined as a chance that an uncertain event will occur. It is
the numerical measure of the likelihood that an event will occur. The value of probability
always remains between 0 and 1 that represent ideal uncertainties.
1. 0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.

1. P(A) = 0, indicates total uncertainty in an event A.
1. P(A) =1, indicates total certainty in an event A.
We can find the probability of an uncertain event by using the below formula.
o P(¬A) = probability of a not happening event.

o P(¬A) + P(A) = 1.
Event: Each possible outcome of a variable is called an event.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in the
real world.
Prior probability: The prior probability of an event is probability computed before

observing new information.
Posterior Probability: The probability that is calculated after all evidence or information
has taken into account. It is a combination of prior probability and new information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has already
happened.
Let's suppose, we want to calculate the event A when event B has already occurred, "the
probability of A under the conditions of B", it can be written as:
Where P(A⋀B)= Joint probability of a and B
P(B)= Marginal probability of B.
If the probability of A is given and we need to find the probability of B, then it will be given
as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample
space will be reduced to set B, and now we can only calculate event A when event B is
already occurred by dividing the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students who
likes English and mathematics, and then what is the percent of students those who like
English also like mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Bayes' theorem in Artificial intelligence
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two
random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference is an application of Bayes' theorem, which is fundamental to Bayesian statistics.
It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can
determine the probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:
As from product rule we can write:
1. P(A ⋀ B)= P(A|B) P(B) or
Similarly, the probability of event B with known event A:
1. P(A ⋀ B)= P(B|A) P(A)
Equating right hand side of both the equations, we will get:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic
of most modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability
of hypothesis A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we
calculate the probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the
evidence
P(B) is called marginal probability, pure probability of an evidence.
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule
can be written as:
Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
Applying Bayes' rule:

Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A).
This is very useful in cases where we have a good probability of these three terms and want
to determine the fourth one. Suppose we want to perceive the effect of some unknown
cause, and want to compute that cause, then the Bayes' rule becomes:
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff
neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs
80% of the time. He is also aware of some more facts, which are given as follows:
o The Known probability that a patient has meningitis disease is 1/30,000.

o The Known probability that a patient has a stiff neck is 2%.
Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff
neck.
Example-2:
Question: From a standard deck of playing cards, a single card is drawn. The
probability that the card is king is 4/52, then calculate posterior probability
P(King|Face), which means the drawn face card is a king card.
Solution:
P(king): probability that the card is King= 4/52= 1/13
P(face): probability that a card is a face card= 3/13
P(Face|King): probability of face card when we assume it is a king = 1
Putting all values in equation (i) we will get:
Application of Bayes' theorem in Artificial intelligence:

Following are some applications of Bayes' theorem:
o It is used to calculate the next step of the robot when the already executed step is
given.
o Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.
Bayesian Belief Network in artificial

intelligence
Bayesian belief network is key computer technology for dealing with probabilistic events
and to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian

model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning,
time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
o Directed Acyclic Graph

o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can
be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes
in the graph.
These links represent that one node directly influence the other node, and if there is
no directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables
represented by the nodes of the network graph.
o If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
The Bayesian network has mainly two components:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi)
), which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability. So
let's first understand the joint probability distribution:
Joint probability distribution:

If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of
x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
Explanation of Bayesian network:

Let's understand the Bayesian network through an example by creating a directed acyclic
graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry
has two neighbors David and Sophia, who have taken a responsibility to inform Harry at
work when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here
we would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary,
nor an earthquake occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
o The conditional distributions for each node are given as conditional probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent
an exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
List of all events occurring in this network:
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E],
can rewrite the above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
Methods to find probability
Conditional probability
P(E and B)=p(E|B).P(B)
P(E and B)=P(B|E).P(E)
P(B|E)= P(E|B).P(B)/P(E)
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:
B E P(A= True) P(A= False)
True True 0.94 0.06
True False 0.95 0.04
False True 0.31

CS6659-Artificial Page 0.69
False False 0.001 0.999
Conditional probability table for David Calls:
The Conditional probability of David that he will call depends on the probability of Alarm.
A P(D= True) P(D= False)
True 0.91 0.09
False 0.05 0.95
Conditional probability table for Sophia Calls:
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
A P(S= True) P(S= False)
True 0.75 0.25
False 0.02 0.98
From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is given
below:
1. To understand the network as the representation of the Joint probability

distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional
independence statements.
It is helpful in designing inference procedure.
What is Fuzzy Logic?

Fuzzy Logic (FL) is a method of reasoning that resembles human reasoning. The
approach of FL imitates the way of decision making in humans that involves all
intermediate possibilities between digital values YES and NO.
The conventional logic block that a computer can understand takes precise input and
produces a definite output as TRUE or FALSE, which is equivalent to human’s YES or
NO.
The inventor of fuzzy logic, Lotfi Zadeh, observed that unlike computers, the human
decision making includes a range of possibilities between YES and NO, such as −
CERTAINLY YES
POSSIBLY YES
CANNOT SAY
POSSIBLY NO
CERTAINLY NO
The fuzzy logic works on the levels of possibilities of input to achieve the definite
output.
Implementation
 It can be implemented in systems with various sizes and capabilities ranging
from small micro-controllers to large, networked, workstation-based control
systems.
 It can be implemented in hardware, software, or a combination of both.
Why Fuzzy Logic?

Fuzzy logic is useful for commercial and practical purposes.
 It can control machines and consumer products.

 It may not give accurate reasoning, but acceptable reasoning.
 Fuzzy logic helps to deal with the uncertainty in engineering.
Fuzzy Logic Systems Architecture
It has four main parts as shown −
 Fuzzification Module − It transforms the system inputs, which are crisp
numbers, into fuzzy sets. It splits the input signal into five steps such as −
CS6659-Artificial LP x is Large Positive Page

MP x is Medium Positive
S x is Small
MN x is Medium Negative
LN x is Large Negative
 Knowledge Base − It stores IF-THEN rules provided by experts.

 Inference Engine − It simulates the human reasoning process by making fuzzy
inference on the inputs and IF-THEN rules.
 Defuzzification Module − It transforms the fuzzy set obtained by the inference
engine into a crisp value.
The membership functions work on fuzzy sets of variables.

Membership Function
Membership functions allow you to quantify linguistic term and represent a fuzzy set
graphically. A membership function for a fuzzy set A on the universe of discourse X
is defined as μ :X → [0,1].
A
Here, each element of X is mapped to a value between 0 and 1. It is

called membership value or degree of membership. It quantifies the degree of
membership of the element in X to the fuzzy set A.
 x axis represents the universe of discourse.

 y axis represents the degrees of membership in the [0, 1] interval.
There can be multiple membership functions applicable to fuzzify a numerical value.
Simple membership functions are used as use of complex functions does not add more
precision in the output.
All membership functions for LP, MP, S, MN, and LN are shown as below −
The triangular membership function shapes are most common among various other
membership function shapes such as trapezoidal, singleton, and Gaussian.
Here, the input to 5-level fuzzifier varies from -10 volts to +10 volts. Hence the
corresponding output also changes.
Example of a Fuzzy Logic System

Let us consider an air conditioning system with 5-level fuzzy logic system. This system
adjusts the temperature of air conditioner by comparing the room temperature and the
target temperature value.
Algorithm
 Define linguistic Variables and terms (start)

 Construct membership functions for them. (start)
 Construct knowledge base of rules (start)
 Convert crisp data into fuzzy data sets using membership functions.
(fuzzification)
 Evaluate rules in the rule base. (Inference Engine)
 Combine results from each rule. (Inference Engine)
 Convert output data into non-fuzzy values. (defuzzification)
Development
Step 1 − Define linguistic variables and terms
Linguistic variables are input and output variables in the form of simple words or
sentences. For room temperature, cold, warm, hot, etc., are linguistic terms.
Temperature (t) = {very-cold, cold, warm, very-warm, hot}
Every member of this set is a linguistic term and it can cover some portion of overall
temperature values.
Step 2 − Construct membership functions for them
The membership functions of temperature variable are as shown −
Step3 − Construct knowledge base rules

Create a matrix of room temperature values versus target temperature values that an
air conditioning system is expected to provide.
RoomTemp.
Very_Cold Cold Warm Hot Very_Hot
/Target
Very_Cold No_Change Heat Heat Heat Heat
Cold Cool No_Change Heat Heat Heat
Warm Cool Cool No_Change Heat Heat
Hot Cool Cool Cool No_Change Heat
Very_Hot Cool Cool Cool Cool No_Change
Build a set of rules into the knowledge base in the form of IF-THEN-ELSE structures.
Sr. No. Condition Action
1 IF temperature=(Cold OR Very_Cold) AND target=Warm THEN Heat
2 IF temperature=(Hot OR Very_Hot) AND target=Warm THEN Cool
3 IF (temperature=Warm) AND (target=Warm) THEN No_Change
Step 4 − Obtain fuzzy value

Fuzzy set operations perform evaluation of rules. The operations used for OR and AND
are Max and Min respectively. Combine all results of evaluation to form a final result.
This result is a fuzzy value.
Step 5 − Perform defuzzification
Defuzzification is then performed according to membership function for output variable.
Application Areas of Fuzzy Logic

The key application areas of fuzzy logic are as given −
Automotive Systems
 Automatic Gearboxes
 Four-Wheel Steering
 Vehicle environment control
Consumer Electronic Goods
 Hi-Fi Systems
 Photocopiers
 Still and Video Cameras
 Television
Domestic Goods
 Microwave Ovens
 Refrigerators
 Toasters
 Vacuum Cleaners
 Washing Machines
Environment Control
 Air Conditioners/Dryers/Heaters
 Humidifiers
Advantages of FLSs
 Mathematical concepts within fuzzy reasoning are very simple.
 You can modify a FLS by just adding or deleting rules due to flexibility of fuzzy
logic.
 Fuzzy logic Systems can take imprecise, distorted, noisy input information.
 FLSs are easy to construct and understand.
 Fuzzy logic is a solution to complex problems in all fields of life, including
medicine, as it resembles human reasoning and decision making.
Disadvantages of FLSs
 There is no systematic approach to fuzzy system designing.
 They are understandable only when simple.
 They are suitable for the problems which do not need high accuracy.
Utility Functions in Artificial Intelligence

The agents use the utility theory for making decisions. It is the mapping from lotteries to the
real numbers. An agent is supposed to have various preferences and can choose the one
which best fits his necessity.
Utility scales and Utility assessments
To help an agent in making decisions and behave accordingly, we need to build a decision-
theoretic system. For this, we need to understand the utility function. This process is known
as preference elicitation In this, the agents are provided with some choices and using the
observed preferences, the respected utility function is chosen. Generally, there is no scale
for the utility function. But, a scale can be established by fixing the boiling and freezing point
of water. Thus, the utility is fixed as:
U(S)=uT for best possible cases
U(S)= u⊥ for worst possible cases.
A normalized utility function uses a utility scale with value uT=1, and u⊥ =0. For example, a
utility scale between uT and u⊥ is given. Thereby an agent can choose a utility value
between any prize Z and the standard lottery [p, u_; (1−p), u⊥]. Here, p denotes the
probability which is adjusted until the agent is adequate between Z and the standard lottery.
Like in medical, transportation, and environmental decision problems, we use two
measurement units: micromort or QUALY(quality-adjusted life year) to measure the
chances of death of a person.
Money Utility
Economics is the root of utility theory. It is the most demanding thing in human life.
Therefore, an agent prefers more money to less, where all other things remain equal. The
agent exhibits a monotonic preference(more is preferred over less) for getting more money.
In order to evaluate the more utility value, the agent calculates the Expected Monetary
Value(EMV) of that particular thing. But this does not mean that choosing a monotonic
value is the right decision always.
Multi-attribute utility functions
Multi-attribute utility functions include those problems whose outcomes are categorized by
two or more attributes. Such problems are handled by multi-attribute utility theory.
Terminology used
 Dominance: If there are two choices say A and B, where A is more effective than B. It
means that A will be chosen. Thus, A will dominate B. Therefore, multi-attribute utility
function offers two types of dominance:
 Strict Dominance: If there are two websites T and D, where the cost of T is less and
provides better service than D. Obviously, the customer will prefer T rather than D.
Therefore, T strictly dominates D. Here, the attribute values are known.
 Stochastic Dominance: It is a generalized approach where the attribute value is unknown.
It frequently occurs in real problems. Here, a uniform distribution is given, where that choice
is picked, which stochastically dominates the other choices. The exact relationship can be
viewed by examing the cumulative distribution of the attributes.
 Preference Structure: Representation theorems are used to show that an agent with a
preference structure has a utility function as:
U(x1, . . . , xn) = F[f1(x1), . . . , fn(xn)],
where F indicates any arithmetic function such as an addition function.
Therefore, preference can be done in two ways :
 Preference without uncertainty: The preference where two attributes are preferentially
independent of the third attribute. It is because the preference between the outcomes of the
first two attributes does not depend on the third one.
 Preference with uncertainty: This refers to the concept of preference structure with
uncertainty. Here, the utility independence extends the preference independence where a
set of attributes X is utility independent of another Y set of attributes, only if the value of
attribute in X set is independent of Y set attribute value. A set is said to be mutually utility
independent (MUI) if each subset is utility-independent of the remaining attribute.
UNIT 6
Inductive Learning Algorithm
Inductive Learning Algorithm (ILA) is an iterative and inductive machine learning algorithm
which is used for generating a set of a classification rule, which produces rules of the form
“IF-THEN”, for a set of examples, producing rules at each iteration and appending to the
set of rules.
Basic Idea:
There are basically two methods for knowledge extraction firstly from domain experts and
then with machine learning.
For a very large amount of data, the domain experts are not very useful and reliable. So
we move towards the machine learning approach for this work.
To use machine learning One method is to replicate the experts logic in the form of
algorithms but this work is very tedious, time taking and expensive.
So we move towards the inductive algorithms which itself generate the strategy for
performing a task and need not instruct separately at each step.
Need of ILA in presence of other machine learning algorithms:
The ILA is a new algorithm which was needed even when other reinforcement learnings
like ID3 and AQ were available.
 The need was due to the pitfalls which were present in the previous algorithms,
one of the major pitfalls was lack of generalisation of rules.
 The ID3 and AQ used the decision tree production method which was too specific
which were difficult to analyse and was very slow to perform for basic short
classification problems.
 The decision tree-based algorithm was unable to work for a new problem if some
attributes are missing.
 The ILA uses the method of production of a general set of rules instead of decision
trees, which overcome the above problems
THE ILA ALGORITHM:
General requirements at start of the algorithm:-
1. list the examples in the form of a table ‘T’ where each row corresponds to an
example and each column contains an attribute value.
2. create a set of m training examples, each example composed of k attributes and a
class attribute with n possible decisions.
3. create a rule set, R, having the initial value false.
4. initially all rows in the table are unmarked.
Steps in the algorithm:-
Step 1:
divide the table ‘T’ containing m examples into n sub-tables (t1, t2,…..tn). One table for
each possible value of the class attribute. (repeat steps 2-8 for each sub-table)
Step 2:
Initialize the attribute combination count ‘ j ‘ = 1.
Step 3:
For the sub-table on which work is going on, divide the attribute list into distinct
combinations, each combination with ‘j ‘ distinct attributes.
Step 4:
For each combination of attributes, count the number of occurrences of attribute values
that appear under the same combination of attributes in unmarked rows of the sub-table
under consideration, and at the same time, not appears under the same combination of
attributes of other sub-tables. Call the first combination with the maximum number of
occurrences the max-combination ‘ MAX’.
Step 5:
If ‘MAX’ = = null , increase ‘ j ‘ by 1 and go to Step 3.
Step 6:
Mark all rows of the sub-table where working, in which the values of ‘MAX’ appear, as
classi?ed.
Step 7:
Add a rule (IF attribute = “XYZ” –> THEN decision is YES/ NO) to R whose left-hand side
will have attribute names of the ‘MAX’ with their values separated by AND, and its right-
hand side contains the decision attribute value associated with the sub-table.
Step 8:
If all rows are marked as classi?ed, then move on to process another sub-table and go to
Step 2. else, go to Step 4. If no sub-tables are available, exit with the set of rules obtained
till then.
An example showing the use of ILA
suppose an example set having attributes Place type, weather, location, decision and
seven examples, our task is to generate a set of rules that under what condition what is
the decision.
EXAMPLE NO. PLACE TYPE WEATHER LOCATION DECISION
I) hilly winter kullu Yes
II ) mountain windy Mumbai No
III ) mountain windy Shimla Yes
IV ) beach windy Mumbai No
V) beach warm goa Yes
VI ) beach windy goa No
VII ) beach warm Shimla Yes

step 1
subset 1
S.NO PLACE TYPE WEATHER LOCATION DECISION
1 hilly winter kullu Yes
2 mountain windy Shimla Yes
3 beach warm goa Yes
4 beach warm Shimla Yes
subset 2
S.NO PLACE TYPE WEATHER LOCATION DECISION
5 mountain windy Mumbai No
6 beach windy Mumbai No
7 beach windy goa No
step (2-8)
at iteration 1
row 3 & 4 column weather is selected and row 3 & 4 are marked.
the rule is added to R IF weather is warm then a decision is yes.
at iteration 2
row 1 column place type is selected and row 1 is marked.
the rule is added to R IF place type is hilly then the decision is yes.
at iteration 3
row 2 column location is selected and row 2 is marked.
the rule is added to R IF location is Shimla then the decision is yes.
at iteration 4
row 5&6 column location is selected and row 5&6 are marked.
the rule is added to R IF location is Mumbai then a decision is no.
at iteration 5
row 7 column place type & the weather is selected and row 7 is marked.
rule is added to R IF place type is beach AND weather is windy then the decision is no.
finally we get the rule set :-
Rule Set
 Rule 1: IF the weather is warm THEN the decision is yes.
 Rule 2: IF place type is hilly THEN the decision is yes.
 Rule 3: IF location is Shimla THEN the decision is yes.
 Rule 4: IF location is Mumbai THEN the decision is no.
 Rule 5: IF place type is beach AND the weather is windy THEN the decision is no.
Decision Tree
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation
of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples

o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
o Cost Complexity Pruning

o Reduced Error Pruning.
Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.
Supervised Larning
Supervised learning is commonly used in real world applications, such as face and speech recognition,
products or movie recommendations, and sales forecasting. Supervised learning can be further classified
into two types - Regression and Classification.
Regression trains on and predicts a continuous-valued response, for example predicting real estate
prices.
Classification attempts to find the appropriate class label, such as analyzing positive/negative sentiment,
male and female persons, benign and malignant tumors, secure and unsecure loans etc
• In supervised learning, learning data comes with description, labels, targets or desired outputs
and the objective is to find a general rule that maps inputs to outputs. This kind of learning data is
called labeled data. The learned rule is then used to label new data with unknown outputs.
• Supervised learning involves building a machine learning model that is based on labeled
samples. For example, if we build a system to estimate the price of a plot of land or a house
based on various features, such as size, location, and so on, we first need to create a database
and label it. We need to teach the algorithm what features correspond to what prices. Based on
this data, the algorithm will learn how to calculate the price of real estate using the values of the
input features
Unsupervised Learning
• Unsupervised learning is used to detect anomalies, outliers, such as fraud or defective
equipment, or to group customers with similar behaviors for a sales campaign. It is the opposite
of supervised learning. There is no labeled data here.
When learning data contains only some indications without any description or labels, it is up to the coder
or to the algorithm to find the structure of the underlying data, to discover hidden patterns, or to determine
how to describe the data. This kind of learning data is called unlabeled data.
Current-Best-Hypothesis Search
First described by John Stuart Mill in 1843. The algorithm is extremely simple; if a
new example is encountered that our hypothesis misclassifies, then change the
hypothesis as follows.
If it is a false positive, specialize the hypothesis not to cover it. This can be done by
dropping disjuncts or adding new terms.
If it is a false negative, generalize the hypothesis by adding disjuncts or dropping
terms.
If no consistent specalization/generalization can be found, backtrack.
(Think of the Russell and Norvig example of two sets, the positive instances inside the
negative ones, and extending or shrinking the positive set as more data comes in.)
The earliest machine learning system to use this approach was the arch-learning
program of [Winston, 1970]. However, it naturally suffers from inefficiency in large
search spaces; after every modification the past instances must be checked to ensure
they are still classified correctly, and it is difficult to find good search heuristics for
generalizing or specializing the definition.
Least-Commitment Search
Rather than backtracking, this approach keeps all hypotheses that are consistent with
the data seen so far. As more data becomes available, this version space shrinks. The
algorithm for doing this is called the candidate elimination learning algorithm or
the version space learning algorithm [Mitchell, 1977], and consists simply of
constraining the version space to be consistent with all data seen so far, updating it as
each new instance is seen.
This is a least-commitment approach since it never favors one possible hypothesis

over another; all remaining hypotheses are consistent. Note that this method implicitly
assumes that there is a partial ordering on all of the hypotheses in the space, the more-
specific-than ordering. A generalization G1 is more specific than a generalization G2 if
it matches a proper subset of the instances that G2 matches.
The obvious problem is that this method potentially requires an enormous number of
hypotheses to record. The solution is to use boundary sets to circumscribe the space of
possible hypotheses. The two boundary sets are the G-set (the most general boundary)
and the S-set (the most specific boundary). Every member of the G-set is consistent,
and there are no more general consistent hypotheses; every member of the S-set is
consistent and there are no more specific hypotheses.
The algorithm works as follows. Initially, the G-set is simply True, and the S-
set False. For every new instance, there are four possible cases:
 false positive for Si -- The hypothesis Si is too general, but as it has no

consistent specializations, we throw it out.
 false negative for Si -- Si is too specific, so we replace it by all of its immediate
generalizations.
 false positive for Gi -- Gi is too general, so we replace it by all of its immediate
specializations.
 false negative for Gi -- The hypothesis Gi is too specific, but as it has no
consistent generalizations, we throw it out.
This process is repeated until one of three things happens. Eventually, either there is
only one hypothesis left in the version space (in which case we return it), the version
space collapses (either S or G becomes empty), meaning there is no consistent
hypothesis, or we run out of examples and our version space still has several
hypotheses, so we can use their collective evaluation (breaking disagreements with
majority vote).
The main problems with this approach are two. If there is any noise, or
insufficient attributes for classification, the version space will collapse. Also, if
unlimited disjunction is allowed in the hypothesis space, S will contain only the most-
specific hypothesis (the conjunction of the positive examples), and G will contain
only the most-general hypothesis (the negation of the disjunctions of the negative
examples). The latter problem can partially be solved by using a generalization
hierarchy.
Such learning systems cannot handle noisy data. One solution is to maintain
several S and G sets, consistent with decreasing numbers of training instances.
The pure version-space algorithm was first used in META-DENDRAL [Buchanan

and Mitchell, 1978]. It was also used in LEX [Mitchell, 1983] which learned to solve
symbolic integration problems.
Computational Complexity
Strategy Time Complexity Space Complexity
DFS (current best hypothesis) O(pn) O(p + n)
Version Space O(sg(p + n) + s2p + g2n) O(s + g)
In this table, p and n are the number of positive and negative training instances,
respectively, and s and g are the largest sizes of sets S and G.
The complexity of the depth-first approach stems from the need to reexamine old
instances after each revision to the hypothesis; in particular, each time a positive
instance forces the hypothesis to be changed, all past negative hypotheses must be
examined. Similarly, revising the hypothesis in response to negative instances requires
reexamining all positive instances.
In the version space strategy, no training instances need be saved, so the space
complexity is just the largest sizes of S and G. Notice that for this strategy, processing
time grows linearly with the number of training instances. However, it grows as the
square of the sizes of the boundary sets.
PAC Learning
[Introduced by Valiant, 1984] The field of computational learning theory attempts to

provide a theoretical framework for learning systems. The question is, how can we
justify the accuracy of a hypothesis h with respect to some function f if we don't know
what f is?
The argument is, roughly, that incorrect hypotheses will be discovered relatively
quickly because they will incorrectly identify instances. Any hypothesis that is
consistent with a large enough set of training examples is unlikely to be seriously
wrong. That is, it is probably approximately correct. PAC learning is the part of
computational learning theory that studies this idea.
The key assumption underlying this argument is that the training set and the test
set are randomly drawn from the same population of examples, and that they are
drawn using the same probability distribution. This idea (due to Valiant) is
the stationarity assumption. It is required to associate any future instances with the
ones seen so far.
In particular, we would like to bound the likelihood that our hypothesis is not within a
certain range of the correct hypothesis. Specifically, we define the error of a
hypothesis thus:
error(h) = Pr(h(x) != f(x) | x drawn from D)
(where D is a distribution over the samples). A hypothesis h is called approximately

correct if error(h) <= e. An approximately correct hypothesis can be thought of as
lying within an e-ball of the true hypothesis in the hypothesis space. All hypothesis
outside of this ball are in the set Hbad.
We calculate the probability that a hypothesis hb is in Hbad as follows. By

supposition, error(hb) > e. Thus the probability of agreement with any example is less
than 1 - e. The bound for m examples is then <= (1-e)m.
For Hbad to contain a consistent hypothesis, at least one of its hypotheses must be
consistent. The probability of this occurring is bounded by the sum of the individual
probabilities, thus
P(Hbad contains a consistent hypothesis) <= |Hbad|(1 - e)m <= |H|(1 - e)m.
We would like to reduce the probability of this event to some number d,

the confidence parameter, so that
|H|(1 - e)m <= d
which we can do if we show
m >= 1/e (ln (1/d) + ln |H|)
training examples to the system. Note that this approach assumes that the true
function f is somewhere in the hypothesis space. Roughly speaking, by theorem of
Blumer, the number of examples required for learning is proportional to the log of the
size of the hypothesis space.
Theorem [Blumer et al., 1989]: A space of hypotheses H is PAC-learnable iff it has

finite VC-dimension.
Some examples of PAC-learnable classes in polynomial time: k-CNF, k-DNF, k-DL,

Boolean conjunctions. In contrast, k-term-DNF and k-3NN (a 3-layer neural network)
are NP-hard PAC-learning problems. Note that the size of the hypothesis space is not
necessarily the whole problem; k-term-DNF is a proper subset of k-CNF, but the
former is not polynomially learnable while the latter is. Note also that this means in
order to prove that learning a concept class is computationally intractable, it must be
shown intractable regardless of the representation employed by
the learning algorithm.
Connectionist Learning
The connectionist approach is inspired by the model of the human brain as an

enormous parallel computer, in which small computational units feed simple, low-
bandwidth signals to one another, and from which intelligence arises. It attempts to
replicate this behavior on an abstract level with what are called neural networks, an
idea introduced by McCullough and Pitts in 1943.
Neural Networks
More specifically, a neural network consists of a set of nodes (or units), links that
connect one node to another, and weights associated with each link. Some nodes
receive inputs via links; others directly from the environment, and some nodes send
outputs out of the network. Learning usually occurs by adjusting the weights on the
links.
Each unit has a set of weighted inputs, an activation level, and a way to compute its
activation level at the next time step. This is done by applying an activation
function to the weighted sum of the node's inputs. Generally, the weighted sum (also
called the input function) is a strictly linear sum, while the activation function may be
nonlinear. If the value of the activation function is above a threshold, the node "fires."
Generally, all nodes share the same activation function and threshold value, and only
the topology and weights change.
Network Structures
The two fundamental types of network structure are feed-forward and recurrent. A
feed-forward network is a directed acyclic graph; information flows in one direction
only, and there are no cycles. Such networks cannot represent internal state.
Usually, neural networks are also layered, meaning that nodes are organized into
groups of layers, and links only go from nodes to nodes in adjacent layers.
Recurrent networks allow loops, and as a result can represent state, though they are
much more complex to analyze. Hopfield networks and Boltzmann machines are
examples of recurrent networks; Hopfield networks are the best understood. All
connections in Hopfield networks are bidirectional with symmetric weights, all units
have outputs of 1 or -1, and the activation function is the sign function. Also, all nodes
in a Hopfield network are both input and output nodes. Interestingly, it has been
shown that a Hopfield network can reliably recognize 0.138N training examples,
where N is the number of units in the network.
Boltzmann machines allow non-input/output units, and they use a stochastic

evaluation function that is based upon the sum of the total weighted input. Boltzmann
machines are formally equivalent to a certain kind of belief network evaluated with a
stochastic simulation algorithm.
One problem in building neural networks is deciding on the initial topology, e.g., how
many nodes there are and how they are connected. Genetic algorithms have been used
to explore this problem, but it is a large search space and this is a computationally
intensive approach. The optimal brain damage method uses information theory to
determine whether weights can be removed from the network without loss of
performance, and possibly improving it. The alternative of making the network larger
has been tested with the tiling algorithm [Mezard and Nadal, 1989] which takes an
approach similar to induction on decision trees; it expands a unit by adding new ones
to cover instances it misclassified. Cross-validation techniques can be used to
determine when the network size is right.
Perceptrons
Perceptrons are single-layer, feed-forward networks that were first studied in the
1950's. They are only capable of learning linearly separable functions. That is, if we
view F features as defining an F-dimensional space, the network can recognize any
class that involves placing a single hyperplane between the instances of two classes.
So, for example, they can easily represent AND, OR, or NOT, but cannot
represent XOR.
Perceptrons learn by updating the weights on their links in response to the difference
between their output value and the correct output value. The updating rule (due to
Frank Rosenblatt, 1960) is as follows. Define Err as the difference between the correct
output and actual output. Then the learning rule for each weight is
Wj <- Wj + A x Ij x Err
where A is a constant called the learning rate.
Of course, this was too good to last, and in Perceptrons [Minsky and Papert, 1969] it
was observed how limited linearly separable functions were. Work on perceptrons
withered, and neural networks didn't come into vogue again until the 1980's, when
multi-layer networks became the focus.
Multi-Layer Feed-Forward Networks
The standard method for learning in multi-layer feed-forward networks is back-

propagation [Bryson and Ho, 1969]. Such networks have an input layer, and output
layer, and one or more hidden layers in between. The difficulty is to divide the blame
for an erroneous output among the nodes in the hidden layers.
The back-propagation rule is similar to the perceptron learning rule. If Erri is the error
at the output node, then the weight update for the link from unit j to unit i (the output
node) is
Wj,i <- Wj,i + A x aj x Erri x g'(ini)
where g' is the derivative of the activation function, and aj is the activation of the
unit j. (Note that this means the activation function must have a derivative, so the
sigmoid function is usually used rather than the step function.) Define Di as Erri x
g'(ini).
This updates the weights leading to the output node. To update the weights on the
interior links, we use the idea that the hidden node j is responsible for part of the error
in each of the nodes to which it connects. Thus the error at the output is divided
according to the strength of the connection between the output node and the hidden
node, and propogated backward to previous layers. Specifically,
Dj = g'(inj) x SUMi Wj,i Di
Thus the updating rule for internal nodes is
Wj,i <- Wj,i + A x aj x Di.
Lastly, the weight updating rule for the weights from the input layer to the hidden
layer is is
Wk,j <- Wk,j + A x Ik x Dj
where k is the input node and j the hidden node, and Ik is the input value of k.
A neural network requires 2n/n hidden units to represent all Boolean functions
of n inputs. For m training examples and W weights, each epoch in
the learning process takes O(mW) time; but in the worst case, the number
of epochs can be exponential in the number of inputs.
In general, if the number of hidden nodes is too large, the network may learn only the
training examples, while if the number is too small it may never converge on a set of
weights consistent with the training examples.
Multi-layer feed-forward networks can represent any continuous function with a

single hidden layer, and any function with two hidden layers [Cybenko, 1988, 1989].
Applications of Neural Networks
John Denker remarked that "neural networks are the second best way of doing just
about anything." They provide passable performance on a wide variety of problems
that are difficult to solve well using other methods.
NETtalk [Sejnowski and Rosenberg, 1987] was designed to learn how to pronounce
written text. Input was a seven-character centered on the target character, and output
was a set of Booleans controlling the form of the sound to be produced. It learned
95% accuracy on its training set, but had only 78% accuracy on the test set. Not
spectacularly good, but important because it impressed many people with the potential
of neural networks.
Other applications include a ZIP code recognition [Le Cun et al., 1989] system that
achieves 99% accuracy on handwritten codes, and driving [Pomerleau, 1993] in the
ALVINN system at CMU. ALVINN controls the NavLab vehicles, and translates
inputs from a video image into steering control directions. ALVINN performs
exceptionally well on the particular road-type it learns, but poorly on other terrain
types. The extended MANIAC system [Jochem et al., 1993] has multiple ALVINN
subnets combined to handle different road types.
Bayesian Learning in Belief Networks
(NOTE: As I am largely unfamiliar with belief networks, I would take the material in
this section with a grain of salt, if I were you. -- POD)
Bayesian learning maintains a number of hypotheses about the data, each one
weighted its posterior probability when a prediction is made. The idea is that, rather
than keeping only one hypothesis, many are entertained, and weighted based on their
likelihoods.
Of course, maintaining and reasoning with a large number of hypotheses can be

intractable. The most common approximation is to use a most probable hypothesis,
that is, an Hi of H that maximizes P(Hi | D), where D is the data. This is often called
the maximum a posteriori (MAP) hypothesis HMAP:
P(X | D) ~= P(X | HMAP) x P(HMAP | D)
To find HMAP, we apply Bayes' rule:
P(Hi | D) = [P(D | Hi) x P(Hi)] / P(D)
Since P(D) is fixed across the hypotheses, we only need to maximize the numerator.
The first term represents the probability that this particular data set would be seen,
given Hi as the model of the world. The second is the prior probability assigned to the
model.
Belief Network Learning Problems

There are four kinds of belief networks, depending upon whether the structure of the
network is known or unknown, and whether the variables in the network are
observable or hidden.
 known structure, fully observable -- In this case the only learnable part is the
conditional probability tables. These can be estimated directly using the
statistics of the sample data set.
 unknown structure, fully observable -- Here the problem is to reconstruct the

network topology. The problem can be thought of as a search through structure
space, and fitting data to each structure reduces to the fixed-structure problem,
so the MAP or ML probability value can be used as a heuristic in hill-climbing
or SA search.
 known structure, hidden variables -- This is analagous to neural

network learning.
 unknown structure, hidden variables -- When some variables are

unobservable, it becomes difficult to apply prior techniques for recovering
structure, but they require averaging over all possible values of the unknown
variables. No good general algorithms are known for handling this case.
Comparison of Belief and Neural Networks
The two formalisms can be compared as representations, as inference systems, and

as learning systems.
Both kinds of network are attribute-based representations. Both can handle either
discrete or continuous output. The major difference is that belief networks are
localized representations, while neural networks are distributed. Belief network nodes
represent propositions with clearly defined semantics and relationships to other nodes.
In neural networks, nodes generally don't represent specific propositions, and the
calculations would not treat them in a semantically-meaningful way. The effect is that
human beings can neither construct nor understand neural network representations,
where both can be done with belief networks.
Belief networks handle two kinds of activation, both in terms of the values a
proposition may take, and the probabilities assigned to each value. Neural
network outputs could be values or probabilities, but they cannot handle both
simultaneously.
Trained feed-forward neural network inference can execute in linear time, where in
belief networks inference is NP-hard. However, a neural network may have to be
exponentially larger to represent the same things that a belief network can.
As for learning, belief networks have the advantage of being easier to give prior
knowledge; also, since they represent propositions locally, it may be easier for them to
converge, since they are directly affected only by a small number of other
propositions.
Reinforcement Learning
As opposed to supervised learning, reinforcement learning takes place in an

environment where the agent cannot directly compare the results of its action to a
desired result. Instead, it is given some reward or punishment that relates to its
actions. It may win or lose a game, or be told it has made a good move or a poor one.
The job of reinforcement learning is to find a successful function using these rewards.
The reason reinforcement learning is harder than supervised learning is that the agent
is never told what the right action is, only whether it is doing well or poorly, and in
some cases (such as chess) it may only receive feedback after a long string of actions.
There are two basic kinds of information an agent can try to learn.
 utility function -- The agent learns the utility of being in various states, and
chooses actions to maximize the expected utility of their outcomes. This
requires the agent keep a model of the environment.
 action-value -- The agent learns an action-value function giving the expected

utility of performing an action in a given state. This is called Q-learning. This is
the model-free approach.
Genetic Algorithms - Introduction

Genetic Algorithm (GA) is a search-based optimization technique based on the
principles of Genetics and Natural Selection. It is frequently used to find optimal or
near-optimal solutions to difficult problems which otherwise would take a lifetime to
solve. It is frequently used to solve optimization problems, in research, and in machine
learning.
Introduction to Optimization
Optimization is the process of making something better. In any process, we have a
set of inputs and a set of outputs as shown in the following figure.
Optimization refers to finding the values of inputs in such a way that we get the “best”
output values. The definition of “best” varies from problem to problem, but in
mathematical terms, it refers to maximizing or minimizing one or more objective
functions, by varying the input parameters.
The set of all possible solutions or values which the inputs can take make up the
search space. In this search space, lies a point or a set of points which gives the
optimal solution. The aim of optimization is to find that point or set of points in the
search space.
What are Genetic Algorithms?

Nature has always been a great source of inspiration to all mankind. Genetic
Algorithms (GAs) are search based algorithms based on the concepts of natural
selection and genetics. GAs are a subset of a much larger branch of computation
known as Evolutionary Computation.
GAs were developed by John Holland and his students and colleagues at the
University of Michigan, most notably David E. Goldberg and has since been tried on
various optimization problems with a high degree of success.
In GAs, we have a pool or a population of possible solutions to the given problem.
These solutions then undergo recombination and mutation (like in natural genetics),
producing new children, and the process is repeated over various generations. Each
individual (or candidate solution) is assigned a fitness value (based on its objective
function value) and the fitter individuals are given a higher chance to mate and yield
more “fitter” individuals. This is in line with the Darwinian Theory of “Survival of the
Fittest”.
In this way we keep “evolving” better individuals or solutions over generations, till we
reach a stopping criterion.
Genetic Algorithms are sufficiently randomized in nature, but they perform much better
than random local search (in which we just try various random solutions, keeping track
of the best so far), as they exploit historical information as well.
Advantages of GAs
GAs have various advantages which have made them immensely popular. These
include −
 Does not require any derivative information (which may not be available for
many real-world problems).
 Is faster and more efficient as compared to the traditional methods.
 Has very good parallel capabilities.
 Optimizes both continuous and discrete functions and also multi-objective
problems.
 Provides a list of “good” solutions and not just a single solution.
 Always gets an answer to the problem, which gets better over the time.
 Useful when the search space is very large and there are a large number of
parameters involved.
Limitations of GAs
Like any technique, GAs also suffer from a few limitations. These include −
 GAs are not suited for all problems, especially problems which are simple and
for which derivative information is available.
 Fitness value is calculated repeatedly which might be computationally expensive
for some problems.
 Being stochastic, there are no guarantees on the optimality or the quality of the
solution.
 If not implemented properly, the GA may not converge to the optimal solution
Current-Best-Hypothesis Search
First described by John Stuart Mill in 1843. The algorithm is extremely simple; if a
new example is encountered that our hypothesis misclassifies, then change the
hypothesis as follows.
If it is a false positive, specialize the hypothesis not to cover it. This can be done by
dropping disjuncts or adding new terms.
If it is a false negative, generalize the hypothesis by adding disjuncts or dropping

terms.
If no consistent specalization/generalization can be found, backtrack.
(Think of the Russell and Norvig example of two sets, the positive instances inside the
negative ones, and extending or shrinking the positive set as more data comes in.)
The earliest machine learning system to use this approach was the arch-learning
program of [Winston, 1970]. However, it naturally suffers from inefficiency in large
search spaces; after every modification the past instances must be checked to ensure
they are still classified correctly, and it is difficult to find good search heuristics for
generalizing or specializing the definition.
(The way in which we improve the hypothesis is by modifying it when we

encounter false positives (false instances we classify as true) and false negatives (true
instances we classify as false). There are two kinds of search to do this.)
Least-Commitment Search
Rather than backtracking, this approach keeps all hypotheses that are consistent with
the data seen so far. As more data becomes available, this version space shrinks. The
algorithm for doing this is called the candidate elimination learning algorithm or
the version space learning algorithm [Mitchell, 1977], and consists simply of
constraining the version space to be consistent with all data seen so far, updating it as
each new instance is seen.
This is a least-commitment approach since it never favors one possible hypothesis

over another; all remaining hypotheses are consistent. Note that this method implicitly
assumes that there is a partial ordering on all of the hypotheses in the space, the more-
specific-than ordering. A generalization G1 is more specific than a generalization G2 if
it matches a proper subset of the instances that G2 matches.
The obvious problem is that this method potentially requires an enormous number of
hypotheses to record. The solution is to use boundary sets to circumscribe the space of
possible hypotheses. The two boundary sets are the G-set (the most general boundary)
and the S-set (the most specific boundary). Every member of the G-set is consistent,
and there are no more general consistent hypotheses; every member of the S-set is
consistent and there are no more specific hypotheses.
The algorithm works as follows. Initially, the G-set is simply True, and the S-
set False. For every new instance, there are four possible cases:
 false positive for Si -- The hypothesis Si is too general, but as it has no

consistent specializations, we throw it out.
 false negative for Si -- Si is too specific, so we replace it by all of its immediate
generalizations.
 false positive for Gi -- Gi is too general, so we replace it by all of its immediate
specializations.
 false negative for Gi -- The hypothesis Gi is too specific, but as it has no
consistent generalizations, we throw it out.
This process is repeated until one of three things happens. Eventually, either there is
only one hypothesis left in the version space (in which case we return it), the version
space collapses (either S or G becomes empty), meaning there is no consistent
hypothesis, or we run out of examples and our version space still has several
hypotheses, so we can use their collective evaluation (breaking disagreements with
majority vote).
The main problems with this approach are two. If there is any noise, or
insufficient attributes for classification, the version space will collapse. Also, if
unlimited disjunction is allowed in the hypothesis space, S will contain only the most-
specific hypothesis (the conjunction of the positive examples), and G will contain
only the most-general hypothesis (the negation of the disjunctions of the negative
examples). The latter problem can partially be solved by using a generalization
hierarchy.
Such learning systems cannot handle noisy data. One solution is to maintain
several S and G sets, consistent with decreasing numbers of training instances.
The pure version-space algorithm was first used in META-DENDRAL [Buchanan

and Mitchell, 1978]. It was also used in LEX [Mitchell, 1983] which learned to solve
symbolic integration problems.
Computational Complexity
Strategy Time Complexity Space Complexity
DFS (current best hypothesis) O(pn) O(p + n)
Version Space O(sg(p + n) + s2p + g2n) O(s + g)
In this table, p and n are the number of positive and negative training instances,
respectively, and s and g are the largest sizes of sets S and G.
The complexity of the depth-first approach stems from the need to reexamine old
instances after each revision to the hypothesis; in particular, each time a positive
instance forces the hypothesis to be changed, all past negative hypotheses must be
examined. Similarly, revising the hypothesis in response to negative instances requires
reexamining all positive instances.
In the version space strategy, no training instances need be saved, so the space
complexity is just the largest sizes of S and G. Notice that for this strategy, processing
time grows linearly with the number of training instances. However, it grows as the
square of the sizes of the boundary sets.
Neural Networks
More specifically, a neural network consists of a set of nodes (or units), links that
connect one node to another, and weights associated with each link. Some nodes
receive inputs via links; others directly from the environment, and some nodes send
outputs out of the network. Learning usually occurs by adjusting the weights on the
links.
Each unit has a set of weighted inputs, an activation level, and a way to compute its
activation level at the next time step. This is done by applying an activation
function to the weighted sum of the node's inputs. Generally, the weighted sum (also
called the input function) is a strictly linear sum, while the activation function may be
nonlinear. If the value of the activation function is above a threshold, the node "fires."
Generally, all nodes share the same activation function and threshold value, and only
the topology and weights change.
Artificial Neural Network Tutorial
Artificial Neural Network Tutorial provides basic and advanced concepts of ANNs. Our
Artificial Neural Network tutorial is developed for beginners as well as professions.
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain. An Artificial neural network is usually a computational
network based on biological neural networks that construct the structure of the human
brain. Similar to a human brain has neurons interconnected to each other, artificial neural
networks also have neurons that are linked to each other in various layers of the networks.
These neurons are known as nodes.
Artificial neural network tutorial covers all the aspects related to the artificial neural
network. In this tutorial, we will discuss ANNs, Adaptive resonance theory, Kohonen self-
organizing map, Building blocks, unsupervised learning, Genetic algorithm, etc.
What is Artificial Neural Network?

The term "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain. Similar to the human brain that has neurons
interconnected to one another, artificial neural networks also have neurons that are
interconnected to one another in various layers of the networks. These neurons are known
as nodes.
The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell
nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and artificial neural network:
Biological Neural Network Artificial Neural Network
Dendrites Inputs
Cell nucleus Nodes
Synapse Weights
Axon Output
An Artificial Neural Network in the field of Artificial intelligence where it attempts to

mimic the network of neurons makes up a human brain so that computers will have an
option to understand things and make decisions in a human-like manner. The artificial
neural network is designed by programming computers to behave simply like interconnected
brain cells.
There are around 1000 billion neurons in the human brain. Each neuron has an association
point somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in
such a manner as to be distributed, and we can extract more than one piece of this data
when necessary from our memory parallelly. We can say that the human brain is made up
of incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example of a
digital logic gate that takes an input and gives an output. "OR" gate, which takes two
inputs. If one or both the inputs are "On," then we get "On" in output. If both the inputs are
"Off," then we get "Off" in output. Here the output depends upon input. Our brain does not
perform the same task. The outputs to inputs relationship keep changing because of the
CS6659-Artificial
neurons in our brain, which are "learning." Page
The architecture of an artificial neural network:
To understand the concept of the architecture of an artificial neural network, we have to
understand what a neural network consists of. In order to define a neural network that
consists of a large number of artificial neurons, which are termed units arranged in a
sequence of layers. Lets us look at various types of layers available in an artificial neural
network.
Artificial Neural Network primarily consists of three layers:
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the
output. Activation functions choose whether a node should fire or not. Only those who are
fired make it to the output layer. There are distinctive activation functions available that can
be applied upon the sort of task we are performing.
Advantages of Artificial Neural Network (ANN)

Parallel processing capability:
Artificial neural networks have a numerical value that can perform more than one task
simultaneously.
Storing data on the entire network:
Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent the
network from working.
Capability to work with incomplete knowledge:
After ANN training, the information may produce output even with inadequate data. The loss
of performance here relies upon the significance of missing data.
Having a memory distribution:
For ANN is to be able to adapt, it is important to determine the examples and to encourage
the network according to the desired output by demonstrating these examples to the
network. The succession of the network is directly proportional to the chosen instances, and
if the event can't appear to the network in all its aspects, it can produce false output.
Having fault tolerance:
Extortion of one or more cells of ANN does not prohibit it from generating output, and this
feature makes the network fault-tolerance.
Disadvantages of Artificial Neural Network:

Assurance of proper network structure:
There is no particular guideline for determining the structure of artificial neural networks.
The appropriate network structure is accomplished through experience, trial, and error.
Unrecognized
CS6659-Artificial behavior of the network: Page
It is the most significant issue of ANN. When ANN produces a testing solution, it does not
provide insight concerning why and how. It decreases trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their
structure. Therefore, the realization of the equipment is dependent.
Difficulty of showing the issue to the network:
ANNs can work with numerical data. Problems must be converted into numerical values
before being introduced to ANN. The presentation mechanism to be resolved here will
directly impact the performance of the network. It relies on the user's abilities.
UNIT 7
Natural Language Processing
Natural Language Processing (NLP) refers to AI method of communicating with an
intelligent systems using a natural language such as English.
Processing of Natural Language is required when you want an intelligent system like
robot to perform as per your instructions, when you want to hear decision from a
dialogue based clinical expert system, etc.
The field of NLP involves making computers to perform useful tasks with the natural
languages humans use. The input and output of an NLP system can be −
 Speech
 Written Text
Components of NLP
There are two components of NLP as given −
Natural Language Understanding (NLU)
Understanding involves the following tasks −
 Mapping the given input in natural language into useful representations.

 Analyzing different aspects of the language.
Natural Language Generation (NLG)
It is the process of producing meaningful phrases and sentences in the form of natural
language from some internal representation.
It involves −
 Text planning − It includes retrieving the relevant content from knowledge base.
 Sentence planning − It includes choosing required words, forming meaningful
phrases, setting tone of the sentence.
 Text Realization − It is mapping sentence plan into sentence structure.
The NLU is harder than NLG.
Difficulties in NLU
NL has an extremely rich form and structure.
It is very ambiguous. There can be different levels of ambiguity −
 Lexical ambiguity − It is at very primitive level such as word-level.
 For example, treating the word “board” as noun or verb?
 Syntax Level ambiguity − A sentence can be parsed in different ways.
 For example, “He lifted the beetle with red cap.” − Did he use cap to lift the
beetle or he lifted a beetle that had red cap?
 Referential ambiguity − Referring to something using pronouns. For example,
Rima went to Gauri. She said, “I am tired.” − Exactly who is tired?
 One input can mean different meanings.
 Many inputs can mean the same thing.
NLP Terminology
 Phonology − It is study of organizing sound systematically.
 Morphology − It is a study of construction of words from primitive meaningful
units.
 Morpheme − It is primitive unit of meaning in a language.
 Syntax − It refers to arranging words to make a sentence. It also involves
determining the structural role of words in the sentence and in phrases.
 Semantics − It is concerned with the meaning of words and how to combine
words into meaningful phrases and sentences.
 Pragmatics − It deals with using and understanding sentences in different
situations and how the interpretation of the sentence is affected.
 Discourse − It deals with how the immediately preceding sentence can affect
the interpretation of the next sentence.
 World Knowledge − It includes the general knowledge about the world.
Steps in NLP
There are general five steps −
 Lexical Analysis − It involves identifying and analyzing the structure of words.
Lexicon of a language means the collection of words and phrases in a language.
Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences,
and words.
 Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for
grammar and arranging words in a manner that shows the relationship among
the words. The sentence such as “The school goes to boy” is rejected by
English syntactic analyzer.
 Semantic Analysis − It draws the exact meaning or the dictionary meaning from
the text. The text is checked for meaningfulness. It is done by mapping syntactic
structures and objects in the task domain. The semantic analyzer disregards
sentence such as “hot ice-cream”.
 Discourse Integration − The meaning of any sentence depends upon the
meaning of the sentence just before it. In addition, it also brings about the
meaning of immediately succeeding sentence.
 Pragmatic Analysis − During this, what was said is re-interpreted on what it
actually meant. It involves deriving those aspects of language which require real
world knowledge.
Implementation Aspects of Syntactic Analysis

There are a number of algorithms researchers have developed for syntactic analysis,
but we consider only the following simple methods −
 Context-Free Grammar
 Top-Down Parser
Let us see them in detail −
Context-Free Grammar
It is the grammar that consists rules with a single symbol on the left-hand side of the
rewrite rules. Let us create grammar to parse a sentence −
“The bird pecks the grains”
Articles (DET) − a | an | the
Nouns − bird | birds | grain | grains
Noun Phrase (NP) − Article + Noun | Article + Adjective + Noun
= DET N | DET ADJ N
Verbs − pecks | pecking | pecked
Verb Phrase (VP) − NP V | V NP
Adjectives (ADJ) − beautiful | small | chirping
The parse tree breaks down the sentence into structured parts so that the computer
can easily understand and process it. In order for the parsing algorithm to construct this
parse tree, a set of rewrite rules, which describe what tree structures are legal, need to
be constructed.
These rules say that a certain symbol may be expanded in the tree by a sequence of
other symbols. According to first order logic rule, if there are two strings Noun Phrase
(NP) and Verb Phrase (VP), then the string combined by NP followed by VP is a
sentence. The rewrite rules for the sentence are as follows −
S → NP VP
NP → DET N | DET ADJ N
VP → V NP
Lexocon −
DET → a | the
ADJ → beautiful | perching
N → bird | birds | grain | grains
V → peck | pecks | pecking
The parse tree can be created as shown −
Now consider the above rewrite rules. Since V can be replaced by both, "peck" or
"pecks", sentences such as "The bird peck the grains" can be wrongly permitted. i. e.
the subject-verb agreement error is approved as correct.
Merit − The simplest style of grammar, therefore widely used one.
Demerits −
 They are not highly precise. For example, “The grains peck the bird”, is a
syntactically correct according to parser, but even if it makes no sense, parser
takes it as a correct sentence.
 To bring out high precision, multiple sets of grammar need to be prepared. It
may require a completely different sets of rules for parsing singular and plural
variations, passive sentences, etc., which can lead to creation of huge set of
rules that are unmanageable.
Top-Down Parser
Here, the parser starts with the S symbol and attempts to rewrite it into a sequence
of terminal symbols that matches the classes of the words in the input sentence until it
consists entirely of terminal symbols.
These are then checked with the input sentence to see if it matched. If not, the process
is started over again with a different set of rules. This is repeated until a specific rule is
found which describes the structure of the sentence.
Merit − It is simple to implement.
Demerits −
 It is inefficient, as the search process has to be repeated if an error occurs.

 Slow speed of working.
What are Expert Systems?
The expert systems are the computer applications developed to solve complex
problems in a particular domain, at the level of extra-ordinary human intelligence and
expertise.
Characteristics of Expert Systems
 High performance
 Understandable
 Reliable
 Highly responsive
Capabilities of Expert Systems
The expert systems are capable of −
 Advising
 Instructing and assisting human in decision making
 Demonstrating
 Deriving a solution
 Diagnosing
 Explaining
 Interpreting input
 Predicting results
 Justifying the conclusion
 Suggesting alternative options to a problem
They are incapable of −
 Substituting human decision makers

 Possessing human capabilities
 Producing accurate output for inadequate knowledge base
 Refining their own knowledge
Components of Expert Systems
The components of ES include −
 Knowledge Base
 Inference Engine
 User Interface
Let us see them one by one briefly −
Knowledge Base
It contains domain-specific and high-quality knowledge.
Knowledge is required to exhibit intelligence. The success of any ES majorly depends
upon the collection of highly accurate and precise knowledge.
What is Knowledge?
The data is collection of facts. The information is organized as data and facts about the
task domain. Data, information, and past experience combined together are termed
as knowledge.
Components of Knowledge Base
The knowledge base of an ES is a store of both, factual and heuristic knowledge.
 Factual Knowledge − It is the information widely accepted by the Knowledge
Engineers and scholars in the task domain.
 Heuristic Knowledge − It is about practice, accurate judgement, one’s ability of
evaluation, and guessing.
Knowledge representation
It is the method used to organize and formalize the knowledge in the knowledge base.
It is in the form of IF-THEN-ELSE rules.
Knowledge Acquisition
The success of any expert system majorly depends on the quality, completeness, and
accuracy of the information stored in the knowledge base.
The knowledge base is formed by readings from various experts, scholars, and
the Knowledge Engineers. The knowledge engineer is a person with the qualities of
empathy, quick learning, and case analyzing skills.
He acquires information from subject expert by recording, interviewing, and observing
him at work, etc. He then categorizes and organizes the information in a meaningful
way, in the form of IF-THEN-ELSE rules, to be used by interference machine. The
knowledge engineer also monitors the development of the ES.
Inference Engine
Use of efficient procedures and rules by the Inference Engine is essential in deducting
a correct, flawless solution.
In case of knowledge-based ES, the Inference Engine acquires and manipulates the
knowledge from the knowledge base to arrive at a particular solution.
In case of rule based ES, it −
 Applies rules repeatedly to the facts, which are obtained from earlier rule
application.
 Adds new knowledge into the knowledge base if required.
 Resolves rules conflict when multiple rules are applicable to a particular case.
To recommend a solution, the Inference Engine uses the following strategies −
 Forward Chaining
 Backward Chaining
Forward Chaining
It is a strategy of an expert system to answer the question, “What can happen next?”
Here, the Inference Engine follows the chain of conditions and derivations and finally
deduces the outcome. It considers all the facts and rules, and sorts them before
concluding to a solution.
This strategy is followed for working on conclusion, result, or effect. For example,
prediction of share market status as an effect of changes in interest rates.
Backward Chaining
With this strategy, an expert system finds out the answer to the question, “Why this
happened?”
On the basis of what has already happened, the Inference Engine tries to find out
which conditions could have happened in the past for this result. This strategy is
followed for finding out cause or reason. For example, diagnosis of blood cancer in
humans.
User Interface
User interface provides interaction between user of the ES and the ES itself. It is
generally Natural Language Processing so as to be used by the user who is well-
versed in the task domain. The user of the ES need not be necessarily an expert in
Artificial Intelligence.
It explains how the ES has arrived at a particular recommendation. The explanation
may appear in the following forms −
 Natural language displayed on screen.

 Verbal narrations in natural language.
 Listing of rule numbers displayed on the screen.
The user interface makes it easy to trace the credibility of the deductions.
Requirements of Efficient ES User Interface
 It should help users to accomplish their goals in shortest possible way.
 It should be designed to work for user’s existing or desired work practices.
 Its technology should be adaptable to user’s requirements; not the other way
round.
 It should make efficient use of user input.
Expert Systems Limitations

No technology can offer easy and complete solution. Large systems are costly, require
significant development time, and computer resources. ESs have their limitations which
include −
 Limitations of the technology

 Difficult knowledge acquisition
 ES are difficult to maintain
 High development costs
Applications of Expert System
The following table shows where ES can be applied.
Application Description
Design Domain Camera lens design, automobile design.
Diagnosis Systems to deduce cause of disease from observed

Medical Domain
data, conduction medical operations on humans.
Comparing data continuously with observed system or with

Monitoring Systems prescribed behavior such as leakage monitoring in long
petroleum pipeline.
Process Control Systems Controlling a physical process based on monitoring.
Knowledge Domain Finding out faults in vehicles, computers.
Finance/Commerce Detection of possible fraud, suspicious transactions, stock

CS6659-Artificial market trading, Airline scheduling, cargo scheduling. Page
Formal grammar
o Formal grammar is a set of rules. It is used to identify correct or incorrect strings of
tokens in a language. The formal grammar is represented as G.
o Formal grammar is used to generate all possible strings over the alphabet that is
syntactically correct in the language.
o Formal grammar is used mostly in the syntactic analysis phase (parsing) particularly
during the compilation.
Formal grammar G is written as follows:
1. G = <V, N, P, S>
Where:
N describes a finite set of non-terminal symbols.
V describes a finite set of terminal symbols.
P describes a set of production rules
S is the start symbol.
Example:
1. L = {a, b}, N = {S, R, B}
Production rules:
1. S = bR
2. R = aR
3. R = aB
4. B = b
Through this production we can produce some strings like: bab, baab, baaab etc.
This production describes the string of shape banab.
Fig: Formal grammar

BNF Notation
BNF stands for Backus-Naur Form. It is used to write a formal representation of a
context-free grammar. It is also used to describe the syntax of a programming language.
BNF notation is basically just a variant of a context-free grammar.
In BNF, productions have the form:

1. Left side → definition
Where leftside ∈ (Vn∪ Vt)+ and definition ∈ (Vn∪ Vt)*. In BNF, the leftside contains one non-
terminal.
We can define the several productions with the same leftside. All the productions are
separated by a vertical bar symbol "|".
There is the production for any grammar as follows:
1. S → aSa
2. S → bSb
3. S → c
In BNF, we can represent above grammar as follows:
1. S → aSa| bSb| c

Ai Unit Wise Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ai Unit Wise Notes

Uploaded by

Copyright:

Available Formats

Unit 1

 Goal Test: It determines if the given state is a goal

 Toy Problem: It is a concise and exact description of

The problem formulation is as follows:

It is noticed from the above figure that each queen is set

 Actions: Add a queen to any empty box.

 Transition model: Returns the chessboard with the

In this section, we will understand how searching can be

For solving different kinds of problem, an agent makes use

Uninformed Search (Blind Search)

There are following types of uninformed searches:

It is a simple search strategy where the root node is

In this search example, we are using two lists which

Unlike BFS, this uninformed search explores nodes based

 First, the goal test is applied to a node only when it is

Uniform-cost search on a binary

 Optimality: It gives optimal path cost solution for the

This search strategy explores the deepest node first, then

 It is also possible that it may not reach the goal state.

This search strategy is similar to DFS with a little difference.

 Loop each node by traversing in DFS manner till the

 It does not provide an optimal solution.

This search is a combination of BFS and DFS, as BFS

 Explore the nodes in DFS order.

The strategy behind the bidirectional search is to run two

 Optimal: It gives an optimal solution.

 Time and space complexity: Bidirectional search

Below we have discussed different types of informed

Best-first Search (Greedy search)

A best-first search is a general approach of informed

1. Set an OPEN list and a CLOSE list where the

 Time and Space complexity: It has O(b ) worst time

and space complexity, where m is the maximum

Disadvantages of Best-first search

 BFS does not guarantees to reach the goal state.

A* search is the most widely used informed search

Let’s see the below example to understand better.

In the above example, S is the root node, and G is the goal

f(S)=(distance from node S to S) + h(S)

A* search finds an optimal solution as it has the admissible

Consider the below search tree where the starting/initial

So, when the cost value is overestimated, it will not take

Note: Underestimation of h(n) leads to a better optimal

space and time complexities.

 A* mostly runs out of space for a long period.

AO* search is a specialized graph based on AND/OR

Therefore, the AO* search provides two ways to choose

Consider the following 8-puzzle problem where we have a

h(n)=Number of tiles out of position.

 Admissible Condition: An algorithm is said to be

It is a natural process or quality It is programmed using

It is not hereditary but a copy

A human brain does not require

Execution speed of a human brain Execution speed is higher

Human intelligence can handle It is designed to handle only

A human brain is analog. An artificial brain is digital.

Difference between BFS and DFS

It extends for Breadth-first It extends for Depth-first

It searches a node breadthwise,

BFS is a vertex-based DFS is an edge-based

The structure of a BFS tree is The structure of a DFS tree is

BFS is used to examine bipartite DFS is used to examine a two-

 What is Knowledge Representation?