You are on page 1of 141

Artificial

Intelligence
RCS -702
B.Tech 4th Year (7th
Semester)

UNIVERSITY ACADEMY 1
UNIT - 1
Introduction : Introduction to Artificial Intelligence

• Foundations and History of Artificial Intelligence, .


• Applications of Artificial Intelligence,
• Intelligent Agents,
• Structure of Intelligent Agents.
• Computer vision
• Natural Language Possessing

UNIVERSITY ACADEMY 2
UNIT 1 ( Introduction to AI)

Short Question & Answers


Ques 1. What is AI ? Define Artificial intelligence on the basis of “ System that think rationally ” and
“ System that act like humans”.
Ans : AI is a very wide field of science and engineering which makes intelligence machines and especially

Intelligent computer programs. It is related to the similar tasks of using compuyters to understand human
intelligence. Scientists want to automate human intelligence for the following reasons :

(i) Understanding and reasoning of human intelligence in better way.


(ii) Making more smarter programs.
(iii) Useful and efficient techniques to solve complex problems.
Definitions of AI vary along two main dimensions. Roughly, the ones on top are concerned with
thought processes and reasoning, whereas the ones on the bottom address behavior.

System that act like humans:The exciting new effort to make computers think . . . machines with
minds, in the full and literal sense." (Haugeland, 1985)
System that think rationally:"[The automation of] activities that we associate with human
thinking, activities such as Decision making, problem solving, learning . . ." (Bellman, 1978)

The study of mental faculties through the use of computational models." (Chamiak and McDermott, 1985)

"The study of the computations that make it possible to perceive, reason, and act." (Winston, 1992)

Ques 2. Differentiate between Natural(Human) Intelligence & Artificial Intelligence.

Ans :

S. NO NATURAL INTELLIGENCE ARTIFICIAL INTELLLLIGENCE


1 Exhibited by human beings Programmed by humans in machines
2 Highly refined and no It exists in computer system, so
electricity required to generate electrical energy is required for
output. activation of output.
3 No one is an expert. We can get Expert system exists ,
better solution from one another which collect ideas of human beings
4 Intelligence increases under Intelligence increases by updating
supervision. technology and algorithms used .

UNIVERSITY ACADEMY 3
Ques 3. What is Weak AI and Strong AI ?

Ans : Weak AI deals with the creation of some form of computer based artificial intelligence which can reason and
solve problems in limited domain. Some thinking like feature may be added to machine , but true intelligence is absent.
Here we have to get the explanation of solution by us in own way rather depending on computer machine.
Strong AI claims that computers can think at the level of human beings. It truly reasons and solve complex problems.In
strong AI programs itself are explanations for any solution.

Ques 4. What is rationality ? Define an intelligent agent.

Ans : The word agent is derived from the concept that when some agency hires some person to do a particular
work on behalf of the user. Agent is that program in terms of AI , which perceives its environment through
sensors and acts upon it accordingly by using actuators. E.g : Software agent, Robotic agent, Nano robots for
body check ups/ biological agents, Internet search agent etc. Software agents carry following properties :

• Intelligent agents are autonomous.


• Ability to perceive data and signals from the environment.
• Adapting to change in surroundings.
• Transportable or mobile over networks.
• Ability to learn , reason , and interact with humans.

Figure of Agent Architecture

UNIVERSITY ACADEMY 4
Mathematically agent’s function is defined as , that maps any given percept sequence to an action. Internally
agent function can be implemented by an agent program f : P* →A , where P* sequence of zero or more
percepts, A is an action taken by the agent.

A system is said to be rational if it does the right thing, given what it knows (Irrefutable reasoning).Right thing
makes agent successful. So some performance measure is required to measure the degree of success.
Rationality depends on :

• Performance measure for criterion of success.


• Agent’s prior knowledge of environment.
• Actions that agent can perform.
• Agent’s percept sequence to date.

Ques 5. Mention some related fields of Artificial Intelligence.


Ans : (i) Fields closely related to AI are of engineering domain, mechanical , electrical , electronics and
computer engg.
(ii) Field if Linguistics (Study of language) is also very popular now days that deals with natural language
processing.
(iii) Cognitive Science : Cognitive science deals with the study of human psychology. Cognitive scientists are
interested in computation process required to perform certain human functions.
(iv) AI and electrical engineering, AI and mechanical engineering.
(v) Medical field (vi) Manufacturing of products field (v) Military and defense
(vi) Aerospace engineering (vii) Banking and finance sector.

Ques 6. What is the importance of Natural Language in AI ?

Ans : (i) Understanding the grammatical and semantic structure of language.

(ii) Helpful in machine translations for giving commands to intelligent agents.

(iii) Easier communication with computers of human beings.

(iv) Talking is easier than typing.

UNIVERSITY ACADEMY 5
Example of NLP grammar is given as below:

Input string : The cat eats the rice.

S → NP VP
NP → DET N | DET ADJ N
VP → V NP Where NP is noun phrase , VP is verb phrase , DET is
DET → the article ADJ is adjective , V is verb and N is noun . These
ADJ → big | fat
N → cat | rice all are non terminals. The, big , fat, eat , rice are
V → eat . terminals.

Ques 7. What is Lexicon ?

Ans : A lexicon is a dictionary of words (usually morphemes or root words with their derivatives, where each
word contains some meaning and syntax.)Information in lexicon is needed to help determine the function and
meanings of the words in a sentence. Entries in lexicons can be grouped and given by word category. E.g:
Articles , nouns, verbs, pronouns, adjectives etc.

UNIVERSITY ACADEMY 6
Long Question & Answers

Ques 8. Explain Goal Based Agent and Utility based Agent architecture with proper diagram.

Ans : Job of AI is to design an agent program that implements the agent function mapping percepts to actions.
This program will execute on some sort of computing device with sensors and actuators – this is called
ARCHITECTURE. Agent = Architecture + Program.

(a) Goal based agent : In these type of agent models desirable goals and promising directions towards goal
which are easy to reach are incorporated. Some times action to be selected is simple, when single action
leads to desirable goal. But when long sequence of percepts are observed , then complexity of decision
making increases. Example : In automated car driving agent.
Goal based agent may be less efficient , but flexibile enough by proper knowledge and decision
making. E,g : If it starts raining then car driving agent must be flexible enough to make correct decision
when to put the brakes on.

(b) Utility based agent : In goal based agents we just get distinction between happy and unhappy states.,
whereas a more general performance measure allows a comparison of different world states according
to exactly how happy they would make agent if the goal is reached. For this we require utility measure.

UNIVERSITY ACADEMY 7
A UTILITY FUNCTION maps a state into a real number which describes associated degree of
happiness. Two cases can be considered here for rationality:

Case 1: Conflicting goals exist and only some of them can be achieved ( e.g safety and speed of car are
conflicting requirements. So select that goal which has more degree of happiness and is more
useful ), So in car driving safety is more essential than faster speed , to avoid any accident or human
loss.
Case 2: If several goals exist and agent cannot reach any way to them with certainty, utility provides
the way in which likelihood of success can be weighted up against the importance of goal.Example :
A house hold robotic agent will give medicine to a person at a schedule time as compared to if at the
same time he is asked by another family member to play his favourite sports channel in Television.
Because utility of medicine consumption is higher than watching the television.

Ques 9. (a) What is PEAS information ? Design the PEAS information for Taxi Driver Agent and
Automated Robot in a manufacturing plant.
(b) Mention various properties of task environment.
Ans : (a) PEAS is the acronym used to define the performance and other characteristics of a rational agent.

P : Performance , E : Environment , A : Actuators , S : Sensors.

Performance measure decides criterion for the success of an agent’s behavior. When an agent is
plunked down in the environment , it generates a sequence of actions according to the percept it receives.

The sequence of actions causes environment to go through a sequence of states. If sequence is desirable , then
agent has performed well.

UNIVERSITY ACADEMY 8
Agent Type Performance Environment Actuators Sensors
Measure
Taxi Driver Safe, fast, legal, Roads, Other traffic, Steering, Cameras,
Agent comfortable trip , pedestrians,customers. accelerator,brake,signal Speedometer, GPS
maximum mileage. ,horn, display , odometer,engine
sensors
Robot part % of parts in correct Conveyner belt with Joint arms and hands Camera, Joint
picking agent place parts, bin angle sensors.

(c) Properties of task environment :


Fully Observable Vs Partial Observable : In Fully observable environment , agent’s sensors give it’s
access to complete state of the environment at each point. In partially observable due to noise and
inaccurate sensors prediction becomes unclear. E.g : Taxi agent cannot think what other drivers are
thinking.

Deterministic Vs Stochastic : An environment is deterministic if next state is completely determined by


current state otherwise it is random/stochastic. E.g : Taxi driving agent is stochastic because one can
never predict behavior of traffic exactly.Vaccum cleaner agent is deterministic.

Episodic Vs Sequential : In episodic task environment agent’s experience is divided into “ atomic
episodes”. Each episode consists of agent perceiving and then performing an action. Next episode
Is independent from the actions taken in previous episodes.
E.g : In sequential environment , current decision affects all future decisions.
E.g : In a Taxi driving agent intensity of brakes put on may have long term consequences.

Static Vs Dynamic : in environment if changes occur while agent is under action , we say it is a
dynamic environment else it is static. Static environment is easy to work on . Dynamic environment
continuously ask agent , what to do next E.g : Taxi driving is dynamic.

Discrete Vs Continuous : This is w.r.t states of an environments. E.g : Chess game has a finite number
of distinct states and discrete set of percepts and actions. Whereas taxi driving is continuous time
problem and continuous state aslo.

UNIVERSITY ACADEMY 9
Single Agent Vs Multi Agent : Agent solving a crossword puzzle alone a single agent.
Chess playing is two agent . Robot Soccer is multi agent ( Cooperative multi agent).

Ques : 10. What is Natural language processing? Mention its application domain in AI. What are
some of the problems which arise in natural language understanding for autonomous machines
like robots, intelligent computers.

Ans : In AI , we need to think of language as a pair ( source, target) for mapping b/w two objects. Language is
a medium of communication. Till now most common linguistic medium of human beings exists in the form of
speech. But processing written language is easier, than processing speech. Developing a program that
understand a natural language is difficult, Natural languages are large. They contain infinite difficulties.
So Natural Language Processing is the task to process speech or written text in such a way that a program
transforms sentences occurring as a part of a dialogue into data structures which convey the intended meaning
Of sentences to a reasoning program. A reasoning program must know about :
(i) Structure of the language
(ii) Possible semantics
(iii) Beliefs and goals of the user
(iv) General knowledge of the world.

NLP = Understanding + Generation, Natural language understanding aims at building the systems that can
make sense of free-form text. NLU system converts samples of human language or computer programs to
manipulate. into more formal representation that are easier Natural language generation aims at building
systems that can express their knowledge or explain their behavior in natural language. NLG system converts
information from computer Databases into normal sounding human language.
• Processing written text using lexical, syntactic and semantic knowledge of the language as well as the
required real world information.

• Processing spoken language includes all information needed plus additional knowledge about
phonology and removal of ambiguous text.

UNIVERSITY ACADEMY 10
• NLP also includes multi lingual translation. E,g In google search engine , or in various smart phones we
have properties of speech to text and text to speech conversion in different types of natural language.

Block Diagram of NLP system

Output
NL Text
Knowledge
input string Parser translatio or
representatio
n system n
Computer Code

Dictionary

Application of NLP in AI are as follows :

(i) Machine translation : Text to speech recognition and speech to text conversion. E,g Features
available now a days in android phones as well as windows laptop .
(ii) Information retrieval from a given collection of documents that satisfies a certain information need.
(iii) Information extraction and data mining.
(iv) Text summarization

Problems that arise in NLU systems

1. Problem of Ambiguity : There are several knowledge levels in which ambiguity may occur in natural
language.
(a) Syntactic Level : A sentence / phrase may be ambiguous at syntactic level. Syntax relates to the
structure of the language as per grammar rules and the way the words are put together.
Example : I hit the man with the hammer. Was the man hit by weapon or weapon was in the hand of
victim.

UNIVERSITY ACADEMY 11
(b) Some sentence structures have more than one correct interpretations.

(c) Lexical level : A sentence may be ambiguous at lexical level. In this a word can have more than one
meaning. Example : I went to the bank. Word bank can be a river bank or a financial institution.So
two meanings of same word.

(d) Referential Level: This is concerned with what the sentence refers to , or a sentence may refer to
more than one thing. Example : Ram killed Ravana because he liked Sita. Here referential
ambiguity occurs for He , that whom does he refers, Ram or Ravana.

(e) Semantic Level : A sentence may be ambiguous at the point of meaning (i.e two different meanings
for same concept). Example: He saw her duck (Lexical and semantic level ambiguity). Did he dip
down to avoid or he saw web footed bird.

(f) Pragmatic Level : Sentence can be ambiguous at pragmatic level i.e at level of interpretations

Depending on the context in which it occurs. Some words can have different meanings in different
situations.

Example : I went to the doctor yesterday. When exactly was the yesterday is not clear. Does
yesterday refers to the day preceding today or it is some another yesterday.

I waited for a long time at the bank.


There is a drought because it has not rained for a long time.
Dinosaurs have been extinct for a long time. 
A long time depends on context . So pragmatic level ambiguity.

2. Problem of impreciseness is also bad , that is very long sentences cannot be easily interpreted by
machines .
3. Problem of incompleteness : Incomplete sentence may create a sort of logical error or misinterpretation.
Example : I went there. There refers to what ?
4. Problem of inaccuracy may also arise in machine translation.

UNIVERSITY ACADEMY 12
5. Problem of continuous change is also very common during NLU. Example : People in different part of
world have different accent of speaking English.
6. Presence of noise in the input to understand .Example : While speaking in front of machine ,
background noise may hinder the clear voice input to the system.
7. The quantifier scoping problem is also very common. Where to apply existential quantifier ( )and
where to use universal quantifier.

Ques 11. Write short notes on the following: ( 15 Marks )


(i) Top down and bottom up parsing (ii) Computer vision (iii) Turing Test.

Ans : Parsing is a technique to check the grammatical structure of computer programs syntactically and
generate a parse tree if given input is successfully parsed by the formal context free grammar. But in NLP
system this traditional parsing is quite difficult to analyze, understand and implement. This is because natural
languages are inherently ambiguous at lexical level, syntax level , semantic level, referential level and
pragmatic level.

There are systematic patterns in the sentence that emerge from the knowledge of grammar.Example
sentences have parts of phrases like noun phrase , verb phrase , preposition phrase etc.Parsing is a kind of
search problem where the serach space is the set of trees consistent with a given grammar . Two Methods of
searching are : Top Down approach and Bottom up approach of parsing.

Top Down approach : In this technique we start searching from the root node of parse tree and go to
downwards level till leaf nodes to find lexicons or original words.Top down approach is Goal Driven. In task
o+f packing bags for travel , we can start with the goal in mind and make a list of items that achieve that goal.

Top down parsers are constrained by the grammar.

Bottom Up Approach : This is data driven approach in which search moves are performed upwards the tree

Starting from leaf nodes and reaching to the root node. If it si done successfully then no syntax error occurs.

Bottom up parser are constrained by the words.

Top down approach Bottom up parsing


S → NP VP Ram ate the biscuit
→ N VP NOUN ate the biscuit
→ Ram VP NP ate the biscuit
→ Ram V NP NP VERB ART biscuit
→ Ram ate NP NP VERB ART NOUN
→ Ram ate ART NOUN NP VERB NP
→ Ram ate the NOUN NP VP
→ Ram ate the biscuit S

UNIVERSITY ACADEMY 13
UNIVERSITY ACADEMY 14
(iii) Computer Vision: There are many opinions about what sort of background is necessary for computer
vision, but one thing is certain–inspirations for new computer vision methods have come from fields as diverse
as psychology, neuroscience, physics, robotics, and statistics. Vision deals with light and its interaction with
surfaces, so of course optics plays a role in understanding computer vision systems. Cameras, lenses, focusing,
5 binocular vision, depth-of-field, sensor sensitivity, time of exposure, and other concepts from optics and photography
are all relevant to computer vision.

Often referred to as the “inverse” of computer graphics, computer vision attempts to make inferences about the
world from images. Given a picture of two objects, we would like to infer that they are roughly cubical, and
that they are likely to be dice, although we can never be completely sure. A vision system may pick up
important highlights to conclude that a surface is wet, transparent, or reflective, features associated with living
creatures, rather than inanimate objects.

Neuroscience , physiology , the human eye, the central nervous system, and the brain are all marvels of
complex structure and performance required for vision.. Studying these systems often provides insight,
inspiration, and clues about artificial vision system design.

The human visual system seems to do all of these things. Just recording the speed at which a human responds
in a particular task, like reading a word, may rule out certain theories as to how certain visual stimuli are
processed.

Probability, Statistics, and Machine Learning The mathematical subfield of probability, the field of statistics,
and the computer science discipline of machine learning have become essential tools in computer vision.

• Early Vision in Multiple Images. The geometry of multiple view.


• Stereopsis :What we know about the world from having 2 eyes
• Structure from motion “What we know about the world from having many eyeor, more commonly, our
eyes moving.

Mid-Level Vision

• Finding coherent structure so as to break the image or movie into big units
– Segmentation:
• Breaking images and videos into useful pieces

UNIVERSITY ACADEMY 15
• E.g. finding video sequences that correspond to one shot
• E.g. finding image components that are coherent in internal appearance
– Tracking:
• Keeping track of a moving object through a long sequence of views

High Level Vision (Geometry)

The relations between object geometry and image geometry: Model based vision find the position and
orientation of known objects
Smooth surfaces and outlines : how the outline of a curved object is formed, and what it looks like
Aspect graphs : how the outline of a curved object moves around as you view it from different directions
High Level Vision (Probabilistic) :The relations between object geometry and image geometry

Model based vision : find the position and orientation of known objects
Smooth surfaces and outlines : how the outline of a curved object is formed, and what it looks like
Aspect graphs : how the outline of a curved object moves around as you view it from different directions

(iii)Turing Test : This test provide an answer to the question “ Can machines think like human beings”.

Alan Turing , the British scientist was a well known computer scientist and the father of artificial
intelligence. Turing left a bench mark test for an intelligent computer ; such that it must fool a person into
thinking the computer machine as a human being. This test was performed in following two phases:
PHASE I : A set up of interrogator in an isolated room , with a man and woman in separate room is
performed. Same questions are asked to both man and woman through a neutral medium , like teletype writer.
Questions asked were calculations of multiplication of big numbers like 33456012 x 6754. Some questions on
lyrics and English literature are also put.

PHASE II : In this phase man is replaced by a computer without the knowledge of the interrogator. The
interrogator does not distinguish between man , woman and machine, rather he knows them as A and B.

Interpretation: If conversation with a computer is indistinguishable from that with a human, the computer is
displaying intelligence. If we can not distinguish between Natural Intelligence and Artificial Intelligence they

UNIVERSITY ACADEMY 16
must be same. If the interrogator could not distinguish between a mn imitating a woman and a computer
imitating a man the computer succeeded in paing the test . The goal of machine was to befool the interrogator
into believing that it is a person. If computer is successful , then we can say “ machines can think like
humans”.

UNIVERSITY ACADEMY 17
UNIT-2
Introduction to Searching Methods in AI
• Searching for solutions,
• Uninformed search strategies
• Informed search strategies
• Local search algorithms and optimistic problems
• Adversarial Search
• Search for games
• Alpha Beta Pruning

UNIVERSITY ACADEMY 1
Short Questions & Answers

Ques1. What is the difference between conventional computing and AI computing ?


Ans: In conventional computing , Program + Data Structures + Algorithms

( operations + sequence)
In AI computing; Expert System = Knowledge Base + control straetigies

Sequence (based on heuristics)


Heuristics is order or guidance of search. Operations ( Search Strategies)

Dimension Conventional computing AI computing


1. Processing Primary algorithmic Includes symbolic concepts
2. Nature of I/P Must be complete Can be complete
3. Search Appraoch Based on algorithms Based on rules and heuristics
4. Explanation Not provided Provided
5. Focus Data and information Knowledgw
6. Manintenance Usually difficult Relatively easy, changes can be
made in self contained modules.
7. Reasoning & Not present Present
Learning ability

Ques 2. What are the main aspects considered before solving a complex AI problem? What
is state space representation in AI?
Ans : Following aspects are considered before solving any AI problem :
(i) What are the main assumptions about the problem?
(ii) What kind of AI techniques are required to solve it(E.g : Game playing, theorem
proving, robotics, expert system, NLP etc.)
(iii) How intelligence level has to be modeled?
(iv) How we will know when we have reached our goal state of solution.

UNIVERSITY ACADEMY 2
State Space Representation: A problem solving system that uses forward reasoning and
whose operators each work by producing a single new object, a new state in the Knowledge base
is said to represent problem in state space structure. In backward reasoning we towards goal state
from the given set of states.
Constituents of a state space problem searching:
(i) Set of initial states/state
(ii) Operator function: This function when applied to any state , then it changes it to another
state.
(iii) State space: All states reachable from initial state by any sequence of actions.
(iv) Path: Sequence of steps along the state space.
(v) Path cost: Cost (in during terms of time or money) incurred during traversing the state
space.
(vi) Pre Conditions: Values certain attributes must have to enable operators application in a
state.
(vii) Post condition: Attributes of a state altered by an operator application.
(viii) Goal State: Final state, when problem is solved.

Ques 3. What is local beam search?

Ans : This is the variation of breadth first search. Keeping just one node in memory might seem
to be a problem for memory limitations. So to avoid this beam search keeps track of K states,
rather than just one
(i) Beam search moves downwards only through the best nodes at each level by applying
heuristics and other nodes are ignored.
(ii) Width of beam is fixed
(iii) K- Parallel search threads share useful information among themselves. Bad moves if
occurred are halted and resources are passed to good successors.
Ques 4. Show that DFS is neither complete nor optimal search.
Ans : Since dfs explores in depth first, so if our search tree is infinite or has a cyclic graph, then
dfs may not terminate i.e. it will search for infinite time so it is not complete. Secondly dfs
may give solution at a certain depth and it may be possible that much better solutions exists
at upper level of tree space, so it is non-optimal as well.

UNIVERSITY ACADEMY 3
Ques 5. Define Heuristic search and heuristic function.
Ans : Heuristic search techniques are those in which we have more information than the initial
state, operators and goal state. This leads to more efficient searching for complex problems.
Heuristics serve as a guide to reach the goal state.
Heuristic function: This is also termed as objective function in mathematical optimization.
Heuristic functions are those that “Maps from problem state description to the measure of
desirability, usually represented in numbers”. How the states are considered, evaluated and
weights are assigned to states leads to the selection criterion of current best path to reach the goal
in most efficient manner. If heuristic is very accurate and searches true merits of each node then
search tree has more direct solution process.
A general heuristic function can be of the form f (n ,g). Where n : no of nodes, g : goal states.
Other function can be f (n) = g (n) + h (n). Where g(n) : path cost from initial state to node n.
h(n) : estimated cost of the cheapest path from n to the goal.
f(n) : Estimated cost of cheapest solution through n.

Ques 6. What is depth limited search?


Ans : This is an un informed search algorithm , in which depth limit L is assigned in dfs ( for
unbounded trees). So nodes at depth L are treated as they have no childs further. Let d is a depth
cut off point .
(i) If ( L < d) , incompleteness is added ( i.e shallowest goal is beyond the depth limit).
(ii) If (L > d ) , then solution is non-optimal.
(iii) If ( L = ∞ ) , special case of dfs.

Time complexity : O (b l ) , where b :branching factor. This algorithm solves infinite path problem.

Ques 7 . What is uniform cost search?


Ans : In breadth first search all steps cost are equal so it is optimal, because it expands the
shallowest unexpanded node. The uniform cost search , node n with lowest path cost is expanded
, instead of shallowest node. This algorithm emphasizes on total path cost rather than the number
of steps a path has.. So it will stuck in an infinite loop if it ever expands a node that has zero cost
action leading back to same state.

UNIVERSITY ACADEMY 4
Ques 8. Differentiate between Uninformed and Informed Search technique.
Ans :

S.NO UNINFORMED SEARCH INFORMED SEARCH


1. In this sometimes pre information Some information about the goal and
about the goal is not available heuristic function is available.
2. Nodes in the search space are Search process takes less time. More
traversed until a goal is reached, or information about initial state and
time limit is over or failure is operators is available.
occurred.
3. These are Blind search methods Based on heuristic search methods.
4. Search efficiency is low Search is fast
5. Practical limits on computer storage Less amount of computation required
available for blind search.
6. Not practical for solving very complex Can handle large , complex AI problems.
and large problems.
7. It is possible to reach best solution Mostly a good enough solution is
accepted as optimal solution.
8. E.g : DFS, BFS, uniform cost search, Best first search, Hill climbing search,
depth limit search. A* search, AO* search.

Ques 9. Write the algorithm of Depth First Search.


Ans: Depth first algorithm always expands the deepest node in the current path moving
downwards a branch. This done by generating a child node from the most recently expanded
node, then generating that child’s children and so on , until a goal is found or some depth cut –
off point d is reached. Search is performed in LIFO order by implementing QUEUE data
structure. Algorithm is as follows:
Step1. Place initial node S on QUEUE.
Step 2. If ( QUEUE = = Φ) , return failure and stop.
Step3. If ( First element of QUEUE = = goal node) , return success and stop.
Else
Remove and expand first element and place children at Front of QUEUE.
Step4. Return to Step 2.

Ques10. Heuristic path algorithm is best first search in which objective function is
f(n) = (2-w) g(n) +w h(n) for what values of w is this algorithm guaranteed to be optimal ?
What kind of search does this perform when w = 0? When w = 1? When w = 2?
Ans : For w = 2 algorithm is optimal as f(n) = ( 2 -2 ) g(n) + 2 * h(n) = 2 h(n).
If w = 0 , then f(n) = 2 g(n) it is breadth first search. If w = 1, f(n) = g(n) + h(n) .
If w = 2 , it is optimal search, as we can directly reach to goal state.

UNIVERSITY ACADEMY 5
Long Questions & Answers
Ques 11. Explain Water Jug Problem using state space search. Generate Production rules
for this problem.
Ans : Two Jugs of 4 liter and 3 liter capacity are given. Initially both are empty. Objective is to
transfer water from one jug to another in such a way, that 4 liter jug has 2 liter water and 3 gallon
jug has n liter water. Where n can be any of 0 , 1, 2 or 3 liter.

Assumptions are as following:


(i) We can fill a jug from the pump.
(ii) We can pour water on to the ground from any jug.
(iii)We can pour water from one jug to another jug.
(iv)No other measuring device for water level in the jug is available.
So the control strategy and production rules which are efficient are as mentioned below:

S.NO Production rules Description


1. If x < 4 , then (x ,y) → (4, y) Fill the 4 liter Jug
2. If y < 3 , then , then (x ,y) →(x , 3) Fill the 3 liter Jug
3. If x > 0 , then (x ,y) → (x –d , y) Pour som water from 4 liter Jug
4. If y > 0 , then (x ,y) → (x , y - d ) Pour some water from 3 liter Jug
5. If x > 0 , then (x ,y) → (0, y) Empty the 4 liter Jug
6. If y > 0 , then (x ,y) → (x , 0) Empty 3 liter Jug
7. If x + y ≥ 4 && y > 0, then (x ,y) → Pour water from 3 liter jug into 4 liter jug
(4, y – ( 4 - x)) until 4 liter is full.
8. If x + y ≥ 3 && x > 0 , then ( x , y) → Pour water from 3 liter jug into 4 liter jug
( x – (3 -y), 3) until 4 liter is full.

UNIVERSITY ACADEMY 6
9. If x + y ≤ 4 && y > 0 , then (x , y) →
( x + y , 0)
10. If x + y ≤ 3 && x > 0 , then (x , y) →
( 0 , x + y)
11. (0 , 2 ) → (2 , 0) This is a special case of rule 9.
12. (2 , y) → (0,y)

Solution for WJP


4 – liter Jug(x) 3 liter Jug(y) Rule applied
0 0 Rule 2
0 3 Rule 9
3 0 Rule 2
3 3 Rule 7
4 2 Rule 5
0 2
2 0 Rule 9 or 11.

State Space Graph for Water Jug Problem

UNIVERSITY ACADEMY 7
Ques 12. (a) How is AI related to Knowledge? Differentiate b/w Declarative and Procedural
knowledge.
(b) What are AI techniques? Explain the properties of it , and purpose of this with
example.
Ans : (a) Knowledge is defined as the body of facts and principles accumulated by human kind
or the act , fact or state of knowing. Knowledge has familiarity with language, concepts ,
procedures , rules , ideas, abstractions , places coupled with an ability to use these properties in
modeling different aspects of a real world problem.
Intelligence requires the possession of and access to knowledge. Characteristic of
intelligent people is that they possess much knowledge. Knowledge has following properties:
(i) It is voluminous
(ii) Knowledge continuously increases
(iii) Understanding the knowledge in different contexts depends on environment. E.g :
Understanding a visual scene requires knowledge of kinds of objects in scene
(iv) Solving a particular problem requires knowledge of that domain.
(v) Knowledge must be effectively represented in a meaning ful way , without any sort of
ambiguity.

Difference between Declarative Knowledge and Procedural Knowledge

Declarative knowledge is a passive knowledge expressed in the form of statements and facts
about the real world. Example: Employee database of any organization, Telephone directory etc.
Procedural knowledge is a compiled knowledge related to the performance of some task, i.e.
solving any problem systematically step by step. Example : steps used to solve any trigonometric
problem.

(b) AI Techniques: These are the methods to solve various type of tasks. E.g : Game playing ,
theorem proving , robotics, expert system, . Simple data structures like arrays, queues etc are
unable to represent the facts of real world. So some symbolic statements are required to represent
them.

UNIVERSITY ACADEMY 8
AI technique is method that exploits knowledge that should be represented in such a way that :
(i) Knowledge captures generalization. It is not necessary to represent separately each individual
situation. Rather situations which carry important properties are grouped together.
(i) It can be understand by the people who must provide it, depending the domain
of problem
(E.g: Robotics, medical field, weather forecasting , biometrics, defense and
military, aeronautics, space research etc.)
(ii) It can be easily modifiable to correct errors.
(iii) It can be used in many situations even if it is not totally accurate / complete.

AI techniques

Knowledge base Smarter

AI computers

Have reasoning ability, Decision making and

Knowledge base relates with Inference Process , through which an AI system can derive
solutions which are already not in the KB. To make Computer smarter we need to learn how
Human Brain works does and what properties of brain are used to design an AI system. This is
called Cognitive Science.
Before solving a problem and selecting appropriate AI domain we require following points:
• What are the assumptions about the knowledge?
• Which technique is most appropriate for reaching to goal state.
• What is the level for modeling the intelligence?
• How will we know when we have succeeded in building an AI system?
Example : Translating English sentences into Japanese. (Requires technique of NLP and NLU)
Teaching a child to subtract integers. (Requires supervised learning)
Solving a cross word puzzle. (Requires logic theory.)

UNIVERSITY ACADEMY 9
Ques 13. What are the characteristics of AI problems? Explain with examples.
Ans : (i) Is the problem decomposable ? Is there any way to divide large , complex problem
into sub problems and then apply simple approach to each problem for getting the solution of
main problem.
E.g : Designing a large software requires to design individual modules , then start coding phase.
(ii) Can solution steps be ignored or atleast undone , if they prove to be irrelevant ?
Based on this there can be following categories of problem:
Ignorable : Solve the problem using simple control strategies , that never back tracks.
(Theorem proving)
Recoverable : These problems can be solved using more complex control structures that can
have chances of errors. In this solution steps can be undone.( 8 – puzzle
Irrecoverable : Solved by the system that expands a great deal of effort , making each decision
since decision must be final. Solution steps cannot be undone. ( Chess game , card games etc.
)Control strategy requires lots of effort , have strict rules and has good heuristic information.
(iii)Is the universe predictable? Can we earlier predict the plans and steps to be taken
during problem solution. E.g : In bridge game , 8 –puzzle game we can predict the future
moves
Certain outcome problems: 8-puzzle, Uncertain outcomes : Bridge game.
Hardest problems: Irrecoverable + Uncertain Outcome.

(iv) Is a good solution absolute / relative: Answering questions on a database of simple


facts, using predicate logic. There can be two or more reasoning paths. ( Any path
problem).In travelling salesman problem we compare each path relative to other one to
make sure none of them is shorter.
• Is the solution state/ path?
• Role of knowledge: If you have unlimited computing power available , how much
knowledge is required ? Answer can be rules to determine legal moves. E.g : Chess
game.
• Requiring interaction with a person :
Solitary Problems: Have no intermediate interaction & communication and no
demand for an explanation of a reasoning process.

UNIVERSITY ACADEMY 10
Conversational Problems: In this type we have an intermediate communication
between a person and computer, either to provide additional assistance to the system
or to give additional information to the user and computer.

• Problem Classification: To examine an input then decide which of a set of known


classes the input is an instance of. Most diagnostic tasks ( e.g : medical diagnosis,
diagnosis of faults in mechanical devices).

Ques 14. What is Control Strategy and Production System? How this is helpful in AI.
Give example with its types also.
Ans: Control Strategy is an interpreter program used to control the order in which the production
rules are fixed and resolve conflicts, if more than one rule becomes applicable simultaneously.
This strategy repeatedly applies rules to the data base until a description of the goal state is
produced.
Control strategy must be systematic + must cause motion also
Example: In water Jug problem we select the rules from given list in such a way that it must
always generate a new state in the state space (Cause a motion). Rule must be selected in such a
way that duplicate states must be avoided.

Types of control strategies: (a) Irrocavable: Rule is selected and applied irrocavably without
provision for reconsideration later. (b) Tentative: Rule is selected and applied with a provision
to return later to this point in computation to apply some other rule.
Production Systems: These were proposed for modeling human solving behavior by Newell &
Simons in 1972. These are also referred as inferential systems or a rule based systems.
Roles of production system:
(i) A powerful knowledge representation method with action associated to it.
(ii) Bridge b/w AI research and expert system.
(iii) Strong data driven nature of intelligent action. When new input is given to the system,
behavior of system changes.
(iv) New rules can easily be added to account for new situations without disturbing the
rest of the system.

UNIVERSITY ACADEMY 11
Expert systems have modules known as Inference Engine, which work based on production
systems.

Architecture of Production System


Rule Set : Knowledge representation is decoded in a declarative facts , which consists of set of
rules of form : Pre _condition → Post _condition (action) , if pre condition of if then else type
rule is true then action will be executed else no output.
E.g If it is hot and humid then it will rain, can be written in the PL as a rule like : H ˄ D → R.

Knowledge Base: Dynamic KB + Static KB. Global DB is the central data structure used by the
production system. Application of a rule changes the data base as a result it is dynamic in nature,
continuously changing when a production rule is applied to any of the system state.
So it is also known as Working Memory or Short Term Memory.
Static KB has complete information about the facts and rules and is fixed, never changes.

UNIVERSITY ACADEMY 12
.

Ques 16. What is missionaries and cannibals problem? Give the production rules for its
solution.

Ans : In this problem, three missionaries and three cannibals must cross a river using a boat
which can carry at most two people, under the constraint that, for both banks, that the
missionaries present on the bank cannot be outnumbered by cannibals. The boat cannot cross the
river by itself with no people on board.

UNIVERSITY ACADEMY 13
Solution:

Rule Left side of river Boat Right side of river

Rule 1 3ML , 3 CL 0MB, 0 CB 0 MR , 0 CR

Rule 2 3ML , 1 CL 0MB, 2 CB 0 MR , 0 CR

Rule 3 3ML , 1CL 0MB, 1 CB 0 MR , 1CR

Rule 4 3ML , 0 CL 0MB, 2 CB 0 MR , 1 CR

Rule 5 3ML , 0 CL 0MB, 1 CB 0 MR , 2 CR

Rule 6 1 ML , 1 CL 2 MB, 0 CB 0 MR , 2 CR

Rule 7 1 ML ,1 CL 1 MB, 1 CB 1 MR , 1 CR

Rule 8 0 ML , 2 CL 2 MB, 0 CB 1 MR , 1 CR

Rule 9 0 ML , 2 CL 0MB, 1 CB 3 MR , 0 CR

Rule 10 0 ML , 1 CL 0 MB, 2 CB 3 MR , 0 CR

Rule 11 0 ML , 1 CL 0 MB, 2 CB 3 MR , 0 CR

Rule 12 0 ML , 1 CL 0MB, 1 CB 3 MR , 1 CR

Rule 13 0 ML , 0 CL 0MB, 0 CB 3 MR , 3 CR

Ques 17. What are local search algorithms? Explain Hill climbing search.

Ans : Local search algorithms operate using a single current state , rather than multiple paths and
then move further to neighboring states. Paths followed by search are not retained i.e. not kept in
the memory.

Local search algorithms use very less memory. They can find reasonable solutions in large or
infinite state space. These are used to find solution of optimization problems.

UNIVERSITY ACADEMY 14
Hill climbing technique is informed search strategy which works based on greedy approach.
Heuristic search / informed search sacrifice the claims of completeness. They behave like a tour
guide. They are good to the extent that they point in generally interesting directions and bad to
the extent that they miss the points of interest to individuals.

Hill climbing technique are of following two types :

(a )Simple hill climbing (b) Steepest ascent hill climbing.


In hill climbing technique, at each point in the search space, a successor node that
appears to lead quickly to the top of the HILL (goal state) , is selected for exploration.
Here no further reference to the parent or other childs is retained.

Simple Hill Climbing: This algorithm selects the FIRST BETTER MOVE (NODE) as a
new CURRENT STATE (next state).

(5) A

B C D

(3) (6) (8)

A is the current state with cost =5. Now compare the cost of A to its successor nodes one by one

Starting from node B. If the optimization problem is to maximize the cost , then the child node
which has more cost as compared to A will be better next node. Since (cost of B) < (cost of A),
so move to node C. (Cost of C) > cost of A, and node C is first better node, therefore simple hill
climbing makes node C as the NEW CURRENT STATE (next state). And further successors of
C will be generated.

Steepest Ascent Hill Climbing: This algorithm considers all the moves from the CURRENT
STATE and selects the BEST NODE as a new CURRENT STATE (next state).

In the same search tree above , if we apply the steepest ascent approach then the BEST NODE
among all the successors is selected as next state. So when we compare A with its successor
nodes , D is maximum among all the other nodes.

Key Points: (a) Hill climbing algorithm terminates, when it reaches a peak where no neighbor
has a higher value.

UNIVERSITY ACADEMY 15
(a) Hill climbing is called greedy search because it selects a good neighbor state without
thinking ahead about where to go next.
(b) An operator is applied to each current state to generate its child nodes. And an
appropriate Heuristic function is used to estimate the cost of each node.
(c) Both simple and steepest ascent hill climbing may fail to find the solution. Either
algorithm may
Terminate, when no goal is found but by getting to a state from which no better states can
be generated.

Algorithm for Steepest Ascent Hill Climbing:


Function HILL_CLIMBING (problem) returns a state that is a LOCAL MAX

INPUT : Given Problem

LOCAL VAR : current( a node) , neighbor (a node)


Current ← MAKE_NODE ( INITIAL _STATE [Given Problem ] )
Loop do
Neighbor ← a highest valued successor of current
If VALUE [neighbor] ≤ VALUE [ CURRENT ] , then
Return STATE [ current ]
Set current ← neighbor.

Ques 18. Draw the state space graph of Hill climbing search. What are the draw backs of
this algorithm? Also discuss about time space complexity of this algorithm.

Ans : Hill climbing search generally falls into trap due to some of the following reasons which
are also the drawbacks of this method :
(a) Local Maxima: A local maxima is a state in space tree which is a peak state , better
than all its neighboring states but lower than the global maximum. Local max are
particularly disadvantageous because they often occur almost within the sight of a
solution. In this case it is called FOOTHILLS.

(b) Plateau : This is area in state space where evaluation function is flat i.e whole set of
neighboring states have the same value from current state , and to find best direction
is difficult. From flat local max no uphill moves exist and from a Shoulder Plateau, it
is possible to move forward.

UNIVERSITY ACADEMY 16
(c) Ridges : This is a sequence of local maxima with some slope. The orientation of high
region compared to the set of available moves and directions in which they move,
makes it impossible to traverse a ridge by single moves.

State space diagram is a graphical representation of the set of states our search algorithm can
reach vs the value of our objective function(the function which we wish to maximize). X-
axis: denotes the state space ie states or configuration our algorithm may reach. Y-axis:
denotes the values of objective function corresponding to to a particular state. The best
solution will be that state space where objective function has maximum value(global maximum).

How to Over Come the above drawbacks ?

(a) To deal with local max, backtrack to some earlier node and select some different path.
(b) To deal with plateau make a BIG HOP in some direction to try to get a new section of
search space. If set of rules and operators describe a single small steps apply them several
times in same direction.
(c) To deal with ridges apply two or more rules before progress. This is same as moving in
several directions at once.

Hill climbing algorithm is not complete, Whether it will find the goal state or not depends upon
the quality of the heuristic function. If the function is good enough , the search will still proceed
towards the goal state.

Space complexity: This is strongest feature. It requires a constant space. This is because it keeps
only the copy of the current state. In addition it may require additional memory for storing the
previous state and each candidate successor.

UNIVERSITY ACADEMY 17
Time complexity of Hill Climbing will be proportional to the length of the steepest gradient
path. In finite domain, search will proceed to this path and will terminate.

Ques19. Explain Blocks World problem using heuristic function in Hill Climbing Search
strategy.

Ans : In this problem a set of initial arrangement of eight blocks is provided. We have to reach
the GOAL arrangement by moving blocks in a systematic order. States are to be evaluated using
heuristic , so that we can get next best node by applying Steepest Ascent Hill Climbing
technique.

Two Heuristics are considered : (i) LOCAL (ii) GLOBAL.

Both the function will try to maximize the score/cost of each state.

Local Heuristic: Add 1 point for each block that is resting on the correct block it is supposed to
be (as compared to the goal state). Subtract 1 point for each block at incorrect position. Global
Point for table is also considered.

Global Heuristic: For each block that has correct support structure, add 1 point for each block in
support structure. For each block having incorrect support, subtract 1 point for each block. In this
point for table is not considered.

As the value of any structure maximizes, we will be nearer to the goal state.

UNIVERSITY ACADEMY 18
Cost/score of goal state is 8(using local heuristic), because all the blocks are at its correct
position.

Now J is current new state with score 6 > cost of I (4). So , In step 2 three moves from best
state J is possible. All the neighbors of node J have lower have lower score than value of J i.e 4 ,
so J is a local maxima, and further no move is possible from states K, L and M. So search falls
in TRAP situation.To overcome the above problem of Local function, we can apply GLOBAL
heuristic.Now goal state will have score /cost of 28 and Initial state will have cost of -28.Again
the best node in next move will be that which has maximum score/cost.

UNIVERSITY ACADEMY 19
Further from state M we can have following moves :

(i) PUSH block G on block A


(ii) PUSH block G on block H
(iii) PUSH block H on block A
(iv) PUSH block H on block G
(v) PUSH block A on block H
(vi) PUSH block A on block G BACK.
(vii) PUSH block G on TABLE…and so on we select best node till we get structure with
score of + 28.

Ques 20. Explain Branch and bound search strategy in detail with an example.

Ans . Branch-and-Bound search is a way to combine the space saving of depth-first search with
heuristic information. It is particularly applicable when many paths to a goal exist and we want
an optimal path. As in A* search, we assume that h(n) is less than or equal to the cost of a lowest-
cost path from n to a goal node.

UNIVERSITY ACADEMY 20
The idea of a branch-and-bound search is to maintain the lowest-cost path to a goal found so
far, and its cost. Suppose this cost is bound. If the search encounters a path p such
that cost(p)+h(p) ≥ bound, path pcan be pruned. If a non-pruned path to a goal is found, it must
be better than the previous best path. This new solution is remembered and bound is set to the
cost of this new solution. It then keeps searching for a better solution.
Let us take the following example for implementing the Branch and Bound algorithm.

UNIVERSITY ACADEMY 21
Step 3:

UNIVERSITY ACADEMY 22
Hence the searching path will be A-B -D-F
Advantages: As it finds the minimum path instead of finding the minimum successor so there should
not be any repetition. The time complexity is less compared to other algorithms.
Disadvantages: (i) The load balancing aspects for Branch and Bound algorithm make it parallelization
difficult.
(ii) The Branch and Bound algorithm is limited to small size network. In the problem of large
networks, where the solution search space grows exponentially with the scale of the network, the
approach becomes relatively prohibitive.

Ques 21. What is Best first search algorithm, explain? Give an example also. Compare best
first search with Hill climbing approach.

Ans : BEST FIRST SEARCH : In BFS and DFS, when we are at a node, we can consider any
of the adjacent as next node. So both BFS and DFS blindly explore paths without considering
any cost function. The idea of Best First Search is to use an evaluation function to decide which
adjacent is most promising and then explore. Best First Search falls under the category of
Heuristic Search or Informed Search.

We use a priority queue to store costs of nodes. So the implementation is a variation of BFS, we
just need to change Queue to Priority Queue. An evaluation function is used to assign a score to

UNIVERSITY ACADEMY 23
each candidate node. The algorithm maintains two lists, one containing a list of candidates yet to
explore (OPEN), and one containing a list of visited nodes (CLOSED). Since all unvisited
successor nodes of every visited node are included in the OPEN list, the algorithm is not
restricted to only exploring successor nodes of the most recently visited node. In other words,
the algorithm always chooses the best of all unvisited nodes that have been graphed, rather
than being restricted to only a small subset, such as immediate neighbors. Other search
strategies, such as depth-first and breadth-first, have this restriction. The advantage of this
strategy is that if the algorithm reaches a dead-end node, it will continue to try other nodes .

Heuristic function used is f(n)= g(n) + h(n) , which estimates the cost of cheapest path from
node n to goal node i.e solution. f(n) gives true cost of a node n . g(n) is cost of getting from
initial state to current node. H(n) additional cost of getting from current node to the goal state.

If n is a goal node , then h(n) = 0. So, f(n) = g(n).

A directed graph(OR graph) is used in which each node is a point in problem space. An
alternative problem solving path exists from each branch. A parent link always point to the best
node from which it came and the list of the nodes that were generated from it. Parent link helps
to recover path to the goal once the goal is found.

The algorithm of BEST FIRST SEARCH is represented here in pseudo-code:


1. Define a list, OPEN, consisting solely of a single node, the start node, s.
2. IF the list is empty, return failure.
3. Remove from the list the node n with the best score (the node where f is the minimum),
and move it to a list, CLOSED.
4. Expand node n.
5. IF any successor to n is the goal node, return success and the solution (by tracing the
path from the goal node to s).
6. FOR each successor node:
a) Apply the evaluation function, f, to the node.
b) IF the node has not been in either list, add it to OPEN.
7. Looping structure by sending the algorithm back to the second step.

Example of Best First Search

UNIVERSITY ACADEMY 24
Step1. Initially start from source node "S" and search for goal "I" using given costs and Best
First search.
Step 2: Priority Queue OPEN, initially contains S. We remove S to CLOSED from OPEN and
process unvisited neighbors of S to priority queue OPEN = {A, B, C} and CLOSED = {S}.
Step 3: We remove A from OPEN and process unvisited neighbors of A to OPEN .
OPEN = {C, B, E, D} and CLOSED = { S,A}.

Step 4: We remove C from OPEN and process unvisited neighbors of C to OPEN.


So, OPEN = {B, H, E, D}, CLOSED = { S , A, C }

Step 5: We remove B from OPEN and process unvisited neighbors of B to OPEN.


So,OPEN = {H, E, D, F, G} and CLOSED = { S , A, C, B }.
Step 6. : We remove H from OPEN . Put H in CLOSED, so CLOSED = { S,A , C ,B} Since our
goal
"I" is a neighbor of H, we return.

Example 2:

UNIVERSITY ACADEMY 25
UNIVERSITY ACADEMY 26
Comparison of Best first search and Hill Climbing

S. No Hill Climbing Best first search


1. One move is selected and others are One move is selected and others are also
rejected. kept in memory for further consideration
2. Less Memory required More memory required
3. Terminates if there are no successor If successor has lower value than the node
nodes with better values than the current just explored , then also best available
state.
State is considered.
4. Priority Queues are not maintained Priority queues are maintained.
5. Heuristic f(n)= g(n) + h(n) not used. Heuristic f(n)= g(n) + h(n) is used.
6. Time complexity is proportional to the The worst case time complexity for Best
length of steepest ascent route from First Search is O(n * Log n) where n is
initial state. O (∞) number of nodes.
7. Space complexity is O(b) , b is branching Same as time complexity , i.e O (n * Log n)
factor.

Ques 22. Explain A* search algorithm. Discuss about the admissibility of A* algorithm.

Ans : A* algorithm is the extension of BEST FIRST SEARCH , proposed by Hart and Raphael
in 1980. It combines the features of branch and bound, Dijkstra’s algorithm and Best first search.

(i) A* Search Algorithm does is that at each step it picks the node according to a value-
‘f’ which is a parameter equal to the sum of two other parameters – ‘g’ and ‘h’. At
each step it picks the node/cell having the lowest ‘f’, and process that node/cell.
(ii) f(n) = g(n) + h (n) , where g(n) is cost of getting from initial state to current node.
h(n) = estimate of additional cost of getting from current node to goal state.
F(n) = this gives total cost of reaching to goal node.
(iii) A* maintains two priority queues OPEN and CLOSED as in best first search.
OPEN contains those nodes which are unexpanded and not evaluated. CLOSED
contains those nodes , whose successors are generated and node cost is evaluated
using heuristic function.

UNIVERSITY ACADEMY 27
A* Search Algorithm

1. Initialize the OPEN list with start node , Set f = 0 + h , g =0 initially and CLOSED = Φ
2. Repeat until a goal is reached
If OPEN = = { Φ}, then failure occurs.
Else select node in OPEN with min(f) value, set this node = BEST NODE for current
path.
Remove BEST NODE from OPEN and CLOSED = { BEST NODE}.
If ( BEST NODE = = Goal Node), search succeeds.
Else generate successors of BEST NODE , but don’t set BEST NODE to point to
them yet.
(a) for each successor
(i) if successor is the goal, stop search
Else set successor node to point to the BEST NODE to recover the current path.

(ii) Compute g (successor) = g (BEST NODE) + cost of getting to successor


from BEST NODE.

(iii) if a node with the same position as successor is in the OPEN list which has a
lower f than successor, skip this successor

(iv) If a node with the same position as successor is in the CLOSED list which has a lower
f than successor, skip this successor otherwise, add the node to the OPEN list.
end (for loop)

(b) Push current BEST NODE on the closed list

(c) If current path of SUCCESSOR is cheaper than the current best path to old parent , then
reset old’s parent Link to point to BEST NODE else do nothing. Record g (old)
and update f (old) node.)

(d) To propagate new cost downwards the graph , perform dfs. Starting from old and
changing each node’s g value. Terminating each branch when you reach either a node
with no successors or a node to which an equivalent or better path has already been
found.

(e) If successor (OPEN and CLOSED) both, add OPEN = { successor}, and add it to the
list of BEST NODE’S successor.

(f) Compute f (successor) = g (successor) + h (succcessor).


end (while loop)

UNIVERSITY ACADEMY 28
Admissibility of A*
An algorithm is said to be admissible if it is guaranteed to return an optimal solution , when
one exists.
A* is admissible if it satisfies following assumptions:
(a) The branching factor is finite. That is from every node only finite number of alternate paths
emerges.
Proof : In every cycle of the main loop in A* , algo picks one node from OPEN and places it
in CLOSED. Since there are only a finite number of nodes, algorithm will terminate in a
finite number of cycles, even if it never reaches the goal.
(b) The cost of each move is greater than some arbitrarily small positive value ε.
i.e for all m , n : k (m , n) > ε.

(c) The heuristic function underestimates the cost to the goal node. i.e for all n : h(n) ≤ h*(n).

Proof : A* is complete and optimal .Admissible heuristics are by nature optimistic because they
think the cost of solving the problem is less than it actually is. g(n) is the exact cost to reach n,
we have as immediate consequence that f(n) never overestimates the true cost of a solution
through n. Let cost of optimal solution be C*. G2 is sub optimal goal node, h (G2)=0, since it is a
goal node. Therefore, f(G2) = g (G2) + h (G2) = g (G2) > C*. If h (n) does not overestimates
the cost of completing solution path, then f (n) = g(n) + h (n) ≤ C*.

A heuristic is consistent if for each node n and ach successor n ‘ of n generated by an operator ,
estimated cost of reaching the goal from n is no greater than step cost of getting to n ‘ plus the
estimated cost of reaching the goal from n ‘.
h(n) ≤ C (n , a , n ‘) + h ( n’ ) , where a is an action/operator.
If h(n) is consistent , then the values of f(n) along any path are non decreasing. If n’ is a
successor of n , then g (n ‘)= g(n) + C (n , a , n ‘ ) and f (n’) = g(n’) + h (n’)
f (n’) ≥ g(n) + h (n) = f(n). So A* expands all the nodes with f(n) < C*.

UNIVERSITY ACADEMY 29
Ques 23. Mention some observations of g(n) and h(n) values in A* search. Discuss about
under estimation and overestimation of A* algorithm.

Ans : In A* algorithm , If heuristic function is f ‘ = g ‘ + h’ , then f’ , g ‘ and h ‘ tells the


estimated cost. If function is f = g + h , then f and h are actual cost.

Observations of g(n) value in A* search :

(i) Define g = 0 , if we want the node closest to a goal node.


(ii) For the path having fewest number of steps , set cost of going from a node to its
successor as a constant = 1.
(iii) If (g = 0), then random search.
(iv) If (g = 1), Breadth first search. All nodes on one level will have lower g-values and
lower f values
Than the nodes at next level. E.g At level 1 , g = 1, at level 2 , g = 2…so on.

Observations of g(n) value in A* search :

(v) If h ‘is always zero , search is controlled by g value.


(vi) h‘ tells how far we are from the goal state. If h ‘ is perfect estimator of h , A* will
converge immediately to the goal with no search . Better is the h ‘ , closer to the
direct approach.

Underestimation and Overestimation of h ‘ values

A* is optimal if h(n) is admissible heuristic , provided h (n) never overestimates the cost to reach
the goal. g ‘(n) and h ‘ (n) may not be known. What are known is estimated cost of both.

In general g’(n) will be lower than g(n) , because algorithm may not have found the optimal path
to n , i.e. g (n) ≥ g ‘ (n). The heuristic value h(n) is the distance to the goal from current state.

If algorithm guarantees an optimal solution, it is necessary that the function underestimates The
distance to the goal. That is h(n) ≤ h ‘ (n).If above condition is true then A* is admissible.

UNIVERSITY ACADEMY 30
UNIVERSITY ACADEMY 31
Ques 24. What is problem reduction technique? Using this explain AO* search with
an example.
Ans : When a problem can be divided into a set of sub problems, where each sub problem can be
solved separately and a combination of these will be a solution, AND-OR graphs or AND - OR
trees are used for representing the solution. The decomposition of the problem or problem
reduction generates AND arcs. One AND are may point to any number of successor nodes. All
these must be solved so that the arc will rise to many arcs, indicating several possible solutions..
Figure shows an AND - OR graph.

(i) In above figure the top node A has been expanded producing two area one leading to
B and leading to C-D .
(ii) The numbers at each node represent the value of f ' at that node (cost of getting to the
goal state from current state). For simplicity, it is assumed that every operation (i.e.
applying a rule) has unit cost, i.e., each are with single successor will have a cost of 1
and each of its components.

UNIVERSITY ACADEMY 32
(iii) With the available information till now, it appears that C is the most promising node
to expand since its f ' = 3 , the lowest but going through B would be better since to
use C we must also use D' and the cost would be 9 = (3+4+1+1). Through B it would
be 6 = (5+1).
(iv) Thus the choice of the next node to expand depends not only n a value but also on
whether that node is part of the current best path form the initial mode. Figure (b)
makes this clearer. In figure the node G appears to be the most promising node, with
the least f ' value. But G is not on the current beat path, since to use G we must use
GH with a cost of 9 and again this demands that arcs be used (with a cost of 27).

Observation: In AO* algorithm, we consider best node + best path (Global View), rather than
best node + best link (Local View).
AO* algorithm uses a single structure Graph instead of OPEN & CLOSED priority queues in
A*. Each node in Graph points down to its immediate successors and up to its immediate
predecessors, and also has with it the value of h' cost of a path from itself to a set of solution
nodes. The cost of getting from the start nodes to the current node "g" is not stored as in the
A* algorithm. This is because it is not possible to compute a single such value since there may
be many paths to the same state. In AO* algorithm serves as the estimate of goodness of a node.
Also a there should value called FUTILITY is used. The estimated cost of a solution is greater
than FUTILITY then the search is abandoned as too expansive to be practical.
For representing above graphs AO* algorithm is as follows :

AO* ALGORITHM

1. Let Graph consists only to the node representing the initial state call this node INTT.
Compute h' (INIT).
2. Until INIT is labeled SOLVED or h’(INIT) > FUTILITY, repeat the
following procedure :

(I) Trace the marked arcs from INIT and select an unbounded/unexpanded node NODE.
(II) Generate the successors of NODE. If there are no successors then assign
h' (NODE) = FUTILITY. This means that NODE is not solvable. If there are successors
then for each one called SUCCESSOR, that is not also an ancestor of NODE do the following :

UNIVERSITY ACADEMY 33
(a) add SUCCESSOR to graph G
(b) if successor is not a terminal node, mark it solved and set h’ (SUCC) = 0 .
(c) If successor is not a terminal node, compute it h' (SUCC) .

(III) Propagate the newly discovered information up the graph by doing the following. Let S be a
set of nodes that have been marked SOLVED. Initialize S to NODE. Until S is empty repeat
the following procedure:
(a) Select a node from S call if CURRENT and remove it from S.
(b) Compute h' of each of the arcs emerging from CURRENT. Assign minimum h' to
CURRENT. Cost (ARC) = Σ ( h’ value of each nodes) + cost of ARC itself.
(c) Mark the minimum cost path as the best out of CURRENT.

(d) Mark CURRENT SOLVED if all of the nodes connected to it through the new marked
are have been labeled SOLVED.

(d) If CURRENT has been marked SOLVED or its h ' has just changed, its new status
must be propagate backwards up the graph .Hence all the ancestors of CURRENT
areadded to S.
Note: AO* will always find minimum cost solution. AO * is both admissible and
complete.

Ques 25. (i) Compare A* and AO * algorithm with each other.


(ii) Why some times unnecessary backward propagation occurs in AND OR graph.

Ans: Comparison between A* search and AO* search algorithm

A* algorithm AO* algorithm


1. Propagation of revised cost estimates back the Backward propagation of
tree is not required in A* revised cost is needed.
2. Individual paths and their cost can be Not possible in AO* in AND
considered independently arc.
3. A* has two lists OPEN and CLOSED as a data Only single graph is used.
structures
4. G values are stored explicitly G values are implicit.
5. Desired path from one node to another is Not always true in AO *
always with lowest cost. search.

UNIVERSITY ACADEMY 34
Unnecessary Backward Cost propagation in AO* search algorithm

Path A to C is better than A to B. So expansion of B is


wasted here. But if cost of node E is revised and change
is not propagated through B and C, B may appear
better.

If updated cost of E = 10, then C’s updated cost =11.


Then path from A to C has cost = 12 as compared as
that of A to B =11, and it will be erroneously promising
path. If D is expanded next, its cost is propagated to B .

B’s cost is recomputed and E’s new value is used. New


cost of B is propagated back to A. At that point path
through C is again better. So some time has wasted in
the expansion of D unnecessarily.

Ques 26. In the graph given below explain, when would the search will terminate with
A if node F is expanded next and its successor is node A. Give the steps
of searching also.

UNIVERSITY ACADEMY 35
Ans: If node F is expanded next with child node A, then the cost of A will be upward
propagated in this graph with cycle. Initially cost of A with AND arc is :
h(A) = h(C) + h(D) + cost of arc(A-C) + cost of arc (A- D).
h(A) = 11 + 15 + 1 + 1 = 28. Now 28 will be used to evaluate the revised cost of node F as
following:
h (F) = h(A) + 1 (because of OR path between Node F and A ).. So h(F) = 28 + 1= 29.

Now between h(E) and h(F) , h(E) = 30 > h(F) = 29 , so OR path C-F is better than path C-E.
So , Revised cost of C = h_new(F) + 1= 29 + 1=30.
Now h_new(A) = h(C) + h(D) + 2 = 30 + 15 + 2 = 47.
So , h_new (F) = 47 + 1= 48.

UNIVERSITY ACADEMY 36
Now h(E)=30 < h_new(F) =48. So node E is better node, so ARC C-E is better now.
So h_new(C) = h(E) + 1 = 30 + 1 = 31. Again revise the cost of node A through upward cost
propagation.
So h_new (A) = h_new(C) + h(D) = 31 + 15 + 2 = 48.
So , h_new (F) = h_new(A) + 1 = 48 +1 = 49.
Still node E is better than F , so h(C) = 31….. This search will continue and no change in path
exist. So cycle will repeat till cost of search does not exceeds. FUTILITY. In that case search
will be terminated.

Ques 27. How is AI useful in game playing techniques. Describe what is adversarial search?

Ans . Game playing is very important and emerging field of AI, which makes machines to play
several games that a people can play. Machine may play with another machine or human robot. It
may as well play with another person also. This requires lot of searching and knowledge.
Charles Babbage, 19th century computer architect programmed an analytical engine for CHESS.

In 1960’s Arthur Samuel developed first operational game playing program. Mathematical game
theory a branch of economics views any multi agent environment as a game provided that the
impact of each agent on others is significant, regardless of whether an agent is cooperative or
competitive.
In AI games are deterministic, zero sum games. In this two agents whose actions alternte and in
which UTILITY values at the end of game are always equal and opposite . E.g: If one player
wins a game of chess (+1), other loses (-1).A utility function gives numeric value to terminal
states .
Adversarial Search: In this two opponents (agents) , playing a game with each other compete in
an environment so as one move of agent A opposes agent B and he tries to take move advantage
over other by maximizing his UTILITY and minimizing opponents UTILIT.

E.g : Chess game , Bridge game of playing cards, Tic Tac Toe , etc.

Some games have agents to be restricted to a small number of actions whose outcomes are define
by precise rules. Physical games like Ice Hockey requires more complex rule set to be defined.
Larger range of operators /actions is needed for better efficiency and result. Only Robot Soccer
Player game is much attractive among game players in AI.

UNIVERSITY ACADEMY 37
UTILITY FUNCTION: This gives numeric value to terminal states. In chess game outcome can
be win , loss or draw with values +1 , -1 and 0 respectively.

A game tree is generated showing different states and cost of each state. Searching for goal is
performed in this tree. In a game tree two half moves, each called PLY exists. For Maximizing
player we have Max ply and for Minimizing player we have Min ply.

In game playing move generation and terminal test must be good enough for fast searching.
We can use a Plausible Move Generator function, in which only small number of promising
moves are generated. Alan Turing gave a utility function based on material advantage of pieces
in chess as following:
Add the number of black pieces(B) values , Value of white(W), and compute (W/B).
Factors considered for moving criteria by an agent can be :
(i). Piece advantage. (ii) Capbility of progress (iii) Control of center (iv) Mobility
(v) Threat of fork.

TERMINAL TEST : This test determines when the game is over. States where game is over
are called terminal states.

Ques 24. Explain MINIMAX search technique/algorithm with an example.

Ans : In MINIMAX searching the score is comparison of what is good for max player minus
what is good for min player.

(i) Nodes for Max players termed as max nodes that take on value of their highest scoring sub
nodes/successors.

(ii) Nodes for MIN players take on the value of their lowest scoring successors.

(iii) Assumption is that both the MAX and MIN players play optimally at their end to win the
game.

(iv) MINIMAX search uses simple recursive computation of minimax vlues of each successor
node in the
game tree in dfs order. Minimax values are backed up through the game tree.

UNIVERSITY ACADEMY 38
In above game tree, firstly in DFS order branches of B are explored. Since B is at MIN ply w.r.t
to MIN player So find min(E, F , G ) and value is backed up to node B.
[ min ( E, F , G)= min(3 ,12,8) = 3].
So 3 is backed up to node B as its score. Similarly Score (node C) = min (2 , 4, 6) = 2.
Score (D) = min (14, 5, 2) = 2.
Now score of MAX player A = Max ( 3 , 2 , 2) = 3. So the winning path is A – B – E.

Complete? – Yes (if tree is finite)

Optimal ? – Yes (against an optimal opponent) –

No (does not exploit opponent weakness against suboptimal opponent)

Time complexity of MINIMAX search = O (bm)

Space complexity is also = O (bm)

Where b : legal moves at each point.

M : maximum depth of game tree.

UNIVERSITY ACADEMY 39
Ques 28. (i) What is alpha beta pruning / search
(ii) Evaluate the winning cost of MAX player in following game tree using
alpha beta pruning.

UNIVERSITY ACADEMY 40
Ans : (i) Alpha-Beta pruning is not actually a an optimization technique for MINIMAX
algorithm. It reduces the computation time by a huge factor. Because number of game states that
MINIMAX has to examine is exponential in number of moves. So we can cut the exponent to
half. This allows us to search much faster and even go into deeper levels in the game tree. It cuts
off branches in the game tree which need not be searched because there already exists a better
move available. It is called Alpha-Beta pruning because it passes 2 extra parameters in the
minimax function, namely alpha and beta. This is also called Alpha Beta cut –off .
This technique when applied , returns the same moves as compared to MINIMAX, but
prunes branches that can not put impact on the final decision. It can be applied to the tree of
any depth in depth first search order.

Alpha is the best value(highest) that the maximizer currently can guarantee at that level (along
max path) or above. This is lower bound on MAX nodes.

Beta is the best value (lowest) that the minimizer currently can guarantee at that level(along min
path.) or above. This is an upper bound on MIN nodes.

Note : Search below a MIN node can be terminated if beta value of MIN node is less than any
of alpha values bound to its ancestor MAX nodes.

Search below a MAX node can be terminated if alpha value of Max node is greater than any
of beta bound to its ancestor MIN nodes.

Pseudo Code of Alpha Beta Pruning


function minimax(node, depth, isMaximizingPlayer, alpha, beta):

if node is a leaf node :


return value of the node

if isMaximizingPlayer :
bestVal = - INFINITY
for each child node :
value = minimax(node, depth+1, false, alpha, beta)
bestVal = max( bestVal, value)
alpha = max( alpha, bestVal)
if beta <= alpha:
break

UNIVERSITY ACADEMY 41
return bestVal

else :
bestVal = +INFINITY
for each child node :
value = minimax(node, depth+1, true, alpha, beta)
bestVal = min( bestVal, value)
beta = min( beta, bestVal)
if beta <= alpha:
break
return bestVal

// Calling the function for the first time.


minimax(0, 0, true, -INFINITY, +INFINITY)

Solution (ii) :

Step 1. Initially apply DFS on path A-B-D-I. Value of I is backed up at D as alpha =3.
Now between node I and J , beta =5 is of node J. So D can have backed up value >= 3 ,
because D is at Max Ply. Therefore value of Node J is changed to alpha =5 .

Step 2. Now node e is to be generated. Alpha = 5 of D is backed up as beta = 5 at B. Value of E


is less than beta = 5 i.e. B can have value less than 5. So beta = 5 is changed to beta=2 at
node B.

Step 3. Now beta=2 is backed up as alpha =2 at node A.

Step 4. Now consider another branch of node A, via path A-C-F-K-P (DFS order). Node K is
MIN node , so Min(P, Q) = min (0 , 7) = 0. Therefore at P, alpha = 0 , backed up as
beta =0 to node K. Node F is Max Node , so max (K , L) = mx (0 , 5) = 5.
Hence alpha =0 at Node F is changed to alpha = 5. So search below Q is pruned. Hence
node Q is not generated. At node K, beta =0 and ancestor of K i.e Node f has alpha =5,
so it will be pruned.

Step 5. Now node L is generated. At C ancestor node A has alpha = 2. At C , beta =5


(So alpha=2< beta=5). At last C is changed to beta = 3 . This value is backed up as
alpha =3 (Since , node A with alpha =2 may have value >=2). Hence new revised score of
Max node A is 3.

UNIVERSITY ACADEMY 42
Ques 29. What is Constraint satisfaction problem?

Ans : A constraint satisfaction problem (or CSP) is defined by a set of variables, X1, X2, . . . ,
Xn, and a set of constraints, C1, C2, C3…, ,Cm. Each variable Xi has a
nonempty domain Di of possible values. Each constraint Ci involves some subset of the
variables and specifies the allowable combinations of values for that subset. A state of the
problem is defined by an assignment of values to some or all of the variables, {Xi = vi,

Xj = vj, . . .}. An assignment that does not violate any constraints is called a consistent or legal
assignment. A complete assignment is one in which every variable is mentioned, and a solution
to a CSP is a complete assignment that satisfies all the constraints. Some CSPs also
require a solution that maximizes an objective function.

Varieties of CSPs
(A) Discrete variables – finite domains:
• n variables, domain size d , O(dn) complete assignments
– infinite domains:
• integers, strings, etc.
• e.g., job scheduling, variables are start/end days for each job
• need a constraint language, e.g., StartJob1 + 5 ≤ StartJob3

Continuous variables
– e.g., start/end times for Hubble Space Telescope observations.
– linear constraints solvable in polynomial time by linear programming.

Unary constraints involve a single variable, e.g., SA ≠ green


Binary constraints involve pairs of variables, e.g., SA ≠ WA
Higher-order constraints involve 3 or more variables, – e.g., cryptarithmetic column
constraints

UNIVERSITY ACADEMY 43
CSP is a two step process :

Step 1: Constraints are discovered and


propagated as far as possible in the system. If
there is still not a solution, search starts. A guess
is made for certain parameters and added as a
new constraint…so on so forth. CP can
terminate for following two reasons :

1. Contradiction arises in given condition .


2. No more guesses can be made.

So if above is true , then to find solution search


further proceeds.

Step 2 : Some hypothesis is assumed to


make constraint more useful. Now CP
begins again for new state.

If solution is found, it can be reported.

If still more guesses are required they


can be made.

If contradiction arises , then backtrack to


previous correct state and proceed for new
guess. Constraints can be used to :

(i) Check the correctness of a


partial solution and hence cut
off the unwanted branches of
search tree.
(ii) Calculate some parameters
from other examples.
(iii) Choose which parameters to
fix next.

UNIVERSITY ACADEMY 44
Algorithm for CSP :

Step 1 : Propagate available constraints. Initially set OPEN = { set of objects which have values
assigned in a complete solution.}.Then
Do until {inconsistency is detected or OPEN = NULL}
1.1 Select an object OB from OPEN. Strengthen as much as possible the set of constraints
which apply with OB.
1.2 If this set is different from the set which was assigned the last time OB was examined or if
it is first time OB has been examined, then add to OPEN all objects which share any
constraints with OB.
1.3 Remove OB from OPEN.

Step 2. If union of constraints discovered above gives a solutio , then return.


Step 3. If union of constraints discovered above gives a contradiction, then return failure.
Step 4. If neither of above steps 2 and step 3 occurs, then make a guess at something in order to
proceed
. Loop
Until a solution is found or all possible solutions are eliminated :
4.1 Select an object whose value is not yet determined and select a way to make the
constraints more strong.
4.2 Recursively invoke constraint satisfaction with the current set of constraints
augmented by above selected constraint with better strength.

Ques 30. Solve the following Crypt Arithmetic Problem: S E N D + M O R E = MONEY


Ans : Solution for the above problem is as follows:

1. From Column 5, M=1, since it is only carry-over possible from sum of 2 single digit number
in column 4.
2. To produce a carry from column 4 to column 5 “ S + M”is at least 9 ,So ,
'S=8 or 9' so 'S+M= 9 or 10' & so 'O = 0 or 1'. But 'M=1', so 'O = 0'.
3.If there is carry from Column 3 to 4 then 'E=9' & so 'N=0'. But 'O = 0' so there is no carry
& 'S=9' & 'c3=0'.

UNIVERSITY ACADEMY 45
3. If there is no carry from column 2 to 3 then 'E=N' which is impossible, therefore there is
carry & 'N=E+1' & 'c2=1'.
5. If there is carry from column 1 to 2 then 'N+R=E mod 10' & 'N=E+1' so 'E+1+R=E
mod 10', So 'R=9' but 'S=9', so there must be carry from column 1 to 2.
Therefore 'c1=1' & 'R=8'.
6. To produce carry 'c1=1' from column 1 to 2, we must have 'D+E=10+Y' as Y cannot be 0/1
so D+E is at least 12. As D is at most 7 & E is at least 5 (D cannot be 8 or 9 as it is
already assigned). N is at most 7 & 'N=E+1' so 'E=5 or 6'.
7. If E were 6 & D+E atleast 12 then D would be 7, but 'N=E+1' & N would also be 7 which
is impossible. Therefore 'E=5' & 'N=6'.
8. D+E is at least 12 for that we get 'D=7' & 'Y=2'.

SOLUTION : Final values of each letter after


executing algorithm is :

9567
+108 5

10652

S=9 , E=5 , N=6 , D=7


M=1 ,O = 0 , R=8 , Y=2

UNIVERSITY ACADEMY 46
UNIT-3
Knowledge Representation & Reasoning:

• Propositional logic
• Theory of first order logic
• Inference in First order logic
• Forward & Backward chaining,
• Resolution.

Probabilistic reasoning

• Utility theory
• Hidden Markov Models (HMM)
• Bayesian Networks

UNIVERSITY ACADEMY 1
Short Question & Answers
Ques 1. Differentiate between declarative knowledge and procedural knowledge.

Ans : Declarative knowledge means representation of facts or assertions. A declarative representation


declares ever piece of knowledge and permits the reasoning system to use the rules of inferences and derive
some new facts and conclusions. A declarative knowledge consists of a database containing relevant
information of some objects. E.g : Relational database of Company employees , Students record in a
particular class.
Procedural Knowledge represents actions or consequences and tells HOW of a situation. This knowledge
uses inference rules to manipulate these procedures to arrive at the result. Example Algorithm to solve
Travelling salesman problem sequentially in a systematic order.

Ques 2. Define the terms Bilief, and hypothesis , Knowledge, Epistemology.

• Belief : This is any meaningful and coherent expression that can be manipulated .
• Hypothesis: This is a justified belief that is not known to be true. Thus hypothesis is a belief
which is backed up with some supporting evidence.
• Knowledge: True justified belief is called knowledge.
• Epistemology: Study of the nature of knowledge.
Ques 3. What is formal logic? Give an example.

Ans : This is a technique for interpreting some sort of reasoning process. It is a symbolic manipulation
mechanism. Given a set of sentences taken to be true , the technique determines what other sentences can be
arranged to be true. The logical nature or validity of argument depends on the form of argument.
Example: Consider following two sentences: All men are mortal 2. Socrates is a man , So we can infer
that Socrates is mortal.

Ques 4. What is CNF and DNF ?

Ans : CNF( Conjunctive Normal Form) : A formula P is said to be in CNF , if it is of the form
P = P1 ˄ P2 ˄P3 , ….,Pn-1 , Pn. ; n ≥1, where each Pi from i = 1 to n is a disjunction of an atom
Example: (Q  P) ˄ (T  ~ Q) ˄ ( P  ~T).
DNF(Disjunctive Normal form) A formula P is said to be in DNF if it has the forma
P = P1  P2  P3, ….  Pn-1  Pn.; n ≥1, where each Pi from i = 1 to n is a conjunction of an

UNIVERSITY ACADEMY 3
atom . Example: (Q ˄ P)  (T ˄~ Q)  ( P ˄ ~T)

UNIVERSITY ACADEMY 4
Ques 5.What are Horn Clauses ? What is its usefulness in logic programming?

Ans : A horn clause is a clause(disjunction of literals) with at most one positive literal. A horn clause with
exactly one positive literal is called definite clause. A horn clause with no positive literals is sometimes
called a goal clause. A dual horn clause is a clause with at most one negative literal.

Example : ~ P  Q  ….  ~ T  U.is a definite horn clause. Relevance of horn clause to theorem


proving by predicate logic resolution is that the resolution of two horn clauses is a horn clause. Resolution
of a goal clause and a definite clause is again a goal clause. In automated reasoning it improves the
efficiency of algorithms. Prolog is based on Horn clauses.

Ques 6. Determine whether the following PL formula is (a) Satisfiable (b) contradictory
(c) Valid : (p q)→r q

Ans : Truth table for above problem :

p Q p˄q r q ~q r  ~q (p q)→rq

T T T T T F T T
T F F T F T T T
F T F F T F F T
F F F F F T T T

Therefore, the given formula is a Tautology.

Ques 7. Convert the following sentences into wff of Predicate Logic ( First order logic).
(i) Ruma dislikes children who drink tea.
(ii) Any person who is respected by every person is a king.

Ans : (i)  x child(x) ˄ DrinkTea (x) →Dislikes ( Ruma, x)


(ii)  x  y : Person (y) ˄ Respects( y , x) → King (x)

UNIVERSITY ACADEMY 5
Long Question & Answers
Ques 8 : Define the term knowledge. What is the role of knowledge in Artificial Intelligence?
Explain various techniques of knowledge representation.

Ans : Knowledge: Knowledge is just another form of data. Data is a raw facts. When these raw facts are
organized systematically and are ready to be processed in human brain or some machine , then it becomes
the knowledge. From this knowledge we can easily draw desired conclusions which can be used to solve
real world complex and simple problems.
Example : A doctor treating a patient requires both the knowledge as well as data. The data is patient’s
record (i.e. patient’s history, measurements of vital signs , diagnosticreports, response of medicines etc…).
Knowledge is that information which the doctor has gained in medical college during his studies.
Cycle of knowledge from data is as follows :
(a) Raw data when refined , processed or analyzed yields information which becomes useful in
answering users queries.
(b) Further refinement , analysis and the adition of heuristics, information may be converted into
knowledge, which is useful in problem solving and from which additional knowledge may be
inferred.
Role of Knowledge in AI : Knowledge is central to AI. More is the knowledge then better are the chances
of a person to be more intelligent as compared from others. Knowledge also improves search efficiency of
human brain. Knowledge to support Intelligence is needed because :
(a) We can understand natural language with the help of it and use it when required.
(b) We can make decisions if we possess sufficient knowledge about the certain domain.
(c) We can recognize different objects with varying features quite easily.
(d) We can interpret various changing situations very easily and logically.
(e) We can plan strategies to solve difficult problems altogether.
(f) Knowledge is dynamic and Data is static.
An AI system must be capable of doing following three things :
(a) Store the knowledge in knowledge base(Both static and dynamic KB)
(b) Apply the knowledge stored to solve problems.
(c) Acquire new knowledge through the experience.
Three key components of an AI system.

UNIVERSITY ACADEMY 7
1. Representation 2. Learning 3. Reasoning.
Various techniques of knowledge representation

Simple
Inheritable
relational
knowledge
knowledge

Inferential
Procedural
knowledge knowledge.

(A ) Relation Knowledge : This is the simplest way to represent knowledge in static form , which is stroed
in a database as a set of records.Facts about the set of objects and relationship between objects are set out
systematically in columns. This technique has very little opportunity for inference. But it provides
knowledge base for other powerful inference mechanisms.Example: Set of records of Employees in an
organization Set of records and related information of voters for elections.
(B) Inheritable Knowledge :One of the most useful form of inference is property inheritance. In this
method Elements of certain classes inherit attributes and values from more general classes in which they are
needed. Features of inheritable knowledge are :
➢ Property inheritance (Objects inherit values from being members of a class, data must be organized
into a hierarchy of classes.)
➢ Boxed nodes (contains objects and values of attributes of objects).
➢ Values can be objects with attributes and so on…
➢ Arrows ( point from object to its value).
➢ This structure is known as Slot and Filler Architecture, Semantic network or collection of
frames.
➢ In semantic networks nodes of classes or objects with some inherent meaning are connected in a

UNIVERSITY ACADEMY 9
network structure.

UNIVERSITY ACADEMY 10
(C) Inferential Knowledge : Knowledge is useless unless there is some inference process that can exploit
it.The required inference process implements the standard logical rules of inference. It represents knowledge
as a form of formal logic . Example : All dogs have tails , x : dog(x) →hastail(x)
This knowledge supports automated reasoning. Advantages of this approach is:
➢ It has set of strict rules.
➢ Can be used to derive more facts.
➢ Truth of new statements can be verified.
➢ Guaranteed correctness.
(D) Procedural Knowledge: This is encoded form of some procedures. Example: Small programs that
know how to do specific things , how to proceed e.g. a parser in a natural language system has the
knowledge that a noun phrase may contain articles, adjectives and nouns. It is represented by calls to
routines that know how to process articles , adjectives and nouns.
Advantages :
➢ Heuristic or domain specific knowledge can be represented
➢ Extended logical inferences, like default reasoning is incorporated.

UNIVERSITY ACADEMY 11
➢ Side effects of actions may be modeled.

UNIVERSITY ACADEMY 12
Disadvantages :
➢ Not all the cases may be represented.
➢ Not all the deductions may be correct
➢ Modularity is not necessary, control information is tedious.

Ques 9 : Define the term logic. What is the role of logic in Artificial Intelligence? Compare
Propositional logic with First order logic (Predicate Calculus).

Ans : Logic is defined as a scientific study of the process of reasoning and the system of rules and
procedures that help in reasoning process. Logic is the process of reasoning representations using
expressions in formal logic to represent the knowledge required. Inference rules and proof procedures can
apply this knowledge to solve specific problems.

We can derive new piece of knowledge by proving that it is a consequence of knowledge that is already
known. We generate logical statements to prove the certain assertions.

Algorithm = logic + control

Role of Logic in AI

➢ Computer scientists are familiar with the idea that logic provides techniques for analyzing the
inferential properties of languages. Logic can provide specification for a programming language by
characterizing a mapping from programs to the computations that they implement.
➢ A compiler that implements the language can be incomplete as long as it approximates the logical
requirements of given problem. This makes it possible to involve logic in AI applications to vary
from relatively weak uses in which logic informs the implementation process with analysis in depth .
➢ Logical theories in AI are independent from implementations. They provide insights into the
reasoning
problem without directly informing the implementation.
➢ Ideas from logic theorem proving and model construction techniques are used in AI.
➢ Logic works as a analysis tool , knowledge representation technique for automated reasoning and
developing Expert Systems. Also it gives the base to programming language like Prolog to develop
AI softwares.

UNIVERSITY ACADEMY 13
George Boole (1815-1864) wrote a book in , named as “ Investigation of Laws of Thoughts”
To investigate the fundamental laws of those operations of the mind by which reasoning is
performed ; to give expression to them in the symbolical language of a Calculus and upon this
foundation to establish the science of Logic and construct its method. To make this method
Itself the basis of a general method from the various elements of truth brought to view in the
course of these inquiries some probable intimations concerning the nature and constitution of
human mind.
Comparison b/w Propositional Logic & First Order Predicate Logic
S.NO PL FOPL
1. Less Declarative More Declarative
2. Contexts dependent semantics Context independent semantics
3. Ambiguous and less expressive Unambiguous and more expressive.
4. Propositions are used as components Use of predicates/relations between
with logical connectives. objects, functions , variables , logical
connectives and quantifiers( Existential
and Universal)
5. Rules of inferences are used for Rules of inferences are used along with
deduction like Modus Ponen, Modus the rules of Quantifiers .
Tollens,disjunctive syllogism etc.
6. Inference algorithms like inference Inference algorithms like Unification ,
rules , DPLL, GSAT are used. Resolution , backward and forward
chaining are used.
7. NP complete Semi-decidable

Ques 10 (A) Convert the following sentences to wff in first order predicate logic.
(i) No coat is water proof unless it has been specially treated.
(ii) A drunker is enemy of himself.
(iii) Any teacher is better than a lawyer.
(iv) If x and y are both greater than zero, so is the product of x and y.
(v)Every one in the purchasing department over 30 years is married.
(B) Determine whether each of the following sentence is satisfiable, contradictory or valid
S1 : (p  q)  (p   q)  p S2 : p → q →  p
UNIVERSITY ACADEMY 15
Ans : (A) (i) No coat is water proof unless it has been specially treated.

x :[ C(x) → ( ~W(x)  S(x) ] , where :


C(x) : x is a coat , ~ W(x) : x is not water proof , S(x) : x is specially treated.
(ii)A drunker is enemy of himself
x : [ D(x) → E(x,x)] , where : D(x) : x is a drunker , E(x,x) : x is enemy of x.

(iii) Any teacher is better than a lawyer.


x : [ T(x) →  y : ( L(y) ˄ B(x , y)] , where :
T(x) : x is a teacher , L(y) : y is lawyer , B(x , y) : x is better than y.

(iv) If x and y are both greater than zero, so is the product of x and y.
x y [ GT (x , 0 ) ˄ GT (y , 0) → GT ( times (x , y) , 0 ) ].
Where : GT : greater than , times(x ,y) : x times y (times is predicate), or we can use
product_of (x , y) , product_of is a function.

(v) Every one in the purchasing department over 30 years is married.


x y [ works_in (x , purch_deptt ) ˄ has_age (x , y ) ˄ GT(y,30 ) → Married(x) ]

(B) (i) Truth table for : (p  q)  (p   q)  p

P Q pq ~q p  ~q (p  q)  (p   q)  p

T T T F T T
T F T T T T
F T T F F F
F F F T T T
Hence by last column of truth table, the above statement is satisfiable.

(ii) Truth table for : p → q →  p

Hence by last column of


P q p→q ~p p→q→p
truth table, the above
statement
T T T F F
is satisfiable.
T F F F T
F T T T T
F F T T T

UNIVERSITY ACADEMY 17
Ques 11: Using the inference rules of Propositional logic , Prove the validity of following
axioms:
(i) If either algebra is required or geometry is required then all students will study
mathematics.
(ii) Algebra is required and trignometry is required therefore all students will study
mathematics.

Ans : Converting above sentences to propositional logic and applying inference rules :

(i) (A  G → S)
(ii) (A ˄ T) To prove that : S is true

Where A : algebra is required , G : geometry is required , T : trigonometry is required.

(iii) (A ˄ T) is true


By simplification A is true (applying simplification in formula (ii))
(iv) Now (A  G ) is true. (applying addition in (iii))
(v) Therefore , S is true ( applying Modus Ponen b/w (i) & (iv))

Hence above axioms are valid, because all are proved to be true.

Ques 12 : Determine whether the following argument is valid or not. “ If I work whole night on this
problem, then I can solve it . If I solve the problem , then I will understand the topic.
Therefore , I will work whole night on this problem, then I will understand the topic.”

Ans : Converting above sentences to propositional logic and applying inference rules :

(i) WN → S , where WN : If I work whole night, S : I can solve it

(ii) S → U , where U : I will understand the topic,

To prove the validity of : WN → U.


(iii) Between the axioms (i) & (ii) apply Hypothetical syllogism/chain rule of inference.
So we get : WN →U, Hence the validity of axioms is proved.

Ques 13. Given the following sentences, Prove their validity :


(i) Either Smith attended the meeting or Smith was not invited in the meeting.
(ii) If directors wanted Smith in meeting then Smith was invited in me.
(iii) Smith didn’t attended the meeting.
(iv) If director’s didn’t want Smith in meeting and Smith was not invited to meeting, then
Smith is on his way out of the company.
UNIVERSITY ACADEMY 19
Ans : Converting above sentences to propositional logic and applying inference rules :
(i) ( A  ~ I ) , where A : Smith attended the meeting , ~I :smith was not invited in meeting.
(ii) ( D → I ) , where D : Directors wanted Smith in meeting , I : Smith was invited in meeting.
(iii) ~ A , Smith did not attend the meeting.
(iv) ( ~ D ˄ ~ I ) → W , To prove that W is true , Smith is on his way out of the company.
(v) ~I (By applying Disjunctive Syllogism b/w axiom (i) & (iii).
(vi) ~D ( By applying Modus Tollens b/w axiom (ii) & (v)).
(vii) (~ D ˄ ~ I) ( By applying Conjunction b/w axiom (v) & (vi)).
(viii) W ( By applying Modus Ponen b/w axiom (iv) & (vii)). (Hence Proved.)

Ques 14 : What is clause form of a wff (well-formed formula)? Convert the following formula into
clause form :  x  y [  z P( f(x), y, z) → {  u Q( x , u) ˄  v R( y, v) } ].

Ans : Clause Form : In Theory of logic either it is propositional logic or predicate logic , while proving the
validity of statements using resolution principle it is required to convert well-formed formula into the
clause form. Clause form is the set of axioms in which propositions or formula are connected only through
OR (˅) connector.
Step 1: Elimination of Implication: Applying P → Q  ~ P ˅ Q
xy (~ z P ( f(x), y, z ) ˅ (  u Q ( x , u) ˄  v R( y ,v) )

Step 2 : Resolving the scope of Negation: Applying ~  (x) F(x)   x ~ F(x).

xy ( z ~ P ( f(x), y, z ) ˅ (  u Q ( x , u) ˄  v R( y ,v) )

Step 3. Applying Qx F(x) ˅ G  Qx [ F(x) ˅ G ]

xy z (~ P ( f(x), y, z ) ˅ (  u Q ( x , u) ˄  v R( y ,v) )

Step 4. Conversion to Prenex Normal Form

xy z  u  v (~ P ( f(x), y, z ) ˅ ( Q ( x , u) ˄ R( y ,v) )

Step5. Skolemization : Conversion to Skolem standard form

y (~ P ( f(a), y, g(y) ) ˅ ( Q ( a , h(y) ) ˄ R( y , I(v) )

Step 6. Removal of Universal Quantifiers

(~ P ( f(a), y, g(y) ) ˅ ( Q ( a , h(y) ) ˄ (R( y , I(v) ) )

UNIVERSITY ACADEMY 21
Step 7. Apply Distributive Law for CNF: P ˅ ( Q ˄ R )  ( P ˅ Q ) ˄ ( P ˅ R )

( ~ P ( f (a) , y , g(y) ) ˅ Q ( a , h (y) ) ˄ ( ~ P ( f (a) , y , g(y) ) ˅ R ( y , I (y) )

Step 8. On removing ˄ we get two clauses:

Clause 1: ( ~ P ( f (a) , y , g(y) ) ˅ Q ( a , h (y) )

Clause 2 : ( ~ P ( f (a) , y , g(y) ) ˅ R ( y , I (y) )

Ques 15 : (A) What is resolution Principle in propositional logic, explain?


(B) Let the following set of axioms is given to be true: P , (P ˄ Q ) → R ,
( S ˅ T ) → Q , T . Assumption is that all are true. To Prove that R is true.

Ans : Resolution Principle : This is also called proof by refutation. To prove a statement is valid ,
resolution attempts to show that the negation of statement produces a contradiction with known
statements . At each step two clauses, called PARENT CLAUSES are compared / resolved,
yielding a new clause that has been inferred from them.
Example : Let two clauses in PL C1 and C2 are given as :
C1 : winter ˅ summer , C2: ~ winter ˅ cold . Assumption is that both C1 and C2
Are true. From C1 and C2 we can infer/deduce summer ˅ cold. This is RESOLVENT CLAUSE

Resolvent Clause is obtained by combining all of the literals of the two parent clauses except the
ones that cancel. If the clause that is produced is empty clause, then a contradiction has been found.
E.g : winter and ~ winter will produce an empty clause.
Algorithm of resolution in propositional logic:

Step 1: Convert all the propositions of F to clause form, where F is set of axioms.

Step 2: Negate proposition P and convert the result to clause form. Add it to the set of clauses
obtained in step 1.

Step 3. Repeat until either a contradiction is found or no progress can be made:

(a) Select two clauses as a parent clause.


(b) Resolve them together. Resolvent clause will be the disjunction of all literals of both the
parent clause with following conditions :
UNIVERSITY ACADEMY 23
(i) If there are any pairs of literals L and ~L such that one of the parent clauses
contains L and other contains ~L , then select one such pair and eliminate both L
and ~L from resolvent clause.
(c) If resolvent is empty clause , then a contradiction has been found. If it is not , then ad it
to the set of clauses available to the procedure.

Ans : (B) Let ~R is true, ad it to the set of clauses formed from given axioms(as a set of support).
C1 : P is true , C2 : ~P V ~Q V R ( By eliminating implication in
(P ˄ Q) → R
C3 : ~ S ˅ Q , C4 : ~ T ˅ Q , C5 : T , C6 : ~R.
( Eliminating implication from ( S ˅ T ) → Q
~ ( S ˅ T ) ˅ Q ≡ (~ S ˄ ~ T ) V Q ( By demorgan’s law), Now apply distributive law
We obtain : (~ S ˅ Q ) ˄ (~ T ˅ Q ) , convert it into two clauses C3 and C4 after
removing AND connector.

Clauses C1 to C5 are base set and C6 is set of support.

~P˅~Q˅R ~R

~ P ˅ ~ Q (Resolvent Clause ) P

~T˅Q ~Q

~T T
Assumption that ~ R is true is false.

So R is true.

Empty Clause
(Contradiction Found)

UNIVERSITY ACADEMY 24
Ques 16: How is resolution in first order predicate logic different from that of propositional
performed? What is Unification Algorithm & why it is required?

Ans : In FOPL , while solving through resolution , situation is more complicated since we must consider all
the possible ways of substituting values for variables. Due to the presence of existential and universal
quantifiers in wff and arguments in predicates , the thing becomes more complicated

Theoretical basis of resolution procedure in predicate logic is “Herbrand’s Theorem” , which is as


follows :
(i) To show that a set of clause S is is unsatisfiable, it is necessary to consider only
interpretations over a particular set, called as Herbrand Universe S.
(ii) A set of clauses S is unsatisfiable iff a finite subset of ground instances ( in which all
bound variables have a value substituted for them), of S is unsatisfiable.

Finding a contradiction is to try systematically the possible substitutions and see if each produces a
contradiction. To apply resolution in predicate logic , we first need to apply unification technique.
Because in FOPL literals with arguments are to be resolved , then matching of arguments is also
required.

Unification Algorithm: Unification algorithm is used as a Recursive Procedure. Let two literals in
FOPL are P (x ,x ) and P ( y , z ). Here predicate name P matches in both literals , but arguments do
not match. O now substitution is required. Now 1st arguments of both x and y do not match. So
substitute y for x , then it will match.

So substitution 𝝈 = y / x is required. (𝝈 is called UNIFIER)Now if we apply 𝜎 = z / x ,


then it is not a consistent substitution , because we can not substitute both y and z for x.

So after applying 𝜎 = y / x , we can perform : P ( y , y ) and P ( y , z ) . Now unify aruments


y and z , by 𝜎 = z / y. So new composition can be : (z / y)(y /

Some Rules for unification algorithm :

i. A variable can be unified with a constant.


ii. A variable can be unified with another variable.
iii. A variable can be unified with a function.

UNIVERSITY ACADEMY 26
v. A constant can’t be unified by a constant.
vi. Predicate/ Literals’ with different number of arguments can’t be unified.

Ques 17: Given the following set of facts, Prove that “ Some who are intelligent can’t read ”.

(i) Who ever can read is literal.


(ii) Dolphins are not literate
(iii) Some Dolphins are intelligent.

Ans : Solution : Form wff of given sentences.

S1 : ∀x [ R (x) → L(x)] , R(x) : whoever can read , L(x) : x is literate.


S2 : ~ L (Dolphins) , ~ L means not literate.
S3 : ∃x [ D(x) ˄ I (x) ] , D(x) : x is Dolphin, I(x) : x is intelligent.
S1 to S3 is Base Set. Let us assume that negation of statement to be proved is true.
So to prove that : ∃x [ I(x) ˄ ~ R (x) ] is true, we assume ~∃x [ I(x) ˄ ~ R (x) ] is true.
So add it as a set of support in the Base Set.
~∃x [ I(x) ˄ ~ R (x) ] ≡ ∀ x [ ~ I(x) ˅ R(x) ] ≡ ~ I(x) ˅ R(x)

Convert all wffs into clause form :

C1 : ~R(x) ˅ L(x) , C2 : ~ L( Dolphins)

In S3 : Apply existential Instantiation to remove ∃ quantifier.

Therefore C3 : D(c) ˄ I(c) { This is in CNF now.}.


Now two clauses can be formed after eliminating Connector ˄. So we get :
C3 (a) : D (c) , C3(b) : I(c).
C4 : ~I(x) ˅ R(x) , This is Set of Support.

UNIVERSITY ACADEMY 28
Ques 18 : Given the following set of facts :-

(i) John likes all kinds of food


(ii) Apples are food
(iv) Anything any one eats and is not killed by is food.
(iii) Bill eats peanuts and is still alive.
(iv) Sue eats everything Bill eats
Translate above into predicate logic. Convert each wff so formed in the clause form.
“ Prove that John likes peanuts Using resolution “

UNIVERSITY ACADEMY 30
Ans .Converting given statements into wff of FOPL
∀ 𝑥 : Food(x) → Likes (John , x) C1 C7
Food (Apples)
Food (Chicken) 𝝈 = x/Peanuts
∀ 𝑥 ∀ 𝑦 : Eats ( x , y) ˄ ~ Killed (x) → Food ( y)
∀ 𝑥 : Eats ( Bill , Peanuts ) ˄ alive (Bill )
∀ 𝑥 : Eats ( Bill , x) → Eats (Sue , x) ~ Food ( Peanuts)
C4
To Prove that : Likes ( John , Peanuts)

Conversion of above wffs into clause form :


𝝈 = y / Peanuts
C1 : ~ Food (x) ˅ Likes ( John , x)
C2 : Food (Apples) ~ Eats ( x , Peanuts ) ˅ Killed (x)
C3 : Food ( Chicken)
C4 : ∀ 𝑥 ∀ 𝑦 : Eats ( x , y) ˄ ~ Killed (x) → Food ( y)
𝝈 = 𝒙/𝑩𝒊𝒍𝒍 C5 (a)
≡ ~ [ Eats ( x , y) ˄ ~ Killed (x) ] ˅ Food ( y)
C5(a) : Eats ( Bill , Peanuts)
C5 (b) : alive (Bill) OR ~ Killed ( Bill ) Killed (Bill)
C6 : ~ Eats ( Bill , x) ˅ Eats ( Sue , x )

But from Sub clause C5 (b) we have


Let us assume that John does not Likes peanuts is True. Alive (Bill), i.e Bill is alive. So
C7 : ~ Likes ( John , Peanuts ) , ( This is set of contradiction has occurred.
Therefore, our assumption that John
support ) does not likes Peanuts is false. Hence
we can say that Likes( John ,
Peanuts) is true.

Ques 19 : Explain Backward and forward Chaining , with example in logic representation. Also mention
advantages and disadvantages of both the algorithms.

Ans : The process of the output of one rule activating another rule is called chaining. Chaining technique is
to break the task into small procedures and then to inform each procedure within the sequence by itself. Two
types of chaining techniques are known: forward chaining and backward chaining.

UNIVERSITY ACADEMY 32
(A) Forward chaining :
➢ This a data-driven reasoning, and starts with the known facts and tries to match the rules
with these facts.
➢ There is a possibility that all the rules match the information (conditions). In forward chaining,
firstly the rules looking for matching facts are tested, and then the action is executed.
➢ In the next stage the working memory/term memory is updated by new facts and the matching
process all over again starts. This process is running until no more rules are left, or the goal is
reached.
➢ Forward chaining is useful when a lot of information is available. Forward chaining is useful to
be implemented if there are an infinite number of potential solutions like configuration
problems and planning.
A rule based KB is given as : and it is to prove the conclusion.
Rule1: IF A OR B THEN C
Rule 2 : IF D AND E AND F THEN G
Rule 3: IF C AND G THEN H
The following facts are presented: B, D, E, F. Goal: prove H. The structure of a forward chaining
example is given in the following figure:

Backward Chaining :
➢ The opposite of a forward chaining is a backward chaining.
➢ Contrast to forward chaining, a backward chaining is a goal-driven reasoning method. The
backward chaining starts from the goal (from the end) which is a hypothetical solution and the
inference engine tries to find the matching evidence.

UNIVERSITY ACADEMY 34
➢ When it is found, the condition becomes the sub-goal, and then rules are searched to prove
these sub-goals. It simply matches the RHS of the goal. This process continues until all the
sub-goals are proved, and it backtracks to the previous step where a rule was chosen.
➢ If there is no rule to be established in an individual sub-goal, another rule is chosen.
➢ The backward chaining reasoning is good for the cases where there are not so much facts and
the information (facts) should be generated by the user. The backward chaining reasoning is
also effective for application in the diagnostic tasks.
In many cases the linear logic programming languages are implemented using the
backward chaining technique. The combination of backward chaining with forward
chaining provides better results in many applications.

Decision Criteria for Forward or Backward Reasoning


1. More possible goal states or start states?
(a) Move from smaller set of states to the larger
(b) Is Justification of Reasoning required?
2. Prefer direction that corresponds more closely to the way users think.

3. What kind of events triggers problem-solving?


(a) If it is arrival of a new fact, forward chaining makes sense.
(b) If it is a query to which a response is required, backward chaining is more natural.
4. In which direction is branching factor greatest?
(a) Go in direction with lower branching factor

Mr. Anuj Khanna


Assitant Professor(KIOT,Kanpur)
Advantages and disadvantages of forward chaining :
1. Runs great when a problem naturally begins by collecting data and searching for information that
can be collected from it to be used in future steps.
2. Forward chaining has the capability of providing a lot of data from the available few initial data or
facts.
3. Forward chaining is a very popular technique for implementation to expert systems, and systems
using production rules in the knowledge base. For the expert system that needs interruption, control,
monitoring, and planning, the forward chaining is the best choice.
4. When there are few facts and initial states, the forward chaining is very useful to be applied.

Disadvantages of a Forward Chaining :


1. New information will be generated by the inference engine without any knowledge about which
information will be used for reaching the goal.
2. The user might be asked to enter a lot of inputs without knowing which input is relevant to the
conclusion.
3. Several rules may fire that have nothing to reach the goal;
4. It might produce different conclusions which are the causes of a high cost of the chaining process.

Advantages of Backward Chaining :


1. The system will stop processing once the variable has its value. It's a “floor system”.
2. The system that uses backward chaining tries to set goals in order which they arrive in the
knowledge base.
3. The search in backward chaining is directed.
4. While searching, the backward chaining considers those parts of the knowledge base which are
directly related to the considered problem or backward chaining never performs unnecessary
inferences.
5. Backward chaining is an excellent tool for specific types of problems such as diagnosing and
debugging.
6. Compare to forward chaining, few data are asked, but many rules are searched.
Some disadvantages of backward chaining:
1. The goal must be known to perform the backward chaining process;
2. The implementation process of backward chaining is difficult.

Ques 20: What is Utility theory and its importance in AI ? Explain with the help of suitable examples.

Ans : Utility theory is concerned with people's choices and decisions. It is concerned also with people's
preferences and with judgments of preferability, worth, value, goodness or any of a number of similar
concepts. Utility means quality of being useful. So as per this each state in environment has a degree of
usefulness to an agent, that agent will prefer states with higher utility.

Decision Theory = Probability theory + Utility Theory.

Interpretations of utility theory are often classified under two headings, prediction and prescription:

(i) The predictive approach is interested in the ability of a theory to predict actual choice behavior.
(ii) The prescriptive approach is interested in saying how a person ought to make a decision.
E.g : Psychologists are primarily interested in prediction.
Economists in both prediction and prescription. In statistics the emphasis is on prescription
in decision making under uncertainty. The emphasis in management science is prescriptive
also.

Sometimes it is useful to ignore uncertainty, focus on ultimate choices. Other times, must model
uncertainty explicitly. Examples: Insurance markets, Financial markets., Game theory. Rather than
choosing outcome directly, decision-maker chooses uncertain prospect (or lottery). A lottery is a probability
distribution over outcomes.

Expected Utility : Expected utility of action A , given evidence E , E ∪( A | E) is calculated as follows :


E ∪( A | E ) = ∑𝒊 𝑷 ( Result i(A) | D0 (A) , E ) ∪ ((𝑹𝒆𝒔𝒖𝒍𝒕𝒊 (𝑨 ) ), where ,
P (Resulti (A) | D0 (A) ) is probability assigned by agent for action A to be executed.
D0(A) : Proposition that A is executed in current state.

This has two basic components; consequences (or outcomes) and lotteries.

(a) Consequences: These are what the decision-maker ultimately cares about.
Example: “I get pneumonia, my health insurance company covers most of the costs, but I have to pay
a $500 deductible.” Consumer does not choose consequences directly. Lotteries Consumer chooses a
lottery, p
(b) Lotteries are probability distributions over consequences: p : C → [0, 1] ;
with ∑c ∈ C p (c) = 1. Set of all lotteries is denoted by P. Example: “A gold-level health insurance
plan, which covers all kinds of diseases, but has a $500 deductible.” Makes sense because consumer
assumed to rank health insurance plans only insofar as lead to different probability distributions over
consequences.
Utility Function : U : P → R has an expected utility form if there exists a function

u : C → R such that U (p) = ∑ p (c) u (c) for all p ∈ P. c ∈ C . In this case, the function U is called an
expected utility function, and the function u is call a von Neumann-Morgenstern utility function. These
functions are used to capture agent’s preferences between various world states .This function assigns a
single number to express desirability of a state utilities. Utilities are combined with outcome probabilities of
actions to give an expected utility for each action. U (s) : Means utility of state S , for agent’s Decision.

Maximum expected Utility ( MEU) : This represents that a rational agent should select an action that
maximizes the agent’s expected utility. MEU principle says “ If an agent maximizes a utility function that
correctly reflects the performance measure by which its behavior is being judged , then it will achieve the
highest possible performance score if we average over the environment of agent.”

Ques 21: What are constraint notations in utility theory ? Define the term Lottery. Also mention the
following axioms of Utility Theory :
(i) Orderability (ii) Substitutability (iii) Monotonicity (iv)Decomposability.

Ans : Constraint Notations in Utility theory for two outcomes / consequences A and B are as mentioned
below :
➢ A  B : A is preferred over B.
➢ A ~ B : Agent is indifferent between A and B.
➢ A ≥ B : Agent prefers A to B or is indifferent b/w them.

A Lottery L with possible outcomes C1 , C2 , C3 …..Cn that can occur with probabilities [ p1 , C1 ;
p2 , C2 ; …..; pn , Cn ].Each outcome of a lootery can be an atomic state or another lottery.

Axioms of Utility Theory:

(i) Orderability : Given any two states , a rational agent must prefer one to other or else rate the
two as equally preferable. So agent can’t avoid the decision.
( A  B) ˅ ( B  A) ˅ ( A ~ B)
(ii) Substitutability: If an agent A is indifferent b/w two lotteries A and B , then the agent is
indifferent b/w two more complex lotteries that are same except that B is substituted for A in
one of them.
( A ~ B)  [ p , A ; 1 – p , c ] ~ [ p , B ; 1 – p , c]

(iii) Monotonicity: Let two lotteries have same outcomes A and b. If ( A 


B) , then agent prefers lottery with higher probability for A.
( A  B)  ( p ≥ q  [ p , A ; 1 – p , B ] ≥ [ q , A ; 1 – q , B ]
(iv) Decomposability: Compound lotteries can be reduced or decomposed to simpler ones .

[ p , A ; 1 – p, [ q , B ; 1 – q , C ] ] ~ [ p , A ; (1 - p) q, B ; (1 - p) (1 -q) , C]

Ques 22 : What is probability reasoning ? Why it is required in AI applications?

Ans : Probabilistic Reasoning in Intelligent Systems is a complete and accessible account of the theoretical
foundations and computational methods that underlie plausible reasoning under uncertainty.
Intelligent agent’s almost never have acess to the whole truth about their environment. So agents act under
uncertainty. The agent’s knowledge can only provide degree of belief. Main concept for dealing with degree
of belief is PROBABILITY THEORY.
➢ If probability is 0 , then belief is that statement is false.
➢ If probability is 1 , then belief is that statement is true.
Percepts received from the environment form the evidence on which probability assertions are based.
As agent receives new percepts , its probability assessments are updated to reflect new Evidence.
➢ Before the evidence is find , we talk about prior (unconditional) probability.
➢ After the evidence is given , we deal with posterior (conditional ) probability.
Probability associated with a proposition (sentence) P is the degree of belief associated with it in the
absence of any other information.
• In AI applications, sample points are defined by set of random variables
– Random vars: boolean, discrete, continuous
Probability Distribution: With respect to some random variable we talk about the probabilities of all
possible outcomes of a random variable. E.g : Let weather is random variable , Given that :
➢ P( weather = sunny) = 0.7 , P( weather = rainy) = 0.2 , P( weather = cloudy) = 0.08
P( weather = snowy ) = 0.02
Joint Probability Distribution: Joint probability distribution for a set of random variables gives the
probability of every atomic event on those random variables (i.e., every sample point).In this case
P(Weather, Cavity) can be given by a 4 × 2 matrix of values

Weather = Sunny Rainy Cloudy Snowy


Cavity = True 0.144 0.02 0.016 0.02
Cavity = False 0.576 0.08 0.064 0.08

This is known as Joint Probability Distribution of weather and cavity.


If a complete set of random variable is covered then it is called “ Full Joint Probability Distribution”.
Conditional Probability:
Definition of conditional probability: P(a∣b) = P(a ∧ b) | P(b) if P(b) ≠ 0 .
Product rule gives an alternative formulation: P(a ∧ b) = P(a∣b) . P(b) = P(b∣a)P(a) .
A general version holds for whole distributions, e.g., P(Weather, Cavity) = P(Weather ∣Cavity)P(Cavity)
Chain rule is derived by successive application of product rule: P(X1, . . . , Xn) = P(X1, . . . , Xn−1)
P(Xn∣X1, . . . , Xn−1) = P(X1, . . . , Xn−2) P(Xn−1∣X1, . . . , Xn−2) P(Xn∣X1, . . . , Xn−1) = . . . = ∏ n i =
1 P(Xi ∣X1, . . . , Xi−1) .

Applications of Probability theory in AI


➢ Uncertainty in medical diagnosis
(i) Diseases produce symptoms (ii) In diagnosis, observed symptoms => disease ID
(iii) Uncertainties
• Symptoms may not occur
• Symptoms may not be reported
• Diagnostic tests not perfect
• False positive, false negative
• Uncertainty in medical decision-making
(iv) Physicians, patients must decide on treatments
(v) Treatments may not be successful
(vi)Treatments may have unpleasant side effects

Ques 23:Explain in detail Markov Model and its applications in Artificial Intelligence.

Ans. Markov Model:

➢ Markov model is an un-précised model that is used in the systems that does not have any fixed
patterns of occurrence i.e. randomly changing systems.
➢ Markov model is based upon the fact of having a random probability distribution or pattern that may
be analysed statistically but cannot be predicted precisely.
➢ In Markov model, it is assumed that the future states only depend upon the current states and not the
previously occurred states. In I order markov, current state depends only on just previous state. i.e.
Conditional probability is : P ( Xt | X0 : t-1) = P ( Xt | X t-1)
Set of states: { S1 S2 , S3 …. Sn }. Process moves from one state to another generating a sequence of
states.

Observable state sequence lead to a Markov Chain Model. Non Observable state leads to Hidden
Markov Models.

Transition Probability Matrix: Each time when a new state is reached the system is set to have
incremented one step ahead. Each step represents a time period which would result in another possible
state. Let Si is state I of environment for I = 1 , 2… n.

Conditional probability of moving from state Si to Sj = P ( Sj | Si ) = P ij, Si : current state , Sj : next


state. Pij = 0 if no transition takes place.

𝑃11 𝑃12 … … . 𝑃1𝑚


𝑃21 𝑃22 … … . 𝑃2𝑚
Transition Matrix : P= ……………………….
……………………….
[ 𝑚1 𝑃𝑚2 𝑃𝑚𝑚 ]
➢ Markov chain property: probability of each subsequent state depends only on what was the
previous state: P ( Sik | Si1 , Si2 ,……., Sik-1) = P ( Sik | Sik - 1) .
To define Markov model, the following probabilities have to be specified:
Transition probabilities: a ij = P ( Sj |Si) i.e. probability of transition from state i to j.
Initial Probabilities: ∏𝒊 = 𝑷 (𝑺𝒊) , Calculation of conditional probabilities of state sequences
are given as below :
P ( Si1 , Si2 , …….Sik-1 , Sik) = P ( Sik | Si1 , Si2 ,……., Sik-1). P ( Si1 , Si2 , …… Sik-1)
= P ( Sik | Sik-1 ) . P ( Si1 , Si2 , ….. Sik-2)
= P ( Sik | Sik-1). P( Sik-1 | Sik-2)… ..........P ( Si2 | Si1) . P(Si).
There are four common Markov-Models:
(i) Markov Decision Models (ii) Markov Chains (iii) Hidden Markov Model (iv)Partially
observable Markov Decision Process
Example : Consider a Problem of weather conditions, Transition diagram is as given below :
•   Two states: { ‘ Rain’ and ‘ Dry’}
•   Transition probabilities: P(‘Rain ’|‘Rain’)=0.3 , P(‘Dry ’|‘Rain’)=0.7 , P(‘Rain ‘|’Dry’)=0.2,
P(‘ Dry ’|‘Dry’) =0.8
•   Initial probabilities: say P(‘Rain’) =0.4 , P(‘Dry’) = 0.6 . Suppose we want to calculate a probability of a
sequence of states in our example, {‘Dry’,’Dry’,’Rain’,Rain’}.
P({‘Dry’, ’Dry’,’ Rain’, Rain’} ) = P(‘Rain ‘|’Rain’) P(‘Rain ’|’Dry’) P(‘Dry ‘|’Dry’) P(‘Dry’)
= 0.3*0.2*0.8*0.6 = 0.0288 ≈ 0.0

Ques 24 : Explain Hidden Markov Model and its applications in AI .


Ans : Hidden Markov Model(HMM)
Hidden Markov-Model is an temporal probabilistic model for which a single discontinuous random
variable determines all the states of the system. A Hidden Markov Model, is a stochastic model where the

states of the model are hidden. Each state can emit an output which is observed . This model is used
because simple markov chain is too restricted for complex applications.

➢ It means that, possible values of variable = Possible states in the system.


➢ For example: Sunlight can be the variable and Sun can be the only possible state.
➢ To make markov model more flexible in HMM assumptions are made that the observations of
model are probabilistic function of each state.
Concept of Hidden Markov Model
Let Imagine , You were locked in a room for several days and you were asked about the weather outside.
The only piece of evidence you have is whether the person who comes into the room bringing your daily
meal is carrying an umbrella or not.
What is hidden? Sunny, Rainy, Cloudy
What can you observe? Umbrella or Not

➢ In Hidden Markov-Model, every individual state has limited number of transitions and emissions.
State sequences are not directly observable, rather it can be recognized from the sequence of
observations produced by the system.
➢ Probability is assigned for each transition between states.
➢ Hence, the past states are totally independent of future states.
➢ The fact that HMM is called hidden because of its ability of being a memory less process i.e. its
future and past states are not dependent on each other.
➢ This can be achieved on two algorithms called as:
(i) Forward Algorithm. (ii) Backward Algorithm.
Components of HMM :
➢ Set of states: { S1 S2 , S3 …. Sn }.
➢ Sequence of states generated by the system : { Si1 , Si2 , …….Sik-1 , Sik }
➢ Joint probability Distribution by Markovian Chain :
P ( Sik | Si1 , Si2 ,……., Sik-1) = P ( Sik | Sik - 1)
Observations / Visible states : { V1 , V2 , …Vm-1 , V m}

For HMM following probabilities are to be specified:


(a) Transition Probabilities: a ij = P ( Sj |Si) i.e. probability of transition from state i to j.
(b) Observation probability Matrix: B = ( bi ( Vm) ) , where bi ( Vm ) = P ( Vm | Si).
(c) Vector of initial probabilities : : ∏𝒊 = 𝑷 (𝑺𝒊)
Model is defined as : M = ( A , B , 𝝅).
Transient state: Process does not returns in this state.
Recurrent state: Initial State and process returns to it at last with probability = 1.
Absorbing state: If a process enters to a state and is destined to remain there forever , Then it is
called absorbing state.

Applications Of Hidden Markov Model


➢ Speech Recognition.
➢ Gesture Recognition.
➢ Language Recognition.
➢ Motion Sensing and Analysis.
➢ Protein Folding.

Ques 25 : Consider the following data provided for Weather Forecasting Scenario.
Two states (Hidden) : ‘Low’ and ‘High’ atmospheric pressure.
Two observations (Visible States) : ‘Rain’ and ‘Dry’.
Suppose  we want to calculate a probability of a sequence of observations in our
example, { ‘Dry’,’ Rain’}.

Ans : Solution :
Transition probabilities:
P(‘Low’|‘Low’) = 0.3
P(‘High’|‘Low’) = 0.7,
P(‘Low ’|‘High’) = 0.2 ,
P(‘High ’|‘High’) = 0.8

Observation probabilities:
P(‘Rain ’|‘Low’) = 0.6
P(‘Dry ’|‘Low’) = 0.4
P(‘Rain ’|‘High’) = 0.4
P(‘Dry ’|‘High’) =0.3 .

Initial probabilities: say P(‘Low’) = 0.4 , P(‘High’) = 0.6 .


Calculation of observation sequence probability

Consider all possible hidden state sequences:


P({‘Dry’, ’Rain’} ) = P({‘Dry’,’ Rain’} , {‘Low’, ‘Low’}) + P({‘Dry’,’ Rain’} ,
{‘Low’, ‘High’}) + P ({‘Dry’,’ Rain’} , {‘High’, ‘Low’}) +
P({‘Dry’,’ Rain’} , {‘High’, ‘High’})

Where first term is :


P ({‘Dry’,’ Rain’} , {‘Low’, ‘Low’}) = P({‘Dry’,’ Rain’} | {‘Low’, ‘Low’}) P({‘Low’, ‘Low’})

= P (‘Dry ‘|’Low’) . P (‘Rain ‘|’Low’) P (‘Low’) P (‘Low’|’Low)

= 0.4*0.4*0.6*0.4*0.3

Ques 26 : Explain in detail Bayesian Theory and its use in AI. Define Likelihood ratio.

Ans : In probabilistic reasoning our conclusions are generally based on available evidences and past
experience . This information is mostly incomplete. When outcomes are unpredictable we use probabilistic
reasoning, E.g Weather forecasting system, Disease Diagnosis, Traffic congestion control system.
➢ When a doctor examines a patient’s history , symptoms , test rules , evidence of possible disease.
➢ In weather fore casting prediction of tomorrow’s cloud coverage , wind speed and direction , sun
heat intensity.
➢ A Business manager must take decision based on uncertain predictions , when to launch a new
product . Factors can be : Target consumer’s life style , population growth in specific city / state,
Average income of consumers, economic scenario of the country . All this can be depend on past
experience of market.
From the product rule of probability theory we express the following equations:
P ( a ∧ b ) = P(a ∣ b) . P( b ) ..............Eq 1.
P( a ∧ b ) = P( b ∣ a ) P( a )............... Eq 2.
𝑷(𝒂 |𝒃) 𝑷 (𝒃)
On Equating both the equations: P(b|a)=
𝑷(𝒂)
Baye’s rule is used in modern AI systems for probabilistic inferences. It uses he notion of conditional
probability: P ( H | E ), This expression is read as “ The probability of hypothesis H given that we have
observed evidence E ”. For this we require prior probability H ( if we have no evidence) and extent to which
E provides evidence of H.
𝑷( 𝑬 |𝑯𝒊).𝑷(𝑯𝒊)
Baye’s theorem states : P ( Hi | E ) =
∑𝑲𝒏=𝟏 𝑷 (𝑬 |𝑯𝒏).𝑷(𝑯𝒏)

Where , P ( Hi | E) = Probability that hypothesis Hi is true given evidence E.


P( E | Hi) = Probability that we will observe evidence E given that hypothesis Hi is true.
P( Hi) = Priori probability that Hi is true in absence of E.
K = No. of possible hypothesis.
Example : (i) If we know the prior probabilities of finding each of the various minerals and we know the
probabilities that if mineral is present then certain physical characteristics will be observed. So Baye’s rule
can be used to find likelihood of minerals to be present.
(ii) Let for solving a medical diagnosis problem :
S : patient has spots , F : Patient has high fever , M : Patient has measles.
Without any additional evidence , presence of spots serves as evidence in favour of measles. It also
Serves as evidence of fever measles would cause fever. But if patient has measles is already known.
Alternatively either spots or fever alone would constitute evidence in favour of measles.
Likelihood Ratio: This is also a conditional probability expression obtained from Baye’s Rule.
If probability P( E ) is difficult to obtain , then we can write as :
𝑷( 𝑬 |~𝑯).𝑷(~𝑯)
P(~H|E) = ……. Eq (i)
𝑷( 𝑬 )
𝑷( 𝑬 |𝑯).𝑷(𝑯)
We have P ( H | E) = ……….Eq (ii)
𝑷( 𝑬 )
On dividing Eq ( ii ) by Eq ( i) We get :
𝑷( 𝑯 | 𝑬) 𝑷( 𝑬 |𝑯).𝑷(𝑯)
= ………….Eq (iii)
𝑷(~𝑯 | 𝑬 ) 𝑷( 𝑬 |~ 𝑯 ) 𝑷 ( ~ 𝑯)
This is Ratio of a probability of an event to the probability of its negation. Ratio is known as
𝑷( 𝑬 | 𝑯)
“ ODDs of Event : O ( E)”. Ratio is known as Likelihood ratio w.r.t H = L (E/H)
𝑷(𝑬 | ~𝑯)
Odds likelihood form of Baye’s Rule from Eq (iii) is : O ( H | E) = L ( E | H ) . O( H )
Disadvantages of Baye’s Theorem: For a complex problem , the size of joint probabilities that we
require to compute this function grows as 2 n if n different propositions are there.
➢ Knowledge acquisition is difficult. Too many probabilities are needed.
➢ Sapce for all probabilities is too large.
➢ Computation terms of all probabilities are too large.
Ques 27 : What is Bayesian Network or Belief Network ? Explain its importance with the help of
an example.
Ans : To describe a real world , it is not necessary to use huge joint probability table in which the list of
probabilities of all possible outcomes is stored. To represent relationship between independent and
conditional independent variables a systematic approach in the form of a data structure called Bayesian
Network is used. It is also known as Causal network, Belief network , probabilistic network, Knowledge
Map. Extension of this is decision network or influence diagram.
“ A Bayesian network is a directed graph in which each node is attached with a quantitative probability
information”. This network is supported by CPT, known as conditional probability table. These are used
for representing knowledge in an Uncertain Domain
➢ Belief network used to encode the meaningful dependence between variables.
1. Nodes represent random variables 2. Arcs represent direct influence
2. Nodes have conditional probability table that gives that variables probability given the
different states of its parents
➢ The Semantics of Belief Networks
1. To construct network , think of as representing the joint probability distribution.

2. To infer from network , think of as representing conditional independence statements.


3. Calculate a member of the joint probability by multiplying individual conditional probabilities.
P(X1=x1, . . . Xn=xn) = P(X1=x1 | parents(X1)) * . . . * P( Xn=xn | parents (Xn) )

P (X1 , X2 , ….Xn , Xn-1) = ∏𝐧𝐢=𝟏 𝐏( 𝐗𝐢 |𝐩𝐚𝐫𝐞𝐧𝐭𝐬 (𝐗𝐢))

➢ To incrementally construct a network:


1. Decide on the variables
2. Decide on an ordering of them : The direct influences must be added to network first if they are to
become parents of the node they influence. So correct order in which to ad nodes is to add the Root
Causes first, then the variables they influence ans so on until we reach leaves( having no direct
causal influence on other variables).A node is conditionally independent of its non-descendants
given its parent. A node is conditionally independent of all other nodes innetwork given its parents ,
children and children’s parents.
3. Do until no variables are left:
(a) Pick a variable and make a node for it
b) Set its parents to the minimal set of pre-existing nodes
(c) Define its conditional probability
➢ Often, the resulting conditional probability tables are much smaller than the exponential size of the full
joint. Different tables may encode the same probabilities.
➢ Some canonical distributions that appear in conditional probability tables:
(a) deterministic logical relationship (e.g. AND, OR)
(b) deterministic numeric relationship (e.g. MIN)
(c) parameteric relationship (e.g. weighted sum in neural net)
(d) noisy logical relationship (e.g. noisy-OR, noisy-MAX)

➢ Inference in Belief Networks: agate beliefs. After constructing such a network an inference engine
can use it to maintain and propagate beliefs. When new information is received , the effects can be
propagated throughout the network , until equilibrium probabilities are reached.
(a) Diagnostic inference: symptoms to causes
(b) Causal inference: causes to symptoms
(c) Intercausal inference
(d) Mixed inference: mixes those above
➢ Inference in Multiply Connected Belief Networks
(a) Multiply connected graphs have 2 nodes connected by more than one path
(b) Techniques for handling:
✓ Clustering: Group some of the intermediate nodes into one meganode.

▪ Pro: Perhaps best way to get exact evaluation.


▪ Con: Conditional probability tables may exponentially increase in size.

✓ Cutset conditioning: Obtain simpler polytrees by instantiating variables as constants.

▪ Con: May obtain exponential number of simpler polytrees.


▪ Pro: It may be safe to ignore trees with lo probability (bounded cutset
conditioning).
✓ Stochastic simulation`: run thru the net with randomly choosen values for each node
(weighed by prior probabilities).
✓ The probability of any atomic event (it's joint probability) can be gotten from the
network.
✓ The correct order to add the nodes is "root causes" first, then the variables they influence
until we reach the "leaves", which have no direct causal influence on the other variables.
✓ If we don't, the network will have : More links and less natural probabilities needed
✓ Example: Scenario is about a new burglar alarm installed at home. It also
responds in minor earthquakes. Two neighbors John and Mary are always available in
case of any emergency. John always calls when he hears alarms but sometimes confuses
with telephone ring. Mary likes loud music and sometimes misses to hear the alarm
sound. The probabilities actually summarize a potentially infinite set of circumstances in
which the alarm might fail to go off ( E.g : High humidity , power failure , dead battery ,
cut wires , a dead mouse stuck inside the bell etc.) OR ( John or Mary might fail to
call and report it due to out for lunch , on vacations , temporarily deaf , passing of
airplane near the home etc.

Joint Probability Distribution is : P ( Burglary | alarm , JohnCalls , MarCalls) = P( Burglary | Alarm).


So only Alarm as a parent is needed.
UNIT -4
Machine Learning :
• Supervised and unsupervised learning
• Decision trees.
• Statistical learning models
• Learning with complete data - Naive Bayes models.

Learning with hidden data


• EM algorithm
• Reinforcement learning

UNIVERSITY ACADEMY 1
Short Questions & Answers
Ques 1. Name out three basic techniques of machine learning .

Ans : (a) Supervised Learning (b) Unsupervised Learning (c) Reinforcement Learning.

Ques 2. Write some applications of Supervised Learning.

Ans :

➢ Implementation of Perceptrons in AI.


➢ Implementation of Adaline network
➢ Application in Back propagation algorithms.
➢ Used in Hetero associative learning.

Ques 3. What is Boolean Decision Tree?

Ans : These are used in Decision Making learning technique. This consists of a vector of input attributes
X, and a single Boolean output y. Example: Set of examples ( X1 , Y1)……( X6 , Y6).
Positive examples are in which goal is true . Negative examples are in which goal is false.
Complete set is called Training Set.

Ques 4. Compare the Decision tree method with Naïve Baye’s Learning.
Ans : (i) Naïve Baye’s learns little less efficiently as compared to decision tree learning.
(ii) Naïve Baye’s learning works well fro wide range of applications as compared to decision tree.
(iii) Naïve Baye’s Scale well to very large problems. E.g : If n Boolean attributes , then 2n + 1
Parameters are required.

Ques 5. What is Reward Function in Re-enforcement learning ?


Ans : Reward function is used to define a goal. It maps each perceived state action pair of environment to a
single number; i.e. a reward that indicates desirability of that state. A re-enforcement agent’s only objective
is to maximize total reward received in long run. Reward functions are stochastic/ random in nature.

UNIVERSITY ACADEMY 3
Long Question & Answers
Ques 6. Explain Machine learning. Illustrate learning model? Mention some factors that affect the
learning.
Ans : Machine learning is the sub field of AI in which we try to improve decision making power of
intelligent agents. Agent has a performance element that decides what actions to take and a learning element
that modifies the performance element so that it makes better decisions. Design of learning element is
affected by following three major factors :
1) Which components of performance element are to be learned.
2) What feedback is available to learn these components.
3) What is representation method used for components.
Following are some ways of learning mostly used in machines:
(A) Logical learning (B) Inductive learning (C) Deductive learning.
(B)
Logical Learning: In this process a new concept or solution through the use of similar known concepts.
We use this type of learning when solving problems on an exam , where previously learned examples serve
as a guide or when we learn to drive a truck using our knowledge of car driving.

Inductive Learning: This technique requires the use of inductive inference, a form of invalid but useful
inference. We use inductive learning when we formulate a general concept after seeing a number of
instances or examples of the concept. E.g : When we learn the concept of color or sweet taste after
experiencing sensation associated with several objects.

Deductive Learning: This is performed through a sequence of deductive inference steps using known facts.
From the known facts , new facts or relationships are logically delivered. E.g : If we have an information
that weather is Hot and Humid then we can infer that it may Rain also. Another example may be , let
P → Q & Q → R , 𝑡ℎ𝑒𝑛 𝑤𝑒 𝑐𝑎𝑛 𝑖𝑛𝑓𝑒𝑟 𝑡ℎ𝑎𝑡 𝑃 → 𝑅

General Learning Model

UNIVERSITY ACADEMY 5
Environment has been included as a part of the overall learning system. It produces random stimuli, which
work as a organized training source such as a teacher which provides carefully selected training examples
for learner component. A user working on a keyboard can also be an environment for some specific
systems.
Inputs to the learning system may be physical stimuli, some sound , signal ,description of text , symbolic
notations . Information is used to create and modify knowledge structures in the KB. Same knowledge is
used by the performance component to carry out some tasks, such as solving a problem, playing a computer
game .
Performance component produces a response/actions when a task is provided. The Critic module then
evaluates this response relative to an optimal response. A feedback indicating whether or not the
performance is acceptable. It is then forwarded by critic module to learner component for its subsequent use
in modifying the structure in knowledge base.
Factors affecting the Machine Learning Process:
1) Types of training provided. E.g: Supervised technique , Unsupervised technique etc.
2) Form and extent of any initial background knowledge or past history.
3) The types of feedbacks provided.
4) Learning algorithms applied.

UNIVERSITY ACADEMY 7
Ques7. Differentiate between Supervised Learning and Unsupervised Learning. Also mention some of
the application areas of both.
Ans :
S.No Supervised Learning Unsupervised Learning
1. Learning of a function can be done from Learning can be used to draw inference from some data
its inputs and outputs, set containing input data
Classifies the data on the basis of training Clusters the data on the basis of similarities according
set available and uses that data for to the characteristics found in the data and grouping
2.
classifying new data. similar objects into clusters.

3. Also known as Classification Also known as Clustering


The class labels on the training data is Class labels on the training data is not known in
known in advance which further helps in
4. advance i.e. no predefined class.
data classification.

Classification Methods: Clustering Methods :


Decision Trees, Bayesian Classification. Hierarchical, Partitioning, Density Based.
Rule Based Classification Grid Based, Model Based.
5. Classification by back propagation,
Associative Classification.

Issues in supervised learning


• Data Cleaning: In data cleaning, noise and missing values are handled.
• Feature Selection: Abundant an irrelevant attributes are removed while feature selection is done.
• Data Transformation: Data normalization and data generalization is included in data transformation.

Ques. 8 Write Short notes on the following: (a) Statistical Learning (b) Naïve Baye’s Model
Ans : (a) Statistical Learning Technique: In this technique main idea is data and hypothesis. Here data is
evidence i.e. instantiations of some or all random variables describing the domain. Bayesian learning
calculates probabilities of each hypothesis given the data and makes prediction.
Let D: data set, with observed value d as an output. Then the probability of each hypothesis is obtained by
Baye’s Rule as: P ( hi | d) = 𝜶 𝑷 (𝒅 |𝒉𝒊)𝑷 ( 𝒉𝒊 ).

UNIVERSITY ACADEMY 9
For prediction of an unknown quantity x , expression is given as below :
P ( x | d ) = ∑𝒊 𝑷 ( 𝒙 |𝒅 , 𝒉𝒊) 𝑷 (𝒉𝒊 | 𝒅 ) = ∑𝒊 𝑷 ( 𝒙 | 𝒉𝒊 ) 𝑷 (𝒉𝒊 | 𝒅 ).
Prediction above is weighted averages over predictions of individual hypothesis. Hypothesis are
intermediate values between raw data and predictions. A very common approximation which is generally
used is to make predictions based on a single most probable hypothesis i.e. an hi that maximizes
P ( hi | d ) is called Maximum a Posteriori.

(b) Naïve Baye’s Model: This is the most common Bayesian network model used in machine learning.
In this model the class variable C ( to be predicted) is the root and attribute Xi are leaves. Model is called
Naïve because it assumes that attributes are conditionally independent of each other, given the class.
Once the model has been trained using maximum likelihood technique, it can be used to classify new
examples for which the class variable C is unobserved. For the observed attributes x1 , x2 ,……xn, the

Probability of each class is given as: P ( C | x1 , x2 …., Xn ) = 𝜶 𝑷(𝑪) ∏𝒊 𝑷 (𝑿𝒊 |𝑪) .

Ques.9 What is learning with complete data? Explain Maximum Likelihood Parameter Learning
with Discrete Model in detail.
Ans . Statistical learning methods are based on simple task parameter learning with complete data.
Parameter learning involves finding the numerical parameters for a probability model with a fix
structure. E.g: In Bayesian network conditional probabilities are obtained for a given scenario. Data are
complete when each point contains values for every variable in a specific learning model.

Maximum Likelihood Parameter Learning : Suppose we buy a bag of lime and cherry candy from a
new manufacturer whose lime–cherry proportions are completely unknown—that is, the fraction could be
anywhere between 0 and 1. Parameter 𝜃 is proportion of cherry candies.
Hypothesis is : h , proportion of limes = 1 - 𝜃
If we assume that all proportions are equally likely a priori, then a maximum-likelihood approach is
reasonable. If we model the situation with a Bayesian network, we need just one random variable, Flavor
(the flavor of a randomly chosen candy from the bag). It has values cherry and lime, where the probability
of cherry is . Now suppose we unwrap N candies, of which c are cherries and l = N - c are limes
Likelihood of above data set is as given below:

UNIVERSITY ACADEMY 11
So maximum likelihood is value of 𝜃 that maximizes above equation .Computing log likelihood:

By taking logarithms, we reduce the product to a sum over the data, which is usually easier
to maximize.) To find the maximum-likelihood value ofθ, we differentiate L with respect to
θ and set the resulting expression to zero:

1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zer

when the data set is small enough that some


events have not yet been observed—for instance, no cherry candies—the maximum likelihood
hypothesis assigns zero probability to those events. Various tricks are used to avoid this
problem, such as initializing the counts for each event to 1 instead of zero.
With complete data maximum likelihood parameter learning problem for a Bayesian Network

UNIVERSITY ACADEMY 12
variable given its parents are just observed frequencies of variable values for each setting of parent
values.
Let us look at another example: Suppose this new candy manufacturer wants to give a
little hint to the consumer and uses candy wrappers colored red and green. The Wrapper for
each candy is selected probabilistically, according to some unknown conditional distribution,
depending on the flavor. The corresponding probability model has three parameters: θ, θ1, and θ2.
θ1 : wrapper color of cherry candy. θ2. : Wrapper color of lime candy.
Let us assume a case for Cherry Candy Wrapper, then using Joint probability distribution we can have
following equation:
P (Flavor = Cherry, Wrapper = Green | hθ , θ1, θ2 ).

Now let N candies are to be unwrapped: C : cherries , L = N – C : Lime


Let wrapper count is as given: rc : Cherries with red wrappers , gc : Cherries with green wrappers
rl : Limes with red wrappers , gl : Limes with green wrappers.
So the likelihood of data is given as below:

Now for Maximum Likelihood Estimation , simplify it by taking Log , to come up with addition form :

Now compute I order partial derivatives w.r.t θ, θ1, θ2 , Equate it to zero , we will get values of parameters.

UNIVERSITY ACADEMY 14
Ques.10 Write short notes on
(a) Continuous model for Maximum likelihood Estimation
(b) Learning with Hidden Variables.
(c) EM Algorithm.

Ans : (a) Continuous model for Maximum likelihood Estimation : Continuous variables are very common
in real-world applications, it is important to know how to learn continuous models from data. The principles for
maximum-likelihood learning are identical to those of the discrete case. In learning the parameters of a Gaussian
density function on a single variable. That is, the data are generated as follows:

The parameters of this model are the mean and the standard deviation. Let the observed values be x1, X2 … xN .

Then the log likelihood is

:
Now setting theI order partial derivative equal to zero we obtain:

The maximum-likelihood value of the mean is the sample average and the maximum likelihood value of
the standard deviation is the square root of the sample variance.

(b) Learning with Hidden Variables : Many real world problems have hidden variables (also called
Latent Variables), which are not observable in given data set samples.
Example : (i) In medical diagnosis, records mostly consist of symptoms , treatment used and outcom
of the treatment. But seldom have direct observation of disease itself.
(ii) A scenario of traffic congestion prediction at office hours ( Hidden variables can be an
unobservable “ Rainy Day” causing very less traffic at peak hours.
Example : Let Bayesian Network for heart disease ( a hidden variable ) is as given in below figure :

UNIVERSITY ACADEMY 16
. In figure (a): Each variable has three possible values and is labeled with the number
of independent parameters in its conditional distribution. In figure (b): The
equivalent network with Heart Disease removed. Note that the symptom variables are no
longer conditionally independent given their parents. Therefore Latent Variables can dramatically reduce
the number of parameters required to specify a Bayesian Network. This can reduce he amount of data
needed to learn the parameters.
(c) EM Algorithm( Expectation Maximization Algorithm) : This algorithm is used to solve the
problems arised in Laerning with hidden variables. Basic idea is to pretend that we know the
parameters of model and then infer the probability that each data point belongs to each component is
fitted to entire data set with each point weighted by the probability that it belongs to that component.

➢ Expectation maximization the process that is used for clustering the data sample.
➢ EM for a given data, has the ability to predict feature values for each class on the basis of
classification of examples by learning the theory that specifies it.
➢ It works on the concept of, starting with the random theory and randomly classified data along
with the execution of below mentioned steps. Compute expected values of each hidden
variables for each examples and then re-computing the parameters using the expected values as
if they were observed values. Let X is the observed values in all examples. Z is the set of all
hidden variables. 𝜃 is all parameters for probability model. 𝜽 = { 𝝁, 𝜮 }

UNIVERSITY ACADEMY 18
➢ E- Step: In this computation of sum (i.e. expectation of Log likelihood of completed data w.r.t.
P ( Z = z | x , 𝜃𝑖 ) , which is posteriori over hidden variables.

➢ M – Step: In this step we find new values of the parameters that maximize the Log Likelihood
of data given the expected values of hidden indicator variables.

➢ EM algorithm increases the Log Likelihood of data at every iteration. Under certain conditions
EM can be proven to reach a local maximum in likelihood. So EM is like Gradient Based Hill
Climbing Algorithm.

Ques. 11 Explain Re-inforcement learning technique in detail .Also Mention its applications in the
field of Artificial intelligence.
Ans : Re-inforcement learning : This type of learning technique is used for agents learning when there is
no teacher telling the agent what action to take in each circumstances.
Example 1 : Let a chess playing agent by supervised learning given examples of game situations along
with the best moves for those situations. He can also try random moves , so agent can eventually build a
predictive model of its environment. Issue is that “Without some feedback about what is good and bad ,
agent will have no grounds for deciding which move to select.” Agent needs to know that something good
has happened when it wins and that something bad has occurred. This kind of Feedback is called Reward
or Re-inforcement .

UNIVERSITY ACADEMY 20
A General Learning Model of Reinforcement Learning:

➢ Reinforcement learning was developed in context to optimal control strategy.


➢ This method is useful in making sequential decisions
➢ Critic converts a primary reinforcement signal received from the environment into a higher quality
signal (Heuristic Signal), both of which are scalar inputs.
➢ System is designed to learn delayed reinforcement ( Temporal sequence of stimuli).
Example 2 : A mobile robot decides whether it should enter a new room in search of more trash to collect
or start trying to find its way back to its battery recharging situation. It makes it decision based on how
quickly and easily it has been able to find the recharger in past.
➢ Agent’s actions are permitted to affect the future state of environment .E.g : Next chess position.
➢ This involves interaction between an active decision making agent and its environment, where goal
is to be searched.
Markov Decision Process: Rewards serve to define optimal policies in MDP’s. An optimal policies that
maximizes expected total reward. Task of re-inforcement learning is to ise observed rewards to learn an
optimal policy.
Elements of re-inforcement Learning:
a). A policy b). A reward function c). A value function d ). A model of environment

UNIVERSITY ACADEMY 22
Architectures in Reinforcement Learning

Policy: This defines learning agent’s behavior at a particular time. It is a mapping from perceived states of
environment to actions to be taken when present in those states. Policy can be a simple function , a look up
table or a search process too.
Reward Function: This is used to define a goal. It maps each perceived state action pair of environment to
a single number; a reward point that indicates desirability of that state. Objective is to maximize total reward
function received in long run. Reward functions are stochastic/random.
Value function: Reward function indicates what is good in an immediate sense, a value function specifies
what is good in the long run. Value of a state is total amount of reward an agent can expect to accumulate
over the future.
Model: this represents behavior of the environment . Models are used for planning, i.e a way of deciding for
a course of actions by considering future situations.
UNIVERSITY ACADEMY 24
Application areas of Reinforcement learning are as mentioned below:
1) The most recent version of Deep Mind’s AI system for playing Go) means interest in reinforcement
learning (RL) is bound to increase.
2) RL requires a lot of data, and as such, it has often been associated with domains where simulated
data is available (gameplay, robotics).
3) Automation of well-defined tasks, that would benefit from sequential decision-making that RL can
help automate (or at least, where RL can augment a human expert).
4) Industrial automation is another promising area. It appears that RL technologies from
DeepMind helped Google significantly reduce energy consumption (HVAC) in its own data centers.
5) The use of RL can lead to training systems that provide custom instruction and materials tuned to the
needs of individual students. A group of researchers is developing RL algorithms and statistical
methods that require less data for use in future tutoring systems.
6) Many RL applications in health care mostly pertain to finding optimal treatment policies.
7) Companies collect a lot of text, and good tools that can help unlock unstructured text will find users.
8) A technique for automatically generating summaries from text based on content “abstracted” from
some original text document).
9) A Financial Times article described an RL-based system for optimal trade execution. The system
(dubbed “LOXM”) is being used to execute trading orders at maximum speed and at the best
possible price.
10) Many warehousing facilities used by E - Commerce sites and other supermarkets use these intelligent
robots for sorting their millions of products every day and helping to deliver the right products to the
right people. If you look at Tesla’s factory, it comprises of more than 160 robots that do major part of
work on its cars to reduce the risk of any defect.
11) Reinforcement learning algorithms can be built to reduce transit time for stocking as well as
retrieving products in the warehouse for optimizing space utilization and warehouse operations.
12).Reinforcement Learning and optimization techniques are utilized to assess the security of the electric
power systems and to enhance Microgrid performance. Adaptive learning methods are employed to
develop control and protection schemes.

UNIVERSITY ACADEMY 25
Ques 12. Discuss Various Types of Reinforcement Learning Techniques.
Ans : Reinforcement learning are of following three types :
(a). Passive Reinforcement (b) Temporal Difference Learning (c) Active Reinforcement learning.
Passive Reinforcement Learning: In this technique agent’s policy is fixed and the task to learn the utilities
of state action pairs. If policy is 𝜋 and state is S , then agent always executes the action 𝜋(𝑆).
➢ Goal is to learn how good policy is i.e to learn th e utility function 𝑈𝜋 ( S). Passive learning agent is
not aware of the transition model T ( S , a , S’) , which specifies probability of reaching state S’ from
state S after action a.
➢ Passive learning also not knows the Reward Function R (S).
➢ A utility is defined to be the expected sum of rewards obtained if policy 𝜋 is followed.
𝐔𝛑( S) = E [ ∑𝐭=𝟎
∞ 𝛄𝐭 𝐑 (𝐒 ) ∶ 𝛑 , 𝐒 = 𝐒 ] , where 𝛄 𝐢𝐬 𝐚 𝐝𝐢𝐬𝐜𝐨𝐮𝐧𝐭 𝐟𝐚𝐜𝐭𝐨𝐫.
𝐭 𝟎

Temporal difference Learning: When a transition occurs from state S to state S’, we update 𝑈𝜋( S) as
following: 𝐔𝛑( S) ← 𝐔𝛑( S) + 𝜶 ( 𝑹 (𝑺) + 𝜸 𝐔𝛑( S’) - 𝐔𝛑( S) ) .
: Learning rate parameter. This update rule uses the difference in utilities between successive states, it is often
called TEMPORAL DIFFERENCE equation.

Active Reinforcement Learning: The compression achieved by a function approximator allows the
learning agent to generalize from states it has visited to states it has not visited.
E.g : An evaluation function for CHESS that is represented as a weighted linear function of a set of features
or a basis function f1 , f2 , ……. fn.

Where 𝜃𝑖∶ is the coefficient we want to learn and


𝑓𝑖∶ is feature extracted from state.

UNIVERSITY ACADEMY 27
Ques 13. What is Decision Tree Learning? Why it is useful in AI applications?
Ans : Decision tree method is one of the most simplest and yet most successful forms of learning algorithm.
It emphasis is towards the area of Inductive Learning. In inductive learning “ a collection of examples of f is
given , we return a function h that approximates f ”, where example f is “ A pair ( x , f(x) ) ”, where x is
input and f (x) is output of function applied to x.
➢ H is hypothesis. A good hypothesis will generalize well i.e will predict examples correctly.
➢ A decision tree takes input an object , with certain feature set and returns a decision of predicted
output value. Output may be Discrete or Continuous.
➢ Learning a discrete function is known as classification learning, wheras learning a continuous
function is termed as Regression in decision tree.
➢ Decision tree reaches itsdecision by performing a sequence of tests.
➢ Each internal node is a test value of one of the properties and branches from node are labeled with
possible values of the test.
➢ Each leaf node consists of return value.
➢ Application of Decision Tree learning is in designing an expert System based on Decision Tree
Architecture.
➢ Decision trees are completely expressive with the class of propositional logic.
➢ Various propositions are connected via logical OR operator( V).

Example : ∀ s F1 (s) → (P1 (s) ˅ P2 (s) ˅ ……˅ Pn (s))


∀ x P1 (x) → (F1 (x) ˅ F2 (x) )
∀ y P2 (y) → (Q1 (y) ˅ Q2(y))
………so on for Pn (s).
∀ z Pn (z) → (R1 (z) ˅ R2(z))
A general decision tree for above propositional formula can be as given below:

UNIVERSITY ACADEMY 29
Boolean Decision Trees : This technique consists of a vector of input attributes , X and a single Boolean
output Y.
E.g : Set of examples ( X1 , Y1….., (X6 , Y6) ).
➢ Positive examples are those in which goal is true.
➢ Negative examples are those in which goal is false.
➢ Complete set is known as a TRAINING SET.

a) In case of numeric attributes, decision trees can be geometrically interpreted


as a collection of hyper planes , each orthogonal to one of the axes.
b) The tree complexity has a crucial effect on its accuracy.
It is explicitly controlled by the stopping criteria used and the pruning method
employed.
c) Usually the tree complexity is measured by one of the following
metrics: the total number of nodes, total number of leaves, tree depth and
number of attributes used.
d) Decision tree induction is closely related to rule
induction. Each path from the root of a decision tree to one of its leaves can be
transformed into a rule simply by conjoining the tests along the path to form
the antecedent part, and taking the leaf’s class prediction as the class value.

Example:
➢ Given this classifier, the analyst can predict the response of a potential customer (by sorting
it down the tree), and understand the behavioral characteristics of the entire potential
customers population regarding direct mailing.
➢ Each node is labeled with the attribute it tests, and its branches are labeled with
its corresponding values.
➢ For example, one of the paths in below figure can be converted into the rule :“If customer
age is is less than or equal to or equal to 30, and the customer is “Male” – then the
customer will respond to the mail”.

UNIVERSITY ACADEMY 31
Application Areas of Decision Tree Learning
1) Variable selection: The number of variables that are routinely monitored in clinical settings has
increased dramatically with the introduction of electronic data storage. Many of these variables are of
marginal relevance and, thus, should probably not be included in data mining exercises.
2) Handling of missing values: A common - but incorrect - method of handling missing data is to
exclude cases with missing values; this is both inefficient and runs the risk of introducing bias in the
analysis. Decision tree analysis can deal with missing data in two ways: it can either classify missing
values as a separate category that can be analyzed with the other categories or use a built decision tree
model which set the variable with lots of missing value as a target variable to make prediction and
replace these missing ones with the predicted value.
3) Prediction: This is one of the most important usages of decision tree models. Using the tree model
derived from historical data, it’s easy to predict the result for future records.
4) Data manipulation: Too many categories of one categorical variable or heavily skewed continuous
data are common in medical research.

UNIVERSITY ACADEMY 33
Ques 14 : Write Short Notes on the following : (A) Regression Trees
(B) Bayesian Parameter Learning.
Ans : Regression Trees : Regression trees are commonly used to solve the problems where target variable
is numerical / continuous instead of discrete. Regression trees posses following properties :
a) Leaf nodes predict the average value of all instances.
b) Splitting criteria : Minimize the variance of the values in each subset 𝐒𝐢
| 𝑺𝒊 |
c) Standard Deviation Reduction : SDR ( A, S) = SD (S) - ∑ SD (𝑺 )
𝒊| 𝑺| 𝒊

d) Termination Criteria: Lower bound on SD in a node and Lower bound on number of


examples in a node.
e) Pruning criteria is Mean Squared Error.

Bayesian Parameter Learning: This learning technique works on parametric variables which are random
having some prior distribution. An optimal learning classifier can be designed using “Class conditional
densities”, p ( x | 𝑤𝑖). In a typical case we merely have some unclear knowledge about situations with given
number of samples and training. Observation of samples converts this to a posteriori density, and true values
of parameters are revised. In Bayesian learning sharpening of Posteriori Density Function is done, causing it
to peak near the true values.
• We assume priors are known: P (i | D) = P(i). p(x  , D)P(i )
• Also, assume functional independence : p( | x, D) = c
i

 p(x  j , D)P( j )
j =1

• Any information we have about about  prior to collecting samples is contained in p(D|).
• Observation of samples converts this to a posterior, p(|D), which we hope is peaked around the true
value of .

• Our goal is to estimate a parameter vector: p( x D ) =  p( x,  D )d


• We can write the joint distribution as a product: p(x D) =  p(x  , D) p( D)d
=  p(x  ) p( D)d

UNIVERSITY ACADEMY 35
UNIT – 5
Pattern Recognition:
• Introduction of Design principles of pattern recognition system
• Statistical Pattern recognition
• Parameter estimation methods
• Principle Component Analysis (PCA)
• Linear Discriminant Analysis (LDA).

Classification Techniques
• Nearest Neighbor (NN) Rule
• Bayes Classifier
• Support Vector Machine (SVM)
• K – means clustering.

UNIVERSITY ACADEMY 1
Short Questions & Answers

Ques1. What is pattern recognition?

Ans. Pattern recognition is a branch of machine learning that focuses on the recognition of patterns and
regularities in data,It is the study of how machines can observe the environment intelligently , learn to
distinguish patterns of interest from their backgrounds and make reasonable & correct decisions about the
different classes of objects. Patterns may be a finger print image, handwritten cursive word, a human face,
iris of human eye ora speech signal. These examples are called input stimuli. Recognition establishes a
close match between some new stimulus and previously stored stimulus patterns. Pattern recognition
systems are in many cases trained from labeled "training" data (supervised learning), but when no labeled
data are available other algorithms can be used to discover previously unknown patterns (unsupervised
learning).At the most abstract level patterns can also be some ideas , concepts , thoughts , procedures
Activated in human brain and body. This is known as the study of human psychology (Cognitive
Science)
Example: In automatic sorting of integrated circuit amplifier packages, there can be three possible types :
metal –cane , dual –in-line and flat pack. The unknown object should be classified as being one of these
types.

Ques 2. Define Measurement space and Feature space in classification process for objects.

Ans: Measurement space: This is the set of all pattern attributes which are stored in a vector form.
It is a range of characteristic attribute values. In vector form measurement space is also called
observation space /data space. E.g : W = [ W1 , W2 ,……,Wn-1, Wn ] for n pattern classes.
𝑥1
W is a pattern vector. Let X = [ ] , X is a pattern vector for flower , x1 is petal length and x2 is petal
𝑥2
width.

Feature Space: The range of subset of attribute values is called Feature Space F. This subset represents a
reduction in attribute space and pattern classes are divided into sub classes. Feature space signifies the
most important attributes of a pattern class observed in measurement space.

UNIVERSITY ACADEMY 3
Ques 3. What is dimensionality reduction problem?

Ans. In machine learning classification problems, there are often too many factors on the basis of which
the final classification is done. These factors are basically variables called features. The higher the
number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most
of these features are correlated, and hence redundant. This is where dimensionality reduction algorithms
come into play. Dimensionality reduction is the process of reducing the number of random variables
under consideration, by obtaining a set of principal variables. It can be divided into feature selection and
feature extraction. The various methods used for dimensionality reduction include:

➢ Principal Component Analysis (PCA)


➢ Linear Discriminant Analysis (LDA)

Ques 4. State some advantages and disadvantage with application of LDA.


Ans. Advantages of Linear Discriminant Analysis
➢ Suitable for larger data set.
➢ Calculations of scatter matrix in LDA is much easy as compared to covariance matrix
Disadvantages of Linear Discriminant Analysis
➢ \More redundancy in data.
➢ Memory requirement is high.
➢ More Noisy.

Applications Of Linear Discriminant Analysis
➢ Face Recognition.
➢ Earth Sciences.
➢ Speech Classification.

Ques 5. Write some disadvantages of K Nearest Neighbor.

Ans. Disadvantages of Using K-NN

(a).Expensive. ( b) High Space Complexity (c)High Time Complexity.


(d)Data Storage Required . (e) High-Dimensionality of Data

UNIVERSITY ACADEMY 5
Ques 7. How K-Mean is different by KNN.

Ans.

K- Means Clustering K- Nearest Neighbor Classification

1. This is an unsupervised learning technique Supervised Learning Technique

2. All the variables are independent All the variables are dependent

3. Splits data point into K number of clusters Determines classification of a point .

4. The points in each cluster tend to be near Combines the classification of the K
each other. nearest points

Ques 8. What is clustering?


Ans. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in
the same group (called a cluster) are more similar (in some sense) to each other than to those in other
groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data
analysis, used in many fields, including machine learning, pattern recognition, image
analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Ques 9. What is partitioning clustering?

Ans. Partitioning algorithms are clustering techniques that subdivide the data sets into a set of k groups,
where k is the number of groups pre-specified by the analyst. There are different types of partitioning
clustering methods. The most popular is the K-means clustering , in which, each cluster is represented by
the center or means of the data points belonging to the cluster. The K-means method is sensitive to
outliers.

UNIVERSITY ACADEMY 7
Long Questions & Answers

Ques 10. Explain Design Cycle of a Pattern Recognition System.

Ans : Design Cycle of a Pattern Recognition System

Pattern classification involves finding three major attribute spaces:


(a) Measurement space (b) Feature space (c) decision space.
After this appropriate neural network set up is trained with these attribute sets to make system
learn for unknown set of patterns and objects. Steps of classification process are as follows:

Step 1. Stimuli produced by the objects are perceived by sensory devices. Important attributes like
( shape , size , color , texture) produce the strongest inputs. Data collection involves identification
of attributes of objects and creating Measurement space.
Measurement space: This is the set of all pattern attributes which are stored in a vector form.
It is a range of characteristic attribute values. In vector form measurement space is also called
observation space /data space. E.g : W = [ W1 , W2 ,……,Wn-1, Wn ] for n pattern classes.
𝑥1
W is a pattern vector. Let X = [ ] , X is a pattern vector for flower , x1 is petal length and x2 is
𝑥2
petal width. Pattern classes can be W1= Lilly , W2= Rose , W3 = Sunflower.
UNIVERSITY ACADEMY 9
Step 2.After this features are selected and feature space vector is designed. The range of subset of
attribute values is called Feature Space F. This subset represents a reduction in attribute space and pattern
classes are divided into sub classes. Feature space signifies the most important attributes of a pattern
class observed in measurement space. Feature space is shown in smaller size than M- space.

Step 3.AI models based on probability theory, E.g : Bayesian Model and Hidden Markov Models are
Used for grouping or clustering the objects. Attributes selected are those which provide High Inter
Class and Low Inter Class groupings.
Step 4. Using Unsupervised(for feature extraction) or Supervised Learning techniques( classification)
training of classifiers is performed. When we present a pattern recognition with a set of classified
patterns so that it can learn the characteristics of the set, we call it training.
Step 5.In evaluation of classifier testing is performed. In this an unknown pattern is given to the PR
System for identifying its correct class .Using the selected attribute values, object/class
characterization models are learned by forming generalized prototype descriptors, Classification
rules or Decision Functions. The range of decision function values is known as Decision space D
of r – dimensions. We also evaluate performance, efficiency of the classifier for further
improvement.

𝑑1
𝑑2
𝑑3
D= . Recognition of familiar objects is achieved through the application of the rules learned
.
. in step 4, by comparing and matching of objects feature with stored models.
[𝑑𝑛]

Ques 11. What are the design principles of a Pattern Recognition System ? What are major steps
involved in this process?
Ans : Design principles of a Pattern Recognition System are as mentioned below :

i. Designing of a pattern recognition system is based on the construction of following AI techniques :


➢ Multi layer perceptron in Artificial Neural Network.
➢ Decision tree implementation.
➢ Nearest neighbor classification.

UNIVERSITY ACADEMY 10
ii. Designing of a robust PR system against the variation in illumination and brightness in
environment.
iii. Designing parameters based on translation, scaling and rotation.
iv. Color and texture representation by histograms.
v. Designing brightness based and feature based PR systems.

This system comprises of mainly five components namely sensing, segmentation, feature extraction,
classification and post processing. All of these together generates a System and works as follows:
1. Sensing and Data Acquisition: It includes, various properties that describes the object, such as its
entities and attributes which are captured using sensing device.

2. Segmentation: Data objects are segmented into smaller segments in this step.

3. Post Processing & Decision: Certain refinements and adjustments are done as per the changes in
features of the data objects which are in the process of recognition. Thus, decision making can be
done once, post processing is completed.
Need of Pattern Recognition System
Pattern Recognition System is responsible for generating patterns and similarities among given
problem/data space, that can further be used to generate solutions to complex problems effectively and
efficiently. Certain problems that can be solved by humans, can also be made to be solved by machine by
using this process. Affective computing which gives a computer the ability to recognize and express
emotions, to respond intelligently to human emotions that contribute to rational decision making.

Ques12. Discuss about the four best approaches for a Pattern Recognition system. Also Discuss
some of the main application area with example of PR system.
Ans : Approaches of PR system are as mentioned below :
1). Template Matching 2). Statistical Approach 3). Syntactic Approach 4). ANN Approach.

TEMPLATE MATCHING: This approach of pattern recognition is based on finding the similarity
between two entities ( points , curves / shapes) of same type. A 2-D shape or a prototype of a pattern to be
recognized is available. Template is a d x d mask or window. Pattern to be recognized is matched against
stored template in a knowledge base.

UNIVERSITY ACADEMY 12
STATISTICAL APPROACH: Each pattern is represented in terms of d- features in d- dimension space.
Goal is to select those features that allow pattern vectors belonging to different categories to occupy
compact and disjoint regions. Separation of pattern classes is determined. Decision surfaces and lines are
drawn which are determined by probability distribution of random variables w.r.t each pattern class.

SYNTACTIC APPROACH: This approach solves complex pattern classification problems. A


hierarchal rules are defined. E.g: Grammar rules for natural language, syntax tree structure. These are
used to decompose complex patterns into simpler sub patterns. Patterns can be viewed as sentences
where sentences are decomposed into words and further words are sub divided into letters.
NEURAL NETWORKS APPROACH: Artificial neural networks are massively parallel computing
systems consisting of extremely large number of simple processors with many interconnections.
Network Models attempt to use some principles like Learning , Generalization , Adaptivity, Fault
Tolerance , Distributed representation & computation. Learning process involves updating network
architecture and connection mapping and weights so that network may perform better clustering.
Applications of PR System with Examples

Problem Domain Application Input pattern Pattern Classes


Bioinformatics Sequence analysis DNA / Protein sequence Known types of genes patterns
Searching for meaningful Points in Compact & well separated
Data Mining
data multidimensional space clusters.
Document Semantic categories(sports,
Internet Searching Text document
classification movies, business, science)
Document image
Reading machines for blinds Document image Alphanumeric characters, words
analysis
Defective / non defective nature of
Industrial Printed circuit board
Intensity or range image product
automation inspection

Authorized users for access


Biometrics Personal identification Face, iris , finger prints
control
Searching content on
Speech recognition Speech waveforms Spoken words.
google voice assistance.

UNIVERSITY ACADEMY 14
Ques 13. Write short notes on :
(A) Decision theoretic classification (B) Optimum Statistical Classifier

Ans : (A) Decision theoretic classification: This is a statistical pattern recognition technique. Which is
based on the use of decision functions to classify the objects. A decision function maps pattern vectors X
into decision regions of D i.e. f : X → D. These functions are also termed as Discriminant Functions.
➢ Given a set of Objects O = { O1 , O2 ….On }. Let each Di have K- observable attributes
(Measurement space and relations are V = { V1 , V2 , …..Vk}).
➢ Determine the following parameters :
a) A subset of m ≤ 𝑘 of Vi , X = [ X1 , X2 ,…..Xm] whose values uniquely characterize Oi.
b) C ≥ 2 grouping of Oi which exhibits High Inter Class and Low Inter Class similarities,
such that a decision function d (X) can be found which partition D into C disjoint
regions. These regions are used to classify each object Oi for some class.
➢ For W pattern Classes , W = [ W1 , W2 ,……,Wn-1, Wn ]to find W decision functions
d1(x) , d2 (x) ,……dw (x). with property that if a pattern X belongs to class Wi , then
di(X) > dj (x) , for j = 1, 2, ….,w ; j ≠ i.
➢ Linear decision function can be in the form of line equation as : d (X) = W1 X1 + W1 X1 for
a 2-D pattern vector.

An object belongs to class


W1 or C1 if d(x) < 0.else
for d(x) > 0 it belongs to
class W2 or C2.

If d (x) =0 then it is
indeterminate.

Fig(a) is linearly separable

Class.
Fig : (a)
Decision Boundary: di (x) – dj ( x) = 0 . Aim is to identify decision boundary between two
classes by a Single function dij (x) = di(x) – dj (x) =0.

UNIVERSITY ACADEMY 16
When a line can be found that separates classes into two or more clusters we say classes are Linearly
Separable else they are called Non Linear Separable Classes.

Fig : (b) Fig(c)


Fig(b) and Fig (c) are for Non linear separable classes.

(B) Optimum Statistical Classifier: This is a pattern classification approach developed on the basis of
probabilistic technique because of randomness under which pattern classes are normally generated.
It is based on Bayesian theory and conditional probabilities.” Probability that a particular pattern x is
from class Wi is denoted as P (Wi | x)”. If a pattern classifier decides that x came from Wj when it actually
came from Wi it incurs a loss Li j .
Average Loss incurred in assigning x to class Wj is given by following equation:

𝒌=𝟏 𝑳𝒌𝒋 P (Wk | x) …………. Eq (1)


𝒓𝒋 (x) = ∑𝑾 [ W are total no. of classes ]

This is called Conditional average Risk / Loss.


𝑷( 𝑩 |𝑨).𝑷(𝑨)
By Baye’s Theorem: P ( A | B) = , So Eq (1) can be modified as :
𝑷(𝑩)

𝑾 𝑳𝒌𝒋 𝑷(𝒙 |𝑾𝒌).𝑷(𝑾𝒌)


∴ 𝒓𝒋 ∑
(x) = 𝒌=𝟏
………….. Eq (2)
𝑷(𝒙)
P(𝒙 |𝑾𝒌) 𝒊𝒔 𝑷𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒅𝒆𝒏𝒔𝒊𝒕𝒚 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 of the patterns from class Wk and P (Wk ) is the
probability of occurrence of class Wk.P(x) is priori probability and independent of k (Has same value for
all the classes), Hence equation can be again rewritten as below :
𝟏
𝒓𝒋 (x) = ∑𝑾
𝒌=𝟏 𝑳𝒌𝒋 P ( x | Wk ). P (Wk )… ......... Eq (3)
𝑷 (𝒙)

UNIVERSITY ACADEMY 18
Ques 14. Explain Maximum Likelihood technique under Parameter Estimation of Classification.
Ans : Estimation model consists of a number of parameters. So, in order to calculate or estimate the
parameters of the model, the concept of Maximum Likelihood is used.
• Whenever the probability density functions of a sample are unknown, they can be calculated by
taking the parameters inside sample as quantities having unknown but fixed values.
• Consider we want to calculate the height of a number of boys in a school. But, it will be a time
consuming process to measure the height of all the boys. So, the unknown mean and unknown
variance of the heights being distributed normally, by maximum likelihood estimation we can
calculate the mean and variance by only measuring the height of a small group of boys from the
total sample.
Let we separate a collection of samples as per the class, having C data sets , D1 , D2 ,….Dc
with samples in Dj drawn accurately to probability p (x | 𝑊𝑗). Let this has a known parametric
form and is determined by value 𝜃𝑗. E.g : p (x | 𝑾𝒋) ~ N (𝝁𝒊 , 𝜮𝒋 ) , 𝜽𝒋 consists of these parameter.
To show dependence we have :
p (x | 𝑾𝒋, 𝜽𝒋) . Objective is to use information provided by training samples to achieve good
estimates for unknown parameter vectors 𝜽𝟏 , 𝜽𝟐 , 𝜽𝟑 … . . 𝜽𝒄−𝟏, 𝜽𝒄 associated with each
category. Assume samples in Di give no information about 𝜃𝑗, if i ≠ 𝑗 i.e Parameters of Different
Classes are functionally independent. Let set D has n samples [ X1 , X2 ,…..Xn],
∴ p ( D | 𝜽 ) = ∏𝒏𝒌=𝟏 𝑷( 𝑿𝒌 |𝜽 ).
p ( D | 𝜃 ) is likelihood of 𝜃 w.r.t set of samples.” Maximum likelihood estimate of 𝜽 is by
definition value 𝜽̂ that maximizes p ( D | 𝜃 ).

Logarithmic Form : Since Log makes the expressions simpler in the form of addition , 𝜃 that
maximizes log likelihood also maximizes likelihood. If number of parameters to be estimated is p,
we let 𝜽 denote p – component vector i.e 𝜽 = ( , 𝜽𝟐 , 𝜽𝟑 … . . 𝜽𝒑−𝟏, 𝜽𝒑)𝒕.

UNIVERSITY ACADEMY 20
𝝏
Let gradient operator 𝛁
𝜽
=𝝏𝜽 , Define l (𝜃) : log likelihood function.
𝟏
𝝏
𝝏𝜽𝟐
… .

𝝏
[ 𝝏𝜽𝒑]
Therefore, l (𝜽) = ln 𝑝 (𝐷 | 𝜃) ⇒ 𝜃̂ = arg 𝑚𝑎𝑥𝜃 l(𝜃)

l (𝜽) = ∑𝒌=𝟏
𝒏 𝐥𝐧 𝒑 (𝑿𝒌 | 𝜽) and 𝛁𝜽 𝒍 = ∑𝒏 𝒌=𝟏 𝛁𝜽 𝒍𝒏 𝒑 (𝑿𝒌 | 𝜽)

For maximum likelihood 𝛁𝜽 𝒍 = 0

Ques 15. (A) Write down the steps for K nearest Neighbor estimation.
(B) Mention some of the advantages and disadvantages of KNN technique.

Ans. (A) K –Nearest Neighbor Estimation:

1. Calculate “d(x,xi)” i =1, 2,….., n; where d denotes the Euclidean distance between the
points.
2. Arrange the calculated n Euclidean distances in non-decreasing order.
3. Let k be a +ve integer, take the first k distances from this sorted list.
4. Find those k-points corresponding to these k-distances.
5. Let ki denotes the number of points belonging to the ith class among k points i.e. k ≥ 0
6. If ki >kj ∀ i ≠ j then put x in class i.

( B ) Advantages of KNN :
1. Easy to understand
2. No assumptions about data
3. Can be applied to both classification and regression
4. Works easily on multi-class problems

Disadvantages are:
1. Memory Intensive / Computationally expensive
2. Sensitive to scale of data
3. Not work well on rare event (skewed) target variable
4. Struggle when high number of independent variables

UNIVERSITY ACADEMY 22
Ques 16.. Explain the function of clustering.

Ans. To measure the quality of clustering ability of any partitioned data set, criterion function is used.
1. Consider a set , B = { x1,x2,x3…xn} containing “n” samples, that is partitioned exactly into “t”
disjoint subsets i.e. B1, B2,…..,Bt.
2. The main highlight of these subsets is, every individual subset represents a cluster.
3. Sample inside the cluster will be similar to each other and dissimilar to samples in other clusters.
4. To make this possible, criterion functions are used according the occurred situations.

Criterion Function For Clustering

1. Internal Criterion Function


a) This class of clustering is an intra-cluster view.
b) Internal criterion function optimizes a function and measures the quality of clustering ability
various clusters which are different from each other.
2. External Criterion Function
a) This class of clustering criterion is an inter-class view.
b) External Criterion Function optimizes a function and measures the quality of clustering ability of
various clusters which are different from each other.
3. Hybrid Criterion Function
a) This function is used as it has the ability to simultaneously optimize multiple individual Criterion
Functions unlike as Internal Criterion Function and External Criterion Function

UNIVERSITY ACADEMY 24
Ques17. Solve it with the help of K-mean clustering.

Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Centre points are : (1,1) and (5,7)

Ans.

Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
This data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let the
A & B values of the two individuals furthest apart (using the Euclidean distance measure), define the
initial cluster means, giving:
Mean Vector
Individual
(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)

The remaining individuals are now examined in sequence and allocated to the cluster to which they are
closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a
new member is added. This leads to the following series of steps:
Cluster 1 Cluster 2
Mean Mean
Step Individual Vector Individual Vector
(centroid) (centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)

UNIVERSITY ACADEMY 26
Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:
Mean Vector
Individual
(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare each
individual’s distance to its own cluster mean and to
that of the opposite cluster. And we find:
Distance to Distance to
mean mean
Individual
(centroid) of (centroid)
Cluster 1 of Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1

Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In
other words, each individual's distance to its own cluster mean should be smaller that the distance to the
other cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to Cluster 2
resulting in the new partition:
Mean Vector
Individual
(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)

Ques : 18 What is Dimensionality reduction in pattern classification methods? Explain Principal


Component analysis with its application in AI.

Ans : A very common problem in statistical pattern recognition is of Feature Selection i.e. a process of
transforming Measurement Space to Feature Space (Set of data which are of interest).
Transformation reduces the dimensionality of data features . Let we have a m- dimensional
vector , X = [𝑋1, 𝑋2, … . . ] and we want to convert it in l-dimensions ( where l << m) .

UNIVERSITY ACADEMY 28
𝑿𝟏
𝑿𝟏
𝑿𝟐 𝑿 𝟐
X = … , after reducing the dimensions vector X = … .
… ….
[𝑿𝒎] [𝑿𝑳]
This reduction causes mean square error . So we need to find that does there exist an invertible transform
T, such that truncation of 𝑇𝑥 is optimal in terms of Mean Square Error. So T must have some components

of low variance( 𝜎2) where 𝝈𝟐 = E [ (𝒙 − 𝝁)𝟐 ] , E is expectation function, x is random variable , and 𝜇 is

mean value. 𝝁 = 𝟏 ∑𝒎 𝑿
𝒙 𝒌=𝟏 𝒌

Definition of PCA: This is a mathematical procedure that uses Orthogonal transforms to convert a set of
observations of possibly correlated variables into a set of linearly uncorrelated variables known as
Principal Components. So here we preserve the most variance with reduced dimensions and minimum
mean square error.

➢ Number of principal components are less than or equal to number of original variables.
➢ First Principal component has largest variance. Successively it decreases.
➢ These are defined by Best Eigen Vectors of Covariance Matrix of vector X.

Geometrical analysis of PCA:


i. PCA projects data along directions where 𝜎2 is maximum.
ii. These directions are determined by eigen vectors of covariance matrix corresponding to largest
eigen values.
iii. Magnitude of variance is variance of data along the directions of eigen values.
iv. Eigen Values are characteristic values given as AX = , A is m x n matrix , 𝜆 𝑖𝑠 𝑒𝑖𝑔𝑒𝑛 𝑣𝑎𝑙𝑢𝑒𝑠.

UNIVERSITY ACADEMY 30
• Degree to which the variables are linearly correlated is
represented by their covariance.

• PCA uses Euclidean Distance calculated from the p


variables as the measure of dissimilarity among the n
objectsThe eigenvalues (latent roots) of S are solutions
() to the characteristic equation

• S − I = 0

• The eigenvalues, 1, 2, ... p are the variances of the coordinates on each principal component
axis. Coordinates of eac…h object i on the kth principal axis, known as the scores on PC k, are
computed as mentioned below:
zki = u1k x1i + u2k x2i ++ u pk x pi

Steps of PCA
• Let 𝜇 be the mean vector (taking the mean of all rows)
• Adjust the original data by the mean 𝜑 = Xk – 𝜇
• Compute the covariance matrix C of adjusted X
• Find the eigenvectors and eigenvalues of C.
• For matrix C, vectors e (=column vector) having same direction as Ce :
• eigenvectors of C is e such that Ce=e,
•  is called an eigenvalue of C. Ce=e  (C-I)e=0

UNIVERSITY ACADEMY 32
Applications of PCA in AI:
➢ Face recognition
➢ Image compression
➢ Gene expression analysis

Ques 19. Explain Linear Discriminant Analysis with its derivation.

Ans : PCA finds components that are useful for data representation , but drawback is that PCA can not
discriminate components /data between different classes. If we group all the samples , then those
directions that are discarded by PCA might be exactly the directions needed for distinguishing between
classes.
➢ PCA is based on representation for efficient direction
➢ LDA is based on discrimination for efficient direction.
Objective of LDA is to perform dimensionality reduction while preserving as much of the class
discrimination information as possible. Here in LDA data is projected from d – dimensions onto a line.
If the samples formed well separated compact clusters in d- space then projection onto an arbitrary line
will usually produce poor recognition performance. By rotating the line we can find an orientation for
which projected samples are well separated.

UNIVERSITY ACADEMY 34
In order to find a good projection vector, we need to define a measure of separation between the
projections.

UNIVERSITY ACADEMY 36
The solution proposed by Fisher is to maximize a function that represents the difference between the
means, normalized by a measure of the within-class variability, or the so-called scatter. • For each class
we define the scatter, an equivalent of the variance, as; (sum of square differences between the projected
samples and their class mean).

The Fisher linear discriminant is defined as the linear function w Tx that maximizes the criterion function:
(the distance between the projected means normalized by the within class scatter of the projected
samples.

In order to find the optimum projection w*, we need to express J(w) as an explicit function of w.. We will
define a measure of the scatter in multivariate feature space x which are denoted as scatter matrices.

Where Si is the covariance matrix of class ωi, and Sw is


called the within-class scatter matrix. Similarly, the
difference between the projected means (in y-space) can
be expressed in terms of the means in the original feature
space (x-space)

The matrix SB is called the between-class


scatter of the original samples/feature
vectors, while is the between-class scatter of
the projected samples y.

UNIVERSITY ACADEMY 38
Ques 20 : Explain Support Vector Machines in detail. What are advantages and disadvantages
of SVM ?

Ans : Support Vector Machines: This is a linear machine with a case of separable patterns that may
arise in the context of pattern classification . Idea is to construct a HYPERPLANE as a
direction surface in such a way that the margin of separation between positive and negative
examples is maximized.
A good example of such a system is classifying a set of new documents into positive or negative
sentiment groups, based on other documents which have already been classified as positive or
negative. Similarly, we could classify new emails into spam or non-spam, based on a large
corpus of documents that have already been marked as spam or non-spam by humans. SVMs are
highly applicable to such situations.
• SVM is an approximate implementation of Structural Risk Minimization.
• Error rate of a machine on test data is bounded by the sum of training error rate and term
that depends on Vapnik Chervonenki’s dimension.
• SVM sets first term to zero and minimizes second term. We use SVM learning algorithm
to construct following three types of learning machines :
(i) Polynomial learning machine
(ii) Two layer perceptrons
(iii) Radial basis function N/W
Condition is as : Test error rate ≤ Train error rate + f ( N , h , p ).
Where N: Size of training set , h : Measure of model complexity
P: Probability that this bound fails.
If we consider an element of our p-dimensional feature space, i.e. →x= (x1,...,xp)∈Rp, then we can
mathematically define an affine Hyper plane by the following equation: b0+b1x1+...+bpxp=0 ,
b0≠0 gives us an affine plane (i.e. it does not pass through the origin). We can use a more succinct
notation for this equation by introducing the summation sign: b0+p∑ j=1bj xj=0. The line that
maximizes the minimum margin is better. Maximum margin separator is determined by a subset of data
points. Data points in the subset are called Support Vectors. Support vectors are used to decide which side
of separator a test case is ON.

Consider a training set { ( 𝑿𝒊 , 𝒅𝒊 ,) } for i= 1 to n , where Xi is input pattern for ith example.

UNIVERSITY ACADEMY 40
And di is the desired response (Target output). Let , = +𝟏 and 𝒅𝒊 , = −𝟏 Pattern classes for positive
and negative examples are linearly separable. Hyper Plane decision surface is given as below equation:
𝑾𝑻 X + b = 0 , then di =0(when data point is on the line)
where W : adjustable weight factor and b is Bias .
Therefore, 𝑾𝑻 Xi + bi ≥ 𝟎 for , = +𝟏 and 𝑾𝑻 Xi + bi < 𝟎 for 𝒅𝒊 , = −𝟏 .
Closest data point is called Margin of Separation. Denoted by ρ. Objective of SVM is to maximize
ρ for Optimal Hyper plane.

Ques 21: What is Nearest Neighbor rule of classification? Mention some of the metrics used in
method.
Ans : Nearest neighbor algorithm assigns to a test pattern the class label of its closest neighbor.
Let n training patterns ( 1 , 𝜃1 ,) , (𝑋2 , 𝜃2 ,) …….., (𝑋𝑛 ,𝜃𝑛 ) where Xi is of dimension d and
𝜃𝑖 , 𝑖𝑠 𝑖𝑡ℎ 𝑝𝑎𝑡𝑡𝑒𝑟𝑛. If P is the test pattern then if d ( P , Xk ) = min { d ( P , Xi) }, i = 1 to n.
Error: In NN classifier error is at most twice the Bayes Error , when the number of training samples
tends to infinity.
𝑪 𝑬 ( 𝜶𝒃𝒂𝒚𝒆𝒔)
E( 𝒃𝒂𝒚𝒆𝒔 ) ≤E( 𝒏𝒏 ) ≤ 𝑬 ( 𝜶𝒃𝒂𝒚𝒆𝒔 ) [𝟐 − 𝑪−𝟏
]
Distance metrics Used in Nearest Neighbor Classification:

(1). Euclidian Distance: L (a , b) = a − b = √∑𝑑 (𝑎𝑖 − 𝑏𝑖)2


𝑖=1

Properties are as following:


(3). Mahalanobis Distance:
I. L(a,b)≥0
II. L ( a , b ) = 0 iff a = b L (a , b) = √(𝑎 − 𝑏)𝑇 𝐶−1 ( a -b)
III. L ( a , b ) = L ( b ,a )
C : Positive definite matrix called Covariance Matrix.
IV. L ( a , b ) + L ( b , c ) ≥ 𝐿 ( 𝑎 , 𝑐)
𝒌 𝟏/𝒌
ai − bi
(2). Minkowiski Metric: Lk ( a , b ) = [ ∑𝒅 𝒊=𝟏 ] , L1 is also called Manhattan or City
Block Distance. Lk : are Norms

[ END OF 5th UNIT ]


UNIVERSITY ACADEMY 42

You might also like