You are on page 1of 65

AL3391 ARTIFICIAL INTELLIGENCE

UNIT IV

1.Knowledge-based agent in Artificial intelligence


o An intelligent agent needs knowledge about the real world for taking decisions and reasoning to act
efficiently.
o Knowledge-based agents are those agents who have the capability of maintaining an internal state of
knowledge, reason over that knowledge, update their knowledge after observations and take actions.
These agents can represent the world with some formal representation and act intelligently.
o Knowledge-based agents are composed of two main parts:
o Knowledge-base and
o Inference system.

A knowledge-based agent must able to do the following:

o An agent should be able to represent states, actions, etc.


o An agent Should be able to incorporate new percepts
o An agent can update the internal representation of the world
o An agent can deduce the internal representation of the world
o An agent can deduce appropriate actions.

The architecture of knowledge-based agent:

The above diagram is representing a generalized architecture for a knowledge-based agent. The knowledge-based
agent (KBA) take input from the environment by perceiving the environment. The input is taken by the inference
engine of the agent and which also communicate with KB to decide as per the knowledge store in KB. The learning
element of KBA regularly updates the KB by learning new knowledge.

Knowledge base: Knowledge-base is a central component of a knowledge-based agent, it is also known as KB. It is
a collection of sentences (here 'sentence' is a technical term and it is not identical to sentence in English). These
sentences are expressed in a language which is called a knowledge representation language. The Knowledge-base of
KBA stores fact about the world.
Why use a knowledge base?

Knowledge-base is required for updating knowledge for an agent to learn with experiences and take action as per the
knowledge.

Inference system

Inference means deriving new sentences from old. Inference system allows us to add a new sentence to the
knowledge base. A sentence is a proposition about the world. Inference system applies logical rules to the KB to
deduce new information.

Inference system generates new facts so that an agent can update the KB. An inference system works mainly in
two rules which are given as:

o Forward chaining
o Backward chaining

Operations Performed by KBA

Following are three operations which are performed by KBA in order to show the intelligent behavior:

1. TELL: This operation tells the knowledge base what it perceives from the environment.
2. ASK: This operation asks the knowledge base what action it should perform.
3. Perform: It performs the selected action.

A generic knowledge-based agent:

Following is the structure outline of a generic knowledge-based agents program:

function KB-AGENT(percept):
persistent: KB, a knowledge base
t, a counter, initially 0, indicating time
TELL(KB, MAKE-PERCEPT-
SENTENCE(percept, t)) Action = ASK(KB,
MAKE-ACTION-QUERY(t)) TELL(KB,
MAKE-ACTION-SENTENCE(action, t))
t=t+1
return action

The knowledge-based agent takes percept as input and returns an action as output. The agent maintains the
knowledge base, KB, and it initially has some background knowledge of the real world. It also has a counter to
indicate the time for the whole process, and this counter is initialized with zero.

Each time when the function is called, it performs its three operations:

o Firstly it TELLs the KB what it perceives.


o Secondly, it asks KB what action it should take
o Third agent program TELLS the KB that which action was chosen.
The MAKE-PERCEPT-SENTENCE generates a sentence as setting that the agent perceived the given percept at
the given time.

The MAKE-ACTION-QUERY generates a sentence to ask which action should be done at the current

time. MAKE-ACTION-SENTENCE generates a sentence which asserts that the chosen action was

executed.

Various levels of knowledge-based agent:

A knowledge-based agent can be viewed at different levels which are given below:

1. Knowledge level

Knowledge level is the first level of knowledge-based agent, and in this level, we need to specify what the agent
knows, and what the agent goals are. With these specifications, we can fix its behavior. For example, suppose an
automated taxi agent needs to go from a station A to station B, and he knows the way from A to B, so this comes at
the knowledge level.

2. Logical level:

are encoded into different logics. At the logical level, an encoding of knowledge into logical sentences occurs. At
the logical level we can expect to the automated taxi agent to reach to the destination B.

3. Implementation level:

This is the physical representation of logic and knowledge. At the implementation level agent perform actions as per
logical and knowledge level. At this level, an automated taxi agent actually implement his knowledge and logic so
that he can reach to the destination.

Approaches to designing a knowledge-based agent:

There are mainly two approaches to build a knowledge-based agent:

1. 1. Declarative approach: We can create a knowledge-based agent by initializing with an empty


knowledge base and telling the agent all the sentences with which we want to start with. This approach is
called Declarative approach.
2. 2. Procedural approach: In the procedural approach, we directly encode desired behavior as a program
code. Which means we just need to write a program that already encodes the desired behavior or agent.

However, in the real world, a successful agent can be built by combining both declarative and procedural
approaches, and declarative knowledge can often be compiled into more efficient procedural code.
2. Propositional logic (Representation Mechanism of knowledge)
Propositional logic (PL) is the simplest form of logic where all the statements are made by propositions. A
proposition is a declarative statement which is either true or false. It is a technique of knowledge representation in
logical and mathematical form.
Example:
a) It is Sunday.
b) The Sun rises from West (False proposition)
c) 3+3= 7(False proposition)
d) 5 is a prime number.

Following are some basic facts about propositional logic:

o Propositional logic is also called Boolean logic as it works on 0 and 1.


o In propositional logic, we use symbolic variables to represent the logic, and we can use any symbol for
a representing a proposition, such A, B, C, P, Q, R, etc.
o Propositions can be either true or false, but it cannot be both.
o Propositional logic consists of an object, relations or function, and logical connectives.
o These connectives are also called logical operators.
o The propositions and connectives are the basic elements of the propositional logic.
o Connectives can be said as a logical operator which connects two sentences.
o A proposition formula which is always true is called tautology, and it is also called a valid sentence.
o A proposition formula which is always false is called Contradiction.
o A proposition formula which has both true and false values is called
o Statements which are questions, commands, or opinions are not propositions such as "Where is
Rohini", "How are you", "What is your name", are not propositions.

Syntax of propositional logic:

The syntax of propositional logic defines the allowable sentences for the knowledge representation. There are
two types of Propositions:

A.Atomic Propositions
B.Compound propositions

o Atomic Proposition: Atomic propositions are the simple propositions. It consists of a single proposition
symbol. These are the sentences which must be either true or false.

Example:

a) 2+2 is 4, it is an atomic proposition as it is a true fact.


b) "The Sun is cold" is also a proposition as it is a false fact.

o Compound proposition: Compound propositions are constructed by combining simpler or atomic


propositions, using parenthesis and logical connectives.

Example:

a) "It is raining today, and street is wet."


b) "Ankit is a doctor, and his clinic is in Mumbai."
Logical Connectives:

Logical connectives are used to connect two simpler propositions or representing a sentence logically. We can create
compound propositions with the help of logical connectives. There are mainly five connectives, which are given as
follows:

1. Negation: A sentence such as ¬ P is called negation of P. A literal can be either Positive literal or
negative literal.
2. Conjunction: A sentence which has 𝖠 connective such as, P 𝖠 Q is called a conjunction.
Example: Rohan is intelligent and hardworking. It can be written as,
P= Rohan is intelligent,
Q= Rohan is hardworking. → P𝖠 Q.
3. Disjunction: A sentence which has ∨ connective, such as P ∨ Q. is called disjunction, where P and Q are
the propositions.
Example: "Ritika is a doctor or Engineer",
Here P= Ritika is Doctor. Q= Ritika is Doctor, so we can write it as P ∨ Q.
4. Implication: A sentence such as P → Q, is called an implication. Implications are also known as if-then
rules. It can be represented as
If it is raining, then the street is wet.
Let P= It is raining, and Q= Street is wet, so it is represented as P → Q
5. Biconditional: A sentence such as P⇔ Q is a Biconditional sentence, example If I am breathing,
then I am alive
P= I am breathing, Q= I am alive, it can be represented as P ⇔ Q.

Following is the summarized table for Propositional Logic Connectives:

Truth Table:

In propositional logic, we need to know the truth values of propositions in all possible scenarios. We can combine
all the possible combination with logical connectives, and the representation of these combinations in a tabular
format is called Truthtable. Following are the truthtable for all logical connectivities.
Truth table with three propositions:

We can build a proposition composing three propositions P, Q, and R. This truth table is made-up of 8n Tuples as we have taken
three proposition symbols.

Precedence of connectives:

Just like arithmetic operators, there is a precedence order for propositional connectors or logical operators. This order should be
followed while evaluating a propositional problem. Following is the list of the precedence order for operators:

Precedence Operators

First Precedence Parenthesis


Second Precedence Negation

Third Precedence Conjunction(AND)

Fourth Precedence Disjunction(OR)

Fifth Precedence Implication

Six Precedence Biconditional

Logical equivalence:

Logical equivalence is one of the features of propositional logic. Two propositions are said to be logically equivalent if and only if
the columns in the truth table are identical to each other.

Let's take two propositions A and B, so for logical equivalence, we can write it as A⇔B. In below truth table we can see that column
for ¬A∨ B and A→B, are identical hence A is Equivalent to B

Properties of Operators:

o Commutativity:

o P∧ Q= Q ∧ P, or

o P ∨ Q = Q ∨ P.

o Associativity:

o (P ∧ Q) ∧ R= P ∧ (Q ∧ R),

o (P ∨ Q) ∨ R= P ∨ (Q ∨ R)

o Identity element:

o P ∧ True = P,

o P ∨ True= True.

o Distributive:

o P∧ (Q ∨ R) = (P ∧ Q) ∨ (P ∧ R).

o P ∨ (Q ∧ R) = (P ∨ Q) ∧ (P ∨ R).

o DE Morgan's Law:

o ¬ (P ∧ Q) = (¬P) ∨ (¬Q)

o ¬ (P ∨ Q) = (¬ P) ∧ (¬Q).

o Double-negation elimination:
o ¬ (¬P) = P.

Limitations of Propositional logic:

o We cannot represent relations like ALL, some, or none with propositional logic. Example:

a. All the girls are intelligent.

b. Some apples are sweet.

o Propositional logic has limited expressive power.

o In propositional logic, we cannot describe statements in terms of their properties or logical relationships.
Knowledge Representation:
3. Propositional theorem proving
Theorem proving: Applying rules of inference directly to the sentences in our knowledge base to construct
a proof of the desired sentence without consulting models.
Inference rules are patterns of sound inference that can be used to find proofs. The resolution rule yields a
complete inference algorithm for knowledge bases that are expressed in conjunctive normal
form. Forward chaining and backward chaining are very natural reasoning algorithms for knowledge bases in
Horn form.
Logical equivalence:
Two sentences α and β are logically equivalent if they are true in the same set of models. (write as α ≡ β). Also: α ≡
β if and only if α 𝖼 β and β 𝖼 α.
Validity: A sentence is valid if it is true in all models.
Valid sentences are also known as tautologies—they are necessarily true. Every valid sentence is logically
equivalent to True.
The deduction theorem: For any sentence αand β, α 𝖼 β if and only if the sentence (α ⇒ β) is valid. Satisfiability:
A sentence is satisfiable if it is true in, or satisfied by, some model. Satisfiability can be checked by enumerating the
possible models until one is found that satisfies the sentence.
The SAT problem: The problem of determining the satisfiability of sentences in propositional logic. Validity and
satisfiability are connected:
α is valid iff ¬α is unsatisfiable;
α is satisfiable iff ¬α is not valid;
α 𝖼 β if and only if the sentence (α𝖠¬β) is unsatisfiable.
Proving β from α by checking the unsatisfiability of (α𝖠¬β) corresponds to proof by refutation / proof by
contradiction.

Inference and proofs


Inferences rules (such as Modus Ponens and And-Elimination) can be applied to derived to a proof.
·Modus Ponens:

Whenever any sentences of the form α⇒β and α are given, then the sentence β can be inferred.

·And-Elimination:

From a conjunction, any of the conjuncts can be inferred.

·All of logical equivalence (in Figure 7.11) can be used as inference rules.
e.g. The equivalence for biconditional elimination yields 2 inference rules:

·De Morgan’s rule

We can apply any of the search algorithms in Chapter 3 to find a sequence of steps that constitutes a proof. We
just need to define a proof problem as follows:
·INITIAL STATE: the initial knowledge base;
·ACTION: the set of actions consists of all the inference rules applied to all the sentences that match the top half
of the inference rule.
·RESULT: the result of an action is to add the sentence in the bottom half of the inference rule.
·GOAL: the goal is a state that contains the sentence we are trying to prove.
In many practical cases, finding a proof can be more efficient than enumerating models, because the proof can
ignore irrelevant propositions, no matter how many of them they are.

Monotonicity: A property of logical system, says that the set of entailed sentences can only increased as
information is added to the knowledge base.
For any sentences α and β,
If KB 𝖼 αthen KB 𝖠β 𝖼
α.
Monotonicity means that inference rules can be applied whenever suitable premises are found in the
knowledge base, what else in the knowledge base cannot invalidate any conclusion already inferred.
Proof by resolution
Resolution: An inference rule that yields a complete inference algorithm when coupled with any complete search
algorithm.
Clause: A disjunction of literals. (e.g. A∨ B). A single literal can be viewed as a unit clause (a disjunction of one
literal ).
Unit resolution inference rule: Takes a clause and a literal and produces a new clause.

where each l is a literal, li and m are complementary literals (one is the negation of the other).

Full resolution rule: Takes 2 clauses and produces a new clause.


where li and mj are complementary literals.

Notice: The resulting clause should contain only one copy of each literal. The removal of multiple copies of literal is
called factoring.
e.g. resolve(A∨ B) with (A∨ ¬B), obtain(A∨ A) and reduce it to just A.

The resolution rule is sound and complete.

Conjunctive normal form


Conjunctive normal form (CNF): A sentence expressed as a conjunction of clauses is said to be in CNF.
Every sentence of propositional logic is logically equivalent to a conjunction of clauses, after converting a sentence
into CNF, it can be used as input to a resolution procedure.

A resolution algorithm

e.g.
KB = (B1,1⟺(P1,2∨P2,1))𝖠¬B1,1
α = ¬P1,2
Notice: Any clause in which two complementary literals appear can be discarded, because it is always
equivalent to True.
e.g. B1,1∨¬B1,1∨P1,2 = True∨P1,2 =
True. PL-RESOLUTION is complete.

Horn clauses and definite clauses

Definite clause: A disjunction of literals of which exactly one is positive. (e.g. ¬ L1,1∨¬Breeze∨B1,1)
Every definite clause can be written as an implication, whose premise is a conjunction of positive literals and
whose conclusion is a single positive literal.
Horn clause: A disjunction of literals of which at most one is positive. (All definite clauses are Horn
clauses.) In Horn form, the premise is called the body and the conclusion is called the head.
A sentence consisting of a single positive literal is called a fact, it too can be written in implication
form. Horn clause are closed under resolution: if you resolve 2 horn clauses, you get back a horn
clause.
Inference with horn clauses can be done through the forward-chaining and backward-chaining
algorithms. Deciding entailment with Horn clauses can be done in time that is linear in the size of the
knowledge base.
Goal clause: A clause with no positive literals.

Forward and backward chaining


forward-chaining algorithm: PL-FC-ENTAILS?(KB, q) (runs in linear
time) Forward chaining is sound and complete.
e.g. A knowledge base of horn clauses with A and B as known facts.

Fixed point: The algorithm reaches a fixed point where no new inferences are possible.
Data-driven reasoning: Reasoning in which the focus of attention starts with the known data. It can be used
within an agent to derive conclusions from incoming percept, often without a specific query in mind. (forward
chaining is an example)

Backward-chaining algorithm: works backward rom the


query. If the query q is known to be true, no work is needed;
Otherwise the algorithm finds those implications in the KB whose conclusion is q. If all the premises of one of
those implications can be proved true (by backward chaining), then q is true. (runs in linear time)
in the corresponding AND-OR graph: it works back down the graph until it reaches a set of known
facts. (Backward-chaining algorithm is essentially identical to the AND-OR-GRAPH-SEARCH
algorithm.) Backward-chaining is a form of goal-directed reasoning.
3. Propositional model checking

The set of possible models, given a fixed propositional vocabulary, is finite, so entailment can be checked by
enumerating models. Efficient model-checking inference algorithms for propositional logic include backtracking
and local search methods and can often solve large problems quickly.
2 families of algorithms for the SAT problem based on model checking:
a. based on backtracking
b. based on local hill-climbing search

1. A complete backtracking
algorithm David-Putnam algorithm
(DPLL):

DPLL embodies 3 improvements over the scheme of TT-ENTAILS?: Early termination, pure symbol heuristic,
unit clause heuristic.
Tricks that enable SAT solvers to scale up to large problems: Component analysis, variable and
value ordering, intelligent backtracking, random restarts, clever indexing.

Local search algorithms


Local search algorithms can be applied directly to the SAT problem, provided that choose the right
evaluation function. (We can choose an evaluation function that counts the number of unsatisfied clauses.)
These algorithms take steps in the space of complete assignments, flipping the truth value of one symbol at a time.
The space usually contains many local minima, to escape from which various forms of randomness are required.
Local search methods such as WALKSAT can be used to find solutions. Such algorithm are sound but not
complete. WALKSAT: one of the simplest and most effective algorithms.
On every iteration, the algorithm picks an unsatisfied clause, and chooses randomly between 2 ways to pick a
symbol to flip:
Either a. a “min-conflicts” step that minimizes the number of unsatisfied clauses in the new
state; Or b. a “random walk” step that picks the symbol randomly.
When the algorithm returns a model, the input sentence is indeed satifiable;
When the algorithm returns failure, there are 2 possible causes:
Either a. The sentence is unsatisfiable;
Or b. We need to give the algorithm more
time. If we set max_flips=∞, p>0, the
algorithm will:
Either a. eventually return a model if one exists
Or b. never terminate if the sentence is unsatifiable.
Thus WALKSAT is useful when we expect a solution to exist, but cannot always detect unsatisfiability.

The landscape of random SAT problems


Underconstrained problem: When we look at satisfiability problems in CNF, an underconstrained problem is
one with relatively few clauses constraining the variables.
An overconstrained problem has many clauses relative to the number of variables and is likely to have
no solutions.

The notation CNFk(m, n) denotes a k-CNF sentence with m clauses and n symbols. (with n variables and k literals
per clause).
Given a source of random sentences, where the clauses are chosen uniformly, independently and without
replacement from among all clauses with k different literals, which are positive or negative at random.
Hardness: problems right at the threshold > overconstrained problems > underconstrained problems
Satifiability threshold conjecture: A theory says that for every k≥3, there is a threshold ratio rk, such that as n
goes to infinity, the probability that CNFk(n, rn) is satisfiable becomes 1 for all values or r below the threshold, and
0 for all values above. (remains unproven)

5. Agents based on propositional logic

1. The current state of the world


We can associate proposition with timestamp to avoid contradiction.
e.g. ¬Stench3, Stench4
fluent: refer an aspect of the world that changes. (E.g. Ltx,y)
atemporal variables: Symbols associated with permanent aspects of the world do not need a time superscript.
Effect axioms: specify the outcome of an action at the next time step.

Frame problem: some information lost because the effect axioms fails to state what remains unchanged as the
result of an action.
Solution: add frame axioms explicity asserting all the propositions that remain the same.

Representation frame problem: The proliferation of frame axioms is inefficient, the set of frame axioms
will be O(mn) in a world with m different actions and n fluents.
Solution: because the world exhibits locaility (for humans each action typically changes no more than some
number k of those fluents.) Define the transition model with a set of axioms of size O(mk) rather than size O(mn).

Inferential frame problem: The problem of projecting forward the results of a t step lan of action in time O(kt)
rather than O(nt).
Solution: change one’s focus from writing axioms about actions to writing axioms about fluents.
For each fluent F, we will have an axiom that defines the truth value of Ft+1 in terms of fluents at time t and the
action that may have occurred at time t.
The truth value of Ft+1 can be set in one of 2 ways:
Either a. The action at time t cause F to be true at t+1
Or b. F was already true at time t and the action at time t does not cause it to be false.
An axiom of this form is called a successor-state axiom and has this schema:

Qualification problem: specifying all unusual exceptions that could cause the action to fail.

2. A hybrid agent
Hybrid agent: combines the ability to deduce various aspect of the state of the world with condition-action rules,
and with problem-solving algorithms.
The agent maintains and update KB as a current plan.
The initial KB contains the atemporal axioms. (don’t depend on t)
At each time step, the new percept sentence is added along with all the axioms that depend on t (such as the
successor-state axioms).
Then the agent use logical inference by ASKING questions of the KB (to work out which squares are safe and
which have yet to be visited).

The main body of the agent program constructs a plan based on a decreasing priority of goals:
1. If there is a glitter, construct a plan to grab the gold, follow a route back to the initial location and climb out of
the cave;
2. Otherwise if there is no current plan, plan a route (with A* search) to the closest safe square unvisited yet,
making sure the route goes through only safe squares;
3. If there are no safe squares to explore, if still has an arrow, try to make a safe square by shooting at one of
the possible wumpus locations.
4. If this fails, look for a square to explore that is not provably unsafe.
5. If there is no such square, the mission is impossible, then retreat to the initial location and climb out of the cave.
Weakness: The computational expense goes up as time goes by.

3. Logical state estimation


To get a constant update time, we need to cache the result of inference.
Belief state: Some representation of the set of all possible current state of the world. (used to replace the past
history of percepts and all their ramifications)

We use a logical sentence involving the proposition symbols associated with the current time step and the temporal
symbols.
Logical state estimation involves maintaining a logical sentence that describes the set of possible states consistent
with the observation history. Each update step requires inference using the transition model of the environment,
which is built from successor-state axioms that specify how each fluent changes.
State estimation: The process of updating the belief state as new percepts arrive.

Exact state estimation may require logical formulas whose size is exponential in the number of symbols.
One common scheme for approximate state estimation: to represent belief state as conjunctions of literals (1-
CNF formulas).
The agent simply tries to prove Xt and ¬Xt for each symbol Xt, given the belief state at t-1.
The conjunction of provable literals becomes the new belief state, and the previous belief state is discarded.
(This scheme may lose some information as time goes along.)

The set of possible states represented by the 1-CNF belief state includes all states that are in fact possible given the
full percept history. The 1-CNF belief state acts as a simple outer envelope, or conservative approximation.

4. Making plans by propositional inference


We can make plans by logical inference instead of A* search in Figure
7.20. Basic idea:
1. Construct a sentence that includes:
a) Init0: a collection of assertions about the initial state;
b) Transition1, …, Transitiont: The successor-state axioms for all possible actions at each time up to some
maximum time t;
c) HaveGoldt𝖠ClimbedOutt: The assertion that the goal is achieved at time t.
2. Present the whole sentence to a SAT solver. If the solver finds a satisfying model, the goal is achievable; else
the planning is impossible.
3. Assuming a model is found, extract from the model those variables that represent actions and are assigned true.
Together they represent a plan to ahieve the goals.
Decisions within a logical agent can be made by SAT solving: finding possible models specifying future
action sequences that reach the goal. This approach works only for fully observable or sensorless
environment.
SATPLAN: A propositional planning. (Cannot be used in a partially observable environment)
SATPLAN finds models for a sentence containing the initial sate, the goal, the successor-state axioms, and the
action exclusion axioms.
(Because the agent does not know how many steps it will take to reach the goal, the algorithm tries each possible
number of steps t up to some maximum conceivable plan length Tmax.)

Precondition axioms: stating that an action occurrence requires the preconditions to be satisfied, added to avoid
generating plans with illegal actions.
Action exclusion axioms: added to avoid the creation of plans with multiple simultaneous actions that interfere with
each other.
Propositional logic does not scale to environments of unbounded size because it lacks the expressive power to deal
concisely with time, space and universal patterns of relationshipgs among objects.

6. First-order logic

First-Order Logic in Artificial intelligence

In the topic of Propositional logic, we have seen that how to represent statements using propositional logic. But
unfortunately, in propositional logic, we can only represent the facts, which are either true or false. PL is not
sufficient to represent the complex sentences or natural language statements. The propositional logic has very
limited expressive power. Consider the following sentence, which we cannot represent using PL logic.

o "Some humans are intelligent", or


o "Sachin likes cricket."

To represent the above statements, PL logic is not sufficient, so we required some more powerful logic, such as first-
order logic.

First-Order logic:
o First-order logic is another way of knowledge representation in artificial intelligence. It is an extension
to propositional logic.
o FOL is sufficiently expressive to represent the natural language statements in a concise way.
o First-order logic is also known as Predicate logic or First-order predicate logic. First-order logic is a
powerful language that develops information about the objects in a more easy way and can also express the
relationship between those objects.
o First-order logic (like natural language) does not only assume that the world contains facts like
propositional logic but also assumes the following things in the world:
o Objects: A, B, people, numbers, colors, wars, theories, squares, pits, wumpus, ......
o Relations: It can be unary relation such as: red, round, is adjacent, or n-any relation such as:
the sister of, brother of, has color, comes between
o Function: Father of, best friend, third inning of, end of, ......
o As a natural language, first-order logic also has two main parts:
a. Syntax
b. Semantics

Syntax of First-Order logic:

The syntax of FOL determines which collection of symbols is a logical expression in first-order logic. The
basic syntactic elements of first-order logic are symbols. We write statements in short-hand notation in FOL.

Basic Elements of First-order logic:

Following are the basic elements of FOL syntax:

Constant 1, 2, A, John, Mumbai, cat,....


Variables x, y, z, a, b,....
Predicates Brother, Father, >,....
Function sqrt, LeftLegOf, ....
Connectives 𝖠, ∨, ¬, ⇒, ⇔
Equality ==
Quantifier ∀, ∃

Atomic sentences:
o Atomic sentences are the most basic sentences of first-order logic. These sentences are formed from a
predicate symbol followed by a parenthesis with a sequence of terms.
o We can represent atomic sentences as Predicate (term1, term2,......, term n).

Example: Ravi and Ajay are brothers: => Brothers(Ravi, Ajay).


Chinky is a cat: => cat (Chinky).

Complex Sentences:
o Complex sentences are made by combining atomic sentences using connectives.

First-order logic statements can be divided into two parts:

o Subject: Subject is the main part of the statement.


o Predicate: A predicate can be defined as a relation, which binds two atoms together in a statement.
Consider the statement: "x is an integer.", it consists of two parts, the first part x is the subject of the statement
and second part "is an integer," is known as a predicate.

Quantifiers in First-order logic:


o A quantifier is a language element which generates quantification, and quantification specifies the
quantity of specimen in the universe of discourse.
o These are the symbols that permit to determine or identify the range and scope of the variable in the logical
expression. There are two types of quantifier:
a. Universal Quantifier, (for all, everyone, everything)
b. Existential quantifier, (for some, at least one).

Universal Quantifier:

Universal quantifier is a symbol of logical representation, which specifies that the statement within its range is true
for everything or every instance of a particular thing.

The Universal quantifier is represented by a symbol ∀, which resembles an inverted A.

Note: In universal quantifier we use implication "→".

If x is a variable, then ∀x is read as:

o For all x
o For each x
o For every x.

Example:

All man drink coffee.

Let a variable x which refers to a cat so all x can be represented in UOD as below:
∀x man(x) → drink (x, coffee).

It will be read as: There are all x where x is a man who drink coffee.

Existential Quantifier:

Existential quantifiers are the type of quantifiers, which express that the statement within its scope is true for at
least one instance of something.

It is denoted by the logical operator ∃, which resembles as inverted E. When it is used with a predicate variable then
it is called as an existential quantifier.

Note: In Existential quantifier we always use AND or Conjunction symbol (𝖠).

If x is a variable, then existential quantifier will be ∃x or ∃(x). And it will be read as:

o There exists a 'x.'


o For some 'x.'
o For at least one 'x.'

Example:

Some boys are intelligent.


∃x: boys(x) 𝖠 intelligent(x)

It will be read as: There are some x where x is a boy who is intelligent.

Points to remember:
o The main connective for universal quantifier ∀ is implication →.
o The main connective for existential quantifier ∃ is and 𝖠.

Properties of Quantifiers:
o In universal quantifier, ∀x∀y is similar to ∀y∀x.
o In Existential quantifier, ∃x∃y is similar to ∃y∃x.
o ∃x∀y is not similar to ∀y∃x.

Some Examples of FOL using quantifier:

1. All birds fly.


In this question the predicate is "fly(bird)."
And since there are all birds who fly so it will be represented as follows.
∀x bird(x) →fly(x).

2. Every man respects his parent.


In this question, the predicate is "respect(x, y)," where x=man, and y=
parent. Since there is every man so will use ∀, and it will be represented as
follows:
∀x man(x) → respects (x, parent).

3. Some boys play cricket.


In this question, the predicate is "play(x, y)," where x= boys, and y= game. Since there are some boys so we
will use ∃, and it will be represented as:
∃x boys(x) → play(x, cricket).

4. Not all students like both Mathematics and Science.


In this question, the predicate is "like(x, y)," where x= student, and y= subject.
Since there are not all students, so we will use ∀ with negation, so following representation for this:
¬∀ (x) [ student(x) → like(x, Mathematics) 𝖠 like(x, Science)].

5. Only one student failed in Mathematics.


In this question, the predicate is "failed(x, y)," where x= student, and y= subject.
Since there is only one student who failed in Mathematics, so we will use following representation for this:
∃(x) [ student(x) → failed (x, Mathematics) 𝖠∀ (y) [¬(x==y) 𝖠 student(y) → ¬failed (x, Mathematics)].

Free and Bound Variables:

The quantifiers interact with variables which appear in a suitable way. There are two types of variables in First-
order logic which are given below:

Free Variable: A variable is said to be a free variable in a formula if it occurs outside the scope of the quantifier.

Example: ∀x ∃(y)[P (x, y, z)], where z is a free variable.

Bound Variable: A variable is said to be a bound variable in a formula if it occurs within the scope of the quantifier.

Example: ∀x [A (x) B( y)], here x and y are the bound variables.

7. Knowledge representation and engineering

Knowledge Engineering in First-order logic

What is knowledge-engineering?

The process of constructing a knowledge-base in first-order logic is called as knowledge- engineering. In


knowledge- engineering, someone who investigates a particular domain, learns important concept of that domain,
and generates a formal representation of the objects, is known as knowledge engineer.

In this topic, we will understand the Knowledge engineering process in an electronic circuit domain, which is
already familiar. This approach is mainly suitable for creating special-purpose knowledge base.

The knowledge-engineering process:

Following are some main steps of the knowledge-engineering process. Using these steps, we will develop a
knowledge base which will allow us to reason about digital circuit (One-bit full adder) which is given below
1. Identify the task:

The first step of the process is to identify the task, and for the digital circuit, there are various reasoning

tasks At the first level or highest level, we will examine the functionality of the circuit:

o Does the circuit add properly?


o What will be the output of gate A2, if all the inputs are high?

At the second level, we will examine the circuit structure details such as:

o Which gate is connected to the first input terminal?


o Does the circuit have feedback loops?

2. Assemble the relevant knowledge:

In the second step, we will assemble the relevant knowledge which is required for digital circuits. So for digital
circuits, we have the following required knowledge:

o Logic circuits are made up of wires and gates.


o Signal flows through wires to the input terminal of the gate, and each gate produces the corresponding
output which flows further.
o In this logic circuit, there are four types of gates used: AND, OR, XOR, and NOT.
o All these gates have one output terminal and two input terminals (except NOT gate, it has one input
terminal).

3. Decide on vocabulary:

The next step of the process is to select functions, predicate, and constants to represent the circuits, terminals,
signals, and gates. Firstly we will distinguish the gates from each other and from other objects. Each gate is
represented as an object which is named by a constant, such as, Gate(X1). The functionality of each gate is
determined by its type, which is taken as constants such as AND, OR, XOR, or NOT. Circuits will be identified by
a predicate: Circuit (C1).

For the terminal, we will use predicate: Terminal(x).

For gate input, we will use the function In(1, X1) for denoting the first input terminal of the gate, and for output
terminal we will use Out (1, X1).

The function Arity(c, i, j) is used to denote that circuit c has i input, j output.

The connectivity between gates can be represented by predicate Connect(Out(1, X1), In(1,

X1)). We use a unary predicate On (t), which is true if the signal at a terminal is on.

4. Encode general knowledge about the domain:

To encode the general knowledge about the logic circuit, we need some following rules:

o If two terminals are connected then they have the same input signal, it can be represented as:
✯ t1, t2 Terminal (t1) 𝖠 Terminal (t2) 𝖠 Connect (t1, t2) → Signal (t1) = Signal (2).

o Signal at every terminal will have either value 0 or 1, it will be represented as:

✯ t Terminal (t) →Signal (t) = 1 ∨Signal (t) = 0.

o Connect predicates are commutative:

✯ t1, t2 Connect(t1, t2) → Connect (t2, t1).

o Representation of types of gates:

✯ g Gate(g) 𝖠 r = Type(g) → r = OR ∨r = AND ∨r = XOR ∨r = NOT.


o Output of AND gate will be zero if and only if any of its input is zero.

✯ g Gate(g) 𝖠 Type(g) = AND →Signal (Out(1, g))= 0 ⇔ ∃n Signal (In(n, g))= 0.


o Output of OR gate is 1 if and only if any of its input is 1:

✯ g Gate(g) 𝖠 Type(g) = OR → Signal (Out(1, g))= 1 ⇔ ∃n Signal (In(n, g))= 1


o Output of XOR gate is 1 if and only if its inputs are different:

✯ g Gate(g) 𝖠 Type(g) = XOR → Signal (Out(1, g)) = 1 ⇔ Signal (In(1, g)) ≠ Signal (In(2, g)).
o Output of NOT gate is invert of its input:

✯ g Gate(g) 𝖠 Type(g) = NOT → Signal (In(1, g)) ≠ Signal (Out(1, g)).


o All the gates in the above circuit have two inputs and one output (except NOT gate).

✯ g Gate(g) 𝖠 Type(g) = NOT → Arity(g, 1, 1)


✯ g Gate(g) 𝖠 r =Type(g) 𝖠 (r= AND ∨r= OR ∨r= XOR) → Arity (g, 2, 1).
o All gates are logic circuits:

✯ g Gate(g) → Circuit (g).

5. Encode a description of the problem instance:

Now we encode problem of circuit C1, firstly we categorize the circuit and its gate components. This step is easy if
ontology about the problem is already thought. This step involves the writing simple atomics sentences of
instances of concepts, which is known as ontology.

For the given circuit C1, we can encode the problem instance in atomic sentences as below:

Since in the circuit there are two XOR, two AND, and one OR gate so atomic sentences for these gates will be:

For XOR gate: Type(x1)= XOR,


Type(X2) = XOR For AND gate:
Type(A1) = AND, Type(A2)=
AND For OR gate: Type (O1) =
OR.

And then represent the connections between all the gates.


Note: Ontology defines a particular theory of the nature of existence.

6. Pose queries to the inference procedure and get answers:

In this step, we will find all the possible set of values of all the terminal for the adder circuit. The first query will be:

What should be the combination of input which would generate the first output of circuit C1, as 0 and a second
output to be 1?

∃ i1, i2, i3 Signal (In(1, C1))=i1 𝖠 Signal (In(2, C1))=i2 𝖠 Signal (In(3, C1))= i3
𝖠 Signal (Out(1, C1)) =0 𝖠 Signal (Out(2, C1))=1

7. Debug the knowledge base:

Now we will debug the knowledge base, and this is the last step of the complete process. In this step, we will try
to debug the issues of knowledge base.

In the knowledge base, we may have omitted assertions like 1 ≠ 0.

8. Inferences in first-order logic

Inference in First-Order Logic

Inference in First-Order Logic is used to deduce new facts or sentences from existing sentences. Before
understanding the FOL inference rule, let's understand some basic terminologies used in FOL.

Substitution:

Substitution is a fundamental operation performed on terms and formulas. It occurs in all inference systems in first-
order logic. The substitution is complex in the presence of quantifiers in FOL. If we write F[a/x], so it refers to
substitute a constant "a" in place of variable "x".

Note: First-order logic is capable of expressing facts about some or all objects in the universe.

Equality:

First-Order logic does not only use predicate and terms for making atomic sentences but also uses another way,
which is equality in FOL. For this, we can use equality symbols which specify that the two terms refer to the same
object.

Example: Brother (John) = Smith.

As in the above example, the object referred by the Brother (John) is similar to the object referred by Smith.
The equality symbol can also be used with negation to represent that two terms are not the same objects.

Example: ¬(x=y) which is equivalent to x ≠y.


FOL inference rules for quantifier:

As propositional logic we also have inference rules in first-order logic, so following are some basic inference rules
in FOL:

o Universal Generalization
o Universal Instantiation
o Existential Instantiation
o Existential introduction

1. Universal Generalization:

o Universal generalization is a valid inference rule which states that if premise P(c) is true for any
arbitrary element c in the universe of discourse, then we can have a conclusion as ✯ x P(x).

o It can be represented as: .


o This rule can be used if we want to show that every element has a similar property.
o In this rule, x must not appear as a free variable.

Example: Let's represent, P(c): "A byte contains 8 bits", so for ✯ x P(x) "All bytes contain 8 bits.", it will also be true.

2. Universal Instantiation:

o Universal instantiation is also called as universal elimination or UI is a valid inference rule. It can be
applied multiple times to add new sentences.
o The new KB is logically equivalent to the previous KB.
o As per UI, we can infer any sentence obtained by substituting a ground term for the variable.
o The UI rule state that we can infer any sentence P(c) by substituting a ground term c (a constant
within domain x) from ✯ x P(x) for any object in the universe of discourse.

o It can be represented as: .


o Example:1.
o IF "Every person like ice-cream"=> ✯x P(x) so we can infer that
"John likes ice-cream" => P(c)
o Example: 2.
o Let's take a famous example,
o "All kings who are greedy are Evil." So let our knowledge base contains this detail as in the form of FOL:

✯x king(x) 𝖠 greedy (x) → Evil (x),

So from this information, we can infer any of the following statements using Universal Instantiation:

o King(John) 𝖠 Greedy (John) → Evil (John),


o King(Richard) 𝖠 Greedy (Richard) → Evil (Richard),
o King(Father(John)) 𝖠 Greedy (Father(John)) → Evil (Father(John)),

3. Existential Instantiation:
Existential instantiation is also called as Existential Elimination, which is a valid inference rule in first-
order logic.

o It can be applied only once to replace the existential sentence.


o The new KB is not logically equivalent to old KB, but it will be satisfiable if old KB was satisfiable.
o This rule states that one can infer P(c) from the formula given in the form of ∃x P(x) for a new
constant symbol c.
o The restriction with this rule is that c used in the rule must be a new term for which P(c ) is true.

o It can be represented as:

Example:

From the given sentence: ∃x Crown(x) 𝖠 OnHead(x, John),

So we can infer: Crown(K) 𝖠 OnHead( K, John), as long as K does not appear in the knowledge base.

o The above used K is a constant symbol, which is called Skolem constant.


o The Existential instantiation is a special case of Skolemization process.

4. Existential introduction

o An existential introduction is also known as an existential generalization, which is a valid inference rule
in first-order logic.
o This rule states that if there is some element c in the universe of discourse which has a property P, then
we can infer that there exists something in the universe which has the property P.

o It can be represented as:


o Example: Let's say that,
"Priyanka got good marks in English."
"Therefore, someone got good marks in
English."

Generalized Modus Ponens Rule:

For the inference process in FOL, we have a single inference rule which is called Generalized Modus Ponens. It is
lifted version of Modus ponens.

Generalized Modus Ponens can be summarized as, " P implies Q and P is asserted to be true, therefore Q must
be True."

According to Modus Ponens, for atomic sentences pi, pi', q. Where there is a substitution θ such that SUBST (θ, pi',)
= SUBST(θ, pi), it can be represented as:

Example:
We will use this rule for Kings are evil, so we will find some x such that x is king, and x is greedy so we
can infer that x is evil.

Here let say, p1' is king(John) p1 is


king(x) p2' is Greedy(y) p2 is Greedy(x)
θ is {x/John, y/John} q is evil(x)
SUBST(θ,q).

9. Forward chaining and Backward chaining

Forward Chaining and backward chaining in AI

In artificial intelligence, forward and backward chaining is one of the important topics, but before
understanding forward and backward chaining lets first understand that from where these two terms came.

Inference engine:

The inference engine is the component of the intelligent system in artificial intelligence, which applies logical rules
to the knowledge base to infer new information from known facts. The first inference engine was part of the expert
system. Inference engine commonly proceeds in two modes, which are:

A. Forward chaining
B. Backward chaining

Horn Clause and Definite

clause:

Horn clause and definite clause are the forms of sentences, which enables knowledge base to use a more restricted
and efficient inference algorithm. Logical inference algorithms use forward and backward chaining approaches,
which require KB in the form of the first-order definite clause.

Definite clause: A clause which is a disjunction of literals with exactly one positive literal is known as a definite
clause or strict horn clause.

Horn clause: A clause which is a disjunction of literals with at most one positive literal is known as horn clause.
Hence all the definite clauses are horn clauses.

Example: (¬ p V ¬ q V k). It has only one positive literal

k. It is equivalent to p 𝖠 q → k.

A. Forward Chaining

Forward chaining is also known as a forward deduction or forward reasoning method when using an inference
engine. Forward chaining is a form of reasoning which start with atomic sentences in the knowledge base and
applies inference rules (Modus Ponens) in the forward direction to extract more data until a goal is reached.

The Forward-chaining algorithm starts from known facts, triggers all rules whose premises are satisfied, and add
their conclusion to the known facts. This process repeats until the problem is solved.

Properties of Forward-Chaining:
o It is a down-up approach, as it moves from bottom to top.
o It is a process of making a conclusion based on known facts or data, by starting from the initial state
and reaches the goal state.
o Forward-chaining approach is also called as data-driven as we reach to the goal using available data.
o Forward -chaining approach is commonly used in the expert system, such as CLIPS, business, and
production rule systems.

Consider the following famous example which we will use in both approaches:

Example:

"As per the law, it is a crime for an American to sell weapons to hostile nations. Country A, an enemy
of America, has some missiles, and all the missiles were sold to it by Robert, who is an American citizen."

Prove that "Robert is criminal."

To solve the above problem, first, we will convert all the above facts into first-order definite clauses, and then we
will use a forward-chaining algorithm to reach the goal.

Facts Conversion into FOL:


o It is a crime for an American to sell weapons to hostile nations. (Let's say p, q, and r are variables)
American (p) 𝖠 weapon(q) 𝖠 sells (p, q, r) 𝖠 hostile(r) → Criminal(p) ...(1)
o Country A has some missiles. ?p Owns(A, p) 𝖠 Missile(p). It can be written in two definite clauses by
using Existential Instantiation, introducing new Constant T1.
Owns(A, T1).................(2)
Missile(T1)..................(3)
o All of the missiles were sold to country A by Robert.
?p Missiles(p) 𝖠 Owns (A, p) → Sells (Robert, p, A)........(4)
o Missiles are weapons.
Missile(p) → Weapons (p)..................(5)
o Enemy of America is known as hostile.
Enemy(p, America) →Hostile(p)..................(6)
o Country A is an enemy of America.
Enemy (A, America)...................(7)
o Robert is American
American(Robert)......................(8)

Forward chaining proof:

Step-1:

In the first step we will start with the known facts and will choose the sentences which do not have implications,
such as: American(Robert), Enemy(A, America), Owns(A, T1), and Missile(T1). All these facts will be
represented as below.
Step-2:

At the second step, we will see those facts which infer from available facts and with satisfied

premises. Rule-(1) does not satisfy premises, so it will not be added in the first iteration.

Rule-(2) and (3) are already added.

Rule-(4) satisfy with the substitution {p/T1}, so Sells (Robert, T1, A) is added, which infers from the
conjunction of Rule (2) and (3).

Rule-(6) is satisfied with the substitution(p/A), so Hostile(A) is added and which infers from Rule-(7).

Step-3:

At step-3, as we can check Rule-(1) is satisfied with the substitution {p/Robert, q/T1, r/A}, so we can add
Criminal(Robert) which infers all the available facts. And hence we reached our goal statement.

Hence it is proved that Robert is Criminal using forward chaining approach.

B. Backward Chaining:

Backward-chaining is also known as a backward deduction or backward reasoning method when using an inference
engine. A backward chaining algorithm is a form of reasoning, which starts with the goal and works backward,
chaining through rules to find known facts that support the goal.

Properties of backward chaining:

o It is known as a top-down approach.


o Backward-chaining is based on modus ponens inference rule.
o In backward chaining, the goal is broken into sub-goal or sub-goals to prove the facts true.
o It is called a goal-driven approach, as a list of goals decides which rules are selected and used.
o Backward -chaining algorithm is used in game theory, automated theorem proving tools, inference
engines, proof assistants, and various AI applications.
o The backward-chaining method mostly used a depth-first search strategy for proof.

Example:

In backward-chaining, we will use the same above example, and will rewrite all the rules.

o American (p) 𝖠 weapon(q) 𝖠 sells (p, q, r) 𝖠 hostile(r) → Criminal(p) ...(1)


Owns(A, T1).......................(2)
o Missile(T1)
o ?p Missiles(p) 𝖠 Owns (A, p) → Sells (Robert, p, A)............(4)
o Missile(p) → Weapons (p)......................(5)
o Enemy(p, America) →Hostile(p)......................(6)
o Enemy (A, America).......................(7)
o American(Robert)..........................(8)

Backward-Chaining proof:

In Backward chaining, we will start with our goal predicate, which is Criminal(Robert), and then infer further rules.

Step-1:

At the first step, we will take the goal fact. And from the goal fact, we will infer other facts, and at last, we will
prove those facts true. So our goal fact is "Robert is Criminal," so following is the predicate of it.

Step-2:

At the second step, we will infer other facts form goal fact which satisfies the rules. So as we can see in Rule-1, the
goal predicate Criminal (Robert) is present with substitution {Robert/P}. So we will add all the conjunctive facts
below the first level and will replace p with Robert.

Here we can see American (Robert) is a fact, so it is proved here.


Step-3:t At step-3, we will extract further fact Missile(q) which infer from Weapon(q), as it satisfies Rule-(5). Weapon
(q) is also true with the substitution of a constant T1 at q.

Step-4:

At step-4, we can infer facts Missile(T1) and Owns(A, T1) form Sells(Robert, T1, r) which satisfies the Rule- 4, with the
substitution of A in place of r. So these two statements are proved here.
Step-5:

At step-5, we can infer the fact Enemy(A, America) from Hostile(A) which satisfies Rule- 6. And hence all
the statements are proved true using backward chaining.

10. Difference between backward chaining and forward chaining

Difference between backward chaining and forward chaining

Following is the difference between the forward chaining and backward chaining:

o Forward chaining as the name suggests, start from the known facts and move forward by applying
inference rules to extract more data, and it continues until it reaches to the goal, whereas backward chaining
starts from the goal, move backward by using inference rules to determine the facts that satisfy the goal.
o Forward chaining is called a data-driven inference technique, whereas backward chaining is called a goal-
driven inference technique.
o Forward chaining is known as the down-up approach, whereas backward chaining is known as a top-
down approach.
o Forward chaining uses breadth-first search strategy, whereas backward chaining uses depth-first
search strategy.
o Forward and backward chaining both applies Modus ponens inference rule.
o Forward chaining can be used for tasks such as planning, design process monitoring, diagnosis, and
classification, whereas backward chaining can be used for classification and diagnosis tasks.
o Forward chaining can be like an exhaustive search, whereas backward chaining tries to avoid the
unnecessary path of reasoning.
o In forward-chaining there can be various ASK questions from the knowledge base, whereas in
backward chaining there can be fewer ASK questions.
o Forward chaining is slow as it checks for all the rules, whereas backward chaining is fast as it checks
few required rules only.

S. Forward Chaining Backward


Chaining No.
1. Forward chaining starts from known Backward chaining starts from the goal
facts and applies inference rule to and works backward through inference
extract more data unit it reaches to rules to find the required facts that
the goal. support the goal.
2. It is a bottom-up approach It is a top-down approach
3. Forward chaining is known as data- Backward chaining is known as goal-
driven inference technique as we driven technique as we start from the
reach to the goal using the available goal and divide into sub-goal to extract
data. the facts.
4. Forward chaining reasoning applies Backward chaining reasoning applies
a breadth-first search strategy. a depth-first search strategy.
5. Forward chaining tests for all the Backward chaining only tests for few
available rules required rules.
6. Forward chaining is suitable for the Backward chaining is suitable for
planning, monitoring, control, and diagnostic, prescription, and debugging
interpretation application. application.
7. Forward chaining can generate an Backward chaining generates a finite
infinite number of possible number of possible conclusions.
conclusions.
8. It operates in the forward direction. It operates in the backward direction.
9. Forward chaining is aimed for any Backward chaining is only aimed for
conclusion. the
required data.

11. Resolution

Resolution in

FOL Resolution

Resolution is a theorem proving technique that proceeds by building refutation proofs, i.e., proofs by
contradictions. It was invented by a Mathematician John Alan Robinson in the year 1965.
Resolution is used, if there are various statements are given, and we need to prove a conclusion of those statements.
Unification is a key concept in proofs by resolutions. Resolution is a single inference rule which can efficiently
operate on the conjunctive normal form or clausal form.

Clause: Disjunction of literals (an atomic sentence) is called a clause. It is also known as a unit clause.

Conjunctive Normal Form: A sentence represented as a conjunction of clauses is said to be conjunctive normal
form or CNF.

The resolution inference rule:

The resolution rule for first-order logic is simply a lifted version of the propositional rule. Resolution can resolve
two clauses if they contain complementary literals, which are assumed to be standardized apart so that they share no
variables.

Where li and mj are complementary literals.

This rule is also called the binary resolution rule because it only resolves exactly two literals.

Example:

We can resolve two clauses which are given below:

[Animal (g(x) V Loves (f(x), x)] and [¬ Loves(a, b) V ¬Kills(a, b)]

Where two complimentary literals are: Loves (f(x), x) and ¬ Loves (a, b)

These literals can be unified with unifier θ= [a/f(x), and b/x] , and it will generate a resolvent clause:

[Animal (g(x) V ¬ Kills(f(x), x)].

Steps for Resolution:


1. Conversion of facts into first-order logic.
2. Convert FOL statements into CNF
3. Negate the statement which needs to prove (proof by contradiction)
4. Draw resolution graph (unification).

To better understand all the above steps, we will take an example in which we will apply resolution.

Example:
a. John likes all kind of food.
b. Apple and vegetable are food
c. Anything anyone eats and not killed is food.
d. Anil eats peanuts and still alive
e. Harry eats everything
that Anil eats. Prove by
resolution that:
f. John likes peanuts.

Step-1: Conversion of Facts into FOL

In the first step we will convert all the given statements into its first order logic.

Step-2: Conversion of FOL into CNF

In First order logic resolution, it is required to convert the FOL into CNF as CNF form makes easier for
resolution proofs.

o Eliminate all implication (→) and rewrite


a. ✯x ¬ food(x) V likes(John, x)
b. food(Apple) Λ food(vegetables)
c. ✯x ✯y ¬ [eats(x, y) Λ ¬ killed(x)] V food(y)
d. eats (Anil, Peanuts) Λ alive(Anil)
e. ✯x ¬ eats(Anil, x) V eats(Harry, x)
f. ✯x¬ [¬ killed(x) ] V alive(x)
g. ✯x ¬ alive(x) V ¬ killed(x)
h. likes(John, Peanuts).
o Move negation (¬)inwards and rewrite
. ✯x ¬ food(x) V likes(John, x)
a. food(Apple) Λ food(vegetables)
b. ✯x ✯y ¬ eats(x, y) V killed(x) V food(y)
c. eats (Anil, Peanuts) Λ alive(Anil)
d. ✯x ¬ eats(Anil, x) V eats(Harry, x)
e. ✯x ¬killed(x) ] V alive(x)
f. ✯x ¬ alive(x) V ¬ killed(x)
g. likes(John, Peanuts).
o Rename variables or standardize variables
. ✯x ¬ food(x) V likes(John, x)
a. food(Apple) Λ food(vegetables)
b. ✯y ✯z ¬ eats(y, z) V killed(y) V food(z)
c. eats (Anil, Peanuts) Λ alive(Anil)
d. ✯w¬ eats(Anil, w) V eats(Harry, w)
e. ✯g ¬killed(g) ] V alive(g)
f. ✯k ¬ alive(k) V ¬ killed(k)
g. likes(John, Peanuts).

o Eliminate existential instantiation quantifier by elimination.


In this step, we will eliminate existential quantifier ∃, and this process is known as Skolemization. But in
this example problem since there is no existential quantifier so all the statements will remain same in this
step.
o Drop Universal
quantifiers. In this step we will drop all universal quantifier
since all the statements are not implicitly quantified so we don't need it.
a. ¬ food(x) V likes(John, x)
b. food(Apple)
c. food(vegetables)
d. ¬ eats(y, z) V killed(y) V food(z)
e. eats (Anil, Peanuts)
f. alive(Anil)
g. ¬ eats(Anil, w) V eats(Harry, w)
h. killed(g) V alive(g)
i. ¬ alive(k) V ¬ killed(k)
j. likes(John, Peanuts).

Note: Statements "food(Apple) Λ food(vegetables)" and "eats (Anil, Peanuts) Λ alive(Anil)" can be written in two separate
statements.
o Distribute conjunction 𝖠 over disjunction ¬.
This step will not make any change in this problem.

Step-3: Negate the statement to be proved

In this statement, we will apply negation to the conclusion statements, which will be written as ¬likes(John, Peanuts)

Step-4: Draw Resolution graph:

Now in this step, we will solve the problem by resolution tree using substitution. For the above problem, it will
be given as follows:
Hence the negation of the conclusion has been proved as a complete contradiction with the given set of statements.

Explanation of Resolution graph:


o In the first step of resolution graph, ¬likes(John, Peanuts) , and likes(John, x) get resolved(canceled)
by substitution of {Peanuts/x}, and we are left with ¬ food(Peanuts)
o In the second step of the resolution graph, ¬ food(Peanuts) , and food(z) get resolved (canceled) by
substitution of { Peanuts/z}, and we are left with ¬ eats(y, Peanuts) V killed(y) .
o In the third step of the resolution graph, ¬ eats(y, Peanuts) and eats (Anil, Peanuts) get resolved by
substitution {Anil/y}, and we are left with Killed(Anil) .
o In the fourth step of the resolution graph, Killed(Anil) and ¬ killed(k) get resolve by substitution
{Anil/k}, and we are left with ¬ alive(Anil) .
o In the last step of the resolution graph ¬ alive(Anil) and alive(Anil) get resolved.
AL3391 ARTIFICIAL INTELLIGENCE

UNIT V PROBABILISTIC REASONING


Acting under uncertainty – Bayesian inference – naïve Bayes models. Probabilistic reasoning –
Bayesian networks – exact inference in BN – approximate inference in BN – causal networks.

1. Acting under uncertainty

Uncertainty:

Till now, we have learned knowledge representation using first-order logic and propositional
logic with certainty, which means we were sure about the predicates. With this knowledge representation, we might
write A→B, which means if A is true then B is true, but consider a situation where we are not sure about whether A
is true or not then we cannot express this statement, this situation is called uncertainty.

So to represent uncertain knowledge, where we are not sure about the predicates, we need uncertain reasoning or
probabilistic reasoning.

Causes of uncertainty:

Following are some leading causes of uncertainty to occur in the real world.

1. Information occurred from unreliable sources.


2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.

2. Bayesian inference

Bayes' theorem in Artificial intelligence

Bayes' theorem:

Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which determines the probability
of an event with uncertain knowledge.

In probability theory, it relates the conditional probability and marginal probabilities of two random events.

Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian inference is an
application of Bayes' theorem, which is fundamental to Bayesian statistics.

It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).


Bayes' theorem allows updating the probability prediction of an event by observing new information of the real world.

Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the probability of
cancer more accurately with the help of age.

Bayes' theorem can be derived using product rule and conditional probability of event A with known event

B: As from product rule we can write:

P(A ⋀ B)= P(A|B) P(B) or

Similarly, the probability of event B with known event A:

P(A ⋀ B)= P(B|A) P(A)


Equating right hand side of both the equations, we will get:

The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most modern AI
systems for probabilistic inference.

It shows the simple relationship between joint and conditional probabilities. Here,

P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of hypothesis A when
we have occurred an evidence B.

P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the probability of
evidence.

P(A) is called the prior probability, probability of hypothesis before considering the

evidence P(B) is called marginal probability, pure probability of an evidence.

In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be written as:

Where A1, A2, A3,. , An is a set of mutually exclusive and exhaustive events.

Applying Bayes' rule:

Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A). This is very useful in
cases where we have a good probability of these three terms and want to determine the fourth one. Suppose we want
to perceive the effect of some unknown cause, and want to compute that cause, then the Bayes' rule becomes:
Example-1:

Question: what is the probability that a patient has diseases meningitis with a stiff neck?

Given Data:

A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80% of the time. He
is also aware of some more facts, which are given as follows:

o The Known probability that a patient has meningitis disease is 1/30,000.


o The Known probability that a patient has a stiff neck is 2%.

Let a be the proposition that patient has stiff neck and b be the proposition that patient has meningitis. , so we can
calculate the following as:

P(a|b) = 0.8

P(b) =

1/30000 P(a)=

.02

Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.

Example-2:

Question: From a standard deck of playing cards, a single card is drawn. The probability that the card is king
is 4/52, then calculate posterior probability P(King|Face), which means the drawn face card is a king card.

Solution:

P(king): probability that the card is King= 4/52=

1/13 P(face): probability that a card is a face card=

3/13

P(Face|King): probability of face card when we assume it is a king

= 1 Putting all values in equation (i) we will get:


Application of Bayes' theorem in Artificial intelligence:

Following are some applications of Bayes' theorem:

o It is used to calculate the next step of the robot when the already executed step is given.
o Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.

3. Probabilistic reasoning

Probabilistic reasoning:

Probabilistic reasoning is a way of knowledge representation where we apply the concept of probability to indicate
the uncertainty in knowledge. In probabilistic reasoning, we combine probability theory with logic to handle the
uncertainty.

We use probability in probabilistic reasoning because it provides a way to handle the uncertainty that is the result of
someone's laziness and ignorance.

In the real world, there are lots of scenarios, where the certainty of something is not confirmed, such as "It will rain
today," "behavior of someone for some situations," "A match between two teams or two players." These are
probable sentences for which we can assume that it will happen but not sure about it, so here we use probabilistic
reasoning.

Need of probabilistic reasoning in AI:

o When there are unpredictable outcomes.


o When specifications or possibilities of predicates becomes too large to handle.
o When an unknown error occurs during an experiment.

In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:

o Bayes' rule
o Bayesian Statistics

As probabilistic reasoning uses probability and related terms, so before understanding probabilistic reasoning, let's
understand some common terms:

Probability: Probability can be defined as a chance that an uncertain event will occur. It is the numerical measure of
the likelihood that an event will occur. The value of probability always remains between 0 and 1 that represent ideal
uncertainties.
0 ≤ P(A) ≤ 1, where P(A) is the probability

of an event A. P(A) = 0, indicates total

uncertainty in an event A.

P(A) =1, indicates total certainty in an event A.

We can find the probability of an uncertain event by using the below formula.

o P(¬A) = probability of a not happening event.


o P(¬A) + P(A) = 1.

Event: Each possible outcome of a variable is called an event.

Sample space: The collection of all possible events is called sample space.

Random variables: Random variables are used to represent the events and objects in the real world.

Prior probability: The prior probability of an event is probability computed before observing new information.

Posterior Probability: The probability that is calculated after all evidence or information has taken into account. It
is a combination of prior probability and new information.

Conditional probability:

Conditional probability is a probability of occurring an event when another event has already happened.

Let's suppose, we want to calculate the event A when event B has already occurred, "the probability of A under
the conditions of B", it can be written as:

Where P(A⋀B)= Joint

probability of a and B P(B)=

Marginal probability of B.

If the probability of A is given and we need to find the probability of B, then it will be given as:

It can be explained by using the below Venn diagram, where B is occurred event, so sample space will be reduced to
set B, and now we can only calculate event A when event B is already occurred by dividing the probability of
P(A⋀B) by P( B ).
Example:

In a class, there are 70% of the students who like English and 40% of the students who likes English and
mathematics, and then what is the percent of students those who like English also like mathematics?

Solution:

Let, A is an event that a student likes

Mathematics B is an event that a student likes

English.

Hence, 57% are the students who like English also like Mathematics.

4. Bayesian networks or Belief networks

Bayesian Belief Network in artificial intelligence

Bayesian belief network is key computer technology for dealing with probabilistic events and to solve a
problem which has uncertainty. We can define a Bayesian network as:

"A Bayesian network is a probabilistic graphical model which represents a set of variables and their conditional
dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian model.

Bayesian networks are probabilistic, because these networks are built from a probability distribution, and also
use probability theory for prediction and anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship between multiple events, we
need a Bayesian network. It can also be used in various tasks including prediction, anomaly detection,
diagnostics, automated insight, reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists of two parts:

o Directed Acyclic Graph


o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision problems under uncertain knowledge
is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:

o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between random
variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed link that
means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of
the network graph.
o If we are considering node B, which is connected with node A by a directed arrow, then
node A is called the parent of Node B.
o Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed acyclic
graph or DAG

The Bayesian network has mainly two components:

o Causal Component
o Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which determines
the effect of the parent on that node.

Bayesian network is based on Joint probability distribution and conditional probability. So let's first understand
the joint probability distribution:
Joint probability distribution:

If we have variables x1, x2, x3, , xn, then the probabilities of a different combination of x1, x2, x3.. xn, are known as
Joint probability distribution.

P[x1, x2, x3, , xn], it can be written as the following way in terms of the joint probability distribution.

= P[x1| x2, x3,....., xn]P[x2, x3, , xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]. P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Explanation of Bayesian network:

Let's understand the Bayesian network through an example by creating a directed acyclic graph:

Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably responds at
detecting a burglary but also responds for minor earthquakes. Harry has two neighbors David and Sophia, who have
taken a responsibility to inform Harry at work when they hear the alarm. David always calls Harry when he hears
the alarm, but sometimes he got confused with the phone ringing and calls at that time too. On the other hand,
Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here we would like to compute the
probability of Burglary Alarm.

Problem:

Calculate the probability that alarm has sounded, but there is neither a burglary, nor an earthquake
occurred, and David and Sophia both called the Harry.

Solution:

o The Bayesian network for the above problem is given below. The network structure is showing that
burglary and earthquake is the parent node of the alarm and directly affecting the probability of alarm's
going off, but David and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary and also do not
notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive set
of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities. Hence, if there are two
parents, then CPT will contain 4 probability values

List of all events occurring in this network:

o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can rewrite the
above probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not

occurred. We can provide the conditional probabilities as per the below

tables: Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:


B E P(A= True) P(A= False)
True True 0.94 0.06
True False 0.95 0.04
False True 0.31 0.69
False False 0.001 0.999

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of Alarm.

A P(D= True) P(D= False)


True 0.91 0.09
False 0.05 0.95

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."

A P(S= True) P(S= False)


True 0.75 0.25
False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form of probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Hence, a Bayesian network can answer any query about the domain by using Joint

distribution. The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is given below:

1. To understand the network as the representation of the Joint probability distribution.

It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional independence statements.

It is helpful in designing inference procedure.


5. Inference in Bayesian Networks
1. Exact inference
2. Approximate inference

1. Exact inference:

In exact inference, we analytically compute the conditional probability distribution over the variables of
interest.

But sometimes, that’s too hard to do, in which case we can use approximation techniques based on statistical
sampling

Given a Bayesian network, what questions might we want to ask?

• Conditional probability query: P(x | e)

• Maximum a posteriori probability: What value of x maximizes P(x|e) ?

General question: What’s the whole probability distribution over variable X given evidence e, P(X |
e)?

In our discrete probability situation, the only way to answer a MAP query is to compute the probability of x
given e for all possible values of x and see which one is greatest

So, in general, we’d like to be able to compute a whole probability distribution over some variable or
variables X, given instantiations of a set of variables e

Using the joint distribution

To answer any query involving a conjunction of variables, sum over the variables not involved in the query

Given the joint distribution over the variables, we can easily answer any question about the value of a single
variable by summing (or marginalizing) over the other variables.

So, in a domain with four variables, A, B, C, and D, the probability that variable D has value d is the sum over
all possible combinations of values of the other three variables of the joint probability of all four values. This
is exactly the same as the procedure we went through in the last lecture, where to compute the probability of
cavity, we added up the probability of cavity and toothache and the probability of cavity and not toothache.
In general, we’ll use the first notation, with a single summation indexed by a list of variable names, and a joint
probability expression that mentions values of those variables. But here we can see the completely written-out
definition, just so we all know what the shorthand is supposed to mean.

To compute a conditional probability, we reduce it to a ratio of conjunctive queries using the definition of
conditional probability, and then answer each of those queries by marginalizing out the variables not mentioned.

In the numerator, here, you can see that we’re only summing over variables A and C, because b and d are
instantiated in the query.

We’re going to learn a general purpose algorithm for answering these joint queries fairly efficiently. We’ll start
by looking at a very simple case to build up our intuitions, then we’ll write down the algorithm, then we’ll
apply it to a more complex case.

Here’s our very simple case. It’s a bayes net with four nodes, arranged in a chain.
So, we know from before that the probability that variable D has some value little d is the sum over A, B, and C
of the joint distribution, with d fixed.

Now, using the chain rule of Bayesian networks, we can write down the joint probability as a product
over the nodes of the probability of each node’s value given the values of its parents. So, in this case, we get
P(d|c) times P(c|b) times P(b|a) times P(a).

This expression gives us a method for answering the query, given the conditional probabilities that are stored in
the net. And this method can be applied directly to any other bayes net. But there’s a problem with it: it requires
enumerating all possible combinations of assignments to A, B, and C, and then, for each one, multiplying the
factors for each node. That’s an enormous amount of work and we’d like to avoid it if at all possible.

So, we’ll try rewriting the expression into something that might be more efficient to evaluate. First, we can
make our summation into three separate summations, one over each variable.

Then, by distributivity of addition over multiplication, we can push the summations in, so that the sum over A
includes all the terms that mention A, but no others, and so on. It’s pretty clear that this expression is the same
as the previous one in value, but it can be evaluated more efficiently. We’re still, eventually, enumerating all
assignments to the three variables, but we’re doing somewhat fewer multiplications than before. So this is still
not completely satisfactory.

If you look, for a minute, at the terms inside the summation over A, you’ll see that we’re doing these
multiplications over for each value of C, which isn’t necessary, because they’re independent of C. Our idea,
here, is to do the multiplications once and store them for later use. So, first, for each value of A and B, we can
compute the product, generating a two dimensional matrix.
Then, we can sum over the rows of the matrix, yielding one value of the sum for each possible value of b.

We’ll call this set of values, which depends on b, f1 of b.

Now, we can substitute f1 of b in for the sum over A in our previous expression. And, effectively, we can
remove node A from our diagram. Now, we express the contribution of b, which takes the contribution of a into
account, as f_1 of b.

We can continue the process in basically the same way. We can look at the summation over b and see that the
only other variable it involves is c. We can summarize those products as a set of factors, one for each value of c.
We’ll call those factors f_2 of c.

We substitute f_2 of c into the formula, remove node b from the diagram, and now we’re down to a
simple expression in which d is known and we have to sum over values of c.
Variable Elimination Algorithm

Given a Bayesian network, and an elimination order for the non-query variables , compute

For i = m downto 1

 remove all the factors that mention Xi


 multiply those factors, getting a value for each combination of mentioned variables
 sum over Xi
 put this new factor into the factor set

That was a simple special case. Now we can look at the algorithm in the general case. Let’s assume that we’re
given a Bayesian network and an ordering on the variables that aren’t fixed in the query. We’ll come back later
to the question of the influence of the order, and how we might find a good one.

We can express the probability of the query variables as a sum over each value of each of the non- query
variables of a product over each node in the network, of the probability that that variable has the given value
given the values of its parents.

So, we’ll eliminate the variables from the inside out. Starting with variable Xm and finishing with
variable X1.

To eliminate variable Xi, we start by gathering up all of the factors that mention Xi, and removing
them from our set of factors. Let’s say there are k such factors.

Now, we make a k+1 dimensional table, indexed by Xi as well as each of the other variables that is mentioned
in our set of factors.

We then sum the table over the Xi dimension, resulting in a k-dimensional table. This

table is our new factor, and we put a term for it back into our set of factors. Once we’ve

eliminated all the summations, we have the desired value.

One more example


Here’s a more complicated example, to illustrate the variable elimination algorithm in a more general case. We
have this big network that encodes a domain for diagnosing lung disease. (Dyspnea, as I understand it, is
shortness of breath).

We’ll do variable elimination on this graph using elimination order A, B, L, T, S, X, V.

So, we start by eliminating V. We gather the two terms that mention V and see that they also involve variable
T. So, we compute the product for each value of T, and summarize those in the factor f1 of T.

Now we can substitute that factor in for the summation, and remove the node from the network.

The next variable to be eliminated is X. There is actually only one term involving X, and it also involves
variable A. So, for each value of A, we compute the sum over X of P(x|a). But wait! We know what this
value is! If we fix a and sum over x, these probabilities have to add up to 1.

So, rather than adding another factor to our expression, we can just remove the whole sum. In general, the only
nodes that will have an influence on the probability of D are its ancestors.

Now, it’s time to eliminate S. We find that there are three terms involving S, and we gather them into the sum.
These three terms involve two other variables, B and L. So we have to make a factor that specifies, for each
value of B and L, the value of the sum of products.

We’ll call that factor f_2 of b and l.

Now we can substitute that factor back into our expression. We can also eliminate node S. But in eliminating S,
we’ve added a direct dependency between L and B (they used to be dependent via S, but now the dependency is
encoded explicitly in f2(b). We’ll show that in the graph by drawing a line between the two nodes. It’s not
exactly a standard directed conditional dependence, but it’s still useful to show that they’re coupled.

Now we eliminate T. It involves two terms, which themselves involve variables A and L. So we make a new
factor f3 of A and L.

We can substitute in that factor and eliminate T. We’re getting close!


Next we eliminate L. It involves these two factors, which depend on variables A and B. So we make a new
factor, f4 of A and B, and substitute it in. We remove node L, but couple A and B.

At this point, we could just do the summations over A and B and be done. But to finish out the
algorithm the way a computer would, it’s time to eliminate variable B.

It involves both of our remaining terms, and it seems to depend on variables A and D. However, in this case,
we’re interested in the probability of a particular value, little d of D, and so the variable d is instantiated. Thus,
we can treat it as a constant in this expression, and we only need to generate a factor over a, which we’ll call f5
of a. And we can now, in some sense, remove D from our network as well (because we’ve already factored it
into our answer).

Finally, to get the probability that variable D has value little d, we simply sum factor f5 over all values of a.
Yay! We did it.

Properties of Variable Elimination

Let’s see how the variable elimination algorithm performs, both in theory and in practice.

 Time is exponential in size of largest factor


 Bad elimination order can generate huge factors
 NP Hard to find the best elimination order
 Even the best elimination order may generate large factors
 There are reasonable heuristics for picking an elimination order (such as choosing the variable that
results in the smallest next factor)
 Inference in polytrees (nets with no cycles) is linear in size of the network (the largest CPT)
 Many problems with very large nets have only small factors, and thus efficient inference

First of all, it’s pretty easy to see that it runs in time exponential in the number of variables involved in the largest
factor. Creating a factor with k variables involves making a k+1 dimensional table. If you have b values per variable,
that’s a table of size b^(k+1). To make each entry, you have to multiply at most n numbers, where n is the number of
nodes. We have to do this for each variable to be eliminated (which is usually close to n). So we have something like
time = O(n^2 b^k).

How big the factors are depends on the elimination order. You’ll see in one of the recitation exercises just how
dramatic the difference in factor sizes can be. A bad elimination order can generate huge factors.

So, we’d like to use the elimination order that generates the smallest factors. Unfortunately, it turns out to be NP
hard to find the best elimination order.

At least, there are some fairly reasonable heuristics for choosing an elimination order. It’s usually done dynamically.
So, rather than fixing the elimination order in advance, as we suggested in the algorithm description, you can pick
the next variable to be eliminated depending on the situation. In particular, one reasonable heuristic is to pick the
variable to eliminate next that will result in the smallest factor. This greedy approach won’t always be optimal, but
it’s not usually too bad.

There is one case where Bayes net inference in general, and the variable elimination algorithm in particular is fairly
efficient, and that’s when the network is a polytree. A polytree is a network with no cycles. That is, a network in
which, for any two nodes, there is only one path between them. In a polytree, inference is linear in the size of the
network, where the size of the network is defined to be the size of the largest conditional probability table (or
exponential in the maximum number of parents of any node). In a polytree, the optimal elimination order is to start
at the root nodes, and work downwards, always eliminating a variable that no longer has any parents. In doing so,
we never introduce additional connections into the network.

So, inference in polytrees is efficient, and even in many large non-polytree networks, it’s possible to
keep the factors small, and therefore to do inference relatively efficiently.

When the network is such that the factors are, of necessity, large, we’ll have to turn to a different class
of methods.

2. Approximate inference:

Sampling

To get approximate answer we can do stochastic simulation (sampling).


Another strategy, which is a theme that comes up also more and more in AI actually, is to say, well, we didn't really
want the right answer anyway. Let's try to do an approximation. And you can also show that it's computationally
hard to get an approximation that's within epsilon of the answer that you want, but again that doesn't keep us from
trying.

So, the other thing that we can do is the stochastic simulation or sampling. In sampling, what we do is we look at
the root nodes of our graph, and attached to this root node is some probability that A is going to be true, right?
Maybe it's .4. So we flip a coin that comes up heads with probability .4 and see if we get true or false.

We flip our coin, let's say, and we get true for A -- this time. And now, given the assignment of true to A, we look in
the conditional probability table for B given A = true, and that gives us a probability for B.

Now, we flip a coin with that probability. Say we get False. We enter that into the table.

We do the same thing for C, and let’s say we get True.

Now, we look in the CPT for D given B and C, for the case where B is false and C is true, and we flip a coin with
that probability, in order to get a value for D.

So, there's one sample from the joint distribution of these four variables. And you can just keep doing this, all day
and all night, and generate a big pile of samples, using that algorithm. And now you can ask various questions.

Estimate:

P*(D|A) = #D,A / #A

Let's say you want to know the probability of D given A. How would you answer - - given all the examples -- what
would you do to compute the probability of D given A? You would just count. You’d count the number of cases in
which A and D were true, and you’d divide that by the number of cases in
which A was true, and that would give you an unbiased estimate of the probability of D given A. The
more samples, the more confidence you’d have that the estimated probability is close to the true one.

Estimation

 Some probabilities are easier than others to estimate


 In generating the table, the rare events will not be well represented
 P(Disease| spots-on-your-tongue, sore toe)
 If spots-on-your-tongue and sore toe are not root nodes, you would generate a huge table but the cases of
interest would be very sparse in the table
 Importance sampling lets you focus on the set of cases that are important to answering your question
It's going to turn out that some probabilities are easier than other ones to estimate.

Exactly because of the process we’re using to generate the samples, the majority of them will be the typical cases.
Oh, it's someone with a cold, someone with a cold, someone with a cold, someone with a cold, someone with a cold,
someone with malaria, someone with a cold, someone with a cold. So the rare results are not going to come up very
often. And so doing this sampling naively can make it really hard to estimate the probability of a rare event. If it's
something that happens one in ten thousand times, well, you know for sure you're going to need, some number of
tens of thousands of samples to get even a reasonable estimate of that probability.

Imagine that you want to estimate the probability of some disease given -- oh, I don't know -- spots on your tongue
and a sore toe. Somebody walks in and they have a really peculiar set of symptoms, and you want to know what's
the probability that they have some disease.

Well, if the symptoms are root nodes, it's easy. If the symptoms were root nodes, you could just assign the root
nodes to have their observed values and then simulate the rest of the network as before.

But if the symptoms aren't root nodes then if you do naïve sampling, you would generate a giant table of samples,
and you'd have to go and look and say, gosh, how many cases do I have where somebody has spots on their tongue
and a sore toe; and the answer would be, well, maybe zero or not very many.

There’s a technique called importance sampling, which allows you to draw examples from a distribution that’s going
to be more helpful and then reweight them so that you can still get an unbiased estimate of the desired conditional
probability. It’s a bit beyond the scope of this class to get into the details, but it’s an important and effective idea.

Recitation Problem

• Do the variable elimination algorithm on the net below using the elimination order A,B,C (that is, eliminate
node C first). In computing P(D=d), what factors do you get?

• What if you wanted to compute the whole marginal distribution P(D)?


Here’s the network we started with. We used elimination order C, B, A (we eliminated A first). Now we’re going to
explore what happens when we eliminate the variables in the opposite order. First, work on the case we did, where
we’re trying to calculate the probability that node D takes on a particular value, little d. Remember that little d is a
constant in this case. Now, do the case where we’re trying to find the whole distribution over D, so we don’t know a
particular value for little d.

Another Recitation Problem

Find an elimination order that keeps the factors small for the net below, or show that there is no such order.

Here’s a pretty complicated graph. But notice that no node has more than 2 parents, so none of the CPTs are huge.
The question is, is this graph hard for variable elimination? More concretely, can you find an elimination order that
results only in fairly small factors? Is there an elimination order that generates a huge factor?

The Last Recitation Problem

Bayesian networks (or related models) are often used in computer vision, but they almost always require sampling.
What happens when you try to do variable elimination on a model like the grid below?
6. Casual Networks:

A causal network is an acyclic digraph arising from an evolution of a substitution system, and representing its
history. The illustration above shows a causal network corresponding to the rules
(applied in a left-to-right scan) and initial condition .

The figure above shows the procedure for diagrammatically creating a causal network from a mobile
automaton.

In an evolution of a multiway system, each substitution event is a vertex in a causal network. Two events which are
related by causal dependence, meaning one occurs just before the other, have an edge between the corresponding
vertices in the causal network. More precisely, the edge is a directed edge leading from the past event to the future
event.
Some causal networks are independent of the choice of evolution, and these are called causally invariant.

You might also like