You are on page 1of 71

UNIT 3

Artificial Intelligence
and
Machine Learning
KNOWLEDGE REPRESENTATION
ONTOLOGICAL ENGINEERING
 How to create representations(like internet
shopping and car driving) , concentrating on
general concepts—such as Events, Time, Physical
Objects, and Beliefs— that occur in many different
domains.
 Representing these abstract concepts is sometimes
called ontological engineering.
What is Ontology- A general Description
 The general framework of concepts is called an upper ontology
because of the convention of drawing graphs with the general
concepts at the top and the more specific concepts below them.
 Two major characteristics of general-purpose ontologies
distinguish them from collections of special-purpose
ontologies:

 A general-purpose ontology should be applicable in more or less


any special-purpose domain (with the addition of domain-specific
axioms).
 • In any sufficiently demanding domain, different areas of
knowledge must be unified, because reasoning and problem
solving could involve several areas simultaneously.
Categories and Objects

 Organization of objects into categories


 Much reasoning takes place at the level of
categories
 There are two choices for representing categories
in first-order logic: predicates and objects.
Notations
We can use the predicate Basketball (b), or we can reify the category
as an object, Basketballs.
We could then say Member(b, Basketballs ), which we will abbreviate
as b∈ Basketballs, to say that b is a member of the category of
basketballs.
We say Subset(Basketballs, Balls), abbreviated as Basketballs ⊂
Balls, to say that Basketballs is a subcategory of Balls.
We will use subcategory, subclass, and subset interchangeably.
 Inheritance ( Food->Fruits->Apple)
– Apple inherits the property of edibility
– Subset
 Subclass relations organize categories into a taxonomy,
or taxonomic hierarchy.
 First-order logic makes it easy to state facts about
categories
 An object is a member of a category.
 BB9 ∈ Basketballs
 • A category is a subclass of another category.
 Basketballs ⊂ Balls
 • All members of a category have some properties.
 (x∈ Basketballs) ⇒ Spherical (x)
 Members of a category can be recognized by some properties.
 Orange(x) ∧ Round (x) ∧ Diameter(x)=9.5 ∧ x∈ Balls ⇒ x∈ Basketballs
 • A category as a whole has some properties.
 Dogs ∈ DomesticatedSpecies
 Two or more categories are disjoint if they have no
members in common.
 And even if we know that males and females are
disjoint, we will not know that an animal that is not
a male must be a female, unless we say that males
and females constitute an exhaustive
decomposition of the animals.
 A disjoint exhaustive decomposition is known as a
partition.
Categories can also be defined by providing necessary and sufficient
conditions for
membership.

For example, a bachelor is an unmarried adult male:


x∈ Bachelors ⇔ Unmarried(x) ∧ x∈ Adults ∧ x∈Males
Physical composition
 The idea that one object can be part of another
 Objects can be grouped into part of hierarchies,
reminiscent of the Subset hierarchy:
▪ PartOf (Bucharest , Romania)
▪ PartOf (Romania, EasternEurope)
▪ PartOf (EasternEurope, Europe)
▪ PartOf (Europe, Earth) .
 The PartOf relation is transitive and reflexive; that is,
PartOf (x, y) ∧ PartOf (y, z) ⇒ PartOf (x, z) .
 Therefore, we can conclude PartOf (Bucharest , Earth).
 An object is composed of the parts in its
PartPartition and can be viewed as deriving some
properties from those parts.
 For example, the mass of a composite object is the
sum of the masses of the parts.
 Consider
“The apples in this bag weigh two pounds.
 BUNCH : For example, if the apples are Apple1, Apple2,
and Apple3, then
BunchOf ({Apple1,Apple2,Apple3})
 denotes the composite object with the three apples as
parts (not elements).
 We can define BunchOf in terms of the PartOf relation.
Obviously, each element of
 s is part of BunchOf (s): ∀x x∈ s ⇒ PartOf (x, BunchOf (s)) .
Measurements
 The values that we assign for these properties are
called measures
 We represent the length with a units function
that takes a number as argument.
 If the line segment is called L1, we can write
▪ Length(L1)=Inches(1.5)=Centimeters(3.81) .
 Conversion between units is done by equating
multiples of one unit to another:
 Centimeters(2.54 × d)=Inches(d) .
 Although measures are not numbers, we can still
compare them, using an ordering symbol such as >
Objects: Things and stuff
 The real world can be seen as consisting of primitive
objects (e.g., atomic particles) and composite objects
built from them.
 There is, however, a significant portion of reality that
seems to defy any obvious individuation—division
into distinct objects
 We give this portion the generic name stuff.
 count nouns, such as aardvarks, holes, and theorems,
and mass nouns, such as butter, water, and energy.
 Intrinsic: they belong to the very substance of the
object, rather than to the object as a whole.
 When you cut an instance of stuff in half, the two
pieces retain the intrinsic properties—things like
density, boiling point, flavor, color, ownership, and
so on.
 Extrinsic properties—weight, length, shape, and
so on—are not retained under subdivision.
EVENTS
 Event calculus: a logic-based formalism for
representing actions and their effects.
 The fluent At(Shankar , Berkeley) is an object that
refers to the fact of Shankar being in Berkeley, but
does not by itself say anything about whether it is
true.
 To assert that a fluent is actually true at some
point in time we use the predicate T, as in
T(At(Shankar , Berkeley), t).
 Note : a fluent is a condition that can change over time
 Events are described as instances of event categories.
 The event E1 of Shankar flying from San Francisco to
Washington, D.C. is described as
E1 ∈ Flyings ∧ Flyer (E1, Shankar ) ∧ Origin(E1, SF) ∧
Destination(E1,DC) .
 three-argument version of the category of flying
events
 E1 ∈ Flyings(Shankar , SF,DC) .
Set of Predicates used
 We assume a distinguished event, Start , that describes the initial
state by saying which fluents are initiated or terminated at the
start time.
 We define T by saying that a fluent holds at a point in time if the
fluent was initiated by an event at some time in the past and was
not made false (clipped) by an intervening event.
 A fluent does not hold if it was terminated by an event and not
made true (restored) by another event.
 Formally, the axioms are:
 Happens(e, (t1, t2)) ∧ Initiates(e, f, t1) ∧ ¬ Clipped(f, (t1, t)) ∧ t1 <
t ⇒ T(f, t)
Example

 For example, we can say that the only way a


wumpus-world agent gets an arrow is at the start,
and the only way to use up an arrow is to shoot it:
 Initiates(e, HaveArrow(a), t) ⇔ e = Start
 Terminates(e, HaveArrow(a), t) ⇔ e ∈ Shootings(a)
Processes
 Till now, we saw events that are discrete- they have a definite
structure.
 Eg: trip has a beginning, middle, and end.
 If we take a small interval of Shankar’s flight, say, the third 20-
minute segment (while he waits anxiously for a bag of peanuts),
that event is still a member of Flyings.
 In fact, this is true for any subinterval.
 Categories of events with this property are called process
categories or liquid event categories.
 Any process e that happens over an interval also happens over
any subinterval:
 (e ∈ Processes) ∧ Happens(e, (t1, t4)) ∧ (t1 < t2 < t3 < t4) ⇒ Happens(e,
(t2, t3)) .
Time intervals

 Two kinds of time intervals: moments and extended


intervals.
 The distinction is that only moments have zero duration:
 Partition({Moments, ExtendedIntervals}, Intervals )
 i∈ Moments ⇔ Duration(i)=Seconds(0) .
Fluents and objects
 Physical objects can be viewed as generalized events, in the sense that a
physical object is a chunk of space–time.
 We can describe the changing properties of USA using state fluents, such
as Population(USA) if USA can be considered as an event that began in
1776( ex)
 A property of the USA that changes every four or eight years, barring
mishaps, is its president.
 One might propose that President(USA) is a logical term that denotes a
different object at different times.
 President(USA) denotes a single object that consists of different people
at different times. It is the object that is George Washington from 1789
to 1797, John Adams from 1797 to 1801, and so on
 To say that George Washington was president throughout 1790, we can
write T(Equals(President(USA), GeorgeWashington ),AD1790) .
MENTAL EVENTS AND MENTAL OBJECTS
 What we need is a model of the mental objects that are in
someone’s head (or something’s knowledge base) and of the
mental processes that manipulate those mental objects.
 We begin with the propositional attitudes that an agent can
have toward mental objects: attitudes such as Believes,
Knows, Wants, Intends, and Informs.
 The difficulty is that these attitudes do not behave like
“normal” predicates.
 For example, suppose we try to assert that Lois knows that
Superman can fly:
Knows(Lois, CanFly(Superman)) .
 A more serious problem is that, if it is true that
Superman is Clark Kent, then we must conclude
that Lois knows that Clark can fly:
 (Superman = Clark) ∧ Knows(Lois , CanFly
(Superman)) |= Knows(Lois, CanFly(Clark )) .
 If our agent knows that 2 + 2 = 4 and 4 < 5, then we want our
agent to know that 2 + 2 < 5. This property is called
referential transparency.
 Modal logic includes special modal operators that take
sentences (rather than terms) as arguments.
 For example, “A knows P” is represented with the notation KAP,
where K is the modal operator for knowledge.
 It takes two arguments, an agent (written as the subscript) and
a sentence.
 The syntax of modal logic is the same as first-order logic,
except that sentences can also be formed with modal operators
 We will need a more complicated model, one that consists of a collection
of possible worlds rather than just one true world.
 The worlds are connected in a graph by accessibility relations, one
relation for each modal operator.
 We say that world w1 is accessible from world w0 with respect to the
modal operator KA if everything in w1 is consistent with what A knows
in w0, and we write this as Acc(KA,w0,w1).
 In diagrams ,we show accessibility as an arrow between possible worlds.
 As an example, in the real world, Bucharest is the capital of Romania,
 but for an agent that did not know that, other possible worlds are
accessible, including ones where the capital of Romania is Sibiu or Sofia.
 Presumably a world where 2 + 2 = 5 would not be accessible to any
agent.
 In general, a knowledge atom KAP is true in world w if and only if P
is true in every world accessible from w.
 The truth of more complex sentences is derived by recursive
application of this rule and the normal rules of first-order logic.
 That means that modal logic can be used to reason about nested
knowledge sentences: what one agent knows about another
 agent’s knowledge.
 For example, we can say that, even though Lois doesn’t know
whether Superman’s secret identity is Clark Kent, she does know
that Clark knows:
 KLois [KClark Identity(Superman, Clark ) ∨ Kclark ¬ Identity (Superman,
Clark )]
 Modal logic solves some tricky issues with the interplay of
quantifiers and knowledge.
 The English sentence “Bond knows that someone is a spy” is
ambiguous.
 The first reading is that there is a particular someone who
Bond knows is a spy; we can write this as
∃ x KBondSpy(x) ,which in modal logic means that there is an
x that, in all accessible worlds, Bond knows to be a spy.
The second reading is that Bond just knows that there is at
least one spy:
KBond∃ x Spy(x) .
 First, we can say that agents are able to draw
deductions; if an agent knows P and knows that P
implies Q, then the agent knows Q:
(KaP ∧ Ka(P ⇒ Q)) ⇒ KaQ.
REASONING SYSTEMS FOR CATEGORIES
 Categories are the primary building blocks of large-scale
knowledge representation schemes.
 There are two closely related families of systems:
semantic networks provide graphical aids for
visualizing a knowledge base and efficient algorithms for
inferring properties of an object on the basis of its
category membership; and description logics provide a
formal language for constructing and combining category
definitions and efficient algorithms for deciding subset
and superset relationships between categories
Semantic networks
A graphical notation of nodes and edges called existential graphs that
are called “the logic of the future”
There are many variants of semantic networks, but all are capable of
representing individual objects, categories of objects, and relations among
objects.
A typical graphical notation displays object or category names in ovals or
boxes, and connects them with labeled links.
For example, Figure has a MemberOf link between Mary and
FemalePersons , corresponding to the logical assertion Mary
∈FemalePersons ; similarly, the SisterOf link between Mary and John
corresponds to the assertion SisterOf (Mary, John).
We can connect categories using SubsetOf links, and so on.
For example, we know that persons have female persons as mothers, so
can we draw a HasMother link from Persons to FemalePersons?
 The answer is no, because HasMother is a relation
between a person and his or her mother, and
categories do not have mothers
 For this reason, we have used a special notation—the
double-boxed link—in Figure
 This link asserts that
 ∀x x∈ Persons ⇒ [∀ y HasMother (x, y) ⇒ y ∈
FemalePersons ] .
 We might also want to assert that persons have two
legs—that is, ∀x x∈ Persons ⇒ Legs(x, 2) .
Description logics
 The syntax of first-order logic is designed to make it easy to
say things about objects.
 Description logics are notations that are designed to make it
easier to describe definitions and properties of categories.
 The principal inference tasks for description logics are
subsumption (checking if one category is a subset of
another by comparing their definitions) and classification
(checking whether an object belongs to a category)
 Some systems also include consistency of a category
definition—whether the membership criteria are logically
satisfiable.
 The CLASSIC language is a typical description logic.
 The syntax of CLASSIC descriptions is shown in Figure in the
next slide
 For example, to say that bachelors are unmarried adult males
we would write
Bachelor = And(Unmarried, Adult ,Male) .
 The equivalent in first-order logic would be
Bachelor (x) ⇔ Unmarried(x) ∧ Adult(x) ∧ Male(x)
 For example, to describe the set of men with at least three
sons who are all unemployed and married to doctors, and at
most two daughters who are all professors in physics or math
departments, we would use
And(Man, AtLeast(3, Son), AtMost(2, Daughter ),
All(Son, And(Unemployed,Married, All(Spouse, Doctor ))),
All(Daughter , And(Professor , Fills(Department ,
Physics,Math)))) .
PROBABILISTIC REASONING
UNIT –III ( Chapter 14)
 Bayesian Networks
 A Bayesian network is a directed graph in which each node is
annotated with quantitative probability information.
The full specification is as follows:
 1. Each node corresponds to a random variable, which may be
discrete or continuous.
 2. A set of directed links or arrows connects pairs of nodes. If there
is an arrow from node X to node Y , X is said to be a parent of Y.
 The graph has no directed cycles (and hence is a directed acyclic
graph, or DAG.
 3. Each nodeXi has a conditional probability distribution P(Xi |
Parents(Xi)) that quantifies the effect of the parents on the node.
https://www.youtube.com/watch?v=SkC8S3wuIfg&ab_channel=edureka%21
 The intuitive meaning of an arrow is typically that X
has a direct influence on Y, which suggests that
causes should be parents of effects.
 It is usually easy for a domain expert to decide what
direct influences exist in the domain—much easier,
in fact, than actually specifying the probabilities
themselves.
 Once the topology of the Bayesian network is laid
out, we need only specify a conditional probability
distribution for each variable, given its parents.
https://www.youtube.com/watch?v=SPFeLriivOs
An example
 It is fairly reliable at detecting a burglary, but also responds
on occasion to minor earthquakes.
 You also have two neighbors, John and Mary, who have
promised to call you at work when they hear the alarm. John
nearly
 always calls when he hears the alarm, but sometimes
confuses the telephone ringing with the alarm and calls then,
too. Mary, on the other hand, likes rather loud music and
often misses the alarm altogether.
 Given the evidence of who has or has not called, we would
like to estimate the probability of a burglary.
A Directed Acyclic Graph

Burglary Earthquake

Alarm

1. A directed acyclic graph:


 The nodes are random variables (which can be discrete or continuous)
 Arrows connect pairs of nodes (X is a parent of Y if there is an arrow
from node X to node Y).

48
A Directed Acyclic Graph

Burglary Earthquake

Alarm

 Intuitively, an arrow from node X to node Y means X has a direct


influence on Y (we can say X has a casual effect on Y)
 Easy for a domain expert to determine these relationships
 The absence/presence of arrows will be made more precise later on

49
A Set of Parameters
B P(B) E P(E) Burglary Earthquake
false 0.999 false 0.998
true 0.001 true 0.002

B E A P(A|B,E)
Alarm
false false false 0.999
false false true 0.001 Each node Xi has a conditional probability distribution
false true false 0.71 P(Xi | Parents(Xi)) that quantifies the effect of the
false true true 0.29 parents on the node
true false false 0.06
The parameters are the probabilities in these conditional
true false true 0.94
probability distributions
true true false 0.05
Because we have discrete random variables, we have
true true true 0.95
conditional probability tables (CPTs)

50
A Set of Parameters
Conditional Probability Stores the probability distribution for
Distribution for Alarm Alarm given the values of Burglary and
Earthquake
B E A P(A|B,E)
false false false 0.999
For a given combination of values of the
false false true 0.001
parents (B and E in this example), the entries
false true false 0.71
for P(A=true|B,E) and P(A=false|B,E) must add
false true true 0.29
up to 1 eg. P(A=true|B=false,E=false) +
true false false 0.06 P(A=false|B=false,E=false)=1
true false true 0.94
true true false 0.05
true true true 0.95

If you have a Boolean variable with k Boolean parents, how big is the
conditional probability table?
How many entries are independently specifiable?
 The network structure shows that burglary and
earthquakes directly affect the probability of the
alarm’s going off, but whether John and Mary call
depends only on the alarm.
 The network thus represents our assumptions
that they do not perceive burglaries directly, they
do not notice minor earthquakes, and they do not
confer before calling.
 The conditional distributions in Figure 14.2 are shown as a conditional
probability table, or CPT.
 Each row in a CPT contains the conditional probability of each node value
for a conditioning case.
 A conditioning case is just a possible combination of values for the parent
nodes.
 Each row must sum to 1, because the entries represent an exhaustive set
of cases for the variable.
 For Boolean variables, once you know that the probability of a true value
is p, the probability of false must be 1 – p, so we often omit the second
number.
 In general, a table for a Boolean variable with k Boolean parents contains
2k independently specifiable probabilities.
 A node with no parents has only one row, representing the prior
probabilities of each possible value of the variable.
THE SEMANTICS OF BAYESIAN NETWORKS

 There are two ways in which one can understand the


semantics of Bayesian networks.
 The first is to see the network as a representation of
the joint probability distribution.
 The second is to view it as an encoding of a collection of
conditional independence statements.
 The two views are equivalent, but the first turns out to
be helpful in understanding how to construct networks,
whereas the second is helpful in designing inference
procedures.
 One way to define what the network means—its
semantics—is to define the way in which it represents a
specific joint distribution over all the variables.
Compactness and node ordering

 The compactness of Bayesian networks is an example of a


general property of locally structured (also called
sparse) systems.
 In a locally structured system, each subcomponent
interacts directly with only a bounded number of other
components, regardless of the total number of
components
 we will get a compact Bayesian network only if we choose
the node ordering well.
Conditional independence relations in Bayesian networks

 We can start from a “topological” semantics that specifies the conditional


independence relationships encoded by the graph structure, and from
this we can derive the “numerical” semantics.
 The topological semantics specifies that each variable is conditionally
independent of its non-descendants, given its parents.
 For example, in Figure 14.2, JohnCalls is independent of Burglary,
Earthquake, and MaryCalls given the value of Alarm.
 The definition is illustrated in Figure 14.4(a).
 From these conditional independence assertions and the interpretation
of the network parameters θ(Xi |Parents(Xi)) as specifications of
conditional probabilities P(Xi |Parents(Xi)), the full joint distribution can
be reconstructed. In this sense, the “numerical” semantics and the
“topological” semantics are equivalent.
A node X is conditionally independent of its non-
descendants
EFFICIENT REPRESENTATION OF CONDITIONAL
DISTRIBUTIONS
 A conditional distribution is a probability
distribution for a sub-population.
 In other words, it shows the probability that a
randomly selected item in a sub-population has a
characteristic you’re interested in.
 For example, if you are studying eye colors (the
population) you might want to know how many
people have blue eyes (the sub-population).
 A deterministic node has its value specified exactly by
the values of its parents, with no uncertainty.
 The relationship can be a logical one: for example, the
relationship between the parent nodes Canadian, US,
Mexican and the child node North American is simply
that the child is the disjunction of the parents.
 The relationship can also be numerical: for example, if
the parent nodes are the prices of a particular model of
car at several dealers and the child node is the price that
a bargain hunter ends up paying, then the child node is
the minimum of the parent values;
 Uncertain relationships can often be characterized by so-
called noisy logical relation ships.
 The standard example is the noisy-OR relation, which is a
generalization of the logicalOR.
 In propositional logic, we might say that Fever is true if and
only if Cold , Flu, or Malaria is true.
 The noisy-OR model allows for uncertainty about the ability
of each parent to cause the child to be true—the causal
relationship between parent and child may be inhibited, and
so a patient could have a cold, but not exhibit a fever
 The model makes two assumptions.
 First, it assumes that all the possible causes are listed. (If
some are missing, we can always add a so-called leak node
that covers “miscellaneous causes.”)
 Second, it assumes that inhibition of each parent is
independent of inhibition of any other parents: for example,
whatever inhibits Malaria from causing a fever is
independent of whatever inhibits Flu from causing a fever.
 Given these assumptions, Fever is false if and only if all its true
parents are inhibited, and the probability of this is the
product of the inhibition probabilities q for each parent.
 Let us suppose these individual inhibition probabilities
are as follows:
 qcold = P( ¬ fever | cold, ¬ flu, ¬ malaria) = 0.6 ,
 qflu = P( ¬ fever | ¬ cold, flu, ¬ malaria) = 0.2 ,
 qmalaria = P( ¬ fever | ¬ cold, ¬ flu, malaria) = 0.1
Then, from this information and the noisy-OR
assumptions, the entire CPT can be built.
 The general rule is that
Bayesian nets with continuous variables
 One possible way to handle continuous variables is to
avoid them by using discretization—that is, dividing up
the possible values into a fixed set of intervals.
 A network with both discrete and continuous variables is
called a hybrid Bayesian network
EXACT INFERENCE IN BAYESIAN NETWORKS
 The basic task for any probabilistic inference system is to
compute the posterior probability distribution for a set of query
variables, given some observed event—that is, some
assignment of values to a set of evidence variables.
 X denotes the query variable; E denotes the set of evidence
variables E1, . . . ,Em, and e is a particular observed event; Y will
denotes the nonevidence, nonquery variables Y1, . . . , Yl (called
the hidden variables).
 Thus, the complete set of variables is X={X}∪E ∪Y. A typical
query asks for the posterior probability distribution P(X |e).
In the burglary network, we might observe the event in which JohnCalls =true and MaryCalls =true.
We could then ask for, say, the probability that a burglary has occurred:
P(Burglary | JohnCalls =true,MaryCalls =true) = 0.284, 0.716 .
APPROXIMATE INFERENCE IN BAYESIAN
NETWORKS
 sampling applied to the computation of posterior probabilities.
NOTE
 The result P(A|B) is referred to as the posterior probability and
P(A) is referred to as the prior probability.
 P(A|B): Posterior probability.
 P(A): Prior probability.
Direct sampling methods

 The primitive element in any sampling algorithm is the generation


of samples from a known probability distribution.
 For example, an unbiased coin can be thought of as a random
variable Coin with values heads, tails and a prior distribution
P(Coin) = <0.5, 0.5>
 The simplest kind of random sampling process for Bayesian
networks generates events from a network that has no evidence
associated with it.
 The idea is to sample each variable in turn, in topological order.
 The probability distribution from which the value is sampled is
conditioned on the values already assigned to the variable’s
parents.
Rejection sampling in Bayesian networks

 Rejection sampling is a general method for producing samples


from a hard-to-sample distribution given an easy-to-sample
distribution.
 In its simplest form, it can be used to compute conditional
probabilities—that is, to determine P(X | e).
 The REJECTION-SAMPLING algorithm first generates samples
from the prior distribution specified by the network.
 Then, it rejects all those that do not match the evidence. Finally,
the estimate ˆ P(X =x | e) is obtained by counting how often X =x
occurs in the remaining samples

You might also like