You are on page 1of 18

E-Notes – CO4

Learning Sets of Rules

Motivation: [https://ieeexplore.ieee.org/document/149926]

• It is notoriously difficult to obtain rules directly from human experts for prediction, diagnosis,
simulation, etc., – Knowledge Acquisition Bottleneck.

• Furthermore, if the domain necessitates reasoning under uncertainty (probabilistic reasoning),


humans are well known to be inconsistent in their description of subjective probabilities.

• Hence, if an existing database of sample data of the domain is available, a rule induction
algorithm would be very useful.

• This is known as “A data reduction process’, where we want to reduce a large database of
information to a small number of rules describing the domain.

Introduction
Learning sets of rules from data is useful to learn the target concept. This is one of the most expressive
and human readable representations for learned hypotheses. A rule-based system (containing
algorithm for the induction of rules from examples ) is a special type of expert system, which typically
consists of a set of if–then rules. Rule-based systems are involved in knowledge discovery tasks and
predictive modeling tasks.

Sequential Covering Algorithms


A set of algorithms for learning rule sets based on the strategy of learning one rule, removing the data
it covers, then iterating this process are discussed below.

Background for learning sets of rules:


• Classify records by using a collection of “if…then…” rules

• Rule: (Condition)  y

• where

• Condition is a conjunctions of attributes

• y is the class label

• LHS: rule antecedent or condition

• RHS: rule consequent

• Examples of classification rules:

• (Blood Type=Warm)  (Lay Eggs=Yes)  Birds

• (Taxable Income < 50K)  (Refund=Yes)  Evade=No

Example dataset and the learned rules

R1: (Give Birth = no)  (Aerial Creature = yes)  Birds

R2: (Give Birth = no)  (Aquatic Creature = yes)  Fishes

R3: (Give Birth = yes)  (Body temperature = warm)  Mammals

R4: (Give Birth = no)  (Aerial Creature = no)  Reptiles

R5: (Aquatic Creature = sometimes)  Amphibians

Distinct instances, Syntactically distinct hypotheses, Semantically distinct hypotheses for the
considered dataset

• For the considered dataset in order to learn sets of rules:


• Based on number of values the attributes take we get = 15.2.6.2.3.2.2.2 = 8,640 (distinct
instances)

• You add two possible values for each feature 1. (?) wild card & 2. (Φ) null so your "syntactically
distinct hypotheses" would be = 17.4.8.4.5.4.4.4 = 6,96,320 (syntactically distinct hypotheses)

• if you have only wild card for each feature and one instance for the empty set (Φ) you get
"semantically distinct hypotheses" i.e., 1 + (16.3.7.3.4.3.3.3) = 1,08,865 (semantically distinct
hypotheses)

This is a larger hypothesis space. Solution is to take advantage of naturally occurring structure over the
hypothesis space in order to search for the target concept-based hypothesis.

General-to-Specific Rule Growing

CN2 Algorithm:

• Start from an empty conjunct: {}

• Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …

• Determine the rule consequent by taking majority class of instances covered by the
rule

• The CN2 algorithm is a greedy, depth-first search for hypotheses (set of rules) with no
backtracking.

• With Greedy search, there is a danger of selecting suboptimal rule at any step.

• To reduce this risk, the algorithm is extended to perform a beam search.

General to Specific Beam Search


• A search in which the algorithm maintains a list of the ‘k’ best candidates at each step,
rather than a single best candidate.

• On each search step, descendants (specializations) are generated for each of these ‘k’
best candidates, and the resulting set is again reduced to the ‘k’ most promising
members.
• Beam search keeps track of the most promising alternatives to the current top-rated
hypothesis, so that all of their successors can be considered at each search step.

• The search for rule preconditions as LEARN-ONE-RULE proceeds from general to


specific.

• At each step, the preconditions of the best rule are specialized in all possible ways.

• Rule postconditions are determined by the examples found to satisfy the


preconditions.

Rule Coverage Accuracy

SC = hair => 20% 100%


Mammals

BT = warm- 46.66% 71.4%


blooded =>
Mammals

HL = No => 33.33% 20%


Mammals

Entropy

0 bits (M:3+, 0-)

0.863 bits

(M: 5+, 2-)

0.721 bits

(M: 1+, 4-)

• Important observation is when both coverage and accuracy are taken for rule
evaluation, one rule is selected for further growing.

• When entropy is taken for rule evaluation, an-other rule is selected for rule growing.
• When number of positive examples heuristic is considered, then a width of k is
considered. [in beam search]

Note: Heuristics can also be used to select the best node for further expansion.
[https://d-nb.info/1106118391/34]

Rule Coverage and Accuracy for determining the quality of the rule

• Coverage of a rule:
• Fraction of records that satisfy the antecedent (A) of a rule
Coverage(r) = |A| / |D|
• Accuracy of a rule:
• Fraction of records that satisfy both the antecedent (A) and consequent (y) of a rule
Accuracy(r) = |A U y| / |A|

Rule Evaluation

Where;

n = Number of examples the rule matches

nc = Number of examples that the rule classifies correctly.

k = Equivalent number of examples for weighting the prior probability p.

• Also, Entropy and Heuristics are useful for rule evaluation.


Learning First Order Rules
 The algorithm seen before is used for learning sets of propositional (variable free)
rules.

 The problem is that propositional representations offer no general way to describe the
essential relations among the values of the attributes.

 In contrast, a program using first-order representations could learn the following


general rule:
IF Father(y, x) and Female(y) THEN Daughter(x, y)
where x and y are variables that can be bound to any person. P(x, y) can be read as “The P of x
is y”.
First Order Inductive Learner (FOIL)

FOIL Algorithm
TRAINING DATA

In terms of predicates these are simple set of assertions:


• GrandDaughter(Victor, Sharon)
• Father(Sharon, Bob)
• Father(Tom, Bob)
• Female(Sharon)
• Father(Bob, Victor)
Application of FOIL Algorithm to the training data for inferring the sets of First Order Rules
Question: To predict or infer the target literal or predicate GrandDaughter(x, y)based rules
by using inductive reasoning with the given assertions.
Step 1: Start with empty body and target literal as head in the Horn clause.
GrandDaughter(x, y)  {}
Step 2: To specialize this initial rule, generate the candidate specializations by using the
Predicates from training data assertions.
Female(x), Female(y), Father(x, y), Father(y, x), Father(x, z), Father(z, x), Father(y, z), Father(z,
y), Equal(x, y) [Need to consider negations of each of these literals].
Step 3: Use FOIL_GAIN heuristic to choose the most useful literal to be added to the body of
the rule.
Calculation of foil_gain: GrandDaughter(x, y) for x=Victor, y=Sharon and choosing best
literal to add to rule precondition

Literal T pre_pos pre_neg post_pos post_neg gain

Female(x) 1 1 15 1 3 2

Female(y) 1 1 15 1 3 2

Father(x, y) 0 1 15 0 3 0

Father(y, x) 1 1 15 1 2 2.401

Father(x, z) 0 1 15 0 3 0

Father(z, x) 1 1 15 1 2 2.401

Father(y, z) 1 1 15 1 1 3

Father(z, y) 1 1 15 1 2 2.401

In this manner, the next predicates that will be chosen to add to the existing rule are Father(z,
x) and Female(y). At this point all positive examples will be removed by FOIL Algorithm.

Induction as Inverted Deduction


• Induction can be cast as a deduction problem as follows
• We wish to learn a target function f(x) that deductively follows from the hypothesis h,
instance xi, and background knowledge B:

• The expression X |-- Y is read "Y follows deductively from X," or alternatively "X entails
Y."

Example

Inverting Resolution
 The inverse entailment operator must derive C2 given the resolvent C and C1.
 Say C = A ∨ B, and C1= B ∨ D. How do we derive C2 s.t. C1 ∧ C2 → C?
 Find the L that appears in C1 and not in C, then form C2 by including the
following literals

C2 = (C - (C1 - {L})) ∪ {¬ L}
so
C2 = A ∨ ¬D

 C2 can also be A ∨ ¬D ∨ B.
 In general, inverse resolution can produce multiple clauses C2.

Learning With Inverted Resolution

 Use inverse entailment to construct hypotheses that, together with the


background information, entail the training data.
 Use sequential covering algorithm to iteratively learn a set of Horn clauses in
this way.

1. Select a training example that is not yet covered by learned clauses.


2. Use inverse resolution rule to generate candidate hypothesis h that satisfies B
∧ h ∧ x → f(x), where B = background knowledge plus any learned clauses.

 This is example-driven search.


 If multiple candidate hypotheses, then choose one with highest accuracy over
the other examples.

https://jmvidal.cse.sc.edu/talks/learningrules/invertingresolution.html

For more theoretical content:


https://people.cs.vt.edu/ramakris/Courses/CS6604/lectures/lec12/lec12.pdf
INVERSE RESOLUTION USING CIGOL ILP SYSTEM

Prolog-EBG
Prolog-EGB(TargetConcept, TraningExamples, DomainTheory)

1. LearnedRules = {}
2. Pos = the positive examples from TraningExamples.

3. for each PositiveExample in Pos that is not covered by


LearnedRules do
1. Explanation = an explanation in terms of DomainTheory that Pos
satisfies the TargetConcept.
2. SufficientConditions = the most general set of features of
PositiveExample sufficient to satisfy the TargetConcept according
to the Explanation.

3. LearnedRules = LearnedRules + {TargetConcept ←


SufficientConditions}.

4. return LearnedRules

Explaining the Example

 Give a proof, using the domain theory, that the (positive) training
satisfies the target concept.
 In our ongoing example the positive example of
SafeToStack(o1,o2) can be explained by using the domain theory,
as such:

1. Volume(o1,2) ∧ Density(o1,0.3) ∧ Equal(0.6, 2*0.3) →


Weight(o1,0.6)
2. Type(o2,endtable) → Weight(o2,5)
3. Weight(o1, 0.6) ∧ LessThan(0.6, 5) ∧ Weight(o2,5) →
Lighter(o1,o2)
4. Lighter(o1, o2) → SafeToStack(o1,o2)

 In Prolog-EGB this explanation is generated using backward


chaining search, as done by Prolog.
 Like Prolog, it halts when it finds a proof.

Analyze the Explanation

 Which of the many features of the objects are relevant to the target
concept?
 Those that appear in the explanation we just built.
 We can collect these and substitute x,y for o1,o2 to get

Volume(x,2) ∧ Density(x,0.3) ∧ Type(y,endtable) →


SafeToStack(x,y)

So, we now have the most general set of features, right?

 Wrong! An even more general rule can be obtained using a more


careful analysis of the explanation.
Weakest Preimage

 The weakest preimage of a conclusion C with respect to a proof P


is the most general set of assertions A, such that A entails C
according to P.
 Prolog-EGB computes the most general rule that can be justified
by the explanation by computing the weakest preimage.
 It is calculated by using regression—work iteratively backward
through the explanation, computing the weakest preimage at each
step.

   SafeToStack(x, y)
4. Lighter(o1, o2) → SafeToStack(o1,o2)
   Lighter(x, y)
3. Weight(o1, 0.6) ∧ LessThan(0.6, 5) ∧ Weight(o2,5) →
Lighter(o1,o2)
  Weight(x,wx), LessThan(wx, wy), Weight(y,wy)
2. Type(o2,endtable) → Weight(o2,5)
   Weight(x,wx), LessThan(wx,5), Type(y,endtable)
1. Volume(o1,2) ∧ Density(o1,0.3) ∧ Equal(0.6, 2*0.3) →
Weight(o1,0.6)
   Volume(x,vx), Density(x,dx), Equal(wx, vx*dx), LessThan(wx,5),
Type(y,endtable)

 The final Horn clause has a body that corresponds to the weakest
preconditions, and the head is the concept:

SafeToStack(x,y) ← Volume(x, vx) ∧ Density(x,dx) ∧ Equal(wx,


vx*dx) ∧ LessThan(wx,5) ∧ Type(y,endtable)
Reinforcement Learning
Reinforcement Learning is an approach to machine intelligence that combines two disciplines to
successfully solve problems that neither discipline can address individually. Dynamic Programming is a
field of mathematics that has traditionally been used to solve problems of optimization and control.
However, traditional dynamic programming is limited in the size and complexity of the problems it can
address.

Supervised learning is a general method for training a parameterized function approximator, such as a
neural network, to represent functions. However, supervised learning requires sample input-output
pairs from the function to be learned. In other words, supervised learning requires a set of questions
with the right answers. For example, we might not know the best way to program a computer to
recognize an infrared picture of a tank, but we do have a large collection of infrared pictures, and we
do know whether each picture contains a tank or not. Supervised learning could look at all the
examples with answers, and learn how to recognize tanks in general.

Unfortunately, there are many situations where we don’t know the correct answers that supervised
learning requires. For example, in a flight control system, the question would be the set of all sensor
readings at a given time, and the answer would be how the flight control surfaces should move during
the next millisecond. Simple neural networks can’t learn to fly the plane unless there is a set of known
answers, so if we don’t know how to build a controller in the first place, simple supervised learning
won’t help.

For these reasons there has been much interest recently in a different approach known as
reinforcement learning (RL). Reinforcement learning is not a type of neural network, nor is it an
alternative to neural networks. Rather, it is an orthogonal approach that addresses a different, more
difficult question. Reinforcement learning combines the fields of dynamic programming and
supervised learning to yield powerful machine-learning systems. Reinforcement learning appeals to
many researchers because of its generality. In RL, the computer is simply given a goal to achieve. The
computer then learns how to achieve that goal by trial-and-error interactions with its environment.
Thus, many researchers are pursuing this form of machine intelligence and are excited about the
possibility of solving problems that have been previously unsolvable.

To provide the intuition behind reinforcement learning consider the problem of learning to ride a
bicycle. The goal given to the RL system is simply to ride the bicycle without falling over. In the first
trial, the RL system begins riding the bicycle and performs a series of actions that result in the bicycle
being tilted 45 degrees to the right. At this point their are two actions possible: turn the handle bars
left or turn them right. The RL system turns the handle bars to the left and immediately crashes to the
ground, thus receiving a negative reinforcement. The RL system has just learned not to turn the handle
bars left when tilted 45 degrees to the right. In the next trial the RL system performs a series of actions
that again result in the bicycle being tilted 45 degrees to the right. The RL system knows not to turn
the handle bars to the left, so it performs the only other possible action: turn right. It immediately
crashes to the ground, again receiving a strong negative reinforcement. At this point the RL system has
not only learned that turning the handle bars right or left when tilted 45 degrees to the right is bad,
but that the "state" of being titled 45 degrees to the right is bad. Again, the RL system begins another
trial and performs a series of actions that result in the bicycle being tilted 40 degrees to the right. Two
actions are possible: turn right or turn left. The RL system turns the handle bars left which results in
the bicycle being tilted 45 degrees to the right, and ultimately results in a strong negative
reinforcement. The RL system has just learned not to turn the handle bars to the left when titled 40
degrees to the right. By performing enough of these trial-and-error interactions with the environment,
the RL system will ultimately learn how to prevent the bicycle from ever falling over.
http://www.cs.toronto.edu/~zemel/documents/411/rltutorial.pdf

The Parts Of A Reinforcement Learning Problem


In the standard reinforcement learning model an agent interacts with its environment. This interaction
takes the form of the agent sensing the environment, and based on this sensory input choosing an
action to perform in the environment. The action changes the environment in some manner and this
change is communicated to the agent through a scalar reinforcement signal. There are three
fundamental parts of a reinforcement learning problem: the environment, the reinforcement function,
and the value function.

The Environment
Every RL system learns a mapping from situations to actions by trial-and-error interactions with a
dynamic environment. This environment must at least be partially observable by the reinforcement
learning system, and the observations may come in the form of sensor readings, symbolic descriptions,
or possibly “mental” situations (e.g., the situation of being lost). The actions may be low level (e.g.,
voltage to motors), high level (e.g., accept job offer), or even “mental” (e.g., shift in focus of attention).
If the RL system can observe perfectly all the information in the environment that might influence the
choice of action to perform, then the RL system chooses actions based on true “states” of the
environment. This ideal case is the best possible basis for reinforcement learning and, in fact, is a
necessary condition for much of the associated theory.

The Reinforcement Function


As stated previously, RL systems learn a mapping from situations to actions by trial-and-error
interactions with a dynamic environment. The “goal” of the RL system is defined using the concept of
a reinforcement function, which is the exact function of future reinforcements the agent seeks to
maximize. In other words, there exists a mapping from state/action pairs to reinforcements; after
performing an action in a given state the RL agent will receive some reinforcement (reward) in the
form of a scalar value. The RL agent learns to perform actions that will maximize the sum of the
reinforcements received when starting from some initial state and proceeding to a terminal state.

It is the job of the RL system designer to define a reinforcement function that properly defines the
goals of the RL agent. Although complex reinforcement functions can be defined, there are at least
three noteworthy classes often used to construct reinforcement functions that properly define the
desired goals.

The Value Function


In previous sections the environment and the reinforcement function are discussed. However, the
issue of how the agent learns to choose “good” actions, or even how we might measure the utility of
an action is not explained. First, two terms are defined. A policy determines which action should be
performed in each state; a policy is a mapping from states to actions. The value of a state is defined as
the sum of the reinforcements received when starting in that state and following some fixed policy to
a terminal state. The optimal policy would therefore be the mapping from states to actions that
maximizes the sum of the reinforcements when starting in an arbitrary state and performing actions
until a terminal state is reached. Under this definition the value of a state is dependent upon the
policy. The value function is a mapping from states to state values and can be approximated using any
type of function approximator (e.g., multilayered perceptron, memory based system, radial basis
functions, look-up table, etc.).

Q-Learning
Q-learning (Watkins, 1989 and 1992) is another extension to traditional dynamic programming (value
iteration) that solves the following problem.

A deterministic Markov decision process is one in which the state transitions are deterministic (an
action performed in state xt always transitions to the same successor state xt+1). Alternatively, in a
nondeterministic Markov decision process, a probability distribution function defines a set of potential
successor states for a given action in a given state. If the MDP is non-deterministic, then value
iteration requires that we find the action that returns the maximum expected value (the sum of the
reinforcement and the integral over all possible successor states for the given action). For example, to
find the expected value of the successor state associated with a given action, one must perform that
action an infinite number of times, taking the integral over the values of all possible successor states
for that action. The reason this is necessary is demonstrated in below figure.

In the above figure there are two possible actions in state x. Each action returns a reinforcement of 0.
Action u1 causes a transition to one of two possible successor states with equal probability. The same
is true for action u2. The values of the successor states are 0 and 1 for both actions. Value iteration
requires that the value of state x be equal to the maximum over actions of the sum of reinforcement
and the expected value of the successor state. By taking an infinite number of samples of successor
states for action u1, one would be able to calculate that the actual expected value is 0.5. The same is
true for action u2. Therefore, the value of state x is 0.5 However, if one were to naively perform value
iteration on this MDP by taking a single sample of the successor state associated with each action
instead of the integral, then x would converge to a value of 0.75. Clearly the wrong answer.

Theoretically, value iteration is possible in the context of non-deterministic MDPs. However, in


practice it is computationally impossible to calculate the necessary integrals without added knowledge
or some degree of modification. Q-learning solves the problem of having to take the max over a set of
integrals.

Rather than finding a mapping from states to state values (as in value iteration), Q-learning finds a
mapping from state/action pairs to values (called Q-values). Instead of having an associated value
function, Qlearning makes use of the Q-function. In each state, there is a Q-value associated with each
action. The definition of a Q-value is the sum of the (possibly discounted) reinforcements received
when performing the associated action and then following the given policy thereafter. Likewise, the
definition of an optimal Qvalue is the sum of the reinforcements received when performing the
associated action and then following the optimal policy thereafter.
In the context of Q-learning, the value of a state is defined to be the maximum Q-value in the given
state. Given this definition it is easy to derive the equivalent of the Bellman equation for Q-learning.

Q-learning differs from value iteration in that it doesn’t require that in a given state each action be
performed and the expected values of the successor states be calculated. While value iteration
performs an update that is analogous to a one level breadth-first search, Q-learning takes a single-step
sample of a Monte-Carlo roll-out. This process is demonstrated in below figure.

The Qvalue is a prediction of the sum of the reinforcements one will receive when performing the
associated action and then following the given policy. To update that prediction Q(xt,ut) one must
perform the associated action ut, causing a transition to the next state xt+1 and returning a scalar
reinforcement r(xt,ut). Then one need only find the maximum Q-value in the new state to have all the
necessary information for revising the prediction (Q-value) associated with the action just performed.
Q-learning does not require one to calculate the integral over all possible successor states in the case
that the state transitions are nondeterministic. The reason is that a single sample of a successor state
for a given action is an unbiased estimate of the expected value of the successor state. In other words,
after many updates the Q-value associated with a particular action will converge to the expected sum
of all reinforcements received when performing that action and following the optimal policy
thereafter.

You might also like