Combining Inductive and Analytical Learning

11/28/22, 10:41 PM Combining Inductive and Analytical Learning
Combining Inductive and Analytical Learning

This talk is based on
Tom M. Mitchell. Machine Learning. [1] McGraw Hill. 1997. Chapter 12.
and his slides [2].
1 Motivation
Inductive Analytical
Learning Learning
Hypothesis fits Hypothesis fits
Goal
data domain theory
Statistical Deductive
Justification
inference Inference
Requires little
Learns from
Advantages prior
scarce data
knowledge
Scarce data, Imperfect
Pitfalls
incorrect bias domain theory
They seem very complementary.

We want something in between!
Example: If your had to review a medical database and learn
"symptoms for which drug X is
more effective than Y" you
would look at relevant attributes
(temperature, not
insurance),
then refine using data.
Some domain theory (background knowledge) does squeeze
into inductive methods, e.g.,
when choosing an
encoding.
1.1 What We Want
We want a learning method such that:
1. Given no domain theory it should be as good as purely

inductive methods.
2. Given a perfect domain theory it should be as good as
analytical methods.
3. Given imperfect domain theory and imperfect data it should
combine the two and do batter
than both inductive and
analytical.
4. Accommodate an unknown level of error in training
data.
5. Accommodate an unknown level of error in domain theory.
2 The Learning Problem

The learning problem:
D is the set of training examples, maybe with

errors.
B is the domain theory, maybe with errors.
H is the space of candidate hypotheses.
Determine
A hypothesis that best fits the training examples and domain theory.
https://jmvidal.cse.sc.edu/talks/indanalytical/allslides.html 1/6
2.1 Best Fits?
So, what does "best fits" mean?
User the old $error_D(d)$ definition—the proportion

of examples from $D$ misclassified by
$h$.
But, what about B?
Define $error_B(h)$ to be the probability that $h$ will
disagree with B on the classification of
a randomly drawn
instance.Then find
\[ h = \argmin_{h \in H} k_D error_D(h) + k_B
error_B(h) \]
But, what do we set $k_D$ and $k_B$ to?
Which one is more reliable, the data or the
theory?
2.2 Hypothesis Space Search
As always, we view the problem as that of search over H where
H is the hypothesis space.

O is the set of operators (search steps).
G is the search objective.
We see that we have several possible approaches:
User prior knowledge to derive an initial hypothesis from

which to begin the search.
Use prior knowledge to alter the objective of the
hypothesis space search.
User prior knowledge to alter the available search
steps.
3 KBANN
The Knowledge-Based Artificial Neural Network (KBANN [3]) algorithm uses prior
knowledge to derive
hypothesis from which to begin search.
It first constructs a ANN that classifies every instance as the
domain theory would.
So, if B is correct then we are done!
Otherwise, we use Backpropagation to train the network.
3.1 KBANN Algorithm

KBANN(domainTheory, trainingExamples)

domainTheory: set of propositional non-recursive Horn clauses
1. for each instance attribute create a network

input.
2. for each Horn clause in domainTheory, create a network unit
1. Connect inputs to attributes tested by
antecedents.
2. Each non-negated antecedent gets a weight W.
3. Each negated antecedent gets a weight -W
4. Threshold weight is -(n - .5), where n is the number of non-negated antecedents.
3. Make all other connections between layers, giving
these very low weights.
4. Apply Backpropagation using trainingExamples
3.2 KBANN Example
Domain theory
Cup ← Stable, Liftable, OpenVessel

Stable ← BottomIsFlat
Liftable ← Graspable, Light

Graspable ← HasHandle
OpenVessel ← HasConcavity, ConcavityPointsUp
Training Examples
Cup Non-Cups
BottomIsFlat X X X X X X X X
ConcavityPointsUp X X X X X X X
Expensive X X X X
Fragile X X X X X X
HandleOnTop X X
HandleOnSide X X X
HasConcavity X X X X X X X X X
HasHandle X X X X X
Light X X X X X X X X
MadeOfCeramic X X X X
MadeOfPaper X X
MadeOfStyrofoam X X X X
3.3 KBANN Example Network
3.4 After Training
3.5 KBANN Results
In classifying promoter regions in DNA: Backpropagation

got 8/106 error rate, KBANN got
4/106.
It does typically generalize more accurately than
backpropagation.
3.6 Hypothesis Space Search
We can view it as search in H.

KBANN starts at a better stop.
As such, it likely to converge to a hypothesis that
generalizes beyond the data in a way that is similar to
domain
theory predictions.
On the negative side, KBANN can only deal with
propositional domain theories.
4 TangentProp
The TangentProp [4] algorithm
incorporates the prior knowledge into the error criterion
minimized by gradient descent.
Specifically, the prior knowledge is in the form of known
derivatives of the target function.
4.1 TangentProp Example

$X$ are images of single handwritten characters.
Task is to correctly classify these characters.
B is "the target function is invariant to small rotations
of the character in the image". How do
we express this
mathematically?
Define a transformation $s(\alpha, x)$ which rotates $x$ by $\alpha$ degrees. Then say
\[
\frac{\partial f(s(\alpha,x_i))}{\partial \alpha} = 0 \]
Then incorporate these into the error
function. How?
Add an additional term to the error function
\[ E = \sum_i \left[ (f(x_{i}) - \hat{f}
(x_{i}))^{2} + \mu_{i} \sum_{j}
\left(\frac{\partial f(s_j(\alpha, x_i))}{\partial \alpha} -
\frac{\partial \hat{f}(s_j(\alpha, x_i))}{\partial{\alpha}} \right)_{(\alpha=0)}^{2} \right] \]
then modify the gradient descent rule to use this. How? Not shown.
4.2 TangentProp Search
The value of $\mu$ must be chosen carefully by the

designer.
TangentProp is not robust to errors in the prior knowledge (it throws off Backpropagation).
TangentProp's is searching for different (maybe) hypothesis from simple Backpropagation. It
searches on a different path.
5 EBNN
The Explanation-Based Neural Network (EBNN [5]) algorithm extends TangentProp.
It computes the derivatives itself.
The value of $\mu$ is chosen independently for each
example.
It represents the domain theory with a collection of
neural networks.
Then, learns the target function as another network.
5.1 EBNN Example
There is one network for each of the Horn clauses in the

domain theory.
EBNN uses the top network to calculate the partial
derivative of the prediction with respect to
each
feature of the instance. (i.e., how much does the
output change as I tweak
BottomIsFlat?).
These derivatives are given to the bottom network which is
trained with a variation of
TangentProp.
5.2 EBNN Summary

EBNN has been shown to generalize more accurately than
backpropagation, especially when
training data is
scarce.
It has been used to learn to control a simulated mobile
robot.
EBNN, like Prolog-EBG, constructs explanations, but they
are based on a domain theory
consisting of neural networks
rather than Horn clauses.
EBNN accommodates imperfect domain theories.
EBNN learns a fixed size network, so it might be unable
to represent complex functions.
6 FOCL
The FOCL [6] system is
an extension of the purely inductive FOIL.
FOIL generates each candidate specialization by adding a
new literal to the preconditions.
FOCL generates additional specializations based
on the domain theory. How?
We say that a literal is operational if it is
allowed to be used in describing an output
hypothesis.
At each point in the search FOCL adds candidates by
1. Considering each operational literal for the
precondition.
2. Creating an operational and logically sufficient
condition for the target concept
according to the
domain theory. Prune any unnecessary preconditions.
6.1 FOCL Example
Note that in order to get the rule we had to "unfold" the

domain theory to get to operational
literals.
6.2 FOCL Search
FOCL theory-suggested specializations correspond to large

steps in FOIL.
The bias is a preference for Horn clauses most similar to
operations, sufficient conditions
entailed by the domain
theory.
Test results show that FOCL achieves a lower error than
FOIL.
URLs
1. Machine Learning book at Amazon,
http://www.amazon.com/exec/obidos/ASIN/0070428077/multiagentcom/
2. Slides by Tom Mitchell on Machine Learning, http://www-2.cs.cmu.edu/~tom/mlbook-
chapter-slides.html
3. KBANN
paper, http://citeseer.nj.nec.com/towell94knowledgebased.html
4. TangentProp paper, http://research.microsoft.com/~patrice/PS/tang_prop.ps
5. EBNN
paper, http://citeseer.nj.nec.com/mitchell92explanationbased.html
6. citeseer:pazzani92utility, http://citeseer.nj.nec.com/pazzani92utility.html
This talk available at http://jmvidal.cse.sc.edu/talks/indanalytical/ 17 April 2003, 12:26PM

Copyright © 2009 José M. Vidal
.
All rights reserved.

Combining Inductive and Analytical Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Combining Inductive and Analytical Learning

Uploaded by

Copyright:

Available Formats

11/28/22, 10:41 PM Combining Inductive and Analytical Learning

Combining Inductive and Analytical Learning

They seem very complementary.

1.1 What We Want

We want a learning method such that:

1. Given no domain theory it should be as good as purely

2 The Learning Problem

D is the set of training examples, maybe with

2.1 Best Fits?

So, what does "best fits" mean?

User the old $error_D(d)$ definition—the proportion

2.2 Hypothesis Space Search

As always, we view the problem as that of search over H where

H is the hypothesis space.

We see that we have several possible approaches:

User prior knowledge to derive an initial hypothesis from

3.1 KBANN Algorithm

domainTheory: set of propositional non-recursive Horn clauses

1. for each instance attribute create a network

3.2 KBANN Example

Cup ← Stable, Liftable, OpenVessel

Liftable ← Graspable, Light

3.3 KBANN Example Network

3.4 After Training

3.5 KBANN Results

In classifying promoter regions in DNA: Backpropagation

3.6 Hypothesis Space Search

We can view it as search in H.

4.1 TangentProp Example

4.2 TangentProp Search

The value of $\mu$ must be chosen carefully by the

5.1 EBNN Example

There is one network for each of the Horn clauses in the

5.2 EBNN Summary

6.1 FOCL Example

Note that in order to get the rule we had to "unfold" the

6.2 FOCL Search

FOCL theory-suggested specializations correspond to large

This talk available at http://jmvidal.cse.sc.edu/talks/indanalytical/ 17 April 2003, 12:26PM

You might also like