You are on page 1of 6

11/28/22, 10:41 PM Combining Inductive and Analytical Learning

Combining Inductive and Analytical Learning


This talk is based on

Tom M. Mitchell. Machine Learning. [1] McGraw Hill. 1997. Chapter 12.
and his slides [2].

1 Motivation
Inductive Analytical
Learning Learning
Hypothesis fits Hypothesis fits
Goal
data domain theory
Statistical Deductive
Justification
inference Inference
Requires little
Learns from
Advantages prior
scarce data
knowledge
Scarce data, Imperfect
Pitfalls
incorrect bias domain theory

They seem very complementary.


We want something in between!
Example: If your had to review a medical database and learn
"symptoms for which drug X is
more effective than Y" you
would look at relevant attributes
(temperature, not
insurance),
then refine using data.
Some domain theory (background knowledge) does squeeze
into inductive methods, e.g.,
when choosing an
encoding.

1.1 What We Want

We want a learning method such that:

1. Given no domain theory it should be as good as purely


inductive methods.
2. Given a perfect domain theory it should be as good as
analytical methods.
3. Given imperfect domain theory and imperfect data it should
combine the two and do batter
than both inductive and
analytical.
4. Accommodate an unknown level of error in training
data.
5. Accommodate an unknown level of error in domain theory.

2 The Learning Problem


The learning problem:

D is the set of training examples, maybe with


errors.
B is the domain theory, maybe with errors.
H is the space of candidate hypotheses.

Determine

A hypothesis that best fits the training examples and domain theory.

https://jmvidal.cse.sc.edu/talks/indanalytical/allslides.html 1/6
11/28/22, 10:41 PM Combining Inductive and Analytical Learning

2.1 Best Fits?

So, what does "best fits" mean?

User the old $error_D(d)$ definition—the proportion


of examples from $D$ misclassified by
$h$.
But, what about B?
Define $error_B(h)$ to be the probability that $h$ will
disagree with B on the classification of
a randomly drawn
instance.Then find
\[ h = \argmin_{h \in H} k_D error_D(h) + k_B
error_B(h) \]
But, what do we set $k_D$ and $k_B$ to?
Which one is more reliable, the data or the
theory?

2.2 Hypothesis Space Search

As always, we view the problem as that of search over H where

H is the hypothesis space.


O is the set of operators (search steps).
G is the search objective.

We see that we have several possible approaches:

User prior knowledge to derive an initial hypothesis from


which to begin the search.
Use prior knowledge to alter the objective of the
hypothesis space search.
User prior knowledge to alter the available search
steps.

3 KBANN
The Knowledge-Based Artificial Neural Network (KBANN [3]) algorithm uses prior
knowledge to derive
hypothesis from which to begin search.
It first constructs a ANN that classifies every instance as the
domain theory would.
So, if B is correct then we are done!
Otherwise, we use Backpropagation to train the network.

3.1 KBANN Algorithm


KBANN(domainTheory, trainingExamples)

domainTheory: set of propositional non-recursive Horn clauses

1. for each instance attribute create a network


input.
2. for each Horn clause in domainTheory, create a network unit
1. Connect inputs to attributes tested by
antecedents.
2. Each non-negated antecedent gets a weight W.
3. Each negated antecedent gets a weight -W
4. Threshold weight is -(n - .5), where n is the number of non-negated antecedents.
3. Make all other connections between layers, giving
these very low weights.
4. Apply Backpropagation using trainingExamples

3.2 KBANN Example

Domain theory

Cup ← Stable, Liftable, OpenVessel


Stable ← BottomIsFlat

Liftable ← Graspable, Light


https://jmvidal.cse.sc.edu/talks/indanalytical/allslides.html 2/6
11/28/22, 10:41 PM Combining Inductive and Analytical Learning

Graspable ← HasHandle
OpenVessel ← HasConcavity, ConcavityPointsUp

Training Examples
Cup Non-Cups
BottomIsFlat X X X X X X X X
ConcavityPointsUp X X X X X X X
Expensive X X X X
Fragile X X X X X X
HandleOnTop X X
HandleOnSide X X X
HasConcavity X X X X X X X X X
HasHandle X X X X X
Light X X X X X X X X
MadeOfCeramic X X X X
MadeOfPaper X X
MadeOfStyrofoam X X X X

3.3 KBANN Example Network

3.4 After Training

3.5 KBANN Results

In classifying promoter regions in DNA: Backpropagation


got 8/106 error rate, KBANN got
4/106.
It does typically generalize more accurately than
backpropagation.

https://jmvidal.cse.sc.edu/talks/indanalytical/allslides.html 3/6
11/28/22, 10:41 PM Combining Inductive and Analytical Learning

3.6 Hypothesis Space Search

We can view it as search in H.


KBANN starts at a better stop.
As such, it likely to converge to a hypothesis that
generalizes beyond the data in a way that is similar to
domain
theory predictions.
On the negative side, KBANN can only deal with
propositional domain theories.

4 TangentProp
The TangentProp [4] algorithm
incorporates the prior knowledge into the error criterion
minimized by gradient descent.
Specifically, the prior knowledge is in the form of known
derivatives of the target function.

4.1 TangentProp Example


$X$ are images of single handwritten characters.
Task is to correctly classify these characters.
B is "the target function is invariant to small rotations
of the character in the image". How do
we express this
mathematically?
Define a transformation $s(\alpha, x)$ which rotates $x$ by $\alpha$ degrees. Then say
\[
\frac{\partial f(s(\alpha,x_i))}{\partial \alpha} = 0 \]
Then incorporate these into the error
function. How?
Add an additional term to the error function
\[ E = \sum_i \left[ (f(x_{i}) - \hat{f}
(x_{i}))^{2} + \mu_{i} \sum_{j}
\left(\frac{\partial f(s_j(\alpha, x_i))}{\partial \alpha} -
\frac{\partial \hat{f}(s_j(\alpha, x_i))}{\partial{\alpha}} \right)_{(\alpha=0)}^{2} \right] \]
then modify the gradient descent rule to use this. How? Not shown.

4.2 TangentProp Search

The value of $\mu$ must be chosen carefully by the


designer.
TangentProp is not robust to errors in the prior knowledge (it throws off Backpropagation).
TangentProp's is searching for different (maybe) hypothesis from simple Backpropagation. It
searches on a different path.

5 EBNN
The Explanation-Based Neural Network (EBNN [5]) algorithm extends TangentProp.
It computes the derivatives itself.
The value of $\mu$ is chosen independently for each
example.
It represents the domain theory with a collection of
neural networks.
Then, learns the target function as another network.

5.1 EBNN Example

https://jmvidal.cse.sc.edu/talks/indanalytical/allslides.html 4/6
11/28/22, 10:41 PM Combining Inductive and Analytical Learning

There is one network for each of the Horn clauses in the


domain theory.
EBNN uses the top network to calculate the partial
derivative of the prediction with respect to
each
feature of the instance. (i.e., how much does the
output change as I tweak
BottomIsFlat?).
These derivatives are given to the bottom network which is
trained with a variation of
TangentProp.

5.2 EBNN Summary


EBNN has been shown to generalize more accurately than
backpropagation, especially when
training data is
scarce.
It has been used to learn to control a simulated mobile
robot.
EBNN, like Prolog-EBG, constructs explanations, but they
are based on a domain theory
consisting of neural networks
rather than Horn clauses.
EBNN accommodates imperfect domain theories.
EBNN learns a fixed size network, so it might be unable
to represent complex functions.

6 FOCL
The FOCL [6] system is
an extension of the purely inductive FOIL.
FOIL generates each candidate specialization by adding a
new literal to the preconditions.
FOCL generates additional specializations based
on the domain theory. How?
We say that a literal is operational if it is
allowed to be used in describing an output
hypothesis.
At each point in the search FOCL adds candidates by
1. Considering each operational literal for the
precondition.
2. Creating an operational and logically sufficient
condition for the target concept
according to the
domain theory. Prune any unnecessary preconditions.

6.1 FOCL Example

https://jmvidal.cse.sc.edu/talks/indanalytical/allslides.html 5/6
11/28/22, 10:41 PM Combining Inductive and Analytical Learning

Note that in order to get the rule we had to "unfold" the


domain theory to get to operational
literals.

6.2 FOCL Search

FOCL theory-suggested specializations correspond to large


steps in FOIL.
The bias is a preference for Horn clauses most similar to
operations, sufficient conditions
entailed by the domain
theory.
Test results show that FOCL achieves a lower error than
FOIL.

URLs
1. Machine Learning book at Amazon,
http://www.amazon.com/exec/obidos/ASIN/0070428077/multiagentcom/
2. Slides by Tom Mitchell on Machine Learning, http://www-2.cs.cmu.edu/~tom/mlbook-
chapter-slides.html
3. KBANN
paper, http://citeseer.nj.nec.com/towell94knowledgebased.html
4. TangentProp paper, http://research.microsoft.com/~patrice/PS/tang_prop.ps
5. EBNN
paper, http://citeseer.nj.nec.com/mitchell92explanationbased.html
6. citeseer:pazzani92utility, http://citeseer.nj.nec.com/pazzani92utility.html

This talk available at http://jmvidal.cse.sc.edu/talks/indanalytical/ 17 April 2003, 12:26PM


Copyright © 2009 José M. Vidal
.
All rights reserved.

https://jmvidal.cse.sc.edu/talks/indanalytical/allslides.html 6/6

You might also like