You are on page 1of 26

Computational Intelligence, Volume 22, Number 1, 2006

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE


HENRY E. KYBURG, JR.
Computer Science and Philosophy, University of Rochester, Rochester, NY 14627, USA; Institute
for Human and Machine Cognition, 40 South Alcaniz Street, Pensacola, FL 32502, USA
CHOH MAN TENG
Institute for Human and Machine Cognition, 40 South Alcaniz Street, Pensacola, FL 32502, USA
Classical statistical inference is nonmonotonic: obtaining more evidence or obtaining more knowledge about
the evidence one has can lead to the replacement of one statistical conclusion by another, or the complete withdrawal
of the original conclusion. While it has long been argued that not all nonmonotonic inference can be accounted for
in terms of relative frequencies or objective probabilities, there is no doubt that much nonmonotonic inference can
be accounted for in this way. Here we seek to explore the close connection between classical statistical inference and
default logic, treating statistical inference within the framework of default logic, and showing that nonmonotonic
logic in general, and default logic in particular, needs to take account of certain features of statistical inference.
Default logic must take account of statistics, but at the same time statistics can throw light on problematic cases of
default inference.
Key words: nonmonotonic logic, default logic, probability, statistical inference, thresholding.

1. INTRODUCTION
It is natural to think of the grounds for the defaults studied by Reiter (1980) and others
(e.g., Lukaszewicz 1988; Brewka 1991; Gelfond et al. 1991; Delgrande, Schaub, and Jackson
1994; Mikitiuk and Truszczynski 1995) as lying in what we know of relative frequencies in
the world: practically all birds fly, practically all husbands and wives live in the same city.
But there are those who have denied that high-relative frequency is necessary for default
reasoning (e.g., McCarthy 1986). Our concern is not with this issue, but with a related one:
classical statistical inference. Classical statistical inference is based on the idea that samples
are somehow typical of the populations from which they are drawn. This is true of most
samples. But the operative principles of statistical inference are somewhat different: they
have to do, not with exploiting typicality or high frequencies directly, but with the control of
frequencies of error (Mayo 1996).
Not only do the terms in which classical statistical inference is conducted conform
closely to the framework of default inference, but we believe the distinction between the
prerequisite of an inference, the premise without which the inference cannot take place, and
the justifications of the inference, the statements that must not be believed if the inference
is to go through, is a distinction that can throw considerable light on the statistical inference
(Kyburg and Teng 1999).
On the other hand, while some default inferences seem uncontroversial (Tweety flies
comes to mind), there are other default inferences, or groups of default inferences that seem
more controversial. For example, a person may be both a Quaker (and thus, typically, a
pacifist) and a Republican (and thus, typically, a hawk). We shall claim that considerations
that guide our choice of reference classes in probability and statistics may also prove valuable
in sorting out these difficult sets of default inferences.
First steps toward this problemthe problem of approaching statistical inference as a
kind of default inferencewere taken in Kyburg and Teng (1999), but the issue was not
This


C

material is based upon work supported by NSF IIS 0328849, NASA NNA04CK88A, and ONR N00014-03-1-0516.

2006 Blackwell Publishing, 350 Main Street, Malden, MA 02148, USA, and 9600 Garsington Road, Oxford OX4 2DQ, UK.

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

27

treated in depth there. The role of statistical considerations in default inference, similarly,
was only touched upon in Kyburg and Teng (2001).

2. STATISTICAL INFERENCE
Statistical inference comes in many forms. Most of these forms require something in the
way of background knowledge, though it is interesting that confidence interval inference to
class ratios arguably need not depend on empirical assumptions. We take statistical inference
as involving the tentative acceptance (or rejection) of a statistical hypothesis, or of some
statement equivalent to a statistical hypothesis. For example, measurement is often taken
to yield an interval within which a physical quantity is taken to fall. But since the true
value of that quantity is the mean of the (hypothetical) unbounded population of adjusted
measurements of that quantity, to say that a quantity lies in an interval is to say that the mean
of an unbounded population of measurements lies in that interval. Even measurement can be
construed as statistical inference.
Much of science uses statistical inference in a more transparent way. According to some
writers (Chow 1996), most scientific inference can be construed in terms of significance
testing. While we do not accept that thesis, we do agree that significance testing plays an
important role in many branches of science. Hypothesis testing that can be seen as the foundation of statistical inference in the prevalent classical BritishAmerican school (Lehmann
1959) can be extended in many ways, and can, in particular, be taken as a foundation for
confidence interval analysis. Finally, there is Bayesian statistics. Bayesian statistics is often
so-called because it allows us to take account of background knowledge, which is more extensive than is ordinarily dealt with by classical statistics, and thereby allows the use of Bayes
theorem in ways that would be prohibited in classical terms. It is also, sometimes, construed
as more extremeas denying that statistical hypotheses are ever legitimately accepted and
insisting that all we can do (and all we ought to do) on the basis of experience is to update
our probabilities, including the probabilities we assign to statistical hypotheses. This latter
possibility is one we shall not focus on, since it does not seem to us to involve nonmonotonicity. When one updates probabilities by conditioning, the new probability is based on
the same probability function with which one started: P(A | B ) = P(A B )/P(B ). No new
function need be introduced; we do not withdraw the results of previous inference. To be
sure, the agents confidence in A changes with the observation of B, and changes again with
the further observation of C. But what we infer on the basis of the observation B is P(A | B)
and not P(A); what we infer on the basis of the further observation C is P(A | B C ), not yet
another value of P(A).
2.1. Significance Testing
The guiding idea behind significance testing is the control of error (Fisher 1971;
Chapter 2). A classic example of simple significance testing is this: Suppose that hypothesis
H 0 , the null hypothesis, asserts that the relative frequency of defective toasters manufactured
on a certain production line is less than 10%. We design a test for this hypothesis; for example,
we decide to examine n toasters from the line and to reject H 0 if more than k are defective.
As is characteristic of significance tests, we arrive at no conclusion if k or less toasters in
our sample are defective: we simply fail to reject H 0 . We may ask, as thoughtful statisticians
have, what it means to reject a hypothesis. To reject H 0 is to perform a positive act; it is
to abandon suspension of belief; it is to accept H 0 , i.e., it is to accept that at least 10%

28

COMPUTATIONAL INTELLIGENCE

of the toasters produced by that line are defective. To fail to reject H 0 , on the other hand,
calls for no action at all; we remain completely agnostic about the frequency of defective
toasters.
The significance test is characterized as follows: The size of the test, , is the maximum
long run chance of falsely rejecting the hypothesis under test, H 0 , i.e., the long run relative
frequency of obtaining k or more defective toasters out of n, given that less than 10% in general
are defective. Given that H 0 is true, we can calculate this maximum long run frequency of
error precisely.
This is the before test analysis of the significance test of H 0 . This does not end the
design of the test. We must stipulate how these n toasters are to be selected from those
produced (they must be selected in such a way that the long run frequency of error applies
to the sample we have selected), and we must also specify the standard according to which
they will be judged defective or not.
Finally, we must actually perform the sampling and testing, and draw the conclusion,
Reject H 0 or Do not reject H 0 , according as we observe more than k defectives in our
sample or not.
We point out two things. First, if H 0 is rejected, this is really, and not just metaphorically,
inference. When we reject H 0 , we accept H 0 . We stop the production line and look for
the source of the excess defects. If H 0 is the no effect hypothesis for a cold remedy, we
devise more and more sensitive tests; we pursue the development of the remedy, at least for
a while.
Of course, if we fail to reject H 0 , we have inferred nothing. The possibilities are just
what they were before our test.
The second thing to note is that this inference, when it occurs, is nonmonotonic: further
premises (more data) could lead to our retracting the conclusion H 0 . For example, we could
look at a larger sample, and note a smaller fraction of defectives. But there are also less direct
grounds we could have for rejecting the conclusion. For example, we might learn that when
the production line first starts up there are a large number of defects, until the machinery gets
warmed up, and that our sample was taken from the start of the line. Or we might learn that
the sample was taken and analyzed by an engineer who is notoriously careless.
In the process of testing, there are a number of conditions we must think about; we will
come back to them in Section 5, after a brief review of evidential probability and default logic,
since these conditions are best construed, we will argue, as justifications of the relevant
statistical default rules.
2.2. Hypothesis Testing
In hypothesis testing, we take account not only of the error of falsely rejecting a hypothesis
that is true (the possibility of false rejection in significance testing) but we also take account
of the error of failing to reject a hypothesis that is false. The first of these errors is Type I
error; it is also called the size of a test, and often denoted by . The second of these errors,
called error of Type II, is often denoted by ; 1 is the power of a test.
We illustrate these ideas with a simple example, in which we test the hypothesis H 0 that
10% of the toasters in a shipment are defective against the hypothesis H 1 that 60% of the
toasters are defective. We look at a sample of n. There are many possibilities for this sample,
but for present purposes we can ignore the order in which the defective toasters appear. Thus,
we focus on the n + 1 points in the sample space corresponding to the number of possible
defective toasters in our sample.
Given the hypothesis under test, H 0 , we can calculate the proportion of samples that have
0, 1, . . . , n defective toasters. If we want the size of our test to be 0.05, we want to be sure

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

29

that we do not reject H 0 , given that it is true, in


more than 5% of the cases. We will achieve
this goal if we choose a value of k such that ik (ni )0.1i 0.9ni is <0.05. That is easy: for
n = 100, we could choose k = 0. The probability of error that we associate with this element
0
100
of the sample space is (100
a very small number indeed.
0 )0.1 0.9
But while this would give us excellent protection against falsely rejecting H 0 , we would
be led to accept the alternative hypothesis H 1 far too infrequently. (We do not deal with
the possibility of suspending judgment; that would be a third alternative that would not add
anything at this stage of our inquiry.) So what we want to do is to keep at 0.05, and at the
same time maximize the chances of rejecting H 0 when it is false. Just as we can calculate the
relative frequency with which H 0 will be falsely rejected, so we can calculate the frequency
with which it will erroneously fail to be rejected. We can do this because we are testing a
simple hypothesis against a simple alternative. Suppose that the critical value of k is k 0 : if we
find k 0 or more defects in our sample, we reject H 0 . Despite the fact that we have set this up
as a test of one hypothesis against another, to reject one is not to be committed to the other:
to reject H 0 is to accept H 0 , but not to accept H 1 , unless we have the disjunction H 0 H 1
as part of our background knowledge.
To fail to reject H 0 need not be to accept H 0 , and so, of course, need not be to reject
H 1 ; nor, of course, is the rejection of H 0 tantamount to the acceptance of H 1 . It may be, for
example, that when we control error of the first kind to be less than 0.05, our test may allow
error of the second kind to be 0.30far too large to warrant acceptance of H 1 .
Again we emphasize, first, that when we accept H 0 we are performing a genuine
inference. We send the shipment of toasters back, and we publish results, and we conduct
new experiments. With the possible exception of the uncertainties involved, this is an inference
in the same sense as that which allows us to infer the velocity of a 1 gram bullet from the
acceleration of a 10 kg block of wood into which it has been fired.
And second, this is a nonmonotonic inference. We may reinstate H 0 if we examine
the whole shipment and find less than 10% defectives. Even the examination of a larger
sample might lead us to withdraw our conclusion. Learning that the sampling was done by
an undercover agent of a competing supplier could also undo our acceptance of H 0 .

2.3. Confidence Intervals


The simplest example of a confidence interval inference is measurementsay the measurement of a length. A method of measurement M is characterized by the distribution of
errors that it produces. In the best cases, the errors are known to be distributed approximately
normally, with a mean of 0, and a well-established standard deviation d. The true length of
an object may be taken to be the mean of all possible measurements of it.
Let us apply method M to the measurement of an object o; suppose we obtain the result
m(o). Of course, we cannot conclude that the length of o is m(o); measurements, as we just
noted, are subject to error. Nor can we impose bounds on the possible value of the true length
l(o) of o: given that the distribution of error is normal, an error of any finite magnitude is
possible. But noting that the true length of o is the expectation of measurements of it, we can
construe inferring the true length as inferring that expectationthat is, making a statistical
inference. If M yields errors that are normal (0, d ), then measurements of o by this method
will be distributed normally with mean equal to the true length l(o), and the same standard
deviation d.
Suppose that one chance in a hundred of error strikes us as secure enough. We can
calculate positive numbers 1 and 2 such that the result of measuring o will lie outside the
interval l(o) 1 , l(o) + 2 at most 1% of the time. We can find many such numbers, but if

30

COMPUTATIONAL INTELLIGENCE

we also impose the constraint that 1 + 2 be a minimum, we obtain a unique result. Often,
as in this case, the interval will be symmetric: 1 = 2 = .
The problem, of course, is applying this analysis to the particular result m(o) we obtained
when we conducted our measurement. This is what we do, without fretting, and often without
even thinking of it as statistical inference. If M is characterized by a standard deviation d,
and we observe m(o), we infer (with confidence 0.99) that the true length of o, l(o), is in the
interval m(o) 2.58d.
What this means is a matter of some dispute. Many classical statisticians will insist that
what is characterized by the number 0.99 is the method M, and that all we can say of m(o)
is that it is within a distance 2.58d of l(o) or that it is not. We shall argue later that there
is a sensible objective interpretation of probability according to which we can say that the
probability is 0.99 that l(o) m(o) 2.58d.
For present purposes the issue is not that this is statistical inference, but that it is inference.
Having made the measurement, we conclude that o will fit in an opening that is m(o ) +
2.58d wide. We conclude that it meets specification (or that it does not). We infer that the
combined length of this and another object o , whose true length is l(o ), will be m(o ) + l(o )
2.58d.
There is not much doubt that this is inference. Nor can there be much doubt that it
is nonmonotonic. Even a second measurement would undermine the conclusion, since we
would be well advised to take the results of both measurements into account. We might also
know something relevant about the provenance of the object o that bears on its length. Or we
might get access to other measurements of o. It is only in the absence of other information
that we infer that the true length of o is in the interval m(o ) 2.58d.
2.4. Inferring Relative Frequencies
These examples have involved premisesstatements we take ourselves to know or
acceptas well as requiring that we not know or accept certain other things. There is a
form of statistical inference, first formalized in 1932 (Clopper and Pearson 1934) that (arguably) requires no general statistical premises. Loosely speaking, since most samples from
a finite population of which a proportion p of individuals are B represent the proportion of
items that are B in that parent population, in the sense that the ratio of Bs in the sample is
close to p, we can infer from a sample something about this proportion. A bit less loosely
(in fact quite precisely), given a sample size n, in which a number m of Bs occur, and given
a confidence level, e.g., 0.99, we can compute upper and lower limits u(m/n ) and l(m/n )
such that u(m/n ) l(m/n ) is a minimum, and the claim that the proportion p of Bs in the
parent population lies between these limits deserves at least that confidence. No assumptions
or approximations are involvedonly inequalities. Although, it is frequently argued that this
number, 0.99, is not a probability, that argument is based on a particular interpretation of
probability, and it is arguable from another, equally objective, point of view, that this number
can be construed as a probability.
In any event, it is clear that this is an inference: to argue from a premise concerning
a sample to a conclusion expressing a constraint on the relative frequency in the parent
population is an uncertain inference whose conclusion is supported but not entailed by its
premises. It is also abundantly clear that this inference is nonmonotonic. No doubt the readers
mind is already buzzing with possibilities that would undermine the inference. For example,
we might already know what proportion of the population are Bs. We might obtain a larger
sample. We might know that the population in question comes from a set of populations in
which the distribution of the frequency p of B is known.

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

31

3. EVIDENTIAL PROBABILITY
We have already incurred the irritation (or worse) of classical statisticians by referring to
the probabilities of statistical hypotheses as possibly objective. We propose here to give a
very brief sketch of evidential probability, an approach to probability that justifies this usage.
Probability is assigned to sentences relative to a body of knowledge and evidence, on the
basis of known approximate relative frequencies. Probabilities are therefore interval valued.
The probability of a statement S, relative to a set of statements  constituting our evidence
and background knowledge, is the interval [p, q]Prob(S, ) = [p, q]and is objectively
determined as follows.
The frequency footing of any probability is a set of statements of the form %x( (x ), (x ),
p, q )1 which assert that the relative frequency of objects (tuples) satisfying the formula that
also satisfy the formula lies between p and q. There are many such statements we know; for
example, %x(x lands heads, x is a toss of a coin, 0.45, 0.55),
including logically true ones
such as %y(the proportion of As in y differs by less than k/(2 n) from the proportion in B
in general, y is an n-membered subset of Bs, 1 1/k 2 , 1.0). A sentence %x( (x ), (x ), p,
q ) is a candidate for the probability of a sentence S provided that for some term of our
language the sentences () S and () are in .
In general there will be many such statistical statements %x( (x ), (x ), p, q ). Only three
principles are needed to resolve this problem. First, however, we must define the technical
relation of conflict between statistical statements.
Definition 1. Given two intervals [p, q] and [r, s], we say that [p, q] is nested in [r, s] iff
r p q s. The two intervals conflict iff neither is nested in the other. The cover of two
intervals [p, q] and [r, s] is the interval [min(p, r ), max(q, s )].
Two statistical statements %x( (x ), (x ), p, q ) and %x(  (x ),  (x ), p , q ) conflict if and
only if their associated intervals [p, q] and [p , q ] conflict.
Note that conflicting intervals are not necessarily disjoint. They may overlap as long as
neither is included entirely in the other.
Rule I
If two statistical statements conflict and the first is based on a marginal distribution while
the second is based on the full joint distribution, ignore the first. This gives conditional
probabilities pride of place when they conflict with the corresponding marginal probabilities.
We will call this the principle of conditioning.

Example. Suppose we have 30 black and 30 white balls in a collection C. The relative
frequency of black balls among balls in C is 0.5, and this could serve as a probability that a
specific ball a in C is black. But if the members of C are divided into three urns, one of which
contains 12 black balls and 28 white balls, and two of which each contains nine black balls
and one white ball, then if a is selected by a procedure that consists of (1) selecting an urn,
and (2) selecting a ball from that urn, the relative frequency of black balls is 1/3(12/40) +
1/3(9/10) + 1/3(9/10) = 0.70, and this is the appropriate probability that a, known to be
selected in this way, is black. The marginal distribution is given by 30 black balls out of 60;
1 These formulas are subject to constraints: they should exclude such predicates as is an emerose or is grue and they
should mention the smallest justifiable intervals.

32

COMPUTATIONAL INTELLIGENCE

the full distribution reflects the division of the balls into the urns, and the distribution of
colors in each urn.

black
white

urn 1

urn 2

12
28
40

9
1
10

urn 3
9
1
10

30
30
60

Rule II
If two conflicting statistical statements both survive the principle of conditioning, and the
second employs a reference class that is known to be included in the first, ignore the first.
This embodies the well-known principle of specificity.

Example. Suppose we have a population of birds in a zoo, and that 95% of them can fly.
The other 5% are penguins, and cannot fly. Given that a bird is to be selected from the zoo
(chosen, for example, by its acquisition number), the relative frequency with which it will
be a flyer is 0.95, and this is also the appropriate probability that it can fly. Given that we
know that it is a penguin that is selected, the relative frequency of flyers is 0, and that is the
probability that it can fly.
Those statistical statements we are not licensed to ignore we will call relevant. A set of
statistical statements that contains every relevant statistical statement that conflicts with a
statement in it will be said to be closed under difference.
Rule III
The probability of S is the shortest cover of any nonempty set of relevant statistical statements
closed under difference; alternatively it is the intersection of all such covers. This is the
principle of strength.

Example. Suppose we are considering a toss of a coin. The coin has been extensively tested,
and we know that it yields heads between 0.4 and 0.6 of the time. It is being tossed by an
expert whose record supports the hypothesis that he gets heads between 0.48 and 0.61 of the
time. Of course, we also know that coin tosses in general yield heads very nearly half the
timesay between 0.495 and 0.505 of the time. There is nothing that conflicts with [0.495,
0.505], so the relative frequencies in the more specific classes (tosses of this coin: [0.40, 0.60],
tosses by the expert: [0.48, 0.61]) can legitimately be ignored. Note that the minimal covers
of sets of statements closed under difference are nested; one set, {[0.4, 0.6], [0.48, 0.61]}
yields [0.4, 0.61]; the other set is {[0.495, 0.505]}, yielding [0.495, 0.505]; the probability
is the shortest cover [0.495, 0.505].
In many casesfor example, those involving well-calibrated gambling apparatuswe
can get quite precise probabilities from the application of these principles.
4. DEFAULT LOGIC
Now let us turn to default logic. We follow in general the terminology in Reiter (1980),
extended to include numeric terms, but what we have to say applies as well to other nonmonotonic formalisms.
Let L be a two-sorted language, with standard propositional constructs and mathematical
constructs. Let denote the provability operator.

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

Definition 2.

33

A default rule d is an expression of the form

: 1 , . . . , n
,

where , 1 , . . . , n , L. We call the prerequisite, 1 , . . . , n the justifications, and


the consequent of the default rule d. A default rule is normal if it is of the form :
, and

.
seminormal if it is of the form :

A default theory is an ordered pair F, D , where F is a set of sentences in L and D is


a set of default rules.

Loosely speaking, a default rule :1 ,...,n conveys the idea that if is accepted, and
none of 1 , . . . , n are accepted, then by default we may assert . For a default theory
= F, D , the known facts constitute F, and a theory extended from F by applying the default
rules in D is known as an extension of , defined formally via a fixed point formulation.
Definition 3. Let = F, D be a default theory and E be a set of sentences in L. Let
E0 = F, and for i 0,


:1 ,...,n

D,

 
E i+1 = { | E i }
Ei ,

and 1 , . . . , n  E

The set of sentences E is an extension of iff E = 0i E i .
Basically, a default extension contains the set of given facts, is deductively closed, and all
default rules that can be applied in the extension have been applied. In addition, an extension
has to be minimal, that is, every sentence in an extension is either a fact or a consequent of
an applied default rule, or a deductive consequence of some combination of the two.
A default rule :1 ,...,n can be applied to conclude the consequent when the conditions
associated with its prerequisite and justifications 1 , . . . , n are satisfied. The prerequisite
condition is satisfied by showing that is present, and each of the justification conditions
i is satisfied by showing that i is absent. Note that i is absent does not imply
automatically that i is accepted. It merely means that there is no hard proof that i is
false.
In the classical logic framework, the presence or absence of a formula is determined
by the deductive provability: is present if is provable from a set of sentences, and
i is absent if i is not provable from the same set of sentences. However, logical
provability need not be the only way to determine whether a formula is present or absent.
In particular, formulas obtained by the application of default rules may qualify as being
present. This is particularly important in the current context, as we will see when we
examine the justifications for statistical inference.
5. STATISTICAL INFERENCE AS DEFAULT INFERENCE
Assumptions (or presuppositions) are sometimes invoked in statistical inference. An
example is the Simple Random Sampling assumption (Moore 1979) often mentioned, and
construed as the claim that each equinumerous subset of a population has the same probability

34

COMPUTATIONAL INTELLIGENCE

of being drawn. According to this assumption, every sample must have the same chance of
being chosen; but usually we know that that is false: Samples remote in space or time have
no chance of being selected.
We cannot choose a sample of trout by a method that will with equal probability select
every subset of the set of all trout, here or there, past or present, with equal frequency. Yet,
the population whose parameter we wish to evaluate may be precisely the set of all trout,
here and there, past and present.
This is also true of the sampling assumption mentioned by Cramer (1951, p. 324)
and also by Baird (1992, p. 31) which requires that each element in the domain has an
equal probability of being selected.2 We cannot, therefore, take simple random sampling as
a premise or a prerequisite of our statistical argument. We not only have no reason to accept
it, but, usually, some reasons for denying it.
The distinction between the prerequisites and the justifications has been somewhat confused by our tendency to focus on normal defaults, in which the justification is the same as
the conclusion, for example,
Tweety is a bird : Tweety flies
.
Tweety flies
When we invoke a normal default, the justification, by virtue of being identical to the consequent, is always asserted positively in the extension, and so there is no justification that
is supported merely by the absence of its negation. But there are many well-known defaults
that are not normal, for example,3
Tweety is an adult : Tweety is not a student
.
Tweety has a job
We need not, at any point, add the justification Tweety is not a student to our premises.
For this default inference to go through, we only require that Tweety is a student is not
believed.
This distinction between the prerequisite and the justifications of an inference is also
illustrated by the procedure of random sampling. A statistical inference is often based on a
random sample. In an ideal case, we can select a sample by a procedure that would select
each possible sample equally often in the long run. If we have adequate grounds for thinking
that our sampling procedure satisfies this condition (which is rare, as we will argue), that
claim may be taken as part of the prerequisite of the inferencesomething for which we
have positive evidence.
But we still require a justification to the effect that the sample is not misleading. Because
each sample has the same chance of being selected, it is perfectly possible that a patently
unsuitable sample may have been selected. Thus, if we are estimating the average weight
of the members of a certain high-school class, our hypothetically perfect random sampling
procedure may give us a sample consisting of all and only members of the football team. We
know that the sample is misleading, because we know that football players tend to be heavier
than their classmates.
Note that we cannot demand positive evidence that the sample is not misleading without
begging the question: the only way we can know that the sample is not misleading is to know
what the average weight is, and that the average in the sample is close to that average weight.
Then, of course, we do not need statistical inference in the first place. That the sample is not
2 Actually
3A

this condition is not sufficient; we also must require independence of selections.


slight paraphrase of an example in Reiter and Criscuolo (1981).

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

35

misleading is a justification of the inference because we can have evidence that it is false (by
knowing how the sample is misleading), while we cannot ordinarily have direct evidence that
it is true.4
What we can demand as a prerequisite is that the sample be obtained by a procedure
that conforms to good statistical practice. This will not guarantee that the sample is not
misleadingeven perfect randomization would not guarantee thatbut is as much as we
can ask as a prerequisite. The arguments go through anyway, and justifiably so. Our lack of
knowledge of bias functions as a justification in a default inference. This is the idea we will
pursue in the following subsections.
5.1. Significance Testing
Let us return to our default inference concerning the frequency of defective toasters
coming off a production line.
Construed as a default inference, the prerequisite of the inference is n toasters from the
line were examined and at least k were found defective.
The consequent, reject H 0 is to be construed as H 0 : more than 10% of the toasters
manufactured on this production line are defective.
It is clear the demand that the sample of toasters be selected by a method that yields
each possible sample with equal long run frequency is a condition that is always difficult, and
usually impossible to satisfy. In the case of a natural population, extended in space and time,
only those individuals in a relatively localized portion of space and time can be selected.
Even in the case of toasters, we do not go back to the founding of the company.
Furthermore, and even more important, the random-sampling assumption is a condition
that is neither necessary nor sufficient for the rational cogency of the inference. It is not
sufficient, since if the condition is satisfied bizarre samples consisting of the first n toasters
or the last n toasters must occur (with their appropriate frequency); but such samples may
well not be a good basis for inference. And it is not necessary, since many perfectly cogent
inferences are made in the face of the impossibility of satisfying this condition. What is really
required (perhaps among other things) is that we not know that there is something special
about the sample that vitiates the conclusion we hope to draw from it. This is the standard
form of a justification in default logic, which requires that we do not know something.
What we propose is that the random sampling assumption be replaced by a justification
in the default rule:

r 1 : The sample on which the inference is based is unbiased.5


As a justification, what does this come to? The violation of this justification comes to
two things: first that we know that the sample we haveafter drawing the samplecomes
from a special class K of samples, and second, that the proportion of samples that lead to
the false rejection of H 0 in this special class K conflicts with that among the general class
of samples. For example, since machine tools wear, it would be ill advised to take the last
n toasters from the line for testing. They are more likely to be defective, and using such a

4 In a sense, that the sample is random is evidence that it is representative, because most samples are representative; but
this argument is indirect, and is a reprise of the original statistical argument.
5 Although, we cannot demand lack of bias as a prerequisite, we could demand such prerequisites as good-sampling
technique. Nevertheless, since we cannot calculate the added frequency of correct conclusions due to good technique, this
seems a questionable prerequisite.

36

COMPUTATIONAL INTELLIGENCE

sample would lead to the excessively frequent false rejection of H 0 , which concerns long run
average conditions.
If we draw a sample of toaster knobs (the plastic dials controlling the temperature) from
the top of a crate to make an inference concerning the proportion of cracked knobs, we will
almost surely be wrong: cracked knobs tend to fall to the bottom of the crate.
This is a matter of something we know: we know that samples from the top of the crate are
less frequently representative of the proportion of cracked knobs than are stratified samples
samples taken proportionally from several levels of the crate. Of course a sample from the
top could be representative; and a stratified sample could be misleading. But we must go
with the probabilities.
The mere possibility of having gotten a misleading sample should not inhibit our inference. In the absence of evidence to the contrary, the inference goes through. When there
is evidence against fairnessand note that this may be statistical evidence, which could
perfectly well be derived from the very sample we have drawnthe inference is blocked.
In short, the sample must not be known to belong to any alternative subclass that would
serve as a reference class for a conflicting inference.
Of course there are other conditions that must be satisfied for our significance testing
inference to be legitimate. We take these to be justifications: items whose denials must not
be known.
An obvious justification for the significance test we are discussing is that the sample of
n be the largest sample we have; to have observed n + m toasters from the line, and found at
least k defective would tell us something quite different. So one justification of our statistical
inference is:

r 2 : The sample of n is the largest appropriate sample we have.


Note that we do not have to know that there is no larger sample. It is perfectly possible
that some earlier investigator started to look into quality control, began to take a sample, and
became distracted. If we cannot combine his data with ours, there is surely no reason not to
use the data we have for our inference. On the other hand, it would clearly be wrong to use
what we know to be only part of the data.
A third justification pertains to prior knowledge. If we knew that under similar circumstances the relative frequency of error for H 0 were 0.25, as opposed to the value of 0.05,
that would surely have a bearing on our significance test inference. We might formulate this
as follows:

r 3 : There is no known set of prior distributions such that conditioning on them with the
data obtained leads to a probability interval conflicting with 1 .6

Note that we suppose here that even if our prior knowledge is vaguerepresented by a
set of distributions, rather than a single distributionwe may find conflict with the size of
the significance test. Thus if H 0 were given a high enough prior probability, that probability
could undermine the result of the significance test.
Something like this may well be going on in the case of tests of psychic phenomena
and other controversial statistical inferences. Of course one would need to look at individual
6 What conflict comes to in this instance is an interval that overlaps with, but does not include and is not included in
[1 , 1].

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

37

cases, and to evaluate the plausibility of the prior probabilities, to arrive at a useful judgment
about these matters.
It has long been agreed, even by the founding fathers of statistics, that if one has knowledge
of a prior distribution for a parameter (such as the defect rate in our toaster production line),
we should use that knowledge (Fisher 1930; Neyman 1957). This is seldom or never made
explicit as a premise, but fits in nicely as a justification for a default rule.
5.2. Hypothesis Testing
Let us consider the test of a hypothesis H 0 , the hypothesis that the relative frequency of
Bs among As is 1/3, against the alternative hypothesis H 1 that the relative frequency of Bs
among As is 3/4.
We test H 0 against H 1 by drawing a sample of five As, and observing the number of
Bs. If we observe four or five Bs, we reject H 0 . Error of the first kind, the chance of a false
rejection of H 0 , is 0.045. Error of the second kind consists of failure to reject H 0 when it is
false. This is the long run frequency with which the hypothesis under test will be mistakenly
accepted. Since the alternative H 1 to H 0 is simple, we can calculate this, too; it is 0.367.
Since we are taking as a premisea prerequisitethat one or the other of these hypotheses is true, these are the only errors we can make in this test. Our before test chance of error
is best characterized by the interval [0.045, 0.367].
The default rule for the inference in which we reject H 0 looks like this:
Prerequisite: (H 0 H 1 ) of a sample of five there are four or five Bs.
Conclusion: H 0 .
Confidence (1 ): 1 0.045 = 0.955.
Justifications:
1 : This is the largest relevant sample.
2 : This sample is unbiased.
3 : There is no prior probability that gives rise, by conditioning, to a conflicting confidence.
Since we make an error in the legitimate application of this rule if and only if H 0 is true, it
seems plausible to say that the probability of H 0 is 0.045. (Of course, classical statisticians will
deny that 0.045 is any kind of any probability; that reflects evidential rather than frequency
probability.) But clearly if we have good reason to assign a high-prior probability to H 0 ,
conditioning may undermine our hypothesis test. If the probability of H 0 is [p, q], and
p 0.399, then the lower bound for the probability of H 0 conditioned on the observation of
four or five Bs in our sample of five As is greater than 0.045. In the case of such conflict it
seems intuitive to take the probability based on the richer background knowledge as correct,
in accord with Rule I for evidential probability.
There may be other justifications that could be mentioned, but we believe they are often
reducible to these. Note that both size and power 1 of classical tests are long run
relative frequencies. They are grounded semantically in facts about the world.
5.3. Confidence Intervals
This is a default version of confidence interval inference. We take ln (r ) and un (r ) to be
the lower and upper 1 confidence bounds for a binomial sample of size n with the sample
ratio r.

38

COMPUTATIONAL INTELLIGENCE

The default rule is:


Prerequisite: s is a sample of n As of which a fraction r are Bs.
Conclusion: The proportion of As that are Bs lies between ln (r ) and un (r ).
Confidence: The conclusion is categorically asserted with confidence 1 .
Again, the interesting part of the rule is its set of justifications:
1 : s is the largest sample of As we have.
2 : s is not biased.
Put otherwise, we must not know that the sample is drawn from a class of samples in
which the frequency of representative samples is less than that among the set of samples
on which the statistical inference is based. Note that we do not have to know that s is
not biased at all; it suffices that there is no default inference of sufficient strength with
the conclusion that s is biased.
3 : A is not known to be a member of a set of populations that gives rise, by conditioning,
to a probability for the conclusion The proportion of As that are Bs lies between ln (r )
and un (r ) that conflicts with the interval [1 , 1] or falls within it.
4 : There is no secular trend of Bs among the As.
Note that the denial of this justification could be established by a default inference from the
same data we would use in this inference. It would then block the confidence interval default
rule.
Again, there may be other justifications that should be considered, but these seem to
cover the most important ones. Although they need not be knowntaken as part of the
evidencethey should be subject to rational debate. Of course, to list as a justification
does not preclude the possibility that it is known; if it is known, so much the better. What is
precluded is the use of the default rule when the negation of is known.
6. COMPETING DEFAULTS
Statistical inference can be modeled as default inference. We can construct normative
default rules based on these statistical inference principles. For instance, using confidence
interval calculations we may say (with confidence 0.95) the proportion of birds that fly is in
the interval [0.93, 0.97]. This gives us a reason to adopt the default rule
bird : fly, 1 , 2 , . . .
,
fly
where the i s correspond to the justifications discussed in the previous section. Similarly,
we may at the same time obtain another default rule
penguin : fly, 1 , 2 , . . .
.
fly
In general we can think of a default rule as being well justified if its associated confidence
interval is high; for example, if the lower bound of the interval exceeds a certain rule
threshold 1 .
Both of the above rules can be well supported by the underlying statistics, yet in the case
of an object being both a bird and a penguin, the two rules offer conflicting conclusions. In
this section, we consider the problem of choosing between competing defaults.

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

39

6.1. Conflicts
Consider the following canonical example.
Example 1.

We have a default theory = F, D , where

F = {R, S },


R : T, 1 , 2 , . . . S : T, 1 , 2 , . . .
D=
,
.
T
T
We get two extensions, one containing T and the other containing T. If we take R to mean
bird, S to mean penguin, and T to mean fly, then we would like to reject the extension
containing T (fly) in favor of the extension containing T (not fly). However, if we take S to
mean animal instead and keep the interpretations of R and T the same, we would want to
reverse our preference. Now the extension containing T (fly) seems more reasonable.
Note that each of the default rules involved in the example above is intuitively appealing
when viewed by itself against our background knowledge: birds fly; penguins do not fly; and
animals in general do not fly either. Moreover, both instantiations (penguins and animals) are
syntactically indistinguishable. We cannot base our decision to prefer one default rule over
the other by simply looking at their syntactic structures.
There are several approaches to circumventing this conceptual difficulty. The first is to
revise the default theory so that the desired result is achieved (Reiter and Criscuolo 1981).
We can amend the default rules by adding the exceptions as justifications, for example,
bird : fly, penguin, . . .
animal : fly, bird, . . .
;
.
fly
fly
With this approach we have to constantly revise the default rules to take into account additional
exceptions. We have little guidance in constructing the list of justifications except that the
resulting default rule has to produce the right answer in the given situation.
Another approach is to establish some priority structure over the set of defaults. For
example, we can refer to a specificity or inheritance hierarchy to determine which default
rule should be used in case of a conflict (Touretzky 1984; Horty, Touretzky, and Thomason
1987). The penguin rule is more specific than the bird rule, and therefore, when both are
applicable we use the penguin rule and not the bird rule. However, conflicting rules do not
always fit into neat hierarchies (e.g., adults are employed, students are not, how about adult
students? (Reiter and Criscuolo 1981)). It is not obvious how we can extend the hierarchical
structure without resorting to explicitly enumerating the priority relations between the default
rules (Brewka 1989, 1994).
The third approach is to appeal to probabilistic analysis. Defaults are interpreted as
representing some infinitesimal probability condition. Adams requires that for A to be a reasonable consequence of the set of sentences S, for any there must be a positive such
that for every probability function, if the probability of every sentence in S is greater than
1 , then the probability of A is at least 1 (Adams 1966, 1975). Pearls approach
similarly involves quantification over possible probability functions (Pearl 1988, 1990).
Bacchus et al. (1993) again take the degree of belief appropriate to a statement to be the
proportion or limiting proportion of intended first-order models in which the statement is
true.
All of these approaches involve matters that go well beyond what we may reasonably
suppose to be available to us as empirical enquirers. In contrast, our approach to constructing

40

COMPUTATIONAL INTELLIGENCE

default rules is modeled on statistical inference. The confidence parameter and the rule
threshold 1 reflect established statistical procedures.
We have argued for the distinction between prerequisites and justifications in nonmonotonic reasoning. In particular, many assumptions in statistical inference should be coded
as justifications, which are acceptable unless shown false, as opposed to the stronger prerequisites, which are acceptable only if they can be proven true. Many logics of conditionals, for
example, along the lines of System P (Kraus, Lehman, and Magidor 1990), do not provide
for such a distinction. In addition, there are some limitations that make a logic based on
conditionals unsatisfactory for nonmonotonic inference. For a more complete discussion of
these issues, see Kyburg, Teng, and Wheeler (Forthcoming).
Our principles for constructing default rules and resolving conflicts among them are
in the same spirit as the approaches taken in Kyburg (1974) and Pollock (1990). These
principles may not account for all our intuitions regarding conflicting default rules, but for
the large number of rules that are based mainly on our knowledge of or intuitions about
relative frequencies, they seem to offer sensible guidance.
6.2. A Fourth Approach
We can methodically resolve some of the conflicts posed by rules such as those discussed
in Example 1. The appropriate default rule to be applied can be picked out by reference to the
statistical information supporting the rules. This provides an accountable basis for deciding
between rules and, therefore, extensions.
Consider the birds and penguins. Suppose according to our data the proportion of birds
that fly is in the interval [0.93, 0.97], and the proportion of penguins that fly is in the interval
[0.01, 0.09]. (There are some talented penguins out there.) In addition, we also know that
penguins are birds. We have four candidate rules, as follows:
bird : fly, . . .
(interval [0.93, 0.97]);
fly

(1)

penguin : fly, . . .
(interval [0.01, 0.09]);
fly

(2)

bird : fly, . . .
(interval [0.03, 0.07]);
fly

(3)

penguin : fly, . . .
(interval [0.91, 0.99]).
fly

(4)

Two default rules are said to compete if both are applicable and have the same consequent.
Even though competing rules would result in the same consequent, they differ in the strength
of the inference, as indicated by the supporting interval. Here rules (1) and (2) compete, and
rules (3) and (4) compete.
Recall that two intervals conflict when neither is nested in the other. When the associated
intervals of two competing default rules are in conflict, we prefer the rule, if there is one,
whose prerequisite is more specific. In this example, only rules (2) and (4) may apply (subject
to further conditions discussed below) when the object in question is known to be a penguin.
The more general rules (1) and (3) may apply only to birds that are not known to be penguins.
Note that of the four rules above, rules (2) and (3) are supported by intervals that have
very low values. It is thus very unlikely that these rules will be invoked. Adopting a rule

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

41

threshold of, for instance, 0.80 would exclude these rules from being applied in practice.
However, rule (2) still competes with rule (1), and is preferred to rule (1), as far as penguins
are concerned. By the same token, rule (4) is preferred to rule (3), but this time rule (4) is
likely to be applied, as the associated interval is above the rule threshold. This allows us to
conclude that penguins do not fly.
Now consider another case. Suppose we also have information about red birds; the
proportion of red birds that fly is in the interval [0.60, 1.00]. Red birds are of course also
birds, but this time we should not prefer the more specific inference. The information about
red birds is vague (perhaps because there are few occurrences of red birds in our data). There
is no conflict between the interval for red birds [0.60, 1.00] and the interval for birds in
general [0.93, 0.97]. In this case, even though red birds are more specific, the associated
statistical information does not warrant the construction of a separate rule for this subclass.
We have only rules (1) and (3) above, and they apply to all birds, red or otherwise.
Now let us look at the Nixon diamond. Suppose the proportion of pacifists among quakers
is in the interval [0.85, 0.95], and the proportion of pacifists among Republicans is in the
interval [0.20, 0.25]. What can we say about Nixon, who is both a quaker and a Republican?
The four candidate rules are
quaker : pacifist, . . .
(interval [0.85, 0.95]);
(5)
pacifist
Republican : pacifist,. . .
(interval [0.20, 0.25]);
pacifist

(6)

quaker : pacifist,. . .
(interval [0.05, 0.15]);
pacifist

(7)

Republican : pacifist,. . .
(interval [0.75, 0.80]).
pacifist

(8)

Rules (5) and (6) compete, and rules (7) and (8) compete. In addition, the intervals of
the rules in each pair conflict. Unlike penguins and birds, the two reference classes here,
quakers and Republicans do not have a subset relationship. When the competing rules are
both applicable, as in the case for Nixon, and the conflict cannot be resolved by appealing
to subset relationships between the reference classes, we take the cover of the conflicting
intervals. This is equivalent to preferring the following rules when the person in question is
both a quaker and a Republican.
quaker Republican : pacifist,. . .
(interval [0.20, 0.95]);
pacifist
quaker Republican : pacifist,. . .
(interval [0.05, 0.80]).
pacifist
The wide intervals reflect the conflicts; we are unlikely to invoke either of these rules as they
are not likely to be above our chosen threshold of acceptance.
6.3. Principles for Resolving Conflicts
We can adjudicate between competing default rules by considering the supporting statistical information. A preferred candidate rule is applied only when it is above the given rule

42

COMPUTATIONAL INTELLIGENCE

threshold 1 , where is the tolerable chance of error. In general, we have the following
scheme.
Let the proportion of s among s be [p, q], and the proportion of s among  s be
 
[p , q ]. Consider the two competing candidate rules:7
: , . . .
(interval[ p, q]);
(9)

 : , . . .
(interval[ p  , q  ]);
(10)

We consider three cases according to the relationship between and  . (This relationship
is typically derived from the set of facts F of the default theory.)
1. is a specialization of  , that is, all s are  s.
There are three subcases according to the relationship between p, p , q, and q .
(a) Conflict: (p < p and q < q ) or (p < p and q < q ).
Rule (9) is preferred when is known. Rule (10) applies when we only know the
more general  . This is in accord with Rule II for evidential probability.
(b) Less precise information about the more specific class : p p and q q.
Only rule (10) is sanctioned, and it applies to all  s regardless of the truth value of
. This conforms to Rule III for evidential probability.
(c) More specific information about the more specific class : p p and q q .
Again rule (9) is preferred for s, and the more general rule (10) applies only when
the more specific rule for cannot be applied.
2.  is a specialization of , that is, all  s are s.
This is symmetrical to Case (1).
3. Neither nor  is a specialization of the other.
Again there are three subcases.
(a) Conflict: (p < p and q < q ) or (p < p and q < q ).
Rules (9) and (10) apply respectively to cases where the object in question is known
to satisfy or to satisfy  but not known to satisfy both. When both and  are
known, and both rules are feasible, we take the supporting interval to be the cover of
the conflicting intervals [p, q] and [p , q ] of the competing rules (9) and (10).
(b) Nested intervals: p p and q q.
Rule (10), the rule supported by the tighter interval, is preferred when both and
 are known. Rule (9) applies when only is known.
(c) Nested intervals: p p and q q .
This is symmetrical to Case (3b).
Based on our background statistical knowledge and the characteristics of the particular
situation we need to make inferences about, we can pick out the preferred default rules
according to the above guidelines. Whether we actually do apply such rules depends on the
designated rule threshold 1 determined by our tolerance for error. If the lower bound
of the interval associated with a preferred rule is above the threshold, the rule is deemed
applicable.
7 In addition another pair of competing rules can be constructed: :,... , with associated interval [1 q, 1 p], and

 :,...
 , 1 p ]. For simplicity we do not discuss this pair of rules, but the governing principles
,
with
associated
interval
[1

are the same.

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

43

To recapitulate, we remove from the set of competing rules the less specific and vaguer
rules. Of the remaining we take the cover of their associated intervals, and apply only those
whose lower bounds are above the rule threshold. These principles for resolving conflicts
between competing default rules are analogous to the ones developed for choosing between
reference classes in an evidential probability setting (Kyburg and Teng 2001).
This statistical approach provides the normative guidance that is lacking in the ad hoc
approach to revising default rules. Instead of relying on intuition to tweak the default rules
until they give the right results, our preferred rules are systematically generated from the
underlying statistics.
7. STATISTICAL DEFAULT THEORIES
The previous section deals with the problem of adjudicating potential conflicts between
competing default rules. The rules that are inappropriate in a particular situation are excluded
from consideration based on a number of principles. Applying a single rule constitutes one
step in the inference process. When we consider the grander picture of default extensions,
we need to take into account the effect of cumulative inference. An extension typically is
derived from multiple inferences, some of whose applicability is dependent upon previous
default conclusions. Statistical default rules and statistical default theories take into account
such cumulative effects in their formulation.
7.1. Statistical Considerations
An important feature of classical statistical inference is its emphasis on the control of
error (Mayo 1996). A corollary of this emphasis is that we may want to acknowledge explicitly
in our inference forms the upper limit of the frequency with which error may be committed.
This represents a difference between our statistical defaults and standard defaults. In the
case of the statistical default rule there is an unavoidable, if controlled, chance of error. The
significance test default in Section 5.1 will yield a false rejection of H 0 up to a fraction of
the time.
We have suggested rejecting the requirement that samples be strictly random, in favor of
the requirement that we not know that they are biased. But this latter knowledge may be quite
weak; all we require in order for our inference to be blocked is some sort of good reason,
which may itself come from a statistical default, for thinking our sample biased.
Furthermore, there is the question of the interaction between sentences in a default extension. In Reiters formulation a default extension is deductively closed. However, when dealing
with statistical inferences, the bounds on the frequency of error for individual conclusions
do not carry over unchanged when multiple conclusions are considered together. Accepting
a hypothesis H 0 with confidence 0.95 and accepting another hypothesis H 1 independently
with confidence 0.95 do not entail the acceptance of both H 0 and H 1 simultaneously at a
confidence level of 0.95.
The same consideration goes for the chaining of default inferences. In the standard default
logic setting, once a default consequent is asserted it enjoys the same status as the facts
that were originally given in the knowledge base.8 Such a default consequent can be used to
satisfy a prerequisite condition of an otherwise inapplicable default rule. Thus, the rules can
be chained to obtain further conclusions based on previous default inferences.
8 At

least until we discover an inconsistency and need to retract some conclusions.

44

COMPUTATIONAL INTELLIGENCE

This makes sense in a qualitative framework, but where defaults are based on statistics,
it is clear that the more the number of steps involved, the lower the level of confidence that
can be attributed to a conclusion thus obtained.
Given these considerations of statistically motivated default inference, a thresholding
mechanism is introduced to coordinate the interactions which give rise to a default extension.
The resulting statistical defaults and statistical default theories are discussed in the following
section.

7.2. Statistical Default Theories


For simplicity we will refer to Reiters formulation as the standard formulation and its
default rules the standard default rules. The rules for statistical inference are called statistical
default rules, and a default theory containing statistical defaults is called a statistical default
theory.
Each statistical default rule is associated with a rule threshold parameter . The value
1 represents the strength of the inference. Specifically, is the maximum long run chance
of error of the default consequent under the circumstances prescribed by the applicability
conditions of the default rule.
In place of the standard default extension, we introduce a set of statements  with
respect to a cumulative threshold parameter . Inferences obtained from standard default
rules are constrained by logical consistency; in particular, the justifications of a rule need to
be consistent with the extension. Inferences obtained from statistical default rules are subject
to an additional constraint. A sentence is admitted into a statistical default extension only
if the chance of committing an error in doing so is within the cumulative threshold 1
associated with  .
More formally, we have the following definitions.
Definition 4.

A statistical default rule ds is an expression of the form


: 1 , . . . , n
,

denote a standard default rule as given in Definition 2 and 0 1. We

: 1 ,...,n

: 1 ,...,n
the

where

body of the statistical default rule ds , and , 1 , . . . , n , , respectively, the


call
prerequisite, justifications, and consequent of the rule. We call the rule threshold parameter,
and 1 the strength of the rule.
A statistical default theory s is an ordered pair F, Ds , where F is a set of sentences in
L and Ds is a set of statistical default rules.
A statistical default rule has the same structure as a standard default rule, except that there
is an additional threshold parameter that indicates the strength of the statistics underlying
the default inference. The chance of error of a statistical default rule of strength 1 is no
more than .
The notion of an extension needs to be revised to take into account the statistical conclusions and the effects of their interaction. We will allow restricted deductive and default
chaining, as long as the conclusion from such a sequence of inferences is still bounded in
error by the cumulative threshold parameter characterizing the extension  (Teng 1997).
We adopt the following definition.

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

45

Definition 5. Let s = F, Ds be a statistical default theory and  be a set of sentences


in L. Let F 0 , and for i 0,


 : 1 , . . . , n

s D s ,


| 1 , . . . , k ,


1 1 , . . . , k k ,
where  ,
i =
.

,
.
.
.
,



,
and 1 + + k i
1
n

and + s i
The set of sentences  is a statistical default extension of s with respect to iff  =
0i i .
A statistical default extension contains the set of facts F, as well as the consequences
inferable by logical deduction and default rule chaining within the prescribed error bound
.
What warrants our accepting a conclusion into a statistical default extension  with
respect to is that the chain of inference resulting in that conclusion will lead us into error
no more than a fraction of the time.
To determine whether a chain of inference leads to a conclusion whose relative frequency
of error is below , we employ a conservative scheme where the sum of the error bounds of
the constituents participating in the immediate inference step is taken as the error bound
of the conclusion itself. While the bound is not exact, in the absence of further knowledge
regarding the characteristics and relationship between the constituents, this serves as an upper
error bound such that any conclusion sanctioned by the resulting statistical default extension
has a relative frequency of error not exceeding the stated acceptable cumulative threshold
, as established in the theorem below.
Theorem 1. Let the probability of error of 1 , . . . , n be denoted by 1 , . . . , n , respectively.
For any such that 1 , . . . , n , the probability of error of is at most 1 + +
n .
The formula follows from the conjunction 1 . . . n . We then have
Pr(1 . . . n ) Pr(),
which gives
Pr() Pr((1 . . . n )),
= Pr(1 . . . n ),



=
Pr(i )
Pr(i j ) +
Pr(i j k ) . . .
i

i< j

i< j<k

Pr(i )

i .

So, the frequency of error of a deductive consequent is bounded by the sum of the
errors of the premises. The frequency of error of a default consequent obtained from the
application of the : 1,...,n s is similarly bounded by the sum of the errors associated with

46

COMPUTATIONAL INTELLIGENCE

the prerequisite ( ) and with the rule itself ( s ). In addition, for the rule to be applicable,
the negation of each of the justifications must be excluded from the extension  .
A standard default rule can be represented as a statistical default rule with strength
1 ( = 0). In other words, a standard default rule
: 1 , . . . , n

can be rewritten as
: 1 , . . . , n
0

in the statistical default framework.


Note that this does not mean that standard default rules have a zero chance of error, and are
therefore always correct. A value of 0 in this case merely denotes that the rules in question
are qualitative and are thus subjected only to the qualitative constraints, namely, logical
consistency. The formulation of a statistical default extension is such that the application of
a qualitative default rule (with = 0) does not add to the cumulative error of the formulas
derived via this default rule.
7.3. Properties of Statistical Default Theories
As defined in Definition 5, a statistical default extension  with respect to is a union
of sets of sentences i for 0 i . Each set i contains the sentences accepted with
an error bound less than i with respect to  . The sets of sentences i s are related as
follows.
Theorem 2. Given a set of sentences  , and i s constructed as in Definition 5, we have
i  j for i j .
That is, the sequence of sets i s monotonically increases as i increases. We only need
to observe that for any formula , and therefore, if is included in i , it is also
included in  j for all i j , according to the specification of the first set in Definition 5.
One would be tempted to think statistical default extensions are monotonically increasing
as well, as the cumulative threshold parameter increases. However, this is not the case.
Given a statistical default theory, successive relaxation of the threshold gives rise to
statistical extensions that do not necessarily bear any subset relationship, as can be readily
seen in the following example.
Example 2.
and

Consider a statistical default theory s = F, Ds , where F = , Ds = {d1 , d2 },

: p
:p
0.05; d2 =
0.10.
q
p
With respect to a cumulative threshold parameter = 0.05, the statistical default theory s
gives rise to a single statistical extension 1 , which contains q but does not contain p. This
extension is obtained by applying the default rule d 1 . On the other hand, with respect to a
larger = 0.10, s again gives rise to a single extension, but this extension 2 contains
p and does not contain q. Only rule d 2 is applicable in this case, as the justification of d 1 is
contradicted by the consequent of d 2 .
d1 =

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

47

The two sets of sentences 1 and 2 are overlapping (e.g., they both contain the tautologies), but each contains elements that are not in the other set.
Thus, as the cumulative threshold parameter increases, the extensions obtained from a
statistical default theory do not form a monotonic sequence. Even the existence of extensions
with respect to one does not provide any guarantee for the existence of extensions with
respect to other values of the threshold. This is shown in the following example.
Example 3. Consider a statistical default theory s = F, Ds , where F = , Ds =
{d1 , d2 , d3 }, and
:p
: p
:q
0.05; d2 =
0.10; d3 =
0.20.
p
p
q
For = 0.05, the default theory has no extension. For = 0.10, there is one extension,
involving the application of default rule d 2 . However, as we increase to 0.20, once again
s has no extension.
d1 =

Statistical default extensions correspond to standard default extensions in a number of


special cases.
Theorem 3. Given a statistical default theory s = F, Ds , the statistical default extensions
of s with respect to a cumulative threshold parameter are identical to the standard default
extensions of the standard default theory = F, D constructed as follows:
[ = 0]: D is the set of default rules corresponding to the bodies of the statistical default
rules of strength 1 in Ds . Only the qualitative constraints are in force in this case.
[ = ]: D is the set of default rules corresponding to the bodies of all the statistical
default rules in Ds . In other words, D is obtained from Ds by dropping the
threshold parameter of each rule.
For 0 < < , statistical default extensions can only be weakly associated with standard
extensions of some corresponding standard default theory. (Note that the statistical default
extensions are not deductively closed in general.)
One might think that can be capped at 1. Note, however, that while the actual chance of
error of an event cannot be greater than 1, the bound on that chance of error can be arbitrarily
large. We need to allow for the case when the sum of the error bounds on the premises, and
thus, the error bound on their consequences, exceeds 1.
7.4. Adjunction
The set 0 is deductively closed, but none of the i s for 0 < i < is closed under
deduction. A conclusion derivable from a single premise has a chance of error at most as
high as that of the premise; however, in general this bound is not guaranteed in the case of
multiple premises.
Moreover, we do not have closure under adjunction. We may accept and , respectively,
but this does not entail the acceptance of into the extension. The cumulative error
of may exceed our error threshold even though both of the constituents and are
acceptable.
Controlled forms of adjunction persist however.

48

COMPUTATIONAL INTELLIGENCE

If  , 1 , and 2 , then 1 2  .
If  ,  , and + , then  +  .
In the first case, 1 2 is accepted at the error threshold if both conjuncts 1 and
2 can be derived from the same premise that is itself acceptable at the error threshold .
In the second case, we accept provided that the sum of errors collectively induced
by the conjuncts and is below the cumulative threshold parameter of the statistical
extension  . The frequency of error of is bounded by + .
Restricted adjunction gives rise to a logic that can be paraconsistent, as shown in the
following example.
Example 4.
and

Let s = F, Ds be a statistical default theory, where F = , Ds = {d1 , d2 },

:q
:r
0.10; d2 =
0.10.
p
p
Given a cumulative threshold parameter such that 0.10 < 0.20, we obtain a statistical
extension  that contains both p and p. However, since p and p each has an associated
error bound of 0.10, the cumulative error bound for p p is 0.20, which makes the conjunction unacceptable into  . Thus, p and p can coexist in an extension while still avoiding
the uncomfortable situation of deriving the whole language L along with them.
d1 =

Paraconsistency is arguably a desirable property in the all too common situation of


contradictory premises and conclusions. In some cases, we may have evidence to support a
statement and equally good (but different) evidence to support its negation, and we need to
be able to account for both lines of reasoning simultaneously.
Let us consider a slightly larger example, inspired by a default version (Poole 1989) of
the probabilistic lottery paradox (Kyburg 1961). There are n species of birds, s1 , . . . , sn . We
can say that penguins are atypical in that they cannot fly; hummingbirds are atypical in that
they have very fine motor control; parrots are atypical in that they can talk; and so on. If
we apply this train of thought to all n species of birds, there is no typical bird left, as for
each species there is always at least one aspect in which it is atypical. A parallel scenario is
formulated below.
Example 5. Suppose we know that the proportion of each species si among all birds is in
the interval [ i , i ] for 1 i n, where i and i are both taken to be small numbers.
Following the principles discussed in Section 6.3 we can derive n statistical default rules,
di =

bird : si
i
si

for 1 i n.

(11)

Each rule di says that a bird is typically not of species si . Rule di has an associated rule
threshold parameter i , which is the maximum chance of error of the rulethe maximum
chance that the bird in question is indeed of species si .
We can then proceed to construct a statistical default theory s = F, Ds , where F
contains
bird
[here is a bird],
bird s1 . . . sn

[an exhaustive list of bird species],

si s j , for all j = i

[species are mutually exclusive],

and Ds contains the n rules shown in (11).

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

49

In the standard formulation of default logic, we would get n extensions, each containing
one si and the negations of all the other s j s. In other words, for each extension, we would
conclude that the bird we have is of a particular species. This seems to be an over commitment,
given the statistical information that only between i and i of birds are of species si .
In the statistical default logic setting, we can have less extreme conclusions. With respect
to a cumulative threshold parameter , where



max(1 , . . . , n ) < min
j ,
i

j=i

all n statistical default rules can be applied to conclude respectively s1 , . . . , sn , but we


do not necessarily include in the same statistical extension  any of their deductive consequences. Thus, in the extension we can have
s1 . . . sn , s1 , . . . , sn ,
without getting a contradiction. A conjunction of the form si . . . s j is included in 
if and only if i + + j 
.

In the case where > j=i j for some i, we would derive s j for all j = i. These
sentences together with s1 . . . sn would give us si , blocking the application of d i . Thus,
we would have an extension containing si and,
 for all j = i, s j .
In the extreme case where maxi ( j=i j ), we would get n extensions each containing one positive atom si . The extensions obtained in this case are identical to the ones
obtained in the standard default logic setting.

8. CONCLUSION
Much of classical statistical inference can usefully be formulated as default inference,
taking advantage of the distinction between prerequisitesstatements that must be taken as
premises of the inferenceand justificationsstatements that must not be known, that is, in
whose presence the inference must fail. For example, in classical confidence inference to the
mean of a normal distribution, the value of the observation must be known, for otherwise we
have no basis for the inference. But at the same time, we must not know that the item sampled
was obtained by a procedure that is biased toward large values; we must not have reason to
believe that the sampled item was part of a larger sample about which we have knowledge;
and there must be no way of invoking a known prior probability distribution, conditioning
on which would give us a conflicting probability.
It is common to have a collection of possible defaults whose conclusions conflict: students, by default, are not employed; adults, by default, are employed; so what are we to say
about adult students? Tweety is a penguin and a bird; birds, by default, fly; penguins, by default, do not fly. Many of these conflicting default inferences can be derived from statistical
knowledge. Choosing among these rules, when possible, can be based on the same principles
that can be invoked in choosing among reference classes in statistical inference (Kyburg and
Teng 2001).
Standard defaults rules are qualitative: if the default can be applied, the conclusion may
be included in an extension. Statistical defaults admit the quantitative possibility of error: if
a hypothesis is rejected at the 0.05 level, there is the explicit maximum chance of 0.05 that
this rejection is in error, and that the hypothesis is nevertheless true. Statistical defaults are

50

COMPUTATIONAL INTELLIGENCE

therefore different from standard defaults in that they include a parameter that represents
the maximum chance of error of that inference.
A consequence of this difference is that while a standard default extension is an ordinary
deductively closed logical theory, extensions for statistical default theories are more complicated. For one thing, they are not closed under adjunction: if conclusion A may be in error no
more frequently than and conclusion B may be in error no more frequently than , that does
not at all mean that the conclusion A B is similarly constrained to be in error no more often
than . For similar reasons, we cannot expect the extensions of statistical default theories to
be deductively closed.
However, we can make sense of the extensions of statistical default theories. We investigated some properties of statistical default theories and their connection to standard default
theories. Here Reiters default logic is used as a concrete platform for the discussion, but our
approach is applicable to other nonmonotonic formalisms as well. The inferential relations
among statistical default extensions may be reflected in a weak modal logic, the logic of
thresholding (Kyburg and Teng 2002). We claim that these statistical extensions are plausible
candidates for realistic bodies of knowledge, and may be of use in AI.

REFERENCES
ADAMS, E. 1966. Probability and the logic of conditionals. In Aspects of Inductive Logic. Edited by J. Hintikka
and P. Suppes. Amsterdam, North Holland, pp. 265316.
ADAMS, E. W. 1975. The Logic of Conditionals. Dordrecht, Reidel.
BACCHUS, F., A. J. GROVE, J. Y. HALPERN, and D. KOLLER. 1993. Statistical foundations for default reasoning.
In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 563569.
BAIRD, D. 1992. Inductive Logic: Probability and Statistics. Prentice-Hall, Englewood Cliffs, NJ.
BREWKA, G. 1989. Preferred subtheoriesAn extended logical framework for default reasoning. In Proceedings
of the Eleventh International Joint Conference on Artificial Intelligence, pp. 10431048.
BREWKA, G. 1991. Cumulative default logic: In defense of nonmonotonic inference rules. Artificial Intelligence,
50:183205.
BREWKA, G. 1994. Reasoning about priorities in default logic. In Proceedings of the Twelfth National Conference
on Artificial Intelligence, pp. 940945.
CHOW, S. L. 1996. Statistical Significance. Sage, London.
CLOPPER, C. J., and E. S. PEARSON. 1934. The use of confidence or fiducial limits illustrated in the case of the
binomial. Biometrika, 26:404413.
CRAMER, H. 1951. Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ.
DELGRANDE, J. P., T. SCHAUB, and W. K. JACKSON. 1994. Alternative approaches to default logic. Artificial
Intelligence, 70:167237.
FISHER, R. A. 1930. Inverse probability. Proceedings of the Cambridge Philosophical Society, 26:528535.
FISHER, R. A. 1971. The Design of Experiments. Hafner, New York.
. 1991. Disjunctive defaults. In ProceedGELFOND, M., V. LIFSCHITZ, H. PRZYMUSINSKA, and M. TRUSZCZYNSKI
ings of the Second International Conference on Principles of Knowledge Representation and Reasoning,
pp. 230238.
HORTY, J. F., D. S. TOURETZKY, and R. H. THOMASON. 1987. A clash of intuitions: The current state of nonmonotonic multiple inheritance systems. In Proceedings of the Tenth International Joint Conference on Artificial
Intelligence, pp. 476482.
KRAUS, S., D. LEHMAN, and M. MAGIDOR. 1990. Nonmonotonic reasoning, preferential models and cumulative
logics. Artificial Intelligence, 44:167207.

NONMONOTONIC LOGIC AND STATISTICAL INFERENCE

51

KYBURG, H. E., JR. 1961. Probability and the Logic of Rational Belief. Wesleyan University Press, Middletown,
CT.
KYBURG, H. E., JR. 1974. The Logical Foundations of Statistical Inference. Reidel, Dordrecht.
KYBURG, H. E., JR., and C. M. TENG. 1999. Statistical inference as default reasoning. International Journal of
Pattern Recognition and Artificial Intelligence, 13:267283.
KYBURG, H. E., JR., and C. M. TENG. 2001. Uncertain Inference. Cambridge University Press, New York.
KYBURG, H. E., JR., and C. M. TENG. 2002. The logic of risky knowledge. In Electronic Notes in Theoretical
Computer Science, Volume 67. Elsevier Science.
KYBURG, H. E., JR., C. M. TENG, and G. WHEELER. Forthcoming. Conditionals and consequences. Journal of
Applied Logic.
LEHMANN, E. L. 1959. Testing Statistical Hypotheses. John Wiley and Sons, New York.
LUKASZEWICZ, W. 1988. Considerations on default logic: An alternative approach. Computational Intelligence,
4(1):116.
MAYO, D. 1996. Error and the Growth of Experimental Knowledge. University of Chicago Press, Chicago.
MCCARTHY, J. 1986. Applications of circumscription to formalizing common sense knowledge. Artificial Intelligence, 28:89116.
. 1995. Constrained and rational default logics. In Proceedings of the
MIKITIUK, A., and M. TRUSZCZYNSKI
Fourteenth International Joint Conference on Artificial Intelligence, pp. 15091517.
MOORE, D. S. 1979. Statistics: Concepts and Controversies. W. H. Freeman, San Francisco.
NEYMAN, J. 1957. Inductive behavior as a basic concept of philosophy of science. Review of the International
Statistical Institute, 25:522.
PEARL, J. 1988. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Francisco.
PEARL, J. 1990. System Z: A natural ordering of defaults with tractable applications to default reasoning. In
Theoretical Aspects of Reasoning about Knowledge, pp. 121135.
POLLOCK, J. 1990. Nomic Probability and the Foundations of Induction. Oxford University Press, Oxford.
POOLE, D. 1989. What the lottery paradox tells us about default reasoning. In Proceedings of the First International
Conference on Principles of Knowledge Representation and Reasoning, pp. 333340.
REITER, R. 1980. A logic for default reasoning. Artificial Intelligence, 13:81132.
REITER, R., and G. CRISCUOLO. 1981. On interacting defaults. In Proceedings of the Seventh International Joint
Conference on Artificial Intelligence, pp. 270276.
TENG, C. M. 1997. Sequential thresholds: Context sensitive default extensions. In Proceedings of the Thirteen
Conference of Uncertainty in Artificial Intelligence, pp. 437444.
TOURETZKY, D. S. 1984. Implicit ordering of defaults in inheritance systems. In Proceedings of the Fifth National
Conference on Artificial Intelligence, pp. 322325.

You might also like