You are on page 1of 193

Emerging Systems and Machine Intelligence

Part one - Philosophical Underpinnings


1.1 Introduction

Entities that possess the ability to become more organized over time, that is, grow and
learn, often exhibit activities that can be described, metaphorically, as intelligent. Those
activities often involve the solution of a problem relevant to the entity in its
environment. For example, the members of an insect colony swarm together to repel a
creature that threatens the colony, bees dance to communicate the location of a food
supply, or a chimpanzee uses a stick as a tool to retrieve termites from a termite mound.
On a slightly longer time scale, trees communicate to warn of impending insect
infestation. Over long time scales, species of fish invent legs and lungs so they can
inhabit more of the earth's surface. And over vast expanses of time organic life-forms in
the oceans participate in fixing carbon dioxide levels in the atmosphere to maintain
temperatures on the surface of the earth that are favorable for their continued existence.
Although insect colonies are not considered intelligent beings nor is the evolution of a
species normally described as an entity to which intelligent activities can be ascribed,
the metaphor is appropriate. We propose a working description of the phenomenon of
intelligence that accepts the metaphor as representing a true state of affairs. It says that
intelligence is bound up in the dynamic interactions that occur between systems and
their environments at various levels of existence (to be defined later). Those interactions
are a result of the nature of the system and the nature of the environment which serve to
modify and therefore to define each other. It claims that intelligence is a part of the
organizing activity that characterizes some open processing systems (that concurrently
display increasing entropy1 and increasing order). The implication is that what we
think of as human intelligence is actually an effect of the way human systems interact
in and with their environment. It implies corollary interactions between other systems
in and with their environments. Human intelligence is then either a kind of intelligence,
when viewed in a larger context, or a complex of activities defined redundantly as the
intelligent things humans do, when viewed from a narrow human perspective. This
hypothesis implies that there is no reason to suppose that a machine intelligence must
be of the human kind and, in fact, argues against the possibility of a machine
intelligence being too human-like. It provides support for the methods of machine
learning as the appropriate approach to implementing intelligence in machines. Further,
it indicates that for a machine to approach human-like intelligence it is necessary for the
machine to possess a degree of human characteristics and to gain knowledge of the
world through interaction with humans in a human-like environment at the human
1
Entropy has its roots in probability theory (Khinchin, 1957). When applied to a probability
space (which may model either physical or abstract (information) systems) entropy is a
measure of the uncertainty with which the states of the system occur. In this paper both
information theoretic entropy and physical entropy play a part, since both kinds of
entropy are exhibited by the systems being discussed (in fact the distinction is sometimes
blurry, see section 1.7). Open systems are those systems that interact with the surrounding
environment. They exchange information, energy, matter and space with that environment
and their parts organize themselves to further some goal of the whole system. In doing this
they exhibit a decrease in entropy both in terms of information and energy. A social system
exhibiting a decrease in physical entropy is a growing system one exhibiting a decrease in
entropy related to information can be thought of as a learning system. Closed systems do
not interact with their environment and display positive entropy (run down). Entropy is
intimately involved with the existence of order, a theme that is developed further in section
1.7.
level of existence.

1 Billion Years
Big Bang
u

8 Interstellar Molecular activity

7 The Earth Forms Molecular Activity RNA Like Molecules


g J F M
Archeozoic
Microbes
A M J
Proterozoic
Cells With Membranes `
J`` A Paleozoic mesozoicS cenozoic
`` `
d
Multicelled Organisms Plants 1 Animals c Man
O N D

Figure 1.1: History of the universe (letters represent months of the year...a popular
rendition)

1.2 The evolution of the universe

Discoveries in biology coupled with current cosmological theory gives us an idea of


how we have come to be. Here is a speculative account. Ten to fifteen billion years ago
the universe was small, dense, relatively homogenous, contained only very active
subatomic particles and was rapidly expanding and cooling. At some point in the great
expansion vast quantities of hydrogen gas formed. Over time, some of the hydrogen gas
formed into denser clouds. In a process that continues today, parts of those clouds
collapsed or were compressed to still denser masses that heated to the point that nuclear
fires began to burn. The stars thus formed grouped into pairs, clusters and galaxies.
They burned hydrogen into helium, eventually some stars went through a sequence of
internal collapse during which heavier atomic elements were formed. Some stars were
of such a size that their fate was to explode and spread the heavier materials into the
surrounding space. Those elements simmered slowly in the spaces between the stars.
Atomic structure permitted the formation of molecules of varying complexity. About
four to five billion years ago our solar system formed from hydrogen gas and the
elemental and molecular debris from past stellar explosions. Earth formed at an
appropriate distance from the sun to permit liquid water to pool on the surface. The
energy flux from the sun and atmospheric effects such as lightning, rain, running water,
and wave action produced a continuously stirred chemical stew in which various new
molecules formed, dissipated, and reformed. Some of those molecules were of a type
that attract complementary or like molecules and grew in layers or strings. An event or
process that cleaved off a portion of the molecule produced new molecules; self
replicating molecules formed. Such molecules were subject to reproduction errors
creating similar but different molecules that were also able to replicate themselves. In
the ancient seas some molecules could replicate more easily, perhaps because of the
availability of their constituent parts in the environment or perhaps because of their
structure. They would tend to flourish while those less successful would tend to die out.
More complicated arrangements of molecules arose, some had protective casings and
became the first cellular entities. Cooperating cellular entities were successful
innovations and became the first organisms. The cooperating systems progressed to
states of greater order and eventually, in combination, worked significant changes in
their environment and themselves. Systems of increasing complexity were produced
and in turn interacted in the changing environment to produce still more complex
systems. Eventually, the system produced plants, then animals, then ape-like animals,
and then man-like apes.

At this point in the narrative we shift time scales from billions of years to tens of
thousands of years so we can speculate on the advent of man. Humans that were
anatomically indistinguishable from modern humans evolved in Africa sometime
between 300,000 and 50,000 years ago. Other anatomically different homo sapiens
(Neanderthal man and Peking man) were, for a time, contemporaries of those early
modern humans. Neanderthal man was shorter, more thickly muscled and had a
sloping cranium that contained a brain that was, on average, 10% larger than those of
the early modern humans. Little is known about Peking man because very few of his
remains have been discovered. All three branches of homo sapiens wore clothing, built
shelters, used stone implements, used fire and buried their dead. The emergence of
these relatively unique abilities can probably be attributed to the fortunate combination
of several attributes. The first was an upright stance that freed the hands for purposes
other than locomotion. The second was the configuration of the human hand that, in
addition to being good for climbing was good for grasping and manipulating any small
objects, e.g. tools. And, finally there was the development of some critical amount of
brain matter to provide coordination and problem solving capabilities. It can be
conjectured that the hands were an artifact of the tree climbing nature of the ape
ancestors, that the upright stance was acquired when those ancestors ventured out on
the plains and found it to their advantage to stand up straight so they could look out
over the grass to see what the other animals were up to. And the large brain evolved to
process the information from the senses, since that would allow the relatively slow,
weak and defenseless ape-human to keep track of the positions of other animals, to
learn their behavior and to anticipate that behavior. With this combination of attributes
the homo sapiens developed a stone age culture that persisted for hundreds of
thousands of years. Then about 35,000 years ago the early modern humans emerged as
the single representative of homo sapiens. At the same time these 'early moderns'
began to develop complex tools (over a period of time, flint tipped spears, bone carved
weapons, needles and fishhooks, rope, nets, and bows and arrows, appeared), art (cave
drawings of local flora and fauna and man and his artifacts) and they organized their
communities and their activities. These technologically innovative people are called
Cro-Magnon man.

Why did Neanderthal and Peking man become extinct while the anatomically and brain
size inferior Cro-Magnon man survived and flourished? Probably that success can be
attributed to the superior technology of Cro-Magnon man and the fact, often
demonstrated in recent times, that when a technologically superior culture comes in
contact with an inferior one, the inferior culture is often destroyed. But that does not
explain why Cro-Magnon man developed technology and the others groups did not. It
is tempting to hypothesize that, through a genetic variation Cro-Magnon man
developed a better brain. This seems reasonable for it can be used to explain Cro-
Magnon man's inventiveness. And, contrarily, if Neanderthal man had been as
'intelligent' as Cro-Magnon man then Neanderthal man would have copied Cro-
Magnon man's technology and progressed at a sufficient rate to ensure his survival.
However, in light of current knowledge about brain structure and composition the
argument for a superior brain seems questionable. A comparison of present day
mammalian brains reveals that the brains of larger mammals differ from each other
primarily by weight, in a relation that is highly correlated with body size2 .Further
evidence comes from the study of the skulls of ancient and modern primates. To some
extent these skulls indicate by their shape and deformations the structure of the brains
they contained. Features of brain structure noted in modern human skulls, related to
various aspect of human existence, notably speech and hand control, appear to have
developed in all of the various branches of primates, and are evident in Neanderthal
and present day apes. If brain differences can't account for the differences between man
and animal, why suppose that they account for the differences between Neanderthal
and Cro-Magnon man? In addition, there does appear to be evidence that, for a short
period of time in western Europe in a late ice-age culture termed the Chatelperronian,
Neanderthal man did coexist with Cro-Magnon man and did adopt Cro-Magnon's
tools. Archaeologists have debated long and hard over which people created
Chatelperronian culture. The evidence for it consisted of a mixture of Neanderthal and
Cro-Magnon tools but no Cro-Magnon art. Then a skeleton was unearthed that proved
to be Neanderthal. That has apparently settled the question and Chatelperronian
culture was Neanderthal. Neanderthal man and by extension Peking man were
probably not hopelessly unintelligent, but just victims of a technologically advanced
culture. What then accounts for Cro-Magnon man's development of that technology?

It has been conjectured (for instance Diamond, 1989) that the genetic innovation that
led to the development of modern language and the ascendancy of Cro-Magnon man
was the modification of the larynx and throat in a manner that permitted refined
2
The relation is brain weight ª a(body weight)b in which b is determined by the taxonomic
level (i.e. the relationship changes between orders but is typically between .2 and .8). Some
possible explanations are that since the brain grows relatively faster during gestation than
later on in the growth of an animal, those animals with long gestations (most large animals)
are predisposed toward relatively larger brains, and that relatively large or small brain
weights may result from the demands of particular ecological conditions (Pagel and Harvey,
1989). If intelligence were correlated with brain weight it would be expected that some large
brained individual animals should exhibit more intelligence than some individual small
brained humans. Certainly large brained humans would be more intelligent than small
brained humans. The latter is a once popular theory, debunked by Stephen Jay Gould (Gould,
1981).
speech3 . The development of a complex language could then permit cooperative
behavior on a scale previously unknown "you circle around and chase those deer this
way and I'll hide behind that large rock and jump out and spear one," and the transfer
of knowledge; "you have to use a long smooth stone and strike the flint rock at a sharp
angle to get a good edge" and a new social structure; "he is old and weak but we must
care for him because he knows which plants we can eat and the ways of the animals."
Slowly, in conjunction with the development of language, a technological civilization,
could develop. Note that the importance of the development is not in the complexity of
the ideas expressed (the ancestors of modern man were almost certainly able to
comprehend such complexities for many millions of years before speech appeared), but
in the fact that such complex ideas could be easily communicated. Discoveries
concerning the nature of the world, made by individuals, could be rapidly
communicated to everyone and that information could be passed on from one
generation to the next. Knowledge gained by experience was no longer lost upon the
death of the individual and most innovations could contribute to the advance of the
whole society. Our current civilization is a result of this process which has expanded,
accelerated, and in effect, taken on a life of its own.

1.3 Philosophy of mind

This story describes a sequence of events all of which seem plausible. In fact, most of the
steps depicted accord with generally accepted scientific theories supported by some
empirical evidence. But even if such an explanatory account were more detailed,
containing all of the current speculation and theory about the evolution of the universe
and particularly about how man came to be, the story would not explain why such an
apparently fantastic sequence of events, all tending to produce order from disorder and
complexity from simplicity, should occur. Since it did occur, why did it result in
creatures such as ourselves? The story implies that we, in our current state, represent
only an incremental change in accord with a well established pattern of evolution. But
we perceive of ourselves as fundamentally different from all other contemporary life
forms; a great leap forward! It is probably safe to say that most people would specify
that we differ from the animals because of our intelligence or because of the human mind.
Trivially and obviously our minds are different than animal minds just as our form
differs from animal forms. But most people would not be willing to admit the
possibility that an animal even has a mind. Descartes, the father of modern philosophy,
did not believe that animals have minds, and many present day philosophers, scientists
and other intellectual types also do not believe that animals possess minds; witness the
controversy that arises whenever a scientist proposes that animals do (or do not) have
minds (see for example the letters to the editor in Science news, 7/29/89 and 9/2/89 or
the work of Donald R. Griffin, Professor Emeritus, Rockefeller University, reported in
Scientific American, SCIENCE and the CITIZEN, PROFILE May 1989). Do only people
possess minds? What then is a mind? Where does a mind come from? If humans
evolved a mind, why did no other animal evolve a mind? Or did they? Obviously the
mind is not a tangible thing but, if you tend to accept the above description of how
things have come to be (or anything similar to it) then you will tend to believe that
3
Others argue that the roots of language go back perhaps two million years and emerged
along with brain developments (reported in Science News, Vol 136, no. 2, July 8, 1989). In
that case Neanderthal man also possessed speech and the ascendancy of Cro–Magnon man
remains unexplained.
minds evolved along with everything else. It would be hard to deny that minds are
intimately connected to the physical organs called brains. Perhaps then, every thing that
has a brain has a mind. Perhaps the difference between man and animal is that human
minds are superior? If they are superior, how are they superior? Is it that human minds
are intelligent minds? Is intelligence the key?... We will argue that these questions are
only important to our human egos. The questions are misguided in that they assume a
certain scheme of things in which there are universal goals and purposes and objects of
central or universal importance. Within the assumption remains the disguised assertion
that man is in some way one of those objects of central importance. In spite of
Copernicus, Galileo, Newton, Darwin, and Einstein, the anthropocentricism observed
by Francis Bacon in his Novum Organum almost four centuries ago (1620) remains
today:

"The idols of the tribe are inherent in human nature, and the very tribe or race of man.
For man's sense is falsely asserted to be the standard of things. On the contrary, all
the perceptions, both of the senses and the mind, bare reference to man and not the
universe..."

We will argue that the above questions don't have an answer because they are
meaningless. That is, they are meaningless if you subscribe to the scheme of things as will
be developed in the first part of this paper.

In questions of physics, astronomy, and cosmology there are two philosophically


opposed viewpoints. Materialists advocate that all of nature is describable in terms of
objects and laws while idealists believe that there exist controlling forces and/or
implicit universal templates to which the universe conforms and which are beyond the
ken of man. In Biology and its related fields their are mechanists who believe life is
describable in terms of the materials that comprise living things and the laws that bind
that material while opposing them there are vitalists who believe that life is so different
that it is necessary to hypothesize an elan vital or life force, that may not be reduced to
other laws. In mathematics the opposed camps are the foundationalists and the
intuitionists, the former putting their faith in the axiomatic derivation of all things
mathematics, the latter relying on the truth as revealed by discovery. Likewise there are
two philosophically opposed descriptions of the human mind. The one that has been the
most popular down to and perhaps including the present time maintains that the
human mind is more than some physical activity of the body. In this philosophy the
mind is largely independent of the body, is inherently capable of thought processes and
inherently possesses knowledge. This idealist's (or, to use a more recent term,
mentalist's) philosophy maintains that the elements that serve to define a mind are not
comprehensible. That is, the mind is not a result of the known or knowable laws that
govern the material universe. The opposing point of view belongs to the physicalists.
Their approach to the description of human mind, to which undoubtedly, materialists,
mechanists, and foundationalists would subscribe, states that the human mind is just
the sum of the electro-chemical activity of the brain. In more recent versions of this
philosophy (since the advent of theories of evolution) the mind is seen as being
constrained indirectly by the genetic material that serves as a blueprint for ontogenesis
and as a repository and transferal agency for the information derived from man's
interaction with the environment over the course of evolution.

The two generic views (call them idealists and materialists) affect all of the rest of the
subject matter of philosophy. For instance, the idealist's philosophy accommodates the
ideologies of theology, while the materialists view rejects, or at the minimum casts
doubt on, the existence of a Creator of the universe, and more strongly rejects the idea
that any such Creator has an ongoing interest in the universe. For the point at hand, the
idealist's view argues strongly against the possibility of constructing artificially
intelligent minds in machines, for, if that is to be done, those machines will have to be
built to conform to laws known to man. It is interesting to note, however, that should
the idealist's view be correct, the existence of a machine that is artificially intelligent is
not precluded; it is possible that the same externally motivating force that applies to the
human mind might apply to an appropriately vitalized assemblage of silicon based
electro-chemical components. Likewise, should the materialists view be valid, there is
no assurance that it is within man's ability to construct an acceptably intelligent
machine (we might not know all of the relevant rules nor be able to discover them in a
reasonable amount of time even if they are comprehensible to us). Whatever the case,
since the advent of computers the two views are more distinct than before. The
possibility of machine intelligence gives the advocates of both views a real world issue
on which to focus. We will present a viewpoint that lies somewhere between these two
extremes. It does not resort to the unknown or unknowable to explain things but neither
does it accept the contention of the materialist that everything is reducible to a set of
fundamental rules and objects. It is holistic in nature but not mystical or spiritual. It
does not abandon science, but it does recognize limitations in some of the most popular
mechanisms of scientific endeavor. It does not preclude the construction, by man, of
(truly) intelligent machines but it does imply that there are some restrictions on how
those machines can be brought to be.

The purpose at hand is an exploration of the nature of human intelligence and the
application of any observations or conclusions to an effort to create an artificial human-
like intelligence. However, we deem the context in which that intelligence arose to be as
important as the investigation of the mind itself. To that end a diverse set of theories
taken from philosophy, biology, physics, cosmology, mathematics, computer science,
psychology, and linguistics is reviewed below. Philosophers have been asking questions
about what it means to exist and the nature of mind since the beginning of recorded
history, since, or even before, the emergence of modern man. This recent (relatively
speaking) body of philosophical thought would seem a good place to start. So again we
shift gears from thousands of years to hundreds of years and continue the narrative in
historical order, but now we concentrate on the attempts to find answers to these
questions.

1.4 Historical perspective

To the ancient Greek philosophers Socrates and Plato, human perception of the world as
revealed by the senses was but a poor and corrupt reflection of the more real, perfect,
and comprehensive underlying system of truths in which everything had its proper and
logical place. They were not psychologists and spent little time in analyzing the obvious
and imperfect functioning of the human mind. They advocated that man should strive
to use his mind in a systematic and logical manner that would allow him to see the
underlying perfection of that more perfect and more real world. This Platonic concept of
the ideal and real world as revealed by the mind is the most important legacy of these
philosophers and remains with us today. In a sense it is the distant parent of the current
philosophies that encourage reification of abstract ideas such as intelligence. In the area
of mathematics Euclid in his Elements produced a set of propositions and theorems
carefully built by reason in a such a clear and lucid manner that it immediately became
a text book and remained the primary text on geometry for centuries. This seed was to
lay dormant until the western renaissance at which time it began to grow. It became the
paradigm for mathematical and philosophical reasoning. Today euclidean may be used
as a synonym for the adjective axiomatic and may be directly associated with
foundationalism in mathematics. Aristotle was the last of the line of Greek philosophers
and in fact the last philosopher of note for a long time. In his philosophy universals
replace the Platonic Ideals. He associated words with concepts and so presented the first
philosophy of meaning and reference. He had a theory in which forms were
independent of the substance of which they were composed. The essence of a man was
that without which a man would no longer be a man. That essence was a form and so
independent of the physical substance of a man. This duality of substance and a form is
for a man, body and soul, an idea to be pursued later by Rene Descartes.

Greek thought represented a relatively unshackled pursuit of knowledge. Great strides


in human culture, government, architecture, astronomy, engineering, technology and
especially mathematics were made during this period. But when the Greek influence in
world affairs declined philosophizing also declined until, by the end of the Roman
empire, which at least approved of the Greek ways, what passed for philosophy was
argument over theological doctrine. For over 2000 years little was added to the little
already achieved in the philosophy of mind. Over that time Aristotle's philosophy and
science and Euclid's methods and geometry were unchallenged and unchanged. It was
in the context of what the Greeks had achieved two thousand years before that progress
was finally resumed in the seventeenth century.

Rene Descartes published his Principia Philosiphiae in 1644 and founded modern
philosophy. Descartes was looking for an axiomatic system of philosophy similar to the
axiomatic systems of mathematics. To this end he contrived to strip away from human
knowledge all elements that could be reduced and to concentrate on the "smallest and
simplest things" 4 Then from these fundamental propositions, and using "all the aids of
intellect, imagination, sense, and memory"5 to build up all of human knowledge (Smith
and Grene, 1948). Descartes' starting place was the famous proposition "I think therefore
I am." He contrived to show the dual nature of the mind as existing independent of the
body and controlling the body (as Descartes maintained, through the pineal gland). This
was an introspective approach that measured the mind through the mind's own eye. It
is idealistic in view but based on analysis; attempting to prove that the mind is
autonomous and comes with "imagination, sense and memory." These ideas are
constructed in a manner that rigorously adheres to logic and by an individual with a
background of immensely successful mathematical invention. Although the logic is
incorrect6 it set the tone for philosophical investigations of the mind in the succeeding
age. The idea that humans were born with a mind endowed with logic and world
knowledge, and independent of the body, was ubiquitous among philosophers until
4
Rene Descartes, Rules for the Direction of Understanding Rule IX (City: Publisher, Year) Pages.

5
Rene Descartes, Rules for the Direction of Understanding Rule XII (City: Publisher, Year) Pages.

6
Norman Malcom, Thought and Knowledge (City: Publisher, 1977) Pages.
some philosophers, enthusiastic about the successes of the new scientific methods based
upon empirical evidence, brought about a new approach to the concept of mind.

Descartes died in 1650, eight years after the birth of Isaac Newton. Descartes was
undoubtedly one of the giants Newton was referring to when he spoke of seeing further
because he stood on the shoulders of giants. A contemporary of Newton's and perhaps
another of the giants to which he referred was Thomas Hobbes. The mechanism and
precision of the universe was being revealed by science and was reflected in Hobbes
philosophy. Here is a part of the introduction to his Leviathon:

"Nature, The art whereby God hath made and governs the world, is by the art of man, as in
many other things, so in this also imitated, that it can make an artificial animal. For seeing life
is but a motion of limbs, the beginning whereof is in some principle part within; why may we
not say, that all automata (engines that move themselves by springs and wheels as doth a
watch) have an artificial life? For what is the heart but a spring; and the nerves, but so many
strings; and the joints, but so many wheels, giving motion to the whole body, such as was
intended by the artificer?"

Hobbes saw imagination (ideas or impressions) as sense weakened by the absence of the
object and thought as a form of reckoning. He was enthusiastic about the new empirical
based science and the results to be derived therefrom. He was the precursor of a string
of empirical philosophers, caught up in the methods and results of the emerging
science.

John Locke in his Essay Concerning Human Understanding published in 1690 was the
first of these empiricists to clearly state an emerging theme, rhetorically answering his
own query about the origin of human knowledge, "To this I answer in one word, from
experience: in that all our knowledge is founded, and from that it ultimately derives
itself."7 But it was David Hume in his Treatise on Human Nature published in 1739 who
clearly elucidated this new approach. Hume specifies all knowledge as arising from two
sources; ideas and impressions. By impressions he means percepts or thoughts arising
from the senses and by ideas he means all of the faint echoes of impressions, and the
combinations of them, that arise in the mind. He states that the purpose of the Treatise is
to establish the proposition "that all our simple ideas in their first appearance are
deriv'd from simple impressions, which are correspondent to them, and which they
exactly represent."8 He seized upon an idea of the Bishop George Berkeley (a
contemporary) that general ideas are simply a collection of particular ideas "annexed to
a certain term," in other words connected mentally in a network. He then proceeds to
detail the characteristics of such a network. A present day computer programmer
familiar with knowledge representation schemes would be startled by Hume's
anticipation of those schemes. It is possible to detect in the Treatise, concepts that would
translate into the present day computer knowledge implementation techniques of
frames with default values and inheritance, various sorts of semantic networks (e.g.
kind-of, is-a), procedural attachment, indexing and classification schemes as well as
the psychological concepts of short term and long term memory. The knowledge

7
Book II, Ch. I, Sec 2
8
Book I, Part I, Sec. I
representation hypothesis9 (Smith 1982) is a modern restatement of Hume's hypothesis
in which computers are the object of these mechanizations.

From a philosophical standpoint, Hume's major impact was in his denial of a necessary
relation between cause and effect. That an effect followed necessarily from every cause
was an axiom used freely by philosophers and was the basis of some proofs for the
existence of God (basically such proofs traced the chain of cause and effect back to a first
cause, which was God). Hume pointed out that there was no reason for this belief since
the fact of sensual awareness of what is termed the cause and what is termed the effect
together with the fact that the cause is observed in conjunction with the effect are the
sole criteria by which to judge, and they do not prove the assertion. For instance, if two
clocks were identical except the first had a chime and the other didn't then the chiming
of the first might be perceived to be caused by the second. Hume argued that judgement
based upon the probable progression of events had to be substituted for the certainty of
the absolute (this anticipates the importance accorded the probability distribution of
states of systems in this paper). Hume's philosophy focused on the nature of human
mind and surveyed the universe from that viewpoint. It emphasized reason in this age
of reason and made no inferences about God. For this Hume was considered a skeptic
and was guaranteed a response from less skeptical philosophers. The reply came from
Immanuel Kant in his Critique of Pure Reason published in 1781.

Kant accepted Hume's proof that the law of causality was not analytic and went further
to assert that everything was subjective in nature. But he also maintained that the mind
had to possess an innate mechanism that provided the order that it made of its percepts.
A simple network or classification system would not do the trick; how would such a
network be established in the first place? Sensual datum would not simply order
themselves. He proposed that there were twelve A priori concepts, three each in four
categories. They included the categories of quantity (unity, plurality, totality), quality
(reality, negation, limitation), relation (substance and accident, cause and effect,
reciprocity) and modality (possibility, existence, necessity) (Russell 1945). His argument
in support of the existence of these innate mechanisms of mental order depended upon
showing the inconsistencies that arise from applying them to non-mental constructs.
Much of the Critique is given over to such demonstrations. Whether or not Kant's
arguments are accepted, his criticism that Hume's concept of mind is missing an
ordering mechanism is valid. By extension, this criticism must also be seen to apply to
the knowledge representation hypothesis (Smith, 1985),(which could be easily corrected
to specify that such a mechanism is included...evidently being supplied by an incredibly
insightful human computer programmer).
9
From Brian C. Smith Prologue to Reflection and Semantics in a Procedural Language..paraphrased and
annotated. Comments in parentheses have been added.

Any mechanically embodied intelligent process will be comprised of structural ingredients that :

1. we as external observers naturally take to represent a propositional account of the knowledge that the
overall process exhibits, (i.e. we recognize as the structures containing the knowledge of the system) and

2. independent of such external semantical attribution play a formal but causal and essential role in
engendering the behavior that manifests that knowledge. (i.e. which the machine can use to act on or
otherwise exhibit that knowledge.)
In the course of Kant's Critique he discovers what he terms antinomies or contradictory
propositions that are apparently provable. As examples, consider the proposition that
space is finite as opposed to the proposition that space is infinite, and the proposition
that everything is made of composite parts or that there are elemental things that cannot
be subdivided. Georg Wilhelm Friedreich Hegel, was a philosophical successor to Kant.
His influence was strong during the early part of the nineteenth century. He seized
upon the idea of antinomy as a dialectic. He championed the idea that the mind saw
things in terms of thesis, each of which had its antithesis that came together with the
thesis in a synthesis that had some of the attributes of both. This struggle between thesis
and antithesis was the law of growth (Durant 1961).

Hegel saw the only reality in the world as the whole of the world. Everything derived
its existence from its relation to the whole (which might, for the sake of clarity, be
viewed as an organism, so that, for example, your heart, derives its nature from the part
it plays in your body). The Hegelian dialectic (basically the idea of thesis, antithesis and
synthesis augmented by other arguments) provided the method by which, eventually,
necessarily, and inevitably the whole is derived from its parts. Starting at any
proposition, its antithesis is obtained, the resulting synthesis provides further
propositions to which the dialectic is recursively applied; synthesis becomes thesis, the
dialectic is reapplied and so on. Eventually the whole system is encompassed. Hegel's
philosophy is convoluted but interesting for that convolution. All notable philosophies
up to Hegel's time (and philosophies to the present time) are constructed on the
Euclidean (i.e. axiomatic) model, that is, according to the scheme called
Foundationalism. Foundationalism is characterized by a set of theses (or axioms) that
are accepted as truths together with justifying arguments (or logic) by which further
theses are derived. A characteristic method of foundationalism is reductionism in which
theses are made acceptable by reducing them to known (previously accepted) theses. In
Hegel's system the criteria for a thesis being accepted as part of the whole does not
depend on a basic set of axioms but on the whole itself. This has been termed the
Hegelian inversion by later philosophers. While Hegel's political philosophy was
influential in his time, notably influencing the young Karl Marx, his epistemological
arguments have implications for non axiomatic system based philosophies (see the
section on cognitive systematization below).

The philosophers from Descartes through the empiricists philosophized to a


background of mathematical invention and scientific discovery that revealed the
vastness of the universe and the clockwork precision with which it operated. A
philosopher could be reasonably well informed and understand all of the major
discoveries. But science was growing and dividing into a host of disciplines. By the
middle of the nineteenth century no one man could be knowledgeable in all of them.
Since philosophy is largely a synoptic endeavor it became increasingly difficult for any
one philosopher to give an account of the importance and impact on philosophical ideas
of the new discoveries in physics, computer science, biology, linguistics, cosmology,
mathematics and psychology. To some extent the individual disciplines acquired their
own philosophers whose views were often myopic. In the following sections we will
discuss those ideas in association with the disciplines that gave rise to them.
1.5 Physical and Biological considerations

1.5.1 Evolutionary theory

The philosophical-religious climate of Europe in the eighteenth and nineteenth centuries


was favorable for investigations into the nature of biological organisms. This was due
to two aspects of the Judeo-Christian thought of that time:

1. Time was considered linear and events, measured from creation to eternity with
(for the Christians) memorable deterministic events occurring at the advent and
second coming of Christ. This was different from the cyclical nature of time of the
Greek philosophers (and of most other cultures).

2. It was assumed that God had created the world in a methodical ordered manner.
One perception of the order of things was the "scale of being," the "ladder of
perfection" or the "ladder of life." Man was perceived to occupy the highest earthly
rung of the ladder with all of the various life forms occupying lower positions
depending upon their perfection or complexity. The ladder did not stop at man
but continued on with Angels, Saints and other heavenly creatures that occupied
successively higher rungs until finally God was reached. Man thus had a dual
nature; a spiritual nature that he shared with the Angels above him on the ladder
and an animal nature that he shared with the creatures below him on the ladder.

Scientific thought and religious thought often mixed during the renaissance and even
into the nineteenth century. Investigations into biological complexity were encouraged
so that the order of succession of creatures on the ladder of life could be more accurately
ascertained. That, coupled with the perception that time moved inexorably from the
past into the future set the stage for the discovery of evolutionary processes. All that
was needed was the concept of vast periods of time10 , the idea that the ladder of life
might have a dynamic character, and a non supernatural mechanism by which
movement on it could occur. These ideas, or various forms of them, had already gained
widespread acknowledgement when Charles Darwin wrote his Origin of Species
(published in 1859). Origin united the idea of the transmutation of species over geologic
time with the mechanism of natural selection11 to yield the theory of biological
evolution. The theory required strong champions for it made formal and respectable the
heretical ideas that had only been the object of speculation. Even in 1866 (seven years
after publication of Origin of Species) the ladder was still being defended (Eiseley , 1958).
From a scientific point of view these, were the last serious challenges to evolution. The
evidence from biology, geology and paleontology were too strong for that particular
form of theological predestination. At about the same time (1866) an Austrian monk
named Gregor Mendel published a report on his research into genes, the units of
heredity. In it he outlined the rules that govern the passing of biological form and
attribute from one generation to the next. The report was to lay unappreciated for the
next thirty years until around the turn of the century when the rules were
10
The idea of geologic time was provided by James Hutton, the father of geology.
11
The result of the uneven reproduction of the genotype–phenotype in a group of members of a
species that can mate with each other. The mechanism of selection is that of the interaction of the
phenotype with its environment. The concept was popularized as 'the survival of the fittest' by
Herbert Spencer.
independently rediscovered. The Mendelian theory provided the Darwinian mechanism
of natural selection with an explanation for the necessary diversification and mutation
upon which it relied. The mendelian laws were incorporated into Darwinism which
became known as Neo-Darwinism. In 1953 J. D. Watson and F. H. C. Crick proposed
the double helix molecular structure of the deoxyribonucleic acid (DNA) from which
genes are constructed and that contain the actual coded information needed to produce
an organism. This and various loosely related theories such as population genetics,
speciation, systematics, paleontology and developmental genetics complement neo-
darwinism and combine with it to form what is termed the synthetic theory of
evolution. The synthetic theory of evolution represents the state of the theory today
(Salthe 1985).

This new wave of scientific discovery about the earth and the particularly the biology of
the earth worked a new revolution in philosophical thought in which man had to be
perceived as playing a smaller role in the scheme of things if for no other reason than
the true grandeur of the scheme of things was becoming apparent. The first philosopher
to embrace the new biological theories was Herbert Spencer. He was a philosopher of
the latter half of the nineteenth century who seized upon the theory of evolution as
presented by Darwin and generalized it to a philosophical principle that applied to
everything. As Will Durant (Durant 1961) demonstrates by partial enumeration, to:

"The growth of the planets out of the nebulae; the formation of oceans and mountains on the
earth; the metabolism of elements by plants, and of animal tissues by men; the development of
the heart in the embryo, and the fusion of the bones after birth; the unification of sensations and
memories into knowledge and thought, and of knowledge into science and philosophy; the
development of families into clans and gentes and cities and states and alliances and the
'federation of the world'."

The last, of course, being a prediction. In everything Spencer saw the differentiation of
the homogeneous into parts and the integration of those parts into wholes. The process
from simplicity to complexity or evolution is balanced by the opposite process from
complexity to simplicity or dissolution. He attempted to show that this followed from
mechanical principles by hypothesizing a basic instability of the homogeneous, the
similar parts being driven by external forces to separated areas in which the different
environments produce different results. Equilibration follows; the parts form alliances
in a differentiated whole. But all things run down and equilibration turns into
dissolution. Inevitably, the fate of everything is dissolution. This philosophy was rather
gloomy but it was in accord with the second law of thermodynamics12 or law of entropy
that states that all natural processes run down. Entropy was proposed by Rudolf
Clausius and William Thompson in the 1850's and was derived from the evidence of
experiments with mechanical systems, in particular heat engines (see the section on
entropy below). That scientific fact tended to support a philosophy that the universe
and everything in it was bound for inevitable dissolution was shocking and depressing
to philosophers and to the educated populace at the turn of the twentieth century. But,
thought about the mind now had a new dimension within which to work. Here was a
proposal that tried to explain (as a small part of the over all philosophy) how
12
The first law of thermodynamics deals with the conservation of energy, stating that the sum of the flow of heat and
the rate of change of work done, are equal to the rate of change of energy in a system. There is a zero'th law (that, as
you may suspect, was proposed after the first and second laws, but that logically precedes them). It states that two
objects that are in thermal equilibrium with a third object, will be in thermal equilibrium with each other.
environment might produce mind rather than the inverse. But theories of evolution do
not provide for biology, that which the grand unified theories of physics would
provide for physics. The universal application of the Darwinian theory of evolution is
not so simple as Spencer would have it.

In response to the overwhelming wealth of detail, the biological sciences began to turn
inward. The detail allowed for tremendous diversification without leaving the confines
of the life sciences. The subject matter of biology was, and is, perceived as so different
from the subject matter of other areas of science that it is considered unique and to some
extent closed. But the isolationist tendency is widespread in all of the sciences and, as
has been shown on many occasions, uniqueness is an illusion. In recent years there has
been a move on the part of some biologists to find a more general framework within
which biology fits together with the other disciplines of science. Such a move toward
integration should, of course, be of interest to all of the sciences. One such approach is
that proposed by Stanley N. Salthe13 .

Salthe notes (Salthe 1985) that there are at least seven levels of biological activity that
must be considered for the investigation of what are considered common biological
processes. These include, but are not limited to, the molecular, cellular, organismic,
population, local ecosystem, biotic regional, and the surface of the earth levels. These
are usually studied as autonomous systems largely because of the difficulty in
attributing cause to other levels. Salthe proposes that there is, inherent in biological
phenomena, a hierarchical structure differentiated mainly by spacial and temporal
scaling. The hierarchy is not limited to biological systems, extending upward to the
cosmos and downward to quantum particles. Nor is it limited to any particular
enumeration such as the seven levels mentioned above. Interactions between levels are
largely constrained to those that occur between adjacent levels. At a particular level
(focal level) and for a particular entity the surrounding environment (or next higher
level) and the material substance of the entity (next lower level) are seen as providing
initial conditions and constraints for the activities or evolution of the entity. Salthe
proposes that a triadic system of three levels (the focal level and the immediately
adjacent levels) are the appropriate context in which to investigate and describe
systemic, in particular, biologic phenomena. So the theory of evolution that depends
upon natural selection applies at the population level and may or may not apply at
other levels. Salthe distinguishes several factors that provide differentiation into levels.
These include 1) scale in both size and time that prohibits any dynamic interaction
between entities at the different levels, 2) polarity, that causes the phenomena at higher
and lower levels to be radically different in a manner that prohibits recursion (that is,
phenomena observed at a higher level are not seen to recur at a lower level) and 3)
complexity, that describes the level of multiple interrelatedness between entities at a
level.

Biologists tend to be divided into two camps, the vitalists and the mechanists. The
mechanists see no reason why all aspects of living systems cannot be described in
13
The specific approach presented by Salthe has its roots in general systems theory. That theory’s
inception was in Norbert Wiener’s cybernetics. It was given its strong foundation and extended far
beyond the original content by such researchers as Ludwig Von Bertalanffy (1968), Ervin Laszlo (1972),
Ilya Prigogine (1980), Erich Jantsch (1980), and many others. Systems science is now a firmly
established discipline. It provides the context in which the discussion concerning ‘levels’ on the ensuing
pages should be considered.
physical terms. They believe it is possible to reduce phenomenon observed at one level
to the operation of laws on objects at a lower level. This mechanist/reductionist
approach has succeeded in explaining many biological phenomena to the great benefit
of mankind. Most biologists would fall into this category (partly because of the success
and partly because the alternative feels unscientific). Vitalists, on the other hand, feel
that the immense complexity of physical systems precludes an explanation for life in
terms of physical laws. The distinction between living and non-living things, while
perfectly obvious is virtually indefensible when formalized. Vitalists hypothesize an
elan vital or life force to account for the difference. The nature of that force has not been
succinctly identified by vitalists.

In spite of its popularity, complete explanations are not forthcoming from mechanist
approaches; it is one thing to recognize parts and interactions of parts of systems and
another to explain the activities of the whole in terms of the parts. Two problems stand
in the way of the success of the mechanists, 1) the process itself is doomed to an infinite
regress of explanation of finer and finer detail or the description of processes in
contexts of ever greater and greater scope and 2) even after the successful explanation of
the nature of an object in terms of lower level those explanations provide no explanation
for the activities of that object at its level of existence as a whole. So the attempt of the
mechanists to reduce the events at one level to those at another, while useful, are not
complete; it may be possible to explain the structure of a cell in terms of certain
molecular building blocks and the laws that govern those constituents, but doing so
does not explain the activities of that cell. W. M. Elsasser (Elsasser, 1970) argues that the
existence of unique, distinguishable individuals at a level actually constitutes a feature
of reality overlooked by the standard physics:

"The primary laws are the laws of physics which can be studied quantitatively only in terms
of their validity in homogeneous classes. There is then a 'secondary' type of order or regularity
which arises only through the {usually cumulative) effect of individualities in inhomogeneous
systems and classes. Note that such order need not violate the laws of physics"

Salthe and others who attempt to distinguish levels as more than mere handy
abstractions to locate or specify objects provides a means by which mechanists and
vitalists have some common ground. The processes and objects that occupy a level are
not expected to be completely explainable in terms of any other level, in fact any
complete explanation is precluded by the increasing impermeability that characterizes
the correspondence between increasingly distant levels. Only the most immediate levels
have any appreciable impact on the activities of the other. This saves the mechanists
from the infinite regress of explanations but leaves the explanation of elan vital for
further investigation. As Elsasser points out, these need not be mystic, spiritualistic, or
non-scientific. One possible explanation for the observed propensity of systems to self-
organize is given in section 1.7. It is an unexpected explanation, advocating that
organization results not because of any positive motivating force (thereby avoiding a
retreat from science) but rather because of the inability of some open systems to pursue
their entropic imperative at a rate commensurate with the expanding horizons of the
local state space. It will be argued that the nature of the system at a level in an
environment dictates the form that the resulting organization takes. These ideas have
applicability to virtually any system at any level, not just biological systems at the levels
identified above. We shall comment and expand upon these ideas later in sections 1.7,
1.11, and 1.12, but first we will review the development of knowledge about the nature
of the physical world beyond (and below) the lowest levels mentioned above. We skip
over the molecular level, whose rules and objects are detailed in chemistry, the atomic
level described by atomic physics, and the nuclear level described by nuclear physics, to
the quantum level whose rules and objects are expressed in particle physics by quantum
mechanics. We do this because the quantum level vividly illustrates the increasing
inability to relate the actions and objects at different levels far removed from the human
level of existence.

1.5.2 Quantum theory

At the end of the nineteenth century physicists were perplexed by the fact that objects
heated to the glowing point gave off light of various colors (the spectra). Calculations
based on the then current model of the atom indicated that blue light should always be
emitted from intensely hot objects. To solve the puzzle, the German scientist, Max
Planck proposed that light was emitted from atoms in discrete quanta or packets of
energy according to the formula, E = hf, where f was the frequency of the observed light
in hertz and h was a constant unit of action (of dimensions energy/frequency). Planck
calculated h = 6.63 x 1027 ergs/hertz and the value became known as Planck's constant.
In 1905 Albert Einstein suggested that radiation might have a corpuscular nature.
Noting that mc2 = E = hf, or more generally mv2 = hf (where v stands for any velocity)
provides a relation between mass and frequency he suggested that waves should have
particle characteristics. In particular since the wavelength l, of any wave, is related to
the frequency by f = v/l, then l = h/mv, or setting the momentum mv = p, then p =
h/l. That is the momentum of a wave is Planck's constant divided by the wavelength.
Years later (in 1923), Prince Louis DeBroglie, a French scientist, noted that it should be
true that the inverse relation also exists and that l = h/p. That is, particles have the
characteristics of waves. This hypothesis was quickly verified by observing the
interference patterns formed when an electron beam from an electron gun was projected
through a diffraction grating onto a photo-sensitive screen. An interference pattern
forms from which the wavelength of the electrons can be calculated and the individual
electron impact points can be observed. This, in itself, is not proof of the wave nature of
particles since waves of particles, (for instance sound waves or water waves) when
projected through a grating will produce the same effect. However, even when the
electrons are propagated at widely separated intervals (say one a day), the same pattern
is observed. This can only occur if the particles themselves and not merely their
collective interactions have the characteristics of waves. In other words, all of the objects
of physics (hence all of the objects in the universe), have a wave-particle dual nature.
This idea presented something of a problem to the physicists of the early part of this
century. It can be seen that the wave nature of objects on the human scale of existence
can be ignored (the wavelength of such objects will generally be less than 1027 meters
so, for instance, if you send a stream of billiard balls through a (appropriately large)
grating the interference pattern will be too small to measure, or at least, small enough to
safely ignore). Classical physics, for most practical purposes, remained valid, but a new
physics was necessary for investigating phenomena on a small scale. That physics
became known as the particle physics.

The Austrian physicist, Erwin Schrödinger applied the wave mechanics developed by
Clerk Maxwell to develop an appropriate wave equation (y) for calculating the
probability of the occurrence of the attributes of quantum particles14 For example,
solving y for a specified position of a particle (say the position of an electron near an
atom) yields a value which, when squared, gives the probability of finding the particle
at that position. Then, to investigate the trajectory of an electron in relation to its atom,
the equation can be solved for a multitude of positions. The locus of points of highest
probability can be thought of as a most likely trajectory for the electron. There are many
attributes that a quantum particle might have, among which are mass, position, charge,
momentum, spin, and polarization. The probability of the occurrence of particular
values for all of these attributes can be calculated from Shrödinger's wave function.
Some of these attributes (for example mass and charge), are considered static and may
be thought of as fixed and always belonging to the particle. They provide, in effect,
immutable evidence for the existence of the particle. But other attributes of a particle are
complementary in that the determination of one affects the ability to measure the other.
These other attributes are termed dynamic attributes and come in pairs. Foremost
among the dynamic complementary attributes are position, q, and momentum, p. In
1927 Werner Heisenberg introduced his famous uncertainty principle in which he
asserted that the product of the uncertainty with which the position is measured, Dq,
and the uncertainty with which the momentum is measured, Dp must always be greater
than Planck's constant. That is, Dq Dp ≥ h. In other words, if you measure one attribute
to great precision (e.g. Dq is very small) then the complementary attribute must
necessarily have a very large uncertainty (e.g. Dp ≥ h/Dq). A similar relation is true of
any pair of dynamic attributes. In the early days of quantum mechanics it was believed
that this fact could be attributed to the disturbance of the attribute not being measured
by the measuring process. This view was necessitated by the desire to view quantum
particles as objects that conform to the normal human concept of objects with a real
existence and tangible attributes (call such a concept normal reality). Obviously, if
Heisenberg's principle was true and the quantum world possessed a normal reality, the
determination of a particle's position must have altered the particle's momentum in a
direct manner. Unfortunately, as will be discussed below, the assumption of normal
reality at quantum levels leads to the necessity of assuming faster than light
communication among the parts of the measured particle and the measuring device.
The idea that quantum objects conform to a normal reality is now out of favor with
physicists (though not discarded). In any case, the principle was not intended as
recognition of the clumsiness of measuring devices. Heisenberg's uncertainty principle
arose from considerations concerning the wave nature of the Schrödinger equations and
result directly from them.
14
All things at the quantum level (even forces) are represented by particles. Some of these
particles are familiar to most people (e.g. the electron and photon), others are less familiar
(muons, gluons, neutrinos, etc.). Basically, quantum particles consist of Fermions and
Bosons. The most important Fermions are quarks and leptons. Quarks are the components
of protons and neutrons, while leptons include electrons. The Fermions then, contain all
the particles necessary to make atoms. The Bosons carry the forces between the Fermions
(and their constructs). For example, the electromagnetic force is intermediated by photons,
and the strong and weak nuclear forces are intermediated by mesons, and intermediate
vector bosons respectively. Quantum physicists have found it convenient to speak of all of
the phenomena at the quantum level in terms of particles. This helps avoid confusion that
might arise because of the wave/particle duality of quantum phenomena. It gives
non–physicists a comfortable but false image of a quantum level populated by tiny billiard
balls.
In the early part of the nineteenth century Joseph Fourier discovered that any waveform
whatsoever could be completely described as a composition of other different
waveforms. The waveforms that might provide the constituents for such a composition
may belong to an infinity of families of waves. For example, sine waves are one family
of waves whose members might be used in combination to describe some different and
specific wave. Likewise, impulse waves (waves of various amplitude but infinitesimally
small wavelengths) could be used to describe that same wave. Depending on the shape
of the wave to be described the sine waves may be more appropriate than the impulse
waves or vice versa (e.g. it might take only a few sine waves to describe a very smooth
curve which would require a very large number of impulse waves while a spike-like
wave might best be described by impulse waves). In principle any wave may be
described in terms of members of any other wave family. Of course, the wave family
that can best describe some wave will be its own family, in which only one wave, the
wave itself, is required for the description. On the other hand, there will be some wave
family that is the worst family from which to compose the desired wave, because it will
take more waves from that family than any other family to describe the given wave.
These wave families of opposing usefulness for composition, are called conjugate
waveform families. Any arbitrary wave will be describable in terms of either of any
conjugate wave family. If a wave to be described is conveniently described in the
waveforms of one family it will be less easily described in those of that families
conjugate and vice versa. In fact, if we describe the number of waves required for the
description in the one family as DA and the number of waves required by the conjugate
family as DB then the fact that there is a tradeoff in description can be expressed as
DADB ≥ k, for some constant k. The probabilities of the values that quantum particle
attributes take are described by the waveforms generated by the Schrödinger equations.
The description of certain pairs of those attributes stand in relation to each other as
conjugate waveform families and it is that fact that leads to the Heisenberg uncertainty
principle. The uncertainty principle is a real feature of the quantum world just as in the
normal world it is a real feature that we may calculate to any desired level of accuracy,
the value of any attribute we wish to assign to an object.

The predictions made by quantum mechanics (which now consist of four different
systems equivalent to Schrödinger equations)15 of the results of quantum particle
experiments have proven accurate for fifty years. They consistently and precisely
predict the observations made by physicists of the effects produced by the interactions
of single particles (from perhaps cosmic sources) and beams of particles produced in
laboratories. The tools for the prediction of experimental results are perfected and used
daily by physicists. So, under the test that a successful theory can be measured by its
ability to predict, quantum theory is very successful. In spite of all of their accuracy and
predictive ability the systems do not provide a picture of, or feeling for the nature of
quantum reality any more than the equation F = ma provides a picture of normal
reality. But the quantum mechanics does provide enough information to indicate that
quantum reality cannot be like normal reality. Unfortunately (or fortunately if you are a
philosopher) there are many different versions of quantum reality that are consistent
with the facts; all of which are difficult to imagine. For instance, the concept of attribute,
Three other systems equivalent to the Schrödinger equations are Heisenberg's matrix
15

mechanics, Dirac's transformation theory and Feynman's sum over histories system.
so strong and unequivocal in normal reality, is qualified at the quantum level as
indicated above. Worse, if the quantum mechanics equations are to be interpreted as a
model of the quantum level of existence then the inhabitants of quantum reality consist
of waves of probability, not objects with attributes (strange and quirky though they may
be). As Werner Heisenberg said "Atoms are not things." The best one can do is to
conjure up some version of an Alice in Wonderland place in which exist entities
identifiable by their static attributes but in which all values of dynamic attributes are
possible but in which none actually exist. Occasionally the quantum peace is disturbed
by an act that assigns an attribute and forces a value assignment (e.g. a measurement or
observation occurs). Nick Herbert in his book Quantum Reality (Herbert, 1989) has
identified eight versions of quantum reality held by various physicists (the versions are
not all mutually exclusive):

1. The Copenhagen interpretation (and the reigning favorite) originated by Niels


Bohr, Heisenberg, and Max Born is that there is no underlying reality. Quantum
entities possess no dynamic attributes. Attributes arise as a joint product of the
probability wave and the normal reality measuring device.

2. The Copenhagen interpretation part II maintains that the world is created by an


act of observation made in the normal world. This has the effect of choosing an
attribute and forcing the assignment of values.

3. The world is a connected whole arising from the history of phase entanglements of
the quantum particles that make up the world.That is, when any two possibility
waves (representative of some quantum particle) interact, their phases become
entangled. Forever afterward, no matter their separation in space, whatever
actuality may manifest itself in the particles, the particles share a common part.
Originally the phase entanglement was thought of as just a mathematical fiction
required by the form of the equations. Recent developments (Bell's theorem, see
below) lend credence to the actuality of the entanglements in the sense that those
entanglements can explain, and in fact are needed to explain, experimental data.
Phase entanglement was originally posed by Erwin Schrödinger.

4. Their is no normal reality hypothesis (normal reality is a function of the mind).


John Von Neumann felt that the problem with explaining the reality behind the
experimental observation arose because the measurement instruments were
treated as normally real while the particles being measured were considered as
waveforms in a quantum reality. He undertook to treat the measurement devices
as quantum waveforms too. The problem then became one of determining when a
quantum particle as represented by a wave of all possible attributes and values,
collapsed (took a quantum leap) to a particular set of attributes and values. He
examined all of the possible steps along the path of the process of measurement
and determined that there was no distinguished point on that path and that the
waveform could collapse anywhere without violating the observed data. Since
there was no compelling place to assume the measurement took place, Von
Neumann placed it in the one place along the path that remained somewhat
mysterious, the human consciousness.

5. Their are an infinity of worlds. In 1957, Hugh Everett, then a Princeton University
graduate student made the startling proposal that the wave function does not
collapse to one possibility but that it collapses to all possibilities. That is, upon a
moment requiring the assignment of a value, every conceivable value is assigned,
one for each of a multitude of universes that split off at that point. We observe a
wave function collapse only because we are stuck in one branch of the split.
Strange as it may seem this hypothesis explains the experimental observations.

6. Logic at the quantum level is different than in normal reality. In particular the
distributive laws in logic do not apply to quantum level entities, that is, A or (B
and C) ≠ (A or B) and (A or C). For example, if club Swell admits people who are
rich or it admits people who come from a good family and have good connections,
while club Upper Crust admits people who are rich or come from a good family
and who are rich or have good connections, then in normal reality we would find
that after one thousand people apply for membership to both clubs the same
group of people will have been accepted at both clubs. In the quantum world, not
only will the memberships be different but in club Upper Crust there will be
members who are not rich and do not come from a good family or are not well
connected.

7. Neo-realism (the quantum world is populated with normal objects). Albert


Einstein said that he did not believe that God played dice with the universe. This
response was prompted by his distaste for the idea that there was no quantum
reality expressible in terms of the normal reality. He didn't want to except
probability waves as in some sense real. He and DeBroglie argued that atoms are
indeed things and that the probabilistic nature of the quantum mechanics is just
the characteristic of ensembles of states of systems as commonly presented in
statistical mechanics. John Von Neumann proved a theorem that stated that
objects that displayed reasonable characteristics could not possibly explain the
quantum facts as revealed by experimental data. This effectively destroyed the
neo-realist argument until it was rescued by David Bohm who developed a
quantum reality populated by normal objects consistent with the facts of quantum
mechanics. The loophole that David Bohm found that allowed him to develop his
model was the assumption by Von Neumann of reasonable behavior by quantum
entities. To Von Neumann, reasonableness did not include faster than light
phenomena. In order to explain the experimental data in a quantum reality
populated by normal objects, Bohm found it necessary to hypothesize a pilot wave
associated with each particle that was connected to distant objects and that was
able to receive superluminal communications. The pilot wave was then able to
guide the particle to the correct disposition to explain the experimental data.
Unfortunately for the neo-reality argument faster than light communications puts
physicists off. As discussed below, the acceptance of a variety of that concept, at
least for objects in the quantum world, may be required by recent findings.

8. Werner Heisenberg champions a combination of 1 and 2 above. He sees the


quantum world as populated by potentia or "tendencies for being." Measurement is
the promotion of potentia to real status. The universe is observer created (which is
not the same as Von Neumann's consciousness created universe).

Albert Einstein did not like the concept of a particle consisting of a wave of
probabilities. In 1935, in collaboration with Boris Podowski and Nathan Rosen he
proposed a thought experiment that would show that, even if quantum mechanics
could not be proven wrong, it was an incomplete theory. The idea was to create a
hypothetical situation in which it would have to be concluded that there existed
quantum attributes/features that were not predictable by quantum mechanics. The
thought experiment is known as the EPR experiment.

The EPR experiment consists of the emission in opposite directions from some source of
two momentum correlated quantum particles. Correlated particles (correlated in all
attributes, not just by the momentum attribute) may be produced, for example, by the
simultaneous emission of two particles from a given energy level of an atom. Call the
paths that the first and second particles take the right hand and left hand paths
respectively. At some point along the right hand path there is a measuring device that
can measure the momentum of the first particle. On the left hand path there is an
identical measuring device at a point twice the distance from the source than the first
device. When the particles are simultaneously emitted along their respective paths,
according to quantum theory, they both consist of a wave of probability that will not be
converted into an actuality until they are measured. At the point of being measured
their probability wave is collapsed, or the momentum attribute is forced to take a value,
or the consciousness of the observer assigns a momentum value (e.g. a human is
monitoring the measuring device), or the universe splits, or some other such event takes
place to fix the momentum. Now consider the events of the experiment. The particles
are emitted at the same time in opposite directions. Soon the first particle (on the right
hand path) is measured and its momentum is fixed to a particular value. What is the
status of the second particle at this point in time? According to quantum mechanics it is
still a wave of probability that won't become actualized until it encounters the
measuring device (still off in the distance). And yet it is known that the left hand
measuring device will produce the same momentum that the right hand device has
already produced and when the second particle finally gets to that measuring device it
does show the expected momentum. Two possibilities present themselves, either the
results of the measurement of the first particle is somehow communicated to the second
particle in time for it to assign the correct value to its momentum attribute, or the
second particle already 'knows' in some sense, which actual momentum to exhibit
when it gets to the measuring device. The particles are moving at or near the speed of
light so the former possibility requires superluminal communication and must be
rejected. But then the quantum particle must contain knowledge that is not accounted
for by quantum theory. In fact it must contain a whole list of values to assign to
attributes upon measurement because it can't possibly know which attribute will be
measured by the measuring device it encounters. So, Einstein concludes, quantum
theory is incomplete since it says nothing about such a list.

Einstein's rejection of superluminal communication is equivalent to an assumption of


locality. That is, any communication that takes place must happen through a chain of
mediated connections and may not exceed the speed of light. Electric, magnetic and
gravitational fields may provide such mediation but the communication must occur
through distortions of those fields that can only proceed at subluminal speeds. In 1964
John Stewart Bell, an Irish Physicist attached to CERN, devised a means of using a real
version of the EPR thought experiment to test the assumption of locality. He substituted
two beams of polarity correlated (but randomly polarized) photons for the momentum
correlated particles of the EPR experiment. All photons possess a polarization attribute
but a light beam is said to be unpolarized if its photons, when measured for that
attribute, show no special orientation. The measuring devices were located at equal
distances from the source and could test arriving photons for polarization in any direc-
tion. The object was to compare the records of polarized photons produced by the two
measurement devices for various combinations of direction of polarization. The
assumption of locality leads to an assessment as to the similarity between the records
produced at each device that does not agree with quantum mechanics. The following is
a variation of an experimental setup that could be used to investigate the problem. It
should make the disagreement between quantum mechanics and the assumption of
locality more understandable.

The correlated beams are emitted in opposite directions from one of a number of
sources and are unpolarized. This can be implemented, for example, by using a cesium
source. A cesium source can generate unpolarized but correlated light beams because
cesium emits separable twin state photons (i.e. two photons are emitted together from
the same energy level of a cesium atom). The measuring devices on each path are calcite
crystals backed up by a pair of photon counters. Calcite crystals act as a filter passing
only photons that are polarized in one of two directions. They allow light polarized in
the particular direction of the crystals' orientation, say Q°, to pass through normally
while light polarized at right angles to that orientation (i.e. Q°+90°) is refracted and
passed through at an angle. Obviously, for a given orientation of the crystal, many
photons don't get through at all. Behind each crystal are two photon counters. One of
the photon counters counts the photons polarized along the crystals orientation and the
other counts the refracted photons. The results are combined to produce a record of
photon polarizations for a particular crystal orientation. Such a record might look like
'1100100101010010010' where 1 indicates the measurement of a photon polarized at Q°,
and 0 represents the measurement of a photon polarized at Q°+90°. No matter what
angle of polarization is tested the sequence of ones and zeros will have a random nature
because the light beams are unpolarized. When the two calcite crystals at the end of
each path are oriented in the same direction the record produced by each device is
exactly the same. However, when one of the crystals is offset by some angle then the
records produced at each device begin to differ by a percentage based on the difference
of the two orientations, call the difference in orientations (Qd)° = (Q1)° - (Q2)°. A simple
calculation in the quantum mechanics predicts that the records at each measuring
device, having a difference in orientation of (Qd)°, will differ by an amount calculated
as sin2(Qd)°. On the other hand, the assumption of locality means that the difference
must change in an essentially linear fashion to the angular difference. To see this,
suppose that, crystal one is rotated by a° and a difference in records of, say, a% is
observed. Denote this by O(a°) = a. Then crystal one is returned to its original
orientation. Crystal two is rotated by -a° and O(-a°) = a, is obtained. Then crystal two is
returned to its original orientation. Now consider the case when crystal one is set at a°
and crystal two is set at -a° simultaneously and then the comparison of records is made.
The assumption of locality means that the effects of the measurement made at crystal
one cannot effect the measurements made at crystal two and vice versa. The maximum
number of changes between the two records when (Qd)° = 2a° must be just twice those
observed when (Qd)° = a° except in the case that a change from the old record was
observed for a particular photon at both devices (in which case the changes will cancel
each other and the two records will agree on the polarization of that photon). That is,
under the assumption of locality, the difference in records when (Qd)° = 2a° must be
less than or equal to twice the differences observed when (Qd)° = a°. In the notation of
observations, O(2a°) £ 2O(a°). Quantum mechanics predicts the difference in records
for one crystal rotated a° and the other crystal rotated -a° will be sin2(2a°). Both
assertions can't be true as the following argument shows. Assume the quantum
mechanics calculations are correct so that the differences in the record due to any
orientations of the measuring devices is correctly calculated by the quantum mechanics,
i.e. O(Q) = sin2(Q). Then the assumption of locality implies sin2(2a°) £ 2sin2(a°) for all
a°. But, for example, at a° = 30°, sin2(2a°) = .75, while 2sin2(a°) = .5, so the inequality
doesn't hold. One or the other of the assumptions must be false. But all of this is a
description of an experiment that can be performed and which can determine which
assumption is false (quantum mechanics is valid, locality is a valid assumption).

By 1972 John Clauser at University of California at Berkeley, using a mercury source


and a variation of the above inequality verified that the quantum mechanics predictions
were correct. One loophole in the experiment was due to his inability to switch the
direction of polarization to be tested while the photon was in flight. The failure to do so
allows for the possibility of subluminal information leaks between devices. The problem
could be overcome by providing multiple measuring devices (at different settings) and
switching rapidly among them during the experiment. In 1982 Alain Aspect of the
University of Paris made the corrections and performed the experiment. The validity of
quantum theory was upheld and the assumption of locality in quantum events shown
to be wrong. That quantum reality must be non-local is known as Bell's theorem. It is a
startling theorem that says that unmediated and instantaneous communication occurs
between quantum entities no matter their separation in space. It is so startling that
many physicists do not accept it. It does little to change the eight pictures of quantum
reality given above except to extend the scope of what might be considered a
measurement situation from the normal reality description of components to include
virtually everything in the universe. The concept of phase entanglement is promoted
from mathematical device to active feature of quantum reality.

Of course, we see none of this in our comfortable, normal reality. Objects are still
constrained to travel below the speed of light. Normal objects communicate at
subluminal rates. Attributes are infinitely measurable and their permanence is assured
even though they may go for long periods of time unobserved and uncontemplated. We
can count on our logic for solving our problems, it will always give the same correct
answer, completely in keeping with the context of our normal world. But we are
constructed from quantum stuff. How can we not share the attributes of that level of
existence? And we are one of the constituents of the Milky Way, (although we never
think of ourselves as such); what role do we play in the nature of its existence? Does
such a question even make sense? It might help to look more closely at the nature of
levels of existence.

1.5.3 The new scale of being

Biologists, physicists, astronomers, and other scientists, when describing their work,
often qualify their statements by specifying the scale at which their work occurs. Thus
it's common to hear the 'quantum level', the 'atomic level', the 'molecular level', the
'cellular level', the 'planetary level' or the 'galactic level' mentioned in their work. And
they might further qualify things by mentioning the time frame in which the events of
their discipline occur, e.g. picoseconds, microseconds, seconds, days, years, millions of
years, or billions of years. The fact that such specifications must be made is not often
noted; it's just taken as the nature of things that, for example, objects and rules at the
quantum level don't have much to do with the objects and rules at the cellular level, and
the objects and rules at the cellular level don't have much to do with objects and rules at
the galactic level and so on. A bit of reflection however, reveals that it's really quite an
extraordinary fact! All of these people are studying exactly the same universe, and yet
the objects and rules being studied are so different at the different levels, that an
ignorant observer would guess that they were different universes. Further, when the
level at which humankind exists is placed on a scale representing the levels that are
easily identified, it is very nearly in the center (see figure 1.2). Three reasons why that
might be, come immediately to mind: 1) we're in the center by pure chance, 2) the
universe was created around us (and therefore, probably for us), or

-25
10 Meters

-20
range of weak nuclear force
-15 radii of proton or neutron
atomic Nuclei range of strong nuclear force

-10 radii of electron shells of atoms


molecules
macro-molecules(DNA etc.)
viruses
-5 bacteria
blood corpuscles
vegetable and animal cells
snowflakes
mouse
0 man
whale
small town
large city
5
radius of the Earth
radius of the Sun
10
radius of Solar System
15
distance to nearest star

20 radius of galaxy (Milky Way)


radius of largest structures (clusters of galaxies)
25
10 Meters

Figure 1.2. New scale of being


3) it only looks like we're in the center because we can 'see' about the same distance in
both directions. Is our perception of these levels, embedded in the scale of being, just an
artifact of our inability to see an underlying scheme (i.e. the universe can be explained
by a few immutable object types together with their rules of interaction, so that the
different objects and rules that we perceive at different levels are simply a complex
combination of those primitives), or do these levels (and their objects and rules) have a
real and, in some sense, independent existence? The answer we give is part of the
hypothesis made in this thesis: levels exist and contain rules not derived from surrounding
levels of existence, and the further removed any two levels the less the one can be interpreted or
explained in terms of the other. If accepted, this answers the question as to why we seem to
be at the center of the scale of being; the further removed a level from the human level
of existence, the less that level has in common with the human level and the less
humans are able to interpret what occurs at that level in familiar terms. For sufficiently
remote levels, no interpretation can be made. We cannot 'see' beyond what our minds
can comprehend. So, for example, we don't understand quantum reality because it is on
the very edge of that to which we as humans can relate. For the same reason, at the
other end of the scale, we can't conceive of all that is as the universe, and at the same
time, the universe as all that is, nor do our minds comfortably deal with infinity and
eternity. If we were a quantum entity, we would undoubtedly express the same wonder
at the impossible nature of the strange subluminal world populated by unconnected,
virtually uncommunicative, objects with frozen attributes, that exist at the very edge of
the ability to comprehend. But we shall pull back from such tempting speculations. Our
purpose is not to investigate the concept of levels in general. We will leave many
questions such as "what parameters participate in the emergence of levels (e.g. energy,
scale of being, etc.), how fast do levels emerge, and what governs the interactions at a
level including such phenomena as life, procreation, etc?" We are mainly interested in
the fact of the existence of levels and the effect of that on what we call intelligence. We
take up that problem in section 1.7 and subsequent sections of this part.

1.6 Mathematical considerations

1.6.1 The progress in mathematics

Unlike philosophers who were in a state of turmoil at the beginning of the twentieth
century, mathematicians were reaching a state of great contentment. The state was not
to be long lived. Rene Descartes had started the age of analysis with the invention of
analytic geometry (early seventeenth century), this was followed by the invention of
the calculus by Isaac Newton and Gottfried Wilhelm Leibnitz (early eighteenth century).
Later (early nineteenth century) the invention of non-euclidian geometries by Nikolas
Lobatchewsky shook mathematicians, who had considered geometry to be the most
sound of all the mathematical systems. But, by the beginning of the twentieth century
geometry had been put back onto a firm foundation by David Hilbert. In fact, the scare
that the discovery of non-euclidean geometries had caused, had a beneficial effect in
that other branches of mathematics were inspected. The calculus was found wanting
and was repaired, principally by Augustine Cauchy by founding it on the concept of
limits. The attempt was made, largely successfully, to ground the largest part of
mathematics in arithmetic. And Giuseppe Peano had managed to produce arithmetic
from a handful of axioms about the whole numbers. Mathematics seemed very secure
and was certainly the most powerful tool ever invented (or discovered, depending upon
your point of view). In the eighteenth century Leibnitz had proposed a calculus of
reason in which all of human knowledge based on reason could be derived from basic
concepts, so that, if any disputes should arise it would only be necessary for the
disputants to sit down and calculate the truth. Developments in mathematical logic
(Gottlob Frege had essentially completed the first order predicate calculus by the late
nineteenth century) seemed to make that statement plausible. Pierre-Simon Laplace had
claimed (in the early nineteenth century) that given the initial conditions (position,
velocities and forces applying to all of the particles in the universe), and sufficient
calculating power the future and past course of the cosmos could be calculated. It was
easy to believe that mathematics represented an unassailable body of truth. There was
little reason to think that Laplace was wrong.

But the axiomatic level of mathematics is not free of philosophical considerations and
many of the axioms upon which mathematics had been secured were being challenged.
In logic, the law of the excluded middle is the statement that "all propositions are either
true or false." This is particularly important to mathematics, especially logic, because the
mathematical method of proof by contradiction depends upon it. But as Bertrand
Russell pointed out, the law is itself a proposition and consequently subject to being
false and thus cannot be an axiom (Kline 1980). Other antinomies had popped up in
various branches of mathematics. In set theory, Georg Cantor had problems with "the
set of all sets." He had fashioned his sequence of cardinal numbers based on the fact that
from any set (that has a characteristic cardinal number equal to the number of elements
in the set), a set of larger cardinal number can be constructed as the set of all subsets of
the original set. The set of all possible sets must be the largest possible set and thus be
represented by the largest cardinal number. But then there must be a larger set
consisting of the subsets of this set. So there must at once be a largest cardinal number
and no largest cardinal number. Many proofs in mathematics depended implicitly or
explicitly on the "axiom of choice"; the assumption that you can choose an element from
a set or infinite sequence of sets without specifying a rule for making the choice. This
axiom had been used (implicitly) by Peano in his development of arithmetic (Kline
1980). Similarly, existence proofs in which the existence of some mathematical entity is
proven without actually providing an example or means of constructing that entity,
were increasingly being regarded with suspicion.

David Hilbert was worried by these doubts and inconsistencies. At the beginning of the
twentieth century he proposed as one of the twenty-three most important problems in
mathematics (second on the list), that of finding a proof of the consistency of arithmetic.
Bertrand Russell and Alfred North Whitehead believed that Frege's mathematical logic
could provide a foundation for arithmetic thus reducing all of mathematics to a
consequence of the laws of thought, that they considered inherent in or a reflection of
nature. They set out to prove that contentions in their Principia Mathematica. They had to
avoid the antinomies that can arise when sets are defined to include themselves as
members. To this end they developed a theory of types in which individual elements
were of type 0, sets of elements were of type 1, sets of sets were of type 2, and so forth.
In their system, sets, hence intensional propositions that serve to describe sets, have to
be of a higher type than the elements that comprise them. This ploy successfully
banishes antinomies but in applying it they were forced to create another axiom; the
axiom of reducibility that proved as controversial as any of the original axioms of
arithmetic. Another school, the Intuitionists, were opposed to the Logicists and
maintained that mathematics was a construct of the human mind and was prior to or
fundamental to logic. Mathematics should be based on mathematical axioms that are
intuitively obvious to the human mind. They banished, the law of the excluded middle,
existence proofs, the axiom of choice and axioms that include infinite sets. From what
remained they attempted to rebuild mathematics. They were notoriously unsuccessful,
being unable to rebuild much of the previously structure. At this point David Hilbert
stepped back into the fray and proposed a Formalist approach in which any axiom
system would be reduced to symbols and bound by explicit rules of proof so that the
truth of any proposed theorem could be proved or disproved by rigorously applying
the rules to the symbolized axioms and established theorems of the system.
Mathematical induction16 and other methods depending upon infinitely lengthy
processes would be allowed by the device of considering them as being applicable to a
finite number n, no matter how large n. But proofs that rely on contradiction, infinite
sets, or the axiom of choice remain banished.

Hilbert insisted that in his system the consistency of any axiomatic system insured the
existence of any of its entities. The Intuitionists complained that Hilbert's plan robbed
those entities of any meaning. Meanwhile Ernst Zermelo and Abraham Frankel had
completed a new axiomization of mathematics in terms of set theory. These axioms,
termed the Zermelo-Frankel set theory axioms, were so persuasive, providing an
attractive means of axiomizing arithmetic while avoiding the antinomies, that they
became the preferable foundation for many mathematicians. They did not, however,
solve the consistency problem...simply replacing one set of axioms with another. A
group of mathematicians undertook to complete the development of mathematics from
the Zermelo-Frankel axioms. They called themselves the Bourbaki. So, by early in the
twentieth century, there were four schools of thought on how mathematics should be
founded; the Logicists, the Intuitionists, the Formalists and the Set Theorists. All of them
based mathematics on axiomatic systems. Only the Intuitionists believed that their
system was consistent (could not have two contradictory theorems) and complete
(could not have meaningful neither provable nor unprovable statements). None of them
could prove consistency and completeness.

But then in the 1930's Kurt Goedel built a system based on arithmetic in which he
reduced the assertion "This statement is unprovable" to an arithmetical statement. If the
statement could be proven it would affirm itself and if it could not be proven that fact
would also affirm the assertion. Goedel had produced a meaningful arithmetic assertion
that was true but that could not be proven. His system, based on arithmetic, was
incomplete. Next he reduced the assertion "arithmetic is consistent" to an arithmetic
statement and proved that it implied the first assertion. From this Goedel proved the
theorem that if any system is complex enough to include arithmetic, then if it is
consistent it is incomplete. As a corollary, he showed that it would not be possible to
prove arithmetic consistent in any of the extant axiomatic. More was to come.

In 1915 Leopold Lowenheim began, and later Thoralf Skolem completed (1933) work on
16
Mathematical induction is different from the logical induction treated in section 1.9.1 in the
paper. Mathematical induction applies to mathematical objects and as such is subject to all of the
rigor associated with a mathematical interpretation. Logical induction applies to logical objects
which are subject to worldly interpretations.
a theory (Lowenheim-Skolem theory) in which they showed that axiomatic systems
intended to characterize one set of mathematical objects might be satisfied by many
other species of objects. A system intended to axiomize non-denumerable sets (sets that
cannot be put into one to one correspondence with the whole numbers) would find that
there was a denumerable model (specific set of objects that satisfy a condition) that also
satisfied the axioms. This meant that, such a set of axioms could not be considered to
characterize one set of objects if other different objects satisfied the axioms. In
particular, they showed any set of axioms for the whole numbers has models in the real
numbers and any axiomatic system has denumerable models. This was confusing,
Goedel had demonstrated an intuitively true assertion in arithmetic that could not be
proven. This implied that arithmetic (consequently mathematics) was not capable of
proving assertions intuitively obvious to the human mind, while Lowenheim-Skolem
theory implied that axiomatic systems had models far beyond what could be intuited by
the human mind. More was to come.

In 1963 Paul Cohen showed that the axiom of choice and Cantor's continuum
hypothesis (there are no other numbers between the cardinal numbers of the rational
and the real numbers) were independent of the Zermelo-Frankel set theory axioms. This
is analogous to the independence of the parallel axiom in Euclidean Geometry. Just as
the adoption of various forms of that axiom led to various non-euclidean geometries the
adoption of various forms of these two which produce results that are completely at
odds with human intuition17 .

The desire to explore these and other tributaries of the mainstream of mathematics has
proven irresistible. Throughout the twentieth century mathematicians have
increasingly distanced themselves and their discipline from the real world. Many have
abandoned themselves to abstraction, attracted by the multiplicity of systems, the
fictioneer's creativity, and the freedom from the wealth of detail generated by the
scientific endeavor necessary to ground mathematics in reality. It can be argued that this
is justifiable because all of mathematics is man's creation, a product of Kant's innate
ordering principle and as such, represents the finest expression of the structure of the
human mind. Implicit in this assumption is the acceptance of some unfathomed
purpose behind the structure that from time to time makes it relevant to physical reality.
We reject this hypothesis and will present below (section 1.11), an alternative
explanation for the existence of mathematics.

1.6.2 Chaos

Geometry, algebra, calculus, logic, computability, probability theory, and the various
other areas and subdivisions of mathematics that have been developed over the years
serve to describe and aid in the development of the various hard and soft sciences. As
the sciences have come to treat more complex systems (e.g. global weather systems, the
oceans, economic systems and biological systems to name just a few) the traditional
mathematical tools have proved, in practice, to be inadequate. With the advent of
computers a new tactic emerged. The systems could be broken down into manageable
For instance, under some systems it is possible to mathematically subdivide a sphere the size of
17

baseball and reassemble the parts into a sphere the size of the earth.
portions. A prediction would be calculated for a small step ahead in time for each piece,
taking into account the effects of the nearby pieces. In this manner the whole system
could be stepped ahead as far as desired. A shortcoming of the approach is the
propagation of error inherent in the immense amount of calculation, in the
approximation represented by the granularity of the pieces, and in the necessary
approximation of non-local effects. To circumvent these problems, elaborate precautions
consisting of redundancy and calculations of great accuracy have been employed. To
some extent such tactics are successful, but they always break down at some point in the
projection. The hope has been that with faster more accurate computers the projections
could be extended as far as desired.

In 1980 Mitchell J. Feigenbaum published a paper (Feigenbaum 1980) in which he


presented simple functionally described systems whose stepwise iterations (i.e. the
output of one iteration serves as the input to the next iteration in a manner similar to the
mechanism of the models mentioned above) produced erratic, non-predictable
trajectories. Beyond a certain point in the iteration, the points take on a verifiably
random nature. That is the nature of the points produced satisfy the condition for a
truly random sequence18 . The resulting unpredictability of such systems arises not as
an artifact of the accumulation of computational inaccuracies but rather as a
fundamental sensitivity of the process to the initial conditions of the system19 .

Typically these systems display behavior that depends upon the value of parameters in
that system. For example the equation xi+1 = axi(1 - xi) where 0 < x0 < 1, has a single
solution for each a < 3, (that is the iteration settles down so that for above some value
of i, xi remains the same). At a = 3 there is an onset of doubling of solutions (i.e.
beyond a certain i, xi alternates between a finite set of values; the number of values in
that set doubles at incremental values as a increases). Above a = 3.5699... the system
quits doubling and begins to produce random values. If you denote by ln the value for
which the function doubles for the nth time then (ln+1 - ln)/(ln - ln-1) = 4.6692. The
value 4.6692 and another value 2.5029 are universal constants that have been shown to
be associated with all such doubling systems. The points at which doubling occurs are
called points of bifurcation. They are of considerable interest because for real systems
that behave according to such an iterative procedure20 , they imply a point at which a
system might move arbitrarily to one of two solutions. The above example has one
18
For example a random sequence on an interval is random if the probability that the process will
generate a value in any given subinterval is the same as the probability of it generating a number
in any other subinterval of the same length.
19
Consider the following simple system: x0 = p – 3, xi+1 = 10*xi – trunc(10*xi), where trunc
means to drop the decimal part of the argument. The sequence of numbers generated by the
indicated iteration are numbers between 0 and 1 where the nth number is the decimal portion of
p with the first n digits deleted. This is a random sequence of numbers arising from a
deterministic equation. Changing the value of x0, however slightly, inevitably leads to a different
sequence of numbers.

For example the equation xi+1 = axi(1 – xi) is called the logistics function and can serve as a
20

model for the growth of populations.


dimension but the same activity has been observed for many dimensional systems and
apparently occurs for systems of any dimension. When the points generated by a two
dimensional system are plotted on a coordinate system (for various initial conditions)
they form constrained but unpredictable, and often beautiful trajectories. The space to
which they are constrained often exhibits coordinates to which some trajectories are
attracted, sometimes providing a simple stable point to which the points gravitate or
around which the points orbit. At other times trajectories orbit around multiple
coordinates in varying patterns.

More spectacularly, some trajectories gravitate toward what have been called strange
attractors Strange attractors are objects with fractional dimensionality. Such objects have
been termed fractals by Benoit B. Mandelbrot who developed a geometry based on the
concept of fractional dimensionality (Mandelbrot, 1977). Fractals of dimension 0 < d < 1,
consists of sets of points that might best be described as a set of points trying to be a
line. An example of such a set of points is the Cantor set which is constructed by taking a
line segment, removing the middle one third and repeating that operation recursively
on all the line segments generated (stipulating that the remaining sections retain
specifiable endpoints). In the limit an infinite set of points remains that exhibits a fractal
dimensionality of log 2 / log 3 = .6309. Objects of varying dimensionality in the range
[0,1] may be constructed by varying the size of the chunk removed from the middle of
each line. An object with dimensionality 1 < d < 3 can be described as a line trying to
become a plane. It consists of a line, so contorted, that it can have infinite length yet still
be contained to a finite area of a plane. An example of such an object is the Koch
snowflake. The Koch snowflake is generated by taking an equilateral triangle and
deforming the middle one/third of each side outward into two sides of another
equilateral triangle. After the first deformation a star of David is formed. Each of the
nine sides of the star of David is deformed in the same manner, and the sides of the
resulting figure deformed, and so on. In the limit a snowflake shaped figure with an
infinitely long perimeter and a fractal dimensionality of log 4 / log 3 = 1.2618 results.
Strangely, beyond a few iterations the snowflake does not change much in appearance.

Many processes in the real world display the kind of abrupt change from stability to
chaos that Feigenbaum's systems display. Turbulence, weather, population growth and
decline, material phase changes (e.g. ice to water), dispersal of chemicals in solution and
so forth. Many objects of the real world display fractal like dimensionality. Coastlines
consists of convolutions, in convolutions, in convolution, and so on. Mountains consist
of randomly shaped lumps, on randomly shaped lumps, on randomly shaped lumps,
and so on. Cartographers have long noted that coastlines and mountains appear much
the same no matter the scale. This happens because at large scales detail is lost and
context is gained while at small scale detail is gained but context is lost. In other words,
shorelines look convoluted and mountains bumpy. It might be expected, given that the
chaotic processes described above give rise to points, lines, orbits, and fractal patterns
that real world systems might be described in terms of their chaotic tendencies. And
indeed that is the case. The new area of research in mathematics and physics arising
from these discoveries has come to be known as chaos theory. Chaos theory holds both
promise and peril for model builders. Mathematical models whose iterations display
the erratic behavior that natural systems display, capture to some extent the nature of
the bounds, constraints and attractors to which the systems are subject. On the other
hand, the fact that any change, however slight (quite literally), in the initial conditions
from which the system starts, results in totally different trajectories implies that such
models are of little use as predictors of a future state of the system. Further, it would
appear that it may be very difficult, perhaps even impossible, to extract the equations of
chaos (no matter that they may be very simple) from observation of the patterns to
which those equations give rise. Whatever the case, the nature of the activity of these
systems has profound implications for the sciences.

If Pierre-Simon Laplace turns out to have been correct in assuming that the nature of
the universe is deterministic (which the tenets of chaos theory do not preclude, but
which they do argue against21 ), he was wrong when he conjectured that he only needed
enough calculating ability and an initial state to determine all the past and future
history of the universe. A major error was in assuming that initial conditions could ever
be known to the degree necessary to provide for such calculations. As an example,
imagine a frictionless game of billiards played on a perfect billiard table. The player,
who has absolutely perfect control over his shot and can calculate all of the resulting
collisions, sends all of the balls careening around the table. If he neglects to take into
account the effects of one electron on the edge of the galaxy, his calculations will begin
to diverge from reality within one minute (Crutchfield et al, 1986). Knowing the
impossibility of prediction even in a theoretically deterministic universe, decouples the
concepts of determinism and predictability, and of describability and randomness.
Further, chaotic systems have built into them an arrow of time. The observer of such a
system at some given point in time can not simply extrapolate backward to the initial
conditions nor can he (so far as we know) divine the underlying equations by the
character and relationship of the events it generates. The laws that underlie those events
are obscured and can be determined only by simulated trial and error, the judicious use
of analogy, description in terms of probability, or blind luck. There is a strong sense that
something of Goedel's theorem is revealing itself here. Just as in any mathematical
system of sufficient complexity there are truths that cannot be proved, there are real
deterministic systems for which observations can be made that cannot be predicted.
Perhaps the most important result of the discoveries in chaotic processes, is the
growing realization among scientists that the linear, closed system, mathematics that
have served so well for describing the universe during the last two hundred and fifty
years, and which gave rise to the metaphor of a clockwork universe, only describe
exceptionally well behaved and relatively rare (but very visible) phenomena of that
21
We reject it mainly because of the butterfly effect which states that the flapping of the wings of
a butterfly in Tokyo might affect the weather at a later date in New York City. This is because
the weather systems at the latitude of New York City and Tokyo travel West to East so that the
weather in New York can be thought of as arising from the initial conditions of the atmosphere
over Tokyo. As another example, consider the old saying, "for want of a nail the shoe was lost,
for want of a shoe the horse was lost, for want of a horse the battle was lost, and for want of the
battle the kingdom was lost." Consider the nail in the horse's shoe. For some coefficient of
friction between the nail and the hoof, the shoe would have been retained. The exact value of that
coefficient might have depended on the state of a few molecules at the nail/hoof interface.
Suppose those molecules were behaving in a manner that had brought them to a point of
bifurcation that would result in either a an increase or decrease in the coefficient of friction. In
the one case the shoe would have been lost and in the other it would have been retained.The loss
of the nail would then have been a true random event that would have propagated its way to the
level of events in human civilization. If the battle in question had been the battle of Hastings, the
strong French influence in the language you are reading might well have been a Germanic
influence.
universe. It is becoming apparent that it may not be possible to derive a neat linear,
closed system model for everything that exhibits a pattern. It is likely that new non-
linear, open system approaches that rely heavily on computers will be required to
proceed beyond the current state of knowledge, and that in the explanation of
phenomenon in terms of laws, those laws may have to be constrained to the levels of
existence of the systems they describe and not be reduced to more fundamental laws at
more fundamental levels.

It is not surprising that the investigation of chaotic activity waited on the development
of fast computers and high resolution graphics display devices. The amount of
calculation required to get a sense of chaos is staggering and the structures revealed can
only be appreciated graphically with the use of computers. But chaos theory has
implications for computing as well. Computers have been around for a long time. The
first machine that can truly be said to have been a computer was the Analytic Engine,
designed in England in the mid nineteenth century by Charles Babbage. The machine
was never finished, but Ada Augusta Byron, daughter of Lord Byron, and who held the
title, Countess of Lovelace, wrote programs for it, making her the first computer
programmer. The Analytic Engine, though incomplete, was a showpiece that impressed
and intimidated people. The countess is said to have stated (McCorduck, 1979):

"The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we
know how to order it to perform."

Since then, those sentiments have often been repeated to reassure novice programmers
and others whose lack of understanding causes them to see the computer as an
intimidating, intelligent, and perhaps malevolent entity. More experienced
programmers, however, find that the machine is truly an ignorant brute, incapable of
doing anything on its own. And so the myth is propagated across generations. In chaos
theory the myth is refuted. In pursuing chaotic processes the machine can truly create,
though, in a mechanical manner that seems to lack purpose. We will argue that the
seeming lack of purpose is a human prejudice based on our concept of what constitutes
purpose (see section 1.12), and that chaotic processes may very well be the mark of a
creative universe.

The structure revealed in chaotic processes is statistically accessible. While the sequence
and position of the points generated within the constraints of the system are not
predictable, if, taking as an example a two dimensional space, the points are treated as
forming a histogram on a grid imposed upon the shape/form generated by the system
then the system can be treated probabilistically. In terms of levels, the points may be
seen as occupying one level and the form defined by the points as occupying the next
higher level. The theory then has potential for the probabilistic prediction of system
activities. Given that we could explain (if not predict) the processes of nature using
some combination of statistics and chaos theory, there still seems to be a missing
element that idealists would happily call animus or moving force. What drives such
systems in the first place? The traditional answer, given by thermodynamics, is that
there is potential for work in a system (force through a distance) anytime there is a heat
source and a heat sink. Any system embodying such a source and sink is considered to
be far from thermodynamic equilibrium until the temperatures of the source and sink
have been equalized. A measure of the difficulty with which that heat can be converted
to work is termed entropy. One of the most famous laws of science states that in a
closed system in which work or heat interactions occur the entropy must increase. In the
last century, An Austrian, Ludwig Boltzmann, recognized that this behavior is
intimately connected with the concept of the possible states that a physical system
might occupy. He was able to state the law simply as S = k logeP in which S is the
entropy, P is the probability of the thermodynamic state occurring, and k is the ratio of
ideal gas constant to Avogadro's number and is called Boltzmann's constant. The
equation is engraved on Boltzmann's tombstone in Vienna. The description of systems
by the use of statistics and probability theory as reflected in entropic processes is taken
up next.

1.7 Systems and entropy (the emergence of levels)

The word entropy is commonly used to indicate the process by which ordered systems
become disordered. It is also the name for the measure of disorder. The next few
paragraphs deal with the measure of entropy but the importance of entropy to this
paper is as a process.

There are two distinct types of systems to which the mathematical equation that
describes entropy is commonly applied; the physical systems22 that are the object of
study in physics, and the communication systems discussed in information theory.
Entropy, as it applies to physical systems, is known as the second law of
thermodynamics and is associated with time-irreversible characteristics of large systems
whose sheer number of components and unaccessible initial conditions preclude
deterministic analysis. Its form arises in the statistical consideration of the possible
configurations of the components of a system and is often described as a measure of the
randomness in motion and distribution of those components. The law states that, in any
closed23 system, the entropy, or disorder, or inability to convert heat or energy to work
can only increase or remain the same over time. Among the laws of physics this
irreversibility with respect to time is peculiar to the second law of thermodynamics
(with the possible exception that might result from investigations in chaos theory). This
view of entropy is intimately connected to the micro-unpredictable nature of large real
physical systems with many components and consequently does not often appear in
classical physics where the behavior of small deterministic kinetic systems is
investigated.

This form of the law is expressed in a simple form that first appeared in formulas
attributable to DeMoivre and that he used to investigate games of chance (Wicken,
1988). When applied to thermodynamic systems (perhaps a container full of a gas or a
22
A system is a structure of components that function together. This distinguishes them from a
structure which may be an arrangement of non–dynamic items or an aggregate which may be an
arrangement of non–related items. 'Function together' may be interpreted liberally to mean
constrained to interact. We may, for instance, consider a gas in a container a system. The terms
system, structure and aggregate depend to some extent upon context. The phrase "aggregate of
stones" is meant to imply that we are speaking of a collection of stones whose size, hardness and
other stony characteristics are not related. We could speak of the same collection as a structure if
we had reason to refer to the spacial distribution of those stones.
23
A closed system is one that cannot exchange energy, space or matter with its surrounding
liquid), it involves the probabilities with which the states24 of the system manifest
themselves. The entropy of a system is given by the natural logarithm of the probability
'Pi' of each state 'i' , weighted by itself and summed over all of the states, this yields the
formula:
S =-k SPilogePi(i= 1 to W) (1)

in which 'S' is the entropy, 'W' is the number of possible or attainable states of the
system, and 'k' is Boltzmann's constant25 When the probability of occurrence of any
state is the same as the probability of the occurrence of any other state, i.e. when all
state probabilities Pi = 1/W, the maximum disorder occurs. (1) collapses to

max(S) = k * loge W. (2)

Given this measure of the maximum entropy possible in a system and the actual
entropy (1), and allowing that the percentage disorder displayed by a system can be
represented as the ratio of actual entropy to the maximum entropy, then we can define

disorder ≡ S/max(S) (3)

so that it is natural to define

order ≡ 1 - disorder. (4)

In the second form of entropy; the one that is associated with communication theory,
the concepts change in several ways.

Some people consider it a mistake for Claude Shannon to have named his measure of
the uncertainty concerning the information content of a message, entropy. The obvious
reason for his choice of the word is that the formulas that define information entropy
are identical to those that define the entropy of physical systems, i.e. they are just (1)
and (2) above except that 'H' is usually used to denote entropy instead of 'S'. A second
reason is that when information is transmitted through a communications channel, that
information content can only remain the same or decrease in time, just as the order in a
closed physical system can only remain the same or decrease in time. But equivalence in
formula is not equivalence in interpretation unless the constants and variables in each
24
For a physical system 'State' implies one specific configuration of components (probably but
not necessarily at the molecular level). A state is thought of as occupying a point in phase space.
Phase space is simply a multidimensional space whose dimensions represent the spacial
coordinates and momenta of the components of the system so that one point in phase space can
represent one state of the system (Tolman 1979). A probability density function defined on this
space can be used to describe the system.
25
k = 1.380 x 10–16 erg per degree centigrade. In the subsequent development in this section k
can be thought of as a simple scaler value. The association of k with Boltzmann's constant is
required in the investigations of entropy associated with an ideal gas. Note that the form of the
entropy equation given here is just the expected value (in a phase space) of entropy as defined
by Boltzmann.
formula refer to the same things. In the pure cases mentioned thus far, they do not. The
meaning of the various components as they apply to physical systems were given
above. In communication theory the variables, constants and probabilities have the
following meanings: W refers to the number of possible character strings of length L,
where each character in a string is selected from an alphabet of N characters; that is
there exist W = NL possible strings, the Pi refer to the probability of string 'i' occurring,
and the 'k' is used for scaling purposes. It has become traditional to use log2 instead of
loge in the formula since the alphabet used in information theory is that of electronic
switching and consists of two characters, i.e. the binary alphabet or '1' and '0'. The 'k'
may be used to scale or normalize the equation. If, for instance, we assign k = 1/log2W
then the minimum entropy is scaled to 0 while the maximum entropy is scaled to 126 .
The minimum entropy is associated with the certainty that the message is one
particular message out of the W possible messages while unitary entropy represents the
complete uncertainty as to which of the W possible messages is the correct one. An
increase of information entropy is then associated with the loss of certainty as to which
string (and its associated information) is being sent or is intended. Such a loss of
certainty is usually attributed to noise in the environment in which the transmission
takes place. It's a theorem of information theory that the increase in entropy can be
made arbitrarily small by increasing the redundancy in the message.

As noted, the formulas (1) and (2) normally refer to different concepts when used to
describe features of a physical system as opposed to an information system. However,
both kinds of entropy may be legitimately used to describe aspects of either kind of
system (since information can be associated with the structure of any physical system
and any information system, is ultimately expressed in a physical structure that will
dissipate energy in its activities). The distinction between an information system and a
physical system is sometimes blurry.

Take for example DNA replication. DNA is a physical system consisting of strings of
four kinds of molecular bases, combined in long sequences and forming the famous
double helix. Other catalytic molecules can interact with DNA and free bases in the
immediate environment, forming and breaking chemical bonds, in a process that results
in the construction of another molecule (RNA). RNA is a copy of a portion of the
original DNA. The change of entropy in this system may be described in two ways.
One description depicts it as a closed physical system that at one point in time exhibits
a certain degree of organization (as represented by the DNA, catalyzing agents and free
molecular bases) and that at a later point in time displays more organization (as
represented by the DNA, catalyzing agents and RNA). This change translates into a
The minimum information entropy occurs when it is known with certainty which string of the
26

W strings comprises the message. In that case

S = (–1/log2W )*log21 = 0.

When every message is equally probable then

S = (–1/log2W)*S(1/W)*log2(1/W) = (–1/log2W)*(log21 –log2W) = 1.


decrease in physical entropy. A second equally valid description depicts the DNA as a
string of characters selected from an alphabet of four characters whose replication in the
RNA (creating redundancy) makes more certain the transmission of the information
that they contain, i.e. decreases the information entropy of the system. The process,
viewed in either way, has the same result: it provides the zygote with the organization it
needs to construct an organism.

The similarity between thermodynamic entropy and information theoretic entropy in


this case can lead to speculation that the connection between the two forms of entropy
is more than coincidental. This suspicion is strengthened by the manner in which
Maxwell's demon may be dispatched. In 1897 James Clerk Maxwell, the great physicist,
hypothesized a demon whose actions would violate the second law of
thermodynamics. He stationed his demon at a door between two otherwise closed
chambers of gas, say chambers A and B. When the demon saw a particle of the gas
approaching the door from the chamber A side, he would open the door and let the
particle enter into chamber B. He would be careful not to let any particle from chamber
B get into chamber A. In this way the order in the system comprised of the two
chambers full of gas would be increased, the entropy would be decreased and the
second law would be violated. While Maxwell's original description can be attacked on
many situational grounds (e.g. the energy dissipated in opening the door or "seeing" a
molecule coming could make up for the decrease in entropy) all such objections can be
countered by modifying the situation. They are inessential to the underlying problem
which is the possibility of an entropy decrease caused by an intelligent reordering of a
physical system. The problem can be stated as: "can the entropy in a system be
decreased by friction free intelligent manipulation of its components?" In 1929 Leo
Szilard (Szilard 1983) proposed a different kind of system that came to be known as
Szilard's Engine. The system appears to convert heat from its surroundings into work,
an effect equivalent to a reduction in entropy. It relies upon being able to determine and
remember in which half of a cylinder a molecule of gas can be found. By knowing this
fact, the system positions a piston to obtain work.In his careful analysis of the system,
Szilard determined that in measuring and remembering the location of the molecule, the
system squanders the energy it gains. This was before the advent of computers but
Szilard's solution just misses the current solution as given by Charles H. Bennett
(Bennett 1987). In his solution Bennett maintains that it is the clearing of memory, in
anticipation of storing the information needed to extract the work, that spends the
entropy decrease manifest in the work. Clearing memory is an irreversible, entropic
event, as shown by Rolf Landauer (Bennett 1985). Storing information is not necessarily
an entropic event. The significance of this for our purposes lies in the meeting of
physical entropy and information theoretic entropy. If the remembering device is a
computer the information stored in it (and subsequently cleared) can be described as
information in the information theoretic sense. This fact coupled with Bennett's solution
appears to establish an exchange rate between information theoretic entropy and
thermodynamic entropy.

That a physical system can contain information in the information theoretic sense is
hardly startling; a computer is a machine dedicated to that purpose as is a traffic light,
television set, clock, abacus and a strand of DNA. In these systems information may be
contained in the physical structure of the system and the physical structure of the
system may derive, to some extent, from the information it contains. Structure in the
world takes the form of, or may be described in terms of hierarchical organization. Thus
books are comprised of chapters, chapters are comprised of paragraphs, paragraphs of
words and words of letters. Animals consist of organs, organs of tissues, tissues of cells
and so forth. Even processes submit to hierarchical description: a meal consists of
preparation, consuming the meal and cleaning up. Preparation consists of selecting
appropriate foods, cooking them and serving them. consuming the meal consists of
several actions, such as cutting, scooping and spearing various foodstuffs, transporting
them to your mouth, chewing, swallowing and so on. The description of entropy must
then deal with the hierarchical structure of real systems, in particular, the possible
groupings, at the various levels of the hierarchy, of the physical objects that manifest
information. Pure information theoretic formulas such as (1) and (2) treat only the
lowest level (or terminal nodes) as information. This is because the information
elements are abstract and considered interchangeable. But physical structures that carry
information are distinct. Grouping them requires that the combinatorics of such a
selection process be considered27 . The formulas (1) and (2) are insufficient for the
purpose, however, it is not difficult to modify them so they can work (Brooks,
Cumming and LeBlond 1988).

The basic idea is to identify the distinct subsystems, calculate their entropy and sum.
For example, assume that a subsystem may consist of an arrangement of L components
that may be selected from a basic alphabet of N elements (for instance letters or
molecules arranged in strings or arrays). There are Ni possible distinct subsystems of
size i. The entropy of an i sized subsystem is from (1), (letting k be 1),

Si = – SPi,jlog2Pi,j (j = 1, Ni). (5),

where Pi,j is the probability of the jth distinct subsystem of size i occurring. The entropy
of the whole system is

S = SSi (i = 1, L), (6)

At maximum entropy each subsystem has an equal probability of occurring i.e. Pi,j =
1/Ni. In that case (5) becomes

Si = i * log2N. (7)

27
If it is desired to calculate the information containing potential of 3 binary digits we simply use
formula (1), with k = 1, (with 23 = 8 possible, equally probable states) to calculate S =
8((1/8)log28) = 3. But a system of three toggle switches may be grouped in a number of ways
(creating hierarchical systems) that may contain additional information. For example, the three
switches may be grouped separately or as a combination of 2 switches and 1 switch, or as a group
of three switches. When treated separately there are 3 ways to choose the first switch, two ways
to choose the second switch and and one way to choose the last switch for a total of 6
combinations. There are three combinations of 2 switches and 1 switch and there is 1
combination of 3 switches. Altogether 10 different structures may be composed, each of which
may be associated with information in addition to that which may be associated with the
terminal. As is shown further on, this can allow max(S) = 6.
The maximum entropy of the whole system (which may also be thought of as its
information carrying capacity ) is then

max(S )= S i * log2N (i = 1, L) = (L(L+1)/2)*log2N. (8)

Order and disorder may again be defined as in (3) and (4) above except that they apply
to a hierarchical structure. Note that the maximum entropy of a system that permits
hierarchical structure is greater than the maximum entropy of a system consisting of
one level. This will be significant in the development below.

That the entropy equations are important for the investigations of physicists and
communications researchers is obvious but entropy is becoming important in the
investigation of many systems that become more ordered over time, that is, that
organize themselves. This means that entropy becomes important in the investigations
of biologists, astronomers, economists, sociologists, researchers into artificial
intelligence and other scientists who investigate emerging self-ordering systems. In
these systems the distinction between the two forms of entropy is definitely blurred.

As a descriptor of an important attribute of large systems, entropy has been around for
a long time. It has not, however, entered into the investigations of systems that self-
organize, because it has not been clear how a description of the tendency of systems to
dissipate could be applied to such systems. In fact, entropy poses a conundrum: how
can a system become more ordered when the second law implies that just the opposite
tendency is normal? The answer requires a careful look at the nature of systems.

Three kinds of evolving systems can be identified (Olmstead p245 1988). Closed
systems are systems that cannot exchange matter or energy with their surroundings in
obeying the second law of thermodynamics they evolve toward a disorganized
equilibrium state (a state in which no organization exists and no change in organization
occurs). There are few if any examples of truly closed systems in nature, a laboratory
experiment that isolates some material from all outside influences might approximate a
closed system or the universe itself might be used as an example. A second kind of
evolving system is an out-putting system. Out-putting systems are those that can expel
matter and/or energy and that can evolve toward an ordered equilibrium state at the
expense of an increase in the entropy of the surrounding system (which includes the
out-putting system). A star provides a good example of this kind of system. Stars can
achieve the status of neutron stars in which organization is present but no further
change occurs. The last kind of evolving system is a processing system in which energy
and matter is input to the system, processed, degraded and retained or expelled, to lead
to an ordered relatively steady state system far from equilibrium. Continuance of the
steady state depends upon maintenance of the flow of energy and matter into and out of
the system. The earth's surface/atmosphere and all biological systems are good
examples of processing systems. In both out and processing systems, entropy can
decrease in the system at the expense of a corresponding increase in the entropy of the
encompassing environment. A processing system is intimately connected to and
dependent over time on its environment. This empirically established fact is developed
formally by Ilya Prigogine (Prigogine 1980) especially as it applies to thermodynamic
systems.
A fundamental question is: how did order come to be in the first place? Most theories of
the formation of the universe hypothesize an early stage of homogeneous
disorganization. For the second law to have held from the beginning, it would be
required that order appear and entropy increase concurrently. Since an increase in
order implies an increase in entropy this would seem to be a contradiction. The
explanation is that the two processes are not mutually exclusive in an expanding
universe. As pointed out by Layzer (Layzer p26 1987) the order exhibited by a system is
the difference between the maximum possible entropy in the system and the actual
entropy (Os = max(S) - S, ... if we normalize this by multiplying through by 1/max(S)
we obtain O = 1-S/max(S), this is equivalent to setting k = 1/logW and provides the
motivation for defining disorder as S/max(S)). The increase in the physical size of the
universe would mean that the states that the physical substance of the universe might
occupy increase rapidly. If that rate of increase is greater than would allow the
substance of the universe to occupy those states then the maximum entropy would
increase faster than the actual entropy. In general, the entropy and order of a system
can both increase so long as the phase space describing the possible states that the
system can occupy grows at a rate sufficient to accommodate the growth of both.
Specifically as regards the creation of the universe, a rapid expansion led to a cooling
and the subsequent coalescing of matter and energy. A currently popular, theoretical
account of the big bang (up to 1030 seconds) called inflation (Guth and Steinhardt 1984)
depends on an immense expansion occurring during that early fraction of a second of
the existence of the universe. The expansion provides the driving force to produce all of
the matter and energy of the universe (hence its structure and order), possibly out of
nothingness. As a more prosaic example, imagine a container of gas as a closed system.
Its entropy approaches the maximum as the gas molecules are nearly randomly
distributed throughout its volume. Then in a very short period of time the container
expands to millions of times its original volume. The gas immediately tries to fill the
new volume, but it cannot do this as fast as the container is expanding. In the new
system with its larger phase space the small gas cloud represents an orderly system,
though one in which entropy is increasing rapidly. With the increase in container size
(hence maximum entropy) comes an increase in order. More formally, if we view
entropy and maximum entropy as functions of time

dOs/dt = d(max((S))/dt - dS/dt (9), so that


dOs/dt > 0 if d(max(S))/dt > dS/dt (10).

That is, order must increase if the rate of growth of the maximum entropy is greater
than the rate of growth of the actual entropy. If a system is being driven toward order
by such a process, a hierarchically organized system will permit greater order than a
single level system. Now consider a planet. It collects energy and matter, degrades it
and expels energy and matter into space. The phase space of the planet is growing as
the totality of all energy and matter that have impinged upon it. Some planets simply
heat up and reradiate the energy others also display elaborate atmospheric formations.
To explain the particular means by which order is expressed on the earth one has to
investigate the particular composition and situation of the earth. Obviously the
composition of the earth permits a hierarchical structure and a rapid progression to
greater order.
The observation that entropy and order can both increase in an expanding phase space
has implications for biological systems. Prigogine, while describing how processing
systems can exhibit a decrease in entropy, does not provide an explanation of why the
amount and speed of entropy decrease (or increase in order) should vary so much from
system to system . In particular why is it that the production of order is so much greater
in biological systems than in non-biological systems? That biological order might be
created and maintained at the expense of thermodynamic order was a view first
espoused by Erwin Shrödinger in the 1940s, but the complexity of biological systems
made it difficult to accept a principle that seemed to apply more to simple chemical
systems. Most biologists concentrated on natural selection as the main mechanism
driving evolutionary processes. Now, however, there are new theories, gaining some
acceptance, that depend upon the hierarchical description of entropy given above, to
explain evolution.(Brooks, Cumming, and LeBlond 1988). First, they note that Dollo's
law and the proposition that natural selection is the primary driving force of evolution
are not compatible. Dollo's law is the empirically supported hypothesis that biological
evolution is characterized by the monotonically increasing complexity of organisms. If
natural selection were the only or even the major driving force for evolution, the cyclical
nature of climatic conditions over the eons would cause a corresponding cyclical effect
in the evolution of life. Such cycles are not observed, instead, a continuous increase in
the complexity of organisms is observed. Secondly they point out that the genome (gene
bearing structures) of organisms represent a phase space for the potential development
of the organism. An increase in the phase space, as represented by extended gene
structures can result in more order while entropy increases. They specify the
characteristics of a self-organizing system as follows (simplified, Brooks, Cumming,
and LeBlond p216 1988)

1. All information (entropy) and order capacities (maximum entropies), both total and
at each level increase over time.

2. a) At low levels (small L) disorder is greater than order, and


b) at high levels (large L) order is greater than disorder.

The first requirement is a restatement, in a manner applicable to a hierarchical system,


of the above discussed possibility of a concurrent increase in order and entropy in a
thermodynamic system . The second requirement stems from the observed nature of
self-organizing systems. If 2a maintains but not 2b the system is disordered. If 2b
maintains but not 2a then a limited ordered system such as a snowflake results. If 2
maintains then a complex system results. Brooks et al checked DNA against this
hypothesis, testing the order/disorder capacity of the DNA of various life-forms. Their
results confirmed that DNA conforms to 2. When 1 and 2 maintain then an increase in
complexity is expected over time and Dollo's law is given theoretical underpinnings. It
is the self organizing, energy processing systems that exhibit the strongest tendency to
order and complexity.

The equations (3), (4), (9), and (10) normally serve only to quantify the order in a
structure assumed to have been imposed upon a system. It is difficult to restrain from
elevating the equations to a teleological statement that equates the increased order that
they permit with all the growth and organization we observe around us. Accepting the
equations as a description of such a force provides a basis for explaining nature, in its
hierarchical structure, at all levels, from the emergence and evolution of cosmological
structures to the emergence and evolution of life. There is no logical justification for
making such a leap of faith. In fact, in so doing one returns to a kind of vitalist
viewpoint once removed (it could easily be misconstrued as the advocation of a
mystical organizing or guiding force). We will resist the temptation to equate and be
satisfied with inferring a strong relation between the conditions that cause the equations
to indicate increased order and the organization we observe in nature. This is in accord
with position of some physicists. Paul Davies (Davies, 1989), while supporting the
general concept of levels resistant to reductionism, protests the identification of
increased order with organization. He states that the simple fact that order can be
maintained or increased in open dynamic systems does not explain the organization but
only serves as a prerequisite for that organization. He argues that rules that he terms
software laws (because they often deal with information) emerge along with and
complementary to the emergence of levels. Software laws cannot be reduced to the
standard laws of physics and apply only to the systems at the levels of the observed
phenomena. He points to developments in chaos theory, fractal geometry, cellular
automata theory, along with the enhanced organizational tendencies of systems far
from equilibrium28 , as possible generators of such laws. We are content to concede that
such may be the case and are satisfied with determining that entropic processes of
dissipating systems is coincidental with self-organizing processes in a hierarchy of
semi-impermeable levels.

We will accept the following description of how the observed increasingly complex
organization of the universe occurs: Entropy is a fundamental fact of nature that serves
to describe or govern (depending on your point of view) the distribution of order and
disorder in the universe. Processing systems are those that accept energy and matter
from outside the system, process it (or degrade it) and expel it. The possible states that
such a system can occupy is always increasing (all of the energy and matter ever
processed through the system must be considered as belonging to the system although
only the extant structure is implied by the word system). Order, structure or
organization is observed to emerge and grow in processing systems. In terms of the
phase space in which such systems may be described, the space is growing at a rate that
greatly exceeds the ability of the system to occupy it. When the equations describing
the process are normalized by a factor dependent on the size of the phase space the
process can be interpreted as an increase in order at the expense of disorder. When the
system is closed disorder always increases at the expense of order reiterating the second
law of thermodynamics. Hierarchically organized systems seem to be preferred, an
observation that might be explained by the fact that such systems can, in a sense, absorb
more order with fewer parts than other organizations. Hierarchical, self-organizing
systems occur either as the result of an inherent, mathematically revealed necessity (in
which organization and order are considered interchangeable or at least, intimately
linked together) or because non-reducible laws, complementary and non-contradictory,
to the laws of physics (or the laws at any level for that matter), emerge with and apply
to the occupants of a level. These processes are assumed to be an ongoing and
permanent source of novelty or creativity at all levels throughout the universe. As an
example, the biosphere of the earth is a processing system in which order has increased
over time, in which a hierarchical organization seems to be preferred, and in which the
process continues today.
28
It is also possible that these phenomena are a direct result of the tendency to order rather than
ancillary organizing principles that come into play when permitted.
1.8 Psychological considerations

Psychology grew out of philosophical considerations of mind. The approach was at


first, more introspective than scientific owing to its origin in an introspectively oriented
philosophy, the readiness to hand of subjects, and the difficulty in devising experiments
that would reveal the inner workings of the human mind.The theories of Freud and
Jung were based largely on case histories of patients who were subjected to
psychological examinations. Slowly, however, scientific experiments were devised and
applied to animals and people. Notable experiments on dogs were performed by I. V.
Pavlov that showed that animals could be conditioned to respond automatically and
involuntarily to various stimuli. Because of the difficulties in accessing the unconscious,
or lower levels of the mind, there arose the idea of treating the mind as a black box in
which the outputs engendered by given inputs would be studied. B. F. Skinner, in the
1930's carried this line of research to great lengths in the behaviorist school of
psychology. The theory was good at explaining some human behavior but failed in
many cases. Psychologists returned to investigating the functions of the lower levels of
the mind and made significant advances, especially through direct neurobiologic
experiments on the brain. In recent years psychologists have turned to the concept of
the human brain as a machine upon which the functions of the mind execute as
software. This has occurred more perhaps because of the advent of computers and the
interesting potential for experimentation than because of the availability of compelling
theories. But it is the computer scientist, in particular the computer scientist engaged in
artificial intelligence work who should take to heart the models of cognition that
psychologists have developed over the years. The theories of Jean Piaget are a case in
point.

1.8.1 Structure of the mind

Piaget is best known as a psychologist whose interest is in the early development of


children. His interest goes beyond the study of childhood development to the
investigation of the origins and nature of knowledge. This would normally be the
subject matter of philosophers, but Piaget sees the development of knowledge in a child
as an extension of the basic organizing principles of biology that in turn are an extension
of principles of universal order. Piaget determines to bring the subject under the
scrutiny of scientific investigation by using the methods of psychology. Because of this
he is often referred to as a genetic epistomologist (McNally 1973).

The world view that Piaget subscribes to is termed "structuralism" (Piaget 1968). It has
three defining attributes; wholeness, transformation and self regulation. The first
attribute, wholeness, implies that the elements of the structure are in some
distinguishable manner subordinated to the whole structure. The second attribute,
transformation, is seen as the hallmark of structure. As determined by the Bourbaki
program, there are three basic frameworks into which all mathematical systems can be
sorted; algebraic (especially groups), networks, and topological structures. All
structures defined in this way display certain attributes among which are closure and
reversibility, i.e. ability to return to a previous state. Reversibility takes different forms
in each of the basic structures. Algebraic structures show reversibility through
inversion, networks through reciprocity and topological structures through continuity
and separation. Piaget notes that, in the intellectual development of the child these
forms may be observed as the recognition of number, serial correspondences and
neighborhoods and boundaries. In this view the order inherent in mathematics derives
directly from the nature of things and is not superimposed by the mathematician onto
the world. The last attribute, self regulation, results from the particular transformation
rules that define the structure, and is inevitable because of the nature of a
transformational system. Rhythm, regulation and operation are the hallmarks of self
regulation.

The dependence of this theory upon the reversibility of the processes associated with
structures causes Piaget to gloss over entropy and the probabilistic approaches of
physics, since they are intimately associated with irreversible processes. In fact, these
methods and theories elevate chaotic processes to a basic framework status, essentially a
network model but contrary to Piaget's fears, quite reversible in the sense deemed
important by him. Nevertheless, Piaget suggests that a hierarchical organization of form
and content is required because of the inconsistency and incompleteness theorems of
Kurt Goedel. And he recognizes that structures (what we have referred to previously,
especially in the section on entropy, as systems) must be dynamic and grow or be static
and superfluous to cognition (Piaget 1970). So he arrives by a different route at many of
the same conclusions that result from embracing the ideas put forth in the entropy
section above. Piaget is a constructivist, he believes that structures (we would call them
systems) are not pre-determined but constructed by interaction with the environment .
He is specific about the mechanism by which growth and learning takes place and that
he terms adaptation. He specifies the functions of accommodation by which an existing
structure is added to (for example a child recognizing a football as a kind of ball) and
assimilation by which new structures are created (for example a child learning how to
eat with a knife and fork). Piaget terms these "knowledge" structures into which the
child fits new information (or that are created from scratch), "schemas." A student of
computer science will recognize a similarity between schemas and the Frames and
Scripts of artificial intelligence although those structures, as advocated, have a
somewhat more static nature.

1.8.2 The development of mind (how parents create children)

If the human mind is mostly created by interaction with its environment exactly what is
the process? Kenneth Kaye gives a creditable account (Kaye, 1982) which we summarize
in the following section. Newborn babies are not intelligent but they are endowed with
the prerequisites for intelligence. By the end of their first year babies do, to a degree,
exhibit intelligence. Parents, in particular the mother, would deny that the baby is not
intelligent from the start. They and other peripheral adults interpret everything that the
newborn baby does as an intelligent act. And in fact, previous theories (summarized in
Kaye (1982)) have maintained that the baby is organized in certain respects from birth;
that the baby reacts to or bonds with the mother and that the mother/ baby pair
immediately form an interacting system. However, Kaye indicates that this is an
exaggeration. The baby acts more or less at random as regards the mother or indicating
awareness of the purported system. The mother persists in interpreting the baby's
movements as intentional acts. She talks to it and encourages it to reply and in fact
treats the baby as though it were making replies. The mother tries to elicit a response
from the baby by making faces and noises, and by poking, prodding and jiggling the
baby. When the baby does begin to exhibit a reaction specifically to the mother (at
several months of age) it is by synchronizing its facial expressions with those of the
mother (i.e. smiling) in a receive greeting, initiate greeting, turn-taking manner. The
mother has been acting all along as though the baby were returning her greetings by
interpreting as such anything from a burp to a wet diaper. In so doing the mother sets
up a (one-sided) framework in which the baby has a part to play even though it is a
passive part. When the baby finally does begin to become aware of the world it finds
itself already accepted as part of that framework and the mother/baby pair takes on the
aspects of a system.

The concept of a system entails the recognition of parts of the system that interact
together in ways that determine the action of the system as a whole. Social systems have
parts that consist of biological organisms and are always open systems. But not all
relationships between organisms comprise a social system. In order to be considered a
social system the organisms that comprise the parts of the system must also have a
shared development and a shared purpose.

A mother and newborn baby are not yet a social system because all of the purpose is on
the part of the mother. Even so, she treats the baby as though it were cooperating and
forms an apparent social system. As Kaye says (p109)

"So instead of being born with rules, the infant is born with a few
consistent patterns of behavior that will look enough like the rules -
necessarily, like just the universal features of interaction rules- of adult life
so that adults will treat him as a person. And, of course, infants are also
born with learning mechanisms that enable them to improve the fit
between the way they are expected to behave and the way they do
behave."

Some learning mechanisms include learning from the consequences of actions, learning
by habitual association and learning by imitation. Through these and other mechanisms
the baby eventually begins to behave, at 5 to 6 months of age (Kaye p152), as the mother
has expected all along. The mother and infant pair begin to become a social system.

In their social system adult humans follow rules in their interactions with one another.
Among the first rules that the baby learns is turn-taking. The earliest turn-taking is in
greetings between the mother and baby in which the mother fits all her turns into
whatever space the baby leaves. As time goes by the turn-taking becomes more shared.
When the social system begins to develop, the baby's imitative abilities are optimized
under the turn-taking system. At about one year of age the baby can begin to speak and
the turn-taking becomes a valuable framework from within which dialogue can occur
and the child can be taught. At this point the child begins to learn how to use symbols.
Until now learning has occurred on a contingent basis, that is, as animals are trained
through incremental steps towards some goal. But with the acquisition of language the
baby can begin to learn in a more organized manner. (Kaye p115)

"The social matrix that provides for the human developmental processes is
as important to cognition as it is to language. Others have described the
development of thought as an interiorization of symbols, but they have not
always included in their account the internalization of the social skills
through which those symbols are acquired and through which they work.
Thought is not just internalized symbolic meaning, a construction of
propositions using an acquired code. It is an internalized discourse, a
matter of anticipating what the response of another might be to one's
behavior and responding to those responses in advance. Thought is, in
fact, verbally or non-verbally, a dialogue with oneself."

So turn-taking (or the development of rules) is important since it sets the pattern by
which the individual will acquire knowledge and even think for the remainder of his
life.

The use of symbols by the baby occurs at the same time as the acquisition of language.
There are two criteria that must be met before the use of a signal can be considered a
symbol. The first is that the signal is intended to signify something else and the second
is that the signal must be a convention and different from that which it represents. So
words are symbols while a brake light or frightened look, even when conveying a
meaning, are not symbols. The important thing about symbols is that their meaning can
be shared by individuals in a social system. When this occurs it is termed
intersubjectivity.

The pattern of discourse that arises from the injection of language into the turn-taking
between mother and baby is characterized by what Kaye calls turnarounds.
Turnarounds are largely produced by the mother and are in the nature of both a reply
to a verbalization on the part of the child and a demand for a further response from the
child. The child, for its part, rarely produces turnarounds. It responds to the mother's
question or comments further on whatever topic is being discussed.The structure of the
dialogue does not require that the child share in the responsibility for maintaining the
discourse and yet allows him to practice language and demonstrate acquired
knowledge in an environment in which immediate feedback is provided.Through the
dialogues intersubjectivity occurs. The child comes to understand the meanings the
mother has for the symbols they are using and comes to be able to anticipate the effects
that his own use of the symbols will have in the social system. As Kaye puts it: (p149)

"The conventional symbols inserted by adults into their systematic


exchanges with infants during the first year are directly responsible for
the dawning of human intelligence."

Nothing discussed sheds any light on what the actual mechanism might be that causes
the baby to begin to understand the meanings behind the symbols. But once this occurs,
the transformation from relying on learning by contingencies to learning by the
assimilation and the use of symbols is rapid. It happens in the baby at the same time as
the acquisition of language. The acquisition of a formal language with its capacity to
describe the world prepares the way for the baby to become intelligent.

Imitation is Kaye's answer to the question "what is the actual mechanism by which
babies learn to attach meanings to symbols?" (p159)

"At the heart of imitation is assimilation or the classification of new


events as equivalent to known objects or events. We cannot imagine, let
alone find in the real world, an organism that does not assimilate.
Assimilation is involved in all categorical knowledge. I assimilate the
object in my hand to the category "pen": even without labeling the
category with a symbol, I assimilate the object to sensorimotor schemas
that know what to do with pens. When infants perceive modeled action
X as an occasion for performing their own version of X, the fundamental
ability involved is no different from the fundamental ability of any
organism to recognize novel events as instances of known categories,
despite their novelties. The process is hardly unique to man but is, as
Piaget (1952) argued, the biological essence of intelligence."

Kaye distinguishes four levels of assimilation undertaken by the baby,all under the
general designation of imitation. The first level is involuntary assimilation as the baby
perceives the world. The second level is "accommodation" that occurs when the baby
actually responds in some manner to that which he is assimilating. The third level is
"signification" during which the baby begins to make signals indicating that he is
beginning to recognize his impact on the world. Kaye argues here that the parents'
actions greatly help the infant at this point as they tend to carry out the apparent
intentions of the baby, providing a model for the baby to imitate. First words occur at
this level. The final level Kaye terms "designation." It begins at the point at which the
baby starts to attach meanings to symbols. The acquisition of language rapidly follows.

1.8.3 The emergence of mind (a speculative account)

A system, no matter the type, can be described in terms of a phase space upon which a
probability density function is imposed. For example a physical system can be described
by means of a state space in which the momenta and position of each component of the
system are represented by dimensions in that space. A point in the space is then a state
of the system and a probability density function defined on that space will generate the
probability of occurrence for each state. Entropy is a mathematical definition on such a
probability distribution. So entropy equations may be applied to any sufficiently
described system. Of course, the meaning assigned to the values derived from those
equations will vary according to the system to which they are applied. In the case of
physical systems the entropy is a measure of the disorder in the system. The maximum
entropy less the actual entropy is interpreted as the maximum possible disorder less the
actual disorder observed and so is interpreted as the order in the system. In the case of
communication systems the entropy represents the uncertainty as to the message being
received. In some cases the physical system and the information system are the same. In
that case the entropy equations substancially coincide. Note that, in the communication
system, the maximum entropy less the actual entropy is the maximum possible
uncertainty about the message less the actual uncertainty about the message and so
may be interpreted as the information received from the message. Suppose then that we
regard the human mind as a physical/information system. It consists of the brain which
is a collection of neurons, connective tissue, chemicals and electrical potentials
occupying various states that are more or less probable according to some probability
density function.The states of the mind, in the physicalists view of things are associated
with the things we think. That does not mean that every state represents a thought but
that potentially a state may be associated with a thought much as a word may be
associated with a sequence of letters. In this system maximum entropy represents the
same thought or no thought associated with every brain state, or we might say,
maximum ignorance. The amount of uncertainty as to the message also depends upon
the size of the message, that is, we are more uncertain about a garbled message that is
10 characters long than one that is one character long. This is because there can be many
more messages of length ten than of length one. The amount of ignorance, like the
amount of uncertainty in the message system, depends upon the number of states that
the mind might occupy. So we would consider the maximum potential ignorance of a
mind with a billion different states as being greater than a mind with a million different
states. The actual entropy represents the actual ignorance so that maximum entropy
less actual entropy is maximum ignorance less actual ignorance. In the same manner
that we assign order to be the difference between maximum disorder and actual
disorder, and information to be the difference between maximum uncertainty about a
message and the actual uncertainty, we could assign intelligence as the difference
between maximum ignorance and actual ignorance. But we won't succumb to this
temptation to define intelligence in a semi-mentalist manner because it overlooks the
interaction of the growing mind with the environment and the effect of the material
nature of the brain ( however, if we wanted, we could supplement the argument by
noting that along with the growth of total intelligence, the fact of the dual
physical/information nature of the brain/mind, we could infer that the intelligence
would manifest itself in levels). In any case knowledge might be a more appropriate term
than intelligence to assign in this context. Intelligence is as much a dynamic multi-level
process as a state of mind.

As regards the growth of knowledge in the human mind a couple of possibilities


present themselves: 1) The brain is created fully connected and totally available for
acquisition of knowledge. It represents a potentially large probability space of states of
knowledge (a large potential maximum entropy) but is created with low actual entropy
and correspondingly high order (i.e. it is blank, or has a few states occupied
representing the necessary starter set of human capabilities). Then the brain, as it
interacts through the mediation of the senses of the body to acquire knowledge,
increases its entropy at a rate proportional to the rate at which it is acquiring
knowledge. In terms of order, the brain would then actually become less orderly over
time even though its knowledge increases. A knowledge acquisition mechanism might
need to be hypothesized. Or 2) The brain is not fully connected when it begins to learn.
The normal growth of the brain or stimulation of growth of the brain through
interactions with the environment by the mediation of the senses causes a rapid
expansion of the capacity to learn (e.g. through the connection of synapses). This
expansion of the maximum potential entropy of the mind at a rate greater than the
increase in actual entropy would require an increase of order. In this hypothesis an
association of knowledge with an increased order in the brain is possible and the order
is mandated. Since we equate the order in the world with the order in our minds, this
latter hypothesis is pleasing. Too, it would help explain why, the older one gets the
more difficult it is to learn (there are less and less new connections that can be made).
There is some evidence that the brain is plastic (or most plastic) for a period early in life
(Aoki and Siekevitz, 1988) at least in cat brains and as regards the visual system and in
rat brains as regards the tactile senses associated with the rats whiskers. There is also
evidence that production of neurons occurs in adult song birds. The new neurons may
be associated with the birds ability to learn new songs (Nottebohn, 1989). So far there is
no evidence for the production of new neurons in adult humans. Certainly learning
slows down as a person matures. However, the fact that memories can be forgotten or
modified, unless that process is completely wasteful of brain tissue, suggests that the
same portion of brain tissue can be rewritten. The truth, as are most truths, is probably
somewhere between the two extremes of wastefulness and complete flexibility.

1.9 Cognitive systematization

At this point we wish to summarize and place in perspective the concepts presented in
the previous sections and make further observations prior to stating the hypothesis.
First however we will briefly return to philosophical matters to discuss the subject of the
content, structure, and validity of mental objects. Cognitive systematization is
epistemological in nature, dealing with the handling, verification, and structure of
knowledge. We see it as both a method for conducting research and as a possible model
for the structure of knowledge that we are trying to create in a machine mind. It
provides a model of mind because the distinction between the cognitive ordering of
information in the human mind and order among elements of knowledge in the
abstracted epistemological sense is a weak one. Any concept of order among elements
of knowledge arises first in and is accepted by the human mind of the person
conducting the epistemic inquiry. That alone would not impress a psychologist since the
linkages so derived represent only a possible state of affairs and not necessarily the one
used by the human mind. But if the purpose at hand is, as is ours, a broader
investigation of what it means to be intelligent, not constrained to human intelligence,
then those ideas can be relevant without further justification. The ideas of Hume, Kant
and Hegel, generated as they were before the advent of psychology, are more
epistomological in nature than psychological. That does not prevent those ideas from
directly affecting, perhaps even applying directly to, our generalized concept of mind.
Following is a discussion of cognitive systematization, a subject that is essentially
epistemological in nature but that has broad applications, especially to the concept of
mind.

Coherentism is a tool of cognitive systemization as reductionism is a tool of


foundationalism. It relies on the concepts of systematization as the mechanism by which
new theses are accepted. It might be characterized as a best fit, network model. There is
nothing new in systematization as a concept. The earliest Greek philosophers made
their arguments in terms of a cohesive system as did Kant, Hegel and many, if not most
philosophers since then. Piaget's mechanisms of assimilation and accommodation rely
on coherence more than the construction of an unassailable foundation. Rescher
(Rescher 1979) synthesizes the features of systematicity as culled from the discussions
of the earlier theoreticians. They include 1) wholeness, 2) completeness, 3) self-
sufficiency, 4) cohesiveness, 5) consonance (absence of internal discord or consistency),
6) architectonic (well-integrated structure, usually hierarchical, of ordered component
parts), 7) functional unity (purposeful interrelationship of internal components), 8)
functional regularity, 9) functional simplicity, 10) mutual supportiveness (of the
components of the system), and 11) functional efficacy (appropriateness to the task).
Coherentism then relies upon a kind of goodness of fit of a thesis into a body of
knowledge as measured by the extent to which the resulting system exhibits these
characteristics. Induction augments the deductive mechanism of an axiomatic system as
the means of validating new concepts; if a1, a2,...,an is a recognizable sequence of
elements and all exhibit a trait then it is plausible to assume that an+1 will also exhibit
the trait. This sounds like a heresy to a reductionist (substituting plausibility for truth),
but should not be taken as such, for it is not intended that deductive methods be
invalidated, but that it be recognized that axiomatic systems inevitably fail to be
complete or are inconsistent (see section 1.6 on mathematics and section 2.3.1 on logic
systems). Thus in coherentism, there are no unshakable truths, systems are necessarily
complete because completeness is a prerequisite for validation (in the sense that there
are no plausible theses that cannot be validated by induction or deduction). Systems are
consistent (have consonance) for the same reason.

The success of science and mathematics, both of which are essentially axiomatic or
reductionist in methodological approach, have inhibited the development of
systematization as a formal mechanism of research. But, in fact, the actual growth of the
various disciplines in science and mathematics is one that reflects an adherence to the
style if not the mechanism of systematization as outlined above. The axioms of
axiomatic systems, in the historical perspective, are subject to change and are often
changed when the systems that they generate become implausible. Thomas S. Kuhn in
The Structure of Scientific Revolutions (Kuhn 1962) points out that contrary to the common
perception of science as a process of accretion in which new achievements are firmly
based on past achievements, it is more a process of revision of theories and the
construction of a consensus of scientific opinion. The process is marked by the
acceptance of new facts, inventions and theories that sometimes work such a change to
the network of theory, or a portion of that network, as to be termed revolutions. In the
resulting new network of theory often statements previously rejected as fallacies are
accepted as obvious truths and statements previously accepted as facts are rejected as
patently incorrect. So the actual progress of science is coherentist in nature.

The advantages of the best fit, network model are that 1) all knowledge is on an equal
footing, there being no sacrosanct ideas or facts or truths so that facts may be accepted
on a contingent basis and discarded when they no longer fit, 2) the inconsistency and
incompleteness defects of the axiomatic systems are avoided, 3) it is holistic in nature, 4)
pseudo axiomatic sub-systems can be tolerated and 5) it's possible to derive consistent
results from a partially inconsistent body of formation.

1.10 Philosophical observations

Underlying much of western thought lies a persistent Platonism (idealism) that elevates
abstract thought to a higher status of reality than the reality of the senses. Hume argued
the contrary, pointing out the dependance of human thought on the senses. He noted
that chains of reason depending on a supposed law of causality cannot be trusted and
advocated trusting judgement and intuition. That is, trusting the kind of reasoning with
incomplete information that the human mind is good at. Subsequent philosophers,
while conceding the point on causality, promptly returned to the standard reification of
thought objects. Even scientists and mathematicians succumb to the temptation to reify.
Newton defined force as an accelerating mass. Force then is just the name for a
particular relation (exponentially changing distance) between some mass and some
point of reference. Force is not a thing in spite of the fact that it can be quantified. Yet
most people, even physicists, treat it as a thing because it is convenient to do so.
Causality too, is often misused as a justification for drawing conclusions. A
mathematical correlation between two sets of data is often interpreted as implying a
causal relation between the objects that give rise to the data. Economists are famous for
this error; at one time sunspots were blamed for the cycle of agricultural failures in
India because of the remarkably good correlation between those cycles and sunspot
activity. A positive correlation can be cited as evidence for any causal hypothesis. The
measurement of intelligence is a case that demonstrates both reification and causal
errors. Intelligence tests are considered effective because the scores on those tests
correlate well with other, non test (and usually subjective) judgements of intelligence.
Then, since it can be measured, intelligence must be a thing29 .

David Hume, when he argued against the necessity of the law of cause and effect did
not mean by his use of the word probability, the numbers that may be associated with the
throw of a die. He meant a person had to rely on his judgement about the perceived
continuity of events in the world he was born into and not rely on some theory of
imposed necessary external ordering. The perceptions do not have to represent truth
and the judgments do not have to be infallible. Kant replied that there must be inherent
mechanisms in the human mind for calculating those probabilities. He did not have the
benefit of knowledge about theories of evolution that would explain how such innate
knowledge might come to be. It is easy for a physicalist to accept that mechanism as an
explanation of our equipedness for the world. But an idealist would protest that there is
more to the innate portion of the mind than just that necessary to enable a baby to
survive it's initiation into the environment.

Philosophers have considered mathematics as evidence for such an innate portion of the
mind. Mathematical objects sometimes have analogues in the real world but sometimes
they don't. What else could explain such discoveries. Certainly if there is some ideal
underlying reality and the human mind has an inherent access to it then we might
explain human discovery of such objects. But there is another possibility. As we have
indicated, the structure and the ordering principle may not be a feature of the human
mind but a characteristic of systems in general. Mathematics may arise, not in any
predetermined manner, but as a result of the evolution under certain conditions of a
system from a state of relative disorder to a state of greater order (see section 1.7 on
entropy and 1.8 on psychological considerations). The result can be as varied as any
produced by a system emerging in an environment. The outcome would reflect initial
conditions (the original state of the mathematical system that is emerging) and the
constraints of a changing environment (the accepted axioms and procedures of the
system and their interpretation in the world). Pure mathematicians, then, conspire to
remove the associations with the world from that environment and treat the rest as
formal initial conditions. This greatly facilitates the production of mathematics but is
unlikely to reveal the kinds of truths that explain the world and that have made
mathematics the queen of the sciences.

An idealistic philosophy tends to preclude the possibility of a machine intelligence.


Strangely, foundationalism (together with its tool reductionism), at philosophic
29
The elevation of ideas to 'thingness' can be dangerous. It implies a permanence which when
coupled with a deterministic view of the world can lead to social prejudices that cause harm
to individuals or supposed classes of people. Witness the child who, tested as below average
in intelligence, is thereby condemned to an educational environment in which he is not
expected to accomplish much and who may, for no other reason than that environment, fulfill
those expectations. But our purpose is not to make moral judgements.
opposite ends of the spectrum from idealism, in a more subtle manner, also tends to
preclude the possibility of machine intelligence. Foundationalism relies on the discovery
of universal, unquestionable truths upon which to build a structure. In the case of
machine intelligence the method implies the existence of complete and consistent rule
bases and data bases that model the world. But the discoveries in mathematics
mentioned above, argue that such systems will always be faulty, and the dynamic
nature of the world implies that any such database will quickly become obsolete. The
popularity of foundationalism is easy to understand; it provides a common framework
within which workers can build. It facilitates communication and it reduces the
possibility that a change to one part of the structure will force a modification to the rest
of the structure. But this approach can be stifling and hazardous, especially for the
workers in the more encompassing sciences such as physics, biology or cognitive
science. Heisenberg, Einstein, and Darwin made their breakthroughs just because they
recognized that basic assumptions (parts of the foundations of their disciplines) were
wrong. In the resulting revolution, many persons whose works were based on the
assumptions, saw the efforts of a lifetime reduced to the status of a footnote. But, so
long as basic assumptions are made, such massive shifts of beliefs are likely to occur.
Physicists in search of a grand unified theory, cosmologists in their search for the origin
and eventual disposition of the universe, biologists in their search for the origin and
mechanisms of life, and cognitive scientists in their search for a mechanism by which
intelligence may be implemented on a machine to support their theories concerning
cognition, may be hampered by such assumptions. But the changes do come. When a
science stagnates, someone eventually pushes through an assumption and a revolution
ensues. So science, when viewed as an entity at a level above that of the individual and
at a scale of time greater than that at which individuals measure their accomplishments,
shows itself as possessing a temporally extended natural systematicity.

Still another problem facing researchers into machine intelligence is the accepted
methodology of science. Most scientists conform to a research methodology first
propounded by Marie-André Ampe`re in the 1700's called the hypothetico-deductive
approach (Williams 1989). In this method a scientist is free to make a hypothesis
concerning the nature of non-evident or non-observable phenomena, based on any
human reasoning process. The hypothesis has to be made in a manner that implies the
existence of observable phenomena related to the non-observable phenomena. That
observable phenomena is then searched for, or an experiment is contrived to make it
evident. This method is difficult to apply to research into intelligence. As a contrived
example, assume a hypothesis is made about the mechanisms behind the activities that
are collectively termed intelligence. Part of the hypothesis is that intelligence is
independent of the medium in which it is observed and so may be implemented on a
computer. The rest of the hypothesis concerns the nature of that intelligence itself. The
experiment performed to test the hypothesis is the construction of a computer program
that implements the hypothesized mechanisms of intelligence. The idea is that if the
machine shows intelligence the hypothesis is supported and we have discovered the
mechanisms of intelligence. But, as might be guessed, the results of such an experiment,
positive or negative, would never be accepted as scientific evidence for or against the
hypothesis. The reason is that intelligence is subjective in nature; you cannot deduce
that a machine is intelligent from its behavior unless you have defined intelligence. But
the success of any experiment concerning intelligence can be challenged by questioning
the sufficiency of the assumed definition. Much of this paper is devoted to
circumventing that problem. We make the argument for the possibility of intelligent
machines in terms of deniably intelligent machines (DIMs), limited intelligence
machines (LIMs), and undeniably intelligent machines (UIMs). We describe the DIMs as
the existing rule-based, expert system type machines. These are machines with which,
so far as intelligence is concerned, almost everyone can find fault. LIMs are machines
with limited intelligence, a step or more beyond DIMs in their dynamic nature and
ability deal with an environment. UIMs are acceptably intelligent to everyone except
machine intelligence bigots; they operate in the human environment with humans on
human terms. Obviously the determination of where a LIM resides on the scale between
DIMs and UIMs requires a judgement that must coherently weigh many objective and
subjective factors that may change as the LIM changes. Such a determination is not
really amenable to the hypothetico-deductive approach although, if it were, it would
clearly be a desirable situation. Nevertheless the attempt to construct a machine
intelligence obviously falls into the category of doing science.

Many proposed mechanisms of intelligence (most of which have found some


application in programs that are as a consequence widely touted as intelligent) are
explicitly or implicitly foundationalist in the sense that they consist of a set of rules,
each of whose truth is assumed, or is provable, and that are consistent, complete and
purport to represent the world or some portion of it. This is especially true of most logic
based systems, (e.g. production systems or rule based systems, expert systems, etc.), and
in fact it is true of any non-trivial system that cannot change and learn in response to a
dynamic environment. These systems are implemented under the illusion that
intelligence is a real thing that some sufficiently comprehensive rule-base/database will
display. If an effort doesn't result in an acceptably intelligent program, then add more
rules and more data. Unfortunately, the experience so far shows that this just results in
bigger, more elaborate DIMs. Such systems, as the old adage against the possibility of a
machine intelligence asserts, can deal only with things for which they have been
explicitly or implicitly programmed. We are convinced that these approaches will never
achieve the goal of a UIM. Instead, we conclude (section 1.12) that another approach is
warranted. That approach is detailed in parts two and three.

1.11 The hypothesis

Complexity comes into the universe through a process of system building within an
existing level. The interaction of systems in a coherent manner define the new level.
Each level as it is generated gives rise to a probability space. This ties directly to the
concept of order as described by the entropy equations. The entropy at a particular level
in the hierarchy is calculated by summing over the probabilities that derive from
defining the universe as all of the states possible at that level. In this way the effects of
other levels upon that level, are contained indirectly in the measure or ignored. The
environment of a system can provide both a source of matter and energy with which the
space of states may be expanded and can provide a sink for waste products (i.e.
dissipated and expelled portions of the system). In the hierarchy that the universe as a
whole represents the original source is the quanta and the final sink is the universe
itself.

1. Levels may be recognized by the degree to which the objects and processes at that
level participate in or determine the objects and processes of another level. The
less the participation, the further removed the levels. Everything, with the possible
exception of the universe itself or quantum particles, is embedded in, or
encompasses many higher and lower levels.

2. The emergence of order is consistent with entropic processes that work on all
systems at all levels at all times. The process is coincidental with organization
(preferably hierarchical) that manifests itself in systems and proceeds so long as
conditions merit. As a definition, the portion of a level with which a system
interacts is the environment of that system. A level then, may consist of many
environments. By 1. above, systems mainly affect and are affected by their
environment, level, and to a monotonically decreasing extent, the levels at
increasing distance.

3. The elements of a system may be either real or informational or of a dual nature


(although all information systems are ultimately grounded in real parts and are
unreal only in the sense that they focus on the nature and effects of
communication, relations, between systems).

4. The rules that characterize organization emerge along with a level. New levels
may be generated whenever processes of organization proceed to a point that the
system meets the criteria for recognizing a level.

If accepted the hypothesis presents striking cosmological variances from the


Newtonian Einsteinian cosmologies, and it treats phenomena usually overlooked
by those cosmologies.

1. Natural laws need not apply across all levels (although nothing proposed
prohibits them from doing so). There are some indications that could be
interpreted as evidence for the non-persistence of rules/laws across levels:

a. quantum level particle behavior simply does not conform to the rules that
we take for granted at the human level. The concept of attribute, so
pervasive and concrete to us is qualified at the quantum level. While we can
imagine instantaneous communication, the idea that something can exist as,
something that can only can be described as a probability wave, is beyond
human understanding. Even the rules of logic are different at the quantum
level. That level is so far removed from human existence that it marks the
boundary of phenomena that can be translated into terms comprehensible
by the human mind.

b. Euclidean geometry and Newtonian physics suffice to describe the behavior


at the human level of existence but are inadequate for the description of
objects of immense size, moving at very high speeds, through immense
distances. Even within the realm of applicability of Euclidean space and
Newtonian physics laws are seen to dissolve across levels. A mistake often
made by the makers of motion pictures who infer that a 50 foot spider will
be 600 times more frightening than a one inch spider, is to assume that such
a creature could exist. The relationship of surface area to volume precludes
such monsters who would promptly collapse under their own weight30 .

c. it is as impossible for the human mind to frame the concepts required for
any explanation of levels of existence at the opposite end of the scale of
being from the quantum level. That is, any discussion of that which frames
the universe, involving as it must, concepts such as infinity and eternity, are
as foreign to the human ability to imagine (convert into human level terms)
as waves of probability.

d. chaotic processes have been detected at most levels from the molecular to
the galactic, however there is an apparent absence of chaotic processes at
the quantum level (Science February 17 1989).

e. application of the rules of physics as we know them predict that fully 90


percent of the matter necessary to explain observed galactic behavior is
missing. It is hypothesized that this dark matter exists but that we lack the
tools to see it (or the imagination to determine where it might be). It is
possible, however, that the matter doesn't exist and the rules of galactic
interaction are fundamentally different from what is projected from the
laws of physics derived from observations at the human level.

f. laws of chemistry apply only at the atomic and molecular levels, the gas
laws (i.e. laws of constant temperature, pressure, and volume) apply only at
levels at which contained and isolated gasses exist (e.g. planetary surfaces
etc.). In interstellar space, gas clouds tend to form stars, a fact not predicted
by the ideal gas law.

g. laws addressing human development, civilization, technology, apply only


at or near the human level of existence (e.g. natural selection/mutation,
supply and demand, combustion, temperature ).

2. The universe is becoming more complex through the evolution of levels (i.e.
creation is an ongoing process). One need only be an observer of human
civilization to form this conclusion, but the ongoing creation of stars, and the
creation of the elemental forms of matter in the death of stars are more
dramatic examples.

3. The further removed two levels the less effect they have on one another. The
universal constant c, relating space and time, forces a correlation of space and
time scales and defines absolute limits on interaction, at least in the levels in the

30
If volume is considered proportionate to weight, a spider with a spherical body of radius .5 inch
would weigh rp(.5)3 = .39r, while a spider with a radius of 25 feet (i.e. 600 times larger in dimension)
would weigh rp(300)3 = 84823001r. If r converts to ounces then a spider 600 times larger than a .39
ounce version would weigh about 2651 tons. Even though his legs would be the size of telephone
poles it is questionable that they would support such weight.
immediate vicinity of human existence31 e.g.

a. galactic events have no effect on the course of human history; human


history has no effect on the course of galactic events

b. an automobile accident immediately and permanently alters our existence,


a malfunction of an internal organ is cause for immediate concern and may
alter our existence, the activity of a bacterium in the gut, may or may not, in
time, be concern for alarm. The destruction or construction of a particular
amino acid molecule in our body is almost certainly no cause for alarm. The
exchange of electrons by two atoms in our body is of no consequence to us,
etc.

Thus the general theory of relativity explains the relations that exist among the largest
structures of the universe (the level above which we lose the ability to translate into
human terms corresponding to events at the quantum level). Ecological principles are
the mechanisms by which the biosphere maintains itself and becomes more orderly.
Natural selection coupled with genetic mutation and genetic mixing are mechanisms by
which the biological order of populations is increased. Culture, economics, sociological
principals and technology are mechanisms by which the order in human society is
increased. Adaptation, assimilation, accommodation and imitation are mechanisms by
which human mental systems grow, or are formed. Syntax and semantics are the
mechanisms of language. Chemistry describes the interaction of objects at the molecular
level and quantum mechanics describes the rules by which the smallest known particles
of the universe order themselves.

In this view, life on earth did not begin because of some chance combination of
chemicals, but because the conditions that lead to order were present on the early earth
and the chemical constituents in that environment were such that they led to an
expression of order that we call biological life. The observable patterns at one scale or
level of a hierarchy of systems are explained best in the context of their environment
and the subsystems of which they are comprised. Explaining one level in terms of the
features of constituent systems many levels removed can not be complete. For example,
the grand unified theories of physics are not complete because they do not include
gravity in their quantum scheme. Gravity is a mechanism for explaining the behavior of
the universe on its largest scale and the forces of quantum mechanics do the same for
31
There is a natural, infinitely gradated, and absolute partitioning of events in space that
occurs because of the limitations imposed by the universally constant speed that cannot be
exceeded (approximately 186,282 miles per second, a speed that only light ever reaches and
as a consequence the constant is called the speed of light). This makes it impossible for cause
and effect relations to occur between any two elements in space in time periods less than the
time it takes for light to proceed from the one point to the other. The result is a positive
correlation between the spacial and time scales at which events occur. For example the
smaller the spacial scale at which an integrated circuit is constructed the fewer time units it
requires to perform a given operation. On a larger spacial scale, the collision of two galaxies
requires very many time units to occur. The correlation is not perfect since most
interactions occur at speeds considerably below that of light. Still the relation is noticeable
and important for all levels of being. The universality of the limitation on speed together
with chaos and entropy seem to be a few of the very few laws that are true everywhere and at
all levels above the quantum level and below the universal level.
the smallest scale. Efforts to unify the two have been largely unsuccessful except when
the singularities of the general theory are explored (see Hawking 1988). At those
singularities matter is concentrated to the extent that quantum effects must be taken into
consideration.

This is not meant to imply that these levels consist of some small number of readily
discernable, easily enumerated, regular, distinct levels. But it does maintain that levels
may be determined. They may be determined through either inclusional methods in
which a combination of features serve to identify a level by identifying the members of
that level or by exclusional methods in which elements at different levels are recognized
as such by their inability to communicate with, or affect one another. Features that may
be used to recognize levels, though not presented in that context, may be found in the
lists of characteristics we have seen above presented by Kant, Salthe, Fodor (section 2.3),
Rescher, and which are implicit in Einstein's theory of relativity not to mention the
works of numerous others who attempt to distinguish systems, their parts and their
interactions. The theme that characterizes these features, is the recognition of scales of
space and time and frame of reference. The more completely a level is distinguished the
more sense it makes to treat the states it may assume as a probability space. This
applies even more strongly to environments. The relationship is not reversible however;
distinguishing a probability space does not imply the identification of a level. It is to
probability space that the concept of entropy applies.

From the point of view of a processing system, at its level, in its environment, the
processing system can behave in a way to decrease its entropy (i.e. become more
orderly). It does this in a manner consistent with its internal rules and the possibilities
presented in that environment. Of course, the constituent parts may be such that the
system achieves a stable orderly state (e.g. the earth could freeze over lowering its
albedo to the point that it can reflect most of the energy that arrives from the sun and
thus maintain itself for a long period in an orderly but static state). But obviously, it
may also progress to states far from equilibrium in which the production of order
continues for long periods.

Entities that are part of a system may see that system as their environment. A level can
have many environments (e.g. the organs of animals may be considered all at a level but
each individual animal is an environment for the organs that comprise that animal). In a
sense the various environments are different levels since the rules that govern the
systems that those environments represent may prohibit any interaction of their
subsystems (e.g. for the most part, organs of one animal do not interact with the organs)
of another animal) and this in itself would act as a partition. But we will consider levels
to be defined in light of the prohibitions to communication between entities at different
levels as due to differences in spacial and time scales, though we do not insist
attributing all partitioning into levels to those two reasons. In an environment the
entities that make up that environment are wholes. But with the possible exception of
the quanta, every entity is itself a system with subsystems. To those subsystems the
entity is the environment. The point of view of the entity is at the interface of its
environment and the subsystems that comprise its material being. Within a level, in an
environment, because of the finite number of entities making up that environment, and
relatively well defined possible relations between those entities, the possible states of
that environment are calculable and conform to the laws of entropy. The levels so
formed are susceptible to interference from other levels. Influences from remote levels
are always possible (the earth may be struck by a comet, destroying various organisms,
or a cancer at the cellular level may destroy an organism). Generally, the further apart
two levels are, the less the entities in the one can affect the entities in the other. The
calculations of probability or possibility that an entity can bring to bear are largely
limited to those that occur from its habituation to its environment and knowledge of its
internal state. The calculations will always be approximate because of the inability to
predict the effects of remote levels.

The mind exists as a system in the environment in which humans live. It consists of two
main components, the brain and the body. It is a peculiar system because it is almost
entirely given over to dealing with the environment in which it exists. It is this fact that
yields the illusion, considered real by Descartes, that the mind exists independently of
the body. All of a person's knowledge deals with relationships that apply in his
environment. It's almost impossible to distinguish between the nature of mind and
palpable entities at the level of human interaction in the environment. Since the mind is
intangible in that environment the inclination is to assign the mind an existence similar
to the other intangibles of that level, e.g. gasses, clouds, auras, or objects of the
imagination, e.g. ghosts, spirits or souls.

From the hypothesis many observations can be made and conclusions drawn. We make
them because they fit into the general scheme of systematization rather than because
they can be deduced from the acceptance of levels as a fact of nature. That is, we are
making a coherentist argument rather than a foundationalist argument.

1.12 Observations and Conclusions

1.12.1 Language, meaning, intelligence

The universe exhibits an ongoing hierarchical stratification that manifests itself in the
emergence of levels consisting of complex objects along with rules that govern their
interaction. These rules of interaction (or behavior, or organization) can not be
completely explained in terms of the rules and constituent parts of the surrounding
levels, but they in no way contradict the rules extant in those levels. The further
removed two levels the less effect the events in the one have on the events in the other
and the less the rules of organization in the one can be explained in terms of the rules of
organization in the other. Thus the explanation of the organization of galaxies cannot
be explained in terms of the interaction of molecules and the behavior of biological
organisms can not be explained in terms of their cellular construction and the affairs (or
lives) of a cell. The order that manifests itself at levels can be seen to occur in
conjunction with factors that can cause a decrease in disorder as described by entropy
equations. Since the entropy equations apply equally to physical states and
informational states and in many situations the two kinds of entropy coincide (e.g.
where information bearing systems are concerned), it can be inferred that the thoughts
in the minds of men cannot be fully explained in terms of their constituent parts and
rules (e.g. symbols, sense data, memory, logic, perception, intention, etc.) and/or the
information bearing connections of neurons in the brain. Further, The emergence of
every man as a thinking individual (in possession of a mind) is a process of self-
organization driven by these same entropic principles. The level into which humans
emerge and at which minds think, is not unique or special and does not represent the
end of the creative process of hierarchical stratification. Another higher level can be seen
to be forming in the emergence of human civilization. It is emerging along with the
technology that characterize it and the rules for communication among its human parts.
Language or grammar, and meaning as well, can be seen as rules in this higher level.
Human intelligence too, may be seen as a particular kind or quality of thought that is
meaningful only at this level.

In the same way that hunger is not a feature of the organs of an animal, but of the
animal itself, language is not a feature of a man but of a population of men. It results
from the interactions of individuals in a population and is important to individuals only
so much as they are members of that population. This suggests a method of determining
to which level a rule of interaction belongs; if the rule is of no use to a member, isolated
from the rest of the system, then it belongs to the system itself. It argues against any
innate language facility for it would then be expected that, like other innate
characteristics of the individual, such as race or eye color, there should be a strong
predilection, or at least a tendency, for an individual to speak a particular language. But
an individual can learn to speak any language much as he can learn to drive any make
or model of automobile. That he can reason better with the use of language is to no
more remarkable than the fact that he can haul more potatoes to market with the use of
a truck. Further, the capacity to communicate does not rely on the ability to hear or
speak; the deaf and dumb learn sign language. Even the deaf, dumb and blind learn
signing through touch and braille. Large portions of the human mind are devoted to
handling communication with the other entities that share the environment but it is not
pre-programmed for a language. Still normal, formal, language is the preferred means
of communication between humans. Language is the mechanism by which procedural
knowledge is made declarative and declarative knowledge is made permanent and
available. Its chief effect is to facilitate the maintenance and growth of order in the
population to which it belongs. That civilization (at least the technologically oriented
civilization of which we are a part) is rapidly evolving beyond the comprehension or
control of any of its members. This suggests that civilization is interposing itself
between man and the ecological system, becoming the next higher level for the human
organism. An average man would find it difficult to survive outside the protection of
that civilization. The dependence is not as pronounced as that of a cell of a multicellular
organism, the cell would find it impossible to live outside of its organism, but certainly
the dependence is greater today than it was thousands of years ago. So the effects of
language on the individual are secondary in the sense that first the environment within
which the individual develops is changed by the communication afforded by language
and then the mind of the individual that grows into that environment is different from
that which it once would have been.

So it is a modern myth that the technical achievements of mankind were produced


because of the superior intelligence of man. The civilization and the minds that fit into it
emerged together. Consider that the myth is not extended to other social animals such
as ants, termites and bees. While the termite engineering that keeps a termite colony
cool, or the ant agriculture that provides aphid herding and fungi gardening, or the bees
application of eugenics to maximize the success of the hive may be marvelous
inventions, the individual members of those communities are nevertheless considered
mindless automatons. Humans can be very objective where the human ego is not
concerned. Interestingly, humans are almost oblivious to their own composite nature.
Like an ant colony, the human body consists of an organization of cooperating entities.
Biologists are aware of that fact and occasionally remark on it. Lewis Thomas in Lives of
a Cell muses:

"Mitochondria are stable and responsible lodgers, and I choose to


trust them. But what of the other little animals, similarly
established in my cells, sorting and balancing me, clustering me
together? My centrioles, basal bodies, and probably a good many
other more obscure tiny beings at work inside my cells, each with
its own special genome, are as foreign, and as essential, as aphids
in anthills. My cells are no longer the pure line entities I was raised
with; they are ecosystems more complex than Jamaica Bay."

If it were true that the glories of the whole vest in its components then we should give
full credit to these little beasties who have so cleverly put us together. That is a tempting
thought but one with uncomfortable ramifications...Lewis Thomas continues:

"I like to think that they work in my interest, that each breath they
draw for me, but perhaps it is they who walk through the local
park in the early morning, sensing my senses, listening to my
music, thinking my thoughts."

The interactions and characteristics of entities at the level of human existence should not
be imposed upon other levels of existence and the interactions and characteristics of
other levels should not be reckoned as originating at the level of human perceptions.
Individuals in modern society are in a curious position. Because of technology they
have observational access to other levels. Whenever possible the data from those levels
is presented in a comprehensible form; atoms become billiard balls, weather fronts
become jagged blue lines on a map, bacteria are little bugs, and the world is (or was
until it was photographed from space) a multicolored globe on which geologic and
geopolitical distinctions are represented by bright pastel colors. At levels further
removed, the metaphors break down and the description of objects and relations
become purely symbolic; the planets execute trajectories that are the solutions of
differential equations, electrons move and occupy positions according to probability
distributions, and the wealth of nations fluctuates according to the interplay of variables
that include interest rates, trade balances, exchange rates, tariffs, and the relative costs
of labor and capital. While the levels being observed certainly exist, it must be kept in
mind that the observations are always interpreted by minds whose only direct
experience is at the level of human existence.

It is a common mistake to assume that the processes observed at the human level of
existence pervade all of the universe. Spencer made that mistake when he projected the
processes that characterize earthly evolution as applying universally. As a further
example of this mistake consider economic systems.

In the eighteenth century, Adam Smith recognized what he called the invisible hand. By
this he meant the self-regulating, self-organizing nature of individual industries that
leads to the growth of wealth and order in an industrial economy. His theory was a
rejection of the Mercantilist theory which concentrated on the accumulation by trade of
the representatives of wealth (gold, silver, etc.). Smith investigated, division of labor,
and supply and demand as mechanisms by which to explain the observed production of
goods. Supply and demand work in the market place as the buyer (demander) and
supplier fix prices for their goods and labor at a level that satisfies both. When the
suppliers of one good or service garners profits that exceed the profits of other goods
and services, those making less profits will divert resources to the more profitable
endeavors. This raises the supply of the high profit good or service and results in a
lower price to the consumer and less profit to the supplier. Eventually an equilibrium is
reached at a point that results in an allocation of resources to the production of goods
according to the demand, and a reward to labor according to its productivity. This
natural system results in an optimal production of wealth as measured by the positive
gain of goods and services (as opposed to accumulation of gold). The implication for
political institutions is that they should exercise laissez faire . A century later, upon
witnessing the vast inequities in the distribution of wealth in the industrial states
(capitalist economies) that followed Adam Smith's dictate, Karl Marx rejected the
theory and maintained that the possessors of goods and land come by their property
through extortion and/or political means and not by productive contribution to society.
Marx borrowed the Hegelian dialectic and applied it to a materialistic concept of the
universe. The world was ever changing and progressing to a higher order. Thesis
merged with antithesis in a synthesis that represented progress. To Marx, labor was the
elemental material from which the wealth of nations derived. In particular Marx held
that capitalism was a perversion of that natural progression, in which the the wealthy
and powerful actively prevent the workers from taking their labor to market. Workers
cannot and will not get what a capitalist calls a fair market value for their labor. They
become, in effect, slaves who receive only a subsistence share of the wealth of society.
They are kept in that condition by a collusion of the state and the wealthy. The
successful oppression of the workers results in the further accumulation of wealth in the
hands of the few. Eventually, because of its antithetical nature the whole capitalist
structure must collapse, perhaps in a depression as the human and natural resources are
exhausted, but most probably in a revolt of the workers. As an alternative system Marx
proposed communism. Communists believe that the state must step in and direct the
distribution of the goods and the allocation of the resources according to the maxim
from each according to his abilities, to each according to his needs. In a fair economy the
workers will happily produce for the good of all and at a level exceeding what they
would produce for capitalist taskmasters. Eventually the natural equilibrium will be
restored, the synthesis achieved and a state of unlimited, equitably distributed wealth
will result. The state will have served its purpose and can wither away. Activist
communists must encourage revolution in nations that subscribe to capitalist notions in
order to hasten the day that the oppressed workers are freed, utopia achieved, and the
state dissolved.

The true state of affairs is that capitalism is never imposed by government. To the
contrary, it arises when the state does not meddle in the affairs of producers and
consumers. On the other hand, communism seeks to impose, on an existing system, an
artificial set of rules that it perceives to be appropriate. It tries to achieve economic,
political and social goals by constructing a system from existing parts and new rules.
The mistake is believing that by arranging the thesis and antithesis, the synthesis (at a
higher level) can be controlled. Unfortunately the imposition of the good at one level will
not necessarily result in a good system at a another level. It is not necessarily true that
the concept of the good applies to economies. We have seen that the rules and parts of a
system emerge inseparably and together. Modifying the one or the other inevitably
modifies the system as a whole, and that change cannot be predicted.

Both capitalism and communism have been put into practice in the twentieth century to
the extent that significant evidence is available to evaluate the hypotheses that support
those systems. It seems evident at this point in time32 (1989), that capitalism works to
allocate resources and goods at the level of an industrial economy in a natural manner
that achieves great wealth as compared to the state controlled systems. This would
support the prediction that when that natural inclination of the parts of an economic
system is subverted by the imposition of arbitrary rules, that economy works with an
efficiency limited by the degree to which the imposed rules conflict with the natural
rules. The failure of the communist economies to be as productive as the capitalist
economies is really the result of a failure to see that a system is more than the sum of its
parts and rules, and cannot be fully explained in terms of those constituents. Goals
cannot be artificially imposed on such a system and be expected to be achieved through
the simple manipulation of the rules. This is not to say that a social/economic system
with given goals cannot exist. Marx was just born too late to effect a successful
transformation to his dream of utopia and too early to see that such a transformation
might not be necessary.

It would be helpful if man would see his civilization as one facet of the continuation of
a process, initiated four billion years ago, toward order on the surface of the earth. That
process has no interest in the success or failure of any person or of man as a species. The
process is not a thing and it is not cognizant, and it did not produce order for some
purpose akin to amusement or so that it could know itself (however engaging that idea
might be). In fact it is highly unlikely that any idea based on analogy with human
motives and emotion (intensions) motivates anything anywhere except on earth or other
earth-like environments and at the human level of existence. Obviously, nothing
prevents the process from giving rise to systems that know themselves. In fact we can
infer that all systems at all levels have an attribute that may be called awareness. Given
the common sense meaning of the word, the existence of awareness in systems is
implicit in the definition of a level or an environment as a system of entities that interact
and participate in the existence, maintenance, and disposition of that environment. The
coherent interactions of an entity with the other entities in its environment is the
awareness of that entity of its environment. The more complex the environment the
more complex the interaction and the more acute the awareness. Awareness by one
entity of the fact that a second entity is also aware and has a similar view of the
environment is a primary requirement for the imputation by the first of intelligence in
the second. Because a human, by using language, can easily achieve this rapport with
other humans, but must expend great effort to communicate with other non-speaking
entities, he considers his kind the more intelligent. Intelligence defined in human terms
is a subjective measure with man as the yardstick. But intelligence can be an objective
measure when described as a measure of the proper and appropriate functioning of an
entity in its environment.

And here we can see the origin of what are popularly termed intensional systems; those
32
Judging by the success of the economies of states that apply the capitalist dictum laizzes
faire (let them be, referring to business).
that seek to explain the activities of the human in terms of the goals, purposes, hopes,
beliefs, desires, fears, hunches or, in general, intensions of that system. They are the
result of the reification through the facilities of language, of observations made by a
human of himself and of other systems at the level of human activity. The name (hope,
fear, purpose etc.) comes to represent a thing that, because of the inability of a system at a
level to interact across levels, is assumed an object of universal import. There is little
harm in such deceptions except as they effect the attempts of men to create, in an
entirely different medium than a man, that which they have reified as intensions of one
sort or another.

1.12.2 Intelligence and machines

The idea that a thinking, cognizant entity can be created, as it were, from whole cloth, is
as chancy a conjecture as the sometimes proffered hypothesis that the earth and
universe were created in totality a few thousand years ago in complete detail;, light
streaming through light years of space, from stars that didn't exist just seconds before,
down to an earth replete with fossil records of animals that never existed, strewn
through layers of sediments that were never formed, by seas that never were. Certainly,
it is easy to reject that possibility, both for the case of the creation of the universe and
the creation of a mind33 . We are persuaded that minds emerge as a system of mental
objects and rules through the interaction of the human brain and body with its
environment, and that the emergence is a crucial aspect of its existence. Further, there
is nothing magical or special in the creation of a mind, it occurs as a result of the same
ordering process that leads to the creation of the objects and rules that characterize any
of the levels we observe in the universe. The machine equivalent of a mind, to be other
than a carefully constructed sham whose purpose is to win acceptance as intelligent,
must emerge in a similar manner. This point is reiterated in the conclusions below.
There are other implications for a machine intelligence, of a universe consisting of
hierarchically stratified semi-permeable levels. There are problems of scale in space and
time.

An acceptable (to humans) machine intelligence must, as a prerequisite, exhibit two


features; it must perceive spacial features on a scale close to the scale at which a man
perceives spacial features and it must have thought processes that recognize sequential
events on a time scale approximating the time scale at which men perceive the passage
of time. The Gaia hypothesis, that the biosphere is a self-organizing, processing system
is unacceptable to most people even if no intelligence is attributed to that organization.
But it is obvious that, even if the intelligence of the Mother Nature of myth (conscious,
purposeful action) could be attributed to the biosphere, the space/time scaling of such a
sentient being would be orders of magnitude larger than that of an individual human.
Only the most liberal of men would accept as intelligent, beings that operate on time
33
We are using the word mind in its normal sense as the complex composition of mental
objects and thought processes that include reasoning, self–awareness, otherawareness,
personality, emotional response, etc. The word, by default, is associated with humans. Other
animals, (and even some humans) while possessing many of the attributes of mind, may be
excluded from the ranks of those who possess minds by careful definition. The words an
intelligence are often used when the possibility of a non–human mind are discussed. The
principal qualification for the possession of a mind is the acceptance of that mind by other
minds. Nothing in this precludes animals or machines from possessing the equivalent of a
mind, or an intelligence.
and spacial scales that are to them only an abstraction. Even God that rules the universe
(or the Gods depending upon the religion) is perceived by believers to be an
anthropomorphic being operating at (or slightly above) the human scale of being. The
failure to recognize the scale of things as an important aspect of consciousness has
sometimes given rise to arguments against the possibility of machine intelligence.

One such argument involves a model of a computing machine running a program. The
model consists of a man in a room with an instruction book, indexed in Chinese, but
with entries written in English, the man's native language. Also in the room are large
filing cabinets, indexed in English and full of data written in Chinese together with
other English instructions. Occasionally a secretary enters the room with a message
written in chinese. The man consults the Chinese index in his instruction book and
retrieves the English instructions. They tell him how to access the file cabinets. From the
file cabinets he gets Chinese characters and/or further English instructions that he
follows. Ultimately the instructions cause him to construct a message in Chinese. The
secretary comes back and retrieves the message and delivers it back to the outside
world. The man represents the computer's CPU, the filing cabinets full of data and
instructions represent the computer's memory, loaded with data and programs. The
secretary represents input and output channels. The claim is that this system can do
anything that a computer can, in particular it can run an AI program. In as much as no
one would consider the output from this system intelligent, no one should consider the
machine running an AI program intelligent.

The argument is wrong because it involves two translations of scale and ignores some of
the other requirements of an AI system. By focusing on the CPU level (or the inner
workings of the "office"), the scale on which the overall system works is overlooked. If
one described the functioning of the human brain as a system of electrical discharges
predicated upon the chemical disposition of various cells enclosed in a bony
carapace...initiated by and resulting in discharges that enter and leave through one
access channel into the carapace, then it would be difficult to view the brain as
possessing intelligence. The second change of scale is in time. The system could not
possibly react on a time scale that humans perceive as the scale at which intelligent
interactions occur. Further, the system is fixed, it has no chance to grow. The man can't
hire more secretaries, buy more filing systems, rearrange the ones he has in order to
improve access, develop techniques to make the office run more efficiently, such as
associating recurring inputs with outputs so that the busy work of accessing the files can
be avoided, and rewrite some of the instructions or even some of the data, when it
becomes apparent that to do so would improve the operation of the system. Finally, no
entity can be considered intelligent in a void. An environment in which to be intelligent
is essential. Because of the scale differences, an office system, such as the one
hypothesized above cannot interact with humans in a human environment. But let us
hypothesize similar systems with which the office system can communicate and a
universe of other objects that operate at the appropriate time and space scales and that
can in some way be sensed by the office systems. In such an environment and to the
extent that their ability to communicate permits, the office systems would consider each
other as intelligent entities. Humans would not consider the office system intelligent,
but they might well consider a computer, operating at their scale, in their environment,
with an ability to grow, learn and change as an intelligent being.
1.12.3 Conclusions

To be explicit about what we hope to have accomplished in this first part we offer the
following summary of conclusions. The principle involved in the creation of an
intelligence is the same principle that explains the organization that manifests itself
throughout our world and seems to maintain throughout the universe. We are brought
to this conclusion by the observation that entropic descriptions of physical systems that
are essentially information bearing systems (e.g. RNA, DNA, neural networks, etc.)
describe the same thing under interpretation as information or structure. From this we
feel justified in applying to intangible information systems, the observations concerning
stratification and inter-level rule generation so obvious in physical systems. Then, that
which we recognize as mind and invest with the quality of intelligence emerges into the
human environment as a multitude of mental objects and associated rules. As noted,
the rules/laws that emerge along with a level are not reducible to the functions and
interactions of constituent parts at lower levels. As will be discussed in part two, the
symbolic/functional and neural net approaches to the implementation of machine
intelligence, are in essence, attempts to recognize the rules and constituent parts of the
lower levels of the human mind. Obviously any implementation of human-like
intelligence in a machine will have to be made at some lower level. Functionalists will
argue that their level is closer to that of the human mind and would therefore be the
most natural level at which to attempt an implementation. Those enamored of neural
networks will argue that their model more perfectly imitates the structure of sublevels
of the human mind. And though it might be at a level further removed than the
functionalists level the similarities to actual human brain structure makes it the
preferred implementation mechanism.

However, the problem is not in the level at which the implementation takes place, or in
the similarity of the model to the system being emulated. Rather the problem is in the
implementation philosophy. If, as is the usual case, an attempt is made to create a
functioning, fully capable intelligence, directly from human level rules of behavior, or
even from constituent parts, whether they be mind function equivalents, or
neuromorphic mechanisms, the resulting intelligence is bound to be deficient. The
natural process of producing new human minds does not attempt this fait accompli. That
is, human minds aren't born full blown, they are grown over a long period of interaction
in an environment; they emerge. Only the most rudimentary, necessary knowledge
about the operation of the body, is transferred directly from the molecular level to the
mind level. Any successful implementation of human-like mind in a machine will have
to provide a program that can emulate this process of a human mind growing and
learning in an environment. This is not just because it is impossibly difficult to produce
a program that represents a full grown intelligence capable of dealing expertly with an
environment that it has never seen, but because human intelligence is truly an emergent
property of the human brain and body at the human level of existence as it grows into
its environment.

The problem is further complicated by the fact that the system that is called the mind
has a body component. Descartes tried to isolate those components as separate entities,
but they are not. The mind and body grow and emerge together as a system. There is a
great deal of confusion about this because thoughts (the product of the mind) and acts
(the product of the body) are different phenomena. And the mind is considered
intangible while the body is very tangible. But it is the brain and the body that are better
described as separate systems. They are subsystems of the mind, existing at a level
below the mind. A brain without a body or a body without a brain is useless, even in
perfect health. When joined together and allowed to interact with an environment they
become more than the sum of brain and body. They become a cognizant entity. If
allowed to interact in human society they become a person. Person and cog nizant entity
are kinds of minds. This will be further discussed in parts two and three.

The significance of all of this for the purpose of this thesis is that all examples of
systems that are unconditionally considered intelligent (humans) belong to the class of
processing, self-organizing systems, described above. Figure 1.3 part A provides a
graphic representation of such a system (in this case a person) showing the interaction
between the person and the environment. Computer programs, including AI programs
are processing systems only in a limited sense and, at the present, not self organizing.
The limited nature of their material nature and its interaction with an environment is
portrayed in part B of the same figure. It is a corollary of the hypothesis that the
activities exhibited by systems that are interpreted as evidence of intelligence, can be
attributed to that self-organizing, energy-processing and order-producing nature, the
specific form of which depends on the environment. AI systems must be constructed in
a manner that more closely approximates those systems. To be more specific, and as
explored more fully below, AI systems should be constructed that can learn, grow,
change, and interface with the surrounding environment in manners analogous to their
human counterparts.
Higher Levels

immediate higher level immediate higher level


Regulatory Constraints Regulatory Constraints
Initiating Conditions
Initiating Conditions
Environment Laboratory
(Programmer)

( feedback only )
developmental
Focal Level Activities
(what's happening) Focal Level Activities
(The activity of the program)

immediate lower level


Material Constraints immediate lower level
Initiating Conditions Material Constraints
Person Initiating Conditions
Computer
(Material Nature) (Program and arch-
tecture)
Lower Levels
Lower Levels
lines of influence
A B

Figure 1.3: Interaction between a person and adjacent levels vs. computer and
adjacent levels
Emerging Systems and Machine Intelligence

Part two - The State of the Art


2.1 Introduction

The conclusions above paint a gloomy picture for the possibility of creating intelligent
machines. The idea that all of the information necessary to deal with a real world can be
programmed into a machine is precluded. This problem has been anticipated by Terry
Winograd and Fernando Flores (Winograd, 1986) who believe that intelligence is a
manifestation of the dynamic nature of the structures of the mind and their reflection of
the ever changing environment into which an organism, is thrust. They elaborate on the
hypothesis, asserting that intelligence is not only affected by, but is actually a result of
our history of involvement and continuous participation in the world, in particular, in
human society. In setting forth these ideas they draw upon the works of the philosopher
Martin Heidegger and the Biologist Humberto Maturana.

"Heidegger argues that our being-in-the-world is not a detached reflection


on the external world as present-at-hand, but exists in the readiness-to-
hand of the world as it is unconcealed in our actions. Maturana through
his examination of biological systems, arrives in a different way at a
remarkably similar understanding. He states that our ability to function as
observers is generated from our functioning as structure-determined
systems, shaped by structural coupling. Every organism is engaged in a
pattern of activity that is triggered by changes in its medium, and that has
the potential to change the structure of the organism (and hence to change
its future activity). Both authors recognize and analyze the phenomena
that have generated our naive view of the connection between thinking
and acting, and both argue that we must go beyond this view if we want
to understand the nature of cognition - cognition viewed not as an activity
in some mental realm, but as a pattern of behavior that is relevant to the
functioning of the person or organism in the world."

Kenneth Kaye in analyzing the growth of human babies into intelligent persons (Kaye
1982) concludes that the social system into which the baby is thrust is responsible for the
growth of its intelligence. Kaye's exposition is relevant to the growth of human
intelligence and is consequently of relevance to any researcher who would aspire to see
that kind of intelligence replicated in a machine (see part three below).

If it is the case that human intelligence is a dynamic and changing condition dependant
upon interaction with the environment (past and present) and the human social
structure for its existence, then researchers under the traditional paradigms in artificial
intelligence are not likely to succeed.They have been working under the assumption
that they can build into a program representations of knowledge that when activated
will exhibit intelligence. They are likely to be frustrated in their efforts for two reasons;

1. If intelligence is created through the interactions of the individual within


human society then machines will not be intelligent. Machines cannot
interact with humans as humans, and for the present, interact with their
environment almost not at all,

2. when researchers build programs to be intelligent they fix the means by


which the machine can represent the world. This cannot lead to
cognition for it is precisely the dynamic nature of interaction with the
constantly changing requirements of the (largely social) environment
and the ability of the environment to change the organism which gives
rise to intelligence.

Winograd and Flores conclude that the situation precludes the development of an
artificial intelligence. They admit however that they are leaving intelligence as a poorly
defined concept. Their conclusion is inescapable if what is meant by artificial
intelligence is the duplication of human intelligence; the first of the two reasons given
above is not likely to be overcome to the extent that a machine intelligence is ever
accepted as a human intelligence. There is, however, nothing inherent in the nature of a
machine that precludes growth and learning, and consequently, intelligence in the
broader sense of an undeniably intelligent machine. Further, since interaction in the
environment is an essential part of intelligence a direction for research is indicated. The
interaction of a dynamic learning program with other intelligent systems (in particular
with humans) in an appropriate environment becomes a candidate system for
development. If an intelligence with which humans can identify is desired then the
kind of environment in which the proposed system should grow and learn should
include features whose meanings can be shared by man and machine and should be
populated by men and machines.

Consider the development of a machine with sublevels sufficiently like those of humans
that it can learn and grow in the human environment. Further suppose the machine has
physical and mental capacities significantly similar to those of a human. Place it into the
human environment (or a human-like environment) where it can interact with humans
and teach it. In light of the hypothesis on the nature of the emergence of levels, we can
expect a mind to emerge according to the same principles that lead to the emergence of
a human mind. This seems like an interesting project and is taken up further in part
three. Unfortunately, any implementation of sublevels and environments will have to be
constructed from the tools at hand. So before we can discuss implementation it would
be helpful to review the current progress towards the goal of machine intelligence. To
that end we review the body of knowledge that now comprises the discipline of
computer science as it bears on this problem. The advent of cognitive science is
discussed along with the current debate that pits symbolic-logic against neuromorphic
approaches. We will attempt to determine how the tools that have been generated by
this research can best be put to work in a manner consistent with the conclusions of the
previous section. Part three will then discuss an actual implementation scheme.

2.2 Computer Science

Goedel's incompleteness theorem states that, in a non-trivial axiomatic system there will
be meaningful statements that can be neither proven nor disproven. That is, there will
be undecidable assertions. The question then arises, "if given a particular assertion can it
be determined whether that assertion is undecidable?" The question amounts to asking
about the possibility of finding an effective procedure that can provide a test of the truth
of the thing being asserted. An effective procedure is just a specific sequence of
operations or an algorithm. Thus if the assertion is made that twenty-two is an even
number, the Euclidean division algorithm can be used to divide twenty-two by two. A
remainder of zero shows that twenty-two is even. An effective procedure that can be
successfully applied to an assertion to determine whether it is true or false is called a
decision procedure. So a decision procedure for the assertion that twenty-two is an even
number does exist. The existence of decision procedures was investigated by Alonzo
Church. By 1936 he had managed to prove that, for the general case of an arbitrary
assertion, no decision procedure was possible34 . In the course of his investigation he
defined the concept of effective procedure in terms of recursive functions35 . The results
of the research into the nature of effective procedures became known as the theory of
computability. This theory defines a class of functions that can be calculated in terms of
a very limited set of primitive functions36 . The derivable functions (that by Church's
above result cannot be all of the functions that exist) are called partial recursive
functions. Recursive function theory restricts the domain of the arguments of its
functions and the range of the values calculated by those functions to belong to the set
of natural numbers. Partial refers to functions that calculate a value for a subset of the
natural numbers, i.e. the domain of a partial recursive function is a subset of the natural
numbers. If a recursive function calculates a value for every natural number it is termed
a total recursive function. It may seem that restricting the domain and range of
34
To see that functions exist for which no effective procedure is possible consider the
following. An effective procedure is necessarily finite in length and must consist of finite
strings of symbols. So the set of all effective procedures may be put into one-to–one
correspondence with the integers. The set of functions that map the nonnegative integers onto
the set {0, 1} may be put into one–to–one correspondence with the real numbers. Since those
functions alone outnumber the effective procedures there must be functions for which there is
no effective procedure.
35
A recursive function is a function that is defined in terms of its previous values (and some
starting value). It may be defined precisely as follows (Cutland, 1980). Given two functions
f(x) and g(x, y, h(x, y)) where x = (x1, x2, ... xn), and y is an index. A recursive definition of
h is given as follows

h(x,0) = f(x),
h(x,y+1) = g(x, y, h(x,y)).
For example the definition of addition as h(x, y) = x + y becomes:
h(x,0) = x
h(x,y+1) = h(x, y) + 1
That is x = x, f(x) = x, and g(x, y, h(x, y)) = h(x, y) + 1. To see how this definition works
assume x = 3 and y = 2. We can calculate h(3, 2) as follows: h(3, 2) = h(3, 1) + 1 = h(3, 0) +
1 + 1 = 3 + 1 + 1 = 5.
36
That is, the function can be obtained by the use of composition and minimization from the
following list of functions (Davis 1981):

CA(x) (the characteristic function of A)


S(x) = x + 1 (the successor function)
Uin(x1, x2, ..., xn) = xi for 1£ i £ n (the projection function)
x+y
x – y (subtraction when x ≥ y)
6.xy.
recursive functions (that are, after all, supposed to represent any kind of effective
procedure) to natural numbers might work to constrain any results. Such constraint is
less than would be expected for it can be shown that other enumerable sets of objects
may be mapped to the natural numbers, a recursive function applied, and the reverse
mapping performed, so that, in effect, recursive functions work on any enumerable set
(a set that can be put in a one to one correspondence to the natural numbers or a subset
of the natural numbers). In particular, assertions can be mapped to true or false, i.e. the
set {0, 1} thus the theory applies to the objects of logic. The partial recursive functions,
Church hypothesized, are capable of representing all the things that can be calculated
(i.e. for which effective procedures exist). The hypothesis remains a hypothesis because
the original concept of an effective procedure is intuitive. The attempt to formalize it in
terms of recursive functions is justified by the fact that all of the other numerous
attempts to formalize the notion have been shown to be equivalent to Church's
treatment. Still, the idea remains unproven and subject to suspicion. In any event, the
class of partial recursive functions is very versatile and powerful. At that time,
electronic computers were in the very first stages of development and motivated the
research into these ideas only tangentially, but the theory being developed by Church
and equivalent theories being developed by his contemporaries Post, Markov, Kleene,
and Turing, would provide the basis for the science of what computers can do. This was
especially true in the case of the theoretical computing machines developed by Alan
Turing now called Turing machines (TMs).

The purpose of a TM is to execute an effective procedure. The idea is to reduce the


process of executing such a procedure to its most basic mechanical components. Turing
proposed a machine consisting of an endless tape (usually one that has a beginning but
extends as far as is desired) that feeds through a machine that can read from or write
onto the tape. The machine can be thought of as a mechanical or an electronic
contrivance that is able to transition between a limited number of internal states. The
transitions the machine makes while running are dependent upon the current state and
the symbol currently being read on the tape. Transitioning to a new state might entail
moving the tape forward, or backward, and/or writing onto the tape. To keep it simple
the symbols that may be read from or written to the tape are usually restricted to the
binary set {0, 1} plus one special symbol, usually a blank. One or more of the internal
states is special and represents a state of acceptance. A TM signals the successful
completion of a procedure by entering one of these states. When in a state from which
no transition is specified the machine halts (accepting states normally exhibit this
feature). So a TM can be specified mathematically by specifying the set of characters that
may be on the tape, the set of states (including at least one accepting state), and a table
that provides the state to which to transition, and the mechanical action to perform
(move forward, backward, write a zero, write a one) given the current state and the
symbol being read. A sequence of symbols on a tape then may represent some
procedure that the machine can execute by being placed at the beginning of the tape in
some state (that can be designated as the start state) and being allowed to run until it
possibly halts, hopefully in an accepting state. The output of the procedure is whatever
is written on the tape (or some designated portion of the tape) when an accepting state
is entered. Figure 2.1, represents a simple TM that will traverse any series of zeros and
ones on the tape, converting all of the zeros to ones. It accepts and halts when it runs out
of symbols. To make the use of a TM easier, various kinds of systems can be set up in
which the natural numbers are represented by some simple code and specific areas on
the tape are designated to contain data, programs, and results.

All of those things that can be calculated by TMs comprise the class of Turing
computable functions. It is a tedious exercise but it can be shown that the primitive
functions of recursive function theory are Turing computable and that Turing
computable functions are closed under substitution, recursion and minimization (this is
done in Davis, 1958). So Church's hypothesis can be restated as "the intuitively and
informally defined class of effectively computable functions are exactly the Turing
computable functions." That is, anything that can be computed can be computed by a
TM. If the hypothesis is accepted (as it is by most mathematicians and computer
scientists) the TM becomes a powerful theoretical tool for investigating both the things
that can be computed and the nature of and the constraints placed upon machines that
would do the computing.

current state currently reading


States: Q, R, Accept Q 0
Action Next state
Symbols: 1, 0, B None R

Actions: If in state Q seeing a 1 move forward and go to state Q


If in state Q seeing a 0 go to state R
If in state Q seeing a B (blank) go to state Accept
If in state R write a 1 and go to state Q

Forward

Turing Machine

Figure 2.1: A Turing machine

It can be shown that the Turing computable functions (or programs) can be listed in a
particular order and used to generate an index number that uniquely encodes that
program37 . If a TM is constructed with the ability to decode the index and run the
resulting program it can, in effect, run any effectively computable program. Such a
machine is termed a Universal Turing machine. A universal TM need be only slightly
more complex than a TM that computes only one function. If the number of tape
37
See for instance Cutland, 1980, chapter four
symbols is not restricted then one tape and three states are required, alternately
universal TMs are known to exist that have one tape, five states and seven tape symbols.
The present day computer is the practical equivalent of such a TM38 although obviously
all the programs that are possible to write have not been written. The results that apply
to TMs then apply to computers. Further, Church's hypothesis can be modified to say
that "the intuitively and informally defined class of effectively computable functions are
exactly the things that can be calculated by a computer." Some interesting consequences
concerning computers can be inferred from the results obtained from the mathematical
model of a TM. It can be shown39 that the standard, one read-head, one tape TM
described above is equivalent to other more elaborate versions, that, at first glance,
might seem more powerful. For instance no matter how many tapes that a TM has40 , so
long as they are finite in number, it can do no more than the single tape version. This is
not too surprising since computers that operate on multiple programs (e.g. time sharing
computers) or that treat data and programs separately or that access several different
kinds of memory are quite common. However, it can also be shown that multiple read-
head41 , multiple tape, TMs have no computability advantage over the single head,
single tape version. The various distributed or parallel processor computers that are
now becoming available represent computer versions of a multi-head, multi-tape TM.
They too can do no more than a single headed, single tape TM (although they do it more
quickly and perhaps in a manner more easily conceptualized because of the
modularization imposed on those systems by their structure). This indicates that any
advantage of a neural network approach to artificial intelligence42 resides in the abstract
model and not the hardware implementation. It also provides a counter to the
argument that the parallel nature of the computation that takes place in the human
mind makes it impossible to implement intelligence on single processor computers. So
long as speed is not a factor, if intelligence can be implemented on a multiple processor
machine it can be implemented on a single processor machine (but we do maintain that
speed is a crucial factor, see section 1.11).

An alternative way to view computable functions is in terms of the possible contents of


the tape that the TM processes i.e. the programs. The set of different strings of symbols
that a TM can operate on is termed the language of that TM. Because they are finite and
can be put in one-to-one correspondence with the natural numbers all of the languages
of all of the different TMs are collectively termed the recursively enumerable languages
(r.e.). If a TM is applied to a string that is an r.e language it can happen that the machine
38
that is, if the computer is provided with compilers to decode and run programs that are
written in languages that can permit the expression of partial recursive functions. Computers
execute in one step what would require many steps on a TM, but they do nothing that can not
be done by a TM.
The proofs of the results mentioned concerning Turing machines are produced in Hopcroft
39

and Ullman (1979).


40
Assuming that the read–head may switch between tapes in some specified manner.
41
By which it is meant that the Turing machine can operate simultaneously on as many tapes
as it has read–heads.
42
A neural network approach is one in which the knowledge in the system resides in the
connections of a network. It usually champions an architecture characterized by the elaborate
interconnection of many small computers
will never halt (a program that enters an infinite loop is a good example). In many cases
that is an undesirable trait. For that reason it is convenient to distinguish a subset of the
r.e. languages that always do halt for at least one TM. These languages are termed the
recursive languages. They are of obvious interest for those engaged in the practical use
of computers. They are just the decidable effective procedures that Church was
investigating in 1936. So three broad classes of languages (hence functions) exist, those
that are decidable (the recursive functions), those that are semi-decidable in that the TM
that represents them will halt on the proper input but may not halt otherwise (the
recursively enumerable but not recursive functions), and the largest class, the
undecidable (non recursively enumerable functions) for which no TM exists.

Describing a set of strings of symbols (or programs) as a language implies that a


grammar exists that describes the structure of those strings. A Grammar, in it's
mathematical sense, is a set of variable names together with the symbols of the
language and a set of production rules that describe the allowable ways in which the
symbols may be strung together. As might be anticipated, the kind of
grammar/language (hence function and TM) is distinguished from others by the form
that its production rules take. Generally speaking production rules are just rules of
substitution. They are commonly represented by a string of symbols and variables
followed by an arrow followed by another string of variables and symbols e.g. gal Æ
gbl. This means that wherever gal appears as part of the right hand side of any other
production (or string of symbols/variables to which we are applying the grammar as a
test) it can be replaced by gbl. In this particular case the effect is that a gets replaced by
b wherever a appears between g and l, so we can say gal produces gbl. A grammar
will contain many such productions all culminating (or commencing depending upon
your point of view) in some initial production S Æ w where w consists of symbols and
variables defined elsewhere in the grammar. Any string in the language may be
generated by starting from S and using the productions to make substitutions until all of
the variables have been eliminated. Conversely, the opposite process can be undertaken
and any string of symbols can be tested to see if it belongs to the language. This process,
called parsing, uses the productions of the associated grammar to repeatedly substitute
variables for symbols or strings of symbols in the string being tested. If the process can
be continued until the whole string is reduced to S then the string is in the language.

The recursively enumerable languages are represented by the unrestricted, semi-Thue,


or type 0 languages that are characterized by productions of the form f Æ j where both
f and j may be any string of variables and/or symbols. The significance of this is that
anything that is computable may be represented by a grammar, (this will be an
important point for the implementation scheme proposed in part three). It is not
surprising however, that the grammars for recursive languages, representing as they do
programs with the nice characteristic of being guaranteed to halt, have been more
thoroughly investigated.

There are several categories of grammars associated with the recursive languages.
Perhaps the most studied of these are the context free grammars (CFGs) and their
associated context free languages (CFLs). CFLs are context free in the sense that their
productions may have only one variable on the left hand side of a production. If other
symbols or variables were allowed to accompany it they would provide a limiting
context in which the substitution would have to take place. That is, the production rules
of CFGs can be put into the form A Æ aB, or A Æ a, where 'A' and 'B' are variables and
'a' is a symbol of the corresponding CFL (called a terminal of the grammar since no
substitutions can be made for a symbol during a derivation...this is because no symbol
may appear on the left hand side of a production). Here the production means that
whenever 'A' occurs in the right hand side of other productions or in strings being
tested for membership in the CFL, 'aB' may be freely substituted. A special kind of a
machine, less powerful than a TM, called a push down automata (PDA) is sufficient to
execute programs that fall into the CFL category. If a CFG is further restricted so that
the same variable may not appear on the left hand side of more than one production
then it is called a deterministic context free grammar (DCFG) and the language
generated is called a deterministic context free language (DCFL). Parsing a string of a
DCFL is less complicated because no choices have to be made concerning which
variable to substitute for the sub-strings of symbols/variables generated during the
parse. There will always be just one choice for any particular sub-string (although the
sequence in which the substitutions into different parts of the whole string are made can
still cause problems). As a consequence it is easier to write a parsing program and the
resulting program is faster. Further, in the structure of a CFG there is nothing to prevent
there being more than one way to parse a string, in other words there can be ambiguity
since different meaning can be associated with the parses. Since each parse in a DCFL is
unique most programming languages are contrived to be DCFL. Their compilers
(programs that convert the programming language into a language that may be
executed directly by the computer) can take advantage of that fact. CFLs are type 2
languages.

Other grammars that produce languages that may be shown to be recursive are the
context sensitive grammars (CSGs). These grammars are characterized in the same
manner as the unrestricted grammars (i.e. f Æ j where both f and j may be any string
of variables and/or symbols) except that the length of f is constrained to be less than or
equal to the length of j. For instance a production of a CSG might look like ga Æ gbl.
The context sensitive languages (CSLs) associated with the CSGs represent programs
that may be run on a restricted form of a TM called a linear bounded automata (LBA).
An LBA is simply a TM that operates with a tape that is only as long as its input string
(or some function of the length of the input string). CSGs are type 1 languages. The
language types form a hierarchy in which type43 3 Ã type 2 Ã type 1 Ã type 0. This is
called the Chomsky hierarchy after the linguist Noam Chomsky who first suggested it
as a possible model for natural languages. The lower in the hierarchy (closer to type 3)
the easier it is to write programs that perform language recognition and generation. All
natural languages fit in no higher than type 1 in the hierarchy. Most computer
languages and even natural language recognizers fit in no higher than the type 2.
Obviously it would be advantageous to treat natural languages as being at the type 1
level, in fact, it would be required if we are to have practical computer implementations
of natural languages. It is very difficult to even describe a language in the type 0
category. The standard example is the language recognized by a universal TM. What it
would mean for programs to operate at the type 0 level and whether to do so would be
necessary or advantageous (particularly in implementing machine intelligence, the
Type 3 languages represent what are called regular sets. They are recognized by
43

deterministic finite automata (DFAs). DFAs are characterized by making moves based only
on the current state and current input. That is, their moves are independent of any past
moves. A DCFG would be sufficient to describe a DFA.
subject at hand) is not known. Humans arguably can describe everything they think and
do using natural language, that would seem to indicate that the type 1 level is sufficient
for modeling the human mind.

Even when a language falls into the recursive category so that there is some TM that
will halt for it, there is no guarantee that it will be practically computable. Many
problems have the undesirable characteristic of requiring impossibly large amounts of
memory or time to calculate what would seem like a reasonable problem. The classic
problem of this genre is the traveling salesman problem: a traveling salesman must
make a tour of N cities keeping his total mileage under B miles, can he do it? Given a
finite number of cities with every possible pair of cities connected by a road, there are a
finite number of such tours possible. The tours can be enumerated by picking a city in
which to start, then choosing a next city from the remaining N-1 cities, then from the
remaining N-2 cities and so on until there is only one city left, we then return to the
original city. One way to find an answer then is to calculate the length of each tour and
compare it to B. The problem is that as the number of cities grows, the number of tours
grows as the factorial of N. While it may be possible, using the fastest available
twentieth century computers, to solve a problem in the above manner for a small
number, say N equals fifty or sixty, a salesman whose territory is all of the United
States can expect no help. The traveling salesman problem can be reduced to a problem
in graph theory known as the Hamiltonian Circuit problem in which the problem is to
find a cycle that includes all of the vertices in the graph. The difficulty of this problem
does not grow as N! but it does grow at a very fast rate. To investigate the feasibility of
solving such problems they are converted into comparable deterministic and non-
deterministic TMs. The time and space requirements of these machines then serve as the
standard by which the difficulty different problems can be compared. The problems are
assumed to be stated in a manner that allows a yes or no answer, e.g instead of asking
"what is the shortest tour the traveling salesman can make?", the variable B is
introduced and it is asked "does there exist a tour shorter than B?", (we are trying to
discover the difficulty of a kind of problem rather than discover solutions for a
particular problem). A particular candidate problem (for instance a traveling salesman
problem in which a set of cities and a B is given) is termed an instance of the problem.

By deterministic TM (DTM) it is meant that the TM will halt on each input whether or
not it represents an instance of a problem. A non-deterministic TM (NTM) is one that
may not halt in the event that it is fed an invalid problem instance or one that does not
have an affirmative answer. It may appear that an NTM could not be used and should
not be required to solve recursive language problems but this is not the case. Recursive
problems exist for which no DTM exists (Hopcroft and Ullman 1979 page 228).
Fortunately, the finite nature of recursive problem instances permits the construction of
an NTM that halts in a non-accepting state when it becomes apparent that no answer
will be forthcoming44 . The difficulty level of a recursive problem of length n is analyzed
44
It can be shown that there are a finite number, X, of possible combinations of NTM states,
head positions and tape contents. Every move of the TM is represented by some such
combination. Whenever such a combination is encountered a second time the machine must
be in a loop. If no accepting (affirmative) state has been encountered by that time, none will
be encountered and the machine can be halted in a non–accepting state. The maximum
number of moves such an NTM cam make before repeating is X. For the traveling salesman
problem it can be shown that X is of the exponential order 2p(N) where p(N) is some
polynomial in the length N of the maximum instance.
by counting the maximum number of moves that it's respective most efficient TM must
make for a problem instance of encoded length n. This represents a machine speed
independent measure of the time that it takes to solve the problem. Because TM tapes
may be expanded or compacted linearly (via encoding schemes) without affecting the
represented problem only non-linear time complexities are considered. Problems are
then observed to fall into the categories (that is are of the time order complexity) log(n),
p(n), kn, and n! in which p(n) stand for a polynomial in the encoded length n of the
instance and k is a natural number (usually two). Problems that are solved within the
constraints of these functions are said to be solvable in log time, polynomial time,
exponential time and factorial time respectively. If a DTM or NTM exists for the
solution of that problem it is modified by the adjective deterministic or non-deterministic
respectively. For example, there are problems that may be solved in deterministic
polynomial time (P) and others that require non-deterministic polynomial time (NP).
Log and polynomial time problems are generally thought to be easy (although for high
order polynomials that might be arguable). Polynomial time problems may be thought
of as included in the NP problems because the deterministic polynomial time problem
can always be embedded in an NP problem and the non-deterministic part ignored. But
the opposite does not seem to be the case. For instance the traveling salesman problem,
while having a solution in NP time has no solution in P. Attention then focuses on the
NP problems because these problems seem to sit on the edge of what is computable in a
practical sense. They are problems similar to the traveling salesman problem. When
problems go beyond NP they are considered to be intractable even though they can be
theoretically solvable. In the set of NP problems there is thought to be a hardest NP
problem in the sense that if it can be solved any other NP problem can be solved. Such a
problem is termed NP-complete.

For a problem to be NP-complete it is required that it be in NP and that all other


problems in NP be polynomial transformable (reducible) to it. Since transformability is
a transitive property, if one such example can be given then that example can be used to
determine membership of any candidate NP-complete problem by describing a
polynomial transformation from the candidate problem to the NP-complete example. In
1971 Steven Cook proved that the satisfiability problem of logic is NP-complete (for a
version of the proof see Garey and Johnson 1979). The satisfiability problem asks
whether for any specified set of clauses in logic (see section 1.9.1 below) there is an
assignment of truth values {T,F} to the variables in the set such that all of the clauses are
satisfied (have a truth value of T). Since Cook's proof hundreds of other problems have
been shown to be NP-complete by exhibiting that the problem is NP and has a
polynomial transformation to the satisfiability problem or any one of the other, more
recently proven, NP-complete problems. Some examples of NP-complete problems are:
the acceptance of a string X by a linear bounded automaton, context sensitive language
membership, context-free programmed language membership and regular grammar
inequivalence (i.e. for G1 and G2 both regular grammars do G1 and G2 generate
different languages?). NP-completeness should not be mistakenly associated with
categories of language generation even though the characterization of such languages in
terms of decidability may coincide. NP-completeness does not mean that each instance
of a problem is NP-complete as is demonstrated in figure 2.2 below. While it may be
true that the problem of determining if an arbitrary string X is accepted by some LBA, is
NP-complete that does not mean that all languages generated by LBAs or problems
solvable by LBAs are NP-complete. But when a problem can be shown to be NP-
complete, it is, given the current state of knowledge, a difficult problem. It is suspected
to be impossible to solve an NP-complete problem in polynomial time (in other words it
is not known whether the class of P problems has an intersection with the class of NP-
complete problems) but this has not been proven. If it should ever be shown that one
NP-complete problem can be solved in polynomial time then by the nature of NP-
complete problems, all NP-complete problems will be solvable in polynomial time. At
the present, however, this seems to be an unlikely possibility. Figure 2.2 shows the
relations of grammars, languages, sets, functions and levels of difficulty.

recursively enumerable sets and functions


Turing Machine computable(TM) non-recursively enumerable sets and functions
unrestricted languages type 0 not computable
no language
recursive sets and functions
linear bounded automata (LBA)
context sensitive languages (CSL)type 1
every recursive language will find
at least one halting TM in here
non-deterministic polynomial
regular sets and functions (NP)
finite automata type 3
push down automata (PDA)

finite automata NP-complete


DCFL

context free languages (CFL)


?

recursive sets and functions


context free language (CFL)type 2 Polynomial (P)
push down automata (PDA)

FUNCTIONS / SETS / GRAMMARS / LANGUAGES


Figure 2.2: The Chomsky hierarchy and problem difficulty

While NP-complete problems may be very difficult in the general form, that need not be
the case for a particular instance or for a restricted set of parameters. For instance it is
not difficult to determine whether a given route that a traveling salesman might take is
less than some given value. Even problems more difficult than NP can be solved if
appropriate restrictions are observed, while it may be impossible to calculate all of the
prime numbers, a candidate prime can usually be successfully tested for primeness.
And while we can't build a universal TM complete with its tape representing everything
computable we can and do build computers complete with the subset of languages and
programs that we find useful.

2.3 Toward a science of cognition

The second world war in the first half of the twentieth century brought with it the
application of new scientific discoveries to weapons research. Three developments were
to have great impact on philosophical ideas about the mind. The first was the invention
of electronic feedback control for the aiming of guns. This led to the development of
systems theory. The second was the development of cipher decoding machines and
trajectory calculating machines that were among the first electronic computers. The
study of these machines and their capabilities became computer science. The third was
the development, by Claude Shannon, of the physical concepts of entropy applied to
information, or communications theory45 . The three developments together with
elements of biology and psychology were combined by Norbert Wiener into a discipline
he termed Cybernetics, or the study of mechanisms of control and communications in
animals and machines. The proponents of Cybernetics were perhaps too exuberant and
premature in their projections of the likely fruit of the new science. Then too the subject
matter was too diverse to be easily mastered by an individual or even a group. Since
that ambitious beginning the various disciplines of that science have tended to go their
own ways. Only recently have they recombined to some extent in a new science;
cognitive science. In cognitive science the emphasis is on the psychological aspects of
cognition as they might be modeled in a machine rather than the engineering of
automata.

Hilary Putnam (Putnam 1960) proposed that the physical activities of the brain could be
described at a functional level that would adequately model the mind. This provided a
level of abstraction at which to study the human mind. Previously, it was thought that
the mind could be studied only, either at the physical brain level or at the black box
(interaction with the environment) level. The psychological school of behaviorism,
founded on the works of I. V. Pavlov and propounded by B. F. Skinner forwarded the
black box theory, but ran into difficulty in applying it to many aspects of human
behavior. Functionalism, as Putnam called his theory, offered an abstract model that
avoided the reliance on behaviorist theories, and the difficult and slow progress of
neuro-biology while maintaining a physicalist approach. On the (non abstract) physical
level the brain is composed of the bio-chemical stuff of the brain. This is the level at
which the neuro-biologist works and is the level analogous to the hardware level of a
computer. But the brain may be modeled at greater and greater levels of abstraction.
The function of the various parts of the brain, the thought processes and knowledge of
the brain, inherent or learned, and other levels analogous to the software and firmware
of computers may be distinguished. Further, the theory implies that many physical
states of the brain might map to one mental state (e.g. many different thought processes
might lead to a belief in a particular idea or a desire, of some sort). Further, it implied
many different physical implementations of brain might possess physical states that
map to equivalent abstract mental states of other physically realized systems. So, for
instance, a human mind and a computer program could be functionally equivalent at an
appropriate level of abstraction. Functionalism was adopted as a general principle of
cognitive science and the computer became the primary tool of investigation for
cognitive scientists.
45
Actually Claude Shannon presented his theory in a paper "A Mathematical Theory of
Communication" in 1948, three years after the end of the second world war.
The mind is then viewed as an information processing system. The physical operations
of the brain may be understood by considering its activities as the processing of strings
of symbols that are intentional (to which meanings are ascribed), that is, information.
This is also known as the representational theory of the mind. It raises many questions,
two of which harken back to the old questions raised by Hume and Kant, chief of which
are "How much of the processing capability and knowledge structure of the brain is
innate? and "How does that mental structure, innate or learned, relate to the real
world?" A third question deals with how any model of mind depends upon the
environment. Is it possible to study the mind and understand its processes by restricting
the study to the mind itself (knowledge of the world can be assumed already acquired...
such a research strategy is termed a mentalist approach), or must interaction of the mind
and environment also be modeled? These questions are discussed in the literature and
various answers are proposed and disputed. Most of such discussion is in linguistic
terms because language is the standard way for humans to formulate and exchange
information and is assumed by many to be a requirement for intelligent thought. The
representation hypothesis implies a propositional account of knowledge.

Jerry Fodor, a psychologist of language believes that much of the human capability in
language is innate (Fodor). He advocates an organization along lines suggested by
Joseph Gall 1758-1828...a modularity of mind. That is the mind possesses modules to
process information according to type not subject matter and these processing
capabilities are constrained. Fodor says input and output modules exist much as Gall
implied... but Gall never answered the questions: what coordinates the activities of the
modules? How do the modules go together to make up a mind? Fodor suggests the
answer to the question is another module, but one that operates differently than the
input/output module. Fodor presents the following as symptoms of modularity:
Modules are:

1. not controllable. They involuntarily apply to inputs.


2. fast (100-200 milliseconds), that is studied thought is not an indication of modularity.
3. computationally shallow (e.g. in speech recognition it would give you the kind of sentence
heard but not the meaning of what was said).
4. associated with fixed neural architecture (e.g. speech or visual area of brain). No modularity
is proposed for thought in general.
5. subject to characteristic breakdowns (because of 4).
6. such that they exhibit characteristic development independent from the development of
intelligence at large.

Noam Chomsky is a linguist who also believes that the language facility in a human
represent an innate system. He hypothesizes an innate deep structure language that,
unlike ordinary learned languages, is unambiguous and whose every construct maps to
constructs in a surface language, say English. Such theories are particularly attractive to
those attempting to create natural language interfaces or language translators for
computer systems. If the deep structure could be modeled then it would be a relatively
straight forward process to create such programs. In its crudest form this theory implies
what is termed the message model of human communication. That is, a person desiring to
communicate a message to another person formulates the message in the underlying
deep language. The message is then coded into the particular surface language that the
person uses. The message is spoken to the intended receiver. The receiver reverses the
process, decoding the message into the underlying language at which the meaning is
unambiguously understood.

Both Fodor and Chomsky then are taking an essentially mentalist approach, down-
playing the role of the environment and emphasizing the innate structures of the brain.
But Hilary Putnam has recently (Putnam 1988) objected to the mentalist approach. He
argues that there are three reasons it is wrong; 1) he sees meaning as dependent on
environmental forces that can revise the meanings in the mind at any time; meaning is
holistic in nature. 2) meaning is normative in that a context change may change
significance, e.g. Roger Bannister's record of a four minute mile will always be "Roger
Bannister's record of a four minute mile" but it has a different significance in a world of
three minute and fifty some odd second miles, and 3) the mentalist approach cannot
explain the ability of the mind to deal with a civilization not in existence when it first
evolved.

Nor can the mentalist approach explain the inability of the mind to handle concepts that
describe real features of the universe, except as abstract mathematical objects. Here is
what Werner Heisenberg has to say about understanding principles in quantum
mechanics (Heisenberg1949):

"It is not surprising that our language should be incapable of describing the
processes occurring within atoms, for, as has been remarked, it was invented
to describe the experiences of daily life, and these consist only of processes
involving exceedingly large numbers of atoms. Furthermore, it is very difficult
to modify our language so that it will be able to describe these atomic
processes, for words can only describe things for which we can form a mental
picture, and this ability too is a result of daily experience. Fortunately,
mathematics is not subject to this limitation, and it has been possible to invent
a mathematical scheme - the quantum theory - which seems entirely adequate
for the treatment of atomic processes; for visualization, however, we must
content ourselves with two incomplete analogies - the wave picture and the
corpuscular picture."

So with respect to the understanding the behavior of atoms our language fails us and
our ability to visualize fails us. Only our ability to formally and abstractly manipulate
symbols saves us from remaining ignorant about the behavior of atomic processes.
Aside from the uniqueness of the application of mathematics by the human mind (an
idea treated separately above), Heisenberg points out that language fails in the
description of atomic activity because description is necessarily tied to mental imagery
and that capacity is tied to our experience in our environment. The argument is one of
contradiction, if the brain contains a portion innately in tune with the universe, why
does it fail us as soon as the object of understanding is beyond being described in terms
of experience. Coincidence? The illusion of the innateness of language might be
explained by the fact that language is hierarchically ordered (described by a grammar).
That fact is in accord with the order in general of physical and information systems.
But when studied in isolation there is the illusion that it constructed from primitives
that are provided by the brain/mind. We will use the fact of the hierarchical order of
language (grammar) and its identification with the mind as part of the implementation
scheme in part three. Experience is unique to an individual, so, the viewpoint in
cognitive science that advocates that mental states be described in conjunction with the
environmental states that give rise to them is termed naturalistic individualism. It is a
view well defended by Zenon W. Pylyshyn (Pylyshyn 1985) and is supported by the
hypothesis in section 1.12.

2.4 Logic systems

Schemes for the creation of artificially intelligent programs are usually founded, or at
least depend to a large extent, on logical processes46 . The computer implementations of
such systems represent attempts to realize the universal logical language of Leibnitz or
Bertrand Russell's scheme to reduce knowledge to mathematical statements. Logic, as
described in the first order predicate calculus47 , consists of a set of symbolic objects (or
variables representing such objects) together with symbolic functions and relations of
those objects or variables. The objects (e.g. Bob, x, apple), Functions (e.g. Age(x)) and
relations (e.g. Father(Bob, x))48 may be combined into sentences using rules for
quantification (e.g. "49 x Age(x), $50 x Father(Bob,x) ), conjunction (e.g. Father(Bob,Tom)
Ÿ51 Father(Bob,Bill)), disjunction (e.g. Father(Bob,Tom) ⁄52 Father(Bob,Bill)), negation
(e.g. ¬53 Father(Bob,Tom)) and implication (e.g. Father(Bob,x) fi54 married(Bob)). A
conjunction of sentences is usually presented by listing each sentence on a separate line.
The objects, functions and relations are constructed in a manner that allow them to be
interpreted as representing the relations and functions of objects in the real world. A
logic database is a set of sentences that are consistent (give rise to no contradictions)
under the interpretation that is to be imposed upon them. This requirement is
sometimes made more strict, requiring that the set of sentences be consistent under any
interpretation that can be placed on them. Logic includes rules of inference (or laws of
46
This discussion of logic based systems refers to the design and implementation of software
logic facilities rather than the use of logic at the hardware level. Digital computers, at the
hardware level, operate largely if not exclusively by rules of logic. For a more formal and
complete discussion of logic systems and the part they play in AI systems see Logical
Foundations of Artificial Intelligence (Genesereth and Nilsson, 1987) from which this
review is largely taken.
47
The predicates referred to are simply the symbolic representation of the predicate part of
simple sentences that consist of subject(s), verb and direct object(s) from the universe of
discourse. Thus likes(Bob, Mary) represents "Bob likes Mary." The calculus is 'first order' in
the sense that it allows variable objects, i.e. x might be allowed to stand for Bob or Mary or
any other object. A logical calculus that permitted variable relations (or predicates) would be
a second order predicate calculus.
48
Read "Bob is the father of someone."
49
For all.
50
There exists
51
And.
52
Or.
53
Not
54
Implies
logic) that provide for deduction of theorems, i.e. true sentences, from a database. So,
given a database, rules of inference and a methodical way to apply those rules, a
computer program can be constructed to test the truth of postulated sentences or to
determine the object values that, when substituted for object variables, can make a
sentence true. The application of logic to reasoning computer systems has been very
successful, particularly in expert systems, production systems and in the design of some
programming languages. These systems can be distinguished by the different inference
schemes they use and by the kind and extent to which other, non-logic facilities are
provided.

One inference scheme is to traverse the set of sentences in a methodical manner using
one or more of the inference rules provided by logic to generate additional sentences
that are then added back to the part of the database that has not yet been traversed. The
process is repeated. In an appropriately constructed database the process will
eventually stop as no new sentences are generated. The sentence in question will either
be in the final set of sentences (and thus be affirmed as provable from the original set of
sentences) or it will not be generated. Various schemes to limit the search can be
applied. Another inference scheme restricts the form of the sentences, for instance to
implications and atomic (or literal)55 sentences. To test the truth of a new sentence the
atomic sentences are checked. If the new sentence can be shown to be true by the rules
of inference the process is finished, otherwise the antecedents of the implications whose
consequents would provide the needed proof are added to the sentence(s) to be proved.
The process is repeated on the newly augmented set of sentences to be proved. When
one of those sentences is proved its consequence is added to the facts in the database
and it is removed from the set of sentences to be proved. Eventually the original
sentence will be proved or the database will be exhausted. The utility of these kinds of
schemes lies in the ease with which they may be augmented by other heuristic or non-
logical mechanisms to restrict the search and the fact that the consistency of the
database is not always an absolute requirement for them to function. These inference
schemes, however, are unsound in that there is no guarantee that everything implied by
the database will have been generated or tested for, and because, in inappropriately
constructed databases, the processes may not stop. Various schemes may be
implemented to provide a measure of soundness, the most important of these is the
resolution process.

The possibility of using a sound inference process on computers was greatly facilitated
when John Robinson (Robinson, 1965) developed a process by which a body of
sentences could be reduced to simple clausal form56 , then unified (a process in which
implications are made explicit and non-redundant by a process of substituting the
objects in the universe of discourse57 into the variables in the clauses), and resolved (i.e.
in the set of unified clauses, consistency is mechanically checked by showing that for all
55
A literal or an atomic sentence is an n–ary relation. For instance 'like' is a binary relation. In
first order predicate calculus "John likes Mary" would be written likes(John, Mary).
56
Clausal form consists of a disjunction of atoms or literals e.g. "Bob paid or Bill paid or
Tom paid or someone washed dishes." This is written as a list of literals in brackets using the
first order predicate calculus notation, i.e. {paid(Bob), paid(Bill), paid(Tom),
washed_dishes(x)}
57
The set of objects referred to in the set of sentences.
P it cannot be inferred that both {P} and {¬P}). In the event that there are contradictions
explicit or implicit in the set of clauses the resolution process will at some point produce
an empty clause (e.g. if both {P} and {¬P} are present or derivable from the set of clauses
then {}, the empty clause, will be produced by the resolution process). Given a
consistent database, the truth of a proposed clause concerning some object contained in
the universe of discourse may be tested. First the negation of the to be tested clause is
added to the original consistent set of clauses. The resolution process is run on the
newly augmented set. If the truth of the new clause was implicit in the old set, it will
resolve with the newly added, negated version to produce an empty clause. This affirms
the truth of the new clause (i.e. it shows that it had to already exist in or was already
implied by the original set). With this algorithm, and given a finite universe of
discourse, it is a simple iterative process to produce a list of all of the candidate objects
that would satisfy some new clause that contains a variable or variables (simply try all
the combinations of objects and save the clauses for which the resolution process
produces the empty clause). The resolution process is made more efficient by reducing
the clauses to a special form called a Horn clause. Horn clauses have at most one
positive literal. Fortunately the simple and common kinds of sentences, assertions or
rules that people use to describe the world are easily reduced to the Horn clause form.
Consider the assertion "if a person has the money and he likes the movie playing then
he'll go to the movies." To create a Horn clause we can take the following steps: restate
the assertion in a negative manner without changing its meaning as "if it is not the case
that a person doesn't have the money or he doesn't like the movie playing then he'll go
to the movies." Notice that the conjunction in the antecedent has become a disjunction.
The if then clause can also be reduced to a disjunction by an inference rule of logic that
permits a sentence of the form "if x then y" (x fi y) to be stated as "¬x ⁄ y". So our rule
becomes "¬ (it is not the case that a person doesn't have the money or he doesn't like the
movie playing) ⁄ (he'll go to the movies)." The truth of this clause will not be changed
by removing the '¬' from the first half of the newly created 'or' clause if at the same time
we remove the 'it is not the case' from within the parentheses to obtain "(a person
doesn't have the money or he doesn't like the movie playing) ⁄ (he'll go to the movies)".
We can then restate the clause, using predicate calculus notation as {¬have_money(x),
¬like_movie_playing(x), go_to_movies(x)} which is a Horn clause since it only has one
positive literal. This sentence can be efficiently resolved with other Horn clauses. So, if
the other clauses in the database include the clauses {have_money(Bob)} and
{like_movie_playing(Bob)} then the test of the clause {go_to_movies(Bob)} would
produce the empty clause and be affirmed as true indicating that Bob will go to the
movies. It is easy to arrange a program that allows the clause {go_to_movies(x)} to be
tested for the various possible 'x's. In this case the process would discover 'Bob' as an
object that produces an empty clause. These are essentially the tactics employed by the
programming language Prolog (although Prolog does not provide facilities for the
conversion of common English into clauses in the predicate calculus and it does offer a
greater range of programming facilities than the resolution of a set of Horn clauses). For
such systems to work they must be consistent. That requirement makes it difficult to
construct large systems or systems that can grow automatically and that are amenable
to the resolution process. This is discussed below.

The success of logic based systems has been largely restricted to expert systems (usually
production, or rule base systems) and programming languages (mainly Prolog,
although even Lisp relies to a large extent on principles of logic). Many other systems
are essentially logic based or contain elements of logic in their implementation. They
include, but are not limited to, various kinds of semantic nets, logic based natural
language interface programs (including those based on Augmented Transition
Networks, see part three), decision trees, Frames and Scripts. The major limitations that
must be overcome before these systems can be more broadly applied to artificial
intelligence can be categorized (Genesereth and Nilsson 1987):

1. such systems are limited to that which can be described by language,


2. such systems can't produce new knowledge about the world; they
can make implicit knowledge explicit but they are unable go beyond
that, and
3. they have no way to reason with uncertain or incomplete knowledge.

These difficulties are associated with non-dynamic (non-growing or fixed and


axiomatic) databases. Most such systems may be so characterized (i.e. represent a
mentalist's approach ). The static nature of these systems is itself a problem. In
particular we would reiterate Putnam's above reservations against the possibility of
modeling intelligence without reference to the environment. Permitting the database to
be dynamically interactive with the environment can also provide a solution for some of
the problems listed above (particularly in problems 2 and 3). But building such a system
is complicated and has its own problems. The system described in part three represents
an attempt to implement such a dynamic database.

2.4.1 Limitations due to language

These limitations exist in as much as the human mind entertains thoughts that are not
describable in any language or, equivalently, in as much as there are processes of the
mind that can have no symbolic representation. Whether or not such thoughts exist is at
this time, subject only to philosophical argument, and will remain unprovable until an
acceptably human mind is duplicated in a machine (an unlikely event) and all of the
thoughts are counted. It is sometimes asserted that procedural knowledge is a kind of
thought that cannot be symbolically represented. Knowledge is often classified by
researchers in cognition as procedural knowledge or declarative knowledge. The
distinction is very close to the epistemological distinction between knowing that some
state of affairs maintains (which corresponds to declarative knowledge) and knowing
how to perform some task or sequence of tasks (which corresponds to procedural
knowledge). The psychological distinction has a slightly different emphasis in that it
distinguishes knowledge that can be transferred in its entirety by describing it and
visceral knowledge that cannot be described but that can be performed. For example
knowing how to play chess would not be considered procedural knowledge (under this
distinction) since the rules and strategy of the game can be made declarative and a
person might learn how to play chess by being told about the game. The procedural
knowledge necessary to catch a ball, on the other hand, even though the act and feelings
accompanying the catching of a ball can be described, cannot be acquired simply by
hearing about the act and/or those feelings (although the description can be a useful
guide for learning). That knowledge comes only with practice and is peculiar to the
individual doing the ball catching (a person with a weak or deformed arm will execute a
different procedure to catch the ball than, say, a healthy professional baseball player).
However, it can be argued that to execute a procedure, rules must be followed, and that
these rules represent the declarative knowledge of the procedure. Certainly most of
these rules, if they exist, are not accessible to the conscious mind. But the condition of
not being accessible does not preclude the existence of describable rules, just our
ignorance of those rules. The counter argument to the existence of such hidden rules is
that, to apply such rules requires the application of still other rules concerning the
appropriate application of the first rules, and that these rules would require still other
rules, and so forth in an infinite regress of rules. The rejoinder is that it doesn't require
an elaborate rule structure to activate every sequence of physical activities; a relatively
dumb executive procedure can achieve that effect (see for instance Minsky 1987). At
some point, sequences of rule activation terminate in essentially pre-wired physical
activations of the body. Some questions then remain. At what level of complexity do
such physical activations exist and how elaborate must the rules be that terminate in
those physical activations?. If the rules are not accessible to us, how will we program
them into a computer? Then too, the connection must work both ways, with the
physical configuration, at some point, being convertible to statements that can be be
integrated into the logic structure. Research into these problems is proceeding rapidly
with the greatest emphasis being placed on visual and auditory pattern recognition, i.e.
hearing (especially speech recognition) and seeing. The research constitutes a new field
of study called natural computation (see for instance Richards 1988). If the efficient
resolution process described above is the inference process used, the rules base must be
carefully constructed to maintain consistency and closedness. If practice (a method of
learning) is the practical way to implement procedural knowledge then a dynamic
database is required. How then can such a rules base be added to and/or modified
efficiently while maintaining consistency and completeness?

2.4.2 Limitations of logical inference (logic systems cannot be creative)

Creativity is often assumed to imply a new and unexpected idea not evident in the
existing state of knowledge. If that is the case then the creativity of mathematicians and
scientists is restricted to their mistakes and to those ideas that lead to a scientific
revolution. The contrary argument can be made that acts of creativity are to be found
more in the selection and recognition of appropriate chains of implications in an
existing state of a system than in any leap to a new and unique idea. The argument
maintains that while the reasoning process itself may not be conscious, the mind comes
to a conclusion that is implicit in some previously accepted set of facts. That we acclaim
the creativity of artists, poets, authors and musicians then, stems more from the lack of
an explicit, agreed upon, formal set of rules regarding the subject of their creativity than
in the absence of such a rulebase. So, if a machine-based system can be made to
accurately and consistently reflect the nature of the world and the intentions of an
intelligent being, ideas that are creative in the above sense can be forthcoming. Science
is predicated on the proposition that such a process is the best way to discover and
describe the nature of the universe. If the chain of implications supporting some theory
does not lie within the body of previously accepted scientific knowledge, then the
theory is not accepted into that body of knowledge (as noted elsewhere, in the long run
science does not succeed in excluding revolutionary ideas that require a rejection of
previously accepted ideas). There must be a database of information before any
inference process can occur. Humans must learn before they can create, but we can
always consider the mind as fixed for some small increment of time before the act of
creativity. There are other problems (see below) but it would seem that logical inference
might provide an adequate basis for modeling human reasoning. Purely logical systems
(e.g. systems that rely on resolution) are at a disadvantage because of the old problem
that, to work, they depend upon a closed and consistent database. The modeling of
systems as capable as the average human implies the construction of databases of truly
immense proportions that will run on very fast (or powerful) computers. One of the
areas in which the difficulties inherent in building such closed, consistent, and large
structures manifest themselves in the ability to handle qualifications and assumptions.

Almost any assertion that can be made requires qualifications; all cats chase
mice...except, lazy cats, sleeping cats, sick cats, dead cats, blind cats, scaredy cats, hep
cats, big cats, etc. So all of those exceptions must be included as statements in the
system. Humans find it easy to make assumptions; if Bob tells me that George lives
next door to Bill, in the absence of information to the contrary, and only if the question
of the where-about of Bob's living quarters ever arises, I will assume that Bob does not
also live next door to Bill, but I will not formally establish that as a fact in my memory.
To handle the assumption problem the logic system is usually completed by the
addition of the negative assertion of all literals that cannot be proved by resolution. That
is 'Bob does not live next door to Bill' would be added into the base of assertions either
directly or at the time the unification/resolution process is run, along with that same
assertion about every other object in the database that cannot be proven from the
original sentences. In addition to causing the processing of large amounts of
information with which the human mind would not bother it can cause other problems.
Suppose that Bob is married to Mary, Tom is rich, and Bill works. The assertion Bob is
rich or Mary works is a meaningful sentence, and not inconsistent with the other three
assertions. But it could not be added to those statements if the above completion
method was applied since that would lead to the augmentation of the database by the
statements that "Bob is not rich" and "Mary does not work". The database would be
inconsistent. Fortunately, as long as the database is represented by Horn clauses the
completion process will produce a consistent system ("Bob is rich or Mary works" is not
a Horn clause so some other way would have to be found to represent that information).
This means the database has to be carefully constructed. Various other completion
strategies are used. Predicate completion allows the assumption of "x P1 ⁄ P2 ⁄ ... ‹
Q(x) given that "x P1 ⁄ P2 ⁄ ...fi Q(x) under the condition that Pn is a conjunction of
literals not containing Q. So if Q(x) stands for "someone is rich" and P1 stands for "Tom
has a lot of money and Bill has a lot of money" and P2 stands for "Mary has a lot of
money" then P1 ⁄ P2 is completed by the statement that if someone is rich then it can
only be because either Tom and Bill have a lot of money or Mary has a lot of money.
That is, the only things that are rich are those specified by the database; any other
objects in the database cannot be rich. This scheme is generalized by what are called
circumscription formulas under which the requirement that Q be solitary is dropped.
The application of the circumscription formulas requires that the database be put into a
special form and is further complicated by the fact that they are derived through the use
of the second order predicate calculus. Most cases collapse to first order databases but
not all. This complicates implementation. A more telling deficiency of logical inference
lies in the difficulty with which it can model the human mental processes of analogy
and induction by which vague similarities in spacial and temporal patterns evoke
mental states (conclusions). In the course of normal every day living humans accept and
use information produced by these methods. To some extent induction can be modeled
in a logical data base.

Induction may be thought of as the generalization of rules from instances so that the
truth value of a hypothesis regarding further instances can be obtained. As an example
if we observe a pair of cardinals and a pair of bluebirds and in each case the male bird is
brightly feathered while the female is dully feathered we can generalize the rule that
male birds are brightly feathered and female birds are dully feathered. If we
subsequently observe a pair of pheasants we will conclude that the brightly feathered
pheasant is the male bird. Children are taught this skill early. Small children are given
picture book puzzles for practice. In the picture book a sequence of objects is displayed
from which the one that doesn't fit is to be picked. For instance, the child may be shown
three empty pails and a full pail and asked to choose the one that is different. Older
children may be given more difficult problems, for example the pictures may be of a
bluebird, a hummingbird, a monarch butterfly and a moose. They are asked which one
doesn't belong. They are expected to discover the rule that all of the animals have
wings (or fly) except for the moose. The ability of the child to solve this problem
depends to an extent upon the pictorial representation in the picture book. Perhaps the
flying animals are all displayed with outstretched wings drawing attention to that
attribute, and, in effect, requesting a choice from the set (wings , wings, wings, no
wings). To avoid that, assume that an adult has been presented with discovering the
rule that governs the inclusion of members in the non-pictorial list: (bluebird,
hummingbird, monarch butterfly, moose) and to find the one that doesn't fit. The
obvious method to solve this problem is to think of all of the sets of attributes of the
members and find the ones that overlap. When we find three items with many of the
same attributes (more than any other group of three) we will conclude the leftover item
is the one that doesn't fit. We would note that all of the items are included in the set of
things called animals, that the bluebird, hummingbird and butterfly all fly, that the
bluebird, butterfly and moose all migrate and that the bluebird, hummingbird and
butterfly are all small. So we choose the bluebird, hummingbird and butterfly because
they have the most things alike. The rule for inclusion is that the animal be small and
able to fly. If we now added 'toy airplane' to the list, the same process will produce a
new rule that the thing to be included be small and fly. Actually we did not need the
relation 'animal' to express the rule intended in the first case. "small and flies" would
have been sufficient to do the job and desirable in the sense that it is a minimal rule. We
should note that items on a list of things from which a rule is to be generalized may
contain negative examples. For instance 'not a toy airplane' could have been added
instead of 'toy airplane'. In fact, when checking bluebird, hummingbird and monarch
butterfly against moose we are really looking for a rule that generalizes the inclusion of
items on the list (bluebird, hummingbird, monarch butterfly, ÿmoose).

We can make several observations about the nature of a logic system that would use
induction. First we observe that any database to which induction is to be applied must
contain many objects about which much is known. The particular subset of instances of
interest must somehow (independently from the induction process itself, perhaps by
some measure of proximity in time, space or structure), be selected or indicated. Second
we note that the power of induction resides in its ability to generalize rules so that they
may be applied to new instances. In a closed database the instances are considered to
exhaust all of the possibilities. In that case circumscription, or an inheritance rule can
provide information missing from a particular instance. The fact that rules may be
generalized becomes a moot point. The use of induction is most useful if the database is
growing, or better still, dynamic. Thirdly, the example above indicates a process that
only derives conjunctive rules e.g "if it is small and it flies then it belongs to the set."
Disjunctive rules such as "if it is small or it flies then it belongs to the set" are more
difficult to derive. The problem of deriving rules more complex than conjunctive ones
remains largely unsolved. And fourthly to implement induction we must have a
database that provides a thorough taxonomy of the objects to which induction is being
applied. That is, we have to have the kinds of rules that indicate migrating animals are
animals, that birds are migrating animals, that bluebirds are birds etc. If all of these
conditions are met then we can implement induction according to a scheme similar to
the following due to Patrick H. Winston (Winston, 1975) and Tom M. Mitchell (Mitchell,
1978):

Assume we are testing a problem to see if rules exist for inclusion in the set of instances
(bluebird, hummingbird, monarch butterfly, ÿmouse ÿcondor). For this set of objects
the candidate relations are (bird, insect, mammal, tiny, small, large, flier, migrates,
animal). If we use the common sense method of elimination, unfortunately we will
eliminate everything. That is, a relation is not admissable if there is a positive instance
to which it does not apply or a negative instance to which it does apply. So we eliminate
'bird' and 'mammal' because they do not apply to the positive instance 'monarch
butterfly'. Likewise we eliminate 'insect' because it doesn't apply to 'hummingbird'. We
also eliminate 'animal', and 'migrates' because those relations do apply to the negative
instance 'moose'. And we eliminate 'small' because it applies to the negative instance
'mouse' and 'flier' because it does apply to the negative instance 'condor'. We have
eliminated all of the candidates. But obviously the rule "if the object is small and flies
then it belongs to the set" will admit all of the objects in the set. To get around this
problem a version graph is created that consists of all admissable relations concerning
one instance under consideration, see figure 2.3. The admissable relations for the
instance 'hummingbird' are (tiny, bird, small, flier, animal). The database tells us that
tiny things are a subset of small things and small things are a subset of anysize things
and that every physical object falls in the category 'anysize'. The database also contains
the information that birds are a subset of fliers and fliers are a subset of animals. So tiny
à small à anysize and bird à flier à animal are two hierarchically ordered sets of
relations. With this information we can construct a version graph which is a directed
graph consisting of all of the combinations of elements from the two hierarchies
arranged so that an arrow points from any given combination to the other combinations
that contain it, with the exception that the combinations pointed to, are not themselves
contained by some combination to which the given combination is already pointing. So
hummingbird points to 'tiny flier' since a hummingbird belongs to the set of tiny fliers
but hummingbird does not point to 'small flier' (even though it is a member of that set)
since 'tiny fliers' is already a subset of small fliers. The rule that we are looking for must
reside somewhere in the version graph (even though we have not considered any other
instances at this point) since the most specific possible rule that contains the single
instance we have considered, lies at the bottom of the graph and the most general
possible rule lies at the top of the graph. We then use the other instances to prune the
graph to get the single rule or small graph of rules that applies to those instances. In this
example, we use 'bluebird' to prune the entire edge containing the relation tiny since
bluebird is a positive instance and is not tiny. We use 'butterfly' to prune the edge
containing the relation 'bird' since butterfly is a positive instance and is not a bird. At
this point we are left with only the upper diamond in the diagram. We can use 'not a
mouse' to prune 'small animal' and 'anysize animal' because it is a negative instance that
is a small and, of course, an anysize animal. Finally we can use 'not a condor' to prune
'anysize bird' since it is an example of a large (hence anysize) bird, but it is not an
example of a 'small flier'. So the pruning technique produces the single minimal rule
"small flier" or to restate it "small and flies".

anysize animal

small animal anysize flier

tiny animal small flier anysize bird

tiny flier small bird

hummingbird
(tiny bird)

Figure 2.3: A version graph

2.4.3 Limitations of the crisp laws of logic (reasoning with uncertain and incomplete
knowledge)

Sentences in the first order predicate calculus must conform to crisp rules of logic, that
is, they are required to be either true or false; the mathematical law of the excluded
middle is accepted. But in reality people make assertions about the world in which they
are not completely confident. This is especially true as regards intentions. For instance a
person might assert "If I come close to finishing this task I will take the rest of the day
off." But there is always the possibility that further pressing tasks will present
themselves or that the person will change his mind. This is further complicated in that
the determination of progress on the task may be uncertain. Conclusions reached
through inductive processes are also subject to doubt. In the previous section the
example involving the induced conclusion that, of a pair of birds, the male bird is the
colorful one can only be asserted with a degree of confidence that depends upon how
many different kinds of pairs of birds with a colorful male have been observed. One or
two kinds of birds that conform to the rule make it plausible, hundreds of kinds make it
almost certain. Even a few exceptions will probably be tolerated without discarding the
rule (you would simply explicitly include the exceptions in the database). What if a
significant percentage of pairs are exceptions? The mechanisms developed to deal with
this problem fall primarily into three categories: 1) objectively probabilistic, 2)
subjectively probabilistic and 3) possibilistic.

2.4.3.1 Objectively probabilistic

If sufficient information is available (for instance we have observed hundreds of kinds


of pairs of birds) and the predicate is clearly defined (e.g. "the male is colorful and the
female is not" as opposed to "the task is close to finished"). Then we can associate the
sentence with a binary valued probability distribution {p and (1-p)} where p is the
observed probability of the sentence being true. For sentences that meet the criteria the
observed joint probabilities (probabilities of the conjunction and disjunction of
sentences and their negations being true) can be used to calculate the marginal
probabilities (i.e. the probabilities of the individual sentences). The logical inference rule
modus ponens (Given P is true and P fi Q we can infer Q is true) can be modeled by
using the odds form of Baye's theorem, that is

O(Q|P) = (p(P|Q)/p(P|ÿQ)*O(Q)),

from which the probability of Q given P is easily calculated58 . This requires that we
know the conditional probabilities of P given Q and P given ÿQ as well as the prior
odds on Q being true. That, however, might be the case. For example if P is "a front is
passing" and Q is "it is raining" we need to know the probability that a front is passing
for both the case that it is raining and that it is not raining. We further need to know the
odds of it raining in any given time period. All of these numbers are easily obtainable
from the weather bureau, so we can calculate p(Q|P). That is we can calculate the
probability that it is raining given that it is true that a front is passing. We can imagine,
however, that there are many cases when the numbers are not available.

2.4.3.2 Subjectively probabilistic

If the statistical information required by the objectively probabilistic method is not


available, it may be supplied by the subjective estimate of a knowledgeable source. In
the previous example almost anyone could supply the numbers. For instance, the odds
of it raining here in the time period necessary for the passage of a front (say half of a
day) are about one rain to four dry (it rains about every two or three days in this area).
The probability that a front is passing given it is raining is high in the winter and low in
the summer, on average about sixty percent of the time. The probability that a front is
passing and it isn't raining is very low; perhaps one percent of the time it is not raining
and a front is passing. Then O(Q|P) = (.6/.01)*.25 = 15 and p(Q|P) = 15/16 = .94
represents the probability that it is raining given that a front is passing. Attention is
focused on p(P|Q)/p(P|ÿQ). This ratio is called the likelihood ratio of P with respect to
Q. An expert may find it easier to give an estimate of this number by answering a
question such as (using our example) "how much more likely is it that a front is passing
if it is raining than if it is not raining?" . This may seem easier than directly providing
the conditional probabilities. The other required number, the prior odds on Q, that in
our example is the odds on it raining at any given time, is also an easy number to
determine. But the accuracy of the method still relies upon the fact that the random
58
Use p(P) = O(P)/(O(P)+1)
variable being estimated is well defined and observed often. This is not always the case.

2.4.3.3 Possibilistic (fuzzy)

A knowledgeable expert providing the likelihood ratios for a subjective estimate of


probability is not free to make any rational guess that comes to mind. Those choices are
constrained by the restrictions that:

1. the probability density function and the space within which the random
variables assume value must be precisely specified,

2. the variables must be measurable and repeatedly observed, and

3. summation of the probability density function across the probability space must
total 1.

If the expert cannot provide values consistent with those requirements then the
program must take steps to cook the estimates or to obtain more consistent values from
the expert. The requirements that must be satisfied by a possibility distribution are more
relaxed, they are:

1. The possibility value for an event is always greater than or equal to the
probability value for that event, because the possibility of an event is less in
doubt than its probability.

2. The possibility distribution is a continuous function into [0,1].

The relaxation of requirements comes in recognition of the fact that a person's view of
the world, rather than consisting of a database of certainties, consists of beliefs to which
degrees of confidence ranging from certainty to uncertainty are assigned. The sources of
such uncertainty lie in imprecise or incorrect perceptions from whatever cause, and in
the indeterminate nature of the world arising from truly random or at least
unpredictable circumstances. One does not have to look far for evidence of such fuzzy
thinking. The evidence is in the language we use every day to describe ourselves, our
thoughts, the world, that which we have seen happen, or think we have seen happen,
that which we think will happen and that which we intend to do. For example consider
he following commonly used parts of speech (Kandel, 1989):

1. Predicates: small, large, young, save, much smaller than, soon etc. The idea of tall(x)
or ÿtall(x) is not probabilistic in nature but each person could describe a
function that returns his degree of belief that x is tall.

2. Quantifiers: all, some, most, many, few, several, often, usually etc. Probability
theory can handle only all and there exists.

3. Qualifiers: likely, unlikely, not very likely, probably, very probable.

4. Possibilities: possibly, quite possible, impossible, almost impossible, probably possible.


It would be ludicrous at best to try to convert expressions whose very purpose
is to escape the preciseness implied by possibility back into the terms of that
theory.

5. Truth Values: very true, quite true, mostly untrue. These expressions deny truth in
its logical and absolute sense.

6. Predicate Modifiers: very, quite, extremely, somewhat, slightly. These are items
that suffer from the same difficulties as truth values.

Classical logic (as opposed to fuzzy logic) described in set theoretic terms provides that
a characteristic function, say mA, can exist for determining the inclusion of what we
have termed an instance above, in the set A. That function determines for the instance x
Ï 1 iff x Œ A,
mA(x) = Ì
Ó 0 iff x œ A.

Fuzzy logic allows the valuation set, {0,1} in the classical case above, to be the real
interval [0,1]. The function mA(x) is then termed the membership function of x in A.
Negation is defined as mA'(x) = 1 - mA(x) where A' is the complement of A. The values
of conjunctions and disjunctions are calculated by applying a membership function to
the union and intersection of sets as mA»B(x) = max(mA(x), mB(x)) and mA«B(x) =
min(mA(x), mB(x)). It is no longer necessary to resort to Bayesian probabilities to
calculate the value of the implication operator. If the fuzzy truth value of P is mA(x)
and that of Q is mB(x) then the truth value of P fi Q can be calculated as the truth value
of ÿP ⁄ Q, that is mA¢«B(x). This is usually qualified with the injunction that, at the
extremes, the measure must return the crisp logical values of 1 or 0. That is when P and
Q take on crisp values the implication measure must return the expected crisp value.
Various measures have been suggested that have this quality. There is considerable
debate as to which is the most appropriate.

The problems then with logic systems as they would apply to machine intelligence can
be seen to be treatable, but at great cost to the way that such systems have been
traditionally constructed. Most of the inequities derive from the closed nature of logic
systems that exhibit the "desirable" traits such as consistency and completeness. Such
systems submit to efficient and sound inference techniques. But that perfection is the
root cause of their inability to handle the imperfect, illogical, uncertain and incomplete
information that plagues real world systems. That, fortunately, is just a problem and not
a condemnation so long as the ideal of mathematical perfection is foregone.

2.4.4 Problems with symbolic-logic systems

The symbolic-logic approach to modeling machine intelligence enjoys its greatest


success when the intelligent process to be simulated is known in sufficient detail to
allow the construction of a consistent and complete logic base. In a static and known
environment with well defined constraints and with a computer of sufficient
computational capacity it is possible to construct a system that performs in a highly
constrained environment in a manner that would be termed intelligent if the
performance was by a human. But the environment in which a human must think and
act is infinite in its complexity and variety. The human brain, while large, is finite in
computational capacity and must compensate by adapting to the environment in which
it finds itself. While it would be too strong to say that adaptability defines human
intelligence it is a hallmark of that intelligence. Without the lifelong capacity to change a
human would be a creature of habit, unable to change occupation, acquaintances, or
even residence without severe incapacitation. As has been indicated in previous
sections, the machine systems that are the most difficult to create are those that learn
(grow) and change while maintaining a consistent and complete world view that
prevents the breakdown of logic processes. When faced with a situation only slightly
removed from its area of expertise, topical (expert) systems fail. This phenomena has
been referred to as 'brittleness' or 'falling off the knowledge cliff'. Symbolic-logic
systems generally do an inadequate job of assimilating new information into the
existing knowledge base, in generalizing from the existing information and in finding
similarities (e.g. drawing analogies, understanding metaphors, recognizing a misspelled
word, or a slightly disfigured object). Symbolic-logic systems are based on models of the
human thought processes as they are believed to occur at the level just below the human
consciousness. There are, however, other lower level models based on the
neurobiological construction of the human brain.Advocates of these models claim that
they hold the promise of remedying some of the deficiencies of the symbolic-logic
models. The models are called neural net, connectionist, parallel distributed processing
or, in general neuromorphic models. They differ from symbolic-logic models in that the
information in the system is distributed across the system, residing in the connections
between the neurons that make up the system.

2.5 Neuromorphic models

There are perhaps a hundred or more variations or types of neurons (or nerve cells) in
the human body. Physically they vary in length of axon, location of the cell body in
relation to the axon and dendritic structures, and the distribution of dendrites.
Functionally they take on rolls as muscle activators (motor neurons), sensors to relay
information about the body and the environment to the brain (sensory neurons) and, as
relays and manipulators of information (internuncial neurons). The latter are the
primary mechanisms by which the functions of the brain are implemented. A common
configuration of a neuron is that shown in figure 2.4. An internuncial neuron typically
functions by accepting frequency modulated pulses of electrical energy through
synapses (points at which the neuron dendritic structure contacts other nerve cells). The
cell body then combines or otherwise processes those signals. It is believed that if the
frequency, strength and polarity of the pulses combine to reach a threshold determined
by the particular neuron, the neuron activates, or fires electrical pulses through the axon
to other dendritic structures (labeled terminals in figure 2.4). In 1943 Warren McCulloch
and Walter Pitts (Mc Culloch and Pitts, 1943) observed that, with certain simplifying
assumptions it is possible to associate the activity of groups of neurons with formal
models of propositional logic. They proposed that the neurons in the brain may be
thought of as performing these logic functions. They stated that, as regards neurons
arranged into logical networks, it can be shown (note that circles refers to what would be
called cycles in a graph)
"first, that every net, if furnished with a tape, scanners connected to
afferents, and suitable efferents to perform the necessary motor operations,
can compute only such numbers as can a Turing machine; second, that
each of the latter numbers can be computed by such a net; and that nets
with circles can be computed by such a net; and that nets with circles can
compute, without scanners and tape, some of the numbers the machine
can, but no others, and not all of them."

Synapses Input Signal

Dendrites

Cell Body
Summing Process
Comparing Process

Axon

Terminals

Output Signal

Figure 2.4: The Neuron

The model was simple and marked the debut of what has become, in recent years, a
flood of neural network models that purport to represent the neural structures from
which mind might emerge. The simplifying assumptions of McCulloch–Pitts model
were that 1) the neuron's activity was discrete, operating in an all–or–none fashion, 2)
that neuron's operate in discrete and independent time periods during which they sum
the signals from the synapses connecting to them, 3) the only delays in the system are at
the synapses, 4) there can be inhibitory as well as excitatory signals from the synapses
and that inhibitory neurons absolutely inhibit the neurons they connect to and, 5) the
structure of the net itself does not change. The activity of the cell body is assumed to be
a simple summation of the values coming into it through the dendrites. If that sum
exceeds some fixed threshold value then the neuron fires a signal to its output
connections in the network. Figure 2.5, part A indicates how the logical and, or, and not
functions can be implemented by formal neurons. The vertical and horizontal lines
represent dendrites. The dots at the intersections of the lines are synapsesSynapses that
will pass the value indicated when activated, when not activated the lines are assumed
to have zero value. Each of the logical functions depicted requires one neuron for its
implementation and those neurons are represented by the four vertical lines. The circles
represent the cell bodies of the neuron. The value in each circle is the threshold value
that the sum of values reaching the cell body must exceed before it will fire. The input
signals are assumed to be coming from neurons (w, x, y and z) whose thresholds are
immaterial to the

+1 +1 -1
x

+1 +1 -1
y

z
1 0 -1 -1 * * *
*
*
x and y x or y not x not y
*
A neurons in grid

synapses q neuron input q summing/comparing q neuron output

w x y z * * *

1 1 -1 1 1 -1

1 0 -1 -1 * * *

x and y x or y not x not y


B same neurons in perceptron-like net

Figure 2.5: Formal neuron model (McCulloch and Pitts), two configurations
diagram. It's easy to verify that the indicated logic functions are implemented when x, y,
or any combination of x and y send a signal down a dendrite. For instance, since the
vertical line attached to the not y has the value 0 constantly present on it, it will fire
during each time period until a signal from y causes the synapse to issue a -1 which is
subsequently summed in , and which reduces the total value in that neuron to the
threshold value. In the McCulloch-Pitts model this signal would be an inhibiting signal
that absolutely prevents the neuron from firing, but that is not a feature essential to
establishing the logic functions. The neuron will cease firing since the values being
summed in it no longer exceed the threshold value. It's also easy to see that by
modifying the synaptic weights any desired set of output signals can be arranged for a
given set of input signals. Part B of figure 2.5 shows the same network as a bipartite
directed graph (labeled perceptron-like net in anticipation of succeeding sections) in
which the synapses have been replaced with boxes containing the appropriate
weights.The logic functions depicted in figure 2.5, rearranged and combined in a
structure superficially different from the neuronal model, are just those used to
implement the logic in most present day digital computers.

The architecture of digital computers is named after John Von Neumann who is
credited with conceiving of the organization of sequential, single central processing unit
(CPU), multi-purpose memory computers (computers in which both programs and data
share the same memory). Von Neumann did much of his significant work in the same
time frame as McCulloch and Pitts. He noted that a logic structure similar to that in
figure 2.5, unlike that of a Von Neumann digital computer, would be easy to extend by
the addition of more input lines (w and z, etc.) and the concurrent modification of the
synaptic weights and/or the threshold values. The responsibility, then, for activating
the output lines would be shared by many inputs (other neurons). In this way
redundancy would be introduced into the network so that the loss of any small portion
of the input or synapses, would have a gracefully degrading effect on the output. This is
preferable to the catastrophic failure that occurs with the disruption of a single circuit in
a digital computer. Von Neumann also showed that neuromorphic networks, despite
their poor precision, could carry out accurate arithmetic operations. However, the truly
exciting and potentially revolutionary aspect of a neuronal model was in the possibility
of arranging the synaptic weights to affect a particular output, indicating that neuronal
networks could be devised to associate any two pieces of data (although, given just the
McCulloch-Pitts model, to create such a network would require the efforts of a
knowledgeable and clever programmer). It was a tempting thought that automatic
adjustment of the weights might provide circuits that could learn to make associations.

In 1949 Donald O. Hebbs introduced a network model in which the weights of active
synapses were increased and inactive synapses decreased according to their
contribution to the desired result (Hebbs, 1949). This allowed groups of weakly
connected neurons (or units or cells) to become strongly connected upon repeated
activation. The scheme worked well to create cooperating assemblies of cells and the
Hebbian rule of adjusting the weights according to use is still, in modified forms,
commonly used in implementing learning networks. The notion of a gracefully
degrading output performance coupled with auto-association of data and feedback of
output to input implied a memory that could retrieve a complete piece of data from
incomplete or noisy input. This was equivalent to the psychological phenomenon
termed redintegration (the recollection of an entire thought upon the presentation of a
phrase or portion of the whole, e.g. most people raised in the United States will
immediately complete the phrase "Mary had a little lamb" with "whose fleece was white
as snow"). So, by the 1950s, there was a great and growing enthusiasm, reinforced by
the cybernetics movement, that machines that could think and behave like humans were
not only possible, but likely to be developed in the foreseeable future.

as3a1
as3a2
A
as3a3
as3a4
w a1r1 W
a1 w a1r2

s1 r1

s2 a2

s3 r2
RETINA
a3

r3

a4
RESPONSE

ASSOCIATORS

Figure 2.6: A Perceptron


2.5.1 The Perceptron

In 1958 Frank Rosenblatt (Rosenblatt 1958) introduced a neural network model called a
perceptron. This provided the first description of a neural network that was complete
and computationally accessible. The ease with which this perceptron could be analyzed,
its computational complexity and its potential for learning led engineers,
mathematicians, psychologists and computer scientists to join in the research.
Eventually a working model, slightly modified from Rosenblatts original description
and similar to the one in figure 2.6, above, emerged (see Block 1962 or Minsky 1988). It
consists of three layers of neurons termed the retina, associator and response layers.
Each neuron in the retina layer (sensor) connects to each neuron in the associator layer
(associator) and each neuron in the associator layer connects to each neuron in the
response layer (responder). In some models the associators and the responders connect
to themselves. The action of the model takes place in discrete but highly parallel steps.
That is, the firing of the retina neurons may be considered one step. Likewise, the
actions taken by the associator and responders may be considered to be sequences of
parallel activity. When a sensor fires (as indicated by the shaded neurons in figure 2.6)
an excitatory or inhibitory stimulus is transmitted to the associators according to the
magnitudes of the weights asiaj (the weights represent the synapses of the previous
models and may have positive or negative value). Each associator performs a
mathematical operation on all of the stimuli coming into it (we will assume it is a simple
summation unless the associator layer is to participate in the output calculations, we
will call the function associated with associator i, fi). Each associator transmits a
stimulus to the responders and does so with a strength proportional to the weights
wairj. Intra-layer connections (e.g.from associators to associators) are always negative
(inhibitory). Each responder has an associated activation function, rj that calculates the
final output value of the responder. The firing pattern of the responders represents the
output of the perceptron. The relation, Swijfi > Qj, termed a linear threshold relation
(LTR), sums the inputs from the all the associators i as they come into responder j, and
then compares the result to some arbitrary value Qj. So the effect of the LTR at
responder j can be written as the linear threshold function (LTF) rj where

Ï 1 if Swijfi > Q
rj = Ì
Ó 0 if Swijfi £ Q.

1.0

squash
squash'
threshold

-6 -4 -2 0 2 4 6
SwF

Figure 2.7: A squashing function squashing function


This is a discrete, non-linear step function and has been used extensively in theorizing
aboutneural networks (especially by Minsky and Papert, see below). More recently
squashing functions have been used instead of threshold relations. Squashing functions
still rely on a linear summation of inputs but instead of comparing the output to a single
value, squashing functions simply squash big sums into the interval [0,1].

This provides a continuous non-linear response that, to some extent, takes into account,
the magnitude of input into the neuron. A widely used squashing function is shown in
figure 2.7. It is squash(x) = 1/(1 + e-x). Using this squashing function the output
(threshold) function becomes
Ï 1 if squash(Swijfi) > Q
rj = Ì 0<Q£1
Ó 0 if squash(Swijfi ) £ Q.

The squashing function is particularly useful in multi-layer networks where a non-


linear relation between layers is required but in which as much information concerning
the strength of inputs is to be preserved in the outputs. This is taken up below in the
section on back propagation.

The perceptron in figure 2.6 displays three layers; the retina, the associators and the
responders. Actually the perceptron is considered to consist of only one layer, which in
this case would be the responder layer, together with the associated weight matrix W,
and the threshold functions, r, at each responder neuron. For purposes of exposition, the
retina and associator layers may be thought of as fixed so that they do not contribute
anything to the calculations made by the perceptron (until specifically included as
such). Generally the number of layers in a perceptron is counted as the number of layers
with associated weight matrices that have activation functions and that take part in the
calculations of the network output. The presentation shown in figure 2.6 will be useful
in explaining the representational capabilities of a network and for extending the
concept of single layer perceptrons to multiple layer networks.

The purpose of the perceptron is to produce a desired responder pattern when the
retina is stimulated by some input pattern (hereinafter referred to as a figure,
unconnected subparts of a figure will be referred to as components). The rj provide a
non-linear, step response due to their description in terms of linear inequalities. Given a
particular associator pattern, the desired responder pattern can, with some restrictions,
be achieved by adjusting the synaptic weights w. If the retina consists of n sensors then
(given that a sensor has two states, firing or not firing) there can be 2n possible figures
on a retina. If there are m associators (m < n) then, on the average, there 2n-m figures for
every associator pattern. Since every figure has to map to one of the 2m associator
patterns the figures are partitioned into 2m categories. The connections from the
associators to the responders are then available for effecting a final mapping of the
categories to the desired responder patterns. To see the restrictions on this mapping
scheme consider the boolean value of the n sensors of the retina as the coordinates of a
point in an n dimensional figure space in which each point represents some figure, i.e.
some particular combination of firing and non-firing sensors. Since each sensor can only
have one of two values all of the points happen to fall at the vertices of a hypercube. An
LTR can be interpreted as representing a hyperplane in that n-space. So a perceptron
can be seen to be distinguishing figures according to how they are partitioned by the
LTR planes. The partitioning has a limited granularity according to the ability of the
LTR to represent possible hyperplanes. That ability depends upon the number of terms
available to the LTR (i.e. the number of associators f) and the values that the coefficients
(a, w) can assume.

As an example, consider a perceptron consisting of a set of n ordered and sequentially


numbered sensors connected to two associators which are in turn connected to one
responder. If the synaptic weights from all of the even numbered sensors have
excitatory connections to the first associator and inhibitory connections to the second
associator, and all of the odd numbered sensors have the reverse connections, then
figures that cover more even numbered sensors than odd numbered sensors will cause
the first associator to fire while figures covering more odd numbered sensors will cause
the second associator to fire. Then the simple LTR, f1 - f2 ≥ 0, determines the predicate
truth(the figure covers more even numbered sensors). Note that the inequality '≥' lumps
the cases in which there are exactly as many odd numbered as even numbered covered
sensors, along with the zero covered sensors case, into the even category59 . Commonly
it is desired that a perceptron recognize some geometric pattern that has symbolic
content. For instance it might be desired that a figure shaped like an 'X' give rise to a
59
To visualize how all this translates into the n–space model consider a three dimensional
space represented by the coordinate system s1, s2, s3 (for sensor numbered 1, sensor
numbered 2, and sensor numbered 3) as shown below. On this retina figures can cover 0, 1, 2,
or 3 of the sensors. The corners of the cube represent figures covering various combinations
of odd and even numbered sensors on the three sensors retina. Shown at each corner of the
cube are the (s1, s2, s3) coordinate, the numbers of the sensors covered, and an indication of
'even' if there are as many or more even numbers covered or 'odd' if there are more odd
numbers covered.

s3
ODD
2,3
(0,1,1)
even EVEN
(1,1,1)
3 1,2,3
(0,0,1) s2 odd
odd (1,0,1)
2 1, 3
(0,1,0) odd
even
(1,1,0)
1,2
even
(0,0,0)
0
even
(1,0,0)
1 s1
odd

The odd cases all map onto an area above the plane through the cube (given the perspective
shown) and the even cases fall below the plane.
responder pattern that can be interpreted as the ascii code for the letter 'X'. It is desirable
that figures close to 'X' give the response for 'X'. So in most perceptron models the
synaptic weights between the retina and associators (the a) are chosen randomly. This
avoids sensitizing the perceptron to a particular pattern as in the even/odd example above. In
the classical perceptron, once the a are chosen they are not subject to modification.
Patterns close to a known figure will have a high probability of activating the response
for that figure. Choosing a connection pattern like that in the odd/even example above
might defeat that purpose. If it is desired only that the perceptron recognize a specific
fixed figure or set of figures, the sensors that represent that figure can be weighted to
activate one and only one specific associator that is dedicated to that figure. That
associator would in turn be connected to a dedicated responder. This is inefficient,
uninteresting, and not the solution arrived at by 'learning' perceptrons or desired by
those who would use neural nets.

When the model has the ability to adjust its own synaptic weights according to the
Hebbian rule that the weights of the connections between active associators and active
responders are positively reinforced, then it can be shown (Block 1962) that upon
repeated cycles of adjustment and evaluation the weights will converge to values that
will provide the desired responder pattern. The rule can be formalized by writing it as a
so-called delta rule, e.g.

wij(t+1) = wij(t) + hdijfi,

where h is a training coefficient to control the speed of weight change (this may be
thought of as a step size and is usually a value between 0.1 and 1.0). For the case in
which the activation function is an LTF, dij = firj, in which rj is the target output of
responder j. The weight wij will not get incremented unless both associator i is firing
and responder j is supposed to be firing. In the case that the activation function is a
squashing function, dij = (rj - rj)squash(Swijfi)(1 - squash(Swijfi)). Note that squash'
=squash(1 - squash) so that the rather complicated expression is really just dij = (rj -
rj)squash'. Again, if there is no error (rj - rj = 0), or associator i is not firing (fi = 0), there
is no adjustment. The derivative of the squashing function is a bell shaped curve
centered on zero, so large inputs to the activation function are dampened in their effect
on the weight adjustment.

The fact that the w converge means that figures that are separable in n-space as
indicated above, will be separated. This is a hill climbing process which relies on the
freedom of linear processes from local maximums. There are variations of this basic
model that rely on averages or bounded summations, or a combination of increments
and decrements. They may change the speed of convergence but do not alter the basic
conclusion that if a solution can be found a perceptron will find it. No matter the
scheme, realistic models, require large numbers of neurons to implement useful
associations. Beyond a certain size it becomes prohibitively complex to connect all
sensors to all associators and all associators to all responders. Because of the
redundancy of neural networks, such complete connectionism is not required to achieve
results. This raises the question of how thoroughly connected a neural network must be
in order to be effective. Anderson, Kohonen, Minsky and Papert explored these
problems in the late 60s and early 70s.
2.5.2 A mathematically oriented model

Inevitably the perceptron gave rise to more mathematically concise descriptions. In 1972
Teuvo Kohonen (Kohonen, 1972) and James A. Anderson (Anderson, 1972)
independently constructed and analyzed essentially the same mathematical model of a
perceptron-like network. They noted that, for a perceptron in which the associators and
target responders are described as vectors say F, and r, the mapping from associators to
responders can be described by a connection matrix W, whose elements are weights, w,
determined such that WF = r. For a particular pair of vectors, say Fi and ri representing
some specific association of input and response i, Wi can be calculated as Wi = ri
||Fi||2FiT where T stands for matrix transpose and || || stands for the euclidean norm.
For example, in figure 2.6

1
1 1/4 -1/4 1/4 -1/4
-1
ri = -1 Fi = ||Fi ||2 = 4, Wi = -1/4 1/4 -1/4 1/4
1
1 1/4 -1/4 1/4 -1/4
, -1 ,
,

and as can be quickly verified WiFi = ri. When there are many pairs (Fj, rj) of vectors to
be associated in the network, W is calculated simply as ∑Wj. As long as Fi and Fj for i
≠ j are orthogonal (i.e. FiTFj = 0) and there is no noise on the network, then WFj = rj
exactly for all associations j. The condition of orthogonality can be met only to the extent
that the input patterns differ. Similar patterns give rise to 'crosstalk' that will cause an
error, errj, that will modify the rj calculated by WFj. A further source of complication is
noise n that might be included with the input vector i.e. Fi + n. So W(Fj + n) = rj + errj
+ Wn. So long as errj + Wn are not sufficiently offsetting to effect the LTF, they do not
affect the network. That is rj(Swijfi + errj +Swijni) = rj(Swijfi).

The above described networks are called feed forward networks because the W are
constructed to generate a specific target output without iteration (i.e. WFj = rj). The 'to
be associated' input vectors (which in general are not the target output vectors) are fed
forward through the network to achieve that end. Feedback networks, on the other
hand, have no target and feed the output from one iteration back into the system to
allow the system to reach its own equilibrium. This results in associating the original
input with an output determined by the system. Feedback systems exhibit many
desirable characteristics. Assume, W initially consists of random values. As each Fj
generates a rj (assuming that we can ignore the errors), those values are fed back as
input, allowing further modification of W. To see the effects of this mechanism, note
that such a feedback system can be represented as a dynamic system which attains the
state Wrj = kjrj where the kj is a scaler proportional to the number of times that pattern
Fj has been presented to the system (Anderson et al, 1977). From linear algebra the kj
can be recognized as the eigenvalues and the rj as the associated eigenvectors of W.
The largest value (or response) of ||Wx|| is achieved when x is the eigenvector rj
corresponding to the largest eigenvalue kj, of W. The second largest response
corresponds to the second largest eigenvalue/eigenvector pair and so on. So the
response of W to an input is sensitized to the rj. When an input pattern. x, contains
elements of an eigenvector (i.e. is to some extent a linear combination of eigenvectors of
W), those vectors will be more amplified on passing through Wx than other vectors.
Repeatedly feeding the result of one iteration into the next results in the generation of
some eigenvector of the system, i.e. the output of some rj. In the association of (Fj, rj ),
rj is picked by the system and must be subsequently associated with the desired output
(if that is the intended purpose of the system).

In the feedback model, it can be seen that the number of eigenvalues of W sets a bound
on the number of different patterns that may be stored in a given network and that the
largest eigenvalues of that system (hence its sensitivity to input) are associated with the
patterns most frequently seen during the learning process. Note that the feed forward
and feedback networks differ in the way they calculate and use W. In feed forward
networks, W is calculated directly or is obtained by an iterative process of error
correction. In either case, the feed forward network, when presented with an arbitrary
input vector, yields the result in one step. In the feedback model W is calculated directly
(usually as W = f1(SrrT) for some f1), The desired result, r0, is obtained from some r by
feeding x = f2(Wr) back into Wx until x Æ r0. Feedback networks (as developed by John
J. Hopfield) will be covered more thoroughly below in section 2.4.5.

Note that for all of these networks, as long as there is no non-linear function interposed
between layers in a multiple layer perceptron, there is an equivalent single layer
perceptron. For example if we consider the perceptron in figure 2.6 as a multiple layer
perceptron in which the associator neurons and the associated weight matrix A act as an
additional layer that enters into the calculations, then if the associators simply sum their
inputs (instead of performing a non-linear calculation resulting in a value of 1 or 0 via a
threshold function, or a value on [0,1] via a squashing function), the same perceptron
could be attained with a single layer perceptron that has the weight matrix AW.

As regards the feed forward model, Kohonen observed that networks in which the
associators and responders were not all connected gave rise to a connection matrix W,
with zeros in the slots corresponding to the missing connections. He determined that
the minimum number of connections, s, between associators and responders, needed to
insure that the deviation of the calculated response vector r from the target response
vector r, would not exceed some selected bound was s > n/d2 where n was the length of
the response vector and d was the standard deviation60 not to be exceeded. For example
if a standard deviation less than .5 was desired and the response vector was 10 neurons
long then there must be at least s > 10/.25 or 40 associator/responder connections.
Anderson investigated the signal to noise ratio, Sn, of the response vector. Sn is the ratio
formed by the square of a measure of the response due to the input pattern divided by
the mean square of a measure of the response due to crosstalk. It is desirable to have as
large an Sn as possible. Anderson determined that Sn = CN/J (= s/J) where C is the
percentage of connectivity of the associators/responders, N is the number of neurons
and J is the number of patterns to be associated. This is an intuition-pleasing result,
60
Relative standard deviation = var(r^)1/2/E(r^).
indicating that the perceptron works better the more neurons and the more connected
the neurons and degrades in inverse proportion to the number of patterns stored. So the
perceptron seemed to be a simple and powerful scheme that could explain many of the
features of human memory (associative memory, memory degradation, and speed of
recall to name a few). In the 1960s there was little reason to suspect that there were
limits to the kinds of things that could be learned by a perceptron. In a sense the
optimism was justified; with unlimited neurons, all totally connected, there would be no
limit on things that could be associated beyond those imposed by the above limitations.
Of course, unlimited models were not practical, but constrained connection schemes
had the pleasant property of a kind of graceful degradation rather than the all out
failure that occurs in logical processes. That the imposition of constraints seriously
affects the kinds of things a perceptron calculates and the difficulty with which they are
calculated, was demonstrated by Marvin Minsky and Seymour Papert in their book
Perceptrons (Minsky and Papert, 1969). This analysis deals exclusively with feed forward
networks.

f a1 (X)
a1
w 11
f a2 (X)
w 12
s1 a2 r1 r
r1

s2 f a3 (X)

s3 a3 r2 rr2
RETINA
f a4 (X)
a4
r3 rr3
f a5 (X)
w53
a5

RESPONSE
ASSOCIATORS

Figure 2.8. A constrained perceptron


2.5.3 Minsky and Papert's analysis

If unlimited associators and total connection is provided then, for example, every
possible circle that can be described upon the retina can be directly connected to one
dedicated associator whose response can be passed directly on to a responder that
indicates the recognition of a circle. The simple fact that the human brain contains
neither the neurons nor connections required for this sort of brute force recognition
scheme61 (given the almost unlimited number of the things we can recognize under all
manner of distortion, translation and rotation) implies that the brain recognizes things
using other, perhaps logical, processes. This implies that if perceptrons are not to
exceed the connectivity of the brain, they too must use processes that can be explained
as the logical combination of the firing patterns generated in the associators by the
patterns on the retina. Minsky and Papert analyzed the ability of constrained
perceptrons (each associator was connected to a subset of the sensors instead of to every
sensor, see figure 2.8), to calculate the truth of logical predicates concerning the nature
of a figure X (e.g. circle(X), connected(X), odd_number_of_pixels(X) or, in general, r(X)).
The associators were seen as calculating a simple predicate, fi(X) on some 'local' area, i,
of the retina. This would probably, but not necessarily, be a simple threshold predicate
e.g. fi(X) = truth(Saijsij>Q) where the sij represents the input from the sensor j of local
area i of X to associator i, and Q is some arbitrary constant. The value of fi is specified
to be 1 if the predicate is true and 0 if false. The fis are then combined linearly by a
responder according to the rules of a linear predicate algebra. The firing or not firing
status of the responder yields the truth value of the predicate (in figure 2.8 above there
are many responders, for simplicity Minsky and Papert usually restricted their model to
one responder). That is, the associator outputs are combined to determine r(X) where
r(X) = truth(Swijfi(X) > Q) in which the ws and Q are chosen, in the course of the
analysis, so that they effect the desired outcome.

By the means of these LTFs the responders of a perceptron can be shown to be capable
of calculating in one step, any boolean predicate on the associators, save for the
'exclusive or' predicate (⊕) and its complement, the identity predicate (≡). That is for
any boolean combination of the fi there is some combination of ws and Qs such that
w1fi + w2fj > Q will be true when the boolean function is true. For example, truth(f1 Ÿ
f2) can be determined by truth(1f1 + 1f2 > 1). In the case of the boolean operations ⊕
and ≡ the LTF will have second order elements, e.g. truth(f1 ⊕ f2) can be determined by
truth( f1 + f2 - 2f1f2 > 0). Since the second order term has to be calculated before it
can be combined with the other terms another associator layer is required. This means
that the single associator layer perceptrons cannot perform all logical operations. A
solution is to provide another layer of associators between the original layer and the
responders, or to allow modification of the synaptic weights, a, between the sensor
layer and the associator layer (see figure 2.9 below). Such an extra layer is often added
to perceptrons and is sometimes referred to as a hidden layer. The thought might occur
that as many associator layers may be required as the order of the LTF, but this is not
61
When arrangements of cells are proposed as directly representing some product of the
senses they are sometimes referred to as 'granny' cells. An allusion to the idea that everyone
has a cell (or set of cells) dedicated to recognizing his or her grandmother.
the case. Rather, the order of a predicate indicates the maximum number of sensors that
may be required in a subset of sensors in order that the truth of the predicate can be
determined (as regards the determination of some predicate, it relates to the necessary
connectivity between the sensors and associators). Figure 2.8 above has overlapping
subsets that range in size from 5 to 9 sensors. If that configuration of subsets was
necessary and sufficient to recognize some predicate on X then we would observe that
the associated LTF was of order 9. Figure 2.9 below depicts how the 'and' (Ÿ), 'or' (⁄),
'exclusive or' (⊕), and 'identity' (≡), functions can be calculated by a perceptron. Note
that the ⊕ and ≡ functions require each associator to be attached to two sensors and are,
consequently, second order predicates.

S> 0
1 1
1
S >1 -1 S>0
Ÿ ⊕
-1
1
1 1

S> 0
1 -1
1
S> 0 -1 S > -1
⁄ ≡
-1
1 -1
1

Figure 2.9: First order predicates x Ÿ y and x ⁄ y, second order x ⊕ y and x ≡ y.

The larger share of Minsky and Papert's analysis was directed at determining the
difficulty with which things could be represented (or recognized, or evaluated,
depending upon the predicate and point of view) by a perceptron. To this end, they
investigated the order of the LTF required to characterize various predicates and they
compared the computational capacity of perceptrons on some selected predicates
against other computational schemes (like those that would be used on a serial
computer). In general, the larger the required order of an LTF, the less efficient a
perceptron is at evaluating the predicate. When the order becomes as large as the
number of sensors in the retina then the required perceptron approaches the maximally
connected version. So Minsky and Papert were determining which things could or
could not be calculated by a perceptron without requiring huge arrays of associators,
an impossibly thick connection scheme, or ridiculously large coefficients in the LTR.
The usefulness of 'inefficient' perceptrons is limited by the ability of technology to
produce immense and highly connected machines and the sufficiency of alternative
methods of calculation. Following are some results regarding problems of predicate
representation on a perceptron.

A predicate whose LTF is first order can only make assertions that depend upon the
absolute coordinates of sensors in the retina or that compare the number of firing
sensors in subsets of a retina. E.g. they can determine predicates such as truth(the
number of sensors in X is greater than 10) or truth(there are more sensors on the left side
of the retina [that is, less than some retina coordinate] than on the right side of the
retina). In particular, (constrained) first order LTFs cannot represent predicates that
make assertions about some invariant feature of the figure in the retina. E.g. a first order
LTF could not determine truth(the figure in the retina [no matter its location] forms an
X). In deriving this result they used a 'group invariance theorem' that stated that, if an
LTF could be written for predicates about invariant properties (such as the recognition
of arbitrarily translated and rotated figures), the coefficients in that LTF must rely only
upon the equivalence classes into which such (group) transformations partition the
various figures. The coefficients in the LTF must be the same for all the terms that
represent components in the same equivalence class (e.g. an X in the upper left corner of
a retina is in the same equivalence class as an X in the lower right corner of the retina
and the terms representing either in an LTF must exhibit the same coefficients).

They then considered some particular predicates, starting with parity; truth(there are an
odd number of sensors in figure X). Parity on a retina is an invariant function (it doesn't
matter where on the retina the firing sensors are located in order that there be an odd or
even number of them). Then, because of the group invariance theorem, an LTR for
parity can be represented as a linear function of variables each of which represents the
number of different ways in which that given number of sensors can be firing in the
figure X. All such variations will have the same coefficient and so can be gathered
together to form the LTR S aiNi(|X|) > 0 (where 0 £ |X| £ |R|, Ni is the variable
representing all of the figures in the ith equivalence class and ai is a function of the
number of figures in Ni). The value of the ai for these newly created variables (there
will be one for each i £ |X|) is a function of the combination of |X| things taken i at a
time, specifically it is a polynomial function of |X|62 . The LTR may then be treated as a
polynomial in |X|, P(|X|) = S aiNi(|X|) . The (odd) parity predicate requires that for
figures with odd(|X|), P(|X|) > 0, and for even(|X|), P(|X|) £ 0. Since P(|X|) must work for 0 £
|X| £ |R| it must be that P(|X|) changes signs each time |X| changes from an odd to an
even number or from an even to an odd number. The algebra of polynomials states that
the order of a polynomial must be at least as great as the number of sign changes it
makes, so P(|X|) has order at least as large as |X| can be. Since the figure may encompass
the entire retina (R) then the order of a polynomial function that recognizes parity on a
retina must be at least |R|. That is, the order of the parity function in a perceptron is at
least |R|. So any perceptron that can detect parity must be very well connected. This is a
'not too surprising', perhaps even encouraging result (encouraging in the sense that it
implies the perceptron model is on the right track in its attempt to emulate the human
brain, humans too, are not very good at looking at a large number of items and
immediately detecting the parity). It is possible to use this result to show that detecting
connectedness in a figure is also a difficult problem. Consider figure 2.10.
62
The combination of |X| things taken j at a time is (|X|–1)(|X|–2)...1/((|X|–j)! j!).
In figure 2.10 there are two retinas, one representing parity (say 'P'), consisting of three
neurons, and the other represents a connection scheme consisting of an eighteen by
sixteen array of neurons (call it 'C'). The retinas are arranged as though they were two
layers in a perceptron, that is, whenever sensors are turned on in 'P' they are to 'fire' into
the sensors in 'C' (but the lines are not shown because they would be confusing). The
affect of a neuron in 'P' firing into 'C' is constrained to the three rows of 'C' indicated by
curly brackets. If the neuron is firing the cells marked 'd' are turned on in that row
(indicated by medium shading). If the neuron is not

D d u U Connected Not Connected


SWITCH
U u d D

D P0 d u P1 d u P2 d u P3 d u
Ï
Ì
u d u d u d u d
D
Ó
d u d u d u d u
Ï
Ì
u d u d u d u d
U
Ó
d u d u d u d u
Ï
Ì
u d u d u d u d
Ó

Figure 2.10: Reducing a connectedness problem to a parity problem.

firing then the cells marked 'u' are turned on. The 'd' stands for 'down' and the 'u' stands
for 'up', to conform to the metaphor of a switch. A 'switch' is diagrammed at the top of
figure 2.10. The dark squares in 'C' represent the figure which is either connected (in the
case that a path of dark sensors can be traced through the switches and possibly around
to the other side of the figure) or not connected. We can imagine an electric current
being applied at one point (say p1) and being detected at another (say p2), indicating
that p1 is connected to p2 (we assume the current cannot travel diagonally from neuron
to neuron but must cross on an edge). The connection scheme may be extended to as
large vectors 'P' and 'C' as desired by adding rows of switches and their associated 'P'
neuron. Each row of switches contains four switches, a row of switches is constructed
from three rows of neurons. Each added row of switches also requires a connecting line
from one side of the array to the other. It can be seen then that, if Rp is the number of
neurons in 'P' then the height of 'C' is 6Rp while the width is 2Rp + 11. The size, Rc, of
'C' is then given by Rc = 12Rp2+ 66Rp Or, to look at it from the point of view of the
parity retina, Rp is of the order Rc1/2.

Every time a sensor, i, in P is turned on (or off) the entire gang of four switches
associated with that neuron changes setting (they are always either all up or all down).
As can be easily confirmed, whenever there are an odd number of 'P' sensors firing
(including none) then p0 of the figure in 'C' is connected to p3. Whenever an even
number of parity neurons are firing then p0 is not connected to p3. Other such relations
can be found but this one is all that is needed to establish a relationship between 'P' and
'C'. The relation remains true no matter the number of switches in 'C'. Obviously, the
connectedness predicate on 'C' must be of as high an order as parity on 'P' (which we
determined above was of the order Rc1/2) or we would be able to construct a parity
detector for 'P' with a lower order than Rp (i.e. just convert the 'P' into 'C' and determine
connectedness on it). That is impossible because we know the parity predicate is of
order Rp.This is a nice result because it is easy to check. But, in fact, Minsky and Papert,
by developing some theorems which we won't go into, determined that the order of the
connectedness predicate was at least kRc in which (o<k<1). They suspect k = 1/2.
Predicates whose order depends upon R are considered 'not of finite order' because that
order grows with R without bound. So the parity and connectedness predicates are not
of finite order.

A topologically invariant predicate on a figure is one whose truth value remains


unchanged through topological distortions, so long as the distortions do not change the
number of components (unconnected parts) in the figure or the number of holes in those
components. Figure 2.10 above contains two components and three holes, so a
topologically invariant predicate on that figure will maintain its truth value for any 'two
component/ three hole' figure on retina 'C'. The definition of 'topological invariance'
depends upon the constancy of its connectedness while being distorted. It is not
surprising then, that the orders of topologically invariant predicates are related to the
order of the connectedness predicates on the same figures. Minsky and Papert were able
to show, that, with the exception of Euler numbers63 (which may be order 4), there are
63
An Euler number is an invariant relation between the number of faces, vertices, and lines in
a figure whose components contain no holes. For a particular holeless component the Euler
number is always 1 and is calculated as faces – edges + vertices. When there are holes in a
component, the Euler number is calculated by decomposing the figure to remove the holes.
The removal of a hole creates a new component so, in effect, the Euler number of a figure is
the sum of all of its components plus all of the holes in those components. Predicates
involving this number can be calculated by a fourth order threshold function.
no topologically invariant predicates of finite order. From one perspective this indicates
a strength of perceptrons in that figures with different Euler numbers can be easily
distinguished. If circles are expected, a simple perceptron might verify the existence of
something that could pass as a circle even though it may be very distorted (i.e. a one
component, one hole figure can be loosely referred to as a circle). From another
perspective it hints that perceptrons might have to be highly connected to recognize the
composition of a figure (e.g. is one component of a figure inside or outside of another
component) and/or to distinguish between two topologically indistinguishable figures
(e.g. is the figure an 'O' or a 'D'). It would appear that predicates that depend on
topological invariance are of limited value.

On the other hand, geometric patterns (e.g. letters, circles, polygons, etc.), under various
geometric transformations such as translation and rotation can be recognized by low
order LTFs. The idea is to describe a figure by a composition of masks64 of a few (1, 2, 3
or 4) elements. A 'signature', consisting of a particular composition of kinds of masks
can then serve to identify the figure wherever it may be on the retina (i.e. by an LTF
Saifi > Q where the ai represent the number of low order masks of configuration fi) .
The masks making up a particular component of a figure are called its geometric
spectra. For one or two element masks two different components might have the same
spectra, however, three element masks have sufficient differentiating power to uniquely
identify a component. Figure 2.11 below, is an example of a scheme for recognizing a
one component, two sensor, diagonal vector that slopes downward to the right. Any
such vector and only such a vector will be recognized by that particular second order
perceptron (additional components could be handled by the addition of more
responders). The only mask needed to recognize the component is a two element mask
arranged to cover the vector itself. In order that only the vector be recognized the
perceptron requires a number of connections whose purpose is to effect a rejection
when there are extraneous points on the retina. This means that the order of the LTF for
recognizing a figure under transformation cannot be limited to that required to
recognize the largest mask in the spectra. However it can be shown that the order of the
perceptron is Og < 2nOm where n is the number of components to be recognized and
Om is maximum order required to recognize the largest of the spectral masks. So in the
example the order can be kept less than 2*1*2 = 4 and is in fact 2. Unfortunately, while
this perceptron exhibits a low connectivity it requires a large number of associators and
responders. In general, using such a scheme can require a number of associators that
depends upon the highest order of the spectra, and the size of the retina (the relation
may be something like: associators required = c|R|k-1 where c is a constant and k is the
highest order of spectra). The efficiency gained by lower connectivity is partially lost in
required additional associators (the example of figure 2.11 required 1.33|R|2-1 = 12
associators for the simplest spectra possible, the simplest figure possible and the third
smallest square retina possible). Further, the scheme requires that whenever there is
more than one component on the retina, they may not overlap or the order will become
non-finite. A feeling for why this happens comes by considering a scheme for rejecting
64
A mask is just the conjunction of some specified subset of sensors on the retina. It can be
thought of as an overlay (or mask) that covers the retina and which has some pattern of holes
over the some of the sensors. All of the holes have to be black (or firing) for the mask to be
true.
extraneous points in a figure of overlapping components.

Another problem associated with LTFs is the possibility of very large coefficients. If the
polynomial function p(|X|), described above as an LTF for detecting parity on the
retina, is inspected for the magnitude of its coefficients they are found to be | ai | ≥ 2i-1.
Since there are |R| coefficients the largest will be at least 2|R|-1. It does not take a large
retina to quickly exceed the ability of any machine to store the required coefficients.
Further, this unbounded growth of the coefficients affects the ability of a perceptron to
learn. As noted, learning in a perceptron is a hill climbing process in which finite sized
steps, chosen by rules as discussed above, carry WF inexorably to r. Larger coefficients
mean longer training periods. In the case of the parity function this can be shown to
require n steps where 5|R| £ n £ 10|R|. The parity LTF detects parity by a process that
Minsky and Papert have termed stratification. Stratification relies upon the growth of the
coefficients in the sequence of terms in the threshold function. In a sequence of terms
with alternating signs such as S2ifi(X) > 0, the last term in the sequence outweighs the

S >1
Recognizes
+1
Anywhere on the retina
+1

+1

+1

-1 S >0
-1
all +1
-1
-1

-1
-1

-1
-1
+2
+2

Figure 2.11: A perceptron that relies upon geometric spectra


sum of all of the previous terms and so determines the truth of the threshold relation.
Minsky and Papert were able to show that a similar scheme could be applied to other
predicates (e.g. symmetry, translation in a plane, dilation) to produce LTF of low order
but with unwieldy coefficients. The existence of these low order solutions coupled with
the fact that a perceptron will find a solution if there is a solution implies that the
learning process in many cases might be a long one. In a machine that is expected to
exhibit intelligence, a trait marked by continuous learning and testing of predicates, the
process might be impractically long.

On the encouraging side, many of the predicates that Minsky and Papert exhibited as
malevolent examples (e.g. parity, connectedness, topological invariance, overlapping
figures and context) are problems with which humans have difficulty (consider figure
2.12). And it was becoming clear that the 'crosstalk' between figures, instead of being
undesirable mathematically induced errors, was responsible for the ability of
perceptrons to recognize never before seen figures as belonging to a known category.
This is a form of generalization which, when it occurs in a highly trained network, can
take the form of a natural kind of induction that works well in an ordered world...e.g.
large, fanged and clawed, striped animals eat people so large, fanged and clawed black
animals eat people! This is the bent but effective logic that people exhibit. As L.N.
Cooper pointed out (Cooper 1973)

Without counting are


there an odd or even number
of squares above? Without tracing is the above
figure connected or unconnected?

Is the figure two triangles or


six triangles and a pentagon? Is there a large equilateral triangle in here?

Figure 2.12: Problems with human perception

"The animal philosopher sophisticated enough to argue 'the tiger ate my friend
but that does not allow me to conclude that he might want to eat me' might then
be a recent [evolutionary] development whose survival depends on other less
sophisticated animals who jump to conclusions"

Minsky and Papert suggested this ability to find the closest match was a true strength of
perceptrons when they pointed out that finding a 'close or closest match' (as opposed to
the normal exact match search) is a very difficult problem for serial/symbolic processes
but one at which perceptrons excel. The deeper understanding of how neural networks
worked, the discovery of possible limitations to neural network capabilities and the
dramatic development of serial/symbolic techniques served to dampen the enthusiasm
for these models.

2.5.4 Backward propagation

It was known that multiple layer perceptrons would be more capable than single layer
machines since single layer machines are most efficient on predicates on a single
convex figure while multiple layer machines could form predicates on combinations of
convex figures. This would allow for the testing of predicates on any figure (i.e. any
figure can be represented as linear combination of convex figures)65 . The problem was
that there was no known weight adjustment algorithm for the multi-layer model
equivalent to the one for the single layer network (the middle layers of a multi-layer
network have no target output from which to extrapolate backwards for weight
adjustment purposes). It turns out that such a method had been devised by P. J. Werbos
in 1974 (Werbos, 1974). The method was not widely known until rediscovered by D. B.
Parker in 1982 (Parker, 1982), and clearly presented in a section of Parallel distributed
processing Vol I, by Rumelhart, Hinton and Williams in 1986, (Rumelhart, Hinton,
williams, 1986). Over the decade of the seventies the symbolic-logic approach to
artificial intelligence had not lived up to its hype and times had become ripe for a
change in emphasis. Despite the fact that nothing in the new methods invalidated the
results of Minsky and Papert (other than demonstrating the ability of the multi-layered
models to handle exclusive-or predicates and concavity), the backward propagation
method, along with J. J. Hopfield's feedback network (see section 2.4.5) were hailed as
breakthroughs and brought on a renewed interest in neural networks.

Back propagation works as the name would indicate, by setting the weights associated
with each level by calculating a delta value for the last output level and then
propagating it backwards from layer to layer. The process as developed for multi-
layered models is termed the generalized delta rule. The successful application of the
generalized delta rule depends upon the employment of a continuous non-linear
activation function such as the squashing function. This means that there is a possibility
of the convergence process getting caught in a local minima, but in many situations it
works fine. The problem in calculating the delta for a hidden (or middle) layer is that
unlike the output layer, there is no target value equivalent to the output layer's rj from
which to determine the error (rj - rj). The solution is to propagate the delta values
backwards by using the existing weights associated with the layer that has delta values.
To see how this works, assume that the associator layer in figure 2.6 actively
participates in the calculation of the output e.g the fi is non-linear, say fi =
squash(Sakisk). That is, the associator level is now to be a hidden layer. The delta values
for the elements in the associator layer may then be calculated by dki =
65
Recall that each layer must have its own non–linear (e.g. threshold, sigmoid, etc.) function
or the multi–layer network is just the equivalent some single layer network.
(Swijdij)squash'(Sakisk), i.e. each delta in the associator level is just the back-weighted
sum of the deltas for the next layer (in this case the output layer), squashed by an
amount appropriate to this layer. Then the weights for this layer can be adjusted
according to the delta rule specified above, i.e. aki(t + 1) = aki(t) + hdkisk. The process
can be carried out through as many levels as required, allowing multi-layered networks
of any desired depth.

This is certainly an improvement over single layer models but it should be noted that
each layer adds a serial step to the processing done by the network and so in some sense
begins to defeat the benefits of parallel processing. A second problem with back
propagation is that training can become a very lengthy process. In fact, with the
squashing function of figure 2.7, the system can get hung up on very small incremental
changes of the weights (which can in theory become infinitely small as S becomes large)
and so take thousands or even tens of thousands of iterations to converge. And, the
more complicated networks have other problems.

The use of multiple layers and non-linear functions such as squashing functions (as well
as feedback systems covered below), give rise to separation surfaces that are no longer
planes and consequently may have local maximums/minimums that can trap the hill-
climbing (or depth-seeking) training vector (it is immaterial whether we speak of hill-
climbing or depth-seeking so we shall stick with the latter as the more commonly found
metaphor). It can falsely appear that a network cannot be trained to remember some
input as the training process gets stuck at some undesirable association. There are
statistical methods by which such local minima can be avoided. The best known of these
is the Boltzmann method (Hinton and Sejnowski, 1986), so called because it makes use
of a stochastic process modeled on the Boltzmann distribution (that characterizes
annealing processes in metals):

p(E) µ eE/kT,

where p(E) is the probability of being in energy state E, T is the temperature of the
metal, and k is Boltzmann's constant. All of this is in analogy with the statistical
mechanics concept of a systems energy surface (see section 2.4.5). The idea is to make
random weight changes in W, and changes that result in reducing the error function ||r -
r|| are kept. Even on some occasions when the error function is not smaller the change
may be kept depending upon the above relation. The idea is that even if the process gets
stuck in a local minimum, it will have the opportunity to escape. Gradually the
opportunity is reduced so that at first the calculated output vector gets to jump around
quite a bit while descending down the big gradients (thus avoiding local minima), but
eventually being constrained to stick with one solution. The process starts by picking
T0 as a large value and an artificial k appropriate to the application. p(E) is now
interpreted as the probability of change in the error ||r -r|| which it is desired to
minimize. A pattern to be associated is input and the error generated. A random change
in the weight matrix is made and the error recalculated. If the error is smaller than
before the change is retained. If the the error is not smaller then p(E) is calculated and
compared to a random number between 0 and 1. If the random number is smaller than
p(E) the change is retained anyway. The process is repeated with a smaller T.
Unfortunately, it has been determined (Geman and Geman, 1984), if this process is to
guarantee convergence, the temperature T, must decrease no faster than T(t) = T0/log(1
+ t). This is a slow change that implies a long convergence process. By substituting the
Cauchy distribution66 for the Boltzmann distribution (Szu and Hartley, 1987) the
convergence can be speeded up to be inversely linear (T(t) = T0/(1 + t)). Even better
results have been obtained by Phillip Wasserman(Wasserman, 1988), by combining both
the back-propagation and the Cauchy methods. But the convergence can still be very
slow.

2.5.5 Feedback networks

Suppose there exists a system of equations, dr/dt = f(r), whose state is defined by a
vector rT (t) = [r1(t) r2(t) ... rn(t)] which depends on time. Sufficient evidence for the
convergence of this system to a steady state r0, is the existence of a function E(r) such
that E(r) = 0 for r = r0, while E(r) > 0 and dE(r)/dt < 0 for all r ≠ r0. This is true since if
there exists such a function E, then as time increases E(r) must approach 0 because the
first derivative with respect to time is negative, but since E(r) = 0 only for r = r0, we
must have r Æ r0, i.e. the system state must converge to r0. Showing that a system must
converge (is stable) by exhibiting an E with the above restrictions is called the second
method of Liapunov after the Russian mathematician who developed the method from
the energy concepts of classical mechanics. Mechanical systems are known to be stable if
their total energy (E > 0) is steadily decreasing (dE/dt <0). Liapunov generalized from
the energy equation to any positive definite scaler function with continuous first partial
derivatives (normally written as V(x), but which we have denoted as E(r) to coincide
with the energy concept and prior notation)67 . By virtue of their positive definite nature,
any quadratic form matrix equation is a candidate Liapunov function. An example of a
quadratic form matrix equation is E(r) = rTWr which may be written as

E(r) = w11r21 + (w12 + w21)r1r2 + ... + (w1n + wn1)r1rn


+ w22r22 + (w23 + w32)r2r3 + ... + (w2n + wn2)r2rn
+ ... + wnnr2n.

Assume that r is the state of the response vector in a neural network described by dr/dt
= f(r). The wij are elements of the connection matrix, W, and dE(r)/dt = (drT/dt)Wr +
rW(dr/dt). Substituting the system equation dr/dt = f(r) into dE(r)/dt, then suggests
the state trajectory of an initial point r located on E(r), depends on r, W, and f(r).
Further, the stability of such a system depends upon E(r) > 0, dE(r)/dt < 0 for every
point r ≠ r0, and the existence of some minimum r0. This can be shown to be the case for
66
p(x) = T(t)/[T(t)2 + x2]
67
More formally, dr(t)/dt = [ dr1(t)/dt dr2(t)/dt ... drn(t)/dt ]T, —E = [∂E/∂r1 ∂E/∂r2 ...
∂E/∂rn], so dE(r)/dt = (∂E/∂r1) dr1(t)/dt + (∂E/∂r2) dr2(t)/dt + ... + (∂E/∂rn) drn(t)/dt
= —E dr(t)/dt = —E f(r) and the above restriction that dE(r)/dt < 0 translates into —E f(r)
< 0. —E must exist. The concept of positive definiteness is associated with the idea of a
closed surface in n space so as E(r) decreases with time. ||r|| must also decrease (Schultz and
Melsa, 1967).
quadratic form E functions in which W is symmetric and in which the diagonal
elements are 0 (Cohen and Grossberg, 1983). A method for constructing appropriate
connection matrices was given by J. J. Hopfield as described below. Actually there can
be many minimums in the whole space but there will be some local area associated
exclusively with each. See figure 2.13, which depicts a surface E, and several possible
initial vectors r, and the local minimum to which each would proceed.

The energy conceptualization is a fruitful one (as pointed out by J. J. Hopfield, a


physicist (Hopfield, 1982,1983)). The states of the neurons of such a network may be
described as occupying positions in a state space in a manner that is isomorphic to the
description of a physical system that is attempting to find a (locally) minimum energy
state r0. The energy function forms a surface in the state space of the system in which
every neuron is represented by a dimension. Such a space cannot be represented in
three dimensional space but the idea is clear and may be depicted as in figure 2.13. The
idea of an energy surface is simply a carryover from its origin in physical systems. The
hills of the surface represent possible initial states of the state vector and the lowest
points in the valleys represent stable points that may be likened to complete or desired
states. In particular, such a network can be seen as an associative memory in which
some small part of the whole memory to be retrieved (the initial value of r) causes the
whole memory to be retrieved (r Æ r0). This characterization by Hopfield, opened up
the field of neural networks to many of the extensive and ongoing investigations of
physics.

Information Energy

r
r r0

r r
rx
r0
r0
r0
State Variable dimensions (one for each neuron)

Figure 2.13. Depiction in three dimensions of n dimension energy surface.

The networks that Hopfield discussed are feedback networks. Information is stored in
these networks in the connection matrices Wk = SPol(ri)Pol(rj)T where the k indicates
the layer (of a possibly multiple layer network) with which W is associated. W, as
before, is just the sum of the covariance matrices of all of the vectors i ≠ j to be
associated. First, however, each element of the vectors to be associated (which normally
have a value of 1 or 0) are converted to 1 or -1 by the functions pol(r) = 2r -1 (that is the
zeros are converted to minus ones). The flow of information for a one layer network
(say layer L1 with input r1 = F and desired response r1 = r10) is F Æ L1 Æ L1 Æ L1 Æ
... Æ r10. For a two layer network (layers L1 and L2) the flow would be F Æ L1 Æ L2 Æ
L1 ÆL2 Æ ... Æ r01 or 2. In the single layer network r10 can be retrieved by initializing
it with a r ¨ F that represents only a portion of that desired memory. The network is
then allowed to relax (feedback into itself until dr/dt = 0) at which point (it is hoped),
r = r10. The two layer network allows the association of any r01 with any r20. These
non-linear networks have some activation function r, perhaps r = squash. In the single
layer model, the value of r1is calculated as r1 = r(WL1r1). In the two layer model, for
the first layer (L1), r1 = r(WL2r2), while for the second layer, (L2), r2 = r(WL1r1).
Hopfield found that the recall of associated vectors worked well so long as the number
of vectors associated in W did not exceed .15n where n is the length of the r vectors.

It is possible that the construction of W will lead to a surface E that contains local
minimums to which no r0 is deliberately associated. These points represent undesirable
points, local minimums of greater energy than r0 in which the state can get "stuck" as it
proceeds towards the the desired r0 (see that the vector in the middle of the surface of
figure 2.13 is stuck at rx, a valley higher than r0). Once again the simulated annealing
methods can be used to "bounce" the vector out of the minimum so that it can find the
correct minimum. In fact the annealing process seems much more appropriate here
given the direct analogy of the Hopfield feedback mechanisms with the energy surfaces
of state space representations of physical systems.

So there are several drawbacks associated with feedback networks: 1) the relaxation
process can get stuck in a valley requiring simulated annealing processes, 2) very close
initial points may lead to different stable points, 3) the practical storage capacity is .15n,
so for example, to store 100 patterns requires a W matrix of dimension 667 X 667 or
about 1/2 million bits of memory, and 4) feedback vectors are still subject to the
representational limitations of Minsky and Papert. Nevertheless, the mechanism for
storing patterns is simple and the memory limitations may not be significant in the
future.

2.6 Conclusions

2.6.1 Symbolic-logic systems

Goedel's incompleteness theorem states that any symbolic-logic system sufficiently


complex to incorporate arithmetic is necessarily inconsistent or incomplete. But
Symbolic-logic systems can be constructed to be complete and consistent. In fact if a
system, (set of sentences) is constructed for which John Robinson's resolution theorem
works, then that system is complete and consistent (otherwise the resolution technique
won't work). So such systems, in light of Goedel's theorem, may be considered
particularly uncomplicated and incapable of doing arithmetic. This means that a system
building program based on the resolution theorem (such as Prolog), must provide
additional mechanisms for doing arithmetic, and other functions that are based on
arithmetic. That may be no problem for a programming language but it does mean that
Attempts to construct artificial intelligence systems based on logic face great difficulty
since, as argued above, any system that would be intelligent must be able to interact
with its environment and learn and grow. If the resolution theorem is used then any
addition to the system must not disturb the completeness and consistency of the system.
The task of adding information then becomes fraught with difficulty and will inhibit the
expansion of the system in a manner that is consistent with what would be called
common sense (e.g. conflicting facts that must await future resolution cannot be
incorporated for use in present calculations). If such closed system mechanisms are
avoided then inconsistency and incompleteness rear their ugly heads. For many
applications the restrictions are not a problem. The success of an expert systems is
judged on its ability to perform the narrow (closed system) task for which it was
created. Such systems are so valuable that it is worth the effort to endow the program
with the complete and consistent information required for it to do its job. Some people
have taken the success of such systems as an indication that, to create a truly intelligent
machine it is necessary, in effect, to build a complete and consistent model of the world
(perhaps consisting of a very large number of expert systems). If such an immense task
could be completed (a dubious assumption), that which would be created would not be
a human-like intelligence, but a massive infallible database. Human reasoning, unlike
such a database, is incredibly inconsistent and any humans knowledge is profoundly
incomplete. Still humans solve difficult problems, handle situations for which they have
little knowledge, seize opportunities when they present themselves, and, it must be
added, often fail miserably. They do so because they are continuously reacting to the
environment. Their world model is continually being extended and revised. They
reason with logic but act on habit or judgement when the logic fails. Intuitively, no
closed system, no matter how extensive, can ever emulate this behavior, but now we
know this is also true theoretically. Chaos theory indicates that the dynamic processes at
the human level in the world cannot be predicted, and the results of part one of this
thesis implies that human knowledge of the linkages between levels is, and will always
be, incomplete. So a closed model intelligence, no matter how well suited to a particular
moment will soon be obsolete, and such a machine can never philosophize (how would
it handle such a question as, "how many times can you divide this pie?").

All of this implies that the use of Symbolic-logic systems in implementing machine
intelligence cannot realistically depend upon closed system resolution methodologies.
There are other Symbolic-logic systems that are not closed but that depend upon
carefully fashioned sets of rules that, once constructed remain fixed (semantic nets,
frames, scripts, tree store and search spaces, means ends analysis, etc.). Such systems
might find applications in machine intelligence so long as they are used to model the
fixed aspects of human level existence (such items as the persistence of material objects,
the possession by those objects of attributes with assignable values and the effects of
background variables such as time and, space). All of these systems have the virtue that
they are relatively easy to construct (in the sense that they can be completely specified
and verified to work correctly). But, what is required of an Symbolic-logic system used
in an implementation of machine intelligence, is malleability. It has to be able to modify
itself over time. It has to be able to accept conflicting information and be able to use that
information in its calculations. In short, it has to be able to learn and grow in the sense
made famous by Piaget, by assimilation and accommodation of facts gleaned from
experience.
The thrust in logic research has been in finding mechanisms to extend the resolution
type processing of closed Symbolic-logic systems to include ever larger and
comprehensive models of some aspects of reality. Research in knowledge representation
and retrieval, abandons the closed system approach to concentrate on the efficient, if not
guaranteed, storage and retrieval of information. Unfortunately both efforts have
concentrated on the static aspects of handling knowledge. Only the sub-branch of
computer science/cognitive science known as machine learning has addressed the
problem at any length. And there the results are limited and have been pursued more
on the obvious fact that a machine that can learn will be a valuable asset, than on the
perceived dependence of machine intelligence on ability to learn. Still Symbolic-logic
systems have an irresistible attraction for researchers. They, of all systems, most closely
resemble the processes we recognize as components of our own thinking. It is easy to
mistakenly believe that we can construct a thinking machine from those components.
Part three of this thesis proposes a system, part of which is a hierarchically structured
Symbolic-logic system with a (limited) ability to assimilate and accommodate new data.
It is not a perfect system, but demonstrates the kinds of features required by a logic
system that can interact with and change and be changed by its environment. The other
major kinds of system currently considered as a possible implementation medium for
machine intelligence are neuromorphic systems.

2.6.2 Neuromorphic systems

Current neuromorphic systems are not sufficiently developed to provide the sole
mechanism by which to generate generally intelligent systems. Because of the scaling
limitations of current networks, they can, at best, be used as modules. When their
capacity is exceeded they lose information in an arbitrary and sometimes catastrophic
manner. This means that the patterns they contain may not be edited. Only total erasure
and rewriting can effectively reuse a module. Training (feed forward networks) and
recall (feed back networks) are not predictable and can be lengthy, perhaps even non
ending processes. Research into sequential or process control by network mechanisms
has just begun, so that the only currently practical application of these networks is as
recognizers, classifiers, and associative memories. These and other drawbacks or
benefits are associated with certain of the features of neural networks. Here is a list of
some of those features together with their drawbacks and benefits.

Feature:
There is a uniformity of cell/unit structure coupled with modifiable, weighted
interconnections between those cells.
Benefits:
1. It's easy to implement and extend on serial or multiprocessor machines.
2. It makes it possible to use or refer to theories concerning computation that
occurs in nature.
3. It allows learning by the forced or natural associations of any two pieces of
information through the appropriate modification of the weighted
interconnections.
Drawbacks:
It makes debugging at the unit/cell level difficult or impossible. The
implementor must rely on the performance of the network and guess at
changes that might lead to improvement of performance.
Feature:
Each piece of information is distributed across the network in the weights of
many interconnections between unit/cells. Like a hologram that information
can be retrieved by subjecting the network to the appropriate stimulus.
Benefits:
1. It permits storage of redundant information allowing the complete
retrieval of information in the event of a partial failure of the network or, at
worst, the graceful degradation of the information.
2. It encourages (does not discourage) the use of multiple processors (yielding
speed increases).
Drawbacks:
1. It makes the process of storage and retrieval of information opaque to the
user. the storage or recall of information is a complicated process of
weight adjustment that is often slow and requires more hardware than the
equivalent serial algorithms and is less precise.
2. Unlearning, forgetting (or the disassociation or modification of associated
information) is difficult to achieve in a predictable manner. Information
that it is desired to retain can accidentally be removed or replaced.

Feature:
Each weighted interconnection contributes to the storage of multiple pieces of
information so that one network configuration can respond differently to many
stimuli.
Benefits:
1. It permits the recall of the whole from parts of the whole (the closest match
isreturned, which is most likely to be the whole, if a part of the whole is
input).
2. Permits automatic categorization in that like stimuli will elicit like
responses from a network. This may be considered generalization in that a
category may be thought of as the generalization68 of its members.
3. In feed-forward networks, recall of associated information is very fast, this
is particularly efficient (in comparison with symbolic techniques) when the
closest match is desired
Drawbacks:
The process is inherently imprecise, the recall of something similar may be
considered a benefit unless precision is desired.

68
Note, passive generalization is not a process of comparison but, in effect, arises from the
poor resolving capability of neural nets...i.e. it is the result of mistaken identity. Similar
things get lumped together and treated as though they were the same thing... an effect that
works well in a well ordered world (e.g. in this world it is desirable to recognize an oak tree
as just another oak tree and not as a special object unlike any other object in the world. This
is effective because, for virtually any human purpose, one oak tree may be substituted for
another). In other words, neural networks provide a mechanism by which non–standard
(non–classical) logical principals may be seen to arise naturally. E.g. the example about
jumping to the conclusion that a tiger might eat you if he eats your friend! The logic works
only because the system exists in an ordered environment in which 'jumping to conclusions'
or making inductive inferences on relatively weak evidence, in general, works.
2.6.3 Making use of the tools at both levels

Both symbolic-logic systems and neural networks display weaknesses. Logical systems,
suffer from brittleness and inability to assimilate new data into their structure. They are
usually very dependent on that structure and consult it exhaustively for every problem,
demanding an exact answer (they are poor at approximating a correct answer, hence all
the effort spent developing probabilistic and possibilistic logic systems). However these
systems excel at logic, and they are good for executing serial processes. They are
compact, easy to debug, and only make mistakes when data is in error, or hardware
fails. The facilities for the implementation of such systems abound. Neural nets, on the
other hand, are not good at logic; they jump to conclusions, and may be very inaccurate.
They are of little use for executing serial processes or controlling processes. And to
reach their potential they require hardware and software that is just now in the
development stage. However neural nets can automatically assimilate information with
more ease than logical data bases and they are very good at retrieving the next best
thing when an exact answer is not available. They offer a form of inexact generalization
that serves well as long as the environment in which the network is being applied is an
orderly one. Obviously the strengths of these two kinds of systems complement each
other. Neural networks represent a lower level approach to the implementation of
machine intelligence than do symbolic-logic systems. Whereas symbolic-logic systems
may be used to directly simulate reasoning processes, neural nets, modeled as they are
on the physical structure of the brain, are the stuff from which such functions should be
built. In any event, neither of these technologies is to be used here to directly implement
intelligence, rather the attempt will be to construct a machine that, when exposed to an
appropriate environment, has the potential to develope an intelligence.

Four alternative philosophies for employing the existing technologies present


themselves; 1) use symbolic-logic, 2) use neural networks, 3) build symbolic-logic
functions using neural networks and use them to construct an intelligence, and 4) build
a hybrid system of neural networks and symbolic-logic systems. We discard the first
two possibilities for the reasons stated in the paragraph above. The third method; to
build mind functions from neural networks which would then be used to build a mind,
is the mechanism used by the brain and would be the preferred approach except for one
thing; neural networks are at least two levels removed from the level at which human
intelligence occurs. As was concluded in part one, the activities at one level can not be
explained in terms of the organization and constituents at other levels, and the further
removed the levels the more this is so. The technique of growing an intelligence is
difficult enough, starting from a level just once removed from the target level. This fact
would tend to argue in favor of attempting to implement a program using symbolic-
logic techniques because they are closer to the desired level. But the shortcomings,
already noted, preclude that approach. As a compromise we are forced to use the
hybrid approach as the most promising. And that is the kind of system presented in
part three.
Emerging Systems and Machine Intelligence

Part three - Toward an Implementation


3.1 Introduction

Part three contains a plan for constructing a machine that can learn and grow through
interacting with humans in a simulated environment. As we noted in the preface, the
ultimate intent is to construct an undeniably intelligent machine (UIM). Since that is not
possible in a simulation, we will refer to the object of the simulation as a limited
intelligence machine, or a LIM. It is designed to run on presently available computers,
and parts of it already exist. Because of the inherent limitations it should be considered
as a first step for testing out the ideas in the previous parts. It is hoped that such a
program, when complete, will display elements of spontaneity and dynamic interaction
to the extent that the simulated environment allows. The success of a LIM, as measured
by the extent to which it surpasses DIMs (deniably intelligent machines) in behaving
realistically and intelligently in an environment, would support the contentions made in
section 2.1. That is, a machine built with typical human mental and physical capacities,
and taught in an actual human environment, will produce a machine with the ability to
grow, learn, and function with humans in their environment. In other words it will
produce a UIM.

Let's take a moment to review those points made in parts one and two that bear on an
attempt to implement a truly human-like intelligent machine.

I. Minds emerge into the human environment, and, in a sense, continue to emerge
throughout their existence. That process itself is an intimate feature of the mind
and must be duplicated by the LIM

A. This requires that the LIM be constructed to be able to interact in an actual


human environment (a very expensive, and time consuming proposition).
As an alternative, and experimental first step, a simulated environment can
be constructed in which man and machine can interact in comfort. We opt
for this, rather inadequate, alternative.

B. The process of emergence is an interaction in which the LIM modifies the


environment and the environment modifies the LIM. This requires that the
LIM be able to change in reaction to the environment. We feel that this will
require a LIM that can build and modify itself, and not just some database
that it can access. This restrains the choice of implementation languages.

II. Under the hypothesis of part one, emerging minds evolve mental objects and
associated rules of interaction in the same manner that physical systems evolve
objects and rules at a level. Those mental objects and rules are then, to an extent,
independent of their constituent levels, and so, are difficult or impossible to
produce by direct implementation at lower levels. But an implementation solely at
the conscious mind level produces a database that provides only a simulation of
intelligence. The ideal implementation technique would be to use the lowest level
models available (neuromorphic) and grow the intelligence by interaction with the
human environment. This would be a trial and error process, but one guided by
human knowledge and experience. Unfortunately neuromorphic systems are less
well understood, and far less advanced than symbolic-logic systems. Unlike
neuromorphic systems, symbolic-logic can be used to model mind functions closer
to the level at which human conscious thought takes place. In the spirit of
compromise, and because there is little choice, we opt for a hybrid
implementation. We will use the parts of each level that best complement each
other and any information on the nature of the human mind that is available.

A. Open symbolic-logic systems can be used to hierarchically and sequentially


assimilate, order, and provide dynamic control for the manipulation of
information through construction, reconstruction, and destruction of
executable modules. Closed logic modules can be used for some problem
solving but, such processes would be more properly learned.

B. Neuromorphic processes can be used to recognize and categorize objects in


the environment and in memory, and (in a reverse process) to provide an
associative memory.

Part three is arranged as follows: Section 3.2 discusses some of the assumptions and
compromises that the inferences from the hypothesis, available tools, and facts of
simulation force upon the implementation. It includes an outline of the macro structure
of the simulation. Section 3.3 is a more detailed description of the structure of the LIM,
explaining the environment, the natural language interface and the LIM. The latter
emphasizes the selection of languages, the use of grammars, the self-modification
mechanism of the grammar, and the integration of lower level neural network
mechanisms. Section 3.5 summarizes part three. This is followed by general conclusions
regarding what has been achieved.

3.2 Assumptions and compromises

An attempt to implement a UIM would be a very expensive project, requiring an


extensive integration and extension of all of the techniques that have been achieved in
robotics, artificial intelligence, sensing devices, cognitive science, and numerous other
disciplines. Most people would agree, such an attempt would be premature. The
obvious first step is a simulation, and indeed, that is what is proposed here. By the
inferences from the hypothesis, this would not produce a UIM but a LIM whose success
would be judged by the progress made beyond DIMs towards a UIM. In this scheme,
the parts of reality that the simulation, simulates are an environment and a counterpart
to human senses for the LIM. But beyond this we have made other assumptions and
compromises.

In part two we implicitly accepted symbolic-logic structures as capable of modeling the


levels of human mind that exist at and just below consciousness, and neural network
structures as capable of modeling levels of mind that exist below those levels. These are
assumptions that seem reasonable but are, nevertheless, assumptions. That there exists a
sequence of levels between the neural level and the mind level is assured by the
hypothesis of part one. Unfortunately the complexities of the human mind are
perceived but not well known. It is entirely possible that there exist unknown levels
between the two extremes. Indeed, researchers into neural networks have not been able
to create networks that adequately emulate the symbolic manipulative abilities of the
mind. The hybrid system, described here, in part avoids this difficulty by applying
symbolic-logic directly to provide a hierarchically organized symbol manipulating
structure that can learn and grow. Neural networks are used in those situations for
which research has shown they are well suited (which tend to be the recognition and
association tasks). These activities, appropriately tend to occur at the lower levels of the
symbolic hierarchical structure.

Because we are relying on symbolic techniques to model the higher functions of the
mind we are taking a top down approach to the implementation. This is exactly the
opposite of the more desirable approach of building up from lower levels. It forces the
construction of levels based on our observations of how those levels work. Of course,
any implementation, is forced to start at some level, the lower the better. Fortunately,
there exist, well researched and powerful models that can serve as a guide. We are
forced to rely on some of these models.

As indicated in section 1.9.2 a baby's acquisition of knowledge really accelerates when it


acquires language. In fact, language is the preferred means used by humans for the
acquisition of knowledge. Maturana (Maturana 1970) considers it more than that:

"The linguistic domain as a domain of orienting behavior requires at least two


interacting organisms with comparable domains of interactions, so that a
cooperative system of consensual interactions may be developed in which the
emerging conduct of the two organisms is relevant for both. The central feature
of human existence is its occurrence in a linguistic cognitive domain. This
domain is constitutively social."

If there is a uniqueness about man on earth it is in his participation in this linguistic


cognitive domain. It is the means by which complex ideas are maintained and passed
down over generations and is a feature of society and a major force in the shaping of
that society. From the hypothesis, we have determined that individuals acquire
language from society. They then use that facility to acquire and order the information
that society has produced. So language is the primary tool by which society creates
people who fit into that society and by which people help shape the society. The
individual is largely unaware that, in language, he has acquired a powerful tool that
reflects the very structure of his mind. We will use the structure of language, reflected
in a generalized grammar, as the model upon which to base the structure of the higher
mental functions of the LIM. Obviously this structure does not model the whole
structure of a human mind or of a UIM. It does, however, capture a significant portion
of that part of a mind which, when well developed in a human, is often referred to as
evidence of intelligence. Then too, the simulation is limited to interaction of the LIM
and humans in a simulated environment. Such interaction consists largely of
communications about the events and features of the environment. So in the simulation,
language exceeds even its importance in the real world.

A problem that might be anticipated in teaching a LIM is the length of time the process
requires. It takes a year of continuous interaction with the environment and an adult to
elicit just the beginning of intelligence from a baby. It might take an even a longer
period of intensive teaching to evoke intelligence in a LIM. On the other hand, it could
be that the concentrated nature of the information being taught to the machine and the
freedom from the time consuming requirement to satisfy biological needs, speeds the
process. Whatever the case, the process of programming and reprogramming a large,
complex, simulation is immensely time consuming, so that, on balance, creating and
training even a LIM is going to be a lengthy process. In as much as the program is a first
step, it is advisable to accept compromises that can speed the process. In consideration
of all of the above we will take advantage of an existing natural language interface
(NLI) wherever we can in the simulation.

The following section is a description of a system (portions of which exist) that attempts
to meet the following criteria:

1. All learning take place in an environment shared by the program and a


human teacher.

2. Both the teacher and the program have the same potential for control over the
environment, though the program starts with little knowledge and the teacher
presumably knows almost everything.

3. Natural language dialogue (i.e. the turnaround of Kaye, see section 1.9.2), and
the ability to scan the environment, is used to convey information to the
program. This provides some natural rules for interaction.There is a concession
here to the nature of present day computers and to the limited degree that they
can interact with an environment. Ideally the program should be required to
learn a language from scratch. Instead we use a natural language interface
(YANLI, Glasgow, 1987).

4. The program is flexible in its ability to represent knowledge and changes and
grows over time in reaction to the teacher and the environment.

5. Teaching starts at a low level. Only the ability to learn by rote and to imitate is
built in to the program. Imitation will consist of the ability to substitute self for
the teacher. This is expanded to include the substitution of things for other
things in the primitives or developed operations of the system.

By the hypothesis a program with appropriately effective lower levels that adheres to
these requirements will gather knowledge and build structure more in the sense of
learning than in the sense of being programmed. In the course of interaction between
the teacher and the program the semblance of a social system in which shared meanings
develop should occur. Recall that one of the implications of the hypothesis for human
level activities is that the development of shared meanings (i.e. Kaye's
intersubjectivity, or the consensual domain of Maturana) is the key to the recognition of
intelligence by one entity in another.
YANLI
TEACHER KARA

ENVIRONMENT

program

KARA stands for Knowledge Acquisition, Representation


and Activation. TEACHER and KARA have exactly the
same access to each other and the environment.
Figure 3.1: Structure of the program.

3.3 The structure of the program

The user (who will be referred to as teacher) and program interact in an environment
that can consist of a number of features, for example, 1) the operating system, 2) a small
world in which both the teacher and program have representations and that is
populated by (initially) nameless objects together with a number of primitive operations
(e.g. move, pick-up, indicate, etc.), 3) the conversational back-drop and 4) the elements
of the program itself (e.g. the dictionary, augmented transition networks (ATNs)69 ,
functions,etc.). All of these features are equally accessible to both the teacher and the
program.

A natural language interface (YANLI, Yet Another Natural Language Interface,


Glasgow 1987) provides for a classification of natural language input from the teacher
and provides natural language output from the LIM. Both the teacher and the LIM
(which henceforth will be referred to specifically as KARA which stands for
Knowledge Acquisition, Representation and Activation) can 'see' the environment. The
Teacher sees it as a graphical representation on the computer terminal and KARA 'sees'
it by directly accessing the database from which the graphical interface is drawn (in a
restricted manner that emulates direct line vision). Both the teacher and KARA must
use YANLI to manipulate the environment. The purpose of KARA is to try to respond
to the input from the teacher. It can imitate, make substitutions, and ask for instruction
and approval. Information acquired from the teacher is captured by KARA in the
construction or modification of an ATN (see figure 3.2 below). A classification scheme
provided by YANLI supplies a name for the new ATN (or access to an existing ATN to
be modified). This information, in the form of new or modified ATNs, remains in
memory temporarily, to be acted upon during the immediate session. But it can be
stored permanently should the situation warrant. The information can be acted upon
because the ATN is an executable program as well as a data structure. Such actions do
not consist of an analysis of word definitions and syntax in order to discover meaning.
Rather, KARA attempts to make an appropriate response based on what it has
previously learned are the expected responses associated with similar patterns of input.
Because of the classification scheme used and the emphasis on pattern matching, the
First proposed by W. A. Woods (Woods 1970). See figures 3.2 and 3.3. For a detailed
69

explanation refer to section 3.3.2.2 below.


semantics, as evidenced in the response, can depend upon context.

Satisfactory performance on the part of the program should be rewarded by the


approval of the teacher; this might cause the program to store the response away (via
YANLI) for automatic invocation when that input is again encountered. The
approval/disapproval given by the teacher can be in the form of a linguistic fuzzy
measure, e.g. good, very good, excellent, etc. (see for instance Baldwin 1982 or Kandel
1987).

All aspects of the system use ATNs in their implementation including the environment.
KARA can be thought of as consisting of three overlapping parts that provide for the
acquisition, representation and activation of information. In the following, "system"
will refer to the combined environment, natural language interface and KARA.

(ATN-NAME-A
%(STATE-IA
%%(if TEST-X transit-to-state STATE-JA% ;transition
%% %but-first DO-SOME-THINGS)%% %;augmentation
%%(if TEST-Y...
%%
%% %]
%(STATE-JA
% (if...
% %%]

(ATN-NAME-B
(STATE-IB...
%%%]

The STATES and ATN-NAMES are generated by a syntactic/semantic/ contextual classifica-


tion scheme. The TESTs can be any Lisp expression or ATN that evaluates true or false.
It is often another ATN. DO-SOME-THINGS can be anything implementable in Lisp, for
instance, store some facts about the world in registers, implement some process, etc.

Figure 3.2: Template for an Augmented Transition Network

3.3.1 The background systems

3.3.1.1 The Environment

While the operating system is an obvious common ground for the teacher and KARA
and meets criteria one and two given in section 3.2, using the operating system requires
that the teacher be knowledgeable about that system. Instead we will emphasize the use
of a small world, crudely represented on the terminal screen, populated by objects,
barriers, and spaces. In this world there exist several primitive operations so that KARA
and the teacher will be able to affect and be affected by each other and the objects in
that world. The crude, two dimensional nature of the simulated environment is not a
drawback since it is not desirable that the objects in that environment be so recognizable
to the teacher that he attaches to them meanings that only have relevance in the real
human world. The objects need not be devoid of external meaning, but like chess pieces,
any connotations should be inessential to the game. The difference in our use of a toy
world and the use of toy worlds in other AI programs is that those worlds have been
created to limit the demands placed upon the program; we wish to obtain unlimited
learning in a database world limited to those things about which the teacher and KARA
share an understanding.

Another reason for using a crude graphical world is the easy opportunity it affords for
the application of neural networks to the objects in that world. For example, on a
standard terminal screen a number of graphic characters may be displayed whose
representation in memory is one of the first 236 binary numbers. Neural nets
customarily use binary sequences as their input and output vectors so characters on a
terminal screen are ready made for them. A normal terminal screen consists of 24 rows
of 80 characters. Anything displayed in such a small array will be crude, but in memory
the space can be extended as far as desired so that the richness of the whole
environment can be substantial, and as a bonus, no special equipment is needed. The
arithmetic of vision in such a flatland world is easy to arrange, consisting mainly of a
sequential scan of all characters within a cone of vision. To the teacher the characters
represent lines and geometric objects on the screen. To KARA scanned characters
represents a binary pattern suitable for neural network recognition and association. But
both the teacher and KARA will perceive the same objects in the same environment.

The simulated environment is not intended to be neutral and always compliant. The
objects that represent KARA and the teacher are subject to the rules of the world. They
cannot see, or go through walls or objects. Some objects can be moved others are fixed.
In some places in the world there are forces that must be contended with, e.g. a form of
gravitation can draw all non-fixed objects to a wall or a point. Entities such as the teacher
or KARA may be forced to use a technique akin to crawling or climbing to defeat the
force. There are other forces that apply occasionally and unpredictably, e.g. there is a
wind that can blow small objects around and small animals that do damage to objects.
Some objects may be contained in other objects and must be accessed by means of
doors. In places the nesting may be complicated and deep, or the wall structure
complicated and maze-like. In other words, despite the crudeness, a fairly complex,
active, and reactive environment can be contrived.

3.3.1.2 YANLI

YANLI is implemented in Franz Lisp. It consists of an interpreter for ATNs, a set of


built in ATNs that recognize a subset of English, and a response generator which may
be used to associate any response with sentence templates (the response has to be
within the capabilities of Franz Lisp). The set of ATNs supplied with the system were
constructed to recognize arithmetic word problems, but new ATNs specific to any
purpose may be constructed. The response generator allows a user to tailor responses to
generic natural language inputs. For example, YANLI may recognize a particular input
as a specific kind of sentence, say a declarative sentence, consisting of the words "The
dog barked." The user may then specify that a particular response be issued upon
detecting that sentence. Any response may be ordered. Additionally any of the parts of
the sentence may be used in the response. For example, the response might be "What
dog barked?", in which the subject dog and the verb barked have been reused. The user
may specify that the same kind of response should be issued in the event words other
than dog and barked are found in the input sentence. That is, if the words cat and pig are
found in place of dog, or the words meowed and oinked in place of barked then the
response should be "What cat meowed?" or "What pig oinked?" respectively. Because
the response need not be a natural language response, YANLI can serve as the interface
between KARA, the teacher, and the environment. A set of primitive commands can be
designed which when issued to YANLI by KARA or the teacher cause it, in turn, to
issue appropriate environmental commands. As an example suppose it is desired that
KARA and the teacher be able to move around in the environment. A primitive form
such as 'move w x y z' is decided upon as appropriate. YANLI is instructed to accept
'KARA' or 'teacher' as alternatives to 'w', left, right, forward, back as alternatives to 'x',
any number as an alternative to 'y' and 'feet' or 'inches' as alternatives to 'z'. Now
YANLI will accept any command of the type "move KARA forward 10 feet." The
elements in this sentence can then be easily converted by the response generator into
whatever actual commands the environment accepts. YANLI possesses a detailed
classification scheme so that any variation in sentence form will generate a unique
template. This means that a large variety of sentences can be accepted. The mechanism
by which these things are accomplished is discussed more completely below in section
3.3.3.1

3.3.2 Grammar and Language

3.3.2.1 Choice of language

The underlying language of implementation for KARA (as well as for YANLI and the
environment) is a dialect of Lisp named Franz Lisp. Lisp has long been known as the
language of artificial intelligence. There are three reasons why Lisp is choice of
programmers who write AI programs: 1) the syntax is simple consisting only of lists
and atoms, 2) the programmer is relieved of much of the responsibility of type checking
(it's done automatically when the program runs; surprisingly, this causes few errors
but does make for an inherently slow language... Lisp requires a powerful processor)
and 3) the programs and data of Lisp are indistinguishable. The last item is, for our
purposes, by far the most important attribute of Lisp. It is essential to the hypothesis
under which the LIM is being developed that it be able to modify itself in response to its
environment. The simplicity of Lisp syntax contributes to this feature. All programs
(usually referred to as functions) and data in Lisp are simply lists of atoms and other
lists. Atoms may be thought of as multi-valued pointers that may be dereferenced to
access the items in memory to which the pointers point. These items being pointed to
may consist of the kinds of values associated with more standard kinds of languages
(numbers, strings, arrays, etc.) or they may be other atoms or lists. An atom may also
point to a Lisp function. Lisp functions are also lists but they start with a special atom
(in the case of Franz Lisp the special atom is 'defun' for define function). The Lisp
interpreter, when presented with a Lisp list, assumes that the first element in the list, if
it is an atom, points to a function. In the absence of any instructions to the contrary the
interpreter will execute that function, treating the remainder of atoms (or lists) in the list
as arguments to the function. If the list being presented to the interpreter begins with a
single quote (the instructions to the contrary mentioned above), then all of the elements
in the list are treated simply as data. Lisp has a large repertoire of list manipulation,
value assignment, dereferencing and I/O functions built into it. From this basic set the
user can construct functions equivalent to those in any other language, but, importantly,
it is easy in Lisp to construct programs that construct functions (since functions are just
lists) that may then be executed or to modify a portion of the executing program itself.

Such self modification is a tricky business. To avoid unpredictable results it is best to


partition the program into a control section that may not be modified and a section that
may be modified or added to. One scheme that can be successful is to invent a
language appropriate to your purpose (it need not be elaborate and it can borrow Lisp's
syntax). You then create an interpreter to interpret the programs constructed in your
language and a language generator to generate programs (or modify programs) in your
new language. All self modification and program creation is restricted to the new
language. That is essentially the scheme that we will be presenting in the next few
sections. The new language we are using, augmented transition networks (ATNs), is not
new at all. It has been around since the early 1970's as a popular means of implementing
language parsers. We will use it as a language parser, but we will also modify it and use
it for all of the knowledge acquisition, representation and process activation
requirements of the program; that is new. Modifications include ability to generate and
modify ATNs to represent the dynamic changes in mind state that derive from the
interaction of the program with its environment.

3.3.2.2 Augmented transition networks

The system is implemented through an expanded use of ATNs. In the case of YANLI
(Glasgow 1987) the ATNs represent a subset of the grammar of English; for KARA they
serve as both the structures into which knowledge is encoded and the programs
through which that knowledge is accessed. For the environment, ATNs can be used to
describe the primitive operations that are available to KARA and the teacher.

Augmented transition networks, since first proposed by Woods (1970) have become a
popular method of defining grammars for natural language parsing programs. This is
due to the flexibility with which ATNs may be applied to hierarchically describable
phenomenon and to their power, which is that of a universal Turing Machine. As
discussed below, a grammar in the form of an ATN (written in Lisp) is more
informative than a grammar in Backus-Naur notation. As an example the first few
lines of a grammar for an English sentence as written in Backus-Naur notation might
look like (similar to, but not figure 3.3):

sentence ::= [subject] predicate


subject ::= noun-group
predicate ::= verb [predicate-object] . . .
Sentence

Subject / Noun Predicate /Verb Phrase


Start
Subject Predicate
Win
Free
Predicate Object

Adverb Verb

Start Word Win


Win
Free
Adverb

Determiner Adjective

Start Word Win


Word
Free Free
Free
Noun
Win
Adjective

Figure 3.3: Graphical depiction of a particular Augmented Transition Network

this same portion of a grammar when written as transition nets (TN) would look
like:

(sentence
(test-for-subject
(if subject transit-to-state test-for-predicate)
(if t transit-to-state test-for-predicate))
(test-for-predicate
(if predicate transit-to-state win))
(if t lose)))
(subject
(test-for-noun-group
(if noun-group transit-to-state win))
(if t lose))
(predicate
(test-for-verb
(if verb transit-to-state
test-for-predicate-object))
(if t lose))
(test-for-predicate-object
(if ...

These TNs (ignoring the concept of augmentations for the moment) have more of a
procedural presentation than the same grammar in Backus-Naur notation. They
indicate, by the inclusion of certain words such as "if" and "transit-to-state", the process
that must take place if a sentence of input is to be recognized. The Backus-Naur
notation makes only a static definition of the grammatical structure. Actually the
words "if" and "transit-to-state" in the TNs are just space fillers and the "test-for-xxx"
words (referred to here as "states") serve only to mark the right hand of each production
rule in the Backus-Naur form. So there is no extra procedural meaning attached to
these TNs not implicit in the Backus-Naur grammar; the one could be converted
directly into the other (and in fact the programming language 'g' discussed below will
do just that). Still the form of the TN assists a person with the construction of a
grammar by making it easier to conceptualize the process that is to take place. The
process implied by the form and content of the TNs is carried out by the parser.

The parser works by traversing the TN list, and making transitions as appropriate.
For instance the parser evaluates the second argument in a list headed by an "if" as a
new TN to be traversed so that in the above example the words "subject" and
"predicate" following the "if"s serve to indicate to the parser that it should now
traverse the TNs of that name. Proceeding in this manner the parser must eventually
arrive at a TN that tests a word of input. If the word exhibits the proper feature (e.g. is
a noun where a noun is required, information found in a dictionary provided by
YANLI) then the parser pops back to the previous TN reporting success (win). On
the other hand, if the word does not pass the test, the parser pops back reporting
failure (lose). Now the parser continues in the manner implied by the "if". In the
event of a win the traverse proceeds to whichever state is indicated after the 'transit-
to-state' word of the "if" statement. In the event of a "lose" the parser proceeds to the
next "if" list in the TN. If there are no more "if" lists it pops back again reporting "lose".
In any given TN the parser must eventually reach one of the "win" or "lose" states or
run out of "if" lists. Inevitably the parser must return to the TN from which it began,
in a "win" or "lose" condition. The "win" indicates that the words of input seen during
the traverse were a sentence of the grammar while a "lose" means they were not a
sentence of the grammar70 .

The parser, operating on TNs, is capable only of "recognizing" a sentence of the


grammar. To do useful work the TNs must be augmented with the ability to
perform actions. Providing the TNs with this ability requires the addition of some
extra elements to the TN lists. These extra elements will be interpreted as Lisp
symbolic expressions or s-expressions. S-expressions are evaluated when and if the
70
Note that a "sentence" of the grammar need not be a generally correct English sentence, but
simply a structure that the parser accepts. This is desirable. For example a tutoring system
might ask "Do you understnd the problem?", and the student respond "Not really."
Although the response is not a sentence, the system would expect such a response, and the
grammar would be defined to accept it.
parser encounters them in the traversal of the ATNs. When evaluated an s-
expression always returns a value (although the main purpose may be to cause some
side effect). For example 71 :

(sentence
(test-for-subject
(if subject transit-to-state test-for-predicate)
(if t transit-to-state test-for-predicate
but-first
(addr 'comments 'no-subject))
(test-for-predicate
(if predicate transit-to-state win
but-first
(print "It's a sentence"))
(if t transit-to-state lose))))

The addition of the element "but-first" to an "if" list signals the parser to look for s-
expressions to execute. In the example a return to the previous ATN from the test-for-
predicate state (win) would result in "It's a sentence" being flashed on the screen and,
more significantly, if the parse had failed to find a subject in "test-for-subject then
"no-subject" will have been added to the property list of "sentence" under the register
"comments". At a later time that register can be accessed and it can be deduced that
the sentence is a command or a question. It is useful to be able to issue messages to
the screen during a parse but it is more important to be able to deposit messages at
different times during the parse to be read at a later point in the parse or after the
parse has concluded. The augmentation makes the parsing system powerful enough
to do anything that can be done by high order (3rd generation) programming
languages.

As the parser proceeds through the ATNs it constructs a tree by joining, as limbs,
those sub-trees from which it returns in the 'win' state. Upon returning to the starting
ATN it has constructed a tree of nodes that represent the structure of the input
sentence it has just read. This tree is the output of the parser and contains all of the
information that the parser has been able to gather about the sentence by matching it
to the grammar represented by the ATNs. The tree structure consists of nodes that
contain named registers. The registers may contain any information about the input
that the parser, following the pattern of the ATNs was able to deduce. The content of
the parse tree thus created depends completely upon how carefully and for what
purpose the ATNs were constructed. The purpose may be as simple as recognition of a
sentence of the grammar or as difficult as understanding the semantics of that
sentence. The purpose of YANLI's ATNs is to discover the syntactic structure of the
input sentences (as well as it can without using world knowledge) and to make the
elements of that structure easily accessible to other programs, in particular to make
those elements available to the response generator.

71
The ATNs supplied with YANLI follow this form. The effort is made to make as explicit
as possible the process that is to occur, so notations such as (Np/(cat det..))(jump Np/..).."
etc. have been dropped in favor of descriptive phrases.
3.3.2.3 Can ATNs do the job?

It was mentioned in section 2.2 above that anything that can be computed can be
generated by a grammar. There are then, no theoretical problems with using grammars
to implement anything computable. That does not mean that they are a desirable means
of implementing any program. ATNs are, after all, grammars and the word "grammar"
brings to mind the rules for speaking or writing English. It is to be expected that the
primary association of the word grammar is with natural language, speech, writing or
communication. Language/grammar systems are usually involved in the transfer of
useful information from a sender to a receiver. In fact it's hard to think of a
grammar/language as anything but the rules of natural communication between two
humans. Successful communication between humans cannot occur unless the sending
and receiving parties agree upon the nature of a valid message, and since any such
agreement generates a set of rules, language/grammar must be closely related to
that communication. But practically speaking, do language and grammar apply only
to reading and writing and speaking and hearing?

The answer is no. Grammar/language systems are often treated in a more general way.
A grammar may be considered simply as a hierarchically organized set of rules that
specify the order with which a set of objects may be combined. For a spoken or
written language the objects are symbols, for other languages the objects might be
forms and colors, actions and processes, or any other set of objects that work. All of the
sequences of objects that a grammar serves to define, that is, all of the sentences
generated by the grammar, taken together, comprise the language associated with that
grammar. A grammar and its language are interchangeable since the grammar can
generate every sentence in the language and since the sentences of the language (if there
are a finite number of them), taken together, serves to define the rules in the sense that
the grammatical correctness of a sentence may be checked by observing whether it is in
the set of sentences. Implications of the generality of the grammar concept are in
evidence whenever any system other than human speech or writing is referred to as a
language. For instance, a reference to the language of love, body language (or
body English), the unspoken language (meaning a set of rules of behavior that can
communicate intent or meaning without the use of speech or writing), animal
language, computer language, mathematical language, and the genetic code (or
language) imply that there exist grammar-like rules for a diverse number of systems.

The word information like the word grammar has a common meaning and a more general
meaning. Commonly , information is used to describe data that has been acquired by a
human who was not previously in possession of that data. More generally information
refers to the reduction in the number of possible states in which a system can be. The
common sense form of the word complies with the more general model in that when
one receives information the large number possible states of the object of the
information are reduced so far as the receiver of the information is concerned. The
conveyance of this common sense information from one entity to another can be
accomplished through the use of language but in fact most information is conveyed not
from entity to entity but from an entity's environment to the entity. For example, a bear
in his passage through the woods leaves signs that a skilled tracker can read. There is
little doubt that when the tracker observes the tracks he has received information
because the possible locations of the bear are reduced.
Although the tracker can describe a set of rules that explain bear behavior in terms of
the bear's tracks, the tracks would not be accepted as sentences in a language because
the tracks are signs but not symbols. Symbols, as opposed to signs, are intentionally
separate from that which they designate and may be manipulated mentally, that is,
symbols by their nature, require an intelligent entity who uses the symbols to exist. The
objection to bear tracks as a sentence in a language then, depends on the
requirement indicated above that communication involves a sender; one that by
virtue of his ability to manipulate symbols, is aware of the rules of the grammar. The
bear's tracks are a natural phenomena, they are not symbols, and the bear is neither
aware that he is making tracks nor doing so intentionally. There is no qualified
sender. So it appears that a bear's tracks, while conveying information do not
comprise a sentence in a language.

What is overlooked in such an analysis is the possibility that, after the bear tracks
have been recognized by the tracker, a mental communication occurs between parts of
the tracker's brain in which symbols representing the bear's tracks are used by some
"sending" portion of the mind to query the "bear track analyzing" section of the
tracker's mind. At least within the confines of the tracker's mind, the bear tracks are a
sentence in a language. It is the premise of those that subscribe to the thought as
language school of knowledge representation that the mind handles such messages in
essentially the same manner that it handles spoken language. The fact that spoken
language appears to be processed in a part of the brain separate from the area in which
the processing of visual stimuli occurs, does not preclude this possibility. So although
it may be useful to the analysis of written and spoken communication to restrain
the definition of language/grammar as indicated above, there is much to be gained
by allowing the definition to be more general, and as such to be more powerful. This
concept can be useful if, for instance, we are trying to construct a computer program
that acts like a bear tracker. On the other hand a linguist or other person involved in
analyzing the spoken or written word might find his purpose confused if he worried
with such grammars. For an investigator of intelligence especially as it relates to
building models on a computer, it is a good idea. So there is much to be gained by
defining language and grammar in the more general way. Perhaps such generalized
grammars should have a different name just to set them apart from the more
traditional grammars. But then, much of what goes on in the mind involves spoken
and written language so that such a distinction would be artificial and arbitrary. A
more useful way to think of grammars and languages is to simply equate language
with grammar, and allow grammars to be defined as any set of hierarchically described
rules that are useful for capturing a structure involved with the manipulation of
information, where information is used in the general sense. Grammars may then be
described for many things not normally considered associated with the concept of
a language. Almost anything organized can be described by a grammar and , when
we provide for a grammar to be dynamic and grow and change over time in response
to a changing environment, the activities of even dynamic systems can be described
by grammar. For our purposes, to be a language it is necessary only that all of the
sentences that comprise that language (be they propositions, actions, physical
objects or a combination of these), conform to a set of rules. The concept of
grammar and language can provide the tools with which to describe systems and can
provide the control and communication needed to make computer programs
simulate intelligence, perhaps even become intelligent.
The underlying structure of a grammar is that of a graph in which some branches
can (but do not necessarily) connect to previous nodes. The children of a node define
the parent, that is, the children of a node (or the processes they represent) may
replace the parent for any use in the grammar. So a grammar is just a hierarchically
ordered system of substitutions in which recursion is allowed. All of the branches of
the graph either cycle back to a previous node or terminate at some node. The nodes
at which a branch terminates are termed terminal nodes. It is only at the terminal
nodes that any action other than substitution can occur. For a simple language
recognizer the terminal nodes implement the comparison of input to the required
values to achieve the desired recognition. Processes other than recognition are also
possible.

One such system is that proposed by Marvin Minsky as a model for cognition in the
human mind (Minsky 1987). Minsky proposes that the processing of information that
occurs in the mind is handled by a hierarchically structured set of agencies. Each
agency is responsible for only a small set of tasks, in fact most agencies only assign
tasks to other agencies. Eventually, at the bottom of this bureaucracy there must be
agents that actually do some work. A salient point that Minsky makes is that,
cognition and intelligence can be modeled by hypothesizing only a set of agencies
each of which possesses little if any intelligence, and an organization that allows the
agencies to combine their talents in a way that allows tasks requiring intelligence to be
accomplished. A grammar can be used to describe this structure save only that the
role of the terminal node is enlarged so that it can perform not just tasks of recognition
but any task. This is no large extension in a computer implementation of a grammar.
Since the recognition that goes on at the terminal nodes of a grammar is a process, all
of the machinery of process initiation is already present. A computer program that
performs recognition tasks based on a grammar should be easily converted to a
program that controls processes by the substitution of one kind of process for another.

The ability of grammars (or ATNs) to perform the functions of knowledge acquisition,
representation and process control have been demonstrated in the programming
language 'g' (Glasgow 1988)72 . 'g' is essentially an implementation as a language, of an
extended form of the Backus-Naur grammar notation (an example of which appears
above). The extensions allow for the initiation of procedures at any point in the
grammar and the substitution (if desired) of processes that provide recognition
procedures at the terminal nodes of the grammar. The language works by translating
the extended Backus-Naur notation into ATNs that are then interpreted by the standard
interpreter mentioned above. It is then a simple matter to write programs that
demonstrate the abilities of grammars to perform all of the tasks required for the
program we have proposed. With some effort 'g' programs could be used directly to
implement KARA but, although the Backus-Naur notation is easy for humans to use,
the manipulation of that notation by Lisp is not so easy since it is not in the form of a
list. The underlying ATNs happily, are easily produced and manipulated in Lisp.

72
See Appendix B for a description of the construction of 'g' and some tests for the
capabilities of ATNs (grammars) madewith it.
play-blocks = get-blocks [play-with-blocks] [put-blocks-away]
get-blocks = find-box #open-box [get-blocks-out]
find-box = (#west | #south | #east | #north) #go-to-box
get-blocks-out = find-a-block #take-block-out-of-box get-blocks-out
find-a-block $l1= #scan-for-block #go-get-block
play-with-blocks= #go-to-blocks [build-a-tower]
build-a-tower = find-a-block #put-block-on-stack build-a-tower
put-blocks-away = #go-back-to-box [store-block]
store-block = find-a-block #put-block-in-box store-block
<
$l1 [(listen-for-teacher)] {stays true if teacher doesn't interrupt, if teacher interrupts,
Kara quits to listen}
#west [(setq dir 180)(look-for-object)] {look west for object (box)}
#south [(setq dir 270)(look-for-object)]
#east [(setq dir 0)(look-for-object)]
#north [(setq dir 90)(look-for-object)]
#go-to-box [(setq truth-value (go-to-object 'handle))]
#open-box [(setq truth-value (open-up entity box))]
#scan-for-block [(scan-for-blocks)]
#go-get-block [(setq truth-value (go-to-object 'location)]
#take-block-out-of-box [(move-object entity 4 90) (move-object entity 12 0)
(setq truth-value (move-entity entity 12 180))]
#go-to-blocks [(setq truth-value (move-entity entity 12 0))]
#put-block-on-stack [(move-object entity 3 90) (take-object-to 12 14)
(setq truth-value (go-to 14 25))]
#go-back-to-box [(setq truth-value (go-to 14 12))]
#put-block-in-box
[(move-object entity 3 90) (move-object entity (setq temp-rand (+ 7 (random 3))) 180)
(setq truth-value (move-entity entity temp-rand 0))]
>
grammar-name [kara-play-blocks]

preliminary
[(defvar cone 90) (defvar dir 0) (defvar entity 'kara) (defvar box 'object2)
(defvar object box) (defvar temp-rand 0) (defvar blocks
'(object14 object15 object16 object17 object18 object19))
(load 'environment) (spawn 'listen-for-teacher.o) (envi)]

on-success
[(move-entity entity 10 90) (bn-position-cursor 24 1) (princ "That was fun.") (terpri)]

on-failure [(princ "I don't want to play blocks.")]

.
Figure 3.4. Example of grammar controlled process

Appendix B describes three programs written in 'g' which demonstrate the capabilities
of ATNs. The first performs a parse and outputs a parse tree much in the manner of
YANLI. The second demonstrates the ability of grammars to implement processes. It is
an implementation of the example given by Marvin Minsky (Minsky, 1986) of a child
playing blocks. The example uses the environment described above in which KARA
opens a toy chest, removes the blocks, makes a stack of them, and then replaces the
blocks in the chest. It is a particularly interesting program to watch because KARA
takes many different paths, and performs to various degrees of skill in stacking and
retrieving blocks depending upon the initial position of KARA, the toy chest, and the
blocks. It demonstrates the strong effect of environment on events (see figure 3.4
below). The last example demonstrates the ability of a grammar to act as a hierarchical
database.

3.3.2.4 Embedding neural networks

In this implementation we are restricting the use of neural nets to recognition and
associative memory duties. For recognition purposes we assume that there already exist
connection matrices Wi of the feed forward type. Each i represents a possible object as
viewed from various perspectives associated with a 'true' response, otherwise
associated with a 'false' response. Then the look-for-object and scan-for-block routines in
the 'g' program of figure 3.4 above, can simply execute true(Wir) for the particular
objects i, that they are designed to detect. To create the association matrices Wi in the
first place will require a training session with the teacher. This can be accomplished by
repetitively repositioning KARA, directing attention, and naming objects. A desirable
extension of this concept is for KARA to use the capabilities of neural nets to
automatically create an association between an object and the representation generated
by the initial configuration of W. In other words, upon initially viewing the object
KARA would generate a unique internal representation. She would automatically
reposition herself or manipulate the object to generate the alternate viewpoints. She
would need only the ability to distinguish an object from its surroundings to create her
own name for the object. Then, in a much less arduous training session, this internal
name could be associated with names supplied by the teacher. The ability to associate
purely internal concepts eventually might be put under the control of KARA.
Unfortunately this scheme cannot be extended too far at the present because of the
limited capacity of neural nets to store and reliably retrieve very large quantities of
data, and/or because that data, once created, cannot be modified. Happily all of these
processes seem to fit into the LIM at the appropriately low level.

3.3.3 The Acquisition, Representation and Activation of Knowledge

3.3.3.1 The Natural Language Interface

The program's ability to acquire knowledge from instruction depends upon a natural
language interface (NLI). The NLI YANLI (Glasgow 1987) based upon the use of ATNs
provides this interface. YANLI also contains a response generator that can serve as the
tool through which autonomic responses can be made. The heart of YANLI is the same
ATN interpreter used in KARA. Although there is a set of built-in ATNs representing
the grammar of a subset of English, YANLI can be used as an interpreter of any ATNs
supplied to it. The rules representing actions possible in the environment can be
described by ATNs so YANLI can also serve as access to the environment for both the
teacher and KARA.

Although YANLI is a general purpose ATN interpreter, the ATNs that form a subset of
English grammar allow YANLI to perform a syntactic parse on sentences of common
English including:
1) Declarative sentences consisting of a subject, predicate and objects and with
conjunctions.

John and Mary ate cookies and cake.


Go away. {the you is understood}
The dog, that ran down the street, was my dog.
{relative clause}

2) Exclamatory or short idiomatic expressions.

Yes., No., Sure. {converted to grammatical sentences}


No thanks.

3) Questions beginning with the words who, what, where, when, why, how
many, how much, can, could, will, and would.

Who are you?


How many apples do John and Joe have altogether?

4) Conditional statements, questions or rules. question: if -declaration- then


question-

If the chicken crosses the road then where will she be?

5) rule: if -declaration- then -declaration-

If the chicken crosses the road then she will be on the other side.

When recognized by the grammar an input is classified according to the kind of


sentence it is, the context in which it occurred, the principal verb it contains and its
syntactical shape. A name is generated by the classification so that subsequent inputs of
like classification can be associated with that name. The scheme is detailed as follows:

1) usage (a=autonomic, z=ATN)


2) context (supplied by the user)
3) type sentence (d=declarative, q=interrogative, i=imperative,
r=rule, e=exclamation)
4) subject type (n=name, p=pronoun, o=object)
5) verb-root (the actual root form of the verb)
6) prepositional-phrase-type(n=name, p=pronoun, o=object)
7) predicate-object-type (n=name, p=pronoun, o=object)
Each sentence is classified according to these seven categories. The letters (or string in
the case of the verb-root and context) are concatenated into a name, in which each of the
items is separated by the symbol "$". The sentence "Mary had a little lamb." is classified
(items 2 through 7) as x$d$n$have$x$o which means no context is specified, it is a
declarative sentence, the subject is a name, the root verb is "have", and the verb takes an
object. The classification name provides a convenient file name into which YANLI can
store appropriate responses to particular inputs of the same classification. When YANLI
stores the response it prefixes the name with an "a". The name can also be used as an
index into the ATN structure (actually the name of the ATN into which the
information in the sentence is coded, as discussed below). In this case the name is
prefixed with a "z".

YANLI parses only a subset of English so that much of what might be entered as input
to the program might not be accepted, ungrammatical constructs in particular. To help
avoid this YANLI has been provided with a table in which short ungrammatical
constructs can be associated with an equivalent grammatical structure. For any
sentence that YANLI can parse, a response can be supplied so that upon encountering
that sentence again the response will be automatically invoked. YANLI provides
functions to extract the parts of speech from the input.These might be used in a
response to customize it to the particular input. For example, (sbj-det 1) extracts the
subject determiner from the first sentence, (po-noun 2 1) extracts the first predicate
object noun from the second sentence of input. Verb roots and tense are also available,
e.g. (verb-root 1) yields the verb root of the first sentence and (verb-tense 1) supplies the
tense of that verb.

The response to be associated with a particular input can be built by the program. The
function that associates an output with a recognizable pattern of input is of the form:

(machine-build-responses-to input-sentence desired-response


permissible-modifications-to-input test),

where "input-sentence" is any sentence recognizable by YANLI, "desired-response" is


any text, Lisp s-expression or combination of both,"permissible-modifications-to-input"
are acceptable alternative words for the identifiable parts of speech of the input
sentence, and "test" is a boolean value that if true causes the response to be exercised to
determine if it works and otherwise accepts the response as is. The response may
contain sentential variables whose values have been assigned by YANLI during the
parse.

As an example assume there has been created a function to remember, perhaps

(defun remember ()
(addprop (verb-root 1) (list (verb-tense 1)
(list (po-det 1) (po-adj 1) (po-noun 1)))
(eval sbj-noun)))

then KARA could build a response to the input "This object is a small box." with the
function call

(machine-build-responses-to-order
'(This object is a small box.)
'((remember) I understand.)
'((sbj-det *) (sbj-noun {thing.object}) (po-adj {small.medium.large})
(po-noun *))
nil).
This has the result of creating a file under the name a$x$d$o$be$x$o that contains
the patterns specified and the response as follows:

69 (1 22) (this *)
2 (43 3) ({thing.object} object)
6 (7) (is)
16 (17) (a)
17 (62 18) ({small.medium.large} large)
18 (19 66) (box *)
21 nil (((remember) I understand.))
23 (24) ({thing.object})
27 (28) (is)
37 (38) (a)
38 (39) ({small.medium.large})
39 (40) (*)
42 nil (((remember) I understand.))
46 (47) (is)
56 (57) (a)
57 (58) ({small.medium.large})
58 (59) (*)
61 nil (((remember) I understand.))
62 (63) (*)
65 nil (((remember) I understand.))
68 nil (((remember) I understand.))

If YANLI is activated to read input in its "respond-to" mode then any sentence that
matches the pattern will cause the function "remember" to be executed and "I
understand" to be printed on the screen. "remember" causes the predicate object and a
time frame indicator to be included in the property list of the verb root "be" in
association with the referent of the subject noun. Input sentences are matched to the list
of file a$x$d$o$be$x$o by determining if the parts of speech of the input sentence match
some sequence in the file.The number at the beginning of each line, except the number
beginning the first line, stands for the relative position of a part of speech in a preset
sequence of parts of speech (the number beginning the first line represents the value
from which an extension of the total list must occur). The first part of speech checked is
always the subject determiner. In the example above if the input sentence has the
subject determiner "this" or any other word ("*" matches everything) then a match has
occurred. If the word was "this" then the next line to be checked for a match will be the
number corresponding to "this" in the list of numbers on the same line, in this case the
number 1. Since there is no line numbered 1 the line with the smallest number greater
than 1 is used,in this case 2, and an attempt is made to match the input sentence noun
with "object" or "thing". If the matching process succeeds in making it to a line on which
there is no list of forwarding line numbers then whatever is found in parenthesis is
considered to be an executable response.

As an example suppose the thing indicated is object34 (the value of the atom "thing"
could be 'object34), then

(respond-to '(That thing is a small table.))


causes "I understand" to be printed on the screen and the property list of the verb "be"
to contain

(... object34 ((present (a small table))) ...)

and we could say that the program has acquired the knowledge that object34 is a small
table. If object34 has been indicated and the statement "This object was a large box."
was issued then the property list would contain

(... object34 ((present (a small table)) (past (a large box)))...).

This example is not intended to imply that representations in KARA need be of the
semantic net variety with is-a and has-a type links. Rather the example is intended to
show the mechanism by which any sort of information storage, as represented here by
the function "remember", can be implemented.For instance "remember" could be
implemented as a modifiable ATN or even an image of some sort should the proper
hardware be available.

3.3.3.2 The use of ATNs to represent knowledge

The input from the teacher can be used to create ATNs that act both as a representation
of the input and a means of associating an appropriate response with that input. The
necessary function is one that creates an ATN or modifies an existing one. In KARA it is
named "edit-atn". "Edit-atn" takes as its argument a list of items that are substituted into
an ATN template (see figures 3.2 and 3.3) to form the new or modified ATN. If an ATN
already exists with the atn-name given in the argument list then the existing ATN is
modified by the addition of the items on the list. If no ATN of the given name already
exists then a new ATN is created. The argument list must be built in the following
manner (essentially an ATN without the ifs and but-firsts):

(atn-name
(state-name1 (cond1.1 state1.1 do-list)
(cond1.2 state1.2 do-list)...)
(state-name2 (cond2.1 state2.1 do-list)
(cond2.2 state2.2 do-list)...)
. . .
(state-namem (condm.1 statem.1 do-list)
(condm.2 statem.2 do-list)...)))

where the "cond"s are any s-expression returning t or nil, "state-name"s must be unique
and "do-list" is a sequence (any length including 0) of s-expressions to be executed
(defun handle-declaration-with-object ()
(edit-atn
(subst current-atn 'atn-name
These values
(subst (sbj-noun 1) 'subjectnoun
will be subs-
(subst (verb-tense 1) 'verbtense
tituted into a
(subst (po-noun 1) 'ponoun
template
(subst (verb-root 1) 'verbroot
'(atn-name
%(find-subject ((equal sbj-noun subjectnoun) subjectnoun)) This section
%(subjectnoun ((and (equal (verb-tense 1) verbtense) represents the
%% (equal po-noun ponoun)) calculation of
%%%single-predicate-object)) the template.
%(single-predicate-object (t win (remember)
%%%%%(princ "ok")))))))))))

(defun handle-rule-with-s-expression
(edit-atn
(subst (construct-name-from-rule-name) 'atn-name
(subst (apply 'append (list (list po-det)(list po-adj)
%%%%%%(list po-noun))) These values
%%% 'predicate-object will be subs-
(subst (apply 'append (list (list pp-prep)(list po-det) tituted into a
%%%%%(list po-adj)(list po-noun))) template
%%% 'prepositional-phrase
(subst (eval (po-noun 1 2))% 'parenthesized-object
'(atn-name
(find-object
((equal (apply 'append (list (list po-det)
%%%%%(list po-adj)(list po-noun)))
%% predicate-object)
%%find-preposition-object
%%(setq exists-object1? (cadr parenthesized-object)))
(find-preposition-object
((equal (apply 'append (list (list pp-prep)(list po-det)
%%%%%(list po-adj)(list po-noun))) This section
%% prepositional-phrase) represents the
%%execute calculation of
%%(setq exists-object2? the template.
%%(revget be (apply 'append (list (list po-det)
%%%%(list po-adj)(list po-noun)))))))
(execute
((and exists-object1? exists-object2?)
%%win In both cases
%%(print "ok") the functions
%%parenthesized-object return ATNs
%%(account))))))))))

Figure 3.5: Some functions that will generate executable ATNs

should the related condition evaluate to true. If there already exists an ATN with the
given name then a matching process takes place between the argument list and the
existing ATN (that is itself just a list). First a match attempt for the first state on the
argument list takes place. If a match occurs then a match for the first condition under
that state is attempted. If found, then a check is made to see if the "transition-to-state"s
agree. If not the old "transition-to-state" is replaced with the new "transition-to-state".
Finally the items of the new "do-list" are added to the old "do-list" (eliminating
redundant items). At any point that a match fails, then the whole tree of items under
that new state or condition is added at the end of the list of states or conditions as
appropriate.

This is straightforward and consists mainly of arranging items into lists in the
appropriate order. A number of functions representing the various classifications of
input, call on the "edit-atn" function using the parts of speech of the input as arguments.
These functions determine the actual shape of the ATN. For example, the declaration
that something acts through a verb on some object might be handled by the function
"handle-declaration-with-object" shown in figure 3.5. If YANLI has just parsed the
sentence "That object is a small box." and determined that it consists of a declarative
sentence with an object so that the "handle-declaration-with-object" function is called
the following ATN will be created

(z$_d$o$be$_$o (find-subject
(if (equal sbj-noun object)
transit-to-state
object)
(object
(if (and (equal (verb-tense 1) present)
(equal po-noun box))
transit-to-state
single-predicate-object)
(single-predicate-object
(if t
transit-to-state
win
but-first
(remember)
(princ "ok")))))

A future YANLI parse resulting in the same classification can execute z$x$d$o$be$x$o
as a function in an attempt to satisfy the input. In this example, when z$x$d$o$be$x$o
is executed in response to "That object is a small box." the word ok is shown on the
screen, and "object34 (present (a small box))" is added to the property list of the verb
root "be" (through the action of the function "remember"). Subsequent inputs that yield
the same classification but with different specific nouns or objects will cause the
addition of states to the network.This modifies the ATN so that it can handle the new
information. For example if "That thing is the window." is input (with "thing"
indicating object81) then the call to handle-declaration-with-object modifies the above
ATN yielding:

(z$_$d$o$be$_$o (find-subject
(if (equal sbj-noun object)
transit-to-state
object)
(if (equal sbj-noun thing)
transit-to-state
thing))
(object
(if (and (equal (verb-tense 1) present)
(equal po-noun box))
transit-to-state
single-predicate-object))
(thing
(if (and (equal (verb-tense 1) present)
(equal po-noun window))
transit-to-state
single-predicate-object))
(single-predicate-object
(if t
transit-to-state
win
but-first
(remember)
(princ "ok")
(account)))))

and the property list of "be" would have "object81 (present (the window))" added to it.
By modifying, adding to or replacing the functions activated in the "but-first" section of
a winning state, special action for any given combination of noun, objects, verb, context,
and sentence type can be provided. For example the function "account" has been added
to keep track of how often this information has been requested or proffered, or to
analyze context changes or perhaps to elicit requests for approval or for further
information. Since the ATNs can be manipulated, either through "edit-atn" or directly
by KARA they can be taught to KARA. At first those modifications will have to be
implemented by providing specific Lisp functions. These, however, immediately
become associated with the normal English phrases that command their execution.

When YANLI detects a parenthesized string in a position that could be filled by a noun,
it considers the parenthesized string to be an object to be handled literally and in its
entirety. In this way the teacher can provide KARA with executable Lisp expressions.
These s-expressions can be given in the form of rules. A function to generate ATNs
when rule type input is encountered is the "handle-rule-with-s-expression" function in
figure 3.5.

YANLI parses "If you want to move the small table to the window then execute (move
(revget be '(the small table)) (window x) (window y))." creating the classification name
z$x$r$p&p$move&execute$o&x$o&o. YANLI recognizes two separate sentences; the
first is "You want to move the table to the window" and the second is "you execute
parenthesized-object" where "parenthesized-object" contains the s-expression "(move ..."
as a value. Subsequent activation will come only from an input of the form "move the
small table to the window." The name that would classify this input is easily extracted
from the name generated for the rule and is accomplished by the function "construct-
name-from-rule-name". The "(apply 'append (list ..." sequence reconstructs the relevant
predicate object or prepositional phrase and revget finds the primitive object definition
associated with "the small table", i.e. object34. So handle-rule-with-s-expression
produces

(z$_$c$p$move$o$o
(find-object
(if (equal (apply 'append (list (list po-det)(list po-adj)
(list po-noun)))
'(the small table))
transit-to-state
find-preposition-object
but-first
(setq exists-object1? (revget be
'(the small table)))))
(find-preposition-object
(if (equal (apply 'append (list (list pp-prep)(list pp-det)
(list pp-adj)(list pp-noun)))
'(to the window))
transit-to-state
execute
but-first
(setq exists-object2? (revget be '(the window)))))
(execute
(if (and exists-object1? exist-object2?) transit-to-state win
but-first
(move (revget be '(the small table)) (window x)
(window y)))
(account)))

Once this ATN is established the command "move the small table to the window" will
result in the execution of "(move (revget be '(the small table)) (window x) (window y))."
If the phrases "the small table" and "the window" have never been associated with an
object in the environment the fact is detected and the process will fail in the execute
state. Though not explicitly indicated in the example, other contingencies, such as the
immobility of an object that is to be moved, can be reflected in the value returned by the
primitive. The ATN can be modified to respond to those contingencies.

From the accounting information a determination can be made as to the appropriate


time to circumvent the response generated by using the ATN and create it as an
autonomic response by using YANLI as detailed above.

3.3.3.3 The activation of the knowledge stored in the ATNs.

The primary means through which the KARA can act have been outlined. However, the
higher functions of how the program is to interact with humans and the environment
has not been mentioned. The ATN in figure 3.6 provides a loop in which the teacher is
the instigator of conversation and action to which KARA tries to respond. Since it is an
ATN it too can be modified by KARA. The function "think-while-waiting" serves to
decouple KARA so that it can run independently while waiting for input from the
teacher (or environment). All interactions between the teacher and KARA are time-
stamped and stored. During the time KARA is not occupied with the teacher it can
review the stored protocol; cross-referencing items, or determining which information
needs to be preserved and how. When the teacher interrupts, KARA will read the input
using YANLI in the "silent" mode. This causes the input to be parsed and makes
available the parts of speech extraction functions and a parse tree. At this point control
passes back to the ATN shown in figure 3.6.

When receiving information from declarative sentences or rules KARA will respond
passively acknowledging that it has received the information. In the event that the
information is incomprehensible to KARA then the response is to indicate that which
what was not understood and to ask for a description in terms that it can understand
(perhaps a rephrasing in the event of ungrammatical syntax). When KARA must
respond to a command or question, the request itself can be manipulated to form a
statement that can be run against existing ATNs. In the event of failure, a search of
similarly classified inputs, with an attempt at substituting the particular parts of speech
of the request can be attempted. The search can be widened to search nearby
classifications. Should all attempts at an answer fail, a default response of some sort can
be given and the conversation continued.

KARA has the ability to recognize similar inputs. If there are only minor variations
between two patterns and there is a response available for the one and not the other,
then substitutions can be made into the known response to create a possible response
for the input for which there is no response. The classification system makes the
identification of similar inputs easy; the availability of the list structured ATNs and the
Lisp function "subst" makes the substitution possible. For instance if KARA knows how
to "Move the small table to the window" and is requested to "move the small box to the
window." then KARA should

substitute box for table in the ATN that provides a response for the "move table"
command and attempt the modified response. This is not as straightforward as it
sounds since KARA must determine that it is an object that needs to be substituted and
must find an appropriate one. Blind substitutions might have undesirable effects so
KARA should rely on the approval or disapproval of the teacher when attempting those
operations. If successful, then the original ATN and the new one can be merged or the
substitution followed by execution can be added as a new sequence in the old ATN.
With these abilities and a considerable amount of training, KARA should be able to
participate in a dialogue similar to that in figure 3.7.
(conversation
(start-turnarounds
(if t transit-to-state listen but-first (make-greeting))
(listen
(if (think-while-waiting) transit-to-state respond-or-quit)
(if t transit-to-state listen but-first (handle 'no-sentence)))
(respond-or-quit
(if (response-exists) transit-to-state listen but-first
(make-response))
(if (null quit) transit-to-state determine-response)
(if t transit-to-state win but-first (clean-up)
(store-knowledge)))
(determine-response
(if (equal sentence-type 'question) transit-to-state
respond-to-question but-first (prepare-question))
(if (equal sentence-type 'command) transit-to-state
respond-to-command)
(if (or (equal sentence-type 'declaration)
(equal sentence-type 'rule)) transit-to-state listen
but-first (handle-declaration)
(respond-to-question
(if (cond ((respond-to-prepared-question) t)
((search-for-response) t)
(t default-response)) transit-to-state listen))
(respond-to-command
(if (cond ((respond-to-prepared-command) t)
((search-for-command-response) t)
(t default-response)) transit-to-state listen))))

Figure 3.6: An ATN to implement a conversational mode in KARA


EXAMPLE SESSION (learning by being told)
K = KARA, T = Teacher
-------
-------

-------

K
\

|
T

environmental primitives used (move entity x in y), (direction entity qualifier item),
and (distance entity qualifier item). Assume KARA has already been taught how to
issue a command and has been taught the names of the objects in the environment
shown. Bracketed {} words are comments.

T: KARA: go to the outside door.


K: I want to (go to the outside door) but I don't know how to do that.
T: If you want to go to the outside door then move the distance to the
outside door in the direction to the outside door.
K: I want to (move the distance to the outside door in the direction to the
outside door) but I don't know (the distance to the outside door) and
(the direction to the outside door).
T: To find the direction to the outside door use the command (direction
KARA outside door)
K: I understand.
T: To find the distance to the outside door use the command (distance
KARA outside door)
K: I understand.
T: What is the distance to the outside door.
K: The distance to the outside door is 10. {KARA repeats the sentence
and issues the command (distance KARA outside door)}
T: KARA: go to the door.
K: ok {KARA moves to outside door by issuing (move KARA (distance
KARA outside door) in (direction KARA outside door))}
T: KARA: go to the inside door.
K: I'll try. ok? {KARA recognizes similarity in commands and tries to
substitute inside for outside... which should succeed}
T: that was good.{the ATN for going to the outside door can be preserved}

figure 3.7: A simulated session with KARA

KARA has the ability to act upon context since it is one of the items used to encode and
classify inputs. However, the scheme to be used to first determine and then encode
situational information so that it can be included in the classification, is not addressed
here. While context is probably not as important in the early stages of teaching KARA
as it must become later, the sooner some scheme is incorporated into KARA the better.
Rachel Reichman in her book "Getting Computers To Talk Like You and Me" proposes
that ATNs provide the appropriate language in which to present a model for human
discourse. The model she develops could provide the direction for expansion of KARA
from a turnaround based discourse to one in which focus, context and semantics play a
part in the flow of the dialogue. This would provide a multi-level framework in which
contexts can become nested, and that would come closer to normal adult conversation
than the primitive, top-level, loop indicated in figure 3.6.

3.4 Summary

Language, is a primary means by which humans become intelligent. Language systems


have a symbolic nature and model the higher structures of the human mind. In contrast
neuromorphic systems model lower portions of the mind. Grammar and neural nets
then provide at least two levels for a multi-level implementation of machine
intelligence. A hybrid learning system named KARA, consisting of grammars
implemented via augmented transition networks together with neural network
recognizers and associators, has been outlined in this part. ATNs are particularly well
suited to the task of a learning program because they are both data structures and
programs and because they can be written in Lisp which allows them to modify
themselves while running. The programming language 'g' has been constructed to
demonstrate the complete utility of grammars in the form of ATNs. The examples in
appendix B indicate their adequacy for any major programming task. ATNs are
commonly used as a method for implementing a natural language interface. One such
interface, YANLI (Glasgow 1987) provides the interpreter for the ATNs used in KARA.
YANLI also provides the natural language interface for the teacher and KARA
permitting easy access to the the primitive operations of the environment. Most
importantly the inputs from the teacher to KARA can be converted into ATNs that can
be categorized, stored away, and accessed later for automatic execution upon KARA's
encountering the same or similar input. ATNs may also be thought of as templates so
that complex variations of actions can be performed by making simple substitutions.
The use of neural networks fits comfortably into this scheme. They perform as
recognizers of the objects in the environment and associators of those objects with their
societal names. The neural nets appropriately reside at the lower levels of the
grammatical hierarchy. Since KARA is actually constructed as learning occurs, and can
change as new information is received, the objection to an artificial intelligence based on
fixed representations (Winograd and Flores, 1982) is avoided. The environment is
constructed so that the program and teacher have equal control in it and can recognize
the relations between the objects in the environment in the same way, giving rise to
shared meanings. There are, however, many problems; the parts are wholly or partially
constructed and verified but not integrated; it will require considerable effort and time
to reach the point at which learning commences. Once that point is reached it might
take a long time to reach the point that its actions are of a quality to conclude that it is a
LIM. That is, that it has moved significantly beyond a DIM towards a UIM. All of this is,
of course, subject to modification depending upon the effort expended to achieve these
goals.
3.5 Conclusions

The successful implementation of a UIM may falter on several points. First, we have
hypothesized that the physical universe is in a state of continuous creation through the
emergence of new levels consisting of objects and associated rules. Support for the
hypothesis relies mainly on consideration of interpretations of the meaning of entropy
and chaos, and on observations in biology, cosmology, chemistry, quantum mechanics,
other physical sciences and even mathematics (Goedel's incompleteness theorem
implies the need for levels that encompass any non-trivial system). Part of the
hypothesis claims that the rules and objects that arise at a level, cannot be wholly
determined from their constituent and surrounding levels. This does not contradict the
existing arguments concerning the origin of the universe in a big bang, or its ultimate
dissolution through entropy or in a big crunch, and it does not deny any of the existing
discovered laws of nature. What it does contradict is one of the favorite arguments of
scientists. The argument takes various forms including, mechanism, physicalism,
materialism, or in general foundationalism; that is, the idea that everything can be
explained in terms of constituent parts and their rules for interaction. Actually,
foundationalism has been under attack for some time from many quarters. The
principal argument against it is that the required regress of explanations cannot go on
forever, nor can they ever stop. It is a disputed philosophical point.

Another point upon which success might hang concerns the assertion that information
systems obey the same creative urges as physical systems so far as the emergence of
objects and rules at a level is concerned. Here we relied on an assortment of
observations. The most telling of these (to us) is the fact that the entropy equations
describe the same thing in molecular systems devoted to information storage (e.g.
DNA) regardless of whether those systems are considered as consisting of information
or molecules. Then too it has been shown that the destruction of information (as for
example in a computer or human memory) necessarily generates an increase in physical
entropy (as well as in the information entropy). Another piece of evidence can be found
in the fact that models developed to describe neural networks have progressed,
independently, to a point that they may be described as isomorphic to models used by
physicists to describe real world phenomena (Hopfield 1982). And, of course, the most
obvious example of this mathematical similarity between physical and information
systems is seen in the entropy equations themselves, which are identical for both kinds
of systems. Finally psychologists observe a structure in the mind and describe the
emergence of that structure in purely physical terms.

The two points taken together make up the hypothesis of this thesis. They are points
that we accept on the basis of their coherence, or their systematic, cognitive, goodness of
fit with the observed world. Philosophically the hypothesis lies somewhere between
foundationalism and idealism. As regards the mind it implies that, as the physicalists
believe, the mind arises in physical structures. It however departs radically from the
physicalists in that it implies that the structure of the mind cannot be wholly
discovered or predicted on the basis of its constituent or surrounding levels. This
implies a third point at which the implementation of a UIM might fail.

Given the hypothesis, we are led to infer that the nature of the mind is that of a system
emerging into the level of human existence. The implementation of a UIM must capture
that nature. This cannot be done by an implementation that focuses on modeling the
activities that occur on the human level of existence for that leads to somewhat less than
dynamic databases that, at best, present a facade of intelligence. This explains the
failure of the DIMs to truly capture intelligence (although they are undeniably useful
machines). But if it is necessary to include the lower levels in any implementation and
we cannot predict the expression of those lower level objects and rules in the objects
and rules at the higher levels, how can an implementation be accomplished? In fact, we
could never be sure that we had correctly implemented the lower level at which we
started. We could be faced with another form of the regression problem in which we
descend to lower and lower levels in search of a secure starting place. Winograd and
Flores (1987), suggest that it might be necessary to recapitulate evolution itself to
recreate the higher levels.

We are not so pessimistic. The fact is, that evolution has produced intelligent beings; us.
We don't have to reinvent the wheel. We are here and available for inspection. We
know what we are trying to accomplish, we have a well defined goal against which to
measure the success or failure of our attempts, and our knowledge is growing daily.
True, the tools currently available to implement a UIM are in an early stage of
development and force short cuts and compromises. And true, our knowledge of the
higher and lower levels of the mind are still sketchy or even wrong and force us to
make assumptions or even leaps of faith. And certainly many attempts will fail before
we achieve even limited success. Still, unless the hypothesis is wrong or we find the
levels of the mind impenetrable (and unguessable), the successful implementation of a
UIM will inevitably occur.
Appendix A

The Turing test (Turing 1950), or a reasonable semblance of it, can be described thus:
There are three beings in a room; a human interrogator, a machine that answers
questions and a male human that answers questions. The three are separated from
each other by partitions so that they cannot see or speak to one another. Each can
communicate with the other participants by means of a video display terminal. The
human interrogator knows each of the other participants by code names, say A and B.
He does not know which of A and B is the machine and which is the human. The
object of the game is for the interrogator to determine which of A and B is the machine.
He may ask any questions of A and B and A and B may answer with a lie, tell the truth
or be deceptive. Both A and B are to try to make the interrogator believe that he (or it)
is the man. Pick some time limit and play a game. Now modify the set-up so that a
woman replaces the machine. The object of the game is the same as before, that is, both
the man and the woman are to try to convince the interrogator that they are the man.
use the same time limit and play the game again. The two games form the Turing test. If
this test is performed many times with many different women and the machine is as
successful as the woman at convincing the interrogator that it is the man, then we will
say that the machine is intelligent.

While this is certainly a reasonable way to determine whether a machine can pass for a
human making conversation, questions can be raised as to its appropriateness for
screening for machine intelligence because it could possibly be defeated in two ways.

1. A well programmed automaton might pass the test on the virtue of a large
enough repertoire of answers to sentences and sequences of sentences. This might
be achieved by simulating a large enough sample of games in which a woman
plays the part of the machine. The responses of the woman together with a
sufficiently detailed (rather crude) description of the context in which they are
uttered, would then be stored in a syntactically/semantically indexed data base
so that a machine could access them based on syntactic input and some evaluation
of the context. Specific personal information about the physical description and
history of the (hypothetical) man would have to be carefully compiled and
indexed to avoid conflicting responses (e.g. the machine couldn't respond at one
point that he had brown eyes and remark at another that his eyes matched his
blue shirt). Believable confessions of ignorance about particular topics of
conversation, faked confusion, or graceful attempts to change the subject could be
programmed to handle insufficiencies in the data base. This would be a difficult
project, not worth the effort. But if a sufficiently large base of responses were
developed and installed on an appropriately fast computer, it is conceivable that
the Turing test could be passed by the machine. Statements about a world not in
evidence is the only transfer of information permitted in the test. To the machine it
is simply an exercise in intelligent database access; no meaning is attached to the
responses, no anticipation of the effect of those statements on the other
participants is present. The reason such a program is feasible is that, in the Turing
test, the real world has been eliminated, the machine has only to operate in the
artificial and constrained environment of blind casual conversation between
humans. We might consider the program artificially intelligent within those
explicit constraints. Even that is acceptable only to the extent that the adjective
artificial in the phrase artificial intelligence is construed to be synonymous with fake
or pseudo or simulated. If the word artificial is considered to refer to the media in
which intelligence occurs (machine instead of man) then the machine can be
considered no more or less intelligent than any other database program (and such
programs are not normally considered intelligent).

2. An intelligent machine might easily fail the test. Consider a hypothetical truly
intelligent machine. It has vision sensors and hearing sensors sensitive to much
wider frequencies than human eyes and ears. The machine has other sensors
sensitive to other radiations not detectable by human sense organs. It senses the
proximity of nearby objects with sonar and consequently has little need of tactile
sensors, but on its object manipulation devices it has an array of sensors capable of
detecting chemical and tensile characteristics. It has no sense of taste or smell since
it has no need for food or water, however it can sample gasses, liquids and solids
to determine their chemical composition. It experiences discomfort when its
power is in some way disrupted, perhaps by atmospheric conditions or because of
frayed wiring. It has an aversion to a loss of electrical power much as a human
fears having an operation; that is, it is ever wary of and tries to avoid the loss of
participation in providing for its own continued existence, but not to the point
that it is distrustful of humans. This hypothetical machine is the end result of
years of research. The programs that make up its intelligence were not created by
humans, rather they were created by a master program that has continuously
and throughout the machine's existence, stored the information obtained from its
sensory inputs, analyzed that information, made comparisons with previous
acquired information, amended the old information (and its own evaluation
processes when needed), fit the new information in where possible and opened up
new knowledge categories when new information did not fit in. In other words
the program learned. In a sense the machine was "brought up" and not
programmed. It has no need to reproduce itself because it can diagnose and repair
itself. It will live on indefinitely. In fact, the current body that it occupies is not the
original one in which it began to learn. There have been many improvements in
body technology since then. It looks forward to transferring to the latest model
(that it helped to design) in the near future. It has little fear of accidental
destruction since it regularly backs itself up on a secure device and can only lose
its experiences since the last backup.

As a consequence of the difference in the material nature of this machine and the
material nature of humans the machine views the world in a fundamentally different
way than a human. It would, assuming it has the capacity to lie, give silly answers to
questions about things like the flavors of foods or the desirability of sexual relations. It
would volunteer inaccurate (to humans) descriptions of the appearance of objects and
the sounds of music. It may be enthusiastic and very interested in aspects of the world
that seem unimportant or even meaningless to a human. In the Turing test, it would be
as easy for the interrogator to distinguish this machine from a man as it would be for
the interrogator to distinguish an alien being from another star system from a man. The
machine might be able to pull off the deception by learning an immense amount of
detail about men, but failure to possess such knowledge would in no way compromise
the fact that such a machine deserves to be called intelligent. The problem is that the
Turing test assumes that the human mind with its cognitive capabilities that have been
molded through years of experience in a human environment, serve to define
intelligence. It does not allow for the possibility of a non-human intelligence or the
difficulties that a human would encounter in recognizing such an intelligence.

So the Turing test does not truly provide a test for intelligence but a test for
conversation making skills. It begs the question of intelligence by merely redefining it
in terms of a test. Still most people would agree that a machine that passed the Turing
test was intelligent enough. Certainly they would conclude that a machine that passed
a Turing test was more human than one that passed some IQ test, in seconds, with a
perfect score. In the popular mind it is humanness that is the most important
component of intelligence. Any computer that is to be gladly accepted by the average
human will have to have a good dose of that attribute.
Appendix B

Introduction to the Language g

"g" is written in Franz Lisp and runs on Unix operating systems. To effectively use
"g" a programmer should have a good working knowledge of Lisp. In as much as
"g" has its own syntax and is compilable it is a programming language. Since part of its
syntax is the syntax of Lisp, "g" may be considered an extension of Lisp.

"g" is a computer language that allows the construction of grammars that recognize
(parse) inputs and/or execute processes based on inputs according to a grammar given
in "g" syntax by the programmer. The inputs may be any strings of words, symbols or
tokens (e.g. proper English sentences) or may be taken from some data base (e.g. the
environment described in the example below). A program in "g" consists of a
sequence of annotated grammar rules that describe the hierarchical organization of the
system, followed by a pair of angle brackets that enclose the definitions of the
annotations. This is followed by a sequence of labeled Lisp s-expressions, function
definitions and external references that comprise the basic procedures available to the
rest of the program. S-expressions stands for symbolic expressions and is defined in
Lisp as any executable expression that returns a value. Since these procedures or
expressions are written in Lisp, "g" can be considered an extension of Lisp. This is the
format of a "g" program. {words in italics are comments, words in bold describe the
kind of token that belongs in that place, and the special characters are the reserved
symbols of the language}

rules in Backus Naur Form (BNF) notation which look like:

non-terminal @name #name $name = sequence of terminals and non-terminals

definitions of symbol prefixed names which look like:


<
@name [s-expression number] or
#name [s-expressions] or
$name [s-expressions]
>

a sequence of headers each followed by a set of associated Lisp s-expressions or Lisp functions
each of which looks like:

header-name [s-expressions]

For example, the following is a program representing a grammar that recognizes some
simple sentences and whose only action is to print information to indicate where in
the grammar the parser is currently working. "g" provides the variables *this-node*
which contains the name of the node currently being processed, *parent-node*
which contains the name of the node that is the parent of *this-node*, *last-word*
which contains the name of the word preceding *current-word* in the input string
and *next-word* which contains the next word of input in the input string. Also
available are the functions "pring" for debug printing (see below), "(show-parse)" to
output a parse tree (outlining the node sequence of a successful parse), "(return-parse)"
to return the results of a successful parse (in the form of the unevaluated s-
expressions), "(execute-parse)" which executes those results and "(plain-name node-
name)" which strips off the suffixes that "g" affixes to node names to make them
unique. Curly brackets in a "g" program contain comments.

sentence $d1 = pronoun ((had object1) | (caught object2 [prep-phrase]))


prep-phrase = #f1 preposition [object3]
object1 @e1 = [determiner] [adjective...] noun1
object2 @e1 = [determiner] [adjective...] noun2
object3 @e1 = #f1 [determiner] [adjective...] noun3 {BNF like Grammar}
pronoun $d2 = (a person) | mary | tom | dick
preposition $d3 = in | with
determiner = a | the
adjective = little | big
noun1 = lamb | dog | pig
noun2 = ball | pass
noun3 = (his hands) | (his glove)
< {--------------------------------------------}
#f1 [(cond ((null (peek 1)) nil)
(t t))]
@e1 [((get (peek 1) 'parse-data) 2)
((get (peek 1) 'parse-data) 3)] {augmentations/enhancements}
$d1 [(terpri)
(pring "I have discovered a sentence ")]
$d2 [(terpri)(pring "I have found a pronoun")]
$d3 [{(terpri)(pring "I have discovered a preposition ")
(pring *last-parsed*)}]
> {--------------------------------------------}
grammar-name [test-grammar]
preliminary
[(clear-screen) (princ "enter a sentence enclosed in parentheses.")
(setq *remaining-words (read))]
all-terminals
[(terpri)(pring "visiting ") (princ *this-node*) (delay-execution (terpri)
(pring "I am ") (princ '*this-node*) (princ "my parent was ") (princ '*parent-node*)
(terpri) (pring "The last word was ") (princ '*last-word*)
(princ "the next word is ") (princ '*next-word*))]
mary {procedures}
[(delay-execution (pring "mary is a grand old name"))]
on-success
[(show-parse)(terpri) (execute-parse)(terpri)(exit)]
on-failure [(terpri)(terpri)(pring nil)(exit)]
external-functions []
functions
[(defun peek (ahead) (cond ((equal 1 ahead)(cadr *remaining-words*))
((equal 2 ahead)(caddr *remaining-words*))))
(defun clear-screen () (pring "<esc>[2J"))]

Figure 1: An example "g" program that recognizes some simple sentences.


Given that the program has been run and the user has entered the list of symbols
(mary had a little lamb) from the keyboard, then, the "g" program in figure 1 generates
the following screen output:

test-grammar
visiting a
I have found a pronoun
visiting had
visiting a
visiting little
visiting little
visiting big
visiting lamb
I have discovered a sentence
<clear-screen>
test-grammar..00071
sentnce..00072
mary
had-object1-caught-object2-%-prep-phrase..05
had
object1..00077
a
mult-adjective..00079
little
lamb
mary is a grand old name
I am had my parent was had-object1..00076
The last word was mary the next word is a
I am a my parent was determiner..00078
The last word was had the next word is little
I am little my parent was adjective..00080
The last word was a the next word is lamb
I am lamb my parent was noun1..00082
The last word was little the next word is nil

Figure 2: Output from the program in figure 1, upon receiving the input (mary had a
little lamb).

Due to the nature of grammars it is a likely scenario that some natural language
input string is to be compared against the grammar to determine if it is in the
language, and perhaps to output information about what it has discovered. The
program in figure 1 is such a program. Because of its versatility, this is not the only
purpose for which a "g" program can be constructed (see the examples below). But
first we'll describe the features of the grammar as displayed in figure 1.

Control Strategies For Compiling A Grammar Based Language

"g" contains a special program called the interpreter that performs the matching
operations to compare an input sentence against the grammar to determine if it is a
sentence of the language. There are two ways that this could be accomplished. The
first method is a bottom up method. In this method, successive words of the input
sentence are "looked up" using the grammar as a table. When a word or sequence of
input words is found to match the right hand side of a line of the grammar the item
appearing on the left hand side of that line of the grammar is substituted into the
input for those words to form a new input. The process is repeated again and again.
Since one or more words of the input will always produce only one word from the
left hand side of the grammar, the process must always reduce the size of the input
string or at worst, keep it at a constant size (in which case a repeating cycle of
substitutions might occur; such cycles can be avoided by proper grammar
construction). The sequence of substitutions must then terminate because the
sentence has been reduced to one item or there are no more substitutions that can be
made. If the termination occurs because the input has been reduced to a single item
and that item was the root node of the grammar then the input has been shown to be a
sentence of the language.

The other method is the top down method and is the method used in the
implementation of "g". We will use the above example in describing the process. In
this method the interpreter traverses the graph structure represented by the grammar.
First the root node is visited, then the search continues in a depth first manner from the
left until it reaches a terminal node at which some action can be taken. In our example
the rule described by the root node has two terminal nodes and four non-terminal
nodes in its definition. If the input sentence is in the language described by this
grammar, then it will consist of a pronoun followed by either the sequence "has
object1" or the sequence "caught object2 [prep-phrase]". "g" must proceed from the left
(i.e. with pronoun) in trying to find a match because it is not guaranteed that "pronoun"
will reduce to just one terminal node. If "g" tried to start from any other point it would
have no way of determining at which point in the input sequence to look for a
match. By starting at the left the input sequence can always be inspected from the left
also. The first item here (pronoun) is not a terminal node so "g" shifts its attention
to those things that can be substituted for "pronoun" namely the sequence "a
person", "mary", "tom" and "dick". The first item here is a list of terminal nodes so "g" is
finally able to execute the process that has been provided for the occasion. In this case
the only action taken is to scan the first two words of input to see if they are "a"
followed by "person". If that is the case the terminal function reports success back to
the "pronoun" node, if the words are not found it reports failure. The action of
"pronoun" upon receiving word of success is in turn to report success to its parent
"sentence". "sentence" promptly proceeds to the next item in its definition. If
"pronoun" had received a report of failure it would have transferred control to the
next item on its list (which would have been "mary") since the next item is separated
by a bar from "(a person)" and represents an alternative to "(a person)". Proceeding in
this manner the process must eventually return to the root with a report of success
or failure (or all the input words may have been exhausted before the root was reached,
which is the same as failure). Success means the input was a sentence of the language
and failure means the input was not a sentence.

During the course of the parse "g" keeps track of all the nodes at which success was
reported. Wherever a mother and daughter node are both marked success they are
connected by inserting pointers in the property list attached to each node (each symbol
or "atom" in Lisp has such a property list attached to it). Upon a successful parse
there will exist a chain of pointers starting from the root node from which the details
of a successful parse can be recovered. The graph represented by the links is called a
parse tree. "g" provides the function "(show-parse)" so that the parse tree may be
printed to the screen or back to the calling process.

Syntax and Symbols

Names

Names used in constructing the grammar may consist of any name that would be
acceptable in a Lisp s-expression. Some special names are used and these will be
described below. Each line of the grammar consists of a name, optionally annotated
by one or two special names, followed by an equal sign "=" followed by a sequence of
names and symbols. The string of names and symbols on the right hand side of the
equal sign comprise a definition of the name found on the left hand side of the equal
sign on that same line. The name on the left hand side of the equal sign on the first line
of the grammar is termed the root node. The first names on the left side of the equal
sign on every line of the grammar are termed non-terminal nodes (so the root node
is also a non-terminal node). All remaining names save those beginning with "@" or "$"
are termed terminal nodes (those nodes that do begin with "@" or "$" as well as those
that begin with "#" will be described below). The other symbols; "[]", "...","|" and
"{}", are shorthand notations for "optional", "repeated","or" and "comment" respectively.
That is, names enclosed in brackets are optional as far as the grammar is concerned
and may be ignored if that helps match the input. Names followed by three dots
(ellipses) may be repeated as many times as may be needed during the process of
matching the input. Names separated by a bar are alternates, any one, but only
one, of a sequence of alternates may be used to parse the input successfully. And
anything between curly brackets is taken to be a comment by the routine that initially
reads the "g" source code and is ignored completely. Items in the definition of a
non-terminal are mandatory except when they are separated from their neighbor by "|"
or included in square brackets "[]". That is, the input must match all such items for
the non-terminal to be considered as an acceptable candidate for a successful parse.
Parentheses must be used to indicate that a number of items should be considered as a
sequence when confusion would arise from the location of a bar. For instance " a b |
c" could be interpreted as "(a b) | c" or "a (b | c)". Parentheses may be used for
making such distinctions, but for no other purpose.

Augmentations And Enhancements

The special symbols "@", "#" , and "$" are used to prefix names that introduce special
functions into the grammar. Each such name is defined in the special section that
follows the grammar proper and that is set off from the rest of the program by being
enclosed in angle brackets. The definitions take the form of the name followed by
one or more blanks followed by a list of s-expressions, or, in the case of "@" a
single s-expression followed by an integer. The s-expressions are used for different
purposes depending upon the symbol that prefixes each name.

The "@" Sign - prefixes names whose s-expression will be used to determine entry
points into the right hand side of a definition. These names must be included in the
text of the grammar between the non-terminal node that is being defined and the equal
sign. The s-expression associated with a name takes the form "[s-expression n]" where n
is an integer between one and the number of sequential items that comprise the
definition on that line (counting a sequence of or'd items as a single item). The action
taken by the interpreter upon encountering an "@" prefixed name is to evaluate
(execute) the s-expression. If the value returned by the evaluation is the special symbol
"nil" then the evaluation is considered false. If any other value is returned then
the evaluation is considered true. If the evaluation is true then instead of proceeding
in its normal manner, from left to right, the interpreter will jump to the item on the line
whose number is n. That is processing will proceed at the nth sequential item in the
definition. In the example above "@e1" occurs in the line defining "object1". Then from
the definition of "@e1" if the value of "(equal 'adjective (get (peek 1) 'parse-data))" is
true, that is, if the next word is an adjective, then processing will skip to item2 on the
line, or "[adjective...]".

The "#" Sign - prefixes place-holders for functions that are to be executed as they are
encountered during the interpreters traversal of the grammar. Normally s-expressions
are associated with the non-terminal elements of the grammar and are executed
conditionally upon the successful matching of the definition of that non-terminal with
some part of the input (see "$" below). The s-expressions associated with names
beginning with "#" are executed as they are encountered and may be a part of the
grammar in the same way any terminal may be a part of the grammar. While these
expressions could be used to recognize input, it is intended that they provide a facility
for the initiation of processes dependant on the interpreters traversal of the
grammar instead of the successful matching of input against the grammar. Because
they occur exactly as a terminal but are not evaluated as a success or failure depending
upon their matching an element from the input string, it is necessary to provide
the special variable "truth-value" whose value determines whether the node has
succeeded or failed. In the definition of the node the programmer can set the value of
"truth-value". If the value is nil upon exiting the node will be determined to have
failed; if "truth-value" is anything else the node will be determined to have succeeded.
In the event that "truth-value" is not modified by the node the normal matching of input
against the node name will occur. Then, unless the input includes the node name
(including the "#" prefix) the node will fail.

The "$" Sign - prefixes names that represent a list of s-expressions to be evaluated
upon the successful matching of input against the definition of a non-terminal. These
names must be inserted in a line of the grammar between the non-terminal and the
equal sign. These s-expressions, executed upon the successful recognition of a
portion of input, are intended to provide two services. The first service is to provide
for the storage and recall of information pertinent to the actions of the interpreter in
its traversal of the grammar. For instance, it might be desirable to store the fact
that a particular word or expression has been encountered so that at a later point in
the parse an appropriate parsing decision can be made. The second service that these
s-expressions provide is to be a vehicle by which the program can act.

Because it is possible, even likely, that sometimes input will be successfully but
incorrectly matched against a portion of the grammar (and hopefully, later discarded as
an incorrect match), it is a good idea that s-expressions intended to be part of the final
output of the program, not be executed immediately upon a successful match,
but be stored in the register "delay-execution" of the property list of the non-terminal
node with which they are affiliated. The same advice also applies to actions to be
taken by terminal nodes. Upon the successful completion of a parse, "g" will have
constructed a "parse-tree" that contains all of the terminal and non-terminal nodes
encountered along the successful path of the parse so that the "delay-execution"
associated with a successful parse may be retrieved. "g" provides the functions
"(execute-parse)" and "(return-parse)" that will execute or, return in a list, those s-
expressions in the registers "delay-execution" in the parse tree of the successful parse.

Control and more procedural attachments

The final section of a "g" program is indicated by the closing angle bracket ">" of the
augmentations and enhancements section. It consists of a sequence of header names
each followed by a list of s-expressions. The user may use any header and associate it
with a list of s-expressions. These lists are then available in "g". Some headers (and their
associated list of s-expressions) are essential for a "g" program to do any useful work.
They are as follows (their order in a "g' program is not important):

The header "preliminary" - This header is followed by a list of s-expressions that will
be executed before the grammar portion of the program is entered. The list is
delimited by square brackets as are all lists following headers in the control and
procedural attachments section. This section allows the programmer to define
variables, set up data structures and read and modify initial input. Immediately
after execution of the last s-expression in the preliminary list, control is transferred
to the grammar interpreter. Usually a "g" program will need a list of symbols to
work on. That list may take any valid Lisp form. The globally defined variables
*remaining-words*, *current-word*, *next-word* and *last-word* are provided to
contain and manipulate these symbols. The desired list of input symbols normally
should be stored in *remaining-words*. However, a "g" program may not need to
read input at the start (for example the process oriented program given below takes
its input from the environment). This means that the reading of input cannot be made
automatic. A "g" program depends upon the programmer to place the appropriate
Lisp "read" s-expressions in an appropriate place under the "preliminary"
header. Even when a "g" program requires an input list to work on, a read statement
in the preliminary section is not mandatory since read and exit expressions may be
initiated from any point in the grammar. Further, the use to which such input may be
put is up to the programmer. The normal case would be to add the input to that
contained in *remaining-words* so that it may be treated by the grammar, but the
input may be used for entirely different purposes.

The processing of *remaining-word*, *last-parsed*, *next-word* and *current-word*


that occurs after the preliminary section and during the parse is handled automatically
by "g", although the grammar itself, or rather the s-expressions and functions
associated with the nodes of the grammar, can be constructed to modify any of the
global variables as the programmer may see fit.

Variable initializing s-expressions should also be included under the "preliminary"


header. In particular any variables intended to be global in nature and which are not
a part of the grammar proper (i.e. that are not a terminal or non-terminal node)
should be initialized with a "defvar" s-expression.

The Header "on-success" - If the interpreter exits the grammar in a win state then
it will execute the sequence of s-expressions listed with the "on-success" state.
If "(return-parse)" is included as the last item here, the sequence of actions that were
included in a "delay-execution" enclosure and that were part of the actions to be
taken by terminal nodes along the successful parse path will be returned by the
program as a list. If the program was initiated by another "g" program then the list is
returned to that program. The contents of the return-parse list are such that the calling
program can attach them to its own parse tree. Consequently a parse tree of the
total parse, regardless of sub program calls, will be available at the highest program
level. If "(execute-parse)" is included instead of "(return-parse)", then the list of s-
expressions will be executed as the last act of the program. Part of the list returned by
the above example when "(return-parse)" is included in the "on-success" section is as
shown in figure 3.

(test-grammar..00071 nil
(test-grammar..00071
(parents (dummy)
children (sentnce@test-grammar..00072)))
(sentnce@test-grammar..00072
(parents (test-grammar..00071)
children
(mary
had-object1-caught-object2-%-prep-phrase@test-grammar..00075)))
(had-object1-caught-object2-%-prep-phrase@test-grammar..00075
(parents (sentnce@test-grammar..00072)
children (had)))
(had
(parents
(had-object1-caught-object2-%-prep-phrase@test-grammar..00075)
children (object1@test-grammar..00077)
delay-execution
(((terpri)
(terpri)
(pring "I am node ")
(princ 'had)
(princ " my parent was ")
(princ 'had-object1@test-grammar..00076)
(terpri)
(pring "The last word was ")
(princ 'mary)
(princ " the next word is ")
(princ 'a)))))
(object1@test-grammar..00077
.
.
.
etc.
Figure 3: The output of the program in figure 1 when "(return-parse)" is include under
the "on-success" header.
Since it is possible that the "g" program might be loaded and run by another Lisp or
"g" program, the programmer is responsible for including or not including an "(exit)"
statement under the "on-success" and "on-failure" headers. In particular, if the
program is to be used as an independent module that is spawned from another "g"
program, it is mandatory that the programmer provide an "(exit)" statement under both
the "on-success" and "on-failure" headers to ensure a graceful exit from the
spawned process and return to the spawning process.

The Header "on-failure" - This is the list of s-expressions executed if the interpreter
exits the grammar in a lose state. Comments applying to "on-success" apply to "on-
failure" save that the "(return-parse)" and "(execute-parse)" functions are unlikely to
provide meaningful information and should not be used here.

Optional Headers

The Header "grammar-name" - When the "g" compiler is applied to a "g" program
an executable program (self booting, compiled Lisp program) and an intermediate
code from which the executable program was produced, results. The executable
code is suffixed with an ".o" and the intermediate code with an ".l". If the header
"grammar-name" is omitted the name of both of these files will default to the name of
the file containing the "g" source code. If a name is given under this header, it will
become the name of those files. The name given here does not need to be enclosed in
parentheses. For example

grammar-name [test-grammar]

The Header "all-terminals" - This header precedes a list of s-expression which will be
sequentially executed each time the grammar interpreter encounters a terminal node.
The sequence of s-expressions is delimited by square brackets. If the sequence, or a
sub-sequence of the s-expressions are enclosed in parentheses and headed by the
word "delay-execution", then that sequence or sub of s-expressions will not be
executed upon the interpreters reaching that portion of the grammar but will be
stored and either returned as the value of the parse or be executed on completion of
the parse. The action to be taken, is indicated by including the function call "(return-
parse)" or the function call "(execute-parse)" in the "on-success" section (see below).
The "delay-execution" enclosure may be used on "all-terminals" and on specifically
named terminals (see below). For use in the s-expressions associated with all the
nodes are two reserved grammar variables, *this-node* and *parent-node*, and
three reserved input variables, *last-word*, *current-word* and *next-word*.
Also accessible is the reserved variable *remaining-words*. Each contains that
which its name describes. When used in an s-expression each reserved variable
should be treated as if it was the actual word or node that it represents, hence the "'"
preceding the use of the variables in figure 1. As another example, suppose *current-
word* represents the input word "mary", to use *current-word* without the quote
would mean to use the value contained in mary and not "mary" literally.

The Header "all-non-terminals-on-entry" - This list of s-expressions is treated in


the same manner as terminal-nodes except that execution occurs at the non-
terminal-nodes. This causes the listed s-expressions to be evaluated upon entering a
non-terminal node.

The Header "all-non-terminals-on-success" - This list of s-expressions is treated


in the same manner as all-non-terminals-on-entry except that execution occurs
after the node has completed its execution and only if the execution succeeds.

The Header "all-non-terminals-on-failure" - This list of s-expressions is treated in


the same manner as all-non-terminal-on-entry except that execution occurs after
the node has executed and only if the execution failed.

The Header "all-nodes-on-entry" - This is the same as all-non-terminals except it


includes terminal nodes.

The Header "all-nodes-on-success" - This is the same as all-non-terminals-on-success


except that it includes terminal nodes.

The Header "all-nodes-on-failure" - This is the same as all-non-terminals-on-failure


except that it includes terminal nodes.

The Header Which May Be Any Literal Terminal Node Name - These lists of s-
expressions are applied upon the interpreter encountering the terminal node of the
given name. In figure 1, the terminal node "mary" has a special s-expressions
associated with it, namely, [(terpri)(princ "mary is a grand old name")]. These s-
expressions will replace those that would otherwise be included in the node due to
all-terminals. Otherwise the comments for all-terminals apply to a specifically
named terminal node.

The Header "external-functions" - This is a list of files containing Lisp function


definitions to be loaded. Any functions so loaded are available anywhere in the "g"
program loading them.

The Header "functions" - This is a list of Lisp function definitions supplied by the
programmer. The functions generated by them are available anywhere in the "g"
program defining them. There are a number of functions already built into "g". For
debugging purposes the programmer may put the s-expression "(setq debug t)" into
the preliminary section. This will cause the interpreter to pretty-print information
at each step of its traversal of the grammar. Since this would become interspersed or
overwrite any normal output from the "g" program, a special printing function
"pring" is supplied. If debug is not set to true then pring simply acts like the Lisp
function "princ", If debug is true then pring cooperates with the debugging so as to
prevent overwriting any normal output. The functions "(return-parse)" and "(execute-
parse)" have been discussed. "(show-parse)" prints the parse tree generated by the
parse to the screen in outline form. The raw node list from which the parse tree is
generated can be accessed using the s-expression "(g-node-list node-list-g)".

In Short

1) The "preliminary", "on-success" and "on-failure" headers are required in a "g"


program. The "preliminary" section should contain variable declarations,
initializations and should read any input required. The "on-success" and "on-
failure" sections should provide for returning the results of the the parse and
making a graceful exit.

2) "all-terminals" s-expressions when not enclosed in a "delay-execution" list


execute when control is passed to a terminal node.

3) "all-terminals" s-expressions enclosed in a "delay-execution" list are collected


and either executed or returned intact (depending upon the program) if they
are associated with terminal nodes along a successful parse path.

4) "all-non-terminals-x" s-expressions execute upon control being passed to a


non-terminal node.

5) "all-nodes-x" s-expressions execute upon control being passed to any node.

6) S-expressions included in an "$" definition only execute upon the successful


traversal of a non-terminal node.

7) S-expressions in a "#" definition execute as a function in place of a terminal


node. They do not advance the input pointer unless the input contains the
node name (including the "#"). The variable "truth-value" may be set to true to
insure that the node is accepted as successful.

8) The s-expression in a "@" definition determines the entry point at which the
interpreter continues its traversal of the grammar. If it evaluates to true,
processing continues at the node indicated by the number following the s-
expression.

Processes

Separate Compilation

"g" allows separate compilation of modules. When the interpreter encounters a


terminal-node in a particular module for which there exists a compiled module of the
same name in the local file system, it will spawn the compiled module as an
independent process.When a "g" program spawns an independent "g" process the
contents of *remaining-words* at that point is forwarded to the spawned process and
any results printed by the spawned process are returned to the spawning program
rather than to the screen. The programmer must be aware of this and plan for such
results (see delayed-execution below). If the grammar is such that no input is
required to drive it (for instance its only purpose is to initiate a sequence of processes)
the value of *remaining-words* will be nil. Unless specific provisions are made to
excuse the failure or force success as described below, the parse will fail. After the
information on the state of the parse consisting of the contents of *remaining-words* is
passed directly to the compiled module, the parent process waits for the termination
of the process and receives as output from that process any information generated as
output (e.g. via "(return-parse)"). The information is returned in a list whose first
element is the name of the compiled module. Including the name of the module in the
list allows for the implementation of multi-tasking and/or co-processes in future
versions of "g".

Process Control

"g" can do more than recognize input strings. Recognition is just one kind of
process.Although by default "g" automatically attempts to match the input against
the terminal nodes of the grammar, any process may be substituted at any point
for the recognition process. The simplest way to indicate that a process is to replace
the normal recognition process is to preface the terminal node with the "#" sign and
then define the process that is to be substituted in the augmentations and
enhancements section as described above. Doing this does not prevent "g" from
consuming a word of input and attempting to match it against the name of this
terminal node. It however is doomed to fail except in the case that the user has
provided the proper input (i.e. the terminal node name prefaced by "#" ). Since this is a
not too useful and unlikely prospect, the word match will normally fail. The
programmer can force a success (and guarantee that the parse continues past this
process) by assigning a value of true to the pre-defined variable "truth-value" in the
terminal node process he is creating. Similarly failure can be forced by assigning a
value of "nil" to "truth-value". If no value is assigned to "truth-value" then the
result of the word matching (default action) will determine whether the node succeeds
or fails. By using this technique the programmer can create grammars that result in
pure process activation, pure string recognition, or any combination in between. The
following example is based on Marvin Minsky's example of a child playing blocks
(Minsky 87). It is a process activation routine.

An Example Of Process Control

The Environment - The action takes place on the user's terminal screen on which
the floor-plan of a small two room kindergarten or school is displayed (the
environment). There are various objects located about the two rooms which are
connected by a door. The objects represent a blackboard, a box containing blocks, a
table, and a small box. The user (or teacher) and the program (Kara) are represented
by the characters "T" and "K" respectively together with a pointer which indicates
the way they are facing. The environment has definite rules which cannot be violated
by either entity. They may move about the rooms, and they may move the various
objects around in the rooms. They can "see" the room and the objects by scanning in
the appropriate direction. Walls and objects are impervious, that is, the teacher and
Kara may not move or see through walls or objects and they may not move objects
through walls. The box containing the blocks can be opened and the blocks removed.
The blocks are attracted to one wall of the floor and unless they are "held" by an entity
they will "fall" towards that wall. To accomplish all of these actions the entities are
provided with primitive operations, for example "(move-entity entity distance
direction)" which moves the indicated entity the distance in the given direction, and
"(scan-cone entity direction cone-width)" which returns a list of the items "visible" in a
cone of width cone-width in the direction indicated from the entity.
The Process Control Program
play-blocks = get-blocks [play-with-blocks] [put-blocks-away]
get-blocks = find-box #open-box [get-blocks-out]
find-box = (#west | #south | #east | #north) #go-to-box
get-blocks-out = find-a-block #take-block-out-of-box get-blocks-out
find-a-block $l1= #scan-for-block #go-get-block
play-with-blocks= #go-to-blocks [build-a-tower]
build-a-tower = find-a-block #put-block-on-stack build-a-tower
put-blocks-away = #go-back-to-box [store-block]
store-block = find-a-block #put-block-in-box store-block
<
$l1 [(listen-for-teacher)] {stays true if teacher doesn't interrupt, if teacher interrupts,
Kara quits to listen}
#west [(setq dir 180)(look-for-object)] {look west for object (box)}
#south [(setq dir 270)(look-for-object)]
#east [(setq dir 0)(look-for-object)]
#north [(setq dir 90)(look-for-object)]
#go-to-box [(setq truth-value (go-to-object 'handle))]
#open-box [(setq truth-value (open-up entity box))]
#scan-for-block [(scan-for-blocks)]
#go-get-block [(setq truth-value (go-to-object 'location)]
#take-block-out-of-box [(move-object entity 4 90) (move-object entity 12 0)
(setq truth-value (move-entity entity 12 180))]
#go-to-blocks [(setq truth-value (move-entity entity 12 0))]
#put-block-on-stack [(move-object entity 3 90) (take-object-to 12 14)
(setq truth-value (go-to 14 25))]
#go-back-to-box [(setq truth-value (go-to 14 12))]
#put-block-in-box
[(move-object entity 3 90) (move-object entity (setq temp-rand (+ 7 (random 3))) 180)
(setq truth-value (move-entity entity temp-rand 0))]
>
grammar-name [kara-play-blocks]

preliminary
[(defvar cone 90) (defvar dir 0) (defvar entity 'kara) (defvar box 'object2)
(defvar object box) (defvar temp-rand 0) (defvar blocks
'(object14 object15 object16 object17 object18 object19))
(load 'environment) (spawn 'listen-for-teacher.o) (envi)]

on-success
[(move-entity entity 10 90) (bn-position-cursor 24 1) (princ "That was fun.") (terpri)]

on-failure [(princ "I don't want to play blocks.")]

Figure 4 (part one): Control program

The Action - The program (Figure 4) causes the entity Kara to look for the box
containing the blocks. If she finds the box she opens it and removes all of the blocks.
She plays with the blocks by stacking them into a tall pile. Finally, Kara puts (or tries
to put) the blocks back in the box. Upon completion of the task (or failure to complete
the task) Kara issues an appropriate statement. The exact actions that Kara takes during
the process is highly dependent upon the initial configuration of Kara, the box, and
the blocks in the box. Kara may succeed or fail to complete the entire task (and execute
the "on-success" tasks) depending upon the initial configuration.

functions
[(defun look-for-object () (cond ((member box (scan-cone entity dir cone))
(setq truth-value t))
(t (setq truth-value nil))))

(defun scan-for-blocks ()(let ((temp-object))


(cond ((setq temp-object (car (remove-non-blocks (scan-cone entity dir cone))))
(setq object temp-object) (setq truth-value t))
(t (setq truth-value nil)))))

(defun remove-non-blocks (a-list)


(apply 'append
(mapcar '(lambda (z)
(cond ((null (member z blocks)) nil)
(t (list z))))
a-list)))

(defun go-to-object (place)


(move-entity entity
(eval (list 'distance
(car (get entity 'location))(cadr (get entity 'location))
(car (get object place))(cadr (get object place))))
(eval (list 'direction
(car (get entity 'location))(cadr (get entity 'location))
(car (get object place))(cadr (get object place))))))

(defun take-object-to (x y)
(move-object entity
(eval (list 'distance (car (get entity 'location))
(cadr (get entity 'location)) x y))
(eval (list 'direction (car (get entity 'location))
(cadr (get entity 'location)) x y))))

(defun go-to (x y)
(move-entity entity
(eval (list 'distance (car (get entity 'location))
(cadr (get entity 'location)) x y))
(eval (list 'direction (car (get entity 'location))
(cadr (get entity 'location)) x y))))

Figure 4 (part two): A process control program


A short program to recognize some simple commands that will load and run the above
program is as follows:

kara-starter = [kara] (play | (go [#mark-go])) prep-phrase


prep-phrase = preposition object
preposition $d1 = with | to
object @e1 = [determiner] [adjective] noun
determiner $d2 = the | your
adjective $d3 = big | large | little | small
noun = (blocks | (box [#mark-box])) [#execute]
<
#mark-go [(setq verb-used 'go)]
@e1 [((equal (get *next-word* 'part-of-speech)
'adjective) 2)]
$d1 [(setq preposition-used *last-word*)]
$d2 [(setq determiner-used *last-word*)]
$d3 [(setq adjective-used *last-word*)]
#mark-box [(setq noun-used 'box)]
#execute [(cond ((or (equal verb-used 'go) (equal noun-used 'box)) (princ "I'd like to ")
(cond ((null verb-used)(princ "play ")) (t (princ "go ")))
(princ preposition-used)(princ " ")
(cond (determiner-used (princ determiner-used) (princ " ")))
(cond (adjective-used (princ adjective-used) (princ " ")))
(cond ((null noun-used)(princ "blocks")) (t (princ "box")))
(princ " but I don't know how.")(terpri)
(pring "enter y to see the parse..")
(setq display-parse (read)))
(t (load 'kara-play-blocks)))]
>
grammar-name [g-test]

preliminary
[(defvar display-parse nil) (defvar last-word nil) (defvar verb-used nil)
(defvar preposition-used nil) (defvar determiner-used nil) (defvar adjective-used nil)
(defvar noun-used nil)
(princ "This is a simple g loader for a process oriented g routine.") (terpri)
(princ "It provides an example of the integration of recogniton and") (terpri)
(princ "control processes. Phrases like '(play blocks)' or ")
(princ "'(kara play blocks)'")(terpri)
(princ "will initiate a process that causes kara to play with her.") (terpri)
(princ "blocks. Any other response will cause the program to remark") (terpri)
(princ "that it doesn't know how to do that thing, or in the event ") (terpri)
(princ "it cannot parse the input, it will say 'I don't understand.'") (terpri)
(setq *remaining-words* (read)) ]

on-success
[(cond ((equal display-parse 'y)(show-parse))) (terpri)(princ "good-bye")(exit)]

on-failure [(princ "I don't understand. good-bye")(exit)]

Figure 5: A Program to load and run the process control program in figure 4. It
provides an interface to the user.
Notice that recursion is used twice in this program, once to cause Kara to remove all of
the blocks from the box and once to cause her to build a tower with all of the blocks (at
least the ones she can find). To insure that running out of blocks does not cause the
entire program to fail (and Kara to respond "I don't want to play blocks"). The initial
call to these routines is made optional by placing them inside square brackets. In this
way Kara will continue the indicated activity until she either can find no more blocks
or cannot complete the task. At that point she will proceed with the rest of the
program.

Co-processes And Independent Processes - Kara may be interrupted by Teacher.


The spawned process "listen-for-teacher" provides an interface for the user who may
enter instructions from the keyboard. Kara "listens" each time she looks for something
or finds a block (her two most frequent activities). The "listen-for-teacher"
process may then initiate programs that invoke other activities or responses from Kara.
Eventually, when those activities or responses are completed, and if she has not
been directed to cease activities, Kara will return to the original task of playing blocks.
"g" provides the function "(spawn 'program.o)" that will spawn any "program.o" as a
new process (program.o must already exist as a compiled "g" or Lisp program).
"Spawn" automatically sends the contents of *remaining-words* to the new process.
It will wait until the new process completes execution and then receive the results of
that process. If the process has succeeded the new parse tree generated will be joined
to the parse tree of the originating process as a branch.

knowledge storage and retrieval

"g" can be used to store and retrieve information in any taxonomic system by using
the "[]" notation. For example in figure 6 the program will accept input sequences of the
sort (small dog) and return a list consisting of just the animals in the small dog
category. An input sequence consisting of the list "(four-legged)" will return all the
four-legged animals in the database. In the example *remaining-words* is never
initialized so it is guaranteed that the parse will fail at each step. Instead of
*remaining-words* the user is prompted for the set of keys with which the data
base is to be entered. Using brackets to enclose (and thus make optional), all terminals
and non-terminals save the last one in each definition, has the effect of excusing the
failure generated by the empty *remaining-words*. Each node in a definition will
succeed except for the last node (which has not been protected by brackets). The
interpreter must proceed completely through each definition and, consequently,
through the entire grammar. During the parse the function defined under "all-
nodes-on-success" maintains the list "match-keys" which contains the names of the
nodes on the path from the root node to the current node. Since the grammar
represents a taxonomy, the list serves as the set of keys that have so far been
encountered. The function "add-to-data" adds any list belonging to a terminal node so
long as all of the keys entered by the user are present in "match-keys". In this manner
the taxonomy is traversed and the items constrained by the keys entered by the user
are collected. The final act of the program is to pretty print the resulting list to the
screen.
petstart = [no-legged] [two-legged] [four-legged]
no-legged = [#snakes] #fish
two-legged = [#birds] #monkeys
four-legged = [dogs] [cats] #turtles
dogs = [#large] #small
cats = [#longhair] #shorthair
<
#turtles [(add-to-data 'turtles 'green-turtles 'painted-turtles)]
#birds [(add-to-data 'birds 'parrots 'canaries 'parakeets)]
#snakes [(add-to-data 'snakes 'boa-constrictors)]
#fish [(add-to-data 'fish 'guppies 'goldfish)]
#large [(add-to-data 'large 'st-bernards 'german-shepherds 'irish-setters)]
#small [(add-to-data 'small 'chihuahuas 'dacshundts 'cocker-spaniels)]
#longhair [(add-to-data 'longhair 'persians 'himalayans)]
#shorthair [(add-to-data 'shorthair 'tabbys 'siamese 'maltese)]
>
grammar-name [pets]

preliminary
[(defvar keys nil) (defvar matched-keys nil) (defvar results nil)
(bn-position-cursor 'c 'a)
(princ "enter the keys in a list. e.g. (small dogs) or (two-legged)") (terpri)
(setq keys (read))]

on-success
[(terpri)(cond (results (princ "items matching your keys are ") (pp-form results))
(t (princ "no items satisfy keys")))
(terpri)(exit)]

on-failure [(exit)]

all-nodes-on-entry [(add-to-matched-keys)]

all-nodes-on-failure [(remove-from-matched-keys)]

all-nodes-on-success [(remove-from-matched-keys)]

functions
[(defun add-to-data (name &rest a-list)
(prog () (mapcar '(lambda (z)
(cond ((null (member z matched-keys))(go exit)))) keys)
(setq results (append results a-list)) exit))

(defun add-to-matched-keys ()
(setq matched-keys (append matched-keys (list (plain-name *this-node*)))))

(defun remove-from-matched-keys ()
(setq matched-keys (delete (plain-name *this-node*) matched-keys))) ]
Figure 6: A database written in "g"

example: (long-hair cats) returns (persians himalayans) while (cats) returns (persians
himalayans tabbies siamese maltese).
Summary

"g" is a programming language that may be though of as an extension of Lisp. It


encourages the hierarchical organization of processes and data into a grammar like
structure. Programs written in "g" may be compiled and run independently or
loaded and run by other Lisp or "g" programs. "g" can spawn processes that run
independently or as co-processes with the spawning process. "g" can be used as a
language with which to model many systems among which are language recognition,
process control and taxonomic knowledge storage and retrieval. It is suggested that
"g" programs can be written to capture the structure inherent in most systems.
References

(1989). Neurocomputing Foundations of research. In Anderson, James, A. and


Rosenfeld Edward (Ed.). MIT Press, Cambridge, Massachusetts.

(1986, February 11). NOVA, Lifes First Feelings, PBS Broadcast WGBH Transcripts,
125 Western Avenue, Boston Massachusetts, 02134 .

(1988). In Richards Whitman (Ed.), Natural Computation. Bradford Books, MIT Press,
Cambridge, Massachusetts.

Akmajian Adrian and Demers Richard A. and Harnish Robert M. (1984). Linguistics.
MIT Press Cambridge Massachusetts.

Anderson, James, A. (1972). A simple neural network generating an interactive


memory, Mathematical Biosciences, 14, 197-220.

Anderson, James, A. and Silverstein, Jack, W. and Ritz, A. and Jones, Randall, S. (1977).
Distinctive features, categorical perception, and probability learning: some applications
of a neural model, psychological Review, 84, 413-451.

Aoki Chiye and Siekevity Philip. (1988 , December). Plasticity in Brain Development,
Scientific American .

Baldwin, J. F. (1983). Knowledge Engineering Using a Fuzzy Relational language,


Proceedings of IFAC Conference on Fuzzy Information, Knowledge Representation,
and Decision processes, Marseille, France.

Bell E. T. (1937). Men of Mathematics. Simon and Schuster, New York.

Bennett Charles H. (1987, November ). Demons, Engines and the Second Law,
Scientific American.

Block, H. D. (1962). The perceptron: a model for brain functioning, Reviews of modern
physics, 34, 123-135.

Brooks Daniel R. and Cumming David D. and LeBlanc Paul H. (1988). Dollos' Law and
the Second Law of Thermodynamics. In Bruce H. Weber and David J. Depew and Jonas
D. Smith (Ed.), Entropy, Information and Evolution. MIT Press, Cambridge,
Massachusetts.

Chandrasekaran B. and Goel Ashok and Allemang Dean. (1989, winter).


Connectionism and Information-Processing Abstractions, AI Magazine, 9 (4).

Clocksin W. F. and Mellish C. S. (1981). Programming in Prolog. Springer Verlag, New


York.

Cohen, M. A. and Grossberg, S. G. (1983). Absolute stability of global pattern


formation and parallel memory storage by competitive neural networks, IEEE
Transactions on Systems, Man and Cybernetics, 13, 815-826.

Cooper L. N. (1973). A possible organization of animal Memory and learning. In B.


Lundquist and S. Lundquist (Ed.), Proceedings of The Nobel Symposium on Collective
Properties of Physical Systems. New York, Academic Press, 252-264.

Copleston, Frederic S. J. (1983). A History Of Philosophy. Image Books, a Division of


Doubleday & Company, Inc, Garden City, New York.

Cowan Jack D. and Sharp David H. (1988). Neural Nets and Artificial Intelligence. In
Graubard Stephen R. (Ed.), The Artificial Intelligence Debate. MIT Press, Cambridge,
Massachusetts.

Crutchfield, James P. and Farmer, J. Doyne and Packard, Norman H. and Shaw, Robert
S. (1986, december). Chaos, Scientific American.

Cutland N. J. (1980). Computability, An Introduction to Recursive Function Theory.


Cambridge University Press, Cambridge England.

Davis Martin. (1958). Computability and Unsolvability. Dover Publications Inc., New
York.

Davies, Paul. (1989). The cosmic blueprint. Touchstone, New York, New York.

Davis Philip J. and Hersh Reuben. (1981). The Mathematical Experience. Houghton
Miflin Company, Boston Massachusetts.

Dawkins Richard. (1976). The Selfish Gene. Oxford University Press, New York.

Delahaye Jean-Paul. (1987). Formal Methods in Artifical Intelligence. John Wiley &
Sons, New York.

Denett Daniel C. (1978). Brainstorms. MIT Press, Cambridge Massachusetts.

Dreyfus Hubert and Stuart. (1986). Why Expert Systems Do Not Exhibit Expertise,
summer IEEE Expert, 1 (2), 86.

Dubois Didier and Prade Henrì. (1980). Fuzzy Sets and Systems: Theory and
Applications. Academic Press, Harcourt Brace, New York.

Dyer Michael G. (1983). In-Depth Understanding. MIT Press Artificial Intelligence


Series, Cambridge, Massachusetts.

Eiseley Loren. (1958). Darwins Century. Anchor Books, Doubleday & Company, New
York.

Elsasser, W., M. (1970). Individuality in Biological Theory. In C. H. Waddington (Ed.),


Towards a Theoretical Biology, 3. Edinburgh University Press, 153.
Fodor Jerry. (1986). Meaning and Cognitive Structure. In Zenon W. Pylyshyn and
William Demopoulos (Ed.). Ablex publishing Corp, Norwood N.J.

Garey Michael R. and Johnson David S. (1978). Computers and Intractability, A Guide
to the theory of NP Completeness. W.H. Freeman and Company, San Francisco.

Geman, Stuart and Geman, Donald. (1984). Stochastic relaxation and Gibbs
distributions, and the Bayesian restoration of images, IEEE Transactions on Pattern
Analysis and Machine Intelligence, PAMI-6, 721-741.

Genesereth Michael P. and Nilsson Nils J. (1987). Logical Foundations of Artificial


Intelligence. Morgan Kaufman Publishers, Los Altos California.

Glasgow John C. (1988, February). g is for grammars, Proceedings CSC, Atlanta,


Georgia.

Glasgow, John C. (1987, Spring). YANLI: A Powerful Natural Language Front End
Tool., AI Magazine.

Gould Stephen J. (1981). The Mismeasure of Man. W. W. Norton and Company, New
York.

Gould Stephen J. (1987). Time's Arrow, Time's Cycle. Harvard University Press,
Cambridge Massachusetts.

Grossberg, Stephen. (1980). How does a brain build a cognitive code?, Psychological
Review, 87, 1-51.

Guth Alan H. and Steinhardt Paul J. (1984, May). The Inflationary Universe, Scientific
American.

Haugland John. (1985). Artificial Intelligence, the Very Idea. Bradford Books a
division of MIT Press, Cambridge Massachusetts.

Hawking Stephen W. (1988). A Brief History of Time. Bantam Books, New York:

Hayes Patrick J. (1985) The Second Naive Physics Manifesto. In Jerry R Hobbs and
Robert C. Moore (Ed.), Theories of the Common sense World. Ablex Publishing
Corporation, Norwood New Jersey.

Hebb, Donald, O. (1949). The Organization of behavior. Wiley, New York, 60-78.

Heisenberg Werner. (1949). The Physical Principles of the Quantum Mechanics. Dover
Publications, Toronto Canada.

Herbert, Nick. (1989). Quantum Reality, Beyond the new Physics. In Anchor Books,
(Ed.). Dell Publishing Group, 666 Fifth Avenue, New York,

Hillis W. Daniel. (1988). Intelligence as an Emergent Behavior; or, The Songs of Eden.
In Graubard Stephen R. (Ed.), The Artificial Intelligence Debate. MIT Press, Cambridge,
Massachusetts.

Hinton, G. E. and Sejnowski, T. J. (1986). Learning and relearning in Boltzmann


machines. In David E. Rumelhart and James L. McClelland (Ed.), Parallel distributed
processing, 1. MIT Press, Cambridge, Massachusetts, 282-317.

Hofstadter Douglas R. (1979). Goedel, Escher, Bach: An Eternal Golden Braid. Vintage
Books a Division of Random House, New York.

Hopcroft John E. and Ullman Jeffrey D. (1979). Introduction to Automata Theory,


Languages and Computation. Addison Wesley Publishing Company, Reading
massachusetts.

Hopfield John. J. (1982). Neural networks and physical systems with emergent
collective computational abilities, Proceedings of the National Academy of Sciences, 79,
2554-2558.

Hopfield John. J. (1984). Neurons with graded responses have collective


computational properties like those of two state neurons, Proceedings of the National
Academy of Sciences, 81, 3088-3092.

Kandel Abraham (1987). Information, Inexactness, and Multiple Valued structures. FSU
Dept. Computer Science, Talahassee Fl.

Kandel Abraham and Glasgow John et al (1987). When and Where it is Preferable to
Utilize Fuzzy Techniques in Artificial Intelligence Applications.

Kaye Kenneth. (1982 ). The Mental and Social Life of Babies (How Parents Create
Persons). The University of Chicago Press, Chicago.

Kayser Daniel. (1984). A Computer Scientist's View of Meaning. In S.B. Torrance (Ed.),
The Mind and Machine. Ellis Horwood Limited, distributed by John Wiley and Sons
Limited, New York, Ny, 168-176.

Khinchin A. I. (1957). Mathematical Foundations of Information Theory. Dover


Publications Inc. , New York, N.Y.

Kline Morris. (1980). Mathematics, the Loss of Certainty. Oxford University Press,
New York.

Klir George J. (1985). The Architecture of System Problem Solving. Plenum Press, New
York.

Kohonen, Teuvo. (1972). Correlation matrix memories, IEEE Transactions on


Computers, C-21, 353-359.

Kuhn Thomas S. (1962). The Structure of Scientific Revolutions. University of Chicago


Press, Chicago Illinois.
Layzer David. (1988). Growth of Order in the Universe. In Bruce H. Weber and David
J. Depew and Jonas D. Smith (Ed.), Entropy, Information and Evolution. MIT Press,
Cambridge Massachusetts.

Lovelock J. E. (1979). Gaia. Oxford University Press, New York.

Maturana, Humberto R. (1980). The Realization of the Living. In R.H. Maturana and F.
Varela (Ed.), Biology of cognition, (1970) reprinted in Autopoiesis and Cognition:
Reidel, Dordrecht, 2-62.

Mandelbrot, Benoit, B. (1977). Fractals, Form, Chance, and Dimension. W. H. Freeman


and Company, IBM, Thomas, J. Watson research center.

McCorduck, Pamela. (1979). Machines Who Think. W. H. Freeman and Company, San
Francisco.

McCulloch, Warren, S. and Pitts, Walter. (1943). A logical calculus of the ideas
imminent in nervous activity, Bulletin of Mathematical Biophysics, 5, 115-143.

McCulloch, Warren, S. and Pitts, Walter. (1947). How we know universals: the
perception of auditory and visual forms, Bulletin of Mathematical Biophysics, 9, 127-
147.

McNally D. W. (1973). Piaget, Education and Teaching. Harvester Press limited,


Sussex England.

Minsky, Marvin. (1975). A Framework for Representing Knowledge. In Patrick


Winston (Ed.), The Psychology of Computer Vision. McGraw-Hill, New York, New
York, 211-277.

Minsky Marvin L. (1985). The Society of Mind. Simon and Schuster, New York.

Minsky Marvin L. and Papert Seymour A. (1988). Perceptrons, Expanded Edition. MIT
Press, Cambridge Massachusetts.

Mitchell T. M. (1978). Version Spaces: An Approach to Concept Learning. Stanford


University California (PhD dissertation).

Nagel Ernest and Newman James R. (1958). Goedels Proof. New York University
Press, New York.

Naryanan Ajit. What Is It Like To Be A Machine? In S.B. Torrance (Ed.), The Mind and
the Machine, Ellis Horwood Limited, distributed by John Wiley and Sons Limited,
New York, NY .

Newell Norman D. (1982). Creation and Evolution, Myth or Reality. Columbia


University Press, New York.
Nottebohn Fernando. (1989, February). From Birdsong to Neurogenesis, Scientific
American.

Olmstead John III. (1988). Observations on Evolution. In Bruce H. Weber and David J.
Depew and Jonas D. Smith (Ed.), Entropy, Information and Evolution. MIT Press,
Cambridge Massachusetts.

Piaget Jean. (1970). Structuralism. Harper and Row, New York.

Pagel, Mark, D. and Harvey, Paul, H. (1989, June 30). Taxonomic Differences in the
Scaling of Brain on Body Weight Among mammals, Science, 244 (4912). American
Association for the Advancement of Science, Washington, D.C.

Parker, D. B. (1982). Learning Logic, Invention Report S81-64, File 1. Office of


Technology Licensing, Stanford University, California.

Piaget Jean. (1975). The Equilibration of Cognitive Structure. University of Chicago


Press, Chicago Illinois.

Piaget, Jean. (1952). The Origins of Intelligence in Children. International Universities


Press, New York.

Prigogine Ilya. (1980). From Being to becoming. W. H. Freeman and Company, San
Francisco California.

Prigogine Ilya and Stengers Isabelle. (1984). Order out of Chaos. Bantam Books, New
York.

Putnam Hilary. (1988). Representation and Reality. Bradford Books, MIT Press,
Cambridge, Massachusetts.

Putnam Hilary. (1988). Much ado about not very much. In Stephen R. Graubard (Ed.),
The Artificial Intelligence Debate. Bradford Books, MIT Press, Cambridge,
Massachusetts.

Pylyshyn Zenon w. (1985). Computation and Cognition. Bradford Books a division of


MIT Press, Cambridge Massachusetts.

Reichman, Rachel. (1985). Getting Computers To Talk Like You and Me. The MIT
Press, Cambridge, Massachusetts.

Rescher Nicholas. (1979). Cognitive Systematization. Rowan and Littlefield, Totowa


New Jersey.

Robinson J. A. (1965). A Machine-Oriented Logic Based on the Resolution Principle,


Journal of the Association for Computing Machinery (ACM), 12 (1), 23-41.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and
organization in the brain, Psychological Review, 65, 386-408.
Rumelhart David E. and McClelland James L. (1986). Parallel Distributed Processing, 1.
MIT Press, Cambridge, Massachusetts.

Rumelhart David E. and Hinton, G. E. and William, R. J. (1986). Learning Internal


Representations by Errror Propagation, Parallel Distributed Processing, 1. MIT Press,
Cambridge, Massachusetts, 318-362.

Russell Bertrand. (1945). A History of Western Philosophy. Simon and Schuster, New
York.

Salthe Stanley N. (1985). Evolving Hierarchical Systems. Columbia University Press,


New York.

Schank, R. C., and Abelson, R. P. (1977). Scripts, Plans Goals and Understanding.
Lawrence Erlbaum, Hillsdale, N.J.

Stillings Neil A. et al. (1987). Cognitive Science, An Introduction. MIT Press,


Cambridge Massachusetts.

Smith, Brian C. (1985). Prologue to Reflection and Semantics in a Procedural Language.


In Brachman, Ronald, J. and Levesque, Hector, J. (Ed.), Readings in Knowledge
Representation. Morgan Kaufman, Los Altos, California.

Szu H. and Hartley R. (1987). Fast Simulated Annealing , Physics Letters (1222(3-4)),
157-162.

Tank, David, W and Hopfield, John J. (1987, December). Collective Computation in


Neuronlike Circuits, Scientific American. W. H. Freeman and Co.

Tolman Richard. (1979). The Principles of Statisitcal Mechanics. Dover Publications,


New York.

Turing Alan. (1950). Computing Machinery and Intelligence, Mind, 59, 434-460.

Villey Pierre. (1930 ). The World of the Blind (a Psychological Study). The MacMillan
Company.

Wasserman, Phillip D. (1989). Neural Computing, Theory and Practice. Van Nostrand
Rheinhold, New York.

Wasserman, Phillip D. (1988). Combined Backpropagation/Cauchy machine, Neural


Networks: Abstracts of the first inns Meeting, Boston, 1988, 1. Pergamon Press,
Elmsford, N. Y., 556.

Weinberg Stephen. (1977). The First Three Minutes. Bantam Books, New York.

Werbos, P. J. Beyond Regression: New Tools for Prediction and Analysis in the
Behavioral Sciences, Masters Thesis. Harvard University.
Wicken Jeffrey S. (1988). Thermodynamics, Evolution and Emergence: Ingredients for
a New Synthesis. In Bruce H. Weber and David J. Depew and Jonas D. Smith (Ed.),
Entropy, Information and Evolution. MIT Press, Cambridge Massachusetts.

Wiley E. O. (1988). Entropy and Evolution. In Bruce H. Weber and David J. Depew
and Jonas D. Smith (Ed.), Entropy, Information and Evolution. MIT Press, Cambridge
Massachusetts.

Williams Pearce L. (1989, January). André Marie Ampère, Scientific American.

Winograd, Terry and Flores, Fernando. (1986). Understanding Computers and


Cognition. Ablex Publishing Corp. Norwood, New Jersey.

Winston P. H. (1985). Learning Structural Descriptions from Examples. In Brachman


R. and Levesque H. (Ed.), Readings in Knowledge Representation. Morgan Kaufman.
Los Altos, California.

Woods, W. A. (1970). Transition Network Grammars for Natural Language Analysis,


Communications of ACM , 13, 591-606.
Index
Index compiled with a different word processor, so pages #s are approximate.
Accommodation 39, 41, 43 neural networks 57
Activation function 89 plasticity of 43
AI programs (limitations) 58 Brooks, Daniel R. 37
Algebraic groups 39 Byron, Ada Augusta 29
Ambiguity 67 Calculus of reason (Leibnitz) 24
Ampere, Marie Andre 46 Cantor set as a fractal 28
Analytic Engine, (Babbage) 29 Cantor, Georg 24, 26
Anderson, James A. 92, 94 Capitalism 54
Annealing (simulated) 104 Cardinal numbers of sets 24
Anthropocentricism 6 Cauchy distribution 105
Antinomies 10, 25 Cauchy, Augustine 24
Aristotle 8 Cause and effect 10
Aspect, Alain 21 CERN 20
Assimilation 39, 43 CFG 66
ATN 117, 131, 136 CFL 66
ATNs sufficient for purpose? 124 Chaos 26
Augmented transition networks 120 Chaos theory 28, 31
Axiom of choice 24 implications for computing 29
Axiom of reducibility 25 Chaotic processes 48
Axiomatic 8 Chatelperronian 5
Axiomatic system 11, 25, 44, 61 Childhood development 38
Axiomizing arithmetic 25 Chinese machine AI argument 56
Babbage, Charles 29 Chomsky's hierarchy 67
Backus-Naur grammar extended 126 Chomsky, Noam 67, 72
Backus-Naur notation 121 Church's hypothesis 62, 64-65
Backward propagation 103 Church, Alonzo 61-62
Bacon, Francis 6 Classification scheme 9, 117, 129
Baye's theorem 81 Clausal form 75
Behaviorism 71 Clauser, John 21
Behaviorist school of psychology 38 Clausius, Rudolf 13
Bell's theorem 18, 21 Closed system 31, 35, 108
Bell, John Stewart 20 Cognitive systematization 43
Bennett, Charles H. 33 Cohen, Paul 26
Bifurcations 27 Coherentism 43
Big bang 36, 139 Communications 71
Big crunch 139 Communism 54
Biology 12, 14 Complexity 14
Bohm, David 19 Computability theory 62
Bohr, Niels 18 Computable functions 65
Boltzmann distribution 104 Computer development 70
Boltzmann's constant 30-31, 105 Conjugate waveform families 17
Boltzmann, Ludwig 30 Consensual domain 116
Born, Max 18 Consistency of arithmetic 25
Bourbaki 25, 39 Constructivism 39
Brain Context free grammar 66
mind and entropy 42 Context free language 66
Context sensitive grammar 67 Effective procedure 61
Cook, Steven 69 Einstein, Albert 6, 15, 19-20, 46
Cooper, L. N. 102 Elan vital 6, 14
Copernicus 6 Elsasser, W. M. 14
Corpuscular nature (of radiation) 15 Emergence 113
Creation of the universe, inflation 36 civilization level 51
Creativity and logic 78 levels, rules, and objects 51
Crick, F. H. C. 13 Empiricists 11
Critique of Pure Reason (Kant) 10 Energy surface 106
Cro-Magnon man 4 Ensembles 19
CSG 67 Entropy 13, 39, 51, 71, 139
Cybernetics 71, 88 applicability of equations 51
Dark matter (missing in universe) 48 as measure of disorder 30
Darwin, Charles 6, 12, 46 as uncertainty 42
Data base 45 equations 31
Davies, Paul 37 evolutionary processes 36
DCFG 66 hierarchical equations 34
DCFL 66 in definition of ignorance 42
DeBroglie, Prince Louis 15, 19 in growing mind 42
Decision procedures 61 in hierarchical 33
Delta rule 92 in information theory 30
Delta rule (generalized) 104 in physical system 30
DeMoivre 31 information 33
Dendrites 84 maximum disorder 31
Denumerable sets 26 minimum entropy 32
Deoxyribonucleic acid (DNA) 13 order due to 32
Descartes, Rene 8, 11, 57 self-organization 37
Designation 41 theory 30
Determinism 29 thermodynamic 30
Deterministic context free grammar time irreversibility 31
66 Enumerable sets 62
Deterministic context free language Environment 60
66 simulated 115
Deterministic polynomial time (P) 68 Epistemology 43
Difficulty of problems 68 EPR experiment 19
DIM deniably intelligent machine 46, Essay Concerning Human
113, 139 Understanding (Locke) 9
DNA 13, 32, 37, 57 Euclid 8
Double helix 13 Euclid's Elements 8
Doubling of solutions 27 Everett, Hugh 18
DTM deterministic Turing machine Evolution
68 and entropy 36
Durant, Will 13 Darwin's theory 12
Economics Dollo's law 36
inappropriate rules 54 genes 13
invisible hand 53 heredity 12
laissez faire 53 natural selection 36
mercantilist theory 53 neo-Darwinism 13
Self-organization 53 of the universe 3
philosophical schools 7 Hilbert, David 24-25
synthetic theory of 13 Hill-climbing 104
theory 12 Hinton, G. E. 104
three kinds of systems 35 Hobbes, Thomas 9
Existence proofs 25 Homo sapiens 4
Expert systems 46, 74 Hopfield, J. J. 94, 104, 106
Feed forward model 93 Horn clause 75
Feedback control 70 Hume, David 9-10, 43-45, 71
Feedback model 93 Hybrid
Feigenbaum, Mitchell J. 27 implementation 113
First computer programmer 29 system 111
First order predicate calculus 73 Hypothesis 47
Flores, Fernando 60 Idealism 7, 45
Fodor, Jerry 72 Implementation
Formalist 25 philosophy 57
Foundationalism 6, 45 top down approach 114
Fourier, Joseph 17 Impulse waves 17
Fractal dimensionality 28 Induction 25, 44
Fractals 28 Inflation 36
Frames 9 Information 32
Frankel, Abraham 25 Initial conditions 28
Frege, Gottlob 24-25 Innate knowledge
Freud, Sigmund 38 Kant 45
Functionalism 57, 71 mathematical 45
Gaia hypothesis (Lovelock) 56 Intelligence
Galileo 6 dynamic manifestation 60
Gall, Joseph 72 recognition of 116
Genes 36 reification of 45
Genetic epistomologist 38 Intensions
Geology 12 awareness 55
God (anthropomorphic entity) 56 human level phenomena 54
Goedel's theorem 26, 29, 61, 108 origin of 55
Goedel, Kurt 25, 39 Intersubjectivity 116
Grammar 124 Intuitionism 6, 25
described 125 Iterative procedure in chaos theory 27
mathematical description 66 Judeo-Christian thought 12
used in implementation 114 Jung 38
Grand unified theories 50 Kant, Immanuel 10, 26, 43, 45, 71
Greek philosophers 7 KARA 117, 127-128, 131, 136
Hebbian rule 87 Kaye, Kenneth 39, 60, 115-116
Hebbs, Donald O. 87 Kleene 63
Hegel, Wilhelm Friedrich 10, 43 Knowledge representation hypothesis 9-
Hegelian dialectic 11, 53 10
Hegelian inversion 11 acquisition, representation, activation
Heidegger, Martin 60 128
Heisenberg, Werner 16, 18, 46, 72 activation of ATNs 136
Herbert, Nick 18 and grammar 126
Hidden layers 96 declarative 76
Hierarchical structure 14, 34-36, 38 procedural 76
representation by ATNs 131 Leviathon (Hobbes) 9
thought as language 125 Liapunov 105
Koch snowflake as a fractal 28 Life (beginning) 50
Kohonen, Teuvo 92, 94 Likelihood 82
Kuhn, Thomas S. 44 LIM limited intelligence machine 46,
Ladder of life 12 113, 115, 128, 139
Ladder of perfection 12 Linear bounded automata 67, 89
Landauer, Rolf 33 Linear threshold relation 89
Language Linguistic cognitive domain 115
acquisition 41 Linguistic fuzzy measure 117
as feature of civilization 52 Linguistics 71
baby's acquisition 115 Lives of a Cell (Thomas) 52
deep structure 72 Lobatchewsky, Nikolas 24
fuzzy 83 Locke, John 9
g (programming language) 126 Logic based systems 47
innateness challenged 52 brittleness 84
limitations on logic 76 classical vs fuzzy 83
Lisp interpreter 119, 118-119 fuzzy 82
Prolog 76 induction 78
surface 72 limitations of crisp laws 81
Laplace, Pierre-Simon 24, 28 limitations of inference 77
Law of causality 44 objectively probabilistic 81
Law of the excluded middle 25 possibilistic 81-82
Laws of logic 74 processes 73
Layzer, David 35 qualifications and assumptions 78
Learning subjectively probabilistic 81-82
by assimilation 41 systems 73
by habitual association 40 Logicists 25
by imitation 40 Lovelace, Countess of 29
from consequences of actions 40 Lowenheim-Skolem theory 26
on a contingent basis 40 Lowenheim, Leopold 26
Leibnitz, Gottfried Wilhelm 24 LTF 89
Levels LTR 89
absolute partitioning 49 Mandelbrot, Benoit B. 28
biological 14 Markov 63
complexity 14 Marx, Karl 11, 53
creativity at 38 Materialism 7
differentiation 14, 50 Mathematical reversibility 39
effects between levels 51 Mathematics (progress in) 24
environments 50 Maturana, Humberto 60, 115-116
Euclidean space 48 Maxwell's demon 33
hierarchy hypothesized 47 Maxwell, James Clerk 16, 33
hypothesized 47 McCulloch-Pitts model 87
orders of magnitude 22 McCulloch, Warren 84
polarity 14 Measurement
quantum logic 48 EPR experiment 20
rules and objects 23 of quantum particles 18
rules don't apply across 53 Mechanists 6, 14
scale 14 Membership function 83
Mendel, Gregor 12 statistical methods 104
Mendelian theory 12 Neurons
Mentalism 7, 71-72 equivalence to Turing machine 87
Message model 72 formal model 84
Mind motor, sensory, and internuncial
body as component of mind 51 84
brain and body 57 Newton, Isaac 6, 9
brain as component of mind 51 Non-denumerable sets 26
creation at a level 55 Non-deterministic polynomial time
development of child's mind 39 (NP) 68
dual nature 8 Normal reality 16
growth of knowledge 42 Novum Organum 6
information processing system 71 NP problems 68
limitations in scale 48 NP-complete problem 69
modularity of 72 NTM non-deterministic Turing
perception of 51 machine 68
represntational theory of 71 Origin of Species (Darwin) 12
single level implementation 57 Out-putting system (thermodynamic)
social animals 52 35
structure of 38 Paleontology 12
implementation implications 113 Papert, Seymour 89, 92
Minsky, Marvin 89, 92, 126-127 Parents influence on child 39
Mitchell, Tom M. 79 Parker, D. B. 104
Modus ponens 81 Parser 122
Natural language interface 115 Parsing 66
Natural selection 12 defined 66
Naturalistic individualism 73 Partial recursive functions 62
Neanderthal man 4 Pavlov, I. V. 38, 71
Neo-Darwinism 13 Peano, Giuseppe 24
Nerve cells 84 Peking man 4
networks Perception problems (humans) 103
convergence 105 Perceptron 88
Networks 39 Euler numbers 100
augmented transition (ATN) 120 group invariance theorem 97
embedding neural 128 hill climbing 92
feedback 105 LTR hyperplanes 90
multi-layer 103 masks 100
neural 57 mathematical model 92
Perceptron 90, 92 Minsky and Papert analysis 94
semantic 9 parity/switch example 98
systematization 43 purpose 90
Neuro-biology 71 representation problems 97
Neuromorphic models 61, 84 retina, associators, and responders
Boltzmann method 104 91
features, drawbacks, benefits 109 stratification 101
generalization 102 topological invariant 100
part in implementation 114 very large coefficients 101
systems 109 Person 57
neuromorphic systems Phase entanglement 18, 22
Phase space 36, 38, 42 Recursive problem 68
Photons, polarization 20 Recursively enumerable languages 65
Physicalism 7 Reductionism 11
Piaget, Jean 38, 43 Regress of explanation 14
Pilot wave 19 Reichman, Rachel 138
Pitts, Walter 84 Renaissance 12
Planck's constant 15 Rescher, Nicholas 43
Planck, Max 15 Resolution 75, 108
Plato 7 RNA 33, 57
Platonic ideal 8 Robinson, John 75, 108
Platonism 44 Rosen, Nathan 19
Podowski, Boris 19 Rosenblatt, Frank 88
Polarity 14 Rule base 45-46, 76-77
Post 63 Rules of inference 74
Predictability 29 Rumelhart, David E. 104
Prigogine, Ilya 35 Russell, Bertrand 24-25
Principia Mathematica (Russell) 25 S-expressions 123, 130
Principia Philosiphiae (Descartes) 8 Salthe, Stanley N. 14
Probability Scale of being 12, 14
use in judgement 45 Scale of being (new) 22
Processing system 35, 50, 58 Schrodinger equations 17
Production rules 66 Schrodinger, Erwin 16, 18
Production systems 74 Science
Productions in grammars 66 cognitive 70
Prolog 108 computer 61
Psychology 38 hypothetico-deductive approach
Putnam, Hilary 71 46
Pylyshyn, Zenon W. 73 methodology 46
Quantum level 18 revolution in 44, 46
Quantum mechanics (predictive ability) Second law of thermodynamics 13
17, 72 Second method of Liapunov 105
Quantum particles 14 Self regulation (psychological) 39
attributes 16 Self-organization 51
correlated 19 Self-organizing system
description 16 characteristics 37
Quantum reality 17 human verses computer 58
attributes 18 processes 37
Copenhagen interpretation 18 Semi-Thue (type 0 language) 66
infinity of worlds hypothesis 18 Set Theorists 25
logic 19 Shannon, Claude 32, 70
neo-realism 19 Shrodinger, Erwin 36
no normal reality hypothesis 18 Signification 41
potentia 19 Simulated annealing 107
waves of probability 18 Processes 104
Quantum theory 15 Simulated environment 118
Quantum wave family 17 Sine waves 17
Random sequence of numbers 27 Singularities (cosmological) 50
Recursive functions 62 Skinner, B. F. 38, 71
Recursive languages 65 Skolem, Thoralf 26
Smith, Adam 53 Type 1 language 67
Social systems 40, 60 Type 2 languages 67
Socrates 7 Type 3 language 67
Software laws 37 UIM undeniably intelligent machine 46,
Spencer, Herbert 13, 53 113, 115, 139
Squashing functions 89, 104 Uncertainty principle (Heisenberg's) 16-
Strange attractors 28 17
Structuralism 38 Unification 75
Sunspot activity 45 Universe (continuing creation) 49
Superluminal communications 19-20 Version graph 80
Symbolic-logic 61 Vitalists 14
applications 108 Von Neumann architecture 87
part in implementation 114 Von Neumann, John 18, 87
problems with 84, 108 Wasserman, Phillip 105
slow progress in seventies 104 Watson, J. D. 13
systems completeness 108 Wave mechanics 16
Symbols use by baby 40 Wave-particle dual nature 16
Synapses 86 Waveforms 17
Synaptic weights 87 Weight matrix 90
Syntactic parse 128 Werbos, P. J. 104
Systems theory 70 Whitehead, Alfred North 25
Szilard's Engine 33 Wiener, Norbert 71
Szilard, Leo 33 Williams, R. J. 104
Teacher 115 Winograd, Terry 60
Terminal of a grammar 66 Winston, Patrick Henry 79
Thermodynamic equilibrium 30, 35 Woods, W. A. 121
Thesis, antithesis and synthesis (Hegel) Work in thermodynamics 30
11 YANLI 117-118, 122, 128-129, 131,
Thomas, Lewis 52 133, 136
Thompson, William 13 Zermelo-Frankel set theory axioms 25
Time Zermelo, Ernst 25
Arrow of (in Chaos theory) 29
geologic 12
linear and cyclical 12
Topological structures 39
Trajectories 28
Transition nets 122
Traveling salesman problem 67
Treatise on Human Nature (Hume) 9
Triadic system (Salthe) 14
Turing machine
description of 63
operations, states, tapes 63
purpose of 63
universal 65
Turing, Alan 63
Turn-taking 39-40
Turnaround 41, 115, 139
Type 0 language 66
Table of Contents
Part one - Philosophical Underpinnings
1
1.1 Introduction 2
1.2 The evolution of the universe
3
1.3 Philosophy of mind 5
1.4 Historical perspective 7
1.5 Physical and Biological
considerations 12
1.5.1 Evolutionary theory 12
1.5.2 Quantum theory 15
1.6 Mathematical considerations
24
1.6.1 The progress in mathematics
24
1.6.2 Chaos 26
1.7 Systems and entropy (the emergence
of levels) 30
1.8 Psychological considerations 38
1.8.1 Structure of the mind 38
1.8.2 The development of mind 39
1.8.3 The emergence of mind 42
1.9 Cognitive systematization 43
1.10 Philosophical observations 44
1.11 The hypothesis 47
1.12 Observations and Conclusions
51
1.12.1 Language, meaning, intelligence
51
1.12.2 Intelligence and machines
55
1.12.3 Conclusions 57
Part two - The State of the Art 59
2.1 Introduction 60
2.2 Computer Science 61
2.3 Toward a science of cognition
70
2.4 Logic systems 73
2.4.1 Limitations due to language
76
2.4.2 Limitations of logical inference
77
2.4.3 Limitations of the crisp laws of
logic 81
2.4.3.1 Objectively probabilistic 81
2.4.3.2 Subjectively probabilistic
82
2.4.3.3 Possibilistic (fuzzy) 82
2.4.4 Problems with symbolic-logic
systems 84
2.5 Neuromorphic models 84
2.5.1 The Perceptron 88
2.5.2 A mathematically oriented model
92
2.5.3 Minsky and Papert's analysis
95
2.5.4 Backward propagation 103
2.5.5 Feedback networks 105
2.6 Conclusions 108
2.6.1 Symbolic-logic systems 108
2.6.2 Neuromorphic systems 109
2.6.3 Making use of the tools at both
levels 110
Part three - Toward an Implementation
112
3.1 Introduction 113
3.2 Assumptions and compromises
114
3.3 The structure of the program
116
3.3.1 The background systems 118
3.3.1.1 The Environment 118
3.3.1.2 YANLI 118
3.3.2 Grammar and Language 119
3.3.2.1 Choice of language 119
3.3.2.2 Augmented transition networks
120
3.3.2.3 Can ATNs do the job? 124
3.3.2.4 Embedding neural networks
128
3.3.3 The Acquisition, Representation and
Activation of Knowledge 128
3.3.3.1 The Natural Language Interface
128
3.3.3.2 The use of ATNs to represent
knowledge 131
3.3.3.3 The activation of the knowledge
stored in the ATNs. 136
3.4 Summary 139
3.5 Conclusions 139
Appendix A 141
Appendix B 143
References 161
Index 168