What is AI?

Is it...
• • •

A type of applied computer science? (But what about theoretical and applied AI?) A branch of cognitive science? A science at all?

What is intelligence?
• • •

What distinguishes human behavior from the behavior of everything else? Whatever behaviors we don't understand? Whatever "works", that is, results in survival (or flourishing) in complex environments

Possible goals
• • • • •

To understand intelligence independent of any particular "implementation" To model human or animal behavior To model human thought To implement rational thought To implement rational action

How is the topic divided up?

By domain o Vision o Natural language o Diagnosis (medicine, etc.) o Mathematics o Robotics o Education By domain-independent task type o Search o Representation o Inference o Learning o Evolution o Perception o Planning and action By theoretical framework o Symbolic AI  Logicist AI


Case-based reasoning Soar ACT-R "Biomorphic" AI, Non-symbolic AI, Subsymbolic AI  Neural networks, connectionist AI, dynamical systems  Evolutionary computation
  

What are some basic differences in approach?

The role of learning o If you want an intelligent machine, program in the intelligence. o If you want an intelligent machine, make it a good learner and send it out into the world. The role of the hardware o Intelligence is a software problem. An intelligent program can be run on a brain or a computer. o The hardware is relevant. We should look for intelligence that is based on the properties of nervous systems because the smartest systems we know about are smart because of their nervous systems. [neural networks, connectionism] The role of domain-independent methods o Neat AI  Theories should be elegant and parsimonious.  We should understand precisely what our theories can do and how they behave.  Most (or all) of intelligence is governed by general principles. o Scruffy AI  The mind is a kludge. To make things efficient, inelegant shortcuts are often appropriate.  It may be impossible to come to a precise understanding of our theories.  There are only a few general principles that apply across domains. Intelligence comes from domain knowledge. Two frameworks: symbolic and subsymbolic (connectionist) AI o Symbolic models  Physical Symbol Systems (Newell, Pylyshyn, Fodor; summarized by Harnad)  A set of arbitrary physical tokens (scratches on paper, holes on a tape, events in a digital computer, etc.) that are  manipulated on the basis of explicit rules that are  likewise physical tokens and strings of tokens. The rulegoverned symbol-token manipulation is based  purely on the shape of the symbol tokens (not their "meaning"), i.e., it is purely syntactic, and consists of  rulefully combining and recombining symbol tokens. There are


primitive atomic symbol tokens and composite symbol-token strings. The entire cognitive system and all its parts--the atomic tokens, the composite tokens, the syntactic manipulations (both actual and possible) and the rules--are all  semantically interpretable: The syntax can be systematically assigned a meaning (e.g., as standing for objects, as describing states of affairs).  Processes happen sequentially.  There is a central controller which coordinates the activities of the modules of the cognitive system and selects among candidate processes at each point in time.  The cognitive system interacts with the world through interfaces to perception and action, which operate very differently from the internal (cognitive) system.  Time is often mapped onto space; that is, the cognitive system has simultaneous access to all of a pattern of some length (word, sentence, etc.). Inputs may also be presented sequentially, but the problem of temporal short-term memory is side-stepped because the inputs are preprocessed.  Knowledge is usually programmed into the cognitive system by someone who has a theory of how knowledge is organized. Learning is also possible, but models that learn usually start intelligent. Subsymbolic (connectionist, dynamical) models  Control is distributed. There is just the illusion of someone being in charge because the behavior seems purposeful, and it seems to be possible to write a centralized program to make it happen.  The basic processes involve very simple interactions among primitive elements arranged in a network. Usally the interaction amounts to the spread of activation.  Many of the processes happen in parallel.  The cognitive system may interact with the world through perception and action components which are similar to the internal (cognitive) parts of the creature. In some models, the environment and the creature itself constitute one large dynamical system.  Knowledge is distributed, usually in the form of patterns of connectivity among the primitive elements. The knowledge in such systems is implicit; it often cannot be simply read off.  Knowledge gets into the cognitive system through learning as the system discovers the statistical properties of the world around it or through evolution as generations of creatures are forced to survive in the world.
 

The problem of temporal short-term memory is often addressed, though the continuous interaction of components of the cognitive system with each other and the world may not be.

Introduction to representation
Why representation?
• • •

Tasks: going from inputs (stimuli) to outputs (responses) The most primitive solution: a lookup table which specifies an output for every input The problem with lookup tables: o There may be too many inputs to store (the world is continuous, after all). o There is a need for the system to be able to respond appropriately to novel inputs. The alternative solution: a function from inputs to outputs o AI is about these functions: what they might look like for tasks requiring "intelligence". o The functions may be very complex, requiring one or more transformations of the input on the way to the output: internal representations.

What are representations like?

• • • •

Are they explicit (directly interpretable), or are they in a form that looks like garbage to an outside observer (even though they serve their function for the system)? Are they localized (in one place), or are they distributed throughout the system? Are they propositional (language-like), or are they in some other form, for example, more like images? Are they static or dynamic? Do they just sit there or do they "happen"? Are there different kinds of representations for different domains that have little in common with each other?

What do they need to have?
• • •

Distinction between objects and relations Wholes consisting of parts, which are in turn wholes consisting of parts: recursive structure (Maybe) slots (roles) and fillers (values) o In an object, the SHAPE slot may have filler CYLINDER. In a sentence, the VERB slot may have filler "PUT".

• •

In an event, the AGENT slot may have filler ROBOT and the PATIENT slot may have filler STICK1. (Maybe) truth and falsehood Generality (abstraction) o The generalization (concept) BLOCK is an abstraction over all blocks, including, for example, BLOCK4. o The generalization (concept) PUT is an abstraction over all instances of putting, including, for example, PUT8, in which the Robot puts a block on another block.

Two basic kinds of representations

Symbolic o The primitives are symbols, e.g., PUT, BLOCK4. o More complex expressions are built up by concatenating symbols together into symbol structures, e.g., PUT(ROBOT, BLOCK4, TABLE). o Similarity is all-or-none: eq?. Connectionist (subsymbolic) o The primitives are vectors of numbers. o There are no more complex expressions; combinations are produced through addition or some other form of superposition. o Similarity is (potentially) continuous: the distance between the vectors.

Some other questions concerning representation
• • •

What sort of evidence do we have for what the internal representations of cognition look like? What about the input and output themselves? What form do they take? How do representations relate to the world, to perception, and to action?

Memory and learning (and representation)

There is usually a basic distinction between long-term, general knowledge of a particular type (in long-term memory) and the short-term characterization of the current situation. The system needs to be able to take items in STM and use them to access knowledge in LTM (a kind of search). In symbolic systems this is usually accomplished through some form of pattern matching. LTM normally consists of categories (concepts) and rules specifying what sort of action to take or inference to make in a particular sort of situation. Application of a rule can result in external behavior or a change in the contents of STM. Knowledge may get into LTM through hard-wiring by the programmer, through learning, or through evolution.

Learning is usually induction: in response to a set of examples, the system creates new categories or rules (a kind of search). Ease of learning depends on the quality of the feedback (if any) provided by the environment.

Kinds of processing: what we need to do with representations

Categorization: given a representation of an object or situation, assign it to one of a finite set of labels. Categorizing an input image as a BLOCK. Categorizing an input word as a NOUN.

Parsing: given a pattern, assign some structure to it. Segmenting an input image into constituent objects. Segmenting a input sentence into constituent phrases.

Compression: given a pattern, represent it in a more compact way, taking advantage of the regularity in it. Representing an input image in turns of a small number of dimensions.

Deduction, inference: given some facts, infer one or more other facts that follow from them. Given that BLOCK1 is ON BLOCK2, infer that BLOCK2 is UNDER BLOCK1. Given that X is a BLOCK, infer that it has flat faces.

Induction: given some examples, create a general rule or category that covers them all. Given multiple instances of scenes, create the rule relating ON and UNDER. Given multiple instances of blocks, create the category BLOCK and use it to categorize new instances.

Action: given a situation (or some facts), take an appropriate action. Given a command to put a block in a box, perform the sequence of actions involved in doing it.

Predicate calculus
What should a good representational format do for us?

• •

It should be correct. It should represent what we think it represents, permitting the inferences that we would like the system to make and failing to make inferences that we would not like (together with an inference mechanism). It should be expressive; it should allow us to distinguish all situations that we need to distinguish (it should be unambiguous). It should allow us to point to entities that need to be pointed to. It should treat situations which we believe are similar as similar. It should be flexible, allowing us to represent the same situation in different ways, reflecting different construals.

Some spatial examples

Some "facts" about objects o BLOCKS  They have square corners.  They have straight edges.  They have six faces.  They don't roll.  They are a kind of prism.  The thing I'm looking at now is one of them. o BALLS  They're round.  They have no edges or faces.  They roll. Some facts about relations o IN  The contained object is surrounded by the container at some point. The contained object is smaller than the container in some dimension.  The container has some empty space within it.  In order to be freely moved, the contained object needs to be taken out of the container.  It's a kind of spatial relation.  The scene I'm looking at now has one in it. It relates a cup (the container) and a stick (the contained thing). o BEHIND (viewer-centered)  At least part of the object that is behind is obscured by the object that is in front.  The object that is behind is further from the viewer than the object that is in front.  To touch the object that is behind, the viewer has to reach over or around the object that is in front. o ON  The bottom of the supported object is in contact with the top of the supporter. Some facts about functions




TOP-OF  The top of an object is a surface or a corner.  We can find the top of an object by looking for the part of it that is furthest from the earth (assuming we're on the earth).  When an object is turned over, whatever was the top of it is now the bottom of it. INSIDE-OF  The inside of an object is a region in space.  If an object is solid, the inside of it is part of it. SUPPORTER  Given two objects, one on the other, the supporter of the two (or of the situation), is the one that if taken away, would cause the other to fall.

Relations, objects, predicates, functions
• • •

Unlike objects, relations take arguments. A relation predicated of one or more objects is either true or not. In predicate calculus o Objects and relations are represented by explicit constant symbols: block23, in o Predicates are represented by expressions consisting of relation constants followed by object constant in a fixed order:
(in block23 cup4) o

An alternate way of representing predicates: the arguments are paired with role constants, and their order is unspecified:
(in (container cup4) (contained block23))

o o

Note that nouns on the one hand and verbs, pre/postpositions, adjectives on the other hand do not map neatly onto object and relation constants. Functions are represented by function constants, and function expressions, like predicates, by a function constant followed by one or more arguments:
(top-of block8)


Function expressions return objects so can replace object symbols in predicates:
(above (bottom-of block8) (top-of block4))

Categories and categorization
• • • •

Categorization is about going from lower-level to higher-level representations, for example, from a specific object instance to an object category such as BLOCK. We need to be able to distinguish instances from categories and to represent their relationship. We need to be able to distinguish categories at different levels of abtractness (different taxonomic levels) and represent their relationship. Predicate calculus

o o

Represents object instances and object categories with expicit symbols and their relationship as a predicate: (block obj23). Represents the relationship between categories at different levels using variables, universal quantification, and implication:
(forall (?x) (if (block ?x) (prism ?x)))


An alternative way of representing the relationship between instances and categories and between categories at different taxonomic levels: AKO (a kind of) and ISA relations:
(isa obj23 block) (ako block prism)

Representing relation instances o Not explicit in ordinary predicate calculus; no way to directly express that a particular event belongs to the event category SING. o Alternative: relation instance (and relation category) symbols.
(sing event23) (= (singer-of event23) terry)

Miscellaneous considerations

Connectives: conjunction (and), disjunction (or), implication (if), equivalence (equiv), negation (not): truth of each defined in terms of the truth of its operand(s) Existential quantification (exists (?x) (and (b551-student ?x) (not (know ?x scheme)))) (not (exists (?x) (and (b551-student ?x) (not (know ?x arithmetic)))))

• •

Sentences, well-formed formulas Equivalences of particular expressions (examples) (equiv (if p q) (if (not q) (not p))) (equiv (if p q) (or (not p) q)) (equiv (or p (and q r)) (and (or p q) (or p r))) (equiv (not (exists (?x) (p ?x))) (forall (?x) (not (p ?x)))) (equiv (forall (?x) (and (p ?x) (q ?x))) (and (forall (?x) (p ?x)) (forall (?x) (q ?x))))

Primitives o If we opt for a symbolic representation such as predicate calculus, we sneed an alphabet of basic symbols. o How would we decide on a set of primitives? This is usually based on practical, rather than theoretical, considerations. o We must make decisions about representational granularity. Limitations of first-order predicate calculus o Representing belief and knowledge (believes al (loves mary al)) (believes al (exists-life mars))

Representing predicate calculus in Scheme
• •

One possibility: relation constants are procedures, predicates are procedure calls, returning either #t or #f A better option: all predicate calculus expressions take the form of lists, and procedures are ways of manipulating and searching through them, such as infer.and true?. The second option requires that we build in the definitions of conjunction, negation, etc.

What will we do with our representations?

takes a sentence and adds it to a database, which is a conjunction of facts taken as true.

(assert '(block b1) database) → modified database

takes a database and returns a list of sentences that can be inferred from the database

(infer '((block b1) (forall (?b) (if (block ?b) (nfaces ?b six))))) → ((nfaces b1 six))

takes a database and a sentence and tells whether the sentence is true, given the database. dont-know is a possibility.

(true? '((block b1) (forall (?b) (if (block ?b) (nfaces ?b six)))) '(nfaces b1 six)) → yes

takes a database and a sentence containing one or more variables and returns bindings for the variables.

(fill-in '((block b1) (forall (?b) (if (block ?b) (nfaces ?b six)))) '(nfaces b1 ?nf)) → ((?nf six))

takes a conjunction of facts about an individual and returns a category for the individual.

(categorize '((nfaces b1 six) (height b1 5cm) (square-corners b1) (forall (?b) (if (and (nfaces ?b six) (square-corners ?b) (less-than (height ?b 20cm)) (greater-than (height ?b 2cm))) (block ?b)))) 'b1) → (block b1)

takes a conjunction of facts about instances of a category and a category symbol and returns a generalization about the category.

(learn '((block b1) (nfaces b1 six) (square-corners b1) (color b1 red) (block b2) (square-corners b2) (color b2 black)...) 'block) → (forall (?b) (if (and (nfaces ?b six) (square-corners ?b)) (block ?b)))

Predicate calculus: practice
Using predicate calculus representation (Scheme format), show how to represent the knowledge embodied in the following English sentences. As far as possible, show the relationships among the different elements. When a moving ball strikes a stationary ball, the moving ball is deflected in a direction which is roughly opposite to its original direction, and the originally stationary ball starts moving roughly in the direction of the originally moving ball. Ball 3 struck Ball 2, which was stationary. Ball 3 was moving south when this happened. (Ignore details like velocity, the effect of the mass of the balls, and the angle at which the moving ball strikes the stationary ball (unless you want to be really ambitious :-). This is very naive physics.)

Search 1
Problem solving as search
• •

A solution as a state in the space of possible solutions Problem solving as search through this state space

Some basic questions concerning problem solving as search
• • • • • • • • • • • • •

How can solving the problem be treated as the execution of a sequence of steps? Do the steps result in more and more complete solutions, or are candidate complete solutions available at every step? Is it easier (or cheaper) to start at the goal and work backwards? Is the consequence of a step predictable? Is there an adversary (another agent who can constrain the possible steps)? Is the solution the way a goal is reached or the nature of the goal itself (what's found at the end)? Is it important to know more than one way to reach the goal? Is it important that the goal is reached in the most efficient way, the way requiring the least cost to the agent? Is it important that the solution is found quickly? Is there a way to estimate the "distance" to the goal? How easy is it to reconsider and try a completely different set of steps? Is the step size adjustable? Is it possible to consider a number of different options in parallel? Is the number of potential ways to the goal finite?

Formalizing search

• • •

• • •

Problem state: a particular configuration of the objects that are relevant for a problem Figuring out how to represent the problem states is no simple matter. State space: the space of all possible problem states Normally only a partial state space is considered. Initial state: where the search starts Goal states: where the search should end There may be any number of these, and we may be interested in finding only one or all of them. State space search: search through the space of problem states for a goal state Search trees: nodes are states (or paths), links are one-step transitions from one state to a successor state Expanding a node (extending a state/path)

• • • •

Testing for goal states (nodes) The queue of untried states (paths); adding new states (paths) to the queue (stack) Branching factor of the search: average number of successor states each state has Depth of the search: how far down the tree is extended during the search

Basic schema for search
(Assuming that the path makes a difference, we maintain a queue of paths rather than just states.) The algorithm makes use of two procedures specific to the problem 1. A predicate goal?, which takes a state and returns #t if the state is a goal state 2. A procedure expand, which takes a state and returns all of the successor states to the state
• •

Form a one-element queue consisting of a zero-length path that contains only the root node. Until the first path in the queue terminates at a goal node (satisfies goal?) or the queue is empty, o Remove the first path from the queue; create new paths by expanding the first path to all the neighbors of the terminal node. o Reject all new paths with loops. o ... o Add the new paths, if any, to the queue. o ... If the goal node is found, announce success and return the path to the goal state found; otherwise, announce failure.

Depth-first, breadth-first, nondeterministic search

Depth-first search Add the new paths to the front of the queue. (That is, use a stack.) o Backtracks when a dead-end or "futility" limit is reached o Appropriate when there is a high branching factor o May be advantageous when many solutions exist but only one needs to be found o May fail to find a solution Breadth-first search Add the new paths to the back of the queue. o Appropriate when there are long useless branches but not when there is a high branching factor (time and space complexity is exponential) o Guaranteed to find a solution (if there is one) and to find the one with the least steps (though not necessarily the least cost) first Nondeterministic search Add the new paths at random places in the queue. o When unable to choose between depth-first and breadth-first

To implement the type of search, we can add a merge-queue argument to our search procedure. There is a different one of these for each of the ways we will add new paths to the queue.

Tower of Hanoi

To do best-first search using your basic search procedure, you only need to define a new merge-queue procedure. best-first-merge takes the queue, the new paths, and the problem-specific estimate procedure, adds the new paths to the queue, and then sorts the new queue by the values returned by estimate when applied to the first state on each path. You can use the Scheme procedure sort to do this. sort takes a comparison predicate and a list and sorts the list using the predicate to compare items. Figure out how you will represent states for your particular problem. Here's one way for Tower of Hanoi. o For the Tower of Hanoi Puzzle, states are lists consisting of a sublist for the disks on each peg. Numbers represent the diameters of the disks, and they are arranged from top to bottom. Thus this is the initial state for the 3disk puzzle:
((1 2 3) () ())

Write the goal?, expand, estimate, and print-state procedures for your particular problem. o For the Tower of Hanoi Puzzle, these are one possibility. Callthe search procedure on the problem-specific states and procedures. o Best-first search on the Tower of Hanoi Puzzle: a trace indicating states which are searched and the queue at each point in the search

Heuristic search: using estimated distance remaining
• •

Another procedure specific to the problem: estimate, which takes a state and estimates the distance from it to a goal state Hill climbing Sort the new paths by the estimated distance left to the goal (using estimate) and add them to the front of the queue. o Parameter-oriented hill climbing: each problem state is a setting for a set of parameters o Problems for hill-climbing: foothills (local maxima), plateaus, ridges o Nondeterministic search as a way of escaping from local maxima o Gradient ascent For a parameter x and "goodness" g which is a smooth function of x, the change in x should be proportional to the speed with which g changes as a function of x, that is, ∂g/∂x.

Best-first search After adding new paths to the queue, sort all the paths in the queue by the estimated distance left to the goal (using estimate). Unlike hill climbing, can jump around in the search space.

Heuristic search: optimal search
• •

British Museum procedure: blindly find all paths, selecting best Branch-and-bound search After adding new paths to the queue, sort all the paths in the queue by the current path length. Note: When a goal state is found, it is still necessary to extend partial paths which are shorter than the complete one because they may end up shorter overall. Using underestimates of distance remaining After adding new paths to the queue, sort all the paths in the queue by an underestimate of total path length (using estimate). Underestimates allow you to stop when partial path estimates are longer than the shortest complete path. Eliminating redundant paths After adding new paths to the queue, if there are two or more paths reaching a common node, keep only the one with the shortest path length. A*: branch-and-bound with underestimates and redundant paths eliminated

Genetic search

Parameter adjustment, function optimization problems o Calculus methods o Blind search and random methods o Parallel search Evolutionary computation o What's needed for evolutionary computation (abstract or real) to work  "Creatures" which 1. Give birth to other creatures, passing on their traits to them 2. Die  A way for traits to be passed on: an inherited genotype, in interaction with the environment, results in a phenotype  A way of evaluating the creatures' traits: some aid in survival or reproduction, others don't ("survival of the fittest")  A way of generating new traits: mutation o How it works  Each creature is born with some combination of traits; it may not be possible to simply figure out what combination works best for the environment of the creatures.


Creatures live their lives. Some survive long enough to reproduce and have offspring.  If a particular trait or combination of traits helps creatures reproduce or live longer, creatures with that trait (those traits) will tend to have more offspring.  Creatures pass on (at least some of) their traits to their offspring. In sexual reproduction, they pass on a combination of the parents' traits.  The percentage of creatures with the good traits should increase on each generation.  There is a small probability that a new creature will end up with some random traits which it did not inherit from its parent(s). In this way new traits or combinations of traits can be tried out in the world. Genetic algorithms  What makes them special

Work from a coding of the parameter set, not the parameters themselves Search from a population of points, not a single point Use fitness information only, no auxiliary knowledge Use probabilistic transition rules Used for Parameter-oriented search for problems in which partial solutions are not evaluated (paths are not sought) Modeling biological evolution Designing a suitable initial architecture for cognition (say, a neural network) The basic GA Operators  Selection: Select individuals in the population for mating on the basis of the individuals' fitness. A common choice is fitness-proportionate selection: the probability of selecting an individual is its relative fitness (implemented through "roulettewheel sampling")  Crossover: Combine the genomes of two parents to produce the genome of the choice. The most common choice: select a position and exchange the substrings before and after that locus between two genomes to create two offspring.  Mutation: Make random changes in the genome of an individual. The most common choice: with a small probability, flip each bit in the genome. Parameters  n individuals in the population

l loci in each genome Probability of crossover: pc (often something like .7)  Probability of mutation: pm (often something like .001) Environment  Fitness function f which evaluates each individual assigning a quantity to it  Or (less commonly) a "world" which permits some individuals to produce more offspring than others Algorithm  For each run  Start with a population of n randomly generated genomes  For each generation  (Realize the genomes as phenomes (individuals).)  Evaluate each of the individuals with the fitness function.  Until n offspring have been created,  Using the selection operator, choose a pair of parents from the population.  Produce two offspring. Use the crossover operator with probability pc. Otherwise produce copies of the parents.  For each locus in each new offspring, apply the mutation operator with probability pm.  Place the resulting offspring in the new population.  Replace the old population with the new. An example: maximizing a function (x2)
 

Generation Genomes Fitness Individuals selected for mating 1 0 1 0 1 0 100 (.06) 1 1 1|1 0 1 1 1 1 0 900 (.57) 1 1 0|0 0 1 1 0 0 0 576 (.36) 1|1 1 1 0 0 0 1 0 0 16 (.01) 0|1 0 1 0 2 1 1 1 0 0 784 (.34) 1 1 1 0|0 1 1 0 1 0 676 (.29) 1 1 0 1|0 1 1 0 1 0 676 (.29) 1 1 1|0 0 0 1 1 1 0 196 (.08) 1 1 0|1 0

Forward and backward chaining
The basic elements
Given a set of assertions (facts), a set of rules, and a goal, prove the goal.

Assertions Each a predicate calculus sentence with no variables and no connectives other than not. Goal Either a sentence with no variables and no connectives, in which case the goal is to prove that it is true (inferrable from the facts and the rules), or an existentially qualified sentence whose variables are to be assigned values (if possible) given the facts and the rules. Rules Each a universally qualified implication with a conjunction of sentences as antecedent and a single sentence as consequent.

An example

Assertions o A ball is on a block
(on ball1 block1) o

A pyramid is above the ball.
(above pyramid1 ball1)

o o

Is the pyramid above the block?
(above pyramid1 block1)

What's above the block?
(above ?x block1)

1.If something is on something, it's also above it. (((on ?x ?y)) (above ?x ?y)) 2.If a is above b, b is below a. (((above ?a ?b)) (below ?b ?a)) 3.If b is above c and a is above b, then a is above c. (((above ?b ?c) (above ?a ?b)) (above ?a ?c))

(In Homework 2 you will be doing a restricted version of forward chaining. Because each rule has only one conjunct in its antecedent, you can iterate through the assertions rather than the rules to attempt to find new assertions (extend the current state).)

Forward chaining

Attempt to match the antecedents of rules with the assertions, adding new assertions based on the consequents if this is possible, until an assertion matching the goal is added. To prove: (above pyramid1 block1)
• •

The antecedent of the first rule matches (on ball1 block1), so we can asssert (above ball1 block1). The antecedent of the third rule matches (above ball1 block1) and (above pyramid1 ball1), so we can assert (above pyramid1 block1). This matches the goal.

Backward chaining
Attempt to match the consequents of rules with a goal, replacing the goal with new goals based on the antecedent of the rule if this is possible, until all of these goals match assertions. To prove: (below block1 ?x)
• •

The consequent of the second rule matches the goal, so we can replace the goal with (above ?x block1) The consequent of the first rule matches the goal, so we can replace the goal with (on ?x block1). This goal matches the assertion (on ball1 block1) with the variable binding ?x = ball1

More on representation
Much of human knowledge seems to be organized in chunks representing types of events: frames or schemas.

Consider what we know about CABBAGE: what it looks like, how it tastes, how it's prepared, how nutritious it is, how much it costs, what other vegetables and plants it's related to. Within the CABBAGE frame, there is knowledge about what CABBAGE has. Since CABBAGE is (probably) a basic-level category, there is a lot of knowledge in its frame. But some knowledge about CABBAGE is shared with other vegetables, and some knowledge about vegetables is shared with other food items. Also there are subtypes of CABBAGE such as RED-CABBAGE. Knowledge seems to be organized in an

inheritance (is-a) hierarchy, one sort of ontology. Categories also have default properties that can be overridden by subcategories or instances.

Consider what we know about instances of GOING and GIVING. When something is given, there is a GIVER, a RECEIVER, and a GIVEN-OBJECT. Before the giving, the GIVER controls the OBJECT, and RECEIVER doesn't. After the giving, the RECEIVER controls the OBJECT, and the GIVER doesn't. The giving is consciously initiated by the GIVER, who wants the RECEIVER to control the OBJECT.
(forall (?g ?r ?o ?t0) (if (give ?g ?r ?o ?t0) (and (exists (?t1) (and (before ?t1 ?t0) (control ?g ?o ?t1) (not (control ?r ?o ?t1)) (exists (?t2) (and (before ?t1 ?t2) (goal ?g (control ?r ?o ?t2) ?t1))))) (exists (?t3) (and (before ?t0 ?t3) (control ?r ?o ?t3) (not (control ?g ?o ?t3)))))))

Though we don't have an English verb for it, there is a more abstract event category that includes GIVE, STEAL, TAKE, and RECEIVE. And there are subtypes of GIVE such as DONATE and LEND.

Using frames
• • •

How is knowledge within a frame instantiated when an instance of the category is created? How can we efficiently access inherited knowledge for an instance of a category? How can we answer questions about properties using an inheritance hierarchy?

Frame representation
(PHYS-OBJ (is-a THING) (color) (weight) (shape) (edibility) ...) (FOOD-ITEM (is-a PHYS-OBJ) (nutritional-value) (fat-content) (starch-content)

(vitamin-content) (source) (taste) (preparation (processing) (cooking) (accompanying-ingredients) (serving)) (availability-form) (edibility YES) (english-lex (neutral "FOOD") (informal "GRUB")) (japanese-lex "TABEMONO") ...) (VEGETABLE (is-a FOOD-ITEM) (plant-part) (plant-type) (source PLANT) (nutritional-value HIGH) (fat-content LOW) (vitamin-content HIGH) (english-lex (neutral "VEGETABLE") (informal "VEGGIE")) ...) (LEAF-VEGETABLE (is-a VEGETABLE) (plant-part LEAF) (color GREEN) (taste BITTER) ...) (COLE-VEGETABLE (is-a VEGETABLE) ...) (CABBAGE (is-a LEAF-VEGETABLE) (is-a COLE-VEGETABLE) (plant-type CABBAGE-PLANT) (taste CABBAGE-TASTE) (availability-form VEGETABLE-HEAD) (shape SPHERICAL) ...) (RED-CABBAGE (is-a CABBAGE) (color PURPLE) (english-lex "RED CABBAGE") ...) (RED-CABBAGE23 (is-a RED-CABBAGE) (preparation (accompanying-ingredients {MAYONNAISE, MUSTARD, TARRAGON}) (cooking NIL))) (ABSTRACT-TRANSFER (source ?s)

(destination ?d) (object ?o) (precondition1 (is-a CONTROL) (controller ?s) (object ?o)) (effect1 (is-a CONTROL) (controller ?d) (object ?o))) (GIVE (is-a ABSTRACT-TRANSFER) (source ?s) (destination ?d) (object ?o) (effect1 ?e) (precondition2 (is-a WANT) (wanter ?s) (wanted ?e))) (GIVE23 (is-a GIVE) (source AL) (destination SUE) (object (is-a DVD)))

A more elaborate style of frame representation from the CHEF program (Hammond, 1989)
(defmop i-m-beef-and-green-beans (m-recipe) (meat i-m-beef) (vege i-m-green-beans) (style i-m-stir-fried) (steps m-recipe-steps (bone-steps i-m-empty-group) (chop-steps m-step-group (1 m-chop-step (object i-m-beef))) (let-stand-steps m-step-group (1 m-let-stand-step (object m-ingred-group (1 i-m-beef) (2 i-m-spices)))) (stir-fry-steps m-step-group (1 m-stir-fry-step (object m-ingred-group (1 i-m-beef) (2 i-m-spices))) (2 m-stir-fry-step (object m-ingred-group (1 i-m-beef) (2 i-m-green-beans) (3 i-m-spices)))) (serve-steps m-step-group (1 m-serve-step (object m-ingred-group

(define instantiate (lambda (frame bindings) ...)) (define inherit (lambda (instance role) ...))

(1 i-m-beef) (2 i-m-green-beans) (3 i-m-spices))))))

Semantic networks
Knowledge in the form of a graph. A single node corresponds to a frame symbol. Two "styles": 1. labeled links represent many relations and roles 2. all relations and roles are represented by nodes; there are only a small number of very general link types

Examples in a particular formalism of type 2 (NETL: Falhman, 1979)

Types, roles and value, is-a

Individuals, sets
Insects have six legs. Each leg is jointed. Ladybugs are insects. Sam is a ladybug. Sam's right rear leg is broken.


An example of a type 1 formalism

Inheritance in a semantic network

Distributed connectionist representation
A fixed network of nodes (units) that can be active or not (or active to different degrees). The unlabeled, weighted, modifiable links (connections) between the units represent the

tendency for pairs of units to be co-activated. Any representation is a pattern of activation across the network. A representation is distributed if each element is involved in the representation of multiple concepts and each concept is represented by multiple elements. Units may represent primitive semantic features or (for example, in holographic representations) may not be directly interpretable.

Instantiation, inheritance
Instantiation does not involve the creation of any new hardware (because the size of the network is fixed) but rather in the activation of certain and units and the strengthening or weakening of some weights. There is no distinction in the network between individuals and types. Different types are not represented by separate units in the network. Abstraction/generality corresponds to uncertainty about the value of units.

Inheritance is in a sense automatic. When we activate a type, for example, on the basis of an input word such as cabbage, the pattern represents features of the instance as well as the type.

Representing commonsense knowledge

• •

Acts (Schank, etc.) Semantic roles

Scripts, plans, goals
Knowledge of stereotypical situations helps in understanding language (Schank, etc.). Mary wanted that camera really bad, so she went and bought a gun. Phil went into a restaurant and sat down at a table. A waiter came over after a few minutes, but he said he wasn't ready to order.

Big Projects (CYC, etc.) The Grounding Problem
• •

Will the knowledge be in a usable form if it's not tied to perception and action? Is it possible to build in commonsense knowledge, or will it have to be learned?

Machine learning: overview

Internal changes in biological systems recorded in memory of one kind or another. Change may be more or less permanent, resulting in different (usually improved) behavior following the change. Kinds of change and kinds of memory: o Evolutionary change; genetic memory, genotype o Development; phenotype o Learning; long-term memory o Cultural change; cultural memory o Processing (temporary change); short-term (working) memory Why learn (and develop), rather than evolve? (Miller & Todd) o Learning (development) allows an organism to build a more complex phenotype than it could otherwise, given a genotype of a certain size. Environmental regularities can do much of the work of wiring up adaptive behavior-generators. o Learning allows an organism to make use of the past as well as the hereand-now. This sort of learning consists in the creation of episodic memories and their retrieval later on.


Learning allows an organism to adjust its behavior faster than natural selection would allow. This is advantageous because there may be changes in  the organism's body  the organism's family  the organism's environment This function dominates thinking about learning but may be less important than the other two for most animals.

Some basic concepts

Availability of feedback o Available  Supervised learning: there is access to the correct output  Reinforcement learning: there is access to the goodness of the actual output  The credit assignment problem in reinforcement (and sometimes supervised) learning: when the behavior of the system is wrong, what aspect of the system's internals led to the error? o Unavailable: unsupervised learning, no information about the correctness of outputs Example: A speech system is to be trained to recognize English words. During an initial phase, the system is simply exposed to samples of English speech without being told what the content of the speech is. The hope is that the system will pick up on some of the systematic phonetic properties that characterize English. Prior knowledge o Learning from scratch o Building on prior knowledge o Martin's law: You cannot learn anything unless you almost know it already. What is learned o Stimulus-response behavior o Concepts o Regularities in the environment: cooccurrences, clustering, prediction o Utility information concerning possible states of the world o Results of possible actions o Ways of organizing knowledge internally to maximize performance

(supervised or reinforcement)

Learning the representation of a function f

• • • •

• •

Given a collection of examples of f, find a function h (the hypothesis) which approximates f Positive and negative examples of the function Generalization of the current hypothesis through positive examples Specialization of the current hypothesis through negative examples; value of near misses Example of a bad negative example: A robot is being trained to recognize a soda can. It is shown a can from two different angles. Then in order that it doesn't produce too general a concept, it is shown a chair and told that that is not an example of a soda can. The robot does not seem to improve. In general induction is not sound: a hypothesis is not usually a logical conclusion of the data; numerous hypotheses may be consistent with the data Incremental learning: hypothesis is updated whenever an example arrives Example: A robot is being trained using reinforcement learning to find all of the empty soda cans in the lab and throw them in the recycling bin. This proves too difficult, so the robot is first trained only to recognize soda cans. Then it is trained to approach soda cans. ...

Introduction to neural networks


Units: simple processing elements that respond to the behavior of other units via input connections, produce an activation, and send it along output connections to other units. The activations of all of the units in the network represent the system's short-term memory. o Connections: Weighted (unlabeled) links between units, multiplying the activation from the source unit on its way to the destination unit. The weights along all of the connections in the network represent the system's long-term memory. Formalization o State

  o

Vector of activations x(t) Matrix of weights W

 

Set of input vectors I(t), possibly infinite (Sometimes) an associated set of target vectors T(t) o Dynamics  Discrete (difference equations) or continuous (differential equations)  Activation x(t+1) = g(h(x(t), W(t), I(t))) g the activation function, h the input function  Weight W(t+1) = f(x(t), W(t), I(t), T(t))) f the learning rule Some differences between models o What sort of connectivity (reflected in where there are gaps in the weight matrix)? o Are there targets (a supervised network)? o Is this is a feedforward network, a partially recurrent network, or a completely recurrent (settling, attractor, constraint satisfaction) network? o Is the activation function threshold or continuous? o Does the network handle sequences of inputs or just static patterns? o Are there separate input and output units? o Does the network have hidden units? Running a network o Update units  For attractor (feedback) networks, update units (usually randomly selected) until the network has settled (no further changes in activations occur)  For feedforward (and simple recurrent networks), update each unit once in a (mostly) fixed sequence  Update a unit  Calculate the input (h) to the unit, the weighted sum of the activations of units feeding into the unit  Calculate the activation (x) of the unit, a function (g) of the input  Example: activation of a unit with a threshold activation function: if hi > θi, g(hi) = 1 else g(hi) = 0


Update connections  Following the presentation of a single training pattern or the whole set of training patterns  Weight changes are usually small changes in a given direction, determined by a learning rate (η)  The direction and magnitude of the change is usually proportional to the activation of the source unit and either the activation of the destination unit or some error measure.

Supervised learning in neural networks

Patterns o Training set: pairs of inputs and targets for training the network o Test set: pairs of inputs and targets for testing the network for generalization Training o Training phase: weights are adjusted in response to training set o Test phase: weights are not adjusted as test set is presented

Feedforward networks
• •

• •

Appropriate for problems with no interacting bottom-up and top-down effects; pattern association Usually trained with error-driven learning, a form of supervised learning in which the change in weights depends on the error, ultimately the difference between the target and the actual output for each output unit Networks with no hidden units, for example, perceptrons Networks with hidden layers, usually trained with backpropagation


Pattern association problems
• • • • •

Given a representation of one kind of entity (the input), generate a representation of another (the output). Both representations often take the form of patterns, that is, vectors of numbers. In this case, the inputs and outputs are represented in a distributed way. Training (supervised): expose the learner to a number of input patterns and their correct associated output patterns. Generalization: given an unfamiliar input pattern, respond with an appropriate output pattern. Examples o Perceptual input, category output (pattern classification) o State (perceptual) input, action Q-value output o Word input, meaning output o Meaning input, word output o English input, Spanish output Implementation in feedforward neural networks

Feedforward networks and pattern association

In a feedforward neural network, the units can be thought of as arranged in separate groups, or layers. The connections joining units in two layers all have the same direction. Some of the units are designated input units. These are clamped to particular activations when the network is presented an input pattern. The activation of a clamped unit does not change. Some of the units are designated output units; their activations represent the network's response to the current input, the pattern that the network "believes" should be associated with the input pattern. Each output unit repeatedly updates its activation while the network is "running". In a feedforward network, each output unit updates its activation once in response to each input pattern.

Also the possibility of one or more layers of hidden units between the input and output layers

Perceptrons: how they work
• •

The simplest architecture for supervised pattern classification Architecture o Input units + bias (threshold) unit o Binary output unit; each output unit a separate perceptron Input and activation rules N y = δ( ∑ wj xj + b) j=1

δ(x) = {
• •

1 if x > 0 0 otherwise

That is, the activation function is a simple linear threshold function. Learning o For each input pattern p, there are three cases:  The pattern is classified correctly. In this case, no changes are made to the weights.  The target is 1, but the network yielded 0. In this case, we need to change each weight so that the output will be higher. We can achieve this by adding the input vector (or a fraction of the input vector, the learning rate η) to the weight vector (the superscript represents the particular pattern). Δw = ηxp

The target is 0, but the network yielded 1. In this case, we need to change each weight so that the output will be lower. We can achieve this by subtracting the input vector (or a fraction of the input vector) to the weight vector. Δw = -ηxp


The three cases can all be expressed with this general rule: Δw = ηxp(tp - yp)

The Perceptron Convergence Theorem (Rosenblatt)





If there is a set of weights that solves the problem, then there is weight vector w* that never yields a sum lying in a region around 0 of width 2 * ε. That is, it for inputs that are supposed to yield a positive output, w* yields an output greater than ε, and for inputs that are supposed to give a negative output, w* yields an output less than -ε. The theorem proves that the angle between the current weight vector and w* is bounded for each training pattern by an envelope that decreases with each presentation of the pattern. The theorem does not guarantee that the angle between the current and final weight vectors will decrease monotonically, only that the envelope within which this angle is found decreases with the number of updates. That is, the error may sometimes rise for a given run through the training patterns. The theorem only guarantees convergence if there is a set of weights that solves the problem.

Perceptrons: what they can and can't do

One input o The input patterns fall along a line; all points on one side of a given value are in, the other outside the category. The trainable bias establishes the threshold, the sign of the single weight the direction of category membership on either side of the threshold.


If the points in the category are broken up by points not in the category, there is no way for the network to solve the problem.

Two inputs o The two weights and the bias define a line

w1x1 + w2x2 + b = 0 with slope -w1/w2 and y-intercept -b/w2. Points on one side of the line turn on the output unit; points on the other side turn it off. Because the perceptron defines an inequality (two possibilities for each line), three values are required rather than the two required to define a line.


The line defined by the weights and bias divides the input space into two regions. A perceptron in 2-space can only learn to separate two sets of points that are on either side of a line.

In general, a perceptron can only solve a pattern classification problem if the sets of patterns are linearly separable; that is, if in N-space, there is a hyperplane of N-1 dimensions which separates the two sets of points. N values (N-1 weights and the trainable bias) are needed to specify the desired behavior because in addition to the hyperplane, we need to say on which side of the hyperplane points are in the category. Problems which perceptrons can't solve o Examples of non-linearly separable sets of patterns  Exclusive OR: (0, 0), (1, 1); (1, 0), (0, 1)  Connectivity



Solving the problems with additional input dimensions, for example, for exclusive OR an input unit that codes for whether the other two inputs are the same Solving the problems with hidden units  Networks with hidden units with linear activation functions are equivalent to networks without hidden units  Hidden units must have non-linear activation functions, for example, a simple threshold function (like perceptron output units) or sigmoidal or gaussian functions.  One possibility: provide "enough" hidden units connected by random, hard-wired weights to the input units.  Another possibility: train the input-to-hidden weights. But how?

The delta rule and backpropagation
Activation functions
• •

The simplest activation function is the identity function: the activation is just the input. Another possibility is a threshold function which converts inputs above some threshold to a maximum value (usually 1.0) and inputs below the threshold to a minimum values (usually -1.0 or 0.0).

A third possibility is a "soft" threshold function which smooths out the region near the threshold. The following function, the sigmoid, has a minimum of 0.0 and a maximum of 1.0 (note that these are never reached): g(h) = 1 / (1 + e-h) (h is the input to the unit.)

The delta rule
• •

Supervised learning (pattern association) in feedforward networks with multiple output units and continuous activation functions The delta rule (least mean squares rule) for supervised learning (when the activation function of the output unit is the identity function) Δwji = η (tj - xj) xi (The xs represent activations, t represents a target, and η is a learning rate between 0.0 and 1.0.)

A formal way to derive the delta rule for the more general case o Gradient descent learning: learning by moving in the direction which looks locally to be the best o For supervised neural network learning, the best direction to move in "weight space": for each weight, how the global error changes with respect to that weight o A global error function: for each pattern the sum of the errors over all of the output units E = ∑j ½ (tj - xj)2

We want to move in "weight space" in a direction which is opposite that of the slope of the error function with respect to each weight because this will move us toward a region with a lower error. The size of the move should be proportional to the magnitude of the slope.

1. To find the slope, we take the partial derivative of the error with respect to the weight. But the only element in the sum of error terms that depends on the weight is the one for the output unit where that weight ends (j in what follows). ∂E/∂wji = ∂ [½ (tj - xj)2] / ∂wji 2. Using the chain rule, we can decompose this derivative into two that are easier to calculate: (∂[½ (tj - xj)2]/∂xj) (∂xj/∂wji) 3. The first derivative is easy to figure; it's just -(tj - xj) 4. The second derivative can be decomposed using the chain rule again if we remember that the activation of unit j is a function of the input to the unit, hj, which is in turn a function of the weights into the unit. ∂xj/∂wji = (∂xj/∂hj) (∂hj/∂wji) 5. Since the activation of an output unit is the activation function g applied to the input h, the first derivative on the right-hand side of (4) is just g′(hj) that is, the derivative of whatever the activation function is at the value of the current input to unit j. 6. The second derivative on the right-hand side of (4) can be derived as follows: ∂Ij/∂wji = ∂(∑kxkwjk)/∂wji = xi because none of the other weights or input activations depend on wji. 7. Putting all of the parts together, we get ∂E/∂wji = -(tj - xj) g′(hj) xi 8. Remember that we want the weight change to be proportional to the negative of the derivative with respect to the weight. So with a

learning rate to control the step size for weight changes, we get the more general delta (least mean squares) learning rule Δwji = η (tj - xj) g'(hj) xi

• •

Problems that are not linearly separable can't be solved by a perceptron (or a network learning with the delta rule). How hidden units can solve non-linearly separable problems.

• •

Backpropagation: a gradient descent algorithm for learning the weights into hidden units as well as output units A network with hidden units with linear activation functions is equivalent to a network with no hidden units, at least the hidden units must have non-linear activation functions, and these must be differentiable for backpropagation to apply: usually the sigmoid function.

The learning rule: Δwji = ηδjxi

For output units: δj = (tj - xj) g′(hj)


For hidden units (k indexes the units in the next highest layer): δj = [∑k δk wkj] g′(hj)

A famous example: NETTALK, the text-to-speech problem

• • •

Some questions and concerns o Does BP get stuck in local minima? o Does it take forever to learn the weights?  Faster as number of hidden units increases (assuming parallel update)  Faster with higher learning rate, within limits o How does the network solve the problem? What sort of hidden-layer representations does it build? Using statistical techniques to analyze hidden-layer representations. o Does it generalize? Does the network behave appropriately on patterns which it has not been trained on?  More local and more distributed (greater generalization) hiddenlayer patterns  Effect of too many trainable connections: overfitting, the network "memorizes" individual patterns rather than generalizing over them Optimization: setting the learning rate, other parameters Incremental training: learning a simpler task which enables the learning of a more complex task Multiple tasks in a single network o Catastrophic forgetting: does the network unlearn one set of patterns when trained on a second? o Does the network fail to learn two interfering tasks which it is trained on simultaneously? Example: the what-where vision problem o Modularity as a solution to problems of interference

Sequential problems and simple recurrent networks

• •

Sequence processing: inputs consist of sequences of patterns; the network's output depends on previous patterns as well as the current one o Prediction: given a partial sequence, predict the next element (pattern) o Sequence classification o Parsing Sequence processing requires some form of short-term memory (in addition to unit activations Simple recurrent network (Elman net): recurrent connections on the hidden layer with a time delay of one sequence event, usually implement with a context layer that maintains a copy of the hidden-unit activations on the last time step

Training an SRN on prediction o The input and output layers represent a single sequence event. o During training, sequences of inputs are presented repeatedly. o On a single training trial, an event is presented to the input layer, and the network is run in the usual fashion, with the context layer treated as another input layer. o The target is the next event in the sequence. Error is back-propagated, and weights are updated using the backpropagation rule, with context-tohidden weights treated exactly as input-to-hidden weights. o Finally the activations on the hidden layer are copied to the context layer.

Unsupervised learning
Auto-association and content-addressable memories

Auto-association: one form of unsupervised learning o Patterns are associated with themselves o Purposes: dimensionality reduction, data compression, pattern completion o Implementation: Hopfield nets (pattern completion only), other constraint satisfaction (settling) networks with hidden layers, feedforward nets (can be trained with backpropagation) o Content-addressable memories Desired behavior:  When part of a familiar pattern enters the memory system, the system fills in the missing parts (recall).  When a familiar pattern enters the memory system, the response is a stronger version of the input (recognition).  When an unfamiliar pattern enters the memory system, it is dampened (unfamiliarity).


When a pattern similar to a stored pattern enters the memory system, the response is a version of the input distorted toward the stored pattern (assimilation).  When a number of similar patterns have been stored, the system will respond to the central tendency of the stored patterns, even if the central tendency itself never appeared (prototype effects). Discrete Hopfield networks  Basic properties  CAM  Potentially completely recurrent  Symmetric weights  Activation rule (θ a threshold, sgn(): 1 if its argument is positive, -1 otherwise):

xi(t + 1) = sgn(∑j wij xj(t) - θi) Settling: asynchronous, random update Training: single presentation of each pattern Each training pattern should yield an (fixed-point) attractor. Stability  Lyapunov stability: if there is a function of the network state which decreases or stays the same as the network is updated, then the network is asymptotically stable.  Energy of network (a Lyapunov function):
  

E = -½ ∑i ∑j wij xi xj

For symmetric weights, this can be rewritten as E = C - ∑(ij) wij xi xj where (ij) refers to distinct pairs of indeces, and C is a constant.

Activation rule minimizes energy Assuming no thresholds, for a given updated unit i, either its activation is unchanged, in which case the energy is unchanged, or it is negated, in which case xi and ∑j wij xj have opposite signs, and x′i = -xi, where x′i is the activation of unit i following the update. Then the difference between the energy after and before the update of unit i is E′ - E = - ∑j≠i wijxi′xj + ∑j≠i wijxixj = 2 ∑j≠i wijxixj

= 2 xi ∑j≠i wijxj = 2 xi ∑j wijxj - 2 wii But both of these terms are negative, so, for asynchronous updates, the energy always either remains the same or decreases.

Learning  Storing Q memories in a Hopfield net:

Hebbian learning: weight on the connection joining two units is proportional to the correlation between their activations. For one pattern p, we get stability if, for all i :

The expression in parentheses (the input to unit i) is

If the magnitude of the second term, the crosstalk term, is less than N, then pattern p is stable.

Capacity of a network  Crosstalk between patterns  Number of random patterns storable is proportional to N if small percentage of errors tolerated, but it is quite small.

Competitive learning

What it is o Winner-take-all output units compete to classify input patterns, only one (roughly) coming on at a time o Clustering unlabelled data o Categorization, vector quantization Simple, single-layer competitive learning

o o

Binary output units fully connected to (usually binary) input units by nonnegative weights Only one output unit fires at a time, the one whose input weight vector is closest to the input vector (i* is the winning unit): |wi* - x| ≤ |wi - x| (for all i) For normalized weights, the winner is always the one with highest input (dot product of input pattern and weight vector): wi* ⋅ x ≥ wi (for all i)

o o

Winner-take-all process can be implemented by simply picking the unit with the highest activation or through lateral inhibitory connections. Learning  Weights initially random  For each input pattern, update the weights into the winning unit only  The standard rule moves the winning weight vector directly towards the input pattern.

Because losers are not activated, the rule is equivalent to (yi is the activation of the ith category unit)

Geometric analogy

Problem of "dead units", units which start out far away from input patterns and never win Feature maps o Networks in which location of output unit conveys information o Output units have fixed positions in one-, two-, or three-dimensional grids o Topology preserving map from the space of possible inputs to the line, plane, or cube of the output units  A mapping that preserves neighborhood relations.  As two input patterns get closer in input space, the winning output units get closer in output space. o Self-organizing feature maps (Kohonen nets), one type of feature map architecture)  The neighborhood relations in the output array are built into the learning rule. Weights into many (sometimes all) units are changed on each update (there may be no dead units).  Winning output unit i*:

|wi* - x| ≤ |wi - x| (for all i)

Learning rule: Δwij = ηΛ(i, i*) (xj - wij) Λ(i, i*) = 1, for i = i* The neighborhood function falls off with the distance |ri - ri*| in the output array, where the r vectors are the coordinates of the units in the output space.

Network as an elastic net in which the weight vector of the winner is dragged toward the input vector and the weight vectors of neighboring units are pulled along with it. Nearby units respond to nearby input patterns. Typical neighborhood function:

 

 

Both σ and η start large and are decreased during training Result is sensitive to probability of inputs as well as their location in input space: more output units are associated with regions of higher probability 1-to-1, 2-to-1, 2-to-2 mappings Convergence  Usually in two stages: (1) untangling, (2) detailed adapting  Kinds of tangles: twists (2 dimensions), kinks (1 dimension) Example applications  Robot joint angles, rather than actual positions, as input  Phoneme similarity

Reinforcement learning
Markov decision processes
• • •

The agent and the environment (the world) Discrete time States At each time step, the agent's sensory/perceptual system returns a state, xt, a representation of its current situation in the environment, which may be errorful and normally misses many aspects of the "real" situation. Actions At each time step, the agent has the option of executing one of a finite set of possible actions, ut, each of which potentially puts it in a new state.

• • •

Reinforcements: rewards and punishments In response to the agent's action in a particular state, the world provides a reinforcement. The reinforcement function in the world: r(x,u) The next-state function in the world: s(x,u) An example

Simple reinforcement learning
• •

The goal: to learn a value for each state-action pair One possibility Q(xt, ut) = r(xt, ut)

But this bases too much on a single instance of reinforcement. We need to learn in smaller steps. Qnew(xt, ut) = (1 - η)Qold(xt, ut) + η r(xt, ut) where η is a learning rate between 0 and 1.

But so far the algorithm can only learn in response to immediate reinforcement. What about delayed reinforcement?

Q learning

• •

The real value of an action in a state (optimal Q) depends not only on immediate reinforcement but also on reinforcements that can be received later as a result of the next state the agent gets to. The value (estimate Q) that an agent stores for each state-action pair should reflect how much reinforcement it will receive immediately and in the future if it takes that action in that state. Policy: a way of using the stored Q values to select actions. More precisely, an optimal Q value for a given state and action is the sum of all reinforcements received if that action is taken in the state, and then the agent follows the optimal policy specified by the other Q values. A first definition: Qopt(xt, ut) = r(xt, ut) + maxut + 1[Qopt(xt + 1, ut + 1)]

But this causes problems because there may be many, even an infinite number of, future reinforcements. We need to weight the future by a discount rate (γ) between 0 and 1. Qopt(xt, ut) = r(xt, ut) + γ maxut + 1[Qopt(xt + 1, ut + 1)]

To approach optimal Q values, the learner starts with 0 or random values for each state-action pair, then updates the values gradually usually the reinforcement received and what it thinks is the best Q value for the next state. Qnew(xt, ut) = (1 - η)Qold(x, u) + η{r(xt, ut) + γ maxut + 1[Qold(xt + 1, ut + 1)]}

An example
γ = 0.8, η = 0.5 and all Q values initialized at 0. In the chart, "new" means the reinforcement received plus the discounted maximum value of the next state. The "new" value is combined with the "old" using the learning rate to give the updated Q value appearing in the next line of the chart. (Note: in this example, in order to illustrate how the agent can learn to "look ahead", it is effectively picked up after it reaches the goal state and dropped back in state 1. There is no "natural" way of reaching state 1 from state 4.) x 1 2 3 4 1 2 3 4 Q 2,l 3,r 0 0 0 0 0 0 0 .5 0 .5 0 .5 0 .5 0 .75 new u 0 0 1 0 0 .4 1 0 r r r l r r r l

1,r 0 0 0 0 0 0 0 0

2,r 0 0 0 0 0 0 .2 .2

3,l 0 0 0 0 0 0 0 0

4,l 0 0 0 0 0 0 0 0

Making decisions
o o


How is the agent to pick an action? One possibility is exploitation, to pick the action that has the highest Q-value for the current state. But the agent can only learn about the value of actions that it tries. Thus it should try a variety of actions. This may involved ignoring what it thinks is best some of the time: exploration. Exploration makes more sense early on in learning when the agent doesn't know much.


One possibility for selecting an action; pick the "best" action with probability P = 1 - e-E a, where a is the number of training samples (the "age" of the agent). Here is how the probability of selecting the "best" action depends on age when E is 0.1.

Here is how it depends on age when E is 0.01


A smarter possibility would be to have the probability of picking an action depend on how high its value is relative to the values of all of the other possible actions. Here is one way:

where the vs represent all of the possible actions in state xt.

Implementing Q learning
• •

A lookup table: a Q value for each state-action pair But in the real world the number of states may be very large, even infinite. Distributed representations of states permit o More efficient coding o Generalization to novel states A neural network o Inputs are distributed representations of states. o Outputs Q values for each action (represented locally). o Weights represent associations of state features with actions. o Error-driven learning: for the selected action, the target is the "new" Q value from the Q learning rule.

Concept learning
Concepts (categories)

• • • •

Features o Sufficient features and decision trees o Necessary features o Typical features o Prohibited features Which features are relevant for concepts (of particular types)? Where does the set of features come from? One-shot learning Quine's problem

Positive and negative examples, generalization and specialization

Evolving hypotheses o Current best guess o Current set consistent with examples Generalization o In response to a positive example that should not be an example, according to the current hypothesis: a false negative

Variabilization Assigning a more general type to an element Dropping features from a conjunction of features Making a disjunction of features Specialization o In response to a negative example that should be an example, according to the current hypothesis: a false positive o The value of near misses o Adding features to a conjunction of features o Assigning a more specific type to an element o Prohibiting a feature Generalization (left) and specialization (right) illustrated
o o o o

An example of current-best-guess learning (that fails): SAME-COLOR o A positive example

Object Object Object Object Object o

1 1 2 1 2

is is is is is

to the left of object 2. a square. a square. red. red.

Another positive example

Object 1 is to the left of object 2. Object 1 is a rectangle. o

A negative example

Object 1 is to the left of object 2. Object 1 is a rectangle. Object 2 must not be yellow. o

The learner fails because of the inadequacy of the representation; learning relies in part on perception.

Version Space learning (Mitchell)
• • • • •

• • • •

Incremental learning vs. batch learning Maintaining the set of all hypotheses that are consistent with the set of positive and negative examples so far Version Space learning: incremental, least-commitment algorithm (makes no arbitrary choices) Version space: set of all hypotheses consistent with examples seen so far Version graph: directed acyclic graph in which nodes are elements of version space and there is an arc from node p to node q iff p is less general than q and there is no node r that is more general than p and less general than q. Positive examples lead to the elimination of hypotheses that are too specific. Negative examples lead to the elimination of hypotheses that are too general. Problem of size of version graph Solution: use of boundary sets, sets of hypotheses defining boundaries on which hypotheses are consistent with examples o Most general boundary set (G-set): every member consistent with examples, and there are no more general consistent hypotheses o Most specific boundary set (S-set): every member consistent with examples, and there are no more specific consistent hypotheses Updating the boundary sets, given a new example (the rules for G-set and S-set are symmetric) o G-set  If the example is positive, exclude any hypotheses that do not cover the example.  If the example is negative, find the most general set within or below the old G-set and above the S-set which fails to cover the example. o S-set  If the example is positive, find the most specific set within or above the old S-set and below the G-set which covers the example.  If the example is negative, exclude any hypotheses that cover the example. Update until o there is exactly one concept left in the version space or o the version space collapes--either the S-set or G-set becomes empty or o there are no more examples. Example

Another detailed example, from a class by Julian Francis Miller when he was at the University of Birmingham

ID3 (decision tree) learning (Quinlan)
• •

ID3: batch learning of concepts using decision trees A set of classified examples class body covering habitat flies? breathes air? 1 m hair land no yes

2 m hair land yes yes 3 m other water no yes 4 b feathers land yes yes 5 b feathers land no yes 6 f scales water no no 7 f scales water yes no 8 f other water no no 9 f scales water no yes 10 r scales land no yes Alternate decision trees that successfully classify the data, differing in number of decisions

Creating a decision tree For each decision point, o If all remaining examples are all classified, we're done.

Else if there are some unclassified examples left and attributes left, pick the remaining attribute which is the "most important", the one which tends to divide the remaining examples into homogeneous sets o Else if there are no examples left, no such example has been observed; return default o Else if there are no attributes left, examples with the same description have different classifications: noise or insufficient attributes or nondeterministic domain Selection of "most important" attribute o For possible answers v to a question with probabilities P(vi), the information content of the answer is

I(P(v1), ... , P(vn)) = ∑in −P(vi) log2P(vi)

In our example the information content of a tree that classifies the data is I(P(m), P(b), P(f), P(r)) = P(m) log2P(m) + P(b) log2P(b) + P(f) log2P(f) + P(r) log2P(r) = = −0.3 log20.3 − 0.2 log20.2 −0.4 log20.4 − 0.1 log20.1 = (0.3)(1.73) + (0.2)(2.32) + (0.4)(1.32) + (0.1)(3.32) = 1.843


o o o o o o o o o o

Each decision point adds to the information content. To determine the information gain, we subtract the information content left after the decision from the total. The information content left after the decision is the weighted sum of the information content in each of the groups resulting from the decision. At each decision point, the algorithm picks the attribute of those left that maximizes the information gain. For the first decision point, the four attributes classify the examples in this way:
body covering: hair(M: 2), feathers(B: 2), scales(F: 3, R: 1), other(M: 1, F: 1) habitat: land(M: 2, B: 2, R: 1), water(M:1, F: 4) flies?: yes(M:1, B: 1, F: 1), no(M: 2, B: 1, F: 3, R: 1) breathes air?: yes(M: 3, B: 2, F: 1, R: 1), no(F: 3) body covering: (0.2)(0) + (0.2)(0) + (0.4)(0.811) + 0.2(1.0) = 0.524 habitat: (0.5)(1.524) + (0.5)(0.722) = 1.123 flies?: (0.3)(1.586) + (0.7)(1.845) = 1.767 breathes air?: (0.7)(0.802) + (0.3)(0) = 1.291

For each of these, the information content remaining after the decision is:

Therefore the appropriate first choice is the "body covering" attribute.
o o o o

For the "scales" branch below this decision point, the three remaining attributes classify the four examples as follows
habitat: land(R: 1), water(F: 3) flies?: yes(F: 1), no(F: 2, R: 1) breathes air?: yes(F: 1, R: 1), no(F: 2)

o o o o

For each of these, the information content remaining after the decision is:
habitat: (0.1)(0) + (0.3)(0) = 0 flies?: (0.1)(0) + (0.3)(0.918) = 0.275 breathes air?: (0.2)(1) + (0.2)(0) = 0.2

Therefore "habitat" is the right choice for this branch.

Some implications of AI
Philosophy and cognitive science
• • • • •

Is intelligence something that can be defined and studied abstractly, independently from the details for the intelligent agent? Does intelligence require a body? Are there different intelligences (logics?), each with its own advantages for a given environment, that could collaborate in solving problems? Where does the mind stop and the "outside world" begin? Are there fundamental differences between human and "animal" intelligence, or are all of the differences just a matter of degree? How much of human intelligence is cultural, as opposed to genetic?

• •

• •

How can AI give certain groups of people (more) power over other groups of people? Who benefits from applied AI (medicine, law, design, education, information retrieval, natural language processing, stock market, marketing, transportation, entertainment, agriculture, military)? Who funds AI research? How might AI be used to further o free, independent media; equal access to information; a better informed public? o a public capable of making rational economic and political decisions? o international (intercultural) understanding (tolerance)? o equal access to (or equitable distribution of) the world's resources? o protection of the environment?

Artificial neural network
An artificial neural network (ANN) or commonly just neural network (NN) is an interconnected group of artificial neurons that uses a mathematical model or computational model for information processing based on a connectionist approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network.

(The term "neural network" can also mean biological-type systems.) In more practical terms neural networks are non-linear statistical data modeling tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data.

A neural network is an interconnected group of nodes, akin to the vast network of neurons in the human brain.

More complex neural networks are often used in Parallel Distributed Processing.

There is no precise agreed definition among researchers as to what a neural network is, but most would agree that it involves a network of simple processing elements (neurons) which can exhibit complex global behavior, determined by the connections between the processing elements and element parameters. The original inspiration for the technique was from examination of the central nervous system and the neurons (and their axons, dendrites and synapses) which constitute one of its most significant information processing elements (see Neuroscience). In a neural network model, simple nodes (called

variously "neurons", "neurodes", "PEs" ("processing elements") or "units") are connected together to form a network of nodes — hence the term "neural network." While a neural network does not have to be adaptive per se, its practical use comes with algorithms designed to alter the strength (weights) of the connections in the network to produce a desired signal flow. These networks are also similar to the biological neural networks in the sense that functions are performed collectively and in parallel by the units, rather than there being a clear delineation of subtasks to which various units are assigned (see also connectionism). Currently, the term Artificial Neural Network (ANN) tends to refer mostly to neural network models employed in statistics, cognitive psychology and artificial intelligence. Neural network models designed with emulation of the central nervous system (CNS) in mind are a subject of theoretical neuroscience. In modern software implementations of artificial neural networks the approach inspired by biology has more or less been abandoned for a more practical approach based on statistics and signal processing. In some of these systems neural networks, or parts of neural networks (such as artificial neurons) are used as components in larger systems that combine both adaptive and non-adaptive elements. While the more general approach of such adaptive systems is more suitable for real-world problem solving, it has far less to do with the traditional artificial intelligence connectionist models. What they do however have in common is the principle of non-linear, distributed, parallel and local processing and adaptation.

[edit] Models
Neural network models in artificial intelligence are usually referred to as artificial neural networks (ANNs); these are essentially simple mathematical models defining a function . Each type of ANN model corresponds to a class of such functions.

[edit] The network in artificial neural network
The word network in the term 'artificial neural network' arises because the function f(x) is defined as a composition of other functions gi(x), which can further be defined as a composition of other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between variables. A widely used type of composition is the nonlinear weighted sum, where , where K is some predefined function, such as the hyperbolic tangent. It will be convenient for the following to refer to a collection of functions gi as simply a vector .

ANN dependency graph This figure depicts such a decomposition of f, with dependencies between variables indicated by arrows. These can be interpreted in two ways. The first view is the functional view: the input x is transformed into a 3-dimensional vector h, which is then transformed into a 2-dimensional vector g, which is finally transformed into f. This view is most commonly encountered in the context of optimization. The second view is the probabilistic view: the random variable F = f(G) depends upon the random variable G = g(H), which depends upon H = h(X), which depends upon the random variable X. This view is most commonly encountered in the context of graphical models. The two views are largely equivalent. In either case, for this particular network architecture, the components of individual layers are independent of each other (e.g., the components of g are independent of each other given their input h). This naturally enables a degree of parallelism in the implementation.

Recurrent ANN dependency graph Networks such as the previous one are commonly called feedforward, because their graph is a directed acyclic graph. Networks with cycles are commonly called recurrent. Such networks are commonly depicted in the manner shown at the top of the figure, where f is shown as being dependent upon itself. However, there is an implied temporal dependence which is not shown. What this actually means in practice is that the value of f at some point in time t depends upon the values of f at zero or at one or more other points in time. The graphical model at the bottom of the figure illustrates the case: the value of f at time t only depends upon its last value. Models such as these, which have no dependencies in the future, are called causal models.

See also: graphical models

[edit] Learning
However interesting such functions may be in themselves, what has attracted the most interest in neural networks is the possibility of learning, which in practice means the following: Given a specific task to solve, and a class of functions F, learning means using a set of observations, in order to find This entails defining a cost function solution). The cost function C is an important concept in learning, as it is a measure of how far away we are from an optimal solution to the problem that we want to solve. Learning algorithms search through the solution space in order to find a function that has the smallest possible cost. For applications where the solution is dependent on some data, the cost must necessarily be a function of the observations, otherwise we would not be modelling anything related to the data. It is frequently defined as a statistic to which only approximations can be made. As a simple example consider the problem of finding the model f which minimizes , for data pairs (x,y) drawn from some distribution . In practical situations we would only have N samples from and thus, for the above example, we would only minimize . Thus, the cost is minimized over a sample of the data rather than the true data distribution. When some form of online learning must be used, where the cost is partially minimized as each new example is seen. While online learning is often used when is fixed, it is most useful in the case where the distribution changes slowly over time. In neural network methods, some form of online learning is frequently also used for finite datasets. See also: Optimization (mathematics), Statistical Estimation, Machine Learning which solves the task in an optimal sense. such that, for the optimal solution f * ,

(no solution has a cost less than the cost of the optimal

[edit] Choosing a cost function
While it is possible to arbitrarily define some ad hoc cost function, frequently a particular cost will be used either because it has desirable properties (such as convexity) or because it arises naturally from a particular formulation of the problem (i.e., In a probabilistic formulation the posterior probability of the model can be used as an inverse cost). Ultimately, the cost function will depend on the task we wish to perform. The three main categories of learning tasks are overviewed below.

[edit] Learning paradigms
There are three major learning paradigms, each corresponding to a particular abstract learning task. These are supervised learning, unsupervised learning and reinforcement learning. Usually any given type of network architecture can be employed in any of those tasks.

[edit] Supervised learning
In supervised learning, we are given a set of example pairs and the aim is to find a function f in the allowed class of functions that matches the examples. In other words, we wish to infer the mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it implicitly contains prior knowledge about the problem domain. A commonly used cost is the mean-squared error which tries to minimise the average error between the network's output, f(x), and the target value y over all the example pairs. When one tries to minimise this cost using gradient descent for the class of neural networks called Multi-Layer Perceptrons, one obtains the well-known backpropagation algorithm for training neural networks. Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification) and regression (also known as function approximation). The supervised learning paradigm is also applicable to sequential data (e.g., for speech and gesture recognition). This can be thought of as learning with a "teacher," in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.

[edit] Unsupervised learning
In unsupervised learning we are given some data x, and the cost function to be minimised can be any function of the data x and the network's output, f. The cost function is dependent on the task (what we are trying to model) and our a priori assumptions (the implicit properties of our model, its parameters and the observed variables).

As a trivial example, consider the model f(x) = a, where a is a constant and the cost C = (E[x] − f(x))2. Minimising this cost will give us a value of a that is equal to the mean of the data. The cost function can be much more complicated. Its form depends on the application: For example in compression it could be related to the mutual information between x and y. In statistical modelling, it could be related to the posterior probability of the model given the data. (Note that in both of those examples those quantities would be maximised rather than minimised) Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications include clustering, the estimation of statistical distributions, compression and filtering.

[edit] Reinforcement learning
In reinforcement learning, data x is usually not given, but generated by an agent's interactions with the environment. At each point in time t, the agent performs an action yt and the environment generates an observation xt and an instantaneous cost ct, according to some (usually unknown) dynamics. The aim is to discover a policy for selecting actions that minimises some measure of a long-term cost, i.e. the expected cumulative cost. The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated. More formally, the environment is modeled as a Markov decision process (MDP) with states and actions with the following probability distributions: the instantaneous cost distribution P(ct | st), the observation distribution P(xt | st) and the transition P(st + 1 | st,at), while a policy is defined as conditional distribution over actions given the observations. Taken together, the two define a Markov chain (MC). The aim is to discover the policy that minimises the cost, i.e. the MC for which the cost is minimal. ANNs are frequently used in reinforcement learning as part of the overall algorithm. Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks. See also: dynamic programming, stochastic control

[edit] Learning algorithms
Training a neural network model essentially means selecting one model from the set of allowed models (or, in a Bayesian framework, determining a distribution over the set of allowed models) that minimises the cost criterion. There are numerous algorithms available for training neural network models; most of them can be viewed as a straightforward application of optimization theory and statistical estimation.

Most of the algorithms used in training artificial neural networks are employing some form of gradient descent. This is done by simply taking the derivative of the cost function with respect to the network parameters and then changing those parameters in a gradientrelated direction. Evolutionary methods, simulated annealing, and Expectation-maximization and nonparametric methods are among other commonly used methods for training neural networks. See also machine learning.

[edit] Employing artificial neural networks
Perhaps the greatest advantage of ANNs is their ability to be used as an arbitrary function approximation mechanism which 'learns' from observed data. However, using them is not so straightforward and a relatively good understanding of the underlying theory is essential.
• •

Choice of model: This will depend on the data representation and the application. Overly complex models tend to lead to problems with learning. Learning algorithm: There are numerous tradeoffs between learning algorithms. Almost any algorithm will work well with the correct hyperparameters for training on a particular fixed dataset. However selecting and tuning an algorithm for training on unseen data requires a significant amount of experimentation. Robustness: If the model, cost function and learning algorithm are selected appropriately the resulting ANN can be extremely robust.

With the correct implementation ANNs can be used naturally in online learning and large dataset applications. Their simple implementation and the existence of mostly local dependencies exhibited in the structure allows for fast, parallel implementations in hardware.

[edit] Applications
The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations. This is particularly useful in applications where the complexity of the data or task makes the design of such a function by hand impractical.

[edit] Real life applications
The tasks to which artificial neural networks are applied tend to fall within the following broad categories:
• •

Function approximation, or regression analysis, including time series prediction and modeling. Classification, including pattern and sequence recognition, novelty detection and sequential decision making.

Data processing, including filtering, clustering, blind source separation and compression.

Application areas include system identification and control (vehicle control, process control), game-playing and decision making (backgammon, chess, racing), pattern recognition (radar systems, face identification, object recognition and more), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications, data mining (or knowledge discovery in databases, "KDD"), visualization and e-mail spam filtering.

[edit] Neural network software
Main article: Neural network software Neural network software is used to simulate, research, develop and apply artificial neural networks, biological neural networks and in some cases a wider array of adaptive systems.

[edit] Types of neural networks
[edit] Feedforward neural network
The feedforward neural networks are the first and arguably simplest type of artificial neural networks devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.

[edit] Single-layer perceptron
The earliest kind of neural network is a single-layer perceptron network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. In this way it can be considered the simplest kind of feed-forward network. The sum of the products of the weights and the inputs is calculated in each node, and if the value is above some threshold (typically 0) the neuron fires and takes the activated value (typically 1); otherwise it takes the deactivated value (typically -1). Neurons with this kind of activation function are also called McCulloch-Pitts neurons or threshold neurons. In the literature the term perceptron often refers to networks consisting of just one of these units. They were described by Warren McCulloch and Walter Pitts in the 1940s. A perceptron can be created using any values for the activated and deactivated states as long as the threshold value lies between the two. Most perceptrons have outputs of 1 or -1 with a threshold of 0 and there is some evidence that such networks can be trained more quickly than networks created from nodes with different activation and deactivation values.

Perceptrons can be trained by a simple learning algorithm that is usually called the delta rule. It calculates the errors between calculated output and sample output data, and uses this to create an adjustment to the weights, thus implementing a form of gradient descent. Single-unit perceptrons are only capable of learning linearly separable patterns; in 1969 in a famous monograph entitled Perceptrons Marvin Minsky and Seymour Papert showed that it was impossible for a single-layer perceptron network to learn an XOR function. They conjectured (incorrectly) that a similar result would hold for a multi-layer perceptron network. Although a single threshold unit is quite limited in its computational power, it has been shown that networks of parallel threshold units can approximate any continuous function from a compact interval of the real numbers into the interval [-1,1]. This very recent result can be found in [Auer, Burgsteiner, Maass: The p-delta learning rule for parallel perceptrons, 2001 (state Jan 2003: submitted for publication)]. A single-layer neural network can compute a continuous output instead of a step function. A common choice is the so-called logistic function:

With this choice, the single-layer network is identical to the logistic regression model, widely used in statistical modelling. The logistic function is also known as the sigmoid function. It has a continuous derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated: y' = y(1 − y)

[edit] Multi-layer perceptron

A two-layer neural network capable of calculating XOR. The numbers within the neurons represent each neuron's explicit threshold (which can be factored out so that all neurons have the same threshold, usually 1). The numbers that annotate arrows represent the weight of the inputs. This net assumes that if the threshold is not reached, zero (not -1) is

output. Note that the bottom layer of inputs is not always considered a real neural network layer This class of networks consists of multiple layers of computational units, usually interconnected in a feed-forward way. Each neuron in one layer has directed connections to the neurons of the subsequent layer. In many applications the units of these networks apply a sigmoid function as an activation function. The universal approximation theorem for neural networks states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layer perceptron with just one hidden layer. This result holds only for restricted classes of activation functions, e.g. for the sigmoidal functions. Multi-layer networks use a variety of learning techniques, the most popular being backpropagation. Here the output values are compared with the correct answer to compute the value of some predefined error-function. By various techniques the error is then fed back through the network. Using this information, the algorithm adjusts the weights of each connection in order to reduce the value of the error function by some small amount. After repeating this process for a sufficiently large number of training cycles the network will usually converge to some state where the error of the calculations is small. In this case one says that the network has learned a certain target function. To adjust weights properly one applies a general method for non-linear optimization that is called gradient descent. For this, the derivative of the error function with respect to the network weights is calculated and the weights are then changed such that the error decreases (thus going downhill on the surface of the error function). For this reason back-propagation can only be applied on networks with differentiable activation functions. In general the problem of teaching a network to perform well, even on samples that were not used as training samples, is a quite subtle issue that requires additional techniques. This is especially important for cases where only very limited numbers of training samples are available. The danger is that the network overfits the training data and fails to capture the true statistical process generating the data. Computational learning theory is concerned with training classifiers on a limited amount of data. In the context of neural networks a simple heuristic, called early stopping, often ensures that the network will generalize well to examples not in the training set. Other typical problems of the back-propagation algorithm are the speed of convergence and the possibility of ending up in a local minimum of the error function. Today there are practical solutions that make back-propagation in multi-layer perceptrons the solution of choice for many machine learning tasks.

[edit] ADALINE
Adaptive Linear Neuron or later called Adaptive Linear Element. It was developed by Professor Bernard Widrow and his graduate student Ted Hoff at Stanford University in

1960. It's based on the McCulloch-Pitts model. It consists of a weight, a bias and a summation function. Operation: yi = wxi + b Its adaptation is defined through a cost function (error metric) of the residual e = di − (b + wxi) where di is the desired input. With the MSE error metric adapted weight and bias become: and the

While the Adaline is through this capable of simple linear regression, it has limited practical use. There is an extension of the Adaline, called the Multiple Adaline (MADALINE) that consists of two or more adalines serially connected.

[edit] Radial basis function (RBF) network
Main article: Radial basis function network Radial Basis Functions are powerful techniques for interpolation in multidimensional space. A RBF is a function which has built into a distance criterion with respect to a centre. Radial basis functions have been applied in the area of neural networks where they may be used as a replacement for the sigmoidal hidden layer transfer characteristic in multi-layer perceptrons. RBF networks have 2 layers of processing: In the first, input is mapped onto each RBF in the 'hidden' layer. The RBF chosen is usually a Gaussian. In regression problems the output layer is then a linear combination of hidden layer values representing mean predicted output. The interpretation of this output layer value is the same as a regression model in statistics. In classification problems the output layer is typically a sigmoid function of a linear combination of hidden layer values, representing a posterior probability. Performance in both cases is often improved by shrinkage techniques, known as ridge regression in classical statistics and known to correspond to a prior belief in small parameter values (and therefore smooth output functions) in a Bayesian framework. RBF networks have the advantage of not suffering from local minima in the same way as multi-layer perceptrons. This is because the only parameters that are adjusted in the learning process are the linear mapping from hidden layer to output layer. Linearity ensures that the error surface is quadratic and therefore has a single easily found minimum. In regression problems this can be found in one matrix operation. In classification problems the fixed non-linearity introduced by the sigmoid output function is most efficiently dealt with using iterated reweighted least squares.

RBF networks have the disadvantage of requiring good coverage of the input space by radial basis functions. RBF centres are determined with reference to the distribution of the input data, but without reference to the prediction task. As a result, representational resources may be wasted on areas of the input space that are irrelevant to the learning task. A common solution is to associate each data point with its own centre, although this can make the linear system to be solved in the final layer rather large, and requires shrinkage techniques to avoid overfitting. Associating each input datum with an RBF leads naturally to kernel methods such as Support Vector Machines and Gaussian Processes (the RBF is the kernel function). All three approaches use a non-linear kernel function to project the input data into a space where the learning problem can be solved using a linear model. Like Gaussian Processes, and unlike SVMs, RBF networks are typically trained in a Maximum Likelihood framework by maximizing the probability (minimizing the error) of the data under the model. SVMs take a different approach to avoiding overfitting by maximizing instead a margin. RBF networks are outperformed in most classification applications by SVMs. In regression applications they can be competitive when the dimensionality of the input space is relatively small.

[edit] Kohonen self-organizing network
The self-organizing map (SOM) invented by Teuvo Kohonen uses a form of unsupervised learning. A set of artificial neurons learn to map points in an input space to coordinates in an output space. The input space can have different dimensions and topology from the output space, and the SOM will attempt to preserve these.

[edit] Recurrent network
Contrary to feedforward networks, recurrent neural networks (RNs) are models with bidirectional data flow. While a feedforward network propagates data linearly from input to output, RNs also propagate data from later processing stages to earlier stages.

[edit] Simple recurrent network
A simple recurrent network (SRN) is a variation on the multi-layer perceptron, sometimes called an "Elman network" due to its invention by Jeff Elman. A three-layer network is used, with the addition of a set of "context units" in the input layer. There are connections from the middle (hidden) layer to these context units fixed with a weight of one. At each time step, the input is propagated in a standard feed-forward fashion, and then a learning rule (usually back-propagation) is applied. The fixed back connections result in the context units always maintaining a copy of the previous values of the hidden units (since they propagate over the connections before the learning rule is applied). Thus the network can maintain a sort of state, allowing it to perform such tasks as sequence-prediction that are beyond the power of a standard multi-layer perceptron.

In a fully recurrent network, every neuron receives inputs from every other neuron in the network. These networks are not arranged in layers. Usually only a subset of the neurons receive external inputs in addition to the inputs from all the other neurons, and another disjunct subset of neurons report their output externally as well as sending it to all the neurons. These distinctive inputs and outputs perform the function of the input and output layers of a feed-forward or simple recurrent network, and also join all the other neurons in the recurrent processing.

[edit] Hopfield network
The Hopfield network is a recurrent neural network in which all connections are symmetric. Invented by John Hopfield in 1982, this network guarantees that its dynamics will converge. If the connections are trained using Hebbian learning then the Hopfield network can perform as robust content-addressable memory, resistant to connection alteration.

[edit] Echo State Network
The Echo State Network (ESN) is a recurrent neural network with a sparsely connected random hidden layer. The weights of output neurons are the only part of the network that can change and be learned. ESN are good to (re)produce temporal patterns.

[edit] Stochastic neural networks
A stochastic neural network differs from a regular neural network in the fact that it introduces random variations into the network. In a probabilistic view of neural networks, such random variations can be viewed as a form of statistical sampling, such as Monte Carlo sampling.

[edit] Boltzmann machine
The Boltzmann machine can be thought of as a noisy Hopfield network. Invented by Geoff Hinton and Terry Sejnowski in 1985, the Boltzmann machine is important because it is one of the first neural networks to demonstrate learning of latent variables (hidden units). Boltzmann machine learning was at first slow to simulate, but the contrastive divergence algorithm of Geoff Hinton (circa 2000) allows models such as Boltzmann machines and products of experts to be trained much faster.

[edit] Modular neural networks
Biological studies showed that the human brain functions not as a single massive network, but as a collection of small networks. This realisation gave birth to the concept of modular neural networks, in which several small networks cooperate or compete to solve problems.

[edit] Committee of machines
A committee of machines (CoM) is a collection of different neural networks that together "vote" on a given example. This generally gives a much better result compared to other neural network models. In fact in many cases, starting with the same architecture and training but using different initial random weights gives vastly different networks. A CoM tends to stabilize the result. The CoM is similar to the general machine learning bagging method, except that the necessary variety of machines in the committee is obtained by training from different random starting weights rather than training on different randomly selected subsets of the training data.

[edit] Associative Neural Network (ASNN)
The ASNN is an extension of the committee of machines that goes beyond a simple/weighted average of different models. ASNN represents a combination of an ensemble of feed-forward neural networks and the k-nearest neighbour technique (kNN). It uses the correlation between ensemble responses as a measure of distance amid the analysed cases for the kNN. This corrects the bias of the neural network ensemble. An associative neural network has a memory that can coincide with the training set. If new data becomes available, the network instantly improves its predictive ability and provides data approximation (self-learn the data) without a need to retrain the ensemble. Another important feature of ASNN is the possibility to interpret neural network results by analysis of correlations between data cases in the space of models. The method is demonstrated at www.vcclab.org, where you can either use it online or download it.

[edit] Other types of networks
These special networks do not fit in any of the previous categories.

[edit] Holographic associative memory
Holographic associative memory represents a family of analog, correlation-based, associative, stimulus-response memories, where information is mapped onto the phase orientation of complex numbers operating.

[edit] Instantaneously trained networks
Instantaneously trained neural networks (ITNNs) were inspired by the phenomenon of short-term learning that seems to occur instantaneously. In these networks the weights of the hidden and the output layers are mapped directly from the training vector data. Ordinarily, they work on binary data, but versions for continuous data that require small additional processing are also available.

[edit] Spiking neural networks
Spiking neural networks (SNNs) are models which explicitly take into account the timing of inputs. The network input and output are usually represented as series of spikes (delta function or more complex shapes). SNNs have an advantage of being able to continuously process information. They are often implemented as recurrent networks. Networks of spiking neurons -- and the temporal correlations of neural assemblies in such networks -- have been used to model figure/ground separation and region linking in the visual system (see e.g. Reitboeck et.al.in Haken and Stadler: Synergetics of the Brain. Berlin, 1989). Gerstner and Kistler have a freely-available online textbook on Spiking Neuron Models. Spiking neural networks with axonal conduction delays exhibit polychronisation, and hence could have a potentially unlimited memory capacity. In June 2005 IBM announced construction of a Blue Gene supercomputer dedicated to the simulation of a large recurrent spiking neural network [1].

[edit] Dynamic neural networks
Dynamic neural networks not only deal with nonlinear multivariate behaviour, but also include (learning of) time-dependent behaviour such as various transient phenomena and delay effects. Meijer has a Ph.D. thesis online where regular feedforward perception networks are generalized with differential equations, using variable time step algorithms for learning in the time domain and including algorithms for learning in the frequency domain (in that case linearized around a set of static bias points).

[edit] Cascading neural networks
Cascade-Correlation is an architecture and supervised learning algorithm developed by Scott Fahlman and Christian Lebiere. Instead of just adjusting the weights in a network of fixed topology, Cascade-Correlation begins with a minimal network, then automatically trains and adds new hidden units one by one, creating a multi-layer structure. Once a new hidden unit has been added to the network, its input-side weights are frozen. This unit then becomes a permanent feature-detector in the network, available for producing outputs or for creating other, more complex feature detectors. The Cascade-Correlation architecture has several advantages over existing algorithms: it learns very quickly, the network determines its own size and topology, it retains the structures it has built even if the training set changes, and it requires no back-propagation of error signals through the connections of the network.

[edit] Neuro-fuzzy networks
A neuro-fuzzy network is a fuzzy inference system in the body of an artificial neural network. Depending on the FIS type, there are several layers that simulate the processes involved in a fuzzy inference like fuzzification, inference, aggregation and defuzzification. Embedding an FIS in a general structure of an ANN has the benefit of using available ANN training methods to find the parameters of a fuzzy system.

[edit] Theoretical properties
[edit] Capacity
Artificial neural network models have a property called 'capacity', which roughly corresponds to their ability to model any given function. It is related to the amount of information that can be stored in the network and to the notion of complexity.

[edit] Convergence
Nothing can be said in general about convergence since it depends on a number of factors. Firstly, there may exist many local minima. This depends on the cost function and the model. Secondly, the optimization method used might not be guaranteed to converge when far away from a local minimum. Thirdly, for a very large amount of data or parameters, some methods become impractical. In general, it has been found that theoretical guarantees regarding convergence are not always a very reliable guide to practical application.

[edit] Generalisation and statistics
In applications where the goal is to create a system that generalises well in unseen examples, the problem of overtraining has emerged. This arises in overcomplex or overspecified systems when the capacity of the network significantly exceeds the needed free parameters. There are two schools of thought for avoiding this problem: The first is to use cross-validation and similar techniques to check for the presence of overtraining and optimally select hyperparameters such as to minimise the generalisation error. The second is to use some form of regularisation. This is a concept that emerges naturally in a probabilistic (Bayesian) framework, where the regularisation can be performed by putting a larger prior probability over simpler models; but also in statistical learning theory, where the goal is to minimise over two quantities: the 'empirical risk' and the 'structural risk', which roughly correspond to the error over the training set and the predicted error in unseen data due to overfitting.

Confidence analysis of a neural network Supervised neural networks that use an MSE cost function can use formal statistical methods to determine the confidence of the trained model. The MSE on a validation set can be used as an estimate for variance. This value can then be used to calculate the confidence interval of the output of the network, assuming a normal distribution. A confidence analysis made this way is statistically valid as long as the output probability distribution stays the same and the network is not modified. By assigning a softmax activation function on the output layer of the neural network (or a softmax component in a component-based neural network) for categorical target variables, the outputs can be interpreted as posterior probabilities. This is very useful in classification as it gives a certainty measure on classifications.

The softmax activation function:

[edit] Dynamical properties
Various techniques originally developed for studying disordered magnetic systems (spin glasses) have been successfully applied to simple neural network architectures, such as the Hopfield network. Influential work by E. Gardner and B. Derrida has revealed many interesting properties about perceptrons with real-valued synaptic weights, while later work by W. Krauth and M. Mezard has extended these principles to binary-valued synapses.

About Fuzzy Logic
What is fuzzy logic? What's the difference between fuzzy logic and Boolean logic? What are connectives in fuzzy logic? Boolean or "two-valued" logic is traditional logic with all statements either being true or false. Symbolic logic is something that you can master. The hardest thing about symbolic logic is learning how to work with the symbols. Once you know what all the symbols stand for, the logic should come more easily.

Philosophers and Logicians
First, I'd like to do a bit of philosophizing as a way to lead into the logic. Philosophers and logicians have a lot of overlap in what they do. Many logicians are also philosophers, and all philosophers are logicians to some extent (some much more so than others). Given that there is such a connection between philosophers and logicians, I find it striking just how radically different the fields are. Philosophers are interested in finding deep truths about the world, be they epistemological, metaphysical, ethical, etc. Logicians (qua logicians) are only interested in using a set of rules to manipulate arbitrary symbols that have no relevance to the world. The (sometimes difficult) marriage between philosophy and logic comes from the fact that everyone in the world (except, I would argue, people who are commonly called "crazy") accepts the truths proven by logic to be universally true and unquestionable. Philosophy needs logic because in order to establish that a philosophical doctrine is true, one needs to show that the doctrine is universally and unquestionably true. One needs to, in other words, make a demostration that everyone would accept as proof that the proposition is true. To do that, the philosopher needs logic.

Logic Takes Small Steps
Logic accomplishes this magical universal acceptance because it makes only little tiny steps. It does silly little things like: ASSUMING: The dog is brown. AND ASSUMING: The dog weighs 15 lbs. I CONCLUDE: The dog is brown and the dog weighs 15 lbs. which anyone who understands what the word "and" means would agree with.

Shorthand Sentences Logicians, though, are very lazy people. They don't like to write long derivations in English because English sentences can be fairly long. So what they do instead is a kind of shorthand. If you give a logician a sentence like The dog is brown. he will pick a letter and assign it to that sentence. He now knows that the letter is just shorthand for the sentence. The way I learned logic, capital letters are used for sentences (with the exception of U, V, W, X, Y, and Z; I'll get to those later).

So let's just start at the beginning of the alphabet and use the letter "A" to represent the sentence "The dog is brown." While we're at it, let's use the letter "B" to represent the sentence "The dog weighs 15 lbs." In addition to saving time and ink, this practice of using capital letters to represent whole sentences has a couple of other advantages. The first is that to a logician, not every word is as interesting as every other. Logicians are extremely interested in the following list of words: and or if...then if and only if not They call these words "connectives." This is because you can use them to connect sentences that you already have together to make new sentences. When you write in English, those words don't stand out; they just get lost in the middle of sentences. Logicians want to make sure the words look special, so they take the whole rest of the sentence (the part they don't care about) and use a single letter to represent that. Then their favorite words stand out. Let's rewrite our earlier example about the dog using our logician's shorthand: ASSUMING: A AND ASSUMING: B I CONCLUDE: A and B The other advantage of using capital letters to represent sentences is that you ignore all the information that isn't relevant to what you're trying to do. For the derivation I did above, it didn't matter that the sentences were both about some dog. It didn't matter that they were about weight or color. They could have just as easily been sentences about how tall the dog is or about a cat or a person or a war or whatever. And if we can do the derivation for A and B, then we can do the same exact derivation for C and D or E and N or any other sentences we like.


Now, as I said before, logicians are lazy. They really don't want to have anything to do with English. So instead of using the English words: and or if...then if and only if not they make up their own symbols for these: For these words and or if ... then if and only if not Logicians use this symbol ^ v -> <-> ~

(Sometimes they also use a triple equals sign for '<->', but I can't type that.) Here are some examples: You and I would write this The dog is brown and the dog weighs 15 lbs. The dog is brown or the dog weighs 15 lbs. if the dog is brown, then the dog weighs 15 lbs. The dog is brown if and only if the dog weighs 15 lbs. The dog is not brown. There are a few things to notice here: 1. The symbols: ^, v, ->, and <-> are called "two-place connectives." This is because they connect two sentences together into a more complicated sentence. A logician writes this (A ^ B) (A v B) (A -> B) (A <-> B) ~A

2. The symbol: ~ is called a "one-place connective" because you only add it to one sentence. (You cannot join multiple sentences together with it.) To negate a sentence, all you have to do is stick a ~ on the front. 3. When you join two sentences with a two-place connective, you ALWAYS put parentheses around it. So it is NOT appropriate to write this: A^ B That makes as much sense in symbolic logic as writing: Nn7&% mm)]mm (


Parentheses I know that a lot of books and instructors claim that it is okay to drop the outermost parentheses in a sentence. I've done it myself many times. And 95% of the time it won't cause you trouble if you're careful. But let's say we started with this 'sentence' A^ B and decided to negate it. Well, the way to negate a sentence is to stick a ~ on the front, so let's do that: ~A ^ B But wait! What we did there was just negate the A. We wanted to negate the whole sentence. If we were really sharp, then we might notice that somebody had given us an illegitimate sentence that was missing parentheses, and so we would add the parentheses before adding the ~, to get: ~(A ^ B) which is what we wanted. It seems silly to make such a big deal about parentheses when we're dealing with simple sentences, but when you're doing a 30-line derivation and you're tired, it's easy to make a mistake just like that on line 17 and get yourself into real trouble. It's better to just remember the simple rule and always add parentheses when you have a two-place connective. . . .

Let's take a deep breath and then go quickly over what we have so far.

Using Connectives
Connectives are logical terms, ^ v -> <-> (and) (or) (if...then) (if and only if)

~ (not) which you can add to a sentence. A simple sentence is one that has no connectives. For example: A (the dog is brown). A complex sentence is a sentence which is made up of one or more simple sentences and one or more connectives. Some examples are: (A ^ B) (A v B) (A -> B) (A <-> B) ~A You can use connectives on complex sentences just as you can on simple sentences. Let's introduce a new simple sentence "it is raining," and let's call our new sentence C. We now have a lot more sentences that we can make. (Keep in mind, we have no idea yet which of these sentences are true or false; we also don't yet know how these sentences relate.) For example: (C ^ B) ~C (B v C) (C v B)

(B -> C) (A -> C) (C -> A) (B <-> C) (~B <-> C) ~(B <-> C) C ~~C ((A ^ B) v C) (((A ^ ~B) v ~C) -> (~(A v B) <-> C)) These can get a little complicated. That last sentence is especially scary looking; we'll come back to it in a little while. For now, here is a quick run-down of how to use connectives to make complex sentences from simple ones. To make this complex sentence ~C (B v C) (B ^ C) Do this Stick a ~ on C. Use a v to join B and C. Use a ^ to join B and C. We say ~C is the negation of C. (B v C) is the disjunction of B and C. (B ^ C) is the conjunction of B and C.

(B -> C)

B implies C. Use a -> to join B and (B -> C) is a conditional or C. implication. Use a <-> to join B and C. B implies C and C implies B. (B <-> C) is a biconditional.

(B <-> C)

Some of our sentences had more than one connective:

(~B <-> C) ~(B <-> C) ~~C ((A ^ B) v C) (((A ^ ~B) v ~C) -> (~(A v B) <-> C)) The sentence (~B <-> C) is made by joining the sentences ~B C with <-> The complex sentence ~B is called a "subsentence" of the larger sentence (~B <-> C) because it is a smaller sentence inside the large one. The simple sentence C is also a subsentence of the larger sentence (~B <-> C) The simple sentence B is a subsentence of the subsentence ~B and so B is also a subsentence of (~B <-> C) There are two connectives used in the larger sentence <-> ~ but they are not equally important. In this case the <-> is much more important than the ~ Remember how the sentence was made by taking the two smaller sentences

~B C and connecting them with a <-> The <-> is therefore called the "main connective" of the sentence. Main connectives are, without a doubt, absolutely the most important idea in logic. The hardest skill to learn in logic is to identify the main connective of a sentence. Make sure you understand what main connectives are. Compare the sentence we've been looking at, (~B <-> C) with one that looks similar, ~(B <-> C) This new sentence is very different. It was made by negating (B <-> C) The main connective of ~(B <-> C) is therefore ~ and (B <-> C) is just a subsentence of ~(B <-> C)

Complicated sentences
Now let's take a closer look at the most complicated sentence on our list and see if we can make it more manageable. The way to analyze a complicated sentence is to start at the outside and work your way in. The outermost parentheses on this ugly sentence (((A ^ ~B) v ~C) -> (~(A v B) <-> C)) are used to connect these two sentences ((A ^ ~B) v ~C) (~(A v B) <-> C) with a -> So the way to build our ugly sentence is to start with these two less ugly sentences:

((A ^ ~B) v ~C) (~(A v B) <-> C) and connect them with the main connective -> We can then analyze each subsentence if we like. I told you before that simple sentences are represented by the capital letters A through T, and that U, W, X, Y, and Z are saved for something else. (I rarely use V because it looks to much like the symbol for 'or'.) U, W, X, Y, and Z are used as shorthand for other sentences in logic (some books use italic letters and others use Greek letters, but since I only have plain text to work with, I use the end of the alphabet). I call these "sentence variables." So, just as we can take the English sentence There is nothing on TV. and use the capital letter D to represent it, we can take the sentence in logic ((A ^ ~B) v ~D) and use the capital letter U to represent it. [Note: it is also legal to use capital variables to stand for simple sentences. So you can take the simple sentence B and use the letter Z to stand for it.] This can be useful in analyzing complicated sentences. For example, if we have the scary looking sentence ((((A ^ ~B) v (B <-> C)) -> (~(C v D) ^ ~(~A -> ~~D))) v A) we can start using sentence variables to stand for subsentences. So if U stands for (A ^ ~B) Then we have (((U v (B <-> C)) -> (~(C v D) ^ ~(~A -> ~~D))) v A) and if V stands for (U v (B <-> C)) then we have ((V -> (~(C v D) ^ ~(~A -> ~~D))) v A) If W stands for ~(C v D) we have ((V -> (W ^ ~(~A -> ~~D))) v A) and if X stands for ~(~A -> ~~D) we have ((V -> (W ^ X)) v A) and if Y stands for (V -> (W ^ X))

we have (Y v A) So we know where our main connective is. And by substituting back in for the sentence variables, we can recreate our sentence in managable chunks. It is very important to keep track of what sentence variables stand for when you're doing this kind of substitution. This can be a major source of error if you're not keeping close track of what every letter stands for. . . . Now that we know all the details of the language of symbolic logic, it's time to actually do symbolic logic. The first step with every sentence is to identify the main connective. The reason is simple: In symbolic logic, the main connective of a sentence is the only thing that you can work with. Let's look at our complicated sentence from earlier (((A ^ ~B) v ~C) -> (~(A v B) <-> C)) is fundamentally an implication between these two subsentences: ((A ^ ~B) v ~C) (~(A v B) <-> C) There is no -> in either subsentence, but the sentence as a whole is still first and foremost an implication because of what its main connective is. So when you're trying to figure out how in the heck you can work with this ugly sentence (((A ^ ~B) v ~C) -> (~(A v B) <-> C)) you need to remember that it is an implication and treat it just as one.

What the Connectives Mean
Here's a quick course on what the connectives mean. (I assume you have some familiarity with them.) The sentence A A is true is TRUE whenever is FALSE whenever A is false

~A (A ^ B) (A v B) (A <-> B) (A -> B)

A is false A is true and B is true A is true; or B is true; or both A and B are true A and B are both true; or A and B are both false A is false; or B is true; or A is false and B is true

A is true A is false; or B is false; or both A and B are false A is false and B is false A is false and B is true, or A is true and B is false A is true and B is false

This last one is a little weird, so let's think about it. If we translate it back into English, we get If the dog is brown then the dog weighs 15 lbs. How would we go about proving that this sentence is false? Let's say that the dog is brown and the dog weighs 15 lbs. Does that disprove the if...then statement? Certainly not! What if the dog is brown but the dog weighs 25 lbs.? That does disprove the statement. What if the dog turns out to be white? Then we cannot disprove the inference because it only makes a prediction about a brown dog. If the dog isn't brown, then we can't test the prediction. So the only way to make the sentence (A -> B) false is to make A true and B false at the same time. Given any other values of A and B, the sentence comes out true.

The Rules of Logic
Now we're finally ready to learn the rules of logic. There are exactly 12 - no more, no less. Each connective has two rules associated with it, and there are two special rules. Let's start with one of the special rules first.

1. Assumptions The first special rule is the rule of assumptions. It is deceptively easy. The rule is: You are allowed to assume anything you want at any time. But there is a catch: You have to keep track of what assumptions you have made. Well that makes sense. Let's say you and I are detectives trying to solve a mystery. I could say something like "let's assume for the time being that the dog is brown." Once I said that, we could discuss what that would mean. Anything we conclude from that assumption is perfectly okay, as long as we remember that it was under the assumption that the dog is brown. In other words, whatever we do prove under the assumption that the dog is brown must be followed by a disclaimer "assuming that the dog is brown." Eventually, we would want to prove something about the case that doesn't depend on the dog being brown. Logicians call this "discharging" the assumption. Fortunately, some of our other rules tell us how to discharge assumptions. 1. When I do derivations, I number each new line. I start new assumptions using curly brackets { 2. and then I indent everything after a new assumption; 3. When I discharge an assumption I close the curly brackets } 4. And then I stop indenting. One other very important thing to keep in mind: Once you close off an assumption, you can no longer use any lines between the curly brackets. So since I've closed the curly brackets above, I would no longer be able to use either of the two lines between them: they are gone forever. So lines (2) and (3) above are illegal. However, line (1) is legal because it is outside the curly brackets, and so is line (4). This can get complicated if you have assumptions inside of assumptions. And finally, and perhaps central to logic: A logical truth is something that you can write with all your assumptions discharged. Before we can do some short derivations, we need to learn two other rules. Let's start with one of the two rules that we get from the ->


2. -> Introduction The rule is called "-> introduction." The way it works is: If you assume X and then you derive Y then you are entitled to discharge the assumption and write (X -> Y) That makes sense. Let's just say that we assumed A [The dog is brown] And then we did some logic and out of that we proved E [The killer is a man] If we did that, we would be entitled to say to a jury (A -> E) [If the dog is brown then the killer is a man] The sentence (A -> E) is true.

3. ^ Elimination Let's learn one more rule for now. This one is called "^ elimination." If you have (X ^ Y) then you are entitled to X and you are also entitled to Y That makes sense too. Lets say we knew for a fact that (A ^ B) [The dog is brown and the dog weighs 15 lbs] Then we would certainly be entitled to conclude

A [The dog is brown] and we would also certainly be entitled to conclude B [The dog weighs 15 lbs] A Derivation Now let's take an example of a derivation. Suppose I wanted to prove that this is a logical truth ((A ^ B) -> A) I would start by identifying the main connective, which is a ->. I know how to introduce a new ->, I assume the left and then derive the right. Let's try it:
{ 1) (A ^ B) 2) A } 3) ((A ^ B) -> A) [->intro on 1-2] [assumption] [^elim on 1]

We just used our 3 rules to derive ((A ^ B) -> A) [If (the dog is brown and the dog weighs 15 lbs) then the dog is brown]

4. Repetition There's one other special rule. It's called "repetition." The rule simply says that if you have X Then you are entitled to write X Provided that it was not inside a closed curly bracket.

5. ^ Introduction The other rule with ^ is called "^ introduction." It says, if you have X and you also have Y

then you are entitled to (X ^ Y) That makes sense too. Let's say that I have already proven E [The killer is a man] And I have also proven F [The killer is tall] Then I am certainly allowed to say to the jury (E ^ F) [The killer is a man and the killer is tall] Continuing the Derivation Let's continue our derivation using our new rules.
{ 1) (A ^ B) 2) A } 3) ((A ^ B) -> A) 4) C 5) ((A ^ B) -> A) 6) } 7) (C -> (((A ^ B) -> A) ^ C) [->intro on 4-5] (((A ^ B) -> A) ^ C) [->intro on 1-2] [assumption] [repetition of 3] [^intro on 4 and 5] [assumption] [^elim on 1]


6. -> Elimination The other rule for -> is called "-> elimination." It says that if you have X and you have (X -> Y) then you are entitled to Y That makes sense too. If I know

A [The dog is brown] and I know (A -> E) [If the dog is brown then the killer is a man] then I am certainly entitled to conclude E [The killer is a man] Adding to the Derivation Let's add a little more to our derivation:
{ 1) (A ^ B) 2) A } 3) ((A ^ B) -> A) { 4) C 5) ((A ^ B) -> A) 6) (((A ^ B) -> A) ^ C) } 7) (C -> (((A ^ B) -> A) ^ C) { 8) (A ^ B) 9) ((A ^ B) -> A) [assumption] [repetition of 3] [->intro on 4-5] [assumption] [repetition of 3] [^intro on 4 and 5] [->intro on 1-2] [assumption] [^elim on 1]

[Note: Line (9) is NOT a repetition of (5) because (5) is inside closed curly brackets. (3) is not, so it is okay to repeat it here.] 10) A [->elim on 8 and 9]

[Note: I did not discharge the assumption I made on line (8). So A is not a logical truth; it is true only on the assumption that

(A ^ B) is true.]

7. <-> Introduction The next two rules have to do with <->. The first is called "<-> introduction." It states that if you have (X -> Y) and you have (Y -> X) then you are entitled to (X <-> Y) This one is a little tricky to explain, and the best way (I'm sorry to say) is truth tables. So you should try all the possible combinations for X and Y and convince yourself that if (X -> Y) and (Y -> X) are both true, then (X <-> Y) must be true too.

8. <-> Elimination The next rule is called "<-> elimination." This one says that if you have (X <-> Y) And you have X Then you are entitled to Y OR If you have (X <-> Y) And you have Y Then you are entitled to X This makes sense because if you know (X <-> Y) then you know that X and Y have the same truth value. So if you know one of them is true, then the other must also be true. A New Derivation

Let's start a new derivation.
{ 1) (A ^ B) 2) A 3) B 4) (B ^ A) } 5) ((A ^ B) -> (B ^ A)) { 6) (B ^ A) 7) B 8) A 9) (A ^ B) } 10) ((B ^ A) -> (A ^ B)) 11) ((A ^ B) -> (B ^ A)) 12) ((A ^ B) <-> (B ^ A)) [TOP] [->intro on 6-9] [repetition of 5] [<->intro on 10 and 11] [assumption] [^elim on 6] [^elim on 6] [^intro on 7 and 8] [->intro on 1-4] [assumption] [^elim on 1] [^elim on 1] [^intro on 2 and 3]

9. ~ Introduction Next we have "~ introduction." It says that if you assume X And then you derive a contradiction, you are entitled to discharge the assumption and write ~X A contradiction is any sentence Y followed on the next line by the negation of that sentence ~Y This rule is the familiar "reductio ad absurdum." An easy way to think of it is this. If we assume ~F [The killer does not have red hair]

And we prove from that A [The dog is brown] and ~A [The dog is not brown] then something is wrong with our assumption.

10. ~ Elimination "~ elimination" is almost identical. It says that if you assume ~X and derive a contradiction, then you are entitled to discharge the assumption and write X A Quick Derivation
{ 1) (A ^ ~A) 2) A 3) ~A } 4) ~(A ^ ~A) [~intro on 1-3] [assumption] [^elim on 1] [^elim on 1]

Lastly, let's look at the rules for v.

11. v Introduction The first is "v introduction." It says that if you have X then you are entitled to write (X v Y) no matter what Y is. That seems a little strange. Normally you wouldn't think you can just go throwing any old sentence into a derivation. But remember (X v Y)

is true as long as X is true OR Y is true OR both are true. So if you already know that X is true, then the disjunction of X and anything else will be true. A Short Derivation
{ 1) A 2) A } 3) 4) (A -> A) ((A -> A) v B) [->intro on 1-2] [vintro on 3] [assumption] [repetition of 1]


12. v Elimination The last rule is a little tricky. It's "v elimination." It says if you have (X v Y) and you have (X -> Z) and you have (Y -> Z) Then you are entitled to Z [Most of the time this means that when you have a disjunction that you don't know what to do with, you have to derive an implication for each side of the disjunction before you can go on.] The rule is hard to do with derivations, but it is actually not too hard to understand if you take an example. Let's say we know (A v B) [The dog is brown or the dog weighs 15 lbs] And we know (A -> E) [If the dog is brown, then the killer is a man] And we know

(B -> E) [If the dog weighs 15 lbs, the killer is a man] Then we don't have to bother figuring out whether A is true or B is true; either way we are entitled to E [The killer is a man] Continuing Our Last Derivation Let's continue our last derivation to get a demonstration of "velim."
{ 1) 2) } 3) 4) (A -> A) ((A -> A) v B) { 5) (A -> A) { 6) 7) } 9) } 10) ((A -> A) -> (C -> C)) { 11) B { 12) C 13) C } [assumption] [repetition of 12] [assumption] [->intro on 5-9] (C -> C) [->intro on 6-7] C C [assumption] [repetition of 6] [assumption] [->intro on 1-2] [vintro on 3] A A [assumption] [repetition of 1]

14) (C -> C) } 15) (B -> (C -> C)) 16) ((A -> A) v B) 17) ((A -> A) -> (C -> C)) 18) (B -> (C -> C)) 19) (C -> C)

[->intro on 12-13]

[->intro on 11-14] [repetition of 4] [repetition of 10] [repetition of 15] [velim on 16, 17, 18]

[Not the most efficient way to prove (C -> C), but it is valid.]

There are a lot of other rules people try to tell you, but anything you can do with those, you can do with these 12 rules.

Why These 12 Rules? A Review
The reason I like these rules is that with these rules you can do any derivation using the same five steps: Step 1: Find the main connective of the sentence you are trying to derive. Step 2: Apply the rule for introducing that main connective. Step 3: When you're in the middle of a derivation and you don't know what to do, find the main connective of the sentence you have and eliminate it. Step 4: Along the way you may have to derive subsentences using steps 1 through 3. Step 5: If all else fails, you may have to do a "~ elimination" [I'll explain this step a little later]. If you use those five steps, you should always know which rule to use. The reason is that there are ONLY four things you are ever allowed to do in a derivation: 1. Eliminate the main connective of the sentence you are on. 2. Use the sentence you are on to eliminate the main connective of another sentence (AS LONG AS THAT OTHER SENTENCE ISN'T CLOSED OFF IN CURLY BRACKETS). 3. Repeat an earlier line that isn't closed off in curly brackets.

4. Make a new assumption.

Mundane Rules: What Do You Have? Now that we have the steps for doing derivations, let me try to explain that confusing business about discharging assumptions. I'm going to approach this from a slightly different angle this time. Of the 12 rules I gave you, 8 are pretty straightforward. They are what I would call the "Mundane Rules." The way Mundane Rules work is: they say "if you have X and Y and Z, then you are entitled to U." The tricky thing with Mundane Rules is knowing what you "have." You "have" any sentence that is written down on a line of the derivation except those which are closed off in curly brackets (which are gone forever once the brackets close). Being "entitled" to something just means that you can legally write it down as the next line of the derivation. The easiest Mundane Rule is repetition: If you have X then you are entitled to X Another Mundane Rule is ^ introduction: If you have X and you have Y then you are entitled to (X ^ Y) Simple enough. (I went into more detail on _why_ this is a sound rule in the last e-mail.) Another pretty easy Mundane Rule is ^ elimination: If you have (X ^ Y) then you are entitled to X

Or, if you prefer, you are also entitled to Y So far so good. Another Mundane Rule is -> elimination: If you have X and you have (X -> Y) then you are entitled to Y This is actually the same thing as Modus Ponens, so you can call it that if you prefer. Since I don't speak Latin, I prefer calling it "-> elimination" because that is more descriptive of what the rule is doing. Another Mundane Rule is <-> introduction: If you have (X -> Y) and you have (Y -> X) then you are entitled to (X <-> Y) This one is a little tricky to explain. Let's assume somehow we have (X -> Y) Under what conditions could that be true? There are 3 possibilities: X is true and Y is true X is false and Y is true X is false and Y is false Also, we have (Y -> X) That can only be true under these conditions: X is true and Y is true X is true and Y is false X is false and Y is false Since we "have" both of these sentences, then they must both be true. So under what conditions are they both true? Well, only these two:

X is true and Y is true X is false and Y is false Which are exactly the conditions for: (X <-> Y) Which means we are entitled to write that. Another Mundane rule is <-> elimination: If we have (X <-> Y) and we have X then we are entitled to Y OR: If we have (X <-> Y) and we have Y then we are entitled to X Another Mundane Rule is v introduction: If you have X then you are entitled to (X v Y) and you are also entitled to (Y v X) This is a little tricky too. We "have" X, which means X must be true. Now we can just, out of the blue, pick any sentence we like and put it into a disjunction with X. Why can we do that? Well, let's say we pick a FALSE sentence. Is that still okay? Yes, it is! Even if Y is false, the disjunction with X is still true, so we haven't written a false sentence, and we are still okay. The last Mundane Rule is v elimination: If you have (X v Y) and you have

(X -> Z) and you have (Y -> Z) then you are entitled to Z So much for the Mundane Rules. Mundane Rules are useful in derivations because they let you move from one step to the next. They tell you what you can do with the sentences you have. They also can give you a hint as to what you need to do next. For example, if you have (X v Y) and you want to eliminate the v, but you don't have (X -> Z) (Y -> Z) yet, then you'd better go get those two sentences. The problem with the Mundane Rules is that they only let you play around with sentences you already HAVE. You can't get anything NEW out of them. So far we've gone over the 8 Mundane Rules. There are 12 rules in total. Of the 4 remaining, one is a 'Special Rule' and three are 'Fun Rules'.

Special Rule The Special Rule is the rule of assumptions: You are free to assume anything you like at any time as long as you do these things: 1. Use curly brackets and indentation to keep track of what you have assumed. 2. Only discharge the assumption using one of the Fun Rules. [Discharging an assumption just means you close the curly brackets and stop indenting. So you can forget about the assumption.]

Three 'Fun' Rules The three Fun Rules all have this form: If you assume X and then, on that assumption, you derive Y

You can discharge the assumption you made at X and then you are entitled to Z The first Fun Rule is -> introduction: If you assume X And then, on that assumption, you derive Y You can discharge the assumption you made at X and then you are entitled to (X -> Y) This is, in my opinion, the most important and fundamental rule in logic. It is the foundation of all logic. [It's also really important for derivations. If you look up at the Mundane Rules, a lot of them require you to have sentences of the form (X -> Y) to apply them.] The justification is that if you assume (but don't prove) A [The dog is brown] and then, on that assumption, you derive P [The floor is wet] then you HAVE NOT proven that the floor is wet, but you have PROVEN (no assumptions required) that (A -> P) [If the dog is brown, then the floor is wet.] The last two Fun Rules are closely related. One is ~ introduction: If you assume X And then, on that assumption, you derive Y and ~Y you can discharge the assumption you made at X; then you are entitled to ~X [You may have been taught this rule as a reductio ad absurdum.] The idea is that if assuming X leads you to a contradiction, then there must've been something contradictory ABOUT X ITSELF. So X must be false. If X is false, then by definition ~X is true: no ifs, ands, buts, or assumptions about it.]

The last Fun Rule is ~ elimination: If you assume ~X and then, on that assumption, you derive Y and ~Y you can discharge the assumption you made at ~X and then you are entitled to: X The idea here is basically the same. ~X is contradictory and therefore false, so X is proven true. I have another nickname for ~ elimination. It is what I call the "Fallback Rule." With every other rule, the way it works is by getting what you want by introducing the main connective or by using what you have by eliminating the main connective. But take a look back at what I just did with ~ elimination. I just proved X is true. There's no way to tell by looking at X that you can prove it by eliminating a ~, but you can. [Actually, ANY sentence you like can be proven with ~ elimination, but it is sometimes hard to do.] So this is where Step 5 of the derivations comes from. If you are trying to prove some sentence X, the first thing to try is to try to introduce the main connective of X. But if you run into a dead end doing that, then assume ~X and try to derive a contradiction. Those are the rules reviewed and better organized so that they make sense, and the difficult bit about discharging assumptions is (I hope) a little clearer. One other thing to watch out for. Some logic problems ask you to prove that a certain sentence is a logical truth. On those problems, you have to discharge all your assumptions and prove that the sentence is true with no assumptions (that is, write it without indenting and outside of all the curly brackets). I'll do an example of a derivation like that in a minute.

Deriving a Conclusion
Other logic problems give you a list of "givens" or "hypotheses" and ask you to derive a conclusion from them. In those problems, what they are saying is that you need to assume the hypotheses, but not discharge those assumptions. Let me give you an example: Given: A (A -> B)

(~B v C) Prove: C Here we go:
{ 1) A { 2) (A -> B) { 3) (~B v C) 4) A 5) (A -> B) 6) B [assumption, given] [repetition of 1] [repetition of 2] [->elim on 4&5] [assumption, given] [assumption, given]

Now what do we do? We have a disjunction on 3 that we don't know what to do with, so we need to eliminate it. But in order to eliminate it, we need to get (~B -> something) and (C -> something). Let's first work on getting (~B -> something).]
{ 7) ~B [Note: this assumption isn't given, so we're going to have to discharge it] [assumption]

Let's see if we can derive C. If we derive (~B -> C) then we'll be most of the way to finishing the problem. How can we derive C? Well, we should try to introduce the main connective. But wait! There is no main connective. C is just a simple sentence. So what can we do? I guess we have to do step 5, try ~elimination.
{ 8) ~C 9) B 10) ~B } 11) C [another assumption we'll have to discharge] [assumption] [repetition of 6] [repetition of 7] [Closes off lines 8-10] [~elim on 8-10]

Up to this point we haven't closed off any assumptions. That means that all of the lines up to this point were sentence that we "have" and can use. But now we just closed off lines 8-10 by discharging the assumption at 8. That means that lines 8-10 are gone, they are off-limits and illegal forever. The good news, though is that we derived C, so now we can discharge the assumption we made at line 7.
} 12) (~B -> C) [Closes off 7-11] [->intro on 7-12]

This may seem like a bit of sleight of hand, like I'm trying to pull the wool over your eyes. How can I use the assumption I made at line 7 as part of the contradiction? I just did a ~elimination to prove C, but there was nothing contradictory about ~C itself; the contradiction was that I had B and then I assumed ~B. This is the familiar refrain "anything can be proven from a contradiction." Once I assumed ~B, I could've proven (~B -> anything-I-want), I chose to prove (~B -> C) because I eventually want to get C. Now we have made some progress on eliminating the disjunction we had on line 3: (~B v C). We have (~B -> C), now we need (C -> C), so let's go get it.
{ 13) C 14) C [assumption] [repetition of 13]

Notice that I can repeat 13 because I have not yet discharged that assumption. However, I cannot repeat the C on line 11 because I closed off line 11 already, so it is gone forever.
} 15) (C -> C) 16) (~B v C) 17) (~B -> C) 18) (C -> C) 19) C [Closes off 13-14] [->intro on 13-14] [repetition of 3] [repetition of 12] [repetition of 15] [velim on 16,17,&18]

Not all of those repetitions were necessary, since we "had" those lines already (they hadn't been closed off), but I added them for clarity. You should go back and double-check the derivation to make sure that I never broke the rules by using a line that was closed off and that I didn't break any other rules. Also, make sure that I discharged all the assumptions except the three I was given at the start.

Deriving a Sentence
As I said before, the other type of problem is where you are handed a sentence and told to derive it. In this problem, you can make any assumptions you need, but you have to discharge all of them and end up with the sentence you're looking for at the end. This usually involves finding the main connective of the sentence you're supposed to prove and then introducing it (sometimes you have to find other sentences too: for example if the main connective is ^, you need to prove each half of the sentence and then do an ^intro). Sometimes trying to do this will get you to a dead end, and then you may try to assume the negation of the sentence you're trying to get and see if you can find a contradiction. Let's do an example. Let's try to prove (((~A ^ ~C) v (~C <-> B)) -> (B -> ~C)) The main connective is a ->, so let's introduce it. To do that we need to assume the left and derive the right.
{ 1) ((~A ^ ~C) v (~C <-> B)) [assumption]

Here we have a disjunction, so we need to eliminate it. That means we need to find two entailments. It would be great if we had ((~A ^ ~C) -> (B -> ~C)) and ((~C <-> B) -> (B -> ~C)), so let's try to get those. First, let's work on ((~A ^ ~C) -> (B -> ~C)).
{ 2) (~A ^ ~C) 3) ~A 4) ~C [assumption] [^elim on 2] [^elim on 2]

We actually don't need line 3, but it's good practice to get both sides of an ^ while you can just in case you might need them later. Now we have ~C, but what we want is (B -> ~C), so let's work toward getting that.
{ 5) B 6) ~C } 7) (B -> ~C) [closes 5-6] [->intro on 5-6] [assumption] [repetition of 4]

} 8)

[closes 2-7] [->intro on 2-7]

((~A ^ ~C) -> (B -> ~C))

So now what we need to finish the velimination on line 1 is to derive ((~C <-> B) -> (B -> ~C)).
{ 9) (~C <-> B) [assumption]

So now we need (B -> ~C).
{ 10) B 11) ~C [assumption] [<->elim on 9-10]

If you wanted to, you could make this a little more clear by repeating (~C <-> B) and then doing the <->elim, but it's not necessary since we have both (~C <-> B) and B we are entitled to ~C.
} 12) (B -> ~C) } [closes 9-12] [->intro on 9-12] [closes 10-11] [->intro on 10-11]

13) ((~C <-> B) -> (B -> ~C))

Now we can do our velimination on line 1 because we have ((~C <-> B) -> (B -> ~C)) and ((~A ^ ~C) -> (B -> ~C)). If you want, for clarity, you can repeat line 1 and line 8, but it's not necessary.
14) (B -> ~C) } 15) (((~A ^ ~C) v (~C <-> B)) -> (B -> ~C)) [velim on 1,8,13]

As always, you should go back and double-check the derivation once you're done. The hardest thing about doing derivations is figuring out what to do next. When you have a lot of random rules with Latin names to choose from, it's difficult. This set of rules helps you to know what to do by either introducing what you're trying to get or eliminating what you have. My advice to you is to try to do some of the derivations in your book or that you had for your class using these rules. Any derivation is possible with them. It takes a lot of time to

learn logic and have it sink in, but if you take it slowly enough and practice, it will become easier. Get comfortable with these 12 basic rules and the 5-step method for doing derivations.

Rules with Latin Names
Many (probably most) places, you don't learn these 12 rules with logic. Even though I think these rules make the most sense and allow a straightforward approach to solving any problem in symbolic logic, more advanced students may want to study the rules such as Modus Tollens and DeMorgan's Law. The twelve rules I've presented here are systematic and straightforward, and all of them move by baby steps. The way I think of these rules (in some cases this is not historically accurate) is that logicians noticed that when doing derivations, they often repeated the same steps over and over. Eventually, someone decided that rather than doing these same five or ten steps, you can take shortcuts. Modus Tollens serves as an instructive example. Let's say I have: (A -> B) And I have: ~B And I am trying to get: ~A Here's what I'd have to do. Since I'm trying to find ~A, I'll do a ~introduction on A:
{ 1) (A -> B) { 2) ~B { 3) 4) 5) } 6) ~A A B ~B [new assumption] [->elim on 1 and 3] [repetition of 2] [closes off 3-5] [~intro 3-5] [assumption, given] [assumption, given]

We have to do this so often that we just call these five steps "Modus Tollens." Modus Tollens is a shortcut rule. There are several others too, some more involved.

The following is a list of the major rules, together with a justification of why each of them is valid and a short example of how you might use some of the more challenging ones.

1. Modus Ponens This is one of the most straightforward laws in logic. It states that if you have (X -> Y) and you have X then you are entitled to: Y This is just what we've been calling "-> elimination." The reason it works is that we are given (X -> Y). Which means that X cannot be true at the same time Y is false. So if X is true (which is the other given), then Y must be true as well, so we are free to conclude Y is true. Example: "If it is raining, then there are clouds" and "it is raining" together imply "there are clouds."

2. Modus Tollens This law is just the flip side of modus ponens. It states that if you have (X -> Y) and you have ~Y then you are entitled to ~X

The reason this works is that we are again given (X -> Y). This means that X cannot be true at the same time Y is false. So if Y is false (which is the other given), then X must be false as well. So we are free to conclude X is false (or ~X is true). Example: "If it is raining, then there are clouds" and "there are no clouds" together imply "it is not raining."

3. DeMorgan's Law (I) DeMorgan came up with a couple sets of equivalencies. The first is that if you have ~(X ^ Y) then you can conclude (~X v ~Y) and if you have (~X v ~Y) then you can conclude ~(X ^ Y) The reason this works is that our starting point is ~(X ^ Y), which is the negation of (X ^ Y). Now, (X ^ Y) can only be true if X is true and Y is also true. So (X ^ Y) will be false if X is false or if Y is false. That is, (X ^ Y) will be false if (~X v ~Y) is true. So ~(X ^ Y) is equivalent to (~X v ~Y). Example: "My dog is fat, or my cat is fat" is equivalent to "It is not true that both my dog and cat are thin."

4. DeMorgan's Law (II) The second equivalence which bears DeMorgan's name is that ~(X v Y) is interchangeable with (~X ^ ~Y)

The only way in which ~(X v Y) can be true is if X and Y are both false. So the two expressions can be interchanged just like in the first law. Example: "My dog is fat and my cat is fat" is equivalent to "It is not true that my dog or cat is thin."

5. Hypothetical Syllogism The rule here is that if you have (X -> Y) and you have (Y -> Z) then you can conclude (X -> Z) Here's why: We know that "if X is true, then Y is true." And we know that "if Y is true, then Z is true." But we don't know anything about whether any of the letters are actually true or not. Let's assume (or hypothesize) for a second that X is true. Then, by modus ponens, Y is true. And then by modus ponens again, Z is true. So: If we assume X is true, then we conclude Z is true. Since we didn't know X was true, we cannot take Z home with us, but we can say that "If X was true, then Z would be true." This is equivalent to saying "If X, then Z" or (X -> Z). Example: "If it is raining, then there are clouds" together with "if there are clouds, then the sun will be blocked" imply "if it is raining, then the sun will be blocked."

6. Disjunctive Syllogism The rule here is that if you have (X v Y) and you have ~X

then you can conclude Y Here's why: We know first of all that "X or Y is true." We also know that X is false. If X or Y is true, and X is false, then Y has no choice but to be true. So we can conclude that Y is true. Example: "My dog is fat or my cat is fat" together with "my dog is thin" imply "my cat is fat."

7. Reductio Ad Absurdum (Proof by Contradiction) This rule states that if you assume X and, from that, you conclude a contradiction, such as (Y ^ ~Y) then you can conclude that your assumption was false, and ~X must be true. You can find a more complete explanation of this at Proof by Contradiction http://mathforum.org/library/drmath/view/62852.html

8. Double Negation This rule simply states that if you have ~~X then you can interchange that with X which should be apparent based on what the ~ is.


9. Switcheroo (I've heard that this was actually named after a person, but I don't know that for certain.) This is a shortcut rule which states that if you have (X v Y) then you can interchange that with (~X -> Y) To understand why, let's think about (~X -> Y). This says that ~X cannot be true at the same time that Y is false. Or, to put that another way, X cannot be false at the same time Y is false. So (~X -> Y) can only be false when X and Y are both false. Similarly, the only way for (X v Y) to be false is to have X and Y both false. So the two expressions are true unless X and Y are both false, so they have the same "truth conditions" and are therefore equivalent (i.e. interchangeable). Example: "My dog is fat, or my cat is fat" is equivalent to "If my dog is thin, then my cat is fat." (This one is hard to wrap your mind around, but think about what must be true/false about the world in order to make each statement true or false and it should eventually become clear.)

10. Disjunctive Addition This is just what we've been calling "v introduction."

11. Simplification This is just what we've been calling "^ elimination."

12. Rule of Joining This is just what we've been calling "^ introduction.""

The problem with shortcut rules is that they're easy to misuse. In my opinion, the best way to learn them is to practice with the twelve systematic rules and if you find yourself doing the same steps over and over, you may have found a shortcut rule. If there's a rule you don't understand, try to use the twelve systematic rules to figure out how the rule works. Once you see the steps in deriving the rule and you know why it is a valid shortcut, you won't have any trouble using it. And remember, if you get stuck and don't know what to do, you can always fall back on the twelve systematic rules.
Fuzzy or "multi-valued" logic is a variation of traditional logic in which there are many (sometimes infinitely many) possible truth values for a statement. True is considered equal to a truth value of 1, false is a truth value of 0, and the real numbers between 1 and 0 are intermediate values. What is Fuzzy Logic? The easy definition is that fuzzy logic is a kind of logic in which propositions don't have to be either true or false. In normal binary logic, the answer to a question like "Is Joe tall?" would have to be either "yes" or "no" - either 1 or 0. In terms of attributes, Joe would either have "tallness," or he wouldn't. This is one of the things that makes binary logic break down so easily when you try to apply it to the real world, where people are "sort of tall," food is "mostly cooked," cars are "pretty fast," jewelry is "very expensive," patients are "barely conscious," and so on. To paraphrase Einstein, to the extent that binary logic applies to reality, it is not certain; and to the extent that it is certain, it doesn't apply to reality. In fuzzy logic, Joe can have a tallness value of, say, 0.9, which can combine with values for other attributes to produce "conclusions" that look more like "The clothes have a dryness value of 0.91" than "The clothes are dry" or "The clothes are not dry." It is frequently used to control physical processes - washing or drying clothes, toasting bread, bringing trains to smooth stops at the right places, keeping planes on course, and so on. It's also used to support decisions whether to buy or sell stock, whether to support or oppose a particular political position, and so on. In short, fuzzy logic provides an alternative to logic that is useful whenever you want to be able to express attributes in shades of gray, rather than as black or white. As far as the more advanced question of connectives in fuzzy logic, here are what I believe are the generally accepted rules. You start with the basic connectives is symbolic logic:

^ (and) v (or) -> (if, then) <-> (if and only if) ~ (not) And extend them. Let's start with ^ (and). In Boolean logic (A ^ B) is true if and only if both A is true and B is true. We can say that in a different way: (A ^ B) has a value of 1 if A has a value of 1 and B has a value of 1, if either A or B has a value of 0, then the conjunction has a value of 0. So the way this is traditionally extended to fuzzy logic is to say that the conjunction (A ^ B) carries the minimum truth value of A or B. For example, if A has a truth value of 1 and B has a truth value of 0.8, then the minimum of these two is 0.8 and the conjunction (A ^ B) will carry the truth value of 0.8. Next, let's talk about v (or). In Boolean logic (A v B) is true if A is true or if B is true or if both are true. We can say that in a different way: (A v B) has a value of 1 if A has a value of 1 or B has a value of 1, if both A and B have a value of 0, then the disjunction has a value of 0. So the way this is traditionally extended to fuzzy logic is to say that the disjunction (A v B) carries the maximum truth value of A or B. For example, if A has a truth value of 0.2 and B has a truth value of 0.6, then the maximum of these is 0.6 and the disjunction (A v B) will carry the truth value of 0.6. Next, let's talk about ~ (not). Unlike the other connectives, ~ doesn't join two sentences; it is only applied to a single sentence. In Boolean logic, ~A carries the opposite truth value of A. So it is false if A is true and true if A is false. Or: ~A has a value of 0 if A has a value of 1 and a value of 1 if A has a value of 0. The way this is extended to fuzzy logic is to say that ~A has a truth value equal to 1 minus the truth value of A. So, for example, if A has a value of 0.3, then 1 - 0.3 = 0.7, so ~A has a value of 0.7. Next, let's talk about -> (if, then). In Boolean logic, (A -> B) is ONLY false if A is true AND B is false; in all other cases it is true.

Or, (A -> B) carries a value of 1 UNLESS A has a value of 1 and B has a value of 0. So, if A has a value of 0, then (A -> B) will definitely have a value of 1; or if B has a value of 1 then (A -> B) will definitely have a value of 1. So (A -> B) is at least as true as the opposite of A, and it is also at least as true as B. So in Boolean logic, (A -> B) has a truth value equal to the maximum of ~A and B. Fuzzy logic uses this definition. The truth value of (A -> B) is equal to the maximum of the truth value of ~A and the truth value of B. For example, if B has a truth value of 0.5 and A has a truth value of 0.4, then ~A has a value of 0.6. The maximum of B (0.5) and ~A (0.6) is 0.6 so (A -> B) will have a value of 0.6. The last connective, <-> (if and only if), is the hardest to extend to fuzzy logic. In Boolean logic, (A <-> B) is true if both A and B have the same truth value, and false otherwise. Or you can say: (A <-> B) has a value of 1 if A and B both have a value of 1 or if A and B both have a value of 0, and it has a value of 0 otherwise (i.e. if A has a value of 0 and B has 1 or vice versa). My first guess for how to apply this to fuzzy logic was that it is simply an equals relation. So if A and B have equal values, then (A <-> B) will have a value of 1; otherwise it will have a value of 0. This is okay as a first guess, but the problem is that now (A <-> B) can only have 1 or 0 as a value, which is not really very fuzzy at all. Let's go back to Boolean logic for a minute and think a bit more about (A <-> B). This is read "A if and only if B." It means that if we know A is true, then we can conclude that B is true AND we if we know B is true, then we can conclude that A is true. In other words, (A <-> B) is equivalent to the conjunction of (A -> B) and (B -> A) or to put it more formally: (A <-> B) = ((A -> B) ^ (B -> A)) But we've already figured out how to do fuzzy logic on these connectives. So let's just apply that to our <-> connective. Specifically: (A <-> B) has a value equal to the minimum (conjunction) of: (A -> B) and (B -> A) So first you figure out the value of (A -> B) (which is the maximum of ~A and B), then you figure out the value of (B -> A) (which is the maximum of ~B and A), and then you take the minimum of those. That's pretty complicated, so before we do an example calculation with

fuzzy logic, let's make sure it works with two-valued (Boolean) logic. Let's say that A and B both have truth values of 0. What does (A <-> B) have in this case? First, we take the maximum of ~A and B. ~A will have a value of 1 and B will have a value of 0. So the maximum is 1. Second, we take the maximum of ~B and A. ~B will have a value of 1 and A will have a value of 0. So the maximum is 1. Finally, we take the minimum of the first two steps above. Step one gave us 1 and step two gave us 1, so the minimum is 1. What if A has a truth value of 1 and B has a truth value of 0? First, we take the maximum of ~A and B. ~A will have a value of 0 and B will have a value of 0. So the maximum is 0. Second, we take the maximum of ~B and A. ~B will have a value of 1 and A will have a value of 1. So the maximum is 1. Finally, we take the minimum of the first two steps above. Step one gave us 0 and step two gave us 1, so the minimum is 0. For a fuzzy logic example, let's say that A has a value of 0.7 and B has a value of 0.5. What is the value of (A <-> B)? First, we take the maximum of ~A and B. ~A will have a value of 0.3 and B will have a value of 0.5. So the maximum is 0.5. Second, we take the maximum of ~B and A. ~B will have a value of 0.5 and A will have a value of 0.7. So the maximum is 0.7. Finally, we take the minimum of the first two steps above. Step one gave us 0.5 and step two gave us 0.7, so the minimum is 0.5. So in this example (A <-> B) has a truth value of 0.5.