You are on page 1of 42

A survey of machine learning

First edition

by Carl Burch
for the Pennsylvania Governor’s School for the Sciences

 

c 2001, Carl Burch. All rights reserved.

ii

Contents
0 Introduction 0.1 Data mining . . . . . . 0.2 Neural networks . . . . 0.3 Reinforcement learning 0.4 Artificial life . . . . . . 1 1 2 2 2

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 Data mining 1.1 Predicting from examples 1.2 Techniques . . . . . . . 1.2.1 Linear regression 1.2.2 Nearest neighbor 1.2.3 ID3 . . . . . . . 1.3 General issues . . . . . . 1.4 Conclusion . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

3 . 3 . 4 . 5 . 8 . 9 . 13 . 14 15 15 15 18 19 19 20 21 23 25 27 27 28 32 32 34 34 34 35 37

2 Neural networks 2.1 Perceptron . . . . . . . . . . . . . 2.1.1 The perceptron algorithm . 2.1.2 Analysis . . . . . . . . . 2.2 Artificial neural networks . . . . . 2.2.1 ANN architecture . . . . . 2.2.2 The sigmoid unit . . . . . 2.2.3 Prediction and training . . 2.2.4 Example . . . . . . . . . 2.2.5 Analysis . . . . . . . . . 3 Reinforcement learning 3.1 Modeling the environment learning . . . . . . . . . 3.2 3.3 Nondeterministic worlds . 3.4 Fast adaptation . . . . . . 3.5 TD learning . . . . . . . . 3.6 Applications . . . . . . . . 3.6.1 TD-Gammon . . . 3.6.2 Elevator scheduling 4 References

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

¡

iv CONTENTS

First the researchers found an expert.1 Data mining With the arrival of computing in all facets of day-to-day business. who they interviewed for a list of rules about how to diagnose diseases in a soybean plant. Although you might argue that machine learning has been around as long as statistics has. hospitals keep information about patients. Moreover. factories get information about instrument performance. ready to access in mass. biology. One of the most famous examples is from the late 1970s. and then they performed some data mining techniques to look for rules predicting the disease based on the characteristics they measured. Data mining is gradually proving itself as an important tool for people who wish to analyze all this data for patterns. 0. Then they collected about 680 diseased soybean plants and determined about 35 pieces of data on each case (such as the month the plant was found to have a disease. let’s do a general overview of machine learning research.Chapter 0 Introduction Machine learning attempts to tell how to automatically find a good predictor based on past experiences. It draws its inspiration from a variety of academic disciplines. Researchers have approached the general goal of machine learning from a variety of approaches. scientists collect information about the natural world — and it’s all stored in computers. One goal is to learn some of the techniques of machine learning. statistics. This also constitutes something of an outline of this text: We’ll spend a chapter on each of the following topics. businesses keep information about customers. the condition of its roots). when data mining proved itself as potentially important for both scientific and commercial purposes on a particular test application of diagnosing diseases in soybean plants [MC80]. the amount of accessible data has exploded. just as significant. They handed the plants to the expert to diagnose. after revealing these rules to the expert. In this class we’re going to look at some of the significant results from machine learning. while the rules discovered through data mining were accurate 97. What they found is that the rules the expert gave during the interview were accurate only 70% of the time. as . Before we delve into details about these. we are going to get a glimpse of the research front and the sort of approaches researchers have taken toward this very nebulous goal of automatically finding predictors. the size of the plant’s seeds. but also.5% of the time. the expert reportedly was impressed enough to adopt some of the rules in place of the ones given during the interview! In this text. we’ll see a few of the more important machine learning techniques used in data mining. Employers keep information about employees. the amount of recent precipitation. including computer science. it really only became a separate topic in the 1990’s. and psychology.

a system that uses an ANN to steer a car on a highway [Pom93]. The system has been tested going up to 70 miles per hour over 90 miles of a public divided highway (with other vehicles on the road). even outside simulations of pseudo-biological. Reinforcement learning (sometimes called unsupervised learning) refers to a brand of learning situation where a machine should learn to behave in situations where feedback is not immediate. machine learning has already reaped the reward of artificial neural networks (ANNs). Then we’ll see how they might be networked together to form an artificial neural network that improves with training.2 Introduction well as surrounding issues that apply regardless of the learning algorithm. One of the more impressive is ALVINN. We’ll see a few of the proposed techniques. where systems loosely based on evolution are simulated to see what might evolve.5 million games. thus gaining a greater understanding of how evolution works and what effects it has. After playing against itself for 1. Though scientists are far from understanding the brain. We’ll emerge from this familiar with much of the established machine learning results and prepared to study less polished research. TD-Gammon uses a neural network coupled with reinforcement learning techniques. the word learn has a similar denotation to adapt. and researchers still don’t have a good grasp on it. ALVINN consists of an array of cameras on the car. and how they can be applied in situations like the one that TD-Gammon tackles. we’ll have worked exclusively with systems that need immediate feedback from their actions in order to learn. the program learned enough to rank among the world’s best backgammon players (including both humans and computers). Genetic algorithms appear to be a promising technique for learning from past experience. which though useful is a limited goal. In a sense. 0. artificial life seeks to emulate living systems with a computer. where you might hope for a robot to learn how to accomplish a certain task in the same way a dog learns a trick.4 Artificial life Finally. If you understanding learning as improving performance based on past experience. Here we’ll look briefly at human neurons and how artificial neurons model their behavior. . We’ll look at this technique. Probably the most famous success story from reinforcement learning is TD-Gammon. without explicit programming. One important part of cognitive science involves simulating models of the human brain on a computer to learn more about the model and also potentially to instill additional intelligence into the computer. 0. This is especially applicable in robotics situations. ANNs are deployed in a variety of situations. This is supervised learning.3 Reinforcement learning Through this point. a program that learns to play the game of Backgammon [Tes95]. even if the connotation is different. though.2 Neural networks Cognitive science aims to understand how the human brain works. They hope to evolve very simple behavior. like that of amoebas. Our study of artificial life will concentrate on genetic algorithms. and then we’ll look at its use in attempting to evolve simple behaviors. whose inputs are fed into a small ANN whose outputs control the steering wheel. Reinforcement learning is much more challenging than supervised learning. the evolutionary process learn — though usually the verb we use is adapt. 0.

From this. Additionally.) There are six training examples here. the bank has the same information about thousands of past loans. We would feed those into the learning algorithm. A natural hope is to analyze the data by computer.Chapter 1 Data mining Data is abundant in today’s society. representing the past history. For example. plus whether the loan proved to be a good investment or not. On hand is a variety of information about the applicant — age. rewarding frequent users of the grocery store with discounts on particular items. produces a prediction of its label. predictors. given the variety of types of hypotheses that learning algorithms produce. Figure 1. and other characteristics of the loan application. Every transaction. Data mining refers to the various tasks of analyzing stored data for patterns — seeking clusters. with this cards. represented by a vector holding the customer age. The learning algorithm sees all these labeled examples. each with four attributes (plant growth. Each example has a label. shelf placement. stem condition. Each training example is a vector of values for the different attributes measured. salary. interest rate. Data mining can potentially improve the loan acceptance rate without sacrificing on the default rate. or discounts.1 illustrates a selection of the data they used. this is the best we can do formally. In our banking application. This is useful for determining what to promote through advertisement. each example represents a single loan application. we briefly looked at some data mining results for soybean disease diagnosis by Michalski and Chilausky [MC80]. (That a learning algorithm produces another algorithm may strike you as odd. 1. (I’ve taken just seven of the 680 plants and just four of the 35 attributes. and it would produce a hypothesis that labels any plant with a supposed disease. A large banking corporation makes many decisions about whether to accept a loan application or not. In this scenario. leaf-spot halo. and patterns in a mass of stored data. the bank wants to know whether it should make the loan.1 Predicting from examples We’ll emphasize a particular type of data mining called predicting from examples. they can track a customer between visits to the store and potentially mine the collected data to determine patterns of customer behavior. every accomplishment gets stored away somewhere. and seed mold) and a label (the disease). trends. profiting both the bank and its customers. But.1.) In Section 1. the algorithm has access to several training examples. customer salary. . We get much more data than we could ever hope to analyze by hand. purpose. employment history. loan amount. credit history — and about the loan application — amount. many grocery stores now have customer cards. given a new vector as an input. and then it should produce a hypothesis — a new algorithm that. The stores gives these cards to encourage customer loyalty and to collect data — for. which in the banking application might be a rating of how well the loan turned out.

the full data set has some attributes with more possible values (the month found could be any month between April or October. Consider the classic data set created by Fisher. each of the four attributes has only two possible values (either normal and abnormal. attributes may be numeric instead.2 and the single unlabeled example whose identity we want to predict. Plant Q. illustrates what we might feed to the hypothesis to arrive at a prediction. The last. 1. we’ll just work with the five labeled examples of Figure 1. Moreover. or yes and no). some data will be missing.2 Techniques We’re going to look at three important techniques for data mining. the precipitation could be below normal. A numeric attribute has several possible values. and labeled them with the specific species of iris. For illustration purposes. or above normal). In the soybean data. ID3. But each attribute has just a small number of possible values.2: A prediction-from-examples problem of Iris classification. Such attributes are called discrete attributes. Fisher measured the sepals and petals of 150 different irises. and attributes are usually mixed in character — some of them will be discrete (like a loan applicant’s state of residence) while others are numeric (like the loan applicant’s salary). but they are simpler than what one normally encounters in practice. is aimed at discrete data. so we’ll keep with the idealized examples. © £ £ ¨©  £ § £ © £ ©£  ¨§ £ ¨ £ ¨ £ ¤¢ £ ¥£ ¦¤¢ §£  ¥¦¨© £ ©£ ¨§ ¥¦¨© £ £ ¨§ ¨© £ ©£ ¨§ £ ¨ £ ¨ £ ¨ £ ¨ £ ¨ ¥£ ¦¤¢ Iris A Iris B Iris C Iris D Iris E Iris Q species setosa versicolor versicolor virginica virginica ??? .1: A prediction-from-examples problem of diagnosing soybean diseases. In some domains. which is what we want to look at here. data has many more attributes and several times the number of examples. These two examples of data sets come from actual data sets. representing a new plant whose disease we want to diagnose.4 Data mining plant growth normal abnormal normal abnormal normal normal normal stem condition abnormal abnormal normal normal normal abnormal normal halo on leaf spots no no yes yes yes no no mold on seed no no yes yes no no yes Plant A Plant B Plant C Plant D Plant E Plant F Plant Q disease frog eye leaf spot herbicide injury downy mildew bacterial pustule bacterial blight frog eye leaf spot ??? Figure 1. in a meaningful linear order. normal. But these things complicate the basic ideas. Normally. sepal length sepal width petal length petal width Figure 1. a statistician working in the mid-1930s [Fis36]. The first two — linear regression and nearest-neighbor search — are best suited for numeric data.

Squaring this value gives greater emphasis to larger errors and saves us dealing with complicated absolute values in the mathematics.) Linear regression would seek the line (specifying the prediction for any given ) that minimizes the sum-of-squares error for the training examples. it’s not too hard to compute the exact values of and that minimize the sum-of-squares error. 1. with an -coordinate of and a -coordinate of . not too far from the actual value of . . . . where the attribute for each example is called . The optimal choices for and are ( " ¥ ¥ §£ I &‡&xj3h‡£¤‰igfed¢pIh©&£   ¢ ‘ © ¢ £ § 9 © ¢  £  ˜  ¥ £   f §£ ¢ 3  £ ¢ ‘  I  £ © ( f 1 © §£ gfed¢™‡&`¨‡&¤—r1 U ‡¤`–•”¨&ST‡& ‘“‡¤¢’€‰§S(  £ ‘ I © £  5  £ © ¨&†3 U# " # e ‰†x&0R" # e & © £  § 3 # %# © §£ 3ˆd % £ ¨6ƒ#&% # e ‡£¤†3…d " 3   ¢ §£  © 3 # „ƒ$" # e wvU qvI uU# " # e d" ! t £‚5v€y3x9 d" I d% 3 s 65 drdq!pih&h$" # e ( % " I 1# %# ! f gT1 # % # Y( e % d% ! f gT1 # " # Y( e For the irises. Single-attribute examples The easiest case for linear regression is when the examples have a single numeric attribute and a numeric label. we might want to predict the petal width of an iris given its petal length using the data of Figure 1. we compute to be and to be So our hypothesis is that.) The quantity is the distance from the value predicted by the hypothesis line to the actual value — the error of the hypothesis for this training example.2. We get similar data from Iris C. We’ll skip the derivation and just show the result. For example. It is a long-established statistical technique that involves simply fitting a line to some data. we compute that .1 Linear regression Linear regression is one of the oldest forms of machine learning. with a petal length of .2.3: Fitting a line to a set of data. Iris D. we’ll look at this case first. Figure 1. Here is (the petal length of Iris A) and is . the hypothesis would predict a negative petal width for a very short petal. we get and . Let be the average value ) and be the average value ( ). Knowing these. if the petal length is .3(b).3(a).2 Techniques 5 petal width 1 petal width 1 2 3 4 5 petal length (a) 2 2 1 1 2 3 4 5 petal length (b) Figure 1. (On the other hand. We can envision each example as being a point in -dimensional space. and Iris E. then the petal width will be and hence On Iris Q. (See Figure 1.) ¥£ ¢ ¦¤b3 " U " 1 "( 20)' 9  ©£ & 9 5 d" 5 ! # &% ©£  G % £ ¥ ¥  §£ I " © ¢  £ ”&‡&lk’gfed¢ £ U11# "( ' I # % WVTSR0)QP&0( 9@8776420)' " 5 3 1 "( % # $" # &% G F H# D E X 1# "( ' I # `Y$0)Q2& X% ¨§ £ §£  ©£  " # $" B " C0A G a" ©£  c3 U % © . Say we have examples.3 graphs the points. and hence .1. and the label for each is . which is clearly not appropriate. this hypothesis would predict a petal width of . From Iris B. single-attribute example (See Figure 1. With a little bit of calculus.

What we’ll do is say that the label is if the species is versicolor and otherwise.6 Data mining Multiple-attribute examples When each piece of training data is a vector of several attributes. . . Iris B is at the point . For the iris example. . . . The label needs to be numeric. Thus Iris A is at the point . the hypothesis is that the iris is probably versicolor and if the predicted value is small (less than ). we can get around this limitation by simply inserting an additional attribute to every vector that is always (so that for that coordinate is just the constant ). Thus our examples are as follows.. . . . So to compute the best-fit hyperplane for our versicolor labeling. . This is slightly different than the two-dimensional case because this is forced to go through the origin . . . ). . Forcing this keeps the mathematics prettier. What we’re going to do is to expand the single-attribute example by expanding the dimensions of the space. we’ll have to find our set of equa- # 75 w w BguR" £`£`£ Ga0A " u k5 Here we have equations and unknowns (namely. with an -coordinate corresponding to the first attribute. . though the iris example labels aren’t. . with a coordinate for each attribute. .attribute. Thus if the predicted value is large (at least ). Given an example prediction. We’ll look for a -dimensional hyperplane that minimizes the sum-of-squares error. . we could view each labeled example in three-dimensional space. . we would predict . . to refer to # 5 r 1  8 r utsh(   £  £  w w w w w 1 ©£ §£ ©£ ¥£  g ¤  ¨§ ¦¤¢ x( 1 "( C0)' 1 G % 1 U % w U w u " uG " # "# 5 m £ £ w ``£ £ £ ``£ w { £ £ ``£ G …5 " w w U w } " }G ~a" } D " 8 8 8  U U U U 51 R" R" # Y( # # e U 5 1 # " 0# " # Y( e G U w U w TU U w U " G a" U 5 ¨1 # ‰u # " # Y( " e z w w 1T’uR" £``£ G‡"0T)' B £ A( #$0#75 # e 3q120)' " "( r D " w w w G F e 3 1 u £ £ U "( # " # 5 Hw# bvuP" ``£ " G 0)' u w w U " w G |( GG " Ta|( G D €( " # 5 r w % q 5 8 " o 5 3 1 "( ‰7•p7n)¤% 0)'  8 8 y # "  3 yvr  U 8†G…¨„0$" R" # Y( 5 1 G# # e 5 G G e G ¨1 0# " 0# " # Y( 5 G " e G 1 0# fu # " # ~( w w w w w 1 ©£ ¥£ £ £  u & ¦¤¢ ¨© ¨ x( w w w 1 £ £ g ``£  h(  r z # &% % . through hyperplane is the function . . . it turns out that we need to solve 1 D % w u D " 8‚``‘ ‘ ‘ 8 ‘ ‘ ‚``‘ 8 ‘ ‘ ‚``‘ w . . . We’d that minimizes the sum-of-squares error. plus a coordinate for the label.. . and then we have the label. . look for a plane For a general number of attributes . If each example has just two attributes. the best-fit for which we want to make a % # `u # " # e U # # &% R" # e # % ƒ# " # e G 3 3 3 U uk¨uu # " $" # ~( 5 1 # e u 5 1 k¨uu # " 0# " # ~( e G u 5 1 " kgu # fu # " # Y( e 1 "( 20)' G F e 3 1 "( # " # 5 H# @v20)' u as closely approximates the . . . . the problem becomes more complicated. If we want to be able to learn a hyperplane that doesn’t necessarily go through the origin. we’ll view each labeled example as a point in -dimensional space. . . After we solve for the . the -coordinate corresponding to the second attribute. . we’re in -dimensional space. # % to refer to the th example’s th attribute value and the notation We’ll use the notation the th example’s label. and the -coordinate corresponding to the label. To compute this a set of equations. giving us a total of attributes in each example. We’re adding the extra always. the hypothesis is that it is probably not. We want to find a set of coefficients so that the function as possible (still using the sum-of-squares error).

tions. we need to compute a lot of sums. This is rarely a reasonable expectation — linear functions are just too restricted to represent a wide variety of hypotheses. – 8 G ’¨ 5 £ 8 G 5 §£  •…’© 8 G 5 £ ¢ •…’¤f 8 G 5 ¥£  •…g¦¨&© 8 G 5 £ •…¨ # e # e # e # e # e # e # e # e # e # e # e # e # e # e # e # e # e # e # e # e # e 3 # % ”# &‘w$" 3 # % # &‘w$" 3 w# " w# "   3 # % # &fŽ$" 3 # " # Œw$fŽ$" 3 Ž$fŽ$" # " # 3 # % }# &‘w$" 3 # " }# Œw$‘w$" 3 Ž$‘w$" # " }# 3 w# " w# " } } U 3 # # &% $" U 3 # # Œw$" $" U 3 Ž$" $" # # U 3 }# # Œw$" $" U U 3 # " # " 3 # %G# &T0$" 3 # " G# Œw$T0$" 3 Ž$T0$" # " G# 3 }# " G# Œw$T0$" U 3 # "G# $T0$" 3 G# "G# ˆ0$T0$" 1.) The answer turns out to be the following. (You can try to do it by hand. Here we go. To do that.Analysis Linear regression is really best suited for problems where the attributes and labels are all numeric and there is reason to expect that a linear function will approximate the problem. Advantages: Now we solve this to get the hyperplane hypothesis.2 Techniques 7 . Has real mathematical rigor. Thus linear regression indicates that we are fairly confident that Iris Q is versicolor (as indeed it is). the predicted answer is  "  © £  I  "  ¥ ©£  I } "  £  R‡pƒ$’‡pxP&I w w w U w "  § ¥£  8 G "   §£  3 1   } ’&¦ˆi‡’‡™vuR" R" R" " ‡0)' G "( 3 3 3 3 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 U £ ¨© £ ¨ £ ¨ ¥£  ¦& £ ¨© 5  ¥£   ’¦`  5 © £  k’‰§  5 § £ k’gd¢¨‰  5  £ k’&¨&§  5  £ k¨ 5 © £   ’‰§  5  £ © 7f¨&  5  ¥£ 7f¨¦¨&  5 ¥ £  © 7gd¢‰  5  §£  7© 5 § £ } ’gd¢¨‰ } 5  ¥£ kf¨¦¨& } 5  £ ¢ kf¤&¢ } 5  §£  k’&¨& } 5  £ ¢ k¤f U 5’&¨&§  £ U 5 ¥ ¢£  © gd‰ U 5’&¨&  §£  U 5  £   ’&¨&‰ 5  ¥£  ’¦¨&© 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 £ ¨© £ ¦¨©  ¥£  ‹` £ ¦¨ © £  ‰“§  £ © ¦¨& £ ¦¨ § ¢£ g¨‰ ¨‹¨& ¥£ ¦¤&¢ £ ¢ ¥£  ‹&  £ &¦¨&§ ¥‰ ¢£  © &¦¨& §£  &¦¨&‰ £   £ ¨© £ ¦¨ §£  ¦© ¦¤f £ ¢ ¥‹¨&© £  £ ¨ U •8 U •8 U n8 U n8 U       ‘ £ © 8  ‘ £  8  ‘ §£  8  ‘ ©£  8  ‘ ©£ g”¨lŠ‰–—‡”™r–™Š‰’ U ¨l8 U —8 U ™8 U ™8 U  £ © £  §£  ©£  ©£  ‘ £  8  ‘ £  8  ‘ £ ¢ 8  ‘ ¥£ ¢ 8  ‘ §£ gq¨lŠ‰q¨‡x¤•rƒ¦¤•Š‰’ £ © ‘ £  8 £  ‘ £  8 §£  ‘ £ ¢ 8 ©£  ‘ ¥£ ¢ 8 ©£  ‘ §£ ¨gq¨lŠgq¨lƒ‡q¤•Š‡–¦¤•Š‰’ U ¨l8 U ¨8 U ¤•8 U ¦¤•8 U  £  £  £ ¢ ¥£ ¢ §£  ‘ ©£ § 8  ‘ ¥£ © 8  ‘ £ § 8  ‘ £ © 8  ‘ ©£ g–¨lŠ‰ƒ¦¨‡’¨lr–¨lŠ‰’¨§ £ © ‘ ©£ § 8 £  ‘ ¥£ © 8 §£  ‘ £ § 8 ©£  ‘ £ © 8 ©£  ‘ ©£ ¨g–¨lŠgƒ¦¨lƒ‡”¨lŠ‡”¨lŠ‰’¨§ £  ‘ ©£ § 8 £  ‘ ¥£ © 8 £ ¢ ‘ £ § 8 ¥£ ¢ ‘ £ © 8 §£  ‘ ©£ ¨g–¨l‰¨gƒ¦¨lW¤`”¨ls¦¤`”¨lŠ‡’¨§ U ¨l8 U ¦¨8 U ¨l8 U ¨l8 U ¨§ ©£ § ¥£ © £ § £ © ©£  ‘ £  8  ‘ £  8  ‘ £  8  ‘ £  8  ‘ ¥£ g–¨lŠ‰–¨‡”¨lrq¨lŠ‰”¦¤¢ £ © £  8 £  ‘ £  8 §£  ‘ £  8 ©£  ‘ £  8 ©£  ‘ ¥£ ¨g‘–¨lŠg–¨lƒ‡–¨lŠ‡x¨lŠ‰”¦¤¢   £ 8 £  ‘ £  8 £ ¢ ‘ £  8 ¥£ ¢ ‘ £  8 §£  ‘ ¥£ £¨g‘–¨l‰¨g–¨lW¤`–¨ls¦¤`x¨lŠ‡”¦¤¢ ©£ ‘  ¨§g–£¨l8s¦¨g–¨l–¨u–¨lŠ¨ux¨lŠ¨u”¦¤¢ ¥£ © ‘ £  8 £ § ‘ £  8 £ © ‘ £  8 ©£ § ‘ ¥£ U ¨l8 U ¨8 U ¨l8 U ¨l8 U ¦¤¢ £  £  £  £  ¥£  ‘  8  ‘  8  ‘  8  ‘  8  ‘ ‰q—suqngW—‡q™sux £ © ‘ ¨uq™sgW—Š‡q™Š‡q™sux 8 £  ‘  8 §£  ‘  8 ©£  ‘  8 ©£  ‘   ‘ £¨uq™¨gW—‰¤`q™s¦¤`q™sgx 8 £ ‘  8 £ ¢ ‘  8 ¥£ ¢ ‘  8 §£  ‘ ©£ ‘  ¨§uq™8x¦¨©gW—ƒ¨uq™Š¨uq™s¨gx ¥£ ‘ 8 £ § ‘  8 £ © ‘  8 ©£ § ‘ £   8    ¨u‘q™s£¨g‘W—8Š£¨uq™‰¨uq™x¦¤‰x   ‘  8 £  ‘  8 ¥£ ¢ ‘ U n8 U n8 U n8 U n8 U      From these we derive a set of five equations with five unknowns. £ ¥  ¥£  3 ©£  ‘  © £  I £ § ‘  ¥ ©£  I ¥£ © ‘  £  I £  ‘  § ¥£  8  ‘   §£ ”‡¦™h‡i‡p’¨u’‡p”¦¨ui&”¨g’&¦ˆ‡’‡ U For Iris Q. but at this point I broke down and went to a computer.

Doesn’t at all represent how humans learn. r w w U w B ’u y " ``£ " ‰¨0A £ £ y G y " &c3 £  &c3 §£  ©gd¢™3 £  &™3 £  ¢ £ § ¨—3 U U £ U ‰u y vxu # 0b``…8 U 1 vI $0—8 U „‰¨vr0R0( 1 " I "( 8 ‘ ‘ ‘ y " # "( 1 G y " I G# " U ‡’¨Sb8 1 ©£  I £ © ( U ‡”xb8 1 ©£  I £  ( U ‡”xb8 1 ©£  I §£  ( U ‡”xb8 1 ©£  I ©£  ( U ‡”hb8 1 ©£  I ©£  ( U U £ X " I ˜‡u y vxu # " b``n‰X y …I # " b‰X ‰y vI 0# X" X 8 ‘ ‘ ‘ 8 " X 8 G " G y ž " ¤› I y ž " ’› ž  Ÿ ž  œ y# R" w w U w B gu # " ``£ R" 0R0A £ £ # G# " U ‡¨Qs¨Sb8 1 £ § I £ ( U 1 £ § I £ ( ‡¨Qs¨Sb8 U ‡¨Qs¤0b8 1 £ § I £ ¢( U ‡¨Q”¦¤0b8 1 £ § I ¥£ ¢( U ‡¨Q’xb8 1 £ § I §£  ( U &¦¨Q’¨Sb8 1 ¥£ © I ©£ § ( U &¦¨Q”¦¨Sb8 1 ¥£ © I ¥£ © ( U &¦¨Qi¨Sb8 1 ¥£ © I £ § ( U &¦¨Q’¨Sb8 1 ¥£ © I £ © ( U &¦¨Q’¨Sb8 1 ¥£ © I ©£ § ( 3 y Š# š  U ‡¨Q”¨S( 1 £  I £  U ‡¨Q”¨S( 1 £  I £  U ‡¨Q”¨S( 1 £  I £  U ‡¨Qx¨S( 1 £  I £  U ‡¨Q–¦¤0( 1 £  I ¥£ ¢ — ™ ™ ™ ™ ™ – – – – – – Handles irrelevant attributes somewhat well. the nearest-neighbor algorithm would predict that Iris Q has the same label as Iris C: versicolor. example distance label Iris A setosa Iris B versicolor versicolor Iris C Iris D virginica Iris E virginica Since Iris C is the closest. (Irrelevant attributes tend to get a coefficient close to zero in the hyperplane’s equation. To determine what we mean by most “similar”. The formula for this would be Another common possibility is to use the Manhattan distance function. In the Iris example. while the width only varies up to two centimeters). the and in -dimensional straight-line distance between the two points space.) Hypothesis function is easy to understand. to ensure that the distance between two attribute values in the same attribute never exceeds . In the Iris example.2 Our second approach originates in the idea that given a query.8 Data mining Disadvantages: 1. we’ll compute the Euclidean distance from each of the vectors for the training examples to the query vector representing Iris Q. it just seems to work well in general. Nearest neighbor . They don’t really have much mathematical reasoning to back this up. The first is that the the raw Euclidean distance overemphasizes attributes that have broader ranges. Limited to numeric attributes. the training example that is most similar to it probably has the same label. the best choice would be the Euclidean distance function. the petal length is given more emphasis than the petal width (since the length varies up to four centimeters. There are two common refinements to the nearest-neighbor algorithm to address common issues with the data. (Why should we expect a hyperplane to be a good approximation?) Difficult to compute. we have to design some sort of distance function. In many cases. Oversimplifies the classification rule. In practice. people generally go for the Euclidean distance function.2. This is easy to fix by scaling each attribute by the maximum difference in the attribute values.

The prediction can be the average of these nearest neighbors if the label values are numeric. so it makes sense to give them equal weight in the distance computation. and the Web site looks for people whose movie preferences are close to yours and then predicts movies that these neighbors liked but that you have not seen. We start at the top node. Decision trees are well-suited for discrete data. ¡ y# R" ¡ y # š ¡ § ¡ – – – – – . Here. while one of our primary complaints about nearestneighbor search was that its hypothesis was too complex to be understandable. where the site makes book recommendations based on your past order history. Recall that one of our primary complaints about linear regression was that its hypothesis was too constricted to represent very many types of data. Collaborative filtering fits into the nearest-neighbor search well because attributes tend to be numeric and similar in nature.2 Techniques 9 By applying the distance function to the instead of the . 1. a small but unknown fraction of the data might be mislabeled or unrepresentative. where a system is to predict a given person’s preference based on other people’s preferences. we find that the three closest irises to Iris Q are Irises B. C. since they don’t have yellow halos. Since Plant Q’s stem is normal. each attribute is a single movie that you have seen. Or perhaps Iris C’s dimensions are uncharacteristic of versicolor specimens.2. Decision trees are easy to interpret but can represent a wide variety of functions. Figure 1. of which two are versicolor and one is virginica. Disadvantages: Does not handle many irrelevant attributes well. Analysis The nearest-neighbor algorithm and its variants are particularly well-suited to collaborative filtering. We can get around this using the law of large numbers: Select some number and use the nearest neighbors. Say we take Plant Q.3 ID3 Linear regression and nearest-neighbor search don’t work very well when the data is discrete — they’re really designed for numeric data. or the plurality if the label values are discrete (breaking ties by choosing the closest). Easy to compute. Finally we examine the spots on Plant Q’s leaves. For example. a movie Web site might ask you to rate some movies and then try to find movies you’d like to see. and we find that it has some. If we have lots of irrelevant attributes. If we choose to be . ID3 is designed with discrete data in mind. the decision tree predicts a label. we start at the top and work downward. Advantages: Represents complex spaces very well. Or you might see this on book-shopping sites. Iris C may not really be versicolor. Still doesn’t look much like how humans learn. this artificial overemphasis disappears. the distance between examples is dominated by the differences in these irrelevant attributes and so becomes meaningless. ID3’s goal is to generate a decision tree that seems to describe the data. Hypothesis function is too complex to describe easily. Given a vector. we go to the right and conclude that Plant Q must have downy mildew. They represent a good compromise between simplicity and complexity. Therefore we still predict that Iris Q is versicolor. and D.4 illustrates one decision tree. To get its prediction. so we go to the left from there. we go to the right. Now we look for mold on Plant Q’s seeds. The second common refinement is to address the issue of noise — typically. In the iris example.1.

B.E.C.D.10 Data mining stem condition abnormal normal plant growth abnormal herbicide injury normal frog eye leaf spot abnormal bacterial pustule yes plant growth normal downy mildew mold on seed no bacterial blight Figure 1.F before the split after the split We stop splitting nodes when every node of the tree is either labeled unanimously or contains indistinguishable vectors. and we split it based on some attribute we select. We begin with a single node containing all the training data. if we have a node containing all the examples and choose to split on plant growth. I mean that we take the node and replace it with new nodes for each possible value of the chosen attribute.E. Then we continue the following process: We find some node containing data with different labels. By good. For example. we need to define the entropy of a set of examples. By split. plant growth abnormal A. Constructing the tree automatically ID3 follows a simple technique for constructing such a decision tree. the general goal is to find a small decision tree that approximates the true label pretty well. the effect would be as follows. The entropy of a set is defined by the following equation. with the soybean data. labels ¬ ­ ¬ « ¤«   ¤¤)I ª E ¢ 3 1 ¢ vvY( © ¨§¦¥ ¤ R„Th$W£ .4: A decision tree example Given a set of data. the goal of ID3 is to find a “good” decision tree that represents it.F B.D normal A. How does ID3 decide on which attribute to split a node? Before we answer this.C.

minus the weighted sum of the entropies of the new sets. we could go with either stem condition or leaf-spot halos: They both have the same gain. D. As an extreme example.1. and E (entropy ).2 Techniques 11 Here is the fraction of points in the set with the label . and F (entropy of splitting on stem condition is This is larger than the gain for plant growth. C. We’ll choose stem condition. E ¥   £ g&   £ &&‡ £   § £  3 ´ ¥ § £ ’&&&—g‡&‡  § £  &&™3 ´ &&‡   £ §  ¥ § £ ‡&‡  § £  3 1 © f  ­ 1 © f  ( I 8 1 © f  ­ 1 © f  ( &&™‡ux(   ¨‡uxhv‡‡ux(   ¨‡uxiI  ¢ ² values ¢ £ ¥   £  –g&¯3 £ 1 ± ¢ H’PY(           ­ ™8 I ­ ™8 I ­ ™8 I ­ —8 I ­ I                 ©   ©  3 ¬ ­ ¬ « @W¤«   e¤iI X xd¢X © ¨§¦¥ ¤ X ± R„Th$W£ Pd¢X  f ‡© ± P¢ ® 8   § £ –&&& 8  § £ W&& E   § £ &&& I 1 ¢ ivY( §  ©  ³ ³ I ¥   £ °’g& © ¨§¦¥ ¤ R„Th$q£ I ¥   £ ’g& ¢  3 ¬ °W¤« ®  f u ¬ ¤« š . we get two sets of plants: B and D have abnormal plant growth (this set has ). We would also consider splitting on leaf-spot halos (gain of 0.6931. B. if everything has the same label . Say represents the old set. D. After computing the gains of splitting for each of the attributes.6367). and F have normal plant growth entropy (this set has entropy ). 0. So the entropy is This entropy is rather large. and represents the examples of with the value for the attribute under consideration. and A.6931) and splitting on seed mold (gain of 0. B. Now let’s look at what happens when we split a node. This divides the ) and a set of C. Let’s consider splitting on . This splits the set of examples it represents into pieces based on the values of the attribute. and of the examples have each of the four other diseases. all the examples are in a single node. Of these. We continue splitting nodes until the examples in every node either have identical labels or indistinguishable attributes. The gain plants into a set of A. so splitting on stem condition is preferable. there are five labels: of the examples have frog eye leaf spot. giving us a new tree. giving us a new tree. this indicates that things are labeled pretty consistently. F C. So the gain of splitting on plant growth is We can do similar calculations to compute the gain for stem condition instead. then since the entropy would be just . we choose the attribute that gives the largest gain and then split on it. To quantify how good the split is. When the entropy is small. we define the gain of splitting a node on a particular attribute: It is the entropy of the old set it represented. We’re aiming for a small decision tree where the total entropy of all the leaves is as small as possible.1. E. The entropy is a weird quantity that people use more or less just because it works. stem condition abnormal normal A. After we the plant growth. We just computed the entropy of all six training examples to be split on plant growth. quantifying the fact that the set isn’t labeled consistently at all. In the training data of Figure 1. The gain would be At the beginning of the algorithm.

We must split on plant growth.) Now we have the following tree. We first consider Plants A. Filters out irrelevant data. and E. while splitting on leaf-spot halos or seed mold gives a gain of 0. so we’ll look for an attribute on which to split them. That turns out to be easy. Analysis Advantages: Represents complex spaces well.6931. So the next set we’ll consider for splitting is C. the gain for splitting based on plant growth is . The gain for splitting on seed mold is . say we choose seed mold. Disadvantages: I   £ ’&&‡   § £  3 1  1 § f  ( 8  § £  1 § f © ( ( I   £ &&&™‰gV‡ux—x&&V‡‡STti&&‡   § £  3 1  1 § f  ( 8  § £  1 § f © ( &&&™‰gV‡uxtq&&V‡‡ST( – – – . F C. so we’ll split that node based on plant growth. stem condition abnormal normal plant growth abnormal B normal A. stem condition abnormal normal plant growth abnormal B normal A. We could go for either plant growth or seed mold. (The gain of this would be 0. They don’t have the same disease.4. meaningful hypothesis. D. since they all have leaf-spot halos. D. as they disagree only on the plant-growth attribute (none of them have leaf-spot halos or seed mold). E The set of A and F isn’t a problem. as they both have the same disease. There. Splitting on plant growth gives a gain of 0.6365. Generates a simple. The gain for splitting on leaf-spot halos is 0. F yes C. while splitting on leaf-spot halos gives a gain of 0. we consider how to split Plants C and D.12 Data mining We repeat the process for each of the remaining sets. and F. D mold on seed no E Finally. B. giving us the tree of Figure 1.

So. Overfitting applies to just about any learning algorithm. Machine learning researchers have ways of working around this. consider the two graphs in Figure 1. ID3 is particularly prone to overfitting. as it continues growing the tree until it fits the data perfectly.5. For example.1. The researchers here did some preprocessing to simplify the task for the learner. The most typical measure is the error: the probability that the hypothesis predicts the wrong label for a randomly chosen new example. Less obvious is the importance of a selecting good attributes. it’s likely that the graph at left is a better hypothesis. we need to be careful to avoid overfitting. General issues . with loan applications. For example. or do we settle for a simpler hypothesis that seems to be pretty close? The answer is that this is part of the art of applying machine learning techniques to data mining. Still doesn’t look much like how humans learn. Overfitting In applying a machine learning technique. 1. We certainly need a good sample of data. Evaluating hypotheses Once we get a hypothesis from a learning algorithm. coupled with the month the plant was found implies whether growth is stunted). Although the graph at the right fits the data perfectly. The importance of these issues makes them worth studying separately. Data selection Choosing the data is an important element of applying data mining techniques.5: Overfitting some data. we often need to get some sort of estimate of how good it is. we should compute the age to give to the learning algorithm. though the database probably holds the birthday. Sometimes we need to do some preprocessing to get good data. We can see a similar thing happening in the soybean data (Figure 1. but they get rather complicated. There’s a tradeoff: Do we go for the perfect fit (which may be an overfit). We’ll look at just a few. the data set just includes a feature saying whether plant growth is normal or abnormal. or else the learned hypothesis won’t be worth much. – – Doesn’t handle numeric attributes well.1): Instead of giving the plant height (which after all. the applicant’s date of birth is probably what is really stored on the computer.3 General issues 13 Figure 1.3 There are a number of overarching data mining issues that apply regardless of the technique. Of course we want the data to have integrity — though there’s a natural tradeoff between integrity and quantity that we have to balance. But an aid to this is to be able to evaluate the error of a given hypothesis. and so we’re choosing to skip their approaches. but really this isn’t significant to the application (neglecting astrological effects) — what’s important is the applicant’s age. This occurs when the algorithm adapts very carefully to the specific training data without improving general performance. But we also need a large sample to work from.

and it’s every bit as intriguing! . nearest-neighbor search. a bank that intends to use data mining to determine whether to approve loans should think twice before including race or gender as one of the attributes given to the learning algorithm. but not nearest-neighbor search) are particularly useful for such applications that need human review at the end. Ethics Obviously any time you deal with personal information. In many situations. so we’re stuck with the handful we have. Algorithms that generate meaningful hypothesis (like linear regression or ID3. U S Presidential elections would be a good example: It’s not as if we can go out and generate new elections. We also need to use some data to compute the error. Databases often include sensitive information that shouldn’t be released. the data just isn’t plentiful enough to allow this. to give a reasonable tradeoff on the accuracy of the hypothesis and the accuracy of the reported error. and ID3 — and other issues that data mining practitioners must heed. but if we need to report the hypothesis error.14 Data mining It’s very tempting to feed all the data we have into the learning algorithm. we separate the data into two sets — the training set. but what have they learned?) So in situations where we need to compute the error. but they’re beyond the scope of this survey. this is a mistake. The topic of data mining is rich and just now becoming a widely applied field. (This would be analogous to giving students all the answers to a test the day before: Of course they’ll do well on the test. which we use for computing the error. For example. Even zip codes should probably not go into the learning algorithm. Machine learning has proposed several techniques for both using all the data and getting a close estimate of the error. the data mining practitioner should review the generated hypothesis to look for unethical or illegal discrimination. Social Security numbers are just one of several pieces of data that one can use to gain access to a person’s identity. 1. We could easily spend a complete semester studying it — it involves many interesting applications of mathematics to this goal of data analysis. For these applications. If we want to apply a learning algorithm to past polling data and their effect on the final result. we want to use all the data we can find. which holds the examples we give to the learning algorithm — and the test set.4 Conclusion We have seen a sample of data mining techniques — linear regression. Less obviously. ethical considerations arise. as implicit discrimination can arise due to communities with a particularly high density of one race. which you don’t want to get around too much. The error we report would be the fraction of the examples in the test set on which the learning algorithm’s hypothesis predicts wrongly. Typically two-thirds of the data might go into the training set and a third into the test set. data miners also need to be careful to avoid discrimination. I’d personally love to spend more time on it — but I also know what’s to come. and we can’t reuse the training data for this computation.

We’ll also make the label numeric by labeling an example if it is versicolor and if it is not. remains a very difficult question for psychologists and biologists. 2. which we’ll investigate in Section 2. and how the changing of synapses represents an actual change in knowledge.2.1) is a cell with two significant pieces: some dendrites. we’ll add a constant-one attribute to give added flexibility to the hypothesis. the neuron becomes excited and sends an impulse down the axon. which can receive impulses. It becomes excited whenever Let us return to classifying irises. These neurons receive the impulse and may themselves become excited. we need to know roughly how a brain works. As with linear regression.1. This connection between an axon and a dendrite is a synapse. it outputs . Although the perceptron isn’t too useful as a learning technique on its own. Over time. this axon is next to other neurons’ dendrites. . it’s worth studying due to its role in artificial neural networks (ANNs). Figure 2. The perceptron can output only these two # qµ £  ¶ # "# ”i$0qµ G F Hw# D E   ’I ! w w ! ``£ ciz £ £  3 # $" .2 contains the data we’ll use. In the brain. To do this. When the dendrites receive enough impulses.   iI When it is excited. though. and so the brain “learns” as the conditions under which neurons become excited change. We’ll call these inputs for (where is the number of inputs). 2. the connections between axons and dendrites change. A brain neuron (see Figure 2.Chapter 2 Neural networks One approach to learning is to try to simulate the human brain in the computer.1 Perceptron Using what we know about a neuron. The simplest attempt at this is called the perceptron. It also has a weight for each input (corresponding to the synapses). and at other times it outputs values.1 The perceptron algorithm The perceptron has a number of inputs corresponding to the axons of other neurons. propagating the signal further. which can send impulses. and an axon. it’s easy to build a mechanism that approximates one. How exactly neurons map onto concepts.

16 Neural networks axon nucleus dendrites Figure 2. constant one sepal length sepal width petal length petal width is species versicolor Figure 2. ’I ’I    ’I £ ¨©  £ § £ ©£  ©£  ¨ £ ¨ £ ¤¢ £ ¥£ ¦¤¢ §£  ©£ ¨§ ¥¦¨© £ £ ¨§ ¨© £ ©£ ¨§ £ ¨ £ ¨ £ ¨ £ ¨ ¥£ ¦¤¢      Iris A Iris B Iris C Iris D Iris E .1: A depiction of a human neuron.2: A prediction-from-examples problem of Iris classification.

But each time it makes a mistake by predicting when the correct answer is . and so if the perceptron saw the same example again. So we’ll update the weights. Since it does. # "# Rhqµ # R"  £  I 3 ©£  ‘ £  8 ¥£ ¢ ‘ 1 ¥ £  I ( 8 £ © ‘ 1  £  I ( 8 £  ‘ 1 ¢ ©£  I ( 8  ‘  £  3 # " # ‡sch‡’&ˆx¦¤‰)&gxx—s¨uv‡‰xx—¨uv’xx—r”&n‡$0qµ   ’I  £ ‡ 3 ©£  ‘  £  8 £ h‡’‡ˆŠ& 3  " º 8  xP’ˆrxµ ¥ £ u 3 ¥£ ¢ ‘  £  8 ¥ £  I i¦¤`’‡ˆxgxc3  ’ˆ8  µ " º ©‡sch¨u’‡ˆs‰xcxP’ˆrxµ £  I 3 £ © ‘  £  8  £  I 3 } " º 8 } U U ¥ £ g 3 £  ‘  £  8 ¢ ©£  I ƒ¨u’‡ˆƒ’xc3 ’ˆ8 µ " º  £ & 3 ‡’‡ˆŠ&  ‘  £  8  £ 3 G " º 8 G ‡’ˆ˜–µ  U# " # " º 8 # R’ˆeqµ  ’I # ¹ qµ  ’I Similarly.2. we change all weights as follows.2). To compute the prediction for Iris B. the correct label for Iris B is . So we’ll update the weights. For the first prediction. Since it does not. we just predicted that Iris A’s label would be (according to Figure 2. when in fact the correct label is  £ ‡  ©£  3 ©£  ‘  8 §£  ‘  8 ©£ § ‘  8 ¥£ ¢ ‘  8  ‘  3 # " # ci‰q™Šg–ˆŠ¨u’ˆx¦¤‰–ˆ‰‡qca$hqµ # qµ  3  „pxµ ¸“qµ ¸“xµ ·3  3   3 }  " º # ’pI # µ ¹ C# µ  ’I º ¹ G ¹ –µ  ¹ xµ  ¹ qµ } ¹ xµ U G ¹ –µ ¹  ¹ xµ ¹Š µ } ¹ xµ U µ  U µ 6’–µ  3 G µ G F H#  E U# "ºlI)$0q†”$»Y$l)q0( # "# µ 3 # "1# " º I # µ # " º I # $•‡qµ    G F H#  E º º . the new weight is . Say we get an example where the perceptron predicts when the answer is . In this case.2. To learn. the perceptron predicts that this example’s label will be . . each input contributed to a total that ended up being positive (and so the perceptron got improperly excited). Let me make an intuitive argument for why this training rule makes sense. Say our perceptron begins with the weights . Since is positive. the perceptron may erratically oscillate rather than settle down to something meaningful. we need to compute whether exceeds . which should be chosen to be some small number like . These are the weights we’ll use for the next training example. a perceptron must adapt over time by changing its weights over time to more accurately match what happens. this input would contribute . and since must be positive regardless of ’s value. the perceptron predicts that Iris B is labeled . Here represents the learning rate. This is wrong: According to Figure 2. we determine whether exceeds . A perceptron normally starts with random weights.) These are the weights we’ll use for the next training example. we change the weights again. . After the training. . if the perceptron predicts when the answer is . (If you choose too large.  ’I & £ 3 ©£  ‘  £  I  3  " º I  iu–‡pscxRvqxµ ¥gx¯ig–‡pi™s$vxqµ £  I 3 §£  ‘  £  I  3  " º I  ‰x¯i¨g–‡pi™xRvqxµ £  I 3 ©£ § ‘  £  I  3 } " º I } U U ¢ ©£  I 3 ¥£ ¢ ‘  £  I  ’x¯’¦¤‰–‡pi™3 vI µ " º  £ & 3 g–‡pscavr–µ  ‘  £  I  3 G " º I G  In our example.1 Perceptron 17 We’ll choose random weights with which to begin the perceptron. .

To train the perceptron on the irises. If you continue to see how it improves. We know that linear regression has a strong mathematical foundation. . and we know that the perceptron hypothesis isn’t any more powerful than that used by linear regression.2 Analysis Although the approaches are very different. and they predict according to whether the weighted sum of the attributes exceeds a threshold. It finally gets all flowers right on the th correct until iteration through the examples. The following table demonstrates how the weights change while going through all the examples three times over. I don’t know of any reason to use a single-perceptron predictor. On the first pass through the examples. sure. and the human brain is the best learning device known to humankind. On the second pass. and . Their prediction techniques are identical: They have a set of weights. the sum will be smaller than before — and hence closer to being labeled correctly. Thus if we repeat the same example immediately.) So what’s the point of studying perceptrons? They’re inspired by human biology. we’d go through the examples several times. unless there’s a hyperplane that separates the data perfectly. the final weights would be . 2. Perceptrons are easier to program.18 Neural networks this contribution is smaller than before. (Perceptrons aren’t even guaranteed to converge to a single answer. . (The training rate doesn’t affect this too much: Increasing the training rate even quite a bit doesn’t speed it up much. the perceptron classified only of the flowers correctly. though the human brain § w   &&§  ¢  £ &  £ &  £ &  £ & & £ & £ & £  £ & © £ &  £ &  £ &  £ & ‡ £ ‡ £ & £  xµ  £   I 3  &&’cxxµ  £  3 ‡¼pº ¥ £  uxI ¥ £  uxI  £ ‡  £ ‡  £  ‰xI  £  ‰xI  £  ‰xI  £ `  £  `xI ¢ §£  ’xI ¢ §£  ’xI  £  ‡xI ¥u £ ¥u £ ¥ £  gxI  qµ   £  ‰xI  £  ‰xI ©‡xI £  ©‡xI £   £  ‰xI  £  ‰xI  £  ‰xI §‡xI £   £  ‰xI © §£  &xI © §£  &xI  £  ‰xI ©‡xI £  ©‡xI £   £  ‰xI } xµ ©  ’I ¢ ©£  ’xI ¢ ©£  ’xI  £ ‡  £ ‡  ©£  &xI  ©£  &xI  ©£  &xI ¢ £  ¢ ©£  ’xI  £  &xI  £  &xI © ©£  &xI ¥g £ ¥g £ ¢ ©£  ’xI U µ  £ ¢ 3  &¤nsqµ ‰xcsxµ xc3 £  I 3 }  §£  I  & £ & £  £ &  £ & & £ & £ & £  £ & & £  £   £  & £  £ &  £ &  £ & G –µ ¥ ©£  ‡xI ¥ ¢£ d¨§  £ ¨© &xI £   ¥£  ¦xI  ©£  xI ¢ £ ’¨§  £  ‰xI  £ § &¨sI ¢ £ © ’¨sI  £   ©£ ¤¢ &¨§ §£  £  ‡xI  ©£  # "# $0qµ # e  ’I U ¢ ’§ µ ic‰–µ  £  I 3 G   . though.) Incidentally. So why would you ever use a perceptron instead? You wouldn’t. this looks quite nice. when you could just as easily do linear regression.) The perceptron algorithm will iterate over and over through the training examples until either it predicts all training examples correctly or until somebody decides it’s time to stop. But we need to keep in mind that. the perceptron doesn’t label of the flowers times through the training set. it got correct. (The same line of argument applies when the perceptron predicts when the answer is . and easier to understand. and decreasing the rate only slows it a little. iris A B C D E A B C D E A B C D E At first glance. after going through all these iterations using . And on the third pass. it got correct.1. if you’re lucky enough to get an answer. But they take a lot more computation to get the same answer. it’s instructive to compare linear regression with the perceptron. .

The final layer is the output layer.3: An artificial neural network. That’s what we’re going to look at next. This particular network has three layers. – – – – – – Simple to understand. so what they detect isn’t necessarily meaningful.2. The first is the input layer. is extraordinarily powerful.1 Figure 2.2. We’ll look first at the overall architecture.3. Adapts to new data simply. This layer has several neurons that should adapt to the input. like detecting particular features of a picture if the inputs represent the pixels of a picture — but they’ll automatically adapt. Inspired by human biology. including four nodes in Figure 2. 2.) The next layer is the hidden layer. but really an ANN could have any number of hidden neurons in any configuration. Only works for predicting yes/no values. and finally at how the network predicts and adapts for training examples. meaningful hypothesis. but linear regression has much more solid mathematical grounds.3 pictures the layout of one ANN. just like linear regression.3 includes three hidden neurons in the hidden layer. Artificial neural networks ANN architecture . with every input node connected to every neuron. and that’s when we’ll start to see some dividends from our study of perceptrons. These nodes aren’t really neurons of the network — they just feed the attributes of an example to the neurons of the next layer. (We’ll have four inputs when we look at the irises.2 Artificial neural networks 19 input hidden output nodes nodes node Figure 2. Generates a simple. The hypothesis is a simple linear combination of inputs. then at the individual neurons. as each iris has four attributes. Figure 2. Choosing the right number of neurons for this layer is an art.2 Artificial neural networks (ANNs) are relatively complex learning devices. It has an output neuron for each output that the ANN should produce. we’re going to have to network neurons together. To get good performance. These neurons are meant to process the inputs into something more useful. Advantages: Disadvantages: 2. each individual neuron is relatively worthless.

like that of Figure 2. the sigmoid unit computes and outputs the value  ’I   ¾  ¾ Á À 8 l‰ £ q  £ Ã# "# †0$0qµ  # $" # 3 1 % ve0( E  Q¿ R" # ¿ 9 # qµ  ’I ½  . followed by a layer of some number of output neurons. defined as This is a smoothed approximation to the threshold function used by perceptrons. there isn’t much point in having them. a o b As is commonly done with ANNs. But the most common design is the two-layer network. People usually use two-layer networks because more complex designs just haven’t proven useful in many situations. But researchers have figured out a practical way to do this attribution of error using a slight modification of the perceptron.) A sigmoid unit still has a weight for each input . there is a layer of some number of hidden neurons.) Researchers just haven’t found a good way of doing this with regular perceptrons. called a sigmoid unit. each of the neurons in our network has an additional constantone input. and it has each hidden neuron connected to each output neuron. or delete or add links different from those diagrammed in Figure 2. because that wouldn’t give a constant-one input into the output unit . You might hope to use perceptrons. if the hidden neurons are to change behavior over time at all.4: The threshold function and the sigmoid function. In this architecture.2 The sigmoid unit Each unit of the network is to resemble a neuron. Sigmoid units produce an output between and .3.20 Neural networks threshold function sigmoid function Figure 2. as Figure 2. whose low output is with .attribute (as we did with linear regression and perceptrons). we’ll use the following very simple two-layer network with two hidden units and and one output unit .4 illustrates. we must sometimes attribute errors in the ANN prediction to mistakes by the hidden neurons. (Having the low output being is just a minor difference from the . We do this instead of adding a new constant.2. To make a prediction given the inputs . but it processes the weighted sum slightly differently. One could conceivably have more hidden layers. using the sigmoid function. using a slightly different procedure from before. It has each input connected to each hidden neuron. 2.3. or even use a more complex unlayered design. In our walk-through example of how the ANN works. We’ll just rework our training example labels by replacing labels perceptron. The problem is one of feedback: In a neural network. but in fact that doesn’t work very well. (If they don’t change behavior.

  9 9 ½ ¾ ½ 9 ½  ¾ 9 9 9 9 ½ ½ ½ ½  ¾ ¾ 9 ½ È x¯3 SÇ µ £  I ©£  3 ÈÅ ŠÉxµ £  3 ÈÄ –Txµ Ç x¯3 ƍ µ £  I Ç £  3 T} µ U ©£  3 Ç µ Ç x¯3 ~G µ £  I Ç £  3 TÄ µ £  3 ō –Æqµ x¯–Txµ £  I 3 Å} U x¯3 Å µ £  I £  3 ÅG –~–µ £  3 ÅÄ –Txµ ¾ . we initialize each of these weights using small random values as follows.2.2. But we won’t get into the details of why it works. 2. The output unit has three inputs and hence three weights — one from each hidden unit. We run the training example through the network to see how it behaves (the prediction step). but here’s the overview. We update all the weights of the network (the weight update step).) Prediction step In a two-layer network like that of Figure 2. It will never actually be or . Too small? Go uphill a bit. except here we’ll always update the weights. This makes mathematical analysis easier. The output units’ outputs are the ANN’s prediction. The difference between a sigmoid unit and a perceptron is quite small. We’ll see how to do each of these steps in detail soon. 1. (Recall that the perceptron weights got updated only the perceptron erred. but it can get very close. To begin the network. If the output is too high.3. and . feed them into the hidden sigmoid units to get those units’ output values.2 Artificial neural networks 21 Notice that this means it will output some number between and . since it means we can always improve the situation by climbing one way or the other along the sigmoid curve. The hidden units and have five inputs and hence five weights each — one from each input node. and then we feed these hidden units’ outputs to the output units. The advantage of a sigmoid unit is that its behavior is smooth and never flat. We assign an “error” to each sigmoid unit in the network (the error attribution step). The only real reason to use the sigmoid unit is so that the mathematics behind the analysis — which we won’t examine — works out to show that the neural network will approach some solution. We have three sigmoid units in our architecture: .3 Prediction and training Handling a single training example is a three-step process. plus the constant-one input. plus the constantone input. weight of constant-one input into hidden unit weight for hidden unit from first input node weight for hidden unit from second input node weight for hidden unit from third input node weight for hidden unit from fourth input node weight of constant-one input into hidden unit weight for hidden unit from first input node weight for hidden unit from second input node weight for hidden unit from third input node weight for hidden unit from fourth input node weight of constant-one input into output unit weight for output unit from hidden unit weight for output unit from hidden unit 2. 3. The whole process is akin to the perceptron training process. the prediction step is straightforward: We take the example attributes. . we’ll try to go downhill a bit.

(If there are multiple output units. Ï È µ ¾ I  ¾ Ç § §  £  3 1 § § § £  I ( ‘ 1 £  I ( 1  § £  I  (  § £  &‡&™‰‡&&‰xxruxxf‡&&&p–x¤&&&™3 È `SÇ »1 Ç Qxx( Ç —3 fÏ 9 Ç fÏ And we compute the error of the hidden unit . We compute . Similarly. call ’s output . To assign the error of an output unit. Finally. the correct answer was (Iris A is not versicolor). What we’ll do is to look at the entire network’s output (made by the output layer) and determine the error of each output unit.22 Neural networks Given Iris A. After we compute the errors of all outputs units.) The error that we’ll attribute . while the network output was . 1   £  ‘ 1 £  I ( 8    £  ‘ ©£  8 £  ‡&&&uiuxxbƒ‡u–l‰h( U À 8 Ê Á l ‡‡h(    £  ‡™3 Ä ‘Ä 3 1 © £   1‡£‰‘”£ˆ8ƒ£‡iuxxbs¨g˜uxxbs¦¤`x•–h( ©  §  ‘ 1 £  I ( 8 ©£ § ‘ 1 £  I ( 8 ¥£ ¢ ‘ £  8 £  ¿ U U 1  " ō µ 8 } " Å} µ u$‰ÆqˆrR‰Txˆ8 ‰Å •˜‡‰~–•WTx0( " µ 8 G " ÅG µ 8 ÅÄ µ Ë TU À 8 Ê Á lr ‰‡&h(   £  &&&™3 ‘Ä 3 1  ©£   1‡£u‘i1u£xIx(™8ƒ‡xˆŠ¨u”ˆx¦¤`)uxx—–h( ©  §£  ‘ £  8 ©£ § ‘ ©£  8 ¥£ ¢ ‘ 1 £  I ( 8 £  ¿ U U Ç µ Ç µ Ç µ Ç µ 1  " ƍ •8 } " T} ˆ8 " Ç •8 G " ~G ˆ8 TÄ 0( µ À 8 Ä   Ê Á l ‰g‡&fh( &&§&£™3 ÆTYdG‘Ä  3 1  ¢ ¢ £   È µ 8 Å ¾ÈÅ µ 8 ÈÄ ¾ 3 v1 Ç ¾ SÇ •‰fÉx•vTh( ¾ 9 ¿ Í Ï Î µ1 Î ¾ I ( Î È fÈ x»gQƒxe¾  È VÌ ¾ ¿ ¿ ¾ 3 3 ½ ¿ ¿ 3 3 3 3 ÅÏ ¿ ¿ Ç ¾ Å ’¾ 3 È •¾ 3 Í È ¾ &&&  § £ Í . (This is the output that was sent forward to the output node when making a prediction. Then we’ll move backward and attribute errors to the units of the hidden layer. È ¾ In our example. We compute the error of that output unit as .) Now that we have errors attributed to the output layer. Consider a hidden unit . Thus the error of unit (which we’ll represent by ) is Î ¾ £ § § § £  I 3 1  § £ I  ( 1  § £  I  (  § £  3 1 È ¾ I È Ì ( 1 È ¾ I  ( È ¾ 3 ”&&‰xc‰‡&&&l4hf‡&&&pƒx¤&&&™v`pWV0f`pxx&nqÈ Ï ¥   £  I 3 1 § § § £  I ( ‘ ©£  1    £  I  (    £  3 È Å µ 1 Å ¾ I  ( Å ¾ 3 ‡‡&xc‰‡&&‰xxg€Vg‡v–xe‡™xÈ Ï Éx»’Qxxg™”Å Ï Í 1‰È’¾vWx0f‰’Qxx’¾ I ÈÌ( 1 È ¾ I ( È È xÌ ÈÏ È xµ Î ½ ¾ &&&  § £ Thus the computed prediction for Iris A is . whose weight connecting it to an output unit is . the error attributed to hidden unit . say the desired output for the unit is . we can compute the output of the output node . but the actual output made by the unit was . we can compute the error of the hidden units. we compute the output of hidden node . we backpropagate to the hidden layer. Moreover. we first compute the output of hidden node . we’ll sum these errors to the hidden node from is over all outputs to get the total error attributed to . Error attribution step The error attribution step uses a technique called backpropagation.

Computationally.Conclusion This is all just for a single training example. These are the weights we’ll use for our next training example. º È ˆ8 SÇ µ º È •8 ÉÅ µ º 8 ÈÄ ˆ)Txµ º Ç •8 ƍ µ º Ç •8 T} µ U º •8 Ç µ º Ç •8 ~G µ º Ç •8 TÄ µ º•8 ƍ µ Å º 8 Å} •‰Txµ U º 8 •‰Å µ º 8 ÅG •‰~–µ º 8 ÅÄ •‰Txµ º È ¹ SÇ µ ¹ È ŠÉÅ µ ÈÄ ¹ Txµ Ç ¹ ƍ µ Ç ¹ T} µ U ¹ Ç µ Ç ¹ ~G µ Ç ¹ TÄ µ ¹–ƍ µ Å Å} ¹ Txµ U ¹ Å µ ÅG ¹ ~–µ ÅÄ ¹ Txµ " Ï º " We’ll use a learning rate of £  Ï 2.4 Just to demonstrate that we’ve made progress. and is the error attributed during the error-attribution step to the sigmoid unit receiving the input. Thus the computed prediction for Iris A is . we’d do all of this for each of the examples in the training set. Here’s how the weights are changed for this training example. let’s see what the network would predict if we tried Iris A again. though a computer can do it pretty easily for small networks. We update all weights as follows (both those going from inputs to hidden nodes.  ¢  § £ & &&&  § £ 3 ¿ ¿ ¿ ¿ ¿ ¿ 3 3 Å ¾ 3 3 Ç ¾ 3 È •¾ ¢  § £  3 1  ¢ ¢ £  &—g‡&fh( 1 ¢ © ¥ £  ‘ 1  ¥  £  I ( 8 © ¢£  ‘ § § £  8 ¥   £  ’‡‰v‡g`xx—–&&gd‰’&&‰lŠ‡&‡h( ¢ © ¥ £  3 1   ©£  ’‡—u&&h( 1 ©£  ‘ 1 ‡‰v‡&&‡xx—Š‡€&`ˆs¨’’`¤ˆŠ¦¤`)‡&&‡xxbƒ‡&&h( £  I ( 8 §£  ‘ ¢   £  8 ©£ § ‘    ©£  8 ¥£ ¢ ‘ 1   £  I ( 8 §   £  © ¢£  3 1 © §  £  I &&gd™‰‡&‡&xx( 1 ©£  ‘ 1    £  I ( 8 §£  ‘ 1   £  I ( 8 ©£ § ‘ 1  ©  £  I ( 8 ¥£ ¢ ‘  £  8 ¥   £  I ‡‰)u¤&&xxbŠ‡v‡‡&`xx—s¨gvu‡`xx—s¦¤‰”&&‡ˆƒg&&xx(  ¥  £  I 3   £  ‘ 1 § § § £  I ( ‘ £  8 £  I g`sch&&&uv‡&&‰xxgsˆxc3 Ç ‘È Ï ¾ § § £ &&‰ 3    £  ‘ 1 § § § £  I ( ‘ £  8 ©£ €‡uv‡&&‰xxgsˆs 3 Å ¾ÈÏ ¥   £ ‡&‡ 3  ‘ 1 § § § £  I ( ‘ £  8 £ gv‡&&‰xxgsˆ 3  †’È Ï &&‡sc3 £  I ©£  ‘ © §  £  ‘ £  8 £  I 3  Ç ‰’&‡&‰sˆxcÑR" fÏ ¢   £ &` 3 ©£  ‘ © §  £  ‘ £  8 £ ‡’&‡&‰sˆ 3 } Ç kP" fÏ U Ç    ©£ `¤ 3 ©£ § ‘ © §  £  ‘ £  8 ©£ ¨u’&‡&‰sˆs 3 " fÏ   £  I &&‡sc3 ¥£ ¢ ‘ © §  £  ‘ £  8 £  I 3 G Ç ¦¤`’&‡&‰sˆxc‚‡" fÏ Ç §   £ ‡&& 3  ‘ © §  £  ‘ £  8 £ ‡’&‡&‰sˆŠ 3  fÏ    £  I ¤&&sc3 ©£  ‘ 1 ¥   £  I ( ‘ £  8 £ uv&‡‡&xxgsˆŠ 3  " ÅÏ ‡&`sc3   £  I ©£  ‘ 1 ¥   £  I ( ‘ £  8 £  I 3 } " gv&‡‡&xxgsˆxcxP‰Å Ï U ‡`sc3 ©  £  I ©£ § ‘ 1 ¥   £  I ( ‘ £  8 £  I ¨gv&‡‡&xxgsˆxc3 ‰Å Ï " &&‡  £ 3 ¥£ ¢ ‘ 1 ¥   £  I ( ‘ £  8 £ ¦¤‰v&‡‡&xxgsˆ 3 G " ‡‰Å Ï ¥   £  I g&&sc3  ‘ 1 ¥   £  I ( ‘ £  8 £ gv&‡‡&xxgsˆŠ 3  ÐgÅ Ï here. Consider one input to a sigmoid unit in the ANN. It’s not the sort of thing you can do by hand.2. It’s instructive to look at a more industrial-strength example to see how they might actually be used. This provides some evidence that the ANN has learned something through this training process. backpropagated ANNs are so complex that it’s difficult to get a strong handle on how they work. We’ll add to the weight associated with this input. We propagate its attributes through the network.2 Artificial neural networks 23 . Example 2. Say is the value being fed into the input during the prediction step. Like in training a perceptron. closer to the correct answer of than the previously predicted . And we’d repeat it several times over. Let’s consider the full iris classification example of Fisher [Fis36]. Weight update step The last step of handling the training example is updating the weights. and those going from hidden nodes to output nodes).

We’ll look at the average sum-of-squares error: If the ANN outputs and the desired output is . that improvement is on the training examples: It’s not indicative of general performance. It’s interesting to see how the ANN improves as this continues. if we encode labels using s and s.24 Neural networks . the weights will gradually become more and more extreme.10 . the sum-of-squares error is This is an odd quantity to investigate. we expect that the thin line in this graph is constantly decreasing. a label is encoded as for setosa.5 graphs the sum-of-squares error on the -axis and the number of iterations through the data set on the -axis. For each of the four numeric attributes.00 0 1500 3000 Figure 2. And the output layer will have a unit for each of the three possible labels. In the hidden layer. despite continued improvement on the training examples. we’ll choose to use two sigmoid units.5: ANN error against iterations through training set. When we interpret the outputs. The thick line is the average error per evaluation example. with each iris having four numeric attributes and a label classifying it among one of three species. The neural net is adapting to specific     w w B £ £ ‡  hA £   £ £  £ U ~’v2V0( 1# ¾ I #Ì  %  G F Hw# } E B }Ì w w Be£ £ £hA   £  w U w Ì G 0A Ì w w B £ g  hA £ £  " B } ¾ w U w ¾ G hA ¾ . The thin line is the average sum-of-squares error over the 100 training examples. or for virginica. Of the 150 irises. (And the researchers who did the derivation chose it basically because they found a technique for decreasing it. When we train. We choose it because the mathematics behind the error attribution and weight update is motivated by trying to decrease the sum-of-squares error for any given training example. The thick line represents performance on the evaluation examples the ANN never trains on. we choose a random subset of 100 as training examples and the remaining 50 for evaluation purposes. for versicolor.000 iterations through the examples and then takes a upward turn after another 1. You can see that it flattens out after 1. we’ll find the output unit emitting the greatest output and interpret the ANN to be predicting the corresponding label. For training purposes.) Figure 2. The thick line is the important number — it’s the average sum-of-squares error over the 50 evaluation examples. we go through the 100 training examples several times.20 . We use and instead of and because a sigmoid unit never actually reaches an output of or . which is indicative of general performance.000 iterations. We’ll model this as a two-layer ANN. That circularity — they discovered that the mathematics worked if they just chose to try to minimize this peculiar error function — is the spark of genius that makes backpropagation work. But. and the thin line is the average error per training example.15 . We use a learning rate of .05 .000th iteration is overfitting. Since the mathematics behind the backpropagation update rule works with a goal of minimizing the sum-of-squares error. we’ll include an input node in the input layer. Recall that this set of examples includes 150 irises. What’s happening beyond the 2. then.

2. Can adapt to new data with labels. way of learning. the ANN was back to 48 of 50 and remains there. Bears some resemblance to a very small human brain.) It’s also worthwhile looking at how the behavior changes with different numbers of hidden units. if complicated. where we had to make a variety of decisions in order to get a good neural network for predicting iris classification. But this supports the same conclusion we reached using the sum-of-squares error. The extended example presented in this chapter is intended to illustrate a realistic example. not the general picture. And additional hidden units provide only marginal help if it helps at all. At the critical number. Of course. than units.000th. the architecture of the network. not the average sum-of-squares error. If we were running the data. the ANN got 47 of the 50 correct.2 Artificial neural networks 25 anomalies in the training data. the training rate. at the added expense of much more computation. Below this. They are one of the more important results coming out of machine learning.  ‡¢ © – – – – – – – – By the 100th iteration through the examples.2. one could argue that what we really care about is the number of evaluation examples classified correctly. The process is much more complicated than simply feeding the data into a linear regression program. By the 1. 2. Difficult to compute.000 iterations. we’d want to stop around this point rather than continue onward. You choose the representation of attribute vectors and labels. Different runs give slightly different numbers. One of the complexities with using ANNs is the number of parameters you can tweak to work with the data better. the ANN performs poorly. So it was probably best to stop after the first 1.000th iteration. it was getting 49 of 50. people find that there’s some critical number of hidden units. Only one hidden unit isn’t powerful enough — the network never gets above Typically. . Around the 2. Adding additional hidden units in this case keeps the picture more or less the same — maybe even slightly worse correct. Disadvantages: Very difficult to interpret the hypothesis as a simple rule. They adapt to strange concepts relatively well in many situations.5 Analysis ANNs have proven to be a useful. Advantages: Very flexible in the types of hypotheses it can represent. and how many iterations through the examples you want to do. (This is just for a single run-through over the data. due to the small random numbers chosen for the initial weights in the network. it does well.

26 Neural networks .

3. While this is useful and interesting. including a state for each situation the learner might be in. We’ll be working with the following very simple state space with just three states. This doesn’t really fit precisely into the supervised learning model. Consider. We’ll think of the mouse as starting out at location . Two of the most prominent potential applications of reinforcement learning are game playing and robot learning.1 Modeling the environment Before we can analyze this problem. so we need to incorporate it into our model too. we need some way of thinking about it. This is a diagram of how our world operates. the psychologist picks up the mouse and puts it back at location for another trial. The field of reinforcement learning looks at how one might approach a problem like this. for example. This state space is meant to represent a very simple maze — actually. The cheese is to the right of . right. for example. when the mouse reaches it. and transitions representing how an action changes the current situation. there are two actions: We can move left (action L) or we can move right (action R). We’ll place a number on some transitions representing how good that action is. just a short hallway — with four locations. In game playing. Ò 9 ½ Ò Ò ½ . or left doesn’t result in immediate feedback — there is a whole sequence of actions that the mouse must learn in order to arrive at the cheese. R c L a R b R L L From each state. a mouse trying to learn how to find cheese in a given maze. and . In both. the learner has to make a series of actions before achieving the goal. . where the learner gets instant feedback about each example it sees. But we need it that simple to keep within the scope of hand calculation. The notion of reward is crucial to the problem. it doesn’t model all types of learning. There is sensory feedback along the way. This is an unrealistically simple state space. One of the most useful techniques for thinking about the problem is as a state space.Chapter 3 Reinforcement learning Until now we’ve been working with supervised learning. with no direct feedback about whether it’s making the correct action. but no information about whether the mouse is on the right track or not. . Whether the mouse chooses to go forward.

Why the discounted reward and not something else? Basically. It keeps the mathematics simple to have our goal be a combination of additions and multiplications. Which strategy do we want? To compare strategies. And this will continue in cycles of . Part of the point of is to emphasize rewards received earlier. you might guess. while the last move before a game is lost would get a negative number. while is a number ( ) that gives more distant rewards less emphasis. In reality. Our goal as the learner is to find a strategy — that is. You may have learned how to do these infinite sums before. a mapping associating an action with each state. In our example. it keeps the sum finite. If we were to leave out the altogether. So we’ll place a reward of on the long transition from back to state . don’t worry — we won’t use it again. The third has a reward of . but it’s not learning — it’s just computation. we define discounted cumulative reward (or just the discounted reward) by the infinite summation In this expression. leaving us at . a b c To compute the discounted reward. leaving us at . we won’t have a map.28 Reinforcement learning the last move before a game is won would get a positive number. but it makes things more complicated. But more importantly.2 learning Ó ‘‘ ‘ 8 ¢ ``nƒ’ Ó Ò ¢  3 Ä ’·pgº ” ´  © Ó ³ §  8 Š  U º ¢ ’ ´  © Ø hf Ë ¥ ¥ § ¥ 3 ¢ b€’ ´ ³      ‘ ‘ ‘ 8 ¢ ``nƒ’ ´ ³ –’ ´ ³ ƒ’ ´ ³ –’ 8 ¢ 8 ¢ 8 ¢ U   }  © © © ‘ ‘ ‘ 8 ¢ × ``nƒ’ ´ ³ ƒ’ ´ ³ ƒ’ ´ ³ –’ 8 ¢ 8 ¢ 8 ¢  ”  }  ³ Ò 8 Š ½ £ ‘ ‘ ‘ 8  Ó 8 } Ó v``nuº Ôruº •8  }  ´  © ‘ ‘ ‘ ``ˆ8 U 1 Ë f G le1 Ë f G tH ( 8 ( 8 ³ 8 ¢ ƒ’ Ì } Ò ´  © ½  ³ 8 ƒ ½ G &º U U º U Ԙ&PÔruº Ó 8 G º Ó 8 Ä ´  © ³ 8 Š ´  ¢’ 9 © ³ 8 ¢ ƒ’ 3 3 3 } uº ¢ ’ Uf 3 G kÓ  Ö Ó Ö ¸ˆbÑ Õ uº Ø § hf G ¥ Ò . we use the discounted reward because it’s convenient. but if not. Our second action takes us to state . Finding the optimal strategy when we have a map of the maze is an interesting problem. thus . therefore. the cumulative reward would be infinite. The fourth has a reward of . The strategy that we want to learn is the strategy with the maximum discounted reward. in this chapter. and researchers have examined several. Either we won’t understand our environment (like Ù 3. moving back to state . at a reward of . Our first action. is the following. there is a reward every time the mouse goes right from . In the last step of this computation. arriving at the cheese. You could use another criterion for comparing strategies. R R R Thus is the discounted reward of the strategy depicted above. Recall that we decided to begin in state . gives us a reward of . represents the reward we receive for the th step if we follow the strategy. we think as following. We’ll always use The optimal strategy for our example. we used the fact that the infinite sum is exactly .

we have values: For any action taking us from state to state . The fourth step places us in at a reward of . we can’t just look at the maze. (We scaled this discounted reward by because we start the optimal strategy at the Ú ¡ ‘‘‘ 8 ¢ ``nƒ’ ¡ 3. since we don’t have a map to work with. And so on. . we’ll look at learning. we know is because it represents the discounted reward of going right from and following the optimal strategy thereafter — that’s exactly the action sequence we considered when we computed discounted reward in Section 3. going left. we’ll use to refer to this number. we’ll look at the values for each action starting at that state. and then follow the optimal strategy from there on. The second step (now following the optimal strategy) takes us right from . And that’s why we have in our table for . we’ll learn a number: For a state and action from that state.) Here are the exact values for our space. determine the optimal strategy. take action . Here we want the discounted reward of starting in . The third step places us in at a reward of . The nice thing about knowing the values is that knowing them gives us the optimal strategy: For a particular state. we’ll gradually zero in onto the correct values of through a series of trials. We can’t. For each transition of the state space. and then play it out. and then Or consider following the optimal strategy thereafter. After our first step of going left. we get a reward of and are in state .2 learning 29 Û ß uÚ  ” Ä uº ´ ½  © w 1–Û ÚS(“¡ Ú  ³  Ú 8 s ‰º U º  ´ ½  © ³ Û 8 s ¥ ¥ —i ´ ³ 3 3      ‘``n8s ´ ³ x ´ ³ 8x ´ ³ s ‘ ‘ 8 8 3 U   }  © © © ‘ ‘ 8 ``‘n–¢’ ×r´ ³ 8ƒ’ ´ ³ 8ƒ’ ´ ³ ¢ ¢ 3  ”  }  © 8 Š ´ ³ s 8   ´ Ò 9  © w w à £ H1 ß Û ß S“¡ ’“› Ôsnv”Û S¡ Ú (  RÞ Ó 8 º 3 1 Ú( œ ³ Û Ø § hf G ¥ Ø  hf U ‰ 8 ¢ –’ Ú Ò Ó ¡ } w £ 1 ˜–Û S“¡ ’“› Ú (  œÞ ´  Ø  hf  &§ Øh Gf 9 © ³ 1 Ý Æw 8 s ½( h“¡ U Ø  hf U ‰ Øhf G ´ ½  © ³ Ø § hf G ¥ 9 ¢ ’ ¡ ¡ } gº 1 ¡ Ü Æw Ø hf G Ò( h¡ º 1 Ý Æw ¡ ¡ w 1 ”Û S¡ Ú( º ½( h“¡ ½ ß ‰Ú  ½ G &º Ò w 1 ”Û S“¡ Ú( . (Don’t worry — we won’t compute directly in our algorithm. to find the optimal action. The fifth step places us in at a reward of . giving That is. a specific algorithm for learning the optimal policy developed in 1989 by Watkins [Wat89].1. At any rate.the feeble mouse) or the map will be too large to explicitly remember (like in game playing). Model What we’ll actually learn are a set of numbers derived from the discounted reward. and then take the action associated with the largest of these. Instead. followed by the discounted reward of following the optimal strategy starting at . Thus the discounted reward starting at state is defined by the quantity Notice the following property of the us an immediate reward . and derive our strategy from that. we get a reward of immediately. In this section. Q L R For example. getting a reward of and putting us in state . representing the discounted reward if we start at . But understanding is key to understanding learning.

so we update to be . . Our estimate of the discounted reward starting in . So we update to be (which was the value we had before). we get a reward of and end up in state . having a discounted reward of . Perform action . according to the current values. whereas the discounted reward starting at first step on. is . We start at state . We get a reward of and end up in state . the estimated discounted reward starting at is . reward for performing . and is the best action starting from the new state. according to the current values. So we update to be . 1 1 Øhf G ¥ § 9 ½ Ü Æw Ü Æw  9( S¡ ½( h“¡ ½ 9 Ò Û w w 1ß ß Ú( à  œ Ó 8 º 3 1 Ú( sÛ uS“¡ RÞ ’› Ôxnv¤½ S“¡ ¢ ’  ¢ ’   i‰‘1 U f G —ƒ 3   ( 8  i‰‘1 U f G —ƒ 3   ( 8 ‰ci&x1 U ( 8  3 © § Gf —ƒ ¢  3  ’—iâ1 U f G —–’ ( 8 ¢ ©&—€’x1 U f G —ƒ § 3 ¢  ( 8   3 © § ( 8 ‰ci&x1 w U f G —ƒ 1 ”Û S¡ Ú( £ º ¥ ¢  § &—3  ß Ú  second step and beyond. Then we wander around the state space and update the values using our formula as we go. According to our current -value estimates. Q L R Now say we move to the right again. end repeat This is really quite simple: We begin by estimating all values as .) follows the optimal strategy from the ¥ ¡ § ¥ á´ ¡  Ò  ¢’   w © à ³ 8  3 ys™v1 ß Û ß S“¡ ’“› Qsº Ú (  RÞ Ó 8 œ Ò ¢’  Ò ¡ Ý Æ¡ w ¹ 1 Æw h“¡ Ý ½( ¹ 1 Æw S“¡ Ü 9( ¹ 1 Æw h“¡ Ü ½( ¹ 1 Æw h“¡ Ü Ò( ¹ 1 Ý 9( Æw S“¡ Ò( ¹ 1 h“¡ ©&§  9 9     ½   ½ Ø  hf  &§ ½ ½ 9 ½ Ò 9 ß ‰Ú ¢’—3hV1 U f G (—8–¢’ ¡ w 1 ß Û ß S“¡ à Þ ’“› •xº Ú(  œ Ó 8 Û Ú    ¢ ’  º  9 Ò Û Û 9 9 Ú Ò Ò ½ ½ w  3 1 ™v”Û S“¡ Ú( © § 3 ¢  ( 8 &—€’x1 U f G —Š 1 Û Ü Æw  3  ( 8 ™hV1 U f G —Š 1 ¡ Ò 9( S¡ Ü Æw Ò( h“¡ w ( ¹ 1–Û ÚS“¡ ß ¹ ‰Ú ¹ º 1 Ü Æw Ò( h“¡ ¹ Û ¹ Ú . new state after performing . Q L R We continue doing this over time until we’re satisfied. say we choose to move right first. update of L R R R L L So after all these steps our final estimates of the values are the following. some action we select. Start with repeat indefinitely: current state. The immediate reward is for going right from . Here are the next several steps. We get a reward of and end up in state . Using our formula. is .30 Reinforcement learning For example. that to define our algorithm. Our estimate of the discounted reward starting in . we find that Algorithm We’ll use this last observation. in our example is . . for all states and actions . And if we move to the right again.

the algorithm extends its knowledge one step backward. The learner will want to try out actions that haven’t been tried too often before in the same state. That is. For problems of a realistic size. This is something that will be more domain-specific than our state-space abstraction allows: In the state space abstraction. it works without constructing an explicit map of the state space. but not in more realistic situations. For example. Thus we can choose to move however we want. The learning algorithm we’ve seen forgets everything that’s happened before when it’s performed the action and replaces it with the new observation. ¡ 3. But also the learner wants to accumulate rewards when it can. a robot building a tower of blocks may find that a gust of wind blows the tower over unexpectedly. hoping to go through all possible states is simply not realistic. for example. we learned that moving from to was a good move the first time. since in practice the map can get very complicated. It doesn’t matter how we choose to move here. But we should have figured out earlier that it is a good idea to move from to . Choosing randomly each step from the available actions. In our example. As long as we occasionally try out all the possible transitions. we’ll see ways of refining learning to address the first two issues. you have a conflict of interest. to gain more experience with the consequences. for example. as the real world often doesn’t behave exactly as predicted. At this point our goal is simply to learn the optimal strategy. any way of making this selection is fine. is more endemic to the overall model that a simple patch isn’t going to solve. We’ve abstracted away all details about the characteristics of the state. What will happen is that each time through the maze.2 learning 31 9 ½ 9 ½ ¡ ¢ ’  ‰ Ò ¡ ©&§  9 ‰  ½ ¡ ¡ ¡ Ò ¡ ½ 9 Ò ¡ ¡ – – – . In situations where you want to accumulate reward as you learn. In a chess game. But learning has some problems that are worth pointing out. This is obviously true in game playing. which is important from a practical point of view. This problem is one of generalizing from what we have learned. If you wanted to accumulate rewards as you learn. Moreover. But we’ll sidestep it with learning. and if we continued wandering we would get closer and closer to the correct values. which the player cannot predict. there are so many states that we can never hope to visit every single possible board that might show up during a game. It’s also true in robot learning. That works fine in a perfect nonrandom world. This is known as the issue of exploration vs exploitation — balancing it is an interesting problem. however. The third issue.Q L R This isn’t too far off the correct answer. when each action is followed by an opponent’s move. a similar action is good for the other. you might tend to select moves that have higher values. not to garner points particularly. we need to somehow recognize that two states are similar and so if an action is good for one. In the next two sections. The algorithm learns more slowly than we might hope. Analysis The neat thing about learning is that it works. though the algorithm is very simple. when every action invariably leads to the same situation. because it actually doesn’t affect whether learning works. and then the next time through we learned to move from to . Often. each state is completely incomparable to the rest. isn’t a bad idea. then the next time through the maze we learned to move from to . an action doesn’t always have the same consequences. The algorithm doesn’t work well in a nondeterministic world.

before. so that we learn more rapidly at the expense of the risk of overemphasizing rare consequences of our actions. A very sensible choice changes with each action to . but each time we use a transition we’ll add one to its -value. The -values start out at . which gives us the same algorithm we used before. In our world. 3. and you want to reduce as you gain more experience. Í w w ´ »1 ß Û ß S“¡ ’“› Ôs&Ôl‡”Û ST˜Ôxx( Ú ( àœ Þ Ó 8 º ³ ã 8 1 Ú( ¡1 ã I  å Û ¡ Í ã š ¢ ’ ‰ Ò ¡ Ú 1 š ƒxu 8 ( f © &§ ¢ 9 å aÓ  ½ ¢  ä ã Ö y—¯… ¡ ã w 1 ”Û S“¡ Ú( w 1 ß Û S¡ à Þ ’“› •ƒº Ú(  œ Ó 8 w —r”Û SHÍ 3 1 Ú( å aÓ w 1 Ú( ¹ ”Û S¡  ã 1 š ‰xu 8 ( f  3 6tã w  3 1 ™v”Û S“¡ Ú( w 1 –Û S“¡ Ú( ã Í ã ¹ Ú ã Í . (which we choose from ) is a parameter that tells us how much weight to give the new observation and how much weight to give to the value we were remembering earlier. what we’ll do is average the current value of with this new observation. If is large. but we’ll alter the update rule to the following.3 Nondeterministic worlds . The algorithm proceeds in the same way. We’ll look at the two most prominent successful applications of reinforcement learning techniques at the close of this chapter. Here. Here’s one solution. where represents the number of times we’ve performed the action from that particular state previously.4 Fast adaptation A more radical change is trying to make the algorithm so that each reward is immediately propagated backward to actions performed in the past. and we’ll see how they resolve it in their particular situations. so that we improve our estimate only slowly over time.2. What we’re going to do is to maintain an -value for each transition of the state space. Start with and for all states and actions . In practice. But if we instead we might as well use choose to be . this indicates that we want to give lots of weight to old values of .32 Reinforcement learning People don’t really have a good way of handling this in general. we don’t need this alternative formulation because our world is deterministic. Q L R We haven’t gotten as close to the correct values here as we did in Scetion 3. you want to use a large at the beginning when the values don’t reflect any experience (and so you might as well adapt quickly). The idea here is to speed up the learning process. so less wandering is required. But. If is small. we end up with the following table after the same sequence of actions we performed before. we want to give more weight to the new values.  ä å ä bх“ w 1 ”Û S¡ Ú( ã ¡ To address nondeterministic consequences of actions. forgetting everything we had learned about we changed Instead. Before. This -value continually decays. we’ll be less extreme in our update of to . The parameter (with ) controls how much effect a future reward has on a current action. they faced this problem of not being able to hope to visit all possible states. repeat indefinitely: current state. at a rate of after each action. But it will rapidly decay back to zero (at a rate of per time step) as future actions are performed. In both applications. we intentionally modified our original algorithm to adapt more slowly. Thus . then. 3.

But for the third action. though of course the -values decay by a factor of again. the value changes to . ¡ ¢  3  ( 8 ¢  8  ’—hV1 U f G —–’lŠxI ¢ ’  3 æp1 Ò  x” f G  f G ~G f G ”    Ò 9 ½   Ü ” ‘ © § 8 ¢ &b3 Æw ~G f G –&lƒ’ Ü Æw 1 ½ h(¡ 3 ½( h‡Í  f G v1 Ï ©&§c⢒x1 U f G @––xI 3 ( 8  8  Ü Æw Í Ò( h‡Í Ò ¡  ™3 Ï 9  fG      Ò 9 ½ Í ¢’63k Ï ™h ã 8 Í ” ~G f G  Ò Ï 9   &¨¥ £  &¨‰ £  Ò  fG  ½ Ò ¢ ©£ ¢ ’¤’§  ¥£  ¦` 9 ¡ w w 1 ß “ß sÛ ß uS‡RCÓ ¹ “ß ŠÛ w ß uS‡Í ßwÚ ( Í å 1 ß ß Ú( w 1 ßß Û ßß S‡Í Ï lC1 ßß Û ßß S“¡ ¹ 1 ßß Û ßß S“¡ Ú( ã 8 Ú( Ú( ß sÛ ß ßß w‰Ú w  8 1 ™H–Û S‡Í w ¹ –Û S‡Í Ú( 1 Ú( w 1 “ßŠÛ ßuÚS(“¡ Ú( ¡ à  œ Ó 8 º 8 1 RÞ ’“› •q•‡–Û S“ŠI ¹ Ï Û ß ¹ ‰Ú ¹ º . Were we to continue through the entire example we’ve been using. . So the only -value that changes is . If we begin at and move right.3. Since . whose estimated discounted reward is . So we’ll compute as we’ll add to . . Perform action . whose estimated discounted reward is . . Q h L L R R So far the -values are just as in our first example. But also. where the estimated discounted reward is . no -values will change. ¢ ’      Ò 9 ½  fG  3  ( 8  8  ™hV1 U f G nqpqxI   Ò ¥d¢¨‰ £ ¢ £ &d¢¨ ½ ßß Û ¢’  Ò  & &§  ©    Ò 9 Ü Æw ½ 1 Ò( h¡ © § 3  ‘ © § 8 &—Šg–&Š 1 ßß Ú 9   Uf Ü Æw 3 G áå Ò( h“¡  Û   ½ Ï Uf 3  3 G kÓ cvã ½ 1 Û Ü Æw 3 U ( U ( 3 å  f G v1 f G f1 f G cáaÓ ¡ Ò( hHÍ ¡ —3  f G ‘–&Š © Ƨ w 8 Ü 1 S¡ 9( ¢ ’ Í 9   ¹ Û ½ . Things start out the same. do: . Thus after our first action the . will change to . Q L R These have gotten very close to the exact values with what’s really a pretty short time wandering through the state space. we’d find that the final values are as follows. changes to . Q h L L R R Our second action gives us a reward of and places us in state . and even a fraction to the action of moving right from . when we move right from to earn a reward of and end up in state . Now when we go through each and . we’ll see something different. the only non-zero -value is . reward for performing . since . This time we compute to be . Q h L L R R What’s happened here is that we got an estimated future reward for moving into . and each action . . which changes to . Moreover. new state after performing for each state end for end repeat We’ll use . So we now have the following. As before. we get a reward of and go into .4 Fast adaptation 33 some action we select.and -values are as follows. and . Each -value decays by . and state . We compute as . and this new algorithm has propagated this future reward to our past actions also — a fraction of it to the action of moving right from .

But if we don’t understand the state space.2. 3. Like learning. while the learning algorithm learns the discounted reward beginning at each transition in the state space. From here you can train the network. TD learning makes sense. the simplest version of TD learning (analogous to the learning algorithm of Section 3. learning — and. as you don’t receive a reward or penalty until the game is won or lost. Much of current research around reinforcement learning involves looking at practical applications demonstrating techniques for getting around this difficulty of tackling huge state spaces.6. The difference is that. Although we’re not going to look at the specifics of backgammon here. TD learning learns the discounted reward for each state of the state space.3 and 3. hidden units. versus about 40 for chess). Results have been mixed. Thus TD learning is frequently superior to learning when we know the state space.1 TD-Gammon The first big reinforcement learning success is certainly TD-Gammon. and a single output to represent the estimate for the input board’s -value. we would look through each possible action from . and TD learning is inappropriate if we don’t. In situations where the state space is known (as in game playing). you have many fewer numbers to learn. indeed. Thus.2) revises the update rule to the following. Thus less exploration is required to find a good solution. To choose a move. Conventional game-playing techniques have not applied well to Backgammon. We’ll look at two of the more significant successes. take the reward and subsequent state for that action. It’s even a little simpler.5 TD learning ¡ Temporal difference learning is an alternative algorithm to learning originally proposed by Sutton in 1988 [Sut88]. Rather than tabulate all possible states. Everything else remains the same. and so we can’t perform this computation. TD learning can be modified as in Sections 3. the neural network had input units for representing the current board.6 Applications ¡ As we saw in Section 3. the important thing to know that there is a set of pieces you are trying to move to goal positions. we don’t know or . therefore. you had only to try each of the possible moves and see which resulting board state gives the greatest -value. and say that the optimal action is the action for which is largest. After all. Each turn. TD-Gammon employed a neural network to approximate the values for each state of the network. especially if you must consider all possible dice rolls. To figure out the optimal action for given the space’s -values. Generally the reward will be . as the number of possible moves in each turn is unusually large (about 400. it’s closely related to learning. the whole state-space model — has a fatal problem: The world is just too complicated for us to hope to go through most real-life state spaces. Actually. But in situations where the state space is unknown. In the first version (version 0.4. just as ¡ ç Ú   &‰ º 1 ß S‚Clƒº Ú( ç Ó 8 Ú ¡ 1 ß SáCÔxº Ú( ç Ó 8 ß ‰Ú º 1 Ú( ¹ Sç Ú ¡  ‡¢ ¡ 1 Ú( S‚ç ç ç ç ß Ú ¡ . Tesauro used TD learning. Because you’re learning a value for each state.0). not each possible action from each state. Then you take that move. what we’re after is learning the optimal action from each state of the space. but the number of states in backgammon in too overwhelming to consider. a backgammon-playing system developed by Tesauro in the early 1990’s [Tes95]. TD learning is impractical. the player rolls a pair of dice and must choose from a set of moves dictated by that roll. 3.34 Reinforcement learning 3.

and each floor has an up and a down button on it. each of the four elevators has within it ten buttons. Encouraged by this work.3.6. each elevator has a location. or stay in its current location to unload or load passengers. Crites and U  TU ` ¡ ¡   ¥£  3 ¦™áå . especially chess and go. and each button pressed on a floor has a “waiting time” associated with it. is widely regarded as ranking among the best backgammon players in the world (human or computer). however.6 Applications 35 is done in regular TD learning. Tesauro developed TD-Gammon 1. Crites and Barto concentrated on scheduling four elevators in a ten-story building during the 5:00 rush (when nearly all passengers wish to go to the lobby to leave for home). It Neurogammon had. Subsequent versions of TD-Gammon have made the program look more than one move in advance. as this is part of their competitive edge over other elevator manufacturers. and very capable of winning a world championship if admitted. But we still have the problem of all those states that we can’t visit thoroughly during training. The basic change for this was to change the inputs to the neural network to represent more complex concepts that backgammon masters feel are important things about a board state — whether crucial positions are blocked. TD-Gammon 3. etc. At any point in time. An action in this state space gives the behavior of the elevators.) Notice that one aspect of this scenario is that we can’t predict the result of an action. while TD-Gammon had nearly no backgammon knowledge outside of the rules. their technique uses reinforcement learning.000 games. In other words. Crites and Barto. taking some advantage of classical minimax search. because we can’t predict where potential passengers are going to appear requesting an elevator. as Neurogammon had been written specifically to play backgammon. (Tesauro used the TD equivalent of the -learning algorithm of Section 2. using . (We don’t have a way of comparing to the manufacturers’ unpublished algorithms. 3. they don’t get to see their work put into practice — or at least they don’t know about it: Elevator manufacturers jealously protect the scheduling algorithms they use.1. this new version incorporated some of the advanced backgammon theory that hidden units and trained on itself with 300. The current version. direction.0 had played competitively with human champions. This is a world with many states. People have tried duplicating this work in other games. This was shocking. Researchers like it because it’s simple and practical. states. An elevator may decide to go up.2 Elevator scheduling Scheduling elevators in a building is an important problem. how close pieces are to winning. and speed. Tesauro found that TD-Gammon performed about as well as one of the best computer backgammon players of the time. learning is more applicable than TD learning for this problem. with much specialized knowledge built into it.4. TD-Gammon 1. but not at a level where it might win a world championship. The penalty for an action can be computed as the sums of the squares of the waiting time for any passengers loaded by the elevators during the time step. The results of these attempts have been encouraging but not as noteworthy as TD-Gammon: The programs learn to play well. (We square the waiting time because realistically we want to give passengers who have been waiting quite long a larger advantage.000 games against itself.) Moreover. go down. saying how long it has been since the prospective passenger requested an elevator. Unfortunately. Thus. Neurogammon (also by Tesauro). can pride themselves on having one of the best published algorithms for scheduling elevators [CB96].) By training the network on 200. Crites and Barto conservatively estimate that it has at least far more than we could visit in the course of learning how to behave in each state. In their simulated system. In their study. but not near the level of human champions.0.

One thing they did to simplify the problem was to add additional constraints on possible actions — constraints that you probably need anyway. as more centralized control will allow elevators to band together to tackle their passengers more effectively. Recall that Crites and Barto have their elevators make decisions independently. and the final unit is a constant. The two output units represent the two possible actions.input. These constraints reduce the number of decisions that the elevators have to make. when an elevator reaches a floor where a passenger is waiting. One unit says whether the elevator in question is at the highest floor with an elevator request. Despite these constraints. They trained their network in a simulated system and compared their results with those of many other algorithms. Another units say which floors have the other elevators. so each network cominput putes the -value for only two actions. of the units how Crites and Barto must have tried a lot of things before settling down on these represented the buttons in the hallways — one unit for each floor representing whether a down passenger is still pending. why waste time? Or perhaps the elevator wants desperately to reach a person who has been waiting longer. This will hurt performance. But making them independent simplifies the set of actions from which the learning algorithm must choose to just two: When we reach a waiting passenger.  ‰ ¥ ¢  ‰  ` ¡ ¥ ¢ ¡ ¡ ç  ¡ © ¥ ¢ ¡ .36 Reinforcement learning Barto used several techniques to bring the state space down to a more reasonable level. The final technique that Crites and Barto employ is to use a neural network to approximate the values. but it does illustrate a situation where it has proven successful. and one unit for each floor representing how long that down passenger has been pending. After much experimentation. should it stop? (You may expect it to always stop for potential passengers. This doesn’t necessarily imply that learning is the best algorithm for elevator scheduling. similar to how Tesauro used a neural network to approximate values in TD learning. they settled on a network of units and output units. do we stop or not? But there are still lots of states to consider — too many to hope to apply learning directly.) Crites and Barto decided to make the elevators make their decisions independently. The input units represent different characteristics of the current state. an elevator cannot pass a floor if a passenger within the elevator has requested a stop on that floor. and it is here that you can see units. elevators must still make some choices. Namely. one unit says whether the elevator in question is at the floor with the oldest unsatisfied elevator request. What they found is that the strategy learned by learning and their neural network outperformed the previous ad hoc algorithms designed by humans for elevator scheduling. Nor can an elevator reverse direction if a passenger has requested a floor in the current direction. For example. but if another elevator is stopping there anyway because it is carrying a passenger there. units represented the current location and direction of the elevator in question.

Nearly everything covered in these chapters is also covered in Mitchell’s textbook. M. Machine Learning. Barto. [AL92] D. Boston: McGraw-Hill.) Finally. as it’s more of a practitioner’s guide to machine learning than a textbook about data mining. Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis. [Fis36] R. editors. Its presentation is typically simpler (if less rigorous) way than Mitchell’s. I found the textbook Data Mining. The book is generally aimed at somebody with several years of computer science training (like a senior). But it does a good job of that. On data mining. Annual Eugenics. neglects reinforcement learning to a large degree.: Addison-Wesley. and M. But Franklin writes well. I drew much of the material about artificial life from Franklin’s Artificial Minds. Interactions between learning and evolution. [Pom93] D. Fisher. Langton et al. Boston: Kluwer Academic Publishers.: MIT Press. [Mit97] T. Mozer.e. Pomerleau. by Witten and Frank. Artificial Life II. pages 19–43. and he covers the philosophical issues that authors of artificial intelligence textbooks typically avoid. Hasselmo. written for a popular audience (i. Robot Learning. Improving elevator performance using reinforcement learning.. In C. Crites and A. International Journal of Policy Analysis and Information Systems. by Tom Mitchell [Mit97]. Knowledge-based training of artificial neural networks for autonomuous robot driving. The book is really a non-specialist’s take on artificial intelligence. 4(2). 1992. Advanced in Neural Information Processing Systems. Michalski and R. but it’s thorough. editors. It’s really a misnamed book. it is well-suited for experienced programmers who want to use machine learning. and of course Mitchell covers much more. The use of multiple measurements in taxonomic problems. volume 8. Cambridge. Connell and S. editor. 1980.Chapter 4 References One of the best machine learning textbooks is Machine Learning. [CB96] R. non-specialists). 7 (part II). Ackley and M. So you can expect some handwaving of the same sort you would see in other mass-market science books. Calif. pages 487–509. Mass. In D. Littman. 1996. useful [WF00]. Chilausky. 1993. Mahadevan. (These machine learning authors don’t seem to have any desire for interesting titles.. Even Mitchell’s book. thorough in many other respects. Redwood City. [MC80] R. 1936. . 1997. Touretzky. The only reasonable reference is Sutton and Barto’s book Reinforcement Learning: An Introduction [SB98]. In J. Mitchell.

Reinforcement Learning: An Introduction. N. Barto. Wilson. 2000. San Francisco: Morgan Kaufmann. PhD thesis. Sutton and A. Hillsdale. Machine Learning. [Wat89] C. Cambridge. Tesauro. 1988.: Erlbaum.38 References [SB98] R. Knowledge growth in an artificial animal. Witten and E. 1989. [Tes95] G. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. 3:9–44. Learning to predict by the methods of temporal differences. In Proceedings of the First International Conference on Genetic Algorithms and Their Applications. Communications of the ACM. Temporal difference learning and td-gammon. [Sut88] R. Frank.J. [Wil85] S.: MIT Press. Sutton. Mass. 1998. 1995. pages 16–23. 38(3). 1985. Cambridge University. [WF00] I. . Learning from delayed rewards. Watkins.