Professional Documents
Culture Documents
John R. Anderson
Shawn Betts
Daniel Bothell
Ryan Hope
Christian Lebiere
1
This work was supported by the Office of Naval Research Grant N00014-15-1-2151. We would
like to thank Niels Taatgen and Matthew Walsh for their comments on the paper.
1
Abstract
A theory is presented about how instruction and experience combine to produce competence in a
complex skill. The theory depends critically on three components of the ACT-R architecture.
The first component interprets declarative representations of instruction so that they lead to
action. The second component converts this knowledge into procedural form so that appropriate
actions can be quickly executed. The third component, newly added to the architecture, learns
the setting of control parameters for actions through a process similar to reinforcement learning..
These three components are intermingled throughout the course of skill acquisition, providing an
shifts in the factor structure of skills. The overall theory is implemented in computational models
that are capable of simulating human learning in different versions of the video game Space
Fortress. Other than humans, these models are the only agents, natural or artificial, capable of
2
The course of our species has been determined by learning skills as varied as hunting with
a bow and arrow to driving a car. This paper describes a general theory of the mechanisms that
underlie human ability to acquire novel skills, which we have applied to learning a video game.
Like many interesting skills, learning to play a video game involves simultaneously mastering
components at perceptual, motor, and cognitive levels. The level of expertise players can achieve
with extensive practice in these games is truly remarkable (e.g., Gray, 2017), but also remarkable
is the speed with which people can acquire a serviceable competence at playing games. There
are now appearing artificial systems using deep reinforcement learning that can out-perform
humans in some video games (e.g., Minh et al., 2015). Whatever the judgment on the
achievements of such systems, so far they have offered little insight into how humans can learn
as fast as they can. As examples of the speed of human learning, Tessler et al (2017) and
Tsividis et al (2017) both report that humans can achieve in a matter of minutes the expertise that
The essential feature of the task in this paper is that it requires simultaneously learning
many things at many levels. There has been considerable research on perceptual learning such
as visual patterns (Gauthier et al., 2010), motor learning (Wolpert, 2011), and complex tasks like
learning algebra (Anderson, 2005), but little research that has addressed how these are put
together. Our task also stresses the combination of instruction and experience. While one
cannot become good at a video game just by reading instructions, upfront instruction certainly
typical of human skill acquisition. As a test of whether our theory has really put the many pieces
together, we required that the model learn by playing the actual game. The model is implemented
within the ACT-R cognitive architecture (Anderson et al., 2004). Using ACT-R avoids the
3
irrelevant-specification problem (Newell, 1990) – that is, the need to make a large number of
under-constrained design decisions to allow the simulation to run. This is because these details
have been developed and verified in other empirical studies. Of particular importance, ACT-R
has a model of the speed and accuracy of perception and action and so prevents us from
implementing a system with super-human response speeds – as is the case in the typical
Theoretical discussions of skill acquisition (e.g., Ackerman, 1988; Anderson, 1982; Kim
et al 2013; Rasmussen, 1986; Rosenbaum et al., 2001, VanLehn. 1986) frequently make
reference to the original characterization by Fitts (1964, Fitts & Posner, 1967) of skill learning as
involving a Cognitive, Associative, and an Autonomous phase. While Fitts’ original theory was
focused on motor skills, Anderson (1982) extended it to cognitive skills in terms of the then
current ACT* theory. Anderson (1982) interpreted Fitts’ phases as successive periods in the
course of skill acquisition. Fitts’ Cognitive Phase corresponded in ACT* to the acquisition and
utilization of declarative knowledge about the task domain. While ACT* focused on acquisition
and use of problem-solving operators, more recent ACT-R models have also used declarative
knowledge in analogical processes (particularly use of examples) and instruction following (e.g.,
Anderson, 2005; Taatgen et al. 2008; Taatgen & Anderson, 2002). This interpretive use of
production compilation (both in ACT*and ACT-R). This results in action rules, called
productions, that directly perform the task without retrieval of declarative knowledge for
guidance. This both speeds up performance and frees retrieval processes to access other
knowledge. Fitts’ Associative phase was interpreted as the period over which compilation was
occurring. ACT* accounted for learning in Fitts Autonomous phase in terms of production-rule
4
tuning, a process by which a search is carried out over the selection criteria of productions to find
the most successful combinations. Production tuning was heavily symbolic and does not apply
to learning over continuous dimensions in the skills of concern to Fitts & Posner or in video
games. Production tuning was not carried forward in the modern ACT-R theory because it did
generalize in a workable way2 to the wide range of tasks that have occupied ACT-R. Thus,
current ACT-R does not have a characterization of how a skill becomes tuned to features that are
predictive of success3. A significant outcome of the current modeling effort is the development
of an extension to ACT-R that serves that function. This mechanism, called Control Tuning, is
learning literature on how movements like reaching for a target are learned (Wolpert, 1995). This
research has emphasized the development of an internal model mapping controllable properties
of the movement (forces produced by the muscles) to features of the movement (e.g., position,
velocity, and acceleration). This internal model enables selection of a range of desired
movements not just the trained movements. While our video game (controlled by discrete key
presses) is not a typical motor learning task, it requires learning a model of how parameters of
action choice can have different success because of variability in the environment and variability
in the learner’s behavior. A further complication is that the environment may not be stable and
2
That is to say, it produced a lot of broken or malfunctioning models.
3
In principle, it would be possible to do this by learning the utility of different productions
(a learning mechanism in ACT-R) but this would require learning productions for every
combination of features. This would be a very large number in a domain like that in the
current experiment and there is no plausible theory of how these individual rules would be
learned.
5
payoffs might change over time (Wilson & Niv, 2011). A significant source of non-stability,
relevant in video games, is that as one’s skill changes different choices may become optimal.
The best approach to these issues in the reinforcement literature (e.g., for domains like n-arm
bandit problems, e.g., Cohen et al., 2007, Lee et al., 2011, Zhang et Yu, 2013, Lee) involves
continuously updating estimates of the payoffs of the options, choosing them with probabilities
that reflect their payoffs, and becoming more deterministic with experience. Not only does it
lead to adaptive behavior, it can be a good characterization of human learning (Daw et al, 2006).
The Space Fortress video game has had a long history in the study of skill acquisition,
first used in the late 1980’s by a wide consortium of researchers (e.g., Donchin, 1989;
Frederiksen & White’s, 1989; Gopher et al., 1989). Figure 1a illustrates the critical elements of
the game. Players are supposed to try to keep a ship flying between the two hexagons. They are
6
shooting “missiles” at a fortress in the middle, while trying to avoid being shot by “shells” from
the fortress. The ship is flying in a frictionless space in which to navigate the player must
combine thrusts in various directions to achieve a path around the fortress. Mastering
navigation in the Space Fortress environment is challenging. While our subjects are
environment.
Details about our version of Space Fortress are available in Appendix A. Here we
describe what is necessary to understand the results. We used the Pygame version of Space
Fortress (Destefano, 2008) where subjects interact by pressing keys on a keyboard. In keeping
with a finger mapping used in many video games, this implementation maps the keys A, W, D
onto counterclockwise turn, thrust, and clockwise turn, respectively. Subjects pressed the space
bar to shoot. The original version of Space Fortress is more complex than our version and
requires tens of hours to achieve mastery. To focus on learning in a shorter time period we
created a simpler version that just required flying the ship and destroying the fortress. The ship
starts out at the beginning of the game aimed at the fortress, at the position of the starting vector
in Figure 1a, and flying at a moderate speed (30 pixels per second) in the direction of the vector.
To avoid having their ship destroyed, subjects must avoid hitting either the inner or outer
hexagons and keep their ship flying fast enough to prevent the fortress getting an aim on it and
shooting it. When subjects are successful the ship goes around the fortress in a clockwise
direction. If their ship is destroyed it respawns after a second in the starting position flying along
the starting vector. Our version of the game eliminated much of complexity of scoring in the
1. Subjects gained 100 points every time they destroyed the fortress.
7
2. Subjects lost 100 points every time they died.
To keep subjects from being discouraged early on, their score never went negative.
Good performance involved mastering two skills – destroying the fortress and flying the
ship in the frictionless environment. To destroy the fortress one must build up the vulnerability
of the fortress (displayed at the bottom of the screen). When the vulnerability reaches 10,
subjects can destroy the fortress with a quick double shot. Each time a shot hits the fortress the
vulnerability increments by one, provided the shots are paced at least 250 msec. apart. If the
inter-shot interval is less than 250 msec. the vulnerability is reset to 0 and one must begin the
build up of vulnerability anew. In contrast to the shots that build up the vulnerability, the double
shot to destroy the fortress must be a pair of shots less than 250 msec. apart. While subjects
could easily make sure the shots building up vulnerability are at least 250 msec. apart by putting
long pauses between them, this would mean that they would destroy few fortresses and gain few
points. Thus, they are motivated to pace the shots as close to 250 msec. as they can without
going less than 250 msec. and producing a reset. The other aspect of shooting at the fortress is
aiming (see Figure 1b). While shots need to be aimed roughly at the fortress, they do not need to
be aimed at the dead center because the fortress has a width. Not to waste time on needless
aiming but not to miss, subjects need to learn the boundary that defines an adequate aim.
The goal in flying is to maintain trajectory that requires the fewest navigation key presses
(enabling the most shot key presses), that avoids being killed, and that optimizes one’s ability to
aim at the fortress. Subjects are instructed that they can achieve this by flying in an approximate
circular orbit (created by a sequence of legs producing a polygon-like path – see Figure 1a). This
can be achieved by applying thrust when they start to get too close to the outer hexagon. They
8
can only maintain a successful flight path if they control the speed of their ship. If the ship
moves too slowly the fortress can shoot it and destroy it. If they start moving too fast they will
not be able to make navigation actions in time. They can change the speed of the ship by the
direction the ship is pointed when they thrust (thrust angle in Figure 1b). The ship can be sped up
by changing its thrust angle so that it points more in the direction it is currently going and the
ship can be slowed down by aiming more in the opposing direction. Exactly how current speed,
current direction, and orientation of the ship determine future speed and direction is something
that subjects must learn from experience. Also the amount of thrust is determined by the length
of time they hold down the key and they need to learn appropriate durations for the thrust.
reduced the demands of aiming at the fortress and of maintaining a successful flight path. In this
version the ship was automatically turned to always aim at the fortress. One could control the
path and speed of the ship by timing thrusts of appropriate durations. These two games created
While these versions were piloted with CMU student subjects, the data come from
Mechanical Turk subjects. The performances of the two populations were similar but the
Mechanical Turk subjects performed slightly better, probably because the Mechanical Turk
subjects who selected this hit were almost all video gamers. Only one subject reported no video
game experience, although no one reported experience with flying in a frictionless environment.
A minority reported having played Asteroids, which has some similarity to Space Fortress. No
subject mentioned Star Castle, which was the arcade game that Space Fortress was based on.
Asteroids is one of the Atari games that deep reinforcement learning has performed
poorly in (Mnih et al., 2015) and despite improvements, reinforcement methods still
9
considerably underperform experienced game players (e.g., Elfwing et al., 2017; Zhu & Poupart,
2017). Appendix B reports deep reinforcement learning results for Space Fortress based on the
original Minh et al approach. Even after extensive training the performance is much worse that
human performance. One source of difficulty for Space Fortress (and probably Asteroids) is that
actions can have long-term consequences for the flight of the ship that hurt future prospects but
can have no or contradictory effects on score. Another source of difficulty unique to Space
Fortress is learning to make 10 paced shots followed by a rapid pair of shots (and all shots
decrement score) to gain the big score from destroying the fortress. Thus, in effect it is
necessary to apply one policy ten times, getting small negative reinforcements along the way,
and then switch to a different policy. Space Fortress instructions inform subjects about this
whereas reinforcement learning methods try to discover it. Some game players can discover this
The subjects played 20 3-minute games. Some subjects failed to show learning over the
course of the game and some of these might not have been trying. We regressed subjects’ scores
on game number and excluded subjects who did not show a positive slope with significance of at
least .1. This mainly eliminate poor performers, including a number who scored 0 points for all
games (but Appendix C reports average results for all subjects). There were 65 subjects in
YouTurn (19-55 years, mean 31.9 years, 29 female) of which 48 met the learning criterion for
inclusion (19-55 years, mean 32.5 years, 22 female, 1 non video game player, 6 who had played
Asteroids). YouTurn subjects were paid a base pay of $7.50 for completing the experiment
and .02 cents per points (average bonus $2.11). There were 52 subjects in AutoTurn (19-41
years, mean 31.7 years, 18 female) of which 42 met the learning criterion for inclusion (24-41
10
years, mean 32.5 years, 15 female, all had played video games, 10 who had played Asteroids).
AutoTurn subjects were paid a base pay of $4 for completing the experiment and .02 cents per
points (average bonus $7.88). Appendix C reports average results for all subjects.
Figure 2 shows the average performance of subjects in the two versions of the game. To
stress the magnitude of individual differences, the error bars denote a standard deviation of the
population (not of the mean). Both populations show substantial improvement over the course
of the 20 games, but throughout the experiment AutoTurn subjects are much better than
YouTurn subjects, indicating a major cost associated with having to turn one’s ship. Subjects in
YouTurn only achieve 53% as many shots on fortress as AutoTurn subjects and averaged 3.8
11
times as many deaths. We postpone further discussion of the data until after the description of
A single ACT-R model was run in both versions of the game. That model consists of a
set of productions rules that respond to information in a number of buffers standard to ACT-R
models: Information about the screen like where the ship is held in the Visual buffer;
information about various control variables like the time to wait between shots is held in the
Goal buffer, information about passage of time is in the Temporal buffer, information about
recent outcomes (like a shot having missed) is in athe Imaginal buffer; retrieved information
about the game instructions is available in the Retrieval buffer. The development of game
competence is based on three critical theoretical mechanisms: a system of rules that interpreted
declarative knowledge into direct action productions, and the new Control Tuning component
The subjects were given instructions relevant to playing YouTurn and AutoTurn (available
instructions were represented in ACT-R’s declarative memory as a set of operators (as in many
prior ACT-R models – e.g., Anderson et al., 2004, Anderson, 2007; Taatgen et al., 2008,
Anderson & Fincham, 2014). Operators take the form of specifying a state in which they apply,
specifying the action to take in that state, and then specifying the new state that will result.
Figure 3 illustrates the state structure of these operators where each node represents a state. In
Space Fortress many of the operators serve a selection function in that their action is to test for a
12
condition in the game and the resulting state depends on the outcome. This produces the
branching structure in Figure 3. The labels on the arrows emanating from these states are the
tests that have to be satisfied to go to the state the arrow is pointing to. The labels on the arrows
emanating from the bottom states are the key presses specified by the operators associated with
these states. The A key turns counter clockwise, the D key clockwise, the W key thrusts, and the
space bar shoots. Upon pressing a key the model would return to the top of tree to determine the
next action. To illustrate what these operators are like, we will focus on the highlighted path
1. Fortress Present: The appropriate actions depend on whether the fortress is present or
not. The fortress is off screen for a second each time it is destroyed before it respawns
13
(shooting is not useful when there is no fortress and one can concentrate on navigation).
In the case of the highlighted path, the fortress is present and the operator specifies
transitioning to the state for decision making when the fortress is present.
2. Far from Border: The choice to shoot or navigate depends on how close the ship is to
drifting into the outer hexagon (too close is a dangerous; Control Tuning learns what
defines too close). In this case, it is far from that border and the model can focus on
shooting.
3. Aim Beyond: The decision to shoot depends on whether the ship is aimed at the fortress
(what constitutes an adequate aim is something else that needs to be learned by Control
Tuning). In this case the ship is aimed too far beyond the fortress and it is necessary to
turn clockwise
The model starts with instruction-following productions interpret these instructions and
execute them. For instance, Figure 4 shows the ACT-R production (right) and the buffers (left)
that the production matches when the second operator above is used. In green is the operator that
is in the retrieval buffer. It specifies performing a test on distance to the big hex. In red are the
contents of the goal buffer, including the threshold for distance to the big hex as 33.95 pixels4.
In blue are contents of the visual buffer, including the current distance from the bighex as 112.85
pixels. The operator specifies that if the distance to the bighex is greater than the threshold
(which it is in this case) the model should go on to testing the aim. The production requests
retrieving the operator for aiming, which is the third operator in the sequence of 4 above.
4
To be clear, we are not assuming humans see distance in pixels (let alone fractions of
pixels) only that their perceptual scale is linear in pixels.
14
In the ACT-R model, the average time to retrieve an instruction is about 100 msec. and
the average time to execute a production is 50 ms. Thus, each of the 4 steps above will take
about 150 msec. Thus, the time to execute this sequence will be approximately 600 msec. with
Production Compilation
Without the instructions just discussed the model is not able to play the game. With these
instructions but without production compilation, the model is able to play the game but
accumulates virtually no points because it averages 4 to 8 times as many deaths as fortress kills
(see Figure 13 later in the paper). It often takes too long to step through these instructions to act
in time. (The timing parameters that control the speed through these steps are based on a history
5
Counting motor execution into the total time is complex because execution of a motor action
can overlap with retrieval of what to do next.
15
of successful ACT-R models.) Subjects also initially find the task overwhelming. In YouTurn
the subjects average 100 seconds to destroy their first fortress, while they only take about 5
seconds to kill the fortress by the end of play (for the model runs the average is 81 seconds for
Production compilation (Taatgen & Anderson, 2002) has been used in past instruction-
following models of skill acquisition (e.g., Anderson et al., 2004, Anderson, 2005, Taatgen et al.,
2008; Taatgen, 2013) to capture the transition that people make from acting on the basis
retrieved declarative information to responding directly to the situation. If two productions fire in
sequence, then they can be combined into a single new production that is added to procedural
memory. If the first of these productions makes a request to declarative memory, and the second
production uses the retrieved fact to determine what to do, then the retrieved fact is substituted
into the new production. So, for instance, consider the production on the right of Figure 4. It
would apply to interpret both steps 2 and 3 because they both involve testing whether one
combine a double execution of this production (first to process operator 2, second to process
operator3) into a single rule that responds to the retrieval of operator 2 and has the content of
operator 3 built into it. Thus, it not only collapses two productions into one, it skips the retrieval
of a declarative instruction. Production compilation would combine this rule with others and
eventually a single rule would emerge that encapsulates the sequence of 4 operators into the
direct-action production illustrated in Figure 6. The production in Figure 6 detects the presence
of a fortress (first operator), the distance from the fortress (second operator), the aim of the ship
(third operator), and the fact that a counterclockwise correct is required (fourth operator). This
16
17
production collapses a sequence that originally took about 600 milliseconds when in
instruction-following mode to 50 msec. The final execution still requires the time to press D and
this motor time does not change (but see footnote 2). High levels of performance require such
production rules that directly respond with key presses to different game situations. However,
the model can start to achieve positive points even with productions rules that compile parts of
the operator sequences (like in Figure 5) and so speed up performance. The development of such
intermediate rules allows the model to begin to achieve kills about midway through the first
game.
A compiled production must compete with the first production that it originated from (the
“parent” production) since that parent production will match in any situation that the new
production will match. Utility values of productions determine which production applies when
more than one matches the current situation. The utility of a production reflects the value of
what they eventually achieve minus the amount of time it takes to do that. What this model is
trying to achieve is to press the next key to play the game and it treats all key presses as having
the same utility. Thus, the factor that determines the relative utility of productions is how fast
This model uses the standard ACT-R utility learning (Anderson, 2007). In ACT-R new
productions start with zero utility and must learn their true utility. Thus, at first a new
production will lose in competition with its parent and will only gradually begin to take effect as
its true utility is learned. Whenever the parent production fires and production compilation
recreates the new production, ACT-R adjusts the utility of the new production towards the
18
The utility values fluctuate moment by moment from the value prescribed above because of a
normal-like noise. Thus, when the utility of the new production approaches that of the parent,
the new production will have some probability of applying. When the production rule does
apply, its value can be determined, and its utility will change according the linear learning rule
The value of a compiled rule will be greater than its parent’s value because it is more efficient.
Eventually the compiled rule will acquire greater utility and tend to be selected rather than the
parent. Thus, new productions are only introduced slowly, which is consistent with the gradual
nature of skill acquisition. The parameter determines how fast the utility of the new
production grows and comes to exceed the parent. It takes many opportunities before rules come
to dominate that skip all instruction-following steps and just perform the appropriate actions.
The speed with which direct-action productions are acquired also depends on their frequency in
the game. For instance, the direct-action single-shot production is learned faster than the direct-
action double-shot production because the single-shot action occurs approximately 10 times as
often (to build up fortress vulnerability) as the double-shot action that destroys the fortress.
Control Tuning
A model that has the declarative instructions and production compilation still does not
learn to play at human level. Runs of such models playing YouTurn reach an average of about
200 points per game (see Figure 13 later in the paper), much under the average of YouTurn
subjects in Figure 2. To learn how to play YouTurn, the model must learn settings for 5 control
values: shot timing, aim (see Figure 1b), thrust timing, thrust duration, and thrust angle (see
Figure 1b). Here we will focus on shot timing and aiming as illustrative of the general principles
behind Control Tuning. Information on the other control values is available in Appendix D.
19
Control Tuning involves three critical components: selecting a range of possible values for a
parameter that will control an action, attending to relevant feedback, and converging to good
Setting a Range for a Control Value. Subjects have some sense of what good values are for
shot timing and aim and we set ranges for the model that reflect that.
1. Shot timing. The instructions tell subjects that while they are building the
vulnerability of the fortress up to 10 their shots must be at least 250 msec. apart.
Subjects have a vague idea of what constitutes 250 msec. and tune that range to
something more precise. ACT-R can be sensitive to the passage of time through its
Temporal module that ticks at somewhat random and increasing intervals producing
the typical logarithmic discrimination for temporal judgments (Taatgen et al., 2008).
The model places the possible range between 9 and 18 time ticks. Given the standard
Aiming. The ship must be pointed towards fortress to shoot it (see Figure 1b). All
but 4 subjects start out with average aims less than 20 degrees off (maximum was 33
degrees) suggesting that most understood the need for aiming from the beginning.
The ship turns in 6 degrees increments, which means it is futile to aspire for less than
a 3 degree offset in aim. The most accurate subject averaged 5 degrees offset. The
model starts out with a range of 3 to 20 degrees and learns the best value in that
range.
Attending to Relevant Feedback. One of the reasons human players can learn faster than the
typical deep reinforcement learners is that they know what features of the game are relevant as
20
feedback to the setting of each parameter. They adjust their behavior according to these features
and not raw points scored. The model looks for a control setting that yields an optimal
1. Shot timing. Subjects know that a good setting for shot timing results in increments
to vulnerability and not resets of vulnerability. Therefore, the features that the model
attends to are increments and resets, weighting the first positively and the second
negatively. For simplicity, for shot timing and all other control variables the positive
and negative weights are equal (+1 and -1). Note that it treats increments in
decrement in score. Note also that very slow shooting will eliminate resets but it will
2. Aiming. Subjects know that a good aim results in a hit and a bad aim results in a
miss. Therefore, the model weights hits positively and misses negatively. Similar to
timing, if it tries to aim too accurately it will avoid misses but not get as many hits
Convergence. A number of ACT-R models (e.g., Taatgen et al., 2007; Taatgen & van Rijn,
2011; Moon & Anderson, 2012) have learned timing using the Temporal module, which is used
for shot timing in the current effort. These models store in declarative memory all results for
different choices of number of time ticks and select a time for waiting by merging the results of
the individual choices together using an ACT-R mechanism called blending (Lebiere et al.,
2007). These models retrieve these past experiences and then a production acts on a blended
outcome. There is not enough time in a fast-paced game for this retrieve-and-act cycle to work.
It is all the more difficult because there are many control dimensions to monitor and select good
21
values for. To capture the successful performance of the subjects, we needed blending to happen
independent of declarative retrieval and procedural execution so that it would not interfere with
the use of the declarative and procedural modules for game play. To achieve this we created a
new Control Tuning Module that functions in parallel with other modules.
The Control Tuning Module samples a candidate control setting for a while, determines a
mean rate of return per unit time for that control setting, and then tries another control setting. It
gradually builds up a sense of the payoff for different control values by blending these outcomes
together. It learns a mapping of different choices onto outcomes through a function learning
process (e.g., Koh & Meyer, 1991; Kalish, Lewandowsky, & Kruschke, 2004; McDaniel &
Busemeyer, 2005). As suggested in Lucas et al (2015) we assume a learning process that favors
a simple function as most probable and in the current context this leads to estimating a quadratic
function V(S) that describes payoff as a function of the control setting S. A quadratic function
facilitates identification of a good point, which is where the function has a maximum. The
model starts off seeded with a function that gives 0 for all values in the initial range and weights
this function as if it is based on an amount of prior observations (20 seconds in the current
model). It starts to build up an estimate of the true function as actual observations come in.
The control settings are changed at random intervals. The durations of these intervals are
selected from an exponential probability distribution (with a mean of 10 seconds in the current
model). Figure 7 shows the quadratic function that was estimated for shot timing in one run of
the model after playing a single game (180 seconds). The red dots reflect performance of
various choices during that game. They are quite scattered reflecting the random circumstances
of each interval. For instance, note the value at 9 ticks, which is approximately 150 msec. This
22
brief an interval would typically lead to resets, but at the beginning of the game when instruction
interpretation slows down action, it actually results in about 2 hits per second.
Tuning needs to balance exploration to find good values with exploitation to use the best values
it currently has. For this purpose Control Tuning uses a softmax equation that give the
V(S) is the current value of the quadratic function for control value S. T(t) is the temperature
which decreases according over time according to the function T(t)=A/(1+B*t) where A is the
initial temperature and B scales the passage of time (in the model this is set to 1/180 of a second,
reflecting a 180 second game)6. Figure 8 illustrates convergence for one random run of 20
6
The variable t reflects experience in the game, not clock time. So, if a player has a break
between games, the value of t does not advance.
23
games. Figure 8a shows that in this run the estimated quadratic function became fairly tuned by
game 5. However, Figure 8b shows that the choices continue to concentrate more around the
Summary. The control tuning module learns to map a continous space onto an evaluation by
exploring that space and converging on an optimal point. While the spaces being mapped in this
task are all one-dimenisonal space, the methods, being based on ACT-R blending, could be
generalized to multi-dimensional spaces (as in the backgammon model of Sanner et al, 2000).
Other function-learning approaches could be applied to this task. Indeed, deep reinforcement
learning methods could apply to this task. This would differ from the typical application of deep
reinforcement learning methods to games in that (a) they would apply to reduced spaces relevant
to a particular control variable rather than the full space of the game, (b) the space would be
defined by prior assumptions about plausible ranges for that control variable, and (c) the
evaluation would be specific to that control variable. Such an informed focus seems
characteristic of human learning. In some cases this focus can mean that humans miss learning
24
Comparison of Model And Data
Figure 9 compares subjects and model in terms of total score. We ran 100 models in each
condition. Besides matching the overall growth in points of the subjects, the models show a high
variability in performance, almost matching the range shown by the subjects. The 100 models
were created by crossing 4 different strategies, with 5 different settings for production
different things subjects reported. Some reported attending to distance to reach outer
border while others reported attending to time to reach outer border. Some reported
25
lowering their threshold for thrusting when the fortress was absent to get the
navigational needs over with when there was nothing to shoot at. Others reported a
constant navigation emphasis. Crossing these two factors yielded the 4 strategies.
models attending to time average 182 points better than the models attending to
distance (F(1,24) = 32.64, p < .0001) and models emphasizing navigation when
fortress was absent did a marginal 38 points better (F(1,24)=4.18, p <.10). The
2. Production Learning Rate We used values of .025, .05, .1, .2, and .3 for this
learning parameter that controls the rate at which compiled productions replace the
productions they are compiled from. This variable loosely reflects differences in
relevant prior experience in subjects. More experienced subjects probably start with
some partial compilations as Taatgen’s (2013) transfer model might imply. Mean
points varied strongly with this factor: 1757, 1975, 2025, 2231 and 2210 points in
AutoTurn and 361, 700, 853, 1080, and 1089 points in YouTurn.
3. Initial temperature A. We used values of .1, .25, .5, 1, and 2. This variable reflects
attention and motivation in subjects. More attentive subjects will incorporate more of
their learning experiences and more motivated subjects will weight the payoff higher.
Either factor results in a more rapid focus of good control values. The variation of
mean scores with this parameter was 2077, 2010, 2111, 2038, and 1964 in AutoTurn
and 871, 933, 861, 767, and 648 in YouTurn. The overall variation AutoTurn was not
significant (F(4,76)= 1.86) but it was in YouTurn (F(4,76)= 18.73, p < .0001).
Performance was worst when this variable was set to 2 (high exploration) because the
26
model remained quite variable in its choice behavior. Performance can also be worse
with low noise because the model can prematurely converge on non-optimal values.
The goal in producing a set of different models was not to create a model of each subject’s
unique behavior, but to show that the theory predicts subject-like behavior over a range of
The right panels of Figure 9 (panels c and d) show the strong relationship that exists
between the mean number of keys a subject hits (averaged over the 20 games) and the mean
number of points per game that the subject achieved (r = .799 in AutoTurn and .766 in
YouTurn). The model reproduces the relationship. The range of either measure is not as
extreme in the model as in the subjects and the correlations are a bit weaker in the model (r
= .693 and .621). The relationship would not emerge if subjects or models were making
counterproductive actions (e.g., shooting too fast and resetting or thrusting too much and running
into the hexagons). However, as subject and model actions tend to be things that lead to points,
the more rapidly they can execute them the more points they can build up.
The subjects are slightly outperforming the model in AutoTurn (an average difference of
80 points, but not significantly so (t(140)=0.68, p >.05). Subjects are not as good in YouTurn (an
average difference of 117 points, again not significant --t(146)=1.60). Even if these differences
were significant, little can be made of such direct comparisons of level of performance. The
subjects do not reflect all the variation in a larger population that did not self-select to be in this
experiment on Mechanical Turk. The models also do not reflect all the possible models that
could be created with different strategies or parameters. Also the models are trying all the time,
something that is not true of every subject. More important than the absolute level of scores are
the learning trends over games. Both models and subjects show learning but the models are
27
learning faster in YouTurn. The improvement from the first 5 games to the last 5 is 833 points
for the models and 630 points for the subjects, a difference that is quite significant (t(146)=3.34,
p < .005).
Each time the fortress is destroyed subjects gain 100 points and each time they are killed
they loose 100 points. Thus, kills and deaths largely determined their total points. The only
additional factor is the 2 points that each shot costs. Figure 10 looks at the growth in fortress
kills over trials and the decrease in ship deaths. The correspondence between models and
subjects is strikingly close for fortress kills but subjects achieve a lower rate of death in
AutoTurn while the models outperforms the subjects on this measure in YouTurn. The
28
interaction for death rates between conditions and populations (subjects versus models) is
significant (F(1,284) = 8.07). However, the two populations show considerable overlap in
Figure 10.
Not to get the discussion here focused on the details of Space Fortress rather than the
underlying learning processes, we will focus further comparisons on Control Tuning, which is
the new theoretical addition to ACT-R. We will describe results relevant to two control values,
shot timing and aiming. These results are representative of the learning of other control values,
Shot Timing. The simplicity of AutoTurn makes it good for assessing the learning of
shot timing. Most of the key presses are shots and producing sequences of shots in a row (an
average of 6.3 for subjects and 6.9 for the models). This means one can use intershot times to
get an estimate of how subjects are pacing their shots. Figure 11a compares subjects and models
in terms of intershot time. Both show a decrease from about 600 msec. to about 350 msec. The
asymptotic value of 350 msec. is 100 msec. above the minimum to avoid a reset, but given
variability in ones ability to time actions and the game variability7, the optimal setting for this
7
The game updates every 1/30th of a second (a game tick) and requires that the shots to be
8 time ticks apart to avoid resets. This means the actual time needed to avoid a reset can
vary from 234 to 267 msec. depending on when the key press occurs relative to the game
29
control variable is above the minimum. Note that on the first game the models average an
intershot time of 600 msec. even though the upper bound of their range for timing is 500 msec.
They take this long because of the time to reason through all of the steps in determining that they
can shoot. We assume the slow initial times of subjects also reflect the cost of thinking through
Figure 11b compares subjects and models in terms of frequency of resets. Subjects
average about 3 resets while the models average about 2 (t(146)=3.91, p < .0001). While resets
are not decreasing, the number of shots taken is increasing from about 250 to 450 reflecting the
decrease in intershot time and other improvements in performance. So the proportion of shots
Figure 11c shows the probability of Control Tuning choosing different waiting times
averaged over the hundred models. The models tend to converge on a sweet spot about 300
msec. (production execution adds another 50 msec. to intershot times). There is a small rise in
choice probability at 500 msec. even after 20 games. This reflects two models that converge on
high durations and stop exploring. Both of these models never reset the fortress but score rather
poorly.
Aiming. Subjects have to turn their ship in YouTurn and turning is critical both to
aiming at the fortress and controlling navigation. Figure 12 displays information comparable on
aiming to Figure 11 on shot timing. Part (a) shows how far the aim was off the fortress when a
shot was taken. Both subjects and models show a decrease in the average angular disparity over
games. Part (b) shows the resulting decrease in proportion of missed shots. While both
populations show similar decreases and both populations overlap, the models average more
accurate behavior. Figure 12c shows that the models tend to converge on a maximum disparity
ticks.
30
of about 9-10 degrees. The slight peak at 0 reflects two models that got locked into seeking
maximum accuracy. These two models made 50% more turns than average, made half as many
shots, and scored less than half as well as the average model.
The models showed learning results much like human players. Three aspects of the
models were critical to this outcome – interpretation of declarative instructions, the production
compilation process, and Control Tuning. To investigate the contribution of the production
compilation and Control Tuning, we created 9 variants on the original model by crossing 3
choices with respect to production compilation and 3 choices with respect to Control Tuning.
1. Already Compiled. The model started out with the direct action productions that
31
For Control Tuning the three choices were
1. Tuned. The model started with the control parameters fixed at high performing values
2. Tuning. The Control Tuning mechanism was in operation as in the models reported in
3. No Tuning. The choice of control values stayed uniform across the full range of values
All models had the strategy of attending to time to border and emphasizing navigation during the
1-second periods that the fortress was absent. This was the best scoring of the 4 strategies that
contributed to the model results compared to subjects. The combination of the second choice for
both (Compiling, Tuning) was identical to the model reported in Figure 6. We used the same
sets of choices for the compilation learning rate and for the initial temperature. We gathered
extended out to 100 games to clearly identify asymptotic performance. The three No Compiling
versions are scoring close to zero points (averages between 10 to 20 points per game) because
they cannot act with sufficient speed to play the game properly. The Compiling models come
up to the level of models with Compiled models over the course of the games. The No Tuning,
Compiling models actually outperform the No Tuning, Compiled models because these models
miss some of the resets that would occur when a short intershot delay is chosen8. The Compiled,
Tuned models have nothing to learn and define the optimal performance. They average of about
8
Along the way to acquiring direct action productions, production compilation will produce a
pair of shooting productions, one that goes from Start state in Figure 3 to Shoot state and another
that goes from the Shoot state to the key press. In cases where the time delay has not yet passed,
the first production of this pair will fire rather than the direct action production. When the time
delay has passed the second action production will fire. When there is no tuning, the models keep
choosing too short times with substantial probabilities. This 2-production combination serves to
delay the shot a bit and avoid many resets.
32
1900 points over the course of the games. The models that are learning Control Tuning reach an
average about 1450 points by the end of the 100 games. Their lower score than the pre-tuned
models reflects that some of the models are converging to suboptimal control choices (see
Tsividis et al. (2017) discuss why humans can learn to play Atari video games so much
faster than the deep learning results. As they note, humans (and this is true of the ACT-R
models) come into the task with many advantages, including the ability to parse the screen and
prior understanding about the games. The ACT-R models also start with instructions about how
to play the game. Tsividis et al. (2017, and also Tessler et al., 2017) show such instructions
confer a substantial advantage on human players. However, the performance of the No Tuning
models in Figure 13 provides an illustration that abstract knowledge about how to play the game
is not enough for high human performance. Humans need to perfect that knowledge through
33
experience to the task at hand. In our models, production compilation and tuning underlie this
gradual perfection.
The tuning that subjects and the models are doing has similarity to the learning in Minh et
al (2015), but subjects and models have advantages in the learning they are doing.
1. Subjects and the models start out knowing what the relevant control variables are and
what actions these variables are relevant to. This knowledge includes what players
already know (e.g., aim is critical to shooting) and other knowledge explicitly
2. Subjects and the models have a sense of what the relevant range is of possible values
for the control variables (e.g., how accurate an aim is good enough, how much time
3. Subjects and the models know what outcomes are relevant to assessing these
variables (e.g., vulnerability resets indicate shots are too rapid, misses indicate aims
are not accurate enough). Unlike reinforcement learners they are not learning from
total points, which have only a less direct connection to setting of control variables.
We re-engineered the model to assess the impact of these features of the modeling approach.
For a referent, we used the precompiled version of the model so the learning outcomes would not
be confounded with the effects of production compilation (the compiled-tuning variation from
Figure 13). Figure 14 compares the result of this option (called Model) with the following
options:
34
Wide. To assess the contribution of having a sense of the range for control values, we
doubled the range of initial values. For shot timing this meant going from a range of 9-18
ticks to a range of 6 to 24 ticks; for aim it meant going from a range 3-20 degrees to 3-37
degrees). This version does not reach asymptotic behavior as fast as the other models.
It may be reaching an asymptote by 100 games but many of these models are converging
Points. We created a version of the model that used change in points to assess each
control variable rather than features directly relevant to that variable like resets for shot
timing and misses for aiming. Points were divided by 10 to get magnitude of
reinforcement in the same range as the magnitudes used in Figure 10. This version does
not do as well as the model with more direct feedback, but we were surprised by how
35
Exploit. The No Tuning models in Figure 13 always explored the full range of control
values, according to the softmax equation, rather than using the current best performing
value. We created a “pure exploitation” model that always chose the current best value.
This model performs better on the first few games because it tends to select reasonable
control values. However, it sticks with these values and does not improve by identifying
better values.
Instant. The ACT-R model represents the time costs of the perceptual, cognitive, and
motor actions. The line labeled “Instant” in Figure 11 represents a model with those
limitations removed. This version still needs to press keys long enough to get the effects
it wants and needs to pace it shots to avoid resets but it can begin executing an action as
soon as it is needed. Not surprisingly this model performs much better than the ACT-R
model, but it still has to learn appropriate settings for control variables. This model
represents an advantage that the artificial deep reinforcement learning systems have.
Discussion
We should emphasize that this is the first model that learns to achieve human-level
performance in ACT-R9. Moreover, it learns in a way that displays much similarity to human
learning, making it a plausible model of humans, the only other agents that so far can learn in
Space Fortress. In this discussion, we will try to situate the theory behind the model in the
framework of the Fitts three-phase characterization and human skill learning more generally.
9
To see a minute of model play compared to a minute of human play, both late in training,
visit
http://act-r.psy.cmu.edu/wordpress/wp-content/uploads/2018/04/A2O7H7VXLFN6BP_v
s_model_2981_game_8_x2.mpg
36
While Fitts’ three-phase characterization of skill acquisition has often been discussed as a
three-stage theory, Fitts did not originally intend it as a stage characterization: As Fitts (1964)
wrote
“it is misleading to assume distinct stages in skill learning. Instead, we should think of
Nonetheless, even in Fitts’ original paper he described skills as being in a particular phase and
his characterization has since been used with the understanding that a skill can be treated as in
one of phases at most points in the course of learning. With complex skills, such phase
assignments probably only make sense as characterization of parts of the skill. For instance, in
Space Fortress the actual hitting of keyboard keys is highly automated at the beginning of the
game while the decision making about what keys to hit is slow.
Following the 1982 ACT* characterization (Anderson, 1982) one might map this ACT-R
model onto Fitts and Posner 3-phase theory is as follows: The interpretation of declarative
instruction corresponds to the Cognitive phase of Fitts and Posner’s 3-phase theory. Production
compilation, which produces direct actions, corresponds to the Associative phase. Control
Tuning, which produces the highly level of performance, corresponds to the Autonomous phase.
However, upon closer inspection it turns out that this mapping does not work, and what is
happening in the model is closer to what is suggested by the above quote from Fitts.
tuning) are progressing in parallel for each bit of knowledge that composes part of the skill in the
ACT-R model. For instance, consider the highlighted path in Figure 3 that indicates when to
execute a clockwise turn to maintain aim. There are a number of steps of collapsing of the
instruction interpretation on this path into a direct action mapping. At any point in time some
37
parts may be interpreted and some parts may be compiled. Also choices for the control variables
of distance from border and aim can be at different levels of tuning. The actual control values
chosen will jump around over time as part of the exploration process, only slowly converging
It is also misleading to think of compilation occurring before tuning; rather, they are
bit slower than the Compiling-Tuned option, but this could be changed by varying the
production-learning rates or the initial temperatures. Both of these versions are converging
faster than the Compiling-Tuning version because the optimal control variable settings change as
production compilation advances. For instance, as one can act faster one needs increase the
number of time ticks between shots to wait to avoid resetting the vulnerability counter.
The learning in ACT-R nicely instantiates Fitts’ description of “gradual shifts in the
factor structure of skills”. While there is not a truly satisfying mapping of the growth of skills
1. The Cognitive phase corresponds to the period of time before the model is achieving positive
points, which is typically midway through the first game. This is the period when the model
2. The Associative phase is the period of time when production compilation and control tuning
3. The Autonomous phase would be when the model reaches asymptotic performance.
Note under this characterization the Autonomous phase is when the model’s learning is
complete. The models and the subjects do come close to asymptoting after 20 games in
AutoTurn but not in YouTurn (Figure 9). The model takes about 60 games to asymptote in
38
YouTurn (Compiling Tuning in Figure 13) and subjects may also have reached an asymptotic
score had they played 3 times as many games. Subject did appear to asymptote after this much
practice in the Fortress-Only condition of Anderson et al. (2011), which is similar to YouTurn.
However, that condition and YouTurn are rather simple compared to many tasks including the
original Space Fortress. Improvement in the original version of Space Fortress continues for as
long as 31 hours (Destefano & Gray, 2016). Interestingly, as Destefano and Gray describe, part
of continued improvement involves discovering strategies late in the game that go beyond what
is instructed. Players deliberately try to apply what they have discovered about the game much
like they did with the initial instructions. This suggests that some of that late learning is really
declarative, the kind of learning that one might have thought was in the Cognitive Phase.
Motor learning is another potential kind of learning that might continue beyond where the
current models asymptote. ACT-R hits individual keys rather than cording keys where pairs of
keys will be hit together. This means that the lower bound on interkey times is about 100 msec.
While this is also true of initial subject behavior, some subjects did start to simultaneously press
two keys (e.g., thrust and shoot), something that the model never comes to do.
The fact that there are three phases and there are three critical aspects to our theory
There are other learning processes beyond those modeled, both high-level like strategy discovery
and low-level like cording. However, the three aspects modeled are critical aspects of human
skill acquisition, and they appear to be the major drivers of learning in games like Space
Fortress. More generally, they are the major drivers of learning in tasks that involve receiving a
set of instructions and then perfecting the instructed skills with experience. Many real world
tasks (and some video games) have other complexities, but involve the kind of learning we have
39
modeled as part of a larger goal structure. For instance, the mechanics of driving a car are
typically a case of perfecting the performance that starts with instruction, but this skill is
embedded in a much larger goal structure of navigating in familiar and unfamiliar terrains,
40
References
Anderson, J. R. (2007). How Can the Human Mind Occur in the Physical Universe? New
Anderson, J. R., Betts, S., Ferris, J. L., Fincham, J. M. (2010). Neural Imaging to Track
Mental States while Using an Intelligent Tutoring System. Proceedings of the National Academy
Anderson, J. R., Bothell, D., Byrne, M.D., Douglass, S., Lebiere, C., Qin, Y. (2004) An
Anderson, J. R., Bothell, D., Fincham, J. M., Anderson, A. R., Poole, B., & Qin, Y.
(2011). Brain regions engaged by part-and whole-task performance in a video game: a model-
based test of the decomposition hypothesis. Journal of cognitive neuroscience, 23(12), 3983-
3997.
Anderson, J.R., Bothell, D., Fincham, J.M., & Moon, J. (2016). The Sequential Structure of
117(3), 288.
41
Bothell, D. (2010). Modeling Space Fotress: CMU Effort
http://act-r.psy.cmu.edu/workshops/workshop-2010/schedule.html
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba,
http://arxiv.org/abs/1606.01540X
Cohen, J. D., McClure, S. M., & Angela, J. Y. (2007). Should I stay or should I go? How
the human brain manages the trade-off between exploitation and exploration. Philosophical
Daw, N. D., O'doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical
action, and cognition over 7.05 orders of magnitude (Doctoral dissertation, Rensselaer
Polytechnic Institute).
Destefano, M., & Gray, W. D. (2008). Progress report on Pygame Space Fortress. Troy,
Destefano, M., & Gray, W. D. (2016). Where should researchers look for strategy
discoveries during the acquisition of complex task performance? The case of space fortress. In
Proceedings of the 38th Annual Conference of the Cognitive Science Society (pp. 668-673)
Elfwing, S., Uchibe, E., & Doya, K. (2017). Sigmoid-weighted linear units for neural
42
Fitts, P. M. (1964). Perceptual-motor skill learning. Categories of human learning, 47,
381-391.
Fitts PM, Posner MI (1967) Learning and Skilled Performance in Human Performance
Frederiksen, J. R., & White, B. Y. (1989). An approach to training based upon principled
Gauthier, I., Tarr, M., & Bub, D. (Eds.). (2010). Perceptual expertise: Bridging brain
Gopher, D., Weil, M., & Siegel, D. (1989). Practice under changing priorities: An
Kim, J. W., Ritter, F. E., & Koubek, R. J. (2013). An integrated theory for improved skill
acquisition and retention in the three phases of learning. Theoretical Issues in Ergonomics
Science, 14(1), 2
Kalish, M. L., Lewandowsky, S., & Kruschke, J. K. (2004). Population of linear experts:
Koh, K., & Meyer, D. E. (1991). Function learning: Induction of continuous stimulus-
17(5), 811.
Lebiere, C., Gonzalez, C., & Martin, M. K. (2007). Instance-based decision making
model of repeated binary choice. In Proceedings of the 8th international conference on cognitive
43
Lee, M. D., Zhang, S., Munro, M., & Steyvers, M. (2011). Psychological models of
human and optimal performance in bandit problems. Cognitive Systems Research, 12(2), 164-
174.
Lucas, C. G., Griffiths, T. L., Williams, J. J., & Kalish, M. L. (2015). A rational model of
Mcdaniel, M. A., & Busemeyer, J. R. (2005). The conceptual basis of function learning
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,
Riedmiller, M. A., (2013). Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5.
http://arxiv.org/abs/1312.5602X
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... &
518(7540), 529-533.
Moon, J., & Anderson, J. R. (2013). Timing in multitasking: Memory contamination and
https://github.com/matthiasplappert/keras-rlX
44
Sanner, S., Anderson, J. R., Lebiere, C., & Lovett, M. (2000). Achieving efficient and
Shebilske, W. L., Volz, R. A., Gildea, K. M., Workman, J. W., Nanjanath, M., Cao, S., &
Whetzel, J. (2005). Revised Space Fortress: a validation study. Behavior research methods,
37(4), 591-601.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... &
Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search.
Sutton, R. S., Barto, A. G., (1998). Introduction to Reinforcement Learning, 1st Edition.
Taatgen, N. A. (2013). The nature and transfer of cognitive skills. Psychological review,
120(3), 439.
Taatgen, N. A., & Anderson, J. R. (2002). Why do children learn to say “broke”? A
model of learning the past tense without feedback. Cognition, 86(2), 123-155.
Taatgen, N. A., Huss, D., Dickison, D., & Anderson, J. R. (2008). The acquisition of
robust and flexible cognitive skills. Journal of Experimental Psychology: General, 137(3), 548.
Taatgen, N., & van Rijn, H. (2011). Traces of times past: representations of temporal
Taatgen, N. A., Van Rijn, H., & Anderson, J. (2007). An integrated theory of prospective
time interval estimation: The role of cognition, attention, and learning. Psychological Review,
114(3), 577.
45
Tessler, M. H., Goodman, N. D., & Frank, M. C. (2017). Avoiding frostbite: It helps to
Tsividis, P. A., Pouncy, T., Xu, J. L., Tenenbaum, J. B., & Gershman, S. J. (2017).
Human learning in atari.". In The AAAI 2017 Spring Symposium on Science of Intelligence:
513-539.
Wilson, R. C., & Niv, Y. (2011). Inferring relevance in a changing world. Frontiers in
human neuroscience, 5.
Wolpert, D. M., Ghahramani, Z., & Jordan, M. I. (1995). An internal model for
Zhang, S., & Yu, J. A. (2013). Cheap but clever: human active learning in a bandit
Zhu, P., Li, X., & Poupart, P. (2017). On Improving Deep Reinforcement Learning for
46
Figure Captions
Figure 1. (a) The Space Fortress screen, showing the inner and outer hexagon, a missile
shot at the fortress, and a shell shot at the ship. The dotted lines illustrate an example path
during one game. (b) A schematic representation of critical values for shooting and flight control.
Figure 2. Growth in mean number of points earned per game. Error bars reflect one
is an operator going from one state one of the next. Except for the bottom operators they test
conditions to determine the next state. The bottom operators perform actions. Note for graphical
clarity the same bottom state (e.g. CCW) is repeated wherever it appears.
Figure 4. The production test-dimension-success on the right. On the left are color
coded contents of the buffers one time when it matches – the retrieval buffer in green, the goal
Figure 5. On the right is the production resulting from compiling successive firings of
productions test-dimension (Figure 4). On the left are the buffer contents to which it would
match. In the middle is the operator retrieval that is now built into the compiled production.
Figure 6. A direct action production which is the result of compiling the highlighted path
of operators in Figure 3.
Figure 7. An example of estimating a payoff function from the first game in space
fortress. Each dot reflects the mean payoff for a sampled value. The function is the best fitting
quadratic to these points weight by the duration of time over which they are estimated.
47
Figure 8. (a) The quadratic functions estimated for different delays at different points in
the one sequence of model runs through 20 learning games. (b) The probability of choosing
Figure 9. A comparison of subject and model performance: (a) and (b) Growth in mean
number of points earned per game in AutoTurn and YouTurn. Error bars reflect one standard
deviation of the population. (c) and (d) Relationship that emerges between average number of
keys a subject or model hits over games and their average score over games. Each dot is a
subject or model.
Figure 10. A comparison of subjects and models on two critical factors determining
score. (a) and (b) growth in mean number of kills over games. (c) and (d) decrease in mean
number of deaths over games. Error bars reflect one standard deviation of the population.
Figure 11. Shot Timing in AutoTurn: (a) Mean intershot duration per game for subjects
and models; (b) Mean number of resets per game for subjects and models. (c) Probability of
choosing different time thresholds, averaged over the 100 models. Error bars in (a) and (b)
Figure 12. Aiming in YouTurn: (a) Mean offset of aim at shot per game for subjects and
models; (b) Mean proportion of misses per game for subjects and models. (c) Probability of
choosing different maximum aim offsets, averaged over the 100 models. . Error bars in (a) and
Figure 14. Performance of the model against various versions created by changing basic
48
Figure D1. Distance for Thrusts in AutoTurn: (a) Mean distance from fortress at which
subjects and models chose to thrust; (b) Probability of choosing different time thresholds,
averaged over 50 models.. (c) Probability of choosing different distance thresholds, averaged
over 50 models.
Figure D2. Thrust angle in YouTurn: (a) Mean thrust angle per game for subjects and
models; (b) Mean speed of ship for subjects and models. (c) Probability of choosing different
thrust angles, averaged over the 100 models. Error bars in (a) and (b) reflect one standard
Figure D3. Distribution of ship deaths in AutoTurn (a) and YouTurn (b).
49
Appendix A: Space Fortress Implementation
There have been many implementations of Space Fortress over the decades. The
development prior to the Pygame version is described in Shebilske et al. (2005) and the original
Pygame implementation is described in Destefano (2010). The most consequential change in the
Pygame implementation over previous implementations was replacing the use of the joystick for
navigation by key presses. We considerably simplified the game from the Pygame
implementation. We eliminated mines and bonus points, both of which were aspects of the
original game. We also made hitting the hexagon boundaries fatal rather than just a loss of
control points. We also simplified the death rules so that being shot or hitting the boundary
immediately resulted in death rather than building up towards death. Here is the pointer to web
http://andersonlab.net/sf-paper-2017/turning.html
4. The ship starts at x=245, y=315, moving up .5 pixel and to the right .866 pixels per game
5. The ship starts aimed horizontally aimed at the fortress. The ship automatically turns to
maintain aim in AutoTurn, but subjects must turn the ship in YouTurn.
6. Every game tick that the thrust key is held down, the velocity vector will change .3 pixels
50
7. Every game tick that a turn key is held down, the ship will turn 6 degrees.
8. If the ship does not traverse 10 degrees of arc around the fortress in a second the fortress
9. For purposes of determining when the ship or fortress is hit, everything is treated as a
circle:
11. Shells travel at 6 pixels per tick and missiles travel at 20 pixels per tick.
perspectives of animal and human behavior, provides a normative account for how agents
optimize their behavior in different environments (Sutton and Barto, 1998). In recent years there
have been many advances in the field of deep learning, which have made it possible for deep RL
the most successful applications of deep learning to the RL problem is the work by (Mnih et al.,
2013), who created a convolutional neural network (CNN) trained with a variant of Q-learning
called a Deep Q-Network (DQN). With the only prior knowledge being the available actions.
Using the same network architecture and hyperparameters, the DQN agent was able to learn to
play a wide range of Atari 2600 video games with just raw pixels and a reward signal as input.
51
The games that the DQN agent excels at vary greatly in their genres and visual
appearance, from side-scrolling shooters (e.g. River Raid) to, sports games (e.g. Boxing) and 3D
auto racing games (e.g. Enduro). The authors argue that for certain games the agent is able to
discover relatively long-term strategies. For example, in the game Breakout, the DQN agent is
able to learn to dig a tunnel along the wall, which allows the ball to be sent around the back to
destroy a large number of blocks. Nevertheless, the DQN agent struggles with games that require
temporally extended rules/strategies and games that require higher than first-order control. For
example, the DQN agent was only able to score 13% of what a professional player can score in
Ms. Pac-Man, a game that requires continual path planning and obstacle avoidance. Similarly,
the DQN agent only achieved 7% of the score of a professional player in Asteroids, a game that
49 Atari 2600 games tested by (Mnih et al., 2015), Montezuma’s Revenge was easily the most
difficult game for the DNQ agent to learn. Montezuma’s Revenge is a non-linear exploration
game that requires both temporally extended strategies and higher order control. For example, in
Montezuma’s Revenge players must move from room to room in an underground labyrinthine
searching for keys (to open doors) and equipment (such as torches, swords, amulets, etc.), in an
attempt to avoid or defeat the challenges in the players path. The DQN agent was not able to
It is not obvious if the DQN agents poor performance on certain games is a result of a
failure to extract useful features from the raw pixel input, from an inability to learn a good
control policy (assuming good feature extraction), or a combination of both. The modeling
efforts below are an attempt to address this question. Space Fortress and its variants share many
features of the ATARI games that the DQN agent struggles with. For example, in Space Fortress
52
the player controls a ship in a frictionless environment, which is analogous (but more difficult)
than flying the ship in the ATARI game Asteroids. Also, much like Montezuma’s Revenge,
Space Fortress requires temporally extended strategies. For example, building the fortress
vulnerability up to 10 before killing it. Since the preceding ACT-R modeling work has already
identified a set of visual features in Space Fortress capable of producing human-like learning and
performance, Space Fortress is the perfect environment to evaluate the learning limitations of the
DQN agent.
Methods
In order to make Space Fortress more accessible to deep reinforcement learning research,
an OpenAI Gym (Brockman et al., 2016) interface for both the AutoTurn and YouTurn variants
of Space Fortress was created. The Space Fortress OpenAI Gym source is available on BitBucket
available on PyPi (soon). The OpenAI Gym interface provides RL agents with two state
observation formats: 1) a raw image/pixel based representation of the game state in the form of a
126x142 RGB array, and 2) a hand-crafted feature vector (of length 15) containing all of the
information used by the best performing ACT-R models. Both environments share 3 discrete
actions: NOOP, FIRE, and THRUST. The YouTurn environment supports 2 additional discrete
Further facilitating the introduction of Space Fortress to the deep reinforcement learning
community, a keras-rl (Plappert, 2016) implementation of the original Mnih et al. (2013) DQN
agent was trained and tested in both the AutoTurn and YouTurn Space Fortress gym
environments. For these baseline tests, every attempt was made to replicate the model
parameters, hyperparameters and experimental setup of the Mnih et al. (2015) paper. However,
53
since Space Fortress runs at 30 frames per second, half the rate the ATARI system runs at, the
agents tested against Space Fortress used frame skip and memory skip values of 2, half the value
used in the agents that were tested against ATARI games. Additionally, since it is not a goal of
this work is to assess the ability of a CNN to extract features relevant to Space Fortress game
play from raw pixels, the DQN agents tested in the Space Fortress environments used the hand-
crafted feature vector state representation as input. Consequently, instead of using a complex
CNN to estimate the Q-values of the state-action pairs, a simple multilayer perceptron (MLP)
with 3 layers of 64 hidden units was used instead. Because of the different number of actions in
AutoTurn and YouTurn Space Fortress, the MLPs used by the agents had 3 and 5 outputs
respectively.
Both the AutoTurn and YouTurn DQN agents were trained for 10000 games, which with
a frameskip of 2, equates to 26,470,000 episodes. The replay memory of the DQN agent was
configured with a capacity of 1 million episodes (about 377 games with a frameskip of 2). The
exploration rate of the ε-greedy control policy was annealed from ε=1.00 to ε=0.10 over the first
1000 games and fixed at ε=0.10 there after. After each epoch of 100 games, training was paused
and a single test game was played with an exploration rate of ε=0.05. Consistent with (Mnih
et al., 2015), reward was clipped to the range -1 to 1 and the RMSProp optimizer was used with
minibatches of size 32. For reference purposes, two additional agents were created and tested, a
random agent and a perfect/cheater agent. The random agent was implemented as the DQN agent
with it’s exploration rate fixed at ε=1.00. The cheater agent algorithmically implements a “shoot
12 then thrust strategy” with zero constraints on how quick actions can be made.
54
Results
Table B1 shows how the DQN agents compare to the performance of other agents across
5 critical performance measures: raw points, deaths, kills, resets and shot efficiency (the average
number of shots needed to kill a fortress). The human and ACT-R values in the table are for the
best performing subject and ACT-R model in each condition. While the performance of the
AutoTurn and YouTurn DQN agents was much better than purely random agent, their
performance nowhere near that of humans or the ACT-R agents. As a way to more directly
compare the performance of the DQN agents to the ACT-R agents the performance of the agents
the normalized performance score of the ACT-R agents in the AutoTurn and YouTurn
environments was 97.89% and 95.24% respectively. Likewise, the normalized performance
scores of the DQN agents was only 66.64% and 38.62% respectively. Perfect performance (as
produced by the cheater agent) results in normalized performance scores of 107.70% and
121.29% in the AutoTurn and YouTurn environments respectively. To see if the poor
performance was related to the MLP configuration, a few alternative MLP configurations were
55
tested including adding up to 5 hidden layers and up to 256 hidden units per layer. None of these
While the DNQ agent develops some flight skills as indicated by the fewer deaths
compared to the random agents, the DQN agents fail to learn the actual rules of the game
necessary to deliberately kill the fortress. It is probably fair to say that the DQN agents never
learn that the fortress vulnerability needs to be raised to 10 before it can be killed or that killing
the fortress required a different shot pacing than when raising the vulnerability of the fortress. A
perfect agent (like the cheater agent represented in Table Error: Reference source not found) only
needs 12 shots to kill a fortress. Humans, because they know the rules before playing their first
game, have a shot/kill efficient very close to 12. However, the DQN agents typically raise the
vulnerability well above 10 before getting a kill. The ability to get a kill can only be explained by
the stochastic nature of the agents. Further supporting this explanation is the counterintuitive fact
that the random YouTurn agent scores more points (by getting more kills) than the AutoTurn
agent. Since the random AutoTurn agent is always perfectly aimed at the fortress, it resets often,
rarely allowing the vulnerability to get high enough for a chance at a kill shot. In contrast, the
random YouTurn agent naturally shoots less often (and thus resets less often) because it has
more actions to pick from. The random YouTurn agent also misses constantly reducing the
frequency of resets. These two factors combined cause the random YouTurn agent to get more
kill opportunities, and in the end more random kills than the AutoTurn agent.
YouTurn. AutoTurn subjects were only told that the relevant keys were Wand spacebar while
YouTurn subjects were told that the relevant keys were A, W, D, and spacebar. The
56
conversion of points earned to bonus payment was doubled. For comparison we will use all
subjects who were run with instruction, rather than just those filtered to show signs of learning
who were the focus in the main paper. Table C1 reports average values for the measures of
performance reported in the paper and Appendix D. We have broken this out into performance
In terms of mean number of points, uninstructed subjects clearly are showing learning
(compare with dqn agents in Appendix B), but they are also scoring less than instructed subjects
(first half AutoTurn, t(59) = 4.45, p < .0001; second half AutoTurn, t(59) = 2.13, p < .05; first
half YouTurn, t(74) = 2.52, p < .05; second half YouTurn, t(74) = 1.24, p > .1; all 2-tailed). The
57
individual variability in both populations is huge. Nonetheless, we can conclude is that without
instructions at least some game players can learn to play these games but with greater difficulty.
The main text provides information on the control values for aiming and shooting. This
When to thrust. As discussed in the main text, the model either choses to thrust
according to how soon it will hit the outer hexagon or its distance to the outer hexagon. If using
time, the range of the control values is between .5 and 5 seconds. If using distance, the range is
between 10% of the way to boundary to 90%. The longer the model waits to thrust the greater
chance of completing a fortress kill. If it waits too long to thrust it will crash into the outer
border. Therefore, it treats fortress kills as positive feedback and crashes into the outer border as
negative feedback.
As is the case for shot timing, good data on when thrust decisions are made can be
collected in AutoTurn because there is never any need for preliminary turning. Figure D1a
displays the mean distance at which subjects and subjects chose to thrust. The values are quite
similar although there is a bit of a downward drift in the average subject distance not seen in the
58
average model distance. Figures D1b and D1c show the average probability of the choices that
Thrust time10. The effect of a thrust increases linearly with the duration that the thrust
key is depressed (a vector of .3 pixels per second in the direction pointed is added each game tick
which is a 30th of a second). The model searches between 67 msec. (which is 2 game ticks, the
typical time associated with a tap) to 167 msec. (5 game ticks, a noticeable duration of sustained
pressing). We should note that while the model may chose to press for a certain period, motor
noise and asynchrony between model and game will result in a the thrust key being depressed for
a duration somewhat different than the selected time. A thrust that is very long may result in
driving the ship into the inner hex. A thrust that is too slow may result in a slow ship speed and
the ship being shot by the fortress. Therefore, the tracker monitors these two events (hitting
inner hex and being shot by fortress) and seeks to find a thrust duration that minimizes both.
Subjects depress the thrust key for an average of 112 msec. in AutoTurn and 134 msec. in
YouTurn. The models average 115 msec. in both versions of the game.
Thrust angle (see Figure 1b). The instructions to the subject indicate to keep speed under
control when thrusting their ship should point 90 degrees or more from the direction the ship is
moving. The model searches for a good angle between 85 and 140 degrees. The sign of a good
thrust angle is that resulting ship speed stays within the good range (30 to 50 pixels per second)
and of a bad thrust that it is slower or faster. Therefore the model evaluates thrust angle in terms
of the speed of the ship after the thrust, weighing good speed as positive feedback and bad
10
This control parameter is not very consequential in the two versions of the game we report, but
did prove in another version of the game where subjects must fly narrow paths
59
Subjects are told they can adjust the speed of the ship by adjusting the thrust angle.
Correspondingly, how the model treats the thrust angle it has chosen for its control value
depends on speed of the ship. If it is flying at a good speed (1 to 1.7 pixels per game tick) it
wants a thrust angle within 15 degrees of the control value. If it needs to turn it will turn to get a
thrust angle close to the control value (it cannot produce an exact match to the control value
given its limitations and the game implementation). If it is flying too slow it wants a thrust angle
less than the control value. If the thrust angle is more it will turn to get close to 15 degrees less.
If it is flying too fast it wants a thrust angle more than the control value. If the actual thrust angle
is less it will turn to get close to 15 degrees more than the control value. As a reference for what
a plausible thrust angle should be, if the thrust key was held down long enough to add as much
velocity in the direction of the thrust angle as the current velocity, a thrust angle of 120 degrees
would maintain current velocity. The average velocity added by the thrust is just a little less than
the average ship speed, implying an average thrust angle a little less than 120 degrees.
Figure D2 compares subjects and models with respect to thrust angle in YouTurn. Part (a)
shows that the average thrust angle of both subjects and models increases a little over the games
to about 100 degrees. Part (b) shows that they both converge to an average speed of about 40
60
pixels per second. Part (c) illustrates how the models’ choices for a control value converge over
time. Note that although the preferred control value is over 110 degrees by the end, the actual
angles at which the ship thrusts are just over 100 degrees. Aiming often leaves the ship a little
under the chosen thrust angle but not enough to provoke a turn.
Deaths. The frequencies of different types of death reflect the consequences of decisions
about when to thrust, how long to thrust, and thrust angle. Figure D3 shows the relative
frequency of ship deaths in the two versions of the game. The scale is doubled for YouTurn,
reflecting the greater number of deaths. The major cause of deaths in AutoTurn is being shot by
the fortress, both for subjects and models. In the case of YouTurn, there is large difference
between models and subjects in the number of cases of being shot with a shell from the fortress.
Being shot by the fortress is the major cause of death for subjects while running into the outer
hex is the major cause of death for the models. The high value of subject deaths by shell reflects
the greater difficulty subjects have maintaining a good speed. Their average within-game
variability in speed is about 50% more than the models’ average variability. This means that
while they average the same speed as the model over games (Figure D2b), they are more likely
to fall into slow lapses where the fortress can shoot them. Nonetheless, as indicated by the one
standard deviation error bars, there is substantial overlap between subjects and models.
61