You are on page 1of 61

Three Aspects of Skill Acquisition1

John R. Anderson

Carnegie Mellon University

Shawn Betts

Carnegie Mellon University

Daniel Bothell

Carnegie Mellon University

Ryan Hope

Carnegie Mellon University

Christian Lebiere

Carnegie Mellon University

1
This work was supported by the Office of Naval Research Grant N00014-15-1-2151. We would
like to thank Niels Taatgen and Matthew Walsh for their comments on the paper.

1
Abstract

A theory is presented about how instruction and experience combine to produce competence in a

complex skill. The theory depends critically on three components of the ACT-R architecture.

The first component interprets declarative representations of instruction so that they lead to

action. The second component converts this knowledge into procedural form so that appropriate

actions can be quickly executed. The third component, newly added to the architecture, learns

the setting of control parameters for actions through a process similar to reinforcement learning..

These three components are intermingled throughout the course of skill acquisition, providing an

instantiation of Fitts’ (1964) original characterization of skill acquisition as involving gradual

shifts in the factor structure of skills. The overall theory is implemented in computational models

that are capable of simulating human learning in different versions of the video game Space

Fortress. Other than humans, these models are the only agents, natural or artificial, capable of

learning to play Space Fortress.

2
The course of our species has been determined by learning skills as varied as hunting with

a bow and arrow to driving a car. This paper describes a general theory of the mechanisms that

underlie human ability to acquire novel skills, which we have applied to learning a video game.

Like many interesting skills, learning to play a video game involves simultaneously mastering

components at perceptual, motor, and cognitive levels. The level of expertise players can achieve

with extensive practice in these games is truly remarkable (e.g., Gray, 2017), but also remarkable

is the speed with which people can acquire a serviceable competence at playing games. There

are now appearing artificial systems using deep reinforcement learning that can out-perform

humans in some video games (e.g., Minh et al., 2015). Whatever the judgment on the

achievements of such systems, so far they have offered little insight into how humans can learn

as fast as they can. As examples of the speed of human learning, Tessler et al (2017) and

Tsividis et al (2017) both report that humans can achieve in a matter of minutes the expertise that

these reinforcement-learning algorithms require hundreds of hours of playing to achieve.

The essential feature of the task in this paper is that it requires simultaneously learning

many things at many levels. There has been considerable research on perceptual learning such

as visual patterns (Gauthier et al., 2010), motor learning (Wolpert, 2011), and complex tasks like

learning algebra (Anderson, 2005), but little research that has addressed how these are put

together. Our task also stresses the combination of instruction and experience. While one

cannot become good at a video game just by reading instructions, upfront instruction certainly

speeds up learning. This combination of learning by instruction and learning by practice is

typical of human skill acquisition. As a test of whether our theory has really put the many pieces

together, we required that the model learn by playing the actual game. The model is implemented

within the ACT-R cognitive architecture (Anderson et al., 2004). Using ACT-R avoids the

3
irrelevant-specification problem (Newell, 1990) – that is, the need to make a large number of

under-constrained design decisions to allow the simulation to run. This is because these details

have been developed and verified in other empirical studies. Of particular importance, ACT-R

has a model of the speed and accuracy of perception and action and so prevents us from

implementing a system with super-human response speeds – as is the case in the typical

reinforcement-learning approach to video games.

Theoretical discussions of skill acquisition (e.g., Ackerman, 1988; Anderson, 1982; Kim

et al 2013; Rasmussen, 1986; Rosenbaum et al., 2001, VanLehn. 1986) frequently make

reference to the original characterization by Fitts (1964, Fitts & Posner, 1967) of skill learning as

involving a Cognitive, Associative, and an Autonomous phase. While Fitts’ original theory was

focused on motor skills, Anderson (1982) extended it to cognitive skills in terms of the then

current ACT* theory. Anderson (1982) interpreted Fitts’ phases as successive periods in the

course of skill acquisition. Fitts’ Cognitive Phase corresponded in ACT* to the acquisition and

utilization of declarative knowledge about the task domain. While ACT* focused on acquisition

and use of problem-solving operators, more recent ACT-R models have also used declarative

knowledge in analogical processes (particularly use of examples) and instruction following (e.g.,

Anderson, 2005; Taatgen et al. 2008; Taatgen & Anderson, 2002). This interpretive use of

declarative knowledge is gradually converted into procedural knowledge by a process called

production compilation (both in ACT*and ACT-R). This results in action rules, called

productions, that directly perform the task without retrieval of declarative knowledge for

guidance. This both speeds up performance and frees retrieval processes to access other

knowledge. Fitts’ Associative phase was interpreted as the period over which compilation was

occurring. ACT* accounted for learning in Fitts Autonomous phase in terms of production-rule

4
tuning, a process by which a search is carried out over the selection criteria of productions to find

the most successful combinations. Production tuning was heavily symbolic and does not apply

to learning over continuous dimensions in the skills of concern to Fitts & Posner or in video

games. Production tuning was not carried forward in the modern ACT-R theory because it did

generalize in a workable way2 to the wide range of tasks that have occupied ACT-R. Thus,

current ACT-R does not have a characterization of how a skill becomes tuned to features that are

predictive of success3. A significant outcome of the current modeling effort is the development

of an extension to ACT-R that serves that function. This mechanism, called Control Tuning, is

one contribution of the paper.

Our development of Control Tuning is related to theoretical proposals in the motor

learning literature on how movements like reaching for a target are learned (Wolpert, 1995). This

research has emphasized the development of an internal model mapping controllable properties

of the movement (forces produced by the muscles) to features of the movement (e.g., position,

velocity, and acceleration). This internal model enables selection of a range of desired

movements not just the trained movements. While our video game (controlled by discrete key

presses) is not a typical motor learning task, it requires learning a model of how parameters of

one’s game play affect significant outcomes in the game.

A challenge in learning appropriate behavior in complex environments is that the same

action choice can have different success because of variability in the environment and variability

in the learner’s behavior. A further complication is that the environment may not be stable and

2
That is to say, it produced a lot of broken or malfunctioning models.
3
In principle, it would be possible to do this by learning the utility of different productions
(a learning mechanism in ACT-R) but this would require learning productions for every
combination of features. This would be a very large number in a domain like that in the
current experiment and there is no plausible theory of how these individual rules would be
learned.

5
payoffs might change over time (Wilson & Niv, 2011). A significant source of non-stability,

relevant in video games, is that as one’s skill changes different choices may become optimal.

The best approach to these issues in the reinforcement literature (e.g., for domains like n-arm

bandit problems, e.g., Cohen et al., 2007, Lee et al., 2011, Zhang et Yu, 2013, Lee) involves

continuously updating estimates of the payoffs of the options, choosing them with probabilities

that reflect their payoffs, and becoming more deterministic with experience. Not only does it

lead to adaptive behavior, it can be a good characterization of human learning (Daw et al, 2006).

Space Fortress Video Game and Mechanical Turk Subjects

The Space Fortress video game has had a long history in the study of skill acquisition,

first used in the late 1980’s by a wide consortium of researchers (e.g., Donchin, 1989;

Frederiksen & White’s, 1989; Gopher et al., 1989). Figure 1a illustrates the critical elements of

the game. Players are supposed to try to keep a ship flying between the two hexagons. They are

6
shooting “missiles” at a fortress in the middle, while trying to avoid being shot by “shells” from

the fortress. The ship is flying in a frictionless space in which to navigate the player must

combine thrusts in various directions to achieve a path around the fortress. Mastering

navigation in the Space Fortress environment is challenging. While our subjects are

overwhelmingly video game players, most have no experience in navigating in a frictionless

environment.

Details about our version of Space Fortress are available in Appendix A. Here we

describe what is necessary to understand the results. We used the Pygame version of Space

Fortress (Destefano, 2008) where subjects interact by pressing keys on a keyboard. In keeping

with a finger mapping used in many video games, this implementation maps the keys A, W, D

onto counterclockwise turn, thrust, and clockwise turn, respectively. Subjects pressed the space

bar to shoot. The original version of Space Fortress is more complex than our version and

requires tens of hours to achieve mastery. To focus on learning in a shorter time period we

created a simpler version that just required flying the ship and destroying the fortress. The ship

starts out at the beginning of the game aimed at the fortress, at the position of the starting vector

in Figure 1a, and flying at a moderate speed (30 pixels per second) in the direction of the vector.

To avoid having their ship destroyed, subjects must avoid hitting either the inner or outer

hexagons and keep their ship flying fast enough to prevent the fortress getting an aim on it and

shooting it. When subjects are successful the ship goes around the fortress in a clockwise

direction. If their ship is destroyed it respawns after a second in the starting position flying along

the starting vector. Our version of the game eliminated much of complexity of scoring in the

original game and just kept three rules

1. Subjects gained 100 points every time they destroyed the fortress.

7
2. Subjects lost 100 points every time they died.

3. To reinforce accurate shooting, every shot cost 2 points.

To keep subjects from being discouraged early on, their score never went negative.

Good performance involved mastering two skills – destroying the fortress and flying the

ship in the frictionless environment. To destroy the fortress one must build up the vulnerability

of the fortress (displayed at the bottom of the screen). When the vulnerability reaches 10,

subjects can destroy the fortress with a quick double shot. Each time a shot hits the fortress the

vulnerability increments by one, provided the shots are paced at least 250 msec. apart. If the

inter-shot interval is less than 250 msec. the vulnerability is reset to 0 and one must begin the

build up of vulnerability anew. In contrast to the shots that build up the vulnerability, the double

shot to destroy the fortress must be a pair of shots less than 250 msec. apart. While subjects

could easily make sure the shots building up vulnerability are at least 250 msec. apart by putting

long pauses between them, this would mean that they would destroy few fortresses and gain few

points. Thus, they are motivated to pace the shots as close to 250 msec. as they can without

going less than 250 msec. and producing a reset. The other aspect of shooting at the fortress is

aiming (see Figure 1b). While shots need to be aimed roughly at the fortress, they do not need to

be aimed at the dead center because the fortress has a width. Not to waste time on needless

aiming but not to miss, subjects need to learn the boundary that defines an adequate aim.

The goal in flying is to maintain trajectory that requires the fewest navigation key presses

(enabling the most shot key presses), that avoids being killed, and that optimizes one’s ability to

aim at the fortress. Subjects are instructed that they can achieve this by flying in an approximate

circular orbit (created by a sequence of legs producing a polygon-like path – see Figure 1a). This

can be achieved by applying thrust when they start to get too close to the outer hexagon. They

8
can only maintain a successful flight path if they control the speed of their ship. If the ship

moves too slowly the fortress can shoot it and destroy it. If they start moving too fast they will

not be able to make navigation actions in time. They can change the speed of the ship by the

direction the ship is pointed when they thrust (thrust angle in Figure 1b). The ship can be sped up

by changing its thrust angle so that it points more in the direction it is currently going and the

ship can be slowed down by aiming more in the opposing direction. Exactly how current speed,

current direction, and orientation of the ship determine future speed and direction is something

that subjects must learn from experience. Also the amount of thrust is determined by the length

of time they hold down the key and they need to learn appropriate durations for the thrust.

We created an alternative version of the game, called AutoTurn, which simultaneously

reduced the demands of aiming at the fortress and of maintaining a successful flight path. In this

version the ship was automatically turned to always aim at the fortress. One could control the

path and speed of the ship by timing thrusts of appropriate durations. These two games created

two conditions, called YouTurn and AutoTurn.

While these versions were piloted with CMU student subjects, the data come from

Mechanical Turk subjects. The performances of the two populations were similar but the

Mechanical Turk subjects performed slightly better, probably because the Mechanical Turk

subjects who selected this hit were almost all video gamers. Only one subject reported no video

game experience, although no one reported experience with flying in a frictionless environment.

A minority reported having played Asteroids, which has some similarity to Space Fortress. No

subject mentioned Star Castle, which was the arcade game that Space Fortress was based on.

Asteroids is one of the Atari games that deep reinforcement learning has performed

poorly in (Mnih et al., 2015) and despite improvements, reinforcement methods still

9
considerably underperform experienced game players (e.g., Elfwing et al., 2017; Zhu & Poupart,

2017). Appendix B reports deep reinforcement learning results for Space Fortress based on the

original Minh et al approach. Even after extensive training the performance is much worse that

human performance. One source of difficulty for Space Fortress (and probably Asteroids) is that

actions can have long-term consequences for the flight of the ship that hurt future prospects but

can have no or contradictory effects on score. Another source of difficulty unique to Space

Fortress is learning to make 10 paced shots followed by a rapid pair of shots (and all shots

decrement score) to gain the big score from destroying the fortress. Thus, in effect it is

necessary to apply one policy ten times, getting small negative reinforcements along the way,

and then switch to a different policy. Space Fortress instructions inform subjects about this

whereas reinforcement learning methods try to discover it. Some game players can discover this

without instruction because the concept of building up vulnerability to destroy an enemy is a

common video game theme (see Appendix C).

The subjects played 20 3-minute games. Some subjects failed to show learning over the

course of the game and some of these might not have been trying. We regressed subjects’ scores

on game number and excluded subjects who did not show a positive slope with significance of at

least .1. This mainly eliminate poor performers, including a number who scored 0 points for all

games (but Appendix C reports average results for all subjects). There were 65 subjects in

YouTurn (19-55 years, mean 31.9 years, 29 female) of which 48 met the learning criterion for

inclusion (19-55 years, mean 32.5 years, 22 female, 1 non video game player, 6 who had played

Asteroids). YouTurn subjects were paid a base pay of $7.50 for completing the experiment

and .02 cents per points (average bonus $2.11). There were 52 subjects in AutoTurn (19-41

years, mean 31.7 years, 18 female) of which 42 met the learning criterion for inclusion (24-41

10
years, mean 32.5 years, 15 female, all had played video games, 10 who had played Asteroids).

AutoTurn subjects were paid a base pay of $4 for completing the experiment and .02 cents per

points (average bonus $7.88). Appendix C reports average results for all subjects.

Figure 2 shows the average performance of subjects in the two versions of the game. To

stress the magnitude of individual differences, the error bars denote a standard deviation of the

population (not of the mean). Both populations show substantial improvement over the course

of the 20 games, but throughout the experiment AutoTurn subjects are much better than

YouTurn subjects, indicating a major cost associated with having to turn one’s ship. Subjects in

YouTurn only achieve 53% as many shots on fortress as AutoTurn subjects and averaged 3.8

11
times as many deaths. We postpone further discussion of the data until after the description of

the model of the task and its theoretical assumptions.

A Model that Plays Space Fortress and Learns.

A single ACT-R model was run in both versions of the game. That model consists of a

set of productions rules that respond to information in a number of buffers standard to ACT-R

models: Information about the screen like where the ship is held in the Visual buffer;

information about various control variables like the time to wait between shots is held in the

Goal buffer, information about passage of time is in the Temporal buffer, information about

recent outcomes (like a shot having missed) is in athe Imaginal buffer; retrieved information

about the game instructions is available in the Retrieval buffer. The development of game

competence is based on three critical theoretical mechanisms: a system of rules that interpreted

declarative representations of instructions, the production compilation mechanism that convert

declarative knowledge into direct action productions, and the new Control Tuning component

that identifies successful parameters for action.

Declarative Interpretation of Instructions

The subjects were given instructions relevant to playing YouTurn and AutoTurn (available

in the online versions of the game – http://andersonlab.net/sf-paper-2017/turning.html). The

instructions were represented in ACT-R’s declarative memory as a set of operators (as in many

prior ACT-R models – e.g., Anderson et al., 2004, Anderson, 2007; Taatgen et al., 2008,

Anderson & Fincham, 2014). Operators take the form of specifying a state in which they apply,

specifying the action to take in that state, and then specifying the new state that will result.

Figure 3 illustrates the state structure of these operators where each node represents a state. In

Space Fortress many of the operators serve a selection function in that their action is to test for a

12
condition in the game and the resulting state depends on the outcome. This produces the

branching structure in Figure 3. The labels on the arrows emanating from these states are the

tests that have to be satisfied to go to the state the arrow is pointing to. The labels on the arrows

emanating from the bottom states are the key presses specified by the operators associated with

these states. The A key turns counter clockwise, the D key clockwise, the W key thrusts, and the

space bar shoots. Upon pressing a key the model would return to the top of tree to determine the

next action. To illustrate what these operators are like, we will focus on the highlighted path

through the tree that leads to pressing a D key in service of shooting:

1. Fortress Present: The appropriate actions depend on whether the fortress is present or

not. The fortress is off screen for a second each time it is destroyed before it respawns

13
(shooting is not useful when there is no fortress and one can concentrate on navigation).

In the case of the highlighted path, the fortress is present and the operator specifies

transitioning to the state for decision making when the fortress is present.

2. Far from Border: The choice to shoot or navigate depends on how close the ship is to

drifting into the outer hexagon (too close is a dangerous; Control Tuning learns what

defines too close). In this case, it is far from that border and the model can focus on

shooting.

3. Aim Beyond: The decision to shoot depends on whether the ship is aimed at the fortress

(what constitutes an adequate aim is something else that needs to be learned by Control

Tuning). In this case the ship is aimed too far beyond the fortress and it is necessary to

turn clockwise

4. To turn clockwise one presses the D key.

The model starts with instruction-following productions interpret these instructions and

execute them. For instance, Figure 4 shows the ACT-R production (right) and the buffers (left)

that the production matches when the second operator above is used. In green is the operator that

is in the retrieval buffer. It specifies performing a test on distance to the big hex. In red are the

contents of the goal buffer, including the threshold for distance to the big hex as 33.95 pixels4.

In blue are contents of the visual buffer, including the current distance from the bighex as 112.85

pixels. The operator specifies that if the distance to the bighex is greater than the threshold

(which it is in this case) the model should go on to testing the aim. The production requests

retrieving the operator for aiming, which is the third operator in the sequence of 4 above.

4
To be clear, we are not assuming humans see distance in pixels (let alone fractions of
pixels) only that their perceptual scale is linear in pixels.

14
In the ACT-R model, the average time to retrieve an instruction is about 100 msec. and

the average time to execute a production is 50 ms. Thus, each of the 4 steps above will take

about 150 msec. Thus, the time to execute this sequence will be approximately 600 msec. with

possibly some motor execution time5.

Production Compilation

Without the instructions just discussed the model is not able to play the game. With these

instructions but without production compilation, the model is able to play the game but

accumulates virtually no points because it averages 4 to 8 times as many deaths as fortress kills

(see Figure 13 later in the paper). It often takes too long to step through these instructions to act

in time. (The timing parameters that control the speed through these steps are based on a history

5
Counting motor execution into the total time is complex because execution of a motor action
can overlap with retrieval of what to do next.

15
of successful ACT-R models.) Subjects also initially find the task overwhelming. In YouTurn

the subjects average 100 seconds to destroy their first fortress, while they only take about 5

seconds to kill the fortress by the end of play (for the model runs the average is 81 seconds for

first kill and 5 seconds asymptotically).

Production compilation (Taatgen & Anderson, 2002) has been used in past instruction-

following models of skill acquisition (e.g., Anderson et al., 2004, Anderson, 2005, Taatgen et al.,

2008; Taatgen, 2013) to capture the transition that people make from acting on the basis

retrieved declarative information to responding directly to the situation. If two productions fire in

sequence, then they can be combined into a single new production that is added to procedural

memory. If the first of these productions makes a request to declarative memory, and the second

production uses the retrieved fact to determine what to do, then the retrieved fact is substituted

into the new production. So, for instance, consider the production on the right of Figure 4. It

would apply to interpret both steps 2 and 3 because they both involve testing whether one

quantity is greater than another. As illustrated in Figure 5, production compilation would

combine a double execution of this production (first to process operator 2, second to process

operator3) into a single rule that responds to the retrieval of operator 2 and has the content of

operator 3 built into it. Thus, it not only collapses two productions into one, it skips the retrieval

of a declarative instruction. Production compilation would combine this rule with others and

eventually a single rule would emerge that encapsulates the sequence of 4 operators into the

direct-action production illustrated in Figure 6. The production in Figure 6 detects the presence

of a fortress (first operator), the distance from the fortress (second operator), the aim of the ship

(third operator), and the fact that a counterclockwise correct is required (fourth operator). This

16
17
production collapses a sequence that originally took about 600 milliseconds when in

instruction-following mode to 50 msec. The final execution still requires the time to press D and

this motor time does not change (but see footnote 2). High levels of performance require such

production rules that directly respond with key presses to different game situations. However,

the model can start to achieve positive points even with productions rules that compile parts of

the operator sequences (like in Figure 5) and so speed up performance. The development of such

intermediate rules allows the model to begin to achieve kills about midway through the first

game.

A compiled production must compete with the first production that it originated from (the

“parent” production) since that parent production will match in any situation that the new

production will match. Utility values of productions determine which production applies when

more than one matches the current situation. The utility of a production reflects the value of

what they eventually achieve minus the amount of time it takes to do that. What this model is

trying to achieve is to press the next key to play the game and it treats all key presses as having

the same utility. Thus, the factor that determines the relative utility of productions is how fast

they lead to key presses.

This model uses the standard ACT-R utility learning (Anderson, 2007). In ACT-R new

productions start with zero utility and must learn their true utility. Thus, at first a new

production will lose in competition with its parent and will only gradually begin to take effect as

its true utility is learned. Whenever the parent production fires and production compilation

recreates the new production, ACT-R adjusts the utility of the new production towards the

parent’s utility according to a simple linear learning rule:

Utility = Utility + (Parent-Utility – Utility)

18
The utility values fluctuate moment by moment from the value prescribed above because of a

normal-like noise. Thus, when the utility of the new production approaches that of the parent,

the new production will have some probability of applying. When the production rule does

apply, its value can be determined, and its utility will change according the linear learning rule

Utility = Utility + (Value – Utility)

The value of a compiled rule will be greater than its parent’s value because it is more efficient.

Eventually the compiled rule will acquire greater utility and tend to be selected rather than the

parent. Thus, new productions are only introduced slowly, which is consistent with the gradual

nature of skill acquisition. The parameter determines how fast the utility of the new

production grows and comes to exceed the parent. It takes many opportunities before rules come

to dominate that skip all instruction-following steps and just perform the appropriate actions.

The speed with which direct-action productions are acquired also depends on their frequency in

the game. For instance, the direct-action single-shot production is learned faster than the direct-

action double-shot production because the single-shot action occurs approximately 10 times as

often (to build up fortress vulnerability) as the double-shot action that destroys the fortress.

Control Tuning

A model that has the declarative instructions and production compilation still does not

learn to play at human level. Runs of such models playing YouTurn reach an average of about

200 points per game (see Figure 13 later in the paper), much under the average of YouTurn

subjects in Figure 2. To learn how to play YouTurn, the model must learn settings for 5 control

values: shot timing, aim (see Figure 1b), thrust timing, thrust duration, and thrust angle (see

Figure 1b). Here we will focus on shot timing and aiming as illustrative of the general principles

behind Control Tuning. Information on the other control values is available in Appendix D.

19
Control Tuning involves three critical components: selecting a range of possible values for a

parameter that will control an action, attending to relevant feedback, and converging to good

range for that parameter.

Setting a Range for a Control Value. Subjects have some sense of what good values are for

shot timing and aim and we set ranges for the model that reflect that.

1. Shot timing. The instructions tell subjects that while they are building the

vulnerability of the fortress up to 10 their shots must be at least 250 msec. apart.

Subjects have a vague idea of what constitutes 250 msec. and tune that range to

something more precise. ACT-R can be sensitive to the passage of time through its

Temporal module that ticks at somewhat random and increasing intervals producing

the typical logarithmic discrimination for temporal judgments (Taatgen et al., 2008).

The model places the possible range between 9 and 18 time ticks. Given the standard

parameterization of the Temporal module, 9 to 18 ticks corresponds to a range from

150 to 500 msec.

Aiming. The ship must be pointed towards fortress to shoot it (see Figure 1b). All

but 4 subjects start out with average aims less than 20 degrees off (maximum was 33

degrees) suggesting that most understood the need for aiming from the beginning.

The ship turns in 6 degrees increments, which means it is futile to aspire for less than

a 3 degree offset in aim. The most accurate subject averaged 5 degrees offset. The

model starts out with a range of 3 to 20 degrees and learns the best value in that

range.

Attending to Relevant Feedback. One of the reasons human players can learn faster than the

typical deep reinforcement learners is that they know what features of the game are relevant as

20
feedback to the setting of each parameter. They adjust their behavior according to these features

and not raw points scored. The model looks for a control setting that yields an optimal

combination of two relevant features specific to the dimension it is controlling:

1. Shot timing. Subjects know that a good setting for shot timing results in increments

to vulnerability and not resets of vulnerability. Therefore, the features that the model

attends to are increments and resets, weighting the first positively and the second

negatively. For simplicity, for shot timing and all other control variables the positive

and negative weights are equal (+1 and -1). Note that it treats increments in

vulnerability as positive outcomes even though each shot results in a 2-point

decrement in score. Note also that very slow shooting will eliminate resets but it will

also decrease number of hits.

2. Aiming. Subjects know that a good aim results in a hit and a bad aim results in a

miss. Therefore, the model weights hits positively and misses negatively. Similar to

timing, if it tries to aim too accurately it will avoid misses but not get as many hits

because of unnecessary turns to improve aim.

Convergence. A number of ACT-R models (e.g., Taatgen et al., 2007; Taatgen & van Rijn,

2011; Moon & Anderson, 2012) have learned timing using the Temporal module, which is used

for shot timing in the current effort. These models store in declarative memory all results for

different choices of number of time ticks and select a time for waiting by merging the results of

the individual choices together using an ACT-R mechanism called blending (Lebiere et al.,

2007). These models retrieve these past experiences and then a production acts on a blended

outcome. There is not enough time in a fast-paced game for this retrieve-and-act cycle to work.

It is all the more difficult because there are many control dimensions to monitor and select good

21
values for. To capture the successful performance of the subjects, we needed blending to happen

independent of declarative retrieval and procedural execution so that it would not interfere with

the use of the declarative and procedural modules for game play. To achieve this we created a

new Control Tuning Module that functions in parallel with other modules.

The Control Tuning Module samples a candidate control setting for a while, determines a

mean rate of return per unit time for that control setting, and then tries another control setting. It

gradually builds up a sense of the payoff for different control values by blending these outcomes

together. It learns a mapping of different choices onto outcomes through a function learning

process (e.g., Koh & Meyer, 1991; Kalish, Lewandowsky, & Kruschke, 2004; McDaniel &

Busemeyer, 2005). As suggested in Lucas et al (2015) we assume a learning process that favors

a simple function as most probable and in the current context this leads to estimating a quadratic

function V(S) that describes payoff as a function of the control setting S. A quadratic function

facilitates identification of a good point, which is where the function has a maximum. The

model starts off seeded with a function that gives 0 for all values in the initial range and weights

this function as if it is based on an amount of prior observations (20 seconds in the current

model). It starts to build up an estimate of the true function as actual observations come in.

The control settings are changed at random intervals. The durations of these intervals are

selected from an exponential probability distribution (with a mean of 10 seconds in the current

model). Figure 7 shows the quadratic function that was estimated for shot timing in one run of

the model after playing a single game (180 seconds). The red dots reflect performance of

various choices during that game. They are quite scattered reflecting the random circumstances

of each interval. For instance, note the value at 9 ticks, which is approximately 150 msec. This

22
brief an interval would typically lead to resets, but at the beginning of the game when instruction

interpretation slows down action, it actually results in about 2 hits per second.

As discussed with respect to reinforcement learning, in a variable environment Control

Tuning needs to balance exploration to find good values with exploitation to use the best values

it currently has. For this purpose Control Tuning uses a softmax equation that give the

probability of selecting a value S at time t:

V(S) is the current value of the quadratic function for control value S. T(t) is the temperature

which decreases according over time according to the function T(t)=A/(1+B*t) where A is the

initial temperature and B scales the passage of time (in the model this is set to 1/180 of a second,

reflecting a 180 second game)6. Figure 8 illustrates convergence for one random run of 20
6
The variable t reflects experience in the game, not clock time. So, if a player has a break
between games, the value of t does not advance.

23
games. Figure 8a shows that in this run the estimated quadratic function became fairly tuned by

game 5. However, Figure 8b shows that the choices continue to concentrate more around the

optimal value as the model plays more games.

Summary. The control tuning module learns to map a continous space onto an evaluation by

exploring that space and converging on an optimal point. While the spaces being mapped in this

task are all one-dimenisonal space, the methods, being based on ACT-R blending, could be

generalized to multi-dimensional spaces (as in the backgammon model of Sanner et al, 2000).

Other function-learning approaches could be applied to this task. Indeed, deep reinforcement

learning methods could apply to this task. This would differ from the typical application of deep

reinforcement learning methods to games in that (a) they would apply to reduced spaces relevant

to a particular control variable rather than the full space of the game, (b) the space would be

defined by prior assumptions about plausible ranges for that control variable, and (c) the

evaluation would be specific to that control variable. Such an informed focus seems

characteristic of human learning. In some cases this focus can mean that humans miss learning

the best solutions, but the focus enables rapid learning.

24
Comparison of Model And Data

Figure 9 compares subjects and model in terms of total score. We ran 100 models in each

condition. Besides matching the overall growth in points of the subjects, the models show a high

variability in performance, almost matching the range shown by the subjects. The 100 models

were created by crossing 4 different strategies, with 5 different settings for production

compilation, with 5 different settings for Control Tuning.

1. Strategies. Four different strategies for navigation were implemented to reflect

different things subjects reported. Some reported attending to distance to reach outer

border while others reported attending to time to reach outer border. Some reported

25
lowering their threshold for thrusting when the fortress was absent to get the

navigational needs over with when there was nothing to shoot at. Others reported a

constant navigation emphasis. Crossing these two factors yielded the 4 strategies.

Strategy choice showed no significant effects on score in AutoTurn, but in YouTurn

models attending to time average 182 points better than the models attending to

distance (F(1,24) = 32.64, p < .0001) and models emphasizing navigation when

fortress was absent did a marginal 38 points better (F(1,24)=4.18, p <.10). The

interaction was not significant (F(1,24)=2.92).

2. Production Learning Rate  We used values of .025, .05, .1, .2, and .3 for this

learning parameter that controls the rate at which compiled productions replace the

productions they are compiled from. This variable loosely reflects differences in

relevant prior experience in subjects. More experienced subjects probably start with

some partial compilations as Taatgen’s (2013) transfer model might imply. Mean

points varied strongly with this factor: 1757, 1975, 2025, 2231 and 2210 points in

AutoTurn and 361, 700, 853, 1080, and 1089 points in YouTurn.

3. Initial temperature A. We used values of .1, .25, .5, 1, and 2. This variable reflects

attention and motivation in subjects. More attentive subjects will incorporate more of

their learning experiences and more motivated subjects will weight the payoff higher.

Either factor results in a more rapid focus of good control values. The variation of

mean scores with this parameter was 2077, 2010, 2111, 2038, and 1964 in AutoTurn

and 871, 933, 861, 767, and 648 in YouTurn. The overall variation AutoTurn was not

significant (F(4,76)= 1.86) but it was in YouTurn (F(4,76)= 18.73, p < .0001).

Performance was worst when this variable was set to 2 (high exploration) because the

26
model remained quite variable in its choice behavior. Performance can also be worse

with low noise because the model can prematurely converge on non-optimal values.

The goal in producing a set of different models was not to create a model of each subject’s

unique behavior, but to show that the theory predicts subject-like behavior over a range of

individual variation of the sort that that is occurring in subjects.

The right panels of Figure 9 (panels c and d) show the strong relationship that exists

between the mean number of keys a subject hits (averaged over the 20 games) and the mean

number of points per game that the subject achieved (r = .799 in AutoTurn and .766 in

YouTurn). The model reproduces the relationship. The range of either measure is not as

extreme in the model as in the subjects and the correlations are a bit weaker in the model (r

= .693 and .621). The relationship would not emerge if subjects or models were making

counterproductive actions (e.g., shooting too fast and resetting or thrusting too much and running

into the hexagons). However, as subject and model actions tend to be things that lead to points,

the more rapidly they can execute them the more points they can build up.

The subjects are slightly outperforming the model in AutoTurn (an average difference of

80 points, but not significantly so (t(140)=0.68, p >.05). Subjects are not as good in YouTurn (an

average difference of 117 points, again not significant --t(146)=1.60). Even if these differences

were significant, little can be made of such direct comparisons of level of performance. The

subjects do not reflect all the variation in a larger population that did not self-select to be in this

experiment on Mechanical Turk. The models also do not reflect all the possible models that

could be created with different strategies or parameters. Also the models are trying all the time,

something that is not true of every subject. More important than the absolute level of scores are

the learning trends over games. Both models and subjects show learning but the models are

27
learning faster in YouTurn. The improvement from the first 5 games to the last 5 is 833 points

for the models and 630 points for the subjects, a difference that is quite significant (t(146)=3.34,

p < .005).

Each time the fortress is destroyed subjects gain 100 points and each time they are killed

they loose 100 points. Thus, kills and deaths largely determined their total points. The only

additional factor is the 2 points that each shot costs. Figure 10 looks at the growth in fortress

kills over trials and the decrease in ship deaths. The correspondence between models and

subjects is strikingly close for fortress kills but subjects achieve a lower rate of death in

AutoTurn while the models outperforms the subjects on this measure in YouTurn. The

28
interaction for death rates between conditions and populations (subjects versus models) is

significant (F(1,284) = 8.07). However, the two populations show considerable overlap in

Figure 10.

Not to get the discussion here focused on the details of Space Fortress rather than the

underlying learning processes, we will focus further comparisons on Control Tuning, which is

the new theoretical addition to ACT-R. We will describe results relevant to two control values,

shot timing and aiming. These results are representative of the learning of other control values,

which are described in Appendix D.

Shot Timing. The simplicity of AutoTurn makes it good for assessing the learning of

shot timing. Most of the key presses are shots and producing sequences of shots in a row (an

average of 6.3 for subjects and 6.9 for the models). This means one can use intershot times to

get an estimate of how subjects are pacing their shots. Figure 11a compares subjects and models

in terms of intershot time. Both show a decrease from about 600 msec. to about 350 msec. The

asymptotic value of 350 msec. is 100 msec. above the minimum to avoid a reset, but given

variability in ones ability to time actions and the game variability7, the optimal setting for this
7
The game updates every 1/30th of a second (a game tick) and requires that the shots to be
8 time ticks apart to avoid resets. This means the actual time needed to avoid a reset can
vary from 234 to 267 msec. depending on when the key press occurs relative to the game

29
control variable is above the minimum. Note that on the first game the models average an

intershot time of 600 msec. even though the upper bound of their range for timing is 500 msec.

They take this long because of the time to reason through all of the steps in determining that they

can shoot. We assume the slow initial times of subjects also reflect the cost of thinking through

what they should do.

Figure 11b compares subjects and models in terms of frequency of resets. Subjects

average about 3 resets while the models average about 2 (t(146)=3.91, p < .0001). While resets

are not decreasing, the number of shots taken is increasing from about 250 to 450 reflecting the

decrease in intershot time and other improvements in performance. So the proportion of shots

resulting in resets is decreasing.

Figure 11c shows the probability of Control Tuning choosing different waiting times

averaged over the hundred models. The models tend to converge on a sweet spot about 300

msec. (production execution adds another 50 msec. to intershot times). There is a small rise in

choice probability at 500 msec. even after 20 games. This reflects two models that converge on

high durations and stop exploring. Both of these models never reset the fortress but score rather

poorly.

Aiming. Subjects have to turn their ship in YouTurn and turning is critical both to

aiming at the fortress and controlling navigation. Figure 12 displays information comparable on

aiming to Figure 11 on shot timing. Part (a) shows how far the aim was off the fortress when a

shot was taken. Both subjects and models show a decrease in the average angular disparity over

games. Part (b) shows the resulting decrease in proportion of missed shots. While both

populations show similar decreases and both populations overlap, the models average more

accurate behavior. Figure 12c shows that the models tend to converge on a maximum disparity

ticks.

30
of about 9-10 degrees. The slight peak at 0 reflects two models that got locked into seeking

maximum accuracy. These two models made 50% more turns than average, made half as many

shots, and scored less than half as well as the average model.

Exploration of Model Variants

The models showed learning results much like human players. Three aspects of the

models were critical to this outcome – interpretation of declarative instructions, the production

compilation process, and Control Tuning. To investigate the contribution of the production

compilation and Control Tuning, we created 9 variants on the original model by crossing 3

choices with respect to production compilation and 3 choices with respect to Control Tuning.

For production compilation the three choices were

1. Already Compiled. The model started out with the direct action productions that

would result from production compilation.

2. Compiling. New productions were created from the instruction-following rules as in

the models reported in the prior section.

3. No Compiling. The model worked by interpreting the declarative instructions

throughout the game. Production compilation was turned off.

31
For Control Tuning the three choices were

1. Tuned. The model started with the control parameters fixed at high performing values

2. Tuning. The Control Tuning mechanism was in operation as in the models reported in

the prior section.

3. No Tuning. The choice of control values stayed uniform across the full range of values

sampled. Control Tuning was turned off.

All models had the strategy of attending to time to border and emphasizing navigation during the

1-second periods that the fortress was absent. This was the best scoring of the 4 strategies that

contributed to the model results compared to subjects. The combination of the second choice for

both (Compiling, Tuning) was identical to the model reported in Figure 6. We used the same

sets of choices for the compilation learning rate and for the initial temperature. We gathered

average scores over 100 model runs for each choice.

Figure 13 shows the average performance in YouTurn of these different combinations

extended out to 100 games to clearly identify asymptotic performance. The three No Compiling

versions are scoring close to zero points (averages between 10 to 20 points per game) because

they cannot act with sufficient speed to play the game properly. The Compiling models come

up to the level of models with Compiled models over the course of the games. The No Tuning,

Compiling models actually outperform the No Tuning, Compiled models because these models

miss some of the resets that would occur when a short intershot delay is chosen8. The Compiled,

Tuned models have nothing to learn and define the optimal performance. They average of about
8
Along the way to acquiring direct action productions, production compilation will produce a
pair of shooting productions, one that goes from Start state in Figure 3 to Shoot state and another
that goes from the Shoot state to the key press. In cases where the time delay has not yet passed,
the first production of this pair will fire rather than the direct action production. When the time
delay has passed the second action production will fire. When there is no tuning, the models keep
choosing too short times with substantial probabilities. This 2-production combination serves to
delay the shot a bit and avoid many resets.

32
1900 points over the course of the games. The models that are learning Control Tuning reach an

average about 1450 points by the end of the 100 games. Their lower score than the pre-tuned

models reflects that some of the models are converging to suboptimal control choices (see

Figures 8c and 9c).

Tsividis et al. (2017) discuss why humans can learn to play Atari video games so much

faster than the deep learning results. As they note, humans (and this is true of the ACT-R

models) come into the task with many advantages, including the ability to parse the screen and

prior understanding about the games. The ACT-R models also start with instructions about how

to play the game. Tsividis et al. (2017, and also Tessler et al., 2017) show such instructions

confer a substantial advantage on human players. However, the performance of the No Tuning

models in Figure 13 provides an illustration that abstract knowledge about how to play the game

is not enough for high human performance. Humans need to perfect that knowledge through

33
experience to the task at hand. In our models, production compilation and tuning underlie this

gradual perfection.

The tuning that subjects and the models are doing has similarity to the learning in Minh et

al (2015), but subjects and models have advantages in the learning they are doing.

1. Subjects and the models start out knowing what the relevant control variables are and

what actions these variables are relevant to. This knowledge includes what players

already know (e.g., aim is critical to shooting) and other knowledge explicitly

instructed (e.g., vulnerability must be a 10 for the kill shot).

2. Subjects and the models have a sense of what the relevant range is of possible values

for the control variables (e.g., how accurate an aim is good enough, how much time

corresponds to 250 msec.).

3. Subjects and the models know what outcomes are relevant to assessing these

variables (e.g., vulnerability resets indicate shots are too rapid, misses indicate aims

are not accurate enough). Unlike reinforcement learners they are not learning from

total points, which have only a less direct connection to setting of control variables.

We re-engineered the model to assess the impact of these features of the modeling approach.

For a referent, we used the precompiled version of the model so the learning outcomes would not

be confounded with the effects of production compilation (the compiled-tuning variation from

Figure 13). Figure 14 compares the result of this option (called Model) with the following

options:

34
Wide. To assess the contribution of having a sense of the range for control values, we

doubled the range of initial values. For shot timing this meant going from a range of 9-18

ticks to a range of 6 to 24 ticks; for aim it meant going from a range 3-20 degrees to 3-37

degrees). This version does not reach asymptotic behavior as fast as the other models.

It may be reaching an asymptote by 100 games but many of these models are converging

on suboptimal values (given the temperature settings).

Points. We created a version of the model that used change in points to assess each

control variable rather than features directly relevant to that variable like resets for shot

timing and misses for aiming. Points were divided by 10 to get magnitude of

reinforcement in the same range as the magnitudes used in Figure 10. This version does

not do as well as the model with more direct feedback, but we were surprised by how

well it did do.

35
Exploit. The No Tuning models in Figure 13 always explored the full range of control

values, according to the softmax equation, rather than using the current best performing

value. We created a “pure exploitation” model that always chose the current best value.

This model performs better on the first few games because it tends to select reasonable

control values. However, it sticks with these values and does not improve by identifying

better values.

Instant. The ACT-R model represents the time costs of the perceptual, cognitive, and

motor actions. The line labeled “Instant” in Figure 11 represents a model with those

limitations removed. This version still needs to press keys long enough to get the effects

it wants and needs to pace it shots to avoid resets but it can begin executing an action as

soon as it is needed. Not surprisingly this model performs much better than the ACT-R

model, but it still has to learn appropriate settings for control variables. This model

represents an advantage that the artificial deep reinforcement learning systems have.

Discussion

We should emphasize that this is the first model that learns to achieve human-level

performance in ACT-R9. Moreover, it learns in a way that displays much similarity to human

learning, making it a plausible model of humans, the only other agents that so far can learn in

Space Fortress. In this discussion, we will try to situate the theory behind the model in the

framework of the Fitts three-phase characterization and human skill learning more generally.

9
To see a minute of model play compared to a minute of human play, both late in training,
visit
http://act-r.psy.cmu.edu/wordpress/wp-content/uploads/2018/04/A2O7H7VXLFN6BP_v
s_model_2981_game_8_x2.mpg

36
While Fitts’ three-phase characterization of skill acquisition has often been discussed as a

three-stage theory, Fitts did not originally intend it as a stage characterization: As Fitts (1964)

wrote

“it is misleading to assume distinct stages in skill learning. Instead, we should think of

gradual shifts in the factor structure of skills” (p. 261).

Nonetheless, even in Fitts’ original paper he described skills as being in a particular phase and

his characterization has since been used with the understanding that a skill can be treated as in

one of phases at most points in the course of learning. With complex skills, such phase

assignments probably only make sense as characterization of parts of the skill. For instance, in

Space Fortress the actual hitting of keyboard keys is highly automated at the beginning of the

game while the decision making about what keys to hit is slow.

Following the 1982 ACT* characterization (Anderson, 1982) one might map this ACT-R

model onto Fitts and Posner 3-phase theory is as follows: The interpretation of declarative

instruction corresponds to the Cognitive phase of Fitts and Posner’s 3-phase theory. Production

compilation, which produces direct actions, corresponds to the Associative phase. Control

Tuning, which produces the highly level of performance, corresponds to the Autonomous phase.

However, upon closer inspection it turns out that this mapping does not work, and what is

happening in the model is closer to what is suggested by the above quote from Fitts.

The three different processes (knowledge interpretation, production compilation, and

tuning) are progressing in parallel for each bit of knowledge that composes part of the skill in the

ACT-R model. For instance, consider the highlighted path in Figure 3 that indicates when to

execute a clockwise turn to maintain aim. There are a number of steps of collapsing of the

instruction interpretation on this path into a direct action mapping. At any point in time some

37
parts may be interpreted and some parts may be compiled. Also choices for the control variables

of distance from border and aim can be at different levels of tuning. The actual control values

chosen will jump around over time as part of the exploration process, only slowly converging

reliably on optimal values.

It is also misleading to think of compilation occurring before tuning; rather, they are

happening simultaneously. The Compiled-Tuning version in Figure 13 does reaches asymptote a

bit slower than the Compiling-Tuned option, but this could be changed by varying the

production-learning rates or the initial temperatures. Both of these versions are converging

faster than the Compiling-Tuning version because the optimal control variable settings change as

production compilation advances. For instance, as one can act faster one needs increase the

number of time ticks between shots to wait to avoid resetting the vulnerability counter.

The learning in ACT-R nicely instantiates Fitts’ description of “gradual shifts in the

factor structure of skills”. While there is not a truly satisfying mapping of the growth of skills

onto his phases, the best choices are probably

1. The Cognitive phase corresponds to the period of time before the model is achieving positive

points, which is typically midway through the first game. This is the period when the model

is laboriously retrieving and executing declarative instructions.

2. The Associative phase is the period of time when production compilation and control tuning

are converging on a final structure.

3. The Autonomous phase would be when the model reaches asymptotic performance.

Note under this characterization the Autonomous phase is when the model’s learning is

complete. The models and the subjects do come close to asymptoting after 20 games in

AutoTurn but not in YouTurn (Figure 9). The model takes about 60 games to asymptote in

38
YouTurn (Compiling Tuning in Figure 13) and subjects may also have reached an asymptotic

score had they played 3 times as many games. Subject did appear to asymptote after this much

practice in the Fortress-Only condition of Anderson et al. (2011), which is similar to YouTurn.

However, that condition and YouTurn are rather simple compared to many tasks including the

original Space Fortress. Improvement in the original version of Space Fortress continues for as

long as 31 hours (Destefano & Gray, 2016). Interestingly, as Destefano and Gray describe, part

of continued improvement involves discovering strategies late in the game that go beyond what

is instructed. Players deliberately try to apply what they have discovered about the game much

like they did with the initial instructions. This suggests that some of that late learning is really

declarative, the kind of learning that one might have thought was in the Cognitive Phase.

Motor learning is another potential kind of learning that might continue beyond where the

current models asymptote. ACT-R hits individual keys rather than cording keys where pairs of

keys will be hit together. This means that the lower bound on interkey times is about 100 msec.

While this is also true of initial subject behavior, some subjects did start to simultaneously press

two keys (e.g., thrust and shoot), something that the model never comes to do.

The fact that there are three phases and there are three critical aspects to our theory

(instruction interpretation, production compilation, and Control Tuning) is just an accident.

There are other learning processes beyond those modeled, both high-level like strategy discovery

and low-level like cording. However, the three aspects modeled are critical aspects of human

skill acquisition, and they appear to be the major drivers of learning in games like Space

Fortress. More generally, they are the major drivers of learning in tasks that involve receiving a

set of instructions and then perfecting the instructed skills with experience. Many real world

tasks (and some video games) have other complexities, but involve the kind of learning we have

39
modeled as part of a larger goal structure. For instance, the mechanics of driving a car are

typically a case of perfecting the performance that starts with instruction, but this skill is

embedded in a much larger goal structure of navigating in familiar and unfamiliar terrains,

obeying traffic laws, avoiding dangerous situations, etc.

40
References

Anderson, J. R. (1982). Acquisition of cognitive skill. Psychological review, 89(4), 369.

Anderson, J. R. (2005) Human Symbol Manipulation within an Integrated Cognitive

Architecture, Cognitive Science, 29, 313-342.

Anderson, J. R. (2007). How Can the Human Mind Occur in the Physical Universe? New

York: Oxford University Press.

Anderson, J. R., Betts, S., Ferris, J. L., Fincham, J. M. (2010). Neural Imaging to Track

Mental States while Using an Intelligent Tutoring System. Proceedings of the National Academy

of Science, USA, 107, 7018-7023.

Anderson, J. R., Bothell, D., Byrne, M.D., Douglass, S., Lebiere, C., Qin, Y. (2004) An

integrated theory of Mind. Psychological Review, 111, 1036-1060.

Anderson, J. R., Bothell, D., Fincham, J. M., Anderson, A. R., Poole, B., & Qin, Y.

(2011). Brain regions engaged by part-and whole-task performance in a video game: a model-

based test of the decomposition hypothesis. Journal of cognitive neuroscience, 23(12), 3983-

3997.

Anderson, J.R., Bothell, D., Fincham, J.M., & Moon, J. (2016). The Sequential Structure of

Brain Activation Predicts Skill. Neuropsychologia, 81, 94-106.

Anderson, J. R., & Fincham, J. M. (2014). Extending problem-solving procedures

through reflection. Cognitive psychology, 74, 1-34.

Ackerman, P. L. (1988). Determinants of individual differences during skill acquisition:

Cognitive abilities and information processing. Journal of experimental psychology: General,

117(3), 288.

41
Bothell, D. (2010). Modeling Space Fotress: CMU Effort

http://act-r.psy.cmu.edu/workshops/workshop-2010/schedule.html

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba,

W., (2016) OpenAI Gym, 1–4.

http://arxiv.org/abs/1606.01540X

Cohen, J. D., McClure, S. M., & Angela, J. Y. (2007). Should I stay or should I go? How

the human brain manages the trade-off between exploitation and exploration. Philosophical

Transactions of the Royal Society of London B: Biological Sciences, 362(1481), 933-942.

Daw, N. D., O'doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical

substrates for exploratory decisions in humans. Nature, 441(7095), 876.

Destefano, M. (2010). The mechanics of multitasking: The choreography of perception,

action, and cognition over 7.05 orders of magnitude (Doctoral dissertation, Rensselaer

Polytechnic Institute).

Destefano, M., & Gray, W. D. (2008). Progress report on Pygame Space Fortress. Troy,

NY: Rensselaer Polytechnic Institute.

Destefano, M., & Gray, W. D. (2016). Where should researchers look for strategy

discoveries during the acquisition of complex task performance? The case of space fortress. In

Proceedings of the 38th Annual Conference of the Cognitive Science Society (pp. 668-673)

Donchin, E. (1989). The learning-strategies project: Introductory remarks. Acta

Psychologica, 71(1-3), 1-15.

Elfwing, S., Uchibe, E., & Doya, K. (2017). Sigmoid-weighted linear units for neural

network function approximation in reinforcement learning. arXiv preprint arXiv:1702.03118.

42
Fitts, P. M. (1964). Perceptual-motor skill learning. Categories of human learning, 47,

381-391.

Fitts PM, Posner MI (1967) Learning and Skilled Performance in Human Performance

Brock-Cole, Belmont, CA.

Frederiksen, J. R., & White, B. Y. (1989). An approach to training based upon principled

task decomposition. Acta Psychologica, 71(1-3), 89-146.

Gauthier, I., Tarr, M., & Bub, D. (Eds.). (2010). Perceptual expertise: Bridging brain

and behavior. OUP USA.

Gopher, D., Weil, M., & Siegel, D. (1989). Practice under changing priorities: An

approach to training of complex skills. Acta Psychologica, 71, 147–178.

Gray, W. D. (2017). Game‐XP: Action Games as Experimental Paradigms for Cognitive

Science. Topics in Cognitive Science, 9(2), 289-307.

Kim, J. W., Ritter, F. E., & Koubek, R. J. (2013). An integrated theory for improved skill

acquisition and retention in the three phases of learning. Theoretical Issues in Ergonomics

Science, 14(1), 2

Kalish, M. L., Lewandowsky, S., & Kruschke, J. K. (2004). Population of linear experts:

knowledge partitioning and function learning. Psychological Review, 111(4), 1072.

Koh, K., & Meyer, D. E. (1991). Function learning: Induction of continuous stimulus-

response relations. Journal of Experimental Psychology: Learning, Memory, and Cognition,

17(5), 811.

Lebiere, C., Gonzalez, C., & Martin, M. K. (2007). Instance-based decision making

model of repeated binary choice. In Proceedings of the 8th international conference on cognitive

modeling (pp. 77–82), Ann Arbor, MI.

43
Lee, M. D., Zhang, S., Munro, M., & Steyvers, M. (2011). Psychological models of

human and optimal performance in bandit problems. Cognitive Systems Research, 12(2), 164-

174.

Lucas, C. G., Griffiths, T. L., Williams, J. J., & Kalish, M. L. (2015). A rational model of

function learning. Psychonomic bulletin & review, 22(5), 1193-1215.

Mcdaniel, M. A., & Busemeyer, J. R. (2005). The conceptual basis of function learning

and extrapolation: Comparison of rule-based and associative-based models. Psychonomic

bulletin & review, 12(1), 24-42.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,

Riedmiller, M. A., (2013). Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5.

http://arxiv.org/abs/1312.5602X

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... &

Petersen, S. (2015. Human-level control through deep reinforcement learning. Nature,

518(7540), 529-533.

Moon, J., & Anderson, J. R. (2013). Timing in multitasking: Memory contamination and

time pressure bias. Cognitive psychology, 67(1), 26-54.

Newell, A. (1990). Unified theories of cognition. Harvard University Press.

Plappert, M., (2016). keras-rl. GitHub repository.

https://github.com/matthiasplappert/keras-rlX

Rosenbaum, D. A., Carlson, R. A., & Gilmore, R. O. (2001). Acquisition of intellectual

and perceptual-motor skills. Annual review of psychology, 52(1), 453-470.

44
Sanner, S., Anderson, J. R., Lebiere, C., & Lovett, M. (2000). Achieving efficient and

cognitively plausible learning in backgammon. Proceedings of the Seventeenth International

Conference on Machine Learning (pp. 823-830). San Francisco: Morgan Kaufmann.

Shebilske, W. L., Volz, R. A., Gildea, K. M., Workman, J. W., Nanjanath, M., Cao, S., &

Whetzel, J. (2005). Revised Space Fortress: a validation study. Behavior research methods,

37(4), 591-601.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... &

Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search.

Nature, 529(7587), 484-489.

Sutton, R. S., Barto, A. G., (1998). Introduction to Reinforcement Learning, 1st Edition.

MIT Press, Cambridge, MA, USA.

Taatgen, N. A. (2013). The nature and transfer of cognitive skills. Psychological review,

120(3), 439.

Taatgen, N. A., & Anderson, J. R. (2002). Why do children learn to say “broke”? A

model of learning the past tense without feedback. Cognition, 86(2), 123-155.

Taatgen, N. A., Huss, D., Dickison, D., & Anderson, J. R. (2008). The acquisition of

robust and flexible cognitive skills. Journal of Experimental Psychology: General, 137(3), 548.

Taatgen, N., & van Rijn, H. (2011). Traces of times past: representations of temporal

intervals in memory. Memory & cognition, 39(8), 1546-1560.

Taatgen, N. A., Van Rijn, H., & Anderson, J. (2007). An integrated theory of prospective

time interval estimation: The role of cognition, attention, and learning. Psychological Review,

114(3), 577.

45
Tessler, M. H., Goodman, N. D., & Frank, M. C. (2017). Avoiding frostbite: It helps to

learn from others. Behavioral and Brain Sciences, 40.

Tsividis, P. A., Pouncy, T., Xu, J. L., Tenenbaum, J. B., & Gershman, S. J. (2017).

Human learning in atari.". In The AAAI 2017 Spring Symposium on Science of Intelligence:

Computational Principles of Natural and Artificial Intelligence.

Wolpert, D. M., Diedrichsen, J., & Flanagan, J. R. (2011). Principles of sensorimotor

learning. Nature reviews. Neuroscience, 12(12), 739.

VanLehn, K. (1996). Cognitive skill acquisition. Annual review of psychology, 47(1),

513-539.

Wilson, R. C., & Niv, Y. (2011). Inferring relevance in a changing world. Frontiers in

human neuroscience, 5.

Wolpert, D. M., Ghahramani, Z., & Jordan, M. I. (1995). An internal model for

sensorimotor integration. Science, 1880-1882.

Zhang, S., & Yu, J. A. (2013). Cheap but clever: human active learning in a bandit

setting. Ratio, 12(13), 14.

Zhu, P., Li, X., & Poupart, P. (2017). On Improving Deep Reinforcement Learning for

POMDPs. arXiv preprint arXiv:1704.07978.

46
Figure Captions

Figure 1. (a) The Space Fortress screen, showing the inner and outer hexagon, a missile

shot at the fortress, and a shell shot at the ship. The dotted lines illustrate an example path

during one game. (b) A schematic representation of critical values for shooting and flight control.

Figure 2. Growth in mean number of points earned per game. Error bars reflect one

standard deviation of the population.

Figure 3. A representation of the instructions in YouTurn. Each set of branching arrows

is an operator going from one state one of the next. Except for the bottom operators they test

conditions to determine the next state. The bottom operators perform actions. Note for graphical

clarity the same bottom state (e.g. CCW) is repeated wherever it appears.

Figure 4. The production test-dimension-success on the right. On the left are color

coded contents of the buffers one time when it matches – the retrieval buffer in green, the goal

buffer in red, and the visual buffer in blue.

Figure 5. On the right is the production resulting from compiling successive firings of

productions test-dimension (Figure 4). On the left are the buffer contents to which it would

match. In the middle is the operator retrieval that is now built into the compiled production.

Figure 6. A direct action production which is the result of compiling the highlighted path

of operators in Figure 3.

Figure 7. An example of estimating a payoff function from the first game in space

fortress. Each dot reflects the mean payoff for a sampled value. The function is the best fitting

quadratic to these points weight by the duration of time over which they are estimated.

47
Figure 8. (a) The quadratic functions estimated for different delays at different points in

the one sequence of model runs through 20 learning games. (b) The probability of choosing

different delays after different numbers of games.

Figure 9. A comparison of subject and model performance: (a) and (b) Growth in mean

number of points earned per game in AutoTurn and YouTurn. Error bars reflect one standard

deviation of the population. (c) and (d) Relationship that emerges between average number of

keys a subject or model hits over games and their average score over games. Each dot is a

subject or model.

Figure 10. A comparison of subjects and models on two critical factors determining

score. (a) and (b) growth in mean number of kills over games. (c) and (d) decrease in mean

number of deaths over games. Error bars reflect one standard deviation of the population.

Figure 11. Shot Timing in AutoTurn: (a) Mean intershot duration per game for subjects

and models; (b) Mean number of resets per game for subjects and models. (c) Probability of

choosing different time thresholds, averaged over the 100 models. Error bars in (a) and (b)

reflect one standard deviation of the population.

Figure 12. Aiming in YouTurn: (a) Mean offset of aim at shot per game for subjects and

models; (b) Mean proportion of misses per game for subjects and models. (c) Probability of

choosing different maximum aim offsets, averaged over the 100 models. . Error bars in (a) and

(b) reflect one standard deviation of the population.

Figure 13. Performance of various learning options in YouTurn.

Figure 14. Performance of the model against various versions created by changing basic

features of the model and architecture. See text for discussion.

48
Figure D1. Distance for Thrusts in AutoTurn: (a) Mean distance from fortress at which

subjects and models chose to thrust; (b) Probability of choosing different time thresholds,

averaged over 50 models.. (c) Probability of choosing different distance thresholds, averaged

over 50 models.

Figure D2. Thrust angle in YouTurn: (a) Mean thrust angle per game for subjects and

models; (b) Mean speed of ship for subjects and models. (c) Probability of choosing different

thrust angles, averaged over the 100 models. Error bars in (a) and (b) reflect one standard

deviation of the population.

Figure D3. Distribution of ship deaths in AutoTurn (a) and YouTurn (b).

49
Appendix A: Space Fortress Implementation

There have been many implementations of Space Fortress over the decades. The

development prior to the Pygame version is described in Shebilske et al. (2005) and the original

Pygame implementation is described in Destefano (2010). The most consequential change in the

Pygame implementation over previous implementations was replacing the use of the joystick for

navigation by key presses. We considerably simplified the game from the Pygame

implementation. We eliminated mines and bonus points, both of which were aspects of the

original game. We also made hitting the hexagon boundaries fatal rather than just a loss of

control points. We also simplified the death rules so that being shot or hitting the boundary

immediately resulted in death rather than building up towards death. Here is the pointer to web

versions of the games subjects played, which contain the instructions:

http://andersonlab.net/sf-paper-2017/turning.html

To review the critical parameters of the game:

1. The game updates every 30th of a second (game tick).

2. The total screen is 710 pixels wide by 626 pixels high.

3. The fortress is centered at x=355, y=315.

4. The ship starts at x=245, y=315, moving up .5 pixel and to the right .866 pixels per game

tick. Thus, its initial speed is 1 pixel per game tick.

5. The ship starts aimed horizontally aimed at the fortress. The ship automatically turns to

maintain aim in AutoTurn, but subjects must turn the ship in YouTurn.

6. Every game tick that the thrust key is held down, the velocity vector will change .3 pixels

in whatever direction the ship is pointing.

50
7. Every game tick that a turn key is held down, the ship will turn 6 degrees.

8. If the ship does not traverse 10 degrees of arc around the fortress in a second the fortress

will shoot a shell at it.

9. For purposes of determining when the ship or fortress is hit, everything is treated as a

circle:

a. The ship has a radius of 10 pixels,

b. The fortress has a radius 18 pixels,

c. Shells have a radius of 3 pixels,

d. Missiles have a radius of 5 pixels.

10. A collision occurs if two circles intersect on a game tick.

11. Shells travel at 6 pixels per tick and missiles travel at 20 pixels per tick.

Appendix B: Deep Reinforcement in Space Fortess

The theory of reinforcement learning (RL), rooted deeply in the neuropsychological

perspectives of animal and human behavior, provides a normative account for how agents

optimize their behavior in different environments (Sutton and Barto, 1998). In recent years there

have been many advances in the field of deep learning, which have made it possible for deep RL

agents to achieve human-level performance in a number of complex task environments. One of

the most successful applications of deep learning to the RL problem is the work by (Mnih et al.,

2013), who created a convolutional neural network (CNN) trained with a variant of Q-learning

called a Deep Q-Network (DQN). With the only prior knowledge being the available actions.

Using the same network architecture and hyperparameters, the DQN agent was able to learn to

play a wide range of Atari 2600 video games with just raw pixels and a reward signal as input.

51
The games that the DQN agent excels at vary greatly in their genres and visual

appearance, from side-scrolling shooters (e.g. River Raid) to, sports games (e.g. Boxing) and 3D

auto racing games (e.g. Enduro). The authors argue that for certain games the agent is able to

discover relatively long-term strategies. For example, in the game Breakout, the DQN agent is

able to learn to dig a tunnel along the wall, which allows the ball to be sent around the back to

destroy a large number of blocks. Nevertheless, the DQN agent struggles with games that require

temporally extended rules/strategies and games that require higher than first-order control. For

example, the DQN agent was only able to score 13% of what a professional player can score in

Ms. Pac-Man, a game that requires continual path planning and obstacle avoidance. Similarly,

the DQN agent only achieved 7% of the score of a professional player in Asteroids, a game that

requires second-order control (acceleration) of a spaceship in a low-friction environment. Of the

49 Atari 2600 games tested by (Mnih et al., 2015), Montezuma’s Revenge was easily the most

difficult game for the DNQ agent to learn. Montezuma’s Revenge is a non-linear exploration

game that requires both temporally extended strategies and higher order control. For example, in

Montezuma’s Revenge players must move from room to room in an underground labyrinthine

searching for keys (to open doors) and equipment (such as torches, swords, amulets, etc.), in an

attempt to avoid or defeat the challenges in the players path. The DQN agent was not able to

score any points in this game.

It is not obvious if the DQN agents poor performance on certain games is a result of a

failure to extract useful features from the raw pixel input, from an inability to learn a good

control policy (assuming good feature extraction), or a combination of both. The modeling

efforts below are an attempt to address this question. Space Fortress and its variants share many

features of the ATARI games that the DQN agent struggles with. For example, in Space Fortress

52
the player controls a ship in a frictionless environment, which is analogous (but more difficult)

than flying the ship in the ATARI game Asteroids. Also, much like Montezuma’s Revenge,

Space Fortress requires temporally extended strategies. For example, building the fortress

vulnerability up to 10 before killing it. Since the preceding ACT-R modeling work has already

identified a set of visual features in Space Fortress capable of producing human-like learning and

performance, Space Fortress is the perfect environment to evaluate the learning limitations of the

DQN agent.

Methods

In order to make Space Fortress more accessible to deep reinforcement learning research,

an OpenAI Gym (Brockman et al., 2016) interface for both the AutoTurn and YouTurn variants

of Space Fortress was created. The Space Fortress OpenAI Gym source is available on BitBucket

(https://bitbucket.org/andersonlab/c-spacefortress/branch/package) and prebuilt packages are

available on PyPi (soon). The OpenAI Gym interface provides RL agents with two state

observation formats: 1) a raw image/pixel based representation of the game state in the form of a

126x142 RGB array, and 2) a hand-crafted feature vector (of length 15) containing all of the

information used by the best performing ACT-R models. Both environments share 3 discrete

actions: NOOP, FIRE, and THRUST. The YouTurn environment supports 2 additional discrete

actions: LTURN and RTURN.

Further facilitating the introduction of Space Fortress to the deep reinforcement learning

community, a keras-rl (Plappert, 2016) implementation of the original Mnih et al. (2013) DQN

agent was trained and tested in both the AutoTurn and YouTurn Space Fortress gym

environments. For these baseline tests, every attempt was made to replicate the model

parameters, hyperparameters and experimental setup of the Mnih et al. (2015) paper. However,

53
since Space Fortress runs at 30 frames per second, half the rate the ATARI system runs at, the

agents tested against Space Fortress used frame skip and memory skip values of 2, half the value

used in the agents that were tested against ATARI games. Additionally, since it is not a goal of

this work is to assess the ability of a CNN to extract features relevant to Space Fortress game

play from raw pixels, the DQN agents tested in the Space Fortress environments used the hand-

crafted feature vector state representation as input. Consequently, instead of using a complex

CNN to estimate the Q-values of the state-action pairs, a simple multilayer perceptron (MLP)

with 3 layers of 64 hidden units was used instead. Because of the different number of actions in

AutoTurn and YouTurn Space Fortress, the MLPs used by the agents had 3 and 5 outputs

respectively.

Both the AutoTurn and YouTurn DQN agents were trained for 10000 games, which with

a frameskip of 2, equates to 26,470,000 episodes. The replay memory of the DQN agent was

configured with a capacity of 1 million episodes (about 377 games with a frameskip of 2). The

exploration rate of the ε-greedy control policy was annealed from ε=1.00 to ε=0.10 over the first

1000 games and fixed at ε=0.10 there after. After each epoch of 100 games, training was paused

and a single test game was played with an exploration rate of ε=0.05. Consistent with (Mnih

et al., 2015), reward was clipped to the range -1 to 1 and the RMSProp optimizer was used with

minibatches of size 32. For reference purposes, two additional agents were created and tested, a

random agent and a perfect/cheater agent. The random agent was implemented as the DQN agent

with it’s exploration rate fixed at ε=1.00. The cheater agent algorithmically implements a “shoot

12 then thrust strategy” with zero constraints on how quick actions can be made.

54
Results

Table B1 shows how the DQN agents compare to the performance of other agents across

5 critical performance measures: raw points, deaths, kills, resets and shot efficiency (the average

number of shots needed to kill a fortress). The human and ACT-R values in the table are for the

best performing subject and ACT-R model in each condition. While the performance of the

AutoTurn and YouTurn DQN agents was much better than purely random agent, their

performance nowhere near that of humans or the ACT-R agents. As a way to more directly

compare the performance of the DQN agents to the ACT-R agents the performance of the agents

can be normalized and expressed as a percentage using the equation: 100*(targetScore –

randomScore)/(humanScore – randomScore).(Mnih et al., 2015). Using the values from Table 1,

the normalized performance score of the ACT-R agents in the AutoTurn and YouTurn

environments was 97.89% and 95.24% respectively. Likewise, the normalized performance

scores of the DQN agents was only 66.64% and 38.62% respectively. Perfect performance (as

produced by the cheater agent) results in normalized performance scores of 107.70% and

121.29% in the AutoTurn and YouTurn environments respectively. To see if the poor

performance was related to the MLP configuration, a few alternative MLP configurations were

55
tested including adding up to 5 hidden layers and up to 256 hidden units per layer. None of these

changes improved the performance of the DQN agents.

While the DNQ agent develops some flight skills as indicated by the fewer deaths

compared to the random agents, the DQN agents fail to learn the actual rules of the game

necessary to deliberately kill the fortress. It is probably fair to say that the DQN agents never

learn that the fortress vulnerability needs to be raised to 10 before it can be killed or that killing

the fortress required a different shot pacing than when raising the vulnerability of the fortress. A

perfect agent (like the cheater agent represented in Table Error: Reference source not found) only

needs 12 shots to kill a fortress. Humans, because they know the rules before playing their first

game, have a shot/kill efficient very close to 12. However, the DQN agents typically raise the

vulnerability well above 10 before getting a kill. The ability to get a kill can only be explained by

the stochastic nature of the agents. Further supporting this explanation is the counterintuitive fact

that the random YouTurn agent scores more points (by getting more kills) than the AutoTurn

agent. Since the random AutoTurn agent is always perfectly aimed at the fortress, it resets often,

rarely allowing the vulnerability to get high enough for a chance at a kill shot. In contrast, the

random YouTurn agent naturally shoots less often (and thus resets less often) because it has

more actions to pick from. The random YouTurn agent also misses constantly reducing the

frequency of resets. These two factors combined cause the random YouTurn agent to get more

kill opportunities, and in the end more random kills than the AutoTurn agent.

Appendix C: Comparison of All Instructed Subjects with Uninstructed Subjects

We ran 9 pilot Mechanical Turk subjects without instruction in AutoTurn and 11 in

YouTurn. AutoTurn subjects were only told that the relevant keys were Wand spacebar while

YouTurn subjects were told that the relevant keys were A, W, D, and spacebar. The

56
conversion of points earned to bonus payment was doubled. For comparison we will use all

subjects who were run with instruction, rather than just those filtered to show signs of learning

who were the focus in the main paper. Table C1 reports average values for the measures of

performance reported in the paper and Appendix D. We have broken this out into performance

during the first 10 games and the last 10 games.

In terms of mean number of points, uninstructed subjects clearly are showing learning

(compare with dqn agents in Appendix B), but they are also scoring less than instructed subjects

(first half AutoTurn, t(59) = 4.45, p < .0001; second half AutoTurn, t(59) = 2.13, p < .05; first

half YouTurn, t(74) = 2.52, p < .05; second half YouTurn, t(74) = 1.24, p > .1; all 2-tailed). The

57
individual variability in both populations is huge. Nonetheless, we can conclude is that without

instructions at least some game players can learn to play these games but with greater difficulty.

Appendix D: More on Control Tuning

The main text provides information on the control values for aiming and shooting. This

appendix provides similar information for the other control values.

When to thrust. As discussed in the main text, the model either choses to thrust

according to how soon it will hit the outer hexagon or its distance to the outer hexagon. If using

time, the range of the control values is between .5 and 5 seconds. If using distance, the range is

between 10% of the way to boundary to 90%. The longer the model waits to thrust the greater

chance of completing a fortress kill. If it waits too long to thrust it will crash into the outer

border. Therefore, it treats fortress kills as positive feedback and crashes into the outer border as

negative feedback.

As is the case for shot timing, good data on when thrust decisions are made can be

collected in AutoTurn because there is never any need for preliminary turning. Figure D1a

displays the mean distance at which subjects and subjects chose to thrust. The values are quite

similar although there is a bit of a downward drift in the average subject distance not seen in the

58
average model distance. Figures D1b and D1c show the average probability of the choices that

determine when the model chooses to thrust.

Thrust time10. The effect of a thrust increases linearly with the duration that the thrust

key is depressed (a vector of .3 pixels per second in the direction pointed is added each game tick

which is a 30th of a second). The model searches between 67 msec. (which is 2 game ticks, the

typical time associated with a tap) to 167 msec. (5 game ticks, a noticeable duration of sustained

pressing). We should note that while the model may chose to press for a certain period, motor

noise and asynchrony between model and game will result in a the thrust key being depressed for

a duration somewhat different than the selected time. A thrust that is very long may result in

driving the ship into the inner hex. A thrust that is too slow may result in a slow ship speed and

the ship being shot by the fortress. Therefore, the tracker monitors these two events (hitting

inner hex and being shot by fortress) and seeks to find a thrust duration that minimizes both.

Subjects depress the thrust key for an average of 112 msec. in AutoTurn and 134 msec. in

YouTurn. The models average 115 msec. in both versions of the game.

Thrust angle (see Figure 1b). The instructions to the subject indicate to keep speed under

control when thrusting their ship should point 90 degrees or more from the direction the ship is

moving. The model searches for a good angle between 85 and 140 degrees. The sign of a good

thrust angle is that resulting ship speed stays within the good range (30 to 50 pixels per second)

and of a bad thrust that it is slower or faster. Therefore the model evaluates thrust angle in terms

of the speed of the ship after the thrust, weighing good speed as positive feedback and bad

speeds as negative feedback.

10
This control parameter is not very consequential in the two versions of the game we report, but
did prove in another version of the game where subjects must fly narrow paths

59
Subjects are told they can adjust the speed of the ship by adjusting the thrust angle.

Correspondingly, how the model treats the thrust angle it has chosen for its control value

depends on speed of the ship. If it is flying at a good speed (1 to 1.7 pixels per game tick) it

wants a thrust angle within 15 degrees of the control value. If it needs to turn it will turn to get a

thrust angle close to the control value (it cannot produce an exact match to the control value

given its limitations and the game implementation). If it is flying too slow it wants a thrust angle

less than the control value. If the thrust angle is more it will turn to get close to 15 degrees less.

If it is flying too fast it wants a thrust angle more than the control value. If the actual thrust angle

is less it will turn to get close to 15 degrees more than the control value. As a reference for what

a plausible thrust angle should be, if the thrust key was held down long enough to add as much

velocity in the direction of the thrust angle as the current velocity, a thrust angle of 120 degrees

would maintain current velocity. The average velocity added by the thrust is just a little less than

the average ship speed, implying an average thrust angle a little less than 120 degrees.

Figure D2 compares subjects and models with respect to thrust angle in YouTurn. Part (a)

shows that the average thrust angle of both subjects and models increases a little over the games

to about 100 degrees. Part (b) shows that they both converge to an average speed of about 40

60
pixels per second. Part (c) illustrates how the models’ choices for a control value converge over

time. Note that although the preferred control value is over 110 degrees by the end, the actual

angles at which the ship thrusts are just over 100 degrees. Aiming often leaves the ship a little

under the chosen thrust angle but not enough to provoke a turn.

Deaths. The frequencies of different types of death reflect the consequences of decisions

about when to thrust, how long to thrust, and thrust angle. Figure D3 shows the relative

frequency of ship deaths in the two versions of the game. The scale is doubled for YouTurn,

reflecting the greater number of deaths. The major cause of deaths in AutoTurn is being shot by

the fortress, both for subjects and models. In the case of YouTurn, there is large difference

between models and subjects in the number of cases of being shot with a shell from the fortress.

Being shot by the fortress is the major cause of death for subjects while running into the outer

hex is the major cause of death for the models. The high value of subject deaths by shell reflects

the greater difficulty subjects have maintaining a good speed. Their average within-game

variability in speed is about 50% more than the models’ average variability. This means that

while they average the same speed as the model over games (Figure D2b), they are more likely

to fall into slow lapses where the fortress can shoot them. Nonetheless, as indicated by the one

standard deviation error bars, there is substantial overlap between subjects and models.

61

You might also like