2019 - Asynchronous N Step Q Learning - 35

Journal of Intelligent Transportation Systems
Technology, Planning, and Operations
ISSN: 1547-2450 (Print) 1547-2442 (Online) Journal homepage: https://www.tandfonline.com/loi/gits20
Asynchronous n-step Q-learning adaptive traffic

signal control
Wade Genders & Saiedeh Razavi
To cite this article: Wade Genders & Saiedeh Razavi (2019) Asynchronous n-step Q-learning
adaptive traffic signal control, Journal of Intelligent Transportation Systems, 23:4, 319-331, DOI:
10.1080/15472450.2018.1491003
To link to this article: https://doi.org/10.1080/15472450.2018.1491003
Published online: 03 Jan 2019.
Submit your article to this journal
Article views: 347
View related articles
View Crossmark data
Citing articles: 1 View citing articles
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=gits20
JOURNAL OF INTELLIGENT TRANSPORTATION SYSTEMS
2019, VOL. 23, NO. 4, 319–331
https://doi.org/10.1080/15472450.2018.1491003
Asynchronous n-step Q-learning adaptive traffic signal control

Wade Genders and Saiedeh Razavi
Department of Civil Engineering, McMaster University, Hamilton, Ontario, Canada
ABSTRACT ARTICLE HISTORY

Ensuring transportation systems are efficient is a priority for modern society. Intersection Received 21 October 2017
traffic signal control can be modeled as a sequential decision-making problem. To learn Revised 13 June 2018
how to make the best decisions, we apply reinforcement learning techniques with function Accepted 16 June 2018
approximation to train an adaptive traffic signal controller. We use the asynchronous n-step KEYWORDS
Q-learning algorithm with a two hidden layer artificial neural network as our reinforcement Artificial intelligence;
learning agent. A dynamic, stochastic rush hour simulation is developed to test the agent’s intelligent transportation
performance. Compared against traditional loop detector actuated and linear Q-learning systems; neural networks;
traffic signal control methods, our reinforcement learning model develops a superior control reinforcement learning;
policy, reducing mean total delay by up 40% without compromising throughput. However, traffic signal controllers
we find our proposed model slightly increases delay for left turning vehicles compared to
the actuated controller, as a consequence of the reward function, highlighting the need for
an appropriate reward function which truly develops the desired policy.
1. Introduction any given time). Many solutions have been proposed

to solve the traffic signal control problem, differing in
Society relies on its many transportation systems for
complexity and assumptions made. The traffic signal
the movement of individuals, goods, and services.
Ensuring vehicles can move efficiently from their control problem can be modeled as a sequential
origin to destination is desirable by all. However, decision-making problem. Reinforcement learning, a
increasing population, and subsequent vehicle owner- machine learning paradigm, provides a framework for
ship, has increased the demand of road infrastructure developing solutions to sequential decision-making
often beyond its capacity, resulting in congestion, problems. A reinforcement learning agent learns how
travel delays, and unnecessary vehicle emissions. To to act (i.e. control the intersection phases) in an
address this problem, two types of solutions are pos- uncertain environment (i.e. stochastic vehicle dynam-
sible. The first is to increase capacity by expanding ics) to achieve a goal (i.e. maximize a numerical
road infrastructure; however, this can be expensive, reward signal, such as throughput). Rewards r are
protracted, and decrease capacity in the short term quantitative measures of how well the agent is
(i.e. parts of the existing infrastructure are often achieving its goal. Described mathematically, given a
restricted to traffic while construction occurs). The discount factor c 2 ½0; 1Þ, the agent seeks to maximize
second solution is to increase the efficiency of existing the sum of discounted future rewards
P
infrastructure and the systems that govern them, such Gt ¼ 1 i¼0 c i
r tþi . Formally, the agent develops a policy
as traffic signal controllers (TSC). We advocate this p which maps states to actions pðsÞ ¼ a, in an attempt
second solution, by utilizing innovations from the to maximize the sum of discounted future rewards.
domain of artificial intelligence (Mnih et al., 2015, To achieve optimal control, we develop an optimal
2016) to develop an adaptive traffic signal controller. state-action policy p . We use valued-based reinforce-
Developing a traffic signal controller means ment learning to estimate an action-value function,
attempting to solve the traffic signal control problem which tells us how “good” any action is in a
(i.e. deciding what traffic phases should be enacted at given state based on the agent’s prior experiences.
CONTACT Wade Genders genderwt@mcmaster.ca Department of Civil Engineering, McMaster University, Hamilton, Ontario, Canada.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/gits.
ß 2018 Taylor & Francis Group, LLC
320 W. GENDERS AND S. RAZAVI
We derive the optimal policy by developing an optimal

action-value function Q ðs; aÞ defined in Equation (1).
Q ðs; aÞ ¼ max E½Gt js ¼ st ; a ¼ at ; p (1)
p
Q-learning (Watkins & Dayan, 1992), a type of tem-

poral difference learning (Sutton, 1988), is used to esti-
mate the optimal action-value function through repeated
environment interactions, composed of the agent, at
time t, observing the state of the environment st , choos-
ing an action at , receiving feedback in the form of a
reward rt and observing a new state stþ1 . Accurately
Figure 1. nQN-TSC model. First, the density and queue state
estimating an action’s value allows the agent to select
st is observed by the agent, which is used as input to the
the action which will yield the most long-term reward. neural network [Fully Connected (FC) layer of 42 neurons con-
However, traditional reinforcement learning can nected to another FC layer of 42 neurons], which produces a
suffer from the curse of dimensionality, where develop- column vector of real numbers, each representing a different
ing tabular solutions becomes intractable due to a actions’ value. Solid green arrows represent protected move-
combinatorial explosion of the state-action space. This ments and white arrows represent permissive movements.
problem can be alleviated by combining traditional
work which we address in this paper; specifically, the
reinforcement learning with function approximators
and supervised learning techniques. We approximate traffic signal control problem has been modeled too
the optimal action-value function using an artificial simply (discussed in Section 2. Literature review) and
neural network, defined in Equation (2). must be confronted authentically. We argue these sim-
plifications have created uncertainty as to whether
Q ðs; aÞ ¼ Q ðs; a; hÞ (2) reinforcement learning can truly be a contending solu-
We develop the parameters h (weights between tion for the traffic signal control problem compared
neurons) of the neural network action-value function against established solutions.
using a variant of the deep Q-network (DQN)
algorithm (Mnih et al., 2015), asynchronous n-step Q-
2. Literature review
learning (Mnih et al., 2016). Once sufficiently approxi-
mated, the optimal policy can be derived by selecting Intersection traffic signal control methods have been
the action with the highest value using the developed studied for decades. Traditionally, traffic signal phases
action-value function, defined in Equation (3). are displayed in an ordered sequence to produce a
p ðsÞ ¼ argmax Q ðs; a; hÞ (3) cycle. If the phase durations and sequence are static, a
a fixed cycle traffic signal controller is produced and can
Using the microtraffic simulator SUMO be classified as a nonadaptive form of traffic signal
(Krajzewicz, Erdmann, Behrisch, & Bieker, 2012) as control. Nonadaptive signal control methods are stable
the environment, we ultimately produce an agent and relatively simple to create, yet may not be optimal
that can control the traffic phases, termed an n-step as they are not influenced by current traffic conditions.
Q-network traffic signal controller (nQN-TSC), illus- With the advent of sensors (e.g. loop detectors, radar,
trated in Figure 1. cameras) at intersections, adaptive traffic signal control-
Other researchers have long noted reinforcement lers were possible. Adaptive traffic signal controllers
learning’s potential in solving the traffic signal control can vary phase duration or phase sequence, producing
problem and have applied it with varying degrees dynamic phase durations and acyclic phase sequences.
of success (Balaji, German, & Srinivasan, 2010; El- Adaptive methods can potentially improve performance
Tantawy, Abdulhai, & Abdelgawad, 2013; Prashanth & as they react to current traffic conditions, but can be
Bhatnagar, 2011). Recent deep reinforcement learning unstable, complex, and difficult to develop.
techniques, successful in other domains, have provided We propose reinforcement learning as the appro-
further improvements over traditional reinforcement priate method for traffic signal control because it
learning, traffic signal control models (Casas, 2017; Li, offers specific advantages compared to other solutions.
Lv, & Wang, 2016; Mousavi, Schukat, Corcoran, & First, reinforcement learning utilizes the structure of
Howley, 2017; Rijken, 2015; van der Pol, 2016). the considered problem, observing what actions in
However, we observed subtle deficiencies in previous certain states achieve reward (Sutton & Barto, 1998).
JOURNAL OF INTELLIGENT TRANSPORTATION SYSTEMS 321
This is in contrast to other optimization techniques, 2010) the most popular. The action space has been
such as evolutionary methods (e.g. genetic algorithms; defined as all available signal phases (Arel et al., 2010;
Lee, Abdulhai, Shalaby, & Chung, 2005), which ignore El-Tantawy et al., 2013) or restricted to green phases
the information yielded from individual environment only (Abdoos et al., 2013; Balaji et al., 2010; Chin
interactions. Second, Q-learning is a model-free et al., 2011). The most common reward definitions
reinforcement learning algorithm, requiring no model are functions of delay (Arel et al., 2010; El-Tantawy
of the environment once it has been sufficiently et al., 2013) and queued vehicles (Abdoos et al., 2013;
trained. Contrasted with many traffic signal control Balaji et al., 2010; Chin et al., 2011). For a compre-
optimization systems (e.g. SCOOT – Hunt, Robertson, hensive review of traditional reinforcement learning
Bretherton, & Winton, 1981; SCATS – Lowrie, 1990; traffic signal control research, the reader is referred to
RHODES – Mirchandani & Head, 2001) which (El-Tantawy, Abdulhai, & Abdelgawad, 2014) and
require traffic models, or control theory approaches (Mannion, Duggan, & Howley, 2016). Bayesian
(Gregoire, Qian, Frazzoli, De La Fortelle, & reinforcement learning has also been demonstrated
Wongpiromsarn, 2015; Timotheou, Panayiotou, & effective for adaptive traffic signal control (Khamis &
Polycarpou, 2015; Wongpiromsarn, Uthaicharoenpong, Gomaa, 2012, 2014; Khamis, Gomaa, El-Mahdy, &
Wang, Frazzoli, & Wang, 2012) which use other Shoukry, 2012; Khamis, Gomaa, & El-Shishiny, 2012).
models, such as backpressure, model-free reinforce- Deep reinforcement learning techniques were first
ment learning makes no assumptions about the applied to traffic signal control via the DQN in (Rijken,
model as no model is required. This arguably makes 2015). The authors followed the example in (Mnih
reinforcement learning more robust to traffic dynam- et al., 2015) closely, defining the dimensions of the state
ics; model-based methods require their models accur- space to be the same dimensions as in the Atari envir-
ately reflect reality. As the disparity between the onment, meaning only a fraction of any state observa-
model and reality increases, the performance of the tion contained relevant information (which the authors
system suffers. Model-free reinforcement learning is identified and acknowledged as problematic). A uni-
parsimonious compared to model-based methods, form distribution was used to generate vehicles into the
requiring less information to function. network, which is likely too simple a model of vehicles
Significant research has been conducted using arriving at an intersection. Also, only green phases were
reinforcement learning for traffic signal control. Early modeled, meaning no yellow change or red clearance
efforts were limited by simple simulations and a lack phases were used between green phases.
of computational power (Abdulhai, Pringle, & Subsequent work by (van der Pol, 2016) expanded
Karakoulas, 2003; Brockfeld, Barlovic, Schadschneider, on the ideas in (Rijken, 2015), with the inclusion of
& Schreckenberg, 2001; Thorpe & Anderson, 1996; vehicle speed and signal phases in the state observation
Wiering, 2000). Beginning in the early 2000s, continu- and 1 s yellow change phases, Double Q-learning and
ous improvements in both of these areas have created limited experiments with multiple intersections. Both
a variety of simulation tools that are increasingly com- authors noted that while the agent’s performance
plex and realistic. Traffic microsimulators are the improved with training, there were concerns with the
most popular tool used by traffic researchers, as they agent’s learned policy, such as quickly oscillating phases
model individual vehicles as distinct entities and can and the inability to develop stable, long-term policies.
reproduce real-world traffic behavior such as shock- Research by (Li et al., 2016) deviated from previous
waves. Research conducted has differed in reinforce- work using deep stacked autoencoders to approximate the
ment learning type, state space definition, action space optimal action-value function. A disadvantage of stacked
definition, reward definition, simulator, traffic net- autoencoders is that each layer must be trained individu-
work geometry and vehicle generation model, and can ally until convergence and then stacked on top one
be separated into traditional reinforcement learning another, as opposed to DQN, which can be trained
and deep reinforcement learning. Previous traditional quicker in an end-to-end manner. The state observation
reinforcement learning traffic signal control research was the number of queued vehicles in each lane over the
efforts have defined the state space as an aggregate past 4 s. The authors tested their agent on an isolated four
traffic statistic, the number of queued vehicles way, two lane intersection where vehicles could not turn
(Abdoos, Mozayani, & Bazzan, 2013; Abdulhai et al., left or right, only travel through. This simplified the
2003; Aziz, Zhu, & Ukkusuri, 2018; Chin, Bolong, action space to two green phases. They also mandated a
Kiring, Yang, & Teo, 2011; Wiering, 2000) and traffic minimum green time of 15 s and did not model any yel-
flow (Arel, Liu, Urbanik, & Kohls, 2010; Balaji et al., low change or red clearance phases. The traffic demand
for each intersection approach was [100, 2000] veh/h gen- delay or queue), assuming safety concerns have been
erated “randomly,” without detailing the specific prob- satisfied. However, if green phases are the only actions
ability distribution. available to the agent, depending on the current phase,
Research referenced thus far used valued-based some actions cannot always be enacted immediately –
reinforcement learning, which estimate value functions yellow change and red clearance phases may be neces-
to develop the traffic signal control policy. Policy-based sary. For example, there exist action sequences which
methods are another type of reinforcement learning are unsafe to transition from immediately, meaning the
which explicitly parameterize the policy instead of using traffic signal phase cannot immediately change to the
a value function to derive the policy. A traffic signal agent’s chosen action for safety reasons (i.e. the move-
control policy was parameterized in (Casas, 2017) and ments constituting these phases conflict). Some action
(Mousavi et al., 2017). Deep policy gradient was used in sequences may require the traffic signal to enact a yel-
(Casas, 2017) to control dozens of traffic signal control- low change phase, instructing vehicles to slow down
lers in a large traffic network with impressive results. and prepare to stop, and then a red clearance phase,
instructing vehicles to stop. Note that the agent does
However, the model developed modified an existing
not explicitly select these transition traffic phases as
traffic signal control policy instead of deriving its own
actions, they are implicit when selecting actions that
in a truly autonomous manner. It seems this was a con-
conflict with the previous action. Therefore, yellow
cession made so that multiple intersections could be
change and red clearance phases are only included
controlled with the same agent; however, it makes the
between sequences of conflicting actions.
implicit assumption that the optimal policy for each Third, prior research has generated traffic using
intersection is the same, which is unlikely. Regardless, simple (Rijken, 2015; van der Pol, 2016), and likely
the size and scope of the experiments which the agent unrealistic, models (e.g. uniform distribution vehicle
was subjected, and its subsequent performance, is generation). For reinforcement learning traffic signal
strong evidence for further work using policy-based control to be a contender in practice, it must be able
reinforcement learning for traffic signal control. to control intersections at any point on the demand
spectrum, up to saturation demands. We train and
3. Contribution test our agent under a dynamic rush hour demand
scenario, authentically modeling the challenging
Considering the previous work as detailed, we identify environment as would be present in practice.
the following problems and our contributions.
First, prior research has modeled the traffic signal
problem with unrealistic simplifications. Specifically, 4. Proposed model
either completely omitting yellow change and red Attempting to solve the traffic signal control problem
clearance phases (Li et al., 2016; Rijken, 2015), or using reinforcement learning requires a formulation of
including them for short durations (van der Pol, the problem using concepts from Markov Decision
2016). Acknowledging that yellow and red phase dura- Processes, specifically, defining a state space S, an
tions can depend on many factors (e.g. intersection action space A and a reward r.
geometry and traffic laws), we contend that omitting
yellow and red phases or including them for unrealis-
4.1. State space
tically short durations simplifies the traffic signal
control problem such that any results are question- The state observation is composed of four elements. The
able. If reinforcement learning methods are to provide first two elements are aggregate statistics of the intersec-
a solution to the traffic signal control problem, the tion traffic – density and queue of each lane approaching
modeling must include all relevant aspects as would the intersection. The density, sd;l is defined in Equation
be present in practice. We contribute a realistic model (4) and the queue, sq;l , is defined in Equation (5):
of the traffic signal control problem, with yellow and jVl j
red phase durations of 4 s. Although the inclusion of sd;l ¼ (4)
cl
these additional signals may seem slight, it changes
how the agent’s actions are implemented. where Vl represents the set of vehicles on lane l and cl
Second, we contend the only actions worth choos- represents the capacity of lane l.
ing by the agent are the green phases, as these are the jVq;l j
sq;l ¼ (5)
actions which ultimately achieve any rational traffic cl
signal control goal (i.e. maximize throughput, minimize
Table 1. Traffic signal phase information. available in Table 1. Therefore, at time t, the agent
NEMA Compass
Turning movements chooses an action at , where at 2 A.
Action phases directions Left Through Right However, when an agent chooses an action, it may
NSG 2, 6 North, South Permissive Protected Permissive not be immediately enacted. To ensure safe control of
NSLG 1, 5 North, South Protected Prohibited Permissive
EWG 4, 8 East, West Prohibited Protected Permissive the intersection, additional phases may precede the
EWLG 3, 7 East, West Protected Prohibited Permissive chosen action. Instead of immediately transitioning
from the current traffic signal phase to the selected
Table 2. Traffic signal phase action information. action, a sequence of yellow change and red clearance
Selected action at phases may be required, dependent on the current
NSG EWG NSLG EWLG phase and chosen action. All possible action transition
Current traffic NSG – {NSY, R} {NSLY} {NSY, R} sequences to transition from the current traffic phase
signal phase EWG {EWY, R} – {EWY, R} {EWLY} to the chosen action phase are shown in Table 2.
NSLG – {NSY, R} – {NSY, R}
EWLG {EWY, R} – {EWY, R} – Note the addition of the North–South Yellow (NSY),
East–West Yellow (EWY), North–South Advance Left
where Vq;l represents the set of queued (i.e. stationary) Yellow (NSLY), East–West Advance Left Yellow
vehicles on lane l and cl represents the capacity of (EWLY) change and red clearance (R) phases, which
lane l. cannot be chosen explicitly as actions, but are part of
The third and fourth elements encode information some phase transition sequences. The set of all traffic
about the current traffic signal phase. These elements signal phases, actions and transition phases, is defined
are a one-hot vector encoding the current traffic as P ¼ {NSG, EWG, NSLG, EWLG, NSY, EWY,
signal phase and a real-valued scalar encoding the NSLY, EWLY, R}.
time spent in the current phase. A one-hot vector is a
binary valued, categorical encoding of qualitative 4.3. Reward
properties. Therefore, the state space at an intersection
with L lanes and a set of traffic phases P is formally The final element of reinforcement learning, after the
defined as S 2 ðRL RL BjPj RÞ. At time t, the agent has observed the state of the environment s,
agent observes the traffic state as st 2 S. chosen an action at , and performed it, is receiving the
reward rt . Reward is one element that differentiates
reinforcement learning from other types of machine
4.2. Action space
learning; the agent seeks to develop a policy which
The actions available to the agent are the green phases maximizes the sum of future discounted rewards.
that the intersection traffic signal controller can enact. Compared to supervised learning, in which correct
Although traditionally denoted numerically with actions are given by instruction, reinforcement learn-
natural numbers starting at 1 [i.e. National Electrical ing has the agent evaluate actions by interacting with
Manufacturers Association (NEMA) standard], in this the environment and receiving reward.
research, to reduce ambiguity, we denote phases by In the context of traffic signal control, various
their compass directions and movement priority. rewards have been successfully used. Examples of
Beginning with the four compass directions, North rewards used are functions based on queue, delay, and
(N), South (S), East (E), and West (W) and combin- throughput (El-Tantawy et al., 2014); however, there
ing the parallel directions, along with the two different is yet no consensus on which is best.
lane movements which allow vehicles to traverse the We define the reward rt as a function of the
intersection, through green (G) and advanced left number of queued vehicles (6).
green (LG), the four possible actions are North–South
Green (NSG), East–West Green (EWG), North–South rt ¼ jVqt j2 (6)
Advance Left Green (NSLG), and East–West Advance
Left Green (EWLG). For any of the action phases, where Vqt is the set of queued vehicles at the intersec-
all other compass direction traffic movements are tion at time t. The number of queued vehicles is
prohibited (i.e. East–West Green phase imply all squared to increasingly penalize actions that lead to
North–South signals are red), except for right turns, longer queues. Since reinforcement learning agents
which are permissive (i.e. right on red). Formally the seek to maximize reward, the negative number of
set of all possible actions A is defined as A ¼ {NSG, queued vehicles squared is used, rewarding the agent
EWG, NSLG, EWLG}, with additional information for minimizing the number of queued vehicles.
4.4. Agent Table 3. Model hyperparameters.

Variable Parameter Value
In reinforcement learning, the agent is the entity that jst j State cardinality 42
learns by interacting with the environment. There is a cl Lane capacity 20 veh
single agent governing the traffic signal controller, T Traffic simulation length 7200 s
tmax Max experience sequence/n-step 4
exclusively responsible for selecting the next green M Total training actions 1:5 106
traffic phase from the set of possible green phases A. I Update interval 7500
c Discount factor 0.99
We model the agent controlling the traffic signals
similar to a DQN (Mnih et al., 2015). A DQN is a
deep artificial neural network trained using Q-learn- algorithm leverages modern multi-core computer pro-
ing, a valued-based reinforcement learning algorithm. cessors to simulate multiple agent-environment pairs
Artificial neural networks are mathematical functions in parallel, each agent-environment pair on a CPU
inspired by biological neural networks (i.e. brains) thread maintaining their own local parameters h0 . The
that are appealing for their function approximation agents collect parameter updates dh, using their local
capabilities. Many problems in machine learning can parameter set h0 , that are used to asynchronously
suffer from the curse of dimensionality, which is update a global parameter set h. Each agent periodic-
when the dimensionality of the data increases, the ally copies the global parameter set to its local param-
training and computational resources required grow eter set every I actions. Utilizing multiple agents
exponentially. Artificial neural networks have the improves learning, as each agent is likely experiencing
capability to generalize from what they have learned, different environment states, taking different actions,
weakening the problems posed by the curse of and receiving different rewards. Second, instead of
dimensionality. updating the parameters h after each action, the agent
Artificial neural networks are defined by their executes n actions and uses data from all n actions to
architecture and weight parameters h. Typically, improve reward (and value) estimates. The asynchron-
neural networks are composed of at least one hidden ous n-step Q-learning algorithm pseudo code is pre-
layer of computational neurons. Additional hidden sented in Algorithm 1 and the model hyperparameters
layers in a network allow it to develop high level rep- are presented in Table 3. Note the inclusion of reward
resentations of its input data through multiple layers normalization by the maximum reward rmax experi-
of function composition, potentially increasing enced thus far across all agents.
performance. The nQN-TSC is implemented as a two The RMSProp (Tieleman & Hinton, 2012)
hidden layer artificial neural network. The input to optimization algorithm is used to train the network
the nQN-TSC is the state observation st outlined in with an initial learning rate a of 0.001.
section 4.1 State space. The two hidden layers are fully The agent employs an action repeat of 10 for
connected with 42 neurons and are each followed by improved stability (i.e. actions/green phases have a dur-
rectified linear activation units (ReLu). The output ation of 10 s). Yellow change and red clearance phases
layer is composed of four neurons with linear activa- have a duration of 4 s. If at least one vehicle is on an
tion units, each representing the action-value of a dif- incoming lane to the intersection, the agent must select
ferent action (i.e. green phase). an action. However, if no vehicles are present on
The optimal policy p is achieved using the neural incoming lanes, this is considered the terminal state
network to approximate the optimal action-value func- sterminal , the agent does not select an action and the traf-
tion by changing the parameters h through training.
fic phase defaults to the red clearance (R) phase.
The action-value function maps states to action util-
To select actions during training, we implement the
ities (i.e. what is the value of each action from a given
simple, yet effective, e-greedy exploration policy,
state). Values represent long-term reward. If an action
which selects a random action (explore) with a
has a high value (and is accurately estimated), enacting
probability e and selects the action with the highest
it means reaping future reward. Choosing actions with
value (exploit) with a probability 1:0 e. The value of
the highest value will yield the optimal policy.
e decreases as training progresses according to (7).
The specific reinforcement learning algorithm used
m
in this research is asynchronous n-step Q-learning em ¼ 1:0 (7)
(Mnih et al., 2016), a variant of DQN (Mnih et al., M
2015). Asynchronous n-step Q-learning differs from where m is the current number of completed actions
DQN in two fundamental ways. First, instead of a sin- and M is the total number of actions. Initially, m ¼ 0,
gle agent interacting with a single environment, the meaning the agent exclusively explores; however, as
Figure 3. Block diagram describing main steps for real world

Figure 2. Simulation model of intersection, arrows represent deployment. The neural network Qðst ; a; hÞ is first developed
lane movements. in simulation and then used to determine the next phase
using traffic sensor data.
training progresses, the agent increasingly exploits
what it has learned, until it exclusively exploits. For training the nQN-TSC, we subject it to different
traffic simulations. We seek to model the dynamic
5. Experimental setup and training nature of traffic at an intersection, which changes over
time. Therefore, we model a rush hour or peak traffic
All experiments were conducted using the traffic demand scenario. Initially, the traffic demand is at a
microsimulator SUMO v0.29 (Krajzewicz et al., 2012). minimum, then, it begins increasing until it reaches a
SUMO provides a Python application programing maximum, where it remains for an hour, after which it
interface (API), by which custom functionality can be begins decreasing and then returns to a constant min-
implemented in the traffic simulation. We used the imum. This training procedure is devised to create a
SUMO Python API and custom code to implement all dynamic training environment. We seek to make solv-
experiment code. The nQN-TSC was implemented ing the traffic signal control problem in simulation as
using Tensorflow v1.2.1 (Abadi et al., 2015) with challenging as would be faced in practice – dynamic
additional code from (Kapturowski, 2017). Additional and stochastic. To ensure these qualities, we randomly
optimized functionality was provided by NumPy and translate the traffic demand in time at the beginning of
SciPy libraries (Jones, Oliphant, & Peterson, 2001). each training simulation, displayed in Figure 4. The
The simulations were executed using eight parallel demands are translated to vary training and to ensure
agents on a desktop computer with an i7-6700 CPU, the agent does not overfit the training data. Vehicles
16 GB of RAM running Ubuntu 16. To train the are generated using a negative exponential distribution,
nQN-TSC for M ¼ 1:5 106 actions requires 6 h of where the rate parameter is used to sample from a
k
wall time (Figure 1). normal distribution N ðk; 10 Þ to stochastically generate
The intersection geometry is four lanes approach- vehicles. This procedure is used to ensure training sim-
ing the intersection from every compass directions ulations are similar but not the same, exposing the
(i.e. North, South, East, and West) connected to four agent to a diversity of traffic states. Training is com-
outgoing lanes from the intersection. The traffic pleted after the agent has executed M ¼ 1:5 106
movements for each approach are as follows: the inner actions. A block diagram describing the use of the pro-
lane is left turn only, the two middle lanes are posed model after training is displayed in (Figure 3).
through lanes and the outer lane is through and right Vehicles are generated into the network from each
turning. All lanes are 300 m in length, from the compass direction with uniform probability. Vehicle
vehicle origin to the intersection stop line. The state routing is proportional to turning movement capacity.
observations are derived from the stop bar to 150 m Vehicles are randomly assigned a route with a prob-
away from the intersection (i.e. only vehicles 150 m or ability proportional to the route’s lane capacity (i.e.
closer to the intersection are considered in the state). through movements are more likely than turning
A visual representation can be seen in Figure 2. movements as there are more lanes from which
Figure 4. Examples of randomly generated simulation rush hour traffic demand. The demands are randomly translated in time for
use during training while the untranslated rush hour demand is used during testing.
through movements are possible). Given the lane Else

movements defined earlier in the section, the prob- G ¼ max a Qðst ; a; h0 Þ
ability a vehicle assigned a through movement is 35 For i 2 ft 1; t 2; :::; tstart g
and 15 each for left and right turning movements. G ri þ cG 0 2
Collect gradients dh dh þ ðGQðsdht ;a;h
0
ÞÞ
Algorithm 1: nQN-TSC Parallel Training pseudo code
Asynchronously update h with dh
m 0; Initialize h; h0
While m < M
t 0, Generate vehicle demands 6. Results and discussion
While t < T
Reset parameter gradients dh 0 To evaluate the performance of the nQN-TSC after
Set thread parameters to global h0 ¼ h training, untranslated rush hour demands are used
tstart ¼ t and its performance is compared against three other
Observe state st TSC methods; random, actuated and linear
While st 6¼ sterminal and ttstart 6¼ tmax Q-learning.
If Unif ð0; 1:0Þ > em The random TSC method enacts each action with a
at ¼ argmax a Qðst ; a; h0 Þ uniform probability (i.e. pRand ðst Þ ¼ UnifðAÞ). A ran-
Else dom action policy is extremely naive with respect to
at ¼ UnifðAÞ optimality; however, it provides a baseline from which
Receive reward rt and observe new state stþ1 to compare the progress of nQN-TSC.
If rt > rmax The actuated method is a fixed phase sequence,
rmax ¼ rt dynamic phase duration TSC. Sensors are used to
rt ¼ rmax
rt
modify the green phase durations, making this
st ¼ stþ1 method adaptive. Each green phase has a minimum
t tþ1 duration (10 s), after which a timer begins decrement-
m mþ1 ing from a gap-out time (5 s). If a vehicle is sensed in
If mmodI ¼¼ 0 a green phase lane (i.e. the lane currently has a per-
h0 h missive or protected movement given the current
If st ¼ sterminal
phase) while the timer is decrementing, the timer is
G ¼ 0
reset and the current phase is temporarily extended.
Table 4. Traffic signal control rush hour results. As expected, the Random TSC achieves the worst
ðl; rÞ performance, with the lowest throughput and high-
Traffic signal control Total throughput Total delay est delay.
method (veh/sim) (s/sim) Comparing the Actuated, Linear, and nQN-TSC,
Random (6 508, 151) ð81 106 ; 20 106 Þ
Actuated (6 594, 95) ð7:8 106 ; 0:7 106 Þ
there is no significant difference in throughput;
Linear (6 618, 91) ð20 106 ; 9:1 106 Þ however, the nQN-TSC achieves the lowest delay.
nQN-TSC (6 609, 93) ð3:0 106 ; 0:8 106 Þ Interestingly, Linear Q-learning produces higher delay
than the Actuated TSC. This is likely evidence that
Only a finite number of extensions can occur, limited the use of a linear combination of feature as a func-
by a maximum extension time (40 s). The sensors tion approximator has insufficient capacity to model
modeled in simulation are loop detectors (one per the action-value function for the traffic signal control
lane, 50 m from the stop line), which trigger when a problem. Although the nQN-TSC and Linear Q-
vehicle traverses them. learning methods share many hyperparameters, the
The linear Q-learning method is similar to the pro- additional layers and nonlinear transformations of the
posed nQN-TSC, selecting the next phase in an ad neural network seem to allow it to more accurately
hoc manner, except that it uses a linear combination estimate the optimal action-value function, leading to
of features instead of a neural network for function better performance.
approximation. A linear combination of features func- We find it interesting that the same numbers of
tion approximator can be thought of as a neural net- vehicles are traversing the intersection during the rush
work without hidden layers and activation functions. hour under the Actuated, Linear and nQN-TSC, but
Each action has its own value function, as a linear there are significant differences in delay. We conjec-
combination of features cannot approximate all ture there is a strong correlation between vehicle
action-values simultaneously. The features used by queues and delay, and since the nQN-TSC seeks to
Linear Q-learning are the same as the input layer to minimize the queue, it is also minimizes vehicle delay,
the nQN-TSC, a combination of the lane queue, dens- as queued vehicles are delayed vehicles (i.e. both are
ity, and phase information. The Linear Q-learning stationary). The performance disparity can be attrib-
method does not use asynchronous learning (i.e. only uted to the flexibility the nQN-TSC has in selecting
one actor and one environment) or n-step returns, the next green phase.
instead computing the traditional action-value target The nQN-TSC does not need to adhere to a par-
update (i.e. 1-step return). Gradient descent is used to ticular phase cycle, it operates acyclically. Every action
train the Linear Q-learning action-value functions (i.e. green phase) chosen is in an attempt to maximize
using the same model hyperparameters as the nQN- reward. The Actuated TSC does not exhibit such goal-
TSC. To train the Linear Q-learning method requires oriented behavior, instead executing primitive logic
approximately 24 h of wall time. that produces inferior performance compared to the
Total vehicle delay D (8) and throughput F (9) traf- nQN-TSC. Granted, the Actuated TSC does not have
fic data are collected for comparing TSC methods. the benefit of training like the nQN-TSC, but its
XT X design precludes it from being of any benefit even if it
D¼ t¼0 vV t
dv (8) was available. The information contained in the nQN-
arrive
TSC’s state representation is also much higher
where T is the total number of simulation steps, compared to the Actuated TSC, improving its action
t
Varrive is the set of vehicles at time t that arrive at selection. The Actuated TSC’s state representation is a
their destination (i.e. are removed from the simula- bit for every lane and this low-resolution state likely
tion) and dv is the delay experienced by vehicle v. aliases many different traffic states. These results pro-
XT
F¼ jVarrive
t
j (9) vide evidence to support that adaptive traffic signal
t¼0 control can be accomplished with reinforcement
t
where Varrive is the set of vehicles at time t that arrive learning and function approximation.
at their destination (i.e. are removed from the simula- We also present individual vehicle delay results for
tion) and T is the total number of simulation steps. each lane/turning movement in Figure 7. We observe
To estimate relevant statistics, each TSC method is that although in aggregate the nQN-TSC achieves the
subjected to 100, 2 h simulations at testing rush hour lowest total delay results, this observation does not
demands. Results from the testing scenarios are shown exist for all lanes. For the through and right lanes, the
in Table 4 and Figures 5–7. nQN-TSC exhibits the lowest median delay, quartiles
Figure 5. Performance of TSC methods under 100, 2 h rush hour simulations. Lines represent the mean and the shaded area is
the 99% confidence interval. Note that the throughput is influenced by the traffic demand. Traffic demand begins low, increases
to a maximum, remains constant, and then decreases at the end of the testing scenario.
Figure 6. Total throughput and delay for individual simulation runs. Colored rectangles represent the range between the first and
third quartiles, solid white line inside the rectangle represents the median, dashed colored lines represent 1.5 interquartile range,
colored dots represent outliers.
and outliers compared to the other TSC methods. result can be understood by considering the reward
However, observing the left lane delay, the Actuated function, which seeks to minimize the squared
method achieves lower delay than the nQN-TSC, number of queued vehicles. The intersection geometry
specifically in regards to outliers, as the nQN-TSC has used in this research has, in any single compass
significant outliers which indicate some left turning direction, three incoming lanes which allow through
vehicles have to wait a disproportionate amount of movements while only one lane for left turn move-
time to turn left compared to other vehicles. This ments. The nQN-TSC has discovered that selecting
Figure 7. Individual vehicle delay by intersection lane movement and TSC type. Colored rectangles represent the range between
the first and third quartiles, solid white line inside the rectangle represents the median, dashed colored lines represent
1.5 interquartile range, colored dots represent outliers. Performance of TSC methods under 100, 2 h rush hour simulations.
the actions/phases which give protected green signals stochastic rush hour demand scenario, the modeled
to through movements (i.e. NSG and EWG) yield the nQN-TSC achieve superior performance by reducing
highest reward. The nQN-TSC learns to favor these average total vehicle delay by 40%. This research
actions over the exclusive left turning actions/phases provides evidence that valued-based reinforcement
(i.e. NSLG, EWLG), because they yield less reward learning methods with function approximation can
from having fewer queued vehicles potentially travers- provide improved performance compared to estab-
ing the intersection when selected. lished traffic signal control methods in stochastic
The nQN-TSC’s behavior is contrasted with the environments, such as rush hour demands.
Actuated method, which implements a cycle and Beyond this research, many areas of future work
reliably enacts exclusive left turning phases, yielding are available. The function approximator used in this
lower delays for vehicles in left turning lanes. This is research was simple (i.e. two fully connected hidden
an example of a reinforcement learning agent develop- layer neural network) compared to research in other
ing a policy which seems optimal with respect to the domains; using convolutional or recurrent layers in
reward, but with undesirable, unintended consequen- the neural network will likely yield additional
ces. This behavior could be corrected with a different performance improvements. Future work can also
reward function for the nQN-TSC, but perhaps at the investigate policy or actor-critic reinforcement
learning algorithms, different state representations,
expense of its desirable performance in the other met-
pedestrians, and multiple intersections. These ideas
rics (e.g. total delay, through lane delay, right lane
will be explored in future endeavors by the authors.
delay). This is an interesting result that we have not
observed discussed in the TSC reinforcement learning
literature and should be the focus of future research. Acknowledgments
The authors would like to acknowledge the authors of the
various programing libraries used in this research. The
7. Conclusions and future work authors would also like to acknowledge the comments and
We modeled an n-step Q-network traffic signal support of friends, family, and colleagues.
controller (nQN-TSC) trained to control phases at a
traffic intersection to minimize vehicle queues. ORCID
Compared to a loop detector Actuated TSC in a Wade Genders http://orcid.org/0000-0001-9844-8652
References Jones, E., Oliphant, T., & Peterson, P., (2001). SciPy: Open
source scientific tools for Python (0.19.0). Retrieved from
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., https://scipy.org.
Citro, C., … Zheng, X. (2017). Tensorflow: Large-scale Kapturowski, S. (2017). Tensorflow-rl. Retrieved from
machine learning on heterogeneous systems (1.2.1). https://github.com/steveKapturowski/tensorflow-rl
Retrived from https://tensorflow.org. Khamis, M. A., & Gomaa, W. (2012). Enhanced multiagent
Abdoos, M., Mozayani, N., & Bazzan, A. L. (2013). Holonic multi-objective reinforcement learning for urban traffic
multi-agent system for traffic signals control. Engineering light control. Paper presented at the 2012 11th
Applications of Artificial Intelligence, 26(5–6), 1575–1587. International Conference on Machine Learning and
doi: 10.1016/j.engappai.2013.01.007 Applications (ICMLA), IEEE, Vol. 1, pp. 586–591.
Abdulhai, B., Pringle, R., & Karakoulas, G. J. (2003). Khamis, M. A., & Gomaa, W. (2014). Adaptive multiobjec-
Reinforcement learning for true adaptive traffic signal con- tive reinforcement learning with hybrid exploration for
trol. Journal of Transportation Engineering, 129(3), traffic signal control based on cooperative multi-agent
278–285. doi: 10.1061/(ASCE)0733-947X(2003)129:3(278) framework. Engineering Applications of Artificial
Arel, I., Liu, C., Urbanik, T., & Kohls, A. (2010). Intelligence, 29, 134–151. doi: 10.1016/
Reinforcement learning-based multi-agent system for j.engappai.2014.01.007
network traffic signal control. IET Intelligent Transport Khamis, M. A., Gomaa, W., El-Mahdy, A., & Shoukry, A.
Systems, 4(2), 128–135. doi: 10.1049/iet-its.2009.0070 (2012). Adaptive traffic control system based on bayesian
Aziz, H. A., Zhu, F., & Ukkusuri, S. V. (2018). Learning probability interpretation. Paper presented at the 2012
based traffic signal control algorithms with neighborhood Japan-Egypt Conference on Electronics, Communications
information sharing: An application for sustainable and Computers (pp. 151–156). Alexandria, Egypt: IEEE.
mobility. Journal of Intelligent Transportation Systems, Khamis, M. A., Gomaa, W., & El-Shishiny, H. (2012).
22(1), 40–52. doi: 10.1080/15472450.2017.1387546 Multi-objective traffic light control system based on
Balaji, P., German, X., & Srinivasan, D. (2010). Urban traf- bayesian probability interpretation. Paper presented at
fic signal control using reinforcement learning agents. the 2012 15th International IEEE Conference on,
IET Intelligent Transport Systems, 4(3), 177–188. doi: Intelligent Transportation Systems (ITSC), IEEE,
10.1049/iet-its.2009.0096 pp. 995–1000.
Brockfeld, E., Barlovic, R., Schadschneider, A., & Krajzewicz, D., Erdmann, J., Behrisch, M., & Bieker, L.
Schreckenberg, M. (2001). Optimizing traffic lights in a (2012). Recent development and applications of SUMO—
cellular automaton model for city traffic. Physical Review. Simulation of Urban MObility. International Journal on
E, Statistical, Nonlinear, and Soft Matter Physics, 64(5 Pt Advances in Systems and Measurements, 5(3 & 4),
2), 056132. 128–138.
Casas, N. (2017). Deep deterministic policy gradient for Lee, J., Abdulhai, B., Shalaby, A., & Chung, E.-H. (2005).
urban traffic light control. arXiv preprint Real-time optimization for adaptive traffic signal control
arXiv:1703.09035. using genetic algorithms. Journal of Intelligent
Chin, Y. K., Bolong, N., Kiring, A., Yang, S. S., & Teo, Transportation Systems, 9(3), 111–122. doi: 10.1080/
K. T. K. (2011). Q-learning based traffic optimization in 15472450500183649
Li, L., Lv, Y., & Wang, F.-Y. (2016). Traffic signal timing
management of signal timing plan. International Journal
via deep reinforcement learning. IEEE/CAA Journal of
of Simulation, Systems, Science & Technology, 12(3),
Automatica Sinica, 3(3), 247–254. doi: 10.1109/
29–35.
JAS.2016.7508798
El-Tantawy, S., Abdulhai, B., & Abdelgawad, H. (2013).
Lowrie, P. (1990). Scats, sydney co-ordinated adaptive traffic
Multiagent reinforcement learning for integrated network
system: A traffic responsive method of controlling urban
of adaptive traffic signal controllers (marlin-atsc):
traffic. (Report No. N/A). New South Wales, Australia:
Methodology and large-scale application on downtown Traffic Control Section Roads and Traffic Authority of
toronto. IEEE Transactions on Intelligent Transportation New South Wales.
Systems, 14(3), 1140–1150. doi: 10.1109/ Mannion, P., Duggan, J., & Howley, E. (2016). An experi-
TITS.2013.2255286 mental review of reinforcement learning algorithms for
El-Tantawy, S., Abdulhai, B., & Abdelgawad, H. (2014). adaptive traffic signal control. In Autonomic road transport
Design of reinforcement learning parameters for seamless support systems (pp. 47–66). Basel, Switzerland: Springer.
application of adaptive traffic signal control. Journal of Mirchandani, P., & Head, L. (2001). A real-time traffic sig-
Intelligent Transportation Systems, 18(3), 227–245. doi: nal control system: Architecture, algorithms, and analysis.
10.1080/15472450.2013.810991 Transportation Research Part C: Emerging Technologies,
Gregoire, J., Qian, X., Frazzoli, E., De La Fortelle, A., & 9(6), 415–432. doi: 10.1016/S0968-090X(00)00047-4
Wongpiromsarn, T. (2015). Capacity-aware backpressure Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,
traffic signal control. IEEE Transactions on Control of Harley, T., … Hassabis, D. (2016). Asynchronous meth-
Network Systems, 2(2), 164–173. doi: 10.1109/ ods for deep reinforcement learning. Paper presented at
TCNS.2014.2378871 the International Conference on Machine Learning,
Hunt, P., Robertson, D., Bretherton, R., & Winton, R. pp. 1928–1937. New York, NY: International Machine
(1981). Scoot-a traffic responsive method of coordinating Learning Society..
signals. (Report No. LR 1014). Berkshire, UK: Transport Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
and Road Research Laboratory. J., Bellemare, M. G., … Ostrovski, G. (2015). Human
level control through deep reinforcement learning. COURSERA: Neural Networks for Machine Learning, 4(2),
Nature, 518(7540), 529–533. doi: 10.1038/nature14236 26?31.
Mousavi, S. S., Schukat, M., Corcoran, P., & Howley, E. Timotheou, S., Panayiotou, C. G., & Polycarpou, M. M.
(2017). Traffic light control using deep policy-gradient and (2015). Distributed traffic signal control using the
value-function based reinforcement learning. Retrieved cell transmission model via the alternating direction
from https://arxiv.org/abs/1704.08883 method of multipliers. IEEE Transactions on Intelligent
Prashanth, L., & Bhatnagar, S. (2011). Reinforcement learn- Transportation Systems, 16(2), 919–933.
ing with function approximation for traffic signal control. van der Pol, E. (2016). Deep reinforcement learning for
IEEE Transactions on Intelligent Transportation Systems, coordination in traffic light control (Master’s thesis).
12(2), 412–421. doi: 10.1109/TITS.2010.2091408 Amsterdam, Netherlands: University of Amsterdam.
Rijken, T. (2015). Deeplight: Deep reinforcement learning for Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine
signalised traffic control (Master’s thesis). London, UK: Learning, 8(3–4), 279–292. doi: 10.1007/BF00992698
University College London. Wiering, M., et al. (2000). Multi-agent reinforcement learn-
Sutton, R. S. (1988). Learning to predict by the methods ing for traffic light control. Proceedings of the 17th
of temporal differences. Machine Learning, 3(1), 9–44. International Conference on Machine Learning. (pp.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: 1151–1158). San Francisco, CA: International Machine
An introduction (Vol. 1). Cambridge: MIT Press. Learning Society, pp. 1151–1158.
Thorpe, T. L., & Anderson, C. W. (1996). Traffic light Wongpiromsarn, T., Uthaicharoenpong, T., Wang, Y., Frazzoli,
control using sarsa with three state representations. E., & Wang, D. (2012). Distributed traffic signal control for
Technical Report, Citeseer. maximum network throughput. Paper presented at the 2012
Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide 15th International IEEE Conference on Intelligent
the gradient by a running average of its recent magnitude. Transportation Systems (ITSC), IEEE, pp. 588–595.

2019 - Asynchronous N Step Q Learning - 35

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2019 - Asynchronous N Step Q Learning - 35

Uploaded by

Copyright:

Available Formats

Journal of Intelligent Transportation Systems

Technology, Planning, and Operations

ISSN: 1547-2450 (Print) 1547-2442 (Online) Journal homepage: https://www.tandfonline.com/loi/gits20

Asynchronous n-step Q-learning adaptive traffic

Wade Genders & Saiedeh Razavi

To link to this article: https://doi.org/10.1080/15472450.2018.1491003

Published online: 03 Jan 2019.

Submit your article to this journal

Article views: 347

View related articles

View Crossmark data

Citing articles: 1 View citing articles

Full Terms & Conditions of access and use can be found at

Asynchronous n-step Q-learning adaptive traffic signal control

ABSTRACT ARTICLE HISTORY

1. Introduction any given time). Many solutions have been proposed

We derive the optimal policy by developing an optimal

Q-learning (Watkins & Dayan, 1992), a type of tem-

4.4. Agent Table 3. Model hyperparameters.

Figure 3. Block diagram describing main steps for real world

through movements are possible). Given the lane Else

You might also like