Final - Research - Report

Multi-Agent Reinforcement Learning for Autonomous Driving
Tirth Patel
Lakehead University - Computer Science
Email: tpatel16@lakeheadu.ca
Student Id: 1143729
Abstract— With the advancement of self-driving cars in general overview of the simulators that are already available
recent years, the notion of multi-agent systems, as well as rein- for better future Implementations.
forcement learning, has emerged, building on the technologies
of self-driving cars. In terms of technological advancement, it A. SMARTS: Scalable Multi-Agent Reinforcement
will be quite important.for the coming years. Many researchers Learning Training School for Autonomous Driving
have stepped forward to perform diverse experiments by
constructing simulators that support a wide range of agent SMARTS is an open-source, realistic, and multi-agent
behaviours, surroundings, tasks, situations, and other associated simulator [6] that serves as a platform for the industrial
elements in order to determine the best potential agent actions research and development of various Multi-Agent Reinforce-
in any dilemma-like circumstance. However, there are several
problems and drawbacks to these publications.in certain ways, ment Learning (MARL) based autonomous driving (AD),
whereas high efficiency and benefits in others additional vari- allowing simulations to be extended for a real-world sce-
ables As a result, this study examines and contrasts a number nario.SMARTS focuses on teaching agents so that they can
of well-known organizations. Simulations were created and are communicate with each other before concluding, just like
widely used in the community. in order to create a variety of in the real world where genuine and diverse interactions
multi-agent reinforcement learning (MARL) models designed
specifically for self-driving vehicles. between different agents play a big role. SMARTS evaluates
aspects such as vehicle physics, actions of other road users,
Index Terms— [keywords] MARL, Reinforcement Learning, the traffic scenario, the overall architecture and structure
Simulation, Autonomous Driving of the environment and the roadways, as well as all the
rules and regulations to be obeyed in order to enable realis-
I. INTRODUCTION tic interactions. Aside from realistic interactions, SMARTS
Various groundbreaking studies in the subject of au- is notable for having heterogeneous agents, primarily ego
tonomous driving have been conducted, utilising a range agents (vehicle controlled by AD software), social agents
of algorithms centred on teaching agents how to respond (vehicles sharing environment with autonomous vehicles),
in challenging situations by engaging with other agents via and even hand-written social agents (i.e.agents following
various communication strategies. The existing simulators only pre-defined specific tasks), trained using real-world data
discussed in this paper are Scalable Multi-Agent Reinforce- and Reinforcement Learning (RL) algorithms, social agent
ment Learning Training School (SMARTS), Multi-Agent zoo which has a collection of varied agents that can be used
Connected Autonomous Driving (MACAD-Gym), Multi- for making the model more robust as people can contribute
Agent Driving Simulator (MADRaS), and Behaviour Bench- their agent algorithm design in the social zoo and thus, can
mark (BARK), all of which are based on Reinforcement integrate and simulate AD among a variety of agents, making
Learning (RL) and include a variety of agents, environments, SMARTS a great option for research [6]. It also has a feature
scenarios, algorithms, communications, and many more. As of visualizing the simulation in a web-based application,
a result, the focus of this research is on comparing the above- along with recording the whole scenario [6].
mentioned factors across all four existing frameworks in In the SMARTS architecture from Fig. 1. the background
order to gain a clear picture of which simulator is ideal for traffic provider is responsible for providing traffic scenarios
each of the given domains. Before getting into the details of to which the agents must respond in terms of interactions
the comparison, this paper gives a quick rundown of all four happening in the foreground. For the background traffic
simulators, including their design, significant features, and provider, SMARTS uses Simulation for Urban Mobility
overall work with back-end algorithms. Finally, the study (SUMO) [2]. Moreover, for the interaction scenarios, it
summarises the comparison table and presents surprising uses Domain Specific Language (DSL) written in python,
information before concluding. which consists of pre-programming the SUMO and integrates
agents in the Social Zoo, with the routes, vehicles and overall
II. METHODS the map of the scenario in general. The bubble is a specific
Before we begin our method discussion, we’ll go over a region from the whole map where real interactions between
quick rundown of all four simulators, including how they the environment and the vehicles take place. Furthermore,
function, their design, significant features, architecture and the physics of the system mainly, ”throttle-brake-steering,
many more. The main goal of the summaries is to provide a trajectory tracking, and lane following” (Zhou et. al., 2020) is
Fig. 1. SMARTS Architecture
provided by the Vehicle Physics Provider. Lastly, the motion as observation and an OpenAI gym (Brockman, Cheung,
plan provider is used for executing social agents devoted to Pettersson, Schneider, Schulman, Tang, and Zaremba, 2016)
highly specific maneuvers such as cut-in or U-turn. It also interface for independent control [5]. From the architecture,
uses distributed Computing, in which ”the bubble mechanism the TORCS simulator, which stands for The Open Racing
allows SMARTS to scale up without sacrificing interaction Car Simulator (Wymann et al., 2000) is capable of simulating
realism” (Zhou et. al., 2020). Moreover, there is a need the necessary elements of vehicular dynamics such as mass,
to allocate computing resources according to the type of rotational inertia, collision, mechanics of suspensions, links
simulation required, as social agents cannot be considered and differentials, friction and aerodynamics as well as it
as ”Non-character Players (NPCs)” as they require a huge offers a variety of free tracks and cars for experimenting.
computing time. This gives rise to another issue of managing Furthermore, for enabling communication, it uses Simulated
the highly sophisticated deep learning dependencies of the Car Racing (SCR) server, which sets up a User Datagram
social agents, thus, here using the Ray approach, which is Protocol (UDP) communication (which is a wireless or
a distributed system where AI applications can parallelly connection-less communication protocol) among all the cars
learn from the interactions by continuously interacting with through their individual dedicated UDP port, from Fig. 2, the
the environment distributed among various computers. [3]. double-headed arrows between various agents represents the
Therefore, through distributed computing, large-scale sim- UDP communications between them. Moreover, MADRaS
ulations wherein multiple computers can be dedicated for consists of Traffic agents, which are agents formed using
performing specific tasks inside various bubbles [6], which, a pre-programmed algorithm and inhibits pre-defined be-
makes proper utilization of resources and faster execution. haviours, which can be programmed through a common base
template by adding various interesting behavioural patterns
B. MADRaS : Multi Agent Driving Simulator and are equipped with functions responsible for avoiding
MADRaS is an open-source Multi-Agent DRiving Simula- collisions and going out of track [5]. However, these traffic
tor for motion planning in AD, which demonstrates its ability agents are different from that of the MADRaS agents, which
to create driving scenarios with high degrees of variability are based upon TORCS and are learning agents, which use
[5]. It mainly accomplishes demanding tasks such as ”driving MARL to learn AD, based on the traffic agents and other
vehicles with drastically different dynamics, maneuvering MADRaS agents’ behaviours. Overall, as seen from Fig.
through a variety of track geometries at a high speed, navi- 2, the SCR client through UDP as discussed before, talks
gating through a narrow road avoiding collisions with both to the TORCS server through a dedicated port, where the
moving and parked traffic cars and making two cars learn to simulation engine as a whole is a UDP server, and all the
cooperate and pass through a traffic bottleneck” (Santara et. other agents are UDP client, thus, establishing a connection
al., 2020). It uses the concept of curriculum learning, where between all learning and traffic agents.
the agents break the tasks into a hierarchy of subtasks and
fulfil each of the sub-tasks to achieve the main goal of the For the exposure of agents to different types of scenarios,
particular scenario, Moreover, to minimize the computational MADRaS offers various tracks and car models inherited
overhead for perception and action, it uses physics simulator from the TORCS platforms. Moreover, the traffic agents are
and representative graphics, in which each driving agent of varied types showing behaviours like [5]:
gets a highlevel object-oriented representation of the world
Fig. 2. MADRaS stimulator
• Driving on a specified lane at a specified speed people will accept and comply with the robot, or the match-
• Speed varying in terms of Sine wave, i.e. decreasing ing hypothesis, which states that the appearance and social
and increasing speeds at specific time intervals. behaviour of a robot should match the seriousness of the task
• Agents randomly changing the lanes and situation.This paper finds that in the real world, robots
• Agent that drives specific distances and subsequently based on the matching hypothesis will prove to be much
parks itself more cooperative, especially in many day-to-day serious
• Agents randomly stopping while driving work activities, such as taking medicines, exercising, and
following a healthy routine, where people will be required to
Furthermore, MADRaS uses a Snake oil library of Gym- complete these tasks.Serious robots are more likely to follow
TORCS (Yoshida, 2016) [5], which is an OpenAI Gym the orders of a serious robot in a disciplined manner than
(Brockman et al., 2016) wrapper for SCR cars developed to entertaining robots.
operate with Reinforcement Learning experiments. MADRaS For their first experiment, the researchers performed an
is built upon GymTORCS by increasing its stability, sim- online study with 108 pupils. They made 12 2D robots that
plicity in usage and extra features like multi-agent training were classified by levels of human similarity (more human-
and custom traffic cars. The Snake oil library is responsible like, midway, more robot-like), age (youth and old), and
for parsing state information returned by the TORCS server gender (male, female) [4], and the students had to allocate
about each agent, like odometry, range data, obstacle detec- the robots to a variety of professions from an inventory.
tion, engine statistics and metadata regarding the position of The participants were asked to assign an appropriate robot
a particular vehicle relative to the other cars on the road [5]. to each profession from a list of 12 options, based on
Also, it commands agents to control their respective cars via how well that robot would fit in the workplace or among
steering, acceleration and brake commands. humans. perform better and save a greater number of human
lives. In the sphere of socially engaged jobs, on the other
C. MACAD-Gym: Multi Agent Connected Autonomous hand, an engaging playful human-like robot will accomplish
Driving wonders. The results were specified for many jobs that were
The study performs three studies in order to answer the social, like actress and drawing instructor (Artistic), retail
study topic of how modifications in a robot’s appearance and clerk and sales representative (Enterprise), office clerk and
behaviour affect people’s reactions to them. The researchers hospital message and food carrier (Conventional), aerobics
constructed a variety of face characteristics for robots in instructor and museum guide (Social) [4], human-like robots
order to test human reactions to both humorous and serious were preferred over machine ones. However, for jobs of a
robot behaviours for various tasks.The appearance and social lab assistant, customs inspector, soldier and security guard,
behaviour of a robot should either match the seriousness of machine-like robots were preferred. the rationale being that
the task and situation, or be based on the positive hypothesis, the latter jobs don’t require many interactions with humans,
which states that the more attractive a robot appears and moreover as a robot in such jobs, are far better than humans,
the more extraverted and cheerful its behaviour is, the more as humans security guards tend to sleep at nighttime, customs
Fig. 3. MACAD-Gym stimulator
inspector might take longer to search out a controversy than as their desire to follow the robots’ commands. However,
a person’s and a military robot in situ of a soldier would the participants were more predisposed to the idea that the
perform better and would save more human lives. On the robots should behave and look in accordance with their
opposite hand, an interactive playful human-like robot will principal objective of job completion, therefore appearances
do wonders within the fields of socially active jobs. and behaviours were just secondary. Thus, the robot whose
The second study measured human responses for 2 per- personality corresponded to the participants’ expectations in
sonalities of a robot which were, playful and heavy, where, terms of task seriousness or playfulness yielded the greatest
the playful robot had humorous, friendly conversations and outcomes in terms of participant compliance with the robot.
therefore the serious robot had more straightforward, onpoint
conversations with the participants. The experiments were D. BARK: Behaviour BenchmARK
through with both the robots and these robots asked the MACAD-Gym is a multi-agent reinforcement learning
participants to try to to some exercise routines in pre-defined platform with the support of Connected Autonomous Driving
dialogue conversations, where the playful robot told jokes, (CAD) simulation. It consists of MACAD-Agent, having a
laughed and made the conversation humorous and friendly, set of baseline/starter agents, which enables the community
whereas the intense robot had an easy on-point conversation to add on various algorithms upon the baseline agent’s
and surprisingly, the results were specified the participants algorithm, to conduct various experiments as well as to
exercised longer with the intense robot. Moreover, they rated train multiple agents which can be contributed and thus,
the intense robot as significantly more than the playful robot can be experimented using the platform. Here, the baseline
in terms of conscientiousness and smartness, but less playful algorithm is trained in a partially observable environment,
and witty than the playful robot. They also rated the playful with the scenario being 3-way urban intersection and stop-
robot as slightly more obnoxious. This study clearly meant sign controlled. Moreover, the simulations use raw, cam-
that for accomplishing more important tasks interactions with era observations to learn the behaviours of various agents’
the intense robot would be more beneficial. strategies and actions. In the MACAD-Gym simulator, the
Finally, the third trial was an extension of the second, goal of each agent is to maximize the expected value of its
in which participants were required to complete two tasks, long-term future reward, where, unlike single-agent systems,
one of which was comparable to the prior exercise and the objective function of an agent depends on the policies
the other of which was related to jellybeans. Participants of the other agents in the multi-agent scenario (Praveen,
in the jellybeans job were instructed to guess the flavours 2016). Thus, the optimal policy of the agent is the best
and, as a result, create as many concoctions as possible response depending on the policies of other agents present
by combining different jellybeans. They were also asked in the environment. The agents follow different policies,
to conduct activities for as long as they could in order which are shown in Fig. 3, they can be centralized learners
to compare the two trials. Participants in the study spent or decentralized learners. In centralized learning, the actors
more time on the jellybean activity because it was more of each vehicle have their own previous experience (XP)
pleasurable.Furthermore, under the jellybean condition, the memory, which stores the baseline algorithms as well as
fun robot evoked more compliance than the serious robot, but any newly learnt strategies into its memory. All the agents
the serious robot elicited more compliance than the playful thus, share those memorized strategies, like parameters,
robot.In the exercise situation, the amusing robot performed trajectory sequences, etc. among each other via Vehicle-to-
well. The most compliance was generated by the fun robot Vehicle (V2V), Vehicle-to-Infrastructure (V2I) or Vehicleto-
in the jellybean condition overall. Pedestrians (V2P) depending upon the type of agents and the
To recapitulate, the robots’ personalities and appearances environment. This shared information is fed into a centralized
had an impact on the participants’ mental models as well learner algorithm, which then decides a global optimal
Fig. 4. BARK architecture
solution for the environment as a whole and commands two objectives, some additional information is required,
individual agents to follow the specific strategy to achieve such as (a) Goal Definition: Each agent has a specific
the globally optimal solution. Contrary to the centralized goal, like geometric goal regions or lanebased goals.
learners, the decentralized learners’ (Fig. 3 right) aim is to (b) Road corridor: Agents use it for determining the
achieve the locally optimal outcome. Here, each vehicle has set of roads, lanes, topology information about road
separate learning algorithms and memories, through which connection, etc. (c) Polygon: A 2D polygon which
each agent can learn locally and thereafter, can share the specifies a particular shape of the agents
learned strategies and actions to the common XP memory
where the results are stored and actions are performed. • Scenario: It contains a list of agent and their behaviours,
Hence, here there is no common learning feature, thus initial states, goal definitions, execution models, and
making learning decentralized, making the agents achieve a stimulation map. Thus, in a particular scenario the
a local optimal output, rather than a global optimal solution. agents, using the factors contained in the Scenario, tend
to learn and stimulate both interactive and data-driven
• World and ObservedWorld Model: TheWorld model scenarios
of BARK consists of the map, static and dynamic
objects (agents). The ObservedWorld model, as the • Scenario generation: It contains source-sink pairs which
name suggests reflects the world which is perceived determine a lane where the agents are present. It also
by the agents, where different degrees of observability has other parameter sets which specify the distribution
can be represented by either completely restricting of agent states, their behaviours, goal definitions and
access to the world behaviour model or only allowing execution models. Hence, a scenario using the above
consideration of ObservedWorld parameters which factors is generated and then executed, allowing to
consists of occlusions and sensor noise [1]. model a variety of scenario types and easy extension
of already present scenarios
• Agent models: There are two interfaces in terms of
agent model, one responsible for calling a behaviour • Benchmarking: It provides benchmarking dataset,
model and thereafter, generating behavioural trajectory. which consists of large-scale behaviour models and
And another for executing an agent-specific model scenario sets and BenchmarkRunner, which evaluates
for determining the next state, based on the generated the specific behaviour models and parameters present
behavioural trajectory [1]. For accomplishing the above
in the benchmarking database [1]. The evaluations are by using extra conditions in the algorithm which can subse-
based on: quently overcome any attacks by taking immediate measures
as well as can figure out bad weather conditions and con-
– StepCount: Counts the no. of steps sequently take other routes or change strategies which can
– GoalReached: Returns binary digit representing be most favourable at the moment based on the respective
goal satisfied or not environmental circumstances.
– DrivableArea: checks if the agent is inside the
RoadCorridor or not B. Types of agents
– Collision: checks if any collisions have occurred 1) Homogeneous agents: Those are the agents belonging
– GoalDistance: Calculates Euclidean distance as per to same category i.e. agents handling only cars or only
the GoalDefintion motorcyclists [4]. Thus, as the types of agents are same, their
action spaces are likely to be similar due to same physical
The results of these experiments showed that human’s expec- properties. Also, as homogeneous agents are collection of
tations for robot’s movements will change based on different only same agent types, their interactions and communications
environmental constraints, and the participant’s predictions are limited of being made between same homogeneous
of robot’s choices gave different insights about what they agents [4], which leads to not considering other agents like
think about the robot’s capabilities. Moreover, the robot’s traffic lights, pedestrians or motor cycles, by a homogeneous
appearance also played a huge role in determining its move- car agent, making the simulator limited.
ments and behaviours. 2) Heterogeneous agents: Contrary to homogeneous
agents, heterogeneous agents considers various different
III. BACKGROUND
types of agents interacting with each other [4]. Mostly, all the
Multiple aspects, such as vehicles operated by a variety of realistic applications are composed of heterogeneous agents,
agents, settings, tasks, and scenarios, are used in the MARL- for example, multiple vehicles, pedestrians, and even traffic
based AD to achieve the goal of self-driving cars engaging lights (can be represented as an intelligent actor) capable of
with other agents and working with one another to make the interacting with each other are heterogeneous agents.‘
overall scene of AD successful. This section examines the 3) Communicating agents: As the name suggests, agents
numerous sorts of environments, tasks, and agents available that are capable of communicating with each other with
in the most recent simulators for experimentation. the sole purpose of increasing information availability [4],
A. Types of environment by means of either direct or indirect channels, which can
be either one way communication between traffic light
1) Fully/Partially Observable environment: For the envi-
and vehicles or two ways between two or more vehicles
ronment to be fully observable, all agents are required to
when negotiating in the traffic. The communication can be
observe the whole map of the environment, including all the
established via virtual/shared/crowd sourced sensors, hand
other agents like vehicles, pedestrians, as well as traffic lights
and body gestures by the pedestrians and visual external
and cross roads behaving as agents, as a whole at each and
displays like car light signals, especially used while turning
every point of time [4]. However, most of the environments
and braking, traffic light indication signals, auditory signals
in the real-world scenario are inherently partially-observable,
like horns, ambulance and police car sirens, and many more.
as no agent at any particular time cannot observe the entire
4) Non-communicating agents: Here if the type of agent
environment, as wireless communications are limited to
is such that it can not communicate either because of the
certain areas, thus, making connectivity between vehicles or
incapability of communications amongst agents or due to the
any other agents, both helpful as well as limited [4].
nature of agent, forcing it of not using the communication
2) Synchronous/Asynchronous: Synchronous as the name
channels.
suggests, is an environment where all the agents act simulta-
neously at a particular moment, whereas in an asynchronous
C. Nature of tasks
environment all the agents act at different time [4]. For
instance, the vehicle agents either stopping or moving, while 1) Independent tasks: This type of tasks consists of Self-
the pedestrians acting similarly at the same time based on Interested agents, who are not selfish (only care about things
the same traffic light indication can be considered as actions that are beneficial for them) or malicious (harm other agents),
done in a synchronous manner, whereas while making turns, but are likely to follow the description of states, which
the agent on a single lane behind an another agent need to it prefers, thus, their actions are highly motivated by this
wait till the agent ahead completes it’s actions. Here, the description (i.e. the algorithm). The model of these types
environment is asynchronous. of agents are similar to single-agent environment, mainly
3) Adversarial: Adversarial in terms of environment are having their own unique goals, and having no explicit com-
different factors like weather, and altered communication munication mediums [4]. Thus, all communication deprived
medium (which can caused by any unanticipated attacks) individual single agents will accomplish their own objectives,
[4], which causes hindrance in the normal functioning of and ”will benefit from agents modeling agents” (Praveen,
the agents. This type of environmental nature can be dealt 2019).
2) Cooperative tasks: In this type of task all the agents act 2) Scenarios: Because a simulation necessitates
as a cooperative unit, which requires developing algorithms particular circumstances in which the agents can act as
for different agents in a way by which they can learn to they should, this section compares the scenarios included
achieve a near-globally optimal outcome for each of the in each simulator. Two-way or three-way traffic, as well as
agents present in a particular scenario. Thus, this task leads intersections, are the most typical types of scenarios, and
to develop agents which communicate with each other and the actions that agents contemplate when dealing with them
therefore, can solve and subsequently eliminate issues like can be used to evaluate the simulator’s overall effectiveness.
congestion and collisions by having pre-planned communi-
cation for solving such future difficulties that might occur 3) Involvement of humans: This is the most significant
[4]. component since, in a real-world scenario, there is a good
3) Competitive tasks: As opposed to that of Cooperative chance that there will be more people than multi-agent-based
tasks, agents here are developed based on a competitive AD, which may or may not behave as expected by AD cars
nature specifically for extreme driving scenarios like road- (maybe humans are unwilling to communicate with AD cars
rages. This type of task leads to the formation of Zero-sum before or after executing activities).As a result, whether or
stochastic games, where 2 agents have opposite interests and not the simulators use human-like agents to train AD cars
those agents can be used for law enforcement or development is a significant consideration
of strong adversarial (External environmental hinderances)
agents, capable of handling actions in a competitive task. 4) Communication techniques: When it comes to
4) Mixed: As per the name, in mixed tasks agents tend dealing with AD automobiles, communication between
to behave in a cooperative, independent and competitive multiple agents is critical. There are several methods for
manner as per the particular task the agent needs to act. facilitating communication among agents, but they all rely
This type is the most beneficial among all others as it leads on wireless communication because cable connection is
to maximizing rewards. Moreover, an agent designed to be difficult to implement in a real-world environment. As a
of a particular nature (say independent), can be extrapolated result, those simulators use diverse server and client models
to learn other tasks (cooperative or competitive), making the based on IP addresses, WIFI, and Bluetooth technologies
simulation highly efficient as the best possible task can be for wireless communication.
easily executed by that particular agent having mixed tasks.
5) Environment: This primarily concerns the type of
IV. RESULTS environment, which is divided into two categories. There
are two types of observable: partially observable and fully
All of the above-mentioned comparison criteria, as well observable. In comparison to fully view-able environments,
as details about the type of environment, tasks, and agents, partially observable environments are more frequent because
as described in background knowledge, are collectively used establishing close connections for communication is a
for all simulators explained in the Related work section, and simpler and less expensive choice, where each agent is
a brief overall comparison and contrast about each specific required to have a broad understanding of the environment,
term is presented in Table. which is a difficult challenge to execute in a real-world
setting.
V. DISCUSSION
6) Actions: The agents existing in the environment must
When distinguishing the four simulations discussed follow specific pre-defined activities in order to achieve
above, there are several criteria that are common to all of their objectives dependent on the circumstances they are
them that can be used to make useful comparisons, such in. As a result, all agents have a a list of appropriate
as determining which simulation gives good results for one activities, primarily turning, braking, and speeding, which
specific criterion and then using that simulator to experiment they might use to behave based on the situations that likely
with related features. to to appear when driving in order to prevent colliding with
other vehicles Objects that can be found in the environment.
1) Agents: For comparison, various agent types
outlined in the Background section (section 3) are used. 7) Rewards: Because AD is founded on RL, it employs
Agents can be homogeneous or heterogeneous, learning the concept of rewards. In real life, to get agents to learn
or non-learning, communicative or non-communicating, certain things, a method of rewarding them (in the form of
synchronous/asynchronous or mixed, independent or points) if they perform the actions correctly and penalising
cooperative or competitive, and so on. In the Results them (in the form of penalties) if they clash or detour from
section, all of the above attributes of a given agent that their core goals connected to completing a task is used. As a
are employed in each simulator are listed and compared to result, this section discusses the simulators’ reward functions
other simulators. as well as some of the criteria they employ to determine the
appropriate payouts.
VI. CONCLUSION Nicolas Perez Nieves, Yihan Ni, Seyedershad Banijamali, Alexander
Cowen Rivers, Zheng Tian, Daniel Palenicek, Haitham bou Ammar,
Certain aspects of all four simulators are very similar, Hongbo Zhang, Wulong Liu, Jianye Hao, and Jun Wang. 2020.
while others are diametrically opposed. To begin with, SMARTS: Scalable Multi-Agent Reinforcement Learning Training
SMARTS, MADRaS, and BARK support a variety of agents, School for Autonomous Driving. arXiv:arXiv:2010.09776
[7] AdaFruit, 2016. Arduino Complex. International Conference on
some of which are pre-defined and non-learning, and others Robotics and Automation. 2767–2772. https://cdn-shop.
which are powered by algorithms and make decisions based adafruit.com
on RL. MACADGym, on the other hand, only employs pre- [8] Anon., 2018. Dude, Where’s My Autonomous Car? The 6
Levels of Vehicle Autonomy https://www.synopsys.com/
defined learning agents and has no more than two variations automotive/autonomous-driving-levels.html
of agents. Non-learning agents are extremely useful in a [9] Anon., n.d. The path to autonomousdriving.
simulator because they can easily be mistaken for human [10] BENUI, N., 2012. Weigh-in-motion system. http://www.
loadcell.cn/weigh-in-motion-system.html
agents who are unwilling to learn, making the simulator [11] Davies, A., 2018. The WIRED Guide to Self-Driving Cars. 576–577.
better from a real-world standpoint.Furthermore, all of the https://www.wired.com/story/guide-self-driving-cars/
simulators have very identical situations and environment
characteristics. All other simulators, with the exception of
MACAD-Gym, only examine a partially observable environ-
ment, whereas MACAD-Gym also considers a fully viewable
world, which is harder to set up.Furthermore, all of the
simulators’ rewards and behaviours are very identical, with
the exception of one.
Braking, accelerating, and turning are examples of actions,
whereas rewards are examples of rewards.
Penalties and other progress incentives are combined into
a weighted total. As a result,
Finally, for more scenarios and agents to choose from,
SMARTS
can be the best solution, however for experimenting in a
small space,
environment that is completely viewable, with additional
action options and
MACAD-Gym can communicate using simple UDP-based
communication protocols.
be taken into account Furthermore, MADRaS would de-
liver the greatest results for developing racing scenarios in
which various racing tracks are provided for experimentation.
Finally, BARK, which emphasises the importance of social
agents in a similar way to SMARTS, is a suitable alternative
for newcomers in this field because it allows them to play
more with non-learning social agents by writing pre-defined
algorithms based on AD.
R EFERENCES
[1] Julian Bernhard, Klemens Esterle, Patrick Hart, and Tobias Kessler.
2020. BARK: Open Behavior Benchmarking in Multi-Agent Environ-
ments. https://doi.org/10. 1109/IROS45743.2020.9341222
[2] D. Krajzewicz, G. Hertkorn, C. Rossel, and P. Wagner. 2002. SUMO
(Simulation of Urban MObility) an open-source traffic simulation. In
MESM. 183–187.
[3] P. Moritz, R. Nishihara, S.Wang, A. Tumanov, R. Liaw, E. Liang, M.
Elibol, Z. Yang, W. Paul, M. I. Jordan, et al. 2018. Ray: A distributed
framework for emerging AI applications. In OSDI. 561–577. https:
//doi.org/10.1177/1541931215591337
[4] Praveen Palanisamy. 2019. Multi-Agent Connected Autonomous Driv-
ing using Deep Reinforcement Learning. arXiv:arXiv:1911.04175
[5] Anirban Santara, Sohan Rudra, Sree Aditya Buridi, Meha Kaushik,
Abhishek Naik, Bharat Kaul, and Balaraman Ravindran. 2020.
MADRaS : Multi Agent Driving Simulator. arXiv:arXiv:2010.00993
[6] Ming Zhou, Jun Luo, Julian Villella, Yaodong Yang, David Rusu,
Jiayu Miao,Weinan Zhang, Montgomery Alban, Iman Fadakar, Zheng
Chen, Aurora Chongxi Huang, Ying Wen, Kimia Hassanzadeh, Daniel
Graves, Dong Chen, Zhengbang Zhu, Nhat Nguyen, Mohamed El-
sayed, Kun Shao, Sanjeevan Ahilan, Baokuan Zhang, Jiannan Wu,
Zhengang Fu, Kasra Rezaee, Peyman Yadmellat, Mohsen Rohani,

Final - Research - Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final - Research - Report

Uploaded by

Copyright:

Available Formats

Multi-Agent Reinforcement Learning for Autonomous Driving

You might also like