An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
An Efficient Hardware Implementation of

Reinforcement Learning: the Q-Learning
Algorithm
SERGIO SPANÒ1 , GIAN CARLO CARDARILLI1 (Member, IEEE), LUCA DI NUNZIO1 , ROCCO
FAZZOLARI1 , DANIELE GIARDINO1 , MARCO MATTA1 , ALBERTO NANNARELLI2 (Senior
Member, IEEE), and MARCO RE1 (Member, IEEE)
1
Department of Electronic Engineering, University of Rome “Tor Vergata”, 00133 Rome, Italy
2
Department of Applied Mathematics and Computer Science, Danmarks Tekniske Universitet, 2800 Kgs. Lyngby, Denmark
Corresponding author: Sergio Spanò (e-mail: spano@ing.uniroma2.it).
The authors would like to thank Xilinx Inc, for providing FPGA hardware and software tools by Xilinx University Program.
ABSTRACT In this paper we propose an efficient hardware architecture that implements the Q-Learning
algorithm, suitable for real-time applications. Its main features are low-power, high throughput and limited
hardware resources. We also propose a technique based on approximated multipliers to reduce the hardware
complexity of the algorithm. We implemented the design on a Xilinx Zynq Ultrascale+ MPSoC ZCU106
Evaluation Kit. The implementation results are evaluated in terms of hardware resources, throughput and
power consumption. The architecture is compared to the state of the art of Q-Learning hardware accelerators
presented in the literature obtaining better results in speed, power and hardware resources. Experiments
using different sizes for the Q-Matrix and different wordlengths for the fixed point arithmetic are presented.
With a Q-Matrix of size 8×4 (8 bit data) we achieved a throughput of 222 MSPS (Mega Samples Per
Second) and a dynamic power consumption of 37 mW, while with a Q-Matrix of size 256×16 (32 bit data)
we achieved a throughput of 93 MSPS and a power consumption 611 mW. Due to the small amount of
hardware resources required by the accelerator, our system is suitable for multi-agent IoT applications.
Moreover, the architecture can be used to implement the SARSA (State-Action-Reward-State-Action)
Reinforcement Learning algorithm with minor modifications.
INDEX TERMS Artificial Intelligence, Hardware Accelerator, Machine Learning, Q-Learning, Reinforce-
ment Learning, SARSA, FPGA, ASIC, IoT, Multi-agent
I. INTRODUCTION this iterative process, the agent learns an optimal action-

Reinforcement Learning (RL) is a Machine Learning (ML) selection policy to accomplish its task. This policy indicates
approach used to train an entity, called agent, to accomplish which is the best action the agent should perform when the
a certain task [1]. Unlike the classic supervised and unsuper- environment is in a certain state. Eventually, the interpreter
vised ML techniques [2], RL does not require two separated may be integrated into the agent that becomes self-critic.
training and inference phases being based on a trial & error Thanks to this approach, RL represents a very powerful
approach. This concept is very close to the human learning. tool to solve problems where the operating scenario is un-
As depicted in Fig. 1, the agent “lives” in an environment known or changes over time.
where it performs some actions. These actions may affect Recently, the applications of RL have become increasingly
the environment which is time-variant and can be modelled popular in various fields such as robotics [3]–[5], Internet of
as a Markovian Decision Process (MDP) [1]. An interpreter Things (IoT) [6], power management [7], financial trading
observes the scenario returning to the agent the state of the [8] and telecommunications [9], [10]. Another research area
environment and a reward. The reward (or reinforcement) is in RL is multi-agent and swarm systems [11]–[14].
a quality figure for the last action performed by the agent and This kind of applications require powerful computing plat-
it is represented as a positive or negative number. Through forms able to process very large amount of data as fast as
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
At the beginning of the training process, the Q-Matrix is

Interpreter Agent
initialized with random or zero values, and it is updated by
State
using (1).
Reward
Qnew (st , at ) = (1 − α)Q(st , at )+
(1)
+α rt + γ max Q(st+1 , a)
a
Environment The variables in (1) refer to:

• st and st+1 : current and next state of the environment.
Action • at and at+1 : current and next action chosen by the agent
(according to its policy).
FIGURE 1. Reinforcement Learning framework. • γ: discount factor, γ ∈ [0, 1]. It defines how much the
agent has to take into account long-run rewards instead
of immediate ones.
possible and with limited power consumption. For these rea-
• α: learning rate, α ∈ [0, 1]. It determines how much the
sons, software-based implementations performance are now
newest piece of knowledge has to replace the older one.
the main limitation in further development of such systems
• rt : current reward value.
and the use of hardware accelerators based on FPGAs or
In [16] it is proved that the knowledge of the Q-Matrix
ASICs can represent an efficient solution for implementing
suffices to extract the optimal action-selection policy for a
RL algorithms.
RL agent.
The main contribution of this work is a flexible and ef-
ficient hardware accelerator for the Q-Learning algorithm. B. RELATED WORK
The system is not constrained to any specific application, RL Despite the growing interest for RL and the need of systems
policy or environment. Moreover, for IoT target devices, a capable to process large amount of data in very short time,
low-power version of the architecture based on approximated just a few works can be found in the literature about the
multipliers is presented. hardware implementation of RL algorithms. Moreover, the
The paper is organized as follows. comparison is hard due to the lack of implementation details
• Section I is a brief survey on Reinforcement Learning and homogeneous benchmarks. In this section we show the
and its applications. Q-Learning algorithm, and the re- most prominent researches in this field.
lated work in the literature are presented. In 2005, Hwang et al. [18] proposed a hardware accel-
• Section II describes the proposed hardware architecture, erator for the “Flexible Adaptable Size Topology” (FAST)
detailing its functional blocks. A technique to reduce the algorithm [19]. The system was implemented on a Xilinx
hardware complexity of the arithmetic operations is also XCV800 FPGA and was validated using the cart-pole prob-
proposed. lem [20]. The architecture is well described but few details
• Section III presents the implementation results and the about the implementation are given.
comparisons with the state of the art. In 2007, Pérez et al. [21] proposed a smart power man-
• In sec. IV final considerations and future developments agement application for embedded systems based on the
are given. SARSA algorithm [15]. The systems was implemented on
• Appendix A shows how the architecture can be a Xilinx Spartan-II FPGA. Although the authors proved its
exploited to implement the SARSA (State-Action- functionality, neither the architecture nor the implementation
Reward-State-Action) RL algorithm [15] with minor details are given.
modifications. One of the most relevant work in the field is [22] by
Gandiki et al. that, in 2017, proposed a RL accerelerator for
A. Q-LEARNING ALGORITHM space rovers. The authors implemented the Deep Q-Learning
Q-Learning [16] is one of the most known and employed technique [23] on a Xilinx Virtex-7 FPGA. They obtained a
RL algorithms [17] and belongs to the class of off-policy throughput of 2.34 MSPS (Mega Samples Per Second) for a
methods since its convergence is guaranteed for any agent’s 4×2 state-action space.
policy. It is based on the concept of Quality Matrix, also Also in 2017, Su et al. [24] proposed another Deep Q-
known as Q-Matrix. The size of this matrix is N × Z where Learning hardware implementation based on an Intel Arria-
N is the number of the possible agent’s states to sense the 10 FPGA. The architecture was compared to a Intel i7-930
environment and Z is the number of possible actions that the CPU and a Nvidia GTX-760 GPU implementation. They
agent can perform. This means that Q-Learning operates in a achieved a throughput of 25 KSPS with 32 bit fixed point
discrete state-action space S × A. Considering a row of the representation for a 27×5 state-action space.
Q-Matrix that represents a particular state, the best action to In 2018, Shao et al. [21] proposed a hardware accelerator
be performed is selected by computing the maximum value for robotic applications based on “Trust Region Policy Opti-
in the row. mization” (TRPO) [25]. The architecture was implemented
2 VOLUME 4, 2016
only in the experiments for the comparison with the state of

at+1 PG the art (sec. III-B). T
Figure 3 shows the Q-Learning accelerator.
z-1 at The Q-Matrix is stored into Z Dual-Port RAMs, named
Q(st,A)
Q-Learning
Action RAMs. Consequently, we have one memory block
-1 accelerator per action. Each RAM contains an entire column of the Q-
st+1 z st
Matrix and the number of memory locations corresponds to
the number of states N . The read address is the next state
st+1 , while the write address is the current state st . The
rt+1 z-1 rt
enable signals for the Action RAMs, generated by a decoder
γ α
driven by the current action at , select the value Q(st , at ) to
FIGURE 2. High level harchitecture of the Q-Learning agent. be updated. The Action RAMs outputs correspond to a row
of the Q-Matrix Q(st+1 , A).
The signal Q(st , at ) is obtained by delaying the output
on different devices: FPGA (Intel Stratix-V), CPU (Intel of the memory blocks and then selecting the Action RAM
i7-5930K) and GPU (Nvidia Tesla-C2070). With respect to through a multiplexer driven by at . A MAX block fed by the
the CPU, the authors obtained a speed-up factor of 4.14× output of the Action RAMs generates max Q(st+1 , A).
a
and 19.29× for the GPU and the FPGA implementation,
The Q-Updater (Q-Upd) block implements the Q-Matrix
respectively.
update equation (1) generating Qnew (st , at ) to be stored into
The most recent works (published in 2019) include
the corresponding Action RAM.
Cho et al. [26]. They propose a hardware accelerator for
The accelerator can be also used for Deep Q-Learning [23]
the “Asynchronous Advantage Actor-Critic” (AC3) algo-
applications if the Action RAMs are replaced with Neural
rithm [27], describing an implementation based on a Xilinx
Network-based approximators.
VCU1525 FPGA. The system was validated using 6 Atari-
2600 videogames.
In the work by Huang et al. [28] another Deep Q-Learning A. MAX BLOCK
network was implemented on a Digilent Pynq development An extensive study about this block has been proposed in
board for the cart-pole problem. The system is meant only for [30]. In the paper, the authors proved that the propagation
inference mode and, consequently, cannot be used for real- delay of this block is the main limitation for the speed of
time learning. Q-Learning accelerators when a large number of actions
One of the most advanced hardware accelerator for Q- is required. Consequently, they propose an implementation
Learning was proposed by Da Silva et al. [29]. The authors based on a tree of binary comparators (M -stages) that is a
presented an implementation based on a Xilinx Virtex-6 good trade-off in area and speed [31].
FPGA. Moreover, they performed a fixed-point analysis to This architecture is employed by the Q-Learning acceler-
confirm the convergence of the algorithm. Different comparators presented in [22], [29] and has also been used in our
isons with state of the art implementations were made. Since architecture (Fig. 4).
this is one of of best performing Q-Learning accelerators at Moreover, in [30] it is proved that, when pipelining is used
today, we provide an extensive comparison with our architec- to speed up the MAX block, the latency does not affect the
ture (sec. III-B). convergence of the Q-Learning algorithm. This means that,
when an application requires a very high throughput, it is
II. PROPOSED ARCHITECTURE possible to use pipelining.
The Q-Learning agent shown in Fig. 2 is composed by two
main blocks: the Policy Generator (PG) and the Q-Learning B. Q-UPDATER BLOCK
accelerator. Equation (1) can be rearranged as
The agent receives the state st+1 and the reward rt+1 from
the observer, while the next action is generated by the PG Qnew (st , at ) = Q(st , at )+
according to the values of the Q-Matrix stored into the Q- (2)
Learning accelerator. +α rt + γ max Q(st+1 , a) − Q(st , at )
a
Note that st , at and rt are obtained by delaying st+1 , at+1
and rt+1 by means of registers. st and at represent the indices to obtain an efficient implementation. Equation (2) is com-
of the rows and columns of the Q-Matrix, respectively. These puted by using 2 multipliers, while (1) requires 3 multipliers.
delays do not affect the convergence of the Q-Learning The Q-Updater block in Fig. 5 is used to compute (2),
algorithm, as proved in [30]. generating Qnew (st , at ).
With the aim to design a general purpose hardware accel- The critical path consists in 2 multipliers and 2 adders.
erator, we do not provide a particular implementation for the In the next section (II-B1) a method to reduce the hardware
PG since it is application-defined. The PG has been included complexity for the multipliers is illustrated.
VOLUME 4, 2016 3
Q(st,A)
Qnew(st,at)
Action 1 -1
st WR_ADDR RAM z
D_IN D_OUT
z-1
MUX
st+1 RD_ADDR
Q(st,at)
...
WR_EN
Action 2 z-1
RAM
at Q-Upd
...
EN2
MAX
max(Q(st+1,A))
...
Action Z
EN1
RAM
ENZ γ rt α
at Decoder
Q(sarchitecture.
FIGURE 3. Q-Learning accelerator t,A)
Qnew(st,at)
Action 1 -1
st WR_ADDR Qnew(st,at)
RAM z
using M bits for the fractional part is:
Q(st+1,a(1)) Action 1 D_IN D_OUT Q(St+1,a1)
st z-1 z-1x = x0 20 + x−1 2−1 + x−2 2−2 + ... + x−M 2−M
WR_ADDR
Max
RAM
Max
MUX
st+1 RD_ADDR (3)
D_IN D_OUT
WR_EN-1 Q(st,at)
...
z Q(St+1,a2)
MUX
Q(st+1,a(2)) st+1 RD_ADDR

Q(st,at) where x0 , ..., x−M are the binary digits. Let i, j, k be the
Max
...
Max
WR_EN
z-1
max(Q(st+1,A))
Action 2
max(Q(st+1,A))
st
positions of the first, second
Q(St+1,aand
3)
third ‘1’ in the binary
Q(st+1EN
,a(3)) Action 2 zRAM
-1
representationat of x starting from the most significant bit.
Max
1
Max
Q-Upd
Max
Max
RAM
st+1 at Moreover,
Q-Updwe define < Q(Sx t+1>,aOP the approximation of x
Q-Upd 4) n
Q(s+1,a(4))
. . . ...
with the n most significant powers of two in the M + 1 bits

EN2
...
EN2
MUX
Q(st+1,A)
representation. That is Q(St+1,a5)

Q(st+1,at+1)
...
Q(st+1,a(5))
Max
Max
max(Q(st+1,A))
< x >OP1 = 2−i
Max
st Action Z Q(St+1,a6) M=⌈log2(Z)⌉

Decoder
EN1Z
Action
at Q(st+1,a(6)) RAM M=⌈log2(Z)⌉ RAM < x >OP2 = 2−i + 2−j (4)
st+1 at+1
ENZ ENZ γ rt α −i γ rt −j
α −k
< x >OP3 = 2 +2 +2
γ rt α FIGURE 4. MAX block tree architecture Decoder
at for Z=6.
for the approximation with one, -two and three powers of two.
Q(s ,at) >>
- The conceptt can be extended to more power of two terms.
>> - >>
Q(st,at) Q(st,at) Qnew(st,at) For example, x = 0.101101(2) =α1 0.703125 Qnewcan
(st,at)be
>> max(Q(st+1,A))
α1 Qnew(st,at) approximated as: γ1 >>
max(Q(st+1,A))
x(Q(st+1,A)) >>
γ1 >> < 0.101101 −1
α rt (2) >OPα12= 2 = 0.5
>> γ rt γ2
α2 < 0.101101(2) >OP2 = 2−1 + 2−3 = 0.625 (5)
rt
γ2 5. Q-matrix updater block architecture.
FIGURE
< 0.101101 > = 2−1 + 2−3 + 2−4
- = 0.6875 .
,a ) OP3
Q(s(2) >>
Qnew(st,at)
t t
- Some examples of >>
Q(st,at)
Q(st,at) - >>
>> Qnew(st,at) the approximated values for
α1 different
Qnew(st,at)
1) Approximated powers of two are presented

γ1 in Fig. 6 (x ≤ 1). >>
max(Q(s ,A))multipliers
>> >>
t+1
The main speed limitation in the updater α1 block is the propa- Consequently,
max(Q(st+1,A))the product
>> z = x · y can be approximated
α
γ as: rt α2
1 γ
gation delay of the multipliers. rt Using>>
a similar approach to γ2
>>
,A)) it is>>
max(Q(st+1[32], possible to replace the full multipliers shown in Fig. < z >OP1 = 2−i · y
>>
5 with approximated
γ2
rt
multipliers basedαon
2
barrel shifters [33]. < z >OP2 = 2−i · y + 2−j · αy3 (6)
>>
In this way, we are approximating α and γ with a number γ3
>> −i −j −k
equal to their nearest power of two (single shifter), or to the < z >OP3 = 2 ·y+2 · y + ·2 ·y.
α3
nearest sumγ3 of powers of two (two or more shifters). Due to The approximated multipliers are implemented by one or
the fact that α, γ ∈ [0, 1], only right shifts have been used. more barrel shifters in the Q-Updater block, depending on the
Considering a number x ≤ 1, its binary representation approximation, as shown in Fig. 7 and 8.
4 VOLUME 4, 2016
Qnew(st,at) TABLE 1. Implementation results for Q-Matrices with 8 bit data and Z=4.
M=5Action 1 -1 Q(st+1,a(1))
st WR_ADDR RAM z
Max
, D_IN x-3 xD_OUT
x= x0 x-1 x-2 -4 x-5
CLK PWR
N LUT LUTRAM FF DSP
z-1 Q(st+1,a(2)) (MHz) (mW)
MUX
st+1 RD_ADDR
Q(st,at)
Max
193 32 154 0
...
1 1 WR_EN 1 8 222 37
max(Q(st+1,A))
0.08% 0.03% 0.03%
Q(s 0.00%
st0.9 -1 t+1,a(3))
0.9 EN1 Action 2 0.9 z 192 32 156 0
Max
Max
16 240 37
RAM 0.08% 0.03% 0.03% 0.00%
0.8 st+1 0.8 0.8 at Q-Upd
32
193 32 Q(s +1,a(4))
158 0
234 41
0.08% 0.03% 0.03% 0.00%
0.7 EN2 0.7 0.7
...
214 64 160 0
Q(st+1,A)
64 Q(st+1,a(5)) 227 38
0.6 0.6 0.6 0.09% 0.06% 0.03% 0.00%
Max
Max
max(Q(st+1,A)) 271 80 162 0
...
st0.5 128 239 45
0.5 0.5 0.12% 0.08% 0.04%
Q(s t+1,a(6))
0.00% M=⌈log2(Z)⌉
Decoder
Action Z
362 160 164 0
at RAM 256 233 61
0.4 st+1 0.4 0.4 0.16% 0.16% 0.04% 0.00%
ENZ γ rt α
0.3 0.3 0.3
0.2 0.2 0.2 [35], [36]. Q(st,at) - >>

- >>
0.1
Q(st,at)
0.1 0.1
Qnew(st,at) i(α) Qnew(st,at)
III. IMPLEMENTATION
max(Q(st+1,A)) EXPERIMENTS
0 0 0 i(γ) >>
max(Q(st+1,A))
(a) (b) (c) In order to validate the proposed architecture, we imple-
>>
mented different versions of the rt
Q-Learning
j(α) accelerator.
α
FIGURE 6. Approximated values γ for a r6-bit
t
number using M=5 bits for the In the experiments, j(γ)we used a Xilinx Zynq UltraScale+
fractional part. (a) 1 power of two, (b) 2 powers of two, (c) 3 powers of two.
MPSoC ZCU106 Evaluation Kit featuring the XCZU7EV-
2FFVC1156 FPGA. -
Q(st,at) All the results in this section
>> were
Qnew(st,at)
Q(st,at)
- >> Qnew(st,at)
obtained using the Vivado >> 2019.1 EDA tool with default
α1
implementation parameters and setting a timing constraint of
>> γ1 >>
max(Q(st+1,A)) 2 ns. The system was coded in VHDL.
i(α) The design exploration>>was implemented for the following
max(Q(st+1,A))
α2
i(γ) rt rt
range of parameters: γ2
>>
FIGURE 7. Q-Matrix updater block with multipliers implemented by a single • Number of bits for >> the Q-Matrix values : 8, 16 and 32
barrel shifter. bit. α3
γ3
• Number of states N : 8, 16, 32, 64, 128 and 256.
• Number of actions Z: 4, 8 and 16.
The position of the leading ones i and j in the rep-
resentation of α and γ can be given as input if constant We focused the implementation analysis on the following
for the whole computation, or determined by Leading-One- resources [37]:
t,at)
Detectors (LOD) [34] if Q(s ,a(1))
thet+1values are modified at run time. • Look Up Tables (LUT);
• Look Up Tables used as RAM (LUTRAM);
Max
The error introduced by this approximation does not effect

the convergence of the Q-Learning
Q(st+1,a(2)) algorithm [16] and, as a • Flip-Flops (FF);
Q(st,at) • Digital Signal Processing slices (DSP);
Max
side effect, we obtain a shorter critical path and lower power

max(Q(st+1,A))
consumption (sec. III-A).Q(sMoreover,

t+1,a(3)) we tested the system in For every resource of the device, we also provide the percent
Max
Max
different applications which proved to be almost insensitive usage respect to the total available.
Q-Upd Q(s+1 ,a(4)) the convergence conditions The performances were measured in terms of maximum
to the approximation error since
of Q-Learning are still satisfied ( α,γ ≤ 1). clock frequency (CLK) and dynamic power consumption
Q(st+1,a(5)) (PWR). The latter was evaluated using Vivado after the
By using approximated multipliers, we avoid to use FP-
Max
max(Q(st+1,A)) Place&Route considering the maximum clock frequency and

GAs with DSP blocks and we can implement the accelerator
M=⌈log2(Z)⌉
in small ultra low power Q(s t+1,a(6))
FPGAs suitable for IoT applications a worst case scenario with a 0.5 activity factor on the circuit
nodes [38].
γ rt α All the implementation examples in this section do not
make use of pipelining in the MAX block (sec. II-A). Unless
Q(st,at)
- >> otherwise stated, no approximated multipliers are used.
>> Tables 1 to 9 show the implementation results for different
i(α) Qnew(st,at)
max(Q(st+1,A)) i(γ) >>
number of states, actions and data-width for the Q-Matrix
>> values (tables header color: blue 8-bit, red 16-bit, green 32-
rt j(α) bit data-widths).
j(γ)
The first consideration is related to the number of DSPs.
Since only one Q-Matrix element is updated per clock cycle,
Q(st,at)
-
FIGURE 8. Q-Matrix updater block with multipliers implemented
>>
by two barrel
shifters. the only parameter that affects the number of required DSPs
Qnew(st,at)
>>
α1
VOLUME 4, 2016 5
γ1 >>
max(Q(st+1,A)) >>
This work is licensed under a CreativerCommons
t Attribution
α2 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
γ2
TABLE 2. Implementation results for Q-Matrices with 8 bit data and Z=8. TABLE 6. Implementation results for Q-Matrices with 16 bit data and Z=16.
CLK PWR CLK PWR

N LUT LUTRAM FF DSP N LUT LUTRAM FF DSP
(MHz) (mW) (MHz) (mW)
290 64 252 0 759 256 577 3
8 190 64 8 117 133
0.13% 0.06% 0.05% 0.00% 0.33% 0.25% 0.13% 0.17%
293 64 254 0 759 256 580 3
16 168 68 16 117 139
0.13% 0.06% 0.06% 0.00% 0.33% 0.25% 0.13% 0.17%
294 64 256 0 761 256 583 3
32 187 65 32 117 142
0.13% 0.06% 0.06% 0.00% 0.33% 0.25% 0.13% 0.17%
343 128 258 0 828 384 592 3
64 183 71 64 115 143
0.15% 0.16% 0.06% 0.00% 0.36% 0.38% 0.13% 0.17%
451 160 260 0 1386 640 606 3
128 210 90 128 121 187
0.20% 0.16% 0.06% 0.00% 0.60% 0.63% 0.13% 0.17%
632 320 268 0 2101 1280 632 3
256 188 109 256 115 284
0.27% 0.31% 0.06% 0.00% 0.91% 1.26% 0.13% 0.17%
CLK PWR CLK PWR

497 128 318 0 383 96 586 5
8 147 88 8 136 116
0.22% 0.13% 0.07% 0.00% 0.17% 0.09% 0.13% 0.29%
498 128 320 0 383 96 588 5
16 130 94 16 125 113
0.22% 0.13% 0.07% 0.00% 0.17% 0.09% 0.13% 0.29%
496 128 322 0 382 96 590 5
32 141 91 32 106 128
0.22% 0.13% 0.07% 0.00% 0.17% 0.09% 0.13% 0.29%
587 256 330 0 417 160 592 5
64 149 81 64 116 125
0.25% 0.25% 0.07% 0.00% 0.18% 0.16% 0.13% 0.29%
717 320 344 0 682 320 606 5
128 160 114 128 112 142
0.36% 0.31% 0.07% 0.00% 0.30% 0.31% 0.13% 0.29%
1192 640 346 0 1035 640 620 5
256 167 162 256 110 183
0.52% 0.63% 0.08% 0.00% 0.45% 0.63% 0.13% 0.29%
CLK PWR CLK PWR

179 64 250 3 752 192 940 5
8 152 56 8 97 209
0.06% 0.06% 0.05% 0.17% 0.33% 0.19% 0.20% 0.29%
178 64 252 3 759 192 942 5
16 152 58 16 102 205
0.06% 0.06% 0.05% 0.17% 0.33% 0.19% 0.20% 0.29%
180 64 254 3 763 192 944 5
32 149 60 32 100 220
0.08% 0.06% 0.06% 0.17% 0.33% 0.19% 0.20% 0.29%
204 96 256 3 845 320 958 5
64 153 57 64 92 233
0.09% 0.09% 0.06% 0.17% 0.37% 0.31 0.21% 0.29%
333 160 258 3 1387 640 972 5
128 158 80 128 103 263
0.14% 0.16% 0.06% 0.17% 0.60% 0.63% 0.21% 0.29%
513 320 272 3 2095 1280 1004 5
256 156 88 256 104 329
0.22% 0.31% 0.06% 0.17% 0.91% 1.26% 0.22% 0.29%
CLK PWR CLK PWR

382 128 444 3 1366 384 1092 5
8 129 107 8 95 244
0.17% 0.13% 0.10% 0.17% 0.59% 0.38% 0.24% 0.29%
383 128 446 3 1362 384 1096 5
16 127 108 16 96 260
0.17% 0.13% 0.10% 0.17% 0.59% 0.38% 0.24% 0.29%
380 128 448 3 1365 384 1100 5
32 132 110 32 95 248
0.16% 0.13% 0.10% 0.17% 0.59% 0.38% 0.24% 0.29%
417 192 450 3 1525 640 1116 5
64 127 108 64 97 251
0.18% 0.19% 0.10% 0.17% 0.66% 0.63 0.24% 0.29%
691 320 458 3 2579 1280 1148 5
128 135 124 128 91 408
0.30% 0.31% 0.10% 0.17% 1.12% 1.26% 0.25% 0.29%
1048 640 478 3 4017 2560 1210 5
256 136 180 256 93 611
0.45% 0.63% 0.10% 0.17% 1.74% 2.52% 0.26% 0.29%
6 VOLUME 4, 2016
240
8 bit
220 16 bit 4000
32 bit
Clock frequency (MHz)
200
Z=4
180 3000
Z=8
160 Z=16
LUTs
2000
140
120
1000
100
80
4 6 8 10 12 14 16 0
0
Number of actions - Z 30
Num 100 25
)
ber o 20 bit
f sta200 15 ation (
FIGURE 9. Average clock frequency for different Q-Matrix data bit-width vs tes -
N 300
10 uantiz
number of actions. Q-va lues q
FIGURE 10. Number of LUTs for different implementations.

is the bit-width. For a Q-Matrix with 8-bit data, we obtain
the fastest implementations that do not require any DSP slice.
For 16-bit and 32-bit data, 3 DSPs and 5 DSPs are required
700
respectively.
Another consideration comes with the maximum clock 600
frequency (that corresponds exactly to the throughput of the Z=4
Dynamic power (mW)
system). Given a certain data-path bit-width and number of 500 Z=8
actions, the clock frequency remains almost unaltered. This 400

Z=16
can be ascribed to the different solutions found by routing
tool. For this reason, in Fig. 9 we use the average clock 300
frequencies. The frequency drop, when the number of actions
200
increases, is greater for 8 bit data-paths with respect to the 16
and 32 bit cases. This behaviour can be justified by taking 100
into account the major role of FPGA interconnections when
0
a large number of bits is used.
0 30
For what concerns the hardware resources, the number of 100 20 25
Numbe
required LUT RAMs is related to the size of the Q-Matrix r of state200 300 10 15
ntization
(bit)
s-N
Q-v alues qua
N × Z. From N =8 to N =32 the resources remain the same,
from N =64 a higher number of LUT RAMs is required. FIGURE 11. Dynamic power consumption for different implementations.
As expected, the power consumption is proportional to the
number of required LUTs (considering architectures with the
same parameters). The trend can be observed in Figs. 10 and In order to evaluate the benefits of such approach, we im-
11. plemented the multipliers by using the single power-of-two
Even for the largest implementation considered (N =256, approach (1 barrel shifter per multiplier) and the more precise
Z=16, 32-bit Q-Matrix values), the required FPGA resources approach based on the linear combination of two powers-of-
are moderate. This suggests that the architecture can be easily two (two barrel shifters per multiplier), as depicted in Figs. 7
employed in applications requiring a large number of states and 8. We considered 8, 16 and 32 bit operands.
or actions and applications where multiple agents must be Since the dynamic power consumption is directly propor-
implemented on the same device. tional to the clock frequency [38], for a fair comparison we
The main result of the design exploration shows that provide the energy required to update one Q-Matrix element
we can implement fast Q-Learning accelerators with small and the percentage of energy saved respect to the traditional
amount of resources and low power consumption. implementation. Tables 10, 11 and 12 show the compari-
son between the implementations of approximated and full
A. Q-UPDATER BASED ON APPROXIMATED multipliers. Note that for the 8 bit architectures the power
MULTIPLIERS dissipation was too low to be accurately estimated. In the
As discussed in sec. II-B1, to allow the use of IoT devices, traditional implementation, we forced the Vivado synthesizer
the hardware complexity of the Q-Updater block can be not to use any DSP block.
reduced by replacing the full multipliers with approximated The barrel shifter-based architectures do not require any
multipliers based on barrel shifters. DSP slice, they use less hardware resources, they are faster
VOLUME 4, 2016 7
TABLE 10. Implementation comparisons: approximated and full multipliers TABLE 13. Da Silva et al. [29] implementation results for 16 bit Q-Matrix
with 8 bit operands. values and Z=4.
Impl. CLK (MHz) PWR Energy Energy CLK PWR Energy

LUT N DSP REG LUT
type SpeedUp (mW) (nJ) saving (MHz) (mW) (nJ)
35 795 34 548 1734

Full 2 0.002516 -Ref- 6 24 3 0.127443
0.01% 1× 4.16% 0.18% 1.15%
11 1644 58 1029 3387
1 Shift <1 <0.000608 >76% 12 22 6 0.269421
<0.01% 2.07× 7.55% 0.34%) 2.24%
27 864 90 1670 5594
2 Shift <1 <0.004116 >54% 20 20 10 0.496032
0.01% 1.09× 11.71% 0.55% 3.71%
130 2470 8872
30 20 13 0.660905
16.92% 0.81% 5.88%
TABLE 11. Implementation comparisons: approximated and full multipliers 234 4552 15526
with 16 bit operands. 56 15 29 1.914191
30.46% 1.51% 10.30%
370 10533 70311
132 13 67 5.018727
Impl. CLK (MHz) PWR Energy Energy 48.17% 3.49% 46.65%
LUT
type SpeedUp (mW) (nJ) saving
155 527 TABLE 14. Proposed implementation results for 16 bit Q-Matrix values and
Full 9 0.017082 -Ref- Z=4.
0.07% 1×
33 1412
1 Shift 2 0.001416 92%
0.01% 2.68× CLK PWR Energy
66 700 N DSP REG LUT
(MHz) (mW) (nJ)
2 Shift 4 0.005716 67%
0.02% 1.33×
3 186 148
8 72 14 0.193798
0.39% 0.06% 0.09%
3 188 142
16 74 15 0.202922
and more power-efficient than their full multiplier-based 0.39% 0.06% 0.09%
counterparts, especially for the 16 and 32 bit implementa- 3 190 147
32 72 14 0.193157
0.39% 0.06% 0.09%
tions. For these reasons, they are suitable for Q-Learning 3 192 191
applications on very small and low-power IoT devices at the 64 74 12 0.163088
0.39% 0.06% 0.12%
cost of a reduced set of possible α and γ values. 3 194 304
128 71 16 0.226053
0.39% 0.06% 0.20%
3 196 512
B. STATE OF THE ART ARCHITECTURE COMPARISON 256 72 20 0.279330
0.39% 0.06% 0.33%
The architecture proposed in this paper has been compared
with one of best performing Q-Learning hardware accereler-
ators at today [29]. values.
In their paper, Da Silva et al. proposed a parallel imple- To obtain a fair comparison:
mentation based on the number of states N , while in our work • We implemented the same RL environment of [29] and
the parallelization is based on the number of actions Z. Since stored the reward values in a Look-Up Table.
in most of the RL applications Z << N (see examples in • We implemented a random PG as described in [29].
sec. I), our approach results in a smaller architecture. • We considered 16-bit Q-Matrix values.
Another important difference consists in the earlier se- • We implemented the architectures on the same Virtex-6
lection of the Q-matrix value to be updated. This allows to FPGA ML605 Evaluation Kit (using the ISE 14.7 Xilinx
implement a single block for the computations of Qnew = suite).
(st , at ), while in [29] N × Z blocks are required. Moreover, The experimental results are shown in Tables 13, 14, 15
in case of FPGA implementations, our architecture allows and 16. We can only make comparisons with Z=4 and Z=8
to employ distributed RAM or embedded block-RAM. This since they are the only values implemented in [29]. The
gives an additional degree of freedom compared to [29] implementation results are given in terms of [39]:
where only registers are considered for storing the Q-Matrix • DSP blocks (DSP)
• Slice Registers (REG)
• Slice LUT (LUT)
TABLE 12. Implementation comparisons: approximated and full multipliers
with 32 bit operands. • Maximum clock frequency (CLK)
• Power consumption (PWR)
Impl. CLK (MHz) PWR Energy Energy • Energy required to update one Q-Matrix element (En-
LUT
type SpeedUp (mW) (nJ) saving ergy)
Full
731 414
49 0.118286 -Ref- As expected, our architecture employs a constant number
0.32% 1× of DSP slices, while in [29] this number is proportional
87 808
1 Shift
0.04% 1.95×
5 0.006190 95% to N × Z. The number of Slice Registers required by our
2 Shift
184 637
9 0.014130 73%
implementations remains almost unaltered when the number
0.08% 1.54× of states increases, while in [29] it grows with N .
8 VOLUME 4, 2016
TABLE 15. Da Silva et al. [29] implementation results for 16 bit Q-Matrix
values and Z=8. 5
CLK PWR Energy 4

N DSP REG LUT
(MHz) (mW) (nJ)
Energy (nJ)
Da Silva Z=4
3
106 1797 6960 Da Silva Z=8
12 21 10 0.483793
13.80% 0.59% 4.61% Proposed Z=4
170 2950 11561 2 Proposed Z=8
20 18 21 1.189128
23.13% 0.97% 7.67%
250 4390 17208 1
30 16 26 1.580547
32.55% 1.45% 11.41%
378 8136 48536
56 14 48 3.331020 0
49.21% 2.69% 32.20% 0 50 100 150 200 250
Number of states - N
TABLE 16. Proposed implementation results for 16 bit Q-Matrix values and
Z=8. FIGURE 13. Energy required to update one Q-Matrix element comparison
between Da Silva et al. [29] and proposed architecture, 16 bit values.
CLK PWR Energy

N DSP REG LUT
(MHz) (mW) (nJ)
because the system in [29] cannot be used as a general-
3 316 316
8 62 20 0.322529 purpose hardware accelerator since the RL environment is
0.39% 0.10% 0.20%
16
3 318 324
62 18 0.289902
mapped on the FPGA. Our system does not have such limi-
0.39% 0.10% 0.21% tation.
3 320 316
32 63 18 0.28454
0.39% 0.10% 0.20%
3 322 423 IV. CONCLUSIONS
64 64 22 0.346075
0.39% 0.10% 0.28%
3 324 655
In this paper we proposed an efficient hardware implemen-
128 63 24 0.380288 tation for the Reinforcement Learning algorithm called Q-
0.39% 0.10% 0.43%
3 326 1064 Learning. Our architecture exploits the learning formula by
256 59 35 0.591816
0.39% 0.10% 0.70%
a-priori selecting the required element of the Q-Matrix to
be updated. This approach made possible to minimize the
hardware resources.
Figure 12 compares the maximum clock frequency for
We also presented an alternative method for reducing the
different number of states and actions. Our system is more
computational complexity of the algorithm by employing
than 3 times faster and the speed is almost independent to the
approximated multipliers instead of full multipliers. This
Q-Matrix number of states.
technique is an effective solution to implement the acceler-
Figure 13 compares the energy required to update a single ator on small ultra low-power FPGAs for IoT applications.
Q-Matrix element for different number of states and actions.
Our architecture has been compared to the state of the
Also in this case, our architecture, except for the N =6 Z=4
art in the literature, showing that our solution requires a
case, presents a better energy efficiency which remains al-
smaller amount of hardware resources, is faster and dissipates
most unaltered increasing the number of states.
less power. Moreover, our system can be used as a general-
It is important to highlight that the most evident difference purpose hardware accelerator for the Q-Learning algorithm,
between the proposed architecture and [29] is its indepen- not being related to a particular RL environment or agent’s
dence from the environment and agent’s policy. This happens policy.
With little effort, the proposed approach can be also ex-
ploited to implement the on-policy version of the Q-Learning
algorithm: SARSA. This aspect is further explored in Ap-
70
pendix A.
Clock frequency (MHz)
60 For the above reasons, our architecture is suitable for high-

50 Da Silva Z=4 throughput and low-power applications. Due to the small
Da Silva Z=8
Proposed Z=4
amount of required resources, it also allows the implementa-
40
Proposed Z=8 tion of multiple Q-Learning agents on the same device, both
30 on FPGA or ASIC.
20 .
10
0 50 100 150 200 250 APPENDIX A SARSA ACCELERATOR ARCHITECTURE
Number of states - N
The proposed architecture for the acceleration of the Q-
FIGURE 12. Clock frequency comparison between Da Silva et al. [29] and Learning algorithm can be easily exploited to implement the
proposed architecture, 16 bit Q-Matrix values. SARSA (State-Action-Reward-State-Action) [15] algorithm.
VOLUME 4, 2016 9
Equation (7) shows the SARSA update formula for the Q- [16] C. J. Watkins and P. Dayan, “Q-Learning,” Machine learning, vol. 8, no.
Matrix. 3-4, pp. 279–292, 1992.
[17] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, “Q-Learning Algorithms:
Qnew (st , at ) = Q(st , at )+ A Comprehensive Classification and Applications,” IEEE Access, vol. 7,
(7) pp. 133 653–133 667, 2019.
+α (rt + γQ(st+1 , at+1 ) − Q(st , at )) [18] K.-S. Hwang, Y.-P. Hsu, H.-W. Hsieh, and H.-Y. Lin, “Hardware imple-
mentation of FAST-based reinforcement learning algorithm,” in Proceed-
Comparing (2) to (7), it is straightforward to note the sim- ings of 2005 IEEE International Workshop on VLSI Design and Video
Technology, 2005. IEEE, 2005, pp. 435–438.
ilarities between the two equations. Since the update of the [19] A. Pérez and E. Sanchez, “The FAST architecture: a neural network with
Q-Matrix depends on the agent’s next action at+1 , SARSA flexible adaptable-size topology,” in Proceedings of Fifth International
algorithm is the on-policy version of the Q-Learning algo- Conference on Microelectronics for Neural Networks. IEEE, 1996, pp.
337–340.
rithm (which is off-policy). [20] S. Geva and J. Sitte, “A cartpole experiment benchmark for trainable
The resulting architecture is presented in Fig. 14. The controllers,” IEEE Control Systems Magazine, vol. 13, no. 5, pp. 40–51,
main difference between the Q-Learning implementation in 1993.
[21] S. Shao, J. Tsai, M. Mysior, W. Luk, T. Chau, A. Warren, and B. Jeppesen,
Fig. 3 consists in the replacement of the MAX block with a “Towards hardware accelerated reinforcement learning for application-
multiplexer driven by the next action at+1 . specific robotic control,” in 2018 IEEE 29th International Conference
The analysis about the Q-Learning architecture can also be on Application-specific Systems, Architectures and Processors (ASAP).
IEEE, 2018, pp. 1–8.
extended to the SARSA accelerator. [22] P. R. Gankidi and J. Thangavelautham, “FPGA architecture for deep
learning and its application to planetary robotics,” in 2017 IEEE Aerospace
REFERENCES Conference. IEEE, 2017, pp. 1–9.
[23] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau
[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. et al., “An Introduction to Deep Reinforcement Learning,” Foundations
MIT press, 2018. and Trends® in Machine Learning, vol. 11, no. 3-4, pp. 219–354, 2018.
[2] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning from data. [24] J. Su, J. Liu, D. B. Thomas, and P. Y. Cheung, “Neural network based re-
AMLBook New York, NY, USA:, 2012, vol. 4. inforcement learning acceleration on FPGA platforms,” ACM SIGARCH
[3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep Computer Architecture News, vol. 44, no. 4, pp. 68–73, 2017.
visuomotor policies,” The Journal of Machine Learning Research, vol. 17, [25] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust Region
no. 1, pp. 1334–1373, 2016. Policy Otimization,” in International conference on machine learning,
[4] A. Konar, I. G. Chakraborty, S. J. Singh, L. C. Jain, and A. K. Nagar, “A 2015, pp. 1889–1897.
deterministic improved Q-learning for path planning of a mobile robot,” [26] H. Cho, P. Oh, J. Park, W. Jung, and J. Lee, “FA3C: FPGA-Accelerated
IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 43, Deep Reinforcement Learning,” in Proceedings of the Twenty-Fourth
no. 5, pp. 1141–1153, 2013. International Conference on Architectural Support for Programming Lan-
[5] J.-L. Lin, K.-S. Hwang, W.-C. Jiang, and Y.-J. Chen, “Gait balance and guages and Operating Systems. ACM, 2019, pp. 499–513.
acceleration of a biped robot based on q-learning,” IEEE Access, vol. 4, [27] A. K. Mackworth, “Consistency in networks of relations,” Artificial intel-
pp. 2439–2449, 2016. ligence, vol. 8, no. 1, pp. 99–118, 1977.
[6] J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep-q-learning-based [28] M.-J. Li, A.-H. Li, Y.-J. Huang, and S.-I. Chu, “Implementation of Deep
transmission scheduling mechanism for the cognitive internet of things,” Reinforcement Learning,” in Proceedings of the 2019 2nd International
IEEE Internet of Things Journal, vol. 5, no. 4, pp. 2375–2385, 2017. Conference on Information Science and Systems. ACM, 2019, pp. 232–
[7] C. Wei, Z. Zhang, W. Qiao, and L. Qu, “Reinforcement-learning-based 236.
intelligent maximum power point tracking control for wind energy conver- [29] L. M. Da Silva, M. F. Torquato, and M. A. Fernandes, “Parallel implemen-
sion systems,” IEEE Transactions on Industrial Electronics, vol. 62, no. 10, tation of reinforcement learning Q-learning technique for FPGA,” IEEE
pp. 6360–6370, 2015. Access, vol. 7, pp. 2782–2798, 2018.
[8] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement [30] Z. Liu and I. Elhanany, “Large-scale tabular-form hardware architecture
learning for financial signal representation and trading,” IEEE transactions for Q-Learning with delays,” in 2007 50th Midwest Symposium on Cir-
on neural networks and learning systems, vol. 28, no. 3, pp. 653–664, cuits and Systems. IEEE, 2007, pp. 827–830.
2016. [31] B. Yuce, H. F. Ugurdag, S. Gören, and G. Dündar, “Fast and efficient
[9] M. Matta, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, circuit topologies forfinding the maximum of n k-bit numbers,” IEEE
A. Nannarelli, M. Re, and S. Spanò, “A Reinforcement Learning-Based Transactions on Computers, vol. 63, no. 8, pp. 1868–1881, 2014.
QAM/PSK Symbol Synchronizer,” IEEE Access, vol. 7, pp. 124 147– [32] G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, M. Re, and S. Spanò, “AW-
124 157, 2019. SOM, an Algorithm for High-speed Learning in Hardware Self-Organizing
[10] A. He, K. K. Bae, T. R. Newman, J. Gaeddert, K. Kim, R. Menon, Maps,” IEEE Transactions on Circuits and Systems II: Express Briefs,
L. Morales-Tirado, Y. Zhao, J. H. Reed, W. H. Tranter et al., “A survey of 2019, in press.
artificial intelligence for cognitive radios,” IEEE Transactions on Vehicular [33] M. R. Pillmeier, M. J. Schulte, and E. G. Walters III, “Design alternatives
Technology, vol. 59, no. 4, pp. 1578–1592, 2010. for barrel shifters,” in Advanced Signal Processing Algorithms, Architec-
[11] Q. Wang, H. Liu, K. Gao, and L. Zhang, “Improved Multi-Agent Rein- tures, and Implementations XII, vol. 4791. International Society for
forcement Learning for Path Planning-Based Crowd Simulation,” IEEE Optics and Photonics, 2002, pp. 436–447.
Access, vol. 7, pp. 73 841–73 855, 2019. [34] K. H. Abed and R. E. Siferd, “VLSI implementations of low-power
[12] M. Jiang, T. Hai, Z. Pan, H. Wang, Y. Jia, and C. Deng, “Multi-Agent Deep leading-one detector circuits,” in Proceedings of the IEEE SoutheastCon
Reinforcement Learning for Multi-Object Tracker,” IEEE Access, vol. 7, 2006. IEEE, 2006, pp. 279–284.
pp. 32 400–32 407, 2019. [35] H. Qi, O. Ayorinde, and B. H. Calhoun, “An ultra-low-power FPGA for
[13] M. Matta, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Re, IoT applications,” in 2017 IEEE SOI-3D-Subthreshold Microelectronics
F. Silvestri, and S. Spanò, “Q-RTS: a real-time swarm intelligence based Technology Unified Conference (S3S). IEEE, 2017, pp. 1–3.
on multi-agent Q-learning,” Electronics Letters, vol. 55, no. 10, pp. 589– [36] Microsemi, “FPGAs,” https://www.microsemi.com/product-
591, 2019. directory/fpga-soc/1638-fpgas.
[14] X. Gan, H. Guo, and Z. Li, “A New Multi-Agent Reinforcement Learning [37] Xilinx, “Vivado Design Suite User Guide - Synthesis,” June 12th 2019,
Method based on Evolving Dynamic Correlation Matrix,” IEEE Access, https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug901-
2019, in press. vivado-synthesis.pdf.
[15] G. A. Rummery and M. Niranjan, “On-Line Q-Learning Using Connec- [38] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power CMOS
tionist Systems,” University of Cambridge, Department of Engineering digital design,” IEEE Journal of Solid-State Circuits, vol. 75, no. 4, pp.
Cambridge, England, Tech. Rep., 1994. 473–484, 1992.
10 VOLUME 4, 2016
..
Action
This article has been accepted for publication in a future Z this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
issue of
EN1
RAM 10.1109/ACCESS.2019.2961174, IEEE Access
Z EN γ rt α
at Decoder
Q(st,A)
Qnew(st,at)
Action 1 -1
st WR_ADDR RAM z
D_IN D_OUT
z-1
MUX
st+1 RD_ADDR
Q(st,at)
...
WR_EN
Action 2 z-1
RAM
at Q-Upd
...
EN2
MUX
Q(st+1,at+1)
...
Action Z
EN1
RAM
ENZ at+1 γ rt α
at Decoder
FIGURE 14. SARSA accelerator architecture.
[39] Xilinx, “Synthesis and Simulation Design Guide,” December 18th 2012, LUCA DI NUNZIO is Adjunct Professor in Digi-
https://www.xilinx.com/support/documentation/sw_manuals/xilinx14_7/sim.pdf. tal Electronics Lab at the University of Rome “Tor
Vergata" and Adjunct Professor of Digital Elec-
tronics at the University Guglielmo Marconi. He
received the Master Degree (summa cum laude) in
Electronics Engineering in 2006 at the University
of Rome “Tor Vergata" and the PhD in Systems
SERGIO SPANÒ received the B.S. degree in
and Technologies for the Space in 2010 in the
Electronic Engineering from the University of
same University. He has a working history with
“Tor Vergata", Rome, Italy, in 2015 and the M.S.
several companies in the fields of Electronics and
degree (summa cum laude) in Electronic Engi-
Communications. His research activity is in the fields of Reconfigurable
neering from the same University in 2018. He
Computing, Communication circuits, Digital Signal Processing and Machine
is currently pursuing the Ph.D. degree in Elec-
Learning.
tronic Engineering in Tor Vergata as member of
the DSPVLSI research group. His interests in-
clude Digital Signal Processing, Machine Learn-
ing, Telecommunications and ASIC/FPGA hard-
ware design. He had industrial experiences in the Space and Telecommunica-
tions field. His current research topics relate to Machine Learning hardware
implementations for embedded and low-power systems.
216 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 15, NO. 2, JUNE 2007
GIAN CARLO
Gian Carlo CARDARILLI
Cardarilli was born(S’79-M’81)
in Rome, Italy, was
and ROCCO
Maria FAZZOLARI
Grazia Marcianireceived
receivedthetheMaster
degreede-in
received
born in the laureaItaly,
Rome, (summa andcum laude) from
received the the Uni-
laurea gree in Electronic
medicine and surgery Engineering in May 2009
and the specialization and
in neu-
versity ofcum
(summa Rome “La from
laude) Sapienza,” in 1981. of Rome
the University the Ph.D.
rology frominthe“Space Systems
University and Technologies"
of Rome, “La Sapienza,”
“LaHeSapienza"
has been in with the University
1981. He has been of Rome “Tor
with the in June 2013 at University of Rome Tor Vergata
Italy.
Vergata” since
University 1984.“Tor
of Rome At present,
Vergata"hesince
is full1984.
professor
At She isAt
(Italy). a full professor
present, he is of neurology fellow
postdoctoral at the Uni-
and
of digitalheelectronics
present, and electronics
is full professor for commu-
of digital electronics versity of Rome
assistant “ToratVergata,”
professor Italy.ofShe
University is also
Rome the
“Tor
nication systems at the University of Rome “Tor Head of the(dept.
Vergata" Neuroscience Department
of Electronic at the Medical
Engineering). He
and electronics for communication systems at the
Vergata.” During 1992–1994, he was with the Uni- School.
works on Herhardware
interests implementation
are in the field ofofepileptic
high-speeddis-
University of Rome “Tor Vergata". During 1992-
versity of L’Aquila. During 1987–1988, he was with eases as well
systems for as in rehabilitation
Digital neuroscience.
Signals Processing, Machine
1994, he was
the Circuits andwith the University
Systems team at EPFL, of L’Aquila.
Lausanne,
During 1987-1988, he wasare with Learning, Array of Wireless Sensor Networks and
Switzerland. His interests in the
the Circuits and
area of VLSI
Systems teamforat EPFL, systems for data analysis of Acoustic Emission (AE) Sensors (based on
architectures Signal Lausanne,
ProcessingSwitzerland.
and IC design. HisIninterests are he
this field, in the area
published
of VLSI
over thanarchitectures
160 papers for Signal Processing
in international andand
journals IC conferences.
design. In this Hefield,
has he
also ultrasonic waves).
published over than 160
regular cooperation withpapers in international
companies as: Alcateljournals
Aleniaand conferences.
Space, He
Italy; STM,
Agrate
has alsoBrianza, Italy; Micron,
regular cooperation withItaly; Selex S.I.,
companies Italy. His
as: Alcatel Aleniascientific
Space, interest
Italy;
concerns
STM, the design
Agrate of special
Brianza, architectures
Italy; Micron, Italy;for signal
Selex processing.
S.I., Italy. HisHe works in
scientific
the fieldconcerns
interest of computerthe arithmetic and its application
design of special architectures to for
the design of fast signal
signal processing.
digital
He processor.
works in the field of computer arithmetic and its application to the design
of fast signal digital processor.
VOLUME 4, 2016 11
DANIELE GIARDINO received the B.S. and MARCO RE (M’92) holds a Ph.D. in Micro-
M.S. degrees in electronic engineering at the Uni- electronics and he is Associate Professor at the
versity of Rome “Tor Vergata" (Italy) in 2015 and University of Rome “Tor Vergata" where teaches
2017, respectively. He is currently a Ph.D. student Digital Electronics and Hardware Architectures
at the same institute in electronic engineering and for DSP. He was awarded with two NATO fellow-
is a member of the research group DSPVLSI. ships at the University of California at Berkeley
He works on digital development for wideband working as visiting scientist at Cadence Berkeley
signals architectures, telecommunications, digital Laboratories. He has been awarded with the Otto
signal processing, and machine learning. In spe- Moensted fellowship as visiting professor at the
cific, he is focused on the digital implementation Technical University of Denmark. Marco collab-
of MIMO systems for wideband signals. orates in many research projects with different companies in the field of
DSP architectures and algorithms. His main scientific interest is in the field
of: Low power DSP algorithms architectures, Hardware-Software Codesign,
Fuzzy Logic and Neural hardware architectures, Low Power Digital imple-
mentations based on non-traditional number systems, Computer Arithmetic
and Cad tools for DSP. He is author of about 200 papers on international
journals and international conferences. He is member of IEEE and Audio
Engineering Society (AES). He his Director of a Master in Audio Engi-
neering at the University of Rome Tor Vergata Department of Electronic
Engineering.
MARCO MATTA was born in Cagliari, Italy, in

1989. He received the B.S. degree in electronic
engineering from the University of Rome “Tor
Vergata", Italy, in 2014 and the M.S. degree in
electronic engineering from the same institute in
2017. Since 2017, he has been a member of the
DSPVLSI research group of the same University.
He is currently pursuing the Ph.D. degree in elec-
tronic engineering. His research interest includes
the development of hardware platforms and low-
power accelerators aimed Machine Learning algorithms and telecommu-
nications. In particular, he is currently focused on the implementation of
Reinforcement Learning models on FPGA.
ALBERTO NANNARELLI (S’94-M’99-SM’13)

graduated in electrical engineering from the Uni-
versity of Roma “La Sapienza,” Roma, Italy, in
1988, and received the M.S. and the Ph.D. degrees
in electrical and computer engineering from the
University of California at Irvine, CA, USA, in
1995 and 1999, respectively. He is an Associate
Professor at the Technical University of Denmark,
Lyngby, Denmark. He worked for SGS-Thomson
Microelectronics and for Ericsson Telecom as a
Design Engineer and for Rockwell Semiconductor Systems as a summer
intern. From 1999 to 2003, he was with the Department of Electical
Engineering, University of Roma “Tor Vergata,” Italy, as a Postdoctoral
Researcher. His research interests include computer arithmetic, computer
architecture, and VLSI design. Dr. Nannarelli is a Senior Member of the
IEEE Computer Society.
12 VOLUME 4, 2016

An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

An Efficient Hardware Implementation of

I. INTRODUCTION this iterative process, the agent learns an optimal action-

At the beginning of the training process, the Q-Matrix is

Environment The variables in (1) refer to:

only in the experiments for the comparison with the state of

Q(st+1,a(2)) st+1 RD_ADDR

with the n most significant powers of two in the M + 1 bits

representation. That is Q(St+1,a5)

st Action Z Q(St+1,a6) M=⌈log2(Z)⌉

1) Approximated powers of two are presented

0.2 0.2 0.2 [35], [36]. Q(st,at) - >>

The error introduced by this approximation does not effect

side effect, we obtain a shorter critical path and lower power

consumption (sec. III-A).Q(sMoreover,

max(Q(st+1,A)) Place&Route considering the maximum clock frequency and

CLK PWR CLK PWR

CLK PWR CLK PWR

CLK PWR CLK PWR

CLK PWR CLK PWR

FIGURE 10. Number of LUTs for different implementations.

system). Given a certain data-path bit-width and number of 500 Z=8

actions, the clock frequency remains almost unaltered. This 400

Impl. CLK (MHz) PWR Energy Energy CLK PWR Energy

35 795 34 548 1734

CLK PWR Energy 4

CLK PWR Energy

60 For the above reasons, our architecture is suitable for high-

FIGURE 14. SARSA accelerator architecture.

MARCO MATTA was born in Cagliari, Italy, in

ALBERTO NANNARELLI (S’94-M’99-SM’13)

You might also like