Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT In this paper we propose an efficient hardware architecture that implements the Q-Learning
algorithm, suitable for real-time applications. Its main features are low-power, high throughput and limited
hardware resources. We also propose a technique based on approximated multipliers to reduce the hardware
complexity of the algorithm. We implemented the design on a Xilinx Zynq Ultrascale+ MPSoC ZCU106
Evaluation Kit. The implementation results are evaluated in terms of hardware resources, throughput and
power consumption. The architecture is compared to the state of the art of Q-Learning hardware accelerators
presented in the literature obtaining better results in speed, power and hardware resources. Experiments
using different sizes for the Q-Matrix and different wordlengths for the fixed point arithmetic are presented.
With a Q-Matrix of size 8×4 (8 bit data) we achieved a throughput of 222 MSPS (Mega Samples Per
Second) and a dynamic power consumption of 37 mW, while with a Q-Matrix of size 256×16 (32 bit data)
we achieved a throughput of 93 MSPS and a power consumption 611 mW. Due to the small amount of
hardware resources required by the accelerator, our system is suitable for multi-agent IoT applications.
Moreover, the architecture can be used to implement the SARSA (State-Action-Reward-State-Action)
Reinforcement Learning algorithm with minor modifications.
INDEX TERMS Artificial Intelligence, Hardware Accelerator, Machine Learning, Q-Learning, Reinforce-
ment Learning, SARSA, FPGA, ASIC, IoT, Multi-agent
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
Q(st,A)
Q-Learning
Action RAMs. Consequently, we have one memory block
-1 accelerator per action. Each RAM contains an entire column of the Q-
st+1 z st
Matrix and the number of memory locations corresponds to
the number of states N . The read address is the next state
st+1 , while the write address is the current state st . The
rt+1 z-1 rt
enable signals for the Action RAMs, generated by a decoder
γ α
driven by the current action at , select the value Q(st , at ) to
FIGURE 2. High level harchitecture of the Q-Learning agent. be updated. The Action RAMs outputs correspond to a row
of the Q-Matrix Q(st+1 , A).
The signal Q(st , at ) is obtained by delaying the output
on different devices: FPGA (Intel Stratix-V), CPU (Intel of the memory blocks and then selecting the Action RAM
i7-5930K) and GPU (Nvidia Tesla-C2070). With respect to through a multiplexer driven by at . A MAX block fed by the
the CPU, the authors obtained a speed-up factor of 4.14× output of the Action RAMs generates max Q(st+1 , A).
a
and 19.29× for the GPU and the FPGA implementation,
The Q-Updater (Q-Upd) block implements the Q-Matrix
respectively.
update equation (1) generating Qnew (st , at ) to be stored into
The most recent works (published in 2019) include
the corresponding Action RAM.
Cho et al. [26]. They propose a hardware accelerator for
The accelerator can be also used for Deep Q-Learning [23]
the “Asynchronous Advantage Actor-Critic” (AC3) algo-
applications if the Action RAMs are replaced with Neural
rithm [27], describing an implementation based on a Xilinx
Network-based approximators.
VCU1525 FPGA. The system was validated using 6 Atari-
2600 videogames.
In the work by Huang et al. [28] another Deep Q-Learning A. MAX BLOCK
network was implemented on a Digilent Pynq development An extensive study about this block has been proposed in
board for the cart-pole problem. The system is meant only for [30]. In the paper, the authors proved that the propagation
inference mode and, consequently, cannot be used for real- delay of this block is the main limitation for the speed of
time learning. Q-Learning accelerators when a large number of actions
One of the most advanced hardware accelerator for Q- is required. Consequently, they propose an implementation
Learning was proposed by Da Silva et al. [29]. The authors based on a tree of binary comparators (M -stages) that is a
presented an implementation based on a Xilinx Virtex-6 good trade-off in area and speed [31].
FPGA. Moreover, they performed a fixed-point analysis to This architecture is employed by the Q-Learning acceler-
confirm the convergence of the algorithm. Different compar- ators presented in [22], [29] and has also been used in our
isons with state of the art implementations were made. Since architecture (Fig. 4).
this is one of of best performing Q-Learning accelerators at Moreover, in [30] it is proved that, when pipelining is used
today, we provide an extensive comparison with our architec- to speed up the MAX block, the latency does not affect the
ture (sec. III-B). convergence of the Q-Learning algorithm. This means that,
when an application requires a very high throughput, it is
II. PROPOSED ARCHITECTURE possible to use pipelining.
The Q-Learning agent shown in Fig. 2 is composed by two
main blocks: the Policy Generator (PG) and the Q-Learning B. Q-UPDATER BLOCK
accelerator. Equation (1) can be rearranged as
The agent receives the state st+1 and the reward rt+1 from
the observer, while the next action is generated by the PG Qnew (st , at ) = Q(st , at )+
according to the values of the Q-Matrix stored into the Q- (2)
Learning accelerator. +α rt + γ max Q(st+1 , a) − Q(st , at )
a
Note that st , at and rt are obtained by delaying st+1 , at+1
and rt+1 by means of registers. st and at represent the indices to obtain an efficient implementation. Equation (2) is com-
of the rows and columns of the Q-Matrix, respectively. These puted by using 2 multipliers, while (1) requires 3 multipliers.
delays do not affect the convergence of the Q-Learning The Q-Updater block in Fig. 5 is used to compute (2),
algorithm, as proved in [30]. generating Qnew (st , at ).
With the aim to design a general purpose hardware accel- The critical path consists in 2 multipliers and 2 adders.
erator, we do not provide a particular implementation for the In the next section (II-B1) a method to reduce the hardware
PG since it is application-defined. The PG has been included complexity for the multipliers is illustrated.
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
Q(st,A)
Qnew(st,at)
Action 1 -1
st WR_ADDR RAM z
D_IN D_OUT
z-1
MUX
st+1 RD_ADDR
Q(st,at)
...
WR_EN
Action 2 z-1
RAM
at Q-Upd
...
EN2
MAX
max(Q(st+1,A))
...
Action Z
EN1
RAM
ENZ γ rt α
at Decoder
Q(sarchitecture.
FIGURE 3. Q-Learning accelerator t,A)
Qnew(st,at)
Action 1 -1
st WR_ADDR Qnew(st,at)
RAM z
using M bits for the fractional part is:
Q(st+1,a(1)) Action 1 D_IN D_OUT Q(St+1,a1)
st z-1 z-1x = x0 20 + x−1 2−1 + x−2 2−2 + ... + x−M 2−M
WR_ADDR
Max
RAM
Max
MUX
st+1 RD_ADDR (3)
D_IN D_OUT
WR_EN-1 Q(st,at)
...
z Q(St+1,a2)
MUX
Max
...
Max
WR_EN
z-1
max(Q(st+1,A))
Action 2
max(Q(st+1,A))
st
positions of the first, second
Q(St+1,aand
3)
third ‘1’ in the binary
Q(st+1EN
,a(3)) Action 2 zRAM
-1
representationat of x starting from the most significant bit.
Max
1
Max
Q-Upd
Max
Max
RAM
st+1 at Moreover,
Q-Updwe define < Q(Sx t+1>,aOP the approximation of x
Q-Upd 4) n
Q(s+1,a(4))
. . . ...
EN2
MUX
Q(st+1,A)
Q(st+1,a(5))
Max
Max
max(Q(st+1,A))
< x >OP1 = 2−i
Max
EN1Z
Action
at Q(st+1,a(6)) RAM M=⌈log2(Z)⌉ RAM < x >OP2 = 2−i + 2−j (4)
st+1 at+1
ENZ ENZ γ rt α −i γ rt −j
α −k
< x >OP3 = 2 +2 +2
γ rt α FIGURE 4. MAX block tree architecture Decoder
at for Z=6.
for the approximation with one, -two and three powers of two.
Q(s ,at) >>
- The conceptt can be extended to more power of two terms.
>> - >>
Q(st,at) Q(st,at) Qnew(st,at) For example, x = 0.101101(2) =α1 0.703125 Qnewcan
(st,at)be
>> max(Q(st+1,A))
α1 Qnew(st,at) approximated as: γ1 >>
max(Q(st+1,A))
x(Q(st+1,A)) >>
γ1 >> < 0.101101 −1
α rt (2) >OPα12= 2 = 0.5
>> γ rt γ2
α2 < 0.101101(2) >OP2 = 2−1 + 2−3 = 0.625 (5)
rt
γ2 5. Q-matrix updater block architecture.
FIGURE
< 0.101101 > = 2−1 + 2−3 + 2−4
- = 0.6875 .
,a ) OP3
Q(s(2) >>
Qnew(st,at)
t t
- Some examples of >>
Q(st,at)
Q(st,at) - >>
>> Qnew(st,at) the approximated values for
α1 different
Qnew(st,at)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
Qnew(st,at) TABLE 1. Implementation results for Q-Matrices with 8 bit data and Z=4.
M=5Action 1 -1 Q(st+1,a(1))
st WR_ADDR RAM z
Max
, D_IN x-3 xD_OUT
x= x0 x-1 x-2 -4 x-5
CLK PWR
N LUT LUTRAM FF DSP
z-1 Q(st+1,a(2)) (MHz) (mW)
MUX
st+1 RD_ADDR
Q(st,at)
Max
193 32 154 0
...
1 1 WR_EN 1 8 222 37
max(Q(st+1,A))
0.08% 0.03% 0.03%
Q(s 0.00%
st0.9 -1 t+1,a(3))
0.9 EN1 Action 2 0.9 z 192 32 156 0
Max
Max
16 240 37
RAM 0.08% 0.03% 0.03% 0.00%
0.8 st+1 0.8 0.8 at Q-Upd
32
193 32 Q(s +1,a(4))
158 0
234 41
0.08% 0.03% 0.03% 0.00%
0.7 EN2 0.7 0.7
...
214 64 160 0
Q(st+1,A)
64 Q(st+1,a(5)) 227 38
0.6 0.6 0.6 0.09% 0.06% 0.03% 0.00%
Max
Max
max(Q(st+1,A)) 271 80 162 0
...
st0.5 128 239 45
0.5 0.5 0.12% 0.08% 0.04%
Q(s t+1,a(6))
0.00% M=⌈log2(Z)⌉
Decoder
Action Z
362 160 164 0
at RAM 256 233 61
0.4 st+1 0.4 0.4 0.16% 0.16% 0.04% 0.00%
ENZ γ rt α
0.3 0.3 0.3
Qnew(st,at)
Q(st,at)
- >> Qnew(st,at)
obtained using the Vivado >> 2019.1 EDA tool with default
α1
implementation parameters and setting a timing constraint of
>> γ1 >>
max(Q(st+1,A)) 2 ns. The system was coded in VHDL.
i(α) The design exploration>>was implemented for the following
max(Q(st+1,A))
α2
i(γ) rt rt
range of parameters: γ2
>>
FIGURE 7. Q-Matrix updater block with multipliers implemented by a single • Number of bits for >> the Q-Matrix values : 8, 16 and 32
barrel shifter. bit. α3
γ3
• Number of states N : 8, 16, 32, 64, 128 and 256.
• Number of actions Z: 4, 8 and 16.
The position of the leading ones i and j in the rep-
resentation of α and γ can be given as input if constant We focused the implementation analysis on the following
for the whole computation, or determined by Leading-One- resources [37]:
t,at)
Detectors (LOD) [34] if Q(s ,a(1))
thet+1values are modified at run time. • Look Up Tables (LUT);
• Look Up Tables used as RAM (LUTRAM);
Max
Max
different applications which proved to be almost insensitive usage respect to the total available.
Q-Upd Q(s+1 ,a(4)) the convergence conditions The performances were measured in terms of maximum
to the approximation error since
of Q-Learning are still satisfied ( α,γ ≤ 1). clock frequency (CLK) and dynamic power consumption
Q(st+1,a(5)) (PWR). The latter was evaluated using Vivado after the
By using approximated multipliers, we avoid to use FP-
Max
>>
α1
VOLUME 4, 2016 5
γ1 >>
max(Q(st+1,A)) >>
This work is licensed under a CreativerCommons
t Attribution
α2 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
γ2
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
TABLE 2. Implementation results for Q-Matrices with 8 bit data and Z=8. TABLE 6. Implementation results for Q-Matrices with 16 bit data and Z=16.
TABLE 3. Implementation results for Q-Matrices with 8 bit data and Z=16. TABLE 7. Implementation results for Q-Matrices with 32 bit data and Z=4.
TABLE 4. Implementation results for Q-Matrices with 16 bit data and Z=4. TABLE 8. Implementation results for Q-Matrices with 32 bit data and Z=8.
TABLE 5. Implementation results for Q-Matrices with 16 bit data and Z=8. TABLE 9. Implementation results for Q-Matrices with 32 bit data and Z=16.
6 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
240
8 bit
220 16 bit 4000
32 bit
Clock frequency (MHz)
200
Z=4
180 3000
Z=8
160 Z=16
LUTs
2000
140
120
1000
100
80
4 6 8 10 12 14 16 0
0
Number of actions - Z 30
Num 100 25
)
ber o 20 bit
f sta200 15 ation (
FIGURE 9. Average clock frequency for different Q-Matrix data bit-width vs tes -
N 300
10 uantiz
number of actions. Q-va lues q
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
TABLE 10. Implementation comparisons: approximated and full multipliers TABLE 13. Da Silva et al. [29] implementation results for 16 bit Q-Matrix
with 8 bit operands. values and Z=4.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
TABLE 15. Da Silva et al. [29] implementation results for 16 bit Q-Matrix
values and Z=8. 5
Energy (nJ)
Da Silva Z=4
3
106 1797 6960 Da Silva Z=8
12 21 10 0.483793
13.80% 0.59% 4.61% Proposed Z=4
170 2950 11561 2 Proposed Z=8
20 18 21 1.189128
23.13% 0.97% 7.67%
250 4390 17208 1
30 16 26 1.580547
32.55% 1.45% 11.41%
378 8136 48536
56 14 48 3.331020 0
49.21% 2.69% 32.20% 0 50 100 150 200 250
Number of states - N
TABLE 16. Proposed implementation results for 16 bit Q-Matrix values and
Z=8. FIGURE 13. Energy required to update one Q-Matrix element comparison
between Da Silva et al. [29] and proposed architecture, 16 bit values.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
Equation (7) shows the SARSA update formula for the Q- [16] C. J. Watkins and P. Dayan, “Q-Learning,” Machine learning, vol. 8, no.
Matrix. 3-4, pp. 279–292, 1992.
[17] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, “Q-Learning Algorithms:
Qnew (st , at ) = Q(st , at )+ A Comprehensive Classification and Applications,” IEEE Access, vol. 7,
(7) pp. 133 653–133 667, 2019.
+α (rt + γQ(st+1 , at+1 ) − Q(st , at )) [18] K.-S. Hwang, Y.-P. Hsu, H.-W. Hsieh, and H.-Y. Lin, “Hardware imple-
mentation of FAST-based reinforcement learning algorithm,” in Proceed-
Comparing (2) to (7), it is straightforward to note the sim- ings of 2005 IEEE International Workshop on VLSI Design and Video
Technology, 2005. IEEE, 2005, pp. 435–438.
ilarities between the two equations. Since the update of the [19] A. Pérez and E. Sanchez, “The FAST architecture: a neural network with
Q-Matrix depends on the agent’s next action at+1 , SARSA flexible adaptable-size topology,” in Proceedings of Fifth International
algorithm is the on-policy version of the Q-Learning algo- Conference on Microelectronics for Neural Networks. IEEE, 1996, pp.
337–340.
rithm (which is off-policy). [20] S. Geva and J. Sitte, “A cartpole experiment benchmark for trainable
The resulting architecture is presented in Fig. 14. The controllers,” IEEE Control Systems Magazine, vol. 13, no. 5, pp. 40–51,
main difference between the Q-Learning implementation in 1993.
[21] S. Shao, J. Tsai, M. Mysior, W. Luk, T. Chau, A. Warren, and B. Jeppesen,
Fig. 3 consists in the replacement of the MAX block with a “Towards hardware accelerated reinforcement learning for application-
multiplexer driven by the next action at+1 . specific robotic control,” in 2018 IEEE 29th International Conference
The analysis about the Q-Learning architecture can also be on Application-specific Systems, Architectures and Processors (ASAP).
IEEE, 2018, pp. 1–8.
extended to the SARSA accelerator. [22] P. R. Gankidi and J. Thangavelautham, “FPGA architecture for deep
learning and its application to planetary robotics,” in 2017 IEEE Aerospace
REFERENCES Conference. IEEE, 2017, pp. 1–9.
[23] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau
[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. et al., “An Introduction to Deep Reinforcement Learning,” Foundations
MIT press, 2018. and Trends® in Machine Learning, vol. 11, no. 3-4, pp. 219–354, 2018.
[2] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning from data. [24] J. Su, J. Liu, D. B. Thomas, and P. Y. Cheung, “Neural network based re-
AMLBook New York, NY, USA:, 2012, vol. 4. inforcement learning acceleration on FPGA platforms,” ACM SIGARCH
[3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep Computer Architecture News, vol. 44, no. 4, pp. 68–73, 2017.
visuomotor policies,” The Journal of Machine Learning Research, vol. 17, [25] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust Region
no. 1, pp. 1334–1373, 2016. Policy Otimization,” in International conference on machine learning,
[4] A. Konar, I. G. Chakraborty, S. J. Singh, L. C. Jain, and A. K. Nagar, “A 2015, pp. 1889–1897.
deterministic improved Q-learning for path planning of a mobile robot,” [26] H. Cho, P. Oh, J. Park, W. Jung, and J. Lee, “FA3C: FPGA-Accelerated
IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 43, Deep Reinforcement Learning,” in Proceedings of the Twenty-Fourth
no. 5, pp. 1141–1153, 2013. International Conference on Architectural Support for Programming Lan-
[5] J.-L. Lin, K.-S. Hwang, W.-C. Jiang, and Y.-J. Chen, “Gait balance and guages and Operating Systems. ACM, 2019, pp. 499–513.
acceleration of a biped robot based on q-learning,” IEEE Access, vol. 4, [27] A. K. Mackworth, “Consistency in networks of relations,” Artificial intel-
pp. 2439–2449, 2016. ligence, vol. 8, no. 1, pp. 99–118, 1977.
[6] J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep-q-learning-based [28] M.-J. Li, A.-H. Li, Y.-J. Huang, and S.-I. Chu, “Implementation of Deep
transmission scheduling mechanism for the cognitive internet of things,” Reinforcement Learning,” in Proceedings of the 2019 2nd International
IEEE Internet of Things Journal, vol. 5, no. 4, pp. 2375–2385, 2017. Conference on Information Science and Systems. ACM, 2019, pp. 232–
[7] C. Wei, Z. Zhang, W. Qiao, and L. Qu, “Reinforcement-learning-based 236.
intelligent maximum power point tracking control for wind energy conver- [29] L. M. Da Silva, M. F. Torquato, and M. A. Fernandes, “Parallel implemen-
sion systems,” IEEE Transactions on Industrial Electronics, vol. 62, no. 10, tation of reinforcement learning Q-learning technique for FPGA,” IEEE
pp. 6360–6370, 2015. Access, vol. 7, pp. 2782–2798, 2018.
[8] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement [30] Z. Liu and I. Elhanany, “Large-scale tabular-form hardware architecture
learning for financial signal representation and trading,” IEEE transactions for Q-Learning with delays,” in 2007 50th Midwest Symposium on Cir-
on neural networks and learning systems, vol. 28, no. 3, pp. 653–664, cuits and Systems. IEEE, 2007, pp. 827–830.
2016. [31] B. Yuce, H. F. Ugurdag, S. Gören, and G. Dündar, “Fast and efficient
[9] M. Matta, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, circuit topologies forfinding the maximum of n k-bit numbers,” IEEE
A. Nannarelli, M. Re, and S. Spanò, “A Reinforcement Learning-Based Transactions on Computers, vol. 63, no. 8, pp. 1868–1881, 2014.
QAM/PSK Symbol Synchronizer,” IEEE Access, vol. 7, pp. 124 147– [32] G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, M. Re, and S. Spanò, “AW-
124 157, 2019. SOM, an Algorithm for High-speed Learning in Hardware Self-Organizing
[10] A. He, K. K. Bae, T. R. Newman, J. Gaeddert, K. Kim, R. Menon, Maps,” IEEE Transactions on Circuits and Systems II: Express Briefs,
L. Morales-Tirado, Y. Zhao, J. H. Reed, W. H. Tranter et al., “A survey of 2019, in press.
artificial intelligence for cognitive radios,” IEEE Transactions on Vehicular [33] M. R. Pillmeier, M. J. Schulte, and E. G. Walters III, “Design alternatives
Technology, vol. 59, no. 4, pp. 1578–1592, 2010. for barrel shifters,” in Advanced Signal Processing Algorithms, Architec-
[11] Q. Wang, H. Liu, K. Gao, and L. Zhang, “Improved Multi-Agent Rein- tures, and Implementations XII, vol. 4791. International Society for
forcement Learning for Path Planning-Based Crowd Simulation,” IEEE Optics and Photonics, 2002, pp. 436–447.
Access, vol. 7, pp. 73 841–73 855, 2019. [34] K. H. Abed and R. E. Siferd, “VLSI implementations of low-power
[12] M. Jiang, T. Hai, Z. Pan, H. Wang, Y. Jia, and C. Deng, “Multi-Agent Deep leading-one detector circuits,” in Proceedings of the IEEE SoutheastCon
Reinforcement Learning for Multi-Object Tracker,” IEEE Access, vol. 7, 2006. IEEE, 2006, pp. 279–284.
pp. 32 400–32 407, 2019. [35] H. Qi, O. Ayorinde, and B. H. Calhoun, “An ultra-low-power FPGA for
[13] M. Matta, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Re, IoT applications,” in 2017 IEEE SOI-3D-Subthreshold Microelectronics
F. Silvestri, and S. Spanò, “Q-RTS: a real-time swarm intelligence based Technology Unified Conference (S3S). IEEE, 2017, pp. 1–3.
on multi-agent Q-learning,” Electronics Letters, vol. 55, no. 10, pp. 589– [36] Microsemi, “FPGAs,” https://www.microsemi.com/product-
591, 2019. directory/fpga-soc/1638-fpgas.
[14] X. Gan, H. Guo, and Z. Li, “A New Multi-Agent Reinforcement Learning [37] Xilinx, “Vivado Design Suite User Guide - Synthesis,” June 12th 2019,
Method based on Evolving Dynamic Correlation Matrix,” IEEE Access, https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug901-
2019, in press. vivado-synthesis.pdf.
[15] G. A. Rummery and M. Niranjan, “On-Line Q-Learning Using Connec- [38] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power CMOS
tionist Systems,” University of Cambridge, Department of Engineering digital design,” IEEE Journal of Solid-State Circuits, vol. 75, no. 4, pp.
Cambridge, England, Tech. Rep., 1994. 473–484, 1992.
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
..
Action
This article has been accepted for publication in a future Z this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
issue of
EN1
RAM 10.1109/ACCESS.2019.2961174, IEEE Access
Z EN γ rt α
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
at Decoder
Q(st,A)
Qnew(st,at)
Action 1 -1
st WR_ADDR RAM z
D_IN D_OUT
z-1
MUX
st+1 RD_ADDR
Q(st,at)
...
WR_EN
Action 2 z-1
RAM
at Q-Upd
...
EN2
MUX
Q(st+1,at+1)
...
Action Z
EN1
RAM
ENZ at+1 γ rt α
at Decoder
[39] Xilinx, “Synthesis and Simulation Design Guide,” December 18th 2012, LUCA DI NUNZIO is Adjunct Professor in Digi-
https://www.xilinx.com/support/documentation/sw_manuals/xilinx14_7/sim.pdf. tal Electronics Lab at the University of Rome “Tor
Vergata" and Adjunct Professor of Digital Elec-
tronics at the University Guglielmo Marconi. He
received the Master Degree (summa cum laude) in
Electronics Engineering in 2006 at the University
of Rome “Tor Vergata" and the PhD in Systems
SERGIO SPANÒ received the B.S. degree in
and Technologies for the Space in 2010 in the
Electronic Engineering from the University of
same University. He has a working history with
“Tor Vergata", Rome, Italy, in 2015 and the M.S.
several companies in the fields of Electronics and
degree (summa cum laude) in Electronic Engi-
Communications. His research activity is in the fields of Reconfigurable
neering from the same University in 2018. He
Computing, Communication circuits, Digital Signal Processing and Machine
is currently pursuing the Ph.D. degree in Elec-
Learning.
tronic Engineering in Tor Vergata as member of
the DSPVLSI research group. His interests in-
clude Digital Signal Processing, Machine Learn-
ing, Telecommunications and ASIC/FPGA hard-
ware design. He had industrial experiences in the Space and Telecommunica-
tions field. His current research topics relate to Machine Learning hardware
implementations for embedded and low-power systems.
216 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 15, NO. 2, JUNE 2007
GIAN CARLO
Gian Carlo CARDARILLI
Cardarilli was born(S’79-M’81)
in Rome, Italy, was
and ROCCO
Maria FAZZOLARI
Grazia Marcianireceived
receivedthetheMaster
degreede-in
received
born in the laureaItaly,
Rome, (summa andcum laude) from
received the the Uni-
laurea gree in Electronic
medicine and surgery Engineering in May 2009
and the specialization and
in neu-
versity ofcum
(summa Rome “La from
laude) Sapienza,” in 1981. of Rome
the University the Ph.D.
rology frominthe“Space Systems
University and Technologies"
of Rome, “La Sapienza,”
“LaHeSapienza"
has been in with the University
1981. He has been of Rome “Tor
with the in June 2013 at University of Rome Tor Vergata
Italy.
Vergata” since
University 1984.“Tor
of Rome At present,
Vergata"hesince
is full1984.
professor
At She isAt
(Italy). a full professor
present, he is of neurology fellow
postdoctoral at the Uni-
and
of digitalheelectronics
present, and electronics
is full professor for commu-
of digital electronics versity of Rome
assistant “ToratVergata,”
professor Italy.ofShe
University is also
Rome the
“Tor
nication systems at the University of Rome “Tor Head of the(dept.
Vergata" Neuroscience Department
of Electronic at the Medical
Engineering). He
and electronics for communication systems at the
Vergata.” During 1992–1994, he was with the Uni- School.
works on Herhardware
interests implementation
are in the field ofofepileptic
high-speeddis-
University of Rome “Tor Vergata". During 1992-
versity of L’Aquila. During 1987–1988, he was with eases as well
systems for as in rehabilitation
Digital neuroscience.
Signals Processing, Machine
1994, he was
the Circuits andwith the University
Systems team at EPFL, of L’Aquila.
Lausanne,
During 1987-1988, he wasare with Learning, Array of Wireless Sensor Networks and
Switzerland. His interests in the
the Circuits and
area of VLSI
Systems teamforat EPFL, systems for data analysis of Acoustic Emission (AE) Sensors (based on
architectures Signal Lausanne,
ProcessingSwitzerland.
and IC design. HisIninterests are he
this field, in the area
published
of VLSI
over thanarchitectures
160 papers for Signal Processing
in international andand
journals IC conferences.
design. In this Hefield,
has he
also ultrasonic waves).
published over than 160
regular cooperation withpapers in international
companies as: Alcateljournals
Aleniaand conferences.
Space, He
Italy; STM,
Agrate
has alsoBrianza, Italy; Micron,
regular cooperation withItaly; Selex S.I.,
companies Italy. His
as: Alcatel Aleniascientific
Space, interest
Italy;
concerns
STM, the design
Agrate of special
Brianza, architectures
Italy; Micron, Italy;for signal
Selex processing.
S.I., Italy. HisHe works in
scientific
the fieldconcerns
interest of computerthe arithmetic and its application
design of special architectures to for
the design of fast signal
signal processing.
digital
He processor.
works in the field of computer arithmetic and its application to the design
of fast signal digital processor.
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961174, IEEE Access
Sergio Spanò et al.: An Efficient Hardware Implementation of Reinforcement Learning: the Q-Learning Algorithm
DANIELE GIARDINO received the B.S. and MARCO RE (M’92) holds a Ph.D. in Micro-
M.S. degrees in electronic engineering at the Uni- electronics and he is Associate Professor at the
versity of Rome “Tor Vergata" (Italy) in 2015 and University of Rome “Tor Vergata" where teaches
2017, respectively. He is currently a Ph.D. student Digital Electronics and Hardware Architectures
at the same institute in electronic engineering and for DSP. He was awarded with two NATO fellow-
is a member of the research group DSPVLSI. ships at the University of California at Berkeley
He works on digital development for wideband working as visiting scientist at Cadence Berkeley
signals architectures, telecommunications, digital Laboratories. He has been awarded with the Otto
signal processing, and machine learning. In spe- Moensted fellowship as visiting professor at the
cific, he is focused on the digital implementation Technical University of Denmark. Marco collab-
of MIMO systems for wideband signals. orates in many research projects with different companies in the field of
DSP architectures and algorithms. His main scientific interest is in the field
of: Low power DSP algorithms architectures, Hardware-Software Codesign,
Fuzzy Logic and Neural hardware architectures, Low Power Digital imple-
mentations based on non-traditional number systems, Computer Arithmetic
and Cad tools for DSP. He is author of about 200 papers on international
journals and international conferences. He is member of IEEE and Audio
Engineering Society (AES). He his Director of a Master in Audio Engi-
neering at the University of Rome Tor Vergata Department of Electronic
Engineering.
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.