Autonomous Voltage Control For Grid Operation Using Hardware Implementation

Autonomous Voltage Control for Grid Operation Using Hardware Implementation of
Reinforcement Learning
Rizky Ardi M (13217054)

Devananda (13217061)
Kevin Sutardi (13217088)
School of Electrical Engineering and Informatics ITB
Abstrak lamp that is only used at night. Thus, various
problems will arise with the power system model
Abstract- Currently, the power system requires a manual
being made. In general, changes in load power
regulator to make the voltage at a stable level. This is due to
can create conditions where overvoltage and
the change in the value of the load used by customers
under-voltage occurs. Therefore, it is necessary to
frequently. However, there are various other technologies that
change the parameters of the power system model
can be applied to overcome this problem. One method is to use
to prevent overvoltage and under-voltage
artificial intelligence, especially reinforcement learning. This
conditions that can damage the load.
method can be applied using an FPGA and is useful in
overcoming power system problems. Therefore, in this Currently, this voltage regulation process is
experiment, a hardware design process based on Verilog, C- regulated using a specific system and uses officers
based software design, memory simulation, and system for this regulatory process. However, along with
integration on zybo was carried out. The results that have the development of technology, there are many
been successfully carried out are the first to third processes. models to solve this problem. One model that can
The process of integrating the system on zybio has been tried, resemble human capabilities is by using artificial
but the display that is issued is not as desired because it gives intelligence. AI itself has various forms such as
an inappropriate display on the terminal. machine learning, deep learning, and
reinforcement learning. Reinforcement learning is
Keywords: Reinforcement Learning, Hardware,
a method that performs learning using a Q-table
Software, Memory, System Integration
to analyze the actions that the system must take
1. INTRODUCTION based on its current conditions to achieve its
goals. One common model of reinforcement
Currently, electrical energy is a source of learning is the maze solver which has a clear
energy that is widely used by humans. However, representation of the state and its actions.
to be able to produce an electric power source, it Therefore, to realize the hardware architecture
is necessary to carry out a series of processes in verification, this application model can be used.
order to obtain the appropriate power and
voltage. Then there is the issue of electric power Thus, reinforcement learning can be used to
transmission. As is well known, the generation of solve the stress repair system more effectively
electric power requires a precise area at once and because the learning process does not require
is quite dangerous for humans. Therefore, the humans to operate it. This can be done by making
location of the electric generator will be far from a the state as the current-voltage condition and
load of its use. Therefore, there must be a action are taken to change the voltage of the
mechanism needed to transmit power. generator and transformer taps. Next, it is
necessary to carry out the process of realization of
A simple model of the power transmission the reinforcement learning system.
process is to model it with generators,
transformers, line impedances, and electrical A simple model for this realization process is
loads. Generally, the electricity generated in the to use an FPGA that can synthesize the
generator will be increased by using a step-up reinforcement learning architecture created. The
transformer. Then this voltage will be transmitted use of FPGA to realize machine learning is
using a cable to the area close to the settlement expected to provide better results and
and the voltage is again lowered by using a step- performance at speed compared to using high-
down transformer. This process is carried out to level computing. By using this combination of
reduce the loss of power in the transmission hardware and software architectures, optimum
process due to resistance cable lines. results will be obtained.
Basically, the load behavior that arises due to Thus, in this experiment, a design process
the use of energy will change and change. Like a will be carried out to produce a reinforcement
learning architecture on the hardware that will be
used to perform Q table calculations and will be
integrated with software that determines the
state. Therefore, for this experiment the objectives
to be achieved are as follows.
1. Design hardware reinforcement
learning based on Verilog and verify it
with ModelSim.
2. Creating a software design based on the
C language related to the application
used and conducting verification
3. Doing memory design on zybo and [3]
F IGURE 1 R EINFORCEMENT LEARNING FRAMEWORK
verifying it.
4. Integrate memory, hardware, and 2.2 DESAIN VERILOG DAN MODELSIM
software to solve problems.
Verilog is a hardware description
2. LITERATURE REVIEW language (HDL) used to model electronic
systems. This language is most often used in the
2.1 REINFORCEMENT LEARNING design and verification of digital circuits at the
transfer level of register abstraction. Apart from
Reinforcement learning is a machine that, Verilog is also used in the verification of
learning technique that deals with the process of analog circuits and mixed-signal circuits, as well
software agents taking action. Reinforcement as in the design of genetic circuits. In 2009, the
learning is one of the three basic machine learning Verilog standard was merged into the
paradigms, apart from supervised learning and SystemVerilog standard, resulting in the IEEE
unsupervised learning. 1800-2009 Standard. Since then, Verilog has
This technique differs from supervised officially become part of the SystemVerilog
learning does not require labeled input/output language. The current version is the IEEE 1800-
pairs to be generated, and it does not require sub- 2017 standard.
optimal actions to be corrected explicitly. Instead, Hardware description languages such as
the focus is on finding a balance between Verilog are similar to software programming
exploration (uncharted territory) and exploitation languages in that they include ways of describing
(current knowledge). propagation time and signal strength (sensitivity).
The environment in RL is usually There are two types of assignment operators,
expressed in the form of a Markov decision namely blocking tasks (=), and non-blocking tasks
process (MDP) because many RL algorithms for (<=). Non-blocking assignments allow designers
this context use dynamic programming to describe machine state updates without
techniques. The main difference between classical needing to declare and use temporary storage
dynamic programming methods and RL variables. Since this concept is part of the
algorithms is that RL algorithms do not assume semantics of the Verilog language, designers can
knowledge of the exact mathematical model of quickly write large circuit descriptions in a
MDP and they target large MDPs, where the exact relatively concise and concise form. At the time of
method is not feasible. the introduction of Verilog, this language
represented a tremendous increase in
productivity for circuit designers who had used
graphical schema capture software and specially
written software programs to document and
simulate electronic circuits.
Verilog designers wanted a language
with a syntax similar to the C programming
language, which is already widely used in
engineering software development. Like C,
Verilog is case sensitive and has a basic
preprocessor (although it is less sophisticated
Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 2

than ANSI C / C ++). The control flow keywords 2.3 MODELSIM
(if / else, for, while, case, etc.) are the same, and
the operator priority is C compatible. Syntax
differences include the bit-width required for
variable declarations, procedural block
demarcation (Verilog uses begin/end instead of
curly braces {}), and many other minor
differences. Verilog requires that variables be
given a definite size. In C, this size is assumed to
be of the variable 'type' (eg an integer type might
be 8 bits).
F IGURE 2 M ODELSIM LOGO
Verilog design consists of a hierarchy of
modules. Modules encapsulate a design hierarchy ModelSim is a multi-language environment by
and communicate with other modules via a series Mentor Graphics, for simulating hardware
of input, output, and two-way declared ports. description languages such as VHDL, Verilog,
Internally, a module can contain any combination and SystemC, and includes a built-in C debugger.
of the following: net / variable declarations (wire, ModelSim can be used independently, or in
reg, integer, etc.), concurrent and sequential block conjunction with Intel Quartus Prime, Xilinx ISE,
of statements, and other module instances (sub- or Xilinx Vivado. Simulations are performed
hierarchies). Sequential statements are placed using a graphical user interface (GUI), or
inside the start / end block and are executed automatically using scripts..
sequentially within the block. However, the
blocks themselves are executed concurrently, 2.4 VIVADO
making Verilog the data flow language.
Verilog's concept of 'wire' consists of
signal value (4-state: "1, 0, floating, undefined")
and signal strength (strong, weak, etc.). This
system allows abstract modeling of shared signal
paths, where multiple sources drive the network
together. When a cable has multiple drivers, the
cable rating (readable) is determined by the
function of the source driver and its strength. F IGURE 3 V IVADO L OGO
A subset of statements in Verilog can be
Vivado is a software produced by Xilinx to
synthesized. A Verilog module conforming to a
synthesize and analyze HDL (Hardware
synthesizable coding style, known as RTL
Description Language) which we look forward to
(register-transfer level), can be physically realized
embedding in FPGA (Field Programmable Gate
by synthesis software. The synthesis software
Array). Of course here the FPGA that we will use
algorithmically converts the Verilog source
is the output from Xilinx. Prior to Vivado, there
(abstract) into a netlist, a logically equivalent
was a predecessor software known as Xilix ISE.
description consisting of only basic logic
primitives (AND, OR, NOT, flip-flop, etc.)
2.5 ZYNQ ARCHITECTURE
available in certain FPGA or VLSI technologies.
Further manipulation of the netlist ultimately Zynq is an FPGA SoC. The Zynq architectural
leads to circuit fabrication blueprints (such as block diagram is shown in Figure 1. In Zynq there
photomask sets for ASICs or bitstream files for is a Processing System (PS) and Programmable
FPGAs). Logic (PL). On the PS there is a dual core ARM
Cortex-A9 processor, while the PL itself is an

FPGA.
F IGURE 4 Z YNC A RCHITECTURE
At PL, we can design using Verilog or use the

IP core provided by Xilinx. Design on PL can F IGURE 7 M AX B LOCK D ESIGN
communicate with PS through several interfaces
such as SGP, MGP, HP, and ACP. The design
made on the PL must be compatible with the
standard AXI bus in order to communicate with
the PS.
2.6 REINFORCEMENT LEARNING

ARCHITECTURE F IGURE 8 Q-U PDATER D ESIGN
The following are a number of reference
architectures used to realize the reinforcement 3. METHODOLOGY
learning module. This design is based on a journal
In this experiment, a number of software and
that discusses the hardware design process for
boards are used to carry out the realization
efficient RL.
process. The following are some of the tools and
software used.
• Computer
• Modelsim
• Vivado
• FPGA Xilinx Zybo board
To be able to realize this reinforcement
learning process in FPGA, in general it is
necessary to carry out 4 stages in accordance with
predetermined goals. The following are the steps
F IGURE 5 Q-L EARNING A GENT D ESIGN that need to be done
F IGURE 6 Q-L EARNING A CCELERATOR D ESIGN

B. Software design process
Perform hardware
design and verification To be able to do software design, a programming
process will be carried out using the C language
to realize the reinforcement learning model. This
is done by adapting the model to the C language
Design software in C
by understanding the model in the Matlab. The
language following is a diagram of the process.
Search for the system's
matlab model
Creating a vivado
memory block
Adapt to C language
Integrate hardware,
software and memory
A. Hardware Design Program integration

In the hardware design process, Verilog-based
programming is necessary to create a circuit. This
circuit will later be implemented on a zybo board.
Of course, in order to realize the reinforcement Perform verification
learning process, it is necessary to make a certain
architecture that works well. Therefore,
architecture is used in literature studies to create
Adapt code to fit
a good hardware reinforcement learning model.
memory
The following are the steps taken to be able to
make a hardware design.
C. Block Memory Vivado
Making blocks in the To be realized, the hardware requires real
architectural part memory that can be accessed on the board.
Therefore, it is necessary to use memoy directly
on the board that can do the read and write
Perform the simulation process process well. Therefore, the process can be done
as in the following diagram.
Make a project and

Debuging dann block do vivado setup
verification
Enter the Memory

Perform the first process of up Block and its settings
to three for each block
Integrate the created verilog Create a testbench

blocks
Create state selector and Perform Memory

Control Unit Simulation
D. Integration Process
Carry out the final verification
Finally, after the whole process has gone
process with certain applications
well, the integration process can be carried out

between PL, PS and Memory. This process is 1. States Space
carried out by following the procedure as follows.
State is defined as a vector of system information
Creating a Q-Agent that is used to represent system conditions,
module on PL namely the Bus Voltage Magnitude. Stress at load
quantized (0.05)
2. Action Space
Perform Q-Agent
simulation In manual control, actions that can be taken to
deal with voltage problems include adjusting the
generator terminal voltage set point, switching
shunt elements, changing the tap ratio, etc. In this
Creating an AXI Interface task, the action chosen to be applied to
Reinforcement Learning is to adjust the set point.
For each generator, the terminal voltage can be
adjusted in the range [0.95, 0.975, 1.0, 1.025, 1.05].
Adapt the C language
The combination of all available generators
program to provide action
becomes an Action Space for training Agents.
3. Rewards
Perform integration There are several stress operating zones that are
defined to determine the quality of the stress
distribution.
Observe the results and • Normal Zone (0.95-1.05 pu) (100 points)
demo • Violation Zone (0.8-0.95 pu or 1.05-1.25
pu) (-50 points)
• Diverged Zone (<0.8 pu or> 1.25 pu) (-100
points)
Application Details of Reinforcement Learning on
Voltage Control in Electric Power Systems
Conventional electric power systems have several 4. RESULT AND ANALYSIS
challenges such as fast and deep ramps and
increasing uncertainty that threatens the safety A. Block diagram
and economics of their operations. In extreme 1. General Design
conditions or local disturbances, if not properly
controlled, the disturbance can spread to
neighboring settlements and cause a cascade of
disturbances, potentially causing widespread
blackout. Therefore, it is necessary to detect early
operation problems. In addition, it may take a lot
of time for the system to return to its normal state.
F IGURE 10 G ENERAL D ESIGN D IAGRAM B LOCK
Basically, the system being developed

will have 3 parts. These sections are the memory
module, Q-Agent, and the software section. It can
be observed in the section above, that BRAM will
function as a block that stores the value of q value
F IGURE 9 G RID O PERATION AS RL E NVIRONMENT and will be updated each process by the q agent.
The Q agent is the hardware tasked with
For the application that is trying to be performing computations and ordering readings
made, the power system model is used as above. and writing in BRAM. Meanwhile, the software
Then it can be observed a number of states, will carry out the process of determining the state,
actions and rewards which will be used in the determining the reward and communicating with
following information. the environment outside the system
2. Q-Agent Block
Then there is the mux with delay and Max block
which is tasked to provide input to the Q updater.
Block mux is used to select Q (st, at) and MaxBlock
to find the maximum value of the next Q state
value so that the above equation parameters can
be met. There is also a decoder that accepts action
and provides a signal enabled for its output. This
decoder serves as a memory regulator so that
active memory can be determined. Action ram is
a memory model used to simulate BRAM.
F IGURE 11 Q-A GENT DIAGRAM BLOCK
The image above shows a block diagram

of the Q agent whose task is to compute the q-
values and determine the action to be selected.
This block has a large part, namely the q learning
accelerator and policy generator. The learning
module is a module that focuses on computation
of the Q value. Meanwhile, the policy generator is
a module that determines what actions to choose.
There are also a number of delay blocks to make
the next action, state, and reward become the
current state.
F IGURE 13 Q-L EARNING ACCELERATOR DIAGRAM BLOCK
3. Policy Generator
B. Hardware Simulation in Verilog

1. Implementation of Max Block
Implementation is done using an assign process
because the assumptions are combinational. The
designer avoids using if code to prevent latches
F IGURE 12 P OLICY G ENERATOR DIAGRAM BLOCK from appearing in the chain. The form of the
comparator is made up of 4 levels, each of which
To be able to determine the action, the has 2 inputs. In total, this code uses 8
policy generator uses 2 main basis blocks, namely comparators.
the action selector and randomizer and 1 block
delay. In this process, an action selection process
is carried out using the greedy algorithm or
random exploration. This depends on the epsilon
value and also the random value. Therefore, we
need an LSFR that can generate random values.
4. Q Learning Accelerator
As already mentioned, the Q learning accelerator
is responsible for calculating the Q values and
storing them in memory. The diagram below F IGURE 14 MAX BLOCK SIMULATION RESULT USING MODELSIM
shows that there are several blocks with their
respective functions. Block Q updater is a block It can be observed that of the nine inputs, the
that functions to perform multiplication using a output value will be the maximum value of the
shifter barrel and perform the addition so that a input set.
new Q value is generated according to the
2. Barrel Shifter
following equation
The multiplication process was also carried out
using the barrel shifter method in the

reinforcement learning model. Therefore, a Likewise, the simulation was carried out and the
hardware barrel shifter was made and the appropriate results were obtained.
following results were obtained.
F IGURE 18 Q-L EARNING ACCELERATOR SIMULATION RESULT

USING MODELSIM
F IGURE 15 BARREL SHIFTER SIMULATION RESULT USING 6. Action Selector

MODELSIM
Changes to the action selector, previously only
3. Q-Updater using the greedy algorithm, have now been given
a randomized value process. The following are
In the implementation, Q-Matrix updater
the results of the simulation carried out.
temporarily uses only 1 barrel shifter because 1
barrel shifter is assumed to be accurate enough
and to make testing easier. If during the next
experimental stage the system is deemed
inadequate, the number of shifter barrels will be F IGURE 19 A CTION S ELECTOR SIMULATION RESULT USING
increased. In determining the state itself, there is MODELSIM
a quantization process which means that high
accuracy is not required in this process. Here are 7. Policy Generator
the verilog simulation results from the Q-updater
A policy generator module has also been
module.
prepared and a simulation process is carried out.
It can be observed that the results obtained are
consistent.
F IGURE 16 Q-U PDATER SIMULATION RESULT USING MODELSIM
F IGURE 20 P OLICY G ENERATOR SIMULATION RESULT USING

4. Action Ram MODELSIM
Action Ram is a module that functions as memory
to store Q (write) values and can later be read for 8. Q-Agent
input from Q-update. The simulation results of Q-Agent has also been successfully compiled and
the action ram used are shown in the following provided a simulation in accordance with the
figure. following expectations.
F IGURE 17 A CTION RAM SIMULATION RESULT USING MODELSIM

F IGURE 21 Q-A GENT SIMULATION RESULT USING MODELSIM
5. Q-Learning Accelerator
This Q-Learning Accelerator module is realized
by following the previous hardware diagram.
C. Verifying Hardware Design with Maze First, the program will ask for input from the user
Solver Example in the form of load voltage, which is the voltage at
the load. The load voltage will then be quantified
The Q-Agent process is then tested to carry out
to get the starting state of the system. The
the validation process with simulations using
program then displays the state of the Q table and
modelsim. In the following simulation results, it
the actions to be taken by the system so that the
can be seen that the Q table has been updated and
load voltage is in a safe zone prior to training. The
the agent has made it to the destination.
Q table in the initial state is filled with random
values so that the actions taken by the system are
not optimal.
After that, the training will be carried out with the
desired number of episodes, each of which has 15
iterations. In each iteration, if the state to be
achieved is already in the normal zone, the
F IGURE 22 MAZE SOLVER SIMULATION RESULT USING iteration will be terminated. In addition, in each
PROPOSED HARDWARE DESIGN iteration, the Q table value will be updated, as
well as the state of the system.
The generation process carried out is 300
After the episodes have reached the desired
times. To be able to complete this process, it takes
5694 ns because a clock with a period of 2 ns is number, the Q table will be displayed again. The
actions that the system will take to keep the load
used. This means that to be able to complete this
voltage in a safe zone will also be displayed. The
process, 2347 clock cycles are needed.
final result of the Q table and the actions taken by
the system should be more optimal than before.
D. Software design and verification For this reason, the verification process is also
Besides creating a hardware architecture, a C- carried out using the C language in total. And the
based software system will be created which will following are the results obtained
be used to determine the state and rewards to be
given to the Q-Agent. This is a flowchart of the C
language used.
F IGURE 23 SOFWARE DESIGN FLOWCHART

F IGURE 24 RESULT OF VOLTAGE CONTROL LEARNING
ALGORITHM IN C
The results above show that the C language

program has been verified and gives the desired
results.
E. Block Memory Simulation of Timing F IGURE 26 Q-A GENT SIMULATION RESULT USING VIVADO
In addition, the verification and simulation It can be observed that the memory and
process are carried out with a timing diagram. hardware have gone through the timing diagram
This is done using the zybo board via the vivado above.
application. The following is the simulation result
of the BRAM block timing. Based on this simulation, the following utilization
results are obtained.
F IGURE 25 BRAM TIMING SIMULATION RESULT F IGURE 27 UTILIZATION RESULT
It can be observed that the memory The image above shows that the system
simulation with read and write processes has has been running using certain timing conditions
been successfully carried out. Next, we can also parameters such as slack.
observe the write and read addresses and other
signals.
F. Integration Process
First, the verification process is carried out on F IGURE 28 TIMING CONDITION PARAMETER
the zybo board related to the Q-Agent on the
hardware being made. The following is a In the simulation process, a clock with a
simulation carried out using a zybo board. This frequency and period value is used as above.
simulation has integrated zybo hardware and
block memory.
F IGURE 29 P OWER CONSUMPTION

Based on the information above, it can There is also information related to
also be seen that the power dissipated by the primitives and their functions as follows.
system is 1.663 W.
Next, the information below describes
information related to the design made which is
likely to affect the size. The following utilities are
related to slice logic which provides information
F IGURE 30 GATE UTILIZATION
In addition, information related to the

type of register is also provided as in the
following figure.
F IGURE 33 PRIMITIVE FUNCTION UTILIZATION
The process has reached the final stage,

namely integrating the software with the memory
and hardware that has been simulated. However,
after adapting the model C code and trying to do
the integration, the results displayed have not
shown what is desired. The selected action values
should appear, however, certain values that do
not match appear.
5. CONCLUSION
F IGURE 31R EGISTER UTILIZATION Based on the experiments carried out, the
following conclusions were obtained.
It also provides block memory
information related to the memory used and not • It can be concluded that the hardware
used on the board. design has been successfully carried out
and simulated related to the Q-Agent.
Verification using the maze solver has also
been carried out.
• Software using the C language is also well-
made and provides appropriate results and
verification.
• Simulated timing of block memory has
been carried out with successful writing
and reading.
F IGURE 32 MEMORY UTILIZATION • The integration process has been
attempted. Hardware and memory have
been successfully simulated on Zybo.

However, when trying to integrate with the
software, an error appears in the
information display.
REFERENCE
[1] Sergio Spanò, An Efficient Hardware

Implementation ofReinforcement
Learning: The Q-LearningAlgorithm,
IEEE Acces, Rome, 2019.
[2] https://en.wikipedia.org/wiki/Verilog, 29
Januari 2020, 20:00
[3] Ruisheng Diao, Zhiwei Wang, Di Shi,
Qianyun Chang, Jiajun Duan, & Xiaohu
Zhang. (2019). Autonomous Voltage
Control for Grid Operation Using Deep
Reinforcement Learning.

APPENDIX
ActionRam.v
//action ram
module ActionRAM(clk, en, wr_addr, rd_addr, write_en, data_in, data_out);
input clk;
input en;
input write_en;
input[5:0] wr_addr ;
input[5:0] rd_addr;
input[15:0] data_in;
output reg[15:0] data_out;
//File Name Parameter
parameter FILENAME = "memory_in.list";
//Memory Model
reg[15:0] mem[0:63];
initial begin
$readmemh (FILENAME, mem);
end
always@(posedge clk) begin
if(!en) begin
data_out = 16'd0;
end
else begin
data_out <= mem[rd_addr];
end
if (write_en) begin
mem[wr_addr] <= data_in;
end
else begin
mem[wr_addr] <= mem[wr_addr]; //do nothing
end
end
endmodule
ActionRam_tb.v
//Testbench Action RAM
`timescale 1 ns/10 ps // time-unit = 1 ns, precision = 10 ps
module ActionRAM_tb();
//input Declaration
reg clk = 1'b0;
reg en = 1'b0;
reg write_en = 1'b0;
reg [5:0] wr_addr ;
reg [5:0] rd_addr;
reg [15:0] data_in;
wire [15:0] data_out;
//port mapping
ActionRAM DUT(.clk(clk),
.en(en),
.write_en(write_en),
.wr_addr(wr_addr),
.rd_addr(rd_addr),
.data_in(data_in),
.data_out(data_out));
//clock generator
always begin
#10 clk = ~clk; //Clock dengan periode 20 time unit
end

//test case
initial begin
#10;
en = 1'b1;
#20;
write_en = 1'b1;
wr_addr = 6'd5;
rd_addr = 6'd5;
data_in = 16'd5;
#20;
write_en = 1'b0;
rd_addr = 6'd5;
#20;
write_en = 1'b1;
wr_addr = 6'd1;
rd_addr = 6'd1;
data_in = 16'd10;
#20;
write_en = 1'b0;
rd_addr = 6'd1;
#20;
end
//display monitor
initial begin
$monitor("time = %2d\n dout = %2d",
$time , data_out);
end
endmodule
ActionSelector.v
module ActionSelector (clk, start, q_values, epsilon, action);
input clk, start;

input [63:0] q_values; // Values of Q_Table at row equal to current state
input [15:0] epsilon; // epsilon = 1 - episode/301
output reg [3:0] action; // action that will be taken
wire [15:0] max_1;
wire [15:0] max_2;
wire [15:0] max_value;
wire [15:0] random;
wire [15:0] random_value;
Randomizer randomizer_1 (16'b0011110011000011,start,clk,random);
Randomizer randomizer_2 (16'b1100001100111100,start,clk,random_value);
assign max_1 = (q_values[15:0] >= q_values[31:16]) ? q_values[15:0] : q_values[31:16];
assign max_2 = (q_values[47:32] >= q_values[63:48]) ? q_values[47:32] :

q_values[63:48];
assign max_value = (max_1 >= max_2) ? max_1 : max_2;
always @(*) begin
if (epsilon <= random) begin
if (q_values[15:0] == max_value) begin
action = 4'd4;
end
else if (q_values[31:16] == max_value) begin
action = 4'd3;
end
action = 4'd2;
end
action = 4'd1;
end

else begin
action = 4'd1;
end
end
else begin
if (random_value <= 16'b0000000001000000) begin
action = 4'd1;
end
else if (random_value <= 16'b0000000010000000) begin
action = 4'd2;
end
else if (random_value <= 16'b0000000011000000) begin
action = 4'd3;
end
else begin
action = 4'd4;
end
end
end
endmodule
ActionSelector_tb.v
module ActionSeletor_tb();
reg [63:0] q_values;
reg [15:0] epsilon;
wire [3:0] action;
ActionSelector action_selector(q_values, epsilon, action);
initial begin

q_values =
64'b0000000000001100000000000000000100000000000000100000000000000011;
epsilon = 16'b0000000011100000;
$display("Random = %f", $itor(epsilon)*2.0**-8.0);
#10;
q_values =
64'b0000000000001100000000000000000100000000000000100000000000000111;
epsilon = 16'b0000000011000000;
#10;
$stop;
end
endmodule
BarrelShifter.v
module BarrelShifter(op, shift_mag, result);
input [15:0] op;
input [3:0] shift_mag;
output [15:0] result;
wire [15:0] mux1_out;
mux mux_instance1(shift_mag[0], op, (op >> 1), mux1_out);
mux mux_instance2(shift_mag[1], mux1_out, (mux1_out >> 2), mux2_out);
mux mux_instance3(shift_mag[2], mux2_out, (mux2_out >> 4), mux3_out);
mux mux_instance4(shift_mag[3], mux3_out, (mux3_out >> 8), result);
endmodule
BarrelShifter_tb.v
module tb_barrel;
reg [15:0] op;
reg [3:0] shift_mag;
wire [15:0] result;
BarrelShifter shifter_instance(op, shift_mag, result);

initial begin
shift_mag = 4'b0001;
op = 16'b0000000000000001;
#10;
shift_mag = 4'b0101;
op = 16'b0000000000000101;
#10;
$stop;
end
endmodule
CU.v
module ControlUnit(clk,enb,epsilon,next_action,current_st,episode,fail,finish,print);
input clk;
input [15:0]epsilon;
input enb;
input [3:0] next_action; //hilangkan kalo pake qagent
//input [5:0] current_state;
output reg [9:0] episode;
output reg fail;
output reg finish;
output reg print;
output reg [5:0] current_st;
wire [3:0] w_next_action;
wire signed [15:0] w_next_reward ;
wire [5:0] next_state;
reg [5:0] current_state = 6'd0;
reg start;
reg en;
reg [1:0] current_condition = 2'b00;

reg [1:0] next_condition = 2'b00;
reg [5:0] counter;
RewardGenerator Reward(.next_state(next_state),
.next_reward(w_next_reward));
StateSelector State(.next_action(w_next_action),
.current_state(current_state),
.next_state(next_state));
QLearningAgent Agent(.clk(clk),
.en(en),
.start(start),
.next_reward(w_next_reward),
.next_state(next_state),
.epsilon(epsilon),
.next_action(w_next_action));
always@(*) begin
if(current_condition == 2'b00) begin
//State Start
current_state=6'd1;
current_st=6'd1;
start=1'b1;
en=1'b1;
print=1'b1;
counter=4'd0;

end
else if (current_condition ==2'b01) begin
//State Calculation
start=1'b0;
en=1'b1;
print=1'b1;
end
//State finish
start=1'b0;
en=1'b0;
print=1'b0;
counter=4'd0;
finish=1'b1;
end
else begin
//State Tidak Berhasil
start=1'b0;
en=1'b0;
print=1'b0;
counter=4'd0;
fail=1'b0;
end
if(current_state == 6'd25) begin
next_condition=2'b10;
end
else if(current_state == 5 || current_state == 8 || current_state == 7 ||

current_state == 20 || current_state == 14 || current_state == 17 ||
current_state == 19 || current_state == 22) begin

end
else if (counter > 5'd14) begin
end
else if (episode == 10'd256) begin
end
else if (current_condition == 2'b00) begin
end
end
current_state<=next_state;
current_st<=next_state;
current_condition<=next_condition;
if (enb == 1'b1) begin
current_condition<=2'b00;
episode=10'd0;
end
counter=counter+1;
end
episode= episode+1;
end

end
endmodule
CU_tb.v
`timescale 100 ps/1 ps // time-unit = , precision =
module ControlUnit_tb();
//input Declaration
reg clk = 1'b1;
reg enb = 1'b1;
reg [15:0] epsilon;
reg [3:0] next_action ;
wire [9:0] episode;
wire fail;
wire finish;
wire print;
wire [5:0] current_st;
wire [35:0] step;
// reg [3:0] gamma;
// reg [3:0] alpha;
//reg [16:0] action_counter;
//wire [15:0] result;
// wire [63:0] Q_out_action;
//port mapping
ControlUnit DUT(.clk(clk),
.enb(enb),
.epsilon(epsilon),
.next_action(next_action),
.current_st(current_st),

.episode(episode),
.fail(fail),
.finish(finish),
.print(print));
reg[3:0] memory_map[0:24];
//read initial memory access
initial begin
$readmemh ("memory_map.list", memory_map);
end
//clock generator
always begin
end
assign step = ({episode, 8'b00000000}) * 18'b000000000000000001;
initial begin
#20;
//next_reward = 16'b00000111_00000000; //7
enb = 1'b0;
next_action = 4'b0000;
//current_action = 4'd1;
// alpha = 4'b1000;
// gamma = 4'b1110;
#20;

enb= 1'b0;
//next_reward = 16'b00000101_00000000; //7
#20;
#20;
#20;
#20;
#20;
#20;
#20;
#20;

#20;
#20;
#20;
#20;
#20;
#20;
#20;
end
always@(episode) begin
epsilon = 16'b0000000100000000 - (step[23:8]);
end
//display monitor
always@(negedge clk) begin

case(current_st)
5'b00001: begin memory_map[0]=3'd1; end
default: begin memory_map[0]=3'd1; end
endcase
$monitor("%d %d %d %d %d\n%d %d %d %d %d\n%d %d %d %d %d\n%d %d %d %d %d\n%d %d %d

%d %d",

memory_map[0], memory_map[1],memory_map[2],memory_map[3],memory_map[4],
memory_map[5], memory_map[6], memory_map[7], memory_map[8],

memory_map[9],
memory_map[10],memory_map[11],memory_map[12],memory_map[13],memory_map[14],
memory_map[20],memory_map[21],memory_map[22],memory_map[23],memory_map[24]);
end
endmodule
ControlUnit.v
module ControlUnit(clk,enb,epsilon,next_action,current_st,episode,fail,finish,print);
input clk;
input [15:0]epsilon;
input enb;
input [3:0] next_action; //hilangkan kalo pake qagent
//input [5:0] current_state;
output reg [8:0] episode;
output reg fail;
output reg finish;
output reg print;
output reg [5:0] current_st;
wire [3:0] w_next_action;
wire signed [15:0] w_next_reward ;
wire [5:0] next_state;
reg [5:0] current_state = 6'd0;
reg start;
reg en;
reg [1:0] current_condition = 2'b00;
reg [1:0] next_condition = 2'b00;

reg [5:0] counter;
//current_condition = 2'b00;
RewardGenerator Reward(.next_state(next_state),
.next_reward(w_next_reward));
StateSelector State(.next_action(next_action),
.next_state(next_state));
/*
QLearningAgent Agent(.clk(clk),
.en(en),
.reward(w_next_reward),
.next_state(w_next_state)
.next_action(w_next_action));
*/
always@(*) begin
if(current_condition == 2'b00) begin
//State Start
current_state=6'd1;
current_st=6'd1;
//next_state=6'd1;
start=1'b1;
en=1'b0;
print=1'b1;
counter=4'd0;
end

//State Calculation
start=1'b0;
en=1'b1;
print=1'b1;
end
//State finish
start=1'b0;
en=1'b0;
print=1'b0;
counter=4'd0;
finish=1'b1;
end
else begin
//State Tidak Berhasil
start=1'b0;
en=1'b0;
print=1'b0;
counter=4'd0;
fail=1'b0;
end
if(current_state == 6'd25) begin
end
else if(current_state == 5 || current_state == 8 || current_state == 7 ||

current_state == 20 || current_state == 14 || current_state == 17 ||
current_state == 19 || current_state == 22) begin
end

else if (counter > 5'd14) begin
end
else if (episode == 9'd300) begin
end
end
end
current_state<=next_state;
current_st<=next_state;
current_condition<=next_condition;
if (enb == 1'b1) begin
current_condition<=2'b00;
end
counter=counter+1;
end
episode= episode+1;
//current_state=6'd1;
end
end
endmodule
ControlUnit_tb.v
`timescale 100 ps/1 ps // time-unit = , precision =
module ControlUnit_tb();
//input Declaration
reg clk = 1'b1;
reg enb = 1'b1;
reg [15:0] epsilon = 15'd0;
reg [3:0] next_action ;
wire [8:0] episode;
wire fail;
wire finish;
wire print;
wire [5:0] current_st;
// reg [3:0] gamma;
// reg [3:0] alpha;
//port mapping
ControlUnit DUT(.clk(clk),
.enb(enb),
.epsilon(epsilon),
.next_action(next_action),
.current_st(current_st),
.episode(episode),
.fail(fail),
.finish(finish),
.print(print));

initial begin
end
//clock generator
always begin
end
initial begin
#20;
//next_reward = 16'b00000111_00000000; //7
enb = 1'b0;
// alpha = 4'b1000;
// gamma = 4'b1110;
#20;
enb= 1'b0;
//next_reward = 16'b00000101_00000000; //7
#20;

#20;
#20;
#20;
#20;
#20;
#20;
#20;
#20;
#20;

#20;
#20;
#20;
#20;
#20;
end
epsilon=1- episode/300;
end
//display monitor
always@(negedge clk) begin
case(current_st)

endcase

%d %d",

memory_map[9],

end
endmodule
Decoder.v
module Decoder (
input [3:0] at,
output en1,en2,en3,en4,en5,en6,en7,en8,en9,en10,en11,en12,en13,en14,en15);
reg [15:0] out;
always @(*) begin
case(at)
4'd1: begin out=16'b0000000000000001; end
4'd2: begin out=16'b0000000000000010; end
4'd3: begin out=16'b0000000000000100; end
4'd4: begin out=16'b0000000000001000; end
4'd5: begin out=16'b0000000000010000; end
4'd6: begin out=16'b0000000000100000; end
4'd7: begin out=16'b0000000001000000; end
4'd8: begin out=16'b0000000010000000; end
4'd9: begin out=16'b0000000100000000; end
4'd10: begin out=16'b0000001000000000; end
4'd11: begin out=16'b0000010000000000; end
4'd12: begin out=16'b0000100000000000; end
4'd13: begin out=16'b0001000000000000; end
4'd14: begin out=16'b0010000000000000; end
4'd15: begin out=16'b0100000000000000; end
default : begin out=16'b0000000000000000; end

endcase
end
assign en1 = out[0];
endmodule
Decoder_tb.v
//Testbench Max Blcok
module decoder_tb();
//input Declaration
reg clk = 1'b0;
reg en = 1'b0;
reg[3:0] action;
wire en1,en2,en3,en4,en5,en6,en7,e8,en9,en10,en11,en12,en13,en14,en15;
//port mapping
Decoder DUT( .at(action),

.en1(en1),
.en2(en2),
.en3(en3),
.en4(en4),
.en5(en5),
.en6(en6),
.en7(en7),
.en8(en8),
.en9(en9),
.en10(en10),
.en11(en11),
.en12(en12),
.en13(en13),
.en14(en14),
.en15(en15));
//clock generator
always begin
end
//test case
initial begin
#10;
action = 1;
#20;
action = 2;
#20;
action = 3;
#20;
action = 4;
end

//display monitor
// initial begin
// $monitor("time = %2d\n dout = %2d",
// $time , Q_max);
// end
endmodule
Delay.v
// Block Delay
module DelayActionRAM(clk, din,dout);
input clk;
input [15:0] din;
output reg [15:0] dout = 16'd0;
//buffer register;
reg [15:0] temp1,temp2,temp3;
temp1 <= din;
dout <= temp1;
end
endmodule
module DelayReward(clk,din,dout);
input clk;
input [15:0] din;
//buffer register;
reg [15:0] temp1,temp2;

//temp1 <= din;
dout <= din;
end
endmodule
module DelayState(clk,din,dout);
input clk;
input [5:0] din;
//buffer register;
reg [5:0] temp1,temp2;
temp1 <= din;
dout <= temp1;
end
endmodule
Delay_tb.v
module Delay_tb();
reg clk = 0;
reg [15:0] din;
wire [15:0] dout;
DelayActionRAM DUT(.clk(clk),
.din(din),
.dout(dout));
//clock generator
always begin

end
initial begin
#10;
din = 16'd10;
#20;
din = 15'd20;
end
endmodule
MaxBlock.v
//Model 1 register di akhir
module MaxBlock (
input [15:0] Q_Act1,
input clk,
output reg [15:0] out);

wire [15:0] a;
wire [15:0] b;
wire [15:0] c;
wire [15:0] d;
wire [15:0] e;
wire [15:0] f;
wire [15:0] g;
wire [15:0] h;
wire [15:0] i;
wire [15:0] j;
wire [15:0] k;
wire [15:0] l;
wire [15:0] m;
wire [15:0] RegOut;
assign a = (Q_Act1>=Q_Act2)? Q_Act1 : Q_Act2;
assign b = (Q_Act3>=Q_Act4)? Q_Act3 : Q_Act4;
assign c = (Q_Act5>=Q_Act6)? Q_Act5 : Q_Act6;
assign d = (Q_Act7>=Q_Act8)? Q_Act7 : Q_Act8;
assign e = (Q_Act9>=Q_Act10)? Q_Act9 : Q_Act10;
assign f = (Q_Act11>=Q_Act12)? Q_Act11 : Q_Act12;
assign g = (Q_Act13>=Q_Act14)? Q_Act13 : Q_Act14;
assign h = (a>=b)? a : b;
assign i = (c>=d)? c : d;
assign j = (e>=f)? e : f;
assign k = (g>=Q_Act15)? g : Q_Act15;

assign l = (h>=i)? h : i;
assign m = (j>=k)? j : k;
assign RegOut = (l>=m)? l : m;
out=RegOut;
end
endmodule
/*
//Model Register di tiap level
module Max_Block (
input clk,

output reg [15:0] out);
wire [15:0] a;
wire [15:0] b;
wire [15:0] c;
wire [15:0] d;
wire [15:0] e;
wire [15:0] f;
wire [15:0] g;
wire [15:0] h;
wire [15:0] i;
wire [15:0] j;
wire [15:0] k;
wire [15:0] l;
wire [15:0] m;
wire [15:0] RegOut;
reg [15:0] Rega;
reg [15:0] Regb;
reg [15:0] Regc;
reg [15:0] Regd;
reg [15:0] Rege;
reg [15:0] Regf;
reg [15:0] Regg;
reg [15:0] Regh;
reg [15:0] Regi;
reg [15:0] Regj;
reg [15:0] Regk;
reg [15:0] Regl;
reg [15:0] Regm;
assign a = (Q_Act1>=Q_Act2)? Q_Act1 : Q_Act2;

assign b = (Q_Act3>=Q_Act4)? Q_Act3 : Q_Act4;
assign c = (Q_Act5>=Q_Act6)? Q_Act5 : Q_Act6;
assign d = (Q_Act7>=Q_Act8)? Q_Act7 : Q_Act8;
assign e = (Q_Act9>=Q_Act10)? Q_Act9 : Q_Act10;
assign f = (Q_Act11>=Q_Act12)? Q_Act11 : Q_Act12;
assign g = (Q_Act13>=Q_Act14)? Q_Act13 : Q_Act14;
assign h = (Rega>=Regb)? Rega : Regb;
assign i = (Regc>=Regd)? Regc : Regd;
assign j = (Rege>=Regf)? Rege : Regf;
assign k = (Regg>=Q_Act15)? Regg : Q_Act15;
assign l = (Regh>=Regi)? Regh : Regi;
assign m = (Regj>=Regk)? Regj : Regk;
assign RegOut = (Regl>=Regm)? Regl : Regm;
Rega=a;
Regb=b;
Regc=c;
Regd=d;
Rege=e;
Regf=f;
Regg=g;
Regh=h;
Regi=i;
Regj=j;
Regk=k;
Regl=l;
Regm=m;
out=RegOut;
end
endmodule
*/
MaxBlock_tb.v
module MaxBlock_tb();
//input Declaration
reg clk = 1'b0;
reg en = 1'b0;
reg[15:0] Q_Act1;
reg[15:0] Q_Act2;
reg[15:0] Q_Act3;
reg[15:0] Q_Act4;
reg[15:0] Q_Act5;
reg[15:0] Q_Act6;
reg[15:0] Q_Act7;
reg[15:0] Q_Act8;
reg[15:0] Q_Act9;
reg[15:0] Q_Act10;
reg[15:0] Q_Act11;
reg[15:0] Q_Act12;
reg[15:0] Q_Act13;
reg[15:0] Q_Act14;
reg[15:0] Q_Act15;
wire[15:0] Q_max;
//port mapping
MaxBlock DUT(.clk(clk),
.Q_Act1(Q_Act1),
.Q_Act2(Q_Act2),
.Q_Act3(Q_Act3),
.Q_Act4(Q_Act4),
.Q_Act5(Q_Act5),

.Q_Act6(Q_Act6),
.Q_Act7(Q_Act7),
.Q_Act8(Q_Act8),
.Q_Act9(Q_Act9),
.Q_Act10(Q_Act10),
.Q_Act11(Q_Act11),
.Q_Act12(Q_Act12),
.Q_Act13(Q_Act13),
.Q_Act14(Q_Act14),
.Q_Act15(Q_Act15),
.out(Q_max));
//clock generator
always begin
end
//test case
initial begin
#10;
Q_Act1 = 16'd1;
Q_Act2 = 16'd2;
Q_Act3 = 16'd3;
Q_Act4 = 16'd4;
Q_Act5 = 16'd0;
Q_Act6 = 16'd0;
Q_Act7 = 16'd0;
Q_Act8 = 16'd0;
Q_Act9 = 16'd0;
Q_Act10 = 16'd0;
Q_Act11 = 16'd0;

Q_Act12 = 16'd0;
Q_Act13 = 16'd0;
Q_Act14 = 16'd0;
Q_Act15 = 16'd0;
#10;
Q_Act1 = 16'd2;
Q_Act2 = 16'd3;
Q_Act3 = 16'd6;
Q_Act4 = 16'd1;
#30;
Q_Act1 = 16'd5;
Q_Act2 = 16'd4;
Q_Act3 = 16'd1;
Q_Act4 = 16'd2;
#20;
Q_Act1 = 16'd4;
Q_Act2 = 16'd7;
Q_Act3 = 16'd2;
Q_Act4 = 16'd3;
end
//display monitor
initial begin
$time , Q_max);
end
endmodule
MazeMap.v
//Maze Map

module MazeMap(clk,current_state, out_map_row_1, out_map_row_2, out_map_row_3,
out_map_row_4, out_map_row_5);
input clk;
input [6:0] current_state;
//output
output reg[3:0] out_map_row_1, out_map_row_2, out_map_row_3, out_map_row_4,

out_map_row_5;
reg [6:0] prev_state;
initial begin
end
always @(posedge clk) begin
prev_state <= current_state;
out_map_row_1 <= memory_map[0:4];
end
always @(*) begin
case(prev_state)

endcase
case(current_state)

endcase
end
//display monitor
// initial begin
// $monitor("%d %d %d %d %d\n%d %d %d %d %d\n%d %d %d %d %d\n%d %d %d %d %d\n%d %d

%d %d %d",

// memory_map[0],
memory_map[1],memory_map[2],memory_map[3],memory_map[4],
// memory_map[5], memory_map[6], memory_map[7], memory_map[8],

memory_map[9],
//
//
//
// end
endmodule
PolicyGenerator.v
module PolicyGenerator(clk, start, q_values, epsilon, current_action,next_action);
input clk; // Clock
input start;
input [15:0] epsilon;
input [63:0] q_values; // Q Values untuk 1 State
output reg [3:0] current_action; // Next Action
output [3:0] next_action;
wire [3:0] w_current_action; // Next Action Wire
ActionSelector Act(.clk(clk),.start(start),.q_values(q_values), .epsilon(epsilon),

.action(w_current_action));
assign next_action=w_current_action;
//Delay Action
always@(posedge clk)
begin
current_action <= w_current_action;
end

endmodule
PolicyGenerator_tb.v
module PolicyGenerator_tb();
reg clk = 1'b0; // Clock
reg [15:0] epsilon;
reg [63:0] q_values; // Q Values untuk 1 State
wire [3:0] action;
PolicyGenerator policy_generator(.clk(clk), .q_values(q_values),

.epsilon(epsilon), .current_action(action));
//clock generator
always begin
end
initial begin
q_values =
64'b0000000000001100000000000000000100000000000000100000000000000011;
epsilon = 16'b0000000011100000;
$display("epsilon = %f", $itor(epsilon)*2.0**-8.0);
#10;
q_values =
64'b0000000000001100000000000000000100000000000000100000000000000111;
epsilon = 16'b0000000011000000;
$display("epsilon = %f", $itor(epsilon)*2.0**-8.0);
#20;
q_values = 64'h0001100001000010;

$stop;
end
endmodule
QLearningAccelerator.v
// Q-Learning Accelerator
module QLearningAccelerator(clk, en, current_action, current_state, next_state,

current_reward, Q_out_action);
// Input and Output
input clk, en; // Control Signal
input[3:0] current_action; // Current Action
input [5:0] current_state; // Current State
input [5:0] next_state; //Next State
input signed [15:0] current_reward; // Current Reward
output reg[63:0] Q_out_action; // Q Values in Q Matrix of row equal to Current

State
reg [15:0] Q_new; // Updated Q Value
//wiring Q new
wire [15:0] w_Q_new;
//wiring
wire[15:0] out_ram_1, out_delay_1,
out_ram_2, out_delay_2,

out_ram_15, out_delay_15;
//decoder to ram wire
wire en1,
en2,
en3,
en4,
en5,
en6,
en7,
en8,
en9,
en10,
en11,
en12,
en13,
en14,
en15;
//new q value wire
//reg [15:0] Q_new;
//wire output mux
wire[15:0] q_value_selected;

//wire q maximum
wire[15:0] q_max;
//module instantiation
ActionRAM ram_1(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en1),
.data_in(w_Q_new),
.data_out(out_ram_1));
.en(en),
.write_en(en2),
.data_in(w_Q_new),
.en(en),
.write_en(en3),
.data_in(w_Q_new),
.en(en),

.write_en(en4),
.data_in(w_Q_new),
.en(en),
.write_en(en5),
.data_in(Q_new),
.en(en),
.write_en(en6),
.data_in(Q_new),
.en(en),
.write_en(en7),
.data_in(Q_new),

.en(en),
.write_en(en8),
.data_in(Q_new),
.en(en),
.write_en(en9),
.data_in(Q_new),
.en(en),
.write_en(en10),
.data_in(Q_new),
.en(en),
.write_en(en11),
.data_in(Q_new),

.en(en),
.write_en(en12),
.data_in(Q_new),
.en(en),
.write_en(en13),
.data_in(Q_new),
.en(en),
.write_en(en14),
.data_in(Q_new),
.en(en),
.write_en(en15),
.data_in(Q_new),

//Delay
DelayActionRAM delay_1(.clk(clk),
.din(out_ram_1),
.dout(out_delay_1));
.din(out_ram_2),
.din(out_ram_3),
.din(out_ram_4),
.din(out_ram_5),
.din(out_ram_6),
.din(out_ram_7),

.din(out_ram_8),
.din(out_ram_9),
.din(out_ram_10),
.din(out_ram_11),
.din(out_ram_12),
.din(out_ram_13),
.din(out_ram_14),
.din(out_ram_15),

//multiplexer 16 to 1
Mux16to1 mux16to1(.sel(current_action),
.d1(out_delay_1),
.d2(out_delay_2),
.d3(out_delay_3),
.d4(out_delay_4),
.d5(out_delay_5),
.d6(out_delay_6),
.d7(out_delay_7),
.d8(out_delay_8),
.d9(out_delay_9),
.d10(out_delay_10),
.d11(out_delay_11),
.d12(out_delay_12),
.d13(out_delay_13),
.d14(out_delay_14),
.d15(out_delay_15),
.dout(q_value_selected));
//this is new
//Decoder
Decoder Decoder(.at(current_action),
.en1(en1),
.en2(en2),
.en3(en3),
.en4(en4),
.en5(en5),
.en6(en6),
.en7(en7),
.en8(en8),

.en9(en9),
.en10(en10),
.en11(en11),
.en12(en12),
.en13(en13),
.en14(en14),
.en15(en15));
//Max_Block
MaxBlock MaxBlock(.clk(clk),
.Q_Act1(out_ram_1),
.Q_Act2(out_ram_2),
.Q_Act3(out_ram_3),
.Q_Act4(out_ram_4),
.Q_Act5(out_ram_5),
.Q_Act6(out_ram_6),
.Q_Act7(out_ram_7),
.Q_Act8(out_ram_8),
.Q_Act9(out_ram_9),
.Q_Act10(out_ram_10),
.out(q_max));
//Q updater
QUpdater Qupdater(.old_Q(q_value_selected),
.max_Q(q_max),

.reward(current_reward),
.new_Q(w_Q_new));
Q_new <= w_Q_new;
Q_out_action[15:0] <= out_delay_1;
end
endmodule
QLearningAccelerato_tb.v
module QLearningAccelerator_tb();
//input Declaration
reg clk = 1'b0;
reg en = 1'b0;
reg [3:0] current_action = 4'd0;
reg [5:0] current_state;
reg [5:0] next_state;
reg signed [15:0] current_reward;
// reg [3:0] gamma;
// reg [3:0] alpha;
reg [16:0] action_counter;

wire [15:0] result;
wire [63:0] Q_out_action;
//port mapping
QLearningAccelerator DUT(.clk(clk),
.en(en),
.current_action(current_action),
.current_reward(current_reward),
.Q_out_action(Q_out_action));
//clock generator
always begin
end
//test case
// always@(*) begin
// if (next_state == 6'd25)
// reward = 16'd100;
// else if (action_counter == 16'd15)
// reward = 16'hFFCE;// -50
// else if (next_state == 6'd3)
// reward = 16'hFF9C; //-100
// reward = 16'hFF9C; //-100
// reward = 16'hFF9C; //-100
// reward = 16'hFF9C; //-100

// reward = 16'hFF9C; //-100
// reward = 16'hFF9C; //-100
// reward = 16'hFF9C; //-100
// reward = 16'hFF9C; //-100
// else
// reward = 16'h0; // 0
// end
initial begin
#10;
current_reward = 16'b00000111_00000000; //7
en = 1'b1;
current_state = 6'b000001; //1
next_state = 6'b000010; //2
// alpha = 4'b1000;
// gamma = 4'b1110;
#20;
current_state = 6'b000010;
next_state = 6'b000011;
#20;
current_state = 6'd3;
next_state = 4'd8;
current_action = 4'd1;

#20;
current_state = 4'd8;
next_state = 4'd3;
current_action = 4'd4;
end
//display monitor
initial begin
$monitor("time = %2d\n result = %2d",
$time , result);
end
endmodule
QLearningAgent.v
//Module Q learning Agent
module QLearningAgent (clk, en, start, next_reward, next_state,epsilon,next_action);
input clk;
input en;
input start;
input[15:0] next_reward; // Reward
input[5:0] next_state; // Next State
input[15:0] epsilon;
output [3:0] next_action;
wire [63:0] w_q_values;
//wire w_next_reward;
//wire w_next_state;
wire [15:0] w_curr_reward;
wire [5:0] w_curr_state;
wire [3:0] w_curr_action;

QLearningAccelerator QLearningAccelerator(.clk(clk),
.en(en),
.current_action(w_curr_action),
.current_state(w_curr_state),
.current_reward(w_curr_reward),
.Q_out_action(w_q_values));
PolicyGenerator PolicyGenerator(.clk(clk),
.start(start),
.current_action(w_curr_action),
.epsilon(epsilon),
.q_values(w_q_values),
.next_action(next_action));
DelayReward DelayReward(.clk(clk),
.din(next_reward),
.dout(w_curr_reward));
DelayState DelayState(.clk(clk),
.din(next_state),
.dout(w_curr_state));
endmodule
QLearningAgent_tb.v
module QLearningAgent_tb();
//input Declaration
reg clk = 1'b0;
reg en = 1'b0;

reg [5:0] next_state;
reg signed [15:0] next_reward;
reg signed [15:0] next_reward_temp;
// reg [3:0] gamma;
// reg [3:0] alpha;
//port mapping
QLearningAgent DUT(.clk(clk),
.en(en),
.next_reward(next_reward));
initial begin
end
//clock generator
always begin
end
//test case

always@(*) begin
if (next_state == 6'd25)
next_reward = 16'd100;
else if (next_state == 6'd3)
next_reward = 16'h9C00; //-100
else
next_reward = 16'h0; // 0
end
initial begin
#10;
//next_reward = 16'b00000111_00000000; //7
en = 1'b1;
memory_map[0] = 1;
next_state = 6'b000001; //1
// alpha = 4'b1000;

// gamma = 4'b1110;
#20;
next_state = 6'b000010; //2
memory_map[1] = 1;
//next_reward = 16'b00000101_00000000; //7
#20;
next_state = 6'b000111; //7
memory_map[2] = 1;
//next_reward = 16'b00000101_00000000; //7
#20;
next_state = 6'b001000; //8
memory_map[7] = 1;
#20;
memory_map[8] = 1;
//next_reward = 16'b00000101_00000000; //7
end
//display monitor
initial begin

%d %d",

memory_map[9],

end
endmodule
QUpdater.v
module QUpdater (old_Q, max_Q, reward, new_Q);
input [15:0] old_Q;
input [15:0] max_Q;
input [15:0] reward; // Current Reward
output [15:0] new_Q; // Updated Q value
wire [15:0] max_i;
wire [15:0] max_j;
wire [15:0] max_k;
wire [15:0] combined_Q;
wire [15:0] final_Q;
BarrelShifter barrel_max_i(max_Q, 4'd1, max_i);
BarrelShifter barrel_max_j(max_Q, 4'd2, max_j);
BarrelShifter barrel_max_k(max_Q, 4'd3, max_k);
BarrelShifter barrel_alpha(combined_Q, 4'd1, final_Q);
assign combined_Q = reward + max_i + max_j + max_k - old_Q;
assign new_Q = final_Q + old_Q
endmodule
QUpdater_tb.v
module QUpdater_tb();

reg signed [15:0] old_Q;
reg signed [15:0] max_Q;
reg [3:0] gamma;
reg [15:0] current_reward;
reg [3:0] alpha;
wire [15:0] new_Q;
localparam sf = 2.0**-8.0; // Q4.4 scaling factor is 2^-4
QUpdater q_updater(old_Q, max_Q, current_reward, new_Q);
initial begin
old_Q = 16'b00000010_00000000; // 2
max_Q = 16'b00000010_00000000; // 2
current_reward = 16'b00000111_00000000; //7
#10;
$display("new_Q = %f", $itor(new_Q)*sf);
$stop;
end
endmodule
Randomizer.v
module Randomizer(ic,start,clk,q); // main module for lfsr
input [15:0]ic;
input start, clk;
output [15:0]q;
wire s;
wire [15:0]lfs;
assign s=lfs[15]^lfs[10]^lfs[9]^lfs[5];
dff dff1(lfs[15],start?ic[15]:s,clk);
dff dff2(lfs[14],start?ic[14]:lfs[15],clk);

assign q = {8'b00000000, lfs[7:0]};
endmodule
module dff (Q, D, Clock); //sub module- d flip flop
output Q;
input D;
input Clock;
reg Q;
always @(posedge Clock)
begin
Q <= D;
end
endmodule
RewardGenerator.v
module RewardGenerator (next_state, next_reward);
input [5:0] next_state;
output reg signed [15:0] next_reward;

always @(*) begin
case (next_state)
6'd25 : next_reward = 16'd100; //100
6'd3 : next_reward = 16'hFF9C; //-100
default : next_reward = 16'h0; //0
endcase
end
endmodule
StateSelector.v
module StateSelector (
input [3:0] next_action,
input [5:0] current_state,
output reg [5:0] next_state);
always @(*) begin
case(next_action)
//Geser Kanan
4'b0000: begin
if(current_state%5 != 0)
begin
next_state=current_state+1;
end
else

begin
next_state=current_state;
end
end
//Geser Atas
4'b0001: begin
if(current_state > 5)
begin
next_state=current_state-5;
end
else
begin
end
end
//Geser Kiri
4'b0010: begin
if(current_state%5 != 1)
begin
next_state=current_state-1;
end
else
begin
end
end
//Geser Bawah
4'b0011: begin
if(current_state%5 < 21)
begin
next_state=current_state+5;

end
else
begin
end
end
default: begin next_state=current_state; end
endcase
end
endmodule
StateSelector_tb.v
module PenentuState_Block_tb;
reg [5:0] current_st;
reg [3:0] action;
wire [5:0] next_st;
PenentuState_Block PenentuState(.current_state(current_st), .at(action),

.next_state(next_st));
initial begin
current_st = 6'b001000;
action = 4'b0001;
#100;
action = 4'b0000;
#100;
action = 4'b0010;
#100;

action = 4'b0011;
#100;
$stop;
end
endmodule
Multiplexer.v
//multiplexer 16 to 1
module Mux16to1(sel, d0, d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15, dout);
input[3:0] sel;
input[15:0] d0, d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15;
output reg [15:0] dout;
always@(*) begin
case (sel)
4'd1 : dout = d1;
4'd2 : dout = d2;
4'd3 : dout = d3;
4'd4 : dout = d4;
4'd5 : dout = d5;
4'd6 : dout = d6;
4'd7 : dout = d7;
4'd8 : dout = d8;
4'd9 : dout = d9;
4'd10 : dout = d10;
4'd11 : dout = d11;
4'd12 : dout = d12;
4'd13 : dout = d13;
4'd14 : dout = d14;
4'd15 : dout = d15;

default : dout = d1;
endcase
end
// assign dout = (sel == 4'd0) ? d0 :
// (sel == 4'd1) ? d1 :
// (sel == 4'd2) ? d2 :
// (sel == 4'd3) ? d3 :
// (sel == 4'd4) ? d4 :
// (sel == 4'd5) ? d5 :
// (sel == 4'd6) ? d6 :
// (sel == 4'd7) ? d7 :
// (sel == 4'd8) ? d8 :
// (sel == 4'd9) ? d9 :
// (sel == 4'd10) ? d10 :
// (sel == 4'd11) ? d11 :
// (sel == 4'd12) ? d12 :
// (sel == 4'd13) ? d13 :
// d14;
endmodule
Multiplexer_tb.v
module Mux16to1_tb();
//input Declaration
reg clk = 1'b0;
reg en = 1'b0;
reg[3:0] sel;
reg[15:0] d0, d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15;

wire[15:0] dout;
//port mapping
Mux16to1 DUT( .sel(sel),
.d1(d1),
.d2(d2),
.d3(d3),
.d4(d4),
.d5(d5),
.d6(d6),
.d7(d7),
.d8(d8),
.d9(d9),
.d10(d10),
.d11(d11),
.d12(d12),
.d13(d13),
.d14(d14),
.d15(d15),
.dout(dout)
);
//clock generator
always begin
end
//test case
initial begin
#10;
d0 = 4'd0;
d1 = 4'd1;

d2 = 4'd2;
d3 = 4'd3;
d4 = 4'd4;
d5 = 4'd5;
d6 = 4'd6;
d7 = 4'd7;
d8 = 4'd8;
d9 = 4'd9;
d10 = 4'd10;
d11 = 4'd11;
d12 = 4'd12;
d13 = 4'd13;
d14 = 4'd14;
d15 = 4'd15;
sel = 4'd0;
#20;
sel = 4'd1;
#20;
sel = 4'd2;
#20;
sel = 4'd3;
#20;
sel = 4'd4;
#20;
sel = 4'd5;
#20;
sel = 4'd6;
#20;
sel = 4'd7;

#20;
sel = 4'd8;
end
//display monitor
initial begin
$time , dout );
end
endmodule
Mux.v
module mux (sel, op1, op2, result);
input sel;
input [15:0] op1;
input [15:0] op2;
output [15:0] result;
assign result = (sel == 1'b0) ? op1 : op2;
endmodule
tb_action.v
module tb_action;
reg start = 1'b1; // Start signal, Active Low
reg [63:0] q_values;
reg [15:0] epsilon;
wire [3:0] action;
ActionSelector action_selector(clk, start, q_values, epsilon, action);
always begin
end

initial begin
#10;
start = 1'b0;
q_values =
64'b0000000000001100000000000000000100000000000000100000000000000011;
epsilon = 16'b0000000011100000;
#10;
q_values =
64'b0000000000001100000000000000000100000000000000100000000000000111;
epsilon = 16'b0000000011000000;
#10;
$stop;
end
endmodule
tb_random.v
module tb_random;
reg [15:0] seed;
reg start = 1'b1; // Start signal, Active Low
wire [15:0] out;
always begin
if (clk) begin
$displayb(out);
$display("Random Number = %f", $itor(out)*2.0**-8.0);
end
end
Randomizer randomizer(seed, start, clk, out);
initial begin

seed = 16'b0100001001000010;
#10;
start = 1'b0;
#1000;
$stop;
end
endmodule
axi_control_tb.v
`timescale 1ns / 1ps

module axi_control_tb();
localparam T = 8;
// *** Multiplier ***
reg clk = 0;
reg rst_n = 0;
wire [15:0] new_Q;

wire [3:0] current_action;
// *** AXI control ***

wire s_axi_arready;
reg [31:0] s_axi_araddr;
reg s_axi_arvalid;
wire s_axi_awready;
reg [31:0] s_axi_awaddr;

reg s_axi_awvalid;
reg s_axi_bready;
wire [1:0] s_axi_bresp;
wire s_axi_bvalid;
reg s_axi_rready;
wire [31:0] s_axi_rdata;
wire [1:0] s_axi_rresp;
wire s_axi_rvalid;
wire s_axi_wready;
reg [31:0] s_axi_wdata;
reg [3:0] s_axi_wstrb;
reg s_axi_wvalid;
integer i;
ActionRAM_wrapper uut ( .clk(clk),

.rst_n(rst_n),
.new_Q(new_Q),
.current_action(current_action),
.s_axi_araddr(s_axi_araddr),
.s_axi_arready(s_axi_arready),
.s_axi_arvalid(s_axi_arvalid),
.s_axi_awaddr(s_axi_awaddr),
.s_axi_awready(s_axi_awready),
.s_axi_awvalid(s_axi_awvalid),
.s_axi_bready(s_axi_bready),
.s_axi_bresp(s_axi_bresp),

.s_axi_bvalid(s_axi_bvalid),
.s_axi_rdata(s_axi_rdata),
.s_axi_rready(s_axi_rready),
.s_axi_rresp(s_axi_rresp),
.s_axi_rvalid(s_axi_rvalid),
.s_axi_wdata(s_axi_wdata),
.s_axi_wready(s_axi_wready),
.s_axi_wstrb(s_axi_wstrb),
.s_axi_wvalid(s_axi_wvalid) );
always begin
// *** Clock ***
clk= ~clk;
#(T/2);
end
initial begin
// *** Init ***
// axi_control_write(4'h8, 17'h000); //write epsilon and start
s_axi_awaddr = 0;
s_axi_awvalid = 0;
s_axi_wstrb = 0;
s_axi_wdata = 0;
s_axi_wvalid = 0;
s_axi_bready = 1;
s_axi_araddr = 0;
s_axi_arvalid = 0;
s_axi_rready = 1;
// *** Reset ***

rst_n = 0;
#(T*5);
rst_n = 1;
#(T*5);
// *** Configuration and start ***

axi_control_write(8'h0, 4'd1); //write next_state
axi_control_write(8'h4, 16'd0); //write next_reward
axi_control_write(8'h8, 17'h0c00); //write epsilon
axi_control_write(8'hc, 1'b0); //start
// Wait until process is done
#(T*50);
axi_control_read(4'h10);
#(T*10);
// ### 2 ###
axi_control_write(8'h4, 16'hFF9C); //write next_reward
// axi_control_write(4'h8, 17'h0f000); //write epsilon and start
#(T*50);
#(T*10);
axi_control_write(8'h4, 16'hFF9C); //write next_reward
#(T*50);

// *** Read output ***
#(T*10);
// ### 4 ### // *** Configuration and start ***
axi_control_write(8'h4, 16'h0); //write next_reward
#(T*50);
end
task axi_control_write;
input [31:0] awaddr;
input [31:0] wdata;
begin // *** Write address ***

s_axi_awaddr = awaddr;
s_axi_awvalid = 1;
#T;
s_axi_awvalid = 0;
// *** Write data ***
s_axi_wdata = wdata;
s_axi_wstrb = 4'hf;
s_axi_wvalid = 1;
#T;
s_axi_wvalid = 0;
#T;
end
endtask
task axi_control_read;
input [31:0] araddr;
begin
// *** Read address ***
s_axi_araddr = araddr;
s_axi_arvalid = 1;
#T;
s_axi_arvalid = 0;
#T;
end
endtask
endmodule
Maze solver dengan axi control
#include <stdio.h>
#include <math.h>
#include "platform.h"
#include "xil_printf.h"
#include <stdlib.h>
#define CTRL_BASE 0X4000000
int main()
{
//REGISTERS
uint32_t *ctrl_p;
init_platform();
//Initialize pointer

ctrl_p = (uint32_t *) CTRL_BASE;
/* ctrl_p : next_state
* ctrl_p+1 : next_reward
* ctrl_p+2 : epsilon
* ctrl_p+3 : start
* ctrl_p+4 : next_action (input)
*/
int episode,new_action, t, reward;

double epsilon;
int state_i, state_j;

uint8_t state = 0;
int new_state = 0;
*(ctrl_p+3) = 1;
*(ctrl_p+3) = 0;
for (episode = 0; episode < 500; episode++)

{
epsilon = (double)(1.00 - (double)(episode+1)/501.00);
*(ctrl_p+2) = round(epsilon*pow(2,16));
// printf("\nepisode : %d epsilon : %d", episode,
*(ctrl_p+2));
//writing start
*(ctrl_p+3) = 0;
for (t = 0; t < 15; t++)

{
if (state == 24)
{
break;
}
if (new_state == 24)
{
reward = 0x6400; //100
}
else if (t == 14)
{
reward = 0xf100;
}
else if (new_state == 4)
{
reward = 0x9C00;
}
{
reward = 0x9C00;
}
{
reward = 0x9C00;
}
{
reward = 0x9C00;
}

{
reward = 0x9C00;
}
{
reward = 0x9C00;
}
{
reward = 0x9C00;
}
{
reward = 0x9C00;
}
else
{
reward = 0;
}
//read new action
new_action = (uint32_t)*(ctrl_p+4);
printf("%d->", new_action);
// write next reward

*(ctrl_p+1) = reward;
switch(new_action)
{
case 1:
state_j += 1; //down
break;
case 2:
state_i -= 1; //up
break;
case 3:
state_j -= 1; //left
break;
case 4:
state_i += 1; //right
break;
}
if (state_i < 0)
{
state_i = 0;
}
else if (state_j < 0)
{
state_j = 0;
}
else if (state_i > 4)
{
state_i = 4;
}
else if (state_j > 4)
{
state_j = 4;
}
// //writing next_state

new_state = state_i * 5 + state_j;
state = new_state;
*(ctrl_p + 0) = new_state;
// printf("\naction\n");
}
printf("\n");
}
cleanup_platform();
return 0;
}

Autonomous Voltage Control For Grid Operation Using Hardware Implementation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Autonomous Voltage Control For Grid Operation Using Hardware Implementation

Uploaded by

Copyright:

Available Formats

Autonomous Voltage Control for Grid Operation Using Hardware Implementation of

Rizky Ardi M (13217054)

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 2

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 3

F IGURE 4 Z YNC A RCHITECTURE

At PL, we can design using Verilog or use the

2.6 REINFORCEMENT LEARNING

F IGURE 6 Q-L EARNING A CCELERATOR D ESIGN

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 4

A. Hardware Design Program integration

Make a project and

Enter the Memory

Integrate the created verilog Create a testbench

Create state selector and Perform Memory

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 5

Basically, the system being developed

F IGURE 11 Q-A GENT DIAGRAM BLOCK

The image above shows a block diagram

B. Hardware Simulation in Verilog

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 7

F IGURE 18 Q-L EARNING ACCELERATOR SIMULATION RESULT

F IGURE 15 BARREL SHIFTER SIMULATION RESULT USING 6. Action Selector

F IGURE 16 Q-U PDATER SIMULATION RESULT USING MODELSIM

F IGURE 20 P OLICY G ENERATOR SIMULATION RESULT USING

F IGURE 17 A CTION RAM SIMULATION RESULT USING MODELSIM

F IGURE 23 SOFWARE DESIGN FLOWCHART

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 9

The results above show that the C language

F IGURE 25 BRAM TIMING SIMULATION RESULT F IGURE 27 UTILIZATION RESULT

F IGURE 29 P OWER CONSUMPTION

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 10

F IGURE 30 GATE UTILIZATION

In addition, information related to the

F IGURE 33 PRIMITIVE FUNCTION UTILIZATION

The process has reached the final stage,

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 11

[1] Sergio Spanò, An Efficient Hardware

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 12

module ActionRAM(clk, en, wr_addr, rd_addr, write_en, data_in, data_out);

output reg[15:0] data_out;

//File Name Parameter

parameter FILENAME = "memory_in.list";

$readmemh (FILENAME, mem);

always@(posedge clk) begin

data_out <= mem[rd_addr];

mem[wr_addr] <= data_in;

//Testbench Action RAM

`timescale 1 ns/10 ps // time-unit = 1 ns, precision = 10 ps

reg clk = 1'b0;

reg write_en = 1'b0;

reg [5:0] wr_addr ;

reg [5:0] rd_addr;

reg [15:0] data_in;

wire [15:0] data_out;

#10 clk = ~clk; //Clock dengan periode 20 time unit

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 14

$monitor("time = %2d\n dout = %2d",

module ActionSelector (clk, start, q_values, epsilon, action);

input clk, start;

Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 15

input [15:0] epsilon; // epsilon = 1 - episode/301

output reg [3:0] action; // action that will be taken

wire [15:0] max_1;