Professional Documents
Culture Documents
Reinforcement Learning
Basically, the load behavior that arises due to Thus, in this experiment, a design process
the use of energy will change and change. Like a will be carried out to produce a reinforcement
learning architecture on the hardware that will be
used to perform Q table calculations and will be
integrated with software that determines the
state. Therefore, for this experiment the objectives
to be achieved are as follows.
1. Design hardware reinforcement
learning based on Verilog and verify it
with ModelSim.
2. Creating a software design based on the
C language related to the application
used and conducting verification
3. Doing memory design on zybo and [3]
F IGURE 1 R EINFORCEMENT LEARNING FRAMEWORK
verifying it.
4. Integrate memory, hardware, and 2.2 DESAIN VERILOG DAN MODELSIM
software to solve problems.
Verilog is a hardware description
2. LITERATURE REVIEW language (HDL) used to model electronic
systems. This language is most often used in the
2.1 REINFORCEMENT LEARNING design and verification of digital circuits at the
transfer level of register abstraction. Apart from
Reinforcement learning is a machine that, Verilog is also used in the verification of
learning technique that deals with the process of analog circuits and mixed-signal circuits, as well
software agents taking action. Reinforcement as in the design of genetic circuits. In 2009, the
learning is one of the three basic machine learning Verilog standard was merged into the
paradigms, apart from supervised learning and SystemVerilog standard, resulting in the IEEE
unsupervised learning. 1800-2009 Standard. Since then, Verilog has
This technique differs from supervised officially become part of the SystemVerilog
learning does not require labeled input/output language. The current version is the IEEE 1800-
pairs to be generated, and it does not require sub- 2017 standard.
optimal actions to be corrected explicitly. Instead, Hardware description languages such as
the focus is on finding a balance between Verilog are similar to software programming
exploration (uncharted territory) and exploitation languages in that they include ways of describing
(current knowledge). propagation time and signal strength (sensitivity).
The environment in RL is usually There are two types of assignment operators,
expressed in the form of a Markov decision namely blocking tasks (=), and non-blocking tasks
process (MDP) because many RL algorithms for (<=). Non-blocking assignments allow designers
this context use dynamic programming to describe machine state updates without
techniques. The main difference between classical needing to declare and use temporary storage
dynamic programming methods and RL variables. Since this concept is part of the
algorithms is that RL algorithms do not assume semantics of the Verilog language, designers can
knowledge of the exact mathematical model of quickly write large circuit descriptions in a
MDP and they target large MDPs, where the exact relatively concise and concise form. At the time of
method is not feasible. the introduction of Verilog, this language
represented a tremendous increase in
productivity for circuit designers who had used
graphical schema capture software and specially
written software programs to document and
simulate electronic circuits.
Verilog designers wanted a language
with a syntax similar to the C programming
language, which is already widely used in
engineering software development. Like C,
Verilog is case sensitive and has a basic
preprocessor (although it is less sophisticated
Adapt to C language
Integrate hardware,
software and memory
D. Integration Process
Carry out the final verification
Finally, after the whole process has gone
process with certain applications
well, the integration process can be carried out
Perform integration There are several stress operating zones that are
defined to determine the quality of the stress
distribution.
Observe the results and • Normal Zone (0.95-1.05 pu) (100 points)
demo • Violation Zone (0.8-0.95 pu or 1.05-1.25
pu) (-50 points)
• Diverged Zone (<0.8 pu or> 1.25 pu) (-100
points)
Application Details of Reinforcement Learning on
Voltage Control in Electric Power Systems
Conventional electric power systems have several 4. RESULT AND ANALYSIS
challenges such as fast and deep ramps and
increasing uncertainty that threatens the safety A. Block diagram
and economics of their operations. In extreme 1. General Design
conditions or local disturbances, if not properly
controlled, the disturbance can spread to
neighboring settlements and cause a cascade of
disturbances, potentially causing widespread
blackout. Therefore, it is necessary to detect early
operation problems. In addition, it may take a lot
of time for the system to return to its normal state.
F IGURE 10 G ENERAL D ESIGN D IAGRAM B LOCK
E. Block Memory Simulation of Timing F IGURE 26 Q-A GENT SIMULATION RESULT USING VIVADO
In addition, the verification and simulation It can be observed that the memory and
process are carried out with a timing diagram. hardware have gone through the timing diagram
This is done using the zybo board via the vivado above.
application. The following is the simulation result
of the BRAM block timing. Based on this simulation, the following utilization
results are obtained.
It can be observed that the memory The image above shows that the system
simulation with read and write processes has has been running using certain timing conditions
been successfully carried out. Next, we can also parameters such as slack.
observe the write and read addresses and other
signals.
F. Integration Process
First, the verification process is carried out on F IGURE 28 TIMING CONDITION PARAMETER
the zybo board related to the Q-Agent on the
hardware being made. The following is a In the simulation process, a clock with a
simulation carried out using a zybo board. This frequency and period value is used as above.
simulation has integrated zybo hardware and
block memory.
5. CONCLUSION
F IGURE 31R EGISTER UTILIZATION Based on the experiments carried out, the
following conclusions were obtained.
It also provides block memory
information related to the memory used and not • It can be concluded that the hardware
used on the board. design has been successfully carried out
and simulated related to the Q-Agent.
Verification using the maze solver has also
been carried out.
• Software using the C language is also well-
made and provides appropriate results and
verification.
• Simulated timing of block memory has
been carried out with successful writing
and reading.
F IGURE 32 MEMORY UTILIZATION • The integration process has been
attempted. Hardware and memory have
been successfully simulated on Zybo.
REFERENCE
//action ram
input clk;
input en;
input write_en;
input[5:0] wr_addr ;
input[5:0] rd_addr;
input[15:0] data_in;
//Memory Model
reg[15:0] mem[0:63];
initial begin
end
if(!en) begin
data_out = 16'd0;
end
else begin
end
if (write_en) begin
end
else begin
mem[wr_addr] <= mem[wr_addr]; //do nothing
end
end
endmodule
ActionRam_tb.v
module ActionRAM_tb();
//input Declaration
reg en = 1'b0;
//port mapping
ActionRAM DUT(.clk(clk),
.en(en),
.write_en(write_en),
.wr_addr(wr_addr),
.rd_addr(rd_addr),
.data_in(data_in),
.data_out(data_out));
//clock generator
always begin
end
initial begin
#10;
en = 1'b1;
#20;
write_en = 1'b1;
wr_addr = 6'd5;
rd_addr = 6'd5;
data_in = 16'd5;
#20;
write_en = 1'b0;
rd_addr = 6'd5;
#20;
write_en = 1'b1;
wr_addr = 6'd1;
rd_addr = 6'd1;
data_in = 16'd10;
#20;
write_en = 1'b0;
rd_addr = 6'd1;
#20;
end
//display monitor
initial begin
$time , data_out);
end
endmodule
ActionSelector.v
action = 4'd4;
end
action = 4'd3;
end
action = 4'd2;
end
action = 4'd1;
end
action = 4'd1;
end
end
else begin
action = 4'd1;
end
action = 4'd2;
end
action = 4'd3;
end
else begin
action = 4'd4;
end
end
end
endmodule
ActionSelector_tb.v
module ActionSeletor_tb();
initial begin
epsilon = 16'b0000000011100000;
#10;
q_values =
64'b0000000000001100000000000000000100000000000000100000000000000111;
epsilon = 16'b0000000011000000;
#10;
$stop;
end
endmodule
BarrelShifter.v
endmodule
BarrelShifter_tb.v
module tb_barrel;
shift_mag = 4'b0001;
op = 16'b0000000000000001;
#10;
shift_mag = 4'b0101;
op = 16'b0000000000000101;
#10;
$stop;
end
endmodule
CU.v
module ControlUnit(clk,enb,epsilon,next_action,current_st,episode,fail,finish,print);
input clk;
input [15:0]epsilon;
input enb;
reg start;
reg en;
RewardGenerator Reward(.next_state(next_state),
.next_reward(w_next_reward));
StateSelector State(.next_action(w_next_action),
.current_state(current_state),
.next_state(next_state));
QLearningAgent Agent(.clk(clk),
.en(en),
.start(start),
.next_reward(w_next_reward),
.next_state(next_state),
.epsilon(epsilon),
.next_action(w_next_action));
always@(*) begin
//State Start
current_state=6'd1;
current_st=6'd1;
start=1'b1;
en=1'b1;
print=1'b1;
counter=4'd0;
//State Calculation
start=1'b0;
en=1'b1;
print=1'b1;
end
//State finish
start=1'b0;
en=1'b0;
print=1'b0;
counter=4'd0;
finish=1'b1;
end
else begin
start=1'b0;
en=1'b0;
print=1'b0;
counter=4'd0;
fail=1'b0;
end
next_condition=2'b10;
end
next_condition=2'b00;
next_condition=2'b00;
end
next_condition=2'b11;
end
next_condition=2'b01;
end
end
current_state<=next_state;
current_st<=next_state;
current_condition<=next_condition;
current_condition<=2'b00;
episode=10'd0;
end
counter=counter+1;
end
episode= episode+1;
end
endmodule
CU_tb.v
module ControlUnit_tb();
//input Declaration
wire fail;
wire finish;
wire print;
//port mapping
ControlUnit DUT(.clk(clk),
.enb(enb),
.epsilon(epsilon),
.next_action(next_action),
.current_st(current_st),
.fail(fail),
.finish(finish),
.print(print));
reg[3:0] memory_map[0:24];
initial begin
end
//clock generator
always begin
end
initial begin
#20;
enb = 1'b0;
next_action = 4'b0000;
//current_action = 4'd1;
// alpha = 4'b1000;
// gamma = 4'b1110;
#20;
next_action = 4'b0000;
//current_action = 4'd1;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0011;
#20;
next_action = 4'b0011;
#20;
next_action = 4'b0011;
next_action = 4'b0011;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0011;
#20;
next_action = 4'b0011;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0000;
end
always@(episode) begin
end
//display monitor
endcase
memory_map[10],memory_map[11],memory_map[12],memory_map[13],memory_map[14],
memory_map[15],memory_map[16],memory_map[17],memory_map[18],memory_map[19],
memory_map[20],memory_map[21],memory_map[22],memory_map[23],memory_map[24]);
end
endmodule
ControlUnit.v
module ControlUnit(clk,enb,epsilon,next_action,current_st,episode,fail,finish,print);
input clk;
input [15:0]epsilon;
input enb;
reg start;
reg en;
//current_condition = 2'b00;
RewardGenerator Reward(.next_state(next_state),
.next_reward(w_next_reward));
StateSelector State(.next_action(next_action),
.current_state(current_state),
.next_state(next_state));
/*
QLearningAgent Agent(.clk(clk),
.en(en),
.reward(w_next_reward),
.next_state(w_next_state)
.next_action(w_next_action));
*/
always@(*) begin
//State Start
current_state=6'd1;
current_st=6'd1;
//next_state=6'd1;
start=1'b1;
en=1'b0;
print=1'b1;
counter=4'd0;
end
start=1'b0;
en=1'b1;
print=1'b1;
end
//State finish
start=1'b0;
en=1'b0;
print=1'b0;
counter=4'd0;
finish=1'b1;
end
else begin
start=1'b0;
en=1'b0;
print=1'b0;
counter=4'd0;
fail=1'b0;
end
next_condition=2'b10;
end
next_condition=2'b00;
end
next_condition=2'b00;
end
next_condition=2'b11;
end
next_condition=2'b01;
end
end
current_state<=next_state;
current_st<=next_state;
current_condition<=next_condition;
current_condition<=2'b00;
end
counter=counter+1;
end
episode= episode+1;
//current_state=6'd1;
end
end
endmodule
ControlUnit_tb.v
Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 31
`timescale 100 ps/1 ps // time-unit = , precision =
module ControlUnit_tb();
//input Declaration
wire fail;
wire finish;
wire print;
//port mapping
ControlUnit DUT(.clk(clk),
.enb(enb),
.epsilon(epsilon),
.next_action(next_action),
.current_st(current_st),
.episode(episode),
.fail(fail),
.finish(finish),
.print(print));
initial begin
end
//clock generator
always begin
end
initial begin
#20;
enb = 1'b0;
next_action = 4'b0000;
//current_action = 4'd1;
// alpha = 4'b1000;
// gamma = 4'b1110;
#20;
enb= 1'b0;
next_action = 4'b0000;
//current_action = 4'd1;
#20;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0011;
#20;
next_action = 4'b0011;
#20;
next_action = 4'b0011;
#20;
next_action = 4'b0011;
#20;
next_action = 4'b0000;
next_action = 4'b0000;
#20;
next_action = 4'b0011;
#20;
next_action = 4'b0011;
#20;
next_action = 4'b0000;
#20;
next_action = 4'b0000;
end
epsilon=1- episode/300;
end
//display monitor
case(current_st)
endcase
memory_map[0], memory_map[1],memory_map[2],memory_map[3],memory_map[4],
memory_map[10],memory_map[11],memory_map[12],memory_map[13],memory_map[14],
memory_map[15],memory_map[16],memory_map[17],memory_map[18],memory_map[19],
end
endmodule
Decoder.v
module Decoder (
output en1,en2,en3,en4,en5,en6,en7,en8,en9,en10,en11,en12,en13,en14,en15);
case(at)
end
endmodule
Decoder_tb.v
module decoder_tb();
//input Declaration
reg en = 1'b0;
reg[3:0] action;
wire en1,en2,en3,en4,en5,en6,en7,e8,en9,en10,en11,en12,en13,en14,en15;
//port mapping
.en2(en2),
.en3(en3),
.en4(en4),
.en5(en5),
.en6(en6),
.en7(en7),
.en8(en8),
.en9(en9),
.en10(en10),
.en11(en11),
.en12(en12),
.en13(en13),
.en14(en14),
.en15(en15));
//clock generator
always begin
end
//test case
initial begin
#10;
action = 1;
#20;
action = 2;
#20;
action = 3;
#20;
action = 4;
end
// initial begin
// $time , Q_max);
// end
endmodule
Delay.v
// Block Delay
input clk;
//buffer register;
end
endmodule
module DelayReward(clk,din,dout);
input clk;
//buffer register;
end
endmodule
module DelayState(clk,din,dout);
input clk;
//buffer register;
end
endmodule
Delay_tb.v
module Delay_tb();
reg clk = 0;
DelayActionRAM DUT(.clk(clk),
.din(din),
.dout(dout));
//clock generator
always begin
initial begin
#10;
din = 16'd10;
#20;
din = 15'd20;
end
endmodule
MaxBlock.v
module MaxBlock (
input clk,
wire [15:0] b;
wire [15:0] c;
wire [15:0] d;
wire [15:0] e;
wire [15:0] f;
wire [15:0] g;
wire [15:0] h;
wire [15:0] i;
wire [15:0] j;
wire [15:0] k;
wire [15:0] l;
wire [15:0] m;
assign h = (a>=b)? a : b;
assign i = (c>=d)? c : d;
assign j = (e>=f)? e : f;
assign m = (j>=k)? j : k;
out=RegOut;
end
endmodule
/*
module Max_Block (
input clk,
wire [15:0] a;
wire [15:0] b;
wire [15:0] c;
wire [15:0] d;
wire [15:0] e;
wire [15:0] f;
wire [15:0] g;
wire [15:0] h;
wire [15:0] i;
wire [15:0] j;
wire [15:0] k;
wire [15:0] l;
wire [15:0] m;
Rega=a;
Regb=b;
Regc=c;
Regd=d;
Rege=e;
Regf=f;
Regg=g;
Regh=h;
Regi=i;
Regj=j;
Regk=k;
Regl=l;
Regm=m;
out=RegOut;
end
endmodule
*/
MaxBlock_tb.v
Laporan Ujian Akhir –Perancangan Sistem VLSI– STEI ITB 46
//Testbench Max Blcok
module MaxBlock_tb();
//input Declaration
reg en = 1'b0;
reg[15:0] Q_Act1;
reg[15:0] Q_Act2;
reg[15:0] Q_Act3;
reg[15:0] Q_Act4;
reg[15:0] Q_Act5;
reg[15:0] Q_Act6;
reg[15:0] Q_Act7;
reg[15:0] Q_Act8;
reg[15:0] Q_Act9;
reg[15:0] Q_Act10;
reg[15:0] Q_Act11;
reg[15:0] Q_Act12;
reg[15:0] Q_Act13;
reg[15:0] Q_Act14;
reg[15:0] Q_Act15;
wire[15:0] Q_max;
//port mapping
MaxBlock DUT(.clk(clk),
.Q_Act1(Q_Act1),
.Q_Act2(Q_Act2),
.Q_Act3(Q_Act3),
.Q_Act4(Q_Act4),
.Q_Act5(Q_Act5),
.Q_Act7(Q_Act7),
.Q_Act8(Q_Act8),
.Q_Act9(Q_Act9),
.Q_Act10(Q_Act10),
.Q_Act11(Q_Act11),
.Q_Act12(Q_Act12),
.Q_Act13(Q_Act13),
.Q_Act14(Q_Act14),
.Q_Act15(Q_Act15),
.out(Q_max));
//clock generator
always begin
end
//test case
initial begin
#10;
Q_Act1 = 16'd1;
Q_Act2 = 16'd2;
Q_Act3 = 16'd3;
Q_Act4 = 16'd4;
Q_Act5 = 16'd0;
Q_Act6 = 16'd0;
Q_Act7 = 16'd0;
Q_Act8 = 16'd0;
Q_Act9 = 16'd0;
Q_Act10 = 16'd0;
Q_Act11 = 16'd0;
Q_Act13 = 16'd0;
Q_Act14 = 16'd0;
Q_Act15 = 16'd0;
#10;
Q_Act1 = 16'd2;
Q_Act2 = 16'd3;
Q_Act3 = 16'd6;
Q_Act4 = 16'd1;
#30;
Q_Act1 = 16'd5;
Q_Act2 = 16'd4;
Q_Act3 = 16'd1;
Q_Act4 = 16'd2;
#20;
Q_Act1 = 16'd4;
Q_Act2 = 16'd7;
Q_Act3 = 16'd2;
Q_Act4 = 16'd3;
end
//display monitor
initial begin
$time , Q_max);
end
endmodule
MazeMap.v
//Maze Map
input clk;
//output
reg[3:0] memory_map[0:24];
initial begin
end
end
case(prev_state)
endcase
case(current_state)
endcase
end
//display monitor
// initial begin
//
memory_map[10],memory_map[11],memory_map[12],memory_map[13],memory_map[14],
//
memory_map[15],memory_map[16],memory_map[17],memory_map[18],memory_map[19],
//
memory_map[20],memory_map[21],memory_map[22],memory_map[23],memory_map[24]);
// end
endmodule
PolicyGenerator.v
input start;
assign next_action=w_current_action;
//Delay Action
always@(posedge clk)
begin
end
PolicyGenerator_tb.v
module PolicyGenerator_tb();
//clock generator
always begin
end
initial begin
q_values =
64'b0000000000001100000000000000000100000000000000100000000000000011;
epsilon = 16'b0000000011100000;
#10;
q_values =
64'b0000000000001100000000000000000100000000000000100000000000000111;
epsilon = 16'b0000000011000000;
#20;
q_values = 64'h0001100001000010;
end
endmodule
QLearningAccelerator.v
// Q-Learning Accelerator
//wiring Q new
//wiring
out_ram_2, out_delay_2,
out_ram_3, out_delay_3,
out_ram_4, out_delay_4,
out_ram_5, out_delay_5,
out_ram_6, out_delay_6,
out_ram_7, out_delay_7,
out_ram_8, out_delay_8,
out_ram_10, out_delay_10,
out_ram_11, out_delay_11,
out_ram_12, out_delay_12,
out_ram_13, out_delay_13,
out_ram_14, out_delay_14,
out_ram_15, out_delay_15;
wire en1,
en2,
en3,
en4,
en5,
en6,
en7,
en8,
en9,
en10,
en11,
en12,
en13,
en14,
en15;
wire[15:0] q_value_selected;
wire[15:0] q_max;
//module instantiation
ActionRAM ram_1(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en1),
.data_in(w_Q_new),
.data_out(out_ram_1));
ActionRAM ram_2(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en2),
.data_in(w_Q_new),
.data_out(out_ram_2));
ActionRAM ram_3(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en3),
.data_in(w_Q_new),
.data_out(out_ram_3));
ActionRAM ram_4(.clk(clk),
.en(en),
.rd_addr(next_state),
.write_en(en4),
.data_in(w_Q_new),
.data_out(out_ram_4));
ActionRAM ram_5(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en5),
.data_in(Q_new),
.data_out(out_ram_5));
ActionRAM ram_6(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en6),
.data_in(Q_new),
.data_out(out_ram_6));
ActionRAM ram_7(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en7),
.data_in(Q_new),
.data_out(out_ram_7));
ActionRAM ram_8(.clk(clk),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en8),
.data_in(Q_new),
.data_out(out_ram_8));
ActionRAM ram_9(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en9),
.data_in(Q_new),
.data_out(out_ram_9));
ActionRAM ram_10(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en10),
.data_in(Q_new),
.data_out(out_ram_10));
ActionRAM ram_11(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en11),
.data_in(Q_new),
.data_out(out_ram_11));
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en12),
.data_in(Q_new),
.data_out(out_ram_12));
ActionRAM ram_13(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en13),
.data_in(Q_new),
.data_out(out_ram_13));
ActionRAM ram_14(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en14),
.data_in(Q_new),
.data_out(out_ram_14));
ActionRAM ram_15(.clk(clk),
.en(en),
.wr_addr(current_state),
.rd_addr(next_state),
.write_en(en15),
.data_in(Q_new),
.data_out(out_ram_15));
DelayActionRAM delay_1(.clk(clk),
.din(out_ram_1),
.dout(out_delay_1));
DelayActionRAM delay_2(.clk(clk),
.din(out_ram_2),
.dout(out_delay_2));
DelayActionRAM delay_3(.clk(clk),
.din(out_ram_3),
.dout(out_delay_3));
DelayActionRAM delay_4(.clk(clk),
.din(out_ram_4),
.dout(out_delay_4));
DelayActionRAM delay_5(.clk(clk),
.din(out_ram_5),
.dout(out_delay_5));
DelayActionRAM delay_6(.clk(clk),
.din(out_ram_6),
.dout(out_delay_6));
DelayActionRAM delay_7(.clk(clk),
.din(out_ram_7),
.dout(out_delay_7));
DelayActionRAM delay_8(.clk(clk),
.dout(out_delay_8));
DelayActionRAM delay_9(.clk(clk),
.din(out_ram_9),
.dout(out_delay_9));
DelayActionRAM delay_10(.clk(clk),
.din(out_ram_10),
.dout(out_delay_10));
DelayActionRAM delay_11(.clk(clk),
.din(out_ram_11),
.dout(out_delay_11));
DelayActionRAM delay_12(.clk(clk),
.din(out_ram_12),
.dout(out_delay_12));
DelayActionRAM delay_13(.clk(clk),
.din(out_ram_13),
.dout(out_delay_13));
DelayActionRAM delay_14(.clk(clk),
.din(out_ram_14),
.dout(out_delay_14));
DelayActionRAM delay_15(.clk(clk),
.din(out_ram_15),
.dout(out_delay_15));
Mux16to1 mux16to1(.sel(current_action),
.d1(out_delay_1),
.d2(out_delay_2),
.d3(out_delay_3),
.d4(out_delay_4),
.d5(out_delay_5),
.d6(out_delay_6),
.d7(out_delay_7),
.d8(out_delay_8),
.d9(out_delay_9),
.d10(out_delay_10),
.d11(out_delay_11),
.d12(out_delay_12),
.d13(out_delay_13),
.d14(out_delay_14),
.d15(out_delay_15),
.dout(q_value_selected));
//this is new
//Decoder
Decoder Decoder(.at(current_action),
.en1(en1),
.en2(en2),
.en3(en3),
.en4(en4),
.en5(en5),
.en6(en6),
.en7(en7),
.en8(en8),
.en10(en10),
.en11(en11),
.en12(en12),
.en13(en13),
.en14(en14),
.en15(en15));
//Max_Block
MaxBlock MaxBlock(.clk(clk),
.Q_Act1(out_ram_1),
.Q_Act2(out_ram_2),
.Q_Act3(out_ram_3),
.Q_Act4(out_ram_4),
.Q_Act5(out_ram_5),
.Q_Act6(out_ram_6),
.Q_Act7(out_ram_7),
.Q_Act8(out_ram_8),
.Q_Act9(out_ram_9),
.Q_Act10(out_ram_10),
.Q_Act11(out_ram_11),
.Q_Act12(out_ram_12),
.Q_Act13(out_ram_13),
.Q_Act14(out_ram_14),
.Q_Act15(out_ram_15),
.out(q_max));
//Q updater
QUpdater Qupdater(.old_Q(q_value_selected),
.max_Q(q_max),
.new_Q(w_Q_new));
end
endmodule
QLearningAccelerato_tb.v
module QLearningAccelerator_tb();
//input Declaration
reg en = 1'b0;
//port mapping
QLearningAccelerator DUT(.clk(clk),
.en(en),
.current_action(current_action),
.current_state(current_state),
.next_state(next_state),
.current_reward(current_reward),
.Q_out_action(Q_out_action));
//clock generator
always begin
end
//test case
// always@(*) begin
// if (next_state == 6'd25)
// reward = 16'd100;
// else
// reward = 16'h0; // 0
// end
initial begin
#10;
en = 1'b1;
//current_action = 4'd1;
// alpha = 4'b1000;
// gamma = 4'b1110;
#20;
current_state = 6'b000010;
next_state = 6'b000011;
//current_action = 4'd1;
#20;
current_state = 6'd3;
next_state = 4'd8;
current_action = 4'd1;
current_state = 4'd8;
next_state = 4'd3;
current_action = 4'd4;
end
//display monitor
initial begin
$time , result);
end
endmodule
QLearningAgent.v
input clk;
input en;
input start;
input[15:0] epsilon;
//wire w_next_reward;
//wire w_next_state;
.en(en),
.current_action(w_curr_action),
.current_state(w_curr_state),
.next_state(next_state),
.current_reward(w_curr_reward),
.Q_out_action(w_q_values));
PolicyGenerator PolicyGenerator(.clk(clk),
.start(start),
.current_action(w_curr_action),
.epsilon(epsilon),
.q_values(w_q_values),
.next_action(next_action));
DelayReward DelayReward(.clk(clk),
.din(next_reward),
.dout(w_curr_reward));
DelayState DelayState(.clk(clk),
.din(next_state),
.dout(w_curr_state));
endmodule
QLearningAgent_tb.v
module QLearningAgent_tb();
//input Declaration
reg en = 1'b0;
//port mapping
QLearningAgent DUT(.clk(clk),
.en(en),
.next_state(next_state),
.next_reward(next_reward));
reg[3:0] memory_map[0:24];
initial begin
end
//clock generator
always begin
end
//test case
if (next_state == 6'd25)
next_reward = 16'd100;
else
next_reward = 16'h0; // 0
end
initial begin
#10;
en = 1'b1;
memory_map[0] = 1;
//current_action = 4'd1;
// alpha = 4'b1000;
#20;
memory_map[1] = 1;
//current_action = 4'd1;
#20;
memory_map[2] = 1;
#20;
memory_map[7] = 1;
#20;
memory_map[8] = 1;
end
//display monitor
initial begin
memory_map[0], memory_map[1],memory_map[2],memory_map[3],memory_map[4],
memory_map[15],memory_map[16],memory_map[17],memory_map[18],memory_map[19],
memory_map[20],memory_map[21],memory_map[22],memory_map[23],memory_map[24]);
end
endmodule
QUpdater.v
endmodule
QUpdater_tb.v
module QUpdater_tb();
initial begin
old_Q = 16'b00000010_00000000; // 2
max_Q = 16'b00000010_00000000; // 2
#10;
$stop;
end
endmodule
Randomizer.v
input [15:0]ic;
output [15:0]q;
wire s;
wire [15:0]lfs;
assign s=lfs[15]^lfs[10]^lfs[9]^lfs[5];
dff dff1(lfs[15],start?ic[15]:s,clk);
dff dff2(lfs[14],start?ic[14]:lfs[15],clk);
dff dff3(lfs[13],start?ic[13]:lfs[14],clk);
dff dff5(lfs[11],start?ic[11]:lfs[12],clk);
dff dff6(lfs[10],start?ic[10]:lfs[11],clk);
dff dff7(lfs[9],start?ic[9]:lfs[10],clk);
dff dff8(lfs[8],start?ic[8]:lfs[9],clk);
dff dff9(lfs[7],start?ic[7]:lfs[8],clk);
dff dff10(lfs[6],start?ic[6]:lfs[7],clk);
dff dff11(lfs[5],start?ic[5]:lfs[6],clk);
dff dff12(lfs[4],start?ic[4]:lfs[5],clk);
dff dff13(lfs[3],start?ic[3]:lfs[4],clk);
dff dff14(lfs[2],start?ic[2]:lfs[3],clk);
dff dff15(lfs[1],start?ic[1]:lfs[2],clk);
dff dff16(lfs[0],start?ic[0]:lfs[1],clk);
endmodule
output Q;
input D;
input Clock;
reg Q;
begin
Q <= D;
end
endmodule
RewardGenerator.v
case (next_state)
endcase
end
endmodule
StateSelector.v
module StateSelector (
case(next_action)
//Geser Kanan
4'b0000: begin
if(current_state%5 != 0)
begin
next_state=current_state+1;
end
else
next_state=current_state;
end
end
//Geser Atas
4'b0001: begin
if(current_state > 5)
begin
next_state=current_state-5;
end
else
begin
next_state=current_state;
end
end
//Geser Kiri
4'b0010: begin
if(current_state%5 != 1)
begin
next_state=current_state-1;
end
else
begin
next_state=current_state;
end
end
//Geser Bawah
4'b0011: begin
begin
next_state=current_state+5;
else
begin
next_state=current_state;
end
end
endcase
end
endmodule
StateSelector_tb.v
module PenentuState_Block_tb;
initial begin
current_st = 6'b001000;
action = 4'b0001;
#100;
current_st = 6'b001000;
action = 4'b0000;
#100;
current_st = 6'b001000;
action = 4'b0010;
#100;
current_st = 6'b001000;
#100;
$stop;
end
endmodule
Multiplexer.v
//multiplexer 16 to 1
input[3:0] sel;
always@(*) begin
case (sel)
endcase
end
// (sel == 4'd1) ? d1 :
// (sel == 4'd2) ? d2 :
// (sel == 4'd3) ? d3 :
// (sel == 4'd4) ? d4 :
// (sel == 4'd5) ? d5 :
// (sel == 4'd6) ? d6 :
// (sel == 4'd7) ? d7 :
// (sel == 4'd8) ? d8 :
// (sel == 4'd9) ? d9 :
// d14;
endmodule
Multiplexer_tb.v
module Mux16to1_tb();
//input Declaration
reg en = 1'b0;
reg[3:0] sel;
//port mapping
.d1(d1),
.d2(d2),
.d3(d3),
.d4(d4),
.d5(d5),
.d6(d6),
.d7(d7),
.d8(d8),
.d9(d9),
.d10(d10),
.d11(d11),
.d12(d12),
.d13(d13),
.d14(d14),
.d15(d15),
.dout(dout)
);
//clock generator
always begin
end
//test case
initial begin
#10;
d0 = 4'd0;
d1 = 4'd1;
d3 = 4'd3;
d4 = 4'd4;
d5 = 4'd5;
d6 = 4'd6;
d7 = 4'd7;
d8 = 4'd8;
d9 = 4'd9;
d10 = 4'd10;
d11 = 4'd11;
d12 = 4'd12;
d13 = 4'd13;
d14 = 4'd14;
d15 = 4'd15;
sel = 4'd0;
#20;
sel = 4'd1;
#20;
sel = 4'd2;
#20;
sel = 4'd3;
#20;
sel = 4'd4;
#20;
sel = 4'd5;
#20;
sel = 4'd6;
#20;
sel = 4'd7;
sel = 4'd8;
end
//display monitor
initial begin
$time , dout );
end
endmodule
Mux.v
input sel;
endmodule
tb_action.v
module tb_action;
always begin
end
#10;
start = 1'b0;
q_values =
64'b0000000000001100000000000000000100000000000000100000000000000011;
epsilon = 16'b0000000011100000;
#10;
q_values =
64'b0000000000001100000000000000000100000000000000100000000000000111;
epsilon = 16'b0000000011000000;
#10;
$stop;
end
endmodule
tb_random.v
module tb_random;
always begin
if (clk) begin
$displayb(out);
end
end
initial begin
#10;
start = 1'b0;
#1000;
$stop;
end
endmodule
axi_control_tb.v
always begin
// *** Clock ***
clk= ~clk;
#(T/2);
end
initial begin
// *** Init ***
// axi_control_write(4'h8, 17'h000); //write epsilon and start
s_axi_awaddr = 0;
s_axi_awvalid = 0;
s_axi_wstrb = 0;
s_axi_wdata = 0;
s_axi_wvalid = 0;
s_axi_bready = 1;
s_axi_araddr = 0;
s_axi_arvalid = 0;
s_axi_rready = 1;
#(T*10);
// ### 2 ###
// *** Configuration and start ***
axi_control_write(8'h0, 4'd2); //write next_state
axi_control_write(8'h4, 16'hFF9C); //write next_reward
// axi_control_write(4'h8, 17'h0f000); //write epsilon and start
// Wait until process is done
#(T*50);
axi_control_read(8'h10);
#(T*10);
// *** Configuration and start ***
axi_control_write(8'h0, 4'd3); //write next_state
axi_control_write(8'h4, 16'hFF9C); //write next_reward
// Wait until process is done
#(T*50);
axi_control_read(8'h10);
#(T*10);
// ### 4 ### // *** Configuration and start ***
axi_control_write(8'h0, 4'd4); //write next_state
axi_control_write(8'h4, 16'h0); //write next_reward
// Wait until process is done
#(T*50);
axi_control_read(8'h10);
end
task axi_control_write;
input [31:0] awaddr;
input [31:0] wdata;
task axi_control_read;
input [31:0] araddr;
begin
// *** Read address ***
s_axi_araddr = araddr;
s_axi_arvalid = 1;
#T;
s_axi_arvalid = 0;
#T;
end
endtask
endmodule
#include <stdio.h>
#include <math.h>
#include "platform.h"
#include "xil_printf.h"
#include <stdlib.h>
int main()
{
//REGISTERS
uint32_t *ctrl_p;
init_platform();
//Initialize pointer
/* ctrl_p : next_state
* ctrl_p+1 : next_reward
* ctrl_p+2 : epsilon
* ctrl_p+3 : start
* ctrl_p+4 : next_action (input)
*/
*(ctrl_p+2) = round(epsilon*pow(2,16));
// printf("\nepisode : %d epsilon : %d", episode,
*(ctrl_p+2));
//writing start
*(ctrl_p+3) = 0;
switch(new_action)
{
case 1:
state_j += 1; //down
break;
case 2:
state_i -= 1; //up
break;
case 3:
state_j -= 1; //left
break;
case 4:
state_i += 1; //right
break;
}
if (state_i < 0)
{
state_i = 0;
}
else if (state_j < 0)
{
state_j = 0;
}
else if (state_i > 4)
{
state_i = 4;
}
else if (state_j > 4)
{
state_j = 4;
}
// //writing next_state
}
printf("\n");
}
cleanup_platform();
return 0;
}