ECAD FOURTH YEAR PROBLEM BASED LABORATORY REPORT

PROBLEM PACK 7 (MODULAR MULTIPLICATION)
LEE YEE HUI, NG TZE JIA, MOHAMMAD AZWAN BIN ALI, MUHAMMAD NUR AKMAL, Student of Department of Microelectronics and Computer Engineering Faculty of Electrical Engineering, Universiti Teknologi Malaysia, Skudai, Johor.

Abstract²This document is an ECAD problem based laboratory report of designing the software and hardware partition of an embedded system. A 32-bit modular multiplication (mod_mul) function is being designed.

I.

INTRODUCTION

The embedded systems are implemented as Systemon-Chip, which is referred to as a SoC design or SoC embedded system. In this lab, SOPC Builder, Quartus II, and Nios II IDE software is used in the Altera SOPC (System -onProgrammable Chip) Builder to develop a Nios II-based Embedded System. This lab aims to design the software and hardware partition of an embedded system. It will also perform the design-space exploration between the hardware and software partition when performing specific computation. The performance metric is measured in logic cost and computation cycle count. In this experiment, a 32-bit Modular Multiplication (mod_mul) function is being developed. The function can generate 25 sets of 32-bit random number input operands (x, y, p), to perform the 32-bit modular multiplication (mod_mul) consecutively. The modular multiplication algorithm is given. The performance of the hardware and software partition is being compared. II. LITERATURE STUDY

An embedded system is the essence of every modern electronic device, from toys to traffic lights to power plant controllers. It covers all aspects of modern life and there are many examples of their use. Watches, microwaves, and phones make the most of embedded systems. An embedded system is a specialized computer system which contained hardware and software customized to perform one or few particular tasks in real-time restrictions. They are generally a part of larger system or machine (e.g., industrial controllers) housed on a single microprocessor board with the programs stored in ROM. Embedded systems are controlled by one or more main processing cores that are normally either microcontrollers or digital signal processors (DSP). Since the embedded system is dedicated to specific tasks, design engineer can optimize it to reduce the size and

cost of the product meanwhile increase the reliability and performance. Nevertheless the complexity of embedded system is continually increased. The number of states in software is very large and complex description of a system making system analysis very hard. System on chip (SoC) refers to integrating all components of a computer or other electronic system into a single integrated circuit (chip). It may include digital, analogue, mixed-signal, and often radio-frequency functions on one single chip. A typical purpose of SoC is for embedded systems. The major advantages of SoC device is greatly reduce the size, cost, and power consumption of a system. SoC device used in handheld digital product has replaced bulkier and higher power consuming digital systems into a board with several chips. A system on chip may include a configurable logic unit. The configurable logic unit includes a processor, interface, and a programmable logic on the same device. As technology advances, integration of various units included in a SoC design becomes increasingly complicated. A SoC device integrates into a single chip might have many of the components of a complex electronic system, such as a wireless receiver. This problem required good software hardware co design process. The common description for hardware/software (HW/SW) co design is the meeting of system-level objectives by taking advantage of the trade-offs between HW and SW in a system through their parallel design. Interaction between HW and SW was developed at the same time on parallel path to produce design that meets performance and functional specs. The two key concepts involved in co design are concurrent development of HW and SW and integrated design. Integrated design allows interaction between the design of HW and SW. Co design techniques using these two key concepts take advantage of design flexibility to create systems that can meet strict performance requirements with a shorter design cycle. The main reason motivating the need for HW/SW co design is the fact that most systems today include both dedicated hardware and software units on microcontrollers or general purpose processors. Meanwhile, the increasing use of programmable processors, available of cheap microcontrollers, increased efficiency of higher level

chipselect. p) and t e 32-bit modular were st red in an array. clk.59 s 149.Software Hardware/Software codesign (without delay) Hardware/Software codesign (with delay) 5.vhd) and the second module is called µAL nterface¶ (ALU_Inte face.9436 s . For the second part. y and p is carried out. The first module is called µfsm¶ ( fsm. Result obtained is incorrect due to slight difference in hardware and software execution time while when slight delay is added in the program to overcome the execution time difference. the µfsm¶ is where the modular multiplication operation of x.9884 s). Hardware/Software codesign has less time consumption because the hardware is designed specifically for that particular algorithm and therefore more efficient. and datau were declared before the main function. correct result is obtained. result[31:0] and done[31:0] as input signals.64 times faster than pure software design. the hardware part consists of two modules that later will be connected together into a top-entity level design. another loop was created to do 25 loops to display the output to check its correctness. write. could be observed that execution time of Hardware/Software co-design is 33. writedata[31:0].16 s 123. x. Both modules are then combined together into one big module called µalu_avalon¶ to form an Avalon bus interface that will integrate with the NIOS II development board. The second module.9884 s S Hardware/Software co-design (without delay) integrates both software and hardware partitions in NIOS II. Hence. This timestamp can be used by declaring this library ³sys/alt_timestamp. It also has x[31:0]. both hardware and software code are written so that the program can run simultaneously to make comparison between hardware and software. RESULT & DISCUSSION D B B C @ B A 99 9 @ ¦¡ ¤ © ¡ ¨ ¤ ©¥ ¦¤ ¤¡¤¢ ¤ ¤ ¤  ¡    £      £ ©¦¡  867 130 00 © ¡ ¡ ¢ ¡  ¡¤ ¦¡       £   5  ¦¡ © ¢¡  ¤¢ ¢¡ ¢ ¡  ¢© ¡ ¡¤   ¤  © £  £           ¤  ¦¤¦¦¤ ¤ ¤ "    ¤ © "¤©¤¤ &    ¦ © ¤  ¤ ¡© ©  ¨  ¡© ¡¤¤             ©¢ ¤ ¦    ¦¡ ¦ © ¡¤ © ¤(¤¦ £             6 %20 ¡© ¡© ¡¤¤  ¤  ©    "¡© ¡¤¤  £         5 ©   © © ¦¤¦ ( ¦  ¨¤ ©   &    4 2 232#& 1 000 i t ti t t ti j t ll Figure1: Timing simulation result waveform for fsm Table1: Time consumption for various implementations Impl mentati n Ti T tal Time. u was stored in another array. The firmware will generate 100 set of 32-bits of random numbers and will send commands to the bus to do the G FE FE G FE H ¡© ¨   © ©¨ ¢  ¤ '¤ ©¡ ¤©¦ £     "¡© ¡¤¤  ¤ ¦  ©  ¨ ¡  ¤©   ¤ ¡¤¢ £ £             ¡© ¡¤¤  ¤  ©% ¡© ¨   ¤ ¦  © ©¨              ¤ ¤ ¤¨¡ ¡© ¤ © © ¡© ¨¤)¤ ¤   ©  £ £          £ ¡© ¡¤¤  ¤ ¦  " ¤¡¤¢ ¡0 ¤    ¨             £   ¤ © ¡ ©  © ¤ ¤ ¨ ¨  ¤¡ ¤ ¤¦ ©        ¦¤¡  )¤ ¤   ¤( ¨¤ © ¤¨¡ © ¤ ¦¡  ¡¤¤ '¤ £         £   ¤  ¤& ¡© ¡¤¤  $% ¦ ¡ $#  ©       © ¡©¨ ¦¡ © ¤ ¤¤ "©  ¡! ¤  ¦¤¦¦¤ ¤       ¡ ¤  ©  ©  ¡¤¤¨¡ © ¡ ¦¤  ¡©¨ © ¢ ¡  ¡©¨ £ £     ¤ ¤ ¦¡ ¤  ¤ ¨  ©¤¨© ¦¤¦¦¤ ¤ © ¤ ¦©¨ £       ¡¤ ¨ ¤ ¢¡  ¤  ¤ ©¨ § ¦¡ ¥ ¤¢ ¢¡       £   l i i t ti i l ti t i t i t t i t t t t i l t ti t i t j ti t i t t i i it i t i t t t ti l t l i l t ti t ll l ti ti t i t t i ti t i l t ti i ll l t t i l t ti t t i i t i ti i l M i t L G C C++ il iti calculation. By using a l multi li ation operation was carried out (according to t e algorit m gi en) for 25 times and each result generated.vhd). Q P P S S S S S S 503716 ticks 12359 ticks 14971 ticks P S RQ P I i i t t t i l i l t ti l ll l it i l i i l t t i t t t t t i itt i C i l i B l C++ t i i i it t i t .71 s Average time. z[31:0] and readdata[31:0] as output. IOWR and IORD commands were used to send and read data from the Avalon Bus interface. In the firmware. The cycle count was recorded by using alt_timestamp() command in the source code in order to make comparison on the performance. y. This firmware will display the answer at console when it done calculating. Table2: Logic cost for various implementations Implementati n L gi Cost Software 3530 Hardware/Software co-design 4127 The above table shows that the logic cost for Hardware/Software co-design is 1.h´.17 times more than software design. There is a firmware that was created to work with this bus. The address for the datax.vhd) to allow the data from the memory to flow in and allow the processed data to flow out from the hardware accelerator.4864 s) and Hardware/Software co-design (5. t 201. µAL nterface¶ will control the flow of the data from the firmware that will be written in C language using NIOS II IDE software. The µAL Interface¶ has the reset. datay. IV.4864 s 4. The ASM Flow chart is being shown in the APPENDIX at page 12. Combinational logic is used for the construction of this core. At the end of the function. T (25 sets) 5037. It has the inputs and outputs that will interface with the µfsm¶ core and the Avalon Bus interface (alu_avalon. For the first module. by comparing the Software (201. address[1:0]. y[31:0].

com/reference/clibrary/cstdlib/srand/ [6] http://soc.velocityreviews.php?postnum=485 [10] http://en.npd-solutions.com/site.A & Hau.org/wiki/Embedded_system [11] http://en.Irwansyah. the algorithm and the functionality of the hardware are being verified while the timing for both software and hardware executions show that the full software implementation is slower than the hardware implementation.eurecom.html#_Toc52606 1364 [8] http://www.edu/~ese201/vhdl/vhdl_primer.wikipedia. this trade off is acceptable. PROBLEM PACK 7. [5] http://www.html [9] http://www.freeinfosociety.com/forums/t23789 -while-loop. the execution time is shortened by 33. 0. PhD .03310519643898708662926 x 4294967295 = 142185736 T .(2006).com/swcodesign. Therefore. Digital system VHDL & Verilog Design(2 nd Edition). PhD.html [1] Result Verification: 1932757444 x 1812851329 = 3503801900990043076 3503801900990043076 ÷4294967295 = 815792452. HW/SW Co-design of a Nios II-based Embedded System (2009).03310519643898708662926 Multiply the decimal point with divider to get the remainder.seas.cplusplus.By increasing logic cost by 1. Running frequency: 100000000 Hz Time taken(Software): 503716 ticks Time taken(Hardware): 14971 ticks 1932757444*1812851329 mod 4294967295=142185736 786279284*481968937 mod 4160749567=890607585 1699140529*368211428 mod 3187671039=2547598544 1619701602*850790193 mod 4294967295=740490066 1642638206*810407642 mod 4294443007=4023211626 199959575*1030935169 mod 4294967295=1966239155 237895031*590249699 mod 4294967295=131936119 1997930855*53546138 mod 2147483647=670880292 668396581*1839177873 mod 4294967295=1633499898 638216208*1670124393 mod 4294967295=2273883489 332713197*1035074448 mod 4294836191=2738433170 1985900155*725370578 mod 4294967295=2373844565 1302741285*134650351 mod 4278190079=3480589006 638160306*565314516 mod 4294967295=425204706 1889387775*1286302809 mod 4294967295=570708015 598279984*371398014 mod 4294967295=2305680216 1765157118*627747827 mod 4294967231=2726428706 2093973672*1774683490 mod 4294967295=3010521705 1234867862*900890657 mod 4294963199=2671521229 282741182*1140457973 mod 4294967295=3174557551 426936334*993029438 mod 4294967295=2955747092 714291992*22021730 mod 4294967295=1333203325 375146338*397206613 mod 4294967295=1646288869 2088659870*125417311 mod 4294967295=3307767680 217963178*957649779 mod 4261412863=751583278 V.(2008).Y. REFERENCES Mohamed Khalil-Hani. ACKNOWLED MENT The authors would like to express their gratitude to Universiti Teknologi Malaysia (UTM) for supporting this laboratory.17 times. STUDENT PACK. HW/SW Co-design of a Nios II-based Embedded System (2009) [3] ECAD PBL Laboratory.wikipedia.W. CONCLUSION In conclusion. [4] Mohamed Khalil-Hani.64 times.org/wiki/System -on-a-chip [12] http://www.fr/EDC/des_cop2/ [7] http://www. veCAD Technical Report: NiosII Tutorial: Avalon Memory-Mapped (AvalonMM) Bus Interface Design of User-Designed Logic in Slave Transfer (VHDL Version) [2] ECAD PBL Laboratory.upenn.

APPENDIX Running frequency: 100000000 Hz Time taken: 12359 ticks 1932757444*1812851329 mod 4294705143=344120785 481968937*1699140529 mod 4255121407=2627761612 1619701602*850790193 mod 4294967295=3119638564 810407642*199959575 mod 4294967295=740490066 237895031*590249699 mod 4293918719=3818040340 53546138*668396581 mod 4294967291=3387772582 638216208*1670124393 mod 4026531839=2934738067 1035074448*1985900155 mod 4294967295=679815706 1302741285*134650351 mod 4227857407=4194470430 565314516*1889387775 mod 4294967295=2228820368 598279984*371398014 mod 4294967287=51824115 627747827*2093973672 mod 4294967295=2719559960 1234867862*900890657 mod 4294967295=3015005904 1140457973*426936334 mod 2013265772=2582953174 714291992*22021730 mod 4293916671=574899538 397206613*2088659870 mod 3976200191=1802861821 217963178*957649779 mod 4294705023=2926194162 1714510321*1894596297 mod 4294959103=1911865590 762000468*989767529 mod 2147483643=2616249845 1299598477*1277489890 mod 4294967295=1813901745 1953573275*608438773 mod 4260364287=3987417940 1600570223*1112997517 mod 4289724415=2312138027 1880702872*471094094 mod 4294967279=3171707931 1806439037*1060615205 mod 4294705151=3224758328 1236269356*1695691943 mod 4294967231=636010679 Result obtained for hardware/software co-design without delay .

seed=i.h" alt_u32 mod_mul(alt_u64 V.h> #include "sys/alt_timestamp.result[i]). void generate_random(alt_u32 p[].end.P.alt_u32 P) { alt_u64 U = 0.i<25.(unsigned int)(end .i ) { while (p[i] <= 1) p[i] = (rand())%((2^32) -1). } Source Code for Software Design UU U UU UU . generate_random(p. alt_u32 result[25]. alt_u32 x[25]. //result[i] = (x[i]*y[i])%p[i]. if (V>=P) { V = V . (unsigned int)(alt_timestamp_freq())).h> #include "altera_avalon_pio_regs.y). } if (U >= P) { U = U .alt_u64 A.h> #include <unistd.x[i]. else { printf("Running frequency: %u Hz \n". do { x[i] = rand()%(p[i]). start = alt_timestamp(). } } return U.x. alt_u32 p[25].#include <stdio. alt_u32 y[25].h" #include "alt_types.y[i].p[i]. } for (i = 0.i ) { result[i] = mod_mul(x[i].i ) { printf("%lu*%lu mod %lu=%lu \n".alt_u32 y[]) { int i.i<25.h" #include <stdlib. return 0. }while (x[i] == 0). V<<=1.h" #include "system. printf("Time taken: %u ticks \n".} while(1). }} int main () { unsigned int start.P. for (i = 0.y[i].alt_u32 x[]. if(alt_timestamp_start() < 0) printf("No timestamp device available \n").p[i]). } end = alt_timestamp().start)). } int seed=10.i<25. srand(seed). while (A != 0) { if (A&0x01) { U = U V. } A>>=1. }while (y[i] == 0). do { y[i] = rand()%(p[i]). int i. for (i = 0.

h> #include <stdlib. srand(seed).i<25.#include <stdio. } int seed=10. alt_u32 x[25].A).start)). seed=rand().h" #define dataX 0x00 #define dataY 0x01 #define dataZ 0x02 Delay added for correct result to be ready #define start_pin 0x03 #define resultData 0x05 #define done_pin 0x04 alt_u32 mod_mul(alt_u32 V.alt_u32 x[].i ) { result[i] = mod_mul(x[i]. for (i = 0.alt_u32 P) { alt_u8 done = 0.h" #include "system.h> #include <pthread. start = alt_timestamp().i ) { while (p[i] <= 1) p[i] = (rand())%((2^32) -1). else { printf("Running frequenc y: %u Hz\n". IOWR(ALU_AVALON_0_BASE. alt_u32 U = 0. do { x[i] = rand()%(p[i]). if(alt_timestamp_start() < 0) printf("No timestamp device available \n"). }while (x[i] == 0). // write dataY=A. // write dataX=V. // write dataZ=P. IOWR(ALU_AVALON_0_BASE.i<25. alt_u32 p[25].V).start_pin.x. }} int main () { unsigned int start.x[i]. int d=0. alt_u32 result[25].p[i]. // store result into U return U.h> #include <unistd. IOWR(ALU_AVALON_0_BASE. void generate_random(alt_u32 p[].(unsigned int)(end . // write dataZ=P.alt_u32 A.dataZ.y[i].h> #include <time. IOWR(ALU_AVALON_0_BASE. do { y[i] = rand()%(p[i]).P). } while(1). int i. // loop & wait for done signal }while (!done). }while (y[i] == 0).} U = IORD(ALU_AVALON_0_BASE. for (i = 0.h> #include "altera_avalon_pio_regs.dataY.result[i]).end.i<25. (unsigned int)(alt_timestamp_freq())).alt_u32 y[]) { int i.p[i]).y[i].h" #include "alt_types. generate_random(p. printf("Time taken: %u ticks \n". } for (i = 0.y). return 0. } end = alt_timestamp().resultData). while(d<1) {d . alt_u32 y[25].i ) { printf("%lu*%lu mod %lu=%lu \n".0x01).h" #include "sys/alt_timestamp. do { done = IORD(ALU_AVALON_0_BASE.done_pin).dataX.} Hardware Partition Source Code for Hardware/Software Co-Design VV VV VV VV .

V(31 downto 0)<=x. use ieee. when s4=> result<=U(31 downto 0). when s2=> next_state<=s3. A(31 downto 0)<=y. U<="0000000000000000000000000000000000000000000000000000000000000000". end if.A. port( x.--def end bhv. when s3=> if(V>=P) then V<=V-P. end if.V. signal U. entity fsm is : in std_logic_vector(31 downto 0). end if. end if. done<='0'. use IEEE.z result architecture bhv of fsm is type statetype is (s0. signal state. else next_state <= s4.P : std_logic_vector (63 downto 0).std_logic_unsigned. done<='1'.std_logic_arith.y. P(31 downto 0)<=z.s4). end if. next_state : statetype :=s0.library ieee. when s3=> next_state <= s0. VHDL code for FSM . begin abc:process (state)begin case state is when s0 => if(A/="00000000000000000000000000000000") then next_state <= s1. else state <= next_state.all. -.--abc def: process begin wait until (clk'event and clk = '1'). use ieee. clk.s2.s1. end process. when s1=> next_state<=s2. end case. if(U>=P) then U<=U-P.wait for rising edge of clk if start = '1' then state <= s0.s3.std_logic_1164. : out std_logic_vector(31 downto 0). done : out std_logic). V<=V(62 downto 0) & '0'.start : in std_logic. end case.all. end fsm. when s2=> A<='0' & A(63 downto 1). case state is when s0=> when s1=> if(A(0)='1') then U<=U+V.all. when s4=> next_state <= s4. end process.

END component. y=>lineY. : in std_logic_vector(31 downto 0). writedata => writedata. VHDL code for alu_avalon .y. clk : IN STD_LOGIC. done=>done_fsm ). chipselect => chipselect.std_logic_1164. BEGIN U_arithUnit: fsm port map ( x=>lineX. ENTITY alu_avalon IS PORT ( reset : IN STD_LOGIC. END alu_avalon. clk => clk. write => write. x=>lineX.all. done=>done_fsm). done : in std_logic). chipselect : IN STD_LOGIC. END arch. ARCHITECTURE arch OF alu_avalon IS signal lineX : std_logic_vector (31 downto 0).y. use IEEE. signal done_fsm : std_logic. writedata : IN STD_LOGIC_VECTOR (31 DOWNTO 0). clk. : out std_logic_vector(31 downto 0). y=>lineY. done : out std_logic). z=>lineZ.all. write : IN STD_LOGIC. : buffer std_logic. signal lineY : std_logic_vector (31 downto 0). address : IN STD_LOGIC_VECTOR (2 DOWNTO 0). writedata : IN STD_LOGIC_VECTOR (31 DOWNTO 0). address : IN STD_LOGIC_VECTOR (2 DOWNTO 0). component fsm is : in std_logic_vector(31 downto 0). readdata : OUT STD_LOGIC_VECTOR (31 DOWNTO 0). clk : IN STD_LOGIC. result=>result_fsm.library IEEE. end component. chipselect : IN STD_LOGIC. z=>lineZ. signal lineZ : std_logic_vector (31 downto 0). signal result_fsm : std_logic_Vector (31 downto 0). start=>start_fsm. PORT ( start port( x.z result : out std_logic_vector(31 downto 0). readdata => readdata. use IEEE. result=>result_fsm. start=>start_fsm. x. address => address.std_logic_arith. component ALU_interface IS reset : IN STD_LOGIC.start : in std_logic.z result U_Interface_Alu: ALU_interface port map ( clk => clk. reset => reset. write : IN STD_LOGIC. signal start_fsm : std_logic. readdata : OUT STD_LOGIC_VECTOR (31 DOWNTO 0) ).

y.std_logic_1164. start PORT ( ARCHITECTURE arch OF ALU_interface IS BEGIN process (reset. address : IN STD_LOGIC_VECTOR (2 DOWNTO 0). clk) begin if reset = '1' then readdata <= (others => '0'). chipselect : IN STD_LOGIC. writedata : IN STD_LOGIC_VECTOR (31 DOWNTO 0). end case. end if. when "011"=> start<=writedata(0).library ieee. use ieee. when "010" => z <= writedata(31 downto 0). when "001" => y <= writedata (31 downto 0). clk : IN STD_LOGIC. case address is when "000" => x <= writedata (31 downto 0).all. end if. end process. write : IN STD_LOGIC. use IEEE. : buffer std_logic. readdata : OUT STD_LOGIC_VECTOR (31 DOWNTO 0). END ALU_interface.all. when "100" => readdata(0) <= done.std_logic_arith.z : out std_logic_vector(31 downto 0). x. END arch. when others=> readdata(31 downto 0) <= result. end if. ENTITY ALU_interface IS reset : IN STD_LOGIC. elsif clk'event and clk = '1' then if chipselect = '1' then if(start='1') then start<='0 '. result : in std_logic_vector(31 downto 0). VHDL code for ALU_interface . done : in std_logic).