You are on page 1of 11

Debjit Sinha Arindam Mallik Somsubhra Mondal

ECE 361: Final Project (Fall 2002) Northwestern University, Evanston, IL (9th December)


A 32-bit single cycle CPU has been designed and tested, as was the objective of the project. The structural design of the CPU was done using the CAD tools from Mentor Graphics. The CPU handles a small subset of the MIPS instruction set and the instruction formats replicate those of MIPS as well. The following instructions are supported: Arithmetic Logical Data Transfer Conditional Branch NOP add, addi, sub and, or, sll lw, sw beq, bne, slt nop

Salient Features of the CPU

Incorporates a 32-bit ALU, with two 32 bits inputs, a 32-bit output, and outputs bits like Carry Out, Zero and Overflow for detailed analysis of the output. Memory elements are 32-bits wide (data width), and are accessed by a 32-bit address (address width). The register file has 8 GPRs (General Purpose Registers), all 32-bits wide. The CPU has an input signal Reset, which when asserted sets the PC (Program Counter) to memory address 0x4000020 The CPU stalls (de-asserts all its Control signals) on a nop instruction. (The nop operation, which besides doing a nop for the current cycle, halts the CPU in this case, which is not so for MIPS).

Schematic and design of the 32-bit ALU Schematic and design of the Register File Schematic and design of the primary and the ALU Control Unit Final schematic and working of the CPU Sample programs and results of simulation


The 32-bit ALU had been built by cascading 32 single bit ALUs that perform basic operations like add, sub, and, slt & or of 2 bits. The Carry In of the 1 bit-ALU unit corresponding to the LSB of the 32-bit ALU is fed with a 1 in case of a sub or slt operation (Refer Fig-1 for details), 0 otherwise. Carry Out bits feed the Carry In bit of the next block. Overflow is detected by a xor of the Carry In and the Carry Out of the 1-bit ALU corresponding to the MSB of the final 32-bit ALU. Fig-1 shows the design of a 1-bit ALU (The Set bit is present only for the 1-bit ALU corresponding to the MSB of the final 32-bit ALU and feeds the Less bit of the 1-bit ALU corresponding to the LSB of the 32-bit ALU after an XOR with the Overflow bit). All other Less bits are set to 0.

Fig 1: A 1-bit ALU

Another module associated with the 32-bit ALU is the SLL unit. The SLL unit is a logical left shifter unit, which takes in a 32-bit operand and shifts it by a 5-bit amount, which is the 2nd input to the module. The output is the 32-bit shifted value. Depending on the control signal, the final output of the ALU, is either from the cascaded 32-bits of basic ALU or from the SLL unit. A Zero Detect unit detects if the final output value is 0 and accordingly asserts the Zero output bit. 32bit MUXES (multiplexers) have been designed and incorporated for the purpose. These muxes take in two 32-bit values and choose one of them for the output depending on the control signal to it. Fig-2, Fig-3 and Fig-4 show the schematic of the final 32-bit ALU, the cascaded ALUs, and the SLL Unit respectively.
The cascaded 32-bit ALU module

Fig 2: Final 32-bit ALU

SLL Unit

Fig 3: 32-cascaded 1-bit ALUs

Fig 4: SLL Unit


The register file consists of a set of registers that can be read & written by supplying a register number to be accessed. The read ports of the register file can be implemented with a pair of muxes, each 32-bits wide. Since we need 2 parallel reads, we need to implement 2 read ports. Fig 5 (below) shows the schematic of one multiplexer needed for a read.

Incorporating the write port is done using a decoder that selects the desired register to write. It is to be noted that all writes are done on the rising clock edge, so the global clock is to be fed in at the final CPU design stage. Thus for a case of read and write to the same register in the same cycle, there is no conflict since all writes happen after the first half of the clock cycle. Fig 6 (below) shows the schematic of the register file implemented.

Fig 6: Register File


(A) CENTRAL CONTROL UNIT DESIGNThe central control unit takes in the opcode (6 bit in this case: Instruction[Bit31: Bit26]) of an instruction and generates various control signals. The table below summaries the Instruction/Instruction Type and the corresponding control signals to be generated by the Central Control Unit. The dont care conditions in the table are also set to 0 since the controller is implemented using PLAs and thus there would be no additional logic implemented. Fig 7 shows the schematic of the corresponding design.
Instruction Opcode RegDst ALUSrc Mem2Reg RegWr MemRd MemWr Branch BNE ALU1 ALU0

Rtype lw sw beq bne addi

000000 100011 101011 000100 000101 001000

1 0 0 0 0 0

0 1 1 0 0 1

0 1 0 0 0 0

1 1 0 0 0 1

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 1 1 0

0 0 0 0 1 0

1 0 0 0 0 0

0 0 0 1 1 0

Fig 7

(B) ALU CONTROL UNIT DESIGNThe central control unit sends a 2-bit input to the ALU controller, which in turn needs to send the appropriate control signals to the ALU. Besides the 2-bit input from the central controller, the ALU control unit also needs the 6-bits of the funct field that needs to be decoded for the right ALU control signal in case of Rtype instructions. Given these inputs, the control unit generates the following ALU signals as given in the table below. ALU_OP[1:0] is the input from the central control unit, and ALU[2:0] show the output control signal for the ALU. The schematic of the implementation is shown in Fig 8.
Instruction Rtype (add) Rtype (sub) Rtype (and) Rtype (or) Rtype (sll) Rtype (slt) lw / sw / addi beq / bne Funct[5:0] 100000 100010 100100 100101 000000 101010 X X ALU_OP[1:0] 10 10 10 10 10 10 00 01 Desired ALU Operation ADD SUB AND OR SLL SLT ADD SUB ALU[2:0] 010 011 000 100 110 111 010 011

Fig 8


The final schematic of the CPU is shown in Fig 9. The following points are of significance in the design: The 32-bit ALU designed earlier has been used in place of the 32-bit adders, wherever applicable. Similarly, the SLL used has been additionally used whenever needed additionally (e.g. For left shift by 2 bits for branch instructions) Since an ALU has been used in place of an adder, the control of the ALU is to be kept constant to ALU[2:0] = [010] (= ADD). Fixed inputs of such kind have been built into a module, and the module then just plugs into the proper input of the ALU. A Sign Extender module has been used to sign extend data wherever necessary Memory elements (Instruction and Data Memory) have been implemented using the ram.3so component from the gen_lib library of Mentor Graphics tools. PC is updated at the start of every clock cycle. To do this, the clock signal is passed through an inverter and then fed to the Enable bit of PC. Thus PC is updated (or written) at the falling edge of the clock (or the start of a cycle). Registers are written on the rising edge of the clock cycle. This gives sufficient time for read in cases when a register is both source and destination. (e.g. add $3, $1, $3) Initial PC load and program halt- An assumption is made here that all programs start from the memory address 0x400020, or that PC is to be set at 0x400020 (0x signifies Hexadecimal number). Thus before every new program execution, the Reset bit of the CPU must be set to 1 for a cycle. This signal acts as a control signal to a 2 to 1 32-bit multiplexer, which chooses between NEXT_PC_VALUE and the constant value 0x400020. Thus whenever the Reset signal is asserted (= set to 1) PC is initialized. The Reset bit must be again set to 0 after 1 cycle, else new value of PC will not be loaded. The nop instruction of the CPU (nop/sll $0, $0, $0) has been modified to signify end of program. A nop is inserted at the end of all programs. When the CPU reads a nop, it deasserts all control signals, thereby holding the PC value to the current value and stalling the CPU. Though this is not what a nop should do, this has been implemented just as a check. We assume that no program uses the instruction sll $0, $0, $0 in between (since it corresponds to a nop) in this case.

Summary: The CPU was designed and tested for different instructions, and also for 3 programs, namely: Sort Program sort_corrected_branch.dat Summation Program sum_branch.dat Simple Transaction Simulator bills_branch.dat The program executions were tested for verity, and the CPU was found to be working without any errors for all the test cases seen so far. Figures 10, 11, and 12 show traces of simulations of the above programs. The cycle-time for the tests was set to 100ns (Clock Rate = 10Mhz). The simulations were found to work correctly for cycle-times as low as 5ns (Clock Rate=200Mhz), but for still lower cycle-time, the outputs came garbled.

Fig 9: Schematic of the CPU M = Multiplexer

Output signals for debugging

Data Memory

32-bit Adder

32-bit ALU

SLL Unit (Shift left 2)

M ALU Control Unit

Register File

Sign Extender Unit

Central Control Unit

32-bit Adder

Instruction Memory Program Counter

Input Bits (Clock & Reset) Output signals for debugging

Section V- Traces of Sample Program Simulations

Program # 1: Summation Program (sum_branch.dat)
Instruction Memory # add $5, $0, $0 00400020 / 00002820; # addi $7, $0, 4096 00400024 / 20071000; # sll $7, $7, 16 00400028 / 00073c00; # add $6, $7, $0 0040002c / 00e03020; # addi $6, $6, 40 00400030 / 20c60028; # lw $4, 0($7) 00400034 / 8ce40000; # add $5, $5, $4 00400038 / 00a42820; # addi $7, $7, 4 0040003c / 20e70004; # bne $7, $6, -16 [loop-0x00400040-4] 00400040 / 14e6fffc; # sw $5, 0($7) 00400044 / ace50000; #nop Data Memory # DATA 10000000 / 00000001; 10000004 / 00000002; 10000008 / 00000003; 1000000c / 00000004; 10000010 / 00000005; 10000014 / 00000006; 10000018 / 00000007; 1000001c / 00000008; 10000020 / 00000009; 10000024 / 0000000a; 10000028 / 00000037;

The program does a sum of the data in the data mem. starting at address 0x10000000 to address 0x10000024, and places the result in memory location 0x10000028. A look at the code shows that the total cycles required for the program is 46. With the clock period = 100ns, it is indeed seen that the program terminates at t=4600ns. Also the last instruction stores the summation (=37) at address 0x10000028. This can be seen from the traces. Also it can be observed that PC is auto loaded with 0x400020 when Reset=1

Fig 10: Trace of the program

Arrows show start of loop

End of program, after 0x37 written to address 0x10000028

Program #2: Transaction Simulator Program (bills_branch.dat)

Instruction Memory # addi $5, $0, 1 00400020 / 20050001; # addi $6, $0, 100 00400024 / 20060064; # addi $2, $0, 4096 00400028 / 20021000; # sll $2, $2, 16 0040002c / 00021400; # addi $7, $2, 40 00400030 / 20470028; # lw $3, 0($2) 00400034 / 8c430000; # slt $4, $6, $3 00400038 / 00c3202a; # beq $4, $5, 8 [next-0x0040003c-4] 0040003c / 10850002; # sub $6, $6, $3 00400040 / 00c33022; # sw $0, 0($2) 00400044 / ac400000; # addi $2, $2, 4 00400048 / 20420004; # bne $2, $7, -28 [loop-0x0040004c-4] 0040004c / 1447fff9; # sw $6, 0($7) 00400050 / ace60000; # nop Data Memory # DATA 10000000 / 0000000a; 10000004 / 00000009; 10000008 / 00000008; 1000000c / 000002bc; 10000010 / 00000005; 10000014 / 00000006; 10000018 / 00000190; 1000001c / 00000001; 10000020 / 00000002; 10000024 / 00000003;

The program compares the data in the data mem. starting at address 0x10000000 to address 0x10000024 with $6. For values that are < $6, $6 is decremented by that value, and the mem location is filled with 0. Final value of $6 is placed in memory location 0x10000028. A look at the code shows that the total cycles required for the program is 72 for the given data. With the clock period = 100ns, it is indeed seen that the program terminates at t=7200ns. Also the last instruction stores $6 (=0x38, which is indeed correct) at address 0x10000028. This can be seen from the traces. Also it can be observed that PC is auto loaded with 0x400020 when Reset=1 (Arrows show where 0x0 is written in memory locations)

Fig 11: Trace of the program

Prgrm ends, after $6=0x38 written in address 0x10000028


Program #3: Sort Program (sort_corrected_branch.dat)

Instruction Memory # MAIN # addi $2, $0, 4096 00400020 / 20021000; # sll $2, $2, 16 00400024 / 00021400; # addi $4, $2, 36 00400028 / 20440024; # addi $5, $2, 40 0040002c / 20450028; # lw $7, 0($2) 00400030 / 8c470000; # addi $3, $2, 4 00400034 / 20430004; # lw $1, 0($3) 00400038 / 8c610000; # slt $6, $7, $1 0040003c / 00e1302a; # bne $6, $0, 12 [incr_j-0x400040-4] 00400040 / 14c00003; # sw $1, 0($2) 00400044 / ac410000; # sw $7, 0($3) 00400048 / ac670000; # add $7, $1, $0 0040004c / 00203820; # addi $3, $3, 4 00400050 / 20630004; # bne $3, $5, -32 00400054 / 1465fff8; # addi $2, $2, 4 00400058 / 20420004; # bne $2, $4, -48 0040005c / 1444fff4; Data Memory # DATA 10000000 / 00000009; 10000004 / 0000000a; 10000008 / 00000008; 1000000c / 00000007; 10000010 / 00000005; 10000014 / 00000006; 10000018 / 00000004; 1000001c / 00000001; 10000020 / 00000002; 10000024 / 00000003;

The program sorts the data elements in the data memory in the address range 0x10000000 to 0x10000024. Simulation of the program shows that the total cycles required for the program is 379 for the given data. With the clock period = 100ns, it can be seen that the program terminates at t=37900ns. The trace below shows staring of the outer and inner loops (thick and dotted arrows), and points where the two largest numbers (0x9 and 0xa) are stored in memory locations 0x10000020 & 0x10000024 respectively (shown by the symbol).

Data Written in memory Fig 12: Trace of the program

Program Ends