You are on page 1of 63

ADITYA ENGINEERING COLLEGE - SURAMPALEM

1INTRODUCTION 1.0 Motivation


A computer is a machine that processes information. A machine, of course, is some tangible device (i.e., hardware) built by hooking together physical components, such as transistors, in an appropriate arrangement. Processing occurs when the machine follows the steps of a mathematical algorithm. Information is represented in the machineby bits, each of which is either 0 or 1. Computer design is the thought process that arrives at how to construct the tangible hardware so that it implements the desired algorithm. The goal is to turn an algorithm into hardware. Computer designers have two ways to look at the machines they build the way they act (known as the behavioral viewpoint, which is closely related to algorithms), and the way they are built (known as the structural viewpoint, which is like ablueprint for building the machine). In the beginning, hardware designers were programmers and vice versa. The world of hardware design and software design fragmented into separate camps during the 1950sand 1960s as advancing technology made software programming easier. The industry needs many more programmers than hardware designers and programmers require farless knowledge of the physical machine than hardware designers. The goal of using the current generation of general purpose computers to help design the next generation of special and general purpose computers required bringing the worlds of hardware and software back together again. Out of this union was born the concept of the Hardware Description Language (HDL).Being a computer language, an HDL allows use of many of the time saving software methodologies that hardware designers had been lacking. But as a hardware language, the HDL allows the expression of concepts that previously could only be expressed by manual notations, such as the ASM notation and circuit diagrams.

Microprocessor Programming and Implementation:


Hardware description languagesare used to implement, verify and simulate the digital logic chips. The commonly used HDLs are VHDL and Verilog HDL. In VHDL a multitude of language or user defined data types can be used. Compared to
1 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

VHDL, Verilog data types are very simple, easy to use and very much geared towards modelling hardware structure as opposed to abstract hardware modelling. Unlike VHDL, all data types used in a Verilog model are defined by the Verilog language and not by the user. Verilog may be preferred because of its simplicity. Hardware implementation is done using FPGA trainer kits.

CISC versus RISC:


One attempt to increase performance of general-purpose processors that became popular in the l970s is the idea of a Complex Instruction Set Computer (CISC). In essence, the idea is to merge a simple general purpose machine together with special hardware (and special registers) that solve certain specific computations. The thought was that this would give the user the best of both worlds (special purpose and general-purpose computers). To activate each special hardware unit requires including a new instruction in the instruction set. Rather than the handful of machine language instructions, a CISC machine might have thousands of distinct instructions. Fitting all these instructions into a reasonable sized instruction register requires that some instructions occupy multiple words. Which is known as avariable length instruction set. Such machines are aptly named CISC because the fetch,execute algorithm, although fundamentally the same, has much more complex detailswith a variable length instruction set. This is especially true if the machine is to bepipelined. Two factors led to the popularity of CISC processors. First, improved fabrication technologies allowed ever increasing amounts of hardware to fit on a chip. Second, instruction set designers had a mistaken belief that programmers and compilers would be able to utilize all this special purpose hardware effectively. By the early 1980s, several empirical studies had shown that CISC processors did notmake effective use of all of their special purpose hardware. As a result of these studies, several groups designed Reduced Instruction Set Computers (RISC). Like the PDP 8, RISC machines have xed length instruction sets. This simplies pipelined implementation. RISC instruction sets are chosen with pipelining offetch&execute in mind, whileCISC instruction sets make pipelining fetch, execute difficult. Unlike the PDP 8, RISCprocessors have several features that allow higher performance than is possible on asingle accumulator machine. Although CISC processors remained

2 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

popular through theend of the twentieth century (the Pentium ii is a CISC processor), the momentum incomputer design shifted to the RISC philosophy. RISC or Reduced Instruction Set Computer is a type of microprocessor architecture that utilizes a small, highly-optimized set of instructions, rather than a more specialized set of instructions often found in other types of architectures. RISC processors have advantages in applications that benefit from faster instruction execution, such as engineering and graphics workstations and parallel-processing systems with their features like one cycle execution time, pipelining, large number of registers. They are also less costly to design, test, and manufacture. The advantages are that a simpler instruction set found in a RISC processor are easier for compilers to write programs in, as well as the simple design of the processor that is allowed to result from this simple instruction set. So for these reasons RISC instruction set is used over CISC.

The ARM:
In the early l980s, Acorn Computers, Ltd. designed an inexpensive computer for teachingcomputer literacy in conjunction with a BBC television program in Great Britain. Themachine was originally dubbed the Acorn RISC Microprocessor" (ARM). Severalyears later, Acorn entered into a consortium with more than a dozen manufacturers, including DEC(the company that manufactured the PDP 8) And Apple (which usesthe ARM in its Newton PDA), to promote the ARM worldwide. The ARM acronymwas redefined to mean Advanced RISC Microprocessor." The ARM is probably themost elegant RISC processor ever marketed. Its instruction set is simpler than most ofthe other RISC processors with which it can be compared.

1.1

Objectives & Goals


The purpose of this project is to develope a high performance embedded

processor core that can execute the ARMv2a instruction set. It is general-purpose 32-bit microprocessors for use in application or customer-specific integrated circuits.Ittargets the low power, price sensitive market, and can be used in palmtop computers, smart cards, and GSM terminal controllers. The advantages of the core stem from its simple design and efficient instruction set.
3 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

1.2

Specifications:
A Behavioural description of memory controller A Behavioural description of Instruction cache controller and Data cache controller. RTL synthesizable instruction prefetch buffer. Load and Store Architecture. 3- Stage Pipeline. Unified Instruction and Data cache. Multiplication using BOOTHs Alogorithm.

4 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

] 2.LITERATURE SURVEY
2.1 Overview:
The design of an 32-bit RISC microprocessor includes the study of the history of microprocessors, their architectures, their design strategies, their programming and implementation on chips. The RISC Processor works on reduced number of Instructions, fixed Instruction length, more general-purpose registers, load-store architecture and simplified addressing modes which makes individual instructions execute faster, achieve a net gain in performance.

2.2

Summary of Literature review:


The main modules of the 32-bit RISC processor are ALU, CU, Registers, and

Instruction set, Shifters, Rotators and Multiplexers. Each module consists of many other sub modules which are basic gates, flip-flops. Micro processor is designed with an instruction set containing 16 operations.Arithmetic and Logical instructions require two source registers and one destination register. The control unit is the heart of the Micro processor. It accepts as input, those signals that are needed to operate the processor, and provides as output all the control signals necessary to effect that operation. Most of a processors operations are performed by one or more ALUs. An ALU loads data from input registers, an external Control Unit then tells the ALU Operation to perform on that data, and then the ALU stores its result into an output register. The Control Unit is responsible for moving the processed data between these registers, ALU and memory. The 32-bit Microprocessor modules desired to perform the 16 operations in the instruction set are identified first. The modules are designed according to the design specifications. The identified blocks of the microprocessor are designed using hardware description languages and code of the design is verified and simulated using

5 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

synthesis software. Finally when the functionality works, then the hardware implementation is carried out on FPGA trainers or other CPLD devices. The microprocessor modules along with sub components are identified and its code is implemented using HDLs according to the specifications of the microprocessor. The code is verified using synthesis software and the verified functionality of the design is implemented on CPLD devices.

6 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

3.PROBLEM DEFINITION
3.0 Introduction:
Microprocessors have traditionally been designed around two Philosophies: Complex Instruction Set Computer (CISC) and Reduced Instruction Set Computer (RISC). The CISC concept is an approach to the Instruction Set Architecture (ISA) design that emphasizes doing more with each Instruction using a wide variety of addressing modes, variable number of operands in various locations in its Instruction Set. Typically CISC chips have a large amount of different and complex instructions. On the other hand, the RISC Processor works on reduced number of Instructions, fixed Instruction length, more general-purpose registers, load-store architecture and simplified addressing modes which makes individual instructions execute faster, achieve a net gain in performance and an overall simpler design with less silicon consumption as compared to CISC leading to less heat dissipation and it is a much lower power chip. This gives the RISC Architecture more room to add on-chip peripherals, Interrupt controllers and programmable timers. By considering all we have designed 32 bit RISC processor which gives the following benefits compared to other processors. The RISC processors have quicker time-to-market that means a smaller processor will have fewer instructions, and the design will be less complicated, so it may be produced more rapidly.

3.1

Project Objectives:
To review the literature on 32-bit RISC microprocessor and identify the blocks required to perform the operations in the instruction set according to the design specifications. To describe the functionality of the 32-bit RISC processor in Verilog HDL. To develop a Verilog code for each component in the microprocessor. To synthesize the design module and generate the RTL and Technology schematics.
7

The project objectives are:

Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

To simulate the synthesized code and to verify the design functionality by applying the user and input constraints.

3.2

Methodologies adopted:
The 32-bit microprocessor performs 16 operations as mentioned in the

instruction set. As the microprocessor is designed under RISC architecture, the code is written only to perform the desired operations. The code is written using Verilog in Procedural (Behavioural) model which is similar to C code, where the microprocessor block represents the main function in which the sub blocks-ALU, CU, Registers, shifters, rotators are declared as the sub functions of main (microprocessor) block. The sub functions are defined in the rest part of the code. The outputs of each and every component are returned to their corresponding higher block and then to the main (microprocessor) block. The code is then synthesized and verification of the design functionality is carried out using synthesis software. Upon meeting the desired specifications the design is implemented on FPGA or CPLD devices.

3.3

Design flow:
In the design flow the first step is to study the literature review on various

The design flow is as shown in the following flow chart.

blocks of ALU, CU and how to design each block using Verilog HDL.

Figure 3.1: Design flow chart


8 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

After that describe the functionality of the 32-bit RISC processor in Verilog HDL. After the identification, develop a Verilog code for each component in the microprocessor and synthesize the design module and generate the RTL and Technology schematics. At last implement the design using FPGA.

3.3.1 Schematic entry: Schematic entry or Schematic capture is a step in the design cycle of electronic design automation at which the electronic diagram or electronic schematic of the designed electronic circuit is created by a designer. This is done interactively with the help of a schematic capture tool known as schematic editor. The circuit design is the very step of actual design of an electronic circuit. Typically sketches are drawn on a paper, and then entered into a computer using a schematic editor. Therefore schematic entry is said to be a front-end operation of several others in the design flow. 3.3.2 Simulation: Simulation is done by a tool termed as simulator that debugs a micro program before the program is loaded to the target machine. Essential to HDL design is the ability to simulate HDL programs. Simulation allows an HDL description of a design to pass design verification, an important milestone that validates the designs intended function (specification) against the code implementation in the HDL description. It also permits architectural exploration. The simulators also have a GUI complete with a suite of debug tools. These allow the user to stop and restart the simulation at any time. 3.3.3 FPGA implementation: FPGA configuration is generally specified using a hardware description language (HDL). FPGAs contain programmable logic components called "logic blocks", and a hierarchy of reconfigurable interconnects that allow the blocks to be "wired together". Logic blocks can be configured to perform complex combinational functions, or merely simple logic gates like AND and XOR. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memory.
9 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

The FPGA architectures with the feature of interconnect makes them far more flexible (in terms of the range of designs that are practical for implementation within them) but also far more complex to design for.

3.4

Summary:
The design of 32-bit RISC microprocessor is done using Verilog HDL and the

synthesis is carried out using Xilinx synthesis software and the error-free code is simulated using Model Sim simulator tool to pass the design verification and then implemented on FPGA.

10 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

4. ARM COMPATIBLE CORE


4.1 Fundamentals:
Processor has a 32-bit data bus and a 26bit address bus. The processor supports two data types, eight bit and 32-bit words, where words must be aligned on four byte boundaries. Instructions are exactly one word, and data operations (e.g. ADO) are only performed on word quantities. Load and store operations can transfer either bytes or words. It supports four modes ofoperation, including protected supervisor and interrupt handing modes.

4.1.1 BYTE Significance Some programming techniques maywrite a 32-bit (word) quantity to memory, but will later retrieve the data as a sequence of byte (II-bit) Hums. For these purposes, the processor stores word data in least significant first (LSBfirst) order. This means that the least significant bytes of a 32bit word occupies the lowest byte add rows. (TheVLSI Technology, Inc. assemblers, none the less, display complied data In MSBs first order, but for the sake of clarity only. The internal machine representation is preserved as LSBs- first). There are two types of representations. 1) Little Endian 2) Big Endian Their formats is as shown below:

11 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

Fig:Little Endian 4.1.2 Registers: The processor has 27 registers (32bitseach), 16 of which are visible to the pro- grammar at any time, the visiblesubset depends on the current processor mode; special registers areswitched in to support interrupt and supervisor processing. The register bank organization is shown In Table.

12 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

User mode is the normal program execution state; registers R15R0 are directly accessible.All registers are general purpose and may be used to hold data or address values, except that register R15 contains the Program Counter (PC) and the Processor Status Register (PSR).Special bits in some Instructions allowthe PC and PSR to be treated together or separately as required. Figure 1 shows the allocation ofbits within RIS. RI41s used as the subroutine link register, andreceives a copy of RI5 when a Branch and Link Instruction Is executed. It may be treated as a general purpose register at all other times. Rl4_svc, R14_irq and R14_fiqare used similarly to hold the return values of R15 when Interrupts and exceptions arise, or when Branch and Link Instructions are executed withinsupervisor or Interrupt routines.

13 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

4.2

PROCESSOR MODES:

1) FIQ PROCESSING MODE: The FIQ(Fast Interrupt Request) mode has seven private registers mapped to R14-R0(R14_fiq R8_fiq).many FIQ programs will not need to save any programs. 2) IRQ PROCESSING MODE: The IRQ(Interrupt Request Mode) mode has two private registers mapped to R14 and R13(R14_irq and R13_lrq) 3) SUPERVISOR MODE: Supervisor mode is the mode that the processor is in after Reset and is generally the mode that an operating system kernel operates in.The SVC mode has two private registers mapped to R14 and R13(R14_svc and R13_svc). 4) USER MODE: User mode is used for programs and applications. The two private registers allow the IRQ and supervisor modes each to have a private stack pointer and line register.supervisor and IRQ mode programs are expected to save the user state on their respective stacks and then use the user registers,remaining to restore the user state before running. In user mode only the N,Z,C and V bits of the PSR may be changed.the I and F and mode flags will change only when an exception arises.In supervisor and interrupt modes all flags may be manipulated directly.

4.2.1 PROGRAM STATUS REGISTER: There is a special register called Program Counter and Program Status Register. It is a 32 bit register.Using this register we can fing the currently running processor mode i.e wheather it is in User mode or FIQ mode or IRQ mode or Supervisor mode. We can enable or disable the FIQ,IRQ interrupt requests.It contains four Flag bits.They are Overflow,Carry,Zero,Negative flag bits. The basic layout of the Program counter and the Program Status Register is shown in the following Figure.

14 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

Program Counter And Processor Status Register

4.3

EXCEPTIONS:
Exceptions arise whenever there is a need for normal flow of program

execution to be broken, so that (for instance) the processor can be diverted to handel an interrupt from a peripheral.The processor state just prior to handling the exception must be preserved so that the original program can be resumed when the exception routine has completed.Many exceptions may arise at the same time. The processor handles exceptions by using the banked registers to save state.The old PC and PSR are copled into the appropriate R14 and the PC and processor mode bits are forced to a value which depends on the exception.Interrupt disable flags are set where required to prevent unmanageable nesting of exceptions.In the case of a re-entrant interrupt handler,R14 should be saved onto a stack in main memory before re-enabling the interrupt.When multiple exceptions arise

simultaneously,a fixed priority determines the order in which they are handled. 1)FIQ:
15 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

The FIQ(Fast Interrupt Request) exception is externally generated by taking the FIQ pin low.This input can accept asynchronous transitions,and is delayed by one clock cycle for synchronization before it can affect the processor execution flow.It is designed to support a data transfer or channel process,and has sufficient private registers to remove the need for registers saving in such applications,so that the over head of context switching is minimized.The FIQ exception may be disabled by setting the F Flag in the PSR (but note that this si not possible from user mode).If the F flag is clear,the processor checks for a low level on the output of the FIQ synchronizer at the end of each instruction. The impact upon execution of an FIQ interrupt is defined below.The return from interrupt sequence is also defined there.This will resume execution of the interrupted code sequence,and restore the original processor state. 2)IRQ: The IRQ (Interrupt Request) exception is a normal interrupt caused by a low level on the-IRQ pin.It has a lower priority than FIQ,and is masked out when a FIQ sequence is entered.Its effect may be masked out at any time by setting the I bit in the PC (but note that this is not possible from user mode).If the I flag is clear ,the processror checks for a low level on the output of the IRQ synchronizer at the end of each instruction. The impact upon execution of an IRQ interrupt is define in the below table.The return -from interrupt sequence is also defined there.This will cause execution to resume at the instruction following the interrupted one,restore the original processor state,and re-enable the IRQ interrupt. 3)ADDRESS EXCEPTION TRAP: An address exception arises whenever a data transfer is attempted with a calculated address above 3FFFFFFH.The VL86C020 address bus is 26-bits wide,and an address calculation will have a 32-bit result.If this result has a log one in any one of the top six bits,it is assumed that the address is an error and the address exception trap is taken. A branch cannot cause an address exception ,and a block data transfer instruction which starts in the legal area but increments into the illegal area will not trap.The check is performed only on the address of the first word to be
16 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

transformed.When an address exception is seen,the processor will respond as defined in the below table.The return-from interrupt sequence is also defined there.This will resume execution of the interrupted code sequence,and restore the original processor state.Normally an address exception is caused by errorneous code,and it is inappropriate to resume execution.If a return is required from this trap,use SUBS PC,R14_svc,4.as defined in the below table.This will return to the instruction after the one causing the trap. 4)ABORT: The ABORT signal comes from an external memory management system,and indicates that the current memory access cannot be completed.For instance,in a virtual memory system the data corresponding to the current address may have been moved out of memory onto a disk,and considerable processor activity may be required to recover the data before the access can be performed successfully.The processor checks for an abort at the end of the first phase of each bus cycle.When successfully aborted,the VL86C020 will respond in one of THREE ways: I. Abort during instruction prefetch: If abort is signalled during an instruction prefetch (a Prefetch abort), the prefetched instruction is marked as invalid; when it comes to execution, it is reinterpreted as below. (If the instruction is not executed, for example as a result of a branch being taken while it is in the pipeline, the abort will have no effect.) Then ARM will: 1) Save R15 in R14_svc, or (for 32 bit configuration ARMs) save R15 in R14_abt and save the CPSR in SPSR_abt. 2) Force the mode bits to SVC mode or (for 32 bit configuration ARMs) ABT mode and set the I bit in the PSR. 3) Force the PC to fetch the next instruction from address &0C. To continue after a Prefetch abort use SUBS PC, R14, #4(where R14is R14_svcor R14_abtdepending on the processor configuration). The ARM will then re-execute the aborting instruction, so you should ensure that you have removed the cause of the original abort.
17 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

II.

Abort during data access If the abort command occurs during a data access (a Data Abort), the action depends on the instruction type. 1. single data transfer instructions (LDR and STR) are aborted as though the instruction had not executed. 2. Block data transfer instructions (LDM and STM) complete, and if writeback is set, the base is updated. If the instruction would normally have overwritten the base with data (ie LDM with the base in the transfer list), this overwriting is prevented All register overwriting is prevented after the Abort is indicated, which means in particular that R15 is preserved in an aborted LDM instruction.

III.

Abort During An Internal Cycle: If the abort occurred during an internal cycle it is ignored.Then in case 1 and 2 the processor will respond as defined in the table. The return fron the prefetch abort defined in table will attempt to ececute the aborting instruction (which will only be active if action has been taken to remove the cause of the original abort).A data Abort requires any auto-indexing to be reserved before returning to re-execute the offending instruction.The return is performed as defined in the table. The abort mechanism allows a demand paged virtual memory system to be

implemented when a suitable memory management unit is available.The processor is allowed to generate arbitrary addresses and when the data at an address is unavailable the memory manager signals an abort.The processor traps into system software which must work out the cause of the abort,make the requested data available.and retry the aborted instruction.The application program needs no knowledge of the amount of memory available to it,nor is its state in any way affected by the abort. 5) Software Interrupt: The software interrupt is used for getting into supervisor mode,usually to request a particular supervision function.response to the (SWI) instruction is defined
18 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

in the below table.as in the method of returning.The indicated return method will return to the instruction following the SWI. 6) Undefined Instruction Trap: When VL86C020 executes a coprocessor instruction or the undefined instruction,it offers it to any coprocessors which may be present.If a coprocessor can perform this instruction busy at that moment,the processor will wait until the coprocessor is ready.If no coprocessor can handle the instruction VL86C020 will take the undefined instruction trap. The trap may be used for software emulation of a coprocessor in a system which does not have the coprocessor hardware,or for general purpose instruction set extension by software emulation. When the undefined instruction trap is taken the VL86C020 will respond as defined in the table.The return from this trap (after performing a suitable emulation of the required function) as defined in the table will return to the instruction following the undefined instruction. 7) Reset: When RESET goes High,The processor will stop the currently executing instruction and start executing no-ops. When RESET goes low again it will respond as

TABLE: EXCEPTION TRAP CONSIDERATIONS TRAP TYPE CPU TRAP ACTIVITY PROGRAM SEQUENCE RETURN

RESET

UNDEFINED INSTRUCTION

1.Save R15 in R14(SVC) (n/a) 2. Force M1, M0 to SVC mode, and set F&I status bits in PC. 3. Force PC to 0x000000. . 1.Save R15 in R14(SVC). MOVS PC,R14 ;SVCs R14 2. Force M1,M0 to SVC mode ,and set I status bit in the PC. 3. Force PC to 0x000004

19 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

SOFTWARE INTERRUPT

1.Save R15 in R14(SVC). MOVS PC,R14;SVCs R14 2. Force M1,M0 to SVC mode and set I status bit in the PC. 3.Force PC to 0x000008 Prefetch Abort SUBS PC,R14,4;SVCs R14 Data Abort: SUBS PC,R14,8; SVCs R14

1. 2.

3.

PREFETCH AND 1.Save R15 in R14(SVC). DATA ABORTS 2.Force M1,M0 to SVC mode and set I status bit in the PC. 3.Force PC to 0x000010 data Force PC to 0x0000c ADDRESS I.Convort Stores to loads. EXCEPTION 2. Complate tho inslruclion (see text for details). 3. Save R151n R14 (SVC). 4. Force MI, MO to SVC mode, and set I status ~ In the PC. 5. Force PC to OxOOOOI4. 1. Savo R151n R14 (IRQ). IRQ 2. Force Ml, MO to IRQ mode, and set I status ~ In tho PC 3. Force the PC to 0x000018 1.Save R15 in R14 (FlQ) FIQ 2.Force M1,M0 to FIQ mode,and set the F and I status bits in the PC. 3.Force PC to 0x00001C

SUBS PC,R14,4;SVCs R14 (Returns CPU to the address following the one causing the trap)

SUBS PC,R14,4;IRQs R14

SUBS PC,R14,4;FIQs R14

4.3

VECTOR TABLE:
When an exception or interrupt occurs,the processor sets the PC to a specific

memory address.the address is with in a special address range called the Vector Table.the entries in the vector table are instructions that branch to specific routines designed to handle a particular exception or interrupt. The memory map address 0x00000000 is reserved for vector table,a set of 32bit words.When an exception or interrupt occurs,the processor suspends the normal execution and starts loading instructions from the exception vector table,Each vector table entry contains a form of branch instruction pointing to the start of the specific routine.

20 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

THE VECTOR TABLE:


EXCEPTION/INTERRUPT RESET UNDEFINED INSTRUCTION SOFTEARE INTERRUPT PREFETCH ABORT DATA ABORT INTERRUPT REQUEST FAST INTERRUPT REQUEST SHORTHAND RESET UNDEF SWI PABT DABT IRQ FIQ ADDRESS 0000000 0000004 0000008 000000C 0000010 0000018 000001C

These are Byte addresses,and each contains a branch instruction pointing to a relevent routine.The FIQ routine might reside at 000001C onwards,and thereby avoid the need for a branch instruction.

EXCEPTION PRIORITIES: 1. RESET(Highest priority) 2. Address Exception,Data Abort 3. FIQ 4. IRQ 5. Prefetch Abort 6. Unified instruction,Software interrupt(Lowest priority) Note that not all exceptions can occur at once.Address exception and Data abort are mutually exclusive,since if an address is illegal the processor ignores the ABORT interrupt.Undefined instruction and software interrupt are also mutually exclusive since they each corresponds to particuar decodings of the current instruction. If an address exception or data abort occurs at same time as a FIQ,and FIQs are enabled i.e the F flag in the PSR is clear,the processor will enter the address exception or data abort handler and then immediately proceed to the FIQ vector.A normal return from FIQ will cause the address exception or resume execution.Placing
21 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

address exception and data abort at a higher priority than FIQ is necessart to ensure that the transfer error does not escape detection,but the time for this exception entry should be reflected in worst case FIQ latency calculations.

4.4

PIPELINE:
A Pipeline is a mechanism a RISC processor uses to execute

instructions.Using a pipeline speeds up execution by Fetching the next instruction while other instructions are being decoded and executed. The main function of the processor computer architecture is to fetch and execute processor instructions. This happens in a 3 stage pipeline consisting of fetch, decode, and execute. The architecture fetches instructions one at a time, decodes each, and executes each instruction by asserting the appropriate datapath control signals. The two basic entities at the heart of this architecture are the datapath and controller. The datapath integrates an ALU, a multiplier, a register file, a barrel shifter and many other modules into a structure capable of supporting the execution of each this processor instruction that comes through the pipeline. governs this datapath to ensure proper execution. The controller

Part of the controllers

responsibilities is communicating with the memory interface for both instruction and data transfers and talking to the FPU coprocessor to ensure that floating point instructions are executed properly. The controller is also concerned with handling exceptions, including resets, aborts, interrupt requests, fast interrupt requests, undefined instructions, and software interrupts. Functionality: The basic movement of an instruction through the pipeline is asfollows. An instruction is fetched from the memory locationpointed to by the program counter (PC), which is maintained andincremented by the controller. The process of obtaining aninstruction involves handshaking with the memory interface usinga

request/acknowledge wait loop.

Once the instruction is fetchedinto instruction

register one (ir1), it is passed to instructionregister two (ir2) on the next clock cycle, which is called thedecode stage. In our implementation, no actual decoding is doneon
22 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

ir1, which follows the scheme found in the Arnold text used inthis class. Instead, decoding happens in the execute stage basedon the instruction in ir2. The decode stage essentially only passesthe instruction through the pipeline without parsing it. This was one of the first major design issues about which we debated.Our reasons for finally choosing to do the decoding on ir2 wastwo-fold. First, because our textbook served as a good startingmodel for our Verilog implementation, we wanted to follow it asclosely as possible. Second, decoding on ir1 raised some difficulthurdles in terms of the proper timing for setting up datapathcontrol signals between the decode and execute pipeline stages. Implementation in Verilog seemed most straightforward if wewere able to set the appropriate control signals immediately afterdecoding an instruction, as opposed to scheduling those signals tobe asserted on the next clock cycle through the use of acomplicated set of pipeline control registers between the decodeand execute stages. This would have been especially troublesomein the face of multicycle instructions that stall the pipeline. Evenwith our method, ir1 still serves a purpose in that it holds the nextinstruction to be executed so that it will be ready to go in the nextcycle without having to access the memory to get it.In the execute stage, the instruction in ir2 is checked against thecondition flags found in the CPSR to determine if it will beexecuted. If so, and if the instruction is neither a branch nor adata processing instruction that modifies register 15 (the PC), thepipeline is scheduled to move ir1 to ir2 and fetch a new instruction from memory to ir1. Otherwise, the pipeline is flushed with NOPs because the current instructions will beskipped after the branch or data processing instruction that changes the PC executes. If the instruction gets past the condition check, it passes through along set of conditional statements that check the opcode bits foreach instruction until either a recognized ARM7 instruction isfound or it is determined that the instruction belongs to thecoprocessor. If such is the case, the architecture performshandshaking with the FPU coprocessor. If it is determined thatthe instruction belongs neither to the ARM7 core nor to the FPU,it is simply ignored and the pipeline advances.Within each subtree of this series of conditional decodingstatements, every bit of ir2 is decoded and appropriate datapathcontrol signals are set to execute the instruction. For multicyclean instruction like loads, stores, and multiplies, handshaking waitloops are used to communicate
23 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

with the appropriate module untilthe instruction is finished executing. Meanwhile, the pipeline isallowed to advance one cycle, and the current multicycleinstruction in ir2 is moved to a temporary register called ir2_mult, where it can still be referenced until its execution completes.

4.5

INSTRUCTION SET ARCHITECTURE:


The processor core is an ARM-compatible 32-bit RISC processor. The core is

fully compatible with the ARM v2a instruction set architecture (ISA) and is therefore supported by the GNU toolset. This older version of the ARM instruction set is supported because it is not covered by patents so can be implemented without alicense from ARM. The project provides a complete embedded FPGA system

incorporating the core and a number of peripherals, including UARTs, timers and an Ethernet MAC. All instructions are conditionally executed,which means their execution may or may not take place depending on the values of the N,Z,C and V flags in the PSR at the end of the preceeding instruction. If the ALWAYS condition is specified,The instruction will be executed irrespective of the flags ,and likewist the never cause it not to be executed.The other condition codes have the meaning as detailed below,for instance, code 0000(Equal) causes the instruction to be executed only if the Z flag is set.This would correspond to the case where a compare(CMP)instruction had found the two operands were different the compare instruction would have cleared the Z flag,and the instruction would not be executed.All instructions include a 4 bit condition execution code.This instruction is only executed if the condition specified in the instruction agrees with the current value of the status flags.

24 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

Condition 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Mnemonic extension Eq Ne Cs/hs Cc/lo Mi Pi Vs Vc Hi Ls Ge lt Gt Le Al -

Meaning equal Not Equal

Condition flag state Z set Z clear

Carry set/unsigned C set higher or same Carry clear/unsigned C clear lower Minus/negative Plus/positive or zero Overflow No overflow Unsigned higher N set N clear V set V clear C set and Z clear

Unsigned lower or C clear or Z set same Signed greater than N==V or equal Signed less than Signed greater than N!=V Z==0,N==V

Signed less than or Z==1 or N!=V equal Always (unconditional) Invalid condition -

4.5.1

Conditional Instruction Sequence: Branches which are taken cause breaks in the pipeline. For this reason they

often waste time, and can sometimes be replaced by a suitable conditional instruction sequence. If the condition tested is true, the instruction is performed. If it is false, the
25 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

instruction is skipped and the PC is advanced to the next memory word, which takes little processor time. After the instruction is obeyed, the arithmetic logic unit (ALU) will output appropriate signals on the flag lines. On certain instructions, the flags set the condition code bits in the PSR; for other instructions, the flags in the PSR are only altered if the programmer permits them to be updated.

4.6

THE BARREL SHIFTER:


The arithmetic logic unit has a 32-bit barrel shifter capable of various shift and

rotate operations. Data involved in the data processing group of instructions may pass through the barrel shifter, either as a direct consequence of the programmers actions, or as a result of the internal computations of ObjAsm. The barrel shifter also affects the index for the single data transfer instructions. The barrel shifter has a carry in, which takes its input from the C flag of the PSR; and a carry out, which may be latched back into the C bit of the PSR for logical data operations. Various instructions use the barrel shifter to shift register operands. The effects of such shifts are detailed in this section, rather than being repeated for each instruction. a.Mnemonics There are six assembler mnemonics for shift types, used to control the barrel shifter. These are: LSL ASL LSR ASR ROR RRX Logical Shift Left Arithmetic Shift Left Logical Shift Right Arithmetic Shift Right Rotate Right Rotate Right with Extend

The mnemonic ASL (arithmetic shift left) may be freely interchanged with LSL (logical shift left).

26 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

b.Specification of the shift amount The shift amount may either be specified in the instruction, or in a register specified by the instruction. c.Instruction specified shift amount When the shift amount is specified in the instruction, it is contained in a 5 bit fieldwhich may take any value from 0 to 31. d.Register specified shift amount Only the least significant byte of the contents of Rs is used to determine the shift amount. If this byte is zero, the unchanged contents of Rm will be used as the second operand, and the old value of the PSR C flag will be passed on as the shifter carry output. If the byte has a value between 1 and 31, the shifted result will exactly match that of an instruction specified shift with the same value and shift operation.If the value in the byte is 32 or more, the result will be a logical extension of the shifting process. 1)Logical shift left, or arithmetic shift left: . A logical shift left (LSL) takes the contents of Rm and moves each bit by the

specified amount to a more significant position. The least significant bits of the result are filled with zeroes. The high bits of Rm which do not map into the result are discarde except that the least significant discarded bit becomes the barrel shifters carry out.

2) Logical shift right: A logical shift right (LSR) is similar to a logical shift left, but the contents of
27 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

Rm are moved to less significant positions in the result. LSR #5 has this effect:

Logical shift right zero is redundant as it is the same as logical shift left zero. The form of the shift field which might be expected to correspond to LSR #0 is therefore used to encode LSR #32. ObjAsm assembles LSR #0 (and ASR #0 and ROR #0) as LSL #0, and allows you to specify LSR #32. 3)Arithmetic shift right: An arithmetic shift right (ASR) is similar to a logical shift right, except that the high bits are filled with bit 31 of Rm instead of zeroes. This preserves the sign in 2s complement notation. For example, ASR #5:

Arithmetic shift right zero is redundant as it is the same as logical shift left zero. The form of the shift field which might be expected to correspond to ASR #0 is therefore used to encode ASR #32. ObjAsm assembles ASR #0 (and LSR #0 and ROR #0) as LSL #0, and allows you to specify ASR #32. 4) Rotate right

28 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

Rotate right (ROR) operations reuse the bits which overshoot in a logical shift right operation by reintroducing them at the high end of the result, in place of the zeroes used to fill the high end in logical right operations. For example, ROR #5

Rotate right zero is redundant as it is the same as logical shift left zero. The form of the shift field which might be expected to correspond to ROR #0 is therefore used to encode rotate right extended (see the next section). ObjAsm assembles ROR #0 (and LSR #0 and ASR #0) as LSL #0. 5) Rotate right with extend The form of the shift field which might be expected to give ROR #0 is used to encode a special function of the barrel shifter, rotate right extended (RRX). This is a rotate right by one bit position of the 33 bit quantity formed by appending the PSR C flag to the most significant end of the contents of Rm:

29 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

4.7

Branch, Branch with Link (B, BL):


Instructions for branching to an instruction other than the next one

These instructions branch to an instruction other than the next one, by altering the value of the program counter (R15). The Branch with Link form of the instruction also stores a return address in the link register (R14), so that program flow can branch to a subroutine, and then return to the instruction immediately following the Branch with Link instruction All branches take a signed 2s complement 24 bit word offset. This is shifted left two bits, and added to the program counter, with any overflow being ignored, giving an offset of 32Mbytes. The branch can therefore reach any word aligned address within a 26 bit address space, since the calculation wraps round between the top and bottom of memory. When using this instruction with ObjAsm you should provide a label, from which ObjAsm will calculate the 24 bit offset.The encoded offset must take account of the effects of pipelining and prefetching within the CPU, which causes the PC to be two words ahead of the current instruction

4.7.2

The link bit:


Branch with Link works in the same way as Branch, but it also writes the old

PC and PSR into the link register (R14) of the current bank. The PC value written is first adjusted to allow for the prefetch, and contains the address of the instruction following the branch and link instruction. This form of the instruction is often used for branching to subroutines. At the end of the subroutine the program flow can return to the instruction immediately following the Branch with Link instruction by writing the link register (R14) value back into the program counter (R15).
30 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

4.8

DIFFERENT INSTRUCTION TYPES:

1)Data processing: Instructions for performing arithmetic or logical operation on one or two operands Instruction format:

31 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

2)Multiply and Multiply-Accumulate (MUL, MLA): Instructions for performing integer multiplication, giving a 32 bit result.

The multiply and multiply-accumulate instructions use a 2 bit Booths algorithm to perform integer multiplication. They give the least significant 32 bits of the product of two 32 bit operands, and may be used to synthesize higher precision multiplications.The multiply form of the instruction gives Rd:=Rm Rs. Rn is

ignored, and should be set to zero for compatibility with possible future upgrades to the instruction set. The multiply-accumulate form gives Rd:=Rm Rs+Rn, which can save an

explicit ADDinstruction in some circumstances.The results of a signed multiply and of an unsigned multiply of 32 bit operands differ only in the upper 32 bits; the low 32 bits are identical. As these instructions only produce those low 32 bits, they can be used with operands which may be considered as either signed (2s complement) or unsigned integers.The instruction is only executed if the condition is true.

32 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

3) Single data transfer (LDR, STR): Instructions for loading or storing single bytes or words of data INSTRUCTION FORMAT:

Synopsis The single data transfer instructions are used to load or store single bytes or words of data. The memory address used in the transfer is calculated by adding an
33 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

offset to or subtracting an offset from a base register. The result of this calculation may be written back into the base register if auto-indexing is required. If the contents of the base are not destroyed by other instructions, the continued use of LDR (or STR) with write back will continually move the base register through memory in steps given by the index value. Note that ! is invalid for post-indexed addressing, as write back is automatic in this case.The instruction is only executed if the condition is true. 4)Block data transfer (LDM, STM): Instructions for loading or storing any subset of the currently visible registers. INSTRUCTION FORMAT:

Synopsis:
Block data transfer instructions are used to load (LDM) or store (STM) any subset of the currently visible registers from or to memory. They support all possible stacking modes, maintaining full or empty stacks which can grow up or down
34 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

memory, and are very efficient instructions for saving or restoring context, or for moving large blocks of data around main memory. 5)Single data swap (SWP): Instruction for swapping atomically between a register and external memory. INSTRUCTION FORMAT:

Synopsis The data swap instruction is used to swap atomically a byte or word quantity between a register and external memory. It is implemented as a memory read followed by a memory write to the same address, which are locked together. The processor cannot be interrupted until both operations have completed, and the memory manager is warned to treat them as inseparable. This instruction is particularly useful for implementing software semaphores. The swap address is determined by the contents of the base register (Rn). The processor first reads the contents of the swap address. It then writes the contents of the source register (Rm) to the swap address, and stores the old memory contents in the destination register (Rd). The same register may be specified as both the source and destination; its contents are correctly swapped with memory. The LOCK output goes HIGH for the duration of the read and write operations to signal to the external memory manager that they are locked together, and should be
35 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

allowed to complete without interruption. This is important in multi-processor systems, where the swap instruction is the only indivisible instruction which may be used to implement semaphores. Control of the memory must not be removed from a processor while it is performing a locked operation.

6)Software Interrupt (SWI):


Instruction for entering supervisor mode in a controlled manner. INSTRUCTION FORMAT:

Synopsis: The software interrupt instruction is used to enter supervisor mode in a controlled manner. The instruction causes the software interrupt trap to be taken, which effects the mode change. The PC is then forced to the SWI vector. If this address is suitably protected (by external memory management hardware) from modification by the user, a fully protected operating system may be constructed.The instruction is only executed if the condition is true

36 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

4.9

INSTRUCTION SET SUMMARY:


Instructions available on ARM,briefly summarized as:

I25 = Immediate form of shifter_operand L24 = Link; Save PC to LR U23 = 1; address = Rn + offset_12 = 0; address = Rn - offset_12 B22 = Byte (0 = word) A21 = Accumulate L20 = Load (0 = store) S20 = Update Condition flags P24, W21 : Select different modes of operation

37 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

5. ARM CORE ARCHITECTURE


INTRODUCTION:
The ARM core architecture can be simply overviewed as below:

It consists of 3 stages: 1. 2. 3. FETCH DECODE EXECUTE

We can see the detailed description of each block in the following pages.

38 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

5.1 FETCH BLOCK:


A cache is a small, fast array of memory placed between the processor core and main memory that stores portions of recently referenced main memory. The processor uses cache memory instead of main memory whenever possible to increase system performance. The goal of a cache is to reduce the memory access bottleneck imposed on the processor core by slow memory. Often used with a cache is a write buffera very small rst-in-rst-out (FIFO) memory placed between the processor core and main memory. The purpose of a write buffer is to free the processor core and cache memory from the slow write time associated with writing to main memory. The word cache is a French word meaning a concealed place for storage. When applied to ARM embedded systems, this denition is very accurate. The cache memory and write buffer hardware when added to a processor core are designed to be transparent to software code execution, and thus previously written software does not need to be rewritten for use on a cached core. Both the cache and write buffer have additional control hardware that automatically handles the movement of code and data between the processor and main memory. However, knowing the details of a processors cache design can help you create programs that run faster on a specic ARM core. The main drawback is the difculty of determining the execution time of a program. Why this is a problem will become evident shortly. Since cache memory only represents a very small portion of main memory, the cache lls quickly during program execution. Once full, the cache controller frequently evicts existing code or data from cache memory to make more room for the new code or data. This eviction process tends to occur randomly, leaving some data in cache and removing others. Thus, at any given instant in time, a value may or may not be stored in cache memory.

39 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

The Memory Hierarchy and Cache Memory:

Fig 5.1: The Memory Hierarchy and Cache Memory

The innermost level of the hierarchy is at the processor core. This memory is so tightly coupled to the processor that in many ways it is difcult to think of it as separate from the processor. This memory is known as a register le. These registers are integral to the processor core and provide the fastest possible memory access in the system. At the primary level, memory components are connected to the processor core through dedicated on-chip interfaces. It is at this level we nd tightly coupled memory (TCM) and level 1 cache.

40 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

Fig:5.2 Relationship that a cache has between the processor core and main memory.

5.1.1

Cache Architecture: ARM uses two bus architectures in its cached cores, the Von Neumann and

the Harvard. The Von Neumann and Harvard bus architectures differ in the separation of the instruction and data paths between the core and memory. A different cache design is used to support the two architectures. In processor cores using the Von Neumann architecture, there is a single cache used for instruction and data. This type of cache is known as a unied cache. A unied cache memory contains both instruction and data values. The Harvard architecture has separate instruction and data buses to improve
41 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

overall system performance, but supporting the two buses requires two caches. In processor cores using the Harvard architecture, there are two caches: an instruction cache (I-cache) and a data cache (D-cache). This type of cache is known as a split cache. In a split cache, instructions are stored in the instruction cache and data values are stored in the data cache. We introduce the basic architecture of caches by showing a unied cache in Figure 12.4. The two main elements of a cache are the cache controller and the cache memory. The cache memory is a dedicated memory array accessed in units called cache lines. The cache controller uses different portions of the address issued by the processor during a memory request to select parts of cache memory. We will present the architecture of the cache memory rst and then proceed to the details of the cache controller.

Basic Architecture of a Cache Memory:


A simple cache memory is shown on the right side of Figure 12.4. It has three main parts: a directory store, a data section, and status information. All three parts of the cache memory are present for each cache line. The cache must know where the information stored in a cache line originates from in main memory. It uses a directory store to hold the address identifying where the cache line was copied from main memory. The directory entry is known as a cache-tag. A cache memory must also store the data read from main memory. This information is held in the data section. The size of a cache is dened as the actual code or data the cache can store from main memory. Not included in the cache size is the cache memory required to support cache-tags or status bits. There are also status bits in cache memory to maintain state information. Two common status bits are the valid bit and dirty bit. A valid bit marks a cache line as active, meaning it contains live data originally taken from main memory and is currently available to the processor core on demand. A dirty bit denes whether or not a cache line contains data that is different from the value it represents in main memory.

42 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

A 4 KB cache consisting of 256 cache lines of four 32-bit words.

5.1.2 Basic Operation of a Cache Controller:


The cache controller is hardware that copies code or data from main memory to cache memory automatically. It performs this task automatically to conceal cache operation from the software it supports. Thus, the same application software can run unaltered on systems with and without a cache. The cache controller intercepts read and write memory requests before passing them on to the memory controller. It processes a request by dividing the address of the request into three elds, the tag eld, the set index eld, and the data index eld. First, the controller uses the set index portion of the address to locate the cache line within the cache memory that might hold the requested code or data. This cache line contains the cache-tag and status bits, which the controller uses to determine the actual data stored there. The controller then checks the valid bit to determine if the cache line is active, and compares the cache-tag to the tag eld of the requested address. If both the status check and comparison succeed, it is a cache hit. If either the status check or comparison fails, it is a cache miss. On a cache miss, the controller copies an entire cache line from main memory
43 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

to cache memory and provides the requested code or data to the processor. The copying of a cache line from main memory to cache memory is known as a cache line ll. On a cache hit, the controller supplies the code or data directly from cache memory to the processor. To do this it moves to the next step, which is to use the data index eld of the address request to select the actual code or data in the cache line and provide it to the processor.

The Relationship between Cache and Main Memory:


Having a general understanding of basic cache memory architecture and how the cache controller works provides enough information to discuss the relationship that a cache has with main memory. Figure below shows where portions of main memory are temporarily stored in cache memory. The gure represents the simplest form of cache, known as a directmapped cache. In a direct-mapped cache each addressed location in main memory maps to a single location in cache memory. Since main memory is much larger than cache memory, there are many addresses in main memory that map to the same single location in cache memory. The gure shows this relationship for the class of addresses ending in 0x824. The three bit elds introduced in Figure 12.4 are also shown in this gure. The set index selects the one location in cache where all values in memory with an ending address of 0x824 are stored. The data index selects the word/halfword/byte in the cache line, in this case the second word in the cache line. The tag eld is the portion of the address that is compared to the cache-tag value found in the directory store. In this example there are one million possible locations in main memory for every one location in cache memory. Only one of the possible one million values in the main memory can exist in the cache memory at any given time. The comparison of the tag with the cache-tag determines whether the requested data is in cache or represents another of the million locations in main memory with an ending address of 0x824. During a cache line ll the cache controller may forward the loading data to the core at the same time it is copying it to cache; this is known as data streaming. Streaming allows a processor to continue execution while the cache controller lls
44 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

the remaining words in the cache line. If valid data exists in this cache line but represents another address block in main memory, the entire cache line is evicted and replaced by the cache line containing the requested address. This process of removing an existing cache line as part of servicing a cache miss is known as evictionreturning the contents of a cache line to main memory from the cache to make room for new data that needs to be cache. loaded in

Fig: How main memory maps to a direct-mapped cache. A direct-mapped cache is a simple solution, but there is a design cost inherent in having a single location available to store a value from main memory. Directmapped caches are subject to high levels of thrashinga software battle for the same location in cache memory. The result of thrashing is the repeated loading and eviction of a cache line. The loading and eviction result from program elements being placed in main memory at addresses that map to the same cache line in cache memory.

45 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

5.1.3 Set Associativity:


Some caches include an additional design feature to reduce the frequency of thrashing . This structural design feature is a change that divides the cache memory into smaller equal units, called ways. still a four KB cache; however, the set index now addresses more than one cache lineit points to one cache line in each way. Instead of one way of 256 lines, the cache has four ways of 64 lines. The four cache lines with the same set index are said to be in the same set, which is the origin of the name set index.

Fig: A 4 KB, four-way set associative cache. The cache has 256 total cache lines, which are separated into four ways, each containing 64 cache lines. The cache line contains four words.

5.2On Chip Bus:


Bus Structure: In addition to using the processor as a digital filter, a bus structure was implemented inthe model so that it could be interfaced with various slave devices using a WISHBONE structure.In a wishbone structure there are several interrupt devices that are used to signal that a datatransfer is requested. In this case a strobe
46 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

input goes high when a peripheral requests a transfer. The strobe causes an interrupt in the ARM processor model that tells the processor to interactwith the bus. The ARM processor leaves its main program and enters an interrupt vector where,depending on the read/write enable input, the processor will either take information from the busor write to the bus. When the transaction is complete the model will send its own acknowledgesignal that tells the external device that the transfer is complete and then the model will re-enterthe program and continue as before. A very basic bus structure that accomplishes this isimplemented in this processor model.

47 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

5.3 Decode Stage:


The decode state is used to produce control signals used in the execute state

5.4 Execute Stage: 5.4.1 REGISTER FILE / BARREL SHIFTER REGISTER FILE
The processor has a total of 37 registers, consisting of 31 general purpose registers and 6 status registers. Through the use of the CPSR (Current Program Status Register) 16 general registers (R0-R15) and one or two status registers are accessible through a register banking scheme. The accessible registers depend on the processor mode, banked via the CPSR, to support modes IRQ, FIQ, Supervisor, Abort and Undefined.

INTERFACE
INPUTS: RF_Addr_A[3:0] RF_Addr_B[3:0] RF_Addr_C[3:0] RF_Addr_Write[3:0] to write RF_Bus_Write[31:0] register to write RF_PC_Write[31:0] program counter write to R15 RF_Load_Write RF_Load_Flags RF_PSR_R_Sel PSR register RF_PSR_W_Sel CPSR / mode PSR RF_Flags_Write[10:0] / status flags sysclk - the system clock - enable signal to write to general purpose register - enable signal to write status/mode flags - select signal for reading CPSR/mode - select signal for writing - input value for mode - Bus A read address - Bus B read address - Bus C read address - address of register - value of -

48 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

OUTPUTS: RF_Bus_A register A read - output of RF_Bus_B

- output of register B read RF_Bus_C - output of register C read RF_PC_Read RF_PSR_Read - output of R15, program counter register - output of CPSR/PSR, selected above

5.3.2 FUNCTIONALDESCRIPTION
Register: Registers are written to buses A,B,C asynchronously, based on both what mode the processor is in, and then to what register 0-14 is being addressed for each bus. CPSR[4:0] is checked for which mode and then the respective RF_Addr_A,B,C bus is examined for decoding. Every clock cycle, RF_Load_Write is examined for assertion to then write a register based on processor mode CPSR[4:0] and addressed register 0-14 of each respective mode. PSR Write: At every clock, RF_Load_Flags is checked to determine if a PSR register should be written. If RF_Load_Flags == 1, a PSR register is written. The

PSRregister is written from the RF_Flags_Write input bus.

The choice of

PSRregister to be written is chosen based first of RF_PSR_Sel , where a value of 0 means the CPSR and a value 1 means a SPSR register associated with the current mode the processor is in, which is determined by the value of CPSR[4:0]. A value of unknown-X is put in the register if the mode is USER and a SPSR is selected to be written. PSR Read: At every clock, the RF_PSR_Read bus is output with the appropriate PSR value selected just as above. Unknown-X is put on the bus if the mode is USER and aSPSRisselected

49 Verilog Implementation of ARM Compatible Core

ADITYA ENGINEERING COLLEGE - SURAMPALEM

5.3.3 ADDRESS REGISTER


INTERFACE
INPUTS: AR_Bus_ALU[31:0] ALU AR_Bus_PC[31:0] the PC AR_Bus_Sel sysclk OUTPUTS: AR_Output_Bus[31:0]

- Input bus to the MAR from the - Input bus to the MAR from - Select for the MAR 0=>AR_Bus_ALU, 1=>AR_BUS_PC - system clock - Output from the MAR

5.3.4 FUNCTIONAL DESCRIPTION


At every positive clock edge, the address register is loaded with the new value of the PC. The new value is selected from the PC incrementer or the ALU bus depending on the select signal AR_Bus_Sel

5.3.5 WRITE DATA REGISTER INTERFACE


INPUTS: WD_Bus_Write[31:0] WD_DBE asserted WD_Load register sysclk OUTPUTS: WD_DOUT[31:0] WD_nENOUT - input bus - sets output to data bus High-Z when - load enable signal for buffering - system clock - output to data bus - unused, was a pass-through signal

FUNCTIONAL DESCRIPTION
At every positive clock, the write data register registers the data on the input bus, and outputs to the output bus. If the DBE signal is deasserted, electrically low, the output is high-Z.

WD_nENOUT: currently unused a pass through signal used for bus arbitration

5.4 BARREL SHIFTER

50

ADITYA ENGINEERING COLLEGE - SURAMPALEM

When the second operand is specified to be a shifted register, the operation of the barrel shifter is controlled by the Shift field in the instruction. This field indicates the type of shift to be performed. The amount by which the register to be shifted may be contained in an immediate field in the instruction, or in the bottom byte of the other register.

5.4.1INTERFACE
INPUTS: BS_Enable BS_Input_Bus[31:0] BS_Shift_Type[1:0] shift type. BS_Shift_Amt[5:0] to be shifted. BS_Cin OUTPUTS: BS_Shift_Output[31:0] (Output Bus) BS_Cout Carry bit - Active high enable signal. - Input data (Input Bus) - Control signal which gives the - Control signal which gives the number of bits - Input carry bit. - Shifted Output bus - Output

51

ADITYA ENGINEERING COLLEGE - SURAMPALEM

5.4.2 FUNCTIONAL BEHAVIOR


If BS_Enable is low, output bus is same as input bus & BS_Cout is equal to BS_Cin. If BS_Enable signal is high, Barrel Shifter does the following types of shifts depending on the value of BS_Shift_Type. BS_Shift_Type[1:0] 2b00 2b01 2b10 2b11 Logical Shift Left (LSL) TYPE OF SHIFT Logical Shift Left Logical Shift Right Arithmetic Shift Right Rotate Right

LSL #N (non-zero): Barrel Shifter takes the contents of BS_Shift_Amt and moves each bit by the specified amount to a more significant position. The least significant bits of the result are filled with zeros. The least significant discarded bit of BS_Input_Bus becomes BS_Cout. LSL #0 is a special case, where the shift carry out is the old value of the CPSR C flag and BS_Shift_Output is same as BS_Input_Bus.

Logical Shift Right (LSR)

LSR #N (non-zero): Barrel Shifter takes the contents of BS_Shift_Amt and moves each it by the specified amount to a less significant position. The most significant bits of the result are filled with zeros. The most significant discarded bit of BS_Input_Bus becomes BS_Cout. LSR #0 is used to encode LSR#32, which makes BS_Shift_Output all zeros & BS_Cout is bit 31 of BS_Input_Bus.

52

ADITYA ENGINEERING COLLEGE - SURAMPALEM

Arithmetic Shift Right (ASR)

ASR #N (non-zero): An Arithmetic Shift right is similar to logical shift right except that the high bits are filled with bit 31 instead of zeros. This preserves the sign in 2s complement notation. ASR#0 is used to encode ASR#32.Bit 31 of BS_Input_Bus is used as the carry out & each bit of BS_Shift_Output is also equal to bit 31 of BS_Input_Bus.

Rotate Right (ROR)

ROR #N (non-zero): Rotate Right operations reuse the bits which overshoot in a logical shift right operation by reintroducing them at the high end of the result-BS_Shift_Output, in place of the zeros used to fill the high end in logical right operations. ROR #0 is used to encode a special function of the barrel shifter, rotate right extended (RRX). This is a rotate right by one the position of the 33 bit quantity formed by appending the CPSR C flag to the most significant end of the contents of Input Bus.

5.5 ARITHMETIC LOGIC UNIT/BOOTH MULTIPLIER

5.5.1 Description of Booths multiplier

The multiplier module takes inputs A, B (32-bits), Enable and System clock (Sysclk). This module will give the least significant 32 bits of the product of two 32-bit operands, and Ready. Enable signal tells when multiplying begins, and Ready signal tells when multiplying ends. The result of a signed multiply and of an unsigned multiply of 32-bit operands differ only in the upper 32 bits, the low 32 bits of the signed or unsigned result are identical. So, they can use for both signed and unsigned multiplies.
53

ADITYA ENGINEERING COLLEGE - SURAMPALEM

Block Diagram of Booths Multiplier:

The Booths multiplier is done in multi-cycles.

As the performance

specification, the maximum number cycle required is 64 cycles, and the cycle length is as small as possible (typically less than 5 ns which is equivalent to 200MHz). There is a tradeoff between the number cycles and the cycle length. If we have a large number cycles, but the cycle length is small, then we might end up with a good total time required for multiply, but its downside might effect to other interfaced circuits since the cycle length is small. If we consider the other way around (less number cycles, but the cycle length is large), then it also has some advantages and disadvantages. As the whole, the number cycle and the cycle length are important variables for designing Booths Multiplier.

5.5.2 Description of ALU Basically, ALU takes two inputs of 32 bits wide and does Logic and Arithmetics operations. The logic and arithmetic are operated depending on the 5-bit Alu_Control signal. The logic operations involve bit-wise and, bit-wise or, and bit-wise xor. The Arithmetics operations involve add, add Carry_in, sub, sub Carry_in, move from coprocessor register, and pass data. There are two outputs from ALU; 32-bit Output is the result depending on the operation of Input1 and Input2, 4-bit Flag_signals supports for every conditional instruction in ARM7. 4-bit Flag_signals is broken down as [N,Z,C,V], where N indicates if Output is negative, Z indicates if Output is equal to zero, C indicates if the operation has
54

ADITYA ENGINEERING COLLEGE - SURAMPALEM

carry out (this is only thecase of Arithmetics operation), and V indicates if we have the overflow situation (again, this is only the case of Arithmetics operation). Block Diagram of ALU:

The performance specification of ALU is to generate Logic/Arithmetic result by using combinational logic. Also, ALU is required to generate the correct Flag_signals in order to support for every conditional instruction. Besides those specifications, we need to optimize the area, an important variable in our design, of ALU as small as possible, which leads to a cheaper cost.

5.6 Design Process of the Booths Multiplier There are several algorithms to implement the multiplier. In ARM7, they choose Booths algorithm to implement the multiplier. Booths algorithm is simple to design. This choice is desirable because

The alternative approach to this design is that

we can implement the parallel multiplier or the Wallace Tree multiplier to gain better performance. However, these designs are much more complex.

55

ADITYA ENGINEERING COLLEGE - SURAMPALEM

We divide the design process into several steps: First step: Draw the ASM (Algorithm State Machine) based on Booths algorithm, which is considered as the major design algorithm for the multiplier. The ASM of the multiplier is shown in Figure 1.3. Second step: Based on the ASM, we write the behavioral Verilog code for the multiplier. Then we run the simulation to see if the multiplier performs its function correctly. Third step: After verifying that the functionality of the multiplier works correctly, we begin to work on the architecture of the multiplier (actual hardware components used in the algorithm). Then, we write the mixed Verilog code for the multiplier (the combination of behavior and actual hardware for the multiplier). As always, we have to run the simulation to check if the mixed Verilog code agrees with the behavior code we wrote earlier. Final step: In this step, the actual components are used to build the whole design. There is no behavior code involved. All the codes we wrote are called Structural-Code, which describes the design structurally. Again, we have to verify if the structural code gives the same result as the behavior and mixed code do. Once this is done, we use Synopsys to synthesize the structural code to get the real circuit, which is ready to fabricate at the chip level.

5.7 Design Process of the Arithmetic Logic Unit (ALU)


As the same with Multiplier above, we have to go through all the steps except the mixed stage. This is because of ALU, which is a combinational logic circuit. The combinational logic circuit does not need to have the mixed stage. We can go directly from the behavior code to the structural code and synthesize from here. The alternative approach to the design is that we can use a conditional sum adder or a binary look ahead carry adder for better performance instead of the carry select adder. However, these designs will require a lot more complex. The major design equations for ALU are to implement the following functions depending on Alu_Control:

5.8 ERROR DATA


56

ADITYA ENGINEERING COLLEGE - SURAMPALEM

During the full chip-testing phase all errors that were found were tracked. Description of the error, status, defective modules name, and time to debug were documented for each bug. In all fifty-one errors were documented. By the end of testing, thirty-nine had been fixed. From Table E.1 we see that a majority of all the errors involved the controller or datapath. This is to be expected, as it is the largest most significant module and also the module that is tested the most vigorously.

GROUP NAME Number of Errors Controller/Datapath 28 FPU 4 Memory 4 ALU/Multiplier 3 Register File/Barrel Shifter 2 Other 5 Table E.1 - Errors broken down by team.

% of Total 61 % 9% 9% 6.5 % 4% 11 %

Errors can be categorized into two groups, functional errors and timing errors. Functional errors are errors that are due to missing signals, incorrect driving of signals, missing instructions, and malfunctioning programs used for debug. Timing errors are errors based on timing issues, such as handshaking and stalling of the microprocessor. Table E.2 shows that a majority of the errors were functional errors. Again this is to be expected since modules were tested hastily due the severe lack in time.

ERROR TYPE Number of Errors Functional 34 Timing 12 Table E.2 Errors broken down by type.

% of Total 72.72 % 27.28 %

Another important metric to keep is the average lifetime of an error. The lifetime of an error is characterized as the time from when an error is opened

57

ADITYA ENGINEERING COLLEGE - SURAMPALEM

To the time when it is closed. The lifetime of errors found during full-chip testing was 1.25 days on average. This is really a metric of how quickly the Architecture team fixed errors with the controller and datapath as they had 61.36 % of the errors.
Output = Input1 (Pass Input2) Output = Input1 + Input2 Output = Input1 + Input2 + Carry_in Output = Input1 - Input2 Output = Input1 Input2 + Carry_in -1 Input2 Input1 Output = Input2 Input1 + Carry_in - 1 Output = Input1 & Input2 Input1 | Input2 (Logical XOR) Output = Input1 & (~Input2) = ~ Input2 (Logical AND NOT) Output (Logical NOT) (Logical AND) Output = (Logical OR) Output = Input1 ^ Input2 (Pass Input1) Output = Input2 (Add) (Add Carry_in) (Subtract) (Subtract Carry_in) Output =

To understand how Carry Select Adder works, we consider 8 bits Carry Select Adder for the purpose of simplicity. Please refer to figure 1.4 on next page for block diagram. For 32 bits Carry Select Adder, we cascade in the same manner. The high level block diagram of 32 bits Carry Select Adder is shown on

58

ADITYA ENGINEERING COLLEGE - SURAMPALEM

6 Validation and Discussion of Results


6.1 SIMULATION RESULTS :

DESIGN VERIFICATION:
Consider an Instruction:-> mov r1, r1, lsr r3 This ARM assembly instruction does a logical shift right of the contents of the r1 register. The shift amount is given by the r3 register. The result of the shift is stored in the r1 register. The machine language for this instruction is 0xe1b01331 You can look in the table 5 on page 14 to understand how the 32-bit instruction is split into fields. Each field controls a different part of the execution stage. 1. Since bits 27 and 26 are 0, this instruction is a REGOP type. (Full decode logic is in a23_decode.v, line 364). 2. The opcode, bits 24 to 21, is 0xd, which is the mov opcode (table 7). 3. Bits 15 to 12 define the destination register for this operation. They are decoded into the reg_bank_wen[14:0] signal. 4. Bits 11 to 0 define a shifter operand, see table 8. Rs is set to 3 and Rm set to 1. These fields are extracted in the decoder stage, as well as a bunch of mux control signals, registered, and fed into the execute stage.

SIMULATION RESULTS :
59

ADITYA ENGINEERING COLLEGE - SURAMPALEM

DESIGN VERIFICATION:
Consider an Instruction:-> mov r1, r1, lsr r3 This ARM assembly instruction does a logical shift right of the contents of the r1 register. The shift amount is given by the r3 register. The result of the shift is stored in the r1 register. The machine language for this instruction is 0xe1b01331 You can look in the table 5 on page 14 to understand how the 32-bit instruction is split into fields. Each field controls a different part of the execution stage. 1. Since bits 27 and 26 are 0, this instruction is a REGOP type. (Full decode logic is in a23_decode.v, line 364). 2. The opcode, bits 24 to 21, is 0xd, which is the mov opcode (table 7). 3. Bits 15 to 12 define the destination register for this operation. They are decoded into the reg_bank_wen[14:0] signal. 4. Bits 11 to 0 define a shifter operand, see table 8. Rs is set to 3 and Rm set to 1. These fields are extracted in the decoder stage, as well as a bunch of mux control signals, registered, and fed into the execute stage.

60

ADITYA ENGINEERING COLLEGE - SURAMPALEM

SIMULATION RESULTS OF DECODER

In the execute stage, the fields Rs and Rm are used to select registers which are put onto the rs and rm buses respectively. E.g. Rs has a value of 4. This is put into a mux which selects one of the 16 registers. The output of the mux, a full 32 bit value, is connected to the rs bus. rs is uses as the ballel shift amount. rm is uses as the barrel shift value.

SIMULATION RESULTS OF REGISTER BANK:

61

ADITYA ENGINEERING COLLEGE - SURAMPALEM

The barrel shift function is set to lsr, and the barrel shifter performs the shift, outputting the shifted value on barrel_shift_out. This passes through the ALU with no changes (the ALU is put in pass through mode), and the ALU output is written back into the register bank. SIMULATION RESULTS OF BARREL SHIFTER:

Similarly for an add operation the simulation result of ALU is

62

88