DR. APARNA P. Assistant Professor EC Dept NITK, Surathkal
Embedded Processor Categories
General Purpose Processor Microcontrollers Digital Signal Processor
Customized processors and FPGA can be included for specific functionality.
pre-designed for a common task
Low unit cost. in part because manufacturer spreads NRE over large numbers of units Carefully designed since higher NRE is acceptable
Can yield good performance. high flexibility
User just writes software. no processor design
.General Purpose Processors
Processor designed for a variety of computation tasks
“Off-the-shelf” -. size and power
Low NRE cost. short time-to-market/prototype.
Processor Control unit Datapath ALU Controller
32 bit processor with 16 bit buses – With protected mode of operation. IBM. Motorola) alliance formed. In 1975 MC 6800 was released. 1981 MIPS-I developed at Stanford.First processor with Index registers. 1975-801 project initiated at IBM’s Watson Research Center.Intel released first processor Intel 4004 for use in calculators. 1964.IBM instituted a research program.32-bit RISC microprocessor (801) developed led by Joel Birnbaum 1979 MC 68000. resulting in PowerPC
.Release of System/360 Mid-1970s improved measurement tools demonstrated on CISC In 1971.-contd
1950s. 1979. RISC-I at Berkeley. 1988 RISC processors had taken over high-end of the workstation market Early 1990s IBM’s POWER (Performance Optimization With Enhanced RISC) architecture introduced w/ the RISC System/6k AIM (Apple.
. Most DSPs and embedded controllers use Harvard architecture for streaming data:
greater memory bandwidth. the Harvard architecture is advantageous. more predictable bandwidth
Most of the computers are von Neumann architecture In certain embedded applications where the program is more-or-less hard wired.Architectural Variants
Von Neumann vs Harvard Architecture:
Memory (program and data)
Harvard allows two simultaneous memory fetches.
Hardwired control unit.
. Complex processor control-store control unit
Reduced instruction set computer (RISC):
load/store architecture Simple processor and pipelinable instructions.-contd
RISC vs CISC Complex instruction set computer (CISC):
many addressing modes many operations. Simple programming and Less program space.
Execute Store res.Pipelining: Increasing Instruction Throughput
8 1 2 3 4 5 6 7 8
8 7 8
Time 5 4 3 2 1 6 5 4 3 2 7 6 5 4 3 8 7 6 5 4 8 7 6 5 8 7 6 8 7 8 Time
3 2 1
4 3 2 1
pipelined instruction execution
. Decode Fetch ops.
-contd: Superscaler vs VLIW
Multiple ALUs to support more than one instruction stream Superscalar VLIW -Fetches instructions in batches. Decode
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Two Pipelines
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
pipelined instruction execution
. -Each word in memory has multiple executes as many as possible independent instructions -May require extensive hardware -Relies on the compiler to detect to detect independent and schedule instructions instructions -Currently growing in popularity
Typical Processors-VIA C3
Low power consumption and effective heat dissipation.Architecture-VIA C3
VIA C3 is processor by VIA technologies based on x86 ISA. these are power efficient and hence more suitable for embedded market. Multiple Stages of Pipeline. web browsing. Suitable for personal electronics and mobile phones. Available in EBGA package .12 stages. Compared to Pentium. More than one Level of Cache Memory.
. Good performance for Internet. digital media applications. video conferencing.
The TLB enables faster computing because it allows the address processing to take place independent of the normal address-translation pipeline. The fetched instruction data is placed sequentially into multiple buffers. effectively takes no time. Three pipeline stages exist in Instruction Fetch Unit that deliver aligned instructions into the instruction decode buffers. thus.Architectural Details
Instruction Fetch Unit
Fetches instruction from I-cache or the external bus.
. The instruction is predecoded as it comes out of the cache Predecode is overlapped with other required operations and. TLB (Translation Look-aside Buffer) holds the address of the pages in the memory accessed recently.
The F stage decodes and “formats” an instruction into an intermediate format. Instruction fetch.-contd
Instruction Decode Unit
Converts instruction byte into internal execution format by 2 pipeline stages. “translates” an intermediate-form instruction from the FIQ into the internal microinstruction format. The internal-format instructions are placed into a five-deep FIFO queue: the FIQ. The X-stage. Branching operations are identified here and the processor starts getting new instructions from a different location. and translation are made asynchronous from execution via a five-entry FIFO queue. decode.
a small. In case of Branch instruction all instrn are abandoned and new set needs to be loaded.Branch History Table (BHT) & Branch Target Buffer (BTB)
IFU pre-fetches the instruction in to IF cache at different stages and sends them for decoding. based on branch history which stored in another set of buffers known as BHT.
.-Contd Branch Prediction (BP). knowing only the current one. associative memory that watches the instruction cache index and tries to predict which index should be accessed next. BP is a technique that attempts to infer the proper next instruction address. Prediction of branch earlier in the pipeline can save time in flushing out the current instructions and getting new instructions. Typically it uses a BTB. This is carried out in the F stage.
Cache Access stages (D. All basic ALU functions take one clock except multiply and divide. Execute stage (E): Integer ALU operations are performed. Write-back stage (W): The results of operations are committed to the register file.
. Store stage (S): Integer store data is grabbed in this stage and placed in a store buffer. Addressing stage (A): Memory addresses are calculated and sent to the D-cache (Data Cache).-Contd
Decode stage (R): Micro-instructions are decoded. integer register files are accessed and resource dependencies are evaluated. G): The D-cache and D-TLB (Data Translation Look aside Buffer) are accessed and aligned load data returned at the end of the G-stage.
This queue. etc. divide.) are represented by a single internal floating-point instruction. square root. compare. multiply. Basic arithmetic floating-point instructions (add. FPI are passed from the integer pipeline to the FPU thr’ a separate FIFO queue.-Contd Floating Point Unit (FPU)
Separate 80-bit floating-point execution unit that can execute floating-point instructions (FPI) in parallel with integer instructions. These instructions “tie up” the integer instruction pipeline such that integer execution cannot proceed until they complete.
. decouples the slower running FP unit from the integer pipeline so that the integer pipeline can continue to process instructions overlapped with FP instructions. tan. which runs at the processor clock speed. Certain little-used and complex floating point instructions (sin.) implemented in microcode and are represented by a long stream of instructions coming from the ROM. etc.
Separate execution unit for some specific 3D instructions. The 3D unit has two single-precision floating-point multipliers and two single-precision floating-point adders. Other MMX instructions execute in one clock. The multiplier and adder are fully pipelined and can start any nondependent 3D instructions every clock. One MMX instruction can issue into the MMX unit every clock. Multiplies followed by a dependent MMX instruction require two clocks. reciprocal.-Contd
MMX & 3D Unit Separate execution unit for the MMX-compatible instructions. and reciprocal square root are provided. One 3D instruction can issue into the 3D unit every clock.
. The MMX multiplier is fully pipelined and can start one non-dependent MMX multiply[-add] instruction (which consists of up to four separate multiplies) every clock. Other functions such as conversions. These instructions provide assistance for graphics transformations SIMD (Single Instruction Multiple Data) single-precision floating-point capabilities.
Because of the uncertainties associated with Branching the overall instruction execution time is not fixed (therefore it is not suitable for some of the real time applications which need accurate execution speed) It handles a very complex instruction set .
. The overall power consumption because of the complexity of the processor is higher.VIA C3 processor uses the same x86 instruction set as Intel processor It is a pipelined architecture.
Typical Processors-PowerPC. independent execution units allow compilers to maximize parallelism and instruction throughput. The multiple. 32-bit and 64-bit PowerPC processors have been a favorite of embedded computer designers.
POWER (Performance Optimization With Enhanced RISC) is a RISC instruction set architecture designed by IBM. MPC601 was the first PowerPC processor with a speed of 66MHz and 132 MIPS. known as AIM. The PowerPC architecture allows optimizing compilers to schedule instructions to maximize performance through efficient use of the PowerPC instruction set and register model. Created by the 1991 Apple-IBM-Motorola alliance. PowerPC is largely based on IBM's POWER architecture.
and double-precision operations.or double-precision operands
.•High-performance superscalar MP — As many as three instructions in execution per clock — Single clock cycle execution for most instructions — Pipelined FPU for all single-precision and most double-precision operations • Three independent execution units and two register files — BPU featuring static branch prediction — A 32-bit IU — Fully IEEE 754-compliant FPU for both single. — 32 GPRs for integer operands — 32 FPRs for single.
physically addressed. LRU replacement Memory unit with a two-element read queue and a three-element write queue Run-time reordering of loads and stores BPU that performs condition register (CR) look-ahead operations Address translation facilities for both Data and Instructions thr’ UTLBBTB and ITLB resp. 52-bit virtual address. 32-bit physical address
.High instruction and data throughput Zero-cycle branch capability Instruction unit capable of fetching eight instructions per clock from the cache An eight-entry instruction queue that provides look-ahead capability Interlocked pipelines with feed-forwarding that control data dependencies in hardware Unified 32-Kbyte cache—eight-way set-associative.