 Computer Design

Part 1B, Dr S.W. Moore

 Introduction
Aims
The aims of this course are to introduce the hardware/software interface models and the hardware structures used in designing computers. The first seven lectures are concerned with the hardware/software interface and cover the programmer's model of the computer. The last nine lectures look at hardware implementation issues at a register transfer level.

Lectures
 Introduction to the course and some background history.  Historic machines. EDSAC versus Manchester Mark I.  Introduction to RISC processor design and the MIPS instruction set.  MIPS tools and code examples.  Operating system support including memory hierarchy and management.  Intel x86 instruction set.  Java Virtual Machine.  Memory hierarchy (caching).  Executing instructions. An algorithmic viewpoint.  Basic processor hardware, pipelining and hazards. [2 lectures]  Verilog implementation of a MIPS processor. [3 lectures]  Internal and external communication.  Data-flow and comments on future directions.

Objectives
At the end of the course students should:
1. 2. 3. 4. 5. 6. 7. 8. be able to read assembler given a guide to the instruction set and be able to write short pieces of assembler if given an instruction set or asked to invent an instruction set understand the differences between RISC and CISC assembler understand what facilities a processor provides to support operating systems, from memory management to software interrupts understand memory hierarchy including different cache structures appreciate the use of pipelining in processor design understand the communications structures, from buses close to the processor, to peripheral interfaces have an appreciation of control structures used in processor design have an appreciation of how to implement a processor in Verilog

 Instruction Sets
Objective 1

Accumulator
An accumulator is a register in which intermediate arithmetic and logic results are stored. The characteristic which distinguishes one register as being the accumulator of a computer architecture is that the accumulator is used as an implicit operand for arithmetic instructions. For instance, a computer might have an instruction like: ADD memaddress This instruction would add the value read from the memory location at memaddress to the value from the accumulator, placing the result in the accumulator.

Register
A processor register is a small amount of storage available on the CPU whose contents can be accessed more quickly than storage available elsewhere. Moving data from main memory into registers, operating on them, then moving the result back into main memory is common—a so-called load-store architecture.

Stack
A stack machines memory takes the form of one or more stacks. In addition, a stack machine can also refer to a real or simulated machine with a "0-operand" instruction set. In such a machine, most instructions implicitly operate on values at the top of the stack and replace those values with the result. Typically such machines also have a "load" and a "store" instruction that reads and writes to arbitrary RAM locations. A stack machine will often be more simple to program and more reliable to run than other machines. Writing compilers for stack based machines is also comparatively simple, as they have fewer exceptional cases to complicate matters. Since running compilers can take up a significant percentage of machine resources, building a machine that can have an efficient compiler is important. A stack based machine instruction is smaller than one that’s register based since there is no need to specify operand addresses. This nearly always leads to dramatically smaller compiled programs. A given task can often be expressed using fewer register machine instructions than stack ones. Stack Machine: a = b + c might be translated as LOAD c, LOAD b, ADD, STORE a. Register Machine: The same code would be a single instruction ADD a, b, c.

 OS Support & Memory Management
Objective 3

Exceptions & Interrupts
Exception handling is a programming language construct or computer hardware mechanism designed to handle the occurrence of some condition that changes the normal flow of execution, for instance; division by zero. In general, the current state will be saved in a predefined location and execution will switch to a handler. The handler may later resume the execution at the original location, using the saved information to restore the original state. Interrupts have a similar affect to exceptions except they are caused by an external signals, for example a direct memory access device signaling completion. Applications normally run in user mode; however, when an interrupt or exception occurs the processor is switched into an alternative mode which has a higher privilege. The software handler is now exposed to more of the internals of the processor and sees a more complex memory model.

Memory Management & Protection
Memory management involves providing ways to allocate portions of memory to programs at their request, and freeing it for reuse when no longer needed.
1. Relocation Programs must be able to reside in different parts of the memory at different times. Since when the program is swapped back into memory after being swapped out for a while it can not always be placed in the same location. 2. Protection Processes should not be able to reference the memory for another process without permission. Preventing malicious code in one program from interfering with another. 3. Sharing Processes should be able to share access the same part of memory with necessary permission. 4. Logical organization Programs are often organized in modules. Some of these modules could be shared between different programs, some are read only and some contain data that can be modified. The memory management is responsible for handling this logical organization that is different from the physical linear address space. 5. Physical organization Memory is usually divided into fast primary storage and slow secondary storage. The memory manager handles moving information between these two levels of memory.

Virtual Addressing
Virtual Memory; gives a program the impression that it has contiguous working memory, while in fact it is physically fragmented and may even overflow on to disk storage. This technique makes programming large applications easier as well as using physical memory more efficiently. Virtual Addresses; are what an application uses to address its memory. These must be converted to reference a block of physical memory, usually between 1 and 64 kbytes, called a page. The upper bits of a virtual address correspond to a page and the lower bits specify an index into the page. If there is insufficient physical memory then some of the pages may be swapped out to disk. If an application attempts to use a page currently stored on disk then the address translation mechanism causes an exception which is handled by the operating system. The OS selects a ‘suitable’ page to be swapped out to disk, and then swaps the required page from disk to memory. There are two principle virtual addressing schemes:

Multiple address space – Each application resides in its own separate virtual address space and is prohibited from making accesses outside this space. This method makes sharing libraries and data more difficult as it involves having several virtual addresses for the same physical address.

Single address space - There is only one virtual address space and each application being executed is allowed some part of it. Linking must be done at lead time because the exact address of any particular component is only known then. However, sharing libraries and data is much simpler since there is only one virtual address for each physical one

Address Translation
If each page is 4KB in size then we require 12 bits for the page offset. Assuming we are using 32 bit addressing that leaves us with 20 bits to express the virtual page number. Each entry of a page table is 4 bytes. A simple page table therefore consumes 4 × 2020 = 4194304 bytes = 4 MB memory per table / application. One solution to this large overhead is to use multilevel page tables. However, page tables may still be large, with many unused entries. Particular problems arise with address spaces that are larger than 32 bits wide or if the virtual address space is occupied very sparsely.

pages in physical memory, and uses a hash lookup to translate virtual addresses to physical addresses in nearly constant time.

This page table is inverted in the sense that physical frames, instead of virtual pages, are used as the main index into the table. A system-wide inverted page table (IPT) maps each physical page number to a virtual page number. A key advantage of an IPT is that its size grows in direct proportion to the number of physical pages in the system. Each entry in the table records which page of which process occupies that particular page of physical memory. The main disadvantage is that a search mechanism is required to locate the physical frame (if any) that holds a particular page. This can be done efficiently using a hash function.

Memory-Mapped I/O
Input and output devices are usually mapped to part of the address space. Thus, communicating to an I/O device can be the same as reading and writing to memory addresses devoted to the I/O device. The I/O device merely has to use the same protocol to communicate with the CPU as memory uses. Reading from an I/O device often has side effects. Memory protection is used to ensure that only the device driver has access to its area in memory. Some processors have special instructions to access I/O within a dedicated I/O address space.

TLB
Performing a complete address translation for every memory request is prohibitively time consuming. The translation look-aside buffer is used to cache recently performed translations so that they may be reused. The MIPS R3000 caches 64 translation entries in fully associative store. When an address needs to be translated the TLB is searched in parallel for the appropriate virtual page translations and protection information. A TLB miss occurs if the translation information is not present in the TLB. A TLB miss is handled in software.

Cache: A cache is a small local memory which makes use of the temporal and spatial characteristics of data to store values which are likely to be needed in the near future. (see next section)

 Memory Hierarchy
Objective 4 “Ideally one would desire an indefinitely large memory capacity such that any particular word would be immediately available, we are however forced to recognize the possibility of construction a hierarchy of memories, each of which has greater capacity that the preceding but which is less quickly accessible.”

Memory Technologies
Static Memory (SRAM):  Maintains store provided power kept on  Typically 4 to 6 transistors per bit  Fast but expensive

Latency and Bandwidth
Register File:  Multi-ported small SRAM  < 1 cycle latency, multiple reads and writes per cycle First level cache:

Dynamic Memory (DRAM):  Relies on storing charge on the gate of transistor  Charge decays over time so requires refreshing  1 transistor per bit  Fairly fast, not too expensive Magnetic Disk:  Maintains store even when power turned off  Much slower than DRAM but much cheaper per MB ROM/PROM/EPROM/EEPROM/Flash:  Read only memory  Programmable ROM  Erasable PROM  Electronically EPROM (faster erase)

 Single or multi-ported SRAM  1 to 3 cycles latency, 1 or 2 reads or writes per cycle Second level cache:  Single ported SRAM  Around 3 to 9 cycles latency, 0.5 to 1 reads or writes per cycle Main memory:  DRAM  Reads take anything from 10 to 100 cycles to get first word and can receive adjacent words every 2 to 8 cycles.  Writes take 8 to 80 cycles and each further consecutive word takes 2 to 8 cycles. Hard Disk:  Slow to seek, around 2 million clock cycles, returns a large block of data.

Cache Design
Temporal locality: If a word is accessed once then it is likely to be accessed again soon. Spatial locality: If a word is accessed then its neighbors are likely to be accessed soon. In both cases it is advantageous to store data close to the processor in a cache.

Fully Associative Cache: If an entry from main memory is free to reside in any part of the cache it is fully associative. Direct Mapped Cache: At the other extreme, if each entry in main memory can go in just one place in the cache, the cache is direct mapped. Set Associative : If a block can be placed in a restricted set of places in the cache, the cache is said to be set associative . A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. See diagram above for 2-way set associative cache.

Cache Line: A line is an adjacent series of bytes in main memory; that is, their addresses are contiguous. Typically 4 or 8 words. Utilises DRAM’s high read speed for successive locations.

Cache Replacement Policies When a miss occurs, the cache controller must select a block to be replaced with the desired data. A replacement policy determines which block should be replaced. With direct-mapped placement the decision is simple because there is no choice: only one block frame is checked for a hit and only that block can be replaced. With fully-associative or set-associative placement , there are more than one block to choose from on a miss. Least Recently Used – Good, but requires usage information to be stored in the cache. Not Last Used – Tends to remove infrequently used cache lines, has a few pathological cases. Random – Actually quite simple and works well in practice.

victim cache: One cache line buffer to store the last line overwritten in the cache.

Reading Memory When the CPU requests a memory location the required block is searched for in the cache. If it is found then it is sent to the CPU, no further work has to be done. If we encounter a cache miss we must go out to main memory. Upon returning with the block from main memory we have two options: 1. Read Through – Do not store in cache, take straight to CPU 2. No Read Through – Store this block in the cache, from there transfer it to memory. Writing Memory When the CPU writes to a memory location and the block currently exists in the cache we have a write hit. We can do one of two things: 1. Write Through – Data is written to both the cache and the lower level memory so if a cache line is replaced it doesn’t need to be written back to the memory first. This method is common for multiprocessor computers so that cache coherency is possible. 2. Write Back - Data is initially written to the cache only and will be written to the lower level memory when its cache line is replaced. A dirty bit is used to indicate if the cache line has been modified and therefore requires being written back before removal. When the CPU writes to a memory location and the block doesn’t exist in the cache we have a write miss. We can one of two things: 1. Fetch – Bring the block from main memory into the cache and then perform either write hit action. 2. Write Around - The block is modified in main memory and not loaded into the cache. Since writing to lower level memory takes time we avoid the processor having to wait by using a write buffer to store a the upcoming writes.

 Pipelining
Objective 5 An instruction pipeline is used in processors to increase their instruction throughput, with the result that a program's overall execution time is lowered. Pipelining doesn't speed up instruction execution time, but it does speed up program execution time by increasing the number of instructions we can work on simultaneously. The more finely we can slice the instruction's lifecycle, the more of the hardware that's used to implement those phases at any given moment. When a processor does not implement pipelining it is said to be sequential. The RISC pipeline is broken into five pipeline stages with a set of flip flops between each stage. 1. Instruction fetch (IF) 2. Decode and register fetch (DC) 3. Execute (EX) 4. Memory access (MA) 5. Register write back (WB) Instruction latency: The number of clock cycles it takes for the instruction to pass through the pipeline

Sequential processors, ones that aren’t pipelined, have a latency of one clock cycle. In contrast, an n-stage pipeline has a minimum latency of n cycles, or longer depending on whether the instruction stalls. Pipelining offers greater performance in general but we can encounter problems when the programmers model violated, for instance: When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. There are two types of hazard; data and control. Techniques such as forwarding and stalling exist to help guard against hazards. When a branch instruction occurs in assembler, a pipelined processor does not know what instruction to fetch next. We could a) stall until we branch/continue b) Predict branch outcome, if found to be wrong, flush pipeline c) Execute the instruction after the branch regardless, this is known as having a branch delay slot. is

Data Hazard
The data hazards result from a conflict over the sharing of data. Although the instructions are written with sequential execution in mind, a previous instruction's output may be the current instruction's input, such that this data dependency creates a problem while pipelining the instructions. The dependency on register t4 is a problem. A2 uses the old value of t4 because A1 hasn’t written its updated value back. Stalling solves this problem; Bubbles, a type of ‘nonoperation’ are inserted into the pipeline to keep A2 in the decode stage until A1 has written its new value. In another instance we might want to be able to use the result of the ALU directly, without having to wait for the result to be written back to the register file. For this, we need to forward the result of the ALU directly back into the ALU. //..Example Code..// lw t2,0(t1) // L: t2=load(t1) add t3,t3,t2 // A: t3=t3+t22 In this example the value of t2 is fed back into the ALU rapidly so that the value of t3 is calculated correctly and not based on an old value. Note that a single stall is still required whilst register t1 is fetched. IF A1 A2 ~ ~ ~ ~ ~ ~ time 0 1 2 3 4 5 IF A1 A2 ~ ~ ~ ~ DC ~ A1 A2 ~ ~ ~ EX ~ ~ A1 A2 ~ ~ MA ~ ~ ~ A1 A2 ~ WB ~ ~ ~ ~ A1 A2 //..Example Code..// add t4,t1,t2 // A1: t4=t1+t2 add t5,t4,t3 // A2: t5=t4+t3  Pipeline Snapshot

 With bubbles DC ~ A1 A2 A2 A2 ~ ~ ~ EX ~ ~ A1   A2 ~ ~ MA ~ ~ ~ A1   A2 ~ WB ~ ~ ~ ~ A1   A2

0 1 2 3 4 5 6 7

time

Control Hazard
Control hazards occur when the processor is told to branch - i.e., if a certain condition is true, then jump from one part of the instruction stream to another - not necessarily to the next instruction sequentially. In such a case, the processor cannot tell in advance whether it should process the next instruction (when it may instead have to move to a distant instruction). This can result in the processor doing unwanted actions. //..Example Code..// J label add t3,t1,t2 label: add t6,t4,t5

//J: jump to label //A1: t3=t1+t2 //A2: t6=t4+t5

We have two solutions to this problem: 1. Flush A1 from the pipeline, converting it to a bubble. 2. Execute A1 anyway, thus exposing the branch delay slot.

 Pros
1. The cycle time of the processor is reduced, thus increasing instruction bandwidth in most cases.

 Cons
1. The design is complex and harder to manufacture. 2. The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent due to the extra flip flops that are added to the data path. 3. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.

The higher throughput of pipelines falls short when the executed code contains many branches: the processor cannot know where to read the next instruction, and must wait for the branch instruction to finish, leaving the pipeline behind it empty. After the branch is resolved, the next instruction has to travel all the way through the pipeline before its result becomes available and the processor appears to "work" again. In the extreme case, the performance of a pipelined processor could theoretically approach that of an unpipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages.

 Internal and External Communication
Objective 6 As a general rule of thumb; when a wire is longer than 1/100 of the wavelength that is being transmitted down its length we need to consider the transmission properties of the wire. E.g. any wire over 3mm for the 1Ghz signal.

Characteristic Impedance: the ratio of the amplitudes of a single pair of voltage and current waves propagating along the line in the absence of reflections.

For a loss-less transmission line (i.e. R and G are negligible) which is terminated the characteristic impedance is given by: Z 0  L , measured in ohms. C

An electrical pulse injected into a transmission line has energy. IF the end of transmission line is unconnected, the pulse is reflected! To prevent reflections, the energy should be dissipated at the receiver, .e.g. using a resistor which is equal to Z 0 .

Communication Methods
Synchronous communication: uses no start and stop bits but instead synchronizes transmission speeds at both the receiving and sending end of the transmission using clock signals built into each component. Data transfer rate is quicker although more errors will occur, as the clocks become skewed.

Skew: the difference in arrival time of bits transmitted at the same time. Asynchronous communication: a start signal is sent prior to each byte, character or code word and a stop signal is sent after each code word. Parallel Communication Parallel transmission involves sending of several bits at the same time, with each bit transmitted over a separate wire. An 8-bit parallel channel transmits eight bits (or a byte) simultaneously. Parallel communication is usually synchronous. Parallel communication wont work well for high clock frequencies over long distances due to the skew experienced when communicating synchronously. The longer the distance the bigger the problem:
 

PCI cards where limited to 66MHZ DDR2 memory chips operate at 660MHZ as they are closer to the CPU.

Serial Communication Serial communication is the process of sending data one bit at one time, sequentially over a twisted pair or coaxial connection. Contrast this to parallel communications, where all the bits of each symbol are sent together. Serial communications are used for all long-haul communications and most computer networks, where the cost of cable and synchronization difficulties make parallel communications impractical. High data rates are possible; Ethernet (1 – 10Gb/s), SATA (3Gb/s), PCI-express (2.5Gb/s per lane)

The diagram above shows the ASCII character A being sent over RS-232, a commonly used asynchronous serial line data transmission standard. USB Universal serial buss was designed to support a large range of devices that can be chained together (see diagram). Electrically USB is a twisted data pair with power and ground lines to supply power to devices.
 

Version 1.1 could transfer at 12Mb/s at full speed. Version 2 added a 480Mb/s mode.

Devices are identified by class, vendor etc. to allow plug and play

On-Chip Communication
His notes make no sense, try to read them.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times