Binder 1

INPUT/OUTPUT
CHAPTER # 6 Computer Organization & Architecture

External Devices
 Human readable
 Screen, printer, keyboard
 Machine readable
 Monitoring and control
 Communication
 Modem
 Network Interface Card (NIC)
Chapter # 6 Computer Organization & Architecture 2

Input/Output Problems
 Wide variety of peripherals

 Delivering different amounts of data
 At different speeds
 In different formats
 All slower than CPU and RAM

 Need I/O modules

Input/Output Module
 Interface to CPU and Memory

 Interface to one or more peripherals

Generic Model of I/O Module

External Device Block Diagram

I/O Module Function
 Control & Timing

 CPU Communication
 Device Communication
 Data Buffering
 Error Detection

I/O Steps
 CPU checks I/O module device status

 I/O module returns status
 If ready, CPU requests data transfer
 I/O module gets data from device
 I/O module transfers data to CPU
 Variations for output, DMA, etc.

I/O Module Diagram

I/O Module Decisions
 Hide or reveal device properties to CPU

 Support multiple or single device
 Control device functions or leave for CPU
 Also O/S decisions
 e.g. Unix treats everything it can as a file

Input Output Techniques
 Programmed I/O
 Interrupt driven I/O
 Direct Memory Access (DMA)

Three Techniques for Input of a Block of Data

Programmed I/O
 CPU has direct control over I/O

 Sensing status
 Read/write commands
 Transferring data
 CPU waits for I/O module to complete operation

 Wastes CPU time

Programmed I/O - Detail
 CPU requests I/O operation

 I/O module performs operation
 I/O module sets status bits
 CPU checks status bits periodically
 I/O module does not inform CPU directly
 I/O module does not interrupt CPU
 CPU may wait or come back later

I/O Commands
 CPU issues address

 Identifies module (& device if >1 per module)
 CPU issues command
 Control - telling module what to do
 e.g. spin up disk
 Test - check status
 e.g. power? Error?
 Read/Write
 Module transfers data via buffer from/to device

Addressing I/O Devices
 Under programmed I/O data transfer is very like

memory access (CPU viewpoint)
 Each device given unique identifier
 CPU commands contain identifier (address)

I/O Mapping
 Memory mapped I/O

 Devices and memory share an address space
 I/O looks just like memory read/write
 No special commands for I/O

 Large selection of memory access commands available
 Isolated I/O
 Separate address spaces
 Need I/O or memory select lines
 Special commands for I/O

 Limited set

Interrupt Driven I/O
 Overcomes CPU waiting

 No repeated CPU checking of device
 I/O module interrupts when ready

Interrupt Driven I/O Basic Operation
 CPU issues read command

 I/O module gets data from peripheral whilst CPU
does other work
 I/O module interrupts CPU
 CPU requests data
 I/O module transfers data

Simple Interrupt Processing

CPU Viewpoint
 Issue read command

 Do other work
 Check for interrupt at end of each instruction cycle
 If interrupted
 Save context (registers)
 Process interrupt
 Fetch data & store

Changes in Memory and Registers for an Interrupt

Design Issues
 How do you identify the module issuing the

interrupt?
 How do you deal with multiple interrupts?
 i.e. an interrupt handler being interrupted

Identifying Interrupting Module
 Multiple Interrupt Lines

 Different line for each module
 Limits number of devices
 Software poll
 CPU asks each module in turn
 Slow
 Daisy Chain or Hardware poll
 Interrupt Acknowledge sent down a chain
 Module responsible places vector on bus
 CPU uses vector to identify handler routine
 Bus Master
 Module must claim the bus before it can raise interrupt
 e.g. PCI & SCSI
Multiple Interrupts
 Each interrupt line has a priority

 Higher priority lines can interrupt lower priority lines
 If bus mastering only current master can interrupt

Direct Memory Access
 Interrupt driven and programmed I/O require active

CPU intervention
 Transfer rate is limited
 CPU is tied up
 DMA is the answer

DMA Function
 Additional Module (hardware) on bus

 DMA controller takes over from CPU for I/O

Typical DMA Module Diagram

DMA Operation
 CPU tells DMA controller

 Read/Write
 Device address
 Starting address of memory block for data
 Amount of data to be transferred
 CPU carries on with other work

 DMA controller deals with transfer
 DMA controller sends interrupt when finished

DMA Transfer Cycle Stealing
 DMA controller takes over bus for a cycle

 Transfer of one word of data
 Not an interrupt
 CPU does not switch context
 CPU suspended just before it accesses bus
 i.e. before an operand or data fetch or a data write
 Slows down CPU but not as much as CPU doing
transfer

DMA Configurations
 Single Bus, Detached DMA controller

 Each transfer uses bus twice
 I/O to DMA then DMA to memory
 CPU is suspended twice

DMA Configurations
 Single Bus, Integrated DMA controller

 Controller may support >1 device
 Each transfer uses bus once
 DMA to memory
 CPU is suspended once

DMA Configurations
 Separate I/O Bus

 Bus supports all DMA enabled devices
 Each transfer uses bus once
 DMA to memory
 CPU is suspended once

Intel 8237A DMA Controller
 Interfaces to 80x86 family and DRAM
 When DMA module needs buses it sends HOLD signal to processor
 CPU responds HLDA (hold acknowledge)
 DMA module can use buses
 E.g. transfer data from memory to disk
1. Device requests service of DMA by pulling DREQ (DMA request) high
2. DMA puts high on HRQ (hold request),
3. CPU finishes present bus cycle (not necessarily present instruction) and puts
high on HDLA (hold acknowledge). HOLD remains active for duration of DMA
4. DMA activates DACK (DMA acknowledge), telling device to start transfer
5. DMA starts transfer by putting address of first byte on address bus and
activating MEMR; it then activates IOW to write to peripheral. DMA
decrements counter and increments address pointer. Repeat until count
reaches zero
6. DMA deactivates HRQ, giving bus back to CPU

8237 DMA Usage of Systems Bus

Fly-By
 While DMA using buses, processor is idle

 Processor using bus, DMA is idle
 Known as fly-by DMA controller
 Data does not pass through and is not stored in DMA
chip
 8237 contains four DMA channels
 Programmed independently
 Any one active
 Numbered 0, 1, 2, and 3

I/O Channels
 I/O devices getting more sophisticated

 e.g. 3D graphics cards
 CPU instructs I/O controller to do transfer
 I/O controller does entire transfer
 Improves speed
 Takes load off CPU
 Dedicated processor is faster

I/O Channel Architecture

The Evolution of the I/O Function
 Programmed I/O
 The CPU directly controls a peripheral device
 Simple microprocessor controlled devices
 I/O Module
 A controller or I/O module is added
 The CPU uses programmed I/O without interrupts
 CPU somewhat divorced from the details of external device interfaces
 Interrupt Driven
 Now interrupts are employed
 The CPU need not spend time waiting for an I/O operation to be
performed
 thus increasing efficiency
 DMA
 The I/O module is given direct access to memory via DMA
 It can now move a block of data to or from memory without involving
the CPU, except at the beginning and end of the transfer
The Evolution of the I/O Function
 I/O Channel
 The I/O module is enhanced to become a processor in its own right,
with a specialized instruction set tailored for I/O
 The CPU directs the I/O processor to execute an I/O program in
memory
 The I/O processor fetches and executes these instructions without CPU
intervention
 I/O Processor
 The I/O module has a local memory of its own
 A large set of I/O devices can be controlled, with minimal CPU
involvement

Parallel and Serial I/O
 Parallel interface
 multiple lines connecting the
peripheral
 multiple bits are transferred
simultaneously
 Serial interface
 Only one line used to transmit
data
 bits must be transmitted one at a
time

Point-to-Point vs Multipoint interface
 Connection between an I/O module in a computer

system and external devices can be either:
 point-to-point
 multiport
 Point-to-point interface
 provides a dedicated line between the I/O module and the
external device
 On small systems (PCs, workstations) typical point-to-point links
include those to the keyboard, printer, and external modem
 Multipoint external interfaces
 used to support external mass storage devices (disk and tape
drives) and multimedia devices (CD-ROMs, video, audio)

Thunderbolt
 Most recent and fastest peripheral connection technology to
become available for general-purpose use
 Developed by Intel with collaboration from Apple
 The technology combines data, video, audio, and power into a
single high-speed connection for peripherals such as hard drives,
RAID arrays, video-capture boxes, and network interfaces
 Provides up to 10 Gbps throughput in each direction and up to 10
Watts of power to connected peripherals
 A Thunderbolt-compatible peripheral interface is considerably
more complex than a simple USB device
 First generation products are primarily aimed at the professional-
consumer market such as audiovisual editors who want to be able
to move large volumes of data quickly between storage devices and
laptops
 Thunderbolt is a standard feature of Apple’s MacBook Pro laptop
and iMac desktop computers

Thunderbolt in a Computer Configuration

Thunderbolt Protocol Layers

InfiniBand
 Recent I/O specification aimed at the high-end server

market
 First version was released in early 2001
 Standard describes an architecture and specifications for
data flow among processors and intelligent I/O devices
 Has become a popular interface for storage area
networking and other large storage configurations
 Enables servers, remote storage, and other network
devices to be attached in a central fabric of switches and
links
 The switch-based architecture can connect up to 64,000
servers, storage systems, and networking devices

InfiniBand Switch Fabric

InfiniBand Links and Data Throughput Rates

zEnterprise 196
 Introduced in 2010
 IBM’s latest mainframe computer offering
 System is based on the use of the z196 chip
 5.2 GHz multi-core chip with four cores
 Can have a maximum of 24 processor chips (96 cores)
 Has a dedicated I/O subsystem that manages all I/O operations
 Of the 96 core processors, up to 4 of these can be dedicated for I/O
use, creating 4 channel subsystems (CSS)
 Each CSS is made up of the following elements:
 System assist processor (SAP)
 Hardware system area (HSA)
 Logical partitions
 Subchannels
 Channel path
 Channel

INSTRUCTION SET
CHARACTERISTICS AND
FUNCTIONS
What is an Instruction Set?
 The complete collection of different instructions that

the processor can execute or understand is referred
to as the processor’s instruction set
 Machine Code
 Binary
 Usually represented by assembly codes

Elements of an Instruction
 Operation code (Op code)

 Specifies the operation to be performed
 Source Operand reference
 To this
 The operation may involve one or more source operands
 Result Operand reference

 Put the answer here
 Next Instruction Reference
 This tells the processor where to fetch the next instruction
after the execution of this instruction is complete

Instruction Cycle State Diagram

Where the Source and Result Operands can be
 Main or virtual memory

 As with next instruction references, the main or virtual
memory address must be supplied
 I/O device
 The instruction must specify the I/O module and device for
the operation
 Processor register
 A processor contains one or more registers that may be
referenced by machine instructions
 Immediate
 The value of the operand is contained in a field in the
instruction being executed

Instruction Representation
 In machine code each instruction has a unique bit

pattern
 For human consumption (well, programmers
anyway) a symbolic representation is used
 e.g. ADD, SUB, LOAD
 Operands can also be represented in this way
 ADD A,B

Simple Instruction Format

Instruction Types
 Data processing
 Data storage (main memory)
 Data movement (I/O)
 Program flow control

Number of Addresses
 3 addresses
 Operand 1, Operand 2, Result
 a = b + c;
 ADD a, b, c
 May be a forth - next instruction (usually implicit)
 Not common
 Needs very long words to hold everything

Number of Addresses
 2 addresses
 One address doubles as operand and result
 a=a+b
 ADD a, b
 Reduces length of instruction
 Requires some extra work
 Temporary storage to hold some results

Number of Addresses
 1 address
 Implicit second address
 ADD a
 Usually a register (accumulator)
 Common on early machines

Number of Addresses
 0 (zero) addresses
 All addresses implicit
 Uses a stack
 e.g. c = a + b
 push a
 push b
 Add
 pop c

Number of Addresses

How Many Addresses
 More addresses
 More complex (powerful?) instructions
 More registers
 Inter-register operations are quicker
 Fewer instructions per program
 Fewer addresses
 Less complex (powerful?) instructions
 More instructions per program
 Faster fetch/execution of instructions

Instruction Set Design Decisions
 Operation range
 How many ops?
 What can they do?
 How complex are they?
 Data types
 The various types of data upon which operations are
performed
 Instruction formats
 Instruction length in bits
 Length of op code field
 Number of addresses
 Size of various fields, etc.

Instruction Set Design Decisions
 Registers
 Number of CPU registers available
 Which operations can be performed on which registers?
 Addressing modes
 RISC v CISC

Types of Operand
 Addresses
 Numbers
 Integer/floating point
 Characters
 ASCII etc.
 Logical Data
 Bits or flags

x86 Data Types
 8 bit Byte
 16 bit word
 32 bit double word
 64 bit quad word
 128 bit double quadword
 Addressing is by 8 bit unit
 Words do not need to align at even-numbered
address
 Data accessed across 32 bit bus in units of double
word read at addresses divisible by 4
 Little endian
x86 Numeric Data Formats

ARM Data Types
 8 (byte), 16 (halfword), 32 (word) bits
 Halfword and word accesses should be word aligned
 Unsigned integer interpretation supported for all
types
 Twos-complement signed integer interpretation
supported for all types
 Majority of implementations do not provide floating-
point hardware
 Saves power and area
 Floating-point arithmetic implemented in software
 Optional floating-point coprocessor
 Single- and double-precision IEEE 754 floating point data
types

Types of Operation
 Data Transfer
 Arithmetic
 Logical
 Conversion
 I/O
 System Control
 Transfer of Control

Types of Operation

Types of Operation

Data Transfer
 Specify
 Source
 Destination
 Amount of data
 May be different instructions for different

movements
 e.g. IBM 370
 Or one instruction and different addresses
 e.g. VAX

Arithmetic
 Add, Subtract, Multiply, Divide

 Signed Integer
 Floating point
 May include
 Increment (a++)
 Decrement (a--)
 Negate (-a)

Logical
 Bitwise operations
 AND
 OR
 NOT
 Test
 Compare

Conversion
 E.g. Binary to Decimal

Input/Output
 May be specific instructions

 May be done using data movement instructions
(memory mapped)
 May be done by a separate controller (DMA)
 Examples
 Input
 Output
 Start I/O

Systems Control
 Privileged instructions
 CPU needs to be in specific state
 Ring 0 on 80386+
 Kernel mode
 For operating systems use

Transfer of Control
 Branch
 e.g. branch to x if result is zero
 Skip
 e.g. increment and skip if zero
 ISZ Register1
 Branch xxxx
 ADD A
 Subroutine call
 c.f. interrupt call

Branch Instruction

x86 Operation Types

x86 Operation Types

x86 Single-Instruction, Multiple-Data (SIMD)
Instructions
 1996 Intel introduced MMX technology into its Pentium
product line
 MMX is a set of highly optimized instructions for multimedia
tasks
 Video and audio data are typically composed of large
arrays of small data types
 Three new data types are defined in MMX
 Packed byte
 Packed word
 Packed doubleword
 Each data type is 64 bits in length and consists of
multiple smaller data fields, each of which holds a fixed-
point integer
MMX Instruction Set

ARM Operation Types
 Load and store instructions

 Branch instructions
 Data-processing instructions
 Multiply instructions
 Parallel addition and subtraction instructions
 Extend instructions
 Status register access instructions

Byte Order (A portion of chips?)
 What order do we read numbers that occupy more

than one byte
 e.g. (numbers in hex to make it easy to read)
 12345678 can be stored in 4x8bit locations as
follows
 Address Value (1) Value(2)
 184 12 78
 185 34 56
 186 56 34
 186 78 12

Byte Order Names
 The problem is called Endian

 Big endian
 In big endian, you store the most significant byte in the
smallest address.
 Little-endian
 In little endian, you store the least significant byte in the
smallest address

Example of C Data Structure

Alternative View of Memory Map

Byte Order Standard
 Pentium (x86), VAX are little-endian

 IBM 370, Motorola 680x0 (Mac), and most RISC are
big-endian
 Internet is big-endian
 Makes writing Internet programs on PC more awkward!
 WinSock provides htoi and itoh (Host to Internet & Internet
to Host) functions to convert

INSTRUCTION ADDRESSING
MODES & FORMATS
Addressing Modes
 Immediate
 Direct
 Indirect
 Register
 Register Indirect
 Displacement (Indexed)
 Stack

Immediate Addressing
 Operand is part of instruction

 Operand = address field
 e.g. ADD 5
 Add 5 to contents of accumulator
 5 is operand
 No memory reference to fetch data

 Fast
 Limited range

Immediate Addressing Diagram
Instruction
Opcode Operand

Direct Addressing
 Address field contains address of operand

 Effective address (EA) = address field (A)
 e.g. ADD A
 Add contents of cell A to accumulator
 Look in memory at address A for operand
 Advantages:
 Single memory reference to access data
 No additional calculations to work out effective address
 Disadvantages:
 Limited address space

Direct Addressing Diagram
Instruction
Opcode Address A
Memory
Operand

Indirect Addressing
 Memory cell pointed to by address field contains the

address of (pointer to) the operand
 EA = (A)
 Look in A, find address (A) and look there for operand
 e.g. ADD (A)
 Add contents of cell pointed to by contents of A to
accumulator

Indirect Addressing
 Advantage:
 Large address space available
 2n where n = word length
 Disadvantage:
 Instruction execution requires two memory references to
fetch the operand
 One to get its address and a second to get its value
 Hence slower
 May be nested, multilevel, cascaded

 e.g. EA = (((A)))

Indirect Addressing Diagram
Instruction
Opcode Address A
Memory
Pointer to operand
Operand

Register Addressing
 Operand is held in register named in address filed

 EA = R
 Advantages:
 Very fast execution
 Only a small address field is needed in the instruction
 Shorter instructions
 Faster instruction fetch
 No time-consuming memory references are required
 Disadvantage:
 The address space is very limited
 Limited number of registers

Register Addressing
 Multiple registers helps performance

 Requires good assembly programming or compiler writing
 C programming
 register int a;
 Also called Register Direct addressing

Register Addressing Diagram
Instruction
Opcode Register Address R
Registers
Operand

Register Indirect Addressing
 EA = (R)
 Operand is in memory cell pointed to by contents of
register R
 Large address space (2n)
 One fewer memory access than indirect addressing

Register Indirect Addressing Diagram
Instruction
Opcode Register Address R
Memory
Registers
Pointer to Operand Operand

Displacement Addressing
 Combines the capabilities of direct addressing and
register indirect addressing
 EA = A + (R)
 Requires that the instruction have two address fields, at
least one of which is explicit
 A = base value
 The value contained in one address field is used directly
 R = register that holds displacement
 The other address field’s contents are added to A to produce the
effective address
 or vice versa
 Most common uses:
 Relative addressing
 Base-register addressing
 Indexing
Displacement Addressing Diagram
Instruction
Opcode Register R Address A
Memory
Registers
Pointer to Operand + Operand

Relative Addressing
 A version of displacement addressing
 R = Program counter, PC
 EA = A + (PC)
 i.e. get operand from A cells from current location pointed to by
PC
 The implicitly referenced register is the program counter
(PC)
 The next instruction address is added to the address field to
produce the EA
 Typically the address field is treated as a twos complement
number for this operation
 Thus the effective address is a displacement relative to the
address of the instruction
 Exploits the concept of locality

Base-Register Addressing
 A holds displacement
 R holds pointer to base address
 R may be explicit or implicit
 e.g. segment registers in 80x86
 Exploits the locality of memory references
 Convenient means of implementing segmentation
 In some implementations a single segment base register is
employed and is used implicitly
 In others the programmer may choose a register to hold
the base address of a segment and the instruction must
reference it explicitly

Indexed Addressing
 The method of calculating the EA is the same as for base-
register addressing
 An important use is to provide an efficient mechanism for
performing iterative operations
 Good for accessing arrays
 Autoindexing
 Automatically increment or decrement the index register after each
reference to it
 EA = A + R
 EA = A + R
 R++
 A = base
 The address field references a main memory address and the
 R = displacement
 referenced register contains a positive displacement from that address

Indexed Addressing
 Postindexing
 Indexing is performed after the indirection
 EA = (A) + (R)
 Preindexing
 Indexing is performed before the indirection
 EA = (A + (R))

Stack Addressing
 A stack is a linear array of locations

 Sometimes referred to as a pushdown list or last-in-first-out
queue
 Items are appended to the top of the stack so that the block is
partially filled
 Associated with the stack is a pointer whose value is the
address of the top of the stack
 The stack pointer is maintained in a register
 Thus references to stack locations in memory are in fact register
indirect addresses
 Operand is (implicitly) on top of stack
 e.g.
 ADD Pop top two items from stack and add

Addressing Modes Summary

Addressing Modes Summary

x86 Addressing Modes
 Virtual or effective address is offset into segment

 Starting address plus offset gives linear address
 This goes through page translation if paging enabled
 12 addressing modes available
 Immediate
 Register operand
 Displacement
 Base
 Base with displacement
 Scaled index with displacement
 Base with index and displacement
 Base scaled index with displacement
 Relative

x86 Addressing Mode Calculation

x86 Addressing Mode Calculation

ARM Addressing Modes Load/Store
 Data processing instructions
 Use either register addressing or a mixture of register and
immediate addressing
 For register addressing the value in one of the register
operands may be scaled using one of the five shift
operators
 Branch instructions
 The only form of addressing for branch instructions is
immediate
 Instruction contains 24 bit value
 Shifted 2 bits left so that the address is on a word boundary
 Effective range +/-32MB from from the program counter
 Indirectly through base register plus offset
 Offset added to or subtracted from base register contents
to form the memory address

ARM Addressing Modes Load/Store
 Preindex
 Memory address is formed as for offset addressing
 Memory address also written back to base register
 So base register value incremented or decremented by offset value
 Postindex
 Memory address is base register value
 Offset added or subtracted
 Result written back to base register
 Base register acts as index register for preindex and postindex
addressing
 Offset either immediate value in instruction or another
register
 If register scaled register addressing available
 Offset register value scaled by shift operator
 Instruction specifies shift size

ARM Indexing Methods

Instruction Formats
 Layout of bits in an instruction

 Includes opcode
 Includes (implicit or explicit) operand(s)
 Usually more than one instruction format in an
instruction set

Instruction Length
 Affected by and affects

 Memory size
 Memory organization
 Bus structure
 CPU complexity
 CPU speed
 Trade off between powerful instruction repertoire

and saving space

Address Mapping
 Address modification
 Base, displacement
 Base, index, displacement

Allocation of Bits
 Number of addressing modes

 Number of operands
 Register versus memory
 Number of register sets
 Address range
 Address granularity

Improvements
 Use hexadecimal rather than binary

 Code as series of lines
 Hex address and memory address
 Need to translate automatically using program
 Add symbolic names or mnemonics for instructions
 Three fields per line
 Location address
 Three letter opcode
 If memory reference: address
 Need more complex translation program

PDP-8 Instruction Format


Variable-Length Instructions
 Variations can be provided efficiently and compactly

 Increases the complexity of the processor
 Does not remove the desirability of making all of the
instruction lengths integrally related to word length
 Because the processor does not know the length of the
next instruction to be fetched a typical strategy is to fetch a
number of bytes or words equal to at least the longest
possible instruction
 Sometimes multiple instructions are fetched


VAX Instruction Examples

x86 Instruction Format

 Instruction prefixes
 Instruction prefix
 The instruction prefix, if present, consists of the LOCK prefix or one
of the repeat prefixes
 The LOCK prefix is used to ensure exclusive use of shared memory in
multiprocessor environments
 The repeat prefixes specify repeated operation of a string, which
enables the x86 to process strings much faster than with a regular
software loop
 There are five different repeat prefixes: REP, REPE, REPZ, REPNE, and
REPNZ

 Segment override
 Explicitly specifies which segment register an instruction should use,
overriding the default segment-register selection generated by the
x86 for that instruction
 Operand size
 An instruction has a default operand size of 16 or 32 bits, and the
operand prefix switches between 32-bit and 16-bit operands
 Address size
 The processor can address memory using either 16- or 32-bit
addresses
 The address size determines the displacement size in instructions
and the size of address offsets generated during effective address
calculation

 Opcode
 The opcode field is 1, 2, or 3 bytes in length
 The opcode may also include bits that specify if data is
byte- or full-size (16 or 32 bits depending on context),
direction of data operation (to or from memory), and
whether an immediate data field must be sign extended
 ModR/M
 This byte provide addressing information
 The ModR/M byte specifies whether an operand is in a
register or in memory
 If it is in memory, then fields within the byte specify the
addressing mode to be used
 SIB
 Certain encoding of the ModR/M byte specifies the inclusion of
the SIB byte to specify fully the addressing mode
 The SIB byte consists of three fields
 Scale field (2 bits) specifies the scale factor for scaled indexing
 Index field (3 bits) specifies the index register
 Base field (3 bits) specifies the base register
 Displacement
 When the addressing-mode specifier indicates that a
displacement is used, an 8-, 16-, or 32-bit signed integer
displacement field is added
 Immediate
 Provides the value of an 8-, 16-, or 32-bit operand
ARM Instruction Formats

Thumb Instruction Set

PROCESSOR STRUCTURE &
FUNCTION
CPU Function
 CPU must
 Fetch instruction
 The processor reads an instruction from memory (register, cache, main
memory)
 Interpret instruction
 The instruction is decoded to determine what action is required
 Fetch data
 The execution of an instruction may require reading data from memory
or an I/O module
 Process data
 The execution of an instruction may require performing some
arithmetic or logical operation on data
 Write data
 The results of an execution may require writing data to memory or an
I/O module

CPU With Systems Bus

CPU Internal Structure

Registers
 CPU must have some working space (temporary

storage)
 Called registers
 Number and function vary between processor
designs
 One of the major design decisions
 Top level of memory hierarchy

Types of Registers
 User-visible registers
 Enable the machine or assembly language programmer to
minimize main memory references by optimizing use of
registers
 Control and status registers
 Used by the control unit to control the operation of the
processor and by privileged, operating system programs to
control the execution of programs

User Visible Registers
 General Purpose
 Data
 Address
 Condition Codes

General Purpose Registers
 May be true general purpose

 May be restricted
 May be used for data or addressing
 Data
 Accumulator
 Addressing
 Segment

General Purpose Registers
 Make them general purpose

 Increase flexibility and programmer options
 Increase instruction size & complexity
 Make them specialized

 Smaller (faster) instructions
 Less flexibility

How Many GP Registers?
 Between 8 - 32
 Fewer = more memory references
 More does not reduce memory references and takes
up processor real estate

How big?
 Large enough to hold full address

 Large enough to hold full word
 Often possible to combine two data registers
 C programming
 double int a;

Condition Code Registers
 Sets of individual bits

 e.g. result of last operation was zero
 Can be read (implicitly) by programs
 e.g. Jump if zero
 Can not (usually) be set by programs

Control & Status Registers
 Program counter (PC)

 Contains the address of an instruction to be fetched
 Instruction register (IR)
 Contains the instruction most recently fetched
 Memory address register (MAR)
 Contains the address of a location in memory
 Memory buffer register (MBR)
 Contains a word of data to be written to memory or the
word most recently read

Program Status Word
A set of bits which includes Condition Codes
 Sign
 Contains the sign bit of the result of the last arithmetic operation
 Zero
 Set when the result is 0
 Carry
 Set if an operation resulted in a carry
 Equal
 Set if a logical compare result is equality
 Overflow
 Used to indicate arithmetic overflow
 Interrupt Enable/Disable
 Used to enable or disable interrupts
 Supervisor
 Indicates whether the processor is executing in supervisor or user mode

Supervisor Mode
 Intel ring zero

 Kernel mode
 Allows privileged instructions to execute
 Used by operating system
 Not available to user programs

Other Registers
 May have registers pointing to

 Process control blocks
 Interrupt Vectors
 Page table
 CPU design and operating system design are closely

linked

Example Register Organizations

EFLAGS Register

Intel x86-64 Registers
 16 integer general-purpose registers
 64-bit
 RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, R8 - R15
 8 floating-point registers, FPU x87
 80-bit
 ST0 - ST7
 8 Multimedia Extensions registers
 64-bit
 MM0 – MM7
 they share space with the registers ST0 - ST7
 16 SSE (Streaming SIMD Extensions) registers
 128-bit
 XMM0 - XMM15
 64-bit RIP pointer
 64-bit flag register RFLAGS
 64-bit x86 adds 8 more general-purpose registers,
named R8-R15
 It also introduces a new naming convention
 except that AH, CH, DH and BH have no equivalents
 R0 is RAX
 R1 is RCX
 R2 is RDX
 R3 is RBX
 R4 is RSP
 R5 is RBP
 R6 is RSI
 R7 is RDI

 R8, R9, R10, R11, R12, R13, R14, R15 are the new
registers and have no other names
 R0D – R15D are the lowermost 32 bits of each
register
 For example, R0D is EAX
 R0W – R15W are the lowermost 16 bits of each
register
 For example, R0W is AX
 R0L – R15L are the lowermost 8 bits of each register
 for example, R0L is AL


Simplified ARM Organization

ARM (32-bit) Processor Modes
 The ARM architecture supports seven execution modes
 User Mode
 Most application programs execute in user mode
 User program being executed is unable to access protected system
resources or to change mode
 Privilege Modes
 These modes are used to run system software and divided into two
categories
 System Mode
 This mode is not entered by any exception and uses the same registers available
in User mode
 The System mode is used for running certain privileged operating system tasks
 System mode tasks may be interrupted by any of the five exception categories
 Exception Modes
 The exception are entered when specific exceptions occur
 Each of these modes has some dedicated registers that substitute for some of
the user mode registers

ARM Exception Modes
 The exception are entered when specific exceptions occur
 Each of these modes has some dedicated registers that substitute for
some of the user mode registers, and which are used to avoid corrupting
User mode state information when the exception occurs
 The exception modes are as follows
 Supervisor mode
 Usually what the OS runs in and it is entered when the processor encounters a software interrupt
instruction
 Abort mode
 Entered in response to memory faults
 Undefined mode
 Entered when the processor attempts to execute an instruction that is supported neither by the
main integer core nor by one of the coprocessors
 Interrupt mode
 Entered whenever the processor receives an interrupt signal from any other interrupt source
 Fast interrupt mode
 Entered whenever the processor receives an interrupt signal from the designated fast interrupt
source
 A fast interrupt cannot be interrupted, but a fast interrupt may interrupt a normal interrupt

ARM Register Organization
 The ARM processor has a total of thirty seven 32-bit registers,
classified as follows
 31 registers referred to in the ARM manual as general-purpose
registers
 In fact, some of these, such as the program counters, have special purposes
 6 program status registers
 Registers are arranged in partially overlapping banks, with the current
processor mode determining which bank is available
 At any time, sixteen numbered registers and one or two program
status registers are visible, for a total of 17 or 18 software-visible
registers
 Registers R0 through R7, register R15 (the program counter) and the current
program status register (CPSR) are visible in and shared by all modes
 Registers R8 through R12 are shared by all modes except fast interrupt, which
has its own dedicated registers R8_fiq through R12_fiq
 All the exception modes have their own versions of registers R13 and R14
 All the exception modes have a dedicated saved program status register (SPSR)



ARM AArch64 Registers

Instruction Cycle

Indirect Cycle
 May require memory access to fetch operands

 Indirect addressing requires more memory accesses
 Can be thought of as additional instruction sub-cycle

Instruction Cycle with Indirect

Instruction Cycle State Diagram

Data Flow (Instruction Fetch)
 Depends on CPU design

 Fetch
 PC contains address of next instruction
 Address moved to MAR
 Address placed on address bus
 Control unit requests memory read
 Result placed on data bus, copied to MBR, then to IR
 Meanwhile PC incremented by 1

Data Flow (Data Fetch)
 IR is examined
 If indirect addressing, indirect cycle is performed
 Right most N bits of MBR transferred to MAR
 Control unit requests memory read
 Result (address of operand) moved to MBR

Data Flow (Fetch Diagram)

Data Flow (Indirect Diagram)

Data Flow (Execute)
 May take many forms

 Depends on instruction being executed
 May include
 Memory read/write
 Input/Output
 Register transfers
 ALU operations

Data Flow (Interrupt)
 Simple
 Predictable
 Current PC saved to allow resumption after interrupt
 Contents of PC copied to MBR
 Special memory location (e.g. stack pointer) loaded to MAR
 MBR written to memory
 PC loaded with address of interrupt handling routine
 Next instruction (first of interrupt handler) can be fetched

Data Flow (Interrupt Diagram)

Prefetch
 Fetch accessing main memory

 Execution usually does not access main memory
 Can fetch next instruction during execution of
current instruction
 Called instruction prefetch

Improved Performance
 But not doubled

 Fetch usually shorter than execution
 Prefetch more than one instruction?
 Any jump or branch means that prefetched instructions are
not the required instructions
 Divide in more activities/stages to improve
performance
 Solution is processor pipelining

Pipelining is Natural
 Laundry Example
 Nazim, Botir, Babar, Temur
each have one load of clothes A B C D
to wash, dry, and fold
 “Washing” takes 30 minutes
 “Drying” takes 30 minutes
 “Folding” takes 30 minutes
 “Stashing” takes 30 minutes

to put clothes into drawers

Sequential Laundry
6 PM 7 8 9 10 11 12 1 2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
T Time
a A
s
k
B
O
r C
d
e
r
D
 Sequential laundry takes 8 hours for 4 loads

Pipelined Laundry: Start work ASAP
6 PM 7 8 9 10 11 12 1 2 AM
30 30 30 30 30 30 30 Time
T
a A
s
k
B
O
r C
d
e
r
D
 Pipelined laundry takes 3.5 hours for 4 loads!

Processor Pipelining
 Fetch instruction
 Decode instruction
 Calculate operands address
 Fetch operands
 Execute instructions
 Write result
 Overlap these operations

Two Stage Instruction Pipeline

Timing Diagram for
Instruction Pipeline Operation

Why Pipeline?
 Suppose we execute 100 instructions

 Single Cycle Machine
 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
 Multicycle Machine
 10 ns/cycle x 4.6 CPI (due to instr. mix) x 100 inst = 4600 ns
 Ideal pipelined machine
 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

The Effect of a Conditional Branch on
Instruction Pipeline Operation

Six Stage Instruction Pipeline

Alternative Pipeline Depiction

Speedup Factors with Instruction Pipelining

Pipeline Hazards
 Pipeline, or some portion of pipeline, must stall

 Also called pipeline bubble
 Types of hazards
 Resource
 Data
 Control

Resource Hazards
 Two (or more) instructions in pipeline need same resource
 Executed in serial rather than parallel for part of pipeline
 Also called structural hazard
 E.g. Assume simplified five-stage pipeline
 Each stage takes one clock cycle
 Ideal case is new instruction enters pipeline each clock cycle
 Assume main memory has single port
 Assume instruction fetches and data reads and writes performed one at a time
 Ignore the cache
 Operand read or write cannot be performed in parallel with instruction fetch
 Fetch instruction stage must idle for one cycle fetching I3
 E.g. multiple instructions ready to enter execute instruction phase

 Single ALU
 One solution: increase available resources

 Multiple main memory ports
 Multiple ALUs

Resource Hazard Diagram

Data Hazards
 Conflict in access of an operand location
 Two instructions to be executed in sequence
 Both access a particular memory or register operand
 If in strict sequence, no problem occurs
 If in a pipeline, operand value could be updated so as to produce different
result from strict sequential execution
 E.g. x86 machine instruction sequence:
 ADD EAX, EBX /* EAX = EAX + EBX

 SUB ECX, EAX /* ECX = ECX – EAX
 ADD instruction does not update EAX until end of stage 5, at clock cycle 5
 SUB instruction needs value at beginning of its stage 2, at clock cycle 4
 Pipeline must stall for two clocks cycles
 Without special hardware and specific avoidance algorithms, results in
inefficient pipeline usage
Data Hazard Diagram

Types of Data Hazard
 Read after write (RAW), or true dependency

 An instruction modifies a register or memory location
 Succeeding instruction reads data in that location
 Hazard if read takes place before write complete
 Write after read (WAR), or antidependency
 An instruction reads a register or memory location
 Succeeding instruction writes to location
 Hazard if write completes before read takes place
 Write after write (WAW), or output dependency
 Two instructions both write to same location
 Hazard if writes take place in reverse of order intended sequence

Control Hazard
 Also known as branch hazard

 Pipeline makes wrong decision on branch prediction
 Brings instructions into pipeline that must
subsequently be discarded

Dealing with Branches
 Multiple Streams
 Prefetch Branch Target
 Loop buffer
 Branch prediction
 Delayed branching

Multiple Streams
 Have two pipelines

 Prefetch each branch into a separate pipeline
 Use appropriate pipeline
 Leads to bus & register contention
 Multiple branches lead to further pipelines being
needed

Prefetch Branch Target
 Target of branch is prefetched in addition to

instructions following branch
 Keep target until branch is executed
 Used by IBM 360/91

Loop Buffer
 Very fast memory

 Maintained by fetch stage of pipeline
 Check buffer before fetching from memory
 Very good for small loops or jumps
 c.f. cache
 Used by CRAY-1

Loop Buffer Diagram

Branch Prediction
 Predict never taken

 Assume that jump will not happen
 Always fetch next instruction
 68020 & VAX 11/780
 VAX will not prefetch after branch if a page fault would

result (O/S v CPU design)
 Predict always taken
 Assume that jump will happen
 Always fetch target instruction

Branch Prediction
 Predict by Opcode
 Some instructions are more likely to result in a jump than
others
 Can get up to 75% success
 Taken/Not taken switch

 Based on previous history
 Good for loops
 Refined by two-level or correlation-based branch history
 Correlation-based
 In more complex structures, branch direction correlates
with that of related branches
 Use recent branch history as well

Branch Prediction
 Delayed Branch
 Do not take jump until you have to
 Rearrange instructions

Branch Prediction State Diagram

Branch Prediction Flowchart

Dealing With Branches

Intel 80486 Pipelining
 Fetch
 From cache or external memory
 Put in one of two 16-byte prefetch buffers
 Fill buffer with new data as soon as old data consumed
 Average 5 instructions fetched per load
 Independent of other stages to keep buffers full
 Decode stage 1
 Opcode & address-mode info
 At most first 3 bytes of instruction
 Can direct D2 stage to get rest of instruction
 Decode stage 2
 Expand opcode into control signals
 Computation of complex address modes
 Execute
 ALU operations, cache access, register update
 Writeback
 Update registers & flags
 Results sent to cache & bus interface write buffers

80486 Instruction Pipeline Examples

Q.1 Q.2
 Given the following memory and  Consider different systems with and
register values. without pipelining. Each system has to
 Word 700 contains 740 execute 1400 instructions.
 Word 710 contains 750  Calculate the total execution time for
 Word 720 contains 710 1400 instructions in each of the
 Word 730 contains 740 following case?
 Word 740 contains 700  Single-cycle machine
 Word 750 contains 700  It takes 40ns for each cycle
 AX Register contains 720  Multi-cycle machine (without pipelining)
 BX Register contains 740
 It takes 6ns for each cycle
 CX Register contains 710
 DX Register contains 750  It has 7 stages
 Base Register contains 200  It consumes 7 clocks per instruction

 What would be the result in on average
following cases?  Ideal pipelined machine
 ADD DX, [BX]  It takes 6ns for each cycle
 SUB [CX], BX  It has 7 stage
 MOV DX, 30
 ADD [AX], [700]
SYSTEM INTERCONNECTION
Connecting
 All the units must be connected

 Different type of connection for different type of unit
 Memory
 Input/Output
 CPU

Types of Transfers
 Interconnection structure must support the

following types of transfers
 Memory to processor
 Processor reads an instruction or a unit of data from memory
 Processor to memory
 Processor writes a unit of data to memory
 I/O to processor
 Processor reads data from an I/O device via an I/O module
 Processor to I/O
 Processor sends data to the I/O device
 I/O to or from memory
 An I/O module is allowed to exchange data directly with memory
without going through the processor using direct memory access

Computer Modules

Memory Connection
 Receives and sends data

 Receives addresses (of locations)
 Receives control signals
 Read
 Write
 Timing

Input/Output Connection
 Similar to memory from computer’s viewpoint

 Output
 Receive data from computer
 Send data to peripheral
 Input
 Receive data from peripheral
 Send data to computer

Input/Output Connection
 Receive control signals from computer

 Send control signals to peripherals
 e.g. spin disk
 Receive addresses from computer
 e.g. port number to identify peripheral
 Send interrupt signals (control)

CPU Connection
 Reads instruction and data

 Writes out data (after processing)
 Sends control signals to other units
 Receives (& acts on) interrupts

What is a Bus?
 A communication pathway connecting two or more

devices
 Usually broadcast
 Signals transmitted by any one device are available for
reception by all other devices attached to the bus
 Often grouped
 A number of channels in one bus
 e.g. 32 bit data bus is 32 separate single bit channels
 Power lines may not be shown

Buses
 There are a number of possible interconnection

systems
 Single and multiple BUS structures are most
common
 e.g. Control/Address/Data bus (PC)
 e.g. Unibus (DEC-PDP)

Data Bus
 Carries data
 Remember that there is no difference between “data” and
“instruction” at this level
 The number of lines determines how many bits can
be transferred at a time
 May consist of 8, 16, 32, 64, 128, or more separate lines
 Width is a key determinant of performance

Address bus
 Identify the source or destination of data

 e.g. CPU needs to read an instruction or data from a given
location in memory
 Bus width determines maximum memory capacity of
system
 e.g. 8080 has 16 bit address bus giving 64k address space
 Also used to address I/O ports
 The higher order bits are used to select a particular
module on the bus and the lower order bits select a
memory location or I/O port within the module

Control Bus
 Used to control the access and the use of the data

and address lines
 Carries control and timing information
 Timing signals indicate the validity of data and address
information
 Command signals specify operations to be performed
 Generally use for

 Memory read/write signal
 Interrupt request
 Clock signals

Bus Interconnection Scheme

What do Buses Look Like?
 Has many shapes

 Parallel lines on circuit boards
 Ribbon cables
 Strip connectors on mother boards

 e.g. PCI
 Sets of wires

Single Bus Problems
 Lots of devices on one bus leads to:

 Propagation delays
 Long data paths mean that co-ordination of bus use can adversely
affect performance
 If aggregate data transfer approaches bus capacity
 Most systems use multiple buses to overcome these
problems

Traditional Architecture

High Performance Bus

Elements of Bus Design
 Bus Type  Bus Width

 Dedicated  Address
 Multiplexed  Data
 Method of Arbitration  Data Transfer Type

 Centralized  Read
 Distributed  Write
 Timing  Read-modify-write
 Synchronous  Read-after-write
 Asynchronous  Block

Bus Types
 Dedicated
 Separate data & address lines
 Multiplexed
 Shared lines
 Address valid or data valid control line
 Advantage
 fewer lines
 Disadvantages
 More complex control

Bus Arbitration
 More than one module controlling the bus

 e.g. CPU and DMA controller
 Only one module may control bus at one time
 Arbitration may be centralised or distributed

Centralised or Distributed Arbitration
 Centralised
 Single hardware device controlling bus access
 Bus Controller
 Arbiter
 May be part of CPU or separate
 Distributed
 Each module may claim the bus
 Control logic on all modules

Timing of Synchronous Bus Operations

Timing of Asynchronous Bus Operations

Point-to-Point Interconnect
 In conventional bus
 Over the period of time electrical constraints encountered
with increasing the frequency of wide synchronous buses
 At higher and higher data rates it becomes increasingly
difficult to perform the synchronization and arbitration
functions in a timely fashion
 Shared bus on the same chip magnified the difficulties of
increasing bus data rate and reducing bus latency to keep
up with the processors
 All this became reason for a change in bus
 Point-to-Point Interconnect was introduced

 It has lower latency, higher data rate, and better scalability
Quick Path Interconnect
 QPI is a point-to-point interconnect

 Introduced in 2008
 Multiple direct connections
 Direct pairwise connections to other components eliminating
the need for arbitration found in shared transmission systems
 Layered protocol architecture
 These processor level interconnects use a layered protocol
architecture rather than the simple use of control signals found
in shared bus arrangements
 Packetized data transfer
 Data are sent as a sequence of packets each of which includes
control headers and error control codes

Multicore Configuration using QPI

QPI Layers
 Physical
 Consists of the actual wires carrying the signals
 The unit of transfer at the is 20 bits, which is called a Phit (physical unit)
 Link
 Responsible for reliable transmission and flow control
 The Link layer’s unit of transfer is an 80-bit Flit (flow control unit)
 Routing
 Provides the framework for
directing packets through the
fabric
 Protocol
 The high-level set of rules for
exchanging packets of data
between devices. A packet is
comprised of an integral number
of Flits

Physical Interface of the Intel QPI Interconnect

Physical Interface of the Intel QPI Interconnect
 The QPI port consists of 84 individual links

 Each data path consists of a pair of wires referred to as a lane
 It transmits data one bit at a time
 There are 20 data lanes in each
 The 20-bit unit is referred to as a phit
 QPI can transmit in parallel in each direction
 Typical signaling speeds 6.4 GT/s
 At 20 bits per transfer, that adds up
to 16 GB/s
 Being bidirectional, the total
capacity is 32 GB/s
 The lanes are grouped into four
quadrants of 5 lanes each
 In some applications, the link can
also operate at half or quarter
widths

Peripheral Component Interconnect (PCI)
 A popular high bandwidth, processor independent bus
that can function as a mezzanine or peripheral bus
 Delivers better system performance for high speed I/O
subsystems
 PCI Special Interest Group (SIG)
 Created to develop further and maintain the compatibility of the
PCI specifications
 PCI Express (PCIe)
 Point-to-point interconnect scheme intended to replace bus-
based schemes such as PCI
 Key requirement is high capacity to support the needs of higher
data rate I/O devices, such as Gigabit Ethernet
 Another requirement deals with the need to support time
dependent data streams

PCIe Configuration

PCIe Protocol Layers
 Physical
 Consists of the actual wires carrying the signals
 Data link
 responsible for reliable transmission and flow control
 data packets generated/consumed are called Data Link Layer Packets (DLLPs)
 Transaction
 Generates and consumes data
packets used to implement
load/store data transfer
mechanisms
 Also manages the flow control of
those packets between the two
components on a link
 Data packets generated and
consumed by the TL are called
Transaction Layer Packets (TLPs)

Binder 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Binder 1

Uploaded by

Copyright:

Available Formats

INPUT/OUTPUT

CHAPTER # 6 Computer Organization & Architecture

Chapter # 6 Computer Organization & Architecture 2

 Wide variety of peripherals

 All slower than CPU and RAM

Chapter # 6 Computer Organization & Architecture 3

 Interface to CPU and Memory

Chapter # 6 Computer Organization & Architecture 4

Chapter # 6 Computer Organization & Architecture 5

Chapter # 6 Computer Organization & Architecture 6

 Control & Timing

Chapter # 6 Computer Organization & Architecture 7

 CPU checks I/O module device status

Chapter # 6 Computer Organization & Architecture 8

Chapter # 6 Computer Organization & Architecture 9

 Hide or reveal device properties to CPU

Chapter # 6 Computer Organization & Architecture 10

Chapter # 6 Computer Organization & Architecture 11

Chapter # 6 Computer Organization & Architecture 12

 CPU has direct control over I/O

 CPU waits for I/O module to complete operation

Chapter # 6 Computer Organization & Architecture 13

 CPU requests I/O operation

Chapter # 6 Computer Organization & Architecture 14

 CPU issues address

Chapter # 6 Computer Organization & Architecture 15

 Under programmed I/O data transfer is very like

Chapter # 6 Computer Organization & Architecture 16

 Memory mapped I/O

 No special commands for I/O

 Special commands for I/O

Chapter # 6 Computer Organization & Architecture 17

 Overcomes CPU waiting

Chapter # 6 Computer Organization & Architecture 18

 CPU issues read command

Chapter # 6 Computer Organization & Architecture 19

Chapter # 6 Computer Organization & Architecture 20

 Issue read command

Chapter # 6 Computer Organization & Architecture 21

Chapter # 6 Computer Organization & Architecture 22

 How do you identify the module issuing the

Chapter # 6 Computer Organization & Architecture 23

 Multiple Interrupt Lines

 Each interrupt line has a priority

Chapter # 6 Computer Organization & Architecture 25

 Interrupt driven and programmed I/O require active

 DMA is the answer

Chapter # 6 Computer Organization & Architecture 26

 Additional Module (hardware) on bus

Chapter # 6 Computer Organization & Architecture 27

Chapter # 6 Computer Organization & Architecture 28

 CPU tells DMA controller

 Starting address of memory block for data

 Amount of data to be transferred

 CPU carries on with other work

Chapter # 6 Computer Organization & Architecture 29

 DMA controller takes over bus for a cycle

Chapter # 6 Computer Organization & Architecture 30

 Single Bus, Detached DMA controller

Chapter # 6 Computer Organization & Architecture 31

 Single Bus, Integrated DMA controller

Chapter # 6 Computer Organization & Architecture 32

 Separate I/O Bus